DotNet Soddi

Posted

Whenever I get into learning a new technology around data access, there is always the issue of finding a good database to work against. Long gone are the days of Northwind. You need a database not with just a lot of rows, you need a database with a interesting and varied distribution of data. The databases I turn to are the Stack Exchange archives. These frequently updated archives contain anonymized exports from all the Stack Exchange sites, from the tiny Sports all the way to the huge main site Stack Overflow. With hundreds of sites to pick from, I can grab a fresh database of whatever size that matches my need. With sites ranging from 2mb to over 100gb of data you can find a database with real data for playing with that fits your size. The schema is simple, well defined and has a variety of real-world data making it perfect for playing with SQL Server indexing strategies, or testing out the latest Entity Framework version.

The archives themselves are just huge xml files, ready to be used for whatever. For me, that means SQL Server. Getting these into your database server I've always relied on Soddi. Originally written by Sky Sanders, and now maintained by Brent Ozar's team, it smooths out importing the 7z files into your database server. And it works well.

But I tend to run into the same set of problems every time I want to grab one of these archives.

With that in mind (and after I read a couple of articles around creating dotnet tools) I decided I'd try and create a modern CLI version of SODDI.

dotnet soddi

soddi screenshot

Because dotnet-soddi is a dotnet tool, you install it using the dotnet tool install command. As of right now, it's still in (very) pre-release so a version number will be needed e.g.

You can also clone and run it from it's repository at https://github.com/phil-scott-78/dotnet-soddi

dotnet tool install --global dotnet-soddi --version 0.1.1

Once installed there are four main commands

list will display all the archives available along with their file size. You can specific a filter to only include a partial list. E.g. soddi list spa, which will only include archives with spa in their name.

download will download the archive over HTTPS. This is preferred for smaller sized files (less than 100mb) so you don't need to worry about the overhead of bittorrent.

torrent use bittorrent to download the archives. For larger archives, especially things like math and the main site, this is almost needed. On my machine, I'll max out at 35mbps over http, but have seen the torrent download reach up to 600mbps depending on the other peers connected)

import is where the real magic happens.

Import will take a given .7z file, and simultaneously extract and perform a bulk insert into SQL Server. No need for scratch space for the extracted files, it streams directly to your SQL Server. Not only keeping disk space requirements low, but it also keeps memory usage low. Even a 50gb import rarely reaches over 65mb of memory used on my machine.

By default, import will expect the destination database to be created and empty with a name matching the site (e.g. stackoverflow.com or math.stackexchange.com). With many of these databases having large sizes, this allows you to fine tune the log and data locations before the import. If you are ok with blindly creating the DB in the default locations, the -–dropAndCreate option will take care of that for you. It won't check for disk space or anything ahead of time so use with caution.

During this process, keys and indexes will be created as well as a helper table named PostTags. These can be turned off with a switch if so desired.

Sample Workflow

  1. soddi list spa - this lists all archives containing the phrase "spa"
  2. soddi download space - you can use either the full site name (space.stackexchange.com) or the short name to download. we'll use the short name, and it'll download a file whose name matches the site name.
  3. soddi import space.stackexchange.com.7z --dropAndCreate - import the space.stackexchange.com.7z file we just downloaded. the --dropAndCreate option will drop any database named space.stackexchange.com and recreate it using your server's configured defaults. soddi import -h will give you more options for specifying a connectionstring and specifying a database name.

Once the import is done, you'll have a local copy of the Stack Exchange database of your choice!

Database Info

If you've never worked with the Stack Exchange schema, here's a few helpful pieces of documentation

Technical Notes

A few technical notes on the app itself