The Coming Government Data Flood

by

Government is releasing data at a breakneck pace, and it is just getting started. One interesting side effect of our National Data Catalog is that we’re regularly parsing all of the data on data.gov, and we’re able to do interesting things with the aggregate metadata. By parsing out the release date for each dataset on data.gov, and grouping each release by quarter though it’s easy to see that since the second quarter of 2009– when Data.gov was released, the federal government has released more raw datasets than it ever has in the past. Take a look at what’s happened after Data.gov launched:

data-gov-release-dates.numbers

Now, granted, like all government data– it’s a little messy. These are bulk, aggregate conclusions and haven’t been reviewed, but they point to a trend regardless of their accuracy. Keep in mind here that what we’re using is the original release date of the data. Trivia: the oldest known release date of a dataset in Data.gov is 1909, and a dataset we’re all fairly familiar with: The Bureau of Labor Statistics Employment and Earning data. Government has released more datasets in the past year, than from 1909-2008 combined.

As of today, about halfway through the first quarter, government is already on pace to beat its Q3 2009 record of 308 datasets. Since June of last year, Government has been releasing data at a pace of 4 datasets per day.

UntitledI spoke with the people here in the district government about new datasets being released by DC agencies. It’s similar: An exponential release of datasets since the announcement of a data catalog. Three and a half years after their launch of data.dc.gov They’re looking at incredible exponential growth. Last year they saw more than a doubling of new datasets being released. It isn’t crazy to suspect we’ll see the same exponential curve of data growth coming out of the federal government and other municipalities as they follow suit.

A new problem is starting to arise– classifying and organizing this information. Much of this data, for instance, may very well be unuseful to most people. And the rate of esoteric data that agencies push out will make it more difficult to find the proverbial diamonds in the rough. Hopefully the National Data Catalog will help solve that problem in the same way that PHP’s community documentation helped make PHP one of the more prevalent and well documented languages around today. By allowing people to comment the functions in the language, it became easier to learn and to use. The same should be done with these datasets. Right now we’re testing it– you can check it out at http://nationaldatacatalog.com.

There’s a few caveats here as always:

  1. We’re measuring datasets, not megabytes. A dataset could be incredibly small and inconsequential, and it’s counted as equally as one that’s large and extremely important. Another way, and additionally appropriate way to measure this is also by the number of bytes these different governments are releasing.

  2. We’ve done very little data quality analysis on this data. A lot of data from Data.gov, for instance, had only years rather than dates associated with them. Some had faulty dates in them. We threw both of those out.

  3. It’s possible that some publishing agencies have conflated “release date” and “update date.”

  4. We have not included the Geodata or Tools sections of data.gov.

Those caveats aside, the point remains the same– government is starting to release data at breakneck speeds. And this is just the beginning. The great government data flood is coming. And we need to start preparing to stay afloat. Now’s the time to start thinking about what kids of data you want out of your government, and to start asking them to release it. This dam won’t hold for much longer.