Open data inventories, ready for human consumption

by and

Just in time for Open Data Day, your Uncle Sam — aided and abetted by developers here at Sunlight — is giving you a present.

It has taken weeks, but it appears that all Cabinet agencies have released machine-readable lists of their public data holdings, in compliance with President Barack Obama’s open data executive order. Now, with the help of Sunlight’s Dan Drinkard and Timothy Ball, you can explore most of these listings in human-readable format.

The president’s order made thousands of new data sets available for public consumption, allowing citizens to find all sorts of new information in one central repository. But when we checked in on the progress agencies had made in complying with the executive order last month, we wondered aloud how difficult it would be for agencies to list their data holdings in a way that citizens without programming degrees could understand. The president’s order resulted in agencies releasing their data listings in JSON format, which fulfills a key component of the executive order, but can prove difficult for non-programmers to manipulate. The executive order also requires agencies to release human-readable versions.

Some departments have already complied with this section of the guidance. Others, like Defense, have gone the extra mile to add its complete catalog to the data.gov website, making it easy for users to easily search its full listings. Most agencies, however, have been somewhat slower in this process. In the interest of aiding any academics, activists, journalists or other citizens who would like to easily parse a larger pool of government data, we’ve made as much of it as we could available in human-readable format. To download an agency-by-agency list of available data sets, in some cases complete with formatting options and key words, you can:

  • Download this zip file, or
  • View the spreadsheets as Google docs here.

Unfortunately, we ran into a few problems during the process and, at press time, had not been able to convert JSON from the Department of the Interior and the Environmental Protection Agency. The most common problem appears to stem from incompatibility between our parser and the JSON files in question. Rest assured that Sunlight technologists are working hard to circumvent these issues. We will update this post, and the related downloads, as we progress. Below, see what the cleaned data listings look like for the Department of State (abbreviated to fit onto page).

Thanks to the guidelines crafted by Project Open Data, each metadata set uses the same set of schema, which gives information like the data set’s title, where to access the set (if it has been made publicly available), point of contact and data format among other things (see the full rundown of schema on Project Open Data’s page on Github).

Each data set will be delineated as either “public,” “restricted public” or “non-public” under the “accessLevel” heading. Agencies are not currently required to share information about their “non-public” data sets, although we have urged them to do so (and filed a FOIA request for all of the complete data inventories, including all non-public datasets). “Restricted public” data sets may include some information, related to things like personal privacy or national security, that needs to be redacted before releasing the larger data sets. The schema requires agencies to explain what aspects of these data sets are restricted.

However, for those data sets that are shared publicly, accessing the data can be as simple as accessing the URL provided in the metadata. While most of the data sets listed fall under the “public” or “restricted public” categories, you may be able to access part of “non-public” data sets through a Freedom of Information Act request.

Update 2:00 p.m. This post has been updated to reflect that the Open Data Executive Order requires data listings to be published in JSON format, this is separate from the requirement that listings also be human readable.