After Sunlight FOIA, comprehensive data indexes arrive on Data.gov

by Matt Rumsey, John Wonderlich and Sean Vitka

policy

Mar 2, 2015 1:32 pm

Over a year ago, in December 2013, the Sunlight Foundation started down a path in search of agencies’ internal indexes of their data holdings, also known as “enterprise data inventories.” The FOIA we drew up was issued in concert with continued advocacy efforts aimed to encourage faithful development of these inventories as well as their public release. Both are critical elements of good data practices, which may be why one of the most data-forward agencies, the Department of Transportation was the only agency to proactively release its index before our FOIA request was resolved.

It’s been a long road, but at the beginning of February we learned that the Office of Management and Budget (OMB) had decided to comply. This marked a high point in a years long process for the Sunlight Foundation.

We hope that OMB and the agencies decide to continue releasing these indexes publicly as they develop, which, if done soon, would make the United States the first government to proactively release such indexes.

This represents an important first in government open data efforts, rooted in the principle that the public, and Congress, needs more than access to the data the government deigns to publish — it needs to know what information the government knows it knows, what information the government isn’t releasing and to what extent the government is prioritizing assessment of its data assets and publishing them.

Data primer

For those that want to dive into the data – something we enthusiastically encourage – we wanted to provide a very brief primer. First, these files (so far) are showing up in JSON, a format that is geared toward computer programmers. They will open in your browser like any text file, but the code may well be disorienting for those not used to the format. Some of us use a tool called JSON Formatter, which should help break it all up so that it’s more understandable. Similar tools should be available, no matter your browser of choice.

The second tool we use is a quick web application — created by former Sunlighter Eric Mill — where you can copy and paste the JSON files, which in turn will create a spreadsheet that you can download. (A CSV file which opens perfectly in, for instance, Microsoft Excel). Note, however, that because this tool operates within your browser, it won’t work with particularly large JSONs (like the Environmental Protection Agency’s). On the other side, you’ll be able to explore this information like any data in a spreadsheet.

What we’re seeing so far

Even a brief look at the new information revealed via this new release allowed us to identify several datasets that are currently not public but could be of great value to journalists, researchers and interested citizens. The best place to start a search for the EDIs is on Data.gov.

As of this morning, we’ve seen the following EDIs: National Science Foundation, National Archives and Records Administration, Veterans Affairs, Environmental Protection Agency, Office of Personnel Management, Housing and Urban Development, the Department Of Transportation, Department of Labor, Department of Education, Department of Defense and the Social Security Administration. This leaves at least 14 other agencies left to comply. We expect many to do so early this week.

Representatives of OMB promised to share links to every agency EDI once they were all posted. We will update this post with that list as soon as we have it. Additionally, these are designed to be updated at least quarterly, meaning each new update should reveal new data.

We have access to some 40,000-foot overviews, thanks to OMB’s own tracking of agency progress. There are 24 agencies that must comply with this (a full list here). But the 40,000-foot view is only so helpful. For instance, the NARA inventory only has 19 “non-public” datasets (which appears to be the extent to which this is larger than the agency’s public data listing). However, agencies are ideally already publishing the most data they can, which would be reflected by a smaller difference between the public data listings and the enterprise data inventories. With that said, an agency prioritizing the cataloging of data sets may have a large value here despite also being committed and faithful to opening up their data. Accordingly, there isn’t a simple binary indicator for “good” or “bad” compliance. One example reveals how unwise it would be to jump to either conclusion: The Office of Personnel Management (OPM) has 624 datasets in their data inventory. Fifty-three are “restricted public” and 30 are “non-public.” However, Data.gov currently shows only 178 datasets (including OPM’s EDI).

We also want to take a moment to congratulate the National Archives and Records Administration, which is not required to comply with the open data executive order (or, accordingly, the requirement to produce enterprise data inventories), for jumping into the fray. We hope to see every agency follow this path, and they should be applauded for this commitment to transparency.

The beasts themselves

In the current, limited view, Sunlight has already identified indicators of varying levels of success in indexing government information, some datasets that appear to be wisely withheld and others that could be of significant public value if released.

One indicator we’ve found within the lists of unpublished data are items that reflect the extraordinary complexity of government. Seeing that agencies are cataloging, for instance, the software that controls security clearances, is a good indication that some agencies have properly received the EDI mandate, which is to ensure the federal government finally understands, to the greatest extent practicable, what information it possesses. If we didn’t see such things, it would be a worrying sign that the cataloging process isn’t working, and it would rob the public of a window into the complexity of agency operations, even if the datasets themselves never make it into the public’s hands. A counterexample is in section eight below.

Some items seem to be wisely unpublished — for instance, a dataset titled “Accident Injuries PII” (PII refers to Personally Identifiable Information). If ever published, and if it contains PII, such data must be appropriately scrubbed of sensitive, personal information.

And then there’s the good stuff (that we’ve seen so far). Instead of listing every example of data we think is intriguing, here are a few choice items (the links go to the JSON files, the inventories themselves, and may be a bit of a heavy lift for some computers).

The Department of Labor has a dataset of hazardous conditions complaints provided by the mining industry. It is currently not public, but the EDI does not explain why. This appears to be an important dataset that could feasibly be unearthed through FOIA, then combined with other relevant safety data to inform the work of journalists, watchdogs and — perhaps most critically — employees working in those conditions.
The Department of Transportation, which has shown consistent leadership with its open data efforts and released its EDI proactively, has a number of potentially useful nonpublic datasets. One that jumps out immediately contains information on motor carrier crashes, or accidents involving large trucks and buses. It is reasonably withheld because it includes PII. Fortunately, DoT understands the importance of releasing aggregate and scrubbed data and has already done so in this case.
The Office of Personnel Management has a dataset titled “Congressional and Legislative Affairs (CLA) Tracking,” which, based on the description, appears to be both a tracking system and a database, presumably of OPM’s Congressional outreach.
Housing and Urban Development includes a non-public dataset simply titled “Operating Plan,” which is described as “The written explanation of how HUD’s plans to run the IT business piece of the agency. The operating plan include includes [sic] funding by Investment with specific details that make up the investment.”
One nonpublic dataset listed by the Environmental Protection Agency (EPA) is the “National Register of Historic Places, US, 2014, NPS, SEGS,” currently only available to internal personnel and state partners. However, the data appears to be available on another website, where they explain software problems have prevented them from publishing more recent information. We expect a good amount of problems like this, unfortunately; problematic IT procurement in the federal government is well-known and well-hated, and it will surely present problems for people in charge of indexing what’s actually be released, identifying where it can be found and ensuring it’s up-to-date.
The EPA has another nonpublic dataset that jumped out: “Arizona – Social Vulnerability Index.” We found some public information on similarly titled information on websites maintained by the Centers for Disease Control, but it’s unclear whether or not these are the same.
The Department of Labor has some information redacted under FOIA. (So far, it’s the only one we’ve seen that does.) While redactions aren’t something Sunlight is keen to celebrate, it’s important to note that this successfully proves the concept: Even with sensitive information included, these indexes can be released to the public. There appear to be 28 redactions, and thanks to OMB’s guidance — which you can view and comment on publicly — the redactions are done without using a black marker and a PDF! Curiously, most of Labor’s redactions are in “contactPoint” fields — that is, the person who is supposed to be the point of contact for the data — and they are redacted under (b)(7) of FOIA, which applies to records compiled for certain law enforcement purposes.
The Department of Defense, somehow, has not cataloged within its index any “non-public” or “restricted” data, nor does it appear to have redacted any information under FOIA. Hopefully this reflects a choice to focus on publishable data, but, perhaps obviously, Defense is an agency we expect to have a lot of nonpublic information — information that still very much needs to be indexed and tracked.

Final Thoughts

OMB’s unprecedented answer to Sunlight’s FOIA shows an extraordinary amount of transparency and represents a true commitment to open data. We will work to make sure their continued enforcement of open data mandates is similarly zealous and successful. We can’t stress enough our hope that OMB and other agencies will choose to proactively release these inventories moving forward, allowing for continued public engagement.

Transparency is a bedrock principle for democracy, and increasingly we expect proactive disclosure of government information. It’s not possible to judge, however, what the government is disclosing without understanding what information the government has decided not to disclose. It is similarly impossible for the government to understand its own data practices if it doesn’t know what information it holds — something the public can only view through the lens of the most comprehensive datasets available. These releases make these processes possible.