Sunlight Foundation

A Principled Look at Open Data

From Moses to James Madison to David Letterman, important ideas come in lists of ten, as do these principles for opening up government information.[1] The list isn’t new: my colleague John Wonderlich wrote about “themes for legislative information publication” in February 2007, and eight open government data principles emerged from a conference organized by internet oracle Carl Malamud and technology publisher Tim O’Reilly in December 2007.[2] However, we have refreshed the principles, expanded upon them, and added details.

The government is increasingly making data available online, partly in response to congressional and presidential leadership and partly from public pressure. The newly released or updated data varies markedly in quality and usefulness; agencies are searching for guidance on how to do better.

These principles are intended to provide a starting point. They are: completeness, primacy, timeliness, ease of physical and electronic access, machine readability, non-discrimination, use of commonly owned standards, licensing, permanence and usage costs. Each one exists along a continuum of openness, and the list writ large is intended as a guidebook, not a rulebook.

We welcome additional ideas and corrections.[3] The document is available here.

***

[1] Technically speaking, what we call the "Bill of Rights" was intended to have 12 constitutional amendments, although only 10 were enacted in the 1790s; noted commentator Melvin Kaminsky reports the number of commandments varied over time; and few items from Letterman’s list are actually funny.

[2] Sunlight provided a grant to the conference.

[3] More background materials are available here.

A Study in Transparency: The Open Government Directive, the Department of Labor, and the Open Data Principles

Cabinet agencies (and others) released their Open Government Plans last week with much fanfare, mixed reviews, and many promises for the future. I want to focus on one initiative -- the Department of Labor's "Online Enforcement Database" -- to highlight the strengths and weakness of what we've seen, and suggest some guidelines for going forward.

Online Database Strengths and Weaknesses

With the explosion at a mine in West Virginia last week, many questions are being asked about federal safety inspections. My colleague Anu Narayanswamy wrote on Monday, before the Online Enforcement Database was released, that the way the federal government releases data on mine safety makes it impossible to see how safety violations at one mine stack up against others. You cannot tell if the 500 safety violations in 2009 at this particular mine, for example, are typical for this industry.

On Wednesday, the Labor Department released the Online Enforcement Database, which contains five major data sets, including one on mine safety. Anu's follow-up article on Friday explained that "with mine safety data, released for the the first time in bulk [on Wednesday], users can search for mine inspection data by state or even zip code." But she also reported the data sets are only in a partially downloadable format, and do not include "the kinds of violation and penalties levied on mines across the country." In other words, it's difficult to figure out what's going on.

It is the search results, and not the underlying database, that are downloadable in bulk. ("Bulk" access means that you can download all of the information at once, and not piecemeal.) The only way to get at the Enforcement Database's information is to use its search tool, which has very limited capabilities. Users may search by state, agency, zip code, and by industry code. (DOL deserves credit for including the industry codes in a link from the search page.) So, a user cannot narrow the search range to a county, or a congressional district, or by the owner of a facility. Compare this to the search tool used at transparencydata.com, a new initiative from Sunlight that allows users to search a database on campaign contributions, that allows searching, sorting, and downloading in a multiplicity of ways.

As mentioned before, the Online Enforcement Database itself is not available for download in bulk. There's no way to look at all of the information the Labor Department has painstakingly gathered. And despite the wealth of information, a clunky search tool adds to the frustration. Without access to the supporting data, researchers cannot answer many questions. In fairness, the Labor Department says that bulk access and improved search tools are "coming soon," but it would be very helpful to have a date to accompany this promise. Doing so would make the promise concrete and testable.

I do not mean to pick on the Department of Labor, which made an effort in its Open Government Plan [PDF] to identify datasets for online publication and to set deadlines. Indeed, they stated they plan to take all data they collect and make it publicly available online and in downloadable formats, with appropriate caveats. Many agencies fell far short of DOL's achievements. But DOL should go further.

Open Data Principles

Elsewhere I've pulled together resources (from Princeton and Sunlight Labs) on building good data sets, including drafting guidelines for government data catalogs. It's important focus, however, at the fundamental level of what it means when we talk about how government should publish data online, a.k.a. "open data principles." As an attorney, I'm hardly qualified to talk about this, so I am fortunate that much of the heavy lifting was done at a conference in 2007. Afterward, my colleagues Clay and John and I worked on revising the open data principles, nine in number, and fleshed out a rough evaluation of when they are satisfied.

When agencies think about how to make information available, they should look to these (draft) principles. They state, in short, that data should be: complete, primary, timely, accessible, machine processable, non discriminatory, non propriety, license free, and permanent. Resource to these principles by the agencies -- and a better effort to comply with the directive's requirement to identify all high-value data sets and set deadlines for online publication -- would have turned the thus-far mixed results of the Open Government Directive into an unqualified success. There is still time to make that promise into a reality.

Here are the 9 open data principles in a framework to evaluate the extent to which they are satisfied:

(If you're wondering what ^M means, it's old school geek for delete - the 8 principles of open data have now become 9.)

Defective by Design?

David Moore at Open Congress has an excellent post up explaining how the current life of a bill in Congress is riddled with disclosure holes. I can't do more than say, go read David's post. Here's some choice graphs:

The reason is that the “Baucus Bill” is only a “mark”, not yet an official Senate bill, which means (to summarize reductively) that the digital text that constitutes the .pdf does not make its way off internal government web servers to the official website of the Library of Congress, THOMAS — and in turn, does not make its way to government transparency web resources such as GovTrack and OpenCongress. Before that happens, this mark of the health care bill needs to be reconciled with other Senate committee versions of the same, which will then be put forward for consideration to the U.S. Senate as a whole. Health care reform is leading news coverage & blog analysis of American politics right now, this is a major document in the mix, and there’s not a widely-recognized, user-friendly resource for online examination by the public at large. You should have better access to this info! You should have — at your fingertips — immediate, unrestricted digital access to the full text of any piece of legislation the very moment it’s released publicly by Congress.

...

The current Congressional process for publishing data is, to borrow a phrase from the Free Software Foundation, Defective By Design. As we see in many proprietary, top-down systems affecting the public interest, it’s insistently closed-off. Congress’ processes for distributing legislative info is fundamentally broken — it could and should relatively easily be fixed, starting now. Whether or not you support the Baucus markup or the House version of the health care reform bill, we hope you agree that the public has a right to read this important iteration & political volley in the process.

Recovery.gov

The Obama administration has promised that they will track the progress of project approved in the stimulus bill (H.R. 1) through a web site, Recovery.gov. Matt Cooper at TPMDC notes the obvious about the current make-up of the site:

In his remarks earlier this morning about his stimulus plan, Obama touted Recovery.gov as a website where Americans "will be able to see how and where we spend taxpayer dollars." Actually the site is empty pending the passage of the bill. Basically, it's a placeholder for after the bill is passed. Shouldn't there be something in there about the competing proposals? The options? Etc. It seems kind of lame for such a techno-savvy White House. Besides after the bill is passed how quickly are they really going to be able to update how Topeka spends it's sewer money?
On that note, I think it would be best if anyone who might have control over the stimulus tracking web site to take note of the awesome suggestions laid out by our own John Wonderlich in a CNET article:
We'd like the site to serve not just the amateur information consumer, but also the programmers that can skillfully remix the information. The citizen observer's role seems well-addressed by the legislation that mandated the site (with requirements for "printable reports," feedback, and to be "easy to understand"), while the needs of the programmer are largely unaddressed. The data should be available in formats that facilitate more advanced use by programmers and analysts alike.

Certainly, the data should be made available following the 8 Principles of Open Data: (1) complete, (2) primary (as it is collected at the source), (3) timely, (4) accessible, (5) machine-processable, (6) nondiscriminatory, (7) nonproprietary, and (8) and license-free. XML and CSV are a minimum.

Search is great, if you are looking to find information about any one thing. But original analysis and visualization require access to data in bulk. If the goal of putting the data online is to increase accountability and transparency, then it is necessary (to) provide bulk data access.

Similarly, Ellen Miller blogged about David Robinson's (not the 7-foot former Spurs center) even more ambitious suggestions for the release of large data sets of government information.

We know the administration, especially the tech team, is having a tough time getting used to the antiquated equipment in the White House, the Executive Office Building, and the Old Executive Office Building. I remember what it looked like in the '90s and I'm sure it has changed very little.

At the same time, there are a lot of impatient people out here wondering when the administration will start running the kind of wired White House they have always intended. In the case of Recovery.gov, there are no shortage of ideas for them to quickly tap.