One thing we’ve been most excited about here at the Sunlight Foundation is the concept of Data.gov. Due later this year, new federal CIO Vivek Kundra will release a new central repository for government data and research. And while in this series we traditionally re-design federal websites, we thought we’d actually take the opportunity to design data.gov right off the bat to show you all what we’d like to see happen.
Here’s what we came up with:
Why we did it:
Providing access to government data is one of the clearest ways to be more transparent— and it is our hope that Kundra and team nail this with Data.gov. In order to do so, we’re looking for these things:
- Bulk access to data
- Accountability for Data Quality
- Clear, understandable language
- Service and developer friendly file formats
Only raw access bulk data can be completely transparent. So we’re looking for a http://bulk.data.gov akin to Carl Malamud’s bulk.resource.org. This will allow developers to browse through a raw directory listing of the judicial, executive, legislative branches as well as independent/miscellaneous/joint agencies and get compressed, bulk files of data via direct download. Getting FEC data, for example should be as easy as clicking on “Other”->”FEC”->”Contributions”->2008_summary.tar.gz. This first and arguably most important part of Data.gov doesn’t need any design. It needs to look like this:
Secondly, we want the ability for the public to comment and rate the quality of data government provides. The public should be able to rate, review or comment on the data sets Data.gov publishes just like it does books on Amazon.com. This will help Vivek Kundra and his team find slow patches and erroneous data faster than any form of government quality assurance process could. So take an Amazon.com style approach to data:
Cataloging the data sources inside the Federal Government is not good enough. Some data sources are simply just not up to par. Data sets like FARAdb are simply unusable as they’re being provided by the government to citizens. But we also understand that change cannot happen overnight. In order to make this the most efficient process possible, Government should rely on the customers of its data to pinpoint where the problems are. A reviewing system for the provided data sets does just that.
We also don’t think that Data.gov can exist without an editorial staff. You need people to write about the data and explain the data that’s being provided. Let’s face it, traditionally the federal government has mostly written in a voice that lawyers and government officials can understand, but take a look at Data.gov’s closest equivalent right now: Fido. In looking at the different data samples here, can you tell what any of them actually do? Could your mother? Of course not. The very language that government uses is the antithesis of transparency, so use something like this to make it more friendly and understandable:
Data.gov should build real, practical descriptions for the data that data.gov provides. It should speak to why each data set is important and beyond relying on the non-transparent federal-speak that is so often used. It should feature data, blog about data, and perhaps even link off to interesting things that other people are doing with the data that comes from Data.gov. But at the heart of this, at bare minimum, Data.gov has to do a better job of explaining the data than Kundra’s first attempt at this, the DC Data Catalog.
Human understanding isn’t enough though, the data that is provided also needs to be understood by machines in formats that are common not only to developers but also to outside services like Google Earth or Microsoft Excel. Data.gov should make it easy for everyone to get to its data in the format that they want.
That’s the hardest part about building a real data catalog for the Federal Government. You have databases out there that range from the 30 year old COBOL format at the FEC to the binary access databases that the FCC has been providing! But in order for Data.gov to truly be successful, it has to take these different data sources and make them available in modern data formats that developers and machines can make sense of.
If Government makes these file formats standardized, and makes the forms that request them standardized too, then groups like Sunlight Labs can create helper classes that help developers automatically browse and interact with the data on a programmatic level rather than just browsing through a web interface. Imagine if this is how you, the developer, interact with Data.gov:
Data.gov has to be comprehensive and timely. While the Constitution calls for separation of powers, we do not believe that Data.gov, run by the Executive Branch of Government, should be limited to only Executive Branch information. It should encompass all branches of Government and every independent agency. (p.s. an OPML based list of all government agencies represented in the natural hierarchy of Government should be a data feed!) And it should constantly be growing. When data isn’t available, people should be able to ask for it straight off the website. And obviously those requests should be a data feed in and of itself.
Because this is a government site, we also had to think about how the regular public would interact with the site as well. We made the navigation and search simple, hiding the more complicated asks under an advanced search button, and made the home page consumer friendly by adding a description and a dashboard of the newest and most recently-updated data available. By having these things on the home page, it makes the site browsable and might help users discover data, even if they weren’t searching for anything in particular.
In the end, the purpose of the site should predominantly be about the data itself, and not about conclusions that may be drawn from it. It should be clear, organized, and easy to use for anyone visiting the site.
So, here you have it, the big reveal of Data.gov: