A Modern Approach to Open Data
Last year, a group of us who work daily with open government data — Josh Tauberer of GovTrack.us, Derek Willis at The New York Times, and myself — decided to stop each building the same basic tools over and over, and start building a foundation we could share.
We set up a small home at github.com/unitedstates, and kicked it off with a couple of projects to gather data on the people and work of Congress. Using a mix of automation and curation, they gather basic information from all over the government — THOMAS.gov, the House and Senate, the Congressional Bioguide, GPO’s FDSys, and others — that everyone needs to report, analyze, or build nearly anything to do with Congress.
Once we centralized this work and started maintaining it publicly, we began getting contributions nearly immediately. People educated us on identifiers, fixed typos, and gathered new data. Chris Wilson built an impressive interactive visualization of the Senate’s budget amendments by extending our collector to find and link the text of amendments.
This is an unusual, and occasionally chaotic, model for an open data project. the /unitedstates project is a neutral space; GitHub’s permissions system allows many of us to share the keys, so no one person or institution controls it. What this means is that while we all benefit from each other’s work, no one is dependent or “downstream” from anyone else. It’s a shared commons in the public domain.
There are a few principles that have helped make the /unitedstates project something that’s worth our time:
* We collaborate in public. When we have questions or ideas, we bring them up and talk them out using GitHub’s issue tracker. Questions get answers very quickly, unexpected participants hop in, and (as with other Q&A systems like Stack Overflow and Quora) discussions theselves become valuable long-term artifacts. GitHub is extremely well designed for this.
* Our congressional tools can be used in a standalone, language-agnostic way, with no required configuration. You just need a command line, and data gets placed on disk in bulk. Nothing depends on a database.
* We started using our new data in a live product right away. Instead of waiting for something that felt “1.0”, Sunlight and GovTrack replaced their pre-existing collection infrastructure with our new tools as soon as they were functional. Because of this, we were forced to promptly fix bugs and fill gaps, and create a stable platform to iterate on. This guarantees momentum.
* No brand names. Our organization’s name, “unitedstates”, is harder to describe to someone in an elevator, but it makes it clearer to volunteers that they’re contributing to the public domain and the common good. Repository names project authority by being clear and descriptive, rather than catchy.
Since we started, the /unitedstates project has grown into a bigger collection of small pieces. There’s a reliable US Code parser, a Swiss army knife for legal citations, and a tiny community spreadsheet of slang bill nicknames. None of these require a huge amount of ongoing investment, and all of them are used in a live service somewhere.
These projects don’t do anything fundamentally new. People have solved these problems before. But usually, developers will just write these sorts of things quickly to get them out of the way, and leave them tightly integrated into some larger system. Even when this is made open source, it’s tough to reuse code written this way. Newcomers find the learning curve intimidating, and the author rarely feels like re-engineering working code.
Instead, when we notice common problems — even small ones — we’re solving them as independent projects that are easy to share. This is basically all upside; anyone can build and brand anything they want on top of these tools, and benefit from the fixes and improvements of others. It’s a healthy arrangement, and the kind we should see more of in the open government community.