I'm happy to announce the newest project from Sunlight Labs, Poligraft. A utility built on top of Transparency Data, Poligraft takes in a block of text, parses it for entities like politicians and corporations, and returns a result set representing the political influence contained in that text. I won't dwell on the features -- read Ellen Miller's announcement blog post and the about page for more information. What I want to talk about instead is the development process.
Continue readingWe Don’t Need a GitHub for Data
There was an interesting exchange this past weekend between Derek Willis of the New York Times and Sunlight's own Labs Director emeritus, Clay Johnson. Clay wrote a post arguing that we need a "GitHub for data":
It's too hard to put data on the web. It’s too hard to get data off the web. We need a GitHub for data.
With a good version control system like Git or Mercurial, I can track changes, I can do rollbacks, branch and merge and most importantly, collaborate. With a web counterpart like GitHub I can see who is branching my source, what’s been done to it, they can easily contribute back and people can create issues and a wiki about the source I’ve written. To publish source to the web, I need only configure my GitHub account, and in my editor I can add a file, commit the change, and publish it to the web in a couple quick keystrokes.
[...]
Getting and integrating data into a project needs to be as easy as integrating code into a project. If I want to interface with Google Analytics with ruby, I can type gem install vigetlabs-garb and I’ve got what I need to talk to the Google Analytics API. Why can I not type into a console gitdata install census-2010 or gitdata install census-2010 —format=mongodb and have everything I need to interface with the coming census data?
On his own blog, Derek pushed back a bit:
[...] The biggest issue, for data-driven apps contests and pretty much any other use of government data, is not that data isn’t easy to store on the Web. It’s that data is hard to understand, no matter where you get it.
[...]
What I’m saying is that the very act of what Clay describes as a hassle:
A developer has to download some strange dataset off of a website like data.gov or the National Data Catalog, prune it, massage it, usually fix it, and then convert it to their database system of choice, and then they can start building their app.
Is in fact what helps a user learn more about the dataset he or she is using. Even a well-documented dataset can have its quirks that show up only in the data itself, and the act of importing often reveals more about the data than the documentation does. We need to import, prune, massage, convert. It’s how we learn.
I think there's a lot to what Derek is saying. Understanding what an MSA is, or how to match Census data up against information that's been geocoded by zip code -- these are bigger challenges than figuring out how to get the Census data itself. The documentation for this stuff is difficult to find and even harder to understand. Most users are driven toward the American Factfinder tool, but if that's not up to telling you what you want, you're going to have to spend some time hunting down the appropriate FTP site and an explanation of its organization -- Clay's right that this is a pain. But it's nothing compared to the challenge of figuring out how to use the data properly. It can be daunting.
But I think there are problems with the "GitHub for data" framing that go beyond the simple fact that the problems GitHub solves aren't the biggest problems facing analysts.
Continue readingOSCON 2010
Last week I was fortunate enough to attend OSCON in Portland, OR. This year OSCON hosted hundreds of talks on a dizzying array of subjects. The hot topics this year were definitely cloud computing and languages, established and emerging.
Most interesting to me though was the emphasis on government, social issues, and information freedom. Tim O'Reilly's opening keynote set the tone for many of the later talks by calling for the open source community to use its expertise in cooperative problem solving to address pressing issues in government and society. There were also keynotes from Portland's Mayor Sam Adams and DC's Chief Technology Officer Bryan Sivak on the importance of open source in local government.
Continue readingSearching Earmarks Isn’t Actually That Hard
Yesterday morning I watched the first markup session of the Earmark Transparency Act. The bill aims to create a comprehensive database of all earmark requests, not just approved earmarks. In its current version, there are over twenty required data elements, including free text descriptions and justifications of the earmark request, as well as related documents. The bill also calls for huge flexibility in the search interface and the API. Overall, it's a win for transparency and a big technical leap forward in terms of how the government thinks about releasing its data. It's biggest opponent in committee was Senator Carl Levin.
Continue readingMeet the New Federal Register
If you haven't already, be sure to check out the new federalregister.gov, which launched last night. For some of you, the site might bring to mind govpulse, one of the winners of our second Apps for America contest. That's no coincidence: GPO and NARA, the agencies responsible for maintaining the FR, sought out Andrew, Dave and Bob -- the folks behind govpulse -- and asked them to help build the new site.
As you can imagine, those of us at Sunlight are pretty excited about this. It's a great validation of the work of the Labs community, and a wonderful example of what's possible when government stays open to the transformative possibilities offered by technology.
Continue readingGovernment Data and the Case for Not Running Me Over
Over the weekend I was clearing out my RSS, and was pleasantly surprised to find Sunlight's work in an unexpected place. TheWashCycle is my favorite DC bike blog, and its author has started a series of posts designed to address arguments that are commonly faced by cycling advocates. One of those is that cyclists don't pay for roads — that the gas tax pays for them — and consequently folks on bikes aren't entitled to the use of roads, or are less entitled to space on the road than motorists, or shouldn't have a say in how roads are built.
As it turns out, the assumption that cyclists don't pay for roads is wrong. The WashCycle post linked to some work that we did for Pew's Subsidyscope project, which shows that gas taxes are paying for a decreasing share of our roads. In 2007 taxes and fees related to auto use covered only half the bill. The shortfall is made up by general revenues and debt — and though the specifics of the story play out differently from state to state it's safe to say that cyclists pay taxes that help build roads.
I mention all this not simply to highlight some pro-cyclist propaganda — though of course, as a daily bike commuter, I'm glad to do that, too — but rather to point this out as an example of what open government data can accomplish.
Continue readingA Few Git Tips
This weekend I had the opportunity to attend Scott Chacon's Advanced Git class at Jumpstart Lab. Scott works for Github and maintains the Git project's website. He's also written a book, ProGit, and the handy reference site Git Reference.
Scott spent a good bit of time going over the fundamentals of Git--the different types of objects stored in its database and how they point to one another. I had seen all this before when I first started using Git, but I wasn't ready to really understand it then. If you've ever felt that Git was a bit mysterious or scary I'd highly recommend going over the basics again. Try this article and these two sections of Scott's book.
Here are some other useful tips I picked up:
Continue readingThe Health 2.0 Developer Challenge
The Health 2.0 Developer Challenge launched last week, and I've been embarrassingly remiss at mentioning it. Hopefully, many of you are already in the loop and excited about the project. Let me take a second and fill the rest of you in.
There are a lot of app contests and hackathons and dev challenges around these days. But I think this is one worth getting excited about, for three reasons.
Continue readingLabs Olympics: Automate your life with geocron
geocron, a Labs Olympics experiment, automates your life based on your location.
Continue readingGrading the new USA.gov
USA.gov, the site that conveys official information and services about the U.S. government, just launched the new design of their website. Since we took a stab at redesigning it ourselves back in January of '09, we thought we'd see if they took any of our advice.
Continue reading