This may admittedly be of limited interest to those outside the DC area, but it's extremely interesting to me, so I'm afraid you'll just have to humor me for a paragraph or two. WMATA, our regional transit agency, has just launched a developer portal and API, and they've done a really nice job of it. People seem to love transit data -- after crime data it seems to be the municipal information people get most excited about (and I'd argue that it's much, much more useful than crime data) -- and I'm no exception. Playing with this stuff is a bit of a hobby of mine, and I've been following WMATA's gradual move toward openness for years. This is a big step forward for both the agency and its customers.
Bus data is still forthcoming, and I suspect that's where the real possibilities lie: the rail system is pretty easy to use; tech can pay bigger dividends when applied to the relative mysteries of the bus. Still, it's already clear that WMATA has made some smart decisions about implementation, defined reasonable terms of service, and generally seems to be moving in the right direction. When the API is considered alongside the already-released GTFS dataset, Metro's offerings match up fairly well (though not perfectly) with the ten open data principles that Sunlight has just published.
Now to see if I can't get a Graphserver instance running...
Continue readingWe Don’t Need a GitHub for Data
There was an interesting exchange this past weekend between Derek Willis of the New York Times and Sunlight's own Labs Director emeritus, Clay Johnson. Clay wrote a post arguing that we need a "GitHub for data":
It's too hard to put data on the web. It’s too hard to get data off the web. We need a GitHub for data.
With a good version control system like Git or Mercurial, I can track changes, I can do rollbacks, branch and merge and most importantly, collaborate. With a web counterpart like GitHub I can see who is branching my source, what’s been done to it, they can easily contribute back and people can create issues and a wiki about the source I’ve written. To publish source to the web, I need only configure my GitHub account, and in my editor I can add a file, commit the change, and publish it to the web in a couple quick keystrokes.
[...]
Getting and integrating data into a project needs to be as easy as integrating code into a project. If I want to interface with Google Analytics with ruby, I can type gem install vigetlabs-garb and I’ve got what I need to talk to the Google Analytics API. Why can I not type into a console gitdata install census-2010 or gitdata install census-2010 —format=mongodb and have everything I need to interface with the coming census data?
On his own blog, Derek pushed back a bit:
[...] The biggest issue, for data-driven apps contests and pretty much any other use of government data, is not that data isn’t easy to store on the Web. It’s that data is hard to understand, no matter where you get it.
[...]
What I’m saying is that the very act of what Clay describes as a hassle:
A developer has to download some strange dataset off of a website like data.gov or the National Data Catalog, prune it, massage it, usually fix it, and then convert it to their database system of choice, and then they can start building their app.
Is in fact what helps a user learn more about the dataset he or she is using. Even a well-documented dataset can have its quirks that show up only in the data itself, and the act of importing often reveals more about the data than the documentation does. We need to import, prune, massage, convert. It’s how we learn.
I think there's a lot to what Derek is saying. Understanding what an MSA is, or how to match Census data up against information that's been geocoded by zip code -- these are bigger challenges than figuring out how to get the Census data itself. The documentation for this stuff is difficult to find and even harder to understand. Most users are driven toward the American Factfinder tool, but if that's not up to telling you what you want, you're going to have to spend some time hunting down the appropriate FTP site and an explanation of its organization -- Clay's right that this is a pain. But it's nothing compared to the challenge of figuring out how to use the data properly. It can be daunting.
But I think there are problems with the "GitHub for data" framing that go beyond the simple fact that the problems GitHub solves aren't the biggest problems facing analysts.
Continue readingMeet the New Federal Register
If you haven't already, be sure to check out the new federalregister.gov, which launched last night. For some of you, the site might bring to mind govpulse, one of the winners of our second Apps for America contest. That's no coincidence: GPO and NARA, the agencies responsible for maintaining the FR, sought out Andrew, Dave and Bob -- the folks behind govpulse -- and asked them to help build the new site.
As you can imagine, those of us at Sunlight are pretty excited about this. It's a great validation of the work of the Labs community, and a wonderful example of what's possible when government stays open to the transformative possibilities offered by technology.
Continue readingGovernment Data and the Case for Not Running Me Over
Over the weekend I was clearing out my RSS, and was pleasantly surprised to find Sunlight's work in an unexpected place. TheWashCycle is my favorite DC bike blog, and its author has started a series of posts designed to address arguments that are commonly faced by cycling advocates. One of those is that cyclists don't pay for roads — that the gas tax pays for them — and consequently folks on bikes aren't entitled to the use of roads, or are less entitled to space on the road than motorists, or shouldn't have a say in how roads are built.
As it turns out, the assumption that cyclists don't pay for roads is wrong. The WashCycle post linked to some work that we did for Pew's Subsidyscope project, which shows that gas taxes are paying for a decreasing share of our roads. In 2007 taxes and fees related to auto use covered only half the bill. The shortfall is made up by general revenues and debt — and though the specifics of the story play out differently from state to state it's safe to say that cyclists pay taxes that help build roads.
I mention all this not simply to highlight some pro-cyclist propaganda — though of course, as a daily bike commuter, I'm glad to do that, too — but rather to point this out as an example of what open government data can accomplish.
Continue readingThe Health 2.0 Developer Challenge
The Health 2.0 Developer Challenge launched last week, and I've been embarrassingly remiss at mentioning it. Hopefully, many of you are already in the loop and excited about the project. Let me take a second and fill the rest of you in.
There are a lot of app contests and hackathons and dev challenges around these days. But I think this is one worth getting excited about, for three reasons.
Continue readingGuest Post: Calling All Phoenix Area Civic Hackers
Marc Chung is one of the organizers who helped make the Great American Hackathon a success, and is a friend of Sunlight. He's asked for a little space on the Labs blog to announce his new Phoenix-area open data group, and we're only too happy to oblige. Read on for the details.
I'm Marc Chung, a computer scientist who is passionate about bringing technologists together to improve our world.
Last year, I organized the Phoenix edition of the Great American Hackathon. That weekend a local gathering of developers decided to contribute time towards building a (parser)[http://sunlightlabs.com/blog/2009/hotness-arizona/] for the Arizona State Legislature. The work was done as part of the Fifty States project which supports organizations like MapLight and OpenCongress.
After the hackathon, I was contacted by several journalists and developers who were very excited by the work we did and just as eager to offer their assistance on future civic hacking initiatives. In the short time since GAH '09, we've been working with to extract useful information from public data in an effort to shed more light on how state governments work.
Combining the interests of these two groups was inevitable and so today, along with Mark Ng and Brian Shaler, I'd like to announce PhxData, a group to unite technologists in the Phoenix area who are engaged in data mining, parsing, visualization, etc. It also serves as a platform for journalists and government officials to connect with civic hackers who want to take public data and make it useful.
Check out our website: http://phxdata.org
If you're a data scientist, journalist, government official, statistician, developer or designer who would like to work on exploring data in the interest of pursuing greater government transparency for the state of Arizona, you should join this group.
Continue readingElena’s Inbox: How Not to Release Data
On Friday @BobBrigham tweeted a suggestion: put the just-released Elena Kagan email dump into a GMail-style interface. I thought this was a pretty cool idea, so I started hacking away at it over the weekend. You can see the finished results at elenasinbox.com.
I'm really pleased that people have found the site useful and interesting, but the truth is that a lot of the emails in the system are garbage: they're badly-formatted, duplicative or missing information. For instance, one of the most-visited pages on the site is the thread with the subject "Two G-rated Jewish jokes" -- understandably, given that it's the most potentially-scandalous-sounding subject line on the first page of results. Unfortunately, if you click through you'll see that there's no content in the messages.
The site was admittedly a bit rushed, but in this case it isn't the code that's to blame. If you go through the source PDF, you'll see that the content is missing there, too. It looks like it might have been redacted, but the format of the document is confusing enough that it's difficult to be sure.
But the source documents' problems go beyond ambiguous formatting. A lot of the junky content on the site comes from the junk it was built from -- there's not much we can do about it. To give you some idea of the problem, consider these strings:
Continue reading“How Our Laws Are Made”: Now in Poster Form
Just a quick note: we've been getting a few requests from folks saying they'd like to buy a printed copy of How Our Laws Are Made, one of the winning entries from Design for America. Well, good news: the folks responsible for this fantastic infographic have made it available from an on-demand print service, letting you get a physical copy in whatever format you think would best suit your classroom, office or other source of blank wall space. Even as I type this a print should be winging its way to Sunlight's offices -- if you'd like one, too, you know where to click.
Continue readingHello, Labs
Like Clay said, I'm the new guy. Well, not entirely new -- I've been at Sunlight since late 2008. But I'm the one who's going to be trying to fill the enormous gap he's leaving. I thought I'd start to explain how I want to do that by talking about how I arrived at Sunlight.
I first became aware of the Sunlight Foundation while working as a programmer at a consultancy here in DC, building sites for large nonprofits and dabbling with using and writing about various technologies on the side. When I heard about Sunlight Labs, I thought it was pretty much the coolest thing in the world. Technologists using their skills to directly improve society. For people like me (and probably you) -- people who have acquired a technical skillset that's powerful, in a sense, but not always obviously useful -- it's an incredibly compelling prospect.
Continue readingChecking Out the New USASpending.gov
USASpending.gov got a face-lift on Wednesday evening, and it brought with it a raft of new features. Some of these are great; others are either not very useful, or an actual step backward. Let's run through them -- not only to highlight the features and shortcomings, but to examine what they can tell us about how government should be opening its data.
Continue reading