Last week, David, Eric, and I attended RubyConf in New Orleans. The organizers of the conference were kind enough to offer us space for an open government hackathon that we held every day of the conference. During the day, as conference sessions were going on, quite a few folks trickled in and out during the "open hacking" hours. On the first two evenings, after the conference sessions ended, we held a series of talks of our own in the hackathon room. We hosted a little over twenty people each night for the talks.
Big thanks goes to Tropo for sponsoring food and drinks for the hackathon attendees. We were able to enjoy beignets, soft pretzels, and king cake, not to mention stay hydrated and caffeinated, thanks to them.
Here's a quick recap of what we worked on:
Continue readingBetter Tools Won’t Save Us
Sam Smith wrote a post reacting to what I had to say about the Geithner schedule. In it, he argues that pushing for data to be released in better formats may not be the best course of action: tools exist to sidestep the problem.
Sunlight, as an organisation which complains about this often enough, has much better tools at their disposal than complaining about it. As people using computers in 2010, we all have better tools to use on PDFs than we currently use. We often complain about how inaccessible PDFs are, without doing the basic, simple, automatable tasks which can make them readable.
Opening the PDF in acrobat, pressing the "Recognise text using OCR" [button] and then [you'll find that] it's searchable, and Sunlight could republish this for everyone to use (or put up a webservice which adds the OCR text in such a way that when you search, what you get highlighted is the relevant bits of the page where the OCRed text matches). That is possible now.
But, as a community, we prefer to stick to the notion that anything in PDF is utterly locked up in a way which no one can get at.
It's not (really).
It is far from ideal, it's a bugger to use, and it is not the best format for most things, but it's what we've got. And showing how valuable this data is will get us far further than complaining that we can't read a file that most people clearly can in the tools they use. It's the tools we choose to use that are letting us down. And, as a movement, open data has to get better at it, and then it'll be less of a problem for us, and we can spend more time doing what we claim to be wanting to do.
I appreciate the response, but I disagree. Nothing Sam says about what technology makes possible is wrong, per se. And better tools are of course useful and desirable. But the last thing I want is for government to begin thinking that OCR can make up for bad document workflows. It simply can't: even though it happens to work well on the Geithner schedule, OCR remains a fundamentally lossy technology.
A Quick Reminder
Are you following @sunlightlabs on Twitter? No? We feel strongly that that's a mistake. Similarly, you might want to take a second to join our mailing list if you haven't already. We do our best to use both of these outlets as ways of discussing interesting topics and highlighting the exciting work of others -- not (just) as a means of relentlessly hawking our own work (though of course we tend to find that work exciting, too).
Continue readingPreview: Real Time Congress API
My main project for the last month or so has been something we're calling the Real Time Congress API. It's not quite ready for production use, and the data in it is subject to change, but I wanted to give you all a preview of what's coming, and to ask for your help and ideas.
The goal of the Real Time Congress (RTC) API is to provide a current, RESTful API over all the artifacts of Congress, updated in as close to real time as possible. For the first version, we plan to include data about bills, votes, legislative and policy documents, committee schedules, updates from the House and Senate floor, and accompanying floor video.
Continue readingScraperWiki is Extremely Cool
ScraperWiki is a project that's been on my radar for a while. Last week Aine McGuire and Richard Pope, two of the people behind the project, happened to be in town, and were nice enough to drop by Sunlight's offices to talk about what they've been up to.
Let's start with the basics: remedial screen scraping 101. "Screen scraping" refers to any technique for getting data off the web and into a well-structured format. There's lots of information on web pages that isn't available as a non-HTML download. Making this information useful typically involves writing a script to process one or more HTML files, then spit out a database of some kind.
It's not particularly glamorous work. People who know how to make nice web pages typically know how to properly release their data. Those who don't tend to leave behind a mess of bad HTML. As a result, screen scrapers often contain less-than-lovely code. Pulling data often involves doing unsexy thing like treating presentation information as though it had semantic value, or hard-coding kludges ("# ignore the second span... just because"). Scraper code is often ugly by necessity, and almost always of deliberately limited use. It consequently doesn't get shared very often -- having the open-sourced code languish sadly in someone's Github account is normally the best you can hope for.
The ScraperWiki folks realized that the situation could be improved. A collaborative approach can help avoid repetition of work. And since scrapers often malfunction when changes are made to the web pages they examine, making a scraper editable by others might lead to scrapers that spend less time broken.
Continue readingWhat’s Going on in the Labs
It's that time again...
Continue readingRedaction and Technical Incompetence
Felix Salmon, finance blogger extraordinare, was inspired by some reporting by Bloomberg to have a look at Treasury's website. Apparently Tim Geithner visited Jon Stewart back in April, and Felix was understandably interested in seeing the evidence for himself. He went to the Treasury website, and then... well, things took a turn for the worse:
First, you go to the Treasury homepage. Then you ignore all of the links and navigation, and go straight down to the footer at the very bottom of the page, where there’s a link saying FOIA. Click on that, and then on the link saying Electronic Reading Room. Once you’re there, you want Other Records. Where, finally, you can see Secretary Geithner’s Calendar April – August 2010.
Be careful clicking on that last link, because it’s a 31.5 MB file, comprising Geithner’s scanned diary. Search for “Stewart” and you won’t find anything, because what we’re looking at is just a picture of his name as it’s printed out on a piece of paper.
In other words, these diaries, posted for transparency, are about as opaque as it can get. Finding the file is very hard, and then once you’ve found it, it’s even harder to, say, count up the number of phone calls between Geithner and Rahm Emanuel. You can’t just search for Rahm’s name; you have to go through each of the 52 pages yourself, counting every appearance manually.
Is this really how Obama’s web-savvy administration wants to behave? The Treasury website is still functionally identical to the dreadful one we had under Bush, and we’ve passed the midterm elections already. I realize that Treasury’s had a lot on its plate these past two years, but much more transparent and usable website is long overdue.
This all sounds sadly familiar to me. I still remember when Treasury started posting TARP disbursement reports as CSVs instead of PDFs. I was working on Subsidyscope at the time, and had to load those reports on a weekly basis. It's more than a little sad how much better my life got when they made that change.
But I think it's important to note that Felix's frustration isn't just the product of bad technology.
Continue readingNew Wireframes from the FCC
As some of you might recall, we took a stab at redesigning the FCC site a little over a year ago. Since then the FCC has been reconsidering their online presence. A few days ago they released some interesting wireframes of a reimagined FCC.gov site. Looking through those wireframes, it seems like quite a good attempt at organizing their content and really trying and make it more understandable to the general public.
There are a few small things here and there that I can nitpick. For example on the "Search Results" wireframe it would be nice to have a title at the top to say what the user had just searched for. I'm also a bit perplexed as to why on a search results page there would be a section for videos that breaks up the main results. If they want to have results by category they should group them as such and then have links to see full results in each category. Also, please FCC, we're begging you: make things like press releases available in formats other than word and pdf.
Continue readingInfluence Explorer Gets New Election Tools and More
I wanted to let everyone know about some great new features that have just gone live on Influence Explorer. If you haven't already checked it out, Influence Explorer is our one-stop source for a variety of influence-related data on politicians, political organizations, private companies and powerful individuals.
Continue readingGetting Serious About Bear Data
Our colleagues at Sunlight have just launched a major new ursine initiative. Naturally, we want to help, so David dug into the data he'd recently scraped from the Catalog of Government Publications to see if there were any contributions we could make from a labs perspective.
The Catalog of Government Publications is a powerful bear resource. It contains no less than 77 records about bears. The most important document in that list, of course, is the Bear Spray Safety Program. Nothing drives humanity more than survival. Therefore, you must familiarize yourself with the all-important Bear Force Continuum:
Continue reading