It's back to work after a 4th of July filled with hamburgers, hot dogs, and other non-meat options. Here's what the Labs was up to over the past month...
Continue readingLive from OKCon
I suspect/hope that most of this blog's readership is still asleep right now, but for those who rightly begin their day with a review of Sunlight blogs over their morning coffee, let me encourage you to tune in to the proceedings here at OKCon. So far we've already heard great talks from Rufus Pollock and Glyn Moody, and Richard Stallman is beginning a talk as I post this. I'll be speaking around 8:30am EDT, and plan to say a bit about the e-Gov cuts, #savethedata and the lessons that other open data organizations can take from the episode.
If that's too early for you, I suspect that the video will be archived. And while you're at it, have a look at the OKCon schedule -- there's lots of good stuff coming up!
Continue readingHack for Change Was Great
A quick note of thanks and congratulations to all those who participated in last weekend's Hack for Change event. Timball and I were in SF for the hackathon, and I know I speak for us both when I say it was thrilling to see so many smart people working on interesting and important problems.
The folks over at Tokbox have a great roundup of links to the various entries -- and I'd be remiss if I didn't mention the Ruby gem that Code For America's Erik Michaels-Ober cooked up for Sunlight's Real-Time Congress API (thanks, Erik!).
All in all, a great event -- many thanks to the good folks at change.org for making it happen.
Continue readingSarah’s Inbox: The Agony and the .tgz
Many of you have probably already seen that earlier today we stood up a copy of the Elena's Inbox code for the Sarah Palin email collection. You can find the site here. I think that by most reasonable standards, Sarah Palin is currently a less newsworthy figure than Justice Kagan was at the time of her confirmation. But there's no question that many people find her fascinating, and folks seem to really enjoy having this sort of interface available -- the response has been overwhelmingly positive, even in spite of its horrifying Gmail 1.0 look (for what it's worth, Sunlight's design team deserves absolutely none of the blame for this one!).
It's worth taking a moment to reflect on what it took to get this site online. The state of Alaska released Governor Palin's email records on paper. News organizations had to have people on the ground to collect, scan and OCR these documents. Our thanks goes out to Crivella West, msnbc.com, Mother Jones and Pro Publica, whose incredibly quick and high-quality work provided us with the baseline data that powers the site.
But it wasn't yet structured data. It was easy enough to convert the PDFs into text, though this introduced some errors -- dates from the year "20Q7", for instance. Then we had to parse the text into documents, each with recipients, a subject line, and a sender. This is trickier than it might seem. Consider the following recipient list:
To: Smith, John; Jane Doe; Anderson; Andy (GOV); Paul Paulson
It's parseable... sort of. It turns out that, in this case, "Andy Anderson" should be treated as an entity. In this dataset, portions of names are delimited by semicolons, but so are names. It's a bit of a mess. Sunlight staff spent the better part of Monday performing a manual merge of the detected entities, collapsing over 6,000 automatically-captured people to less than half that number. I won't pretend that the dataset is now spotless, but it's considerably more structured than it used to be.
And that structure makes possible not only novel interfaces like Sarah's Inbox, but also novel analyses. Consider this graph of how often the word "McCain" appears in the emails:
Interesting, right? More substantively, consider the efforts of Andree McCloud, who's raising questions about an apparent gap in the Palin emails near the beginning of the governor's term. With the data captured, it's easy to visualize this -- here's a graph of the total email volume in the system by week, beginning with the first week of December 2006, when Palin took office:
(To be clear, I don't think you can necessarily conclude from this graph that there's anything nefarious about that period's low email volume -- there are plenty of potential explanations. Still, it's useful to be able to be able to understand the outlier period in the larger context of the document corpus.)
Of course, these analyses and interfaces could be even better if Alaska had just released the files digitally. In fact, if they had, we might be able to draw some more solid conclusions: as our sysadmin Tim pointed out, message headers' often-sequential IDs could conceivably show whether there actually are missing emails from those first few weeks.
It's a shame that that didn't happen -- and not just because it meant my weekend was spent parsing PDFs. Releasing properly structured data ultimately allows everyone to do better work in less time. It's unfortunate that the authorities in Alaska introduced such a substantial and unnecessary roadblock.
But we at Sunlight can at least share what we've done to improve the situation. If you're interested in running your own analysis, you can find our code here, and the data to power it here (12M). At the moment it's in the form of a Django project -- if you need it in a different format, don't hesitate to ask on our mailing list. If you do something neat with it, please tell us!
Continue readingLabs Update
I admit it: we missed the May Labs Update entirely. I'm sorry! It's been as busy as always around here, with a number of really neat -- but also really involved -- projects beginning to see some light at the end of their respective tunnels. Amidst that effort, we just plain forgot to get out a timely update last month. We promise we won't keep you in the dark like this again.
The most exciting news is that we have a bunch of new faces in the labs offices:
-
Drew Vogel has joined the Subsidyscope team, and after a few weeks of work by remote, he's now in the office, in person, and doing great things.
-
Ryan Sibley recently moved over to Labs from the Reporting Group. Ryan's been with Sunlight for a while, but the idea of having a journalist embedded with our developers is a new one, and something that I'm pretty excited about.
-
Casey Kimmey has joined the Open States team for the summer, where she'll be doing some invaluable data quality validation.
-
And finally, Montserrat Lobos is joining us for three weeks from our friends at Ciudadano Inteligente in Chile. She's going to be working with the design team; we're excited to have her.
Here's what the team has been working on:
Tom has been finally -- finally! -- finishing the process of getting the team back to a full headcount. He's also been doing the usual mixture of proposal writing, project oversight, and general triage. Also: some soldering and messing around with Titanium Mobile, the results of which will hopefully be published in a few months.
Alison added a new, corporate accountability dataset to TransparencyData and Influence Explorer. She has also been working on several name matching tasks and making some major additions, including parsing for organization names, to our Name Cleaver name parsing library. In addition, she has been experimenting with some visualizations based upon campaign finance data, which you may be able to look forward to cropping up on our site or in a blog post in the future.
Luigi is busy with the next-generation of software that will power our Sunlight Live events. He wrote an article for HTML5 Rocks and took trips to Portland and Baltimore.
Drew added a few usability enhancements to the Subsidyscope search tool. Now he is working on changes to the data importer that will allow us to provide direct expenditure figures based on more current USASpending data.
Jeremy has been hard at work destroying Sunlight web sites, but in a good way. We have decided to retire Public Equals Online and integrate the features into the main Sunlight Foundation site and organizing page. Jeremy also rebuilt TransparencyCamp.org to add a brand new mobile app and HTML based informational screens to display session information on monitors at the conference.
Eric has been integrating full-text searching with ElasticSearch into our Real Time Congress API for bills. There'll be new endpoints and features announced soon. He's also been wrapping up work on our soon-to-be-released iOS/Android mobile app to help people make better local health care decisions. Finally, he's been working on getting the first round of House expenditure data from the 112th Congress up into our expenditure database and House staff directory.
Aaron updated Follow the Unlimited Money for the new election cycle -- just in time for the special election in New York's 26th District -- and made some improvements to the Reporting Group's Lobbyist Tracker. He's also been working on Capitol Words (preview: the top words in Congress so far this year are "job", "cut", "create" and "repeal").
Kaitlin has been working on a third Roku app as well as building some backend tools for Subsidyscope. She's also been busy writing bombastic blog posts and checking up on her FOIA for contracting data quality reports. Also, she's been working with the other Caitlin to update the design and functionality of the Subsidyscope site.
Chris has been working on a variety of projects including: coding the new design for the House Staff Directory, creating graphics/signs/and other deliverables for Transparency Camp, new background theme for Sunlight's YouTube channel, and misc graphics for other Sunlight projects. She is currently working on a new theme for Sunlight's Data Viz Tumblr blog and continuing to work on the House Staff Directory site.
Michael has been working on expanding the coverage of the Open State Project to new states, adding XML support to the API, and exploring visualizations of the data that's being collected.
Ethan has been coordinating a number of new products and features in Data Commons. This week we released Inbox Influence and added POGO's contractor misconduct database to Transparency Data and Influence Explorer. We're hard at work on several new data sets to be released in July.
With Sunlight Health teetering on the edge of completion, Caitlin has turned her focus to building out the Subsidyscope redesign, interrupted briefly by a jaunt through the South and once again to help with updates to the Roku app that Kaitlin has been working on.
James has been adding support for Maine, New Hampshire, and Oregon to the Open State Project. He's also been working with new Open States intern Casey who has been doing data checking and cleaning to help promote more states from experimental to ready status.
Andrew prepped the Inbox Influence project for its launch, which was announced at the PdF conference in New York. He has also continued to work on extracting data from Regulations.gov.
Ali has been working on numerous small tasks to support the foundation in their design needs. She's been doing a little bit of everything from branding to visualizing data to giving cfbp advice on their new mortgage forms. Currently the big project on her desk is the new Capitol Words site.
Continue readingThe Palin Emails and Redaction Technology
Today's release of the Palin emails is prompting frustration among reporters, environmentalists and people who know how to use computers over the fact that the documents are being delivered in the form of a huge, $700+ stack of paper. As Luigi pointed out on Twitter, this decision is being attributed to the difficulty of performing redaction properly within an all-digital system.
This is something I've written about here before. Redaction mistakes do happen -- the brilliant Tim Lee recently released some interesting work showing how to quantify just how often -- but doing it properly isn't rocket science. Digital workflows save time, money and material resources; and in cases like this one, they make it easier for the press to do its job. In other cases, like the one facing the PIDB, there's simply no choice: they'll never overcome the backlog they face without the help of information technology. It's long past time for government to get over its skittishness about digital redaction.
UPDATE: Be sure to check out the comments to this post. Jeremy Ashkenas -- who has personally had to haul the Palin emails, in paper form, across Juneau -- points out that the redaction workflow in this case does appear to have been digital... up to a point. The output, though, was thoroughly analog. If it's not one thing it's another...
Continue readingInbox Influence: Political Influence Data in Gmail
Today we're officially launching Inbox Influence, the latest addition to our suite of political influence tools. Inbox Influence is a browser extension that adds political influence data to your Gmail messages. With Inbox Influence installed, you'll see information on the sender of each email, the company from which it's sent, and any politician, company, union or political action committee mentioned in the body of the email. The information is added unobtrusively and nearly instantaneously, and includes campaign contributions, fundraisers and lobbying activity. You can use it to add context to news alerts, political mailers and corporate emails, or just to see who your friends donated to in the last election. We hope that the tool will be of interest to journalists, activists and anyone interested in seeing the political activity of the people and organizations they communicate with.
Continue readingDevHouse DC Was Awesome
As was strongly implied would be the case, DevHouse DC #5 was awesome. We had dozens of folk over for coding, talking, soldering, eating, and presenting, variously interspersed throughout the office.
Hit the jump for some sweet photos, including our Labs Director Tom Lee, who did the bulk of the day's soldering, in the middle of a lightning talk about the LED numbers he's working on lighting up for an upcoming Sunlight project.
Continue readingWest Coast Coders: Come Hack for Change!
Change.org is hosting a hackathon at their San Francisco headquarters in a couple of weeks. If you're in the Bay Area, handy with a text editor, and feel like doing some interesting work, I hope you'll consider joining the party (did I mention they'll be giving away $10,000 in prizes?). I'm pleased to say that Sunlight is a partner for the event, and indeed, I'll be on hand to talk to folks about our APIs, judge entries and, if at all possible, eat some Mission burritos.
You can find all of the details about Hack for Change here. I hope to see you in San Francisco on the 18th!
Continue readingDevHouse DC #5 at Sunlight
Come by Sunlight's office in Dupont Circle this Saturday, June 4 for DevHouse DC's 5th gathering. We'll be here from noon to midnight, hacking, soldering, and roboticizing the day away.
If you have no idea what DevHouse DC is, then you have no choice but to come and find out the fun way. Or I guess you can look at their website, whatever. Hopefully using the words "typing", "talking", and "treasure" in a sentence together like this is enough to get it across. Register for free on EventBrite and we'll see you there!
Continue reading