Some thoughts on the strategy of retiring projects and how we look back at our work on the new tools page.
Continue readingSarah’s Inbox: The Agony and the .tgz
Many of you have probably already seen that earlier today we stood up a copy of the Elena's Inbox code for the Sarah Palin email collection. You can find the site here. I think that by most reasonable standards, Sarah Palin is currently a less newsworthy figure than Justice Kagan was at the time of her confirmation. But there's no question that many people find her fascinating, and folks seem to really enjoy having this sort of interface available -- the response has been overwhelmingly positive, even in spite of its horrifying Gmail 1.0 look (for what it's worth, Sunlight's design team deserves absolutely none of the blame for this one!).
It's worth taking a moment to reflect on what it took to get this site online. The state of Alaska released Governor Palin's email records on paper. News organizations had to have people on the ground to collect, scan and OCR these documents. Our thanks goes out to Crivella West, msnbc.com, Mother Jones and Pro Publica, whose incredibly quick and high-quality work provided us with the baseline data that powers the site.
But it wasn't yet structured data. It was easy enough to convert the PDFs into text, though this introduced some errors -- dates from the year "20Q7", for instance. Then we had to parse the text into documents, each with recipients, a subject line, and a sender. This is trickier than it might seem. Consider the following recipient list:
To: Smith, John; Jane Doe; Anderson; Andy (GOV); Paul Paulson
It's parseable... sort of. It turns out that, in this case, "Andy Anderson" should be treated as an entity. In this dataset, portions of names are delimited by semicolons, but so are names. It's a bit of a mess. Sunlight staff spent the better part of Monday performing a manual merge of the detected entities, collapsing over 6,000 automatically-captured people to less than half that number. I won't pretend that the dataset is now spotless, but it's considerably more structured than it used to be.
And that structure makes possible not only novel interfaces like Sarah's Inbox, but also novel analyses. Consider this graph of how often the word "McCain" appears in the emails:
Interesting, right? More substantively, consider the efforts of Andree McCloud, who's raising questions about an apparent gap in the Palin emails near the beginning of the governor's term. With the data captured, it's easy to visualize this -- here's a graph of the total email volume in the system by week, beginning with the first week of December 2006, when Palin took office:
(To be clear, I don't think you can necessarily conclude from this graph that there's anything nefarious about that period's low email volume -- there are plenty of potential explanations. Still, it's useful to be able to be able to understand the outlier period in the larger context of the document corpus.)
Of course, these analyses and interfaces could be even better if Alaska had just released the files digitally. In fact, if they had, we might be able to draw some more solid conclusions: as our sysadmin Tim pointed out, message headers' often-sequential IDs could conceivably show whether there actually are missing emails from those first few weeks.
It's a shame that that didn't happen -- and not just because it meant my weekend was spent parsing PDFs. Releasing properly structured data ultimately allows everyone to do better work in less time. It's unfortunate that the authorities in Alaska introduced such a substantial and unnecessary roadblock.
But we at Sunlight can at least share what we've done to improve the situation. If you're interested in running your own analysis, you can find our code here, and the data to power it here (12M). At the moment it's in the form of a Django project -- if you need it in a different format, don't hesitate to ask on our mailing list. If you do something neat with it, please tell us!
Continue readingAnnouncing Sarah’s Inbox
Today the Sunlight Foundation is proud to unveil Sarah's Inbox, our attempt to make Sarah Palin's recently released email records easier to use with a searchable function and an interface similar to Gmail.
Continue readingThe Palin Emails and Redaction Technology
Today's release of the Palin emails is prompting frustration among reporters, environmentalists and people who know how to use computers over the fact that the documents are being delivered in the form of a huge, $700+ stack of paper. As Luigi pointed out on Twitter, this decision is being attributed to the difficulty of performing redaction properly within an all-digital system.
This is something I've written about here before. Redaction mistakes do happen -- the brilliant Tim Lee recently released some interesting work showing how to quantify just how often -- but doing it properly isn't rocket science. Digital workflows save time, money and material resources; and in cases like this one, they make it easier for the press to do its job. In other cases, like the one facing the PIDB, there's simply no choice: they'll never overcome the backlog they face without the help of information technology. It's long past time for government to get over its skittishness about digital redaction.
UPDATE: Be sure to check out the comments to this post. Jeremy Ashkenas -- who has personally had to haul the Palin emails, in paper form, across Juneau -- points out that the redaction workflow in this case does appear to have been digital... up to a point. The output, though, was thoroughly analog. If it's not one thing it's another...
Continue readingTop 25 Viewed Pages in Elena’s Inbox
The most interesting pages to Elena's Inbox visitors (judged by most viewed and tweeted) are quite telling. Given the diversity of incoming links and the numbers of views over such a short period, these numbers can provide some nice insight into what the public is curious about with Obama's latest Supreme Court nominee.
Continue readingKagan’s constitutional thoughts on abortion
Even though Elena Kagan, President Obama's Supreme Court nominee, was just learning to e-mail, she had no trouble expressing rather sophisticated reasoning and legal thought on the constitutional aspects of partial-birth abortion.
In an e-mail exchange with another adviser in the Clinton Administration, Kagan writes strategically about what should be included in a bill that would allow partial-birth abortion:
"It seems to me that the way to go is to recommend that the President reject the bill because (1) it does not include an exception for the health of the mother, and (2) (not stressed as much) because it ...
Kagan central to Clinton campaign finance reform efforts
Elena Kagan, President Barack Obama's nominee for the Supreme Court, was an active player in the Clinton Administration's efforts on campaign finance reform, a quick search of her emails--easily searchable and available here, thanks to Sunlight Labs--shows. (Click here to see a list of all emails that crossed her desk mentioning the term.)
Campaign finance reform was one of two ideas she gave to her boss, White House Counsel Abner Mikvah, as a topic that would keep her "amused," and make "good use" of her.
After she started work at the White House in 1995 she wrote in ...
Elena’s Inbox: How Not to Release Data
On Friday @BobBrigham tweeted a suggestion: put the just-released Elena Kagan email dump into a GMail-style interface. I thought this was a pretty cool idea, so I started hacking away at it over the weekend. You can see the finished results at elenasinbox.com.
I'm really pleased that people have found the site useful and interesting, but the truth is that a lot of the emails in the system are garbage: they're badly-formatted, duplicative or missing information. For instance, one of the most-visited pages on the site is the thread with the subject "Two G-rated Jewish jokes" -- understandably, given that it's the most potentially-scandalous-sounding subject line on the first page of results. Unfortunately, if you click through you'll see that there's no content in the messages.
The site was admittedly a bit rushed, but in this case it isn't the code that's to blame. If you go through the source PDF, you'll see that the content is missing there, too. It looks like it might have been redacted, but the format of the document is confusing enough that it's difficult to be sure.
But the source documents' problems go beyond ambiguous formatting. A lot of the junky content on the site comes from the junk it was built from -- there's not much we can do about it. To give you some idea of the problem, consider these strings:
Continue reading