Elena’s Inbox: How Not to Release Data
On Friday @BobBrigham tweeted a suggestion: put the just-released Elena Kagan email dump into a GMail-style interface. I thought this was a pretty cool idea, so I started hacking away at it over the weekend. You can see the finished results at elenasinbox.com.
I’m really pleased that people have found the site useful and interesting, but the truth is that a lot of the emails in the system are garbage: they’re badly-formatted, duplicative or missing information. For instance, one of the most-visited pages on the site is the thread with the subject “Two G-rated Jewish jokes” — understandably, given that it’s the most potentially-scandalous-sounding subject line on the first page of results. Unfortunately, if you click through you’ll see that there’s no content in the messages.
The site was admittedly a bit rushed, but in this case it isn’t the code that’s to blame. If you go through the source PDF, you’ll see that the content is missing there, too. It looks like it might have been redacted, but the format of the document is confusing enough that it’s difficult to be sure.
But the source documents’ problems go beyond ambiguous formatting. A lot of the junky content on the site comes from the junk it was built from — there’s not much we can do about it. To give you some idea of the problem, consider these strings:
- http://172.28.127.30:8082/ARMS/servletigetEmaiIArchive?URL]ATH=/n1cp -2/Anns405/who/Who_19981…
- http://I 72.28.127.30:8082/ ARMS/servletigetEmaiIArchive?URL]ATH=/ncp-llArms405/who/…411012009
- hltp:lI172.28.l27 .30:8082/ARMS/servlet/getEmaiIArchive?URL]ATH=/nlcp-2/Arms405/who/Who _19981 …
URLs like this are present at the bottom of nearly every page in the document dump. It’s obvious that they’re being generated by some part of the record management system (in this case that system is named “ARMS”). In other words, robots wrote this stuff. So why is it so inconsistent? Why are capital “I”s substituted for numeric “1”s? And why are there so many similar errors in the documents — like “1998” showing up as “199S”? Why are there so many hex dumps of attachments that can’t be decoded?
I have a theory. I can’t prove it, but here’s what I think happened:
- White House officials send emails — digital records that get fed to ARMS.
- It comes time to release records. Someone makes ARMS print out its lovely digital files on paper, a profoundly analog format.
- Those printouts are shipped or faxed to the Clinton Library.
- Clinton Library staff scan the printouts and run them through optical character recognition (OCR) turning them back into digital documents — but ones with substantial problems, since OCR is a far-from-flawless process.
- The once-again-digital documents are then released as PDFs (just to add insult to injury).
When you’ve been staring at regular expressions for hours, it’s easy to start thinking that all of this must be a deliberate attempt to obfuscate the data — to hide it from the public. But that’s probably wrong. Let’s remember Hanlon’s razor: “Never attribute to malice that which can be adequately explained by stupidity.” And while I’m sure many of the people who worked on these documents are perfectly smart, the overall system is a profoundly stupid way to release data.
To be honest, I’m not sure what to say next. It’s not like there’s some secret open data invocation I can utter, after which the scales will fall from the relevant employees’ eyes. In fact, I’m sure the people who built these systems know better; the people using them probably know better, too. I guess all I can say is: get with it, guys. Do a better job. Because this kind of nonsense is flat-out ridiculous — and the longer these habits persist, the more outrageous they become.