Name standardization, on its surface, would appear to be a primarily aesthetic problem (no pun intended). People's names can be listed "last, first" or "first last". Simple, right? Not exactly. When you're naming different things— people vs. organizations, for instance— and dealing with different ordering, capitalization styles, honorifics, suffixes, metadata or other additional info embedded in names (e.g. politicial party signifiers, company departments or locations), or just general cruft and typos, name standardization is a thorny problem. Add to that the fact that there are no universal identifiers for people or companies in many datasets, names rarely (if ever) come split into their constituent parts, and we are often expected to link data via little more than a name string, and you can see how relevant the issue is to the world of open government data.
Continue readingElena’s Inbox: How Not to Release Data
On Friday @BobBrigham tweeted a suggestion: put the just-released Elena Kagan email dump into a GMail-style interface. I thought this was a pretty cool idea, so I started hacking away at it over the weekend. You can see the finished results at elenasinbox.com.
I'm really pleased that people have found the site useful and interesting, but the truth is that a lot of the emails in the system are garbage: they're badly-formatted, duplicative or missing information. For instance, one of the most-visited pages on the site is the thread with the subject "Two G-rated Jewish jokes" -- understandably, given that it's the most potentially-scandalous-sounding subject line on the first page of results. Unfortunately, if you click through you'll see that there's no content in the messages.
The site was admittedly a bit rushed, but in this case it isn't the code that's to blame. If you go through the source PDF, you'll see that the content is missing there, too. It looks like it might have been redacted, but the format of the document is confusing enough that it's difficult to be sure.
But the source documents' problems go beyond ambiguous formatting. A lot of the junky content on the site comes from the junk it was built from -- there's not much we can do about it. To give you some idea of the problem, consider these strings:
Continue reading