Exploring strategies for cleaning messy data

by
A lighthouse watches over dark water at night.
Data, data, everywhere, and all the nerds did think; data, data, everywhere, yet nothing with which to link. (Photo via John Crowe/Flickr)

Thanks to the efforts over the past few decades of the open government community, a [large](http://data.influenceexplorer.com/#) and [hard-won](http://sunlightfoundation.com/blog/2015/02/09/a-big-win-for-open-government-sunlight-gets-us-to-release-indexes-of-federal-data/) group of [government datasets](http://openstates.org/downloads/) has been collected and [made publicly available](https://github.com/unitedstates). It’s inspiring to look up from the day-to-day grind of opening up government data to see how much progress has already been made. Now, though, we must bear the burden of our collective success, and recognize that we’ve created an unruly menagerie of data sources with many related, but unrelatable, datasets.

At Sunlight, this means that [we’re consolidating many of the related but separate projects that have sprung up over the years](http://sunlightfoundation.com/blog/2015/05/04/were-making-it-easier-to-access-and-use-sunlights-data/). We’re applying all that we’ve learned from [the dozens of projects we’ve done](http://sunlightfoundation.com/tools/) to provide a unified experience. The public should not have to search a dozen different databases in order to find what the information they seek. Just as no man is an island, information cannot have meaning outside the context of its collection and environment. We aim to provide fast, easy and meaningful context to government affairs.

Over the past year, we’ve been working on taming these messy data by testing and validating [new ways of moving](http://kafka.apache.org/) and [representing data](http://opencivicdata.org/). As we’ve been figuring out how to effectively consolidate our data, we find ourselves facing the same problem time and time again. It’s a basic issue that runs deep, seemingly without any easy fix: The datasets we collect don’t have reliable identifiers associated with each person or organization mentioned in the data. There is nothing equivalent to a social security number that allows data collectors to reference the same entity across datasets (or even consistently within the same dataset).

### Bootstrapping authority

We must act as curators, creating reliable identifiers ourselves, making decisions about which identifiers each piece should get and managing those identifiers in the face of changes to the content and format of incoming data. We’re forced to move beyond finding, liberating and publishing data. We must use all the data we have to provide context for every piece of data we have. There is no authority on the data as whole, so we’re forced to rely on ourselves and start up the process from scratch.

Thankfully, we are not the only ones who’ve had problems such as these. As long as there have been databases, there have been database integrity problems. As we started Googling around, we ran across field after field, specialization after specialization, tool after tool that seek to redress every variation of the above problem we could imagine. [Entity resolution](http://www.amazon.com/Entity-Resolution-Information-Quality-Talburt/dp/0123819725), [record linkage](http://en.wikipedia.org/wiki/Record_linkage), [householding](http://analytics.ncsu.edu/sesug/1999/085.pdf) and many other academic fields were all created to address this issue. Background checks, counterterrorism efforts and fraud analysis all depend on these techniques to find the important data hiding in the mountains of messy data. The U.S. Census Bureau has been using advanced statistical techniques for decades to make sense of the data it collects. In short, as we researched these issues, we found ourselves in interesting, varied and, frankly, unexpected company.

### What’s next

Although it will still be several months before we can point to projects where we use these techniques, we’ve ran across enough interesting ideas, projects and efforts that we feel compelled to share some of the things we’ve found. From talking with others in the open government community, we know that others have felt our pain and are looking for their own solutions. Our solution surely won’t be the same as everyone else’s, but each of solutions will likely all share some common traits.

Over the summer, we’ll be blogging about research, companies and problems we’ve come across in our work in entity resolution that we’ve found especially interesting. The issues are necessarily technical, but we aim to keep the explanations from being overly technical. We aim to build a lighthouse of ideas for others trapped in the confusing fog of messy data. No one should have to navigate the stormy seas of government data alone — and we hope that these posts will help you find your way to wherever you are headed.