Why is cleaning messy data so hard?

What’s the best way to clean up information when it comes in many shapes and sizes? We aim to find out. (Photo credit: Taichiro Ueki/Flickr)

After over a year of research and work, constraints have emerged that shape the systems we are building to clean our messy data. This post is about these constraints along with design strategies for satisfying them.

No matching oracle

We’ve yet to find a general method that lets us know with absolute certainty that two records represent the same entity. Even if we copped out and hired an army of experts to clean up all our data by hand, we would still have to deal with the occasional mistake. Humans provide the most reliable resolutions, but it’s silly to think someone would always make the correct choice. With this upper bound, any matching method we use is most likely going to make mistakes eventually. Hopefully, we will find these mistakes over time and need to fix them. So, any decision we make about the relationships between data needs to be correctable.

Moreover, we’ve found that, depending on the dataset, different methods have different efficacy. For example, the Senate has semireliable identifiers for lobbying firms but not lobbying clients. We’ve also explored deterministic parsing and found that it produces good results for some types of data. And, as always, we will have humans going in to clean up the messiest data and the mistakes that other methods make (though hopefully not too often). So, we would like to have flexibility in the kinds of ways we can match records together.

Datasets have different data formats

One of our first projects involves combining datasets detailing lobbying, campaign finance and legislative affairs. Each of the datasets involved are collected by different organizations with different purposes. Even beyond high-level file format differences like XML versus CSV versus HTML, there are differences in how the data is described. Fields and attributes like “Name,” “name (first),” “first_name” and “FirstName” all have similar but not quite the same meanings. These mismatches are a hard constraint that isn’t easy to work around. The industry has yet to find a solid way of mitigating the data format mismatch beyond engineering effort. There is some promising work going on about inferring schema mappings, but it remains unproven.

Different subsystems have different performance requirements

Our task breaks up into roughly four pieces: collection, resolution, analysis and publishing. Collection typically means web scraping of the data we need from various government websites. Resolution involves using rules, statistics and human efforts to match the all the collected data together. Analysis takes the results of resolution and creates easily understandable, high-level overviews of the data. Publishing means putting the results from the previous stages into a database for the purpose of being the backend for a website. We’ve found that the requirements for each of the stages all vary significantly, to the point that colocating and combining all the systems into one process or piece of software doesn’t work. We wish it weren’t so, but a single database can’t do everything we need for our goals. And yet, coordinating data between multiple databases and systems is a hard engineering problem.

Varying data quality levels

Refreshingly, there are some datasets that are well curated and managed. They have reliable identifiers and seldom contain mistakes. The challenge here is that most datasets are not as nice to work with as these. We’d like to ensure that the quality of the nice datasets does not get dragged down into the quagmire that is the average dataset we work with. As I mentioned before, we have a project involving legislative affairs data and campaign finance data: Congressional legislative data is fairly good — in particular the bioguide identifier for current and former lawmakers is well maintained and reliable — but campaign finance data is some of the messiest that any of us have ever seen. Our task is to combine these two datasets while preserving the quality of the good datasets. As we make resolution decisions, we’d like to ensure that the identifiers assigned to good data don’t change often. If we can do this, developers who only need to use the good data can provide reliable links from their websites to other websites without having to be downstream of the resolution changes. We hope we can create a barrier to restrict the flow of insanity between those working with the well-ordered data and the poor souls laboring away on the messy data.

Things change

Last but certainly not least, the data is always changing. Datasets are constantly being added to, updated and corrected. New rows get tacked on every day, and formats are prone to changes every few years. Therefore, we need to work with a continual outpouring of data, not just static sets.

Designing within constraints

Now that we’ve laid out the constraint, we can explore how to build a system within these bounds. Here are some of the techniques we’re trying out and why we think they are useful.

Linking and blending, not merging

Since we cannot be sure of our decisions and because the underlying data changes, we’re avoiding doing destructive merges of our data as long as possible. Instead of combining records A and B into a new record C, we’re creating a link that signifies that the two records are the same. If we need to correct our decision later, we couldn’t recover A and B from record C, but we could just remove the link. For consumers of the data downstream, they can consume the clusters of records and blend them into the format they need.

Tagged records

Things change, and our system should reflect the changes in the underlying data. We tag incoming data with unique, reproducible identifiers that never change. So, when we observe a change in the source data, we can update the corresponding piece of data downstream. Note that this in contrast to creating an entirely new record out of the update and having the old and new versions coexisting in the same system downstream.


Perhaps the most risky design we are working with is using Kafka and Samza to move data around. Kafka is a persistent message queue that uses the abstraction of streams of data. Samza can be used to consume streams in order to produce new streams of data. One of the next posts in the series will be about how we are using these technologies. For the moment though, we can say that we are excited by how easy it looks to have many disparate producers and various types of consumers of the various streams of data. We think that by betting hard on streaming, we can build a system that easily and quickly reacts to any change in the data or how we choose to create our understanding of the data.

Black box matching

On that note, we are viewing the matching algorithms as a black box of sorts: Streams of messy data go in and streams of resolutions come out. By using streaming, we can unify all methods with a common interface. A streaming interface means that we can bring up new algorithms and test them on all the old data, while at the same time doing incremental resolution based on new incoming data — without having to build multiple systems or interfaces to deal with either scenario. Moreover, anything can happen inside this black box at any time scale. If we want to have simple, fast deterministic algorithms going that send out updates within a second of receiving a message, we can. If we want to have long running statistical calculations going, then we can ingest the data, compute for a few hours and emit events later on. If we want to have humans in the loop, we can ingest the data into a simple app and then send out new events as the person makes decisions within the app at their leisure. We’re fairly certain that each method can be based off the same types of input and create similar kinds of output and we hope that this will make our life as easy as we hoped.

Open Civic Data as lingua franca

The mismatching format problem is the major hurdle and where the bulk of the work will be invested, as measured by man-hours. We’ve been developing Open Civic Data for a while now and are working to expand OCD to cover all the cases we care about. The hope is that by writing a converter script from a dataset’s given format to Open Civic Data, a person can plug their data into our system without having to know about all the decisions we made downstream while still getting the benefits of entity resolution. This will take a fair amount of engineering effort, but the work isn’t necessarily hard and offers a straightforward path to data standardization.

What’s next?

We’ve laid out the major constraints we’ve encountered and proposed some new designs. This post was kept at a high level to keep from diving too deep too quickly into the technical weeds, but the gist of the ideas should be clear. By streaming Open Civic Data information into and out of a system that manipulates links between tagged records in a flexible manner, we think we can build an entity resolution system that suits our purposes. We’ve built about half of this infrastructure out already and will be sharing examples of how it works soon!