If you are a regular Sunlight Foundation blog reader, Twitter follower or staff member’s loved one, you are probably sick of hearing about unique identifiers. We’re sorry about that.
But there’s really no getting around it. Almost any data analysis task requires knowing whether two records refer to the same thing. When working with government datasets, we often have to do this based on the names that have been entered into text fields. Sometimes this is relatively easy: “COCA-COLA CORP.” and “COCACOLA CORPORATION” are clearly the same entity. But which college does “USC” refer to? In some cases it gets even trickier: Does “MCDONALDS” refer to the nationwide brand, the central corporation or a specific regional franchise corporation?
The process of answering these questions and linking records together is called entity resolution, and it’s not always easy or even possible. It is a ubiquitous challenge, however, affecting our work on campaign finance data, lobbying data and White House visitor logs, to name just a few datasets.
Matching records based on names can never be made perfect. There just isn’t enough information present in the name. A better solution is to use ID numbers instead of names. This makes it possible to ignore the name entirely — similar or ambiguous names cease to be a problem.
Unfortunately, using good identifiers is often not an option, for several reasons:
- The bureaucratic or procedural overhead to creating or looking up an ID number might be too burdensome.
- The best identifier for the system is collected, but releasing it might invite privacy or other concerns. The White House visitor log system collects Social Security Numbers, for example, but doesn’t release them.
- The existing system wasn’t designed with ID collection in mind, and changing it would be difficult for legal or procedural reasons like the Paperwork Reduction Act.
These problems all deserve to be taken seriously. But there are some simple technical steps that can be used to alleviate them. By using these steps more often, I think individual government officials and vendors could make their data better.
Let’s start with a simple one: Government should be using more autocomplete forms.
Most of the time, autocomplete fields feel like a pleasant but inessential UI frill — a very minor time-saver that we could easily do without.
It seems that way because browser UI only shows the text of the name that’s being completed. But autocomplete fields can fill in more than that. As you type, your browser is querying a database for entities that are appropriate answers to the field in question — typically, entries that have already been stored in the database. But it’s not just the name that comes back. Identifiers can be silently recorded, too, providing concrete, unambiguous links between identical entities.
This doesn’t resolve the potential for ambiguity, particularly if the existing entries in the database are themselves ambiguous. But it does nudge users toward using an existing record, if appropriate. It injects a few milliseconds’ worth of expert human judgment into the data entry process once per record. The alternative is often to find the resources for the many hours of nonexpert human judgment needed to clean the data every time it’s used.
Better still, implementing autocomplete is usually pretty simple. It’s a common feature for web framework add-on libraries to offer, and generally takes between a few minutes and a couple of hours’ worth of dev time to implement. It’s possible to leak sensitive information through autocomplete fields, so some thought has to be given to your application’s security model. But it’s not rocket science.
So there’s one simple, concrete way to improve entity resolution in government datasets. If you’re in charge of disclosure forms and would like to talk about it more, we’d love to hear from you.