Name Standardization: Problems and a Solution
Name standardization, on its surface, would appear to be a primarily aesthetic problem (no pun intended). People’s names can be listed “last, first” or “first last”. Simple, right? Not exactly. When you’re naming different things— people vs. organizations, for instance— and dealing with different ordering, capitalization styles, honorifics, suffixes, metadata or other additional info embedded in names (e.g. politicial party signifiers, company departments or locations), or just general cruft and typos, name standardization is a thorny problem. Add to that the fact that there are no universal identifiers for people or companies in many datasets, names rarely (if ever) come split into their constituent parts, and we are often expected to link data via little more than a name string, and you can see how relevant the issue is to the world of open government data.
Name Standardization in Practice: Influence Explorer
Influence Explorer unites numerous datasets on politicial influence, from many different sources. What all these datasets have in common is names— lots of names, and it’s our job to figure out which names refer to the same entity, and then display them in a reasonably consistent format. Even when we do have shared ID’s we can use to link names between datasets, such as those from the Congressional BioGuide or CRP, we’re faced with the problem of how to form a canonical version of the name to display, and do it in an automated fashion, since we don’t fancy the idea of manually adjusting the 164,000 entity names currently residing in our database.
Different Names, Different Problems
The challenges associated with each name differ depending on whether you’re talking about a politician, individual or organization (the three main entity types embodied in Influence Explorer).
Politicians and individuals have the same problems in theory, since they’re all people, but in practice, tend to present differently. Politicians’ names are more carefully vetted and standardized, yet pesky metadata often comes along for the ride, such as state and party. (See example below.)
"Wyden, Ron (D-OR)" => "Ron Wyden"
Individual names, on the other hand, come with more honorific (e.g. Mr./Mrs.) and suffix baggage (e.g. Jr./Sr./III), not to mention nicknames. For example, below are a couple of raw individual names we received for prominent donors in a recent election cycle, and what we convert them to.
"ROTHSCHILD 212, STANFORD Z MR" => "Stanford Z Rothschild" "Baird, Frederick A 'Tripp' III" => "Frederick A 'Tripp' Baird III"
Even capitalization is tricky. Python has a built-in “title()” case method on string which gets us part of the way, but some names require special heuristics to capitalize properly, such as Scotch/Irish surnames.
"Milton Elmer 'Mac' McCullough, Jr (3)" => "Milton Elmer McCullough Jr"
While (Western-style) person names can be messy, they are relatively easily quantified and fixed, but organization names, as our recently released Six Degrees of Corporations project demonstrates, are another beast. The names of companies and institutions frequently contain words which can be abbreviated, or phrases which are junk as far as a canonical name is concerned, and a single name could have several variations in punctuation alone. For example:
- Merck & Co., Inc. / Merck & Company Incorporated
- Health Net Inc / Health Net, Inc.
- Massachusetts Inst. of Technology / Massachusetts Institute of Technology
- F. HOFFMANN-LA ROCHE LTD and its Affiliates
While we initially used a series of regular expressions to transform names into the desired format, we quickly realized that we needed more. We needed to:
- have names split into their constituent parts, and
- be able to reuse the code across repositories.
To address these needs, we designed an object-oriented library to replace our ad hoc regex text transformation approach, and called it Name Cleaver. 1
Name Cleaver has been around for about a year, but it has just recently replaced all of our ad hoc standardization code, at version 0.2.1. It also now supports all three major name types, politicians, individuals and organizations, with a specific class and special features for each. The PoliticianNameCleaver class has specific methods to deal with and store metadata about a politician (party and state), and also to deal with the names of running mates (Governors and Lieuntenant Governors are billed together in many states). The OrganizationNameCleaver class has methods to reduce a name to only the “kernel” of the name, and also to expand all abbreviations (that Name Cleaver knows of), useful for matching tasks.
How to Use
It’s easy to slice and dice a name by importing NameCleaver and then instantiating a NameCleaver object for a name and calling
from name_cleaver import PoliticianNameCleaver mcdonnell = PoliticianNameCleaver('McDonnell, Robert M (Bob)').parse()
__str__ method on a Name object defines how it will be displayed, so displaying a name is as simple as this.
str(mcdonnell) => 'Robert M McDonnell'
How to Contribute
All of this name standardization is anything but standard. A lot of it depends highly on the inputs. The bigger the set of inputs, the smarter Name Cleaver’s heuristics will need to get. We’d love your contributions to help make Name Cleaver more robust and better at standardizing the names we find in political influence data. You can find (and fork) it on Github.
Other food-themed libraries at Sunlight (by no means a comprehensive list): brisket, saucebrush, Chutney, oxtail. ↩