Earlier this year we started on the Data Commons, a project to merge open government data sets to make them more searchable and usable. Our goal for the initial release is to load state and federal campaign contribution data from The Center for Responsive Politics and The National Institute for Money in State Politics. Along with the raw transactional records, we will be taking the additional step of matching the entities (people, organizations, corporations, etc.) across the data sets. We'll have more posts soon with details about the Data Commons.
To assist us in this effort, we are developing Matchbox, a toolkit for the merging and matching of entities. We have big plans for Matchbox, but want to get feedback from the community as we improve it over the next few months.
So what is Matchbox and what does it do?
Looking at the various political data we have available, almost all of it can be represented as transactional records between entities:
- A gave $2000 to X's campaign
- A paid lobbyist B to meet with X
- X requested earmark for $1 million to A
- Agency D awarded a contract to A for $6 million
The biggest challenge is not reconciling the transactions, but matching the transaction participants across the data. Each data set has it's own representation for entities; they usually have different IDs and different names. We need a way to look at two data sets and decide that entity A from the CRP data is the same organization as entity Z from NIMSP data. Additionally, we need to keep track of any attributes, such as the original CRP and NIMSP IDs, that each entity contains.
Matchbox allows us to load and store entities from each data set. We can then merge records that are deemed to represent the same entity. You can interact with Matchbox using the included Python module or by calling the basic API over HTTP. For now the API is meant for internal use and not to be a public facing service, though it will be expanded in the future.
What are the plans for Matchbox?
Over the next few months we will be adding additional features to Matchbox.
The newest member of the labs, Ethan, will be working to add text matching algorithms to automate the process of finding potential merge candidates. His initial focus will be on developing algorithms to match corporate names. We will be adding algorithms for other types of entities as well and hope that the community will contribute code for other entity types.
Web-based Administration Interface
The Python module and API help developers interact with the Data Commons, but there is not yet an easy way for the average analyst or journalist to assist in the standardization of entities. We are in the process of developing a web-based administration interface that will allow end users to manage the merging process from their browser. MAPLight.org has done a great job building their own internal interface for name standardization that is a big influence on the application we are building.
Importers and Exporters
While Matchbox is an integral part of the Data Commons, we realize that some organizations would find it useful to use as a tool alongside their existing data processes. We plan to add utilities to help import data from data sources (spreadsheets, relational databases) and export the match results in various formats.
We're currently using Matchbox to build the Data Commons and are excited about the features we'll be adding in the next few months. If you have any questions or would like to participate, come join our mailing list. Big thanks to James for getting this development version released.
Check out the source code on GitHub.