The Physics of the Corporate Universe
Today we’re launching 6° of Corporations, a new micro-site that provides some insight into the complicated area of corporate identity. It may sound trivial, but uniquely identifying a corporate entity is not easy. For federal contracting data (like in USASpending.gov), DUNS numbers are used to (supposedly) uniquely identify a contractor. However, there are problems in not only how DUNS numbers are issued and maintained, but also with the agency’s use of DUNS numbers. To help illustrate this, we’ve created a visualization that shows the relationship between company names and company DUNS numbers in USASpending.gov.
The visualization takes a company name and finds all the DUNS numbers associated with that name in a transaction. Then, it takes the resulting DUNS numbers and searches for more names that are associated with them. If A = B and B = C then A = C, right? Not in this data! Checkout the site to learn why this is such a difficult (though not unsolveable!) problem. The rest of this post will detail how the tool works and how we built the tool.
The visualization repeats these two steps, alternately searching for related DUNS, then related names, then related DUNS again, searching until there are no more unique names or DUNS numbers. As the tool finds more names and DUNS numbers, it displays a graph representing the relationships between them. The graph is made up of circular nodes connected by thin lines, representing names and DUNS numbers, respectively.
The largest challenge in building this tool was determining how to draw the graph. There’s a long lineage of rock-solid tools like Graphviz that have solved this problem. Unfortunately the underlying algorithms were developed in an earlier era, mostly in the context of batch processing. They aren’t well suited for applications where you need to add nodes or edges to the graph while it’s being laid out.
One of the more mature web graphics technologies is Processing.js. In the Processing community there is a well-known particle physics library named for it’s creator Jeffery Traer. Many talented people, such as Natasha Harper and Katie Adee, have created nice visual representations of Traer-based particle systems, so we tried to build off of their success.
In technical terms, the tool generates a force-directed graph. Simply put, it runs a physics simulation in which the names are represented as particles and the DUNS numbers are represented by springs. To ensure that the particles are all spaced fairly evenly, there is a small repelling force between each particle and every other particle. Each time the simulation runs, it is calculating at least n2 forces, where n is the number of names found. In order to draw smooth animations we need to run the simulation 12 times per second. So, as a concrete example, when I search for the defense contractor General Dynamics, it will find 673 names before my computer slows down and the simulation turns itself off (your results will vary). For 673 names, it is calculating 452,929 forces 12 times per second — 5,435,148 calculations per second. While nature has infinite computing power, your browser does not. We had to find a way to speed this up.
The heart of the inefficiency is that there are too many particles interacting. While there are ways of significantly reducing the number of calculations required, they don’t do well with large changes in the graph mid-simulation. Our search results will often come back in large bursts. Even though we rate-limit the addition of particles, since the updates are localized to a relatively small area of the graph, the large concentration of energy can really upset the layout.
Our initial version contained a line for each transaction that established a relationship between a company name and a DUNS number (there were nodes for DUNS numbers as well as names). That graph contained so many nodes that it was impractical to render it in real-time on low-end hardware. Changing the layout to hide the individual contracts and grants served two purposes well. First, we were able to drastically reduce the number of particles in our physics simulation, allowing it to run more smoothly, especially on lesser hardware. However, the secondary purpose was actually more important. Removing the individual transactions made it more clear that the DUNS number alone could not be used to disambiguate two entities. When you use the tool today, it takes much less time to understand that there is a missing piece of the corporate identity puzzle. Win-win!
In the end, most of the remaining speed improvements came from simple programming optimizations. Since we’re laying out out a graph rather than modeling the real world, we could remove the 3rd dimension from our simulation. After that, we were left to the details of the virtual machines that run JavaScript in your browser. The major browsers all have their own quirks, so we had to do a lot of cross-browser testing. In IE 6, 7, & 8 we simply require Google’s Chrome Frame to avoid VML and all of the complications involved with that. The JavaScript profilers in Firefox and Chrome were indispensable.
It was very exciting to implement a physics simulation in the browser. We did it with open technologies (and put everything on github) and we applied it to a real-world problem, which is very rewarding. If you’re inspired to contribute, head on over to OpenCorporates. They’re collaborating with the awesome folks at ScraperWiki to pull in state and country level information about corporations into a single database.