Yesterday morning I watched the first markup session of the Earmark Transparency Act. The bill aims to create a comprehensive database of all earmark requests, not just approved earmarks. In its current version, there are over twenty required data elements, including free text descriptions and justifications of the earmark request, as well as related documents. The bill also calls for huge flexibility in the search interface and the API. Overall, it's a win for transparency and a big technical leap forward in terms of how the government thinks about releasing its data. It's biggest opponent in committee was Senator Carl Levin.
To be fair, Sen. Levin is not against the spirit of the bill, but has serious concerns about the technical implementation. He claims it is impossible for a search engine to aggregate earmarks by the free form text fields required by the bill. Sen. Levin also claims it's not feasible to sort or aggregate by documents related to the earmark (such as the bill itself, or a related budget estimate). Our policy team has logged some interesting thoughts on the subject, but I thought I’d take this opportunity to approach this from a technical standpoint.
To be specific, the bill states: “the term ‘searchable website’ means a website that ... allows the public to search and aggregate earmarks by any data element required under section 3”, where section 3 contains a list of the data elements required. Two of the data elements, the project description and the list of relevant documents, are the source of contention -- not because they’re included, but because Sen. Levin is not convinced they can be aggregated, as the bill calls for.
Aggregate has a specific meaning in relation to computer science. It’s usually short for “aggregate function”, referring to the minimum, maximum, sum and average math functions. It’s quite common for these aggregate functions to be used in conjunction with an operation that groups the data by similar values. Suppose we’re aggregating data on a fiscal year. First we’d bin the data into similar groups according to their fiscal year value, then we’d sum the earmark totals for each group, resulting in a list of fiscal years and respective totals of earmark requests. As most of you may know, this is easily done in any SQL-based relational database (and I have it on good authority the Senate uses Oracle). Now let’s throw some free form text into the mix. We can follow the same operation, except instead of grouping by the exact fiscal year, we will limit the earmarks to those whose project description contains the phrase “research and development”. Then we can sum or average the earmark totals for all of these earmarks. These queries can also be combined, giving us all the R&D earmark request totals, by fiscal year.
As for aggregating by the list of related documents, I can kind of understand how this might seem tricky. In this case, you wouldn’t aggregate by the exact text in these related documents. Rather, you would house all of these documents in one place, and assign them all unique identifiers. Then, the database only needs to keep track of a list of document IDs for each earmark request. Now our previous query can be altered to return the sum of all earmark requests that have the defense appropriations bill as a related document (using the document IDs). We have built a similar document database that drives our Real Time Congress iPhone app.
Most professional software developers have learned the fundamental principles of database phrase searches, free form text search engines, relevancy algorithms and similarity calculations. And if they didn’t, you don’t need to reinvent the wheel since this software already exists for a multitude of platforms and is free. Sen. Levin shouldn’t be put in a position to make these technical judgments on his own and should instead leverage the expertise of the technical community. He says that the Senate Sergeant at Arms (the office in charge of technical infrastructure and support for the Senate) agrees with his above claims. And for this I think we are owed an explanation.
To all you non-programmers out there, I will let you in on a secret. Every software developer, at some point in their career, will use their developer trump card. When someone requests a feature that is a pain to implement, or a feature you disagree with but can't convince the team to abandon, you rattle off some technical jargon about how it won't be compatible with your Cold Fusion infrastructure, or some other nonsense. To the Senate Sergeant at Arms I say: put it back in the deck. If the technological infrastructure in congress is so backward and outdated that it cannot launch a free text search instance or make use of the “LIKE” SQL keyword, then we have a much more serious problem on our hands.
For starters, the existing technical infrastructure in Congress should be more transparent. Congress has consistently watered down the specificity and breadth of the data it has released and has used data formats that are not machine readable. All this has been done roughly under the guise of "doing it the right way is too hard and expensive". For example, when the House released their expenditure data, vendor names and other specific information regarding the expenditures was removed to ease the technical strain of making the text more uniform and easily comparable. Additionally, it was released in PDFs instead of a structured data format such as JSON, XML or CSV. These are serious impediments to total and open transparency.
There's very little known about what format Congress' internal data is in and what infrastructure already exists. This works to Congress' advantage because they can consistently "punt" on meaningful transparency requirements due to supposed technical limitations. One thing we do know is that a contractor called General Dynamics IT provides IT services to the Senate Sergeant at Arms. This contractor is fourth biggest contractor for the federal government according to the Federal Procurement Data System. It comes in just behind Lockheed Martin, Boeing, and Northrup Grumman. From the General Dynamics 2009 Annual Report:
"Information Systems and Technology also supplies network−modernization and IT infrastructure services to U.S. government customers. As one of the U.S. Air Force’s leading partners for network modernization, for example, the group has provided IT support services to more than 75 Air Force bases. It currently supports all Air Force main operating bases. The group also has provided continuous enterprise−wide IT services and support to the U.S. Senate for more than five years."
Is it really the case that a corporation that got $16 billion from the federal government in 2009 cannot aggregate all the earmark request records whose description contains "ethanol", or sum the total earmark requests that have the agriculture appropriations bill in their related document array?
Thankfully, the bill made it out of committee. Hopefully those conducting the upcoming Senate floor debate will consult the open government technical community instead of settling for excuses about the limitations of IT.