Wrangling messy political data into usable information


Thanks to the [Lobbying Disclosure Act of 1995](http://en.wikipedia.org/wiki/Lobbying_Disclosure_Act_of_1995), individuals and organizations must disclose the activities they undertook each quarter while representing themselves or their clients to Congress. After the [Honest Leadership and Open ](http://en.wikipedia.org/wiki/Honest_Leadership_and_Open_Government_Act)[Government](http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FHonest_Leadership_and_Open_Government_Act&sa=D&sntz=1&usg=AFQjCNFqziFD3YxesvPaa2rv3sc0__y37A)[ Act of 2007](http://en.wikipedia.org/wiki/Honest_Leadership_and_Open_Government_Act) was passed, there was a rapid and sustained use in electronic filing for lobbying disclosures. There are now over 500,000 disclosure forms available for analysis in electronic formats from the past seven years. Although the disclosures don’t offer nearly as many specifics as one would hope, when taken in aggregate the available data provides a high level overview of the movements and trends of the lobbying industry.

Sadly, we can’t just skip from downloading the data to calculating aggregate statistics. The disclosure forms include no reliable way of knowing when two lobbying firms or two clients of a lobbying firm are the same. Without taking an educated guess, we don’t know from the data that the client called [“1Sky Education Fund”](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=a45f7297-469c-4cda-8cf0-a880d21dddf8&filingTypeID=1) and the client called [“350.org (formerly known as 1Sky Education Fund)”](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=087d8459-9542-4a79-bae0-d57350ebb73a&filingTypeID=51) are in fact the same organization. Compounding the issue, two firms usually don’t disclose the name of the same client in the exact same way. Some lobbying firms hardly even disclose their own name consistently. Before we can get into the high level overview of lobbying disclosure data, we must merge and identify all the organizations and individuals in the disclosure forms.

The traditional approach to this problem has been to build software that allows people to label and tag disclosure forms. Human annotation by experts is a tried and true method for understanding these forms. If I had not been one lone intern but rather, say, a hungry swarm of human labor ready to descend on the senate disclosure data like politically inclined locusts settling in on vast fields of informative wheat, I too would have built a system to store the stream of annotations I would have been producing. But, for better or worse, this summer it was just me and a computer with 32 cores, 64 gigabytes of memory and no inborn interest in the lobbying activity of [“NPLMCC–Nuclear Power Labor-Management Cooperation Committee” during the first quarter of 2014](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=acb60dbc-a561-4a90-8ce2-7a63af36615a&filingTypeID=51).

Moreover, we’ve found that lobbying disclosure forms all get submitted during the same two week period each quarter. This means that the month after the disclosure deadlines are hell on researchers. When disclosures hit, everybody drops everything and helps fight the good fight. Despite the best efforts of the labor liquidity movement, hiring and subsequently firing large swaths of lobbying disclosure experts is not a tenable system for dealing with the quarterly disclosures long term. If an organization wants to annotate and tag lobbying disclosure forms, the organization has to be structured to deal with sharp, regular and unavoidable labor surpluses and deficits from the get go.

Some organizations amortize this labor cost by creating dual roles like an individual serving as a reporter primarily and only as an annotator when needed, or finding other disclosure data sets of similar size that are “on” when lobbying disclosure forms are “off.” The Sunlight Foundation was not willing to pursue such a drastic organizational shift and so we decided to explore how far we could get with only software. If a technological solution could be found, we figured it would be faster, cheaper and more reliable than a team of human annotators with, hopefully, acceptable levels of accuracy and precisions.

And so, with all of the above in mind, I embarked on a quest to train a computer to care about politics. The Influence Explorer team figured that if I could reproduce even a fraction of the accuracy that human annotation provides, then a technological solution offered some very real benefits that made the trade off reasonably attractive. In short, we hoped that by sprinkling magical silicon dust over the lobbying disclosure data, we wouldn’t have to destroy the environment by burning of all the midnight oil folks would need to get the projects done each quarter.


Upon arrival to the Sunlight Foundation in May, I was given the goal of automating the annotation of lobbying disclosure data. I had effectively free rein to do what I thought was best. While in pursuit of this goal, I built a series of systems capable of easily answering interesting questions about the world of lobbying disclosures. [ECHELON](https://github.com/sunlightlabs/echelon) is the third prototype I’ve built so far and is by far the most successful and powerful. With just a few hours of computation, ECHELON is able to approximately reproduce the resolution precision that several years of dedicated hand curation built up.

I’m happy to say that automated annotation and tagging is well within the realm of possibility. I was able to build a system that approximated the results of other organizations. In particular, the number of unique clients and lobbying firms produced by our program was within a few thousand of the same statistics for human annotated data for the same time period. Considering that we started out with over 1,000,000 clients and registrants in total, we were very excited when we saw that we were getting down to around 33,000 clients and 7,000 registrants.

This is not meant to be a deep technical blog post, so I will only touch on the technical architecture before diving into some of the results. ECHELON is a [Clojure](http://clojure.org/) project built on top of the free version of [Datomic](http://www.datomic.com/) and levages[Instaparse](https://github.com/Engelberg/instaparse) heavily. At this point, the core of ECHELON consists of 1500 lines, with another 1000 lines for one off experiments and queries that I’ve been exploring. [Most of the core code deals with loading in the data and getting it into just the right format](https://github.com/sunlightlabs/echelon/blob/master/src/echelon/load.clj).

As we’ll soon see, such a small project can pack a mean punch.

## Parsing field names

One of the major forces powering ECHELON is the understanding of the disclosed name fields. Thanks to Instaparse, we were able to create a [formal grammar](http://en.wikipedia.org/wiki/Formal_grammar) for [parsing the various corporate entities](https://github.com/sunlightlabs/echelon/blob/master/src/echelon/parser.bnf) that appeared in the name fields for client, registrants, affiliated organizations and foreign entities. This means that we can turn “SkyTerra Communications, Inc., formerly Mobile Satellite Ventures” into something that looks like the following:

((“skyterra” “communications” :corporation) :fka (“mobile” “satellite” “ventures”))

All that the above is indicating is that there was a corporation called Skyterra Communications, (“skyterra” “communications” :corporation), that was formerly known as, :fka, Mobile Satellite Ventures, (“mobile” “satellite””ventures”). Which is neat, but sort of useless alone. Here’s what one gets when one runs “SKYTERRA COMMUNICATIONS CORPORATION F/K/A MOBILE SATELLITE VENTURES” through ECHELON’s parser:

((“skyterra” “communications” :corporation) :fka (“mobile” “satellite” “ventures”))

We get the exact same result for both examples! Two names which look very different produce the exact same result. What I’ve done is create a way of taking a wide variety of inputs and imposing a rigid structure on them, with an emphasis on making similar inputs produce the exact same result.

Once I had run this parser over every organization name field in the disclosure data, I had a fair amount of power to play with. The parser is a vital step in the automated annotation process. As an example, I’ve picked out [LightSquared](http://en.wikipedia.org/wiki/LightSquared) because it is a organization that has a long history and has operated with several different names. Here are all the various names that ECHELON has annotated to be the same thing:

Names of entities matched to Lightsquared
SkyTerra Communications, Inc., formerly Mobile Satellite Ventures
LightSquared (Formerly known as SkyTerra)
LightSquared (formerly known as Skyterra / Lightsquared)
SkyTerra / LightSquared
LightSquared (formerly SkyTerra Communications, Inc.)
Mobile Satellite Ventures, LP
SkyTerra (Formerly Mobile Satellite Ventures)
Mobile Satellite Ventures
Skyterra (formerly known as Mobile Satellite Ventures)
Skyterra (formerly known as Mobile Satellite Ventures, LP)
SkyTerra Communications, Inc. (formerly Mobile Satellite Ventures)

So, by the use of a formal grammar and a smart annotation step, we are able to easily find and record the various names that an organization has used as it went about lobbying Congress.

## Querying

ECHELON provides a powerful interface for querying the data. Assuming the answer to a question exists within the data, there hasn’t been a question yet that I’ve been able to think of that ECHELON cannot answer. The system is surprisingly powerful, more so than I could have hoped for. Here are the organizations which come up the most often in the disclosure data, broken down by the various associated names:

Alias Number of Occurrences
“Patton Boggs LLP” 3469
“Squire Patton Boggs formerly Patton Boggs LLP” 170
“Squire Patton Boggs” 7
“Patton Boggs, LLP” 6
Alias Number of Occurrences
“Van Scoyoc Associates” 6526
Alias Number of Occurrences
“Holland & Knight LLP” 5062
“Holland & Knight, LLP” 3
Alias Number of Occurrences
“Akin, Gump, Strauss, Hauer & Feld” 19
“Delaware North Companies on behalf of Akin Gump Strauss Hauer & Feld” 11
“Oneida Indian Nation on behalf of Akin Gump Strauss Hauer & Feld” 10
“City of Houston on behalf of Akin Gump Strauss Hauer & Feld” 10
“Akin Gump Strauss Hauer and Feld” 1
“Akin Gump Strauss Hauer & Feld” 1
Alias Number of Occurrences
“K&L GATES LLP” 4056
“K&L Gates LLP” 212
“K&L Gates, LLP” 12
“K&L Gates, LLp” 1
Alias Number of Occurrences
“Hogan Lovells US LLP” 1459
“Hogan & Hartson LLP” 634
“Hogan Lovells US LLP f/k/a Hogan & Hartson LLP” 380
“Hogan Lovells f/k/a Hogan & Hartson LLP” 131
“Hogan Lovells US LLP f/k/a Hogan & Hartson LLP” 3
Alias Number of Occurrences
“Cornerstone Government Affairs, LLC” 2999
Alias Number of Occurrences
“Cassidy & Associates, Inc. formerly known as Cassidy & Associates “ 2215
“Cassidy & Associates, Inc.” 268
“Cassidy & Associates” 244
“Cassidy & Associates Inc.” 70
“Tiffany & Co. on behalf of Cassidy & Associates” 7
“Hospital for Special Surgery on behalf of Cassidy & Associates” 6
“College of New Rochelle, The on behalf of Cassidy & Associates” 6
“Claflin University on behalf of Cassidy & Associates” 6
“Hampton University on behalf of Cassidy & Associates” 5
“United States Tennis Association Inc. on behalf of Cassidy & Associates” 4
“National Acquarium in Baltimore, Inc. on behalf of Cassidy & Associates” 3
“Institute for Student Achievement on behalf of Cassidy & Associates” 3
“National Aquarium in Baltimore, Inc. on behalf of Cassidy & Associates” 2
“Cassidy & Associates, Inc.formerly known as Cassidy & Associates” 1
Alias Number of Occurrences
“Podesta Group, Inc.” 62
Alias Number of Occurrences
“ALCALDE & FAY” 2894
“Alcalde & Fay” 14

There are many interesting little tidbits in the above output. The value of the parser is easily seen as we look at all the variations of the names that pop up within the documents. Specifically, there is a phenomena within disclosure forms where a third party will include itself within the name of the client, i.e. “Patton Boggs on behalf of Northrop Grumman Inc.” even though the lobbying firm that filed the form could be “Cassidy & Associates.” In general, the client name will be something like “Firm A on behalf of Client A” while the registrant will be neither “Firm A” nor “Client A.” This is a common pattern and the parser and annotator account for it. There are several theories about what these disclosed “on behalf of” relationships mean. The most believable one is that the disclosing firms hire these other firms to lobby on behalf of their clients in areas where the disclosing firms is weak. The clients get a wider range of expertise and, perhaps more importantly, clients don’t have to go through the trouble of coordinating with more than one lobbying firm directly. These seems like a reasonable explanation, but these relationships admittedly deserve scrutiny than I’ve been able to give them.

In some rare instances the grammar of the disclosure gets messed up though. While I’m not so into linguistic prescription, it seems like “Entity A on behalf of Entity B” usually means that “Entity A” undertook some work for the benefit of “Entity B” and not that “Entity B” undertook some work for the benefit of “Entity A.” However, as evidenced above, sometimes form fillers will flip the entity positions within the “on behalf of” statement. This confuses the automated annotator. That’s why “Hospital for Special Surgery on behalf of Cassidy & Associates” is resolved to be the same entity as just “Cassidy & Associates.” There is a potential solution to this problem involving more information and a more complicated annotation process, but this issue only occurred a handful of times and thus didn’t feel like it was within the scope of the current project. .

We can see from this that Pattons Boggs occurs most often! Neat. What sorts of activities does Boggs undertake for its clients? Part of the disclosure process is that Patton Boggs must break down what they do into specific issue codes representing the areas that any lobbying activity can fall under. Thus, here is a list of lobbying codes and the number of times Patton Boggs has undertaken an lobbying activity with that code during a quarter on behalf of itself or a client.

Issue Code Number of Occurrences
“Budget/Appropriations” 1679
“Transportation” 965
“Health Issues” 908
“Taxation/Internal Revenue Code” 593
“Medicare/Medicaid” 550
“Urban Development/Municipalities” 399
“Homeland Security” 393
“Energy/Nuclear” 351
“Housing” 332
“Financial Institutions/Investments/Securities” 332
“Defense” 290
“Telecommunications” 284
“Aviation/Aircraft/Airlines” 272
“Education” 257
“Economics/Economic Development” 247
“Law Enforcement/Crime/Criminal Justice” 236
“Natural Resources” 227
“Labor Issues/Antitrust/Workplace” 226
“Environmental/Superfund” 222
“Trade (Domestic & Foreign)” 163
“Government Issues” 148
“Indian/Native American Affairs” 127
“Agriculture” 115
“Insurance” 113
“Clean Air & Water (Quality)” 110
“Retirement” 106
“Consumer Issues/Safety/Protection” 99
“Disaster Planning/Emergencies” 81
“Communications/Broadcasting/Radio/TV” 66
“Chemicals/Chemical Industry” 65
“Copyright/Patent/Trademark” 64
“Banking” 64
“Gaming/Gambling/Casino” 61
“Utilities” 59
“Food Industry (Safety, Labeling, etc.)” 59
“Pharmacy” 54
“Science/Technology” 53
“Roads/Highway” 53
“Manufacturing” 49
“Marine/Maritime/Boating/Fisheries” 46
“Travel/Tourism” 41
“Computer Industry” 41
“Tobacco” 40
“Small Business” 40
“Medical/Disease Research/Clinical Labs” 39
“Immigration” 38
“Foreign Relations” 38
“Railroads” 35
“Veterans” 33
“Sports/Athletics” 32
“Bankruptcy” 24
“Torts” 23
“District of Columbia” 19
“Real Estate/Land Use/Conservation” 18
“Fuel/Gas/Oil” 16
“Beverage Industry” 15
“Aerospace” 15
“Firearms/Guns/Ammunition” 12
“Automotive Industry” 10
“Family Issues/Abortion/Adoption” 9
“Accounting” 9
“Welfare” 8
“Advertising” 8
“Trucking/Shipping” 7
“Media (Information/Publishing)” 6
“Alcohol & Drug Abuse” 5
“Arts/Entertainment” 3
“Intelligence and Surveillance” 2
“Constitution” 2
“Commodities (Big Ticket)” 2

Woah! No wonder Patton Boggs is the entity that shows up the most, they seem to be doing a little bit of everything. How neat. I wonder how things change with time, though. Does Patton Boggs have its bread and butter type lobbying activities or has it been a dynamic firm? Here are the top five activities for each of the past seven years for Patton Boggs:

2008 Number of Reports
“Budget/Appropriations” 281
“Transportation” 127
“Health Issues” 105
“Taxation/Internal Revenue Code” 76
“Medicare/Medicaid” 72
2009 Number of Reports
“Budget/Appropriations” 308
“Transportation” 167
“Health Issues” 149
“Taxation/Internal Revenue Code” 95
“Medicare/Medicaid” 89
2010 Number of Reports
“Budget/Appropriations” 296
“Health Issues” 198
“Transportation” 152
“Taxation/Internal Revenue Code” 113
“Medicare/Medicaid” 101
2011 Number of Reports
“Budget/Appropriations” 254
“Transportation” 150
“Health Issues” 134
“Taxation/Internal Revenue Code” 92
“Medicare/Medicaid” 74
2012 Number of Reports
“Budget/Appropriations” 224
“Transportation” 158
“Health Issues” 122
“Medicare/Medicaid” 82
“Taxation/Internal Revenue Code” 79
2013 Number of Reports
“Budget/Appropriations” 184
“Transportation” 122
“Health Issues” 119
“Medicare/Medicaid” 83
“Taxation/Internal Revenue Code” 76
2014 Number of Reports
“Budget/Appropriations” 132
“Transportation” 89
“Health Issues” 81
“Taxation/Internal Revenue Code” 62
“Medicare/Medicaid” 49

So it seems that Patton Boggs does have its standard issues that it hits every year, with very little movement in the ranking of issues each year. A solid firm then, a stoic firm one might say, a firm that knows what it is good at and sticks to its guns. Good on you Patton Boggs, good on you. Now, this is not the limit of what is possible with ECHELON at all. There is a whole rabbit hole of queries and results that we could disappear into. Every which way I turn when touching the data new questions arise and they can quickly overwhelm us. This post is only meant to introduce and briefly explain ECHELON and its capabilities and so let’s focus on one particular type of query to wrap everything up.

## A Comedy of Errors

Back when I was young and naive, i.e. three months ago, I had great faith in the identifiers that the house and senate gave to each registrant and client. You see, registrants are the ones who are actually filling out and filing the forms that I’ve been analyzing. Every registrant, which typically means every lobbying firm, must register that they are going to lobby on their clients’ behalf. Then, each quarter, the registrants file a report on behalf of their clients disclosing the activities they undertook. The house and the senate give each client and firm pair a unique identifier to use when filing the forms. While these identifiers aren’t terribly useful by themselves, they could potentially make it easier to link up all the activities that firms undertook for clients across time. Early on, I was advised by colleagues to look into how reliable the identifiers were. After some rough experiments, it seemed that lobbyists made enough mistakes when entering the identifiers that correcting them all by hand was possible but would not be enjoyable nor productive. I decided to ignore the government issued identifiers for a while if I could by.

Obviously, ECHELON has been a success without using the government issued identifiers. Moreover, ECHELON can tell us exactly how much of a problem these government issued identifiers would have posed if I tried to use them. By relying on only the automated annotation, we can easily find mistakes that lobbyists made when entering in the identifiers on the forms.

First off, form fillers don’t seem to make mistakes when entering in the senate issued identifier. We’ve checked and the senate identifier is apparently used to log into the disclosures systems for both the senate and the house. Thus, these forms cannot be uploaded and still have the senate id wrong. This was a surprising and encouraging find!

The house identifier did not fare nearly so well. I was able to find a couple dozen serious uncorrected mistakes that were made when entering in the house id. At first blush, fourteen mistakes out of over 500,000 forms filled out is a pretty decent track record. However, this number is a lower bound on the number of mistakes that have been made and gives no indication as to the actual number of mistakes. If I ran a better query to find mistakes, found better techniques for annotation, or caught an unknown mistake I was making in my current code, the number of found mistakes in the house identifiers column could sharply increase.

What distresses me most about these mistakes is that they occur in the field where precision matters the most. A firm can forget to include an activity in the disclosures, fudge the numbers on how much they were paid, even misspell the name of their client and it would be fine. That sort of thing doesn’t really matter all that much. Identifiers matter because they are meant to precisely identify an organization and there is no room for error. By making any mistake at all when entering in the identifier, no matter how small the mistake may be, lobbying firms effectively negate the entire purpose of the field in the first place. I’d rather have them leave it blank than to put in nonsense.

Putting the rants of a young pedantic wonk aside, there are two types of mistakes that I’ve found so far when firms are filling out disclosure forms. The first is just a simple typo where something like “1001” becomes something like “10001” or “2001.” The majority of the mistakes found where just typos. These mistakes aren’t terribly interesting, so let’s just look at two examples of them.

“Process Handler et al.” [registered](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=7f21b60c-b05f-4f75-87fc-8a9ec8dcc6e8&filingTypeID=1) that it would be lobbying on behalf of “Mr. Cie Sharp” in early 2007. All throughout 2008, the activities undertaken by the firm on behalf of the client were disclosed with the house identifier “363570022” (as evidenced by the [Q1](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=352825ec-dadf-4088-b735-6a337adfa4c4&filingTypeID=52), [Q2](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=58af25a3-d3be-4ee9-a23a-d685d069a708&filingTypeID=61), [Q3](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=d2cd6737-eb9f-4594-8e4d-610d21160e2b&filingTypeID=69) and [Q4](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=54c2e646-3d4b-43ee-ab10-4d13d84feb9c&filingTypeID=79) reports). However, the house identifier of “362570022” was used on the[Q1 report for 2009](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=577f2e5d-190b-427b-90dc-065506a4533c&filingTypeID=51). After that, the reports switched back to using “363570022” until the relationship was terminated at the end of 2009 ([Q2](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=5a0cbf07-2947-4f3e-ae81-a4d2d6a94ea7&filingTypeID=61),[Q3](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=8c627843-ca1e-48c4-a4ce-97b8677d3007&filingTypeID=70),[Q4](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=910deaf0-6a01-4141-a568-5186639e349e&filingTypeID=78), [Q4 termination](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=689903f3-24fa-4178-8947-016c953f7fae&filingTypeID=84)).

“Hoffman, Silver, Gilman & Blasco P.C. (formerly known as Robertson, Monagle & Eastuagh *[sic]*” has had a [long](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=15872f57-51da-4827-a5d1-b88febf4205b&filingTypeID=3) [multi](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=cc2bf6ab-7c15-4b04-b8c6-e89f76900cc7&filingTypeID=9) [year](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=518ca210-b981-4cd6-ad66-e905903210ac&filingTypeID=60) relationship with the “Alaska Forest Association.” The relationship between the two entities is typically disclosed with the house identifier “306260000.” In the fourth quarter of 2011 though, [three](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=d8fb2fc0-0930-4c97-b124-09f1e1a7adaf&filingTypeID=78) [separate](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=0052aef0-5d21-4cda-a81f-283dfc737842&filingTypeID=78) [reports](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=e302010a-8730-47c3-9987-5d8246d25104&filingTypeID=78) were filed to detail this relationship. None of them are amendments to the others, all of them spell the client’s name wrong and two of them use the wrong house identifier (the incorrect “306250000” and “306260005” vs. the correct “306260000”). Strange.

Moving beyond typos, there were two cases of general incompetence. “Keevican Weiss Bauerle & Hirsch, LLC” has lobbied for “TriState Capital Bank” under the identifier “405970000” since 2009. Although they eventually settled in with using the correct identifier, the fourth quarter of 2009 saw that same pattern of [three](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=39f5c792-f3c7-4882-910f-b20048532a43&filingTypeID=79) [different](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=e3e0a4a3-8af3-4de9-97db-6d489710ccd5&filingTypeID=79) [disclosures](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=8d68b13d-a11b-4b2d-a4a9-2af3bd707b62&filingTypeID=79), none of them amendments, with two of the disclosures using the wrong identifiers. The[first mistake](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=e3e0a4a3-8af3-4de9-97db-6d489710ccd5&filingTypeID=79) was a simple typo where an extra zero was included at the begininng of the identifier. The [second mistake](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=8d68b13d-a11b-4b2d-a4a9-2af3bd707b62&filingTypeID=79) seems nonsensical though; there isn’t a simple way of getting from “405970000” to “408550000” without making at least three typos. If we look for other relationships which have the same identifier though, we see that “Keevican Weiss Baurele & Hirsch, LLC” also does work for “C & S Patient Education Foundation dba Conquer Chiari” and, surprise, that [relationship has the “40885000” identifier](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=a01f29ec-56ec-468c-8ab2-5445f4dcfa9f&filingTypeID=78).

“The Susquehanna Group” has lobbied for “The Corps Network” for about a decade now. [This](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=e8056dc1-a7c0-4915-91a0-b310f2e657db&filingTypeID=9) [relationship](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=35af03d5-e09a-4e38-99f3-9844e750e228&filingTypeID=69) [typically](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=501e83c7-8f76-40b9-9e4e-d4ccd44eb1b5&filingTypeID=60) [uses](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=ec3eb4ea-a634-415f-9748-c6d48d8307c3&filingTypeID=70) [the identifier “358530003”](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=ad9ee124-4ad7-45c6-bcf2-76add2fafe87&filingTypeID=60).[One time they made a typo](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=83c3b935-38a4-44d4-8949-ff35b16a7b1e&filingTypeID=61), who cares, but another time they did something odd. I think they just made up a house identifier and used that instead. [This report](http://soprweb.senate.gov/index.cfm?event=getFilingDetails&filingID=9b804101-f9ad-41c8-9e5b-ce63f0b55f87&filingTypeID=51) uses “200052379” as the house identifier. That’s more than a few typos and there doesn’t seem to be any other client of anyone who has ever used that identifier. So, as far as I can tell, during one quarter “The Susquehanna Group” just made up an identifier and decided to use that instead. Very strange.

## Summary

This has been a terribly long way to explain something that most anyone who has ever worked with raw lobbying disclosure forms has discovered: lobbying disclosure forms are awful in a variety of astounding and disappointing ways. The disclosure forms provide very little information about what is actually going on and the information that is provided is on par with second hand gossip at best. Only by leveraging a fair amount of technical resources and techniques could these forms be processed and turned into something useful. In a way, we’ve shown how ECHELON bootstraps itself out of nothingness and into the Lobbying Form Typo Limelight. ECHELON needs to exist because look at what ECHELON has already had to do to exist! In all seriousness though, the ECHELON project has been a success that shows the power and potential of automated annotation systems. Just as the earth ever so patiently applied pressure and force on the excrement of long forgotten herbivores to create the fuel that powers our modern day economy, we too can apply annotation techniques and hard work to lobbying disclosure data and create something that can further our understanding of the modern political landscape.