When we launched TransparencyCorps at the end of June, we ran a few small earmark campaigns, to digitize little batches of earmark request letters that legislators had posted on their websites. These campaigns wrapped up very quickly, and at the same time, the House decided to release earmark request letters en masse, and we didn’t have to do our campaigns per-legislator anymore.
Given the demonstrated interest in earmarks, we decided to run a much larger campaign, for all the earmarks released by the House Appropriations Committee, starting with those for the Commerce, Justice, and Science Subcommittee. These were released in a single massive PDF, which I split up into individual 1- or 2-page request letters.
This campaign involved 1,183 letters, and we had the campaign run for 5,537 completed tasks. Total volunteer time, as measured on TransparencyCorps: over 472 hours. That’s nearly 20 man-days. Here are the results.
First, the data:
- CSV – This is a CSV of all the earmark request letters, created by automatically merging user responses.
- SQLite – This is an SQLite database containing tables of the original document metadata, all collected responses, and the final merged table that was used to generate the above CSV, including metadata about agreement and allowed variation.
The merging process chose the most accurate response for a field based on how many other users’ responses agreed. For each field, we allowed a different level of “fuzziness” when calculating agreement. For reference, fuzziness is calculated using an implementation of the Levenshtein distance algorithm).
About a third (403) of the letters were assigned to 4 users, and most of the rest (773) got tackled 5 times. A handful (10) got assigned 6 times. The tasks were handed out in order of which had gotten the fewest completed responses, so I’m guessing those last 10 were cases where somebody was assigned the same task another user had just gotten, before the first user had completed theirs.
We collected six pieces of information about each earmark request letter:
The legislator doing the requesting. This was a dropdown, and what was stored was an ID, so no fuzziness was allowed. It was also an extremely accurate field; all but 24 letters had agreement between 4 or more of the responses, and those 24 all had agreement between 3 of the responders.
The title of the project the money would go to. Arguably the most important field, we got good accuracy. Allowing responses to differ by only 1 character and still consider them agreed, over 86% of the request letters got agreement from at least 3 responders.
The amount being requested. We got good accuracy here, but actually, most of the request letters had no amount. Only 51 of the letters had an amount listed at all. Out of those, we got agreement from 3 or more people on 46 of them. On 4 of the remaining 5, a majority of people missed the amount entirely.
The purpose of the funding. This field was a textarea on the form, and collected the 1 or more sentences that describe the stated purpose of the entity receiving the funding. As you’d expect, with a larger set of text to capture, there’s more variation on the entries, be it from inexact highlighting on copy/paste, or more room for error in a manual transcription. This was our least accurate field, with only 62% of the earmarks getting responses that 3 or more responders agreed on, allowing for 5 characters of variance. Letting the variance go up to 20 characters only brings that number up to 71%. The merged results in the data are produced with a 5-character variance for this field.
The names of the actual entity or entities receiving the money. This was pretty accurate, with 87% of the responses garnering agreement from 3 or more people, allowing for a 1-character variance. Earmark money could be requested on behalf of multiple entities, and people were asked to enter them in separate fields, but the calculations here treat them as one field.
The addresses of these entities. This was a larger text field, like the funding purpose field, yet it turned out to be far more accurate. 97% of the earmark letters had agreement between 3 or more of the responders, with a 5-character variance. One possible explanation is that this field was not made of prose, but in the universal address format that everyone is familiar with, and so it was easier to figure out the correct start and end points instinctively.
I believe we demonstrated that crowdsourcing earmarks is a viable way of collecting this data.
The earmark data was merged automatically, and includes the ambiguous responses (which I considered means having less than 3 people agreeing on an answer), but let’s say we wanted to have a trustworthy person like myself go over the ambiguously answered fields and do it myself, manually. That’s 677 total fields where the crowd didn’t agree, and I’d have to fill out the right answer, out of the 7,116 fields I would have had to fill out if I did it all myself. The crowd took care of over 90% of the work.
A huge thank you to everyone who threw in some time towards this campaign. We generated a lot of data. This was only about 1/3 of the earmark request letters that the House has released, so if there continues to be interest in this work, we’re happy to launch another earmark campaign to take more on. We welcome your feedback on how this campaign went, and anything we can do to improve upon it in the future.