Disclaimer: The opinions expressed by the guest blogger and those providing comments are theirs alone and do not reflect the opinions of the Sunlight Foundation or any employee thereof. Sunlight Foundation is not responsible for the accuracy of any of the information within the guest blog.
Luke Rosiak is a former Sunlight Foundation reporter and database analyst who now writes for the Washington Examiner. Luke is also a winner of Sunlight Foundation’s OpenGov Grants for his project, CitizenAudit. You can reach Luke on Twitter at @lukerosiak.
In return for not paying taxes, nonprofits in the U.S. file detailed financial disclosures to the IRS, listing how much of their money goes to certain categories, how much they pay their top people and what groups they give money to. But even though large nonprofits submit structured electronic data, the IRS takes pains to convert it into paper copies and doesn’t make them available publicly at all, instead directing interested parties to request a copy from the organization itself. Recently, tech pioneer Carl Malamud’s Public.Resource.Org began successfully filing Freedom of Information Act requests for all disclosures--990s, as they are called---and paying the IRS on a monthly basis for reams of DVDs with TIFF images. Some are scanned paper filings, for others the IRS went out of their way to turn structured data into a mere image. None has an embedded text layer.
The information is invaluable for philanthropists, journalists and competitors--and the universe of nonprofits is enormous, including the major sports leagues, political groups, hospitals and universities and quasi-public institutions. So I began an enormous OCRing spree, using open-source tools and home-built software and put the results in elasticsearch and PostgreSQL on a free site. The effort, half the funding for which came thanks to a Sunlight Foundation OpenGov grant of $5,000, is called CitizenAudit.org.
Three or four years of nonprofit disclosures are now fully and instantly text-searchable, out of 7 million PDFs and that number will continue to grow. You can not only pull up 990s by typing in an organization’s name a lot faster (and freer) than Guidestar, but you can search across the inside text of all nonprofits.
Because nonprofits often disclose who they give to, but not who they get from, searching an organization’s name turns up the filings of other, seemingly unaffiliated groups--essentially uncovering the previously secret donors to the first organization. You can also type a person’s name to see what boards they’re on and what groups they’re drawing a salary from. (I’ve found some hucksters who spin complicated webs of nonprofits drawing government grants and paying themselves full-time salaries at each.) And a simple CTRL-F can navigate you to the part of the document you’re interested in, as opposed to reading through dozens or even hundreds of pages.
I’ve also pieced together more usable databases from poorly-documented and obscure IRS files, which are downloadable in SQL dumps, as is an index of the 7 million disclosures.
And you can pass an organization’s IRS-assigned ID to an API for some structured data and extracted text -- it’s as simple as citizenaudit.org/api/[EIN]/.
The next stage of this project is to use regular expressions to extract structured data, where possible, from the text. (A more ambitious goal is to use the hOCR files, which give the bounding boxes of words, to deal with cases where we need to know exactly where the text was in a complicated page layout. If hOCR parsers come out of this project, that could be a lasting and generalized contribution to this sphere.) If you’re interested in either and have some familiarity with programming/regular expressions, please contact me.
It began at a civic hackathon in Washington, D.C. where Amazon had offered each participant $100 in AWS credits; I built a system where people could run a command against their credits that would spin up an EC2 instance and set up tesseract, an open-source OCR library, connect to an SQS queue and upload the results to S3. We managed to piece together several thousand dollars of free computing time, with up to 1,000 EC2 instances running. (A python script to manage OCRing across many EC2 instances is here.)
But there are 7 million disclosures, sometimes in the hundreds of pages long each and OCRing is extremely computationally expensive and EC2 instances are quite weak. Worse, it turned out that the S3 costs were atrocious. CitizenAudit.org has over three terabytes of data. So I built a top-of-the-line computer that will churn through documents for years, with power the main operating cost. I pieced together a 6-core, 14-terabyte machine with an overclocked, water-cooled Intel 3930K. It uses tesseract, PostgreSQL and a little Redis to manage its workload.
The text output is pushed out to an ElasticSearch server and accessed through a Django site. One challenge is whether the PDFs that CitizenAudit relies on will continue to be reliable. Public.Resource.Org’s ability to continue to obtain them is unclear. Having a free and open repository of all nonprofits’ 990s in PDF form is more important than and a precondition to, CitizenAudit.org and funding for continuing to FOIA the forms from the IRS must be secured.
In all, this was a fairly simple process executed at a large scale--at least for a noncommercial side project in the public interest--but also one that endeavored mightily merely to reverse-engineer the devastation the IRS wreaked on the valuable structured data that filers submitted.
If the IRS is going to exempt certain--often enormous--groups from paying taxes, we need to know why. And the IRS needs to stop actively taking steps to make data less useful. This isn’t even a case of a legacy system--they have electronic data and won’t release it.
The IRS needs to put my site out of business and start providing bulk downloads of the structured-data form 990s that groups give it.
Interested in writing a guest blog for Sunlight? Email us at firstname.lastname@example.org.