Disclaimer: The opinions expressed by the guest blogger and those providing comments are theirs alone and do not reflect the opinions of the Sunlight Foundation or any employee thereof. Sunlight Foundation is not responsible for the accuracy of any of the information within the guest blog.
Luke Rosiak is a former Sunlight Foundation reporter and database analyst who now writes for the Washington Examiner. Luke is also a winner of Sunlight Foundation’s OpenGov Grants for his project, CitizenAudit. You can reach Luke on Twitter at @lukerosiak.
In return for not paying taxes, nonprofits in the U.S. file detailed financial disclosures to the IRS, listing how much of their money goes to certain categories, how much they pay their top people and what groups they give money to.
But even though large nonprofits submit structured electronic data, the IRS takes pains to convert it into paper copies and doesn’t make them available publicly at all, instead directing interested parties to request a copy from the organization itself.
Recently, tech pioneer Carl Malamud’s Public.Resource.Org began successfully filing Freedom of Information Act requests for all disclosures--990s, as they are called---and paying the IRS on a monthly basis for reams of DVDs with TIFF images. Some are scanned paper filings, for others the IRS went out of their way to turn structured data into a mere image. None has an embedded text layer.
The information is invaluable for philanthropists, journalists and competitors--and the universe of nonprofits is enormous, including the major sports leagues, political groups, hospitals and universities and quasi-public institutions.
So I began an enormous OCRing spree, using open-source tools and home-built software and put the results in elasticsearch and PostgreSQL on a free site. The effort, half the funding for which came thanks to a Sunlight Foundation OpenGov grant of $5,000, is called CitizenAudit.org.
Continue reading