A few months ago, we kicked off [a new project on GitHub](https://github.com/unitedstates/inspectors-general) called `inspectors-general`. Part of [the @unitedstates project](http://theunitedstates.io), it’s a small, public domain tool with the goal of eventually gathering reports from every inspector general (IG) in the federal government.
The `inspectors-general` project gathers reports by — you guessed it! — scraping government websites. There is no central, public website linking to all Inspector General reports. That means they need to be gathered from each and every IG website.
While not technically a hojillion, as of now we’ve downloaded over 5,000 reports using reliable scrapers for 5 agencies:
* [US Postal Service](http://www.uspsoig.gov/) * [Department of Homeland Security](http://www.oig.dhs.gov/) (this includes FEMA, TSA, ICE, and more) * [Office of Personnel Management](https://www.opm.gov/our-inspector-general/reports/) * [Environmental Protection Agency](http://www.epa.gov/oig/reports.html) * [Department of Justice](http://www.justice.gov/oig/reports/) (this includes DEA, ATF, the FBI, and more)
We’re investing in the project’s quality by making every report [searchable in our alert engine, Scout](https://scout.sunlightfoundation.com/search/documents/report?documents[document_type]=ig_report). If you’ve signed up for search alerts, you may already have gotten emails containing some.
Yet there are so many more reports to be had! The indefatigable [Matt Rumsey](http://sunlightfoundation.com/team/mrumsey/) has [compiled a spreadsheet](https://docs.google.com/spreadsheet/ccc?key=0AoQuErjcV2a0dF9jUjRSczQ5WEVqd3RoS3dtLTdGQnc&usp=sharing#gid=0) listing websites for over **70 Inspectors General**.
Read on for more about what an inspector general is, how IG offices can make these reports more accessible, and how **you** can help (if you know a tiny bit of Python).
## What’s an inspector general?
Just about every agency in the federal government has an independent unit, usually called the Office of the Inspector General, dedicated to independent oversight. This includes regular audits of the agency’s spending, monitoring of active government contractors and investigations into wasteful or corrupt agency practices. They ask tough questions, [carry guns](http://cnsnews.com/news/article/doj-inspectors-general-employ-3501-workers-who-can-carry-gunsmake-arrests) and sue people.
While the relationship between an IG and its agency need not be hostile, it certainly [can be](http://www.foxnews.com/us/2014/05/07/epa-inspector-general-office-to-say-its-investigations-impeded-by-unit-inside/). An IG is meant to be truly independent, with a separately determined budget and a free hand to do their work, without political or agency interference. If you’re interested in more detail, check out [IGnet’s FAQ](http://www.ignet.gov/igs/faq1.html), or read [this 2011 report by GAO](http://www.gao.gov/new.items/d11770.pdf) on the state of the government’s IGs.
For an outside-of-government perspective, read the Project on Government Oversight’s [excellent 2008 report](http://pogoarchives.org/m/go/ig/report-20080226.pdf) on how many IGs function in practice and their recommendations for improvement.
## And why would I read an IG report?
While some of an IG’s reports may make dry reading, the nature of an IG’s work — uncovering lies, prosecuting bad actors, and public fingerpointing — can make for some surprisingly gripping tales.
One particularly high profile example was a [2012 report](http://www.gsaig.gov/?LinkServID=908FFF8C-B323-14AD-270C38936310AEBD&showMeta=0) by the IG for the General Services Administration, Brian D. Miller. The report described, in great detail, lavish spending and disregard for taxpayer money by one of GSA’s branches at a regional conference. The scope and severity of the report created an outcry from Congress and the White House, and resulted in [embarrassment and resignations](http://www.washingtonpost.com/politics/gsa-chief-resigns-amid-reports-of-excessive-spending/2012/04/02/gIQABLNNrS_story.html) for GSA.
In another case, the Inspector General for the Department of Justice released [a tense report](http://www.justice.gov/oig/reports/2013/s1305.pdf) concluding that a US Attorney working at the Justice Department [leaked a memo in order to retaliate](http://www.msnbc.com/hardball/the-fast-and-furious-case-untangling-the-i) against the original whistleblower of the “Fast & Furious” [gunwalking scandal](https://en.wikipedia.org/wiki/ATF_gunwalking_scandal).
## Collecting them all
The IG community in the U.S. produces valuable, high quality reports, but they’re spread all over the Internet. In this state, it’s difficult to integrate them into any sort of project, public or private, that might get the right report in front of the right person. The most egregiously scandalous reports, like those described above, will get picked up by the press, but there’s a lot of great work that will fly under many people’s radar — journalists included.
As we’ve [said before](http://sunlightfoundation.com/blog/2012/03/21/government-do-you-really-need-an-api/), many uses of government data require first obtaining *all* the data. If you want to make so much as a basic search engine over IG reports, the first step is to get the reports in one place.
We’ve started by putting the ones we have so far in [our own search engine, Scout](https://scout.sunlightfoundation.com/search/documents/report?documents[document_type]=ig_report), but these reports are public property, and we hope others will find more uses for them.
This is just a snapshot, but you can download the ~5,500 reports we’ve collected so far in bulk, below. Each report has its original PDF, along with extracted text and JSON metadata. Be careful, it’s big:
[http://bulk.sunlightfoundation.com/inspectors-general/inspectors-general-2014-05-12.tgz](http://bulk.sunlightfoundation.com/inspectors-general/inspectors-general-2014-05-12.tgz) (3.8 GB)
## How you can help
If you have a bit of Python experience and spare time, you can hop over to [github.com/unitedstates/inspectors-general](https://github.com/unitedstates/inspectors-general), pick an IG from [the list](https://docs.google.com/spreadsheet/ccc?key=0AoQuErjcV2a0dF9jUjRSczQ5WEVqd3RoS3dtLTdGQnc&usp=sharing#gid=0) that no one’s done yet, write a scraper to download its reports and file a pull request when you’ve got something working. There’s no consistency between IG websites; some are relatively straightforward to scrape, and others can be quite tricky.
To help you, the project is already set up with tools that do much of the heavy lifting for you (downloading report, extracting text, handling errors). All you have to do is write the code to navigate the IG’s website, identify key data (such as the report’s title and publication date) and tell the project to save that report.
Check out the [instructions for contributing a scraper](https://github.com/unitedstates/inspectors-general#contributing-a-scraper) for the gritty details.
And a big **thanks** to [Lindsay Young](http://sunlightfoundation.com/team/lyoung/), [Travis Briggs](http://boxofmonocles.com) and [Andrew Dai](http://andrewdai.co), each of whom has graciously written a scraper and greatly contributed to our current collection.
## How the IG community can help
Clearly, having to write and maintain 70 scrapers just to collect our nation’s independent oversight reports in one place is not ideal.
As our [Open States](http://openstates.org) project taught us, reverse engineering so many websites is loads of work and entropy — the exact sort of situation that private companies that sell information routinely take advantage of to put [a price on access](http://sunlightfoundation.com/blog/2010/10/16/the-price-of-access/).
The IG community can greatly improve how they distribute their work to the public.
* The first step is to publish **machine-readable feeds of reports** to syndicate **all reports** (not just recent reports). [RSS](http://www.howto.gov/social-media/rss) is one common way of doing this. Besides allowing the public to follow these feeds in “feed readers” like [NewsBlur](http://newsblur.com/), comprehensive RSS feeds also allow programs to download every report, and stay up to date automatically and efficiently. Whatever machine-readable format is chosen, publishing recent reports only **is not enough** — the entire archive of reports should be discoverable this way. * This feed must include **reliable metadata**. At a bare minimum, this means the publication date, title, a unique report ID, and a URL to download the report. Including a summary of the report is also helpful. The RSS format [has fields for all of those](https://en.wikipedia.org/wiki/RSS#Example). If an IG oversees multiple “components” of a large agency, it should include which component the report is in reference to. * Publish reports themselves as **open, machine-friendly documents**. While fully structured report bodies are less important than structured syndication of reports, the more text-based a report is, the more value can be had from it. HTML is a great format for publishing reports — but if an IG must publish PDFs, it must ensure every word in that PDF can be easily extracted. In other words: **[no printing and scanning](http://sunlightfoundation.com/blog/2014/04/15/sunlight-and-allies-talk-fara-reform-with-the-department-of-justice/)**, even if redactions must be made.
These are each steps that individual inspectors general can take, without having to coordinate with others or to establish standard formats or URLs.
However, the biggest gains would be had by either **publishing reports to a central repository**, or by **standardizing the URLs and format** of reports across IGs. If 70 IGs publishes reports in on 70 different websites, in 70 different ways, then it will still be necessary to maintain 70 different programs for keeping up with their work.
## Too many reports; didn’t read
Wow, what a long piece about downloading inspector general reports! To sum up:
* Inspectors general do important, vigorous oversight work on behalf of taxpayers. * Their reports are spread all over the Internet, without any coordination or consistency. * Collecting them in one place opens the door for powerful new uses, and a wider audience. * You can make a difference by [contributing a Python scraper](https://github.com/unitedstates/inspectors-general) to help download them. Novices welcome! * The IGs can make a difference by modernizing and standardizing their publishing systems.
If you’d like to get involved or discuss the project, the best place to write in is [the project’s GitHub forum](https://github.com/unitedstates/inspectors-general/issues) (no tech expertise required).