Recently, a project started to gather the work of every inspector general (IG) in the U.S. government by using web scrapers. This effort has now hit a major milestone, gathering the reports of every U.S. federal IG that publishes them: 65 inspectors general with over 18,000 reports.
Continue readingWhy we’ve collected a hojillion inspector general reports
While not technically a hojillion, as of now we've downloaded over 5,000 reports using reliable scrapers for 5 agencies.
Continue readingOpening data: Have you checked your pipes?
Almost every technical project (and every idea for one) has an initial cost known as ETL. So why aren't we talking about it?
Continue readingSample the new, à la carte, Congressional Record parser
Introducing congressional-record! This is a project that can parse the flat text of the Congressional Record from the Government Printing Office's HTML files and produce bulk XML data for the entirety of the digital record — no database required.
Continue readingHow 60,000 bills tried to become law – in one graph
While working on OpenCongress, one of the questions we've been tackling is how close a given bill is to actually passing into law. So we made a graph to help find out.
Continue readingA Modern Approach to Open Data
Last year, a group of us who work daily with open government data -- Josh Tauberer of GovTrack.us, Derek Willis at The New York Times, and myself -- decided to stop each building the same basic tools over and over, and start building a foundation we could share. We set up a small home at github.com/unitedstates, and kicked it off with a couple of projects to gather data on the people and work of Congress. Using a mix of automation and curation, they gather basic information from all over the government -- THOMAS.gov, the House and Senate, the Congressional Bioguide, GPO's FDSys, and others -- that everyone needs to report, analyze, or build nearly anything to do with Congress. Once we centralized this work and started maintaining it publicly, we began getting contributions nearly immediately. People educated us on identifiers, fixed typos, and gathered new data. Chris Wilson built an impressive interactive visualization of the Senate's budget amendments by extending our collector to find and link the text of amendments. This is an unusual, and occasionally chaotic, model for an open data project. github.com/unitedstates is a neutral space; GitHub's permissions system allows many of us to share the keys, so no one person or institution controls it. What this means is that while we all benefit from each other's work, no one is dependent or "downstream" from anyone else. It's a shared commons in the public domain. There are a few principles that have helped make the unitedstates project something that's worth our time, which we've listed below.
Continue readingIntegrating the US’ Documents
A few weeks ago, we integrated the full text of federal bills and regulations into our alert system, [Scout](https://scout.sunlightfoundation.com). Now, if you visit [CISPA](https://scout.sunlightfoundation.com/item/bill/hr624-113) or a fascinating [cotton rule](https://scout.sunlightfoundation.com/item/regulation/2013-10114), you'll see the original document - nicely formatted, but also well-integrated into Scout's layout. There are a lot of good reasons to integrate the text this way: we want you to see why we alerted you to a document without having to jump off-site, and without clunky iframes. As importantly, we wanted to do this in a way that would be easily reusable by other projects and people. So we **built a tool called [us-documents](https://github.com/unitedstates/documents)** that makes it possible for anyone to do this with federal bills and regulations. It's [available as a Ruby gem](https://rubygems.org/gems/us-documents), and comes with a [command line tool](https://github.com/unitedstates/documents#usage) so that you can use it with Python, Node, or any other language. It lives inside the [unitedstates project](https://github.com/unitedstates) at [unitedstates/documents](https://github.com/unitedstates/documents), and is entirely public domain.
Continue reading