The Library of Congress Really Really Does Not Want To Give You Your Data

by Eric Mill

Sep 27, 2013 12:26 pm

It’s 2013, and the Library of Congress seems to think releasing public data about Congress is a risk to the public.

The Library of Congress is in charge of [THOMAS.gov](http://thomas.loc.gov/), and its successor [Congress.gov](http://congress.gov). These sites publish some of the most fundamental information about Congress — the history and status of bills. Whether it’s immigration law or SOPA, patent reform or Obamacare, the Library of Congress will tell you: *What is Congress working on? Who’s working on it? When did that happen?*

Except they won’t let you download that information. Instead, popular websites like [GovTrack](https://www.govtrack.us), widely used services like [Sunlight’s](http://sunlightlabs.github.io/congress/), and world-class newspapers like [the New York Times](http://politics.nytimes.com/congress) are forced to design complicated, error-prone systems that extract what data they can from the pages of the website of the Library of Congress. It’s a lot of work, but it’s a necessary burden for anyone outside Congress who wishes to use that data to inform and empower the public.

For years now, the Sunlight Foundation, Josh Tauberer of GovTrack, and a host of others from the open government community have been [pushing the Library of Congress to remove this burden](http://sunlightfoundation.com/blog/2012/02/02/bulk-data-at-the-house-legislative-data-conference/) by publishing downloadable, machine-readable bulk data to the public.

This pressure is a big part of what led to the [creation of the Bulk Data Task Force](http://www.speaker.gov/press-release/house-leaders-back-bulk-access-legislative-information) by the House of Representatives in 2012. The Task Force soon asked the Library of Congress to produce a cost estimate on publishing just a tiny piece of this data — hand-written bill summaries — online in XML.

The Library’s cost estimate, tucked away in 6 pages of a 913-page committee report, was authored in [October 2012](https://gist.github.com/konklone/6690800) (publicly released only in July). In the middle is a jarring portrayal of using open government data as a risky, inaccurate disservice to the public:

> Once the information is hosted and “mashed up” by third parties, there exists no method for ensuring that the information has not been tampered with or innocently misinterpreted. Furthermore, distribution of bulk data will likely result in multiple alternative stores of legislative information that, to varying degrees are not as timely, and therefore as accurate, as Congress’ primary systems. If there is an obligation to inform the general public to the risks of non-authoritative versions of the information, it has not been included in the estimates. [Emphasis added.]

The estimate also devotes [an entire paragraph](https://gist.github.com/konklone/6690800#support) to scare language, warning the reader that the release of bill summaries in XML could lead directly to unstoppable public demand for CRS reports, forever compromising the integrity and quality of CRS’ work.

Finally, the bill summaries the Library is grudgingly, ominously releasing are [only those that apply to bills originating in the House](https://gist.github.com/konklone/6690800#house-only). Anybody wishing to use these summaries to evaluate the work of Congress will find a weakened and incomplete dataset.

The Library appears to see public reuse of government data as, well, simply too risky. Somebody might mess something up! This is a narrow attitude, and a damaging policy decision. As [Josh Tauberer](https://twitter.com/JoshData/status/383217004066242560) of GovTrack put it in [his letter to the Library and to the Speaker of the House](http://razor.occams.info/pubdocs/2013-09-26%20Letter%20to%20the%20Librarian.pdf) today:

> …it is reprehensible that the Library would discredit other sources of information as inherently risky, as if instead of a free press the public ought to rely on the Library alone as its only source of risk-free information about the legislative branch.

The Library need not feel so threatened. Congress.gov will always be the invaluable source of official record for legislative information, no matter what “alternative stores” arise. In fact, those stores arose years ago! GovTrack and Sunlight (among others) have been [giving away Congressional data](http://sunlightfoundation.com/blog/2013/08/20/a-modern-approach-to-open-data/) on the streets of the Internet for free for years, and it hasn’t removed an ounce of legitimacy from the official sources.

What **has** happened is that, in the absence of reliable downloadable data from the Library of Congress, the public has learned to get that [data](https://github.com/unitedstates/congress/wiki) [from](http://developer.nytimes.com/docs/congress_api) [other](https://www.govtrack.us/developers/api) [sources](http://sunlightlabs.github.io/congress/). This includes Congress itself — you can find GovTrack, in particular, [referenced](http://donyoung.house.gov/news/documentsingle.aspx?DocumentID=155578) and [embedded](http://kinzinger.house.gov/index.cfm?sectionid=74) all over members’ websites. And, as we learned at this year’s [Legislative Data and Transparency Conference](https://cha.house.gov/2013-legislative-data-and-transparency-conference), the House Democrats use GovTrack data to power their own internal information systems, precisely because of how difficult it is to get data from the Library:

DemCom—the intranet site for House Dems—uses @govtrack as its main source of legislative data. Better than scraping THOMAS themselves. #LDTC

— Harlan Yu (@harlanyu) May 22, 2013

In recent years, every other part of Congress has found reasons to offer tremendously valuable data to the public. [House votes](http://clerk.house.gov/evs/2013/index.asp) and [Senate votes](http://www.senate.gov/pagelayout/legislative/a_three_sections_with_teasers/votes.htm) are now published in XML in nearly real time. The House has created an ambitious repository of committee documents and upcoming floor activity at [docs.house.gov](http://docs.house.gov/). The Government Printing Office’s [FDSys](http://www.gpo.gov/fdsys/) has grown into a [vitally important](http://sunlightfoundation.com/blog/2013/02/07/keeping-gpos-data-free/) source of bulk data for the United States. And most recently, [the US Code was released as XML](http://sunlightfoundation.com/blog/2013/07/30/the-u-s-code-arrives-in-xml/).

Going forward, the Senate should step up and join the House in demonstrating leadership on opening data to the public. The Senate can start by telling the Library of Congress that it has no objection to the Library publishing its Senate data alongside that of the House.

The Library of Congress still has the opportunity to join the open government community as the data provider of official record. To do that, the Library will have to stop making excuses, and embrace its responsibility to the public.