Available as a PDF
Comments for the Record
from Daniel Schuman
Policy Counsel of the Sunlight Foundation
Director of the Advisory Committee on Transparency
for the Committee on Appropriations
Subcommittee on Legislative Branch
United States House of Representatives
on the Budget for
the Library of Congress and
the Government Printing Office,
regarding bulk access to
THOMAS legislative information
February 6, 2012
Comments of the Sunlight Foundation
February 6, 2012
Chairman Crenshaw, Ranking Member Honda, and members of the Committee, thank you for the opportunity to submit comments on the budget for the Library of Congress and the Government Printing Office.
I am the Policy Counsel for the Sunlight Foundation, a non-partisan non-profit dedicated to using the power of the Internet to increase government openness and transparency, and Director of the Advisory Committee on Transparency, a project of the Sunlight Foundation that brings together organizations from across the political spectrum in support of the Congressional Transparency Caucus’ mission of educating policymakers on transparency issues.
Today’s comments are focused on the failure of the Library of Congress to meet Congress’s charge to “report on the feasibility” of “enhancing public access to legislative documents, bill status, summary information, and other legislative data through more direct methods such as bulk data downloads and other means of no-charge digital access to legislative databases.” Four years have elapsed since the Library said it “would look into the issue” in response to congressional prompting; three years have passed since appropriators directed the Library to undertake a study; and I testified about ongoing failures to make progress on bulk access to THOMAS data before this Committee last May.
Providing bulk access to data means that users can download all the information contained in a database at once. By contrast, an Application Programming Interface, or API, allows computers to ask a database for specific information. THOMAS does not support either of these technologies. Instead, programmers must build tools called web scrapers that simulate a person going to each page of a website, copying that information into a database, and then trying to put those results into context. This is very hard to do automatically, particularly with large quantities of information, and the scrapers often break or take a lot of time to gather all the necessary information.
Recognizing this problem and the importance of public access to information, the government already provides bulk access to many datasets. The Government Printing Office, one of the entities responsible for THOMAS, has published six legislative datasets online in bulk, including the Code of Federal Regulations and the Federal Register. Data.gov, which provides the public bulk access to government data, contains 3,824 “high value” data sets as of February 3, 2012, with 1.5 million data downloads in the last year. Compared to this veritable feast of information, THOMAS provides only a small morsel at a time.
As I mentioned earlier, there are ongoing efforts to scrape THOMAS, but these methods are prone to error, onerous, slow, and fragile. Even so, the scraped data is gathered and used by websites like GovTrack.us and OpenCongress.org, which increase the audience for congressional information by providing better user interfaces and adding important context. This data is often used on mobile platforms, too. The Sunlight Foundation’s Congress app for the Android smart-phone has been downloaded over 400,000 times.
A variety of non-government developers are extending the reach and value of legislative information. Much important information is being made available at no cost to the public. Its dissemination improves everyone’s awareness of what’s going on in Congress. These private sector efforts are necessarily limited because of the difficulty of getting the data from the Library and GPO in the first place. Legislative support agencies should recognize that aiding non-governmental efforts to disseminate legislative information is a crucial component of their public service mission.
Congress has already recognized the importance of sharing legislative data broadly. In 2009, Congress adopted a forward-thinking approach that would have required an examination of granting the American people access the entirety of the legislative archives at once – via “bulk” access – in its explanatory statement accompanying the Omnibus Appropriations Act of 2009. It said:
"Public Access to Legislative Data.--There is support for enhancing public access to legislative documents, bill status, summary information, and other legislative data through more direct methods such as bulk data downloads and other means of no-charge digital access to legislative databases. The Library of Congress, Congressional Research Service, and Government Printing Office and the appropriate entities of the House of Representatives are directed to prepare a report on the feasibility of providing advanced search capabilities. This report is to be provided to the Committees on Appropriations of the House and Senate within 120 days of the release of Legislative Information System 2.0."
The House had initially wanted to go further, proposing a report from the Library of Congress within 90 days of enactment of the 2009 legislation, but the requirement was changed to no later than 120 days after the release of LIS 2.0. A report was anticipated to be released during the first part of 2009. Three years later, the Library has apparently ignored Congress’ mandate.
Movement has been so slow that the House of Representatives has been able to build and implement a system within a year that makes many primary House documents available online in bulk, with more information to go online soon. A major focus of the exemplary House Legislative Data and Transparency Conference, hosted by the Committee on House Administration on February 2, 2012, was the importance of bulk access to legislative information. The Library of Congress and GPO are being left in the dust. They must be prompted to act.
Times have changed since the Committee's original unheeded directive, and we request your renewed attention. We urge the committee to direct the Library of Congress, the Government Printing Office, and the Congressional Research Service – or the agencies that now have responsibility for THOMAS – to provide bulk access to legislative documents, bill status, summary information, and other legislative data within 120 days. In addition, we ask for the immediate creation of an advisory committee composed of members of these agencies and members of the public that regularly meets to address the public's need for public access to this information and the means by which it is provided. Only sustained attention can ensure that we finally make progress.
I appreciate the opportunity to draw your attention to this matter, and I welcome any questions that you may have. I can be contacted at email@example.com or 202-742-1520 x 273.
(please note that footnotes are omitted. See the original for the full text)