Sunlight Foundation

Bulk Data at the House Legislative Data Conference

Many of us from Sunlight have been at the House's legislative data conference today, as Daniel has noted on the blog. The conference organizers have done a fantastic job -- the day has been like an all day committee hearing, where the House's tech officials are the witnesses, and the public gets to ask the questions. This is exactly the sort of good faith attempt to take responsibility for data policy that we wrote about in 2007 with the Open House Project report. It's extraordinary for the leading providers of third party legislative information systems to sit as peers among the administrators, staff, and politicians responsible for how the House shares it work with the public. If that praise seems effusive, it should be; the House is setting an example for how to work with NGOs on data availability.

That's not to say everything we're hearing is good news.

The morning's last panel featured the leaders of the offices responsible for most legislative data processes -- like the Office of Law Revision Counsel, the Law Library of Congress, and the Government Printing Office.  We saw valuable new projects -- mobile sites, web redesigns, and incremental improvements in data publication. All worthy efforts showing the legislative support bureaucracy adapting to new expectations for online information.

In cultivating these projects, though, these offices are also choosing to ignore another responsibility: their role in providing the data about Congress that enables third party web publishers (like Sunlight) to do their jobs. The officials were asked (by a number of us from Sunlight) why they still haven't begun publishing bulk legislative data, and their answers were telling: it's not a priority, they're more concerned about accuracy.

These answers were a bit of a surprise for me, since Sunlight has been asking for bulk legislative data since 2007, persistently. These agencies have seen letters from Members and leadership, appropriations language requiring a report on feasibility, a bill proposing to force the issue, public criticism, and steadfast activism from our colleagues like Josh Tauberer (of Popvox and GovTrack) and David Moore (of OpenCongress). Even with all that attention, we've been met with a shrug.

The people responsible for publishing this information should get a little more familiar with the third party publishers who are reusing and re-presenting congressional information. Right now, people are researching legislation and the records of their representatives using both official sites (like THOMAS), and also third party sites like OpenCongress.org, GovTrack.us, Popvox.com, WashingtonWatch, or Congress. Third party sites aren't going away -- they're essential to activists and analysts who rely on access to information that official congressional sites will never provide. Official and third party sites should be capable of coexisting amicably, reinforcing each other's role and mission.

By declining to provide bulk access to legislative data, support agencies are actually ensuring that third party sites will continue to rely on a brittle, complex system of scraping and parsing, where legislative data lags behind the official version, and errors from official sources spread even after they're corrected. Whatever concerns the LOC has about reliable data, the publishing system they're relying on now is probably worse. By withholding bulk data, they're creating the liabilities they warn against: the public relying on slightly less reliable data.

Part of Congress's job should be to empower third party developers who are are permanent part of the infrastructure that brings legislative data to a huge slice of the public. By ignoring the public's and Congress's calls for bulk legislative data, administrators are ignoring part of what it means to be a responsible steward of public data. That definition has changed, and this morning demonstrated that we've got a lot of work left to do to demonstrate that bulk data does in fact fall squarely within those responsibilities.

Update: See our wiki page for more resources regarding how to improve THOMAS.

State Level Data Opening Up

This is from US Deputy CIO for Open Government Beth Noveck:

Inspired by the President’s call for more open government, the Commonwealth of Massachusetts launched its data catalogue, following in the footsteps of Washington, DC, San Francisco, New York, and elsewhere around the country (as well as cities in Canada and the UK), to provide public access to information by and about government. What makes this exciting is not merely having transportation information available in machine-readable formats, but that professional and amateur enthusiasts can then get together, as they did last weekend, to create new software applications and data visualizations to better enable public transit riders to track arrival times for the next subway, bus, or ferry. Publishing government information online facilitates this kind of useful collaboration between government and the public that transforms dry data into the tools that improve people’s lives. (For another great example, check out what happened when we published the Federal Register for people to use.)

The National Association of State CIOs is helping to spur this movement toward greater data transparency at the state level by publishing “Guidance for Opening the Doors to State Data.”

As my colleague John Wonderlich wrote earlier this year in a post about the importance of Recovery.gov, "The Internet has been recognized as having a central — even fundamental — role in enabling oversight and public access." In the case of the state and city level bulk data you can make the case that the Internet is also being recognized as making people's lives easier. Just ask anyone in here in Washington, D.C. about checking Metro and bus arrivals online or on your iPhone or Android. There is a large amount of information that our city releases that can be turned into useful applications that make the lives of Washingtonians easier. If you don't live in New York or San Francisco or Toronto or Vancouver or any other city that has bulk data access, you should really be advocating for it.