Sunlight Foundation

Benchmarks for Measuring Success for Legislative Data Transparency

The following are my notes for remarks I delivered at the House Legislative Data and Transparency Conference on February 2, 2012. They've been updated to include hyperlinks, but were delivered largely as written. The official page for the conference, with video, is here.

Thank you to Matt Lira and Steve Dwyer for the introduction, and to the House of Representatives for holding such an important and timely conference. This kind of event has been a long time in coming.

I must acknowledge the excellent panels that have been happening all day. And I would be remiss if I didn't commend the Committee on House Administration for adopting "standards for the electronic posting of house and committee documents and data," which are already transforming the House in a very positive way.

Because I'm limited to 10 minutes, let me briefly commend three documents to all of you which lay out a transparency vision in greater breath and detail than is possible here. They are the Open House Project Report, the Ten Principles for Opening Up Government Data, and the report from the Congressional Facebook Hackathon.

I've been asked to speak about benchmarks for measuring success in making legislative data available online. I feel like a kid in a candy store, but I will try to restrain myself.  When I speak about the House, please construe my remarks as applying to the Senate and the legislative support agencies as well.

 

What is Transparency For?

In determining benchmarks, it's incumbent on us to assess, at least briefly: what good is online transparency anyway? Here's how I see transparency adding value to our political process. It provides relevant information to decisionmakers at the time they need it. It levels the playing field between the special interests and everyone else so we all have an equal opportunity to find out what's going on. It lets the American people and their elected representatives have a solid basis for a conversation about priorities. It helps congress work more efficiently, by eliminating redundancies and identifying bottlenecks. It allows the agencies to better understand what they're supposed to do. It helps businesses make money by improving their ability to predict government actions. And most importantly, transparency is the cornerstone of a democracy.

This is all pretty ethereal, so I'll get to the point. To the maximum extent possible, legislative information must be available online, in real time, and in machine readable formats. With the exception of internal deliberations protected by the speech or debate clause, or national security and some personnel matters, the Congress's business is the people's business. So let me break down this formulation of online, in real time, and in machine readable formats into concrete benchmarks.

 

Online Publication

Publishing information online is a major hurdle in of itself. A lot of information isn't online, but instead is only available if you know the right person, or go to the right room and ask for a hardcopy, and so on. Should you have to know someone on staff to get a copy of the chairman's mark on a bill before it's voted on? Do we really want to make people trudge down to the House's legislative resource center to print out documents at 10 cents a page? It certainly cannot make any sense to have to request a CRS report through your representative or pay 20 bucks online to buy a copy.

Almost as bad as the failure to publish online is secrecy through obscurity. If information is locked inside an image file and not susceptible to a search engine, or is in an entirely random location, or is hidden on page 400 of the congressional record, it's not really helpful to anyone.

In addition, old information can be just as important as newly created information. For example, there's a huge gap in the availability of committee reports. Along the same lines, while ignorance of the law is no defense for a crime, the actual enactment of the law, known as the Statutes are Large, is not available online for a nearly 80-year period.

Let me offer some concrete benchmarks by which we can judge improvements on this.

  1. The House of Representatives should conduct an audit of all the different types of information it produces and releases, including whether it's online, and where it can be found.

  2. To the extent the House (or legislative support agencies) has information that is already in electronic format -- from the documents in the Clerk's office to CRS reports to hearing transcripts -- that information should be put online in whatever format its currently in. It's also worth considering whether legislative data should include sometimes released items like Dear Colleagues and Whip notices. We can worry later about improving how this information is made available, but just to start, put them online.

 

Real-time publication

Moving on, let's now talk about real-time publication. This is the kind of idea that makes a lot of people uncomfortable, but I'd suggest a common-sense starting point: think about the time frame and context in which a document is used. An amendment that's going to be voted on in 2 hours needs to be online just as soon as it's drafted. A bill that's going to be voted on in 2 legislative days needs to go up pretty quickly as well. You should know about a committee hearing a week in advance. Other items, like the House disbursement reports, can take a little longer.

Don't get me wrong. The goal should be real-time publication for everything. But the evaluation of what that means in the short term can be context dependent. But that context changes if the document is originally created in digital format -- in that circumstances, there shouldn't be any wait.

Here are some benchmarks:

  1. All committee reports, amendments, and bills should be available online as they are introduced. The House should monitor the lag time between introduction and when they appear on THOMAS or the committee websites. I've done this, and it can be a while before some bills show up. Evaluate the extent of the problem, and work to reduce it.

  2. All hearing notices should be available online 7 days prior to the hearing.

  3. Many committees are skirting House rules about publishing video of hearings. House appropriators are particularly guilty of this. The House should review whether meetings are being held in rooms where video capability exists natively or could be added through use of the House's video service, and pester the committees if they're opting out of recording. When only one meeting in a particular committee is going on at a time, it should be streamed online so long as it is open to the public. It's time to review behavior and start slapping some wrists. Perhaps the House should create a mechanism for the public to report on non-webcast hearings.

 

Machine Readability

So let's move on to discuss machine-readable formats. This is what really allows the idea of House of Representatives as a platform for democracy to succeed.

The biggest wish of many staffers is to be able to dynamically see how an amendment would modify a bill,  how that bill would change the law, (and eventually how an agency would promulgate a regulation, how the courts interpret that regulation, and back to congress again.) Along the same lines, people looking at a bill want to know if there are other, similar bills, in this congress or in previous ones, whether there are committee reports, CRS and GAO evaluations, and so on. If you cannot find a way to tie this information together, this dream becomes impossible.

Legislative data needs to be released as highly structured data. In other words, a machine needs to be able to look at the content and "know" what it is looking at. This would require the use of languages like XML, which allows this kind of value-added context. But to make it work, we also need a way to uniquely describe people and bills and amendments and so on -- cleverly enough embodied in commonly-accepted unique identifiers. There are already tons of these identifiers being used, but the House needs to consistently and widely employ them.

Sometimes, structured language is used when creating a document, or unique identifiers are used to describe data items in a document, but that document is stripped naked before it is released to the public. There are some circumstances where this makes sense, like hiding the different internal drafts of a bill. But most of the time, it serves no real purpose. The data that's removed could be very helpful to those on the outside. Leave it in.

Let me add that PDFs, especially PDFs that are image files, do not promote transparency. They make it difficult to impossible to extract data from documents. If you must use a PDF, make sure that the underlying data is available some other way as well.

That brings me to a point about how the data is made available. A lot of transparency advocates build scrapers to try to transform data that's published online and put it back into a useful structure. Josh Tauburer, for example, scrapes THOMAS to turn it into a database. It's like trying to unscramble an egg.

Legislative data, such as that in THOMAS, should be made available online in bulk. Give folks the database all at once or in very large chunks, and let them figure out how to use it. (See our wiki page for more resources regarding how to improve THOMAS.)

Here are my benchmarks:

  1. All bills, amendments, and votes should be published online in XML, or some other structured format. Make scrapers unnecessary.

  2. End the tyranny of only publishing in PDFs. House expenditure reports are a giant database -- publish them as a spreadsheet file, not a PDF. The Constitution Annotated is prepared in XML, don't publish it as a PDF.

  3. Encourage the use of unique identifiers, whether they come from inside the House or elsewhere. The data needs to be interoperable.

 

Concluding Remarks

My time is running short, so I will only make two more comments about process.

First, today's conference, and the standards released by the House in December, are a good thing.

As a benchmark, we need to have another conference like this one within the next year as a way of assessing how well we have done, and we should continue with these conferences on a regular basis.

Second, we need to foster collaboration between those inside and outside government. In particular, technologists who are trying to use legislative data need to be able to get technology questions answered by the responsible internal stakeholder. And policy works can help provide direction so that the new services developed by the House meet the needs of the public. I suggest:

  1. The creation of a standing committee, composed of internal and external stakeholders, that meets at least quarterly, if not monthly, to discuss these issues.

  2. A listserv where people who are not in DC can engage in this discussion with people inside and outside of government.

I appreciate your time and the opportunity to speak. Thank you very much.

A Year Later, Little Progress on Digitizing Legislative Documents

A year ago today, Congress' Joint Committee on Printing directed that three sets of vital legislative and legal documents be published online "as quickly as possible." We've reviewed how well that order was implemented, and the results are not encouraging. Of the three documents, there's only apparent progress on one.

The vital documents are the Constitution Annotated, the Congressional Record, and the Statutes at Large. The Government Printing Office is responsible for publishing them, and shares that responsibility to a certain extent with the Library of Congress and its subsidiary agencies, the Congressional Research Service and the Law Library of Congress. These agencies are custodians of America's heritage, and have an important obligation to make it available to every citizen. Here's how they've performed.

The Constitution Annotated

The Constitution Annotated (or CONAN) is a constantly-updated legal treatise that explains how the Supreme Court has interpreted the Constitution. It's available to the public online from GPO, but in a cramped, out-of-date, technologically unsophisticated format. Members of the public have been asking for access to a better version for years.

JCP's instructions to GPO are simple and straightforward:

To make the online version of CONAN as useful as possible to Congress and the public, it is time to put the updates online as soon as they are prepared, rather than waiting to coincide with the two-year print cycle. The Joint Committee on Printing is authorizing you to work with the Library of Congress to update the online edition as frequently as possible, and to create new and improved functions on the CONAN site. The Congress and the public should find this site accessible and user-friendly.

What's happened since then? As far as is visible to the public, nothing. The most recent GPO-published  publicly-available complete version of CONAN dates back to 2002, and no updates have been published online since 2010. The webpage is hard to find, and only Congress has access to the latest version on its internal network, as provided by the document's author, the Congressional Research Service. GPO should save itself the trouble and share with the public what's already available on Congress' intranet.

The Congressional Record

The Congressional Record is the official record of congressional proceedings and debates. GPO has published an online version of the Record dating back to 1994, and the document was first published in its current format in 1873. The Library of Congress has published online earlier recordings of congressional proceedings and debates dating back from the founding of the country until 1873.

The Joint Committee on Printing authorized a collaboration between the GPO and the Library of Congress to digitize volumes of the Congressional Record from 1873 to 1998, which would fill in the missing gaps and provide a complete record of Congressional activity on the internet. JCP directed the online publication of "digital files with search functions, content management capabilities, and digital authentication."

Looking at GPO's website, the collection only dates back to 1994. THOMAS, however, appears to contain records going back to 1989.

There's more than a 100 year gap in the online records of congressional proceedings and debates, a majority of which is within living memory and has repercussions to this day. There's no evidence that any substantive work has been done on this in the last year.

Statutes at Large

The Statutes at Large is the official source for the laws and resolutions passed by Congress. It was first published by a private company in 1845, but responsibility for publication was transferred to GPO in 1874, with administrative responsibility shifting in 1950 and again in 1985. Like the Congressional Record, the Library of Congress has published online historic statutes at large covering the years 1789 to 1873. THOMAS also has long made it possible to browse (but not search) copies of the Statutes at Large from 1973 to present.

The JCP instructed GPO to work with the Law Library of Congress "to create digitized volumes of the Statutes at Large and to develop robust searching and content management tools." In essence, their role is to fill in the gaps. JCP further instructed that "once the content has been prepared, the Statutes at Large will be published online by GPO, and the Library of Congress will use their GPO content in its public database of legislative information known as 'THOMAS.'"

Unlike with the other two publications, there is tangible evidence of progress. GPO has now publishing a digitized version that covers from 1951-2002, which is a significant undertaking. However, the documents have not been integrated into THOMAS, and are still somewhat difficult to use because of their large size. Moreover, GPO published another set of digitized documents, from 2003 to 2007, that are kept in a separate location on GPO's website and stored at a much greater level of granularity.

This project is only partially complete, with a sizable gap in the public record from 1874 to 1951. Moreover, the documents haven't been integrated into THOMAS.

GPO Statement

I asked GPO to comment on their ongoing efforts to comply with the Joint Committee on Printing's letter. Here is their response:

GPO and the Library of Congress have worked together to digitize the U.S. Statutes at Large (content covers volumes 65-116, 1951-2002) and make them available through GPO’s Federal Digital System (www.fdsys.gov).

GPO and the Library of Congress are collaborating on a project to digitize the print bound Congressional Record dating back to 1873. GPO first put the daily Congressional Record online in 1994, and digital versions of the bound Congressional Record from 1998-2002 are currently available on FDsys. GPO is working with CRS on the dynamic version of CONAN.

Conclusion

I would like to call this a work in progress, but there doesn't appear to have been much progress. GPO hasn't provided an explanation for the delay, a timeline for completion, or a plan to get things on track. I know that GPO and its legislative branch colleagues can act with greater speed than we've seen thus far.

I am concerned by the apparent failure to think of how the public will find and use this information. Why aren't all the existing data sets integrated into THOMAS, where people will look for them? Why isn't the data available in bulk, so that developers can build tools to share the information more widely? Why aren't members of the public involved in the design and specifications of these sites, to make sure their needs are addressed?

The JCP described these documents are "essential to understanding our laws and legislative history" and proclaimed that "they should all be readily available online in electronic format." It is long past time to make this happen. The public deserves an explanation of what's gone wrong and when to expect results.

Update: I want to add that none of this should be construed as a commentary on what GPO, LOC, or other agency funding levels should be. Generally speaking, funding cuts would make it less likely that these important initiatives will come to fruition. Instead, I would urge Congress to more closely scrutinize compliance with its directives, and encourage agencies to be more open about their progress and the challenges they face. With respect to funding, it may be that digitization and online publication will lead to significant savings -- especially in terms of the current need to print many copies of these documents as well as the cost to government of paying private vendors to access ostensibly public documents -- but my main point is that the public has a right to this information.

(One more thing -- you may find that some of the links to documents stored on GPO's website, FDsys, don't always work. I don't know why that is, but they often time out for me, too.)

O Conan! Where art thou? Legal treatise a no-show

Seven months ago, the order was given for the legal treatise, known as the Constitution Annotated (or CONAN), to be published online, but so far without result. CONAN is a government publication that explains the Constitution as interpreted by the Supreme Court. The Joint Committee on Printing directed the Congressional Research Service and the Government Printing Office provide "enhanced access" to that document, which means that CONAN should be published online as it is updated, albeit as a searchable PDF and not the structured data format that we (and many others) requested.

A frequently updated version of the Constitution Annotated is available to congressional staff on Congress' internal website -- and in the structured data format that we want. All that's available to the public, however, is a decade-old copy, and a handful of scatter-shot updates. What's strangely funny is that only a few minutes work would be required to publish the Congress-only version of CONAN online, but transforming CONAN into the much-less useful PDF version has taken seven months ... and counting. Perhaps some lessons could be learned from last week's Committee on House Administration hearing on modernizing information delivery in the House.

Tomorrow, the Joint Committee on Printing and the Joint Committee on the Library will hold a very rare public meeting. It's for organizational purposes -- 6 months after Congress convened -- so don't get too excited. Movement is measured slowly, especially since the JCP's website hasn't been updated in several years. But if you're so inclined, the hearing is set for 11:30am in SC6, which is on the Senate side of the US. Capitol. We'll see you there.

JCP directs enhanced access to 3 of "our nation's vital legislative and legal documents"

I’m rather late in sharing the news, but “enhanced access” to three of “our nation’s vital legislative and legal documents” will soon be possible thanks to a letter from the Joint Committee on Printing to the Government Printing Office and the Library of Congress. Specifically, it authorizes the two legislative agencies to work together to provide “enhanced access” to the Constitution Annotated, the Congressional Record, and the Statutes at Large.

The Constitution Annotated

We’ve been banging on the drum for improved access to the Constitution Annotated for a year-and-a-half, and I’m pleased to announce a partial victory. To recap, the Constitution Annotated is a government publication that explains the Constitution as interpreted by the Supreme Court. Although updated on a frequent basis and readily available to congressional staff, the complete Constitution Annotated is released to the public only once a decade -- scrubbed of helpful metadata. Updates reflecting recent Court decisions are released separately every two years, far short of what’s available to Congress.

The Joint Committee on Printing has directed that updates to CONAN (as it’s affectionately know) be put online as soon as they are prepared. But, instead of publishing it in XML, the structured data format in which it is prepared, CONAN will be published as a PDF. My former colleague Clay Johnson explained two years ago why publishing files only as PDFs is bad for open government. We appreciate that the document will be searachable and have a hyperlinked table of contents, but we’d like the underlying data, too. More than 20 organizations last year asked for CONAN to be made publicly available online in structured data format as it is updated in real time, as did then-Senator Feingold, and we hope that we’ll ultimately get there.

Congressional Record

It is a surprising fact that the official record of the proceedings and debate of the U.S. Congress are only available online (for free) from 1999 forward and prior to 1873. The JCP has now given GPO the go-ahead to digitize volumes of the Congressional Record during that 125 year gap. I fear that it will be made available only as a PDF, which will require a tremendous and expensive effort to transform those files into a structured data format that everyone can use. Still, making the documents available in some way is better than none. The American people have a right to see the crucial debates in Congress that continue to shape our world.

Statutes at Large

Believe it or not, it’s impossible to find all the laws enacted by congress online. Although the U.S. Code is available in its entirety, it is not always “positive law”; to find the original bills as they were enacted and are often still in effect, you have to look to the Statutes at Large. In essence, the Statutes at Large are a chronological compilation of bills enacted into law. (The process by which the bills are broken apart and transformed into the U.S. Code is discussed here.)

The JCP has now authorized GPO to work with the Law Library of Congress to digitize and publish online absent volumes of the Statutes at Large and “develop robust searching and content management tools.” Hopefully this means more than scanning them and putting them online as PDFs, but even that would be a great step forward. We’ve been interested in this for quite a while, and we’re glad to see that things are moving forward.

The Road Ahead

The JCP letter was sent nearly 3 months ago -- on November 17 -- and I am unable to find any evidence that the Constitution Annotated has been updated online or that progress has been made on the Congressional Record or the Statutes at Large. That is not to say that nothing has been done, but I was hoping to see, well, something. Although JCP has directed these agencies to complete these projects “as quickly as possible,” the absence of deadlines and historical reluctance on the part of some of the institutional players raises concerns about forward movement, particularly with respect to the Constitution Annotated.

We have other ideas about how Congress can improve public access to lawmaking information. Some of them are described in my “Read the Bill 2.0” post. The truth is that we are only beginning to scratch the surface of what should be available. I applaud the JCP’s efforts to move things forward, and I hope that the pace will only quicken.

Constitution Annotated, Congressional Record, and Statutes at Large

20+ Orgs Ask For Better Access to the “Constitution Annotated”

Photo from " By Pink Sherbet Photography" on FlickrToday, on the birthday of the Constitution, more than 20 organizations and individuals called for better public access to the legal treatise Constitution Annnotated, a government publication that explains the Constitution as interpreted by the Supreme Court. Although updated on a frequent basis and readily available to congressional staff, the complete Constitution Annotated is released to the public only once a decade -- scrubbed of helpful metadata. Updates reflecting recent Court decisions are released separately every two years, far short of what’s available to Congress.

We believe the Constitution Annotated should be published online as it is updated and with metadata intact. Because it is prepared in XML, this is relatively easy to do.

Last September, the Sunlight Foundation called for the release of the Constitution Annotated, a call that was joined by Senator Feingold in October. Although the Congressional Research Service and the Government Printing Office have held a meeting regarding its release, as the parties respectively responsible for authoring and publishing the document, they still have not acted. It is time.

The signatories urge Senators Schumer and Bennett and Representatives Brady and Lungren -- who lead the relevant House and Senate committees -- to prod CRS and GPO to make this vital resource available to the American people intact and on a timely basis.

The letter is available here, with background information on the Constitution Annotated available here.

Organizations Call for Better Access to the "Constitution Annotated"

Senator Feingold Urges Posting of Constitution Annotated Online

Last week, Senator Feingold sent a letter requesting that the Government Printing Office post the Constitution Annotated online. The Constitution Annotated is a public document, and a great resource on the Supreme Court's interpretation of the Constitution. It is nominally publicly available, but is online only in a PDF format. The Constitution Annotated contains analysis of 8,000 cases so to be truly useful, it seems obvious it must be searchable.

The GPO can take a simple step toward greater transparency by making this document available to the public in a navigable format. It could also ensure that rather than updating the Constitution Annotated every two years, as is the current practice, updates are posted in real time, as Senator Feingold also requested in his letter.

As Chairman of the Constitution Subcommittee, Senator Feingold gets it. We hope the Government Printing Office gets it too.

CRS On Making the Constitution Annotated Available in XML

Last week, the Sunlight Foundation urged the Government Printing Office to publish the legal treatise Constitution Annotated (a.k.a. CONAN) online in XMLCONAN explains the U.S. Constitution section by section, describing in its usual (and legally required) non-partisan fashion how the U.S. Supreme Court has interpreted the Constitution's provisions. CONAN contains analysis of nearly 8,000 Supreme Court cases.

We contacted the Librarian of Congress, who has statutory responsibility for preparing CONAN, for his opinion on making the treatise available online in XML. (Although it is prepared in XML, GPO publishes CONAN online in plain text and PDF format, sans meta-data. As a result, the structured data is unavailable to those who may want to republish, remix, or otherwise engage with the treatise.)

The Congressional Research Service*, which is part of the Library of Congress and whose staff actually write CONAN, made themselves available to answer our questions, summarized below:

(1) Would CRS agree to making the Constitution Annotated available online in XML every two years, when the document is printed?

(2) Would CRS agree making the Constitution Annotated available online in XML as that document is updated and released on Congress's intranet? (This would be more frequent than the every-other-year publication schedule.)

Here is CRS's response:
The Congressional Research Service and the Government Printing Office plan to discuss publication of the Constitution Annotated and possible future enhancements.
It is not entirely clear what this means. What we hope is that this statement indicates movement towards an arrangement whereby CRS frequently provides the XML file to GPO on a regular basis, and GPO makes that file -- untouched -- available for download on its website. Stay tuned.

Thanks to BoingBoing for the coverage.

  • Disclosure: I used to work for CRS.

220+ Years Later, It's Time to Publish the Constitution Annotated Online in XML

constitutionToday, the Sunlight Foundation called upon the Government Printing Office to publish the legal treatise The Constitution Annotated online in XML format as it is updated. The Constitution Annotated has been written by the Library of Congress for nearly 100 years, and contains analysis of nearly 8,000 U.S. Supreme Court cases.

Over the decades, GPO has published print versions of this extraordinary resource every two years, with limited electronic versions available from 1992 edition onward. Although the Library of Congress has drafted the Constitution Annotated in XML for a number of years, that data is no longer present when it is published online by GPO. [Update: To clarify, GPO has never published the XML data. However, CRS currently creates that document in XML format, and has done so for a number of years.] Releasing the treatise in XML would allow for the easy sharing of information between different kinds of computers, applications, and organizations, and provide a roadmap to the underlying data.

In addition to asking for The Constitution Annotated to be published online in XML, we are also asking that as the data is updated and made available to Congressional staff, it also be made available to the general public. For an example of what that could look like, see Cornell University Law School's transformation of the data.

Today is the 222th anniversary of the adoption of the Constitution. In 1787, it was made available to the American people by the most modern technology of the day. We should do no less today, and provide the Constitution (along with commentary) in XML.

Constitution Annotated Letter

The full text of the letter is after the jump.

The Honorable Robert C. Tapella Public Printer of the United States Government Printing Office 732 North Capitol Street, NW Washington, DC 20401-0001

September 17, 2009

Dear Mr. Tapella:

Today is the 222th anniversary of the adoption of the United States Constitution. It is in light of this momentous historical event that I am writing on behalf of the Sunlight Foundation to ask that the GPO begin to immediately publish the legal treatise "The Constitution of the United States, Analysis and Interpretation" (The Constitution Annotated) online in XML.

The Constitution Annotated is the oldest continuously published treatise on the Constitution, containing analysis of nearly 8,000 U.S. Supreme Court cases. Prepared by the Library of Congress for nearly 100 years, it provides a wealth of resources to scholars and laypersons alike.

The Library of Congress now transmits this document to your office in XML format for publication, so GPO needs only to electronically publish that file. Moreover, the GPO should publish the treatise as it is updated, and not every two years, as is current practice.

Publishing The Constitution Annotated online without encoding it in XML is analogous to printing it without a table of contents, index, chapter breaks, or footnotes. As you know, XML is a standard for laying out data in a format that allows other computers to easily parse that data. Releasing this document in XML would allow the easy sharing of information between different kinds of computers, applications, and organizations, and provide a roadmap to the underlying data.

GPO’s publication of The Constitution Annotated in XML will further the agency’s mandate of making available government information to the public in a timely fashion. Here, GPO can provide a substantive and timely view of the Constitution’s enduring role in our democracy, and uphold the President’s pledge to increase accessibility to government information.

If you have any questions regarding this request, please feel free to contact me.

Sincerely,

Ellen S. Miller Executive Director

Updated: to add a "plus" sign