Integrating the US’ Documents

by Eric Mill

technology

May 21, 2013 9:43 am

A few weeks ago, we integrated the full text of federal bills and regulations into our alert system, [Scout](https://scout.sunlightfoundation.com). Now, if you visit [CISPA](https://scout.sunlightfoundation.com/item/bill/hr624-113) or a fascinating [cotton rule](https://scout.sunlightfoundation.com/item/regulation/2013-10114), you’ll see the original document – nicely formatted, but also well-integrated into Scout’s layout. There are a lot of good reasons to integrate the text this way: we want you to see why we alerted you to a document without having to jump off-site, and without clunky iframes.

As importantly, we wanted to do this in a way that would be easily reusable by other projects and people. So we **built a tool called [us-documents](https://github.com/unitedstates/documents)** that makes it possible for anyone to do this with federal bills and regulations. It’s [available as a Ruby gem](https://rubygems.org/gems/us-documents), and comes with a [command line tool](https://github.com/unitedstates/documents#usage) so that you can use it with Python, Node, or any other language. It lives inside the [unitedstates project](https://github.com/unitedstates) at [unitedstates/documents](https://github.com/unitedstates/documents), and is entirely public domain.

What it Does

The chief problem that [us-documents](https://github.com/unitedstates/documents) solves is taking the original documents and converting them into context-less HTML that can be dropped directly into any website. Once dropped in, all of the styling can be done with CSS.

For example: Congress publishes XML for every bill in the House and Senate, and it’s rich XML (just [look at that DTD](view-source:http://www.gpo.gov/fdsys/pkg/BILLS-113hr624rfs/xml/bill.dtd)). It’s built to handle a lot of different use cases, including compatibility with sophisticated drafting tools. Most of that just gets in the way of displaying the bill to users, so we can rip most data out and turn it into div’s and span’s.

In this before-and-after example, we’re turning a piece of [CISPA’s official XML](http://www.gpo.gov/fdsys/pkg/BILLS-113hr624rfs/xml/BILLS-113hr624rfs.xml) into HTML we can [drop into place](https://scout.sunlightfoundation.com/item/bill/hr624-113):

Doing it this way also lets us make our own decisions on what to display – Congress may feel the need to display the table of contents of the bill, but we don’t, so it can be hidden with CSS.

We do something similar with rules and notices from [FederalRegister.gov](https://www.federalregister.gov). Even though FR.gov already provides HTML snippets ripe for integration, there are simple things we can do to make them even more universally usable, like ditching “id” attributes.

This lets us take the [HTML used here](https://www.federalregister.gov/articles/2013/04/30/2013-10114/revision-of-regulations-defining-bona-fide-cotton-spot-markets) on FederalRegister.gov and (https://scout.sunlightfoundation.com/item/regulation/2013-10114), without any conflict with our own HTML.

Doing Right By Citations

One of the other reasons to integrate these documents into Scout was to link legal citations to searches, to take advantage of Scout’s [special citation searching](https://scout.sunlightfoundation.com/search/all/5%20usc%20552).

Both Congress and the Federal Register already attempt to detect and link the legal citations in their documents. To make these links easily overrideable, us-documents extracts the basic pieces from each caught citation and puts them into data attributes on the link, e.g. `data-title=”5″ data-section=”552″`.

This way, we can easily process that HTML client-side in JavaScript, and replace the original links with new ones built from those data attributes.

The Federal Register does a [terrific job](https://www.federalregister.gov/articles/2013/04/30/2013-10114/revision-of-regulations-defining-bona-fide-cotton-spot-markets#p-15) of detecting citations, but strangely, Congress’s detection seems a lot spottier. If you take a look at last year’s [DISCLOSE Act](https://scout.sunlightfoundation.com/item/bill/s3369-112), it’s difficult to tell why only one citation gets linked, and so many others are not. It’s possible we may end up using [our own](https://github.com/unitedstates/citation) citation detector in the end.

Does This Need a Standard?

us-documents is doing a pretty naive transform on these documents, and the resulting HTML needs entirely different CSS for bills and for regulations. It works, but there are few questions that come to mind:

* Is it worth identifying the common denominator of tags and features necessary for both kinds of documents, and transforming them both into a standard end product? * If so, should that standard be the resulting HTML, or is it worth creating an intermediate format of some kind? * And should this standard have some sort of compatibility or at least good vibes with existing standards for legal documents, such as [Akoma Ntoso](http://www.akomantoso.org/) or [legal-markdown](https://github.com/compleatang/legal-markdown)?

The answers may become more obvious over time if others begin using the tool, and if it gets expanded to process more kinds of documents than just federal bills and regulations. Grant Vergottini also just wrote [a post asking smart questions](http://legixinfo.wordpress.com/2013/05/20/xml-html-json-choosing-the-right-format-for-legislative-text/) about the use of HTML, XML, and JSON with documents like these.

In the mean time, [us-documents](https://github.com/unitedstates/documents) has already been useful to us, and we hope it will be to others.

Sunlight Foundation

Follow Us

Integrating the US’ Documents

What it Does

Doing Right By Citations

Does This Need a Standard?