Sample the new, à la carte, Congressional Record parser

by Lindsay Young

technology

Feb 20, 2014 10:19 am

The Capitol dome displayed with a partially cloudy sky behind it. — The U.S. Capitol. Photo credit: Architect of the Capitol

Introducing congressional-record, a modular parser for the Congressional Record (CR). This is a project that allows you to parse the flat text of the CR from the Government Printing Office’s (GPO) HTML files and produce structured, bulk XML data for the entirety of the digital record, on disk — no database required.

The congressional-record parser is part of the unitedstates project, a collaboration between open government, open source technologists from across the country. Sunlight contributes to this project as part of an effort to make government accountability tools more accessible to developers.

The parser originates from Sunlight’s Capitol Words project. We think Capitol Words is a useful resource that augments the CR, so speeches and remarks can be attributed to the person speaking. Another way to access that data is by using the handy Capitol Words API.

Read more about the tech behind Capitol Words here, where Sunlight developer Dan Drinkard explains how he ingests and analyzes the output of what is now the congressional-record parser to make Capitol Words.

Running the parser

You can install the code by cloning the project from GitHub into its own environment and installing the requirements from requirements.txt. The project can be run from the command line via ./parsecr.py and giving it a date in YYYY-MM-DD format. See the documentation for additional options.

After installation, the congressional-record parser can fetch the CR for you for a date or range of dates. Below is a simple example of the command to retrieve the CR for a single date.

$ ./parsecr.py 2014-02-14
Downloading url  http://www.gpo.gov/fdsys/pkg/CREC-2014-02-14.zip
processed zip 
...

Here is an example of the default output structure. Additional files are omitted for brevity.

output
└── 2014
    └── CREC-2014-02-14
        ├── __log
        │   └── parser.log
        ├── __parsed
        │   ├── CREC-2014-02-14-pt1-PgE211-2.xml
        │   ├── CREC-2014-02-14-pt1-PgE211-3.xml
        │   ├── CREC-2014-02-14-pt1-PgE211-4.xml
        │   ├── CREC-2014-02-14-pt1-PgE211-5.xml
        │   ├── CREC-2014-02-14-pt1-PgE211-6.xml
        └── __text
            ├── CREC-2014-02-14-pt1-PgD151-2.htm
            ├── CREC-2014-02-14-pt1-PgD151-3.htm
            ├── CREC-2014-02-14-pt1-PgE215-4.txt
            ├── CREC-2014-02-14-pt1-PgE215-5.txt
            ├── CREC-2014-02-14-pt1-PgE215.txt
            └── mods.xml

The __log folder contains a file of the log for that day’s records. The __parsed folder contains the XML parsing results. The __text folder contains the original .htm or the .txt document.

To use date range, give the starting and ending date separated by a colon, YYYY-MM-DD:YYYY-MM-DD. If using a range of dates, keep in mind there is a limit to the number of downloads per day. Days with no records will be listed at the end of command line output.

Additional options can determine the destination of these files:

-id, --indir: Input directory to parse. Front matter and other procedural text will not be processed.
-od, --outdir: Output directory for parsed files.
-l, --logdir: Directory for logs to be written to. Defaults to __log in the input directory.

How it works

After passing a date argument, the parser looks for a zip file from FDsys, the GPO’s information portal. After the file is downloaded, the parser will look for substantive files and create a plain text version from the HTML files. Records that serve as legislative boilerplate — like the Pledge of Allegiance and front matter — are not parsed.

From the plain text version, the parser uses regular expressions and some official XML metadata to identify things, such as who is speaking and whether the speaker is quoting another person. This can be tricky because the clues for these transitions involve carefully monitoring whitespace, the contents of the previous line, many different titles and punctuation and capitalization.

The parser also includes a small test suite. On GitHub, we have incorporated it with Travis, to email contributors if a new commit breaks our tests. With tests that look for major errors, improvements can be added with less fear of breaking the system.

Tests are easy to run:

$ ./test/run
.....................
----------------------------------------------------------------------
Ran 21 tests in 1.726s

OK

Contributing

If you are interested in contributing, the project has a BSD3 license, and additions to the parser should contain a test to verify that the addition works.

Don’t be shy, take a look at the congressional-record project, we would love to hear your feedback on how to make this project more useful to you.

Sunlight Foundation

Follow Us

Sample the new, à la carte, Congressional Record parser

Running the parser

How it works

Contributing