Catalog of Government Publications Bulk Data

by

Today we are releasing regularly-updated bulk data for the Catalog of Government Publications (CGP). The CGP contains records describing electronic and print publications from the legislative, executive, and judicial branches of the U.S. government. The CGP contains 700,000+ records issued from 1976 onward and is administered by the Government Printing Office (GPO).

To be clear, the CGP is a catalog, not a library. The CGP helps you find information about documents, but does not contain the full contents. Use it to search titles, keywords, dates, or any other metadata. Once you have the record you want, the CGP will also help you locate the original document in a Federal depository library. In some cases, the CGP metadata includes a hyperlink to an online version of the document.

The GPO offers a public CGP search interface but not access to the raw data. Since the the search interface has limited querying abilities, the public cannot fully dig into the data and explore it.

At Sunlight, we advocate for free, open access to bulk government data. That’s why we built a CGP crawler to extract the CGP records and share the resulting CGP bulk data publicly.

There is a lot of metadata to explore here! Making sense of the documents and putting them into context will require exploration. We invite the transparency community to get involved. At the bottom of this post, I list some specific next steps of how you can help.

Directory Structure

These files are stored in a nested directory structure that looks like this:

 system number      revision
             |      |
             v      v
/000/111/000111222-000.xml
/000/111/000111222-001.xml
/000/111/000111222-002.xml

These files are grouped into folders in order to reduce the number of files per directory. Note that we keep old revisions of the documents, making it possible to see how the documents change over time.

File Format

The CGP bulk data are XML files, formatted as MARCXML and encoded as UTF-8. To give you an idea of the file format, here is an excerpt from an example record:

<record xmlns="http://www.loc.gov/MARC21/slim">
  ...
  <controlfield tag="001">000122458</controlfield>
  ...
  <controlfield tag="005">20041122002551.0</controlfield>
  ...
  <datafield tag="035" ind1=" " ind2=" ">
    <subfield code="a">(OCoLC)03879096</subfield>
  </datafield>
  <datafield tag="245" ind1="1" ind2="0">
    <subfield code="a">Shoreline plant establishment and use of a wave-stilling device /</subfield>
    <subfield code="c">
      by J.W. Webb and J.D. Dodd ;
      prepared for U.S. Army, Corps of Engineers, Coastal Engineering Research Center.
    </subfield>
  </datafield>
  <datafield tag="260" ind1=" " ind2=" ">
    <subfield code="a">Fort Belvoir, Va. :</subfield>
    <subfield code="b">The Center ;</subfield>
    <subfield code="a">Springfield, Va. :</subfield>
    <subfield code="b">Available from National Technical Information Service,</subfield>
    <subfield code="c">[1978]</subfield>
  </datafield>
  ...
  <datafield tag="650" ind1=" " ind2="0">
    <subfield code="a">Revegetation</subfield>
    <subfield code="z">Texas</subfield>
    <subfield code="z">Galveston Bay.</subfield>
  </datafield>
  <datafield tag="650" ind1=" " ind2="0">
    <subfield code="a">Shorelines</subfield>
    <subfield code="z">Texas</subfield>
    <subfield code="z">Galveston Bay.</subfield>
  </datafield>
  <datafield tag="650" ind1=" " ind2="0">
    <subfield code="a">Soil conservation</subfield>
    <subfield code="z">Texas</subfield>
    <subfield code="z">Galveston Bay.</subfield>
  </datafield>
  ...
</record>

This format is not particularly welcoming to the uninitiated. It is certainly not a great example of XML. One approach to making sense of this metadata is to read the MARC standards first and then (hopefully) return enlightened, invigorated, and ready to make sense of the CGP. If this approach works for you, go for it — but in my experience, the MARC documentation is an overwhelming jungle of strange, overgrown, interweaving concepts. If you go in, take a machete and don’t go alone. If you make it out alive, share your observations and lessons with the Sunlight Labs community.

If you are looking to make faster progress, I would suggest venturing into MARC-land only with a clear goal in mind (for example, to figure out what tag=”650″ means). So, for example, just by scanning the record above, you will probably notice that tag “650” has something to do with categories or classification. Searching the MARC documentation confirms this hunch — 650 code is used for a Subject Added Entry-Topical Term.

Next Steps

By exposing the raw data, we’ve only taken the first step. We invite the open data and open government communities to dig into the CGP data. Here are some example future directions and projects that we would like to see:

  • Improved file formats. The existing MARCXML format is not intuitive. For example, <datafield tag="650"> is difficult to understand. It would look better as <topical_term>.

  • Scripts or tools to import the bulk data into the database of your choice.

  • An API for accessing the CGP bulk data. The API could offer a cleaner file format, as mentioned above.

  • A National Data Catalog importer. We expect that many CGP publications will correspond to data sets — so we want to ingest these into NatDatCat.

  • Visualizations. There are 700,000+ records — figuring out how the records are clumped (by topic, keywords, dates, and so on) would be very interesting.

  • Web and mobile applications. How can we make the CGP relevant and interesting?

If you want to take on any of these mini-projects — or have ideas to go in a different direction — please let us know on our mailing list!