The National Data Catalog Is Hungry
So you’ve found some government data on the web. Naturally, you are eager to share your findings with the world. Perfect! Sunlight Labs can help. Our National Data Catalog (NatDatCat) is hungry for government data, and we have to feed it regularly. Otherwise, it gets grumpy.
The first step is to assess what you’ve found. If it is just a few bits of scattered files, just fill out a quick form and tell us about it. On the other hand, if it is a collection of data sets, you might consider writing an importer…
Writing an Importer for NatDatCat
Have Ruby and Git skills and a hankering for some Web spelunking? Then writing an importer for NatDatCat might be a perfect civic hacking project project for you!
Since the NatDatCat system is centered around a RESTful API, it is easy to write small standalone programs to work with the data. (Even the Web app is, more-or-less, a presentation layer that communicates through the API.) So, to write your importer, you could integrate with the NDC API directly. We have API documentation at your service to get you started.
But not so fast. There is a better way. We recommend using the NDC importer framework. The framework serves two major purposes:
-
It simplifies the task of writing an importer. In particular, the importer framework handles the API communication, so all an importer has to do is handle the external translation step (such as scraping of a Web site or integration with an API). It also provides utility functions that come in handy.
-
It standardizes importers. This encourages the sharing of best practices and it also makes coordination easier. The various importers are automated through the use of the National Data Catalog Importer System.
The importer framework is good at doing a few things and delegating the rest. This document will help you get started. Before long, you’ll have an importer ready to liberate government data.
As a prerequisite, you’ll need to install the NatDatCat API on your system. Doing so lets you test your importer locally in a controlled environment. (Once you get your importer working, let us know and we’ll add it to our collection of importers that run against our production API.)
Importer Walkthrough
Let’s take a look at some example code in the example folder.
1. Setup the Rakefile
Begin by looking at example/rakefile.rb. In this file, you set some configuration information and call out to the importer framework. It will define some rake tasks for you.
The importer framework handles quite a few things for you provided that you follow its design correctly. Your importer is responsible for providing a Puller class (as defined with :puller => Puller
in rakefile.rb).
2. Make Some Keys and Hide Them
Use the API to generate a key for your importer. Remember, API keys are private, so please don’t store them in code. Actually, don’t even store them in source control at all. Separate them out and store them in config.yml
. Make sure that your .gitignore
file is setup to ignore config.yml
. It is a good idea to include config.example.yml
that demonstrates the format of the file.
3. Make the Puller
Next, let’s look at the Puller class. It is responsible for defining two methods: initialize
and run
. (The rake tasks constructed above rely on these methods.)
Please note that the example provided here is oversimplified. It is intended to demonstrate how to use the importer framework, not as a practical example to copy verbatim. If you want to steal some importer code, please visit the Sunlight Labs projects page and filter the projects by ‘datacatalog-imp-‘.
As you would probably expect, initialize
is called once. Its main purpose is to setup the callback handler (@handler
) to refer back to the importer framework.
Put the main logic / algorithm / secret recipe / voodoo of your importer in the run
method. The key responsibility of your importer is to call @handler.source
or @handler.organization
each time your importer finds a data source or organization, respectively. (Historical note: the 0.1.x version of importer framework worked a little bit differently. This is a more flexible style.)
source parameter
@handler.source()
expects a hash parameter of this shape:
{
:title => "Budget for...",
:description => "Congressional budget for...",
:source_type => "dataset",
:url => "http://...",
:documentation_url => "http://...",
:license => "...",
:license_url => "http://...",
:released => Kronos.parse("...").to_hash,
:frequency => "daily",
:period_start => Kronos.parse("...").to_hash,
:period_end => Kronos.parse("...").to_hash,
:organization => {
:name => "", # organization that provides data
}
:downloads => [{
:url => "http://..."
:format => "xml",
}] # include as many download formats as appropiate
:custom => {},
:raw => {},
:catalog_name => "...",
:catalog_url => "http://...",
}
Note that most of these parameters match up with the properties defined for a Source in the National Data Catalog API. These parameters are just passed along to the API, which will validate the values.
The remaining parameters (organization
and downloads
) are handled by the importer framework:
-
The organization sub-hash is used to lookup or create the associated organization for the source. Then a
organization_id
key/value pair is sent to the API. -
The downloads array is used to lookup or create the associate download formats for a data source.
You may have noticed the use of Kronos.parse
above. We highly recommend the use of the kronos library for the parsing of dates.
organization parameter
@handler.organization()
expects a hash parameter of this shape:
{
:name => "",
:acronym => "",
:url => "http://...",
:description => "",
:org_type => "governmental",
:organization => {
:name => "", # parent organization, if any
:url => "",
}
:catalog_name => "...",
:catalog_url => "http://...",
}
Note that most of these parameters match up with the properties defined for an Organization in the National Data Catalog API. These parameters are just passed along to the API, which will validate the values.
The remaining parameter, organization
, is handled by the importer framework. The framework just looks up the parent organization using the name or url. It then sends parent_id
with the associated parent organization id to the API.
You’re Done / Best Practices
That’s it. But before you go hacking away, let me say a few words about best practices:
-
If you are scraping a web site, we highly recommend caching the raw HTML files in your importer. Our production importers are queued up using the NDC Importer System, which integrates nicely with git. It keeps a record of the raw HTML files that correspond to each run. This makes it easier to debug if and when things go wrong.
-
Take advantage of the utility functions in /lib/utility.rb. If you have suggestions about useful utility functions, please let us know.
-
It goes without saying, but please follow best Ruby practices and make a good faith effort at writing clean code. Follow the conventions of the community and strive to make your code readable by other people.
And thanks for helping us feed the National Data Catalog!
Talk to Us / Stay Up To Date
Please reach out to us on our National Data Catalog Google Group. We can help you with your importer. Once it works reliably, we will want to add it to our importer system. The more up-to-date, relevant government data we bring in, the more useful our data catalog becomes.
This document is adapted from the README in the datacatalog-importer source code repository. You can find the latest version there.