How We Use MongoDB at Sunlight
Last week, David and I attended MongoNYC, a one-day conference focused on MongoDB. We like Mongo here at Sunlight. We like it a lot.
Working with Mongo, it’s become clear that it’s a more natural way to store data. We primarily use Python and Ruby, and because Mongo allows us to think in JSON, everything tends to just click. JSON documents are close enough to objects in Python and Ruby that mapping between application and database becomes almost effortless. Mongo has really shined in two specific use cases: as a datastore for a resource oriented web service, and as a datastore for results from scraping a web site.
Powering a Resource Oriented JSON API
We currently have two APIs powered by Mongo, and we’ve blogged about both recently: the National Data Catalog API and Drumbone.
A great thing about Mongo is that it allows you to store data as you’d naturally want work with it, particularly through the use of Embedded Documents. Looking at an example entry on the National Data Catalog, it’s easy to determine the Mongo document schema just by looking at the page design. We have a single document representing the earthquake data source, and it has fields that represent the title, homepage, and documentation URL. Embedded in the document are collections representing the downloads, ratings, and comments made for this entry. There’s no need to think about tables and joins — a catalog entry can be thought of as a JSON document.
It’s easy to create a resource oriented web service with a framework like Sinatra, but for the National Data Catalog, David wrote a framework called sinatra_resource, that provides a nice DSL for exposing Mongo documents as resources.
For the Drumbone API, Eric exposed Mongo’s ability to reach deep into embedded documents with dot notation through the query string. So, let’s say you want to grab only two of the fields from the earmarks sub-object:
// ?sections=last_name,first_name,state,earmarks
{legislator: {
last_name: "Lee",
state: "CA",
first_name: "Barbara",
earmarks: {
average_number: 20,
total_amount: 10000000,
average_amount: 22994535,
total_number: 28,
last_updated: "2010-03-18",
fiscal_year: 2010,
}
}
You would modify the parameter string to use dot notation like so:
// ?sections=last_name,first_name,state,earmarks.total_amount,earmarks.total_number
{legislator: {
last_name: "Lee",
state: "CA",
first_name: "Barbara",
earmarks: {
total_amount: 10000000,
total_number: 28
}
}
The ability to ask for partial responses from an API is a huge win when it comes to speed and efficiency, and Mongo makes supporting it dead simple.
Storing Results from Parsing and Scraping
Not coincidentally, both the National Data Catalog and Drumbone aggregate lots of data from disparate sources. For the National Data Catalog, we scrape other catalogs like data.gov and the DC Data Catalog to create our central catalog. Drumbone uses API data from GovTrack, TransparencyData, and USASpending.gov, among others.
Using a relational database to store data from different sources with divergent schemas usually means creating a lot of small, single-purpose tables, or creating one generic key/value table. With Mongo, schema isn’t enforced on the data level, making parsing and scraping much less tedious when it comes to data storage.
We’re converting the Fifty State Project to MongoDB. We’re scraping the legislative web sites for all fifty states, and while we have a baseline schema for legislators, bills, and votes, we also want to preserve all valuable but unique data that a given state may provide.
Lightning Talk
Here’s video of a lightning talk I gave at MongoNYC, with the slide deck below it. I go over what we do at Sunlight, and how we use MongoDB in our projects. Please excuse the sound quality, as I was not wearing a microphone and the walls to the room were thin, so at some points you can hear the session next door in the background.