Last Monday we launched an update to our Capitol Words project, which indexes and tokenizes the Congressional Record daily. With the launch behind us and the dust starting to settle, I'd like to walk through how we get from raw text to attributed, searchable quotations, and provide some examples of how you can interact with the data directly.
Before delving into how it works, though, it's important to acknowledge the myriad developers whose work on this project has made it possible. I'm only the most recent steward of the site; the bulk of the data legwork for this iteration was handled by Aaron Bycoffe and Jessy Kate Schingler, and the web interface owes its beauty to Caitlin Weber and Ali Felski. Timball provided the hardware, and the list continues from contributions to the scrapers all the way back to the original conception and implementation of the idea by Josh Ruihley and Garrett Schure. It's the combined efforts of everyone involved that brought us the site that's available today.
How we index
There are 3 steps involved in the transition from raw GPO text to index: scrape, parse and ingest. Each morning, we download yesterday's congressional record in plain text from FDSys and store it on disk. We log each morning's activity so that missing days can be retrieved in bulk later in the case of some unexpected condition, either on our end or theirs. Quite frequently, GPO is several hours to a day or so late in posting, so logging days missing data is easier than recursing through the file system to find them. Each day's record comes in the form of dozens of files (today's batch is 95), depending on what business was conducted. Files are split by section and page, among separate proceedings from the House and Senate, and extensions of remarks.
Once the day's files are downloaded, we parse each of them against a set of regular expressions, yielding XML with relevant semantics. You can view the parser in its entirety here, but the gist is that it captures entities such as titles, speakers and their associated blocks of text, dates, interjections from the recorder, and document delimiters. The resulting xml is a technology-agnostic representation of each CREC, and it, too is stored on disk to save time should we need to re-process a chunk of data or, as occurs in my nightmares from time to time, move to a different backend. This gist shows an example of a marked-up CR document.
Finally, each xml file is converted to our Solr schema and posted to the index. This stage is where our n-grams are created, in lengths of one to five words, as well as stemmed unigrams, which are stored without inflection or derivation. In addition to n-grams, documents in the Solr index include fields such as identifiers and dates, speaker name / bioguide id / chamber of congress, document title, spoken quote, and full text. These fields can be aggregated as facets, creating the basis for the chart visualizations on capitolwords.org.
Now the fun part: once the day's CR is in Solr, we analyze the index (on demand, for a fair chunk of it!) in several ways: tf*idf weighting, raw counts, percentages and vector space (cosine distance) between comparable entities.
Tf*idf stands for 'term frequency–inverse document frequency.' It's useful for finding the significance of words versus the overall corpus in which they appear. There's more on our methodology at the Capitol Words about page, but in general we use tf*idf to determine top words for each day, month, legislator and state.
While tf*idf is great at pinpointing significance or interestingness among a sea of n-grams, in practice it's less useful going the other direction--say, the top state or legislator for a particular word. For the top states and legislators on a given term detail page, we rank by overall count of attributable utterances per entity. To some degree this favors states with larger delegations, as well as states with more prolix legislators. But in general the approach causes the notable voices on a given subject to appear prominently. Party breakdowns are also derived from ratios of raw counts, as using tf*idf there would be more or less useless.
The sparkline graphs on the site are drawn by percentage, meaning each point is plotted as time versus (term count / total terms). So eight mentions of 'global warming' versus 2644331 words spoken in the month of March '96 yields a value of roughly 0.000303%.
In addition to surfacing the top words for an entity and the top entities for a word, we have also begun calculating similarity between states and legislators. Similarity is derived as the cosine of the angle between two vectors of n-gram tf*idfs. To illustrate this, let's draw a basic comparison (numbers here are fabricated):
| Legislator | 'American' | 'person' | 'grapefruit' | | Max Baucus | 0.0140 | 0.0045 | 0.0000 | | Jon Kyl | 0.0346 | 0.0135 | 0.0018 |
Here we have two vectors, each three calculations in length. We can express them as three-dimensional points:
V1: [0.0140, 0.0045, 0.0000] V2: [0.0346, 0.0135, 0.0018]
In this case, we could even plot these points on a triaxial graph, where each word is an axis. In reality, we generate our comparison vectors from the tf*idf values of the full superset of all unigrams attributed to either entity--yielding thousands of coordinates per point--but it's simpler for the sake of example to model space in terms of a perceivable number of dimensions. Using these two vectors, we can effectively determine how similar Max Baucus and Jon Kyl are by the cosine of the angle between the lines drawn from the origin of our graph to each point. Because we are working in one 'quadrant' where all values are positive (tf*idfs are never negative), the cosine of the angle will always be a number between 1 (0°, exactly similar) and 0 (90°, exactly opposite).
SciPy's spatial.distance module provides functions for computing the distance between 2 vectors--including cosine distance, among many others--making this comparison trivial to generate. Note that with SciPy, a distance of 0 indicates congruence. Given our sample, we could run the following:
>>> from scipy.distance.spatial import cosine >>> cosine(V1, V2) 0.0030305984940074415
Do that just over a million times, and we have compared every legislator to every other legislator since 1996 by every word they said. Pretty cool!
Unfortunately, the dataset is large enough that Solr isn't performant on some facets over large ranges. So, we fall back to precalculating some of them on a weekly or monthly basis. These calculations are stored in a relational database and modeled in the django application.
Specifically, each day's n-grams are stored at ingest time, with associated counts and tf*idf. We also store the total numbers of 1..5-grams to get a corpus size for each day. The month's tf*idfs are recomputed daily at the beginning of the month, and then weekly as that month's corpus grows. Legislator and state tf*idfs are calculated monthly due to the sheer number of records, and distances for those entities are processed immediately after. State distances take a couple of hours to calculate, and legislators on the order of a couple days.
An introduction to the API
All of this number crunching makes for some interesting views of the words spoken by our legislators, but the potential for this data goes well beyond what we've done. So, I'd like to spend a little time walking through the API we're using to drive the site, which we've also opened to the public. I should note that the documentation needs a little updating, as the final methods accept slightly different sets of parameters than what was originally planned, and that update is forthcoming.
Currently there are 3 documented methods: text, dates, and phrases. The first two are, at the time of this writing, far more interesting than the third, due to some performance constraints on faceting top phrases. Before you can access the API, you'll need to get a key from http://services.sunlightlabs.com/acconts/register, so do that if you haven't already. It will probably take about 5 minutes from the time you confirm your key for it to sync out, so don't be alarmed if your key is denied initially.
Once you have a working key you can start making calls. If you're interested in searching the congressional record for instances of a word or phrase, you want text.json. This endpoint queries Solr directly, and responds to several facets. For example, to find personal explanations that coincide with this year's summer recess, we could hit this url:
53 results! To break those down by party we can just add party=D|R to that url:
Looks like the parties are dead even at 25 each, indicating that 3 instances weren't properly attributed.
To break the list down instead by chamber of the document they originated from, we could add chamber=House|Senate|Extensions:
28, 1 and 24, respectively. We can also search by document title:
Only 34, seems several titles were conveniently misspelled...
As mentioned, the documentation needs updating at the time of this post, but currently filterable fields are:
- state: AK,AZ,NM
- party: D,R,I
- bioguide_id: B000243,K000352
- congress: 110,111,112
- session: 1,2
- cr_pages, volume and page_id are also available, though probably less useful for most users.
Just as we can search for instances of a word or phrase using the text method, we can get counts and tf*idfs over time using dates.json. Because this method also hits Solr directly, the same set of filters on documents is available. So: to get a sequence of document counts containing 'I would have voted' spoken by Democrats, by month, during the entirety of the George W. Bush presidency, with percentage calculations:
This endpoint hits the database that powers the top phrase display on the site, and therefore is not capable of 'real' faceting. So, the only available metrics are:
Top phrases for a given month: http://capitolwords.org/api/phrases.json?apikey=your-key-here&entity_type=month&entity_value=201110
Top phrases for a given day: http://capitolwords.org/api/phrases.json?apikey=your-key-here&entity_type=date&entity_value=2011-10-02
Top phrases for a given legislator: http://capitolwords.org/api/phrases.json?apikey=your-key-here&entity_type=legislator&entity_value=B0000243
Top phrases for a given state, sorted by count: http://capitolwords.org/api/phrases.json?apikey=your-key-here&entity_type=state&entity_value=MT&sort=count%20desc
By default, the results are sorted by tf*idf desc, but can also be sorted by count, and in either direction.
That's about it--you've now more or less seen the tech behind Capitol Words in its entirety, so it must be time to talk about what's next. We've got some ideas about ways to improve performance to deliver more API content in real-time, and a few extra datasets we'd like to incorporate, but if you've got thoughts about new ways to use this data, or requests for features you'd like to see, please do let us know in the comments here or on the Sunlight Labs mailing list!