Building Poligraft

by

I’m happy to announce the newest project from Sunlight Labs, Poligraft. A utility built on top of Transparency Data, Poligraft takes in a block of text, parses it for entities like politicians and corporations, and returns a result set representing the political influence contained in that text. I won’t dwell on the features — read Ellen Miller’s announcement blog post and the about page for more information. What I want to talk about instead is the development process.

Third Time’s A Charm

The idea behind Poligraft is not new. Back in late 2007, well before I joined Sunlight, the nascent Labs team attempted an initial version of the concept that didn’t pan out. Then in early 2009, I still wasn’t at Sunlight, but I did develop an entry for the first Apps for America contest that was called Defogger. Defogger was embarrassingly slow, didn’t use any AJAX updating, and stopped short of making the connections between entities that Poligraft does today. Much more worthy apps placed at the top of Apps for America.

But in developing Defogger, I did build a key piece of the puzzle used in Poligraft: what I now call the content plucker. Since the best way to use Poligraft is through the bookmarklet, the app needs to pull out the article content from an arbitrary URL. Thankfully, the Readability bookmarklet does that exact thing, and the code is open source. The algorithm examines containing elements for paragraphs, and assigns more points to the containers that are more likely to contain the page’s main content. Porting the algorithm from Javascript to Ruby was a fun exercise in screen scraping.

Harnessing APIs

With the content plucked, Poligraft extracts the entities (people, organizations, companies) from the text with the Calais API. A service by Thomson Reuters, Calais semantically processes any given text, and returns a rich representation of that text. It’s very detailed, much more so than what Poligraft needs. Try out the Calais Viewer to see what I mean.

Using the people, companies, and organizations that Calais detected, Poligraft then uses the Transparency Data API in three steps. First, the Transparency Data entity search is called on each Calais entity. This will usually weed out the majority of entities detected by Calais, because we’re only focusing on entities that have something to do with campaign contributions. These are the “Points of Influence” you see in the sidebar, and you can sometimes see the “weed out” step if you watch closely. Second, on that subset of entities, Poligraft uses the Transparency Data aggregate endpoints to draw the graphs you see on the sidebar. Third, the “Aggregated Contributions” section in the sidebar is filled out using a pairwise aggregation endpoint that is not yet described in the official Transparency Data API documentation. It’ll be ready for public use very soon.

Providing an API

Poligraft also has its own built-in API, which is used by Poligraft itself for dynamically populating the results page via AJAX. Specify a URL or text to be processed, and get back the results in JSON format. In fact, every result page in Poligraft has a corresponding JSON representation. Just append a .json to the unique slug, like so.

To process an article, use the http://poligraft.com/poligraft endpoint in conjunction with a POST or GET request:

http://poligraft.com/poligraft?url=ARTICLE-URL-HERE&json=1

Be sure to pass in json=1 or else HTML will be returned, and use url= to pass in a URL or text= to pass in a selection of text. HTTP clients must have redirection enabled, as the response will be a redirect to a slug endpoint like http://poligraft.com/ABCD.json.

Because Poligraft does processing asynchronously, this endpoint will return a 202 ACCEPTED code until processing is finished, when it returns a 200 OK. In addition to the HTTP response code, there’s a top-level field in the JSON called processed which is set to false while processing is active. Poll the endpoint every few seconds until the return code is 200 or the processed value in the JSON is true. Both techniques will work.

Open Source + Open Data

As usual, the code behind Poligraft is open source on GitHub. The APIs it uses are available for use, for free. Specifically, the Transparency Data API is incredibly valuable for building tools and apps that examine and visualize political influence. While building Poligraft, I was pleasantly surprised on many occasions by what Transparency Data provides. In the months and years to come, I hope we see many more apps built on top of it, not just from within the Labs, but from the wider community.