To take a break from the routine and our official projects, the Sunlight Labs organized an internal “labs olympics”, in which teams would compete for outrageous prizes by building an extracurricular project. This installment brings you the contribution from “Team Intern”.
As team intern, we felt we had something to prove. Could four unseasoned new recruits withstand the blazing glory of the veteran sunlighters? On the team were Charlie DeTar (from MIT, working at Sunlight Labs on Transparency Data), Dan Schneiderman (from RIT, working on the Fifty State Project), Michael Stephens (from RPI, also with the Fifty State Project) and Ryan Wold (consultant, working on the National Data Catalog).
We started off on Monday morning with a couple of vague ideas of what we might work on (Some sort of direct message/twitter bot for RSS feeds? Something to do with mapping?). We kicked it off with a brain storming session for a couple of hours, putting ideas on post-it notes, sorting them into categories, pruning, and we eventually settled on a “Legalese Translator” service: a wiki which lets people annotate legalese documents – such as Terms of Service and Privacy Policies – with more human-readable summaries, and eye-catching icons indicating major problem areas (such as the company asserting they can change the TOS at any time). We started poking around the MediaWiki codebase to see what it would take to do a few extensions to suit our needs. After spending a couple of hours on this, we started to second guess ourselves: would we be able to pull something off with this worthy of a demo? Challenges included coming up with a taxonomy of legal problems (none of us are lawyers), coming up with enough seed data to make the wiki work, and a realization that the vast majority of the work in a project like this would involve community management, expectation setting, and organization, none of which were particularly strong points in any of our expertise.
So, at 1pm on Monday with 1/4 of the alloted time already consumed, we shifted gears. Gathered around a whiteboard, we almost instantly converged on another topic: mapping the complex references in bodies of law. Legal code tends to refer to itself, often in noodley, snakey paths that are hard to traverse, and most of the laws were written before such a thing as “hypertext” existed. This stayed in our general topic area of “legalese”, but gave us a much more finite and concrete objective: visualizing and navigating references in laws. We started exploring a few different bodies of law to choose one for the project, and settled on the US Code – a gargantuan body comprising more than 50 titles broken into more than 60,000 sections with a decidedly complex subsection hierarchy. To get started, we made use of Cornell University’s XML translation of the code. For the rest of the day, we worked on importing the code into a relational database from which we could generate the reference hierarchies necessary for our navigation and visualization tools. And a name…. we needed a name. Since we were dealing with the law in a shredded and stringy form, we decided to call it “Coleslaw”, or if you prefer, “Cole§law”.
The US code is awfully complex. Among the 50 titles of the US Code, there are 168,000 references – including those within and between sections. Now on to the eye candy.
U.S. Code browser
The first tool we created was a django application to import the code, and a simple view to browse the code. In order to make navigating the spaghetti structure of the code a little easier we support opening referenced sections in a pop-up frame. We also provide a list of outgoing and incoming references at the bottom of each section. The stub application is available at http://coleslaw.sunlightlabs.com. The code to the application as well as the python code to import the XML files is available on github.
Next, using exports of the relationship hierarchy between all the references of the code, we developed a set of visualizations. As an example, here are Graph Viz renderings of the relationships contained in titles 3, 11, and 41.
Here is another visualization for titles 11 and 28. Sections are laid out linearly on the X axis, and references are noted with a curve that connects the two corresponding points.
We also used the database we generated to do some statistical analysis of the code’s reference structure. This led to some interesting but not altogether surprising insights – for example, 4 of the 10 most widely referenced sections of the code are within Title 26, the Internal Revenue Code. Also heavily referenced is Title 5, Section 552 – the Freedom of Information Act.
During the two and a half days from concept through development we came up with a lot of ideas to add onto Coleslaw. Unfortunately we were not able to implement all of them. Some possible future work includes adding a color code for specific titles or reference types to see how they relate on the arc visualization. Another idea was to create an interactive interface for browsing the code, jumping from one reference to another, within the context of a treemap visualization. Each additional reference would create a long line of visual nodes that would provide the text of the referenced code. A tool like Coleslaw could be adapted to compare changes made to laws, helping people to understand the repercussions of new legislation.