Earlier this week the annual Law Via the Internet conference was hosted by the Legal Information Institute at Cornell University. The conference schedule featured talks on a range of policy and technical subjects, including the topic of extracting legal citations from text and understanding them programmatically, which arises whenever people need to determine the relevance of legal documents based on the authorities they cite. Recognizing citations in text is also a vexing but fun programming challenge, so I was excited to see this issue figure prominently in at least four separate talks.
So why is parsing citations from text a difficult problem? I'll attempt to briefly explain. Beyond serving as attribution for a quotation or claim, a citation's only purpose is to identify the cited authority without too much ambiguity. For that reason, early citation manuals tended to be simple and straightforward. The citation formats they prescribed were calibrated only for authorship and consumption by humans—never machines. The first edition of the Blue Book in 1926 was a mere 26 pages in length. But by 2010, the eighteenth edition of the Blue Book had swollen to over 500 pages, hundreds of which struggled to respond to the vagaries of citing electronic resources like websites and CD-ROM sets. In addition to the constant evolution in citation rules, the idiosyncratic field of deciding how sources should be cited tends to be populated with (there's no nice way to put this) weird people who experience an aesthetic mania in arguing about—and ultimately revising—existing rules. So citation rules also tend to be in constant flux and vary widely depending on which manual you consult. A third factor is that complicated, fluid rules are difficult to comply with, so writers frequently bend or break the rules to save time. With these challenges in mind, here are some techniques that presenters at the conference are using to deal with these issues:
Michael Lissner, the co-founder and lead developer of CourtListener.com explained the approach that he and collaborator Rowyn McDonald are using to parse case citations from federal court opinions. They start by identifying a subset of acronyms for federal bound volumes, like "U.S.", "F. 2d", "F. 3d" and then search to the left and right to identify the volume number and page number, respectively (source code here). This is slightly different from the approach taken by Sunlight Labs' own Eric Mill in citation.js, which defines a set of regular expressions and callback functions to process the matches each finds.
Another presentation by Marc-André Morissette, the director of technology at Lexum, focused on specialized techniques his firm is using to parse statute and session law citations from court decisions in Canada. This presentation touched on the challenges of programmatically identifying human-written citations that deviate from the styles prescribed in citation manuals. To correctly recognize bogus references, Morissette's team precomputes all valid citations to the resources they want to extract from the cases, then computes a variety of misspellings and mistakes that commonly appear in the targeted citation types. They then generate a state machine—a decision tree, basically—and scan over the input text one token at a time, testing each against the current node in the decision tree. When a terminal node is reached, a match is found. This technique has the interesting (but also potentially limiting) characteristic of substantively verifying the citation during the extraction operation: if a citation doesn't get matched against the decision tree, it must not refer to a known resource.
I was also very pleased to see that Anurag Achary, the lead engineer of Google Scholar, was not only in attendance at LVI 2012, but presenting. In his presentation, he detailed a number of challenges his team faced in publishing and cross-referencing court opinions (and scholarly journals) on Google Scholar—issues such as how to react to changes in citation formats across jurisdictions and different time periods, how to distinguish short citations ("in the New York decision we held...") from normal prose ("she lived in New York"), and how to resolve ambiguous citations, like "ibid" to the actual source they refer to. Anurag understandably stopped short of disclosing his recipe for the secret sauce that makes Google Scholar so, so delicious, but hinted that his go-to tools included a huge corpus of test data, unit tests, Bayes' theorem, and plenty of common sense. He spoke in very simple and down-to-earth terms about his work, but I still came away from his talk dumbfounded by what he has accomplished with a team of less than three full-time developers.
A fourth and really impressive presentation was given by Frank Bennet, an associate professor at the Nagoya University law school in Japan. Whereas the other presentations focused on identifying citations in text and using them to link related documents, Frank focused on the opposite but closely related task of transforming bibliographic data into properly formatted legal citations than can be inserted into documents. Confronted with the problem of law students so confounded by the complexity of the Blue Book that they simply resorted to plagiarizing citations, Frank set out to adapt the Zotero research platform for use with legal resources. Zotero is a browser extension that enables users to select resources from sites like Amazon and Google Scholar and store their bibliographic information in a local database. Frank's work goes one step further and provides style definitions that citeproc-js can use to format the citations for publication. The session included an eye-popping demonstration of a related word-processing utility that enables users to insert formatted citations from sources in their local Zotero database.
So what does any of this have to do with Sunlight Labs? Parsing citations effectively is an inevitable task for applications that need to identify related sources in text. For us, that issue is arising in a few places. Eric's latest project, Scout, aims to provide users with alerts for proposed regulations, and one of the more effective ways to determine whether a proposed regulation affects your interests is to search its text for references to statutes you care about. In an attempt to build out this feature, Eric has started working on parsing US Code citations from the text of bills and proposed regulations. Similarly, the Open States project is also looking for ways to identify what statutes would be impacted by legislative bills so we can link users to appropriate sites (if they exist) and also enable querying of bills by the law sections they would affect. In our case, these tend to be prosaic references rather than formal citations, which lends an additional and terrifying dimension to the problem. But from a software developer's perspective, parsing citations involves the same family of techniques regardless of the underlying resources you're looking for.
All in all, LVI was a great conference and I hope to see these projects enjoy continued success.