Today I'm pleased to announce that the Superfastmatch project is open-source and ready for use. I’m excited to be posting this—I’ve been waiting to do so for a while! I think SFM is really, really cool—and I think you’ll agree once I tell you why. But first, a little bit of backstory.
We first became aware of the technology behind SFM when Churnalism launched. Created by the Media Standards Trust, Churnalism is an ingenious effort to detect when UK journalists copy-and-paste press releases into their published stories. It’s a great project, but we were even more excited by the technology behind it. Finding overlap between documents in huge corpora is not as simple a problem as you might think--it's tempting to assume that diff will manage the job, but in truth that tool is unsuitable for most types of documents.
The basic algorithmic challenge is the same one faced by those working on systems to detect academic plagiarism--a rich and evolving field in its own right. But surprisingly little of that technology is freely available.
Sunlight reached out to MST and was ultimately able to provide a grant that allowed them to open-source their code. Even better: they've been improving it. A mostly-Python implementation that needed hefty hardware is now a compiled solution that runs blazingly fast on commodity hardware (we’ve also successfully run it on vanilla EC2 instances--see the README for details).
Each instance of the system is an HTTP server. Users load documents by POSTing their text to a RESTful interface. As each document is processed, it’s normalized and split into substrings, which are hashed into unique tokens. After you’ve loaded your documents, you run an association task, which compares each document's collection of tokens against one another. Where there's overlap, contiguous chunks of text are assembled, and you can begin to inspect the parts that might be borrowed from one another. (The actual mechanics of the system are considerably more complex than this explanation, but the preceding should give you a rough idea of how things work.)
There's a demo at scripts/gutenberg.sh that loads the Bible, the Koran and ten classic novels from Project Gutenberg into the system, then finds every bit of overlap between them (it takes about 45 seconds from start to finish on my three year-old laptop).
Sunlight's particular interest is in pairing this technology with data from our Open States Project in order to detect when legislation is migrating between statehouses or from interest groups and into law. But we hope and expect that SFM's uses will extend well beyond our mission--the applications of this technology seem sure to surprise us.
The project remains under very active development. We expect a bugfix related to very large datasets to be merged into the main branch in a week or two, for instance. But Sunlight and MST are both anxious to see developers begin to acquaint themselves with Superfastmatch. And of course we're also hopeful that others might be inspired to contribute back to it. Providing the system's output as JSON, for example, is a long-planned feature that would be easy to implement and of considerable value.
For now, though, please have a look at the project repo and start thinking about what SFM might make possible for you. You don't need to look for a needle in a haystack anymore--you just need a few good haystacks.