Churnalism US: the Nuts and Bolts
Churnalism US (launching today!) allows you to check the news articles you read for influence from press releases and Wikipedia. If you’re curious about a particular article, you can simply copy/paste the web address into the Churnalism US website. You can also choose to check each news article you read by installing our browser extensions. The extensions will alert you when a news article matches our database. You can read more about using Churnalism in Nicko’s post, but I’ll explain how we approached this problem from a technical point of view.
The core technology behind the service is a fast, full text search database named SuperFastMatch. It was developed by our friends at the Media Standards Trust to power the original UK-based Churnalism.com. The original version of the site allowed you to check the influence a particular press release has had on the UK national press. Our task is the inverse of theirs but the fundamental technical challenge is the same so we used the second generation of their technology to power this new site.
SuperFastMatch employs an innovative technique that splits the text of a corpus (mostly press releases in this case) into overlapping windows of a fixed number of characters. Each of those text windows is hashed into a 26 bit number. We use a “rolling” hash function but if you’re familiar with MD5 or SHA1 then you’ve got a good idea of what it does. Every hash function suffers from hash collisions. Instead of trying to avoid these, the collisions are used as an approximation for comparing the text represented by the hash. Once a list of matching hashes is found then a more exact (but slower) comparison of the text windows can be done on this smaller set of values in order to filter out false positives.
Having this list of hashes isn’t enough to make the text search fast. Once the list of hashes is in hand they need to be stored in an index. Since the hashes are numbers, the index stores them in a numerically sorted list. This list is then delta-encoded by subtracting each number from the previous one and then using a variable bit-length encoding, stored in a sparsetable. Even with this compression, the index can grow very large; our index is about 20 GB and growing.
Once we have a list of which press releases share text with a given news article we have to analyze whether that shared text is meaningful. This is where the Churnalism web frontend takes over. We remove fragments that are mostly long proper nouns (such as “the President of the United States of America”). We then measure how many characters overlap and how close together the shared passages are, relative to the document lengths. A 3,000 word news article that shares two sentences with a press release is less interesting than a 1,000 article that shares two paragraphs. Similarly two articles of the same length that share the same two sentences with a press release aren’t always churning the press release to the same degree. We boil this down into the “density” of the shared text in the two documents as a measure of how likely the text was simply copy/pasted and then slightly edited.
Unfortunately the state of web publishing is inconsistent such that we couldn’t reliably detect and eliminate quotes. Often blockquote html tags are used for things that are not actually quotes and of course not all quotes are annotated with appropriate html tags. While initially frustrating from an engineering perspective, we’ve found it delivers an additional feature by providing context around quotes in a news article and exposing instances of news articles selectively quoting speeches or press releases.
As great of a service as Churnalism provides, we think the underlying technology has many other exciting uses. SuperFastMatch can be used as an approach to any problem that requires a longest common substring search. If you’re a Pythonista, we have a client library that will handle simple load balancing and sharding by document type. It also provides tools to backup and restore the SuperFastMatch index (the index is ephemeral, so a reboot wipes the data). We’ve found it useful in our Ad Hawk service for clustering “cookie-cutter” attack ads where most of the audio is the same but the politician’s name and background are changed. If you find it useful, let us know. If you have any trouble setting it up, submit a ticket to the Github project and we’ll do our best to get you up and running.