SuperFastMatch

 

Churnalism: Discover When News Copies from Other Sources

The Sunlight Foundation's Churnalism US web tool and extension.Churnalism US is a new web tool and browser extension that allows anyone to compare the news you read against existing content to uncover possible instances of plagiarism. It is a joint project with the Media Standards Trust.

Simply feed in a link or block of text to the Churnalism site or let the browser extension run in the background to notify you of any matches of text from Churnalism's cache of documents. They include most articles in Wikipedia, press releases from PR Newswire, PR News Web, EurekaAlert!, congressional leadership offices, the White House, a sampling of Fortune 500 companies, prominent philanthropic foundations and much more. The browser extension available for Chrome, Internet Explorer and Firefox (full approval pending) allows Churnalism to extract article text from a whitelist of common news sites and lets you know when something you're reading may be copied from another source. It's a rare occurrence, but it's not unprecedented. Just last week Tom Lee, a noted Churnalism beta tester and Sunlight Labs Director, found through Churnalism that Reuters' prematurely published obituary of still-alive-human George Soros borrowed heavily from the collection of quotes on his Wikipedia page.

For a video walkthrough of how to use the Churnalism web tool and extensions please watch this two minute tutorial on Sunlight Academy featuring Kaitlin Devine, a developer on Churnalism:

Sunlight’s Churnalism is based on a UK site of the same name and is driven by open-source text analysis technology dubbed SuperFastMatch, both developed by the awesome Media Standards Trust. For a deeper dive into the underlying technology and process behind the project, check out this detailed post from Drew Vogel, another developer on Churnalism.

With the extension installed, you can learn about the sourced and unsourced flow of text copied from somewhere else. For some anecdotal evidence from my experience using Churnalism, I've found a number of instances of articles about science topics relying heavily on press releases and study summaries. For example, take this piece on the BBC website about epilepsy and migraines. Churnalism found a significant portion of the text came from this press release in EurekaAlert! and let me know with a ribbon notification on the top of the page. By tapping the Show Me button on the notification, Churnalism overlays a side-by-side display of the article and the possible match with copied text highlighted for easy comparison:

The Sunlight Foundation's Churnalism US shows overlap of a BBC article and a press release.
Using the Churnalism browser extension it's easy to see the overlap between the article shown on the left linked to the corresponding text copied from a press release on the right.

The best way to detect influence and language sharing from other sources is to install the browser extension and continue consuming news. You'll slowly start uncovering overlaps of language seen in this CBS News report, this NY Daily News article, this piece on NBC News or maybe uncover a reverse application of Churnalism, like this New York Times article that is cited heavily in a Wikipedia article.

We understand the privacy sensitivities with an extension extracting text from what you read, so we've designed Churnalism to be quite customizable and never retain identifiable information such as your IP address. You can easily change which sites Churnalism runs on by going into the settings for the browser extension. We've provided a basic whitelist of major news sites, a listing of local news affiliates and the ability to let Churnalism run on any site with news or article in url, but all these can be removed or paired down (or expanded!) to whatever sites you're interested in.

We're very excited to get this project out into the public and hope to continue to improve the underlying software as there are some excited potential applications for large corpus text matching. We used the SuperFastMatch technology to look at model legislation and it drove stories like our look at how ALEC distributed the 'Stand Your Ground' legislation for adoption in a number of different states.

Let us know any interesting Churnalism matches you uncover!

Churnalism US: the Nuts and Bolts

Churnalism US (launching today!) allows you to check the news articles you read for influence from press releases and Wikipedia. If you’re curious about a particular article, you can simply copy/paste the web address into the Churnalism US website. You can also choose to check each news article you read by installing our browser extensions. The extensions will alert you when a news article matches our database. You can read more about using Churnalism in Nicko's post, but I'll explain how we approached this problem from a technical point of view.

The core technology behind the service is a fast, full text search database named SuperFastMatch. It was developed by our friends at the Media Standards Trust to power the original UK-based Churnalism.com. The original version of the site allowed you to check the influence a particular press release has had on the UK national press. Our task is the inverse of theirs but the fundamental technical challenge is the same so we used the second generation of their technology to power this new site.

SuperFastMatch employs an innovative technique that splits the text of a corpus (mostly press releases in this case) into overlapping windows of a fixed number of characters. Each of those text windows is hashed into a 26 bit number. We use a "rolling" hash function but if you’re familiar with MD5 or SHA1 then you've got a good idea of what it does. Every hash function suffers from hash collisions. Instead of trying to avoid these, the collisions are used as an approximation for comparing the text represented by the hash. Once a list of matching hashes is found then a more exact (but slower) comparison of the text windows can be done on this smaller set of values in order to filter out false positives.

Having this list of hashes isn't enough to make the text search fast. Once the list of hashes is in hand they need to be stored in an index. Since the hashes are numbers, the index stores them in a numerically sorted list. This list is then delta-encoded by subtracting each number from the previous one and then using a variable bit-length encoding, stored in a sparsetable. Even with this compression, the index can grow very large; our index is about 20 GB and growing.

Once we have a list of which press releases share text with a given news article we have to analyze whether that shared text is meaningful. This is where the Churnalism web frontend takes over. We remove fragments that are mostly long proper nouns (such as "the President of the United States of America"). We then measure how many characters overlap and how close together the shared passages are, relative to the document lengths. A 3,000 word news article that shares two sentences with a press release is less interesting than a 1,000 article that shares two paragraphs. Similarly two articles of the same length that share the same two sentences with a press release aren't always churning the press release to the same degree. We boil this down into the "density" of the shared text in the two documents as a measure of how likely the text was simply copy/pasted and then slightly edited.

Unfortunately the state of web publishing is inconsistent such that we couldn't reliably detect and eliminate quotes. Often blockquote html tags are used for things that are not actually quotes and of course not all quotes are annotated with appropriate html tags. While initially frustrating from an engineering perspective, we've found it delivers an additional feature by providing context around quotes in a news article and exposing instances of news articles selectively quoting speeches or press releases.

As great of a service as Churnalism provides, we think the underlying technology has many other exciting uses. SuperFastMatch can be used as an approach to any problem that requires a longest common substring search. If you’re a Pythonista, we have a client library that will handle simple load balancing and sharding by document type. It also provides tools to backup and restore the SuperFastMatch index (the index is ephemeral, so a reboot wipes the data). We've found it useful in our Ad Hawk service for clustering "cookie-cutter" attack ads where most of the audio is the same but the politician’s name and background are changed. If you find it useful, let us know. If you have any trouble setting it up, submit a ticket to the Github project and we’ll do our best to get you up and running.

Sunlight and Open Source

David Eaves has a thoughtful post over at TechPresident talking about open source and the transparency community's commitment to it -- a commitment that David sees as half-hearted. Sunlight's mentioned in the post, and the MySociety initiative that prompted the post is something that our team has been thinking about a lot. I think there's something to David's criticisms. But he's missing a few important things.

But let's get the baseline stuff out of the way first. Sunlight loves open source. Our whole stack is built on it, from the Varnish cache your browser connects to, to the Django/Rails/Flask/Sinatra/whatever app behind it, to the Postgres/Mongo/Redis/Solr/elasticsearch datastores that power it, to the OpenOffice suite that edits the grant application that paid for it all. All of our code is up on GitHub, and we welcome and celebrate contributions from the community.

But, Kindle contest aside, the above examples are mostly about us benefiting from open source. What have we done for the movement lately? This is the crux of David's critique:

So far, it appears that the spirit of re-use among the big players, like MySociety and the Sunlight Foundation, only goes so deep. Indeed often it seems they are limited to believing others should re-use their code. There are few examples where the bigger players dedicate resources to support other people's components. Again, it is fine if this is all about creating competing platforms and competing to get players in smaller jurisdictions who cannot finance creating whole websites on their own to adopt it. But if this is about reducing duplication then I'll expect to see some of the big players throw resources behind components they see built elsewhere. So far it isn't clear to me that we are truly moving to a world of "small pieces loosely joined" instead of a world of "our pieces, loosely joined."

I think David's missing a few important examples. For one thing, Sunlight's been adopting and investing in other organizations' code for a while now. PPF's OpenCongress has long been a Sunlight grantee, of course, and their code is entirely open source, including specific components like Formageddon that we commissioned. It's been more than a year since we began providing support for the Media Standards Trust to open-source and continue to develop SuperFastMatch; that's a partnership we think has tremendous potential to benefit both us and others, and you can expect to see some additional collaborations announced soon. Politwoops is a recent example of Sunlight adopting, extending and then launching a project started by another NGO -- the Open State Foundation, in this case (we're in the process of working with them to open-source the code).

But this is at the level of fairly specific partnerships with other transparency NGOs. The fact is that the more specific a project's use case, the harder it is to generalize its adoption. The more fundamental and abstract a tool is, the easier it is to adopt it and contribute back to it. It's no coincidence that we have people on our team who have patches in the Linux kernel but none who have patches in FixMyStreet. We see plenty of people use our Django apps and middlewares, but (so far) no successful redeployments of Influence Explorer. We've contributed a number of patches to the Boundary Service project that David mentions, but none to Ushahidi. Heck, back in my fixed-width font days, even I managed to get a minor patch into PySolr.

It simply gets harder to collaborate when you move to a less-abstract level of software. Requirements become more specific, and there cease to be good, general approaches to tackling problems. I saw this first-hand when I threw together the Elena's Inbox project. That effort generated a lot of excitement from other folks who had access to email archives, and we were glad to speak to all of them. I was eager to offer advice, answer questions and generally do some hand-holding, but I found myself wishing I had better news for the people who got in touch with me. Because unfortunately the reusable part of the site isn't all that valuable -- it's just some ugly templates and a basic Django app that provides endpoints for search and starring of emails (though we do have some much less ugly templates waiting for the next time we do a similar project). The real work and value-creation comes in the weekend following the government's Friday afternoon email document dump, when you need a programmer to lose sleep writing endless regular expressions that parse the idiosyncratic formatting of what's likely to be a badly-OCRed pile of text, then apply algorithmic approaches -- usually specific to the particular document set -- to stitch individual emails back together into threads. Come Monday morning, you'll be facing a huge, all-hands-on-deck manual review process as your staff tries to collapse duplicate entities down to single individuals (a process that can be aided by some string-similarity techniques, but which inevitably involves a lot of judgment calls and contextual knowledge).

Setting up an EI-style-site is unfortunately never going to be a clean, easily-repeatable process; not until government starts releasing MDBs or exposing IMAP endpoints (something we have yet to see, as far as I know). And this is fairly typical of work in our space: a lot of it needs to be purpose-built because of the quirks of government and the datasets it produces.

The good news is that although our movement is still quite young, we've already learned some lessons. I think MySociety's components strategy reflects this: they're moving down a layer of abstraction -- cautiously and after much consideration -- and tackling a slightly-more-specific task than a typical NOSQL or GIS project; a task that's still abstract enough to be reusable, but which is targeted enough to be particularly relevant to transparency organizations. It's something that we think is worth pursuing, and that we're anxious to help to make into a success. It probably won't make sense to spend time replacing Sunlight's too-specific-to-be-reusable but perfectly-useful-for-us entity store with PopIt in the near term. But those organizations that come to this space after us should be able to benefit from the lessons learned by MySociety, Sunlight and others. It's the same reason why Open States has been refactored twice: it takes time and experience to figure out what parts of a problem can be abstracted and made reusable.

There's no question that we can do better. We're looking at which projects have the most potential for reuse, and -- where appropriate -- we're planning to clean up their docs, add easy Heroku deployment support, roll some AMIs, and support some up-and-coming general source data formats. We'll also be taking a hard look at how our APIs are organized: we can make our data more easily reusable, too.

But specificity is often the enemy of reusability, and we think some of the most interesting opportunities tend to involve very specific problems. It's a real tension, but one that we're committed to continuing to work to address.

UPDATE: MySociety's Tom Steinburg has also posted a response to David, in which he explains the rationale behind MySociety's components strategy in considerably more detail

Read more

Expanded Analysis Finds 15 States Sharing "Stand Your Ground" Language

An expansion of our recent analysis of Stand Your Ground laws confirms that an additional five states, and perhaps more, have passed legal language strikingly similar to the Florida law that has gained national attention in the wake of Trayvon Martin's death. In addition to the previously identified states, Alabama, Georgia, North Carolina, Utah, and Illinois passed laws that appear to share substantial overlap with the legislation enacted in Florida.

Sunlight's initial analysis demonstrated that at least ten states had passed bills sharing substantial overlapping text with the Florida measure. However, our report was limited by the availability of legislative text, and we intentionally confined its scope to the text of bills that were enacted after the Florida law’s passage in 2005.

With research assistance from the Legal Community Against Violence, we expanded our analysis to include a wider collection of state-level self-defense laws, as well as statutory text in cases where original legislative language could not be obtained. As a result, we can now confirm that measures on the books in at least 15 states share meaningful amounts of overlapping language with the Florida law — one such example is the phrase “a person does not have a duty to retreat” — implying connections related to these laws' provenance and spread. This analytic finding is consistent with reports that the Florida law was adopted and promoted to other states by the American Legislative Exchange Council and the National Rifle Association.

Our analysis also hints at some of the origins of the Florida law. Illinois and Utah's measures represent earlier "Castle Doctrines." But these bills lacked some of the provisions of the Florida measure that are considered to be most problematic, such as the presumption of innocence for those using deadly force. And it was the Florida bill that appears to have been used as a template by most states that have since adopted "Stand Your Ground" laws.

Although our expanded analysis examined legislative or statutory language from 45 states, in some cases data availability remains a limiting factor. In others, laws may be substantially similar or share a common origin but lack overlapping language. Our analysis is limited to overlapping, identical text (we ignored matching text that appeared to be a product of short or frequently-used legal boilerplate). It is important to note that the 15 matches we located represent a lower bound; the total number of states that have adopted bills inspired by Florida’s may well be higher.

Sunlight's work relies on software called SuperFastMatch. Created by the Media Standards Trust and supported by a grant from Sunlight, SFM allows for the identification of overlap between text documents at large scales and high speeds. You can examine the connections between the "Stand Your Ground" measures we have collected for yourself by visiting our research instance of SFM. Click the "Documents" tab to begin exploring the different self-defense-related measures we examined and their degree of overlap with those of other states.

Announcing Superfastmatch

picture of a combine thresherToday I'm pleased to announce that the Superfastmatch project is open-source and ready for use. I’m excited to be posting this—I’ve been waiting to do so for a while! I think SFM is really, really cool—and I think you’ll agree once I tell you why. But first, a little bit of backstory.

We first became aware of the technology behind SFM when Churnalism launched. Created by the Media Standards Trust, Churnalism is an ingenious effort to detect when UK journalists copy-and-paste press releases into their published stories. It’s a great project, but we were even more excited by the technology behind it. Finding overlap between documents in huge corpora is not as simple a problem as you might think--it's tempting to assume that diff will manage the job, but in truth that tool is unsuitable for most types of documents.

The basic algorithmic challenge is the same one faced by those working on systems to detect academic plagiarism--a rich and evolving field in its own right. But surprisingly little of that technology is freely available.

Sunlight reached out to MST and was ultimately able to provide a grant that allowed them to open-source their code. Even better: they've been improving it. A mostly-Python implementation that needed hefty hardware is now a compiled solution that runs blazingly fast on commodity hardware (we’ve also successfully run it on vanilla EC2 instances--see the README for details).

Each instance of the system is an HTTP server. Users load documents by POSTing their text to a RESTful interface. As each document is processed, it’s normalized and split into substrings, which are hashed into unique tokens. After you’ve loaded your documents, you run an association task, which compares each document's collection of tokens against one another. Where there's overlap, contiguous chunks of text are assembled, and you can begin to inspect the parts that might be borrowed from one another. (The actual mechanics of the system are considerably more complex than this explanation, but the preceding should give you a rough idea of how things work.)

There's a demo at scripts/gutenberg.sh that loads the Bible, the Koran and ten classic novels from Project Gutenberg into the system, then finds every bit of overlap between them (it takes about 45 seconds from start to finish on my three year-old laptop).

Sunlight's particular interest is in pairing this technology with data from our Open States Project in order to detect when legislation is migrating between statehouses or from interest groups and into law. But we hope and expect that SFM's uses will extend well beyond our mission--the applications of this technology seem sure to surprise us.

The project remains under very active development. We expect a bugfix related to very large datasets to be merged into the main branch in a week or two, for instance. But Sunlight and MST are both anxious to see developers begin to acquaint themselves with Superfastmatch. And of course we're also hopeful that others might be inspired to contribute back to it. Providing the system's output as JSON, for example, is a long-planned feature that would be easy to implement and of considerable value.

For now, though, please have a look at the project repo and start thinking about what SFM might make possible for you. You don't need to look for a needle in a haystack anymore--you just need a few good haystacks.

Read more