Tag Archive: SuperFastMatch

Churnalism now updated with fresher Wikipedia content

by dvogel

technology

Jan 6, 2014 2:52 pm

We've updated our version of Churnalism, making it more reliable than ever to discover the journalism you can trust — and what you should question.

Churnalism: Discover When News Copies from Other Sources

by Nicko Margolies Apr 23, 2013 8:00 am

Churnalism US is a new web tool and browser extension that allows anyone to compare the news you read against existing content to uncover possible instances of plagiarism.

Churnalism US: the Nuts and Bolts

by dvogel

technology

Apr 23, 2013 7:59 am

Churnalism US (launching today!) allows you to check the news articles you read for influence from press releases and Wikipedia.... View Article

Sunlight and Open Source

by Tom Lee

technology

Jul 26, 2012 1:32 pm

David Eaves has a thoughtful post over at TechPresident talking about open source and the transparency community's commitment to it -- a commitment that David sees as half-hearted. Sunlight's mentioned in the post, and the MySociety initiative that prompted the post is something that our team has been thinking about a lot. I think there's something to David's criticisms. But he's missing a few important things.

But let's get the baseline stuff out of the way first. Sunlight loves open source. Our whole stack is built on it, from the Varnish cache your browser connects to, to the Django/Rails/Flask/Sinatra/whatever app behind it, to the Postgres/Mongo/Redis/Solr/elasticsearch datastores that power it, to the OpenOffice suite that edits the grant application that paid for it all. All of our code is up on GitHub, and we welcome and celebrate contributions from the community.

But, Kindle contest aside, the above examples are mostly about us benefiting from open source. What have we done for the movement lately? This is the crux of David's critique:

So far, it appears that the spirit of re-use among the big players, like MySociety and the Sunlight Foundation, only goes so deep. Indeed often it seems they are limited to believing others should re-use their code. There are few examples where the bigger players dedicate resources to support other people's components. Again, it is fine if this is all about creating competing platforms and competing to get players in smaller jurisdictions who cannot finance creating whole websites on their own to adopt it. But if this is about reducing duplication then I'll expect to see some of the big players throw resources behind components they see built elsewhere. So far it isn't clear to me that we are truly moving to a world of "small pieces loosely joined" instead of a world of "our pieces, loosely joined."

I think David's missing a few important examples. For one thing, Sunlight's been adopting and investing in other organizations' code for a while now. PPF's OpenCongress has long been a Sunlight grantee, of course, and their code is entirely open source, including specific components like Formageddon that we commissioned. It's been more than a year since we began providing support for the Media Standards Trust to open-source and continue to develop SuperFastMatch; that's a partnership we think has tremendous potential to benefit both us and others, and you can expect to see some additional collaborations announced soon. Politwoops is a recent example of Sunlight adopting, extending and then launching a project started by another NGO -- the Open State Foundation, in this case (we're in the process of working with them to open-source the code).

But this is at the level of fairly specific partnerships with other transparency NGOs. The fact is that the more specific a project's use case, the harder it is to generalize its adoption. The more fundamental and abstract a tool is, the easier it is to adopt it and contribute back to it. It's no coincidence that we have people on our team who have patches in the Linux kernel but none who have patches in FixMyStreet. We see plenty of people use our Django apps and middlewares, but (so far) no successful redeployments of Influence Explorer. We've contributed a number of patches to the Boundary Service project that David mentions, but none to Ushahidi. Heck, back in my fixed-width font days, even I managed to get a minor patch into PySolr.

It simply gets harder to collaborate when you move to a less-abstract level of software. Requirements become more specific, and there cease to be good, general approaches to tackling problems. I saw this first-hand when I threw together the Elena's Inbox project. That effort generated a lot of excitement from other folks who had access to email archives, and we were glad to speak to all of them. I was eager to offer advice, answer questions and generally do some hand-holding, but I found myself wishing I had better news for the people who got in touch with me. Because unfortunately the reusable part of the site isn't all that valuable -- it's just some ugly templates and a basic Django app that provides endpoints for search and starring of emails (though we do have some much less ugly templates waiting for the next time we do a similar project). The real work and value-creation comes in the weekend following the government's Friday afternoon email document dump, when you need a programmer to lose sleep writing endless regular expressions that parse the idiosyncratic formatting of what's likely to be a badly-OCRed pile of text, then apply algorithmic approaches -- usually specific to the particular document set -- to stitch individual emails back together into threads. Come Monday morning, you'll be facing a huge, all-hands-on-deck manual review process as your staff tries to collapse duplicate entities down to single individuals (a process that can be aided by some string-similarity techniques, but which inevitably involves a lot of judgment calls and contextual knowledge).

Setting up an EI-style-site is unfortunately never going to be a clean, easily-repeatable process; not until government starts releasing MDBs or exposing IMAP endpoints (something we have yet to see, as far as I know). And this is fairly typical of work in our space: a lot of it needs to be purpose-built because of the quirks of government and the datasets it produces.

The good news is that although our movement is still quite young, we've already learned some lessons. I think MySociety's components strategy reflects this: they're moving down a layer of abstraction -- cautiously and after much consideration -- and tackling a slightly-more-specific task than a typical NOSQL or GIS project; a task that's still abstract enough to be reusable, but which is targeted enough to be particularly relevant to transparency organizations. It's something that we think is worth pursuing, and that we're anxious to help to make into a success. It probably won't make sense to spend time replacing Sunlight's too-specific-to-be-reusable but perfectly-useful-for-us entity store with PopIt in the near term. But those organizations that come to this space after us should be able to benefit from the lessons learned by MySociety, Sunlight and others. It's the same reason why Open States has been refactored twice: it takes time and experience to figure out what parts of a problem can be abstracted and made reusable.

There's no question that we can do better. We're looking at which projects have the most potential for reuse, and -- where appropriate -- we're planning to clean up their docs, add easy Heroku deployment support, roll some AMIs, and support some up-and-coming general source data formats. We'll also be taking a hard look at how our APIs are organized: we can make our data more easily reusable, too.

But specificity is often the enemy of reusability, and we think some of the most interesting opportunities tend to involve very specific problems. It's a real tension, but one that we're committed to continuing to work to address.

UPDATE: MySociety's Tom Steinburg has also posted a response to David, in which he explains the rationale behind MySociety's components strategy in considerably more detail

Expanded Analysis Finds 15 States Sharing “Stand Your Ground” Language

by Tom Lee Mar 29, 2012 5:39 pm

An expansion of our recent analysis of Stand Your Ground laws confirms that an additional five states, and perhaps more,... View Article

Virginia ultrasound law is the image of a few others

by Ryan Sibley

investigations

Mar 7, 2012 9:59 am

Source: Guttmacher Institute

Updated March 7, 4;22 p.m.

The bill Virginia Gov. Bob McDonnell signed Wednesday requiring women in his state to undergo ultrasound screening before they can proceed with an abortion represents the latest victory for anti-abortion activists pushing to get similar legislation enacted nationwide.

Seven states already have laws on the books requiring pre-abortion ultrasound screening, according to Elizabeth Nash, state issues manager for the Guttmacher Institute, an abortion rights group that focuses on reproductive health policy. At least 18 more states are considering similar bills.

Above is a map showing states that either have or ...

Announcing Superfastmatch

by Tom Lee

technology

Sep 13, 2011 10:42 am

picture of a combine thresher Today I'm pleased to announce that the Superfastmatch project is open-source and ready for use. I’m excited to be posting this—I’ve been waiting to do so for a while! I think SFM is really, really cool—and I think you’ll agree once I tell you why. But first, a little bit of backstory.

We first became aware of the technology behind SFM when Churnalism launched. Created by the Media Standards Trust, Churnalism is an ingenious effort to detect when UK journalists copy-and-paste press releases into their published stories. It’s a great project, but we were even more excited by the technology behind it. Finding overlap between documents in huge corpora is not as simple a problem as you might think--it's tempting to assume that diff will manage the job, but in truth that tool is unsuitable for most types of documents.

The basic algorithmic challenge is the same one faced by those working on systems to detect academic plagiarism--a rich and evolving field in its own right. But surprisingly little of that technology is freely available.

Sunlight reached out to MST and was ultimately able to provide a grant that allowed them to open-source their code. Even better: they've been improving it. A mostly-Python implementation that needed hefty hardware is now a compiled solution that runs blazingly fast on commodity hardware (we’ve also successfully run it on vanilla EC2 instances--see the README for details).

Each instance of the system is an HTTP server. Users load documents by POSTing their text to a RESTful interface. As each document is processed, it’s normalized and split into substrings, which are hashed into unique tokens. After you’ve loaded your documents, you run an association task, which compares each document's collection of tokens against one another. Where there's overlap, contiguous chunks of text are assembled, and you can begin to inspect the parts that might be borrowed from one another. (The actual mechanics of the system are considerably more complex than this explanation, but the preceding should give you a rough idea of how things work.)

There's a demo at scripts/gutenberg.sh that loads the Bible, the Koran and ten classic novels from Project Gutenberg into the system, then finds every bit of overlap between them (it takes about 45 seconds from start to finish on my three year-old laptop).

Sunlight's particular interest is in pairing this technology with data from our Open States Project in order to detect when legislation is migrating between statehouses or from interest groups and into law. But we hope and expect that SFM's uses will extend well beyond our mission--the applications of this technology seem sure to surprise us.

The project remains under very active development. We expect a bugfix related to very large datasets to be merged into the main branch in a week or two, for instance. But Sunlight and MST are both anxious to see developers begin to acquaint themselves with Superfastmatch. And of course we're also hopeful that others might be inspired to contribute back to it. Providing the system's output as JSON, for example, is a long-planned feature that would be easy to implement and of considerable value.

For now, though, please have a look at the project repo and start thinking about what SFM might make possible for you. You don't need to look for a needle in a haystack anymore--you just need a few good haystacks.