Sunlight and Open Source

by

David Eaves has a thoughtful post over at TechPresident talking about open source and the transparency community’s commitment to it — a commitment that David sees as half-hearted. Sunlight’s mentioned in the post, and the MySociety initiative that prompted the post is something that our team has been thinking about a lot. I think there’s something to David’s criticisms. But he’s missing a few important things.

But let’s get the baseline stuff out of the way first. Sunlight loves open source. Our whole stack is built on it, from the Varnish cache your browser connects to, to the Django/Rails/Flask/Sinatra/whatever app behind it, to the Postgres/Mongo/Redis/Solr/elasticsearch datastores that power it, to the OpenOffice suite that edits the grant application that paid for it all. All of our code is up on GitHub, and we welcome and celebrate contributions from the community.

But, Kindle contest aside, the above examples are mostly about us benefiting from open source. What have we done for the movement lately? This is the crux of David’s critique:

So far, it appears that the spirit of re-use among the big players, like MySociety and the Sunlight Foundation, only goes so deep. Indeed often it seems they are limited to believing others should re-use their code. There are few examples where the bigger players dedicate resources to support other people’s components. Again, it is fine if this is all about creating competing platforms and competing to get players in smaller jurisdictions who cannot finance creating whole websites on their own to adopt it. But if this is about reducing duplication then I’ll expect to see some of the big players throw resources behind components they see built elsewhere. So far it isn’t clear to me that we are truly moving to a world of “small pieces loosely joined” instead of a world of “our pieces, loosely joined.”

I think David’s missing a few important examples. For one thing, Sunlight’s been adopting and investing in other organizations’ code for a while now. PPF’s OpenCongress has long been a Sunlight grantee, of course, and their code is entirely open source, including specific components like Formageddon that we commissioned. It’s been more than a year since we began providing support for the Media Standards Trust to open-source and continue to develop SuperFastMatch; that’s a partnership we think has tremendous potential to benefit both us and others, and you can expect to see some additional collaborations announced soon. Politwoops is a recent example of Sunlight adopting, extending and then launching a project started by another NGO — the Open State Foundation, in this case (we’re in the process of working with them to open-source the code).

But this is at the level of fairly specific partnerships with other transparency NGOs. The fact is that the more specific a project’s use case, the harder it is to generalize its adoption. The more fundamental and abstract a tool is, the easier it is to adopt it and contribute back to it. It’s no coincidence that we have people on our team who have patches in the Linux kernel but none who have patches in FixMyStreet. We see plenty of people use our Django apps and middlewares, but (so far) no successful redeployments of Influence Explorer. We’ve contributed a number of patches to the Boundary Service project that David mentions, but none to Ushahidi. Heck, back in my fixed-width font days, even I managed to get a minor patch into PySolr.

It simply gets harder to collaborate when you move to a less-abstract level of software. Requirements become more specific, and there cease to be good, general approaches to tackling problems. I saw this first-hand when I threw together the Elena’s Inbox project. That effort generated a lot of excitement from other folks who had access to email archives, and we were glad to speak to all of them. I was eager to offer advice, answer questions and generally do some hand-holding, but I found myself wishing I had better news for the people who got in touch with me. Because unfortunately the reusable part of the site isn’t all that valuable — it’s just some ugly templates and a basic Django app that provides endpoints for search and starring of emails (though we do have some much less ugly templates waiting for the next time we do a similar project). The real work and value-creation comes in the weekend following the government’s Friday afternoon email document dump, when you need a programmer to lose sleep writing endless regular expressions that parse the idiosyncratic formatting of what’s likely to be a badly-OCRed pile of text, then apply algorithmic approaches — usually specific to the particular document set — to stitch individual emails back together into threads. Come Monday morning, you’ll be facing a huge, all-hands-on-deck manual review process as your staff tries to collapse duplicate entities down to single individuals (a process that can be aided by some string-similarity techniques, but which inevitably involves a lot of judgment calls and contextual knowledge).

Setting up an EI-style-site is unfortunately never going to be a clean, easily-repeatable process; not until government starts releasing MDBs or exposing IMAP endpoints (something we have yet to see, as far as I know). And this is fairly typical of work in our space: a lot of it needs to be purpose-built because of the quirks of government and the datasets it produces.

The good news is that although our movement is still quite young, we’ve already learned some lessons. I think MySociety’s components strategy reflects this: they’re moving down a layer of abstraction — cautiously and after much consideration — and tackling a slightly-more-specific task than a typical NOSQL or GIS project; a task that’s still abstract enough to be reusable, but which is targeted enough to be particularly relevant to transparency organizations. It’s something that we think is worth pursuing, and that we’re anxious to help to make into a success. It probably won’t make sense to spend time replacing Sunlight’s too-specific-to-be-reusable but perfectly-useful-for-us entity store with PopIt in the near term. But those organizations that come to this space after us should be able to benefit from the lessons learned by MySociety, Sunlight and others. It’s the same reason why Open States has been refactored twice: it takes time and experience to figure out what parts of a problem can be abstracted and made reusable.

There’s no question that we can do better. We’re looking at which projects have the most potential for reuse, and — where appropriate — we’re planning to clean up their docs, add easy Heroku deployment support, roll some AMIs, and support some up-and-coming general source data formats. We’ll also be taking a hard look at how our APIs are organized: we can make our data more easily reusable, too.

But specificity is often the enemy of reusability, and we think some of the most interesting opportunities tend to involve very specific problems. It’s a real tension, but one that we’re committed to continuing to work to address.

UPDATE: MySociety’s Tom Steinburg has also posted a response to David, in which he explains the rationale behind MySociety’s components strategy in considerably more detail