We Don’t Need a GitHub for Data

by

picture of Lt. Commander Data standing in front of a screen with the GitHub logThere was an interesting exchange this past weekend between Derek Willis of the New York Times and Sunlight’s own Labs Director emeritus, Clay Johnson. Clay wrote a post arguing that we need a “GitHub for data”:

It’s too hard to put data on the web. It’s too hard to get data off the web. We need a GitHub for data.

With a good version control system like Git or Mercurial, I can track changes, I can do rollbacks, branch and merge and most importantly, collaborate. With a web counterpart like GitHub I can see who is branching my source, what’s been done to it, they can easily contribute back and people can create issues and a wiki about the source I’ve written. To publish source to the web, I need only configure my GitHub account, and in my editor I can add a file, commit the change, and publish it to the web in a couple quick keystrokes.

[…]

Getting and integrating data into a project needs to be as easy as integrating code into a project. If I want to interface with Google Analytics with ruby, I can type gem install vigetlabs-garb and I’ve got what I need to talk to the Google Analytics API. Why can I not type into a console gitdata install census-2010 or gitdata install census-2010 —format=mongodb and have everything I need to interface with the coming census data?

On his own blog, Derek pushed back a bit:

[…] The biggest issue, for data-driven apps contests and pretty much any other use of government data, is not that data isn’t easy to store on the Web. It’s that data is hard to understand, no matter where you get it.

[…]

What I’m saying is that the very act of what Clay describes as a hassle:

A developer has to download some strange dataset off of a website like data.gov or the National Data Catalog, prune it, massage it, usually fix it, and then convert it to their database system of choice, and then they can start building their app.

Is in fact what helps a user learn more about the dataset he or she is using. Even a well-documented dataset can have its quirks that show up only in the data itself, and the act of importing often reveals more about the data than the documentation does. We need to import, prune, massage, convert. It’s how we learn.

I think there’s a lot to what Derek is saying. Understanding what an MSA is, or how to match Census data up against information that’s been geocoded by zip code — these are bigger challenges than figuring out how to get the Census data itself. The documentation for this stuff is difficult to find and even harder to understand. Most users are driven toward the American Factfinder tool, but if that’s not up to telling you what you want, you’re going to have to spend some time hunting down the appropriate FTP site and an explanation of its organization — Clay’s right that this is a pain. But it’s nothing compared to the challenge of figuring out how to use the data properly. It can be daunting.

But I think there are problems with the “GitHub for data” framing that go beyond the simple fact that the problems GitHub solves aren’t the biggest problems facing analysts. Before I talk about that, though, let me applaud Clay for something he did in his post: he used specific examples. I have to admit I bristle when people start talking about what could or needs to be done with “data” in the abstract. I think that level of vagueness makes it tough to reach useful conclusions. As Derek pointed out, meaningful analysis requires that we roll up our sleeves and begin to deal with the specifics of a dataset. A focus on the most abstract level of discussion can also sometimes be symptomatic of generalists who find the idea of big datasets exciting, but have never actually tried to conduct an analysis themselves. I’m glad for their enthusiasm, but hesitant to spend much time listening to their recommendations. If someone doesn’t know what a normal distribution is, I have doubts about how much meaningful insight they can contribute to discussions about appropriate tools and policies for open data. Anyway, cheers to Clay for avoiding this trap.

So! On to my quibbles. My sense is that the phrase “GitHub for [blank]” is getting thrown around a lot right now because GitHub is still relatively new, Git itself is innately mysterious and powerful, and there’s a general sense that a lot of exciting things are happening within the GH community. GitHub brought social features to the world of code repositories, and did so with impressive execution. That’s a real innovation, and people are justifiably excited about it.

I’m all for taking advantage of social web innovations within analyst communities. That’s part of the vision for the National Data Catalog, after all — having a place where people interested in the same problems can find each other and share their work. But if that’s all we’re after, I’d rather start talking about a “Flickr for data” or “Reddit for data” — I think that both framings might offer a bit less novelty, but could save us from making some serious conceptual mistakes.

And here I’m referring to the core functionality of GitHub: version control. Simply put, version control doesn’t make sense for data. In one sense it seems to, because both code and datasets demand auditability — the ability to trace an artifact’s current state back to its origin. Version control can certainly do this. For code, one can look at a diff and see what has changed and why. The transformation between one revision and another will probably involve some lines being changed, and others not, and hopefully the insertion of inline comments. That, plus the comment associated with the revision, will usually be enough for a viewer to understand the transformation.

Transformations on datasets aren’t like that. If you normalize a vector, every number in it will change. This is going to make your version control system suck up a ton of space, of course, but the bigger problem is that while you might be able to tease out the nature of the transformation by looking at the before and after snapshots, it’s going to be much harder to do so than it is with code. The transformation is the thing that is interesting, and the thing that may need revision in order to fix mistakes. And in most cases, the transformation is going to be written in code of one type or another. That’s what needs to go into the version control system — not the data.

In fact, I’d argue that the data shouldn’t go into the VCS. Because there’s another big difference between evolving data and evolving code: worked-on code tends to get better, while worked-on data tends to get worse. Not for the person doing the work, maybe. But for everyone else, I think this generally holds true. The original data is sacred, in a way — there may be effects hidden within it that aren’t immediately apparent, and which ought to be preserved. The transformations that the data must undergo will often lead to the loss of information — a necessary step in service of analysis, but one that needs to be kept in mind, as that information may prove to be valuable, sometimes in ways that can’t be anticipated.

I’ll make up an example: let’s say someone’s studying dolphin calls to see if they can be computationally differentiated. They produce some high-resolution underwater audio recordings in a PCM format, then set about improving the dataset to facilitate others’ analysis. Maybe they do some filtering to remove noises outside the frequency band known to be used by dolphins, then they compress everything down to MP3 to facilitate distribution.

That might be okay, but it’s not sufficient. We can’t look at the difference between the MP3 and the source audio and know what we’ve lost, or what assumptions might be coloring our analysis of the derived data. We need to know about the methodology that was used to get from one point to another, so that we can, for example, revisit our assumptions made about the frequency range we’re examining (maybe we discover something new about dolphin biology; or maybe there’s a harmonic that travels further underwater than we’d expected). Ideally, we’ll just share those transformations along with the source data and let other users run them themselves — like make && make install. We should only be shuttling around “improved” data when there are practical reasons for doing so (typically related to filesize or the transformations being extremely computationally demanding) and when the transformations have been thoroughly reviewed. I’m not confident that a freewheeling environment akin to GitHub can reach that level of control and rigor — nor should it, in my opinion.

So let’s share our transformations — our code — in a social way. I think that’ll work fine, and convey real advantages. Better still, the tools are already built. But data is different than code, and we should think carefully before we jam it into the conceptual framework of VCS. That isn’t to say that we don’t need better tools for managing it — and the good news here is that people like Harvard’s Gary King are thinking hard about what those tools might look like. But GitHub is the wrong model.