Opening data: Have you checked your pipes?

by

[Code for America](http://codeforamerica.org/) alum [Dave Guarino](https://twitter.com/allafarce) had a post recently entitled [“ETL for America”](http://daguar.github.io/2014/03/17/etl-for-america/). In it, he highlights something that open data practitioners face with every new project: the problem of **Extracting** data from old databases, **Transforming** it to suit a new application or analysis and **Loading** it into the new datastore that will support that new application or analysis. Almost every technical project (and every idea for one) has this process as an initial cost. This cost is so pervasive that it’s rarely discussed by anyone except for the wretched “data plumber” (Dave’s term) who has no choice but to figure out how to move the important resources from one place to another.

### Why aren’t we talking about it?

The up-front costs of ETL don’t come up very often in the open data and civic hacking community. At hackathons, in funding pitches, and in our definitions of success, we tend to focus on outputs (apps, APIs, visualizations) and treat the data preparation as a collateral task, unavoidable and necessary but not worth “getting into the weeds” about. Quoting Dave:

> The fact that I can go months hearing about “open data” without a single > mention of ETL is a problem. ETL is the pipes of your house: it’s how you > open data.

It’s difficult to point to evidence that this is really the case, but I personally share Dave’s experience. To me, it’s *still* the elephant in the room during the proceedings of any given hackathon or open data challenge. I worry that the open data community is somehow under the false impression that, eventually in the sunny future, data will be released in a more clean way and that this cost will decrease over time.

It won’t. Open data might get cleaner, but no data source can evolve to the point where it serves all possible needs. Regardless of how easy it is to read, the data published by government probably wasn’t prepared with your new app idea in mind.

Data transformation will always be necessary, and it’s worth considering apart from the development of the next cool interface. It’s a permanent cost of developing new things in our space, so why aren’t we putting more resources toward addressing it as a problem in its own right? Why not spend some quality time (and money) focused on data preparation itself, and then let a thousand apps bloom?

### You can’t buy a solution

One reason why there are so few ETL-specific projects out there is that the solution would never look like an app or even a single project. There’s no such thing as turnkey software for these tasks, even though the incentive to create one is obviously enormous.

As Dave points out, even with all the money in the world, you can’t buy a universal, out-of-the-box solution to ETL workflow. He’s right about that, and he’s right about why:

> ETL is actually just an intrinsically hard technical problem. Palantir, a company > which is very good with data, essentially solves the problem by throwing engineers > at it. They have a fantastic analytic infrastructure at their core, and they pay > large sums of money to very smart people to do one thing: write scripts to get > clients’ data into that infrastructure.

In other words, it’s not as if there is some vacuum in the data processing market, waiting for the next lean startup to get in and fix everything for everyone once and for all. You’re always going to need those very smart people designing bespoke data preparation.

### The pipes, the pipes are calling

This is why “data plumber” is such an apt term. There is no push-button solution to rigging up the new wet bar in your basement, because every house is different and has its own history encoded in the tubes that move water around. The value of a plumber’s expertise is that, while they may not know the specifics about your house and your wet bar, they’ve seen enough houses and wet bars to be able to figure it out before too long.

So the answer is clearly not to shop for data-munging robot brains. We shouldn’t expect to see magical software that will take the work out of channeling existing data while sculpting it to fit some new purpose. It’s hard to do: every new dataset will present its own warts, and every new project will have its own special demands. With the right expertise, though, smart people can make all the pipes fit.

### Encouraging tradecraft

Does this mean that open data projects are doomed forever to shoulder the up-front costs of data munging? Yes and no. Well, yes and maybe. Yes, we’ll always have to do this work, even after the last government agency releases an Access database and the last fixed-width COBOL printout is accepted as “disclosure.” But maybe we can reduce the cost by showing our work, and benefitting from the shared experience of everyone who ETL’d before us.

If we as a community agree that opening data is hard, and that resources are scarce, then we should be more active in avoiding duplicated effort. Documenting how we do ETL and sharing the code should be top priority. Before you make a web app, document your ETL process. Before you make a D3 chart, document your ETL process. Before you contribute money to development of the next big thing in open data, demand that the ETL process be thoroughly documented and available as open source code.

If we do a good job of sharing our collective experience dealing with various forms of civic data, we might have a chance to share what plumbers, electricians, carpenters (and, for that matter, spies and lobbyists) do: tradecraft. Tradecraft is the assemblage of rules-of-thumb, heuristic guidelines and what-not-to-do’s that get passed on from master to journeyman in a trade. It’s not going to be taught in a classroom, and it’s hard to learn without at least some degree of mentorship. If, as Niels Bohr tells us, “an expert is a person who has made all the mistakes that can be made in a very narrow field,” then following the example of experts can help everyone to avoid repeating costly mistakes during an already costly part of any new project.

### Stop showing off, start showing work

It’s important, though, when sharing, not to obscure the logic of your ETL process with complex code. Examples of advanced software construction are useful to developers who are operating at a certain level of proficiency. They’re a barrier, though, to passing on the workflow logic that is at the heart of good data munging.

If you have an ETL process that you think others could use or learn from, lay it out procedurally with lots of comments to show how you’re THINKING about the data, not just how your production code happens to work. Newcomers and experienced developers alike stand to benefit from the hard-knocks lessons learned by their colleagues, even if they’re not looking at exactly the same data in exactly the same setting. Seeing how your production code is factored or how your database scales is somewhat less illuminating, because these things are, in part, dictated by your particular use case. In any case, it doesn’t help the next data munger get started with ugly data, and could indirectly encourage premature optimization.

### By the way, this is already happening

If you’d like to see a community where this kind of collective experience is being shared particularly well, you might look at the way the [unitedstates](http://theunitedstates.io/) GitHub organization has been doing things (full disclosure: quite a few Labs members are active contributors). They don’t prescribe very much in the way of coding style, but they’re setting a really good example for how to make ETL scripts that are reliable and testable, while being approachable enough to be maintained by a community of coders with different levels of expertise. Each repo is dedicated to downloading/transforming one specific kind of government data, and that’s it.

One thing that will strike you when going through, say, the congress repo, is that the code is VERY procedural and not optimized in any significant way. No threading/multiprocessing, no mapping or reducing, not even very many object-oriented code. There are also no interfaces to databases. The output is files. One file per bill. This has at least two really important consequences:

– Even a fairly novice Python programmer can debug this code. – There are practically no dependencies other than what Python has built-in (which is admittedly a lot).

Riffing on Michael Pollan, [Mike Migurski](https://twitter.com/michalmigurski) (CTO of Code for America) sums this approach up pretty well:

Write code. Not too much. Mostly procedural.

credit: Mike Migurski

### ETL;DR

Maybe you’re a data plumber like me, in the middle of interviewing some god-awful new source of data, and you didn’t have time to read all of the above. Here’s the short version:

– ETL is hard – You can’t buy your way out of it with shiny software – Human practitioners will always be necessary – Human practitioners can make it easier on each other by sharing our ETL code and lessons learned – Community-shared and maintained ETL processes will be maximally helpful if they are… – modular – focused – free of dependencies – accessible to everyone – output to plain text files.

I’d also like to mention that the [unitedstates](http://github.com/unitedstates) project has inspired me to create a sister (cousin?) organization on GitHub called [influence-usa](https://github.com/influence-usa), which is in its infancy, but will aim to collect ETL code on all things influence-related (lobbying, campaign finance, contracts & grants, revolving door activity, and more). Note, I’ve already broken the “free of dependencies” rule twice, but maybe we can make an exception for “for humans” dependencies like [Requests](http://docs.python-requests.org/en/latest/) and [stream.py](http://www.trinhhaianh.com/stream.py/)? In any case, if you have code that you’d like to contribute, please get in touch!

**UPDATE**: forgot to mention that Dave Guarino has started a repo at [daguar/etl-for-america](https://github.com/daguar/etl-for-america) for collaborating and discussing general problems related to ETL.