It's a question I'm seeing asked more and more: by press, by Gov 2.0 advocates, and by the online public. Those of us excited by the possibilities of open data have promised great things. So why is BrightScope the only government data startup that anyone seems to talk about?
I think it's important that those of us who value open data be ready with an answer to this question. But part of that answer needs to address the misperceptions built into the query itself.
There Are Lots of Open Data BusinessesBrightScope is a wonderful example of a business that sells services built in part on publicly available data. They've gotten a lot of attention because they started up after the Open Government Directive, after data.gov -- after Gov 2.0 in general -- and can therefore be pointed to as a validation of that movement.
But if we want to validate the idea of public sector information (PSI) being useful foundations for businesses *in general*, we can expand our scope considerably. And if we do, it's easy to find companies that are built on government data: there are databases of legal decisions, databases of patent information, medicare data, resellers of weather data, business intelligence services that rely in part on SEC data, GIS products derived from Census data, and many others.
Some of these should probably be free, open, and much less profitable than they currently are*. But all of them are examples of how genuinely possible it is to make money off of government data. It's not all that surprising that many of the most profitable uses of PSI emerged before anyone started talking about open data's business potential. That's just the magic of capitalism! This stuff was useful, and so people found it and commercialized it. The profit motive meant that nobody had to wait around for people like me to start talking about open formats and APIs. There are no doubt still efficiencies to be gained in improving and opening these systems, but let's not be shocked if a lot of the low-hanging commercial fruit turns out to have already been picked.
Still, surely there are more opportunities out there. A lot of new government data is being opened up. Some of it must be valuable... right?
Today, our guest post is written by Joshua Gay, a programmer, activist, and community organizer whose interests revolve around technology,... View ArticleContinue reading
Sam Smith wrote a post reacting to what I had to say about the Geithner schedule. In it, he argues that pushing for data to be released in better formats may not be the best course of action: tools exist to sidestep the problem.
Sunlight, as an organisation which complains about this often enough, has much better tools at their disposal than complaining about it. As people using computers in 2010, we all have better tools to use on PDFs than we currently use. We often complain about how inaccessible PDFs are, without doing the basic, simple, automatable tasks which can make them readable.
Opening the PDF in acrobat, pressing the "Recognise text using OCR" [button] and then [you'll find that] it's searchable, and Sunlight could republish this for everyone to use (or put up a webservice which adds the OCR text in such a way that when you search, what you get highlighted is the relevant bits of the page where the OCRed text matches). That is possible now.
But, as a community, we prefer to stick to the notion that anything in PDF is utterly locked up in a way which no one can get at.
It's not (really).
It is far from ideal, it's a bugger to use, and it is not the best format for most things, but it's what we've got. And showing how valuable this data is will get us far further than complaining that we can't read a file that most people clearly can in the tools they use. It's the tools we choose to use that are letting us down. And, as a movement, open data has to get better at it, and then it'll be less of a problem for us, and we can spend more time doing what we claim to be wanting to do.
I appreciate the response, but I disagree. Nothing Sam says about what technology makes possible is wrong, per se. And better tools are of course useful and desirable. But the last thing I want is for government to begin thinking that OCR can make up for bad document workflows. It simply can't: even though it happens to work well on the Geithner schedule, OCR remains a fundamentally lossy technology.
There was an interesting exchange this past weekend between Derek Willis of the New York Times and Sunlight's own Labs Director emeritus, Clay Johnson. Clay wrote a post arguing that we need a "GitHub for data":
It's too hard to put data on the web. It’s too hard to get data off the web. We need a GitHub for data.
With a good version control system like Git or Mercurial, I can track changes, I can do rollbacks, branch and merge and most importantly, collaborate. With a web counterpart like GitHub I can see who is branching my source, what’s been done to it, they can easily contribute back and people can create issues and a wiki about the source I’ve written. To publish source to the web, I need only configure my GitHub account, and in my editor I can add a file, commit the change, and publish it to the web in a couple quick keystrokes.
Getting and integrating data into a project needs to be as easy as integrating code into a project. If I want to interface with Google Analytics with ruby, I can type gem install vigetlabs-garb and I’ve got what I need to talk to the Google Analytics API. Why can I not type into a console gitdata install census-2010 or gitdata install census-2010 —format=mongodb and have everything I need to interface with the coming census data?
On his own blog, Derek pushed back a bit:
[...] The biggest issue, for data-driven apps contests and pretty much any other use of government data, is not that data isn’t easy to store on the Web. It’s that data is hard to understand, no matter where you get it.
What I’m saying is that the very act of what Clay describes as a hassle:
A developer has to download some strange dataset off of a website like data.gov or the National Data Catalog, prune it, massage it, usually fix it, and then convert it to their database system of choice, and then they can start building their app.
Is in fact what helps a user learn more about the dataset he or she is using. Even a well-documented dataset can have its quirks that show up only in the data itself, and the act of importing often reveals more about the data than the documentation does. We need to import, prune, massage, convert. It’s how we learn.
I think there's a lot to what Derek is saying. Understanding what an MSA is, or how to match Census data up against information that's been geocoded by zip code -- these are bigger challenges than figuring out how to get the Census data itself. The documentation for this stuff is difficult to find and even harder to understand. Most users are driven toward the American Factfinder tool, but if that's not up to telling you what you want, you're going to have to spend some time hunting down the appropriate FTP site and an explanation of its organization -- Clay's right that this is a pain. But it's nothing compared to the challenge of figuring out how to use the data properly. It can be daunting.
But I think there are problems with the "GitHub for data" framing that go beyond the simple fact that the problems GitHub solves aren't the biggest problems facing analysts.Continue reading
This morning a number of organizations — POGO, OMB Watch, CREW, National Security Archive, the Center for Democracy and Technology ... View ArticleContinue reading
David Moore at Open Congress has an excellent post up explaining how the current life of a bill in Congress... View ArticleContinue reading
Two nights ago, in his The Word segment, Steven Colbert actually used his show to do some investigative work into... View ArticleContinue reading