What I’d Change about Data.gov

by

I think Data.gov is pretty awesome. I’m generally a fan of what Vivek Kundra & Team are trying to do inside of the government to make the our country more transparent. Heck, we’re so excited about it we’re doing our own contest with cash prizes to celebrate.

But I do have a few gripes. So in the interest of full transparency, and the hopes that this will create change, here are my complaints for all to see:

1. Half the data is from the USGS.

No offense to our hard working geologists, but seriously– copper smelters? Really? Why is the first dataset on Data.gov about Copper Smelters? And more importantly, every piece of data that’s on the front catalog page of Data.gov has a 1 by it. Is that because they wanted them to appear at the top of the list? So these four (Smelters, Hydrolic Remote Sensing Center, Patent Grants, and Residential Energy Consumption from 2005) datasets were editorially chosen to lead the pack?

I want better data, and there’s a lot of it out there, and there’s no excuses for it to be inside of Data.gov. It is data that’s already being maintained by the feds. Ones I’d particularly like to see, in no particular order:

  1. How about the data in Data.gov. Put Data.gov’s catalog online in a bulk format for all to see and play with.
  2. FARA
  3. FEC
  4. FACA
  5. Personal Financial Disclosure Statements for Cabinet and Key Government Employees.
  6. USASpending.gov Downloads
  7. The Federal Register — this one’s special, and a little political. But the Government shouldn’t be charging $17,250 for an electronic copy of the Federal Register.
  8. Census All of it. In something other than PDF files, too please.
  9. Bureau of Labor Statistics All of it.
  10. Bulk data from FedBizOpps
  11. Of course, all the data on Recovery.gov

I’m sure there’s more than these 10 datasets. According to the feds, there’s 200,000+ more coming, so get on with it, hurry up!

2. It is a data catalog, not a data repository

This isn’t just semantics– the data on Data.gov links out to external sources that are not standardized. This means it is very hard to wrap programatically. For instance, if you go check out the Patent Grant Bibliographic Data for instance, you’ll see that you can download the file as an XML file from uspto.gov. This means Data.gov is merely linking off to another site, rather than serving as a single source for the data.

Fine, cool, I can think of a million reasons to do that, especially that whole Separation of Powers bit. This would make it so maybe Data.gov could link off to congressional information without having to cross the line into the Executive Branch compelling congress to do something or having to wait on legislation (maybe), but the problem is, even the links are non-standardized and not restful. What we want is to be able to presume:

a. the Patent Data has an ID number of 3 b. It has XML data c. Therefore, to get the XML data, we can go to data.gov/data/3/xml

And have the software point us to the data we want. This kind of REST-ish interface for the website would be particularly useful. That way we could build software similar to RubyGems for Data.gov. How cool would that be? My dream? To be able to type in:

datagov install census.economic -y 2007 -v csv

And see my terminal download that information directly onto my hard drive in a format that I, as well as my trusty computer can understand. Data.gov can lead us there. Where we need to head is for the data to all be in the same place, with standard formats, and reliability that it will always be there.

3. It doesn’t engage us directly

I don’t just want you to put links to the data up there, this is the biggest technical transparency and openness initiative the Government has undertaken in a long time. It is also going to be a hub for developers. So talk to us, engage us, have a blog, tell us what’s going on and what to expect.

So much of dealing with data is narrative, and telling the story of Data.gov on an ongoing basis has so much value to it. We want to know what’s going on on the inside, who is working on it, what the process is and who is building it. How are you talking Federal Agencies into putting their data online. What software challenges are you facing? When there’s new data, how will we know? (Here at Sunlight we built our own RSS feed for it.)

Those are my three biggest gripes. But all in all, it is a great contribution to society that I think will make amazing things happen for years to come. Heaps of praise, appreciation and gratitude for the sleepless nights that went into building this site. What would you change?