Government Data Sets – Managing Expectations

by

US Open Government plans were released today. As part of this process, federal agencies are beginning to release data sets publicly in ways they never have before. Some substantial and thought-provoking blog posts over the last few weeks have discussed how government can do open data well.

There are significant cultural and social sticking points that have yet to be addressed in releasing data openly. A discussion with a colleague from NASA last week confirmed how far away most agencies are from the luxury of considering the innovative ideas for data set management available to them. Here’s why:

With few exceptions, data sets being released on data.gov and elsewhere were not assembled with the explicit intention of being made public, nor of being subject to the scrutiny that goes along with that. As a result, the primary road block to releasing data surrounds concerns about responsibility, quality, documentation, and maintenance. These aren’t just cultural issues to get over– they represent very real issues about the quality and usability of the data itself. The 8 principles of open data should absolutely be institutionalized. But in the meantime, when faced with data sets that fail to adhere to those principles (often for some historical reason and through no fault of the parties releasing it), how do we set expectations?

This is a starting point– an incredible, historic, and unprecedented opportunity. While there are many technical shortcomings in the data being made available, it’s understandable that this data was not gathered with public release in mind. Perhaps not desirable, but certainly understandable. But, it is worth releasing data, even if the quality isn’t as high as desired. What we need to do is provide ways for the government community to address the state, quality and lineage of a data set– and then, importantly, encourage them to publish it anyway.

A data set classification system would provide a standardized way to assert important data set qualities. Such a rating system would serve two purposes: It would address the uncertainty and discomfort associated with data set release, giving a clear and legitimized path forward given a variety of data characteristics; and, it would give the user community, from developers to consumers, greater clarity in adopting, integrating and using that data.

A rating system would address the following characteristics:

  • P for Provenance: This data is or is not of known provenance. (Its date(s) of collection, collector(s) are known)
  • Q for Quality: This data set was or was not gathered to meet specific statistical guidelines, has known margins of error, a well defined sample size, etc.
  • R for Responsibility: This data set does or does not have a point of contact
  • M for Maintenance: This data set is/is not being actively maintained
  • D for Documentation – This data is or is not distributed with documentation

In many cases, data sets being released today do not have any of these characteristics. Note that, for example, the ability to assert lack of quality, doesn’t excuse us from ever having to meet those standards. Much work will be done over the coming years to define what and how data should be gathered, released, and maintained. We’ll start to see standards and best practices emerge. In the meantime, producers (in this case, government) and users (all of us) need a way to move forward with integrity and confidence.

When we put data out onto the web, that data is left to fend for itself. Even beyond government, distributing your data with statements about qualities which affect products made with that data, helps the open data ecosystem stay healthy and accountable, and produces a positively reinforcing cycle.

If you’re part of an organization releasing your data, would this help address some of the discomfort with data set release? What’s missing?