Free yourself from the Shackles of “High Value Data”

by

When the feds introduced the term High Value Data, my immediate response here was “what the heck is ‘High Value Data’?!” We quickly extracted the definition from the Open Government Directive and here it is:

“High-value information is information that can be used to increase agency accountability and responsiveness; improve public knowledge of the agency and its operations; further the core mission of the agency; create economic opportunity; or respond to need and demand as identified through public consultation.”

Now we’ve had a chance to go through and take a look at some of the datasets. Our http://reporting.sunlightfoundation.com is having a field day analyzing the data, pointing out flaws in the data and generally doing a great job of figuring out what’s actually new in the datasets.

Predictably, a new complaint has emerged: people keep trying to figure out why this data is “high value.” And what we have here is the equivalent of legions of sci-fi fans complaining that Huckleberry Finn didn’t have enough Yoda. The fact is– it’s impossible for anyone, government or otherwise, to claim that data is “high value” by any universal standard. It depends too much on the person, the timing, and other completely subjective factors.

Investigative reporters point out that much of the data isn’t “high value” when what they really mean is that it isn’t high value to them. Much of it doesn’t help to “increase the accountability of the agency” or provide them with the ability to conduct investigative reports. Researchers and specialists will sometimes say “But this data has been online for years. All you’ve done is release it in a .csv. This isn’t high value.” Developers will say “you’ve provided me a bunch of bulk data, that’s great– but I have no idea what it is or what to do with it. Where’s the context?”

“High value” is a subjective term. Data has no value without context. Data’s value varies based on the expertise and interests of the individual. I suspect that while I don’t really find the Feed Grains Database particularly interesting, Michael Pollan may find that data particularly useful.

“High value” also depends on timing. Today’s “junk” dataset could be tomorrow’s gold mine. Would a dataset of gas floor pedals be considered high value before the Toyota Recall? Would Recovery.gov’s data be considered high value without the political background of Obama’s economic stimulus?

Let’s do away with the term “high value.” It’d be better to measure specific goals for each type of dataset the people want released. Take, for instance, accountability data. What would happen if we set benchmarks specifically for accountability data? If the Open Government Directive said that each agency had to release 2-3 datasets allowing for the detection of the most common types of waste, fraud or corruption affecting the agency. Then the investigative journalists would be able to more accurately measure whether or not the data is of value. They can tell you fairly quickly if a dataset’s going to help them write the stories that help citizens hold agencies accountable for their work.

Because a dataset’s value is entirely subjective, discoverability and usability become the bigger factors for most datasets, too. How do we get the right datasets in front of the right people? Because the flood of data is coming. And one person’s junk is another’s treasure.

We’re almost there. The White House should consider pushing a little harder and pushing the agencies to create new datasets along specific units of measurements. And you– yes, you– the open government community could do the same. Join me and free yourself from the shackles of “high value” abstraction. Create your own measurements by which to measure the impact of this data being released and start tracking.