Data Quality Deserves to be Tackled on Its Own


Last week Clay wrote about how we’ll be evaluating /open pages released under the OGD. The post ended with a series of considerations that we think are important: completeness, primacy, timeliness, accessibility, machine readability, availability without registration, being non-proprietary, freedom from licensing restrictions, permanence and obtainability.

One thing is conspicuously missing from the list, though: quality. This isn’t by accident. Data quality is something we talk about a lot at Sunlight. The reason is obvious: if the government releases data that is incomplete, erroneous or otherwise leads to bad conclusions, it doesn’t matter at all whether we got it via the coolest realtime XML pubsub serialized comet-streaming architecture imaginable. Garbage put into a useful container is still garbage.

But striking the right balance is tough. We want to help government find the best way to get its data into the hands of the public. Frankly, convincing an agency to get its technical house in order is much easier than convincing it to revisit its entire internal (and sometimes external) workflow. And not just because it’s an easier argument to make! It’s also simply harder to identify data quality problems.

Sometimes it’s possible to write a clever script that can identify missing or incorrect data and flag it for correction. To the extent that we can do this, we certainly ought to. But there’s a flip-side to that approach: it can leads to mistakes like the ones on display at’s Data Quality page. Near the bottom of that page you’ll find a couple of tables measuring “data completeness” — basically, USASpending looks at the transactional rows it gets from each agency, counts up how many columns are blank, and assigns a percentage score for completeness. This isn’t a bad way to triage data for glaring technical problems, but it can hardly be called a measurement of data quality. For instance, it won’t ever tell you that the Maritime Administration doesn’t bother to report any of its spending. Or that loan guarantees are often only reported when the loan goes into default (they’re supposed to be reported either way). Or that when you add up the spending numbers for an agency in USASpending, they frequently don’t match the numbers quoted on the agency’s website (and almost never match the obligation numbers quoted in the CFDA). The result is that doesn’t really accomplish what it’s designed to do. It’s far from useless, but it’s also far from authoritative.

The only reason I know this is that I’ve been working on the Subsidyscope project for the last year, and as we’ve tried to work with the USASpending data, its shortcomings have become apparent. Unfortunately, I think that’s the case with a lot of data. Until you really dig into it, making good judgments about its quality is tough.

That’s not to say that the situation is hopeless. aimed to use Linus’s Law to fix its data quality problems by empowering an army of “citizen IGs” — crowdsourcing the problem, in essence. I don’t think anyone would claim that this worked perfectly, but it’s much too early to dismiss the approach entirely. On the other end of the spectrum, there’s a huge opportunity for fixing these problems using resources that the government already has. For instance, a number of us at Subsidyscope feel that the big problem with the USASpending system is that it isn’t the “real” system that the government uses to track its spending. That’s done through Treasury, and it’s not public (they don’t want to release the names and addresses of everyone who gets a Social Security check, for instance). But perhaps we can find ways to redact the personal data, cross-walk it with USASpending, and make sure agencies consider it important to have these systems reconciled with one another.

That’s just one example. The point is that there’s a way forward — once you can identify the problem. Figuring out better ways to do that is something we’re always thinking about here at Sunlight. We’d love to hear your thoughts about it, too.