Dealing with Inaccurate Government Data

by

Developers are good at getting bits to line up, importing data and getting great conclusions out of it. Designers are great at making things look great and making those conclusions and bits easily digestible. But in all the apps I’ve seen, they all ultimately suffer from the same fatal flaw: accuracy.

As developers we must not only focus on making compelling visualizations and user experiences out of the data we pull out of government, but also make the information we get as accurate as possible. We must do a better job than government does — and this is difficult, because we’re taking inaccurate data that comes from government.

Journalists know this. The folks at Realtime caution us almost every time with projects that we release. There’s nuances to nearly every dataset (especially the ethics kind) that comes from government that journalists have compiled after years of experience in working with the data. The Center for Responsive Politics has built a strong, sustainable brand around their FEC data not only because they make the data accessible, but because it gets refined with human eyes both to add value but also to add accuracy.

When developers alone tackle the problem, they tend to worry about taking a dataset, putting it in some kind of database, mashing that data up with other data, and making a thoughtful useful tool. But rarely do they worry about the accuracy of the data they’re providing.

The best of developers face this problem. Aaron Swartz, faced it with Watchdog.net which we funded last year. He built an elegant platform mashing up data from all kinds of places and partners to get a singular view of a Member of Congress. He got data from OpenSecrets about campaign contributions and data from Taxpayers for Common Sense about earmarks. He got data from Congress and lobbyists and built his system. And none of the data on that site is trustworthy. Check this out:

Lobbyist Contributions (watchdog.net)

a 2.3MM contribution? To the untrained and unknowledgable eye, this can be assumed as fact, but to someone with some knowledge of the FEC, this is immediately discreditable. The FEC only allows campaign contributions of $2,300 to campaigns.

But it isn’t a bug in Aaron’s code, it is a bug in the system. Going upstream, you can see straight from the source that those million dollar contributions are being reported even though they’re completely illegal. Likely, someone put a comma where they meant to put a zero, and those are simply just $2,300 campaign contributions. But the question then arises: whose responsibility is it?

While we at Sunlight and our great community work to push the government to make data more accurate, I think it is still the developers responsibility to make the data as accurate as it can be. And I’d like to suggest these best practices for doing so:

Make Data Imports Modular and Reusable

This is the most important one. Separate your import scripts from your application entirely and do not make “cleaning up the data” dependent upon your application but instead as generic as you can make it.

You want to modularize it so that you can open it up and people can contribute source just to this part of your application. In a world where we’re writing parsers for everything, it certainly would be great if we only wrote them once (or twice). This way you can get the most expertise possible around your dataset contributing as upstream as possible to your app.

Create tests on the regulations around the data

Attempt to understand the rules and regulations of the data you’re working on. If you’re dealing with campaign contributions, for instance, you need to know that the maximum contribution from an individual is capped. Then you should write a test to flag all of the data that doesn’t match those regulations, for instance, every contribution over the contribution cap.

The question then becomes what to do with that data. While it varies case by case, I think there are lots of options. No matter what though, you should capture it, save it, and report it back to the agency that is providing the data. Sure, they may fire back blaming somewhere upstream from them, but you’ve now done the socially responsible thing to do. After you’ve done that, you have a few options

  1. Spot check the data and figure out what’s wrong.
  2. Just get rid of the inaccurate data
  3. Present the data to the user with a flag or warning

Probably a combination of all three is what you want to do to be as accurate and friendly to the user as possible.

Provide a way for users to report inaccuracies

You the developer are never going to be able to catch everything. There shouldn’t be a single app that deals primarily with government data that does not present an easy way for users to flag bad data for you. If I had one thing to add to GovPulse.us, DataMasher.org, or ThisWeKnow.org, it would be the ability for users to report faulty data or even nuances in the data. It’s simply irresponsible not to architect your systems to not use user feedback to continuously improve the accuracy of the data you’re providing to them.

There’s probably more best practices — but start with these three. As we enter a new era of open government, and we start building long-term serious apps, we have to inherit some of the principles not only from the developers around us, but from the great investigative journalists who’ve been bringing this stuff to light for years.