Reasons to Not Release Data, Part 7: Accuracy

by Laurenellen McCann and Alisha Green

policy

Oct 9, 2013 2:00 pm

Earlier this month, we shared a crowdsourced collection of the top concerns data advocates have heard when they’ve raised an open data project with government officials at the federal, state, and local level, and we asked for you to share how you’ve responded. Dozens of you contributed to the project, sharing your thoughts on social media, our public Google doc, and even on the Open Data Stack Exchange, where 8 threads were opened to dive deeper into specific subjects.

target

Drawing from your input, our own experience, and existing materials from our peers at the National Neighborhood Indicators Partnership and some data warriors from the UK, we’ve compiled a number of answers — discussion points, if you will — to help unpack and respond to some of the most commonly cited open data concerns. This mash-up of expertise is a work in progress, but we bet you’ll find it a useful conversation starter (or continuer) for your own data advocacy efforts.

Click here to see other posts in this series.

Over the next few weeks, we’ll be sharing challenges and responses from our #WhyOpenData list that correspond to different themes. Today’s theme is Accuracy.

30. If we put the data out there in bulk, people will alter it

“What is the worst case scenario? What is the risk-benefit trade-off?”
People might already be misusing and misunderstanding the bits and pieces of what data is already available. Communicating the meaning of the data and working toward improved data quality and release could help clear up these misuses. In this way, bulk data can actually help clarify misunderstandings of data by sharing the largest scope of information available on a particular subject.
Bulk data access doesn’t mean losing the original copies of data, it just means sharing a complete copy with the public. If there is inaccurate reporting or use of results, that’s not likely not the fault of the data and/or the provider, but an error in calculation from an analyst. These can be addressed.
Further, not having bulk data available can also be detrimental, leading experts and amateur data scientists to draw conclusions from a limited field of information. Giving broader access to a dataset can enhance the accuracy of analysis and research by giving access to wider sample pool with more dynamic indicators.

31. If we share our data/code, we’ll be hacked

A. Hackers will do stuff with it

An opportunity for a discussion: ‘Hackers’ no longer just refers to people crashing your web servers. The term also encompasses problem-solvers, tinkerers, and others who contribute to code, build applications and tech, and explore the potential of data.
People with access to data want to ‘hack’ it to improve upon it, mash it up with other data to add context, and put it into useful applications. Hackathons and hacking don’t mean bad things — they’re forms of civic participation and can both lead to positive outcomes for the agency releasing the data that is being used.
Consider sharing some educational material, such as ‘Civic Hacking Creates Democracy,’ a video by the Sunlight Foundation.

B. Our systems will be hacked

“Let’s be clear about what we’re asking for you to release. Releasing your data or code will not necessarily lead to an attack — there are ways to protect government technology from malicious attacks while releasing datasets.”
Explain that you’re asking for the release of public data, not asking them to make any changes that would open up their systems for attack.
“Releasing data could actually prevent hacking or alternative system access: proactively sharing information can help avoid scraping or other well-intentioned efforts made to access data. If you continue to find members of the public requesting or scraping (extracting) data from your website, treat these events not as threats, but as cues for what information is considered desirable from your community and explore opportunities to publish this data more widely.

C. People might use the data to plan attacks

“Is there a subset of this data which would reduce/eliminate/be unaffected by this risk? Can we redact sensitive information, while balancing it with privacy or security concerns and the public interest?”

32. It might be presented in ways that result in people misunderstanding it

A. The media will misreport it / The data will be misinterpreted

“Publish the data dictionary, descriptions, and classifications. Opening it up and providing context is more likely to improve data integrity than corrupt it.”
Also, consider that misunderstandings can be useful: They raise flags for productive review of data quality and create opportunities for the government agency to set the record straight and point the public to the correct data/understanding.
When people do misinterpret data, be prepared to help and correct the errors. Those that misinterpret it by accident will be grateful for the help.
Publishing may actually be useful to counter willful misrepresentation (e.g. of data acquired through Freedom of Information legislation), as you can quickly point to the real data on the web to refute the wrong interpretation.

B. People don’t understand my data / It’s complex/magical/for experts

“How do you understand your data? How would you explain it to others so that they could also benefit by knowing about what you do?”
Are there experts outside your agency that use your data in their work? How do they access it? How could you enhance their access? Data made available to the broader public isn’t always information that every member of the public can use. The point is to allow those people who are interested and can use your data to be able to do so without restriction, no matter who they are. Experts might self-select, but we’ll never generate new experts without allowing curious amateurs to explore.

33. People will be confused: The data quality isn’t great

A. The data might have errors or mistakes and could misinform the public / We’re not even sure how accurate the data is, so we don’t see the point in sharing it

Provide the agency with feedback about their data. Government staff may only use individual records or certain fields for operations, and not have the need to evaluate their data in its entirety. Sharing your diagnostic results with them can at minimum ensure open lines of communication, and at best, prompt measures to improve the data quality.
Pitch it as a crowdsourcing opportunity: “People are more forgiving and more eager to help than you think. If you’re able to contextualize this as a first step, you can find ways to use the public’s input to your advantage by asking people to flag for improvement inaccuracies you and your team may have never found otherwise.”
Offer to help the data owner to tidy up, or better maintain, their data. By providing a system in which owners are more easily able to curate their data, you could be doing them a favor.
Suggest they look at creating or complying with some common standards.
Ask if some of the data can be released while the rest goes through a data quality process.

B. Releasing it like this will make us look bad

“Data imperfection is not an excuse for withholding data — all data is imperfect to some extent, and openness around data can help improve it.”

Stay tuned tomorrow for our next #WhyOpenData post on Privacy.