Exploring open data’s microdata frontier

by Emily Shaw

Oct 24, 2014 2:13 pm

Photo credit: *n3wjack’s world in pixels/Flickr

As open data advocates, we seek to achieve public access to the best quality data we can get. One of the critical dimensions of a dataset’s quality concerns its granularity: the number of individual observations that a dataset aggregates together in individual cells. Microdata —data which is not aggregated at all, but which is available at the level of the individual observation — is the most granular data. Where this data is available openly, this is “open data at the power of one” — data which allows us to query individual cases, giving us our best chance at getting a true picture of the empirical reality the data represent.

Why is microdata so powerful? For many of the most social valuable kinds of data use, the more granular the data, the better it is. A useful analogy for understanding the value of high granularity lies in comparing it to high resolution digital photography. Images with more pixels per inch are clearer and easier to interpret; they can be enlarged to look at smaller details, and they offer a more precise representation of the object they depict. More detailed data generally offers a similar set of advantages. Highly granular datasets allow researchers to test more detailed causal theories and learn about more specific outcomes. Across the fields of criminal justice, education, health and social service delivery, researchers have the potential to achieve life-changing social advances with access to more microdata. For app developers, highly granular datasets enable the creation of more precisely-tailored services, providing greater value to end users. In both cases, microdata allow the highest possible granularity for their respective uses.

Because more granular data enables a broader variety of uses, open data advocates tend to seek access to more data at higher levels of granularity. So if the utility of microdata is so high, why is it often difficult to achieve open access to individual-level datasets?

The answer is that one person’s individual-level observation is another person’s private information. For many decades, U.S. laws have worked to define and defend a right to individual privacy that concerns not just our immediate physical privacy but also — as the potential power inherent in holding data became clear — the privacy of certain kinds of information. As a result of significant federal laws like the Privacy Act of 1974, the Health Insurance Portability and Accountability Act, and the Family Education Rights and Privacy Act, Americans enjoy the right to prevent the open disclosure of many kinds of data that would allow others to identify and learn personal information about them. Even beyond the legal restrictions, however, we have a broad ethical interest in making sure that private individuals are treated with respect as human beings, not just as generators of data. Although it’s not the only way to accomplish this end — and it’s not without some significant downsides — our existing privacy protections are the major way we currently demonstrate our interest in making sure data-sharing does not harm individuals.

The conflict between data users’ desire for individual-level data and our laws and norms of individual privacy protection leads to some critical questions for open data advocates. What exactly are the best current policies and practices for balancing our need for improved open data against the claims of individual privacy? And how do we address the competing moral claims, on the one side to the good of public knowledge and on the other to the good of personal privacy?

To help inform this critical conversation, we want to begin seeking answers to these questions. Please join the Sunlight Foundation as we explore the challenges and opportunities of the current legal, technical and social landscape of 21st Century microdata. In our new ongoing, periodic series, “Opendata¹,” we will investigate the important questions facing open data advocates at the microdata frontier.