The flip side of anonymity: legally-identified microdata


When we think about using microdata, the individual-level data that permits analysis at the most granular level, we often think about the way its use is restricted by rules protecting individual privacy. Health data, school data and criminal justice records are all restricted due to various privacy concerns. While some of these restrictions seem necessary, others don’t seem realistic or helpful.

Because of that frustration, our exploration into “open data at the power of one” began with a consideration of the state of the art in data anonymization techniques. If anonymization techniques were robust enough that individuals could not be connected with the released data, we could deploy them widely across public microdata-holding sites and feel quite comfortable advocating for the immediate and universal release of microdata. (As our recent post described, however, while techniques are moving in that direction, we’re not quite there yet.)

At the same time, it is also important to point out that not all kinds of individual-level data are held to the same privacy standard. Some data legally contain personally identifiable information. While we wouldn’t have to worry about violating privacy protection laws under conditions of perfect data anonymization, we also don’t have to worry about violating privacy protection laws where individual-level datasets legally contain identified data.

Exploring the varied terrain of legally-identified data helps show how different datasets, even closely related datasets, may have very different levels of protection. Circumstances of history, our legal traditions, and the political ability to fight for privacy or argue a countervailing public interest all affect the legal determination of which personally-identified information can be publicly released and which can not. In other words, the determination of which data can be legally identified occurs through a social and political process and is not an exact science.

While there is no clear or absolute rule for determining which personally-identifiable data should be legally available, arguments about legally identified data always operate by balancing social benefit and individual harm. When there are convincing political arguments that the social benefit strongly outweighs the individual harm, it’s easier to maintain public access to personally identified data. Where there are convincing political arguments that people are significantly harmed by identified data with little countervailing social benefit, policies are more likely to restrict access.

So what kind of data currently does legally include personally-identifiable information?

We can begin to answer that question by considering how the official American collection, provision and restriction of individual-level information has changed over time.

In the colonial era (as now), the right to government-held information sprang from the existence of official collections of data. In colonial America, the mandated collection of such documents as probate records, land-ownership records and vital records provided a core set of public data that individuals could access for specific legal uses. The records of court trials were also available. These, however, were generally available for inspection to anybody, without a specific legal mandate, in line with the ancient common law right to a public trial.

In addition to legal data, public data collection also reflected ongoing public surveillance for social goals, such as monitoring disease prevalence. In his entertaining historical survey of the American effort to balance personal privacy and access to information about others, Robert Ellis Smith describes also how a Puritan belief in the godliness of constant public scrutiny led to the early American public recording of such misdeeds as drunkenness, failing to attend church and living alone (an often-illegal activity.)

Data collection and its public availability changed dramatically in the 20th Century. Daniel Solove describes how the expansion of the bureaucratic state in the 20th Century led to an explosion of public records, far outstripping the scale of individual-level data collection that had occurred to that point. Moreover, while earlier record-keeping law and technology had limited the practical scope of records access, 20th Century technology and political action changed the nature of both data collection and data requests in ways that let both occur far more broadly. Harlan Yu and David Robinson effectively summarize the “closed government” characteristics of public records custodianship in the period up to and after World War II that was countered by the successful effort, led by journalists and media lawyers, to achieve enactment in 1966 of the federal Freedom of Information Act (FOIA). While most states did not have formal freedom of access policy in place before the federal FOIA, all of established similar laws in the years after its implementation.

While FOIA expanded the public right to access government information, it developed simultaneously with a movement to restrict access to government-held individual-level data. The 1974 Privacy Act was a response to fear about increased federal use of collected individual-level data. Expressing concern that the ever-increasing pace of computerized government-held data needed to afford citizens more control over their data, an influential report on citizen rights to privacy in the computer age convinced lawmakers of the need for increased protection. The Privacy Act requires that citizens be able to learn about all of the federal datasets which hold their information, obtain that information, correct faulty information and limit new kinds of uses for the data (except if the new use falls within the twelve permitted exceptions.) While this was unquestionably an important advance in establishing new personal protections, the law applies almost solely to individual-level data housed by federal agencies — although it does place some additional conditions on federal, state and local governments use of Social Security Numbers as identifiers. In response to the same kind of concerns driving enactment of the Privacy Act, many states also developed their own laws to protect citizens’ data. (We will explore the extent of this state-level variation in an upcoming post.)

Significant in its own regard, the Privacy Act was also the first in a series of federal laws which afforded specific kinds of data strong legal restrictions from public release. The Health Insurance Portability and Accountability Act, providing for the confidentiality and privacy of medical records, and the Family Education Rights and Privacy Act, protecting education records, are the two most significant of these federal restrictions on individual-level data release in terms of the sheer number of institutions that must abide by them. However, a number of other specific data confidentiality laws developed similarly in other issue areas, such as the Child Abuse Prevention and Treatment Act protecting records related to child welfare, the Comprehensive Alcohol Abuse and Alcoholism Prevention, Treatment, and Rehabilitation Act setting limitations on access to substance abuse treatment information, and the Gramm-Leach-Bliley Act which limits the release of data associated with individuals’ financial accounts. People seeking to use or access microdata are likely to find that issue-specific laws are ultimately more consequential than the Privacy Act for data access because they address records held by non-federal entities.

The territory of which public, legally-identified data becomes private is constantly changing, frequently subject to pressure from visible public events and their political consequences. For example, the Drivers Privacy Protection Act (DPPA), which protects motor vehicle records, arose after women were stalked and killed using information obtained through their motor vehicle records. (The DPPA changed again after other, less violent public events: A later amendment to this act further restricted state governments from selling motor vehicle record data to marketers in response to a public perception that the practice fueled junk mail.) Another example of the political source of data privacy can be found in the 1988 Video Privacy Protection Act, an expansion of federal law protecting the privacy of individual’s videotape rental records passed in response to the journalistic publication of U.S. Supreme Court nominee Robert Bork’s video rentals. More recently, we’ve witnessed political events drive privacy law in the effort, occurring in a number of states, to restrict public access to lists of individuals who’ve received state gun permits. This nationwide restriction of identified data about gun licensing occurred as a result of a newspaper’s publication of a map of local gun permit holders in the wake of the December 2012 shooting at Sandy Hook Elementary School.

Between the new modes of accessing government-held information provided by federal and state-level FOI laws, and the new restrictions on data created by the Privacy Act and other laws restricting data on specific topics, the landscape of microdata in the early 21st Century is a unique patchwork of highly available and highly protected data. Because of FOIA and advances in electronic access to information, where individual-level data has not been specifically restricted in statute, we can generally find large quantities of individual-level identified data online. Where categories of individual-level data have been legally restricted, it is essentially inaccessible without additional qualifications, certifications or legal intervention.

A survey of a few categories of data demonstrates the highly varied data landscape.

Criminal Justice

  • Maintaining the legacy of common law access, courts often continue to provide fully accessible, individually-identified data. While states may redact elements of court files — commonly including the redaction of information about victims, financial information and information about juveniles — many court records are available to the public via an online interface (albeit for a fee). Observers have noted, however, tension around the availability of court records — and new restrictions on their availability — has arisen where those records contain the kinds of specific identifiers restricted under the issue-specific data restrictions, such as SSNs, drivers license numbers and financial account information, since that information made available on online court records has been the source of identity theft.
  • Police department records — while difficult to access on some topics, notably internal investigations — can be highly identified and available for other subjects, particularly adult arrests. Some states, like Iowa, publish mugshots alongside an individual’s full name, physical details, and previous arrests. States typically more restrictive on privacy, like California, still require police to make public a fairly comprehensive set of details about adult arrestees.
  • Since the 1990s, people convicted of sex crimes have been subject to federal registration requirements that create very highly-identified and specific public data. Sex offender databases are available at the state level online and contain specific identifying details about individuals such as pictures, birth dates and addresses and the crimes they committed, along with distinct social labels like “predator.”
  • Meanwhile, when the criminal justice records concern minors, another set of principles governs the identifiability of data. Court proceedings involving minors have often been confidential in most jurisdictions — though not all — but even where confidentiality is the rule, exceptions are made for very serious and violent crimes.

Political Data

  • State voter registration data contains highly detailed identifying information — including full name, address, political party and recent voting history. It is accessible to some degree in all states (except states like North Dakota, which have no state voter registration) but with varying degrees of access and protection. In Florida, for example, the state makes many personal details publicly available and explicitly acknowledges that users of the Florida voter registration database may use the data in ways that could inconvenience or cost the registered voters. In California, meanwhile, the state makes voter registration information only nominally public by highly restricting access to it, requiring potential data users to demonstrate they are using it for legitimate public interest purposes.
  • Donations to political candidates became a new form of personally identified public data after the 1971 Federal Election Campaign Act (FECA) mandated public disclosure of donations. When the legitimacy of this disclosure requirement was assessed by the U.S. Supreme Court in Buckley v. Valeo (1976), the justices determined that the potential for individual harm caused by the disclosure was outweighed by the social purpose achieved when “disclosure requirements deter actual corruption and avoid the appearance of corruption by exposing large contributions and expenditures to the light of publicity.” Significant personally identified details — including name, employer, address, supported candidates and donation amounts — are now available about campaign donors in all 50 states.
  • Even though an individual’s party registration and donation history is public, their votes themselves, however, remain strictly protected. The challenge of producing acceptable online voting had in fact been stymied for quite a while over the problem of being able to adequately demonstrate that the method would secure the anonymity of individual votes.

Public Employee Data

  • The full names, titles and salaries (including benefit information) of people identified as “public employees” are typically considered public information. Many state and local governments put this identified information online as open data.
  • The determination of who counts as a public employee for these purposes depends on historical and political definitions of which institutions count as “public” and which do not. For example, public access to salaries always extends to the personnel at state universities, despite the fact that many states now fund only a small percentage of their public university’s budget.  The obligation to disclose salaries, however, does not extend to institutions identified as “private,” no matter how much public money they receive. For example, large public-supported businesses like Lockheed Martin are nearly 80% funded through public money but have no similar requirement to provide salary and benefit data for all employees.

The landscape of which data legally contain personally-identifiable information and who cannot is clearly complicated and nuanced — and perpetually in flux. In just about all cases, the question of how to find the balance between social benefit and avoidance of individual harm remains an active area of inquiry.

Some of the leading work in finding the balance between data and harm has been developed in scientific research settings. In an upcoming post, we will explore some of the most common and useful approaches that we’ve found in this domain.