Anonymization and microdata: Can we open up granular info without invading privacy?
We’re taking a closer look at a number of important questions associated with use of microdata — the individual-level data we understand to offer both enormous potential benefit and potential risk.
At the present moment, governments tend to manage risks to personal privacy when they release data by aggregating the data points together in such a way that identifiable personal characteristics are obscured. For example, when the U.S. Census Bureau collects its detailed demographic information about U.S. residents, it does not publicly release information at the level of individual survey response. In order to comply with privacy laws, the Census hides individual-level detail by transforming its individual responses into aggregate counts and categories.
At the same time, we know that technological advances perpetually offer new opportunities and methods for achieving data-related goals. Because microdata can be such a powerful tool for answering important social questions, we want to increase access to it — but we do not want to illegally provide access to individual’s personally identifiable information. Do current techniques yet allow us to provide the public with individual-level records while simultaneously preserving the privacy of data contributors? Are anonymization techniques at a level of development where we can expect government data-holders to use them in order to create and release anonymized individual-level data?
Anonymization and deidentification
“Anonymized data” describes the rigorous condition under which the individual cases that make up a dataset cannot be reconnected to specific individual identities. If it is possible to use the dataset in order to identify specific real people, than the data was insufficiently anonymized. Achieving the status of “anonymized data” represents one traditional and important ideal for the public release of microdata. If anonymization were truly possible and feasible, it would allow the unproblematic opening of individual-level datasets – without concerns about privacy violation – since all potential links between identities and data would have been broken.
“Deidentification” is a closely related, though slightly distinct, concept. Deidentified data is mentioned most often in the context of health data since HIPAA, the law mandating health data protection, defines “deidentification” as the process of ensuring that data have been stripped of a list of 18 data identifiers. In addition, deidentified data may also include data which allow for the reunification of individual identities with their data rows if a data manager uses that dataset’s very specific, unique and securely-kept reidentification key. Informally, however, the terms anonymization and deidentification are generally used interchangeably.
So how hard is it to anonymize data? If it doesn’t matter what you’re using it for, it’s not that hard. Superficially, we could say that anyone can create anonymized individual-level datasets — after all, a dataset which contains only apparently random values is entirely anonymous. In order to have any informational utility, of course, a dataset must retain at least some true values. Anonymous datasets therefore must strike a balance between providing enough true information that the data remain useful while having enough missing or false information that the data can’t be linked to its producer.
Early 21st Century efforts in data anonymization
The problem of how to effectively anonymize microdata has been an active subject of discussion in both academia and the popular media. A tour of the academic and popular literature over the past fifteen years reveals how the practice of anonymizing data has evolved in the age of “big data.”
In 2002, Latanya Sweeney, then a Harvard professor but currently chief technologist at the Federal Trade Commission (FTC), developed an influential method for anonymizing datasets called “k-anonymity.” Her paper “k-Anonymity: A Model For Protecting Privacy” addressed the problem that common non-specific identifiers, like gender, birth date and zip code could be used together to de-anonymize a database. (An earlier paper by Sweeney revealed that 87 percent of the U.S. public could be uniquely identified purely through a combination of gender, birth date and zip code.) Broadly speaking, k-anonymity describes the relationship between a numerical anonymity requirement and the number of people sharing identifying data rows: A dataset has k-anonymity when an individual cannot be distinguished from at least k-1 of the individuals in the dataset. Specifically, the k-anonymity technique Sweeney recommended sought to protect privacy by increasing aggregation, using methods that could make the information harder to identify with one record: returning a birth year rather than an exact date, for example.
Subsequent research has pointed out weakness and potential fixes to this approach. Data anonymization using k-anonymity can be defeated through an equivalence attack: using one publicly available dataset, like state voting records, to de-anonymize another. Researchers have also demonstrated that datasets which get updated periodically present another challenge because an equivalence attack can be made using a prior release of the same dataset. Merely the act of keeping a dataset updated can inadvertently reveal sensitive information because aggregate averages will change over time.
Anonymization techniques were not merely critiqued in theory. Examples of recent failed anonymization efforts – where supposedly anonymized datasets were released to the public and subsequently shown to be vulnerable to deanonymization – also received substantial community attention.
One frequently cited case occurred when the New York Taxi Commission released over 20 gigabytes of improperly anonymized taxi fare and trip logs. This dataset was released with hashed values of taxi numbers and driver’s licenses, but the encryption turned out to be easily defeatable in this case.
Cryptographic hashes of identifying fields provide one method for predictably creating a unique encrypted value from input data, coding identifying details at such a high level of security that they become impossible to interpret. The basic concept behind why hashing is a strong technique is that you can reliably create a hashed output only if you know the statistically-improbable input. If this approach was effective it would be very feasible for implementation since many people are already familiar with it: Web developers, for example, often hash passwords before they get stored in a database. Hashes are further strengthened when another random value, called a “salt,” is added to increase the strength of the hash.
Of course, serious issues arise when hashes are made on predictable input and without the standard precaution of adding a salt to confound that predictability. In the case of the New York taxi data, Vijay Pandurangan was able to create hashes for every possible value of a taxi medallion number because he knew the regular form of a valid medallion number. Since the released taxi data only used a basic hash, Pandurangan was able to identify the unencrypted numbers using his table. Pandurangan pointed out that there were better methods for encrypting these values, but also noted that even if those identifiers were removed entirely it could be possible to de-anonymize the records using techniques to join the anonymized dataset with other non-anonymized public datasets.
The effectiveness of the technique of combining anonymized data with non-anonymized data as a method of individual reidentification was demonstrated several years ago using a dataset of anonymized movie ratings. In 2006, Netflix released a dataset of over 100 million user ratings in a contest to devise a more accurate movie recommendation system. The ratings were anonymized in the sense that all specific identifying user details were removed. However, researchers at the University of Texas at Austin demonstrated that users could be de-anonymized by comparing the Netflix data with a public dataset. They used public, non-anonymous reviews from the Internet Movie Database (IMDb) to identify a few users based on both the similarity of the movie ratings and the timing of the movie reviews. The effort did not achieve a complete deanonymization of the all of the users, but it provided a proof of the concept that the anonymity of a single database, even if effective on its own, was vulnerable to being broken when joined with other publicly available datasets.
Earlier in 2006, there was another case where a company released an anonymized dataset only to find that their efforts weren’t sufficient. AOL posted the complete search history of 500,000 users, replacing their usernames with random numbers. However the individual search queries themselves were enough to identify some of the users, and the Electronic Frontier Foundation filed a complaint with the FTC over AOL’s failure to protect personal consumer information.
These publicized anonymization failures prompted articles claiming that “’Anonymized’ data really isn’t” and had legal scholars warning us, “Don’t Build A Database of Ruin“. Collectively, these public examples demonstrated some genuine problems facing the release of anonymized microdata.
Creating methods to preserve both microdata and privacy
While several techniques have been shown to be insufficiently effective for true public-facing anonymization, other techniques for the protection of sensitive data have also been evolving. For the most part, researchers have developed a combined approach to managing data privacy, combining methods for anonymizing data while also controlling the people who are permitted to access the data and what is made available. In order to meet the legal standards set both by federal regulations and institutional review boards, most researchers both strictly restrict access to raw data and provide highly aggregated, low-granularity data publicly.
However, researchers have been looking at how to preserve more attributes of datasets that have personally identifiable information while still ensuring anonymization. The most promising approach is called “differential privacy” and it concerns the application of statistical algorithms in order to create deindividualized views of the data.
Differential privacy is a method where the data anonymizer takes a dataset and adds “noise” – random false information – in order to make it difficult to identify any one individual within the dataset. Unlike previous anonymization techniques, however, differential privacy also integrates the concept of the level of acceptable risk, “ἐ”, arising from an awareness of such known reidentification risks as linkage attacks and multiple, over-time queries. Differential privacy requires the data anonymizer to conceptualize a dataset’s overall “privacy budget”: the maximum privacy level required under the expectation of the number and type of queries it will answer,allowing the data anonymizer to add a level of noise that’s appropriate to the number of data queries that will be permitted.
The addition of this expectation of interaction with the data allows the data anonymizer to “turn up” or “turn down” the noise as appropriate for the conditions of data release. As mentioned above, it’s possible to have an entirely anonymized dataset if it contains purely random information — but that this data is no longer useful. Differential privacy answers the question of how much noise must be added — and where it must be added — in order to both obscure every individual while retaining at least some useful qualities of the data. (For another explanation of how differential privacy works and a widget that allows you to explore how it hides individual rows, see Anthony Tockar’s blog post on “the basics” of differential privacy, and see how he applies it to the New York taxi data example mentioned above.)
Earlier this year, Cynthia Dwork and Aaron Roth published The Algorithmic Foundations of Differential Privacy, a work which discusses in detail how to apply the related techniques. Dwork and Roth promise that “at their best, differentially private database mechanisms can make confidential data widely available for accurate data analysis, without resorting to data clean rooms, data usage agreements, data protection plans, or restricted views.” However, it is clear that this work is new enough that it has yet to enjoy broad application. Given the current financial restraints of many government data holders, it may take a while for the technique to become routine and simple enough to implement without having an in-house statistical specialist. However, the promise of differential privacy suggests that while we may not get anonymized data that preserves rows for individual persons in a dataset, we can in the future expect to see privacy algorithms that enable the regular release and update of properly anonymized, high-resolution data useful to researchers who want access to that data.