How to protect privacy when releasing open data

Rows of white numbers on a black background
Image credit: r2hox/Flickr

In early 2015, Jeb Bush published a trove of emails sent to him during his time as governor. All political considerations aside, it provoked significant debate — once people realized the information published included personal information ranging from citizens’ email addresses to the holy grail of government-maintained personal information, social security numbers. This was certainly transparency (in some sense) on Bush’s part, but it also likely compromised the privacy of the people who wrote to him.

Many argue that Bush should have adopted various accepted practices that address the potentially negative impacts of “too much transparency,” such as by redacting information traditionally redacted by government FOIA processes, like social security numbers. Even less transparency — but more privacy — would be achieved through anonymizing the emails, making it as difficult as possible to tell who had communicated with Bush. Others might advocate for aggregation of the emails as an approach — listing categories or quantities of emails — as another solution offered frequently in the context of highly granular data.

We could also choose to look at this problem in a different way, by weighing the costs, benefits and methods available to help calibrate the balance of privacy and openness. To do that, we might ask a slightly different and more abstract set of questions:

  • What is the benefit being accomplished by the release of data about individuals?
  • What are the negative consequences that may accompany that release?
  • What are the mitigating techniques that could prevent those negative consequences from occurring?

By answering these questions, we may be able to better identify where to look for problems with the release of individual-level information and, more importantly, where to look for solutions.

For Bush’s emails, the answers are fairly simple:

  • There wasn’t a particularly large benefit, given that it was one person in a unique position (both as governor and as a presidential candidate), and because the release happened once.
  • The negative consequences could be profound for individuals, who became vulnerable to the serious problem of identity theft.
  • In this case, redaction, generally speaking, would have made the most sense.

We can apply this approach to other challenging cases where individuals’ data release causes concern. By applying these same questions to other “edge instances” of transparency, we see an interesting correlation: As we improve data quality, we also achieve better control over sensitive information.

Foreign Agents Registration Act

The Department of Justice maintains many valuable databases, including the Foreign Agents Registration Act (FARA) database. FARA requires people acting as agents of foreign entities to register and submit information, in a way not dissimilar from American lobbyists’ requirements under the Lobbying Disclosure Act. This is tremendously valuable data for evaluating the possibility of foreign influence on US policymaking, and the FARA database is the only location for that data.

Unfortunately, however, this is a golden example of how the automated publication of great quantities of valuable information can produce mediocre results. The produced filings can end up looking nearly illegible. The government allows agents to submit their records in paper, even though they’re clearly and almost always compiled in a digital format, and even if the agents wanted to submit high quality, machine-readable data, the DOJ only allows attachments (which often include enormous spreadsheets) in PDF, JPG and TIFF formats. These formats are not machine-readable, making them significantly less useful from the user’s perspective.

But what does this terrible data quality have to do with privacy? Everything.

Not only is this information nearly useless without exhaustive additional processing, it’s also dangerous. The information — which, to be clear, the government should be collecting and which it should almost entirely be publishing — is of such poor quality that the only way to review it for private information is by hand, and perhaps with fingers crossed that whoever is reviewing it caught all of the private information that should have been excluded, and redacted it. Unfortunately, we know that isn’t the case. We’ve brought several instances directly to FARA in the past — instances where a bank account number or other personal information was included in the published filing (for obvious reasons, we won’t be linking to any here).

Sunlight’s tried to get a variety of policy changes with FARA for years now, sadly to no avail. We’ve made the argument that not only do we need this data to be high quality and machine-readable in order to make best use of it, but the Department of Justice also needs it to be high quality and machine-readable, so that they can adequately protect people’s privacy.

This is a critical point because it contrasts with the idea that making individual-level data more available is one track toward increasing privacy risks. While improving the availability of individual-level data might increase the risk to individuals’ privacy in some instances, often data (like FARA’s) has to be released at the individual level and can’t be released as aggregate data. Where data must be released at the individual level, one answer is increasing quality of data by making it machine-readable. In other words, by ensuring data is of high enough quality to be useful to the public, the government can also better enable itself to be a responsible steward of (very) sensitive information.

With FARA, we see examples of how some familiar transparency-versus-privacy solutions feed into and are in turn fed by the cutting edge challenges we will face as data become more and more open. The questions we asked about benefits, costs and mitigation techniques were answered in the following way:

  • The benefit of making FARA data available as individual level data is enhancing the possibility of evaluating foreign influence on U.S. policymaking.
  • The negative consequences of FARA data release is that the poor quality of the data makes it inevitable that human reviewers will fail to redact sensitive information.
  • The mitigating technique most relevant here is to improve the quality of information from the beginning by requiring foreign lobbyists to file their forms electronically, ensuring the integrity of the data and reducing or eliminating the likelihood of sensitive data release. This has the added benefit of creating more transparency.

Form 990 data

A final example summarizes how the concerns about privacy and open data can be answered, in large part, with existing solutions or their analogues.

The annual tax form that nonprofit organizations are required to file with the IRS is called the Form 990. The work to make organizations’ 990 data more available makes useful data about the nation’s nonprofit universe to a large variety of audiences. 990s are effectively the only place to acquire critical information about organizations, their makeup and their budget. Meanwhile, the 990, which contains large amounts of microdata about organizations and the people that work there, also poses substantial potential negative problems for privacy — even though the 990 forms were already, in theory, supposed to be public.

Carl Malamud’s dogged efforts to open up organizations’ Form 990s and require that they be made available in machine-readable, searchable format leapt forward when he won a court case against the IRS. The IRS had been taking 990s that were frequently submitted to them as machine-readable, searchable data and then releasing them to the public only as TIFF files — the digital equivalent of taking a picture of a printed text document and scanning it. In short, the court sided with Malamud, telling the IRS that this was an unacceptable practice.

This was a major win. At the same time, it came with heightened risks to privacy. Making this data more available would certainly make the existing human redaction errors more obvious. Human failures to find the social security numbers or bank account numbers that had accidentally been included in the public filings would make organizations liable for substantial privacy violations. And it was obvious that the 990s were full of this sort of data — many files that were released under the existing structure contained extremely sensitive information.

Happily, the IRS was required by the court to make an additional, specific modernization. The agency was required specifically to release the data in XML — a machine-readable, searchable format that 990s are already sometimes collected in. This wasn’t a voluntary decision by the IRS; they have fought the change tooth and nail for years. However, thanks to the strength of FOIA, Malamud was able to convince the courts to compel the IRS’s cooperation. However, as a result, this transition also forced an internal processing improvement that will vastly improve the government’s ability to protect sensitive information (as Malamud has long argued in a variety of contexts).

Changing the 990s away from TIFF files to XML will allow the IRS to automatically redact far greater information, a privacy benefit that can only be realized with improved data quality. Indeed, by complementing existing hand review with such methods, the process ensures that we get a high-quality, effectively redacted result that allows us to have both the useful data and the appropriate levels of control over private information.

The questions we asked about benefits, costs and mitigation techniques were effectively answered in the following way:

  • Form 990 data provides a large benefit to audiences who are interested in nonprofit transparency, and greater availability of this data powers more analysis and insight.
  • Form 990 data is known to have unacceptably high levels of private information inappropriately included within it, and existing hand-redaction techniques have not eliminated all of it.
  • The mitigating technique most relevant here is to improve the quality of information from the beginning by requiring release in a format that can be combined with existing auto-redaction programs, allowing for much more comprehensive removal of sensitive information.

Mitigation techniques in more detail

I’ve described several ways in which the challenges presented by the above problems are actually opportunities to improve both data and privacy, but it would be worthwhile to spell this out briefly. Credit card numbers, email addresses and many other types of data possess certain traits. For instance, credit card numbers have a certain number of digits and adhere to an algorithm, and email addresses have an “@” sign followed by a some characters, a period, and then 2-3 other characters. These are all susceptible to computer processes that, for the purposes of troves of open data, are better performed by machines. To require hand processing of the 990s in particular would likely be prohibitively expensive, and due to the sensitivity of the information contained therein, greatly increase the overall negative consequences of imperfect, human-only redaction processes.

As explained in connection with FARA, automated redaction techniques requires high-quality data, which both empowers the public’s use of the data and better preserves privacy. With the 990s, we see how this can be done en masse. Interestingly, in the case of the 990s, Malamud’s use of public records requests to compel the production of open data forced the development of improved privacy functions — specifically the development of redaction techniques specifically for open data, which, in the long-run, will prove more effective than hand processing, especially for large volumes of information.