A little math could make identifiers a whole lot better

A Springfield DMV billboard from the Simpsons

Last month I wrote about the difficulty of telling government records apart and how autocomplete fields can help. That’s a simple technology that can offer a real, if partial, improvement in our ability to analyze records.

But there are other, slightly fancier techniques that government should consider employing to solve this problem. They’re also considerably more fun to geek out over. Here’s one of them.

There are many cases in which the government can or does collect sensitive information that it can use to tell entities apart, but which it can’t reasonably release to the general public. Tax returns and immigration forms are two examples, but there are plenty of others.

For our purposes, let’s talk about WAVES. If you ever go to a meeting in the White House, you’ll be asked to fill out a WAVES form beforehand. It’ll ask for your first, middle and last names; your date of birth; your Social Security Number (SSN); whether you’re a citizen; your country of birth; your gender; and your current city and state of residence.

This information is used for security screening, but it has also become a transparency measure: You can find (most of) it at Ethics.gov. This can be valuable data. Back in 2010, former Sunlighter Paul Blumenthal used it and other information sources to piece together how President Obama negotiated parts of health care reform with the pharmaceutical industry.

Naturally, the public parts of WAVES don’t display all of the information that the system collects. All that we really get are names. If we’re looking for a distinctive name like “Billy Tauzin,” this is usually no big deal. But if we’re trying to figure out how many times a particular “Nancy K Smith” visited, we run into the familiar disambiguation problem. There are 27 such records in the system as of this writing, and although the government can tell them apart, the public can’t.

Government could use checksums

Obviously it wouldn’t be a good idea to share Ms. Smith’s SSN with the world. But what if we just shared a tiny bit of information? Suppose we added a column to WAVES: “Is SSN Odd?” This would just be full of “YESes” and “NOs,” indicating if the last digit in each visitor’s SSN was an odd or even number.

This would cut the size of our disambiguation problem in half — in information theory-speak, we’d be emitting one bit of information about the SSN. It’s already easy to see how this would be helpful, but this is a fairly crude approach: There are a lot of approaches to checksumming, and you can scale up or down the number of bits they emit pretty easily.

What’s that? You have an objection to this idea? Go on…

Didn’t I read something about the Mosaic Effect somewhere?

Indeed you did! For those unfamiliar with the term, the basic idea behind the Mosaic Effect is that disparate bits of seemingly innocuous, anonymous information can be combined into a problematic whole. This is a real problem that’s worth taking seriously. Those emitted bits of data — even though they are few — could be used to make guesses that help link to records in other databases.

But it’s important to remember that most Mosaic Effect analyses boil down to combining databases to allow a single individual to be identified. If this happens to your Netflix viewing data, it’s a problem. There’s no reason your activity on Netflix should be public. That’s not part of the deal when you sign up for the service.

But in the case of data that’s released for the sake of transparency — like WAVES — making data personally identifiable is the point of the release.

A distinction needs to be made between personally identifiable information (PII) and sensitive information. These terms are often used interchangeably, but they shouldn’t be. A transparency database should not leak sensitive information. But in most cases, releasing PII is exactly what such systems are supposed to do. The fact that so many of them don’t is a bug, not a feature.

In some cases, the release of PII and the release of sensitive information are the same thing. But the distinction is important, and should be carefully considered by system designers — even when making minor disclosures like the odd-or-even SSN scheme described above.

It’s also worth noting that checksums can bring problems. If a system designer opts to emit a lot of bits — which they might, since more bits mean more disambiguating power — it can become possible for an attacker with sufficient computing power to figure out the original SSN. There are ways of avoiding this, but they rely on government systems choosing values to complicate the checksum process, then managing to keep them secret. That can be asking for trouble.

Fortunately, there are ways to take advantage of sensitive identifiers’ distinctiveness without revealing information about them.

Government should use tokenization

Checksums can be very easy to implement, and well-designed systems that use them can be perfectly safe. But we can avoid the risk of data leakage entirely by mapping sensitive IDs to new, unrelated values. The folks running WAVES could keep an eye on the incoming SSNs and look them up against a database. If a given SSN hasn’t been seen before, it gets mapped to a made-up number — let’s say that every time Ms. Smith’s SSN arrives, she gets assigned “123” as a less-sensitive ID to use within the system and in public records. This is an example of a technique known as tokenization. It’s sort of like how the Secret Service uses code names.

Even though this is a very simple system, it’s enough to protect Ms. Smith’s information. The only real threat is that the database that maps input IDs to output IDs gets leaked. And, indeed, this is a sadly common problem.

That doesn’t mean that the situation is hopeless, though. It just means that sensitive information should be treated carefully. In particular, it should be handled by as few people as possible, rather than being spread out across countless government contractors’ laptops.

Ideally, this would mean getting a tokenized identifier from a central, well-protected source. Do you use your Google, Twitter or Facebook account to sign into other web services and apps? Those OAuth providers go to the trouble of managing your account info, password retrieval workflows, two-factor auth and other security considerations. The smaller web services that connect to them don’t get access to all of your information, just a guarantee of your identity.

There’s no reason that government’s approach to identity can’t work the same way. In fact, you could get even fancier about it.

Government should issue IDs namespaced by agency

This is admittedly getting a little far out there, but it’s an awfully neat idea suggested by Michael Rogers on the liberationtech email list:

At the level of individual records, you could use modular exponentiation to anonymise the data. You pick a prime modulus p, and each organisation that’s going to publish anonymised data picks a random secret value. Organisation X with secret value x anonymises a piece of data d by publishing d_x = dx mod p, and organisation Y with secret value y anonymises the same data by publishing d_y = dy mod p.

If X and Y want to know which records they have in common, X takes the data published by Y and calculates d_x’ = d_yx mod p = dyx mod p, and Y takes the data published by X and calculates d_y’ = d_xy mod p = dxy mod p. For each record in common, d_x’ = d_y’, but neither can de-anonymise records published by the other that they don’t have in common.

This can be extended to more than two organisations: pass the records round in a circle, and when they get back to you they’ve been exponentiated by all the secret values (order doesn’t matter). Now you can see which records you have in common with all the other organisations.

Don’t worry if you’re frightened by the math; that feeling is normal. Still, this is super-cool. It does admittedly rely on agencies not leaving their random secret values on employee laptops. But if executed properly, each agency could issue distinct IDs, limiting potential exposure and preventing inter-agency revelation of the source data. And yet each agency could still tell which records should be linked.

Not bad. But we could probably just start with the simple stuff for now.