The point of publishing bulk data is so it can be reused as widely as possible. This is particularly true for government data, which belongs to the public.
Government agencies can sometimes also be concerned with ensuring the authenticity of their legal information – especially when the data might be seen as an official source. It breaks down into two major concerns: integrity (ensuring the text is accurate), and origin (proving it’s official). A lot of people are used to the “wax seal” model of authenticity – the experience of opening a PDF and seeing that the document is signed and official. This model quickly breaks down for distributing bulk data.
The goals of ease of use and authentication are frequently presented as being in tension, but that tension is overstated. There are straightforward approaches to guaranteeing authenticity of bulk data that don’t get in the way of reuse.
In fact, the Government Printing Office currently employs one of these approaches—cryptographic hashes—for every document it publishes on behalf of the United States. In their FDSys system, every document (take H.R. 6289 as an example) has an accompanying “PREMIS” file.
This PREMIS file contains a SHA-256 hash for every version of H.R. 6289 that GPO publishes – plain text, XML, and PDF. After you’ve downloaded any of those files, you can re-calculate the hash, using standard open source tools, to verify that the file is identical to what GPO published. PREMIS is an open standard, hosted at the Library of Congress.
GPO described their approach in June of 2011, saying that data integrity should not get in the way of reuse:
“The publication of the cryptographic hash values in the PREMIS metadata file, and the way FDsys structures its public URLs, makes it possible for machines to crawl and use this information to determine content integrity in bulk…
GPO recognizes the importance of ensuring that any content integrity verification method for XML content, such as digital signatures, should be structured so as not to interfere with data re-use or re-purposing. GPO is also committed to the principle of employing open, internationally recognized standards whenever possible.”
In December, California’s Office of Legislative Counsel wrote a report on authentication documenting several approaches, from a signature-based approach to a range of proprietary solutions. Signatures are, not surprisingly, a vastly cheaper solution.
As for guaranteeing that your signatures themselves are legitimate, the OLC presents another simple, cheap solution – using SSL:
“The primary limitation of hashes is that, by themselves, they do not authenticate the origin of the document…However, hashes can be used in combination with a secure Web site to authenticate documents. For instance, the hash for a document can be posted on a secure Web site, and consumers of the document can verify that the hash from the Web site matches the hash computed directly from the document.”
This is quite true, although verifying origin like this is only necessary if you’re concerned about someone pretending to be the owner of the document. It’s difficult to see this as a concern for all but the most security-sensitive government materials. GPO doesn’t use SSL for hosting anything on FDSys, apparently also not seeing it as an issue at present. (However, PREMIS supports verifying origin [docs], by including a version of the hash signed with a public key.)
The OLC also acknowledges another possibility: that an agency could legitimately decide that the whole issue is moot.
“The validation problem could be simplified if XML validation by the general public is determined to be unnecessary. Large document consumers that desire authenticated XML documents could be required to implement their own validation solutions.”
What’s clear in all of this is that authenticity can be simple, inexpensive, and optional — for both sides. Government bodies publishing bulk data that feel it’s important to guarantee authenticity can provide signatures. Consumers who don’t care about authenticity don’t have to, and those that do can easily verify those signatures. Everyone can win.
Edit: Replaced “signature” with “hash” where appropriate, and clarified that PREMIS does support origin verification.