Don’t attribute open data — cite it!

Hand holding a sign that reads "Citation Needed" within square brackets
Photo credit: Wikimedia Commons

In discussions about access to government data, we should distinguish between attribution and citation: government-imposed attribution requirements are inappropriate restrictions on reuse and unnecessary barriers to access, while citation guidelines can help add reliable value to data.

Sunlight’s Open Data Policy Guidelines spring from a long-running collaborative effort to make government data more open. The concept of openness entails not limiting access, use or reuse. In our effort to improve information openness, we’ve continued to revise our Open Data Policy Guidelines by adding or clarifying provisions to continue to strip obstacles we’ve observed to information access.

An important question that we’ve found a need to address is the legitimacy of attribution requirements. This is particularly a question because the heritage of the developing concept of online “openness” includes an acceptable place for the attribution requirement. The definition of “openness” that was developed by the Open Knowledge Foundation in 2005 includes the caveat that free use, reuse and redistribution can be made subject “to the requirement to attribute and sharealike.” The OKF definition, however, addressed all producers of open property, without reference to the nature of the producer. Private producers of intellectual property may have additional reasons to claim ownership of that property; government producers, meanwhile, are able to produce information only because they’re doing so on behalf of the public. When we think about government producers of open data, is it legitimate for them also to claim that they can condition data availability on attribution requirements?

We find that attribution requirements place an unnecessary barrier between the public and its data. Attribution requirements can be used as a way to track and potentially censor users or public data. Including an attribution requirement is also a way to expressly hold the threat of a lawsuit over data users. Sunlight has joined with other open government data advocates movement to call for government data to be made available without license, worldwide, or using a tool like CC0 in cases where a license type must be affixed. This means that governments should not create attribution requirements for their data since this requirement does, even in its ostensibly minimal form, violate the principle of unrestricted information access.

On the other hand, there is a strong value to allowing people to understand where their data comes from. As one commenter responded to the announcement of our CC0 advocacy, “Academically it is best practice – even required – to link back to the primary source to allow a consumer to validate any perspective or repeat any experiment.”

Happily, it is not the case that an attribution requirement is the only way to achieve that best practice. Rather, we might consider the existing conversation within the sciences on the need from the data users’ perspective to have a common method of identifying the source of information used in scholarship and analysis. Researchers depend heavily on the work of their predecessors. They need strong and reliable ways to describe the information they draw on in order to produce high quality work, since an implication of high-quality work is that it can be evaluated through replication or validation. An ability to describe the provenance of each particular piece of information included within a research project is essential to establishing the trustworthiness of the project in question.

Instead of viewing this identification of sources from the perspective of attribution, however, the scientific research community primarily describes it in terms of citation.

The difference between citation and attribution lies in the difference between assigning credit for information and making reference to it. Attribution is distinguishable from citation because it represents a legal requirement, premised on the assumed power — whether of creation or some other form of ownership — of the attributed party over the information that’s being shared. Citation, meanwhile, is intended to clarify specific qualities of the information so that end users can find out more about that information’s context, development, and quality. While attribution helps describe who has power over a piece of information, citation helps to make the information itself more powerful through making it more useful. (It must be said that the difference between these concepts is not clear in every professional field; in journalism, for example, the terms  are used interchangeably.)  

In order to ensure the best possible quality of data – and best possibilities for data use – our updated provisions do not suggest attribution, but instead contain a new recommendation that government data managers develop a recommended citation form for their data sets. In reviewing existing practice, we’ve found that a number of government data holders already follow this recommendation.

The practice of providing a recommended citation form is particularly prevalent across the public sphere of physical and biological scientific data. Specific projects of the National Institutes of Health, NASA’s Oak Ridge National Laboratory, and the National Cancer Institute represent just a few examples of how citation recommendation works in practice. Data used in social scientific research is also often well-supported through citation recommendations, including for data from the US Census and the Bureau of Labor Statistics.

Government sites express the non-obligatory nature of the citation requirement in a number of ways. Some include their recommended citation form within an FAQ in response to the question of how a researcher should cite the data. The Bureau of Economic Analysis notes that “citations are appreciated and appropriate.” The US Geological Survey asks that credit be given, but in a respectful way that does not bludgeon the information user with implied legal action if the user decides not to do so. (The USGS also explicitly points out that most of their work is in the public domain and can be used without restriction.)

The desire to describe data provenance should not involve a legal requirement that will hinder the freest use of that data. By identifying the best form of citation for their data, government data managers position themselves in a helpful way, demonstrating themselves to be experts on and guides to their data. The best role government can take in the opening of its data is to ensure that it enables the best possible quality of research. It is a far superior role to jealously and inappropriately claiming legal ownership rights to our public data.