Open Data Policies and Implementation: Frequently Asked Questions : Sunlight Foundation

Open Data Policies and Implementation: Frequently Asked Questions

At the Sunlight Foundation, we believe access to government information and decision-making processes is a fundamental democratic principle. Open data is one important way of achieving that access. Here, we explore some of the commonly asked questions about open data policies.

To learn more about Sunlight’s local policy work, explore our hub here.

See what an open data policy can and should do in our Open Data Policy Guidelines.

Frequently asked questions:

What kind of archival materials should be digitized?
What data should we prioritize for release?
How do we safeguard sensitive information?
How do we ensure the integrity of the data?
What are some of the ways to limit liability?
What are good terms of use for the data?
How do we address licensing issues?
What portal should we use?
How should we transfer data from our systems to a portal?
Do we need APIs?
How should we engage the public in the process?
How much does it cost?
How should we structure oversight?
Will we need to hire more staff to do this? What if we get more information requests about the data we put out there?
How do we address records retention schedules?
Where can I find examples of open data plans and progress reports?

What kind of archival materials should be digitized?

Think about what kinds of materials would add context to the data being released or what kinds of documents are important to the public. This is a great place to draw on public input. Do they want online access to a digital archive of city council meeting minutes? Do they want old city photos digitized and shared online? The same prioritization process used for data release could also be applied to thinking about what materials to digitize. See this blog post for more information on the prioritization process or see below on prioritization.

What data should we prioritize for release?

Simply releasing whatever is cheapest or easiest to release is not the best approach to open data release. Governments should think seriously about data prioritization — that is, consciously figuring out which data they want to release first — if their data release process is envisioned to occur gradually over an extended period of time. To be sure that the data release will consistently support the goals of a jurisdiction’s open data policy, governments should use those goals as the basis for prioritizing data set release. Transparency goals are served by the release of information which provides insight into critical government decision-making processes, such as data which was used in the creation of government policies or data which reveals the details of government revenues, spending and contracting. The goal of making government more accessible is achieved by providing information in which internal and external stakeholders have already demonstrated an interest through formal and informal requests. See more about developing an approach to data prioritization in this blog post on the topic.

How do we safeguard sensitive information?

Safeguarding sensitive information is an important, if challenging, aspect of proactive data release. It can only be done appropriately with a balance test that asks whether the potential harm from releasing the information outweighs the public interest in accessing the information. Without this process to prove how the public interest was considered, any sensitive information being withheld should be subject to scrutiny as to why it is not being released. Selective redaction is another process that can be used to protect sensitive information while still releasing as much data as possible. Rather than refusing to release all of a dataset due to one problematic element (or even a few), the sensitive information should be redacted and the rest of the dataset should be released. See this blog post for more information about appropriately safeguarding sensitive information with a balance test. See also below on limiting liability.

How do we ensure the integrity of the data?

A government can best ensure that they are publishing high-integrity data by initially publishing good quality data. Creating quality control mechanisms, including regular opportunities to review, correct, and incorporate public feedback about data quality, allows a government to ensure that the data they publish is likely to be accurate. Good quality data is as complete and timely as possible, so publication guidelines and schedules also affect the quality of published data. The principles of open data require making published datasets as available as possible for reuse, so a government effectively can’t ensure the integrity of data once it leaves their site. However, it can limit liability for data and it can suggest a preferred form for citation of data, allowing individuals who reuse the data in an app to direct their end-users back to the original source of the data on the government website.

What are some of the ways to limit liability?

Governments can limit their potential liability by defining “data to be released” as referring only to information that’s under the authority of their jurisdiction and as not including information otherwise protected by law, including local right-to-know law exemptions, privacy, security, and accessibility laws and otherwise legally privileged information.

Governments should also ensure that their released data complies with the following regulations:

ADA challenges and Section 508 (the Amendment to the Rehabilitation Act of 1973): The opening of government data should assist in making more information available to screen-reading software, but accommodations, such as alternative text describing graphics or formats, should be made for any images or other non-text elements.
Health Insurance Portability and Accountability Act (“HIPAA”) / Family Educational Rights and Privacy Act of 1974 (“FERPA”): HIPAA/FERPA have very exacting requirements for determining whether data have been sufficiently de-identified so as not to compromise individual privacy. Agencies that deal with health and student educational data should carefully evaluate the limitations they place on data publication.
Mosaic Effect: Even in the absence of specific legal prohibitions, government entities should be aware that outlier publication conditions may result in personally identifiable information (PII) being unintentionally disclosed. Datasets which do not individually reveal PII may potentially do so in combination with other published datasets.

Disclaimers can be added to the language of the policy itself, and should also be included in the data’s Terms of Use. Disclaimers can include exclusions of any express or implied warranties, relieving governments of responsibility for consequential damages, and indemnity clauses. Ideally these are not overbroad, include a right to access (save for narrowly defined emergencies), and are coupled with a policy that also has a strong process to ensure data quality.

Having multiple opportunities to review data helps catch errors before they are published. For example, by sorting data during an inventory process into data that is releasable as is, unreleasable data (with a legal citation as to why), and data that must be cleaned before release, governments add another helpful layer of review. In the same vein, governments can add legal checks into the procedures for release. For example, New York state requires Data Coordinators get explicit approval from Legal Counsel (via a signature), before publishing a data set.

With regard to public data relevant to a jurisdiction that is collected by third parties, consider adding provisions to contracts with third parties that require data collected to be distributed under the terms of the jurisdiction’s open data law. This both increases the amount of data a government can release and obliges contractors to follow the same good practices the government creates for its own data release processes.

What are good terms of use for the data?

Terms of Use that support truly open government data can include disclaimers of warranty and limitations of liability when needed, but should include:

No cost or registration requirement.
No restriction on use.
No license restrictions. All open government data should be available as public domain information. If government data is currently controlled by a vendor license and still waiting to be resolved and opened, add a note that says all datasets are available in the public domain and without restriction unless otherwise noted and flag those datasets.
No attribution is required, but citation can be recommended as a best practice, including a note about the attribution required for any government logos or seals.

How do we address licensing issues?

For information to be truly public, and maximally re-usable, there should be no license-related barrier to the reuse of public information. To be completely “open,” public government information should be released completely into the worldwide public domain and clearly labeled as such. If the government data in question is not explicitly in the worldwide public domain, it should be given an explicit public domain dedication, such as the Creative Commons CC0 statement or an Open Data Commons Public Domain Dedication and License (PDDL) — both of which combine a waiver and a license.

What portal should we use?

To facilitate their findability, data portals should permit indexing and searching by third parties such as search engines. There are several helpful features that should be included in general or specific portals. A list of what data is contained there is one necessary feature that makes it easy for users to quickly see what kinds of information are available on the data portal. If appropriate, this could be done through a link to a data inventory. Another beneficial feature to include in data portals is a view of analytics on data downloads. This will help users and government data providers understand what datasets are of the highest interest.

Here is a growing list of open data portals throughout the United States.

Here are some tools that facilitate hosting open data in a number of ways:

Linked ad hoc on local websites in open formats like CSV.
On GitHub
Via Google Fusion tables
Or on an open data portal, such as:

Open Source Solutions

CKAN, “Comprehensive Knowledge Archive Network,” open source software created by Open Knowledge
DKAN, a Drupal-based implementation of CKAN
Open Data Catalog, open source software created by Azavea

Free w/ Previous Software Subscriptions

Esri ArcGIS Open Data, a web application free for Esri ArcGIS Online customers
CivicData, a CKAN instance/pipeline free for Accela customers

Cloud-based SaaS Subscription Services

Junar, a cloud-based SaaS data portal, paid for on a competitively priced subscription basis
NuCivic Data Enterprise, a competitively priced OpenSaaS platform based on DKAN
OpenData.city, a competitively priced OpenSaaS platform based on CKAN
Socrata, a cloud-based SaaS data portal, paid for on a competitively priced subscription basis

Other Community Resources

See a long list of data hosting tools here via the Open Data Stack Exchange
See more information about open data platforms in Code for America’s Open Data Playbook

How should we transfer data from our systems to a portal?

There are several resources that address the extract, transfer and load (ETL) process — or, quite simply, the process for sharing data from one place in another place. One resource is the chapter about ETL from “Beyond Transparency” written by former Chicago Chief Data Officer Brett Goldstein. Chicago’s process is also discussed step by step here. The website Simple Open Data broadly explores steps to open data. There’s another account of ETL here by Dave Guarino and there’s a list of ETL resources here. Sunlight’s Bob Lannon has a look at the challenges and opportunities of ETL, too.

Do we need APIs?

Although bulk data provides the most basic access to searching and retrieving government data, government bodies can also develop APIs, or Application Programming Interfaces, that allow third parties to automatically search, retrieve or submit information directly from databases online. Navigating requirements for bulk data and APIs should be done in consultation with people with technical expertise as well as with likely users of the information. Tools, such as CSV to API, Database to API, and API Sandbox may be helpful as examples in bootstrapping APIs, getting them online quickly.

How should we engage the public in the process?

There are many ways to engage the public. One of the most important ways to engage the community is by asking what data would be useful to them. Including the public in the prioritization process for data release helps ensure the information being shared will be used. Public participation shouldn’t stop there, however. The public should also be engaged with the data once it is released. There will still be important questions to ask the public. Are they happy with the completeness, timeliness and quality of the data being released? What other information would be useful to them? How can the data that is being released be useful in decision-making and engagement processes? There are already many apps that make government data useful to people by allowing them to engage in a more informed or direct manner. These kinds of options should be explored to foster better engagement.

How much does it cost?

There has been a range of budgeted numbers attached to legislation for funding open data efforts. Generally, anywhere from $0 (if no one is hired and no additional money is needed for technology) to $500,000 (for hiring new staff and adopting new technology) may be budgeted for city, state, or federal open data initiatives. Funding should be considered for—but not limited to—the potential of new staff (administrative, technical and legal), new software (to house, extract and input data), training, and server maintenance. It will be important for each jurisdiction to consider what it needs to support a strong open data ecosystem.

For commentary on differences between CKAN and Socrata see this discussion and for insight into Socrata’s pricing model see this research.

How should we structure oversight?

Responsibility for data management should be managed by a delineated authority structure, with those close to the creation of the data being part of the process. Almost all of the formal local open data policies on the books in the U.S. to date (including open data administrative memos, executive orders and laws) develop an authority structure for oversight of the open data policy. These authority structures either empower existing staff, such as a Chief Information Officer, City Manager or IT department, or involve hiring new staff specifically tasked with the implementation of the policy, such as a Chief Data Officer. Managing data release requires project management skill as well as data literacy.

Will we need to hire more staff to do this? What if we get more information requests about the data we put out there?

Implementing an open data policy does not necessarily require more staff. Responsibility can be distributed among departmental coordinators who meet regularly, for example, to reduce the burden of oversight. This can also help with cross-departmental coordination and buy-in to the open data efforts. It is possible that more information requests could come initially from sharing data online, as people inquire with specific departments about what they see in datasets. Generally, proactive release of information online — especially if it’s focused on sharing commonly requested datasets — can help reduce the number of information requests that governments receive. Ongoing studies and benchmarking should be able to provide insight soon about the impacts of open data release on the volume of information requests going to government staff.

How do we address records retention schedules?

Records retention schedules can, and should, be integrated thoughtfully into efforts to release open data. In cases where destruction of records was related to storage limits, those schedules should be considered for revision. It’s easier than ever to store records as data with low costs, so storage space and costs are no longer a barrier for keeping some records. In cases where destruction schedules must stand for other reasons, clerks, records managers, archivists, or other appropriate equivalents should be consulted about how to ensure open data release does not conflict with records retention laws.

Where can I find examples of open data plans and progress reports?

Here are a few examples of plans, progress reports and implementation guidelines from cities with open data policies: