Analyzing the state of cities’ bulk data

A map of U.S. cities and states with open data policies, as of July 2016. (Image credit: Sunlight Foundation)

We at the Sunlight Foundation are very interested in whether or not cities have open data — but what do we even mean when we say that data is open or not? Often, Sunlight policy experts review what governments make available and then assign rankings along a series of indicators. Sunlight’s Open States Report Card, for instance, checks whether legislative data are permanent, thorough and updated once or multiple times a day; the U.S. States Open Data Census considers whether state datasets are copyrighted, verifiable and in a state repository; and the Open Data Map simply determines whether or not an open data policy is in place.

The US City Open Data Census collects information about an array of datasets important for local-level governance. However, beneath the scores the census gives each city and the evaluations of transit or zoning data is a treasure trove of information about specific factors that help make data more “open” or accessible, such as whether datasets are provided in bulk, available in machine-readable formats and freely licensed. Although many cities on the census digitize official documents and put them online, looking at the availability of bulk data downloads could help show which cities are more thoroughly engaged with opening their datasets. Here are some of the results from the census:

City name Percent of key datasets available in bulk* City name Percent of key datasets available in bulk*
Los Angeles 94.7 St. Louis 36.8
New York 94.7 Tulsa 36.8
Hartford 91.7 Anchorage 36.8
San Francisco 89.5 Washington, D.C. 36.8
Chicago 88.9 Detroit 35.7
Santa Monica 86.7 Atlanta 33.3
* Only counts datasets that have received census submissions, and excludes cities with few census submissions.
On the left are the best six cities in the census in terms of providing bulk data access. Aside from including America’s three largest cities, this list features places with long and well-recognized track records with open data: Hartford, a smaller city with strong open data policies; San Francisco, an early open data champion; and Santa Monica, next door to large and open data-friendly Los Angeles. While these cities sometimes could do better with their open data implementation, they clearly strive to make sure their datasets are easily accessible, and go beyond simply putting datasets online.

What about the cities on the right? Their record with bulk data is much spottier, but otherwise they may be doing a decent job releasing datasets — Atlanta, in fact, has a higher overall census score than Hartford. Why, then, do these cities seem to not fare as well at providing data in bulk?

A graph comparing the percentage of bulk data in cities and the population size.
A graph comparing cities’ population size and bulk data access. (Image credit: Sunlight Foundation)

General or overall open data indicators frequently correlate with population. In other words, whether one is looking at the overall scores on the US City Open Data Census or at expert evaluations of a city’s open data program, cities with larger populations tend to do better at implementing open data. However, the correlation is even closer between population and bulk data access, and the correlation remains even when excluding the census’ largest and smallest cities (see graph). In addition, as one might expect, cities that make their data available in bulk tend to also make their data available in machine-readable formats and with an open license. The correlations between bulk access, machine-readability and open licenses far surpass the correlations between, say, bulk access and simply putting datasets online, or between open licensing and keeping datasets up to date. If a city provides data in bulk, then chances are that it also does very well with making data open in many other ways.

Small population sizes don’t prevent cities from having strong open data programs, as places like Hartford and Asheville indicate. Large populations, however, may give cities an extra push — not only to create an open data initiative, but to create one well. That could explain why bulk access to data is even more closely linked to population size than generic open data indicators, and why more complicated facets of open data — like bulk access, machine-readability and open licensing — tend to appear in the same cities. Though cities might not implement important components of open data one by one, enough public enthusiasm for open data might motivate a city to implement a series of open data components all at once, or prompt a city to enact policies that call for specific open data measures. (See the chart below for sample policy language on Sunlight’s bulk data guideline).

Only 12 cities address the question of releasing open data in bulk, making it one of the lesser adopted best practices from Sunlight’s Open Data Policy Guidelines.
What other factors could influence the implementation of bulk data access? It is possible that cities learn from and compete with a select group of “peers” with similar sizes or locations. Many of the first cities to develop open data programs were large metropolises like New York and Los Angeles; other large or nearby “peer” cities may have been faster to see what the early adopters accomplished and then catch on. Another possibility is that, while many cities face calls for greater openness and transparency, cities with particularly active tech communities are more oriented toward the finer details of open data — thus presenting machine-readable bulk data rather than many separate pages of scanned PDF.

Although this overview of the state of bulk data is only a start, we hope that closer analyses of the data behind open data will help in seeing how and why some cities have effective open data programs, and what challenges other cities might have to overcome.