Improvements Needed For High Value Datasets On Data.gov

by

This morning a number of organizations — POGO, OMB Watch, CREW, National Security Archive, the Center for Democracy and Technology  and the Open The Government coalition– and Sunlight sent a letter to Vivek Kundra, Federal CIO, about improvements needed to the release of High Value Datasets on Data.gov. Here are the core recommendations included. Please tell us what you think in the comments below.

As advocates for government openness, we support the Administration’s efforts to provide the public with access to information through Data.gov. We are eager to work with you to ensure the success of Data.gov and, in that spirit, write to raise our concerns with the datasets submitted by agencies to fulfill their requirement under the Open Government Directive to post three high value datasets by January 22, and to offer constructive suggestions for improving their usefulness.

As an overall recommendation, we urge you to add public representatives to the Open Government Initiative interagency working committee and ask the committee to address the problems and recommendations identified below.

Release Format and Usability by the Public

We understand one of the primary purposes of Data.gov is to enable the technology community and transparency advocates to most effectively use the data to make a direct impact on the daily lives of the American people. The format of the data plays a key role in its usability; many within the community of advocates who re-use and repackage government data would prefer data in CSV format, rather than the XML format in which many of the posted databases are provided. Accordingly, we recommend that you strike an appropriate balance between formats (such as XML) that serve the coding community and web-based presentations by agencies that can be used and understood by the general public.

In addition, some of the currently posted files are quite large, ranging upward to several hundred megabytes. Their large size undermines their usefulness for most people or organizations. The large number of currently posted datasets also makes it difficult to find a particular database of interest. We therefore recommend that if a Data.gov dataset is available from an agency through a web-based interface, Data.gov link to that interface on the dataset’s Data.gov landing page. For a consumer looking for information on a car seat, for example, it would be far easier to search the Department of Transportation’s online database rather than scrolling through screen after screen of raw data in XML format. Additionally, as agencies continue to post datasets to Data.gov, efforts should be made to identify those of greatest public interest that lack such interfaces and develop web interfaces that allow the data to be explored online.

Further, while we agree there is value in aggregating government data in a single site, it is questionable how much the collocation of the currently posted information on Data.gov actually benefits the public. The site is not searchable by topic and does not provide any way to bring together data from different sources on similar topics.

As an enhancement to the organization of the site, we recommend that you use tagging or metadata to enable the public to bring together information on a topic. The thesaurus that USA.gov uses provides a useful example of the needed vocabulary.

Value of Data

The release of the datasets also has prompted discussions about the value and the quality of the released data, and the additional value provided by access to existing data in a new format. We believe repackaging old information is of marginal value, yet that is what many agencies have done with their recent postings on Data.gov. According to the Sunlight Foundation, of 58 datasets posted by major agencies, only 16 were previously unavailable in some format online. This leaves the impression that agencies posted easily available data, the proverbial low-hanging fruit, rather than seriously considering which of their datasets truly are of high value. While these initial postings can be considered a test run, more attention needs to be directed toward ensuring the overall quality and usefulness of the data.

In addition, sustained attention should be paid to the possibility of making some of the datasets available as feeds that are constantly up to date, rather than as static datasets that are pulled down and then reposted on an occasional basis. We recommend that agencies be required to explain why the data is high value by having them designate which of the “high value criteria” the data meets: information that can be used to increase agency accountability and responsiveness; improve public knowledge of the agency and its operations; further the core mission of the agency; create economic opportunity; or respond to need and demand as identified through public consultation. Similarly, we recommend requiring agencies to indicate whether a high value dataset was previously unavailable, available only with a FOIA request, available only for purchase, or available, but in a less user-friendly format. Going forward, this will make it much easier to track how agencies are complying with the other requirements of the Open Government Directive. While we appreciate the value of data that furthers the mission of an agency, we believe it is equally important to make available to the public data that holds an agency accountable for its policy and spending decisions. We hope to see more datasets of this type available in the near future.

Quality

As is to be expected in efforts of this type, there were a number of glitches–datasets that could not be downloaded or, once downloaded, could not be opened (the Central Contractor Registration FOIA extract from the General Services Administration seems to have caused several users problems). Additionally, some datasets were incomplete (the Hazard Grant Mitigation Program data released by FEMA is missing 23 years of data between 1966 and 1989). Even more troubling, some did not have header rows, and for those that did, their Data.gov pages did not always link to code sheets explaining what those header rows meant. Without this information, the data cannot be used.

We therefore urge the implementation of a responsive feedback mechanism that allows the public to alert an agency that a specific dataset is not working, lacks information, or is missing explanatory material and provides a response to the concerns within a specified time. One way to address this may be to include an agency contact with the ability to resolve any database problems or provide information about the database. The interagency working group could sample the quality of these agency-specific dialogues to ensure that they are having an impact and to develop recommendations on best practices to improve the responsiveness. Additionally, we strongly recommend that all datasets on Data.gov be directly associated with their code sheets.

Finally, we are concerned with the current lack of public notice when data is removed from the site. We respectfully urge you to note all raw tools and data that are removed from Data.gov, and to provide an explanation for their removal.

Many of the concerns outlined above apply across all or many of the agencies’ datasets. Accordingly, we think that standards for handling these types of problems can easily be addressed through the interagency working group and then disseminated amongst the agencies.

Categorized in:
Share This:
FacebookTwitter
  • Andrea Schneider

    Hi Ellen,

    This is an excellent memo, very clear and easy to understand. Great job.

    I wonder if one reason we aren’t seeing meaningful data (from some agencies) is because they don’t have it or have not changed their own data collection requirements.

    The OGD is putting pressure on agencies (maybe caught with pants down), they can’t hide out as easily and now have public expectations on performance. I’m not surprised you are seeing some”low hanging fruit”.

    Having worked inside the federal government, evaluating grants, this part of the grant process is not always very sophisticated, both at the agency level and grantee level. It can also be very expensive.

    I’d like to see a discussion of what we mean by data. Is it quantitative, qualitative, both? Sometimes one type of data helps more than the another, like in making a decision about car seats.

    But, what if a community wants to prevent crime, crime data is only one part of the bigger story. How did they reduce crime? What strategies do they tie to the numbers? What recommendations do they have for another community with the same issue? In this case the data needed to address the issue is more than numbers.

    We have a lot of trouble with redundancy of effort and funding, we keep spending the same dollar over and over again.

    Funding is usually categorical, with each agency tackling the same problem from their own point of view. Many of our current problems are inter-related, perhaps data sets, from multiple agencies, might give us a better picture and be more helpful.

    It is useful to think about all the possible customer’s for data ahead of time. That way it can be planned for and collected with criteria in mind.

    I am very interested in this subject. Thanks for all the work you are doing. I hope I can be helpful.

    Andrea

  • John L. Clark

    What motivates the suggestion that government agencies prefer CSV files over XML files? Is there a discussion about this elsewhere that I should refer to before discussing this further here?

  • Andrea–

    It’s definitely true that a lot of data maintained by agencies is in a format that does not lend itself to release, which is a huge problem. When FedSpending.org and USASpending.gov went online, some people discovered that Agriculture loan programs used their social security numbers as part of unique identifier numbers for the loans they were giving out. For years and years and years, when their records were internal (and there was no Internet), this wasn’t a big deal. But after USASpending.gov went online, they had to completely change the record numbering process.

    As to a discussion of what data is–I personally think of it as the first building block that lets you start asking the more substantive questions. You mention crime stats (I happen to think FBI’s Uniformed Crime Reports are some of the most misleading stats out there) — getting them into a database is part of the process of understanding what’s going on with crime, but not all of it. For example, there’s been an explosion of federal criminal statutes over the last few decades–there are far more crimes to commit than there were 30 or 40 years ago–which has had a huge impact on crime rates. Having data in and of itself won’t explain everything, but it does let one say, “Geez, look at how the federal prison population exploded. What’s going on there?” Without the data, you miss those things.

    That said, having a substantive conversationa about what data is would be incredibly useful. We like to try to treat everything as data — our Capitol Words site breaks up speeches given by members of Congress into individual words and than counts how many times members say things like “earmark” or “taxes” or “health.” Data is more than just numbers in a spreadsheet. But obviously, and in light of the Open Government Directive, having a more meaningful discussion of what we mean by data would be useful.

    And yes, it would absolutely be beneficial if we had real time disclosure of spending data from different agencies tagged in such a way that, let’s say, a Labor Dept. grant isn’t duplicated by one from Commerce.

  • John:

    I think it’s the advocates saying they prefer CSV over XML, although there’s no reason not to release data in both formats where possible, not the government.

    I know a lot of folks here at Sunlight prefer XML.