Changes to USGS website highlight the importance of search for public access

by and

As has been true for decades, the ways public data are stored and presented on federal government websites can sometimes be tricky for the public to understand, access, use, and re-use.

As federal websites have been changing under the Trump administration, there has been ongoing confusion in the media and public about the difference between the output of a search on a website or a data portal and the data or Web resources to which those search results link.

This week, the U.S. Geological Survey (USGS) came under scrutiny. On September 18, Peter Gleick, a climate scientist and member of the U.S. National Academy of Sciences, brought attention to changes to the USGS Science Explorer, a government search engine for scientific data and information, in a series of tweets.

Using the archived page stored in the Wayback Machine, Gleick noted that the search results for the term “climate change” had dropped from a count of 5,932 in December 2016 to 416 on September 19. Following media attention, USGS changed their search function. The same search a day later showed a count of 63,016 results. The increased count is not necessarily an indication of improved search and it is unclear if the change was an intended or an inadvertent result of an attempted fix.

If you’re not familiar with the Science Explorer, when a member of the public searches this site, it returns results with links, dates, and descriptions corresponding to other Web resources, such as webpages, images, videos, and data, but it does not itself host that content.

That’s important: our spot checks of the linked resources produced by the search in December 2016 but not found in the September 19 search results found that the resources themselves were not removed. Only the search results were affected.

Without providing full context, a ThinkProgress post about Gleick’s statements reads, “You paid for U.S. Geological Survey climate data, but the White House is making it disappear.” An E&E News article went even further, claiming that “thousands of webpages are gone.”

These assertions are not supported by our research and were refuted by the agency. For example, when searching for the words “climate change,” the top result on December 2016 was a link to a URL (https://nccwsc.usgs.gov/content/usda-announces-new-climate-change-initiatives), which does not appear in the September 19 search results, but still leads to a live page that has been unchanged since December 12, 2016.

“No doubt, there have been technical problems with the USGS search function,” A.B. Wade, a press officer at USGS, told Sunlight, in an emailed statement. “Many of these technical problems have now been fixed, meaning the search is more reliable, but it’s not perfect. What’s important, though, is that the USGS has not been directed by the current Administration to remove any data sets, publications, or webpages from our online presence.”

How access to public information can be altered

We’ve been glad to see that there there have not been widespread federal data takedowns in 2017, as many people had feared. The only example of a removal of an open data set from the Internet thus far, was when the Department of Agriculture took down a set of data on animal welfare.

Instead, what we’ve seen the Trump administration do on federal government websites constitutes more subtle, but still significant, reductions to public access to public information online.

This latest episode has generated confusion for at least two reasons. First, it muddied the conversation about whether public information was taken offline entirely or if the changes only affected the search portal results.

Second, the specific focus on climate change, when many other search results also changed substantially, implied without evidence that this was a targeted and potentially politically-motivated removal of public climate information by the Trump administratin. For example, there was also a large reduction in the number of results when searching for the words “natural resource exploration” between December 2016 and September 19.

Both of these issues could be substantially mitigated by more proactive communication from government agencies that operate search engines and websites regarding future content changes, downtime, bugs, migrations, and changes to how searches function, from indexing to display.

The confusion about changes to federal websites and removals of information since the beginning of the Trump administration can be traced to misunderstanding of how access to public information works on federal websites.

We categorize resources as falling into three broad categories:

  1. Public information assets, including documents and structured government databases
  2. Webpages that host or link to public information assets
  3. Search results from website search engines and open data platforms that point to public information assets or their host webpages

The most extreme example of type #1 content changed since Inauguration Day is the example mentioned above, in which the Department of Agriculture took down USDA data sets related animal welfare.

When content of type #2 alone is removed, data or information may remain available at the same URL on a government website, but often with substantially reduced accessibility.

For example, the Environmental Protection Agency removed its Clean Water Rule website, leaving PDFs that provide information about clean water online but inaccessible to anyone who did not have a direct URL. In some cases when type #2 content is removed, it may also no longer be possible to navigate to the content using a government website, significantly limiting the discoverability of the resource.

When content of type #3 alone is removed, however, the content’s Web hosting is not affected at all and the content is still discoverable on a given website.

While access may be reduced in an important way, that assessment is closely linked to how commonly that website search engine or data portal is used and how well its search function works. So when type #3 content is removed, we cannot say that data or information was “removed” or “deleted’ or that it was necessarily made “inaccessible.”

In the case of the USGS search portal, with no evidence provided that any content of type #1 or #2 was removed, it appears that the removals were only of type #3. These alterations may have been due to changes to the indexing or structure of the metadata, which is the underlying information that a search function is based on. It could also be because of changes to the algorithm that the search function uses to search the metadata.

A poorly functioning search on a website or data portal, especially when it was just recently working much more clearly, is certainly something to be concerned about, but this type of change is very far from a “deletion of data.”

While there is no evidence yet that resources corresponding to the changed search results have been removed or deleted, there is evidence that the USGS website has been altered otherwise: webpages presenting maps by topic matter have been removed. The “Map Topics” page, which links to 10 other topic pages, was live on June 17 but currently leads to a “File Not Found” page. These removals of the webpages appear to have been independent of the search alterations and are significant reductions in access to Web content that we hope see restored as the USGS continues to update its website.

What’s next at the USGS?

To its credit, the USGS has responded consistently to tweets and news media inquiries, acknowledging the issues — including Sunlight’s.

In a September 18 tweet, the USGS stated that “our current search engine is not up to snuff. Content exists, but we have issues w/our search causing poor results. Fixes coming.”

It was good to hear the USGS acknowledge the problem with search on social media and to Sunlight directly. Any substantial changes to Web resources and tools, like those made to the Science Explorer, should be accompanied by a clear explanation from a government agency before they’re made, during any issues, and upon resolution, including the kind of public responsiveness to affected members of the public that USGS adopted here.

Design websites for people to use

As we continue to evaluate the federal government’s approach to access, disclosure, and presentation of public data and information, we hope agencies will stay focused on improving the design and functionality of public websites and their features.

How can a given search tool help the user access information that they knew about in advance of the search? How can it help a user discover new information in a useful way?

Answers to these questions are context-dependent and based on specific search use cases. A reduced or increased search result count does not necessarily mean that a given search tool is better or worse. Getting a dump of thousands of poorly prioritized results can make it almost impossible to find any particular one.

We hope that the federal government will consider the needs and use cases of the public when building or improving website search engines and data portals for accessing Web content.

Transparency about technical issues increases accountability and reduces public confusion and fear when changes are reported. Taking steps to communicate with the public is crucial, given the understandable concern that large numbers of scientists and citizens have regarding the future of scientific research and disclosure at the EPA and other agencies.

Categorized in:
Share This:
  • Jesse Biroscak

    Thanks, Toly and Andrew, for writing this (and to Greg J-D for sharing it with me). As a PM working on a new govt search experience, these questions in particular caught my eye:

    “How can a given search tool help the user access information that they knew about in advance of the search?”
    “How can it help a user discover new information in a useful way?”

    Have you seen any examples (inside or outside govt) or parallel ideas that you liked?

    A couple questions for you, too:
    * Transparency around technical issues has confused the general public in my experience. They are often more concerned with getting the thing they want rather than why they can’t get it. Speculation on how the public (even technical publics) would want this information communicated? Seen anyone do it well?
    * How do you see documents / data surfaced vs page content being best surfaced for ease of access? Examples you like?