Federal open data audit: Defense downright dismal, Interior immense yet imperfect

by and
Photo credit: Yuri Yu. Samoilov/Flickr

As we audit the public data catalogs of federal agencies, we have found that some of the agencies we examined varied widely in quantity and quality of data. We also made some unusual discoveries while examining the URLs the agencies published. Here, we look at the departments of Defense and the Interior.

Department of Defense and the Bermuda Traceroute

The Department of Defense’s public data catalog is much smaller than the Interior’s, with only 373 datasets that offer 375 URLs for data. Of those, 59 URLs (15.7 percent) return 404 errors, including the Federal Employees Overseas Absentee Voting dataset. Additional 404 errors include links to records of FOIA requests to the Office of Secretary of Defense for 2011 and 2013. Valid URLs can be found outside of the data catalog.

But of particular interest are the domains armyobt.army.mil, dod.xml.feedroom.com and marines.xml.feedroom.com, ones that appear 48 times in the catalog. The first apparently doesn’t exist, or at least doesn’t have a public DNS listing. The second domain has a DNS listing, but does not respond to any requests. These links may work within the department, but, if they exist at all, they block public access from anywhere else. Defense indicates that datasets exist at 24 distinct URLs among these three domains, but the existence of servers at those domains can’t be confirmed. Requests to dod.xml.feedroom.com and marines.xml.feedroom.com don’t result in an error from the server or a message that the server was unreachable — they provide no response whatsoever. Note that these 48 problems are in addition to, not part of, the 59 URL 404 errors above. That means, in total, 107 URLs do not function for the public — about 29 percent.

Using the traceroute tool, we can see that data packets sent to dod.xml.feedroom.com make their way across various pieces of network hardware without issue until they just disappear — a virtual Bermuda Triangle.

Perform a traceroute on feedroom.com and packets make their way there and back just fine. The domain is owned by an online video services company, Piksel, which makes sense given that the URLs we were examining are associated with various military video offerings, such as “The Pentagon Channel.” It is unclear why this particular host would be unavailable and not respond with an error (for instance, that access is forbidden or that the page doesn’t exist). It could be a side effect of compliance with Defense’s directive on cybersecutity and information technology.

Again, inside the Department of Defense, that server could be accessible. But, suffice to say, this data is not publicly accessible.

Interior, Becky, Bob, Chris and Lisa

The Department of the Interior’s public data catalog is impressive. It weighs in at 36,486 datasets, making it by far the largest we have examined. The catalog contains entries for the main agency and sub-agencies such as the Bureau of Land Management, the U.S. Geological Survey and the National Park Service. There were 43,940 unique URL entries associated with these datasets, and quite a few of those point to geographic databases. But when we checked those URLs, we also found a variety of issues, from items that didn’t exist at the provided URL to entries that pointed to local addresses, meaning the datasets are on agency computers rather than the Internet.

For 2,650 URLs, a web server returned a “404 Not Found” response when queried with an HTTP HEAD request (this is also how we explored the Department of Defense’s data). Roughly half of those came from http://www.landscape.blm.gov/, which responds to all HEAD requests with a “404 Not Found,” meaning that the web server won’t tell you if some potentially large download exists unless you attempt to download whatever it is at the URL. While that may not dissuade people from gathering a particular dataset, it is a hurdle when attempting to determine the availability of tens of thousands of potential datasets. In a manual examination of a selection of 790 URLs under the National Park Service domain, numerous URLs still “404’d” when the full resource was requested. So if you’re looking to collect, for instance, information about wheelchair accessibility in national parks, you won’t find information on wheelchair-accessible trails in Chaco Culture National Historical Park, N.M.. Perhaps with increased visibility, these missing resources will be located in the near future.

In addition to things that aren’t there or don’t appear to be, we have run across URLs that appear to be filed away on someone’s desktop computer. The Department of the Interior lists URLs for files presumably on the hard drives of Bob, Lisa, Chris and Becky’s computers. Here are an example of each:

In addition to Bob, Lisa, Chris and Becky’s computers, there are also numerous URLs that point to shared drive letters such as G, X and I, or other locations that appear to be internal resources. Some of these URLs are part of a collection of links for a particular dataset, so there may be some useable data for those datasets, but we still don’t know what’s on Lisa’s computer (or Bob, Chris or Becky’s computer, either); it may be a duplicate of what we have access to, or, more likely, an additional piece of the dataset.

Overall, our audit reveals a pretty dismal performance by the Department of Defense, and an impressive-if-imperfect presentation by one of the biggest data publishers in the federal government, the Interior. We’ll be following up on this post with additional audits in the future, so stay tuned!