How Data Refuge works, and how YOU can help save federal open data

by Guest Blogger

Feb 6, 2017 1:17 pm

Editor’s Note: Last week, Sunlight joined Abbie Grotke, a digital library project manager at the Library of Congress, and professor Bethany Wiggin, from the University of Pennsylvania, at a Transparency Caucus briefing in the U.S. House of Representatives to discuss archiving federal open government data. (You can watch video of the event online.) At the event, Wiggin shared what’s happening with Data Refuge, a distributed, grassroots effort around the United States in which scientists, researchers, programmers, librarians and other volunteers are working to preserve government data. Her prepared remarks are below.

Thank you for inviting me; it is an honor to be here. I am an Associate Professor at the University of Pennsylvania, where I teach a seminar on censorship and technology in history; and I also research and write about Philadelphia’s urban waters—including the great lack of much environmental data about those waters. I am also the Founding Director of Penn’s Program in Environmental Humanities.

This academic program works at the intersection of the natural and human sciences, otherwise known as the humanities. We take an integrated, interdisciplinary approach spanning the arts and sciences to understand how humans have profoundly remade the natural world, fundamentally altering earth systems. Working at the intersection of academic disciplines, we can foster resilience in era of human-caused global climate change.

Public engagement lies at the very heart of the program’s mission. One recent example of our public engagement work is the project called Data Refuge.

What is Data Refuge?

Working with partners, especially with our librarians at Penn, Data Refuge aims to accomplish three goals:

Use our trustworthy system to make research-quality copies of federal climate and environmental data. The types of public data we copy range from satellite imaging to PDFs, and we augment the work of webcrawling by our partners at End of Term Harvest and the Internet Archive by developing tools to download and describe “uncrawlables” that can be put in the public server space available at www.datarefuge.org

Advocate for environmental literacy with storytelling projects that showcase how federal environmental data support health and safety in our local communities; and advocate for more robust archiving of born-digital materials as well as for more reliable access to them. They are, after all, paid for by American taxpayers.

Build a consortium of research libraries to scale data refuge’s tools and practices to make copies of other kinds of federal data beyond the environment. This budding consortium, supported by the Association of Research Libraries, will supplement the existing system of federal depository libraries, where printed documents are “pushed.” This new consortium could actively “pull” public materials, i.e., copy them, from federal agencies.

How much data has Data Refuge archived?

Data Rescue events have downloaded roughly 4 terabytes of data. Related libraries’ efforts have captured petabytes of open data. Data Rescue events, as of 1/31/17, have also seeded more than 30,000 urls to put into the Internet Archive’s WayBack Machine. As of 1/31/17, some 800 people have participated in Data Rescue events.

Since beginning this project in November of last year, we have helped support six data rescue events, including a two-day event in Philadelphia, the first to tackle data that cannot go into the Wayback Machine. We’ve now supported a seventh in Cambridge, Mass; an eighth at UC Davis; a ninth in Portland, OR; a tenth in NYC.

Data Refuge organizers hosted a webinar for future event organizers attended by well over one hundred participants. More than twenty additional Data Rescue events in locations ranging from the SF Bay area, to Atlanta, Austin, two additional Boston events, Boulder, Chapel Hill, DC, Denver, Haverford (PA), Miami, another NYC event, Seattle, Twin Cities, and Wageningen, Netherlands.

How do we know how to prioritize the data to save?

Since December, with the help of the Union of Concerned Scientists, we have circulated a survey that invites researchers to identify those data sets most valuable for their work. It also asks them to consider how vulnerable those data sets might be. If they are stored in multiple locations, they are less vulnerable; if in only one location, it is far easier to limit or even block their access.

Data Rescue events also use a comprehensive approach to webcrawling developed by the Environmental Data Governance Initiative, a newly-formed coalition of individual researchers: surveying climate and environmental data across multiple locations. (I serve on the steering committee). This method allows people without deep content knowledge to participate in data rescue events, as does the work of the storytelling teams.

What happens at a Data Rescue event?

Participants select one of “Four Trails” through the Refuge. These trails are in essence working groups, with trained Trail Guides coordinating the work across the different local Data Rescue events. The Trails are:

Feed Internet Archive

Federal Internet materials that can go to the Internet Archive’s Wayback Machine go there.

Feed Data Refuge

Suspected “uncrawlables” are added to a master list on a spreadsheet the Data Refuge team manages and project participants do additional research. (An app will soon replace the spreadsheet and the associated workflow with its multiple checks for quality assurance.)

Storytellers and Documentarians

Create social media about data rescuers and events. Develop use cases in partnership with city and municipal government partners as well as other community partners and NGOs.

The Long Trail

Build a library consortium and advocacy for better policy on federal open data management

Why good copies of data are so important

Data Refuge Rests on a Clear Chain of Custody. The documentation of a clear “chain of custody” is the cornerstone of Data Refuge. Without it, trust in data collapses; without it, trustworthy, research-quality copies of digital datasets cannot be created.

Libraries always say: “Lots of Copies Keeps Stuff Safe” (LOCKSS). That’s very true. But consider what happens if a faulty copy is made — whether by accident or technical error or deliberate action–and then proliferates. Especially in a digital world, an epidemic can be the result. Instead of keeping “stuff safe,” we have spread lots of bad copies. Factual-looking data can in fact easily be fake data.

But how do we safeguard data and ensure that a copy is true to the original? Especially if the original is no longer available, we must find another way to verify the copy’s accuracy. This is where a clear, well-documented “chain of custody” comes in. By documenting this chain–where the data comes from originally, who copied them and how, and then who and how they are re-distributed–the Data Refuge project relies on multiple checks by trained librarians and archivists providing quality assurance along every link in the chain. Consider this extreme case: What happens if an original dataset disappears, and the only copy has passed through unverified hands and processes? Even a system that relies on multiple unverified copies can be gamed if many copies of bad data proliferate.

This practice of documenting whose hands have been on information goes back across hundreds, even thousands, of years. Instilling trust in information is a universal human concern. Unfortunately, it’s imperfect. The workflow devised for data refuge is similarly not 100% foolproof. But we can increase our trust in the copies by including librarians trained in digital archiving and metadata as the final instance of quality control before we make anything public. At this end link in the chain, we verify the quality with the Data Refuge stamp of approval.

How we verify data for Data Refugue

After the data is harvested, it gets checked against the original website copy of the datay by an expert who can answer: “Will this data make sense to a scientist or other researcher who might want to use it.” This guarantees the data are useable. Then, digital preservation experts check the data again, make sure that the metadata reflect the right information, and create a manifest of technical checksums to enclose with the data in a bagit file so that any future changes to the data will be easily recognizable.

The bagit files move to the describers who open them, spot check for errors, and create records in the datarefuge.org catalog, adding still more metadata.

Each actor in this chain is recorded. Each actor in effect signs off, saying yes, this data matches the original. And each actor also checks the work of the previous actor and signs off on it. This is the best way we have to ensure this copy is the same as the original, even if the original goes away.

Libraries

Today, building on decades of work, many libraries are taking fast action to advocate for open data and to promote better access. Many libraries have hosted Data Rescue events and are working quickly with their communities to harvest and save data. But, the fragility of government information on the internet is a problem that has already gained considerable traction in the library community. Coordinated efforts, like EoT, but also less well-known consortia are actively working to map the landscape of new government information. Together, with various research communities, we are strategizing on how to manage that information landscape responsibly and systematically. This will require deep and sustained collaboration of the type that is difficult to create quickly. Nonetheless, Data Refuge has done much to accelerate, responsibly, those collaborations.

Last week, with two librarians from Penn, I met with the Association of Research Libraries headquartered here in Washington.

In response to the overwhelming number of requests that we’ve received from colleagues at universities across the US who want to help, we propose an additional scaling strategy: Leveraging existing capacity within libraries — in staff and expertise — to archive web sites immediately. Many academic research libraries have web archiving systems in place with knowledgeable librarians engaged in this work. If a fraction of those skills and systems are directed to address a small slice of this challenge, together we can make substantive and critical progress toward preserving federal websites.

You can learn more about out Data Refuge at ppehlab.org, including how to get involved as an individual. If you want to host a Data Rescue event, check our guide “How to Host a Data Rescue Event” which also includes a useful Toolkit.

Bethany Wiggin is the Founding Director of the University of Pennsylvania’s Program in Environmental Humanities and holds appointments in German, English, and Comparative Literature.

Disclaimer: The opinions expressed by the guest blogger and those providing comments are theirs alone and do not reflect the opinions of the Sunlight Foundation.

Interested in writing a guest blog for Sunlight? Email us at guestblog@sunlightfoundation.com