PDF Liberation: Why it matters and how you can help
Government officials, journalists and citizen watchdogs trying take advantage of reams of data locked into hard-to-access digital documents, there’s hope: Last weekend’s PDF Liberation Hackathon produced some not-so-small steps towards what could be a major breakthrough.
Held in at least six cities, (including Washington, where the Sunlight Foundation played host), the event brought together technophiles and transparency advocates to work on what has turned out to be one of the knottiest problems facing transparency advocates and would-be government data analysts.
PDFs, short for Portable Document Format, are, for many government agencies, the preferred way of creating and storing documents. Developed by Adobe, they provide a one-format-fits-all way of allowing documents to be shared as their creators intended them to look, regardless of the type of computer or operating system used to open them. In other words, they are designed to look pretty. As opposed to promoting transparency.
When it comes to machine-readability, they turn out to be a Rosetta Stone written in plain English (and plenty of other modern languages). That’s because the same technology that allows visual formatting to be loaded and locked also locks in content. It stops machines from doing what machines do best — ingesting and collating vast quantities of data. And that, in turn, blocks humans from doing what they do best — querying that data, analyzing it and making judgments about it.
A few examples to show the implications of this problem:
- A half century of reports — dating back to the Marshall Plan — describing the results of projects funded by the U.S. Agency for International Development are locked into PDF files. Kat Townsend of USAID came to the hackathon looking for help in turning those reports into data “so that we can figure out what works.”
- For the Sunlight Foundation’s Daniel Cloud, one of the organizers of the Washington hackathon, the “enormity” of the PDF problem became apparent when he began working on Political Ad Sleuth. The tool, developed by the Sunlight Foundation with the assistance of Free Press, enables users to search across more than 78,000 records of political ad buys uploaded by the Federal Communications Commission (and some by volunteers). But in order to make calculations with them — and determine precisely how much money is being spent by a given group or in a given market, the data on those PDF records must be retyped by hand into a structured database.
- The Center for Responsive Politics provides an invaluable public service by entering information (some of it hand-written) from lawmakers’ personal financial disclosure forms into a database which allows users to determine the richest and poorest members of Congress, among other things, but the data runs months behind the release of the records because so much of it must be hand-entered.
Finding ways to make these tasks less Herculean was the goal of the PDF Liberation Hackathon, and participants came away feeling hopeful that it laid the groundwork for some major progress.
Want to get involved? Come to Open Data Day!
“The summit is very achievable. We just have to establish a few base camps,” said Greg Elin, a former Sunlight Labs director who just left his post as chief data officer for the FCC to focus on a new, Knight Challenge Grant-winning project.
In the view of the Washington Hackathon judges (watch Sunlight’s blog next week for more about contents in other cities), one of those “base camps” was put in place by Sunlight developers Jacob Fenton and Bob Lannon, with their project whatwordwhere. Judge Waldo Jaquith, an accomplished open data activist, said the “huge breakthrough” accomplished by the two came in treating PDFs more like maps than documents. “Everyone else — me included — has been working with some form of extracting data as text.”
Lannon and Fenton, on the other hand, focused on the topography of the documents, “treating this as geodata,” Jaquith observed. “This gives us a rich suite of tools to apply to the problem.”
Other breakthroughs included:
- Alex Byrnes and Miriam Diemer from the Center for Responsive Politics, made progress towards ingesting and parsing personal financial disclosure documents available from the House Office of the Clerk. With help from Ross Tsiomenko of OpenGov and Sunlight’s Bob Lannon, the team scraped the House Clerk’s site for personal financial disclosures and periodic transaction reports. Then, using the ABBYY Cloud OCR SDK (generously provided with extended trial period to PDF Liberation), they were able to extract tables of data from the raw scans. Their GitHub repo can be found here.
- Dylan Bartlett, a developer with Versivo, which provides services to local government agencies, cracked enough of the US AID code to put reports from 1947-59 in .csv format on GitHub.
Bartlett said he volunteered his time on a weekend because “I’ve gotten kind of good at reverse-engineering these propriety interfaces.”
He hopes that other developers will find ways to use his work to liberate more government data.
“After all,” he said, “we all own it anyway.”