So next week, Adobe’s having aconference here to tell Federal employees why they ought to be using “Adobe PDF, and Adobe® Flash® technology” to make government more open. They’ve spent what seems to be millions of dollars wrapping buses in DC with Adobe marketing materials all designed to tell us how necessary Adobe products are to Obama’s Open Government Initiative. They’ve even got a beautiful website set up to tout the government’s use of Flash and PDF, and are holding a conference here next week to talk about how Government should use ubiquitous and secure technologies to make government more open and interactive.
Here at the Sunlight Foundation, we spend a lot of time with Adobe’s products– mainly trying to reverse the damage that these technologies create when government discloses information. The PDF file format, for instance, isn’t particularly easily parsed. As ubiquitous as a PDF file is, often times they’re non-parsable by software, unfindable by search engines, and unreliable if text is extracted.
Take, for instance, H.R. 3200– otherwise known as “America’s Affordable Health Choices Act of 2009”, a 1017 page healthcare bill from congress. Because it is primarily published in PDF, we’ve got to build a special parser for it– that bill– in order to represent it programatically. Or Carl Malamud’s IRS filings for 527 (stealth PAC) organizations: gigabytes of PDF files, all released by government. Government releasing data in PDF tends to be catastrophic for Open Government advocates, journalists and our readers because of the amount of overhead it takes to get data out of it. When a government agency publishes its data and documents as PDFs, it makes us Open Government advocates and developers cringe, tear our hair out, and swear a little (just a little). Most earmark requests by members of congress are published as PDF files of scanned letters, leading the Sunlight Foundation and others to write custom parsers for each letter.
Yet, for some reason, Adobe feels they’re essential to the new administration’s mission of transparent and open government. I on the other hand feel like picketing the event they’re having next week to sell their wares (hey hey! ho ho! your-binary-low-parsable-formats-for-government-data has got to go!) because in fact, they’re quite the opposite. Here at Sunlight we want the government to STOP publishing bills, and data in PDFs and Flash and start publish them in open, machine readable formats like XML and XSLT. What’s most frustrating is, Government seems to transform documents that are in XML into PDF to release them to the public, thinking that that’s a good thing for citizens. Government: We can turn XML into PDFs. We can’t turn PDFs into XML.
Flash isn’t off the hook either. Government has spent lots of time and money developing flash tools to allow citizens to view charts and graphs online, and while we’re happy the government is interested in allowing citizens to do this, Government’s primary method of disclosure should not be these visualizations, but rather publishing the APIs and datasets that allow citizens to make their own. Only after those things are completed to the fullest extent possible should government be working on its own visualizations. While Adobe may say in their open government whitepaper:
“Since the advent of the web, an entire infrastructure has evolved to enable public access to information. Such technologies include HTML, Adobe PDF, and Adobe® Flash® technology.”
This is nonsense. The fact is, sticking to open, standards based technologies like HTML, XML, JSON and others are far more important and useful in getting your information out to the public than the proprietary formats of Adobe. Here’s a hint– if the data format has an ® by its name, it probably isn’t great for transparency or open data.
So don’t get me wrong– I appreciate just like the next guy that I can download a nice PDF file of an IRS form, print it out, and send it in. I think that members of congress publishing their “Dear Colleague” letters with accuracy is great and important, and I think the pie charts on the IT dashboard are really neat. But when it comes down to it, these technologies aren’t helping to fully open our government. They have their place, but in terms of transparency and openness, I’m afraid they do more harm than good. Relying on them only yields frustration from the people who use the data government publishes the most, and they should be considered a bell or a whistle on top of the foundation that an agency should do to be fully transparent: putting data online, obeying the 8 principles of Open Data to the fullest extent.
Update (3:10pm): At the strong urging of our Policy Director, I’ll add this caveat: any time Government decides to release data to the public, we’re glad that government has taken a step forward. But the PDF file format, especially when it comes to data, and large documents like bills, is something that government should strongly consider open, machine readable, parsable alternatives to. There are plenty, and we’re happy to help find them for you.
Update (3:20pm): PJ Doland has the right answer. PDF by itself is insufficient. So is Flash. But what makes PDF in particular bad is that more often than not, you can turn XML into a human readable PDF. But you can’t turn PDF into a machine readable XML/JSON/whatever file.