Follow Us

Tag Archive: PDF

Nonprofit E-File Data Should Be Open

by

The IRS is refusing to release digital e-file data for public documents filed by nonprofits--instead, they release it as PDFs. This introduces wasteful barriers for people who want to use this data. Carl Malamud's been fighting to fix this problem. We at Sunlight join him in calling for the IRS to release 990 e-file data.

Continue reading
Share This:

Coming to PDF? Get Warmed Up With a Hackathon

by

PDF 2012 conference logoA bunch of the Labs team (and the rest of Sunlight) will be in New York next week for PDF 2012. It's one of the can't-miss events of our calendar year -- and not just because Sunlight counts Micah Sifry and Andrew Rasiej as close friends. PDF is a consistently great opportunity for like-minded folks to get together and share their visions for how technology can change society for the better. We've found more than a few team members at past PDFs; I don't think it's a coincidence.

This year the folks behind the event are trying something new: a two-day hackathon in the leadup to the conference. They're calling it PDF: Applied, and if you have talent for coding and a chance to make it to New York a little early, you should really consider attending. It's always exciting to see this kind of attempt to translate big thoughts into concrete action.

Continue reading
Share This:

Redaction and Technical Incompetence

by

Felix Salmon, finance blogger extraordinare, was inspired by some reporting by Bloomberg to have a look at Treasury's website. Apparently Tim Geithner visited Jon Stewart back in April, and Felix was understandably interested in seeing the evidence for himself. He went to the Treasury website, and then... well, things took a turn for the worse:

First, you go to the Treasury homepage. Then you ignore all of the links and navigation, and go straight down to the footer at the very bottom of the page, where there’s a link saying FOIA. Click on that, and then on the link saying Electronic Reading Room. Once you’re there, you want Other Records. Where, finally, you can see Secretary Geithner’s Calendar April – August 2010.

Be careful clicking on that last link, because it’s a 31.5 MB file, comprising Geithner’s scanned diary. Search for “Stewart” and you won’t find anything, because what we’re looking at is just a picture of his name as it’s printed out on a piece of paper.

In other words, these diaries, posted for transparency, are about as opaque as it can get. Finding the file is very hard, and then once you’ve found it, it’s even harder to, say, count up the number of phone calls between Geithner and Rahm Emanuel. You can’t just search for Rahm’s name; you have to go through each of the 52 pages yourself, counting every appearance manually.

Is this really how Obama’s web-savvy administration wants to behave? The Treasury website is still functionally identical to the dreadful one we had under Bush, and we’ve passed the midterm elections already. I realize that Treasury’s had a lot on its plate these past two years, but much more transparent and usable website is long overdue.

This all sounds sadly familiar to me. I still remember when Treasury started posting TARP disbursement reports as CSVs instead of PDFs. I was working on Subsidyscope at the time, and had to load those reports on a weekly basis. It's more than a little sad how much better my life got when they made that change.

record of Timothy Geithner's meeting with Jon Stewart

But I think it's important to note that Felix's frustration isn't just the product of bad technology.

Continue reading
Share This:

Elena’s Inbox: How Not to Release Data

by

screenshot of elenasinbox.com

On Friday @BobBrigham tweeted a suggestion: put the just-released Elena Kagan email dump into a GMail-style interface. I thought this was a pretty cool idea, so I started hacking away at it over the weekend. You can see the finished results at elenasinbox.com.

I'm really pleased that people have found the site useful and interesting, but the truth is that a lot of the emails in the system are garbage: they're badly-formatted, duplicative or missing information. For instance, one of the most-visited pages on the site is the thread with the subject "Two G-rated Jewish jokes" -- understandably, given that it's the most potentially-scandalous-sounding subject line on the first page of results. Unfortunately, if you click through you'll see that there's no content in the messages.

The site was admittedly a bit rushed, but in this case it isn't the code that's to blame. If you go through the source PDF, you'll see that the content is missing there, too. It looks like it might have been redacted, but the format of the document is confusing enough that it's difficult to be sure.

But the source documents' problems go beyond ambiguous formatting. A lot of the junky content on the site comes from the junk it was built from -- there's not much we can do about it. To give you some idea of the problem, consider these strings:

Continue reading
Share This:

A lesson in Humility

by

On Monday the House of Representatives delivered, as promised, an electronic dump of House Expense Reports. We, at Sunlight Labs had a plan. We knew it was going to be a huge PDF, but we have all the infrastructure in place. We had plenty of bandwidth, knew when the data was coming out, roughly how it was going to look, and that it was likely we wouldn't be able to parse it all with computers. "We'll use TransparencyCorps," we thought, to get that last mile out of the data, so that eventually we'll end up with a parseable database.

Continue reading
Share This:

CFC (Combined Federal Campaign) Today 59063

Charity Navigator