As stated in the note from the Sunlight Foundation′s Board Chair, as of September 2020 the Sunlight Foundation is no longer active. This site is maintained as a static archive only.

Follow Us

Hack for Change Was Great

by

Hack for ChangeA quick note of thanks and congratulations to all those who participated in last weekend's Hack for Change event. Timball and I were in SF for the hackathon, and I know I speak for us both when I say it was thrilling to see so many smart people working on interesting and important problems.

The folks over at Tokbox have a great roundup of links to the various entries -- and I'd be remiss if I didn't mention the Ruby gem that Code For America's Erik Michaels-Ober cooked up for Sunlight's Real-Time Congress API (thanks, Erik!).

All in all, a great event -- many thanks to the good folks at change.org for making it happen.

Continue reading

Sarah’s Inbox: The Agony and the .tgz

by

Many of you have probably already seen that earlier today we stood up a copy of the Elena's Inbox code for the Sarah Palin email collection. You can find the site here. I think that by most reasonable standards, Sarah Palin is currently a less newsworthy figure than Justice Kagan was at the time of her confirmation. But there's no question that many people find her fascinating, and folks seem to really enjoy having this sort of interface available -- the response has been overwhelmingly positive, even in spite of its horrifying Gmail 1.0 look (for what it's worth, Sunlight's design team deserves absolutely none of the blame for this one!).

It's worth taking a moment to reflect on what it took to get this site online. The state of Alaska released Governor Palin's email records on paper. News organizations had to have people on the ground to collect, scan and OCR these documents. Our thanks goes out to Crivella West, msnbc.com, Mother Jones and Pro Publica, whose incredibly quick and high-quality work provided us with the baseline data that powers the site.

But it wasn't yet structured data. It was easy enough to convert the PDFs into text, though this introduced some errors -- dates from the year "20Q7", for instance. Then we had to parse the text into documents, each with recipients, a subject line, and a sender. This is trickier than it might seem. Consider the following recipient list:

To: Smith, John; Jane Doe; Anderson; Andy (GOV); Paul Paulson

It's parseable... sort of. It turns out that, in this case, "Andy Anderson" should be treated as an entity. In this dataset, portions of names are delimited by semicolons, but so are names. It's a bit of a mess. Sunlight staff spent the better part of Monday performing a manual merge of the detected entities, collapsing over 6,000 automatically-captured people to less than half that number. I won't pretend that the dataset is now spotless, but it's considerably more structured than it used to be.

And that structure makes possible not only novel interfaces like Sarah's Inbox, but also novel analyses. Consider this graph of how often the word "McCain" appears in the emails:

total emails mentioning 'mccain' by week

Interesting, right? More substantively, consider the efforts of Andree McCloud, who's raising questions about an apparent gap in the Palin emails near the beginning of the governor's term. With the data captured, it's easy to visualize this -- here's a graph of the total email volume in the system by week, beginning with the first week of December 2006, when Palin took office:

total released email volume by week

(To be clear, I don't think you can necessarily conclude from this graph that there's anything nefarious about that period's low email volume -- there are plenty of potential explanations. Still, it's useful to be able to be able to understand the outlier period in the larger context of the document corpus.)

Of course, these analyses and interfaces could be even better if Alaska had just released the files digitally. In fact, if they had, we might be able to draw some more solid conclusions: as our sysadmin Tim pointed out, message headers' often-sequential IDs could conceivably show whether there actually are missing emails from those first few weeks.

It's a shame that that didn't happen -- and not just because it meant my weekend was spent parsing PDFs. Releasing properly structured data ultimately allows everyone to do better work in less time. It's unfortunate that the authorities in Alaska introduced such a substantial and unnecessary roadblock.

But we at Sunlight can at least share what we've done to improve the situation. If you're interested in running your own analysis, you can find our code here, and the data to power it here (12M). At the moment it's in the form of a Django project -- if you need it in a different format, don't hesitate to ask on our mailing list. If you do something neat with it, please tell us!

Continue reading

Labs Update

by

I admit it: we missed the May Labs Update entirely. I'm sorry! It's been as busy as always around here, with a number of really neat -- but also really involved -- projects beginning to see some light at the end of their respective tunnels. Amidst that effort, we just plain forgot to get out a timely update last month. We promise we won't keep you in the dark like this again.

The most exciting news is that we have a bunch of new faces in the labs offices:

  • Drew Vogel has joined the Subsidyscope team, and after a few weeks of work by remote, he's now in the office, in person, and doing great things.

  • Ryan Sibley recently moved over to Labs from the Reporting Group. Ryan's been with Sunlight for a while, but the idea of having a journalist embedded with our developers is a new one, and something that I'm pretty excited about.

  • Casey Kimmey has joined the Open States team for the summer, where she'll be doing some invaluable data quality validation.

  • And finally, Montserrat Lobos is joining us for three weeks from our friends at Ciudadano Inteligente in Chile. She's going to be working with the design team; we're excited to have her.

Here's what the team has been working on:

Tom has been finally -- finally! -- finishing the process of getting the team back to a full headcount. He's also been doing the usual mixture of proposal writing, project oversight, and general triage. Also: some soldering and messing around with Titanium Mobile, the results of which will hopefully be published in a few months.

Alison added a new, corporate accountability dataset to TransparencyData and Influence Explorer. She has also been working on several name matching tasks and making some major additions, including parsing for organization names, to our Name Cleaver name parsing library. In addition, she has been experimenting with some visualizations based upon campaign finance data, which you may be able to look forward to cropping up on our site or in a blog post in the future.

Luigi is busy with the next-generation of software that will power our Sunlight Live events. He wrote an article for HTML5 Rocks and took trips to Portland and Baltimore.

Drew added a few usability enhancements to the Subsidyscope search tool. Now he is working on changes to the data importer that will allow us to provide direct expenditure figures based on more current USASpending data.

Jeremy has been hard at work destroying Sunlight web sites, but in a good way. We have decided to retire Public Equals Online and integrate the features into the main Sunlight Foundation site and organizing page. Jeremy also rebuilt TransparencyCamp.org to add a brand new mobile app and HTML based informational screens to display session information on monitors at the conference.

Eric has been integrating full-text searching with ElasticSearch into our Real Time Congress API for bills. There'll be new endpoints and features announced soon. He's also been wrapping up work on our soon-to-be-released iOS/Android mobile app to help people make better local health care decisions. Finally, he's been working on getting the first round of House expenditure data from the 112th Congress up into our expenditure database and House staff directory.

Aaron updated Follow the Unlimited Money for the new election cycle -- just in time for the special election in New York's 26th District -- and made some improvements to the Reporting Group's Lobbyist Tracker. He's also been working on Capitol Words (preview: the top words in Congress so far this year are "job", "cut", "create" and "repeal").

Kaitlin has been working on a third Roku app as well as building some backend tools for Subsidyscope. She's also been busy writing bombastic blog posts and checking up on her FOIA for contracting data quality reports. Also, she's been working with the other Caitlin to update the design and functionality of the Subsidyscope site.

Chris has been working on a variety of projects including: coding the new design for the House Staff Directory, creating graphics/signs/and other deliverables for Transparency Camp, new background theme for Sunlight's YouTube channel, and misc graphics for other Sunlight projects. She is currently working on a new theme for Sunlight's Data Viz Tumblr blog and continuing to work on the House Staff Directory site.

Michael has been working on expanding the coverage of the Open State Project to new states, adding XML support to the API, and exploring visualizations of the data that's being collected.

Ethan has been coordinating a number of new products and features in Data Commons. This week we released Inbox Influence and added POGO's contractor misconduct database to Transparency Data and Influence Explorer. We're hard at work on several new data sets to be released in July.

With Sunlight Health teetering on the edge of completion, Caitlin has turned her focus to building out the Subsidyscope redesign, interrupted briefly by a jaunt through the South and once again to help with updates to the Roku app that Kaitlin has been working on.

James has been adding support for Maine, New Hampshire, and Oregon to the Open State Project. He's also been working with new Open States intern Casey who has been doing data checking and cleaning to help promote more states from experimental to ready status.

Andrew prepped the Inbox Influence project for its launch, which was announced at the PdF conference in New York. He has also continued to work on extracting data from Regulations.gov.

Ali has been working on numerous small tasks to support the foundation in their design needs. She's been doing a little bit of everything from branding to visualizing data to giving cfbp advice on their new mortgage forms. Currently the big project on her desk is the new Capitol Words site.

Continue reading

The Palin Emails and Redaction Technology

by

Today's release of the Palin emails is prompting frustration among reporters, environmentalists and people who know how to use computers over the fact that the documents are being delivered in the form of a huge, $700+ stack of paper. As Luigi pointed out on Twitter, this decision is being attributed to the difficulty of performing redaction properly within an all-digital system.

This is something I've written about here before. Redaction mistakes do happen -- the brilliant Tim Lee recently released some interesting work showing how to quantify just how often -- but doing it properly isn't rocket science. Digital workflows save time, money and material resources; and in cases like this one, they make it easier for the press to do its job. In other cases, like the one facing the PIDB, there's simply no choice: they'll never overcome the backlog they face without the help of information technology. It's long past time for government to get over its skittishness about digital redaction.

UPDATE: Be sure to check out the comments to this post. Jeremy Ashkenas -- who has personally had to haul the Palin emails, in paper form, across Juneau -- points out that the redaction workflow in this case does appear to have been digital... up to a point. The output, though, was thoroughly analog. If it's not one thing it's another...

Continue reading

West Coast Coders: Come Hack for Change!

by

Hack for ChangeChange.org is hosting a hackathon at their San Francisco headquarters in a couple of weeks. If you're in the Bay Area, handy with a text editor, and feel like doing some interesting work, I hope you'll consider joining the party (did I mention they'll be giving away $10,000 in prizes?). I'm pleased to say that Sunlight is a partner for the event, and indeed, I'll be on hand to talk to folks about our APIs, judge entries and, if at all possible, eat some Mission burritos.

You can find all of the details about Hack for Change here. I hope to see you in San Francisco on the 18th!

Continue reading

The Consequences of the e-Gov Cuts

by

Save the Data!If you haven't already, please be sure to check out my colleague Daniel Schuman's post over at the main Sunlight Foundation blog, where he details the consequences of the cuts to the e-Gov fund. The short version: in a letter to Sen. Carper, federal CIO Vivek Kundra is reporting that the cuts will negatively affect upgrades to a broad variety of executive branch transparency- and good-government-related websites; lead to the cancellation of FedSpace and the Citizen Services Dashboard; and hinder efforts at improving data quality.

There's no doubt this is bad news -- that the administration is already making excuses for not following through on fixing data quality is particularly discouraging. But there's also no question that things could have been worse. This fight isn't over yet, but our community has already made a big difference.

So thanks for your help, and for sticking with us as we try to ensure that our government doesn't stagger backward from its early, tentative steps into the online era.

Continue reading

Using our APIs is Absurdly Easy

by

wooden ABC blocksA little while ago Ethan blogged about how to use our Influence Explorer APIs. It was a great intro to just how easy it is to start pulling influence data from our systems and into your projects.

But of course that's just one of several APIs that we offer. A couple of weeks ago I responded to an email from someone interested in matching a dataset of zip codes to congressional districts. This is a pretty common task for people doing research, or building advocacy websites, or otherwise trying to link citizens to their elected representatives. It also happens to be a problem that our APIs are perfectly suited to solving.

So here's an example that I wrote to try to show a non-programmer how to get up to speed with our APIs in Python. If you're on OS X or a Linux system, you've already got Python installed. If you're on Windows, you'll need to jump through a few more hoops -- this blog post should be helpful (it's probably a good idea to stick with a Python version earlier than 3.0). Hopefully this will show just how simple it can be to start using our services.

This particular code is oriented toward taking a CSV file with zip codes and adding information about the congressional districts associated with each zip. There's sample data included as well -- just a random assortment of zipcodes -- to help you see how everything works. You shouldn't need much more than a free API key and a command line prompt.

This code interfaces with our API through the use of a helper library. I've included that file too, but if you want the most up-to-date version you can find it here (Rubyists: we have a gem as well). I should also note that the code doesn't follow optimal conventions -- for instance, hardcoding the input filename is not how I'd normally do things -- but I think it's a bit easier to follow this way. I've tried to add a lot of comments.

For this exercise I assumed that the zip is in the row's final column -- the row[-1] code at line 26 determines this. This is the case for the sample file, but if you have your own CSV to process, it might not be. But it's easy to change this! If the zip is in the second-to-last column, for instance, you can use row[-2], and so on. You can also use positive addressing: row[0] is the first column, row[1] the second, etc. Please make sure that whatever CSV you use doesn't begin with a header row, as this will confuse our API and throw an error ("Dear API: which congressional districts fall within the zipcode with the number 'Zipcode'?").

To use the script:

  1. Obtain a free API key from services.sunlightlabs.com.

  2. Download this zip file and uncompress it. Place its contents in the same same directory as the CSV file you want to process (or just use the included one, if you're trying things out -- you can put them in any old folder).

  3. Open getdistricts.py in a decent text editor, like TextWrangler (OS X), vim/emacs (Linux), or Notepad++ (Windows).

  4. Insert your API key in the appropriate spot on line 4.

  5. Change the value of the INPUT_FILENAME variable on line 5 to match your desired CSV's filename.

  6. In a terminal window, navigate to the appropriate directory and run the script by typing "python getdistricts.py"

You should see output as a query is made for each zip code (zip codes that have already been looked up will be cached). When the process is complete, a file called output.csv will be present in the same directory. It will contain the same columns as the source file, plus two new columns at each row's end: one with the number of districts within that zipcode, and another with those districts delimited with semicolons.

That's it! Now, yes, if you're coming to this as a complete newbie, following these steps probably won't make you instantly comfortable with programming. But for those who've tinkered but never tangled with a real API, hopefully this will go some of the way toward showing how easy it is to use our services. And don't forget: if you run into trouble, we're here to help.

Continue reading

Data Visualeggzation

by

Months ago, Josh and Tim bought an Egg-Bot kit from Evil Mad Scientist Laboratories. Despite the obvious utility of this piece of office equipment, it fell into disuse not too long after assembly. But with the year's premiere egg-decorating holiday fast approaching, we decided to dust off the Egg-Bot and see if we couldn't put it to good use during our team's weekly lunch meeting. Things kind of spiraled out of control from there. We blame the sugar high from eating all that candy.

Continue reading

The Worst Government Website We’ve Ever Seen?

by

Yesterday the government's Federal Awardee Performance and Integrity Information System (FAPIIS) came online. This is something we've been looking forward to for a while. It's easy to find horror stories about the mismanagement of contracts; this isn't surprising when you consider the disorganized constellation of contractor oversight databases that exists, many of which aren't open to the public. Getting FAPIIS online should be a step toward fixing that problem. Yesterday government took that step.

POGO has some thoughts about it that are certainly worth your time. But we can't help chiming in as well. In short: this site is terrible. As one colleague said, "This might be the worst website I've ever seen."

This is at least debatable. Contracting databases are part of the world of procurement, procurement is heavily influenced by the Defense Department, and DoD has a proud heritage of producing websites so ugly that they make you want to claw out your eyes. So FAPIIS has company. But if this was just a question of aesthetics, we wouldn't be complaining.

Assuming you're using one of the few web browsers in which the site works at all (Chrome and Safari users are out of luck), the experience is off-putting from the start, as users are warned that their use of the site may be monitored, surveilled, or otherwise spied upon (you don't necessarily surrender your right to speak privately to your priest by using the website, though--thanks for clearing that up, guys!). Perhaps this is why their (arguably superfluous) SSL certificate is utterly broken. But let's click past the security warnings and proceed.

Here's the next screen. It contains a captcha.

Let's be clear: the use of a captcha to gate government data is outrageous. Government should be making its data more accessible and more machine-readable. Captchas are designed to interfere with automated tools that facilitate malicious acts. But downloading government data is decidedly not a malicious act. Why are we trying to limit machines' ability to use this data?

But our irritation with the captcha is softened a bit by how laughably inept its implementation is. It's made of black and white text, unrotated, unskewed, superimposed on the same black and white grid every time. Here's a stab at how you'd beat it:

  1. Subtract grid
  2. Flip every white pixel that's bordered by 2 or more black pixels to black
  3. Identify columns of all-white pixels and slice the image by them
  4. Crop the resulting slices, then recombine
  5. OCR

You could probably get this done using a stock PHP distribution in about an afternoon. But you don't need to, because even this pathetic level of security isn't properly implemented! Instead the human-readable text is sent to the client as a SHA1 hash in a hidden field. That hash is compared to the hash of what the user enters for the captcha code. So a scraper can just ignore the captcha and resend a solved hash for every request -- it'll work just fine1. They didn't even salt the hash. Whoever wrote this has absolutely no idea how to implement a secure system.

After the captcha, things start to get really weird, with radio buttons with onclick handlers being used as hyperlinks. It's unclear to me whether the programmers responsible for this interface had ever actually used the web or simply had it described to them. Either way, whoever built this should be embarrassed. Whoever managed the project should be embarrassed. Whoever signed off on delivery should be embarrassed! But we haven't even gotten to the worst part yet.

That's because while all of the above will be embarrassing to any developer who takes pride in his or her craft, the quality of a government website is ultimately less important than the data it exposes. And there is no FAPIIS data in FAPIIS. Not yet, anyway. Such data exists, mind you. But the decision was made not to include any historical data when FAPIIS went public. Presumably the contractors who did a bad job, and who were reported for doing so, are concerned that people might look at those reports and get the impression that, uh, they did a bad job. Others may be concerned that the database could cast them in a bad light and raise uncomfortable questions. That government caved in to the demands of these vendors -- vendors who are supposed to be serving government! -- can only be described as an act of craven capitulation. We've FOIAed for this data, and if we're lucky, perhaps we'll even get it. But it ought to be online right now.

As a matter of principle, it's good to see government opening closed databases, and Congress deserves credit for deciding to open this one. But what has followed that decision deserves only whatever the smallest quantity of plaudits is that's still distinguishable from zero. I hope that the site removes the captcha, offers bulk downloads, and fills up with useful, unsanitized data. But whoever built this travesty deserves to have an entry in FAPIIS of their own.

1: You do need to update the JSESSIONID cookie and get a fresh value for the org.apache.struts.taglib.html.TOKEN hidden variable, but this is easy enough to do.

Continue reading

CFC (Combined Federal Campaign) Today 59063

Charity Navigator