Happy Document Freedom Day! What are you doing to celebrate?
While it may not be obvious, open standards go hand-in-hand with open government. If we are asking our government to make information about itself available to the public, documents should be released in a way that everyone can access. Access to government information should not be only for the privileged few that can afford expensive commercial software needed to open files in proprietary formats. The public should be able to use any device, operating system, or software of their choosing without having to worry about what documents and data they won't be able to access.
Please note that while we advocate for the use of open document formats, it is more important to release data in appropriate formats. You won't get any praise from us for releasing earmark requests in ODF instead of a proprietary format. Structured data should always be released in programmer friendly formats such as JSON, CSV, and XML.
Additionally, open formats should be offered in addition to common proprietary formats instead of replacing them completely. We're not zealots here; proprietary document formats make sense for a large number of users. Just be kind to those of us that choose to not run the software needed to access the documents and give us an open choice.
Sunlight is celebrating by making the following pledges:
-
Any document created by Sunlight that is published in a proprietary format will also be published in an open document format.
-
We will update our recommendations to government to include the publishing of open document alternatives when proprietary formats are used. The Ten Principles for Opening Up Government Information hint at this, but we'll make it explicit.
So yay open document formats! Visit documentfreedom.org for more information or download this post in ODF. Sorry, I just had to do that.
Continue readingBlog Posts Via Email With CloudMailin.com
I recently learned (with horror) that a co-worker wrote her blog posts in Gmail, copied the rich text to WordPress, then copy and pasted the generated HTML into our Markdown-enabled blog backend. To be fair, our nerdy authoring tool is a bit much for non-technical users and doesn't really fit into most "normal" workflows. Additionally, she emails her posts to an internal list so Gmail was a natural authoring tool.
There had to be some common ground we could find; blog posts still written in Markdown while allowing her to use Gmail to write her posts. Our solution was to enable post-by-email on the blog. By adding a special email address to the recipients, the message is parsed into Markdown, a draft post is created, and she receives an email reply a few seconds later with a link to edit the new post. From there she can review and publish it in a few clicks resulting in a much improved workflow.
We wanted the draft posts created immediately and I didn't care to be polling a mail server every few seconds. Fortunately, we found a new service that made this project incredibly easy to implement.
CloudMailin.com
CloudMailin.com is a fantastic service that does the opposite of most other mail services. Rather than providing an API based method of sending email like Postmark, another fantastic service, CloudMailin.com receives email at a provided address and POSTs the data to a URL of your choosing. In addition to the simple parsing of SMTP headers and MIME parts, the service can handle email attachments. Pay them a few bucks extra and they'll upload the attachments to one of your S3 buckets!
A competing service we evaluated started at a pricey $30 a month; a bit ridiculous if we are receiving 5 emails a week to start. CloudMailin.com's recently announced pricing is right on the mark with a 200 message free plan and a 3000 message micro plan for $9 per month.
So how did we make it work? Let's look at some code...
django-cloudmailin
django-cloudmailin is a Django app we created to make working with CloudMailin.com as simple as possible. First we need a method that will receive the posted email message parameters and create a blog post.
In create_post we extract the parameters from the message to get the author, title, and content of the post. A post object is created and an email is sent back to the original sender of the email with a link to the Django admin for the new post. The author needs to check to make sure the post looks correct and hit publish. This is a greatly simplified example because we do some additional parsing of the content to transform the plain text into valid Markdown, but it should give you an idea of how it works.
Next we register that method with the mail handler.
MailHandler is a class-based view provided by django-cloundmailin that handles the registration and processing of mail messages. In this example we register our CloudMailin.com email address and secret key with the method that is to be invoked upon receipt of a new message. Multiple email addresses can be registered with the handler to allow for many different actions-by-mail in the same application. Finally the MailHandler instance is associated with an URL pattern in urls.py.
All incoming messages are signed with your secret key to prevent any old person from spamming your mail endpoint. The MailHandler instance takes care of verifying the signature so you can concentrate on writing your application.
You can find the source for django-cloudmailin on GitHub.
Continue readingSunlight Labs & Google Summer of Code 2011
We're proud to announce we've been accepted as a mentoring application for the Google Summer of Code 2011.
If you aren't familiar with Google Summer of Code, it is a great opportunity for college students and open source organizations to work together. Google pays students a $5000 stipend in exchange for their work on an eligible project. For more details about the program in general visit the GSoC 2011 website.
This is our third year participating and we're looking forward to another great summer and a new batch of students and projects.
Continue readingDefining “High Value Data” Is Hard. So Let’s Not Do It.
Yesterday I had the pleasure of sitting on a Sunshine Week panel moderated by Patrice McDermott, along with CRP's Sheila Krumholz, Pro Publica's Jennifer LaFleur and Todd Park of HHS. We touched on a lot of different topics, including one that by now is probably familiar to anyone who's followed the progress of the Open Government Directive: frustration with the vagueness of the term "high value datasets." Various organizations--Sunlight included--have criticized the administration for releasing "high value" datasets that seem to actually be of questionable usefulness.
Jennifer coined a formulation of what she considers to be a high value dataset, and it attracted some support on the panel:
Information on anything that's inspected, spent, enforced, or licensed. That's what I want, and that's what the public wants.
I don't think this is a bad formulation. But while I'm not anxious to tie myself into knots of relativism, we should keep in mind the degree to which "high value" is in the eye of the beholder. It's clear how Jennifer's criteria map to the needs of journalists like those at Pro Publica. But if you consider the needs of someone working with weather data, or someone constructing a GIS application--two uses of government data that have spawned thriving industries, and generated a lot of wealth--it's obvious that the definition isn't complete. To use a more melodramatic example, if World War III broke out tomorrow, a KML inventory of fallout shelters could quickly go from being an anachronism to a vital asset.
The point isn't that Jennifer's definition is bad, but rather that any definition is going to be incomplete. The problem isn't that agencies did a bad job of interpreting "high value" (though to be clear, some did do a bad job); rather, it's that formulating their task in this way was bound to produce unsatisfactory results.
We're going about this backward. Ideally, we'd be able to start by talking about what the available datasets are, not by trying to figure out what we hope they'll turn out to be. Government should audit its data holdings, publish the list, then ask the public to identify what we want and need. This won't be easy, but it's far from impossible. And any other approach will inevitably leave the public wondering what we're not being told.
Continue readingWhat’s Going On In The Labs
... or what was going on in the labs. I'm horribly late in posting this -- it turns out that I'm much, much worse at this than Josh was. Just another piece of evidence that we need more talented folks around here! Remember, we still have open positions.
Luigi has been working on Datajam, a data-driven platform for reporting live events on the Web. You can follow its development on Github. Datajam will soon power our Sunlight Live events.
Jeremy has been working on various Sunlight sites including the relaunch of the Advisory Committee on Transparency. February also saw the launch of Capitol Defense, a JavaScript/SVG/HTML game developed with Andrew and Chris. Other various interesting tasks included: launching Sunlight Jobs, teaching a half-day HTML class to Sunlight employees, releasing django-cloudmailin which we use for blog post drafting via email, and preparing for TransparencyCamp 2011.
Ethan attended the Computer Assisted Reporting Conference, worked on an algorithm for fast entity matching in text, and researched new content for the Influence Explorer homepage. He's now planning for new corporate accountability datasets and new lobbying-related features.
Eric released the Real Time Congress API, and version 3.0 of the Congress app for Android. He also continued his work on an upcoming mobile app to help people make better local health care decisions.
Kaitlin had a lovely vacation and then spent several days updating the USASpending data on Subsidyscope and is now squashing bugs in the soon-to-be-expanded tax expenditure database on the site. She also interviewed many a candidate for Subsidyscope and pitched in a little bit on the Clearspending testimony.
timball has been crying a lot over ISPs and is starting to familiarize himself with Chef, a new ruby based scaling solution. Also he says he gained 5lbs from eating in NOLA. We thought you should know.
Chris has been fabulously wireframing new layouts for the House Staff Directory, designing magically delicious HTML emails and newsletters, creating spectacular presentations promoting Sunlight's awesomeness, and providing Sugar-free-Red-Bull-fueled graphics support for a variety of little projects along the way (e.g. Capitol Defense, one Influence Explorer postcard, Sunlight's meetup page, new Twitter background, etc).
James and Michael have continued the process of expanding the reach of the Open States Project and migrating content to the new site The most recent update brings the project to 20 states and the District of Columbia. New functionality in the API is in the works, including the ability to query for bills by sponsor or issue area. We are also working on adding more ways for people to access the data without having to access the API directly.
Aaron added an additional lobbying dataset to the Reporting Group's lobbying tracker. Users can now see a list of post-employment notifications for former congressional staffers and members, including when they'll be eligible to lobby their old colleagues. He's also continued work on Capitol Words.
David is working on an analytics dashboard. He uploaded some sample data to Google's Public Data Explorer. He worked on pulling out structured data from GAO reports -- making some progress but also hit some obstacles.
Caitlin has been working with Eric and the reporting team on nailing down wireframes for the healthcare app and has been translating them into pretty sexy comps. She is also working with the other Kaitlin to redesign and streamline the Subsidyscope site. ...and stuff. She also helped launch the new Openstates site since the last Labs update.
Ali has been making a lot of ads lately to remarket the Sunlight Foundation and the reporting group and for new and upcoming Sunlight Live events. She has also been working on building out a new page for the organizing section of the Foundation and Sunlight Live.
Andrew has been working on new tools for adding influence-related context to text, focusing on a plugin for enhancing Gmail. He has also been experimenting with new scraping technologies.
Alison has been updating our Wikipedia scraper to pull in corporate logos to display on the organization pages in Influence Explorer. She has also been working on adding information to Influence Explorer detailing which bills organizations hired lobbyists to work on.
...and I (Tom) have been working on a bunch of proposals, organizing meetings around the corporate ID issue, writing some testimony related to Clearspending, and trying to find staff to fill the spots left by Josh and Kevin's departures. Also, daydreaming about what we're going to do with these enormous 7-segment LEDs.
Continue readingLexPop
I hope that readers will spare a second to check out LexPop. It's a contribution to a problem that a lot of you are interested in: how to allow citizens into the legislative process to a greater degree. There's no question that that old machinery that we use for transmitting public opinion to lawmakers and rulemakers suffers from some serious pathologies. So I've been very glad to see efforts like POPVOX and Expert Labs emerge.
LexPop is working in that same vein. I met Matt Baca, one of the people behind the project, at an event last month, and was struck by the ambition of his experiment. LexPop isn't working at the federal scale, but the scope of what they're doing is large: they're trying to write a state law from start to finish. What makes the effort really fascinating is that they've got a legislator interested, ready to engage with the process. It's going to be interesting to see how this unfolds.
Continue readingClearspending Heads To Capitol Hill
I'm thrilled to say that tomorrow morning Sunlight's Executive Director, Ellen Miller, will be testifying before Congress about our Clearspending project. You can read more about it here, or just check out the posts we wrote about Clearspending back when it launched.
We think that the data quality problems identified by the project are important, and we're glad to see that government is taking them seriously. Without a clear understanding of how our government spends money, it's difficult to make smart decisions about how to adjust that spending.
Having Congress pay attention to our results is a tremendous vindication for the work that Kaitlin and Kevin have done on Clearspending. I think it's also a great example of why Sunlight is such a cool place to work.Where else can your diligent SQL-wrangling turn into a chance to give sworn testimony before Congress?
And speaking of working here: as I've mentioned before, we have a couple of open positions. As you might imagine, preparing testimony has gotten in the way of reviewing resumes. But we'll be diving back into that process very soon. If you've been thinking about it, stop hesitating!
Continue readingUSASpending.gov Data Quality — Still Bad?
We at the labs have written about USASpending.gov several times now. We’ve recently been able to make use of their bulk data downloads to regularly populate some of our webapps with federal grants and contracts data. However, we also have an old snapshot of the data that we received in April of 2010. This snapshot was received on a hard drive that we shipped to USASpending engineers -- before the bulk data downloads existed. Thankfully, we don’t have to go through that process anymore. I wondered how the data has changed over the past year. Last year, the USASpending team took a lot of flak for their data quality issues. Has it been improved? I thought I’d take a look back and see how two data snapshots from April 2010 and December 2010 compare.
Continue readingdjango-mediasync 2.1 for Django 1.3
Earlier today we released django-mediasync 2.1 in anticipation of Django's upcoming 1.3 release. The Django 1.3 RC was released last night so the final version should be coming any day now. This release changes the way static files are handled and breaks previous versions of mediasync. The old MEDIA_URL and MEDIA_ROOT settings are now meant to handle media uploaded by users while two new settings, STATIC_URL and STATIC_ROOT, handle static site content.
Mediasync will first try to use STATIC_ settings and fall back to MEDIA_ if not found. This ensures that mediasync will work regardless of the version of Django being used.
Find the package on PyPI and the source on GitHub. And as always, if you use mediasync please indicate it on Django Packages.
Continue readingSunlight and Open Government on TWiT
Those of us who grew up watching ZDTV (later called TechTV) have fond memories of Leo Laporte. In recent years, Leo has been building a podcast empire called TWiT. Jeremy and I were on the TWiT Network's FLOSS Weekly show to discuss open government in the context of free, libre, and open source software. Give it a listen, or watch the video.
While we're on the subject, there have been a few other tech podcasts where Sunlight has made an appearance:
Continue reading