Stay up to date on Sunlight’s work in D.C., throughout the country and around the world, as well as the latest open government, transparency and technology news.
The Cyber Intelligence Sharing and Protection Act (CISPA) passed the House by a comfortable margin last week despite loud opposition from privacy groups, a veto threat from the White House, and uncertain prospects in the Senate. Lawmakers made several changes to the bill aimed at easing privacy concerns. Unfortunately, a provision that should give transparency advocates pause not only survived, but is spreading to other cybersecurity legislation.
When CISPA was originally introduced in the 112th Congress it contained language that would effectively exempt all information about "cyber threats" shared via the bill from the Freedom of Information Act. That provision survived in the version of CISPA that passed the House last week, and similar language has worked its way into another piece of legislation, the SECURE IT Act, introduced earlier this month.
Wholesale exemptions for "cyber threat information" will prevent public oversight and deny citizens and watchdogs the ability to understand how the government and businesses communicate about and respond to cyber threats. The most sensitive information that would be shared through these bills is already protected from disclosure through existing FOIA exemptions. It is hard to see a compelling reason to subvert the FOIA altogether when it comes to cybersecurity.
Privacy advocates are concerned that personal information will be subject to over-sharing and misuse. Without access to rights provided by the FOIA there will be no way to hold those in power accountable if they are collecting too much information or misusing the data they obtain.
The Freedom of Information Act is a cornerstone for public oversight of government activity. Any change to the law deserves a vigorous and open debate.
CISPA and the SECURE IT Act give government officials broad new powers and the current FOIA provisions provide them with blanket protection from public scrutiny. These new, overly broad exemptions are unnecessary and should not be passed into law.
The Philippine Department of Agriculture is embracing the message of transparency being spread by their president with a new open data portal. The portal aims to raise public awareness of the department's projects and includes budget data, photos, and mapping features. (Future Gov)
A new proposal from the European Commission would require companies to publicly disclose information about their anti-bribery and corruption efforts. The proposal would target some 16,000 European companies that have at least 500 employees. (TrustLaw)
A prominent Russian blogger and political activist who exposed corruption in the United Russia party is facing up to ten years in prison on charges of corruption that independent reviewers have called "laughably bogus." Aleksei Navalny, who got his start advising a provincial governor, exposed millions of dollars in corruption and led opposition to the United Russia part in recent elections. (Tech President)
A data disclosure bill working its way through the California legislature has attracted some negative attention from big tech firms like Facebook and Google. The bill would require companies to provide customers with any personal information that the company holds about them upon request and is similar to laws that are already in effect in Europe. (Ars Technica)
The Chamber of Commerce pulled up slightly on its rapid election year lobbying pace, but the group still managed to spend more than $10 million on lobbying during the first quarter of 2013. The Chamber has more than 40 in house lobbyists and 14 firms on retainer. (Roll Call)
On Monday, Michelle Bachmann's former chief of staff, Andy Parrish, testified that Bachmann personally approved payments to an Iowa state senator as part of her presidential campaign despite rules against the practice. He also stated that those involved believed that they acted within the law. (National Journal)
Companies that chose to file their first quarter lobbying reports early are generally showing spending increases. It's hard to say if the trend will continue as more companies file or if the overachievers were just eager to show off. (Roll Call)
Churnalism US is a new web tool and browser extension that allows anyone to compare the news you read against existing content to uncover possible instances of plagiarism. It is a joint project with the Media Standards Trust.
Simply feed in a link or block of text to the Churnalism site or let the browser extension run in the background to notify you of any matches of text from Churnalism's cache of documents. They include most articles in Wikipedia, press releases from PR Newswire, PR News Web, EurekaAlert!, congressional leadership offices, the White House, a sampling of Fortune 500 companies, prominent philanthropic foundations and much more. The browser extension available for Chrome, Internet Explorer and Firefox (full approval pending) allows Churnalism to extract article text from a whitelist of common news sites and lets you know when something you're reading may be copied from another source. It's a rare occurrence, but it's not unprecedented. Just last week Tom Lee, a noted Churnalism beta tester and Sunlight Labs Director, found through Churnalism that Reuters' prematurely published obituary of still-alive-human George Soros borrowed heavily from the collection of quotes on his Wikipedia page.
With the extension installed, you can learn about the sourced and unsourced flow of text copied from somewhere else. For some anecdotal evidence from my experience using Churnalism, I've found a number of instances of articles about science topics relying heavily on press releases and study summaries. For example, take this piece on the BBC website about epilepsy and migraines. Churnalism found a significant portion of the text came from this press release in EurekaAlert! and let me know with a ribbon notification on the top of the page. By tapping the Show Me button on the notification, Churnalism overlays a side-by-side display of the article and the possible match with copied text highlighted for easy comparison:
Using the Churnalism browser extension it's easy to see the overlap between the article shown on the left linked to the corresponding text copied from a press release on the right.
We understand the privacy sensitivities with an extension extracting text from what you read, so we've designed Churnalism to be quite customizable and never retain identifiable information such as your IP address. You can easily change which sites Churnalism runs on by going into the settings for the browser extension. We've provided a basic whitelist of major news sites, a listing of local news affiliates and the ability to let Churnalism run on any site with news or article in url, but all these can be removed or paired down (or expanded!) to whatever sites you're interested in.
Churnalism US (launching today!) allows you to check the news articles you read for influence from press releases and Wikipedia. If you’re curious about a particular article, you can simply copy/paste the web address into the Churnalism US website. You can also choose to check each news article you read by installing our browser extensions. The extensions will alert you when a news article matches our database. You can read more about using Churnalism in Nicko's post, but I'll explain how we approached this problem from a technical point of view.
The core technology behind the service is a fast, full text search database named SuperFastMatch. It was developed by our friends at the Media Standards Trust to power the original UK-based Churnalism.com. The original version of the site allowed you to check the influence a particular press release has had on the UK national press. Our task is the inverse of theirs but the fundamental technical challenge is the same so we used the second generation of their technology to power this new site.
SuperFastMatch employs an innovative technique that splits the text of a corpus (mostly press releases in this case) into overlapping windows of a fixed number of characters. Each of those text windows is hashed into a 26 bit number. We use a "rolling" hash function but if you’re familiar with MD5 or SHA1 then you've got a good idea of what it does. Every hash function suffers from hash collisions. Instead of trying to avoid these, the collisions are used as an approximation for comparing the text represented by the hash. Once a list of matching hashes is found then a more exact (but slower) comparison of the text windows can be done on this smaller set of values in order to filter out false positives.
Having this list of hashes isn't enough to make the text search fast. Once the list of hashes is in hand they need to be stored in an index. Since the hashes are numbers, the index stores them in a numerically sorted list. This list is then delta-encoded by subtracting each number from the previous one and then using a variable bit-length encoding, stored in a sparsetable. Even with this compression, the index can grow very large; our index is about 20 GB and growing.
Once we have a list of which press releases share text with a given news article we have to analyze whether that shared text is meaningful. This is where the Churnalism web frontend takes over. We remove fragments that are mostly long proper nouns (such as "the President of the United States of America"). We then measure how many characters overlap and how close together the shared passages are, relative to the document lengths. A 3,000 word news article that shares two sentences with a press release is less interesting than a 1,000 article that shares two paragraphs. Similarly two articles of the same length that share the same two sentences with a press release aren't always churning the press release to the same degree. We boil this down into the "density" of the shared text in the two documents as a measure of how likely the text was simply copy/pasted and then slightly edited.
Unfortunately the state of web publishing is inconsistent such that we couldn't reliably detect and eliminate quotes. Often blockquote html tags are used for things that are not actually quotes and of course not all quotes are annotated with appropriate html tags. While initially frustrating from an engineering perspective, we've found it delivers an additional feature by providing context around quotes in a news article and exposing instances of news articles selectively quoting speeches or press releases.
As great of a service as Churnalism provides, we think the underlying technology has many other exciting uses. SuperFastMatch can be used as an approach to any problem that requires a longest common substring search. If you’re a Pythonista, we have a client library that will handle simple load balancing and sharding by document type. It also provides tools to backup and restore the SuperFastMatch index (the index is ephemeral, so a reboot wipes the data). We've found it useful in our Ad Hawk service for clustering "cookie-cutter" attack ads where most of the audio is the same but the politician’s name and background are changed. If you find it useful, let us know. If you have any trouble setting it up, submit a ticket to the Github project and we’ll do our best to get you up and running.
Here’s another win for open data. The Consumer Financial Protection Bureau releases data on which banks have the most consumer complaints. Even before the data becomes public officially, banks start improving response times and responding more favorably to customer complaints.
That’s the story that’s emerging from the CFPB’s bold decision to make bank consumer complaint data public.
This is exactly what open data is supposed to do. It equalizes the balance of power. In this case, it has empowered consumers, and brought accountability to big banks.
Despite still being allowed to file their campaign finance reports on paper, a growing number of Senators are embracing the future and filing electronically. John Tester, who introduced legislation that would require every Senator to e-file, is joined by a bipartisan group, although more Democrats have taken up the practice than Republicans. (Public Integrity)
Steven VanRoekel, the US CIO, expressed his hopes that open data will become "the default setting of the federal government" during a speech last week. As part of his message he urged vendors and contractors to plan to collect and distribute data in ways that will allow agencies to make it available in free, non-proprietary formats. (Federal Computer Week)
GOP Boy Wonder Marco Rubio's leadership PAC raised $650,000 during the first quarter. Insert played out joke here: the PAC spent more than $47,000 on bottled water. (Roll Call)
I think everyone can agree that last week was tough on America's mental state, and through that our productivity. But, the string of tragic events didn't slow the train of political fundraising moving all across the nation. (Public Integrity)
Mark Zuckerberg's FWD.us jumped into the immigration debate feet first last month, paying Republican lobbyists with the firm Fierce, Isakowitz & Blalock $30,000 in march to lobby on the issue. FWD.us also signed up Peck Madigan Jones, but the firm has yet to file their first quarter report. (Roll Call)
The NRA is spending more than ever on federal lobbying as they face a massive push to reform some gun laws. They spent at least $800,000 in the first quarter to lobby on a number of bills in the House and Senate. (Public Integrity)
President Obama raised more than $43 million to fund his second inaugural festivities, not quite reaching the high bar that he set with his first inaugural haul of $53 million. A number of major corporations and unions cut big checks. (The Hill, Washington Times)
Disclaimer: The opinions expressed by the guest blogger and those providing comments are theirs alone and do not reflect the opinions of the Sunlight Foundation or any employee thereof. Sunlight Foundation is not responsible for the accuracy of any of the information within the guest blog.
Adam Green is the CTO of UniteBlue.com -- a social network for progressives based on Twitter. UniteBlue connects and organizes political activists on a national and state level. You can reach. Adam at adam@uniteblue.com.
One of the great challenges for political activism in each of the 50 state legislatures is providing timely information on bills as they move through the legislative process. The data provided by the Sunlight Foundation Open States API is comprehensive, but combing through it to find the limited number of bills that activists have the time to focus on can be overwhelming. I have found over 88,000 bills in the current session alone. We are now developing an early warning system at UniteBlue.com that can assist our volunteers in filtering this information flow down to a manageable level.
While the entire process will take months to complete, in this guest post I’d like to propose a triage model that can be used as a first step. Triage is the medical model used in hospital emergency rooms and during disaster responses. A high flow of patients is screened using certain “signatures” to determine which ones need to be seen first, such as blocked airways or dangerous vital signs. The process of triaging the high flow of bills coming from the Sunlight data will follow a similar model.
Oakland, CA, trying to stay out of San Francisco's open data shadow, has a new budget visualization website. Open Budget Oakland, put together by OpenOakland, launched this week with visualized budget data for 2011 - 2013 and will soon be updated to include budget blueprints through 2015. (Tech President)
It looks like Mark Sanford has let his relationship issues get in the way of his ambitions once again. The NRCC cut ties with Sanford immediately after reports surfaced this week that his ex-wife had accused him of trespassing on her property earlier this year. Now they're being joined by a number of well funded outside groups. (Roll Call)
Jamaica is taking steps to foster an open data community in hopes of sparking a start up culture and improving governance. Officials highlighted a number of steps being taken by the country at the "Developing the Caribbean" conference last week in Kingston. (O'Reilly Radar)
The ethics probe into Rep. Michele Bachmann is slated to take an interesting turn. Her former Chief of Staff, Andy Parrish, is expected to testify that Bachmann's presidential campaign inappropriately paid an Iowa State Senator to work for her. (POLITICO)
The Obama Administration's recently released open government self assessment shows progress, but also presents a rose colored view of some of the administration's actions. (POGO)
Open Government Indonesia is pushing a new portal called "Lapor" to encourage citizens to report corruption by public officials. OGI is part of the OGP and is made up of a number of government agencies and NGO's. (Future Gov)
France looks set to institute wide-reaching asset disclosure rules for public officials in response to recent corruption scandals. Rules have already been put in place for government ministers and a there is a law pending that would extend disclosure to members of parliament. (Transparency International)
TransparencyCamp.org, your destination for all things TCamp, now has a fresh new look and additional functionality. This new package of beauty and brains will help you to register, get you prepared for the "unconference," and you can even submit ideas for sessions!
The responsive site will be a great way to look at session choices during TCamp. During the conference, you will even be able to submit your session via the web.
Don't forget to take a "scroll" down memory lane and see awesome TCamps of years past. Think of all the possibilites-- at your fingertips!