On Aug. 5, the Federal Communications Commission [announced](http://sunlightfoundation.com/blog/2014/08/05/fcc-releases-open-internet-comments-in-bulk/) the bulk release of the comments from its largest-ever public comment collection. We’ve spent the last three weeks cleaning and preparing the data and leveraging our experience in machine learning and [natural language processing](http://en.wikipedia.org/wiki/Natural_language_processing) to try and make sense of the hundreds-of-thousands of comments in the docket. Here is a high-level overview, as well as our cleaned version of the full corpus which is available for download in the hopes of making further research easier.
Our first exploration uses natural language processing techniques to identify topical keywords within comments and use those keywords to group comments together. We analyzed a corpus of 800,959 comments. Some key findings:
* We estimate that less than 1 percent of comments were clearly opposed to net neutrality1.
* At least 60 percent of comments submitted were form letters written by organized campaigns (484,692 comments); while these make up the majority of comments, this is actually a lower percentage than is common for high-volume regulatory dockets.
* At least 200 comments came from law firms, on behalf of themselves or their clients.
Below is an interactive visualization that lets you explore these groupings and view individual comments within the groups.
In-depth exploration of the topical keywords revealed several prominent recurring themes, both in form letter and non-form letter submissions (see below for a more detailed exploration of form letter submissions). Among the most common:
* Around two-thirds of commenters objected to the idea of paid priority for Internet traffic, or division of Internet traffic into separate speed tiers. This topic was discussed in many independent comments, as well as form letter campaigns organized by [the Nation](http://activism.thenation.com/p/dia/action3/common/public/?action_KEY=13853), [Battle for the Net](https://www.battleforthenet.com/), [CREDO Action](http://act.credoaction.com/sign/fcc_nn_comments_2014), [Daily Kos](http://campaigns.dailykos.com/p/dia/action3/common/public/?action_KEY=878) and [Free Press](https://act.freepress.net/sign/internet_fcc_break/). Common keywords in this group included “slow/fast lane,” “pay to play,” “wealthy,” “divide” and “Netflix.”
* About the same number of comments, including submissions from form letter campaigns organized by the Nation, [Badass Digest](http://badassdigest.com/2014/05/15/the-fcc-wants-to-slow-down-your-internet.-we-dont-want-them-to/), CREDO Action, Daily Kos and Free Press, asked the FCC to reclassify ISPs as [common carriers](http://en.wikipedia.org/wiki/Common_carrier#Telecommunications) under the [1934 Communications Act](http://en.wikipedia.org/wiki/Communications_Act_of_1934). Common keywords in these comments included “common carrier,” “(re)classify,” “authority” and “Title II” (a part of the act that might grant the FCC this authority). A smaller portion of commenters advocated a regulatory strategy with a similar effect but a different legal basis, relying on section 706 of the [1996 Telecommunications Act](http://en.wikipedia.org/wiki/Telecommunications_Act_of_1996).
* The subject of Internet access as an essential freedom comprised more than half of comments included in form letters from the Nation, Battle for the Net, CREDO Action and Daily Kos. Common topic words included “important,” “vitally,” “economy,” “essential,” “resource” and “cornerstone.”
* Almost half of comments, including form letters from [Electronic Frontier Foundation](https://www.dearfcc.org/), the Nation, Battle for the Net, Daily Kos and Free Press, discussed the economic impact, or the impact on small businesses and innovation, of the end of net neutrality. Typical terms in these comments included “work,” “competition,” “startup,” “kill,” “barrier” and “entry.”
* Around 40 percent of comments, including campaign letters from EFF, Battle for the Net and Daily Kos, discussed the importance of consumer choice, or the impact of regulations on consumer fees. Topic words included “access,” “choice,” “entertainment,” “fee,” “content,” “extort” and “extract.”
* About one-third of comments, including those in Battle for the Net’s campaign, discussed the importance of competition among ISPs. Frequent terms included “monopoly” and “competition,” “Comcast,” “Verizon” and “Warner.”
* Several form letters either from the Daily Kos or of unknown provenance (combined with non-form letters) advocated treating broadband providers like a [public utility](http://en.wikipedia.org/wiki/Public_utility#United_States). About 15 percent of comments discussed this topic.
* A small number of comments (around 5 percent, including letters from [Stop Net Neutrality](http://www.stopnetregulation.org/about) and a [Tea Partier blog](http://teapartiers.blogspot.com/2014/07/fcc-petition.html)) had anti-regulation messages. Interestingly, some of these comments seemed to emphasize freedom for consumers while others advocated freedom for ISPs, two positions seemingly at odds with one another.
Additionally, a couple of topics came up in significant enough numbers of comments to be noteworthy despite not occurring in any of the form letter campaigns. These included comments calling for the resignation of FCC Chairman Tom Wheeler or other FCC commissioners or staff (about 2500 comments), and people either mentioning John Oliver by name or using the words “dingo” or “f*ckery,” again typically directed at Tom Wheeler, comprising about 1500 comments, and likely motivated by usage of these terms in Oliver’s [net neutrality segment](https://www.youtube.com/watch?v=fpbOEoRrHyU).
## Wait, where are the 1.1 million comments?
The comments were [originally released](http://www.fcc.gov/blog/fcc-makes-open-internet-comments-more-accessible-public) by the FCC as six continuous XML files, with two caveats:
> First, mailed comments postmarked prior to July 18 are still being scanned and entered into the ECFS and may not be reflected in the files. We will post an updated XML file when they are completed, so stay tuned.
We haven’t received word of any updates since the original release.
> Second, certain handwritten comments may not be searchable. For this reason, source links to these comments are included in the files.
More than 500 comments had text fields which were blank. Our guess is that these may correspond to handwritten comments.
The XML files contained 446,719 records. Many of these contained a single comment each, but some contained multitudes. We wrote custom processing scripts to break up the multiple-comment records, revealing the total count of 801,781 comments. Of these, some were discarded as unparseable or too long (both *Les Misérables* and *War and Peace* were submitted as comments), leaving the final count at 800,959 comments.
## Detecting expert submissions
After speaking with policy experts from the [Open Technology Institute](http://oti.newamerica.net/) and [Public Knowledge](https://www.publicknowledge.org/), we learned some interesting details about comment submission. While most public comments were submitted using a simplified form or via email, experienced submitters made use of a more complex form. Comments submitted by these “experts” were marked in the data, giving us an easy way to isolate them.
Once isolated, they provided the basis for training a piece of artificial intelligence software called a [text classifier](http://en.wikipedia.org/wiki/Document_classification). We trained the classifier to detect expert language based on examples from submissions that we knew were from experts. It was then able to read comments submitted through the simple form or via email and tell us whether or not each was likely to have been written by an expert. The classifier found approximately 6,700 such comments. Approximately 3,900 of these were form letters with this basic structure:
To Chairman Tom Wheeler and the FCC Commissioners To the FCC Please build any net neutrality argument upon solid legal standing. Specifically, this means reclassifying broadband under Title II of the Telecommunications Act of 1934. 706 authority from the Telecommunications Act has been repeatedly struck down in court after legal challenges by telecom companies. Take the appropriate steps to prevent this from happening again. Sincerely,
While this was almost certainly penned by an expert, we’re considering it a non-expert submission, because it seems to have been part of a broader organized campaign. Of the remaining 2,846 comments, 567 of them contain at least 200 words, which we feel is an appropriate heuristic to apply to expert submissions. In summary, our back-of-the-envelope estimate of the number of expert submissions is 600, or 0.08 percent of the 800,959 comments analyzed.
## Form letters
We searched within the topical groupings that powered the visualization above to find groups of comments with very low amounts of text variation from one comment to another, yielding a similar result (though using different technology better suited to the extreme size of this docket) to the form letter detection visualizations employed in our [Docket Wrench](http://docketwrench.sunlightfoundation.com) tool. After manual review of these groups, we estimate that at least 20 separate form letter writing campaigns drove submissions to this docket, ranging in size from a few hundred comments to more than 100,000 and together comprising almost 500,000 comments, or about 60 percent of the corpus that we examined. We made a cursory attempt at trying to find the organizations that orchestrated each form letter writing campaign. In the interactive visualization below, we’ve shown each group, along with its sponsoring organization if we were able to find it. The visualization is color-coded by whether each group appears to support or oppose net neutrality (the lone opposing group is difficult to see, but is shown in red near the center):
While form letters do appear to make up the majority of the comments, it’s actually surprising how many of the submitted comments seemed *not* to have been driven by form letter writing campaigns. In previous analyses of high-volume dockets, we’ve found that it’s not unusual for form letter contributions to make up in excess of 90 percent of a docket’s total submissions, with the percentage of comments coming from form letter campaigns being well-correlated with the total number of comments received. The two largest dockets in Docket Wrench, the [Department of State Keystone XL rulemaking](http://docketwrench.sunlightfoundation.com/docket/DOS-2014-0003) and the [Internal Revenue Service docket on political activity undertaken by social welfare organizations](http://docketwrench.sunlightfoundation.com/docket/IRS-2013-0038), both from earlier this year, are each dominated by form letter comments, with more than 75 percent of the comments in each having been classified as form letter submissions by our detection systems.
It’s difficult to know why, exactly, more members of the public apparently wrote letters themselves in this rulemaking than is typical for large dockets. It could be an indicator of a genuinely higher level of personal investment and interest in this issue, or perhaps this docket drew organizers who employed different “get out the comment” techniques than we have seen in the past.
Even within the form letters, we see evidence of various kinds of innovation in terms of the way form letter campaigns have been run. EFF’s campaign gives submitters several opportunities to choose from a menu of options at various points within the text, for example. More intriguingly, several groups of comments that we were unable to attribute show subtle textual variations that don’t seem to alter the meaning of the text in the way that EFF’s do. These groups appear to all be about the same size, leading us to believe that a single overall population of users might have been solicited to submit comments and was then automatically uniformly segmented in some fashion. This could have been to test which versions of the comment text got the most users to submit (along the lines of the [A/B testing](http://en.wikipedia.org/wiki/A/B_testing) commonly used in software development). It could also perhaps be an effort to foil exactly the kind of automated grouping tools we (and some federal agencies) might employ to make large volumes of comments like this one easier to review.
Finally, while comments submitted as part of form letter campaigns are similar to one another, it’s important to note that they’re not identical. Many submitters take the opportunity to personalize their comment beyond what was supplied by the campaign’s template language. How exactly they vary is an interesting question, and worth pursuing.
If you’re interested in doing your own analysis with this data, you can download our cleaned-up versions below. We’ve taken the six XML files released by the FCC and split them out into individual files in JSON format, one per comment, then compressed them into archives, one for each of XML file. Additionally, we’ve taken several individual records from the FCC data that represented multiple submissions grouped together, and split them out into individual files (these JSON files will have hyphens in their filenames, where the value before the hyphen represents the original record ID). This includes email messages to [email@example.com](mailto:firstname.lastname@example.org), which had been aggregated into bulk submissions, as well as mass submissions from CREDO Mobile, Sen. Bernie Sanders’ office and others. We would be happy to answer any questions you may have about how these files were generated, or how to use them.
## Ideas for further investigation
We’ve only just scratched the surface of what could be learned from such a rich dataset. Here are some other promising avenues of investigation that have occurred to us. If you pursue them, please let us know! Bonus points for research that comes complete with links to open source code and data.
* How do commenters augment the template responses provided by form letter campaigns? What do they add, delete or modify? What consistently stays intact? * Do models of non-form submissions surface topics that we haven’t found? What about models of expert submissions? * How are individual words related to one another? Eg, what modifiers are used for terms like “ISP,” “Wheeler,” “Internet,” etc. * Looking at email addresses, which domains are most popular? * How often are key political figures or elements of government mentioned? * Which other services or utilities is broadband Internet compared with, and how often? * How do commenters break out by gender? (This is more difficult than it seems, even if you’re using the way fun [Genderize API](http://genderize.io/). Often the commenter’s real name can only be found in the body of the comment itself, not in the “applicant” field)
To help get you started, we’ve released all of the code we used to do our analysis in a [GitHub repository](https://github.com/sunlightlabs/fcc-net-neutrality-comments), and it depends on entirely on open-source tools.
We’d like to thank Michael Weinberg and his colleagues at Public Knowledge, and Sarah Morris of the New America Foundation’s Open Technology Institute for their invaluable advice in better understanding this data. We’d also like to thank Radim Řehůřek, maintainer of the [gensim library](http://radimrehurek.com/gensim/), which was crucial to our text analysis.