Prices. Mother. Bless. Soldiers. Borrowing. Corporate. Abortion. Seniors.
What do these words have in common? They are all significantly more prominent in the speech of congressional freshmen than in overall speech patterns of House members, based on a new Sunlight Foundation analysis of the Congressional Record.
Transportation. Cosponsor. International. Order. Public. Intelligence. Human. Respect.
What do these words have in common? You guessed it. They are all significantly less prominent in the speech of congressional freshmen.
In general, there’s not a whole lot of difference in how the freshmen speak compared to their more senior colleagues. But certain words do jump out as being more or less prominent to the freshmen class, and we think some of them (like the ones we’ve listed above) are intriguing enough to share. Our look at patterns of freshmen speech — both on the House floor and on Twitter — is part of Sunlight's examination of the record of the big class of congressional rookies as they face their first election as incumbents. See more of the series here, and read a summary by Sunlight's editorial director, Bill Allison, here.
Figures 1 and 2 below look at selected common words that were of noticeably more or less importance.
Our measure of prominence is based on a statistic known as Term Frequency – Inverse Document Frequency, or TF*IDF. Most notably used by search engines to rank the relevance of a document to the words in your search queries, TF*IDF is a ratio of the number of occurrences of a given word in a single document to the inverse frequency of that word in the collection of documents. In short, it is a metric of whether a word is significant or not compared to the corpus in which it is found.
To discover our significant words, we took a list of the top 1000 words from our freshmen by TF*IDF. These are words that were significant either to freshmen legislators or in congressional speech as a whole, and then compared their weights to 25 random samples of the same size from the 112th Congress at large, which we used to get a baseline standard deviation. Words appearing outside that standard deviation for our freshmen list can be considered significantly more or less important to the freshman class.
Finally, to show that freshmen are not that different from overall members in general, Figure 3 shows the distribution of the natural log of the freshmen-to-overall member word importance ratios. Basically, what this tells us is that most of the word importance ratios are within +/- 2.7182 (the natural log 2.7182 is 1). That is, there are very few words more than 2.72 times more or less important to freshmen.
In the data, we also provide standard deviations for the overall sample, based on 25 random draws of 10,000 documents from a much larger corpus. This allows one to see likely or unlikely it is that certain word prominences among freshmen are due to purely random chance.