Monthly Archives: July 2013

ngramr – an R package for Google Ngrams

The recent post How common are common words? made use of unusually explicit language for the Stubborn Mule. As expected, a number of email subscribers reported that the post fell foul of their email filters. Here I will return to the topic of n-grams, while keeping the language cleaner, and describe the R package I developed to generate n-gram charts.

Rather than an explicit language warning, this post carries a technical language warning: regular readers of the blog who are not familiar with the R statistical computing system may want to stop reading now!

The Google Ngram Viewer is a tool for tracking the frequency of words or phrases across the vast collection of scanned texts in Google Books. As an example, the chart below shows the frequency of the words “Marx” and “Freud”. It appears that Marx peaked in popularity in the late 1970s and has been in decline ever since. Freud persisted for a decade longer but has likewise been in decline.

The Ngram Viewer will display an n-gram chart, but does not provide the underlying data for your own analysis. But all is not lost. The chart is produced using JavaScript and so the n-gram data is buried in the source of the web page in the code. It looks something like this:

// Add column headings, with escaping for JS strings.

data.addColumn('number', 'Year');
data.addColumn('number', 'Marx');
data.addColumn('number', 'Freud');

// Add graph data, without autoescaping.

data.addRows(
[[1900, 2.0528437403299904e-06, 1.2246303970897543e-07],
[1901, 1.9467918036752963e-06, 1.1974195999187031e-07],
...
[2008, 1.1858645848406013e-05, 1.3913611155658145e-05]]
)

With the help of the RJSONIO package, it is easy enough to parse this data into an R dataframe. Here is how I did it:

ngram_parse <- function(html){
  if (any(grepl("No valid ngrams to plot!",
                html))) stop("No valid ngrams.") 
    
  cols <- lapply(strsplit(grep("addColumn", html,
                               value=TRUE), ","),
                getElement, 2)
  
  cols <- gsub(".*'(.*)'.*", "\\1", cols)

I realise that is not particularly beautiful, so to make life easier I have bundled everything up neatly into an R package which I have called ngramr, hosted on GitHub.

The core functions are ngram, which queries the Ngram viewer and returns a dataframe of frequencies, ngrami which does the same thing in a somewhat case insensitive manner (by which I mean that, for example, the results for "mouse", "Mouse" and "MOUSE" are all combined) and ggram which retrieves the data and plots the results using ggplot2. All of these functions allow you to specify various options, including the date range and the language corpus (Google can provide results for US English, British English or a number of other languages including German and Chinese).

The package is easy to install from GitHub and I may also post it on CRAN.

I would be very interested in feedback from anyone who tries out this package and will happily consider implementing any suggested enhancements.

UPDATE: ngramr is now available on CRAN, making it much easier to install.

Poll Dancing

5 Replies

With elections looming, and Kevin Rudd’s return to power, it is time for our regular guest blogger, James, to pull out his beer coaster calculator and take a closer look at the polls.

It is really that time again. Australian election fever has risen. Though in this case it feels like we have been here for three years since the last election. Polls every week telling us what we think and who we will vote for. But what exactly do these polls mean? And what do they mean by “margin of error”?

So here is the quick answer. Suppose you have a two party election (which two party preferred, 2PP, effectively amounts to through Australia’s preference system). Now suppose each of those parties really has 50% of the vote. If there are 8 million voters and you poll 1,000 of them then what can you tell? Surprisingly it turns out that of these inputs the number of 8 million voters is actually irrelevant! We can all understand that if you only poll 1,000 voters out of 8 million then there is a margin of error. This margin of error turns out to be quite easy to compute (using undergraduate level Binomial probability theory) and only depends on the number of people polled, and not the total number of voters. The formula is:

MOE = k × 0.5 /√N.

where N is the number of people polled and k is the number of standard deviations for the error. The formula √1000 = 33 so 1/√1000 = 0.03 = 3%. The choice of k is somewhat arbitrary but in this case k = 2 (because for the Normal distribution 95% of outcomes lies within k=2 standard deviations of the mean) which conveniently makes k × 0.5 = 1. So MOE=1/√N is a fairly accurate formula. If N=1000 then MOE=1/33=3% (give or take). This simply means that even if the actual vote was 50:50 then 5% of the time, an unbiased poll of 1,000 voters would poll outside 47:53 due purely to random selection. And even if the actual vote is, say, 46:54, the MOE will be about the same.

Interestingly in the US where there are about 100m voters they usually poll at N = 40,000 which makes the MOE = 0.5%. In this case the economics of polling scale as the number of voters hence they can afford to poll more people. But the total number of voters, 100m or 10m, is irrelevant for the MOE. As the formula shows to improve the accuracy of the estimate by a factor of 10 (say from 3% to 0.3%) they would need to increase the sample size by a factor of 100. You simply can’t get around this.

One of the criticisms of polling is that that they don’t reach the same number of (young) people on mobile phones as older people on land lines. This is easily fixed. You just adjust the figures according to what type of phone they are using based on known percentages of who uses what type of phone. Similarly you can adjust by gender and age. The interesting thing though is that the further you get from actual phone usage/gender/age in your poll you also need to increase your MOE, but not your expected outcome.

Okay so that is it: MOE = 1/√N where N = number of people polled. If N = 1000 then MOE=3%. My all time favourite back of the beer coaster formula.

The recent jump in the 2PP polls for Labor when Kevin Rudd reassumed the PM-ship from about 45% to 49% were greeted by journalists as “Kevin Rudd is almost, but not quite, dead even”. I found this amusing as it could statistically have been 51%, within the MOE, in which case the headline would have been “Kevin Rudd is ahead!”. Indeed barely a week later he was “neck and neck” in the polls at 50:50. Next week it may be “51:49” in which case he will be declared on a certain path to victory! However within the MOE of 3% these results are statistically indistinguishable.

From my point of view, as a professional statistician, I find the way many journalists develop a narrative based on polls from week to week, without understanding the margin of error, quite annoying. Given the theory that if a politician has the “The Mo” (ie. momentum) it may end up helping them win when it is irresponsible to allow random fluctuation due to statistical sampling error to influence the outcome of an election. Unless of course it helps the party I support win.

How common are common words?

24 Replies

One of my favourite podcasts is Slate’s Lexicon Valley. All about language, it is rigorous and detailed in its approach to the subject, which appeals to the closet academic in me, but also extremely entertaining. It is a sign of a good podcast to find yourself bursting out laughing while walking down a busy city street. Lexicon Valley is to blame for numerous moments of alarm for my fellow commuters.

In September last year, hosts Mike Vuolo (the knowledgeable one) and Bob Garfield (the funny one) interviewed linguist Geoffrey Nunberg, talking to him about his recent book, Ascent of the A-Word: Assholism the First Sixty Years. A half hour discussion of the evolution of the word “asshole”helps earn this podcast an “Explicit” tag in the iTunes store and, as a result, this will be the first Stubborn Mule post that may fall victim to email filters. Apologies in advance to anyone of a sensitive disposition and to any email subscribers this post fails to reach.

Nunberg traces the evolution of “asshole” from its origins among US soliders in the Second World War through to its current role as a near-universal term of abuse for arrogant boors lacking self-awareness. Along the way, he explores the differences between profanity (swearing drawing on religion), obscenity (swearing drawing on body parts and sexual activity) and plain old vulgarity (any of the above).

The historical perspective of the book is supported by charts using Google “n-grams”. An n-gram is any word or phrase found in a book and one type of quantitative analysis used by linguists is to track the frequency of n-grams in a “corpus” of books. After working for years with libraries around the world, Google has amassed a particularly large corpus: Google Books. Conveniently for researchers like Nunberg,with the help of the Google n-gram Viewer, anyone can analyse n-gram frequencies across the Google Books corpus. For example, the chart below shows that “asshole” is far more prevalent in books published in the US than in the UK. No surprises there.

Use of “asshole” in US and UK Books

If “asshole” is the American term, the Australian and British equivalent should be “arsehole”, but surprisingly arsehole is less common than asshole in the British Google Books corpus. This suggests that, while being a literal equivalent to asshole, arsehole really does not perform the same function. If anything, it would appear that the US usage of asshole bleeds over to Australia and the UK.

“asshole” versus “arsehole”

Intriguing though these n-gram charts are, they should be interpreted with caution, as I learned when I first tried to replicate some of Nunberg’s charts.

The chart below is taken from Ascent of the A-word and compares growth in the use of the words “asshole” and “empathetic”. The frequencies are scaled relative to the frequency of “asshole” in 1972* . At first, try as I might, I could not reproduce Nunberg’s results. Convinced that I must have misunderstood the book’s explanation of the scaling, I wrote to Nunberg. His more detailed explanation confirmed my original interpretation, but meant that I still could not reproduce the chart.

Relative growth of “empathetic” and “asshole”

Then I had an epiphany. It turns out that Google has published two sets of n-gram data. The first release of the data was based on an analysis of the Google Books collection in July 2009, described in the paper Michel, Jean-Baptiste, et al. “Quantitative analysis of culture using millions of digitized books” Science 331, No. 6014 (2011): 176-182. As time passed, Google continued to build the Google Books collection and in July 2012 a second n-gram data set was assembled. As the charts below show, the growth of “asshole” and “empathetic” is somewhat different depending on which edition of the n-gram data set used. I had been using the more recent 2012 data set and, evidently, Nunberg used the 2009 data set. While either chart would support the same broad conclusions, the differences show that smaller movements in these charts are likely to be meaningless and not too much should be read into anything other then large-scale trends.

Comparison of the 2009 and 2012 Google Books corpuses

So far I have not done very much to challenge anyone’s email filters. I can now fix that by moving on to a more recent Lexicon Valley episode, A Brief History of Swearing. This episode featured an interview with Melissa Mohr, the author of Holy Shit: A Brief History of Swearing. In this book Mohr goes all the way back to Roman times in her study of bad language. Well-preserved graffiti in Pompeii is one of the best sources of evidence we have of how to swear in Latin. Some Latin swear words were very much like our own, others were very different.

Of the “big six” swear words in English, namely ass, cock, cunt, fuck, prick and piss (clearly not all as bad as each other!), five had equivalents in Latin. The only one missing was “piss”. It was common practice to urinate in jars left in the street by fullers who used diluted urine to wash clothing. As a result, urination was not particularly taboo and so not worthy of being the basis for vulgarity. Mohr goes on to enumerate another five Latin swear words to arrive at a list of the Roman “big ten” obscenities. One of these was the Latin word for “clitoris”, which was a far more offensive word than “clit” is today. I also learned that our relatively polite, clinical terms “penis”, “vulva” and “vagina” all derive from obscene Latin words. It was the use of these words by the upper class during the Renaissance, speaking in Latin to avoid corrupting the young, that caused these words to become gentrified.

Unlike Nunberg, Mohr does not make use of n-grams in her book, which provides a perfect opportunity for me to track the frequency of the big six English swear words.

Frequency of the “Big Six” swear words

The problem with this chart is that the high frequency of “ass” and “cock”, particularly in centuries gone by, is likely augmented by their use to refer to animals. Taking a closer look at the remaining four shows just how popular the use of “fuck” became in the second half of the twentieth century, although “cunt” and “piss” have seen modest (or should I say immodest) growth. Does this mean we are all getting a little more accepting of bad language? Maybe I need to finish reading Holy Shit to find out.

Frequency of four of the “Big Six” swear words

* The label on the chart indicates that the reference year is 1972, but by my calculations the reference year is in fact 1971.