ngramr – an R package for Google Ngrams

by Stubborn Mule on 16 July 2013 · 36 comments

The recent post How common are common words? made use of unusually explicit language for the Stubborn Mule. As expected, a number of email subscribers reported that the post fell foul of their email filters. Here I will return to the topic of n-grams, while keeping the language cleaner, and describe the R package I developed to generate n-gram charts.

Rather than an explicit language warning, this post carries a technical language warning: regular readers of the blog who are not familiar with the R statistical computing system may want to stop reading now!

The Google Ngram Viewer is a tool for tracking the frequency of words or phrases across the vast collection of scanned texts in Google Books. As an example, the chart below shows the frequency of the words “Marx” and “Freud”. It appears that Marx peaked in popularity in the late 1970s and has been in decline ever since. Freud persisted for a decade longer but has likewise been in decline.

Freud vs Marx ngram chart

The Ngram Viewer will display an n-gram chart, but does not provide the underlying data for your own analysis. But all is not lost. The chart is produced using JavaScript and so the n-gram data is buried in the source of the web page in the code. It looks something like this:

// Add column headings, with escaping for JS strings.

data.addColumn('number', 'Year');
data.addColumn('number', 'Marx');
data.addColumn('number', 'Freud');

// Add graph data, without autoescaping.

[[1900, 2.0528437403299904e-06, 1.2246303970897543e-07],
[1901, 1.9467918036752963e-06, 1.1974195999187031e-07],
[2008, 1.1858645848406013e-05, 1.3913611155658145e-05]]

With the help of the RJSONIO package, it is easy enough to parse this data into an R dataframe. Here is how I did it:

ngram_parse <- function(html){
  if (any(grepl("No valid ngrams to plot!",
                html))) stop("No valid ngrams.") 
  cols <- lapply(strsplit(grep("addColumn", html,
                               value=TRUE), ","),
                getElement, 2)
  cols <- gsub(".*'(.*)'.*", "\\1", cols)

I realise that is not particularly beautiful, so to make life easier I have bundled everything up neatly into an R package which I have called ngramr, hosted on GitHub.

The core functions are ngram, which queries the Ngram viewer and returns a dataframe of frequencies, ngrami which does the same thing in a somewhat case insensitive manner (by which I mean that, for example, the results for "mouse", "Mouse" and "MOUSE" are all combined) and ggram which retrieves the data and plots the results using ggplot2. All of these functions allow you to specify various options, including the date range and the language corpus (Google can provide results for US English, British English or a number of other languages including German and Chinese).

The package is easy to install from GitHub and I may also post it on CRAN.

I would be very interested in feedback from anyone who tries out this package and will happily consider implementing any suggested enhancements.

UPDATE: ngramr is now available on CRAN, making it much easier to install.

Possibly Related Posts (automatically generated):

{ 35 comments… read them below or add one }

1 Fr. July 20, 2013 at 9:21 am

Nice job! I am not sure I understand the smoothing parameter, though, and the aggregate argument has failed a few quick tests. I have made a few suggestions to make the function a bit more robust and to provide more flexibility with geoms. I’m also suggesting the GGally package as a possible candidate to publish the function in.

2 Stubborn Mule July 20, 2013 at 1:39 pm

@Fr thanks for the suggested edits over on github: I have incorporated your suggestions. Could you give me some examples of the errors you got with your aggregate tests? I suspect that there is scope for further error trapping!

3 Stephen Peplow July 22, 2013 at 8:07 am

Hi—I can’t seem to get ngramr working. I have tried both the devtools method and installing from a local zip file. Here is the error message:

Error in .install_package_code_files(“.”, instdir) :
files in ‘Collate’ field missing from ‘C:/Users/Stephen/AppData/Local/Temp/RtmpEBIzzL/ngramr-master/R’:
ERROR: unable to collate and parse R files for package ‘ngramr’
* removing ‘C:/Users/Stephen/Documents/R/win-library/2.15/ngramr’
Error: Command failed (1)

This is a really neat application and I’d love to get it going.
Thanks for any help

4 Stubborn Mule July 22, 2013 at 9:25 am

@Stephen: sorry about that. I’ve been tweaking some new functionality and seem to have broken it! I will let you know as soon as it is fixed.

5 Fr. July 22, 2013 at 5:18 pm

I have flagged the line that is probably at fault in the code, it should be easy to fix.

6 Stubborn Mule July 22, 2013 at 5:35 pm

Yes indeed: I was part way through some changes and must have pushed them prematurely up to GitHub. I will fix them when I get home and make sure I adopt better practice and establish a development branch!

7 Stubborn Mule July 22, 2013 at 7:01 pm

@Stephen: it should work now. Sorry for mess!

8 Stephen Peplow July 23, 2013 at 3:26 am

Thanks — I got it working, but thought you should know: downloading from the ZIP file didn’t work. It just stops. Downloading from GITHUB worked, except users should be aware that they’ll need to update their version of R. Small thing: the example code you give at the top for hacker etc doesn’t include require(ggplot2). I am going to write up my own example and will send you a link. Thanks for all this!

9 Stubborn Mule July 23, 2013 at 6:03 am

@Stephen: thanks for the feedback. I will look into the ZIP file issue and reflect the other comments in the instructions.

10 Stubborn Mule July 23, 2013 at 8:49 pm

@Stephen: what problem did you have with the ZIP install? My testing looked like this:

> library(devtools)
> install_local("~/Downloads/")
Installing package from ~/Downloads/
Installing ngramr
'/Library/Frameworks/R.framework/Resources/bin/R' --vanilla CMD INSTALL  
  --library='/Library/Frameworks/R.framework/Versions/3.0/Resources/library' --with-keep.source  

* installing *source* package 'ngramr' ...
** R
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (ngramr)
11 Maxine December 24, 2013 at 2:35 am

This is fantastic! I was just wondering how R goes about dealing with accents? This seems to be somewhat of a barrier for working with the non-English corpora. Thanks again for such a useful bit of code!

12 Stubborn Mule December 24, 2013 at 5:37 pm

While I have not tested this extensively, R, Google ngrams and ngramr seem to behave ok with accents. For example, this seems to work fine:

ggram(“soufflé”, corpus=”fre_2012″, year_start=1800)

13 Maxine December 27, 2013 at 9:58 pm

Thanks for replying! That’s funny, I can see that works fine but when I look at certain words in I get an error. For example:
ggram(“fécondité”, corpus=”fre_2012″, year_start=1800)
gives me:
Error in data.frame(…, check.names = FALSE) :
arguments imply differing number of rows: 209, 0
Any ideas?

14 Stubborn Mule December 27, 2013 at 10:29 pm

I’ll have a look into it!

15 Stubborn Mule December 28, 2013 at 9:37 am

@Maxine. I tried

ggram(“fécondité”, corpus=”fre_2012″, year_start=1800)

and it worked. Here is the result. Have you got the latest version of ngramr (and other packages)? If you are using RStudio (which I would highly recommend!) this can be done via Tools -> Check for Package Updates.

16 Maxine December 30, 2013 at 9:13 pm

Thanks, at least I know it’s on my end! I’ve just started using RStudio and have the same issue. For some reason my R is just ignoring the accents, as when I try “soufflé” for example I get the results for “souffl”, and words with an accent within the word (like fécondité) are returning no results, hence the error. Very strange.

17 Stubborn Mule December 30, 2013 at 10:35 pm

@Maxine: what operating system are you using? Windows? I am on a Mac, but will also test on Windows.

18 Stubborn Mule December 31, 2013 at 8:17 am

@Maxine, I have tried it now on Windows and get the same problem as you, so now there is something for me to investigate!

19 Maxine December 31, 2013 at 9:15 am

You sir, are a god amongst mules!

20 Stubborn Mule December 31, 2013 at 9:25 am

@Maxine I think I have got to the bottom of the problem: a different approach to character encoding on Windows. I think I have sorted out a fix, so will submit it to CRAN. I will keep you posted.

21 Stubborn Mule December 31, 2013 at 5:28 pm

@Maxine the updated version (1.4.2) has been accepted on CRAN. It may take a little while for the binary versions to appear, but do try updating your packages and let me know if this solves your problem.

22 Maxine January 11, 2014 at 2:26 am

Working beautifully now, thank you for all your hard work!

23 Stubborn Mule January 11, 2014 at 1:47 pm

@Maxine excellent! Pleased to hear it.

24 Matt February 1, 2014 at 7:23 am

Thanks for writing such a useful package! While working with more obscure words, I’ve encountered a potential bug with the ngrami function. The line
>ngram(“pulser”, corpus=”fre_2012″)
returns the expected full dataset while the case insensitive function,
>ngrami(“pulser”, corpus=”fre_2012″)
returns an error. I believe this is because it is trying to combine the results from “pulser” and “Pulser”, the latter of which is empty.
(When I enter >ngram(“Pulser”, corpus=”fre_2012″), it returns an error because there aren’t any instances of it in the corpus.) I’m using some workarounds, but I figure a fix is possible in the code itself.
Thanks for your help,
Matt Blackshaw

25 Stubborn Mule February 1, 2014 at 9:07 am

I’m glad you are finding the package useful. I will get onto the additional error trapping and keep you posted.

26 Stubborn Mule February 2, 2014 at 9:37 am

@Matt I have uploaded a new release of ngramr. The packages should be rebuilt within 24 hours or so. Let me know if it fixes your problems.

27 will May 16, 2014 at 5:11 am

Have you have any experience with this error message?

Error in fromJSON(sub(“.*=”, “”, html[data_line])) :
CHAR() can only be applied to a ‘CHARSXP’, not a ‘pairlist’

It seems to be caused by repeated ngram calls — I encountered it in a loop to build a matrix of more than 12 ngrams. Is there a capacity constraint build in by Google?

28 Stubborn Mule May 17, 2014 at 11:04 am

I have not seen that error, but it is certainly possible that there is a capacity constraint: Google changes the way its pages work quite often! Can you post some sample code?

29 suz August 21, 2014 at 6:26 am

G’day! Thanks for the great work with the package, I love it. Here are two questions from a noob, who wants to query a few dozen words at once:

1. I tried to build a for-loop (not experienced with looping, I admit), but as you need to quote the phrase, I’m unsure on how you’d call ngram() with indexes. I couldn’t find any discussion apart from will’s comment above, so I guess it is possible to loop over a vector with strings. Mine’s not a capacity problem (not there yet!). I just get the error “is.character(phrases) is not TRUE”. I tried ngram(‘”‘cat[i]'”‘), corpus=”eng_us_2012″), but I guess my fault lies elsewhere; like not understanding for-loops. (I only have a few dozen words I’d like to query.) Any suggestions on this issue?

2. What am I doing wrong with the ngrami() function? It returns a line “Browse[1]> ” that expects user input?

30 Stubborn Mule August 21, 2014 at 7:56 am

I’ll have a look into it for you.

31 suz August 21, 2014 at 4:53 pm

Thanks! Q1 seems somewhat solved; it works fine with sapply(). So my problem has more to do with a misunderstanding of for-loops, I guess.

32 Stubborn Mule August 21, 2014 at 8:37 pm

There is a problem with the ngrami code. I’ve submitted an update to CRAN, so check for an update in 24 hours or so. Thanks for picking it up!

33 Stubborn Mule August 21, 2014 at 8:38 pm

By the way, you can bundle up multiple words in a single call: ngram(c("fish", "cow", "bird"))

34 Stubborn Mule August 31, 2014 at 8:44 pm

I am having problems with the CRAN submission. In the meantime you can get the most up to date version from github.

35 catphish July 7, 2015 at 2:36 pm

Great script! Quick question: I can’t get any data back for phrases/words that include apostrophes. To process these, Google adds a space (e.g. “X ‘s Y”), but even when I do this the script skips the phrase. Any ideas?

Leave a Comment

{ 1 trackback }

Previous post:

Next post: