The recent post How common are common words? made use of unusually explicit language for the Stubborn Mule. As expected, a number of email subscribers reported that the post fell foul of their email filters. Here I will return to the topic of n-grams, while keeping the language cleaner, and describe the R package I developed to generate n-gram charts.
Rather than an explicit language warning, this post carries a technical language warning: regular readers of the blog who are not familiar with the R statistical computing system may want to stop reading now!
The Google Ngram Viewer is a tool for tracking the frequency of words or phrases across the vast collection of scanned texts in Google Books. As an example, the chart below shows the frequency of the words “Marx” and “Freud”. It appears that Marx peaked in popularity in the late 1970s and has been in decline ever since. Freud persisted for a decade longer but has likewise been in decline.
The Ngram Viewer will display an n-gram chart, but does not provide the underlying data for your own analysis. But all is not lost. The chart is produced using JavaScript and so the n-gram data is buried in the source of the web page in the code. It looks something like this:
// Add column headings, with escaping for JS strings. data.addColumn('number', 'Year'); data.addColumn('number', 'Marx'); data.addColumn('number', 'Freud'); // Add graph data, without autoescaping. data.addRows( [[1900, 2.0528437403299904e-06, 1.2246303970897543e-07], [1901, 1.9467918036752963e-06, 1.1974195999187031e-07], ... [2008, 1.1858645848406013e-05, 1.3913611155658145e-05]] )
With the help of the RJSONIO package, it is easy enough to parse this data into an R dataframe. Here is how I did it:
ngram_parse <- function(html){ if (any(grepl("No valid ngrams to plot!", html))) stop("No valid ngrams.") cols <- lapply(strsplit(grep("addColumn", html, value=TRUE), ","), getElement, 2) cols <- gsub(".*'(.*)'.*", "\\1", cols)
I realise that is not particularly beautiful, so to make life easier I have bundled everything up neatly into an R package which I have called ngramr, hosted on GitHub.
The core functions are ngram, which queries the Ngram viewer and returns a dataframe of frequencies, ngrami which does the same thing in a somewhat case insensitive manner (by which I mean that, for example, the results for "mouse", "Mouse" and "MOUSE" are all combined) and ggram which retrieves the data and plots the results using ggplot2. All of these functions allow you to specify various options, including the date range and the language corpus (Google can provide results for US English, British English or a number of other languages including German and Chinese).
The package is easy to install from GitHub and I may also post it on CRAN.
I would be very interested in feedback from anyone who tries out this package and will happily consider implementing any suggested enhancements.
UPDATE: ngramr is now available on CRAN, making it much easier to install.
Possibly Related Posts (automatically generated):
- How common are common words? (11 July 2013)
- Junk Charts #3 – US Business Lending (23 February 2010)
- Online Data and Charts with Swivel (10 August 2008)
- The Mule trips up (21 April 2010)
Nice job! I am not sure I understand the smoothing parameter, though, and the aggregate argument has failed a few quick tests. I have made a few suggestions to make the function a bit more robust and to provide more flexibility with geoms. I’m also suggesting the GGally package as a possible candidate to publish the function in.
@Fr thanks for the suggested edits over on github: I have incorporated your suggestions. Could you give me some examples of the errors you got with your aggregate tests? I suspect that there is scope for further error trapping!
Hi—I can’t seem to get ngramr working. I have tried both the devtools method and installing from a local zip file. Here is the error message:
Error in .install_package_code_files(“.”, instdir) :
files in ‘Collate’ field missing from ‘C:/Users/Stephen/AppData/Local/Temp/RtmpEBIzzL/ngramr-master/R’:
themes.R
ERROR: unable to collate and parse R files for package ‘ngramr’
* removing ‘C:/Users/Stephen/Documents/R/win-library/2.15/ngramr’
Error: Command failed (1)
This is a really neat application and I’d love to get it going.
Thanks for any help
Stephen
@Stephen: sorry about that. I’ve been tweaking some new functionality and seem to have broken it! I will let you know as soon as it is fixed.
I have flagged the line that is probably at fault in the code, it should be easy to fix.
Yes indeed: I was part way through some changes and must have pushed them prematurely up to GitHub. I will fix them when I get home and make sure I adopt better practice and establish a development branch!
@Stephen: it should work now. Sorry for mess!
Thanks — I got it working, but thought you should know: downloading from the ZIP file didn’t work. It just stops. Downloading from GITHUB worked, except users should be aware that they’ll need to update their version of R. Small thing: the example code you give at the top for hacker etc doesn’t include require(ggplot2). I am going to write up my own example and will send you a link. Thanks for all this!
@Stephen: thanks for the feedback. I will look into the ZIP file issue and reflect the other comments in the instructions.
@Stephen: what problem did you have with the ZIP install? My testing looked like this:
Pingback: Words and culture | GIS and Statistics at KPU
This is fantastic! I was just wondering how R goes about dealing with accents? This seems to be somewhat of a barrier for working with the non-English corpora. Thanks again for such a useful bit of code!
While I have not tested this extensively, R, Google ngrams and ngramr seem to behave ok with accents. For example, this seems to work fine:
Thanks for replying! That’s funny, I can see that works fine but when I look at certain words in I get an error. For example:
ggram(“fécondité”, corpus=”fre_2012″, year_start=1800)
gives me:
Error in data.frame(…, check.names = FALSE) :
arguments imply differing number of rows: 209, 0
Any ideas?
I’ll have a look into it!
@Maxine. I tried
ggram(“fécondité”, corpus=”fre_2012″, year_start=1800)
and it worked. Here is the result. Have you got the latest version of ngramr (and other packages)? If you are using RStudio (which I would highly recommend!) this can be done via Tools -> Check for Package Updates.
Thanks, at least I know it’s on my end! I’ve just started using RStudio and have the same issue. For some reason my R is just ignoring the accents, as when I try “soufflé” for example I get the results for “souffl”, and words with an accent within the word (like fécondité) are returning no results, hence the error. Very strange.
@Maxine: what operating system are you using? Windows? I am on a Mac, but will also test on Windows.
@Maxine, I have tried it now on Windows and get the same problem as you, so now there is something for me to investigate!
You sir, are a god amongst mules!
@Maxine I think I have got to the bottom of the problem: a different approach to character encoding on Windows. I think I have sorted out a fix, so will submit it to CRAN. I will keep you posted.
@Maxine the updated version (1.4.2) has been accepted on CRAN. It may take a little while for the binary versions to appear, but do try updating your packages and let me know if this solves your problem.
Working beautifully now, thank you for all your hard work!
@Maxine excellent! Pleased to hear it.
Thanks for writing such a useful package! While working with more obscure words, I’ve encountered a potential bug with the ngrami function. The line
>ngram(“pulser”, corpus=”fre_2012″)
returns the expected full dataset while the case insensitive function,
>ngrami(“pulser”, corpus=”fre_2012″)
returns an error. I believe this is because it is trying to combine the results from “pulser” and “Pulser”, the latter of which is empty.
(When I enter >ngram(“Pulser”, corpus=”fre_2012″), it returns an error because there aren’t any instances of it in the corpus.) I’m using some workarounds, but I figure a fix is possible in the code itself.
Thanks for your help,
Matt Blackshaw
I’m glad you are finding the package useful. I will get onto the additional error trapping and keep you posted.
@Matt I have uploaded a new release of ngramr. The packages should be rebuilt within 24 hours or so. Let me know if it fixes your problems.
Have you have any experience with this error message?
Error in fromJSON(sub(“.*=”, “”, html[data_line])) :
CHAR() can only be applied to a ‘CHARSXP’, not a ‘pairlist’
It seems to be caused by repeated ngram calls — I encountered it in a loop to build a matrix of more than 12 ngrams. Is there a capacity constraint build in by Google?
I have not seen that error, but it is certainly possible that there is a capacity constraint: Google changes the way its pages work quite often! Can you post some sample code?
G’day! Thanks for the great work with the package, I love it. Here are two questions from a noob, who wants to query a few dozen words at once:
1. I tried to build a for-loop (not experienced with looping, I admit), but as you need to quote the phrase, I’m unsure on how you’d call ngram() with indexes. I couldn’t find any discussion apart from will’s comment above, so I guess it is possible to loop over a vector with strings. Mine’s not a capacity problem (not there yet!). I just get the error “is.character(phrases) is not TRUE”. I tried ngram(‘”‘cat[i]'”‘), corpus=”eng_us_2012″), but I guess my fault lies elsewhere; like not understanding for-loops. (I only have a few dozen words I’d like to query.) Any suggestions on this issue?
2. What am I doing wrong with the ngrami() function? It returns a line “Browse[1]> ” that expects user input?
I’ll have a look into it for you.
Thanks! Q1 seems somewhat solved; it works fine with sapply(). So my problem has more to do with a misunderstanding of for-loops, I guess.
There is a problem with the ngrami code. I’ve submitted an update to CRAN, so check for an update in 24 hours or so. Thanks for picking it up!
By the way, you can bundle up multiple words in a single call:
ngram(c("fish", "cow", "bird"))
I am having problems with the CRAN submission. In the meantime you can get the most up to date version from github.
Great script! Quick question: I can’t get any data back for phrases/words that include apostrophes. To process these, Google adds a space (e.g. “X ‘s Y”), but even when I do this the script skips the phrase. Any ideas?