Shark season

Summer in Australia comes with cicadas, sunburn and, in the media at least, sharks. So far, I have learned that aerial shark patrols are inefficient (or perhaps not) and that the Western Australian government plans to keep swimmers safe by shooting big sharks.

Sharks are compelling objects of fear, right up there with spiders and snakes in the package of special terrors for visitors to Australia. As good hosts, we are quick to reassure: sharks may be the stuff of nightmares and 70s horror movies, but attacks are rare.

But, exactly how rare is death by shark? Over a Boxing Day lunch, I heard an excellent ‘statistic’, designed to reassure a visiting American. Apparently, more people are killed each year in the US by falling vending machines than are killed by sharks around the world. I was skeptical, but had no data to hand. Later, with the help of Google, I discovered that this statistic is 10 years old and the source? Los Angeles life guards. The tale has, however, become taller over time. Originally, vending machine deaths in the US were compared to shark attack fatalities in the US, not the entire world.

While data on vending machine related deaths are hard to come by, subsequent attempts to validate the story concluded that it was plausible, on the basis that there were two vending machine deaths in 2005 in the US but no fatal shark attacks.

Fun though the vending machine line may be, it is not relevant to Australia and, if you are on the beach contemplating a quick dip, then the risk of a shark attack is certainly higher in the sea than death by vending machine. Local data is in order.

According to the Taronga Zoo Australian Shark Attack File (ASAF):

 In the last 50 years, there have been 50 recorded unprovoked fatalities due to shark attack, which averages one per year.

Fatalities have been higher than average over the last couple of years. The ASAF recorded two deaths in 2012 and, although validated figures for 2013 are yet to be published, six deaths have been reported over the last two years, suggesting that fatalities rose further to four this year.

To compare shark fatalities to other causes of mortality, a common scale is useful. My unit of choice is the micromort. A one-in-a-million chance of death corresponds to a micromort of 1.0, a one-in-ten-million chance of death to a micromort of 0.1. Taking the recent average death rate of three per year (more conservative than the longer run average of one), and a population of 23 million in Australia leads to a figure of 0.13 micromorts for the annual risk of death for a randomly chosen Australian.

The most recent data on causes of death published by the Australian Bureau of Statistics (ABS) are for 2009. That year, three people were killed by crocodiles. Sharks are not specifically identified, but any fatal shark attacks would be included among the three deaths due to ‘contact with marine animals’. The chart below illustrates the risk of death associated with a number of ‘external causes’. None of these come close to heart disease, cancer or car accidents. Death by shark ranks well below drowning, even drowning in the bath, as well as below a variety of different types of falls, whether from stairs, cliffs or ladders.

Shark barplot

Annual risk of death in Australia (2009 data)*

Of course, you and I are not randomly chosen Australians and our choices change the risks we face. I am far less likely to suffer death by vending machine if I steer clear of the infernal things and I am far less likely to be devoured by a shark if I stay out of the water.

So, care should be taken when interpreting the data in the chart. Drug addicts (or perhaps very serious Hendrix imitators) are far more likely to asphyxiate on their own vomit than summer beach-goers. The fairest point of comparison is drowning in natural waters. At almost 3.5 micromorts, drownings in the sea (or lakes and rivers) is more than 25 times more common than fatal shark attacks. And the risk of both can be reduced by swimming between the flags.

What does that leave us with for conversations with foreign visitors? If you are headed to the beach, the risk of shark attack would be higher than death by vending machine, but it is still very low. The drive there (at 34.3 micromorts) is almost certainly more dangerous.

I will be taking comfort from my own analysis as I am heading to Jervis Bay tomorrow and sharks were sighted there this weekend:

Bendigo Bank Aerial Patrol spotted up to 14 sharks between 50 and 100 metres from shore at various beaches in Jervis Bay. [The] crew estimated the sharks at between 2.5 and 3.5 metres in length at Nelsons, Blenheim, Greenfields, Chinaman’s Beach and Hyams Beaches.

The beaches are un-patrolled, so wish me luck…but I don’t think I’ll need it.

* The figure for ‘Shark attack’ is based on the estimate of three deaths per year rather than the ABS data.

Power to the people

Regular Mule contributor, James Glover, returns to the blog today to share his reflections on solar power.

I have been investigating solar power for years and finally bit the bullet and signed up for a system. A 4.5kW system cost me $8,500, after receiving the Government rebate (about $3,000). I’ve been meaning to write about my adventures in solar for a while now. It started because of a strange fact I discovered about 4 years ago. Even though the cost of solar cells has been dropping dramatically in the last 4 years (it’s gone down about 75%) the payback time has stayed steady at about 5-10 years. The payback time is based on what you save by not paying your power bills plus what you earn by selling electricity back into the grid. The peak time for solar generation is 10am-2pm while the peak time for domestic use is in the morning and evening outside these times.

The answer to my conundrum is that while the cost of solar cells has been steadily dropping so has the feedback tariff. When the feedback tariff was 60c per kWh, the excess power created during the day paid for the disparity in the price of the power consumed in the evening. In Victoria the feed in tariff has dropped to about 8c. In order to have a net zero cost of solar it is necessary to have an even bigger system as peak power cost is about 32c per kWh. A particularly good website I found for all things solar is SolarQuotes.  I thoroughly recommend it as has lots of info on solar power as well as cost benefit analysis. They recommended two solar companies in my area, both of who were very good.

From a financial point of view it makes sense that power companies would buy solar power at a lower rate than they would provide it‑it’s called the bid/offer spread and is how most companies make money. The cost of producing power is about 5c so it is still cheaper for them to produce and sell the power themselves than buy it from solar power generators.

There is a twist to this tale however. Electricity generators are monopolies and so left to their own devices would naturally gouge buyers. When the state governments privatised electricity generation they set up supervisory boards to ensure the companies made reasonable, but not immodest profits. In the absence of a competitive market one way to do this is on a “cost plus” basis: set the profit at say 10% above cost of electricity generation. It seemed reasonable until power companies found a way to game this system. If they increased the cost of providing electricity then they increase their profits.

But surely, you say, the costs of generating electricity are based on market forces for the raw materials plus the cost of running the plant? One way is to spend much more on investment than is actually necessary. And the electricity companies did this beautifully. They convinced the state government oversight bodies that not only was electricity consumption forecast to rise well above GDP growth but that existing infrastructure needed to be “gold plated”: improved to reduce the probability of a widespread failure. A combination of inflated growth predictions (and hence building new plants) and gold plating is the real reason electricity prices have risen 20% year on year over the last few years. Yes, the carbon tax has had a small effect as a one-off increase. The Coalition (now the Government) exploited this in the run up to the election, although I am pretty sure this not was the real reason the Labor government lost office.

If you take solar power growth into consideration then electricity generation from traditional sources such as coal and hydro is expected to fall, not rise. Gold plating (soon to include actual gold power lines…I think I am joking) is now seen for what it is and is being reined in.

One of things I have always wondered is why someone doesn’t set up a virtual power company which buys solar power and sells it to distributors? Turns out they already exist. The thing which swung me to the solar provider I chose (the price was identical to the others) was that they could hook me into just such a company.  Sunpower is a US company which has set up in Australia to do just this. Currently their feed in tariffs are higher (guaranteed 20c for 2 years as opposed to 8c for coal generating providers) though I have no expectation they will remain this high. Australian Diamond Energy is another example of a virtual power company. Diamond Energy buys green power from retail solar producers (i.e. you and me) as well as independent wind farms. They also invest in their own larger scale solar and wind farms. Market forces will dictate the future price and I am happy to offset the environmental cost of running my air conditioning at full bore over summer.

In the US they already have communities which set up solar farms to provide their bulk electricity and sell their excess to the grid. Old style electricity companies have resorted to making claims that there are problems with solar electricity, either because it’s at the wrong time of the day or because old style inverters produce modified sine waves from direct current rather than pure sine waves and some electrical appliances don’t operate as well with these modified sine waves. Increasingly though inverters are of the pure sine wave type anyway. While there is some truth to their arguments, it is worth remembering that power companies would prefer that there was no solar at all. They have an axe to grind, their arguments are designed to limit the onward march of solar, or totally compensate them for lost revenue which will achieve the same aim through higher solar costs or lower feedback tariffs.

Another example of why traditional power companies are increasingly out of touch is smart meters. Solar power companies, monitor power usage through smart meters and solar panel output monitoring.They then provide feedback directly to your table or smart phone, and also work to help users optimise their power usage and minimise costs. Traditional power companies see smart meters as purely a way to save on meter reader costs, they have no interest in reducing users’ power consumption.

It seems that in Australia, the “sunburnt country” we have missed a few tricks. The dinosaur coal-based power companies are fighting a rearguard action, trying to get governments to lower the feed-in tariff further or let them charge solar customers a fixed fee to cover their “costs”. I think they are on the wrong side of history. A consumer group Solar Citizens has already been effective in reminding governments that over 1m households have solar power. I think that 1m is a tipping point.

There are about 8m households in Australia. At a cost of about $5,000 we could make each a net producer of electricity for $40bn.  About the cost of the NBN. A new national Snowy River Scheme!

Power to the people. From the people. For the people.

Qantas and Adobe

In my last post, I complained about the approach Qantas has taken to password security for its new Qantas Cash website. When I called Qantas to express my concerns, my query was referred to the “technical team”. I was assured they would be able to assuage my concerns. Here is the email response I received:

As I’m sure you’ll understand, we cannot discuss in any depth the security protocols and practices of our products.

However, I can assure you that your password is stored and encrypted on our server, is never transmitted and cannot be viewed by anyone.

The reason we use random ordinal characters rather than full password entry is because it is more secure as it makes harvesting passwords using keylogging software a much more challenging task.

Thank you for taking an interest in the product and we are certain you’ll find the site, the card and the product as a whole, a secure and useful addition to your payment options.

I tried to dig a little deeper, asking whether individual password characters were hashed. This did not help:

Thank you for your email. Your previous question has been queried with our technical team. They have advised that we cannot discuss in any depth the security protocols and practices of our products.

I am far from reassured. Security through obscurity is a poor strategy. Knowing how an effective security practice works does not make it weaker. Quite the contrary: the best security practices are well-known and have been tested and retested and have survived unscathed. The ones that do not pass these tests are discarded. If Qantas is keeping their security methods secret, it simply heightens my fear that they have been devised by web developers who are not experts in security and are vulnerable to attack.

Qantas and I are approaching the question of security very differently, with different threat models. Qantas is focused on preventing me from doing something silly that could compromise my account. Whereas I am worried about Qantas being hacked.

Only a few weeks ago, Adobe was hacked and up to 150 million encrypted passwords have been made public. Their encryption methods were weak (no salted hashing!) and password hints for all of the accounts were also leaked. Enthusiastic hackers are quickly reverse-engineering the passwords.

The same thing could happen to Qantas. If it does, and Qantas is moved to offer a heartfelt apology to their customers, I will not be too upset: I will not be one of those customers.

Security can be tricky

Qantas CashQantas has recently launched Qantas cash, a pre-paid Mastercard which you can charge up with cash in multiple currencies. The contemporary equivalent of traveller’s cheques, cards like this can be as convenient as a credit card with the added advantage of reducing the uncertainty associated with exchange rate volatility. If you have a rough idea of how much you will need in euro, you can charge up the card with euro at today’s exchange rate without having to worry about the Australian dollar dropping in value while you are half way through your trip.

As a Qantas frequent flyer account holder, I received a Qantas cash card in the mail and it seemed worth investigating. However after activating the card, my interest in the card itself was quickly displaced by disappointment in the insecure design of the Qantas cash website.

Computer security is not easy. It should be left to the experts. I am no expert myself, but I have listened to enough of the Security Now podcast to recognise poor security when I see it.

The first sign of trouble came with setting my password. The password had to be 6 to 8 characters long. A maximum of only 8 characters? The longer the password length, the more secure it is and 8 characters is far too short for a secure password.

Somewhat disconcerted, I pressed on, creating a password made up of 8 random characters. Random passwords are far more secure than real words (or even transparently modified “w0rd5”). They are also impossible to remember, but there are plenty of secure password storage tools (such as LastPass) that make that unnecessary.

Having set everything up, I was then prompted to log in. Unexpectedly, instead of being prompted to enter my password, I was asked to enter the “3rd, 4th and 5th character of the password”. Alarm bells started ringing. Quite apart from the irritation that this caused as it prevented LastPass from automatically filling in the password, it confirmed my initial fears that the website’s security model was flawed.

What I had realised was that Qantas servers must be storing passwords. For anyone unfamiliar with password security, this may seem blindingly obvious. If the servers don’t store the password, how can the website confirm you have entered the correct password when you log in?

In fact, there is a far more secure approach, which makes use of so-called “one way functions“. A one-way function takes a string of characters (a password, for example) as input produces a different string of characters as its output. The key feature of a one-way function is that it extremely difficult to reverse the process: given the output, working out what the input must have been is computationally highly intensive. Applying a one-way function is also known as (cryptographic) “hashing”.

Armed with a good one-way function, instead of storing passwords, a web server can store a hash of the password*. Then, whenever a user enters a password, the web site applies the one-way function and compares the result to its database. The password itself can be discarded immediately. The webserver’s user database should only ever contain hashes of user passwords and never the “plain text” original version of the password.

While this approach to password storage is well-established practice in the security community, many corporate websites are not designed by security experts. Back in 2011, hackers were able to get hold of more than a million passwords from Sony which had been stored in plain text.

Unfortunately, it would appear that Qantas cash is not following best practice in its website security. If the site was only storing hashed passwords, it would be impossible for the site to verify whether users were correctly entering the 3rd, 4th and 5th character of the password. Taking a password hash and trying to determine individual characters of the original password is just as difficult as reverse engineering the whole password.**

I then called Qantas cash to seek clarification. I was assured that all passwords were “encrypted” using the same security techniques that any other commercial website, such as Amazon, would use. Furthermore, the requirement to enter individual characters of the password was an additional security measure to prevent users from copying and pasting passwords.

This did not reassure me. Even if the passwords are encrypted, the Qantas cash server itself clearly has the capability of decrypting the passwords, which makes it just as vulnerable as Sony. I am also sure that Amazon does not use this approach. And preventing copying and pasting is a furphy. By preventing users from using secure password stores, this approach simply encourages the use of weaker passwords.

The Qantas cash developers may think they have come up with some excellent security features. But these developers are clearly not experts in security and, as a result, have produced a far less secure site. The call centre promised that the technical team would email me more details of the site’s security. My hopes are not high.

Needless to say, I will not be using the Qantas cash card. This is an e-commerce site, not a movie chat forum. When money is involved, security should be paramount.

Keep your eyes open for news about a Qantas cash website hack.

* Strictly speaking, a “salted hash” should be stored to add an additional layer of security and protect against the use of rainbow tables.

** In principle, Qantas could store hashes of three character combinations (56 hashes would have to be stored or 336 if order is significant). In practice I doubt this is being done.

Cats

Somehow September has passed by without a single post. During that time, the Mule has travelled to the other side of the world and back (primarily for a one day workshop in Switzerland). Also, James Glover (regular contributor to the blog) and I have been exploring the statistical significance of global temperatures. That will, eventually, crystallise into a future post but in the meantime James has been driven to reflect on cats rather than climate.

There are, apparently, two kinds of people. Those who like cats and those who don’t have personalities. I am of the former and am onto my 5th and 6th cats (a mother/daughter pair of rescue cats). I’ve been reading (another) book on cat behaviour which traces the domestication of the cat from solitary hunters to domestic pets (John Bradshaw’s Cat Sense: The Feline Enigma Revealed). Most domesticated animals are herd beasts whose natural behaviours lend them to domestication. A really great read on this is Jared Diamond’s Guns, Germs and Steel. Cats, however, are naturally solitary creatures whose real benefit to humans became obvious when agrarian societies stored grains which attracted rodents, the cat’s natural food source. It’s hard to imagine now, when we get our daily bread from Woolies, but think back to the day when farmers were (literally) plagued by mice and rats, and cats served to control them.

As a kid growing up in suburban Townsville we had an un-neutered tom cat called Whiskey. We weren’t allowed to play with Whiskey, and I have vague memories of him bringing home litters which lived briefly under the house and my mother throwing him the occasional piece of liver on the back steps. He wasn’t what you would call a friendly cat. As an 8 year old we moved and I recall driving with my father to take Whiskey to a “cat home”. I still have an image of dozens of cats climbing up the side of a large wire cage. I am guessing Whiskey didn’t last there for long, and, of course was happily re-homed with another loving family. Yes, that’s what happened.

Almost every website on cats says not to feed them cow’s milk because adult mammals don’t produce lactase, the enzyme required to break down lactose in milk, into sugar. Mammals stop producing lactase once they are weaned because their mothers no longer provide them with milk and they instead produce enzymes which turn proteins, in animal and vegetable matter, into sugars. Producing lactase would be pointless and require resources better devoted to other enzymes and hence has been adapted against. The idea is that if cats can’t digest lactose, it stays in their gut and bacteria feeding on it leads to an upset stomach and diarrhoea. But I see several problems with this view.

  1. Humans can produce lactase as adults*, due to a variety of different genetic mutations which stop the shutdown of lactase production in adults. So the genetic mutation doesn’t have to suddenly find a way to produce lactase, just a way to stop stopping it. Basically this is because of the nutritional benefits of cow’s milk to dairy farmers which started about 10,000 years ago. Comparisons of 10,000 year old human DNA to modern descendants of dairy farmers show this is a widespread adaptation due to its obvious nutritional benefits. Indigenous Australians and Inuit don’t have this mutation because they have no dairy farmer ancestors. This is still an open question however as curdled milk and cheese doesn’t have much lactose so do not require lactase to digest them. Personally I suspect that hunters which killed a lactating cow were able to drink the milk immediately and benefited. Other theories say cow’s milk, as an alternative to water, may have saved them from diseases. Not all humans can do this. My own father, for example, can’t drink milk.
  2. Cats are quickly put off foods that make them feel sick and my cats love milk. It’s possible there is something in milk which they love (like cat nip) even if it makes them sick, but they are quick learners and I doubt it.
  3. There is a lack of eye witness evidence from vets and catteries back in the day when cats were fed milk that they suffered diarrhoea when they drink milk. But none of the evidence against cats drinking cow’s milk seems to be based on this. I’ve not found a single account of someone whose cats were fed cow’s milk and suffered.
  4. Cats have adapted to human living rapidly in the last 2-3 thousand years. This is equivalent to 4-5 times the length of time for humans due to their shorter lifespans, about the same time humans have adapted to drinking milk as adults.
  5. It makes sense that cats which were given milk by humans, and could process it, would have a better chance of reproducing. It would have a nutritional advantage over cats which couldn’t, the same evolutionary pressure on humans should operate on cats and they should (most of them anyway) have adapted to being able to drink milk as adults.
  6. I can’t find a single study which shows cats can’t produce lactase as adults, it just seems to be assumed because they are non-human mammals.

My guess is that cats descended from European cats can (most of them anyway) drink cow’s milk safely. If they drink it and come back for more it probably doesn’t upset them. My own cats, when they drink milk, run around like kids on sugary drinks, displaying very kittenish behaviour. That makes me think they are turning lactose into sugar, which means they are still producing lactase as adults.

I still find it quite amazing how memes like “cats shouldn’t drink milk” propagate across the internet without any back up evidence–like an actual study which shows it. Like climate skeptics, cat people latch onto “evidence” which supports their point of view. In any event if anyone has firm evidence that adult cats don’t produce lactase I would be happy to hear about it.

JG-cats

Two cats both called Minoo because cats don’t actually know their names

* Editor’s note: a recent episode of Science Friday touched on this and other evolutionary changes in the human diet. The theme of the podcast is that humans are still evolving, faster than ever. So, perhaps cats are too, as James suggests.

The price of protectionism

An  article in Friday’s Australian began

Ford has blamed Kevin Rudd’s $1.8 billion fringe benefits tax overhaul for halting production, forcing at least 750 workers to be stood down in rolling stoppages that will further imperil Labor’s chances of retaining the nation’s most marginal seat.

and goes on to report that the Federal Chamber of Automotive Industries has called on Labor to reverse its changes to the application of fringe benefits tax (FBT) to cars.

So what exactly has Labor done to put these jobs at risk?

The previous regime provided two mechanisms to determine tax benefits for expenses incurred for cars used for work purposes:

  1. the “log book” method, whereby the driver maintained records to show what proportion of their use of the car was for work rather than personal use, or
  2. an assumed flat rate of 20% work use of the car (regardless of how often the car is actually used for work purposes).

The government has eliminated the second option. So, the estimated $1.8 billion saving is due to the fact that a significant number of drivers using the 20% method could never come close to a 20% proportion of work use if they took the trouble to maintain a log book. Either that or they don’t think it is worth the effort to maintain the log book records.

While the elimination of this tax-payer largesse for drivers may come at a cost to workers in the car industry, does it really make sense to reverse the changes to save 750 jobs? These jobs would be saved at a cost to the tax payer of $2.4 million per job. Now these are just the jobs at Ford and (for now at least) we should acknowledge that some Holden jobs may also be saved, bringing the cost closer to $1 million per job.

The car industry in Australia has long benefited from government support, but surely there are better ways of saving these jobs. A job guarantee springs to mind.

Of course, industry protectionism is far from unique to Australia and this week I had my attention drawn to an extreme example in the small central American nation of Belize.

On 7 August, the parliament of Belize met for the first time since April. With so long between sittings, there were many bills for parliament to pass that day. Included among these was one which increased the already high import tariff on flour from 25% to 100%.

Wheat

Why such a dramatic increase? For some time, local bakers had been buying their flour from Mexico for 69 Belize dollars per sack (approximately A$38). It was hard to justify buying the more expensive local flour at BZ$81 per sack (A$45). The new tariff will push the price of Mexican flour up to around BZ$110 (A$61), which is good news for the domestic flour mill and its employees.

That domestic flour mill is operated by Archer Daniels Midland (ADM), one of the top 10 global commodity firms. This is the same ADM which is in the process of trying to buy GrainCorp, Australia’s largest agricultural business.

But back to Belize. ADM’s website proudly declares that it “employs more than 40 people” in its Belize mill. Presumably, parliament had an eye to saving these jobs from the threat of cheap Mexican flour when it hiked the import tariff. With a population of only 335,000, Belize is 1/70th the size of Australia. You could argue that saving 40 jobs in Belize is the equivalent of saving 2,800 in Australia and that this is a far more effective form of protectionism than reversing FBT reforms.

But protectionism always has consequences and in Belize these are easier to see than is often the case.

Bread in Belize is subject to price control, along with rice, beans and even local beer. By law, bakers must sell “standard loaves” of bread for BZ$1.75. The August sitting of parliament may have increased flour tariffs, but it did not increase the price bakers could charge for bread.

Bakers in Belize will see their profits squeezed, job losses may follow and there are more bakers in Belize than workers at the ADM mill. Needless to say, the Belize Baker’s Association is lobbying for an increase in the controlled price of bread.

Perhaps it is time for the Belize government to consider abandoning the flour tariff and trying a job guarantee instead.

ngramr – an R package for Google Ngrams

The recent post How common are common words? made use of unusually explicit language for the Stubborn Mule. As expected, a number of email subscribers reported that the post fell foul of their email filters. Here I will return to the topic of n-grams, while keeping the language cleaner, and describe the R package I developed to generate n-gram charts.

Rather than an explicit language warning, this post carries a technical language warning: regular readers of the blog who are not familiar with the R statistical computing system may want to stop reading now!

The Google Ngram Viewer is a tool for tracking the frequency of words or phrases across the vast collection of scanned texts in Google Books. As an example, the chart below shows the frequency of the words “Marx” and “Freud”. It appears that Marx peaked in popularity in the late 1970s and has been in decline ever since. Freud persisted for a decade longer but has likewise been in decline.

Freud vs Marx ngram chart

The Ngram Viewer will display an n-gram chart, but does not provide the underlying data for your own analysis. But all is not lost. The chart is produced using JavaScript and so the n-gram data is buried in the source of the web page in the code. It looks something like this:

// Add column headings, with escaping for JS strings.

data.addColumn('number', 'Year');
data.addColumn('number', 'Marx');
data.addColumn('number', 'Freud');

// Add graph data, without autoescaping.

data.addRows(
[[1900, 2.0528437403299904e-06, 1.2246303970897543e-07],
[1901, 1.9467918036752963e-06, 1.1974195999187031e-07],
...
[2008, 1.1858645848406013e-05, 1.3913611155658145e-05]]
)

With the help of the RJSONIO package, it is easy enough to parse this data into an R dataframe. Here is how I did it:

ngram_parse <- function(html){
  if (any(grepl("No valid ngrams to plot!",
                html))) stop("No valid ngrams.") 
    
  cols <- lapply(strsplit(grep("addColumn", html,
                               value=TRUE), ","),
                getElement, 2)
  
  cols <- gsub(".*'(.*)'.*", "\\1", cols)

I realise that is not particularly beautiful, so to make life easier I have bundled everything up neatly into an R package which I have called ngramr, hosted on GitHub.

The core functions are ngram, which queries the Ngram viewer and returns a dataframe of frequencies, ngrami which does the same thing in a somewhat case insensitive manner (by which I mean that, for example, the results for "mouse", "Mouse" and "MOUSE" are all combined) and ggram which retrieves the data and plots the results using ggplot2. All of these functions allow you to specify various options, including the date range and the language corpus (Google can provide results for US English, British English or a number of other languages including German and Chinese).

The package is easy to install from GitHub and I may also post it on CRAN.

I would be very interested in feedback from anyone who tries out this package and will happily consider implementing any suggested enhancements.

UPDATE: ngramr is now available on CRAN, making it much easier to install.

Poll Dancing

With elections looming, and Kevin Rudd’s return to power, it is time for our regular guest blogger, James, to pull out his beer coaster calculator and take a closer look at the polls. 

It is really that time again. Australian election fever has risen. Though in this case it feels like we have been here for three years since the last election. Polls every week telling us what we think and who we will vote for. But what exactly do these polls mean? And what do they mean by “margin of error”?

So here is the quick answer. Suppose you have a two party election (which two party preferred, 2PP, effectively amounts to through Australia’s preference system). Now suppose each of those parties really has 50% of the vote. If there are 8 million voters and you poll 1,000 of them then what can you tell? Surprisingly it turns out that of these inputs the number of 8 million voters is actually irrelevant! We can all understand that if you only poll 1,000 voters out of 8 million then there is a margin of error. This margin of error turns out to be quite easy to compute (using undergraduate level Binomial probability theory) and only depends on the number of people polled, and not the total number of voters. The formula is:

MOE = k × 0.5 /√N.

where N is the number of people polled and k is the number of standard deviations for the error. The formula √1000 = 33 so 1/√1000 = 0.03 = 3%. The choice of k is somewhat arbitrary but in this case k = 2 (because for the Normal distribution 95% of outcomes lies within k=2 standard deviations of the mean) which conveniently makes k × 0.5 = 1. So MOE=1/√N is a fairly accurate formula. If N=1000 then MOE=1/33=3% (give or take). This simply means that even if the actual vote was 50:50 then 5% of the time, an unbiased poll of 1,000 voters would poll outside 47:53 due purely to random selection. And even if the actual vote is, say, 46:54, the MOE will be about the same.

Interestingly in the US where there are about 100m voters they usually poll at N = 40,000 which makes the MOE = 0.5%. In this case the economics of polling scale as the number of voters hence they can afford to poll more people. But the total number of voters, 100m or 10m, is irrelevant for the MOE. As the formula shows to improve the accuracy of the estimate by a factor of 10 (say from 3% to 0.3%) they would need to increase the sample size by a factor of 100. You simply can’t get around this.

One of the criticisms of polling is that that they don’t reach the same number of (young) people on mobile phones as older people on land lines. This is easily fixed. You just adjust the figures according to what type of phone they are using based on known percentages of who uses what type of phone. Similarly you can adjust by gender and age. The interesting thing though is that the further you get from actual phone usage/gender/age in your poll you also need to increase your MOE, but not your expected outcome.

Okay so that is it: MOE = 1/√N where N = number of people polled. If N = 1000 then MOE=3%. My all time favourite back of the beer coaster formula.

The recent jump in the 2PP polls for Labor when Kevin Rudd reassumed the PM-ship from about 45% to 49% were greeted by journalists as “Kevin Rudd is almost, but not quite, dead even”. I found this amusing as it could statistically have been 51%, within the MOE, in which case the headline would have been “Kevin Rudd is ahead!”. Indeed barely a week later he was “neck and neck” in the polls at 50:50. Next week it may be “51:49” in which case he will be declared on a certain path to victory! However within the MOE of 3% these results are statistically indistinguishable.

From my point of view, as a professional statistician, I find the way many journalists develop a narrative based on polls from week to week, without understanding the margin of error, quite annoying. Given the theory that if a politician has the “The Mo” (ie. momentum) it may end up helping them win when it is irresponsible to allow random fluctuation due to statistical sampling error to influence the outcome of an election. Unless of course it helps the party I support win.

How common are common words?

One of my favourite podcasts is Slate’s Lexicon Valley. All about language, it is rigorous and detailed in its approach to the subject, which appeals to the closet academic in me, but also extremely entertaining. It is a sign of a good podcast to find yourself bursting out laughing while walking down a busy city street. Lexicon Valley is to blame for numerous moments of alarm for my fellow commuters.

In September last year, hosts Mike Vuolo (the knowledgeable one) and Bob Garfield (the funny one) interviewed linguist Geoffrey Nunberg, talking to him about his recent book, Ascent of the A-Word: Assholism the First Sixty Years. A half hour discussion of the evolution of the word “asshole”helps earn this podcast an “Explicit” tag in the iTunes store and, as a result, this will be the first Stubborn Mule post that may fall victim to email filters. Apologies in advance to anyone of a sensitive disposition and to any email subscribers this post fails to reach.

Nunberg traces the evolution of “asshole” from its origins among US soliders in the Second World War through to its current role as a near-universal term of abuse for arrogant boors lacking self-awareness. Along the way, he explores the differences between profanity (swearing drawing on religion), obscenity (swearing drawing on body parts and sexual activity) and plain old vulgarity (any of the above).

The historical perspective of the book is supported by charts using Google “n-grams”. An n-gram is any word or phrase found in a book and one type of quantitative analysis used by linguists is to track the frequency of n-grams in a “corpus” of books. After working for years with libraries around the world, Google has amassed a particularly large corpus: Google Books. Conveniently for researchers like Nunberg,with the help of the Google n-gram Viewer, anyone can analyse n-gram frequencies across the Google Books corpus. For example, the chart below shows that “asshole” is far more prevalent in books published in the US than in the UK. No surprises there.

"Asshole" frequency US vs UKUse of “asshole” in US and UK Books

If “asshole” is the American term, the Australian and British equivalent should be “arsehole”, but surprisingly arsehole is less common than asshole in the British Google Books corpus. This suggests that, while being a literal equivalent to asshole, arsehole really does not perform the same function. If anything, it would appear that the US usage of asshole bleeds over to Australia and the UK.

Asshole/Arsehole frequencies“asshole” versus “arsehole”

Intriguing though these n-gram charts are, they should be interpreted with caution, as I learned when I first tried to replicate some of Nunberg’s charts.

The chart below is taken from Ascent of the A-word and compares growth in the use of the words “asshole” and “empathetic”. The frequencies are scaled relative to the frequency of “asshole” in 1972* . At first, try as I might, I could not reproduce Nunberg’s results. Convinced that I must have misunderstood the book’s explanation of the scaling, I wrote to Nunberg. His more detailed explanation confirmed my original interpretation, but meant that I still could not reproduce the chart.

Nunberg's chart: asshole versus empathy

Relative growth of “empathetic” and “asshole”

Then I had an epiphany. It turns out that Google has published two sets of n-gram data. The first release of the data was based on an analysis of the Google Books collection in July 2009, described in the paper Michel, Jean-Baptiste, et al. “Quantitative analysis of culture using millions of digitized books” Science 331, No. 6014 (2011): 176-182. As time passed, Google continued to build the Google Books collection and in July 2012 a second n-gram data set was assembled. As the charts below show, the growth of “asshole” and “empathetic” is somewhat different depending on which edition of the n-gram data set used. I had been using the more recent 2012 data set and, evidently, Nunberg used the 2009 data set. While either chart would support the same broad conclusions, the differences show that smaller movements in these charts are likely to be meaningless and not too much should be read into anything other then large-scale trends.

Empathy frequency: 2009 versus 2012Comparison of the 2009 and 2012 Google Books corpuses

So far I have not done very much to challenge anyone’s email filters. I can now fix that by moving on to a more recent Lexicon Valley episode, A Brief History of Swearing. This episode featured an interview with Melissa Mohr, the author of Holy Shit: A Brief History of Swearing. In this book Mohr goes all the way back to Roman times in her study of bad language. Well-preserved graffiti in Pompeii is one of the best sources of evidence we have of how to swear in Latin. Some Latin swear words were very much like our own, others were very different.

Of the “big six” swear words in English, namely ass, cock, cunt, fuck, prick and piss (clearly not all as bad as each other!), five had equivalents in Latin. The only one missing was “piss”. It was common practice to urinate in jars left in the street by fullers who used diluted urine to wash clothing. As a result, urination was not particularly taboo and so not worthy of being the basis for vulgarity. Mohr goes on to enumerate another five Latin swear words to arrive at a list of the Roman “big ten” obscenities. One of these was the Latin word for “clitoris”, which was a far more offensive word than “clit” is today. I also learned that our relatively polite, clinical terms “penis”, “vulva” and “vagina” all derive from obscene Latin words. It was the use of these words by the upper class during the Renaissance, speaking in Latin to avoid corrupting the young, that caused these words to become gentrified.

Unlike Nunberg, Mohr does not make use of n-grams in her book, which provides a perfect opportunity for me to track the frequency of the big six English swear words.

Big 6 SwearwordsFrequency of the “Big Six” swear words

The problem with this chart is that the high frequency of “ass” and “cock”, particularly in centuries gone by, is likely augmented by their use to refer to animals. Taking a closer look at the remaining four shows just how popular the use of “fuck” became in the second half of the twentieth century, although “cunt” and “piss” have seen modest (or should I say immodest) growth. Does this mean we are all getting a little more accepting of bad language? Maybe I need to finish reading Holy Shit to find out.

Big 4 Swear WordsFrequency of four of the “Big Six” swear words

* The label on the chart indicates that the reference year is 1972, but by my calculations the reference year is in fact 1971.

Feedburner on the fritz

Those of you who have subscribed to email updates from the Stubborn Mule will have noticed some strange behaviour lately, as old blog posts have appeared in your inboxes. Why this is happening remains a mystery to me. The email subscriptions are powered by Google’s Feedburner service and, with the recent announcement that Google is shutting down Google Reader, I am starting to wonder whether Google is deliberately sabotaging Feedburner as a precursor to shutting it down too.

The sabotage theory is a bit too extreme, but certainly others are speculating that Feedburner’s days may be numbered. In any event, the time has come for me to look for an alternative in an attempt to stop the random emails.

I have looked at Feedblitz and have been bombarded with marketing materials as a result, so that one is not for me. Mailchimp is a possibility.

While I am weighing my options, I would welcome suggestions from other bloggers who have successfully made the move from Feedburner.