NZ Real GDP htmlwidget

Thought I would try my hand at generating an interactive JavaScript line graph using R. Thankfully theĀ dygraphs package makes this very easy!

The code below generates an interactive plot of New Zealand’s real GDP through time. I have added some annotations displaying some of the major financial crises. It is as if the economy fell of a cliff during the GFC!

NZ Real GDP
Overall Picture. (Source: RBNZ).

 

NZ GDP Zoom
Select an Area to Zoom. (Source RBNZ).

I would highly recommend you give this package a look. The static images really don’t do it justice.

All data was sourced from the reserve bank website.

Advertisements

Good Parameterisation in R

Imagine you work in a large factory that produces complicated widgets. It is your job to control production line settings which must be reset each day so as to ensure the smooth operation of the factory. However, to change the settings you have to walk around turning dials and pressing buttons at various different locations on the factory floor.

One morning you forget to turn the dial on an important machine causing the production line to completely shut down. Your manager storms over and you explain to him that it would be so much easier if you could just change the settings from one location!

One can think of a data science solution, automated or otherwise, as a factory which takes data as an input and produces insight. Best practice is to parameterise all code and sensibly place these parameters so that you can easily find and change them. Parameterised code is especially important when it comes to dashboard development. When a user interacts with a visual display, they should not be presented with a series of hard-coded outputs, rather, they should be changing parameters that result in uniquely generated results.

You pretty much have three options with respect to your code:

Unparameterised

Hard-coded settings are dispersed throughout your script and in order to change them, one must trawl right though it. This style of script is very prone to find-and-replace errors and is a nightmare to handover if you were to leave your job.

Partially Parameterised

Settings can all be found in a logical place such as the beginning of your script or in a separate file that is sourced in making them easy to change. This set-up also helps future users (including your future self!) fully understand what is going on.

Fully parameterised

Functions are defined and any parameters are set as arguments to the function. This is the most elegant solution.

The example below illustrates these three options. The script takes a data frame, selects only numeric fields, calls the k-means algorithm and plots a coloured chart displaying cluster allocations. There are three settings that can be changed: the data frame, the number of clusters and the chart title.

The first code block is an example of unparameterised code as the user must change all three settings manually by finding them in the script. The second code block is partially parameterised, setting nClust at the beginning selects k while simultaneously altering the chart title. The final code block wraps everything into a function, dealing with all three settings through specified arguments.

cluster_solution

Pretty Data Class Conversion

Load data – check structure – convert – analyse.

Data class conversion is essential to gaining the right result… especially if you have left stringsAsFactors = TRUE. The worst thing you can do is feed factor data into a function when you expected it to be characters.

If system memory is not a concern, I prefer to read data in as character strings and then convert accordingly, I view this as a safer option… it forces you to take stock of each field.

There are many ways to perform data conversion, for example, you can use transfrom() in base R or dplyr’s mutate() family of functions. For a single column conversion I prefer to use mutate but for multiple conversions I use mutate_each() and just specify the relevant columns. This avoids repeating the column names in code.

I still need to do some bench-marking to see which setup is faster, but for now I see mutate_each() as the cleanest, aesthetically at least. I have also included an example of ‘all column’ conversion.

Demystifying the GLM (Part 1)

Upon being thrown a prickly binary classification problem, most data practitioners will have dug deep into their statistical tool box and pulled out the trusty logistic regression model.

Essentially, logistic regression can help us predict a binary (yes/no) response with consideration given to other, hopefully related, variables. For example, one might want to predict whether a person will experience a heart attack given their weight and age. In this case, we have reason to believe weight and age are related to the incidence of heart attacks.

So, they will have sorted their data, fired up R and typed something along the lines of:

glm(heartAttack ~ weight + age, data = heartData, family=binomial())

But what is a glm? What does family = binomial() actually mean?

It turns out the logistic regression model is a member of a broad group of models known as generalised linear models, or GLMs for short.

This series will endeavor to help demystify these highly useful models.

Stay tuned.

NZ’s Shifting Makeup

New Zealand is culturally diverse. Even at a regional level, there are big differences in ethnic composition… and with an increasingly inter-connected world, ethnic composition is expected to change substantially in the future, particularly in Auckland.

Statistics New Zealand has provided us with sub-national ethnic population projections, by age and sex, from 2013 to 2038 which are well suited to visualisation using stacked area charts. The package ggplot2 in R makes generating these easy.

The following projections assume ‘medium fertility, medium paternity, medium mortality, medium net migration, and medium net inter-ethnic mobility.’ This is considered ‘medium growth’*.

Total New Zealand

total nz
Data Source: Statistics New Zealand

Total North Island

total ni
Data Source: Statistics New Zealand

Total South Island

total si
Data Source: Statistics New Zealand

Auckland

auckland
Data Source: Statistics New Zealand

*Please see http://nzdotstat.stats.govt.nz for more information.

References:

Data:

Subnational ethnic population projections, by age and sex, 2013(base)-2038. Statistics New Zealand. Provided under the creative commons attribution 3.0 New Zealand license.

I have transformed the data into proportions.

Plotting:

http://stackoverflow.com/questions/5030389/getting-a-stacked-area-plot-in-r

http://www.cookbook-r.com/Graphs/Axes_%28ggplot2%29/

A Matter of Style?

Up until a few weeks ago I would style my code like this:

I thought that was the only way… until I witnessed a DBA friend of mine coding. He would write the same function like this:

In my opinion, the second style makes the code easier to read. I suspect it is something to do with the nice ‘column’ of commas. The whole thing seems more orderly!

Trying to Win with R

A common competition run by vendors of fishing equipment is a ‘guess the weight and win’ where an image of someone holding a fish is posted and it is up to you to guess it’s weight with the closest guess winning a prize.

The ‘law of large numbers’ implies that the average of the guesses of many is superior to the average of the guesses of a few, so the ‘best guess’ should be close to the average of all guesses…

Motivated by the possibility of winning some fishing tackle I set about messing about with R’s regular expressions to create a tool that would enable me to make an informed guess based on the guesses of many.

The function below reads in a text file containing each persons guess (provided via a comment), extracts and cleans the guesses, transforms the guesses into a common unit (kilograms) and provides summary statistics and a histogram that would suggest the best guess you could make. Of course this function could be adapted to suit a ‘how many jelly beans in the jar?’ competition also!

Here is the output of one such competition:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
4.00   12.50   17.00   17.35   19.90   85.00 

totalGuesses

In this case, I would guess the weight of the fish to be around 17 kilograms!