R Work Areas. Standardize and Automate.

Before beginning work on a new data science project I like to do the following:

1. Get my work area ready by creating an R Project for use with the RStudio IDE.

2. Organize my work area by creating a series of directories to store my project inputs and outputs. I create ‘data’ (raw data), ‘src’ (R code), ‘reports’ (markdown documents etc) and ‘documentation’ (help files) directories.

3. Set up tracking of my work by Initializing a Git Repo.

4. Take measures to avoid tracking sensitive information (such as data or passwords) by adding certain file names or extensions to the .gitignore file.

You can of course achieve the the desired result by using the RStudio IDE GUI but I have found into handy to automate the process using a shell script. Because I use Windows, I execute this script using the Git BASH emulator. If you have a Mac or Linux machine, just use the terminal.


1. Navigate to a directory you want to want to designate as your area of work and run

bash project_setup projectname

where “projectname” is a name of your choosing.

2. Open the freshly generated R Project in RStudio. This will create your .Rproj.user directory.

3. Start work!

You can see my script below, just modify to suit your own requirements. Notice I have set the R project options ‘Restore Workspace’, ‘Save Workspace’ and ‘Always Save History’ to an explicit ‘No’, increased the ‘Number of Spaces for Tab’ to 4 and prevented the tracking of .csv data files.


NZ Real GDP htmlwidget

Thought I would try my hand at generating an interactive JavaScript line graph using R. Thankfully theĀ dygraphs package makes this very easy!

The code below generates an interactive plot of New Zealand’s real GDP through time. I have added some annotations displaying some of the major financial crises. It is as if the economy fell of a cliff during the GFC!

Overall Picture. (Source: RBNZ).


Select an Area to Zoom. (Source RBNZ).

I would highly recommend you give this package a look. The static images really don’t do it justice.

All data was sourced from the reserve bank website.

Good Parameterisation in R

Imagine you work in a large factory that produces complicated widgets. It is your job to control production line settings which must be reset each day so as to ensure the smooth operation of the factory. However, to change the settings you have to walk around turning dials and pressing buttons at various different locations on the factory floor.

One morning you forget to turn the dial on an important machine causing the production line to completely shut down. Your manager storms over and you explain to him that it would be so much easier if you could just change the settings from one location!

One can think of a data science solution, automated or otherwise, as a factory which takes data as an input and produces insight. Best practice is to parameterise all code and sensibly place these parameters so that you can easily find and change them. Parameterised code is especially important when it comes to dashboard development. When a user interacts with a visual display, they should not be presented with a series of hard-coded outputs, rather, they should be changing parameters that result in uniquely generated results.

You pretty much have three options with respect to your code:


Hard-coded settings are dispersed throughout your script and in order to change them, one must trawl right though it. This style of script is very prone to find-and-replace errors and is a nightmare to handover if you were to leave your job.

Partially Parameterised

Settings can all be found in a logical place such as the beginning of your script or in a separate file that is sourced in making them easy to change. This set-up also helps future users (including your future self!) fully understand what is going on.

Fully parameterised

Functions are defined and any parameters are set as arguments to the function. This is the most elegant solution.

The example below illustrates these three options. The script takes a data frame, selects only numeric fields, calls the k-means algorithm and plots a coloured chart displaying cluster allocations. There are three settings that can be changed: the data frame, the number of clusters and the chart title.

The first code block is an example of unparameterised code as the user must change all three settings manually by finding them in the script. The second code block is partially parameterised, setting nClust at the beginning selects k while simultaneously altering the chart title. The final code block wraps everything into a function, dealing with all three settings through specified arguments.


Pretty Data Class Conversion

Load data – check structure – convert – analyse.

Data class conversion is essential to gaining the right result… especially if you have left stringsAsFactors = TRUE. The worst thing you can do is feed factor data into a function when you expected it to be characters.

If system memory is not a concern, I prefer to read data in as character strings and then convert accordingly, I view this as a safer option… it forces you to take stock of each field.

There are many ways to perform data conversion, for example, you can use transfrom() in base R or dplyr’s mutate() family of functions. For a single column conversion I prefer to use mutate but for multiple conversions I use mutate_each() and just specify the relevant columns. This avoids repeating the column names in code.

I still need to do some bench-marking to see which setup is faster, but for now I see mutate_each() as the cleanest, aesthetically at least. I have also included an example of ‘all column’ conversion.

Demystifying the GLM (Part 1)

Upon being thrown a prickly binary classification problem, most data practitioners will have dug deep into their statistical tool box and pulled out the trusty logistic regression model.

Essentially, logistic regression can help us predict a binary (yes/no) response with consideration given to other, hopefully related, variables. For example, one might want to predict whether a person will experience a heart attack given their weight and age. In this case, we have reason to believe weight and age are related to the incidence of heart attacks.

So, they will have sorted their data, fired up R and typed something along the lines of:

glm(heartAttack ~ weight + age, data = heartData, family=binomial())

But what is a glm? What does family = binomial() actually mean?

It turns out the logistic regression model is a member of a broad group of models known as generalised linear models, or GLMs for short.

This series will endeavor to help demystify these highly useful models.

Stay tuned.

NZ’s Shifting Makeup

New Zealand is culturally diverse. Even at a regional level, there are big differences in ethnic composition… and with an increasingly inter-connected world, ethnic composition is expected to change substantially in the future, particularly in Auckland.

Statistics New Zealand has provided us with sub-national ethnic population projections, by age and sex, from 2013 to 2038 which are well suited to visualisation using stacked area charts. The package ggplot2 in R makes generating these easy.

The following projections assume ‘medium fertility, medium paternity, medium mortality, medium net migration, and medium net inter-ethnic mobility.’ This is considered ‘medium growth’*.

Total New Zealand

total nz
Data Source: Statistics New Zealand

Total North Island

total ni
Data Source: Statistics New Zealand

Total South Island

total si
Data Source: Statistics New Zealand


Data Source: Statistics New Zealand

*Please see http://nzdotstat.stats.govt.nz for more information.



Subnational ethnic population projections, by age and sex, 2013(base)-2038. Statistics New Zealand. Provided under the creative commons attribution 3.0 New Zealand license.

I have transformed the data into proportions.




A Matter of Style?

Up until a few weeks ago I would style my code like this:

I thought that was the only way… until I witnessed a DBA friend of mine coding. He would write the same function like this:

In my opinion, the second style makes the code easier to read. I suspect it is something to do with the nice ‘column’ of commas. The whole thing seems more orderly!