R Work Areas. Standardize and Automate.

Before beginning work on a new data science project I like to do the following:

1. Get my work area ready by creating an R Project for use with the RStudio IDE.

2. Organize my work area by creating a series of directories to store my project inputs and outputs. I create ‘data’ (raw data), ‘src’ (R code), ‘reports’ (markdown documents etc) and ‘documentation’ (help files) directories.

3. Set up tracking of my work by Initializing a Git Repo.

4. Take measures to avoid tracking sensitive information (such as data or passwords) by adding certain file names or extensions to the .gitignore file.

You can of course achieve the the desired result by using the RStudio IDE GUI but I have found into handy to automate the process using a shell script. Because I use Windows, I execute this script using the Git BASH emulator. If you have a Mac or Linux machine, just use the terminal.


1. Navigate to a directory you want to want to designate as your area of work and run

bash project_setup projectname

where “projectname” is a name of your choosing.

2. Open the freshly generated R Project in RStudio. This will create your .Rproj.user directory.

3. Start work!

You can see my script below, just modify to suit your own requirements. Notice I have set the R project options ‘Restore Workspace’, ‘Save Workspace’ and ‘Always Save History’ to an explicit ‘No’, increased the ‘Number of Spaces for Tab’ to 4 and prevented the tracking of .csv data files.


Helpful Data Science Reads

Here are some of the books that I found interesting and useful in 2017.

Scrum: The Art of Doing Twice the Work in Half the Time by Jeff Sutherland

Jeff Sutherland, one of the creators of the scrum methodology of project management lays down the rational for adopting scrum over more traditional project management frameworks. He explains how large-scale software projects often faltered under ‘waterfall’ type approaches but were quickly turned around by ditching Gantt charts and drawing up a scrum board.

Although I believe there is a time and a place for traditional methods, such as long-term projects with fixed budgets and a finite and predictable scope, the agile approach of scrum has huge advantages for analytical projects especially under conditions of quickly shifting stakeholder requirements.

Scrum: The Art of Doing Twice the Work in Half the Time describes the fundamental concepts of scrum and is a must read for data science managers.

R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham, Garrett Grolemund

R for Data Science equips data scientists with the tools and techniques to extract data, clean it up and uncover clear insights. The primary toolbox here is R’s tidyverse package, itself a collection of packages that have revolutionised R by making it much easier to use for day to day data management activities. The authors also present a helpful iterative framework for conducting data science projects and address the common issues of messy data and strange formats.

R for Data Science is extremely easy to read, if you haven’t played around with the tidyverse this book is a great place to start. Your life as a data scientist will become much easier!

The Third Wave: An Entrepreneur’s Vision of the Future by Steve Case

Steve Case, co-founder of AOL predicts the future will dominated not by the ‘Internet of Things’ but the ‘Internet of Everything.’ This will be the third wave of the internet. He describes the rising influence of impact investing and how tech and government need to work together to usher in this next phase.

I found Steve’s account of AOL during the early days of the internet fascinating and his prophesies of the future of tech inspiring. Clearly, data scientists have a big part to play in uncovering the insights buried in the huge volumes of data that will arise when the third wave comes into being.

Data Smart: Using Data Science to Transform Information into Insight by John W. Foreman

The then chief data scientist at MailChimp.com, John W. Foreman takes readers through several practical exercises in data science and business analytics. Data Smart highlights the point that in order to ‘do data science’ one does not always have to use fancy tools, much of the work can simply be done in Excel. Of course, using tools such as R are often much more convenient, but it is sometimes helpful to hack the problem using Excel, then you know you really understand what you are doing! I found the pragmatic approach to data science put forward by Data Smart most refreshing.

Data Science for Business: What you need to know about data mining and data-analytic thinking by Foster Provost, Tom Fawcett

Data Science for Business presents a comprehensive survey of modern supervised and unsupervised data science methods and applications in a business context. I particularly enjoyed the treatment of model ‘accuracy’ and business cost considerations when settling an acceptable true positive threshold for a chosen model.

This book strikes a nice balance between being too technical and too fluffy. For those really interested, ‘extra for experts’ mathematics sections are available throughout.

An Introduction to Statistical Learning: With Applications in R
by Gareth James, Trevor Hastie, Robert Tibshirani, Daniela Witten

This book is by far the best book I have read on statistical learning. Like Data Science for Business, ISL strikes a great balance between mathematical and intuitive explanations.

What I particularly like about this book is the discussion around the trade-off between variance and bias when choosing a model specification and the handy graphics used to illustrate this point. Another selling point is the collection of exercises that enable the reader to test their knowledge by using R.

Note: if you want to expand on the content found in this book you may consider reading its big brother, The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani, Jerome Friedman.

NZ Real GDP htmlwidget

Thought I would try my hand at generating an interactive JavaScript line graph using R. Thankfully the dygraphs package makes this very easy!

The code below generates an interactive plot of New Zealand’s real GDP through time. I have added some annotations displaying some of the major financial crises. It is as if the economy fell of a cliff during the GFC!

Overall Picture. (Source: RBNZ).


Select an Area to Zoom. (Source RBNZ).

I would highly recommend you give this package a look. The static images really don’t do it justice.

All data was sourced from the reserve bank website.

Good Parameterisation in R

Imagine you work in a large factory that produces complicated widgets. It is your job to control production line settings which must be reset each day so as to ensure the smooth operation of the factory. However, to change the settings you have to walk around turning dials and pressing buttons at various different locations on the factory floor.

One morning you forget to turn the dial on an important machine causing the production line to completely shut down. Your manager storms over and you explain to him that it would be so much easier if you could just change the settings from one location!

One can think of a data science solution, automated or otherwise, as a factory which takes data as an input and produces insight. Best practice is to parameterise all code and sensibly place these parameters so that you can easily find and change them. Parameterised code is especially important when it comes to dashboard development. When a user interacts with a visual display, they should not be presented with a series of hard-coded outputs, rather, they should be changing parameters that result in uniquely generated results.

You pretty much have three options with respect to your code:


Hard-coded settings are dispersed throughout your script and in order to change them, one must trawl right though it. This style of script is very prone to find-and-replace errors and is a nightmare to handover if you were to leave your job.

Partially Parameterised

Settings can all be found in a logical place such as the beginning of your script or in a separate file that is sourced in making them easy to change. This set-up also helps future users (including your future self!) fully understand what is going on.

Fully parameterised

Functions are defined and any parameters are set as arguments to the function. This is the most elegant solution.

The example below illustrates these three options. The script takes a data frame, selects only numeric fields, calls the k-means algorithm and plots a coloured chart displaying cluster allocations. There are three settings that can be changed: the data frame, the number of clusters and the chart title.

The first code block is an example of unparameterised code as the user must change all three settings manually by finding them in the script. The second code block is partially parameterised, setting nClust at the beginning selects k while simultaneously altering the chart title. The final code block wraps everything into a function, dealing with all three settings through specified arguments.


Pretty Data Class Conversion

Load data – check structure – convert – analyse.

Data class conversion is essential to gaining the right result… especially if you have left stringsAsFactors = TRUE. The worst thing you can do is feed factor data into a function when you expected it to be characters.

If system memory is not a concern, I prefer to read data in as character strings and then convert accordingly, I view this as a safer option… it forces you to take stock of each field.

There are many ways to perform data conversion, for example, you can use transfrom() in base R or dplyr’s mutate() family of functions. For a single column conversion I prefer to use mutate but for multiple conversions I use mutate_each() and just specify the relevant columns. This avoids repeating the column names in code.

I still need to do some bench-marking to see which setup is faster, but for now I see mutate_each() as the cleanest, aesthetically at least. I have also included an example of ‘all column’ conversion.

Demystifying the GLM (Part 1)

Upon being thrown a prickly binary classification problem, most data practitioners will have dug deep into their statistical tool box and pulled out the trusty logistic regression model.

Essentially, logistic regression can help us predict a binary (yes/no) response with consideration given to other, hopefully related, variables. For example, one might want to predict whether a person will experience a heart attack given their weight and age. In this case, we have reason to believe weight and age are related to the incidence of heart attacks.

So, they will have sorted their data, fired up R and typed something along the lines of:

glm(heartAttack ~ weight + age, data = heartData, family=binomial())

But what is a glm? What does family = binomial() actually mean?

It turns out the logistic regression model is a member of a broad group of models known as generalised linear models, or GLMs for short.

This series will endeavor to help demystify these highly useful models.

Stay tuned.

NZ’s Shifting Makeup

New Zealand is culturally diverse. Even at a regional level, there are big differences in ethnic composition… and with an increasingly inter-connected world, ethnic composition is expected to change substantially in the future, particularly in Auckland.

Statistics New Zealand has provided us with sub-national ethnic population projections, by age and sex, from 2013 to 2038 which are well suited to visualisation using stacked area charts. The package ggplot2 in R makes generating these easy.

The following projections assume ‘medium fertility, medium paternity, medium mortality, medium net migration, and medium net inter-ethnic mobility.’ This is considered ‘medium growth’*.

Total New Zealand

total nz
Data Source: Statistics New Zealand

Total North Island

total ni
Data Source: Statistics New Zealand

Total South Island

total si
Data Source: Statistics New Zealand


Data Source: Statistics New Zealand

*Please see http://nzdotstat.stats.govt.nz for more information.



Subnational ethnic population projections, by age and sex, 2013(base)-2038. Statistics New Zealand. Provided under the creative commons attribution 3.0 New Zealand license.

I have transformed the data into proportions.