Working with Data Frames in Python and R

Data Hipsters

Data frame objects facilitate most data analysis exercises in both R and Python (perhaps with the exception of time series analysis, where the focus is on R time series and Pandas series objects). Data frames are a tidy and meaningful way to store data.

This post will display exactly the same workflow in both languages. I will run though the Python code first, and you can find an equivalent R script presented at the end.

If you are an R user and have been tempted to explore the exciting world of Python one of the first things you will notice is the similarity of syntax. This should make it easy to pick up the basics. However, there are some key differences between the two. A good example is how to index the first observation in a set of data. R indexing starts at 1 while Python indexing starts at 0!

View original post 163 more words


Text Mining the NZ Road Network with R

What are the most common words in New Zealand road names? Are there any common themes?

Thankfully, New Zealand’s 73,906 current road names have been made available through the LINZ Data Service. To answer the questions above, we can use R’s tm package to conduct basic text mining.

The process is simple*. Text is cleansed of any punctuation, extra white-space, redundant or uninteresting words before being fed into wordcloud(). The 60 most common words are then displayed with size proportional to frequency of occurrence.

The 60 most common words found in New Zealand road names

Can we see any common themes? Yes, namely:

1. Royalty and famous Britons: George, King, Victoria, Queen, Elizabeth, Albert, Nelson.

Queen Victoria, Reigned 1837 - 1901. Source:
Queen Victoria, Reigned 1837 – 1901. Source:

2. Early New Zealanders: Campbell, Russel, Grey, Scott.

Sir George Grey, 3rd Governor of New Zealand, In office 1845 – 1854. Source:
Sir George Grey, 3rd Governor of New Zealand, In office 1845 – 1854. Source:

3. Native trees: Kowhai, Totara, Rata, Rimu, Matai, Kauri, Miro.

A pair of Kauri trees. Source:
A pair of Kauri trees. Source:

4. Not-so native trees: Pine, Oak.

5. Native birds: Tui, Huia, Kiwi.

The now-extinct Huia (male and female). Source:
The now-extinct Huia (male and female). Source:

*This blog post from deltaDNA served as a guide.

Landonline: Road Name. Source: LINZ/Full Landonline Dataset:
ASP: Street Type. Source: LINZ/Electoral:

Set Operations in R and Python. Useful!

Set operations are super useful when data cleaning or testing scripts. They are a must have in any analyst’s (data scientist’s/statistician’s/data wizard’s) toolbox. Here is a quick rundown in both R and python.

Say we have two vectors x and y…

# vector x
x = c(1,2,3,4,5,6)

# vector y
y = c(4,5,6,7,8,9)

What if we ‘combined’ x and y ignoring any duplicate elements? (x \cup y)

# x UNION y
union(x, y)

[1] 1 2 3 4 5 6 7 8 9

What are the common elements in x and y? (x \cap y)

intersect(x, y)

[1] 4 5 6

What elements feature in x but not in y?

# x members not in y

[1] 1 2 3

What elements feature in y but not in x?

# y members not in x

[1] 7 8 9

How might we visualise all this?

Venn Diagram
Venn Diagram

What about python? In standard python there exists a module called ‘sets’ that allows for the creation of a ‘Set’ object from a python list. The Set object has methods that provide the same functionality as the R functions above.


Data Distance: Identifying Nearest Neighbours in R

Given a large multidimensional data-set, what is the closest point to any point you care to choose? Or, for that matter, what is the closest point to any location you care to specify? Easy enough in one dimension, you can just look at the absolute values of the differences between your chosen point and the rest. But what about in 2, 3 or even 20 dimensions? We are talking about point distance.

Distance can be measured a number of ways but the go-to is Euclidean distance. Imagine taking a ruler to a a set of points in 3D space. Euclidean distance would be the distance that you measure. Euclidean distance can even be used in more than three dimensions and allows for the construction of distance matrices representing point separation in your data. The following example in R illustrates the concept.

Lets slice up the classic Fisher’s iris data-set into a more manageable form consisting of the first 10 observations of the setosa species in 3 dimensions.

# required libraries

# loading the iris data set

# just setosa
iris.df = (iris %>% filter(Species=='setosa')
                # just three variables
                  %>% select(-Petal.Width,-Species)
                  # just the first 10 observations
                    %>% slice(1:10)

# viewing

  Sepal.Length Sepal.Width Petal.Length
1          5.1         3.5          1.4
2          4.9         3.0          1.4
3          4.7         3.2          1.3
4          4.6         3.1          1.5
5          5.0         3.6          1.4
6          5.4         3.9          1.7

dist() allows us to build a distance matrix. Smaller values indicate points that are closer together and that is why we see 0’s on the diagonal.

# generating distance matrix
distances.mat = as.matrix(dist(iris.df))

# viewing the first 8 columns

           1         2         3         4         5         6         7         8
1  0.0000000 0.5385165 0.5099020 0.6480741 0.1414214 0.5830952 0.5099020 0.1732051
2  0.5385165 0.0000000 0.3000000 0.3316625 0.6082763 1.0723805 0.5000000 0.4242641
3  0.5099020 0.3000000 0.0000000 0.2449490 0.5099020 1.0677078 0.2449490 0.4123106
4  0.6480741 0.3316625 0.2449490 0.0000000 0.6480741 1.1489125 0.3162278 0.5000000
5  0.1414214 0.6082763 0.5099020 0.6480741 0.0000000 0.5830952 0.4472136 0.2236068
6  0.5830952 1.0723805 1.0677078 1.1489125 0.5830952 0.0000000 0.9899495 0.6708204
7  0.5099020 0.5000000 0.2449490 0.3162278 0.4472136 0.9899495 0.0000000 0.4123106
8  0.1732051 0.4242641 0.4123106 0.5000000 0.2236068 0.6708204 0.4123106 0.0000000
9  0.9219544 0.5099020 0.4358899 0.3000000 0.9219544 1.4456832 0.5385165 0.7874008
10 0.4582576 0.1414214 0.3000000 0.3000000 0.5196152 0.9643651 0.4358899 0.3162278

We can see from the matrix that 1 is closest to 5 (distance = 0.14), 2 is closest to 10 (distance = 0.14) and 7 is closest to 3 (distance = 0.24) etc. Using the ggplot2 package we can transform this matrix into a nice heat-map for easy visualisation.

Darker tiles indicate closer points.

# ggplot2 heat-map of the distance matrix
p = qplot(x=Var1, y=Var2, data=melt(distances.mat), fill=value, geom='tile')
# adding all the tick marks
p + scale_x_continuous(breaks=1:10) + 
    scale_y_continuous(breaks=1:10) + 
# hiding the axis labels
    xlab('') +
Heat-map of the distance matrix
Heat-map of the distance matrix

Viewing the data in 3D using the scatter3d() function in the car package further confirms the results we see in the distance matrix (click image to zoom).

          xlab = 'Sepal Length', 
          ylab = 'Sepal Width',
          zlab = 'Petal.Length',
          axis.col = c(rep('blue',3))

3D scatter (angle 1)
3D scatter (angle 1)

3D scatter (angle 2)
3D scatter (angle 2)


Approaches to Extremes: Visualisation, Tests and The Local Outlier Factor

For most modelling exercises, extreme values pose a real problem. Whether these points are considered outliers* or influential points, it is good practice to identify, visualise and possibly exclude them for the sake of your model. Extreme values can really mess up your estimates and consequently your conclusions!

Of course in some fields, such as marketing or retail analytics, extreme values may represent a significant opportunity. Perhaps there are some customers that are simply off the charts in terms of purchase size and frequency. It would then be a good idea to identify them and do everything you can to retain them!

An extreme value may come to be via a mistake in data collection or it could have been legitimately generated from the statistical process the generated your data, however unlikely. The decision on whether a point is extreme is inherently subjective. Some people believe outliers should never be excluded unless the data point is surely a mistake, others seem to be far more ruthless… lucky there exists a set of tools to aid us in our decisions regarding extreme data.

The univariate case

In the univariate case, the problem of extreme values is relatively straight forward. Let’s generate a vector of data from a standard normal distribution and add to this vector one point that is five times the size of a point randomly chosen:


# 1000 random standard normal variates
q = rnorm(n=1000)

# adding an 'extreme value'
q = c(q,sample(q,size=1)*5)

# summary

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-3.28200 -0.72470 -0.02786 -0.03383  0.66420  7.39500 

# plotting
1000 standard normal values plus one ‘extreme’ value
1000 standard normal values plus one ‘extreme’ value

The most extreme data point is 7.39500 and it can be seen on the plot sitting far above the others. This point could be defined as an outlier.

There are also a number of statistical tests that can be deployed to test for an outlying point in univariate data (assuming normality). One such test is the ‘Grubbs test for one outlier’ that can be found in the outliers package:


# testing for one outlier

	Grubbs test for one outlier

data:  q
G = 7.08620, U = 0.94974, p-value = 3.593e-10
alternative hypothesis: highest value 7.39479015341698 is an outlier

Here we have strong evidence against the null hypothesis that the highest value in the data is not an outlier.

This is all well and good, but what happens when you are faced with multivariate data, perhaps with a high number of dimensions? An extreme value for one dimension may not appear extreme for the rest, making it had to systematically identify problem values.

The multivariate case

The local outlier factor algorithm (LOF) proposed by Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng and Jörg Sander in the early 2000s provides a general mechanism to identify extreme values in multivariate data. Not only does LOF allow us to identify these values, it also gives the degree to which a given point is considered ‘outlying.’ In simple terms, the LOF looks at a point in relation to its nearest neighbours in the data cloud and determines it’s degree of local isolation. If the point is very isolated it is given a high score. An implementation of the LOF algorithm, lofactor(data,k), can be found in the DMwR package. The argument k specifies the number of neighbours considered when assessing each point. The example below demonstrates the system:

Vectors x,y and z each represent 1000 data points generated from separate standard normal distributions. 5 ‘extreme’ points are then added to each vector. The effect is a central data cloud with several extreme satellite points. We wish to use the LOF algorithm to flag these satellite points.


x = rnorm(1000)
x = c(x,sample(x,size=5)*5)

y = rnorm(1000)
y = c(y,sample(y,size=5)*5)

z = rnorm(1000)
z = c(z,sample(z,size=5)*5)

# dat is a data frame comprised of the three vectors
dat = data.frame(x,y,z)

# visualising the data
Scatter plot matrix revealing a central data cloud and a number of satellite points
Scatter plot matrix revealing a central data cloud and a number of satellite points

Calling the LOF function scores each point in relation to its abnormality:


# let's use 6 neighbours
scores = lofactor(dat, k=6)

# viewing the scores for the first few points

[1] 1.0080632 1.0507945 1.1839767 0.9840658 1.0239369 1.1021009

Now lets take the top 10 most extreme points:

# storing the 10 ten most outlying points in a vector called 'outliers'
top.ten = order(scores, decreasing=TRUE)[1:10]

# printing

[1] 1003 1004 1001 1005  883  785  361  589  130  283

Plotting the data with the 10 most extreme points coloured red:

# outliers will be coloured red, all other points will be black
colouring = rep(1,nrow(dat))
colouring[top.ten] = 2

# outliers will be crosses, all other points will be circles
symbol = rep(1,nrow(dat))
symbol[top.ten] = 4

# visualising the data
Red crosses are deemed to be local outliers
Red crosses are deemed to be local outliers

As we can see, the algorithm does a pretty good job of identifying the satellites!

Viewing the data in 3 dimensions further confirms the result. The red points indicate our top 10 most extreme points as suggested by LOF:


# outliers will be coloured red, all other points will be grey
col = rep("#4c4646",nrow(dat))
col[top.ten] = "#b20000"

# javascript 3d scatter plot
3D visualisation (angle 1)
3D visualisation (angle 1)
3D visualisation (angle 2)
3D visualisation (angle 2)
3D visualisation (angle 3)
3D visualisation (angle 3)

Of course you must decide on your value for k. Try a few different values and visualise the data until you are happy.

*Actually defining an outlier is a prickly subject so I have steered clear!


DIY Header Comments in R

It is always a good idea to include a header at the start of your R scripts that provide a title, record the author, the date of writing and provide a description of what your code actually does! This makes it easy for someone else (or the future you!) to understand what you have done.

There is no need to manually type these out. header_comment() from commentr package prints a really nice header to the R console:


header_comment("Amazing Program", "Does everything you could ever want!",
               author       = "John Roberts",
               contact      = "",
               client       = "A Big Firm",
               date_created = date(),
               width        = "a4landscape")

Try it!

Otherwise, if you would rather the header took a different form, you could write your own function. What follows is an example function I quickly wrote that uses cat() to print a header. Go ahead and change it to suit your styles!

# the function takes 5 arguments:
# title (required with no default)
# version (default: NULL)
# author (default: system user)
# contact (default: NULL) 
# description (default: NULL)

header = function(title=NULL,version=NULL,
  # you must have a title
  if (is.null(title)){
    stop('You need a title!')
  # defines the bars
  bar = c(rep('#',4), rep('-',nchar(date())), rep('#',4),'\n')
  # what will print regarding version
  version.str = c('Version:', version,'\n')
  # what will print regarding contact
  contact.str = c('Contact:',contact,'\n')
  # what will print regarding description
  description.str = c('Description:',description,rep('\n',2)) 
  # if no version was entered, nothing will print
    version.str = NULL
  # same for contact
    contact.str = NULL
  # same for description
    description.str = NULL
  # comment construction
    'Author:', author,'\n', 

header(title="Amazing Program",version=2.1,author="John Roberts",
description="Does everything you could ever want!")


 # # # # - - - - - - - - - - - - - - - # # # 
 # # # # - - - - - - - - - - - - - - - # # # 
 Title: Amazing Program 
 Version: 2.1 
 Date: Mon Jul 20 21:14:30 2015 
 Author: John Roberts 
 Description: Does everything you could ever want! 
 # # # # - - - - - - - - - - - - - - - # # # 
 # # # # - - - - - - - - - - - - - - - # # #


Maximum-likelihood Fitting with R

So you have gathered your data and now you want use maximum-likelihood estimation along with a selected distribution to fit a model. Here are two methods:

Method 1

In R, you could define the log-likelihood and use R’s optim() to maximise this function with respect to the parameters. This gets you your maximum likelihood estimates which then enable you to fit a probability distribution to the data.

The process is perhaps best shown by way of an example.

Say we have some count data that we think may best be modelled using the Poisson distribution. The data looks like this:


# let’s generate some synthetic sample data:
x = rpois(n=10000,lambda=2)

# plotting

hist of x 2

Hint: We know the data has been generated from a Poisson distribution with \lambda=2!

The Poisson probability function has the form:

P(X=x) = \frac{\lambda^{x}e^{-\lambda}}{x!}

The likelihood function considering n independent data points is therefore:

L(\lambda;x) = \prod_{i=1}^{n}\frac{\lambda^{x_i}e^{-\lambda}}{x_i!} = \frac{\lambda^{x_1}e^{-\lambda}}{x_1!}\times\frac{\lambda^{x_2}e^{-\lambda}}{x_2!}\times\ldots\times\frac{\lambda^{x_n}e^{\lambda}}{x_n!}

= \frac{\lambda^{\sum x_i}e^{-n\lambda}}{x_1!x_2!\dots x_n!}

And the log-likelihood function:

\log(L(\lambda;x))=\log(\lambda^{\sum x_i}e^{-n\lambda})-\log(x_1!x_2!\dots x_n!)

Which, ignoring the terms not featuring the parameter \lambda,

=\sum x_i\log(\lambda)-n\lambda

Expressing this in R:

# the log-likelihood is a function of lambda and the data
poissonLogLik = function(lambda,data){
  # total number of observations
  n = length(data)
  # equation
  logLik = (sum(data)*log(lambda)-n*lambda)

To find our estimate \hat\lambda we need to maximise the log-likelihood function:

\frac{\mathrm \partial}{\mathrm \partial \lambda} \big(\log(L(\hat\lambda;x))\big)=0

Doing it in R:

# with a starting value of 1, minimize the function! 
mle = optim(par=1,fn=poissonLogLik,data=x,control=list(fnscale=-1))

> mle
[1] 2.000195

[1] -6136.363

function gradient 
      30       NA 

[1] 0


Note that by default, optim() finds a minimum solution. This is no good because we are only interested in maximising! Adding control=list(fnscale=-1) solves the problem. Check out optim()’s help file for more details.

Now we have our estimate, \hat\lambda, how close are our observed and expected values?

Our observed values:

# observed

   0    1    2    3    4    5    6    7    8    9 
1327 2781 2676 1790  880  367  126   38   14    1 

Our expected values:

# expected
expected = dpois(x=min(x):max(x), lambda=mle$par)*10000

> expected
 [1] 1353.088531 2706.441338 2706.705639 1804.646644  902.411439  360.999826
 [7]  120.345027   34.387651    8.597752    1.910798

Overlaying the expected values onto the original histogram (with a red line):

# observed

# expected

hist of x expected

The the model appears to fit well!

Method 2

The function fitdistr() in the MASS package allows us to fit a range distributions to data using maximum-likelihood. This means we don’t have to use method 1 each time! woohoo! It’s as easy as handing over your data, choosing your distribution and pressing RUN. You also get your estimated standard errors which is a bonus.

# loading the MASS package

# fitting a Poisson distribution using maximum-likelihood 

Very nice!