Data frame objects facilitate most data analysis exercises in both R and Python (perhaps with the exception of time series analysis, where the focus is on R time series and Pandas series objects). Data frames are a tidy and meaningful way to store data.

This post will display exactly the same workflow in both languages. I will run though the Python code first, and you can find an equivalent R script presented at the end.

If you are an R user and have been tempted to explore the exciting world of Python one of the first things you will notice is the similarity of syntax. This should make it easy to pick up the basics. However, there are some key differences between the two. A good example is how to index the first observation in a set of data. R indexing starts at 1 while Python indexing starts at 0!

What are the most common words in New Zealand road names? Are there any common themes?

Thankfully, New Zealand’s 73,906 current road names have been made available through the LINZ Data Service. To answer the questions above, we can use R’s tm package to conduct basic text mining.

The process is simple*. Text is cleansed of any punctuation, extra white-space, redundant or uninteresting words before being fed into wordcloud(). The 60 most common words are then displayed with size proportional to frequency of occurrence.

Can we see any common themes? Yes, namely:

1. Royalty and famous Britons: George, King, Victoria, Queen, Elizabeth, Albert, Nelson.

2. Early New Zealanders: Campbell, Russel, Grey, Scott.

3. Native trees: Kowhai, Totara, Rata, Rimu, Matai, Kauri, Miro.

Set operations are super useful when data cleaning or testing scripts. They are a must have in any analyst’s (data scientist’s/statistician’s/data wizard’s) toolbox. Here is a quick rundown in both R and python.

Say we have two vectors x and y…

# vector x
x = c(1,2,3,4,5,6)
# vector y
y = c(4,5,6,7,8,9)

What if we ‘combined’ x and y ignoring any duplicate elements? ()

# x UNION y
union(x, y)
[1] 1 2 3 4 5 6 7 8 9

What are the common elements in x and y? ()

# x INTERSECTION y
intersect(x, y)
[1] 4 5 6

What elements feature in x but not in y?

# x members not in y
setdiff(x,y)
[1] 1 2 3

What elements feature in y but not in x?

# y members not in x
setdiff(y,x)
[1] 7 8 9

How might we visualise all this?

What about python? In standard python there exists a module called ‘sets’ that allows for the creation of a ‘Set’ object from a python list. The Set object has methods that provide the same functionality as the R functions above.

Given a large multidimensional data-set, what is the closest point to any point you care to choose? Or, for that matter, what is the closest point to any location you care to specify? Easy enough in one dimension, you can just look at the absolute values of the differences between your chosen point and the rest. But what about in 2, 3 or even 20 dimensions? We are talking about point distance.

Distance can be measured a number of ways but the go-to is Euclidean distance. Imagine taking a ruler to a a set of points in 3D space. Euclidean distance would be the distance that you measure. Euclidean distance can even be used in more than three dimensions and allows for the construction of distance matrices representing point separation in your data. The following example in R illustrates the concept.

Lets slice up the classic Fisher’s iris data-set into a more manageable form consisting of the first 10 observations of the setosa species in 3 dimensions.

# required libraries
library(dplyr)
library(ggplot2)
library(reshape2)
library(car)
library(rgl)
# loading the iris data set
data(iris)
# just setosa
iris.df = (iris %>% filter(Species=='setosa')
# just three variables
%>% select(-Petal.Width,-Species)
# just the first 10 observations
%>% slice(1:10)
)
# viewing
head(iris.df)
Sepal.Length Sepal.Width Petal.Length
1 5.1 3.5 1.4
2 4.9 3.0 1.4
3 4.7 3.2 1.3
4 4.6 3.1 1.5
5 5.0 3.6 1.4
6 5.4 3.9 1.7

dist() allows us to build a distance matrix. Smaller values indicate points that are closer together and that is why we see 0’s on the diagonal.

We can see from the matrix that 1 is closest to 5 (distance = 0.14), 2 is closest to 10 (distance = 0.14) and 7 is closest to 3 (distance = 0.24) etc. Using the ggplot2 package we can transform this matrix into a nice heat-map for easy visualisation.

Darker tiles indicate closer points.

# ggplot2 heat-map of the distance matrix
p = qplot(x=Var1, y=Var2, data=melt(distances.mat), fill=value, geom='tile')
# adding all the tick marks
p + scale_x_continuous(breaks=1:10) +
scale_y_continuous(breaks=1:10) +
# hiding the axis labels
xlab('') +
ylab('')

Viewing the data in 3D using the scatter3d() function in the car package further confirms the results we see in the distance matrix (click image to zoom).

For most modelling exercises, extreme values pose a real problem. Whether these points are considered outliers* or influential points, it is good practice to identify, visualise and possibly exclude them for the sake of your model. Extreme values can really mess up your estimates and consequently your conclusions!

Of course in some fields, such as marketing or retail analytics, extreme values may represent a significant opportunity. Perhaps there are some customers that are simply off the charts in terms of purchase size and frequency. It would then be a good idea to identify them and do everything you can to retain them!

An extreme value may come to be via a mistake in data collection or it could have been legitimately generated from the statistical process the generated your data, however unlikely. The decision on whether a point is extreme is inherently subjective. Some people believe outliers should never be excluded unless the data point is surely a mistake, others seem to be far more ruthless… lucky there exists a set of tools to aid us in our decisions regarding extreme data.

The univariate case

In the univariate case, the problem of extreme values is relatively straight forward. Let’s generate a vector of data from a standard normal distribution and add to this vector one point that is five times the size of a point randomly chosen:

set.seed(8)
# 1000 random standard normal variates
q = rnorm(n=1000)
# adding an 'extreme value'
q = c(q,sample(q,size=1)*5)
# summary
summary(q)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.28200 -0.72470 -0.02786 -0.03383 0.66420 7.39500
# plotting
boxplot(q)

The most extreme data point is 7.39500 and it can be seen on the plot sitting far above the others. This point could be defined as an outlier.

There are also a number of statistical tests that can be deployed to test for an outlying point in univariate data (assuming normality). One such test is the ‘Grubbs test for one outlier’ that can be found in the outliers package:

library(outliers)
# testing for one outlier
grubbs.test(q)
Grubbs test for one outlier
data: q
G = 7.08620, U = 0.94974, p-value = 3.593e-10
alternative hypothesis: highest value 7.39479015341698 is an outlier

Here we have strong evidence against the null hypothesis that the highest value in the data is not an outlier.

This is all well and good, but what happens when you are faced with multivariate data, perhaps with a high number of dimensions? An extreme value for one dimension may not appear extreme for the rest, making it had to systematically identify problem values.

The multivariate case

The local outlier factor algorithm (LOF) proposed by Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng and Jörg Sander in the early 2000s provides a general mechanism to identify extreme values in multivariate data. Not only does LOF allow us to identify these values, it also gives the degree to which a given point is considered ‘outlying.’ In simple terms, the LOF looks at a point in relation to its nearest neighbours in the data cloud and determines it’s degree of local isolation. If the point is very isolated it is given a high score. An implementation of the LOF algorithm, lofactor(data,k), can be found in the DMwR package. The argument k specifies the number of neighbours considered when assessing each point. The example below demonstrates the system:

Vectors x,y and z each represent 1000 data points generated from separate standard normal distributions. 5 ‘extreme’ points are then added to each vector. The effect is a central data cloud with several extreme satellite points. We wish to use the LOF algorithm to flag these satellite points.

set.seed(20)
x = rnorm(1000)
x = c(x,sample(x,size=5)*5)
y = rnorm(1000)
y = c(y,sample(y,size=5)*5)
z = rnorm(1000)
z = c(z,sample(z,size=5)*5)
# dat is a data frame comprised of the three vectors
dat = data.frame(x,y,z)
# visualising the data
pairs(dat)

Calling the LOF function scores each point in relation to its abnormality:

library(DMwR)
# let's use 6 neighbours
scores = lofactor(dat, k=6)
# viewing the scores for the first few points
head(scores)
[1] 1.0080632 1.0507945 1.1839767 0.9840658 1.0239369 1.1021009

Now lets take the top 10 most extreme points:

# storing the 10 ten most outlying points in a vector called 'outliers'
top.ten = order(scores, decreasing=TRUE)[1:10]
# printing
top.ten
[1] 1003 1004 1001 1005 883 785 361 589 130 283

Plotting the data with the 10 most extreme points coloured red:

# outliers will be coloured red, all other points will be black
colouring = rep(1,nrow(dat))
colouring[top.ten] = 2
# outliers will be crosses, all other points will be circles
symbol = rep(1,nrow(dat))
symbol[top.ten] = 4
# visualising the data
pairs(dat,col=colouring,pch=symbol)

As we can see, the algorithm does a pretty good job of identifying the satellites!

Viewing the data in 3 dimensions further confirms the result. The red points indicate our top 10 most extreme points as suggested by LOF:

library(threejs)
# outliers will be coloured red, all other points will be grey
col = rep("#4c4646",nrow(dat))
col[top.ten] = "#b20000"
# javascript 3d scatter plot
scatterplot3js(x,y,z,num.ticks=NULL,color=col)

Of course you must decide on your value for k. Try a few different values and visualise the data until you are happy.

*Actually defining an outlier is a prickly subject so I have steered clear!

It is always a good idea to include a header at the start of your R scripts that provide a title, record the author, the date of writing and provide a description of what your code actually does! This makes it easy for someone else (or the future you!) to understand what you have done.

There is no need to manually type these out. header_comment() from commentr package prints a really nice header to the R console:

library(commentr)
header_comment("Amazing Program", "Does everything you could ever want!",
author = "John Roberts",
contact = "john.roberts@someemailprovider.com",
client = "A Big Firm",
date_created = date(),
width = "a4landscape")

Try it!

Otherwise, if you would rather the header took a different form, you could write your own function. What follows is an example function I quickly wrote that uses cat() to print a header. Go ahead and change it to suit your styles!

# the function takes 5 arguments:
# title (required with no default)
# version (default: NULL)
# author (default: system user)
# contact (default: NULL)
# description (default: NULL)
header = function(title=NULL,version=NULL,
author=Sys.getenv('USERNAME'),
contact=NULL,description=NULL){
# you must have a title
if (is.null(title)){
stop('You need a title!')
}
# defines the bars
bar = c(rep('#',4), rep('-',nchar(date())), rep('#',4),'\n')
# what will print regarding version
version.str = c('Version:', version,'\n')
# what will print regarding contact
contact.str = c('Contact:',contact,'\n')
# what will print regarding description
description.str = c('Description:',description,rep('\n',2))
# if no version was entered, nothing will print
if(is.null(version)){
version.str = NULL
}
# same for contact
if(is.null(contact)){
contact.str = NULL
}
# same for description
if(is.null(description)){
description.str = NULL
}
# comment construction
cat(
'',
rep(x=bar,times=2),'\n',
'Title:',title,'\n',
version.str,
'Date:',date(),'\n',
'\n',
'Author:', author,'\n',
contact.str,'\n',
description.str,
rep(x=bar,times=2)
)
}
header(title="Amazing Program",version=2.1,author="John Roberts",
contact="john.roberts@someemailprovider.com",
description="Does everything you could ever want!")

So you have gathered your data and now you want use maximum-likelihood estimation along with a selected distribution to fit a model. Here are two methods:

Method 1

In R, you could define the log-likelihood and use R’s optim() to maximise this function with respect to the parameters. This gets you your maximum likelihood estimates which then enable you to fit a probability distribution to the data.

The process is perhaps best shown by way of an example.

Say we have some count data that we think may best be modelled using the Poisson distribution. The data looks like this:

set.seed(100)
# let’s generate some synthetic sample data:
x = rpois(n=10000,lambda=2)
# plotting
hist(x,col='lightblue')

Hint: We know the data has been generated from a Poisson distribution with !

The Poisson probability function has the form:

The likelihood function considering independent data points is therefore:

And the log-likelihood function:

Which, ignoring the terms not featuring the parameter ,

Expressing this in R:

# the log-likelihood is a function of lambda and the data
poissonLogLik = function(lambda,data){
# total number of observations
n = length(data)
# equation
logLik = (sum(data)*log(lambda)-n*lambda)
}

To find our estimate we need to maximise the log-likelihood function:

Doing it in R:

# with a starting value of 1, minimize the function!
mle = optim(par=1,fn=poissonLogLik,data=x,control=list(fnscale=-1))

> mle
$par
[1] 2.000195
$value
[1] -6136.363
$counts
function gradient
30 NA
$convergence
[1] 0
$message
NULL

Note that by default, optim() finds a minimum solution. This is no good because we are only interested in maximising! Adding control=list(fnscale=-1) solves the problem. Check out optim()’s help file for more details.

Now we have our estimate, , how close are our observed and expected values?

The function fitdistr() in the MASS package allows us to fit a range distributions to data using maximum-likelihood. This means we don’t have to use method 1 each time! woohoo! It’s as easy as handing over your data, choosing your distribution and pressing RUN. You also get your estimated standard errors which is a bonus.

# loading the MASS package
library(MASS)
# fitting a Poisson distribution using maximum-likelihood
fitdistr(x,densfun='Poisson')