A common competition run by vendors of fishing equipment is a ‘guess the weight and win’ where an image of someone holding a fish is posted and it is up to you to guess it’s weight with the closest guess winning a prize.
The ‘law of large numbers’ implies that the average of the guesses of many is superior to the average of the guesses of a few, so the ‘best guess’ should be close to the average of all guesses…
Motivated by the possibility of winning some fishing tackle I set about messing about with R’s regular expressions to create a tool that would enable me to make an informed guess based on the guesses of many.
The function below reads in a text file containing each persons guess (provided via a comment), extracts and cleans the guesses, transforms the guesses into a common unit (kilograms) and provides summary statistics and a histogram that would suggest the best guess you could make. Of course this function could be adapted to suit a ‘how many jelly beans in the jar?’ competition also!
Here is the output of one such competition:
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.00 12.50 17.00 17.35 19.90 85.00
In this case, I would guess the weight of the fish to be around 17 kilograms!
Data frame objects facilitate most data analysis exercises in both R and Python (perhaps with the exception of time series analysis, where the focus is on R time series and Pandas series objects). Data frames are a tidy and meaningful way to store data.
This post will display exactly the same workflow in both languages. I will run though the Python code first, and you can find an equivalent R script presented at the end.
If you are an R user and have been tempted to explore the exciting world of Python one of the first things you will notice is the similarity of syntax. This should make it easy to pick up the basics. However, there are some key differences between the two. A good example is how to index the first observation in a set of data. R indexing starts at 1 while Python indexing starts at 0!
What are the most common words in New Zealand road names? Are there any common themes?
Thankfully, New Zealand’s 73,906 current road names have been made available through the LINZ Data Service. To answer the questions above, we can use R’s tm package to conduct basic text mining.
The process is simple*. Text is cleansed of any punctuation, extra white-space, redundant or uninteresting words before being fed into wordcloud(). The 60 most common words are then displayed with size proportional to frequency of occurrence.
Can we see any common themes? Yes, namely:
1. Royalty and famous Britons: George, King, Victoria, Queen, Elizabeth, Albert, Nelson.
2. Early New Zealanders: Campbell, Russel, Grey, Scott.
3. Native trees: Kowhai, Totara, Rata, Rimu, Matai, Kauri, Miro.
Set operations are super useful when data cleaning or testing scripts. They are a must have in any analyst’s (data scientist’s/statistician’s/data wizard’s) toolbox. Here is a quick rundown in both R and python.
Say we have two vectors x and y…
# vector x
x = c(1,2,3,4,5,6)
# vector y
y = c(4,5,6,7,8,9)
What if we ‘combined’ x and y ignoring any duplicate elements? ()
# x UNION y
 1 2 3 4 5 6 7 8 9
What are the common elements in x and y? ()
# x INTERSECTION y
 4 5 6
What elements feature in x but not in y?
# x members not in y
 1 2 3
What elements feature in y but not in x?
# y members not in x
 7 8 9
How might we visualise all this?
What about python? In standard python there exists a module called ‘sets’ that allows for the creation of a ‘Set’ object from a python list. The Set object has methods that provide the same functionality as the R functions above.
Given a large multidimensional data-set, what is the closest point to any point you care to choose? Or, for that matter, what is the closest point to any location you care to specify? Easy enough in one dimension, you can just look at the absolute values of the differences between your chosen point and the rest. But what about in 2, 3 or even 20 dimensions? We are talking about point distance.
Distance can be measured a number of ways but the go-to is Euclidean distance. Imagine taking a ruler to a a set of points in 3D space. Euclidean distance would be the distance that you measure. Euclidean distance can even be used in more than three dimensions and allows for the construction of distance matrices representing point separation in your data. The following example in R illustrates the concept.
Lets slice up the classic Fisher’s iris data-set into a more manageable form consisting of the first 10 observations of the setosa species in 3 dimensions.
# required libraries
# loading the iris data set
# just setosa
iris.df = (iris %>% filter(Species=='setosa')
# just three variables
# just the first 10 observations
Sepal.Length Sepal.Width Petal.Length
1 5.1 3.5 1.4
2 4.9 3.0 1.4
3 4.7 3.2 1.3
4 4.6 3.1 1.5
5 5.0 3.6 1.4
6 5.4 3.9 1.7
dist() allows us to build a distance matrix. Smaller values indicate points that are closer together and that is why we see 0’s on the diagonal.
We can see from the matrix that 1 is closest to 5 (distance = 0.14), 2 is closest to 10 (distance = 0.14) and 7 is closest to 3 (distance = 0.24) etc. Using the ggplot2 package we can transform this matrix into a nice heat-map for easy visualisation.
Darker tiles indicate closer points.
# ggplot2 heat-map of the distance matrix
p = qplot(x=Var1, y=Var2, data=melt(distances.mat), fill=value, geom='tile')
# adding all the tick marks
p + scale_x_continuous(breaks=1:10) +
# hiding the axis labels
Viewing the data in 3D using the scatter3d() function in the car package further confirms the results we see in the distance matrix (click image to zoom).
For most modelling exercises, extreme values pose a real problem. Whether these points are considered outliers* or influential points, it is good practice to identify, visualise and possibly exclude them for the sake of your model. Extreme values can really mess up your estimates and consequently your conclusions!
Of course in some fields, such as marketing or retail analytics, extreme values may represent a significant opportunity. Perhaps there are some customers that are simply off the charts in terms of purchase size and frequency. It would then be a good idea to identify them and do everything you can to retain them!
An extreme value may come to be via a mistake in data collection or it could have been legitimately generated from the statistical process the generated your data, however unlikely. The decision on whether a point is extreme is inherently subjective. Some people believe outliers should never be excluded unless the data point is surely a mistake, others seem to be far more ruthless… lucky there exists a set of tools to aid us in our decisions regarding extreme data.
The univariate case
In the univariate case, the problem of extreme values is relatively straight forward. Let’s generate a vector of data from a standard normal distribution and add to this vector one point that is five times the size of a point randomly chosen:
# 1000 random standard normal variates
q = rnorm(n=1000)
# adding an 'extreme value'
q = c(q,sample(q,size=1)*5)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.28200 -0.72470 -0.02786 -0.03383 0.66420 7.39500
The most extreme data point is 7.39500 and it can be seen on the plot sitting far above the others. This point could be defined as an outlier.
There are also a number of statistical tests that can be deployed to test for an outlying point in univariate data (assuming normality). One such test is the ‘Grubbs test for one outlier’ that can be found in the outliers package:
# testing for one outlier
Grubbs test for one outlier
G = 7.08620, U = 0.94974, p-value = 3.593e-10
alternative hypothesis: highest value 7.39479015341698 is an outlier
Here we have strong evidence against the null hypothesis that the highest value in the data is not an outlier.
This is all well and good, but what happens when you are faced with multivariate data, perhaps with a high number of dimensions? An extreme value for one dimension may not appear extreme for the rest, making it had to systematically identify problem values.
The multivariate case
The local outlier factor algorithm (LOF) proposed by Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng and Jörg Sander in the early 2000s provides a general mechanism to identify extreme values in multivariate data. Not only does LOF allow us to identify these values, it also gives the degree to which a given point is considered ‘outlying.’ In simple terms, the LOF looks at a point in relation to its nearest neighbours in the data cloud and determines it’s degree of local isolation. If the point is very isolated it is given a high score. An implementation of the LOF algorithm, lofactor(data,k), can be found in the DMwR package. The argument k specifies the number of neighbours considered when assessing each point. The example below demonstrates the system:
Vectors x,y and z each represent 1000 data points generated from separate standard normal distributions. 5 ‘extreme’ points are then added to each vector. The effect is a central data cloud with several extreme satellite points. We wish to use the LOF algorithm to flag these satellite points.
x = rnorm(1000)
x = c(x,sample(x,size=5)*5)
y = rnorm(1000)
y = c(y,sample(y,size=5)*5)
z = rnorm(1000)
z = c(z,sample(z,size=5)*5)
# dat is a data frame comprised of the three vectors
dat = data.frame(x,y,z)
# visualising the data
Calling the LOF function scores each point in relation to its abnormality:
# let's use 6 neighbours
scores = lofactor(dat, k=6)
# viewing the scores for the first few points
 1.0080632 1.0507945 1.1839767 0.9840658 1.0239369 1.1021009
Now lets take the top 10 most extreme points:
# storing the 10 ten most outlying points in a vector called 'outliers'
top.ten = order(scores, decreasing=TRUE)[1:10]
 1003 1004 1001 1005 883 785 361 589 130 283
Plotting the data with the 10 most extreme points coloured red:
# outliers will be coloured red, all other points will be black
colouring = rep(1,nrow(dat))
colouring[top.ten] = 2
# outliers will be crosses, all other points will be circles
symbol = rep(1,nrow(dat))
symbol[top.ten] = 4
# visualising the data
As we can see, the algorithm does a pretty good job of identifying the satellites!
Viewing the data in 3 dimensions further confirms the result. The red points indicate our top 10 most extreme points as suggested by LOF:
# outliers will be coloured red, all other points will be grey
col = rep("#4c4646",nrow(dat))
col[top.ten] = "#b20000"
Of course you must decide on your value for k. Try a few different values and visualise the data until you are happy.
*Actually defining an outlier is a prickly subject so I have steered clear!