For most modelling exercises, extreme values pose a real problem. Whether these points are considered outliers* or influential points, it is good practice to identify, visualise and possibly exclude them for the sake of your model. Extreme values can really mess up your estimates and consequently your conclusions!
Of course in some fields, such as marketing or retail analytics, extreme values may represent a significant opportunity. Perhaps there are some customers that are simply off the charts in terms of purchase size and frequency. It would then be a good idea to identify them and do everything you can to retain them!
An extreme value may come to be via a mistake in data collection or it could have been legitimately generated from the statistical process the generated your data, however unlikely. The decision on whether a point is extreme is inherently subjective. Some people believe outliers should never be excluded unless the data point is surely a mistake, others seem to be far more ruthless… lucky there exists a set of tools to aid us in our decisions regarding extreme data.
The univariate case
In the univariate case, the problem of extreme values is relatively straight forward. Let’s generate a vector of data from a standard normal distribution and add to this vector one point that is five times the size of a point randomly chosen:
set.seed(8) # 1000 random standard normal variates q = rnorm(n=1000) # adding an 'extreme value' q = c(q,sample(q,size=1)*5) # summary summary(q) Min. 1st Qu. Median Mean 3rd Qu. Max. -3.28200 -0.72470 -0.02786 -0.03383 0.66420 7.39500 # plotting boxplot(q)
The most extreme data point is 7.39500 and it can be seen on the plot sitting far above the others. This point could be defined as an outlier.
There are also a number of statistical tests that can be deployed to test for an outlying point in univariate data (assuming normality). One such test is the ‘Grubbs test for one outlier’ that can be found in the outliers package:
library(outliers) # testing for one outlier grubbs.test(q) Grubbs test for one outlier data: q G = 7.08620, U = 0.94974, p-value = 3.593e-10 alternative hypothesis: highest value 7.39479015341698 is an outlier
Here we have strong evidence against the null hypothesis that the highest value in the data is not an outlier.
This is all well and good, but what happens when you are faced with multivariate data, perhaps with a high number of dimensions? An extreme value for one dimension may not appear extreme for the rest, making it had to systematically identify problem values.
The multivariate case
The local outlier factor algorithm (LOF) proposed by Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng and Jörg Sander in the early 2000s provides a general mechanism to identify extreme values in multivariate data. Not only does LOF allow us to identify these values, it also gives the degree to which a given point is considered ‘outlying.’ In simple terms, the LOF looks at a point in relation to its nearest neighbours in the data cloud and determines it’s degree of local isolation. If the point is very isolated it is given a high score. An implementation of the LOF algorithm, lofactor(data,k), can be found in the DMwR package. The argument k specifies the number of neighbours considered when assessing each point. The example below demonstrates the system:
Vectors x,y and z each represent 1000 data points generated from separate standard normal distributions. 5 ‘extreme’ points are then added to each vector. The effect is a central data cloud with several extreme satellite points. We wish to use the LOF algorithm to flag these satellite points.
set.seed(20) x = rnorm(1000) x = c(x,sample(x,size=5)*5) y = rnorm(1000) y = c(y,sample(y,size=5)*5) z = rnorm(1000) z = c(z,sample(z,size=5)*5) # dat is a data frame comprised of the three vectors dat = data.frame(x,y,z) # visualising the data pairs(dat)
Calling the LOF function scores each point in relation to its abnormality:
library(DMwR) # let's use 6 neighbours scores = lofactor(dat, k=6) # viewing the scores for the first few points head(scores)  1.0080632 1.0507945 1.1839767 0.9840658 1.0239369 1.1021009
Now lets take the top 10 most extreme points:
# storing the 10 ten most outlying points in a vector called 'outliers' top.ten = order(scores, decreasing=TRUE)[1:10] # printing top.ten  1003 1004 1001 1005 883 785 361 589 130 283
Plotting the data with the 10 most extreme points coloured red:
# outliers will be coloured red, all other points will be black colouring = rep(1,nrow(dat)) colouring[top.ten] = 2 # outliers will be crosses, all other points will be circles symbol = rep(1,nrow(dat)) symbol[top.ten] = 4 # visualising the data pairs(dat,col=colouring,pch=symbol)
As we can see, the algorithm does a pretty good job of identifying the satellites!
Viewing the data in 3 dimensions further confirms the result. The red points indicate our top 10 most extreme points as suggested by LOF:
Of course you must decide on your value for k. Try a few different values and visualise the data until you are happy.
*Actually defining an outlier is a prickly subject so I have steered clear!