Data Distance: Identifying Nearest Neighbours in R

Given a large multidimensional data-set, what is the closest point to any point you care to choose? Or, for that matter, what is the closest point to any location you care to specify? Easy enough in one dimension, you can just look at the absolute values of the differences between your chosen point and the rest. But what about in 2, 3 or even 20 dimensions? We are talking about point distance.

Distance can be measured a number of ways but the go-to is Euclidean distance. Imagine taking a ruler to a a set of points in 3D space. Euclidean distance would be the distance that you measure. Euclidean distance can even be used in more than three dimensions and allows for the construction of distance matrices representing point separation in your data. The following example in R illustrates the concept.

Lets slice up the classic Fisher’s iris data-set into a more manageable form consisting of the first 10 observations of the setosa species in 3 dimensions.

# required libraries

# loading the iris data set

# just setosa
iris.df = (iris %>% filter(Species=='setosa')
                # just three variables
                  %>% select(-Petal.Width,-Species)
                  # just the first 10 observations
                    %>% slice(1:10)

# viewing

  Sepal.Length Sepal.Width Petal.Length
1          5.1         3.5          1.4
2          4.9         3.0          1.4
3          4.7         3.2          1.3
4          4.6         3.1          1.5
5          5.0         3.6          1.4
6          5.4         3.9          1.7

dist() allows us to build a distance matrix. Smaller values indicate points that are closer together and that is why we see 0’s on the diagonal.

# generating distance matrix
distances.mat = as.matrix(dist(iris.df))

# viewing the first 8 columns

           1         2         3         4         5         6         7         8
1  0.0000000 0.5385165 0.5099020 0.6480741 0.1414214 0.5830952 0.5099020 0.1732051
2  0.5385165 0.0000000 0.3000000 0.3316625 0.6082763 1.0723805 0.5000000 0.4242641
3  0.5099020 0.3000000 0.0000000 0.2449490 0.5099020 1.0677078 0.2449490 0.4123106
4  0.6480741 0.3316625 0.2449490 0.0000000 0.6480741 1.1489125 0.3162278 0.5000000
5  0.1414214 0.6082763 0.5099020 0.6480741 0.0000000 0.5830952 0.4472136 0.2236068
6  0.5830952 1.0723805 1.0677078 1.1489125 0.5830952 0.0000000 0.9899495 0.6708204
7  0.5099020 0.5000000 0.2449490 0.3162278 0.4472136 0.9899495 0.0000000 0.4123106
8  0.1732051 0.4242641 0.4123106 0.5000000 0.2236068 0.6708204 0.4123106 0.0000000
9  0.9219544 0.5099020 0.4358899 0.3000000 0.9219544 1.4456832 0.5385165 0.7874008
10 0.4582576 0.1414214 0.3000000 0.3000000 0.5196152 0.9643651 0.4358899 0.3162278

We can see from the matrix that 1 is closest to 5 (distance = 0.14), 2 is closest to 10 (distance = 0.14) and 7 is closest to 3 (distance = 0.24) etc. Using the ggplot2 package we can transform this matrix into a nice heat-map for easy visualisation.

Darker tiles indicate closer points.

# ggplot2 heat-map of the distance matrix
p = qplot(x=Var1, y=Var2, data=melt(distances.mat), fill=value, geom='tile')
# adding all the tick marks
p + scale_x_continuous(breaks=1:10) + 
    scale_y_continuous(breaks=1:10) + 
# hiding the axis labels
    xlab('') +
Heat-map of the distance matrix
Heat-map of the distance matrix

Viewing the data in 3D using the scatter3d() function in the car package further confirms the results we see in the distance matrix (click image to zoom).

          xlab = 'Sepal Length', 
          ylab = 'Sepal Width',
          zlab = 'Petal.Length',
          axis.col = c(rep('blue',3))

3D scatter (angle 1)
3D scatter (angle 1)

3D scatter (angle 2)
3D scatter (angle 2)



Approaches to Extremes: Visualisation, Tests and The Local Outlier Factor

For most modelling exercises, extreme values pose a real problem. Whether these points are considered outliers* or influential points, it is good practice to identify, visualise and possibly exclude them for the sake of your model. Extreme values can really mess up your estimates and consequently your conclusions!

Of course in some fields, such as marketing or retail analytics, extreme values may represent a significant opportunity. Perhaps there are some customers that are simply off the charts in terms of purchase size and frequency. It would then be a good idea to identify them and do everything you can to retain them!

An extreme value may come to be via a mistake in data collection or it could have been legitimately generated from the statistical process the generated your data, however unlikely. The decision on whether a point is extreme is inherently subjective. Some people believe outliers should never be excluded unless the data point is surely a mistake, others seem to be far more ruthless… lucky there exists a set of tools to aid us in our decisions regarding extreme data.

The univariate case

In the univariate case, the problem of extreme values is relatively straight forward. Let’s generate a vector of data from a standard normal distribution and add to this vector one point that is five times the size of a point randomly chosen:


# 1000 random standard normal variates
q = rnorm(n=1000)

# adding an 'extreme value'
q = c(q,sample(q,size=1)*5)

# summary

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-3.28200 -0.72470 -0.02786 -0.03383  0.66420  7.39500 

# plotting
1000 standard normal values plus one ‘extreme’ value
1000 standard normal values plus one ‘extreme’ value

The most extreme data point is 7.39500 and it can be seen on the plot sitting far above the others. This point could be defined as an outlier.

There are also a number of statistical tests that can be deployed to test for an outlying point in univariate data (assuming normality). One such test is the ‘Grubbs test for one outlier’ that can be found in the outliers package:


# testing for one outlier

	Grubbs test for one outlier

data:  q
G = 7.08620, U = 0.94974, p-value = 3.593e-10
alternative hypothesis: highest value 7.39479015341698 is an outlier

Here we have strong evidence against the null hypothesis that the highest value in the data is not an outlier.

This is all well and good, but what happens when you are faced with multivariate data, perhaps with a high number of dimensions? An extreme value for one dimension may not appear extreme for the rest, making it had to systematically identify problem values.

The multivariate case

The local outlier factor algorithm (LOF) proposed by Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng and Jörg Sander in the early 2000s provides a general mechanism to identify extreme values in multivariate data. Not only does LOF allow us to identify these values, it also gives the degree to which a given point is considered ‘outlying.’ In simple terms, the LOF looks at a point in relation to its nearest neighbours in the data cloud and determines it’s degree of local isolation. If the point is very isolated it is given a high score. An implementation of the LOF algorithm, lofactor(data,k), can be found in the DMwR package. The argument k specifies the number of neighbours considered when assessing each point. The example below demonstrates the system:

Vectors x,y and z each represent 1000 data points generated from separate standard normal distributions. 5 ‘extreme’ points are then added to each vector. The effect is a central data cloud with several extreme satellite points. We wish to use the LOF algorithm to flag these satellite points.


x = rnorm(1000)
x = c(x,sample(x,size=5)*5)

y = rnorm(1000)
y = c(y,sample(y,size=5)*5)

z = rnorm(1000)
z = c(z,sample(z,size=5)*5)

# dat is a data frame comprised of the three vectors
dat = data.frame(x,y,z)

# visualising the data
Scatter plot matrix revealing a central data cloud and a number of satellite points
Scatter plot matrix revealing a central data cloud and a number of satellite points

Calling the LOF function scores each point in relation to its abnormality:


# let's use 6 neighbours
scores = lofactor(dat, k=6)

# viewing the scores for the first few points

[1] 1.0080632 1.0507945 1.1839767 0.9840658 1.0239369 1.1021009

Now lets take the top 10 most extreme points:

# storing the 10 ten most outlying points in a vector called 'outliers'
top.ten = order(scores, decreasing=TRUE)[1:10]

# printing

[1] 1003 1004 1001 1005  883  785  361  589  130  283

Plotting the data with the 10 most extreme points coloured red:

# outliers will be coloured red, all other points will be black
colouring = rep(1,nrow(dat))
colouring[top.ten] = 2

# outliers will be crosses, all other points will be circles
symbol = rep(1,nrow(dat))
symbol[top.ten] = 4

# visualising the data
Red crosses are deemed to be local outliers
Red crosses are deemed to be local outliers

As we can see, the algorithm does a pretty good job of identifying the satellites!

Viewing the data in 3 dimensions further confirms the result. The red points indicate our top 10 most extreme points as suggested by LOF:


# outliers will be coloured red, all other points will be grey
col = rep("#4c4646",nrow(dat))
col[top.ten] = "#b20000"

# javascript 3d scatter plot
3D visualisation (angle 1)
3D visualisation (angle 1)
3D visualisation (angle 2)
3D visualisation (angle 2)
3D visualisation (angle 3)
3D visualisation (angle 3)

Of course you must decide on your value for k. Try a few different values and visualise the data until you are happy.

*Actually defining an outlier is a prickly subject so I have steered clear!