Data Distance: Identifying Nearest Neighbours in R

Given a large multidimensional data-set, what is the closest point to any point you care to choose? Or, for that matter, what is the closest point to any location you care to specify? Easy enough in one dimension, you can just look at the absolute values of the differences between your chosen point and the rest. But what about in 2, 3 or even 20 dimensions? We are talking about point distance.

Distance can be measured a number of ways but the go-to is Euclidean distance. Imagine taking a ruler to a a set of points in 3D space. Euclidean distance would be the distance that you measure. Euclidean distance can even be used in more than three dimensions and allows for the construction of distance matrices representing point separation in your data. The following example in R illustrates the concept.

Lets slice up the classic Fisher’s iris data-set into a more manageable form consisting of the first 10 observations of the setosa species in 3 dimensions.

# required libraries
library(dplyr)
library(ggplot2)
library(reshape2)
library(car)
library(rgl)

# loading the iris data set
data(iris)

# just setosa
iris.df = (iris %>% filter(Species=='setosa')
                # just three variables
                  %>% select(-Petal.Width,-Species)
                  # just the first 10 observations
                    %>% slice(1:10)
)

# viewing
head(iris.df)

  Sepal.Length Sepal.Width Petal.Length
1          5.1         3.5          1.4
2          4.9         3.0          1.4
3          4.7         3.2          1.3
4          4.6         3.1          1.5
5          5.0         3.6          1.4
6          5.4         3.9          1.7

dist() allows us to build a distance matrix. Smaller values indicate points that are closer together and that is why we see 0’s on the diagonal.

# generating distance matrix
distances.mat = as.matrix(dist(iris.df))

# viewing the first 8 columns
distances.mat[,1:8]

           1         2         3         4         5         6         7         8
1  0.0000000 0.5385165 0.5099020 0.6480741 0.1414214 0.5830952 0.5099020 0.1732051
2  0.5385165 0.0000000 0.3000000 0.3316625 0.6082763 1.0723805 0.5000000 0.4242641
3  0.5099020 0.3000000 0.0000000 0.2449490 0.5099020 1.0677078 0.2449490 0.4123106
4  0.6480741 0.3316625 0.2449490 0.0000000 0.6480741 1.1489125 0.3162278 0.5000000
5  0.1414214 0.6082763 0.5099020 0.6480741 0.0000000 0.5830952 0.4472136 0.2236068
6  0.5830952 1.0723805 1.0677078 1.1489125 0.5830952 0.0000000 0.9899495 0.6708204
7  0.5099020 0.5000000 0.2449490 0.3162278 0.4472136 0.9899495 0.0000000 0.4123106
8  0.1732051 0.4242641 0.4123106 0.5000000 0.2236068 0.6708204 0.4123106 0.0000000
9  0.9219544 0.5099020 0.4358899 0.3000000 0.9219544 1.4456832 0.5385165 0.7874008
10 0.4582576 0.1414214 0.3000000 0.3000000 0.5196152 0.9643651 0.4358899 0.3162278

We can see from the matrix that 1 is closest to 5 (distance = 0.14), 2 is closest to 10 (distance = 0.14) and 7 is closest to 3 (distance = 0.24) etc. Using the ggplot2 package we can transform this matrix into a nice heat-map for easy visualisation.

Darker tiles indicate closer points.

# ggplot2 heat-map of the distance matrix
p = qplot(x=Var1, y=Var2, data=melt(distances.mat), fill=value, geom='tile')
# adding all the tick marks
p + scale_x_continuous(breaks=1:10) + 
    scale_y_continuous(breaks=1:10) + 
# hiding the axis labels
    xlab('') +
    ylab('')
Heat-map of the distance matrix
Heat-map of the distance matrix

Viewing the data in 3D using the scatter3d() function in the car package further confirms the results we see in the distance matrix (click image to zoom).

scatter3d(x=iris.df$Sepal.Length,
          y=iris.df$Sepal.Width,
          z=iris.df$Petal.Length,
          surface=FALSE,
          point.col='#003300',
          id.n=nrow(iris.df),
          xlab = 'Sepal Length', 
          ylab = 'Sepal Width',
          zlab = 'Petal.Length',
          axis.col = c(rep('blue',3))
)

3D scatter (angle 1)
3D scatter (angle 1)

3D scatter (angle 2)
3D scatter (angle 2)

References:
https://en.wikipedia.org/wiki/Euclidean_distance
http://www.sthda.com/english/wiki/amazing-interactive-3d-scatter-plots-r-software-and-data-visualization

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s