Given a large multidimensional data-set, what is the closest point to any point you care to choose? Or, for that matter, what is the closest point to any location you care to specify? Easy enough in one dimension, you can just look at the absolute values of the differences between your chosen point and the rest. But what about in 2, 3 or even 20 dimensions? We are talking about point *distance*.

Distance can be measured a number of ways but the go-to is *Euclidean distance*. Imagine taking a ruler to a a set of points in 3D space. Euclidean distance would be the distance that you measure. Euclidean distance can even be used in more than three dimensions and allows for the construction of distance matrices representing point separation in your data. The following example in R illustrates the concept.

Lets slice up the classic Fisher’s iris data-set into a more manageable form consisting of the first 10 observations of the *setosa* species in 3 dimensions.

# required libraries library(dplyr) library(ggplot2) library(reshape2) library(car) library(rgl) # loading the iris data set data(iris) # just setosa iris.df = (iris %>% filter(Species=='setosa') # just three variables %>% select(-Petal.Width,-Species) # just the first 10 observations %>% slice(1:10) ) # viewing head(iris.df) Sepal.Length Sepal.Width Petal.Length 1 5.1 3.5 1.4 2 4.9 3.0 1.4 3 4.7 3.2 1.3 4 4.6 3.1 1.5 5 5.0 3.6 1.4 6 5.4 3.9 1.7

*dist()* allows us to build a distance matrix. Smaller values indicate points that are closer together and that is why we see 0’s on the diagonal.

# generating distance matrix distances.mat = as.matrix(dist(iris.df)) # viewing the first 8 columns distances.mat[,1:8] 1 2 3 4 5 6 7 8 1 0.0000000 0.5385165 0.5099020 0.6480741 0.1414214 0.5830952 0.5099020 0.1732051 2 0.5385165 0.0000000 0.3000000 0.3316625 0.6082763 1.0723805 0.5000000 0.4242641 3 0.5099020 0.3000000 0.0000000 0.2449490 0.5099020 1.0677078 0.2449490 0.4123106 4 0.6480741 0.3316625 0.2449490 0.0000000 0.6480741 1.1489125 0.3162278 0.5000000 5 0.1414214 0.6082763 0.5099020 0.6480741 0.0000000 0.5830952 0.4472136 0.2236068 6 0.5830952 1.0723805 1.0677078 1.1489125 0.5830952 0.0000000 0.9899495 0.6708204 7 0.5099020 0.5000000 0.2449490 0.3162278 0.4472136 0.9899495 0.0000000 0.4123106 8 0.1732051 0.4242641 0.4123106 0.5000000 0.2236068 0.6708204 0.4123106 0.0000000 9 0.9219544 0.5099020 0.4358899 0.3000000 0.9219544 1.4456832 0.5385165 0.7874008 10 0.4582576 0.1414214 0.3000000 0.3000000 0.5196152 0.9643651 0.4358899 0.3162278

We can see from the matrix that 1 is closest to 5 (distance = 0.14), 2 is closest to 10 (distance = 0.14) and 7 is closest to 3 (distance = 0.24) etc. Using the *ggplot2* package we can transform this matrix into a nice heat-map for easy visualisation.

Darker tiles indicate closer points.

# ggplot2 heat-map of the distance matrix p = qplot(x=Var1, y=Var2, data=melt(distances.mat), fill=value, geom='tile') # adding all the tick marks p + scale_x_continuous(breaks=1:10) + scale_y_continuous(breaks=1:10) + # hiding the axis labels xlab('') + ylab('')

Viewing the data in 3D using the *scatter3d()* function in the *car* package further confirms the results we see in the distance matrix (click image to zoom).

scatter3d(x=iris.df$Sepal.Length, y=iris.df$Sepal.Width, z=iris.df$Petal.Length, surface=FALSE, point.col='#003300', id.n=nrow(iris.df), xlab = 'Sepal Length', ylab = 'Sepal Width', zlab = 'Petal.Length', axis.col = c(rep('blue',3)) )

References:

https://en.wikipedia.org/wiki/Euclidean_distance

http://www.sthda.com/english/wiki/amazing-interactive-3d-scatter-plots-r-software-and-data-visualization