New Zealand is culturally diverse. Even at a regional level, there are big differences in ethnic composition… and with an increasingly inter-connected world, ethnic composition is expected to change substantially in the future, particularly in Auckland.
A common competition run by vendors of fishing equipment is a ‘guess the weight and win’ where an image of someone holding a fish is posted and it is up to you to guess it’s weight with the closest guess winning a prize.
The ‘law of large numbers’ implies that the average of the guesses of many is superior to the average of the guesses of a few, so the ‘best guess’ should be close to the average of all guesses…
Motivated by the possibility of winning some fishing tackle I set about messing about with R’s regular expressions to create a tool that would enable me to make an informed guess based on the guesses of many.
The function below reads in a text file containing each persons guess (provided via a comment), extracts and cleans the guesses, transforms the guesses into a common unit (kilograms) and provides summary statistics and a histogram that would suggest the best guess you could make. Of course this function could be adapted to suit a ‘how many jelly beans in the jar?’ competition also!
Here is the output of one such competition:
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.00 12.50 17.00 17.35 19.90 85.00
In this case, I would guess the weight of the fish to be around 17 kilograms!
Data frame objects facilitate most data analysis exercises in both R and Python (perhaps with the exception of time series analysis, where the focus is on R time series and Pandas series objects). Data frames are a tidy and meaningful way to store data.
This post will display exactly the same workflow in both languages. I will run though the Python code first, and you can find an equivalent R script presented at the end.
If you are an R user and have been tempted to explore the exciting world of Python one of the first things you will notice is the similarity of syntax. This should make it easy to pick up the basics. However, there are some key differences between the two. A good example is how to index the first observation in a set of data. R indexing starts at 1 while Python indexing starts at 0!
What are the most common words in New Zealand road names? Are there any common themes?
Thankfully, New Zealand’s 73,906 current road names have been made available through the LINZ Data Service. To answer the questions above, we can use R’s tm package to conduct basic text mining.
The process is simple*. Text is cleansed of any punctuation, extra white-space, redundant or uninteresting words before being fed into wordcloud(). The 60 most common words are then displayed with size proportional to frequency of occurrence.
Can we see any common themes? Yes, namely:
1. Royalty and famous Britons: George, King, Victoria, Queen, Elizabeth, Albert, Nelson.
2. Early New Zealanders: Campbell, Russel, Grey, Scott.
3. Native trees: Kowhai, Totara, Rata, Rimu, Matai, Kauri, Miro.
Set operations are super useful when data cleaning or testing scripts. They are a must have in any analyst’s (data scientist’s/statistician’s/data wizard’s) toolbox. Here is a quick rundown in both R and python.
Say we have two vectors x and y…
# vector x
x = c(1,2,3,4,5,6)
# vector y
y = c(4,5,6,7,8,9)
What if we ‘combined’ x and y ignoring any duplicate elements? ()
# x UNION y
 1 2 3 4 5 6 7 8 9
What are the common elements in x and y? ()
# x INTERSECTION y
 4 5 6
What elements feature in x but not in y?
# x members not in y
 1 2 3
What elements feature in y but not in x?
# y members not in x
 7 8 9
How might we visualise all this?
What about python? In standard python there exists a module called ‘sets’ that allows for the creation of a ‘Set’ object from a python list. The Set object has methods that provide the same functionality as the R functions above.
Given a large multidimensional data-set, what is the closest point to any point you care to choose? Or, for that matter, what is the closest point to any location you care to specify? Easy enough in one dimension, you can just look at the absolute values of the differences between your chosen point and the rest. But what about in 2, 3 or even 20 dimensions? We are talking about point distance.
Distance can be measured a number of ways but the go-to is Euclidean distance. Imagine taking a ruler to a a set of points in 3D space. Euclidean distance would be the distance that you measure. Euclidean distance can even be used in more than three dimensions and allows for the construction of distance matrices representing point separation in your data. The following example in R illustrates the concept.
Lets slice up the classic Fisher’s iris data-set into a more manageable form consisting of the first 10 observations of the setosa species in 3 dimensions.
# required libraries
# loading the iris data set
# just setosa
iris.df = (iris %>% filter(Species=='setosa')
# just three variables
# just the first 10 observations
Sepal.Length Sepal.Width Petal.Length
1 5.1 3.5 1.4
2 4.9 3.0 1.4
3 4.7 3.2 1.3
4 4.6 3.1 1.5
5 5.0 3.6 1.4
6 5.4 3.9 1.7
dist() allows us to build a distance matrix. Smaller values indicate points that are closer together and that is why we see 0’s on the diagonal.
We can see from the matrix that 1 is closest to 5 (distance = 0.14), 2 is closest to 10 (distance = 0.14) and 7 is closest to 3 (distance = 0.24) etc. Using the ggplot2 package we can transform this matrix into a nice heat-map for easy visualisation.
Darker tiles indicate closer points.
# ggplot2 heat-map of the distance matrix
p = qplot(x=Var1, y=Var2, data=melt(distances.mat), fill=value, geom='tile')
# adding all the tick marks
p + scale_x_continuous(breaks=1:10) +
# hiding the axis labels
Viewing the data in 3D using the scatter3d() function in the car package further confirms the results we see in the distance matrix (click image to zoom).