Trying to Win with R

A common competition run by vendors of fishing equipment is a ‘guess the weight and win’ where an image of someone holding a fish is posted and it is up to you to guess it’s weight with the closest guess winning a prize.

The ‘law of large numbers’ implies that the average of the guesses of many is superior to the average of the guesses of a few, so the ‘best guess’ should be close to the average of all guesses…

Motivated by the possibility of winning some fishing tackle I set about messing about with R’s regular expressions to create a tool that would enable me to make an informed guess based on the guesses of many.

The function below reads in a text file containing each persons guess (provided via a comment), extracts and cleans the guesses, transforms the guesses into a common unit (kilograms) and provides summary statistics and a histogram that would suggest the best guess you could make. Of course this function could be adapted to suit a ‘how many jelly beans in the jar?’ competition also!

Here is the output of one such competition:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
4.00   12.50   17.00   17.35   19.90   85.00 


In this case, I would guess the weight of the fish to be around 17 kilograms!

Working with Data Frames in Python and R

Data Hipsters

Data frame objects facilitate most data analysis exercises in both R and Python (perhaps with the exception of time series analysis, where the focus is on R time series and Pandas series objects). Data frames are a tidy and meaningful way to store data.

This post will display exactly the same workflow in both languages. I will run though the Python code first, and you can find an equivalent R script presented at the end.

If you are an R user and have been tempted to explore the exciting world of Python one of the first things you will notice is the similarity of syntax. This should make it easy to pick up the basics. However, there are some key differences between the two. A good example is how to index the first observation in a set of data. R indexing starts at 1 while Python indexing starts at 0!

View original post 163 more words