Getting it Done with Python!

There is considerable excitement in the analytics community regarding the use of python as a multi-purpose data science tool. And for good reason.

It’s fast.

It’s also well supported by a vast array of statistical and graphics libraries such as Numpy, statsmodels and matplotlib.

I won’t go into details here but if you are interested in the ‘R versus python’ discussion check out this neat infographic by DataCamp.

In my last post I shared a simple example of data blending in R. Now you can see how it might be done in python. Note the similarity in syntax and structure between the two languages. This is largely thanks to the python data management library pandas (well worth a look!).


# Sales data for this year:

#### ------------------------ ####
#### -------- SQLite -------- ####
#### ------------------------ ####

# libraries
import pandas
import sqlite3 as lite

# connecting to database
sqliteConnection = lite.connect('.../sales.sqlite')

# sending query
query = sqliteConnection.execute('SELECT * FROM sales')

# transforming results into a DataFrame
dataSqlite = pandas.read_sql_query(sql='SELECT * FROM sales',con=sqliteConnection)

# viewing data
dataSqlite

>>> dataSqlite
     name  salesTy
0   James     2000
1  Hannah     1500
2    Nick      500
3    Anne     3000
4  Hayley     2000

# sales data for last year:

#### ------------------------ ####
#### ------- MS Excel ------- ####
#### ------------------------ ####

# reading data into a DataFrame
dataExcel = pandas.read_excel('.../salesLy.xlsx', 'Sheet1', index_col=None)

# viewing data
dataExcel

>>> dataExcel
     name  salesLy
0   James     1500
1  Hannah      400
2    Nick      600
3    Anne     3500
4  Hayley     1000

# What are the area assignments?

#### ------------------------ ####
#### --------- .csv --------- ####
#### ------------------------ ####

# reading data into a DataFrame
dataCsv = pandas.read_csv('.../salesArea.csv')

# viewing data
dataCsv

>>> dataCsv
     name   area
0    Nick  South
1    Anne  South
2  Hayley  North
3   James   West
4  Hannah   East

# Number of clients each employee looks after:

#### ------------------------ ####
#### --------- .txt --------- ####
#### ------------------------ ####

# reading data into a DataFrame
dataTxt = pandas.read_table('.../salesClients.txt')

# viewing data
dataTxt

>>> dataTxt
     name  clients
0   James        2
1  Hannah        4
2    Nick        3
3    Anne        5
4  Hayley        1

# Centralising the data:

#### ------------------------ ####
#### --------- MERGE -------- ####
#### ------------------------ ####

# placing all the previously created DataFrames in a list
dataSources = [dataSqlite, dataExcel, dataCsv, dataTxt]

# first join
final = pandas.merge(dataSqlite, dataExcel, how='outer', on='name')

# third to fifth join
for i in range(2,4):
    final = pandas.merge(final, dataSources[i], how='outer', on='name')

# viewing data
final

>>> final
     name  salesTy  salesLy   area  clients
0   James     2000     1500   West        2
1  Hannah     1500      400   East        4
2    Nick      500      600  South        3
3    Anne     3000     3500  South        5
4  Hayley     2000     1000  North        1

# Assessing sales performance, this year versus last:

#### ------------------------ ####
#### ------ OPERATIONS ------ ####
#### ------------------------ ####

# what is the percentage change in sales for each sales person?
final['percentChange'] = ((final.salesTy-final.salesLy)/final.salesLy)*100
 
# creating output DataFrame 
output = final
 
# viewing
output

>>> final
     name  salesTy  salesLy   area  clients  percentChange
0   James     2000     1500   West        2      33.333333
1  Hannah     1500      400   East        4     275.000000
2    Nick      500      600  South        3     -16.666667
3    Anne     3000     3500  South        5     -14.285714
4  Hayley     2000     1000  North        1     100.000000

I used Python 3.4.3 alongside Anaconda 2.2.0 64-bit and the spyder IDE to produce this output.

Note: I had some trouble reading my Microsoft Access database so I have excluded it from this example.

Advertisements

Data Blending with R

Many businesses have a decentralised data architecture whereby data is stored in multiple locations, often in different file formats. For example, financial data may be stored in a MySQL database while employee data may be stored in a series of excel files all sitting in different folders.

Data blending is the process by which data is gathered from different sources and moulded into a uniform structure that allows for easy analysis. The results of the analysis can then be exported in whatever format one desires. R and it’s associated packages provide a perfect environment for this exercise. Consider the following fictitious example.

Hamish is a data analyst who has just landed a job at Firm A. Firm A has 5 employees working in 4 areas across the country to sell products to several important clients. Unfortunately, firm A does not have the luxury of a centralised database to store it’s sales and employee data. Sales this year are stored in a SQLite database, sales last year are stored in a excel file, years of employment are stored in a access database, area assignments can be found in a .csv file and someone has recorded the number of clients each employee looks after in a .txt file. What a headache!

Hamish been told to centralise this data and assess sales performance, this year versus last. The following code solves his problem.

Sales data for this year:

#### ------------------------ ####
#### -------- SQLite -------- ####
#### ------------------------ ####

# required package
library(RSQLite)

# loading database driver
sqlite = dbDriver('SQLite')

# connecting to database
sqliteConnection = dbConnect(sqlite, dbname='.../sales.sqlite')

# sending query
query = dbSendQuery(sqliteConnection, statement='SELECT * FROM Sales')

# transforming results into a data frame
dataSqlite.df = fetch(query)

# viewing data
dataSqlite.df

 
> dataSqlite.df
    name salesTy
1  James    2000
2 Hannah    1500
3   Nick     500
4   Anne    3000
5 Hayley    2000

Sales data for last year:


#### ------------------------ ####
#### ------- MS Excel ------- ####
#### ------------------------ ####

# required package
library(readxl)

# reading data into a data frame
dataExcel.df = read_excel(path='.../salesLy.xlsx', sheet=1, col_names=TRUE)

# viewing data
dataExcel.df

  
> dataExcel.df
    name salesLy
1  James    1500
2 Hannah     400
3   Nick     600
4   Anne    3500
5 Hayley    1000

How long have the employees been employed?

#### ------------------------ ####
#### ------ MS Access ------- ####
#### ------------------------ ####

# required package
library(RODBC)

# connecting to database
accessConnection = odbcDriverConnect('Driver={Microsoft Access Driver (*.mdb,*.accdb)};DBQ=.../salesEmployed.accdb')

# sending query and transforming into data frame
dataAccess.df = sqlQuery(channel=accessConnection, query='SELECT * FROM Employed')

# renaming the key so that it is lower case
colnames(dataAccess.df)[1] = 'name'

# closing the connection
odbcClose(accessConnection)

# viewing data
dataAccess.df

    
> dataAccess.df
    name employed
1   Anne       10
2 Hannah        4
3 Hayley       20
4  James        6
5   Nick        3

What are the area assignments?


#### ------------------------ ####
#### --------- .csv --------- ####
#### ------------------------ ####

# reading data into a data frame
dataCsv.df = read.csv(file='.../salesArea.csv', header=TRUE)

# viewing data
dataCsv.df

    
> dataCsv.df
    name  area
1   Nick South
2   Anne South
3 Hayley North
4  James  West
5 Hannah  East

Number of clients each employee looks after:


#### ------------------------ ####
#### --------- .txt --------- ####
#### ------------------------ ####

# reading data into a data frame
dataTxt.df = read.table(file='.../salesClients.txt', header=TRUE)

# viewing data
dataTxt.df

    
> dataTxt.df
    name clients
1  James       2
2 Hannah       4
3   Nick       3
4   Anne       5
5 Hayley       1

Centralising the data:


#### ------------------------ ####
#### --------- MERGE -------- ####
#### ------------------------ ####

# required package
library(dplyr)

# placing all the previously created data frames in a list
dataSources = list(dataSqlite.df, dataExcel.df, dataAccess.df, dataCsv.df, dataTxt.df)

# first join
final.df = full_join(dataSqlite.df, dataExcel.df, by='name')

# third to fith join
for(i in 3:5){
  final.df = full_join(final.df, dataSources[[i]], by='name')
}

# viewing data
final.df

    
> final.df
    name salesTy salesLy employed  area clients
1  James    2000    1500        6  West       2
2 Hannah    1500     400        4  East       4
3   Nick     500     600        3 South       3
4   Anne    3000    3500       10 South       5
5 Hayley    2000    1000       20 North       1

Assessing sales performance, this year versus last:


#### ------------------------ ####
#### ------ OPERATIONS ------ ####
#### ------------------------ ####

# what is the percentage change in sales for each sales person?
output.df = mutate(final.df,percentChange=((salesTy-salesLy)/salesLy)*100)

# viewing
output.df
> output.df
    name salesTy salesLy employed  area clients percentChange
1  James    2000    1500        6  West       2      33.33333
2 Hannah    1500     400        4  East       4     275.00000
3   Nick     500     600        3 South       3     -16.66667
4   Anne    3000    3500       10 South       5     -14.28571
5 Hayley    2000    1000       20 North       1     100.00000

Clearly Hannah and Hayley are doing very well.

References:
https://nyquist212.wordpress.com/2014/08/15/example-to-connect-to-a-sqlite-database-using-r-programming-language-for-statistical-computing-r-project/
http://rprogramming.net/connect-to-ms-access-in-r/

Geographical Distances with R

If you are dealing with data that has a spatial component you may find yourself wanting to calculate the geographical distance (as the crow flies) between two points.

The first task is to source the latitude and longitude values for both points. These co-ordinates should be in the form of decimal degrees (DD). For example, the rough location of New Zealand’s Little Barrier Island in DD is (-36.200965, 175.080992).

If you don’t have co-ordinates on hand, you can easily get them from google maps by following these instructions.

If your co-ordinates take another form, perhaps degrees, minutes, and seconds (DMS) or degrees and decimal minutes (DMM), several converters exist online.

When you have your co-ordinates, you can calculate the distance between them by calling the distance functions contained in the R package geosphere. There are a handful of functions to choose from, each reflecting a different approach to distance calculation. For example, distHaversine() assumes the earth is spherical while distVincentyEllipsoid() assumes the earth is an ellipsoid (a great discussion on this can be found here). geosphere is a really interesting package, I would recommend having a read through the documentation as it is loaded with interesting functions.

By way of an example, lets say you want to calculate the distance in metres between Mount Ruapehu and Mount Ngauruhoe, two volcanoes found in New Zealand’s Central Plateau.

# loading package
library(geosphere)

# mount ruapehu
ruapehu = c(175.568450,-39.281726)

# mount ngauruhoe
ngauruhoe = c(175.632122,-39.156874)

# distance between ruapehu and ngauruhoe in metres
distVincentyEllipsoid(p1=ruapehu, p2=ngauruhoe)
[1] 14911.83

Just divide the result by 1000 if you want kilometres.

As well as providing us with co-ordinates, google maps allows us to validate our results by measuring the distance between two points . Instructions for how to do this can be found here.