Data Blending with R

Many businesses have a decentralised data architecture whereby data is stored in multiple locations, often in different file formats. For example, financial data may be stored in a MySQL database while employee data may be stored in a series of excel files all sitting in different folders.

Data blending is the process by which data is gathered from different sources and moulded into a uniform structure that allows for easy analysis. The results of the analysis can then be exported in whatever format one desires. R and it’s associated packages provide a perfect environment for this exercise. Consider the following fictitious example.

Hamish is a data analyst who has just landed a job at Firm A. Firm A has 5 employees working in 4 areas across the country to sell products to several important clients. Unfortunately, firm A does not have the luxury of a centralised database to store it’s sales and employee data. Sales this year are stored in a SQLite database, sales last year are stored in a excel file, years of employment are stored in a access database, area assignments can be found in a .csv file and someone has recorded the number of clients each employee looks after in a .txt file. What a headache!

Hamish been told to centralise this data and assess sales performance, this year versus last. The following code solves his problem.

Sales data for this year:

#### ------------------------ ####
#### -------- SQLite -------- ####
#### ------------------------ ####

# required package
library(RSQLite)

# loading database driver
sqlite = dbDriver('SQLite')

# connecting to database
sqliteConnection = dbConnect(sqlite, dbname='.../sales.sqlite')

# sending query
query = dbSendQuery(sqliteConnection, statement='SELECT * FROM Sales')

# transforming results into a data frame
dataSqlite.df = fetch(query)

# viewing data
dataSqlite.df

 
> dataSqlite.df
    name salesTy
1  James    2000
2 Hannah    1500
3   Nick     500
4   Anne    3000
5 Hayley    2000

Sales data for last year:


#### ------------------------ ####
#### ------- MS Excel ------- ####
#### ------------------------ ####

# required package
library(readxl)

# reading data into a data frame
dataExcel.df = read_excel(path='.../salesLy.xlsx', sheet=1, col_names=TRUE)

# viewing data
dataExcel.df

  
> dataExcel.df
    name salesLy
1  James    1500
2 Hannah     400
3   Nick     600
4   Anne    3500
5 Hayley    1000

How long have the employees been employed?

#### ------------------------ ####
#### ------ MS Access ------- ####
#### ------------------------ ####

# required package
library(RODBC)

# connecting to database
accessConnection = odbcDriverConnect('Driver={Microsoft Access Driver (*.mdb,*.accdb)};DBQ=.../salesEmployed.accdb')

# sending query and transforming into data frame
dataAccess.df = sqlQuery(channel=accessConnection, query='SELECT * FROM Employed')

# renaming the key so that it is lower case
colnames(dataAccess.df)[1] = 'name'

# closing the connection
odbcClose(accessConnection)

# viewing data
dataAccess.df

    
> dataAccess.df
    name employed
1   Anne       10
2 Hannah        4
3 Hayley       20
4  James        6
5   Nick        3

What are the area assignments?


#### ------------------------ ####
#### --------- .csv --------- ####
#### ------------------------ ####

# reading data into a data frame
dataCsv.df = read.csv(file='.../salesArea.csv', header=TRUE)

# viewing data
dataCsv.df

    
> dataCsv.df
    name  area
1   Nick South
2   Anne South
3 Hayley North
4  James  West
5 Hannah  East

Number of clients each employee looks after:


#### ------------------------ ####
#### --------- .txt --------- ####
#### ------------------------ ####

# reading data into a data frame
dataTxt.df = read.table(file='.../salesClients.txt', header=TRUE)

# viewing data
dataTxt.df

    
> dataTxt.df
    name clients
1  James       2
2 Hannah       4
3   Nick       3
4   Anne       5
5 Hayley       1

Centralising the data:


#### ------------------------ ####
#### --------- MERGE -------- ####
#### ------------------------ ####

# required package
library(dplyr)

# placing all the previously created data frames in a list
dataSources = list(dataSqlite.df, dataExcel.df, dataAccess.df, dataCsv.df, dataTxt.df)

# first join
final.df = full_join(dataSqlite.df, dataExcel.df, by='name')

# third to fith join
for(i in 3:5){
  final.df = full_join(final.df, dataSources[[i]], by='name')
}

# viewing data
final.df

    
> final.df
    name salesTy salesLy employed  area clients
1  James    2000    1500        6  West       2
2 Hannah    1500     400        4  East       4
3   Nick     500     600        3 South       3
4   Anne    3000    3500       10 South       5
5 Hayley    2000    1000       20 North       1

Assessing sales performance, this year versus last:


#### ------------------------ ####
#### ------ OPERATIONS ------ ####
#### ------------------------ ####

# what is the percentage change in sales for each sales person?
output.df = mutate(final.df,percentChange=((salesTy-salesLy)/salesLy)*100)

# viewing
output.df
> output.df
    name salesTy salesLy employed  area clients percentChange
1  James    2000    1500        6  West       2      33.33333
2 Hannah    1500     400        4  East       4     275.00000
3   Nick     500     600        3 South       3     -16.66667
4   Anne    3000    3500       10 South       5     -14.28571
5 Hayley    2000    1000       20 North       1     100.00000

Clearly Hannah and Hayley are doing very well.

References:
https://nyquist212.wordpress.com/2014/08/15/example-to-connect-to-a-sqlite-database-using-r-programming-language-for-statistical-computing-r-project/
http://rprogramming.net/connect-to-ms-access-in-r/

Advertisements

One thought on “Data Blending with R

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s