Getting it Done with Python!

There is considerable excitement in the analytics community regarding the use of python as a multi-purpose data science tool. And for good reason.

It’s fast.

It’s also well supported by a vast array of statistical and graphics libraries such as Numpy, statsmodels and matplotlib.

I won’t go into details here but if you are interested in the ‘R versus python’ discussion check out this neat infographic by DataCamp.

In my last post I shared a simple example of data blending in R. Now you can see how it might be done in python. Note the similarity in syntax and structure between the two languages. This is largely thanks to the python data management library pandas (well worth a look!).


# Sales data for this year:

#### ------------------------ ####
#### -------- SQLite -------- ####
#### ------------------------ ####

# libraries
import pandas
import sqlite3 as lite

# connecting to database
sqliteConnection = lite.connect('.../sales.sqlite')

# sending query
query = sqliteConnection.execute('SELECT * FROM sales')

# transforming results into a DataFrame
dataSqlite = pandas.read_sql_query(sql='SELECT * FROM sales',con=sqliteConnection)

# viewing data
dataSqlite

>>> dataSqlite
     name  salesTy
0   James     2000
1  Hannah     1500
2    Nick      500
3    Anne     3000
4  Hayley     2000

# sales data for last year:

#### ------------------------ ####
#### ------- MS Excel ------- ####
#### ------------------------ ####

# reading data into a DataFrame
dataExcel = pandas.read_excel('.../salesLy.xlsx', 'Sheet1', index_col=None)

# viewing data
dataExcel

>>> dataExcel
     name  salesLy
0   James     1500
1  Hannah      400
2    Nick      600
3    Anne     3500
4  Hayley     1000

# What are the area assignments?

#### ------------------------ ####
#### --------- .csv --------- ####
#### ------------------------ ####

# reading data into a DataFrame
dataCsv = pandas.read_csv('.../salesArea.csv')

# viewing data
dataCsv

>>> dataCsv
     name   area
0    Nick  South
1    Anne  South
2  Hayley  North
3   James   West
4  Hannah   East

# Number of clients each employee looks after:

#### ------------------------ ####
#### --------- .txt --------- ####
#### ------------------------ ####

# reading data into a DataFrame
dataTxt = pandas.read_table('.../salesClients.txt')

# viewing data
dataTxt

>>> dataTxt
     name  clients
0   James        2
1  Hannah        4
2    Nick        3
3    Anne        5
4  Hayley        1

# Centralising the data:

#### ------------------------ ####
#### --------- MERGE -------- ####
#### ------------------------ ####

# placing all the previously created DataFrames in a list
dataSources = [dataSqlite, dataExcel, dataCsv, dataTxt]

# first join
final = pandas.merge(dataSqlite, dataExcel, how='outer', on='name')

# third to fifth join
for i in range(2,4):
    final = pandas.merge(final, dataSources[i], how='outer', on='name')

# viewing data
final

>>> final
     name  salesTy  salesLy   area  clients
0   James     2000     1500   West        2
1  Hannah     1500      400   East        4
2    Nick      500      600  South        3
3    Anne     3000     3500  South        5
4  Hayley     2000     1000  North        1

# Assessing sales performance, this year versus last:

#### ------------------------ ####
#### ------ OPERATIONS ------ ####
#### ------------------------ ####

# what is the percentage change in sales for each sales person?
final['percentChange'] = ((final.salesTy-final.salesLy)/final.salesLy)*100
 
# creating output DataFrame 
output = final
 
# viewing
output

>>> final
     name  salesTy  salesLy   area  clients  percentChange
0   James     2000     1500   West        2      33.333333
1  Hannah     1500      400   East        4     275.000000
2    Nick      500      600  South        3     -16.666667
3    Anne     3000     3500  South        5     -14.285714
4  Hayley     2000     1000  North        1     100.000000

I used Python 3.4.3 alongside Anaconda 2.2.0 64-bit and the spyder IDE to produce this output.

Note: I had some trouble reading my Microsoft Access database so I have excluded it from this example.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s