Statistical computing with Python

This is a tutorial on how to do some typical statistical programming tasks using Python. It’s intended for people basically familiar with Python and experienced at statistical programming in a language like R, Stata, SAS, SPSS, or MATLAB.
  1# 0. Getting set up ====
  2
  3""" To get started, pip install the following: jupyter, numpy, scipy, pandas,
  4    matplotlib, seaborn, requests.
  5        Make sure to do this tutorial in a Jupyter notebook so that you get
  6    the inline plots and easy documentation lookup. The shell command to open
  7    one is simply `jupyter notebook`, then click New -> Python.
  8"""
  9
 10# 1. Data acquisition ====
 11
 12""" One reason people choose Python over R is that they intend to interact a lot
 13    with the web, either by scraping pages directly or requesting data through
 14    an API. You can do those things in R, but in the context of a project
 15    already using Python, there's a benefit to sticking with one language.
 16"""
 17
 18import requests  # for HTTP requests (web scraping, APIs)
 19import os
 20
 21# web scraping
 22r = requests.get("https://github.com/adambard/learnxinyminutes-docs")
 23r.status_code  # if 200, request was successful
 24r.text  # raw page source
 25print(r.text)  # prettily formatted
 26# save the page source in a file:
 27os.getcwd()  # check what's the working directory
 28with open("learnxinyminutes.html", "wb") as f:
 29    f.write(r.text.encode("UTF-8"))
 30
 31# downloading a csv
 32fp = "https://raw.githubusercontent.com/adambard/learnxinyminutes-docs/master/"
 33fn = "pets.csv"
 34r = requests.get(fp + fn)
 35print(r.text)
 36with open(fn, "wb") as f:
 37    f.write(r.text.encode("UTF-8"))
 38
 39""" for more on the requests module, including APIs, see
 40    http://docs.python-requests.org/en/latest/user/quickstart/
 41"""
 42
 43# 2. Reading a CSV file ====
 44
 45""" Wes McKinney's pandas package gives you 'DataFrame' objects in Python. If
 46    you've used R, you will be familiar with the idea of the "data.frame" already.
 47"""
 48
 49import pandas as pd
 50import numpy as np
 51import scipy as sp
 52pets = pd.read_csv(fn)
 53pets
 54#        name  age  weight species
 55# 0    fluffy    3      14     cat
 56# 1  vesuvius    6      23    fish
 57# 2       rex    5      34     dog
 58
 59""" R users: note that Python, like most C-influenced programming languages, starts
 60    indexing from 0. R starts indexing at 1 due to Fortran influence.
 61"""
 62
 63# two different ways to print out a column
 64pets.age
 65pets["age"]
 66
 67pets.head(2)  # prints first 2 rows
 68pets.tail(1)  # prints last row
 69
 70pets.name[1]  # 'vesuvius'
 71pets.species[0]  # 'cat'
 72pets["weight"][2]  # 34
 73
 74# in R, you would expect to get 3 rows doing this, but here you get 2:
 75pets.age[0:2]
 76# 0    3
 77# 1    6
 78
 79sum(pets.age) * 2  # 28
 80max(pets.weight) - min(pets.weight)  # 20
 81
 82""" If you are doing some serious linear algebra and number-crunching, you may
 83    just want arrays, not DataFrames. DataFrames are ideal for combining columns
 84    of different types.
 85"""
 86
 87# 3. Charts ====
 88
 89import matplotlib as mpl
 90import matplotlib.pyplot as plt
 91%matplotlib inline
 92
 93# To do data visualization in Python, use matplotlib
 94
 95plt.hist(pets.age);
 96
 97plt.boxplot(pets.weight);
 98
 99plt.scatter(pets.age, pets.weight)
100plt.xlabel("age")
101plt.ylabel("weight");
102
103# seaborn sits atop matplotlib and makes plots prettier
104
105import seaborn as sns
106
107plt.scatter(pets.age, pets.weight)
108plt.xlabel("age")
109plt.ylabel("weight");
110
111# there are also some seaborn-specific plotting functions
112# notice how seaborn automatically labels the x-axis on this barplot
113sns.barplot(pets["age"])
114
115# R veterans can still use ggplot
116from ggplot import *
117ggplot(aes(x="age",y="weight"), data=pets) + geom_point() + labs(title="pets")
118# source: https://pypi.python.org/pypi/ggplot
119
120# there's even a d3.js port: https://github.com/mikedewar/d3py
121
122# 4. Simple data cleaning and exploratory analysis ====
123
124""" Here's a more complicated example that demonstrates a basic data
125    cleaning workflow leading to the creation of some exploratory plots
126    and the running of a linear regression.
127        The data set was transcribed from Wikipedia by hand. It contains
128    all the Holy Roman Emperors and the important milestones in their lives
129    (birth, death, coronation, etc.).
130        The goal of the analysis will be to explore whether a relationship
131    exists between emperor birth year and emperor lifespan.
132    data source: https://en.wikipedia.org/wiki/Holy_Roman_Emperor
133"""
134
135# load some data on Holy Roman Emperors
136url = "https://raw.githubusercontent.com/adambard/learnxinyminutes-docs/master/hre.csv"
137r = requests.get(url)
138fp = "hre.csv"
139with open(fp, "wb") as f:
140    f.write(r.text.encode("UTF-8"))
141
142hre = pd.read_csv(fp)
143
144hre.head()
145"""
146   Ix      Dynasty        Name        Birth             Death
1470 NaN  Carolingian   Charles I  2 April 742    28 January 814
1481 NaN  Carolingian     Louis I          778       20 June 840
1492 NaN  Carolingian   Lothair I          795  29 September 855
1503 NaN  Carolingian    Louis II          825     12 August 875
1514 NaN  Carolingian  Charles II  13 June 823     6 October 877
152
153       Coronation 1   Coronation 2 Ceased to be Emperor
1540   25 December 800            NaN       28 January 814
1551  11 September 813  5 October 816          20 June 840
1562       5 April 823            NaN     29 September 855
1573        Easter 850     18 May 872        12 August 875
1584   29 December 875            NaN        6 October 877
159"""
160
161# clean the Birth and Death columns
162
163import re  # module for regular expressions
164
165rx = re.compile(r'\d+$')  # match trailing digits
166
167""" This function applies the regular expression to an input column (here Birth,
168    Death), flattens the resulting list, converts it to a Series object, and
169    finally converts the type of the Series object from string to integer. For
170    more information into what different parts of the code do, see:
171      - https://docs.python.org/2/howto/regex.html
172      - http://stackoverflow.com/questions/11860476/how-to-unlist-a-python-list
173      - http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html
174"""
175
176from functools import reduce
177
178def extractYear(v):
179    return(pd.Series(reduce(lambda x, y: x + y, map(rx.findall, v), [])).astype(int))
180
181hre["BirthY"] = extractYear(hre.Birth)
182hre["DeathY"] = extractYear(hre.Death)
183
184# make a column telling estimated age
185hre["EstAge"] = hre.DeathY.astype(int) - hre.BirthY.astype(int)
186
187# simple scatterplot, no trend line, color represents dynasty
188sns.lmplot("BirthY", "EstAge", data=hre, hue="Dynasty", fit_reg=False)
189
190# use scipy to run a linear regression
191from scipy import stats
192(slope, intercept, rval, pval, stderr) = stats.linregress(hre.BirthY, hre.EstAge)
193# code source: http://wiki.scipy.org/Cookbook/LinearRegression
194
195# check the slope
196slope  # 0.0057672618839073328
197
198# check the R^2 value:
199rval**2  # 0.020363950027333586
200
201# check the p-value
202pval  # 0.34971812581498452
203
204# use seaborn to make a scatterplot and plot the linear regression trend line
205sns.lmplot("BirthY", "EstAge", data=hre)
206
207""" For more information on seaborn, see
208      - http://web.stanford.edu/~mwaskom/software/seaborn/
209      - https://github.com/mwaskom/seaborn
210    For more information on SciPy, see
211      - http://wiki.scipy.org/SciPy
212      - http://wiki.scipy.org/Cookbook/
213    To see a version of the Holy Roman Emperors analysis using R, see
214      - http://github.com/e99n09/R-notes/blob/master/holy_roman_emperors_dates.R
215"""
If you want to learn more, get Python for Data Analysis by Wes McKinney. It’s a superb resource and I used it as a reference when writing this tutorial.
You can also find plenty of interactive IPython tutorials on subjects specific to your interests, like Cam Davidson-Pilon’s Probabilistic Programming and Bayesian Methods for Hackers.
Some more modules to research:
text analysis and natural language processing: nltk
social network analysis: igraph