Statistical computing with Python

Личный сайт Go-разработчика из Казани

This is a tutorial on how to do some typical statistical programming tasks using Python. It’s intended for people basically familiar with Python and experienced at statistical programming in a language like R, Stata, SAS, SPSS, or MATLAB.

1# 0. Getting set up ==== 2 3""" To get started, pip install the following: jupyter, numpy, scipy, pandas, 4 matplotlib, seaborn, requests. 5 Make sure to do this tutorial in a Jupyter notebook so that you get 6 the inline plots and easy documentation lookup. The shell command to open 7 one is simply `jupyter notebook`, then click New -> Python. 8""" 9 10# 1. Data acquisition ==== 11 12""" One reason people choose Python over R is that they intend to interact a lot 13 with the web, either by scraping pages directly or requesting data through 14 an API. You can do those things in R, but in the context of a project 15 already using Python, there's a benefit to sticking with one language. 16""" 17 18import requests # for HTTP requests (web scraping, APIs) 19import os 20 21# web scraping 22r = requests.get("https://github.com/adambard/learnxinyminutes-docs") 23r.status_code # if 200, request was successful 24r.text # raw page source 25print(r.text) # prettily formatted 26# save the page source in a file: 27os.getcwd() # check what's the working directory 28with open("learnxinyminutes.html", "wb") as f: 29 f.write(r.text.encode("UTF-8")) 30 31# downloading a csv 32fp = "https://raw.githubusercontent.com/adambard/learnxinyminutes-docs/master/" 33fn = "pets.csv" 34r = requests.get(fp + fn) 35print(r.text) 36with open(fn, "wb") as f: 37 f.write(r.text.encode("UTF-8")) 38 39""" for more on the requests module, including APIs, see 40 http://docs.python-requests.org/en/latest/user/quickstart/ 41""" 42 43# 2. Reading a CSV file ==== 44 45""" Wes McKinney's pandas package gives you 'DataFrame' objects in Python. If 46 you've used R, you will be familiar with the idea of the "data.frame" already. 47""" 48 49import pandas as pd 50import numpy as np 51import scipy as sp 52pets = pd.read_csv(fn) 53pets 54# name age weight species 55# 0 fluffy 3 14 cat 56# 1 vesuvius 6 23 fish 57# 2 rex 5 34 dog 58 59""" R users: note that Python, like most C-influenced programming languages, starts 60 indexing from 0. R starts indexing at 1 due to Fortran influence. 61""" 62 63# two different ways to print out a column 64pets.age 65pets["age"] 66 67pets.head(2) # prints first 2 rows 68pets.tail(1) # prints last row 69 70pets.name[1] # 'vesuvius' 71pets.species[0] # 'cat' 72pets["weight"][2] # 34 73 74# in R, you would expect to get 3 rows doing this, but here you get 2: 75pets.age[0:2] 76# 0 3 77# 1 6 78 79sum(pets.age) * 2 # 28 80max(pets.weight) - min(pets.weight) # 20 81 82""" If you are doing some serious linear algebra and number-crunching, you may 83 just want arrays, not DataFrames. DataFrames are ideal for combining columns 84 of different types. 85""" 86 87# 3. Charts ==== 88 89import matplotlib as mpl 90import matplotlib.pyplot as plt 91%matplotlib inline 92 93# To do data visualization in Python, use matplotlib 94 95plt.hist(pets.age); 96 97plt.boxplot(pets.weight); 98 99plt.scatter(pets.age, pets.weight) 100plt.xlabel("age") 101plt.ylabel("weight"); 102 103# seaborn sits atop matplotlib and makes plots prettier 104 105import seaborn as sns 106 107plt.scatter(pets.age, pets.weight) 108plt.xlabel("age") 109plt.ylabel("weight"); 110 111# there are also some seaborn-specific plotting functions 112# notice how seaborn automatically labels the x-axis on this barplot 113sns.barplot(pets["age"]) 114 115# R veterans can still use ggplot 116from ggplot import * 117ggplot(aes(x="age",y="weight"), data=pets) + geom_point() + labs(title="pets") 118# source: https://pypi.python.org/pypi/ggplot 119 120# there's even a d3.js port: https://github.com/mikedewar/d3py 121 122# 4. Simple data cleaning and exploratory analysis ==== 123 124""" Here's a more complicated example that demonstrates a basic data 125 cleaning workflow leading to the creation of some exploratory plots 126 and the running of a linear regression. 127 The data set was transcribed from Wikipedia by hand. It contains 128 all the Holy Roman Emperors and the important milestones in their lives 129 (birth, death, coronation, etc.). 130 The goal of the analysis will be to explore whether a relationship 131 exists between emperor birth year and emperor lifespan. 132 data source: https://en.wikipedia.org/wiki/Holy_Roman_Emperor 133""" 134 135# load some data on Holy Roman Emperors 136url = "https://raw.githubusercontent.com/adambard/learnxinyminutes-docs/master/hre.csv" 137r = requests.get(url) 138fp = "hre.csv" 139with open(fp, "wb") as f: 140 f.write(r.text.encode("UTF-8")) 141 142hre = pd.read_csv(fp) 143 144hre.head() 145""" 146 Ix Dynasty Name Birth Death 1470 NaN Carolingian Charles I 2 April 742 28 January 814 1481 NaN Carolingian Louis I 778 20 June 840 1492 NaN Carolingian Lothair I 795 29 September 855 1503 NaN Carolingian Louis II 825 12 August 875 1514 NaN Carolingian Charles II 13 June 823 6 October 877 152 153 Coronation 1 Coronation 2 Ceased to be Emperor 1540 25 December 800 NaN 28 January 814 1551 11 September 813 5 October 816 20 June 840 1562 5 April 823 NaN 29 September 855 1573 Easter 850 18 May 872 12 August 875 1584 29 December 875 NaN 6 October 877 159""" 160 161# clean the Birth and Death columns 162 163import re # module for regular expressions 164 165rx = re.compile(r'\d+$') # match trailing digits 166 167""" This function applies the regular expression to an input column (here Birth, 168 Death), flattens the resulting list, converts it to a Series object, and 169 finally converts the type of the Series object from string to integer. For 170 more information into what different parts of the code do, see: 171 - https://docs.python.org/2/howto/regex.html 172 - http://stackoverflow.com/questions/11860476/how-to-unlist-a-python-list 173 - http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html 174""" 175 176from functools import reduce 177 178def extractYear(v): 179 return(pd.Series(reduce(lambda x, y: x + y, map(rx.findall, v), [])).astype(int)) 180 181hre["BirthY"] = extractYear(hre.Birth) 182hre["DeathY"] = extractYear(hre.Death) 183 184# make a column telling estimated age 185hre["EstAge"] = hre.DeathY.astype(int) - hre.BirthY.astype(int) 186 187# simple scatterplot, no trend line, color represents dynasty 188sns.lmplot("BirthY", "EstAge", data=hre, hue="Dynasty", fit_reg=False) 189 190# use scipy to run a linear regression 191from scipy import stats 192(slope, intercept, rval, pval, stderr) = stats.linregress(hre.BirthY, hre.EstAge) 193# code source: http://wiki.scipy.org/Cookbook/LinearRegression 194 195# check the slope 196slope # 0.0057672618839073328 197 198# check the R^2 value: 199rval**2 # 0.020363950027333586 200 201# check the p-value 202pval # 0.34971812581498452 203 204# use seaborn to make a scatterplot and plot the linear regression trend line 205sns.lmplot("BirthY", "EstAge", data=hre) 206 207""" For more information on seaborn, see 208 - http://web.stanford.edu/~mwaskom/software/seaborn/ 209 - https://github.com/mwaskom/seaborn 210 For more information on SciPy, see 211 - http://wiki.scipy.org/SciPy 212 - http://wiki.scipy.org/Cookbook/ 213 To see a version of the Holy Roman Emperors analysis using R, see 214 - http://github.com/e99n09/R-notes/blob/master/holy_roman_emperors_dates.R 215"""

If you want to learn more, get Python for Data Analysis by Wes McKinney. It’s a superb resource and I used it as a reference when writing this tutorial.

You can also find plenty of interactive IPython tutorials on subjects specific to your interests, like Cam Davidson-Pilon’s Probabilistic Programming and Bayesian Methods for Hackers.

Some more modules to research:

  • text analysis and natural language processing: nltk
  • social network analysis: igraph