8 Distributions and Descriptive Statistics in Python

This tutorial introduces distributions and descriptive statistics in Python using pandas and helper functions that mirror R’s syntax.

Getting Started

Install and Import Packages

%pip install pandas plotnine scipy

import os, sys
import pandas as p
from plotnine import *
sys.path.append(os.path.abspath('functions'))
from functions_distributions import *

8.1 Our Data

sw = p.Series([4.5, 5, 5.5, 5, 5.5, 6.5, 6.5, 6, 5, 4])
sw

## 0    4.5
## 1    5.0
## 2    5.5
## 3    5.0
## 4    5.5
## 5    6.5
## 6    6.5
## 7    6.0
## 8    5.0
## 9    4.0
## dtype: float64

8.2 Size

8.2.1 Length

len(sw)

## 10

8.3 Location

8.3.1 Mean and Median

sw.mean()

## 5.35

sw.median()

## 5.25

8.3.2 Mode

sw.mode()

## 0    5.0
## dtype: float64

8.4 Spread (1)

8.4.1 Percentiles

sw.quantile(q=0)  # min

## 4.0

sw.quantile(q=1)  # max

## 6.5

sw.quantile(q=.75)

## 5.875

8.5 Spread (2)

8.5.1 Standard Deviation, Variance, CV, SE

# Manual SD (sample)
x = ((sw - sw.mean())**2).sum()
x = x / (len(sw) - 1)
x**0.5

## 0.8181958472422385

sw.std()

## 0.8181958472422385

sw.var()

## 0.6694444444444444

sw.std()**2

## 0.6694444444444444

sw.std() / sw.mean()  # CV

## 0.15293380322284833

se = sw.std() / (len(sw)**0.5)
se

## 0.25873624493766706

8.6 Shape

8.6.1 Skewness and Kurtosis

diff = sw - sw.mean()
n = len(sw) - 1
sigma = sw.std()
sum(diff**3) / (n * sigma**3)

## 0.024342597820882206

sum(diff**4) / (n * sigma**4)

## 1.8758509667533272

# Using helper functions mirroring R
skewness(sw)

## 0.024342597820882206

kurtosis(sw)

## 1.8758509667533272

8.7 Finding Parameters for Your Distributions

sw = p.Series([4.5, 5, 5.5, 5, 5.5, 6.5, 6.5, 6, 5, 4])
mymean = sw.mean()
mysd = sw.std()

8.8 Common Distributions

8.8.1 Normal

mynorm = rnorm(n=1000, mean=mymean, sd=mysd)
hist(mynorm)

## <plotnine.ggplot.ggplot object at 0x16ed3d580>

8.8.2 Poisson

mypois = rpois(n=1000, mu=mymean)
hist(mypois)

## <plotnine.ggplot.ggplot object at 0x308f8d8e0>

8.8.3 Exponential

myrate_e = 1 / sw.mean()
myexp = rexp(n=1000, rate=myrate_e)
hist(myexp)

## <plotnine.ggplot.ggplot object at 0x30b79a420>

8.8.4 Gamma

myshape = sw.mean()**2 / sw.var()
myrate = 1 / (sw.var() / sw.mean())
mygamma = rgamma(n=1000, shape=myshape, rate=myrate)
hist(mygamma)

## <plotnine.ggplot.ggplot object at 0x30b7b0d70>

8.8.5 Weibull

from scipy import stats as fitdistr
myshape_w, loc, myscale_w = fitdistr.weibull_min.fit(sw, floc=0)
myweibull = rweibull(n=1000, shape=myshape_w, scale=myscale_w)
hist(myweibull)

## <plotnine.ggplot.ggplot object at 0x30abb1040>

8.9 Comparing Distributions

mysim = p.concat([
  p.DataFrame({'x': sw, 'type': "Observed"}),
  p.DataFrame({'x': mynorm, 'type': "Normal"}),
  p.DataFrame({'x': mypois, 'type': "Poisson"}),
  p.DataFrame({'x': mygamma, 'type': "Gamma"}),
  p.DataFrame({'x': myexp, 'type': "Exponential"}),
  p.DataFrame({'x': myweibull, 'type': "Weibull"})
])

g1 = (ggplot(mysim, aes(x='x', fill='type')) +
  geom_density(alpha=0.5) +
  labs(x='Seawall Height (m)', y='Density (Frequency)', subtitle='Which distribution fits best?', fill='Type'))
g1

## <plotnine.ggplot.ggplot object at 0x30676d8b0>

g1 + xlim(0,10)

## <plotnine.ggplot.ggplot object at 0x30b74bc20>

Learning Check 1

Question

Simulate 1000 draws from a normal distribution using your sw mean and standard deviation. What are the simulated mean and sd? How close are they to sw’s?

[View Answer!]

mymean = sw.mean(); mysd = sw.std()
m = rnorm(1000, mean=mymean, sd=mysd)
[m.mean(), m.std()]

## [5.31369576645973, 0.8235176440088696]

Conclusion

You computed size, location, spread, and shape statistics and compared common simulated distributions using helper functions that mirror R.