8 Distributions and Descriptive Statistics in Python

This tutorial introduces distributions and descriptive statistics in Python using pandas and helper functions that mirror R’s syntax.

Getting Started

Install and Import Packages

%pip install pandas plotnine scipy
import os, sys
import pandas as p
from plotnine import *
sys.path.append(os.path.abspath('functions'))
from functions_distributions import *

8.1 Our Data

sw = p.Series([4.5, 5, 5.5, 5, 5.5, 6.5, 6.5, 6, 5, 4])
sw
## 0    4.5
## 1    5.0
## 2    5.5
## 3    5.0
## 4    5.5
## 5    6.5
## 6    6.5
## 7    6.0
## 8    5.0
## 9    4.0
## dtype: float64

8.2 Size

8.2.1 Length

len(sw)
## 10

8.3 Location

8.3.1 Mean and Median

sw.mean()
## np.float64(5.35)
sw.median()
## np.float64(5.25)

8.3.2 Mode

sw.mode()
## 0    5.0
## dtype: float64

8.4 Spread (1)

8.4.1 Percentiles

sw.quantile(q=0)  # min
## np.float64(4.0)
sw.quantile(q=1)  # max
## np.float64(6.5)
sw.quantile(q=.75)
## np.float64(5.875)

8.5 Spread (2)

8.5.1 Standard Deviation, Variance, CV, SE

# Manual SD (sample)
x = ((sw - sw.mean())**2).sum()
x = x / (len(sw) - 1)
x**0.5
## np.float64(0.8181958472422385)
sw.std()
## np.float64(0.8181958472422385)
sw.var()
## np.float64(0.6694444444444445)
sw.std()**2
## np.float64(0.6694444444444444)
sw.std() / sw.mean()  # CV
## np.float64(0.15293380322284833)
se = sw.std() / (len(sw)**0.5)
se
## np.float64(0.25873624493766706)

8.6 Shape

8.6.1 Skewness and Kurtosis

diff = sw - sw.mean()
n = len(sw) - 1
sigma = sw.std()
sum(diff**3) / (n * sigma**3)
## np.float64(0.024342597820882206)
sum(diff**4) / (n * sigma**4)
## np.float64(1.8758509667533272)
# Using helper functions mirroring R
skewness(sw)
## np.float64(0.024342597820882206)
kurtosis(sw)
## np.float64(1.8758509667533272)

8.7 Finding Parameters for Your Distributions

sw = p.Series([4.5, 5, 5.5, 5, 5.5, 6.5, 6.5, 6, 5, 4])
mymean = sw.mean()
mysd = sw.std()

8.8 Common Distributions

8.8.1 Normal

mynorm = rnorm(n=1000, mean=mymean, sd=mysd)
hist(mynorm)
## <plotnine.ggplot.ggplot object at 0x000002382A6B1EB0>

8.8.2 Poisson

mypois = rpois(n=1000, mu=mymean)
hist(mypois)
## <plotnine.ggplot.ggplot object at 0x000002382AB1ECF0>

8.8.3 Exponential

myrate_e = 1 / sw.mean()
myexp = rexp(n=1000, rate=myrate_e)
hist(myexp)
## <plotnine.ggplot.ggplot object at 0x000002382A6B1EB0>

8.8.4 Gamma

myshape = sw.mean()**2 / sw.var()
myrate = 1 / (sw.var() / sw.mean())
mygamma = rgamma(n=1000, shape=myshape, rate=myrate)
hist(mygamma)
## <plotnine.ggplot.ggplot object at 0x000002382AB30CE0>

8.8.5 Weibull

from scipy import stats as fitdistr
myshape_w, loc, myscale_w = fitdistr.weibull_min.fit(sw, floc=0)
myweibull = rweibull(n=1000, shape=myshape_w, scale=myscale_w)
hist(myweibull)
## <plotnine.ggplot.ggplot object at 0x000002382A5E32C0>

8.9 Comparing Distributions

mysim = p.concat([
  p.DataFrame({'x': sw, 'type': "Observed"}),
  p.DataFrame({'x': mynorm, 'type': "Normal"}),
  p.DataFrame({'x': mypois, 'type': "Poisson"}),
  p.DataFrame({'x': mygamma, 'type': "Gamma"}),
  p.DataFrame({'x': myexp, 'type': "Exponential"}),
  p.DataFrame({'x': myweibull, 'type': "Weibull"})
])

g1 = (ggplot(mysim, aes(x='x', fill='type')) +
  geom_density(alpha=0.5) +
  labs(x='Seawall Height (m)', y='Density (Frequency)', subtitle='Which distribution fits best?', fill='Type'))
g1
## <plotnine.ggplot.ggplot object at 0x000002382AB1F680>
g1 + xlim(0,10)
## <plotnine.ggplot.ggplot object at 0x000002382AB0B2C0>

Learning Check 1

Question

Simulate 1000 draws from a normal distribution using your sw mean and standard deviation. What are the simulated mean and sd? How close are they to sw’s?

[View Answer!]
mymean = sw.mean(); mysd = sw.std()
m = rnorm(1000, mean=mymean, sd=mysd)
[m.mean(), m.std()]
## [np.float64(5.3334619491997195), np.float64(0.828318086068825)]

Conclusion

You computed size, location, spread, and shape statistics and compared common simulated distributions using helper functions that mirror R.