6 Visualization with plotnine in Python

Visualization is a key part of statistical analyses, especially in systems engineering. In this tutorial, we’ll learn to visualize data with plotnine (the Python port of ggplot2).

Please follow along using the code below!



Getting Started

Loading Packages

import pandas as p
from plotnine import *
from gapminder import gapminder as gapminder
from seaborn import load_dataset

# diamonds from seaborn
diamonds = load_dataset('diamonds')

Gapminder data

# View it in the console
gapminder
##           country continent  year  lifeExp       pop   gdpPercap
## 0     Afghanistan      Asia  1952   28.801   8425333  779.445314
## 1     Afghanistan      Asia  1957   30.332   9240934  820.853030
## 2     Afghanistan      Asia  1962   31.997  10267083  853.100710
## 3     Afghanistan      Asia  1967   34.020  11537966  836.197138
## 4     Afghanistan      Asia  1972   36.088  13079460  739.981106
## ...           ...       ...   ...      ...       ...         ...
## 1699     Zimbabwe    Africa  1987   62.351   9216418  706.157306
## 1700     Zimbabwe    Africa  1992   60.377  10704340  693.420786
## 1701     Zimbabwe    Africa  1997   46.809  11404948  792.449960
## 1702     Zimbabwe    Africa  2002   39.989  11926563  672.038623
## 1703     Zimbabwe    Africa  2007   43.487  12311143  469.709298
## 
## [1704 rows x 6 columns]
# Glimpse-like summary
gapminder.dtypes, gapminder.shape
## (country       object
## continent     object
## year           int64
## lifeExp      float64
## pop            int64
## gdpPercap    float64
## dtype: object, (1704, 6))

6.1 Your first scatterplot

ggplot(data=gapminder, mapping=aes(x='gdpPercap', y='lifeExp'))
## <plotnine.ggplot.ggplot object at 0x000001EC7B182BA0>

Add points with + geom_point().

(ggplot(gapminder, aes(x='gdpPercap', y='lifeExp')) + geom_point())
## <plotnine.ggplot.ggplot object at 0x000001EC7A03D130>

Learning Check 1

Question

What kind of relationship does this graph show? Why might it matter to policymakers?

[View Answer!]

As wealth per person (GDP per capita) increases, life expectancy rises quickly then tapers off. This shows a strong relationship between wealth and health.


6.2 Transparency (alpha)

(ggplot(gapminder, aes(x='gdpPercap', y='lifeExp')) + geom_point(alpha=0.2))
## <plotnine.ggplot.ggplot object at 0x000001EC7B1834A0>
(ggplot(gapminder, aes(x='gdpPercap', y='lifeExp')) + geom_point(alpha=0.5))
## <plotnine.ggplot.ggplot object at 0x000001EC7B1B42C0>
(ggplot(gapminder, aes(x='gdpPercap', y='lifeExp')) + geom_point(alpha=1))
## <plotnine.ggplot.ggplot object at 0x000001EC7B1B6DE0>

Learning Check 2

Question

What happens when you change alpha across the three visuals above?

[View Answer!]

alpha controls transparency from 0 to 1. Higher values are more opaque; lower values are more transparent.


6.3 Color: constant vs mapped

# Single color
(ggplot(gapminder, aes(x='gdpPercap', y='lifeExp')) +
  geom_point(alpha=0.5, color='steelblue'))
## <plotnine.ggplot.ggplot object at 0x000001EC7B19AE40>
# Color mapped by continent
(ggplot(gapminder, aes(x='gdpPercap', y='lifeExp', color='continent')) +
  geom_point(alpha=0.5))
## <plotnine.ggplot.ggplot object at 0x000001EC7B1DDB20>

Learning Check 3

Question

Where do you place color for a single color vs. multiple colors based on a variable?

[View Answer!]

Single color: set color inside geom_point(color='...') (outside aes). Mapped colors: set color inside aes(color='variable').


6.4 Improving our visualizations

(ggplot(gapminder, aes(x='gdpPercap', y='lifeExp', color='continent')) +
  geom_point(alpha=0.5) +
  labs(x='GDP per capita (USD)',
       y='Life Expectancy (years)',
       color='Continent',
       title='Does Wealth affect Health?',
       subtitle='Global Health Trends by Continent',
       caption='Points display individual country-year observations.'))
## <plotnine.ggplot.ggplot object at 0x000001EC7A8CBB30>

You can save visuals as objects to reuse them.

myviz = (ggplot(gapminder, aes(x='gdpPercap', y='lifeExp', color='continent')) +
  geom_point(alpha=0.5) +
  labs(x='GDP per capita (USD)', y='Life Expectancy (years)', color='Continent',
       title='Does Wealth affect Health?', subtitle='Global Health Trends by Continent',
       caption='Points display individual country-year observations.'))

myviz
## <plotnine.ggplot.ggplot object at 0x000001EC7AF92870>
myviz + theme_bw()
## <plotnine.ggplot.ggplot object at 0x000001EC7A9AA0C0>
myviz + theme_dark()
## <plotnine.ggplot.ggplot object at 0x000001EC7A94E8A0>
myviz + theme_classic()
## <plotnine.ggplot.ggplot object at 0x000001EC7B196930>



6.5 Visualizing diamonds data

diamonds.head(3)
##    carat      cut color clarity  depth  table  price     x     y     z
## 0   0.23    Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
## 1   0.21  Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31
## 2   0.23     Good     E     VS1   56.9   65.0    327  4.05  4.07  2.31
diamonds.dtypes, diamonds.shape
## (carat       float64
## cut        category
## color      category
## clarity    category
## depth       float64
## table       float64
## price         int64
## x           float64
## y           float64
## z           float64
## dtype: object, (53940, 10))

6.5.1 Boxplots by cut

(ggplot(diamonds, aes(x='cut', y='price', group='cut')) + geom_boxplot())
## <plotnine.ggplot.ggplot object at 0x000001EC7B1D5190>
(ggplot(diamonds, aes(x='cut', y='price', group='cut')) +
  geom_boxplot(fill='steelblue'))
## <plotnine.ggplot.ggplot object at 0x000001EC7B0DFCB0>
(ggplot(diamonds, aes(x='cut', y='price', group='cut', fill='cut')) +
  geom_boxplot())
## <plotnine.ggplot.ggplot object at 0x000001EC7B196540>

Learning Check 4

Question

Why do the two boxplot versions look different? What changed in the code to create those effects?

[View Answer!]

Constant fill uses geom_boxplot(fill='steelblue'). Mapped fill uses aes(fill='cut') to color by variable.


6.6 Visualizing Distributions

(ggplot(diamonds, aes(x='price', fill='cut')) +
  geom_histogram(color='white') +
  labs(x='Price (USD)', y='Frequency', title='US Diamond Sales'))
## <plotnine.ggplot.ggplot object at 0x000001EC7B196A20>

6.6.1 Try adjusting binwidth and theme

(ggplot(diamonds, aes(x='price', fill='cut')) +
  geom_histogram(color='white', binwidth=500) +
  theme_classic() +
  labs(x='Price (USD)', y='Frequency', title='US Diamond Sales (binwidth=500)'))
## <plotnine.ggplot.ggplot object at 0x000001EC7B19A090>

Learning Check 5

Question

Make a histogram of price with a narrower binwidth and apply a different theme. Which choices improve readability?

[View Answer!]
(ggplot(diamonds, aes(x='price', fill='cut')) +
  geom_histogram(color='white', binwidth=250) +
  theme_bw() +
  labs(x='Price (USD)', y='Frequency'))
## <plotnine.ggplot.ggplot object at 0x000001EC7B1D5D00>

Conclusion

You learned how to build scatterplots, boxplots, and histograms with plotnine and how to control transparency, color mapping, labels, and themes.