Simpson's Paradox
Simpson’s Paradox is an interesting statistical property that arises when you arrive at misleading conclusions due to overlooking confounding variables in your data.
Ultimately, the only way to overcome the paradox (should it even arise…) is a thorough understanding of your data and that it represents.
Simple Overview
This video was very helpful in helping me gain some intuition with simple examples.
A More Concrete Example
import requests
import pandas as pd
We’ll lean on a longitudinal dataset from South Africa.
conn = requests.get('http://jse.amstat.org/datasets/birthtotenb.dat.txt')
df = pd.DataFrame.from_records(conn.text.split('\n'))
df = df[[0, 2, 4]]
df.columns = ['aid', 'traced', 'race']
df.drop(1590, inplace=True)
df = df.applymap(int)
df.head()
aid | traced | race | |
---|---|---|---|
0 | 0 | 0 | 1 |
1 | 0 | 0 | 1 |
2 | 0 | 0 | 1 |
3 | 0 | 0 | 1 |
4 | 0 | 0 | 1 |
The dataset is comprised of of three variables:
aid
: Whether or not the patient had insurancetraced
: Whether or not newborns had five-year follow appointmentsrace
:1
being White,2
Black
A naive look at the mean traced
proportions, suggests that having insurance makes you less likely to have a follow up appointment.
df.groupby('aid')['traced'].agg(['mean', 'size'])
mean | size | |
---|---|---|
aid | ||
0 | 0.274277 | 1349 |
1 | 0.190871 | 241 |
However, when you also break out by race, we can see that this idea doesn’t hold.
In fact, regardless of race, having insurance makes you objectively more likely to follow-up.
df.groupby(['aid', 'race'])['traced'].mean().unstack()
race | 1 | 2 |
---|---|---|
aid | ||
0 | 0.083333 | 0.277736 |
1 | 0.087719 | 0.283465 |
All told, through investigation, the more glaring isight this datset gives is the disproportionate level of care provided race-to-race.
df.groupby('race')['aid'].mean()
race
1 0.826087
2 0.087466
Name: aid, dtype: float64