Simpson’s Paradox is an interesting statistical property that arises when you arrive at misleading conclusions due to overlooking confounding variables in your data.
Ultimately, the only way to overcome the paradox (should it even arise…) is a thorough understanding of your data and that it represents.
This video was very helpful in helping me gain some intuition with simple examples.
A More Concrete Example
import requests import pandas as pd
We’ll lean on a longitudinal dataset from South Africa.
conn = requests.get('http://jse.amstat.org/datasets/birthtotenb.dat.txt')
df = pd.DataFrame.from_records(conn.text.split('\n')) df = df[[0, 2, 4]] df.columns = ['aid', 'traced', 'race'] df.drop(1590, inplace=True) df = df.applymap(int) df.head()
The dataset is comprised of of three variables:
aid: Whether or not the patient had insurance
traced: Whether or not newborns had five-year follow appointments
A naive look at the mean
traced proportions, suggests that having insurance makes you less likely to have a follow up appointment.
However, when you also break out by race, we can see that this idea doesn’t hold.
In fact, regardless of race, having insurance makes you objectively more likely to follow-up.
All told, through investigation, the more glaring isight this datset gives is the disproportionate level of care provided race-to-race.
race 1 0.826087 2 0.087466 Name: aid, dtype: float64