Simpson's Paradox

13 Nov 2018

Simpson’s Paradox is an interesting statistical property that arises when you arrive at misleading conclusions due to overlooking confounding variables in your data.

Ultimately, the only way to overcome the paradox (should it even arise…) is a thorough understanding of your data and that it represents.

Simple Overview

This video was very helpful in helping me gain some intuition with simple examples.

A More Concrete Example

import requests
import pandas as pd

We’ll lean on a longitudinal dataset from South Africa.

conn = requests.get('http://jse.amstat.org/datasets/birthtotenb.dat.txt')

df = pd.DataFrame.from_records(conn.text.split('\n'))
df = df[[0, 2, 4]]
df.columns = ['aid', 'traced', 'race']
df.drop(1590, inplace=True)
df = df.applymap(int)
df.head()

	aid	traced	race
0	0	0	1
1	0	0	1
2	0	0	1
3	0	0	1
4	0	0	1

The dataset is comprised of of three variables:

aid: Whether or not the patient had insurance
traced: Whether or not newborns had five-year follow appointments
race: 1 being White, 2 Black

A naive look at the mean traced proportions, suggests that having insurance makes you less likely to have a follow up appointment.

df.groupby('aid')['traced'].agg(['mean', 'size'])

	mean	size
aid
0	0.274277	1349
1	0.190871	241

However, when you also break out by race, we can see that this idea doesn’t hold.

In fact, regardless of race, having insurance makes you objectively more likely to follow-up.

df.groupby(['aid', 'race'])['traced'].mean().unstack()

race	1	2
aid
0	0.083333	0.277736
1	0.087719	0.283465

All told, through investigation, the more glaring isight this datset gives is the disproportionate level of care provided race-to-race.

df.groupby('race')['aid'].mean()

race
1    0.826087
2    0.087466
Name: aid, dtype: float64