QQ Plots

Overview

Short and sweet, a QQ plot is used to check the normality of a given data distribution.

Their construction is pretty straight-forward. Essentially you:

(Borrowing visuals from StatQuest):

  1. Sort your data and label each point as its own quantile (10th, 42nd, 99th, etc). Normalized data is your cleanest way to go here.
from IPython.display import Image
Image('../images/qq_data.PNG')

png

  1. Then, using the quantiles from step (1), fire up your vanilla N ~ (0, 1) distribution, and sample the same quantiles from it
Image('../images/qq_theory.PNG')

png

Note: Because quantiles are a strictly ordinal measure (think median vs mean), comparing quantiles to quantiles across different distributions may very well yield inconsistent values.

Generating QQ Plots

For starters, we’ll download an interesting dataset using yellowbrick and ignore like 80% of it, lol

from yellowbrick.datasets import load_nfl

dataset = load_nfl(return_dataset=True)
df = dataset.to_dataframe()

The dataset describes overall Receiving stats for the 2018 season. There’s a lot here.

df.head()
Rk Player Id Tm Age G GS Tgt Rec Ctch_Rate ... FirstTeamAllPro C_pos CB_pos DT_pos FB_pos QB_pos RB_pos T_pos TE_pos WR_pos
0 1 Michael Thomas ThomMi05 NOR 25 16 16 147 125 0.850 ... 1 0 0 0 0 0 0 0 0 1
1 2 Zach Ertz ErtzZa00 PHI 28 16 16 156 116 0.744 ... 0 0 0 0 0 0 0 0 1 0
2 3 DeAndre Hopkins HopkDe00 HOU 26 16 16 163 115 0.706 ... 1 0 0 0 0 0 0 0 0 1
3 4 Julio Jones JoneJu02 ATL 29 16 16 170 113 0.665 ... 0 0 0 0 0 0 0 0 0 1
4 5 Adam Thielen ThieAd00 MIN 28 16 16 153 113 0.739 ... 0 0 0 0 0 0 0 0 0 1

5 rows × 29 columns

For the sake of demonstration, let’s consider total yardage.

This skewed distribution makes a ton of sense when you consider how many players don’t get a lot of touches during the season.

df['Yds'].hist();

png

Now we’ll fire up statsmodels.api, which has a really clean utility for generating QQ plots.

Go figure this distribution isn’t very normal.

from statsmodels.api import qqplot

qqplot(df['Yds'], line='s');

png

The “Longest Reception” metric, on the other hand, looks a bit more palatable– what, because it has an inherent upper-bound at the length of a football field.

df['Lng'].hist();

png

Not perfect, but certainly normal-er!

qqplot(df['Lng'], line='s');

png

… but also runs into having a negative-yard-longest-reception?

fig = qqplot(df['Lng'])
fig.axes[0].set_xlim([-5, 0])
fig.axes[0].set_ylim([-20, 10]);

png

No, that’s accurate

df['Lng'].min()
-11

lol

df[df['Lng'] == -11]['Player']
492    Russell Wilson
Name: Player, dtype: object