# QQ Plots

## Overview

Short and sweet, a QQ plot is used to check the normality of a given data distribution.

Their construction is pretty straight-forward. Essentially you:

(Borrowing visuals from StatQuest):

1. Sort your data and label each point as its own quantile (10th, 42nd, 99th, etc). Normalized data is your cleanest way to go here.
from IPython.display import Image
Image('../images/qq_data.PNG')

1. Then, using the quantiles from step (1), fire up your vanilla N ~ (0, 1) distribution, and sample the same quantiles from it
Image('../images/qq_theory.PNG')

Note: Because quantiles are a strictly ordinal measure (think median vs mean), comparing quantiles to quantiles across different distributions may very well yield inconsistent values.

## Generating QQ Plots

For starters, we’ll download an interesting dataset using yellowbrick and ignore like 80% of it, lol

from yellowbrick.datasets import load_nfl

df = dataset.to_dataframe()

The dataset describes overall Receiving stats for the 2018 season. There’s a lot here.

df.head()
Rk Player Id Tm Age G GS Tgt Rec Ctch_Rate ... FirstTeamAllPro C_pos CB_pos DT_pos FB_pos QB_pos RB_pos T_pos TE_pos WR_pos
0 1 Michael Thomas ThomMi05 NOR 25 16 16 147 125 0.850 ... 1 0 0 0 0 0 0 0 0 1
1 2 Zach Ertz ErtzZa00 PHI 28 16 16 156 116 0.744 ... 0 0 0 0 0 0 0 0 1 0
2 3 DeAndre Hopkins HopkDe00 HOU 26 16 16 163 115 0.706 ... 1 0 0 0 0 0 0 0 0 1
3 4 Julio Jones JoneJu02 ATL 29 16 16 170 113 0.665 ... 0 0 0 0 0 0 0 0 0 1
4 5 Adam Thielen ThieAd00 MIN 28 16 16 153 113 0.739 ... 0 0 0 0 0 0 0 0 0 1

5 rows × 29 columns

For the sake of demonstration, let’s consider total yardage.

This skewed distribution makes a ton of sense when you consider how many players don’t get a lot of touches during the season.

df['Yds'].hist();

Now we’ll fire up statsmodels.api, which has a really clean utility for generating QQ plots.

Go figure this distribution isn’t very normal.

from statsmodels.api import qqplot

qqplot(df['Yds'], line='s');

The “Longest Reception” metric, on the other hand, looks a bit more palatable– what, because it has an inherent upper-bound at the length of a football field.

df['Lng'].hist();

Not perfect, but certainly normal-er!

qqplot(df['Lng'], line='s');

… but also runs into having a negative-yard-longest-reception?

fig = qqplot(df['Lng'])
fig.axes[0].set_xlim([-5, 0])
fig.axes[0].set_ylim([-20, 10]);

No, that’s accurate

df['Lng'].min()
-11


lol

df[df['Lng'] == -11]['Player']
492    Russell Wilson
Name: Player, dtype: object