Standardization

Standardizing your data before starting in on machine learning routines is paramount. Not only does it allow your algorithms to converge faster (by delta’ing over a much narrower scope of data), but it also prevents any features scaled arbitrarily larger from having an inflated weight on whatever your model winds up learning.

E.g. a “0, 1, 2 car garage” probably has more predictive power on a home value than “0-10,000” jelly beans could fit in the master bathtub. Probably.

Getting the Data

Loading one of the gimmie datasets from scikitlearn

from sklearn.datasets import load_boston

data = load_boston()
X = data['data']

Stuffing into a pandas DataFrame for easier inspection

import pandas as pd

df = pd.DataFrame(X, columns=data['feature_names'])
df.head()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33

And if you look at each attribute, there’s a huge spread of:

  • Relative difference between max and min
  • Numeric scale of the attribute
df.describe().T[['max', 'min']]
max min
CRIM 88.9762 0.00632
ZN 100.0000 0.00000
INDUS 27.7400 0.46000
CHAS 1.0000 0.00000
NOX 0.8710 0.38500
RM 8.7800 3.56100
AGE 100.0000 2.90000
DIS 12.1265 1.12960
RAD 24.0000 1.00000
TAX 711.0000 187.00000
PTRATIO 22.0000 12.60000
B 396.9000 0.32000
LSTAT 37.9700 1.73000

Approaches

And so we have two ways of resolving this data imbalance.

Normalization

This is essentially ensuring that each column will have values between 0 and 1.

We achieve this by finding out the range of values for each column

spread = df.max() - df.min()
spread
CRIM        88.96988
ZN         100.00000
INDUS       27.28000
CHAS         1.00000
NOX          0.48600
RM           5.21900
AGE         97.10000
DIS         10.99690
RAD         23.00000
TAX        524.00000
PTRATIO      9.40000
B          396.58000
LSTAT       36.24000
dtype: float64

And then, per row, subtracting the minimum value and then dividing by that spread.

normed_df = (df - df.min()) / spread
normed_df.head()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.000000 0.18 0.067815 0.0 0.314815 0.577505 0.641607 0.269203 0.000000 0.208015 0.287234 1.000000 0.089680
1 0.000236 0.00 0.242302 0.0 0.172840 0.547998 0.782698 0.348962 0.043478 0.104962 0.553191 1.000000 0.204470
2 0.000236 0.00 0.242302 0.0 0.172840 0.694386 0.599382 0.348962 0.043478 0.104962 0.553191 0.989737 0.063466
3 0.000293 0.00 0.063050 0.0 0.150206 0.658555 0.441813 0.448545 0.086957 0.066794 0.648936 0.994276 0.033389
4 0.000705 0.00 0.063050 0.0 0.150206 0.687105 0.528321 0.448545 0.086957 0.066794 0.648936 1.000000 0.099338
normed_df.describe().T[['min', 'max']]
min max
CRIM 0.0 1.0
ZN 0.0 1.0
INDUS 0.0 1.0
CHAS 0.0 1.0
NOX 0.0 1.0
RM 0.0 1.0
AGE 0.0 1.0
DIS 0.0 1.0
RAD 0.0 1.0
TAX 0.0 1.0
PTRATIO 0.0 1.0
B 0.0 1.0
LSTAT 0.0 1.0

However, this handles outlier data… poorly

%pylab inline

_ = normed_df.boxplot(figsize=(18, 10))
Populating the interactive namespace from numpy and matplotlib

png

Standardization

Instead, we’ll try standardization, which substracts the mean from each value and then divides by the standard deviation.

means = df.mean()
means
CRIM         3.593761
ZN          11.363636
INDUS       11.136779
CHAS         0.069170
NOX          0.554695
RM           6.284634
AGE         68.574901
DIS          3.795043
RAD          9.549407
TAX        408.237154
PTRATIO     18.455534
B          356.674032
LSTAT       12.653063
dtype: float64
stand_df = (df - means) / df.std()
stand_df.head()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 -0.417300 0.284548 -1.286636 -0.272329 -0.144075 0.413263 -0.119895 0.140075 -0.981871 -0.665949 -1.457558 0.440616 -1.074499
1 -0.414859 -0.487240 -0.592794 -0.272329 -0.739530 0.194082 0.366803 0.556609 -0.867024 -0.986353 -0.302794 0.440616 -0.491953
2 -0.414861 -0.487240 -0.592794 -0.272329 -0.739530 1.281446 -0.265549 0.556609 -0.867024 -0.986353 -0.302794 0.396035 -1.207532
3 -0.414270 -0.487240 -1.305586 -0.272329 -0.834458 1.015298 -0.809088 1.076671 -0.752178 -1.105022 0.112920 0.415751 -1.360171
4 -0.410003 -0.487240 -1.305586 -0.272329 -0.834458 1.227362 -0.510674 1.076671 -0.752178 -1.105022 0.112920 0.440616 -1.025487

This, of course, leads to data that falls out of our neat, [0:1] range

stand_df.describe().T[['min', 'max']]
min max
CRIM -0.417300 9.931906
ZN -0.487240 3.800473
INDUS -1.556302 2.420170
CHAS -0.272329 3.664771
NOX -1.464433 2.729645
RM -3.876413 3.551530
AGE -2.333128 1.116390
DIS -1.265817 3.956602
RAD -0.981871 1.659603
TAX -1.312691 1.796416
PTRATIO -2.704703 1.637208
B -3.903331 0.440616
LSTAT -1.529613 3.545262

But does a… marginally better job at handling outliers

stand_df.boxplot(figsize=(18, 10))
<matplotlib.axes._subplots.AxesSubplot at 0xc6c7978>

png

And forces each variable to follow a useful unit distribution with mean 0

stand_df.describe().T[['mean', 'std']]
mean std
CRIM 1.144232e-16 1.0
ZN 3.466704e-16 1.0
INDUS -3.016965e-15 1.0
CHAS 3.999875e-16 1.0
NOX 3.563575e-15 1.0
RM -1.149882e-14 1.0
AGE -1.158274e-15 1.0
DIS 7.308603e-16 1.0
RAD -1.068535e-15 1.0
TAX 6.534079e-16 1.0
PTRATIO -1.084420e-14 1.0
B 8.117354e-15 1.0
LSTAT -6.494585e-16 1.0

Using Scikit-Learn

Of course, if we weren’t interested in taking our numpy data, piping it into a pandas.DataFrame, doing our transformations, and then .values‘ing our way back to numpy, sklearn provides a useful class to handle this.

# Again, for demonstration
df.describe().T[['mean', 'std', 'min', 'max']]
mean std min max
CRIM 3.593761 8.596783 0.00632 88.9762
ZN 11.363636 23.322453 0.00000 100.0000
INDUS 11.136779 6.860353 0.46000 27.7400
CHAS 0.069170 0.253994 0.00000 1.0000
NOX 0.554695 0.115878 0.38500 0.8710
RM 6.284634 0.702617 3.56100 8.7800
AGE 68.574901 28.148861 2.90000 100.0000
DIS 3.795043 2.105710 1.12960 12.1265
RAD 9.549407 8.707259 1.00000 24.0000
TAX 408.237154 168.537116 187.00000 711.0000
PTRATIO 18.455534 2.164946 12.60000 22.0000
B 356.674032 91.294864 0.32000 396.9000
LSTAT 12.653063 7.141062 1.73000 37.9700
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X)
StandardScaler(copy=True, with_mean=True, with_std=True)
pd.DataFrame(scaler.transform(X)).describe().T[['mean', 'std', 'min', 'max']]
mean std min max
0 6.340997e-17 1.00099 -0.417713 9.941735
1 -6.343191e-16 1.00099 -0.487722 3.804234
2 -2.682911e-15 1.00099 -1.557842 2.422565
3 4.701992e-16 1.00099 -0.272599 3.668398
4 2.490322e-15 1.00099 -1.465882 2.732346
5 -1.145230e-14 1.00099 -3.880249 3.555044
6 -1.407855e-15 1.00099 -2.335437 1.117494
7 9.210902e-16 1.00099 -1.267069 3.960518
8 5.441409e-16 1.00099 -0.982843 1.661245
9 -8.868619e-16 1.00099 -1.313990 1.798194
10 -9.205636e-15 1.00099 -2.707379 1.638828
11 8.163101e-15 1.00099 -3.907193 0.441052
12 -3.370163e-16 1.00099 -1.531127 3.548771