Standardization

29 May 2018

Standardizing your data before starting in on machine learning routines is paramount. Not only does it allow your algorithms to converge faster (by delta’ing over a much narrower scope of data), but it also prevents any features scaled arbitrarily larger from having an inflated weight on whatever your model winds up learning.

E.g. a “0, 1, 2 car garage” probably has more predictive power on a home value than “0-10,000” jelly beans could fit in the master bathtub. Probably.

Getting the Data

Loading one of the gimmie datasets from scikitlearn

from sklearn.datasets import load_boston

data = load_boston()
X = data['data']

Stuffing into a pandas DataFrame for easier inspection

import pandas as pd

df = pd.DataFrame(X, columns=data['feature_names'])
df.head()

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33

And if you look at each attribute, there’s a huge spread of:

Relative difference between max and min
Numeric scale of the attribute

df.describe().T[['max', 'min']]

	max	min
CRIM	88.9762	0.00632
ZN	100.0000	0.00000
INDUS	27.7400	0.46000
CHAS	1.0000	0.00000
NOX	0.8710	0.38500
RM	8.7800	3.56100
AGE	100.0000	2.90000
DIS	12.1265	1.12960
RAD	24.0000	1.00000
TAX	711.0000	187.00000
PTRATIO	22.0000	12.60000
B	396.9000	0.32000
LSTAT	37.9700	1.73000

Approaches

And so we have two ways of resolving this data imbalance.

Normalization

This is essentially ensuring that each column will have values between 0 and 1.

We achieve this by finding out the range of values for each column

spread = df.max() - df.min()
spread

CRIM        88.96988
ZN         100.00000
INDUS       27.28000
CHAS         1.00000
NOX          0.48600
RM           5.21900
AGE         97.10000
DIS         10.99690
RAD         23.00000
TAX        524.00000
PTRATIO      9.40000
B          396.58000
LSTAT       36.24000
dtype: float64

And then, per row, subtracting the minimum value and then dividing by that spread.

normed_df = (df - df.min()) / spread
normed_df.head()

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	0.000000	0.18	0.067815	0.314815	0.577505	0.641607	0.269203	0.000000	0.208015	0.287234	1.000000	0.089680
1	0.000236	0.00	0.242302	0.172840	0.547998	0.782698	0.348962	0.043478	0.104962	0.553191	1.000000	0.204470
2	0.000236	0.00	0.242302	0.172840	0.694386	0.599382	0.348962	0.043478	0.104962	0.553191	0.989737	0.063466
3	0.000293	0.00	0.063050	0.150206	0.658555	0.441813	0.448545	0.086957	0.066794	0.648936	0.994276	0.033389
4	0.000705	0.00	0.063050	0.150206	0.687105	0.528321	0.448545	0.086957	0.066794	0.648936	1.000000	0.099338

normed_df.describe().T[['min', 'max']]

	min	max
CRIM	0.0	1.0
ZN	0.0	1.0
INDUS	0.0	1.0
CHAS	0.0	1.0
NOX	0.0	1.0
RM	0.0	1.0
AGE	0.0	1.0
DIS	0.0	1.0
RAD	0.0	1.0
TAX	0.0	1.0
PTRATIO	0.0	1.0
B	0.0	1.0
LSTAT	0.0	1.0

However, this handles outlier data… poorly

%pylab inline

_ = normed_df.boxplot(figsize=(18, 10))

Populating the interactive namespace from numpy and matplotlib

png

Standardization

Instead, we’ll try standardization, which substracts the mean from each value and then divides by the standard deviation.

means = df.mean()
means

CRIM         3.593761
ZN          11.363636
INDUS       11.136779
CHAS         0.069170
NOX          0.554695
RM           6.284634
AGE         68.574901
DIS          3.795043
RAD          9.549407
TAX        408.237154
PTRATIO     18.455534
B          356.674032
LSTAT       12.653063
dtype: float64

stand_df = (df - means) / df.std()
stand_df.head()

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	-0.417300	0.284548	-1.286636	-0.272329	-0.144075	0.413263	-0.119895	0.140075	-0.981871	-0.665949	-1.457558	0.440616	-1.074499
1	-0.414859	-0.487240	-0.592794	-0.272329	-0.739530	0.194082	0.366803	0.556609	-0.867024	-0.986353	-0.302794	0.440616	-0.491953
2	-0.414861	-0.487240	-0.592794	-0.272329	-0.739530	1.281446	-0.265549	0.556609	-0.867024	-0.986353	-0.302794	0.396035	-1.207532
3	-0.414270	-0.487240	-1.305586	-0.272329	-0.834458	1.015298	-0.809088	1.076671	-0.752178	-1.105022	0.112920	0.415751	-1.360171
4	-0.410003	-0.487240	-1.305586	-0.272329	-0.834458	1.227362	-0.510674	1.076671	-0.752178	-1.105022	0.112920	0.440616	-1.025487

This, of course, leads to data that falls out of our neat, [0:1] range

stand_df.describe().T[['min', 'max']]

	min	max
CRIM	-0.417300	9.931906
ZN	-0.487240	3.800473
INDUS	-1.556302	2.420170
CHAS	-0.272329	3.664771
NOX	-1.464433	2.729645
RM	-3.876413	3.551530
AGE	-2.333128	1.116390
DIS	-1.265817	3.956602
RAD	-0.981871	1.659603
TAX	-1.312691	1.796416
PTRATIO	-2.704703	1.637208
B	-3.903331	0.440616
LSTAT	-1.529613	3.545262

But does a… marginally better job at handling outliers

stand_df.boxplot(figsize=(18, 10))

<matplotlib.axes._subplots.AxesSubplot at 0xc6c7978>

png

And forces each variable to follow a useful unit distribution with mean 0

stand_df.describe().T[['mean', 'std']]

	mean	std
CRIM	1.144232e-16	1.0
ZN	3.466704e-16	1.0
INDUS	-3.016965e-15	1.0
CHAS	3.999875e-16	1.0
NOX	3.563575e-15	1.0
RM	-1.149882e-14	1.0
AGE	-1.158274e-15	1.0
DIS	7.308603e-16	1.0
RAD	-1.068535e-15	1.0
TAX	6.534079e-16	1.0
PTRATIO	-1.084420e-14	1.0
B	8.117354e-15	1.0
LSTAT	-6.494585e-16	1.0

Using Scikit-Learn

Of course, if we weren’t interested in taking our numpy data, piping it into a pandas.DataFrame, doing our transformations, and then .values‘ing our way back to numpy, sklearn provides a useful class to handle this.

# Again, for demonstration
df.describe().T[['mean', 'std', 'min', 'max']]

	mean	std	min	max
CRIM	3.593761	8.596783	0.00632	88.9762
ZN	11.363636	23.322453	0.00000	100.0000
INDUS	11.136779	6.860353	0.46000	27.7400
CHAS	0.069170	0.253994	0.00000	1.0000
NOX	0.554695	0.115878	0.38500	0.8710
RM	6.284634	0.702617	3.56100	8.7800
AGE	68.574901	28.148861	2.90000	100.0000
DIS	3.795043	2.105710	1.12960	12.1265
RAD	9.549407	8.707259	1.00000	24.0000
TAX	408.237154	168.537116	187.00000	711.0000
PTRATIO	18.455534	2.164946	12.60000	22.0000
B	356.674032	91.294864	0.32000	396.9000
LSTAT	12.653063	7.141062	1.73000	37.9700

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X)

StandardScaler(copy=True, with_mean=True, with_std=True)

pd.DataFrame(scaler.transform(X)).describe().T[['mean', 'std', 'min', 'max']]

	mean	std	min	max
0	6.340997e-17	1.00099	-0.417713	9.941735
1	-6.343191e-16	1.00099	-0.487722	3.804234
2	-2.682911e-15	1.00099	-1.557842	2.422565
3	4.701992e-16	1.00099	-0.272599	3.668398
4	2.490322e-15	1.00099	-1.465882	2.732346
5	-1.145230e-14	1.00099	-3.880249	3.555044
6	-1.407855e-15	1.00099	-2.335437	1.117494
7	9.210902e-16	1.00099	-1.267069	3.960518
8	5.441409e-16	1.00099	-0.982843	1.661245
9	-8.868619e-16	1.00099	-1.313990	1.798194
10	-9.205636e-15	1.00099	-2.707379	1.638828
11	8.163101e-15	1.00099	-3.907193	0.441052
12	-3.370163e-16	1.00099	-1.531127	3.548771