Handling Missing Numeric Data
You always need to keep track of where you’ve got missing data and what to do about it.
Not only is it the right thing to do from a “build a scalable model” approach, but sklearn
will often throw its hands up in frustration if you don’t tell it what to do when it encounters the dreaded np.nan
value.
The Data
Let’s load the iris dataset
import numpy as np
from sklearn.datasets import load_iris
data = load_iris()
X = data['data']
y = data['target']
Which is all non-null
(np.isnan(X)).any()
False
And then clumsily make the middle 50 rows all null values.
X[50:-50] = np.nan
sum(np.isnan(X))
array([50, 50, 50, 50])
And while pandas
might be clever enough to toss out NULL values
import pandas as pd
pd.DataFrame(X).mean()
0 5.797
1 3.196
2 3.508
3 1.135
dtype: float64
numpy
isn’t
X[:].mean()
nan
And by extension, neither is sklearn
, which is all parked on top of the underlying numpy
arrays.
from sklearn.linear_model import LinearRegression
try:
model = LinearRegression()
model.fit(X, y)
model.predict(X)
except:
print("Doesn't work")
Doesn't work
Modelling with Null Values
Thankfully, sklearn
has a helpful Imputer
class to handle this hiccup for us.
from sklearn.preprocessing import Imputer
imputer = Imputer()
imputer.fit_transform(X).mean()
3.4090000000000003
By default, this will fill missing values with the mean of the column.
imputer.fit(X)
Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)
But if we wanted to use, say, median, it’d be as wasy as passing that into the strategy
argument at instantiation.
imputer = Imputer(strategy='median')
imputer.fit_transform(X).mean()
3.3601666666666667