You always need to keep track of where you’ve got missing data and what to do about it.
Not only is it the right thing to do from a “build a scalable model” approach, but
sklearn will often throw its hands up in frustration if you don’t tell it what to do when it encounters the dreaded
Let’s load the iris dataset
import numpy as np from sklearn.datasets import load_iris
data = load_iris() X = data['data'] y = data['target']
Which is all non-null
And then clumsily make the middle 50 rows all null values.
X[50:-50] = np.nan
array([50, 50, 50, 50])
pandas might be clever enough to toss out NULL values
import pandas as pd pd.DataFrame(X).mean()
0 5.797 1 3.196 2 3.508 3 1.135 dtype: float64
And by extension, neither is
sklearn, which is all parked on top of the underlying
from sklearn.linear_model import LinearRegression
try: model = LinearRegression() model.fit(X, y) model.predict(X) except: print("Doesn't work")
Modelling with Null Values
sklearn has a helpful
Imputer class to handle this hiccup for us.
from sklearn.preprocessing import Imputer imputer = Imputer() imputer.fit_transform(X).mean()
By default, this will fill missing values with the mean of the column.
Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)
But if we wanted to use, say, median, it’d be as wasy as passing that into the
strategy argument at instantiation.
imputer = Imputer(strategy='median')