Handling Missing Numeric Data

You always need to keep track of where you’ve got missing data and what to do about it.

Not only is it the right thing to do from a “build a scalable model” approach, but sklearn will often throw its hands up in frustration if you don’t tell it what to do when it encounters the dreaded np.nan value.

The Data

Let’s load the iris dataset

import numpy as np
from sklearn.datasets import load_iris
data = load_iris()
X = data['data']
y = data['target']

Which is all non-null

(np.isnan(X)).any()
False

And then clumsily make the middle 50 rows all null values.

X[50:-50] = np.nan
sum(np.isnan(X))
array([50, 50, 50, 50])

And while pandas might be clever enough to toss out NULL values

import pandas as pd

pd.DataFrame(X).mean()
0    5.797
1    3.196
2    3.508
3    1.135
dtype: float64

numpy isn’t

X[:].mean()
nan

And by extension, neither is sklearn, which is all parked on top of the underlying numpy arrays.

from sklearn.linear_model import LinearRegression
try:
    model = LinearRegression()
    model.fit(X, y)
    model.predict(X)
except:
    print("Doesn't work")
Doesn't work

Modelling with Null Values

Thankfully, sklearn has a helpful Imputer class to handle this hiccup for us.

from sklearn.preprocessing import Imputer

imputer = Imputer()
imputer.fit_transform(X).mean()
3.4090000000000003

By default, this will fill missing values with the mean of the column.

imputer.fit(X)
Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)

But if we wanted to use, say, median, it’d be as wasy as passing that into the strategy argument at instantiation.

imputer = Imputer(strategy='median')
imputer.fit_transform(X).mean()
3.3601666666666667