# Handling Missing Numeric Data

You always need to keep track of where you’ve got missing data and what to do about it.

Not only is it the right thing to do from a “build a scalable model” approach, but `sklearn`

will often throw its hands up in frustration if you don’t tell it what to do when it encounters the dreaded `np.nan`

value.

## The Data

Let’s load the iris dataset

```
import numpy as np
from sklearn.datasets import load_iris
```

```
data = load_iris()
X = data['data']
y = data['target']
```

Which is all non-null

`(np.isnan(X)).any()`

```
False
```

And then clumsily make the middle 50 rows all null values.

`X[50:-50] = np.nan`

`sum(np.isnan(X))`

```
array([50, 50, 50, 50])
```

And while `pandas`

might be clever enough to toss out NULL values

```
import pandas as pd
pd.DataFrame(X).mean()
```

```
0 5.797
1 3.196
2 3.508
3 1.135
dtype: float64
```

`numpy`

isn’t

`X[:].mean()`

```
nan
```

And by extension, neither is `sklearn`

, which is all parked on top of the underlying `numpy`

arrays.

`from sklearn.linear_model import LinearRegression`

```
try:
model = LinearRegression()
model.fit(X, y)
model.predict(X)
except:
print("Doesn't work")
```

```
Doesn't work
```

## Modelling with Null Values

Thankfully, `sklearn`

has a helpful `Imputer`

class to handle this hiccup for us.

```
from sklearn.preprocessing import Imputer
imputer = Imputer()
imputer.fit_transform(X).mean()
```

```
3.4090000000000003
```

By default, this will fill missing values with the mean of the column.

`imputer.fit(X)`

```
Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)
```

But if we wanted to use, say, median, it’d be as wasy as passing that into the `strategy`

argument at instantiation.

`imputer = Imputer(strategy='median')`

`imputer.fit_transform(X).mean()`

```
3.3601666666666667
```