Sklearn Pipelines
If you’ve read the other notebooks under this header, you know how to do all kinds of data preprocessing using sklearn
objects. And if you’ve been reading closely, you’ll notice that they all generally fit the same form. That’s no accident.
We can chain together successive preprocessing steps into one cohesive object. But doing so requires a bit of planning.
Tired of iris yet?
from sklearn.datasets import load_iris
import numpy as np
import pandas as pd
data = load_iris()
cols = list(data['feature_names']) + ['flower_name']
df = pd.DataFrame(np.c_[data['data'], data['target']],
columns=cols)
df.shape
(150, 5)
flowerNames = {0: 'setosa',
1: 'versicolor',
2: 'virginica'}
df['flower_name'] = df['flower_name'].map(flowerNames)
df.head()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | flower_name | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
Now say that this is the format that the data comes to us in and we want to build a KNN classfier. Ultimately, that means that our data will be completely numeric, with no target label.
Therefore, we’re going to think about our preprocessing in two steps: - Handling the numeric columns - Handling the categorical columns
Numeric Columns
The first 4 columns of this dataset are all numeric, but there’s still preprocessing that we should do to ensure that it plays well with our algorithm. Namely:
- Ensuring that there’s no missing data
- Scaling each feature
from sklearn.preprocessing import Imputer, StandardScaler
imputer = Imputer(strategy='median')
scaler = StandardScaler()
And then we can pipeline each of these calls into the next.
numericData = df.values[:, :4]
scaledNumericData = scaler.fit_transform(imputer.fit_transform(numericData))
scaledNumericData[:5]
array([[-0.90068117, 1.03205722, -1.3412724 , -1.31297673],
[-1.14301691, -0.1249576 , -1.3412724 , -1.31297673],
[-1.38535265, 0.33784833, -1.39813811, -1.31297673],
[-1.50652052, 0.10644536, -1.2844067 , -1.31297673],
[-1.02184904, 1.26346019, -1.3412724 , -1.31297673]])
But this is gross, hard to read, and hard to maintain.
Instead, sklearn
provides a really slick Pipeline
class that handles this.
from sklearn.pipeline import Pipeline
num_pipeline = Pipeline([
('imputer', Imputer(strategy='median')),
('std_scaler', StandardScaler()),
])
Few things to note:
- This executes sequentially from top to bottom, so be deliberate about your flow
- Each of these objects must have a
fit_transform
method that does the transformation and pushes it to the next step - The final estimator just uses a
fit()
method - The names in each tuple are for clarity and debugging. You can call them whatever.
Now, all together
smarterScaledNumericData = num_pipeline.fit_transform(numericData)
np.all(smarterScaledNumericData == scaledNumericData)
True
Categorical Columns
There’s a notebook you can read on the specifics if you haven’t already, this pipelines the same steps.
categ_data = df.values[:, -1]
from sklearn.preprocessing import LabelEncoder, LabelBinarizer
label_encoder = LabelEncoder()
hot_encoder = LabelBinarizer()
hot_encoder.fit_transform(label_encoder.fit_transform(categ_data))[:5]
array([[1, 0, 0],
[1, 0, 0],
[1, 0, 0],
[1, 0, 0],
[1, 0, 0]])
And same deal, just pipeline it.
# We have to extend the `LabelBinarizer` class
# CategoricalEncoder class in 0.20.0 will
# handle this
from sklearn.base import TransformerMixin
class MyLabelBinarizer(TransformerMixin):
def __init__(self, *args, **kwargs):
self.encoder = LabelBinarizer(*args, **kwargs)
def fit(self, x, y=0):
self.encoder.fit(x)
return self
def transform(self, x, y=0):
return self.encoder.transform(x)
categ_pipeline = Pipeline([
('label_encode', MyLabelBinarizer()),
])
categ_pipeline.fit_transform(categ_data)[:5]
array([[1, 0, 0],
[1, 0, 0],
[1, 0, 0],
[1, 0, 0],
[1, 0, 0]])
Pipelining the Pipelines
This has all been pretty slick so far, but we still found ourself manually pulling apart the dataset above.
Our ideal state has one single object that we can pass the data as we get it, that will spit out the data as the model’s prepared.
We can accomplish this by:
- Building a preprocessing step that will split the
DataFrame
into categorical and numeric frames - Using the
FeatureUnion
object to stitch the two pipelines together at the end
from sklearn.base import BaseEstimator
class DataFrameSplitter(BaseEstimator, TransformerMixin):
def __init__(self, attributeNames):
self.attributeNames = attributeNames
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attributeNames].values
We can get at the categorical and numerical columns with a bit of set logic
colNames = set(df.dtypes.index)
colNames
{'flower_name',
'petal length (cm)',
'petal width (cm)',
'sepal length (cm)',
'sepal width (cm)'}
df.dtypes
sepal length (cm) float64
sepal width (cm) float64
petal length (cm) float64
petal width (cm) float64
flower_name object
dtype: object
numericColumns = set(df.dtypes[df.dtypes == 'float64'].index)
categColumns = colNames - numericColumns
Feature Union
We build the two parts separately
num_attributes = list(numericColumns)
categ_attributes = list(categColumns)
num_pipeline = Pipeline([
('selector', DataFrameSplitter(num_attributes)),
('imputer', Imputer(strategy='median')),
('std_scaler', StandardScaler()),
])
cat_pipeline = Pipeline([
('selector', DataFrameSplitter(categ_attributes)),
('label_encode', MyLabelBinarizer()),
])
Then merge them together. This runs the two in parallel and np.c_
’s the data together when it’s done.
from sklearn.pipeline import FeatureUnion
full_pipeline = FeatureUnion(transformer_list = [
('num_pipeline', num_pipeline),
('cat_pipeline', cat_pipeline),
])
final_data = full_pipeline.fit_transform(df)
final_data.shape
(150, 7)
final_data[:5]
array([[ 1.03205722, -1.3412724 , -0.90068117, -1.31297673, 1. ,
0. , 0. ],
[-0.1249576 , -1.3412724 , -1.14301691, -1.31297673, 1. ,
0. , 0. ],
[ 0.33784833, -1.39813811, -1.38535265, -1.31297673, 1. ,
0. , 0. ],
[ 0.10644536, -1.2844067 , -1.50652052, -1.31297673, 1. ,
0. , 0. ],
[ 1.26346019, -1.3412724 , -1.02184904, -1.31297673, 1. ,
0. , 0. ]])