Sklearn Pipelines

If you’ve read the other notebooks under this header, you know how to do all kinds of data preprocessing using sklearn objects. And if you’ve been reading closely, you’ll notice that they all generally fit the same form. That’s no accident.

We can chain together successive preprocessing steps into one cohesive object. But doing so requires a bit of planning.

Tired of iris yet?

from sklearn.datasets import load_iris
import numpy as np
import pandas as pd
data = load_iris()

cols = list(data['feature_names']) + ['flower_name']

df = pd.DataFrame(np.c_[data['data'], data['target']],
                  columns=cols)
df.shape
(150, 5)
flowerNames = {0: 'setosa',
               1: 'versicolor',
               2: 'virginica'}
df['flower_name'] = df['flower_name'].map(flowerNames)
df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) flower_name
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Now say that this is the format that the data comes to us in and we want to build a KNN classfier. Ultimately, that means that our data will be completely numeric, with no target label.

Therefore, we’re going to think about our preprocessing in two steps: - Handling the numeric columns - Handling the categorical columns

Numeric Columns

The first 4 columns of this dataset are all numeric, but there’s still preprocessing that we should do to ensure that it plays well with our algorithm. Namely:

  • Ensuring that there’s no missing data
  • Scaling each feature
from sklearn.preprocessing import Imputer, StandardScaler

imputer = Imputer(strategy='median')
scaler = StandardScaler()

And then we can pipeline each of these calls into the next.

numericData = df.values[:, :4]
scaledNumericData = scaler.fit_transform(imputer.fit_transform(numericData))
scaledNumericData[:5]
array([[-0.90068117,  1.03205722, -1.3412724 , -1.31297673],
       [-1.14301691, -0.1249576 , -1.3412724 , -1.31297673],
       [-1.38535265,  0.33784833, -1.39813811, -1.31297673],
       [-1.50652052,  0.10644536, -1.2844067 , -1.31297673],
       [-1.02184904,  1.26346019, -1.3412724 , -1.31297673]])

But this is gross, hard to read, and hard to maintain.

Instead, sklearn provides a really slick Pipeline class that handles this.

from sklearn.pipeline import Pipeline

num_pipeline = Pipeline([
    ('imputer', Imputer(strategy='median')),
    ('std_scaler', StandardScaler()),
])

Few things to note:

  1. This executes sequentially from top to bottom, so be deliberate about your flow
  2. Each of these objects must have a fit_transform method that does the transformation and pushes it to the next step
  3. The final estimator just uses a fit() method
  4. The names in each tuple are for clarity and debugging. You can call them whatever.

Now, all together

smarterScaledNumericData = num_pipeline.fit_transform(numericData)
np.all(smarterScaledNumericData == scaledNumericData)
True

Categorical Columns

There’s a notebook you can read on the specifics if you haven’t already, this pipelines the same steps.

categ_data = df.values[:, -1]
from sklearn.preprocessing import LabelEncoder, LabelBinarizer

label_encoder = LabelEncoder()
hot_encoder = LabelBinarizer()
hot_encoder.fit_transform(label_encoder.fit_transform(categ_data))[:5]
array([[1, 0, 0],
       [1, 0, 0],
       [1, 0, 0],
       [1, 0, 0],
       [1, 0, 0]])

And same deal, just pipeline it.

# We have to extend the `LabelBinarizer` class
# CategoricalEncoder class in 0.20.0 will
# handle this

from sklearn.base import TransformerMixin

class MyLabelBinarizer(TransformerMixin):
    def __init__(self, *args, **kwargs):
        self.encoder = LabelBinarizer(*args, **kwargs)
    def fit(self, x, y=0):
        self.encoder.fit(x)
        return self
    def transform(self, x, y=0):
        return self.encoder.transform(x)
categ_pipeline = Pipeline([
    ('label_encode', MyLabelBinarizer()),
])
categ_pipeline.fit_transform(categ_data)[:5]
array([[1, 0, 0],
       [1, 0, 0],
       [1, 0, 0],
       [1, 0, 0],
       [1, 0, 0]])

Pipelining the Pipelines

This has all been pretty slick so far, but we still found ourself manually pulling apart the dataset above.

Our ideal state has one single object that we can pass the data as we get it, that will spit out the data as the model’s prepared.

We can accomplish this by:

  • Building a preprocessing step that will split the DataFrame into categorical and numeric frames
  • Using the FeatureUnion object to stitch the two pipelines together at the end
from sklearn.base import BaseEstimator

class DataFrameSplitter(BaseEstimator, TransformerMixin):
    def __init__(self, attributeNames):
        self.attributeNames = attributeNames
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attributeNames].values

We can get at the categorical and numerical columns with a bit of set logic

colNames = set(df.dtypes.index)
colNames
{'flower_name',
 'petal length (cm)',
 'petal width (cm)',
 'sepal length (cm)',
 'sepal width (cm)'}
df.dtypes
sepal length (cm)    float64
sepal width (cm)     float64
petal length (cm)    float64
petal width (cm)     float64
flower_name           object
dtype: object
numericColumns = set(df.dtypes[df.dtypes == 'float64'].index)
categColumns = colNames - numericColumns

Feature Union

We build the two parts separately

num_attributes = list(numericColumns)
categ_attributes = list(categColumns)

num_pipeline = Pipeline([
    ('selector', DataFrameSplitter(num_attributes)),
    ('imputer', Imputer(strategy='median')),
    ('std_scaler', StandardScaler()),
])

cat_pipeline = Pipeline([
    ('selector', DataFrameSplitter(categ_attributes)),
    ('label_encode', MyLabelBinarizer()),
])

Then merge them together. This runs the two in parallel and np.c_’s the data together when it’s done.

from sklearn.pipeline import FeatureUnion

full_pipeline = FeatureUnion(transformer_list = [
    ('num_pipeline', num_pipeline),
    ('cat_pipeline', cat_pipeline),
])
final_data = full_pipeline.fit_transform(df)
final_data.shape
(150, 7)
final_data[:5]
array([[ 1.03205722, -1.3412724 , -0.90068117, -1.31297673,  1.        ,
         0.        ,  0.        ],
       [-0.1249576 , -1.3412724 , -1.14301691, -1.31297673,  1.        ,
         0.        ,  0.        ],
       [ 0.33784833, -1.39813811, -1.38535265, -1.31297673,  1.        ,
         0.        ,  0.        ],
       [ 0.10644536, -1.2844067 , -1.50652052, -1.31297673,  1.        ,
         0.        ,  0.        ],
       [ 1.26346019, -1.3412724 , -1.02184904, -1.31297673,  1.        ,
         0.        ,  0.        ]])