Custom Transformers

31 May 2018

As we’ve seen in other notebooks, we can use built-in Imputer, StandardScaler, LabelEncoder, and LabelBinarizer classes in sklearn to do a good deal of the data-preprocessing heavy lifting.

However, under the hood, these all fit the same form:

Each class has inherits from the BastEstimator object and has a

fit() method to fit the data
transform() method that transforms the data
fit_transform() that does the last two steps in sequence

Additionally, by inheriting from the TransformerMixin object, we get the fit_transform() method for free.

Our Data

Let’s go back to the iris dataset

from sklearn.datasets import load_iris

data = load_iris()

X = data['data']
y = data['target']

import pandas as pd
import numpy as np

cols = data['feature_names'] + ['flower_name']

df = pd.DataFrame(np.c_[X, y], columns=cols)
df.head()

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

Creating `area` measures

Assuming for a second that the sepals and petals of an iris flower are rectangles, say we want to derive metrics sepal area and petal area that are the product of their respective lengths and widths.

Doing this in pandas is a gimmie.

df['sepal area'] = (df['sepal length (cm)']
                    * df['sepal width (cm)'])

df['petal area'] = (df['petal length (cm)']
                    * df['petal width (cm)'])

df.head()

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	sepal area	petal area
0	5.1	3.5	1.4	0.2	17.85	0.28
1	4.9	3.0	1.4	0.2	14.70	0.28
2	4.7	3.2	1.3	0.2	15.04	0.26
3	4.6	3.1	1.5	0.2	14.26	0.30
4	5.0	3.6	1.4	0.2	18.00	0.28

But we want to be more procedural with this. Let’s make a sklearn.base.BaseEstimator child class.

df.drop('sepal area', axis=1, inplace=True)
df.drop('petal area', axis=1, inplace=True)

from sklearn.base import BaseEstimator, TransformerMixin

sep_len_idx, sep_wid_idx = 0, 1
pet_len_idx, pet_wid_idx = 2, 3

class FlowerAreaAdder(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        sepArea = X[:, sep_len_idx] * X[:, sep_wid_idx]
        petArea = X[:, pet_len_idx] * X[:, pet_wid_idx]
        
        return np.c_[X, sepArea, petArea]

And so our shape of X is

X.shape

(150, 4)

And calling the transformer yields

areaAdder = FlowerAreaAdder()
areaAdder.transform(X).shape

(150, 6)

A Note

The last line will have to be assigned to another variable if we want to persist our new columns. I deliberately chose not to overwrite the X that was passed in, because we should leave our raw data unblemished!

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

Our Data

Creating area measures

A Note

Creating `area` measures

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2