Custom Transformers

As we’ve seen in other notebooks, we can use built-in Imputer, StandardScaler, LabelEncoder, and LabelBinarizer classes in sklearn to do a good deal of the data-preprocessing heavy lifting.

However, under the hood, these all fit the same form:

Each class has inherits from the BastEstimator object and has a

  • fit() method to fit the data
  • transform() method that transforms the data
  • fit_transform() that does the last two steps in sequence

Additionally, by inheriting from the TransformerMixin object, we get the fit_transform() method for free.

Our Data

Let’s go back to the iris dataset

from sklearn.datasets import load_iris

data = load_iris()

X = data['data']
y = data['target']
import pandas as pd
import numpy as np

cols = data['feature_names'] + ['flower_name']

df = pd.DataFrame(np.c_[X, y], columns=cols)
df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) flower_name
0 5.1 3.5 1.4 0.2 0.0
1 4.9 3.0 1.4 0.2 0.0
2 4.7 3.2 1.3 0.2 0.0
3 4.6 3.1 1.5 0.2 0.0
4 5.0 3.6 1.4 0.2 0.0

Creating area measures

Assuming for a second that the sepals and petals of an iris flower are rectangles, say we want to derive metrics sepal area and petal area that are the product of their respective lengths and widths.

Doing this in pandas is a gimmie.

df['sepal area'] = (df['sepal length (cm)']
                    * df['sepal width (cm)'])
df['petal area'] = (df['petal length (cm)']
                    * df['petal width (cm)'])
df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) flower_name sepal area petal area
0 5.1 3.5 1.4 0.2 0.0 17.85 0.28
1 4.9 3.0 1.4 0.2 0.0 14.70 0.28
2 4.7 3.2 1.3 0.2 0.0 15.04 0.26
3 4.6 3.1 1.5 0.2 0.0 14.26 0.30
4 5.0 3.6 1.4 0.2 0.0 18.00 0.28

But we want to be more procedural with this. Let’s make a sklearn.base.BaseEstimator child class.

df.drop('sepal area', axis=1, inplace=True)
df.drop('petal area', axis=1, inplace=True)
from sklearn.base import BaseEstimator, TransformerMixin

sep_len_idx, sep_wid_idx = 0, 1
pet_len_idx, pet_wid_idx = 2, 3

class FlowerAreaAdder(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        sepArea = X[:, sep_len_idx] * X[:, sep_wid_idx]
        petArea = X[:, pet_len_idx] * X[:, pet_wid_idx]
        
        return np.c_[X, sepArea, petArea]

And so our shape of X is

X.shape
(150, 4)

And calling the transformer yields

areaAdder = FlowerAreaAdder()
areaAdder.transform(X).shape
(150, 6)

A Note

The last line will have to be assigned to another variable if we want to persist our new columns. I deliberately chose not to overwrite the X that was passed in, because we should leave our raw data unblemished!