Custom Transformers
As we’ve seen in other notebooks, we can use built-in Imputer
, StandardScaler
, LabelEncoder
, and LabelBinarizer
classes in sklearn
to do a good deal of the data-preprocessing heavy lifting.
However, under the hood, these all fit the same form:
Each class has inherits from the BastEstimator
object and has a
fit()
method to fit the datatransform()
method that transforms the datafit_transform()
that does the last two steps in sequence
Additionally, by inheriting from the TransformerMixin
object, we get the fit_transform()
method for free.
Our Data
Let’s go back to the iris dataset
from sklearn.datasets import load_iris
data = load_iris()
X = data['data']
y = data['target']
import pandas as pd
import numpy as np
cols = data['feature_names'] + ['flower_name']
df = pd.DataFrame(np.c_[X, y], columns=cols)
df.head()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | flower_name | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0.0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0.0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0.0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0.0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0.0 |
Creating area
measures
Assuming for a second that the sepals and petals of an iris flower are rectangles, say we want to derive metrics sepal area
and petal area
that are the product of their respective lengths and widths.
Doing this in pandas
is a gimmie.
df['sepal area'] = (df['sepal length (cm)']
* df['sepal width (cm)'])
df['petal area'] = (df['petal length (cm)']
* df['petal width (cm)'])
df.head()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | flower_name | sepal area | petal area | |
---|---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0.0 | 17.85 | 0.28 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0.0 | 14.70 | 0.28 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0.0 | 15.04 | 0.26 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0.0 | 14.26 | 0.30 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0.0 | 18.00 | 0.28 |
But we want to be more procedural with this. Let’s make a sklearn.base.BaseEstimator
child class.
df.drop('sepal area', axis=1, inplace=True)
df.drop('petal area', axis=1, inplace=True)
from sklearn.base import BaseEstimator, TransformerMixin
sep_len_idx, sep_wid_idx = 0, 1
pet_len_idx, pet_wid_idx = 2, 3
class FlowerAreaAdder(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
sepArea = X[:, sep_len_idx] * X[:, sep_wid_idx]
petArea = X[:, pet_len_idx] * X[:, pet_wid_idx]
return np.c_[X, sepArea, petArea]
And so our shape of X
is
X.shape
(150, 4)
And calling the transformer yields
areaAdder = FlowerAreaAdder()
areaAdder.transform(X).shape
(150, 6)
A Note
The last line will have to be assigned to another variable if we want to persist our new columns. I deliberately chose not to overwrite the X
that was passed in, because we should leave our raw data unblemished!