As we’ve seen in other notebooks, we can use built-in
LabelBinarizer classes in
sklearn to do a good deal of the data-preprocessing heavy lifting.
However, under the hood, these all fit the same form:
Each class has inherits from the
BastEstimator object and has a
fit()method to fit the data
transform()method that transforms the data
fit_transform()that does the last two steps in sequence
Additionally, by inheriting from the
TransformerMixin object, we get the
fit_transform() method for free.
Let’s go back to the iris dataset
from sklearn.datasets import load_iris data = load_iris() X = data['data'] y = data['target']
import pandas as pd import numpy as np cols = data['feature_names'] + ['flower_name'] df = pd.DataFrame(np.c_[X, y], columns=cols) df.head()
|sepal length (cm)||sepal width (cm)||petal length (cm)||petal width (cm)||flower_name|
Assuming for a second that the sepals and petals of an iris flower are rectangles, say we want to derive metrics
sepal area and
petal area that are the product of their respective lengths and widths.
Doing this in
pandas is a gimmie.
df['sepal area'] = (df['sepal length (cm)'] * df['sepal width (cm)'])
df['petal area'] = (df['petal length (cm)'] * df['petal width (cm)'])
|sepal length (cm)||sepal width (cm)||petal length (cm)||petal width (cm)||flower_name||sepal area||petal area|
But we want to be more procedural with this. Let’s make a
sklearn.base.BaseEstimator child class.
df.drop('sepal area', axis=1, inplace=True) df.drop('petal area', axis=1, inplace=True)
from sklearn.base import BaseEstimator, TransformerMixin sep_len_idx, sep_wid_idx = 0, 1 pet_len_idx, pet_wid_idx = 2, 3 class FlowerAreaAdder(BaseEstimator, TransformerMixin): def fit(self, X, y=None): return self def transform(self, X, y=None): sepArea = X[:, sep_len_idx] * X[:, sep_wid_idx] petArea = X[:, pet_len_idx] * X[:, pet_wid_idx] return np.c_[X, sepArea, petArea]
And so our shape of
And calling the transformer yields
areaAdder = FlowerAreaAdder() areaAdder.transform(X).shape
The last line will have to be assigned to another variable if we want to persist our new columns. I deliberately chose not to overwrite the
X that was passed in, because we should leave our raw data unblemished!