Encoding Categorical Data

Perhaps not surprisingly, when we want to do some sort of prediction in sklearn using data that comes to us in text format, the library doesn’t know how to stuff the word “Michigan” into a regression.

Thus, we have to transform our categorical data into a numerical representation.

The Data

Let’s load the iris dataset

from sklearn.datasets import load_iris

data = load_iris()

And, for the sake of example, do a bit of manipulation to it to get it into a format relevant to this notebook.

import numpy as np
import pandas as pd

cols = data['feature_names'] + ['flower_name']
flowerNames = {0: 'setosa',
               1: 'versicolor',
               2: 'virgniica'}

df = pd.DataFrame(np.c_[data['data'], data['target']],
                  columns=cols)
df['flower_name'] = df['flower_name'].map(flowerNames)
df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) flower_name
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Trying to Predict sepal length (cm)

Typically, firing up the iris dataset leads to an exercise in trying to predict the last column, flower_name. However, since the purpose of tutorial is to show how to leverage categorical variables in sklearn, we’re going to predict one of the features, intead.

Nevertheless, let’s try and use one of the more popular almost-classification techniques for an almost-classification dataset.

from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor()

X = df.iloc[:, 1:].values
y = df.iloc[:, 0].values

As expected, it doesn’t know what to do with strings.

try:
    forest.fit(X, y)
except ValueError as e:
    print(e)
could not convert string to float: 'virgniica'

And so we can transform that column from string to a numerical representation wiht the LabelEncoder class.

from sklearn.preprocessing import LabelEncoder
stringCol = X[:, -1]
encoder = LabelEncoder()

encoder.fit(stringCol)
LabelEncoder()
encoder.transform(stringCol)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int64)

And build the same X, but with numbers.

clean_X = np.c_[X[:, :-1], encoder.transform(stringCol)]
forest.fit(clean_X, y)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)
forest.predict(clean_X)
array([ 5.11      ,  4.82      ,  4.595     ,  4.71      ,  5.015     ,
        5.37      ,  4.77      ,  5.02833333,  4.54      ,  4.89      ,
        5.29333333,  4.91      ,  4.81      ,  4.56      ,  5.43      ,
        5.43      ,  5.3       ,  5.09      ,  5.57333333,  5.20833333,
        5.25      ,  5.16      ,  4.83      ,  5.2       ,  4.86      ,
        4.92      ,  5.1       ,  5.19333333,  5.15      ,  4.69      ,
        4.83      ,  5.21      ,  5.255     ,  5.275     ,  4.89      ,
        4.915     ,  5.42      ,  4.89      ,  4.49      ,  5.02833333,
        5.21      ,  4.62      ,  4.595     ,  5.04      ,  5.1       ,
        4.84      ,  5.16333333,  4.62      ,  5.29333333,  5.        ,
        6.91      ,  6.43      ,  6.86      ,  5.53      ,  6.51      ,
        5.87333333,  6.31      ,  5.04      ,  6.54      ,  5.39      ,
        5.21      ,  5.79      ,  5.77      ,  6.24      ,  5.61      ,
        6.66      ,  5.59666667,  5.78      ,  6.01      ,  5.54      ,
        6.14      ,  6.02      ,  6.18      ,  6.21      ,  6.175     ,
        6.59      ,  6.42      ,  6.37      ,  5.91333333,  5.48      ,
        5.46      ,  5.44      ,  5.76      ,  6.04      ,  5.59666667,
        6.04      ,  6.7       ,  6.08      ,  5.71      ,  5.55      ,
        5.62      ,  6.27      ,  5.78      ,  5.09      ,  5.63      ,
        5.75      ,  5.7       ,  6.175     ,  5.15      ,  5.77      ,
        6.53      ,  5.93      ,  6.84      ,  6.37      ,  6.61      ,
        7.65      ,  5.52      ,  7.29      ,  6.7       ,  7.47      ,
        6.33      ,  6.23      ,  6.68      ,  5.92      ,  6.18      ,
        6.52      ,  6.52      ,  7.71      ,  7.72      ,  6.08      ,
        6.78      ,  5.96      ,  7.68      ,  6.16      ,  6.66      ,
        7.2       ,  6.22      ,  6.11      ,  6.32      ,  7.01      ,
        7.39      ,  7.74      ,  6.43      ,  6.24      ,  6.19      ,
        7.53      ,  6.36      ,  6.42      ,  6.01      ,  6.7       ,
        6.64      ,  6.68      ,  5.93      ,  6.78      ,  6.58      ,
        6.7       ,  6.23      ,  6.3       ,  6.34      ,  6.01      ])

Groovy.

A Better Idea

Of course, we might not have decided to go the route of Random Forest, but may have instead used a Linear Regression.

from sklearn.linear_model import LinearRegression

And it works.

model = LinearRegression()
model.fit(clean_X, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

But this is where it’s particularly important to know what you’re actually doing. If something didn’t compile, you’d know that and have to investigate. Here, we’ve made a critical error and it passed silently.

Let’s investigate.

According to scikitlearn, [-0.22 * ‘versicolor’ = the flower’s contribution to the sepal length].

print(list(df.columns[1:]))
print(model.coef_)
['sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'flower_name']
[ 0.6291636   0.74403774 -0.41389919 -0.22135464]

What’s more, is that because ‘versicolor’ is encoded as a 1 and ‘virginica’ as a 2, that makes versicolor “twice” virginica, which is nonsense.

data['target_names']
array(['setosa', 'versicolor', 'virginica'],
      dtype='<U10')

Instead, we want to use the LabelBinarizer class to break each of these values out into their own colums, populated with 0’s and 1’s.

from sklearn.preprocessing import LabelBinarizer

binarizer = LabelBinarizer()
binarizer.fit(X[:, -1])
LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)

This process is called one-hot encoding and produces rows that look like this.

encoded_flowers = binarizer.transform(X[:, -1])

encoded_flowers[0], encoded_flowers[50], encoded_flowers[100]
(array([1, 0, 0]), array([0, 1, 0]), array([0, 0, 1]))

Where each row only has one non-zero value.

sum(encoded_flowers.T)
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

So now if we put this into our Linear Regression

encoded_X = np.c_[X[:, :-1], encoded_flowers]
model.fit(encoded_X, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

We can intuit the behavior of the last three values.

model.coef_
array([ 0.50107481,  0.82878689, -0.32210351,  0.57456359, -0.13951206,
       -0.43505152])

For instance, a setosa flower (hot-encoded as (1, 0, 0)) would contribute [.5746 * 1 + (-.1395) * 0 + (-.4351) * 0]