Encoding Categorical Data
Perhaps not surprisingly, when we want to do some sort of prediction in sklearn
using data that comes to us in text format, the library doesn’t know how to stuff the word “Michigan” into a regression.
Thus, we have to transform our categorical data into a numerical representation.
The Data
Let’s load the iris dataset
from sklearn.datasets import load_iris
data = load_iris()
And, for the sake of example, do a bit of manipulation to it to get it into a format relevant to this notebook.
import numpy as np
import pandas as pd
cols = data['feature_names'] + ['flower_name']
flowerNames = {0: 'setosa',
1: 'versicolor',
2: 'virgniica'}
df = pd.DataFrame(np.c_[data['data'], data['target']],
columns=cols)
df['flower_name'] = df['flower_name'].map(flowerNames)
df.head()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | flower_name | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
Trying to Predict sepal length (cm)
Typically, firing up the iris dataset leads to an exercise in trying to predict the last column, flower_name
. However, since the purpose of tutorial is to show how to leverage categorical variables in sklearn
, we’re going to predict one of the features, intead.
Nevertheless, let’s try and use one of the more popular almost-classification techniques for an almost-classification dataset.
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor()
X = df.iloc[:, 1:].values
y = df.iloc[:, 0].values
As expected, it doesn’t know what to do with strings.
try:
forest.fit(X, y)
except ValueError as e:
print(e)
could not convert string to float: 'virgniica'
And so we can transform that column from string to a numerical representation wiht the LabelEncoder
class.
from sklearn.preprocessing import LabelEncoder
stringCol = X[:, -1]
encoder = LabelEncoder()
encoder.fit(stringCol)
LabelEncoder()
encoder.transform(stringCol)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int64)
And build the same X
, but with numbers.
clean_X = np.c_[X[:, :-1], encoder.transform(stringCol)]
forest.fit(clean_X, y)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0, warm_start=False)
forest.predict(clean_X)
array([ 5.11 , 4.82 , 4.595 , 4.71 , 5.015 ,
5.37 , 4.77 , 5.02833333, 4.54 , 4.89 ,
5.29333333, 4.91 , 4.81 , 4.56 , 5.43 ,
5.43 , 5.3 , 5.09 , 5.57333333, 5.20833333,
5.25 , 5.16 , 4.83 , 5.2 , 4.86 ,
4.92 , 5.1 , 5.19333333, 5.15 , 4.69 ,
4.83 , 5.21 , 5.255 , 5.275 , 4.89 ,
4.915 , 5.42 , 4.89 , 4.49 , 5.02833333,
5.21 , 4.62 , 4.595 , 5.04 , 5.1 ,
4.84 , 5.16333333, 4.62 , 5.29333333, 5. ,
6.91 , 6.43 , 6.86 , 5.53 , 6.51 ,
5.87333333, 6.31 , 5.04 , 6.54 , 5.39 ,
5.21 , 5.79 , 5.77 , 6.24 , 5.61 ,
6.66 , 5.59666667, 5.78 , 6.01 , 5.54 ,
6.14 , 6.02 , 6.18 , 6.21 , 6.175 ,
6.59 , 6.42 , 6.37 , 5.91333333, 5.48 ,
5.46 , 5.44 , 5.76 , 6.04 , 5.59666667,
6.04 , 6.7 , 6.08 , 5.71 , 5.55 ,
5.62 , 6.27 , 5.78 , 5.09 , 5.63 ,
5.75 , 5.7 , 6.175 , 5.15 , 5.77 ,
6.53 , 5.93 , 6.84 , 6.37 , 6.61 ,
7.65 , 5.52 , 7.29 , 6.7 , 7.47 ,
6.33 , 6.23 , 6.68 , 5.92 , 6.18 ,
6.52 , 6.52 , 7.71 , 7.72 , 6.08 ,
6.78 , 5.96 , 7.68 , 6.16 , 6.66 ,
7.2 , 6.22 , 6.11 , 6.32 , 7.01 ,
7.39 , 7.74 , 6.43 , 6.24 , 6.19 ,
7.53 , 6.36 , 6.42 , 6.01 , 6.7 ,
6.64 , 6.68 , 5.93 , 6.78 , 6.58 ,
6.7 , 6.23 , 6.3 , 6.34 , 6.01 ])
Groovy.
A Better Idea
Of course, we might not have decided to go the route of Random Forest, but may have instead used a Linear Regression.
from sklearn.linear_model import LinearRegression
And it works.
model = LinearRegression()
model.fit(clean_X, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
But this is where it’s particularly important to know what you’re actually doing. If something didn’t compile, you’d know that and have to investigate. Here, we’ve made a critical error and it passed silently.
Let’s investigate.
According to scikitlearn
, [-0.22 * ‘versicolor’ = the flower’s contribution to the sepal length].
print(list(df.columns[1:]))
print(model.coef_)
['sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'flower_name']
[ 0.6291636 0.74403774 -0.41389919 -0.22135464]
What’s more, is that because ‘versicolor’ is encoded as a 1 and ‘virginica’ as a 2, that makes versicolor “twice” virginica, which is nonsense.
data['target_names']
array(['setosa', 'versicolor', 'virginica'],
dtype='<U10')
Instead, we want to use the LabelBinarizer
class to break each of these values out into their own colums, populated with 0’s and 1’s.
from sklearn.preprocessing import LabelBinarizer
binarizer = LabelBinarizer()
binarizer.fit(X[:, -1])
LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)
This process is called one-hot encoding and produces rows that look like this.
encoded_flowers = binarizer.transform(X[:, -1])
encoded_flowers[0], encoded_flowers[50], encoded_flowers[100]
(array([1, 0, 0]), array([0, 1, 0]), array([0, 0, 1]))
Where each row only has one non-zero value.
sum(encoded_flowers.T)
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
So now if we put this into our Linear Regression
encoded_X = np.c_[X[:, :-1], encoded_flowers]
model.fit(encoded_X, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
We can intuit the behavior of the last three values.
model.coef_
array([ 0.50107481, 0.82878689, -0.32210351, 0.57456359, -0.13951206,
-0.43505152])
For instance, a setosa flower (hot-encoded as (1, 0, 0)) would contribute [.5746 * 1 + (-.1395) * 0 + (-.4351) * 0]