Histogram Tricks for Comparing Classes

Looking at the different distributions of features between various classes is the first step in building any sort of classifier. However, even univariate analysis can lead to some cluttered visualizations fore more than a couple of different classes.

Example

We’ll load up our old, reliable Iris Dataset

%pylab inline

import pandas as pd
from sklearn.datasets import load_iris

data = load_iris()
df = pd.DataFrame(data['data'], columns=data['feature_names'])
Populating the interactive namespace from numpy and matplotlib

Map the 0, 1, 2 into actual flower names.

mapping = {num: flower
           for num, flower
           in enumerate(data['target_names'])}

flowers = pd.Series(data['target'], name='flower').map(mapping)

Then build out an iterator we can use to cycle through DataFrames by flower class

gb = df.groupby(flowers)

So for a feature like petal width, the separation is pretty straight-forward. I’d ship this.

fig, ax = plt.subplots(figsize=(12, 10))

for idx, group in gb:
    ax.hist(group['petal width (cm)'], label=idx)
    
ax.legend();

png

However, if we instead look at sepal length, there’s more overlap between class distributions, and due to rendering order, it’s not obvious what’s happening to versicolor in the [6.0, 7.0] range.

fig, ax = plt.subplots(figsize=(12, 10))

for idx, group in gb:
    ax.hist(group['sepal length (cm)'], label=idx)

ax.legend();

png

For this, we might consider using the histtype='step' argument to un-shade the area beneath the bars

fig, ax = plt.subplots(figsize=(12, 10))

for idx, group in gb:
    ax.hist(group['sepal length (cm)'], histtype='step', linewidth=3, label=idx)

ax.legend();

png

But this still looks a bit crowded.

Worth pointing out, however, that this technique can be extremely valueable when looking at two different classes of similar distributions, such as the one outlined in hundredblocks’ book on ML Applications.

from IPython.display import Image
Image('images/dual_hist.PNG')

png

For this, I’d probably just ratched down the value of alpha argument. But it’s easy to see how the introduction of another class or two would really make this a mess.

fig, ax = plt.subplots(figsize=(12, 10))

for idx, group in gb:
    ax.hist(group['sepal length (cm)'], alpha=.5, label=idx)
    
ax.legend();

png

In the case that I have more than 3 or so classes, I think I’d opt to put each class on its own histogram, taking great care to remember to utilize the sharex=True argument so I can meaningfully compare their distributions

# to keep the same color scheme
colors = mpl.cm.get_cmap('tab10').colors
N_CLASSES = 3

fig, axes = plt.subplots(N_CLASSES, 1, figsize=(12, 10), sharex=True)

for ax, (idx, group), color in zip(axes, gb, colors):
    ax.hist(group['sepal length (cm)'], label=idx, color=color)
    ax.legend();

png