Generating Classification Datasets

When you’re tired of running through the Iris or Breast Cancer datasets for the umpteenth time, sklearn has a neat utility that lets you generate classification datasets.

Its use is pretty simple. A call to the function yields a attributes and a target column of the same length

import numpy as np
from sklearn.datasets import make_classification

X, y = make_classification()
print(X.shape, y.shape)
(100, 20) (100,)

Customizing

Additionally, the function takes a bunch of parameters that allow you to modify your dataset including:

Number of samples and size of feature space

X, y = make_classification(n_samples=1000, n_features=10)
print(X.shape, y.shape)
(1000, 10) (1000,)

Number of linear combination/repeated features to trip up your models, or how many informative variables there are

X, y = make_classification(n_redundant=4, n_repeated=5, n_informative=10)

Number of classes you aim to predict (Note, the generation algorithm needs a balance between informative attributes and classes)

X, y = make_classification(n_informative=4, n_classes=3)
print(np.unique(y))
[0 1 2]

And other, more technical elements

Sparse Datasets

One thing this functionality fails to do is generate sparse datasets. It’s not unreasonable to want to practice training algorithms on datasets with a huge class imbalance, make_classification() generally provides an even-ish split.

for _ in range(5):
    X, y = make_classification(10000)
    print(sum(y == 1))
5007
4997
5013
5015
5019

To get around this, we’re going to do a bit of pandas magicâ„¢.

First, we’ll make a ton of data

import numpy as np
import pandas as pd

X, y = make_classification(10000)

Which is, as expected, about 5050 1 and 0

print(sum(y == 1))
5012

If we stuff this into a DataFrame

df = pd.DataFrame(np.c_[X, y])
df.shape
(10000, 21)

We can write some tricksy logic to find all of the 1 rows

true_rows = df[df[df.columns[-1]] == 1].index
print(len(true_rows))
5012

Randomly pick a selection of them

survivors = np.random.choice(true_rows, 100)
survivors
array([  28, 2844, 4029,  876, 8481, 6550, 4976, 2347, 8514, 1986, 8691,
       1763, 3272, 5270, 2864, 1217, 8935, 4135,  561, 2259, 3420,  804,
       8671, 8038,  706, 1568, 8309, 3756,  877, 8653, 4461, 3986, 6061,
       5831, 7167, 2214, 1415, 3260, 6016, 4288, 1924, 6576, 1546, 3267,
       8999, 5531, 1152, 3800, 1523, 6493, 3390, 2965, 6186,   14, 7335,
       7867, 5706, 8033, 1247, 7669, 8100, 8146, 2375, 3351, 9907, 5604,
       4679, 3794, 3429, 1020, 6384, 8380, 3250, 6055, 2558, 3388, 4299,
        929, 8720, 8145, 5570, 4274, 7250, 7692, 7795, 6964, 3834, 3589,
       6312, 8418, 6370,  720, 1098, 4666, 3041, 8691, 8004, 3197, 1001,
       1420], dtype=int64)

And cull the rest

chopping_block = set(true_rows) - set(survivors)
df = df.drop(chopping_block)

Giving us a much smaller DataFrame

df.shape
(5087, 21)

With an artifically down-sampled target column

df[df.columns[-1]].value_counts()
0.0    4988
1.0      99
Name: 20, dtype: int64

That we’ll stuff back into our original X, y values

X = df[df.columns[:-1]].values
y = df[df.columns[-1]].values

print(X.shape, y.shape)
(5087, 20) (5087,)

As a Function

def downsample_class(X, y, classToDownsample=1, downsampleToPct=.1):
    df = pd.DataFrame(np.c_[X, y])
    
    true_rows = df[df[df.columns[-1]] == classToDownsample].index
    num_desired_true = int(downsampleToPct * len(true_rows))
    
    survivors = np.random.choice(true_rows, num_desired_true)
    chopping_block = set(true_rows) - set(survivors)
    df = df.drop(chopping_block)
    
    X = df[df.columns[:-1]].values
    y = df[df.columns[-1]].values

    assert(len(X) == len(y))
    return X, y
X, y = make_classification(100000)
X, y = downsample_class(X, y)
len(X)
54765