Generating Classification Datasets

When you’re tired of running through the Iris or Breast Cancer datasets for the umpteenth time, sklearn has a neat utility that lets you generate classification datasets.

Its use is pretty simple. A call to the function yields a attributes and a target column of the same length

import numpy as np
from sklearn.datasets import make_classification

X, y = make_classification()
print(X.shape, y.shape)
(100, 20) (100,)


Customizing

Additionally, the function takes a bunch of parameters that allow you to modify your dataset including:

Number of samples and size of feature space

X, y = make_classification(n_samples=1000, n_features=10)
print(X.shape, y.shape)
(1000, 10) (1000,)


Number of linear combination/repeated features to trip up your models, or how many informative variables there are

X, y = make_classification(n_redundant=4, n_repeated=5, n_informative=10)

Number of classes you aim to predict (Note, the generation algorithm needs a balance between informative attributes and classes)

X, y = make_classification(n_informative=4, n_classes=3)
print(np.unique(y))
[0 1 2]


And other, more technical elements

Sparse Datasets

One thing this functionality fails to do is generate sparse datasets. It’s not unreasonable to want to practice training algorithms on datasets with a huge class imbalance, make_classification() generally provides an even-ish split.

for _ in range(5):
X, y = make_classification(10000)
print(sum(y == 1))
5007
4997
5013
5015
5019


To get around this, we’re going to do a bit of pandas magic™.

First, we’ll make a ton of data

import numpy as np
import pandas as pd

X, y = make_classification(10000)

Which is, as expected, about 5050 1 and 0

print(sum(y == 1))
5012


If we stuff this into a DataFrame

df = pd.DataFrame(np.c_[X, y])
df.shape
(10000, 21)


We can write some tricksy logic to find all of the 1 rows

true_rows = df[df[df.columns[-1]] == 1].index
print(len(true_rows))
5012


Randomly pick a selection of them

survivors = np.random.choice(true_rows, 100)
survivors
array([  28, 2844, 4029,  876, 8481, 6550, 4976, 2347, 8514, 1986, 8691,
1763, 3272, 5270, 2864, 1217, 8935, 4135,  561, 2259, 3420,  804,
8671, 8038,  706, 1568, 8309, 3756,  877, 8653, 4461, 3986, 6061,
5831, 7167, 2214, 1415, 3260, 6016, 4288, 1924, 6576, 1546, 3267,
8999, 5531, 1152, 3800, 1523, 6493, 3390, 2965, 6186,   14, 7335,
7867, 5706, 8033, 1247, 7669, 8100, 8146, 2375, 3351, 9907, 5604,
4679, 3794, 3429, 1020, 6384, 8380, 3250, 6055, 2558, 3388, 4299,
929, 8720, 8145, 5570, 4274, 7250, 7692, 7795, 6964, 3834, 3589,
6312, 8418, 6370,  720, 1098, 4666, 3041, 8691, 8004, 3197, 1001,
1420], dtype=int64)


And cull the rest

chopping_block = set(true_rows) - set(survivors)
df = df.drop(chopping_block)

Giving us a much smaller DataFrame

df.shape
(5087, 21)


With an artifically down-sampled target column

df[df.columns[-1]].value_counts()
0.0    4988
1.0      99
Name: 20, dtype: int64


That we’ll stuff back into our original X, y values

X = df[df.columns[:-1]].values
y = df[df.columns[-1]].values

print(X.shape, y.shape)
(5087, 20) (5087,)


As a Function

def downsample_class(X, y, classToDownsample=1, downsampleToPct=.1):
df = pd.DataFrame(np.c_[X, y])

true_rows = df[df[df.columns[-1]] == classToDownsample].index
num_desired_true = int(downsampleToPct * len(true_rows))

survivors = np.random.choice(true_rows, num_desired_true)
chopping_block = set(true_rows) - set(survivors)
df = df.drop(chopping_block)

X = df[df.columns[:-1]].values
y = df[df.columns[-1]].values

assert(len(X) == len(y))
return X, y
X, y = make_classification(100000)
X, y = downsample_class(X, y)
len(X)
54765