Generating Classification Datasets
When you’re tired of running through the Iris or Breast Cancer datasets for the umpteenth time, sklearn
has a neat utility that lets you generate classification datasets.
Its use is pretty simple. A call to the function yields a attributes and a target column of the same length
import numpy as np
from sklearn.datasets import make_classification
X, y = make_classification()
print(X.shape, y.shape)
(100, 20) (100,)
Customizing
Additionally, the function takes a bunch of parameters that allow you to modify your dataset including:
Number of samples and size of feature space
X, y = make_classification(n_samples=1000, n_features=10)
print(X.shape, y.shape)
(1000, 10) (1000,)
Number of linear combination/repeated features to trip up your models, or how many informative variables there are
X, y = make_classification(n_redundant=4, n_repeated=5, n_informative=10)
Number of classes you aim to predict (Note, the generation algorithm needs a balance between informative attributes and classes)
X, y = make_classification(n_informative=4, n_classes=3)
print(np.unique(y))
[0 1 2]
And other, more technical elements
Sparse Datasets
One thing this functionality fails to do is generate sparse datasets. It’s not unreasonable to want to practice training algorithms on datasets with a huge class imbalance, make_classification()
generally provides an even-ish split.
for _ in range(5):
X, y = make_classification(10000)
print(sum(y == 1))
5007
4997
5013
5015
5019
To get around this, we’re going to do a bit of pandas
magicâ„¢.
First, we’ll make a ton of data
import numpy as np
import pandas as pd
X, y = make_classification(10000)
Which is, as expected, about 50⁄50 1
and 0
print(sum(y == 1))
5012
If we stuff this into a DataFrame
df = pd.DataFrame(np.c_[X, y])
df.shape
(10000, 21)
We can write some tricksy logic to find all of the 1
rows
true_rows = df[df[df.columns[-1]] == 1].index
print(len(true_rows))
5012
Randomly pick a selection of them
survivors = np.random.choice(true_rows, 100)
survivors
array([ 28, 2844, 4029, 876, 8481, 6550, 4976, 2347, 8514, 1986, 8691,
1763, 3272, 5270, 2864, 1217, 8935, 4135, 561, 2259, 3420, 804,
8671, 8038, 706, 1568, 8309, 3756, 877, 8653, 4461, 3986, 6061,
5831, 7167, 2214, 1415, 3260, 6016, 4288, 1924, 6576, 1546, 3267,
8999, 5531, 1152, 3800, 1523, 6493, 3390, 2965, 6186, 14, 7335,
7867, 5706, 8033, 1247, 7669, 8100, 8146, 2375, 3351, 9907, 5604,
4679, 3794, 3429, 1020, 6384, 8380, 3250, 6055, 2558, 3388, 4299,
929, 8720, 8145, 5570, 4274, 7250, 7692, 7795, 6964, 3834, 3589,
6312, 8418, 6370, 720, 1098, 4666, 3041, 8691, 8004, 3197, 1001,
1420], dtype=int64)
And cull the rest
chopping_block = set(true_rows) - set(survivors)
df = df.drop(chopping_block)
Giving us a much smaller DataFrame
df.shape
(5087, 21)
With an artifically down-sampled target column
df[df.columns[-1]].value_counts()
0.0 4988
1.0 99
Name: 20, dtype: int64
That we’ll stuff back into our original X
, y
values
X = df[df.columns[:-1]].values
y = df[df.columns[-1]].values
print(X.shape, y.shape)
(5087, 20) (5087,)
As a Function
def downsample_class(X, y, classToDownsample=1, downsampleToPct=.1):
df = pd.DataFrame(np.c_[X, y])
true_rows = df[df[df.columns[-1]] == classToDownsample].index
num_desired_true = int(downsampleToPct * len(true_rows))
survivors = np.random.choice(true_rows, num_desired_true)
chopping_block = set(true_rows) - set(survivors)
df = df.drop(chopping_block)
X = df[df.columns[:-1]].values
y = df[df.columns[-1]].values
assert(len(X) == len(y))
return X, y
X, y = make_classification(100000)
X, y = downsample_class(X, y)
len(X)
54765