Splitting Your Data

It’s some Data Science 101 stuff to split your data out in order to validate the performance of your model. Thankfully, sklearn comes with some pretty robust batteries-included approaches do doing that.

Load a Dataset

Here we’ll use the Iris Dataset

from sklearn.datasets import load_iris

data = load_iris()
data.keys()
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])
X = data['data']
y = data['target']
X.shape, y.shape
((150, 4), (150,))

Vanilla Split

from sklearn.model_selection import train_test_split

Say we wanted to split our data 7030, we’d just use the test_size=0.3 argument.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
[arr.shape for arr in train_test_split(X, y, test_size=0.3)]
[(105, 4), (45, 4), (105,), (45,)]

But this pre-supposes that we’ve already broken our data out into X and y. What if instead, we started with a table of data and wanted to preserve it as such.

import numpy as np

values = np.c_[X, y]
values.shape
(150, 5)

The train_test_split function can handle that just fine.

train_values, test_values = train_test_split(values, test_size=0.3)
[arr.shape for arr in train_test_split(values)]
[(112, 5), (38, 5)]

Stratification

One thing to note, looking at this, is the effect of our sampling on each population. For instance, all-in, our base dataset has a perfectly equal distribution of each kind of flower.

import pandas as pd

features = ['x1', 'x2', 'x3', 'x4', 'flower']

df = pd.DataFrame(values, columns=features)
train_df = pd.DataFrame(train_values, columns=features)
test_df = pd.DataFrame(test_values, columns=features)
df['flower'].value_counts().sort_index()
0.0    50
1.0    50
2.0    50
Name: flower, dtype: int64

However, as a result of our train_test_split, we’ve skewed the distribution between our test and our train datasets

train_df['flower'].value_counts().sort_index() / len(train_df)
0.0    0.342857
1.0    0.342857
2.0    0.314286
Name: flower, dtype: float64
test_df['flower'].value_counts().sort_index() / len(test_df)
0.0    0.311111
1.0    0.311111
2.0    0.377778
Name: flower, dtype: float64

If we were working with a massive amount of data, we might be able to make sweeping assumptions about this distribution, but with a meager 150 rows of data, we want to be careful about our sampling.

The StratifiedShuffleSplit object takes the typical “how do you want to split your data” arguments at instantiation.

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.3)

But then it has its own split method where you specify what X you’re splitting and, more importantly, what y it should be working to preserve a distribution of.

for train_index, test_index in split.split(X=df, y=df['flower']):
    strat_train_set = values[train_index]
    strat_test_set = values[test_index]

That’s more like it.

pd.DataFrame(strat_test_set)[4].value_counts()
0.0    15
2.0    15
1.0    15
Name: 4, dtype: int64
pd.DataFrame(strat_train_set)[4].value_counts()
0.0    35
2.0    35
1.0    35
Name: 4, dtype: int64

Splitting by Group

On the other hand, we might instead have a feature, say garden, that indexes which garden the flowers grew in.

Each garden only grows 10 samples of the same flower

from itertools import cycle, islice, chain

gardens = pd.Series(chain.from_iterable(
    [list(islice(cycle(range(0, 5)), 50)),   # flower 0 has 0-5
     list(islice(cycle(range(5, 10)), 50)),  # flower 1 has 6-10 
     list(islice(cycle(range(10, 15)), 50))] # flower 2 has 11-15
), name='garden')

gardens.value_counts()
14    10
13    10
12    10
11    10
10    10
9     10
8     10
7     10
6     10
5     10
4     10
3     10
2     10
1     10
0     10
Name: garden, dtype: int64

And that we randomly indexed our flowers from the various gardens, as long as they match the correct type

flowers = df['flower']

shuffled_df = df.groupby("flower").transform(lambda x: x.sample(frac=1))

contrived = shuffled_df.join(flowers).join(gardens)
contrived = df.sort_values(['flower', 'x1'])
contrived = pd.concat([contrived, gardens], axis=1)
contrived.head()
x1 x2 x3 x4 flower garden
0 5.1 3.5 1.4 0.2 0.0 0
1 4.9 3.0 1.4 0.2 0.0 1
2 4.7 3.2 1.3 0.2 0.0 2
3 4.6 3.1 1.5 0.2 0.0 3
4 5.0 3.6 1.4 0.2 0.0 4

And then dummied, to convert numeric labels into their proper categorical form.

contrived = contrived.join(pd.get_dummies(contrived['garden'], drop_first=False))
contrived.head()
x1 x2 x3 x4 flower garden 0 1 2 3 ... 5 6 7 8 9 10 11 12 13 14
0 5.1 3.5 1.4 0.2 0.0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 4.9 3.0 1.4 0.2 0.0 1 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 4.7 3.2 1.3 0.2 0.0 2 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
3 4.6 3.1 1.5 0.2 0.0 3 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
4 5.0 3.6 1.4 0.2 0.0 4 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 21 columns

Now.

Somehow, we found ourselves in a position where this garden feature made its way into our model. If we do our regular routine of train_test_split(), you can see that values of garden wind up in both the train and test sets

train_values, test_values = train_test_split(contrived, test_size=0.3)


garden_counts = pd.concat([train_values['garden'].value_counts(),
                           test_values['garden'].value_counts()],
                           axis=1).fillna(0)
garden_counts.columns = ['train', 'test']
garden_counts
train test
0 7 3
1 4 6
2 7 3
3 7 3
4 8 2
5 6 4
6 7 3
7 7 3
8 5 5
9 8 2
10 7 3
11 8 2
12 9 1
13 8 2
14 7 3

Finally, imagine that we were using some whiz-bang Deep Learning model that can learn non-linear relationships.

What will likely happen is that the model just learns that non-null values in features 0-4 correspond to flower 0, and so on.

(Editor’s Note: I spent an embarassing amount of time trying to prove this to be the case with a simple Decision Tree, to no avail. It just didn’t take the obvious bait)

train_values.groupby('flower').sum()
x1 x2 x3 x4 garden 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
flower
0.0 164.4 113.3 48.5 7.6 71 7 4 7 7 8 0 0 0 0 0 0 0 0 0 0
1.0 196.0 91.7 139.6 43.7 233 0 0 0 0 0 6 7 7 5 8 0 0 0 0 0
2.0 256.5 116.3 215.5 79.0 468 0 0 0 0 0 0 0 0 0 0 7 8 9 8 7

And so, because we continue to live in a world where we can’t simply drop gardens (and its subsequent dummy features) from our dataset, instead we want to ensure that if a record for garden X shows up in the training set, that no garden X records show up in the test set.

This keeps the model guessing, you see.

For this, we want to employ the GroupShuffleSplit object, which behaves consistently to the last ShuffleSplit object we used.

from sklearn.model_selection import GroupShuffleSplit

group_splitter = GroupShuffleSplit(n_splits=1)

train_idx, test_idx = next(group_splitter.split(contrived, groups=contrived['garden']))

Lo and behold, each garden group is mutually exclusive!

garden_counts = pd.concat([contrived.loc[train_idx]['garden'].value_counts(),
                           contrived.loc[test_idx]['garden'].value_counts()],
                          axis=1).fillna(0)
garden_counts.columns = ['train', 'test']
garden_counts
train test
0 10.0 0.0
1 10.0 0.0
2 0.0 10.0
3 10.0 0.0
4 0.0 10.0
5 10.0 0.0
6 0.0 10.0
7 10.0 0.0
8 10.0 0.0
9 10.0 0.0
10 10.0 0.0
11 10.0 0.0
12 10.0 0.0
13 10.0 0.0
14 10.0 0.0

Worth mentioning that this last section was brought to my attention (in a more practical context) via the book Building Machine Learning Powered Applications, wherein the author describes a case of building a model off of user entries to a Q&A site.

He posits that doing a similar stratification on user_id would prevent a sophisticated NLP application from learning a given user’s prose and using that to bias its decision making.