Splitting Your Data
It’s some Data Science 101 stuff to split your data out in order to validate the performance of your model. Thankfully, sklearn comes with some pretty robust batteries-included approaches do doing that.
Load a Dataset
Here we’ll use the Iris Dataset
from sklearn.datasets import load_iris
data = load_iris()
data.keys()
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])
X = data['data']
y = data['target']
X.shape, y.shape
((150, 4), (150,))
Vanilla Split
from sklearn.model_selection import train_test_split
Say we wanted to split our data 70⁄30, we’d just use the test_size=0.3
argument.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
[arr.shape for arr in train_test_split(X, y, test_size=0.3)]
[(105, 4), (45, 4), (105,), (45,)]
But this pre-supposes that we’ve already broken our data out into X
and y
. What if instead, we started with a table of data and wanted to preserve it as such.
import numpy as np
values = np.c_[X, y]
values.shape
(150, 5)
The train_test_split
function can handle that just fine.
train_values, test_values = train_test_split(values, test_size=0.3)
[arr.shape for arr in train_test_split(values)]
[(112, 5), (38, 5)]
Stratification
One thing to note, looking at this, is the effect of our sampling on each population. For instance, all-in, our base dataset has a perfectly equal distribution of each kind of flower.
import pandas as pd
features = ['x1', 'x2', 'x3', 'x4', 'flower']
df = pd.DataFrame(values, columns=features)
train_df = pd.DataFrame(train_values, columns=features)
test_df = pd.DataFrame(test_values, columns=features)
df['flower'].value_counts().sort_index()
0.0 50
1.0 50
2.0 50
Name: flower, dtype: int64
However, as a result of our train_test_split
, we’ve skewed the distribution between our test and our train datasets
train_df['flower'].value_counts().sort_index() / len(train_df)
0.0 0.342857
1.0 0.342857
2.0 0.314286
Name: flower, dtype: float64
test_df['flower'].value_counts().sort_index() / len(test_df)
0.0 0.311111
1.0 0.311111
2.0 0.377778
Name: flower, dtype: float64
If we were working with a massive amount of data, we might be able to make sweeping assumptions about this distribution, but with a meager 150 rows of data, we want to be careful about our sampling.
The StratifiedShuffleSplit
object takes the typical “how do you want to split your data” arguments at instantiation.
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.3)
But then it has its own split
method where you specify what X you’re splitting and, more importantly, what y
it should be working to preserve a distribution of.
for train_index, test_index in split.split(X=df, y=df['flower']):
strat_train_set = values[train_index]
strat_test_set = values[test_index]
That’s more like it.
pd.DataFrame(strat_test_set)[4].value_counts()
0.0 15
2.0 15
1.0 15
Name: 4, dtype: int64
pd.DataFrame(strat_train_set)[4].value_counts()
0.0 35
2.0 35
1.0 35
Name: 4, dtype: int64
Splitting by Group
On the other hand, we might instead have a feature, say garden
, that indexes which garden the flowers grew in.
Each garden only grows 10 samples of the same flower
from itertools import cycle, islice, chain
gardens = pd.Series(chain.from_iterable(
[list(islice(cycle(range(0, 5)), 50)), # flower 0 has 0-5
list(islice(cycle(range(5, 10)), 50)), # flower 1 has 6-10
list(islice(cycle(range(10, 15)), 50))] # flower 2 has 11-15
), name='garden')
gardens.value_counts()
14 10
13 10
12 10
11 10
10 10
9 10
8 10
7 10
6 10
5 10
4 10
3 10
2 10
1 10
0 10
Name: garden, dtype: int64
And that we randomly indexed our flowers from the various gardens, as long as they match the correct type
flowers = df['flower']
shuffled_df = df.groupby("flower").transform(lambda x: x.sample(frac=1))
contrived = shuffled_df.join(flowers).join(gardens)
contrived = df.sort_values(['flower', 'x1'])
contrived = pd.concat([contrived, gardens], axis=1)
contrived.head()
x1 | x2 | x3 | x4 | flower | garden | |
---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0.0 | 0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0.0 | 1 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0.0 | 2 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0.0 | 3 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0.0 | 4 |
And then dummied, to convert numeric labels into their proper categorical form.
contrived = contrived.join(pd.get_dummies(contrived['garden'], drop_first=False))
contrived.head()
x1 | x2 | x3 | x4 | flower | garden | 0 | 1 | 2 | 3 | ... | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0.0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0.0 | 1 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0.0 | 2 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0.0 | 3 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0.0 | 4 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 21 columns
Now.
Somehow, we found ourselves in a position where this garden
feature made its way into our model. If we do our regular routine of train_test_split()
, you can see that values of garden
wind up in both the train and test sets
train_values, test_values = train_test_split(contrived, test_size=0.3)
garden_counts = pd.concat([train_values['garden'].value_counts(),
test_values['garden'].value_counts()],
axis=1).fillna(0)
garden_counts.columns = ['train', 'test']
garden_counts
train | test | |
---|---|---|
0 | 7 | 3 |
1 | 4 | 6 |
2 | 7 | 3 |
3 | 7 | 3 |
4 | 8 | 2 |
5 | 6 | 4 |
6 | 7 | 3 |
7 | 7 | 3 |
8 | 5 | 5 |
9 | 8 | 2 |
10 | 7 | 3 |
11 | 8 | 2 |
12 | 9 | 1 |
13 | 8 | 2 |
14 | 7 | 3 |
Finally, imagine that we were using some whiz-bang Deep Learning model that can learn non-linear relationships.
What will likely happen is that the model just learns that non-null values in features 0-4
correspond to flower 0, and so on.
(Editor’s Note: I spent an embarassing amount of time trying to prove this to be the case with a simple Decision Tree, to no avail. It just didn’t take the obvious bait)
train_values.groupby('flower').sum()
x1 | x2 | x3 | x4 | garden | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
flower | ||||||||||||||||||||
0.0 | 164.4 | 113.3 | 48.5 | 7.6 | 71 | 7 | 4 | 7 | 7 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1.0 | 196.0 | 91.7 | 139.6 | 43.7 | 233 | 0 | 0 | 0 | 0 | 0 | 6 | 7 | 7 | 5 | 8 | 0 | 0 | 0 | 0 | 0 |
2.0 | 256.5 | 116.3 | 215.5 | 79.0 | 468 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 8 | 9 | 8 | 7 |
And so, because we continue to live in a world where we can’t simply drop gardens
(and its subsequent dummy features) from our dataset, instead we want to ensure that if a record for garden X
shows up in the training set, that no garden X
records show up in the test set.
This keeps the model guessing, you see.
For this, we want to employ the GroupShuffleSplit
object, which behaves consistently to the last ShuffleSplit
object we used.
from sklearn.model_selection import GroupShuffleSplit
group_splitter = GroupShuffleSplit(n_splits=1)
train_idx, test_idx = next(group_splitter.split(contrived, groups=contrived['garden']))
Lo and behold, each garden
group is mutually exclusive!
garden_counts = pd.concat([contrived.loc[train_idx]['garden'].value_counts(),
contrived.loc[test_idx]['garden'].value_counts()],
axis=1).fillna(0)
garden_counts.columns = ['train', 'test']
garden_counts
train | test | |
---|---|---|
0 | 10.0 | 0.0 |
1 | 10.0 | 0.0 |
2 | 0.0 | 10.0 |
3 | 10.0 | 0.0 |
4 | 0.0 | 10.0 |
5 | 10.0 | 0.0 |
6 | 0.0 | 10.0 |
7 | 10.0 | 0.0 |
8 | 10.0 | 0.0 |
9 | 10.0 | 0.0 |
10 | 10.0 | 0.0 |
11 | 10.0 | 0.0 |
12 | 10.0 | 0.0 |
13 | 10.0 | 0.0 |
14 | 10.0 | 0.0 |
Worth mentioning that this last section was brought to my attention (in a more practical context) via the book Building Machine Learning Powered Applications, wherein the author describes a case of building a model off of user entries to a Q&A site.
He posits that doing a similar stratification on user_id
would prevent a sophisticated NLP application from learning a given user’s prose and using that to bias its decision making.