Rolling DataFrame Window
I wanted to train some sort of sequence model on some mental health data I’d been capturing.
The data was stored as a flat .csv
with a bunch of columns (omitted) representing various things I track per-entry, a couple columns (date
, timestamp_id
) to determine when the entry was, and finally, the mood_id
, my target variable.
However, going from that table to something ingestible by a model took some creativity.
The Problem
The idea was that given a dataset of records ordered sequentially
import pandas as pd
df = (pd.read_csv('../data/moods.csv', date_parser=['date'])
.sort_values(['date', 'timestamp_id']))
# want ordered data, but won't use this column
del df['date']
df.head()
mood_id | timestamp_id | |
---|---|---|
0 | 3 | 5 |
1 | 4 | 1 |
2 | 4 | 3 |
3 | 4 | 5 |
4 | 4 | 1 |
df.shape
(2666, 2)
I wanted to scan through my records n
rows at a time and extract the matrix of values in that chunk of the table.
So if n=5
, the first step would look like
df.iloc[0:4].values
array([[3, 5],
[4, 1],
[4, 3],
[4, 5]], dtype=int64)
then
df.iloc[1:5].values
array([[4, 1],
[4, 3],
[4, 5],
[4, 1]], dtype=int64)
until we got to
df.iloc[-5:].values
array([[4, 1],
[5, 2],
[5, 3],
[4, 4],
[4, 5]], dtype=int64)
All Together
That whole process can be expressed with a simple generator
def window_scan(df, windowSize):
numWindows = len(df) - windowSize + 1
for i in range(numWindows):
yield df.iloc[(0+i):(windowSize+i)].values
If that works correctly, we should expect equivalent results when we unpack using __next__()
windowIter = window_scan(df, 5)
df.iloc[0:5].values
array([[3, 5],
[4, 1],
[4, 3],
[4, 5],
[4, 1]], dtype=int64)
windowIter.__next__()
array([[3, 5],
[4, 1],
[4, 3],
[4, 5],
[4, 1]], dtype=int64)
df.iloc[1:6].values
array([[4, 1],
[4, 3],
[4, 5],
[4, 1],
[5, 2]], dtype=int64)
windowIter.__next__()
array([[4, 1],
[4, 3],
[4, 5],
[4, 1],
[5, 2]], dtype=int64)
Looks good to me. Finally, we can stuff it into the numpy
array that our model is expecting.
import numpy as np
windowIter = window_scan(df, 5)
res = np.array(list(windowIter), dtype='float64')
res.shape
(2662, 5, 2)