Rolling DataFrame Window

I wanted to train some sort of sequence model on some mental health data I’d been capturing.

The data was stored as a flat .csv with a bunch of columns (omitted) representing various things I track per-entry, a couple columns (date, timestamp_id) to determine when the entry was, and finally, the mood_id, my target variable.

However, going from that table to something ingestible by a model took some creativity.

The Problem

The idea was that given a dataset of records ordered sequentially

import pandas as pd

df = (pd.read_csv('../data/moods.csv', date_parser=['date'])
        .sort_values(['date', 'timestamp_id']))

# want ordered data, but won't use this column
del df['date']
df.head()
mood_id timestamp_id
0 3 5
1 4 1
2 4 3
3 4 5
4 4 1
df.shape
(2666, 2)

I wanted to scan through my records n rows at a time and extract the matrix of values in that chunk of the table.

So if n=5, the first step would look like

df.iloc[0:4].values
array([[3, 5],
       [4, 1],
       [4, 3],
       [4, 5]], dtype=int64)

then

df.iloc[1:5].values
array([[4, 1],
       [4, 3],
       [4, 5],
       [4, 1]], dtype=int64)

until we got to

df.iloc[-5:].values
array([[4, 1],
       [5, 2],
       [5, 3],
       [4, 4],
       [4, 5]], dtype=int64)

All Together

That whole process can be expressed with a simple generator

def window_scan(df, windowSize):
    numWindows = len(df) - windowSize + 1
    for i in range(numWindows):
        yield df.iloc[(0+i):(windowSize+i)].values

If that works correctly, we should expect equivalent results when we unpack using __next__()

windowIter = window_scan(df, 5)
df.iloc[0:5].values
array([[3, 5],
       [4, 1],
       [4, 3],
       [4, 5],
       [4, 1]], dtype=int64)
windowIter.__next__()
array([[3, 5],
       [4, 1],
       [4, 3],
       [4, 5],
       [4, 1]], dtype=int64)
df.iloc[1:6].values
array([[4, 1],
       [4, 3],
       [4, 5],
       [4, 1],
       [5, 2]], dtype=int64)
windowIter.__next__()
array([[4, 1],
       [4, 3],
       [4, 5],
       [4, 1],
       [5, 2]], dtype=int64)

Looks good to me. Finally, we can stuff it into the numpy array that our model is expecting.

import numpy as np

windowIter = window_scan(df, 5)
res = np.array(list(windowIter), dtype='float64')
res.shape
(2662, 5, 2)