# Rolling DataFrame Window

I wanted to train some sort of sequence model on some mental health data I’d been capturing.

The data was stored as a flat `.csv`

with a bunch of columns (omitted) representing various things I track per-entry, a couple columns (`date`

, `timestamp_id`

) to determine when the entry was, and finally, the `mood_id`

, my target variable.

However, going from that table to something ingestible by a model took some creativity.

## The Problem

The idea was that given a dataset of records ordered sequentially

```
import pandas as pd
df = (pd.read_csv('../data/moods.csv', date_parser=['date'])
.sort_values(['date', 'timestamp_id']))
# want ordered data, but won't use this column
del df['date']
df.head()
```

mood_id | timestamp_id | |
---|---|---|

0 | 3 | 5 |

1 | 4 | 1 |

2 | 4 | 3 |

3 | 4 | 5 |

4 | 4 | 1 |

`df.shape`

```
(2666, 2)
```

I wanted to scan through my records `n`

rows at a time and extract the matrix of values in that chunk of the table.

So if `n=5`

, the first step would look like

`df.iloc[0:4].values`

```
array([[3, 5],
[4, 1],
[4, 3],
[4, 5]], dtype=int64)
```

then

`df.iloc[1:5].values`

```
array([[4, 1],
[4, 3],
[4, 5],
[4, 1]], dtype=int64)
```

until we got to

`df.iloc[-5:].values`

```
array([[4, 1],
[5, 2],
[5, 3],
[4, 4],
[4, 5]], dtype=int64)
```

## All Together

That whole process can be expressed with a simple generator

```
def window_scan(df, windowSize):
numWindows = len(df) - windowSize + 1
for i in range(numWindows):
yield df.iloc[(0+i):(windowSize+i)].values
```

If that works correctly, we should expect equivalent results when we unpack using `__next__()`

`windowIter = window_scan(df, 5)`

`df.iloc[0:5].values`

```
array([[3, 5],
[4, 1],
[4, 3],
[4, 5],
[4, 1]], dtype=int64)
```

`windowIter.__next__()`

```
array([[3, 5],
[4, 1],
[4, 3],
[4, 5],
[4, 1]], dtype=int64)
```

`df.iloc[1:6].values`

```
array([[4, 1],
[4, 3],
[4, 5],
[4, 1],
[5, 2]], dtype=int64)
```

`windowIter.__next__()`

```
array([[4, 1],
[4, 3],
[4, 5],
[4, 1],
[5, 2]], dtype=int64)
```

Looks good to me. Finally, we can stuff it into the `numpy`

array that our model is expecting.

```
import numpy as np
windowIter = window_scan(df, 5)
res = np.array(list(windowIter), dtype='float64')
```

`res.shape`

```
(2662, 5, 2)
```