csv 6: Iterating

23 May 2018

Iterating

After you’ve handled all of the “how” to parse a csv, you can also specify “what” you get.

Trimming down rows and columns at the time of read spares you needing to stage intermediate datasets pre-read or drop data after you’ve already built your DataFrame.

There are also a number of arguments that instruct how to handle/iterate through very large files.

First, let’s start with a simple dataset.

import pandas as pd

csvPath = 'data/ex1.csv'
open(csvPath, 'r').read()

'a,b,c,d,message\n1,2,3,4,hello\n5,6,7,8,world\n9,10,11,12,foo'

pd.read_csv(csvPath)

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

Skipping over Columns

Sometimes you’ve got more columns in your raw dataset than you want to stuff into a DataFrame. We can handle this pretty easily with usecols

usecols

Specify which columns of data that you want to use, either in the form of

A list of either: str (col names)

pd.read_csv(csvPath, usecols=['a', 'c'])

	a	c
0	1	3
1	5	7
2	9	11

or a list of int (col indexes)

pd.read_csv(csvPath, usecols=[0, 2])

	a	c
0	1	3
1	5	7
2	9	11

Alternatively, you can use a function that would map across each value.

pd.read_csv(csvPath, usecols=lambda x: '2' not in x)

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

Interestingly, if you specify less columns than are available, you see performance gains because the iterator skips right over the junk columns for each row.

from io import StringIO
import numpy as np

#making a 100k row DataFrame
lotOfData = pd.DataFrame(np.random.rand(100000, 3)).to_csv()

%timeit pd.read_csv(StringIO(lotOfData))

136 ms ± 5.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pd.read_csv(StringIO(lotOfData), usecols=[0, 2])

107 ms ± 7.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Skipping over Rows

skiprows
skipfooter

Especially useful for data brute-force converted to csv from xlsx files.

With List-Indexing

You can elect to specify which rows you want to skip with a list of type int. We’ll skip the lines containing “hello” and “world”.

pd.read_csv(csvPath, skiprows=[1, 2])

	a	b	c	d	message
0	9	10	11	12	foo

With Head/Tail Behavior

Just how DataFrame.head(n) and DataFrame.tail(n) will give you the first/last n rows, you can use the same kind of logic for skipping rows if you pass an int instead of a list to the skiprows argument.

Here we’ll skip the first two rows.

pd.read_csv(csvPath, skiprows=2)

	5	6	7	8	world
0	9	10	11	12	foo

To skip rows at the end, we’ll use the skipfooter argument in the same fashion.

pd.read_csv(csvPath, skipfooter=2)

C:\Users\nhounshell\AppData\Local\Continuum\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support skipfooter; you can avoid this warning by specifying engine='python'.
  """Entry point for launching an IPython kernel.

	a	b	c	d	message
0	1	2	3	4	hello

We want some Data

nrows
skiprows

Applies all of the other rules/logic in the function call, but limites the number of rows returned to whatever int you pass in.

Useful for large files. Silly for 4 row files.

pd.read_csv(csvPath, nrows=2)

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world

With a Function

The third type of argument that you can pass to skiprows is a callable function applied to the index of each row.

This could be useful to grab a sort-of-random assortment of rows. For instance, consider this large csv.

import numpy as np

bigCsvPath = 'data/ex6.csv'
len(pd.read_csv(bigCsvPath))

Say we want about a sixth of those rows (~1667). Cheating a bit with numpy.random.randint, we’ll get about 1 in 6 lines.

KEEP_ONE_IN_N = 6

semirand = lambda x: np.random.randint(0, KEEP_ONE_IN_N)
len(pd.read_csv(bigCsvPath, skiprows=semirand))

Note that this is going to give different results each run unless you seed.

KEEP_ONE_IN_N = 6

semirand = lambda x: np.random.randint(0, KEEP_ONE_IN_N)
pd.read_csv(bigCsvPath, skiprows=semirand).head()

	-0.358893469543	1.40445260007	0.704964644926	-0.200638304015	B
0	2.317658	0.430710	-1.334216	0.199679	P
1	-0.088496	0.634712	0.153324	0.415335	B
2	-0.070127	-1.579097	0.120892	0.671432	F
3	-0.248618	0.868707	-0.492226	-0.717959	W
4	-1.091549	-0.867110	-0.647760	-0.832562	C

KEEP_ONE_IN_N = 6

semirand = lambda x: np.random.randint(0, KEEP_ONE_IN_N)
pd.read_csv(bigCsvPath, skiprows=semirand).head()

	-0.913134961617	1.53062351168	-0.572656719239	0.477252252981	K
0	2.317658	0.430710	-1.334216	0.199679	P
1	1.338459	0.338489	-0.213446	2.325923	Q
2	0.762428	-0.648529	1.936515	-1.159666	J
3	0.124548	0.730791	-0.257527	-1.511257	G
4	-0.541681	1.510031	0.467945	0.553225	N

Chunking Iteration Out

iterator
chunksize

Say you have a very large file that you want to gradually read through, instead of getting everything at once. You can do this by setting iterator=True and setting whatever batch size you want with chunksize.

Doing this returns a TextFileReader iterable object.

it = pd.read_csv(bigCsvPath, iterator=True, chunksize=10,
                 usecols=[0], squeeze=True)

it

<pandas.io.parsers.TextFileReader at 0x103fe400>

it.__next__()

0    0.467976
1   -0.358893
2   -0.501840
3    0.204886
4    0.354628
5    1.817480
6   -0.776764
7   -0.913135
8    0.358480
9   -1.740877
Name: one, dtype: float64