csv 6: Iterating
Iterating
After you’ve handled all of the “how” to parse a csv, you can also specify “what” you get.
Trimming down rows and columns at the time of read spares you needing to stage intermediate datasets pre-read or drop data after you’ve already built your DataFrame
.
There are also a number of arguments that instruct how to handle/iterate through very large files.
First, let’s start with a simple dataset.
import pandas as pd
csvPath = 'data/ex1.csv'
open(csvPath, 'r').read()
'a,b,c,d,message\n1,2,3,4,hello\n5,6,7,8,world\n9,10,11,12,foo'
pd.read_csv(csvPath)
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
Skipping over Columns
Sometimes you’ve got more columns in your raw dataset than you want to stuff into a DataFrame
. We can handle this pretty easily with usecols
usecols
Specify which columns of data that you want to use, either in the form of
A list of either: str
(col names)
pd.read_csv(csvPath, usecols=['a', 'c'])
a | c | |
---|---|---|
0 | 1 | 3 |
1 | 5 | 7 |
2 | 9 | 11 |
or a list of int
(col indexes)
pd.read_csv(csvPath, usecols=[0, 2])
a | c | |
---|---|---|
0 | 1 | 3 |
1 | 5 | 7 |
2 | 9 | 11 |
Alternatively, you can use a function that would map across each value.
pd.read_csv(csvPath, usecols=lambda x: '2' not in x)
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
Interestingly, if you specify less columns than are available, you see performance gains because the iterator skips right over the junk columns for each row.
from io import StringIO
import numpy as np
#making a 100k row DataFrame
lotOfData = pd.DataFrame(np.random.rand(100000, 3)).to_csv()
%timeit pd.read_csv(StringIO(lotOfData))
136 ms ± 5.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.read_csv(StringIO(lotOfData), usecols=[0, 2])
107 ms ± 7.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Skipping over Rows
skiprows
skipfooter
Especially useful for data brute-force converted to csv from xlsx files.
With List-Indexing
You can elect to specify which rows you want to skip with a list
of type int
. We’ll skip the lines containing “hello” and “world”.
pd.read_csv(csvPath, skiprows=[1, 2])
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 9 | 10 | 11 | 12 | foo |
With Head/Tail Behavior
Just how DataFrame.head(n)
and DataFrame.tail(n)
will give you the first/last n
rows, you can use the same kind of logic for skipping rows if you pass an int
instead of a list
to the skiprows
argument.
Here we’ll skip the first two rows.
pd.read_csv(csvPath, skiprows=2)
5 | 6 | 7 | 8 | world | |
---|---|---|---|---|---|
0 | 9 | 10 | 11 | 12 | foo |
To skip rows at the end, we’ll use the skipfooter
argument in the same fashion.
pd.read_csv(csvPath, skipfooter=2)
C:\Users\nhounshell\AppData\Local\Continuum\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support skipfooter; you can avoid this warning by specifying engine='python'.
"""Entry point for launching an IPython kernel.
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
We want some Data
nrows
skiprows
Applies all of the other rules/logic in the function call, but limites the number of rows returned to whatever int
you pass in.
Useful for large files. Silly for 4 row files.
pd.read_csv(csvPath, nrows=2)
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
With a Function
The third type of argument that you can pass to skiprows
is a callable function applied to the index of each row.
This could be useful to grab a sort-of-random assortment of rows. For instance, consider this large csv.
import numpy as np
bigCsvPath = 'data/ex6.csv'
len(pd.read_csv(bigCsvPath))
10000
Say we want about a sixth of those rows (~1667). Cheating a bit with numpy.random.randint
, we’ll get about 1 in 6 lines.
KEEP_ONE_IN_N = 6
semirand = lambda x: np.random.randint(0, KEEP_ONE_IN_N)
len(pd.read_csv(bigCsvPath, skiprows=semirand))
1689
Note that this is going to give different results each run unless you seed.
KEEP_ONE_IN_N = 6
semirand = lambda x: np.random.randint(0, KEEP_ONE_IN_N)
pd.read_csv(bigCsvPath, skiprows=semirand).head()
-0.358893469543 | 1.40445260007 | 0.704964644926 | -0.200638304015 | B | |
---|---|---|---|---|---|
0 | 2.317658 | 0.430710 | -1.334216 | 0.199679 | P |
1 | -0.088496 | 0.634712 | 0.153324 | 0.415335 | B |
2 | -0.070127 | -1.579097 | 0.120892 | 0.671432 | F |
3 | -0.248618 | 0.868707 | -0.492226 | -0.717959 | W |
4 | -1.091549 | -0.867110 | -0.647760 | -0.832562 | C |
KEEP_ONE_IN_N = 6
semirand = lambda x: np.random.randint(0, KEEP_ONE_IN_N)
pd.read_csv(bigCsvPath, skiprows=semirand).head()
-0.913134961617 | 1.53062351168 | -0.572656719239 | 0.477252252981 | K | |
---|---|---|---|---|---|
0 | 2.317658 | 0.430710 | -1.334216 | 0.199679 | P |
1 | 1.338459 | 0.338489 | -0.213446 | 2.325923 | Q |
2 | 0.762428 | -0.648529 | 1.936515 | -1.159666 | J |
3 | 0.124548 | 0.730791 | -0.257527 | -1.511257 | G |
4 | -0.541681 | 1.510031 | 0.467945 | 0.553225 | N |
Chunking Iteration Out
iterator
chunksize
Say you have a very large file that you want to gradually read through, instead of getting everything at once. You can do this by setting iterator=True
and setting whatever batch size you want with chunksize
.
Doing this returns a TextFileReader
iterable object.
it = pd.read_csv(bigCsvPath, iterator=True, chunksize=10,
usecols=[0], squeeze=True)
it
<pandas.io.parsers.TextFileReader at 0x103fe400>
it.__next__()
0 0.467976
1 -0.358893
2 -0.501840
3 0.204886
4 0.354628
5 1.817480
6 -0.776764
7 -0.913135
8 0.358480
9 -1.740877
Name: one, dtype: float64