csv 2: Indexing

Indexing

Getting the data from the file involves first telling it the rules it needs to follow when parsing, as well as how to label it in the finished DataFrame.

Here’s our simple csv.

import pandas as pd

csvPath = 'data/ex1.csv'
open(csvPath, 'r').read()
'a,b,c,d,message\n1,2,3,4,hello\n5,6,7,8,world\n9,10,11,12,foo'

Dialect

This should be immediately familar to anyone comfortable working in the csv standard library. Each of these can be passed in as individual arguments or as part of a dialect argument.

sep (same as 'delimiter')
quoting
quotechar
doublequote
escapechar    
skipinitialspace
encoding

Which helps resolve files that look like this

pd.read_csv('data/iris_bar.csv').head()
sepal_length|sepal_width|petal_length|petal_width|species
0 5.1|3.5|1.4|0.2|setosa
1 4.9|3.0|1.4|0.2|setosa
2 4.7|3.2|1.3|0.2|setosa
3 4.6|3.1|1.5|0.2|setosa
4 5.0|3.6|1.4|0.2|setosa

Into something a little less awful.

pd.read_csv('data/iris_bar.csv', sep='|').head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

If None is passed into the the sep argument, then the Python engine will take over running the sniffer tool from the csv module.

Differences from Vanilla csv.Dialect

Using Regex in sep

Consider some silly file like this

open('data/iris_chaos.csv', encoding='utf-8').read()
'sepal_length±sepal_width±petal_length±petal_width±species\n5.1™3.5™1.4™0.2™setosa\n4.9™3.0™1.4™0.2™setosa\n4.7♫3.2♫1.3♫0.2♫setosa\n4.6♫3.1♫1.5♫0.2♫setosa\n5.0♫3.6♫1.4♫0.2♫setosa\n5.4™3.9™1.7™0.4™setosa\n4.6™3.4™1.4™0.3™setosa\n5.0ñ3.4ñ1.5ñ0.2ñsetosa\n4.4ñ2.9ñ1.4ñ0.2ñsetosa\n4.9™3.1™1.5™0.1™setosa\n5.4™3.7™1.5™0.2™setosa\n4.8™3.4™1.6™0.2™setosa\n4.8™3.0™1.4™0.1™setosa\n4.3™3.0™1.1™0.1™setosa\n5.8™4.0™1.2™0.2™setosa\n5.7™4.4™1.5™0.4™setosa'

Trying its best, the csv module would handle it like this.

import csv

with open('data/iris_chaos.csv', encoding='utf-8') as f:
    csvreader = csv.reader(f)
    for row in csvreader:
        print(row)
['sepal_length±sepal_width±petal_length±petal_width±species']
['5.1™3.5™1.4™0.2™setosa']
['4.9™3.0™1.4™0.2™setosa']
['4.7♫3.2♫1.3♫0.2♫setosa']
['4.6♫3.1♫1.5♫0.2♫setosa']
['5.0♫3.6♫1.4♫0.2♫setosa']
['5.4™3.9™1.7™0.4™setosa']
['4.6™3.4™1.4™0.3™setosa']
['5.0ñ3.4ñ1.5ñ0.2ñsetosa']
['4.4ñ2.9ñ1.4ñ0.2ñsetosa']
['4.9™3.1™1.5™0.1™setosa']
['5.4™3.7™1.5™0.2™setosa']
['4.8™3.4™1.6™0.2™setosa']
['4.8™3.0™1.4™0.1™setosa']
['4.3™3.0™1.1™0.1™setosa']
['5.8™4.0™1.2™0.2™setosa']
['5.7™4.4™1.5™0.4™setosa']

You can pass one of the weird, one-off characters (♫, ™, ñ, ±), but it only parses rows with that delimiter.

with open('data/iris_chaos.csv', encoding='utf-8') as f:
    csvreader = csv.reader(f, delimiter='ñ')
    for row in csvreader:
        print(row)
['sepal_length±sepal_width±petal_length±petal_width±species']
['5.1™3.5™1.4™0.2™setosa']
['4.9™3.0™1.4™0.2™setosa']
['4.7♫3.2♫1.3♫0.2♫setosa']
['4.6♫3.1♫1.5♫0.2♫setosa']
['5.0♫3.6♫1.4♫0.2♫setosa']
['5.4™3.9™1.7™0.4™setosa']
['4.6™3.4™1.4™0.3™setosa']
['5.0', '3.4', '1.5', '0.2', 'setosa']
['4.4', '2.9', '1.4', '0.2', 'setosa']
['4.9™3.1™1.5™0.1™setosa']
['5.4™3.7™1.5™0.2™setosa']
['4.8™3.4™1.6™0.2™setosa']
['4.8™3.0™1.4™0.1™setosa']
['4.3™3.0™1.1™0.1™setosa']
['5.8™4.0™1.2™0.2™setosa']
['5.7™4.4™1.5™0.4™setosa']

In the pandas version, you can pass in regular expressions if your file is delimited by more than a single character

Note: This will force the Python engine (as evidenced by the error message below)

bunchOfGarbageChars = 'ñ|±|™|Þ|→|♫'
pd.read_csv('data/iris_chaos.csv', sep=bunchOfGarbageChars, encoding='utf-8')
C:\Users\nhounshell\AppData\Local\Continuum\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa
6 4.6 3.4 1.4 0.3 setosa
7 5.0 3.4 1.5 0.2 setosa
8 4.4 2.9 1.4 0.2 setosa
9 4.9 3.1 1.5 0.1 setosa
10 5.4 3.7 1.5 0.2 setosa
11 4.8 3.4 1.6 0.2 setosa
12 4.8 3.0 1.4 0.1 setosa
13 4.3 3.0 1.1 0.1 setosa
14 5.8 4.0 1.2 0.2 setosa
15 5.7 4.4 1.5 0.4 setosa

Data split by whitespace

Sometimes, your data is split by a variable amount of whitespace

open('data/ex3.csv').read()
'            A         B         C\naaa -0.264438 -1.026059 -0.619500\nbbb  0.927272  0.302904 -0.032399\nccc -0.264273 -0.386314 -0.217601\nddd -0.871858 -0.348382  1.100491'

So you could use some regex magic and make a ‘split over whitespace’ argument

pd.read_csv('data/ex3.csv', sep='\s+')
A B C
aaa -0.264438 -1.026059 -0.619500
bbb 0.927272 0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382 1.100491

Or you could set in delim_whitespace=True which does the exact same thing.

pd.read_csv('data/ex3.csv', delim_whitespace=True)
A B C
aaa -0.264438 -1.026059 -0.619500
bbb 0.927272 0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382 1.100491

Labeling your data

header
prefix
names
index_col

Which row(s) to use as column names. Takes the first row of the data by default.

pd.read_csv(csvPath)
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

You can also tell it that there isn’t a header

pd.read_csv(csvPath, header=None)
0 1 2 3 4
0 a b c d message
1 1 2 3 4 hello
2 5 6 7 8 world
3 9 10 11 12 foo

Or specify a Multi-Index header by row number.

(Note that we skipped over line 1 and that he remaining one line of data after line 2 is our ‘data’)

pd.read_csv(csvPath, header=[0, 2])
a b c d message
5 6 7 8 world
0 9 10 11 12 foo

prefix

If your data has no header what letter to put in front of the 0..N-1 column names.

This would be used if you find yourself pulling in many different csv files and are more concerned about which table a column is from than what it represents.

pd.read_csv(csvPath, header=None, prefix='X')
X0 X1 X2 X3 X4
0 a b c d message
1 1 2 3 4 hello
2 5 6 7 8 world
3 9 10 11 12 foo

names

List of column names to use.

This is most useful if your data doesn’t have a header row. If it does, then specifying the names will just push it down a row.

pd.read_csv(csvPath, names=['col1', 'col2', 'col3', 'col4'])
col1 col2 col3 col4
a b c d message
1 2 3 4 hello
5 6 7 8 world
9 10 11 12 foo

Unless you specify that the header is found in the top row.

pd.read_csv(csvPath, names=['col1', 'col2', 'col3', 'col4'], header=0)
col1 col2 col3 col4
1 2 3 4 hello
5 6 7 8 world
9 10 11 12 foo

Which makes this very useful for renaming weird columns.

pd.read_csv(csvPath, names=['foo', 'bar', 'baz', 'meh'], header=0)
foo bar baz meh
1 2 3 4 hello
5 6 7 8 world
9 10 11 12 foo

index_col

Which column to use as your index. The index just defaults to a 0-n array.

pd.read_csv(csvPath)
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

But something else might be more useful, which you can pass in the form of a str of the column name, or an int with the column index.

pd.read_csv(csvPath, index_col='c')
a b d message
c
3 1 2 4 hello
7 5 6 8 world
11 9 10 12 foo

Notice that specifying the index_col also gives the index a name.

pd.read_csv(csvPath, index_col=2)
a b d message
c
3 1 2 4 hello
7 5 6 8 world
11 9 10 12 foo

Just Want a Column?

squeeze

It’s edge case, but if you find that you’re only trying to read one column from a csv, it’s usually best to store as a Series, not a DataFrame.

pd.read_csv(csvPath, usecols=[0])
a
0 1
1 5
2 9
pd.read_csv(csvPath, usecols=[0], squeeze=True)
0    1
1    5
2    9
Name: a, dtype: int64

Resources

Data from Chapter 6 of Wes McKinney’s Python for Data Analysis 1st ed