csv 2: Indexing
Indexing
Getting the data from the file involves first telling it the rules it needs to follow when parsing, as well as how to label it in the finished DataFrame
.
Here’s our simple csv.
import pandas as pd
csvPath = 'data/ex1.csv'
open(csvPath, 'r').read()
'a,b,c,d,message\n1,2,3,4,hello\n5,6,7,8,world\n9,10,11,12,foo'
Dialect
This should be immediately familar to anyone comfortable working in the csv
standard library. Each of these can be passed in as individual arguments
or as part of a dialect
argument.
sep (same as 'delimiter')
quoting
quotechar
doublequote
escapechar
skipinitialspace
encoding
Which helps resolve files that look like this
pd.read_csv('data/iris_bar.csv').head()
sepal_length|sepal_width|petal_length|petal_width|species | |
---|---|
0 | 5.1|3.5|1.4|0.2|setosa |
1 | 4.9|3.0|1.4|0.2|setosa |
2 | 4.7|3.2|1.3|0.2|setosa |
3 | 4.6|3.1|1.5|0.2|setosa |
4 | 5.0|3.6|1.4|0.2|setosa |
Into something a little less awful.
pd.read_csv('data/iris_bar.csv', sep='|').head()
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
If None
is passed into the the sep
argument, then the Python engine will take over running the sniffer
tool from the csv
module.
Differences from Vanilla csv.Dialect
Using Regex in sep
Consider some silly file like this
open('data/iris_chaos.csv', encoding='utf-8').read()
'sepal_length±sepal_width±petal_length±petal_width±species\n5.1™3.5™1.4™0.2™setosa\n4.9™3.0™1.4™0.2™setosa\n4.7♫3.2♫1.3♫0.2♫setosa\n4.6♫3.1♫1.5♫0.2♫setosa\n5.0♫3.6♫1.4♫0.2♫setosa\n5.4™3.9™1.7™0.4™setosa\n4.6™3.4™1.4™0.3™setosa\n5.0ñ3.4ñ1.5ñ0.2ñsetosa\n4.4ñ2.9ñ1.4ñ0.2ñsetosa\n4.9™3.1™1.5™0.1™setosa\n5.4™3.7™1.5™0.2™setosa\n4.8™3.4™1.6™0.2™setosa\n4.8™3.0™1.4™0.1™setosa\n4.3™3.0™1.1™0.1™setosa\n5.8™4.0™1.2™0.2™setosa\n5.7™4.4™1.5™0.4™setosa'
Trying its best, the csv
module would handle it like this.
import csv
with open('data/iris_chaos.csv', encoding='utf-8') as f:
csvreader = csv.reader(f)
for row in csvreader:
print(row)
['sepal_length±sepal_width±petal_length±petal_width±species']
['5.1™3.5™1.4™0.2™setosa']
['4.9™3.0™1.4™0.2™setosa']
['4.7♫3.2♫1.3♫0.2♫setosa']
['4.6♫3.1♫1.5♫0.2♫setosa']
['5.0♫3.6♫1.4♫0.2♫setosa']
['5.4™3.9™1.7™0.4™setosa']
['4.6™3.4™1.4™0.3™setosa']
['5.0ñ3.4ñ1.5ñ0.2ñsetosa']
['4.4ñ2.9ñ1.4ñ0.2ñsetosa']
['4.9™3.1™1.5™0.1™setosa']
['5.4™3.7™1.5™0.2™setosa']
['4.8™3.4™1.6™0.2™setosa']
['4.8™3.0™1.4™0.1™setosa']
['4.3™3.0™1.1™0.1™setosa']
['5.8™4.0™1.2™0.2™setosa']
['5.7™4.4™1.5™0.4™setosa']
You can pass one of the weird, one-off characters (♫, ™, ñ, ±), but it only parses rows with that delimiter.
with open('data/iris_chaos.csv', encoding='utf-8') as f:
csvreader = csv.reader(f, delimiter='ñ')
for row in csvreader:
print(row)
['sepal_length±sepal_width±petal_length±petal_width±species']
['5.1™3.5™1.4™0.2™setosa']
['4.9™3.0™1.4™0.2™setosa']
['4.7♫3.2♫1.3♫0.2♫setosa']
['4.6♫3.1♫1.5♫0.2♫setosa']
['5.0♫3.6♫1.4♫0.2♫setosa']
['5.4™3.9™1.7™0.4™setosa']
['4.6™3.4™1.4™0.3™setosa']
['5.0', '3.4', '1.5', '0.2', 'setosa']
['4.4', '2.9', '1.4', '0.2', 'setosa']
['4.9™3.1™1.5™0.1™setosa']
['5.4™3.7™1.5™0.2™setosa']
['4.8™3.4™1.6™0.2™setosa']
['4.8™3.0™1.4™0.1™setosa']
['4.3™3.0™1.1™0.1™setosa']
['5.8™4.0™1.2™0.2™setosa']
['5.7™4.4™1.5™0.4™setosa']
In the pandas
version, you can pass in regular expressions if your file is delimited by more than a single character
Note: This will force the Python engine (as evidenced by the error message below)
bunchOfGarbageChars = 'ñ|±|™|Þ|→|♫'
pd.read_csv('data/iris_chaos.csv', sep=bunchOfGarbageChars, encoding='utf-8')
C:\Users\nhounshell\AppData\Local\Continuum\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
5 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
6 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
7 | 5.0 | 3.4 | 1.5 | 0.2 | setosa |
8 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
9 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
10 | 5.4 | 3.7 | 1.5 | 0.2 | setosa |
11 | 4.8 | 3.4 | 1.6 | 0.2 | setosa |
12 | 4.8 | 3.0 | 1.4 | 0.1 | setosa |
13 | 4.3 | 3.0 | 1.1 | 0.1 | setosa |
14 | 5.8 | 4.0 | 1.2 | 0.2 | setosa |
15 | 5.7 | 4.4 | 1.5 | 0.4 | setosa |
Data split by whitespace
Sometimes, your data is split by a variable amount of whitespace
open('data/ex3.csv').read()
' A B C\naaa -0.264438 -1.026059 -0.619500\nbbb 0.927272 0.302904 -0.032399\nccc -0.264273 -0.386314 -0.217601\nddd -0.871858 -0.348382 1.100491'
So you could use some regex magic and make a ‘split over whitespace’ argument
pd.read_csv('data/ex3.csv', sep='\s+')
A | B | C | |
---|---|---|---|
aaa | -0.264438 | -1.026059 | -0.619500 |
bbb | 0.927272 | 0.302904 | -0.032399 |
ccc | -0.264273 | -0.386314 | -0.217601 |
ddd | -0.871858 | -0.348382 | 1.100491 |
Or you could set in delim_whitespace=True
which does the exact same thing.
pd.read_csv('data/ex3.csv', delim_whitespace=True)
A | B | C | |
---|---|---|---|
aaa | -0.264438 | -1.026059 | -0.619500 |
bbb | 0.927272 | 0.302904 | -0.032399 |
ccc | -0.264273 | -0.386314 | -0.217601 |
ddd | -0.871858 | -0.348382 | 1.100491 |
Labeling your data
header
prefix
names
index_col
header
Which row(s) to use as column names. Takes the first row of the data by default.
pd.read_csv(csvPath)
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
You can also tell it that there isn’t a header
pd.read_csv(csvPath, header=None)
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | a | b | c | d | message |
1 | 1 | 2 | 3 | 4 | hello |
2 | 5 | 6 | 7 | 8 | world |
3 | 9 | 10 | 11 | 12 | foo |
Or specify a Multi-Index header by row number.
(Note that we skipped over line 1 and that he remaining one line of data after line 2 is our ‘data’)
pd.read_csv(csvPath, header=[0, 2])
a | b | c | d | message | |
---|---|---|---|---|---|
5 | 6 | 7 | 8 | world | |
0 | 9 | 10 | 11 | 12 | foo |
prefix
If your data has no header
what letter to put in front of the 0..N-1 column names.
This would be used if you find yourself pulling in many different csv files and are more concerned about which table a column is from than what it represents.
pd.read_csv(csvPath, header=None, prefix='X')
X0 | X1 | X2 | X3 | X4 | |
---|---|---|---|---|---|
0 | a | b | c | d | message |
1 | 1 | 2 | 3 | 4 | hello |
2 | 5 | 6 | 7 | 8 | world |
3 | 9 | 10 | 11 | 12 | foo |
names
List of column names to use.
This is most useful if your data doesn’t have a header
row. If it does, then specifying the names
will just push it down a row.
pd.read_csv(csvPath, names=['col1', 'col2', 'col3', 'col4'])
col1 | col2 | col3 | col4 | |
---|---|---|---|---|
a | b | c | d | message |
1 | 2 | 3 | 4 | hello |
5 | 6 | 7 | 8 | world |
9 | 10 | 11 | 12 | foo |
Unless you specify that the header
is found in the top row.
pd.read_csv(csvPath, names=['col1', 'col2', 'col3', 'col4'], header=0)
col1 | col2 | col3 | col4 | |
---|---|---|---|---|
1 | 2 | 3 | 4 | hello |
5 | 6 | 7 | 8 | world |
9 | 10 | 11 | 12 | foo |
Which makes this very useful for renaming weird columns.
pd.read_csv(csvPath, names=['foo', 'bar', 'baz', 'meh'], header=0)
foo | bar | baz | meh | |
---|---|---|---|---|
1 | 2 | 3 | 4 | hello |
5 | 6 | 7 | 8 | world |
9 | 10 | 11 | 12 | foo |
index_col
Which column to use as your index. The index just defaults to a 0-n array.
pd.read_csv(csvPath)
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
But something else might be more useful, which you can pass in the form of a str
of the column name, or an int
with the column index.
pd.read_csv(csvPath, index_col='c')
a | b | d | message | |
---|---|---|---|---|
c | ||||
3 | 1 | 2 | 4 | hello |
7 | 5 | 6 | 8 | world |
11 | 9 | 10 | 12 | foo |
Notice that specifying the index_col
also gives the index a name.
pd.read_csv(csvPath, index_col=2)
a | b | d | message | |
---|---|---|---|---|
c | ||||
3 | 1 | 2 | 4 | hello |
7 | 5 | 6 | 8 | world |
11 | 9 | 10 | 12 | foo |
Just Want a Column?
squeeze
It’s edge case, but if you find that you’re only trying to read one column from a csv, it’s usually best to store as a Series
, not a DataFrame
.
pd.read_csv(csvPath, usecols=[0])
a | |
---|---|
0 | 1 |
1 | 5 |
2 | 9 |
pd.read_csv(csvPath, usecols=[0], squeeze=True)
0 1
1 5
2 9
Name: a, dtype: int64
Resources
Data from Chapter 6 of Wes McKinney’s Python for Data Analysis 1st ed