The csv module
Overview
The csv
standard library is used for interfacting with– strangely enough– csv files.
Mechanically, working with csv files involves breaking up by lines, then by delimiter, and using the values.
However, these files aren’t beholden to a consistent format. Different rules regarding quotes, delimiters, and line separation can arise, kneecapping your ability to generalize how to work with the files.
The csv
standard library can be handily leveraged as a translation layer in your data pipeline to resolve inconsistencies between these formats.
Our Data
import csv
import sys
from itertools import islice
Printing the first few rows of the iris dataset, we can see that the data is split into rows, and each entry is separated by a comma.
path = 'data/iris.csv'
with open(path, 'r') as csvfile:
print(csvfile.read()[:195])
sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa
But it could be just as easily have been separated by a bar.
path = 'data/iris_bar.csv'
with open(path, 'r') as csvfile:
print(csvfile.read()[:195])
sepal_length|sepal_width|petal_length|petal_width|species
5.1|3.5|1.4|0.2|setosa
4.9|3.0|1.4|0.2|setosa
4.7|3.2|1.3|0.2|setosa
4.6|3.1|1.5|0.2|setosa
5.0|3.6|1.4|0.2|setosa
5.4|3.9|1.7|0.4|setosa
Or, inexplicably, by ™’s
path = 'data/iris_tm.csv'
with open(path, 'r', encoding='utf-8') as csvfile:
print(csvfile.read()[:190])
sepal_length™sepal_width™petal_length™petal_width™species
5.1™3.5™1.4™0.2™setosa
4.9™3.0™1.4™0.2™setosa
4.7™3.2™1.3™0.2™setosa
4.6™3.1™1.5™0.2™setosa
5.0™3.6™1.4™0.2™setosa
5.4™3.9™1.7™0.4™s
Reading and Writing
Both the Reader
and Writer
in the csv
module operate using Dialects
that essentially describe the rules the file is to be written/read with.
Both csv.reader
and csv.writer
can take either a dialect
argument (more on this below), or individual arguments that would be found within a Dialect
. Often, you won’t need to finagle with much more than the delimiter
.
Reader
Takes in a csv file that supports an iterator and whose __next__
method returns a str
. The object itself is a lightweight wrapper to an iterator.
The default reader handles vanilla csv files just fine
with open('data/iris.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile)
for line in islice(csvreader, 5):
print(line)
['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
['5.1', '3.5', '1.4', '0.2', 'setosa']
['4.9', '3.0', '1.4', '0.2', 'setosa']
['4.7', '3.2', '1.3', '0.2', 'setosa']
['4.6', '3.1', '1.5', '0.2', 'setosa']
But as mentioned above, you can specify the reader
’s Dialect
, or pass parameters that modify the defaults.
with open('data/iris_bar.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile, delimiter='|')
for line in islice(csvreader, 5):
print(line)
['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
['5.1', '3.5', '1.4', '0.2', 'setosa']
['4.9', '3.0', '1.4', '0.2', 'setosa']
['4.7', '3.2', '1.3', '0.2', 'setosa']
['4.6', '3.1', '1.5', '0.2', 'setosa']
Additionally, the csv.reader
has a line_num
attribute that is useful in searching a file
with open('data/iris_bar.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile, delimiter='|')
for line in csvreader:
if (line[0] == '5.0') and (line[3] == '0.2'):
print(csvreader.line_num)
6
9
27
37
51
Or, obviously, in logging with a try
/except
block
Writer
Similarly, you can use csv.writer
to specify some Dialect
attributes and write either:
per row
with open('data/iris.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile)
csvwriter = csv.writer(sys.stdout)
for line in islice(csvreader, 5):
csvwriter.writerow(line)
sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
Or multiple rows at once
with open('data/iris.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile)
csvwriter = csv.writer(sys.stdout)
fiveLines = list(islice(csvreader, 5))
csvwriter.writerows(fiveLines)
sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
This is very handy because it allows you to neatly translate between Dialect
s. Here, we can quickly change the delimiter from ','
to '|'
.
with open('data/iris.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile)
csvwriter = csv.writer(sys.stdout, delimiter='|')
for line in islice(csvreader, 5):
csvwriter.writerow(line)
sepal_length|sepal_width|petal_length|petal_width|species
5.1|3.5|1.4|0.2|setosa
4.9|3.0|1.4|0.2|setosa
4.7|3.2|1.3|0.2|setosa
4.6|3.1|1.5|0.2|setosa
Dialects
Of course, data is often messy.
Thus, we have a need to communicate the “rules” that a csv file follows. In the csv
library this is implemented as dialects.
Dialects are implemented as Python classes and have attributes that fall into two categories:
- How to break up a file
- What to do about quotes when writing
Breaking up files
delimiter
: What the file rows are split up over. Can only be a single character.
If we pass the correct delimiter for a given file, each row becomes a list of that many entries
with open('data/iris_bar.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile, delimiter='|')
for line in islice(csvreader, 5):
print(line)
['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
['5.1', '3.5', '1.4', '0.2', 'setosa']
['4.9', '3.0', '1.4', '0.2', 'setosa']
['4.7', '3.2', '1.3', '0.2', 'setosa']
['4.6', '3.1', '1.5', '0.2', 'setosa']
If we don’t, it becomes one record per row
with open('data/iris.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile, delimiter='|')
for line in islice(csvreader, 5):
print(line)
['sepal_length,sepal_width,petal_length,petal_width,species']
['5.1,3.5,1.4,0.2,setosa']
['4.9,3.0,1.4,0.2,setosa']
['4.7,3.2,1.3,0.2,setosa']
['4.6,3.1,1.5,0.2,setosa']
lineterminator
: What character(s) used to terminate lines when writing.
skipinitialspace
: Some csvs have a bunch of whitespace after each delimiter. This reads past that.
Dealing with Quotes
Let’s assume a simple dataset. 3 rows with columns “Index” and “Letter”
nums = list(range(1, 4))
letters = [chr(x+64) for x in nums]
print(list(zip(nums, letters)))
[(1, 'A'), (2, 'B'), (3, 'C')]
def write_3(**quoteargs):
csvwriter = csv.writer(sys.stdout, **quoteargs)
for line in zip(nums, letters):
csvwriter.writerow(line)
Out of the box, this works fine.
write_3()
1,A
2,B
3,C
Using the quoting
argument, we can choose to include or ignore quotes, regardless of type
write_3(quoting=csv.QUOTE_ALL)
"1","A"
"2","B"
"3","C"
write_3(quoting=csv.QUOTE_NONE)
1,A
2,B
3,C
However, this gets dicey when you try to quote and have quote characters within your values.
Let’s overwrite our C
with "C"
letters[2] = '"C"'
csv
doesn’t know what to do with itself now.
write_3(quoting=csv.QUOTE_NONE)
1,A
2,B
---------------------------------------------------------------------------
Error Traceback (most recent call last)
<ipython-input-19-bd14db3002e5> in <module>()
----> 1 write_3(quoting=csv.QUOTE_NONE)
<ipython-input-14-d8e006bede05> in write_3(**quoteargs)
2 csvwriter = csv.writer(sys.stdout, **quoteargs)
3 for line in zip(nums, letters):
----> 4 csvwriter.writerow(line)
Error: need to escape, but no escapechar set
A few options here.
We use the quotechar
argument to tell csv
that we quote using something besides "
write_3(quoting=csv.QUOTE_NONE, quotechar="'")
1,A
2,B
3,"C"
Or use the escapechar
argument to denote where the character was already in the string
write_3(quoting=csv.QUOTE_NONE, escapechar='\\')
1,A
2,B
3,\"C\"
write_3(quoting=csv.QUOTE_NONNUMERIC)
1,"A"
2,"B"
3,"""C"""
write_3(quoting=csv.QUOTE_ALL)
"1","A"
"2","B"
"3","""C"""
The easiest way is to just use QUOTE_MINIMAL
, which is the default behavior
write_3(quoting=csv.QUOTE_MINIMAL)
1,A
2,B
3,"""C"""
Standard Dialects
csv
comes pre-loaded with some common dialects that are produced by popular csv-creating software
csv.list_dialects()
['excel', 'excel-tab', 'unix']
def print_dialect_attrs(dialect):
attrs = [x for x in dir(dialect) if not x.startswith('_')]
d = dict((x, getattr(dialect, x)) for x in attrs)
for k, v in d.items():
print(str(k).ljust(16), ':', repr(v))
print_dialect_attrs(csv.excel)
delimiter : ','
doublequote : True
escapechar : None
lineterminator : '\r\n'
quotechar : '"'
quoting : 0
skipinitialspace : False
print_dialect_attrs(csv.excel_tab)
delimiter : '\t'
doublequote : True
escapechar : None
lineterminator : '\r\n'
quotechar : '"'
quoting : 0
skipinitialspace : False
print_dialect_attrs(csv.get_dialect('unix'))
delimiter : ','
doublequote : 1
escapechar : None
lineterminator : '\n'
quotechar : '"'
quoting : 1
skipinitialspace : 0
strict : 0
Sniffer
As we saw above, accounting for all of the csv behavior can be a headache.
Thankfully, we can lean on a really useful csv
utility called sniffer
.
has_header
Sniffer can investigate a file to see if it’s got header columns or not.
with open('data/iris.csv', 'r') as csvfile:
print(csv.Sniffer().has_header(csvfile.read()))
True
sniff
But the real power of the Sniffer
class is in its ability to determine how things are broken up in a given csv and assign its findings to a Dialect
class.
with open('data/iris_bar.csv', 'r') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read())
print_dialect_attrs(dialect)
delimiter : '|'
doublequote : False
escapechar : None
lineterminator : '\r\n'
quotechar : '"'
quoting : 0
skipinitialspace : False
Of course, it’s not always right– so use caution.
with open('data/iris_tm.csv', 'r', encoding='utf-8') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read())
print_dialect_attrs(dialect)
delimiter : '.'
doublequote : False
escapechar : None
lineterminator : '\r\n'
quotechar : '"'
quoting : 0
skipinitialspace : True
Manually correcting
with open('data/iris_tm.csv', 'r', encoding='utf-8') as csvfile:
csvreader = csv.reader(csvfile, delimiter='™')
print_dialect_attrs(csvreader.dialect)
delimiter : '™'
doublequote : 1
escapechar : None
lineterminator : '\r\n'
quotechar : '"'
quoting : 0
skipinitialspace : 0
strict : 0