The csv module
Overview
The csv standard library is used for interfacting with– strangely enough– csv files.
Mechanically, working with csv files involves breaking up by lines, then by delimiter, and using the values.
However, these files aren’t beholden to a consistent format. Different rules regarding quotes, delimiters, and line separation can arise, kneecapping your ability to generalize how to work with the files.
The csv standard library can be handily leveraged as a translation layer in your data pipeline to resolve inconsistencies between these formats.
Our Data
import csv
import sys
from itertools import islicePrinting the first few rows of the iris dataset, we can see that the data is split into rows, and each entry is separated by a comma.
path = 'data/iris.csv'
with open(path, 'r') as csvfile:
print(csvfile.read()[:195])sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa
But it could be just as easily have been separated by a bar.
path = 'data/iris_bar.csv'
with open(path, 'r') as csvfile:
print(csvfile.read()[:195])sepal_length|sepal_width|petal_length|petal_width|species
5.1|3.5|1.4|0.2|setosa
4.9|3.0|1.4|0.2|setosa
4.7|3.2|1.3|0.2|setosa
4.6|3.1|1.5|0.2|setosa
5.0|3.6|1.4|0.2|setosa
5.4|3.9|1.7|0.4|setosa
Or, inexplicably, by ™’s
path = 'data/iris_tm.csv'
with open(path, 'r', encoding='utf-8') as csvfile:
print(csvfile.read()[:190])sepal_length™sepal_width™petal_length™petal_width™species
5.1™3.5™1.4™0.2™setosa
4.9™3.0™1.4™0.2™setosa
4.7™3.2™1.3™0.2™setosa
4.6™3.1™1.5™0.2™setosa
5.0™3.6™1.4™0.2™setosa
5.4™3.9™1.7™0.4™s
Reading and Writing
Both the Reader and Writer in the csv module operate using Dialects that essentially describe the rules the file is to be written/read with.
Both csv.reader and csv.writer can take either a dialect argument (more on this below), or individual arguments that would be found within a Dialect. Often, you won’t need to finagle with much more than the delimiter.
Reader
Takes in a csv file that supports an iterator and whose __next__ method returns a str. The object itself is a lightweight wrapper to an iterator.
The default reader handles vanilla csv files just fine
with open('data/iris.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile)
for line in islice(csvreader, 5):
print(line)['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
['5.1', '3.5', '1.4', '0.2', 'setosa']
['4.9', '3.0', '1.4', '0.2', 'setosa']
['4.7', '3.2', '1.3', '0.2', 'setosa']
['4.6', '3.1', '1.5', '0.2', 'setosa']
But as mentioned above, you can specify the reader’s Dialect, or pass parameters that modify the defaults.
with open('data/iris_bar.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile, delimiter='|')
for line in islice(csvreader, 5):
print(line)['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
['5.1', '3.5', '1.4', '0.2', 'setosa']
['4.9', '3.0', '1.4', '0.2', 'setosa']
['4.7', '3.2', '1.3', '0.2', 'setosa']
['4.6', '3.1', '1.5', '0.2', 'setosa']
Additionally, the csv.reader has a line_num attribute that is useful in searching a file
with open('data/iris_bar.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile, delimiter='|')
for line in csvreader:
if (line[0] == '5.0') and (line[3] == '0.2'):
print(csvreader.line_num)6
9
27
37
51
Or, obviously, in logging with a try/except block
Writer
Similarly, you can use csv.writer to specify some Dialect attributes and write either:
per row
with open('data/iris.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile)
csvwriter = csv.writer(sys.stdout)
for line in islice(csvreader, 5):
csvwriter.writerow(line)sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
Or multiple rows at once
with open('data/iris.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile)
csvwriter = csv.writer(sys.stdout)
fiveLines = list(islice(csvreader, 5))
csvwriter.writerows(fiveLines)sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
This is very handy because it allows you to neatly translate between Dialects. Here, we can quickly change the delimiter from ',' to '|'.
with open('data/iris.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile)
csvwriter = csv.writer(sys.stdout, delimiter='|')
for line in islice(csvreader, 5):
csvwriter.writerow(line)sepal_length|sepal_width|petal_length|petal_width|species
5.1|3.5|1.4|0.2|setosa
4.9|3.0|1.4|0.2|setosa
4.7|3.2|1.3|0.2|setosa
4.6|3.1|1.5|0.2|setosa
Dialects
Of course, data is often messy.
Thus, we have a need to communicate the “rules” that a csv file follows. In the csv library this is implemented as dialects.
Dialects are implemented as Python classes and have attributes that fall into two categories:
- How to break up a file
- What to do about quotes when writing
Breaking up files
delimiter: What the file rows are split up over. Can only be a single character.
If we pass the correct delimiter for a given file, each row becomes a list of that many entries
with open('data/iris_bar.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile, delimiter='|')
for line in islice(csvreader, 5):
print(line)['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
['5.1', '3.5', '1.4', '0.2', 'setosa']
['4.9', '3.0', '1.4', '0.2', 'setosa']
['4.7', '3.2', '1.3', '0.2', 'setosa']
['4.6', '3.1', '1.5', '0.2', 'setosa']
If we don’t, it becomes one record per row
with open('data/iris.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile, delimiter='|')
for line in islice(csvreader, 5):
print(line)['sepal_length,sepal_width,petal_length,petal_width,species']
['5.1,3.5,1.4,0.2,setosa']
['4.9,3.0,1.4,0.2,setosa']
['4.7,3.2,1.3,0.2,setosa']
['4.6,3.1,1.5,0.2,setosa']
lineterminator: What character(s) used to terminate lines when writing.
skipinitialspace: Some csvs have a bunch of whitespace after each delimiter. This reads past that.
Dealing with Quotes
Let’s assume a simple dataset. 3 rows with columns “Index” and “Letter”
nums = list(range(1, 4))
letters = [chr(x+64) for x in nums]
print(list(zip(nums, letters)))[(1, 'A'), (2, 'B'), (3, 'C')]
def write_3(**quoteargs):
csvwriter = csv.writer(sys.stdout, **quoteargs)
for line in zip(nums, letters):
csvwriter.writerow(line)Out of the box, this works fine.
write_3()1,A
2,B
3,C
Using the quoting argument, we can choose to include or ignore quotes, regardless of type
write_3(quoting=csv.QUOTE_ALL)"1","A"
"2","B"
"3","C"
write_3(quoting=csv.QUOTE_NONE)1,A
2,B
3,C
However, this gets dicey when you try to quote and have quote characters within your values.
Let’s overwrite our C with "C"
letters[2] = '"C"'csv doesn’t know what to do with itself now.
write_3(quoting=csv.QUOTE_NONE)1,A
2,B
---------------------------------------------------------------------------
Error Traceback (most recent call last)
<ipython-input-19-bd14db3002e5> in <module>()
----> 1 write_3(quoting=csv.QUOTE_NONE)
<ipython-input-14-d8e006bede05> in write_3(**quoteargs)
2 csvwriter = csv.writer(sys.stdout, **quoteargs)
3 for line in zip(nums, letters):
----> 4 csvwriter.writerow(line)
Error: need to escape, but no escapechar set
A few options here.
We use the quotechar argument to tell csv that we quote using something besides "
write_3(quoting=csv.QUOTE_NONE, quotechar="'")1,A
2,B
3,"C"
Or use the escapechar argument to denote where the character was already in the string
write_3(quoting=csv.QUOTE_NONE, escapechar='\\')1,A
2,B
3,\"C\"
write_3(quoting=csv.QUOTE_NONNUMERIC)1,"A"
2,"B"
3,"""C"""
write_3(quoting=csv.QUOTE_ALL)"1","A"
"2","B"
"3","""C"""
The easiest way is to just use QUOTE_MINIMAL, which is the default behavior
write_3(quoting=csv.QUOTE_MINIMAL)1,A
2,B
3,"""C"""
Standard Dialects
csv comes pre-loaded with some common dialects that are produced by popular csv-creating software
csv.list_dialects()['excel', 'excel-tab', 'unix']
def print_dialect_attrs(dialect):
attrs = [x for x in dir(dialect) if not x.startswith('_')]
d = dict((x, getattr(dialect, x)) for x in attrs)
for k, v in d.items():
print(str(k).ljust(16), ':', repr(v))print_dialect_attrs(csv.excel)delimiter : ','
doublequote : True
escapechar : None
lineterminator : '\r\n'
quotechar : '"'
quoting : 0
skipinitialspace : False
print_dialect_attrs(csv.excel_tab)delimiter : '\t'
doublequote : True
escapechar : None
lineterminator : '\r\n'
quotechar : '"'
quoting : 0
skipinitialspace : False
print_dialect_attrs(csv.get_dialect('unix'))delimiter : ','
doublequote : 1
escapechar : None
lineterminator : '\n'
quotechar : '"'
quoting : 1
skipinitialspace : 0
strict : 0
Sniffer
As we saw above, accounting for all of the csv behavior can be a headache.
Thankfully, we can lean on a really useful csv utility called sniffer.
has_header
Sniffer can investigate a file to see if it’s got header columns or not.
with open('data/iris.csv', 'r') as csvfile:
print(csv.Sniffer().has_header(csvfile.read()))True
sniff
But the real power of the Sniffer class is in its ability to determine how things are broken up in a given csv and assign its findings to a Dialect class.
with open('data/iris_bar.csv', 'r') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read())
print_dialect_attrs(dialect)delimiter : '|'
doublequote : False
escapechar : None
lineterminator : '\r\n'
quotechar : '"'
quoting : 0
skipinitialspace : False
Of course, it’s not always right– so use caution.
with open('data/iris_tm.csv', 'r', encoding='utf-8') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read())
print_dialect_attrs(dialect)delimiter : '.'
doublequote : False
escapechar : None
lineterminator : '\r\n'
quotechar : '"'
quoting : 0
skipinitialspace : True
Manually correcting
with open('data/iris_tm.csv', 'r', encoding='utf-8') as csvfile:
csvreader = csv.reader(csvfile, delimiter='™')
print_dialect_attrs(csvreader.dialect)delimiter : '™'
doublequote : 1
escapechar : None
lineterminator : '\r\n'
quotechar : '"'
quoting : 0
skipinitialspace : 0
strict : 0