The csv module

23 May 2018

Overview

The csv standard library is used for interfacting with– strangely enough– csv files.

Mechanically, working with csv files involves breaking up by lines, then by delimiter, and using the values.

However, these files aren’t beholden to a consistent format. Different rules regarding quotes, delimiters, and line separation can arise, kneecapping your ability to generalize how to work with the files.

The csv standard library can be handily leveraged as a translation layer in your data pipeline to resolve inconsistencies between these formats.

Our Data

import csv
import sys
from itertools import islice

Printing the first few rows of the iris dataset, we can see that the data is split into rows, and each entry is separated by a comma.

path = 'data/iris.csv'
with open(path, 'r') as csvfile:
    print(csvfile.read()[:195])

sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa

But it could be just as easily have been separated by a bar.

path = 'data/iris_bar.csv'
with open(path, 'r') as csvfile:
    print(csvfile.read()[:195])

sepal_length|sepal_width|petal_length|petal_width|species
5.1|3.5|1.4|0.2|setosa
4.9|3.0|1.4|0.2|setosa
4.7|3.2|1.3|0.2|setosa
4.6|3.1|1.5|0.2|setosa
5.0|3.6|1.4|0.2|setosa
5.4|3.9|1.7|0.4|setosa

Or, inexplicably, by ™’s

path = 'data/iris_tm.csv'
with open(path, 'r', encoding='utf-8') as csvfile:
    print(csvfile.read()[:190])

sepal_length™sepal_width™petal_length™petal_width™species
5.1™3.5™1.4™0.2™setosa
4.9™3.0™1.4™0.2™setosa
4.7™3.2™1.3™0.2™setosa
4.6™3.1™1.5™0.2™setosa
5.0™3.6™1.4™0.2™setosa
5.4™3.9™1.7™0.4™s

Reading and Writing

Both the Reader and Writer in the csv module operate using Dialects that essentially describe the rules the file is to be written/read with.

Both csv.reader and csv.writer can take either a dialect argument (more on this below), or individual arguments that would be found within a Dialect. Often, you won’t need to finagle with much more than the delimiter.

Reader

Takes in a csv file that supports an iterator and whose __next__ method returns a str. The object itself is a lightweight wrapper to an iterator.

The default reader handles vanilla csv files just fine

with open('data/iris.csv', 'r') as csvfile:
    csvreader = csv.reader(csvfile)
    for line in islice(csvreader, 5):
        print(line)

['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
['5.1', '3.5', '1.4', '0.2', 'setosa']
['4.9', '3.0', '1.4', '0.2', 'setosa']
['4.7', '3.2', '1.3', '0.2', 'setosa']
['4.6', '3.1', '1.5', '0.2', 'setosa']

But as mentioned above, you can specify the reader’s Dialect, or pass parameters that modify the defaults.

with open('data/iris_bar.csv', 'r') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='|')
    for line in islice(csvreader, 5):
        print(line)

['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
['5.1', '3.5', '1.4', '0.2', 'setosa']
['4.9', '3.0', '1.4', '0.2', 'setosa']
['4.7', '3.2', '1.3', '0.2', 'setosa']
['4.6', '3.1', '1.5', '0.2', 'setosa']

Additionally, the csv.reader has a line_num attribute that is useful in searching a file

with open('data/iris_bar.csv', 'r') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='|')
    for line in csvreader:
        if (line[0] == '5.0') and (line[3] == '0.2'):
            print(csvreader.line_num)

Or, obviously, in logging with a try/except block

Writer

Similarly, you can use csv.writer to specify some Dialect attributes and write either:

per row

with open('data/iris.csv', 'r') as csvfile:
    csvreader = csv.reader(csvfile)
    csvwriter = csv.writer(sys.stdout)
    
    for line in islice(csvreader, 5):
        csvwriter.writerow(line)

sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa

Or multiple rows at once

with open('data/iris.csv', 'r') as csvfile:
    csvreader = csv.reader(csvfile)
    csvwriter = csv.writer(sys.stdout)
    
    fiveLines = list(islice(csvreader, 5))
    
    csvwriter.writerows(fiveLines)

sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa

This is very handy because it allows you to neatly translate between Dialects. Here, we can quickly change the delimiter from ',' to '|'.

with open('data/iris.csv', 'r') as csvfile:
    csvreader = csv.reader(csvfile)
    csvwriter = csv.writer(sys.stdout, delimiter='|')
    
    for line in islice(csvreader, 5):
        csvwriter.writerow(line)

sepal_length|sepal_width|petal_length|petal_width|species
5.1|3.5|1.4|0.2|setosa
4.9|3.0|1.4|0.2|setosa
4.7|3.2|1.3|0.2|setosa
4.6|3.1|1.5|0.2|setosa

Dialects

Of course, data is often messy.

Thus, we have a need to communicate the “rules” that a csv file follows. In the csv library this is implemented as dialects.

Dialects are implemented as Python classes and have attributes that fall into two categories:

How to break up a file
What to do about quotes when writing

Breaking up files

delimiter: What the file rows are split up over. Can only be a single character.

If we pass the correct delimiter for a given file, each row becomes a list of that many entries

with open('data/iris_bar.csv', 'r') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='|')
    for line in islice(csvreader, 5):
        print(line)

['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
['5.1', '3.5', '1.4', '0.2', 'setosa']
['4.9', '3.0', '1.4', '0.2', 'setosa']
['4.7', '3.2', '1.3', '0.2', 'setosa']
['4.6', '3.1', '1.5', '0.2', 'setosa']

If we don’t, it becomes one record per row

with open('data/iris.csv', 'r') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='|')
    for line in islice(csvreader, 5):
        print(line)

['sepal_length,sepal_width,petal_length,petal_width,species']
['5.1,3.5,1.4,0.2,setosa']
['4.9,3.0,1.4,0.2,setosa']
['4.7,3.2,1.3,0.2,setosa']
['4.6,3.1,1.5,0.2,setosa']

lineterminator: What character(s) used to terminate lines when writing.

skipinitialspace: Some csvs have a bunch of whitespace after each delimiter. This reads past that.

Dealing with Quotes

Let’s assume a simple dataset. 3 rows with columns “Index” and “Letter”

nums = list(range(1, 4))
letters = [chr(x+64) for x in nums]
print(list(zip(nums, letters)))

[(1, 'A'), (2, 'B'), (3, 'C')]

def write_3(**quoteargs):
    csvwriter = csv.writer(sys.stdout, **quoteargs)
    for line in zip(nums, letters):
        csvwriter.writerow(line)

Out of the box, this works fine.

write_3()

1,A
2,B
3,C

Using the quoting argument, we can choose to include or ignore quotes, regardless of type

write_3(quoting=csv.QUOTE_ALL)

"1","A"
"2","B"
"3","C"

write_3(quoting=csv.QUOTE_NONE)

1,A
2,B
3,C

However, this gets dicey when you try to quote and have quote characters within your values.

Let’s overwrite our C with "C"

letters[2] = '"C"'

csv doesn’t know what to do with itself now.

write_3(quoting=csv.QUOTE_NONE)

1,A
2,B



---------------------------------------------------------------------------

Error                                     Traceback (most recent call last)

<ipython-input-19-bd14db3002e5> in <module>()
----> 1 write_3(quoting=csv.QUOTE_NONE)


<ipython-input-14-d8e006bede05> in write_3(**quoteargs)
      2     csvwriter = csv.writer(sys.stdout, **quoteargs)
      3     for line in zip(nums, letters):
----> 4         csvwriter.writerow(line)


Error: need to escape, but no escapechar set

A few options here.

We use the quotechar argument to tell csv that we quote using something besides "

write_3(quoting=csv.QUOTE_NONE, quotechar="'")

1,A
2,B
3,"C"

Or use the escapechar argument to denote where the character was already in the string

write_3(quoting=csv.QUOTE_NONE, escapechar='\\')

1,A
2,B
3,\"C\"

write_3(quoting=csv.QUOTE_NONNUMERIC)

1,"A"
2,"B"
3,"""C"""

write_3(quoting=csv.QUOTE_ALL)

"1","A"
"2","B"
"3","""C"""

The easiest way is to just use QUOTE_MINIMAL, which is the default behavior

write_3(quoting=csv.QUOTE_MINIMAL)

1,A
2,B
3,"""C"""

Standard Dialects

csv comes pre-loaded with some common dialects that are produced by popular csv-creating software

csv.list_dialects()

['excel', 'excel-tab', 'unix']

def print_dialect_attrs(dialect):
    attrs = [x for x in dir(dialect) if not x.startswith('_')]
    d = dict((x, getattr(dialect, x)) for x in attrs)

    for k, v in d.items():
        print(str(k).ljust(16), ':', repr(v))

print_dialect_attrs(csv.excel)

delimiter        : ','
doublequote      : True
escapechar       : None
lineterminator   : '\r\n'
quotechar        : '"'
quoting          : 0
skipinitialspace : False

print_dialect_attrs(csv.excel_tab)

delimiter        : '\t'
doublequote      : True
escapechar       : None
lineterminator   : '\r\n'
quotechar        : '"'
quoting          : 0
skipinitialspace : False

print_dialect_attrs(csv.get_dialect('unix'))

delimiter        : ','
doublequote      : 1
escapechar       : None
lineterminator   : '\n'
quotechar        : '"'
quoting          : 1
skipinitialspace : 0
strict           : 0

Sniffer

As we saw above, accounting for all of the csv behavior can be a headache.

Thankfully, we can lean on a really useful csv utility called sniffer.

has_header

Sniffer can investigate a file to see if it’s got header columns or not.

with open('data/iris.csv', 'r') as csvfile:
    print(csv.Sniffer().has_header(csvfile.read()))

True

sniff

But the real power of the Sniffer class is in its ability to determine how things are broken up in a given csv and assign its findings to a Dialect class.

with open('data/iris_bar.csv', 'r') as csvfile:
    dialect = csv.Sniffer().sniff(csvfile.read())
    print_dialect_attrs(dialect)

delimiter        : '|'
doublequote      : False
escapechar       : None
lineterminator   : '\r\n'
quotechar        : '"'
quoting          : 0
skipinitialspace : False

Of course, it’s not always right– so use caution.

with open('data/iris_tm.csv', 'r', encoding='utf-8') as csvfile:
    dialect = csv.Sniffer().sniff(csvfile.read())
    print_dialect_attrs(dialect)

delimiter        : '.'
doublequote      : False
escapechar       : None
lineterminator   : '\r\n'
quotechar        : '"'
quoting          : 0
skipinitialspace : True

Manually correcting

with open('data/iris_tm.csv', 'r', encoding='utf-8') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='™')
    print_dialect_attrs(csvreader.dialect)

delimiter        : '™'
doublequote      : 1
escapechar       : None
lineterminator   : '\r\n'
quotechar        : '"'
quoting          : 0
skipinitialspace : 0
strict           : 0

Resources

https://docs.python.org/3.6/library/csv.html