csv 5: Cleaning

import pandas as pd

Handling Unclean Data

How much should Python freak out given bad data?

error_bad_lines
warn_bad_lines

Given a table that has a random data element jutting out, pd.read_csv loses its mind.

badTableCsvPath = 'data/ex7.csv'
print(open(badTableCsvPath).read())
"a","b","c"
"1","2","3"
"1","2","3","4"
try:
    pd.read_csv(badTableCsvPath)
except Exception as e:
    print(e)
Error tokenizing data. C error: Expected 3 fields in line 3, saw 4

Thankfully, you can provide instructions on what to do when it goes into crisis mode.

We can either skip the data (with outputs) by setting error_bad_lines to False

pd.read_csv(badTableCsvPath, error_bad_lines=False)
b'Skipping line 3: expected 3 fields, saw 4\n'
a b c
0 1 2 3

And can suppress error prints with warn_bad_lines

pd.read_csv(badTableCsvPath, error_bad_lines=False, warn_bad_lines=False)
a b c
0 1 2 3

Null Data

na_values
keep_default_na
naCsvPath = 'data/ex5.csv'
open(naCsvPath).read()
'something,a,b,c,d,message\none,1,2,3,4,NA\ntwo,5,6,,8,world\nthree,9,10,11,12,foo'

na_values can specify values that count as null, per_column.

So if we took our original DataFrame

pd.read_csv(naCsvPath)
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo

And decided that the string ‘foo’ meant bad data. We could inform the parser either by using the na_values argument, with a str for whatever string we don’t like.

pd.read_csv(naCsvPath, na_values='foo')
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 NaN

Or if ‘two’ was also a bad idea, we could use a list.

pd.read_csv(naCsvPath, na_values=['two', 'foo'])
something a b c d message
0 one 1 2 3.0 4 NaN
1 NaN 5 6 NaN 8 world
2 three 9 10 11.0 12 NaN

But if we were only interested in cleaning up weird data in the message column, we’d specify that with a dictionary.

pd.read_csv(naCsvPath, na_values={'message': ['two', 'foo']})
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 NaN

keep_default_na determines whether we append or overwrite the existing list of characters to filter out.

pd.read_csv(naCsvPath, na_values={'message': ['two', 'foo']},
            keep_default_na=False)
something a b c d message
0 one 1 2 3.0 4 NA
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 NaN

Comments

comments
commentCsvPath = 'data/ex4.csv'
print(open(commentCsvPath).read())
# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
pd.read_csv(commentCsvPath, comment='#')
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo