csv 5: Cleaning

23 May 2018

import pandas as pd

Handling Unclean Data

How much should Python freak out given bad data?

error_bad_lines
warn_bad_lines

Given a table that has a random data element jutting out, pd.read_csv loses its mind.

badTableCsvPath = 'data/ex7.csv'
print(open(badTableCsvPath).read())

"a","b","c"
"1","2","3"
"1","2","3","4"

try:
    pd.read_csv(badTableCsvPath)
except Exception as e:
    print(e)

Error tokenizing data. C error: Expected 3 fields in line 3, saw 4

Thankfully, you can provide instructions on what to do when it goes into crisis mode.

We can either skip the data (with outputs) by setting error_bad_lines to False

pd.read_csv(badTableCsvPath, error_bad_lines=False)

b'Skipping line 3: expected 3 fields, saw 4\n'

	a	b	c
0	1	2	3

And can suppress error prints with warn_bad_lines

pd.read_csv(badTableCsvPath, error_bad_lines=False, warn_bad_lines=False)

	a	b	c
0	1	2	3

Null Data

na_values
keep_default_na

naCsvPath = 'data/ex5.csv'
open(naCsvPath).read()

'something,a,b,c,d,message\none,1,2,3,4,NA\ntwo,5,6,,8,world\nthree,9,10,11,12,foo'

na_values can specify values that count as null, per_column.

So if we took our original DataFrame

pd.read_csv(naCsvPath)

	something	a	b	c	d	message
0	one	1	2	3.0	4	NaN
1	two	5	6	NaN	8	world
2	three	9	10	11.0	12	foo

And decided that the string ‘foo’ meant bad data. We could inform the parser either by using the na_values argument, with a str for whatever string we don’t like.

pd.read_csv(naCsvPath, na_values='foo')

	something	a	b	c	d	message
0	one	1	2	3.0	4	NaN
1	two	5	6	NaN	8	world
2	three	9	10	11.0	12	NaN

Or if ‘two’ was also a bad idea, we could use a list.

pd.read_csv(naCsvPath, na_values=['two', 'foo'])

	something	a	b	c	d	message
0	one	1	2	3.0	4	NaN
1	NaN	5	6	NaN	8	world
2	three	9	10	11.0	12	NaN

But if we were only interested in cleaning up weird data in the message column, we’d specify that with a dictionary.

pd.read_csv(naCsvPath, na_values={'message': ['two', 'foo']})

	something	a	b	c	d	message
0	one	1	2	3.0	4	NaN
1	two	5	6	NaN	8	world
2	three	9	10	11.0	12	NaN

keep_default_na determines whether we append or overwrite the existing list of characters to filter out.

pd.read_csv(naCsvPath, na_values={'message': ['two', 'foo']},
            keep_default_na=False)

	something	a	b	c	d	message
0	one	1	2	3.0	4	NA
1	two	5	6	NaN	8	world
2	three	9	10	11.0	12	NaN

Comments

comments

commentCsvPath = 'data/ex4.csv'
print(open(commentCsvPath).read())

# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

pd.read_csv(commentCsvPath, comment='#')

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo