csv 5: Cleaning
import pandas as pd
Handling Unclean Data
How much should Python freak out given bad data?
error_bad_lines
warn_bad_lines
Given a table that has a random data element jutting out, pd.read_csv
loses its mind.
badTableCsvPath = 'data/ex7.csv'
print(open(badTableCsvPath).read())
"a","b","c"
"1","2","3"
"1","2","3","4"
try:
pd.read_csv(badTableCsvPath)
except Exception as e:
print(e)
Error tokenizing data. C error: Expected 3 fields in line 3, saw 4
Thankfully, you can provide instructions on what to do when it goes into crisis mode.
We can either skip the data (with outputs) by setting error_bad_lines
to False
pd.read_csv(badTableCsvPath, error_bad_lines=False)
b'Skipping line 3: expected 3 fields, saw 4\n'
a | b | c | |
---|---|---|---|
0 | 1 | 2 | 3 |
And can suppress error prints with warn_bad_lines
pd.read_csv(badTableCsvPath, error_bad_lines=False, warn_bad_lines=False)
a | b | c | |
---|---|---|---|
0 | 1 | 2 | 3 |
Null Data
na_values
keep_default_na
naCsvPath = 'data/ex5.csv'
open(naCsvPath).read()
'something,a,b,c,d,message\none,1,2,3,4,NA\ntwo,5,6,,8,world\nthree,9,10,11,12,foo'
na_values
can specify values that count as null, per_column.
So if we took our original DataFrame
pd.read_csv(naCsvPath)
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3.0 | 4 | NaN |
1 | two | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11.0 | 12 | foo |
And decided that the string ‘foo’ meant bad data. We could inform the parser either by using the na_values
argument, with a str
for whatever string we don’t like.
pd.read_csv(naCsvPath, na_values='foo')
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3.0 | 4 | NaN |
1 | two | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11.0 | 12 | NaN |
Or if ‘two’ was also a bad idea, we could use a list
.
pd.read_csv(naCsvPath, na_values=['two', 'foo'])
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3.0 | 4 | NaN |
1 | NaN | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11.0 | 12 | NaN |
But if we were only interested in cleaning up weird data in the message
column, we’d specify that with a dictionary.
pd.read_csv(naCsvPath, na_values={'message': ['two', 'foo']})
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3.0 | 4 | NaN |
1 | two | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11.0 | 12 | NaN |
keep_default_na
determines whether we append or overwrite the existing list of characters to filter out.
pd.read_csv(naCsvPath, na_values={'message': ['two', 'foo']},
keep_default_na=False)
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3.0 | 4 | NA |
1 | two | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11.0 | 12 | NaN |
Comments
comments
commentCsvPath = 'data/ex4.csv'
print(open(commentCsvPath).read())
# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
pd.read_csv(commentCsvPath, comment='#')
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |