csv 3: Type Handling
Type Handling
The default read-in behavior is pretty good, but sometimes it’s worth being a bit more explicit.
Consider this file that has all of its data wrapped in quotes.
import pandas as pd
dataPath = 'data/ex8.csv'
open(dataPath).read()
'"a","b","c"\n"1","2","3"\n"1","2","3"\n'
pd.read_csv(dataPath)
a | b | c | |
---|---|---|---|
0 | 1 | 2 | 3 |
1 | 1 | 2 | 3 |
General
dtype
Getting the data in float
format is easy with the dtype
argument.
pd.read_csv(dataPath, dtype=float)
a | b | c | |
---|---|---|---|
0 | 1.0 | 2.0 | 3.0 |
1 | 1.0 | 2.0 | 3.0 |
It can also be leveraged on a column-by-column level, letting the parser default anything unspecified however it will.
pd.read_csv(dataPath, dtype={'a': str, 'b': float})
a | b | c | |
---|---|---|---|
0 | 1 | 2.0 | 3 |
1 | 1 | 2.0 | 3 |
Handling Foreign “True/False” Words
true_values
false_values
Sometimes we get output from other sources that codes the Python True
and False
values in manners we might not expect.
For example, say we wanted to take this data
from io import StringIO
stringyData = 'BOOL,oneBoolWord,twoBoolWords\nTRUE,Yes,Yes\nFALSE,FALSE,No'
StringIO(stringyData).read()
'BOOL,oneBoolWord,twoBoolWords\nTRUE,Yes,Yes\nFALSE,FALSE,No'
Notice that our ALLCAPS TRUE
and FALSE
values get correctly cased (and cast as bool
types) in the first column of the output. The parser knows that this external data looks like this frequently, so it’s hard-coded in.
pd.read_csv(StringIO(stringyData))
BOOL | oneBoolWord | twoBoolWords | |
---|---|---|---|
0 | True | Yes | Yes |
1 | False | FALSE | No |
We can expand the number of words that the parser looks for to do these smart substitutions (here, we’ll use true_values
). The second column now casts accordingly (note the how the value in row 1 changed).
pd.read_csv(StringIO(stringyData), true_values=['Yes'])
BOOL | oneBoolWord | twoBoolWords | |
---|---|---|---|
0 | True | True | Yes |
1 | False | False | No |
However, that third column is still reading as strings, even though we mapped “Yes” to True
Because the parser doesn’t know how to handle EVERY value, it elects to just store all values as strings to be safe.
Thus, our last step is providing rules for the string “No” with false_values
pd.read_csv(StringIO(stringyData), true_values=['Yes'], false_values=['No'])
BOOL | oneBoolWord | twoBoolWords | |
---|---|---|---|
0 | True | True | True |
1 | False | False | False |
Finally, this gives us our intended column datatypes.
pd.read_csv(StringIO(stringyData), true_values=['Yes'], false_values=['No']).dtypes
BOOL bool
oneBoolWord bool
twoBoolWords bool
dtype: object
Tricky, Rules-Based Typing
converters
And if your conversions are less straight-forward, you can use converters
to specify a dictionary of functions to apply on reads.
For instance, if you’re reading the following file:
moneyCsvPath = 'data/ex9.csv'
open(moneyCsvPath).read()
'a|b|c|money|bignum\n1|2|3|$1,000|1,000,000\n5|6|7|$0.00|867,530.9\n9|10|11|$3.50|3.14159265359'
money_to_float = lambda x: float(x[1:].replace(',', ''))
pd.read_csv(moneyCsvPath, sep='|', converters={'money':money_to_float})
a | b | c | money | bignum | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 1000.0 | 1,000,000 |
1 | 5 | 6 | 7 | 0.0 | 867,530.9 |
2 | 9 | 10 | 11 | 3.5 | 3.14159265359 |
Big Numbers
thousands
decimal
We had to use a function in our money example because of the leading $
, but you may encounter several-digit numbers written out as strings with commas and decimals (and some complicated Dialect rules, no doubt). By default, Python doesn’t check for these.
If you need to turn that feature on, you can use the thousands
and decimal
arguments, setting them to ','
and '.'
, respectively.
Or thousands='.'
and decimal=','
, if you’re reading data from some backwards, Byzantine hellscape.
pd.read_csv(moneyCsvPath, sep='|',
usecols=['a', 'b', 'c', 'bignum'],
thousands=',', decimal='.')
a | b | c | bignum | |
---|---|---|---|---|
0 | 1 | 2 | 3 | 1000000.000000 |
1 | 5 | 6 | 7 | 867530.900000 |
2 | 9 | 10 | 11 | 3.141593 |