csv 3: Type Handling

23 May 2018

Type Handling

The default read-in behavior is pretty good, but sometimes it’s worth being a bit more explicit.

Consider this file that has all of its data wrapped in quotes.

import pandas as pd

dataPath = 'data/ex8.csv'
open(dataPath).read()

'"a","b","c"\n"1","2","3"\n"1","2","3"\n'

pd.read_csv(dataPath)

	a	b	c
0	1	2	3
1	1	2	3

General

dtype

Getting the data in float format is easy with the dtype argument.

pd.read_csv(dataPath, dtype=float)

	a	b	c
0	1.0	2.0	3.0
1	1.0	2.0	3.0

It can also be leveraged on a column-by-column level, letting the parser default anything unspecified however it will.

pd.read_csv(dataPath, dtype={'a': str, 'b': float})

	a	b	c
0	1	2.0	3
1	1	2.0	3

Handling Foreign “True/False” Words

true_values
false_values

Sometimes we get output from other sources that codes the Python True and False values in manners we might not expect.

For example, say we wanted to take this data

from io import StringIO

stringyData = 'BOOL,oneBoolWord,twoBoolWords\nTRUE,Yes,Yes\nFALSE,FALSE,No'
StringIO(stringyData).read()

'BOOL,oneBoolWord,twoBoolWords\nTRUE,Yes,Yes\nFALSE,FALSE,No'

Notice that our ALLCAPS TRUE and FALSE values get correctly cased (and cast as bool types) in the first column of the output. The parser knows that this external data looks like this frequently, so it’s hard-coded in.

pd.read_csv(StringIO(stringyData))

	BOOL	oneBoolWord	twoBoolWords
0	True	Yes	Yes
1	False	FALSE	No

We can expand the number of words that the parser looks for to do these smart substitutions (here, we’ll use true_values). The second column now casts accordingly (note the how the value in row 1 changed).

pd.read_csv(StringIO(stringyData), true_values=['Yes'])

	BOOL	oneBoolWord	twoBoolWords
0	True	True	Yes
1	False	False	No

However, that third column is still reading as strings, even though we mapped “Yes” to True

Because the parser doesn’t know how to handle EVERY value, it elects to just store all values as strings to be safe.

Thus, our last step is providing rules for the string “No” with false_values

pd.read_csv(StringIO(stringyData), true_values=['Yes'], false_values=['No'])

	BOOL	oneBoolWord	twoBoolWords
0	True	True	True
1	False	False	False

Finally, this gives us our intended column datatypes.

pd.read_csv(StringIO(stringyData), true_values=['Yes'], false_values=['No']).dtypes

BOOL            bool
oneBoolWord     bool
twoBoolWords    bool
dtype: object

Tricky, Rules-Based Typing

converters

And if your conversions are less straight-forward, you can use converters to specify a dictionary of functions to apply on reads.

For instance, if you’re reading the following file:

moneyCsvPath = 'data/ex9.csv'
open(moneyCsvPath).read()

'a|b|c|money|bignum\n1|2|3|$1,000|1,000,000\n5|6|7|$0.00|867,530.9\n9|10|11|$3.50|3.14159265359'

money_to_float = lambda x: float(x[1:].replace(',', ''))
pd.read_csv(moneyCsvPath, sep='|', converters={'money':money_to_float})

	a	b	c	money	bignum
0	1	2	3	1000.0	1,000,000
1	5	6	7	0.0	867,530.9
2	9	10	11	3.5	3.14159265359

Big Numbers

thousands
decimal

We had to use a function in our money example because of the leading $, but you may encounter several-digit numbers written out as strings with commas and decimals (and some complicated Dialect rules, no doubt). By default, Python doesn’t check for these.

If you need to turn that feature on, you can use the thousands and decimal arguments, setting them to ',' and '.', respectively.

Or thousands='.' and decimal=',', if you’re reading data from some backwards, Byzantine hellscape.

pd.read_csv(moneyCsvPath, sep='|',
            usecols=['a', 'b', 'c', 'bignum'],
            thousands=',', decimal='.')

	a	b	c	bignum
0	1	2	3	1000000.000000
1	5	6	7	867530.900000
2	9	10	11	3.141593