csv 4: Datetime Handling

from IPython.display import Image
import pandas as pd

Datetime Handling

This one is so messy that it gets its own notebook.

Let’s start with a csv that has a simple date column:

dateCsvPath = 'data/tseries.csv'
print(open(dateCsvPath).read())
2000-01-01,0
2000-01-02,1
2000-01-03,2
2000-01-04,3
2000-01-05,4
2000-01-06,5
2000-01-07,6

Reading Datetimes as Datetimes

parse_dates

By default, the parser won’t acknowledge the fact that the first column is of type datetime.

pd.read_csv(dateCsvPath).dtypes
2000-01-01    object
0              int64
dtype: object

However, you can tell the parser to use the parse engine in dateutil.parser with the parse_dates argument.

pd.read_csv(dateCsvPath, parse_dates=[0], header=None).dtypes
0    datetime64[ns]
1             int64
dtype: object

Simply setting parse_dates to True doesn’t work, because it only tries to read the index.

pd.read_csv(dateCsvPath, parse_dates=True, header=None).dtypes
0    object
1     int64
dtype: object

Building Datetimes from multiple columns

parse_dates
keep_date_col

Maybe instead, our data csv has 3 columns that make up the same information, but split over month, day, year.

splitDtCsvPath = 'data/tseries3.csv'
print(open(splitDtCsvPath).read())
1,1,2000,0
1,2,2000,1
1,3,2000,2
1,4,2000,3
1,5,2000,4
1,6,2000,5
1,7,2000,6

You can pass a nested list into the parse_dates argument to specify which columns make up the date.

pd.read_csv(splitDtCsvPath, header=None, parse_dates=[[0, 1, 2]])
0_1_2 3
0 2000-01-01 0
1 2000-01-02 1
2 2000-01-03 2
3 2000-01-04 3
4 2000-01-05 4
5 2000-01-06 5
6 2000-01-07 6

And for clarity, you can pass it as a dict, and specify a new column name as the key.

pd.read_csv(splitDtCsvPath, header=None, parse_dates={'dateCol':[0, 1, 2]})
dateCol 3
0 2000-01-01 0
1 2000-01-02 1
2 2000-01-03 2
3 2000-01-04 3
4 2000-01-05 4
5 2000-01-06 5
6 2000-01-07 6

And if you don’t want to throw away the existing data, you can specify keep_date_col=True

pd.read_csv(splitDtCsvPath, header=None,
            parse_dates={'dateCol': [0, 1, 2]},
            keep_date_col=True)
dateCol 0 1 2 3
0 2000-01-01 1 1 2000 0
1 2000-01-02 1 2 2000 1
2 2000-01-03 1 3 2000 2
3 2000-01-04 1 4 2000 3
4 2000-01-05 1 5 2000 4
5 2000-01-06 1 6 2000 5
6 2000-01-07 1 7 2000 6

Tuning Datetime Reads

date_parser
infer_datetime_format

The parser engine is impressively expressive. However, because it’s written to sniff out so many different test cases, it’s pretty non-performant at scale in all but the most basic of cases.

Consider a large file (10,000 rows) that looks like

dateTimeCsvPath = 'data/tseries2.csv'
pd.read_csv(dateTimeCsvPath).head()
DT ID
0 4/10/17 8:39 3376898277
1 10/3/17 14:41 3390216125
2 4/6/17 8:11 3376630233
3 11/7/17 12:06 3392651569
4 9/5/17 18:17 3388322876

Base

Calling parse_dates results in the correct typing of the first column.

pd.read_csv(dateTimeCsvPath, parse_dates=[0]).dtypes
DT    datetime64[ns]
ID             int64
dtype: object

But it takes a couple seconds.

%timeit pd.read_csv(dateTimeCsvPath, parse_dates=['DT'])
1.47 s ± 158 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Better

However, if all of the data is going to be of the same format through the whole column, you can use the infer_datetime_format argument to run the parser once and use that same mapping for the rest of the data.

Negligiblly faster here, but you’d really see the benefits at scale.

%%timeit
pd.read_csv(dateTimeCsvPath, parse_dates=['DT'],
            infer_datetime_format=True)
1.44 s ± 66.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Best

So, if you’re trying to parse in some non-intuitive, but consistent, format, you should instead remove the guesswork using the date_parser argument.

You want to explicitly define the string mapping using strftime syntax and pd.to_datetime.

%%timeit 
ourFormat = '%m/%d/%y %H:%M'
pd.read_csv(dateTimeCsvPath, parse_dates=['DT'],
            date_parser=lambda x: pd.to_datetime(x, format=ourFormat))
44.5 ms ± 2.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Much better

Data from other locales

dayfirst

In much the same way as decimals and commas in large numbers, other, less-enlightened parts of the world don’t follow our superior date formatting

Image('data/pyramid.png')

png

So, you can handle for this backwards convention using the dayfirst argument.