Kaplan-Meier Estimate

So before we get underway with our fancy-pants tuned models to approximate our survival/hazard equations, it’s worth highlighting a simple estimator that provides a useful, if naive, approximation with minimal headache.

This straight-forward equation is called the Kaplan-Meier Estimate and is used as a proxy for evaluating the Survival Function at a given point in time.

from IPython.display import Image
Image('images/km_eq.PNG')

png

The n_i term is the population still at risk right before time i (not yet swallowed by F(t)), and d_i is the number of events that occur on time i.

A couple things worth noting here:

  • We use product, not sum, here in order to account for the joint probability of surviving at time i, as well as surviving through all times before
  • This equation cleanly handles data that drops out due to an event obsevation or censoring. Similarly, it considers data with no established terminus for the duration of the obseration.

This blog-post works through a toy example on six points of data.

Image('images/km_simple.PNG')

png

On Actual Data

At scale, these look a lot more sophisticated, and with hardly any additional effort.

Let’s pull a dataset of term-lengths of Canadian Senators from the lifelines library.

%pylab inline

from lifelines.datasets import load_canadian_senators

data = load_canadian_senators()
data.head()
Populating the interactive namespace from numpy and matplotlib
Name Political Affiliation at Appointment Province / Territory Appointed on the advice of Term (yyyy.mm.dd) start_date end_date reason diff_days observed
0 Abbott, John Joseph Caldwell Liberal-Conservative Quebec Macdonald, John Alexander 1887.05.12 - 1893.10.30 (Death) 1887-05-12 00:00:00 1893-10-30 00:00:00 Death 2363 True
1 Adams, Michael Conservative (1867-1942) New Brunswick Bowell, Mackenzie 1896.01.07 - 1899.01.01 (Death) 1896-01-07 00:00:00 1899-01-01 00:00:00 Death 1090 True
2 Adams, Willie Liberal Party of Canada Northwest Territories Trudeau, Pierre Elliott 1977.04.05 - 2009.06.22 (Retirement) 1977-04-05 00:00:00 2009-06-22 00:00:00 Retirement 11766 True
3 Aikins, James Cox Liberal-Conservative Ontario Royal Proclamation 1867.10.23 - 1882.05.30 (Resignation) 1867-10-23 00:00:00 1882-05-30 00:00:00 Resignation 5333 True
4 Aikins, James Cox Liberal-Conservative Ontario Bowell, Mackenzie 1896.01.07 - 1904.08.06 (Death) 1896-01-07 00:00:00 1904-08-06 00:00:00 Death 3133 True

There’s a bit more information than we really need in this DataFrame, so we’ll follow lifelines convention and stuff the durations and event observations into their own Series– T and E, respectively

T = data['diff_days']
E = data['observed'].map({True: 1, False: 0})

Before we fit the Kaplan-Meier Estimator to the data, let’s get a peek at some of the data.

There are a ton of records

len(data)
933

And after a bit of inspection, I found that there was a good mix of “observed” and “not observed” event data between 1000 and 2000 days. So we’ll narrow down to that timeline window and sample down to 50 records, to make a neater lifetime plot.

from lifelines.plotting import plot_lifetimes

narrowed = (data[(data['diff_days'] > 1000)
                  & (data['diff_days'] < 2000)]
           ).sample(n=50, random_state=0).reset_index()
narrowed_T = narrowed['diff_days']
narrowed_E = narrowed['observed']

fig, ax = plt.subplots(figsize=(12, 10))

plot_lifetimes(durations=narrowed_T,
               event_observed=narrowed_E);

png

Fitting the Curve

Actually fitting the data is super easy.

While the lifelines.fitters.BaseFitter object doesn’t directly inherit from anything in the sklearn space, the interface is more or less the same. We just need to specify the Series values that we want to use to fit durations and event_observed.

from lifelines import KaplanMeierFitter


kmf = KaplanMeierFitter()
kmf.fit(durations=T, event_observed=E)
<lifelines.KaplanMeierFitter:"KM_estimate", fitted with 933 total observations, 99 right-censored observations>

And the object even comes, batteries-included, with a neat plotting interface.

fig, ax = plt.subplots(figsize=(15, 10))

kmf.plot(ax=ax);

png

Where the lighter area around the curve represents the 95% Confidence Interval, achieved via the exponential Greenwood (says the lifeline docs).

Additionally, the model object gives us a bunch of other statistical goodies.

for fn in dir(kmf):
    if (fn[-1] == '_' and fn[0] != '_'):
        print(fn)
conditional_time_to_event_
confidence_interval_
confidence_interval_cumulative_density_
confidence_interval_survival_function_
cumulative_density_
median_survival_time_
survival_function_

As an Analytical Tool

The Kaplan-Meier Estimator is obviously useful for getting a handle on the basic shape of the survival curve for your population. But with surprisingly-little adjustment, it also provides analytical insight between different segments of your population.

For instance, suppose we were interested in comparing the survival curves of sentators, based on which decade they were elected in. Getting at those values is simply some clever pandas datetime stuff on the field start_date, which reveals the decades for which we have the most data.

import pandas as pd

decade = pd.to_datetime(data['start_date']).dt.year // 10 * 10

decade.value_counts()
1990    85
1860    78
2000    76
1940    71
1970    67
1910    67
1960    65
1900    60
1950    51
1920    51
1930    49
1890    48
1880    47
1870    47
1980    41
2010    30
Name: start_date, dtype: int64

To better illustrate change over time, we’ll space out which decades we grab from: 1860, 1940, 1990

Then, plotting them together simply involves making a shared plot, then fitting the same estimator and plotting the results– three times.

fig, ax = plt.subplots(figsize=(15, 10))

kmf = KaplanMeierFitter()
kmf.fit(T[decade==1860], E[decade==1860], label='1860')
kmf.plot(ax=ax, ci_alpha=.1)

kmf.fit(T[decade==1940], E[decade==1940], label='1940')
kmf.plot(ax=ax, ci_alpha=.1)

kmf.fit(T[decade==1990], E[decade==1990], label='1990')
kmf.plot(ax=ax, ci_alpha=.1);

png

And almost immediately, you’ll notice that the senators elected in the 40s enjoyed a much kinder survival rate between days 3 and 5 thousand. However to better pick apart why this might be, we’d be better-served using other approaches.