# Kaplan-Meier Estimate

So before we get underway with our fancy-pants *tuned* models to approximate our survival/hazard equations, it’s worth highlighting a simple estimator that provides a useful, if naive, approximation with minimal headache.

This straight-forward equation is called the Kaplan-Meier Estimate and is used as a proxy for evaluating the Survival Function at a given point in time.

```
from IPython.display import Image
Image('images/km_eq.PNG')
```

The `n_i`

term is the population still at risk right before time `i`

(not yet swallowed by `F(t)`

), and `d_i`

is the number of events that occur *on* time `i`

.

A couple things worth noting here:

- We use product, not sum, here in order to account for the joint probability of surviving at time
`i`

, as well as surviving through all times before - This equation cleanly handles data that drops out due to an event obsevation or censoring. Similarly, it considers data with no established terminus for the duration of the obseration.

This blog-post works through a toy example on six points of data.

`Image('images/km_simple.PNG')`

## On Actual Data

At scale, these look a lot more sophisticated, and with hardly any additional effort.

Let’s pull a dataset of term-lengths of Canadian Senators from the `lifelines`

library.

```
%pylab inline
from lifelines.datasets import load_canadian_senators
data = load_canadian_senators()
data.head()
```

```
Populating the interactive namespace from numpy and matplotlib
```

Name | Political Affiliation at Appointment | Province / Territory | Appointed on the advice of | Term (yyyy.mm.dd) | start_date | end_date | reason | diff_days | observed | |
---|---|---|---|---|---|---|---|---|---|---|

0 | Abbott, John Joseph Caldwell | Liberal-Conservative | Quebec | Macdonald, John Alexander | 1887.05.12 - 1893.10.30 (Death) | 1887-05-12 00:00:00 | 1893-10-30 00:00:00 | Death | 2363 | True |

1 | Adams, Michael | Conservative (1867-1942) | New Brunswick | Bowell, Mackenzie | 1896.01.07 - 1899.01.01 (Death) | 1896-01-07 00:00:00 | 1899-01-01 00:00:00 | Death | 1090 | True |

2 | Adams, Willie | Liberal Party of Canada | Northwest Territories | Trudeau, Pierre Elliott | 1977.04.05 - 2009.06.22 (Retirement) | 1977-04-05 00:00:00 | 2009-06-22 00:00:00 | Retirement | 11766 | True |

3 | Aikins, James Cox | Liberal-Conservative | Ontario | Royal Proclamation | 1867.10.23 - 1882.05.30 (Resignation) | 1867-10-23 00:00:00 | 1882-05-30 00:00:00 | Resignation | 5333 | True |

4 | Aikins, James Cox | Liberal-Conservative | Ontario | Bowell, Mackenzie | 1896.01.07 - 1904.08.06 (Death) | 1896-01-07 00:00:00 | 1904-08-06 00:00:00 | Death | 3133 | True |

There’s a bit more information than we really need in this DataFrame, so we’ll follow `lifelines`

convention and stuff the durations and event observations into their own Series– `T`

and `E`

, respectively

```
T = data['diff_days']
E = data['observed'].map({True: 1, False: 0})
```

Before we fit the Kaplan-Meier Estimator to the data, let’s get a peek at some of the data.

There are a ton of records

`len(data)`

```
933
```

And after a bit of inspection, I found that there was a good mix of “observed” and “not observed” event data between 1000 and 2000 days. So we’ll narrow down to that timeline window and sample down to 50 records, to make a neater lifetime plot.

```
from lifelines.plotting import plot_lifetimes
narrowed = (data[(data['diff_days'] > 1000)
& (data['diff_days'] < 2000)]
).sample(n=50, random_state=0).reset_index()
narrowed_T = narrowed['diff_days']
narrowed_E = narrowed['observed']
fig, ax = plt.subplots(figsize=(12, 10))
plot_lifetimes(durations=narrowed_T,
event_observed=narrowed_E);
```

### Fitting the Curve

Actually fitting the data is super easy.

While the `lifelines.fitters.BaseFitter`

object doesn’t directly inherit from anything in the `sklearn`

space, the interface is more or less the same. We just need to specify the Series values that we want to use to fit `durations`

and `event_observed`

.

```
from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()
kmf.fit(durations=T, event_observed=E)
```

```
<lifelines.KaplanMeierFitter:"KM_estimate", fitted with 933 total observations, 99 right-censored observations>
```

And the object even comes, batteries-included, with a neat plotting interface.

```
fig, ax = plt.subplots(figsize=(15, 10))
kmf.plot(ax=ax);
```

Where the lighter area around the curve represents the 95% Confidence Interval, achieved via the exponential Greenwood (says the `lifeline`

docs).

Additionally, the model object gives us a bunch of other statistical goodies.

```
for fn in dir(kmf):
if (fn[-1] == '_' and fn[0] != '_'):
print(fn)
```

```
conditional_time_to_event_
confidence_interval_
confidence_interval_cumulative_density_
confidence_interval_survival_function_
cumulative_density_
median_survival_time_
survival_function_
```

### As an Analytical Tool

The Kaplan-Meier Estimator is obviously useful for getting a handle on the basic shape of the survival curve for your population. But with surprisingly-little adjustment, it also provides analytical insight between different segments of your population.

For instance, suppose we were interested in comparing the survival curves of sentators, based on which decade they were elected in. Getting at those values is simply some clever `pandas`

datetime stuff on the field `start_date`

, which reveals the decades for which we have the most data.

```
import pandas as pd
decade = pd.to_datetime(data['start_date']).dt.year // 10 * 10
decade.value_counts()
```

```
1990 85
1860 78
2000 76
1940 71
1970 67
1910 67
1960 65
1900 60
1950 51
1920 51
1930 49
1890 48
1880 47
1870 47
1980 41
2010 30
Name: start_date, dtype: int64
```

To better illustrate change over time, we’ll space out which decades we grab from: 1860, 1940, 1990

Then, plotting them together simply involves making a shared plot, then fitting the same estimator and plotting the results– three times.

```
fig, ax = plt.subplots(figsize=(15, 10))
kmf = KaplanMeierFitter()
kmf.fit(T[decade==1860], E[decade==1860], label='1860')
kmf.plot(ax=ax, ci_alpha=.1)
kmf.fit(T[decade==1940], E[decade==1940], label='1940')
kmf.plot(ax=ax, ci_alpha=.1)
kmf.fit(T[decade==1990], E[decade==1990], label='1990')
kmf.plot(ax=ax, ci_alpha=.1);
```

And almost immediately, you’ll notice that the senators elected in the 40s enjoyed a much kinder survival rate between days 3 and 5 thousand. However to better pick apart why this might be, we’d be better-served using other approaches.