Part XIII · Specialized ML Methods · Chapter 06

Survival Analysis & Event Modeling, how long until something happens.

Standard supervised learning predicts a label or a number; survival analysis predicts a time — and does so under the unavoidable complication that for many subjects, the event of interest has not yet occurred when the data is collected. The result is a specialised framework: hazard rates instead of probabilities, censoring as a first-class concern, and likelihood functions that encode partial information from incomplete observations. This chapter develops the classical theory (Kaplan-Meier, Cox regression, parametric models), the modern neural-network counterparts (DeepSurv, DeepHit, neural ODEs for survival), and the deployment patterns for medicine, customer churn, predictive maintenance, and credit risk.

Prerequisites & orientation

This chapter assumes the probability and statistics background of Part I Ch 04–05, basic regression (Part IV Ch 01), and the regularised-likelihood framework that underlies most of statistical learning. Familiarity with the Cox proportional-hazards model is helpful but not required — Section 4 develops it from first principles. Neural-network fundamentals (Part V Ch 01–02) are needed for Sections 7–8 on deep survival models, and the causal-inference framing of Part XIII Ch 03 is a useful complement when survival outcomes are tied to interventions.

Two threads run through the chapter. The first is censoring: in survival data, the event of interest is often unobserved at the end of the study, and the model must extract information from "this subject was still alive at month 24" without pretending we know when they eventually die. Censoring is what makes survival analysis its own discipline rather than a special case of regression. The second thread is the hazard function — the instantaneous rate of event occurrence given survival up to now — which turns out to be the natural mathematical object for everything from Kaplan-Meier estimates through Cox regression to modern neural survival models. Once you internalise censoring and the hazard, the rest of the chapter is variations on a theme.

In this chapter

Why Survival Analysis Is Its Own Discipline time-to-event · censoring · why regression fails
Hazards, Survival Functions, and the Likelihood S(t) · h(t) · cumulative hazard · censored likelihood
Kaplan-Meier and Non-Parametric Estimation product-limit estimator · log-rank test · confidence bands
Cox Proportional Hazards Regression partial likelihood · proportionality · semi-parametric inference
Parametric Survival Models Weibull · log-normal · AFT framing · prediction beyond data
Competing Risks and Multi-State Models cause-specific hazards · subdistribution · CIF · Markov chains
DeepSurv and Neural Cox Extensions non-linear hazards · DeepSurv · Cox-CC · Cox-Time
Discrete-Time and Direct Survival Networks DeepHit · logistic-hazard · discretisation · ranking losses
Evaluation: Concordance, Brier, and Calibration C-index · time-dependent AUC · IBS · calibration drift
Applications and Frontier churn · maintenance · oncology · neural-ODE survival · causal survival

Why Survival Analysis Is Its Own Discipline

If you know the time at which an event occurs for every subject in your dataset, you can fit a regression model and predict event time directly. The trouble is that you almost never know that. The defining problem of survival analysis is censoring: at the end of the study some subjects have not yet experienced the event, and ignoring them or pretending they have biases the answer. Survival analysis exists because regression on time-to-event data is wrong by default, and getting it right requires explicit machinery.

Censoring: the central nuisance

Imagine a clinical trial that enrolls 500 cancer patients and follows them for five years. Some patients die during the study; their event time is observed. Some are still alive when the study ends; we know only that they survived at least 60 months. Some drop out at 36 months because they moved away; we know they survived at least that long but nothing more. The first group provides exact event times, the second and third provide only lower bounds. The technical name for the latter case is right-censoring, and it is the dominant form of censoring in practice. Other variants exist — left-censoring (we know the event happened before some time but not exactly when), interval-censoring (the event is known to lie within an interval) — but right-censoring is the case the chapter mostly addresses.

Four representative patients in a survival study. A's event is fully observed. B is censored at the time they were lost to follow-up. C is censored at study end. D entered late but had their event observed. Standard regression treats only A and D as informative and throws away B and C — wrong, because B and C contribute information ("alive at least until t").

Why naive regression fails

Three options come to mind, and all of them are wrong. Option one: drop the censored subjects and fit a regression on the rest. This induces a survivor-bias towards short event times because the subjects with the longest true times are exactly those who were censored — the model learns "events happen early" because it never sees the long-survivors. Option two: treat the censoring time as if it were the event time. This makes everyone look like they had the event soon, which is plainly false. Option three: impute event times for the censored subjects from the model's predictions and refit. This is closer to right (it is essentially what survival models do internally) but doing it ad hoc compounds the bias of whatever imputation rule you chose.

The right approach treats censoring as partial information: a censored subject contributes "I did not experience the event by time c" to the likelihood, while an observed subject contributes "I experienced the event exactly at time t." The likelihood function for survival data is the product of these contributions, and survival analysis is the body of methods built around it.

Where survival analysis shows up

The applications driving survival-analysis research and deployment include: medicine and epidemiology (time to death, recurrence, hospitalisation; the original home of the discipline), customer churn and retention (time to subscription cancellation, app uninstall, account closure), predictive maintenance (time to machine failure given sensor history), credit risk (time to loan default), insurance (time to claim), employee retention (time to attrition), and reliability engineering (time to component failure). Each application has its own censoring patterns and its own dominant tooling, but the mathematical machinery is shared.

The Censoring-First Principle

Any model for time-to-event data must explicitly handle censoring or it will be biased. The question is never "do we have censoring?" but "what kind of censoring, and is the censoring mechanism informative?" Methods differ in their assumptions: most assume non-informative censoring — that the timing of censoring is independent of the event hazard given the covariates. When that assumption fails (treatment-related dropout, for instance), the answers can be badly wrong even with the right method.

Hazards, Survival Functions, and the Likelihood

Three mathematical objects together describe a survival distribution: the survival function S(t), the hazard function h(t), and the cumulative hazard H(t). Each contains the same information; each is convenient for different purposes; and the censored likelihood is naturally written in terms of all three. Internalising these objects is the precondition for everything that follows.

Survival, density, and hazard

Let T be the random event time. The survival function is S(t) = P(T > t) — the probability of surviving past time t. The probability density f(t) is the derivative of the cumulative event probability, f(t) = −dS/dt. The hazard function is the instantaneous rate of event occurrence, given survival up to now:

Hazard function h(t) = lim Δt \to 0 P(t \leq T < t + Δt | T \geq t) / Δt = f(t) / S(t) The hazard is the conditional density of the event at time t given that the event has not yet occurred. Crucially, it is conditional on survival — h(t) is well-defined even at very long times where the unconditional density f(t) might be tiny. The hazard is the natural object for modelling because it is local: each subject contributes information about h(t) only at times when they were still at risk.

The cumulative hazard is the integral of the hazard, H(t) = ∫₀^t h(u) du, and it ties the three objects together: S(t) = exp(−H(t)). In practice, survival models often parameterise h(t) directly (as in Cox regression, where the covariate effect multiplies a baseline hazard) and recover S(t) via this exponential relationship.

The censored likelihood

Suppose subject i has event time t_i and an indicator δ_i that is 1 if the event was observed and 0 if censored. The likelihood contribution from each subject is:

Likelihood for right-censored data L i = f(t i) δ i \cdot S(t i) 1 - δ i An observed event (δ = 1) contributes the density at the exact time. A censored subject (δ = 0) contributes the survival function — "still alive at t i ". Equivalently, using h = f / S and S = exp(-H), the log-likelihood becomes Σ i [ δ i log h(t i) - H(t i) ], which is the form most survival models actually optimise.

Notice the elegance: the log-likelihood depends only on the hazard at observed event times and the cumulative hazard at every observation time. Censored subjects contribute through the cumulative-hazard term but not the log-hazard term. This is the formal expression of "censored subjects are at risk for as long as we observe them, and contribute information accordingly."

The same distribution viewed three ways. S(t) decreases monotonically from 1 toward 0. The density f(t) rises and falls, peaking where events are most likely. The hazard h(t) — the conditional rate given survival — is increasing here, indicating that the risk per unit time grows with age (a typical pattern in mortality data). For exponential lifetimes the hazard is constant; for Weibull it is monotone; for human mortality it is bathtub-shaped.

The four shapes of hazard

A surprising amount of practical guidance comes from asking what shape h(t) has. Constant hazard means the event is "memoryless" — the exponential distribution; you see this in some failure-time data and in event-stream models. Monotonically increasing hazard means subjects degrade over time — typical for ageing components, some cancers in late stage, some chronic diseases. Monotonically decreasing hazard means early events thin the population, after which survivors are robust — typical of post-surgery recovery and infant-mortality patterns. Bathtub-shaped hazard — high early, low middle, rising late — characterises human mortality and many engineered systems. The hazard shape constrains which parametric family is appropriate; non-parametric methods (Kaplan-Meier in Section 3) make no shape assumption.

Kaplan-Meier and Non-Parametric Estimation

Before fitting a model with covariates, you almost always want a model-free picture of the survival pattern in the data — and the right tool is the Kaplan-Meier estimator. Published by Edward Kaplan and Paul Meier in 1958, it remains the single most widely-used method in survival analysis and the picture every survival paper opens with.

The product-limit estimator

The Kaplan-Meier estimator works by chopping time into intervals defined by observed event times. At each event time t_j, let d_j be the number of events and n_j the number of subjects still at risk. The estimated survival function is the product over event times:

Kaplan-Meier (product-limit) estimator Ŝ(t) = ∏j: tj ≤ t ( 1 − dj / nj ) At each event time, the survival probability is multiplied by (1 − fraction of at-risk subjects who experienced the event at that time). Censored subjects leave the at-risk set silently — they reduce nj for later event times but contribute no factor of their own. The result is a step function that drops at each observed event and is flat between events.

A Kaplan-Meier curve. The step function drops at each observed event; tick marks show times of censored observations (which reduce the at-risk denominator but do not produce a step). The shaded band is a 95% confidence interval, typically computed via Greenwood's formula. Reading: at any time t, Ŝ(t) is the estimated probability that a subject from the same population would survive past t.

Confidence intervals: Greenwood's formula

Pointwise confidence intervals for the Kaplan-Meier curve are typically computed using Greenwood's formula for the variance of Ŝ(t). The variance is approximately Ŝ(t)² · Σ d_j / [n_j(n_j − d_j)]. Most practitioners apply a log-log transformation before computing intervals because confidence bounds on raw survival can extend below 0 or above 1, which is awkward; the transformation keeps bounds in the unit interval. Modern software handles this transparently — the interpretation is what matters.

The log-rank test

The natural follow-up question after drawing two Kaplan-Meier curves (treatment vs. control, group A vs. group B) is whether they differ statistically. The standard answer is the log-rank test, which compares observed events to expected events under the null hypothesis of equal hazards. The test statistic is asymptotically chi-squared with one degree of freedom for two groups, more for more groups. The log-rank test is most powerful against alternatives where hazards are proportional — which connects directly to the Cox model in the next section. Stratified log-rank tests, weighted log-rank tests (Wilcoxon, Tarone-Ware) and tests robust against non-proportional hazards exist for the cases where the basic log-rank assumption fails.

What Kaplan-Meier doesn't do

Kaplan-Meier estimates the survival curve in a single (sub-)population. It does not adjust for covariates, does not produce individual-level predictions, and does not extrapolate beyond the observed data. The curve gets noisy in its right tail (few subjects still at risk → wide confidence intervals → unreliable estimates), and many practitioners truncate the display at the time when fewer than ~10% of the original cohort remain at risk. For covariate-adjusted survival prediction the chapter moves to Cox regression in the next section.

Cox Proportional Hazards Regression

The Cox proportional-hazards model, published by David Cox in 1972, is the survival-analysis equivalent of linear regression: the default model that practitioners reach for first, robust enough to handle most real datasets, and the framework on top of which most modern extensions (including the deep-learning models of Sections 7–8) are built. Its key innovation — the partial likelihood — sidesteps the need to estimate a baseline hazard, making it semi-parametric and dramatically more flexible than the parametric alternatives of Section 5.

The proportional-hazards assumption

Cox's model assumes the hazard for subject i with covariates x_i takes the form:

Cox proportional-hazards model h(t | x i) = h 0 (t) \cdot exp(β ⊤ x i) h 0 (t) is an unspecified baseline hazard common to all subjects; exp(β ⊤ x i) is a multiplicative covariate effect. The proportional-hazards assumption is in the multiplicative structure: the ratio of hazards between any two subjects is constant over time, equal to exp(β ⊤ (x i - x j)). The β coefficients are interpretable as log hazard ratios .

The proportional-hazards assumption is testable. The standard diagnostic is the Schoenfeld residuals plot — plot residuals against time and look for trend. If the residuals show no time trend, proportionality is supported; if there is a trend, the assumption is suspect and you may need stratification, time-varying covariates, or a model that allows non-proportional hazards.

Partial likelihood: the central trick

Cox's brilliant insight was that you can estimate β without ever estimating the baseline hazard h₀(t). The argument: at each event time, the conditional probability that a particular at-risk subject is the one who experiences the event depends only on covariates, not on the baseline hazard. Form the product of these conditional probabilities and you get the partial likelihood:

Cox partial likelihood L(β) = \prod i: δ i = 1 [ exp(β ⊤ x i) / Σ j \in R(t i) exp(β ⊤ x j) ] R(t i) is the at-risk set at time t i — every subject who has not yet had an event or been censored. The numerator is the hazard of the subject who actually experienced the event; the denominator sums hazards over all at-risk subjects. The baseline h 0 (t) cancels in the ratio. Maximising this partial likelihood yields consistent, asymptotically-normal estimates of β.

Once β̂ is estimated, the baseline hazard can be recovered using the Breslow estimator if needed for survival-curve prediction. In many applications, however, only the hazard ratios matter and the baseline is never explicitly estimated.

Hazard ratios and interpretation

The fitted coefficient β̂_k is the log of the hazard ratio for a one-unit increase in covariate x_k, holding others fixed. exp(β̂_k) is the hazard ratio itself: a value of 2 means doubling the hazard, 0.5 means halving. Hazard ratios are the lingua franca of clinical-trial reporting, and the interpretability they offer is part of why the Cox model has remained dominant for fifty years. For categorical covariates, the hazard ratio compares to the reference category; for continuous, it is per unit (so for age in years, exp(β̂) = 1.04 means a 4% hazard increase per year of age).

Stratification, time-varying covariates, and frailty

The basic Cox model has several extensions that handle violations of its assumptions. Stratification allows different baseline hazards for different strata (e.g., separate baselines per study site) while keeping a common β. Time-varying covariates let x_i change over time — useful for biomarkers that evolve or treatments that change during follow-up; the partial likelihood adapts naturally. Frailty models introduce a random effect (a "frailty") shared by clusters of subjects, capturing unobserved heterogeneity. Together these extensions make the Cox framework remarkably flexible without leaving the proportional-hazards setup. The cases where Cox genuinely fails — strongly time-varying hazard ratios, complex non-linear covariate effects — are exactly where the neural extensions of Sections 7–8 come in.

Parametric Survival Models

The Cox model leaves the baseline hazard unspecified, which is its strength when you don't know what shape it should have. When you do know — or when you need to extrapolate beyond the observed data, simulate event times, or quote absolute predictions rather than hazard ratios — a fully parametric model is the right choice. The parametric survival family is small, well-understood, and continues to dominate reliability engineering and economic-modelling applications.

The Weibull and exponential

The exponential distribution is the simplest survival distribution: constant hazard, h(t) = λ. It corresponds to memorylessness — the probability of surviving another unit of time does not depend on how long you have already survived. Few real-world processes are exactly exponential, but the model is useful as a baseline and as a building block.

The Weibull distribution generalises by allowing a power-law hazard, h(t) = λ p t^p−1. The shape parameter p controls the hazard's behaviour: p = 1 recovers the exponential, p > 1 gives an increasing hazard (typical of ageing), p < 1 gives a decreasing hazard (typical of early-failure-prone systems). The Weibull is the workhorse of reliability engineering — Weibull plots are a standard quality-control tool — and remains common in clinical trials when the hazard shape is well-understood.

Other distributions: log-normal, log-logistic, gamma

The log-normal distribution lets log(T) be normally distributed; its hazard rises and then falls, which is appropriate for diseases with a peak risk window. The log-logistic is similar but with heavier tails. The generalised gamma nests several of these as special cases and is sometimes used when model selection across families is itself part of the analysis. The choice among these is empirical — fit several, compare via AIC or BIC, examine residual plots, pick the one that fits best.

Accelerated failure time framing

An alternative to the proportional-hazards parameterisation is the accelerated failure time (AFT) framing, where covariates multiply the time scale rather than the hazard:

Accelerated Failure Time model log T = α + γ ⊤ x + σ ε where ε is a noise term with a specified distribution (normal for log-normal AFT, extreme-value for Weibull AFT, logistic for log-logistic AFT). The covariate effect γ k is a time ratio : exp(γ k) is the multiplicative effect on event time. AFT models are popular in industrial reliability because the time-acceleration interpretation matches the engineer's intuition more directly than hazard ratios.

The Weibull family has the property of being both a proportional-hazards and an AFT model — the only family where this holds. For other distributions, AFT and PH parameterisations give different things, and the choice is dictated by which interpretation is more useful.

When parametric beats Cox

Three situations favour parametric models. First, extrapolation: Cox can predict survival only within the observed time range; a parametric model extrapolates by structural form. For a 5-year clinical trial that needs to predict 10-year outcomes, this matters. Second, simulation and decision-analysis: economic models and reliability simulations need closed-form survival distributions to integrate over future time, which Cox does not provide. Third, small samples with many covariates: Cox's partial-likelihood estimates can be unstable when events are sparse, while a parametric model with a tight functional form has fewer effective parameters and degrades more gracefully.

Competing Risks and Multi-State Models

Standard survival analysis assumes a single event of interest and treats anything else as censoring. In many real applications this is wrong: a cancer patient might die from the cancer or from an unrelated cause, and the two are not exchangeable; a subscriber might churn voluntarily or be terminated, and the rates of each respond to different interventions. The competing risks framework handles multiple mutually-exclusive event types simultaneously, and multi-state models extend this to more complex event histories.

Cause-specific hazards vs. subdistribution hazards

Two parameterisations dominate the competing-risks literature, and they answer different questions. The cause-specific hazard for cause k is the rate of event-of-type-k among subjects still alive (still at risk for any cause), h_k(t) = lim P(t ≤ T < t + Δt, cause = k | T ≥ t) / Δt. Cause-specific hazards are useful for asking "what drives the rate of cause k?" — they are the natural object for aetiological questions.

The subdistribution hazard, introduced by Fine and Gray (1999), is the rate of cause-k events among subjects who have either not had any event yet or have already had a competing event. It is the right object for predicting the cumulative incidence function (CIF) — the absolute probability that cause k will eventually be the first event. The Fine-Gray model is the Cox-style regression on the subdistribution hazard and is the standard tool for clinical applications where absolute risk prediction matters.

The cumulative incidence function

The cumulative incidence function for cause k is F_k(t) = P(T ≤ t, cause = k). It satisfies F₁(t) + F₂(t) + ... + F_K(t) ≤ 1, with the gap accounting for subjects still alive. Crucially, the CIF is not simply 1 minus the cause-specific Kaplan-Meier — that classic mistake (which treats competing events as censoring) overestimates the cumulative incidence because it implicitly assumes the censored subjects could still go on to have the cause-of-interest event. The right non-parametric estimator is Aalen-Johansen, which accounts for competing events properly.

Cumulative incidence of two competing causes plotted as a stacked area. At any time t, the bottom region is the probability that cause 1 has been the first event by t; the middle region the probability that cause 2 has; the top region the probability of still being event-free. The three regions sum to 1. This stacked-CIF picture is the standard summary for competing-risks studies.

Multi-state models

When events can recur or when subjects pass through a sequence of states (healthy → diseased → dead, or active → trial → paid → cancelled), the natural framework is a multi-state model. Subjects move between states according to transition hazards, and the methodology generalises Cox regression to each transition. The illness-death model (three states: healthy, ill, dead, with possible direct healthy-to-dead transition) is a canonical example. Multi-state models are popular in chronic disease epidemiology and in customer-lifecycle analytics where the full progression matters more than time-to-first-event.

DeepSurv and Neural Cox Extensions

Cox regression assumes the log hazard is linear in the covariates. For tabular data with a handful of features, that assumption is often fine. For data with hundreds of features, complex interactions, or non-tabular structure (images, sequences), a linear log-hazard is a strong and frequently-violated assumption. The DeepSurv family of models replaces the linear function with a neural network while keeping the rest of the Cox machinery — a clean, well-understood extension that has dominated deep-learning survival analysis since 2018.

DeepSurv: a neural log-hazard

Katzman et al. (2018) introduced DeepSurv, which keeps the proportional-hazards form but replaces the linear log-hazard with a neural network g_θ(x):

DeepSurv (Katzman et al. 2018) h(t | x) = h 0 (t) \cdot exp( g θ (x) ) g θ is a feed-forward MLP (or any other neural architecture) producing a scalar log-hazard for input x. Training maximises the Cox partial likelihood with g θ (x i) substituted for β ⊤ x i . The output is non-linear in covariates but still proportional in time, so the model gains expressivity without abandoning the Cox framework.

DeepSurv inherits all the strengths of Cox — censoring handled correctly, no need to estimate the baseline hazard during training — and gains the expressivity of a neural network. It outperforms linear Cox on benchmarks where non-linear covariate effects are expected (most modern medical datasets, where feature interactions matter) and matches it on benchmarks where linearity is a reasonable approximation (small clinical datasets with handful of features). Production deployments in oncology, cardiovascular risk, and predictive maintenance increasingly use DeepSurv-style models.

Loss formulation and training tricks

The Cox partial-likelihood loss has a quirk that complicates SGD training: each event subject's contribution depends on the entire risk set at their event time, which can be the whole training set. Several fixes exist. Cox-CC (Cox case-control) approximates the risk-set sum by sampling a small set of negatives at each event time, restoring the standard mini-batch-friendly loss. Sorted-batch training (sort the batch by event time and let each event subject see only the later subjects in the batch as their risk set) is another common trick. Both produce close-to-correct gradients with manageable memory and have become standard in deep-survival libraries.

Cox-Time: time-varying log-hazard

The proportional-hazards assumption — that hazard ratios are constant over time — is the most common Cox-model violation. Cox-Time (Kvamme et al. 2019) extends DeepSurv to allow the log-hazard to depend on time, g_θ(x, t). The proportional-hazards property is dropped; the model learns time-varying hazard ratios from data. Cox-Time is a strong default for problems where proportionality clearly fails — for instance, immunotherapy trials where short-term and long-term hazards differ in direction. The cost is more parameters and more careful regularisation; the benefit is a more honest fit when the data demands it.

DeepHit and the discrete alternative

An alternative philosophy abandons the Cox structure entirely and parameterises a discrete-time hazard or survival distribution directly. That family is the subject of the next section, but the design choice is worth flagging here: continuous-time Cox-style models stay close to the classical framework and are easier to interpret and audit; discrete-time models are simpler computationally and more flexible about hazard shape. Both have their place, and modern deep-survival deployments commonly include one of each as ensemble members.

Discrete-Time and Direct Survival Networks

An alternative to the Cox-style continuous-time formulation is to discretise time into bins and predict, for each bin, the conditional probability of the event given survival up to that bin. The discrete-time framework matches how event data is typically collected (in days, months, or visits), makes deep-learning-style training straightforward, and removes the proportional-hazards assumption entirely. It is the dominant architecture for survival prediction in large-scale industrial deployments.

The logistic-hazard parameterisation

Discretise time into bins 0 = τ₀ < τ₁ < ... < τ_K. For each bin k, define the discrete hazard λ_k(x) = P(T ∈ [τ_k−1, τ_k) | T ≥ τ_k−1, x) — the conditional probability of the event happening in bin k given survival up to k. A neural network outputs K hazards via a sigmoid-per-bin head, and the survival function is recovered as the cumulative product:

Logistic-hazard network Ŝ(τ k | x) = \prod j \leq k ( 1 - λ j (x) ) Training uses a binary cross-entropy loss on the per-bin hazards, weighted to handle censoring: an observed event in bin k contributes a positive label at bin k and negative labels for prior bins; a censored observation contributes negative labels up to the censoring bin and no contribution thereafter. The whole thing is straightforward mini-batch SGD.

Logistic-hazard networks are the simplest deep-survival architecture and are remarkably effective on industrial-scale datasets with millions of subjects. Modern implementations like pycox's LogisticHazard are essentially drop-in for any tabular survival problem.

DeepHit and competing risks

The DeepHit architecture (Lee et al. 2018) extends the discrete-time framework to competing risks and adds a ranking-based loss term. A shared trunk feeds cause-specific heads, each producing a discrete probability mass over time bins; the output is the joint distribution over (cause, time). The training loss combines a likelihood term (for the observed events) with a ranking term that pushes higher predicted risk for subjects who actually experienced the event sooner — this directly optimises the C-index discussed in Section 9. DeepHit and its successors dominate competing-risks deep-learning benchmarks.

Choice of bin count and edges

The discretisation introduces a hyperparameter — the number of bins and where to place them. Too few bins coarsens predictions; too many produces noisy hazards in sparse-event regions. The standard advice is to use ~30–100 bins with edges at empirical quantiles of observed event times (so each bin has roughly equal numbers of events). Some implementations learn the bin edges; this rarely helps in practice. Bin count is, in practice, a near-irrelevant hyperparameter once it is reasonable.

Discrete-time versus continuous-time: a practical comparison

Discrete-time models win on raw flexibility, ease of training, and natural support for competing risks. Continuous-time Cox-style models win on interpretability (hazard ratios are straightforward to explain), extrapolation (the hazard function is defined at all times), and integration with the classical statistical-inference apparatus. Production deployments often use both: a Cox-style model for headline reporting and clinical interpretation, a DeepHit-style model for the highest-stakes predictions. The choice is not sharp; the methods overlap in capability and the difference comes down to which framework is easier to maintain in your specific stack.

Evaluation: Concordance, Brier, and Calibration

Evaluating a survival model is harder than evaluating a regression or classification model because censoring obscures the truth and because survival predictions are not single numbers — a model produces a survival curve per subject, and the right metric depends on which property of that curve you care about. The survival-evaluation literature has converged on three complementary metrics: concordance for discrimination, Brier score for overall accuracy, calibration for trustworthiness of probabilities.

Concordance: the C-index

The concordance index or C-index is the survival-analysis equivalent of AUC. Given two subjects, the model is concordant on the pair if it predicts a higher risk for the one who actually had the earlier event. The C-index is the fraction of all comparable pairs (those with at least one observed event and where the comparison is identifiable despite censoring) on which the model is concordant. C = 1 is perfect; C = 0.5 is random. Most published survival models report a C-index between 0.6 and 0.85 on real benchmarks; jumps from 0.75 to 0.78 are clinically meaningful.

Two C-index variants matter: Harrell's C, the original definition, biased upward when censoring is heavy; and Uno's C, which uses inverse-probability-of-censoring weighting to correct the bias. For comparative reporting, Uno's C is the modern default. Most survival libraries (lifelines, pycox, scikit-survival) implement both.

Brier score and integrated Brier

The Brier score at time t is the mean squared error between the predicted survival probability and the indicator of survival past t, averaged over subjects (with inverse-probability-of-censoring weights to handle censoring). The integrated Brier score (IBS) integrates this over a range of follow-up times, producing a single summary metric. IBS is sensitive to both calibration and discrimination — a model can have high concordance but poor IBS if its absolute predicted probabilities are miscalibrated. For applications that use absolute predicted risks (clinical decision-making, insurance pricing), IBS is the more relevant headline metric than C-index.

Calibration: do predictions mean what they say?

A model is well-calibrated if among subjects predicted to have a 30% probability of an event by year 5, roughly 30% actually experience the event by year 5. Calibration is independent of discrimination — a model can be perfectly discriminating but consistently over- or under-predicts the absolute risk. The standard diagnostic is a calibration plot: bin subjects by predicted risk, plot observed vs. predicted incidence per bin, look for a 45-degree line. Recalibration is the post-hoc adjustment (typically logistic recalibration on the predicted log-hazard) to restore calibration when discrimination is fine but absolute risks are off.

Time-dependent AUC

For a fixed prediction horizon (say, 5-year mortality), one can define a binary outcome (died by 5 years vs. not) and compute AUC for the predicted 5-year risk against this binary outcome — handling censoring by inverse-probability weighting. The result is a time-dependent AUC, useful when the application has a specific decision horizon and a cleaner story than a horizon-spanning C-index. Plotting the time-dependent AUC across many horizons (the AUC-versus-time curve) is now a standard diagnostic in clinical-survival reporting.

What the metrics miss

None of these metrics capture clinical utility directly — the question of whether using the model improves real-world decisions. The decision curve analysis framework (Vickers et al.) addresses this by quantifying net benefit at varying decision thresholds. For high-stakes deployments, decision-curve analysis should accompany the standard discrimination/calibration metrics, and the regulatory expectation in medical-device approval is increasingly that all three appear together.

Applications and Frontier

Survival-analysis machinery is used in a remarkable range of domains, from oncology to industrial maintenance to consumer churn. The deployment patterns differ — feature schemas, censoring mechanisms, and reporting requirements vary by domain — but the methods of the previous nine sections cover most production needs. This final section surveys the application landscape and the frontier where deep learning, causal inference, and continuous-time methods are reshaping the field.

Oncology and clinical applications

Survival analysis was born in clinical research and remains its largest application. Standard oncology workflows use Kaplan-Meier curves to summarise raw survival, the log-rank test for treatment comparison, and Cox regression with treatment plus prognostic covariates for the primary efficacy analysis. Modern deep-survival models (DeepSurv, DeepHit) appear in radiomics, where covariates include extracted image features, and in genomic survival prediction, where covariates include thousands of gene-expression measurements. The cBioPortal and TCGA public datasets are the standard benchmarks for genomic-survival methods.

Customer churn and retention

Subscription businesses run continuous survival analysis on their user bases. The event is account cancellation; the covariates are usage features, account demographics, support-ticket history. The competing-risks framing is often appropriate (voluntary cancellation vs. forced termination for non-payment). Production stacks are typically logistic-hazard networks for the model itself plus retention-action systems on top — the model identifies subjects at imminent risk and triggers retention offers, and the analysis loop measures whether the offers actually moved the cancellation hazard.

Predictive maintenance and reliability

Industrial systems generate continuous sensor data; the survival-analysis question is when a component will fail. Weibull and log-normal parametric models dominate the engineering side of this domain (they extrapolate well and the engineering interpretation of the shape parameter is familiar), while deep-survival models are used when sensor histories are high-dimensional time series. Multi-state models capture wear-condition progression. The deployment pattern: live forecasting of remaining-useful-life, with maintenance scheduling driven by predicted hazard crossing a threshold.

Credit risk, insurance, and finance

Time to default, time to claim, time to bankruptcy — survival framing is increasingly standard for credit and insurance applications, displacing classification models that predict "will default within 12 months" with continuous-time models that produce a full survival curve. The competing-risks framework is essential (default vs. prepayment for mortgages, for instance). Regulatory regimes (Basel for banks, Solvency II for insurers) increasingly accept survival-style models for the calibrated absolute-risk predictions they produce.

Frontier methods

Three frontiers are particularly active in 2026. First, neural ordinary differential equations (Neural ODEs) for survival, which parameterise the hazard as a continuous-time function output by an ODE network — keeping the continuous-time mathematical structure while gaining neural-network expressivity. Second, causal survival analysis, which extends the doubly-robust and orthogonal-machine-learning estimators of Part XIII Ch 04 to time-to-event outcomes, producing causally-interpretable hazard-ratio estimates under treatment-confounding. Third, survival foundation models — pretrained representations of clinical-event sequences that fine-tune to specific endpoints, drawing on the same techniques as language-model pretraining; the early work (e.g., BEHRT, Med-BERT) is promising but the field is far from settled.

What this chapter does not cover

Several adjacent areas are out of scope. Recurrent-event analysis, where subjects can experience the same event multiple times (hospital readmissions, repeated equipment failures), uses extensions like the Andersen-Gill model and counting-process formulations. Joint models for longitudinal-and-survival data, where a biomarker trajectory and a time-to-event outcome are modelled simultaneously, are an active subfield with their own software ecosystem. Cure models, which assume a fraction of subjects are immune to the event and never experience it, are important in oncology with very-long follow-up. Bayesian survival analysis, including the BUGS/Stan/PyMC ecosystem with horseshoe priors on covariate effects, complements the frequentist treatment in this chapter for small-sample and prior-informed problems. And the engineering literature on Weibull mixture models, Bayesian network reliability, and accelerated-life testing intersects survival analysis but draws on a separate quality-engineering tradition that warrants its own treatment.