Week 1: Introduction to Survival Analysis

Definitions

Survival Time (T): Time until an event occurs; also called time-to-event, death time, or reliability time

Censored Data: Incomplete observation data where the event hasn’t occurred by the end of study

Event did not occur until end of study
Sample unit no longer collectable before end of study

Truncated Data: Data with left-truncation (subjects entered study after time 0)

Concept

Survival analysis studies time-to-event data. Key components:

Origin: Start of observation time
Scale: Time measurement unit (days, months, hours)
Event: The endpoint being measured (death, failure, relapse)

Week 2: Basic Quantities of Survival Distribution

Definitions

Survival Function S(t): $S (t) = Pr (T > t)$ — probability surviving beyond time t

Recall the definition of random variable T here

Hazard Function h(t): Instantaneous failure rate at time t given survival until t

See conceptual example here

Cumulative Hazard H(t): $\int_{0}^{t} h (u) d u$ — total hazard accumulated up to time t

Mean Residual Life mrl(t): $E [T - t ∣ T > t]$ — expected remaining time after surviving t

Median Life: Time where 50% of subjects have experienced the event

Concepts

S(t) properties:

Monotonically decreasing
$S (0) = 1$ , $lim_{t \to \infty} S (t) = 0$ (starts at 1, approaches 0)

h(t) properties: Can be increasing, decreasing, constant, or bathtub-shaped

Relationship: $S (t) = exp [- H (t)]$ and $h (t) = f (t) / S (t)$

Hazard function

Rate of the event occurring per unit time (t), among those who haven’t experienced the event yet.

A high h(t) at some time t means: among those still alive at t, failures are happening rapidly.
A low h(t) means failures are rare at that moment.

The “instantaneous” part means you’re shrinking the window to a single point in time rather than measuring over a finite interval (like survival function).

100 light bulbs. All running at time t=1000 hours. 5 fail between 1000 and 1001 hours.

$h (1000) \approx \frac{5/100}{1} = 0.05 per hour$

At t=1000, survivors are failing at a rate of 0.05 per hour.

Survival vs hazard function

Survival describes the subject’s survivability for an event (event does not occur)
Conversely, hazard describes the subject’s risk of NOT surviving (event occurs)

Cumulative hazard: Accumulated risk over time.

Driving a car: every hour you drive, you face some hazard of an accident. H(t) is the total risk you’ve accumulated after t hours of driving. Even if the hourly risk is small, H(t) keeps growing the longer you drive.

Mean residual life

Try interpreting $E [T - t]$ (given $T > t$ ) :

$T - t$ : How much time remains from now ( $t$ ) until the event $(T)$ .
$E [T - t]$ : Mean of time left (survival time)

It’s basically the mean of remaining time left before event occurs

Formulas

S (t) h (t) H (t) m r l (t) μ = 1 - F (t) = Δ t \to 0 lim \frac{Pr ( t < T \leq t + Δ t ∣ T > t )}{Δ t} = \frac{f ( t )}{S ( t )} = \int_{0}^{t} h (u) d u = - ln S (t) = E [T - t ∣ T > t] = \frac{\int _{t}^{\infty} S ( u ) d u}{S ( t )} = E [T] = \int_{0}^{\infty} S (u) d u

Total Time on Test (TTT): Cumulative operating time until last failure — sum of all observed times (both event and censored)

$TTT = \sum_{i = 1}^{n} t_{i}$

where $t_{i}$ is the failure or censoring time for subject $i$ .

Interpretation

TTT represents the total “machine-hours” or “person-years” that subjects contributed to the study. It’s used in:

TTT Plot: A diagnostic tool to check if data follows Weibull distribution

Nelson-Aalen estimator: TTT appears in variance calculations

Hazard function relationship proof

h (t) = Δ t \to 0 lim \frac{1}{Δ t} \cdot \frac{Pr ( t < T \leq t + Δ t )}{Pr ( T > t )} = Δ t \to 0 lim \frac{F ( t + Δ t ) - F ( t )}{Δ t \cdot S ( t )}

The limit $lim_{Δ t \to 0} \frac{F ( t + Δ t ) - F ( t )}{Δ t}$ is by definition $F^{'} (t) = f (t)$ :

$h (t) = \frac{f ( t )}{S ( t )}$

MRL proof

$m r l (t) = E [T - t ∣ T > t]$

By definition of Conditional Probability. For a Continuous Random Variable, the conditional density given $T > t$ is:

$f (u ∣ T > t) = \frac{f ( u )}{P r ( T > t )} = \frac{f ( u )}{S ( t )}, u > t$

Let $T = u$ , then $T - t = u - t$ . So, by definition of Expectation for continuous random variable,

$E [T - t ∣ T > t] = \int_{t}^{\infty} value of T - t (u - t) \cdot pdf of T \frac{f ( u )}{S ( t )}, d u$

Integration by parts with $v^{'} = f (u)$ , $w = u - t$ :

$\int_{t}^{\infty} (u - t) f (u), d u = [- (u - t) S (u)]_{t}^{\infty} + \int_{t}^{\infty} S (u), d u$

The boundary term: at $u = t$ , $(t - t) S (t) = 0$ . At $u = \infty$ , $(u - t) S (u) \to 0$ (assuming finite mean). So the boundary term vanishes, leaving:

$\int_{t}^{\infty} (u - t) f (u), d u = \int_{t}^{\infty} S (u), d u$

Therefore:

$m r l (t) = \frac{\int _{t}^{\infty} S ( u ) , d u}{S ( t )}$

Week 3: Parametric Models

Exponential Distribution

Concept: Simplest model; assumes constant hazard rate. Has “lack of memory” property.

Formulas (for $T \sim Exp (λ)$ , $λ > 0$ ):

f (t) S (t) h (t) H (t) E (T) = λ e^{- λ t}, t > 0 = e^{- λ t} = λ = λ t = \frac{1}{λ}, V (T) = \frac{1}{λ ^{2}}

Notice that $h (t) = λ$ . We can interpret $λ$ as definition of hazard function:

Rate of the event occurring per unit time (t), among those who haven’t experienced the event yet.

Weibull Distribution

Concept: Generalization of exponential with shape parameter $α$ . Can model increasing, decreasing, or constant hazard.

Formulas (for $T \sim Weibull (λ, α)$ ):

f (t) S (t) h (t) H (t) = α λ t^{α - 1} e^{- λ t^{α}}, t > 0 = exp (- λ t^{α}) = α λ t^{α - 1} = λ t^{α}

$α < 1$ : Decreasing hazard
$α = 1$ : Constant hazard (reduces to exponential)
$α > 1$ : Increasing hazard

Gamma Distribution

Concept: Another generalization of exponential.

Formulas: $f (t) = \frac{λ ^{α} t ^{α - 1} e ^{- λ t}}{Γ ( α )}, t > 0$

Cheatsheet

	$S (t)$	$h (t)$	$f (t) = S (t) \cdot h (t)$	$H (t)$	$E (T)$	$V (T)$
Exp	$e^{- λ t}$	$λ$	$λ e^{- λ t}$	$λ t$	$\frac{1}{λ}$	$\frac{1}{λ ^{2}}$
Weibull	$e^{- λ t^{α}}$	$α λ t^{α - 1}$	$α λ t^{α - 1} e^{- λ t^{α}}$	$λ t^{α}$	—	—
Gamma	—	—	$\frac{λ ^{α} t ^{α - 1} e ^{- λ t}}{Γ ( α )}$	—	$\frac{α}{λ}$	$\frac{α}{λ ^{2}}$

Formulas

S (t) h (t) H (t) m r l (t) μ = 1 - F (t) = Δ t \to 0 lim \frac{Pr ( t < T \leq t + Δ t ∣ T > t )}{Δ t} = \frac{f ( t )}{S ( t )} = \int_{0}^{t} h (u) d u = - ln S (t) = E [T - t ∣ T > t] = \frac{\int _{t}^{\infty} S ( u ) d u}{S ( t )} = E [T] = \int_{0}^{\infty} S (u) d u

Total Time on Test (TTT): Cumulative operating time until last failure — sum of all observed times (both event and censored)

$TTT = \sum_{i=1}^n t_i$$

where $t_{i}$ is the failure or censoring time for subject $i$ .

Interpretation

TTT represents the total “machine-hours” or “person-years” that subjects contributed to the study. It’s used in:

TTT Plot: A diagnostic tool to check if data follows Weibull distribution

Nelson-Aalen estimator: TTT appears in variance calculations

Example

Case: Risk of getting sick during flu season. Hazard increases over time as the season peaks.

Use Weibull with $λ = 0.001$ , $α = 2$ (increasing hazard).

$h (t) = α λ t^{α - 1} = 0.002 t$

t (days)	h(t)
10	0.02
30	0.06
60	0.12

Interpretation of h(t): On day 10, survivors (still-healthy people) are getting sick at a rate of 0.02 per day. By day 60, that rate has jumped to 0.12 — the flu season is peaking, risk is much higher.

$H (t) = λ t^{α} = 0.001 t^{2}$

t (days)	H(t)
10	0.1
30	0.9
60	3.6

Interpretation of H(t): By day 30 you’ve accumulated 0.9 units of total sickness risk — nearly 1 full unit. By day 60 it’s 3.6, and survival probability is $e^{- 3.6} \approx 0.027$ , meaning only 2.7% of people have avoided getting sick by day 60.

Week 4: Censoring and Truncation

Definitions

Censoring (Penyensoran): When the exact survival time is not fully observed — only partial information is available. The subject has not experienced the event by the end of study, or is lost to follow-up before the event occurs.

Mostly about not having enough information.

Truncation (Pemancungan): When subjects are only observed if their event time falls within a certain window. Those whose event occurs outside the observation window are not included in the study at all.

Concepts

Types of Censoring:

Right Censoring: Event time is beyond a certain point (study ends, subject drops out)
- Type I: Study ends at fixed time $C$
- Type II: Study ends when $r$ events occur
- Random/Progressive: Subject lost to follow-up
Left Censoring: Event occurred before study started but exact time unknown
Interval Censoring: Event known to occur within an interval $[L, R]$

Type	What we know	What we don’t know
Right censored	- Survived until time $C$	If/when event occurred after $C$
Left censored	Event occurred before study	Exactly when
Interval censored	Event occurred in $[L, R]$	Exact time

Types of Truncation:

Left Truncation: Subject enters study after time 0; only observed if $T \geq L$
Right Truncation: Subject exits study before event; only observed if $T \leq R$

Formulas

Likelihood for right-censored data:

L = i = 1 \prod n [f (t_{i})]^{δ_{i}} [S (t_{i})]^{1 - δ_{i}}

where $δ_{i}$ is the event indicator:

$δ_{i} = 1$ if event observed at time $t_{i}$
$δ_{i} = 0$ if censored at time $t_{i}$

Nelson-Aalen & Kaplan-Meier estimators:

\hat{H} (t) \hat{S} (t) = t_{i} \leq t \sum \frac{d _{i}}{Y _{i}} = t_{i} \leq t \prod (1 - \frac{d _{i}}{Y _{i}})

where:

$d_{i}$ = number of events at time $t_{i}$
$Y_{i}$ = number at risk just before $t_{i}$

Examples

Clinical Trial: 30 patients treated for heart disease, observed for 6 years. Only 10 had strokes during the study. The other 20 are right-censored (type I) — we know they survived at least 6 years but don’t know when (or if) they will have a stroke.
Carcinogen Study: 40 mice injected with carcinogen, observed until 25 show disease symptoms. The remaining 15 mice are right-censored (type II) — they may develop disease later but we stopped before observing it.
Survey: Children asked when they started using gadgets. Some cannot remember exact time (left-censored), some started during the study (observed), some haven’t started yet (right-censored (random/progressive)).

Key point: Censoring and truncation affect the likelihood function and require special statistical methods (non-parametric like Kaplan-Meier, or semi-parametric like Cox proportional hazards) to properly analyze.

Week 5: Advanced Censoring & Truncation

Detailed Censoring (Penyensoran)

Censoring occurs when we have only partial information about the exact survival time.

Types of Right Censoring (Penyensoran Kanan):

Type I (Time Censoring): Study ends at a pre-determined time $τ$ set by the researcher.
- Fixed: All subjects stop at the same time $τ$ (e.g., mice observed for exactly 14 days).
- Progressive: Different fixed censoring times $C_{i}$ are assigned at the start.
- Generalized: Subjects enter the study at different times and are censored if the event hasn’t occurred when the study ends or they leave.
Type II (Failure Censoring): Study ends when a pre-specified number of events $d$ occur among $n$ subjects.
- Simple: Stops exactly at the $d$ -th failure. The total duration of the study is a random variable.
- Progressive: Some survivors are intentionally removed from the study at various intermediate event times.
Competing Risk: Occurs when multiple types of events are possible. The occurrence of one event prevents the observation of others (e.g., in a study on leukemia, death from other causes is a competing risk for disease relapse).

Left Censoring (Penyensoran Kiri): The event occurred before a certain time $t_{l c}$ , but the exact time is unknown (e.g., a child already knows how to use a gadget before the survey starts).

Interval Censoring (Penyensoran Interval): The event is known to have occurred within a specific interval $[C_{l}, C_{r}]$ (e.g., a tumor is detected during a follow-up at 24 months, but was not present at the 18-month check-up).

Double Censoring (Penyensoran Ganda): A dataset that contains both left-censored and right-censored observations (e.g., a gadget usage survey where some children already use them, some start during the study, and some haven’t started by the end).

Truncation (Pemancungan)

Truncation is a selection mechanism by design. Only subjects who satisfy certain conditions regarding their survival time $T$ are included in the sample.

Right Truncation (Pemancungan Kanan): Only subjects who have already experienced the event before a time $T_{R}$ are included.
- Condition: $T < T_{R}$
- Example: Using historical death records to study lifespan. People who are still alive are not in the records and are excluded from the analysis.
Left Truncation (Pemancungan Kiri / Delayed Entry): Only subjects who have not yet experienced the event at time $T_{L}$ are included.
- Condition: $T > T_{L}$
- Subjects must survive until $T_{L}$ to enter the study (Delayed Entry).
- Example:
  - Nursing Home: Studying age of death among residents. Subjects must survive long enough to enter the home. Those who die before entering are never observed.
  - Life Insurance: Policyholders must be alive at the time they sign up for the policy.

Comparison Summary

Feature	Censoring	Truncation
Nature	Missing information about exact time	Selection bias by study design
Awareness	Researcher knows the subject exists but not the exact $T$	Researcher may not even know the excluded subjects exist
Likelihood	Uses $f (t)$ for events, $S (t)$ for censored	Uses conditional probabilities given selection criteria

Likelihood Construction Examples

Right-Censored Data

Scenario: 5 patients in a clinical trial. 3 die at times 2, 5, 8; 2 are censored at times 3, 6.

Step-by-step construction:

For event times ( $δ_{i} = 1$ ): contribute $f (t_{i})$ to likelihood
For censored times ( $δ_{i} = 0$ ): contribute $S (t_{i})$ to likelihood

$L = f (2) \cdot f (5) \cdot f (8) \cdot S (3) \cdot S (6)$

For exponential model with constant hazard $λ$ : $L = λ e^{- λ \cdot 2} \cdot λ e^{- λ \cdot 5} \cdot λ e^{- λ \cdot 8} \cdot e^{- λ \cdot 3} \cdot e^{- λ \cdot 6} = λ^{3} e^{- λ \cdot 24}$

To find MLE: $ln L = 3 ln λ - 24 λ$ , then $\frac{d}{d λ} ln L = \frac{3}{λ} - 24 = 0 \Rightarrow \hat{λ} = \frac{3}{24} = 0.125$

Left-Truncated Data

Scenario: Nursing home study. Subject enters at age 70, dies at age 85.

For left-truncated data with entry time $L_{i}$ :

Must survive until $L_{i}$ to be observed
Contribution: $\frac{f ( t _{i} )}{S ( L _{i} )}$ (conditional on surviving until entry)

Combined Left-Truncated & Right-Censored

For subject who enters at $L_{i}$ , has event/censoring at $t_{i}$ : $L_{i} = [\frac{f ( t _{i} )}{S ( L _{i} )}]^{δ_{i}} [\frac{S ( t _{i} )}{S ( L _{i} )}]^{1 - δ_{i}}$

Week 6: Non-Parametric Estimation (Penaksiran Non-Parametrik)

Estimating the Survival Function $S (t)$ (Kaplan-Meier Approach)

The Kaplan-Meier estimator builds the survival curve step-by-step, calculating the conditional probability of surviving past each observed event time.

Timeline and Definitions:

$n = Y_{0}$ : Total number of subjects at the start (time $t_{0}$ ).
$Y_{i}$ : Number of individuals at risk just before time $t_{i}$ .
$d_{i}$ : Number of events (failures/deaths) that occur at time $t_{i}$ .
$c_{i}$ : Number of censored individuals between $t_{i}$ and $t_{i + 1}$ .

Step-by-step Intuition:

At time $t_{0}$ : Everyone is alive. $S (t_{0}) = Pr (T > t_{0}) = 1$
At time $t_{1}$ : There are $Y_{1}$ people at risk. $d_{1}$ people experience the event. The probability of surviving past $t_{1}$ given survival up to $t_{1}$ is $1 - \frac{d _{1}}{Y _{1}}$ . $S (t_{1}) = S (t_{0}) \times (1 - \frac{d _{1}}{Y _{1}}) = 1 \times (1 - \frac{d _{1}}{Y _{1}})$
At time $t_{2}$ : There are $Y_{2}$ people at risk. Notice that $Y_{2}$ drops not just because of the deaths $d_{1}$ , but also because of any censored observations $c_{1}$ that occurred between $t_{1}$ and $t_{2}$ . So, $Y_{2} = n - d_{1} - c_{1}$ . $d_{2}$ people experience the event. The conditional survival probability is $1 - \frac{d _{2}}{Y _{2}}$ . $S (t_{2}) = S (t_{1}) \times (1 - \frac{d _{2}}{Y _{2}})$

General Kaplan-Meier Formula: $\hat{S} (t) = \prod_{t_{i} \leq t} (1 - \frac{d _{i}}{Y _{i}})$

Estimating the Cumulative Hazard $H (t)$ (Nelson-Aalen Approach)

The Nelson-Aalen estimator calculates the cumulative hazard by adding up the instantaneous hazard rates at each event time.

Step-by-step Intuition:

At time $t_{1}$ : Out of $Y_{1}$ people at risk, $d_{1}$ fail. The hazard rate is $h (t_{1}) = \frac{d _{1}}{Y _{1}}$ .
At time $t_{2}$ : Out of $Y_{2}$ people at risk, $d_{2}$ fail. The hazard rate is $h (t_{2}) = \frac{d _{2}}{Y _{2}}$ .

Cumulative Hazard up to $t_{2}$ : Simply sum the individual hazards: $H (t_{2}) = h (t_{1}) + h (t_{2}) = \frac{d _{1}}{Y _{1}} + \frac{d _{2}}{Y _{2}}$

General Nelson-Aalen Formula: $\hat{H} (t) = \sum_{t_{i} \leq t} \frac{d _{i}}{Y _{i}}$

Variance Estimation: Greenwood’s Formula

To estimate uncertainty in $\hat{S} (t)$ , use Greenwood’s formula:

$Var [\hat{S} (t)] = [\hat{S} (t)]^{2} \sum_{t_{i} \leq t} \frac{d _{i}}{Y _{i} ( Y _{i} - d _{i} )}$

Intuition: Kaplan-Meier is a product of conditional probabilities. Using the delta method, the variance of $ln \hat{S} (t)$ is the sum of variances from each term.

Simplified form when $d_{i} = 1$ (no ties): $Var [\hat{S} (t)] = [\hat{S} (t)]^{2} \sum_{t_{i} \leq t} \frac{1}{Y _{i} ( Y _{i} - 1 )}$

For Nelson-Aalen: $Var [\hat{H} (t)] = \sum_{t_{i} \leq t} \frac{d _{i}}{Y _{i}^{2}}$

Confidence Intervals

Pointwise CI for $S (t)$

Linear scale (can go below 0 or above 1 — not recommended): $\hat{S} (t) \pm z_{α /2} Var [\hat{S} (t)]$

Log-log transform (recommended — keeps within $[0, 1]$ ): $[\hat{S} (t)^{e x p (\mp z_{α /2} \overset{σ}{^} (t))}, \hat{S} (t)^{e x p (\pm z_{α /2} \overset{σ}{^} (t))}]$

where $\overset{σ}{^}^{2} (t) = \frac{Var [ S ^ ( t )]}{[ S ^ ( t ) l n S ^ ( t ) ] ^{2}}$

For 95% CI: $z_{0.025} = 1.96$

Summary: Non-Parametric Estimators

Estimator	Formula	Variance
Kaplan-Meier $\hat{S} (t)$	$\prod_{t_{i} \leq t} (1 - \frac{d _{i}}{Y _{i}})$	Greenwood
Nelson-Aalen $\hat{H} (t)$	$\sum_{t_{i} \leq t} \frac{d _{i}}{Y _{i}}$	$\sum \frac{d _{i}}{Y _{i}^{2}}$

Key relationship: $\hat{S}_{K M} (t) \approx exp (- \hat{H}_{N A} (t))$

Materi

Table of Contents

Table of Contents

Materi

Week 1: Introduction to Survival Analysis

Definitions

Concept

Week 2: Basic Quantities of Survival Distribution

Definitions

Concepts

Formulas

Hazard function relationship proof

MRL proof

Week 3: Parametric Models

Exponential Distribution

Weibull Distribution

Gamma Distribution

Cheatsheet

Formulas

Example

Week 4: Censoring and Truncation

Definitions

Concepts

Formulas

Examples

Week 5: Advanced Censoring & Truncation

Detailed Censoring (Penyensoran)

Truncation (Pemancungan)

Comparison Summary

Likelihood Construction Examples

Right-Censored Data

Left-Truncated Data

Combined Left-Truncated & Right-Censored

Week 6: Non-Parametric Estimation (Penaksiran Non-Parametrik)

Estimating the Survival Function S(t) (Kaplan-Meier Approach)

Estimating the Cumulative Hazard H(t) (Nelson-Aalen Approach)

Variance Estimation: Greenwood’s Formula

Confidence Intervals

Pointwise CI for S(t)

Summary: Non-Parametric Estimators

Recent Notes

Rust's Journey to Async/Await

Pomodoro Technique

Learn Japanese

studying

Anime Broadcasting Periods

Graph View

Related notes

Estimating the Survival Function $S (t)$ (Kaplan-Meier Approach)

Estimating the Cumulative Hazard $H (t)$ (Nelson-Aalen Approach)

Pointwise CI for $S (t)$