Introduction to censored data
Outcome variable: time to event
 Generally time to the occurrence of a particular event, e.g.
 death
 disease recurrence
 or other experience of interest
 Time: The time from the beginning of an observation period t0
(e.g. surgery) to:
 an event, or
 end of the study, or
 loss of contact or withdrawal from the study
Typical research questions
 What is the median survival time (in years) of patients diagnosed
with a certain disease?
 What is the probability of those patients surviving for at least 5
years?
 Are certain personal, behavioral, or clinical characteristics
correlated with participant’s chance of survival?
 Is there a survival difference between groups?
 e.g. treatment vs. control
 e.g. exposed vs. unexposed
Special considerations in survival analysis
 Survival data requires special techniques:
 Survival data is generally not normally distributed

Censoring  observe individuals for differing
lengths of time that may or may not result in an “event”
 Censoring is a key challenge in survival analysis. Consider a
clinical study where:
 patient 1 dies 1 month after diagnosis
 patient 2 dies 12 years after diagnosis
 patient 3 is lost to followup after 1 month
 patient 4 is still alive after 12 years of followup
Question #1: which patients are “censored?”
Question #2: how would you rank these patients in order of
disease severity?
Left / right / interval censoring

right censoring: The event (if it occurs) happens past the
end of the observation period

left censoring: We observe the presence of a state or
condition but do not know when it began.
 Example: a study investigating the time to recurrence of a cancer
following surgical removal of the primary tumor. If the patients were
examined 3 months after surgery to determine recurrence, then those who
had a recurrence would have a survival time that was left censored
because the actual time of recurrence occurred less than 3 months after
surgery.

interval censoring: individuals come in and out of
observation.
Source: https://data.princeton.edu/wws509/notes/c7.pdf

type 1 censoring: The total duration of the study is fixed
 a generalization is fixed censoring: each individual has a
potentially different maximum observation time, but still fixed in
advance

type 2 censoring: The sample is followed as long as
necessary until a prespecified number of events have occurred
 the length of the study is unknown in advance

random censoring: the censoring times are independent
random variables
These are all analyzed in essentially the same way.
Source: https://data.princeton.edu/wws509/notes/c7.pdf

Uninformative censoring: The most basic assumption we will
make is that the censoring of an observation does not provide any
information about survival other than that it exceeds the time of the
censoring
 Can be violated if, for example, higher risk of death causes study
dropout
 Similar to when we assume data missing at random or completely at
random
Survival function S(t)
 The Survival function at time t, denoted \(S(t)\), is the probability of being
eventfree at t.
 Equivalently, it is the probability that the survival time is
greater than t.
leukemia Example: see leuk.csv
 Study of 6mercaptopurine (6MP) maintenance therapy for children in
remission from acute lymphoblastic leukemia (ALL)
 42 patients achieved remission from induction therapy and were then
randomized in equal numbers to 6MP or placebo.
 Survival time studied was from randomization until relapse.
Survival times in weeks for Placebo group:
## [1] 1 1 2 2 3 4 4 5 5 8 8 8 8 11 11 12 12 15 17 22 23
Survival times in weeks for Treatment group:
## [1] 6 6 6 7 10 13 16 22 23 6+ 9+ 10+ 11+ 17+ 19+ 20+ 25+ 32+ 32+
## [20] 34+ 35+
A graphical look at the treatment group
(Initiation times (t0) are simulated between 0 and 26 weeks)
leukemia study followup table
leukemia Followup Table
This is the KaplanMeier Estimate \(\hat S(t)\) of the Survival function \(S(t)\).
Survival function and KaplanMeier estimator
KaplanMeier Estimate
Definition: Median Survival Time is the time at
which half of a group (sample, population) is expected to experience an
event (in this example, death)
 Without censoring, median survival time can be calculated the
obvious way
 With censoring, we need to use the KaplanMeier estimate of the
survival function \(\hat S(t)\)
## Call: survfit(formula = Surv(time, cens) ~ group, data = leuk)
##
## n events median 0.95LCL 0.95UCL
## group=6 MP 21 9 23 16 NA
## group=Placebo 21 21 8 4 12
Definition: Median Potential FollowUp Time is the
time for which half of a sample would have been expected to be followe,
in the absence of events.
 Without any events, median followup time can be calculated the
obvious way
 With events, a simple median will underestimate the
potential followup time. Use a reverse KaplanMeier estimate
instead:
## Call: survfit(formula = Surv(time, 1  cens) ~ group, data = leuk)
##
## n events median 0.95LCL 0.95UCL
## group=6 MP 21 12 25 17 NA
## group=Placebo 21 0 NA NA NA
Note: Actual median followup time is half as long
for the placebo group, but there is not reason to believe the potential
followup times were different
Cumulative Event Function
Definition: The cumulative event function at time
t, denoted \(F(t)\), is the probability
that the event has occurred by time t, or equivalently, the probability
that the survival time is less than or equal to t. Note \(F(t) = 1S(t)\).
Hazard and Cumulative Hazard functions

\(h(t)\): hazard function, risk of
event at a point in time
 only calculated by software

\(H(t) = log[S(t)]\): cumulative
hazard function
 not easily interpretable
 cumulative force of mortality, or the number of events that would be
expected for each individual by time t if the event were a repeatable
process.
 Will be important next class for Cox Proportional Hazards
Comparing Groups Using the Logrank Test
Logrank test

logrank test is used to compare survival between two or
more groups

\(H_0\) is that the population
survival functions are equal at all followup times

\(H_1\) is that the population
survival functions differ at at least one followup time
 logrank test is really just a chisquare test comparing
expected vs. observed number of events in each group.
 Observed is just what we see.
 How to calculate expected?
Logrank test (cont’d)
Logrank test is just a chisquare test on the observed and expected
number of events:
## Call:
## survdiff(formula = Surv(time, cens) ~ group, data = leuk)
##
## N Observed Expected (OE)^2/E (OE)^2/V
## group=6 MP 21 9 19.3 5.46 16.8
## group=Placebo 21 21 10.7 9.77 16.8
##
## Chisq= 16.8 on 1 degrees of freedom, p= 4e05
 Many alternatives are available, but logrank should be the default
unless you have good reason.
 E.g. Wilcoxon (Breslow), TaroneWare, Peto tests
Notes about the Logrank Test
 Nonparametric: no assumptions on the form of \(S(t)\)
 Logrank test and KM curves don’t work with continuous
predictors
 Assumes noninformative censoring:
 censoring is unrelated to the likelihood of developing the event of
interest
 for each subject, his/her censoring time is statistically
independent from their failure time
Summary
 Censoring requires special methods to make full use of the data
 KaplanMeier estimate provides nonparametric estimate of the
survival function
 nonparametric meaning that no form of the survival function is
assumed; instead it is empirically estimated
 Logrank test provides a nonparametric hypothesis test
 H0: identical survival functions of multiple strata