lab_1_estimation_solutions
lab_1_estimation_solutions
lab_1_estimation_solutions
Dave Harrington
May 2018
The lab files come in two versions: a PDF with the problem statements and an Rmd file that
produces the PDF. In most cases, you can work in the Rmd file and enter in your solutions. For
the purely algebraic questions, you may either use LaTeX commands to enter your solutions in the
Rmd file or write out the answers on paper.
The solution files (a PDF with the solutions, and the Rmd file to produce them) are contained in a
separate folder under each lab. Your learning experience with the labs will be more effective if you
do not look at the solutions until after you finish working on the questions.
f (t) = λeλt , t ≥ 0.
a) What are the median, mean, standard deviation, and 20th and 80th percentiles for T , expressed
as a function of λ?
b) Show that the cumulative hazard function for an exponential distribution with rate λ is given
by the straight line y = λt, t ≥ 0.
c) Verify the memoryless property of exponential random variables,
Problem 1 Solution.
a) To find the median (i.e., value at the 50th percentile), use the definition of the survival function
and solve for tm where the survival function equals 0.50.
− log(0.5) log(2)
tm = =
λ λ
To find the mean, µ
Z ∞
µ= t exp(λt)dt
0
∞
Z ∞
= t exp(−λt) − exp(−λt)dt integration by parts
0 0
= (0 − 0) + 1/λ
= 1/λ
1
More generally,
J
µ= for discrete T with J possible outcomes.
X
aj fj
j=1
Z ∞ Z ∞Z u Z ∞Z ∞
µ= uf (u)du = dvf (u)du = f (u)dudv
0 0 0 0 v
Z ∞
= S(v)dv for continuous T
0
To find the standard deviation, first calculate the variance: Var(T ) = E(T 2 ) − [E(T )]2
Z ∞
E(T 2 ) = t2 exp(−λt)dt
0
∞
Z ∞
= −t2 exp(−λt) + 2t exp(−λt)dt integration by parts
0 0
2
∞ Z ∞
−2
= (0 − 0) + t exp(−λt) + exp(−λt)dt integration by parts, again
λ λ 0
0∞
2 −1
= (0 − 0) exp(−λt)dt
λ λ 0
2
= 2
λ
1
.[E(T )] = 2
2
λ
2 1 1
Var(T ) = 2 − 2 = 2
λ λ λ
1 1
r
SD(T ) = =
λ2 λ
The 20th and 80th percentiles are found using the same approach to find the median, since the
median is just the 50th percentile. For any pth percentile tp ,
− log((1 − p)/100)
tp = ,
λ
since the survival function gives right tail probabilities.
− log(0.80) log(5) − log(0.20)
Thus, t20 = λ = λ and t80 = λ .
b) The survival function for the exponential distribution is
Z ∞
S(t) = exp(−λu)du = exp(−λt).
t
The cumulative hazard function is Λ(t) = − log(S(t)) = λt. This implies Rthat the hazard
function λ(t) for an exponential distribution is the constant λ, since Λ(t) = 0t λ(u)du.
2
equivalent to P (T > a + b). Rewrite probabilities in terms of S(t) = exp(−λt), then simplify.
P (T > a + b, T > a)
P (T > a + b|T > a) =
P (T > a)
P (T > a + b)
=
P (T > a)
exp((−λ)(a + b))
=
exp(−λa)
exp(−λa) exp(−λb)
=
exp(−λa)
= exp(−λb)
= P (T > b)
3
Problem 2: The Weibull distribution
Suppose T has a Weibull distribution with survival function
γ
S(t) = e−λt .
a) What are the density, hazard, and cumulative hazard functions for T , expressed as a function
of the parameters λ and γ?
b) To plot a function in R, first define the function then call the plot() command. Additional
functions can then be added to the plot using the lines command. The lines command
must be the next command after plot(). For example, the following R chunk produces a plot
of two functions of x, 3x and x, for values of x from 1 to 100. The function 3x, named eq, is
plotted with a solid line; the function x, named eq2, is plotted with a dashed line.
eq = function(x){3*x}
eq2 = function(x){x}
plot(eq(1:100), type='l', lty = 1)
lines(eq2(1:100), type='l', lty = 2)
250
eq(1:100)
150
50
0
0 20 40 60 80 100
Index
Make three separate plots, each showing three functions. On the first plot, plot the hazard functions
for a Weibull distribution with scale parameter λ = 0.5 and shape parameters γ = 0.5, 1, 1.4; use
the range 0 ≤ t ≤ 5. On the second plot, plot the survival functions for the same combination of
parameters. On the third plot, plot the density functions for the same combinations of parameters.
The code chunk below shows how to create the plot with the hazard functions. Add additional
chunks to plot the survival functions and the density functions.
Describe one or two interesting features you observe from the plots of the hazard and survival
functions.
#define shape and scale parameters
scale = 0.5
shape.1 = 0.5
shape.2 = 1.0
4
shape.3 = 1.4
#add legend
legend(x = "topright", lty = 1:3,
legend = c("gamma = 0.5", "gamma = 1.0", "gamma = 1.4"))
Problem 2 Solution.
a)
γ
S(t) = e−λt
−d γ
f (t) = S(t) = γλtγ−1 e−λt
dt
f (t)
λ(t) = = γλtγ−1
S(t)
5
b) The shape of the hazard functions with different γ are very different, while the differences
between the survival functions are noticeable, but far less dramatic. This is the reason that
hazard functions are typically used when modeling survival data.
The plots visually demonstrate the relationship between the hazard and survival functions.
At times when the hazard is very high, the survival function drops rapidly. For example,
for γ = 0.5, the initial very large values of the hazard function causes the corresponding
survival function to decrease more quickly than the survival functions for other values of γ;
this decrease levels off as the values of the hazard become smaller. In contrast, the hazard
starts out low for γ = 1.4 and increases, which causes the survival function to have the slowest
decrease at small values of t and the largest decrease at larger values of t.
#define shape and scale parameters
scale = 0.5
shape.1 = 0.5
shape.2 = 1.0
shape.3 = 1.4
#add legend
legend(x = "topright", lty = 1:3,
legend = c("gamma = 0.5", "gamma = 1.0", "gamma = 1.4"))
6
Weibull hazard functions, scale = 0.5
gamma = 0.5
1.0
gamma = 1.0
gamma = 1.4
0.8
hazard
0.6
0.4
0.2
0 20 40 60 80 100
time
#define shape and scale parameters
scale = 0.5
shape.1 = 0.5
shape.2 = 1.0
shape.3 = 1.4
#add legend
legend(x = "topright", lty = 1:3,
legend = c("gamma = 0.5", "gamma = 1.0", "gamma = 1.4"))
7
Weibull survival functions, scale = 0.5
gamma = 0.5
gamma = 1.0
0.9
gamma = 1.4
survival
0.7
0.5
0.3
0 20 40 60 80 100
time
#define shape and scale parameters
scale = 0.5
shape.1 = 0.5
shape.2 = 1.0
shape.3 = 1.4
#add legend
legend(x = "topright", lty = 1:3,
legend = c("gamma = 0.5", "gamma = 1.0", "gamma = 1.4"))
8
Weibull density functions, scale = 0.5
1.0
gamma = 0.5
gamma = 1.0
0.8
gamma = 1.4
0.6
density
0.4
0.2
0.0
0 20 40 60 80 100
time
9
Problem 3: Simulating a clinical trial
This exercise begins with the construction of a simulated dataset from a clinical trial with potentially
censored failure time observations. The simulation builds the clinical trial dataset using the concept
of staggered entry and finite follow-up shown in the graphic in the slide labeled ‘Structure of event
time data’ in Unit 1.
The result of the simulation will be dataset of 300 participants, a follow-up time X, and a failure
indicator δ.
The simulated trial has the following characteristics:
• Participants will enter at a constant rate (i.e., uniform entry) beginning on January 1, 2019;
enrollment will end 8 years later on December 31, 2022. Assume that enrollment time is
measured in months. The dates are not as important as the length of the enrollment period.
Assume that the entry times will be uniformly distributed over the enrollment period.
• Data will be locked and analyzed when the last patient enrolled has been followed for 1 year.
• Assume that the trial will enroll a total of 300 participants.
• Assume this is a simple trial with only one treatment, and that patients in the trial will have
a median survival of 18 months.
• Assume that the only form of censoring will be administrative censoring; censoring will happen
only for participants whose death has not been observed by the end of the data collection
period (i.e., at the time the data are locked).
The most efficient way to simulate the dataset is to, for each case, sample a uniformly distributed
entry time and failure time, then use these to calculate the other items needed.
The simulation uses the following R functions. See the R help files for further information about
syntax for the function arguments.
• runif: simulates uniform random variables
• rbinom: simulates binomial random variables
• rexp: simulates exponential random variables
• pmin(a,b): finds the component pairwise minimum of two vectors a and b
When doing a simulation in R, it advisable to set the seed for the random number generator so that
the results are reproducible. The code chunk below sets the seed with 14052018, the date for the
first day of this course.
a) Run the simulation. Use the techniques discussed in the unit on estimation to check that
the simulation behaves as expected. Since the parameters of the survival distribution in the
simulation are known, there are several things that could be inspected, such as summary
statistics for the failure times, summary statistics for the failure time distribution estimated
from the censored data, as well as graphical comparisons of the underlying distributions and
those estimated from the censored data.
b) Show that the failure time distribution estimated from the censored data and the true
distribution are different than the distribution of X, the follow-up times. It is best to do this
with survival curves. When plotting a survival curve where there is no censoring, the event
10
variable can simply be omitted. Why are the medians not useful in this case for showing that
the distribution of obs.time differs from the other two distributions?
c) Show that the failure time distribution estimated from the censored data and the true
distribution are different than the distribution of the follow-up times from only uncensored
cases, i.e., the cases where δ = 1. (Hint: The subset() command may be useful.)
Problem 3 Solution.
a) The median of the simulated failure times is 16.8, which is close to the true median of 18.
The true hazard λ can be calculated from the median: λ = log(2)
18 = 0.039. The true mean and
standard deviation have the common value 1/λ = 1/0.039 = 25.64. This is close to the mean
and standard deviation of the simulated failure times: 27.7 and 29.8.
The estimated median for the survival distribution based on the censored data is also 16.8,
and the 95% confidence interval for the median contains the true value 18. The Kaplan-Meier
estimator and the true exponential survival curves are close. They differ in the right tail
because censoring often causes the right tail of the KM estimator to have high variance.
b) Because most of the censoring in the simulation happens after the median, the medians
alone are not sufficient to show that the distribution of obs.time is different from the true
distribution. The plots show that that the curves are initially very similar, and the difference
is mainly in the right tail.
c) The distribution of failure times estimated from only uncensored cases is clearly different from
the failure time distribution estimated from the censored data and the true distribution. The
difference is most obvious when comparing the survival curves, but the summary statistics
also indicate that the distributions are different; for example, the median and third quartile
estimated from only uncensored cases are lower than the median and third quartile estimated
from the censored data, 13.4 versus 16.8 and 27.5 versus 38.0.
###SIMULATION###
#initialize parameters
total.enrollment = 300
enrollment.period = 96
followup.period = 12
max.obs.time = enrollment.period + followup.period
median.failure.time = 18
exp.rate = log(2)/median.failure.time
set.seed(14052018)
11
#define longest possible follow-up time for each case
potential.obs.time = max.obs.time - entry.time
###ANALYSIS###
## [1] 29.8405
#plot km estimator of survival, assuming no censoring
library(survival)
survfit(Surv(failure.time, failed) ~ 1)
12
1.0
KM estimator
true S(t)
0.8
Survival Probability
0.6
0.4
0.2
0.0
0 20 40 60 80 100
Months
#summary statistics and survival curve for X = obs.time
summary(obs.time)
13
1.0
X
KM estimator
0.8
true S(t)
Survival Probability
0.6
0.4
0.2
0.0
0 20 40 60 80 100
Months
#summary statistics and survival curve for only uncensored cases
uncensored.times = subset(obs.time, failed == 1)
summary(uncensored.times)
14
1.0
uncensored cases
KM estimator
0.8
true S(t)
Survival Probability
0.6
0.4
0.2
0.0
0 20 40 60 80 100
Months
15
Problem 4: Calculating the Kaplan-Meier estimator
Listed below are values of survival time in years for 6 males and 6 females from a small study.
Right-censored times are denoted with “+” as a superscript.
Group 0: 1.2, 3.4, 5.0+ , 5.1, 6.1, 7.1
Group 1: 0.4, 1.2, 4.3, 4.9, 5.0, 5.1+
a) Let S(t)
b and S̃(t) denote the Kaplan-Meier estimator and the Fleming-Harrington estimator
of the survivorship function of the failure time combining Groups 0 and 1. Calculate S(1.2)
b
and S̃(1.2) by hand.
The Fleming-Harrington estimator of the survival function at a time t is exp(−Λ(t)),
b where Λ
b
is the Nelson-Aalen cumulative hazard estimator.
b) Figure 1 shows the Kaplan-Meier estimates separately for the two groups. Suppose that we
know that the largest observation in the female group is a censored observation. Which group,
0 or 1, represents the female group?
1.0
0.8
Survival Probability
0.6
0.4
0.2
group 0
group 1
0.0
0 1 2 3 4 5 6 7
c) The table below provides the Kaplan-Meier estimates for the survival function (S(t)),
b estimated
failure time function F (t), and the estimated standard errors for S(t) using the Greenwood
b b
Formula for Group 1. Calculate the values S(5.1)
b and Fb (5.1).
Product-Limit Survival Estimates
years S(t)
b Fb (t) sd
.e.(S(t))
b
0.00 1.0000 0 0
0.40 0.8333 0.1667 0.1521
1.20 0.6667 0.3333 0.1925
4.30 0.5000 0.5000 0.2041
4.90 0.3333 0.6667 0.1925
5.00 0.1667 0.8333 0.1521
5.10+ ???? ???? .
16
d) Use the above table to calculate the estimated 75th percentile of the survival time in Group
1. Provide a point estimate and a 95% confidence interval for S(t)
b at that time, using the
“log-log” approach. You can use the following result directly:
!2
1
Var log(− log(S(t)))
b = Var S(t)
b
b log(S(t))
S(t) b
Problem 4 Solution.
S(5.0)
b = 0.1667.
log(− log(S(t)))
b = log(− log(0.1667)) = 0.5831
1 2
Var(log(− log(S(t))))
b = (0.1521)2 = 0.2594
0.1667 × log(0.1667)
√
95%CI on log-log scale = 0.5831 ± 1.96 × 0.2594 = (−0.4151, 1.5813)
95% CI : (exp(− exp(1.5813)), exp(− exp(−0.4151))) = (0.0077, 0.5167)
A 95% CI for the survival at time 5.0 years in Group 1 is (0.0077, 0.5167).
17
Problem 5: Estimating the length of stay in a nursing home
The National Center for Health Services Research in the United States studied 36 for-profit nursing
homes to assess the effects of different financial incentives on length of stay. “Treated” nursing
homes received higher daily allowances for Medicaid patients and bonuses for improving a patient’s
health and sending them home. The study included 1,601 patients admitted between May 1, 1981
and April 30, 1982. The data appear in Morris, et al., Case Studies in Biometry, Chapter 12.
Data from this study are contained in the dataset nursing.home in the package eventtimedata.
Refer to the package documentation for information on the variable definitions and coding.
a) Using the Kaplan-Meier estimator, provide a graph of the estimated survival function for
length of stay for each of the two groups defined by the intervention variable rx. Length of
stay is the variable stay, and the indicator for discharge is the numeric variable cens. Does
the treatment appear to have changed the length of stay? Support your answer with some
summary statistics; a later lab will ask for a significance test.
b) Do you notice anything strange about the appearance of the survival curves in the two groups?
c) Plot the cumulative hazard function for each group.
d) The 5-number summary for a distribution consists of the minimum and the maximum, the
first and third quartile, and the median. As often happens with censored data, standard terms
may need to be carefully redefined to accommodate censoring.
Provide two 5-number summaries for the control group with these data. For the first summary,
use just the observation times stored in stay, ignoring censoring; the summary( ) command
can be used. For the second 5-number summary, the elements should be the estimated 1st
quartile, 3rd quartile, and median from the Kaplan-Meier and the largest and smallest observed
times to discharge for censored cases. The quantiles from a KM fit in R can be calculated
using the quantile function. The command quantile(km, 0.5) returns the median with its
confidence interval for a KM fit named km.
e) Most manuscripts of randomized trials with event-time data report a ‘median follow-up’. There
are several definitions of median follow-up in use. The first uses the median of all observed
follow-up times. This is the variable X in the slides and the variable stay in this dataset.
The second reports the median of the follow-up times among participants who have not had
an observed failure.
Calculate each of these for the control group in the nursing home data. How might each of
these statistics be useful?
Here is the beginning of a code chunk to get you started.
library(survival)
library(eventtimedata)
data(nursing.home)
18
Problem 5 Solution.
a) The two estimated medians for length of stay are 108 in the control group and 123 days in the
intervention group. The confidence intervals for the medians overlap, so there will probably
not be a significant difference in length of stay between the groups.
b) The intervention arm has considerably longer follow-up. The chapter by Morris, et al., notes
that participants were enrolled in the study if they were admitted to a participating nursing
home between May 1, 1981 and April 30, 1982. Residents in the treatment nursing homes
were censored after April 30, 1984; residents in the control nursing homes were censored after
April 30, 1983. I do not know the reason for the difference.
c) See below for plot.
d) Ignoring censoring, the 5-number summary for stay is (0, 28, 108, 379, 726) days. The second
5-number summary is (0, 28, 108, 428, 597), as calculated from the KM fit and the minimum
and maximum censored observations in the control group.
e) Median follow-up according to the first definition has already been calculated from the 5-
number summary for stay, 506 days. The second method yields 521 days. Each method has
its advantages. The median of the full set of times provides information about the amount
of information in the study. The second summarizes length of follow-up among cases still at
risk of an event, and is a more accurate summary of whether the cases who have not failed
have been followed long enough to ensure that enough failures are observed in the tail of the
survival distribution.
library(survival)
library(eventtimedata)
data(nursing.home)
19
KM of Discharge Probability
1.0
Control
Intervention
Probability of Discharge
0.8
0.6
0.4
0.2
0.0
Days
# cumulative hazard plot
plot(nrshome.km, col = 3:4, mark.time = TRUE,
fun = "cumhaz", cex = 0.6, cex.main = 0.8,
xlab = "Days", ylab = "Cumulative Hazard",
main = "Cumulative Hazard of Discharge", axes = FALSE)
axis(1)
axis(2)
legend(x = "bottomright", lty = 1, col = 3:4,
legend = c("Control", "Intervention"))
20
Cumulative Hazard of Discharge
1.5
Cumulative Hazard
1.0
0.5
Control
Intervention
0.0
Days
#summary of observed follow-up times in control group, ignoring censoring
control.grp = (nursing.home$rx == "Control")
summary(nursing.home$stay[control.grp])
## $quantile
## 25 50 75
## rx=Control 28.0 108 428
## rx=Intervention 28.5 123 444
##
## $lower
## 25 50 75
## rx=Control 24 92 340
## rx=Intervention 24 100 386
##
## $upper
## 25 50 75
## rx=Control 32 127 578
## rx=Intervention 37 156 553
control.grp.censored = control.grp & (nursing.home$cens == 1)
min(nursing.home$stay[control.grp.censored]) #min of censored cases, control arm
21
## [1] 0
max(nursing.home$stay[control.grp.censored]) #max of censored cases, control arm
## [1] 597
# median follow-up, control group, cases without discharge
not.failed = (nursing.home$cens == 0)
median(nursing.home$stay[not.failed == TRUE & control.grp == TRUE])
## [1] 521
22
Problem 6: The exponential distribution, again
Suppose T has an exponential distribution with rate parameter λ. Let (t1 , t2 , . . . , tn ) be a random
sample of n observations from T .
a) The likelihood function for a set of observations from a distribution with density f (t, λ) is
n
L(λ) = f (ti , λ).
Y
i=1
c) Suppose the observations from the exponential are subject to independent censoring. For each
case, let
• Ui denote the censoring variable,
• Ti denote the underlying, possibly censored event time,
• Xi denote the observed follow-up time, and
• δi = I(Ti ≤ Ui ) be the failure indicator.
Assume that the data for each case consist of (xi , δi ), i = 1, . . . n.
When data are censored in a parametric model, the likelihood for the data is given by
n
[f (ti )]δi [S(ti )]1−δi
Y
i=1
Show that the maximum likelihood estimator for λ based on the censored sample is
P
δi
λ= P
b
Xi
rate?
Problem 6 Solution.
i=1
b) Calculate the log of the likelihood function, then maximize by setting the derivative with
respect to λ equal to 0, and solving for λ.
23
d d
`(λ) = [n log(λ) − λ( ti )] = n/λ −
X X
ti
dλ dλ
0 = n/λ −
X
ti
b = n/( ti ) = 1/t, where t is the sample mean of t.
X
λ
c) As before, calculate the log-likelihood, differentiate with respect to λ, set the derivative equal
to 0, and solve for λ.
" #
−λXi
` = log
Y
δi
λ e
i
= [δi log(λ) − λXi ]
X
i
= [δi log(λ)] −
X X
λXi
i i
P
d δi X
`= − Xi = 0
dλ λ
P
b = P δi
λ
Xi
d) The Xi represents the total amount of time at risk for all subjects who were being followed.
P
The numerator represents the number of individuals who developed the event of interest during
the observation time. Thus, λb can be interpreted as the average rate at which the outcome
occurs.
24
Problem 7: Fitting an exponential distribution
This problem uses the code from the simulated dataset from a clinical trial with a total of 300
randomized patients (Problem 3).
a) Estimate and plot the survival distribution for all patients and plot the estimate for median
survival with confidence intervals.
b) Use the formula for the MLE for an exponential distribution to estimate the rate parameter,
then estimate the median survival using the formula derived earlier in the lab for the median
of an exponential distribution.
c) On a single plot, plot the Nelson-Aalen cumulative hazard estimator and the estimated
cumulative hazard from the exponential model. What do you notice?
Problem 7 Solution.
#initialize parameters
total.enrollment = 300
enrollment.period = 96
followup.period = 24
max.obs.time = enrollment.period + followup.period
median.failure.time = 18
exp.rate = log(2)/median.failure.time
set.seed(14052018) #set seed
25
obs.time = pmin(potential.obs.time, failure.time)
###ANALYSIS###
0.6
0.4
0.2
0.0
Months
#part b
## [1] 0.03683046
#estimate median
median.mle = log(2)/lambda.mle
median.mle
26
## [1] 18.81994
#part c: cumulative hazard plot
plot(exp.sim.km, lty= 1, mark.time = TRUE,
fun = "cumhaz",
xlab = "Months",
ylab = "Cumulative Hazard",
conf.int = F,
axes = FALSE,
cex = 0.6, cex.main = 0.8)
axis(1)
axis(2)
abline(a = 0, b = lambda.mle, lty=2)
2.5
2.0
Cumulative Hazard
1.5
1.0
0.5
0.0
Months
27