[go: up one dir, main page]

0% found this document useful (0 votes)
68 views22 pages

Semiparametric Regression

Semiparametric Regression

Uploaded by

shamz17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views22 pages

Semiparametric Regression

Semiparametric Regression

Uploaded by

shamz17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

See

discussions, stats, and author profiles for this publication at:


https://www.researchgate.net/publication/227390047

Semiparametric Regression

Book January 2006


DOI: 10.1017/CBO9780511755453 Source: RePEc

CITATIONS READS

1,610 1,918

3 authors, including:

David Ruppert Raymond Carroll


Cornell University Texas A&M University
258 PUBLICATIONS 14,022 CITATIONS 556 PUBLICATIONS 26,996 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

functional data analysis View project

simultaneous confidence bands View project

All content following this page was uploaded by David Ruppert on 26 September 2017.

The user has requested enhancement of the downloaded file.


Semiparametric Regression

DAVID RUPPERT
Cornell University

M. P. WAND
Harvard University

R. J. CARROLL
Texas A&M University
published by the press syndicate of the university of cambridge
The Pitt Building, Trumpington Street, Cambridge, United Kingdom

cambridge university press


The Edinburgh Building, Cambridge CB2 2RU, UK
40 West 20th Street, New York, NY 10011-4211, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
Ruiz de Alarcn 13, 28014 Madrid, Spain
Dock House, The Waterfront, Cape Town 8001, South Africa
http://www.cambridge.org

David Ruppert, M. P. Wand, R. J. Carroll 2003

This book is in copyright. Subject to statutory exception and


to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without
the written permission of Cambridge University Press.

First published 2003

Printed in the United States of America

Typeface Times 10.5/13 pt. System AMS-TEX [FH]

A catalog record for this book is available from the British Library.

Library of Congress Cataloging in Publication data


Ruppert, David, 1948
Semiparametric regression / David Ruppert, M.P. Wand, R.J. Carroll.
p. cm.
Includes bibliographical references and index.
ISBN 0-521-78050-0 ISBN 0-521-78516-2 (pb.)
1. Regression analysis. 2. Nonparametric statistics. I. Wand, M. P. (Matthew P.).
II. Carroll, Raymond J. III. Title.

QA278.2.R87 2003
519.50 36 dc21 2002041460

ISBN 0 521 78050 0 hardback


ISBN 0 521 78516 2 paperback
Contents

Preface page xiii


Guide to Notation xv

1 Introduction 1
1.1 Assessing the Carcinogenicity of Phenolphthalein 3
1.2 Salinity and Fishing in North Carolina 4
1.3 Management of a Retirement Fund 5
1.4 Biomonitoring of Airborne Mercury 7
1.5 Term Structure of Interest Rates 7
1.6 Air Pollution and Mortality in Milan: The Harvesting Effect 11

2 Parametric Regression 15
2.1 Introduction 15
2.2 Linear Regression Models 15
2.3 Regression Diagnostics 20
2.4 Inference 28
2.5 Parametric Additive Models 36
2.6 Model Selection 44
2.7 Polynomial Regression Models 46
2.8 Nonlinear Regression 48
2.9 Transformations in Regression 51
2.10 Bibliographic Notes 55
2.11 Summary of Formulas 55

3 Scatterplot Smoothing 57
3.1 Introduction 57
3.2 Preliminary Ideas 58
3.3 Practical Implementation 62
3.4 Automatic Knot Selection 64
3.5 Penalized Spline Regression 65
3.6 Quadratic Spline Bases 67
3.7 Other Spline Models and Bases 69
3.8 Other Penalties 74
3.9 General Definition of a Penalized Spline 75
3.10 Linear Smoothers 76
3.11 Error of a Smoother 76

vii
viii Contents

3.12 Rank of a Smoother 78


3.13 Degrees of Freedom of a Smoother 80
3.14 Residual Degrees of Freedom 82
3.15 Other Approaches to Scatterplot Smoothing 84
3.16 Choosing a Scatterplot Smoother 87
3.17 Bibliographical Notes 88
3.18 Summary of Formulas 89

4 Mixed Models 91
4.1 Introduction 91
4.2 Mixed Models 91
4.3 Prediction 95
4.4 The Linear Mixed Model ( LMM) 98
4.5 Estimation and Prediction in LMM 98
4.6 Estimated BLUP (EBLUP) 101
4.7 Standard Error Estimation 102
4.8 Hypothesis Testing 104
4.9 Penalized Splines as BLUPs 108
4.10 Bibliographical Notes 110
4.11 Summary of Formulas 110

5 Automatic Scatterplot Smoothing 112


5.1 Introduction 112
5.2 The Likelihood Approach 113
5.3 The Model Selection Approach 114
5.4 Caveats of Automatic Parameter Selection 120
5.5 Choosing the Knots and Basis Functions 123
5.6 Automatic Selection of the Number of Knots 127
5.7 Bibliographical Notes 131
5.8 Summary of Formulas 131

6 Inference 133
6.1 Introduction 133
6.2 Variability Bands 133
6.3 Confidence and Prediction Intervals 135
6.4 Inference for Penalized Splines 137
6.5 Simultaneous Confidence Bands 142
6.6 Testing the Adequacy of Parametric Models 145
6.7 Testing for No Effect 149
6.8 Inference Using First Derivatives 151
6.9 Testing for Existence of a Feature 156
6.10 Bibliographical Notes 158
6.11 Summary of Formulas 159

7 Simple Semiparametric Models 161


7.1 Introduction 161
7.2 Beyond Scatterplot Smoothing 161
Contents ix

7.3 Semiparametric Binary Offset Model 162


7.4 Additivity and Interactions 164
7.5 General Parametric Component 164
7.6 Inference 167
7.7 Bibliographical Notes 168
8 Additive Models 170
8.1 Introduction 170
8.2 Fitting an Additive Model 171
8.3 Degrees of Freedom 174
8.4 Smoothing Parameter Selection 176
8.5 Hypothesis Testing 181
8.6 Model Selection 183
8.7 Bibliographical Notes 185
9 Semiparametric Mixed Models 186
9.1 Introduction 186
9.2 Additive Mixed Models 186
9.3 Subject-Specific Curves 191
9.4 Bibliographical Notes 192
10 Generalized Parametric Regression 194
10.1 Introduction 194
10.2 Binary Response Data 194
10.3 Logistic Regression 195
10.4 Other Generalized Linear Models 197
10.5 Iteratively Reweighted Least Squares 200
10.6 Hat Matrix, Degrees of Freedom, and Standard Errors 201
10.7 Overdispersion and Variance Functions: Pseudolikelihood 201
10.8 Generalized Linear Mixed Models 203
10.9 Deviance 209
10.10 Technical Details 210
10.11 Bibliographical Notes 213
11 Generalized Additive Models 214
11.1 Introduction 214
11.2 Generalized Scatterplot Smoothing 215
11.3 Generalized Additive Mixed Models 217
11.4 Degrees-of-Freedom Approximations 219
11.5 Automatic Smoothing Parameter Selection 220
11.6 Hypothesis Testing 220
11.7 Model Selection 221
11.8 Density Estimation 221
11.9 Bibliographical Notes 222
12 Interaction Models 223
12.1 Introduction 223
12.2 Binary-by-Continuous Interaction Models 224
x Contents

12.3 Factor-by-Curve Interactions in Additive Models 226


12.4 Varying Coefficient Models 234
12.5 Continuous-by-Continuous Interactions 235
12.6 Bibliographical Notes 237

13 Bivariate Smoothing 238


13.1 Introduction 238
13.2 Choice of Bivariate Basis Functions 240
13.3 Kriging 242
13.4 General Radial Smoothing 248
13.5 Default Automatic Bivariate Smoother 256
13.6 Geoadditive Models 258
13.7 Additive Plus Interaction Models 259
13.8 Generalized Bivariate Smoothing 259
13.9 Appendix: Equivalence of BLUP using ZR and ZP 259
13.10 Bibliographical Notes 260

14 Variance Function Estimation 261


14.1 Introduction 261
14.2 Formulation 263
14.3 Application to the LIDAR Data 264
14.4 Quasilikelihood and Variance Functions 266
14.5 Bibliographical Notes 267

15 Measurement Error 268


15.1 Introduction 268
15.2 Formulation 269
15.3 The Expectation Maximization (EM) Algorithm 270
15.4 Simulated Example Revisited 273
15.5 Sensitivity Analysis Example 273
15.6 Bibliographical Notes 275

16 Bayesian Semiparametric Regression 276


16.1 Introduction 276
16.2 General Framework 277
16.3 Scatterplot Smoothing 279
16.4 Linear Mixed Models 285
16.5 Generalized Linear Mixed Models 288
16.6 RaoBlackwellization 291
16.7 Bibliographical Notes 292

17 Spatially Adaptive Smoothing 293


17.1 Introduction 293
17.2 A Local Penalty Method 294
17.3 Completely Automatic Algorithm 295
17.4 Bayesian Inference 296
Contents xi

17.5 Simulations 298


17.6 LIDAR Example 304
17.7 Additive Models 305
17.8 Bibliographical Notes 307

18 Analyses 308
18.1 Cancer Rates on Cape Cod 308
18.2 Assessing the Carcinogenicity of Phenolphthalein 308
18.3 Salinity and Fishing in North Carolina 308
18.4 Management of a Retirement Fund 313
18.5 Biomonitoring of Airborne Mercury 314
18.6 Term Structure of Interest Rates 315
18.7 Air Pollution and Mortality in Milan: The Harvesting Effect 319

19 Epilogue 320
19.1 Introduction 320
19.2 Minimalist Statistics 320
19.3 Some Omitted Topics 321
19.4 Future Research 325

A Technical Complements 326


A.1 Introduction 326
A.2 Matrix Definitions and Results 326
A.3 Linear Algebra 331
A.4 Probability Definitions and Results 333
A.5 Maximum Likelihood Estimation 335
A.6 Bibliographical Notes 335

B Computational Issues 336


B.1 Fast Computation of Penalized Spline Smooths 336
B.2 Computation of Covariance Matrix Estimators 351
B.3 Software 353

Bibliography 361
Author Index 375
Notation Index 380
Example Index 381
Subject Index 382
1

Introduction

Semiparametric regression can be of substantial value in the solution of complex


scientific problems. The real world is far too complicated for the human mind to
comprehend in great detail. Semiparametric regression models reduce complex
data sets to summaries that we can understand. Properly applied, they retain es-
sential features of the data while discarding unimportant details, and hence they
aid sound decision-making.
Figure 1.1 depicts a complex data set corresponding to a cancer study in the
Upper Cape Cod region of Massachusetts. Apart from the geographical location
of cancer occurrences, there are data on age and smoking status. These data are
for females.
One question of interest is whether there are elevated lung cancer rates, relative
to all cancers and after adjustment for confounders, in any particular geographi-
cal locations. There is clearly a lot of relevant information represented by the one
thousand points in this plot. However, it is very difficult to draw any conclusions
from this alone. A semiparametric regression analysis leads to Figure 1.2.
Each of the graphics in Figure 1.2 displays an easy-to-comprehend estimate of
the effect of smoking status, age, and geographical location on the occurrence of

Figure 1.1 One


41.80

thousand randomly
chosen occurrences
of female cancer in
41.75

Upper Cape Cod,


Massachusetts, for
the period 1986
41.70
degrees latitude

1994. The data are


categorized according
to lung cancer (red)
41.65

or other (blue) and


smoker (closed
circle) or nonsmoker
41.60

(open circle). The


size of the circle is
lung cancer, smoker proportional to age.
lung cancer, non-smoker
41.55

other cancer, smoker For confidentiality


other cancer, non-smoker
reasons, the data have
been jittered.
-70.7 -70.6 -70.5 -70.4 -70.3
degrees longitude

1
2 Introduction

Figure 1.2 Graphical


outcomes from a
semiparametric
regression analysis
of Upper Cape Cod
lung cancer data: top
panel, point estimate
and approximate 95% 0 10 20 30 40 50
confidence interval for 95% C.I. for odds ratio of smokers
the odds ratio of lung
est. odds ratio (relative to average)

cancer among smokers


who have some type
of cancer; middle
0.0 0.5 1.0 1.5 2.0

panel, estimated odds


ratio as function of
age; bottom panel,
estimated odds ratio as
function of geographic
location. Higher
values correspond 0 20 40 60 80
to high estimated age (years)
probabilities of lung
cancer, given cancer,
41.80

measured through the


odds ratio.
41.70 41.75
degrees latitude
41.65 41.60

est. odds ratio (relative to geometric mean)


41.55

0.4 0.8 1.2 1.6

-70.6 -70.5 -70.4 -70.3


degrees longitude

lung cancer, relative to cancer, while controlling for each of the other two vari-
ables. Smoking status is a binary variable, so its effect can be modeled through
The odds ratio of an a single parameter. This the simplest type of parametric modeling. The graphic
event A, relative to shows an odds ratio estimated to be in the range 11 to 33. Age is a continuous
an event B, is defined
to be the ratio of the
variable and, in this instance, its effect can be modeled reasonably well using
odds of A to the odds parametric regression techniques. However, the nonparametric estimate shown
of B. The odds of A in the middle panel suggests an unusual type of nonlinearity and so nonparamet-
is the probability of A
ric regression techniques may lead to an improved fit. The effect of geography is
occurring divided by
the probability of A difficult to model using traditional parametric models, and the map in Figure 1.2
not occurring. is the result of a bivariate nonparametric regression technique. It clearly shows
1.1 Assessing the Carcinogenicity of Phenolphthalein 3

Table 1.1 Observed


0 ppm 25,000 ppm mammary tumor rates
with phenolphthalein.
Tumor rates Tumor rates For example, 32
Mean body Mean body of the 50 animals
Overall Terminal a weight b Overall Terminal weight exposed at 25,000
ppm had tumors at
32/50 25/30 287 17/50 15/32 254 the time of death. Of
these, 18 died during
a the experiment and
Tumors found at terminal sacrifice time. 32 were sacrificed
b
Average body weight at 12 months. at the end of the
experiment, with 15 of
regions with elevated lung cancer levels, something that is not easy to discern in the sacrificed animals
being among the 17
Figure 1.1. Since the effects of smoking, age, and location have been modeled with tumors.
using a combination of parametric and nonparametric regression techniques, we
call this a semiparametric regression analysis.
In the next sections we look at other important scientific investigations where
semiparametric regression can play a useful role. We give detailed analyses of
these studies (or at least references to where careful analyses can be found) in
Chapter 18, after we have developed methodology to tackle them; Chapters 217
will be spent describing this methodology.

1.1 Assessing the Carcinogenicity of Phenolphthalein


The U.S. National Toxicology Program (NTP) routinely conducts animal exper-
iments to measure the toxicity of certain foods and drugs. One such example is
the assessment of the possible carcinogenicity of phenolphthalein, an ingredient
of over-the-counter laxatives that was recently withdrawn by the U.S. Food and
Drug Administration.
A topic of recent interest in the analysis of carcinogenicity data is how to deal
with body weight. A recent editorial in Science magazine was highly critical of
risk assessment agencies for not controlling for the possible confounding effect
of weight, since weight loss caused by a toxic substance might protect against
cancer and mask a carcinogenic effect (Abelson 1995). It is not uncommon for
control animals to weigh substantially more than the treated animals through-
out the course of an experiment owing to toxic effects of the chemical. Several
sources have reported a lower incidence of tumors corresponding to lower body
weights (Hart et al. 1995; Haseman, Bourbina, and Eustis 1994; Seilkop 1995).
Thus, dose-related differences in body weights could affect the conclusions drawn
from these studies. Indeed, many studies conducted by the NTP have shown pro-
tective effects of the chemical being tested on certain tumor incidences. These
apparent reductions in tumor incidence across dose may be due to differences in
body weight (Hart et al. 1995). This phenomenon is illustrated in Table 1.1, taken
from the NTP study in phenolphthalein.
Figure 1.3 shows nonparametric estimates of the probabilities of four carcino-
genic outcomes as a function of weight based on a large NTP set of data on
4 Introduction

Figure 1.3

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0


prob(tumors|weight)

prob(tumors|weight)
Estimated probability
mammary tumor leukemia
of mammary tumor,
leukemia, pituitary
tumor, and thyroid
tumor as a function
of weight for a set
of NTP historical
controls. The shaded 200 250 300 350 200 250 300 350
region represents plus weight weight
and minus twice the
estimated (pointwise)
0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0


standard error.
prob(tumors|weight)

prob(tumors|weight)
pituitary tumor thyroid tumor

200 250 300 350 200 250 300 350


weight weight

controls. It is apparent from these plots that nonlinear relationships exist and that
semiparametric models for incorporation of weight data would be beneficial.

1.2 Salinity and Fishing in North Carolina


This example comes from a larger project to predict the annual shrimp (or prawn)
harvest in Pamlico Sound, North Carolina, where shrimping occurs in the sum-
mer and autumn. It was believed that low salinity in the sound was detrimental
to the shrimp harvest and that salinity values during certain crucial springtime
periods would be useful predictors.
Salinity values were not measured regularly during the years prior to the
project. However, discharges from rivers that empty into Pamlico Sound were
known. The goal of the project was to develop a prediction model that could be
used during the spring, early enough to help the fishing industry decide whether
to rig for shrimp or instead to harvest some other species such as bluefish.
The data set has 28 cases taken from the spring periods of years 1972 to 1977.
In each case, salinity was measured at the current time period and two weeks ear-
lier, giving the variables salinity and lagged.sal. Two other variables were
measured, discharge and trend. The variable trend indicated which of six
biweekly periods during March to May a case came from. It was felt that trend
might model the effects of increasing evaporation as the weather warmed, but no
effect of trend was detected and so that variable will be ignored.
Figure 1.4 is a scatterplot matrix of the salinity data. One can see the strong,
seemingly linear, relationship between salinity and lagged.sal. The re-
lationship between salinity and discharge is somewhat weaker and pos-
sibly nonlinear. There is not a strong relationship between lagged.sal and
1.3 Management of a Retirement Fund 5

16 16 Figure 1.4
Scatterplot matrix of
14 14
the salinity data.
12 12

salinity

salinity
10 10

8 8

6 6

4 4
0 5 10 15 20 25 30 35
lagged salinity discharge
16

14

lagged salinity
12

10

4
20 25 30 35
discharge

discharge, so their effects upon salinity should be individually estimable


with good precision.
The relationship between salinity and discharge is easier to see if we
remove the effects of lagged.sal. To do this, we regressed salinity on
lagged.sal using a straight line model (see Section 2.2). The residuals (i.e.,
the differences between salinity and the predicted values) are plotted against
discharge in Figure 1.5. The nonlinearity is now more evident, especially be-
cause a scatterplot smooth has been added. This suggests that a semiparametric The notion of
regression approach will be beneficial. The observation with discharge equal smoothing a
scatterplot will be
to nearly 34 is a high leverage point, meaning that it has a potentially high in- described extensively
fluence on the fitted curve. In fact, the fitted curve bends upward in the figure but in Chapters 3 and 5.
would not do so if the leverage point were excluded. However, unlike a linear
fit, the curved fit is only influenced locally that is, on the right. We will discuss
this point further when we return to this example in Chapter 18.

1.3 Management of a Retirement Fund


Bryant and Smith (1995) describe a managerial problem based on a real data
set, but with names changed to protect confidentiality. It concerns a company,
Best Retirement Inc. (BRI), that sells retirement plans to corporations around the
United States. To capture a market niche, it has decided to target smaller firms:
those with 500 or fewer employees. The major portion of their revenue comes
from retirement packages.
For a particular type of retirement plan known as 401(k), data are available
on several attributes of the firms from the previous year. It is advantageous that
6 Introduction

Figure 1.5 3
Scatterplot of
residuals from
the regression
2
of salinity on
lagged.sal. A
scatterplot smooth
has been added. Note 1
salinity residual

the effect of the high


leverage point on the
extreme right.
0

3
20 22 24 26 28 30 32 34
discharge

Figure 1.6
Estimated effect
of salary on
0.5

contribution to the
effect on mean year end contributions

logarithm of year-end
contributions in
a semiparametric
0.0

regression analysis.
The shaded region
represents plus and
-0.5

minus twice the


estimated (pointwise)
standard error.
-1.0 -1.5

0 20000 40000 60000


salary

BRI be able to estimate the year-end dollar amount contributed to each plan in
advance so that it can make internal revenue and cost projections.
Apart from building a prediction model for year-end contributions, there are
some other managerial questions that can be addressed using these data. For ex-
ample, BRI has a sales representative who has been specifically trained to deal
exclusively with 401(k) retirement plans. The company would like to know if her
expertise is a factor that influences contributions to such retirement plans.
Figure 1.6 shows the effect of salary (average salary of each firm) on the
logarithm of year-end contributions as estimated by a semiparametric regression
1.5 Term Structure of Interest Rates 7

Figure 1.7 Plot of


biomonitoring data.
Open circles show
sampling locations,
and asterisks mark
the single or replicate
values of mercury
measured at each
sampling location.
200
The large solid circle
mercury

100 marks the location of


4522
the incinerator.
4521.5
0
4521
504
INCINERATOR 4520.5
502 4520
4519.5
500
4519
498 4518.5
4518
496
4517.5
UTM North
UTM East 494 4517

analysis. There is a pronounced nonlinearity here, which suggests that better


predictions and managerial decisions can be realized through the use of semi-
parametric regression.

1.4 Biomonitoring of Airborne Mercury


Waste incineration is a major source of environmental mercury. As part of an envi-
ronmental monitoring program in Warren County, New Jersey, pots of sphaghum
moss were placed at 15 sampling locations about a solid waste incinerator and
exposed to ambient conditions between July 9 and July 23, 1991. The moss was
then collected, dried, and assayed for mercury. The resultant data are shown in
Figure 1.7.
The goals of the study include estimating the distribution of mercury about the
incinerator and testing the null hypothesis that the mean mercury concentration
is constant.
Figure 1.8 shows estimated levels of mercury concentration that were obtained
using nonparametric methods described in this book. The plot indicates that mer-
cury concentration peaks north of the incinerator. There are only 15 sampling
locations, with replicate moss pots at 7 of these sites, for a total of 22 observa-
tions. With so few data, only gross features of mercury deposition can be resolved,
but the nonparametric fit provides a pleasing image of these features.

1.5 Term Structure of Interest Rates


Corporations, municipalities, the U.S. Treasury, and other entities raise money by
issuing bonds. The purchase price of the bond is a loan to the issuing entity and
the bond is a contract requiring that entity to pay to the bond holder both principal
8 Introduction

Figure 1.8 Plot of Estimated mercury concentration


biomonitoring data
with coloring of
estimated mercury
concentration. There
were 15 sampling
locations and 7 had 150
replicate samples.
Open circles indicate
sampling locations;
the asterisk marks the 100
incinerator location.

50

0
490
INCINERATOR 4521
495
4520
500 4519
4518
505 UTM North
UTM East

and interest according to a schedule. At the time of expiration of the bond, which
is called the maturity, the bond holder receives a payment call the par value.
There are two general classes of bonds, coupon bonds and zero-coupon bonds.
At fixed periods, often every six months, the holder of a coupon bond receives
a coupon payment. Generally, coupon bonds sell at a price near their par value.
The par payment at maturity is a repayment of principal while the coupon pay-
ments are interest. Zero-coupon bonds have no coupon payments and sell below
par. The par payment at maturity represents principal and interest.
Frequently, the initial owner of the bond will sell the bond to another investor.
A financial derivative The current price at which bonds trade depends upon the current interest rates.
is a security whose For example, suppose a corporate coupon bond with a 5% coupon rate is issued
value depends on
with the initial price equal to par, so that the coupon payments are 5% of the ini-
the value of other
underlying securities. tial price. If the prevailing interest rate increases to 6% then the price of the bond
As an example of a will drop, so that a new purchaser of the bond will in effect receive a 6% rate.
derivative, consider a The interest rates on bonds depend upon their maturities, with long-term bonds
call option on a stock.
A call option gives the frequently (though not always) paying higher rates than short-term bonds. For ex-
owner the right, but ample, on January 26, 2001, the rate on a 1-year Treasury bill was 4.83% whereas
not the obligation, to the rate on a 30-year Treasury bond was 6.11%. The term structure of interest
purchase a share of
stock at a fixed price
rates is a quantitative description of the dependency of rate upon maturity. The
on a given date, called estimation of term structure is essential for financial analysts working, for exam-
the expiration date. ple, with credit derivatives.
The value of the call
Interest rates not only depend upon the maturity, but for any fixed maturity, the
option depends on the
price of the underlying interest rate on bonds with that maturity will change over time. In this case study,
stock and on such we are not concerned with such changes. Rather, we will only be concerned with
1.5 Term Structure of Interest Rates 9

how interest rates on a given day depend on maturity. Specifically, in our exam- other variables as
ple, we will model bond interest rates on December 31, 1995. the time left until
expiration. An
We will work with continuously compounded interest rates. As an illustration, example of a interest
we will start with an unrealistic assumption that the interest rate is constant, that rate derivative is a
is, not dependent on maturity. If a bond is worth P(t) dollars at time t and is cap. If an interest rate
exceeds the cap, then
continuously compounded at a constant rate r, then P(t) satisfies the simple dif- the owner of the cap
ferential equation is paid the difference
P 0(t) = rP(t) (1.1) between the interest
rate and the cap.
and so, at maturity T, Clearly, the value of
P(T ) = P(0) exp(rT ). (1.2) the cap depends on
the underlying interest
The rate r is called the forward rate. It is the rate agreed upon at present for in- rate. A company
terest in the future, that is, forward in time. paying interest at a
Interest rates must be inferred from bond prices. Recall that the bonds value floating rate might
purchase a cap as
at maturity, P(T ), is called the par value. Hence, from (1.2) we have insurance against rate
increases.
P(0) = par exp(rT ), (1.3)
where par is the par value. Suppose a 1-year, par $100 zero-coupon bond is sell-
ing now for $92. This means we can buy the bond now for $92 and receive $100
exactly one year from now. Recall that zero-coupon means the bond holder re-
ceives no interest payments until maturity. The $8 difference between the present
price and the par value is the only interest payment. Here we have T = 1, P(1) =
par = 100, P(0) = 92, and, from (1.2),
92 = 100 exp(r)
or
r = log(100/92) = 0.0834.
Thus, the annual continuously compounded interest rate over the next year is
8.34%.
Suppose, in addition, that a 2-year, par $100 zero-coupon bond sells for $85.
We assume that this bond pays the just-determined rate of 8.34% the first year
but a different interest rate the next year. The rate for the second year, call it r 2 ,
solves
83 = 100 exp{(0.0834 + r 2 )}
or
r 2 = log(100/83) 0.0834 = 0.1029.
Table 1.2 gives the prices on December 31, 1995, of five bonds previously is-
sued by the U.S. communications company AT&T and maturing at some time after
that date. These are the prices at which the bonds were traded that is, purchased
by one investor from another. Each bond price is expressed as a percentage of
par, the amount AT&T will pay the bond owner at maturity. The maturity is given
in years from December 31, 1995. The bonds make semiannual interest payments
called coupons. The time in years of the next coupon and the coupon payments
are given in the table. The aim is to determine the forward rate of AT&T bonds
from these data.
10 Introduction

Table 1.2 AT&T


bond prices on Next
December 31, 1995. Issue Maturity coupon Coupon Price
Issue, maturity, and
next coupon dates 3.9644 5.9781 0.0356 7.1250 109.4580
are in years from 1.7726 8.1890 0.2274 6.7500 106.2840
December 31, 1995. 1.5836 10.3562 0.4164 7.5000 111.4360
0.8384 11.1041 0.1616 7.7500 115.5090
0.6384 9.3096 0.3616 7.0000 107.6590

We have been assuming that the forward interest rate is constant over each
year. Clearly, this is an oversimplification. Financial analysts model the forward
interest rate as a continuous function of time, r(t). If P(T ) is the par value of a
zero-coupon bond maturing at time T and if P(0) is the current price of the bond,
then (1.1) is replaced by
P 0(t) = r(t)P(t),
with solution  Z T 
P(0) = P(T ) exp r(x) dx . (1.4)
0
A forward price is a The problem is to estimate r(t) from bond prices, such as those shown in Table
price negotiated at 1.2. A further complication is that many bonds, including those in the table, have
the present for the
future delivery of
coupons. A coupon bond can be modeled as a bundle of zero-coupon bonds, one
some commodity. A for each coupon payment and one for the final payment at maturity of the par
forward interest rate value. The bond price is the aggregate price of all of these coupon bonds. Bond
means an interest rate
that is agreed upon
prices such as in Table 1.2 have some random error since, for example, they are
now for a loan in the really prices at last transaction, not exactly at the current time. Therefore, the es-
future. timation of the forward rate curve is a statistical problem. Fisher, Nychka, and
Zervos (1994) have developed a very elegant spline method for estimating the for-
ward rate curve. Their method works well for Treasury bond data because there
are enough Treasury bonds to estimate a continuous forward rate.
For corporate bonds, there is often a paucity of data and so the method of Fisher
and colleagues cannot be applied directly. Jarrow, Ruppert, and Yu (2001) extend
the model of Fisher et al. by assuming that the forward rate for a corporation such
as AT&T differs from the Treasury forward rate by a constant or, perhaps, by a
low-degree polynomial function of time. The corporate forward rate is greater
than the Treasury rate, since Treasury bonds have no risk of default; the U.S. Trea-
sury can always raise money by taxation. The difference between the two rates
is called the risk premium or spread and reflects the extra interest that investors
demand when buying corporate bonds (which may default) rather than risk-free
Treasury bonds. The model of Jarrow and colleagues is semiparametric in that
the Treasury forward rate is modeled as a spline, but the risk premium is mod-
eled parametrically. This case study is typical of semiparametric models in that
parts of the model for which there is much data are modeled nonparametrically
while parts that are not well supported by data are modeled parametrically.
Figure 1.9 shows the prices of U.S. STRIPS (Separate Trading of Registered
Interest and Principal of Securities), a type of zero-coupon Treasury bond. The
1.6 Air Pollution and Mortality in Milan: The Harvesting Effect 11

100 Figure 1.9 U.S.


STRIPS prices as a
90 percentage of the par
value.
80

70

60
price

50

40

30

20

10
0 5 10 15 20 25 30
time to maturity

prices are expressed as a percentage of the par value and are plotted against time
to maturity. If r(x) is constant, say r(x) = r 0 for all x, then by (1.4) we have
y i = 100 exp(r 0 Ti ) (1.5)
and
log(y i ) = log(100) r 0 Ti . (1.6)
Here P(Ti ) is the par, P(0) is the present price, y i = 100P(0)/P(Ti ) is the
response, and Ti is the maturity for the the ith U.S. STRIPS.
The rough exponential shape in Figure 1.9 suggests that model (1.5) is at least
approximately correct. However, in Figure 1.10 we see log(y i ) plotted against
Ti , and the plot is not quite the straight line that (1.6) suggests. In fact, we fit
n
a straight line to {Ti , log(y i )}i=1 and plotted the residuals, which are the dif-
ferences between the log(y i ) and the fitted line. This plot, shown as Figure 1.11,
shows an obvious deviation from the random cloud that we would expect if the
model (1.5) fit the data, thus indicating the need for a nonparametric model. The
fitting of straight line models and residual analysis will be discussed in Chapter 2.

1.6 Air Pollution and Mortality in Milan: The Harvesting Effect


In the last decade, a good deal of literature has been published concerning the
short-term effect of air pollution on health. Daily mortality counts and hospi-
tal admissions have been associated with daily air pollution levels, correcting
for several time-dependent confounders. From the public health point of view,
the significance of air pollutions short-term effects corresponds to an increase in
mortality or morbidity among individuals who would otherwise die much later,
not among those who could have died within a few days.
12 Introduction

Figure 1.10
Logarithms of U.S.
STRIPS prices as a
percentage of the par 4.5

value.

4
log(price)

3.5

0 5 10 15 20 25 30
time to maturity

Figure 1.11 0.08


Residuals from a
straight line fit to the
logarithms of the U.S. 0.06
STRIPS prices as a
percentage of the par
value. 0.04

0.02
residual

0.02

0.04

0.06
0 5 10 15 20 25 30
time to maturity

Figure 1.12 is a schematic representation of the dynamics that arise when air
pollution has an impact on mortality. The risk pool consists of sick and elderly
people. Transitions between this state and the general population are affected by
air pollution levels.
Consider the following lagged regression model of air pollution and generic
mortality:
log{E(mortalityt )} = + 0 pollution t + + q pollution tq + t ,
1.6 Air Pollution and Mortality in Milan: The Harvesting Effect 13

Figure 1.12
Schematic
Air Pollution representation
of the dynamics
that arise when air
pollution has an
impact on mortality.

General
Risk Pool Death
Population

Figure 1.13
0.8

Lag structure
corresponding to the

A
0.6

harvesting effect.
0.4
coefficients
0.2 0.0

B
-0.2

0 5 10 15 20
lag

where mortalityt and pollution t are (respectively) the mortality count and
pollution level for day t. The lag structure in Figure 1.13 describes the so-called
harvesting effect. The horizontal axis is the lag number and the vertical axis shows
the coefficients ` . Each ` has this interpretation: net effect of pollution level `
days ago on mortality.
In the figure, A is the sum of the positive coefficients for low lags and repre-
sents the fact that pollution levels in the past few days or weeks have a positive
effect on mortality. However, the negative coefficients in B mean that pollution
levels a longer period ago have a negative effect. This is due to depletion of
the risk pool, normally made up of elderly and sick people whose deaths have
been hastened a few days or weeks by episodes of high pollution; this is known
as harvesting. Here A overestimates the public health significance of pollution,
since it is really A + B (where B is negative) that represents deaths induced by a
noticeable amount of time.
Daily data over 10 years are available on mortality, air pollution, and several
meteorological variables for the city of Milan, Italy. It is of interest to use these to
14 Introduction

Figure 1.14

10^-4
Estimates of the
coefficients of the
lags of sulphur
dioxide on mortality

5*10^-5
in Milan, Italy. The
est. coefficients
shaded points are
plus and minus 2
times the estimated
standard error of each
coefficient estimate.
0
-5*10^-5

0 10 20 30 40
lag number

quantify the public health significance of air pollution, incorporating the harvest-
ing effect. By constraining the lag coefficients to be on a smooth (but otherwise
flexible) curve, we obtained Figure 1.14. This suggests some evidence of harvest-
ing. The construction of this result required some nonstandard semiparametric
regression techniques that allowed for the lag coefficients to lie on a smooth curve
and also be influenced by data on daily weather conditions.

Chapter 18 provides much fuller analyses and solutions for a selection of the prob-
lems presented in this chapter. Between now and then we will need to describe
techniques for performing semiparametric regression analysis. The next chapter
signals the start of this journey.

View publication stats

You might also like