Semiparametric Regression
Semiparametric Regression
Semiparametric Regression
CITATIONS READS
1,610 1,918
3 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by David Ruppert on 26 September 2017.
DAVID RUPPERT
Cornell University
M. P. WAND
Harvard University
R. J. CARROLL
Texas A&M University
published by the press syndicate of the university of cambridge
The Pitt Building, Trumpington Street, Cambridge, United Kingdom
A catalog record for this book is available from the British Library.
QA278.2.R87 2003
519.50 36 dc21 2002041460
1 Introduction 1
1.1 Assessing the Carcinogenicity of Phenolphthalein 3
1.2 Salinity and Fishing in North Carolina 4
1.3 Management of a Retirement Fund 5
1.4 Biomonitoring of Airborne Mercury 7
1.5 Term Structure of Interest Rates 7
1.6 Air Pollution and Mortality in Milan: The Harvesting Effect 11
2 Parametric Regression 15
2.1 Introduction 15
2.2 Linear Regression Models 15
2.3 Regression Diagnostics 20
2.4 Inference 28
2.5 Parametric Additive Models 36
2.6 Model Selection 44
2.7 Polynomial Regression Models 46
2.8 Nonlinear Regression 48
2.9 Transformations in Regression 51
2.10 Bibliographic Notes 55
2.11 Summary of Formulas 55
3 Scatterplot Smoothing 57
3.1 Introduction 57
3.2 Preliminary Ideas 58
3.3 Practical Implementation 62
3.4 Automatic Knot Selection 64
3.5 Penalized Spline Regression 65
3.6 Quadratic Spline Bases 67
3.7 Other Spline Models and Bases 69
3.8 Other Penalties 74
3.9 General Definition of a Penalized Spline 75
3.10 Linear Smoothers 76
3.11 Error of a Smoother 76
vii
viii Contents
4 Mixed Models 91
4.1 Introduction 91
4.2 Mixed Models 91
4.3 Prediction 95
4.4 The Linear Mixed Model ( LMM) 98
4.5 Estimation and Prediction in LMM 98
4.6 Estimated BLUP (EBLUP) 101
4.7 Standard Error Estimation 102
4.8 Hypothesis Testing 104
4.9 Penalized Splines as BLUPs 108
4.10 Bibliographical Notes 110
4.11 Summary of Formulas 110
6 Inference 133
6.1 Introduction 133
6.2 Variability Bands 133
6.3 Confidence and Prediction Intervals 135
6.4 Inference for Penalized Splines 137
6.5 Simultaneous Confidence Bands 142
6.6 Testing the Adequacy of Parametric Models 145
6.7 Testing for No Effect 149
6.8 Inference Using First Derivatives 151
6.9 Testing for Existence of a Feature 156
6.10 Bibliographical Notes 158
6.11 Summary of Formulas 159
18 Analyses 308
18.1 Cancer Rates on Cape Cod 308
18.2 Assessing the Carcinogenicity of Phenolphthalein 308
18.3 Salinity and Fishing in North Carolina 308
18.4 Management of a Retirement Fund 313
18.5 Biomonitoring of Airborne Mercury 314
18.6 Term Structure of Interest Rates 315
18.7 Air Pollution and Mortality in Milan: The Harvesting Effect 319
19 Epilogue 320
19.1 Introduction 320
19.2 Minimalist Statistics 320
19.3 Some Omitted Topics 321
19.4 Future Research 325
Bibliography 361
Author Index 375
Notation Index 380
Example Index 381
Subject Index 382
1
Introduction
thousand randomly
chosen occurrences
of female cancer in
41.75
1
2 Introduction
lung cancer, relative to cancer, while controlling for each of the other two vari-
ables. Smoking status is a binary variable, so its effect can be modeled through
The odds ratio of an a single parameter. This the simplest type of parametric modeling. The graphic
event A, relative to shows an odds ratio estimated to be in the range 11 to 33. Age is a continuous
an event B, is defined
to be the ratio of the
variable and, in this instance, its effect can be modeled reasonably well using
odds of A to the odds parametric regression techniques. However, the nonparametric estimate shown
of B. The odds of A in the middle panel suggests an unusual type of nonlinearity and so nonparamet-
is the probability of A
ric regression techniques may lead to an improved fit. The effect of geography is
occurring divided by
the probability of A difficult to model using traditional parametric models, and the map in Figure 1.2
not occurring. is the result of a bivariate nonparametric regression technique. It clearly shows
1.1 Assessing the Carcinogenicity of Phenolphthalein 3
Figure 1.3
prob(tumors|weight)
Estimated probability
mammary tumor leukemia
of mammary tumor,
leukemia, pituitary
tumor, and thyroid
tumor as a function
of weight for a set
of NTP historical
controls. The shaded 200 250 300 350 200 250 300 350
region represents plus weight weight
and minus twice the
estimated (pointwise)
0.0 0.2 0.4 0.6 0.8 1.0
prob(tumors|weight)
pituitary tumor thyroid tumor
controls. It is apparent from these plots that nonlinear relationships exist and that
semiparametric models for incorporation of weight data would be beneficial.
16 16 Figure 1.4
Scatterplot matrix of
14 14
the salinity data.
12 12
salinity
salinity
10 10
8 8
6 6
4 4
0 5 10 15 20 25 30 35
lagged salinity discharge
16
14
lagged salinity
12
10
4
20 25 30 35
discharge
Figure 1.5 3
Scatterplot of
residuals from
the regression
2
of salinity on
lagged.sal. A
scatterplot smooth
has been added. Note 1
salinity residual
3
20 22 24 26 28 30 32 34
discharge
Figure 1.6
Estimated effect
of salary on
0.5
contribution to the
effect on mean year end contributions
logarithm of year-end
contributions in
a semiparametric
0.0
regression analysis.
The shaded region
represents plus and
-0.5
BRI be able to estimate the year-end dollar amount contributed to each plan in
advance so that it can make internal revenue and cost projections.
Apart from building a prediction model for year-end contributions, there are
some other managerial questions that can be addressed using these data. For ex-
ample, BRI has a sales representative who has been specifically trained to deal
exclusively with 401(k) retirement plans. The company would like to know if her
expertise is a factor that influences contributions to such retirement plans.
Figure 1.6 shows the effect of salary (average salary of each firm) on the
logarithm of year-end contributions as estimated by a semiparametric regression
1.5 Term Structure of Interest Rates 7
50
0
490
INCINERATOR 4521
495
4520
500 4519
4518
505 UTM North
UTM East
and interest according to a schedule. At the time of expiration of the bond, which
is called the maturity, the bond holder receives a payment call the par value.
There are two general classes of bonds, coupon bonds and zero-coupon bonds.
At fixed periods, often every six months, the holder of a coupon bond receives
a coupon payment. Generally, coupon bonds sell at a price near their par value.
The par payment at maturity is a repayment of principal while the coupon pay-
ments are interest. Zero-coupon bonds have no coupon payments and sell below
par. The par payment at maturity represents principal and interest.
Frequently, the initial owner of the bond will sell the bond to another investor.
A financial derivative The current price at which bonds trade depends upon the current interest rates.
is a security whose For example, suppose a corporate coupon bond with a 5% coupon rate is issued
value depends on
with the initial price equal to par, so that the coupon payments are 5% of the ini-
the value of other
underlying securities. tial price. If the prevailing interest rate increases to 6% then the price of the bond
As an example of a will drop, so that a new purchaser of the bond will in effect receive a 6% rate.
derivative, consider a The interest rates on bonds depend upon their maturities, with long-term bonds
call option on a stock.
A call option gives the frequently (though not always) paying higher rates than short-term bonds. For ex-
owner the right, but ample, on January 26, 2001, the rate on a 1-year Treasury bill was 4.83% whereas
not the obligation, to the rate on a 30-year Treasury bond was 6.11%. The term structure of interest
purchase a share of
stock at a fixed price
rates is a quantitative description of the dependency of rate upon maturity. The
on a given date, called estimation of term structure is essential for financial analysts working, for exam-
the expiration date. ple, with credit derivatives.
The value of the call
Interest rates not only depend upon the maturity, but for any fixed maturity, the
option depends on the
price of the underlying interest rate on bonds with that maturity will change over time. In this case study,
stock and on such we are not concerned with such changes. Rather, we will only be concerned with
1.5 Term Structure of Interest Rates 9
how interest rates on a given day depend on maturity. Specifically, in our exam- other variables as
ple, we will model bond interest rates on December 31, 1995. the time left until
expiration. An
We will work with continuously compounded interest rates. As an illustration, example of a interest
we will start with an unrealistic assumption that the interest rate is constant, that rate derivative is a
is, not dependent on maturity. If a bond is worth P(t) dollars at time t and is cap. If an interest rate
exceeds the cap, then
continuously compounded at a constant rate r, then P(t) satisfies the simple dif- the owner of the cap
ferential equation is paid the difference
P 0(t) = rP(t) (1.1) between the interest
rate and the cap.
and so, at maturity T, Clearly, the value of
P(T ) = P(0) exp(rT ). (1.2) the cap depends on
the underlying interest
The rate r is called the forward rate. It is the rate agreed upon at present for in- rate. A company
terest in the future, that is, forward in time. paying interest at a
Interest rates must be inferred from bond prices. Recall that the bonds value floating rate might
purchase a cap as
at maturity, P(T ), is called the par value. Hence, from (1.2) we have insurance against rate
increases.
P(0) = par exp(rT ), (1.3)
where par is the par value. Suppose a 1-year, par $100 zero-coupon bond is sell-
ing now for $92. This means we can buy the bond now for $92 and receive $100
exactly one year from now. Recall that zero-coupon means the bond holder re-
ceives no interest payments until maturity. The $8 difference between the present
price and the par value is the only interest payment. Here we have T = 1, P(1) =
par = 100, P(0) = 92, and, from (1.2),
92 = 100 exp(r)
or
r = log(100/92) = 0.0834.
Thus, the annual continuously compounded interest rate over the next year is
8.34%.
Suppose, in addition, that a 2-year, par $100 zero-coupon bond sells for $85.
We assume that this bond pays the just-determined rate of 8.34% the first year
but a different interest rate the next year. The rate for the second year, call it r 2 ,
solves
83 = 100 exp{(0.0834 + r 2 )}
or
r 2 = log(100/83) 0.0834 = 0.1029.
Table 1.2 gives the prices on December 31, 1995, of five bonds previously is-
sued by the U.S. communications company AT&T and maturing at some time after
that date. These are the prices at which the bonds were traded that is, purchased
by one investor from another. Each bond price is expressed as a percentage of
par, the amount AT&T will pay the bond owner at maturity. The maturity is given
in years from December 31, 1995. The bonds make semiannual interest payments
called coupons. The time in years of the next coupon and the coupon payments
are given in the table. The aim is to determine the forward rate of AT&T bonds
from these data.
10 Introduction
We have been assuming that the forward interest rate is constant over each
year. Clearly, this is an oversimplification. Financial analysts model the forward
interest rate as a continuous function of time, r(t). If P(T ) is the par value of a
zero-coupon bond maturing at time T and if P(0) is the current price of the bond,
then (1.1) is replaced by
P 0(t) = r(t)P(t),
with solution Z T
P(0) = P(T ) exp r(x) dx . (1.4)
0
A forward price is a The problem is to estimate r(t) from bond prices, such as those shown in Table
price negotiated at 1.2. A further complication is that many bonds, including those in the table, have
the present for the
future delivery of
coupons. A coupon bond can be modeled as a bundle of zero-coupon bonds, one
some commodity. A for each coupon payment and one for the final payment at maturity of the par
forward interest rate value. The bond price is the aggregate price of all of these coupon bonds. Bond
means an interest rate
that is agreed upon
prices such as in Table 1.2 have some random error since, for example, they are
now for a loan in the really prices at last transaction, not exactly at the current time. Therefore, the es-
future. timation of the forward rate curve is a statistical problem. Fisher, Nychka, and
Zervos (1994) have developed a very elegant spline method for estimating the for-
ward rate curve. Their method works well for Treasury bond data because there
are enough Treasury bonds to estimate a continuous forward rate.
For corporate bonds, there is often a paucity of data and so the method of Fisher
and colleagues cannot be applied directly. Jarrow, Ruppert, and Yu (2001) extend
the model of Fisher et al. by assuming that the forward rate for a corporation such
as AT&T differs from the Treasury forward rate by a constant or, perhaps, by a
low-degree polynomial function of time. The corporate forward rate is greater
than the Treasury rate, since Treasury bonds have no risk of default; the U.S. Trea-
sury can always raise money by taxation. The difference between the two rates
is called the risk premium or spread and reflects the extra interest that investors
demand when buying corporate bonds (which may default) rather than risk-free
Treasury bonds. The model of Jarrow and colleagues is semiparametric in that
the Treasury forward rate is modeled as a spline, but the risk premium is mod-
eled parametrically. This case study is typical of semiparametric models in that
parts of the model for which there is much data are modeled nonparametrically
while parts that are not well supported by data are modeled parametrically.
Figure 1.9 shows the prices of U.S. STRIPS (Separate Trading of Registered
Interest and Principal of Securities), a type of zero-coupon Treasury bond. The
1.6 Air Pollution and Mortality in Milan: The Harvesting Effect 11
70
60
price
50
40
30
20
10
0 5 10 15 20 25 30
time to maturity
prices are expressed as a percentage of the par value and are plotted against time
to maturity. If r(x) is constant, say r(x) = r 0 for all x, then by (1.4) we have
y i = 100 exp(r 0 Ti ) (1.5)
and
log(y i ) = log(100) r 0 Ti . (1.6)
Here P(Ti ) is the par, P(0) is the present price, y i = 100P(0)/P(Ti ) is the
response, and Ti is the maturity for the the ith U.S. STRIPS.
The rough exponential shape in Figure 1.9 suggests that model (1.5) is at least
approximately correct. However, in Figure 1.10 we see log(y i ) plotted against
Ti , and the plot is not quite the straight line that (1.6) suggests. In fact, we fit
n
a straight line to {Ti , log(y i )}i=1 and plotted the residuals, which are the dif-
ferences between the log(y i ) and the fitted line. This plot, shown as Figure 1.11,
shows an obvious deviation from the random cloud that we would expect if the
model (1.5) fit the data, thus indicating the need for a nonparametric model. The
fitting of straight line models and residual analysis will be discussed in Chapter 2.
Figure 1.10
Logarithms of U.S.
STRIPS prices as a
percentage of the par 4.5
value.
4
log(price)
3.5
0 5 10 15 20 25 30
time to maturity
0.02
residual
0.02
0.04
0.06
0 5 10 15 20 25 30
time to maturity
Figure 1.12 is a schematic representation of the dynamics that arise when air
pollution has an impact on mortality. The risk pool consists of sick and elderly
people. Transitions between this state and the general population are affected by
air pollution levels.
Consider the following lagged regression model of air pollution and generic
mortality:
log{E(mortalityt )} = + 0 pollution t + + q pollution tq + t ,
1.6 Air Pollution and Mortality in Milan: The Harvesting Effect 13
Figure 1.12
Schematic
Air Pollution representation
of the dynamics
that arise when air
pollution has an
impact on mortality.
General
Risk Pool Death
Population
Figure 1.13
0.8
Lag structure
corresponding to the
A
0.6
harvesting effect.
0.4
coefficients
0.2 0.0
B
-0.2
0 5 10 15 20
lag
where mortalityt and pollution t are (respectively) the mortality count and
pollution level for day t. The lag structure in Figure 1.13 describes the so-called
harvesting effect. The horizontal axis is the lag number and the vertical axis shows
the coefficients ` . Each ` has this interpretation: net effect of pollution level `
days ago on mortality.
In the figure, A is the sum of the positive coefficients for low lags and repre-
sents the fact that pollution levels in the past few days or weeks have a positive
effect on mortality. However, the negative coefficients in B mean that pollution
levels a longer period ago have a negative effect. This is due to depletion of
the risk pool, normally made up of elderly and sick people whose deaths have
been hastened a few days or weeks by episodes of high pollution; this is known
as harvesting. Here A overestimates the public health significance of pollution,
since it is really A + B (where B is negative) that represents deaths induced by a
noticeable amount of time.
Daily data over 10 years are available on mortality, air pollution, and several
meteorological variables for the city of Milan, Italy. It is of interest to use these to
14 Introduction
Figure 1.14
10^-4
Estimates of the
coefficients of the
lags of sulphur
dioxide on mortality
5*10^-5
in Milan, Italy. The
est. coefficients
shaded points are
plus and minus 2
times the estimated
standard error of each
coefficient estimate.
0
-5*10^-5
0 10 20 30 40
lag number
quantify the public health significance of air pollution, incorporating the harvest-
ing effect. By constraining the lag coefficients to be on a smooth (but otherwise
flexible) curve, we obtained Figure 1.14. This suggests some evidence of harvest-
ing. The construction of this result required some nonstandard semiparametric
regression techniques that allowed for the lag coefficients to lie on a smooth curve
and also be influenced by data on daily weather conditions.
Chapter 18 provides much fuller analyses and solutions for a selection of the prob-
lems presented in this chapter. Between now and then we will need to describe
techniques for performing semiparametric regression analysis. The next chapter
signals the start of this journey.