Predictive Modeling
Predictive Modeling
Modeling
published by
bookdown.org/egarpor/pm-uc3m
Licensed under the Creative Commons License version 4.0 under the terms of Attribution, Non-Commercial
and No-Derivatives (the “License”); you may not use this file except in compliance with the License. You
may obtain a copy of the License at http://creativecommons.org/licenses/by-nc-nd/4.0. Unless re-
quired by applicable law or agreed to in writing, software distributed under the License is distributed
on an “as is” basis, without warranties or conditions of any kind, either express or implied.
See the License for the specific language governing permissions and limitations under the License.
Preface 7
1 Introduction 11
1.1 Course overview 11
1.2 What is predictive modelling? 12
1.3 General notation and background 14
1.4 Scripts and datasets 18
4 Linear models III: shrinkage, multivariate response, and big data 147
4.1 Shrinkage 147
4.2 Constrained linear models 161
4.3 Multivariate multiple linear model 164
4.4 Big data considerations 174
B Software 297
B.1 Installation of R and RStudio 297
B.2 Introduction to RStudio 297
B.3 Introduction to R 298
notes for predictive modeling 5
C References 315
Preface
Welcome
Welcome to the notes for Predictive Modeling for the course 2019/2020.
The subject is part of the MSc in Big Data Analytics from Carlos III
University of Madrid.
The course is designed to have, roughly, one lesson per each
main topic in the syllabus. The schedule is tight due to time con-
straints, which will inevitably make the treatment of certain meth-
ods a little superficial compared with what it would be the opti-
mal. Nevertheless, the course will hopefully give you a respectable
panoramic view of different available statistical methods for pre-
dictive modelling. A broad view of the syllabus and its planning
is:
• The office hours are Thursdays from 12:45 to 13:45, at the class-
room in which the session took place. Making use of them is the
fastest way for me to clarify your doubts.
• Questions and comments during lectures are mostly welcome.
So just go ahead and fire! Particularly if these are clarifications,
comments or alternative perspectives that may help the rest of
the class.
• Detailed course evaluation guidelines can be found in Aula
Global. Recall that participation in lessons is positively evalu-
ated.
Contributions
Contributions, reporting of typos, and feedback on the notes are
very welcome. Just send an email to edgarcia@est-econ.uc3m.es
and give me a reason for writing your name in the list of contribu-
tors!
License
All the material in these notes is licensed under the Creative Com-
mons Attribution-NonCommercial-NoDerivatives 4.0 Interna-
tional Public License (CC BY-NC-ND 4.0). You may not use this
material except in compliance with the former license. The human-
readable summary of the license states that:
These notes contain both the theory and practice for the statisti-
cal methods presented in the course. The emphasis is placed in
building intuition behind the methods, gaining insights into their
properties, and showing their application through the use of sta-
tistical software. The topics we will cover are an in-depth analysis
of linear models, their extension to generalized linear models, and an
introduction to nonparametric regression.
3
The list may be updated during the
We will employ several packages3 that are not included by de- course.
fault with R. These can be installed as:
# Installation of required packages
packages <- c("MASS", "car", "readxl", "rgl", "nortest", "latex2exp", "pca3d",
"ISLR", "pls", "corrplot", "glmnet", "mvtnorm", "biglm", "leaps",
"lme4", "viridis", "ffbase", "ks", "KernSmooth", "nor1mix", "np",
12 eduardo garcía portugués
Y = m( X1 , . . . , X p ) + ε, (1.1)
that the energy and speed have a quadratic relationship, and there-
fore we may assume that Y and X are truly quadratically-related for
the sake of exposition:
Y = a + bX + cX 2 + ε.
10
model6 :
9
# Estimates for a, b, and c
8
Fuel consumption
lm(y ~ x + I(x^2))
7
##
## Call:
6
## lm(formula = y ~ x + I(x^2))
##
5
## Coefficients:
4
## (Intercept) x I(x^2)
## 8.512421 -0.153291 0.001408 20 40 60 80 100 120
Speed
Then the estimate of m is m̂( x ) = â + b̂x + ĉx2 = 8.512 − 0.153x + Figure 1.1: Scatterplot of fuel con-
sumption vs. speed.
0.001x2 and its fit to the data is pretty good. As a consequence,
we can use this precise mathematical function to predict the Y 6
Note we use the information that m
has to be of a particular form (in this
from a particular observation X. For example, the estimated fuel
case quadratic) which is an unrealistic
consumption at speed 90 km/h is 8.512421 - 0.153291 * 90 + situation for other data applications.
0.001408 * 90ˆ2 = 6.1210.
light:
8
boxes). Interpretability is key in order to gain insights on the pre- 20 40 60 80 100 120
X1 ≤ x1 and . . . and X p ≤ x p .
variables X1 , . . . , X p . The (joint) cdf of X is11
F ( x ) : = P [ X ≤ x ] : = P [ X1 ≤ x 1 , . . . , X p ≤ x p ]
p
and, if X is continuous, its (joint) pdf is f := ∂x ∂···∂x p F.
1
The marginals of F and f are the cdf and pdf of X j , j = 1, . . . , p,
16 eduardo garcía portugués
FX j ( x j ) := P[ X j ≤ x j ] = F (∞, . . . , ∞, x j , ∞, . . . , ∞),
Z
∂
f Xj ( x j ) := FX ( x ) = f (x) dx− j ,
∂x j j j R p −1
• Joint pdf:
• Marginal pdfs:
Z
f X1 ( x 1 ) = φ( x1 , t2 ; µ1 , µ2 , σ12 , σ22 , ρ) dt2 = φ( x1 ; µ1 , σ12 )
N µ2 , σ22 .
• Conditional pdfs:
f ( x1 , x2 ) σ
f X1 | X2 = x 2 ( x 1 ) = = φ x1 ; µ1 + ρ 1 ( x2 − µ2 ), (1 − ρ2 )σ12 ,
f X2 ( x 2 ) σ2
σ2
f X2 | X1 = x1 ( x2 ) =φ x2 ; µ2 + ρ ( x1 − µ1 ), (1 − ρ2 )σ22 .
σ1
Hence
σ1
X1 | X2 = x 2 ∼ N ( x2 − µ2 ), (1 − ρ2 )σ12 ,
µ1 + ρ
σ2
σ2
X2 | X1 = x1 ∼ N µ2 + ρ ( x1 − µ1 ), (1 − ρ2 )σ22 .
σ1
• Conditional expectations:
σ1
E [ X1 | X2 = x 2 ] = µ 1 + ρ ( x2 − µ2 ),
σ2
σ2
E [ X2 | X1 = x 1 ] = µ 2 + ρ ( x 1 − µ 1 ) .
σ1
18 eduardo garcía portugués
• Joint cdf:
Z x2 Z x
1
φ(t1 , t2 ; µ1 , µ2 , σ12 , σ22 , ρ) dt1 dt2 .
−∞ −∞
R x1
• Marginal cdfs: −∞ φ(t; µ1 , σ12 ) dt =: Φ( x1 ; µ1 , σ12 ) and Φ( x2 ; µ2 , σ22 ).
• Conditional cdfs:
Z x
1 σ1 σ1
2 2 2 2
φ t; µ1 + ρ ( x2 − µ2 ), (1 − ρ )σ1 dt = Φ x1 ; µ1 + ρ ( x2 − µ2 ), (1 − ρ )σ1
−∞ σ2 σ2
and Φ x2 ; µ2 + ρ σσ2 ( x1 − µ1 ), (1 − ρ2 )σ22 .
1
in your browser.
• Chapter 1: 01-intro.R.
• Chapter 2: 02-lm-i.R.
• Chapter 3: 03-lm-ii.R.
• Chapter 4: 04-lm-iii.R.
• Chapter 5: 05-glm.R. Generation of Figures 5.12–5.23: hypothesisGlm.R.
• Chapter 6: 06-npreg.R.
• Appendices A and B: 07-appendix.R.
Y = β 0 + β 1 X1 + β 2 X2 + . . . + β p X p + ε
Y = β0 + β1 X + ε
Calculate the winter rain and the harvest rain (in millimeters). Add
summer heat in the vineyard (in degrees centigrade). Subtract 12.145.
-2
And what do you have? A very, very passionate argument over wine.
-4
— “Wine Equation Puts Some Noses Out of Joint”, The New York
-3 -2 -1 0 1 2 3
Times, 04/03/1990. X
the aged wine is in the market. Therefore, being able to predict the
quality of a vintage is valuable information for investing resources,
for determining a fair price for vintages, and for understanding
what factors are affecting the wine quality. The purpose of this case
study is to answer:
Y = β0 + β1 X + ε (2.1)
E[Y | X = x ] = β 0 + β 1 x, (2.2)
since E[ε| X = x ] = 0.
The Left Hand Side (LHS) of (2.2) is the conditional expectation of
Y given X. It represents how the mean of the random variable Y
is changing according to a particular value x of the random vari-
able X. With the RHS, what we are saying is that the mean of Y is
changing in a linear fashion with respect to the value of X. Hence
the clear interpretation of the coefficients:
4
They are unique and always exist
It can be seen that the minimizers of the RSS4 are (except if s x > 0, when all the data
points are the same). They can be
s xy
β̂ 0 = Ȳ − β̂ 1 X̄, β̂ 1 = , (2.3) obtained by solving ∂β∂ RSS( β 0 , β 1 ) =
0
s2x ∂
0 and ∂β 1 RSS( β 0 , β 1 ) = 0.
where:
# Responses
yLin <- -0.5 + 1.5 * x + eps
yQua <- -0.5 + 1.5 * x^2 + eps
yExp <- -0.5 + 1.5 * 2^x + eps
# Data
leastSquares <- data.frame(x = x, yLin = yLin, yQua = yQua, yExp = yExp)
# Call lm
lm(yLin ~ x, data = leastSquares)
##
## Call:
## lm(formula = yLin ~ x, data = leastSquares)
##
## Coefficients:
## (Intercept) x
## -0.6154 1.3951
lm(yQua ~ x, data = leastSquares)
##
## Call:
## lm(formula = yQua ~ x, data = leastSquares)
##
## Coefficients:
## (Intercept) x
## 0.9710 -0.8035
lm(yExp ~ x, data = leastSquares)
##
## Call:
## lm(formula = yExp ~ x, data = leastSquares)
##
## Coefficients:
## (Intercept) x
## 1.270 1.007
# The lm object
mod <- lm(yLin ~ x, data = leastSquares)
mod
##
## Call:
## lm(formula = yLin ~ x, data = leastSquares)
##
## Coefficients:
## (Intercept) x
## -0.6154 1.3951
names(mod)
## [1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign" "qr"
0
mod$coefficients
## (Intercept) x
-4
## -0.6153744 1.3950973
-2 -1 0 1 2
# We can produce a plot with the linear fit easily
x
plot(x, yLin)
Figure 2.4: Linear fit for the data
abline(coef = mod$coefficients, col = 2)
employed in Figure 2.3 minimizing the
RSS.
26 eduardo garcía portugués
Check that you can not improve the error in Figure 2.3
when using the coefficients given by lm, if vertical dis-
tances are selected. Check also that these coefficients are
only optimal for vertical distances.
# Covariance
Sxy <- cov(x, yLin)
# Variance
Sx2 <- var(x)
# Coefficients
beta1 <- Sxy / Sx2
beta0 <- mean(yLin) - beta1 * mean(x)
c(beta0, beta1)
## [1] -0.6153744 1.3950973
# Output from lm
mod <- lm(yLin ~ x, data = leastSquares)
mod$coefficients
## (Intercept) x
## -0.6153744 1.3950973
# Read data
wine <- read.table(file = "wine.csv", header = TRUE, sep = ",")
## Min. :1952 Min. :6.205 Min. :376.0 Min. :14.98 Min. : 38.0 Min. : 3.00 Min. :43184
## 1st Qu.:1960 1st Qu.:6.508 1st Qu.:543.5 1st Qu.:16.15 1st Qu.: 88.0 1st Qu.: 9.50 1st Qu.:46856
## Median :1967 Median :6.984 Median :600.0 Median :16.42 Median :123.0 Median :16.00 Median :50650
## Mean :1967 Mean :7.042 Mean :608.4 Mean :16.48 Mean :144.8 Mean :16.19 Mean :50085
## 3rd Qu.:1974 3rd Qu.:7.441 3rd Qu.:705.5 3rd Qu.:17.01 3rd Qu.:185.5 3rd Qu.:22.50 3rd Qu.:53511
## Max. :1980 Max. :8.494 Max. :830.0 Max. :17.65 Max. :292.0 Max. :31.00 Max. :55110
Year
1970
1955
8.5
Price
7.5
6.5
AGST
16.5
15.0
300
HarvestRain
50 150
Age
25
15
5
52000
FrancePop
44000
# Price ~ AGST
modAGST <- lm(Price ~ AGST, data = wine)
# R^2
sumModAGST$r.squared
## [1] 0.4455894
Predictor R2
AGST 0.4456
HarvestRain 0.2572
FrancePop 0.2314
Age 0.2120
WinterRain 0.0181
It seems that none of these simple linear models on their own are
properly explaining Price. Intuitively, it would make sense to bind
them together to achieve a better explanation of Price. Let’s see
how to do that with a more advanced model.
notes for predictive modeling 29
Y = β 0 + β 1 X1 + . . . + β p X p + ε (2.4)
E [ Y | X1 = x 1 , . . . , X p = x p ] = β 0 + β 1 x 1 + . . . + β p x p , (2.5)
since E[ε| X1 = x1 , . . . , X p = x p ] = 0.
The LHS of (2.5) is the conditional expectation of Y given X1 , . . . ,
X p . It represents how the mean of the random variable Y is chang-
ing, now according to particular values of several predictors. With
the RHS, what we are saying is that the mean of Y is changing in
a linear fashion with respect to the values of X1 , . . . , X p . Hence the
neat interpretation of the coefficients:
1 X11 ··· X1p
. .. .. ..
X := ..
. . .
.
1 Xn1 ··· Xnp n×( p+1)
30 eduardo garcía portugués
With this notation, the RSS for the multiple linear regression is
n
RSS( β) := ∑ (Yi − β0 − β1 Xi1 − . . . − β p Xip )2
i =1
= (Y − Xβ)0 (Y − Xβ). (2.6)
The RSS aggregates the squared vertical distances from the data to a
regression plane given by β. The least squares estimators are the 8
It can be seen that they are unique
and that they always exist, provided
minimizers of the RSS8 : that rank(X0 X) = p + 1.
notes for predictive modeling 31
Let’s check that indeed the coefficients given by lm are the ones
given by (2.7). For that purpose we consider the leastSquares3D
data frame in the least-squares-3D.RData dataset. Among other
variables, the data frame contains the response yLin and the predic-
tors x1 and x2.
load(file = "least-Squares-3D.RData")
# Matrix X
X <- cbind(1, leastSquares3D$x1, leastSquares3D$x2)
# Vector Y
Y <- leastSquares3D$yLin
# Coefficients
beta <- solve(t(X) %*% X) %*% t(X) %*% Y
# %*% multiplies matrices
# solve() computes the inverse of a matrix
# t() transposes a matrix
beta
## [,1]
## [1,] -0.5702694
## [2,] 0.4832624
## [3,] 0.3214894
Once we have the least squares estimates β̂, we can define the
next concepts:
ε̂ i := Yi − Ŷi , i = 1, . . . , n.
They are the vertical distances between actual data and fitted
data.
# Residuals
mod$residuals
# A shortcut
modWine1 <- lm(Price ~ ., data = wine)
modWine1
##
## Call:
## lm(formula = Price ~ ., data = wine)
##
## Coefficients:
## (Intercept) WinterRain AGST HarvestRain Age FrancePop
## -2.343e+00 1.153e-03 6.144e-01 -3.837e-03 1.377e-02 -2.213e-05
# Summary
summary(modWine1)
##
## Call:
## lm(formula = Price ~ ., data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46541 -0.24133 0.00413 0.18974 0.52495
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.343e+00 7.697e+00 -0.304 0.76384
34 eduardo garcía portugués
i. Linearity: E[Y | X1 = x1 , . . . , X p = x p ] = β 0 + β 1 x1 + . . . + β p x p .
ii. Homoscedasticity: Var[ε| X1 = x1 , . . . , X p = x p ] = σ2 .
iii. Normality: ε ∼ N (0, σ2 ).
iv. Independence of the errors: ε 1 , . . . , ε n are independent (or
uncorrelated, E[ε i ε j ] = 0, i 6= j, since they are assumed to be
normal).
Y |( X1 = x1 , . . . , X p = x p ) ∼ N ( β 0 + β 1 x1 + . . . + β p x p , σ2 ). (2.8)
Recall that, except assumption iv, the rest are expressed in terms
of the random variables, not in terms of the sample. Thus they are
population versions, rather than sample versions. It is however
trivial to express (2.8) in terms of assumptions about the sample
{(Xi , Yi )}in=1 :
This result can be obtained from the form of β̂ given in (2.7), the
sample version of the model assumptions given in (2.10), and the
linear transformation property of a normal given in (1.4). Equation
(2.11) implies that the marginal distribution of β̂ j is
β̂ j ∼ N β j , SE( β̂ j )2 , (2.12)
β̂ j − β j
∼ N (0, 1).
SE( β̂ j )
with
σ2 X̄ 2 σ2
2
SE( β̂ 0 ) = 1+ 2 , SE( β̂ 1 )2 = . (2.14)
n sx ns2x
• Bias. Both estimates are unbiased. That means that their expecta-
tions are the true coefficients for any sample size n.
The insights about (2.11) are more convoluted and the following
broad remarks, extensions of what happened when p = 1, apply:
• Bias. All the estimates are unbiased for any sample size n.
where tn− p−1;α/2 is the α/2-upper quantile of the tn− p−1 . Usually,
α = 0.10, 0.05, 0.01 are considered.
H0 : β j = 0
β̂ j − 0
,
ˆ ( β̂ j )
SE
β̂ j −0 H0
This is denoted as ˆ ( β̂ j ) ∼ tn− p−1 .
18
SE
which is distributed as a tn− p−1 under the (veracity of) the null hypoth-
esis18 .
The null hypothesis H0 is tested against the alternative hypothesis,
H1 . If H0 is rejected, it is rejected in favor of H1 . The alternative hy-
pothesis can be bilateral (we will focus mostly on these alternatives),
such as
H0 : β j = 0 vs H1 : β j 6= 0
or unilateral, such as
H0 : β j = 0 vs H1 : β j < (>)0.
• Yes → reject H0 .
• No → do not reject H0 .
# Fit
modWine1 <- lm(Price ~ ., data = wine)
# Summary
sumModWine1 <- summary(modWine1)
sumModWine1
##
## Call:
## lm(formula = Price ~ ., data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46541 -0.24133 0.00413 0.18974 0.52495
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.343e+00 7.697e+00 -0.304 0.76384
## WinterRain 1.153e-03 4.991e-04 2.311 0.03109 *
notes for predictive modeling 43
car::compareCoefs(modWine1, modWine2)
## Calls:
## 1: lm(formula = Price ~ ., data = wine)
## 2: lm(formula = Price ~ . - FrancePop, data = wine)
##
## Model 1 Model 2
## (Intercept) -2.34 -3.65
## SE 7.70 1.69
##
## WinterRain 0.001153 0.001167
## SE 0.000499 0.000482
##
## AGST 0.6144 0.6164
## SE 0.0980 0.0952
##
## HarvestRain -0.003837 -0.003861
## SE 0.000837 0.000808
##
## Age 0.01377 0.02385
## SE 0.05821 0.00717
##
## FrancePop -2.21e-05
## SE 1.27e-04
##
Note how the coefficients for modWine2 have smaller errors than
modWine1.
The individual CIs for the unknown β j ’s can be obtained by
applying the confint function to an lm object. Let’s compute the
CIs for the model coefficients of modWine1, modWine2, and a new
model modWine3:
# Fit a new model
modWine3 <- lm(Price ~ Age + WinterRain, data = wine)
summary(modWine3)
##
## Call:
## lm(formula = Price ~ Age + WinterRain, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88964 -0.51421 -0.00066 0.43103 1.06897
##
## Coefficients:
notes for predictive modeling 45
Note that this does not mean that the coefficient will be al-
ways non-significant: in Price ~ Age + AGST + HarvestRain +
WinterRain it is.
Y = β 0 + β 1 ( X1 − E[ X1 ]) + . . . + β p ( X p − E[ X p ]) + ε.
# By default, scale centers (substracts the mean) and scales (divides by the
# standard deviation) the columns of a matrix
wineCen <- data.frame(scale(wine, center = TRUE, scale = FALSE))
# Summary
summary(modWine3Cen)
##
## Call:
notes for predictive modeling 47
2.5 Prediction
The inspection of the CIs for the conditional mean and condi-
tional response in the simple linear model offers great insight into
the previous similarities and differences, and also on what compo-
nents affect precisely the quality of the prediction:
s !
σ̂2 ( x − x̄ )2
ŷ ± tn−2;α/2 1+ . (2.18)
n s2x
s !
σ̂2 ( x − x̄ )2
ŷ ± tn−2;α/2 σ̂2 + 1+ . (2.19)
n s2x
# Fit a linear model for the price on WinterRain, HarvestRain, and AGST
modWine4 <- lm(Price ~ WinterRain + HarvestRain + AGST, data = wine)
summary(modWine4)
notes for predictive modeling 49
##
## Call:
## lm(formula = Price ~ WinterRain + HarvestRain + AGST, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.62816 -0.17923 0.02274 0.21990 0.62859
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.9506001 1.9694011 -2.514 0.01940 *
## WinterRain 0.0012820 0.0005765 2.224 0.03628 *
## HarvestRain -0.0036242 0.0009646 -3.757 0.00103 **
## AGST 0.7123192 0.1087676 6.549 1.11e-06 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.3436 on 23 degrees of freedom
## Multiple R-squared: 0.7407, Adjusted R-squared: 0.7069
## F-statistic: 21.9 on 3 and 23 DF, p-value: 6.246e-07
# Other levels
predict(modWine4, newdata = weather, interval = "confidence", level = 0.90)
## fit lwr upr
## 1 8.066342 7.774576 8.358108
predict(modWine4, newdata = weather, interval = "confidence", level = 0.99)
## fit lwr upr
## 1 8.066342 7.588427 8.544258
# Other levels
predict(modWine4, newdata = weather, interval = "prediction", level = 0.90)
## fit lwr upr
## 1 8.066342 7.409208 8.723476
predict(modWine4, newdata = weather, interval = "prediction", level = 0.99)
## fit lwr upr
## 1 8.066342 6.989951 9.142733
2.6 ANOVA
SST
|{z} = SSR
|{z} + SSE
|{z} (2.20)
Variation of Yi0 s Variation of Ŷi0 s Variation of ε̂0i s
# F-quantities
Fval <- predictorsRow[3] / tab[p + 1, 3]
pval <- pf(Fval, df1 = p, df2 = tab$Df[p + 1], lower.tail = FALSE)
predictorsRow <- c(predictorsRow, Fval, pval)
# Simplified table
tab <- rbind(predictorsRow, tab[p + 1, ])
row.names(tab)[1] <- "Predictors"
return(tab)
# Models
modWine1 <- lm(Price ~ ., data = wine)
modWine2 <- lm(Price ~ . - FrancePop, data = wine)
# Simplified table
simpleAnova(modWine1)
## Analysis of Variance Table
##
## Response: Price
## Df Sum Sq Mean Sq F value Pr(>F)
## Predictors 5 8.6671 1.73343 20.188 2.232e-07 ***
## Residuals 21 1.8032 0.08587
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
simpleAnova(modWine2)
## Analysis of Variance Table
##
## Response: Price
## Df Sum Sq Mean Sq F value Pr(>F)
## Predictors 4 8.6645 2.16613 26.39 4.057e-08 ***
## Residuals 22 1.8058 0.08208
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
# The null hypothesis of no linear dependence is emphatically rejected in
# both models
2.7.1 The R2
31
Which is not the σ̂2 in (2.21), but σ̂2
2 SSR SSR SSR
R := = = . (2.21) is obviously dependent on σ2 .
SST SSR + SSE SSR + (n − p − 1)σ̂2
Ŷi = β̂ 0 + β̂ 1 Xi
= (Ȳ − β̂ 1 X̄ ) + β̂ 1 Xi
= Ȳ + β̂ 1 ( Xi − X̄ ). (2.22)
56 eduardo garcía portugués
s2yŷ
ry2ŷ =
s2y s2ŷ
2
∑in=1 (Yi − Ȳ ) Ŷi − Ȳ
= 2 2
∑in=1 (Yi − Ȳ ) ∑in=1 Ŷi − Ȳ
2
∑in=1 (Yi − Ȳ ) Ȳ + β̂ 1 ( Xi − X̄ ) − Ȳ
= 2 2
∑in=1 (Yi − Ȳ ) ∑in=1 Ȳ + β̂ 1 ( Xi − X̄ ) − Ȳ
= r2xy
# Great R^2!?
reg <- lm(y ~ x)
summary(reg)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.53525 -0.18020 0.02811 0.16882 0.46896
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.87190 0.05860 14.88 <2e-16 ***
## x -1.69268 0.09359 -18.09 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.232 on 98 degrees of freedom
## Multiple R-squared: 0.7695, Adjusted R-squared: 0.7671
## F-statistic: 327.1 on 1 and 98 DF, p-value: < 2.2e-16
# plot(x, y)
# abline(coef = reg$coef, col = 3)
0.5
# Create data that:
# 1) does not follow a linear model
0.0
# 2) the error is heteroskedastic
y
x1 <- seq(0.15, 1, l = 100)
set.seed(123456)
-0.5
x2 <- runif(100, -3, 3)
eps <- rnorm(n = 100, sd = 0.25 * x1^2)
y <- 1 - 3 * x1 * (1 + 0.25 * sin(4 * pi * x1)) + 0.25 * cos(x2) + eps
-1.0
0.2 0.4 0.6 0.8 1.0
x
# Great R^2!?
reg <- lm(y ~ x1 + x2)
summary(reg) Figure 2.18: Regression line for a
## dataset that clearly violates the linear-
## Call: ity and homoscedasticity assumptions.
## lm(formula = y ~ x1 + x2) The R2 is, nevertheless, as high as
## (approximately) 0.77.
## Residuals:
## Min 1Q Median 3Q Max
## -0.78737 -0.20946 0.01031 0.19652 1.05351
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.788812 0.096418 8.181 1.1e-12 ***
## x1 -2.540073 0.154876 -16.401 < 2e-16 ***
## x2 0.002283 0.020954 0.109 0.913
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.3754 on 97 degrees of freedom
## Multiple R-squared: 0.744, Adjusted R-squared: 0.7388
## F-statistic: 141 on 2 and 97 DF, p-value: < 2.2e-16
We can visualize the fit of the latter multiple linear model, since
we are in p = 2.
# But prediction is obviously problematic
car::scatter3d(y ~ x1 + x2, fit = "linear")
rgl::rglwidget()
Remember that:
Y = β 0 + β 1 X1 + . . . + β j X j + ε, (2.26)
The fact that the ANOVA decomposition does not hold for no-
intercept models has serious consequences on the theory we built,
notes for predictive modeling 61
# Manually checking the R^2 indeed reveals that summary is doing something
7.0
cor(mod0$model$Sepal.Length, mod0$fitted.values)^2
Sepal.Length
## [1] 0.6690277
6.0
5.5
1 - SSE1 / SST1
## [1] 0.6690277 0.5 1.0 1.5 2.0 2.5
Petal.Width
Let’s play the evil practitioner and try to modify the R20
returned by summary when the intercept is excluded. If
we were really evil, we could use this knowledge to fool
someone that is not aware of the difference between R20
and R2 into believing that any given model is incredibly
good or bad in terms of the R2 !
# Summaries
summary(modWine1)
##
## Call:
## lm(formula = Price ~ ., data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46541 -0.24133 0.00413 0.18974 0.52495
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.343e+00 7.697e+00 -0.304 0.76384
## WinterRain 1.153e-03 4.991e-04 2.311 0.03109 *
## AGST 6.144e-01 9.799e-02 6.270 3.22e-06 ***
## HarvestRain -3.837e-03 8.366e-04 -4.587 0.00016 ***
## Age 1.377e-02 5.821e-02 0.237 0.81531
## FrancePop -2.213e-05 1.268e-04 -0.175 0.86313
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
notes for predictive modeling 65
Although these were known facts, keep in mind that this model
# # Alternatively
# data(Boston, package = "MASS")
20 40 60 80
crim dataset.
0
12
dis
8
6
4
2
10 20 30 40 50
medv
nox
0.8
0.6
0.4
rm
8
7
6
5
4
0 20 40 60 80 10 20 30 40 50 4 5 6 7 8
and n are, the more variable the estimates ( β̂, σ̂2 ) will be, since less
information is available for estimating each one. In the limit case
n = p + 2, each sample point determines a parameter estimate.
# Case p = n - 1 = 2: beta can be estimated, but sigma^2 can not (hence no,
# inference can be performed since it requires the estimated sigma^2)
summary(lm(y ~ ., data = df))
• tn− p−1;α/2 appears in (2.16) and influences the length of the CIs
for β j , see (2.17). It also influences the length of the CIs for the
prediction. As Figure 3.2 shows, when the degrees of freedom
n − p − 1 decrease, tn− p−1;α/2 increases, thus the intervals widen.
• σ̂2 = n−1p−1 ∑in=1 ε̂2i influences the R2 and R2Adj . If no relevant
variables are added to the model then ∑in=1 ε̂2i will not change
substantially. However, the factor n−1p−1 will increase as p aug-
ments, inflating σ̂2 and its variance. This is exactly what hap-
pened in Figure 2.19.
α = 0.1
α = 0.05
α = 0.01
df
fitness of a model with the number of predictors employed. Hence, Figure 3.2: Effect of df = n − p − 1 in
it determines objectively the best model as the one that minimizes tdf;α/2 for α = 0.10, 0.05, 0.01.
where `(model) is the log-likelihood of the model (how well the model
fits the data) and npar(model) is the number of parameters consid-
ered in the model, p + 2 in the case of a multiple linear regression
model with p predictors. The AIC replaces log n by 2 in (3.1) so, 5
Recall that log n > 2 if n ≥ 8.
compared with BIC, it penalizes less the more complex models5 .
This is one of the reasons why BIC is preferred by some practition- 6
Also, because the BIC is consis-
ers for model comparison6 . tent in selecting the true distribu-
The BIC and AIC can be computed through the functions BIC tion/regression model: if enough
data is provided (n → ∞), the BIC
and AIC. They take a model as the input.
is guaranteed to select the true data-
generating model among a list of
# Two models with different predictors candidate models if the true model
mod1 <- lm(medv ~ age + crim, data = Boston) is included in that list (see Schwarz
mod2 <- lm(medv ~ age + crim + lstat, data = Boston) (1978) for the specifics for the linear
model). Note, however, that this may
# BICs be unrealistic in practice, as the true
BIC(mod1) model may be nonlinear.
## [1] 3581.893
BIC(mod2) # Smaller -> better
## [1] 3300.841
# AICs
AIC(mod1)
## [1] 3564.987
AIC(mod2) # Smaller -> better
## [1] 3279.708
# Full model
mod <- lm(Price ~ ., data = wine)
# With BIC
modBIC <- MASS::stepAIC(mod, k = log(nrow(wine)))
## Start: AIC=-53.29
## Price ~ Year + WinterRain + AGST + HarvestRain + Age + FrancePop
##
##
## Step: AIC=-53.29
## Price ~ Year + WinterRain + AGST + HarvestRain + FrancePop
##
## Df Sum of Sq RSS AIC
## - FrancePop 1 0.0026 1.8058 -56.551
## - Year 1 0.0048 1.8080 -56.519
## <none> 1.8032 -53.295
## - WinterRain 1 0.4585 2.2617 -50.473
## - HarvestRain 1 1.8063 3.6095 -37.852
## - AGST 1 3.3756 5.1788 -28.105
##
## Step: AIC=-56.55
## Price ~ Year + WinterRain + AGST + HarvestRain
##
## Df Sum of Sq RSS AIC
## <none> 1.8058 -56.551
## - WinterRain 1 0.4809 2.2867 -53.473
## - Year 1 0.9089 2.7147 -48.840
## - HarvestRain 1 1.8760 3.6818 -40.612
## - AGST 1 3.4428 5.2486 -31.039
summary(modBIC)
##
## Call:
## lm(formula = Price ~ Year + WinterRain + AGST + HarvestRain,
## data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46024 -0.23862 0.01347 0.18601 0.53443
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.6390418 14.6939240 2.970 0.00707 **
notes for predictive modeling 73
# With AIC
modAIC <- MASS::stepAIC(mod, k = 2)
## Start: AIC=-61.07
## Price ~ Year + WinterRain + AGST + HarvestRain + Age + FrancePop
##
##
## Step: AIC=-61.07
## Price ~ Year + WinterRain + AGST + HarvestRain + FrancePop
##
## Df Sum of Sq RSS AIC
## - FrancePop 1 0.0026 1.8058 -63.031
## - Year 1 0.0048 1.8080 -62.998
## <none> 1.8032 -61.070
## - WinterRain 1 0.4585 2.2617 -56.952
## - HarvestRain 1 1.8063 3.6095 -44.331
## - AGST 1 3.3756 5.1788 -34.584
##
## Step: AIC=-63.03
## Price ~ Year + WinterRain + AGST + HarvestRain
##
## Df Sum of Sq RSS AIC
## <none> 1.8058 -63.031
## - WinterRain 1 0.4809 2.2867 -58.656
## - Year 1 0.9089 2.7147 -54.023
## - HarvestRain 1 1.8760 3.6818 -45.796
## - AGST 1 3.4428 5.2486 -36.222
summary(modAIC)
##
## Call:
## lm(formula = Price ~ Year + WinterRain + AGST + HarvestRain,
## data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46024 -0.23862 0.01347 0.18601 0.53443
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.6390418 14.6939240 2.970 0.00707 **
## Year -0.0238480 0.0071667 -3.328 0.00305 **
## WinterRain 0.0011667 0.0004820 2.420 0.02421 *
## AGST 0.6163916 0.0951747 6.476 1.63e-06 ***
## HarvestRain -0.0038606 0.0008075 -4.781 8.97e-05 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.2865 on 22 degrees of freedom
## Multiple R-squared: 0.8275, Adjusted R-squared: 0.7962
## F-statistic: 26.39 on 4 and 22 DF, p-value: 4.057e-08
## [1] 78.62268
n * (log(2 * pi) + 1) + 2
## [1] 78.62268
Note that the selected models modBIC and modAIC are equivalent
to the modWine2 we selected in Section 2.7.3 as the best model. This
is an illustration that the model selected by stepAIC is often a good
starting point for further additions or deletions of predictors.
## Step: AIC=-53.38
## Price ~ Year + WinterRain + AGST + HarvestRain + noisePredictor
##
## Df Sum of Sq RSS AIC
## - noisePredictor 1 0.0081 1.8058 -56.551
## <none> 1.7977 -53.376
## - WinterRain 1 0.4771 2.2748 -50.317
## - Year 1 0.9162 2.7139 -45.552
## - HarvestRain 1 1.8449 3.6426 -37.606
## - AGST 1 3.4234 5.2212 -27.885
##
## Step: AIC=-56.55
## Price ~ Year + WinterRain + AGST + HarvestRain
##
## Df Sum of Sq RSS AIC
## <none> 1.8058 -56.551
## - WinterRain 1 0.4809 2.2867 -53.473
## - Year 1 0.9089 2.7147 -48.840
## - HarvestRain 1 1.8760 3.6818 -40.612
## - AGST 1 3.4428 5.2486 -31.039
##
## Call:
## lm(formula = Price ~ Year + WinterRain + AGST + HarvestRain,
## data = wineNoise)
##
## Coefficients:
## (Intercept) Year WinterRain AGST HarvestRain
## 43.639042 -0.023848 0.001167 0.616392 -0.003861
# Using the defaults from the full model essentially does backward selection,
# but allowing predictors that were removed to enter again at later steps
MASS::stepAIC(modAll, direction = "both", k = log(n))
## Start: AIC=-50.13
## Price ~ Year + WinterRain + AGST + HarvestRain + Age + FrancePop +
## noisePredictor
##
##
## Step: AIC=-50.13
## Price ~ Year + WinterRain + AGST + HarvestRain + FrancePop +
## noisePredictor
##
## Df Sum of Sq RSS AIC
## - FrancePop 1 0.0036 1.7977 -53.376
## - Year 1 0.0038 1.7979 -53.374
## - noisePredictor 1 0.0090 1.8032 -53.295
## <none> 1.7941 -50.135
## - WinterRain 1 0.4598 2.2539 -47.271
## - HarvestRain 1 1.7666 3.5607 -34.923
## - AGST 1 3.3658 5.1599 -24.908
##
## Step: AIC=-53.38
## Price ~ Year + WinterRain + AGST + HarvestRain + noisePredictor
##
## Df Sum of Sq RSS AIC
## - noisePredictor 1 0.0081 1.8058 -56.551
## <none> 1.7977 -53.376
## - WinterRain 1 0.4771 2.2748 -50.317
## + FrancePop 1 0.0036 1.7941 -50.135
## - Year 1 0.9162 2.7139 -45.552
## - HarvestRain 1 1.8449 3.6426 -37.606
## - AGST 1 3.4234 5.2212 -27.885
##
## Step: AIC=-56.55
## Price ~ Year + WinterRain + AGST + HarvestRain
##
## Df Sum of Sq RSS AIC
## <none> 1.8058 -56.551
## - WinterRain 1 0.4809 2.2867 -53.473
80 eduardo garcía portugués
Note also how the BIC penalizes the complexity more than the AIC,
which is more flat.
# Best models
modBIC <- MASS::stepAIC(modHouse, k = log(nrow(Boston)))
## Start: AIC=1648.81
## medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +
## tax + ptratio + black + lstat
82 eduardo garcía portugués
##
## Df Sum of Sq RSS AIC
## - age 1 0.06 11079 1642.6
## - indus 1 2.52 11081 1642.7
## <none> 11079 1648.8
## - chas 1 218.97 11298 1652.5
## - tax 1 242.26 11321 1653.5
## - crim 1 243.22 11322 1653.6
## - zn 1 257.49 11336 1654.2
## - black 1 270.63 11349 1654.8
## - rad 1 479.15 11558 1664.0
## - nox 1 487.16 11566 1664.4
## - ptratio 1 1194.23 12273 1694.4
## - dis 1 1232.41 12311 1696.0
## - rm 1 1871.32 12950 1721.6
## - lstat 1 2410.84 13490 1742.2
##
## Step: AIC=1642.59
## medv ~ crim + zn + indus + chas + nox + rm + dis + rad + tax +
## ptratio + black + lstat
##
## Df Sum of Sq RSS AIC
## - indus 1 2.52 11081 1636.5
## <none> 11079 1642.6
## - chas 1 219.91 11299 1646.3
## - tax 1 242.24 11321 1647.3
## - crim 1 243.20 11322 1647.3
## - zn 1 260.32 11339 1648.1
## - black 1 272.26 11351 1648.7
## - rad 1 481.09 11560 1657.9
## - nox 1 520.87 11600 1659.6
## - ptratio 1 1200.23 12279 1688.4
## - dis 1 1352.26 12431 1694.6
## - rm 1 1959.55 13038 1718.8
## - lstat 1 2718.88 13798 1747.4
##
## Step: AIC=1636.48
## medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio +
## black + lstat
##
## Df Sum of Sq RSS AIC
## <none> 11081 1636.5
## - chas 1 227.21 11309 1640.5
## - crim 1 245.37 11327 1641.3
## - zn 1 257.82 11339 1641.9
## - black 1 270.82 11352 1642.5
## - tax 1 273.62 11355 1642.6
## - rad 1 500.92 11582 1652.6
## - nox 1 541.91 11623 1654.4
## - ptratio 1 1206.45 12288 1682.5
## - dis 1 1448.94 12530 1692.4
## - rm 1 1963.66 13045 1712.8
## - lstat 1 2723.48 13805 1741.5
modAIC <- MASS::stepAIC(modHouse, trace = 0, k = 2)
# Comparison
car::compareCoefs(modBIC, modAIC)
## Calls:
## 1: lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + black + lstat, data = Boston)
## 2: lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + black + lstat, data = Boston)
##
## Model 1 Model 2
## (Intercept) 36.34 36.34
## SE 5.07 5.07
##
## crim -0.1084 -0.1084
## SE 0.0328 0.0328
##
notes for predictive modeling 83
## zn 0.0458 0.0458
## SE 0.0135 0.0135
##
## chas 2.719 2.719
## SE 0.854 0.854
##
## nox -17.38 -17.38
## SE 3.54 3.54
##
## rm 3.802 3.802
## SE 0.406 0.406
##
## dis -1.493 -1.493
## SE 0.186 0.186
##
## rad 0.2996 0.2996
## SE 0.0634 0.0634
##
## tax -0.01178 -0.01178
## SE 0.00337 0.00337
##
## ptratio -0.947 -0.947
## SE 0.129 0.129
##
## black 0.00929 0.00929
## SE 0.00267 0.00267
##
## lstat -0.5226 -0.5226
## SE 0.0474 0.0474
##
summary(modBIC)
##
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +
## tax + ptratio + black + lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5984 -2.7386 -0.5046 1.7273 26.2373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.341145 5.067492 7.171 2.73e-12 ***
## crim -0.108413 0.032779 -3.307 0.001010 **
## zn 0.045845 0.013523 3.390 0.000754 ***
## chas 2.718716 0.854240 3.183 0.001551 **
## nox -17.376023 3.535243 -4.915 1.21e-06 ***
## rm 3.801579 0.406316 9.356 < 2e-16 ***
## dis -1.492711 0.185731 -8.037 6.84e-15 ***
## rad 0.299608 0.063402 4.726 3.00e-06 ***
## tax -0.011778 0.003372 -3.493 0.000521 ***
## ptratio -0.946525 0.129066 -7.334 9.24e-13 ***
## black 0.009291 0.002674 3.475 0.000557 ***
## lstat -0.522553 0.047424 -11.019 < 2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348
## F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16
# Confidence intervals
confint(modBIC)
## 2.5 % 97.5 %
## (Intercept) 26.384649126 46.29764088
## crim -0.172817670 -0.04400902
## zn 0.019275889 0.07241397
## chas 1.040324913 4.39710769
84 eduardo garcía portugués
Note how the R2Adj has slightly increased with respect to the full
model and how all the predictors are significant. Note also that
modBIC and modAIC are the same.
We have quantified the influence of the predictor variables on the
housing prices (Q1) and we can conclude that, in the final model
(Q2) and with significance level α = 0.05:
E[Y | X1 = x1 , . . . , X p = x p , D = d] = β 0 + β 1 X1 + . . . + β p X p + β p+1 D.
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.3068 on 144 degrees of freedom
## Multiple R-squared: 0.8673, Adjusted R-squared: 0.8627
## F-statistic: 188.3 on 5 and 144 DF, p-value: < 2.2e-16
# Speciessetosa (D1) coefficient: 0.72356. The average increment of
# Sepal.Length when the species is setosa instead of versicolor (reference)
# Speciesvirginica (D2) coefficient: -0.29994. The average increment of
# Sepal.Length when the species is virginica instead of versicolor (reference)
# Both dummy variables are significant
##
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +
## tax + ptratio + black + lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5984 -2.7386 -0.5046 1.7273 26.2373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.341145 5.067492 7.171 2.73e-12 ***
## crim -0.108413 0.032779 -3.307 0.001010 **
## zn 0.045845 0.013523 3.390 0.000754 ***
## chas 2.718716 0.854240 3.183 0.001551 **
## nox -17.376023 3.535243 -4.915 1.21e-06 ***
## rm 3.801579 0.406316 9.356 < 2e-16 ***
## dis -1.492711 0.185731 -8.037 6.84e-15 ***
## rad 0.299608 0.063402 4.726 3.00e-06 ***
## tax -0.011778 0.003372 -3.493 0.000521 ***
## ptratio -0.946525 0.129066 -7.334 9.24e-13 ***
## black 0.009291 0.002674 3.475 0.000557 ***
## lstat -0.522553 0.047424 -11.019 < 2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348
## F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16
# The coefficient associated to chas is 2.71871. If the suburb is close to
# the river, the mean of medv increases in 2.71871 units
# chas is significant as well in the presence of more predictors
1. Y = β 0 + β 1 X 2 + ε
2. Y = β 0 + β 1 log( X ) + ε
3. Y = β 0 + β 1 ( X 3 − log(| X |) + 2X ) + ε
1. X̃i = Xi2 ,
i = 1, . . . , n.
2. X̃i = log( Xi ), i = 1, . . . , n.
3. X̃i = Xi3 − log(| Xi |) + 2Xi , i = 1, . . . , n.
y and x, together with its fitted regression line. Clearly, the data
does not follow a linear pattern, but a nonlinear one, similar to a
parabola y = x2 . Hence, y might be better explained by the square of
x, xˆ2, rather than by x. Indeed, if we plot y against xˆ2 in the right
panel of Figure 3.4, we can see that the relation of y and xˆ2 is now
linear!
10
linearized pattern when plotting Y
against X 2 . In red, the fitted regression
y
y
5
5
line.
0
0
-2 -1 0 1 2 3 4 5 0 5 10 15 20 25
x x^2
1
0
y=x
y = x2
y = x3
-1
y= x
y = exp(x)
y = exp(− x)
y = log(x)
-2
-2 -1 0 1 2 3 4 5
x
y = −x
2
y = − x2
y = − x3
y=− x
1
y = − exp(− x)
y = − exp(x)
y = − log(x)
0
-1
y
-2
-3
-4
-5
-2 -1 0 1 2 3 4 5
x
92 eduardo garcía portugués
# Data
x <- c(-2, -1.9, -1.7, -1.6, -1.4, -1.3, -1.1, -1, -0.9, -0.7, -0.6,
-0.4, -0.3, -0.1, 0, 0.1, 0.3, 0.4, 0.6, 0.7, 0.9, 1, 1.1, 1.3,
1.4, 1.6, 1.7, 1.9, 2, 2.1, 2.3, 2.4, 2.6, 2.7, 2.9, 3, 3.1,
3.3, 3.4, 3.6, 3.7, 3.9, 4, 4.1, 4.3, 4.4, 4.6, 4.7, 4.9, 5)
y <- c(1.4, 0.4, 2.4, 1.7, 2.4, 0, 0.3, -1, 1.3, 0.2, -0.7, 1.2, -0.1,
-1.2, -0.1, 1, -1.1, -0.9, 0.1, 0.8, 0, 1.7, 0.3, 0.8, 1.2, 1.1,
2.5, 1.5, 2, 3.8, 2.4, 2.9, 2.7, 4.2, 5.8, 4.7, 5.3, 4.9, 5.1,
6.3, 8.6, 8.1, 7.1, 7.9, 8.4, 9.2, 12, 10.5, 8.7, 13.5)
# We create a new column inside nonLinear, called x2, that contains the
# new variable x^2
nonLinear$x2 <- nonLinear$x^2
# If you wish to remove it
# nonLinear$x2 <- NULL
# Regressions
mod1 <- lm(y ~ x, data = nonLinear)
mod2 <- lm(y ~ x2, data = nonLinear)
summary(mod1)
##
## Call:
## lm(formula = y ~ x, data = nonLinear)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5268 -1.7513 -0.4017 0.9750 5.0265
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9771 0.3506 2.787 0.0076 **
## x 1.4993 0.1374 10.911 1.35e-14 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2.005 on 48 degrees of freedom
## Multiple R-squared: 0.7126, Adjusted R-squared: 0.7067
## F-statistic: 119 on 1 and 48 DF, p-value: 1.353e-14
summary(mod2)
##
## Call:
## lm(formula = y ~ x2, data = nonLinear)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0418 -0.5523 -0.1465 0.6286 1.8797
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.05891 0.18462 0.319 0.751
## x2 0.48659 0.01891 25.725 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.9728 on 48 degrees of freedom
## Multiple R-squared: 0.9324, Adjusted R-squared: 0.931
## F-statistic: 661.8 on 1 and 48 DF, p-value: < 2.2e-16
# mod2 has a larger R^2. Also notice the intercept is not significative
notes for predictive modeling 93
Some hints:
Y = β 0 + β 1 X + . . . + β k X k + ε.
## attr(,"coefs")$norm2
## [1] 1.0000000 4.0000000 2.2222222 0.7901235
##
0.5
## attr(,"degree")
## [1] 1 2
## attr(,"class")
0.0
xk
k=5
matplot(x, poly(x, degree = degree, raw = TRUE), type = "l", lty = 1, -1.0 -0.5 0.0 0.5 1.0
ylab = expression(x^k)) x
k
legend("bottomright", legend = paste("k =", 1:degree), col = 1:degree, lwd = 2) Figure 3.7: Raw polynomials (x ) in
(−1, 1) up to degree k = 5.
0.2
0.0
k=1
k=2
# Data containing speed (mph) and stopping distance (ft) of cars from 1920 k=3
-0.2
k=4
data(cars) k=5
plot(cars, xlab = "Speed (mph)", ylab = "Stopping distance (ft)") -1.0 -0.5 0.0 0.5 1.0
# Quadratic
mod2 <- lm(dist ~ poly(speed, degree = 2), data = cars)
# The fit is not a line, we must look for an alternative approach
d <- seq(0, 25, length.out = 200)
lines(d, predict(mod2, new = data.frame(speed = d)), col = 3)
# Cubic
mod3 <- lm(dist ~ poly(speed, degree = 3), data = cars)
lines(d, predict(mod3, new = data.frame(speed = d)), col = 4)
120
100
# BICs -- the linear model is better!
BIC(mod1, mod2, mod3, mod10)
80
Stopping distance (ft)
## df BIC
60
## mod1 3 424.8929
## mod2 4 426.4202
40
## mod3 5 429.4451
## mod10 12 450.3523
20
# poly computes by default orthogonal polynomials. These are not
0
# X^1, X^2, ..., X^p but combinations of them such that the polynomials are 5 10 15 20 25
Speed (mph)
# orthogonal. ’Raw’ polynomials are possible with raw = TRUE. They give the
# same fit, but the coefficient estimates are different. Figure 3.9: Raw and orthogonal
mod2Raw <- lm(dist ~ poly(speed, degree = 2, raw = TRUE), data = cars) polynomial fits of dist ~ speed in the
plot(cars, xlab = "Speed (mph)", ylab = "Stopping distance (ft)") cars dataset.
lines(d, predict(mod2, new = data.frame(speed = d)), col = 1)
lines(d, predict(mod2Raw, new = data.frame(speed = d)), col = 2)
120
# However: different coefficient estimates, but same R^2. How is this possible?
100
summary(mod2)
##
80
Stopping distance (ft)
## Call:
## lm(formula = dist ~ poly(speed, degree = 2), data = cars)
60
##
40
## Residuals:
## Min 1Q Median 3Q Max
20
## Coefficients: 5 10 15 20 25
## (Intercept) 42.980 2.146 20.026 < 2e-16 *** Figure 3.10: Raw and orthogonal
## poly(speed, degree = 2)1 145.552 15.176 9.591 1.21e-12 *** polynomial fits of dist ~ speed in the
## poly(speed, degree = 2)2 22.996 15.176 1.515 0.136 cars dataset.
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 15.18 on 47 degrees of freedom
## Multiple R-squared: 0.6673, Adjusted R-squared: 0.6532
## F-statistic: 47.14 on 2 and 47 DF, p-value: 5.852e-12
summary(mod2Raw)
##
## Call:
## lm(formula = dist ~ poly(speed, degree = 2, raw = TRUE), data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.720 -9.184 -3.188 4.628 45.152
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.47014 14.81716 0.167 0.868
96 eduardo garcía portugués
# Because the predictors in mod2Raw are highly related between them, and
# the ones in mod2 are uncorrelated between them!
car::scatterplotMatrix(mod2$model[, -1], col = 1, regLine = list(col = 2),
smooth = list(col.smooth = 4, col.spread = 4))
car::scatterplotMatrix(mod2Raw$model[, -1],col = 1, regLine = list(col = 2),
smooth = list(col.smooth = 4, col.spread = 4))
cor(mod2$model[, -1])
## 1 2
## 1 1.000000e+00 4.686464e-17
## 2 4.686464e-17 1.000000e+00
cor(mod2Raw$model[, -1])
## 1 2
## 1 1.0000000 0.9794765
## 2 0.9794765 1.0000000
-0.1 0.0 0.1 0.2 0.3 0.4
0.2
X1
0.1
The use of orthogonal polynomials instead of raw poly-
0.0
nomials is advised for high order polynomial fits, since
-0.1
-0.2
they avoid the numerical instabilities arising from exces-
-0.3
sive linear dependencies between the raw polynomial
0.4
X2
predictors.
0.3
0.2
0.1
0.0
3.4.3 Interactions
-0.1
25
X1
of interest to explore the interaction between them by means of
20
X1 X2 . This is a new variable that positively (negatively) affects the
15
response Y when both X1 and X2 are positive or negative at the
10
same time (at different times):
5
600
X2
500
Y = β 0 + β 1 X1 + β 2 X2 + β 3 X1 X2 + ε.
400
300
# Thus, when age increases makes lstat affect less negatively medv.
# Note that the same intepretation does NOT hold if we switch the roles
# of age and lstat because age is not present as a sole predictor!
Y = β0 + β1 X + ε
β + β X + ε, if D = 0,
0 1
Y = β0 + β1 X + β2 D + ε =
( β 0 + β 2 ) + β 1 X + ε, if D = 1.
Y = β0 + β1 X + β2 D + β3 (X · D) + ε
β + β X + ε, if D = 0,
0 1
=
( β 0 + β 2 ) + ( β 1 + β 3 ) X + ε, if D = 1.
β + β X + ε, if D = 0,
0 1
Y = β0 + β1 X + β2 (X · D) + ε =
β 0 + ( β 1 + β 2 ) X + ε, if D = 1.
β + ε, if D = 0,
0
Y = β0 + β1 D + ε =
( β 0 + β 1 ) + ε, if D = 1.
β + ε, if D = 0,
0
Y = β0 + β1 D + β2 (X · D) + ε =
( β 0 + β 1 ) + β 2 X + ε, if D = 1.
β + ε, if D = 0,
0
Y = β0 + β1 (X · D) + ε =
β 0 + β 1 X + ε, if D = 1.
# 1. No dummy variable
(mod1 <- lm(medv ~ lstat, data = Boston))
1
##
50
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
40
## Coefficients:
## (Intercept) lstat
30
medv
## 34.55 -0.95
plot(medv ~ lstat, data = Boston, col = col, pch = 16, cex = cex, main = "1")
20
# 2. Dummy variable 10 20 30
##
## Call:
## lm(formula = medv ~ lstat + chas, data = Boston)
##
## Coefficients:
## (Intercept) lstat chas
## 34.0941 -0.9406 4.9200
plot(medv ~ lstat, data = Boston, col = col, pch = 16, cex = cex, main = "2")
abline(a = mod2$coefficients[1], b = mod2$coefficients[2], col = 3, lwd = 2)
abline(a = mod2$coefficients[1] + mod2$coefficients[3],
b = mod2$coefficients[2], col = 4, lwd = 2)
2
50
# 3. Dummy variable, with interaction
40
(mod3 <- lm(medv ~ lstat * chas, data = Boston))
##
30
## Call:
medv
## lm(formula = medv ~ lstat * chas, data = Boston)
##
20
## Coefficients:
## (Intercept) lstat chas lstat:chas
10
## 33.7672 -0.9150 9.8251 -0.4329
plot(medv ~ lstat, data = Boston, col = col, pch = 16, cex = cex, main = "3")
10 20 30
abline(a = mod3$coefficients[1], b = mod3$coefficients[2], col = 3, lwd = 2)
lstat
abline(a = mod3$coefficients[1] + mod3$coefficients[3],
3
b = mod3$coefficients[2] + mod3$coefficients[4], col = 4, lwd = 2)
50
40
# 4. Dummy variable only present in interaction
(mod4 <- lm(medv ~ lstat + lstat:chas, data = Boston))
30
medv
##
## Call:
20
plot(medv ~ lstat, data = Boston, col = col, pch = 16, cex = cex, main = "4") lstat
abline(a = mod4$coefficients[1],
50
## Call:
## lm(formula = medv ~ chas, data = Boston)
10
##
## Coefficients:
10 20 30
## (Intercept) chas
lstat
## 22.094 6.346
5
plot(medv ~ lstat, data = Boston, col = col, pch = 16, cex = cex, main = "5")
50
##
## Call:
## lm(formula = medv ~ chas + lstat:chas, data = Boston)
10
##
## Coefficients: 10 20 30
50
# 7. Dummy variable. Interaction in the slope
(mod7 <- lm(medv ~ lstat:chas, data = Boston))
40
##
## Call:
30
## lm(formula = medv ~ lstat:chas, data = Boston)
medv
##
## Coefficients:
20
## (Intercept) lstat:chas
## 22.49484 0.04882
10
plot(medv ~ lstat, data = Boston, col = col, pch = 16, cex = cex, main = "7")
abline(a = mod7$coefficients[1], b = 0, col = 3, lwd = 2)
10 20 30
abline(a = mod7$coefficients[1], b = mod7$coefficients[2], col = 4, lwd = 2) lstat
50
(with varying flexibility) to the two groups of data encoded by
40
the dummy variable, and merge this simultaneous fit within a sin-
gle linear model. We can check this in more detail using the subset
30
medv
option of lm:
20
# Model using a dummy variable in the full dataset
lm(medv ~ lstat + chas + lstat:chas, data = Boston)
10
##
## Call:
10 20 30
## lm(formula = medv ~ lstat + chas + lstat:chas, data = Boston)
lstat
##
## Coefficients:
## (Intercept) lstat chas lstat:chas
## 33.7672 -0.9150 9.8251 -0.4329
4.0
3.5
The last scatterplot is an illustration of the Simpson’s Sepal.Width
1. Within each group, there is a clear and common corre- 0.5 1.0 1.5 2.0 2.5
Do the following:
3.5.1 Linearity
Linearity between the response Y and the predictors X1 , . . . , X p
is the building block of the linear model. If this assumption fails,
i.e., if there is a nonlinear trend linking Y and at least one of the
predictors X1 , . . . , X p in a significant way, then all the conclusions 13
If p = 1, then it is possible to in-
we might extract from the analysis are suspected to be flawed. spect the scatterplot of the ( Xi1 , Yi ),
i = 1, . . . , n, in order to determine
Therefore it is a key assumption.
whether linearity is plausible. But
How to check it the usefulness of this graphical check
quickly decays with the dimension p,
The so-called residuals vs. fitted values plot is the scatterplot of the as p scatterplots need to be investi-
(Ŷi , ε̂ i ), i = 1, . . . , n, and is a very useful tool for detecting linearity gated. That is precisely the key point
for relying in the residuals vs. fitted
departures using a single13 graphical device. Under linearity, we values plot.
expect that there is no trend in the residuals ε̂ i with respect to Ŷi . For
example:
106 eduardo garcía portugués
plot(mod, 1)
Residuals vs Fitted
0.6
Under linearity, we expect the red line (a flexible fit of the mean 6
0.4
residuals vs. fitted values plots behave for situations in which the
0.2
linearity is known to be respected or violated. If nonlinearities are
Residuals
0.0
observed, it is worth to plot the regression terms of the model. These
are the p scatterplots ( Xij , Yi ), i = 1, . . . , n, that are accompanied
-0.2
by the regression lines y = β̂ 0 + β̂ j x j (important: β̂ j , j = 1, . . . , p,
-0.4
20
24
come from the multiple linear fit that gives β̂, not from individual
simple linear regressions). They help with detecting which predictor 6.0 6.5 7.0 7.5 8.0
Fitted values
lm(Price ~ Age + AGST + HarvestRain + WinterRain)
is having nonlinear effects on Y.
Figure 3.13: Residuals vs. fitted val-
ues for the Price ~ Age + AGST +
HarvestRain + WinterRain model for
the wine dataset.
1.0
1.0
Partial for AGST
0.5
0.5
Partial for Age
What to do if fails
0.0
0.0
-0.5
-0.5
Using an adequate nonlinear transformation for the problematic
-1.0
-1.0
predictors or adding interaction terms, as we saw in Section 3.4, 5 10 15
Age
20 25 30 15.0 15.5 16.0
AGST
16.5 17.0 17.5
1.0
1.0
Partial for HarvestRain
0.5
than Y), as we will see in the case study of Section 3.5.7. Let’s see
0.0
0.0
the transformation of predictors in the example that motivated
-0.5
-0.5
-1.0
-1.0
Section 3.4. 50 100 150 200 250 300 400 500 600 700 800
HarvestRain WinterRain
2
50 35 41
3 47
1
Residuals
Residuals
-1 0
2
3.5.2 Normality
0
-2
-3
49
-2 0 2 4 6 8 0 2 4 6 8 10 12
The assumed normality of the errors ε i , i = 1, . . . , n, allows us to Fitted values Fitted values
make exact inference in the linear model, in the sense that the dis- Figure 3.16: Correction of nonlinearity
by using the right transformation for
tribution of β̂ given in (2.11) is exact for any n and not asymptotic the predictor.
with n → ∞. If normality does not hold, then the inference we did 14
Recall that β̂ j = e0j (X0 X)−1 X0 Y =:
(CIs for β j , hypothesis testing, CIs for prediction) is to be somehow ∑in=1 wij Yi , where e j is the canonical
suspected. Why just somehow? Roughly speaking, the reason is that vector of R p+1 with 1 in the j-th
position and 0 elsewhere. Therefore,
the central limit theorem will make β̂ asymptotically normal14 , even β̂ j is a weighted sum of the random
if the errors are not. However, the speed of this asymptotic conver- variables Y1 , . . . , Yn (recall that we
assume that X is given and therefore
gence greatly depends on how non-normal is the distribution of the is deterministic). Even if Y1 , . . . , Yn
errors. Hence the next rule of thumb: are not normal, the central√ limit
theorem entails that n( β̂ j − β j ) is
Non-severe15 departures from normality yield valid (asymptotic) asymptotically normally distributed
inference for relatively large sample sizes n. when n → ∞, provided that linearity
holds.
Therefore, the failure of normality is typically less problematic
than other assumptions.
How to check it
The QQ-plot (Theoretical Quantile vs. Empirical Quantile) al- 15
Distributions that are not heavy
lows to check if the standardized residuals follow a N (0, 1). What tailed, not heavily multimodal, and not
heavily skewed.
it does is to compare the theoretical quantiles of a N (0, 1) with the
Normal Q-Q
quantiles of the sample of standardized residuals.
6
2
plot(mod, 2)
1
Standardized residuals
non-normal. -2 -1 0 1 2
There are formal tests to check the null hypothesis of normality Theoretical Quantiles
lm(Price ~ Age + AGST + HarvestRain + WinterRain)
in our residuals, such as the Shapiro–Wilk test implemented by the Figure 3.17: QQ-plot for the resid-
uals of the Price ~ Age + AGST +
shapiro.test function or the Lilliefors test17 implemented by the HarvestRain + WinterRain model for
nortest::lillie.test function: the wine dataset.
108 eduardo garcía portugués
# shapiro.test allows up to 5000 observations -- if dealing with more data Therefore, the variance of x̂ p grows
# points, randomization of the input is a possibility if p → 0, 1 and more variability is
expected on the extremes of the QQ-
# Lilliefors test -- the Kolmogorov-Smirnov adaptation for testing normality plot, see Figure 3.19.
nortest::lillie.test(mod$residuals)
##
## Lilliefors (Kolmogorov-Smirnov) normality test
##
## data: mod$residuals
## D = 0.13739, p-value = 0.2125
# We do not reject normality
What to do if fails
Patching non-normality is not easy and most of the time re-
3
quires the consideration of other models, like the ones to be seen in
2
Chapter 5. If Y is non-negative, one possibility is to transform Y by
1
Sample Quantiles
means of the Box–Cox (Box and Cox, 1964) transformation:
0
-1
y λ −1 , λ 6= 0,
y(λ) : =
-2
λ (3.3)
log(y), λ = 0.
-3
-3 -2 -1 0 1 2 3
This transformation alleviates the skewness18 of the data, there- Theoretical Quantiles
fore making it more symmetric and hence normal-like. The op- Figure 3.19: The uncertainty behind
the QQ-plot. The figure aggregates
timal data-dependent λ̂ that makes the data more normal-like M = 1000 different QQ-plots of
can be found through maximum likelihood19 on the transformed N (0, 1) data with n = 100, displaying
(λ) for each of them the pairs ( x p , x̂ p )
sample {Yi }in=1 . If Y is not non-negative, (3.3) can not be ap-
evaluated at p = i−n1/2 , i = 1, . . . , n
plied. A possible patch is to shift the data by a positive constant (as they result from ppoints(n)).
m = − min(Y1 , . . . , Yn ) + δ, δ > 0, such that transformation (3.3) The uncertainty is measured by the
asymptotic 100(1 − α)% CIs for x̂ p ,
becomes √
z −α/2 p (1− p )
given by x p ± 1√ n φ( x )
.
p
( y + m ) λ −1 , λ 6= 0,
(λ,m) λ These curves are displayed in red for
y := α = 0.05. Observe that the vertical
log(y + m), λ = 0.
strips arise since the x p coordinate is
deterministic.
A neat alternative to this shifting is to rely on a transformation that 18
Precisely, if λ < 1, positive skewness
is already designed for real Y, such as the Yeo–Johnson (Yeo and (or skewness to the right) is palliated
(large values of Y shrink, small values
Johnson, 2000) transformation: grow), whereas if λ > 1 negative
skewness (or skewness to the left) is
( y +1) λ −1
λ , λ 6= 0, y ≥ 0, corrected (large values of Y grow,
small values shrink).
λ = 0, y ≥ 0,
log(y + 1),
y(λ) : = For a N (µ, σ2 ) and potentially using
19
(− y + 1 ) 2− λ −1 (3.4)
− 2 − λ , λ 6= 2, y < 0, the linear model structure if we are
performing the transformation to
log(y + 1), λ = 2, y < 0.
achieve normality in errors of the
linear model. Recall that optimally
The beauty of the Yeo–Johnson transformation is that it extends transforming Y such that Y is normal-
like or Y |( X1 , . . . , X p ) is normal-like
neatly the Box–Cox transformation, which appears as a particular
(the assumption in the linear model)
case when Y is non-negative (see Figure 3.20). As with the Box– are very different goals!
Cox transformation, the optimal λ̂ is estimated through maximum
(λ) n
likelihood on the transformed sample {Yi } i =1 .
N <- 200
Yeo-Johnson transformation
y <- seq(-4, 4, length.out = N)
lambda <- c(0, 0.5, 1, 2, -0.5, -1, -2) 4
l <- length(lambda)
psi <- sapply(lambda, function(la) car::yjPower(U = y, lambda = la))
2
matplot(y, psi, type = "l", ylim = c(-4, 4), lwd = 2, lty = 1:l,
ylab = latex2exp::TeX("$y^{(\\lambda)}$"), col = 1:l, las = 1,
main = "Yeo-Johnson transformation")
y(λ)
abline(v = 0, h = 0)
legend("bottomright", lty = 1:l, lwd = 2, col = 1:l,
-2 λ=0
legend = latex2exp::TeX(paste0("$\\lambda = ", lambda, "$"))) λ = 0.5
λ=1
λ=2
λ = -0.5
λ = -1
The previous transformations have the price of modelling the -4 λ = -2
-4 -2 0 2 4
transformed response rather than Y. It is also possible to patch it y
# Test data
# Predictors
n <- 200
set.seed(121938)
X1 <- rexp(n, rate = 1 / 5) # Non-negative
X2 <- rchisq(n, df = 5) - 3 # Real
# Box-Cox transformation
X1Transf <- car::bcPower(U = X1, lambda = lambdaBC)
# Comparison
par(mfrow = c(1, 2))
hist(X1, freq = FALSE, breaks = 10, ylim = c(0, 0.3))
hist(X1Transf, freq = FALSE, breaks = 10, ylim = c(0, 0.3))
0.30
0.20
0.20
Density
Density
0.10
0.10
0.00
0.00
0 10 20 30 40 -2 0 2 4 6
X1 X1Transf
# Yeo-Johnson transformation
X2Transf <- car::yjPower(U = X2, lambda = lambdaYJ)
# Comparison
par(mfrow = c(1, 2))
hist(X2, freq = FALSE, breaks = 10, ylim = c(0, 0.3))
hist(X2Transf, freq = FALSE, breaks = 10, ylim = c(0, 0.3))
notes for predictive modeling 111
0.30
0.30
0.20
0.20
Density
Density
0.10
0.10
0.00
0.00
-2 0 2 4 6 8 10 -4 -2 0 2 4 6
X2 X2Transf
# Yeo-Johnson transformation
YTransf <- car::yjPower(U = Y, lambda = lambdaYJ)
Standardized residuals
142 142
4
0 1 2 3 4
14 14
52
52
2
0
-2
-2
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
3.5.3 Homoscedasticity
The constant-variance assumption of the errors is also key for ob-
taining the inferential results we saw. For example, if the assump-
tion does not hold, then the CIs for prediction will not respect the
Scale-Location
24
20
How to check it
1.2
plot(mod, 3)
6.0 6.5 7.0 7.5 8.0
Fitted values
Under homoscedasticity, we expect the red line to show no lm(Price ~ Age + AGST + HarvestRain + WinterRain)
# Heteroskedastic models
set.seed(123456)
x <- rnorm(100)
y1 <- 1 + 2 * x + rnorm(100, sd = x^2)
y2 <- 1 + 2 * x + rnorm(100, sd = 1 + x * (x > 0))
modHet1 <- lm(y1 ~ x)
modHet2 <- lm(y2 ~ x)
Scale-Location
# Heteroskedasticity not detected 5
car::ncvTest(modHet1)
2.0
55
What to do if fails
Using a nonlinear transformation for the response Y may help 96
√
1.5
Standardized residuals
Fitted values
lm(y2 ~ x)
# Artificial data with heteroskedasticity
set.seed(12345) Figure 3.23: Two heteroskedasticity
X <- rchisq(500, df = 3) patterns that are undetected and
e <- rnorm(500, sd = sqrt(0.1 + 2 * X)) detected, respectively, by the Breusch–
Y <- 1 + X + e Pagan test.
114 eduardo garcía portugués
# Original
plot(lm(Y ~ X), 3) # Very heteroskedastic
# Transformed
plot(lm(I(log(abs(Y))) ~ X), 3) # Much less hereroskedastic, but at the price
# of losing the signs in Y...
306 338
# Transformed by Yeo-Johnson
1.5
# Optimal lambda for Yeo-Johnson
Standardized residuals
YJ <- car::powerTransform(lm(Y ~ X), family = "yjPower")
1.0
(lambdaYJ <- YJ$lambda)
## Y1
## 0.6932053
0.5
# Yeo-Johnson transformation
0.0
YTransf <- car::yjPower(U = Y, lambda = lambdaYJ)
plot(lm(YTransf ~ X), 3) # Slightly less hereroskedastic 5 10 15
Fitted values
lm(Y ~ X)
Scale-Location
372
2.5
3.5.4 Independence 202
155
2.0
Independence is also a key assumption: it guarantees that the
Standardized residuals
1.5
amount of information that we have on the relationship between
Y and X1 , . . . , X p with n observations is maximal. If there is depen- 1.0
variability of the estimates will be larger. For example, our 95% CIs
0.0
will be smaller than the adequate, meaning that they will not contain
0.5 1.0 1.5 2.0 2.5 3.0 3.5
with a 95% confidence the unknown parameter, but with a lower Fitted values
lm(I(log(abs(Y))) ~ X)
Scale-Location
confidence (say 80%). 137
and so are the estimates for the coefficients β. Yet in the 2n-sample
√ −1
1.0
the confidence interval has happened, but we have the same infor-
0.0
mation! This will give us a wrong sense of confidence in our model, 1.5 2.0 2.5 3.0
Fitted values
and the root of the evil was the dependence between observations. lm(I(log(Y + m)) ~ X)
Scale-Location
137
How to check it
2.0
295
323
Fitted values
lm(YTransf ~ X)
Under uncorrelation, we expect the series to show no track- Figure 3.24: Patching of heteroskedas-
ing of the residuals. That is, that the closer observations do not ticity for an artificial dataset.
notes for predictive modeling 115
take similar values, but rather change without any kind of distin-
0.4
guishable pattern. Tracking is associated to positive autocorrelation,
but negative, manifested as alternating small-large or positive-
0.2
mod$residuals
negative residuals, is also possible. The lagged plots of (ε̂ i−l , ε̂ i ),
0.0
i = l + 1, . . . , n, obtained through lag.plot, allow us to detect both
kinds of autocorrelations for a given lag l. Under independence, we
-0.2
expect no trend in such plot. Here is an example:
-0.4
lag.plot(mod$residuals, lags = 1, do.lines = FALSE) 0 5 10 15 20 25
Index
There are also formal tests for testing for the absence of au-
0.4
tocorrelation, such as the Durbin–Watson test implemented in
0.2
car::durbinWatsonTest:
mod$residuals
0.0
# Durbin-Watson test
car::durbinWatsonTest(mod)
-0.2
## lag Autocorrelation D-W Statistic p-value
## 1 -0.4160168 2.787261 0.054
-0.4
## Alternative hypothesis: rho != 0
# Does not reject at alpha = 0.05 -0.4 -0.2 0.0 0.2 0.4 0.6
lag 1
What to do if fails
Little can be done if there is dependence in the data, once this
has been collected. If the dependence is temporal, we must rely
on the family of statistical models meant to deal with serial de-
pendence: time series. Other kinds of dependence such as spatial
dependence, spatio-temporal dependence, geometrically-driven de-
pendencies, censorship, truncation, etc. need to be analyzed with a
different set of tools to the ones covered in these notes.
However, there is a simple trick worth mentioning. If the ob-
servations of the response Y, say Y1 , Y2 , . . . , Yn , present serial de-
pendence, a differentiation of the sample that yields Y1 − Y2 , Y2 −
Y3 , . . . , Yn−1 − Yn may lead to independent observations. These are
called the innovations of the series of Y.
3.5.5 Multicollinearity
A common problem that arises in multiple linear regression is mul-
ticollinearity. This is the situation when two or more predictors are
highly linearly related between them. Multicollinearitiy has impor-
tant effects on the fit of the model:
116 eduardo garcía portugués
HarvestRain
WinterRain
FrancePop
## AGST -0.29488335 0.6675248 -0.32113230 1.00000000 -0.02708361 0.29488335 -0.30126148
AGST
Price
Year
Age
## HarvestRain -0.05884976 -0.5071846 -0.26798907 -0.02708361 1.00000000 0.05884976 -0.03201463
1
## Age -1.00000000 0.4604087 -0.05118354 0.29488335 0.05884976 1.00000000 Year -0.99227908
1 -0.46 0.05 -0.29 -0.06 -1 0.99
0.8
## FrancePop 0.99227908 -0.4810720 0.02945091 -0.30126148 -0.03201463 -0.99227908 1.00000000
Price 0.6
-0.46 1 0.13 0.67 -0.51 0.46 -0.48
0.4
# Graphically WinterRain 0.05 0.13 1 -0.32 -0.27 -0.05 0.03
0.2
corrplot::corrplot(cor(wine), addCoef.col = "grey")
AGST -0.29 0.67 -0.32 1 -0.03 0.29 -0.3 0
Here we can see what we already knew from Section 2.1: Age HarvestRain -0.06 -0.51 -0.27 -0.03 1 0.06 -0.03
-0.2
-0.4
and Year are perfectly linearly related and Age and FrancePop are Age -1 0.46 -0.05 0.29 0.06 1 -0.99 -0.6
highly linearly related. Then one approach will be to directly re- FrancePop 0.99 -0.48 0.03 -0.3 -0.03 -0.99 1
-0.8
x2
x3
x4
terexamples that show non suspicious pairwise correlations but x1 1 0.38 0.21 -0.53 0.31 0.8
problematic complex linear relations that remain hidden. For the 0.6
x2
sake of illustration, here is one: 0.38 1 0.52 0.57 -0.04 0.4
0.2
# Create predictors with multicollinearity: x4 depends on the rest x3 0.21 0.52 1 0.25 -0.77 0
set.seed(45678)
-0.2
x1 <- rnorm(100)
x4 -0.53 0.57 0.25 1 -0.29 -0.4
x2 <- 0.5 * x1 + rnorm(100)
x3 <- 0.5 * x2 + rnorm(100) -0.6
-1
# Without x4
modClean <- lm(y ~ x1 + x2 + x3)
# Comparison
car::compareCoefs(modMultiCo, modClean)
## Calls:
## 1: lm(formula = y ~ x1 + x2 + x3 + x4)
## 2: lm(formula = y ~ x1 + x2 + x3)
##
## Model 1 Model 2
## (Intercept) 1.062 1.058
## SE 0.103 0.103
##
## x1 0.922 1.450
## SE 0.551 0.116
##
## x2 1.640 1.119
## SE 0.546 0.124
##
## x3 -3.165 -3.145
## SE 0.109 0.107
##
## x4 -0.529
## SE 0.541
##
confint(modMultiCo)
## 2.5 % 97.5 %
## (Intercept) 0.8568419 1.2674705
## x1 -0.1719777 2.0167093
## x2 0.5556394 2.7240952
## x3 -3.3806727 -2.9496676
## x4 -1.6030032 0.5446479
confint(modClean)
## 2.5 % 97.5 %
## (Intercept) 0.8526681 1.262753
## x1 1.2188737 1.680188
## x2 0.8739264 1.364981
## x3 -3.3564513 -2.933473
notes for predictive modeling 119
# Summaries
summary(modMultiCo)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9762 -0.6663 0.1195 0.6217 2.5568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0622 0.1034 10.270 < 2e-16 ***
## x1 0.9224 0.5512 1.673 0.09756 .
## x2 1.6399 0.5461 3.003 0.00342 **
## x3 -3.1652 0.1086 -29.158 < 2e-16 ***
## x4 -0.5292 0.5409 -0.978 0.33040
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 1.028 on 95 degrees of freedom
## Multiple R-squared: 0.9144, Adjusted R-squared: 0.9108
## F-statistic: 253.7 on 4 and 95 DF, p-value: < 2.2e-16
summary(modClean)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.91297 -0.66622 0.07889 0.65819 2.62737
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0577 0.1033 10.24 < 2e-16 ***
## x1 1.4495 0.1162 12.47 < 2e-16 ***
## x2 1.1195 0.1237 9.05 1.63e-14 ***
## x3 -3.1450 0.1065 -29.52 < 2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 1.028 on 96 degrees of freedom
## Multiple R-squared: 0.9135, Adjusted R-squared: 0.9108
## F-statistic: 338 on 3 and 96 DF, p-value: < 2.2e-16
2
0.5
19
1
Standardized residuals
residuals vs. leverage plot:
0
plot(mod, 5)
-1
The rules of thumb for declaring outliers and high-leverage 24
20
-2
0.5
Leverage
• If the standardized residual of an observation is larger than 3 in lm(Price ~ Age + AGST + HarvestRain + WinterRain)
absolute value, then it may be an outlier. Figure 3.29: Residuals vs. leverage
plot for the Price ~ Age + AGST +
• If the leverage statistic hi (see below) is greatly exceeding ( p +
HarvestRain + WinterRain model for
1)/n 21 , then the i-th observation may be suspected of having a the wine dataset.
high leverage. 21
This is the expected value for the
leverage statistic hi if the linear model
Let’s see an artificial example. holds.
6
# Create data
set.seed(12345)
4
x <- rnorm(100)
e <- rnorm(100, sd = 0.5)
2
y <- 1 + 2 * x + e
y
## [1] 0.01980198
-4
# Base model
-2 -1 0 1 2
m0 <- lm(y ~ x) x
plot(x, y) Residuals vs Leverage
abline(coef = m0$coefficients, col = 2)
3
2
24
plot(m0, 5)
Standardized residuals
summary(m0)
0
##
-1
## Call: 42
## lm(formula = y ~ x) 75
-2
##
Cook's distance
## Residuals:
0.00 0.01 0.02 0.03 0.04 0.05 0.06
## Min 1Q Median 3Q Max
Leverage
## -1.10174 -0.30139 -0.00557 0.30949 1.30485 lm(y ~ x)
notes for predictive modeling 121
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.01103 0.05176 19.53 <2e-16 ***
## x 2.04727 0.04557 44.93 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.5054 on 98 degrees of freedom
## Multiple R-squared: 0.9537, Adjusted R-squared: 0.9532
## F-statistic: 2018 on 1 and 98 DF, p-value: < 2.2e-16
# Make an outlier
x[101] <- 0; y[101] <- 30
m1 <- lm(y ~ x)
30
plot(x, y)
25
abline(coef = m1$coefficients, col = 2)
20
plot(m1, 5)
15
y
10
summary(m1)
5
##
## Call:
0
## lm(formula = y ~ x)
-5
##
-2 -1 0 1 2
## Residuals:
x
## Min 1Q Median 3Q Max
Residuals vs Leverage
## -1.3676 -0.5730 -0.2955 0.0941 28.6881
10
## 101
## Coefficients:
8
## Estimate Std. Error t value Pr(>|t|)
Standardized residuals
## (Intercept) 1.3119 0.2997 4.377 2.98e-05 *** 6
1
## x 1.9901 0.2652 7.505 2.71e-11 ***
4
## --- 0.5
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
2
##
## Residual standard error: 2.942 on 99 degrees of freedom
0
47 42
## F-statistic: 56.33 on 1 and 99 DF, p-value: 2.708e-11 0.00 0.01 0.02 0.03 0.04 0.05 0.06
Leverage
lm(y ~ x)
m2 <- lm(y ~ x)
plot(x, y)
4
plot(m2, 5)
0
summary(m2)
-2
##
-4
## Call:
## lm(formula = y ~ x) -2 0 2 4 6 8 10
## x
##
0.5
## Coefficients: 60 1
Standardized residuals
42
-2
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
-8
## 101
-10
Cook's distance
## Residual standard error: 1.339 on 99 degrees of freedom
0.0 0.1 0.2 0.3 0.4
## Multiple R-squared: 0.6791, Adjusted R-squared: 0.6758
Leverage
## F-statistic: 209.5 on 1 and 99 DF, p-value: < 2.2e-16 lm(y ~ x)
122 eduardo garcía portugués
# Another option
h <- hat(x = x)
# Standardized residuals
60
rs <- rstandard(m2)
plot(m2, 2) # QQ-plot
Frequency
84+
+ +
transistors ≈ 2years/2 .
++
++++++
+++++++++
+++++++
++++++++
+++++++++++
+++++++++++++
0
+++
++
++++++++++++
+
+++++++++++++
+++
++++
Applying logarithms to both sides gives ++
Standardized residuals
+42
-2
log(2)
-4
log(transistors) ≈ years.
2
-6
-2 -1 0 1 2
Theoretical Quantiles
where ε is a random error. This is a linear model! lm(y ~ x)
notes for predictive modeling 123
PC = A0 (X − µ), (3.7)
notes for predictive modeling 125
X = µ + APC, (3.9)
which admits an insightful interpretation: the PC are uncorrelated26 Since the variance-covariance matrix
26
Var[PC] is diagonal.
variables that, once rotated by A and translated to the location µ,
produce exactly X. So, PC contains the same information as X
but rearranged in a more convenient way because the principal
components are centred and uncorrelated between them:
λ , if i = j,
i
E[PCi ] = 0 and Cov[PCi , PC j ] =
0, if i 6= j.
∑lj=1 λ j
p .
∑ j =1 λ j
Let’s check that R’s function for PCA, princomp, returns the same
principal components we outlined in the theory.
# PCA
pcaLaliga <- princomp(laliga, fix_sign = TRUE)
summary(pcaLaliga)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## Standard deviation 104.7782561 48.5461449 22.13337511 12.66692413 8.234215354 7.83426116 6.068864168 4.137079559
## Proportion of Variance 0.7743008 0.1662175 0.03455116 0.01131644 0.004782025 0.00432876 0.002597659 0.001207133
## Cumulative Proportion 0.7743008 0.9405183 0.97506949 0.98638593 0.991167955 0.99549671 0.998094374 0.999301507
## Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
## Standard deviation 2.0112480391 1.8580509157 1.126111e+00 9.568824e-01 4.716064e-01 1.707105e-03 8.365534e-04
## Proportion of Variance 0.0002852979 0.0002434908 8.943961e-05 6.457799e-05 1.568652e-05 2.055361e-10 4.935768e-11
## Cumulative Proportion 0.9995868048 0.9998302956 9.999197e-01 9.999843e-01 1.000000e+00 1.000000e+00 1.000000e+00
## Comp.16 Comp.17
## Standard deviation 5.867584e-07 0
## Proportion of Variance 2.428208e-17 0
## Cumulative Proportion 1.000000e+00 1
# The standard deviations are the square roots of the eigenvalues
# The cumulative proportion of variance explained accumulates the
# variance explained starting at the first component
# Useful for detecting an "elbow" in the graph whose location gives the
6000
Variances
# when the next variances are almost similar and notably smaller when
# compared with the previous
2000
# Same eigenvalues
pcaLaliga$sdev^2 - eig$values
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## -1.818989e-12 2.728484e-12 2.273737e-13 -2.273737e-13 5.684342e-14 3.552714e-14 4.263256e-14 -3.552714e-14
## Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15 Comp.16
## 1.234568e-13 5.373479e-14 1.199041e-14 1.054712e-14 1.049161e-14 -3.709614e-15 -2.191892e-15 4.476020e-13
## Comp.17
## 2.048814e-12
# The eigenvectors (the a_j vectors) are the column vectors in $loadings
pcaLaliga$loadings
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10 Comp.11 Comp.12
## Points 0.125 0.497 0.195 0.139 0.340 0.425 0.379 0.129 0.166
## Wins 0.184 0.175 0.134 -0.198 0.139
## Draws 0.101 -0.186 0.608 0.175 0.185 -0.251
## Loses -0.129 -0.114 -0.157 -0.410 -0.243 -0.166 0.112
## Goals.scored 0.181 0.251 -0.186 -0.169 0.399 0.335 -0.603 -0.155 0.129 0.289 -0.230
## Goals.conceded -0.471 -0.493 -0.277 0.257 0.280 0.441 -0.118 0.297
## Percentage.scored.goals
## Percentage.conceded.goals
## Shots 0.718 0.442 -0.342 0.255 0.241 0.188
## Shots.on.goal 0.386 0.213 0.182 -0.287 -0.532 -0.163 -0.599
## Penalties.scored -0.350 0.258 0.378 -0.661 0.456
## Assistances 0.148 0.198 -0.173 0.362 0.216 0.215 0.356 -0.685 -0.265
## Fouls.made -0.480 0.844 0.166 -0.110
## Matches.without.conceding 0.151 0.129 -0.182 0.176 -0.369 -0.376 -0.411
## Yellow.cards -0.141 0.144 -0.363 0.113 0.225 0.637 -0.550 -0.126 0.156
## Red.cards -0.123 -0.157 0.405 0.666
## Offsides 0.108 0.202 -0.696 0.647 -0.106
## Comp.13 Comp.14 Comp.15 Comp.16 Comp.17
## Points 0.138 0.278 0.315
## Wins 0.147 -0.907
## Draws -0.304 0.526 -0.277
## Loses 0.156 0.803
## Goals.scored -0.153
## Goals.conceded
## Percentage.scored.goals -0.760 -0.650
## Percentage.conceded.goals 0.650 -0.760
## Shots
## Shots.on.goal
## Penalties.scored -0.114
## Assistances -0.102
## Fouls.made
## Matches.without.conceding -0.664
## Yellow.cards
## Red.cards -0.587
## Offsides
##
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14
## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059
## Cumulative Var 0.059 0.118 0.176 0.235 0.294 0.353 0.412 0.471 0.529 0.588 0.647 0.706 0.765 0.824
## Comp.15 Comp.16 Comp.17
## SS loadings 1.000 1.000 1.000
## Proportion Var 0.059 0.059 0.059
## Cumulative Var 0.882 0.941 1.000
head(pcaLaliga$scores)
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## Barcelona 242.23916 -21.581016 25.380103 -17.4375054 -7.1797218 9.0814106 -5.5920449 -7.3615386 -0.3715688
## Real Madrid 313.60263 63.202402 -8.756998 8.5557906 0.7119181 0.2221379 6.7034894 2.4455971 1.8388132
## Atlético Madrid 45.99393 -0.646609 38.540964 31.3888316 3.9162812 3.2904137 0.2431925 5.0912667 -3.0444029
## Villarreal -96.22013 -42.932869 50.003639 -11.2420481 10.4732634 2.4293930 -3.0183049 0.1958417 1.2106025
## Athletic 14.51728 -16.189672 18.884019 -0.4122161 -5.6491352 -6.9329640 8.0652665 2.4783231 -2.6920566
## Celta -13.07483 6.792525 5.227118 -9.0945489 6.1264750 11.8794638 -2.6148154 6.9706627 3.0825781
## Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15 Comp.16 Comp.17
## Barcelona 1.7160752 0.0264937 -1.0948280 0.15160351 -0.0010244179 -0.0002177801 -1.047044e-12 -7.623799e-12
## Real Madrid -2.9660592 -0.1344557 0.3409538 -0.03316355 0.0014744734 0.0003891502 1.587123e-12 1.215672e-11
## Atlético Madrid 2.0974553 0.6771343 -0.3985625 -0.18088616 0.0004984680 -0.0001116115 6.252046e-13 1.090230e-12
## Villarreal -1.7453143 0.1350586 -0.5735057 -0.58936061 -0.0001067413 -0.0002431758 3.319980e-13 -3.406707e-12
## Athletic 0.8950389 0.1542005 1.4714997 0.11090055 -0.0031346887 -0.0002557828 -3.696130e-12 -1.895661e-11
## Celta 0.3129865 0.0859623 1.9159241 0.37219921 -0.0017960697 0.0002911046 -2.150767e-12 -5.513489e-12
# Uncorrelated
corrplot::corrplot(cor(pcaLaliga$scores), addCoef.col = "gray")
Comp.10
Comp.11
Comp.12
Comp.13
Comp.14
Comp.15
Comp.16
Comp.17
Comp.1
Comp.2
Comp.3
Comp.4
Comp.5
Comp.6
Comp.7
Comp.8
Comp.9
1
Comp.1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# Caution! What happened in the last columns? What happened is that the Comp.2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 -0.01 0 0.8
# variance for the last principal components is close to zero (because there Comp.3 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Comp.4 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.6
# are linear dependencies on the variables; e.g. Points, Wins, Loses, Draws), Comp.5 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0.01 0
0.4
# so the computation of the correlation matrix becomes unstable for those Comp.6 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
Comp.7 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0.2
# variables (a 0/0 division takes place) Comp.8 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 -0.01
Comp.9 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0.06 0 0
Comp.10 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0.02 0
# Better to inspect the covariance matrix Comp.11 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0.040.01
-0.2
Comp.12 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 -0.05-0.01
corrplot::corrplot(cov(pcaLaliga$scores), addCoef.col = "gray", is.corr = FALSE) Comp.13 -0.4
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 -0.12-0.01
Comp.14 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0.99 0.6 -0.6
Comp.15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 -0.040.8
Comp.16 -0.8
0 -0.01 0 0 0.01 0 0 0 0.060.020.04-0.05-0.120.99-0.04 1 0.56
Comp.17 0 0 0 0 0 0 0 -0.01 0 0 0.01-0.01-0.010.6 0.8 0.56 1
-1
# The scores are A’ * (X_i - mu). We center the data with scale()
Comp.10
Comp.11
Comp.12
Comp.13
Comp.14
Comp.15
Comp.16
Comp.17
Comp.1
Comp.2
Comp.3
Comp.4
Comp.5
Comp.6
Comp.7
Comp.8
Comp.9
# and then multiply each row by A’ 11556.3
Comp.111556.30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
scores <- scale(laliga, center = TRUE, scale = FALSE) %*% A Comp.2 02480.770 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10400.67
Comp.3 0 0515.670 0 0 0 0 0 0 0 0 0 0 0 0 0
Comp.4 0 0 0 168.9 0 0 0 0 0 0 0 0 0 0 0 0 0 9245.04
# Same as (but this is much slower) Comp.5 0 0 0 0 71.37 0 0 0 0 0 0 0 0 0 0 0 0
8089.41
# scores <- t(apply(scale(laliga, center = TRUE, scale = FALSE), 1, Comp.6 0 0 0 0 0 64.61 0 0 0 0 0 0 0 0 0 0 0
Comp.7 0 0 0 0 0 0 38.77 0 0 0 0 0 0 0 0 0 0 6933.78
# function(x) t(A) %*% x)) Comp.8 0 0 0 0 0 0 0 18.02 0 0 0 0 0 0 0 0 0
Comp.9 0 0 0 0 0 0 0 0 4.26 0 0 0 0 0 0 0 0 5778.15
Comp.10 0 0 0 0 0 0 0 0 0 3.63 0 0 0 0 0 0 0
# Same scores (up to possible changes in signs) Comp.11 0 0 0 0 0 0 0 0 0 0 1.33 0 0 0 0 0 0
4622.52
Comp.12 0 0 0 0 0 0 0 0 0 0 0 0.96 0 0 0 0 0
max(abs(abs(pcaLaliga$scores) - abs(scores))) Comp.13 0 0 0 0 0 0 0 0 0 0 0 0 0.23 0 0 0 0
3466.89
head(
sweep(pcaLaliga$scores %*% t(pcaLaliga$loadings), 2, pcaLaliga$center, "+")
)
## Points Wins Draws Loses Goals.scored Goals.conceded Percentage.scored.goals Percentage.conceded.goals
## Barcelona 91 29 4 5 112 29 2.95 0.76
## Real Madrid 90 28 6 4 110 34 2.89 0.89
## Atlético Madrid 88 28 4 6 63 18 1.66 0.47
## Villarreal 64 18 10 10 44 35 1.16 0.92
## Athletic 62 18 8 12 58 45 1.53 1.18
## Celta 60 17 9 12 51 59 1.34 1.55
## Shots Shots.on.goal Penalties.scored Assistances Fouls.made Matches.without.conceding Yellow.cards
## Barcelona 600 277 11 79 385 18 66
## Real Madrid 712 299 6 90 420 14 72
## Atlético Madrid 481 186 1 49 503 24 91
## Villarreal 346 135 3 32 534 17 100
## Athletic 450 178 3 42 502 13 84
## Celta 442 170 4 43 528 10 116
## Red.cards Offsides
## Barcelona 1 120
## Real Madrid 5 114
## Atlético Madrid 3 84
## Villarreal 4 106
## Athletic 5 92
## Celta 6 103
notes for predictive modeling 129
300
0.6
# The effects of the distorsion can be clearly seen with the biplot
200
0.4
Espanyol
Fouls.made
# Variability absorbed by Shots, Shots.on.goal, Fouls.made Real Madrid
Rayo Vallecano
biplot(pcaLaliga, cex = 0.75) Eibar
100
0.2
Granada
Sevilla Shots
Comp.2
Málaga
Shots.on.goal
Levante
Yellow.cards
Valencia Offsides
# The effects of the variables are more balanced Celta
Goals.conceded Goals.scored
Assistances
0.0
Getafe
Red.cards
Penalties.scored
Loses WinsPoints
Percentage.scored.goals
Percentage.conceded.goals
0
Atlético Madrid
Matches.without.conceding
Draws
-100
-0.2
Sporting
VillarrealGijón
Real Sociedad
Betis
Las Palmas
-200
-0.4
ing the relevant information contained in the first two principal Deportivo
1. The scores of the data in PC1 and PC2 by points (with optional
Atlético Madrid
text labels, depending if there are case names). This is the repre-
4
0.4
2
0.2
2. The variables represented in the PC1 and PC2 by the arrows. Real Sociedad
Comp.2
Las Palmas
Deportivo Athletic
Sporting Gijón Points
Wins
These arrows are centred at (0, 0). Draws
0.0
GetafeValencia 0
Fouls.made Eibar
Yellow.cards
Loses
Levante Celta
Espanyol Sevilla Assistances
Goals.scoredBarcelona
Percentage.scored.goals
Shots.on.goal
Penalties.scored
-2
Shots
Goals.conceded Granada
Percentage.conceded.goals
Red.cards Offsides
X j . X j is expressed in terms of PC1 and PC2 by the weights a j1 and Real Madrid
-0.4
-4
a j2 : Rayo Vallecano
Comp.1
X j = a j1 PC1 + a j2 PC2 + . . . + a jp PC p ≈ a j1 PC1 + a j2 PC2 . Figure 3.30: Biplots for laliga dataset,
with unstandardized and standardized
The weights a j1 and a j2 have the same sign as Cor( X j , PC1 ) and data, respectively.
Cor( X j , PC2 ), respectively. The arrow associated to X j is given by 30
Computed with biplot.princomp or
with biplot applied to a princomp ob-
the segment joining (0, 0) and ( a j1 , a j2 ). Therefore:
ject. But note that this function applies
an internal scaling of the scores and
• If the arrow points right (a j1 > 0), there is positive correlation variables to improve the visualization
between X j and PC1 . Analogous if the arrow points left. (see ?biplot.princomp). This can be
disabled with the argument scale = 0.
• If the arrow is approximately vertical (a j1 ≈ 0), there is uncorrelation 31
For the sample version, replace the
between X j and PC1 . variable X j by its sample X1j , . . . , Xnj ,
the weights a ji by their estimates â ji ,
and the principal component PC j by
the scores PCˆ j1 , . . . , PC
ˆ jn .
Analogously:
130 eduardo garcía portugués
Real Sociedad
Comp.2
Las Palmas
Deportivo Athletic
Sporting Gijón Points
Wins
Draws
0.0
GetafeValencia
Fouls.made Eibar
Yellow.cards
Loses
Levante Celta
Espanyol Sevilla Assistances
Goals.scoredBarcelona
Percentage.scored.goals
-0.2
Shots.on.goal
Penalties.scored
-2
Shots
Goals.conceded Granada
Percentage.conceded.goals
Red.cards Offsides
Real Madrid
-0.4
-4
Rayo Vallecano
Comp.1
0.6
4
biplot(pcaLaligaStd, choices = c(1, 3)) # 0.7138 proportion of variance
0.4
biplot(pcaLaligaStd, choices = c(2, 3)) # 0.2180 proportion of variance Fouls.made
Espanyol Atlético Madrid
2
Yellow.cards
0.2
Granada
Wins
Levante Villarreal Matches.without.conceding
Celta Points
Loses Red.cards EibarSevilla
Valencia
Penalties.scored
Offsides
Getafe
0.0
Athletic Percentage.scored.goals
0
Goals.scored
Assistances
Barcelona
Comp.3
Goals.concededSporting Gijón
Percentage.conceded.goals Málaga Shots.on.goal
RayoReal
Vallecano
Las Sociedad
Palmas Shots
Real Madrid
At the sight of the previous plots, can you think about an Betis
-0.2
-2
interpretation for PC3 ?
-0.4
Draws
-4
-0.6
-0.8
Deportivo
-6
3.6.2 Principal components regression -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6
Comp.1
-4 -2 0 2
The key idea behind Principal Components Regression (PCR) is to Fouls.made
Espanyol
2
Atlético Madrid
Yellow.cards
Granada Wins
Matches.without.conceding
Levante Points Villarreal
tion is that often a small number of principal components is enough Red.cards
Penalties.scored
Offsides
Celta
Loses
Sevilla
Eibar
Valencia
Getafe
0.0
Percentage.scored.goals Athletic
0
Goals.scored
Barcelona
Assistances
Percentage.conceded.goals
Goals.conceded Sporting Gijón
Málaga
Shots.on.goal
Rayo Vallecano
to explain most of the variability of the predictors and consequently RealPalmas
Las Sociedad
Comp.3
Real Madrid
Shots
Betis
-0.2
model34
Draws
-0.6
-4
The first point is worth discussing now. The PCR model (3.10)
can be seen as a linear model expressed in terms of the original
predictors. To make this point clearer, let’s re-express (3.10) as
0
Y = α0 + PC1:l α1:l + ε, (3.11)
where A1:l represents the A matrix with only its first l columns and
In other words, the coefficients α1:l of the PCR done with l principal
components in (3.10) translate into the coefficients (3.12) of the
linear model based on the p original predictors
Y = γ0 + γ1 X1 + . . . + γ p X p + ε,
Notice that γ̂ is not the least squares estimator that we use to de-
note by β̂, but just the coefficients of the PCR that are associated to
the original predictors. Consequently, γ̂ is useful for the interpreta-
tion of the linear model produced by PCR, as it can be interpreted
36
For example, thanks to γ̂ we know
in the same way β̂ was36 . that the estimated conditional re-
Finally, remember that the usefulness of PCR relies on how well sponse of Y precisely increases γ̂ j for a
we are able to reduce the dimensionality37 of the predictors and marginal unit increment in X j .
6
0.6
4
# Create a new dataset with the response + principal components
0.4
Red.cards
laligaPCA <- data.frame("Points" = laliga$Points, pcaLaligaRed$scores) Percentage.conceded.goals
Goals.conceded
Granada Offsides Real Madrid
2
0.2
Espanyol
Sevilla
Celta
Yellow.cards Shots
Penalties.scored
Comp.2
Fouls.made Shots.on.goal
Levante
# Regression on all the principal components Percentage.scored.goals
Goals.scored
Assistances
Getafe Eibar Barcelona
Valencia
modPCA <- lm(Points ~ ., data = laligaPCA)
0.0
0
summary(modPCA) # Predictors clearly significative -- same R^2 as mod Sporting Gijón Athletic
Real Sociedad
Deportivo
Las Palmas
##
-0.2
Málaga
-2
Villarreal
Betis
## Call:
## lm(formula = Points ~ ., data = laligaPCA)
-0.4
Atlético Madrid
-4
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 4.274 on 7 degrees of freedom
## Multiple R-squared: 0.9795, Adjusted R-squared: 0.9443
## F-statistic: 27.83 on 12 and 7 DF, p-value: 9.784e-05
car::vif(modPCA) # No problems at all
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10 Comp.11 Comp.12
## 1 1 1 1 1 1 1 1 1 1 1 1
# Reality
laliga[1:2, 1]
## [1] 91 90
# Slots of information in the model -- most of them as 3-dim arrays with the
# third dimension indexing the number of components considered
names(modPcr)
notes for predictive modeling 137
##
350
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## X 62.73 79.13 86.11 92.24 95.84 98.19 99.11 99.69 99.87 100.00 100 100.00
250
## Points 80.47 84.23 87.38 90.45 90.99 93.94 94.36 96.17 96.32 96.84 97 97.95
200
MSEP
# The black is the CV loss, the dashed red line is the adjCV loss, a bias
50
# (k = 10 is the default)
modPcrCV10 <- pcr(Points ~ ., data = laligaRed2, scale = TRUE,
validation = "CV")
summary(modPcrCV10)
## Data: X dimension: 20 12
## Y dimension: 20 1
## Fit method: svdpc
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps
## CV 18.57 8.520 8.392 8.911 7.187 7.124 6.045 6.340 5.728 6.073 7.676 9.300
## adjCV 18.57 8.464 8.331 8.768 7.092 7.043 5.904 6.252 5.608 5.953 7.423 8.968
## 12 comps
## CV 9.151
## adjCV 8.782
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## X 62.73 79.13 86.11 92.24 95.84 98.19 99.11 99.69 99.87 100.00 100 100.00
## Points 80.47 84.23 87.38 90.45 90.99 93.94 94.36 96.17 96.32 96.84 97 97.95
validationplot(modPcrCV10, val.type = "MSEP") # l = 6 gives the minimum CV
Points
350
300
pcr() does an internal scaling of the predictors by their
quasi-standard deviations. This means that each variable
250
is divided by √ 1 , when in princomp a scaling of √1n
200
MSEP
n −1
is applied (the standard deviations are employed). This
150
results in a minor discrepancy in the scores object of 100
n
princomp() are the ones of pcr() multiplied by n −1 .
0 2 4 6 8 10 12
This problem is inherited to the coefficients, which as- number of components
# Equality of scores from princomp() and pcr() (with the same standardization)
max(abs(abs(pcaLaligaRed$scores[, 1:3]) -
abs(modPcr$scores[, 1:3] * sqrt(n / (n - 1)))))
## [1] 5.551115e-15
38
Alternatively, the “in Prediction” part
The selection of l by cross-validation attempts to minimize the of the latter terms is dropped and they
Mean Squared Error in Prediction (MSEP) or, equivalently, the Root are just referred to as the MSE and
MSEP (RMSEP) of the model38 . This is a jackknife method valid RMSE.
for the selection of any tuning parameter λ that affects the form
of the estimate m̂λ of the regression function m (remember (1.1)).
Given the sample {(Xi , Yi )}in=1 , leave-one-out cross-validation
considers the tuning parameter
n
λ̂CV := arg min ∑ (Yi − m̂λ,−i (Xi ))2 , (3.14)
λ ≥0 i =1
where m̂λ,−i represents the fit of the model m̂λ without the i-th
observation (Xi , Yi ).
A less computationally expensive variation on leave-one-out
cross-validation is k-fold cross-validation, which partitions the data
into k folds F1 , . . . , Fk of approximately equal size, trains the model
m̂λ in the aggregation of k − 1 folds and evaluates its MSEP in the
remaining fold:
k
λ̂k-CV := arg min ∑ ∑ (Yi − m̂λ,−Fj (Xi ))2 , (3.15)
λ ≥0 j =1 i ∈ F
j
where m̂λ,− Fj represents the fit of the model m̂λ excluding the data
from the j-th fold Fj . Recall that k-fold cross-validation is more
general than leave-one-out cross-validation, since the latter is a
particular case of the former with k = n.
140 eduardo garcía portugués
Cov[ X j , Y ]
a1j := ,
Var[ X j ]
X j = α0 + α1j PLS1 + ε j , j = 1, . . . , p.
40
They play the role of X1 , . . . , X p in
2. Regress Y on each of the p random errors ε 1 , . . . , ε p 40 from the the computation of PLS1 and can be
above regressions, yielding regarded as X1 , . . . , X p after filtering
the linear information explained by
Cov[ε j , Y ] PLS1 .
a2j := ,
Var[ε j ]
Y = β 0 + β 1 PLS1 + . . . + β l PLSl + ε.
# PLS scores
head(modPls$scores)
## [1] 7.36407868 6.70659816 2.45577740 0.06913485 0.96513341 -0.39588401
# Also uncorrelated
head(cov(modPls$scores))
## Comp 1 Comp 2 Comp 3 Comp 4 Comp 5 Comp 6 Comp 7 Comp 8
## Comp 1 7.393810e+00 1.704822e-16 1.667417e-16 -1.716470e-16 -1.788249e-16 -8.227618e-16 -4.380633e-16 -1.202580e-16
## Comp 2 1.704822e-16 1.267859e+00 1.815810e-16 1.695817e-16 1.084887e-16 9.991690e-17 7.754899e-18 3.451971e-17
## Comp 3 1.667417e-16 1.815810e-16 9.021126e-01 -1.412726e-17 -7.389551e-17 -1.422399e-16 -3.699555e-17 -4.991681e-17
## Comp 4 -1.716470e-16 1.695817e-16 -1.412726e-17 3.130984e-01 1.130620e-17 1.088368e-17 -1.278556e-17 2.547340e-18
## Comp 5 -1.788249e-16 1.084887e-16 -7.389551e-17 1.130620e-17 2.586132e-01 8.607103e-18 -2.416915e-17 -1.698417e-17
notes for predictive modeling 143
# Prediction
predict(modPls, newdata = laligaRed2[1:2, ], ncomp = 12)
## , , 12 comps
##
## Points
## Barcelona 92.01244
## Real Madrid 91.38026
350
300
# Selecting the number of components to retain by 10-fold Cross-Validation
# (k = 10 is the default)
250
modPlsCV10 <- plsr(Points ~ ., data = laligaRed2, scale = TRUE,
validation = "CV")
200
MSEP
summary(modPlsCV10)
150
## Data: X dimension: 20 12
## Y dimension: 20 1
100
## Fit method: kernelpls
## Number of components considered: 12
50
##
0 2 4 6 8 10 12
## VALIDATION: RMSEP number of components
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps
## CV 18.57 7.895 6.944 6.571 6.330 6.396 6.607 7.018 7.487 7.612 7.607 8.520
## adjCV 18.57 7.852 6.868 6.452 6.201 6.285 6.444 6.830 7.270 7.372 7.368 8.212
## 12 comps
## CV 8.014
## adjCV 7.714
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## X 62.53 76.07 84.94 88.29 92.72 95.61 98.72 99.41 99.83 100.00 100.00 100.00
## Points 83.76 91.06 93.93 95.43 95.94 96.44 96.55 96.73 96.84 96.84 97.69 97.95
validationplot(modPlsCV10, val.type = "MSEP")
Points
350
##
## Call: 0 2 4 6 8 10 12
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3565 -3.6157 0.4508 2.3288 12.3116
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.4000 1.2799 40.941 < 2e-16 ***
## Comp.1 6.0933 0.4829 12.618 4.65e-10 ***
## Comp.2 4.3413 1.1662 3.723 0.00169 **
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 5.724 on 17 degrees of freedom
## Multiple R-squared: 0.9106, Adjusted R-squared: 0.9
## F-statistic: 86.53 on 2 and 17 DF, p-value: 1.225e-09
car::vif(modPLS) # No problems at all
## Comp.1 Comp.2
## 1 1
notes for predictive modeling 145
4.1 Shrinkage
Y = β 0 + β 1 X1 + . . . + β p X p + ε,
ally tractable way and is the one employed in the package glmnet.
0.5
0.0
x1
β̂s,α := arg p
min RSS( β), (4.6)
β∈R p+1 :∑ j=1 (α| β j |+(1−α)| β j |2 )≤sλ
# Discard NA’s
Hitters <- na.omit(Hitters)
# The glmnet function works with the design matrix of predictors (without
# the ones). This can be obtained easily through model.matrix()
x <- model.matrix(Salary ~ 0 + ., data = Hitters)
# 0 + to exclude a column of 1’s for the intercept, since the intercept will be
# added by default in glmnet::glmnet and if we do not exclude it here we will
# end with two intercepts, one of them resulting in NA
# Interestingly, note that in Hitters there are two-level factors and these
# are automatically transformed into dummy variables in x -- the main advantage
# of model.matrix.lm
head(Hitters[, 14:20])
## League Division PutOuts Assists Errors Salary NewLeague
## -Alan Ashby N W 632 43 10 475.0 N
## -Alvin Davis A W 880 82 14 480.0 A
## -Andre Dawson N E 200 11 3 500.0 N
## -Andres Galarraga N E 805 40 4 91.5 N
## -Alfredo Griffin A W 282 421 25 750.0 A
## -Al Newman N E 76 127 7 70.0 A
head(x[, 14:19])
## LeagueA LeagueN DivisionW PutOuts Assists Errors
## -Alan Ashby 0 1 1 632 43 10
## -Alvin Davis 1 0 1 880 82 14
## -Andre Dawson 0 1 0 200 11 3
## -Andres Galarraga 0 1 0 805 40 4
## -Alfredo Griffin 1 0 1 282 421 25
150 eduardo garcía portugués
15
library(glmnet)
2
4
5
10
11
17
12
18
9
0
8
13
1
3
19
# Plot of the solution path -- gives the value of the coefficients for different
# measures in xvar (penalization imposed to the model or fitness) 16
L1 Norm
notes for predictive modeling 151
# Versus lambda
plot(ridgeMod, label = TRUE, xvar = "lambda")
20 20 20 20 20
15
0
8
13
1
3
19
# R^2 for generalized linear models. Since we have a linear model, this is the 7
14
Coefficients
-50
# The maximum R^2 is slightly above 0.5
-100
# Indeed, we can see that R^2 = 0.5461
summary(lm(Salary ~., data = Hitters))$r.squared 16
## [1] 0.5461159 4 6 8 10 12
Log Lambda
0
8
13
1
3
19
names(ridgeMod) 14
Coefficients
## [1] "a0" "beta" "df" "dim" "lambda" "dev.ratio" "nulldev" "npasses" "jerr"
-50
## [10] "offset" "call" "nobs"
-100
# in exchange of better variable interpretation and avoidance of overfitting
plot(log(ridgeMod$lambda), ridgeMod$dev.ratio, type = "l", 16
ridgeMod$dev.ratio[length(ridgeMod$dev.ratio)]
0.5
## [1] 0.5164752
# Slightly different to lm’s because of compromises in accuracy for speed
0.4
# The coefficients for different values of lambda are given in $a0 (intercepts)
0.3
length(ridgeMod$a0)
0.2
## [1] 100
dim(ridgeMod$beta)
0.1
## [1] 20 100
length(ridgeMod$lambda) # 100 lambda’s were automatically chosen
0.0
## [1] 100 4 6 8 10 12
log(lambda)
abline(v = log(ridgeMod$lambda[50]))
-10
120
The selection of the penalty parameter λ is usually done by k-
100
fold cross-validation, following the general principle described at
80
the end of Section 3.6. This data-driven selector is denoted by λ̂k-CV
l2 norm
60
and has the form given8 in (3.15) (or (3.14) if k = n):
40
k
∑ ∑ (Yi − m̂λ,−Fj (Xi ))2 .
20
λ̂k-CV := arg min CVk (λ), CVk (λ) :=
λ ≥0 j=1 i ∈ Fj
0
4 6 8 10 12
log(lambda)
A very interesting variant for the λ̂k-CV selector is the so-called one 8
If the linear model is employed, for
standard error rule. This rule is based on the parsimonious principle generalized linear models the metric
in the cross-validation function is
“favor simplicity within the set of most likely optimal models”. not the squared difference between
observations and fitted values.
ˆ CVk (λ̂k-CV )
λ̂k-1SE := max λ ≥ 0 : CVk (λ) ∈ CVk (λ̂k-CV ) ± SE .
15
8
5
13
4
1
19
7
# error the MSE (other possible error can be considered for generalized models
# using the argument type.measure) 20
14
Coefficients
-50
# Equivalent to
indMin <- which.min(kcvRidge$cvm)
kcvRidge$lambda[indMin]
## [1] 25.52821
# Potential problem! Minimum occurs at one extreme of the lambda grid in which
# CV is done. The grid was automatically selected, but can be manually inputted
range(kcvRidge$lambda)
## [1] 25.52821 255282.09651
lambdaGrid <- 10^seq(log10(kcvRidge$lambda[1]), log10(0.1),
length.out = 150) # log-spaced grid
kcvRidge2 <- cv.glmnet(x = x, y = y, nfolds = 10, alpha = 0,
lambda = lambdaGrid)
# Much better
plot(kcvRidge2)
kcvRidge2$lambda.min
## [1] 9.506186
# But the CV curve is random, since it depends on the sample. Its variability
# can be estimated by considering the CV curves of each fold. An alternative
# approach to select lambda is to choose the largest within one standard
# deviation of the minimum error, in order to favour simplicity of the model
# around the optimal lambda value. This is know as the "one standard error rule"
kcvRidge2$lambda.1se
## [1] 2964.928
220000
# The consideration of the one standard error rule for selecting lambda makes
# special sense when the CV function is quite flat around the minimum (hence an
180000
# overpenalization that gives more sparsity does not affect so much the CV loss)
Mean-Squared Error
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
Prediction
100000 120000 140000 160000 180000 200000 220000
# The model associated to lambda.1se (or any other lambda not included in the 0 5 10
Log(λ)
# original path solution -- obtained by an interpolation) can be retrieved with
20 20 20
predict(modRidgeCV, type = "coefficients", s = kcvRidge2$lambda.1se)
## 21 x 1 sparse Matrix of class "dgCMatrix"
15
## 1 2
6
3
## (Intercept) 229.758314334
11
12
18
17
9
10
0
8
13
5
1
4
19
7
## AtBat 0.086325740
20
## Hits 0.351303930 14
Coefficients
## HmRun 1.142772275
-50
## Runs 0.567245068
## RBI 0.568056880
## Walks 0.731144713
-100
## Years 2.389248929
16
## CAtBat 0.007261489
## CHits 0.027854683 0 5 10
## CRuns 0.055877337
## CRBI 0.057777505
## CWalks 0.056352113
## LeagueA -2.509251990
## LeagueN 2.509060248
## DivisionW -20.162700810
## PutOuts 0.048911039
## Assists 0.006973696
## Errors -0.128351187
## NewLeagueN 2.373103450
# Predictions for the first observation, for all the lambdas. We can see how
# the prediction for one observation changes according to lambda
plot(log(modRidgeCV$lambda),
predict(modRidgeCV, type = "response", newx = x[1, , drop = FALSE]),
type = "l", xlab = "log(lambda)", ylab = " Prediction")
Analytical form
500
The optimization problem (4.4) has an explicit solution for α =
0. To see it, assume that both the response Y and the predictors
Prediction
450
X1 , . . . , X p are centred, and that the sample {(Xi , Yi )}in=1 is also
centred9 . In this case, there is no intercept β 0 (= 0) to estimate by
400
β̂ 0 (= 0) and the linear model is simply
Y = β 1 X1 + . . . + β p X p + ε.
0 5 10
9
That is, that Ȳ = 0 and X̄ = 0.
Hλ := X(X0 X + λI p )−1 X0
The next chunk of code implements β̂λ,0 and shows that is equiv- -5 0 5 10 15
n <- 200
beta <- seq(-1, 1, l = p)
set.seed(123124)
x <- matrix(rnorm(n * p), n, p)
y <- 1 + x %*% beta + rnorm(n)
# Unrestricted fit
fit <- glmnet(x, y, alpha = 0, lambda = 0, intercept = TRUE,
standardize = FALSE)
beta0Hat <- rbind(fit$a0, fit$beta)
beta0Hat
## 6 x 1 sparse Matrix of class "dgCMatrix"
## s0
## 1.05856208
## V1 -1.03109958
## V2 -0.56932123
## V3 -0.03813426
## V4 0.47415412
## V5 1.05761841
# Restricted fit
# glmnet considers as the reguarization parameter "lambda" the value
# lambda / n (lambda being here the penalty parameter employed in the theory)
lambda <- 2
fit <- glmnet(x, y, alpha = 0, lambda = lambda / n, intercept = TRUE,
standardize = FALSE)
betaLambdaHat <- rbind(fit$a0, fit$beta)
betaLambdaHat
## 6 x 1 sparse Matrix of class "dgCMatrix"
## s0
## 1.0586029
## V1 -1.0264951
## V2 -0.5667469
## V3 -0.0377357
## V4 0.4710700
## V5 1.0528297
4.1.2 Lasso
# Call to the main function -- use alpha = 1 for lasso regression (the default)
lassoMod <- glmnet(x = x, y = y, alpha = 1)
# Same defaults as before, same object structure
# Plot of the solution path -- now the paths are not smooth when decreasing to
# zero (they are zero exactly). This is a consequence of the l1 norm 20 18 12 6
0
15
10
8
5
13
1
4
19
7
-20
20
-40
Coefficients
# Versus the R^2 -- same maximum R^2 as before
-60
14
-80
-100
# Now the l1-norm of the coefficients decreases as lambda increases
16
-120
plot(log(lassoMod$lambda), colSums(abs(lassoMod$beta)), type = "l",
-2 0 2 4
xlab = "log(lambda)", ylab = "l1 norm")
Log Lambda
0 2 4 4 5 12
2
6
3
11
12
18
17
9
0
15
10
8
5
13
1
4
19
set.seed(12345)
-20
20
14
kcvLasso$lambda.min
-80
## [1] 2.674375
-100
plot(kcvLasso)
abline(h = kcvLasso$cvm[indMin] + c(0, kcvLasso$cvsd[indMin]))
150
l1 norm
# Interesting: note that the numbers on top of the figure gives the number of
# coefficients *exactly* different from zero -- the number of predictors
50
-2 0 2 4
# Leave-one-out cross-validation
log(lambda)
lambdaGrid <- 10^seq(log10(kcvRidge$lambda[1]), log10(0.1),
20 19 19 19 18 18 14 14 12 9 7 6 6 6 6 5 4 2
length.out = 150) # log-spaced grid
220000
plot(ncvLasso)
160000
140000
Prediction
120000
20 19 18 16 14 9 6 6 5 2 0 0 0 0 0 0 0 0 0 0
220000
# The model associated to lambda.min (or any other lambda not included in the
# original path solution -- obtained by an interpolation) can be retrieved with
predict(modLassoCV, type = "coefficients",
180000
Mean-Squared Error
s = c(kcvLasso$lambda.min, kcvLasso$lambda.1se))
## 21 x 2 sparse Matrix of class "dgCMatrix"
## 1 2
140000
## (Intercept) 1.558172e+02 144.37970485
## AtBat -1.547343e+00 .
## Hits 5.660897e+00 1.36380384
100000
## HmRun . .
## Runs . . 0 5 10
## RBI . . Log(λ)
## Years -9.595837e+00 . 2
6
3
11
12
## CAtBat . .
18
17
9
0
15
10
8
5
13
1
4
19
7
## CHits . .
-20
20
## CHmRun 5.108207e-01 .
## CRuns 6.594856e-01 0.15275165
-40
Coefficients
## CRBI 3.927505e-01 0.32833941
-60
14
## CWalks -5.291586e-01 .
## LeagueA -3.206508e+01 .
-80
## LeagueN 3.285872e-14 .
-100
## DivisionW -1.192990e+02 .
## PutOuts 2.724045e-01 0.06625755 16
-120
## Assists 1.732025e-01 .
-2 0 2 4
## Errors -2.058508e+00 . Log Lambda
## NewLeagueN . .
Variable selection
# We can use lasso for model selection!
selPreds <- predict(modLassoCV, type = "coefficients",
s = c(kcvLasso$lambda.min, kcvLasso$lambda.1se))[-1, ] != 0
x1 <- x[, selPreds[, 1]]
x2 <- x[, selPreds[, 2]]
# CV
lambdaGrid <- exp(seq(-10, 3, l = 200))
plot(cv.glmnet(x = x, y = y, alpha = 1, nfolds = n, lambda = lambdaGrid))
100 100 99 98 94 92 80 60 30 8 0 0 0 0 0 0 0 0
1.8
4.2 Constrained linear models
1.6
Mean-Squared Error
As outlined in the previous section, after doing variable selection
1.4
with lasso15 , two possibilities are: (i) fit a linear model on the lasso-
selected predictors; (ii) run a stepwise selection starting from the
1.2
lasso-selected model to try to further improve the model16 .
Let’s explore the intuitive idea behind (i) in more detail. For the
1.0 -10 -8 -6 -4 -2 0 2
sake of exposition, assume that among p predictors, lasso selected Log(λ)
the first q of them17 . Then, once q is known, we would seek to fit Figure 4.4: “L”-shaped form of a
cross-validation curve with unrelated
the model
response and predictors.
15
For example, based on the data-
Y = β 0 + β 1 X1 + . . . + β p X p + ε, subject to β 1 = . . . = β q = 0. driven penalization parameters λ̂k-CV
or λ̂k-1SE .
This is a very simple constraint that we know how to solve: just 16
Note that with this approach we
include the p − q remaining predictors in the model and fit it. It is assign to the more computationally
efficient lasso the “hard work” of
however a specific case of a linear constraint on β, since β 1 = . . . = coming up with a set of relevant
β q = 0 is expressible as predictors from the whole dataset,
whereas the betterment of that model
is done with the more demanding
Iq 0q×( p−q) β −1 = 0 q , (4.11) stepwise regression (if the number of
q× p
predictors is smaller than n).
where Iq is an q × q identity matrix and β−1 = ( β 1 , . . . , β p )0 . The Note that this is a random quantity,
17
means that β 0 and β̂ 0 are null, hence they are not included in the
model. That is, that the model
Y = β 1 X1 + . . . + β p X p + ε (4.13)
For the general case given in (4.12), in which neither Y and X nor
the sample are centred, the estimator of β in (4.12) is unaltered for
the slopes and equals (4.15). The intercept is given by
The next code illustrates how to fit a linear model with con-
straints in practice.
# Simulate data
set.seed(123456)
n <- 50
p <- 3
x1 <- rnorm(n, mean = 1)
x2 <- rnorm(n, mean = 2)
x3 <- rnorm(n, mean = 3)
eps <- rnorm(n, sd = 0.5)
y <- 1 + 2 * x1 - 3 * x2 + x3 + eps
## x1Cen 1.9873776
## x2Cen -3.1449015
## x3Cen 0.9828062
where
The inference for constrained linear models is not built within base
R. Therefore, we just give a couple of insights about (4.16) and do
not pursue inference further. Note that:
Y = B0 X + ε (4.17)
E[Y|X = x] = B0 x.
notes for predictive modeling 165
Y = XB + E, (4.19)
21
This notation is required to avoid
confusions between the design matrix
where Y, X 21 ,
and E are clearly identified by comparing (4.19)
and the random vector (not matrix) X.
with (4.18).
The approach for estimating B is really similar to the univariate
multiple linear model: minimize the sum of squared distances be-
tween the responses Y1 , . . . , Yn and their explanations B0 X1 , . . . , B0 Xn .
These distances are now measured by the k · k2 norm, resulting
n
RSS(B) := ∑ kYi − B0 Xi k22
i =1
n
= ∑ ( Yi − B 0 Xi )0 ( Yi − B 0 Xi )
i =1
= tr (Y − XB)0 (Y − XB) .
22
Employing the centred version of
The similarities with (2.6) are clear and it is immediate to see that it
the univariate multiple linear model,
appears as a special case for q = 1 22 . as we have done in this section for the
multivariate version.
B̂ := arg min RSS(B) = (X0 X)−1 X0 Y. (4.20)
B∈M p×q
Recall that if the responses and predictors are not centred, then the
estimate of the intercept is simply obtained from the sample means
Ȳ := (Ȳ1 , . . . , Ȳq )0 and X̄ = ( X̄1 , . . . , X̄ p )0 :
# Linear model
B <- matrix((-1)^(1:p) * (1:p), nrow = p, ncol = q, byrow = TRUE)
Y <- X %*% B + E
# Compare with B
B
## [,1] [,2]
## [1,] -1 2
## [2,] -3 -1
## [3,] 2 -3
# Summary of the model: gives q separate summaries, one for each fitted
# univariate model
summary(mod)
## Response Y1 :
##
## Call:
## lm(formula = Y1 ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0432 -1.3513 0.2592 1.1325 3.5298
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
notes for predictive modeling 167
# Exactly equivalent to
summary(lm(Y[, 1] ~ X))
##
## Call:
## lm(formula = Y[, 1] ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0432 -1.3513 0.2592 1.1325 3.5298
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.05017 0.96251 0.052 0.9585
## X1 -0.54770 0.24034 -2.279 0.0249 *
## X2 -3.01547 0.26146 -11.533 < 2e-16 ***
## X3 1.88327 0.21537 8.745 7.38e-14 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 1.695 on 96 degrees of freedom
## Multiple R-squared: 0.7033, Adjusted R-squared: 0.694
## F-statistic: 75.85 on 3 and 96 DF, p-value: < 2.2e-16
summary(lm(Y[, 2] ~ X))
##
## Call:
## lm(formula = Y[, 2] ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1385 -0.7922 -0.0486 0.8987 3.6599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3690 0.8897 -0.415 0.67926
168 eduardo garcía portugués
i. Linearity: E[Y|X = x] = B0 x.
ii. Homoscedasticity: Var[ε| X1 = x1 , . . . , X p = x p ] = Σ.
iii. Normality: ε ∼ Nq (0, Σ).
iv. Independence of the errors: ε1 , . . . , εn are independent (or
uncorrelated, E[εi ε0j ] = 0, i 6= j, since they are assumed to be
normal).
where Σ = (σij ) and σii = σi2 .24 Observe how the covariance of the
24
# MANOVA table
manova(modIris)
## Call:
## manova(modIris)
##
## Terms:
## Sepal.Length Sepal.Width Species Residuals
## Petal.Width 57.9177 6.3975 17.5745 4.6802
## Petal.Length 352.8662 50.0224 49.7997 11.6371
## Deg. of Freedom 1 1 2 145
##
## Residual standard errors: 0.1796591 0.2832942
## Estimated effects may be unbalanced
4.3.3 Shrinkage
Applying shrinkage is also possible in multivariate linear models.
In particular, this allows to fit models with p > n predictors. The
glmnet package implements the elastic net regularization for mul-
tivariate linear models with extensions of the k · k1 and k · k2 norm
penalties for a vector of parameters in R p , as considered in Section
4.1, to norms for a matrix of parameters of size p × q. Precisely:
350
300
# Extract the coefficients associated to some fits
kcvLassoMfit <- kcvLassoM$glmnet.fit
250
coefs <- predict(kcvLassoMfit, type = "coefficients",
Mean-Squared Error
s = c(kcvLassoMfit$lambda.min, kcvLassoMfit$lambda.1se))
200
str(coefs, 1)
150
## List of 5
## $ y1:Formal class ’dgCMatrix’ [package "Matrix"] with 6 slots
100
# Plot true B
image(1:q, 1:p, t(B), col = viridisLite::viridis(20))
In (4.25) above, xi0 is the i-th row of the design matrix X. The ex-
pression follows from the Sherman–Morrison formula for an invert-
ible matrix A and a vector b,
A−1 bb0 A−1
(A + bb0 )−1 = A−1 − ,
1 + b 0 A −1 b
and from the equalities
1. Start from a reduced dataset Xold ≡ X−i and Yold ≡ Y−i for
which the least squares estimate can be computed. Denote it by
β̂old ≡ β̂−i .
2. Add one of the remaining data points to get β̂new ≡ β̂ from
(4.25) and (4.26).
3. Set β̂new ← β̂old and Xnew ← Xold .
4. Repeat steps 2–3 until there are no remaining data points left.
5. Return β̂ ← β̂new .
# biglm has a very similar syntaxis to lm -- but the formula interface does not
# work always as expected
# biglm::biglm(formula = resp ~ ., data = bigData1) # Does not work
# biglm::biglm(formula = y ~ x) # Does not work
# biglm::biglm(formula = resp ~ pred.1 + pred.2, data = bigData1) # Does work,
# but not very convenient for a large number of predictors
# Hack for automatic inclusion of all the predictors
f <- formula(paste("resp ~", paste(names(bigData1)[-1], collapse = " + ")))
biglmMod <- biglm::biglm(formula = f, data = bigData1)
# lm’s call
lmMod <- lm(formula = resp ~ ., data = bigData1)
# Summaries
s1 <- summary(biglmMod)
s2 <- summary(lmMod)
s1
176 eduardo garcía portugués
# Further information
s1$mat # Coefficients and their inferences
## Coef (95% CI) SE p
## (Intercept) 1.0021454430 0.9939053491 1.010385537 0.0041200470 0.000000e+00
## pred.1 -0.9732674585 -1.0132653005 -0.933269616 0.0199989210 0.000000e+00
## pred.2 -0.2866314070 -0.2911021089 -0.282160705 0.0022353509 0.000000e+00
## pred.3 -0.0534833941 -0.0554827653 -0.051484023 0.0009996856 0.000000e+00
## pred.4 -0.0040771777 -0.0060739907 -0.002080365 0.0009984065 4.432709e-05
## pred.5 -0.0002051218 -0.0022030377 0.001792794 0.0009989579 8.373098e-01
## pred.6 0.0002828388 -0.0017149118 0.002280589 0.0009988753 7.770563e-01
## pred.7 0.0026085425 0.0006093153 0.004607770 0.0009996136 9.066118e-03
## pred.8 0.0520743791 0.0500755376 0.054073221 0.0009994208 0.000000e+00
## pred.9 0.2840358104 0.2800374345 0.288034186 0.0019991879 0.000000e+00
## pred.10 0.9866850849 0.9667099026 1.006660267 0.0099875911 0.000000e+00
s1$rsq # R^2
## [1] 0.5777074
s1$nullrss # SST (as in Section 2.6)
## [1] 2364861
# Extract coefficients
coef(biglmMod)
## (Intercept) pred.1 pred.2 pred.3 pred.4 pred.5 pred.6 pred.7
## 1.0021454430 -0.9732674585 -0.2866314070 -0.0534833941 -0.0040771777 -0.0002051218 0.0002828388 0.0026085425
## pred.8 pred.9 pred.10
## 0.0520743791 0.2840358104 0.9866850849
notes for predictive modeling 177
-860000
tion.
-860000
-860000
# Model selection adapted to big data models
bic
-860000
reg <- leaps::regsubsets(biglmMod, nvmax = p, method = "exhaustive")
plot(reg) # Plot best model (top row) to worst model (bottom row) -860000
-850000
-840000
-560000
pred.1
pred.2
pred.3
pred.4
pred.5
pred.6
pred.7
pred.8
pred.9
pred.10
# It also works with ordinary linear models and it is much faster and
# informative than stepAIC
reg <- leaps::regsubsets(resp ~ ., data = bigData1, nvmax = p,
method = "backward")
subs <- summary(reg)
subs$bic
## [1] -558603.7 -837859.9 -854062.3 -856893.8 -859585.3 -861936.5 -861939.3 -861932.3 -861918.6 -861904.8
subs$which[which.min(subs$bic), ]
## (Intercept) pred.1 pred.2 pred.3 pred.4 pred.5 pred.6 pred.7 pred.8 pred.9
## TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE
## pred.10
## TRUE
# First fit
bigMod <- biglm::biglm(y ~ x, data = data.frame(y, x))
# Update fit
# pb <- txtProgressBar(style = 3)
for (i in 2:nChunks) {
# Progress
# setTxtProgressBar(pb = pb, value = i / nChunks)
# Final model
summary(bigMod)
## Large data regression model: biglm::biglm(y ~ x, data = data.frame(y, x))
## Sample size = 100000000
## Coef (95% CI) SE p
## (Intercept) 1.0003 0.9995 1.0011 4e-04 0.0000
## x1 -1.0015 -1.0055 -0.9975 2e-03 0.0000
## x2 -0.2847 -0.2852 -0.2843 2e-04 0.0000
## x3 -0.0531 -0.0533 -0.0529 1e-04 0.0000
## x4 -0.0041 -0.0043 -0.0039 1e-04 0.0000
## x5 0.0002 0.0000 0.0004 1e-04 0.0760
## x6 -0.0001 -0.0003 0.0001 1e-04 0.2201
## x7 0.0041 0.0039 0.0043 1e-04 0.0000
## x8 0.0529 0.0527 0.0531 1e-04 0.0000
## x9 0.2844 0.2840 0.2848 2e-04 0.0000
## x10 1.0007 0.9987 1.0027 1e-03 0.0000
print(object.size(bigMod), units = "KiB")
## 7.8 KiB
Y |( X1 = x1 , . . . , X p = x p ) ∼ N ( β 0 + β 1 x1 + . . . + β p x p , σ2 )
and hence
E [ Y | X1 = x 1 , . . . , X p = x p ] = β 0 + β 1 x 1 + . . . + β p x p .
the NASA Space Shuttle orbiter Challenger broke apart and disinte-
0.2
grated at 73 seconds into its flight, leading to the deaths of its seven
crew members. The accident had serious consequences for the
0.0
X
the shuttle program. The Presidential Rogers Commission (formed by Figure 5.1: Scatterplot of a sample
astronaut Neil A. Armstrong and Nobel laureate Richard P. Feyn- {( Xi , Yi )}in=1 sampled from a logistic
man, among others) was created in order to investigate the causes regression.
of the disaster.
The Rogers Commission elaborated a report (Presidential Com-
mission on the Space Shuttle Challenger Accident, 1986) with all
the findings. The commission determined that the disintegration
began with the failure of an O-ring seal in the solid rocket mo-
tor due to the unusually cold temperature (−0.6 Celsius degrees)
during the launch. This failure produced a breach of burning gas
through the solid rocket motor that compromised the whole shuttle Figure 5.2: Challenger launch and
posterior explosion, as broadcasted
structure, resulting in its disintegration due to the extreme aero- live by NBC in 28/01/1986. Video also
dynamic forces. The problem with O-rings was something known: available here.
182 eduardo garcía portugués
Let’s begin the analysis by replicating Figures 5.3a and 5.3b and
checking that linear regression is not the right tool for answering
184 eduardo garcía portugués
Q1–Q3.
challenger <- read.table(file = "challenger.txt", header = TRUE, sep = "\t")
2.0
car::scatterplot(nfails.field ~ temp, smooth = FALSE, boxplots = FALSE,
data = challenger)
1.8
There is a fundamental problem in using linear regression for
1.6
nfails.field
this data: the response is not continuous. As a consequence, there
1.4
is no linearity and the errors around the mean are not normal (in-
deed, they are strongly non-normal). Let’s check this with the cor-
1.2
responding diagnostic plots:
1.0
mod <- lm(nfails.field ~ temp, data = challenger) 12 14 16 18 20 22 24
plot(mod, 1) temp
2.0
plot(mod, 2)
1.5
Albeit linear regression is not the adequate tool for this data, it is
able to detect the obvious difference between the two plots:
nfails.field
1.0
1. The trend for launches with incidents is flat, hence suggesting
0.5
there is no dependence on the temperature (Figure 5.3a). This
was one of the arguments behind NASA’s decision of launching
0.0
the rocket at a temperature of −0.6 Celsius degrees. 15 20 25
2. However, the trend for all launches indicates a clear negative temp
Residuals vs Fitted
21
14
Along this chapter we will see the required tools for answering
0.0
precisely Q1–Q3.
-0.5
Normal Q-Q
For simplicity, we first study the logistic regression and then study 21
2
1
E [ Y | X1 = x 1 , . . . , X p = x p ] = β 0 + β 1 x 1 + . . . + β p x p .
-2 -1 0 1 2
(5.1) Theoretical Quantiles
lm(nfails.field ~ temp)
Y |( X1 = x1 , . . . , X p = x p ) ∼ N ( β 0 + β 1 x1 + . . . + β p x p , σ2 ).
notes for predictive modeling 185
P [Y = y ] = p y ( 1 − p ) 1 − y , y = 0, 1. (5.2)
E[Y | X1 = x1 , . . . , X p = x p ] = β 0 + β 1 x1 + . . . + β p x p =: η.
eη 1
g−1 (η ) = logistic(η ) := = .
1 + eη 1 + e−η
186 eduardo garcía portugués
Linear regression
1.0
Probit
Logit
0.8
[0, 1] −→ R, is known as the logit function:
0.6
g−1(η)
p
logit( p) := logistic−1 ( p) = log .
0.4
1− p
0.2
In conclusion, with the logit link function we can map the domain
0.0
of Y to R in order to apply a linear model. The logistic model can be
then equivalently stated as -3 -2 -1 0 1 2 3
p P [Y = 1 ]
odds(Y ) := = . (5.6)
1− p P [Y = 0 ]
The odds is the ratio between the probability of success and the prob-
ability of failure. It is extensively used in betting5 due to its better 5
Recall that (traditionally) the result of
interpretability6 . Conversely, if the odds of Y is given, we can eas- a bet is binary: you either win or lose
ily know what is the probability of success p, using the inverse of the bet.
6
For example, if a horse Y has a
(5.6)7 :
probability p = 2/3 of winning a race
(Y = 1), then the odds of the horse is
odds(Y ) p
p = P [Y = 1 ] = . 2/3
1− p = 1/3 = 2. This means that the
1 + odds(Y ) horse has a probability of winning that is
twice larger than the probability of losing.
This is sometimes written as a 2 : 1 or
Recall that the odds is a number in [0, +∞]. The 0 and 2 × 1 (spelled “two-to-one”).
+∞ values are attained for p = 0 and p = 1, respectively. 7
For the previous example, if the
The log-odds (or logit) is a number in [−∞, +∞]. odds of the horse were 5, that would
correspond to a probability of winning
p = 5/6.
We can rewrite (5.4) in terms of the odds (5.6)8 so we get:
8
To do so, apply (5.6) to (5.4) and use
odds(Y | X1 = x1 , . . . , X p = x p ) = eη = e β0 e β1 x1 . . . e β p x p . (5.7) (5.3).
log(odds(Y | X1 = x1 , . . . , X p = x p )) = β 0 + β 1 x1 + . . . + β p x p . (5.8)
notes for predictive modeling 187
# Plot data
plot(challenger$temp, challenger$fail.field, xlim = c(-1, 30),
xlab = "Temperature", ylab = "Incident probability")
Challenger
y <- 1 / (1 + y)
lines(x, y, col = 2, lwd = 2)
0.8
# The Challenger
Incident probability
0.6
At the sight of this curve and the summary it seems that the
0.2
• e β̂0 = 1965.974. This means that, when the temperature is zero, the
fitted odds is 1965.974, so the (estimated) probability of having
an incident (Y = 1) is 1965.974 times larger than the proba-
bility of not having an incident (Y = 0). Or, in other words,
the probability of having an incident at temperature zero is
1965.974
1965.974+1 = 0.999.
• e β̂1 = 0.659. This means that each Celsius degree increment on
the temperature multiplies the fitted odds by a factor of 0.659 ≈
2
3 , hence reducing it.
However, for the moment we can not say whether these findings
are significant, since we do not have information on the variability
of the estimates of β. We will need inference for that.
Estimation by maximum likelihood
The estimation of β from a sample {(Xi , Yi )}in=1 is done by Maxi-
mum Likelihood Estimation (MLE). As it can be seen in Appendix A.2,
MLE is equivalent to least squares in the linear model under the
assumptions mentioned in Section 2.3, particularly, normality and
9
Section 5.7 discusses in detail the
independence. In the logistic model, we assume that9
assumptions of generalized linear
models.
Yi |( X1 = xi1 , . . . , X p = xip ) ∼ Ber(logistic(ηi )), i = 1, . . . , n,
# Data
dataMle <- data.frame(x = x, y1 = y1, y2 = y2, y3 = y3)
For fitting a logistic model we employ glm, which has the syntax
glm(formula = response ~ predictor, family = "binomial",
data = data), where response is a binary variable. Note that
family = "binomial" is referring to the fact that the response is
Binomial variable (since it is a Bernoulli). Let’s check that indeed
the coefficients given by glm are the ones that maximize the likeli-
hood given in the animation of Figure 5.5. We do so for y1 ~ x.
# Call glm
mod <- glm(y1 ~ x, family = "binomial", data = dataMle)
mod$coefficients
## (Intercept) x
## -0.1691947 2.4281626
# -loglik(beta)
minusLogLik <- function(beta) {
p <- 1 / (1 + exp(-(beta[1] + beta[2] * x)))
-sum(y1 * log(p) + (1 - y1) * log(1 - p))
}
## [1] 14.79376
##
## $counts
## function gradient
## 73 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
-20
# The plot.axes argument is a hack to add graphical information within the
# coordinates of the main panel (behind filled.contour there is a layout()...) -40
-60
-100
β0
1 2 3
g E [ Y | X1 = x 1 , . . . , X p = x p ] = η
notes for predictive modeling 191
or, equivalently,
E [ Y | X1 = x 1 , . . . , X p = x p ] = g − 1 ( η ) ,
yθ − b(θ )
f (y; θ, φ) = exp + c(y, φ) , (5.10)
a(φ)
where a(·), b(·), and c(·, ·) are specific functions. If Y has the pdf
(5.10), then we write Y ∼ E(θ, φ, a, b, c). If the scale parameter φ is
known, this is an exponential family with canonical parameter
θ (if φ is unknown, then it may or not may be a two-parameter
exponential family). Distributions from the exponential family have
some nice properties. Importantly, if Y ∼ E(θ, φ, a, b, c), then
θ = g(µ) (5.12)
g ( µ ) = ( b 0 ) −1 ( µ ). (5.13)
θ2 y2
1
a(φ) = φ, b(θ ) = , c(y, φ) = − + log(2πφ) ,
2 2 φ
Support Link Y |( X1 =
of Y Distribution g(µ) g −1 ( η ) φ x1 , . . . , X p = x p )
R N (µ, σ2 ) µ η σ2 N (η, σ2 )
0, 1 B(1, p) logit(µ) logistic(η ) 1 B (1, logistic(η ))
0, 1, 2, . . . Pois(λ) log(µ) eη 1 Pois(eη )
Y |( X1 = x1 , . . . , X p = x p ) ∼ Pois(eη ),
this is,
E [ Y | X1 = x 1 , . . . , X p = x p ] = λ ( Y | X1 = x 1 , . . . , X p = x p )
= e β0 + β1 x1 +...+ β p x p . (5.15)
Notice that, since in the Poisson distribution the mean and variance
equal, this implies that the variance of Y |( X1 = x1 , . . . , X p = x p )
changes according to the value of the predictors. The interpretation
of the coefficients is clear from (5.15):
# Data
plot(Species ~ Biomass, data = species, col = pH)
legend("topright", legend = c("Low pH", "Medium pH", "High pH"),
col = 1:3, lwd = 2)
Low pH
Medium pH
High pH
40
summary(species1)
Species
##
## Call:
20
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5959 -0.6989 -0.0737 0.6647 3.5604 0 2 4 6 8 10
## Biomass
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.84894 0.05281 72.885 < 2e-16 ***
## pHlow -1.13639 0.06720 -16.910 < 2e-16 ***
## pHmed -0.44516 0.05486 -8.114 4.88e-16 ***
## Biomass -0.12756 0.01014 -12.579 < 2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
194 eduardo garcía portugués
# With interactions
species2 <- glm(Species ~ Biomass * pH, data = species, family = poisson)
summary(species2)
##
## Call:
## glm(formula = Species ~ Biomass * pH, family = poisson, data = species)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4978 -0.7485 -0.0402 0.5575 3.2297
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.76812 0.06153 61.240 < 2e-16 ***
## Biomass -0.10713 0.01249 -8.577 < 2e-16 ***
## pHlow -0.81557 0.10284 -7.931 2.18e-15 ***
## pHmed -0.33146 0.09217 -3.596 0.000323 ***
## Biomass:pHlow -0.15503 0.04003 -3.873 0.000108 ***
## Biomass:pHmed -0.03189 0.02308 -1.382 0.166954
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 452.346 on 89 degrees of freedom
## Residual deviance: 83.201 on 84 degrees of freedom
## AIC: 514.39
##
## Number of Fisher Scoring iterations: 4
exp(species2$coefficients)
## (Intercept) Biomass pHlow pHmed Biomass:pHlow Biomass:pHmed
## 43.2987424 0.8984091 0.4423865 0.7178730 0.8563910 0.9686112
# - If pH decreases to med (low), then the effect of the biomass in the number
# of species decreases by a factor of 0.9686 (0.8564). The higher the pH, the
# stronger the effect of the Biomass in Species
# Draw fits
plot(Species ~ Biomass, data = species, col = pH)
legend("topright", legend = c("High pH", "Medium pH", "Low pH"),
col = 1:3, lwd = 2)
# Without interactions
bio <- seq(0, 10, l = 100)
z <- species1$coefficients[1] + species1$coefficients[4] * bio
lines(bio, exp(z), col = 1)
lines(bio, exp(species1$coefficients[2] + z), col = 2)
lines(bio, exp(species1$coefficients[3] + z), col = 3)
High pH
Medium pH
Low pH
40
Estimation by maximum likelihood
The estimation of β by MLE can be done in a unified framework
30
for all generalized linear models thanks to the exponential family
Species
(5.10). Given {(Xi , Yi )}in=1 and employing a canonical link function
20
(5.13), we have that
10
Yi |( X1 = xi1 , . . . , X p = xip ) ∼ E(θi , φ, a, b, c), i = 1, . . . , n,
0 2 4 6 8 10
where Biomass
θi := ηi := β 0 + β 1 xi1 + . . . + β p xip ,
µi := E[Yi | X1 = xi1 , . . . , X p = xip ] = g−1 (ηi ).
∂`( β) n
(Y − b0 (θi )) ∂θi
=∑ i
∂β i =1
a(φ) ∂β
∂2 `( β) ∂`( β)
( βnew − βold ) = − . (5.18)
∂β∂β0 β= βold
∂β β= βold
∂2 `( β)
A simplifying trick is to consider the expectation of ∂β∂β0 β= β
old
in (5.18) rather than its actual value. By doing so, we can arrive at
a neat iterative algorithm called Iterative Reweighted Least Squares
(IRLS). To that aim, we use the following well-known property of
the Fisher information matrix of the MLE theory:
∂`( β) ∂`( β) 0
2 " #
∂ `( β) 13
Recall that E[(Yi − µi )(Yj − µ j )] =
E = −E .
∂β∂β0
(
∂β ∂β Vi , i = j,
Cov[Yi , Yj ] = because of
0, i 6= j,
Then, it can be seen that13 independence.
196 eduardo garcía portugués
" #
n
∂2 `( β)
E = − ∑ wi xi xi0 = −X0 WX, (5.19)
∂β∂β0 β= βold i =1
∂2 l ( β ) 2
h i
∂ l ( β)
In general, E 6=
∂β∂β0 ∂β∂β0
. Thus, IRLS in general
departures from the standard Newton–Raphson. How-
ever, if the canonical link is used, it can be seen that
the equality of the matrices is guaranteed and IRLS
is exactly the same as Newton–Raphson. In that case,
wi = g0 (1µ ) (which simplifies the computation of W in the
i
algorithm above).
∂2 `( β)
I ( β ) : = −E
∂β∂β0
is the Fisher information matrix. The name comes from the fact that
it measures the information available in the sample for estimating β. The
“larger” (large eigenvalues) the matrix is, the more precise the
estimation of β is, because that results in smaller variances in (5.22).
The inverse of the Fisher information matrix is
β̂ j − β j
∼ N (0, 1), ˆ ( β̂ j )2 := v j
SE (5.24)
SEˆ ( β̂ j )
where
Thanks to (5.24), we can have the 100(1 − α)% CI for the coefficient
β j , j = 0, . . . , p:
ˆ ( β̂ j )zα/2
β̂ j ± SE (5.25)
where zα/2 is the α/2-upper quantile of the N (0, 1). In case we are
interested in the CI for e β j , we can just simply take the exponential
on the above CI. So the 100(1 − α)% CI for e β j , j = 0, . . . , p, is
ˆ
e( β̂ j ±SE( β̂ j )zα/2 ) .
ˆ
Of course, this CI is not the same as e β̂ j ± eSE( β̂ j )zα/2 , which is not
a valid CI for e β̂ j .
H0 : β j = 0
β̂ j − 0
,
ˆ ( β̂ j )
SE
5.4 Prediction
E [ Y | X1 = x 1 , . . . , X p = x p ] = g − 1 ( η ) = g − 1 ( β 0 + β 1 x 1 + . . . + β p x p )
# Function for computing the predictions and CIs for the conditional probability
predictCIsLogistic <- function(object, newdata, level = 0.95) {
# CI in the log-odds
za <- qnorm(p = (1 - level) / 2)
lwr <- pred$fit + za * pred$se.fit
upr <- pred$fit - za * pred$se.fit
# Transform to probabilities
fit <- 1 / (1 + exp(-pred$fit))
lwr <- 1 / (1 + exp(-lwr))
upr <- 1 / (1 + exp(-upr))
# Simple call
predictCIsLogistic(nasa, newdata = newdata)
## fit lwr upr
## 1 0.999604 0.4838505 0.9999999
# The CI is large because there is no data around temp = -0.6 and
# that makes the prediction more variable (and also because we only
# have 23 observations)
5.5 Deviance
0.4
0.2
0.0
n
2
∑
D=− Yi θ̂i − b(θ̂i ) − Yi g(Yi ) + b( g(Yi )) φ
a(φ) i =1
n
2φ
∑
= Yi (Yi − θ̂i ) − b( g(Yi )) + b(θ̂i ) . (5.26)
a(φ) i =1
!
2σ2 n Y2 η̂ 2
D = 2 ∑ Yi (Yi − η̂i ) − i + i
σ i =1 2 2
n
= ∑ 2Yi2 − 2Yi η̂i − Yi2 + η̂i2
i =1
n
= ∑ (Yi − η̂i )2
i =1
= RSS( β̂), (5.27)
since η̂i = β̂ 0 + β̂ 1 xi1 + . . . + β̂ p xip . Remember that RSS( β̂) is just
another name for the SSE.
A benchmark for evaluating the scale of the deviance is the null
deviance,
D0 := −2 `( β̂ 0 ) − `s φ,
which is the deviance of the worst model, the one fitted without
any predictor and only intercept, to the perfect model. For example,
in logistic regression:
Y |( X1 = x1 , . . . , X p = x p ) ∼ Ber(logistic( β 0 )).
m
In this case, β̂ 0 = logit( m n
n ) = log 1− m where m is the number of 1’s
n
in Y1 , . . . , Yn (see Figure 5.9).
Using again (5.26), we can see that the null deviance is a gen-
eralization of the total sum of squares of the linear model (see
Section 2.6):
n n 2
D0 = ∑ (Yi − η̂i )2 = ∑ Yi − β̂ 0 = SST,
i =1 i =1
Using the deviance and the null deviance, we can compare how
much the model has improved by adding the predictors X1 , . . . , X p
and quantify the percentage of deviance explained. This can be done
by means of the R2 statistic, which is a generalization of the deter-
mination coefficient for linear regression:
linear
model
D SSE
R2 : = 1 − = 1− .
D0 SST
notes for predictive modeling 207
# R^2
r2glm(nasa)
## [1] 0.280619
r2glm(null)
## [1] -2.220446e-16
n − p − 1. This provides
D −2(`( β̂ − `s ))
φ̂D := = ,
n− p−1 n− p−1
which, as expected, in the case of the linear model is equivalent to
σ̂2 as given in (2.15). More importantly, the scaled deviance can be
used for performing hypotheses tests on sets of coefficients of a
generalized linear model.
Assume we have one model, say model 2, with p2 predictors
and another model, say model 1, with p1 < p2 predictors that are
contained in the set of predictors of the model 2. In other words,
assume model 1 is nested within model 2. Then we can test the null
hypothesis that the extra coefficients of model 2 are simultaneously
zero. For example, if model 1 has the coefficients { β 0 , β 1 , . . . , β p1 }
and model 2 has coefficients { β 0 , β 1 , . . . , β p1 , β p1 +1 , . . . , β p2 }, we can
test
H0
D ∗p1 − D ∗p2 ∼ χ2p2 − p1 . (5.29)
# Polynomial predictors
nasa0 <- glm(fail.field ~ 1, family = "binomial", data = challenger)
nasa1 <- glm(fail.field ~ temp, family = "binomial", data = challenger)
nasa2 <- glm(fail.field ~ poly(temp, degree = 2), family = "binomial",
data = challenger)
nasa3 <- glm(fail.field ~ poly(temp, degree = 3), family = "binomial",
data = challenger)
# Plot fits
temp <- seq(-1, 35, l = 200)
tt <- data.frame(temp = temp)
plot(fail.field ~ temp, data = challenger, pch = 16, xlim = c(-1, 30),
xlab = "Temperature", ylab = "Incident probability")
lines(temp, predict(nasa0, newdata = tt, type = "response"), col = 1)
lines(temp, predict(nasa1, newdata = tt, type = "response"), col = 2)
lines(temp, predict(nasa2, newdata = tt, type = "response"), col = 3)
lines(temp, predict(nasa3, newdata = tt, type = "response"), col = 4)
legend("bottomleft", legend = c("Null model", "Linear", "Quadratic", "Cubic"),
lwd = 2, col = 1:4)
1.0
0.8
# R^2’s
r2glm(nasa0)
Incident probability
0.6
## [1] -2.220446e-16
r2glm(nasa1)
0.4
## [1] 0.280619
r2glm(nasa2)
0.2
Cubic
## [1] 0.4831863
0 5 10 15 20 25 30
Temperature
# Chisq and F tests -- same results since phi is known
210 eduardo garcía portugués
# Cubic vs linear
anova(nasa1, nasa3, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: fail.field ~ temp
## Model 2: fail.field ~ poly(temp, degree = 3)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 21 20.335
## 2 19 14.609 2 5.726 0.0571 .
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
# Comparison
anova(species1, species2, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: Species ~ pH + Biomass
## Model 2: Species ~ Biomass * pH
notes for predictive modeling 211
# Summaries
summary(nasa1)
##
## Call:
## glm(formula = fail.field ~ temp, family = "binomial", data = challenger)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0566 -0.7575 -0.3818 0.4571 2.2195
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.5837 3.9146 1.937 0.0527 .
## temp -0.4166 0.1940 -2.147 0.0318 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28.267 on 22 degrees of freedom
## Residual deviance: 20.335 on 21 degrees of freedom
## AIC: 24.335
##
## Number of Fisher Scoring iterations: 5
summary(nasa2)
##
## Call:
## glm(formula = fail.field ~ temp + pres.field, family = "binomial",
## data = challenger)
212 eduardo garcía portugués
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2109 -0.6081 -0.4292 0.3498 2.0913
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.642709 4.038547 1.645 0.1000
## temp -0.435032 0.197008 -2.208 0.0272 *
## pres.field 0.009376 0.008821 1.063 0.2878
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28.267 on 22 degrees of freedom
## Residual deviance: 19.078 on 20 degrees of freedom
## AIC: 25.078
##
## Number of Fisher Scoring iterations: 5
# AICs
AIC(nasa1) # Better
## [1] 24.33485
AIC(nasa2)
## [1] 25.07821
# Model whether a suburb has a median house value larger than $25000
mod <- glm(I(medv > 25) ~ ., data = Boston, family = "binomial")
summary(mod)
##
## Call:
## glm(formula = I(medv > 25) ~ ., family = "binomial", data = Boston)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3498 -0.2806 -0.0932 -0.0006 3.3781
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.312511 4.876070 1.090 0.275930
## crim -0.011101 0.045322 -0.245 0.806503
## zn 0.010917 0.010834 1.008 0.313626
## indus -0.110452 0.058740 -1.880 0.060060 .
## chas 0.966337 0.808960 1.195 0.232266
## nox -6.844521 4.483514 -1.527 0.126861
## rm 1.886872 0.452692 4.168 3.07e-05 ***
## age 0.003491 0.011133 0.314 0.753853
## dis -0.589016 0.164013 -3.591 0.000329 ***
## rad 0.318042 0.082623 3.849 0.000118 ***
## tax -0.010826 0.004036 -2.682 0.007314 **
## ptratio -0.353017 0.122259 -2.887 0.003884 **
## black -0.002264 0.003826 -0.592 0.554105
## lstat -0.367355 0.073020 -5.031 4.88e-07 ***
## ---
notes for predictive modeling 213
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 563.52 on 505 degrees of freedom
## Residual deviance: 209.11 on 492 degrees of freedom
## AIC: 237.11
##
## Number of Fisher Scoring iterations: 7
r2glm(mod)
## [1] 0.628923
# With BIC -- ends up with only the significant variables and a similar R^2
modBIC <- MASS::stepAIC(mod, trace = 0, k = log(nrow(Boston)))
summary(modBIC)
##
## Call:
## glm(formula = I(medv > 25) ~ indus + rm + dis + rad + tax + ptratio +
## lstat, family = "binomial", data = Boston)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3077 -0.2970 -0.0947 -0.0005 3.2552
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.556433 3.948818 0.394 0.693469
## indus -0.143236 0.054771 -2.615 0.008918 **
## rm 1.950496 0.441794 4.415 1.01e-05 ***
## dis -0.426830 0.111572 -3.826 0.000130 ***
## rad 0.301060 0.076542 3.933 8.38e-05 ***
## tax -0.010240 0.003631 -2.820 0.004800 **
## ptratio -0.404964 0.112086 -3.613 0.000303 ***
## lstat -0.384823 0.069121 -5.567 2.59e-08 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 563.52 on 505 degrees of freedom
## Residual deviance: 215.03 on 498 degrees of freedom
## AIC: 231.03
##
## Number of Fisher Scoring iterations: 7
r2glm(modBIC)
## [1] 0.6184273
# Classified Y’s
yHat <- nasa$fitted.values > 0.5
# Hit matrix:
# - 16 correctly classified as 0
# - 4 correctly classified as 1
# - 3 incorrectly classified as 0
tab <- table(challenger$fail.field, yHat)
tab
## yHat
## FALSE TRUE
## 0 16 0
## 1 3 4
1. Compute the hit matrix and hit ratio for the regres-
sion I(medv > 25) ~ ..
2. Fit I(medv > 25) ~ . but now using only the first 300
observations of Boston, the training dataset.
3. For the previous model, predict the probability of the
responses and classify them into 0 or 1 in the last 206
observations, the testing dataset.
4. Compute the hit matrix and hit ratio for the new pre-
dictions. Check that the hit ratio is smaller than the
one in the first point.
µ = E [ Y | X1 = x 1 , . . . , X p = x p ] = g − 1 ( η ) ,
and η ( x1 , . . . , x p ) = β 0 + β 1 x1 + . . . + β p x p .
In the case of the logistic and Poisson regressions, both with
canonical link functions, the general model takes the form (inde-
pendence is implicit)
The assumptions behind (5.30), (5.31), and (5.32) are the follow-
ing:
β 0 + β 1 x1 + . . . + β p x p .
ii. Response distribution: Y |X = x ∼ E(η (x), φ, a, b, c) (φ, a, b, c are
constant for x).
216 eduardo garcía portugués
Recall that:
For the linear model, di = ε̂2i , since D = RSS( β̂). Based on this, we
can define the deviance residuals as
ε̂ D
p
i : = sign(Yi − µ̂i ) di , i = 1, . . . , n
ε̂ D
i are approximately normal. (5.33)
5.7.1 Linearity
Linearity between the transformed expectation of Y and the predictors
X1 , . . . , X p is the building block of generalized linear models. If
this assumption fails, then all the conclusions we might extract
from the analysis are suspected to be flawed. Therefore it is a key
assumption.
How to check it
We use the residuals vs. fitted values plot, which for generalized
linear models is the scatterplot of (η̂i , ε̂ Di ), i = 1, . . . , n. Recall that it
is not the scatterplot of (Ŷi , ε̂ i ). Under linearity, we expect that there
is no trend in the residuals ε̂ D i with respect to η̂i . If nonlinearities
are observed, it is worth plotting the regression terms of the model
via termplot.
218 eduardo garcía portugués
What to do if fails
Using an adequate nonlinear transformation for the problematic
predictors or adding interaction terms might be helpful. Alterna-
tively, considering a nonlinear transformation f for the response Y
might also be helpful.
What to do if fails
Patching the distribution assumption is not easy and requires
the consideration of more flexible models. One possibility is to
transform Y by means of one of the transformations discussed in
Section 3.5.2, of course at the price of modelling the transformed
response rather than Y.
5.7.3 Independence
Independence is also a key assumption: it guarantees that the
amount of information that we have on the relationship between
Y and X1 , . . . , X p with n observations is maximal.
How to check it
The presence of autocorrelation in the residuals can be examined
by means of a serial plot of the residuals. Under uncorrelation, we
222 eduardo garcía portugués
What to do if fails
As in the linear model, little can be done if there is dependence
in the data, once this has been collected. If serial dependence is
present, a differentiation of the response may lead to independent
observations.
5.7.4 Multicollinearity
Multicollinearity can also be present in generalized linear models.
Despite the nonlinear effect of the predictors on the response, the
predictors are combined linearly in (5.4). Due to this, if two or
more predictors are highly correlated between them, the fit of the
model will be compromised since the individual linear effect of
each predictor is hard to distinguish from the rest of correlated
predictors.
224 eduardo garcía portugués
# Response
z <- 1 + 0.5 * x1 + 2 * x2 - 3 * x3 - x4
y <- rbinom(n = 100, size = 1, prob = 1/(1 + exp(-z)))
data <- data.frame(x1 = x1, x2 = x2, x3 = x3, x4 = x4, y = y)
# Without x4
modClean <- glm(y ~ x1 + x2 + x3, family = "binomial")
# Comparison
summary(modMultiCo)
##
## Call:
## glm(formula = y ~ x1 + x2 + x3 + x4, family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4743 -0.3796 0.1129 0.4052 2.3887
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.2527 0.4008 3.125 0.00178 **
## x1 -3.4269 1.8225 -1.880 0.06007 .
## x2 6.9627 2.1937 3.174 0.00150 **
## x3 -4.3688 0.9312 -4.691 2.71e-06 ***
## x4 -5.0047 1.9440 -2.574 0.01004 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 132.81 on 99 degrees of freedom
## Residual deviance: 59.76 on 95 degrees of freedom
## AIC: 69.76
##
## Number of Fisher Scoring iterations: 7
notes for predictive modeling 225
summary(modClean)
##
## Call:
## glm(formula = y ~ x1 + x2 + x3, family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0952 -0.4144 0.1839 0.4762 2.5736
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.9237 0.3221 2.868 0.004133 **
## x1 1.2803 0.4235 3.023 0.002502 **
## x2 1.7946 0.5290 3.392 0.000693 ***
## x3 -3.4838 0.7491 -4.651 3.31e-06 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 132.813 on 99 degrees of freedom
## Residual deviance: 68.028 on 96 degrees of freedom
## AIC: 76.028
##
## Number of Fisher Scoring iterations: 6
5.8 Shrinkage
p
−`( β) + λ ∑ (α| β j | + (1 − α)| β j |2 ). (5.34)
j =1
of (5.34) gives
(
p
)
β̂λ,α := arg min −`( β) + λ ∑ (α| β j | + (1 − α)| β j |2 ) . (5.35)
β ∈R p +1 j =1
Note that the sparsity is enforced in the slopes, not in the intercept,
and that the link function g is not affecting the penalization term.
As in linear models, the predictors need to be standardized if they
are of different nature.
We illustrate the shrinkage in generalized linear models with the 10 10 10 10 10
0.04
10
0.02
and N (standing for National League). The variable indicates the 6
Coefficients
0.00
player’s league at the end of 1986. The predictors employed are his 8
1
9
-0.02
some distinctive pattern between the players in both leagues. 4
-0.04
# Load data 7
Log Lambda
# Include only predictors related with 1986 season and discard NA’s 9 9 9 6 5 2
Hitters <- subset(Hitters, select = c(League, AtBat, Hits, HmRun, Runs, RBI, 10
0.04
Walks, Division, PutOuts, Assists,
Errors))
0.02
Hitters <- na.omit(Hitters) 6
Coefficients
# Response and predictors 0.00
8
9
y <- Hitters$League 5
4
-0.04
Log Lambda
library(glmnet)
10 10 10 10 10 10
ridgeMod <- glmnet(x = x, y = y, alpha = 0, family = "binomial")
0.04
10
2
Coefficients
0.00
8
1
9
3
-0.04
# Versus the percentage of deviance explained 0.00 0.01 0.02 0.03 0.04 0.05
0 1 3 4 5 9
10
0.04
2
Coefficients
0.00
3
-0.02
set.seed(12345)
kcvLasso <- cv.glmnet(x = x, y = y, alpha = 1, nfolds = 10, family = "binomial") 7
plot(kcvLasso)
0.00 0.01 0.02 0.03 0.04 0.05
9 9 9 9 9 9 9 9 8 8 6 5 5 5 4 4 3 2 1 1
1.44
# The lambda that minimises the CV error and "one standard error rule"’s lambda
kcvLasso$lambda.min
1.42
## [1] 0.01039048
kcvLasso$lambda.1se
Binomial Deviance
1.40
## [1] 0.08829343
1.38
# Leave-one-out cross-validation -- similar result
ncvLasso <- cv.glmnet(x = x, y = y, alpha = 1, nfolds = nrow(Hitters),
1.36
family = "binomial")
plot(ncvLasso)
-8 -7 -6 -5 -4 -3
Log(λ)
ncvLasso$lambda.min 9 9 9 9 9 9 9 9 8 8 6 5 5 5 4 4 3 2 1 1
## [1] 0.007860015
ncvLasso$lambda.1se
1.40
## [1] 0.07330276
# Model selected
1.38
Binomial Deviance
predict(ncvLasso, type = "coefficients", s = ncvLasso$lambda.1se)
## 11 x 1 sparse Matrix of class "dgCMatrix"
1.36
## 1
## (Intercept) -0.099447861
## AtBat .
1.34
## Hits .
## HmRun -0.006971231
-8 -7 -6 -5 -4 -3
## Runs . Log(λ)
## RBI .
## Walks .
## DivisionW .
## PutOuts .
## Assists .
## Errors .
# Let’s split the dataset in two, do model-selection in one part and then
# inference on the selected model in the other, to have an idea of the real
# significance of HmRun
set.seed(123456)
train <- sample(c(FALSE, TRUE), size = nrow(Hitters), replace = TRUE)
# Hit matrix
H <- table(pred > 0.5, y[!train] == "A") # ("A" was the reference level)
H
##
## FALSE TRUE
## FALSE 79 75
sum(diag(H)) / sum(H) # Worse than tossing a coin!
notes for predictive modeling 229
## [1] 0.512987
• What (if any) are the leading factors among the fea-
tures of a player in season 1986 in order to be in the
top 10% of most paid players in season 1987?
• What (if any) are the player features in season 1986
influencing the number of home runs in the same
season? And during his career?
1. Computing the likelihood requires reading all the data at once. Dif-
ferently from the linear model, updating the model with a new
chunk implies re-fitting with all the data due to the nonlinearity
of the likelihood.
2. The IRLS algorithm requires reading the data as many times as
iterations.
# Logistic regression
# Same comments for the formula framework -- this is the hack for automatic
# inclusion of all the predictors
library(biglm)
f <- formula(paste("resp ~", paste(names(bigData1)[-1], collapse = " + ")))
bigglmMod <- bigglm.ffdf(formula = f, data = bigData1ff, family = binomial())
# glm’s call
glmMod <- glm(formula = resp ~ ., data = bigData1, family = binomial())
# Compare sizes
print(object.size(bigglmMod), units = "KiB")
## 178.4 KiB
print(object.size(glmMod), units = "MiB")
## 732.6 MiB
# Summaries
s1 <- summary(bigglmMod)
s2 <- summary(glmMod)
s1
## Large data regression model: bigglm(formula = f, data = bigData1ff, family = binomial())
## Sample size = 1e+06
## Coef (95% CI) SE p
## (Intercept) 1.0177 0.9960 1.0394 0.0109 0.0000
## pred.1 -0.9869 -1.0931 -0.8808 0.0531 0.0000
## pred.2 -0.2927 -0.3046 -0.2808 0.0059 0.0000
## pred.3 -0.0481 -0.0534 -0.0428 0.0027 0.0000
## pred.4 -0.0029 -0.0082 0.0024 0.0026 0.2779
## pred.5 -0.0015 -0.0068 0.0038 0.0027 0.5708
## pred.6 -0.0022 -0.0075 0.0031 0.0026 0.4162
## pred.7 0.0018 -0.0035 0.0071 0.0027 0.4876
## pred.8 0.0582 0.0529 0.0635 0.0027 0.0000
## pred.9 0.2747 0.2640 0.2853 0.0053 0.0000
## pred.10 0.9923 0.9392 1.0454 0.0265 0.0000
s2
##
## Call:
## glm(formula = resp ~ ., family = binomial(), data = bigData1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4234 0.2112 0.4591 0.6926 2.5853
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.017719 0.010856 93.744 <2e-16 ***
## pred.1 -0.986934 0.053075 -18.595 <2e-16 ***
## pred.2 -0.292675 0.005947 -49.214 <2e-16 ***
## pred.3 -0.048077 0.002653 -18.120 <2e-16 ***
## pred.4 -0.002875 0.002650 -1.085 0.278
## pred.5 -0.001503 0.002650 -0.567 0.571
## pred.6 -0.002154 0.002650 -0.813 0.416
## pred.7 0.001840 0.002651 0.694 0.488
## pred.8 0.058182 0.002653 21.927 <2e-16 ***
notes for predictive modeling 231
# Further information
s1$mat # Coefficients and their inferences
## Coef (95% CI) SE p
## (Intercept) 1.017718995 0.996006224 1.039431766 0.010856386 0.000000e+00
## pred.1 -0.986933994 -1.093083429 -0.880784558 0.053074718 3.515228e-77
## pred.2 -0.292674675 -0.304568670 -0.280780679 0.005946998 0.000000e+00
## pred.3 -0.048076842 -0.053383444 -0.042770241 0.002653301 2.230685e-73
## pred.4 -0.002875381 -0.008174848 0.002424085 0.002649733 2.778513e-01
## pred.5 -0.001502598 -0.006803348 0.003798152 0.002650375 5.707564e-01
## pred.6 -0.002154396 -0.007453619 0.003144826 0.002649611 4.161613e-01
## pred.7 0.001839965 -0.003462032 0.007141962 0.002650999 4.876415e-01
## pred.8 0.058182131 0.052875269 0.063488994 0.002653431 1.431803e-106
## pred.9 0.274650557 0.264011111 0.285290003 0.005319723 0.000000e+00
## pred.10 0.992299439 0.939217602 1.045381276 0.026540919 6.230296e-306
s1$rsq # R^2
## [1] 0.2175508
s1$nullrss # Null deviance
## [1] 1132208
# Extract coefficients
coef(bigglmMod)
## (Intercept) pred.1 pred.2 pred.3 pred.4 pred.5 pred.6 pred.7 pred.8
## 1.017718995 -0.986933994 -0.292674675 -0.048076842 -0.002875381 -0.001502598 -0.002154396 0.001839965 0.058182131
## pred.9 pred.10
## 0.274650557 0.992299439
-120000
-120000
subs <- summary(reg) -120000
subs$which -120000
## (Intercept) pred.1 pred.2 pred.3 pred.4 pred.5 pred.6 pred.7 pred.8 pred.9 -120000
pred.10
bic
## 1 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE -120000 TRUE
## 2 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE -120000 TRUE
## 3 TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE -120000 TRUE
## 4 TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE -120000 TRUE
## 5 TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE -79000 TRUE
## 6 TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
(Intercept)
pred.1
pred.2
pred.3
pred.4
pred.5
pred.6
pred.7
pred.8
pred.9
## 7 TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE pred.10
## 8 TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE
## 9 TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
## 10 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
subs$bic
## [1] -79219.27 -118848.59 -121238.63 -121703.97 -122033.75 -122347.61 -122334.97 -122321.82 -122308.48 -122294.99
subs$which[which.min(subs$bic), ]
## (Intercept) pred.1 pred.2 pred.3 pred.4 pred.5 pred.6 pred.7 pred.8 pred.9
## TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
## pred.10
## TRUE
# Let’s compute the true BICs for the p models. This implies fitting p bigglm’s
bestModels <- list()
for (i in 1:nrow(subs$which)) {
f <- formula(paste("resp ~", paste(names(which(subs$which[i, -1])),
collapse = " + ")))
bestModels[[i]] <- bigglm.ffdf(formula = f, data = bigData1ff,
family = binomial(), maxit = 20)
# Did not converge with the default iteration limit, maxit = 8
# The approximate BICs and the true BICs are very similar (in this example)
notes for predictive modeling 233
930000
cor(subs$bic, exactBICs, method = "pearson") # Correlation
920000
## [1] 0.9999708
Approximate
910000
# Both give the same model selection and same order
subs$which[which.min(subs$bic), ] # Approximate
900000
## (Intercept) pred.1 pred.2 pred.3 pred.4 pred.5 pred.6 pred.7 pred.8 pred.9
## TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
890000
## pred.10
## TRUE
-120000 -110000 -100000 -90000 -80000
subs$which[which.min(exactBICs), ] # Exact
Exact
## (Intercept) pred.1 pred.2 pred.3 pred.4 pred.5 pred.6 pred.7 pred.8 pred.9
## TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
## pred.10
## TRUE
cor(subs$bic, exactBICs, method = "spearman") # Order correlation
## [1] 1
6
Nonparametric regression
f ( x0 ) = F 0 ( x0 )
F ( x0 + h ) − F ( x0 )
= lim
h →0+ h
P[ x0 < X < x0 + h ]
= lim .
h →0+ h
More precisely, given an origin t0 and a bandwidth h > 0, the
histogram builds a piecewise constant function in the intervals
{ Bk := [tk , tk+1 ) : tk = t0 + hk, k ∈ Z} by counting the number of 3
Recall that with this standardization
sample points inside each of them. These constant-length intervals we approach to the probability density
concept.
are also denoted bins. The fact that they are of constant length h is
Histogram of faithE
important, since it allows to standardize by h in order to have rela-
tive frequencies per length3 in the bins. The histogram at a point x is
defined as
60
n
1
fˆH ( x; t0 , h) := ∑ 1{Xi ∈Bk :x∈Bk } . (6.1)
Frequency
40
nh i =1
vk
histogram is fˆH ( x; t0 , h) = nh if x ∈ Bk for a k ∈ Z.
The computation of histograms is straightforward in R. As an
0
faithE
duration of the eruption and the waiting time between eruptions Histogram of faithE
# Duration of eruption
faithE <- faithful$eruptions
0.4
## [1] 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 2 3 4 5
t0 <- min(faithE)
h <- 0.25
0.2
# Uniform sample
set.seed(1234567)
u <- runif(n = 100)
# t0 = 0, h = 0.2
Bk1 <- seq(0, 1, by = 0.2)
# t0 = -0.1, h = 0.2
Bk2 <- seq(-0.1, 1.1, by = 0.2)
# Comparison
par(mfrow = 1:2)
hist(u, probability = TRUE, breaks = Bk1, ylim = c(0, 1.5),
main = "t0 = 0, h = 0.2")
rug(u)
abline(h = 1, col = 2)
hist(u, probability = TRUE, breaks = Bk2, ylim = c(0, 1.5),
main = "t0 = -0.1, h = 0.2")
rug(u)
abline(h = 1, col = 2)
1.5
1.0
1.0
Density
Density
0.5
0.5
0.0
0.0
u u
f (x) = F0 (x)
F ( x + h) − F ( x − h)
= lim
h →0+ 2h
P[ x − h < X < x + h ]
= lim .
h →0+ 2h
238 eduardo garcía portugués
where
x−X
P[ x − h < X < x + h ] = P −1 < <1
h
# Quickly compute a kernel density estimator and plot the density object
# Automatically chooses bandwidth and uses normal kernel
plot(density(x = samp))
notes for predictive modeling 241
0.4
# density automatically chooses the interval for plotting the kernel density
# estimator (observe that the black line goes to roughly between -3 and 3)
0.3
# This can be tuned using "from" and "to"
Density
plot(density(x = samp, from = -4, to = 4), xlim = c(-5, 5))
0.2
0.1
# The density object is a list
kde <- density(x = samp, from = -5, to = 5, n = 1024)
0.0
str(kde) -3 -2 -1 0 1 2 3
## $ x : num [1:1024] -5 -4.99 -4.98 -4.97 -4.96 ... density.default(x = samp, from = -4, to = 4)
0.4
## $ bw : num 0.315
## $ n : int 100
## $ call : language density.default(x = samp, n = 1024, from = -5, to = 5)
0.3
## $ data.name: chr "samp"
Density
## $ has.na : logi FALSE
0.2
## - attr(*, "class")= chr "density"
# Note that the evaluation grid "x"" is not directly controlled, only through
# "from, "to", and "n" (better use powers of 2). This is because, internally,
0.1
# kde employs an efficient Fast Fourier Transform on grids of size 2^m
0.0
# Plotting by the returned values of the kde
-4 -2 0 2 4
plot(kde$x, kde$y, type = "l") N = 100 Bandwidth = 0.3145
0.2
-4 -2 0 2 4
ing h = 0.15 and the Epanechnikov kernel. kde$x
Once the MISE is set as the error criterion to be minimized, our aim
is to find
or as
# Rule-of-thumb
bw.nrd(x = x)
## [1] 0.4040319
# bwd.nrd employs 1.34 as an approximation for diff(qnorm(c(0.25, 0.75)))
# Same as
iqr <- diff(quantile(x, c(0.25, 0.75))) / diff(qnorm(c(0.25, 0.75)))
1.06 * n^(-1/5) * min(sd(x), iqr)
## [1] 0.4040319
# Data
set.seed(672641)
x <- rnorm(100)
# DPI selector
bw.SJ(x = x, method = "dpi")
## [1] 0.5006905
# Similar to
ks::hpi(x) # Default is two-stages
## [1] 0.4999456
244 eduardo garcía portugués
Cross-validation
We turn now our attention to a different philosophy of band-
width estimation. Instead of trying to minimize the AMISE by
plugging-in estimates for the unknown curvature term, we directly
attempt to minimize the MISE by using the sample twice: one for
computing the kde and other for evaluating its performance on esti-
mating f . To avoid the clear dependence on the sample, we do this
evaluation in a cross-validatory way: the data used for computing the
kde is not used for its evaluation.
We begin by expanding the square in the MISE expression:
Z
MISE[ fˆ(·; h)] = E ( fˆ( x; h) − f ( x ))2 dx
Z Z
=E fˆ( x; h)2 dx − 2E fˆ( x; h) f ( x ) dx
Z
+ f ( x )2 dx.
Since the last term does not depend on h, minimizing MISE[ fˆ(·; h)]
is equivalent to minimizing
Z Z
E ˆ 2
f ( x; h) dx − 2E ˆ
f ( x; h) f ( x ) dx .
where fˆ−i (·; h) is the leave-one-out kde and is based on the sample
with the Xi removed:
n
1
fˆ−i ( x; h) =
n−1 ∑ K h ( x − X j ).
j =1
j 6 =i
-0.20
# Data
set.seed(123456)
x <- rnorm(100)
r
f ( x; µ, σ, w) : = ∑ w j φ(x; µ j , σj2 ),
j =1
√
MISEr [ fˆ(·; h)] = (2 πnh)−1 + w0 {(1 − n−1 )Ω2 − 2Ω1 + Ω0 }w,
(Ω a )ij := φ(µi − µ j ; ah2 + σi2 + σj2 ), i, j = 1, . . . , r.
# Available models
?nor1mix::MarronWand
# Density evaluation
x <- seq(-4, 4, length.out = 400)
0.20
0.15
0.10
0.05
0.00
plot(nor1mix::MW.nm7)
3
0.3
plot(nor1mix::MW.nm10)
2
f(x)
f(x)
0.2
plot(nor1mix::MW.nm12)
0.1
1
x x
0.4
f(x)
0.2
0.1
0.0
0.0
-2 -1 0 1 2 -3 -2 -1 0 1 2 3
x x
248 eduardo garcía portugués
and then assuming that f is the pdf of a N p (µ, Σ). With the normal
kernel, this results in
p ( p +1)
parameters that need to be chosen – precisely 2 – which no-
tably complicates bandwidth selection as the dimension p grows.
A common simplification is to consider a diagonal bandwidth ma-
trix H = diag(h21 , . . . , h2p ), which yields the kde employing product
kernels:
n
1 p
fˆ(x; h) =
n ∑ Kh1 (x1 − Xi,1 )× · · · ×Kh p (x p − Xi,p ), (6.12)
i =1
100
that they have the same scale), then a simple choice is to consider
h = h1 = . . . = h p .
80
Multivariate kernel density estimation and bandwidth selection
waiting
is not supported in base R, but the ks package implements the kde
60
by ks::kde for p ≤ 6. Bandwidth selectors, allowing for full or
diagonal bandwidth matrices are implemented by: ks::Hns (NS),
40
ks::Hpi and ks::Hpi.diag (DPI), ks::Hlscv and ks::Hlscv.diag
1 2 3 4 5
(LSCV), and ks::Hbcv and ks::Hbcv.diag (BCV). The next chunk of eruptions
100
# DPI selectors
Hpi1 <- ks::Hpi(x = faithful)
80
Hpi1
## [,1] [,2]
waiting
## [1,] 0.06326802 0.6041862
60
## [2,] 0.60418624 11.1917775
1 2 3 4 5
# Different representations eruptions
plot(kdeHpi1, display = "slice", cont = c(25, 50, 75))
# "cont" specifies the density contours, which are upper percentages of highest
# density regions. The default contours are at 25%, 50%, and 75%
plot(kdeHpi1, display = "filled.contour2", cont = c(25, 50, 75))
Density fu
nction
image(kdeHpi1$eval.points[[1]], kdeHpi1$eval.points[[2]],
g
80
main = "full")
40
kdeHpi1$eval.points[[1]]
notes for predictive modeling 251
full
100
Hlscv0 <- ks::Hlscv(x = x)
Hbcv0 <- ks::Hbcv(x = x)
Hpi0 <- ks::Hpi(x = x)
80
Hns0 <- ks::Hns(x = x)
waiting
par(mfrow = c(2, 2))
60
p <- lapply(list(Hlscv0, Hbcv0, Hpi0, Hns0), function(H) {
# col.fun for custom colours
plot(ks::kde(x = x, H = H), display = "filled.contour2",
40
cont = seq(10, 90, by = 10), col.fun = viridis::viridis)
points(x, cex = 0.5, pch = 16)
1 2 3 4 5
})
eruptions
diagonal
100
90
80
100
100
waiting
70
80
80
60
waiting
waiting
50
60
60
40
40
40
2 3 4 5
eruptions
2 3 4 5 1 2 3 4 5 6
eruptions eruptions
100
100
80
80
waiting
waiting
60
60
40
40
1 2 3 4 5 1 2 3 4 5
eruptions eruptions
m ( x ) = E [Y | X = x ]
Z
= y f Y | X = x (y) dy
R
y f ( x, y) dy
= . (6.13)
f X (x)
n
1
fˆ( x, y; h) =
n ∑ Kh1 (x − Xi )Kh2 (y − Yi ) (6.14)
i =1
notes for predictive modeling 253
where
K h ( x − Xi )
Wi0 ( x ) := n .
∑ i = 1 K h ( x − Xi )
# Arguments
# x: evaluation points
# X: vector (size n) with the predictors
# Y: vector (size n) with the response variable
# h: bandwidth
# K: kernel
# Weights
W <- Kx / rowSums(Kx) # Column recycling!
drop(W %*% Y)
# Bandwidth
h <- 0.5
# Plot data
plot(X, Y)
rug(X, side = 1); rug(Y, side = 2)
lines(xGrid, m(xGrid), col = 1)
lines(xGrid, mNW(x = xGrid, X = X, Y = Y, h = h), col = 2)
legend("top", legend = c("True regression", "Nadaraya-Watson"),
lwd = 2, col = 1:2)
function m.
10
5
Y
0
-5
-10
-4 -2 0 2 4
X
notes for predictive modeling 255
# Plot data
plot(X, Y)
rug(X, side = 1); rug(Y, side = 2)
lines(xGrid, m(xGrid), col = 1)
lines(xGrid, mNW(x = xGrid, X = X, Y = Y, h = h), col = 2)
legend("topright", legend = c("True regression", "Nadaraya-Watson"),
lwd = 2, col = 1:2)
without assuming any particular form for the true m. This is not
achievable directly, since no knowledge on m is available. Recall
that what we did in parametric models, such as linear regression,
was to assume a parametrization for m, m β (x) = β 0 + β 1 x for the
simple linear model, which allowed to tackle the minimization of
(6.17) by means of solving
n
m β̂ (x) := arg min ∑ (Yi − m β ( Xi ))2 .
β i =1
and
Y1
.
W := diag(Kh ( X1 − x ), . . . , Kh ( Xn − x )), ..
Y := .
Yn n×1
m̂( x; p, h) := β̂ h,0
= e10 (X0 WX)−1 X0 WY
n
∑ Wi (x)Yi
p
= (6.23)
i =1
where
p
Wi ( x ) := e10 (X0 WX)−1 X0 Wei
where ŝr ( x; h) := 1
n ∑in=1 ( Xi − x )r Kh ( x − Xi ).
# KernSmooth::locpoly fits
h <- 0.25
lp0 <- KernSmooth::locpoly(x = X, y = Y, bandwidth = h, degree = 0,
range.x = c(-10, 10), gridsize = 500)
lp1 <- KernSmooth::locpoly(x = X, y = Y, bandwidth = h, degree = 1,
range.x = c(-10, 10), gridsize = 500)
# Provide the evaluation points by range.x and gridsize
# loess fits
span <- 0.25 # The default span is 0.75, which works very bad in this scenario
lo0 <- loess(Y ~ X, degree = 0, span = span)
lo1 <- loess(Y ~ X, degree = 1, span = span)
# loess employs an "span" argument that plays the role of an variable bandwidth
# "span" gives the proportion of points of the sample that are taken into
# account for performing the local fit around x and then uses a triweight kernel
# (not a normal kernel) for weighting the contributions. Therefore, the final
# estimate differs from the definition of local polynomial estimator, although
# the principles in which are based are the same
# Prediction at x = 2
x <- 2
lp1$y[which.min(abs(lp1$x - x))] # Prediction by KernSmooth::locpoly
## [1] 5.445975
predict(lo1, newdata = data.frame(X = x)) # Prediction by loess
## 1
## 5.379652
m(x) # Reality
## [1] 7.274379
notes for predictive modeling 259
# Plot data
plot(X, Y)
rug(X, side = 1); rug(Y, side = 2)
lines(xGrid, m(xGrid), col = 1)
lines(lp0$x, lp0$y, col = 2)
lines(lp1$x, lp1$y, col = 3)
lines(xGrid, predict(lo0, newdata = data.frame(X = xGrid)), col = 2, lty = 2)
lines(xGrid, predict(lo1, newdata = data.frame(X = xGrid)), col = 3, lty = 2)
legend("bottom", legend = c("True regression", "Local constant (locpoly)",
"Local linear (locpoly)", "Local constant (loess)",
"Local linear (loess)"),
lwd = 2, col = c(1:3, 2:3), lty = c(rep(1, 3), rep(2, 2)))
10
heavily depends on h.
0
# Simple plot of local polynomials for varying h’s
manipulate::manipulate({
-10
# Plot data
lpp <- KernSmooth::locpoly(x = X, y = Y, bandwidth = h, degree = p,
-20
range.x = c(-10, 10), gridsize = 500) True regression
Local constant (locpoly)
Local linear (locpoly)
plot(X, Y) Local constant (loess)
Local linear (loess)
rug(X, side = 1); rug(Y, side = 2)
-3 -2 -1 0 1 2 3
lines(xGrid, m(xGrid), col = 1)
X
lines(lpp$x, lpp$y, col = p + 2)
legend("bottom", legend = c("True regression", "Local polynomial fit"),
lwd = 2, col = c(1, p + 2))
25
This assumption requires certain
• A1.25 m is twice continuously differentiable. smoothness of the regression function,
• A2.26 σ2 is continuous and positive. allowing thus for Taylor expansions
to be performed. This assumption is
• A3.27 f , the marginal pdf of X, is continuously differentiable and
important in practice: m̂(·; p, h) is in-
bounded away from zero28 . finitely differentiable if the considered
• A4.29 The kernel K is a symmetric and bounded pdf with finite kernels K are.
26
Avoids the situation in which Y is a
second moment and is square integrable. degenerated random variable.
• A5.30 h = hn is a deterministic sequence of bandwidths such 27
Avoids the degenerate situation
that, when n → ∞, h → 0 and nh → ∞. in which m is estimated at regions
without observations of the predictors
(such as holes in the support of X).
The bias and variance are studied in their conditional versions 28
Meaning that there exist a positive
on the predictor’s sample X1 , . . . , Xn . The reason for analyzing lower bound for f .
the conditional instead of the unconditional versions is avoiding 29
Mild assumption inherited from the
kde.
technical difficulties that integration with respect to the predictor’s 30
Key assumption for reducing the
density may pose. This is in the spirit of what it was done in the bias and variance of m̂(·; p, h) simulta-
parametric inference of Sections 2.4 and 5.3. The main result is the neously.
Theorem 6.1. Under A1–A5, the conditional bias and variance of the 31
The notation oP ( an ) stands for a
local constant (p = 0) and local linear (p = 1) estimators are31 random variable that converges in
probability to zero at a rate faster
than an → 0. It is mostly employed
Bias[m̂( x; p, h)| X1 , . . . , Xn ] = B p ( x )h2 + oP (h2 ), (6.24)
for denoting non-important terms in
R(K ) 2 asymptotic expansions, like the ones in
Var[m̂( x; p, h)| X1 , . . . , Xn ] = σ ( x ) + oP ((nh)−1 ), (6.25) (6.24)–(6.25).
nh f ( x )
where
µ2 (K ) m00 ( x ) + 2 m0 ( x) f 0 ( x) ,
n o
2 f (x)
if p = 0,
B p ( x ) : = µ (K )
2 m00 ( x ), if p = 1.
2
The bias and variance expressions (6.24) and (6.25) yield very
interesting insights:
1. Bias.
2. Variance.
Observe that this definition is very similar to the kde’s MISE, except
for the fact that f appears weighting the quadratic difference: what
matters is to minimize the estimation error on the regions were
the density of X is higher. Recall also that the MISE follows by
integrating the conditional MSE, which amounts to the squared
bias (6.24) plus the variance (6.25) given in Theorem 6.1. These
operations produce the conditional AMISE:
R(K )
Z Z
2 2
AMISE[m̂(·; p, h)| X1 , . . . , Xn ] = h B p ( x ) f ( x ) dx + σ2 ( x ) dx
nh
and, if p = 1, the resulting optimal AMISE bandwidth is
" #1/5
R(K ) σ2 ( x ) dx
R
hAMISE = ,
2µ22 (K )θ22 n
# DPI selector
hDPI <- KernSmooth::dpill(x = X, y = Y)
# Fits
lp1 <- KernSmooth::locpoly(x = X, y = Y, bandwidth = 0.25, degree = 0,
range.x = c(-10, 10), gridsize = 500)
lp1DPI <- KernSmooth::locpoly(x = X, y = Y, bandwidth = hDPI, degree = 1,
range.x = c(-10, 10), gridsize = 500)
notes for predictive modeling 263
# Compare fits
plot(X, Y)
rug(X, side = 1); rug(Y, side = 2)
lines(xGrid, m(xGrid), col = 1)
lines(lp1$x, lp1$y, col = 2)
lines(lp1DPI$x, lp1DPI$y, col = 3)
legend("bottom", legend = c("True regression", "Local linear",
"Local linear (DPI)"),
lwd = 2, col = 1:3)
10
We turn now our attention to cross validation. Following an
analogy with the fit of the linear model, we could look for the
0
bandwidth h such that it minimizes an RSS of the form
Y
n
-10
1
n ∑ (Yi − m̂(Xi ; p, h))2 . (6.26)
i =1
-20
True regression
As it looks, this is a bad idea. Attempting to minimize (6.26) always Local linear
Local linear (DPI)
X
illustrated below.
# Error curve
plot(hGrid, error, type = "l")
rug(hGrid)
abline(v = hGrid[which.min(error)], col = 2)
p
∑in=1 Wi ( x )Yi :
p p
p
Wj ( x ) Wj ( x )
W−i,j ( x ) = p = p .
∑nk=1 Wk ( x ) 1 − Wi ( x )
k 6 =i
# Objective function
cvNW <- function(X, Y, h, K = dnorm) {
# Bandwidth
hCV <- bw.cv.grid(X = X, Y = Y, plot.cv = TRUE)
notes for predictive modeling 265
hCV
## [1] 0.3117806
# Plot result
plot(X, Y)
rug(X, side = 1); rug(Y, side = 2)
lines(xGrid, m(xGrid), col = 1)
lines(xGrid, mNW(x = xGrid, X = X, Y = Y, h = hCV), col = 2)
legend("top", legend = c("True regression", "Nadaraya-Watson"),
lwd = 2, col = 1:2)
True regression
A more sophisticated cross-validation bandwidth selection can
15
Nadaraya-Watson
10
code below.
# np::npregbw computes by default the least squares CV bandwidth associated to
5
# a local constant fit
bw0 <- np::npregbw(formula = Y ~ X)
## Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |
0
# Multiple initial points can be employed for minimizing the CV function (for
-5
# one predictor, defaults to 1) -3 -2 -1 0 1 2 3 4
# The "rbandwidth" object contains many useful information, see ?np::npregbw for
# all the returned objects
bw0
##
## Regression Data (100 observations, 1 variable(s)):
##
## X
## Bandwidth(s): 0.3112962
##
## Regression Type: Local-Constant
## Bandwidth Selection Method: Least Squares Cross-Validation
## Formula: Y ~ X
## Bandwidth Type: Fixed
## Objective Function Value: 5.368999 (achieved on multistart 1)
##
## Continuous Kernel Type: Second-Order Gaussian
## No. Continuous Explanatory Vars.: 1
# Recall that the fit is very similar to hCV
# Once the bandwith is estimated, np::npreg can be directly called with the
# "rbandwidth" object (it encodes the regression to be made, the data, the kind
# of estimator considered, etc). The hard work goes on np::npregbw, not on
# np::npreg
kre0 <- np::npreg(bw0)
kre0
##
## Regression Data: 100 training points, in 1 variable(s)
## X
## Bandwidth(s): 0.3112962
##
## Kernel Regression Estimator: Local-Constant
15
# The evaluation points of the estimator are by default the predictor’s sample
# (which is not sorted!)
5
-3 -2 -1 0 1 2 3 4
# Plot directly the fit via plot() -- it employs different evaluation points
# than exdat
plot(kre0, col = 2, type = "o")
points(X, Y)
rug(X, side = 1); rug(Y, side = 2)
lines(xGrid, m(xGrid), col = 1)
lines(kre0$eval$xGrid, kre0$mean, col = 3, type = "o", pch = 16, cex = 0.5)
15
# Using the evaluation points
10
bw1 <- np::npregbw(formula = Y ~ X, regtype = "ll")
Y
## Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |
# regtype = "ll" stands for "local linear", "lc" for "local constant"
5
# Local linear fit
kre1 <- np::npreg(bw1, exdat = xGrid)
0
# Comparison -3 -2 -1 0 1 2 3 4
X
plot(X, Y)
rug(X, side = 1); rug(Y, side = 2)
lines(xGrid, m(xGrid), col = 1)
lines(kre0$eval$xGrid, kre0$mean, col = 2)
lines(kre1$eval$xGrid, kre1$mean, col = 3)
legend("top", legend = c("True regression", "Nadaraya-Watson", "Local linear"),
lwd = 2, col = 1:3)
True regression
15
Nadaraya-Watson
Local linear
x by means of a small h. -3 -2 -1 0 1 2 3 4
# Constant bandwidth
bwc <- np::npregbw(formula = Y ~ X, bwtype = "fixed", regtype = "ll")
## Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |
krec <- np::npreg(bwc, exdat = xGrid)
# Variable bandwidths
bwg <- np::npregbw(formula = Y ~ X, bwtype = "generalized_nn", regtype = "ll")
## Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |
kreg <- np::npreg(bwg, exdat = xGrid)
bwa <- np::npregbw(formula = Y ~ X, bwtype = "adaptive_nn", regtype = "ll")
## Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |
krea <- np::npreg(bwa, exdat = xGrid)
# Comparison
plot(X, Y)
rug(X, side = 1); rug(Y, side = 2)
lines(xGrid, m(xGrid), col = 1)
lines(krec$eval$xGrid, krec$mean, col = 2)
lines(kreg$eval$xGrid, kreg$mean, col = 3)
lines(krea$eval$xGrid, krea$mean, col = 4)
legend("top", legend = c("True regression", "Fixed", "Generalized NN",
"Adaptive NN"),
lwd = 2, col = 1:4)
True regression
Fixed
Generalized NN
Adaptive NN
5
Y
0
-5
-3 -2 -1 0 1 2 3
# Observe how the fixed bandwidth may yield a fit that produces serious
# artifacts in the low density region. At that region the NN-based bandwidths
# expand to borrow strenght from the points in the high density regions,
# whereas in the high density regions they shrink to adapt faster to the
# changes of the regression function
and then solve the problem in the exact same way but now con-
sidering
1 ( X1 − x ) 0
. ..
..
X := .
1 (Xn − x)0 n×( p+1)
and
where
n
1
CV(h) :=
n ∑ (Yi − m̂−i (Xi ; p, h))2 ,
i =1
ĥCV := arg min CV(h).
h1 ,...,h p >0
# Regression
fitWine <- np::npreg(bwWine)
summary(fitWine)
##
## Regression Data: 27 training points, in 4 variable(s)
## Age WinterRain AGST HarvestRain
## Bandwidth(s): 15217744 164.468 0.8698265 190414416
##
## Kernel Regression Estimator: Local-Linear
## Bandwidth Type: Fixed
## Residual standard error: 0.1947015
## R-squared: 0.9038531
##
## Continuous Kernel Type: Second-Order Gaussian
## No. Continuous Explanatory Vars.: 4
8.0
8.0
7.5
7.5
Price
Price
7.0
7.0
6.5
6.5
5 10 15 20 25 30 400 500 600 700 800
Age WinterRain
8.0
8.0
7.5
7.5
Price
Price
7.0
7.0
6.5
6.5
AGST HarvestRain
# Therefore:
# - Age is positively related with Price (almost linearly)
# - WinterRain is positively related with Price (with a subtle nonlinearity)
# - AGST is positively related with Price, but now we see what it looks like a
# quadratic pattern
# - HarvestRain is negatively related with Price (almost linearly)
l ( x d , Xd ; λ ) = λ | x d − Xd | ,
# Regression
fitIris <- np::npreg(bwIris)
summary(fitIris)
##
## Regression Data: 150 training points, in 2 variable(s)
## Sepal.Width Species
## Bandwidth(s): 898696.6 2.357536e-07
##
## Kernel Regression Estimator: Local-Linear
## Bandwidth Type: Fixed
## Residual standard error: 0.3775005
## R-squared: 0.9539633
272 eduardo garcía portugués
##
## Continuous Kernel Type: Second-Order Gaussian
## No. Continuous Explanatory Vars.: 1
##
## Unordered Categorical Kernel Type: Aitchison and Aitken
## No. Unordered Categorical Explanatory Vars.: 1
5
Petal.Length
Petal.Length
4
4
3
3
2
2
2.0 2.5 3.0 3.5 4.0 setosa virginica
Sepal.Width Species
# Load data
data(oecdpanel, package = "np")
# Bandwidth by CV for local constant -- use only two starts to reduce the
# computation time
out <- capture.output(
bwOECD <- np::npregbw(formula = growth ~ factor(oecd) + ordered(year) +
initgdp + popgro + inv + humancap, data = oecdpanel,
regtype = "lc", nmulti = 2)
)
bwOECD
##
## Regression Data (616 observations, 6 variable(s)):
##
## factor(oecd) ordered(year) initgdp popgro inv humancap
## Bandwidth(s): 0.02414207 0.8944302 0.1940907 698479 0.09140916 1.223215
##
## Regression Type: Local-Constant
## Bandwidth Selection Method: Least Squares Cross-Validation
## Formula: growth ~ factor(oecd) + ordered(year) + initgdp + popgro + inv +
## humancap
## Bandwidth Type: Fixed
## Objective Function Value: 0.0006545946 (achieved on multistart 2)
##
## Continuous Kernel Type: Second-Order Gaussian
## No. Continuous Explanatory Vars.: 4
##
## Unordered Categorical Kernel Type: Aitchison and Aitken
## No. Unordered Categorical Explanatory Vars.: 1
##
## Ordered Categorical Kernel Type: Li and Racine
## No. Ordered Categorical Explanatory Vars.: 1
# Regression
fitOECD <- np::npreg(bwOECD)
summary(fitOECD)
##
## Regression Data: 616 training points, in 6 variable(s)
## factor(oecd) ordered(year) initgdp popgro inv humancap
notes for predictive modeling 273
0.06
0.06
growth
growth
growth
-0.02 0.02
-0.02 0.02
-0.02 0.02
0 1 1965 1975 1985 1995 6 7 8 9
0.06
0.06
growth
growth
growth
-0.02 0.02
-0.02 0.02
-0.02 0.02
0.030
0.06
growth
growth
growth
0.020
0.02
-0.02 0.02
0.010
-0.02
0.06
0.025
0.1
growth
growth
growth
0.02
-0.3 -0.1
0.015
-0.02
# Recall that in $mean we had the regression evaluated at the evaluation points,
# by default the sample of the predictors, so in this case the same as the
# fitted values
head(fitOECD$mean)
## [1] 0.02877775 0.02113485 0.03592716 0.04027500 0.01099441 0.03888302
0.15
0.15
0.15
growth
growth
growth
0.05
0.05
0.05
-0.05
-0.05
-0.05
0 1 1965 1975 1985 1995 6 7 8 9
0.15
0.15
growth
growth
growth
0.05
0.05
0.05
-0.05
-0.05
-0.05
-3.4 -3.0 -2.6 -4.5 -3.5 -2.5 -1.5 -2 -1 0 1 2
showed that, under the assumptions given in Section 2.3, the maxi-
mum likelihood estimate of β in the linear model
Y |( X1 , . . . , X p ) ∼ N ( β 0 + β 1 X1 + . . . + β p X p , σ2 ) (6.29)
Y | X ∼ N ( β 0 + β 1 X + . . . + β p X p , σ 2 ). (6.30)
Yi | Xi ∼ Ber(logistic(η ( Xi ))), i = 1, . . . , n,
The log-likelihood of β is
n
`( β) = ∑ {Yi log(logistic(η (Xi ))) + (1 − Yi ) log(1 − logistic(η (Xi )))}
i =1
n
= ∑ `(Yi , η (Xi )),
i =1
η̂ ( x ) := β̂ h,0 .
suppressWarnings(
fitGlm <- sapply(x, function(x) {
K <- dnorm(x = x, mean = X, sd = h)
glm.fit(x = cbind(1, X - x), y = Y, weights = K,
family = binomial())$coefficients[1]
})
)
# Compare fits
plot(x, p(x), ylim = c(0, 1.5), type = "l", lwd = 2)
lines(x, logistic(fitGlm), col = 2)
lines(x, logistic(fitNlm), col = 3, lty = 2)
plot(fitLocfit, add = TRUE, col = 4)
legend("topright", legend = c("p(x)", "glm", "nlm", "locfit"), lwd = 2,
col = c(1, 2, 3, 4), lty = c(1, 1, 2, 1))
1.5
p(x)
Bandwidth selection can be done by means of likelihood cross- glm
nlm
locfit
maximizing
p(x)
n
∑ `(Yi , η̂−i (Xi )),
0.5
LCV(h) = (6.34)
i =1
where η̂−i ( Xi ) represents the local fit at Xi without the i-th datum
0.0
Luckily, the test statistic Dn , the asymptotic cdf K, and the associ- 7
Which is limn→∞ P[dn > Dn ] =
ated asymptotic p-value7 are readily implemented in R through the 1 − K (dn ), where dn is the observed
ks.test function. statistic and Dn is the random variable
(A.1).
# Comparison of p-values
par(mfrow = 1:2)
hist(pValues_H0, breaks = seq(0, 1, l = 20), probability = TRUE,
main = expression(H[0]), ylim = c(0, 2.5))
abline(h = 1, col = 2)
hist(pValues_H1, breaks = seq(0, 1, l = 20), probability = TRUE,
main = expression(H[1]), ylim = c(0, 2.5))
abline(h = 1, col = 2)
H0 H1
2.0
2.0
Density
Density
1.0
1.0
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
pValues_H0 pValues_H1
n
`( β) = log φ(Y; Xβ, σ2 I) = ∑ log φ(Yi ; (Xβ)i , σ). (A.3)
i =1
Theorem A.1. Under the assumptions i–iv in Section 2.3, the maximum
likelihood estimate of β is the least squares estimate (2.7):
1 1
(Y − Xβ)0 X = 2 (Y0 X − β0 X0 X) = 0.
σ2 σ
p j ( x ) : = P [ Y = j | X1 = x 1 , . . . , X p = x p ]
e β0j + β1j X1 +...+ β pj X p
= J −1
(A.4)
1 + ∑l =1 e β0l + β1l X1 +...+ β pl X p
p J ( x ) : = P [ Y = J | X1 = x 1 , . . . , X p = x p ]
1
= J −1
. (A.5)
1 + ∑l =1 e β0l + β1l X1 +...+ β pl X p
J
Note that (A.4) and (A.5) imply that ∑ j=1 p j (x) = 1 and that there
are ( J − 1) × ( p + 1) coefficients11 . Also, (A.5) reveals that the
11
( J − 1) intercepts and ( J − 1) × p
slopes.
last level, J, is given a different treatment. This is because it is the
reference level (it could be a different one, but it is the tradition to
choose the last one).
The multinomial logistic model has an interesting interpretation
in terms of logistic regressions. Taking the quotient between (A.4)
and (A.5) gives
p j (x)
= e β0j + β1j X1 +...+ β pj X p (A.6)
p J (x)
Therefore:
notes for predictive modeling 285
# Data from the voting intentions in the 1988 Chilean national plebiscite
data(Chile, package = "carData")
summary(Chile)
## region population sex age education income statusquo vote
## C :600 Min. : 3750 F:1379 Min. :18.00 P :1107 Min. : 2500 Min. :-1.80301 A :187
## M :100 1st Qu.: 25000 M:1321 1st Qu.:26.00 PS : 462 1st Qu.: 7500 1st Qu.:-1.00223 N :889
## N :322 Median :175000 Median :36.00 S :1120 Median : 15000 Median :-0.04558 U :588
## S :718 Mean :152222 Mean :38.55 NA’s: 11 Mean : 33876 Mean : 0.00000 Y :868
## SA:960 3rd Qu.:250000 3rd Qu.:49.00 3rd Qu.: 35000 3rd Qu.: 0.96857 NA’s:168
## Max. :250000 Max. :70.00 Max. :200000 Max. : 2.04859
## NA’s :1 NA’s :98 NA’s :17
# vote is a factor with levels A (abstention), N (against Pinochet),
# U (undecided), Y (for Pinochet)
# Predicted class
predict(mod2, newdata = newdata, type = "class")
## [1] N U Y
## Levels: N A U Y
notes for predictive modeling 287
# Complete cases
head(airquality[comp, ])
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 9 8 19 20.1 61 5 9
## 13 11 290 9.2 66 5 13
## 14 14 274 10.9 68 5 14
## 15 18 65 13.2 58 5 15
## Residuals:
## Min 1Q Median 3Q Max
## -23.790 -10.910 -2.249 10.960 33.246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -71.50880 31.49703 -2.270 0.035704 *
## Solar.R -0.01112 0.03376 -0.329 0.745596
## Wind -0.61129 0.96113 -0.636 0.532769
## Temp 1.82870 0.42224 4.331 0.000403 ***
## Month -2.86513 2.22222 -1.289 0.213614
## Day -0.28710 0.41700 -0.688 0.499926
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 16.12 on 18 degrees of freedom
## (129 observations deleted due to missingness)
## Multiple R-squared: 0.5957, Adjusted R-squared: 0.4833
## F-statistic: 5.303 on 5 and 18 DF, p-value: 0.00362
We have seen the problems that missing data may cause in re-
gression models. There are many techniques designed to handle
missing data, depending on the missing data mechanism (whether
is it completely at random or whether there is some pattern in the
missing process) and the approach to impute the data (parametric,
nonparametric, Bayesian, etc). We do not give an exhaustive view of
the topic here, but we outline three concrete approaches to handle
missing data in practice:
Proportion of missings
0.6
Combinations
VIM::aggr(airqualityNoNA)
0.4
0.2
0.0
Ozone
Solar.R
Wind
Temp
Month
Day
Ozone
Solar.R
Wind
Temp
Month
Day
# Stepwise regression without NA’s -- no problem
modBIC1 <- MASS::stepAIC(lm(Ozone ~ ., data = airqualityNoNA),
1.0
Proportion of missings
k = log(nrow(airqualityNoNA)), trace = 0)
0.8
Combinations
summary(modBIC1)
0.6
##
0.4
## Call:
0.2
## lm(formula = Ozone ~ Temp, data = airqualityNoNA)
0.0
Ozone
Solar.R
Wind
Temp
Month
Day
Ozone
Solar.R
Wind
Temp
Month
Day
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.645 -13.180 -0.136 8.926 37.666
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -90.1846 24.6414 -3.660 0.00138 **
## Temp 1.6321 0.3367 4.847 7.63e-05 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 15.95 on 22 degrees of freedom
## Multiple R-squared: 0.5164, Adjusted R-squared: 0.4944
## F-statistic: 23.49 on 1 and 22 DF, p-value: 7.634e-05
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 22.17 on 101 degrees of freedom
## Multiple R-squared: 0.5901, Adjusted R-squared: 0.5779
## F-statistic: 48.46 on 3 and 101 DF, p-value: < 2.2e-16
# In this example the approach works well because most of
# the NA’s are associated to the variable Solar.R
# Notice that the imputed data is the same (except for a small
# truncation that is introduced by mice) as
predict(lm(airquality$Ozone ~ ., data = airqualityMean),
notes for predictive modeling 293
Y = β 1 X1 + β 2 X2 + β 3 X3 + β 4 X4 + ε, (A.8)
1. The model with all the predictors. The inferences for the coef-
ficients are correct: the distribution of the p-values (pvalues1) is
uniform whenever H0 : β j = 0 holds (for j = 3, 4) and concen-
trated around 0 when H0 does not hold (for j = 1, 2).
2. The model with predictors selected by stepwise regression. The
inferences for the coefficients are biased: when X3 and X4 are
included in the model is because they are highly significant for
the given sample by mere chance. Therefore, the distribution of
the p-values (pvalues2) is not uniform but concentrated at 0.
3. The model with selected predictors by stepwise regression, but
fitted in a separate dataset. In this case, the p-values (pvalues3)
are not unrealistically small if the non-significant predictors are
included in the model and the inference is correct.
# Simulation setting
n <- 2e2
p <- 4
notes for predictive modeling 295
p0 <- p %/% 2
beta <- c(rep(1, p0), rep(0, p - p0))
# Generate two sets of independent data following the same linear model
# with coefficients beta and null intercept
x1 <- matrix(rnorm(n * p), nrow = n, ncol = p)
data1 <- data.frame("x" = x1)
xbeta1 <- x1 %*% beta
x2 <- matrix(rnorm(n * p), nrow = n, ncol = p)
data2 <- data.frame("x" = x2)
xbeta2 <- x2 %*% beta
# Simulation
# pb <- txtProgressBar(style = 3)
for (i in 1:M) {
1.0
mod1 <- lm(y ~ 0 + ., data = data1)
s1 <- summary(mod1)
0.8
pvalues1[i, ] <- s1$coefficients[, 4]
0.6
# Obtain the significances of the coefficients for a data-driven selected
# linear model (in this case, by stepwise regression using BIC)
mod2 <- MASS::stepAIC(mod1, k = log(n), trace = 0)
0.4
s2 <- summary(mod2)
ind <- match(x = names(s2$coefficients[, 4]), table = nam)
0.2
# Progress
0.6
}
0.2
β1 β3 β3 β4
296 eduardo garcía portugués
B.3 Introduction to R
Simple computations
# The console can act as a simple calculator
1.0 + 1.1
2 * 2
3/2
2^3
1/0
0/0
Compute:
e2 +sin(2)
• 1
. Answer: 2.723274.
cos−1 2 +2
p
• 32.5 + log(10). Answer: 4.22978.
• (20.93 − log2 (3 + 2 + sin(1)))10tan(1/3)) 32.5 + log(10).
p p
Answer: -3.032108.
# Capitalization matters
X <- 3
x; X
## [1] 2
## [1] 3
# Remove variables
rm(X)
X
## Error in eval(expr, envir, enclos): object ’X’ not found
Do the following:
Vectors
# We combine numbers with the function "c"
c(1, 3)
300 eduardo garcía portugués
## [1] 1 3
c(1.5, 0, 5, -3.4)
## [1] 1.5 0.0 5.0 -3.4
# Entrywise operations
myData + 1
## [1] 2 3
myData^2
## [1] 1 4
# And also
myData2[-c(1, 2)]
## [1] 1.1 1.0 3.0 4.0