[go: up one dir, main page]

0% found this document useful (0 votes)
9 views300 pages

Predictive Modeling

The document contains notes for a predictive modeling course as part of the MSc in Big Data Analytics at Carlos III University of Madrid. It covers various statistical methods, including linear models, generalized linear models, and nonparametric regression, with a focus on practical applications using R and RStudio. The course aims to provide a comprehensive overview of predictive modeling techniques, supported by case studies and software tools.

Uploaded by

Anderson Yao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views300 pages

Predictive Modeling

The document contains notes for a predictive modeling course as part of the MSc in Big Data Analytics at Carlos III University of Madrid. It covers various statistical methods, including linear models, generalized linear models, and nonparametric regression, with a focus on practical applications using R and RStudio. The course aims to provide a comprehensive overview of predictive modeling techniques, supported by case studies and software tools.

Uploaded by

Anderson Yao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 300

Notes for Predictive

Modeling

Eduardo García Portugués


2020-01-27, v5.5.9
Copyright © 2020 Eduardo García Portugués

published by

bookdown.org/egarpor/pm-uc3m

Licensed under the Creative Commons License version 4.0 under the terms of Attribution, Non-Commercial
and No-Derivatives (the “License”); you may not use this file except in compliance with the License. You
may obtain a copy of the License at http://creativecommons.org/licenses/by-nc-nd/4.0. Unless re-
quired by applicable law or agreed to in writing, software distributed under the License is distributed
on an “as is” basis, without warranties or conditions of any kind, either express or implied.
See the License for the specific language governing permissions and limitations under the License.

First printing, January 2020


Contents

Preface 7

1 Introduction 11
1.1 Course overview 11
1.2 What is predictive modelling? 12
1.3 General notation and background 14
1.4 Scripts and datasets 18

2 Linear models I: multiple linear model 21


2.1 Case study: The Bordeaux equation 21
2.2 Model formulation and least squares 23
2.3 Assumptions of the model 34
2.4 Inference for model parameters 38
2.5 Prediction 47
2.6 ANOVA 50
2.7 Model fit 55

3 Linear models II: model selection, extensions, and diagnostics 67


3.1 Case study: Housing values in Boston 67
3.2 Model selection 69
3.3 Use of qualitative predictors 84
3.4 Nonlinear relationships 89
3.5 Model diagnostics 105
3.6 Dimension reduction techniques 123
4 eduardo garcía portugués

4 Linear models III: shrinkage, multivariate response, and big data 147
4.1 Shrinkage 147
4.2 Constrained linear models 161
4.3 Multivariate multiple linear model 164
4.4 Big data considerations 174

5 Generalized linear models 181


5.1 Case study: The Challenger disaster 181
5.2 Model formulation and estimation 184
5.3 Inference for model parameters 196
5.4 Prediction 202
5.5 Deviance 204
5.6 Model selection 211
5.7 Model diagnostics 214
5.8 Shrinkage 225
5.9 Big data considerations 229

6 Nonparametric regression 235


6.1 Nonparametric density estimation 235
6.2 Kernel regression estimation 252
6.3 Kernel regression with mixed multivariate data 267
6.4 Prediction and confidence intervals 273
6.5 Local likelihood 275

A Further topics 279


A.1 Informal review on hypothesis testing 279
A.2 Least squares and maximum likelihood estimation 282
A.3 Multinomial logistic regression 284
A.4 Dealing with missing data 287
A.5 A note of caution with inference after model-selection 293

B Software 297
B.1 Installation of R and RStudio 297
B.2 Introduction to RStudio 297
B.3 Introduction to R 298
notes for predictive modeling 5

C References 315
Preface

Figure 1: Illustration of nonparametric


kernel estimators of the regression
function.

Welcome
Welcome to the notes for Predictive Modeling for the course 2019/2020.
The subject is part of the MSc in Big Data Analytics from Carlos III
University of Madrid.
The course is designed to have, roughly, one lesson per each
main topic in the syllabus. The schedule is tight due to time con-
straints, which will inevitably make the treatment of certain meth-
ods a little superficial compared with what it would be the opti-
mal. Nevertheless, the course will hopefully give you a respectable
panoramic view of different available statistical methods for pre-
dictive modelling. A broad view of the syllabus and its planning
is:

1. Introduction (first lesson)


2. Linear models I (first/second lesson)
3. Linear models II (second/third lesson)
8 eduardo garcía portugués

4. Linear models III (third/fourth lesson)


5. Generalized linear models (fifth/sixth lesson)
6. Nonparametric regression (sixth/seventh lesson)

Some logistics for the development of the course follow:

• The office hours are Thursdays from 12:45 to 13:45, at the class-
room in which the session took place. Making use of them is the
fastest way for me to clarify your doubts.
• Questions and comments during lectures are mostly welcome.
So just go ahead and fire! Particularly if these are clarifications,
comments or alternative perspectives that may help the rest of
the class.
• Detailed course evaluation guidelines can be found in Aula
Global. Recall that participation in lessons is positively evalu-
ated.

Main references and credits


Several great reference books have been used for preparing these
notes. The following list presents the books that have been con-
sulted:

• Chacón and Duong (2018) (Section 6.1.4)


• DasGupta (2008) (Section 3.5.2)
• Durbán (2017) (Section 5.2.2)
• Fan and Gijbels (1996) (Sections 6.2, 6.2.3, and 6.2.4)
• Hastie et al. (2009) (Section 4.1)
• James et al. (2013) (Sections 2.2 – 2.7, 3.1, 3.5, and 3.6.3, 4.1)
• Kuhn and Johnson (2013) (Section 1.2)
• Li and Racine (2007) (Section 6.3)
• Loader (1999) (Section 6.5)
• McCullagh and Nelder (1983) (Sections 5.2 – 5.6)
• Peña (2002) (Sections 2.2 – 2.7, 3.5, and 5.2.1)
• Seber and Lee (2003) (Section 4.2)
• Seber (1984) (Section 4.3)
• Wand and Jones (1995) (Sections 6.1.2, 6.1.3, and 6.2.4)
• Wasserman (2004) (Sections 6.5)
• Wasserman (2006) (Sections 6.2.4)
• Wood (2006) (Sections 5.2.2 and 5.7)

These notes are possible due to the existence of the incredible


pieces of software by Xie (2016a), Xie (2016b), Allaire et al. (2017),
Xie and Allaire (2018), and R Core Team (2018). Also, certain hacks
to improve the design layout have been possible due to the wonder-
ful work of Úcar (2018). The icons used in the notes were designed
by madebyoliver, freepik, and roundicons from Flaticon.
Last but not least, the notes have benefited from contributions
from the following people.
List of contributors:
notes for predictive modeling 9

• Ainara Apezteguía García (fixed a mathematical typo)


• Katherine Botz (performed a thorough proofreading of the
course materials, fixing a large number of typos)
• Marcos José Castillo Estévez (fixed two typos)
• Luis Cerdán Pedraza (performed an outstanding proofreading
of the course materials fixing more than fifty typos, style issues,
and mathematical typos)
• Frederik Chettouh (fixed a mathematical typo and two bugs)
• Gulnur Demir (fixed two typos)
• Andrés Escalante Ariza (fixed a mathematical typo)
• José Ángel Fernández (fixed several typos)
• Trinidad González Berzal (fixed a mathematical typo)
• Andrés Modet Álamo (performed an excellent review of the
course materials detecting and fixing more than thirty typos,
mostly mathematical, and four bugs)
• Santiago Palmero Muñoz (fixed a mathematical typo and a bug)
• Federico Petraccaro (fixed three mathematical typos)
• Enrique Ramírez Díaz (fixed a mathematical typo)
• Pavel Razgovorov (fixed a mathematical typo)
• Cristina Rodríguez Beltrán (fixed a typo and two bugs)
• Manuel Rodríguez Ramírez (fixed two typos)
• Celia Romero González (fixed a typo)
• Leonardo Stincone (fixed a mathematical typo and a bug)

Contributions
Contributions, reporting of typos, and feedback on the notes are
very welcome. Just send an email to edgarcia@est-econ.uc3m.es
and give me a reason for writing your name in the list of contribu-
tors!

License
All the material in these notes is licensed under the Creative Com-
mons Attribution-NonCommercial-NoDerivatives 4.0 Interna-
tional Public License (CC BY-NC-ND 4.0). You may not use this
material except in compliance with the former license. The human-
readable summary of the license states that:

• You are free to:


– Share – Copy and redistribute the material in any medium or
format.
• Under the following terms:
– Attribution – You must give appropriate credit, provide a link
to the license, and indicate if changes were made. You may do
so in any reasonable manner, but not in any way that suggests
the licensor endorses you or your use.
– NonCommercial – You may not use the material for commercial
purposes.
10 eduardo garcía portugués

– NoDerivatives – If you remix, transform, or build upon the


material, you may not distribute the modified material.
1
Introduction

These notes contain both the theory and practice for the statisti-
cal methods presented in the course. The emphasis is placed in
building intuition behind the methods, gaining insights into their
properties, and showing their application through the use of sta-
tistical software. The topics we will cover are an in-depth analysis
of linear models, their extension to generalized linear models, and an
introduction to nonparametric regression.

1.1 Course overview

The notes contain a substantial amount of snippets of code that are


fully self-contained within the chapter in which they are included.
This allows to see how the methods and theory translate neatly to
the practice. The software employed in the course is the statistical
language R and its most common IDE (Integrated Development
Environment) nowadays, RStudio. A basic prior knowledge of both 1
Among others: basic programming
is assumed1 . Appendix B presents some very basic introductions to in R, ability to work with objects and
RStudio and R for those students lacking basic expertise on them. data structures, ability to produce
graphics, knowledge of the main
The Shiny interactive apps on the notes can be downloaded
statistical functions, and ability to run
and run locally, which allows in particular to examine their codes. scripts in RStudio.
Check out this GitHub repository for the sources. Unfortunately,
the interactive apps will not work properly the first time they are 2
The browser’s warning is due to
browsed2 and some action is required: the animations being hosted at an
https website with an auto-signed SSL
certificate.
The animations of these notes will not be displayed the
first time they are browsed. To see them, click on the
caption’s link “Application available here” and allow an
exception in your browser. After this is done once, all the
animations will show up correctly within the notes in the
future.

3
The list may be updated during the
We will employ several packages3 that are not included by de- course.
fault with R. These can be installed as:
# Installation of required packages
packages <- c("MASS", "car", "readxl", "rgl", "nortest", "latex2exp", "pca3d",
"ISLR", "pls", "corrplot", "glmnet", "mvtnorm", "biglm", "leaps",
"lme4", "viridis", "ffbase", "ks", "KernSmooth", "nor1mix", "np",
12 eduardo garcía portugués

"locfit", "manipulate", "mice", "VIM", "nnet")


install.packages(packages)

The notes make explicit mention to the package to whom a func-


tion belongs by using the operator :: and, when the use of the
functions of a package is very repetitive, they may opt to load that
package. In case you want to load all of them, you may want to run
# Load packages
lapply(packages, library, character.only = TRUE)

1.2 What is predictive modelling?


Predictive modelling is the process of developing a mathematical
tool or model that generates an accurate prediction about a random
quantity of interest.

In predictive modelling we are interested in predicting a random


variable, typically denoted by Y, from a set of related variables
X1 , . . . , X p . The focus is on learning what is the probabilistic model
that relates Y with X1 , . . . , X p , and use that acquired knowledge
for predicting Y given an observation of X1 , . . . , X p . Some concrete
examples of this are:

• Predicting the wine quality (Y) from a set of environmental


variables (X1 , . . . , X p ).
• Predicting the number of sales (Y) from a set of marketing ac-
tions (X1 , . . . , X p ).
• Modelling the average house value in a given suburb (Y) from a
set of community-related features (X1 , . . . , X p ).
• Predicting the probability of failure (Y) of a rocket launcher from
the ambient temperature (X1 ).
• Predicting students academic performance (Y) according to
education resources and learning methodologies (X1 , . . . , X p ).

The process of predictive modelling can be statistically ab-


stracted in the following way. We believe that Y and X1 , . . . , X p
are related by a regression model of the form

Y = m( X1 , . . . , X p ) + ε, (1.1)

where m is the regression function and ε is a random error with zero


mean that accounts for the uncertainty of knowing Y if X1 , . . . , X p
are known. The function m : R p → R is unknown in practice and
is the target of predictive modelling: m encodes the relation4 be- 4
The relation is encoded in average by
tween Y and X1 , . . . , X p . In other words, m captures the trend of the means of the conditional expectation.

relation between Y and X1 , . . . , X p , and ε represents the stochasticity


of that relation. Knowing m allows to predict Y and we are going
to devote this course to see some statistical models to come up with
an estimate of m, denoted by m̂, and use m̂ to predict Y.
Let’s see a concrete example of this with an artificial dataset.
Suppose Y represents average fuel consumption (l/100km) of a car
and X is the average speed (km/h). It is well-known from physics
notes for predictive modeling 13

that the energy and speed have a quadratic relationship, and there-
fore we may assume that Y and X are truly quadratically-related for
the sake of exposition:

Y = a + bX + cX 2 + ε.

Then m : R → R (p = 1) with m( x ) = a + bx + cx2 . Suppose


the following data are measurements from a given car model, mea-
sured in different drivers and conditions (we do not have data for
accounting for all those effects, which go to the ε term5 ): 5
This is an alternative useful view of ε:
the aggregation of the effects what we
x <- c(64, 20, 14, 64, 44, 39, 25, 53, 48, 9, 100, 112, 78, 105, 116, 94, 71, can not account for predicting Y.
71, 101, 109)
y <- c(4, 6, 6.4, 4.1, 4.9, 4.4, 6.6, 4.4, 3.8, 7, 7.4, 8.4, 5.2, 7.6, 9.8,
6.4, 5.1, 4.8, 8.2, 8.7)
plot(x, y, xlab = "Speed", ylab = "Fuel consumption")

From this data, we can estimate m by means of a polynomial

10
model6 :

9
# Estimates for a, b, and c

8
Fuel consumption
lm(y ~ x + I(x^2))

7
##
## Call:

6
## lm(formula = y ~ x + I(x^2))
##

5
## Coefficients:

4
## (Intercept) x I(x^2)
## 8.512421 -0.153291 0.001408 20 40 60 80 100 120

Speed

Then the estimate of m is m̂( x ) = â + b̂x + ĉx2 = 8.512 − 0.153x + Figure 1.1: Scatterplot of fuel con-
sumption vs. speed.
0.001x2 and its fit to the data is pretty good. As a consequence,
we can use this precise mathematical function to predict the Y 6
Note we use the information that m
has to be of a particular form (in this
from a particular observation X. For example, the estimated fuel
case quadratic) which is an unrealistic
consumption at speed 90 km/h is 8.512421 - 0.153291 * 90 + situation for other data applications.
0.001408 * 90ˆ2 = 6.1210.

plot(x, y, xlab = "Speed", ylab = "Fuel consumption")


curve(8.512421 - 0.153291 * x + 0.001408 * x^2, add = TRUE, col = 2)

There are a number of generic issues and decisions to take when


10

building and estimating regression models that are worth to high-


9

light:
8

1. The prediction accuracy versus interpretability trade-off. Pre-


Fuel consumption

diction accuracy is key in any predictive model: of course, the


better the model is able to predict Y, the more useful it will be.
6

However, some models achieve this predictive accuracy in ex-


5

change of a clear interpretability of the model (the so-called black


4

boxes). Interpretability is key in order to gain insights on the pre- 20 40 60 80 100 120

diction process, to know which variables are most influential in Speed

Figure 1.2: Fitted quadratic model.


Y, to be able to interpret the parameters of the model, and to
translate the prediction process to non-experts. In essence, inter-
pretability allows to explain precisely how and why the model
behaves when predicting Y from X1 , . . . , X p . Most of models seen
in these notes favor interpretability7 and hence they may make
a sacrifice in terms of their prediction accuracy when compared 7
Not only that but they are neatly
with more convoluted models. interpretable.
14 eduardo garcía portugués

2. Model correctness versus model usefulness. Correctness and


usefulness are two different concepts in modelling. The first
refers to the model being statistically correct, this is, it translates
to stating that the assumptions on which the model relating
Y with X1 , . . . , X p is built are satisfied. The second refers to
the model being useful for explaining or predicting Y from
X1 , . . . , X p . Both concepts are certainly related (if the model is
correct/useful, then likely it is useful/correct) but neither is im-
plied by the other. For example, a regression model might be
correct but useless if the variance of ε is large (too much noise).
And yet if the model is not completely correct, it may give use-
ful insights and predictions, but inference may be completely 8
Particularly, it usually happens that
spurious8 . the inference based on erroneous
assumptions underestimates variability,
3. Flexibility versus simplicity. The best model is the one which is as the assumptions tend to ensure
very simple (low number of parameters), highly interpretable, that the information of the sample
is maximal for estimating the model
and delivers great predictions. This is often unachievable in at hand. Thus, inference based on
practice. What it can be achieved is a good model: the one that bal- erroneous assumptions results in a
false sense of confidence: a larger error
ances the simplicity with the prediction accuracy, which is often
is made in reality than the one the
increased the more flexible the model is. However, flexibility model theory states.
comes at a price: more flexible (hence more complex) models use
more parameters that need to be estimated from a finite amount
of information – the sample. This is problematic, as overly flex-
ible models are more dependent on the sample, up to the point
in which they end up not estimating the true relation between
Y and X1 , . . . , X p , m, but merely interpolating the observed data.
This well-known phenomenon is called overfitting and it can 9
Three datasets if we are fitting hy-
be avoided by splitting the dataset in two datasets9 : the training perparameters or tunning parameters
dataset, used for estimating the model; and the testing dataset, in our model: the training dataset for
estimating the model parameters;
used for evaluating the fitted model predictive performance.
the validation dataset for estimating
On the other hand, excessive simplicity (underfitting) is also the hyper-parameters; and the test-
problematic, since the true relation between Y and X1 , . . . , X p ing dataset for evaluating the final
performance of the fitted model.
may be overly simplified. Therefore, a trade-off in the degree of
flexibility has to be attained for having a good model. This is
often referred to as the bias-variance trade-off (low flexibility
increases the bias of the fitted model, high flexibility increases
the variance). An illustration of this transversal problem in pre-
dictive modelling is given in Figure 1.3.

1.3 General notation and background

We use capital letters to denote random variables, such as X, and


lowercase, such as x, to denote deterministic values. For example
P[ X = x ] means “the probability that the random variable X takes
the particular value x”. In predictive modelling we are concerned
about the prediction or explanation of a response Y from a set of
predictors X1 , . . . , X p . Both Y and X1 , . . . , X p are random variables,
but we use them in a different way: our interest lays in predicting
or explaining Y from X1 , . . . , X p . Other name for Y is dependent
notes for predictive modeling 15

Figure 1.3: Illustration of overfitting in


polynomial regression. The left plot
shows the training dataset and the
right plot the testing dataset. Better
fitting of the training data with a
higher polynomial order does not
imply better performance in new
observations (prediction), but just an
over-fitting of the available data with
an overly-parametrized model (too
flexible for the amount of information
available). Reduction in the predictive
error is only achieved with fits (in red)
of polynomial degrees close to the
true regression (in black). Application
available here.

variable and X1 , . . . , X p are sometimes referred to as independent


variables, covariates, or explanatory variables. We will not use these
terminologies.
The cumulative distribution function (cdf) of a random variable X
is F ( x ) := P[ X ≤ x ] and is a function that completely characterizes
the randomness of X. Continuous random variables are also char- Rx
10
Respectively, F ( x ) = −∞ f (t) dt.
acterized by the probability density function (pdf) f ( x ) = F 0 ( x ) 10 ,
which represents the infinitesimal relative probability of X per unit of
length. On the other hand, discrete random variables are also char-
acterized by the probability mass function P[ X = x ]. We write X ∼ F
(or X ∼ f if X is continuous) to denote that X has a cdf F (or a pdf
f ). If two random variables X and Y have the same distribution, we
d
write X = Y.
For a random variable X ∼ F, the expectation of g( X ) is defined
as
Z
E[ g( X )] := g( x ) dF ( x )
Z
 g( x ) f ( x ) dx,
 if X is continuous,
:=

 ∑ g ( x )P[ X = x ], if X is discrete.
{ x ∈R:P[ X = x ]>0}

Unless otherwise stated, the integration limits of any integral are


R or R p . The variance is defined as Var[ X ] := E[( X − E[ X ])2 ] =
E[ X 2 ] − E[ X ]2 .
We employ boldface to denote vectors (assumed to be column
matrices, although sometimes written in row-layout), like a, and
matrices, like A. We denote by A0 to the transpose of A. Boldfaced
capitals will be used simultaneously for denoting matrices and also
random vectors X = ( X1 , . . . , X p ), which are collections of random Understood as the probability that
11

X1 ≤ x1 and . . . and X p ≤ x p .
variables X1 , . . . , X p . The (joint) cdf of X is11

F ( x ) : = P [ X ≤ x ] : = P [ X1 ≤ x 1 , . . . , X p ≤ x p ]

p
and, if X is continuous, its (joint) pdf is f := ∂x ∂···∂x p F.
1
The marginals of F and f are the cdf and pdf of X j , j = 1, . . . , p,
16 eduardo garcía portugués

respectively. They are defined as:

FX j ( x j ) := P[ X j ≤ x j ] = F (∞, . . . , ∞, x j , ∞, . . . , ∞),
Z

f Xj ( x j ) := FX ( x ) = f (x) dx− j ,
∂x j j j R p −1

where x− j := ( x1 , . . . , x j−1 , x j+1 , x p ). The definitions can be ex-


tended analogously to the marginals of the cdf and pdf of different
subsets of X.
The conditional cdf and pdf of X1 |( X2 , . . . , X p ) are defined, re-
spectively, as

FX1 |X−1 =x−1 ( x1 ) := P[ X1 ≤ x1 |X−1 = x−1 ],


f (x)
f X1 | X − 1 = x − 1 ( x 1 ) : = .
f X −1 ( x −1 )

The conditional expectation of Y | X is the following random vari-


able12
12
Recall that the X-part of E[Y | X ]
Z is random. However, E[Y | X = x ] is
deterministic.
E [Y | X ] : = y dFY | X (y| X ).

For two random variables X1 and X2 , the covariance between them


is defined as

Cov[ X1 , X2 ] := E[( X1 − E[ X1 ])( X2 − E[ X2 ])] = E[ X1 X2 ] − E[ X1 ]E[ X2 ],

and the correlation between them is defined as


Cov[ X1 , X2 ]
Cor[ X1 , X2 ] := p .
Var[ X1 ]Var[ X2 ]
The variance and the covariance are extended to a random vector
X = ( X1 , . . . , X p )0 by means of the so-called variance-covariance
matrix:

Var[X] := E[(X − E[X])(X − E[X])0 ]


= E[XX0 ] − E[X]E[X]0
Var[ X1 ]
 
Cov[ X1 , X2 ] ··· Cov[ X1 , X p ]
 Cov[ X2 , X1 ] Var[ X2 ] ··· Cov[ X2 , X p ]
 
= .. .. .. .. ,
. . . .
 
 
Cov[ X p , X1 ] Cov[ X p , X2 ] ··· Var[ X p ]

where E[X] := (E[ X1 ], . . . , E[ X p ])0 is just the componentwise


expectation. As in the univariate case, the expectation is a linear
operator, which now means that13
13
When applying (1.2), (1.3), and
E[AX + b] = AE[X] + b, for a q × p matrix A and b ∈ Rq . (1.2) (1.4), it is common to have B = I p (in
other words, B disappears from the
equations).
It follows from (1.2) that

Var[AX + b] = AVar[X]A0 , for a q × p matrix A and b ∈ Rq .


(1.3)

The p-dimensional normal of mean µ ∈ R p and covariance matrix


Σ (a p × p symmetric and positive definite matrix) is denoted by
notes for predictive modeling 17

N p (µ, Σ) and is the generalization to p random variables of usual


the normal distribution. Its (joint) pdf is given by
1 1 0 −1 ( x − µ )
φ(x; µ, Σ) := e− 2 (x−µ) Σ , x ∈ Rp.
(2π ) p/2 |Σ|1/2
The p-dimensional normal has a nice linear property that stems
from (1.2) and (1.3):
d
AN p (µ, Σ) + b = Nq (Aµ + b, AΣA0 ). (1.4)

Notice that when p = 1, and µ = µ and Σ = σ2 , then the pdf of


the usual normal N (µ, σ2 ) is recovered14 : 14
If µ = 0 and σ = 1 (standard normal),
( x − µ )2 then the pdf and cdf are simply
1 − denoted by φ and Φ, without extra
φ( x; µ, σ2 ) := √ e 2σ2 .
2πσ parameters.

When p = 2, the pdf is expressed in terms of µ = (µ1 , µ2 ) and


Σ = (σ12 , ρσ1 σ2 ; ρσ1 σ2 , σ22 ), for µ1 , µ2 ∈ R, σ1 , σ2 > 0, and −1 < ρ < 1:
1
φ( x1 , x2 ; µ1 , µ2 , σ12 , σ22 , ρ) := p (1.5)
2πσ1 σ1 1 − ρ2
( " #)
1 ( x1 − µ1 )2 ( x2 − µ2 )2 2ρ( x1 − µ1 )( x2 − µ2 )
× exp − + − .
2(1 − ρ2 ) σ12 σ22 σ1 σ2

The surface defined by (1.5) can be regarded as a three-dimensional


bell. In addition, it serves to provide concrete examples of the func-
tions introduced above:

• Joint pdf:

f ( x1 , x2 ) = φ( x1 , x2 ; µ1 , µ2 , σ12 , σ22 , ρ).

• Marginal pdfs:
Z
f X1 ( x 1 ) = φ( x1 , t2 ; µ1 , µ2 , σ12 , σ22 , ρ) dt2 = φ( x1 ; µ1 , σ12 )

and f X2 ( x2 ) = φ( x2 ; µ2 , σ22 ). Hence X1 ∼ N µ1 , σ12 and X2 ∼




N µ2 , σ22 .


• Conditional pdfs:
 
f ( x1 , x2 ) σ
f X1 | X2 = x 2 ( x 1 ) = = φ x1 ; µ1 + ρ 1 ( x2 − µ2 ), (1 − ρ2 )σ12 ,
f X2 ( x 2 ) σ2
 
σ2
f X2 | X1 = x1 ( x2 ) =φ x2 ; µ2 + ρ ( x1 − µ1 ), (1 − ρ2 )σ22 .
σ1
Hence
 
σ1
X1 | X2 = x 2 ∼ N ( x2 − µ2 ), (1 − ρ2 )σ12 ,
µ1 + ρ
σ2
 
σ2
X2 | X1 = x1 ∼ N µ2 + ρ ( x1 − µ1 ), (1 − ρ2 )σ22 .
σ1

• Conditional expectations:
σ1
E [ X1 | X2 = x 2 ] = µ 1 + ρ ( x2 − µ2 ),
σ2
σ2
E [ X2 | X1 = x 1 ] = µ 2 + ρ ( x 1 − µ 1 ) .
σ1
18 eduardo garcía portugués

• Joint cdf:
Z x2 Z x
1
φ(t1 , t2 ; µ1 , µ2 , σ12 , σ22 , ρ) dt1 dt2 .
−∞ −∞
R x1
• Marginal cdfs: −∞ φ(t; µ1 , σ12 ) dt =: Φ( x1 ; µ1 , σ12 ) and Φ( x2 ; µ2 , σ22 ).

• Conditional cdfs:
Z x    
1 σ1 σ1
2 2 2 2
φ t; µ1 + ρ ( x2 − µ2 ), (1 − ρ )σ1 dt = Φ x1 ; µ1 + ρ ( x2 − µ2 ), (1 − ρ )σ1
−∞ σ2 σ2
 
and Φ x2 ; µ2 + ρ σσ2 ( x1 − µ1 ), (1 − ρ2 )σ22 .
1

Figure 1.4 graphically summarizes the concepts of joint, marginal,


and conditional distributions within the context of a 2-dimensional
normal.

Figure 1.4: Visualization of the joint


pdf (in blue), marginal pdfs (green),
conditional pdf of X2 | X1 = x1 (or-
ange), expectation (red point), and
conditional expectation E[ X2 | X1 = x1 ]
(orange point) of a 2-dimensional nor-
mal. The conditioning point of X1 is
x1 = −2. Note the different scales of
the densities, as they have to integrate
one over different supports. Note how
the conditional density (upper orange
curve) is not the joint pdf f ( x1 , x2 )
(lower orange curve) with x1 = −2
but a rescaling of this curve by f 1( x ) .
X1 1
The parameters of the 2-dimensional
normal are µ1 = µ2 = 0, σ1 = σ2 = 1
and ρ = 0.75. 500 observations sam-
pled from the distribution are shown
in black.

Finally, in the predictive models we will consider an indepen-


dent and identically distributed (iid) sample of the response and the
predictors. We use the following notation: Yi is the i-th observa-
tion of the response Y and Xij represents the i-th observation of
the j-th predictor X j . Thus we will deal with samples of the form
{( Xi1 , . . . , Xip , Yi )}in=1 .

1.4 Scripts and datasets

The snippets of code of the notes are conveniently collected in the


following scripts. To download them, simply save the link as a file
notes for predictive modeling 19

in your browser.

• Chapter 1: 01-intro.R.
• Chapter 2: 02-lm-i.R.
• Chapter 3: 03-lm-ii.R.
• Chapter 4: 04-lm-iii.R.
• Chapter 5: 05-glm.R. Generation of Figures 5.12–5.23: hypothesisGlm.R.
• Chapter 6: 06-npreg.R.
• Appendices A and B: 07-appendix.R.

The following is a handy list of all the relevant datasets used


in the course together with brief descriptions. The list is sorted
according to the order of appearance of the datasets in the notes. To
download them, simply save the link as a file in your browser.

• wine.csv. The dataset is formed by the auction Price of 27 red


Bordeaux vintages, five vintage descriptors (WinterRain, AGST,
HarvestRain, Age, Year), and the population of France in the year
of the vintage, FrancePop.

• least-squares.RData. Contains a single data.frame, named


leastSquares, with 50 observations of the variables x, yLin,
yQua, and yExp. These are generated as X ∼ N (0, 1), Ylin =
−0.5 + 1.5X + ε, Yqua = −0.5 + 1.5X 2 + ε, and Yexp = −0.5 +
1.5 · 2X + ε, with ε ∼ N (0, 0.52 ). The purpose of the dataset is to
illustrate the least squares fitting.

• least-squares-3D.RData. Contains a single data.frame, named


leastSquares3D, with 50 observations of the variables x1, x2, x3,
yLin, yQua, and yExp. These are generated as X1 , X2 ∼ N (0, 1),
X3 = X1 + N (0, 0.052 ), Ylin = −0.5 + 0.5X1 + 0.5X2 + ε, Yqua =
−0.5 + X12 + 0.5X2 + ε, and Yexp = −0.5 + 0.5e X2 + X3 + ε, with
ε ∼ N (0, 1). The purpose of the dataset is to illustrate the least
squares fitting with several predictors.

• assumptions.RData. Contains the data frame assumptions with


200 observations of the variables x1, . . . , x9 and y1, . . . , y9. The
purpose of the dataset is to identify which regression y1 ~ x1,
. . . , y9 ~ x9 fulfills the assumptions of the linear model. The
dataset moreAssumptions.RData has the same structure.

• assumptions3D.RData. Contains the data frame assumptions3D


with 200 observations of the variables x1.1, . . . , x1.8, x2.1, . . . ,
x2.8 and y.1, . . . , y.8. The purpose of the dataset is to identify
which regression y.1 ~ x1.1 + x2.1, . . . , y.8 ~ x1.8 + x2.8
fulfills the assumptions of the linear model.

• Boston.xlsx. The dataset contains 14 variables describing 506


suburbs in Boston. Among those variables, medv is the median
house value, rm is the average number of rooms per house and
crim is the per capita crime rate. The full description is available
in ?MASS::Boston.
20 eduardo garcía portugués

• cpus.txt and gpus.txt. The datasets contain 102 and 35 rows,


respectively, of commercial CPUs and GPUs appeared since
the first models up to nowadays. The variables in the datasets
are Processor, Transistor count, Date of introduction,
Manufacturer, Process, and Area.

• la-liga-2015-2016.xlsx. Contains 19 performance metrics for


the 20 football teams in La Liga 2015/2016.

• challenger.txt. Contains data for 23 space-shuttle launches.


There are 8 variables. Among them: temp (the temperature in
Celsius degrees at the time of launch), and fail.field and
fail.nozzle (indicators of whether there were an incidents in
the O-rings of the field joints and nozzles of the solid rocket
boosters).

• species.txt. Contains data for 90 country parcels in which the


Biomass, pH of the terrain (categorical variable), and number of
Species were measured.

• Chile.txt. Contains data for 2700 respondents on a survey for


the voting intentions in the 1988 Chilean national plebiscite.
There are 8 variables: region, population, sex, age, education,
income, statusquo (scale of support for the status quo), and vote.
vote is a factor with levels A (abstention), N (against Pinochet),
U (undecided), and Y (for Pinochet). Available in data(Chile,
package = "carData").
2
Linear models I: multiple linear model

The multiple linear model is a simple but useful statistical model.


In short, it allows us to analyse the (assumed) linear relation be-
tween a response Y and multiple predictors, X1 , . . . , X p in a proper
way:

Y = β 0 + β 1 X1 + β 2 X2 + . . . + β p X p + ε

The simplest case corresponds to p = 1, known as the simple linear


model:

Y = β0 + β1 X + ε

This model would be useful, for example, to predict Y given X from


a sample ( X1 , Y1 ), . . . , ( Xn , Yn ) such that its scatterplot is the one in
Figure 2.1.
6
4

2.1 Case study: The Bordeaux equation


2
Y

Calculate the winter rain and the harvest rain (in millimeters). Add
summer heat in the vineyard (in degrees centigrade). Subtract 12.145.
-2

And what do you have? A very, very passionate argument over wine.
-4

— “Wine Equation Puts Some Noses Out of Joint”, The New York
-3 -2 -1 0 1 2 3
Times, 04/03/1990. X

Figure 2.1: Scatterplot of a sample


This case study is motivated by the study of Princeton professor ( X1 , Y1 ), . . . , ( Xn , Yn ) showing a linear
Orley Ashenfelter (Ashenfelter et al., 1995) on the quality of red pattern.

Bordeaux vintages. The study became mainstream after disputes


with the wine press, especially with Robert Parker Jr., one of the
most influential wine critics in America. You can see a short review
of the story at the Financial Times1 and at the video in Figure 2.2.
Red Bordeaux wines have been produced in Bordeaux, one of
most famous and prolific wine regions in the world, in a very sim-
ilar way for hundreds of years. However, the quality of vintages is
largely variable from one season to another due to a long list of ran- Figure 2.2: ABC interview to Orley
Ashenfelter, broadcasted in 1992.
dom factors, such as weather conditions. Because Bordeaux wines Video also available here.
taste better when they are older2 , there is an incentive to store the
1
"How computers routed the experts",
young wines until they are mature. Due to the important difference Financial Times, 31/08/2007.
in taste, it is hard to determine the quality of the wine when it is so 2
Young wines are astringent, when the
young just by tasting it, because it will change substantially when wines age they lose their astringency.
22 eduardo garcía portugués

the aged wine is in the market. Therefore, being able to predict the
quality of a vintage is valuable information for investing resources,
for determining a fair price for vintages, and for understanding
what factors are affecting the wine quality. The purpose of this case
study is to answer:

• Q1. Can we predict the quality of a vintage effectively?


• Q2. What is the interpretation of such prediction?

The wine.csv file contains 27 red Bordeaux vintages. The data is 3


Source: http://www.liquidasset.
the same data3 originally employed by Ashenfelter et al. (1995), ex- com/winedata.html.
cept for the inclusion of the variable Year, the exclusion of NAs and
the reference price used for the wine. Each row has the following
variables:

• Year: year in which grapes were harvested to make wine.


• Price: logarithm of the average market price for Bordeaux vin-
tages according to 1990–1991 auctions. The price is relative to the
price of the 1961 vintage, regarded as the best one ever recorded.
• WinterRain: winter rainfall (in mm).
• AGST: Average Growing Season Temperature (in Celsius degrees).
• HarvestRain: harvest rainfall (in mm).
• Age: age of the wine measured as the number of years stored in a
cask.
• FrancePop: population of France at Year (in thousands).

The quality of the wine is quantified as the Price, a clever way of


quantifying a qualitative measure. A portion of the data is shown
in Table 2.1.

Table 2.1: First 15 rows of the wine dataset.

Year Price WinterRain AGST HarvestRain Age FrancePop


1952 7.4950 600 17.1167 160 31 43183.57
1953 8.0393 690 16.7333 80 30 43495.03
1955 7.6858 502 17.1500 130 28 44217.86
1957 6.9845 420 16.1333 110 26 45152.25
1958 6.7772 582 16.4167 187 25 45653.81
1959 8.0757 485 17.4833 187 24 46128.64
1960 6.5188 763 16.4167 290 23 46584.00
1961 8.4937 830 17.3333 38 22 47128.00
1962 7.3880 697 16.3000 52 21 48088.67
1963 6.7127 608 15.7167 155 20 48798.99
1964 7.3094 402 17.2667 96 19 49356.94
1965 6.2518 602 15.3667 267 18 49801.82
1966 7.7443 819 16.5333 86 17 50254.97
1967 6.8398 714 16.2333 118 16 50650.41
1968 6.2435 610 16.2000 292 15 51034.41
notes for predictive modeling 23

We will see along the chapter how to answer Q1 and Q2 and


how to obtain quantitative insights on the effects of the predictors
on the price. Before doing so, we need to introduce the required
statistical machinery.

2.2 Model formulation and least squares

In order to simplify the introduction of the foundations of the linear


model, we first present the simple linear model and then extend it
to the multiple linear model.

2.2.1 Simple linear model


The simple linear model is constructed by assuming that the linear
relation

Y = β0 + β1 X + ε (2.1)

holds between X and Y. In (2.1), β 0 and β 1 are known as the inter-


cept and slope, respectively. The random variable ε has mean zero
and is independent from X. It describes the error around the mean,
or the effect of other variables that we do not model. Another way
of looking at (2.1) is

E[Y | X = x ] = β 0 + β 1 x, (2.2)

since E[ε| X = x ] = 0.
The Left Hand Side (LHS) of (2.2) is the conditional expectation of
Y given X. It represents how the mean of the random variable Y
is changing according to a particular value x of the random vari-
able X. With the RHS, what we are saying is that the mean of Y is
changing in a linear fashion with respect to the value of X. Hence
the clear interpretation of the coefficients:

• β 0 : is the mean of Y when X = 0.


• β 1 : is the increment in mean of Y for an increment of one unit in
X = x.

If we have a sample ( X1 , Y1 ), . . . , ( Xn , Yn ) for our random vari-


ables X and Y, we can estimate the unknown coefficients β 0 and β 1 .
A possible way of doing so is by looking for certain optimality, for
example the minimization of the Residual Sum of Squares (RSS):
n
RSS( β 0 , β 1 ) := ∑ (Yi − β0 − β1 Xi )2 .
i =1

In other words, we look for the estimators ( β̂ 0 , β̂ 1 ) such that

( β̂ 0 , β̂ 1 ) = arg min RSS( β 0 , β 1 ).


( β 0 ,β 1 )∈R2

The motivation for minimizing the RSS is geometrical, as shown


by Figure 2.3. We aim to minimize the squares of the distances of
points projected vertically onto the line determined by ( β̂ 0 , β̂ 1 ).
24 eduardo garcía portugués

Figure 2.3: The effect of the kind of


distance in the error criterion. The
choices of intercept and slope that
minimize the sum of squared distances
for one kind of distance are not the
optimal choices for a different kind of
distance. Application available here.

4
They are unique and always exist
It can be seen that the minimizers of the RSS4 are (except if s x > 0, when all the data
points are the same). They can be
s xy
β̂ 0 = Ȳ − β̂ 1 X̄, β̂ 1 = , (2.3) obtained by solving ∂β∂ RSS( β 0 , β 1 ) =
0
s2x ∂
0 and ∂β 1 RSS( β 0 , β 1 ) = 0.
where:

• X̄ = n1 ∑in=1 Xi is the sample mean.


5
The p
sample standard deviation is
• s2x = n1 ∑in=1 ( Xi − X̄ )2 is the sample variance5 .
s x = s2x .
• s xy = n1 ∑in=1 ( Xi − X̄ )(Yi − Ȳ ) is the sample covariance. It mea-
sures the degree of linear association between X1 , . . . , Xn and
Y1 , . . . , Yn . Once scaled by s x sy , it gives the sample correlation
s
coefficient r xy = sxxysy .

There are some important points hidden behind the election of


RSS as the error criterion for obtaining ( β̂ 0 , β̂ 1 ):

• Why the vertical distances and not horizontal or perpendicular? Be-


cause we want to minimize the error in the prediction of Y! Note
that the treatment of the variables is not symmetrical.
• Why the squares in the distances and not the absolute value? Due to
mathematical convenience. Squares are nice to differentiate and
are closely related with maximum likelihood estimation under
the normal distribution (see Appendix A.2).

Let’s see how to obtain automatically the minimizers of the error


in Figure 2.3 by the lm (linear model) function. The data of the
figure has been generated with the following code:
# Generates 50 points from a N(0, 1): predictor and error
set.seed(34567)
x <- rnorm(n = 50)
eps <- rnorm(n = 50)
notes for predictive modeling 25

# Responses
yLin <- -0.5 + 1.5 * x + eps
yQua <- -0.5 + 1.5 * x^2 + eps
yExp <- -0.5 + 1.5 * 2^x + eps

# Data
leastSquares <- data.frame(x = x, yLin = yLin, yQua = yQua, yExp = yExp)

For a simple linear model, lm has the syntax lm(formula =


response ~ predictor, data = data), where response and
predictor are the names of two variables in the data frame data.
Note that the LHS of ~ represents the response and the RHS the
predictors.

# Call lm
lm(yLin ~ x, data = leastSquares)
##
## Call:
## lm(formula = yLin ~ x, data = leastSquares)
##
## Coefficients:
## (Intercept) x
## -0.6154 1.3951
lm(yQua ~ x, data = leastSquares)
##
## Call:
## lm(formula = yQua ~ x, data = leastSquares)
##
## Coefficients:
## (Intercept) x
## 0.9710 -0.8035
lm(yExp ~ x, data = leastSquares)
##
## Call:
## lm(formula = yExp ~ x, data = leastSquares)
##
## Coefficients:
## (Intercept) x
## 1.270 1.007

# The lm object
mod <- lm(yLin ~ x, data = leastSquares)
mod
##
## Call:
## lm(formula = yLin ~ x, data = leastSquares)
##
## Coefficients:
## (Intercept) x
## -0.6154 1.3951

# mod is a list of objects whose names are


2

names(mod)
## [1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign" "qr"
0

## [8] "df.residual" "xlevels" "call" "terms" "model"


yLin

# We can access these elements by $


-2

mod$coefficients
## (Intercept) x
-4

## -0.6153744 1.3950973

-2 -1 0 1 2
# We can produce a plot with the linear fit easily
x
plot(x, yLin)
Figure 2.4: Linear fit for the data
abline(coef = mod$coefficients, col = 2)
employed in Figure 2.3 minimizing the
RSS.
26 eduardo garcía portugués

Check that you can not improve the error in Figure 2.3
when using the coefficients given by lm, if vertical dis-
tances are selected. Check also that these coefficients are
only optimal for vertical distances.

An interesting exercise is to check that lm is actually implement-


ing the estimates given in (2.3):

# Covariance
Sxy <- cov(x, yLin)

# Variance
Sx2 <- var(x)

# Coefficients
beta1 <- Sxy / Sx2
beta0 <- mean(yLin) - beta1 * mean(x)
c(beta0, beta1)
## [1] -0.6153744 1.3950973

# Output from lm
mod <- lm(yLin ~ x, data = leastSquares)
mod$coefficients
## (Intercept) x
## -0.6153744 1.3950973

The population regression coefficients, ( β 0 , β 1 ), are not the


same as the estimated regression coefficients, ( β̂ 0 , β̂ 1 ):

• ( β 0 , β 1 ) are the theoretical and always unknown quan-


tities (except under controlled scenarios).
• ( β̂ 0 , β̂ 1 ) are the estimates computed from the data.
They are random variables, since they are computed
from the random sample ( X1 , Y1 ), . . . , ( Xn , Yn ).

In an abuse of notation, the term regression line is often


used to denote both the theoretical (y = β 0 + β 1 x) and the
estimated (y = β̂ 0 + β̂ 1 x) regression lines.

2.2.2 Case study application


Let’s get back to the wine dataset and compute some simple linear
regressions. Prior to that, let’s begin by summarizing the informa-
tion in Table 2.1 to get a grasp of the structure of the data. For that,
we first correctly import the dataset into R:

# Read data
wine <- read.table(file = "wine.csv", header = TRUE, sep = ",")

Now we can conduct a quick exploratory analysis to have in-


sights into the data:

# Numerical -- marginal distributions


summary(wine)
## Year Price WinterRain AGST HarvestRain Age FrancePop
notes for predictive modeling 27

## Min. :1952 Min. :6.205 Min. :376.0 Min. :14.98 Min. : 38.0 Min. : 3.00 Min. :43184
## 1st Qu.:1960 1st Qu.:6.508 1st Qu.:543.5 1st Qu.:16.15 1st Qu.: 88.0 1st Qu.: 9.50 1st Qu.:46856
## Median :1967 Median :6.984 Median :600.0 Median :16.42 Median :123.0 Median :16.00 Median :50650
## Mean :1967 Mean :7.042 Mean :608.4 Mean :16.48 Mean :144.8 Mean :16.19 Mean :50085
## 3rd Qu.:1974 3rd Qu.:7.441 3rd Qu.:705.5 3rd Qu.:17.01 3rd Qu.:185.5 3rd Qu.:22.50 3rd Qu.:53511
## Max. :1980 Max. :8.494 Max. :830.0 Max. :17.65 Max. :292.0 Max. :31.00 Max. :55110

# Graphical -- pairwise relations with linear and "smooth" regressions


car::scatterplotMatrix(wine, col = 1, regLine = list(col = 2),
smooth = list(col.smooth = 4, col.spread = 4))

6.5 7.5 8.5 15.0 16.5 5 15 25

Year

1970
1955
8.5

Price
7.5
6.5

400 600 800


WinterRain

AGST
16.5
15.0

300
HarvestRain

50 150

Age
25
15
5

52000

FrancePop
44000

1955 1970 400 600 800 50 150 300 44000 52000


Figure 2.5: Scatterplot matrix for the
wine dataset. The diagonal plots show
As we can see, Year and FrancePop are very dependent, and density estimators of the pdf of each
Year and Age are perfectly dependent. This is so because Age = 1983 variable. The (i, j)-th scatterplot shows
the data of Xi vs X j , where the red line
- Year. Therefore, we opt to remove the predictor Year and use
is the regression line of Xi (response)
it to set the case names, which can be helpful later for identifying on X j (predictor) and the blue curve
outliers: represents a smoother that estimates
nonparametrically the regression
# Set row names to Year -- useful for outlier identification function of Xi on X j . The dashed blue
row.names(wine) <- wine$Year curves are the confidence intervals
wine$Year <- NULL associated to the nonparametric
smoother.

Remember that the objective is to predict Price. Based on the


above matrix scatterplot, the best we can predict Price by a simple
28 eduardo garcía portugués

linear regression seems to be with AGST or HarvestRain. Let’s see


which one yields the higher R2 , which, as we will see in Section
2.7.1, is an indicative of the performance of the linear model.

# Price ~ AGST
modAGST <- lm(Price ~ AGST, data = wine)

# Summary of the model


summary(modAGST)
##
## Call:
## lm(formula = Price ~ AGST, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78370 -0.23827 -0.03421 0.29973 0.90198
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.5469 2.3641 -1.500 0.146052
## AGST 0.6426 0.1434 4.483 0.000143 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.4819 on 25 degrees of freedom
## Multiple R-squared: 0.4456, Adjusted R-squared: 0.4234
## F-statistic: 20.09 on 1 and 25 DF, p-value: 0.0001425

# The summary is also an object


sumModAGST <- summary(modAGST)
names(sumModAGST)
## [1] "call" "terms" "residuals" "coefficients" "aliased" "sigma" "df"
## [8] "r.squared" "adj.r.squared" "fstatistic" "cov.unscaled"

# R^2
sumModAGST$r.squared
## [1] 0.4455894

Complete the analysis by computing the linear models


Price ~ FrancePop, Price ~ Age, Price ~ WinterRain,
and Price ~ HarvestRain. Name them as modFrancePop,
modAge, and modWinterRain. Obtain their R2 and display
them in a table like:

Predictor R2
AGST 0.4456
HarvestRain 0.2572
FrancePop 0.2314
Age 0.2120
WinterRain 0.0181

It seems that none of these simple linear models on their own are
properly explaining Price. Intuitively, it would make sense to bind
them together to achieve a better explanation of Price. Let’s see
how to do that with a more advanced model.
notes for predictive modeling 29

2.2.3 Multiple linear model


The multiple linear model extends the simple linear model by de- 6
Note that now X1 represents the first
scribing the relation between several random variables X1 , . . . , X p 6 predictor and not the first element of a
and Y. Therefore, as before, the multiple linear model is constructed sample of X.
by assuming that the linear relation

Y = β 0 + β 1 X1 + . . . + β p X p + ε (2.4)

holds between the predictors X1 , . . . , X p and the response Y. In


(2.4), β 0 is the intercept and β 1 , . . . , β p are the slopes, respectively.
The random variable ε has mean zero and is independent from
X1 , . . . , X p . Another way of looking at (2.4) is

E [ Y | X1 = x 1 , . . . , X p = x p ] = β 0 + β 1 x 1 + . . . + β p x p , (2.5)

since E[ε| X1 = x1 , . . . , X p = x p ] = 0.
The LHS of (2.5) is the conditional expectation of Y given X1 , . . . ,
X p . It represents how the mean of the random variable Y is chang-
ing, now according to particular values of several predictors. With
the RHS, what we are saying is that the mean of Y is changing in
a linear fashion with respect to the values of X1 , . . . , X p . Hence the
neat interpretation of the coefficients:

• β 0 : is the mean of Y when X1 = . . . = X p = 0.


• β j , 1 ≤ j ≤ p: is the increment in mean of Y for an increment
of one unit in X j = x j , provided that the rest of predictors
X1 , . . . , X j−1 , X j+1 , . . . , X p remain constant.

Figure 2.6 illustrates the geometrical interpretation of a multiple


linear model: a hyperplane in the ( p + 1)-dimensional space. If
p = 1, the hyperplane is actually a line, the regression line for
simple linear regression. If p = 2, then the regression plane can be
visualized in a three-dimensional plot.
The estimation of β 0 , β 1 , . . . , β p is done as in simple linear re-
gression by minimizing the RSS, which now accounts for the sum
of squared distances of the data to the vertical projections on the
hyperplane. Before doing so, we need to introduce some helpful
matrix notation:

• A sample of ( X1 , . . . , X p , Y ) is denoted as {( Xi1 , . . . , Xip , Yi )}in=1 ,


where Xij is the i-th observation of the j-th predictor X j . We
denote with Xi := ( Xi1 , . . . , Xip ) to the i-th observation of
( X1 , . . . , X p ), so the sample simplifies to {(Xi , Yi )}in=1 .

• The design matrix contains all the information of the predictors


plus a column of ones

 
1 X11 ··· X1p
. .. .. .. 
X :=  ..

. . . 
 .
1 Xn1 ··· Xnp n×( p+1)
30 eduardo garcía portugués

• The vector of responses Y, the vector of coefficients β, and the vector


of errors are, respectively,
 
  β0  
Y1   ε1
 .   β1  .
 .. 
Y :=  , β :=  , and ε := . .
 ..  .
 
 . 
Yn n×1 ε n n ×1
β p ( p+1)×1

Thanks to the matrix notation, we can turn the sample version7 7


Recall that (2.4) and (2.5) were re-
ferring to the relation of the random
of the multiple linear model, namely
variable (or population) Y with the
random variables X1 , . . . , X p . Those
Yi = β 0 + β 1 Xi1 + . . . + β p Xip + ε i , i = 1, . . . , n, are population versions of the linear
model and clearly generate the sample
into something as compact as versions the moment they are repli-
cated for each observation (Xi , Yi ),
Y = Xβ + ε. i = 1, . . . , n, of the random vector
(X, Y ).

Figure 2.6: The least squares regression


plane y = β̂ 0 + β̂ 1 x1 + β̂ 2 x2 and its
dependence on the kind of squared
distance considered. Application
available here.

Recall that if p = 1 we recover the simple linear model.


In this case:
 
1 X11 !
. .  β 0
X=  .. .. 
 and β =
β1
.
2×1
1 Xn1 n×2

With this notation, the RSS for the multiple linear regression is
n
RSS( β) := ∑ (Yi − β0 − β1 Xi1 − . . . − β p Xip )2
i =1
= (Y − Xβ)0 (Y − Xβ). (2.6)

The RSS aggregates the squared vertical distances from the data to a
regression plane given by β. The least squares estimators are the 8
It can be seen that they are unique
and that they always exist, provided
minimizers of the RSS8 : that rank(X0 X) = p + 1.
notes for predictive modeling 31

β̂ := arg min RSS( β).


β ∈R p +1
∂(Ax)
Luckily, thanks to the matrix form of (2.6), it is possible9 to com-
9
It follows from ∂x = A and
∂( f (x)0 g(x)) ∂g(x) ∂ f (x)
= f (x)0 ∂x + g(x)0 ∂x
pute a closed-form expression for the least squares estimates: ∂x
for two vector-valued functions f , g :
∂( f (x)0 g(x))
β̂ = (X0 X)−1 X0 Y. (2.7) R p → Rm , where ∂x is the
∂ f (x)
gradient row vector of f 0 g, and ∂x
∂g(x)
and ∂x are the Jacobian matrices of f
There are some similarities between (2.7) and β̂ 1 = (s2x )−1 and g, respectively.
s xy from the simple linear model: both are related to the
covariance between X and Y weighted by the variance of
X.

Let’s check that indeed the coefficients given by lm are the ones
given by (2.7). For that purpose we consider the leastSquares3D
data frame in the least-squares-3D.RData dataset. Among other
variables, the data frame contains the response yLin and the predic-
tors x1 and x2.
load(file = "least-Squares-3D.RData")

Let’s compute the coefficients of the regression of yLin on the


predictors x1 and x2, which is denoted by yLin ~ x1 + x2. Note
the use of + for including all the predictors. This does not mean that
they are all added and then the regression is done on the sum10 .
10
If you wanted to do so, you will
Instead, this notation is designed to resemble the mathematical
need to use the function I() for
form of the multiple linear model. indicating that + is not including
predictors in the model, but is acting
# Output from lm
as the algebraic sum operator.
mod <- lm(yLin ~ x1 + x2, data = leastSquares3D)
mod$coefficients
## (Intercept) x1 x2
## -0.5702694 0.4832624 0.3214894

# Matrix X
X <- cbind(1, leastSquares3D$x1, leastSquares3D$x2)

# Vector Y
Y <- leastSquares3D$yLin

# Coefficients
beta <- solve(t(X) %*% X) %*% t(X) %*% Y
# %*% multiplies matrices
# solve() computes the inverse of a matrix
# t() transposes a matrix
beta
## [,1]
## [1,] -0.5702694
## [2,] 0.4832624
## [3,] 0.3214894

Compute β̂ for the regressions yLin ~ x1 + x2, yQua ~


x1 + x2, and yExp ~ x2 + x3 using:

• Equation (2.7) and


• the function lm.

Check that both are the same.


32 eduardo garcía portugués

Once we have the least squares estimates β̂, we can define the
next concepts:

• The fitted values Ŷ1 , . . . , Ŷn , where

Ŷi := β̂ 0 + β̂ 1 Xi1 + . . . + β̂ p Xip , i = 1, . . . , n.

They are the vertical projections of Y1 , . . . , Yn onto the fitted


plane (see Figure 2.6). In a matrix form, inputting (2.6)

Ŷ = X β̂ = X(X0 X)−1 X0 Y = HY,

where H := X(X0 X)−1 X0 is called the hat matrix because it “puts


the hat into Y”. What it does is to project Y into the regression
plane (see Figure 2.6).

• The residuals (or estimated errors) ε̂ 1 , . . . , ε̂ n , where

ε̂ i := Yi − Ŷi , i = 1, . . . , n.

They are the vertical distances between actual data and fitted
data.

These two objects are present in the output of lm:


# Fitted values
mod$fitted.values

# Residuals
mod$residuals

We conclude with an important insight on the relation of multi-


ple and simple linear regressions that it is illustrated in Figure 2.7.
The data used in that figure is:
set.seed(212542)
n <- 100
x1 <- rnorm(n, sd = 2)
x2 <- rnorm(n, mean = x1, sd = 3)
y <- 1 + 2 * x1 - x2 + rnorm(n, sd = 1)
data <- data.frame(x1 = x1, x2 = x2, y = y)

Consider the multiple linear model Y = β 0 + β 1 X1 +


β 2 X2 +ε 1 and its associated simple linear models
Y = α0 + α1 X1 + ε 2 and Y = γ0 + γ1 X2 + ε 3 , where
ε 1 , ε 2 , ε 3 are random errors. Assume that we have a sam-
ple {( Xi1 , Xi2 , Yi )}in=1 . Then, in general, α̂0 6= β̂ 0 6= γ̂0 ,
α̂1 6= β̂ 1 , and γ̂1 6= β̂ 2 . Even if α0 = β 0 = γ0 , α1 = β 1 ,
and γ1 = β 2 . That is, in general, the inclusion of a new
predictor changes the coefficient estimates of the rest
predictors.

With the above data, check how the fitted coefficients


change for y ~ x1, y ~ x2, and y ~ x1 + x2.
notes for predictive modeling 33

Figure 2.7: The regression plane (blue)


of Y on X1 and X2 and its relation
with the simple linear regressions
(green lines) of Y on X1 and of Y
on X2 . The red points represent the
sample for ( X1 , X2 , Y ) and the black
points the sample projections for
( X1 , X2 ) (bottom), ( X1 , Y ) (left), and
( X2 , Y ) (right). As it can be seen, the
regression plane does not extend the
simple linear regressions.

2.2.4 Case study application


A natural step now is to extend these simple regressions to increase
both the R2 and the prediction accuracy for Price by means of the
multiple linear regression:
# Regression on all the predictors
modWine1 <- lm(Price ~ Age + AGST + FrancePop + HarvestRain + WinterRain,
data = wine)

# A shortcut
modWine1 <- lm(Price ~ ., data = wine)
modWine1
##
## Call:
## lm(formula = Price ~ ., data = wine)
##
## Coefficients:
## (Intercept) WinterRain AGST HarvestRain Age FrancePop
## -2.343e+00 1.153e-03 6.144e-01 -3.837e-03 1.377e-02 -2.213e-05

# Summary
summary(modWine1)
##
## Call:
## lm(formula = Price ~ ., data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46541 -0.24133 0.00413 0.18974 0.52495
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.343e+00 7.697e+00 -0.304 0.76384
34 eduardo garcía portugués

## WinterRain 1.153e-03 4.991e-04 2.311 0.03109 *


## AGST 6.144e-01 9.799e-02 6.270 3.22e-06 ***
## HarvestRain -3.837e-03 8.366e-04 -4.587 0.00016 ***
## Age 1.377e-02 5.821e-02 0.237 0.81531
## FrancePop -2.213e-05 1.268e-04 -0.175 0.86313
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.293 on 21 degrees of freedom
## Multiple R-squared: 0.8278, Adjusted R-squared: 0.7868
## F-statistic: 20.19 on 5 and 21 DF, p-value: 2.232e-07

The fitted regression is Price = −2.343 + 0.013 × Age +0.614 ×


AGST −0.000 × FrancePop −0.003 × HarvestRain +0.001 × WinterRain.
Recall that the 'Multiple R-squared' has almost doubled with
respect to the best simple linear regression! This tells us that com-
bining several predictors may lead to important performance gains
in the prediction of the response. However, note that the R2 of the
multiple linear model is not the sum of the R2 ’s of the simple linear
models. The performance gain of combining predictors is hard to
anticipate from the single-predictor models and depends on the
dependence among the predictors.

2.3 Assumptions of the model


11
After all, we already have a neat way
A natural11 question to ask is: “Why do we need assumptions?” of estimating β from the data. Isn’t it
The answer is that we need probabilistic assumptions to ground sta- the end of the story?

tistical inference about the model parameters. Or, in other words,


to quantify the variability of the estimator β̂ and to infer proper-
ties about the unknown population coefficients β from the sample
{(Xi , Yi )}in=1 .
The assumptions of the multiple linear model are:

i. Linearity: E[Y | X1 = x1 , . . . , X p = x p ] = β 0 + β 1 x1 + . . . + β p x p .
ii. Homoscedasticity: Var[ε| X1 = x1 , . . . , X p = x p ] = σ2 .
iii. Normality: ε ∼ N (0, σ2 ).
iv. Independence of the errors: ε 1 , . . . , ε n are independent (or
uncorrelated, E[ε i ε j ] = 0, i 6= j, since they are assumed to be
normal).

A good one-line summary of the linear model is the following


(independence is implicit)

Y |( X1 = x1 , . . . , X p = x p ) ∼ N ( β 0 + β 1 x1 + . . . + β p x p , σ2 ). (2.8)

Recall that, except assumption iv, the rest are expressed in terms
of the random variables, not in terms of the sample. Thus they are
population versions, rather than sample versions. It is however
trivial to express (2.8) in terms of assumptions about the sample
{(Xi , Yi )}in=1 :

Yi |( Xi1 = xi1 , . . . , Xip = xip ) ∼ N ( β 0 + β 1 xi1 + . . . + β p xip , σ2 ),


(2.9)
notes for predictive modeling 35

with Y1 , . . . , Yn being independent conditionally on the sample


of predictors. Equivalently stated in a compact matrix way, the
assumptions of the model on the sample are:

Y|X ∼ Nn (Xβ, σ2 I). (2.10)

Figures 2.10 and 2.11 represent situations where the assumptions


of the model for p = 1 are respected and violated, respectively.

Figure 2.8: The key concepts of the


simple linear model. The red points
represent a sample with population
regression line y = β 0 + β 1 x given
by the black line. The yellow band
denotes where 95% of the data is,
according to the model. The blue
densities represent the conditional
density of Y given X = x, whose
means lie in the regression line.

Figure 2.12 represents situations where the assumptions of the


model are respected and violated, for the situation with two pre-
dictors. Clearly, the inspection of the scatterplots for identifying
strange patterns is more complicated than in simple linear regres-
sion – and here we are dealing only with two predictors.

The dataset assumptions.RData contains the variables x1,


. . . , x9 and y1, . . . , y9. For each regression y1 ~ x1, . . . ,
y9 ~ x9:

• Check whether the assumptions of the linear model


are being satisfied (make a scatterplot with a regres-
sion line).
• State which assumption(s) are violated and justify
your answer.
36 eduardo garcía portugués

Figure 2.9: The key concepts of the


multiple linear model when p = 2.
The red points represent a sample
with population regression plane
y = β 0 + β 1 x1 + β 2 x2 given by the
blue plane. The black points represent
the associated observations of the
predictors. The space between the
yellow planes denotes where 95% of
the data is, according to the model.

Figure 2.10: Perfectly valid simple


linear models (all the assumptions are
verified).
notes for predictive modeling 37

Figure 2.11: Problematic simple linear


models (a single assumption does not
hold).

Figure 2.12: Valid (all the assumptions


are verified) and problematic (a single
assumption does not hold) multiple
linear models, when there are two
predictors. Application available here.
38 eduardo garcía portugués

2.4 Inference for model parameters

The assumptions introduced in the previous section allow to spec-


ify what is the distribution of the random vector β̂. The distribution
is derived conditionally on the sample predictors X1 , . . . , Xn . In
other words, we assume that the randomness of Y = Xβ + ε comes 12
This is for theoretical and modelling
only from the error terms and not from the predictors12 . To denote convenience. With this assumption, we
this, we employ lowercase for the sample predictors x1 , . . . , xn . just model the randomness of Y given
the predictors. If the randomness of Y
and the randomness of X1 , . . . , Xn was
2.4.1 Distributions of the fitted coefficients to be modeled, we will require from a
significantly more complex model.
The distribution of β̂ is:
 
β̂ ∼ N p+1 β, σ2 (X0 X)−1 . (2.11)

This result can be obtained from the form of β̂ given in (2.7), the
sample version of the model assumptions given in (2.10), and the
linear transformation property of a normal given in (1.4). Equation
(2.11) implies that the marginal distribution of β̂ j is
 
β̂ j ∼ N β j , SE( β̂ j )2 , (2.12)

where SE( β̂ j ) is the standard error, SE( β̂ j )2 := σ2 v j , and

v j is the j-th element of the diagonal of (X0 X)−1 .

Recall that an equivalent form for (2.12) is (why?)

β̂ j − β j
∼ N (0, 1).
SE( β̂ j )

The interpretation of (2.12) is simpler in the case with p = 1, where


   
β̂ 0 ∼ N β 0 , SE( β̂ 0 )2 , β̂ 1 ∼ N β 1 , SE( β̂ 1 )2 , (2.13)

with

σ2 X̄ 2 σ2
 
2
SE( β̂ 0 ) = 1+ 2 , SE( β̂ 1 )2 = . (2.14)
n sx ns2x

Some insights on (2.13) and (2.14), illustrated interactively in Figure


2.13, are the following:

• Bias. Both estimates are unbiased. That means that their expecta-
tions are the true coefficients for any sample size n.

• Variance. The variances SE( β̂ 0 )2 and SE( β̂ 1 )2 have interesting


interpretations in terms of their components:

– Sample size n. As the sample size grows, the precision of the


estimators increases, since both variances decrease.
– Error variance σ2 . The more disperse the error is, the less pre-
cise the estimates are, since more vertical variability is present.
notes for predictive modeling 39

– Predictor variance s2x . If the predictor is spread out (large s2x ),


then it is easier to fit a regression line: we have information
about the data trend over a long interval. If s2x is small, then
all the data is concentrated on a narrow vertical band, so we
have a much more limited view of the trend.
– Mean X̄. It has influence only on the precision of β̂ 0 . The
larger X̄ is, the less precise β̂ 0 is.

Figure 2.13: Illustration of the random-


ness of the fitted coefficients ( β̂ 0 , β̂ 1 )
and the influence of n, σ2 , and s2x .
The sample predictors x1 , . . . , xn are
fixed and new responses Y1 , . . . , Yn
are generated each time from a linear
model Y = β 0 + β 1 X + ε. Application
available here.

The insights about (2.11) are more convoluted and the following
broad remarks, extensions of what happened when p = 1, apply:

• Bias. All the estimates are unbiased for any sample size n.

• Variance. It depends on:

– Sample size n. Hidden inside X0 X. As n grows, the precision of


the estimators increases.
– Error variance σ2 . The larger σ2 is, the less precise β̂ is. 13
Undestood as small |(X0 X)−1 |.
– Predictor sparsity (X0 X)−1 . The more “sparse”13 the predictor
is, the more precise β̂ is.

The problem with the result in (2.11) is that σ2 is unknown in


practice. Therefore, we need to estimate σ2 in order to use a result
similar to (2.11). We do so by computing a rescaled sample variance
of the residuals ε̂ 1 , . . . , ε̂ n :
n
1
σ̂2 := ∑ ε̂2 .
n − p − 1 i =1 i
(2.15)
40 eduardo garcía portugués

Note the n − p − 1 in the denominator. The factor n − p − 1 is


represents the degrees of freedom: the number of data points minus 14
Prior to undertake the estima-
the number of already14 fitted parameters (p slopes plus 1 intercept) tion of σ we have used the sam-
with the data. For the interpretation of σ̂2 , it is key to realize that ple to estimate β̂. The situation
is thus analogous to the discus-
the mean of the residuals ε̂ 1 , . . . , ε̂ n is zero, this is ε̂¯ = 0. Therefore,
sion between the sample variance
σ̂2 is indeed a rescaled sample variance of the residuals which 2
s2x = n1 ∑in=1 ( Xi − X̄ ) and the sample
estimates the variance of ε15 . It can be seen that σ̂2 is unbiased as quasi-variance ŝ2x = n− 1 n
1 ∑i =1 ( Xi − X̄ )
2

an estimator of σ2 . that are computed from a sample


X1 , . . . , Xn . When estimating Var[ X ],
If we use the estimate σ̂2 instead of σ2 , we get more useful16 both estimate previously E[ X ] through
distributions than (2.12): X̄. The fact that ŝ2x accounts for that
prior estimation through the degrees
β̂ j − β j of freedom n − 1 makes that estimator
∼ t n − p −1 , ˆ ( β̂ j )2 := σ̂2 v j
SE (2.16) unbiased for Var[ X ] (s2x is not).
SEˆ ( β̂ j ) 15
Recall that the sample variance of
2
ε̂ 1 , . . . , ε̂ n is n1 ∑in=1 (ε̂ i − ε̂¯) .
where tn− p−1 represents the Student’s t distribution with n − p − 1
16
In the sense of practically realistic.
degrees of freedom.
The LHS of (2.16) is the t-statistic for β j , j = 0, . . . , p. We will
employ them for building confidence intervals and hypothesis tests
in what follows.

2.4.2 Confidence intervals for the coefficients


Thanks to (2.16), we can have the 100(1 − α)% Confidence Intervals
(CI) for the coefficient β j , j = 0, . . . , p:
 
ˆ ( β̂ j )tn− p−1;α/2
β̂ j ± SE (2.17)

where tn− p−1;α/2 is the α/2-upper quantile of the tn− p−1 . Usually,
α = 0.10, 0.05, 0.01 are considered.

Figure 2.14: Illustration of the random-


ness of the CI for β 0 at 100(1 − α)%
confidence. The plot shows 100 ran-
dom CIs for β 0 , computed from 100
random datasets generated by the
same simple linear model, with in-
tercept β 0 . The illustration for β 1 is
completely analogous. Application
available here.

This random CI contains the unknown coefficient β j “with a probabil-


ity of 1 − α”. The previous quoted statement has to be understood
notes for predictive modeling 41

as follows. Suppose you have 100 samples generated according to


a linear model. If you compute the CI for a coefficient, then in ap-
proximately 100(1 − α) of the samples the true coefficient would be
actually inside the random CI. Note also that the CI is symmetric
around β̂ j . This is illustrated in Figure 2.14.

2.4.3 Testing on the coefficients

The distributions in (2.16) allow also to conduct a formal hypothesis


test on the coefficients β j , j = 0, . . . , p. For example the test for sig- 17
Shortcut for significantly different from
nificance17 is especially important, that is, the test of the hypotheses zero.

H0 : β j = 0

for j = 0, . . . , p. The test of H0 : β j = 0 with 1 ≤ j ≤ p is especially


interesting, since it allows us to answer whether the variable X j
has a significant linear effect on Y. The statistic used for testing for
significance is the t-statistic

β̂ j − 0
,
ˆ ( β̂ j )
SE
β̂ j −0 H0
This is denoted as ˆ ( β̂ j ) ∼ tn− p−1 .
18
SE
which is distributed as a tn− p−1 under the (veracity of) the null hypoth-
esis18 .
The null hypothesis H0 is tested against the alternative hypothesis,
H1 . If H0 is rejected, it is rejected in favor of H1 . The alternative hy-
pothesis can be bilateral (we will focus mostly on these alternatives),
such as

H0 : β j = 0 vs H1 : β j 6= 0

or unilateral, such as

H0 : β j = 0 vs H1 : β j < (>)0.

The test based on the t-statistic is referred to as the t-test. It rejects


H0 : β j = 0 (against H1 : β j 6= 0) at significance level α for large
absolute values of the t-statistic, precisely for those above the α/2-
upper quantile of the tn− p−1 distribution. That is, it rejects H0 at
| β̂ j |
level α if ˆ ( β̂ j )
SE
> tn− p−1;α/2 . For the unilateral tests, it rejects H0
β̂ j
against H1 : β j < 0 or H1 : β j > 0 if ˆ ( β̂ j )
SE
< −tn− p−1;α or
β̂ j
ˆ ( β̂ j )
SE
> tn− p−1;α , respectively.
Remember the following insights about hypothesis testing.

The analogy of conducting an hypothesis test and a trial


can be seen in Appendix A.1.
42 eduardo garcía portugués

In an hypothesis test, the p-value measures the degree of


veracity of H0 according to the data. The rule of thumb is
the following:

Is the p-value lower than α?

• Yes → reject H0 .
• No → do not reject H0 .

The connection of a t-test for H0 : β j = 0 and the CI for β j , both


at level α, is the following:

Is 0 inside the CI for β j ?

• Yes ↔ do not reject H0 .


• No ↔ reject H0 .

The unilateral test H0 : β j = 0 vs H1 : β j < 0 (respectively,


H1 : β j > 0) can be done by means of the CI for β j . If H0 is rejected,
they allow us to conclude that β̂ j is significantly negative (positive)
and that for the considered regression model, X j has a significant negative
(positive) effect on Y. The rule of thumb is the following:

Is the CI for β j below (above) 0 at level α?

• Yes → reject H0 at level α. Conclude X j has a signifi-


cant negative (positive) effect on Y at level α.
• No → the criterion is not conclusive.

2.4.4 Case study application


Let’s analyse the multiple linear model we have considered for
the wine dataset, now that we know how to make inference on the
model parameters. The relevant information is obtained with the
summary of the model:

# Fit
modWine1 <- lm(Price ~ ., data = wine)

# Summary
sumModWine1 <- summary(modWine1)
sumModWine1
##
## Call:
## lm(formula = Price ~ ., data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46541 -0.24133 0.00413 0.18974 0.52495
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.343e+00 7.697e+00 -0.304 0.76384
## WinterRain 1.153e-03 4.991e-04 2.311 0.03109 *
notes for predictive modeling 43

## AGST 6.144e-01 9.799e-02 6.270 3.22e-06 ***


## HarvestRain -3.837e-03 8.366e-04 -4.587 0.00016 ***
## Age 1.377e-02 5.821e-02 0.237 0.81531
## FrancePop -2.213e-05 1.268e-04 -0.175 0.86313
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.293 on 21 degrees of freedom
## Multiple R-squared: 0.8278, Adjusted R-squared: 0.7868
## F-statistic: 20.19 on 5 and 21 DF, p-value: 2.232e-07

# Contains the estimation of sigma ("Residual standard error")


sumModWine1$sigma
## [1] 0.2930287

# Which is the same as


sqrt(sum(modWine1$residuals^2) / modWine1$df.residual)
## [1] 0.2930287

The Coefficients block of the summary output contains the next


elements regarding the significance of each coefficient β j , this is, the
test H0 : β j = 0 vs H1 : β j 6= 0:

• Estimate: least squares estimate β̂ j .


• Std. Error: estimated standard error SE ˆ ( β̂ j ).
β̂ j
• t value: t-statistic ˆ ( β̂ j ) .
SE
• Pr(>|t|): p-value of the t-test.
• Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.'
0.1 ' ' 1: codes indicating the size of the p-value. The more 19
For example, '**' indicates that the
asterisks, the more evidence supporting that H0 does not hold19 . p-value lies within 0.001 and 0.01.

Note that many predictors are not significant in modWine1:


FrancePop, Age, and the intercept are not significant. This is an
indication of an excess of predictors adding little information to
the response. One explanation is the almost perfect correlation be-
tween FrancePop and Age shown before: one of them is not adding
any extra information to explain Price. This complicates the model
unnecessarily and, more importantly, it has the undesirable effect
of making the coefficient estimates less precise. We opt to remove
the predictor FrancePop from the model since it is exogenous to the 20
This is a context-guided decision, not
wine context20 . A data-driven justification of the removal of this data-driven.
variable is that it is the least significant in modWine1. 21
Notice the use of - for excluding a
Then, the model without FrancePop21 is: particular predictor.

modWine2 <- lm(Price ~ . - FrancePop, data = wine)


summary(modWine2)
##
## Call:
## lm(formula = Price ~ . - FrancePop, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46024 -0.23862 0.01347 0.18601 0.53443
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.6515703 1.6880876 -2.163 0.04167 *
## WinterRain 0.0011667 0.0004820 2.420 0.02421 *
## AGST 0.6163916 0.0951747 6.476 1.63e-06 ***
44 eduardo garcía portugués

## HarvestRain -0.0038606 0.0008075 -4.781 8.97e-05 ***


## Age 0.0238480 0.0071667 3.328 0.00305 **
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.2865 on 22 degrees of freedom
## Multiple R-squared: 0.8275, Adjusted R-squared: 0.7962
## F-statistic: 26.39 on 4 and 22 DF, p-value: 4.057e-08

All the coefficients are significant at level α = 0.05. Therefore,


there is no clear redundant information. In addition, the R2 is very
similar to the full model, but the 'Adjusted R-squared', a weight-
ing of the R2 to account for the number of predictors used by the
model, is slightly larger. As we will see in Section 2.7.2, this means
that, compared to the number of predictors used, modWine2 explains
more variability of Price than modWine1.
A handy way of comparing the coefficients of both models is
car::compareCoefs:

car::compareCoefs(modWine1, modWine2)
## Calls:
## 1: lm(formula = Price ~ ., data = wine)
## 2: lm(formula = Price ~ . - FrancePop, data = wine)
##
## Model 1 Model 2
## (Intercept) -2.34 -3.65
## SE 7.70 1.69
##
## WinterRain 0.001153 0.001167
## SE 0.000499 0.000482
##
## AGST 0.6144 0.6164
## SE 0.0980 0.0952
##
## HarvestRain -0.003837 -0.003861
## SE 0.000837 0.000808
##
## Age 0.01377 0.02385
## SE 0.05821 0.00717
##
## FrancePop -2.21e-05
## SE 1.27e-04
##

Note how the coefficients for modWine2 have smaller errors than
modWine1.
The individual CIs for the unknown β j ’s can be obtained by
applying the confint function to an lm object. Let’s compute the
CIs for the model coefficients of modWine1, modWine2, and a new
model modWine3:
# Fit a new model
modWine3 <- lm(Price ~ Age + WinterRain, data = wine)
summary(modWine3)
##
## Call:
## lm(formula = Price ~ Age + WinterRain, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88964 -0.51421 -0.00066 0.43103 1.06897
##
## Coefficients:
notes for predictive modeling 45

## Estimate Std. Error t value Pr(>|t|)


## (Intercept) 5.9830427 0.5993667 9.982 5.09e-10 ***
## Age 0.0360559 0.0137377 2.625 0.0149 *
## WinterRain 0.0007813 0.0008780 0.890 0.3824
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.5769 on 24 degrees of freedom
## Multiple R-squared: 0.2371, Adjusted R-squared: 0.1736
## F-statistic: 3.73 on 2 and 24 DF, p-value: 0.03884

# Confidence intervals at 95%


# CI: (lwr, upr)
confint(modWine3)
## 2.5 % 97.5 %
## (Intercept) 4.746010626 7.220074676
## Age 0.007702664 0.064409106
## WinterRain -0.001030725 0.002593278

# Confidence intervals at other levels


confint(modWine3, level = 0.90)
## 5 % 95 %
## (Intercept) 4.9575969417 7.008488360
## Age 0.0125522989 0.059559471
## WinterRain -0.0007207941 0.002283347
confint(modWine3, level = 0.99)
## 0.5 % 99.5 %
## (Intercept) 4.306650310 7.659434991
## Age -0.002367633 0.074479403
## WinterRain -0.001674299 0.003236852

# Compare with previous models


confint(modWine1)
## 2.5 % 97.5 %
## (Intercept) -1.834844e+01 13.6632391095
## WinterRain 1.153872e-04 0.0021910509
## AGST 4.106337e-01 0.8182146540
## HarvestRain -5.577203e-03 -0.0020974232
## Age -1.072931e-01 0.1348317795
## FrancePop -2.858849e-04 0.0002416171
confint(modWine2)
## 2.5 % 97.5 %
## (Intercept) -7.1524497573 -0.150690903
## WinterRain 0.0001670449 0.002166393
## AGST 0.4190113907 0.813771726
## HarvestRain -0.0055353098 -0.002185890
## Age 0.0089852800 0.038710748
confint(modWine3)
## 2.5 % 97.5 %
## (Intercept) 4.746010626 7.220074676
## Age 0.007702664 0.064409106
## WinterRain -0.001030725 0.002593278

In modWine3, the 95% CI for β 0 is (4.7460, 7.2201), for β 1 is


(0.0077, 0.0644), and for β 2 is (−0.0010, 0.0026). Therefore, we
can say with a 95% confidence that the coefficient of WinterRain is
non-significant (0 is inside the CI). But, inspecting the CI of β 2 in
modWine2 we can see that it is significant for the model! How is this
possible? The answer is that the presence of extra predictors af-
fects the coefficient estimate, as we saw in Figure 2.7. Therefore, the
precise statement to make is:
In model Price ~ Age + WinterRain, with α = 0.05, the coefficient
of WinterRain is non-significant.
46 eduardo garcía portugués

Note that this does not mean that the coefficient will be al-
ways non-significant: in Price ~ Age + AGST + HarvestRain +
WinterRain it is.

Compute and interpret the CIs for the coefficients, at


levels α = 0.10, 0.05, 0.01, for the following regressions:

• Price ~ WinterRain + HarvestRain + AGST (wine).


• AGST ~ Year + FrancePop (wine).

For the assumptions dataset, do the following:

• Regression y7 ~ x7. Check that:


– The intercept is not significant for the regression at
any reasonable level α.
– The slope is significant for any α ≥ 10−7 .
• Regression y6 ~ x6. Assume the linear model as-
sumptions are verified.
– Check that β̂ 0 is significantly different from zero at
any level α.
– For which α = 0.10, 0.05, 0.01 is β̂ 1 significantly dif-
ferent from zero?

In certain applications, it is useful to center the predictors


X1 , . . . , X p prior to fit the model, in such a way that the
slope coefficients ( β 1 , . . . , β p ) measure the effects of devi-
ations of the predictors from their means. Theoretically,
this amounts to considering the linear model

Y = β 0 + β 1 ( X1 − E[ X1 ]) + . . . + β p ( X p − E[ X p ]) + ε.

In the sample case, we proceed by replacing Xij by


Xij − X̄ j , which can be easily done by the scale function
(see below). If, in addition, the response is also centred,
then β 0 = 0 and β̂ 0 = 0. This centering of the data has
no influence on the significance of the predictors (but has
influence on the significance of β̂ 0 ), as it is just a linear
transformation of them.

# By default, scale centers (substracts the mean) and scales (divides by the
# standard deviation) the columns of a matrix
wineCen <- data.frame(scale(wine, center = TRUE, scale = FALSE))

# Regression with centred response and predictors


modWine3Cen <- lm(Price ~ Age + WinterRain, data = wineCen)

# Summary
summary(modWine3Cen)
##
## Call:
notes for predictive modeling 47

## lm(formula = Price ~ Age + WinterRain, data = wineCen)


##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88964 -0.51421 -0.00066 0.43103 1.06897
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.284e-16 1.110e-01 0.000 1.0000
## Age 3.606e-02 1.374e-02 2.625 0.0149 *
## WinterRain 7.813e-04 8.780e-04 0.890 0.3824
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.5769 on 24 degrees of freedom
## Multiple R-squared: 0.2371, Adjusted R-squared: 0.1736
## F-statistic: 3.73 on 2 and 24 DF, p-value: 0.03884

2.5 Prediction

The forecast of Y from X = x (this is, X1 = x1 , . . . , X p = x p ) is


approached in two different ways:

1. Through the estimation of the conditional mean of Y given


X = x, E[Y |X = x]. This is a deterministic quantity, which equals
β 0 + β 1 x1 + . . . + β p x p .
2. Through the prediction of the conditional response Y |X = x.
This is a random variable distributed as N ( β 0 + β 1 x1 + . . . +
β p x p , σ 2 ).

There are similarities and differences in the prediction of the


conditional mean E[Y |X = x] and conditional response Y |X = x,
which we highlight next:
22
Because the prediction of a new
• Similarities. The estimate is the same22 , ŷ = β̂ 0 + β̂ 1 x1 + . . . + observation from the random variable
β̂ p x p . Both CI are centred in ŷ. N ( β 0 + β 1 x1 + . . . + β p x p , σ2 ) is simply
its mean, β 0 + β 1 x1 + . . . + β p x p , which
• Differences. E[Y |X = x] is deterministic and Y |X = x is a random
is also the most likely value in a
variable. The prediction of the latter is more noisy, because it has normal.
to take into account the randomness of Y. Therefore, the variance
is larger for the prediction of Y |X = x than for the prediction
of E[Y |X = x]. This has a direct consequence on the length of
the prediction intervals, which are longer for Y |X = x than for
E [Y | X = x ] .

The inspection of the CIs for the conditional mean and condi-
tional response in the simple linear model offers great insight into
the previous similarities and differences, and also on what compo-
nents affect precisely the quality of the prediction:

• The 100(1 − α)% CI for the conditional mean β 0 + β 1 x is

s !
σ̂2 ( x − x̄ )2

ŷ ± tn−2;α/2 1+ . (2.18)
n s2x

• The 100(1 − α)% CI for the conditional response Y | X = x is


48 eduardo garcía portugués

s !
σ̂2 ( x − x̄ )2

ŷ ± tn−2;α/2 σ̂2 + 1+ . (2.19)
n s2x

Figure 2.15: Illustration of the CIs for


the conditional mean and response.
Note how the width of the CIs is
influenced by x, especially for the
conditional mean (the conditional
response has a constant term affecting
the width). Application available here.

Notice the dependence of both CIs on x, n, and σ̂2 , each of them


with a clear effect on the resulting length of the interval. Note also
the high similarity between (2.18) and (2.19) (both intervals are
centred at ŷ and have a similar variance) and its revealing unique
23
A consequence of this extra σ̂2 is that
difference: the extra σ̂2 in (2.19)23 , consequence of the “extra ran-
the length of (2.19) can not be reduced
domness” of the conditional response with respect to the condi- arbitrarily if the sample size n grows.
tional mean. Figure 2.15 helps to visualize these concepts and the
difference between CIs interactively.

2.5.1 Case study application


The prediction and the computation of prediction CIs can be done
with predict. The objects required for predict are: first, an lm ob-
ject; second, a data.frame containing the locations x = ( x1 , . . . , x p )
where we want to predict β 0 + β 1 x1 + . . . + β p x p . The prediction
is β̂ 0 + β̂ 1 x1 + . . . + β̂ p x p and the CIs returned are either (2.18) or
(2.19).

It is mandatory to name the columns of the data.frame


with the same names of the predictors used in lm. Other-
wise predict will generate an error.

# Fit a linear model for the price on WinterRain, HarvestRain, and AGST
modWine4 <- lm(Price ~ WinterRain + HarvestRain + AGST, data = wine)
summary(modWine4)
notes for predictive modeling 49

##
## Call:
## lm(formula = Price ~ WinterRain + HarvestRain + AGST, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.62816 -0.17923 0.02274 0.21990 0.62859
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.9506001 1.9694011 -2.514 0.01940 *
## WinterRain 0.0012820 0.0005765 2.224 0.03628 *
## HarvestRain -0.0036242 0.0009646 -3.757 0.00103 **
## AGST 0.7123192 0.1087676 6.549 1.11e-06 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.3436 on 23 degrees of freedom
## Multiple R-squared: 0.7407, Adjusted R-squared: 0.7069
## F-statistic: 21.9 on 3 and 23 DF, p-value: 6.246e-07

# Data for which we want a prediction


# Important! You have to name the column with the predictor name!
weather <- data.frame(WinterRain = 500, HarvestRain = 123, AGST = 18)
weatherBad <- data.frame(500, 123, 18)

# Prediction of the mean

# Prediction of the mean at 95% -- the defaults


predict(modWine4, newdata = weather)
## 1
## 8.066342
predict(modWine4, newdata = weatherBad) # Error
## Error in eval(predvars, data, env): object ’WinterRain’ not found

# Prediction of the mean with 95% confidence interval (the default)


# CI: (lwr, upr)
predict(modWine4, newdata = weather, interval = "confidence")
## fit lwr upr
## 1 8.066342 7.714178 8.418507
predict(modWine4, newdata = weather, interval = "confidence", level = 0.95)
## fit lwr upr
## 1 8.066342 7.714178 8.418507

# Other levels
predict(modWine4, newdata = weather, interval = "confidence", level = 0.90)
## fit lwr upr
## 1 8.066342 7.774576 8.358108
predict(modWine4, newdata = weather, interval = "confidence", level = 0.99)
## fit lwr upr
## 1 8.066342 7.588427 8.544258

# Prediction of the response

# Prediction of the mean at 95% -- the defaults


predict(modWine4, newdata = weather)
## 1
## 8.066342

# Prediction of the response with 95% confidence interval


# CI: (lwr, upr)
predict(modWine4, newdata = weather, interval = "prediction")
## fit lwr upr
## 1 8.066342 7.273176 8.859508
predict(modWine4, newdata = weather, interval = "prediction", level = 0.95)
## fit lwr upr
## 1 8.066342 7.273176 8.859508
50 eduardo garcía portugués

# Other levels
predict(modWine4, newdata = weather, interval = "prediction", level = 0.90)
## fit lwr upr
## 1 8.066342 7.409208 8.723476
predict(modWine4, newdata = weather, interval = "prediction", level = 0.99)
## fit lwr upr
## 1 8.066342 6.989951 9.142733

# Predictions for several values


weather2 <- data.frame(WinterRain = c(500, 200), HarvestRain = c(123, 200),
AGST = c(17, 18))
predict(modWine4, newdata = weather2, interval = "prediction")
## fit lwr upr
## 1 7.354023 6.613835 8.094211
## 2 7.402691 6.533945 8.271437

For the wine dataset, do the following:

• Regress WinterRain on HarvestRain and AGST. Name


the fitted model modExercise.
• Compute the estimate for the conditional mean of
WinterRain for HarvestRain = 123.0 and AGST = 16.15.
What is the CI at α = 0.01?
• Compute the estimate for the conditional response for
HarvestRain = 125.0 and AGST = 15. What is the CI at
α = 0.10?
• Check that modExercise$fitted.values is the
same as predict(modExercise, newdata =
data.frame(HarvestRain = wine$HarvestRain, AGST
= wine$AGST)). Why is this so?

2.6 ANOVA

The variance of the error, σ2 , plays a fundamental role in the infer-


ence for the model coefficients and in prediction. In this section we
will see how the variance of Y is decomposed into two parts, each
corresponding to the regression and to the error, respectively. This
decomposition is called the ANalysis Of VAriance (ANOVA).
An important fact to highlight prior to introducing the ANOVA
decomposition is that Ȳ = Ŷ¯ 24 . The ANOVA decomposition consid- This is an important result that
24

can be checked if we use the matrix


ers the following measures of variation related with the response: notation in Section 2.2.3.
2
• SST := ∑in=1 (Yi − Ȳ ) , the Total Sum of Squares. This is the total
variation of Y1 , . . . , Yn , since SST = ns2y , where s2y is the sample
variance of Y1 , . . . , Yn .
2
• SSR := ∑in=1 Ŷi − Ȳ , the Regression Sum of Squares. This
is the variation explained by the regression plane, that is, the
variation from Ȳ that is explained by the estimated conditional mean
Ŷi = β̂ 0 + β̂ 1 Xi1 + . . . + β̂ p Xip . SSR = ns2ŷ , where s2ŷ is the sample 25
Recall that SSE and RSS (of the
least squares estimator β̂) are two
variance of Ŷ1 , . . . , Ŷn . names for the same quantity (that
2
• SSE := ∑in=1 Yi − Ŷi , the Sum of Squared Errors25 . Is the vari- appears in different contexts): SSE =
2
ation around the conditional mean. Recall that SSE = ∑in=1 ε̂2i = ∑in=1 Yi − Ŷi = ∑in=1 Yi − β̂ 0 −
2
(n − p − 1)σ̂2 , where σ̂2 is the sample variance of ε̂ 1 , . . . , ε̂ n . β̂ 1 Xi1 − . . . − β̂ p Xip = RSS( β̂).
notes for predictive modeling 51

The ANOVA decomposition states that:

SST
|{z} = SSR
|{z} + SSE
|{z} (2.20)
Variation of Yi0 s Variation of Ŷi0 s Variation of ε̂0i s

or, equivalently (dividing by n in (2.20)),

s2y = s2ŷ + (n − p − 1)/n × σ̂2 .


|{z} |{z} | {z }
Variance of Yi0 s Variance of Ŷi0 s Variance of ε̂0i s

The graphical interpretation of (2.20) when p = 1 is shown in


Figures 2.16. Figure 2.17 dynamically shows how the ANOVA
decomposition places more weight on SSR or SSE according to σ̂2
(which is obviously driven by the value of σ2 ).
The ANOVA table summarizes the decomposition of the vari-
ance:

Degrees of Sum Mean


freedom Squares Squares F-value p-value
SSR SSR/p
Predictors p SSR p SSE/(n− p−1)
p-value
SSE
Residuals n− p−1 SSE n − p −1

The F-value of the ANOVA table represents the value of the F-


SSR/p
statistic SSE/(n− p−1) . This statistic is employed to test

H0 : β 1 = . . . = β p = 0 vs. H1 : β j 6= 0 for any j ≥ 1,

that is, the hypothesis of no linear dependence of Y on X1 , . . . , X p 26 .


26
Geometrically: the plane is com-
pletely flat, it does not have any
This is the so-called F-test and, if H0 is rejected, allows to conclude inclination in the Y direction.
that at least one β j is significantly different from zero27 . It hap- 27
And therefore, there is a statistical
meaningful (i.e., not constant) linear
pens that trend to model.
SSR/p H0
F= ∼ Fp,n− p−1 ,
SSE/(n − p − 1)

where Fp,n− p−1 represents the Snedecor’s F distribution with p and


n − p − 1 degrees of freedom. If H0 is true, then F is expected to be
small since SSR will be close to zero28 . The F-test rejects at signif-
28
Little variation is explained by the
regression model since β̂ ≈ 0.
icance level α for large values of the F-statistic, precisely for those
above the α-upper quantile of the Fp,n− p−1 distribution, denoted by
Fp,n− p−1;α . That is, H0 is rejected if F > Fp,n− p−1;α .

The “ANOVA table” is a broad concept in statistics, with


different variants. Here we are only covering the ba-
sic ANOVA table from the relation SST = SSR + SSE.
However, further sophistications are possible when SSR
is decomposed into the variations contributed by each
predictor. In particular, for multiple linear regression R’s
anova implements a sequential (type I) ANOVA table, which
is not the previous table!
52 eduardo garcía portugués

Figure 2.16: Visualization of the


ANOVA decomposition. SST measures
the variation of Y1 , . . . , Yn with respect
to Ȳ. SSR measures the variation of
Ŷ1 , . . . , Ŷn with respect to Ŷ¯ = Ȳ.
SSE collects the variation between
Y1 , . . . , Yn and Ŷ1 , . . . , Ŷn , that is, the
variation of the residuals.

Figure 2.17: Illustration of the ANOVA


decomposition and its dependence
on σ2 and σ̂2 . Larger (respectively,
smaller) σ̂2 results in more weight
placed on the SSE (SSR) term. Applica-
tion available here.
notes for predictive modeling 53

The anova function takes a model as an input and returns the 29


More complex – included here just
following sequential ANOVA table29 : for clarification of the anova’s output.

Degrees of Sum Mean


freedom Squares Squares F-value p-value
SSR1 SSR1 /1
Predictor 1 1 SSR1 1 SSE/(n− p−1)
p1
SSR2 SSR2 /1
Predictor 2 1 SSR2 1 SSE/(n− p−1)
p2
.. .. .. .. .. ..
. . . . . .
SSR p SSR p /1
Predictor p 1 SSR p 1 SSE/(n− p−1)
pp
SSE
Residuals n− p−1 SSE n − p −1

Here the SSR j represents the regression sum of squares associ-


ated to the inclusion of X j in the model with predictors X1 , . . . , X j−1 ,
this is:

SSR j = SSR( X1 , . . . , X j ) − SSR( X1 , . . . , X j−1 ).

The p-values p1 , . . . , p p correspond to the testing of the hypotheses


30
Note that, if mod <- lm(resp ~
preds, data) represents a model
with response resp and predictors
H0 : β j = 0 vs. H1 : β j 6= 0, preds, and mod0, is the intercept-only
model mod0 <- lm(resp ~ 1, data)
carried out inside the linear model Y = β 0 + β 1 X1 + . . . + β j X j + ε. that does not contain predictors,
anova(mod0, mod) gives a similar,
This is like the t-test for β j for the model with predictors X1 , . . . , X j .
output to the seen ANOVA table.
Recall that there is no F-test in this version of the ANOVA table. Precisely, the first row of the outputted
In order to exactly30 compute the simplified ANOVA table seen table stands for the SST and the
second row for the SSE row (so we
before, we need to rely on the following ad-hoc function. The func- call it the SST–SSE table). The SSR
tion takes as input a fitted lm: row is not present. The seen ANOVA
table (which contains SSR and SSE)
and the SST–SSE table encode the
same information due to the ANOVA
decomposition. So it is a matter of
taste and tradition to employ one or
the other. In particular, both have the
F-test and its associated p-value (in the
SSE row for the SST–SSE table).
54 eduardo garcía portugués

# This function computes the simplified anova from a linear model


simpleAnova <- function(object, ...) {

# Compute anova table


tab <- anova(object, ...)

# Obtain number of predictors


p <- nrow(tab) - 1

# Add predictors row


predictorsRow <- colSums(tab[1:p, 1:2])
predictorsRow <- c(predictorsRow, predictorsRow[2] / predictorsRow[1])

# F-quantities
Fval <- predictorsRow[3] / tab[p + 1, 3]
pval <- pf(Fval, df1 = p, df2 = tab$Df[p + 1], lower.tail = FALSE)
predictorsRow <- c(predictorsRow, Fval, pval)

# Simplified table
tab <- rbind(predictorsRow, tab[p + 1, ])
row.names(tab)[1] <- "Predictors"
return(tab)

2.6.1 Case study application


Let’s compute the ANOVA decomposition of modWine1 and modWine2
to test the existence of linear dependence.

# Models
modWine1 <- lm(Price ~ ., data = wine)
modWine2 <- lm(Price ~ . - FrancePop, data = wine)

# Simplified table
simpleAnova(modWine1)
## Analysis of Variance Table
##
## Response: Price
## Df Sum Sq Mean Sq F value Pr(>F)
## Predictors 5 8.6671 1.73343 20.188 2.232e-07 ***
## Residuals 21 1.8032 0.08587
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
simpleAnova(modWine2)
## Analysis of Variance Table
##
## Response: Price
## Df Sum Sq Mean Sq F value Pr(>F)
## Predictors 4 8.6645 2.16613 26.39 4.057e-08 ***
## Residuals 22 1.8058 0.08208
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
# The null hypothesis of no linear dependence is emphatically rejected in
# both models

# R’s ANOVA table -- warning this is now what we saw in lessons


anova(modWine1)
## Analysis of Variance Table
##
## Response: Price
## Df Sum Sq Mean Sq F value Pr(>F)
## WinterRain 1 0.1905 0.1905 2.2184 0.1512427
## AGST 1 5.8989 5.8989 68.6990 4.645e-08 ***
## HarvestRain 1 1.6662 1.6662 19.4051 0.0002466 ***
## Age 1 0.9089 0.9089 10.5852 0.0038004 **
## FrancePop 1 0.0026 0.0026 0.0305 0.8631279
notes for predictive modeling 55

## Residuals 21 1.8032 0.0859


## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Compute the ANOVA table for the regression Price ~


WinterRain + AGST + HarvestRain + Age in the wine
dataset. Check that the p-value for the F-test given in
summary and by simpleAnova are the same.

For the y6 ~ x6 and y7 ~ x7 in the assumptions dataset,


compute their ANOVA tables. Check that the p-values of
the t-test for β 1 and the F-test are the same (any explana-
tion of why this is so?).

2.7 Model fit

2.7.1 The R2

The coefficient of determination R2 is closely related with the ANOVA


decomposition. It is defined as

31
Which is not the σ̂2 in (2.21), but σ̂2
2 SSR SSR SSR
R := = = . (2.21) is obviously dependent on σ2 .
SST SSR + SSE SSR + (n − p − 1)σ̂2

R2 measures the proportion of variation of the response variable Y


that is explained by the predictors X1 , . . . , X p through the regres-
sion. Intuitively, R2 measures the tightness of the data cloud around
the regression plane. Check in Figure 2.17 how changing the value of
σ2 31 affects the R2 .
R2 is intimately related with the sample correlation coefficient.
For example, if p = 1, then it can be seen (exercise below) that R2 =
r2xy . More importantly, R2 = ry2ŷ for any p, that is, the square of
the sample correlation coefficient between Y1 , . . . , Yn and Ŷ1 , . . . , Ŷn
is R2 , a fact that is not immediately evident. Let’s check this fact
when p = 1 by relying on R2 = r2xy . First, by the form of β̂ 0 given in
(2.3),

Ŷi = β̂ 0 + β̂ 1 Xi
= (Ȳ − β̂ 1 X̄ ) + β̂ 1 Xi
= Ȳ + β̂ 1 ( Xi − X̄ ). (2.22)
56 eduardo garcía portugués

Then, replacing (2.22) in

s2yŷ
ry2ŷ =
s2y s2ŷ
2
∑in=1 (Yi − Ȳ ) Ŷi − Ȳ
= 2 2
∑in=1 (Yi − Ȳ ) ∑in=1 Ŷi − Ȳ
2
∑in=1 (Yi − Ȳ ) Ȳ + β̂ 1 ( Xi − X̄ ) − Ȳ
= 2 2
∑in=1 (Yi − Ȳ ) ∑in=1 Ȳ + β̂ 1 ( Xi − X̄ ) − Ȳ
= r2xy

and, as a consequence, ry2ŷ = r2xy = R2 when p = 1.

Show that R2 = r2xy when p = 1. Hint: start from the defi-


nition of R2 and use (2.3) to arrive to r2xy .

Trusting the R2 blindly can lead to catastrophic conclusions.


Here are a couple of counterexamples of a linear regression per-
formed in a data that clearly does not satisfy the assumptions dis-
cussed in Section 2.3, but despite that, the models have a large R2 .
As a consequence, inference built on the validity of the assump- 32
Which do not hold!
tions32 will be problematic, no matter what is the value of R2 . For
example, recall how biased the predictions will be for x = 0.35 and
x = 0.65.
# Simple linear model

# Create data that:


# 1) does not follow a linear model
# 2) the error is heteroskedastic
x <- seq(0.15, 1, l = 100)
set.seed(123456)
eps <- rnorm(n = 100, sd = 0.25 * x^2)
y <- 1 - 2 * x * (1 + 0.25 * sin(4 * pi * x)) + eps

# Great R^2!?
reg <- lm(y ~ x)
summary(reg)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.53525 -0.18020 0.02811 0.16882 0.46896
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.87190 0.05860 14.88 <2e-16 ***
## x -1.69268 0.09359 -18.09 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.232 on 98 degrees of freedom
## Multiple R-squared: 0.7695, Adjusted R-squared: 0.7671
## F-statistic: 327.1 on 1 and 98 DF, p-value: < 2.2e-16

# scatterplot is a quick alternative to


notes for predictive modeling 57

# plot(x, y)
# abline(coef = reg$coef, col = 3)

# But prediction is obviously problematic


car::scatterplot(y ~ x, col = 1, regLine = list(col = 2), smooth = FALSE)

# Multiple linear model

0.5
# Create data that:
# 1) does not follow a linear model

0.0
# 2) the error is heteroskedastic

y
x1 <- seq(0.15, 1, l = 100)
set.seed(123456)

-0.5
x2 <- runif(100, -3, 3)
eps <- rnorm(n = 100, sd = 0.25 * x1^2)
y <- 1 - 3 * x1 * (1 + 0.25 * sin(4 * pi * x1)) + 0.25 * cos(x2) + eps

-1.0
0.2 0.4 0.6 0.8 1.0

x
# Great R^2!?
reg <- lm(y ~ x1 + x2)
summary(reg) Figure 2.18: Regression line for a
## dataset that clearly violates the linear-
## Call: ity and homoscedasticity assumptions.
## lm(formula = y ~ x1 + x2) The R2 is, nevertheless, as high as
## (approximately) 0.77.
## Residuals:
## Min 1Q Median 3Q Max
## -0.78737 -0.20946 0.01031 0.19652 1.05351
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.788812 0.096418 8.181 1.1e-12 ***
## x1 -2.540073 0.154876 -16.401 < 2e-16 ***
## x2 0.002283 0.020954 0.109 0.913
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.3754 on 97 degrees of freedom
## Multiple R-squared: 0.744, Adjusted R-squared: 0.7388
## F-statistic: 141 on 2 and 97 DF, p-value: < 2.2e-16

We can visualize the fit of the latter multiple linear model, since
we are in p = 2.
# But prediction is obviously problematic
car::scatter3d(y ~ x1 + x2, fit = "linear")
rgl::rglwidget()

The previous counterexamples illustrate that a large R2


means nothing in terms of inference if the assumptions of
the model do not hold.

Remember that:

• R2 does not measure the correctness of a linear model


but its usefulness, assuming the model is correct.
• R2 is the proportion of variance of Y explained by
X1 , . . . , X p , but, of course, only when the linear model is
correct.

We finalize by pointing out a nice connection between the R2 , the


ANOVA decomposition, and the least squares estimator β̂:
58 eduardo garcía portugués

The ANOVA decomposition gives another interpretation


of the least-squares estimates: β̂ are the estimated co-
efficients that maximize the R2 (among all the possible
estimates we could think about). To see this, recall that

SSR SST − SSE SST − RSS( β̂)


R2 = = = .
SST SST SST

Then, if RSS( β̂) = minβ∈R p+1 RSS( β), then R2 is maximal


for β̂.

2.7.2 The R2Adj

As we saw, these are equivalent forms for R2 :

SSR SST − SSE SSE


R2 = = = 1−
SST SST SST
σ̂2
=1− × ( n − p − 1). (2.23)
SST
The SSE on the numerator always decreases as more predictors are
added to the model, even if these are not significant. As a conse-
quence, the R2 always increases with p. Why is this so? Intuitively,
because the complexity – hence the flexibility – of the model in-
creases when we use more predictors to explain Y. Mathematically,
because when p approaches n − 1 the second term in (2.23) is re-
duced and, as a consequence, R2 grows.
The adjusted R2 is an important quantity specifically designed 33
An informal way of regarding the
to overcome this R2 ’s flaw, which is ubiquitous in multiple linear difference between R2 and R2Adj is
regression. The purpose is to have a better tool for comparing models by thinking of R2 as a measure of fit
that “is not aware of the dangers of
without systematically favoring complexer models33 . This alternative overfitting”. In this interpretation, R2Adj
coefficient is defined as is the overfitting-aware version of R2 .

SSE/(n − p − 1) SSE n−1


R2Adj := 1 − = 1− ×
SST/(n − 1) SST n − p − 1
σ̂2
=1− × ( n − 1). (2.24)
SST

The R2Adj is independent of p, at least explicitly. If p = 1 then


R2Adj is almost R2 (practically identical if n is large). Both (2.23) and
(2.24) are quite similar except for the last factor, which in the latter
does not depend on p. Therefore, (2.24) will only increase if σ̂2 is
reduced with p – in other words, if the new variables contribute in
the reduction of variability around the regression plane.
Figure 2.19: Comparison of R2 and
The different behavior between R2 and R2Adj can be visualized R2Adj on the model (2.26) fitted with
by a small simulation study. Suppose that we generate a random data generated by (2.25). The number
dataset {( Xi1 , Xi2 , Yi )}in=1 , with n = 200 observations of two predic- of predictors p ranges from 1 to 198,
with only the first two predictors
tors X1 and X2 that are distributed as a N (0, 1), and a response Y being significant. The M = 200
generated by the linear model red and black curves arise from M
simulated datasets of sample size
n = 200. The thicker curves are the
Yi = β 0 + β 1 Xi1 + β 2 Xi2 + ε i , (2.25) mean of each color’s curves.
notes for predictive modeling 59

where ε i ∼ N (0, 1). To this data, we add 196 garbage predictors


X j ∼ N (0, 1) that are completely independent from Y. There-
fore, we end up with p = 198 predictors where only the first two
ones are relevant for explaining Y. We compute now the R2 ( j) and
R2Adj ( j) for the models

Y = β 0 + β 1 X1 + . . . + β j X j + ε, (2.26)

with j = 1, . . . , p, and we plot them as the curves ( j, R2 ( j)) and


( j, R2Adj ( j)). Since R2 and R2Adj are random variables, we repeat this
procedure M = 200 times to have an idea of the variability behind
R2 and R2Adj . Figure 2.19 contains the results of this experiment.
As it can be seen, the R2 increases linearly with the number of
predictors considered, although only the first two ones were actu-
ally relevant. Thus, if we did not know about the random mecha-
nism (2.25) that generated Y, we would be tempted to believe that
the more adequate models would be the ones with a larger number
of predictors. On the contrary, R2Adj only increases in the first two
variables and then exhibits a mild decaying trend, indicating that,
on average, the best choice for the number of predictors is actually
close to p = 2. However, note that R2Adj has a huge variability when
p approaches n − 2, a consequence of the explosive variance of σ̂2 in 34
This indicates that R2Adj can act as a
that degenerate case34 . The experiment helps to visualize that R2Adj model selection device up to a certain
is more adequate than the R2 for evaluating the fit of a multiple point, as its effectiveness becomes
too variable, too erratic, when the
linear regression35 . number of predictors is very high in
An example of a simulated dataset considered in the experiment comparison with n.
of Figure 2.19 is: 35
Coincidentally, the experiment
also serves as a reminder about the
randomness of R2 and R2Adj , a fact that
# Generate data
p <- 198 is sometimes overlooked.
n <- 200
set.seed(3456732)
beta <- c(0.5, -0.5, rep(0, p - 2))
X <- matrix(rnorm(n * p), nrow = n, ncol = p)
Y <- drop(X %*% beta + rnorm(n, sd = 3))
data <- data.frame(y = Y, x = X)

# Regression on the two meaningful predictors


summary(lm(y ~ x.1 + x.2, data = data))

# Adding 20 garbage variables


# R^2 increases and adjusted R^2 decreases
summary(lm(y ~ X[, 1:22], data = data))

The R2Adj no longer measures the proportion of variation


of Y explained by the regression, but the result of correct-
ing this proportion by the number of predictors employed. As a
consequence of this, R2Adj ≤ 1 and R2Adj can be negative.

The next code illustrates a situation where we have two predic-


tors completely independent from the response. The fitted model
has a negative R2Adj .
60 eduardo garcía portugués

# Three independent variables


set.seed(234599)
x1 <- rnorm(100)
x2 <- rnorm(100)
y <- 1 + rnorm(100)

# Negative adjusted R^2


summary(lm(y ~ x1 + x2))
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5081 -0.5021 -0.0191 0.5286 2.4750
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.97024 0.10399 9.330 3.75e-15 ***
## x1 0.09003 0.10300 0.874 0.384
## x2 -0.05253 0.11090 -0.474 0.637
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 1.034 on 97 degrees of freedom
## Multiple R-squared: 0.009797, Adjusted R-squared: -0.01062
## F-statistic: 0.4799 on 2 and 97 DF, p-value: 0.6203

For the previous example, construct more predictors (x3,


x4, . . . ) that are independent from y and check that when
the predictors are added to the model, the R2Adj decreases
and the R2 increases.

Beware of the R2 and R2Adj for model fits with no in-


tercept! If the linear model is fit with no intercept,
the summary function silently returns a 'Multiple
R-squared' and an 'Adjusted R-squared' that do not
correspond (see this question in the R FAQ) with the seen
expressions

SSE SSE n−1


R2 = 1 − , R2Adj = 1 − × .
∑in=1 (Yi
− Ȳ )2 SST n − p − 1

In the case with no intercept, summary rather returns

SSE SSE n−1


R20 := 1 − , R20,Adj := 1 − × .
∑in=1 Yi2 ∑in=1 Yi2 n− p−1

The reason is perhaps shocking: if the model is fit with-


out intercept and neither the response nor the predictors
are centred, then the ANOVA decomposition does not
hold, in the sense that

SST 6= SSR + SSE.

The fact that the ANOVA decomposition does not hold for no-
intercept models has serious consequences on the theory we built,
notes for predictive modeling 61

in particular the R2 can be negative and R2 6= ry2ŷ , which derive from


the fact that ∑in=1 ε̂ i 6= 0. Therefore, the R2 can not be regarded as
the “proportion of variance explained” if the model fit is performed
without intercept. The R20 and R20,Adj versions are considered be-
cause they are the ones that arise from the “no-intercept ANOVA
decomposition”
n
SST0 = SSR + SSE, SST0 := ∑ Yi2 ,
i =1

and therefore the R20


is guaranteed to be a quantity in [0, 1]. It
would indeed be the proportion of variance explained if the pre-
dictors and the response were centred (i.e., if Ȳ = 0 and X̄ j = 0,
j = 1, . . . , p).
The next chunk of code illustrates these concepts for the iris
dataset.
# Model with intercept
mod1 <- lm(Sepal.Length ~ Petal.Width, data = iris)
mod1
##
## Call:
## lm(formula = Sepal.Length ~ Petal.Width, data = iris)
##
## Coefficients:
## (Intercept) Petal.Width
## 4.7776 0.8886

# Model without intercept


mod0 <- lm(Sepal.Length ~ 0 + Petal.Width, data = iris)
mod0
##
## Call:
## lm(formula = Sepal.Length ~ 0 + Petal.Width, data = iris)
##
## Coefficients:
## Petal.Width
## 3.731

# Recall the different way of obtaining the estimators


X1 <- cbind(1, iris$Petal.Width)
X0 <- cbind(iris$Petal.Width) # No column of ones!
Y <- iris$Sepal.Length
betaHat1 <- solve(crossprod(X1)) %*% t(X1) %*% Y
betaHat0 <- solve(crossprod(X0)) %*% t(X0) %*% Y
betaHat1
## [,1]
## [1,] 4.7776294
## [2,] 0.8885803
betaHat0
## [,1]
## [1,] 3.731485

# Summaries: higher R^2 for the model with no intercept!?


summary(mod1)
##
## Call:
## lm(formula = Sepal.Length ~ Petal.Width, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.38822 -0.29358 -0.04393 0.26429 1.34521
##
## Coefficients:
62 eduardo garcía portugués

## Estimate Std. Error t value Pr(>|t|)


## (Intercept) 4.77763 0.07293 65.51 <2e-16 ***
## Petal.Width 0.88858 0.05137 17.30 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.478 on 148 degrees of freedom
## Multiple R-squared: 0.669, Adjusted R-squared: 0.6668
## F-statistic: 299.2 on 1 and 148 DF, p-value: < 2.2e-16
summary(mod0)
##
## Call:
## lm(formula = Sepal.Length ~ 0 + Petal.Width, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1556 -0.3917 1.0625 3.8537 5.0537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Petal.Width 3.732 0.150 24.87 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2.609 on 149 degrees of freedom
## Multiple R-squared: 0.8058, Adjusted R-squared: 0.8045
## F-statistic: 618.4 on 1 and 149 DF, p-value: < 2.2e-16

# Wait a moment... let’s see the actual fit


plot(Sepal.Length ~ Petal.Width, data = iris)
abline(mod1, col = 2) # Obviously, much better
abline(mod0, col = 3)
8.0
7.5

# Manually checking the R^2 indeed reveals that summary is doing something
7.0

# different for computing the R^2 when no intercept


6.5

cor(mod0$model$Sepal.Length, mod0$fitted.values)^2
Sepal.Length

## [1] 0.6690277
6.0
5.5

# Compute the R^2 manually for mod1


SSE1 <- sum((mod1$residuals - mean(mod1$residuals))^2)
5.0

SST1 <- sum((mod1$model$Sepal.Length - mean(mod1$model$Sepal.Length))^2)


4.5

1 - SSE1 / SST1
## [1] 0.6690277 0.5 1.0 1.5 2.0 2.5

Petal.Width

# Compute the R^2 manually for mod0


SSE0 <- sum((mod0$residuals - mean(mod0$residuals))^2)
SST0 <- sum((mod0$model$Sepal.Length - mean(mod0$model$Sepal.Length))^2)
1 - SSE0 / SST0
## [1] -6.179158
# It is negative!

# Recall that the mean of the residuals is not zero


mean(mod0$residuals)
## [1] 1.368038

# What summary really returns if there is no intercept


n <- nrow(iris)
p <- 1
R0 <- 1 - sum(mod0$residuals^2) / sum(mod0$model$Sepal.Length^2)
R0Adj <- 1 - sum(mod0$residuals^2) / sum(mod0$model$Sepal.Length^2) *
(n - 1) / (n - p - 1)
R0
## [1] 0.8058497
R0Adj
## [1] 0.8045379

# What if we centred the data previously?


notes for predictive modeling 63

irisCen <- data.frame(scale(iris[, -5], center = TRUE, scale = FALSE))


modCen1 <- lm(Sepal.Length ~ Petal.Width, data = irisCen)
modCen0 <- lm(Sepal.Length ~ 0 + Petal.Width, data = irisCen)

# No problem, "correct" R^2


summary(modCen1)
##
## Call:
## lm(formula = Sepal.Length ~ Petal.Width, data = irisCen)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.38822 -0.29358 -0.04393 0.26429 1.34521
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.421e-15 3.903e-02 0.0 1
## Petal.Width 8.886e-01 5.137e-02 17.3 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.478 on 148 degrees of freedom
## Multiple R-squared: 0.669, Adjusted R-squared: 0.6668
## F-statistic: 299.2 on 1 and 148 DF, p-value: < 2.2e-16
summary(modCen0)
##
## Call:
## lm(formula = Sepal.Length ~ 0 + Petal.Width, data = irisCen)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.38822 -0.29358 -0.04393 0.26429 1.34521
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Petal.Width 0.8886 0.0512 17.36 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.4764 on 149 degrees of freedom
## Multiple R-squared: 0.669, Adjusted R-squared: 0.6668
## F-statistic: 301.2 on 1 and 149 DF, p-value: < 2.2e-16

# But only if we center predictor and response...


summary(lm(iris$Sepal.Length ~ 0 + irisCen$Petal.Width))
##
## Call:
## lm(formula = iris$Sepal.Length ~ 0 + irisCen$Petal.Width)
##
## Residuals:
## Min 1Q Median 3Q Max
## 4.455 5.550 5.799 6.108 7.189
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## irisCen$Petal.Width 0.8886 0.6322 1.406 0.162
##
## Residual standard error: 5.882 on 149 degrees of freedom
## Multiple R-squared: 0.01308, Adjusted R-squared: 0.006461
## F-statistic: 1.975 on 1 and 149 DF, p-value: 0.1619
64 eduardo garcía portugués

Let’s play the evil practitioner and try to modify the R20
returned by summary when the intercept is excluded. If
we were really evil, we could use this knowledge to fool
someone that is not aware of the difference between R20
and R2 into believing that any given model is incredibly
good or bad in terms of the R2 !

• For the previous example, display the R20 as a function


of a shift in mean of the predictor Petal.Width. What
can you conclude? Hint: use I() for performing the
shifting in the model equation.
• What shift on Petal.Width would be necessary to
achieve an R20 ≈ 0.95? Is this shift unique?
• Do the same but for a shift in the mean of the re-
sponse Sepal.Length. What shift on would be neces-
sary to achieve an R20 ≈ 0.50? Is there a single shift?
• Consider the multiple linear model medv ~ nox + rm
for the Boston.xlsx dataset. We want to tweak the R20
to set it to any number in [0, 1]. Can we achieve this by
only shifting rm or medv (only one of them)?
• Explore systematically the R20 for the shifting combi-
nations of rm and medv and obtain a combination that
delivers an R20 ≈ 0.30. Hint: use filled.contour() for
visualizing the R20 surface.

2.7.3 Case study application


Coming back to the case study, we have studied so far three mod-
els:
# Fit models
modWine1 <- lm(Price ~ ., data = wine)
modWine2 <- lm(Price ~ . - FrancePop, data = wine)
modWine3 <- lm(Price ~ Age + WinterRain, data = wine)

# Summaries
summary(modWine1)
##
## Call:
## lm(formula = Price ~ ., data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46541 -0.24133 0.00413 0.18974 0.52495
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.343e+00 7.697e+00 -0.304 0.76384
## WinterRain 1.153e-03 4.991e-04 2.311 0.03109 *
## AGST 6.144e-01 9.799e-02 6.270 3.22e-06 ***
## HarvestRain -3.837e-03 8.366e-04 -4.587 0.00016 ***
## Age 1.377e-02 5.821e-02 0.237 0.81531
## FrancePop -2.213e-05 1.268e-04 -0.175 0.86313
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
notes for predictive modeling 65

## Residual standard error: 0.293 on 21 degrees of freedom


## Multiple R-squared: 0.8278, Adjusted R-squared: 0.7868
## F-statistic: 20.19 on 5 and 21 DF, p-value: 2.232e-07
summary(modWine2)
##
## Call:
## lm(formula = Price ~ . - FrancePop, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46024 -0.23862 0.01347 0.18601 0.53443
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.6515703 1.6880876 -2.163 0.04167 *
## WinterRain 0.0011667 0.0004820 2.420 0.02421 *
## AGST 0.6163916 0.0951747 6.476 1.63e-06 ***
## HarvestRain -0.0038606 0.0008075 -4.781 8.97e-05 ***
## Age 0.0238480 0.0071667 3.328 0.00305 **
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.2865 on 22 degrees of freedom
## Multiple R-squared: 0.8275, Adjusted R-squared: 0.7962
## F-statistic: 26.39 on 4 and 22 DF, p-value: 4.057e-08
summary(modWine3)
##
## Call:
## lm(formula = Price ~ Age + WinterRain, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88964 -0.51421 -0.00066 0.43103 1.06897
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.9830427 0.5993667 9.982 5.09e-10 ***
## Age 0.0360559 0.0137377 2.625 0.0149 *
## WinterRain 0.0007813 0.0008780 0.890 0.3824
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.5769 on 24 degrees of freedom
## Multiple R-squared: 0.2371, Adjusted R-squared: 0.1736
## F-statistic: 3.73 on 2 and 24 DF, p-value: 0.03884

modWine2 is the model with the largest R2Adj . It is a model that


explains the 82.75% of the variability in a non-redundant way and
with all its coefficients significant. Therefore, we have a formula
for effectively explaining and predicting the quality of a vintage
(answers Q1). Prediction and, importantly, quantification of the
uncertainty in the prediction, can be done through the predict
function.
Furthermore, the interpretation of modWine2 agrees with well-
known facts in viticulture that make perfect sense. Precisely (an-
swers Q2):

• Higher temperatures are associated with better quality (higher


priced) wine.
• Rain before the growing season is good for the wine quality, but
during harvest is bad.
• The quality of the wine improves with the age.
66 eduardo garcía portugués

Although these were known facts, keep in mind that this model

• allows to quantify the effect of each variable on the wine qual-


ity and
• provides a precise way of predicting the quality of future vin-
tages.
3
Linear models II: model selection, extensions, and diag-
nostics

Given the response Y and the predictors X1 , . . . , X p , many linear


models can be built for predicting and explaining Y. In this chapter
we will see how to address the problem of selecting the best subset
of predictors X1 , . . . , X p for explaining Y. Among others, we will
also see how to extend the linear model to account for nonlinear
relations between Y and X1 , . . . , X p , and how to check whether the
assumptions of the model are realistic in practice.

3.1 Case study: Housing values in Boston

This case study is motivated by Harrison and Rubinfeld (1978), 1


An hedonic model is a model that
who proposed an hedonic model1 for determining the willingness decomposes the price of an item into
of house buyers to pay for clean air. The study of Harrison and separate components that determine
its price. For example, an hedonic
Rubinfeld (1978) employed data from the Boston metropolitan area, model for the price of a house may
containing 560 suburbs and 14 variables. The Boston dataset is decompose its price into the house
characteristics, the kind of neighbor-
available through the file Boston.xlsx file and through the dataset
hood, and the location.
MASS::Boston.
The description of the related variables can be found in ?Boston 2
But be aware of the changes in units
and Harrison and Rubinfeld (1978)2 , but we summarize here the for medv, black, lstat, and nox.
most important ones as they appear in Boston. They are aggregated
into five topics:

• Dependent variable: medv, the median value of owner-occupied


homes (in thousands of dollars).
• Structural variables indicating the house characteristics: rm (aver-
age number of rooms “in owner units”) and age (proportion of
owner-occupied units built prior to 1940).
• Neighborhood variables: crim (crime rate), zn (proportion of
residential areas), indus (proportion of non-retail business
area), chas (river limitation), tax (cost of public services in
each community), ptratio (pupil-teacher ratio), black (variable
1000( B − 0.63)2 , where B is the black proportion of population
– low and high values of B increase housing prices) and lstat
(percent of lower status of the population).
• Accesibility variables: dis (distances to five Boston employment
68 eduardo garcía portugués

centers) and rad (accessibility to radial highways – larger index


denotes better accessibility).
• Air pollution variable: nox, the annual concentration of nitrogen
oxide (in parts per ten million).

We begin by importing the data:


# Read data
Boston <- readxl::read_excel(path = "Boston.xlsx", sheet = 1, col_names = TRUE)

# # Alternatively
# data(Boston, package = "MASS")

A quick summary of the data is shown below:


summary(Boston)
## crim zn indus chas nox rm
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000 Min. :0.3850 Min. :3.561
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000 1st Qu.:0.4490 1st Qu.:5.886
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000 Median :0.5380 Median :6.208
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917 Mean :0.5547 Mean :6.285
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000 3rd Qu.:0.6240 3rd Qu.:6.623
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000 Max. :0.8710 Max. :8.780
## age dis rad tax ptratio black lstat
## Min. : 2.90 Min. : 1.130 Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32 Min. : 1.73
## 1st Qu.: 45.02 1st Qu.: 2.100 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38 1st Qu.: 6.95
## Median : 77.50 Median : 3.207 Median : 5.000 Median :330.0 Median :19.05 Median :391.44 Median :11.36
## Mean : 68.57 Mean : 3.795 Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67 Mean :12.65
## 3rd Qu.: 94.08 3rd Qu.: 5.188 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23 3rd Qu.:16.95
## Max. :100.00 Max. :12.127 Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90 Max. :37.97
## medv
## Min. : 5.00
## 1st Qu.:17.02
## Median :21.20
## Mean :22.53
## 3rd Qu.:25.00
## Max. :50.00

The two goals of this case study are:

• Q1. Quantify the influence of the predictor variables in the housing


prices.
• Q2. Obtain the “best possible” model for decomposing the housing
prices and interpret it.

We begin by making an exploratory analysis of the data with a


matrix scatterplot. Since the number of variables is high, we opt to
plot only five variables: crim, dis, medv, nox, and rm. Each of them
represents the five topics in which variables were classified.
car::scatterplotMatrix(~ crim + dis + medv + nox + rm, regLine = list(col = 2),
col = 1, smooth = list(col.smooth = 4, col.spread = 4),
data = Boston)

The diagonal panels show an estimate of the unknown density


of each variable. Note the peculiar distribution of crim, very con-
centrated at zero, and the asymmetry in medv, with a second mode
associated to the most expensive properties. Inspecting the individ-
ual panels, it is clear that some nonlinearity exists in the data and
that some predictors are going to be more important than others
(recall that we have plotted just a subset of all the predictors).
notes for predictive modeling 69

2 4 6 8 12 0.4 0.6 0.8


Figure 3.1: Scatterplot matrix for crim,
dis, medv, nox, and rm from the Boston

20 40 60 80
crim dataset.

0
12

dis
8
6
4
2

10 20 30 40 50
medv

nox
0.8
0.6
0.4

rm

8
7
6
5
4
0 20 40 60 80 10 20 30 40 50 4 5 6 7 8

3.2 Model selection

In Chapter 2 we briefly saw that the inclusion of more predictors


is not for free: there is a price to pay in terms of more variability in
the coefficients estimates, harder interpretation, and possible inclu-
sion of highly-dependent predictors. Indeed, there is a maximum
number of predictors p that can be consideredˆ in a linear model
for a sample size n: p ≤ n − 2. Or equivalently, there is a mini-
mum sample size n required for fitting a model with p predictors:
n ≥ p + 2.
The interpretation of this fact is simple if we think on the geome-
try for p = 1 and p = 2:

• If p = 1, we need at least n = 2 points to uniquely fit a line.


However, this line gives no information on the vertical variation 3
Applying its formula, we would
around it and hence σ2 can not be estimated3 . Therefore we need obtain σ̂2 = 0/0 and so σ̂2 will not be
at least n = 3 points, or in other words n ≥ p + 2 = 3. defined.

• If p = 2, we need at least n = 3 points to uniquely fit a plane.


But again this plane gives no information on the variation of the
data around it and hence σ2 cannot be estimated. Therefore we
need n ≥ p + 2 = 4.

Another interpretation is the following:

The fitting of a linear model with p predictors involves the estimation


of p + 2 parameters ( β, σ2 ) from n data points. The closer p + 2
70 eduardo garcía portugués

and n are, the more variable the estimates ( β̂, σ̂2 ) will be, since less
information is available for estimating each one. In the limit case
n = p + 2, each sample point determines a parameter estimate.

In the above discussion the degenerate case p = n − 1 was


excluded, as it gives a perfect and useless fit for which σ̂2 is not
defined. However, β̂ is actually computable if p = n − 1. The output
of the next chunk of code clarifies the distinction between p = n − 1
and p = n − 2.
# Data: n observations and p = n - 1 predictors
set.seed(123456)
n <- 5
p <- n - 1
df <- data.frame(y = rnorm(n), x = matrix(rnorm(n * p), nrow = n, ncol = p))

# Case p = n - 1 = 2: beta can be estimated, but sigma^2 can not (hence no,
# inference can be performed since it requires the estimated sigma^2)
summary(lm(y ~ ., data = df))

# Case p = n - 2 = 1: both beta and sigma^2 can be estimated (hence, inference


# can be performed)
summary(lm(y ~ . - x.1, data = df))

The degrees of freedom n − p − 1 quantify the increase in the vari-


ability of ( β̂, σ̂2 ) when n − p − 1 decreases. For example:

• tn− p−1;α/2 appears in (2.16) and influences the length of the CIs
for β j , see (2.17). It also influences the length of the CIs for the
prediction. As Figure 3.2 shows, when the degrees of freedom
n − p − 1 decrease, tn− p−1;α/2 increases, thus the intervals widen.
• σ̂2 = n−1p−1 ∑in=1 ε̂2i influences the R2 and R2Adj . If no relevant
variables are added to the model then ∑in=1 ε̂2i will not change
substantially. However, the factor n−1p−1 will increase as p aug-
ments, inflating σ̂2 and its variance. This is exactly what hap-
pened in Figure 2.19.

Now that we have shed more light on the problem of having


an excess of predictors, we turn the focus on selecting the most
10

α = 0.1
α = 0.05
α = 0.01

adequate predictors for a multiple regression model. This is a chal-


8

lenging task without a unique solution, and what is worse, without


a method that is guaranteed to work in all the cases. However, there
6
2
tdf;α

is a well-established procedure that usually gives good results:


4

the stepwise model selection. Its principle is to compare multiple


2

linear regression models with different predictors4 sequentially.


Before introducing the method, we need to understand what is
0

an information criterion. An information criterion balances the 0 5 10 15 20 25 30

df

fitness of a model with the number of predictors employed. Hence, Figure 3.2: Effect of df = n − p − 1 in
it determines objectively the best model as the one that minimizes tdf;α/2 for α = 0.10, 0.05, 0.01.

the considered information criterion. Two common criteria are


4
Of course, with the same responses.

the Bayesian Information Criterion (BIC) and the Akaike Information


Criterion (AIC). Both are based on a balance between the model
fitness and its complexity:

BIC(model) = −2`(model) + npar(model) × log n, (3.1)


| {z } | {z }
Model fitness Complexity
notes for predictive modeling 71

where `(model) is the log-likelihood of the model (how well the model
fits the data) and npar(model) is the number of parameters consid-
ered in the model, p + 2 in the case of a multiple linear regression
model with p predictors. The AIC replaces log n by 2 in (3.1) so, 5
Recall that log n > 2 if n ≥ 8.
compared with BIC, it penalizes less the more complex models5 .
This is one of the reasons why BIC is preferred by some practition- 6
Also, because the BIC is consis-
ers for model comparison6 . tent in selecting the true distribu-
The BIC and AIC can be computed through the functions BIC tion/regression model: if enough
data is provided (n → ∞), the BIC
and AIC. They take a model as the input.
is guaranteed to select the true data-
generating model among a list of
# Two models with different predictors candidate models if the true model
mod1 <- lm(medv ~ age + crim, data = Boston) is included in that list (see Schwarz
mod2 <- lm(medv ~ age + crim + lstat, data = Boston) (1978) for the specifics for the linear
model). Note, however, that this may
# BICs be unrealistic in practice, as the true
BIC(mod1) model may be nonlinear.
## [1] 3581.893
BIC(mod2) # Smaller -> better
## [1] 3300.841

# AICs
AIC(mod1)
## [1] 3564.987
AIC(mod2) # Smaller -> better
## [1] 3279.708

# Check the summaries


summary(mod1)
##
## Call:
## lm(formula = medv ~ age + crim, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.940 -4.991 -2.420 2.110 32.033
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.80067 0.97078 30.698 < 2e-16 ***
## age -0.08955 0.01378 -6.499 1.95e-10 ***
## crim -0.31182 0.04510 -6.914 1.43e-11 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 8.157 on 503 degrees of freedom
## Multiple R-squared: 0.2166, Adjusted R-squared: 0.2134
## F-statistic: 69.52 on 2 and 503 DF, p-value: < 2.2e-16
summary(mod2)
##
## Call:
## lm(formula = medv ~ age + crim + lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.133 -3.848 -1.380 1.970 23.644
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.82804 0.74774 43.903 < 2e-16 ***
## age 0.03765 0.01225 3.074 0.00223 **
## crim -0.08262 0.03594 -2.299 0.02193 *
## lstat -0.99409 0.05075 -19.587 < 2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
72 eduardo garcía portugués

## Residual standard error: 6.147 on 502 degrees of freedom


## Multiple R-squared: 0.5559, Adjusted R-squared: 0.5533
## F-statistic: 209.5 on 3 and 502 DF, p-value: < 2.2e-16

Let’s go back to the selection of predictors. If we have p predic-


tors, a naive procedure would be to check all the possible models that
can be constructed with them and then select the best one in terms
of BIC/AIC. This is the so-called best subset selection. The problem is
7
If the intercept is questioned to be
that there are 2 p possible models!7 Fortunately, the MASS::stepAIC included, then 2 p+1 possible models.
function helps us navigating this ocean of models by iteratively
adding useful predictors and removing the non-important ones.
The function takes as input a candidate model to be improved.
Let’s see how it works with the already studied wine dataset, using
as initial model the one employing all the available predictors.
# Load data -- notice that "Year" is also included
wine <- read.csv(file = "wine.csv", header = TRUE)

stepAIC takes the argument k as 2 (default) or log n, where n is


the sample size. With k = 2 it uses the AIC criterion and with k =
log(n) it considers the BIC.

# Full model
mod <- lm(Price ~ ., data = wine)

# With BIC
modBIC <- MASS::stepAIC(mod, k = log(nrow(wine)))
## Start: AIC=-53.29
## Price ~ Year + WinterRain + AGST + HarvestRain + Age + FrancePop
##
##
## Step: AIC=-53.29
## Price ~ Year + WinterRain + AGST + HarvestRain + FrancePop
##
## Df Sum of Sq RSS AIC
## - FrancePop 1 0.0026 1.8058 -56.551
## - Year 1 0.0048 1.8080 -56.519
## <none> 1.8032 -53.295
## - WinterRain 1 0.4585 2.2617 -50.473
## - HarvestRain 1 1.8063 3.6095 -37.852
## - AGST 1 3.3756 5.1788 -28.105
##
## Step: AIC=-56.55
## Price ~ Year + WinterRain + AGST + HarvestRain
##
## Df Sum of Sq RSS AIC
## <none> 1.8058 -56.551
## - WinterRain 1 0.4809 2.2867 -53.473
## - Year 1 0.9089 2.7147 -48.840
## - HarvestRain 1 1.8760 3.6818 -40.612
## - AGST 1 3.4428 5.2486 -31.039
summary(modBIC)
##
## Call:
## lm(formula = Price ~ Year + WinterRain + AGST + HarvestRain,
## data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46024 -0.23862 0.01347 0.18601 0.53443
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.6390418 14.6939240 2.970 0.00707 **
notes for predictive modeling 73

## Year -0.0238480 0.0071667 -3.328 0.00305 **


## WinterRain 0.0011667 0.0004820 2.420 0.02421 *
## AGST 0.6163916 0.0951747 6.476 1.63e-06 ***
## HarvestRain -0.0038606 0.0008075 -4.781 8.97e-05 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.2865 on 22 degrees of freedom
## Multiple R-squared: 0.8275, Adjusted R-squared: 0.7962
## F-statistic: 26.39 on 4 and 22 DF, p-value: 4.057e-08

# With AIC
modAIC <- MASS::stepAIC(mod, k = 2)
## Start: AIC=-61.07
## Price ~ Year + WinterRain + AGST + HarvestRain + Age + FrancePop
##
##
## Step: AIC=-61.07
## Price ~ Year + WinterRain + AGST + HarvestRain + FrancePop
##
## Df Sum of Sq RSS AIC
## - FrancePop 1 0.0026 1.8058 -63.031
## - Year 1 0.0048 1.8080 -62.998
## <none> 1.8032 -61.070
## - WinterRain 1 0.4585 2.2617 -56.952
## - HarvestRain 1 1.8063 3.6095 -44.331
## - AGST 1 3.3756 5.1788 -34.584
##
## Step: AIC=-63.03
## Price ~ Year + WinterRain + AGST + HarvestRain
##
## Df Sum of Sq RSS AIC
## <none> 1.8058 -63.031
## - WinterRain 1 0.4809 2.2867 -58.656
## - Year 1 0.9089 2.7147 -54.023
## - HarvestRain 1 1.8760 3.6818 -45.796
## - AGST 1 3.4428 5.2486 -36.222
summary(modAIC)
##
## Call:
## lm(formula = Price ~ Year + WinterRain + AGST + HarvestRain,
## data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46024 -0.23862 0.01347 0.18601 0.53443
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.6390418 14.6939240 2.970 0.00707 **
## Year -0.0238480 0.0071667 -3.328 0.00305 **
## WinterRain 0.0011667 0.0004820 2.420 0.02421 *
## AGST 0.6163916 0.0951747 6.476 1.63e-06 ***
## HarvestRain -0.0038606 0.0008075 -4.781 8.97e-05 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.2865 on 22 degrees of freedom
## Multiple R-squared: 0.8275, Adjusted R-squared: 0.7962
## F-statistic: 26.39 on 4 and 22 DF, p-value: 4.057e-08

An explanation of what stepAIC has performed follows. At each


step, stepAIC displayed information about the current value of
the information criterion. For example, for modBIC, the BIC at the
first step was Step: AIC=-53.29 and then it improved to Step:
AIC=-56.55 in the second step (the function always prints ‘AIC’,
74 eduardo garcía portugués

even if ‘BIC’ is employed). The next model to move on was decided


by stepAIC by exploring the information criteria of the different
models resulting from adding or removing a predictor (depend-
ing on the direction argument, explained next). For example,
for modBIC, in the first step the model resulting from removing
FrancePop had a BIC equal to -56.551, and if Year was removed
the BIC would be -56.519. The stepwise regression proceeded then
by removing FrancePop (as it gave the lowest BIC) and then by re-
peating the procedure, eventually arriving to the situation in which
removing <none> predictors was the best possible action (otherwise,
the BIC of removing any predictor would be worse).

stepAIC and friends (addterm and dropterm) compute


a slightly different version of the BIC/AIC than the
BIC/AIC functions. Precisely, the BIC/AIC they report
come from the extractAIC function, which differs in
an additive constant from the output of BIC/AIC. This
is not relevant for model comparison, since shifting by
a common constant the BIC/AIC does not change the
lower-to-higher BIC/AIC ordering of models. However,
it is important to be aware of this fact in order to do not
compare directly the output of stepAIC with the one of
BIC/AIC. The additive constant (included in BIC/AIC but
not in extractAIC) is n(log(2π ) + 1) + log(n) for the
BIC and n(log(2π ) + 1) + 2 for the AIC. The discrepancy
arises from simplifying the computation of the BIC/AIC
and from counting σ̂2 as an estimated parameter.

The following chunk of code illustrates the relation of the AIC


reported in stepsAIC, the output of extractAIC, and the BIC/AIC
reported by BIC/AIC.
# Same BICs, different scale
n <- nobs(modBIC)
extractAIC(modBIC, k = log(n))[2]
## [1] -56.55135
BIC(modBIC)
## [1] 23.36717
# Observe that MASS::stepAIC(mod, k = log(nrow(wine))) returned as final BIC
# the one given by extractAIC(), not by BIC()! But both are equivalent, they
# just live in different scales

# Same AICs, different scale


extractAIC(modBIC, k = 2)[2]
## [1] -63.03053
BIC(modBIC)
## [1] 23.36717

# The additive constant: BIC() includes it but extractAIC() does not


BIC(modBIC) - extractAIC(modBIC, k = log(n))[2]
## [1] 79.91852
n * (log(2 * pi) + 1) + log(n)
## [1] 79.91852

# Same for the AIC


AIC(modAIC) - extractAIC(modAIC)[2]
notes for predictive modeling 75

## [1] 78.62268
n * (log(2 * pi) + 1) + 2
## [1] 78.62268

Note that the selected models modBIC and modAIC are equivalent
to the modWine2 we selected in Section 2.7.3 as the best model. This
is an illustration that the model selected by stepAIC is often a good
starting point for further additions or deletions of predictors.

When applying stepAIC for BIC/AIC, different final


models might be selected depending on the choice of
direction. This is the interpretation:

• "backward": removes predictors sequentially from the


given model.
• "forward": adds predictors sequentially to the given
model (using the predictors that are available in the
data argument of lm).
• "both" (default): combination of the above.

The advice is to try several of these methods and retain


the one with minimum BIC/AIC. Set trace = 0 to omit
lengthy outputs of information of the search procedure.

The chunk of code below clearly explains how to exploit the


direction argument, and other options of stepAIC, with a modi-
fied version of the wine dataset. An important takeaway is that if
employing direction = "forward" or direction = "both", scope
needs to be properly defined.
# Add an irrelevant predictor to the wine dataset
set.seed(123456)
wineNoise <- wine
n <- nrow(wineNoise)
wineNoise$noisePredictor <- rnorm(n)

# Backward selection: removes predictors sequentially from the given model

# Starting from the model with all the predictors


modAll <- lm(formula = Price ~ ., data = wineNoise)
MASS::stepAIC(modAll, direction = "backward", k = log(n))
## Start: AIC=-50.13
## Price ~ Year + WinterRain + AGST + HarvestRain + Age + FrancePop +
## noisePredictor
##
##
## Step: AIC=-50.13
## Price ~ Year + WinterRain + AGST + HarvestRain + FrancePop +
## noisePredictor
##
## Df Sum of Sq RSS AIC
## - FrancePop 1 0.0036 1.7977 -53.376
## - Year 1 0.0038 1.7979 -53.374
## - noisePredictor 1 0.0090 1.8032 -53.295
## <none> 1.7941 -50.135
## - WinterRain 1 0.4598 2.2539 -47.271
## - HarvestRain 1 1.7666 3.5607 -34.923
## - AGST 1 3.3658 5.1599 -24.908
##
76 eduardo garcía portugués

## Step: AIC=-53.38
## Price ~ Year + WinterRain + AGST + HarvestRain + noisePredictor
##
## Df Sum of Sq RSS AIC
## - noisePredictor 1 0.0081 1.8058 -56.551
## <none> 1.7977 -53.376
## - WinterRain 1 0.4771 2.2748 -50.317
## - Year 1 0.9162 2.7139 -45.552
## - HarvestRain 1 1.8449 3.6426 -37.606
## - AGST 1 3.4234 5.2212 -27.885
##
## Step: AIC=-56.55
## Price ~ Year + WinterRain + AGST + HarvestRain
##
## Df Sum of Sq RSS AIC
## <none> 1.8058 -56.551
## - WinterRain 1 0.4809 2.2867 -53.473
## - Year 1 0.9089 2.7147 -48.840
## - HarvestRain 1 1.8760 3.6818 -40.612
## - AGST 1 3.4428 5.2486 -31.039
##
## Call:
## lm(formula = Price ~ Year + WinterRain + AGST + HarvestRain,
## data = wineNoise)
##
## Coefficients:
## (Intercept) Year WinterRain AGST HarvestRain
## 43.639042 -0.023848 0.001167 0.616392 -0.003861

# Starting from an intermediate model


modInter <- lm(formula = Price ~ noisePredictor + Year + AGST, data = wineNoise)
MASS::stepAIC(modInter, direction = "backward", k = log(n))
## Start: AIC=-32.38
## Price ~ noisePredictor + Year + AGST
##
## Df Sum of Sq RSS AIC
## - noisePredictor 1 0.0146 5.0082 -35.601
## <none> 4.9936 -32.384
## - Year 1 0.7522 5.7459 -31.891
## - AGST 1 3.2211 8.2147 -22.240
##
## Step: AIC=-35.6
## Price ~ Year + AGST
##
## Df Sum of Sq RSS AIC
## <none> 5.0082 -35.601
## - Year 1 0.7966 5.8049 -34.911
## - AGST 1 3.2426 8.2509 -25.417
##
## Call:
## lm(formula = Price ~ Year + AGST, data = wineNoise)
##
## Coefficients:
## (Intercept) Year AGST
## 41.49441 -0.02221 0.56067
# Recall that other predictors not included in modInter are not explored
# during the search (so the relevant predictor HarvestRain is not added)

# Forward selection: adds predictors sequentially from the given model

# Starting from the model with no predictors, only intercept (denoted as ~ 1)


modZero <- lm(formula = Price ~ 1, data = wineNoise)
MASS::stepAIC(modZero, direction = "forward",
scope = list(lower = modZero, upper = modAll), k = log(n))
## Start: AIC=-22.28
## Price ~ 1
##
## Df Sum of Sq RSS AIC
notes for predictive modeling 77

## + AGST 1 4.6655 5.8049 -34.911


## + HarvestRain 1 2.6933 7.7770 -27.014
## + FrancePop 1 2.4231 8.0472 -26.092
## + Year 1 2.2195 8.2509 -25.417
## + Age 1 2.2195 8.2509 -25.417
## <none> 10.4703 -22.281
## + WinterRain 1 0.1905 10.2798 -19.481
## + noisePredictor 1 0.1761 10.2942 -19.443
##
## Step: AIC=-34.91
## Price ~ AGST
##
## Df Sum of Sq RSS AIC
## + HarvestRain 1 2.50659 3.2983 -46.878
## + WinterRain 1 1.42392 4.3809 -39.214
## + FrancePop 1 0.90263 4.9022 -36.178
## + Year 1 0.79662 5.0082 -35.601
## + Age 1 0.79662 5.0082 -35.601
## <none> 5.8049 -34.911
## + noisePredictor 1 0.05900 5.7459 -31.891
##
## Step: AIC=-46.88
## Price ~ AGST + HarvestRain
##
## Df Sum of Sq RSS AIC
## + FrancePop 1 1.03572 2.2625 -53.759
## + Year 1 1.01159 2.2867 -53.473
## + Age 1 1.01159 2.2867 -53.473
## + WinterRain 1 0.58356 2.7147 -48.840
## <none> 3.2983 -46.878
## + noisePredictor 1 0.06084 3.2374 -44.085
##
## Step: AIC=-53.76
## Price ~ AGST + HarvestRain + FrancePop
##
## Df Sum of Sq RSS AIC
## + WinterRain 1 0.45456 1.8080 -56.519
## <none> 2.2625 -53.759
## + noisePredictor 1 0.00829 2.2542 -50.562
## + Age 1 0.00085 2.2617 -50.473
## + Year 1 0.00085 2.2617 -50.473
##
## Step: AIC=-56.52
## Price ~ AGST + HarvestRain + FrancePop + WinterRain
##
## Df Sum of Sq RSS AIC
## <none> 1.8080 -56.519
## + noisePredictor 1 0.0100635 1.7979 -53.374
## + Year 1 0.0048039 1.8032 -53.295
## + Age 1 0.0048039 1.8032 -53.295
##
## Call:
## lm(formula = Price ~ AGST + HarvestRain + FrancePop + WinterRain,
## data = wineNoise)
##
## Coefficients:
## (Intercept) AGST HarvestRain FrancePop WinterRain
## -5.945e-01 6.127e-01 -3.804e-03 -5.189e-05 1.136e-03
# It is very important to set the scope argument adequately when doing forward
# search! In the scope you have to define the "minimum" (lower) and "maximum"
# (upper) models that contain the set of explorable models. If not provided,
# the maximum model will be taken as the passed starting model (in this case
# modZero) and stepAIC will not do any search

# Starting from an intermediate model


MASS::stepAIC(modInter, direction = "forward",
scope = list(lower = modZero, upper = modAll), k = log(n))
## Start: AIC=-32.38
78 eduardo garcía portugués

## Price ~ noisePredictor + Year + AGST


##
## Df Sum of Sq RSS AIC
## + HarvestRain 1 2.71878 2.2748 -50.317
## + WinterRain 1 1.35102 3.6426 -37.606
## <none> 4.9936 -32.384
## + FrancePop 1 0.24004 4.7536 -30.418
##
## Step: AIC=-50.32
## Price ~ noisePredictor + Year + AGST + HarvestRain
##
## Df Sum of Sq RSS AIC
## + WinterRain 1 0.47710 1.7977 -53.376
## <none> 2.2748 -50.317
## + FrancePop 1 0.02094 2.2539 -47.271
##
## Step: AIC=-53.38
## Price ~ noisePredictor + Year + AGST + HarvestRain + WinterRain
##
## Df Sum of Sq RSS AIC
## <none> 1.7977 -53.376
## + FrancePop 1 0.0036037 1.7941 -50.135
##
## Call:
## lm(formula = Price ~ noisePredictor + Year + AGST + HarvestRain +
## WinterRain, data = wineNoise)
##
## Coefficients:
## (Intercept) noisePredictor Year AGST HarvestRain WinterRain
## 44.096639 -0.019617 -0.024126 0.620522 -0.003840 0.001211
# Recall that predictors included in modInter are not dropped during the
# search (so the irrelevant noisePredictor is kept)

# Both selection: useful if starting from an intermediate model

# Removes the problems associated to "backward" and "forward" searchs done


# from intermediate models
MASS::stepAIC(modInter, direction = "both",
scope = list(lower = modZero, upper = modAll), k = log(n))
## Start: AIC=-32.38
## Price ~ noisePredictor + Year + AGST
##
## Df Sum of Sq RSS AIC
## + HarvestRain 1 2.7188 2.2748 -50.317
## + WinterRain 1 1.3510 3.6426 -37.606
## - noisePredictor 1 0.0146 5.0082 -35.601
## <none> 4.9936 -32.384
## - Year 1 0.7522 5.7459 -31.891
## + FrancePop 1 0.2400 4.7536 -30.418
## - AGST 1 3.2211 8.2147 -22.240
##
## Step: AIC=-50.32
## Price ~ noisePredictor + Year + AGST + HarvestRain
##
## Df Sum of Sq RSS AIC
## - noisePredictor 1 0.01182 2.2867 -53.473
## + WinterRain 1 0.47710 1.7977 -53.376
## <none> 2.2748 -50.317
## + FrancePop 1 0.02094 2.2539 -47.271
## - Year 1 0.96258 3.2374 -44.085
## - HarvestRain 1 2.71878 4.9936 -32.384
## - AGST 1 2.94647 5.2213 -31.180
##
## Step: AIC=-53.47
## Price ~ Year + AGST + HarvestRain
##
## Df Sum of Sq RSS AIC
## + WinterRain 1 0.48087 1.8058 -56.551
notes for predictive modeling 79

## <none> 2.2867 -53.473


## + FrancePop 1 0.02497 2.2617 -50.473
## + noisePredictor 1 0.01182 2.2748 -50.317
## - Year 1 1.01159 3.2983 -46.878
## - HarvestRain 1 2.72157 5.0082 -35.601
## - AGST 1 2.96500 5.2517 -34.319
##
## Step: AIC=-56.55
## Price ~ Year + AGST + HarvestRain + WinterRain
##
## Df Sum of Sq RSS AIC
## <none> 1.8058 -56.551
## - WinterRain 1 0.4809 2.2867 -53.473
## + noisePredictor 1 0.0081 1.7977 -53.376
## + FrancePop 1 0.0026 1.8032 -53.295
## - Year 1 0.9089 2.7147 -48.840
## - HarvestRain 1 1.8760 3.6818 -40.612
## - AGST 1 3.4428 5.2486 -31.039
##
## Call:
## lm(formula = Price ~ Year + AGST + HarvestRain + WinterRain,
## data = wineNoise)
##
## Coefficients:
## (Intercept) Year AGST HarvestRain WinterRain
## 43.639042 -0.023848 0.616392 -0.003861 0.001167
# It is very important as well to correctly define the scope, because "both"
# resorts to "forward" (as well as to "backward")

# Using the defaults from the full model essentially does backward selection,
# but allowing predictors that were removed to enter again at later steps
MASS::stepAIC(modAll, direction = "both", k = log(n))
## Start: AIC=-50.13
## Price ~ Year + WinterRain + AGST + HarvestRain + Age + FrancePop +
## noisePredictor
##
##
## Step: AIC=-50.13
## Price ~ Year + WinterRain + AGST + HarvestRain + FrancePop +
## noisePredictor
##
## Df Sum of Sq RSS AIC
## - FrancePop 1 0.0036 1.7977 -53.376
## - Year 1 0.0038 1.7979 -53.374
## - noisePredictor 1 0.0090 1.8032 -53.295
## <none> 1.7941 -50.135
## - WinterRain 1 0.4598 2.2539 -47.271
## - HarvestRain 1 1.7666 3.5607 -34.923
## - AGST 1 3.3658 5.1599 -24.908
##
## Step: AIC=-53.38
## Price ~ Year + WinterRain + AGST + HarvestRain + noisePredictor
##
## Df Sum of Sq RSS AIC
## - noisePredictor 1 0.0081 1.8058 -56.551
## <none> 1.7977 -53.376
## - WinterRain 1 0.4771 2.2748 -50.317
## + FrancePop 1 0.0036 1.7941 -50.135
## - Year 1 0.9162 2.7139 -45.552
## - HarvestRain 1 1.8449 3.6426 -37.606
## - AGST 1 3.4234 5.2212 -27.885
##
## Step: AIC=-56.55
## Price ~ Year + WinterRain + AGST + HarvestRain
##
## Df Sum of Sq RSS AIC
## <none> 1.8058 -56.551
## - WinterRain 1 0.4809 2.2867 -53.473
80 eduardo garcía portugués

## + noisePredictor 1 0.0081 1.7977 -53.376


## + FrancePop 1 0.0026 1.8032 -53.295
## - Year 1 0.9089 2.7147 -48.840
## - HarvestRain 1 1.8760 3.6818 -40.612
## - AGST 1 3.4428 5.2486 -31.039
##
## Call:
## lm(formula = Price ~ Year + WinterRain + AGST + HarvestRain,
## data = wineNoise)
##
## Coefficients:
## (Intercept) Year WinterRain AGST HarvestRain
## 43.639042 -0.023848 0.001167 0.616392 -0.003861

# Omit lengthty outputs


MASS::stepAIC(modAll, direction = "both", trace = 0,
scope = list(lower = modZero, upper = modAll), k = log(n))
##
## Call:
## lm(formula = Price ~ Year + WinterRain + AGST + HarvestRain,
## data = wineNoise)
##
## Coefficients:
## (Intercept) Year WinterRain AGST HarvestRain
## 43.639042 -0.023848 0.001167 0.616392 -0.003861

Run the previous stepwise selections for the Boston


dataset, with the aim of clearly understanding the differ-
ent search directions. Specifically:

• Do a "forward" stepwise fit starting from medv ~ 1.


• Do a "forward" stepwise fit starting from medv ~ crim
+ lstat + age.
• Do a "both" stepwise fit starting from medv ~ crim +
lstat + age.
• Do a "both" stepwise fit starting from medv ~ ..
• Do a "backward" stepwise fit starting from medv ~ ..

stepAIC assumes that no NA’s (missing values) are


present in the data. It is advised to remove the miss-
ing values in the data before. Their presence might lead
to errors. To do so, employ data = na.omit(dataset) in
the call to lm (if your dataset is dataset). Also, see Ap-
pendix A.4 for possible alternatives to deal with missing
data.

We conclude by highlighting a caveat on the use of the BIC and


AIC: they are constructed assuming that the sample size n is much
larger than the number of parameters in the model (p + 2). There-
fore, they will work reasonably well if n >> p + 2, but if this is
not true they may favor unrealistic complex models. An illustration
of this phenomenon is Figure 3.3, which is the BIC/AIC version
of Figure 2.19 for the experiment done in Section 2.6. The BIC and
AIC curves tend to have local minima close to p = 2 and then in-
crease. But when p + 2 gets close to n, they quickly drop down.
notes for predictive modeling 81

Note also how the BIC penalizes the complexity more than the AIC,
which is more flat.

3.2.1 Case study application


We want to build a linear model for predicting and explaining medv.
There are a good number of predictors and some of them might be
of little use for predicting medv. However, there is no clear intuition
of which predictors will yield better explanations of medv with
the information at hand. Therefore, we can start by doing a linear
model on all the predictors:

modHouse <- lm(medv ~ ., data = Boston)


Figure 3.3: Comparison of BIC and
summary(modHouse)
AIC for n = 200 and p ranging from
##
1 to 198. M = 100 datasets were
## Call:
simulated with only the first two
## lm(formula = medv ~ ., data = Boston)
predictors being significant. The
##
thicker curves are the mean of each
## Residuals:
color’s curves.
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## black 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16

There are a couple of non-significant variables, but so far the


model has an R2 = 0.74 and the fitted coefficients are sensible
with what would be expected. For example, crim, tax, ptratio,
and nox have negative effects on medv, while rm, rad, and chas
have positive. However, the non-significant coefficients are not
significantly affecting the model, but only adding artificial noise
and decreasing the overall accuracy of the coefficient estimates.
Let’s polish the previous model a little bit. Instead of manually
removing each non-significant variable to reduce the complexity, we
employ stepAIC for selecting a candidate best model:

# Best models
modBIC <- MASS::stepAIC(modHouse, k = log(nrow(Boston)))
## Start: AIC=1648.81
## medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +
## tax + ptratio + black + lstat
82 eduardo garcía portugués

##
## Df Sum of Sq RSS AIC
## - age 1 0.06 11079 1642.6
## - indus 1 2.52 11081 1642.7
## <none> 11079 1648.8
## - chas 1 218.97 11298 1652.5
## - tax 1 242.26 11321 1653.5
## - crim 1 243.22 11322 1653.6
## - zn 1 257.49 11336 1654.2
## - black 1 270.63 11349 1654.8
## - rad 1 479.15 11558 1664.0
## - nox 1 487.16 11566 1664.4
## - ptratio 1 1194.23 12273 1694.4
## - dis 1 1232.41 12311 1696.0
## - rm 1 1871.32 12950 1721.6
## - lstat 1 2410.84 13490 1742.2
##
## Step: AIC=1642.59
## medv ~ crim + zn + indus + chas + nox + rm + dis + rad + tax +
## ptratio + black + lstat
##
## Df Sum of Sq RSS AIC
## - indus 1 2.52 11081 1636.5
## <none> 11079 1642.6
## - chas 1 219.91 11299 1646.3
## - tax 1 242.24 11321 1647.3
## - crim 1 243.20 11322 1647.3
## - zn 1 260.32 11339 1648.1
## - black 1 272.26 11351 1648.7
## - rad 1 481.09 11560 1657.9
## - nox 1 520.87 11600 1659.6
## - ptratio 1 1200.23 12279 1688.4
## - dis 1 1352.26 12431 1694.6
## - rm 1 1959.55 13038 1718.8
## - lstat 1 2718.88 13798 1747.4
##
## Step: AIC=1636.48
## medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio +
## black + lstat
##
## Df Sum of Sq RSS AIC
## <none> 11081 1636.5
## - chas 1 227.21 11309 1640.5
## - crim 1 245.37 11327 1641.3
## - zn 1 257.82 11339 1641.9
## - black 1 270.82 11352 1642.5
## - tax 1 273.62 11355 1642.6
## - rad 1 500.92 11582 1652.6
## - nox 1 541.91 11623 1654.4
## - ptratio 1 1206.45 12288 1682.5
## - dis 1 1448.94 12530 1692.4
## - rm 1 1963.66 13045 1712.8
## - lstat 1 2723.48 13805 1741.5
modAIC <- MASS::stepAIC(modHouse, trace = 0, k = 2)

# Comparison
car::compareCoefs(modBIC, modAIC)
## Calls:
## 1: lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + black + lstat, data = Boston)
## 2: lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + black + lstat, data = Boston)
##
## Model 1 Model 2
## (Intercept) 36.34 36.34
## SE 5.07 5.07
##
## crim -0.1084 -0.1084
## SE 0.0328 0.0328
##
notes for predictive modeling 83

## zn 0.0458 0.0458
## SE 0.0135 0.0135
##
## chas 2.719 2.719
## SE 0.854 0.854
##
## nox -17.38 -17.38
## SE 3.54 3.54
##
## rm 3.802 3.802
## SE 0.406 0.406
##
## dis -1.493 -1.493
## SE 0.186 0.186
##
## rad 0.2996 0.2996
## SE 0.0634 0.0634
##
## tax -0.01178 -0.01178
## SE 0.00337 0.00337
##
## ptratio -0.947 -0.947
## SE 0.129 0.129
##
## black 0.00929 0.00929
## SE 0.00267 0.00267
##
## lstat -0.5226 -0.5226
## SE 0.0474 0.0474
##
summary(modBIC)
##
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +
## tax + ptratio + black + lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5984 -2.7386 -0.5046 1.7273 26.2373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.341145 5.067492 7.171 2.73e-12 ***
## crim -0.108413 0.032779 -3.307 0.001010 **
## zn 0.045845 0.013523 3.390 0.000754 ***
## chas 2.718716 0.854240 3.183 0.001551 **
## nox -17.376023 3.535243 -4.915 1.21e-06 ***
## rm 3.801579 0.406316 9.356 < 2e-16 ***
## dis -1.492711 0.185731 -8.037 6.84e-15 ***
## rad 0.299608 0.063402 4.726 3.00e-06 ***
## tax -0.011778 0.003372 -3.493 0.000521 ***
## ptratio -0.946525 0.129066 -7.334 9.24e-13 ***
## black 0.009291 0.002674 3.475 0.000557 ***
## lstat -0.522553 0.047424 -11.019 < 2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348
## F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16

# Confidence intervals
confint(modBIC)
## 2.5 % 97.5 %
## (Intercept) 26.384649126 46.29764088
## crim -0.172817670 -0.04400902
## zn 0.019275889 0.07241397
## chas 1.040324913 4.39710769
84 eduardo garcía portugués

## nox -24.321990312 -10.43005655


## rm 3.003258393 4.59989929
## dis -1.857631161 -1.12779176
## rad 0.175037411 0.42417950
## tax -0.018403857 -0.00515209
## ptratio -1.200109823 -0.69293932
## black 0.004037216 0.01454447
## lstat -0.615731781 -0.42937513

Note how the R2Adj has slightly increased with respect to the full
model and how all the predictors are significant. Note also that
modBIC and modAIC are the same.
We have quantified the influence of the predictor variables on the
housing prices (Q1) and we can conclude that, in the final model
(Q2) and with significance level α = 0.05:

• chas, age, rad, and black have a significantly positive influence


on medv;
• nox, dis, tax, ptratio, and lstat have a significantly negative
influence on medv.

The functions MASS::addterm and MASS::dropterm al-


low to add and remove all individual predictors to a
given model, and inform the BICs / AICs of the possible
combinations. Check that:

• modBIC can not be improved in terms of BIC by re-


moving predictors. Use dropterm(modBIC, k =
log(nobs(modBIC))) for that.
• modBIC can not be improved in terms of BIC by adding
predictors. Use addterm(modBIC, scope = lm(medv ~
., data = Boston), k = log(nobs(modBIC))) for
that. scope must specify the maximal model or for-
mula. However, be careful because addterm(modBIC,
scope = medv ~ ., k = log(nobs(modBIC))) will
understand that . refers to all the predictors in mod-
BIC, not in the Boston dataset, and will return an
error. Calling addterm(modBIC, scope = medv ~ .
+ indus + age, k = log(nobs(modBIC))) gives the
required result, at the expense of manually adding the
remaining predictors.

3.3 Use of qualitative predictors

An important situation not covered so far is how to deal with qual-


itative, and not quantitative, predictors. Qualitative predictors, also
known as categorical variables or, in R’s terminology, factors, are very
common, for example in social sciences. Dealing with them requires
some care and proper understanding of how these variables are
represented.
notes for predictive modeling 85

The simplest case is the situation with two levels. A binary


variable C with two levels (for example, a and b) can be encoded as
(
1, if C = b,
D=
0, if C = a.
D now is a dummy variable: it codifies with zeros and ones the two
possible levels of the categorical variable. An example of C could
be gender, which has levels male and female. The dummy variable
associated is D = 0 if the gender is male and D = 1 if the gender is
female.
The advantage of this dummification is its interpretability in re-
gression models. Since level a corresponds to 0, it can be seen as the
reference level to which level b is compared. This is the key point in
dummification: set one level as the reference and codify the rest
as departures from it with ones.
The previous interpretation translates easily to the linear model.
Assume that the dummy variable D is available together with other
predictors X1 , . . . , X p . Then:

E[Y | X1 = x1 , . . . , X p = x p , D = d] = β 0 + β 1 X1 + . . . + β p X p + β p+1 D.

The coefficient associated to D is easily interpretable: β p+1 is the


increment in mean of Y associated to changing D = 0 (reference) to
D = 1, while the rest of the predictors are fixed. Or in other words,
β p+1 is the increment in mean of Y associated to changing the level
of the categorical variable from a to b.
R does the dummification automatically (translates a categori-
cal variable C into its dummy version D) if it detects that a factor
variable is present in the regression model.
Let’s see now the case with more than two levels, for example,
a categorical variable C with levels a, b, and c. If we take a as the
reference level, this variable can be represented by two dummy
variables:
(
1, if C = b,
D1 =
0, if C 6= b
and
(
1, if C = c,
D2 =
0, if C 6= c.

Then C = a is represented by ( D1 , D2 ) = (0, 0), C = b is represented


by ( D1 , D2 ) = (1, 0), and C = c is represented by ( D1 , D2 ) = (0, 1).
The interpretation of the regression models with the presence of D1
and D2 is very similar to the one before. For example, for the linear
model, the coefficient associated to D1 gives the increment in mean
of Y when the category of C changes from a to b, and the coefficient
for D2 gives the increment in mean of Y when C changes from a to
c.
In general, if we have a categorical variable C with J levels, then
the previous process is iterated and the number of dummy vari-
8
One reference level taken by default
ables required to encode C is J − 1 8 . Again, R does the dummifica- and J − 1 possible departures from it.
86 eduardo garcía portugués

tion automatically if it detects that a factor variable is present in the


regression model.
Let’s see an example with the famous iris dataset.
# iris dataset -- factors in the last column
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
## Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500

# Summary of a linear model


mod1 <- lm(Sepal.Length ~ ., data = iris)
summary(mod1)
##
## Call:
## lm(formula = Sepal.Length ~ ., data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.79424 -0.21874 0.00899 0.20255 0.73103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.17127 0.27979 7.760 1.43e-12 ***
## Sepal.Width 0.49589 0.08607 5.761 4.87e-08 ***
## Petal.Length 0.82924 0.06853 12.101 < 2e-16 ***
## Petal.Width -0.31516 0.15120 -2.084 0.03889 *
## Speciesversicolor -0.72356 0.24017 -3.013 0.00306 **
## Speciesvirginica -1.02350 0.33373 -3.067 0.00258 **
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.3068 on 144 degrees of freedom
## Multiple R-squared: 0.8673, Adjusted R-squared: 0.8627
## F-statistic: 188.3 on 5 and 144 DF, p-value: < 2.2e-16
# Speciesversicolor (D1) coefficient: -0.72356. The average increment of
# Sepal.Length when the species is versicolor instead of setosa (reference)
# Speciesvirginica (D2) coefficient: -1.02350. The average increment of
# Sepal.Length when the species is virginica instead of setosa (reference)
# Both dummy variables are significant

# How to set a different level as reference (versicolor)


iris$Species <- relevel(iris$Species, ref = "versicolor")

# Same estimates except for the dummy coefficients


mod2 <- lm(Sepal.Length ~ ., data = iris)
summary(mod2)
##
## Call:
## lm(formula = Sepal.Length ~ ., data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.79424 -0.21874 0.00899 0.20255 0.73103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.44770 0.28149 5.143 8.68e-07 ***
## Sepal.Width 0.49589 0.08607 5.761 4.87e-08 ***
## Petal.Length 0.82924 0.06853 12.101 < 2e-16 ***
## Petal.Width -0.31516 0.15120 -2.084 0.03889 *
## Speciessetosa 0.72356 0.24017 3.013 0.00306 **
## Speciesvirginica -0.29994 0.11898 -2.521 0.01280 *
## ---
notes for predictive modeling 87

## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.3068 on 144 degrees of freedom
## Multiple R-squared: 0.8673, Adjusted R-squared: 0.8627
## F-statistic: 188.3 on 5 and 144 DF, p-value: < 2.2e-16
# Speciessetosa (D1) coefficient: 0.72356. The average increment of
# Sepal.Length when the species is setosa instead of versicolor (reference)
# Speciesvirginica (D2) coefficient: -0.29994. The average increment of
# Sepal.Length when the species is virginica instead of versicolor (reference)
# Both dummy variables are significant

# Coefficients of the model


confint(mod2)
## 2.5 % 97.5 %
## (Intercept) 0.8913266 2.00408209
## Sepal.Width 0.3257653 0.66601260
## Petal.Length 0.6937939 0.96469395
## Petal.Width -0.6140049 -0.01630542
## Speciessetosa 0.2488500 1.19827390
## Speciesvirginica -0.5351144 -0.06475727
# The coefficients of Speciesversicolor and Speciesvirginica are
# significantly negative

# Show the dummy variables employed for encoding a factor


contrasts(iris$Species)
## setosa virginica
## versicolor 0 0
## setosa 1 0
## virginica 0 1
iris$Species <- relevel(iris$Species, ref = "setosa")
contrasts(iris$Species)
## versicolor virginica
## setosa 0 0
## versicolor 1 0
## virginica 0 1

Do not codify a categorical variable as a discrete vari-


able. This constitutes a major methodological fail that
will flaw the subsequent statistical analysis.
For example if you have a categorical variable party with
levels partyA, partyB, and partyC, do not encode it as
a discrete variable taking the values 1, 2, and 3, respec-
tively. If you do so:

• You assume implicitly an order in the levels of party,


since partyA is closer to partyB than to partyC.
• You assume implicitly that partyC is three times larger
than partyA.
• The codification is completely arbitrary – why not
consider 1, 1.5, and 1.75 instead?

The right way of dealing with categorical variables in


regression is to set the variable as a factor and let R do
the dummification internally.
88 eduardo garcía portugués

It may happen that one dummy variable, say D1 , is not


significant, while other dummy variables, say D2 , are sig-
nificant. For example, this happens in the example above
at level α = 0.01.

3.3.1 Case study application


Let’s see what the dummy variables are in the Boston dataset and
what effect they have on medv.
# Load the Boston dataset
data(Boston, package = "MASS")

# Structure of the data


str(Boston)
## ’data.frame’: 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
# chas is a dummy variable measuring if the suburb is close to the river (1)
# or not (0). In this case it is not codified as a factor but as a 0 or 1
# (so it is already dummyfied)

# Summary of a linear model


mod <- lm(medv ~ chas + crim, data = Boston)
summary(mod)
##
## Call:
## lm(formula = medv ~ chas + crim, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.540 -5.421 -1.878 2.575 30.134
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.61403 0.41862 56.409 < 2e-16 ***
## chas 5.57772 1.46926 3.796 0.000165 ***
## crim -0.40598 0.04339 -9.358 < 2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 8.373 on 503 degrees of freedom
## Multiple R-squared: 0.1744, Adjusted R-squared: 0.1712
## F-statistic: 53.14 on 2 and 503 DF, p-value: < 2.2e-16
# The coefficient associated to chas is 5.57772. That means that if the suburb
# is close to the river, the mean of medv increases in 5.57772 units for
# the same house and neighborhood conditions
# chas is significant (the presence of the river adds a valuable information
# for explaining medv)

# Summary of the best model in terms of BIC


summary(modBIC)
notes for predictive modeling 89

##
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +
## tax + ptratio + black + lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5984 -2.7386 -0.5046 1.7273 26.2373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.341145 5.067492 7.171 2.73e-12 ***
## crim -0.108413 0.032779 -3.307 0.001010 **
## zn 0.045845 0.013523 3.390 0.000754 ***
## chas 2.718716 0.854240 3.183 0.001551 **
## nox -17.376023 3.535243 -4.915 1.21e-06 ***
## rm 3.801579 0.406316 9.356 < 2e-16 ***
## dis -1.492711 0.185731 -8.037 6.84e-15 ***
## rad 0.299608 0.063402 4.726 3.00e-06 ***
## tax -0.011778 0.003372 -3.493 0.000521 ***
## ptratio -0.946525 0.129066 -7.334 9.24e-13 ***
## black 0.009291 0.002674 3.475 0.000557 ***
## lstat -0.522553 0.047424 -11.019 < 2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348
## F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16
# The coefficient associated to chas is 2.71871. If the suburb is close to
# the river, the mean of medv increases in 2.71871 units
# chas is significant as well in the presence of more predictors

We will see how to mix dummy and quantitative predictors in


Section 3.4.3.

3.4 Nonlinear relationships

3.4.1 Transformations in the simple linear model


The linear model is termed linear not because the regression curve
is a plane, but because the effects of the parameters are linear. Indeed,
the predictor X may exhibit a nonlinear effect on the response Y
and still be a linear model! For example, the following models can
be transformed into simple linear models:

1. Y = β 0 + β 1 X 2 + ε
2. Y = β 0 + β 1 log( X ) + ε
3. Y = β 0 + β 1 ( X 3 − log(| X |) + 2X ) + ε

The trick is to work with the transformed predictors (X 2 , log( X ),


. . . ), instead of with the original predictor X. Then, rather than
working with the sample {( Xi , Yi )}in=1 , we consider the transformed
sample {( X̃i , Yi )}in=1 with9 : 9
For the above examples.

1. X̃i = Xi2 ,
i = 1, . . . , n.
2. X̃i = log( Xi ), i = 1, . . . , n.
3. X̃i = Xi3 − log(| Xi |) + 2Xi , i = 1, . . . , n.

An example of this simple but powerful trick is given as follows.


The left panel of Figure 3.4 shows the scatterplot for some data
90 eduardo garcía portugués

y and x, together with its fitted regression line. Clearly, the data
does not follow a linear pattern, but a nonlinear one, similar to a
parabola y = x2 . Hence, y might be better explained by the square of
x, xˆ2, rather than by x. Indeed, if we plot y against xˆ2 in the right
panel of Figure 3.4, we can see that the relation of y and xˆ2 is now
linear!

Figure 3.4: Left: quadratic pattern


when plotting Y against X. Right:
10

10
linearized pattern when plotting Y
against X 2 . In red, the fitted regression
y

y
5

5
line.
0

0
-2 -1 0 1 2 3 4 5 0 5 10 15 20 25

x x^2

In conclusion, with a simple trick we have drastically increased


the explanation of the response. However, there is a catch: knowing
which transformation is required in order to linearize the relation
between response and the predictor is a kind of art which requires
a good eye. One first approach is to consider one of the usual trans-
formations, which are displayed in Figure 3.6, depending on the
pattern of the data. Figure 3.5 illustrates how to choose an adequate
transformation for linearizing certain nonlinear data patterns.

Figure 3.5: Illustration of the choice


of the nonlinear transformation.
Application available here.

If you apply a nonlinear transformation, namely f , and


fit the linear model Y = β 0 + β 1 f ( X ) + ε, then there is no
point in also fitting the model resulting from the negative
transformation − f . The model with − f is exactly the
same as the one with f but with the sign of β 1 flipped!
As a rule of thumb, use Figure 3.6 with the transforma-
tions to compare it with the data pattern, then choose the
most similar curve, and finally apply the corresponding
function with positive sign.
notes for predictive modeling 91

Figure 3.6: Some common nonlinear


5

transformations and their negative


counterparts. Recall the domain of
definition of each transformation.
4
3
2
y

1
0

y=x
y = x2
y = x3
-1

y= x
y = exp(x)
y = exp(− x)
y = log(x)
-2

-2 -1 0 1 2 3 4 5

x
y = −x
2

y = − x2
y = − x3
y=− x
1

y = − exp(− x)
y = − exp(x)
y = − log(x)
0
-1
y

-2
-3
-4
-5

-2 -1 0 1 2 3 4 5

x
92 eduardo garcía portugués

Let’s see how to compute transformations of our predictors and


perform a linear regression with them. The data for Figure 3.4 is:

# Data
x <- c(-2, -1.9, -1.7, -1.6, -1.4, -1.3, -1.1, -1, -0.9, -0.7, -0.6,
-0.4, -0.3, -0.1, 0, 0.1, 0.3, 0.4, 0.6, 0.7, 0.9, 1, 1.1, 1.3,
1.4, 1.6, 1.7, 1.9, 2, 2.1, 2.3, 2.4, 2.6, 2.7, 2.9, 3, 3.1,
3.3, 3.4, 3.6, 3.7, 3.9, 4, 4.1, 4.3, 4.4, 4.6, 4.7, 4.9, 5)
y <- c(1.4, 0.4, 2.4, 1.7, 2.4, 0, 0.3, -1, 1.3, 0.2, -0.7, 1.2, -0.1,
-1.2, -0.1, 1, -1.1, -0.9, 0.1, 0.8, 0, 1.7, 0.3, 0.8, 1.2, 1.1,
2.5, 1.5, 2, 3.8, 2.4, 2.9, 2.7, 4.2, 5.8, 4.7, 5.3, 4.9, 5.1,
6.3, 8.6, 8.1, 7.1, 7.9, 8.4, 9.2, 12, 10.5, 8.7, 13.5)

# Data frame (a matrix with column names)


nonLinear <- data.frame(x = x, y = y)

# We create a new column inside nonLinear, called x2, that contains the
# new variable x^2
nonLinear$x2 <- nonLinear$x^2
# If you wish to remove it
# nonLinear$x2 <- NULL

# Regressions
mod1 <- lm(y ~ x, data = nonLinear)
mod2 <- lm(y ~ x2, data = nonLinear)
summary(mod1)
##
## Call:
## lm(formula = y ~ x, data = nonLinear)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5268 -1.7513 -0.4017 0.9750 5.0265
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9771 0.3506 2.787 0.0076 **
## x 1.4993 0.1374 10.911 1.35e-14 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2.005 on 48 degrees of freedom
## Multiple R-squared: 0.7126, Adjusted R-squared: 0.7067
## F-statistic: 119 on 1 and 48 DF, p-value: 1.353e-14
summary(mod2)
##
## Call:
## lm(formula = y ~ x2, data = nonLinear)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0418 -0.5523 -0.1465 0.6286 1.8797
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.05891 0.18462 0.319 0.751
## x2 0.48659 0.01891 25.725 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.9728 on 48 degrees of freedom
## Multiple R-squared: 0.9324, Adjusted R-squared: 0.931
## F-statistic: 661.8 on 1 and 48 DF, p-value: < 2.2e-16
# mod2 has a larger R^2. Also notice the intercept is not significative
notes for predictive modeling 93

A fast way of performing and summarizing the quadratic


fit is

summary(lm(y ~ I(x^2), data = nonLinear))

The I() function wrapping xˆ2 is fundamental when


applying arithmetic operations in the predictor. The sym-
bols +, *, ˆ, . . . have different meaning when inputted
in a formula, so it is required to use I() to indicate that
they must be interpreted in their arithmetic meaning and
that the result of the expression denotes a new predictor
variable. For example, use I((x - 1)ˆ3 - log(3 * x))
if you want to apply the transformation (x - 1)ˆ3 -
log(3 * x).

Load the dataset assumptions.RData. We are going to


work with the regressions y2 ~ x2, y3 ~ x3, y8 ~ x8,
and y9 ~ x9, in order to identify which transformation
of Figure 3.6 gives the best fit. (For the purpose of illus-
tration, we do not care if the assumptions are respected.)
For these, do the following:

• Find the transformation that yields the largest R2 .


• Compare the original and the transformed linear mod-
els.

Some hints:

• y2 ~ x2 has a negative dependence, so look at the


right panel of Figure 3.5.
• y3 ~ x3 seems to have just a subtle nonlinearity. . .
Will it be worth to attempt a transformation?
• For y9 ~ x9, try also with exp(-abs(x9)),
log(abs(x9)), and 2ˆabs(x9).

3.4.2 Polynomial transformations

A powerful nonlinear extension of the linear model are polynomial


models. These are constructed by replacing each predictor X j by
the set of monomials ( X j , X 2j , . . . , X jk ) that are formed from X j . In
the case with a single predictor X, we have the k-th order polyno-
mial fit:

Y = β 0 + β 1 X + . . . + β k X k + ε.

With this approach, a highly flexible model is produced, as it was


shown in Figure 1.3. The creation of such models can be auto-
mated by the use of the function poly, which for the observations
94 eduardo garcía portugués

( X1 , . . . , Xn ) of X creates the matrices


   
X1 X12 . . . X1k p 1 ( X1 ) p 2 ( X1 ) ... p k ( X1 )
 . .. .. ..   . .. .. ..  10
Precisely, the type of orthogonal
 .
 . . . .  or  .. . . .  , (3.2)
  
polynomials considered are a data-
Xn Xn2 . . . Xn k p1 ( Xn ) p2 ( Xn ) ... p k ( Xn ) driven shifting and rescaling of the
Legendre polynomials in (−1, 1). The
Legendre polynomials p j and pl in
where p1 , . . . , pk are orthogonal polynomials10 of orders 1, . . . , k, re- R1
(−1, 1) satisfy that −1 p j ( x ) pl ( x ) dx =
spectively. The matrices in (3.2) can be computed through R’s poly 2
2j+1 δjl , where δjl is the Kronecker’s
function which, if the considered polynomials are orthogonal, it delta. The first five (excluding p0 ( x ) =
is guaranteed to yield a data matrix in (3.2) with uncorrelated11 1) Legendre polynomials in (−1, 1)
are, respectively: p1 ( x ) = x, p2 ( x ) =
columns. 1 2 1 3
2 (3x − 1), p3 ( x ) = 2 (5x − 3x ),
Let’s see a couple of examples. p4 ( x ) = 81 (35x4 − 30x2 + 3), and
p5 ( x ) = 18 (63x5 − 70x3 + 15x ). The
x1 <- seq(-1, 1, l = 4)
recursive formula pk+1 ( x ) = (k +
poly(x = x1, degree = 2, raw = TRUE) # (X, X^2)
1)−1 (2k + 1) xpk ( x ) − kpk−1 ( x ) allows

## 1 2
to obtain the (k + 1)-th order Legendre
## [1,] -1.0000000 1.0000000
polynomial from the previous two
## [2,] -0.3333333 0.1111111
ones. Notice that from k ≥ 3, pk ( x )
## [3,] 0.3333333 0.1111111
involves more monomials than just
## [4,] 1.0000000 1.0000000
x k , precisely all the monomials with a
## attr(,"degree")
lower order and with the same parity.
## [1] 1 2
This may complicate the interpretation
## attr(,"class")
of a coefficient related to pk ( X ) for
## [1] "poly" "matrix"
k ≥ 3, as it is not directly associated to
poly(x = x1, degree = 2) # By default, it employs orthogonal polynomials
a k-th order effect of X.
## 1 2 11
This is due to how the second matrix
## [1,] -0.6708204 0.5
of (3.2) is computed: by means of a
## [2,] -0.2236068 -0.5
QR decomposition associated to the
## [3,] 0.2236068 -0.5
first matrix. The decomposition yields
## [4,] 0.6708204 0.5
a data matrix formed by polynomials
## attr(,"coefs")
that are not exactly the Legendre
## attr(,"coefs")$alpha
polynomials but a data-driven shifting
## [1] -5.551115e-17 -4.649059e-17
and rescaling of them.
##
1.0

## attr(,"coefs")$norm2
## [1] 1.0000000 4.0000000 2.2222222 0.7901235
##
0.5

## attr(,"degree")
## [1] 1 2
## attr(,"class")
0.0
xk

## [1] "poly" "matrix"


-0.5

# Depiction of raw polynomials k=1


k=2
x <- seq(-1, 1, l = 200) k=3
k=4
degree <- 5
-1.0

k=5

matplot(x, poly(x, degree = degree, raw = TRUE), type = "l", lty = 1, -1.0 -0.5 0.0 0.5 1.0

ylab = expression(x^k)) x

k
legend("bottomright", legend = paste("k =", 1:degree), col = 1:degree, lwd = 2) Figure 3.7: Raw polynomials (x ) in
(−1, 1) up to degree k = 5.
0.2

# Depiction of orthogonal polynomials


matplot(x, poly(x, degree = degree), type = "l", lty = 1,
ylab = expression(p[k](x)))
0.1

legend("bottomright", legend = paste("k =", 1:degree), col = 1:degree, lwd = 2)


pk(x)

0.0

These matrices can now be used as inputs in the predictor side of


lm. Let’s see this in an example.
-0.1

k=1
k=2
# Data containing speed (mph) and stopping distance (ft) of cars from 1920 k=3
-0.2

k=4
data(cars) k=5

plot(cars, xlab = "Speed (mph)", ylab = "Stopping distance (ft)") -1.0 -0.5 0.0 0.5 1.0

# Fit a linear model of dist ~ speed Figure 3.8: Orthogonal polynomials


mod1 <- lm(dist ~ speed, data = cars) (pk ( x )) in (−1, 1) up to degree k = 5.
notes for predictive modeling 95

abline(coef = mod1$coefficients, col = 2)

# Quadratic
mod2 <- lm(dist ~ poly(speed, degree = 2), data = cars)
# The fit is not a line, we must look for an alternative approach
d <- seq(0, 25, length.out = 200)
lines(d, predict(mod2, new = data.frame(speed = d)), col = 3)

# Cubic
mod3 <- lm(dist ~ poly(speed, degree = 3), data = cars)
lines(d, predict(mod3, new = data.frame(speed = d)), col = 4)

# 10th order -- overfitting


mod10 <- lm(dist ~ poly(speed, degree = 10), data = cars)
lines(d, predict(mod10, new = data.frame(speed = d)), col = 5)

120
100
# BICs -- the linear model is better!
BIC(mod1, mod2, mod3, mod10)

80
Stopping distance (ft)
## df BIC

60
## mod1 3 424.8929
## mod2 4 426.4202

40
## mod3 5 429.4451
## mod10 12 450.3523

20
# poly computes by default orthogonal polynomials. These are not

0
# X^1, X^2, ..., X^p but combinations of them such that the polynomials are 5 10 15 20 25

Speed (mph)
# orthogonal. ’Raw’ polynomials are possible with raw = TRUE. They give the
# same fit, but the coefficient estimates are different. Figure 3.9: Raw and orthogonal
mod2Raw <- lm(dist ~ poly(speed, degree = 2, raw = TRUE), data = cars) polynomial fits of dist ~ speed in the
plot(cars, xlab = "Speed (mph)", ylab = "Stopping distance (ft)") cars dataset.
lines(d, predict(mod2, new = data.frame(speed = d)), col = 1)
lines(d, predict(mod2Raw, new = data.frame(speed = d)), col = 2)
120

# However: different coefficient estimates, but same R^2. How is this possible?
100

summary(mod2)
##
80
Stopping distance (ft)

## Call:
## lm(formula = dist ~ poly(speed, degree = 2), data = cars)
60

##
40

## Residuals:
## Min 1Q Median 3Q Max
20

## -28.720 -9.184 -3.188 4.628 45.152


##
0

## Coefficients: 5 10 15 20 25

## Estimate Std. Error t value Pr(>|t|) Speed (mph)

## (Intercept) 42.980 2.146 20.026 < 2e-16 *** Figure 3.10: Raw and orthogonal
## poly(speed, degree = 2)1 145.552 15.176 9.591 1.21e-12 *** polynomial fits of dist ~ speed in the
## poly(speed, degree = 2)2 22.996 15.176 1.515 0.136 cars dataset.
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 15.18 on 47 degrees of freedom
## Multiple R-squared: 0.6673, Adjusted R-squared: 0.6532
## F-statistic: 47.14 on 2 and 47 DF, p-value: 5.852e-12
summary(mod2Raw)
##
## Call:
## lm(formula = dist ~ poly(speed, degree = 2, raw = TRUE), data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.720 -9.184 -3.188 4.628 45.152
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.47014 14.81716 0.167 0.868
96 eduardo garcía portugués

## poly(speed, degree = 2, raw = TRUE)1 0.91329 2.03422 0.449 0.656


## poly(speed, degree = 2, raw = TRUE)2 0.09996 0.06597 1.515 0.136
##
## Residual standard error: 15.18 on 47 degrees of freedom
## Multiple R-squared: 0.6673, Adjusted R-squared: 0.6532
## F-statistic: 47.14 on 2 and 47 DF, p-value: 5.852e-12

# Because the predictors in mod2Raw are highly related between them, and
# the ones in mod2 are uncorrelated between them!
car::scatterplotMatrix(mod2$model[, -1], col = 1, regLine = list(col = 2),
smooth = list(col.smooth = 4, col.spread = 4))
car::scatterplotMatrix(mod2Raw$model[, -1],col = 1, regLine = list(col = 2),
smooth = list(col.smooth = 4, col.spread = 4))
cor(mod2$model[, -1])
## 1 2
## 1 1.000000e+00 4.686464e-17
## 2 4.686464e-17 1.000000e+00
cor(mod2Raw$model[, -1])
## 1 2
## 1 1.0000000 0.9794765
## 2 0.9794765 1.0000000
-0.1 0.0 0.1 0.2 0.3 0.4

0.2
X1

0.1
The use of orthogonal polynomials instead of raw poly-

0.0
nomials is advised for high order polynomial fits, since

-0.1
-0.2
they avoid the numerical instabilities arising from exces-

-0.3
sive linear dependencies between the raw polynomial

0.4
X2
predictors.
0.3
0.2
0.1
0.0

3.4.3 Interactions
-0.1

-0.3 -0.2 -0.1 0.0 0.1 0.2


0 100 200 300 400 500 600

When two or more predictors X1 and X2 are present, it may be

25
X1
of interest to explore the interaction between them by means of

20
X1 X2 . This is a new variable that positively (negatively) affects the

15
response Y when both X1 and X2 are positive or negative at the

10
same time (at different times):

5
600

X2
500

Y = β 0 + β 1 X1 + β 2 X2 + β 3 X1 X2 + ε.
400
300

The coefficient β 3 in Y = β 0 + β 1 X1 + β 2 X2 + β 3 X1 X2 + ε can be


200
100

interpreted as the increment of the effect of the predictor X1 on


0

the mean of Y for a unit increment in X2 12 . Significance testing on 5 10 15 20 25

Figure 3.11: Correlations between


these coefficients can be carried out as usual. the first and second order orthogonal
The way of adding these interactions in lm is through the opera- polynomials associated to speed, and
tors : and *. The operator : only adds the term X1 X2 , while * adds between speed and speedˆ2.
12
Of course, the roles of X1 and X2 can
X1 , X2 , and X1 X2 . Let’s see an example in the Boston dataset. be interchanged in this interpretation.
# Interaction between lstat and age
summary(lm(medv ~ lstat + lstat:age, data = Boston))
##
## Call:
## lm(formula = medv ~ lstat + lstat:age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.815 -4.039 -1.335 2.086 27.491
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
notes for predictive modeling 97

## (Intercept) 36.041514 0.691334 52.133 < 2e-16 ***


## lstat -1.388161 0.126911 -10.938 < 2e-16 ***
## lstat:age 0.004103 0.001133 3.621 0.000324 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 6.142 on 503 degrees of freedom
## Multiple R-squared: 0.5557, Adjusted R-squared: 0.554
## F-statistic: 314.6 on 2 and 503 DF, p-value: < 2.2e-16
# For a unit increment in age, the effect of lstat in the response
# increases positively by 0.004103 units, shifting from -1.388161 to -1.384058

# Thus, when age increases makes lstat affect less negatively medv.
# Note that the same intepretation does NOT hold if we switch the roles
# of age and lstat because age is not present as a sole predictor!

# First order interaction


summary(lm(medv ~ lstat * age, data = Boston))
##
## Call:
## lm(formula = medv ~ lstat * age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.806 -4.045 -1.333 2.085 27.552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.0885359 1.4698355 24.553 < 2e-16 ***
## lstat -1.3921168 0.1674555 -8.313 8.78e-16 ***
## age -0.0007209 0.0198792 -0.036 0.9711
## lstat:age 0.0041560 0.0018518 2.244 0.0252 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 6.149 on 502 degrees of freedom
## Multiple R-squared: 0.5557, Adjusted R-squared: 0.5531
## F-statistic: 209.3 on 3 and 502 DF, p-value: < 2.2e-16

# Second order interaction


summary(lm(medv ~ lstat * age * indus, data = Boston))
##
## Call:
## lm(formula = medv ~ lstat * age * indus, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.1549 -3.6437 -0.8427 2.1991 24.8751
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.103752 2.891173 15.946 < 2e-16 ***
## lstat -2.641475 0.372223 -7.096 4.43e-12 ***
## age -0.042300 0.041668 -1.015 0.31051
## indus -1.849829 0.380252 -4.865 1.54e-06 ***
## lstat:age 0.014249 0.004437 3.211 0.00141 **
## lstat:indus 0.177418 0.037647 4.713 3.18e-06 ***
## age:indus 0.014332 0.004386 3.268 0.00116 **
## lstat:age:indus -0.001621 0.000408 -3.973 8.14e-05 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 5.929 on 498 degrees of freedom
## Multiple R-squared: 0.5901, Adjusted R-squared: 0.5844
## F-statistic: 102.4 on 7 and 498 DF, p-value: < 2.2e-16
98 eduardo garcía portugués

A fast way of accounting interactions between predictors


is to use the ˆ operator in lm:

• lm(y ~ (x1 + x2 + x3)ˆ2) equals lm(y ~ x1 + x2 +


x3 + x1:x2 + x1:x3 + x2:x3). Higher powers like
lm(y ~ (x1 + x2 + x3)ˆ3) serve to include up to
second-order interactions like x1:x2:x3.
• It is possible to regress on all the predictors and the
first order interactions using lm(y ~ .ˆ2).
• Further flexibility in lm is possible, e.g. removing a
particular interaction with lm(y ~ .ˆ2 - x1:x2) or
forcing the intercept to be zero with lm(y ~ 0 + .ˆ2).

Stepwise regression can also be done with interaction terms.


MASS::stepAIC supports interaction terms, but their inclusion must
be asked for in the scope argument. By default, scope considers
the largest model in which to perform stepwise regression as the
formula of the model in object, the first argument. In order to set
the largest model to search for the best subset of predictors as the
one that contains first-order interactions, we proceed as follows:
# Include first-order interactions in the search for the best model in
# terms of BIC, not just single predictors
modIntBIC <- MASS::stepAIC(object = lm(medv ~ ., data = Boston),
scope = medv ~ .^2, k = log(nobs(modBIC)), trace = 0)
summary(modIntBIC)
##
## Call:
## lm(formula = medv ~ crim + indus + chas + nox + rm + age + dis +
## rad + tax + ptratio + black + lstat + rm:lstat + rad:lstat +
## rm:rad + dis:rad + black:lstat + dis:ptratio + crim:chas +
## chas:nox + chas:rm + chas:ptratio + rm:ptratio + age:black +
## indus:dis + indus:lstat + crim:rm + crim:lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.5845 -1.6797 -0.3157 1.5433 19.4311
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.673e+01 1.350e+01 -7.167 2.93e-12 ***
## crim -1.454e+00 3.147e-01 -4.620 4.95e-06 ***
## indus 7.647e-01 1.237e-01 6.182 1.36e-09 ***
## chas 6.341e+01 1.115e+01 5.687 2.26e-08 ***
## nox -1.691e+01 3.020e+00 -5.598 3.67e-08 ***
## rm 1.946e+01 1.730e+00 11.250 < 2e-16 ***
## age 2.233e-01 5.898e-02 3.786 0.000172 ***
## dis -2.462e+00 6.776e-01 -3.634 0.000309 ***
## rad 3.461e+00 3.109e-01 11.132 < 2e-16 ***
## tax -1.401e-02 2.536e-03 -5.522 5.52e-08 ***
## ptratio 1.207e+00 7.085e-01 1.704 0.089111 .
## black 7.946e-02 1.262e-02 6.298 6.87e-10 ***
## lstat 2.939e+00 2.707e-01 10.857 < 2e-16 ***
## rm:lstat -3.793e-01 3.592e-02 -10.559 < 2e-16 ***
## rad:lstat -4.804e-02 4.465e-03 -10.760 < 2e-16 ***
## rm:rad -3.490e-01 4.370e-02 -7.986 1.05e-14 ***
## dis:rad -9.236e-02 2.603e-02 -3.548 0.000427 ***
## black:lstat -8.337e-04 3.355e-04 -2.485 0.013292 *
## dis:ptratio 1.371e-01 3.719e-02 3.686 0.000254 ***
## crim:chas 2.544e+00 3.813e-01 6.672 7.01e-11 ***
notes for predictive modeling 99

## chas:nox -3.706e+01 6.202e+00 -5.976 4.48e-09 ***


## chas:rm -3.774e+00 7.402e-01 -5.099 4.94e-07 ***
## chas:ptratio -1.185e+00 3.701e-01 -3.203 0.001451 **
## rm:ptratio -3.792e-01 1.067e-01 -3.555 0.000415 ***
## age:black -7.107e-04 1.552e-04 -4.578 5.99e-06 ***
## indus:dis -1.316e-01 2.533e-02 -5.197 3.00e-07 ***
## indus:lstat -2.580e-02 5.204e-03 -4.959 9.88e-07 ***
## crim:rm 1.605e-01 4.001e-02 4.011 7.00e-05 ***
## crim:lstat 1.511e-02 4.954e-03 3.051 0.002408 **
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 3.045 on 477 degrees of freedom
## Multiple R-squared: 0.8964, Adjusted R-squared: 0.8904
## F-statistic: 147.5 on 28 and 477 DF, p-value: < 2.2e-16

# There is no improvement by removing terms in modIntBIC


MASS::dropterm(modIntBIC, k = log(nobs(modIntBIC)), sorted = TRUE)
## Single term deletions
##
## Model:
## medv ~ crim + indus + chas + nox + rm + age + dis + rad + tax +
## ptratio + black + lstat + rm:lstat + rad:lstat + rm:rad +
## dis:rad + black:lstat + dis:ptratio + crim:chas + chas:nox +
## chas:rm + chas:ptratio + rm:ptratio + age:black + indus:dis +
## indus:lstat + crim:rm + crim:lstat
## Df Sum of Sq RSS AIC
## <none> 4423.7 1277.7
## black:lstat 1 57.28 4481.0 1278.0
## crim:lstat 1 86.33 4510.1 1281.2
## chas:ptratio 1 95.15 4518.9 1282.2
## dis:rad 1 116.73 4540.5 1284.6
## rm:ptratio 1 117.23 4541.0 1284.7
## dis:ptratio 1 126.00 4549.7 1285.7
## crim:rm 1 149.24 4573.0 1288.2
## age:black 1 194.40 4618.1 1293.2
## indus:lstat 1 228.05 4651.8 1296.9
## chas:rm 1 241.11 4664.8 1298.3
## indus:dis 1 250.51 4674.2 1299.3
## tax 1 282.77 4706.5 1302.8
## chas:nox 1 331.19 4754.9 1308.0
## crim:chas 1 412.86 4836.6 1316.6
## rm:rad 1 591.45 5015.2 1335.0
## rm:lstat 1 1033.93 5457.7 1377.7
## rad:lstat 1 1073.80 5497.5 1381.4

# Neither by including other terms interactions


MASS::addterm(modIntBIC, scope = lm(medv ~ .^2, data = Boston),
k = log(nobs(modIntBIC)), sorted = TRUE)
## Single term additions
##
## Model:
## medv ~ crim + indus + chas + nox + rm + age + dis + rad + tax +
## ptratio + black + lstat + rm:lstat + rad:lstat + rm:rad +
## dis:rad + black:lstat + dis:ptratio + crim:chas + chas:nox +
## chas:rm + chas:ptratio + rm:ptratio + age:black + indus:dis +
## indus:lstat + crim:rm + crim:lstat
## Df Sum of Sq RSS AIC
## <none> 4423.7 1277.7
## nox:age 1 52.205 4371.5 1277.9
## chas:lstat 1 50.231 4373.5 1278.1
## crim:nox 1 50.002 4373.7 1278.2
## indus:tax 1 46.182 4377.6 1278.6
## nox:rad 1 42.822 4380.9 1279.0
## tax:ptratio 1 37.105 4386.6 1279.6
## age:lstat 1 29.825 4393.9 1280.5
## rm:tax 1 27.221 4396.5 1280.8
## nox:rm 1 25.099 4398.6 1281.0
100 eduardo garcía portugués

## nox:ptratio 1 17.994 4405.7 1281.8


## rm:age 1 16.956 4406.8 1282.0
## crim:black 1 15.566 4408.2 1282.1
## dis:tax 1 13.336 4410.4 1282.4
## dis:lstat 1 10.944 4412.8 1282.7
## rm:black 1 9.909 4413.8 1282.8
## rm:dis 1 9.312 4414.4 1282.8
## crim:indus 1 8.458 4415.3 1282.9
## tax:lstat 1 7.891 4415.8 1283.0
## ptratio:black 1 7.769 4416.0 1283.0
## rad:black 1 7.327 4416.4 1283.1
## age:ptratio 1 6.857 4416.9 1283.1
## age:tax 1 5.785 4417.9 1283.2
## nox:dis 1 5.727 4418.0 1283.2
## age:dis 1 5.618 4418.1 1283.3
## nox:tax 1 5.579 4418.2 1283.3
## crim:dis 1 5.376 4418.4 1283.3
## tax:black 1 4.867 4418.9 1283.3
## indus:age 1 4.554 4419.2 1283.4
## indus:rm 1 4.089 4419.6 1283.4
## indus:ptratio 1 4.082 4419.6 1283.4
## zn 1 3.919 4419.8 1283.5
## chas:tax 1 3.918 4419.8 1283.5
## rad:tax 1 3.155 4420.6 1283.5
## age:rad 1 3.085 4420.6 1283.5
## nox:black 1 2.939 4420.8 1283.6
## ptratio:lstat 1 2.469 4421.3 1283.6
## indus:chas 1 2.359 4421.4 1283.6
## chas:black 1 1.940 4421.8 1283.7
## indus:nox 1 1.440 4422.3 1283.7
## indus:black 1 1.177 4422.6 1283.8
## chas:rad 1 0.757 4423.0 1283.8
## chas:age 1 0.757 4423.0 1283.8
## crim:rad 1 0.678 4423.1 1283.8
## nox:lstat 1 0.607 4423.1 1283.8
## rad:ptratio 1 0.567 4423.2 1283.8
## crim:age 1 0.348 4423.4 1283.9
## indus:rad 1 0.219 4423.5 1283.9
## dis:black 1 0.077 4423.7 1283.9
## crim:ptratio 1 0.019 4423.7 1283.9
## crim:tax 1 0.004 4423.7 1283.9
## chas:dis 1 0.004 4423.7 1283.9

Interactions are also possible with categorical variables. For


example, for one predictor X and one dummy variable D encod-
ing a factor with two levels, we have seven possible linear models
stemming from how we want to combine X and D:

1. Predictor and no dummy variable. The usual simple linear model:

Y = β0 + β1 X + ε

2. Predictor and dummy variable. D affects the intercept of the linear


model, which is different for each group:


 β + β X + ε, if D = 0,
0 1
Y = β0 + β1 X + β2 D + ε =
( β 0 + β 2 ) + β 1 X + ε, if D = 1.

3. Predictor and dummy variable, with interaction. D affects the inter-


cept and the slope of the linear model, and both are different for
each group:
notes for predictive modeling 101

Y = β0 + β1 X + β2 D + β3 (X · D) + ε

 β + β X + ε, if D = 0,
0 1
=
( β 0 + β 2 ) + ( β 1 + β 3 ) X + ε, if D = 1.

4. Predictor and interaction with dummy variable. D affects only the


slope of the linear model, which is different for each group:


 β + β X + ε, if D = 0,
0 1
Y = β0 + β1 X + β2 (X · D) + ε =
 β 0 + ( β 1 + β 2 ) X + ε, if D = 1.

5. Dummy variable and no predictor. D controls the intercept of a


constant fit, depending on each group:


 β + ε, if D = 0,
0
Y = β0 + β1 D + ε =
( β 0 + β 1 ) + ε, if D = 1.

6. Dummy variable and interaction with predictor. D adds the predic-


tor X for one group and affects the intercept, which is different
for each group:


 β + ε, if D = 0,
0
Y = β0 + β1 D + β2 (X · D) + ε =
( β 0 + β 1 ) + β 2 X + ε, if D = 1.

7. Interaction of dummy and predictor. D adds the predictor X for one


group and the intercept is common:


 β + ε, if D = 0,
0
Y = β0 + β1 (X · D) + ε =
 β 0 + β 1 X + ε, if D = 1.

Let’s see the visualization of these seven possibilities with Y =


medv, X = lstat, and D = chas for the Boston dataset. In the
following, the green color stands for data points and linear fits
associated to D = 0, whereas blue stands for D = 1.
# Group settings
col <- Boston$chas + 3
cex <- 0.5 + 0.25 * Boston$chas

# 1. No dummy variable
(mod1 <- lm(medv ~ lstat, data = Boston))
1
##
50

## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
40

## Coefficients:
## (Intercept) lstat
30
medv

## 34.55 -0.95
plot(medv ~ lstat, data = Boston, col = col, pch = 16, cex = cex, main = "1")
20

abline(coef = mod1$coefficients, lwd = 2)


10

# 2. Dummy variable 10 20 30

(mod2 <- lm(medv ~ lstat + chas, data = Boston)) lstat


102 eduardo garcía portugués

##
## Call:
## lm(formula = medv ~ lstat + chas, data = Boston)
##
## Coefficients:
## (Intercept) lstat chas
## 34.0941 -0.9406 4.9200
plot(medv ~ lstat, data = Boston, col = col, pch = 16, cex = cex, main = "2")
abline(a = mod2$coefficients[1], b = mod2$coefficients[2], col = 3, lwd = 2)
abline(a = mod2$coefficients[1] + mod2$coefficients[3],
b = mod2$coefficients[2], col = 4, lwd = 2)
2

50
# 3. Dummy variable, with interaction

40
(mod3 <- lm(medv ~ lstat * chas, data = Boston))
##

30
## Call:

medv
## lm(formula = medv ~ lstat * chas, data = Boston)
##

20
## Coefficients:
## (Intercept) lstat chas lstat:chas

10
## 33.7672 -0.9150 9.8251 -0.4329
plot(medv ~ lstat, data = Boston, col = col, pch = 16, cex = cex, main = "3")
10 20 30
abline(a = mod3$coefficients[1], b = mod3$coefficients[2], col = 3, lwd = 2)
lstat
abline(a = mod3$coefficients[1] + mod3$coefficients[3],
3
b = mod3$coefficients[2] + mod3$coefficients[4], col = 4, lwd = 2)

50
40
# 4. Dummy variable only present in interaction
(mod4 <- lm(medv ~ lstat + lstat:chas, data = Boston))

30
medv
##
## Call:
20

## lm(formula = medv ~ lstat + lstat:chas, data = Boston)


##
## Coefficients:
10

## (Intercept) lstat lstat:chas


## 34.4893 -0.9580 0.2128 10 20 30

plot(medv ~ lstat, data = Boston, col = col, pch = 16, cex = cex, main = "4") lstat

abline(a = mod4$coefficients[1], b = mod4$coefficients[2], col = 3, lwd = 2) 4

abline(a = mod4$coefficients[1],
50

b = mod4$coefficients[2] + mod4$coefficients[3], col = 4, lwd = 2)


40

# 5. Dummy variable and no predictor


30
medv

(mod5 <- lm(medv ~ chas, data = Boston))


##
20

## Call:
## lm(formula = medv ~ chas, data = Boston)
10

##
## Coefficients:
10 20 30
## (Intercept) chas
lstat
## 22.094 6.346
5
plot(medv ~ lstat, data = Boston, col = col, pch = 16, cex = cex, main = "5")
50

abline(a = mod5$coefficients[1], b = 0, col = 3, lwd = 2)


abline(a = mod5$coefficients[1] + mod5$coefficients[2], b = 0, col = 4, lwd = 2)
40
30
medv

# 6. Dummy variable. Interaction in the intercept and slope


(mod6 <- lm(medv ~ chas + lstat:chas, data = Boston))
20

##
## Call:
## lm(formula = medv ~ chas + lstat:chas, data = Boston)
10

##
## Coefficients: 10 20 30

## (Intercept) chas chas:lstat lstat


notes for predictive modeling 103

## 22.094 21.498 -1.348


plot(medv ~ lstat, data = Boston, col = col, pch = 16, cex = cex, main = "6")
abline(a = mod6$coefficients[1], b = 0, col = 3, lwd = 2)
abline(a = mod6$coefficients[1] + mod6$coefficients[2],
b = mod6$coefficients[3], col = 4, lwd = 2)
6

50
# 7. Dummy variable. Interaction in the slope
(mod7 <- lm(medv ~ lstat:chas, data = Boston))

40
##
## Call:

30
## lm(formula = medv ~ lstat:chas, data = Boston)

medv
##
## Coefficients:

20
## (Intercept) lstat:chas
## 22.49484 0.04882

10
plot(medv ~ lstat, data = Boston, col = col, pch = 16, cex = cex, main = "7")
abline(a = mod7$coefficients[1], b = 0, col = 3, lwd = 2)
10 20 30
abline(a = mod7$coefficients[1], b = mod7$coefficients[2], col = 4, lwd = 2) lstat

From the above illustration, it is clear that the effect of adding 7

a dummy variable is to simultaneously fit two linear models

50
(with varying flexibility) to the two groups of data encoded by

40
the dummy variable, and merge this simultaneous fit within a sin-
gle linear model. We can check this in more detail using the subset

30
medv
option of lm:

20
# Model using a dummy variable in the full dataset
lm(medv ~ lstat + chas + lstat:chas, data = Boston)

10
##
## Call:
10 20 30
## lm(formula = medv ~ lstat + chas + lstat:chas, data = Boston)
lstat
##
## Coefficients:
## (Intercept) lstat chas lstat:chas
## 33.7672 -0.9150 9.8251 -0.4329

# Individual model for the group with chas == 0


lm(medv ~ lstat, data = Boston, subset = chas == 0)
##
## Call:
## lm(formula = medv ~ lstat, data = Boston, subset = chas == 0)
##
## Coefficients:
## (Intercept) lstat
## 33.767 -0.915
# Notice that the intecept and lstat coeffient are the same as before

# Individual model for the group with chas == 1


lm(medv ~ lstat, data = Boston, subset = chas == 1)
##
## Call:
## lm(formula = medv ~ lstat, data = Boston, subset = chas == 1)
##
## Coefficients:
## (Intercept) lstat
## 43.592 -1.348
# Notice that the intecept and lstat coeffient equal the ones from the
# joint model, plus the specific terms associated to chas

This discussion can be extended to the situation where we have a


factor with several levels, and hence more dummy variables. Such
as for example in the iris dataset, where there are three groups.
The next code shows how three group-specific linear regressions
are modeled together by means of two dummy variables.
104 eduardo garcía portugués

# Does not take into account the groups in the data


modIris <- lm(Sepal.Width ~ Petal.Width, data = iris)
modIris$coefficients
## (Intercept) Petal.Width
## 3.3084256 -0.2093598

# Adding interactions with the groups


modIrisSpecies <- lm(Sepal.Width ~ Petal.Width * Species, data = iris)
modIrisSpecies$coefficients
## (Intercept) Petal.Width Speciesversicolor Speciesvirginica
## 3.2220507 0.8371922 -1.8491878 -1.5272777
## Petal.Width:Speciesversicolor Petal.Width:Speciesvirginica
## 0.2164556 -0.2057870

# Joint regression line shows negative correlation, but each group


# regression line shows a positive correlation
plot(Sepal.Width ~ Petal.Width, data = iris, col = as.integer(Species) + 1,
pch = 16)
abline(a = modIris$coefficients[1], b = modIris$coefficients[2], lwd = 2)
abline(a = modIrisSpecies$coefficients[1], b = modIrisSpecies$coefficients[2],
col = 2, lwd = 2)
abline(a = modIrisSpecies$coefficients[1] + modIrisSpecies$coefficients[3],
b = modIrisSpecies$coefficients[2] + modIrisSpecies$coefficients[5],
col = 3, lwd = 2)
abline(a = modIrisSpecies$coefficients[1] + modIrisSpecies$coefficients[4],
b = modIrisSpecies$coefficients[2] + modIrisSpecies$coefficients[6],
col = 4, lwd = 2)

4.0
3.5
The last scatterplot is an illustration of the Simpson’s Sepal.Width

paradox. The simplest case of the paradox arises in


3.0

simple linear regression, when there are two or more


2.5

well-defined groups in the data such that:


2.0

1. Within each group, there is a clear and common corre- 0.5 1.0 1.5 2.0 2.5

lation pattern between the response and the predictor. Petal.Width

Figure 3.12: The three linear fits


2. When the groups are aggregated, the response and of Sepal.Width ~ Petal.Width *
the predictor exhibit an opposite correlation pattern. Species for each of the three levels
in the Species factor (setosa in red,
versicolor in green, and virginica
in blue) in the iris dataset. The
black line represents the linear fit for
Sepal.Width ~ Petal.Width, that is,
3.4.4 Case study application the linear fit without accounting for
the levels in Species.
The model employed in Harrison and Rubinfeld (1978) is different
from the modBIC model. In the paper, several nonlinear transforma-
tions of the predictors and the response are done to improve the
linear fit. Also, different units are used for medv, black, lstat, and
nox. The authors considered these variables:

• Response: log(1000 * medv)


• Linear predictors: age, black / 1000 (this variable corresponds to
their ( B − 0.63)2 ), tax, ptratio, crim, zn, indus, and chas.
• Nonlinear predictors: rmˆ2, log(dis), log(rad), log(lstat /
100), and (10 * nox)ˆ2.
notes for predictive modeling 105

Do the following:

1. Check if the model with such predictors corresponds


to the one in the first column, Table VII, page 100
of Harrison and Rubinfeld (1978) (open-access pa-
per available here). To do so, save this model as
modelHarrison and summarize it.

2. Make a stepAIC selection of the variables


in modelHarrison (use BIC) and save it as
modelHarrisonSel. Summarize the fit.

3. Which model has a larger R2 ? And adjusted R2 ?


Which is simpler and has more significant coefficients?

3.5 Model diagnostics

As we saw in Section 2.3, checking the assumptions of the multiple


linear model through the visualization of the data becomes tricky
even when p = 2. To solve this issue, a series of diagnostic tools have
been designed in order to evaluate graphically and systematically
the validity of the assumptions of the linear model.
We will illustrate them in the wine dataset, which is available in
the wine.RData workspace.
load("wine.RData")
mod <- lm(Price ~ Age + AGST + HarvestRain + WinterRain, data = wine)
summary(mod)

When one assumption fails, it is likely that this failure


will affect to other assumptions. For example, if linearity
fails, then most likely homoscedasticity and normality
will fail also. Therefore, identifying the root cause of the
assumptions failure is key in order to try to find a patch.

3.5.1 Linearity
Linearity between the response Y and the predictors X1 , . . . , X p
is the building block of the linear model. If this assumption fails,
i.e., if there is a nonlinear trend linking Y and at least one of the
predictors X1 , . . . , X p in a significant way, then all the conclusions 13
If p = 1, then it is possible to in-
we might extract from the analysis are suspected to be flawed. spect the scatterplot of the ( Xi1 , Yi ),
i = 1, . . . , n, in order to determine
Therefore it is a key assumption.
whether linearity is plausible. But
How to check it the usefulness of this graphical check
quickly decays with the dimension p,
The so-called residuals vs. fitted values plot is the scatterplot of the as p scatterplots need to be investi-
(Ŷi , ε̂ i ), i = 1, . . . , n, and is a very useful tool for detecting linearity gated. That is precisely the key point
for relying in the residuals vs. fitted
departures using a single13 graphical device. Under linearity, we values plot.
expect that there is no trend in the residuals ε̂ i with respect to Ŷi . For
example:
106 eduardo garcía portugués

plot(mod, 1)
Residuals vs Fitted

0.6
Under linearity, we expect the red line (a flexible fit of the mean 6

of the residuals) to capture no trend. Figure 3.14 illustrates how the

0.4
residuals vs. fitted values plots behave for situations in which the

0.2
linearity is known to be respected or violated. If nonlinearities are

Residuals

0.0
observed, it is worth to plot the regression terms of the model. These
are the p scatterplots ( Xij , Yi ), i = 1, . . . , n, that are accompanied

-0.2
by the regression lines y = β̂ 0 + β̂ j x j (important: β̂ j , j = 1, . . . , p,

-0.4
20
24

come from the multiple linear fit that gives β̂, not from individual
simple linear regressions). They help with detecting which predictor 6.0 6.5 7.0 7.5 8.0

Fitted values
lm(Price ~ Age + AGST + HarvestRain + WinterRain)
is having nonlinear effects on Y.
Figure 3.13: Residuals vs. fitted val-
ues for the Price ~ Age + AGST +
HarvestRain + WinterRain model for
the wine dataset.

Figure 3.14: Residuals vs. fitted values


plots for datasets respecting (left
column) and violating (right column)
the linearity assumption.
notes for predictive modeling 107

par(mfrow = c(2, 2)) # We have 4 predictors


termplot(mod, partial.resid = TRUE)

1.0

1.0
Partial for AGST
0.5

0.5
Partial for Age
What to do if fails

0.0

0.0
-0.5

-0.5
Using an adequate nonlinear transformation for the problematic

-1.0

-1.0
predictors or adding interaction terms, as we saw in Section 3.4, 5 10 15

Age
20 25 30 15.0 15.5 16.0

AGST
16.5 17.0 17.5

might be helpful. Alternatively, consider a nonlinear transformation


f for the response Y (of course, at the price of predicting f (Y ) rather

1.0

1.0
Partial for HarvestRain

Partial for WinterRain


0.5

0.5
than Y), as we will see in the case study of Section 3.5.7. Let’s see

0.0

0.0
the transformation of predictors in the example that motivated

-0.5

-0.5
-1.0

-1.0
Section 3.4. 50 100 150 200 250 300 400 500 600 700 800

HarvestRain WinterRain

par(mfrow = c(1, 2)) Figure 3.15: Regression terms for


plot(lm(y ~ x, data = nonLinear), 1) # Nonlinear Price ~ Age + AGST + HarvestRain +
plot(lm(y ~ I(x^2), data = nonLinear), 1) # Linear WinterRain in the wine dataset.
Residuals vs Fitted Residuals vs Fitted

2
50 35 41
3 47

1
Residuals

Residuals

-1 0
2
3.5.2 Normality

0
-2

-3
49

-2 0 2 4 6 8 0 2 4 6 8 10 12
The assumed normality of the errors ε i , i = 1, . . . , n, allows us to Fitted values Fitted values

make exact inference in the linear model, in the sense that the dis- Figure 3.16: Correction of nonlinearity
by using the right transformation for
tribution of β̂ given in (2.11) is exact for any n and not asymptotic the predictor.
with n → ∞. If normality does not hold, then the inference we did 14
Recall that β̂ j = e0j (X0 X)−1 X0 Y =:
(CIs for β j , hypothesis testing, CIs for prediction) is to be somehow ∑in=1 wij Yi , where e j is the canonical
suspected. Why just somehow? Roughly speaking, the reason is that vector of R p+1 with 1 in the j-th
position and 0 elsewhere. Therefore,
the central limit theorem will make β̂ asymptotically normal14 , even β̂ j is a weighted sum of the random
if the errors are not. However, the speed of this asymptotic conver- variables Y1 , . . . , Yn (recall that we
assume that X is given and therefore
gence greatly depends on how non-normal is the distribution of the is deterministic). Even if Y1 , . . . , Yn
errors. Hence the next rule of thumb: are not normal, the central√ limit
theorem entails that n( β̂ j − β j ) is
Non-severe15 departures from normality yield valid (asymptotic) asymptotically normally distributed
inference for relatively large sample sizes n. when n → ∞, provided that linearity
holds.
Therefore, the failure of normality is typically less problematic
than other assumptions.
How to check it
The QQ-plot (Theoretical Quantile vs. Empirical Quantile) al- 15
Distributions that are not heavy
lows to check if the standardized residuals follow a N (0, 1). What tailed, not heavily multimodal, and not
heavily skewed.
it does is to compare the theoretical quantiles of a N (0, 1) with the
Normal Q-Q
quantiles of the sample of standardized residuals.
6
2

plot(mod, 2)
1
Standardized residuals

Under normality, we expect the points to align with the diag-


onal line, which represents the ideal position of the points if they
0

were sampled from a N (0, 1). It is usual to have larger departures


from the diagonal in the extremes16 than in the center, even under
-1

normality, although these departures are more evident if the data is 24


20

non-normal. -2 -1 0 1 2

There are formal tests to check the null hypothesis of normality Theoretical Quantiles
lm(Price ~ Age + AGST + HarvestRain + WinterRain)

in our residuals, such as the Shapiro–Wilk test implemented by the Figure 3.17: QQ-plot for the resid-
uals of the Price ~ Age + AGST +
shapiro.test function or the Lilliefors test17 implemented by the HarvestRain + WinterRain model for
nortest::lillie.test function: the wine dataset.
108 eduardo garcía portugués

# Shapiro-Wilk test of normality 16


For X ∼ F, the p-th quantile x p =
shapiro.test(mod$residuals) F −1 ( p) of X is estimated through
## the sample quantile x̂ p := Fn−1 ( p),
## Shapiro-Wilk normality test where Fn ( x ) = n1 ∑in=1 1{Xi ≤ x} is the
## empirical cdf of X1 ,√ . . . , Xn . If X ∼ f
## data: mod$residuals is continuous, then n( x̂ p − x p )
## W = 0.95903, p-value = 0.3512
 
p (1− p )
is asymptotically N 0, f ( x )2 .
# We do not reject normality p

# shapiro.test allows up to 5000 observations -- if dealing with more data Therefore, the variance of x̂ p grows
# points, randomization of the input is a possibility if p → 0, 1 and more variability is
expected on the extremes of the QQ-
# Lilliefors test -- the Kolmogorov-Smirnov adaptation for testing normality plot, see Figure 3.19.
nortest::lillie.test(mod$residuals)
##
## Lilliefors (Kolmogorov-Smirnov) normality test
##
## data: mod$residuals
## D = 0.13739, p-value = 0.2125
# We do not reject normality

Figure 3.18: QQ-plots for datasets


respecting (left column) and violating
(right column) the normality assump-
tion.
17
This is the Kolmogorov–Smirnov
test shown in Section A.1 but adapted
to testing the normality of the data
with unknown mean and variance.
More precisely, the test tests the
composite null hypothesis H0 : F =
Φ(·; µ, σ2 ) with µ and σ2 unknown.
Note that this is different from the
simple null hypothesis H0 : F = F0
of the Kolmogorov–Smirnov test in
which F0 is completely specified.
Further tests of normality can be
derived by adapting other tests for
the simple null hypothesis H0 : F =
F0 , such as the Cramér–von Mises
and the Anderson–Darling tests,
and these are implemented in the
functions nortest::cvm.test and
nortest::ad.test.
notes for predictive modeling 109

Confidence bands for QQ-plot

What to do if fails
Patching non-normality is not easy and most of the time re-

3
quires the consideration of other models, like the ones to be seen in

2
Chapter 5. If Y is non-negative, one possibility is to transform Y by

1
Sample Quantiles
means of the Box–Cox (Box and Cox, 1964) transformation:

0
-1

 y λ −1 , λ 6= 0,
y(λ) : =

-2
λ (3.3)
log(y), λ = 0.

-3
-3 -2 -1 0 1 2 3
This transformation alleviates the skewness18 of the data, there- Theoretical Quantiles

fore making it more symmetric and hence normal-like. The op- Figure 3.19: The uncertainty behind
the QQ-plot. The figure aggregates
timal data-dependent λ̂ that makes the data more normal-like M = 1000 different QQ-plots of
can be found through maximum likelihood19 on the transformed N (0, 1) data with n = 100, displaying
(λ) for each of them the pairs ( x p , x̂ p )
sample {Yi }in=1 . If Y is not non-negative, (3.3) can not be ap-
evaluated at p = i−n1/2 , i = 1, . . . , n
plied. A possible patch is to shift the data by a positive constant (as they result from ppoints(n)).
m = − min(Y1 , . . . , Yn ) + δ, δ > 0, such that transformation (3.3) The uncertainty is measured by the
asymptotic 100(1 − α)% CIs for x̂ p ,
becomes  √ 
z −α/2 p (1− p )
 given by x p ± 1√ n φ( x )
.
p
 ( y + m ) λ −1 , λ 6= 0,
(λ,m) λ These curves are displayed in red for
y := α = 0.05. Observe that the vertical
log(y + m), λ = 0.
strips arise since the x p coordinate is
deterministic.
A neat alternative to this shifting is to rely on a transformation that 18
Precisely, if λ < 1, positive skewness
is already designed for real Y, such as the Yeo–Johnson (Yeo and (or skewness to the right) is palliated
(large values of Y shrink, small values
Johnson, 2000) transformation: grow), whereas if λ > 1 negative
 skewness (or skewness to the left) is
( y +1) λ −1


 λ , λ 6= 0, y ≥ 0, corrected (large values of Y grow,
 small values shrink).
λ = 0, y ≥ 0,

log(y + 1),
y(λ) : = For a N (µ, σ2 ) and potentially using
19
(− y + 1 ) 2− λ −1 (3.4)

 − 2 − λ , λ 6= 2, y < 0, the linear model structure if we are
performing the transformation to



log(y + 1), λ = 2, y < 0.

achieve normality in errors of the
linear model. Recall that optimally
The beauty of the Yeo–Johnson transformation is that it extends transforming Y such that Y is normal-
like or Y |( X1 , . . . , X p ) is normal-like
neatly the Box–Cox transformation, which appears as a particular
(the assumption in the linear model)
case when Y is non-negative (see Figure 3.20). As with the Box– are very different goals!
Cox transformation, the optimal λ̂ is estimated through maximum
(λ) n
likelihood on the transformed sample {Yi } i =1 .
N <- 200
Yeo-Johnson transformation
y <- seq(-4, 4, length.out = N)
lambda <- c(0, 0.5, 1, 2, -0.5, -1, -2) 4

l <- length(lambda)
psi <- sapply(lambda, function(la) car::yjPower(U = y, lambda = la))
2
matplot(y, psi, type = "l", ylim = c(-4, 4), lwd = 2, lty = 1:l,
ylab = latex2exp::TeX("$y^{(\\lambda)}$"), col = 1:l, las = 1,
main = "Yeo-Johnson transformation")
y(λ)

abline(v = 0, h = 0)
legend("bottomright", lty = 1:l, lwd = 2, col = 1:l,
-2 λ=0
legend = latex2exp::TeX(paste0("$\\lambda = ", lambda, "$"))) λ = 0.5
λ=1
λ=2
λ = -0.5
λ = -1
The previous transformations have the price of modelling the -4 λ = -2

-4 -2 0 2 4
transformed response rather than Y. It is also possible to patch it y

if it is a consequence of the failure of linearity or homoscedastic- Figure 3.20: Yeo–Johnson transfor-


mation for some values of λ. The
ity, which translates the problem into fixing those assumptions.
Box–Cox transformation for λ corre-
Let’s see how to implement both using the car::powerTransform, sponds to the right hand side (y ≥ 0)
car::bcPower, and car::yjPower functions. of the plot.
110 eduardo garcía portugués

# Test data

# Predictors
n <- 200
set.seed(121938)
X1 <- rexp(n, rate = 1 / 5) # Non-negative
X2 <- rchisq(n, df = 5) - 3 # Real

# Response of a linear model


epsilon <- rchisq(n, df = 10) - 10 # Centered error, but not normal
Y <- 10 - 0.5 * X1 + X2 + epsilon

# Transformation of non-normal data to achieve normal-like data (no model)

# Optimal lambda for Box-Cox


BC <- car::powerTransform(lm(X1 ~ 1), family = "bcPower") # Maximum-likelihood fit
# Note we use a regression model with no predictors
(lambdaBC <- BC$lambda) # The optimal lambda
## Y1
## 0.2412419
# lambda < 1, so positive skewness is corrected

# Box-Cox transformation
X1Transf <- car::bcPower(U = X1, lambda = lambdaBC)

# Comparison
par(mfrow = c(1, 2))
hist(X1, freq = FALSE, breaks = 10, ylim = c(0, 0.3))
hist(X1Transf, freq = FALSE, breaks = 10, ylim = c(0, 0.3))

Histogram of X1 Histogram of X1Transf


0.30

0.30
0.20

0.20
Density

Density
0.10

0.10
0.00

0.00

0 10 20 30 40 -2 0 2 4 6

X1 X1Transf

# Optimal lambda for Yeo-Johnson


YJ <- car::powerTransform(lm(X2 ~ 1), family = "yjPower")
(lambdaYJ <- YJ$lambda)
## Y1
## 0.5850791

# Yeo-Johnson transformation
X2Transf <- car::yjPower(U = X2, lambda = lambdaYJ)

# Comparison
par(mfrow = c(1, 2))
hist(X2, freq = FALSE, breaks = 10, ylim = c(0, 0.3))
hist(X2Transf, freq = FALSE, breaks = 10, ylim = c(0, 0.3))
notes for predictive modeling 111

Histogram of X2 Histogram of X2Transf

0.30

0.30
0.20

0.20
Density

Density
0.10

0.10
0.00

0.00
-2 0 2 4 6 8 10 -4 -2 0 2 4 6

X2 X2Transf

# Transformation of non-normal response in a linear model

# Optimal lambda for Yeo-Johnson


YJ <- car::powerTransform(lm(Y ~ X1 + X2), family = "yjPower")
(lambdaYJ <- YJ$lambda)
## Y1
## 0.9160924

# Yeo-Johnson transformation
YTransf <- car::yjPower(U = Y, lambda = lambdaYJ)

# Comparison for the residuals


par(mfrow = c(1, 2))
plot(lm(Y ~ X1 + X2), 2)
plot(lm(YTransf ~ X1 + X2), 2) # Slightly better

Normal Q-Q Normal Q-Q


Standardized residuals

Standardized residuals

142 142
4
0 1 2 3 4

14 14
52
52
2
0
-2
-2

-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

Theoretical Quantiles Theoretical Quantiles

3.5.3 Homoscedasticity
The constant-variance assumption of the errors is also key for ob-
taining the inferential results we saw. For example, if the assump-
tion does not hold, then the CIs for prediction will not respect the
Scale-Location

confidence on which they were built. 6


1.4

24
20

How to check it
1.2

Heteroskedasticity can be detected by looking into irregular


1.0
Standardized residuals

vertical dispersion patterns in the residuals vs. fitted values plot.


0.8

However, it is simpler to use the scale-location plot, where the


0.6

standardized residuals are transformed by a square root (of its


0.4

absolute value) and inspect the deviations in the positive axis.


0.2
0.0

plot(mod, 3)
6.0 6.5 7.0 7.5 8.0

Fitted values

Under homoscedasticity, we expect the red line to show no lm(Price ~ Age + AGST + HarvestRain + WinterRain)

Figure 3.21: Scale-location plot for the


trend. If there are consistent patterns, then there is evidence of Price ~ Age + AGST + HarvestRain +
heteroskedasticity. WinterRain model for the wine dataset.
112 eduardo garcía portugués

Figure 3.22: Scale-location plots for


datasets respecting (left column)
and violating (right column) the
homoscedasticity assumption.
notes for predictive modeling 113

There are formal tests to check the null hypothesis of homoscedas-


ticity in our residuals. For example, the Breusch–Pagan test imple-
mented in car::ncvTest:
# Breusch-Pagan test
car::ncvTest(mod)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.3254683, Df = 1, p = 0.56834
# We do not reject homoscedasticity

The Breusch–Pagan test checks homoscedasticity against


a nonconstant, linearly increasing with respect to the pre-
dictors, variance in the residuals. This fact means that
the test can be fooled by a nonlinear pattern in the vari-
ance of the residuals that results in a flat plane fit (e.g., a
quadratic pattern). It is advised then to check the scale-
location plot in addition to performing the Breusch–
Pagan test in order to identify evident non constant
variances driven by possibly tricky nonlinearities. The
next code illustrates this with two examples.

# Heteroskedastic models
set.seed(123456)
x <- rnorm(100)
y1 <- 1 + 2 * x + rnorm(100, sd = x^2)
y2 <- 1 + 2 * x + rnorm(100, sd = 1 + x * (x > 0))
modHet1 <- lm(y1 ~ x)
modHet2 <- lm(y2 ~ x)

Scale-Location
# Heteroskedasticity not detected 5

car::ncvTest(modHet1)
2.0

## Non-constant Variance Score Test 79

## Variance formula: ~ fitted.values 35


1.5
Standardized residuals

## Chisquare = 2.938652e-05, Df = 1, p = 0.99567


plot(modHet1, 3)
1.0

# Heteroskedasticity correctly detected


car::ncvTest(modHet2)
0.5

## Non-constant Variance Score Test


## Variance formula: ~ fitted.values
0.0

## Chisquare = 41.03562, Df = 1, p = 1.4948e-10


-2 0 2 4
plot(modHet2, 3) Fitted values
lm(y1 ~ x)
Scale-Location
5
2.0

55

What to do if fails
Using a nonlinear transformation for the response Y may help 96


1.5
Standardized residuals

to control the variance. Typical choices are log Y and Y, which


reduce the scale of the larger responses and leads to a reduction
1.0

of heteroskedasticity. Keep in mind that these transformations, as


0.5

the Box–Cox transformations, are designed for non-negative Y. The


Yeo–Johnson transformation can be used instead if Y is real. Let’s
0.0

see some quick examples. -4 -2 0 2 4 6

Fitted values
lm(y2 ~ x)
# Artificial data with heteroskedasticity
set.seed(12345) Figure 3.23: Two heteroskedasticity
X <- rchisq(500, df = 3) patterns that are undetected and
e <- rnorm(500, sd = sqrt(0.1 + 2 * X)) detected, respectively, by the Breusch–
Y <- 1 + X + e Pagan test.
114 eduardo garcía portugués

# Original
plot(lm(Y ~ X), 3) # Very heteroskedastic

# Transformed
plot(lm(I(log(abs(Y))) ~ X), 3) # Much less hereroskedastic, but at the price
# of losing the signs in Y...

# Shifted and transformed


delta <- 1 # This is tuneable
m <- -min(Y) + delta
plot(lm(I(log(Y + m)) ~ X), 3) # No signs loss Scale-Location
137

306 338

# Transformed by Yeo-Johnson

1.5
# Optimal lambda for Yeo-Johnson

Standardized residuals
YJ <- car::powerTransform(lm(Y ~ X), family = "yjPower")

1.0
(lambdaYJ <- YJ$lambda)
## Y1
## 0.6932053

0.5
# Yeo-Johnson transformation

0.0
YTransf <- car::yjPower(U = Y, lambda = lambdaYJ)
plot(lm(YTransf ~ X), 3) # Slightly less hereroskedastic 5 10 15

Fitted values
lm(Y ~ X)
Scale-Location
372

2.5
3.5.4 Independence 202
155

2.0
Independence is also a key assumption: it guarantees that the

Standardized residuals

1.5
amount of information that we have on the relationship between
Y and X1 , . . . , X p with n observations is maximal. If there is depen- 1.0

dence, then information is repeated, and as a consequence the


0.5

variability of the estimates will be larger. For example, our 95% CIs
0.0

will be smaller than the adequate, meaning that they will not contain
0.5 1.0 1.5 2.0 2.5 3.0 3.5

with a 95% confidence the unknown parameter, but with a lower Fitted values
lm(I(log(abs(Y))) ~ X)
Scale-Location
confidence (say 80%). 137

An extreme case is the following: suppose we have two samples 323


295
2.0

of sizes n and 2n, where the 2n-sample contains the elements of


Standardized residuals

the n-sample twice. The information in both samples is the same,


1.5

and so are the estimates for the coefficients β. Yet in the 2n-sample
√ −1
1.0

the length of the confidence intervals is C 2n , whereas in the


√ −1 √
n-sample they have length C n . A reduction by a factor of 2 in
0.5

the confidence interval has happened, but we have the same infor-
0.0

mation! This will give us a wrong sense of confidence in our model, 1.5 2.0 2.5 3.0

Fitted values
and the root of the evil was the dependence between observations. lm(I(log(Y + m)) ~ X)
Scale-Location
137

How to check it
2.0

295

323

The set of possible dependence structures on the residuals is


1.5

immense, and there is no straightforward way of checking all of


Standardized residuals

them. Usually what it is examined is the presence of autocorrelation,


1.0

which appears when there is some kind of serial dependence in


the measurement of observations. The serial plot of the residuals
0.5

allows to detect time trends in them.


0.0

plot(mod$residuals, type = "o") 2 4 6 8

Fitted values
lm(YTransf ~ X)

Under uncorrelation, we expect the series to show no track- Figure 3.24: Patching of heteroskedas-
ing of the residuals. That is, that the closer observations do not ticity for an artificial dataset.
notes for predictive modeling 115

take similar values, but rather change without any kind of distin-

0.4
guishable pattern. Tracking is associated to positive autocorrelation,
but negative, manifested as alternating small-large or positive-

0.2
mod$residuals
negative residuals, is also possible. The lagged plots of (ε̂ i−l , ε̂ i ),

0.0
i = l + 1, . . . , n, obtained through lag.plot, allow us to detect both
kinds of autocorrelations for a given lag l. Under independence, we

-0.2
expect no trend in such plot. Here is an example:

-0.4
lag.plot(mod$residuals, lags = 1, do.lines = FALSE) 0 5 10 15 20 25

Index

Figure 3.25: Serial plot of the resid-


# No serious serial trend, but some negative autocorrelation is appreaciated uals of the Price ~ Age + AGST +
cor(mod$residuals[-1], mod$residuals[-length(mod$residuals)]) HarvestRain + WinterRain model for
## [1] -0.4267776 the wine dataset.

There are also formal tests for testing for the absence of au-

0.4
tocorrelation, such as the Durbin–Watson test implemented in

0.2
car::durbinWatsonTest:

mod$residuals

0.0
# Durbin-Watson test
car::durbinWatsonTest(mod)

-0.2
## lag Autocorrelation D-W Statistic p-value
## 1 -0.4160168 2.787261 0.054

-0.4
## Alternative hypothesis: rho != 0
# Does not reject at alpha = 0.05 -0.4 -0.2 0.0 0.2 0.4 0.6

lag 1

What to do if fails
Little can be done if there is dependence in the data, once this
has been collected. If the dependence is temporal, we must rely
on the family of statistical models meant to deal with serial de-
pendence: time series. Other kinds of dependence such as spatial
dependence, spatio-temporal dependence, geometrically-driven de-
pendencies, censorship, truncation, etc. need to be analyzed with a
different set of tools to the ones covered in these notes.
However, there is a simple trick worth mentioning. If the ob-
servations of the response Y, say Y1 , Y2 , . . . , Yn , present serial de-
pendence, a differentiation of the sample that yields Y1 − Y2 , Y2 −
Y3 , . . . , Yn−1 − Yn may lead to independent observations. These are
called the innovations of the series of Y.

Load the dataset assumptions3D.RData and compute


the regressions y.3 ~ x1.3 + x2.3, y.4 ~ x1.4 + x2.4,
y.5 ~ x1.5 + x2.5, and y.8 ~ x1.8 + x2.8. Use the
presented diagnostic tools to test the assumptions of the
linear model and look out for possible problems.

3.5.5 Multicollinearity
A common problem that arises in multiple linear regression is mul-
ticollinearity. This is the situation when two or more predictors are
highly linearly related between them. Multicollinearitiy has impor-
tant effects on the fit of the model:
116 eduardo garcía portugués

Figure 3.26: Serial plots of the residu-


als for datasets respecting (left column)
and violating (right column) the inde-
pendence assumption.
notes for predictive modeling 117

• It reduces the precision of the estimates. As a consequence,


the signs of fitted coefficients may be reversed and valuable
predictors may appear as non-significant.
• It is difficult to determine how each of the highly related pre-
dictors affects the response, since one masks the other. Also, this
may result in numerical instabilities because X0 X will be close to
being singular.

Intuitively, multicollinearity can be visualized as a card


(fitting plane) that is hold on its opposite corners and
that spins on its diagonal, where the data is concen-
trated. Then, very different planes will fit the data almost
equally well, which results in a large variability of the
optimal plane.

An approach to detect multicollinearity is to inspect the correla-


tion matrix between the predictors.
# Numerically
cor(wine)
## Year Price WinterRain AGST HarvestRain Age FrancePop
## Year 1.00000000 -0.4604087 0.05118354 -0.29488335 -0.05884976 -1.00000000 0.99227908
## Price -0.46040873 1.0000000 0.13488004 0.66752483 -0.50718463 0.46040873 -0.48107195
## WinterRain 0.05118354 0.1348800 1.00000000 -0.32113230 -0.26798907 -0.05118354 0.02945091

HarvestRain
WinterRain

FrancePop
## AGST -0.29488335 0.6675248 -0.32113230 1.00000000 -0.02708361 0.29488335 -0.30126148

AGST
Price
Year

Age
## HarvestRain -0.05884976 -0.5071846 -0.26798907 -0.02708361 1.00000000 0.05884976 -0.03201463
1
## Age -1.00000000 0.4604087 -0.05118354 0.29488335 0.05884976 1.00000000 Year -0.99227908
1 -0.46 0.05 -0.29 -0.06 -1 0.99
0.8
## FrancePop 0.99227908 -0.4810720 0.02945091 -0.30126148 -0.03201463 -0.99227908 1.00000000
Price 0.6
-0.46 1 0.13 0.67 -0.51 0.46 -0.48

0.4
# Graphically WinterRain 0.05 0.13 1 -0.32 -0.27 -0.05 0.03
0.2
corrplot::corrplot(cor(wine), addCoef.col = "grey")
AGST -0.29 0.67 -0.32 1 -0.03 0.29 -0.3 0

Here we can see what we already knew from Section 2.1: Age HarvestRain -0.06 -0.51 -0.27 -0.03 1 0.06 -0.03
-0.2

-0.4

and Year are perfectly linearly related and Age and FrancePop are Age -1 0.46 -0.05 0.29 0.06 1 -0.99 -0.6

highly linearly related. Then one approach will be to directly re- FrancePop 0.99 -0.48 0.03 -0.3 -0.03 -0.99 1
-0.8

move one of the highly-correlated predictors. -1

Figure 3.27: Graphical visualization of


However, it is not sufficient to inspect pairwise correlations in the correlation matrix.
order to get rid of multicollinearity. It is possible to build coun-
x1

x2

x3

x4

terexamples that show non suspicious pairwise correlations but x1 1 0.38 0.21 -0.53 0.31 0.8

problematic complex linear relations that remain hidden. For the 0.6

x2
sake of illustration, here is one: 0.38 1 0.52 0.57 -0.04 0.4

0.2

# Create predictors with multicollinearity: x4 depends on the rest x3 0.21 0.52 1 0.25 -0.77 0

set.seed(45678)
-0.2
x1 <- rnorm(100)
x4 -0.53 0.57 0.25 1 -0.29 -0.4
x2 <- 0.5 * x1 + rnorm(100)
x3 <- 0.5 * x2 + rnorm(100) -0.6

x4 <- -x1 + x2 + rnorm(100, sd = 0.25) y 0.31 -0.04 -0.77 -0.29 1 -0.8

-1

# Response Figure 3.28: Unsuspicious correlation


y <- 1 + 0.5 * x1 + 2 * x2 - 3 * x3 - x4 + rnorm(100) matrix with hidden multicollinearity.
data <- data.frame(x1 = x1, x2 = x2, x3 = x3, x4 = x4, y = y)

# Correlations -- none seems suspicious


corrplot::corrplot(cor(data), addCoef.col = "grey")

A better approach to detect multicollinearity is to compute the


Variance Inflation Factor (VIF) of each fitted coefficient β̂ j . This is
20
And therefore more informative on
the linear dependence of the predictors
than the correlations of X j with each of
the remaining predictors.
118 eduardo garcía portugués

a measure of how linearly dependent is X j on the rest of the predictors20


and it is defined as
1
VIF( β̂ j ) =
1 − R2X | X
j −j

where R2X | X represents the R2 from the regression of X j onto the


j −j
remaining predictors X1 , . . . , X j−1 , X j+1 . . . , X p . Clearly, VIF( β̂ j ) ≥
1. The next simple rule of thumb gives direct insight into which
predictors are multicollinear:

• VIF close to 1: absence of multicollinearity.


• VIF larger than 5 or 10: problematic amount of multicollinearity.
Advised to remove the predictor with largest VIF.

VIF is computed with the car::vif function, which takes as an


argument a linear model. Let’s see how it works in the previous
example with hidden multicollinearity.
# Abnormal variance inflation factors: largest for x4, we remove it
modMultiCo <- lm(y ~ x1 + x2 + x3 + x4)
car::vif(modMultiCo)
## x1 x2 x3 x4
## 26.361444 29.726498 1.416156 33.293983

# Without x4
modClean <- lm(y ~ x1 + x2 + x3)

# Comparison
car::compareCoefs(modMultiCo, modClean)
## Calls:
## 1: lm(formula = y ~ x1 + x2 + x3 + x4)
## 2: lm(formula = y ~ x1 + x2 + x3)
##
## Model 1 Model 2
## (Intercept) 1.062 1.058
## SE 0.103 0.103
##
## x1 0.922 1.450
## SE 0.551 0.116
##
## x2 1.640 1.119
## SE 0.546 0.124
##
## x3 -3.165 -3.145
## SE 0.109 0.107
##
## x4 -0.529
## SE 0.541
##
confint(modMultiCo)
## 2.5 % 97.5 %
## (Intercept) 0.8568419 1.2674705
## x1 -0.1719777 2.0167093
## x2 0.5556394 2.7240952
## x3 -3.3806727 -2.9496676
## x4 -1.6030032 0.5446479
confint(modClean)
## 2.5 % 97.5 %
## (Intercept) 0.8526681 1.262753
## x1 1.2188737 1.680188
## x2 0.8739264 1.364981
## x3 -3.3564513 -2.933473
notes for predictive modeling 119

# Summaries
summary(modMultiCo)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9762 -0.6663 0.1195 0.6217 2.5568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0622 0.1034 10.270 < 2e-16 ***
## x1 0.9224 0.5512 1.673 0.09756 .
## x2 1.6399 0.5461 3.003 0.00342 **
## x3 -3.1652 0.1086 -29.158 < 2e-16 ***
## x4 -0.5292 0.5409 -0.978 0.33040
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 1.028 on 95 degrees of freedom
## Multiple R-squared: 0.9144, Adjusted R-squared: 0.9108
## F-statistic: 253.7 on 4 and 95 DF, p-value: < 2.2e-16
summary(modClean)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.91297 -0.66622 0.07889 0.65819 2.62737
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0577 0.1033 10.24 < 2e-16 ***
## x1 1.4495 0.1162 12.47 < 2e-16 ***
## x2 1.1195 0.1237 9.05 1.63e-14 ***
## x3 -3.1450 0.1065 -29.52 < 2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 1.028 on 96 degrees of freedom
## Multiple R-squared: 0.9135, Adjusted R-squared: 0.9108
## F-statistic: 338 on 3 and 96 DF, p-value: < 2.2e-16

# Variance inflation factors are normal


car::vif(modClean)
## x1 x2 x3
## 1.171942 1.525501 1.364878

Note that multicollinearity is another instance of the


model correctness vs. usefulness. A model with mul-
ticollinearity might be perfectly valid in the sense of
respecting the assumptions of the model. As we saw
in Section 2.3, it does not matter whether the predic-
tors are related or not, at least for the verification of the
assumptions. But the model will be useless if the mul-
ticollinearity is high, since it can inflate the variability
of the estimation without any kind of bound. In the ex-
treme case in which the multicollinearity is perfect, then
the model will not be identifiable, despite being correct.
120 eduardo garcía portugués

3.5.6 Outliers and high-leverage points


Outliers and high-leverage points are particular observations that
have an important impact in the final linear model, either on the
estimates or on the properties of the model. They are defined as
follows.

• Outliers are the observations with a response Yi far away from


the regression plane. They typically do not affect the estimate of
the plane, unless one of the predictors is also extreme (see next
point). But they inflate σ̂ and as a consequence they draw down
the R2 of the model and expand the CIs.
• High-leverage points are observations with an extreme predictor
Xij located far away from the rest of points. These observations are
highly influential and may drive the fitting of the linear model.
The reason is the squared distance in the RSS: an individual Residuals vs Leverage

extreme point contributes a large portion of the RSS.

2
0.5

19

Both outliers and high-leverage points can be identified with the

1
Standardized residuals
residuals vs. leverage plot:

0
plot(mod, 5)

-1
The rules of thumb for declaring outliers and high-leverage 24
20

-2
0.5

points are: Cook's distance

0.0 0.1 0.2 0.3

Leverage
• If the standardized residual of an observation is larger than 3 in lm(Price ~ Age + AGST + HarvestRain + WinterRain)

absolute value, then it may be an outlier. Figure 3.29: Residuals vs. leverage
plot for the Price ~ Age + AGST +
• If the leverage statistic hi (see below) is greatly exceeding ( p +
HarvestRain + WinterRain model for
1)/n 21 , then the i-th observation may be suspected of having a the wine dataset.
high leverage. 21
This is the expected value for the
leverage statistic hi if the linear model
Let’s see an artificial example. holds.
6

# Create data
set.seed(12345)
4

x <- rnorm(100)
e <- rnorm(100, sd = 0.5)
2

y <- 1 + 2 * x + e
y

# Leverage expected value


2 / 101 # (p + 1) / n
-2

## [1] 0.01980198
-4

# Base model
-2 -1 0 1 2
m0 <- lm(y ~ x) x
plot(x, y) Residuals vs Leverage
abline(coef = m0$coefficients, col = 2)
3
2

24
plot(m0, 5)
Standardized residuals

summary(m0)
0

##
-1

## Call: 42

## lm(formula = y ~ x) 75
-2

##
Cook's distance
## Residuals:
0.00 0.01 0.02 0.03 0.04 0.05 0.06
## Min 1Q Median 3Q Max
Leverage
## -1.10174 -0.30139 -0.00557 0.30949 1.30485 lm(y ~ x)
notes for predictive modeling 121

##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.01103 0.05176 19.53 <2e-16 ***
## x 2.04727 0.04557 44.93 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.5054 on 98 degrees of freedom
## Multiple R-squared: 0.9537, Adjusted R-squared: 0.9532
## F-statistic: 2018 on 1 and 98 DF, p-value: < 2.2e-16

# Make an outlier
x[101] <- 0; y[101] <- 30
m1 <- lm(y ~ x)

30
plot(x, y)

25
abline(coef = m1$coefficients, col = 2)

20
plot(m1, 5)

15
y

10
summary(m1)

5
##
## Call:

0
## lm(formula = y ~ x)

-5
##
-2 -1 0 1 2
## Residuals:
x
## Min 1Q Median 3Q Max
Residuals vs Leverage
## -1.3676 -0.5730 -0.2955 0.0941 28.6881

10
## 101

## Coefficients:

8
## Estimate Std. Error t value Pr(>|t|)
Standardized residuals
## (Intercept) 1.3119 0.2997 4.377 2.98e-05 *** 6
1
## x 1.9901 0.2652 7.505 2.71e-11 ***
4

## --- 0.5

## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
2

##
## Residual standard error: 2.942 on 99 degrees of freedom
0

47 42

## Multiple R-squared: 0.3627, Adjusted R-squared: 0.3562 Cook's distance

## F-statistic: 56.33 on 1 and 99 DF, p-value: 2.708e-11 0.00 0.01 0.02 0.03 0.04 0.05 0.06

Leverage
lm(y ~ x)

# Make a high-leverage point


x[101] <- 10; y[101] <- 5
6

m2 <- lm(y ~ x)
plot(x, y)
4

abline(coef = m2$coefficients, col = 2)


2
y

plot(m2, 5)
0

summary(m2)
-2

##
-4

## Call:
## lm(formula = y ~ x) -2 0 2 4 6 8 10

## x

## Residuals: Residuals vs Leverage

## Min 1Q Median 3Q Max


2

## -9.2423 -0.6126 0.0373 0.7864 2.1652 0.5


0

##
0.5
## Coefficients: 60 1
Standardized residuals

42
-2

## Estimate Std. Error t value Pr(>|t|)


## (Intercept) 1.09830 0.13676 8.031 2.06e-12 ***
-4

## x 1.31440 0.09082 14.473 < 2e-16 ***


-6

## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
-8

## 101
-10

Cook's distance
## Residual standard error: 1.339 on 99 degrees of freedom
0.0 0.1 0.2 0.3 0.4
## Multiple R-squared: 0.6791, Adjusted R-squared: 0.6758
Leverage
## F-statistic: 209.5 on 1 and 99 DF, p-value: < 2.2e-16 lm(y ~ x)
122 eduardo garcía portugués

The leverage statistic associated to the i-th datum corresponds to


the i-th diagonal entry of the hat matrix H:

hi := Hii = (X(X0 X)−1 X0 )ii


1
and it can be seen that ≤ hi ≤ 1 and that the mean h̄ =
n
p +1
1
n ∑in=1 hi
= n .
This can be clearly seen in the case of simple
linear regression, where the leverage statistic has the explicit form
1 ( X − X̄ )2
hi = + n i .
n ∑ j=1 ( X j − X̄ )2

Interestingly, this expression shows that the leverage statistic is


directly dependent on the distance to the center of the predictor. A
measure of how much hi exceeds the expected value h̄ can be given
if the predictors are assumed to be jointly normal. In this case,
nhi − 1 ∼ χ2p (Peña, 2002) and hence the i-th point is declared as a
χ2 +1
potential high-leverage point if hi > p;αn , where χ2p;α is the α-upper
quantile of the χ2p distribution and α can be taken as 0.05 or 0.01.
The functions influence, hat, and rstandard allow to perform a
finer inspection of the leverage statistics.
# Access leverage statistics
head(influence(model = m2, do.coef = FALSE)$hat)
## 1 2 3 4 5 6
## 0.01017449 0.01052333 0.01083766 0.01281244 0.01022209 0.03137304

# Another option
h <- hat(x = x)

# 1% most influential points


n <- length(x)
p <- 1
hist(h, breaks = 20)
abline(v = (qchisq(0.99, df = p) + 1) / n, col = 2) Histogram of h
80

# Standardized residuals
60

rs <- rstandard(m2)
plot(m2, 2) # QQ-plot
Frequency

points(qnorm(ppoints(n = n)), sort(rs), col = 2, pch = ’+’) # Manually computed


40
20

3.5.7 Case study application


Moore’s law (Moore, 1965) is an empirical law that states that the
0

0.0 0.1 0.2 0.3 0.4


power of a computer doubles approximately every two years. h

Translated into a mathematical formula, Moore’s law is Normal Q-Q


2

84+
+ +
transistors ≈ 2years/2 .
++
++++++
+++++++++
+++++++
++++++++
+++++++++++
+++++++++++++
0

+++
++
++++++++++++
+
+++++++++++++
+++
++++
Applying logarithms to both sides gives ++
Standardized residuals

+42
-2

log(2)
-4

log(transistors) ≈ years.
2
-6

We can write the above formula more generally as


-8

log(transistors) = β 0 + β 1 years + ε, +101

-2 -1 0 1 2

Theoretical Quantiles
where ε is a random error. This is a linear model! lm(y ~ x)
notes for predictive modeling 123

The dataset cpus.txt (source, retrieved in September


2016) contains the transistor counts for the CPUs ap-
peared in the time range 1971–2015. For this data, do the
following:

• Import conveniently the data and name it as cpus.


• Show a scatterplot of Transistor.count vs.
Date.of.introduction with a linear regression.
• Are the assumptions verified in Transistor.count ~
Date.of.introduction? Which ones are more “prob-
lematic”?
• Create a new variable, named Log.Transistor.count,
containing the logarithm of Transistor.count.
• Show a scatterplot of Log.Transistor.count vs.
Date.of.introduction with a linear regression.
• Are the assumptions verified in
Log.Transistor.count ~ Date.of.introduction?
Which ones are which are more “problematic”?
• Regress Log.Transistor.count ~
Date.of.introduction.
• Summarize the fit. What are the estimates β̂ 0 and β̂ 1 ?
log(2)
Is β̂ 1 close to 2 ?
log(2)
• Compute the CI for β 1 at α = 0.05. Is 2 inside it?
What happens at levels α = 0.10, 0.01?
• We want to forecast the average log-number of transis-
tors for the CPUs to be released in 2017. Compute the
adequate prediction and CI.
• A new CPU design is expected for 2017. What is the
range of log-number of transistors expected for it, at a
95% level of confidence?
• Compute the ANOVA table for Log.Transistor.count
~ Date.of.introduction. Is β 1 significant?

The dataset gpus.txt (source, retrieved in September


2016) contains the transistor counts for the GPUs ap-
peared in the period 1997–2016. Repeat the previous
analysis for this dataset.

3.6 Dimension reduction techniques

As we have seen in Section 3.2, the selection of the best linear


model from a set of p predictors is a challenging task that increases
with the dimension of the problem, that is, with p. In addition to 22
Precisely, the size of the set is 2 p+1 .
the growth of the set of possible models22 as p grows, the model
space becomes more and more complicated to explore due to the
potential multicollinearity among the predictors. We will see in this
124 eduardo garcía portugués

section two methods to deal with these two problems simultane-


ously.

3.6.1 Review on principal component analysis


Principal Component Analysis (PCA) is a multivariate technique
designed to summarize the most important features and relations
of p numerical random variables X1 , . . . , X p . PCA computes a new
set of variables, the principal components PC1 , . . . , PC p , that con-
tain the same information as X1 , . . . , X p but expressed in a more
convenient way. The goal of PCA is to retain only a limited number
1 ≤ l < p of principal components such that they explain most of
the information and perform dimension reduction. 23
That is, E[ X j ] = 0, j = 1, . . . , p. This
If X1 , . . . , X p are centred23 , then the principal components are is important since PCA is sensitive to
orthonormal linear combinations of X1 , . . . , X p : the centering of the data.

PC j := a1j X1 + a2j X2 + . . . + a pj X p = a0j X, j = 1, . . . , p, (3.5)

where a j := ( a1j , . . . , a pj )0 , X := ( X1 , . . . , X p )0 , and the orthonormal-


ity condition is

1, if i = j,
ai0 a j =
0, if i 6= j.

Remarkably, PCA computes the principal components in an ordered


way: PC1 is the principal component that explains the most of the
information (quantified as the variance) of X1 , . . . , X p , and then the
explained information decreases monotonically down to PC p , the
principal component that explains the least information. Precisely:

Var[PC1 ] ≥ Var[PC2 ] ≥ . . . ≥ Var[PC p ]. (3.6)

Mathematically, PCA reduces to compute the spectral decomposi- 24


Recall that the covariance matrix is a
tion24 of the covariance matrix Σ := Var[X]: real, symmetric, semi-positive definite
matrix.
Σ = AΛA0 ,

where Λ = diag(λ1 , . . . , λ p ) contains the eigenvalues of Σ and A 25


Therefore, A−1 = A0 .
is the orthogonal matrix25 that contains the unit-norm eigenvectors
of Σ as columns. The matrix A gives, thus, the coefficients of the
orthonormal linear combinations:
 
a11 a12 · · · a1p
   a21 a22 · · · a2p 
 
A = a1 a2 · · · a p =   .. .. .. .
.. 
 . . . . 
a p1 a p2 ··· a pp

If the data is not centred, the computation of the principal com-


ponents is done by first subtracting µ = E[X] and then premultiply-
ing with A0 :

PC = A0 (X − µ), (3.7)
notes for predictive modeling 125

where PC = (PC1 , . . . , PC p )0 , X = ( X1 , . . . , X p )0 , and µ =


(µ1 , . . . , µ p )0 . Note that, because of (1.3) and (3.7),

Var[PC] = A0 ΣA = A0 AΛA0 A = Λ, (3.8)

therefore Var[PC j ] = λ j , j = 1, . . . , p, and as a consequence (3.6)


indeed holds.
Also, from (3.7), it is evident that we can express the random
variables X in terms of PC:

X = µ + APC, (3.9)

which admits an insightful interpretation: the PC are uncorrelated26 Since the variance-covariance matrix
26

Var[PC] is diagonal.
variables that, once rotated by A and translated to the location µ,
produce exactly X. So, PC contains the same information as X
but rearranged in a more convenient way because the principal
components are centred and uncorrelated between them:

λ , if i = j,
i
E[PCi ] = 0 and Cov[PCi , PC j ] =
0, if i 6= j.

Given the uncorrelation of PC, we can measure the total variance


p p
of PC as ∑ j=1 Var[PC j ] = ∑ j=1 λ j . Consequently, we can define the
proportion of variance explained by the first l principal compo-
nents, 1 ≤ l ≤ p, as

∑lj=1 λ j
p .
∑ j =1 λ j

In the sample case27 where a sample X1 , . . . , Xn is given and µ and 27


Up to now, the exposition has been
Σ are unknown, µ is replaced by the sample mean X̄ and Σ by the focused exclusively on the population
sample variance-covariance matrix S = n1 ∑in=1 (Xi − X̄)(Xi − X̄)0 . case.

Then, the spectral decomposition of S is computed28 . This gives  28


Some care is needed here. The
and produces the scores of the data: matrix S is obtained from linear
combinations of the n vectors X1 −
ˆ 1 := Â0 (X1 − X̄), . . . , PC
PC ˆ n := Â0 (Xn − X̄). X̄, . . . , Xn − X̄. Recall that these n
vectors are not linearly independent,
as they are guaranteed to add 0,
The scores are centred, uncorrelated, and have sample variances in
∑in=1 [Xi − X̄] = 0, so it is possible
each vector’s entry that are sorted in a decreasing way. to express one perfectly on the rest.
That implies that the p × p matrix
S has rank smaller or equal to n −
The maximum number of principal components that can 1. If p ≤ n − 1, then the matrix
be determined from a sample X1 , . . . , Xn is min(n − 1, p), has full rank p and it is invertible
(excluding degenerate cases in which
assuming that the matrix formed by (X1 · · · Xn ) is of full the p variables are collinear). But if
rank (i.e., if the rank is min(n, p)). If n ≥ p and the vari- p ≥ n, then S is singular and, as a
consequence, λ j = 0 for j ≥ n. This
ables X1 , . . . , X p are such that only r of them are linearly implies that the principal components
independent, then the maximum is min(n − 1, r ). for those eigenvalues can not be
determined univocally.

Let’s see an example of these concepts in La Liga 2015/2016


dataset. It contains the standings and team statistics for La Liga
2015/2016.
laliga <- readxl::read_excel("la-liga-2015-2016.xlsx", sheet = 1, col_names = TRUE)
laliga <- as.data.frame(laliga) # Avoid tibble since it drops row.names
126 eduardo garcía portugués

A quick preprocessing gives:


rownames(laliga) <- laliga$Team # Set teams as case names to avoid factors
laliga$Team <- NULL
laliga <- laliga[, -c(2, 8)] # Do not add irrelevant information
summary(laliga)
## Points Wins Draws Loses Goals.scored Goals.conceded
## Min. :32.00 Min. : 8.00 Min. : 4.00 Min. : 4.00 Min. : 34.00 Min. :18.00
## 1st Qu.:41.25 1st Qu.:10.00 1st Qu.: 8.00 1st Qu.:12.00 1st Qu.: 40.00 1st Qu.:42.50
## Median :44.50 Median :12.00 Median : 9.00 Median :15.50 Median : 45.50 Median :52.50
## Mean :52.40 Mean :14.40 Mean : 9.20 Mean :14.40 Mean : 52.15 Mean :52.15
## 3rd Qu.:60.50 3rd Qu.:17.25 3rd Qu.:10.25 3rd Qu.:18.25 3rd Qu.: 51.25 3rd Qu.:63.25
## Max. :91.00 Max. :29.00 Max. :18.00 Max. :22.00 Max. :112.00 Max. :74.00
## Percentage.scored.goals Percentage.conceded.goals Shots Shots.on.goal Penalties.scored Assistances
## Min. :0.890 Min. :0.470 Min. :346.0 Min. :129.0 Min. : 1.00 Min. :23.00
## 1st Qu.:1.050 1st Qu.:1.115 1st Qu.:413.8 1st Qu.:151.2 1st Qu.: 1.00 1st Qu.:28.50
## Median :1.195 Median :1.380 Median :438.0 Median :165.0 Median : 3.00 Median :32.50
## Mean :1.371 Mean :1.371 Mean :452.4 Mean :173.1 Mean : 3.45 Mean :37.85
## 3rd Qu.:1.347 3rd Qu.:1.663 3rd Qu.:455.5 3rd Qu.:180.0 3rd Qu.: 4.50 3rd Qu.:36.75
## Max. :2.950 Max. :1.950 Max. :712.0 Max. :299.0 Max. :11.00 Max. :90.00
## Fouls.made Matches.without.conceding Yellow.cards Red.cards Offsides
## Min. :385.0 Min. : 4.0 Min. : 66.0 Min. :1.00 Min. : 72.00
## 1st Qu.:483.8 1st Qu.: 7.0 1st Qu.: 97.0 1st Qu.:4.00 1st Qu.: 83.25
## Median :530.5 Median :10.5 Median :108.5 Median :5.00 Median : 88.00
## Mean :517.6 Mean :10.7 Mean :106.2 Mean :5.05 Mean : 92.60
## 3rd Qu.:552.8 3rd Qu.:13.0 3rd Qu.:115.2 3rd Qu.:6.00 3rd Qu.:103.75
## Max. :654.0 Max. :24.0 Max. :141.0 Max. :9.00 Max. :123.00

Let’s check that R’s function for PCA, princomp, returns the same
principal components we outlined in the theory.
# PCA
pcaLaliga <- princomp(laliga, fix_sign = TRUE)
summary(pcaLaliga)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## Standard deviation 104.7782561 48.5461449 22.13337511 12.66692413 8.234215354 7.83426116 6.068864168 4.137079559
## Proportion of Variance 0.7743008 0.1662175 0.03455116 0.01131644 0.004782025 0.00432876 0.002597659 0.001207133
## Cumulative Proportion 0.7743008 0.9405183 0.97506949 0.98638593 0.991167955 0.99549671 0.998094374 0.999301507
## Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
## Standard deviation 2.0112480391 1.8580509157 1.126111e+00 9.568824e-01 4.716064e-01 1.707105e-03 8.365534e-04
## Proportion of Variance 0.0002852979 0.0002434908 8.943961e-05 6.457799e-05 1.568652e-05 2.055361e-10 4.935768e-11
## Cumulative Proportion 0.9995868048 0.9998302956 9.999197e-01 9.999843e-01 1.000000e+00 1.000000e+00 1.000000e+00
## Comp.16 Comp.17
## Standard deviation 5.867584e-07 0
## Proportion of Variance 2.428208e-17 0
## Cumulative Proportion 1.000000e+00 1
# The standard deviations are the square roots of the eigenvalues
# The cumulative proportion of variance explained accumulates the
# variance explained starting at the first component

# We use fix_sign = TRUE so that the signs of the loadings are


# determined by the first element of each loading, set to be
# non-negative. Otherwise, the signs could change for different OS /
# R versions yielding to opposite interpretations of the PCs pcaLaliga

# Plot of variances of each component (screeplot)


10000

plot(pcaLaliga, type = "l")


8000

# Useful for detecting an "elbow" in the graph whose location gives the
6000
Variances

# "right" number of components to retain. Ideally, this elbow appears


4000

# when the next variances are almost similar and notably smaller when
# compared with the previous
2000

# Alternatively: plot of the cumulated percentage of variance


# barplot(cumsum(pcaLaliga$sdev^2) / sum(pcaLaliga$sdev^2))
0

Comp.1 Comp.3 Comp.5 Comp.7 Comp.9


notes for predictive modeling 127

# Computation of PCA from the spectral decomposition


n <- nrow(laliga)
eig <- eigen(cov(laliga) * (n - 1) / n) # By default, cov() computes the
# quasi-variance-covariance matrix that divides by n - 1, but PCA and princomp
# consider the sample variance-covariance matrix that divides by n
A <- eig$vectors

# Same eigenvalues
pcaLaliga$sdev^2 - eig$values
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## -1.818989e-12 2.728484e-12 2.273737e-13 -2.273737e-13 5.684342e-14 3.552714e-14 4.263256e-14 -3.552714e-14
## Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15 Comp.16
## 1.234568e-13 5.373479e-14 1.199041e-14 1.054712e-14 1.049161e-14 -3.709614e-15 -2.191892e-15 4.476020e-13
## Comp.17
## 2.048814e-12

# The eigenvectors (the a_j vectors) are the column vectors in $loadings
pcaLaliga$loadings
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10 Comp.11 Comp.12
## Points 0.125 0.497 0.195 0.139 0.340 0.425 0.379 0.129 0.166
## Wins 0.184 0.175 0.134 -0.198 0.139
## Draws 0.101 -0.186 0.608 0.175 0.185 -0.251
## Loses -0.129 -0.114 -0.157 -0.410 -0.243 -0.166 0.112
## Goals.scored 0.181 0.251 -0.186 -0.169 0.399 0.335 -0.603 -0.155 0.129 0.289 -0.230
## Goals.conceded -0.471 -0.493 -0.277 0.257 0.280 0.441 -0.118 0.297
## Percentage.scored.goals
## Percentage.conceded.goals
## Shots 0.718 0.442 -0.342 0.255 0.241 0.188
## Shots.on.goal 0.386 0.213 0.182 -0.287 -0.532 -0.163 -0.599
## Penalties.scored -0.350 0.258 0.378 -0.661 0.456
## Assistances 0.148 0.198 -0.173 0.362 0.216 0.215 0.356 -0.685 -0.265
## Fouls.made -0.480 0.844 0.166 -0.110
## Matches.without.conceding 0.151 0.129 -0.182 0.176 -0.369 -0.376 -0.411
## Yellow.cards -0.141 0.144 -0.363 0.113 0.225 0.637 -0.550 -0.126 0.156
## Red.cards -0.123 -0.157 0.405 0.666
## Offsides 0.108 0.202 -0.696 0.647 -0.106
## Comp.13 Comp.14 Comp.15 Comp.16 Comp.17
## Points 0.138 0.278 0.315
## Wins 0.147 -0.907
## Draws -0.304 0.526 -0.277
## Loses 0.156 0.803
## Goals.scored -0.153
## Goals.conceded
## Percentage.scored.goals -0.760 -0.650
## Percentage.conceded.goals 0.650 -0.760
## Shots
## Shots.on.goal
## Penalties.scored -0.114
## Assistances -0.102
## Fouls.made
## Matches.without.conceding -0.664
## Yellow.cards
## Red.cards -0.587
## Offsides
##
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14
## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059
## Cumulative Var 0.059 0.118 0.176 0.235 0.294 0.353 0.412 0.471 0.529 0.588 0.647 0.706 0.765 0.824
## Comp.15 Comp.16 Comp.17
## SS loadings 1.000 1.000 1.000
## Proportion Var 0.059 0.059 0.059
## Cumulative Var 0.882 0.941 1.000

# The scores is the representation of the data in the principal components -


# it has the same information as laliga
128 eduardo garcía portugués

head(pcaLaliga$scores)
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## Barcelona 242.23916 -21.581016 25.380103 -17.4375054 -7.1797218 9.0814106 -5.5920449 -7.3615386 -0.3715688
## Real Madrid 313.60263 63.202402 -8.756998 8.5557906 0.7119181 0.2221379 6.7034894 2.4455971 1.8388132
## Atlético Madrid 45.99393 -0.646609 38.540964 31.3888316 3.9162812 3.2904137 0.2431925 5.0912667 -3.0444029
## Villarreal -96.22013 -42.932869 50.003639 -11.2420481 10.4732634 2.4293930 -3.0183049 0.1958417 1.2106025
## Athletic 14.51728 -16.189672 18.884019 -0.4122161 -5.6491352 -6.9329640 8.0652665 2.4783231 -2.6920566
## Celta -13.07483 6.792525 5.227118 -9.0945489 6.1264750 11.8794638 -2.6148154 6.9706627 3.0825781
## Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15 Comp.16 Comp.17
## Barcelona 1.7160752 0.0264937 -1.0948280 0.15160351 -0.0010244179 -0.0002177801 -1.047044e-12 -7.623799e-12
## Real Madrid -2.9660592 -0.1344557 0.3409538 -0.03316355 0.0014744734 0.0003891502 1.587123e-12 1.215672e-11
## Atlético Madrid 2.0974553 0.6771343 -0.3985625 -0.18088616 0.0004984680 -0.0001116115 6.252046e-13 1.090230e-12
## Villarreal -1.7453143 0.1350586 -0.5735057 -0.58936061 -0.0001067413 -0.0002431758 3.319980e-13 -3.406707e-12
## Athletic 0.8950389 0.1542005 1.4714997 0.11090055 -0.0031346887 -0.0002557828 -3.696130e-12 -1.895661e-11
## Celta 0.3129865 0.0859623 1.9159241 0.37219921 -0.0017960697 0.0002911046 -2.150767e-12 -5.513489e-12

# Uncorrelated
corrplot::corrplot(cor(pcaLaliga$scores), addCoef.col = "gray")

Comp.10
Comp.11
Comp.12
Comp.13
Comp.14
Comp.15
Comp.16
Comp.17
Comp.1
Comp.2
Comp.3
Comp.4
Comp.5
Comp.6
Comp.7
Comp.8
Comp.9
1
Comp.1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# Caution! What happened in the last columns? What happened is that the Comp.2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 -0.01 0 0.8

# variance for the last principal components is close to zero (because there Comp.3 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Comp.4 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.6

# are linear dependencies on the variables; e.g. Points, Wins, Loses, Draws), Comp.5 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0.01 0
0.4
# so the computation of the correlation matrix becomes unstable for those Comp.6 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
Comp.7 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0.2
# variables (a 0/0 division takes place) Comp.8 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 -0.01
Comp.9 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0.06 0 0

Comp.10 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0.02 0
# Better to inspect the covariance matrix Comp.11 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0.040.01
-0.2

Comp.12 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 -0.05-0.01
corrplot::corrplot(cov(pcaLaliga$scores), addCoef.col = "gray", is.corr = FALSE) Comp.13 -0.4
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 -0.12-0.01
Comp.14 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0.99 0.6 -0.6
Comp.15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 -0.040.8
Comp.16 -0.8
0 -0.01 0 0 0.01 0 0 0 0.060.020.04-0.05-0.120.99-0.04 1 0.56
Comp.17 0 0 0 0 0 0 0 -0.01 0 0 0.01-0.01-0.010.6 0.8 0.56 1
-1

# The scores are A’ * (X_i - mu). We center the data with scale()

Comp.10
Comp.11
Comp.12
Comp.13
Comp.14
Comp.15
Comp.16
Comp.17
Comp.1
Comp.2
Comp.3
Comp.4
Comp.5
Comp.6
Comp.7
Comp.8
Comp.9
# and then multiply each row by A’ 11556.3
Comp.111556.30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
scores <- scale(laliga, center = TRUE, scale = FALSE) %*% A Comp.2 02480.770 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10400.67
Comp.3 0 0515.670 0 0 0 0 0 0 0 0 0 0 0 0 0
Comp.4 0 0 0 168.9 0 0 0 0 0 0 0 0 0 0 0 0 0 9245.04
# Same as (but this is much slower) Comp.5 0 0 0 0 71.37 0 0 0 0 0 0 0 0 0 0 0 0
8089.41
# scores <- t(apply(scale(laliga, center = TRUE, scale = FALSE), 1, Comp.6 0 0 0 0 0 64.61 0 0 0 0 0 0 0 0 0 0 0
Comp.7 0 0 0 0 0 0 38.77 0 0 0 0 0 0 0 0 0 0 6933.78
# function(x) t(A) %*% x)) Comp.8 0 0 0 0 0 0 0 18.02 0 0 0 0 0 0 0 0 0
Comp.9 0 0 0 0 0 0 0 0 4.26 0 0 0 0 0 0 0 0 5778.15

Comp.10 0 0 0 0 0 0 0 0 0 3.63 0 0 0 0 0 0 0
# Same scores (up to possible changes in signs) Comp.11 0 0 0 0 0 0 0 0 0 0 1.33 0 0 0 0 0 0
4622.52

Comp.12 0 0 0 0 0 0 0 0 0 0 0 0.96 0 0 0 0 0
max(abs(abs(pcaLaliga$scores) - abs(scores))) Comp.13 0 0 0 0 0 0 0 0 0 0 0 0 0.23 0 0 0 0
3466.89

## [1] 1.427989e-11 Comp.14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2311.26


Comp.15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Comp.16 1155.63
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

# Reconstruct the data from all the principal components Comp.17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


0

head(
sweep(pcaLaliga$scores %*% t(pcaLaliga$loadings), 2, pcaLaliga$center, "+")
)
## Points Wins Draws Loses Goals.scored Goals.conceded Percentage.scored.goals Percentage.conceded.goals
## Barcelona 91 29 4 5 112 29 2.95 0.76
## Real Madrid 90 28 6 4 110 34 2.89 0.89
## Atlético Madrid 88 28 4 6 63 18 1.66 0.47
## Villarreal 64 18 10 10 44 35 1.16 0.92
## Athletic 62 18 8 12 58 45 1.53 1.18
## Celta 60 17 9 12 51 59 1.34 1.55
## Shots Shots.on.goal Penalties.scored Assistances Fouls.made Matches.without.conceding Yellow.cards
## Barcelona 600 277 11 79 385 18 66
## Real Madrid 712 299 6 90 420 14 72
## Atlético Madrid 481 186 1 49 503 24 91
## Villarreal 346 135 3 32 534 17 100
## Athletic 450 178 3 42 502 13 84
## Celta 442 170 4 43 528 10 116
## Red.cards Offsides
## Barcelona 1 120
## Real Madrid 5 114
## Atlético Madrid 3 84
## Villarreal 4 106
## Athletic 5 92
## Celta 6 103
notes for predictive modeling 129

An important issue when doing PCA is the scale of the variables,


since the variance depends on the units in which the variable is 29
Therefore, a sample of lengths
measured29 . Therefore, when variables with different ranges are measured in centimeters will have
mixed, the variability of one may dominate the other as an artifact a variance 104 times larger than the
same sample measured in meters – yet
of the scale. To prevent this, we standardize the dataset prior to do
it is the same information!
a PCA.
# Use cor = TRUE to standardize variables (all have unit variance)
# and avoid scale distortions
pcaLaligaStd <- princomp(x = laliga, cor = TRUE, fix_sign = TRUE)
summary(pcaLaligaStd)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## Standard deviation 3.2918365 1.5511043 1.13992451 0.91454883 0.85765282 0.59351138 0.45780827 0.370649324
## Proportion of Variance 0.6374228 0.1415250 0.07643694 0.04919997 0.04326873 0.02072093 0.01232873 0.008081231
## Cumulative Proportion 0.6374228 0.7789478 0.85538472 0.90458469 0.94785342 0.96857434 0.98090308 0.988984306
## Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
## Standard deviation 0.327182806 0.217470830 0.128381750 0.0976778705 0.083027923 2.582795e-03 1.226924e-03
## Proportion of Variance 0.006296976 0.002781974 0.000969522 0.0005612333 0.000405508 3.924019e-07 8.854961e-08
## Cumulative Proportion 0.995281282 0.998063256 0.999032778 0.9995940111 0.999999519 9.999999e-01 1.000000e+00
## Comp.16 Comp.17
## Standard deviation 2.222483e-08 0
## Proportion of Variance 2.905548e-17 0 -200 -100 0 100 200 300

## Cumulative Proportion 1.000000e+00 1

300
0.6
# The effects of the distorsion can be clearly seen with the biplot

200
0.4
Espanyol
Fouls.made
# Variability absorbed by Shots, Shots.on.goal, Fouls.made Real Madrid
Rayo Vallecano
biplot(pcaLaliga, cex = 0.75) Eibar

100
0.2
Granada
Sevilla Shots

Comp.2
Málaga
Shots.on.goal
Levante
Yellow.cards
Valencia Offsides
# The effects of the variables are more balanced Celta
Goals.conceded Goals.scored
Assistances

0.0
Getafe
Red.cards
Penalties.scored
Loses WinsPoints
Percentage.scored.goals
Percentage.conceded.goals

0
Atlético Madrid
Matches.without.conceding
Draws

biplot(pcaLaligaStd, cex = 0.75) Athletic


Barcelona

-100
-0.2
Sporting
VillarrealGijón
Real Sociedad
Betis
Las Palmas

The biplot30 provides a powerful and succinct way of display-

-200
-0.4

ing the relevant information contained in the first two principal Deportivo

-0.4 -0.2 0.0 0.2 0.4 0.6


components. It shows: Comp.1
-4 -2 0 2 4
0.6

1. The scores of the data in PC1 and PC2 by points (with optional
Atlético Madrid

text labels, depending if there are case names). This is the repre-

4
0.4

sentation of the data in the first two PCs. Betis


Villarreal
Matches.without.conceding
Málaga

2
0.2

2. The variables represented in the PC1 and PC2 by the arrows. Real Sociedad
Comp.2

Las Palmas
Deportivo Athletic
Sporting Gijón Points
Wins
These arrows are centred at (0, 0). Draws
0.0

GetafeValencia 0
Fouls.made Eibar
Yellow.cards
Loses
Levante Celta
Espanyol Sevilla Assistances
Goals.scoredBarcelona
Percentage.scored.goals

Let’s examine the population31 arrow associated to the variable


-0.2

Shots.on.goal
Penalties.scored
-2

Shots
Goals.conceded Granada
Percentage.conceded.goals
Red.cards Offsides

X j . X j is expressed in terms of PC1 and PC2 by the weights a j1 and Real Madrid
-0.4

-4

a j2 : Rayo Vallecano

-0.4 -0.2 0.0 0.2 0.4 0.6

Comp.1
X j = a j1 PC1 + a j2 PC2 + . . . + a jp PC p ≈ a j1 PC1 + a j2 PC2 . Figure 3.30: Biplots for laliga dataset,
with unstandardized and standardized
The weights a j1 and a j2 have the same sign as Cor( X j , PC1 ) and data, respectively.
Cor( X j , PC2 ), respectively. The arrow associated to X j is given by 30
Computed with biplot.princomp or
with biplot applied to a princomp ob-
the segment joining (0, 0) and ( a j1 , a j2 ). Therefore:
ject. But note that this function applies
an internal scaling of the scores and
• If the arrow points right (a j1 > 0), there is positive correlation variables to improve the visualization
between X j and PC1 . Analogous if the arrow points left. (see ?biplot.princomp). This can be
disabled with the argument scale = 0.
• If the arrow is approximately vertical (a j1 ≈ 0), there is uncorrelation 31
For the sample version, replace the
between X j and PC1 . variable X j by its sample X1j , . . . , Xnj ,
the weights a ji by their estimates â ji ,
and the principal component PC j by
the scores PCˆ j1 , . . . , PC
ˆ jn .
Analogously:
130 eduardo garcía portugués

• If the arrow points up (a j2 > 0), there is positive correlation between


X j and PC2 . Analogous if the arrow points down.
• If the arrow is approximately horizontal (a j2 ≈ 0), there is uncorrela-
tion between X j and PC2 .

In addition, the magnitude of the arrow informs us about the


strength of the correlation of X j with (PC1 , PC2 ).
The biplot also informs about the direct relation between vari-
ables, at sight of their expressions in PC1 and PC2 . The angle of
the arrows of variable X j and Xk gives an approximation to the
correlation between them, Cor( X j , Xk ):

• If angle ≈ 0◦ , the two variables are highly positively correlated.


• If angle ≈ 90◦ , they are approximately uncorrelated.
• If angle ≈ 180◦ , the two variables are highly negatively corre-
lated.

The insights obtained from the approximate correlations


between the variables and principal components are as
valid as the percentage of variance explained by PC1 and
PC2 .

-4 -2 0 2 4 Figure 3.31: Biplot for laliga dataset.


Interpretations of PC1 and PC2 are
0.6

driven by the signs of the variables


inside them (directions of the arrows)
Atlético Madrid and by the strength of the correlations
of the PC1 and PC2 with the variables
4

(length of the arrows). The scores of


0.4

the data serve also to cluster simi-


Villarreal lar observations according to their
Matches.without.conceding proximity in the biplot.
Betis
Málaga
2
0.2

Real Sociedad
Comp.2

Las Palmas
Deportivo Athletic
Sporting Gijón Points
Wins
Draws
0.0

GetafeValencia
Fouls.made Eibar
Yellow.cards
Loses
Levante Celta
Espanyol Sevilla Assistances
Goals.scoredBarcelona
Percentage.scored.goals
-0.2

Shots.on.goal
Penalties.scored
-2

Shots
Goals.conceded Granada
Percentage.conceded.goals
Red.cards Offsides

Real Madrid
-0.4

-4

Rayo Vallecano

-0.4 -0.2 0.0 0.2 0.4 0.6

Comp.1

Some interesting insights in La Liga 2015/2016 biplot, obtained


from the previous remarks, are32 :
notes for predictive modeling 131

• PC1 can be regarded as the performance of a team during the sea-


son. PC1 is positively correlated with Wins, Points, etc. and
negatively correlated with Draws, Loses, Yellow.cards, etc.
The best performing teams are not surprising: Barcelona, Real
Madrid, and Atlético Madrid. On the other hand, among the
worst-performing teams are Levante, Getafe, and Granada.
• PC2 can be seen as the efficiency of a team (obtaining points with
little participation in the game). Using this interpretation we can
see that Atlético Madrid and Villareal were the most efficient
teams and that Rayo Vallecano and Real Madrid were the most
inefficient.
• Offsides is approximately uncorrelated with Red.cards and
Matches.without.conceding.

A 3D representation of the biplot can be computed through:

pca3d::pca3d(pcaLaligaStd, show.labels = TRUE, biplot = TRUE)


rgl::rglwidget()

Finally, the biplot function allows to construct the biplot using


two arbitrary principal components by specifying the choices argu-
ment. Keep in mind than these pseudo-biplots will explain a lower
proportion of variance than the default choices = 1:2. -6 -4 -2 0 2 4

0.6

4
biplot(pcaLaligaStd, choices = c(1, 3)) # 0.7138 proportion of variance

0.4
biplot(pcaLaligaStd, choices = c(2, 3)) # 0.2180 proportion of variance Fouls.made
Espanyol Atlético Madrid

2
Yellow.cards
0.2

Granada
Wins
Levante Villarreal Matches.without.conceding
Celta Points
Loses Red.cards EibarSevilla
Valencia
Penalties.scored
Offsides
Getafe
0.0

Athletic Percentage.scored.goals

0
Goals.scored
Assistances
Barcelona
Comp.3

Goals.concededSporting Gijón
Percentage.conceded.goals Málaga Shots.on.goal
RayoReal
Vallecano
Las Sociedad
Palmas Shots
Real Madrid
At the sight of the previous plots, can you think about an Betis
-0.2

-2
interpretation for PC3 ?
-0.4

Draws

-4
-0.6
-0.8

Deportivo

-6
3.6.2 Principal components regression -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6

Comp.1
-4 -2 0 2
The key idea behind Principal Components Regression (PCR) is to Fouls.made

regress the response Y in a set of principal components PC1 , . . . , PCl


0.4

Espanyol
2
Atlético Madrid
Yellow.cards

obtained from the predictors X1 , . . . , X p , where l < p. The motiva-


0.2

Granada Wins
Matches.without.conceding
Levante Points Villarreal
tion is that often a small number of principal components is enough Red.cards
Penalties.scored
Offsides
Celta
Loses
Sevilla
Eibar
Valencia
Getafe
0.0

Percentage.scored.goals Athletic
0

Goals.scored
Barcelona
Assistances
Percentage.conceded.goals
Goals.conceded Sporting Gijón
Málaga
Shots.on.goal
Rayo Vallecano
to explain most of the variability of the predictors and consequently RealPalmas
Las Sociedad
Comp.3

Real Madrid
Shots
Betis
-0.2

their relationship with Y 33 . Therefore, we look for fitting the linear


-2
-0.4

model34
Draws
-0.6

-4

Y = α0 + α1 PC1 + . . . + αl PCl + ε. (3.10) Deportivo


-0.8

-0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4

The main advantages of PCR are two: Comp.2

Figure 3.32: Pseudo-biplots for laliga


dataset with based on (PC1 , PC3 ) and
1. Multicollinearity is avoided by design: PC1 , . . . , PCl are uncor-
(PC2 , PC3 ), respectively.
related between them. 33
This does not need to be true, but it
2. There are less coefficients to estimate (l instead of p), hence the is often the case.
34
For the sake of simplicity, we consid-
accuracy of the estimation increases.
ering the exposition the first l principal
components, but obviously other com-
However, keep in mind that PCR affects the linear model in two binations of l principal components are
fundamental ways: possible.
132 eduardo garcía portugués

1. Interpretation of the coefficients is not directly related with the predic-


tors, but with the principal components. Hence, the interpretabil-
ity of a given coefficient in the regression model is tied to the
interpretability of the associated principal component.
2. Prediction needs an extra step, since it is required to obtain the
scores of the new observations of the predictors in the principal
components.

The first point is worth discussing now. The PCR model (3.10)
can be seen as a linear model expressed in terms of the original
predictors. To make this point clearer, let’s re-express (3.10) as
0
Y = α0 + PC1:l α1:l + ε, (3.11)

where the subindex 1 : l denotes the inclusion of the vector entries


from 1 to l. Now, we can express the PCR problem (3.11) in terms 35
Since we know by (3.7) that PC =
of the original predictors35 : A0 (X − µ), where A is the p × p matrix
of loadings.
0
Y = α0 + (A1:l (X − µ))0 α1:l + ε
= (α0 − µ0 A1:l α1:l ) + X0 A1:l α1:l + ε
= γ0 + X0 γ1:p + ε,

where A1:l represents the A matrix with only its first l columns and

γ0 := α0 − µ0 A1:l α1:l , γ1:p := A1:l α1:p . (3.12)

In other words, the coefficients α1:l of the PCR done with l principal
components in (3.10) translate into the coefficients (3.12) of the
linear model based on the p original predictors

Y = γ0 + γ1 X1 + . . . + γ p X p + ε,

In the sample case, we have that

γ̂0 = α̂0 − X̄0 Â1:l α̂1:l , γ̂1:p = Â1:l α̂1:l . (3.13)

Notice that γ̂ is not the least squares estimator that we use to de-
note by β̂, but just the coefficients of the PCR that are associated to
the original predictors. Consequently, γ̂ is useful for the interpreta-
tion of the linear model produced by PCR, as it can be interpreted
36
For example, thanks to γ̂ we know
in the same way β̂ was36 . that the estimated conditional re-
Finally, remember that the usefulness of PCR relies on how well sponse of Y precisely increases γ̂ j for a
we are able to reduce the dimensionality37 of the predictors and marginal unit increment in X j .

the veracity of the assumption that the l principal components are


37
If l = p, then PCR is equivalent to
the least squares estimation.
related with Y.

Keep in mind that PCR considers the PCA done in the


set of predictors, this is, we exclude the response for ob-
vious reasons (a perfect and useless fit). It is important to
remove the response from the call to princomp if we want
to use the output in lm.
notes for predictive modeling 133

We see now two approaches for performing PCR, which we illus-


trate with the laliga dataset. The common objective is to predict
Points using the remaining variables (except from the directly re-
lated variables Wins, Draws, Loses, and Matches.without.conceding)
in order to quantify, explain, and predict the final points of a team
from its performance.
The first approach combines the use of the princomp and lm
functions. Its strong points are that is both able to predict and
explain, and is linked with techniques we have employed so far.
The weak point is that it requires some extra coding.

# A linear model is problematic


mod <- lm(Points ~ . - Wins - Draws - Loses - Matches.without.conceding,
data = laliga)
summary(mod) # Lots of non-significant predictors
##
## Call:
## lm(formula = Points ~ . - Wins - Draws - Loses - Matches.without.conceding,
## data = laliga)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2117 -1.4766 0.0544 1.9515 4.1422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 77.11798 26.12915 2.951 0.0214 *
## Goals.scored -28.21714 17.44577 -1.617 0.1498
## Goals.conceded -24.23628 15.45595 -1.568 0.1608
## Percentage.scored.goals 1066.98731 655.69726 1.627 0.1477
## Percentage.conceded.goals 896.94781 584.97833 1.533 0.1691
## Shots -0.10246 0.07754 -1.321 0.2279
## Shots.on.goal 0.02024 0.13656 0.148 0.8863
## Penalties.scored -0.81018 0.77600 -1.044 0.3312
## Assistances 1.41971 0.44103 3.219 0.0147 *
## Fouls.made -0.04438 0.04267 -1.040 0.3328
## Yellow.cards 0.27850 0.16814 1.656 0.1416
## Red.cards 0.68663 1.44229 0.476 0.6485
## Offsides -0.00486 0.14854 -0.033 0.9748
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 4.274 on 7 degrees of freedom
## Multiple R-squared: 0.9795, Adjusted R-squared: 0.9443
## F-statistic: 27.83 on 12 and 7 DF, p-value: 9.784e-05

# We try to clean the model


modBIC <- MASS::stepAIC(mod, k = log(nrow(laliga)), trace = 0)
summary(modBIC) # Better, but still unsatisfactory
##
## Call:
## lm(formula = Points ~ Goals.scored + Goals.conceded + Percentage.scored.goals +
## Percentage.conceded.goals + Shots + Assistances + Yellow.cards,
## data = laliga)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4830 -1.4505 0.9008 1.1813 5.8662
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.91373 10.73528 5.860 7.71e-05 ***
## Goals.scored -23.90903 11.22573 -2.130 0.05457 .
## Goals.conceded -12.16610 8.11352 -1.499 0.15959
## Percentage.scored.goals 894.56861 421.45891 2.123 0.05528 .
134 eduardo garcía portugués

## Percentage.conceded.goals 440.76333 307.38671 1.434 0.17714


## Shots -0.05752 0.02713 -2.120 0.05549 .
## Assistances 1.42267 0.28462 4.999 0.00031 ***
## Yellow.cards 0.11313 0.07868 1.438 0.17603
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 3.743 on 12 degrees of freedom
## Multiple R-squared: 0.973, Adjusted R-squared: 0.9572
## F-statistic: 61.77 on 7 and 12 DF, p-value: 1.823e-08

# Also, huge multicollinearity


car::vif(modBIC)
## Goals.scored Goals.conceded Percentage.scored.goals Percentage.conceded.goals
## 77998.044758 22299.952547 76320.612577 22322.307151
## Shots Assistances Yellow.cards
## 6.505748 32.505831 3.297224

# A quick way of removing columns without knowing its position


laligaRed <- subset(laliga, select = -c(Points, Wins, Draws, Loses,
Matches.without.conceding))
# PCA without Points, Wins, Draws, Loses, and Matches.without.conceding
pcaLaligaRed <- princomp(x = laligaRed, cor = TRUE, fix_sign = TRUE)
summary(pcaLaligaRed) # l = 3 gives 86% of variance explained
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## Standard deviation 2.7437329 1.4026745 0.91510249 0.8577839 0.65747209 0.5310954 0.332556029 0.263170555
## Proportion of Variance 0.6273392 0.1639580 0.06978438 0.0613161 0.03602246 0.0235052 0.009216126 0.005771562
## Cumulative Proportion 0.6273392 0.7912972 0.86108155 0.9223977 0.95842012 0.9819253 0.991141438 0.996913000
## Comp.9 Comp.10 Comp.11 Comp.12
## Standard deviation 0.146091551 0.125252621 3.130311e-03 1.801036e-03
## Proportion of Variance 0.001778562 0.001307352 8.165704e-07 2.703107e-07
## Cumulative Proportion 0.998691562 0.999998913 9.999997e-01 1.000000e+00

# Interpretation of PC1 and PC2


biplot(pcaLaligaRed)
-4 -2 0 2 4 6

6
0.6

# PC1: attack performance of the team


Rayo Vallecano

4
# Create a new dataset with the response + principal components
0.4

Red.cards
laligaPCA <- data.frame("Points" = laliga$Points, pcaLaligaRed$scores) Percentage.conceded.goals
Goals.conceded
Granada Offsides Real Madrid

2
0.2

Espanyol
Sevilla
Celta
Yellow.cards Shots
Penalties.scored
Comp.2

Fouls.made Shots.on.goal
Levante
# Regression on all the principal components Percentage.scored.goals
Goals.scored
Assistances
Getafe Eibar Barcelona
Valencia
modPCA <- lm(Points ~ ., data = laligaPCA)
0.0

0
summary(modPCA) # Predictors clearly significative -- same R^2 as mod Sporting Gijón Athletic
Real Sociedad
Deportivo
Las Palmas
##
-0.2

Málaga
-2

Villarreal
Betis
## Call:
## lm(formula = Points ~ ., data = laligaPCA)
-0.4

Atlético Madrid
-4

## -0.4 -0.2 0.0 0.2 0.4 0.6


## Residuals: Comp.1

## Min 1Q Median 3Q Max


## -5.2117 -1.4766 0.0544 1.9515 4.1422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.4000 0.9557 54.831 1.76e-10 ***
## Comp.1 5.7690 0.3483 16.563 7.14e-07 ***
## Comp.2 -2.4376 0.6813 -3.578 0.0090 **
## Comp.3 3.4222 1.0443 3.277 0.0135 *
## Comp.4 -3.6079 1.1141 -3.238 0.0143 *
## Comp.5 1.9713 1.4535 1.356 0.2172
## Comp.6 5.7067 1.7994 3.171 0.0157 *
## Comp.7 -3.4169 2.8737 -1.189 0.2732
## Comp.8 9.0212 3.6313 2.484 0.0419 *
## Comp.9 -4.6455 6.5415 -0.710 0.5006
## Comp.10 -10.2087 7.6299 -1.338 0.2227
## Comp.11 222.0340 305.2920 0.727 0.4907
## Comp.12 -954.7650 530.6164 -1.799 0.1150
notes for predictive modeling 135

## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 4.274 on 7 degrees of freedom
## Multiple R-squared: 0.9795, Adjusted R-squared: 0.9443
## F-statistic: 27.83 on 12 and 7 DF, p-value: 9.784e-05
car::vif(modPCA) # No problems at all
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10 Comp.11 Comp.12
## 1 1 1 1 1 1 1 1 1 1 1 1

# Using the first three components


modPCA3 <- lm(Points ~ Comp.1 + Comp.2 + Comp.3, data = laligaPCA)
summary(modPCA3)
##
## Call:
## lm(formula = Points ~ Comp.1 + Comp.2 + Comp.3, data = laligaPCA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.178 -4.541 -1.401 3.501 16.093
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.4000 1.5672 33.435 3.11e-16 ***
## Comp.1 5.7690 0.5712 10.100 2.39e-08 ***
## Comp.2 -2.4376 1.1173 -2.182 0.0444 *
## Comp.3 3.4222 1.7126 1.998 0.0630 .
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 7.009 on 16 degrees of freedom
## Multiple R-squared: 0.8738, Adjusted R-squared: 0.8501
## F-statistic: 36.92 on 3 and 16 DF, p-value: 2.027e-07

# Coefficients associated to each original predictor (gamma)


alpha <- modPCA3$coefficients
gamma <- pcaLaligaRed$loadings[, 1:3] %*% alpha[-1] # Slopes
gamma <- c(alpha[1] - pcaLaligaRed$center %*% gamma, gamma) # Intercept
gamma
## [1] -44.2288551 1.7305124 -3.4048178 1.7416378 -3.3944235 0.2347716 0.8782162 2.6044699 1.4548813
## [10] -0.1171732 -1.7826488 -2.6423211 1.3755697

# We can overpenalize to have a simpler model -- also one single


# principal component does quite well
modPCABIC <- MASS::stepAIC(modPCA, k = 2 * log(nrow(laliga)), trace = 0)
summary(modPCABIC)
##
## Call:
## lm(formula = Points ~ Comp.1 + Comp.2 + Comp.3 + Comp.4 + Comp.6 +
## Comp.8, data = laligaPCA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6972 -2.6418 -0.3265 2.3535 8.4944
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.4000 1.0706 48.946 3.95e-16 ***
## Comp.1 5.7690 0.3902 14.785 1.65e-09 ***
## Comp.2 -2.4376 0.7632 -3.194 0.00705 **
## Comp.3 3.4222 1.1699 2.925 0.01182 *
## Comp.4 -3.6079 1.2481 -2.891 0.01263 *
## Comp.6 5.7067 2.0158 2.831 0.01416 *
## Comp.8 9.0212 4.0680 2.218 0.04502 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 4.788 on 13 degrees of freedom
136 eduardo garcía portugués

## Multiple R-squared: 0.9521, Adjusted R-squared: 0.9301


## F-statistic: 43.11 on 6 and 13 DF, p-value: 7.696e-08
# Note that the order of the principal components does not correspond
# exactly to its importance in the regression!

# To perform prediction we need to compute first the scores associated to the


# new values of the predictors, conveniently preprocessed
# Predictions for FCB and RMA (although they are part of the training sample)
newPredictors <- laligaRed[1:2, ]
newPredictors <- scale(newPredictors, center = pcaLaligaRed$center,
scale = pcaLaligaRed$scale) # Centered and scaled
newScores <- t(apply(newPredictors, 1,
function(x) t(pcaLaligaRed$loadings) %*% x))

# We need a data frame for prediction


newScores <- data.frame("Comp" = newScores)
predict(modPCABIC, newdata = newScores, interval = "prediction")
## fit lwr upr
## Barcelona 93.64950 80.35115 106.9478
## Real Madrid 90.05622 77.11876 102.9937

# Reality
laliga[1:2, 1]
## [1] 91 90

The second approach employs the function pls::pcr and is more


direct, yet less connected with the techniques we have seen so far.
It employs a model object that is different from the lm object and,
as a consequence, functions like summary, BIC, MASS::stepAIC, or
plot will not work properly. This implies that inference, model
selection, and model diagnostics are not so straightforward. In
exchange, pls::pcr allows for model fitting in an easier way and
model selection through the use of cross-validation. In summary,
this is a more pure predictive approach than predictive and explicative.
# Create a dataset without the problematic predictors and with the response
laligaRed2 <- subset(laliga, select = -c(Wins, Draws, Loses,
Matches.without.conceding))

# Simple call to pcr


library(pls)
modPcr <- pcr(Points ~ ., data = laligaRed2, scale = TRUE)
# Notice we do not need to create a data.frame with PCA, it is automatically
# done within pcr. We also have flexibility to remove predictors from the PCA
# scale = TRUE means that the predictors are scaled internally before computing
# PCA

# The summary of the model is different


summary(modPcr)
## Data: X dimension: 20 12
## Y dimension: 20 1
## Fit method: svdpc
## Number of components considered: 12
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## X 62.73 79.13 86.11 92.24 95.84 98.19 99.11 99.69 99.87 100.00 100 100.00
## Points 80.47 84.23 87.38 90.45 90.99 93.94 94.36 96.17 96.32 96.84 97 97.95
# First row: percentage of variance explained of the predictors
# Second row: percentage of variance explained of Y (the R^2)
# Note that we have the same R^2 for 3 and 12 components as in the previous
# approach

# Slots of information in the model -- most of them as 3-dim arrays with the
# third dimension indexing the number of components considered
names(modPcr)
notes for predictive modeling 137

## [1] "coefficients" "scores" "loadings" "Yloadings" "projection" "Xmeans" "Ymeans"


## [8] "fitted.values" "residuals" "Xvar" "Xtotvar" "fit.time" "ncomp" "method"
## [15] "scale" "call" "terms" "model"

# The coefficients of the original predictors (gammas), not of the components!


modPcr$coefficients[, , 12]
## Goals.scored Goals.conceded Percentage.scored.goals Percentage.conceded.goals
## -602.85050765 -383.07010184 600.61255371 374.38729000
## Shots Shots.on.goal Penalties.scored Assistances
## -8.27239221 0.88174787 -2.14313238 24.42240486
## Fouls.made Yellow.cards Red.cards Offsides
## -2.96044265 5.51983512 1.20945331 -0.07231723
# pcr() computes up to ncomp (in this case, 12) linear models, each one
# considering one extra principal component. $coefficients returns in a
# 3-dim array the coefficients of all the linear models

# Prediction is simpler and can be done for different number of components


predict(modPcr, newdata = laligaRed2[1:2, ], ncomp = 12)
## , , 12 comps
##
## Points
## Barcelona 92.01244
## Real Madrid 91.38026

# Selecting the number of components to retain. All the components up to ncomp


# are selected, no further flexibility is possible
modPcr2 <- pcr(Points ~ ., data = laligaRed2, scale = TRUE, ncomp = 3)
summary(modPcr2)
## Data: X dimension: 20 12
## Y dimension: 20 1
## Fit method: svdpc
## Number of components considered: 3
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps
## X 62.73 79.13 86.11
## Points 80.47 84.23 87.38

# Selecting the number of components to retain by Leave-One-Out


# cross-validation
modPcrCV1 <- pcr(Points ~ ., data = laligaRed2, scale = TRUE,
validation = "LOO")
summary(modPcrCV1)
## Data: X dimension: 20 12
## Y dimension: 20 1
## Fit method: svdpc
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 20 leave-one-out segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps
## CV 18.57 8.505 8.390 8.588 7.571 7.688 6.743 7.09 6.224 6.603 7.547 8.375
## adjCV 18.57 8.476 8.356 8.525 7.513 7.663 6.655 7.03 6.152 6.531 7.430 8.236
## 12 comps
## CV 7.905
## adjCV 7.760 Points

##
350

## TRAINING: % variance explained


300

## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## X 62.73 79.13 86.11 92.24 95.84 98.19 99.11 99.69 99.87 100.00 100 100.00
250

## Points 80.47 84.23 87.38 90.45 90.99 93.94 94.36 96.17 96.32 96.84 97 97.95
200
MSEP

# View cross-validation Mean Squared Error in Prediction


150

validationplot(modPcrCV1, val.type = "MSEP") # l = 8 gives the minimum CV


100

# The black is the CV loss, the dashed red line is the adjCV loss, a bias
50

# corrected version of the MSEP (not described in the notes)


0 2 4 6 8 10 12

# Selecting the number of components to retain by 10-fold Cross-Validation number of components


138 eduardo garcía portugués

# (k = 10 is the default)
modPcrCV10 <- pcr(Points ~ ., data = laligaRed2, scale = TRUE,
validation = "CV")
summary(modPcrCV10)
## Data: X dimension: 20 12
## Y dimension: 20 1
## Fit method: svdpc
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps
## CV 18.57 8.520 8.392 8.911 7.187 7.124 6.045 6.340 5.728 6.073 7.676 9.300
## adjCV 18.57 8.464 8.331 8.768 7.092 7.043 5.904 6.252 5.608 5.953 7.423 8.968
## 12 comps
## CV 9.151
## adjCV 8.782
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## X 62.73 79.13 86.11 92.24 95.84 98.19 99.11 99.69 99.87 100.00 100 100.00
## Points 80.47 84.23 87.38 90.45 90.99 93.94 94.36 96.17 96.32 96.84 97 97.95
validationplot(modPcrCV10, val.type = "MSEP") # l = 6 gives the minimum CV
Points

350
300
pcr() does an internal scaling of the predictors by their
quasi-standard deviations. This means that each variable

250
is divided by √ 1 , when in princomp a scaling of √1n

200
MSEP
n −1
is applied (the standard deviations are employed). This
150
results in a minor discrepancy in the scores object of 100

both methods that is easily patchable. The scores qof


50

n
princomp() are the ones of pcr() multiplied by n −1 .
0 2 4 6 8 10 12
This problem is inherited to the coefficients, which as- number of components

sume scores divided by √ 1 . Therefore, the γ̂ coeffi-


n −1
cients described in (3.13) are
q obtained by dividing the
n
coefficients of pcr() by n −1 .

The next chunk of code illustrates the previous warning.

# Equality of loadings from princomp() and pcr()


max(abs(abs(pcaLaligaRed$loadings[, 1:3]) - abs(modPcr$loadings[, 1:3])))
## [1] 1.804112e-15

# Equality of scores from princomp() and pcr() (with the same standardization)
max(abs(abs(pcaLaligaRed$scores[, 1:3]) -
abs(modPcr$scores[, 1:3] * sqrt(n / (n - 1)))))
## [1] 5.551115e-15

# Equality of the gamma coefficients obtained previously for 3 PCA


# (with the same standardization)
modPcr$coefficients[, , 3] / sqrt(n / (n - 1))
## Goals.scored Goals.conceded Percentage.scored.goals Percentage.conceded.goals
## 1.7305124 -3.4048178 1.7416378 -3.3944235
## Shots Shots.on.goal Penalties.scored Assistances
## 0.2347716 0.8782162 2.6044699 1.4548813
## Fouls.made Yellow.cards Red.cards Offsides
## -0.1171732 -1.7826488 -2.6423211 1.3755697
gamma[-1]
## [1] 1.7305124 -3.4048178 1.7416378 -3.3944235 0.2347716 0.8782162 2.6044699 1.4548813 -0.1171732 -1.7826488
## [11] -2.6423211 1.3755697
notes for predictive modeling 139

# Coefficients associated to the principal components -- same as in modPCA3


lm(Points ~ ., data = data.frame("Points" = laliga$Points,
modPcr$scores[, 1:3] * sqrt(n / (n - 1))))
##
## Call:
## lm(formula = Points ~ ., data = data.frame(Points = laliga$Points,
## modPcr$scores[, 1:3] * sqrt(n/(n - 1))))
##
## Coefficients:
## (Intercept) Comp.1 Comp.2 Comp.3
## 52.400 -5.769 2.438 -3.422
modPCA3
##
## Call:
## lm(formula = Points ~ Comp.1 + Comp.2 + Comp.3, data = laligaPCA)
##
## Coefficients:
## (Intercept) Comp.1 Comp.2 Comp.3
## 52.400 5.769 -2.438 3.422
# Of course, flipping of signs is always possible with PCA

38
Alternatively, the “in Prediction” part
The selection of l by cross-validation attempts to minimize the of the latter terms is dropped and they
Mean Squared Error in Prediction (MSEP) or, equivalently, the Root are just referred to as the MSE and
MSEP (RMSEP) of the model38 . This is a jackknife method valid RMSE.

for the selection of any tuning parameter λ that affects the form
of the estimate m̂λ of the regression function m (remember (1.1)).
Given the sample {(Xi , Yi )}in=1 , leave-one-out cross-validation
considers the tuning parameter

n
λ̂CV := arg min ∑ (Yi − m̂λ,−i (Xi ))2 , (3.14)
λ ≥0 i =1

where m̂λ,−i represents the fit of the model m̂λ without the i-th
observation (Xi , Yi ).
A less computationally expensive variation on leave-one-out
cross-validation is k-fold cross-validation, which partitions the data
into k folds F1 , . . . , Fk of approximately equal size, trains the model
m̂λ in the aggregation of k − 1 folds and evaluates its MSEP in the
remaining fold:

k
λ̂k-CV := arg min ∑ ∑ (Yi − m̂λ,−Fj (Xi ))2 , (3.15)
λ ≥0 j =1 i ∈ F
j

where m̂λ,− Fj represents the fit of the model m̂λ excluding the data
from the j-th fold Fj . Recall that k-fold cross-validation is more
general than leave-one-out cross-validation, since the latter is a
particular case of the former with k = n.
140 eduardo garcía portugués

k-fold cross validation with k < n has an inconvenience


to be aware of: it depends on the choice of the folds for
splitting the data (i.e., how each datum is assigned to
each of the k folds). Some statistical softwares do this
assignment randomly, which means that the selection of
λ̂k-CV may vary from one run to another. Thus, fixing the
seed prior to the parameter selection is important to en-
sure reproducibility. Another alternative is to aggregate,
in a convenient way, the results of several k-fold cross-
validations done in different random random partitions.
Notice that this problem is not present in leave-one-out
cross-validation (where k = n).

Inference in PCR can be carried out as it was done for the


standard linear model. The key point is to realize that the
inference is about the coefficients α0:l associated to the l
principal components, and that it can be carried out by
the summary() function on the output of lm(). The coef-
ficients α0:l are the ones estimated by least squares when
considering the scores of the l principal components as
the predictors. This transformation of the predictors does
not affect the validity of the inferential results of Section
2.4 (derived conditionally on the predictors). But recall
that the inference is not about γ.
notes for predictive modeling 141

Inference in PCR is based on the assumptions of the lin-


ear model being satisfied for the principal components
we are considering. The evaluation of the assumptions
can be done using the exact same tools described in Sec-
tion 3.5. However, keep in mind that PCA is a linear
transformation of the data, and therefore:

• If the linearity assumption fails for the predictors


X1 , . . . , X p , then it will likely fail for PC1 , . . . , PCl ,
since the transformation will not introduce nonlineari-
ties able to capture the nonlinear effects.
• Similarly, if the homoscedasticity, normality, or inde-
pendence assumptions fail for X1 , . . . , X p , then they
will likely fail for PC1 , . . . , PCl .

Exceptions to the previous common implications are pos-


sible, and may involve the association of one or several
problematic predictors (e.g., have nonlinear effects on the
response) to the principal components that are excluded
from the model. Up to which extent the failure of the
assumptions in the original predictors can be mitigated
by PCR depends on each application.

3.6.3 Partial least squares


PCR works by replacing the predictors X1 , . . . , X p by a set of princi-
pal components PC1 , . . . , PCl under the hope that these directions,
that explain most of the variability of the predictors, are also the
best directions for predicting the response Y. While this is a rea-
sonable belief, it is not a guaranteed fact. Partial Least Squares (PLS)
precisely tackles this point with the idea of regressing the response
Y on a set of new variables PLS1 , . . . , PLSl that are constructed with
the objective of predicting Y from X1 , . . . , X p in the best linear way.
As with PCA, the idea to find linear combinations of the predic-
tors X1 , . . . , X p , that is, to have:
p p
PLS1 := ∑ a1j Xj , . . . , PLS p := ∑ a pj Xj .
j =1 j =1

The question is how to choose the coefficients akj , j, k = 1, . . . , p.


PLS does it by placing the most weight in the predictors that are
most strongly correlated with Y and in such a way that the result-
ing PLS1 , . . . , PLS p are uncorrelated. After standardizing the vari-
ables, for PLS1 this is achieved by setting a1j equal to the theoretical
39
Recall (2.3) for the sample version.
slope coefficient of regressing Y into X j , that is39 :

Cov[ X j , Y ]
a1j := ,
Var[ X j ]

where a1j stems from Y = a0 + a1j X j + ε.


142 eduardo garcía portugués

The second partial least squares direction, PLS2 , is computed in


a similar way, but once the linear effects of PLS1 on X1 , . . . , X p are
removed. This is achieved by:

1. Regressing each of X1 , . . . , X p on PLS1 . That is, fit the p simple


linear models

X j = α0 + α1j PLS1 + ε j , j = 1, . . . , p.
40
They play the role of X1 , . . . , X p in
2. Regress Y on each of the p random errors ε 1 , . . . , ε p 40 from the the computation of PLS1 and can be
above regressions, yielding regarded as X1 , . . . , X p after filtering
the linear information explained by
Cov[ε j , Y ] PLS1 .
a2j := ,
Var[ε j ]

where a2j stems from Y = a0 + a2j ε j + ε.

The coefficients for PLS j , j > 2, can be computed41 iterating


41
Of course, in practice, the compu-
tations need to be done in terms of
the former process. Once PLS1 , . . . , PLSl are obtained, then PLS the sample versions of the described
proceeds as PCR and fits the model population versions.

Y = β 0 + β 1 PLS1 + . . . + β l PLSl + ε.

The implementation of PLS can be done by the function pls::plsr,


which has an analogous syntax to pls::pcr.
# Simple call to plsr -- very similar to pcr
modPls <- plsr(Points ~ ., data = laligaRed2, scale = TRUE)

# The summary of the model


summary(modPls)
## Data: X dimension: 20 12
## Y dimension: 20 1
## Fit method: kernelpls
## Number of components considered: 12
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## X 62.53 76.07 84.94 88.29 92.72 95.61 98.72 99.41 99.83 100.00 100.00 100.00
## Points 83.76 91.06 93.93 95.43 95.94 96.44 96.55 96.73 96.84 96.84 97.69 97.95
# First row: percentage of variance explained of the predictors
# Second row: percentage of variance explained of Y (the R^2)
# Note we have the same R^2 for 12 components as in the linear model

# Slots of information in the model


names(modPls)
## [1] "coefficients" "scores" "loadings" "loading.weights" "Yscores" "Yloadings"
## [7] "projection" "Xmeans" "Ymeans" "fitted.values" "residuals" "Xvar"
## [13] "Xtotvar" "fit.time" "ncomp" "method" "scale" "call"
## [19] "terms" "model"

# PLS scores
head(modPls$scores)
## [1] 7.36407868 6.70659816 2.45577740 0.06913485 0.96513341 -0.39588401

# Also uncorrelated
head(cov(modPls$scores))
## Comp 1 Comp 2 Comp 3 Comp 4 Comp 5 Comp 6 Comp 7 Comp 8
## Comp 1 7.393810e+00 1.704822e-16 1.667417e-16 -1.716470e-16 -1.788249e-16 -8.227618e-16 -4.380633e-16 -1.202580e-16
## Comp 2 1.704822e-16 1.267859e+00 1.815810e-16 1.695817e-16 1.084887e-16 9.991690e-17 7.754899e-18 3.451971e-17
## Comp 3 1.667417e-16 1.815810e-16 9.021126e-01 -1.412726e-17 -7.389551e-17 -1.422399e-16 -3.699555e-17 -4.991681e-17
## Comp 4 -1.716470e-16 1.695817e-16 -1.412726e-17 3.130984e-01 1.130620e-17 1.088368e-17 -1.278556e-17 2.547340e-18
## Comp 5 -1.788249e-16 1.084887e-16 -7.389551e-17 1.130620e-17 2.586132e-01 8.607103e-18 -2.416915e-17 -1.698417e-17
notes for predictive modeling 143

## Comp 6 -8.227618e-16 9.991690e-17 -1.422399e-16 1.088368e-17 8.607103e-18 2.792408e-01 -2.305428e-17 -7.131527e-18


## Comp 9 Comp 10 Comp 11 Comp 12
## Comp 1 -3.291438e-16 -3.606813e-16 -2.165469e-16 -3.535228e-16
## Comp 2 9.691056e-17 1.087954e-16 2.611563e-17 -2.943533e-17
## Comp 3 -1.134240e-17 7.717273e-18 -1.008660e-17 3.430733e-17
## Comp 4 -2.368046e-17 -1.480284e-17 -1.227829e-17 -1.358123e-17
## Comp 5 -3.565028e-17 2.589496e-17 8.527293e-18 9.912266e-18
## Comp 6 -1.579360e-17 -2.849998e-18 1.062075e-17 -2.612850e-18

# The coefficients of the original predictors, not of the components!


modPls$coefficients[, , 2]
## Goals.scored Goals.conceded Percentage.scored.goals Percentage.conceded.goals
## 1.8192870 -4.4038213 1.8314760 -4.4045722
## Shots Shots.on.goal Penalties.scored Assistances
## 0.4010902 0.9369002 -0.2006251 2.3688050
## Fouls.made Yellow.cards Red.cards Offsides
## 0.2807601 -1.6677725 -2.4952503 1.2187529

# Obtaining the coefficients of the PLS components


lm(formula = Points ~., data = data.frame("Points" = laliga$Points,
modPls$scores[, 1:3]))
##
## Call:
## lm(formula = Points ~ ., data = data.frame(Points = laliga$Points,
## modPls$scores[, 1:3]))
##
## Coefficients:
## (Intercept) Comp.1 Comp.2 Comp.3
## 52.400 6.093 4.341 3.232

# Prediction
predict(modPls, newdata = laligaRed2[1:2, ], ncomp = 12)
## , , 12 comps
##
## Points
## Barcelona 92.01244
## Real Madrid 91.38026

# Selecting the number of components to retain


modPls2 <- plsr(Points ~ ., data = laligaRed2, scale = TRUE, ncomp = 2)
summary(modPls2)
## Data: X dimension: 20 12
## Y dimension: 20 1
## Fit method: kernelpls
## Number of components considered: 2
## TRAINING: % variance explained
## 1 comps 2 comps
## X 62.53 76.07
## Points 83.76 91.06

# Selecting the number of components to retain by Leave-One-Out cross-validation


modPlsCV1 <- plsr(Points ~ ., data = laligaRed2, scale = TRUE,
validation = "LOO")
summary(modPlsCV1)
## Data: X dimension: 20 12
## Y dimension: 20 1
## Fit method: kernelpls
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 20 leave-one-out segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps
## CV 18.57 8.307 7.221 6.807 6.254 6.604 6.572 6.854 7.348 7.548 7.532 7.854
## adjCV 18.57 8.282 7.179 6.742 6.193 6.541 6.490 6.764 7.244 7.430 7.416 7.717
## 12 comps
## CV 7.905
## adjCV 7.760
##
144 eduardo garcía portugués

## TRAINING: % variance explained


## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## X 62.53 76.07 84.94 88.29 92.72 95.61 98.72 99.41 99.83 100.00 100.00 100.00
## Points 83.76 91.06 93.93 95.43 95.94 96.44 96.55 96.73 96.84 96.84 97.69 97.95

# View cross-validation Mean Squared Error Prediction


validationplot(modPlsCV1, val.type = "MSEP") # l = 4 gives the minimum CV
Points

350
300
# Selecting the number of components to retain by 10-fold Cross-Validation
# (k = 10 is the default)

250
modPlsCV10 <- plsr(Points ~ ., data = laligaRed2, scale = TRUE,
validation = "CV")

200
MSEP
summary(modPlsCV10)

150
## Data: X dimension: 20 12
## Y dimension: 20 1

100
## Fit method: kernelpls
## Number of components considered: 12

50
##
0 2 4 6 8 10 12
## VALIDATION: RMSEP number of components
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps
## CV 18.57 7.895 6.944 6.571 6.330 6.396 6.607 7.018 7.487 7.612 7.607 8.520
## adjCV 18.57 7.852 6.868 6.452 6.201 6.285 6.444 6.830 7.270 7.372 7.368 8.212
## 12 comps
## CV 8.014
## adjCV 7.714
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## X 62.53 76.07 84.94 88.29 92.72 95.61 98.72 99.41 99.83 100.00 100.00 100.00
## Points 83.76 91.06 93.93 95.43 95.94 96.44 96.55 96.73 96.84 96.84 97.69 97.95
validationplot(modPlsCV10, val.type = "MSEP")
Points
350

# l = 4 is close to the minimum CV


300

# Regress manually Points in the scores, in order to have an lm object


250

# Create a new dataset with the response + PLS components


200
MSEP

laligaPLS <- data.frame("Points" = laliga$Points, cbind(modPls$scores))


150

# Regression on the first two PLS


modPLS <- lm(Points ~ Comp.1 + Comp.2, data = laligaPLS)
100

summary(modPLS) # Predictors clearly significative -- same R^2 as in modPls2


50

##
## Call: 0 2 4 6 8 10 12

## lm(formula = Points ~ Comp.1 + Comp.2, data = laligaPLS) number of components

##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3565 -3.6157 0.4508 2.3288 12.3116
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.4000 1.2799 40.941 < 2e-16 ***
## Comp.1 6.0933 0.4829 12.618 4.65e-10 ***
## Comp.2 4.3413 1.1662 3.723 0.00169 **
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 5.724 on 17 degrees of freedom
## Multiple R-squared: 0.9106, Adjusted R-squared: 0.9
## F-statistic: 86.53 on 2 and 17 DF, p-value: 1.225e-09
car::vif(modPLS) # No problems at all
## Comp.1 Comp.2
## 1 1
notes for predictive modeling 145

Let’s perform PCR and PLS in the iris dataset. Recall


that this dataset contains a factor variable, which can
not be treated directly by princomp but can be used in
PCR/PLS (it is transformed internally to two dummy
variables). Do the following:

• Compute the PCA of iris, excluding Species. What


is the percentage of variability explained with two
components?
• Draw the biplot and look for interpretations of the
principal components.
• Plot the PCA scores in a car::scatterplotMatrix
plot such that the scores are coloured by the levels in
Species (you can use the groups argument).
• Compute the PCR and PLS regressions of Petal.Width
with l = 2 (exclude Species). Inspect by CV the best
selection of l for each regression.
• Plot the PLS scores of the data in a
car::scatterplotMatrix plot such that the scores
are coloured by the levels in Species. Compare with
the PCA scores. (A quick way of converting the scores
to a matrix is with rbind.)

For regression, PLS does a more clever selection of the di-


rections of dimensionality reduction than PCA. However,
in practise PCR and PLS tend to perform similarly,
since PLS potentially increases the variance of the direc-
tions with respect to PCR.

Inference for PLS is more involved since, differently to


what happens in PCR, the PLS directions are dependent
on the response Y. This directly breaks a core assump-
tion made in Section 2.4: that the randomness of the
regression model came only from the error terms and not
from the predictors. We do not go into details on how to
perform inference on PLS in these notes.
4
Linear models III: shrinkage, multivariate response, and
big data

We explore in this chapter several extensions of the linear model


for certain non-classical settings such as: high-dimensional data
(p > n) that requires from shrinkage methods, big data (large n)
that demands from a thoughtful computation, and the multivariate
response situation in which the interest lies on explaining a vector
of responses Y = (Y1 , . . . , Yq ).

4.1 Shrinkage

As we saw in Section 2.4.1, the least squares estimates β̂ of the


linear model

Y = β 0 + β 1 X1 + . . . + β p X p + ε,

were the minimizers of the residual sum of squares


n
RSS( β) = ∑ (Yi − β0 − β1 Xi1 − . . . − β p Xip )2 .
i =1

Under the validity of the assumptions of Section 2.3, in Section 2.4


we saw that
 
β̂ ∼ N p+1 β, σ2 (X0 X)−1 .

A particular consequence of this result is that β̂ is unbiased in es-


timating β, that is, β̂ does not make any systematic error in the
estimation. However, bias is only one part of the quality of an esti-
1
See Section 1.2.

mate: variance is also important. Indeed, the bias-variance trade-off 1


arises from the bias-variance decomposition of the Mean Squared
Error (MSE) of an estimate. For example, for the estimate β̂ j of β j ,
we have

MSE[ β̂ j ] := E[( β̂ j − β j )2 ] = (E[ β̂ j ] − β j )2 + Var[ β̂ j ] . (4.1)


| {z } | {z }
Bias2 Variance

Shrinkage methods pursue the following idea:


148 eduardo garcía portugués

Add an amount of smart bias to β̂ in order to reduce its variance,


in such a way that we obtain simpler interpretations from the biased
version of β̂.

This is done by enforcing sparsity, that is, by biasing the esti-


mates of β towards being non-null only in the most important
relations between the response and predictors. The two methods
covered in this section, ridge regression and lasso (least absolute
shrinkage and selection operator), use this idea in a different way. It
is important to realize that both methods do consider the standard
linear model, but what they do different is the way of estimating β.
The form they enforce sparsity in the estimates is by minimizing
the RSS plus a penalty term that favors sparsity on the estimated
coefficients. For example, the ridge regression enforces a quadratic
penalty to the coefficients β and seeks to minimize2
2
Remember that, for a vector x ∈ Rm ,
r ≥ 1, the `r norm (usually referred
p to as ` p norm, but renamed here to
RSS( β) + λ ∑ β2j = RSS( β) + λk β−1 k22 , (4.2) avoid confusion with the number
j =1 of predictors) is defined as kxkr :=
 1/r
∑m r
j =1 x j . The `∞ norm is defined
where λ ≥ 0 is the penalty parameter. On the other hand, the lasso
as kxk∞ := max1≤ j≤m | x j | and it is
considers an absolute penalty: satisfied that limr→∞ kxkr = kxk∞ , for
p all x ∈ Rm . A visualization of these
RSS( β) + λ ∑ | β j | = RSS( β) + λk β−1 k1 . (4.3) norms for m = 2 is given in Figure 4.1.
j =1

Among other possible joint representations for (4.2) and (4.3)3 ,


1.5
the one based on the elastic nets is particularly convenient, as it r=1
r=2
r=3
r=4
aims to combine the strengths of both methods in a computation- r=∞
1.0

ally tractable way and is the one employed in the package glmnet.
0.5

Considering a proportion 0 ≤ α ≤ 1, the elastic net is defined as


x2

0.0

RSS( β) + λ(αk β−1 k1 + (1 − α)k β−1 k22 ). (4.4)


-0.5

Clearly, ridge regression corresponds to α = 0 (quadratic penalty)


-1.0

and lasso to α = 1 (linear penalty). Obviously, if λ = 0, we are back


to the least squares problem and theory. The optimization of (4.4) -1.0 -0.5 0.0 0.5 1.0 1.5

x1

gives Figure 4.1: The “unit circle”


(
p
) k( x1 , x2 )kr = 1 for r = 1, 2, 3, 4, ∞.
RSS( β) + λ ∑ (α| β j | + (1 − α)| β j |2 ) Such as, for example, RSS( β) +
3
β̂λ,α := arg min , (4.5)
β ∈R p +1 λk β−1 krr , for r ≥ 1.
j =1

which is the penalized estimation of β. Note that the sparsity is


enforced in the slopes, not in the intercept, since this depends on
the scale of Y. Note also that the optimization problem is convex4
and therefore it is guaranteed the existence and uniqueness of a 4
The function kxkrr is convex if r ≥ 1.
minimum. However, in general5 , there are no explicit formulas for If 0 < r < 1, kxkrr is not convex.
β̂λ,α and the optimization problem needs to be solved numerically. 5
The main exception being the ridge
regression, as seen in Section 4.1.1.
Finally, λ is a tuning parameter that will need to be chosen suitably
6
In addition, α can also be regarded
and that we will discuss afterwards6 . What it is important now is to as a tuning parameter. We do not deal
recall that the predictors need to be standardized, or otherwise its scale with its data-driven selection in these
notes, as this will imply a more costly
will distort the optimization of (4.4).
optimization of the cross-validation on
An equivalent way of viewing (4.5) that helps in visualizing the the pair (λ, α).
differences between the ridge and lasso regressions is that they try 7
Recall that (4.5) can be seen as the
to solve the equivalent optimization problem7 of (4.5): Lagrangian of (4.6). Because of the
convexity of the optimization problem,
the minimizers of (4.5) and (4.6) do
actually coincide.
notes for predictive modeling 149

β̂s,α := arg p
min RSS( β), (4.6)
β∈R p+1 :∑ j=1 (α| β j |+(1−α)| β j |2 )≤sλ

where sλ is certain scalar that does not depend on β.

Figure 4.2: Comparison of ridge and


lasso solutions from the optimization
problem (4.6) with p = 2. The elliptical
contours show the regions with equal
RSS( β 1 , β 2 ), the objective function, for
( β 1 , β 2 ) ∈ R2 (β 0 = 0 is assumed).
The diamond (α = 1) and circular
(α = 0) regions show the feasibility
p
regions determined by ∑ j=1 (α| β j | +
(1 − α)| β j |2 ) ≤ sλ for the optimization
problem. The sharpness of the diamond
makes the lasso attain solutions with
many coefficients exactly zero, in a
similar situation to the one depicted.
Extracted from James et al. (2013).

The reference implementation of shrinkage estimators based


on elastic nets is the glmnet package. In order to illustrate how
to apply the ridge and lasso regression in practice, we will work
with the ISLR::Hitters dataset. This dataset contains statistics
and salaries from baseball players from the 1986 and 1987 seasons.
The objective will be to predict the Salary from the remaining
predictors.

# Load data -- baseball players statistics


data(Hitters, package = "ISLR")

# Discard NA’s
Hitters <- na.omit(Hitters)

# The glmnet function works with the design matrix of predictors (without
# the ones). This can be obtained easily through model.matrix()
x <- model.matrix(Salary ~ 0 + ., data = Hitters)
# 0 + to exclude a column of 1’s for the intercept, since the intercept will be
# added by default in glmnet::glmnet and if we do not exclude it here we will
# end with two intercepts, one of them resulting in NA

# Interestingly, note that in Hitters there are two-level factors and these
# are automatically transformed into dummy variables in x -- the main advantage
# of model.matrix.lm
head(Hitters[, 14:20])
## League Division PutOuts Assists Errors Salary NewLeague
## -Alan Ashby N W 632 43 10 475.0 N
## -Alvin Davis A W 880 82 14 480.0 A
## -Andre Dawson N E 200 11 3 500.0 N
## -Andres Galarraga N E 805 40 4 91.5 N
## -Alfredo Griffin A W 282 421 25 750.0 A
## -Al Newman N E 76 127 7 70.0 A
head(x[, 14:19])
## LeagueA LeagueN DivisionW PutOuts Assists Errors
## -Alan Ashby 0 1 1 632 43 10
## -Alvin Davis 1 0 1 880 82 14
## -Andre Dawson 0 1 0 200 11 3
## -Andres Galarraga 0 1 0 805 40 4
## -Alfredo Griffin 1 0 1 282 421 25
150 eduardo garcía portugués

## -Al Newman 0 1 0 76 127 7

# We also need the vector of responses


y <- Hitters$Salary

model.matrix removes by default the observations with


any NAs, returning only the complete cases. This may be
undesirable in certain circumstances. If NAs are to be pre-
served, an option is to use na.action = "na.pass" but
with the function model.matrix.lm (not model.matrix, as
it ignores the argument!). The next code illustrates this.

# Data with NA in the first observation


data_na <- data.frame("x1" = rnorm(3), "x2" = rnorm(3), "y" = rnorm(3))
data_na$x1[1] <- NA

# The first observation disappears!


model.matrix(y ~ 0 + ., data = data_na)
## x1 x2
## 2 0.5136652 1.435410
## 3 -0.6558154 1.085212
## attr(,"assign")
## [1] 1 2

# Still ignores NA’s


model.matrix(y ~ 0 + ., data = data_na, na.action = "na.pass")
## x1 x2
## 2 0.5136652 1.435410
## 3 -0.6558154 1.085212
## attr(,"assign")
## [1] 1 2

# Does not ignore NA’s


model.matrix.lm(y ~ 0 + ., data = data_na, na.action = "na.pass")
## x1 x2
## 1 NA -1.329132
## 2 0.5136652 1.435410
## 3 -0.6558154 1.085212
## attr(,"assign")
## [1] 1 2

4.1.1 Ridge regression


We describe next how to do the fitting, tuning parameter selection,
prediction, and the computation of the analytical form for the ridge
regression. The first three topics are very similar for the lasso or for
other elastic net fits (i.e., without α = 0).
Fitting 20 20 20 20 20

15

# Call to the main function -- use alpha = 0 for ridge regression


6

library(glmnet)
2
4
5
10
11
17
12
18
9
0

8
13
1
3
19

ridgeMod <- glmnet(x = x, y = y, alpha = 0) 20

# By default, it computes the ridge solution over a set of lambdas 14


Coefficients

# automatically chosen. It also standardizes the variables by default to make


-50

# the model fitting since the penalization is scale-sensitive. Importantly,


# the coefficients are returned on the original scale of the predictors
-100

# Plot of the solution path -- gives the value of the coefficients for different
# measures in xvar (penalization imposed to the model or fitness) 16

plot(ridgeMod, xvar = "norm", label = TRUE) 0 50 100 150 200

L1 Norm
notes for predictive modeling 151

# xvar = "norm" is the default: L1 norm of the coefficients sum_j abs(beta_j)

# Versus lambda
plot(ridgeMod, label = TRUE, xvar = "lambda")
20 20 20 20 20

15

# Versus the percentage of deviance explained -- this is a generalization of the 6


2
4
5
10
11
17
12
18
9

0
8
13
1
3
19

# R^2 for generalized linear models. Since we have a linear model, this is the 7

# same as the R^2 20

14

plot(ridgeMod, label = TRUE, xvar = "dev")

Coefficients

-50
# The maximum R^2 is slightly above 0.5

-100
# Indeed, we can see that R^2 = 0.5461
summary(lm(Salary ~., data = Hitters))$r.squared 16

## [1] 0.5461159 4 6 8 10 12

Log Lambda

# Some persistently important predictors are 16, 14, and 15 20 20 20 20 20 20

colnames(x)[c(16, 14, 15)] 15

## [1] "DivisionW" "LeagueA" "LeagueN"


6
2
4
5
10
11
17
12
18
9

0
8
13
1
3
19

# What is inside glmnet’s output?


20

names(ridgeMod) 14

Coefficients
## [1] "a0" "beta" "df" "dim" "lambda" "dev.ratio" "nulldev" "npasses" "jerr"

-50
## [10] "offset" "call" "nobs"

# lambda versus R^2 -- fitness decreases when sparsity is introduced, in

-100
# in exchange of better variable interpretation and avoidance of overfitting
plot(log(ridgeMod$lambda), ridgeMod$dev.ratio, type = "l", 16

xlab = "log(lambda)", ylab = "R2")


0.0 0.1 0.2 0.3 0.4 0.5

Fraction Deviance Explained

ridgeMod$dev.ratio[length(ridgeMod$dev.ratio)]
0.5

## [1] 0.5164752
# Slightly different to lm’s because of compromises in accuracy for speed
0.4

# The coefficients for different values of lambda are given in $a0 (intercepts)
0.3

# and $beta (slopes) or, alternatively, both in coef(ridgeMod)


R2

length(ridgeMod$a0)
0.2

## [1] 100
dim(ridgeMod$beta)
0.1

## [1] 20 100
length(ridgeMod$lambda) # 100 lambda’s were automatically chosen
0.0

## [1] 100 4 6 8 10 12

log(lambda)

# Inspecting the coefficients associated to the 50th lambda


coef(ridgeMod)[, 50]
## (Intercept) AtBat Hits HmRun Runs RBI Walks Years
## 214.720401393 0.090210299 0.371622562 1.183305700 0.597288605 0.595291959 0.772639338 2.474046383
## CAtBat CHits CHmRun CRuns CRBI CWalks LeagueA LeagueN
## 0.007597343 0.029269640 0.217470479 0.058715486 0.060728346 0.058696981 -2.903863538 2.903558851
## DivisionW PutOuts Assists Errors NewLeagueN
## -21.886726838 0.052629591 0.007406369 -0.147635448 2.663300521
ridgeMod$lambda[50] 20 20 20
## [1] 2674.375
10

# Zoom in path solution


plot(ridgeMod, label = TRUE, xvar = "lambda",
0

xlim = log(ridgeMod$lambda[50]) + c(-2, 2), ylim = c(-30, 10))


Coefficients

abline(v = log(ridgeMod$lambda[50]))
-10

points(rep(log(ridgeMod$lambda[50]), nrow(ridgeMod$beta)), ridgeMod$beta[, 50],


pch = 16, col = 1:6)
-20

# The squared l2-norm of the coefficients decreases as lambda increases


-30

plot(log(ridgeMod$lambda), sqrt(colSums(ridgeMod$beta^2)), type = "l", 6 7 8 9 10

xlab = "log(lambda)", ylab = "l2 norm") Log Lambda


152 eduardo garcía portugués

Tuning parameter selection

120
The selection of the penalty parameter λ is usually done by k-

100
fold cross-validation, following the general principle described at

80
the end of Section 3.6. This data-driven selector is denoted by λ̂k-CV

l2 norm

60
and has the form given8 in (3.15) (or (3.14) if k = n):

40
k
∑ ∑ (Yi − m̂λ,−Fj (Xi ))2 .

20
λ̂k-CV := arg min CVk (λ), CVk (λ) :=
λ ≥0 j=1 i ∈ Fj

0
4 6 8 10 12

log(lambda)

A very interesting variant for the λ̂k-CV selector is the so-called one 8
If the linear model is employed, for
standard error rule. This rule is based on the parsimonious principle generalized linear models the metric
in the cross-validation function is
“favor simplicity within the set of most likely optimal models”. not the squared difference between
observations and fitted values.

It arises from observing that the objective function to minimize,


CVk , is random, and thus its minimizer λ̂k-CV is subjected to vari-
ability. Then, the parsimonious approach proceeds by selecting not
λ̂k-CV , but the largest λ (hence, the simplest model) that is still likely
optimal, that is, that is “close” to λ̂k-CV . This closeness is quanti-
fied by the estimation of the standard error of the random variable
CVk (λ̂k-CV ), which is obtained thanks to the folding splitting of the
sample. Mathematically, λ̂k-1SE is defined as

ˆ CVk (λ̂k-CV )
 
λ̂k-1SE := max λ ≥ 0 : CVk (λ) ∈ CVk (λ̂k-CV ) ± SE .

The λ̂k-1SE selector often offers a good trade-off between model


fitness and interpretability in practice. The code below gives all the
details.
# If we want, we can choose manually the grid of penalty parameters to explore
# The grid should be descending
ridgeMod2 <- glmnet(x = x, y = y, alpha = 0, lambda = 100:1)
plot(ridgeMod2, label = TRUE, xvar = "lambda") # Not a good choice!
20 20 20 20 20

15

# Lambda is a tuning parameter that can be chosen by cross-validation, using as 2


6
3
11
12
10
18
17
9
0

8
5
13
4
1
19
7

# error the MSE (other possible error can be considered for generalized models
# using the argument type.measure) 20
14
Coefficients

-50

# 10-fold cross-validation. Change the seed for a different result


set.seed(12345)
kcvRidge <- cv.glmnet(x = x, y = y, alpha = 0, nfolds = 10)
-100

# The lambda grid in which CV is done


16

# The lambda that minimises the CV error is


0 1 2 3 4
kcvRidge$lambda.min Log Lambda
## [1] 25.52821

# Equivalent to
indMin <- which.min(kcvRidge$cvm)
kcvRidge$lambda[indMin]
## [1] 25.52821

# The minimum CV error


kcvRidge$cvm[indMin]
## [1] 115034
min(kcvRidge$cvm)
## [1] 115034
notes for predictive modeling 153

# Potential problem! Minimum occurs at one extreme of the lambda grid in which
# CV is done. The grid was automatically selected, but can be manually inputted
range(kcvRidge$lambda)
## [1] 25.52821 255282.09651
lambdaGrid <- 10^seq(log10(kcvRidge$lambda[1]), log10(0.1),
length.out = 150) # log-spaced grid
kcvRidge2 <- cv.glmnet(x = x, y = y, nfolds = 10, alpha = 0,
lambda = lambdaGrid)

# Much better
plot(kcvRidge2)
kcvRidge2$lambda.min
## [1] 9.506186

# But the CV curve is random, since it depends on the sample. Its variability
# can be estimated by considering the CV curves of each fold. An alternative
# approach to select lambda is to choose the largest within one standard
# deviation of the minimum error, in order to favour simplicity of the model
# around the optimal lambda value. This is know as the "one standard error rule"
kcvRidge2$lambda.1se
## [1] 2964.928

# Location of both optimal lambdas in the CV loss function in dashed vertical


# lines, and lowest CV error and lowest CV error + one standard error
plot(kcvRidge2)
indMin2 <- which.min(kcvRidge2$cvm)
abline(h = kcvRidge2$cvm[indMin2] + c(0, kcvRidge2$cvsd[indMin2])) 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20

220000
# The consideration of the one standard error rule for selecting lambda makes
# special sense when the CV function is quite flat around the minimum (hence an

180000
# overpenalization that gives more sparsity does not affect so much the CV loss)
Mean-Squared Error

# Leave-one-out cross-validation. More computationally intense but completely


140000

# objective in the choice of the fold-assignment


ncvRidge <- cv.glmnet(x = x, y = y, alpha = 0, nfolds = nrow(Hitters),
lambda = lambdaGrid)
100000

# Location of both optimal lambdas in the CV loss function


0 5 10
plot(ncvRidge) Log(λ)

20 20 20 20 20 20 20 20 20 20 20 20 20 20 20

Prediction
100000 120000 140000 160000 180000 200000 220000

# The glmnet fit is inside the output of cv.glmnet


modRidgeCV <- kcvRidge2$glmnet.fit
Mean-Squared Error

# Inspect the best models


plot(modRidgeCV, label = TRUE, xvar = "lambda")
abline(v = log(c(kcvRidge2$lambda.min, kcvRidge2$lambda.1se)))

# The model associated to lambda.1se (or any other lambda not included in the 0 5 10

Log(λ)
# original path solution -- obtained by an interpolation) can be retrieved with
20 20 20
predict(modRidgeCV, type = "coefficients", s = kcvRidge2$lambda.1se)
## 21 x 1 sparse Matrix of class "dgCMatrix"
15

## 1 2
6
3

## (Intercept) 229.758314334
11
12
18
17
9
10
0

8
13
5
1
4
19
7

## AtBat 0.086325740
20

## Hits 0.351303930 14
Coefficients

## HmRun 1.142772275
-50

## Runs 0.567245068
## RBI 0.568056880
## Walks 0.731144713
-100

## Years 2.389248929
16

## CAtBat 0.007261489
## CHits 0.027854683 0 5 10

## CHmRun 0.207220032 Log Lambda


154 eduardo garcía portugués

## CRuns 0.055877337
## CRBI 0.057777505
## CWalks 0.056352113
## LeagueA -2.509251990
## LeagueN 2.509060248
## DivisionW -20.162700810
## PutOuts 0.048911039
## Assists 0.006973696
## Errors -0.128351187
## NewLeagueN 2.373103450

# Predictions for the first two observations


predict(modRidgeCV, type = "response", s = kcvRidge2$lambda.1se,
newx = x[1:2, ])
## 1
## -Alan Ashby 530.8080
## -Alvin Davis 577.8485

# Predictions for the first observation, for all the lambdas. We can see how
# the prediction for one observation changes according to lambda
plot(log(modRidgeCV$lambda),
predict(modRidgeCV, type = "response", newx = x[1, , drop = FALSE]),
type = "l", xlab = "log(lambda)", ylab = " Prediction")

Analytical form

500
The optimization problem (4.4) has an explicit solution for α =
0. To see it, assume that both the response Y and the predictors

Prediction

450
X1 , . . . , X p are centred, and that the sample {(Xi , Yi )}in=1 is also
centred9 . In this case, there is no intercept β 0 (= 0) to estimate by
400
β̂ 0 (= 0) and the linear model is simply

Y = β 1 X1 + . . . + β p X p + ε.
0 5 10

Then, the ridge regression estimator β̂λ,0 ∈ R p is


log(lambda)

9
That is, that Ȳ = 0 and X̄ = 0.

β̂λ,0 = arg minp RSS( β) + λk βk22


β ∈R
n
= arg minp ∑ (Yi − Xi β)2 + λβ0 β
β ∈R i =1

= arg minp (Y − Xβ)0 (Y − Xβ) + λβ0 β, (4.7)


β ∈R

where X is the design matrix but now excluding the column of


ones (thus of size n × p). Luckily, (4.7) is a continuous quadratic
optimization problem that is easily solved with the same arguments
we employed for obtaining (2.7), resulting in10 10
If the data was not centred, then
(4.8) would translate into β̂λ,0 =
β̂λ,0 = (X0 X + λI p )−1 X0 Y. (4.8) (X0 X + diag(0, λ, . . . , λ))−1 X0 Y, where
X is the ( p + 1) × n design matrix with
The form (4.8) neatly connects with the least squares estimator the first column consisting of ones.
(λ = 0) and yields many interesting insights. First, notice how the
ridge regression estimator is always computable, even if p > n11 11
This was the original motivation for
and the matrix X0 X is not invertible or if X0 X is singular due to ridge regression, a way of estimating
perfect multicollinearity. Second, as it was done with (2.11) it is β for the challenging situation with
p > n. This property also holds for any
straightforward to see that, under the assumptions of the linear
other elastic net estimator (e.g., lasso)
model, as long as λ > 0.
 
β̂λ,0 ∼ N p (X0 X + λI p )−1 X0 Xβ, σ2 (X0 X + λI p )−1 X0 X(X0 X + λI p )−1 .
(4.9)
notes for predictive modeling 155

The distribution (4.9) is revealing: it shows that β̂λ,0 is no longer 12


If the eigenvalues of X0 X are η1 ≥
unbiased and that its variance is smaller12 than the least squares . . . ≥ η p > 0, then the eigenvalues
estimator β̂. This is much clear in the case where the predictors of X0 X + λI p are η1 + λ ≥ . . . ≥
η p + λ > 0 (because the addition
are uncorrelated and standardized, hence X0 X = I p – precisely the
involves a constant diagonal matrix).
case of the PCA or PLS scores if these are standardized to have unit Therefore, the determinant of (X0 X)−1
variance. In this situation, then (4.8) and (4.9) simplify to is ∏ j=1 p η j−1 and the determinant of
(X0 X + λI p )−1 X0 X(X0 X + λI p )−1 is
−1 0
β̂λ,0 = (1 + λ) X Y = (1 + λ) β̂,−1 ∏ j=1 p η j (η j + λ)−2 , which is smaller
  (4.10) than ∏ j=1 p η j−1 since λ ≥ 0.
β̂λ,0 ∼ N p (1 + λ)−1 β, σ2 (1 + λ)−2 I p .

The shrinking effect of λ is yet more evident from (4.10): if the


predictors are uncorrelated, we shrink equally the least squares es-
timator β̂ by the factor (1 + λ)−1 , which results in a reduction of
the variance by a factor of (1 + λ)−2 . Furthermore, notice an im-
portant point: due to the explicit control of the distribution of β̂λ,0 , 13
Recall that, if the predictors are un-
inference about β can be done in a relatively straightforward13 way correlated, (1 + λ) β̂λ,0 ∼ N p β, σ2 I p
from β̂λ,0 , just as it was done from β̂ in Section 2.4. This tractabil- and the t-tests and CIs for β j follow
easily from there. In general, from (4.9)
ity, both on the explicit form of the estimator and on the associated
it follows that I p + λ(X0 X)−1 β̂λ,0 ∼

inference, is one of the main advantages of ridge regression with N p β, σ2 I p + λ(X0 X)−1 if X0 X is


respect to other shrinkage methods. invertible.


Finally, just as we did for the least squares estimator, we can
define the hat matrix

Hλ := X(X0 X + λI p )−1 X0

that predicts Ŷ from Y. This hat matrix becomes especially useful


now, as it can be employed to define the effective degrees of freedom
associated to a ridge regression with penalty λ. These are defined
as the trace of the hat matrix:
14
We employ the cyclic property of
df(λ) := tr(Hλ ). the trace operator, which stands that
tr(ABC) = tr(CAB) = tr(BCA) for
any matrices A, B, and C for which the
The motivation behind is that, for the unrestricted least squares fit14 multiplications make sense.

tr(H0 ) = tr X(X0 X)−1 X0 = tr X0 X(X0 X)−1 = p


 

and thus indeed df(0) = p is representing the degrees of freedom


of the fit, understood as the number of parameters employed (keep
in mind that the intercept was excluded). For a constrained fit with
λ > 0, df(λ) < p because, even if we are estimating p parameters
5

in β̂λ,0 , these are restricted to satisfy k β̂λ,0 k22 ≤ sλ , for a certain


4

sλ (recall (4.6)). The function df is monotonically decreasing and


such that limλ→∞ df(λ) = 0, see Figure 4.3. Recall that, due to
3
df(λ)

the imposed constrain on the coefficients, we could choose λ such


2

that df(λ) = r, where r is an integer smaller than p: this would


correspond to effectively employing exactly r parameters in the
1

regression but considering p predictors instead of r.


0

The next chunk of code implements β̂λ,0 and shows that is equiv- -5 0 5 10 15

alent to the output of glmnet::glmnet with certain touches. log(λ)

Figure 4.3: The effective degrees of


# Random data freedom df(λ) as a function of log λ
p <- 5 for a ridge regression with p = 5.
156 eduardo garcía portugués

n <- 200
beta <- seq(-1, 1, l = p)
set.seed(123124)
x <- matrix(rnorm(n * p), n, p)
y <- 1 + x %*% beta + rnorm(n)

# Unrestricted fit
fit <- glmnet(x, y, alpha = 0, lambda = 0, intercept = TRUE,
standardize = FALSE)
beta0Hat <- rbind(fit$a0, fit$beta)
beta0Hat
## 6 x 1 sparse Matrix of class "dgCMatrix"
## s0
## 1.05856208
## V1 -1.03109958
## V2 -0.56932123
## V3 -0.03813426
## V4 0.47415412
## V5 1.05761841

# Unrestricted fit matches least squares -- but recall glmnet uses an


# iterative method so it is inexact (convergence threshold thresh = 1e-7 by
# default)
X <- model.matrix(y ~ x) # A way of constructing a design matrix that is a
# data.frame and has a column of ones
solve(crossprod(X)) %*% t(X) %*% y
## [,1]
## (Intercept) 1.05856209
## x1 -1.03109954
## x2 -0.56932123
## x3 -0.03813426
## x4 0.47415412
## x5 1.05761841

# Restricted fit
# glmnet considers as the reguarization parameter "lambda" the value
# lambda / n (lambda being here the penalty parameter employed in the theory)
lambda <- 2
fit <- glmnet(x, y, alpha = 0, lambda = lambda / n, intercept = TRUE,
standardize = FALSE)
betaLambdaHat <- rbind(fit$a0, fit$beta)
betaLambdaHat
## 6 x 1 sparse Matrix of class "dgCMatrix"
## s0
## 1.0586029
## V1 -1.0264951
## V2 -0.5667469
## V3 -0.0377357
## V4 0.4710700
## V5 1.0528297

# Analytical form with intercept


solve(crossprod(X) + diag(c(0, rep(lambda, p)))) %*% t(X) %*% y
## [,1]
## (Intercept) 1.05864278
## x1 -1.02203900
## x2 -0.56425607
## x3 -0.03735258
## x4 0.46809435
## x5 1.04819600

4.1.2 Lasso

The main novelty in lasso with respect to ridge is its ability to do


variable selection within its fit and the lack of analytical solution.
Fitting, tuning parameter selection, and prediction are completely
notes for predictive modeling 157

analogous to the ridge regression.


Fitting
# Get the Hitters data back
x <- model.matrix(Salary ~ 0 + ., data = Hitters)
y <- Hitters$Salary

# Call to the main function -- use alpha = 1 for lasso regression (the default)
lassoMod <- glmnet(x = x, y = y, alpha = 1)
# Same defaults as before, same object structure

# Plot of the solution path -- now the paths are not smooth when decreasing to
# zero (they are zero exactly). This is a consequence of the l1 norm 20 18 12 6

plot(lassoMod, xvar = "lambda", label = TRUE) 2


6
3
11
12
18
17
9

0
15
10
8
5
13
1
4
19
7

-20
20

# Some persistently important predictors are 15, 14, and 19

-40
Coefficients
# Versus the R^2 -- same maximum R^2 as before

-60
14

plot(lassoMod, label = TRUE, xvar = "dev")

-80
-100
# Now the l1-norm of the coefficients decreases as lambda increases
16

-120
plot(log(lassoMod$lambda), colSums(abs(lassoMod$beta)), type = "l",
-2 0 2 4
xlab = "log(lambda)", ylab = "l1 norm")
Log Lambda

0 2 4 4 5 12

2
6
3
11
12
18
17
9

0
15
10
8
5
13
1
4
19

# 10-fold cross-validation. Change the seed for a different result 7

set.seed(12345)

-20
20

kcvLasso <- cv.glmnet(x = x, y = y, alpha = 1, nfolds = 10)


-40
Coefficients

# The lambda that minimises the CV error is


-60

14

kcvLasso$lambda.min
-80

## [1] 2.674375
-100

# The "one standard error rule" for lambda


kcvLasso$lambda.1se 16
-120

## [1] 76.16717 0.0 0.1 0.2 0.3 0.4 0.5

Fraction Deviance Explained

# Location of both optimal lambdas in the CV loss function


indMin <- which.min(kcvLasso$cvm)
200

plot(kcvLasso)
abline(h = kcvLasso$cvm[indMin] + c(0, kcvLasso$cvsd[indMin]))
150
l1 norm

# No problems now: minimum does not occur at one extreme


100

# Interesting: note that the numbers on top of the figure gives the number of
# coefficients *exactly* different from zero -- the number of predictors
50

# effectively considered in the model!


# In this case, the one standard error rule makes also sense
0

-2 0 2 4
# Leave-one-out cross-validation
log(lambda)
lambdaGrid <- 10^seq(log10(kcvRidge$lambda[1]), log10(0.1),
20 19 19 19 18 18 14 14 12 9 7 6 6 6 6 5 4 2
length.out = 150) # log-spaced grid
220000

ncvLasso <- cv.glmnet(x = x, y = y, alpha = 1, nfolds = nrow(Hitters),


lambda = lambdaGrid)
200000

# Location of both optimal lambdas in the CV loss function


180000
Mean-Squared Error

plot(ncvLasso)
160000
140000

Prediction
120000

# Inspect the best models


100000

modLassoCV <- kcvLasso$glmnet.fit


plot(modLassoCV, label = TRUE, xvar = "lambda") -2 0 2 4

abline(v = log(c(kcvLasso$lambda.min, kcvLasso$lambda.1se))) Log(λ)


158 eduardo garcía portugués

20 19 18 16 14 9 6 6 5 2 0 0 0 0 0 0 0 0 0 0

220000
# The model associated to lambda.min (or any other lambda not included in the
# original path solution -- obtained by an interpolation) can be retrieved with
predict(modLassoCV, type = "coefficients",

180000
Mean-Squared Error
s = c(kcvLasso$lambda.min, kcvLasso$lambda.1se))
## 21 x 2 sparse Matrix of class "dgCMatrix"
## 1 2

140000
## (Intercept) 1.558172e+02 144.37970485
## AtBat -1.547343e+00 .
## Hits 5.660897e+00 1.36380384

100000
## HmRun . .
## Runs . . 0 5 10
## RBI . . Log(λ)

## Walks 4.729691e+00 1.49731098 20 18 12 6

## Years -9.595837e+00 . 2
6
3
11
12

## CAtBat . .
18
17
9

0
15
10
8
5
13
1
4
19
7

## CHits . .

-20
20

## CHmRun 5.108207e-01 .
## CRuns 6.594856e-01 0.15275165

-40
Coefficients
## CRBI 3.927505e-01 0.32833941

-60
14

## CWalks -5.291586e-01 .
## LeagueA -3.206508e+01 .

-80
## LeagueN 3.285872e-14 .

-100
## DivisionW -1.192990e+02 .
## PutOuts 2.724045e-01 0.06625755 16

-120
## Assists 1.732025e-01 .
-2 0 2 4
## Errors -2.058508e+00 . Log Lambda
## NewLeagueN . .

# Predictions for the first two observations


predict(modLassoCV, type = "response",
s = c(kcvLasso$lambda.min, kcvLasso$lambda.1se),
newx = x[1:2, ])
## 1 2
## -Alan Ashby 427.8822 540.0835
## -Alvin Davis 700.1705 615.3311

Variable selection
# We can use lasso for model selection!
selPreds <- predict(modLassoCV, type = "coefficients",
s = c(kcvLasso$lambda.min, kcvLasso$lambda.1se))[-1, ] != 0
x1 <- x[, selPreds[, 1]]
x2 <- x[, selPreds[, 2]]

# Least squares fit with variables selected by lasso


modLassoSel1 <- lm(y ~ x1)
modLassoSel2 <- lm(y ~ x2)
summary(modLassoSel1)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -940.10 -174.20 -25.94 127.05 1890.12
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 224.24408 84.45013 2.655 0.008434 **
## x1AtBat -2.30798 0.56236 -4.104 5.5e-05 ***
## x1Hits 7.34602 1.71760 4.277 2.7e-05 ***
## x1Walks 6.08610 1.57008 3.876 0.000136 ***
## x1Years -13.60502 10.38333 -1.310 0.191310
## x1CHmRun 0.83633 0.84709 0.987 0.324457
## x1CRuns 0.90924 0.27662 3.287 0.001159 **
## x1CRBI 0.35734 0.36252 0.986 0.325229
## x1CWalks -0.83918 0.27207 -3.084 0.002270 **
notes for predictive modeling 159

## x1LeagueA -36.68460 40.73468 -0.901 0.368685


## x1LeagueN NA NA NA NA
## x1DivisionW -119.67399 39.32485 -3.043 0.002591 **
## x1PutOuts 0.29296 0.07632 3.839 0.000157 ***
## x1Assists 0.31483 0.20460 1.539 0.125142
## x1Errors -3.23219 4.29443 -0.753 0.452373
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 314 on 249 degrees of freedom
## Multiple R-squared: 0.5396, Adjusted R-squared: 0.5156
## F-statistic: 22.45 on 13 and 249 DF, p-value: < 2.2e-16
summary(modLassoSel2)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -914.21 -171.94 -33.26 97.63 2197.08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -96.96096 55.62583 -1.743 0.082513 .
## x2Hits 2.09338 0.57376 3.649 0.000319 ***
## x2Walks 2.51513 1.22010 2.061 0.040269 *
## x2CRuns 0.26490 0.19463 1.361 0.174679
## x2CRBI 0.39549 0.19755 2.002 0.046339 *
## x2PutOuts 0.26620 0.07857 3.388 0.000814 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 333 on 257 degrees of freedom
## Multiple R-squared: 0.4654, Adjusted R-squared: 0.455
## F-statistic: 44.75 on 5 and 257 DF, p-value: < 2.2e-16

# Comparison with stepwise selection


modBIC <- MASS::stepAIC(lm(Salary ~ ., data = Hitters), k = log(nrow(Hitters)),
trace = 0)
summary(modBIC)
##
## Call:
## lm(formula = Salary ~ AtBat + Hits + Walks + CRuns + CRBI + CWalks +
## Division + PutOuts, data = Hitters)
##
## Residuals:
## Min 1Q Median 3Q Max
## -794.06 -171.94 -28.48 133.36 2017.83
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 117.15204 65.07016 1.800 0.072985 .
## AtBat -2.03392 0.52282 -3.890 0.000128 ***
## Hits 6.85491 1.65215 4.149 4.56e-05 ***
## Walks 6.44066 1.52212 4.231 3.25e-05 ***
## CRuns 0.70454 0.24869 2.833 0.004981 **
## CRBI 0.52732 0.18861 2.796 0.005572 **
## CWalks -0.80661 0.26395 -3.056 0.002483 **
## DivisionW -123.77984 39.28749 -3.151 0.001824 **
## PutOuts 0.27539 0.07431 3.706 0.000259 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 314.7 on 254 degrees of freedom
## Multiple R-squared: 0.5281, Adjusted R-squared: 0.5133
## F-statistic: 35.54 on 8 and 254 DF, p-value: < 2.2e-16
# The lasso variable selection is similar, although the model is slightly worse
# in terms of adjusted R^2 and significance of the predictors. However, keep in
160 eduardo garcía portugués

# mind that lasso is solving a constrained least squares problem, so it is


# expected to achieve better R^2 and adjusted R^2 via a selection procedure
# that employs solutions of unconstrained least squares. What is remarkable
# is the speed of lasso on selecting variables, and the fact that gives quite
# good starting points for performing further model selection

# Another interesting possibility is to run a stepwise selection starting from


# the set of predictors selected by lasso. In this search, it is important to
# use direction = "both" (default) and define the scope argument adequately
f <- formula(paste("Salary ~", paste(names(which(selPreds[, 2])),
collapse = " + ")))
start <- lm(f, data = Hitters) # Model with predictors selected by lasso
scope <- list(lower = lm(Salary ~ 1, data = Hitters), # No predictors
upper = lm(Salary ~ ., data = Hitters)) # All the predictors
modBICFromLasso <- MASS::stepAIC(object = start, k = log(nrow(Hitters)),
scope = scope, trace = 0)
summary(modBICFromLasso)
##
## Call:
## lm(formula = Salary ~ Hits + Walks + CRBI + PutOuts + AtBat +
## Division, data = Hitters)
##
## Residuals:
## Min 1Q Median 3Q Max
## -873.11 -181.72 -25.91 141.77 2040.47
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.51180 65.00006 1.408 0.160382
## Hits 7.60440 1.66254 4.574 7.46e-06 ***
## Walks 3.69765 1.21036 3.055 0.002488 **
## CRBI 0.64302 0.06443 9.979 < 2e-16 ***
## PutOuts 0.26431 0.07477 3.535 0.000484 ***
## AtBat -1.86859 0.52742 -3.543 0.000470 ***
## DivisionW -122.95153 39.82029 -3.088 0.002239 **
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 319.9 on 256 degrees of freedom
## Multiple R-squared: 0.5087, Adjusted R-squared: 0.4972
## F-statistic: 44.18 on 6 and 256 DF, p-value: < 2.2e-16

# Comparison in terms of BIC, slight improvement with modBICFromLasso


BIC(modLassoSel1, modLassoSel2, modBICFromLasso, modBIC)
## df BIC
## modLassoSel1 15 3839.690
## modLassoSel2 7 3834.434
## modBICFromLasso 8 3817.785
## modBIC 10 3818.320

Consider la-liga-2015-2016.xlsx dataset. We aim to


predict Points after removing the perfectly related linear
variables with Points. Do the following:

• Lasso regression. Select λ by cross-validation. Obtain


the estimated coefficients for the chosen lambda.
• Use the predictors with non-null coefficients for creat-
ing a model with lm.
• Summarize the model and check for multicollinearity.
notes for predictive modeling 161

It may happen that the cross-validation curve has an


“L”-shaped form without a well-defined global mini-
mum. This usually happens when only the intercept is
significative and none of the predictors are relevant for
explaining Y. The code below illustrates this case.

# Random data with predictors unrelated with the response


p <- 100
n <- 300
set.seed(123124)
x <- matrix(rnorm(n * p), n, p)
y <- 1 + rnorm(n)

# CV
lambdaGrid <- exp(seq(-10, 3, l = 200))
plot(cv.glmnet(x = x, y = y, alpha = 1, nfolds = n, lambda = lambdaGrid))
100 100 99 98 94 92 80 60 30 8 0 0 0 0 0 0 0 0

1.8
4.2 Constrained linear models

1.6
Mean-Squared Error
As outlined in the previous section, after doing variable selection

1.4
with lasso15 , two possibilities are: (i) fit a linear model on the lasso-
selected predictors; (ii) run a stepwise selection starting from the

1.2
lasso-selected model to try to further improve the model16 .
Let’s explore the intuitive idea behind (i) in more detail. For the
1.0 -10 -8 -6 -4 -2 0 2
sake of exposition, assume that among p predictors, lasso selected Log(λ)

the first q of them17 . Then, once q is known, we would seek to fit Figure 4.4: “L”-shaped form of a
cross-validation curve with unrelated
the model
response and predictors.
15
For example, based on the data-
Y = β 0 + β 1 X1 + . . . + β p X p + ε, subject to β 1 = . . . = β q = 0. driven penalization parameters λ̂k-CV
or λ̂k-1SE .
This is a very simple constraint that we know how to solve: just 16
Note that with this approach we
include the p − q remaining predictors in the model and fit it. It is assign to the more computationally
efficient lasso the “hard work” of
however a specific case of a linear constraint on β, since β 1 = . . . = coming up with a set of relevant
β q = 0 is expressible as predictors from the whole dataset,
whereas the betterment of that model
  is done with the more demanding
Iq 0q×( p−q) β −1 = 0 q , (4.11) stepwise regression (if the number of
q× p
predictors is smaller than n).

where Iq is an q × q identity matrix and β−1 = ( β 1 , . . . , β p )0 . The Note that this is a random quantity,
17

but we ignore this fact for the sake of


constraint in (4.11) can be generalized as Aβ−1 = c, which results in exposition.
the (linearly) constrained linear model

Y = β 0 + β 1 X1 + . . . + β p X p + ε, subject to Aβ−1 = c, (4.12)


18
Therefore, it does not have to be
where A is an q × p matrix of rank q 18 and c ∈ Rq . The constrained invertible.
linear model (4.12) is useful when there is prior information avail-
able about a linear relation that the coefficients of the linear model
must satisfy (e.g., in piecewise polynomial fitting).
Before fitting the model (4.12), let’s assume from now on that
the variables Y and X1 , . . . , Xn , as well the sample {(Xi , Yi )}in=1 ,
are centred (see the tip at the end of Section 2.4.4). This means that
Ȳ = 0 and that X̄ := ( X̄1 , . . . , X̄ p )0 is zero. More importantly, it also
162 eduardo garcía portugués

means that β 0 and β̂ 0 are null, hence they are not included in the
model. That is, that the model

Y = β 1 X1 + . . . + β p X p + ε (4.13)

is considered. In this setting, β = ( β 1 , . . . , β p )0 (so we do not re-


quire the previous notation β−1 !) and β̂ = (X0 X)−1 X0 Y is the least
squares estimator, with the design matrix X now omitting the first
column of ones.
Now, the estimator of β in (4.13) from a sample {(Xi , Yi )}in=1
under the linear constraint Aβ = c is defined as
n
β̂A := arg minp RSS0 ( β),
β ∈R
RSS0 ( β) := ∑ (Yi − β1 Xi1 − . . . − β p Xip )2 .
i =1
Aβ=c
(4.14)

Solving (4.14) analytically is possible using Lagrange multipliers,


and the explicit solution to (4.14) can be seen to be

β̂A = β̂ + (X0 X)−1 A0 [A(X0 X)−1 A0 ]−1 (c − A β̂). (4.15)

For the general case given in (4.12), in which neither Y and X nor
the sample are centred, the estimator of β in (4.12) is unaltered for
the slopes and equals (4.15). The intercept is given by

β̂ A,0 = Ȳ − X̄0 β̂A .

The next code illustrates how to fit a linear model with con-
straints in practice.

# Simulate data
set.seed(123456)
n <- 50
p <- 3
x1 <- rnorm(n, mean = 1)
x2 <- rnorm(n, mean = 2)
x3 <- rnorm(n, mean = 3)
eps <- rnorm(n, sd = 0.5)
y <- 1 + 2 * x1 - 3 * x2 + x3 + eps

# Center the data and compute design matrix


x1Cen <- x1 - mean(x1)
x2Cen <- x2 - mean(x2)
x3Cen <- x3 - mean(x3)
yCen <- y - mean(y)
X <- cbind(x1Cen, x2Cen, x3Cen)

# Linear restriction: use that


# beta_1 + beta_2 + beta_3 = 0
# beta_2 = -3
# In this case q = 2. The restriction is codified as
A <- rbind(c(1, 1, 1),
c(0, 1, 0))
c <- c(0, -3)

# Fit model without intercept


S <- solve(crossprod(X))
beta_hat <- S %*% t(X) %*% yCen
beta_hat
## [,1]
notes for predictive modeling 163

## x1Cen 1.9873776
## x2Cen -3.1449015
## x3Cen 0.9828062

# Restricted fit enforcing A * beta = c


beta_hat_A <- beta_hat +
S %*% t(A) %*% solve(A %*% S %*% t(A)) %*% (c - A %*% beta_hat)
beta_hat_A
## [,1]
## x1Cen 2.0154729
## x2Cen -3.0000000
## x3Cen 0.9845271

# Intercept of the constrained fit


beta_hat_A_0 <- mean(y) - c(mean(x1), mean(x2), mean(x3)) %*% beta_hat_A
beta_hat_A_0
## [,1]
## [1,] 1.02824

What about inference? In principle, it can be obtained analo-


gously to how the inference for the unconstrained linear model was
obtained in Section 2.4, since the distribution of β̂A under the as-
sumptions of the linear model is straightforward to obtain. We keep
assuming that the model is centered. Then, recall that (4.15) can be
expressed as

β̂A = (X0 X)−1 A0 [A(X0 X)−1 A0 ]−1 c


 
+ I − (X0 X)−1 A0 [A(X0 X)−1 A0 ]−1 A β̂.

Then, using (1.4) and proceeding similarly to (2.11),


 
β̂A ∼ N p β + b( β, A, c, X), σ2 (X0 X)−1 − v(σ2 , A, X) , (4.16)

where

b( β, A, c, X) := (X0 X)−1 A0 [A(X0 X)−1 A0 ]−1 (c − Aβ),


v(σ2 , A, X) := σ2 (X0 X)−1 A0 [A(X0 X)−1 A0 ]−1 A(X0 X)−1 .

The inference for constrained linear models is not built within base
R. Therefore, we just give a couple of insights about (4.16) and do
not pursue inference further. Note that:

• the variances of β̂ A,j , j = 1, . . . , p decrease with respect to the


variances of β̂ j , given by the diagonal elements of σ2 (X0 X)−1 .
This is perfectly coherent, after all we are constraining the possi-
ble values that the estimator of β can take in order to accommo-
date Aβ = c. More importantly, these variances remain the same
irrespectively of whether Aβ = c holds or not (since they do not
depend on c!).

• the bias of β̂A depends on the veracity of Aβ = c. If the re-


striction is verified, then b(σ2 , A, X) = 0 and β̂A is still unbiased.
However, if Aβ 6= c, then β̂A is severely biased in estimating β.
164 eduardo garcía portugués

Verify by Monte Carlo that the covariance matrix in (4.16)


is correct. To do so:

1. Choose β, A, and c at your convenience.


2. Sample n = 50 observations for the predictors.
3. Sample n = 50 observations for the responses from a
linear model based on β. Use the same n observations
for the predictors from step 2.
4. Compute β̂A .
5. Repeat steps 3–4 M = 500 times, saving each time β̂A .
6. Compute the sample covariance matrix of the β̂A ’s.
7. Compare it with the covariance matrix in (4.16).

Do the same study for checking the expectation in (4.16),


for the cases in which Aβ = c and Aβ 6= c.

4.3 Multivariate multiple linear model

So far, we have been interested in predicting/explaining a single


response Y from a set of predictors X1 , . . . , X p . However, we might
want to predict/explain several responses Y1 , . . . , Yq 19 . As we will
19
Do not confuse them with Y1 , . . . , Yn ,
the notation employed in the rest of
see, the model construction and estimation are quite analogous the sections for denoting the sample of
to the univariate multiple linear model, yet more cumbersome in the response Y.
notation.

4.3.1 Model formulation and least squares


The centred20 population version of the multivariate multiple linear
20
Centering the responses and pre-
dictors is useful for removing the
model is intercept term, allowing for simpler
matricial versions.
Y1 = β 11 X1 + . . . + β p1 X p + ε 1 ,
..
.
Yq = β 1q X1 + . . . + β pq X p + ε q ,

or, equivalently in matrix form,

Y = B0 X + ε (4.17)

where ε := (ε 1 , . . . , ε q )0 is a random vector with null expectation,


Y = (Y1 , . . . , Yq )0 and X = ( X1 , . . . , X p )0 are random vectors, and
 
β 11 . . . β 1q
 . .. .. 
B=  .. . .  .

β p1 . . . β pq p×q

Clearly, this construction implies that the conditional expectation of


the random vector Y is

E[Y|X = x] = B0 x.
notes for predictive modeling 165

Given a sample {(Xi , Yi )}in=1 of observations of ( X1 , . . . , X p ) and


(Y1 , . . . , Yq ), the sample version of (4.17) is
       
Y11 ... Y1q X11 ... X1p β 11 ... β 1q ε 11 ... ε 1q
 . ..   . ..   . .. 
.. .. ..  +  .. .. .. 

 .  .  .
 . . .  =  . . .   . . .   . . .  ,
  
Yn1 ... Ynq n×q Xn1 ... Xnp n× p β p1 ... β pq p×q ε n1 ... ε nq n×q
(4.18)

or, equivalently in matrix form,

Y = XB + E, (4.19)
21
This notation is required to avoid
confusions between the design matrix
where Y, X 21 ,
and E are clearly identified by comparing (4.19)
and the random vector (not matrix) X.
with (4.18).
The approach for estimating B is really similar to the univariate
multiple linear model: minimize the sum of squared distances be-
tween the responses Y1 , . . . , Yn and their explanations B0 X1 , . . . , B0 Xn .
These distances are now measured by the k · k2 norm, resulting

n
RSS(B) := ∑ kYi − B0 Xi k22
i =1
n
= ∑ ( Yi − B 0 Xi )0 ( Yi − B 0 Xi )
i =1
= tr (Y − XB)0 (Y − XB) .


22
Employing the centred version of
The similarities with (2.6) are clear and it is immediate to see that it
the univariate multiple linear model,
appears as a special case for q = 1 22 . as we have done in this section for the
multivariate version.
B̂ := arg min RSS(B) = (X0 X)−1 X0 Y. (4.20)
B∈M p×q

Recall that if the responses and predictors are not centred, then the
estimate of the intercept is simply obtained from the sample means
Ȳ := (Ȳ1 , . . . , Ȳq )0 and X̄ = ( X̄1 , . . . , X̄ p )0 :

β̂0 = Ȳ − B̂0 X̄.

Equation (4.20) reveals that fitting a q-multivariate linear


model amounts to fitting q univariate linear models
separately! Indeed, recall that B = ( β1 · · · βq ), where the
column vector β j represents the vector of coefficients of
the j-th univariate linear model. Then, comparing (4.20)
with (2.7) (where Y consisted of a single column) and by
block matrix multiplication, we can clearly see that B̂ is
just the concatenation of the columns of β̂ j , j = 1, . . . , q,
i.e., B̂ = ( β̂1 · · · β̂q ).
166 eduardo garcía portugués

As happened in the univariate linear model, if p > n then


the inverse of X0 X in (4.20) does not exist. In that case,
one should either remove predictors or resort to a shrink-
age method that avoids inverting X0 X. It is interesting
to note, though, that q has no effect on the feasibility
of the fitting, only p does. In particular it is possible to
compute (4.20) with q > n.

We see next how to do multivariate multiple linear regression in


R for a simulated example.
# Dimensions and sample size
p <- 3
q <- 2
n <- 100

# A quick way of creating a non-diagonal (valid) covariance matrix for the


# errors
Sigma <- 3 * toeplitz(seq(1, 0.1, l = q))
set.seed(12345)
X <- mvtnorm::rmvnorm(n = n, mean = 1:p, sigma = diag(0.5, nrow = p, ncol = p))
E <- mvtnorm::rmvnorm(n = n, mean = rep(0, q), sigma = Sigma)

# Linear model
B <- matrix((-1)^(1:p) * (1:p), nrow = p, ncol = q, byrow = TRUE)
Y <- X %*% B + E

# Fitting the model (note: Y and X are matrices!)


mod <- lm(Y ~ X)
mod
##
## Call:
## lm(formula = Y ~ X)
##
## Coefficients:
## [,1] [,2]
## (Intercept) 0.05017 -0.36899
## X1 -0.54770 2.06905
## X2 -3.01547 -0.78308
## X3 1.88327 -3.00840
# Note that the intercept is markedly different from zero -- that is because
# X is not centred

# Compare with B
B
## [,1] [,2]
## [1,] -1 2
## [2,] -3 -1
## [3,] 2 -3

# Summary of the model: gives q separate summaries, one for each fitted
# univariate model
summary(mod)
## Response Y1 :
##
## Call:
## lm(formula = Y1 ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0432 -1.3513 0.2592 1.1325 3.5298
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
notes for predictive modeling 167

## (Intercept) 0.05017 0.96251 0.052 0.9585


## X1 -0.54770 0.24034 -2.279 0.0249 *
## X2 -3.01547 0.26146 -11.533 < 2e-16 ***
## X3 1.88327 0.21537 8.745 7.38e-14 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 1.695 on 96 degrees of freedom
## Multiple R-squared: 0.7033, Adjusted R-squared: 0.694
## F-statistic: 75.85 on 3 and 96 DF, p-value: < 2.2e-16
##
##
## Response Y2 :
##
## Call:
## lm(formula = Y2 ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1385 -0.7922 -0.0486 0.8987 3.6599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3690 0.8897 -0.415 0.67926
## X1 2.0691 0.2222 9.314 4.44e-15 ***
## X2 -0.7831 0.2417 -3.240 0.00164 **
## X3 -3.0084 0.1991 -15.112 < 2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 1.567 on 96 degrees of freedom
## Multiple R-squared: 0.7868, Adjusted R-squared: 0.7801
## F-statistic: 118.1 on 3 and 96 DF, p-value: < 2.2e-16

# Exactly equivalent to
summary(lm(Y[, 1] ~ X))
##
## Call:
## lm(formula = Y[, 1] ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0432 -1.3513 0.2592 1.1325 3.5298
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.05017 0.96251 0.052 0.9585
## X1 -0.54770 0.24034 -2.279 0.0249 *
## X2 -3.01547 0.26146 -11.533 < 2e-16 ***
## X3 1.88327 0.21537 8.745 7.38e-14 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 1.695 on 96 degrees of freedom
## Multiple R-squared: 0.7033, Adjusted R-squared: 0.694
## F-statistic: 75.85 on 3 and 96 DF, p-value: < 2.2e-16
summary(lm(Y[, 2] ~ X))
##
## Call:
## lm(formula = Y[, 2] ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1385 -0.7922 -0.0486 0.8987 3.6599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3690 0.8897 -0.415 0.67926
168 eduardo garcía portugués

## X1 2.0691 0.2222 9.314 4.44e-15 ***


## X2 -0.7831 0.2417 -3.240 0.00164 **
## X3 -3.0084 0.1991 -15.112 < 2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 1.567 on 96 degrees of freedom
## Multiple R-squared: 0.7868, Adjusted R-squared: 0.7801
## F-statistic: 118.1 on 3 and 96 DF, p-value: < 2.2e-16

Let’s see another quick example using the iris dataset

# When we want to add several variables of a dataset as responses through a


# formula interface, we have to use cbind() in the response. Doing
# "Petal.Width + Petal.Length ~ ..." is INCORRECT, as lm will understand
# "I(Petal.Width + Petal.Length) ~ ..." and do one single regression

# Predict Petal’s measurements from Sepal’s


modIris <- lm(cbind(Petal.Width, Petal.Length) ~
Sepal.Length + Sepal.Width + Species, data = iris)
summary(modIris)
## Response Petal.Width :
##
## Call:
## lm(formula = Petal.Width ~ Sepal.Length + Sepal.Width + Species,
## data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50805 -0.10042 -0.01221 0.11416 0.46455
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.86897 0.16985 -5.116 9.73e-07 ***
## Sepal.Length 0.06360 0.03395 1.873 0.063 .
## Sepal.Width 0.23237 0.05145 4.516 1.29e-05 ***
## Speciesversicolor 1.17375 0.06758 17.367 < 2e-16 ***
## Speciesvirginica 1.78487 0.07779 22.944 < 2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.1797 on 145 degrees of freedom
## Multiple R-squared: 0.9459, Adjusted R-squared: 0.9444
## F-statistic: 634.3 on 4 and 145 DF, p-value: < 2.2e-16
##
##
## Response Petal.Length :
##
## Call:
## lm(formula = Petal.Length ~ Sepal.Length + Sepal.Width + Species,
## data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.75196 -0.18755 0.00432 0.16965 0.79580
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.63430 0.26783 -6.102 9.08e-09 ***
## Sepal.Length 0.64631 0.05353 12.073 < 2e-16 ***
## Sepal.Width -0.04058 0.08113 -0.500 0.618
## Speciesversicolor 2.17023 0.10657 20.364 < 2e-16 ***
## Speciesvirginica 3.04911 0.12267 24.857 < 2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.2833 on 145 degrees of freedom
## Multiple R-squared: 0.9749, Adjusted R-squared: 0.9742
notes for predictive modeling 169

## F-statistic: 1410 on 4 and 145 DF, p-value: < 2.2e-16

# The fitted values and resiuals are now matrices


head(modIris$fitted.values)
## Petal.Width Petal.Length
## 1 0.2687095 1.519831
## 2 0.1398033 1.410862
## 3 0.1735565 1.273483
## 4 0.1439590 1.212910
## 5 0.2855861 1.451142
## 6 0.3807391 1.697490
head(modIris$residuals)
## Petal.Width Petal.Length
## 1 -0.06870951 -0.119831001
## 2 0.06019672 -0.010861533
## 3 0.02644348 0.026517420
## 4 0.05604099 0.287089900
## 5 -0.08558613 -0.051141525
## 6 0.01926089 0.002510054

# The individual models


modIris1 <- lm(Petal.Width ~Sepal.Length + Sepal.Width + Species, data = iris)
modIris2 <- lm(Petal.Length ~Sepal.Length + Sepal.Width + Species, data = iris)
summary(modIris1)
##
## Call:
## lm(formula = Petal.Width ~ Sepal.Length + Sepal.Width + Species,
## data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50805 -0.10042 -0.01221 0.11416 0.46455
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.86897 0.16985 -5.116 9.73e-07 ***
## Sepal.Length 0.06360 0.03395 1.873 0.063 .
## Sepal.Width 0.23237 0.05145 4.516 1.29e-05 ***
## Speciesversicolor 1.17375 0.06758 17.367 < 2e-16 ***
## Speciesvirginica 1.78487 0.07779 22.944 < 2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.1797 on 145 degrees of freedom
## Multiple R-squared: 0.9459, Adjusted R-squared: 0.9444
## F-statistic: 634.3 on 4 and 145 DF, p-value: < 2.2e-16
summary(modIris2)
##
## Call:
## lm(formula = Petal.Length ~ Sepal.Length + Sepal.Width + Species,
## data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.75196 -0.18755 0.00432 0.16965 0.79580
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.63430 0.26783 -6.102 9.08e-09 ***
## Sepal.Length 0.64631 0.05353 12.073 < 2e-16 ***
## Sepal.Width -0.04058 0.08113 -0.500 0.618
## Speciesversicolor 2.17023 0.10657 20.364 < 2e-16 ***
## Speciesvirginica 3.04911 0.12267 24.857 < 2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.2833 on 145 degrees of freedom
## Multiple R-squared: 0.9749, Adjusted R-squared: 0.9742
170 eduardo garcía portugués

## F-statistic: 1410 on 4 and 145 DF, p-value: < 2.2e-16

4.3.2 Assumptions and inference


As deduced from what we have seen so far, fitting a multivariate
linear regression is more practical than doing q separate univariate
fits (especially if the number of responses q is large), but it is not
conceptually different. The discussion becomes more interesting
in the inference for the multivariate linear regression, where the
dependence between the responses has to be taken into account. In
order to achieve inference, we will require some assumptions, these
being natural extensions of the ones seen in Section 2.3:

i. Linearity: E[Y|X = x] = B0 x.
ii. Homoscedasticity: Var[ε| X1 = x1 , . . . , X p = x p ] = Σ.
iii. Normality: ε ∼ Nq (0, Σ).
iv. Independence of the errors: ε1 , . . . , εn are independent (or
uncorrelated, E[εi ε0j ] = 0, i 6= j, since they are assumed to be
normal).

Then, a good one-line summary of the multivariate multiple


linear model is (independence is implicit)

Y|X = x ∼ Nq (B0 x, Σ).

Based on these assumptions, the key result for rooting inference


is the distribution of B̂ = ( β̂1 · · · β̂q ) as an estimator of B =
( β1 · · · βq ). This result is now more cumbersome23 , but we can
23
Indeed, specifying the full distribu-
tion of B̂ would require introducing
state it as the matrix normal distribution, a
  generalization of the p-dimensional
β̂ j ∼ N p β j , σj2 (X0 X)−1 , j = 1, . . . , q, (4.21) normal seen in Section 1.3.
! ! !!
β̂ j βj σj2 (X0 X)−1 σjk (X0 X)−1
∼ N2p , , j, k = 1, . . . , q,
β̂k βk σjk (X0 X)−1 σk2 (X0 X)−1
(4.22)

where Σ = (σij ) and σii = σi2 .24 Observe how the covariance of the
24

errors ε j and ε k , denoted by σjk , is the


The results (4.21)–(4.22) open the way for obtaining hypothesis responsible of the correlation between
tests on the joint significance of a predictor in the model (for the q β̂ j and β̂k in (4.22). If σjk = 0, then β̂ j
responses, not just for one), confidence intervals for the coefficients, and β̂k would be uncorrelated (thus
independent because of their joint
prediction confidence regions for the conditional expectation and normality).
the conditional response, the Multivariate ANOVA (MANOVA) de-
composition, the multivariate extensions of the F-test, and others.
However, due to the correlation between responses and the multi-
variateness, these tools are not as simple as in the univariate linear
model. Therefore, given the increased complexity, we do not go into
more details there and refer the interested reader to, e.g., Chapter 8
in Seber (1984). We illustrate with code, though, the most important
practical aspects.
# Confidence intervals for the parameters
confint(modIris)
## 2.5 % 97.5 %
notes for predictive modeling 171

## Petal.Width:(Intercept) -1.204674903 -0.5332662


## Petal.Width:Sepal.Length -0.003496659 0.1307056
## Petal.Width:Sepal.Width 0.130680383 0.3340610
## Petal.Width:Speciesversicolor 1.040169583 1.3073259
## Petal.Width:Speciesvirginica 1.631118293 1.9386298
## Petal.Length:(Intercept) -2.163654566 -1.1049484
## Petal.Length:Sepal.Length 0.540501864 0.7521177
## Petal.Length:Sepal.Width -0.200934599 0.1197646
## Petal.Length:Speciesversicolor 1.959595164 2.3808588
## Petal.Length:Speciesvirginica 2.806663658 3.2915610
# Warning! Do not confuse Petal.Width:Sepal.Length with an interaction term!
# It is meant to represent the Response:Predictor coefficient

# Prediction -- now more limited without confidence intervals implemented


predict(modIris, newdata = iris[1:3, ])
## Petal.Width Petal.Length
## 1 0.2687095 1.519831
## 2 0.1398033 1.410862
## 3 0.1735565 1.273483

# MANOVA table
manova(modIris)
## Call:
## manova(modIris)
##
## Terms:
## Sepal.Length Sepal.Width Species Residuals
## Petal.Width 57.9177 6.3975 17.5745 4.6802
## Petal.Length 352.8662 50.0224 49.7997 11.6371
## Deg. of Freedom 1 1 2 145
##
## Residual standard errors: 0.1796591 0.2832942
## Estimated effects may be unbalanced

# "Same" as the "Sum Sq" and "Df" entries of


anova(modIris1)
## Analysis of Variance Table
##
## Response: Petal.Width
## Df Sum Sq Mean Sq F value Pr(>F)
## Sepal.Length 1 57.918 57.918 1794.37 < 2.2e-16 ***
## Sepal.Width 1 6.398 6.398 198.21 < 2.2e-16 ***
## Species 2 17.574 8.787 272.24 < 2.2e-16 ***
## Residuals 145 4.680 0.032
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
anova(modIris2)
## Analysis of Variance Table
##
## Response: Petal.Length
## Df Sum Sq Mean Sq F value Pr(>F)
## Sepal.Length 1 352.87 352.87 4396.78 < 2.2e-16 ***
## Sepal.Width 1 50.02 50.02 623.29 < 2.2e-16 ***
## Species 2 49.80 24.90 310.26 < 2.2e-16 ***
## Residuals 145 11.64 0.08
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

# anova() serves for assessing the significance of including a new predictor


# for explaining all the responses. This is based on an extension of the
# *sequential* ANOVA table briefly covered in Section 2.6. The hypothesis test
# is by default conducted with the Pillai statistic (an extension of the F-test)
anova(modIris)
## Analysis of Variance Table
##
## Df Pillai approx F num Df den Df Pr(>F)
## (Intercept) 1 0.99463 13332.6 2 144 < 2.2e-16 ***
## Sepal.Length 1 0.97030 2351.9 2 144 < 2.2e-16 ***
172 eduardo garcía portugués

## Sepal.Width 1 0.81703 321.5 2 144 < 2.2e-16 ***


## Species 2 0.89573 58.8 4 290 < 2.2e-16 ***
## Residuals 145
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

4.3.3 Shrinkage
Applying shrinkage is also possible in multivariate linear models.
In particular, this allows to fit models with p > n predictors. The
glmnet package implements the elastic net regularization for mul-
tivariate linear models with extensions of the k · k1 and k · k2 norm
penalties for a vector of parameters in R p , as considered in Section
4.1, to norms for a matrix of parameters of size p × q. Precisely:

• The ridge penalty k β−1 k22 extends to kBk2F , where kBkF =


q
p q
∑i=1 ∑ j=1 β2ij is the Frobenious norm of B. This is a global penalty
to shrink B.
p
• k β−1 k1 extends to ∑ j=1 kB j k2 , where B j is the j-th row of B. This
is a rowwise penalty that seeks to effectively remove rows of B,
thus eliminating predictors.

Taking these two extensions into account, it results:


p
!
RSS(B) + λ α ∑ kB j k2 + (1 − α)kBk2F . (4.23)
j =1

Clearly, ridge regression corresponds to α = 0 (quadratic penalty)


and lasso to α = 1 (linear penalty). And if λ = 0, we are back to the
least squares problem and theory. The optimization of (4.23) gives
(
p
!)
B̂λ,α := arg min RSS(B) + λ α ∑ kB j k2 + (1 − α)kBk2F .
M p×q j =1

From here, the workflow is very similar to the univariate linear


model: we have to be aware that an standardization of X and Y
takes place in glmnet; there are explicit formulas for the ridge re-
gression estimator, but not for lasso; tuning parameter selection
of λ is done by k-fold cross-validation and its one standard error
variant; variable selection (zeroing of rows in B) can be done with
lasso.
The following chunk of code illustrates some of these points
using glmnet::glmnet with family = "mgaussian" (do not forget
this argument!).
# Simulate data
n <- 500
p <- 50
q <- 5
set.seed(123456)
X <- mvtnorm::rmvnorm(n = n, mean = p:1, sigma = 5 * 0.5^toeplitz(1:p))
E <- mvtnorm::rmvnorm(n = n, mean = rep(0, q), sigma = toeplitz(q:1))
B <- 5 / (0.5 * (1:p - 10)^2 + 2) %*% t(sqrt(1:q))
Y <- X %*% B + E

# Visualize B -- blue is close to 0


image(1:q, 1:p, t(B), col = viridisLite::viridis(20))
notes for predictive modeling 173

# Lasso path fit


mfit <- glmnet(x = X, y = Y, family = "mgaussian", alpha = 1)

# A list of models for each response


str(mfit$beta, 1)
## List of 5
## $ y1:Formal class ’dgCMatrix’ [package "Matrix"] with 6 slots
## $ y2:Formal class ’dgCMatrix’ [package "Matrix"] with 6 slots
## $ y3:Formal class ’dgCMatrix’ [package "Matrix"] with 6 slots
## $ y4:Formal class ’dgCMatrix’ [package "Matrix"] with 6 slots
## $ y5:Formal class ’dgCMatrix’ [package "Matrix"] with 6 slots

# Tuning parameter selection by 10-fold cross-validation


set.seed(12345)
kcvLassoM <- cv.glmnet(x = X, y = Y, family = "mgaussian", alpha = 1)
kcvLassoM$lambda.min
## [1] 0.1538278
kcvLassoM$lambda.1se
## [1] 0.3900095

# Location of both optimal lambdas in the CV loss function


indMin <- which.min(kcvLassoM$cvm)
plot(kcvLassoM)
abline(h = kcvLassoM$cvm[indMin] + c(0, kcvLassoM$cvsd[indMin]))
51 51 51 51 51 50 44 32 26 20 16 11 11 6 6 4 2

350
300
# Extract the coefficients associated to some fits
kcvLassoMfit <- kcvLassoM$glmnet.fit

250
coefs <- predict(kcvLassoMfit, type = "coefficients",

Mean-Squared Error
s = c(kcvLassoMfit$lambda.min, kcvLassoMfit$lambda.1se))

200
str(coefs, 1)
150
## List of 5
## $ y1:Formal class ’dgCMatrix’ [package "Matrix"] with 6 slots
100

## $ y2:Formal class ’dgCMatrix’ [package "Matrix"] with 6 slots


50

## $ y3:Formal class ’dgCMatrix’ [package "Matrix"] with 6 slots


## $ y4:Formal class ’dgCMatrix’ [package "Matrix"] with 6 slots
-6 -4 -2 0 2
## $ y5:Formal class ’dgCMatrix’ [package "Matrix"] with 6 slots
Log(λ)

# Predictions for the first two observations


preds <- predict(kcvLassoMfit, type = "response",
s = c(kcvLassoM$lambda.min, kcvLassoM$lambda.1se),
newx = X[1:2, ])
preds
## , , 1
##
## y1 y2 y3 y4 y5
## [1,] 591.0271 417.9397 341.0378 294.9762 263.5970
## [2,] 565.5700 400.1350 326.5257 282.9225 253.1465
##
## , , 2
##
## y1 y2 y3 y4 y5
## [1,] 589.8410 417.0422 340.5166 294.3984 263.1363
## [2,] 566.4677 400.6527 327.0581 283.3459 253.4001

Finally, the next animation helps visualizing how the zeroing of


the lasso happens for the estimator of B with overall low absolute
values on the previous simulated model.
manipulate::manipulate({

# Plot true B
image(1:q, 1:p, t(B), col = viridisLite::viridis(20))

# Extract B_hat from the lasso fit, a p x q matrix


B_hat <- sapply(seq_along(mfit$beta), function(i) mfit$beta[i][[1]][, j])
174 eduardo garcía portugués

# Put as black rows the predictors included


not_zero <- abs(B_hat) > 0
image(1:q, 1:p, t(not_zero), breaks = c(0.5, 1), col = 1, add = TRUE)

}, j = manipulate::slider(min = 1, max = ncol(mfit$beta$y1), step = 1,


initial = 10, label = "j in lambda(j)"))

4.4 Big data considerations

The computation of the least squares estimator

β̂ = (X0 X)−1 X0 Y (4.24)

involves inverting the ( p + 1) × ( p + 1) matrix X0 X, where X is


an n × ( p + 1) matrix. The vector to be obtained, β̂, is of size p +
1. However, computing it directly from (4.24) requires allocating
O(np + p2 ) elements in memory. When n is very large, this can be
prohibitive. In addition, for convenience of the statistical analysis,
R’s lm returns several objects of the same size as X and Y, thus
notably increasing the memory usage. For these reasons, alternative
approaches for computing β̂ with big data are required.
An alternative for computing (4.24) in a memory-friendly way
is to split the computation of (X0 X)−1 and X0 Y by blocks that are
storable in memory. A possibility is to update sequentially the es-
timation of the vector of coefficients. This can be done with the
following expression, which relates β̂ with β̂−i , the vector of esti-
mated coefficients without the i-th datum:
 
β̂ = β̂−i + (X0 X)−1 xi Yi − xi0 β̂−i . (4.25)

In (4.25) above, xi0 is the i-th row of the design matrix X. The ex-
pression follows from the Sherman–Morrison formula for an invert-
ible matrix A and a vector b,
A−1 bb0 A−1
(A + bb0 )−1 = A−1 − ,
1 + b 0 A −1 b
and from the equalities

X0 X = X0−i X−i + xi xi0 ,


X0 Y = X0−i Y−i + xi Yi0 ,

where X−i is the (n − 1) × ( p + 1) matrix obtained by removing the


i-th row of X. In (4.25), using again the Sherman–Morrison formula,
 −1
we can update (X0 X)−1 easily from X0−i X−i :
 −1  −1
0 −1
 −1 X0−i X−i xi xi0 X0−i X−i
(X X) = X0−i X−i −   −1 . (4.26)
1 + xi0 X0−i X−i xi

This has the advantage of not requiring to compute X0 X and then to


 −1
invert it. Instead of that, we work directly with X0−i X−i , which
was already computed and has size ( p + 1) × ( p + 1).
This idea can be iterated and we can compute β̂ by the following
iterative procedure:
notes for predictive modeling 175

1. Start from a reduced dataset Xold ≡ X−i and Yold ≡ Y−i for
which the least squares estimate can be computed. Denote it by
β̂old ≡ β̂−i .
2. Add one of the remaining data points to get β̂new ≡ β̂ from
(4.25) and (4.26).
3. Set β̂new ← β̂old and Xnew ← Xold .
4. Repeat steps 2–3 until there are no remaining data points left.
5. Return β̂ ← β̂new .

The main advantage of this iterative procedure is clear: we do


not need to store any vector or matrix with n in the dimension –
only matrices of size p. As a consequence, we do not need to store
the data in memory.
A similar iterative approach (yet more sophisticated) is fol-
lowed by the biglm package. We omit the details here (see Miller
(1992)) and just comment the main idea: for computing (4.24), 25
The QR decomposition of the matrix
biglm::biglm performs a QR decomposition25 of X that is com- X of size n × m is X = QR such that
puted iteratively. Then, instead of computing (4.24), it solves the Q is an n × n orthogonal matrix and
R is an n × m upper triangular matrix.
triangular system
This factorization is commonly used in
numerical analysis for solving linear
R β̂ = Q T Y. systems.

Let’s see how biglm::biglm works in practice.

# Not really "big data", but for the sake of illustration


set.seed(12345)
n <- 1e6
p <- 10
beta <- seq(-1, 1, length.out = p)^5
x1 <- matrix(rnorm(n * p), nrow = n, ncol = p)
x1[, p] <- 2 * x1[, 1] + rnorm(n, sd = 0.1) # Add some dependence to predictors
x1[, p - 1] <- 2 - x1[, 2] + rnorm(n, sd = 0.5)
y1 <- 1 + x1 %*% beta + rnorm(n)
x2 <- matrix(rnorm(100 * p), nrow = 100, ncol = p)
y2 <- 1 + x2 %*% beta + rnorm(100)
bigData1 <- data.frame("resp" = y1, "pred" = x1)
bigData2 <- data.frame("resp" = y2, "pred" = x2)

# biglm has a very similar syntaxis to lm -- but the formula interface does not
# work always as expected
# biglm::biglm(formula = resp ~ ., data = bigData1) # Does not work
# biglm::biglm(formula = y ~ x) # Does not work
# biglm::biglm(formula = resp ~ pred.1 + pred.2, data = bigData1) # Does work,
# but not very convenient for a large number of predictors
# Hack for automatic inclusion of all the predictors
f <- formula(paste("resp ~", paste(names(bigData1)[-1], collapse = " + ")))
biglmMod <- biglm::biglm(formula = f, data = bigData1)

# lm’s call
lmMod <- lm(formula = resp ~ ., data = bigData1)

# The reduction in size of the resulting object is more than notable


print(object.size(biglmMod), units = "KiB")
## 13.1 KiB
print(object.size(lmMod), units = "MiB")
## 381.5 MiB

# Summaries
s1 <- summary(biglmMod)
s2 <- summary(lmMod)
s1
176 eduardo garcía portugués

## Large data regression model: biglm::biglm(formula = f, data = bigData1)


## Sample size = 1000000
## Coef (95% CI) SE p
## (Intercept) 1.0021 0.9939 1.0104 0.0041 0.0000
## pred.1 -0.9733 -1.0133 -0.9333 0.0200 0.0000
## pred.2 -0.2866 -0.2911 -0.2822 0.0022 0.0000
## pred.3 -0.0535 -0.0555 -0.0515 0.0010 0.0000
## pred.4 -0.0041 -0.0061 -0.0021 0.0010 0.0000
## pred.5 -0.0002 -0.0022 0.0018 0.0010 0.8373
## pred.6 0.0003 -0.0017 0.0023 0.0010 0.7771
## pred.7 0.0026 0.0006 0.0046 0.0010 0.0091
## pred.8 0.0521 0.0501 0.0541 0.0010 0.0000
## pred.9 0.2840 0.2800 0.2880 0.0020 0.0000
## pred.10 0.9867 0.9667 1.0067 0.0100 0.0000
s2
##
## Call:
## lm(formula = resp ~ ., data = bigData1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8798 -0.6735 -0.0013 0.6735 4.9060
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0021454 0.0041200 243.236 < 2e-16 ***
## pred.1 -0.9732675 0.0199989 -48.666 < 2e-16 ***
## pred.2 -0.2866314 0.0022354 -128.227 < 2e-16 ***
## pred.3 -0.0534834 0.0009997 -53.500 < 2e-16 ***
## pred.4 -0.0040772 0.0009984 -4.084 4.43e-05 ***
## pred.5 -0.0002051 0.0009990 -0.205 0.83731
## pred.6 0.0002828 0.0009989 0.283 0.77706
## pred.7 0.0026085 0.0009996 2.610 0.00907 **
## pred.8 0.0520744 0.0009994 52.105 < 2e-16 ***
## pred.9 0.2840358 0.0019992 142.076 < 2e-16 ***
## pred.10 0.9866851 0.0099876 98.791 < 2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.9993 on 999989 degrees of freedom
## Multiple R-squared: 0.5777, Adjusted R-squared: 0.5777
## F-statistic: 1.368e+05 on 10 and 999989 DF, p-value: < 2.2e-16

# Further information
s1$mat # Coefficients and their inferences
## Coef (95% CI) SE p
## (Intercept) 1.0021454430 0.9939053491 1.010385537 0.0041200470 0.000000e+00
## pred.1 -0.9732674585 -1.0132653005 -0.933269616 0.0199989210 0.000000e+00
## pred.2 -0.2866314070 -0.2911021089 -0.282160705 0.0022353509 0.000000e+00
## pred.3 -0.0534833941 -0.0554827653 -0.051484023 0.0009996856 0.000000e+00
## pred.4 -0.0040771777 -0.0060739907 -0.002080365 0.0009984065 4.432709e-05
## pred.5 -0.0002051218 -0.0022030377 0.001792794 0.0009989579 8.373098e-01
## pred.6 0.0002828388 -0.0017149118 0.002280589 0.0009988753 7.770563e-01
## pred.7 0.0026085425 0.0006093153 0.004607770 0.0009996136 9.066118e-03
## pred.8 0.0520743791 0.0500755376 0.054073221 0.0009994208 0.000000e+00
## pred.9 0.2840358104 0.2800374345 0.288034186 0.0019991879 0.000000e+00
## pred.10 0.9866850849 0.9667099026 1.006660267 0.0099875911 0.000000e+00
s1$rsq # R^2
## [1] 0.5777074
s1$nullrss # SST (as in Section 2.6)
## [1] 2364861

# Extract coefficients
coef(biglmMod)
## (Intercept) pred.1 pred.2 pred.3 pred.4 pred.5 pred.6 pred.7
## 1.0021454430 -0.9732674585 -0.2866314070 -0.0534833941 -0.0040771777 -0.0002051218 0.0002828388 0.0026085425
## pred.8 pred.9 pred.10
## 0.0520743791 0.2840358104 0.9866850849
notes for predictive modeling 177

# Prediction works as usual


predict(biglmMod, newdata = bigData2[1:5, ])
## [,1]
## 1 2.3554732
## 2 2.5631387
## 3 2.4546594
## 4 2.3483083
## 5 0.6587481

# Must contain a column for the response


# predict(biglmMod, newdata = bigData2[1:5, -1]) # Error

# Update the model with training data


update(biglmMod, moredata = bigData2)
## Large data regression model: biglm::biglm(formula = f, data = bigData1)
## Sample size = 1000100

# AIC and BIC


AIC(biglmMod, k = 2)
## [1] 998685.1
AIC(biglmMod, k = log(n))
## [1] 998815.1

# Features not immediately available for biglm objects: stepwise selection by


# stepAIC, residuals, variance of the error, model diagnostics, and vifs

# Workaround for obtaining hat(sigma)^2 = SSE / (n - p - 1), SSE = SST * (1 - R^2)


(s1$nullrss * (1 - s1$rsq)) / s1$obj$df.resid
## [1] 0.9986741
s2$sigma^2
## [1] 0.9986741

Model selection of biglm models can be done, not by MASS::stepAIC,


but with the more advanced leaps package. This is achieved by the
leaps::regsubsets function, which returns the best subset of up to
(by default) nvmax = 8 predictors among the p possible predictors
26
Not really exhaustive: the method
behind it, due to Furnival and Wilson
to be included in the model. The function requires the full biglm (1974), employs an ingenious branch-
model to begin the exhaustive26 search (Furnival and Wilson, 1974). and-bound algorithm to remove most
of the non-interesting models.
The kind of search can be changed using the method argument and
choosing the exhaustive (by default), forward, or backward selec- -860000

-860000
tion.
-860000

-860000
# Model selection adapted to big data models
bic

-860000
reg <- leaps::regsubsets(biglmMod, nvmax = p, method = "exhaustive")
plot(reg) # Plot best model (top row) to worst model (bottom row) -860000

-850000

-840000

-560000

# Summarize (otherwise regsubsets’s outptut is hard to decypher)


(Intercept)

pred.1

pred.2

pred.3

pred.4

pred.5

pred.6

pred.7

pred.8

pred.9

pred.10

subs <- summary(reg)


subs
## Subset selection object Figure 4.5: Best subsets for
## 10 Variables (and intercept) p = 10 predictors returned by
## Forced in Forced out leaps::regsubsets. The vertical axis
## pred.1 FALSE FALSE indicates the sorting in terms of the
## pred.2 FALSE FALSE BIC (the top positions are for the best
## pred.3 FALSE FALSE models in terms of the BIC). White
## pred.4 FALSE FALSE color indicates that the predictor is not
## pred.5 FALSE FALSE included in the model and black that
## pred.6 FALSE FALSE it is included. The p models obtained
## pred.7 FALSE FALSE with the best subsets of 1 ≤ r ≤ p
## pred.8 FALSE FALSE out of p predictors are displayed. Note
## pred.9 FALSE FALSE that the vertical ordering does not
## pred.10 FALSE FALSE necessarily coincide with r = 1, . . . , p.
178 eduardo garcía portugués

## 1 subsets of each size up to 9


## Selection Algorithm: exhaustive
## pred.1 pred.2 pred.3 pred.4 pred.5 pred.6 pred.7 pred.8 pred.9 pred.10
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " "*"
## 2 ( 1 ) " " " " " " " " " " " " " " " " " *" "*"
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " *" "*"
## 4 ( 1 ) " " "*" "* " " " " " " " " " " " "* " "*"
## 5 ( 1 ) " " "*" "* " " " " " " " " " "*" "* " "*"
## 6 ( 1 ) "*" "*" "* " " " " " " " " " "*" "* " "*"
## 7 ( 1 ) "*" "*" "* " "*" " " " " " " "*" "* " "*"
## 8 ( 1 ) "*" "*" "* " "*" " " " " "*" "*" "* " "*"
## 9 ( 1 ) "*" "*" "* " "*" " " "*" "*" "*" "* " "*"

# Lots of useful information


str(subs, 1)
## List of 8
## $ which : logi [1:9, 1:11] TRUE TRUE TRUE TRUE TRUE TRUE ...
## ..- attr(*, "dimnames")=List of 2
## $ rsq : num [1:9] 0.428 0.567 0.574 0.576 0.577 ...
## $ rss : num [1:9] 1352680 1023080 1006623 1003763 1001051 ...
## $ adjr2 : num [1:9] 0.428 0.567 0.574 0.576 0.577 ...
## $ cp : num [1:9] 354480 24444 7968 5106 2392 ...
## $ bic : num [1:9] -558604 -837860 -854062 -856894 -859585 ...
## $ outmat: chr [1:9, 1:10] " " " " " " " " ...
## ..- attr(*, "dimnames")=List of 2
## $ obj :List of 27
## ..- attr(*, "class")= chr "regsubsets"
## - attr(*, "class")= chr "summary.regsubsets"

# Get the model with lowest BIC


subs$which
## (Intercept) pred.1 pred.2 pred.3 pred.4 pred.5 pred.6 pred.7 pred.8 pred.9 pred.10
## 1 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## 2 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
## 3 TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
## 4 TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
## 5 TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
## 6 TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
## 7 TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
## 8 TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
## 9 TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
subs$bic
## [1] -558603.7 -837859.9 -854062.3 -856893.8 -859585.3 -861936.5 -861939.3 -861932.3 -861918.6
subs$which[which.min(subs$bic), ]
## (Intercept) pred.1 pred.2 pred.3 pred.4 pred.5 pred.6 pred.7 pred.8 pred.9
## TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE
## pred.10
## TRUE

# It also works with ordinary linear models and it is much faster and
# informative than stepAIC
reg <- leaps::regsubsets(resp ~ ., data = bigData1, nvmax = p,
method = "backward")
subs <- summary(reg)
subs$bic
## [1] -558603.7 -837859.9 -854062.3 -856893.8 -859585.3 -861936.5 -861939.3 -861932.3 -861918.6 -861904.8
subs$which[which.min(subs$bic), ]
## (Intercept) pred.1 pred.2 pred.3 pred.4 pred.5 pred.6 pred.7 pred.8 pred.9
## TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE
## pred.10
## TRUE

# Compare it with stepAIC


MASS::stepAIC(lm(resp ~ ., data = bigData1), trace = 0,
direction = "backward", k = log(n))
##
## Call:
## lm(formula = resp ~ pred.1 + pred.2 + pred.3 + pred.4 + pred.8 +
notes for predictive modeling 179

## pred.9 + pred.10, data = bigData1)


##
## Coefficients:
## (Intercept) pred.1 pred.2 pred.3 pred.4 pred.8 pred.9 pred.10
## 1.002141 -0.973201 -0.286626 -0.053487 -0.004074 0.052076 0.284038 0.986651

Finally, let’s see an example on how to fit a linear model to a


large dataset that does not fit in the RAM of most regular laptops.
Imagine that you want to regress a response Y into a set of p = 10
predictors and the sample size is n = 108 . Merely storing the
response and the predictors will take up to 8.2 GiB in RAM:
# Size of the response
print(object.size(rnorm(1e6)) * 1e2, units = "GiB")
## 0.7 GiB

# Size of the predictors


print(object.size(rnorm(1e6)) * 1e2 * 10, units = "GiB")
## 7.5 GiB

In addition to this, if lm was called, it will return the residuals,


effects, and fitted.values slots (all vectors of length n, hence
0.7 × 3 = 2.1 GiB more). It will also return the qr decomposition
of the design matrix and the model matrix (both are n × ( p + 1)
matrices, so another 8.2 × 2 = 14.4 GiB more). The final lm object
will thus be at the very least, of size 16.5 GiB. Clearly, this is not a
very memory-friendly way of proceeding.
A possible approach is to split the dataset and perform updates
of the model in chunks of reasonable size. The next code provides a
template for such approach using biglm and update.
# Linear regression with n = 10^8 and p = 10
n <- 10^8
p <- 10
beta <- seq(-1, 1, length.out = p)^5

# Number of chunks for splitting the dataset


nChunks <- 1e3
nSmall <- n / nChunks

# Simulates reading the first chunk of data


set.seed(12345)
x <- matrix(rnorm(nSmall * p), nrow = nSmall, ncol = p)
x[, p] <- 2 * x[, 1] + rnorm(nSmall, sd = 0.1)
x[, p - 1] <- 2 - x[, 2] + rnorm(nSmall, sd = 0.5)
y <- 1 + x %*% beta + rnorm(nSmall)

# First fit
bigMod <- biglm::biglm(y ~ x, data = data.frame(y, x))

# Update fit
# pb <- txtProgressBar(style = 3)
for (i in 2:nChunks) {

# Simulates reading the i-th chunk of data


set.seed(12345 + i)
x <- matrix(rnorm(nSmall * p), nrow = nSmall, ncol = p)
x[, p] <- 2 * x[, 1] + rnorm(nSmall, sd = 0.1)
x[, p - 1] <- 2 - x[, 2] + rnorm(nSmall, sd = 0.5)
y <- 1 + x %*% beta + rnorm(nSmall)

# Update the fit


bigMod <- update(bigMod, moredata = data.frame(y, x))
180 eduardo garcía portugués

# Progress
# setTxtProgressBar(pb = pb, value = i / nChunks)

# Final model
summary(bigMod)
## Large data regression model: biglm::biglm(y ~ x, data = data.frame(y, x))
## Sample size = 100000000
## Coef (95% CI) SE p
## (Intercept) 1.0003 0.9995 1.0011 4e-04 0.0000
## x1 -1.0015 -1.0055 -0.9975 2e-03 0.0000
## x2 -0.2847 -0.2852 -0.2843 2e-04 0.0000
## x3 -0.0531 -0.0533 -0.0529 1e-04 0.0000
## x4 -0.0041 -0.0043 -0.0039 1e-04 0.0000
## x5 0.0002 0.0000 0.0004 1e-04 0.0760
## x6 -0.0001 -0.0003 0.0001 1e-04 0.2201
## x7 0.0041 0.0039 0.0043 1e-04 0.0000
## x8 0.0529 0.0527 0.0531 1e-04 0.0000
## x9 0.2844 0.2840 0.2848 2e-04 0.0000
## x10 1.0007 0.9987 1.0027 1e-03 0.0000
print(object.size(bigMod), units = "KiB")
## 7.8 KiB

The summary of a biglm object yields slightly different


significances for the coefficients than for lm. The rea-
son is that biglm employs N (0, 1) approximations for
the distributions of the t-tests instead of the exact tn−1
distribution. Obviously, if n is large, the differences are
inappreciable.
5
Generalized linear models

As we saw in Chapter 2, linear regression assumes that the re-


sponse variable Y is such that

Y |( X1 = x1 , . . . , X p = x p ) ∼ N ( β 0 + β 1 x1 + . . . + β p x p , σ2 )

and hence

E [ Y | X1 = x 1 , . . . , X p = x p ] = β 0 + β 1 x 1 + . . . + β p x p .

This, in particular, implies that Y is continuous. In this chapter we


will see how generalized linear models can deal with other kinds of
distributions for Y |( X1 = x1 , . . . , X p = x p ), particularly with discrete
responses, by modelling the transformed conditional expectation.
The simplest generalized linear model is logistic regression, which is
arises when Y is a binary response, that is, a variable encoding two
categories with 0 and 1. This model would be useful, for example,
to predict Y given X from a sample {( Xi , Yi )}in=1 like the one in
Figure 5.1.
1.0
0.8

5.1 Case study: The Challenger disaster


0.6
Y

The Challenger disaster occurred on the 28th January of 1986, when


0.4

the NASA Space Shuttle orbiter Challenger broke apart and disinte-
0.2

grated at 73 seconds into its flight, leading to the deaths of its seven
crew members. The accident had serious consequences for the
0.0

NASA credibility and resulted in an interruption of 32 months in -3 -2 -1 0 1 2 3

X
the shuttle program. The Presidential Rogers Commission (formed by Figure 5.1: Scatterplot of a sample
astronaut Neil A. Armstrong and Nobel laureate Richard P. Feyn- {( Xi , Yi )}in=1 sampled from a logistic
man, among others) was created in order to investigate the causes regression.

of the disaster.
The Rogers Commission elaborated a report (Presidential Com-
mission on the Space Shuttle Challenger Accident, 1986) with all
the findings. The commission determined that the disintegration
began with the failure of an O-ring seal in the solid rocket mo-
tor due to the unusually cold temperature (−0.6 Celsius degrees)
during the launch. This failure produced a breach of burning gas
through the solid rocket motor that compromised the whole shuttle Figure 5.2: Challenger launch and
posterior explosion, as broadcasted
structure, resulting in its disintegration due to the extreme aero- live by NBC in 28/01/1986. Video also
dynamic forces. The problem with O-rings was something known: available here.
182 eduardo garcía portugués

the night before the launch, there was a three-hour teleconference


between motor engineers and NASA management, discussing the
effect of low temperature forecasted for the launch on the O-ring
performance. The conclusion, influenced by Figure 5.3a, was:

“Temperature data [is] not conclusive on predicting primary O-ring


blowby.”

Figure 5.3: Number of incidents


in the O-rings (filed joints) versus
temperatures. Panel a includes only
flights with incidents. Panel b contains
all flights (with and without incidents).

The Rogers Commission noted a major flaw in Figure 5.3a: the


flights with zero incidents were excluded from the plot because it was
felt that these flights did not contribute any information about
the temperature effect (Figure 5.3b). The Rogers Commission con-
cluded:

“A careful analysis of the flight history of O-ring performance would


have revealed the correlation of O-ring damage in low temperature”.

The purpose of this case study, inspired by Siddhartha et al.


(1989), is to quantify what was the influence of the temperature
on the probability of having at least one incident related with the
O-rings. Specifically, we want to address the following questions:

• Q1. Is the temperature associated with O-ring incidents?


• Q2. In which way was the temperature affecting the probability of
O-ring incidents?
notes for predictive modeling 183

• Q3. What was the predicted probability of an incidient in an O-ring for


the temperature of the launch day?

To try to answer these questions we have the challenger dataset


(download). The dataset contains (shown in Table 5.1) information 1
After the shuttle exits the atmo-
regarding the state of the solid rocket boosters after launch1 for 23 sphere, the solid rocket boosters
flights. Each row has, among others, the following variables: separate and descend to land using
a parachute where they are carefully
analyzed.
• fail.field, fail.nozzle: binary variables indicating whether
there was an incident with the O-rings in the field joints or in the
nozzles of the solid rocket boosters. 1 codifies an incident and 0
its absence. On the analysis, we focus on the O-rings of the field
joint as being the most determinants for the accident.
• temp: temperature in the day of launch. Measured in Celsius
degrees.
• pres.field, pres.nozzle: leak-check pressure tests of the O-
rings. These tests assured that the rings would seal the joint.

Table 5.1: The challenger dataset.

flight date fail.field fail.nozzle temp


1 12/04/81 0 0 18.9
2 12/11/81 1 0 21.1
3 22/03/82 0 0 20.6
5 11/11/82 0 0 20.0
6 04/04/83 0 1 19.4
7 18/06/83 0 0 22.2
8 30/08/83 0 0 22.8
9 28/11/83 0 0 21.1
41-B 03/02/84 1 1 13.9
41-C 06/04/84 1 1 17.2
41-D 30/08/84 1 1 21.1
41-G 05/10/84 0 0 25.6
51-A 08/11/84 0 0 19.4
51-C 24/01/85 1 1 11.7
51-D 12/04/85 0 1 19.4
51-B 29/04/85 0 1 23.9
51-G 17/06/85 0 1 21.1
51-F 29/07/85 0 0 27.2
51-I 27/08/85 0 0 24.4
51-J 03/10/85 0 0 26.1
61-A 30/10/85 1 0 23.9
61-B 26/11/85 0 1 24.4
61-C 12/01/86 1 1 14.4

Let’s begin the analysis by replicating Figures 5.3a and 5.3b and
checking that linear regression is not the right tool for answering
184 eduardo garcía portugués

Q1–Q3.
challenger <- read.table(file = "challenger.txt", header = TRUE, sep = "\t")

car::scatterplot(nfails.field ~ temp, smooth = FALSE, boxplots = FALSE,


data = challenger, subset = nfails.field > 0)

2.0
car::scatterplot(nfails.field ~ temp, smooth = FALSE, boxplots = FALSE,
data = challenger)

1.8
There is a fundamental problem in using linear regression for

1.6
nfails.field
this data: the response is not continuous. As a consequence, there

1.4
is no linearity and the errors around the mean are not normal (in-
deed, they are strongly non-normal). Let’s check this with the cor-

1.2
responding diagnostic plots:

1.0
mod <- lm(nfails.field ~ temp, data = challenger) 12 14 16 18 20 22 24

plot(mod, 1) temp

2.0
plot(mod, 2)

1.5
Albeit linear regression is not the adequate tool for this data, it is
able to detect the obvious difference between the two plots:

nfails.field

1.0
1. The trend for launches with incidents is flat, hence suggesting

0.5
there is no dependence on the temperature (Figure 5.3a). This
was one of the arguments behind NASA’s decision of launching

0.0
the rocket at a temperature of −0.6 Celsius degrees. 15 20 25

2. However, the trend for all launches indicates a clear negative temp

Residuals vs Fitted

dependence between temperature and number of incidents! (Fig-


2.0

21

ure 5.3b). Think about it in this way: the minimum temperature


1.5

for a launch without incidents ever recorded was above 18 Cel-


sius degrees, and the Challenger was launched at −0.6 without
1.0
Residuals

14

clearly knowing the effects of such low temperatures. 2


0.5

Along this chapter we will see the required tools for answering
0.0

precisely Q1–Q3.
-0.5

-0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2

5.2 Model formulation and estimation Fitted values


lm(nfails.field ~ temp)

Normal Q-Q

For simplicity, we first study the logistic regression and then study 21

the general case of a generalized linear model.


3
Standardized residuals

5.2.1 Logistic regression


14

2
1

As we saw in Section 2.2, the multiple linear model described the


relation between the random variables X1 , . . . , X p and Y by assum-
0

ing a linear relation in the conditional expectation:


-1

E [ Y | X1 = x 1 , . . . , X p = x p ] = β 0 + β 1 x 1 + . . . + β p x p .
-2 -1 0 1 2
(5.1) Theoretical Quantiles
lm(nfails.field ~ temp)

In addition, it made three more assumptions on the data (see Sec-


tion 2.3), which resulted in the following one-line summary of the
linear model:

Y |( X1 = x1 , . . . , X p = x p ) ∼ N ( β 0 + β 1 x1 + . . . + β p x p , σ2 ).
notes for predictive modeling 185

Recall that a necessary condition is that Y was continuous, in order


to satisfy the normality of the errors. Therefore, the linear model is
designed for a continuous response.
The situation when Y is discrete (naturally ordered values) or
categorical (non-ordered categories) requires a different treatment.
The simplest situation is when Y is binary: it can only take two
values, codified for convenience as 1 (success) and 0 (failure). For
binary variables there is no fundamental distinction between the
treatment of discrete and categorical variables. Formally, a binary 2
Recall that a binomial variable with size
variable is referred to as a Bernoulli variable2 : Y ∼ Ber( p), 0 ≤ p ≤ n and probability p, B(n, p), is obtained
1 3 , if by summing n independent Ber( p),
( so Ber( p) is the same distribution as
B(1, p).
1, with probability p,
Y= 3
Do not confuse this p with the p
0, with probability 1 − p, representing the number of predictors
in the model.
or, equivalently, if

P [Y = y ] = p y ( 1 − p ) 1 − y , y = 0, 1. (5.2)

Recall that a Bernoulli variable is completely determined by the


probability p, and also so do its mean and variance:

E [Y ] = P [Y = 1 ] = p and Var[Y ] = p(1 − p).

Assume then that Y is a Bernoulli variable and that X1 , . . . , X p


are predictors associated to Y. The purpose in logistic regression is
to model

E[Y | X1 = x1 , . . . , X p = x p ] = P[Y = 1| X1 = x1 , . . . , X p = x p ], (5.3)

that is, how the conditional expectation of Y or, equivalently, the


conditional probability of Y = 1, is changing according to particular
values of the predictors. At sight of (5.1), a tempting possibility is
to consider the model

E[Y | X1 = x1 , . . . , X p = x p ] = β 0 + β 1 x1 + . . . + β p x p =: η.

However, such a model will run into serious problems inevitably:


negative probabilities and probabilities larger than one may hap-
pen.
A solution is to consider a link function g to encapsulate the
value of E[Y | X1 = x1 , . . . , X p = x p ] and map it back to R. Or,
alternatively, a function g−1 that takes η ∈ R and maps it to [0, 1],
the support of E[Y | X1 = x1 , . . . , X p = x p ]. There are several
alternatives for g−1 : R −→ [0, 1] that give rise to different models:

• Uniform: considers the truncation g−1 (η ) = η1{0<η <1} + 1{η ≥1} .


• Probit: considers the normal cdf, this is, g−1 (η ) = Φ(η ).
• Logit: considers the logistic cdf:

eη 1
g−1 (η ) = logistic(η ) := = .
1 + eη 1 + e−η
186 eduardo garcía portugués

Linear regression

The logistic transformation is the most employed due to its Uniform

1.0
Probit
Logit

tractability, interpretability, and smoothness4 . Its inverse, g :

0.8
[0, 1] −→ R, is known as the logit function:

0.6
g−1(η)
p
logit( p) := logistic−1 ( p) = log .

0.4
1− p

0.2
In conclusion, with the logit link function we can map the domain

0.0
of Y to R in order to apply a linear model. The logistic model can be
then equivalently stated as -3 -2 -1 0 1 2 3

1 Figure 5.4: Different transformations


E[Y | X1 = x1 , . . . , X p = x p ] = logistic(η ) = , (5.4) g−1 mapping the response of a simple
1 + e−η linear regression η = β 0 + β 1 x to [0, 1].
or as 4
And also, as we will see later, because
it is the canonical link function.
logit(E[Y | X1 = x1 , . . . , X p = x p ]) = η (5.5)

where η = β 0 + β 1 x1 + . . . + β p x p . There is a clear interpretation of


the role of the linear predictor η in (5.4) (remember (5.3)):

• If η = 0, then P[Y = 1| X1 = x1 , . . . , X p = x p ] = 21 (Y = 1 and


Y = 0 are equally likely).
• If η < 0, then P[Y = 1| X1 = x1 , . . . , X p = x p ] < 21 (Y = 1 less
likely).
• If η > 0, then P[Y = 1| X1 = x1 , . . . , X p = x p ] > 12 (Y = 1 more
likely).

To be more precise on the interpretation of the coefficients we


need to introduce the odds. The odds is an equivalent way of ex-
pressing the distribution of probabilities in a binary variable Y.
Instead of using p to characterize the distribution of Y, we can use

p P [Y = 1 ]
odds(Y ) := = . (5.6)
1− p P [Y = 0 ]
The odds is the ratio between the probability of success and the prob-
ability of failure. It is extensively used in betting5 due to its better 5
Recall that (traditionally) the result of
interpretability6 . Conversely, if the odds of Y is given, we can eas- a bet is binary: you either win or lose
ily know what is the probability of success p, using the inverse of the bet.
6
For example, if a horse Y has a
(5.6)7 :
probability p = 2/3 of winning a race
(Y = 1), then the odds of the horse is
odds(Y ) p
p = P [Y = 1 ] = . 2/3
1− p = 1/3 = 2. This means that the
1 + odds(Y ) horse has a probability of winning that is
twice larger than the probability of losing.
This is sometimes written as a 2 : 1 or
Recall that the odds is a number in [0, +∞]. The 0 and 2 × 1 (spelled “two-to-one”).
+∞ values are attained for p = 0 and p = 1, respectively. 7
For the previous example, if the
The log-odds (or logit) is a number in [−∞, +∞]. odds of the horse were 5, that would
correspond to a probability of winning
p = 5/6.
We can rewrite (5.4) in terms of the odds (5.6)8 so we get:
8
To do so, apply (5.6) to (5.4) and use
odds(Y | X1 = x1 , . . . , X p = x p ) = eη = e β0 e β1 x1 . . . e β p x p . (5.7) (5.3).

Alternatively, taking logarithms, we have the log-odds (or logit)

log(odds(Y | X1 = x1 , . . . , X p = x p )) = β 0 + β 1 x1 + . . . + β p x p . (5.8)
notes for predictive modeling 187

The conditional log-odds (5.8) plays the role of the conditional


mean for multiple linear regression. Therefore, we have an analo-
gous interpretation for the coefficients:

• β 0 : is the log-odds when X1 = . . . = X p = 0.


• β j , 1 ≤ j ≤ p: is the additive increment of the log-odds for an
increment of one unit in X j = x j , provided that the remaining
variables X1 , . . . , X j−1 , X j+1 , . . . , X p do not change.

The log-odds is not as easy to interpret as the odds. For that


reason, an equivalent way of interpreting the coefficients, this time
based on (5.7), is:

• e β0 : is the odds when X1 = . . . = X p = 0.


• e β j , 1 ≤ j ≤ p: is the multiplicative increment of the odds for
an increment of one unit in X j = x j , provided that the remaining
variables X1 , . . . , X j−1 , X j+1 , . . . , X p do not change. If the increment
in X j is of r units, then the multiplicative increment in the odds
is (e β j )r .

As a consequence of this last interpretation, we have:

If β j > 0 (respectively, β j < 0) then e β j > 1 (e β j < 1)


in (5.7). Therefore, an increment of one unit in X j , pro-
vided that the remaining variables do not change, results
in a positive (negative) increment in the odds and in
P [ Y = 1 | X1 = x 1 , . . . , X p = x p ] .

Case study application


In the Challenger case study we used fail.field as an indicator
of whether “there was at least an incident with the O-rings” (1 =
yes, 0 = no). Let’s see if the temperature was associated with O-
ring incidents (Q1). For that, we compute the logistic regression of
fail.field on temp and we plot the fitted logistic curve.
# Logistic regression: computed with glm and family = "binomial"
nasa <- glm(fail.field ~ temp, family = "binomial", data = challenger)

# Plot data
plot(challenger$temp, challenger$fail.field, xlim = c(-1, 30),
xlab = "Temperature", ylab = "Incident probability")

# Draw the fitted logistic curve


x <- seq(-1, 30, l = 200)
y <- exp(-(nasa$coefficients[1] + nasa$coefficients[2] * x))
1.0

Challenger

y <- 1 / (1 + y)
lines(x, y, col = 2, lwd = 2)
0.8

# The Challenger
Incident probability

0.6

points(-0.6, 1, pch = 16)


text(-0.6, 1, labels = "Challenger", pos = 4)
0.4

At the sight of this curve and the summary it seems that the
0.2

temperature was affecting the probability of an O-ring incident


0.0

(Q1). Let’s quantify this statement and answer Q2 by looking to the 0 5 10 15 20 25 30

coefficients of the model: Temperature


188 eduardo garcía portugués

# Exponentiated coefficients ("odds ratios")


exp(coef(nasa))
## (Intercept) temp
## 1965.9743592 0.6592539

The exponentials of the estimated coefficients are:

• e β̂0 = 1965.974. This means that, when the temperature is zero, the
fitted odds is 1965.974, so the (estimated) probability of having
an incident (Y = 1) is 1965.974 times larger than the proba-
bility of not having an incident (Y = 0). Or, in other words,
the probability of having an incident at temperature zero is
1965.974
1965.974+1 = 0.999.
• e β̂1 = 0.659. This means that each Celsius degree increment on
the temperature multiplies the fitted odds by a factor of 0.659 ≈
2
3 , hence reducing it.

However, for the moment we can not say whether these findings
are significant, since we do not have information on the variability
of the estimates of β. We will need inference for that.
Estimation by maximum likelihood
The estimation of β from a sample {(Xi , Yi )}in=1 is done by Maxi-
mum Likelihood Estimation (MLE). As it can be seen in Appendix A.2,
MLE is equivalent to least squares in the linear model under the
assumptions mentioned in Section 2.3, particularly, normality and
9
Section 5.7 discusses in detail the
independence. In the logistic model, we assume that9
assumptions of generalized linear
models.
Yi |( X1 = xi1 , . . . , X p = xip ) ∼ Ber(logistic(ηi )), i = 1, . . . , n,

where ηi := β 0 + β 1 xi1 + . . . + β p xip . Denoting pi ( β) := logistic(ηi ),


the log-likelihood of β is
n
`( β) = log ∏ pi ( β)Yi (1 − pi ( β))1−Yi
i =1
n
= ∑ [Yi log pi ( β) + (1 − Yi ) log(1 − pi ( β))] . (5.9)
i =1

The MLE estimate of β is

β̂ := arg max `( β).


β ∈R p +1

Unfortunately, due to the non-linearity of (5.9), there is no explicit


expression for β̂ and it has to be obtained numerically by means
of an iterative procedure. We will see it with more detail in the
next section. Just be aware that this iterative procedure may fail to
converge in low sample size situations with perfect classification,
where the likelihood might be numerically unstable.
Figure 5.5 shows how the log-likelihood changes with respect
to the values for ( β 0 , β 1 ) in three data patterns. The data of the
illustration has been generated with the next chunk of code.
# Data
set.seed(34567)
notes for predictive modeling 189

Figure 5.5: The logistic regression fit


and its dependence on β 0 (horizontal
displacement) and β 1 (steepness of the
curve). Recall the effect of the sign of
β 1 in the curve: if positive, the logistic
curve has an ‘s’ form; if negative, the
form is a reflected ‘s’. Application
available here.

x <- rnorm(50, sd = 1.5)


y1 <- -0.5 + 3 * x
y2 <- 0.5 - 2 * x
y3 <- -2 + 5 * x
y1 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y1)))
y2 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y2)))
y3 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y3)))

# Data
dataMle <- data.frame(x = x, y1 = y1, y2 = y2, y3 = y3)

For fitting a logistic model we employ glm, which has the syntax
glm(formula = response ~ predictor, family = "binomial",
data = data), where response is a binary variable. Note that
family = "binomial" is referring to the fact that the response is
Binomial variable (since it is a Bernoulli). Let’s check that indeed
the coefficients given by glm are the ones that maximize the likeli-
hood given in the animation of Figure 5.5. We do so for y1 ~ x.
# Call glm
mod <- glm(y1 ~ x, family = "binomial", data = dataMle)
mod$coefficients
## (Intercept) x
## -0.1691947 2.4281626

# -loglik(beta)
minusLogLik <- function(beta) {
p <- 1 / (1 + exp(-(beta[1] + beta[2] * x)))
-sum(y1 * log(p) + (1 - y1) * log(1 - p))
}

# Optimization using as starting values beta = c(0, 0)


opt <- optim(par = c(0, 0), fn = minusLogLik)
opt
## $par
## [1] -0.1691366 2.4285119
##
## $value
190 eduardo garcía portugués

## [1] 14.79376
##
## $counts
## function gradient
## 73 NA
##
## $convergence
## [1] 0
##
## $message
## NULL

# Visualization of the minusLogLik surface


beta0 <- seq(-3, 3, l = 50)
beta1 <- seq(-2, 8, l = 50)
L <- matrix(nrow = length(beta0), ncol = length(beta1))
for (i in seq_along(beta0)) {
for (j in seq_along(beta1)) {
L[i, j] <- minusLogLik(c(beta0[i], beta1[j]))
}
}
filled.contour(beta0, beta1, -L, color.palette = viridis::viridis,
xlab = expression(beta[0]), ylab = expression(beta[1]),
plot.axes = {
axis(1:2)
points(mod$coefficients[1], mod$coefficients[2],
col = 2, pch = 16)
points(opt$par[1], opt$par[2], col = 4)
})

-20
# The plot.axes argument is a hack to add graphical information within the
# coordinates of the main panel (behind filled.contour there is a layout()...) -40

-60

For the regressions y2 ~ x and y3 ~ x, do the following: -80


β1

-100

• Check that the true β is close to maximizing the likeli-


-120

hood computed in Figure 5.5.


-140

• Plot the fitted logistic curve and compare it with the


one in Figure 5.5. -3 -2 -1 0

β0
1 2 3

Figure 5.6: `( β 0 , β 1 ) surface and its


global maximum ( β̂ 0 , β̂ 1 ).

The extension of the logistic model to the case of a cat-


egorical response with more than two levels is sketched in
Appendix A.3.

5.2.2 General case


The same idea we used in logistic regression, namely transforming
the conditional expectation of Y into something that can be mod-
eled by a linear model (this is, a quantity that lives in R), can be
generalized. This raises the family of generalized linear models, which
extends the linear model to different kinds of response variables
and provides a convenient parametric framework.
The first ingredient is a link function g, that is monotonic and
differentiable, which is going to produce a transformed expectation
to be modeled by a linear combination of the predictors:

g E [ Y | X1 = x 1 , . . . , X p = x p ] = η

notes for predictive modeling 191

or, equivalently,

E [ Y | X1 = x 1 , . . . , X p = x p ] = g − 1 ( η ) ,

where η := β 0 + β 1 x1 + . . . + β p x p is the linear predictor.


The second ingredient of generalized linear models is a distribu-
tion for Y |( X1 , . . . , X p ), just as the linear model assumes normality
or the logistic model assumes a Bernoulli random variable. Thus,
we have two generalizations with respect to the usual linear model:

1. The conditional mean may be modeled by a transformation g−1


of the linear predictor η.
2. The distribution of Y |( X1 , . . . , X p ) may be different from the
Normal. 10
Not to be confused with the expo-
nential distribution Exp(λ), which is a
member of the exponential family.
Generalized linear models are intimately related with the expo- 11
This is the so-called canonical form of
nential family10 11 , which is the family of distributions with pdf the exponential family. Generalizations
expressable as of the family are possible, thought we
do not consider them.

yθ − b(θ )
 
f (y; θ, φ) = exp + c(y, φ) , (5.10)
a(φ)

where a(·), b(·), and c(·, ·) are specific functions. If Y has the pdf
(5.10), then we write Y ∼ E(θ, φ, a, b, c). If the scale parameter φ is
known, this is an exponential family with canonical parameter
θ (if φ is unknown, then it may or not may be a two-parameter
exponential family). Distributions from the exponential family have
some nice properties. Importantly, if Y ∼ E(θ, φ, a, b, c), then

µ : = E [Y ] = b 0 ( θ ) , σ2 := Var[Y ] = b00 (θ ) a(φ). (5.11)

The canonical link function is the function g that transforms


µ = b0 (θ ) into the canonical parameter θ. For E(θ, φ, a, b, c), this
happens if

θ = g(µ) (5.12)

or, more explicitly due to (5.11), if

g ( µ ) = ( b 0 ) −1 ( µ ). (5.13)

In the case of canonical link function, the one-line summary of the


generalized linear model is (independence is implicit)

Y |( X1 = x1 , . . . , X p = x p ) ∼ E(η, φ, a, b, c). (5.14)


192 eduardo garcía portugués

Expression (5.14) gives insight on what a generalized


linear model does:

1. Select a member of the exponential family in (5.10) for


modelling Y.
2. The canonical link function g is given by g(µ) =
0 − 1
(b ) (µ). For this g, then θ = g(µ).
3. The generalized linear model associated to the mem-
ber of the exponential family and g models the con-
ditional θ, given X1 , . . . , Xn , by means of the linear
predictor η. This equals to modelling the conditional µ
by means of g−1 (η ).

The linear model arises as a particular case of (5.14) with

θ2 y2
 
1
a(φ) = φ, b(θ ) = , c(y, φ) = − + log(2πφ) ,
2 2 φ

and scale parameter φ = σ2 . In this case, µ = θ and the


canonical link function g is the identity.

Show that the Normal, Binomial, Gamma (which in-


cludes Exponential and Chi-squared), and Poisson distri-
butions are members of the exponential family. For that,
express their pdfs in terms of (5.10) and identify who is θ
and φ.

The following table lists some useful generalized linear models.


Recall that the linear and logistic models of Sections 2.2.3 and 5.2.1
are obtained from the first and second rows, respectively.

Support Link Y |( X1 =
of Y Distribution g(µ) g −1 ( η ) φ x1 , . . . , X p = x p )
R N (µ, σ2 ) µ η σ2 N (η, σ2 )
0, 1 B(1, p) logit(µ) logistic(η ) 1 B (1, logistic(η ))
0, 1, 2, . . . Pois(λ) log(µ) eη 1 Pois(eη )

Obtain the canonical link function for the exponential


distribution Exp(λ). What is the scale parameter? What
is the distribution of Y |( X1 = x1 , . . . , X p = x p ) in such
model?

The third model is known as Poisson regression and is usually


employed for modelling count data that arises from the recording
notes for predictive modeling 193

of the frequencies of a certain phenomenon. It considers that

Y |( X1 = x1 , . . . , X p = x p ) ∼ Pois(eη ),

this is,

E [ Y | X1 = x 1 , . . . , X p = x p ] = λ ( Y | X1 = x 1 , . . . , X p = x p )
= e β0 + β1 x1 +...+ β p x p . (5.15)

Notice that, since in the Poisson distribution the mean and variance
equal, this implies that the variance of Y |( X1 = x1 , . . . , X p = x p )
changes according to the value of the predictors. The interpretation
of the coefficients is clear from (5.15):

• e β0 : is the expected value and variance of Y when X1 = . . . =


X p = 0.
• e β j , 1 ≤ j ≤ p: is the multiplicative increment of the expectation
and variance for an increment of one unit in X j = x j , provided
that the remaining variables X1 , . . . , X j−1 , X j+1 , . . . , X p do not
change.

Case study application


Let’s see how to apply a Poisson regression. For that aim we
consider the species (download) dataset. The goal is to analyse
whether the Biomass and the pH (a factor) of the terrain are influen-
tial on the number of Species. Incidentally, it will serve to illustrate
that the use of factors within glm is completely analogous to what
we did with lm.
species <- read.table("species.txt", header = TRUE)

# Data
plot(Species ~ Biomass, data = species, col = pH)
legend("topright", legend = c("Low pH", "Medium pH", "High pH"),
col = 1:3, lwd = 2)

Low pH
Medium pH
High pH
40

# Fit Poisson regression


species1 <- glm(Species ~ ., data = species, family = poisson)
30

summary(species1)
Species

##
## Call:
20

## glm(formula = Species ~ ., family = poisson, data = species)


##
10

## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5959 -0.6989 -0.0737 0.6647 3.5604 0 2 4 6 8 10
## Biomass

## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.84894 0.05281 72.885 < 2e-16 ***
## pHlow -1.13639 0.06720 -16.910 < 2e-16 ***
## pHmed -0.44516 0.05486 -8.114 4.88e-16 ***
## Biomass -0.12756 0.01014 -12.579 < 2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
194 eduardo garcía portugués

## Null deviance: 452.346 on 89 degrees of freedom


## Residual deviance: 99.242 on 86 degrees of freedom
## AIC: 526.43
##
## Number of Fisher Scoring iterations: 4
# Took 4 iterations of the IRLS

# Interpretation of the coefficients:


exp(species1$coefficients)
## (Intercept) pHlow pHmed Biomass
## 46.9433686 0.3209744 0.6407222 0.8802418
# - 46.9433 is the average number of species when Biomass = 0 and the pH is high
# - For each increment in one unit in Biomass, the number of species decreases
# by a factor of 0.88 (12% reduction)
# - If pH decreases to med (low), then the number of species decreases by a factor
# of 0.6407 (0.3209)

# With interactions
species2 <- glm(Species ~ Biomass * pH, data = species, family = poisson)
summary(species2)
##
## Call:
## glm(formula = Species ~ Biomass * pH, family = poisson, data = species)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4978 -0.7485 -0.0402 0.5575 3.2297
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.76812 0.06153 61.240 < 2e-16 ***
## Biomass -0.10713 0.01249 -8.577 < 2e-16 ***
## pHlow -0.81557 0.10284 -7.931 2.18e-15 ***
## pHmed -0.33146 0.09217 -3.596 0.000323 ***
## Biomass:pHlow -0.15503 0.04003 -3.873 0.000108 ***
## Biomass:pHmed -0.03189 0.02308 -1.382 0.166954
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 452.346 on 89 degrees of freedom
## Residual deviance: 83.201 on 84 degrees of freedom
## AIC: 514.39
##
## Number of Fisher Scoring iterations: 4
exp(species2$coefficients)
## (Intercept) Biomass pHlow pHmed Biomass:pHlow Biomass:pHmed
## 43.2987424 0.8984091 0.4423865 0.7178730 0.8563910 0.9686112
# - If pH decreases to med (low), then the effect of the biomass in the number
# of species decreases by a factor of 0.9686 (0.8564). The higher the pH, the
# stronger the effect of the Biomass in Species

# Draw fits
plot(Species ~ Biomass, data = species, col = pH)
legend("topright", legend = c("High pH", "Medium pH", "Low pH"),
col = 1:3, lwd = 2)

# Without interactions
bio <- seq(0, 10, l = 100)
z <- species1$coefficients[1] + species1$coefficients[4] * bio
lines(bio, exp(z), col = 1)
lines(bio, exp(species1$coefficients[2] + z), col = 2)
lines(bio, exp(species1$coefficients[3] + z), col = 3)

# With interactions seems to provide a significant improvement


bio <- seq(0, 10, l = 100)
z <- species2$coefficients[1] + species2$coefficients[2] * bio
notes for predictive modeling 195

lines(bio, exp(z), col = 1, lty = 2)


lines(bio, exp(species2$coefficients[3] + species2$coefficients[5] * bio + z),
col = 2, lty = 2)
lines(bio, exp(species2$coefficients[4] + species2$coefficients[6] * bio + z),
col = 3, lty = 2)

High pH
Medium pH
Low pH

40
Estimation by maximum likelihood
The estimation of β by MLE can be done in a unified framework

30
for all generalized linear models thanks to the exponential family

Species
(5.10). Given {(Xi , Yi )}in=1 and employing a canonical link function

20
(5.13), we have that

10
Yi |( X1 = xi1 , . . . , X p = xip ) ∼ E(θi , φ, a, b, c), i = 1, . . . , n,
0 2 4 6 8 10

where Biomass

θi := ηi := β 0 + β 1 xi1 + . . . + β p xip ,
µi := E[Yi | X1 = xi1 , . . . , X p = xip ] = g−1 (ηi ).

Then, the log-likelihood is


n
Yi θi − b(θi )
 
`( β) = ∑ a(φ)
+ c(Yi , φ) . (5.16)
i =1

Differentiating with respect to β gives

∂`( β) n
(Y − b0 (θi )) ∂θi
=∑ i
∂β i =1
a(φ) ∂β

which, exploiting the properties of the exponential family, can be


reduced to
n
∂`( β) (Y − µ i )
= ∑ 0i x, (5.17)
∂β i =1
g (µi )Vi i

where xi is the i-th row of the design matrix X and Vi := Var[Yi ] =


∂`( β)
a(φ)b00 (θi ). Solving explicitly the system of equations ∂β = 0
is not possible in general and a numerical procedure is required.
Newton–Raphson is usually employed, which is based in obtaining 12
The system stems from a first-order
βnew from the linear system12 Taylor expansion around the root, β̂.

∂2 `( β) ∂`( β)
( βnew − βold ) = − . (5.18)
∂β∂β0 β= βold
∂β β= βold

∂2 `( β)
A simplifying trick is to consider the expectation of ∂β∂β0 β= β
old
in (5.18) rather than its actual value. By doing so, we can arrive at
a neat iterative algorithm called Iterative Reweighted Least Squares
(IRLS). To that aim, we use the following well-known property of
the Fisher information matrix of the MLE theory:

∂`( β) ∂`( β) 0
 2  "  #
∂ `( β) 13
Recall that E[(Yi − µi )(Yj − µ j )] =
E = −E .
∂β∂β0
(
∂β ∂β Vi , i = j,
Cov[Yi , Yj ] = because of
0, i 6= j,
Then, it can be seen that13 independence.
196 eduardo garcía portugués

" #
n
∂2 `( β)
E = − ∑ wi xi xi0 = −X0 WX, (5.19)
∂β∂β0 β= βold i =1

where wi := V ( g0 1(µ ))2 and W := diag(w1 , . . . , wn ). Using this


i i
notation and from (5.17),
∂`( β)
= X0 W(Y − µold )g0 (µold ), (5.20)
∂β β= βold

Substituting (5.19) and (5.20) in (5.18), we have:


" # −1
∂2 `( β) ∂`( β)
βnew = βold − E 0
∂ββ β= β ∂β β= βold
old

= βold + (X0 WX)−1 X0 W(Y − µold )g0 (µold )


= (X0 WX)−1 X0 Wz, (5.21)

where z := Xβold + (Y − µold )g0 (µold ) is the working vector.


As a consequence, fitting a generalized linear model by IRLS
amounts to performing a series of weighted linear models with
changing weights and responses given by the working vector. IRLS
can be summarized as:

1. Set βold with some initial estimation.


2. Compute µold , W, and z.
3. Compute βnew using (5.21).
4. Set βold as βnew .
5. Iterate steps 2–4 until convergence, then set β̂ = βnew .

∂2 l ( β ) 2
h i
∂ l ( β)
In general, E 6=
∂β∂β0 ∂β∂β0
. Thus, IRLS in general
departures from the standard Newton–Raphson. How-
ever, if the canonical link is used, it can be seen that
the equality of the matrices is guaranteed and IRLS
is exactly the same as Newton–Raphson. In that case,
wi = g0 (1µ ) (which simplifies the computation of W in the
i
algorithm above).

5.3 Inference for model parameters

The assumptions on which a generalized linear model is con-


structed allow us to specify what is the asymptotic distribution of
the random vector β̂ through the theory of MLE. Again, the distribu-
tion is derived conditionally on the sample predictors X1 , . . . , Xn . In
other words, we assume that the randomness of Y comes only from
Y |( X1 = x1 , . . . , X p = x p ) and not from the predictors.
For the ease of exposition, we will focus on the logistic model
rather than in the general case. The conceptual differences are not
so big, but the simplification in terms of notation and the benefits
on the intuition side are important.
There is an important difference between the inference results for
the linear model and for logistic regression:
notes for predictive modeling 197

• In linear regression the inference is exact. This is due to the nice


properties of the normal, least squares estimation, and linear-
ity. As a consequence, the distributions of the coefficients are
perfectly known assuming that the assumptions hold.
• In generalized linear models the inference is asymptotic. This
means that the distributions of the coefficients are unknown
except for large sample sizes n, for which we have approxima-
tions. The reason is the higher complexity of the model in terms
of non-linearity. This is the usual situation for the majority of
regression models.

5.3.1 Distributions of the fitted coefficients

The distribution of β̂ is given by the asymptotic theory of MLE:


 
β̂ ∼ N p+1 β, I ( β)−1 (5.22)

where ∼ [. . .] must be understood as asymptotically distributed as


[. . .] when n → ∞ for the rest of the chapter and

∂2 `( β)
 
I ( β ) : = −E
∂β∂β0

is the Fisher information matrix. The name comes from the fact that
it measures the information available in the sample for estimating β. The
“larger” (large eigenvalues) the matrix is, the more precise the
estimation of β is, because that results in smaller variances in (5.22).
The inverse of the Fisher information matrix is

I ( β)−1 = (X0 VX)−1 , (5.23)

where V = diag(V1 , . . . , Vn ) and, for the logistic model, Vi =


logistic(ηi )(1 − logistic(ηi )), with ηi = β 0 + β 1 xi1 + . . . + β p xip .
In the case of the multiple linear regression, I ( β)−1 = σ2 (X0 X)−1
(see (2.11)), so the presence of V here is a consequence of the het-
eroskedasticity of the model.
The interpretation of (5.22) and (5.23) gives some useful insights
on what concepts affect the quality of the estimation:

• Bias. The estimates are asymptotically unbiased.

• Variance. It depends on:

– Sample size n. Hidden inside X0 VX. As n grows, the precision


of the estimators increases.
– Weighted predictor sparsity (X0 VX)−1 . The more sparse the pre-
dictor is (small eigenvalues of (X0 VX)−1 ), the more precise β̂
is.
198 eduardo garcía portugués

The precision of β̂ is affected by the value of β, which


is hidden inside V. This contrasts sharply with the linear
model, where the precision of the least squares esti-
mator was not affected by the value of the unknown
coefficients (see (2.11)). The reason is partially due to the
heteroskedasticity of logistic regression, which implies
a dependence of the variance of Y in the logistic curve,
hence in β.

Figure 5.7: Illustration of the ran-


domness of the fitted coefficients
( β̂ 0 , β̂ 1 ) and the influence of n,
( β 0 , β 1 ) and s2x . The sample predic-
tors x1 , . . . , xn are fixed and new
responses Y1 , . . . , Yn are generated
each time from a logistic model
Y | X = x ∼ Ber(logistic( β 0 + β 1 x )).
Application available here.

Similar to linear regression, the problem with (5.22) and (5.23) is


that V is unknown in practice because it depends on β. Plugging-in
the estimate β̂ to β in V results in V̂. Now we can use V̂ to get

β̂ j − β j
∼ N (0, 1), ˆ ( β̂ j )2 := v j
SE (5.24)
SEˆ ( β̂ j )

where

v j is the j-th element of the diagonal of (X0 V̂X)−1 .

The LHS of (5.24) is the Wald statistic for β j , j = 0, . . . , p. They are


employed for building confidence intervals and hypothesis tests in
an analogous way to the t-statistics in linear regression.
notes for predictive modeling 199

5.3.2 Confidence intervals for the coefficients

Thanks to (5.24), we can have the 100(1 − α)% CI for the coefficient
β j , j = 0, . . . , p:

ˆ ( β̂ j )zα/2

β̂ j ± SE (5.25)

where zα/2 is the α/2-upper quantile of the N (0, 1). In case we are
interested in the CI for e β j , we can just simply take the exponential
on the above CI. So the 100(1 − α)% CI for e β j , j = 0, . . . , p, is

ˆ
e( β̂ j ±SE( β̂ j )zα/2 ) .

ˆ
 
Of course, this CI is not the same as e β̂ j ± eSE( β̂ j )zα/2 , which is not
a valid CI for e β̂ j .

5.3.3 Testing on the coefficients

The distributions in (5.24) also allow us to conduct a formal hy-


pothesis test on the coefficients β j , j = 0, . . . , p. For example, the
test for significance:

H0 : β j = 0

for j = 0, . . . , p. The test of H0 : β j = 0 with 1 ≤ j ≤ p is especially


interesting, since it allows us to answer whether the variable X j has a
significant effect on Y. The statistic used for testing for significance is
the Wald statistic

β̂ j − 0
,
ˆ ( β̂ j )
SE

which is asymptotically distributed as a N (0, 1) under the (veracity


of) the null hypothesis. H0 is tested against the bilateral alternative
hypothesis H1 : β j 6= 0.
The tests for significance are built-in in the summary function.
However, a note of caution is required when applying the rule of
thumb:

Is the CI for β j below (above) 0 at level α?

• Yes → reject H0 at level α. Conclude X j has a signifi-


cant negative (positive) effect on Y at level α.
• No → the criterion is not conclusive.
200 eduardo garcía portugués

The significances given in summary and the output of


MASS::confint are slightly incoherent and the previous
rule of thumb does not apply. The reason is because
MASS::confint is using a more sophisticated method
(profile likelihood) to estimate the standard error of β̂ j ,
ˆ ( β̂ j ), and not the asymptotic distribution behind the
SE
Wald statistic.

By changing confint to R’s default confint.default,


the results of the latter will be completely equivalent to
the significances in summary, and the rule of thumb still
be completely valid. For the contents of this course we
prefer confint.default due to its better interpretability.
This point is exemplified in the next section.

5.3.4 Case study application


Let’s compute the summary of the nasa model in order to address
the significance of the coefficients. At the sight of this curve and the
summary of the model we can conclude that the temperature was
increasing the probability of an O-ring incident (Q2). Indeed, the
confidence intervals for the coefficients show a significant negative
correlation at level α = 0.05:

# Summary of the model


summary(nasa)
##
## Call:
## glm(formula = fail.field ~ temp, family = "binomial", data = challenger)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0566 -0.7575 -0.3818 0.4571 2.2195
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.5837 3.9146 1.937 0.0527 .
## temp -0.4166 0.1940 -2.147 0.0318 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28.267 on 22 degrees of freedom
## Residual deviance: 20.335 on 21 degrees of freedom
## AIC: 24.335
##
## Number of Fisher Scoring iterations: 5

# Confidence intervals at 95%


confint.default(nasa)
## 2.5 % 97.5 %
## (Intercept) -0.08865488 15.25614140
## temp -0.79694430 -0.03634877

# Confidence intervals at other levels


confint.default(nasa, level = 0.90)
## 5 % 95 %
notes for predictive modeling 201

## (Intercept) 1.1448638 14.02262275


## temp -0.7358025 -0.09749059

# Confidence intervals for the factors affecting the odds


exp(confint.default(nasa))
## 2.5 % 97.5 %
## (Intercept) 0.9151614 4.223359e+06
## temp 0.4507041 9.643039e-01

The coefficient for temp is significant at α = 0.05 and the intercept


is not (it is for α = 0.10). The 95% confidence interval for β 0 is
(−0.0887, 15.2561) and for β 1 is (−0.7969, −0.0363). For e β0 and e β1 ,
the CIs are (0.9151, 4.2233 × 106 ) and (0.4507, 0.9643), respectively.
Therefore, we can say with a 95% confidence that:

• When temp=0, the probability of fail.field=1 is not signifi-


cantly large than the probability of fail.field=0 (using the CI
for β 0 ). fail.field=1 is between 0.9151 and 4.2233 × 107 more
likely than fail.field=0 (using the CI for e β0 ).
• temp has a significantly negative effect on the probability of
fail.field=1 (using the CI for β 1 ). Indeed, each unit increase in
temp produces a reduction of the odds of fail.field by a factor
between 0.4507 and 0.9643 (using the CI for e β1 ).

This completes the answers to Q1 and Q2.


We conclude by illustrating the incoherence of summary and
confint.

# Significances with asymptotic approximation for the standard errors


summary(nasa)
##
## Call:
## glm(formula = fail.field ~ temp, family = "binomial", data = challenger)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0566 -0.7575 -0.3818 0.4571 2.2195
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.5837 3.9146 1.937 0.0527 .
## temp -0.4166 0.1940 -2.147 0.0318 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28.267 on 22 degrees of freedom
## Residual deviance: 20.335 on 21 degrees of freedom
## AIC: 24.335
##
## Number of Fisher Scoring iterations: 5

# CIs with asymptotic approximation -- coherent with summary


confint.default(nasa, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) -0.08865488 15.25614140
## temp -0.79694430 -0.03634877
confint.default(nasa, level = 0.99)
## 0.5 % 99.5 %
## (Intercept) -2.4994971 17.66698362
## temp -0.9164425 0.08314945
202 eduardo garcía portugués

# CIs with profile likelihood -- incoherent with summary


confint(nasa, level = 0.95) # intercept still significant
## 2.5 % 97.5 %
## (Intercept) 1.3364047 17.7834329
## temp -0.9237721 -0.1089953
confint(nasa, level = 0.99) # temp still significant
## 0.5 % 99.5 %
## (Intercept) -0.3095128 22.26687651
## temp -1.1479817 -0.02994011

5.4 Prediction

Prediction in general linear models focuses mainly on predicting


the values of the conditional mean

E [ Y | X1 = x 1 , . . . , X p = x p ] = g − 1 ( η ) = g − 1 ( β 0 + β 1 x 1 + . . . + β p x p )

by means of η̂ := β̂ 0 + β̂ 1 x1 + . . . + β̂ p x p and not on predicting the


conditional response. The reason is that confidence intervals, the
main difference between both kinds of prediction, depend heavily
on the family we are considering for the response.
For the logistic model, the prediction of the conditional response
follows immediately from logistic(η̂ ):
(
1, with probability logistic(η̂ ),
Ŷ |( X1 = x1 , . . . , X p = x p ) =
0, with probability 1 − logistic(η̂ ).

As a consequence, we can predict Y as 1 if logistic(η̂ ) > 21 and as 0


otherwise.
To make predictions and compute CIs in practice we use predict.
There are two differences with respect to its use for lm:

• The argument type. type = "link" returns η̂ (the log-odds in


the logistic model), type = "response" returns g−1 (η̂ ) (the prob-
abilities in the logistic model). Observe that type = "response"
has a different behaviour than predict for lm, where it returned
the predictions for the conditional response.
• There is no interval argument for using predict with glm. That
means that the computation of CIs for prediction involves some
extra coding.

Figure 5.8 gives an interactive visualization of the CIs for the


conditional probability in simple logistic regression. Their inter-
pretation is very similar to the CIs for the conditional mean in the
simple linear model, see Section 2.5 and Figure 2.15.

5.4.1 Case study application


Let’s compute what was the probability of having at least one inci-
dent with the O-rings in the launch day (answers Q3):
predict(nasa, newdata = data.frame(temp = -0.6), type = "response")
## 1
## 0.999604
notes for predictive modeling 203

Figure 5.8: Illustration of the CIs for


the conditional probability in the
simple logistic regression. Application
available here.

Recall that there is a serious problem of extrapolation in the


prediction, which makes it less precise (or more variable). But this
extrapolation, together with the evidences raised by a simple analy-
sis like we did, should have been strong arguments for postponing
the launch.
Since it is a bit cumbersome to compute the CIs for the condi-
tional response, we can code the function predictCIsLogistic to
do it automatically.

# Function for computing the predictions and CIs for the conditional probability
predictCIsLogistic <- function(object, newdata, level = 0.95) {

# Compute predictions in the log-odds


pred <- predict(object = object, newdata = newdata, se.fit = TRUE)

# CI in the log-odds
za <- qnorm(p = (1 - level) / 2)
lwr <- pred$fit + za * pred$se.fit
upr <- pred$fit - za * pred$se.fit

# Transform to probabilities
fit <- 1 / (1 + exp(-pred$fit))
lwr <- 1 / (1 + exp(-lwr))
upr <- 1 / (1 + exp(-upr))

# Return a matrix with column names "fit", "lwr" and "upr"


result <- cbind(fit, lwr, upr)
colnames(result) <- c("fit", "lwr", "upr")
return(result)

Let’s apply the function to the case study:

# Data for which we want a prediction


newdata <- data.frame(temp = -0.6)
204 eduardo garcía portugués

# Prediction of the conditional log-odds, the default


predict(nasa, newdata = newdata, type = "link")
## 1
## 7.833731

# Prediction of the conditional probability


predict(nasa, newdata = newdata, type = "response")
## 1
## 0.999604

# Simple call
predictCIsLogistic(nasa, newdata = newdata)
## fit lwr upr
## 1 0.999604 0.4838505 0.9999999
# The CI is large because there is no data around temp = -0.6 and
# that makes the prediction more variable (and also because we only
# have 23 observations)

For the challenger dataset, do the following:

• Regress fail.nozzle on temp and pres.nozzle.


• Compute the predicted probability of fail.nozzle=1
for temp = 15 and pres.nozzle = 200. What is the
predicted probability for fail.nozzle=0?
• Compute the confidence interval for the two predicted
probabilities at level 95%.

5.5 Deviance

The deviance is a key concept in generalized linear models. In-


tuitively, it measures the deviance of the fitted generalized linear
model with respect to a perfect model for E[Y | X1 = x1 , . . . , X p =
x p ]. This perfect model, known as the saturated model, is the model
that perfectly fits the data, in the sense that the fitted responses (Ŷi )
are the same as the observed responses (Yi ). For example, in logistic
regression this would be the model such that

P̂[Y = 1| X1 = Xi1 , . . . , Xk = Xip ] = Yi , i = 1, . . . , n.

Figure 5.9 shows a saturated model and a fitted logistic regression


to a dataset.
Formally, the deviance is defined through the difference of the
log-likelihoods between the fitted model, `( β̂), and the saturated
model, `s . Computing `s amounts to substitute µi by Yi in (5.16).
If the canonical link function is used, this corresponds to setting
θi = g(Yi ) (recall (5.12)). The deviance is then defined as:
h i
D := −2 `( β̂) − `s φ.

The log-likelihood `( β̂) is always smaller than `s (the saturated


model is more likely given the sample since it is basically the sam-
ple itself). As a consequence, the deviance is always larger or equal
than zero, being zero only if the fit of the model is perfect.
notes for predictive modeling 205

1.0 Fitted logistic model Figure 5.9: Fitted logistic regression


A saturated model versus a saturated model (several are
The null model possible depending on the interpo-
lation between points) and the null
model.
0.8
0.6
y

0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0

If the canonical link function is employed, the deviance can be


expressed as

n
2


D=− Yi θ̂i − b(θ̂i ) − Yi g(Yi ) + b( g(Yi )) φ
a(φ) i =1
n



= Yi (Yi − θ̂i ) − b( g(Yi )) + b(θ̂i ) . (5.26)
a(φ) i =1

In most of the cases, a(φ) ∝ φ, so the deviance does not depend on φ.


Expression (5.26) is interesting, since it delivers the following key
insight:

The deviance is a generalization of the Residual Sum of


Squares (RSS) of the linear model. The generalization is
driven by the likelihood and its equivalence with the RSS
in the linear model. 14
The canonical link function g is the
identity, check (5.12) and (5.13).

To see it, let consider the linear model in (5.26) by setting φ = σ2 ,


2 y2
a(φ) = φ, b(θ ) = θ2 , c(y, φ) = − 21 { φ + log(2πφ)}, and θ = µ =
η 14 . Then, we have:
206 eduardo garcía portugués

!
2σ2 n Y2 η̂ 2
D = 2 ∑ Yi (Yi − η̂i ) − i + i
σ i =1 2 2
n  
= ∑ 2Yi2 − 2Yi η̂i − Yi2 + η̂i2
i =1
n
= ∑ (Yi − η̂i )2
i =1
= RSS( β̂), (5.27)
since η̂i = β̂ 0 + β̂ 1 xi1 + . . . + β̂ p xip . Remember that RSS( β̂) is just
another name for the SSE.
A benchmark for evaluating the scale of the deviance is the null
deviance,
 
D0 := −2 `( β̂ 0 ) − `s φ,
which is the deviance of the worst model, the one fitted without
any predictor and only intercept, to the perfect model. For example,
in logistic regression:
Y |( X1 = x1 , . . . , X p = x p ) ∼ Ber(logistic( β 0 )).
m
In this case, β̂ 0 = logit( m n
n ) = log 1− m where m is the number of 1’s
n
in Y1 , . . . , Yn (see Figure 5.9).
Using again (5.26), we can see that the null deviance is a gen-
eralization of the total sum of squares of the linear model (see
Section 2.6):
n n 2
D0 = ∑ (Yi − η̂i )2 = ∑ Yi − β̂ 0 = SST,
i =1 i =1

since β̂ 0 = Ȳ because there are no predictors.

Figure 5.10: Illustrative pictorial


representation of the deviance (D) and
the null deviance (D0 ).

Using the deviance and the null deviance, we can compare how
much the model has improved by adding the predictors X1 , . . . , X p
and quantify the percentage of deviance explained. This can be done
by means of the R2 statistic, which is a generalization of the deter-
mination coefficient for linear regression:
linear
model
D SSE
R2 : = 1 − = 1− .
D0 SST
notes for predictive modeling 207

The R2 for generalized linear models is a global measure


of fit that shares the same philosophy with the determi-
nation coefficient in linear regression: it is a proportion
of how good the fit is. If perfect, D = 0 and R2 = 1. If the
predictors do not add anything to the regression, then
D = D0 and R2 = 0.

However, this R2 has a different interpretation than the


one in linear regression. In particular:

• Is not the percentage of variance explained by the


model, but rather a ratio indicating how close is the fit
to being perfect or the worst.
• It is not related to any correlation coefficient.

The deviance is returned by summary, but is important to recall


that R refers to the deviance as the 'Residual deviance' (the null
deviance is referred to as 'Null deviance').
# Summary of model
nasa <- glm(fail.field ~ temp, family = "binomial", data = challenger)
summaryLog <- summary(nasa)
summaryLog
##
## Call:
## glm(formula = fail.field ~ temp, family = "binomial", data = challenger)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0566 -0.7575 -0.3818 0.4571 2.2195
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.5837 3.9146 1.937 0.0527 .
## temp -0.4166 0.1940 -2.147 0.0318 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28.267 on 22 degrees of freedom
## Residual deviance: 20.335 on 21 degrees of freedom
## AIC: 24.335
##
## Number of Fisher Scoring iterations: 5
# ’Residual deviance’ is the deviance; ’Null deviance’ is the null deviance

# Null model (only intercept)


null <- glm(fail.field ~ 1, family = "binomial", data = challenger)
summaryNull <- summary(null)
summaryNull
##
## Call:
## glm(formula = fail.field ~ 1, family = "binomial", data = challenger)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8519 -0.8519 -0.8519 1.5425 1.5425
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
208 eduardo garcía portugués

## (Intercept) -0.8267 0.4532 -1.824 0.0681 .


## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28.267 on 22 degrees of freedom
## Residual deviance: 28.267 on 22 degrees of freedom
## AIC: 30.267
##
## Number of Fisher Scoring iterations: 4

# Computation of the R^2 with a function -- useful for repetitive computations


r2glm <- function(model) {

summaryLog <- summary(model)


1 - summaryLog$deviance / summaryLog$null.deviance

# R^2
r2glm(nasa)
## [1] 0.280619
r2glm(null)
## [1] -2.220446e-16

A quantity related with the deviance is the scaled deviance:


D h i
D∗ := = −2 `( β̂) − `s .
φ
If φ = 1, such as in the logistic or Poisson regression models,
then both the deviance and the scaled deviance agree. The scaled
deviance has asymptotic distribution

D ∗ ∼ χ2n− p−1 , (5.28)

where χ2k is the Chi-squared distribution with k degrees of freedom. In


the case of the linear model, D ∗ = σ12 RSS is exactly distributed as
a χ2n− p−1 . The result (5.28) provides a way of estimating φ when it
h i
is unknown: match D ∗ = D φ with the expectation E χn− p−1 =
2

n − p − 1. This provides

D −2(`( β̂ − `s ))
φ̂D := = ,
n− p−1 n− p−1
which, as expected, in the case of the linear model is equivalent to
σ̂2 as given in (2.15). More importantly, the scaled deviance can be
used for performing hypotheses tests on sets of coefficients of a
generalized linear model.
Assume we have one model, say model 2, with p2 predictors
and another model, say model 1, with p1 < p2 predictors that are
contained in the set of predictors of the model 2. In other words,
assume model 1 is nested within model 2. Then we can test the null
hypothesis that the extra coefficients of model 2 are simultaneously
zero. For example, if model 1 has the coefficients { β 0 , β 1 , . . . , β p1 }
and model 2 has coefficients { β 0 , β 1 , . . . , β p1 , β p1 +1 , . . . , β p2 }, we can
test

H0 : β p1 +1 = . . . = β p2 = 0 vs. H1 : β j 6= 0 for any p1 < j ≤ p2 .


notes for predictive modeling 209

This can be done by means of the statistic15

H0
D ∗p1 − D ∗p2 ∼ χ2p2 − p1 . (5.29)

If H0 is true, then D ∗p1 − D ∗p2 is expected to be small, thus we will


reject H0 if the value of the statistic is above the α-upper quantile of
the χ2p2 − p1 , denoted as χ2α;p2 − p1 .
Note that D ∗ apparently removes the effects of φ, but it is still
dependent on φ, since this is hidden in the likelihood (see (5.26)).
Therefore, D ∗ cannot be computed unless φ is know, which forbids
using (5.29). Hopefully, this dependence is removed by employing
(5.28) and (5.29) and assuming that they are asymptotically inde-
pendent. This gives the F-test for H0 :

( D ∗p1 − D ∗p2 )/( p2 − p1 ) ( D p1 − D p2 )/( p2 − p1 ) H0


F= = ∼ Fp2 − p1 ,n− p2 −1 .
D ∗p2 /(n − p2 − 1) D p2 / ( n − p 2 − 1 )

Note that the LHS is perfectly computable, since φ cancels due to


the quotient (and because we assume that a(φ) ∝ φ). Note also that
this is an extension of the F-test as we saw it in Section 2.6: take
p1 = 0 and p2 = p and then it is tested the significance of all the
predictors included in the model (both models contain intercept).
The computation of deviances and associated tests is done
through anova, which implements the Analysis of Deviance. This
is illustrated in the following code, which coincidentally also illus-
trates the inclusion of non-linear transformations on the predictors.

# Polynomial predictors
nasa0 <- glm(fail.field ~ 1, family = "binomial", data = challenger)
nasa1 <- glm(fail.field ~ temp, family = "binomial", data = challenger)
nasa2 <- glm(fail.field ~ poly(temp, degree = 2), family = "binomial",
data = challenger)
nasa3 <- glm(fail.field ~ poly(temp, degree = 3), family = "binomial",
data = challenger)

# Plot fits
temp <- seq(-1, 35, l = 200)
tt <- data.frame(temp = temp)
plot(fail.field ~ temp, data = challenger, pch = 16, xlim = c(-1, 30),
xlab = "Temperature", ylab = "Incident probability")
lines(temp, predict(nasa0, newdata = tt, type = "response"), col = 1)
lines(temp, predict(nasa1, newdata = tt, type = "response"), col = 2)
lines(temp, predict(nasa2, newdata = tt, type = "response"), col = 3)
lines(temp, predict(nasa3, newdata = tt, type = "response"), col = 4)
legend("bottomleft", legend = c("Null model", "Linear", "Quadratic", "Cubic"),
lwd = 2, col = 1:4)
1.0
0.8

# R^2’s
r2glm(nasa0)
Incident probability

0.6

## [1] -2.220446e-16
r2glm(nasa1)
0.4

## [1] 0.280619
r2glm(nasa2)
0.2

## [1] 0.3138925 Null model


Linear
r2glm(nasa3) Quadratic
0.0

Cubic
## [1] 0.4831863
0 5 10 15 20 25 30

Temperature
# Chisq and F tests -- same results since phi is known
210 eduardo garcía portugués

anova(nasa1, test = "Chisq")


## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: fail.field
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 22 28.267
## temp 1 7.9323 21 20.335 0.004856 **
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
anova(nasa1, test = "F")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: fail.field
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev F Pr(>F)
## NULL 22 28.267
## temp 1 7.9323 21 20.335 7.9323 0.004856 **
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

# Incremental comparisons of nested models


anova(nasa1, nasa2, nasa3, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: fail.field ~ temp
## Model 2: fail.field ~ poly(temp, degree = 2)
## Model 3: fail.field ~ poly(temp, degree = 3)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 21 20.335
## 2 20 19.394 1 0.9405 0.3321
## 3 19 14.609 1 4.7855 0.0287 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
# Quadratic effects are not significative

# Cubic vs linear
anova(nasa1, nasa3, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: fail.field ~ temp
## Model 2: fail.field ~ poly(temp, degree = 3)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 21 20.335
## 2 19 14.609 2 5.726 0.0571 .
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

# Example in Poisson regression


species1 <- glm(Species ~ ., data = species, family = poisson)
species2 <- glm(Species ~ Biomass * pH, data = species, family = poisson)

# Comparison
anova(species1, species2, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: Species ~ pH + Biomass
## Model 2: Species ~ Biomass * pH
notes for predictive modeling 211

## Resid. Df Resid. Dev Df Deviance Pr(>Chi)


## 1 86 99.242
## 2 84 83.201 2 16.04 0.0003288 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
r2glm(species1)
## [1] 0.7806071
r2glm(species2)
## [1] 0.8160674

5.6 Model selection

The same discussion we did in Section 3.2 is applicable to general-


ized linear models with small changes:

1. The deviance of the model (reciprocally the likelihood and the


R2 ) always decreases (increases) with the inclusion of more
predictors – no matter whether they are significant or not.
2. The excess of predictors in the model is paid by a larger vari-
ability in the estimation of the model which results in less pre-
cise prediction.
3. Multicollinearity may hide significant variables, change the
sign of them, and result in an increase of the variability of the
estimation.

Stepwise selection can be done through MASS::stepAIC as in 16


The leaps package does not support
linear models16 . Conveniently, summary also reports the AIC: generalized linear models directly.
There are, however, other packages for
# Models performing best subset selection with
nasa1 <- glm(fail.field ~ temp, family = "binomial", data = challenger) generalized linear models, but we do
nasa2 <- glm(fail.field ~ temp + pres.field, family = "binomial", not cover them here.
data = challenger)

# Summaries
summary(nasa1)
##
## Call:
## glm(formula = fail.field ~ temp, family = "binomial", data = challenger)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0566 -0.7575 -0.3818 0.4571 2.2195
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.5837 3.9146 1.937 0.0527 .
## temp -0.4166 0.1940 -2.147 0.0318 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28.267 on 22 degrees of freedom
## Residual deviance: 20.335 on 21 degrees of freedom
## AIC: 24.335
##
## Number of Fisher Scoring iterations: 5
summary(nasa2)
##
## Call:
## glm(formula = fail.field ~ temp + pres.field, family = "binomial",
## data = challenger)
212 eduardo garcía portugués

##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2109 -0.6081 -0.4292 0.3498 2.0913
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.642709 4.038547 1.645 0.1000
## temp -0.435032 0.197008 -2.208 0.0272 *
## pres.field 0.009376 0.008821 1.063 0.2878
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28.267 on 22 degrees of freedom
## Residual deviance: 19.078 on 20 degrees of freedom
## AIC: 25.078
##
## Number of Fisher Scoring iterations: 5

# AICs
AIC(nasa1) # Better
## [1] 24.33485
AIC(nasa2)
## [1] 25.07821

MASS::stepAIC works analogously as in linear regression. An


illustration is given next for a predicting binary variable that mea-
sures whether a Boston suburb (Boston dataset from Section 3.1)
is wealth or not. The binary variable is medv > 25: it is TRUE (1) for
suburbs with median house value larger than 25000$) and FALSE
(0) otherwise. The cutoff 25000$ corresponds to the 25% richest
suburbs.
# Boston dataset
data(Boston, package = "MASS")

# Model whether a suburb has a median house value larger than $25000
mod <- glm(I(medv > 25) ~ ., data = Boston, family = "binomial")
summary(mod)
##
## Call:
## glm(formula = I(medv > 25) ~ ., family = "binomial", data = Boston)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3498 -0.2806 -0.0932 -0.0006 3.3781
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.312511 4.876070 1.090 0.275930
## crim -0.011101 0.045322 -0.245 0.806503
## zn 0.010917 0.010834 1.008 0.313626
## indus -0.110452 0.058740 -1.880 0.060060 .
## chas 0.966337 0.808960 1.195 0.232266
## nox -6.844521 4.483514 -1.527 0.126861
## rm 1.886872 0.452692 4.168 3.07e-05 ***
## age 0.003491 0.011133 0.314 0.753853
## dis -0.589016 0.164013 -3.591 0.000329 ***
## rad 0.318042 0.082623 3.849 0.000118 ***
## tax -0.010826 0.004036 -2.682 0.007314 **
## ptratio -0.353017 0.122259 -2.887 0.003884 **
## black -0.002264 0.003826 -0.592 0.554105
## lstat -0.367355 0.073020 -5.031 4.88e-07 ***
## ---
notes for predictive modeling 213

## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 563.52 on 505 degrees of freedom
## Residual deviance: 209.11 on 492 degrees of freedom
## AIC: 237.11
##
## Number of Fisher Scoring iterations: 7
r2glm(mod)
## [1] 0.628923

# With BIC -- ends up with only the significant variables and a similar R^2
modBIC <- MASS::stepAIC(mod, trace = 0, k = log(nrow(Boston)))
summary(modBIC)
##
## Call:
## glm(formula = I(medv > 25) ~ indus + rm + dis + rad + tax + ptratio +
## lstat, family = "binomial", data = Boston)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3077 -0.2970 -0.0947 -0.0005 3.2552
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.556433 3.948818 0.394 0.693469
## indus -0.143236 0.054771 -2.615 0.008918 **
## rm 1.950496 0.441794 4.415 1.01e-05 ***
## dis -0.426830 0.111572 -3.826 0.000130 ***
## rad 0.301060 0.076542 3.933 8.38e-05 ***
## tax -0.010240 0.003631 -2.820 0.004800 **
## ptratio -0.404964 0.112086 -3.613 0.000303 ***
## lstat -0.384823 0.069121 -5.567 2.59e-08 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 563.52 on 505 degrees of freedom
## Residual deviance: 215.03 on 498 degrees of freedom
## AIC: 231.03
##
## Number of Fisher Scoring iterations: 7
r2glm(modBIC)
## [1] 0.6184273

The logistic model is at the intersection between regression models


and classification methods. Therefore, the search for adequate predic-
tors to be included in the model can also be done in terms of the
classification performance. Although we do not explore in detail
this direction, we simply mention how the overall predictive accu-
racy can be summarized with the hit matrix (also called confusion
matrix)

Reality vs. classified Ŷ = 0 Ŷ = 1


Y=0 Correct0 Incorrect01
Y=1 Incorrect10 Correct1

and with the hit ratio, Correct0 +


n
Correct1
, which is the percentage of
correct classifications. The hit matrix is easily computed with the
214 eduardo garcía portugués

table function. The function, whenever called with two vectors,


computes the cross-table between the two vectors.

# Fitted probabilities for Y = 1


nasa$fitted.values
## 1 2 3 4 5 6 7 8 9 10
## 0.42778935 0.23014393 0.26910358 0.32099837 0.37772880 0.15898364 0.12833090 0.23014393 0.85721594 0.60286639
## 11 12 13 14 15 16 17 18 19 20
## 0.23014393 0.04383877 0.37772880 0.93755439 0.37772880 0.08516844 0.23014393 0.02299887 0.07027765 0.03589053
## 21 22 23
## 0.08516844 0.07027765 0.82977495

# Classified Y’s
yHat <- nasa$fitted.values > 0.5

# Hit matrix:
# - 16 correctly classified as 0
# - 4 correctly classified as 1
# - 3 incorrectly classified as 0
tab <- table(challenger$fail.field, yHat)
tab
## yHat
## FALSE TRUE
## 0 16 0
## 1 3 4

# Hit ratio (ratio of correct classification)


sum(diag(tab)) / sum(tab)
## [1] 0.8695652

It is important to recall that the hit matrix will be always biased


towards unrealistically good classification rates if it is computed in
the same sample used for fitting the logistic model. An approach
based on data-splitting/cross-validation is therefore needed to
estimate unbiasedly the hit matrix.

For the Boston dataset, do the following:

1. Compute the hit matrix and hit ratio for the regres-
sion I(medv > 25) ~ ..
2. Fit I(medv > 25) ~ . but now using only the first 300
observations of Boston, the training dataset.
3. For the previous model, predict the probability of the
responses and classify them into 0 or 1 in the last 206
observations, the testing dataset.
4. Compute the hit matrix and hit ratio for the new pre-
dictions. Check that the hit ratio is smaller than the
one in the first point.

5.7 Model diagnostics

As it was implicit in Section 5.2, generalized linear models are built


on some probabilistic assumptions that are required for performing
inference on the model parameters β and φ. Unless stated other-
wise, ∼ [. . .] denotes (exactly) distributed as [. . .] along this section.
In general, if we employ the canonical link function, we assume
notes for predictive modeling 215

that the data has been generated from (independence is implicit)

Y |( X1 = x1 , . . . , X p = x p ) ∼ E(η ( x1 , . . . , x p ), φ, a, b, c), (5.30)

in such a way that

µ = E [ Y | X1 = x 1 , . . . , X p = x p ] = g − 1 ( η ) ,

and η ( x1 , . . . , x p ) = β 0 + β 1 x1 + . . . + β p x p .
In the case of the logistic and Poisson regressions, both with
canonical link functions, the general model takes the form (inde-
pendence is implicit)

Y |( X1 = x1 , . . . , X p = x p ) ∼ Ber (logistic(η )) , (5.31)


Y |( X1 = x1 , . . . , X p = x p ) ∼ Pois (e ) . η
(5.32)

Figure 5.11: The key concepts of


the logistic model. The red points
represent a sample with population
logistic curve y = logistic( β 0 +
β 1 x ), shown in black. The blue bars
represent the conditional probability
mass functions of Y given X = x,
whose means lie in the logistic curve.

The assumptions behind (5.30), (5.31), and (5.32) are the follow-
ing:

i. Linearity in the transformed expectation: g E[Y | X1 = x1 , . . . , X p = x p ] =




β 0 + β 1 x1 + . . . + β p x p .
ii. Response distribution: Y |X = x ∼ E(η (x), φ, a, b, c) (φ, a, b, c are
constant for x).
216 eduardo garcía portugués

iii. Independence: Y1 , . . . , Yn are independent, conditionally on


X1 , . . . , Xn .

There are two important points of the linear model assumptions


“missing” here:

• Where is homoscedasticity? Homoscedasticity is specific to certain


exponential family distributions for whose θ does not affect the
variance. This is not the case for Bernoulli or Poisson distribu-
tions variables, which result in heteroskedastic models. Also,
homoscedasticity is the consequence of assumption ii in the case
of the Normal distribution.
• Where are the errors? The errors are not fundamental for build-
ing the linear model, but just a helpful concept related to least
squares. The linear model can be constructed “without errors”
using (2.8).

Recall that:

• Nothing is said about the distribution of X1 , . . . , X p .


They could be deterministic or random. They could be
discrete or continuous.
• X1 , . . . , X p are not required to be independent be-
tween them.

Checking the assumptions of a generalized linear model is more


complicated than what we did in Section 3.5. The reason is the
heterogeneity and heteroskedasticity of the responses, which
makes the inspection of the residuals Yi − Ŷi complicated. The first
step is to construct some residuals ε̂ i that are simpler to analyze.
The deviance residuals are the generalization of the residuals
ε̂ i = Yi − Ŷi from the linear model. They are constructed using
the analogy between the deviance and the RSS saw in (5.27). The
deviance can be expressed into a sum of terms associated to each
datum (recall e.g. (5.26)):
n
D= ∑ di .
i =1

For the linear model, di = ε̂2i , since D = RSS( β̂). Based on this, we
can define the deviance residuals as

ε̂ D
p
i : = sign(Yi − µ̂i ) di , i = 1, . . . , n

and have a generalization of ε̂ i for generalized linear models. This


definition has interesting distributional consequences. From (5.28),
we know that, asymptotically, D ∗ ∼ χ2n− p−1 . This suggests that

ε̂ D
i are approximately normal. (5.33)

The previous statement is of variable accuracy, depending on the


model, sample size, and distribution of the predictors17 . In the
notes for predictive modeling 217

linear model, it is exact and (ε̂ D D


1 , . . . , ε̂ n ) are distributed exactly as a
2 0 0 −
Nn (0, σ X (X X) X).1

The deviance residuals are key for the diagnostics of generalized


linear models. Whenever we refer to “residuals”, we understand
that we refer to the deviance residuals (since several definitions of
residuals are possible). They are also the residuals returned in R,
either by glm$residuals or by residuals(glm).

This generalization has interesting connections:

• If the canonical function is employed, then ∑in=1 ε̂ D


i =
0, as in the linear model.
• The estimate of the scale parameter can be seen as
∑n (ε̂ D )2
φ̂D = ni=−1p−i 1 , which is perfectly coherent with σ̂2 in
the linear model.
• Therefore, φ̂D is the sample variance of ε̂ D D
1 , . . . , ε̂ n ,
which suggets that φ is the asymptotic variance of
the population deviance residuals, in other words,
Var[ε D ] ≈ φ.

When one assumption fails, it is likely that this failure


will affect other assumptions. For example, if linearity
fails, then most likely the response distribution will fail
also. The key point is to identify the root cause of the
assumptions failure in order to try to find a patch.

The script used for generating the following Figures 5.12–5.23 is


available here.

5.7.1 Linearity
Linearity between the transformed expectation of Y and the predictors
X1 , . . . , X p is the building block of generalized linear models. If
this assumption fails, then all the conclusions we might extract
from the analysis are suspected to be flawed. Therefore it is a key
assumption.
How to check it
We use the residuals vs. fitted values plot, which for generalized
linear models is the scatterplot of (η̂i , ε̂ Di ), i = 1, . . . , n. Recall that it
is not the scatterplot of (Ŷi , ε̂ i ). Under linearity, we expect that there
is no trend in the residuals ε̂ D i with respect to η̂i . If nonlinearities
are observed, it is worth plotting the regression terms of the model
via termplot.
218 eduardo garcía portugués

Figure 5.12: Residuals vs. fitted values


plots (first row) for datasets (second
row) respecting the linearity assump-
tion in Poisson regression.

Figure 5.13: Residuals vs. fitted values


plots (first row) for datasets (second
row) violating the linearity assump-
tion in Poisson regression.
notes for predictive modeling 219

Figure 5.14: Residuals vs. fitted values


plots (first row) for datasets (second
row) respecting the linearity assump-
tion in logistic regression.

Figure 5.15: Residuals vs. fitted values


plots (first row) for datasets (second
row) violating the linearity assump-
tion in logistic regression.

What to do if fails
Using an adequate nonlinear transformation for the problematic
predictors or adding interaction terms might be helpful. Alterna-
tively, considering a nonlinear transformation f for the response Y
might also be helpful.

5.7.2 Response distribution


The approximate normality in the deviance residuals allows to eval-
uate how well satisfied the assumption of the response distribution
is. The good news is that we can do so without relying on ad-hoc
18
This will constitute the rigorous
approach, but it is notably more
tools for each distribution18 . The bad news is that we have to pay complex.
an important price in terms of inexactness, since we employ an
asymptotic distribution. The speed of this asymptotic convergence
220 eduardo garcía portugués

and the effective validity of (5.33) largely depends on several as-


pects (distribution of the response, sample size, distribution of the
predictors).
How to check it
The QQ-plot allows us to check if the standardized residuals
follow a N (0, 1). Under the correct distribution of the response,
we expect the points to align with the diagonal line. It is usual
to have departures from the diagonal in the extremes other than
in the center, even under normality, although these departures
are more evident if the data is non normal. Unfortunately, it is
also possible to have severe departures from normality even if the
model is perfectly correct, see below. The reason is simply that the
deviance residuals are significantly non-normal, which happens
often in logistic regression.

Figure 5.16: QQ-plots for the deviance


residuals (first row) for datasets
(second row) respecting the response
distribution assumption for Poisson
regression.

Figure 5.17: QQ-plots for the deviance


residuals (first row) for datasets
(second row) violating the response
distribution assumption for Poisson
regression.
notes for predictive modeling 221

Figure 5.18: QQ-plots for the deviance


residuals (first row) for datasets
(second row) respecting the response
distribution assumption for logistic
regression.

Figure 5.19: QQ-plots for the deviance


residuals (first row) for datasets
(second row) violating the response
distribution assumption for logistic
regression.

What to do if fails
Patching the distribution assumption is not easy and requires
the consideration of more flexible models. One possibility is to
transform Y by means of one of the transformations discussed in
Section 3.5.2, of course at the price of modelling the transformed
response rather than Y.

5.7.3 Independence
Independence is also a key assumption: it guarantees that the
amount of information that we have on the relationship between
Y and X1 , . . . , X p with n observations is maximal.
How to check it
The presence of autocorrelation in the residuals can be examined
by means of a serial plot of the residuals. Under uncorrelation, we
222 eduardo garcía portugués

expect the series to show no tracking of the residuals, which is


a sign of positive serial correlation. Negative serial correlation
can be identified in the form of a small-large or positive-negative
systematic alternation of the residuals. This can be explored better
with lag.plot, as saw in Section 3.5.4.

Figure 5.20: Serial plots of the resid-


uals (first row) for datasets (second
row) respecting the independence
assumption for Poisson regression.

Figure 5.21: Serial plots of the resid-


uals (first row) for datasets (second
row) violating the independence
assumption for Poisson regression.
notes for predictive modeling 223

Figure 5.22: Serial plots of the resid-


uals (first row) for datasets (second
row) respecting the independence
assumption for logistic regression.

Figure 5.23: Serial plots of the resid-


uals (first row) for datasets (second
row) violating the independence
assumption for logistic regression.

What to do if fails
As in the linear model, little can be done if there is dependence
in the data, once this has been collected. If serial dependence is
present, a differentiation of the response may lead to independent
observations.

5.7.4 Multicollinearity
Multicollinearity can also be present in generalized linear models.
Despite the nonlinear effect of the predictors on the response, the
predictors are combined linearly in (5.4). Due to this, if two or
more predictors are highly correlated between them, the fit of the
model will be compromised since the individual linear effect of
each predictor is hard to distinguish from the rest of correlated
predictors.
224 eduardo garcía portugués

Then, a useful way of detecting multicollinearity is to inspect


the VIF of each coefficient. The situation is exactly the same as
in linear regression, since VIF looks only into the linear relations
of the predictors. Therefore, the rule of thumb is the same as in
Section 3.5.5:
• VIF close to 1: absence of multicollinearity.
• VIF larger than 5 or 10: problematic amount of multicollinearity.
Advised to remove the predictor with largest VIF.
# Create predictors with multicollinearity: x4 depends on the rest
set.seed(45678)
x1 <- rnorm(100)
x2 <- 0.5 * x1 + rnorm(100)
x3 <- 0.5 * x2 + rnorm(100)
x4 <- -x1 + x2 + rnorm(100, sd = 0.25)

# Response
z <- 1 + 0.5 * x1 + 2 * x2 - 3 * x3 - x4
y <- rbinom(n = 100, size = 1, prob = 1/(1 + exp(-z)))
data <- data.frame(x1 = x1, x2 = x2, x3 = x3, x4 = x4, y = y)

# Correlations -- none seems suspicious


cor(data)
## x1 x2 x3 x4 y
## x1 1.0000000 0.38254782 0.2142011 -0.5261464 0.20198825
## x2 0.3825478 1.00000000 0.5167341 0.5673174 0.07456324
## x3 0.2142011 0.51673408 1.0000000 0.2500123 -0.49853746
## x4 -0.5261464 0.56731738 0.2500123 1.0000000 -0.11188657
## y 0.2019882 0.07456324 -0.4985375 -0.1118866 1.00000000

# Abnormal generalized variance inflation factors: largest for x4, we remove it


modMultiCo <- glm(y ~ x1 + x2 + x3 + x4, family = "binomial")
car::vif(modMultiCo)
## x1 x2 x3 x4
## 27.84756 36.66514 4.94499 36.78817

# Without x4
modClean <- glm(y ~ x1 + x2 + x3, family = "binomial")

# Comparison
summary(modMultiCo)
##
## Call:
## glm(formula = y ~ x1 + x2 + x3 + x4, family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4743 -0.3796 0.1129 0.4052 2.3887
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.2527 0.4008 3.125 0.00178 **
## x1 -3.4269 1.8225 -1.880 0.06007 .
## x2 6.9627 2.1937 3.174 0.00150 **
## x3 -4.3688 0.9312 -4.691 2.71e-06 ***
## x4 -5.0047 1.9440 -2.574 0.01004 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 132.81 on 99 degrees of freedom
## Residual deviance: 59.76 on 95 degrees of freedom
## AIC: 69.76
##
## Number of Fisher Scoring iterations: 7
notes for predictive modeling 225

summary(modClean)
##
## Call:
## glm(formula = y ~ x1 + x2 + x3, family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0952 -0.4144 0.1839 0.4762 2.5736
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.9237 0.3221 2.868 0.004133 **
## x1 1.2803 0.4235 3.023 0.002502 **
## x2 1.7946 0.5290 3.392 0.000693 ***
## x3 -3.4838 0.7491 -4.651 3.31e-06 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 132.813 on 99 degrees of freedom
## Residual deviance: 68.028 on 96 degrees of freedom
## AIC: 76.028
##
## Number of Fisher Scoring iterations: 6

# Generalized variance inflation factors normal


car::vif(modClean)
## x1 x2 x3
## 1.674300 2.724351 3.743940

Performing PCA on the predictors, as seen in Section


3.6.2, is a possibility to achieve uncorrelation and can be
employed straighforwardly in generalized linear mod-
els. The situation is different for PLS, since it makes
use of the linear structure between the response and
the predictors and thus is not immediately adaptable to
generalized linear models.

5.8 Shrinkage

Enforcing sparsity in generalized linear models can be done as it


was done with linear models. Ridge regression and Lasso can be
generalized with glmnet with little differences in practice.
What we want is to bias the estimates of β towards being non-
null only in the most important relations between the response 19
We are minimizing the negative
and predictors. To achieve that, we add a penalization term to the log-likelihood `( β).
maximum likelihood estimation of β19 :

p
−`( β) + λ ∑ (α| β j | + (1 − α)| β j |2 ). (5.34)
j =1

As in Section 4.1, ridge regression corresponds to α = 0 (quadratic


penalty) and lasso to α = 1 (linear penalty). Obviously, if λ = 0, we
are back to the generalized linear models theory. The optimization
226 eduardo garcía portugués

of (5.34) gives
(
p
)
β̂λ,α := arg min −`( β) + λ ∑ (α| β j | + (1 − α)| β j |2 ) . (5.35)
β ∈R p +1 j =1

Note that the sparsity is enforced in the slopes, not in the intercept,
and that the link function g is not affecting the penalization term.
As in linear models, the predictors need to be standardized if they
are of different nature.
We illustrate the shrinkage in generalized linear models with the 10 10 10 10 10

0.04
10

ISLR::Hitters dataset, where now the objective will be to predict


NewLeague, a factor with levels A (standing for Americal League)

0.02
and N (standing for National League). The variable indicates the 6

Coefficients

0.00
player’s league at the end of 1986. The predictors employed are his 8
1
9

statistics during 1986, and the objective is to see whether there is

-0.02
some distinctive pattern between the players in both leagues. 4

-0.04
# Load data 7

data(Hitters, package = "ISLR")


-4 -2 0 2 4

Log Lambda

# Include only predictors related with 1986 season and discard NA’s 9 9 9 6 5 2

Hitters <- subset(Hitters, select = c(League, AtBat, Hits, HmRun, Runs, RBI, 10

0.04
Walks, Division, PutOuts, Assists,
Errors))

0.02
Hitters <- na.omit(Hitters) 6

Coefficients
# Response and predictors 0.00
8
9

y <- Hitters$League 5

x <- model.matrix(League ~ ., data = Hitters)[, -1] 3


-0.02

4
-0.04

After preparing the data, we perform the regressions.


7

# Ridge and lasso regressions -8 -7 -6 -5 -4 -3

Log Lambda
library(glmnet)
10 10 10 10 10 10
ridgeMod <- glmnet(x = x, y = y, alpha = 0, family = "binomial")
0.04

10

lassoMod <- glmnet(x = x, y = y, alpha = 1, family = "binomial")

# Solution paths versus lambda


0.02

plot(ridgeMod, label = TRUE, xvar = "lambda") 6

2
Coefficients

0.00

8
1
9

plot(lassoMod, label = TRUE, xvar = "lambda")


-0.02

3
-0.04

# Versus the percentage of deviance explained 0.00 0.01 0.02 0.03 0.04 0.05

plot(ridgeMod, label = TRUE, xvar = "dev") Fraction Deviance Explained

0 1 3 4 5 9

10
0.04

plot(lassoMod, label = TRUE, xvar = "dev")


0.02

2
Coefficients

0.00

# The percentage of deviance explained only goes up to 0.05. There are no 8


9

# clear patterns indicating player differences between both leagues 5

3
-0.02

# Let’s select the predictors to be included with a 10-fold cross-validation


4
-0.04

set.seed(12345)
kcvLasso <- cv.glmnet(x = x, y = y, alpha = 1, nfolds = 10, family = "binomial") 7

plot(kcvLasso)
0.00 0.01 0.02 0.03 0.04 0.05

Fraction Deviance Explained


notes for predictive modeling 227

9 9 9 9 9 9 9 9 8 8 6 5 5 5 4 4 3 2 1 1

1.44
# The lambda that minimises the CV error and "one standard error rule"’s lambda
kcvLasso$lambda.min

1.42
## [1] 0.01039048
kcvLasso$lambda.1se

Binomial Deviance

1.40
## [1] 0.08829343

1.38
# Leave-one-out cross-validation -- similar result
ncvLasso <- cv.glmnet(x = x, y = y, alpha = 1, nfolds = nrow(Hitters),

1.36
family = "binomial")
plot(ncvLasso)
-8 -7 -6 -5 -4 -3

Log(λ)

ncvLasso$lambda.min 9 9 9 9 9 9 9 9 8 8 6 5 5 5 4 4 3 2 1 1

## [1] 0.007860015
ncvLasso$lambda.1se

1.40
## [1] 0.07330276

# Model selected

1.38
Binomial Deviance
predict(ncvLasso, type = "coefficients", s = ncvLasso$lambda.1se)
## 11 x 1 sparse Matrix of class "dgCMatrix"

1.36
## 1
## (Intercept) -0.099447861
## AtBat .

1.34
## Hits .
## HmRun -0.006971231
-8 -7 -6 -5 -4 -3
## Runs . Log(λ)

## RBI .
## Walks .
## DivisionW .
## PutOuts .
## Assists .
## Errors .

HmRun is selected by leave-one-out cross-validation as the unique


predictor to be included in the lasso regression. We know that the
model is not good due to the percentage of deviance explained.
However, we still want to know whether HmRun has any signifi-
cance at all. When addressing this, we have to take into account
Appendix A.5 to avoid spurious findings.

# Analyse the selected model)


fit <- glm(League ~ HmRun, data = Hitters, family = "binomial")
summary(fit)
##
## Call:
## glm(formula = League ~ HmRun, family = "binomial", data = Hitters)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2976 -1.1320 -0.8106 1.1686 1.6440
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.27826 0.18086 1.539 0.12392
## HmRun -0.04290 0.01371 -3.130 0.00175 **
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 443.95 on 321 degrees of freedom
## Residual deviance: 433.57 on 320 degrees of freedom
## AIC: 437.57
##
## Number of Fisher Scoring iterations: 4
228 eduardo garcía portugués

# HmRun is significant -- but it may be spurious due to the model selection


# procedure (see Appendix A.5)

# Let’s split the dataset in two, do model-selection in one part and then
# inference on the selected model in the other, to have an idea of the real
# significance of HmRun
set.seed(123456)
train <- sample(c(FALSE, TRUE), size = nrow(Hitters), replace = TRUE)

# Model selection in training part


ncvLasso <- cv.glmnet(x = x[train, ], y = y[train], alpha = 1,
nfolds = sum(train), family = "binomial")
predict(ncvLasso, type = "coefficients", s = ncvLasso$lambda.1se)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -0.3856625
## AtBat .
## Hits .
## HmRun .
## Runs .
## RBI .
## Walks .
## DivisionW .
## PutOuts .
## Assists .
## Errors .

# Inference in testing part


summary(glm(League ~ HmRun, data = Hitters[!train, ], family = "binomial"))
##
## Call:
## glm(formula = League ~ HmRun, family = "binomial", data = Hitters[!train,
## ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3768 -1.2326 0.9903 1.0985 1.5203
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.45749 0.25402 1.801 0.0717 .
## HmRun -0.03984 0.01932 -2.062 0.0392 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 213.39 on 153 degrees of freedom
## Residual deviance: 208.96 on 152 degrees of freedom
## AIC: 212.96
##
## Number of Fisher Scoring iterations: 4
# HmRun is now not significant...
# We can repeat the analysis for different partitions of the data and we will
# obtain weak significances. Therefore, we can conclude that this is an spurious
# finding and that HmRun is not significant as a single predictor

# Prediction (obviously not trustable, but for illustration)


pred <- predict(ncvLasso, newx = x[!train, ], type = "response",
s = ncvLasso$lambda.1se)

# Hit matrix
H <- table(pred > 0.5, y[!train] == "A") # ("A" was the reference level)
H
##
## FALSE TRUE
## FALSE 79 75
sum(diag(H)) / sum(H) # Worse than tossing a coin!
notes for predictive modeling 229

## [1] 0.512987

Perform an adequate statistical analysis based on shrink-


age of a generalized linear model to reply the following
questions:

• What (if any) are the leading factors among the fea-
tures of a player in season 1986 in order to be in the
top 10% of most paid players in season 1987?
• What (if any) are the player features in season 1986
influencing the number of home runs in the same
season? And during his career?

Hint: you may use the one shown in the section as a


template.

5.9 Big data considerations

As we saw in Section 5.2.2, fitting a generalized linear model in-


volves fitting a series of linear models. Therefore, all the memory
problems that appeared in Section 4.4 are inherited. Worse, compu-
tation is now more complicated because:

1. Computing the likelihood requires reading all the data at once. Dif-
ferently from the linear model, updating the model with a new
chunk implies re-fitting with all the data due to the nonlinearity
of the likelihood.
2. The IRLS algorithm requires reading the data as many times as
iterations.

These two peculiarities are a game-changer for the approach


followed in Section 4.4: biglm::bigglm needs to have access to the
full data while performing the fitting. This can be cumbersome.
Hopefully, a neat solution is available using the ff and ffbase
packages, which allow for efficiently working with data stored in disk
that behave (almost) as if they were in RAM20 . The function that we
20
The ff package implements the
ff vectors and ffdf data frames
will employ is ffbase::bigglm.ffdf, and requires from an object of classes. The package ffbase provides
the class ffdf (ff’s data frames). convenience functions for working
with these non-standard classes in a
# Not really "big data", but for the sake of illustration more transparent way.
set.seed(12345)
n <- 1e6
p <- 10
beta <- seq(-1, 1, length.out = p)^5
x1 <- matrix(rnorm(n * p), nrow = n, ncol = p)
x1[, p] <- 2 * x1[, 1] + rnorm(n, sd = 0.1) # Add some dependence to predictors
x1[, p - 1] <- 2 - x1[, 2] + rnorm(n, sd = 0.5)
y1 <- rbinom(n, size = 1, prob = 1 / (1 + exp(-(1 + x1 %*% beta))))
x2 <- matrix(rnorm(100 * p), nrow = 100, ncol = p)
y2 <- rbinom(100, size = 1, prob = 1 / (1 + exp(-(1 + x2 %*% beta))))
bigData1 <- data.frame("resp" = y1, "pred" = x1)
bigData2 <- data.frame("resp" = y2, "pred" = x2)

# Save files to disk to emulate the situation with big data


230 eduardo garcía portugués

write.csv(x = bigData1, file = "bigData1.csv", row.names = FALSE)


write.csv(x = bigData2, file = "bigData2.csv", row.names = FALSE)

# Read files using ff


library(ffbase) # Imports ff
bigData1ff <- read.table.ffdf(file = "bigData1.csv", header = TRUE, sep = ",")
bigData2ff <- read.table.ffdf(file = "bigData2.csv", header = TRUE, sep = ",")

# Recall: bigData1.csv is not copied into RAM


print(object.size(bigData1), units = "MiB")
## 80.1 MiB
print(object.size(bigData1ff), units = "KiB")
## 41.2 KiB

# Logistic regression
# Same comments for the formula framework -- this is the hack for automatic
# inclusion of all the predictors
library(biglm)
f <- formula(paste("resp ~", paste(names(bigData1)[-1], collapse = " + ")))
bigglmMod <- bigglm.ffdf(formula = f, data = bigData1ff, family = binomial())

# glm’s call
glmMod <- glm(formula = resp ~ ., data = bigData1, family = binomial())

# Compare sizes
print(object.size(bigglmMod), units = "KiB")
## 178.4 KiB
print(object.size(glmMod), units = "MiB")
## 732.6 MiB

# Summaries
s1 <- summary(bigglmMod)
s2 <- summary(glmMod)
s1
## Large data regression model: bigglm(formula = f, data = bigData1ff, family = binomial())
## Sample size = 1e+06
## Coef (95% CI) SE p
## (Intercept) 1.0177 0.9960 1.0394 0.0109 0.0000
## pred.1 -0.9869 -1.0931 -0.8808 0.0531 0.0000
## pred.2 -0.2927 -0.3046 -0.2808 0.0059 0.0000
## pred.3 -0.0481 -0.0534 -0.0428 0.0027 0.0000
## pred.4 -0.0029 -0.0082 0.0024 0.0026 0.2779
## pred.5 -0.0015 -0.0068 0.0038 0.0027 0.5708
## pred.6 -0.0022 -0.0075 0.0031 0.0026 0.4162
## pred.7 0.0018 -0.0035 0.0071 0.0027 0.4876
## pred.8 0.0582 0.0529 0.0635 0.0027 0.0000
## pred.9 0.2747 0.2640 0.2853 0.0053 0.0000
## pred.10 0.9923 0.9392 1.0454 0.0265 0.0000
s2
##
## Call:
## glm(formula = resp ~ ., family = binomial(), data = bigData1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4234 0.2112 0.4591 0.6926 2.5853
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.017719 0.010856 93.744 <2e-16 ***
## pred.1 -0.986934 0.053075 -18.595 <2e-16 ***
## pred.2 -0.292675 0.005947 -49.214 <2e-16 ***
## pred.3 -0.048077 0.002653 -18.120 <2e-16 ***
## pred.4 -0.002875 0.002650 -1.085 0.278
## pred.5 -0.001503 0.002650 -0.567 0.571
## pred.6 -0.002154 0.002650 -0.813 0.416
## pred.7 0.001840 0.002651 0.694 0.488
## pred.8 0.058182 0.002653 21.927 <2e-16 ***
notes for predictive modeling 231

## pred.9 0.274651 0.005320 51.629 <2e-16 ***


## pred.10 0.992299 0.026541 37.388 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1059853 on 999999 degrees of freedom
## Residual deviance: 885895 on 999989 degrees of freedom
## AIC: 885917
##
## Number of Fisher Scoring iterations: 5

# Further information
s1$mat # Coefficients and their inferences
## Coef (95% CI) SE p
## (Intercept) 1.017718995 0.996006224 1.039431766 0.010856386 0.000000e+00
## pred.1 -0.986933994 -1.093083429 -0.880784558 0.053074718 3.515228e-77
## pred.2 -0.292674675 -0.304568670 -0.280780679 0.005946998 0.000000e+00
## pred.3 -0.048076842 -0.053383444 -0.042770241 0.002653301 2.230685e-73
## pred.4 -0.002875381 -0.008174848 0.002424085 0.002649733 2.778513e-01
## pred.5 -0.001502598 -0.006803348 0.003798152 0.002650375 5.707564e-01
## pred.6 -0.002154396 -0.007453619 0.003144826 0.002649611 4.161613e-01
## pred.7 0.001839965 -0.003462032 0.007141962 0.002650999 4.876415e-01
## pred.8 0.058182131 0.052875269 0.063488994 0.002653431 1.431803e-106
## pred.9 0.274650557 0.264011111 0.285290003 0.005319723 0.000000e+00
## pred.10 0.992299439 0.939217602 1.045381276 0.026540919 6.230296e-306
s1$rsq # R^2
## [1] 0.2175508
s1$nullrss # Null deviance
## [1] 1132208

# Extract coefficients
coef(bigglmMod)
## (Intercept) pred.1 pred.2 pred.3 pred.4 pred.5 pred.6 pred.7 pred.8
## 1.017718995 -0.986933994 -0.292674675 -0.048076842 -0.002875381 -0.001502598 -0.002154396 0.001839965 0.058182131
## pred.9 pred.10
## 0.274650557 0.992299439

# Prediction works as usual


predict(bigglmMod, newdata = bigData2[1:5, ], type = "response")
## [,1]
## 1 0.9603955
## 2 0.7434756
## 3 0.6632871
## 4 0.6188387
## 5 0.6418678
# predict(bigglmMod, newdata = bigData2[1:5, -1]) # Error

# Update the model with training data


update(bigglmMod, moredata = bigData2)
## Large data regression model: bigglm(formula = f, data = bigData1ff, family = binomial())
## Sample size = 1000100

# AIC and BIC


AIC(bigglmMod, k = 2)
## [1] 885917.2
AIC(bigglmMod, k = log(n))
## [1] 886047.2

# Delete the csv files in disk


file.remove(c("bigData1.csv", "bigData2.csv"))
## [1] TRUE TRUE
232 eduardo garcía portugués

Note that this is also a perfectly valid approach for lin-


ear models, we just need to specify family = gaussian()
in the call to bigglm.ffdf.

Model selection of biglm::bigglm models is not so straightfor-


ward. The trick that leaps::regsubsets employs for simplifying
the model search in linear models (see Section 4.4) does not apply
for generalized linear models because of the non-linearity of the
likelihood. However, there is a simple and useful hack: we can do
best subset selection in the linear model associated to the last iter-
ation of the IRLS algorithm and then refine the search by comput- 21
Without actually expanding that list,
ing the exact BIC/AIC from a set of candidate models21 . If we do as coming out with this list of candi-
so, we translate the model selection problem back to the linear case, date models is the most expensive part
in best subset selection.
plus an extra overhead of fitting several generalized linear models.
Keep in mind that, albeit useful, this approach is an approximation
to the task of finding the best subset of predictors.

# Model selection adapted to big data generalized linear models


reg <- leaps::regsubsets(bigglmMod, nvmax = p + 1, method = "exhaustive")
# This takes the QR decomposition, which encodes the linear model associated to
# the last iteration of the IRLS algorithm. However, the reported BICs are *not*
# the true BICs of the generalized linear models, but a sufficient
# approximation to obtain a list of candidate models in a fast way

# Get the model with lowest BIC


plot(reg)

-120000

-120000
subs <- summary(reg) -120000
subs$which -120000
## (Intercept) pred.1 pred.2 pred.3 pred.4 pred.5 pred.6 pred.7 pred.8 pred.9 -120000
pred.10
bic

## 1 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE -120000 TRUE
## 2 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE -120000 TRUE
## 3 TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE -120000 TRUE
## 4 TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE -120000 TRUE
## 5 TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE -79000 TRUE
## 6 TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
(Intercept)

pred.1

pred.2

pred.3

pred.4

pred.5

pred.6

pred.7

pred.8

pred.9

## 7 TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE pred.10

## 8 TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE
## 9 TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
## 10 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
subs$bic
## [1] -79219.27 -118848.59 -121238.63 -121703.97 -122033.75 -122347.61 -122334.97 -122321.82 -122308.48 -122294.99
subs$which[which.min(subs$bic), ]
## (Intercept) pred.1 pred.2 pred.3 pred.4 pred.5 pred.6 pred.7 pred.8 pred.9
## TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
## pred.10
## TRUE

# Let’s compute the true BICs for the p models. This implies fitting p bigglm’s
bestModels <- list()
for (i in 1:nrow(subs$which)) {
f <- formula(paste("resp ~", paste(names(which(subs$which[i, -1])),
collapse = " + ")))
bestModels[[i]] <- bigglm.ffdf(formula = f, data = bigData1ff,
family = binomial(), maxit = 20)
# Did not converge with the default iteration limit, maxit = 8

# The approximate BICs and the true BICs are very similar (in this example)
notes for predictive modeling 233

exactBICs <- sapply(bestModels, AIC, k = log(n))


plot(subs$bic, exactBICs, type = "o", xlab = "Exact", ylab = "Approximate")

930000
cor(subs$bic, exactBICs, method = "pearson") # Correlation

920000
## [1] 0.9999708

Approximate

910000
# Both give the same model selection and same order
subs$which[which.min(subs$bic), ] # Approximate

900000
## (Intercept) pred.1 pred.2 pred.3 pred.4 pred.5 pred.6 pred.7 pred.8 pred.9
## TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE

890000
## pred.10
## TRUE
-120000 -110000 -100000 -90000 -80000
subs$which[which.min(exactBICs), ] # Exact
Exact
## (Intercept) pred.1 pred.2 pred.3 pred.4 pred.5 pred.6 pred.7 pred.8 pred.9
## TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
## pred.10
## TRUE
cor(subs$bic, exactBICs, method = "spearman") # Order correlation
## [1] 1
6
Nonparametric regression

The models we saw in the previous chapters share a common root:


all of them are parametric. This means that they assume a certain
structure on the regression function m, which is controlled by pa- 1
For example, linear models assume
rameters1 . If this assumption truly holds, then parametric methods that m is of the form m(x) = β 0 +
β 1 x1 + . . . + β p x p for some unknown
are the best approach for estimating m. But in practice it is rarely coefficients β.
the case where parametric methods work out-of-the-box, and sev-
eral tricks are needed in order to expand their degree of flexibility
in a case-by-case basis. Avoiding this nuisance is the strongest
point of nonparametric methods: they do not assume major hard-
to-satisfy hypothesis on the regression function, but just minimal
assumptions, which makes them directly employable. Their weak
points are that they usually are more computationally demanding
and are harder to interpret. 2
For the sake of introducing the main
We consider first the simplest situation2 : a single continuous concepts, on Section 6.3 we will see the
full general situation.
predictor X for predicting a response Y. In this case, recall that the
complete knowledge of Y when X = x is given by the conditional
f ( x,y)
pdf f Y | X = x (y) = f ( x) . While this pdf provides full knowledge
X
about Y | X = x, it is also a challenging task to estimate it: for each
x we have to estimate a different curve! A simpler approach, yet still
challenging, is to estimate the conditional mean (a scalar) for each x
through the regression function
Z
m ( x ) = E [Y | X = x ] = y f Y | X = x (y) dy.

As we will see, this density-based view of the regression function is


useful in order to motivate estimators.

6.1 Nonparametric density estimation

In order to introduce a nonparametric estimator for the regression


function m, we need to introduce first a nonparametric estimator
for the density of the predictor X. This estimator is aimed to esti-
mate f , the density of X, from a sample X1 , . . . , Xn and without
assuming any specific form for f . This is, without assuming, for
example, that the data is normally distributed.
236 eduardo garcía portugués

6.1.1 Histogram and moving histogram


The simplest method to estimate a density f from an iid sample
X1 , . . . , Xn is the histogram. From an analytical point of view, the
idea is to aggregate the data in intervals of the form [ x0 , x0 + h)
and then use their relative frequency to approximate the density at
x ∈ [ x0 , x0 + h), f ( x ), by the estimate of

f ( x0 ) = F 0 ( x0 )
F ( x0 + h ) − F ( x0 )
= lim
h →0+ h
P[ x0 < X < x0 + h ]
= lim .
h →0+ h
More precisely, given an origin t0 and a bandwidth h > 0, the
histogram builds a piecewise constant function in the intervals
{ Bk := [tk , tk+1 ) : tk = t0 + hk, k ∈ Z} by counting the number of 3
Recall that with this standardization
sample points inside each of them. These constant-length intervals we approach to the probability density
concept.
are also denoted bins. The fact that they are of constant length h is
Histogram of faithE
important, since it allows to standardize by h in order to have rela-
tive frequencies per length3 in the bins. The histogram at a point x is
defined as

60
n
1
fˆH ( x; t0 , h) := ∑ 1{Xi ∈Bk :x∈Bk } . (6.1)
Frequency

40
nh i =1

Equivalently, if we denote the number of points in Bk as vk , then the


20

vk
histogram is fˆH ( x; t0 , h) = nh if x ∈ Bk for a k ∈ Z.
The computation of histograms is straightforward in R. As an
0

example, we consider the faithful dataset, which contains the 2 3 4 5

faithE

duration of the eruption and the waiting time between eruptions Histogram of faithE

for the Old Faithful geyser in Yellowstone National Park (USA).


0.5

# Duration of eruption
faithE <- faithful$eruptions
0.4

# Default histogram: automatically choses bins and uses absolute frequencies


0.3
Density

histo <- hist(faithE)


0.2
0.1

# Bins and bin counts


histo$breaks # Bk’s
0.0

## [1] 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 2 3 4 5

histo$counts # vk’s faithE

## [1] 55 37 5 9 34 75 54 3 Histogram of faithE

# With relative frequencies


0.6

hist(faithE, probability = TRUE)


0.5
0.4
Density

# Choosing the breaks


0.3

t0 <- min(faithE)
h <- 0.25
0.2

Bk <- seq(t0, max(faithE), by = h)


0.1

hist(faithE, probability = TRUE, breaks = Bk)


rug(faithE) # The sample
0.0

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Recall that the shape of the histogram depends on: faithE


notes for predictive modeling 237

• t0 , since the separation between bins happens at t0 k, k ∈ Z;


• h, which controls the bin size and the effective number of bins
for aggregating the sample.

We focus first on exploring the dependence on t0 , as it serves for


motivating the next density estimator, with the next example.

# Uniform sample
set.seed(1234567)
u <- runif(n = 100)

# t0 = 0, h = 0.2
Bk1 <- seq(0, 1, by = 0.2)

# t0 = -0.1, h = 0.2
Bk2 <- seq(-0.1, 1.1, by = 0.2)

# Comparison
par(mfrow = 1:2)
hist(u, probability = TRUE, breaks = Bk1, ylim = c(0, 1.5),
main = "t0 = 0, h = 0.2")
rug(u)
abline(h = 1, col = 2)
hist(u, probability = TRUE, breaks = Bk2, ylim = c(0, 1.5),
main = "t0 = -0.1, h = 0.2")
rug(u)
abline(h = 1, col = 2)

t0 = 0, h = 0.2 t0 = -0.1, h = 0.2 Figure 6.1: The dependence of the


histogram on the origin t0 .
1.5

1.5
1.0

1.0
Density

Density
0.5

0.5
0.0

0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8

u u

Clearly, this dependence is undesirable, as it is prone to change


notably the estimation of f using the same data. An alternative
to avoid the dependence on t0 is the moving histogram or naive
density estimator. The idea is to aggregate the sample X1 , . . . , Xn
in intervals of the form ( x − h, x + h) and then use its relative fre-
quency in ( x − h, x + h) to approximate the density at x, which can
be written as

f (x) = F0 (x)
F ( x + h) − F ( x − h)
= lim
h →0+ 2h
P[ x − h < X < x + h ]
= lim .
h →0+ 2h
238 eduardo garcía portugués

Recall the differences with the histogram: the inter-


vals depend on the evaluation point x and are centred
around it. That allows to directly estimate f ( x ) (with-
out the proxy f ( x0 )) by an estimate of the symmetric
derivative.

Given a bandwidth h > 0, the naive density estimator builds a


piecewise constant function by considering the relative frequency of 4
Note that the function has 2n discon-
X1 , . . . , Xn inside ( x − h, x + h)4 : tinuities that are located at Xi ± h.
n
1
fˆN ( x; h) :=
2nh ∑ 1 { x − h < Xi < x + h } . (6.2)
i =1

The analysis of fˆN ( x; h) as a random variable follows from realizing


that
n
∑ 1{x−h<Xi <x+h} ∼ B(n, px,h )
i =1

where

p x,h := P[ x − h < X < x + h] = F ( x + h) − F ( x − h).

Therefore, employing the bias and variance expressions of a bino-


mial5 , it follows: 5
E[B(n, p)] = np and Var[B(n, p)] =
np(1 − p).
F ( x + h) − F ( x − h)
E[ fˆN ( x; h)] = ,
2h
F ( x + h) − F ( x − h)
Var[ fˆN ( x; h)] =
4nh2
( F ( x + h) − F ( x − h))2
− .
4nh2
These two results provide very interesting insights on the effect of h
on the moving histogram:

1. If h → 0, then E[ fˆN ( x; h)] → f ( x ) and (6.2) is an asymptotically


unbiased estimator of f ( x ). However, if h → 0, the variance
f (x) f ( x )2
explodes: Var[ fˆN ( x; h)] ≈ 2nh − n → ∞.
2. If h → ∞, then both E[ fˆN ( x; h)] → 0 and Var[ fˆN ( x; h)] → 0.
Therefore, the variance shrinks to zero but the bias grows.
3. If nh → ∞ 6 , then the variance shrinks to zero. If, in addition, 6
Or, in other words, if h−1 grows
h → 0, the bias also shrinks to zero. So both the bias and the vari- slower than n.

ance are reduced if n → ∞, h → 0, and nh → ∞, simultaneously.

The animation in Figure 6.2 illustrates the previous points and


gives insight on how the performance of (6.2) varies smoothly with
h.
The estimator (6.2) raises an important question:

Why giving the same weight to all X1 , . . . , Xn in ( x − h, x + h)?


F ( x +h)− F ( x −h)
We are estimating f ( x ) = F 0 ( x ) by estimating 2h
through the relative frequency of X1 , . . . , Xn in the interval ( x −
notes for predictive modeling 239

Figure 6.2: Bias and variance for the


moving histogram. The animation
shows how for small bandwidths the
bias of fˆN ( x; h) on estimating f ( x ) is
small, but the variance is high, and
how for large bandwidths the bias
is large and the variance is small.
The variance is represented by the
asymptotic 95% confidence intervals
for fˆN ( x; h). Recall how the variance
of fˆN ( x; h) is (almost) proportional to
f ( x ). Application also available here.

h, x + h). Should not be the data points closer to x more important


than the ones further away? The answer to this question shows
that (6.2) is indeed a particular case of a wider class of density
estimators.

6.1.2 Kernel density estimation


The moving histogram (6.2) can be equivalently written as
n
1 1
fˆN ( x; h) =
nh ∑ 2 1n−1< x−hXi <1o
i =1
n
x − Xi
 
1
=
nh ∑K h
, (6.3)
i =1

with K (z) = 21 1{−1<z<1} . Interestingly, K is a uniform density in


(−1, 1). This means that, when approximating

x−X
 
P[ x − h < X < x + h ] = P −1 < <1
h

by (6.3), we give equal weight to all the points X1 , . . . , Xn . The gen-


eralization of (6.3) is now obvious: replace K by an arbitrary den-
sity. Then K is known as a kernel: a density with certain regularity
that is (typically) symmetric and unimodal at 0. This generalization 7
Also known as the Parzen–Rosemblatt
provides the definition of kernel density estimator7 (kde): estimator to honor the proposals by
Parzen (1962) and Rosenblatt (1956).
n
x − Xi
 
1
fˆ( x; h) :=
nh ∑K h
. (6.4)
i =1

A common notation is Kh (z) := 1h K hz , so the kde can be com-




pactly written as fˆ( x; h) = n1 ∑in=1 Kh ( x − Xi ).


240 eduardo garcía portugués

It is useful to recall (6.4) with the normal kernel. If that is the


case, then Kh ( x − Xi ) = φ( x; Xi , h2 ) = φ( x − Xi ; 0, h2 ) (denoted
simply as φ( x − Xi ; h2 )) and the kernel is the density of a N ( Xi , h2 ).
Thus the bandwidth h can be thought as the standard deviation of a
normal density whose mean is Xi and the kde (6.4) as a data-driven
mixture of those densities. Figure 6.3 illustrates the construction of
the kde, and the bandwidth and kernel effects.

Figure 6.3: Construction of the kernel


density estimator. The animation
shows how the bandwidth and kernel
affect the density estimate, and how
the kernels are rescaled densities with
modes at the data points. Application
available here.

Several types of kernels are possible. The most popular is the


normal kernel K (z) = φ(z), although the Epanechnikov kernel,
8
Although the efficiency of the normal
K (z) = 43 (1 − z2 )1{|z|<1} , is the most efficient8 . The rectangular
kernel, with respect to the Epanech-
kernel K (z) = 12 1{|z|<1} yields the moving histogram as a particular nikov kernel, is roughly 0.95.
case. The kernel density estimator inherits the smoothness proper-
ties of the kernel. That means, for example, that (6.4) with a normal
kernel is infinitely differentiable. But with an Epanechnikov ker-
nel, (6.4) is not differentiable, and with a rectangular kernel is not
even continuous. However, if a certain smoothness is guaranteed
(continuity at least), then the choice of the kernel has little importance
in practice (at least compared with the choice of the bandwidth h).
The computation of the kde in R is done through the density
function. The function automatically chooses the bandwidth h
9
Precisely, the rule-of-thumb given by
using a data-driven criterion9 . bw.nrd.

# Sample 100 points from a N(0, 1)


set.seed(1234567)
samp <- rnorm(n = 100, mean = 0, sd = 1)

# Quickly compute a kernel density estimator and plot the density object
# Automatically chooses bandwidth and uses normal kernel
plot(density(x = samp))
notes for predictive modeling 241

# Select a particular bandwidth (0.5) and kernel (Epanechnikov)


lines(density(x = samp, bw = 0.5, kernel = "epanechnikov"), col = 2)
density.default(x = samp)

0.4
# density automatically chooses the interval for plotting the kernel density
# estimator (observe that the black line goes to roughly between -3 and 3)

0.3
# This can be tuned using "from" and "to"

Density
plot(density(x = samp, from = -4, to = 4), xlim = c(-5, 5))

0.2
0.1
# The density object is a list
kde <- density(x = samp, from = -5, to = 5, n = 1024)

0.0
str(kde) -3 -2 -1 0 1 2 3

## List of 7 N = 100 Bandwidth = 0.3145

## $ x : num [1:1024] -5 -4.99 -4.98 -4.97 -4.96 ... density.default(x = samp, from = -4, to = 4)

## $ y : num [1:1024] 5.98e-17 3.46e-17 2.56e-17 3.84e-17 4.50e-17 ...

0.4
## $ bw : num 0.315
## $ n : int 100
## $ call : language density.default(x = samp, n = 1024, from = -5, to = 5)

0.3
## $ data.name: chr "samp"

Density
## $ has.na : logi FALSE

0.2
## - attr(*, "class")= chr "density"
# Note that the evaluation grid "x"" is not directly controlled, only through
# "from, "to", and "n" (better use powers of 2). This is because, internally,

0.1
# kde employs an efficient Fast Fourier Transform on grids of size 2^m

0.0
# Plotting by the returned values of the kde
-4 -2 0 2 4
plot(kde$x, kde$y, type = "l") N = 100 Bandwidth = 0.3145

curve(dnorm(x), col = 2, add = TRUE) # True density


rug(samp)
0.4
0.3

Load the dataset faithful. Then:


kde$y

0.2

• Estimate and plot the density of faithful$eruptions.


• Create a new plot and superimpose different density
0.1

estimations with bandwidths equal to 0.1, 0.5, and 1.


• Get the density estimate at exactly the point x = 3.1 us-
0.0

-4 -2 0 2 4
ing h = 0.15 and the Epanechnikov kernel. kde$x

6.1.3 Bandwidth selection


The kde critically depends on the employed bandwidth; hence ob-
jective and automatic bandwidth selectors that attempt to minimize
the estimation error of the target density f are required to properly
apply a kde in practice.
A global, rather than local, error criterion for the kde is the Inte-
grated Squared Error (ISE):
Z
ISE[ fˆ(·; h)] := ( fˆ( x; h) − f ( x ))2 dx.

The ISE is a random quantity, since it depends directly on the sam-


ple X1 , . . . , Xn . As a consequence, looking for an optimal-ISE band-
width is a hard task, since the optimality is dependent on the sam-
ple itself (not only on f and n). To avoid this problematic, it is usual
242 eduardo garcía portugués

to compute the Mean Integrated Squared Error (MISE):


h i
MISE[ fˆ(·; h)] := E ISE[ fˆ(·; h)]
Z h i
= E ( fˆ( x; h) − f ( x ))2 dx
Z
= MSE[ fˆ( x; h)] dx.

Once the MISE is set as the error criterion to be minimized, our aim
is to find

hMISE := arg min MISE[ fˆ(·; h)].


h >0

For that purpose, we need an explicit expression of the MISE that


we can attempt to minimize. An asymptotic expansion can be de-
rived when h → 0 and nh → ∞, resulting in
1 R(K )
MISE[ fˆ(·; h)] ≈ AMISE[ fˆ(·; h)] := µ22 (K ) R( f 00 )h4 + , (6.5)
4 nh
where µ2 (K ) := z2 K (z) dz and R( g) := g( x )2 dx. The AMISE
R R

stands for Asymptotic MISE and, due to its closed expression, it


allows to obtain a bandwidth that minimizes it:10
" #1/5 10
By solving dh d
AMISE[ fˆ(·; h)] = 0,
R(K ) i.e. µ 2 ( K ) R ( f 00 ) h 3 − R ( K ) n−1 h−2 = 0,
hAMISE = 2 00
. (6.6) 2
µ2 ( K ) R ( f ) n yields hAMISE .

Unfortunately, the AMISE bandwidth depends on R( f 00 ) = ( f 00 ( x ))2 dx,


R

which measures the curvature of the unknown density f . As a con-


sequence, it can not be readily applied in practice!
Plug-in selectors
A simple solution to turn (6.6) into something computable is to
estimate R( f 00 ) by assuming that f is the density of a N (µ, σ2 ), and
then plug-in the form of the curvature for such density:
3
R(φ00 (·; µ, σ2 )) = .
8π 1/2 σ5
While doing so, we approximate the curvature of an arbitrary den-
sity by means of the curvature of a Normal and we have that
" #1/5
8π 1/2 R(K )
hAMISE = σ.
3µ22 (K )n
Interestingly, the bandwidth is directly proportional to the standard
deviation of the target density. Replacing σ by an estimate yields
the normal scale bandwidth selector, which we denote by ĥNS to
emphasize its randomness:
" #1/5
8π 1/2 R(K )
ĥNS = σ̂.
3µ22 (K )n
The estimate σ̂ can be chosen as the standard deviation s, or, in
order to avoid the effects of potential outliers, as the standardized
interquantile range
X([0.75n]) − X([0.25n])
σ̂IQR :=
Φ−1 (0.75) − Φ−1 (0.25)
notes for predictive modeling 243

or as

σ̂ = min(s, σ̂IQR ). (6.7)

When combined with a normal kernel, for which µ2 (K ) = 1 and


R(K ) = 2√1 π , this particularization of ĥNS gives the famous rule-of-
thumb for bandwidth selection:
 1/5
4
ĥRT = n−1/5 σ̂ ≈ 1.06n−1/5 σ̂.
3
11
Not to confuse with bw.nrd0!
ĥRT is implemented in R through the function bw.nrd11 .
# Data
set.seed(667478)
n <- 100
x <- rnorm(n)

# Rule-of-thumb
bw.nrd(x = x)
## [1] 0.4040319
# bwd.nrd employs 1.34 as an approximation for diff(qnorm(c(0.25, 0.75)))

# Same as
iqr <- diff(quantile(x, c(0.25, 0.75))) / diff(qnorm(c(0.25, 0.75)))
1.06 * n^(-1/5) * min(sd(x), iqr)
## [1] 0.4040319

The rule-of-thumb is an example of a zero-stage plug-in selector,


a terminology which lays on the fact that R( f 00 ) was estimated
by plugging-in a parametric estimation at “the very first moment a
quantity that depends on f appears”. We could have opted to estimate
R( f 00 ) nonparametrically, in an optimal way, and then plug-in the
estimate into into hAMISE . The important catch is on the
 optimal

estimation of R( f 00 ) : it requires the knowledge of R f (4) ! What `-
 `
stage plug-in selectors do is to iterate these steps times and finally
plug-in a normal estimate of the unknown R f (2`) 12 . The motivation for doing so is to try
12

to add the parametric assumption at a


Typically, two stages are considered a good trade-off between later, least important, step.
bias (mitigated when ` increases) and variance (augments with `)
of the plug-in selector. This is the method proposed by Sheather
and Jones (1991), yielding what we call the Direct Plug-In (DPI).
The DPI selector is implemented in R through the function bw.SJ
(use method = "dpi"). An alternative and faster implementation is
ks::hpi, which also for more flexibility and has a somehow more
complete documentation.

# Data
set.seed(672641)
x <- rnorm(100)

# DPI selector
bw.SJ(x = x, method = "dpi")
## [1] 0.5006905

# Similar to
ks::hpi(x) # Default is two-stages
## [1] 0.4999456
244 eduardo garcía portugués

Cross-validation
We turn now our attention to a different philosophy of band-
width estimation. Instead of trying to minimize the AMISE by
plugging-in estimates for the unknown curvature term, we directly
attempt to minimize the MISE by using the sample twice: one for
computing the kde and other for evaluating its performance on esti-
mating f . To avoid the clear dependence on the sample, we do this
evaluation in a cross-validatory way: the data used for computing the
kde is not used for its evaluation.
We begin by expanding the square in the MISE expression:
Z 
MISE[ fˆ(·; h)] = E ( fˆ( x; h) − f ( x ))2 dx
Z  Z 
=E fˆ( x; h)2 dx − 2E fˆ( x; h) f ( x ) dx
Z
+ f ( x )2 dx.

Since the last term does not depend on h, minimizing MISE[ fˆ(·; h)]
is equivalent to minimizing
Z  Z 
E ˆ 2
f ( x; h) dx − 2E ˆ
f ( x; h) f ( x ) dx .

This quantity is unknown, but it can be estimated unbiasedly by


Z n
LSCV(h) := fˆ( x; h)2 dx − 2n−1 ∑ fˆ−i (Xi ; h), (6.8)
i =1

where fˆ−i (·; h) is the leave-one-out kde and is based on the sample
with the Xi removed:
n
1
fˆ−i ( x; h) =
n−1 ∑ K h ( x − X j ).
j =1
j 6 =i

The Least Squares Cross-Validation (LSCV) selector, also denoted


Unbiased Cross-Validation (UCV) selector, is defined as

ĥLSCV := arg min LSCV(h).


h >0

Numerical optimization is required for obtaining ĥLSCV , contrary


to the previous plug-in selectors, and there is little control on the
shape of the objective function.

Numerical optimization of the (6.8) can be challeng-


ing. In practice, several local minima are possible, and
the roughness of the objective function can vary no-
tably depending on n and f . As a consequence, opti-
mization routines may get trapped in spurious solu-
tions. To be on the safe side, it is advisable to check
the solution by plotting LSCV(h) for a range of h, or
to perform an exhaustive search in a bandwidth grid:
ĥLSCV ≈ arg minh1 ,...,hG LSCV(h).
notes for predictive modeling 245

ĥLSCV is implemented in R through the function bw.ucv. bw.ucv


uses optimize, which is quite sensible to the selection of the search 13
Long intervals containing the so-
interval13 . Therefore, some care is needed and that is why the lution may lead to unsatisfactory
bw.ucv.mod function is presented. termination of the search; and short
intervals might not contain the mini-
# Data mum.
set.seed(123456)
x <- rnorm(100)

# UCV gives a warning


bw.ucv(x = x)
## [1] 0.4499177

# Extend search interval


bw.ucv(x = x, lower = 0.01, upper = 1)
## [1] 0.5482419

# bw.ucv.mod replaces the optimization routine of bw.ucv by an exhaustive


# search on "h.grid" (chosen adaptatively from the sample) and optionally
# plots the LSCV curve with "plot.cv"
bw.ucv.mod <- function(x, nb = 1000L,
h.grid = diff(range(x)) * (seq(0.1, 1, l = 200))^2,
plot.cv = FALSE) {
if ((n <- length(x)) < 2L)
stop("need at least 2 data points")
n <- as.integer(n)
if (is.na(n))
stop("invalid length(x)")
if (!is.numeric(x))
stop("invalid ’x’")
nb <- as.integer(nb)
if (is.na(nb) || nb <= 0L)
stop("invalid ’nb’")
storage.mode(x) <- "double"
hmax <- 1.144 * sqrt(var(x)) * n^(-1/5)
Z <- .Call(stats:::C_bw_den, nb, x)
d <- Z[[1L]]
cnt <- Z[[2L]]
fucv <- function(h) .Call(stats:::C_bw_ucv, n, d, cnt, h)
## Original
# h <- optimize(fucv, c(lower, upper), tol = tol)$minimum
# if (h < lower + tol | h > upper - tol)
# warning("minimum occurred at one end of the range")
## Modification
obj <- sapply(h.grid, function(h) fucv(h))
h <- h.grid[which.min(obj)]
if (plot.cv) {
plot(h.grid, obj, type = "o")
rug(h.grid)
abline(v = h, col = 2, lwd = 2)
}
h
}

# Compute the bandwidth and plot the LSCV curve


-0.10

bw.ucv.mod(x = x, plot.cv = TRUE)


## [1] 0.5431732
-0.15

# We can compare with the default bw.ucv output


abline(v = bw.ucv(x = x), col = 3)
obj

-0.20

The next cross-validation selector is based on Biased Cross-


Validation (BCV). The BCV selector presents a hybrid strategy that
combines plug-in and cross-validation ideas. It starts by consider-
-0.25

ing the AMISE expression in (6.5) and then plugs-in an estimate 0 1 2 3 4 5

for R( f 00 ) based on a modification of R( fˆ00 (·; h)). The appealing h.grid


246 eduardo garcía portugués

property of ĥBCV is that it has a considerably smaller variance com-


pared to ĥLSCV . This reduction in variance comes at the price of an
increased bias, which tends to make ĥBCV larger than hMISE .
ĥBCV is implemented in R through the function bw.bcv. Again,
bw.bcv uses optimize so the bw.bcv.mod function is presented to
have better guarantees on finding the adequate minimum.

# Data
set.seed(123456)
x <- rnorm(100)

# BCV gives a warning


bw.bcv(x = x)
## [1] 0.4500924

# Extend search interval


args(bw.bcv)
## function (x, nb = 1000L, lower = 0.1 * hmax, upper = hmax, tol = 0.1 *
## lower)
## NULL
bw.bcv(x = x, lower = 0.01, upper = 1)
## [1] 0.5070129

# bw.bcv.mod replaces the optimization routine of bw.bcv by an exhaustive


# search on "h.grid" (chosen adaptatively from the sample) and optionally
# plots the BCV curve with "plot.cv"
bw.bcv.mod <- function(x, nb = 1000L,
h.grid = diff(range(x)) * (seq(0.1, 1, l = 200))^2,
plot.cv = FALSE) {
if ((n <- length(x)) < 2L)
stop("need at least 2 data points")
n <- as.integer(n)
if (is.na(n))
stop("invalid length(x)")
if (!is.numeric(x))
stop("invalid ’x’")
nb <- as.integer(nb)
if (is.na(nb) || nb <= 0L)
stop("invalid ’nb’")
storage.mode(x) <- "double"
hmax <- 1.144 * sqrt(var(x)) * n^(-1/5)
Z <- .Call(stats:::C_bw_den, nb, x)
d <- Z[[1L]]
cnt <- Z[[2L]]
fbcv <- function(h) .Call(stats:::C_bw_bcv, n, d, cnt, h)
## Original code
# h <- optimize(fbcv, c(lower, upper), tol = tol)$minimum
# if (h < lower + tol | h > upper - tol)
# warning("minimum occurred at one end of the range")
## Modification
obj <- sapply(h.grid, function(h) fbcv(h))
h <- h.grid[which.min(obj)]
if (plot.cv) {
plot(h.grid, obj, type = "o")
rug(h.grid)
abline(v = h, col = 2, lwd = 2)
}
h
}

# Compute the bandwidth and plot the BCV curve


bw.bcv.mod(x = x, plot.cv = TRUE)
## [1] 0.5130493

# We can compare with the default bw.bcv output


abline(v = bw.bcv(x = x), col = 3)
notes for predictive modeling 247

Comparison of bandwidth selectors


Despite it is possible to compare theoretically the performance of
bandwidth selectors by investigating the convergence of nν (ĥ/hMISE −
1), comparisons are usually done by simulation and investigation
of the averaged ISE error. A popular collection of simulation sce-
narios was given by Marron and Wand (1992) and are conveniently
available through the package nor1mix. They form a collection of
normal r-mixtures of the form

r
f ( x; µ, σ, w) : = ∑ w j φ(x; µ j , σj2 ),
j =1

where w j ≥ 0, j = 1, . . . , r and ∑rj=1 w j = 1. Densities of this form


are specially attractive since they allow for arbitrarily flexibility
and, if the normal kernel is employed, they allow for explicit and
exact MISE expressions, directly computable as


MISEr [ fˆ(·; h)] = (2 πnh)−1 + w0 {(1 − n−1 )Ω2 − 2Ω1 + Ω0 }w,
(Ω a )ij := φ(µi − µ j ; ah2 + σi2 + σj2 ), i, j = 1, . . . , r.

This expression is especially useful for benchmarking bandwidth


selectors, as the MISE optimal bandwidth can be computed by
hMISE = arg minh>0 MISEr [ fˆ(·; h)].

# Available models
?nor1mix::MarronWand

# Simulating -- specify density with MW object


Histogram of samp
samp <- nor1mix::rnorMix(n = 500, obj = nor1mix::MW.nm9)
hist(samp, freq = FALSE)
0.30
0.25

# Density evaluation
x <- seq(-4, 4, length.out = 400)
0.20

lines(x, nor1mix::dnorMix(x = x, obj = nor1mix::MW.nm9), col = 2)


Density

0.15
0.10
0.05
0.00

# Plot a MW object directly


-3 -2 -1 0 1 2 3
# A normal with the same mean and variance is plotted in dashed lines
samp
par(mfrow = c(2, 2))
#5 Outlier #7 Separated
plot(nor1mix::MW.nm5)
0.4

plot(nor1mix::MW.nm7)
3

0.3

plot(nor1mix::MW.nm10)
2
f(x)

f(x)

0.2

plot(nor1mix::MW.nm12)
0.1
1

lines(nor1mix::MW.nm1, col = 2:3) # Also possible


0.0
0

-1.0 -0.5 0.0 0.5 1.0 -4 -2 0 2 4

x x

#10 Claw #12 Asym Claw

Figure 6.4 presents a visualization of the performance of the


0.6

0.4

kde with different bandwidth selectors, carried out in the family of


0.3
0.4
f(x)

f(x)

0.2

mixtures of Marron and Wand (1992).


0.2

0.1
0.0

0.0

-2 -1 0 1 2 -3 -2 -1 0 1 2 3

x x
248 eduardo garcía portugués

Figure 6.4: Performance comparison


of bandwidth selectors. The RT, DPI,
LSCV, and BCV are computed for each
sample for a normal mixture density.
For each sample, computes the ISEs
of the selectors and sorts them from
best to worst. Changing the scenarios
gives insight on the adequacy of
each selector to hard- and simple-to-
estimate densities. Application also
available here.

Which bandwidth selector is the most adequate for a


given dataset?
There is no simple and universal answer to this question.
There are, however, a series of useful facts and sugges-
tions:

• Trying several selectors and inspecting the results


may help on determining which one is estimating the
density better.
• The DPI selector has a convergence rate much faster
than the cross-validation selectors. Therefore, in the-
ory is expected to perform better than LSCV and BCV.
For this reason, it tends to be amongst the preferred
bandwidth selectors in the literature.
• Cross-validatory selectors may be better suited for
highly non-normal and rough densities, in which
plug-in selectors may end up oversmoothing.
• LSCV tends to be considerably more variable than
BCV.
• The RT is a quick, simple, and inexpensive selector.
However, it tends to give bandwidths that are too
large for non-normal data.
notes for predictive modeling 249

6.1.4 Multivariate extension


Kernel density estimation can be extended to estimate multivari-
ate densities f in R p . For a sample X1 , . . . , Xn in R p , the kde of f
evaluated at x ∈ R p is
n
1  
fˆ(x; H) := ∑
n|H|1/2 i=1
K H −1/2
( x − X i ) , (6.9)

where K is multivariate kernel, a p-variate density that is (typically)


symmetric and unimodal at 0, and that depends on the bandwidth
14
Observe that, if p = 1, then H will
matrix14 H, a p × p symmetric and positive definite matrix. A com-
equal the square of the bandwidth h,
mon notation is KH (z) := |H|−1/2 K H−1/2 z , so the kde can be

that is, H = h2 .
compactly written as fˆ(x; H) := n1 ∑in=1 KH (x − Xi ). The most em-
ployed multivariate kernel is the normal kernel K (z) = φ(z; 0, I p ).
The interpretation of (6.9) is analogous to the one of (6.4): build
a mixture of densities with each density centred at the each data
point. As a consequence, and roughly speaking, most of the con-
cepts and ideas seen in univariate kernel density estimation extend
to the multivariate situation, although some of them with consid-
erable technical complications. For example, bandwidth selection
inherits the same cross-validatory ideas (LSCV and BCV selectors)
and plug-in methods (NS and DPI) seen before, but with increased
complexity for the BCV and DPI selectors. The interested reader is
referred to Chacón and Duong (2018) for a rigorous and compre-
hensive treatment.
We briefly discuss next the NS and LSCV selectors, denoted by
ĤNS and ĤLSCV , respectively. The normal scale bandwidth selector
follows, as in the univariate case, by minimizing the asymptotic
MISE of the kde, which now takes the form
Z 
ˆ
MISE[ f (·; H)] = E ˆ 2
( f (x; H) − f (x)) dx ,

and then assuming that f is the pdf of a N p (µ, Σ). With the normal
kernel, this results in

HNS = (4( p + 2))2/( p+4) n−2/( p+4) Σ. (6.10)

Replacing Σ by the sample covariance matrix S in (6.10) gives ĤNS .


The unbiased cross-validation selector neatly extends from
the univariate case and attempts to minimize MISE[ fˆ(·; H)] by
estimating it unbiasedly with
15
Observe that implementing the
Z n
optimization of (6.11) is not trivial,
LSCV(H) := fˆ(x; H)2 dx − 2n−1 ∑ fˆ−i (Xi ; H) since it is required to enforce the
i =1 constrain H ∈ SPD p . A neat way
of parametrizing H that induces
and then minimizing it: the positive definiteness constrain
is through the (unique) Cholesky
ĤLSCV := arg min LSCV(H), (6.11) decomposition of H ∈ SPD p : H = R0 R,
H∈SPD p
where R is a triangular matrix with
positive entries on the diagonal (but
where SPD p is the set of positive definite matrices15 of size p. the remaining entries unconstrained).
Considering a full bandwidth matrix H gives more flexibility Therefore, optimization of (6.11) can be
p ( p +1)
to the kde, but also increases notably the amount of bandwidth done through the 2 entries of R.
250 eduardo garcía portugués

p ( p +1)
parameters that need to be chosen – precisely 2 – which no-
tably complicates bandwidth selection as the dimension p grows.
A common simplification is to consider a diagonal bandwidth ma-
trix H = diag(h21 , . . . , h2p ), which yields the kde employing product
kernels:
n
1 p
fˆ(x; h) =
n ∑ Kh1 (x1 − Xi,1 )× · · · ×Kh p (x p − Xi,p ), (6.12)
i =1

where Xi = ( Xi,1 , . . . , Xi,p ) and h = (h1 , . . . , h p ) is the vector of


bandwidths. If the variables X1 , . . . , X p have been standardized (so

100
that they have the same scale), then a simple choice is to consider
h = h1 = . . . = h p .

80
Multivariate kernel density estimation and bandwidth selection

waiting
is not supported in base R, but the ks package implements the kde

60
by ks::kde for p ≤ 6. Bandwidth selectors, allowing for full or
diagonal bandwidth matrices are implemented by: ks::Hns (NS),

40
ks::Hpi and ks::Hpi.diag (DPI), ks::Hlscv and ks::Hlscv.diag
1 2 3 4 5
(LSCV), and ks::Hbcv and ks::Hbcv.diag (BCV). The next chunk of eruptions

code illustrates their usage with the faithful dataset.

100
# DPI selectors
Hpi1 <- ks::Hpi(x = faithful)

80
Hpi1
## [,1] [,2]
waiting
## [1,] 0.06326802 0.6041862
60
## [2,] 0.60418624 11.1917775

# Compute kde (if H is missing, ks::Hpi is called)


40

kdeHpi1 <- ks::kde(x = faithful, H = Hpi1)

1 2 3 4 5
# Different representations eruptions
plot(kdeHpi1, display = "slice", cont = c(25, 50, 75))

# "cont" specifies the density contours, which are upper percentages of highest
# density regions. The default contours are at 25%, 50%, and 75%
plot(kdeHpi1, display = "filled.contour2", cont = c(25, 50, 75))
Density fu
nction

plot(kdeHpi1, display = "persp")

# Manual plotting using the kde object structure


wa
itin

image(kdeHpi1$eval.points[[1]], kdeHpi1$eval.points[[2]],
g

kdeHpi1$estimate, col = viridis::viridis(20)) ns


ptio
eru
points(kdeHpi1$x)
100

# Diagonal vs. full


kdeHpi1$eval.points[[2]]

80

Hpi2 <- ks::Hpi.diag(x = faithful)


kdeHpi2 <- ks::kde(x = faithful, H = Hpi2)
plot(kdeHpi1, display = "filled.contour2", cont = c(25, 50, 75),
60

main = "full")
40

plot(kdeHpi2, display = "filled.contour2", cont = c(25, 50, 75),


main = "diagonal") 1 2 3 4 5

kdeHpi1$eval.points[[1]]
notes for predictive modeling 251

full

# Comparison of selectors along predefined contours


x <- faithful

100
Hlscv0 <- ks::Hlscv(x = x)
Hbcv0 <- ks::Hbcv(x = x)
Hpi0 <- ks::Hpi(x = x)

80
Hns0 <- ks::Hns(x = x)

waiting
par(mfrow = c(2, 2))

60
p <- lapply(list(Hlscv0, Hbcv0, Hpi0, Hns0), function(H) {
# col.fun for custom colours
plot(ks::kde(x = x, H = H), display = "filled.contour2",

40
cont = seq(10, 90, by = 10), col.fun = viridis::viridis)
points(x, cex = 0.5, pch = 16)
1 2 3 4 5
})
eruptions

diagonal

100
90
80
100

100

waiting

70
80

80

60
waiting

waiting

50
60
60

40
40
40

2 3 4 5

eruptions

2 3 4 5 1 2 3 4 5 6

eruptions eruptions
100

100
80

80
waiting

waiting

60
60

40
40

1 2 3 4 5 1 2 3 4 5

eruptions eruptions

Kernel density estimation can be used to visualize density level


sets in 3D too, as illustrated as follows with the iris dataset.

# Normal scale bandwidth


Hns1 <- ks::Hns(iris[, 1:3])

# Show high nested contours of high density regions


plot(ks::kde(x = iris[, 1:3], H = Hns1))
points3d(x = iris[, 1:3])
rgl::rglwidget()
252 eduardo garcía portugués

Consider the normal mixture


2 2 2 2
w1 N2 (µ11 , µ12 , σ11 , σ12 , ρ1 ) + w2 N2 (µ21 , µ22 , σ21 , σ22 , ρ2 ),

where w1 = 0.3, w2 = 0.7, (µ11 , µ12 ) = (1, 1), (µ21 , µ22 ) =


2 = σ2 = 1, σ2 = σ2 = 2, ρ = 0.5 and
(−1, −1), σ11 21 12 22 1
ρ2 = −0.5.
Perform the following simulation exercise:

1. Plot the density of the mixture using ks::dnorm.mixt


and overlay points simulated employ-
ing ks::rnorm.mixt. You may want to use
ks::contourLevels to have density plots compara-
ble to the kde plots performed in the next step.
2. Compute the kde employing ĤDPI , both for full and
diagonal bandwidth matrices. Are there any gains on
considering full bandwidths? What if ρ2 = 0.7?
3. Consider the previous point with ĤLSCV instead of
ĤDPI . Are the conclusions the same?

6.2 Kernel regression estimation

6.2.1 Nadaraya–Watson estimator

Our objective is to estimate the regression function m : R p →


R nonparametrically (recall that we are considering the simplest
situation: one continuous predictor, so p = 1). Due to its definition,
we can rewrite m as

m ( x ) = E [Y | X = x ]
Z
= y f Y | X = x (y) dy
R
y f ( x, y) dy
= . (6.13)
f X (x)

This expression shows an interesting point: the regression function


can be computed from the joint density f and the marginal f X .
Therefore, given a sample {( Xi , Yi )}in=1 , a nonparametric estimate
of m may follow by replacing the previous densities by their kernel
density estimators! From the previous section, we know how to do
this using the multivariate and univariate kde’s given in (6.4) and
(6.9), respectively. For the multivariate kde, we can consider the kde
(6.12) based on product kernels for the two dimensional case and
bandwidths h = (h1 , h2 ), which yields the estimate

n
1
fˆ( x, y; h) =
n ∑ Kh1 (x − Xi )Kh2 (y − Yi ) (6.14)
i =1
notes for predictive modeling 253

of the joint pdf of ( X, Y ). On the other hand, considering the same


bandwidth h1 for the kde of f X , we have
n
1
fˆX ( x; h1 ) =
n ∑ K h 1 ( x − Xi ) . (6.15)
i =1

We can therefore define the estimator of m that results from replac-


ing f and f X in (6.13) by (6.14) and (6.15):
R 1 n
y fˆ( x, y; h) dy y n ∑i=1 Kh1 ( x − Xi )Kh2 (y − Yi ) dy
R
=
fˆX ( x; h1 ) 1 n
n ∑ i = 1 K h 1 ( x − Xi )
1 n
n ∑i =1 Kh1 ( x − Xi ) yKh2 ( y − Yi ) dy
R
= 1 n
n ∑ i = 1 K h 1 ( x − Xi )
1
n ∑in=1 Kh1 ( x − Xi )Yi
= 1 n
n ∑ i = 1 K h 1 ( x − Xi )
n K h 1 ( x − Xi )
= ∑ ∑n Yi .
i =1 i = 1 K h 1 ( x − Xi )
16
Notice that it does not depend on h2 ,
The resulting estimator16 is the so-called Nadaraya–Watson17 esti-
only on h1 , the bandwidth employed
mate of the regression function: for smoothing X.
17
Termed due to the coetaneous
n n
K ( x − Xi )
m̂( x; 0, h) := ∑ n h Yi = ∑ Wi0 ( x )Yi ,
proposals by Nadaraya (1964) and
(6.16)
i =1 ∑ i =1 h
K ( x − Xi ) Watson (1964).
i =1

where
K h ( x − Xi )
Wi0 ( x ) := n .
∑ i = 1 K h ( x − Xi )

The Nadaraya–Watson estimate can be seen as a


weighted average of Y1 , . . . , Yn by means of the set of
weights {Wi ( x )}in=1 (they add to one). The set of varying
weights depends on the evaluation point x. That means
that the Nadaraya–Watson estimator is a local mean of
Y1 , . . . , Yn around X = x (see Figure 6.6).

Let’s implement from scratch the Nadaraya–Watson estimate to


get a feeling of how it works in practice.
# A naive implementation of the Nadaraya-Watson estimator
mNW <- function(x, X, Y, h, K = dnorm) {

# Arguments
# x: evaluation points
# X: vector (size n) with the predictors
# Y: vector (size n) with the response variable
# h: bandwidth
# K: kernel

# Matrix of size n x length(x)


Kx <- sapply(X, function(Xi) K((x - Xi) / h) / h)

# Weights
W <- Kx / rowSums(Kx) # Column recycling!

# Means at x ("drop" to drop the matrix attributes)


254 eduardo garcía portugués

drop(W %*% Y)

# Generate some data to test the implementation


set.seed(12345)
n <- 100
eps <- rnorm(n, sd = 2)
m <- function(x) x^2 * cos(x)
# m <- function(x) x - x^2 # Other possible regression function, works
# equally well
X <- rnorm(n, sd = 2)
Y <- m(X) + eps
xGrid <- seq(-10, 10, l = 500)

# Bandwidth
h <- 0.5

# Plot data
plot(X, Y)
rug(X, side = 1); rug(Y, side = 2)
lines(xGrid, m(xGrid), col = 1)
lines(xGrid, mNW(x = xGrid, X = X, Y = Y, h = h), col = 2)
legend("top", legend = c("True regression", "Nadaraya-Watson"),
lwd = 2, col = 1:2)

True regression Figure 6.5: The Nadaraya–Watson


Nadaraya-Watson estimator of an arbitrary regression
15

function m.
10
5
Y

0
-5
-10

-4 -2 0 2 4

X
notes for predictive modeling 255

Implement your own version of the Nadaraya–Watson


estimator in R and compare it with mNW. Focus only on
the normal kernel and reduce the accuracy of the final
computation up to 1e-7 to achieve better efficiency.
Are you able to improve the speed of mNW? Use the
microbenchmark::microbenchmark function to measure
the running times for a sample with n = 10000.

Similarly to kernel density estimation, in the Nadaraya–Watson


estimator the bandwidth has a prominent effect on the shape of the
estimator, whereas the kernel is clearly less important. The code be-
low illustrates the effect of varying h using the manipulate::manipulate
function.
# Simple plot of N-W for varying h’s
manipulate::manipulate({

# Plot data
plot(X, Y)
rug(X, side = 1); rug(Y, side = 2)
lines(xGrid, m(xGrid), col = 1)
lines(xGrid, mNW(x = xGrid, X = X, Y = Y, h = h), col = 2)
legend("topright", legend = c("True regression", "Nadaraya-Watson"),
lwd = 2, col = 1:2)

}, h = manipulate::slider(min = 0.01, max = 2, initial = 0.5, step = 0.01))

6.2.2 Local polynomial regression


The Nadaraya–Watson estimator can be seen as a particular case
of a wider class of nonparametric estimators, the so called local
polynomial estimators. Specifically, Nadaraya–Watson is the one that
corresponds to performing a local constant fit. Let’s see this wider
class of nonparametric estimators and their advantages with respect
to the Nadaraya–Watson estimator.
The motivation for the local polynomial fit comes from attempt-
18
Obviously, avoiding the spurious
ing to find an estimator m̂ of m that “minimizes”18 the RSS perfect fit attained with m̂( Xi ) := Yi ,
n i = 1 . . . , n.

∑ (Yi − m̂(Xi ))2 (6.17)


i =1

without assuming any particular form for the true m. This is not
achievable directly, since no knowledge on m is available. Recall
that what we did in parametric models, such as linear regression,
was to assume a parametrization for m, m β (x) = β 0 + β 1 x for the
simple linear model, which allowed to tackle the minimization of
(6.17) by means of solving
n
m β̂ (x) := arg min ∑ (Yi − m β ( Xi ))2 .
β i =1

That was true because the resulting m β̂ is precisely the estima-


tor that minimizes the RSS among all the linear estimators, that is,
among the class of estimators that we have parametrized.
256 eduardo garcía portugués

When m has no available parametrization and can adopt any


mathematical form, an alternative approach is required. The first 19
Here we employ p for denoting
step is to induce a local parametrization on m. By a p-th19 order the order of the Taylor expansion
Taylor expression it is possible to obtain that, for x close to Xi , and, correspondingly, the order of
the associated polynomial fit. Do not
m00 ( x ) confuse p with the number of original
m( Xi ) ≈ m( x ) + m0 ( x )( Xi − x ) + ( Xi − x ) 2 predictors for explaining Y – there is
2
only one predictor, X. However, with
m( p) ( x ) a local polynomial fit we expand this
+···+ ( Xi − x ) p . (6.18)
p! predictor to p predictors based on
( X 1 , X 2 , . . . , X p ).
Then, replacing (6.18) in the population version of (6.17) that re-
places m̂ with m, we have that
!2
n p
m( j) ( x )
∑ Yi − ∑ j! (Xi − x) . j
(6.19)
i =1 j =0

This expression is still not workable: it depends on m( j) ( x ), j =


0, . . . , p, which of course are unknown, as m is unknown. The
m( j) ( x )
great idea is to set β j := j! and turn (6.19) into a linear re-
gression problem where the unknown parameters are precisely
β = ( β 0 , β 1 , . . . , β p )0 . Simply rewriting (6.19) using this idea gives
!2
n p
∑ Yi − ∑ β j ( Xi − x ) j . (6.20)
i =1 j =0

Now, estimates of β automatically produce estimates for m( j) ( x ),


j = 0, . . . , p. In addition, we know how to obtain an estimate β̂ that
minimizes (6.20), since this is precisely the least squares problem
studied in Section 2.2.3. The final touch is to weight the contribu-
tions of each datum ( Xi , Yi ) to the estimation of m( x ) according to
the proximity of Xi to x20 . We can achieve this precisely by kernels:
20
The rationale is simple: ( Xi , Yi )
!2
n p should be more informative about
β̂h := arg min ∑
β ∈R p +1 i =1
Yi − ∑ β j ( Xi − x ) j K h ( x − Xi ) . (6.21) m( x ) than ( X j , Yj ) if x and Xi are
closer than x and X j .
j =0

Solving (6.21) is easy once the proper notation is introduced. To


that end, denote
 
1 X1 − x · · · ( X1 − x ) p
. .. .. .. 
X :=   .. . . .


1 Xn − x · · · ( Xn − x ) p n×( p+1)

and
 
Y1
 . 
W := diag(Kh ( X1 − x ), . . . , Kh ( Xn − x )),  .. 
Y :=  .

Yn n×1

Then we can re-express (6.21) into a weighted least squares problem


whose exact solution is

β̂h = arg min (Y − Xβ)0 W(Y − Xβ)


β ∈R p +1

= (X0 WX)−1 X0 WY. (6.22)


notes for predictive modeling 257

The estimate21 for m( x ) is therefore computed as

m̂( x; p, h) := β̂ h,0
= e10 (X0 WX)−1 X0 WY
n
∑ Wi (x)Yi
p
= (6.23)
i =1

where
p
Wi ( x ) := e10 (X0 WX)−1 X0 Wei

and ei is the i-th canonical vector. Just as the Nadaraya–Watson


was, the local polynomial estimator is a weighted linear combina-
tion of the responses.
Two cases deserve special attention on (6.23):

• p = 0 is the local constant estimator or the Nadaraya–Watson


estimator. In this situation, the estimator has explicit weights, as
we saw before:
K h ( x − Xi )
Wi0 ( x ) = .
∑nj=1 Kh ( x − X j )

• p = 1 is the local linear estimator, which has weights equal to:

1 ŝ2 ( x; h) − ŝ1 ( x; h)( Xi − x )


Wi1 ( x ) = K ( x − Xi ) ,
n ŝ2 ( x; h)ŝ0 ( x; h) − ŝ1 ( x; h)2 h

where ŝr ( x; h) := 1
n ∑in=1 ( Xi − x )r Kh ( x − Xi ).

Recall that the local polynomial fit is computationally


more expensive than the local constant fit: m̂( x; p, h)
is obtained as the solution of a weighted linear prob-
lem, whereas m̂( x; 0, h) can be directly computed as a
weighted mean of the responses.

Figure 6.6 illustrates the construction of the local polynomial


estimator (up to cubic degree) and shows how β̂ 0 = m̂( x; p, h), the
intercept of the local fit, estimates m at x.

The local polynomial estimator m̂(·; p, h) of m performs a


series of weighted polynomial fits, as many as points x
on which m̂(·; p, h) is to be evaluated.
22
The lowess estimator, related with
An inefficient implementation of the local polynomial estimator loess, is the one employed in R’s
panel.smooth, which is the function
can be done relatively straightforwardly from the previous insight in charge of displaying the smooth fits
and from expression (6.22). However, several R packages provide in lm and glm regression diagnostics
(employing a prefixed and not data-
implementations, such as KernSmooth::locpoly and R’s loess22
driven smoothing span of 2/3 – which
(but this one has a different control of the bandwidth plus a set of makes it inevitably a bad choice for
other modifications). Below are some examples of their usage. certain data patterns).
258 eduardo garcía portugués

Figure 6.6: Construction of the local


polynomial estimator. The animation
shows how local polynomial fits in a
neighborhood of x are combined to
provide an estimate of the regression
function, which depends on the
polynomial degree, bandwidth, and
kernel (gray density at the bottom).
The data points are shaded according
to their weights for the local fit at x.
Application available here.

# Generate some data


set.seed(123456)
n <- 100
eps <- rnorm(n, sd = 2)
m <- function(x) x^3 * sin(x)
X <- rnorm(n, sd = 1.5)
Y <- m(X) + eps
xGrid <- seq(-10, 10, l = 500)

# KernSmooth::locpoly fits
h <- 0.25
lp0 <- KernSmooth::locpoly(x = X, y = Y, bandwidth = h, degree = 0,
range.x = c(-10, 10), gridsize = 500)
lp1 <- KernSmooth::locpoly(x = X, y = Y, bandwidth = h, degree = 1,
range.x = c(-10, 10), gridsize = 500)
# Provide the evaluation points by range.x and gridsize

# loess fits
span <- 0.25 # The default span is 0.75, which works very bad in this scenario
lo0 <- loess(Y ~ X, degree = 0, span = span)
lo1 <- loess(Y ~ X, degree = 1, span = span)
# loess employs an "span" argument that plays the role of an variable bandwidth
# "span" gives the proportion of points of the sample that are taken into
# account for performing the local fit around x and then uses a triweight kernel
# (not a normal kernel) for weighting the contributions. Therefore, the final
# estimate differs from the definition of local polynomial estimator, although
# the principles in which are based are the same

# Prediction at x = 2
x <- 2
lp1$y[which.min(abs(lp1$x - x))] # Prediction by KernSmooth::locpoly
## [1] 5.445975
predict(lo1, newdata = data.frame(X = x)) # Prediction by loess
## 1
## 5.379652
m(x) # Reality
## [1] 7.274379
notes for predictive modeling 259

# Plot data
plot(X, Y)
rug(X, side = 1); rug(Y, side = 2)
lines(xGrid, m(xGrid), col = 1)
lines(lp0$x, lp0$y, col = 2)
lines(lp1$x, lp1$y, col = 3)
lines(xGrid, predict(lo0, newdata = data.frame(X = xGrid)), col = 2, lty = 2)
lines(xGrid, predict(lo1, newdata = data.frame(X = xGrid)), col = 3, lty = 2)
legend("bottom", legend = c("True regression", "Local constant (locpoly)",
"Local linear (locpoly)", "Local constant (loess)",
"Local linear (loess)"),
lwd = 2, col = c(1:3, 2:3), lty = c(rep(1, 3), rep(2, 2)))

As with the Nadaraya–Watson, the local polynomial estimator

10
heavily depends on h.

0
# Simple plot of local polynomials for varying h’s
manipulate::manipulate({

-10
# Plot data
lpp <- KernSmooth::locpoly(x = X, y = Y, bandwidth = h, degree = p,

-20
range.x = c(-10, 10), gridsize = 500) True regression
Local constant (locpoly)
Local linear (locpoly)
plot(X, Y) Local constant (loess)
Local linear (loess)
rug(X, side = 1); rug(Y, side = 2)
-3 -2 -1 0 1 2 3
lines(xGrid, m(xGrid), col = 1)
X
lines(lpp$x, lpp$y, col = p + 2)
legend("bottom", legend = c("True regression", "Local polynomial fit"),
lwd = 2, col = c(1, p + 2))

}, p = manipulate::slider(min = 0, max = 4, initial = 0, step = 1),


h = manipulate::slider(min = 0.01, max = 2, initial = 0.5, step = 0.01))

A more sophisticated framework for performing nonparametric


estimation of the regression function is the np package, which we
detail in Section 6.2.4. This package will be the chosen approach
for the more challenging situation in which several predictors are
present, since the former implementations do not escalate well for
more than one predictor.

6.2.3 Asymptotic properties


What affects the performance of the local polynomial estimator? Is
local linear estimation better than local constant estimation? What is
the effect of h?

The purpose of this section is to provide some highlights on


the questions above by examining the theoretical properties of
the local polynomial estimator. This is achieved by examining the
asymptotic bias and variance of the local linear and local constant
estimators23 . For this goal, we consider the location-scale model for Y 23
We do not address the analysis of
and its predictor X: the general case in which p ≥ 1. The
reader is referred to, for example,
Y = m( X ) + σ ( X )ε, Theorem 3.1 of Fan and Gijbels (1996)
for the full analysis.
where σ2 ( x ) := Var[Y | X = x ] is the conditional variance of Y given
X and ε is such that E[ε] = 0 and Var[ε] = 1. Note that since the
conditional variance is not forced to be constant we are implicitly
allowing for heteroskedasticity.
The following assumptions24 are the only requirements to per-
form the asymptotic analysis of the estimator:
260 eduardo garcía portugués

25
This assumption requires certain
• A1.25 m is twice continuously differentiable. smoothness of the regression function,
• A2.26 σ2 is continuous and positive. allowing thus for Taylor expansions
to be performed. This assumption is
• A3.27 f , the marginal pdf of X, is continuously differentiable and
important in practice: m̂(·; p, h) is in-
bounded away from zero28 . finitely differentiable if the considered
• A4.29 The kernel K is a symmetric and bounded pdf with finite kernels K are.
26
Avoids the situation in which Y is a
second moment and is square integrable. degenerated random variable.
• A5.30 h = hn is a deterministic sequence of bandwidths such 27
Avoids the degenerate situation
that, when n → ∞, h → 0 and nh → ∞. in which m is estimated at regions
without observations of the predictors
(such as holes in the support of X).
The bias and variance are studied in their conditional versions 28
Meaning that there exist a positive
on the predictor’s sample X1 , . . . , Xn . The reason for analyzing lower bound for f .
the conditional instead of the unconditional versions is avoiding 29
Mild assumption inherited from the
kde.
technical difficulties that integration with respect to the predictor’s 30
Key assumption for reducing the
density may pose. This is in the spirit of what it was done in the bias and variance of m̂(·; p, h) simulta-
parametric inference of Sections 2.4 and 5.3. The main result is the neously.

following, which provides useful insights on the effect of p, m, f


(standing from now on for the marginal pdf of X), and σ2 in the
performance of m̂(·; p, h) for p = 0, 1.

Theorem 6.1. Under A1–A5, the conditional bias and variance of the 31
The notation oP ( an ) stands for a
local constant (p = 0) and local linear (p = 1) estimators are31 random variable that converges in
probability to zero at a rate faster
than an → 0. It is mostly employed
Bias[m̂( x; p, h)| X1 , . . . , Xn ] = B p ( x )h2 + oP (h2 ), (6.24)
for denoting non-important terms in
R(K ) 2 asymptotic expansions, like the ones in
Var[m̂( x; p, h)| X1 , . . . , Xn ] = σ ( x ) + oP ((nh)−1 ), (6.25) (6.24)–(6.25).
nh f ( x )

where

 µ2 (K ) m00 ( x ) + 2 m0 ( x) f 0 ( x) ,
 n o
2 f (x)
if p = 0,
B p ( x ) : = µ (K )
 2 m00 ( x ), if p = 1.
2

The bias and variance expressions (6.24) and (6.25) yield very
interesting insights:

1. Bias.

• The bias decreases with h quadratically for both p = 0, 1.


That means that small bandwidths h give estimators with
low bias, whereas large bandwidths provide largely biased
estimators.
• The bias at x is directly proportional to m00 ( x ) if p = 1 or
affected by m00 ( x ) if p = 0. Therefore:

– The bias is negative in regions where m is concave, i.e. { x ∈


R : m00 ( x ) < 0}. These regions correspond to peaks and
modes of m.
– Conversely, the bias is positive in regions where m is convex,
i.e. { x ∈ R : m00 ( x ) > 0}. These regions correspond to
valleys of m.
– All in all, the “wilder” the curvature m00 , the larger the bias
and the harder to estimate m.
notes for predictive modeling 261

• The bias for p = 0 at x is affected by m0 ( x ), f 0 ( x ), and f ( x ).


All of them are quantities that are not present in the bias
when p = 1. Precisely, for the local constant estimator, the
lower the density f ( x ), the larger the bias. Also, the faster m
and f change at x (derivatives), the larger the bias. Thus the
bias of the local constant estimator is much more sensible to
m( x ) and f ( x ) than the local linear (which is only sensible to
m00 ( x )). Particularly, the fact that it depends on f 0 ( x ) and f ( x )
is referred to as the design bias since it depends merely on the
predictor’s distribution.

2. Variance.

• The main term of the variance is the same for p = 0, 1. In


σ2 ( x )
addition, it depends directly on f ( x) . As a consequence, the 32
Recall that this makes perfect sense:
lower the density, the more variable is m̂( x; p, h)32 . Also, the low density regions of X imply less
larger the conditional variance at x, σ2 ( x ), the more variable information about m available.
33
The same happened in the the linear
m̂( x; p, h) is33 . model with the error variance σ2 .
• The variance decreases at a factor of (nh)−1 . This is related
with the so-called effective sample size nh, which can be thought
as the amount of data in the neighborhood of x that is em- 34
The variance of an unweighted
ployed for performing the regression.34 mean is reduced by a factor n−1 when
n observations are employed. For
computing m̂( x; p, h), n observations
The main takeaway of the analysis of p = 0 vs. p = 1 is are used but in a weighted fashion that
roughly amounts to considering nh
that p = 1 has smaller bias than p = 0 (but of the same unweighted observations.
order) while keeping the same variance as p = 0.

An extended version of Theorem 6.1, given in Theorem 3.1 of


Fan and Gijbels (1996), shows that this phenomenon extends to
higher order: odd order (p = 2ν + 1, ν ∈ N) polynomial fits
introduce an extra coefficient for the polynomial fit that allows
them to reduce the bias, while maintaining the same variance of
the precedent even order (p = 2ν). So, for example, local cubic fits
are preferred to local quadratic fits. This motivates the claim that
local polynomial fitting is an odd world (Fan and Gijbels (1996)).

6.2.4 Bandwidth selection


Bandwidth selection, as for density estimation, is of key practical
importance for kernel regression estimation. Several bandwidth
selectors have been proposed for kernel regression by following
similar cross-validatory and plug-in ideas to the ones seen in Sec- 35
Further details are available in
tion 6.1.3. For simplicity, we briefly mention35 the DPI analogue for Section 5.8 of Wand and Jones (1995)
local linear regression for a single continuous predictor and focus and references therein.

mainly on least squares cross-validation, as it is a bandwidth selec-


tor that readily generalizes to the more complex settings of Section
6.3.
Following the derivation of the DPI for the kde, the first step is
to define a suitable error criterion for the estimator m̂(·; p, h). The
262 eduardo garcía portugués

conditional (on the sample of the predictor) MISE of m̂(·; p, h) is


often considered:
Z 
MISE[m̂(·; p, h)| X1 , . . . , Xn ] := E (m̂( x; p, h) − m( x ))2 f ( x ) dx | X1 , . . . , Xn
Z h i
= E (m̂( x; p, h) − m( x ))2 | X1 , . . . , Xn f ( x ) dx
Z
= MSE [m̂( x; p, h)| X1 , . . . , Xn ] f ( x ) dx.

Observe that this definition is very similar to the kde’s MISE, except
for the fact that f appears weighting the quadratic difference: what
matters is to minimize the estimation error on the regions were
the density of X is higher. Recall also that the MISE follows by
integrating the conditional MSE, which amounts to the squared
bias (6.24) plus the variance (6.25) given in Theorem 6.1. These
operations produce the conditional AMISE:

R(K )
Z Z
2 2
AMISE[m̂(·; p, h)| X1 , . . . , Xn ] = h B p ( x ) f ( x ) dx + σ2 ( x ) dx
nh
and, if p = 1, the resulting optimal AMISE bandwidth is
" #1/5
R(K ) σ2 ( x ) dx
R
hAMISE = ,
2µ22 (K )θ22 n

where θ22 := (m00 ( x ))2 f ( x ) dx. As happened in the density set-


R

ting, the AMISE-optimal bandwidth can not be readily employed,


as knowledge about the “curvature” of m, θ22 , and about σ2 ( x ) dx
R

is required. As with the DPI selector, a series of nonparametric es-


timations of θ22 and high-order curvature terms follow, concluding
with a necessary estimation of a higher-order curvature based on a
“block polynomial fit”36 . The estimation of σ2 ( x ) dx is carried out
R 36
A fit based on ordinal polynomial
fits but done in different blocks of the
by assuming homoscedasticity and a compactly supported density data.
f . The resulting bandwidth selector, ĥDPI , has a much faster conver-
gence rate to hMISE than cross-validatory selectors. However, it is
notably more convoluted, and as a consequence is less straightfor-
ward to extend to more complex settings.
The DPI selector for the local linear estimator is implemented in
KernSmooth::dpill.

# Generate some data


set.seed(123456)
n <- 100
eps <- rnorm(n, sd = 2)
m <- function(x) x^3 * sin(x)
X <- rnorm(n, sd = 1.5)
Y <- m(X) + eps
xGrid <- seq(-10, 10, l = 500)

# DPI selector
hDPI <- KernSmooth::dpill(x = X, y = Y)

# Fits
lp1 <- KernSmooth::locpoly(x = X, y = Y, bandwidth = 0.25, degree = 0,
range.x = c(-10, 10), gridsize = 500)
lp1DPI <- KernSmooth::locpoly(x = X, y = Y, bandwidth = hDPI, degree = 1,
range.x = c(-10, 10), gridsize = 500)
notes for predictive modeling 263

# Compare fits
plot(X, Y)
rug(X, side = 1); rug(Y, side = 2)
lines(xGrid, m(xGrid), col = 1)
lines(lp1$x, lp1$y, col = 2)
lines(lp1DPI$x, lp1DPI$y, col = 3)
legend("bottom", legend = c("True regression", "Local linear",
"Local linear (DPI)"),
lwd = 2, col = 1:3)

10
We turn now our attention to cross validation. Following an
analogy with the fit of the linear model, we could look for the

0
bandwidth h such that it minimizes an RSS of the form

Y
n

-10
1
n ∑ (Yi − m̂(Xi ; p, h))2 . (6.26)
i =1

-20
True regression
As it looks, this is a bad idea. Attempting to minimize (6.26) always Local linear
Local linear (DPI)

leads to h ≈ 0 that results in a useless interpolation of the data, as -3 -2 -1 0 1 2 3

X
illustrated below.

# Grid for representing (6.26)


hGrid <- seq(0.1, 1, l = 200)^2
error <- sapply(hGrid, function(h) {
mean((Y - mNW(x = X, X = X, Y = Y, h = h))^2)
})

# Error curve
plot(hGrid, error, type = "l")
rug(hGrid)
abline(v = hGrid[which.min(error)], col = 2)

As we know, the root of the problem is the comparison of Yi


with m̂( Xi ; p, h), since there is nothing forbidding h → 0 and as a 37
Recall that h is a tuning parameter!
consequence m̂( Xi ; p, h) → Yi . As discussed in (3.14)37 , a solution
is to compare Yi with m̂−i ( Xi ; p, h), the leave-one-out estimate of
m computed without the i-th datum ( Xi , Yi ), yielding the least
squares cross-validation error
n
1
CV(h) :=
n ∑ (Yi − m̂−i (Xi ; p, h))2 (6.27)
i =1

and then choose

ĥCV := arg min CV(h).


h >0

The optimization of (6.27) might seem to be very computationally


expensive, since it is required to compute n regressions for just a
single evaluation of the cross-validation function. There is, however,
a simple and neat theoretical result that vastly reduces the compu-
tational complexity, at the price of increasing the memory demand.
This trick allows to compute, with a single fit, the cross-validation
function.

Proposition 6.1. For any p ≥ 0, the weights of the leave-one-out estima-


p
tor m̂−i ( x; p, h) = ∑nj=1 W−i,j ( x )Yj can be obtained from m̂( x; p, h) =
j 6 =i
264 eduardo garcía portugués

p
∑in=1 Wi ( x )Yi :
p p
p
Wj ( x ) Wj ( x )
W−i,j ( x ) = p = p .
∑nk=1 Wk ( x ) 1 − Wi ( x )
k 6 =i

This implies that


!2
n
1 Yi − m̂( Xi ; p, h)
CV(h) =
n ∑ p
1 − Wi ( Xi )
. (6.28)
i =1
p
The result can be proved by using that the weights {Wi ( x )}in=1
add to one, for any x, and that m̂( x; p, h) is a linear combination38 38
Indeed, for any other linear smoother
of the responses {Yi }in=1 . of the response, the result will also
hold.
Computing (6.28) requires evaluating the local poly-
nomial estimator at the sample { Xi }in=1 and obtain-
p
ing {Wi ( Xi )}in=1 (which are needed for evaluating
m̂( Xi ; p, h)). Both taskscan be achieved
 simultaneously
p
from the n × n matrix Wi ( X j ) and, if p = 0, directly
 ij
from the symmetric n × n matrix Kh ( Xi − X j ) ij , whose
 2 
storage costs O n 2−n (the diagonal is constant).

Let’s implement ĥCV for the Nadaraya–Watson estimator in R.


# Generate some data to test the implementation
set.seed(12345)
n <- 100
eps <- rnorm(n, sd = 2)
m <- function(x) x^2 + sin(x)
X <- rnorm(n, sd = 1.5)
Y <- m(X) + eps
xGrid <- seq(-10, 10, l = 500)

# Objective function
cvNW <- function(X, Y, h, K = dnorm) {

sum(((Y - mNW(x = X, X = X, Y = Y, h = h, K = K)) /


(1 - K(0) / colSums(K(outer(X, X, "-") / h))))^2)
# Beware: outer() is not very memory-friendly!

# Find optimum CV bandwidth, with sensible grid


bw.cv.grid <- function(X, Y,
h.grid = diff(range(X)) * (seq(0.1, 1, l = 200))^2,
K = dnorm, plot.cv = FALSE) {

obj <- sapply(h.grid, function(h) cvNW(X = X, Y = Y, h = h, K = K))


h <- h.grid[which.min(obj)]
if (plot.cv) {
plot(h.grid, obj, type = "o")
rug(h.grid)
abline(v = h, col = 2, lwd = 2)
}
h

# Bandwidth
hCV <- bw.cv.grid(X = X, Y = Y, plot.cv = TRUE)
notes for predictive modeling 265

hCV
## [1] 0.3117806

# Plot result
plot(X, Y)
rug(X, side = 1); rug(Y, side = 2)
lines(xGrid, m(xGrid), col = 1)
lines(xGrid, mNW(x = xGrid, X = X, Y = Y, h = hCV), col = 2)
legend("top", legend = c("True regression", "Nadaraya-Watson"),
lwd = 2, col = 1:2)
True regression
A more sophisticated cross-validation bandwidth selection can

15
Nadaraya-Watson

be achieved by np::npregbw and np::npreg, as illustrated in the

10
code below.
# np::npregbw computes by default the least squares CV bandwidth associated to

5
# a local constant fit
bw0 <- np::npregbw(formula = Y ~ X)
## Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |

0
# Multiple initial points can be employed for minimizing the CV function (for

-5
# one predictor, defaults to 1) -3 -2 -1 0 1 2 3 4

bw0 <- np::npregbw(formula = Y ~ X, nmulti = 2) X

## Multistart 1 of 2 |Multistart 1 of 2 |Multistart 1 of 2 |Multistart 1 of 2 /Multistart 1 of 2 |Multistart 1 of 2 |Multistart 2

# The "rbandwidth" object contains many useful information, see ?np::npregbw for
# all the returned objects
bw0
##
## Regression Data (100 observations, 1 variable(s)):
##
## X
## Bandwidth(s): 0.3112962
##
## Regression Type: Local-Constant
## Bandwidth Selection Method: Least Squares Cross-Validation
## Formula: Y ~ X
## Bandwidth Type: Fixed
## Objective Function Value: 5.368999 (achieved on multistart 1)
##
## Continuous Kernel Type: Second-Order Gaussian
## No. Continuous Explanatory Vars.: 1
# Recall that the fit is very similar to hCV

# Once the bandwith is estimated, np::npreg can be directly called with the
# "rbandwidth" object (it encodes the regression to be made, the data, the kind
# of estimator considered, etc). The hard work goes on np::npregbw, not on
# np::npreg
kre0 <- np::npreg(bw0)
kre0
##
## Regression Data: 100 training points, in 1 variable(s)
## X
## Bandwidth(s): 0.3112962
##
## Kernel Regression Estimator: Local-Constant
15

## Bandwidth Type: Fixed


##
## Continuous Kernel Type: Second-Order Gaussian
10

## No. Continuous Explanatory Vars.: 1


kre0$mean

# The evaluation points of the estimator are by default the predictor’s sample
# (which is not sorted!)
5

# The evaluation of the estimator is given in "mean"


plot(kre0$eval$X, kre0$mean)
0

-3 -2 -1 0 1 2 3 4

# The evaluation points can be changed using "exdat" kre0$eval$X


266 eduardo garcía portugués

kre0 <- np::npreg(bw0, exdat = xGrid)

# Plot directly the fit via plot() -- it employs different evaluation points
# than exdat
plot(kre0, col = 2, type = "o")
points(X, Y)
rug(X, side = 1); rug(Y, side = 2)
lines(xGrid, m(xGrid), col = 1)
lines(kre0$eval$xGrid, kre0$mean, col = 3, type = "o", pch = 16, cex = 0.5)

15
# Using the evaluation points

# Local linear fit -- find first the CV bandwidth

10
bw1 <- np::npregbw(formula = Y ~ X, regtype = "ll")

Y
## Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |
# regtype = "ll" stands for "local linear", "lc" for "local constant"

5
# Local linear fit
kre1 <- np::npreg(bw1, exdat = xGrid)

0
# Comparison -3 -2 -1 0 1 2 3 4

X
plot(X, Y)
rug(X, side = 1); rug(Y, side = 2)
lines(xGrid, m(xGrid), col = 1)
lines(kre0$eval$xGrid, kre0$mean, col = 2)
lines(kre1$eval$xGrid, kre1$mean, col = 3)
legend("top", legend = c("True regression", "Nadaraya-Watson", "Local linear"),
lwd = 2, col = 1:3)

True regression

15
Nadaraya-Watson
Local linear

The adequate bandwidths for the local linear estimator 10

are usually larger than the adequate bandwidths for the


local constant estimator. The reason: the extra flexibility
Y

the local linear estimator has allows to adapt faster to


variations in m, whereas the local constant estimator can
0

only achieve this by shrinking the neighborhood around


-5

x by means of a small h. -3 -2 -1 0 1 2 3 4

There are more sophisticated options for bandwidth selection in


np::npregbw. For example, the argument bwtype allows to estimate
data-driven variable bandwidths ĥ( x ) that depend on the evaluation
point x, rather than fixed bandwidths ĥ, as we have considered.
Roughly speaking, these variable bandwidths are related to the
variable bandwidth ĥk ( x ) that is necessary to contain the k nearest
neighbors X1 , . . . , Xk of x in the neighborhood ( x − ĥk ( x ), x + ĥk ( x )).
There is a potential gain in employing variable bandwidths as the
estimator can adapt the amount of smoothing according to the
density of the predictor. We do not investigate this approach in
detail but just point to its implementation.
# Generate some data with bimodal density
set.seed(12345)
n <- 100
eps <- rnorm(2 * n, sd = 2)
m <- function(x) x^2 * sin(x)
X <- c(rnorm(n, mean = -2, sd = 0.5), rnorm(n, mean = 2, sd = 0.5))
Y <- m(X) + eps
xGrid <- seq(-10, 10, l = 500)
notes for predictive modeling 267

# Constant bandwidth
bwc <- np::npregbw(formula = Y ~ X, bwtype = "fixed", regtype = "ll")
## Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |
krec <- np::npreg(bwc, exdat = xGrid)

# Variable bandwidths
bwg <- np::npregbw(formula = Y ~ X, bwtype = "generalized_nn", regtype = "ll")
## Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |
kreg <- np::npreg(bwg, exdat = xGrid)
bwa <- np::npregbw(formula = Y ~ X, bwtype = "adaptive_nn", regtype = "ll")
## Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 |Multistart 1 of 1 /Multistart 1 of 1 |Multistart 1 of 1 |
krea <- np::npreg(bwa, exdat = xGrid)

# Comparison
plot(X, Y)
rug(X, side = 1); rug(Y, side = 2)
lines(xGrid, m(xGrid), col = 1)
lines(krec$eval$xGrid, krec$mean, col = 2)
lines(kreg$eval$xGrid, kreg$mean, col = 3)
lines(krea$eval$xGrid, krea$mean, col = 4)
legend("top", legend = c("True regression", "Fixed", "Generalized NN",
"Adaptive NN"),
lwd = 2, col = 1:4)

True regression
Fixed
Generalized NN
Adaptive NN
5
Y

0
-5

-3 -2 -1 0 1 2 3

# Observe how the fixed bandwidth may yield a fit that produces serious
# artifacts in the low density region. At that region the NN-based bandwidths
# expand to borrow strenght from the points in the high density regions,
# whereas in the high density regions they shrink to adapt faster to the
# changes of the regression function

6.3 Kernel regression with mixed multivariate data

Until now, we have studied the simplest situation for perform-


ing nonparametric estimation of the regression function: a single,
continuous, predictor X is available for explaining Y, a continuous
response. This served for introducing the main concepts without
268 eduardo garcía portugués

the additional technicalities associated to more complex predictors.


We now extend study the case in which there are

• multiple predictors X1 , . . . , X p and


• possible non-continuous predictors, namely categorical predic-
tors and discrete predictors.

The first point is how to extend the local polynomial estima-


tor m̂(·; q, h) 39 to deal with p continuous predictors. Although Now q denotes the order of the
39

polynomial fit since p stands for the


this can be done for q ≥ 0, we focus on the local constant and number of predictors.
linear estimators (q = 0, 1) for avoiding excessive technical com-
plications40 . Also, to avoid a quickly escalation of the number of
40
In particular, the consideration of
Taylor expansions of m : R p → R
smoothing bandwidths, it is customary to consider product ker- of more than two orders that involve
nels. With these two restrictions, the estimators for m based on a the consideration of the vector of
partial derivatives D⊗s m(x) formed by
sample {(Xi , Yi )}in=1 extend easily from the developments in Sec- ∂s m(x)
s s , where s = s1 + . . . + s p . For
tions 6.2.1 and 6.2.2: ∂x11 ···∂x p1
example: if s = 1, then D⊗1 m(x) is just
• Local constant estimator. We can replicate the argument in (6.13) the transpose of the gradient ∇m(x)0 ;
if s = 2, then D⊗2 m(x) is the vector
with a multivariate kde for f X based on product kernels with half (column stacking of the entries
bandwidth vector h, which gives below and belonging to the diagonal)
of the Hessian matrix H m(x); and if
n n s ≥ 3 the arrangement of derivatives
Kh ( x − Xi )
m̂(x; 0, h) := ∑ ∑n Kh (x − Xi ) i ∑ Wi0 (x)Yi
Y = is made in terms of tensors containing
∂s m(x)
i =1 i =1 i =1 the derivatives s s , accounting
∂x11 ···∂x p1
where for symmetry.
p
Kh (x − Xi ) := Kh1 ( x1 − Xi1 )× · · · ×Kh p ( x p − Xip ),
Kh ( x − Xi )
Wi0 (x) := .
∑in=1 Kh (x − Xi )

• Local linear estimator. Considering the Taylor expansion m(Xi ) ≈


m(x) + ∇m(x)(Xi − x) instead of (6.18) it is possible to arrive to
the analogous of (6.21),
n

2
β̂h := arg min Yi − β0 (1, (Xi − x)0 )0 K h ( x − X i ),
β ∈R p +1i =1

and then solve the problem in the exact same way but now con-
sidering
 
1 ( X1 − x ) 0
. .. 
 ..
X :=  .


1 (Xn − x)0 n×( p+1)

and

W := diag(Kh (X1 − x), . . . , Kh (Xn − x)).

The estimate41 for m( x ) is therefore computed as


41
Recall that now the entries of β̂h are
m̂(x; 1, h) := β̂ h,0
estimating β = (m(x), ∇m(x))0 .
= e10 (X0 WX)−1 X0 WY
n
= ∑ Wi1 (x)Yi
i =1
notes for predictive modeling 269

where

Wi1 (x) := e10 (X0 WX)−1 X0 Wei .

The cross-validation bandwidth selection rule studied on Section


6.2.4 extends neatly to the multivariate case,

n
1
CV(h) :=
n ∑ (Yi − m̂−i (Xi ; p, h))2 ,
i =1
ĥCV := arg min CV(h).
h1 ,...,h p >0

although with obvious complications in the optimization of CV(h).


Importantly, the trick described in Proposition 6.1 also holds with
obvious modifications.
Let’s see an application of multivariate kernel regression for the
wine dataset.

# Employing the wine dataset


# wine <- read.table(file = "wine.csv", header = TRUE, sep = ",")

# Bandwidth by CV for local linear estimator -- a product kernel with 4 bandwidths


# Employs 4 random starts for minimizing the CV surface
out <- capture.output(
bwWine <- np::npregbw(formula = Price ~ Age + WinterRain + AGST + HarvestRain,
data = wine, regtype = "ll")
)
bwWine
##
## Regression Data (27 observations, 4 variable(s)):
##
## Age WinterRain AGST HarvestRain
## Bandwidth(s): 15217744 164.468 0.8698265 190414416
##
## Regression Type: Local-Linear
## Bandwidth Selection Method: Least Squares Cross-Validation
## Formula: Price ~ Age + WinterRain + AGST + HarvestRain
## Bandwidth Type: Fixed
## Objective Function Value: 0.08462009 (achieved on multistart 4)
##
## Continuous Kernel Type: Second-Order Gaussian
## No. Continuous Explanatory Vars.: 4
# capture.output() to remove the multistart messages

# Regression
fitWine <- np::npreg(bwWine)
summary(fitWine)
##
## Regression Data: 27 training points, in 4 variable(s)
## Age WinterRain AGST HarvestRain
## Bandwidth(s): 15217744 164.468 0.8698265 190414416
##
## Kernel Regression Estimator: Local-Linear
## Bandwidth Type: Fixed
## Residual standard error: 0.1947015
## R-squared: 0.9038531
##
## Continuous Kernel Type: Second-Order Gaussian
## No. Continuous Explanatory Vars.: 4

# Plot marginal effects of each predictor on the response


plot(fitWine, )
270 eduardo garcía portugués

8.0

8.0
7.5

7.5
Price

Price
7.0

7.0
6.5

6.5
5 10 15 20 25 30 400 500 600 700 800

Age WinterRain
8.0

8.0
7.5

7.5
Price

Price
7.0

7.0
6.5

6.5

15.0 16.0 17.0 50 100 200 300

AGST HarvestRain

# Therefore:
# - Age is positively related with Price (almost linearly)
# - WinterRain is positively related with Price (with a subtle nonlinearity)
# - AGST is positively related with Price, but now we see what it looks like a
# quadratic pattern
# - HarvestRain is negatively related with Price (almost linearly)

The R2 outputted by the summary of np::npreg is defined


as

(∑in=1 (Yi − Ȳ )(Ŷi − Ȳ ))2


R2 : =
(∑in=1 (Yi − Ȳ )2 )(∑in=1 (Ŷi − Ȳ )2 )

and is neither the ry2ŷ (because is not guaranteed that


Ȳ = Ŷ!)¯ nor “the percentage of variance explained” by
the model – this interpretation only makes sense within
the linear model context. It is however a quantity in [0, 1]
that attains R2 = 1 when the fit is perfect.

Non-continuous variables can be taken into account by defin-


ing suitably adapted kernels. The two main possibilities for non-
continuous data are:

• Categorical or unordered discrete variables. For example, iris$species


is a categorical variable in which ordering does not make sense.
These variables are specified in R by factor. Due to the lack of
ordering, the basic mathematical operation behind a kernel, a
distance computation42 , is senseless. That motivates the Aitchi- 42
Recall Kh ( x − Xi ).
son and Aitken (1976) kernel.
Assume that the categorical random variable Xd has cd different
levels. Then, it can be represented as Xd ∈ Cd := {0, 1, . . . , cd −
notes for predictive modeling 271

1}. For xd , Xd ∈ Cd , the Aitchison and Aitken (1976) unordered


discrete kernel is

1 − λ, if x = X ,
d d
l ( x d , Xd ; λ ) =
 λ ,
c d −1if xd 6= Xd ,

where λ ∈ [0, (cd − 1)/cd ] is the bandwidth.

• Ordinal or ordered discrete variables. For example, wine$Year is


a discrete variable with a clear order, but it is not continuous.
These variables are specified by ordered (an ordered factor). In
these variables there is ordering, but distances are discrete.
If the ordered discrete random variable Xd can take cd differ-
ent ordered values, then it can be represented as Xd ∈ Cd :=
{0, 1, . . . , cd − 1}. For xd , Xd ∈ Cd , a possible (Li and Racine, 2007)
ordered discrete kernel is

l ( x d , Xd ; λ ) = λ | x d − Xd | ,

where λ ∈ [0, 1] is the bandwidth.

The np package employs a variation of the previous kernels. The


following examples illustrate their use.
# Bandwidth by CV for local linear estimator
# Recall that Species is a factor!
out <- capture.output(
bwIris <- np::npregbw(formula = Petal.Length ~ Sepal.Width + Species,
data = iris, regtype = "ll")
)
bwIris
##
## Regression Data (150 observations, 2 variable(s)):
##
## Sepal.Width Species
## Bandwidth(s): 898696.6 2.357536e-07
##
## Regression Type: Local-Linear
## Bandwidth Selection Method: Least Squares Cross-Validation
## Formula: Petal.Length ~ Sepal.Width + Species
## Bandwidth Type: Fixed
## Objective Function Value: 0.1541057 (achieved on multistart 1)
##
## Continuous Kernel Type: Second-Order Gaussian
## No. Continuous Explanatory Vars.: 1
##
## Unordered Categorical Kernel Type: Aitchison and Aitken
## No. Unordered Categorical Explanatory Vars.: 1
# Product kernel with 2 bandwidths

# Regression
fitIris <- np::npreg(bwIris)
summary(fitIris)
##
## Regression Data: 150 training points, in 2 variable(s)
## Sepal.Width Species
## Bandwidth(s): 898696.6 2.357536e-07
##
## Kernel Regression Estimator: Local-Linear
## Bandwidth Type: Fixed
## Residual standard error: 0.3775005
## R-squared: 0.9539633
272 eduardo garcía portugués

##
## Continuous Kernel Type: Second-Order Gaussian
## No. Continuous Explanatory Vars.: 1
##
## Unordered Categorical Kernel Type: Aitchison and Aitken
## No. Unordered Categorical Explanatory Vars.: 1

# Plot marginal effects of each predictor on the response


par(mfrow = c(1, 2))
plot(fitIris, plot.par.mfrow = FALSE)
5

5
Petal.Length

Petal.Length
4

4
3

3
2

2
2.0 2.5 3.0 3.5 4.0 setosa virginica

Sepal.Width Species

# Options for the plot method for np::npreg available at ?np::npplot

# Example from ?np::npreg: modelling of the GDP growth of a country from


# economic indicators of the country
# The predictors contain a mix of unordered, ordered, and continuous variables

# Load data
data(oecdpanel, package = "np")

# Bandwidth by CV for local constant -- use only two starts to reduce the
# computation time
out <- capture.output(
bwOECD <- np::npregbw(formula = growth ~ factor(oecd) + ordered(year) +
initgdp + popgro + inv + humancap, data = oecdpanel,
regtype = "lc", nmulti = 2)
)
bwOECD
##
## Regression Data (616 observations, 6 variable(s)):
##
## factor(oecd) ordered(year) initgdp popgro inv humancap
## Bandwidth(s): 0.02414207 0.8944302 0.1940907 698479 0.09140916 1.223215
##
## Regression Type: Local-Constant
## Bandwidth Selection Method: Least Squares Cross-Validation
## Formula: growth ~ factor(oecd) + ordered(year) + initgdp + popgro + inv +
## humancap
## Bandwidth Type: Fixed
## Objective Function Value: 0.0006545946 (achieved on multistart 2)
##
## Continuous Kernel Type: Second-Order Gaussian
## No. Continuous Explanatory Vars.: 4
##
## Unordered Categorical Kernel Type: Aitchison and Aitken
## No. Unordered Categorical Explanatory Vars.: 1
##
## Ordered Categorical Kernel Type: Li and Racine
## No. Ordered Categorical Explanatory Vars.: 1

# Regression
fitOECD <- np::npreg(bwOECD)
summary(fitOECD)
##
## Regression Data: 616 training points, in 6 variable(s)
## factor(oecd) ordered(year) initgdp popgro inv humancap
notes for predictive modeling 273

## Bandwidth(s): 0.02414207 0.8944302 0.1940907 698479 0.09140916 1.223215


##
## Kernel Regression Estimator: Local-Constant
## Bandwidth Type: Fixed
## Residual standard error: 0.01814205
## R-squared: 0.6767848
##
## Continuous Kernel Type: Second-Order Gaussian
## No. Continuous Explanatory Vars.: 4
##
## Unordered Categorical Kernel Type: Aitchison and Aitken
## No. Unordered Categorical Explanatory Vars.: 1
##
## Ordered Categorical Kernel Type: Li and Racine
## No. Ordered Categorical Explanatory Vars.: 1

# Plot marginal effects of each predictor on the response


par(mfrow = c(2, 3))
plot(fitOECD, plot.par.mfrow = FALSE)
0.06

0.06

0.06
growth

growth

growth
-0.02 0.02

-0.02 0.02

-0.02 0.02
0 1 1965 1975 1985 1995 6 7 8 9

factor(oecd) ordered(year) initgdp


0.06

0.06

0.06
growth

growth

growth
-0.02 0.02

-0.02 0.02

-0.02 0.02

-3.4 -3.0 -2.6 -4.5 -3.5 -2.5 -1.5 -2 -1 0 1 2

popgro inv humancap

6.4 Prediction and confidence intervals

The prediction of the conditional response m(x) = E[Y |X = x] with


the local polynomial estimator reduces to evaluate m̂(x; p, h). The
fitted values are, therefore, Ŷi := m̂(Xi ; p, h), i = 1, . . . , n. The np
package has methods to perform these operations via the predict
and fitted functions.
More interesting is the discussion about the uncertainty of m̂(x; p, h)
and, as a consequence, of the predictions. Differently to what hap-
pened in parametric models, in nonparametric regression there is
no parametric distribution of the response that can help to carry
out the inference and, consequently, to address the uncertainty of
274 eduardo garcía portugués

the estimation. Because of this, it is required to resort to somehow 43


Precisely, the ones associated to the
convoluted asymptotic expressions43 or to a bootstrap resampling asymptotic normality of m̂(x; p, h)
procedure. The default one in np is the so-called wild bootstrap (Liu, and that are based on the results of
Theorem 6.1.
1988), which is a resampling procedure particularly well-suited for
regression problems with potential heteroskedasticity.
# Asymptotic confidence bands for the marginal effects of each predictor on the
# response
par(mfrow = c(2, 3))
plot(fitOECD, plot.errors.method = "asymptotic", common.scale = FALSE,
plot.par.mfrow = FALSE)
0.06

0.030

0.06
growth

growth

growth
0.020
0.02

-0.02 0.02
0.010
-0.02

0 1 1965 1975 1985 1995 6 7 8 9

factor(oecd) ordered(year) initgdp


0.3

0.06
0.025

0.1
growth

growth

growth

0.02
-0.3 -0.1
0.015

-0.02

-3.4 -3.0 -2.6 -4.5 -3.5 -2.5 -1.5 -2 -1 0 1 2

popgro inv humancap

# Bootstrap confidence bands


# They take more time to compute because a resampling + refitting takes place
par(mfrow = c(2, 3))
plot(fitOECD, plot.errors.method = "bootstrap", plot.par.mfrow = FALSE)

# The asymptotic standard error associated to the regression evaluated at the


# evaluation points are in $merr
head(fitOECD$merr)
## [1] 0.004534564 0.006104577 0.001558259 0.002401120 0.006213855 0.005274232

# Recall that in $mean we had the regression evaluated at the evaluation points,
# by default the sample of the predictors, so in this case the same as the
# fitted values
head(fitOECD$mean)
## [1] 0.02877775 0.02113485 0.03592716 0.04027500 0.01099441 0.03888302

# Prediction for the first 3 points + standard errors


pred <- predict(fitOECD, newdata = oecdpanel[1:3, ], se.fit = TRUE)

# Approximate (based on assuming asymptotic normality) 100(1 - alpha)% CI for


notes for predictive modeling 275

0.15

0.15

0.15
growth

growth

growth
0.05

0.05

0.05
-0.05

-0.05

-0.05
0 1 1965 1975 1985 1995 6 7 8 9

factor(oecd) ordered(year) initgdp


0.15

0.15

0.15
growth

growth

growth
0.05

0.05

0.05
-0.05

-0.05

-0.05
-3.4 -3.0 -2.6 -4.5 -3.5 -2.5 -1.5 -2 -1 0 1 2

popgro inv humancap

# the conditional mean of the first 3 points


alpha <- 0.05
pred$fit + (qnorm(1 - alpha / 2) * pred$se.fit) %o% c(-1, 1)
## [,1] [,2]
## [1,] 0.019890170 0.03766533
## [2,] 0.009170096 0.03309960
## [3,] 0.032873032 0.03898129

Extrapolation with kernel regression estimators is par-


ticularly dangerous. Keep in mind that the kernel esti-
mators smooth the data in order to estimate m, the trend.
If there is no data close to the point x at which m(x)
is estimated, then the estimation will heavily depend
on the closest data point to x. If employing compactly-
supported kernels (so that they can take the value 0
exactly), the estimate may not be even properly defined.

6.5 Local likelihood

We explore in this section an extension of the local polynomial es-


timator that reproduces the expansion generalized linear models
made of linear models. This extension aims to estimate the re-
gression function by relying in the likelihood, rather than the least
squares. Thus, the idea behind the local likelihood is to fit, locally,
parametric models by maximum likelihood.
We begin by seeing that local likelihood using the the linear
model is equivalent to local polynomial modelling. Theorem A.1
276 eduardo garcía portugués

showed that, under the assumptions given in Section 2.3, the maxi-
mum likelihood estimate of β in the linear model

Y |( X1 , . . . , X p ) ∼ N ( β 0 + β 1 X1 + . . . + β p X p , σ2 ) (6.29)

was equivalent to the least squares estimate, β̂ = (X0 X)−1 X0 Y. The


reason was the form of the conditional (on X1 , . . . , X p ) likelihood:
n
`( β) = − log(2πσ2 )
2
1 n
− 2 ∑ (Yi − β 0 − β 1 Xi1 − . . . − β p Xip )2 .
2σ i=1

If there is a single predictor X, polynomial fitting of order p of


the conditional mean can be achieved by the well-known trick of
identifying the j-th predictor X j in (6.29) by X j . This results in

Y | X ∼ N ( β 0 + β 1 X + . . . + β p X p , σ 2 ). (6.30)

Therefore, the weighted log-likelihood of the linear model (6.30)


around x is
n
` x,h ( β) = − log(2πσ2 )
2
1 n
− 2 ∑ (Yi − β 0 − β 1 Xi − . . . − β p Xi )2 Kh ( x − Xi ).
p
2σ i=1
(6.31)

Maximizing with respect to β the local log-likelihood (6.31) provides


β̂ 0 = m̂( x; p, h), the local polynomial estimator, as it was obtained in
(6.21), but now from a likelihood-based perspective. The key point
is to realize that the very same idea can be applied to the family of
generalized linear models.
We illustrate the local likelihood principle for the logistic regres-
sion. In this case, {( Xi , Yi )}in=1 with

Yi | Xi ∼ Ber(logistic(η ( Xi ))), i = 1, . . . , n,

with the polynomial term44


44
If p = 1, then we have the usual
p
η ( x ) := β 0 + β 1 x + . . . + β p x . simple logistic model.

The log-likelihood of β is
n
`( β) = ∑ {Yi log(logistic(η (Xi ))) + (1 − Yi ) log(1 − logistic(η (Xi )))}
i =1
n
= ∑ `(Yi , η (Xi )),
i =1

where we consider the log-likelihood addend `(y, η ) = yη − log(1 +


eη ), and make explicit the dependence on η ( x ) for clarity in the
next developments, and implicit the dependence on β.
The local log-likelihood of β around x is then
n
` x,h ( β) := ∑ `(Yi , η (Xi − x))Kh (x − Xi ). (6.32)
i =1
notes for predictive modeling 277

Figure 6.7: Construction of the local


likelihood estimator. The animation
shows how local likelihood fits in a
neighborhood of x are combined to
provide an estimate of the regression
function for binary response, which
depends on the polynomial degree,
bandwidth, and kernel (gray density
at the bottom). The data points are
shaded according to their weights for
the local fit at x. Application available
here.

Maximizing45 the local log-likelihood (6.32) with respect to β pro-


vides

β̂h = arg max ` x,h ( β).


β ∈R p +1

The local likelihood estimate of η ( x ) is

η̂ ( x ) := β̂ h,0 .

Note that the dependence of β̂ 0 on x and h is omitted. From η̂ ( x ),


we can obtain the local logistic regression evaluated at x as

m̂` ( x; h, p) := g−1 (η̂ ( x )) = logistic( β̂ h,0 ). (6.33)

Each evaluation of m̂` ( x; h, p) in a different x requires, thus, a


weighted fit of the underlying logistic model.
The code below shows three different ways of implementing the
local logistic regression (of first degree) in R.
# Simulate some data
n <- 200
logistic <- function(x) 1 / (1 + exp(-x))
p <- function(x) logistic(1 - 3 * sin(x))
set.seed(123456)
X <- runif(n = n, -3, 3)
Y <- rbinom(n = n, size = 1, prob = p(X))

# Set bandwidth and evaluation grid


h <- 0.25
x <- seq(-3, 3, l = 501)

# Approach 1: optimize the weighted log-likelihood through the workhorse


# function underneath glm, glm.fit
278 eduardo garcía portugués

suppressWarnings(
fitGlm <- sapply(x, function(x) {
K <- dnorm(x = x, mean = X, sd = h)
glm.fit(x = cbind(1, X - x), y = Y, weights = K,
family = binomial())$coefficients[1]
})
)

# Approach 2: optimize directly the weighted log-likelihood


suppressWarnings(
fitNlm <- sapply(x, function(x) {
K <- dnorm(x = x, mean = X, sd = h)
nlm(f = function(beta) {
-sum(K * (Y * (beta[1] + beta[2] * (X - x)) -
log(1 + exp(beta[1] + beta[2] * (X - x)))))
}, p = c(0, 0))$estimate[1]
})
)

# Approach 3: employ locfit::locfit


# Bandwidth can not be controlled explicitly -- only through nn in ?lp
fitLocfit <- locfit::locfit(Y ~ locfit::lp(X, deg = 1, nn = h),
family = "binomial", kern = "gauss")

# Compare fits
plot(x, p(x), ylim = c(0, 1.5), type = "l", lwd = 2)
lines(x, logistic(fitGlm), col = 2)
lines(x, logistic(fitNlm), col = 3, lty = 2)
plot(fitLocfit, add = TRUE, col = 4)
legend("topright", legend = c("p(x)", "glm", "nlm", "locfit"), lwd = 2,
col = c(1, 2, 3, 4), lty = c(1, 1, 2, 1))

1.5
p(x)
Bandwidth selection can be done by means of likelihood cross- glm
nlm
locfit

validation. The objective is to maximize the local likelihood fit at


( Xi , Yi ) but removing the influence by the datum itself. That is,
1.0

maximizing
p(x)

n
∑ `(Yi , η̂−i (Xi )),
0.5

LCV(h) = (6.34)
i =1

where η̂−i ( Xi ) represents the local fit at Xi without the i-th datum
0.0

( Xi , Yi ). Unfortunately, the nonlinearity of (6.33) forbids a simpli- -3 -2 -1 0 1 2 3

fying result as Proposition 6.1. Thus, in principle46 , it is required 46


The interested reader is referred
to fit n local likelihoods for sample size n − 1 for obtaining a single to Sections 4.3.3 and 4.4.3 of Loader
(1999) for an approximation of (6.34)
evaluation of (6.34).
that only requires a local likelihood fit
We conclude by illustrating how to compute the LCV function for a single sample.
and optimize it (keep in mind that much more efficient implemen-
tations are possible!).
# Exact LCV -- recall that we *maximize* the LCV!
h <- seq(0.1, 2, by = 0.1)
suppressWarnings(
LCV <- sapply(h, function(h) {
sum(sapply(1:n, function(i) {
K <- dnorm(x = X[i], mean = X[-i], sd = h)
nlm(f = function(beta) {
-sum(K * (Y[-i] * (beta[1] + beta[2] * (X[-i] - X[i])) -
log(1 + exp(beta[1] + beta[2] * (X[-i] - X[i])))))
}, p = c(0, 0))$minimum
}))
})
)
plot(h, LCV, type = "o")
abline(v = h[which.max(LCV)], col = 2)
A
Further topics

A.1 Informal review on hypothesis testing

The process of hypothesis testing has an interesting analogy with


a trial. The analogy helps understanding the elements present in a
formal hypothesis test in an intuitive way. 11 22 33

Hypothesis test Trial


Null hypothesis H0 The defendant: an individual
accused of committing a crime. He is
backed up by the presumption of
innocence, which means that he is
not guilty until there are enough
evidences to support his guilt.
Sample X1 , . . . , Xn Collection of small evidences
supporting innocence and guilt of
the defendant. These evidences
contain a certain degree of
uncontrollable randomness due to
how they were collected and the 1
Think about phenomena that may
context regarding the case1 . randomly support innoncence or
Test statistic2 Tn Summary of the evicences presented guilt of the defendant, irrespectively
of his true condition. For example:
by the prosecutor and defense lawyer. spurious coincidences ("happen to
Distribution of Tn under The judge conducting the trial. be in the wrong place at the wrong
time"), lost of evidences during the
H0 Evaluates and measures the evidence
case, previous past statemets of the
presented by both sides and presents defendant, doubtious identification by
a verdict for the defendant. witness, imprecise witness testimonies,
unverificable alibi, etc.
Significance level α 1 − α is the strength of evidences 2
Usually simply referred to as statistic.
required by the judge for condemning 3
As the judge has to have the power
the defendant. The judge allows of condemning a guilty defendant.
Setting α = 0 (no innocents are
evidences that, on average, condemn declared guilt) would result in a judge
100α% of the innocents, due to the that systematically declares everybody
randomness inherent to the evidences not guilty. Therefore, a compromise is
needed.
collection process. α = 0.05 is
considered to be a reasonable level3 .
280 eduardo garcía portugués

Hypothesis test Trial


p-value Decision of the judge that measures
the degree of compatibility, in a scale
0–1, of the presumption of innocence
with the summary of the evidences
presented. If p-value< α, the
defendant is declared guilty.
Otherwise, he is declared not guilty.
H0 is rejected The defendant is declared guilty:
there are strong evidences
supporting its guilt.
H0 is not rejected The defendant is declared not guilty:
either he is innocent or there are no
enough evidences supporting his
guilt.

More formally, the p-value of an hypothesis test about H0 is


defined as:

The p-value is the probability of obtaining a test statistic more un-


favourable to H0 than the observed, assuming that H0 is true.

Therefore, if the p-value is small (smaller than the chosen level


α), it is unlikely that the evidences against H0 are due to ran-
domness. As a consequence, H0 is rejected. If the p-value is large
(larger than α), then it is more possible that the evidences against
H0 are merely due to the randomness of the data. In this case, we
do not reject H0 .

If H0 holds, then the p-value (which is a random vari-


able) is distributed uniformly in (0, 1). If H0 does not
hold, then the distribution of the p-value is not uniform 4
Understood as F ( x ) = F0 ( x ) for all
but concentrated at 0 (where the rejections of H0 take x ∈ R. For F 6= F0 , we mean that
place). F ( x ) 6= F0 ( x ) for at least one x ∈ R.
5
Formally, a null hypothesis H0 is
tested against an alternative hypothesis
H1 . The concept of alternative hy-
Let’s quickly illustrate the previous fact with the well-known pothesis was not addressed in the trial
Kolmogorov–Smirnov test. This test evaluates whether the unknown analogy for the sake of simplicity, but
cdf of X, F, equals a specified cdf F0 . In other words, it tests the you may think of H1 as the defendant
being not guilty or as the plaintiff or
null hypothesis4 complainant. The alternative hypothe-
sis H1 represents the "alternative truth"
to H0 and the test decides between H0
H0 : F = F0 and H1 , only rejecting H0 in favor of
H1 if there are enough evidences on
versus the alternative hypothesis5 the data against H0 or supporting H1 .
Obviously, H0 ∩ H1 = ∅. But recall
that also H1 ⊂ ¬ H0 or, in other words,
H1 : F 6= F0 . H1 may be more restrictive than the
opposite of H0 .

For that purpose, given a sample X1 , . . . , Xn of X, the Kolmogorov–


notes for predictive modeling 281

Smirnov statistic Dn is computed:



Dn : = n sup | Fn ( x ) − F0 ( x )| = max( Dn+ , Dn− ), (A.1)
x ∈R

 
i
Dn+:= n max − U(i) ,
1≤ i ≤ n n
√ i−1
 

Dn := n max U(i) − .
1≤ i ≤ n n

where Fn represents the empirical cdf of X1 , . . . , Xn and U( j) stands


for the j-th sorted Ui := F0 ( Xi ), i = 1, . . . , n. If H0 holds, then Dn
tends to be small. Conversely, when F 6= F0 , larger values of Dn are
expected, so the test rejects when Dn is large. 6
When the sample size n is large:
If H0 holds, then Dn has an asymptotic6 cdf given by the Kolmo- n → ∞.
gorov–Smirnov’s K function:

∑ (−1)m−1 e−2m x .
2 2
lim P[ Dn ≤ x ] = K ( x ) := 1 − 2 (A.2)
n→∞
m =1

Luckily, the test statistic Dn , the asymptotic cdf K, and the associ- 7
Which is limn→∞ P[dn > Dn ] =
ated asymptotic p-value7 are readily implemented in R through the 1 − K (dn ), where dn is the observed
ks.test function. statistic and Dn is the random variable
(A.1).

Implement the Kolmogorov–Smirnov test from the equa-


tions above. This amounts to:

• Provide a function for computing the test statistic


(A.1) from a sample X1 , . . . , Xn and a cdf F0 .
• Implement the K function (A.2).
• Call the previous functions from a routine that returns
the asymptotic p-value of the test.

Compare that the implementations coincide with the


ones of the ks.test function when exact = FALSE. Note:

ks.test computes Dn / n instead of Dn .

# Sample data from a N(0, 1)


set.seed(3245678)
n <- 50
x <- rnorm(n)

# Kolmogorov-Smirnov test for H_0 : F = N(0, 1). Does not reject.


ks.test(x, "pnorm")
##
## One-sample Kolmogorov-Smirnov test
##
## data: x
## D = 0.050298, p-value = 0.9989
## alternative hypothesis: two-sided

# Simulation of p-values when H_0 is true


M <- 1e4
pValues_H0 <- sapply(1:M, function(i) {
x <- rnorm(n) # N(0, 1)
ks.test(x, "pnorm")$p.value
})
282 eduardo garcía portugués

# Simulation of p-values when H_0 is false -- the data does not


# come from a N(0, 1) but from a N(0, 1.5)
pValues_H1 <- sapply(1:M, function(i) {
x <- rnorm(n, mean = 0, sd = sqrt(1.5)) # N(0, 1.5)
ks.test(x, "pnorm")$p.value
})

# Comparison of p-values
par(mfrow = 1:2)
hist(pValues_H0, breaks = seq(0, 1, l = 20), probability = TRUE,
main = expression(H[0]), ylim = c(0, 2.5))
abline(h = 1, col = 2)
hist(pValues_H1, breaks = seq(0, 1, l = 20), probability = TRUE,
main = expression(H[1]), ylim = c(0, 2.5))
abline(h = 1, col = 2)

H0 H1
2.0

2.0
Density

Density
1.0

1.0
0.0

0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

pValues_H0 pValues_H1

Figure A.1: Comparison of the dis-


tribution of p-values under H0 and
H1 for the Kolmogorov–Smirnov test.
Observe that the frequency of low p-
A.2 Least squares and maximum likelihood estimation values, associated with the rejection of
H0 , grows when H0 does not hold. Un-
der H0 , the distribution of the p-values
Least squares had a prominent role in linear models. In certain is uniform.
sense, this is strange. After all, it is a purely geometrical argument
for fitting a plane to a cloud of points and therefore it seems to
do not rely on any statistical grounds for estimating the unknown
parameters β.
However, as we will see, least squares estimation is equivalent
to maximum likelihood estimation under the assumptions of the 8
Normality is especially important
here due to the squares present in the
model seen in Section 2.38 . So maximum likelihood estimation, exponential of the normal pdf.
the most well-known statistical estimation method, is behind least
squares if the assumptions of the model hold.
First, recall that given the sample {(Xi , Yi )}in=1 , due to the as-
sumptions introduced in Section 2.3, we have that:

Yi |( Xi1 = xi1 , . . . , Xip = xip ) ∼ N ( β 0 + β 1 xi1 + . . . + β p xip , σ2 ),

with Y1 , . . . , Yn being independent conditionally on the sample of


notes for predictive modeling 283

predictors. Equivalently stated in a compact matrix way:

Y|X ∼ Nn (Xβ, σ2 I).

From these two equations we can obtain the log-likelihood function 9


We assume that the randomness is on
of Y1 , . . . , Yn conditionally9 on X1 , . . . , Xn as the response only.

n
`( β) = log φ(Y; Xβ, σ2 I) = ∑ log φ(Yi ; (Xβ)i , σ). (A.3)
i =1

Maximization of (A.3) with respect to β gives the maximum likeli-


hood estimator β̂ML .
Now we are ready to show the next result.

Theorem A.1. Under the assumptions i–iv in Section 2.3, the maximum
likelihood estimate of β is the least squares estimate (2.7):

β̂ML = arg max `( β) = (X0 X)−1 XY.


β ∈R p +1 10
Recall that |σ2 I|1/2 = σn .

Proof. Expanding the first equality at (A.3) gives10


  1
`( β) = − log (2π )n/2 σn − 2 (Y − Xβ)0 (Y − Xβ).

In order to differentiate with respect to β, we use that, for two


vector-valued functions f and g:

∂Ax ∂ f (x)0 g(x) ∂g(x) ∂ f (x)


= A and = f (x)0 + g(x)0 .
∂x ∂x ∂x ∂x

Then, differentiating with respect to β and equating to zero gives

1 1
(Y − Xβ)0 X = 2 (Y0 X − β0 X0 X) = 0.
σ2 σ

This means that optimizing ` does not require knowledge on σ2 !


This is a very convenient fact tha allows to solve the above equa-
tion, yielding

β̂ = (X0 X)−1 XY.

A final comment on the benefits of relying on maximum likeli-


hood estimation follows.

Maximum likelihood estimation is asymptotically opti-


mal when estimating unknown parameters of a model.
This is a very appealing property that means that, when
the sample size n is large, it is guaranteed to perform
better than any other estimation method, where better is
understood in terms of the mean squared error.
284 eduardo garcía portugués

A.3 Multinomial logistic regression

The logistic model seen in Section 5.2.1 can be generalized to cat-


egorical variables Y with more than two possible levels, namely
{1, . . . , J }. Given the predictors X1 , . . . , X p , multinomial logistic re-
gression models the probability of each level j of Y by

p j ( x ) : = P [ Y = j | X1 = x 1 , . . . , X p = x p ]
e β0j + β1j X1 +...+ β pj X p
= J −1
(A.4)
1 + ∑l =1 e β0l + β1l X1 +...+ β pl X p

for j = 1, . . . , J − 1 and (for the last level J)

p J ( x ) : = P [ Y = J | X1 = x 1 , . . . , X p = x p ]
1
= J −1
. (A.5)
1 + ∑l =1 e β0l + β1l X1 +...+ β pl X p
J
Note that (A.4) and (A.5) imply that ∑ j=1 p j (x) = 1 and that there
are ( J − 1) × ( p + 1) coefficients11 . Also, (A.5) reveals that the
11
( J − 1) intercepts and ( J − 1) × p
slopes.
last level, J, is given a different treatment. This is because it is the
reference level (it could be a different one, but it is the tradition to
choose the last one).
The multinomial logistic model has an interesting interpretation
in terms of logistic regressions. Taking the quotient between (A.4)
and (A.5) gives

p j (x)
= e β0j + β1j X1 +...+ β pj X p (A.6)
p J (x)

for j = 1, . . . , J − 1. Therefore, applying a logarithm to both sides


we have:
p j (x)
log = β 0j + β 1j X1 + . . . + β pj X p . (A.7)
p J (x)

This equation is indeed very similar to (5.8). If J = 2, it is the same


up to a change in the codes for the levels: the logistic regression
giving the probability of Y = 1 versus Y = 2. On the LHS of (A.7)
we have the logarithm of the ratio of two probabilities and on the
RHS a linear combination of the predictors. If the probabilities on
the LHS were complementary (if they added up to one), then we
would have a log-odds and hence a logistic regression for Y. This
is not the situation, but it is close: instead of odds and log-odds,
we have ratios and log-ratios of non complementary probabilities.
Also, it gives a good insight on what the multinomial logistic re-
gression is: a set of J − 1 independent logistic regressions for the
probability of Y = j versus the probability of the reference Y = J.
Equation (A.6) gives also interpretation on the coefficients of the
model since

p j (x) = e β0j + β1j X1 +...+ β pj X p p J (x).

Therefore:
notes for predictive modeling 285

• e β0j : is the ratio between p j (0)/p J (0), the probabilities of Y = j


and Y = J when X1 = . . . = X p = 0. If e β0j > 1 (equivalently,
β 0j > 0), then Y = j is more likely than Y = J. If e β0j < 1
(β 0j < 0), then Y = j is less likely than Y = J.
• e β lj , l ≥ 1: is the multiplicative increment of the ratio between
p j (x)/p J (x) for an increment of one unit in Xl = xl , provided
that the remaining variables X1 , . . . , Xl −1 , Xl +1 , . . . , X p do not
change. If e β lj > 1 (equivalently, β lj > 0), then Y = j becomes
more likely than Y = J for each increment in X j . If e β lj < 1
(β lj < 0), then Y = j becomes less likely than Y = J.

The following code illustrates how to compute a basic multino-


mial regression employing the nnet package.

# Data from the voting intentions in the 1988 Chilean national plebiscite
data(Chile, package = "carData")
summary(Chile)
## region population sex age education income statusquo vote
## C :600 Min. : 3750 F:1379 Min. :18.00 P :1107 Min. : 2500 Min. :-1.80301 A :187
## M :100 1st Qu.: 25000 M:1321 1st Qu.:26.00 PS : 462 1st Qu.: 7500 1st Qu.:-1.00223 N :889
## N :322 Median :175000 Median :36.00 S :1120 Median : 15000 Median :-0.04558 U :588
## S :718 Mean :152222 Mean :38.55 NA’s: 11 Mean : 33876 Mean : 0.00000 Y :868
## SA:960 3rd Qu.:250000 3rd Qu.:49.00 3rd Qu.: 35000 3rd Qu.: 0.96857 NA’s:168
## Max. :250000 Max. :70.00 Max. :200000 Max. : 2.04859
## NA’s :1 NA’s :98 NA’s :17
# vote is a factor with levels A (abstention), N (against Pinochet),
# U (undecided), Y (for Pinochet)

# Fit of the model done by multinom: Response ~ Predictors


# It is an iterative procedure (maxit sets the maximum number of iterations)
# Read the documentation in ?multinom for more information
mod1 <- nnet::multinom(vote ~ age + education + statusquo, data = Chile,
maxit = 1e3)
## # weights: 24 (15 variable)
## initial value 3476.826258
## iter 10 value 2310.201176
## iter 20 value 2135.385060
## final value 2132.416452
## converged

# Each row of coefficients gives the coefficients of the logistic


# regression of a level versus the reference level (A)
summary(mod1)
## Call:
## nnet::multinom(formula = vote ~ age + education + statusquo,
## data = Chile, maxit = 1000)
##
## Coefficients:
## (Intercept) age educationPS educationS statusquo
## N 0.3002851 0.004829029 0.4101765 -0.1526621 -1.7583872
## U 0.8722750 0.020030032 -1.0293079 -0.6743729 0.3261418
## Y 0.5093217 0.016697208 -0.4419826 -0.6909373 1.8752190
##
## Std. Errors:
## (Intercept) age educationPS educationS statusquo
## N 0.3315229 0.006742834 0.2659012 0.2098064 0.1292517
## U 0.3183088 0.006630914 0.2822363 0.2035971 0.1059440
## Y 0.3333254 0.006915012 0.2836015 0.2131728 0.1197440
##
## Residual Deviance: 4264.833
## AIC: 4294.833

# Set a different level as the reference (N) for easening interpretations


Chile$vote <- relevel(Chile$vote, ref = "N")
286 eduardo garcía portugués

mod2 <- nnet::multinom(vote ~ age + education + statusquo, data = Chile,


maxit = 1e3)
## # weights: 24 (15 variable)
## initial value 3476.826258
## iter 10 value 2393.713801
## iter 20 value 2134.438912
## final value 2132.416452
## converged
summary(mod2)
## Call:
## nnet::multinom(formula = vote ~ age + education + statusquo,
## data = Chile, maxit = 1000)
##
## Coefficients:
## (Intercept) age educationPS educationS statusquo
## A -0.3002035 -0.00482911 -0.4101274 0.1525608 1.758307
## U 0.5720544 0.01519931 -1.4394862 -0.5217093 2.084491
## Y 0.2091397 0.01186576 -0.8521205 -0.5382716 3.633550
##
## Std. Errors:
## (Intercept) age educationPS educationS statusquo
## A 0.3315153 0.006742654 0.2658887 0.2098012 0.1292494
## U 0.2448452 0.004819103 0.2116375 0.1505854 0.1091445
## Y 0.2850655 0.005700894 0.2370881 0.1789293 0.1316567
##
## Residual Deviance: 4264.833
## AIC: 4294.833
exp(coef(mod2))
## (Intercept) age educationPS educationS statusquo
## A 0.7406675 0.9951825 0.6635657 1.1648133 5.802607
## U 1.7719034 1.0153154 0.2370495 0.5935052 8.040502
## Y 1.2326171 1.0119364 0.4265095 0.5837564 37.846937
# Some highlights:
# - intercepts do not have too much interpretation (correspond to age = 0).
# A possible solution is to center age by its mean (so age = 0 would
# represent the mean of the ages)
# - both age and statusquo increase the probability of voting Y, A or U
# with respect to voting N -> conservativeness increases with ages
# - both age and statusquo increase more the probability of voting Y and U
# than A -> elderly and status quo supporters are more decided to participate
# - a PS level of education increases the probability of voting N. Same for
# a S level of education, but more prone to A

# Prediction of votes -- three profile of voters


newdata <- data.frame(age = c(23, 40, 50),
education = c("PS", "S", "P"),
statusquo = c(-1, 0, 2))

# Probabilities of belonging to each class


predict(mod2, newdata = newdata, type = "probs")
## N A U Y
## 1 0.856057623 0.064885869 0.06343390 0.01562261
## 2 0.208361489 0.148185871 0.40245842 0.24099422
## 3 0.000288924 0.005659661 0.07076828 0.92328313

# Predicted class
predict(mod2, newdata = newdata, type = "class")
## [1] N U Y
## Levels: N A U Y
notes for predictive modeling 287

Multinomial logistic regression will suffer from numer-


ical instabilities and its iterative algorithm might even
fail to converge if the levels of the categorical variable
are very separated (e.g., two data clouds clearly sepa-
rated corresponding to a different level of the categorical
variable).

A.4 Dealing with missing data

Missing data, codified as NA in R, can be problematic in predictive


modelling. By default, most of the regression models in R work
with the complete cases of the data, that is, they exclude the cases
in which there is at least one NA. This may be problematic in certain
cases. Let’s see an example.

# The airquality dataset contains NA’s


data(airquality)
head(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
summary(airquality)
## Ozone Solar.R Wind Temp Month Day
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00 Min. :5.000 Min. : 1.0
## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00 1st Qu.:6.000 1st Qu.: 8.0
## Median : 31.50 Median :205.0 Median : 9.700 Median :79.00 Median :7.000 Median :16.0
## Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88 Mean :6.993 Mean :15.8
## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00 3rd Qu.:8.000 3rd Qu.:23.0
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00 Max. :9.000 Max. :31.0
## NA’s :37 NA’s :7

# Let’s add more NA’s for the sake of illustration


set.seed(123456)
airquality$Solar.R[runif(nrow(airquality)) < 0.7] <- NA
airquality$Day[runif(nrow(airquality)) < 0.1] <- NA

# See what are the fully-observed cases


comp <- complete.cases(airquality)
mean(comp) # Only 15% of cases are fully observed
## [1] 0.1568627

# Complete cases
head(airquality[comp, ])
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 9 8 19 20.1 61 5 9
## 13 11 290 9.2 66 5 13
## 14 14 274 10.9 68 5 14
## 15 18 65 13.2 58 5 15

# Linear model on all the variables


summary(lm(Ozone ~ ., data = airquality)) # 129 not included
##
## Call:
## lm(formula = Ozone ~ ., data = airquality)
##
288 eduardo garcía portugués

## Residuals:
## Min 1Q Median 3Q Max
## -23.790 -10.910 -2.249 10.960 33.246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -71.50880 31.49703 -2.270 0.035704 *
## Solar.R -0.01112 0.03376 -0.329 0.745596
## Wind -0.61129 0.96113 -0.636 0.532769
## Temp 1.82870 0.42224 4.331 0.000403 ***
## Month -2.86513 2.22222 -1.289 0.213614
## Day -0.28710 0.41700 -0.688 0.499926
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 16.12 on 18 degrees of freedom
## (129 observations deleted due to missingness)
## Multiple R-squared: 0.5957, Adjusted R-squared: 0.4833
## F-statistic: 5.303 on 5 and 18 DF, p-value: 0.00362

# Caution! Even if the problematic variable is excluded, only


# the complete observations are employed
summary(lm(Ozone ~ . - Solar.R, data = airquality))
##
## Call:
## lm(formula = Ozone ~ . - Solar.R, data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.43 -11.56 -1.67 11.19 33.11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -72.2501 30.6707 -2.356 0.029386 *
## Wind -0.6236 0.9376 -0.665 0.514001
## Temp 1.7980 0.4021 4.472 0.000261 ***
## Month -2.6533 2.0767 -1.278 0.216762
## Day -0.2944 0.4065 -0.724 0.477851
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 15.74 on 19 degrees of freedom
## (129 observations deleted due to missingness)
## Multiple R-squared: 0.5932, Adjusted R-squared: 0.5076
## F-statistic: 6.927 on 4 and 19 DF, p-value: 0.001291

# Notice the difference with


summary(lm(Ozone ~ ., data = subset(airquality, select = -Solar.R)))
##
## Call:
## lm(formula = Ozone ~ ., data = subset(airquality, select = -Solar.R))
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.677 -12.609 -3.125 11.993 98.805
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -74.4456 25.8567 -2.879 0.00488 **
## Wind -3.1064 0.7028 -4.420 2.51e-05 ***
## Temp 2.1666 0.2876 7.534 2.25e-11 ***
## Month -3.6493 1.6343 -2.233 0.02778 *
## Day 0.3162 0.2549 1.241 0.21768
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 22.12 on 100 degrees of freedom
## (48 observations deleted due to missingness)
notes for predictive modeling 289

## Multiple R-squared: 0.5963, Adjusted R-squared: 0.5801


## F-statistic: 36.92 on 4 and 100 DF, p-value: < 2.2e-16

# Model selection can be problematic with missing data, since


# the number of complete cases changes with the addition or
# removing of predictors
mod <- lm(Ozone ~ ., data = airquality)

# stepAIC drops an error


modAIC <- MASS::stepAIC(mod)
## Start: AIC=138.55
## Ozone ~ Solar.R + Wind + Temp + Month + Day
##
## Df Sum of Sq RSS AIC
## - Solar.R 1 28.2 4708.1 136.70
## - Wind 1 105.2 4785.0 137.09
## - Day 1 123.2 4803.1 137.18
## <none> 4679.9 138.55
## - Month 1 432.2 5112.1 138.67
## - Temp 1 4876.6 9556.5 153.69
## Error in MASS::stepAIC(mod): number of rows in use has changed: remove missing values?

# Also, this will be problematic (the number of complete


# cases changes with the predictors considered!)
modBIC <- MASS::stepAIC(mod, k = log(nrow(airquality)))
## Start: AIC=156.73
## Ozone ~ Solar.R + Wind + Temp + Month + Day
##
## Df Sum of Sq RSS AIC
## - Solar.R 1 28.2 4708.1 151.85
## - Wind 1 105.2 4785.0 152.24
## - Day 1 123.2 4803.1 152.33
## - Month 1 432.2 5112.1 153.82
## <none> 4679.9 156.73
## - Temp 1 4876.6 9556.5 168.84
## Error in MASS::stepAIC(mod, k = log(nrow(airquality))): number of rows in use has changed: remove missing values?

# Comparison of AICs or BICs is spurious: the scale of the


# likelihood changes with the sample size (the likelihood
# decreases with n), which increases AIC / BIC with n.
# Hence using BIC / AIC is not adequate for model selection
# with missing data.
AIC(lm(Ozone ~ ., data = airquality))
## [1] 208.6604
AIC(lm(Ozone ~ ., data = subset(airquality, select = -Solar.R)))
## [1] 955.0681

# Considers only complete cases including Solar.R


AIC(lm(Ozone ~ . - Solar.R, data = airquality))
## [1] 206.8047

We have seen the problems that missing data may cause in re-
gression models. There are many techniques designed to handle
missing data, depending on the missing data mechanism (whether
is it completely at random or whether there is some pattern in the
missing process) and the approach to impute the data (parametric,
nonparametric, Bayesian, etc). We do not give an exhaustive view of
the topic here, but we outline three concrete approaches to handle
missing data in practice:

1. Use complete cases. This is the simplest solution and can be


achieved by restricting the analysis to the set of fully-observed
observations. The advantage of this solution is that it can be im-
plemented very easily by using the complete.cases or na.omit
290 eduardo garcía portugués

functions. However, an undesirable consequence is that we may


lose a substantial amount of data and therefore the precision
of the estimators will be lower. In addition, it may lead to a bi-
ased representation of the original data (if the missing process is
associated with the values of the response or predictors).
2. Remove predictors with many missing data. This is another
simple solution that is useful in case most of the missing data is
concentrated in one predictor.
3. Use imputation for the missing values. The idea is to replace
the missing observations on the response or the predictors with
artificial values that try to preserve the dataset structure:

• When the response is missing, we can use a predictive model


to predict the missing response, then create a new fully-
observed dataset containing the predictions instead of the
missing values, and finally re-estimate the predictive model in
this expanded dataset. This approach is attractive if most of
the missing data is in the response.
• When different predictors and the response are missing, we
can use a direct imputation for them. The simplest approach
is to replace the missing data with the sample mean of the
observed cases (in the case of quantitative variables). This
and more sophisticated imputation methods, based on pre- 12
Check the available methods for
dictive models, are available within the mice package12 . This mice::mice at ?mice. There are specific
approach is interesting if the data contains many NAs scattered methods for different kinds of vari-
ables (numeric, factor with two levels,
in different predictors (hence a complete-cases analysis will be factors of more than two levels) and
inefficient). fairly advanced imputation methods.

Let’s put in practice these three approaches in the previous ex-


ample.

# The complete cases approach is the default in R


summary(lm(Ozone ~ ., data = airquality))
##
## Call:
## lm(formula = Ozone ~ ., data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.790 -10.910 -2.249 10.960 33.246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -71.50880 31.49703 -2.270 0.035704 *
## Solar.R -0.01112 0.03376 -0.329 0.745596
## Wind -0.61129 0.96113 -0.636 0.532769
## Temp 1.82870 0.42224 4.331 0.000403 ***
## Month -2.86513 2.22222 -1.289 0.213614
## Day -0.28710 0.41700 -0.688 0.499926
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 16.12 on 18 degrees of freedom
## (129 observations deleted due to missingness)
## Multiple R-squared: 0.5957, Adjusted R-squared: 0.4833
## F-statistic: 5.303 on 5 and 18 DF, p-value: 0.00362

# However, since the complete cases that R is going to consider


notes for predictive modeling 291

# depends on which predictors are included, it is safer to


# exclude NA’s explicitly before the fitting of a model.
airqualityNoNA <- na.exclude(airquality)
summary(airqualityNoNA)
## Ozone Solar.R Wind Temp Month Day
## Min. : 8.00 Min. : 13.0 Min. : 6.300 Min. :58.00 Min. :5.00 Min. : 1.00
## 1st Qu.:13.00 1st Qu.: 61.5 1st Qu.: 9.575 1st Qu.:65.50 1st Qu.:5.00 1st Qu.:11.25
## Median :17.00 Median :178.5 Median :11.500 Median :71.50 Median :6.00 Median :16.50
## Mean :28.21 Mean :161.8 Mean :11.887 Mean :72.54 Mean :6.75 Mean :15.79
## 3rd Qu.:36.25 3rd Qu.:255.0 3rd Qu.:14.300 3rd Qu.:79.50 3rd Qu.:9.00 3rd Qu.:20.25
## Max. :96.00 Max. :334.0 Max. :20.700 Max. :97.00 Max. :9.00 Max. :29.00

# The package VIM has a function to visualize where the missing


# data is present. It gives the percentage of NA’s for each
# variable and for the most important combinations of NA’s.
VIM::aggr(airquality)

Proportion of missings

0.6

Combinations
VIM::aggr(airqualityNoNA)

0.4
0.2
0.0

Ozone

Solar.R

Wind

Temp

Month

Day

Ozone

Solar.R

Wind

Temp

Month

Day
# Stepwise regression without NA’s -- no problem
modBIC1 <- MASS::stepAIC(lm(Ozone ~ ., data = airqualityNoNA),

1.0
Proportion of missings
k = log(nrow(airqualityNoNA)), trace = 0)

0.8

Combinations
summary(modBIC1)

0.6
##

0.4
## Call:

0.2
## lm(formula = Ozone ~ Temp, data = airqualityNoNA)

0.0

Ozone

Solar.R

Wind

Temp

Month

Day

Ozone

Solar.R

Wind

Temp

Month

Day
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.645 -13.180 -0.136 8.926 37.666
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -90.1846 24.6414 -3.660 0.00138 **
## Temp 1.6321 0.3367 4.847 7.63e-05 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 15.95 on 22 degrees of freedom
## Multiple R-squared: 0.5164, Adjusted R-squared: 0.4944
## F-statistic: 23.49 on 1 and 22 DF, p-value: 7.634e-05

# But we only take into account 16% of the original data


nrow(airqualityNoNA) / nrow(airquality)
## [1] 0.1568627

# Removing the predictor with many NA’s, as we did before


# We also exclude NA’s from other predictors
airqualityNoSolar.R <- na.exclude(subset(airquality, select = -Solar.R))
modBIC2 <- MASS::stepAIC(lm(Ozone ~ ., data = airqualityNoSolar.R),
k = log(nrow(airqualityNoSolar.R)), trace = 0)
summary(modBIC2)
##
## Call:
## lm(formula = Ozone ~ Wind + Temp + Month, data = airqualityNoSolar.R)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.973 -14.170 -3.107 10.638 102.297
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -67.7279 25.3507 -2.672 0.0088 **
## Wind -3.0439 0.7028 -4.331 3.51e-05 ***
## Temp 2.1323 0.2870 7.429 3.59e-11 ***
## Month -3.6167 1.6384 -2.207 0.0295 *
292 eduardo garcía portugués

## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 22.17 on 101 degrees of freedom
## Multiple R-squared: 0.5901, Adjusted R-squared: 0.5779
## F-statistic: 48.46 on 3 and 101 DF, p-value: < 2.2e-16
# In this example the approach works well because most of
# the NA’s are associated to the variable Solar.R

# Imput data using the sample mean


library(mice)
airqualityMean <- complete(mice(data = airquality, m = 1, method = "mean"))
##
## iter imp variable
## 1 1 Ozone Solar.R Day
## 2 1 Ozone Solar.R Day
## 3 1 Ozone Solar.R Day
## 4 1 Ozone Solar.R Day
## 5 1 Ozone Solar.R Day
head(airqualityMean)
## Ozone Solar.R Wind Temp Month Day
## 1 41.00000 190.0000 7.4 67 5 1.00000
## 2 36.00000 118.0000 8.0 72 5 2.00000
## 3 12.00000 181.7857 12.6 74 5 3.00000
## 4 18.00000 181.7857 11.5 62 5 15.62857
## 5 42.12931 181.7857 14.3 56 5 5.00000
## 6 28.00000 181.7857 14.9 66 5 15.62857
# Explanation of the sintaxis:
# - mice::complete() serves to retrieve the completed dataset from
# the mids object.
# - m = 1 specifies that we only want a reconstruction of the
# dataset because the imputation method is deterministic
# (it could be random).
# - method = "mean" says that we want the sample mean to be
# used to fill NA’s in all the columns. This only works
# properly for quantitative variables.

# Impute using linear regression for the response (first column)


# and mean for the predictors (remaining five columns)
airqualityLm <- complete(mice(data = airquality, m = 1,
method = c("norm.predict", rep("mean", 5))))
##
## iter imp variable
## 1 1 Ozone Solar.R Day
## 2 1 Ozone Solar.R Day
## 3 1 Ozone Solar.R Day
## 4 1 Ozone Solar.R Day
## 5 1 Ozone Solar.R Day
head(airqualityLm)
## Ozone Solar.R Wind Temp Month Day
## 1 41.00000 190.0000 7.4 67 5 1.00000
## 2 36.00000 118.0000 8.0 72 5 2.00000
## 3 12.00000 181.7857 12.6 74 5 3.00000
## 4 18.00000 181.7857 11.5 62 5 15.62857
## 5 -13.35375 181.7857 14.3 56 5 5.00000
## 6 28.00000 181.7857 14.9 66 5 15.62857

# Imputed data -- some extrapolation problems may happen


airqualityLm$Ozone[is.na(airquality$Ozone)]
## [1] -13.3537515 33.3839709 -15.1032941 -6.4344422 15.0745276 43.7185093 34.5128480 0.1488983 57.7500138
## [10] 62.1313290 30.1891953 70.4905697 72.0216949 75.6909337 38.1841717 43.5104926 57.9764970 69.7165869
## [19] 63.0884481 51.9374205 50.0400961 56.8792515 41.2844048 49.6257521 33.0354266 67.4490679 48.9905942
## [28] 54.0457773 51.9941898 50.1697328 46.1697595 71.3235772 49.9344726 39.0222991 26.9297719 77.0827188
## [37] 26.9508942

# Notice that the imputed data is the same (except for a small
# truncation that is introduced by mice) as
predict(lm(airquality$Ozone ~ ., data = airqualityMean),
notes for predictive modeling 293

newdata = airqualityMean[is.na(airquality$Ozone), -1])


## 5 10 25 26 27 32 33 34 35 36
## -13.3537515 33.3839709 -15.1032941 -6.4344422 15.0745276 43.7185093 34.5128480 0.1488983 57.7500138 62.1313290
## 37 39 42 43 45 46 52 53 54 55
## 30.1891953 70.4905697 72.0216949 75.6909337 38.1841717 43.5104926 57.9764970 69.7165869 63.0884481 51.9374205
## 56 57 58 59 60 61 65 72 75 83
## 50.0400961 56.8792515 41.2844048 49.6257521 33.0354266 67.4490679 48.9905942 54.0457773 51.9941898 50.1697328
## 84 102 103 107 115 119 150
## 46.1697595 71.3235772 49.9344726 39.0222991 26.9297719 77.0827188 26.9508942

# Removing the truncation with ridge = 0


complete(mice(data = airquality, m = 1,
method = c("norm.predict", rep("mean", 5)),
ridge = 0))[is.na(airquality$Ozone), 1]
##
## iter imp variable
## 1 1 Ozone Solar.R Day
## 2 1 Ozone Solar.R Day
## 3 1 Ozone Solar.R Day
## 4 1 Ozone Solar.R Day
## 5 1 Ozone Solar.R Day
## [1] -13.3537515 33.3839709 -15.1032941 -6.4344422 15.0745276 43.7185093 34.5128480 0.1488983 57.7500138
## [10] 62.1313290 30.1891953 70.4905697 72.0216949 75.6909337 38.1841717 43.5104926 57.9764970 69.7165869
## [19] 63.0884481 51.9374205 50.0400961 56.8792515 41.2844048 49.6257521 33.0354266 67.4490679 48.9905942
## [28] 54.0457773 51.9941898 50.1697328 46.1697595 71.3235772 49.9344726 39.0222991 26.9297719 77.0827188
## [37] 26.9508942

# The default mice’s method (predictive mean matching) works


# better in this case (in the sense that it does not yield
# negative Ozone values)
# Notice that there is randomness in the imputation!
airqualityMice <- complete(mice(data = airquality, m = 1, seed = 123))
##
## iter imp variable
## 1 1 Ozone Solar.R Day
## 2 1 Ozone Solar.R Day
## 3 1 Ozone Solar.R Day
## 4 1 Ozone Solar.R Day
## 5 1 Ozone Solar.R Day
head(airqualityMice)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 186 12.6 74 5 3
## 4 18 112 11.5 62 5 23
## 5 13 250 14.3 56 5 5
## 6 28 252 14.9 66 5 28

A.5 A note of caution with inference after model-selection


294 eduardo garcía portugués

Inferences from models that result from model-selection


procedures, such as stepwise regression, ridge, or lasso,
have to be analysed with caution. The reason is be-
cause we are using the sample twice: one for selecting the
most significant / informative predictors in order to
be included in the model, and other for making infer-
ence using the same sample. While making this, we are
biasing the significance tests, and thus obtaining unreal-
istically small p-values. In other words, when included
in the model, some selected predictors will be shown as
significant when in reality they are not.

A relatively simple solution for performing valid in-


ference in a data-driven selected model is to split the
dataset in two parts: one part for performing model-
selection (selection of important variables and model
structure; inside this part we could have two subparts for
training and validation) and another for fitting the model
and carrying out inference on the coefficients based on that
fit. Obviously, this approach has the undesirable conse-
quence of losing power in the estimation and inference
parts due to the sample splitting, but it guarantees valid
inference in a simple and general way.

The next simulation exercise exemplifies the previous remarks.


Consider the following linear model

Y = β 1 X1 + β 2 X2 + β 3 X3 + β 4 X4 + ε, (A.8)

where β 1 = β 2 = 1, β 3 = β 4 = 0, and ε ∼ N (0, 1). The next chunk


of code analyses the significances of the four coefficients for:

1. The model with all the predictors. The inferences for the coef-
ficients are correct: the distribution of the p-values (pvalues1) is
uniform whenever H0 : β j = 0 holds (for j = 3, 4) and concen-
trated around 0 when H0 does not hold (for j = 1, 2).
2. The model with predictors selected by stepwise regression. The
inferences for the coefficients are biased: when X3 and X4 are
included in the model is because they are highly significant for
the given sample by mere chance. Therefore, the distribution of
the p-values (pvalues2) is not uniform but concentrated at 0.
3. The model with selected predictors by stepwise regression, but
fitted in a separate dataset. In this case, the p-values (pvalues3)
are not unrealistically small if the non-significant predictors are
included in the model and the inference is correct.
# Simulation setting
n <- 2e2
p <- 4
notes for predictive modeling 295

p0 <- p %/% 2
beta <- c(rep(1, p0), rep(0, p - p0))

# Generate two sets of independent data following the same linear model
# with coefficients beta and null intercept
x1 <- matrix(rnorm(n * p), nrow = n, ncol = p)
data1 <- data.frame("x" = x1)
xbeta1 <- x1 %*% beta
x2 <- matrix(rnorm(n * p), nrow = n, ncol = p)
data2 <- data.frame("x" = x2)
xbeta2 <- x2 %*% beta

# Objects for the simulation


M <- 1e4
pvalues1 <- pvalues2 <- pvalues3 <- matrix(NA, nrow = M, ncol = p)
set.seed(12345678)
data1$y <- xbeta1 + rnorm(n)
nam <- names(lm(y ~ 0 + ., data = data1)$coefficients)

# Simulation
# pb <- txtProgressBar(style = 3)
for (i in 1:M) {

# Generate new data


data1$y <- xbeta1 + rnorm(n)
p-values in the full model
# Obtain the significances of the coefficients for the usual linear model

1.0
mod1 <- lm(y ~ 0 + ., data = data1)
s1 <- summary(mod1)

0.8
pvalues1[i, ] <- s1$coefficients[, 4]

0.6
# Obtain the significances of the coefficients for a data-driven selected
# linear model (in this case, by stepwise regression using BIC)
mod2 <- MASS::stepAIC(mod1, k = log(n), trace = 0)
0.4

s2 <- summary(mod2)
ind <- match(x = names(s2$coefficients[, 4]), table = nam)
0.2

pvalues2[i, ind] <- s2$coefficients[, 4]


0.0

# Generate independent data


β1 β3 β3 β4
data2$y <- xbeta2 + rnorm(n)
p-values in the stepwise model

# Significances of the coefficients by the data-driven selected model


1.0

s3 <- summary(lm(y ~ 0 + ., data = data2[, c(ind, p + 1)]))


pvalues3[i, ind] <- s3$coefficients[, 4]
0.8

# Progress
0.6

# setTxtProgressBar(pb = pb, value = i / M)


0.4

}
0.2

# Percentage of NA’s: NA = predictor excluded


apply(pvalues2, 2, function(x) mean(is.na(x)))
0.0

## [1] 0.0000 0.0000 0.9767 0.9784


β1 β3 β3 β4

# Boxplots of significances p-values in the model with the predictors selected by


stepwise regression, and fitted in an independent sample
boxplot(pvalues1, names = expression(beta[1], beta[3], beta[3], beta[4]),
1.0

main = "p-values in the full model", ylim = c(0, 1))


0.8

boxplot(pvalues2, names = expression(beta[1], beta[3], beta[3], beta[4]),


0.6

main = "p-values in the stepwise model", ylim = c(0, 1))


0.4

boxplot(pvalues3, names = expression(beta[1], beta[3], beta[3], beta[4]),


main = "p-values in the model with the predictors selected by
0.2

stepwise regression, and fitted in an independent sample",


ylim = c(0, 1))
0.0

β1 β3 β3 β4
296 eduardo garcía portugués

# Test uniformity of the p-values associated to the coefficients that are 0


apply(pvalues1[, (p0 + 1):p], 2, function(x) ks.test(x, y = "punif")$p.value)
## [1] 0.2680792 0.9773141
apply(pvalues2[, (p0 + 1):p], 2, function(x) ks.test(x, y = "punif")$p.value)
## [1] 0 0
apply(pvalues3[, (p0 + 1):p], 2, function(x) ks.test(x, y = "punif")$p.value)
## [1] 0.3467553 0.7129604
B
Software

B.1 Installation of R and RStudio

This is what you have to do in order to install R and RStudio in


your own computer:

1. In Mac OS X, if you want to have 3D functionality for packages


like rgl , download and install first XQuartz and log out and 1
This is an important step that is
back on your Mac OS X account1 . Be sure that your Mac OS X required for 3D graphics to work.
system is up-to-date.
2. Download the latest version of R at https://cran.r-project.
org/. For Windows, you can download it directly here. For Mac
OS X you can download the latest version (at the time of writing
this, 3.4.4) here.
3. Install R. In Windows, be sure to select the 'Startup options'
and then choose 'SDI' in the 'Display Mode' options. Leave the
rest of installation options as default.
4. Download the latest version of RStudio for your system at
https://www.rstudio.com/products/rstudio/download/#download
and install it.

Linux users can follow the corresponding instructions here for


installing R, download RStudio (only certain Ubuntu and Fedora
versions are supported), and install it using a package manager.

B.2 Introduction to RStudio

RStudio is the most employed Integrated Development Environ-


ment (IDE) for R nowadays. When you start RStudio you will see a
window similar to Figure B.1. There are a lot of items in the GUI,
most of them described in the RStudio IDE Cheat Sheet. The most
important things to keep in mind are:

1. The code is written in scripts in the source panel (upper-left panel


in Figure B.1).
2. For running a line or code selection from the script in the console
(first tab in the lower-left panel in Figure B.1), you do it with
298 eduardo garcía portugués

the keyboard shortcut 'Ctrl+Enter' (Windows and Linux) or


'Cmd+Enter' (Mac OS X).

Figure B.1: Main window of RStudio.


Extracted from here.

B.3 Introduction to R

This section provides a collection of self-explainable code snippets


for the programming language R (R Core Team, 2018). These snip-
pets are not meant to provide an exhaustive introduction to R but
just to illustrate the very basic functions and methods.
In the following, # denotes comments to the code and ## outputs
of the code.

Simple computations
# The console can act as a simple calculator
1.0 + 1.1
2 * 2
3/2
2^3
1/0
0/0

# Use ";" for performing several operations in the same line


(1 + 3) * 2 - 1; 3 + 2

# Elemental mathematical functions


sqrt(2); 2^0.5
exp(1)
log(10) # Natural or neperian logarithm
log10(10); log2(10) # Logs in base 10 and 2
sin(pi); cos(0); asin(0)
tan(pi/3)
sqrt(-1)
notes for predictive modeling 299

# Remember to close the parenthesis -- errors below


1 +
(1 + 3
## Error: <text>:24:0: unexpected end of input
## 22: 1 +
## 23: (1 + 3
## ^

Compute:

e2 +sin(2)
• 1
  . Answer: 2.723274.
cos−1 2 +2
p
• 32.5 + log(10). Answer: 4.22978.
• (20.93 − log2 (3 + 2 + sin(1)))10tan(1/3)) 32.5 + log(10).
p p

Answer: -3.032108.

Variables and assignment


# Any operation that you perform in R can be stored in a variable
# (or object) with the assignment operator "<-"
x <- 1

# To see the value of a variable, simply type it


x
## [1] 1

# A variable can be overwritten


x <- 1 + 1

# Now the value of x is 2 and not 1, as before


x
## [1] 2

# Capitalization matters
X <- 3
x; X
## [1] 2
## [1] 3

# See what are the variables in the workspace


ls()
## [1] "x" "X"

# Remove variables
rm(X)
X
## Error in eval(expr, envir, enclos): object ’X’ not found

Do the following:

• Store −123 in the variable y.


• Store the log of the square of y in z.
y−z
• Store y+z2 in y and remove z.
• Output the value of y. Answer: 4.366734.

Vectors
# We combine numbers with the function "c"
c(1, 3)
300 eduardo garcía portugués

## [1] 1 3
c(1.5, 0, 5, -3.4)
## [1] 1.5 0.0 5.0 -3.4

# A handy way of creating integer sequences is the operator ":"


1:5
## [1] 1 2 3 4 5

# Storing some vectors


myData <- c(1, 2)
myData2 <- c(-4.12, 0, 1.1, 1, 3, 4)
myData
## [1] 1 2
myData2
## [1] -4.12 0.00 1.10 1.00 3.00 4.00

# Entrywise operations
myData + 1
## [1] 2 3
myData^2
## [1] 1 4

# If you want to access a position of a vector, use [position]


myData[1]
## [1] 1
myData2[6]
## [1] 4

# You can also change elements


myData[1] <- 0
myData
## [1] 0 2

# Think on what you want to access...


myData2[7]
## [1] NA
myData2[0]
## numeric(0)

# If you want to access all the elements except a position,


# use [-position]
myData2[-1]
## [1] 0.0 1.1 1.0 3.0 4.0
myData2[-2]
## [1] -4.12 1.10 1.00 3.00 4.00

# Also with vectors as indexes


myData2[1:2]
## [1] -4.12 0.00
myData2[myData]
## [1] 0

# And also
myData2[-c(1, 2)]
## [1] 1.1 1.0 3.0 4.0

# But do not mix positive and negative indexes!


myData2[c(-1, 2)]
## Error in myData2[c(-1, 2)]: only 0’s may be mixed with negative subscripts

# Remove the first element


myData2 <- myData2[-1]

You might also like