[go: up one dir, main page]

0% found this document useful (0 votes)
26 views98 pages

Lecture4 - Model Selection and Regularization - Ver2

This lecture discusses model selection and regularization techniques to improve linear regression models, focusing on alternatives to ordinary least squares fitting. It covers three main classes of methods: subset selection, shrinkage (regularization), and dimension reduction, each with specific procedures and advantages. The lecture also emphasizes the importance of selecting optimal models using criteria such as Cp, AIC, and BIC to avoid overfitting and enhance predictive accuracy.

Uploaded by

tofer0321
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views98 pages

Lecture4 - Model Selection and Regularization - Ver2

This lecture discusses model selection and regularization techniques to improve linear regression models, focusing on alternatives to ordinary least squares fitting. It covers three main classes of methods: subset selection, shrinkage (regularization), and dimension reduction, each with specific procedures and advantages. The lecture also emphasizes the importance of selecting optimal models using criteria such as Cp, AIC, and BIC to avoid overfitting and enhance predictive accuracy.

Uploaded by

tofer0321
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

Lecture 04.

Model Selection and Regularization


(Chapter 6 and Section 5.1)

Ping Yu

HKU Business School


The University of Hong Kong

Ping Yu (HKU) Model Selection and Regularization 1 / 98


Introduction

Recall the linear regression model,

Y = β 0 + β 1 X1 + + β p Xp + ε.

In the lectures that follow, we consider some approaches for extending the linear
model framework.
In Lecture 7, we generalize the linear model in order to accommodate non-linear,
but still additive, relationships.
In Lectures 8 and 10, we consider even more general non-linear models.
Despite its simplicity, the linear model has distinct advantages in terms of its inter-
pretability and often shows good predictive performance.
Hence we discuss in this lecture some ways in which the simple linear model can
be improved, by replacing ordinary least squares fitting with some alternative fitting
procedures.

Ping Yu (HKU) Model Selection and Regularization 2 / 98


Why consider alternatives to Least Squares?

Prediction Accuracy: When n is not much larger than p, there can be a lot of vari-
ability in the least squares fit, resulting in overfitting and consequently poor predic-
tions; when p > n, there is no unique least squares estimate: the variance is infinite!
By constraining or shrinking the estimated coefficients, we can substantially reduce
the variance at the cost of a negligible increase in bias.
Model Interpretability: By removing irrelevant features – that is, by setting the corre-
sponding coefficient estimates to zero – we can obtain a model that is more easily
interpreted. We will present some approaches for automatically performing feature
selection or variable selection.

Ping Yu (HKU) Model Selection and Regularization 3 / 98


Three Classes of Methods

Subset Selection: We identify a subset of the p predictors that we believe to be


related to the response. We then fit a model using least squares on the reduced
set of variables.
- We will discuss two such procedures: best subset selection and stepwise selec-
tion.
Shrinkage: We fit a model involving all p predictors, but the estimated coefficients
are shrunken towards zero relative to the least squares estimates. This shrink-
age (also known as regularization) has the effect of reducing variance, and some
shrinkage methods, e.g., the lasso, can also perform variable selection.
- We will discuss two most popular shrinkage methods: ridge regression and the
lasso (least absolute shrinkage and selection operator).
- Since there is a tuning parameter to be selected for our two shrinkage methods,
we will also discuss a general method for such a purpose – cross validation.

Ping Yu (HKU) Model Selection and Regularization 4 / 98


Continued

Dimension Reduction: We project the p predictors into a M-dimensional subspace,


where M < p. This is achieved by computing M different linear combinations, or
projections, of the variables. Then these M projections are used as predictors to fit
a linear regression model by least squares.
- We will concentrate on two dimension reduction methods – principal components
regression and parial least squares, but will postpone the discussions to Lecture 5.
The concepts discussed in this lecture can also be applied to other methods (be-
sides linear regression), e.g., the classification methods in Lecture 2.
- The deviance – negative two times the maximized log-likelihood – plays the role
of RSS for a broader class of models.

Ping Yu (HKU) Model Selection and Regularization 5 / 98


(**) Alternative Definition of Deviance – Standardized Deviance

Suppose under the parameter value θ , the mean of y is µ which is a function of θ ;


then the standardized deviance is defined as

b ; y)
2 [l ( µ l (y; y)] ,

b = ( µ̂ 1 , , µ̂ n )T is the
where l ( µ; y) = ∑ni=1 l ( µ i ; yi ) with l ( µ; y ) = ln f (y ; θ ), and µ
ML estimator of µ (deduced from the ML estimator of θ ).
- In other words, the deviance is ( 2 times) the difference between the log likelihood
of the assumed model and the saturated model (a model with a parameter for every
observation so that the data are fitted exactly).
(y µ )2
Normal: If y N µ, σ 2 with known σ 2 , then θ = µ, and f (y ; µ ) = p 1
exp 2σ 2
,
2πσ 2
so that
1 (y µ )2
l ( µ; y ) = log 2πσ 2 .
2 2σ 2
1
Setting µ = y gives l (y ; y ) = 2 log 2πσ 2 and so

(y µ̂ )2
2 [l ( µ̂; y ) l (y ; y )] = .
σ2

Ping Yu (HKU) Model Selection and Regularization 6 / 98


(**) Continued: Two More Examples

Binary: If y Bernoulli(p ), then µ = p and f (y ; µ ) = µ y (1 µ )1 y


, so that

l ( µ; y ) = y log µ + (1 y ) log (1 µ),

and
y 1 y
2 [l ( µ̂; y ) l (y ; y )] = 2 y log + (1 y ) log .
µ̂ 1 µ̂
Poisson: If y Poisson(λ ), then µ = λ and f (y ; µ ) = e λ λ y /y !, so that

l ( µ; y ) = λ + y log λ log y !,

and
y
2 [l ( µ̂; y ) l (y ; y )] = 2 y log (y µ̂ ) .
µ̂
In all three examples, µ̂ is the sample mean.

Ping Yu (HKU) Model Selection and Regularization 7 / 98


Subset Selection

Subset Selection
(Section 6.1)

Ping Yu (HKU) Model Selection and Regularization 8 / 98


Subset Selection Best Subset Selection

Best Subset Selection


p p p
There are totally C0 + C1 + + Cp = 2p possible models! So selecting the best
model is not trivial, and we break the procedure into two stages:

Step 2 reduces the number of possible models from 2p to p + 1. [see Figure 6.1 for
the Credit dataset, n = 400, p = 10]
In Step 3, we cannot use RSS or R 2 to choose the best model since they represent
the training error not the testing error, and are decreasing in p so always the largest
model is selected.
Ping Yu (HKU) Model Selection and Regularization 9 / 98
Subset Selection Best Subset Selection

Totally 211 = 2048 models.


When the number of predictors is greater than 3, RSS or R 2 barely improves.

Ping Yu (HKU) Model Selection and Regularization 10 / 98


Subset Selection Stepwise Selection

Stepwise Selection

The key drawback of best subset selection is its computational limitation; usually,
p > 40 is infeasible.
Best subset selection may also suffer from statistical problems when p is large:
larger the search space, the higher the chance of finding models that look good on
the training data, even though they might not have any predictive power on future
data.
Thus an enormous search space can lead to overfitting and high variance of the
coefficient estimates.
For both of these reasons, stepwise methods, which explore a far more restricted
set of models, are attractive alternatives to best subset selection.

Ping Yu (HKU) Model Selection and Regularization 11 / 98


Subset Selection Stepwise Selection

Forward Stepwise Selection

Starts from the null model; at each step the variable that gives the greatest addi-
tional improvement to the fit is added to the model, until all of the predictors are in
the model.

Ping Yu (HKU) Model Selection and Regularization 12 / 98


Subset Selection Stepwise Selection

Comments
p +1
Totally, only 1 + p + (p 1) + + 1 = 1 + C2 models are considered,1 so com-
putational advantage over best subset selection is clear.
It is not guaranteed to find the best possible model out of all 2p models containing
subsets of the p predictors.
- Why not? Give an example.

Figure 6.1 shows that there is little difference between the three- and four-variable
models in terms of RSS, so either of the four-variable models will likely be adequate.
This is a greedy approach, and might include variables early that later become
redundant; hybrid selection below can remedy this.
1
(**) If n < p, only M0 , , Mn 1 would be considered, but the number of possible models is much larger than
1 + C2p+1 , which is so even when n > p since usually a guided search is performed.
Ping Yu (HKU) Model Selection and Regularization 13 / 98
Subset Selection Stepwise Selection

Backward Stepwise Selection

Unlike forward stepwise selection, backward stepwise selection begins with the full
least squares model containing all p predictors, and then iteratively removes the
least useful predictor, one-at-a-time, until the null model is achieved.

Comments on forward stepwise selection still apply here.2


Stepwise selection improves prediction accuracy only in certain cases, such as
when only a few covariates have a strong relationship with the outcome.
2
(**) If n < p, backward stepwise selection cannot apply, so forward stepwise selection is the only viable subset
method when p is very large.
Ping Yu (HKU) Model Selection and Regularization 14 / 98
Subset Selection Stepwise Selection

(*) Hybrid/Mixed Approaches

This is a combination of forward and backward selection.


We start with no variables in the model, and as with forward selection, we add the
variable that provides the best fit.
We continue to add variables one-by-one, but the p-values for existing variables
can become larger as new predictors are added to the model.
Hence, if at any point the p-value for one of the variables in the model rises above
a certain threshold, then we remove that variable from the model.
We continue to perform these forward and backward steps until all variables in the
model have a sufficiently low p-value, and all variables outside the model would
have a large p-value if added to the model.
We know the best subset, forward stepwise, and backward stepwise selection ap-
proaches generally give similar but not identical models; the hybrid approach at-
tempts to more closely mimic best subset selection while retaining the computa-
tional adavantages of forward and backward stepwise selection.

Ping Yu (HKU) Model Selection and Regularization 15 / 98


Subset Selection Choosing the Optimal Model

Choosing the Optimal Model

As mentioned above, RSS and R 2 are not suitable for selecting the best model
among a collection of models with different numbers of predictors because they are
related to the training error, not the test error.
To select the best model with respect to test error, we estimate this test error based
on two common approaches:

1 We can indirectly estimate test error by making an adjustment to the training error
to account for the bias due to overfitting.
- Cp (or Mallow’s Cp ), Akaike Information Criterion (AIC), Bayesian Information Cri-
terion (BIC or the Schwarz criterion), and adjusted R 2 (or R̄ 2 ). [figure here]
- All criteria are popular, but the adjusted R 2 is not as well motivated in statistical
theory as the other three.
- AIC and BIC can be defined for more general types of models.
2 We can directly estimate the test error.
- A validation set approach or a cross-validation approach, as will be discussed
later in this lecture.

Ping Yu (HKU) Model Selection and Regularization 16 / 98


Subset Selection Choosing the Optimal Model

History of Cp , AIC, BIC, and R̄ 2

Colin L. Mallow (1930-), Bell Labs Hirotugu Akaike (1927-2009), ISM

Gideon E. Schwarz (1933-2007), Hebrew Henri Theil (1924-2000), Chicago and Florida

Ping Yu (HKU) Model Selection and Regularization 17 / 98


Subset Selection Choosing the Optimal Model

Cp , AIC, BIC, and R̄ 2

Figure 6.2 displays Cp , BIC, and adjusted R 2 for the best model of each size pro-
duced by best subset selection on the Credit dataset.

Ping Yu (HKU) Model Selection and Regularization 18 / 98


Subset Selection Choosing the Optimal Model

Details on Cp

Mallows’s Cp :
1
Cp = RSS (d ) + 2d σ̂ 2
n
where d is the total # of predictors used, and σ̂ 2 is an estimate of the variance of
the error ε.
- Note that RSS(d ) depends on d – it decreases with d.
- Typically, σ̂ 2 is estimated using the full model containing all predictors.
- The penalty term 2d σ̂ 2 is added since the training error tends to underestimate the
test error; it is increasing in d to adjust for the corresponding decrease in training
RSS.
-(**) If σ̂ 2 is unbiased, then Cp is an unbiased estimate of test MSE.
- In Figure 6.2, minimizing Cp obtains 6 predictors, income, limit, rating, cards, age
and student.

Ping Yu (HKU) Model Selection and Regularization 19 / 98


Subset Selection Choosing the Optimal Model

Details on AIC

AIC: defined for a large class of models fit by maximum likelihood,

AIC = 2 log L (d ) + 2d,

which is equivalent to Cp in linear regression with Gaussian errors, wher L (d ) is


the maximized likelihood function with d predictors.
- This is because
2 log L (d ) = constant + n log σ̂ 2 ,
and letting σ̂ 2 =RSS(d )/n in Cp , we have

n + 2d 2
Cp = σ̂
n
such that
2d 2d AIC
log Cp = log σ̂ 2 + log 1 + t log σ̂ 2 + t .
n n n

Ping Yu (HKU) Model Selection and Regularization 20 / 98


Subset Selection Choosing the Optimal Model

Details on BIC

BIC: in linear regression with Gaussian errors,


1
BIC = RSS (d ) + log (n) d σ̂ 2 .
n
- BIC replaces the 2d σ̂ 2 used by Cp with a log (n) d σ̂ 2 term.
- Since log n > 2 for any n > 7, the BIC statistic generally places a heavier penalty on
models with many variables, and hence results in the selection of smaller models
than Cp .
- In Figure 6.2, only income, limit, cards, and student are selected, but the four-
variable and six-variable models do not have much difference in accuracy.

Ping Yu (HKU) Model Selection and Regularization 21 / 98


Subset Selection Choosing the Optimal Model

Details on R̄ 2

Adjusted R 2 :
RSS (d ) / (n d 1)
R̄ 2 = 1 .
TSS/ (n 1)
- Different from Cp , AIC and BIC, a large value of R̄ 2 is preferred.
RSS(d )
- Maximizing R̄ 2 is equivalent to minimizing n d 1. While RSS(d ) always de-
RSS(d )
creases as d increases, n d 1 may increase or decrease, due to the presence of
d in the denominator.
- Unlike the R 2 statistic, the R̄ 2 statistic pays a price for the inclusion of unnecessary
variables in the model.
- It can be shown that maximizing R̄ 2 is approximately equivalent to minimizing
1 2
n RSS (d ) + d σ̂ . [Exercise]
- In Figure 6.2, besides the six variables selected by Cp and AIC, own is also
selected.
Why select among only M0 , , Mp ?
p
- For the same d (and σ̂ 2 ), optimizing over the Cd models in the four criteria is the
same as minimizing over the corresponding RSS.

Ping Yu (HKU) Model Selection and Regularization 22 / 98


Shrinkage Methods

Shrinkage Methods
(Section 6.2)

Ping Yu (HKU) Model Selection and Regularization 23 / 98


Shrinkage Methods Ridge Regression

History of Ridge Regression

Arthur E. Hoerl (1921-1994, UDelaware) Robert W. Kennard (1923-2011, DuPont)

Hoerl, A.E., and R.W. Kennard, 1970, Ridge Regression: Biased Estimation for Nonorthogo-
nal Problems, Technometrics, 12, 55–67.
Hoerl, A.E., and R.W. Kennard, 1970, Ridge Regression: Applications to Nonorthogonal Prob-
lems, Technometrics, 12, 69–82.

Ping Yu (HKU) Model Selection and Regularization 24 / 98


Shrinkage Methods Ridge Regression

Ridge Regression

Recall that the least squares fitting procedure estimates β 0 , β 1 , , β p using the
values that minimize
!2
n p
RSS = ∑ yi β0 ∑ β j xij .
i =1 j =1

R
In contrast, the ridge regression (RR for short) coefficient estimates β̂ are the
values that minimize
!2
n p p
∑ yi β0 ∑ β j xij +λ ∑ β 2j = RSS + λ kβ k22 , (1)
i =1 j =1 j =1

where λ 0 is a tuning parameter, to be determined separately, and k k2 is the


Enclidean norm or `2 norm.
- Note that β 0 is not penalized, and is equal to ȳ when the columns of X are normal-
ized to have mean 0; in other words, we care about the sensitivity of Y with respect
to Xj , but not its own level.

Ping Yu (HKU) Model Selection and Regularization 25 / 98


Shrinkage Methods Ridge Regression

Continued

As with least squares, RR seeks coefficient estimates that fit the data well, by
making the RSS small.
p
However, the second term, λ ∑j =1 β 2j , called a shrinkage penalty, is small when
β 1 , , β p are close to zero, and so it has the effect of shrinking the estimates of β j
towards zero.
The tuning parameter λ serves to control the relative impact of these two terms on
the regression coefficient estimates.
Selecting a good value for λ is critical; cross-validation discussed below is used for
this.
RR has substantial computational advantages over best subset selection – only
one (rather than 2p ) model is fit for each λ .
- Actually, computations required to solve (1), simultaneously for all values of λ , are
almost identical to those for fitting a model using least squares.
Different from least squares, RR can be conducted even if p > n.

Ping Yu (HKU) Model Selection and Regularization 26 / 98


Shrinkage Methods Ridge Regression

[Example] Credit

R R R T
β̂ λ is a function of λ : λ ! 0, β̂ λ ! β̂ LS , and λ ! ∞, β̂ λ ! βb 0 , 0, 0, , 0 , the
null model.
R R
β̂ λ is generally decreasing in λ , but need not be strictly so; anyway, β̂ λ must
2
be decreasing in λ , so there is a one-to-one correspondence between λ 2 (0, ∞)
R
and β̂ λ / β̂ LS 2 (1, 0) as shown in the right panel of Figure 6.4 .
2 2
Ping Yu (HKU) Model Selection and Regularization 27 / 98
Shrinkage Methods Ridge Regression

Scaling of Predictors

β̂ LS is scale equivariant: if Xj ! cXj then β̂ j,LS ! β̂ j,LS /c such that Xj β̂ j,LS remains
the same.
R R
In contrast, β̂ j,λ can change substantially and need not change to β̂ j,λ /c when
p R
Xj ! cXj , due to the equal weights on β 2j in ∑j =1 β 2j , such that Xj β̂ j,λ would change.
R
- Actually, Xj β̂ λ ,j depends also on the scaling of the other predictors.
Therefore, it is best to apply RR after standardizing the predictors:
xij x̄j
x̃ij = q ,
1 2
n ∑ni=1 xij x̄j

i.e., use the z-score of xij .


T
- For any j, x̃1j , , x̃nj has mean 0 and standard deviation 1.
R
- "standardized coefficients" in Figure 6.4 are generated in this way; now, β̂ λ ,j has
the same meaning – the average change in y when xj changes one standard devi-
ation.

Ping Yu (HKU) Model Selection and Regularization 28 / 98


Shrinkage Methods Ridge Regression

Why Does RR Improve Over LS?

Simulated data with n = 50 observations, p = 45 predictors, all having nonzero


coefficients.
RR’s advantage over LS (λ = 0) is rooted in the bias-variance trade-off.
The optimal λ 30; due to the high variance of β̂ LS (because p is close to n), MSE
of LS MSE of λ = ∞; RR works best in situations where β̂ LS has high variance
(e.g., multicollinear, or p is large).
Ping Yu (HKU) Model Selection and Regularization 29 / 98
Shrinkage Methods Ridge Regression

Why the name "Ridge Regression"?

R 1
The solution to (1) is β̂ = XT X + λ I XT y, where a ridge parameter λ of the
identify matrix is added to the correlation matrix XT X (when X is standardized),
while the diagonal of ones in the correlation matrix can be described as a "ridge".
Historically, "ridge" actually refers to the characteristics of the function we were
attempting to optimize, rather than to adding a "ridge" to the XT X matrix.

Ping Yu (HKU) Model Selection and Regularization 30 / 98


Shrinkage Methods The Lasso

History of the Lasso

Robert Tibshirani (1956-, Stanford)3

Tibshirani, R., 1996, Regression Shrinkage and Selection via the lasso, Journal of the Royal
Statistical Society, Series B, 58, 267-88.
3
previoiusly at Toronto, COPSS Award receiver of 1996, inventor of LASSO, and a student of
Efron.
Ping Yu (HKU) Model Selection and Regularization 31 / 98
Shrinkage Methods The Lasso

The Lasso

RR does have one obvious disadvantage: unlike subset selection, which will gen-
erally select models that involve just a subset of the variables, RR will include all p
predictors in the final model (unless λ = ∞).
- This is not a problem for prediction accuracy, but a challenge in model interpreta-
tion when p is large.
The Lasso is a relatively recent alternative to RR that overcomes this disadvantage.
L
The lasso coefficients, β̂ λ , minimize the quantity
!2
n p p
∑ yi β0 ∑ β j xij +λ ∑ β j = RSS + λ kβ k1 , (2)
i =1 j =1 j =1

where k k1 is the `1 norm, i.e., the lasso uses an `1 penalty instead of an `2 penalty.

Ping Yu (HKU) Model Selection and Regularization 32 / 98


Shrinkage Methods The Lasso

Continued

As with RR, the lasso shrinks the coefficient estimates towards zero, so can reduce
variance at the expense of a small increase in bias.
However, in the case of the lasso, the `1 penalty has the effect of forcing some of
the coefficient estimates to be exactly equal to zero when the tuning parameter λ
is sufficiently large.
Hence, much like best subset selection, the lasso performs variable selection, and
the resulting model is much easier to interpret than that in RR.
We say that the lasso yields sparse models – that is, models that involve only a
subset of the variables.
As in RR, selecting a good value of λ for the lasso is critical; cross-validation is
again the method of choice.
Like RR, the entire coefficient paths can be computed with about the same amount
of work as a single least squares fit.

Ping Yu (HKU) Model Selection and Regularization 33 / 98


Shrinkage Methods The Lasso

[Example] Credit

L R L
The limits of β̂ λ as λ ! 0 and ∞ are the same as in β̂ λ ; β̂ λ is also generally
L
decreasing in λ and β̂ λ must be decreasing in λ .
1
L R
In between, i.e., λ 2 (0, ∞), β̂ λ and β̂ λ are very different: rating, student, limit, and
income enter the model sequentially, so any number of variables is possible, while
RR involves all variables.
Ping Yu (HKU) Model Selection and Regularization 34 / 98
Shrinkage Methods The Lasso

(**) More Properties of the Lasso Coefficient Path

There is a finite sequence of λ ’s

λ0 > λ1 > λ2 > > λK = 0

such that
L
1 For all λ > λ 0 , β̂ λ = 0.
L
2 In the interior of the interval (λ m+1 , λ m ), the active set B (λ ) := fjjβ̂ λ j 6= 0g and the sign
L
vector of the active are constant with respect to λ , so can be denoted as Bm and
β̂ λ j ’s
signm .
3 When λ decreases from λ = λ m , some predictors with zero coefficients at λ m are
about to have nonzero coefficients; as λ approaches λ m + there are possibly some
predictors in Bm whose coefficients reach zero.
From the Kuhn-Tucker conditions, we have the equiangular conditions:
L
λ = 2jxTj (y Xβ̂ λ )j, 8j 2 B (λ ),
L
λ > 2jxTj (y / B (λ ),
Xβ̂ λ )j, 8j 2

which imply that the coefficient path for the lasso is continuous and piecewise linear
over the entire range of λ , with the transition points fλ m g occurring whenever the
active set changes, or the signs of the coefficients change.
Ping Yu (HKU) Model Selection and Regularization 35 / 98
Shrinkage Methods The Lasso

The Variable Selection Property of the Lasso

Why is it that the lasso, unlike RR, results in coefficient estimates that are exactly
equal to zero?
One can show that the lasso and RR coefficient estimates solve the problems
8 !2 9
< n p = p
min ∑ yi β 0 ∑ β j xij subject to ∑ β j s
β :i =1 j =1
; j =1

and 8 !2 9
< n p = p

:∑ ∑ β j xij ∑ β 2j
min yi β0 subject to s,
β i =1 j =1
; j =1

respectively. [see Figure 6.7 for p = 2]


- λ is one-to-one and inversely related to s.
- s is like a budget that can be spent by β j ’s or β 2j ’s.

Ping Yu (HKU) Model Selection and Regularization 36 / 98


Shrinkage Methods The Lasso

Best subset selection among s-variable models can be reformulated in a similar


manner: 8 ! 9
< p 2= p
n

:∑ ∑ β j xij ∑I
min yi β0 subject to β j 6= 0 s,
β i =1 j =1
; j =1

so RR and lasso provide computationally feasible alternatives to best subset selec-


tion by replacing the form of budget or through convexification.
Ping Yu (HKU) Model Selection and Regularization 37 / 98
Shrinkage Methods The Lasso

Comparing the Lasso and RR

Same simulated data as in Figure 6.5.


Lasso and RR have similar biases, but RR has a slightly lower variance than lasso.

Ping Yu (HKU) Model Selection and Regularization 38 / 98


Shrinkage Methods The Lasso

A design favorable to lasso: 2 out of 45 (rather than all) predictors contribute to y .


Lasso outperforms RR in terms of bias, variance, and MSE.

Ping Yu (HKU) Model Selection and Regularization 39 / 98


Shrinkage Methods The Lasso

Conclusions and Discussions

These two examples illustrate that neither RR nor the lasso will universally domi-
nate the other.
In general, the lasso is suitable for sparse models with a small number of substan-
tial coefficients, while RR is suitable for models involving many predictors, all with
roughly equal coefficients.
However, the number of predictors that is related to the response is never known a
priori for real datasets.
A technique such as cross-validation can be used in order to determine which ap-
proach is better on a particular dataset.

(**) For extensions of the lasso, see Appendix A.


(**) For a Bayesian interpretation of RR and the lasso, see Appendix B.

A popular package of R used for RR and the lasso is glmnet.


Another popular package of R used for the lasso is gamlr.
Both packages run the lasso in a similar way, and the difference lies in what they
can do beyond lasso, e.g., gamlr can conduct Gamma-Lasso in Appendix B which
includes the lasso as a special case .
Ping Yu (HKU) Model Selection and Regularization 40 / 98
Shrinkage Methods The Lasso

A Simple Special Case for RR and Lasso

n = p, X = I, and β 0 = 0 is known.
LS:
p 2
min ∑ yj βj =) β̂ j = yj .
β j =1

RR:
p p
2 yj
min ∑ yj ∑ β 2j =) β̂ j
R
βj +λ = .
β j =1 j =1
1+λ

- In Figure 6.10, each β̂ j is shrunken by the same proportion.


Lasso:
8
p 2 p < yj λ /2, if yj > λ /2,
min ∑ yj ∑
L
βj +λ βj =) β̂ j = y + λ /2, if yj < λ /2,
β j =1 j =1
: j 0, if yj λ /2.
!
λ /2
= sign yj yj λ /2 + = 1 yj .
yj
+

- In Figure 6.10, each β̂ j is shrunken toward zero by a constant amount, λ /2,


known as soft-thresholding.4
H p
4
A hard-thresholding is like β̂ j = β̂ j I (jβ̂ j j > 2λ ) as in the best subset selection, where the `1 penalty of
0
Lasso is changed to the `0 penalty, λ ∑pj=1 β j , i.e., counting the number of nonzero elements in β . [Exercise]
Ping Yu (HKU) Model Selection and Regularization 41 / 98
Shrinkage Methods The Lasso

R R
This example is special since β̂ j is linearly shrunken, so sign β̂ j =sign β̂ j ,
R
while in general (especially in the multicollinear case), sign β̂ j can be reversed
[see Figure 6.4].
L L
Although typically, once β̂ j hits zero it stays there (just as in this example), β̂ j can
also reverse its sign. [Exercise]
Ping Yu (HKU) Model Selection and Regularization 42 / 98
Shrinkage Methods The Lasso

Comparison Between the Subset Selection and Shrinkage Methods

The subset selection methods are unstable in the sense that a small change in the
data can make large changes in the sequence of M0 , M1 , , Mp . On the other
hand, the solutions of shrinkage methods are continuous in λ so are relatively
stable.
- (**) CART (indexed by the number of terminal nodes) in Lecture 8, single layer
neural networks (indexed by the number of hidden units) in Lecture 10 and the `q
penalty with 0 < q < 1 in Appendix A are also unstable, while KNN and bagging in
Lecture 8 are stable.5
- The more unstable the procedure, the noiser the test error, and we are less able
to locate the best model.
The subset selection methods are sequential, i.e., first model selection and then
estimation, while the shrinkage methods are simultaneous, conducting model se-
lection and estimation together.
The best subset selection can handle only tens of features, while the shrinkage
methods can handle more than thousands of features.

5
Recall that Clustering in Lecture 3 is also unstable in this sense, but it is an unsupervised learning method.
Ping Yu (HKU) Model Selection and Regularization 43 / 98
Cross Validation

Cross Validation
(Section 5.1)

Ping Yu (HKU) Model Selection and Regularization 44 / 98


Cross Validation

Resampling Methods

Cross-validation (CV) is a resampling method; such methods involve repeatedly


drawing samples from a training set and refitting a model of interest on each sample
in order to obtain additional information about the fitted model.
Resampling approaches can be computationally expensive, because they involve
fitting the same statistical method multiple times using different subsets of the train-
ing data, but this is generally not prohibitive nowadays due to recent advances in
computing power.
We discuss two resampling methods in this course, the bootstrap besides CV. CV
would be discussed in this lecture, and the bootstrap would be discussed in Lecture
8.
CV can be used in both model assessment and model selection. The former means
evaluating a model’s performance, e.g., estimating the test error associated with a
given statistical learning method, and the latter means selecting the proper level of
flexibility for a model.

Ping Yu (HKU) Model Selection and Regularization 45 / 98


Cross Validation

Training Error versus Test Error

Recall the distinction between the test error and the training error.
The test error is the average error that results from using a statistical learning
method to predict the response on a new observation, one that was not used in
training the method.
- Best solution: a large designated test set. Often not available
In contrast, the training error can be easily calculated by applying the statistical
learning method to the observations used in its training.
But the training error rate often is quite different from the test error rate, and in
particular the former can dramatically underestimate the latter. [figure here]
CV estimates the test error by holding out a subset of the training observations from
the fitting process, and then applying the statistical learning method to those held
out observations.

Ping Yu (HKU) Model Selection and Regularization 46 / 98


Cross Validation

Figure: Test and training error as a function of model complexity.

Ping Yu (HKU) Model Selection and Regularization 47 / 98


Cross Validation The Validation Set Approach

The Validation Set Approach

Here we randomly divide the available set of samples into two parts: a training set
and a validation or hold-out set. [see Figure 5.1]

The model is fit on the training set, and the fitted model is used to predict the
responses for the observations in the validation set.
The resulting validation-set error provides an estimate of the test error. This is
typically assessed using MSE in the case of a quantitative response and misclas-
sification rate in the case of a qualitative (discrete) response.

Ping Yu (HKU) Model Selection and Regularization 48 / 98


Cross Validation The Validation Set Approach

[Example] Auto

n = 392, ntraining = nvalidation = 196. p̂ = 2 in the left panel, and p̂ is not unique for
different splittings (anyway, p̂ 6= 1 for all splittings).

Ping Yu (HKU) Model Selection and Regularization 49 / 98


Cross Validation The Validation Set Approach

Drawbacks of Validation Set Approach

1 The validation estimate of the test error can be highly variable, depending on pre-
cisely which observations are included in the training set and which observations
are included in the validation set.
2 In the validation approach, only a subset of the observations – those that are in-
cluded in the training set rather than in the validation set – are used to fit the model.
- Since statistical methods tend to perform worse when trained on fewer observa-
tions, this suggests that the validation set error may tend to overestimate the test
error for the model fit on the entire data set.

CV refines the validation set approach by addressing these two issues.

Ping Yu (HKU) Model Selection and Regularization 50 / 98


Cross Validation K -Fold Cross-Validation

K -Fold Cross-Validation

K -fold CV is a widely used approach for estimating test error.


Estimates can be used to select the best model, and to give an idea of the test error
of the final chosen model.
The idea is to randomly divide the data into K equal-sized parts. We leave out part
k , fit the model to the other K 1 parts (combined), and then obtain predictions for
the left-out k th part.
This is done in turn for each part k = 1, 2, , K , and then the results are combined.
[see Figure 5.5]
Let the K parts be C1 , C2 , , CK , where Ck denotes the indices of the observa-
tions in part k . There are nk observations in part k : if n is a multiple of K , then
nk = n/K .
Compute
K K
n n =n/K 1
CVK = ∑ k MSEk k = ∑ MSEk ,
k =1
n K k =1

where MSEk = ∑i2Ck (yi ŷi )2 /nk , and ŷi is the fit for observation i, obtained from
the data with part k removed.
Setting K = n yields n-fold or leave-one out cross-validation (LOOCV). [see Figure
5.3]
Ping Yu (HKU) Model Selection and Regularization 51 / 98
Cross Validation K -Fold Cross-Validation

Ping Yu (HKU) Model Selection and Regularization 52 / 98


Cross Validation K -Fold Cross-Validation

Ping Yu (HKU) Model Selection and Regularization 53 / 98


Cross Validation K -Fold Cross-Validation

Bias-Variance Trade-Off for K -Fold Cross-Validation

Why can K -fold CV (or LOOCV) avoid or mitigate the two drawbacks of the valida-
tion set approach?
1 Since K > 2, the variability of CVK is typically much lower than that in the validation set
approach.6 [see the example below]
2 Since each training set is only (K 1)/K as big as the original training set, the estimates
of prediction error will typically be biased upward but not as much as in the validation set
approach where K = 2.
In LOOCV, K = n, the bias of prediction error is minimized and there is no variability
at all in CVn .
However, LOOCV typically doesn’t shake up the data enough. The estimates from
each fold are highly (positively) correlated and hence their average can have high
variance.7
K = 5 or 10 provides a good compromise for this bias-variance tradeoff.

6
(**) Note that given the data, the variability of CVK depends on both K and the number of possible splittings;
C3 C2 C2 C2
the latter need not decrease with K , e.g., if n = 6, then #K =2 = 2!6 = 10, while #K =3 = 6 3!4 2 = 15 > 10.
7
(**) Note that the variance here is different from the variability of CVK above: the former is from the random
sampling of the original data from the population, while the later is conditional on the data, i.e., the data are fixed.
Ping Yu (HKU) Model Selection and Regularization 54 / 98
Cross Validation K -Fold Cross-Validation

A Nice Special Case

With least-squares linear or polynomial regression, an amazing shortcut makes the


cost of LOOCV the same as that of a single model fit! The following formula holds:
2
1 n yi ŷi
n i∑
CVn = ,
=1 1 hi

where ŷi is the ith fitted value from the original least squares fit, and hi is the
leverage (diagonal of the “hat” matrix, reflecting the amount that an observation
h i
(x x̄ )2
influences its own fit; in a simple linear regression, hi = n1 + n i 1
2 2 n , 1 ).
∑i 0 =1 (xi 0 x̄ )
This is like the ordinary MSE, except the ith residual is inflated by dividing 1 hi .
As mentioned above, a better choice is K = 5 or 10, which can also save computa-
tional time in a general machine learning method.

Ping Yu (HKU) Model Selection and Regularization 55 / 98


Cross Validation K -Fold Cross-Validation

[Example] Auto Revisited

Ping Yu (HKU) Model Selection and Regularization 56 / 98


Cross Validation K -Fold Cross-Validation

Although CVs sometimes underestimate the true test MSE, all of the CV curves
come close to identifying the correct level of flexibility – that is, the flexibility level
corresponding to the smallest test MSE.

Ping Yu (HKU) Model Selection and Regularization 57 / 98


Cross Validation K -Fold Cross-Validation

Comparison with Cp , AIC, BIC, and Adjusted R 2

The validation set and cross-validation methods have an advantage relative to AIC,
BIC, Cp , and adjusted R 2 , in that it provides a direct estimate of the test error, and
doesn’t require an estimate of the error variance σ 2 .
It can also be used in a wider range of model selection tasks, even in cases where
it is hard to pinpoint the model degrees of freedom (e.g., the number of predictors
in the model) or hard to estimate the error variance σ 2 .
In Figure 6.3, the validation errors were calculated by randomly selecting three-
quarters of the observations as the training set, and the remainder as the validation
set; the cross-validation errors were computed using K = 10 folds.
- In this case, the validation and cross-validation methods both result in a six-
variable model, but all three approaches suggest that the four-, five-, and six-
variable mdoels are roughly equivalent in their test errors.
In this setting, we can select a model using the one-standard-error rule. We first
calculate the standard error of the estimated test MSE for each model size (by
using different randomly chosen training and validation sets), and then select the
smallest or most parsimonious model for which the estimated test error is within
one standard error of the lowest point on the curve.
- The rationale here is that if a set of models appear to be more or less equally
good, then we might as well choose the simplest model – that is, the model with
the smallest number of predictors.
Ping Yu (HKU) Model Selection and Regularization 58 / 98
Cross Validation K -Fold Cross-Validation

[Example] Credit

The one-standard-error rule leads to a three-variable model in the validation set or


cross-validation approach.

Ping Yu (HKU) Model Selection and Regularization 59 / 98


Cross Validation Selecting Tuning Parameters in RR and Lasso

Selecting Tuning Parameters in RR and Lasso

As for subset selection, for RR and lasso we require a method to determine which
of the models under consideration is best.
That is, we require a method selecting a value for the tuning parameter λ or equiv-
alently, the value of the constraint s.
Cross-validation provides a simple way to tackle this problem. We choose a grid of
λ values, and compute the cross-validation error rate for each value of λ .
We then select the tuning parameter value for which the cross-validation error is
smallest.
Finally, the model is re-fit using all of the available observations and the selected
value of the tuning parameter.

Ping Yu (HKU) Model Selection and Regularization 60 / 98


Cross Validation Selecting Tuning Parameters in RR and Lasso

[Example] RR for Credit

The LOOCV λ is relatively small, so no much shrinkage relative to LS; the CV-error
curve is flat around the λ ; maybe simply use LS.

Ping Yu (HKU) Model Selection and Regularization 61 / 98


Cross Validation Selecting Tuning Parameters in RR and Lasso

[Example] Lasso for Simulated Data

10-fold CV λ sieves out the two signal variables and discards all noise variables
even if p = 45 and n = 50, while LS assigns a large coefficient estimate to only one
of the two signal variables.

Ping Yu (HKU) Model Selection and Regularization 62 / 98


Cross Validation Cross-Validation on Classification Problems

Cross-Validation on Classification Problems

For the same training/validation set split, compute


K
nk 1 K
∑ K k∑
nk =n/K
CVK = Errk = Errk ,
k =1
n =1

where Errk = ∑i2Ck I (yi 6= ŷi ) /nk .


The estimated standard deviation (or standard error) of CVK is
v
u
u 1 K (Errk CVK )2
SE (CVK ) = t ∑ .
K k =1 K 1

This is a useful estimate, but strictly speaking, not quite valid. Why not? only
between-fold variation, no within-fold variation.
The following figure shows logistic regression fits using polynomials of orders 1 4
on the two-dimensional classification data displayed in Figure 2.13, where the test
error rates are 0.201, 0.197, 0.160 and 0.162 respectively while the Bayes error rate
is 0.133, and Figure 5.8 shows how to use CV10 to choose the order of polynomial.

Ping Yu (HKU) Model Selection and Regularization 63 / 98


Cross Validation Cross-Validation on Classification Problems

Ping Yu (HKU) Model Selection and Regularization 64 / 98


Cross Validation Cross-Validation on Classification Problems

Although the CV error curve slightly underestimates the test error rate, it takes on
a miminum very close to the best value for the order of polynomial or the number
of neighbors.

Ping Yu (HKU) Model Selection and Regularization 65 / 98


Cross Validation Cross-Validation on Classification Problems

(*) Cross-Validation: Right and Wrong

Consider a simple classifier applied to some two-class data:


1 Starting with 5000 predictors and 50 samples, find the 100 predictors having the largest
correlation with the class labels.
2 We then apply a classifier such as logistic regression, using only these 100 predictors.
How do we estimate the test set performance of this classifier?
Can we apply cross-validation in Step 2, forgetting about Step 1?
NO! This would ignore the fact that in Step 1, the procedure has already seen the
labels of the training data, and made use of them. This is a form of training and
must be included in the validation process.
It is easy to simulate realistic data with the class labels independent of the outcome,
so that true test error = 50%, but the CV error estimate that ignores Step 1 is zero!
Try to do this yourself
With a multistep modeling steps, CV must be applied to the entire sequence of
modeling steps.

Ping Yu (HKU) Model Selection and Regularization 66 / 98


Cross Validation Cross-Validation on Classification Problems

(*) Wrong Way and Right Way

Ping Yu (HKU) Model Selection and Regularization 67 / 98


Cross Validation Cross-Validation on Classification Problems

(**) Comparison Between Covariance Penalties and Cross-Validation

Both Cp and AIC are covariance penalty methods.


In Cp , y µ, σ 2 I , L (y , µ ) = (y µ )2 , and µ̂ = Py is a linear rule, where P is a
function of covariates and always assumed fixed. The target is the prediction error
n n h i
Err = ∑ Erri = ∑ E0 L yi0 , µ̂ i ,
i =1 i =1

where E0 is the expectation over yi0 µ i , σ 2 independent of y, with µ̂ i held fixed.


Since erri = L (yi , µ̂ i ) would under-estimate Erri , we define the optimism
Oi = Erri erri
and estimate Oi by
Ωi = Ey [Oi ] ,
where Ey is the expectation over y which appear in yi and µ̂ i .
It turns out that
Ωi = 2 Cov ( µ̂ i , yi ) = 2σ 2 Pii ,
where Pii is the ith diagonal element of P.
If µ̂ is the LSE, then ∑ni=1 Cov( µ̂ i , yi ) = σ 2 d. So Err ∑ni=1 erri + 2σ 2 d =RSS+2dσ 2 .
Ping Yu (HKU) Model Selection and Regularization 68 / 98
Cross Validation Cross-Validation on Classification Problems

(**) Continued

In AIC, yi fµ (y ) with µ = E [y ], L (y , µ ) = 2 log fµ (y ) is the deviance loss (which


is equivalent to the square error loss in the Gaussian model with σ 2 known), and
µ̂ i is the MLE (which is equivalent to the LSE in the Gaussian case). Now,

Ωi = 2 Cov λ̂ i , yi

with λ̂ i being a function of µ̂ i depending on fµ (y ), e.g., for the Bernoulli case,


1
µ̂
λ̂ i =logit( µ̂ i ), and for the Gaussian case, λ̂ i = σi 2 2 . It turns out that ∑ni=1 Cov λ̂ i , yi
d, which is obvious for the Gaussian case. As a result,
n
Err ∑ 2 log fµ̂ i (yi ) + 2d = 2 log L + 2d.
i =1

Both the two covariance penalty methods and cross-validation are conditional or
local methods in the sense that the randomness in Oi comes only from the ith data
although the former fixes xi while the latter allows xi to be random.
The former is model-based while the latter is model-free so is more robust against
model mis-specification.
The former is typically more efficient in estimating Ω := ∑ni=1 Ωi than the latter al-
though it imposes more assumptions on the model.
Ping Yu (HKU) Model Selection and Regularization 69 / 98
Considerations in High Demensions

(*) Considerations in High Demensions


(Section 6.4)

Ping Yu (HKU) Model Selection and Regularization 70 / 98


Considerations in High Demensions

High-Dimensional Data

Traditionally, we consider only low-dimensional problems, where by dimension we


mean the size of p.
In the past 20 years, high-dimensional (i.e., p n) data become popular, where p
can be extremely large, while n is often limited due to cost, sample availability, etc.
For example, a marketing analyst interested in understanding people’s online shop-
ping patterns could treat as features all of the search terms entered by users of a
serach engine, which is sometimes known as the "bag-of-words" model.
Maybe only a few hundred search engine users would like to share their search
histories with the researcher.
In this example, n 1000, and p is much larger.
In this case, classical approaches such as least squares, logistic regression, LDA
etc. cannot apply.
Issues like bias-variance trade-off and the danger of overfitting are particularly im-
portant in the high-dimensional case.

Ping Yu (HKU) Model Selection and Regularization 71 / 98


Considerations in High Demensions

What Does Wrong in High Demensions?

Consider least squares when p is close to or larger than n.


In Figure 6.22, we consider p = 2 while n = 20 and 2.
When n = 2, least squares results in a perfect fit, which certainly leads to overfitting
of the data (e.g., check this fitting on the data in the left panel).
The problem is that when p > n or p n, a simple least squares regression line is
too flexible and hence overfits the data.
In Figure 6.23, n = 20 observations were simulated, and between 1 and 20 fea-
tures, each of which was completely unrelated to the response, were added to the
regression.
- R 2 " 1, training MSE# 0, and testing MSE" because including the additional pre-
dictors leads to a vast increase in the variance of β̂ .
The Cp , AIC, and BIC approaches are not appropriate here since estimating σ̂ 2 is
problematic; the adjusted R 2 is not appropriate either since we can easily obtain a
model with an adjusted R 2 value of 1.

Ping Yu (HKU) Model Selection and Regularization 72 / 98


Considerations in High Demensions

Ping Yu (HKU) Model Selection and Regularization 73 / 98


Considerations in High Demensions

Ping Yu (HKU) Model Selection and Regularization 74 / 98


Considerations in High Demensions

Regression in High Dimensions

Many methods learned in this lecture for fitting less flexible least squares models,
such as forward stepwise selection, RR, the lasso, and those (would be) learned in
the future such as principal components regression and partial least squares, are
particularly useful for performing regression in the high-dimensional setting.
Figure 6.24 shows the perfromance of the lasso (test MSE against three values of
p) in a simulated example with n = 100 while p = 20, 50 and 2000.
- When p = 20, the best performance was achieved when λ is small, while when p
gets larger, a larger λ achieves the lowest validation set error.
Figure 6.24 highlights three important points:
1 regularization or shrinkage plays a key role in high-dimensional problems;
2 appropriate tuning parameter selection is crucial for good predictive performance;
3 the test error tends to increase as p increases, unless the additional features are truly
associated with the response, which is known as the curse of dimensionality.
Increasing p need not be a bless: maybe only noise features are included, which
exacerbates the risk of overfitting by assigning nonzero coefficients due to chance
associations with the response; even if the new features are relevant, the variance
incurred in fitting their coefficients may outweigh the reduction in bias that they
bring.

Ping Yu (HKU) Model Selection and Regularization 75 / 98


Considerations in High Demensions

Ping Yu (HKU) Model Selection and Regularization 76 / 98


Considerations in High Demensions

Interpreting Results in High Dimensions

In the high-dimensional setting, the multicollinearity problem is extreme: any vari-


able in the model can be written as a linear combination of all of the other variables
in the model.
This means that we can never know exactly which variables (if any) truly are pre-
dictive of the outcome, and we can never identify the best coefficients for use.
At most, we can hope to assign large regression coefficients to variables that are
correlated with the variables that truly are predictive of the outcome.
In the marketing example, we may identify 17 search terms by forward stepwise
selection, but there are likely to be many sets of 17 terms (perhaps even non-
overlapping sets) that would predict the online shopping behaviors just as well as
the selected model.
We must be careful not to overstate the results obtained, and to make it clear that
what we have identified is simply one of many possible models for predicting online
shopping, and that it must be further validated on independent data sets.
It is also important to be careful in reporting errors and measures of model fit in
the high-dimensional setting, e.g., one should never use sum of squared errors,
p-values, R 2 , or other traditional measures of model fit on the training data as
evidence of a good model fit in the high-dimensional setting, rather, report results
like MSE or R 2 on an independent test set, or CV errors.
Ping Yu (HKU) Model Selection and Regularization 77 / 98
Considerations in High Demensions

Summary

Model selection methods are an essential tool for data analysis, especially for big
datasets involving many predictors.
Research into methods that give sparsity, such as the lasso is an especially hot
area.

Ping Yu (HKU) Model Selection and Regularization 78 / 98


Lab 1: Subset Selection Methods

Lab 1: Subset Selection Methods


(Section 6.5.1)

Best Subset Selection


Forward and Backward Stepwise Selection
Choosing Among Models Using the Validation Set Approach and CV

Ping Yu (HKU) Model Selection and Regularization 79 / 98


Lab2: Ridge Regression and the Lasso

Lab2: Ridge Regression and the Lasso


(Section 6.5.2)

Ridge Regression
The Lasso

Ping Yu (HKU) Model Selection and Regularization 80 / 98


Lab3: Cross-Validation

Lab3: Cross-Validation
(Section 5.3.1-3)

The Validation Set Approach


Leave-One-Out Cross-Validation
k -Fold Cross-Validation

Ping Yu (HKU) Model Selection and Regularization 81 / 98


Appendix A: Extensions of the Lasso

Appendix A: Extensions of the Lasso

Ping Yu (HKU) Model Selection and Regularization 82 / 98


Appendix A: Extensions of the Lasso

Extensions of the Lasso

Relaxed Lasso or Gauss Lasso: first find the variables with nonzero coefficients by
the lasso, and the refit the linear model with these variables by least squares. This
L
would help to remove the bias of β̂ j when it is nonzero.
Elastic Net: combine RR and the lasso by using the penalty
1
λ (1 α ) kβ k22 + α kβ k1 .
2
- It can overcome some limitations of the lasso, particularly when p > n, and with
very high degrees of collinearity; it tends to select correlated features (or not) to-
gether.

Penalty Function in R1 Contour in R2

Ping Yu (HKU) Model Selection and Regularization 83 / 98


Appendix A: Extensions of the Lasso

Continued

p
Group Lasso: group β 0 + ∑l =1 β l xil into θ 0 + ∑Jj=1 θ Tj zij , where zij 2 Rpj are corre-
lated features such as the dummay variables for levels of a qualitative variable, and
∑Jj=1 pj = p, and change the penalty of lasso to

J
λ ∑ θj 2
,
j =1

where note that θ j 2 is not squared.


- It also tends to select correlated groups.
Fused Lasso: if xij is ordered in time, i.e., xij is actually xit , and we hope time-
neighboring coefficients to be the same or similar (grouped in time), then we can
change the objective function of lasso to
!2
n p p p
∑ yi β0 ∑ β t xit +λ1 ∑ βj +λ2 ∑ βt βt 1
i =1 t =1 t =1 t =2

Ping Yu (HKU) Model Selection and Regularization 84 / 98


Appendix A: Extensions of the Lasso

Nonconvex Penalties

`q penalty with 0 < q < 1:

- Like the lasso, it performs variable selection or feature extraction.


- Unlike the lasso, the thresholding functions are discontinuous in β̂ j .
Adaptive Lasso: change the penalty of lasso to
λ ∑j =1 wj jβ j j,
p

where wj = 1/jβ̃ j jν , ν > 0, and β̃ j is a pilot estimate such as the LSE if p < n and
the univariate LSE if p n.
- It can be seen as an approximation to the `q penalties with q = 1 ν, but the
objective function is convex in β given β̃ .
Ping Yu (HKU) Model Selection and Regularization 85 / 98
Appendix A: Extensions of the Lasso

Continued
SCAD (smoothly clipped absolute deviation): replace λ jβ j of lasso by Jλ ,a (β ) for
some a 2 (a = 3.7 from a Bayesian argument), where
Z jβ j
(aλ s )+
Jλ ,a (β ) = λ I (s λ)+ I (s > λ ) ds
0 (a 1) λ
8
>
< λ jβ j , if 0 jβ j λ ,
λ2
= 1
a 1 2 + aλ jβ j 12 β 2 , if λ < jβ j aλ ,
>
: if jβ j > aλ .
1
2 (a + 1) λ 2 ,

- The second term in square-braces reduces the amount of shrinkage in the lasso
for larger values of β , with ultimately no shrinkage as jβ j ! ∞; this is similar to the
`q penalty with 0 < q < 1 [figure here].

Ping Yu (HKU) Model Selection and Regularization 86 / 98


Appendix A: Extensions of the Lasso

Continued
MC+ (minimax concave): replace λ jβ j of lasso by
Z jβ j Z jβ j
x (λ γ x )+
Pλ ,γ (β ) = λ 1 dx = dx
0 λγ + 0 γ
(
β2
λ jβ j 2γ ,
if 0 jβ j γλ ,
=
1 2
, if jβ j > γλ ,
2 γλ
which translates the linear part of Jλ ,a (β ) to the origin, where γ 1 controls con-
cavity.

log penalty: replace λ jβ j of lasso by log (1 + γ jβ j).


Ping Yu (HKU) Model Selection and Regularization 87 / 98
Appendix A: Extensions of the Lasso

Continued: Some Others

Bridge: including all `q penalty with q 0.


Garrote: !2
n p
min ∑ yi β0 ∑ cj β j xij s.t. kck2 s.
β i =1 j =1

Nonnegative Garrote:
!2
n p
min ∑ yi β0 ∑ cj β j xij
β i =1 j =1

s.t. cj 0, j = 1, , p,
p
kck1 = ∑ cj s.
j =1

p
- The prediction is ∑j =1 cj β̂ j xij in both Garrote and NN-Garrote.

Ping Yu (HKU) Model Selection and Regularization 88 / 98


Appendix A: Extensions of the Lasso

Continued: GLM + Lasso

We can also apply the lasso to different objective functions.


For example, our target is
( )
p
min
β
log ` β 0 , β 1 , ,βp + λ ∑ κj βj ,
j =1

where ` ( ) = log L is the likelihood function of a GLM model in Lecture 2 (e.g.,


logistic regression), and κ j ( ) are increasing penalty functions.
If the lasso penalty is used, we often set

κj βj = ωj β j

with ω j being the sample standard deviation of the jth covariate if the original co-
variate is not normalized to have length 1.

Ping Yu (HKU) Model Selection and Regularization 89 / 98


Appendix B: Bayesian Interpretation for Ridge Regression and the Lasso

Appendix B: Bayesian Interpretation for Ridge


Regression and the Lasso

Ping Yu (HKU) Model Selection and Regularization 90 / 98


Appendix B: Bayesian Interpretation for Ridge Regression and the Lasso

Bayesian Analysis

In a Bayesian analysis of regression, we need to impose a prior distribution on β ,


say p (β ).
- If p (β ) contains unknown parameters α, then α is called hyperparameters.
Suppose the likelihood of the data (X, y), f (y, Xjβ ) = f (yjX, β ) f (Xjβ ), can be writ-
ten as f (yjX, β ) f (X), i.e., f (X) does not depend on β , f (Xjβ ) = f (X).
Then the posterior distribution of β is proportional to the product of the prior and
the likelihood,
f (y, Xjβ ) p (β )
p (β jX, y) = R ∝ f (yjX, β ) f (X) p (β ) ∝ f (yjX, β ) p (β ) ,
f (y, Xjβ ) p (β ) dβ

where ∝ means a proportionality constant


R
(which does not depend on β but may
depend on (X, y)) is neglected, and f (y, Xjβ ) p (β ) dβ does not depend on β .
We can estimate β by the posterior mean, median or mode (also known as the
maximum a posteriori or MAP estimate, where the likelihood is required to be con-
sistent with prior [figure here]).
- Popular samplers of the posterior include the Gibbs sampler, the Metropolis-
Hastings sampler, and the slice sampler; see the supplementary Markov chain
Monte Carlo (MCMC) chapter for more details.

Ping Yu (HKU) Model Selection and Regularization 91 / 98


Appendix B: Bayesian Interpretation for Ridge Regression and the Lasso

Figure: Illustration of MAP: θ is the parameter, and X is the data.

Ping Yu (HKU) Model Selection and Regularization 92 / 98


Appendix B: Bayesian Interpretation for Ridge Regression and the Lasso

RR

In the linear regression model,


Y = β 0 + β 1 X1 + + β p Xp + ε,

suppose ε N 0, σ 2 , and

p (β ) = g (β 0 ) ∏j =1 g β j ∝ ∏j =1 g β j ,
p p

i.e., β 0 , β 1 , , β p are distributed independently, and β 0 has a diffuse or uninforma-


tive prior.
2
If g β j = φ β j j0, σλ [see Figure 6.11], where φ jµ 0 , σ 20 is the pdf of N µ 0 , σ 20 ,
then

∏i =1 φ ∏j =1 g
n p
p (β jX, y)R ∝ yi jβ 0 + xTi β , σ 2 βj
( )
1 n 2 λ p 2
∝ exp ∑ yi
2σ 2 i =1
β0 xTi β ∑β ,
2σ 2 j =1 j

and
R
arg max p (β jX, y)R = β̂ λ .
β

Ping Yu (HKU) Model Selection and Regularization 93 / 98


Appendix B: Bayesian Interpretation for Ridge Regression and the Lasso

Lasso

If ( !)
λ 2σ 2
g βj = exp βj / ,
4σ 2 λ
2σ 2
i.e., the double-exponential (or Laplace) distribution with scale parameter λ
[see
Figure 6.11], then

∏i =1 φ ∏j =1 g
n p
p (β jX, y)L ∝ yi jβ 0 + xTi β , σ 2 βj
( )
1 n λ p
2
∝ exp ∑ yi
2σ 2 i =1
β0 xTi β ∑ β
2σ 2 j =1 j
8 2
9
>
< ∑ni=1 yi β0 xTi β + λ ∑j =1 β j >
p
=
= exp ,
>
: 2σ 2 >
;

and
( )
n 2 p
∑ ∑
L L
arg max p (β jX, y) = arg min yi β0 xTi β +λ βj = β̂ λ .
β β i =1 j =1

Ping Yu (HKU) Model Selection and Regularization 94 / 98


Appendix B: Bayesian Interpretation for Ridge Regression and the Lasso

The lasso prior is steeply peaked at zero, while the Gaussian is flatter and fatter at
zero. Hence, the lasso expects a priori that many of the coefficients are (exactly)
zero, while ridge assumes the coefficients are randomly distributed about zero.
R L
β̂ λ is actually also the posterior mean of p (β jX, y)R , while β̂ λ is not the posterior
mean of p (β jX, y)L ; actually the posterior mean of p (β jX, y)L is not sparse.

Ping Yu (HKU) Model Selection and Regularization 95 / 98


Appendix B: Bayesian Interpretation for Ridge Regression and the Lasso

Gamma-Lasso (GL)

As discussed in Appendix A, the penalty on jβ j j often takes a coordinate-specific


form, ω j jβ j j, which is not covered in the above prior specification.
To cover this case, we specify the prior for β j as

λj
π βj = exp λ j jβ j j ,
2
and the Laplace rate λ j is assigned a conjugate gamma hyperprior

rs s 1 rλj
Ga λ j js, r = λ e ,
Γ (s ) j

where the shape s and rate r imply mean s/r and variance s/r 2 .
n o n o
Since arg maxλ j π (β j )Ga(λ j js, r ) = s/(r + jβ j j), and log π (β j )Ga λ j js, r ∝
s log λ j + (r + jβ j j)λ j , the joint MAP estimation of β j and λ j is equivalent to add
h i
s log 1+s/rjβ j/r
+ s ∝ s log(1 + jβ j j/r ), which is the log penalty so is nonconvex in
j
β j , to log f (yjX, β ).
(s, r ) can be chosen by CV or AICc (especially when p is large, see Lecture 6 and
Flynn, Hurvich and Simonoff (2013)).
Ping Yu (HKU) Model Selection and Regularization 96 / 98
Appendix B: Bayesian Interpretation for Ridge Regression and the Lasso

Path of One-Step Estimators (POSE)

For nonconvex penalties like SCAD, Zou and Li (2008) suggest iteratively weighted-
`1 regularization, where the coefficient-specific weights are based upon results from
previous iterations of `1 regularization.
Actually, even a single step of such weighted-`1 regularization is enough to get
solutions that are close to optimal, so long as the preestimates are good enough
starting points. The crux of success is finding starts that are good enough.
On the other hand, Efron et al. (2004) suggest the least-angle regression (LARS)
algorithm for the lasso which provides a regularization path, a sequence of models
under decreasing amounts of regularization.
Taddy (2017) combines these two ideas for nonconvex penalties and proposes the
path of one-step estimators (POSE), which takes advantage of an available source
of initial values in any path estimation algorithm – the solved values from the previ-
ous path iteration, at no more cost than single lasso path.

Ping Yu (HKU) Model Selection and Regularization 97 / 98


Appendix B: Bayesian Interpretation for Ridge Regression and the Lasso

Continued

The POSE algorithm normalizes the penalty c jβ j j as c 0 (0) = 1, which implies


s = r (equal to, say 1/γ) in Gamma-Lasso.
- When γ ! 0, c (b ) ! jbj, and when γ ! ∞, c (b ) converges `0 -type costs.
∂ l (α,β )
It starts from a large λ , e.g., λ 1 = max ∂βj
, j = 1, , p , such that β 1 =
β =0
0, sets λ t = δ λ t 1
and
( t 1
t 1
c 0 (jβ̂ j j), if β̂ j 6= 0,
ω tj = t 1
1, if β̂ j = 0,
solves
t p
α̂, β̂ = arg min l (α, β ) + λ t
α,β j
∑ ω tj βj
j =1

by coordinate descent, and ends at λ T (e.g., 0.01λ 1 ), where l ( ) is the objective


function, and α plays the role of β 0 above.
γ and λ can be chosen by CV or AICc; usually, γ is chosen only from f0, 1, 10g.
- When n/s is large, CV and AICc perform similarly in prediction; when n/s is
small, CV is preferred when there is little signal and AICc is preferred when there
is strong signal, where s measures the sparsity of the model.
Ping Yu (HKU) Model Selection and Regularization 98 / 98

You might also like