Lecture4 - Model Selection and Regularization - Ver2
Lecture4 - Model Selection and Regularization - Ver2
Ping Yu
Y = β 0 + β 1 X1 + + β p Xp + ε.
In the lectures that follow, we consider some approaches for extending the linear
model framework.
In Lecture 7, we generalize the linear model in order to accommodate non-linear,
but still additive, relationships.
In Lectures 8 and 10, we consider even more general non-linear models.
Despite its simplicity, the linear model has distinct advantages in terms of its inter-
pretability and often shows good predictive performance.
Hence we discuss in this lecture some ways in which the simple linear model can
be improved, by replacing ordinary least squares fitting with some alternative fitting
procedures.
Prediction Accuracy: When n is not much larger than p, there can be a lot of vari-
ability in the least squares fit, resulting in overfitting and consequently poor predic-
tions; when p > n, there is no unique least squares estimate: the variance is infinite!
By constraining or shrinking the estimated coefficients, we can substantially reduce
the variance at the cost of a negligible increase in bias.
Model Interpretability: By removing irrelevant features – that is, by setting the corre-
sponding coefficient estimates to zero – we can obtain a model that is more easily
interpreted. We will present some approaches for automatically performing feature
selection or variable selection.
b ; y)
2 [l ( µ l (y; y)] ,
b = ( µ̂ 1 , , µ̂ n )T is the
where l ( µ; y) = ∑ni=1 l ( µ i ; yi ) with l ( µ; y ) = ln f (y ; θ ), and µ
ML estimator of µ (deduced from the ML estimator of θ ).
- In other words, the deviance is ( 2 times) the difference between the log likelihood
of the assumed model and the saturated model (a model with a parameter for every
observation so that the data are fitted exactly).
(y µ )2
Normal: If y N µ, σ 2 with known σ 2 , then θ = µ, and f (y ; µ ) = p 1
exp 2σ 2
,
2πσ 2
so that
1 (y µ )2
l ( µ; y ) = log 2πσ 2 .
2 2σ 2
1
Setting µ = y gives l (y ; y ) = 2 log 2πσ 2 and so
(y µ̂ )2
2 [l ( µ̂; y ) l (y ; y )] = .
σ2
and
y 1 y
2 [l ( µ̂; y ) l (y ; y )] = 2 y log + (1 y ) log .
µ̂ 1 µ̂
Poisson: If y Poisson(λ ), then µ = λ and f (y ; µ ) = e λ λ y /y !, so that
l ( µ; y ) = λ + y log λ log y !,
and
y
2 [l ( µ̂; y ) l (y ; y )] = 2 y log (y µ̂ ) .
µ̂
In all three examples, µ̂ is the sample mean.
Subset Selection
(Section 6.1)
Step 2 reduces the number of possible models from 2p to p + 1. [see Figure 6.1 for
the Credit dataset, n = 400, p = 10]
In Step 3, we cannot use RSS or R 2 to choose the best model since they represent
the training error not the testing error, and are decreasing in p so always the largest
model is selected.
Ping Yu (HKU) Model Selection and Regularization 9 / 98
Subset Selection Best Subset Selection
Stepwise Selection
The key drawback of best subset selection is its computational limitation; usually,
p > 40 is infeasible.
Best subset selection may also suffer from statistical problems when p is large:
larger the search space, the higher the chance of finding models that look good on
the training data, even though they might not have any predictive power on future
data.
Thus an enormous search space can lead to overfitting and high variance of the
coefficient estimates.
For both of these reasons, stepwise methods, which explore a far more restricted
set of models, are attractive alternatives to best subset selection.
Starts from the null model; at each step the variable that gives the greatest addi-
tional improvement to the fit is added to the model, until all of the predictors are in
the model.
Comments
p +1
Totally, only 1 + p + (p 1) + + 1 = 1 + C2 models are considered,1 so com-
putational advantage over best subset selection is clear.
It is not guaranteed to find the best possible model out of all 2p models containing
subsets of the p predictors.
- Why not? Give an example.
Figure 6.1 shows that there is little difference between the three- and four-variable
models in terms of RSS, so either of the four-variable models will likely be adequate.
This is a greedy approach, and might include variables early that later become
redundant; hybrid selection below can remedy this.
1
(**) If n < p, only M0 , , Mn 1 would be considered, but the number of possible models is much larger than
1 + C2p+1 , which is so even when n > p since usually a guided search is performed.
Ping Yu (HKU) Model Selection and Regularization 13 / 98
Subset Selection Stepwise Selection
Unlike forward stepwise selection, backward stepwise selection begins with the full
least squares model containing all p predictors, and then iteratively removes the
least useful predictor, one-at-a-time, until the null model is achieved.
As mentioned above, RSS and R 2 are not suitable for selecting the best model
among a collection of models with different numbers of predictors because they are
related to the training error, not the test error.
To select the best model with respect to test error, we estimate this test error based
on two common approaches:
1 We can indirectly estimate test error by making an adjustment to the training error
to account for the bias due to overfitting.
- Cp (or Mallow’s Cp ), Akaike Information Criterion (AIC), Bayesian Information Cri-
terion (BIC or the Schwarz criterion), and adjusted R 2 (or R̄ 2 ). [figure here]
- All criteria are popular, but the adjusted R 2 is not as well motivated in statistical
theory as the other three.
- AIC and BIC can be defined for more general types of models.
2 We can directly estimate the test error.
- A validation set approach or a cross-validation approach, as will be discussed
later in this lecture.
Gideon E. Schwarz (1933-2007), Hebrew Henri Theil (1924-2000), Chicago and Florida
Figure 6.2 displays Cp , BIC, and adjusted R 2 for the best model of each size pro-
duced by best subset selection on the Credit dataset.
Details on Cp
Mallows’s Cp :
1
Cp = RSS (d ) + 2d σ̂ 2
n
where d is the total # of predictors used, and σ̂ 2 is an estimate of the variance of
the error ε.
- Note that RSS(d ) depends on d – it decreases with d.
- Typically, σ̂ 2 is estimated using the full model containing all predictors.
- The penalty term 2d σ̂ 2 is added since the training error tends to underestimate the
test error; it is increasing in d to adjust for the corresponding decrease in training
RSS.
-(**) If σ̂ 2 is unbiased, then Cp is an unbiased estimate of test MSE.
- In Figure 6.2, minimizing Cp obtains 6 predictors, income, limit, rating, cards, age
and student.
Details on AIC
n + 2d 2
Cp = σ̂
n
such that
2d 2d AIC
log Cp = log σ̂ 2 + log 1 + t log σ̂ 2 + t .
n n n
Details on BIC
Details on R̄ 2
Adjusted R 2 :
RSS (d ) / (n d 1)
R̄ 2 = 1 .
TSS/ (n 1)
- Different from Cp , AIC and BIC, a large value of R̄ 2 is preferred.
RSS(d )
- Maximizing R̄ 2 is equivalent to minimizing n d 1. While RSS(d ) always de-
RSS(d )
creases as d increases, n d 1 may increase or decrease, due to the presence of
d in the denominator.
- Unlike the R 2 statistic, the R̄ 2 statistic pays a price for the inclusion of unnecessary
variables in the model.
- It can be shown that maximizing R̄ 2 is approximately equivalent to minimizing
1 2
n RSS (d ) + d σ̂ . [Exercise]
- In Figure 6.2, besides the six variables selected by Cp and AIC, own is also
selected.
Why select among only M0 , , Mp ?
p
- For the same d (and σ̂ 2 ), optimizing over the Cd models in the four criteria is the
same as minimizing over the corresponding RSS.
Shrinkage Methods
(Section 6.2)
Hoerl, A.E., and R.W. Kennard, 1970, Ridge Regression: Biased Estimation for Nonorthogo-
nal Problems, Technometrics, 12, 55–67.
Hoerl, A.E., and R.W. Kennard, 1970, Ridge Regression: Applications to Nonorthogonal Prob-
lems, Technometrics, 12, 69–82.
Ridge Regression
Recall that the least squares fitting procedure estimates β 0 , β 1 , , β p using the
values that minimize
!2
n p
RSS = ∑ yi β0 ∑ β j xij .
i =1 j =1
R
In contrast, the ridge regression (RR for short) coefficient estimates β̂ are the
values that minimize
!2
n p p
∑ yi β0 ∑ β j xij +λ ∑ β 2j = RSS + λ kβ k22 , (1)
i =1 j =1 j =1
Continued
As with least squares, RR seeks coefficient estimates that fit the data well, by
making the RSS small.
p
However, the second term, λ ∑j =1 β 2j , called a shrinkage penalty, is small when
β 1 , , β p are close to zero, and so it has the effect of shrinking the estimates of β j
towards zero.
The tuning parameter λ serves to control the relative impact of these two terms on
the regression coefficient estimates.
Selecting a good value for λ is critical; cross-validation discussed below is used for
this.
RR has substantial computational advantages over best subset selection – only
one (rather than 2p ) model is fit for each λ .
- Actually, computations required to solve (1), simultaneously for all values of λ , are
almost identical to those for fitting a model using least squares.
Different from least squares, RR can be conducted even if p > n.
[Example] Credit
R R R T
β̂ λ is a function of λ : λ ! 0, β̂ λ ! β̂ LS , and λ ! ∞, β̂ λ ! βb 0 , 0, 0, , 0 , the
null model.
R R
β̂ λ is generally decreasing in λ , but need not be strictly so; anyway, β̂ λ must
2
be decreasing in λ , so there is a one-to-one correspondence between λ 2 (0, ∞)
R
and β̂ λ / β̂ LS 2 (1, 0) as shown in the right panel of Figure 6.4 .
2 2
Ping Yu (HKU) Model Selection and Regularization 27 / 98
Shrinkage Methods Ridge Regression
Scaling of Predictors
β̂ LS is scale equivariant: if Xj ! cXj then β̂ j,LS ! β̂ j,LS /c such that Xj β̂ j,LS remains
the same.
R R
In contrast, β̂ j,λ can change substantially and need not change to β̂ j,λ /c when
p R
Xj ! cXj , due to the equal weights on β 2j in ∑j =1 β 2j , such that Xj β̂ j,λ would change.
R
- Actually, Xj β̂ λ ,j depends also on the scaling of the other predictors.
Therefore, it is best to apply RR after standardizing the predictors:
xij x̄j
x̃ij = q ,
1 2
n ∑ni=1 xij x̄j
R 1
The solution to (1) is β̂ = XT X + λ I XT y, where a ridge parameter λ of the
identify matrix is added to the correlation matrix XT X (when X is standardized),
while the diagonal of ones in the correlation matrix can be described as a "ridge".
Historically, "ridge" actually refers to the characteristics of the function we were
attempting to optimize, rather than to adding a "ridge" to the XT X matrix.
Tibshirani, R., 1996, Regression Shrinkage and Selection via the lasso, Journal of the Royal
Statistical Society, Series B, 58, 267-88.
3
previoiusly at Toronto, COPSS Award receiver of 1996, inventor of LASSO, and a student of
Efron.
Ping Yu (HKU) Model Selection and Regularization 31 / 98
Shrinkage Methods The Lasso
The Lasso
RR does have one obvious disadvantage: unlike subset selection, which will gen-
erally select models that involve just a subset of the variables, RR will include all p
predictors in the final model (unless λ = ∞).
- This is not a problem for prediction accuracy, but a challenge in model interpreta-
tion when p is large.
The Lasso is a relatively recent alternative to RR that overcomes this disadvantage.
L
The lasso coefficients, β̂ λ , minimize the quantity
!2
n p p
∑ yi β0 ∑ β j xij +λ ∑ β j = RSS + λ kβ k1 , (2)
i =1 j =1 j =1
where k k1 is the `1 norm, i.e., the lasso uses an `1 penalty instead of an `2 penalty.
Continued
As with RR, the lasso shrinks the coefficient estimates towards zero, so can reduce
variance at the expense of a small increase in bias.
However, in the case of the lasso, the `1 penalty has the effect of forcing some of
the coefficient estimates to be exactly equal to zero when the tuning parameter λ
is sufficiently large.
Hence, much like best subset selection, the lasso performs variable selection, and
the resulting model is much easier to interpret than that in RR.
We say that the lasso yields sparse models – that is, models that involve only a
subset of the variables.
As in RR, selecting a good value of λ for the lasso is critical; cross-validation is
again the method of choice.
Like RR, the entire coefficient paths can be computed with about the same amount
of work as a single least squares fit.
[Example] Credit
L R L
The limits of β̂ λ as λ ! 0 and ∞ are the same as in β̂ λ ; β̂ λ is also generally
L
decreasing in λ and β̂ λ must be decreasing in λ .
1
L R
In between, i.e., λ 2 (0, ∞), β̂ λ and β̂ λ are very different: rating, student, limit, and
income enter the model sequentially, so any number of variables is possible, while
RR involves all variables.
Ping Yu (HKU) Model Selection and Regularization 34 / 98
Shrinkage Methods The Lasso
such that
L
1 For all λ > λ 0 , β̂ λ = 0.
L
2 In the interior of the interval (λ m+1 , λ m ), the active set B (λ ) := fjjβ̂ λ j 6= 0g and the sign
L
vector of the active are constant with respect to λ , so can be denoted as Bm and
β̂ λ j ’s
signm .
3 When λ decreases from λ = λ m , some predictors with zero coefficients at λ m are
about to have nonzero coefficients; as λ approaches λ m + there are possibly some
predictors in Bm whose coefficients reach zero.
From the Kuhn-Tucker conditions, we have the equiangular conditions:
L
λ = 2jxTj (y Xβ̂ λ )j, 8j 2 B (λ ),
L
λ > 2jxTj (y / B (λ ),
Xβ̂ λ )j, 8j 2
which imply that the coefficient path for the lasso is continuous and piecewise linear
over the entire range of λ , with the transition points fλ m g occurring whenever the
active set changes, or the signs of the coefficients change.
Ping Yu (HKU) Model Selection and Regularization 35 / 98
Shrinkage Methods The Lasso
Why is it that the lasso, unlike RR, results in coefficient estimates that are exactly
equal to zero?
One can show that the lasso and RR coefficient estimates solve the problems
8 !2 9
< n p = p
min ∑ yi β 0 ∑ β j xij subject to ∑ β j s
β :i =1 j =1
; j =1
and 8 !2 9
< n p = p
:∑ ∑ β j xij ∑ β 2j
min yi β0 subject to s,
β i =1 j =1
; j =1
:∑ ∑ β j xij ∑I
min yi β0 subject to β j 6= 0 s,
β i =1 j =1
; j =1
These two examples illustrate that neither RR nor the lasso will universally domi-
nate the other.
In general, the lasso is suitable for sparse models with a small number of substan-
tial coefficients, while RR is suitable for models involving many predictors, all with
roughly equal coefficients.
However, the number of predictors that is related to the response is never known a
priori for real datasets.
A technique such as cross-validation can be used in order to determine which ap-
proach is better on a particular dataset.
n = p, X = I, and β 0 = 0 is known.
LS:
p 2
min ∑ yj βj =) β̂ j = yj .
β j =1
RR:
p p
2 yj
min ∑ yj ∑ β 2j =) β̂ j
R
βj +λ = .
β j =1 j =1
1+λ
R R
This example is special since β̂ j is linearly shrunken, so sign β̂ j =sign β̂ j ,
R
while in general (especially in the multicollinear case), sign β̂ j can be reversed
[see Figure 6.4].
L L
Although typically, once β̂ j hits zero it stays there (just as in this example), β̂ j can
also reverse its sign. [Exercise]
Ping Yu (HKU) Model Selection and Regularization 42 / 98
Shrinkage Methods The Lasso
The subset selection methods are unstable in the sense that a small change in the
data can make large changes in the sequence of M0 , M1 , , Mp . On the other
hand, the solutions of shrinkage methods are continuous in λ so are relatively
stable.
- (**) CART (indexed by the number of terminal nodes) in Lecture 8, single layer
neural networks (indexed by the number of hidden units) in Lecture 10 and the `q
penalty with 0 < q < 1 in Appendix A are also unstable, while KNN and bagging in
Lecture 8 are stable.5
- The more unstable the procedure, the noiser the test error, and we are less able
to locate the best model.
The subset selection methods are sequential, i.e., first model selection and then
estimation, while the shrinkage methods are simultaneous, conducting model se-
lection and estimation together.
The best subset selection can handle only tens of features, while the shrinkage
methods can handle more than thousands of features.
5
Recall that Clustering in Lecture 3 is also unstable in this sense, but it is an unsupervised learning method.
Ping Yu (HKU) Model Selection and Regularization 43 / 98
Cross Validation
Cross Validation
(Section 5.1)
Resampling Methods
Recall the distinction between the test error and the training error.
The test error is the average error that results from using a statistical learning
method to predict the response on a new observation, one that was not used in
training the method.
- Best solution: a large designated test set. Often not available
In contrast, the training error can be easily calculated by applying the statistical
learning method to the observations used in its training.
But the training error rate often is quite different from the test error rate, and in
particular the former can dramatically underestimate the latter. [figure here]
CV estimates the test error by holding out a subset of the training observations from
the fitting process, and then applying the statistical learning method to those held
out observations.
Here we randomly divide the available set of samples into two parts: a training set
and a validation or hold-out set. [see Figure 5.1]
The model is fit on the training set, and the fitted model is used to predict the
responses for the observations in the validation set.
The resulting validation-set error provides an estimate of the test error. This is
typically assessed using MSE in the case of a quantitative response and misclas-
sification rate in the case of a qualitative (discrete) response.
[Example] Auto
n = 392, ntraining = nvalidation = 196. p̂ = 2 in the left panel, and p̂ is not unique for
different splittings (anyway, p̂ 6= 1 for all splittings).
1 The validation estimate of the test error can be highly variable, depending on pre-
cisely which observations are included in the training set and which observations
are included in the validation set.
2 In the validation approach, only a subset of the observations – those that are in-
cluded in the training set rather than in the validation set – are used to fit the model.
- Since statistical methods tend to perform worse when trained on fewer observa-
tions, this suggests that the validation set error may tend to overestimate the test
error for the model fit on the entire data set.
K -Fold Cross-Validation
where MSEk = ∑i2Ck (yi ŷi )2 /nk , and ŷi is the fit for observation i, obtained from
the data with part k removed.
Setting K = n yields n-fold or leave-one out cross-validation (LOOCV). [see Figure
5.3]
Ping Yu (HKU) Model Selection and Regularization 51 / 98
Cross Validation K -Fold Cross-Validation
Why can K -fold CV (or LOOCV) avoid or mitigate the two drawbacks of the valida-
tion set approach?
1 Since K > 2, the variability of CVK is typically much lower than that in the validation set
approach.6 [see the example below]
2 Since each training set is only (K 1)/K as big as the original training set, the estimates
of prediction error will typically be biased upward but not as much as in the validation set
approach where K = 2.
In LOOCV, K = n, the bias of prediction error is minimized and there is no variability
at all in CVn .
However, LOOCV typically doesn’t shake up the data enough. The estimates from
each fold are highly (positively) correlated and hence their average can have high
variance.7
K = 5 or 10 provides a good compromise for this bias-variance tradeoff.
6
(**) Note that given the data, the variability of CVK depends on both K and the number of possible splittings;
C3 C2 C2 C2
the latter need not decrease with K , e.g., if n = 6, then #K =2 = 2!6 = 10, while #K =3 = 6 3!4 2 = 15 > 10.
7
(**) Note that the variance here is different from the variability of CVK above: the former is from the random
sampling of the original data from the population, while the later is conditional on the data, i.e., the data are fixed.
Ping Yu (HKU) Model Selection and Regularization 54 / 98
Cross Validation K -Fold Cross-Validation
where ŷi is the ith fitted value from the original least squares fit, and hi is the
leverage (diagonal of the “hat” matrix, reflecting the amount that an observation
h i
(x x̄ )2
influences its own fit; in a simple linear regression, hi = n1 + n i 1
2 2 n , 1 ).
∑i 0 =1 (xi 0 x̄ )
This is like the ordinary MSE, except the ith residual is inflated by dividing 1 hi .
As mentioned above, a better choice is K = 5 or 10, which can also save computa-
tional time in a general machine learning method.
Although CVs sometimes underestimate the true test MSE, all of the CV curves
come close to identifying the correct level of flexibility – that is, the flexibility level
corresponding to the smallest test MSE.
The validation set and cross-validation methods have an advantage relative to AIC,
BIC, Cp , and adjusted R 2 , in that it provides a direct estimate of the test error, and
doesn’t require an estimate of the error variance σ 2 .
It can also be used in a wider range of model selection tasks, even in cases where
it is hard to pinpoint the model degrees of freedom (e.g., the number of predictors
in the model) or hard to estimate the error variance σ 2 .
In Figure 6.3, the validation errors were calculated by randomly selecting three-
quarters of the observations as the training set, and the remainder as the validation
set; the cross-validation errors were computed using K = 10 folds.
- In this case, the validation and cross-validation methods both result in a six-
variable model, but all three approaches suggest that the four-, five-, and six-
variable mdoels are roughly equivalent in their test errors.
In this setting, we can select a model using the one-standard-error rule. We first
calculate the standard error of the estimated test MSE for each model size (by
using different randomly chosen training and validation sets), and then select the
smallest or most parsimonious model for which the estimated test error is within
one standard error of the lowest point on the curve.
- The rationale here is that if a set of models appear to be more or less equally
good, then we might as well choose the simplest model – that is, the model with
the smallest number of predictors.
Ping Yu (HKU) Model Selection and Regularization 58 / 98
Cross Validation K -Fold Cross-Validation
[Example] Credit
As for subset selection, for RR and lasso we require a method to determine which
of the models under consideration is best.
That is, we require a method selecting a value for the tuning parameter λ or equiv-
alently, the value of the constraint s.
Cross-validation provides a simple way to tackle this problem. We choose a grid of
λ values, and compute the cross-validation error rate for each value of λ .
We then select the tuning parameter value for which the cross-validation error is
smallest.
Finally, the model is re-fit using all of the available observations and the selected
value of the tuning parameter.
The LOOCV λ is relatively small, so no much shrinkage relative to LS; the CV-error
curve is flat around the λ ; maybe simply use LS.
10-fold CV λ sieves out the two signal variables and discards all noise variables
even if p = 45 and n = 50, while LS assigns a large coefficient estimate to only one
of the two signal variables.
This is a useful estimate, but strictly speaking, not quite valid. Why not? only
between-fold variation, no within-fold variation.
The following figure shows logistic regression fits using polynomials of orders 1 4
on the two-dimensional classification data displayed in Figure 2.13, where the test
error rates are 0.201, 0.197, 0.160 and 0.162 respectively while the Bayes error rate
is 0.133, and Figure 5.8 shows how to use CV10 to choose the order of polynomial.
Although the CV error curve slightly underestimates the test error rate, it takes on
a miminum very close to the best value for the order of polynomial or the number
of neighbors.
(**) Continued
Ωi = 2 Cov λ̂ i , yi
Both the two covariance penalty methods and cross-validation are conditional or
local methods in the sense that the randomness in Oi comes only from the ith data
although the former fixes xi while the latter allows xi to be random.
The former is model-based while the latter is model-free so is more robust against
model mis-specification.
The former is typically more efficient in estimating Ω := ∑ni=1 Ωi than the latter al-
though it imposes more assumptions on the model.
Ping Yu (HKU) Model Selection and Regularization 69 / 98
Considerations in High Demensions
High-Dimensional Data
Many methods learned in this lecture for fitting less flexible least squares models,
such as forward stepwise selection, RR, the lasso, and those (would be) learned in
the future such as principal components regression and partial least squares, are
particularly useful for performing regression in the high-dimensional setting.
Figure 6.24 shows the perfromance of the lasso (test MSE against three values of
p) in a simulated example with n = 100 while p = 20, 50 and 2000.
- When p = 20, the best performance was achieved when λ is small, while when p
gets larger, a larger λ achieves the lowest validation set error.
Figure 6.24 highlights three important points:
1 regularization or shrinkage plays a key role in high-dimensional problems;
2 appropriate tuning parameter selection is crucial for good predictive performance;
3 the test error tends to increase as p increases, unless the additional features are truly
associated with the response, which is known as the curse of dimensionality.
Increasing p need not be a bless: maybe only noise features are included, which
exacerbates the risk of overfitting by assigning nonzero coefficients due to chance
associations with the response; even if the new features are relevant, the variance
incurred in fitting their coefficients may outweigh the reduction in bias that they
bring.
Summary
Model selection methods are an essential tool for data analysis, especially for big
datasets involving many predictors.
Research into methods that give sparsity, such as the lasso is an especially hot
area.
Ridge Regression
The Lasso
Lab3: Cross-Validation
(Section 5.3.1-3)
Relaxed Lasso or Gauss Lasso: first find the variables with nonzero coefficients by
the lasso, and the refit the linear model with these variables by least squares. This
L
would help to remove the bias of β̂ j when it is nonzero.
Elastic Net: combine RR and the lasso by using the penalty
1
λ (1 α ) kβ k22 + α kβ k1 .
2
- It can overcome some limitations of the lasso, particularly when p > n, and with
very high degrees of collinearity; it tends to select correlated features (or not) to-
gether.
Continued
p
Group Lasso: group β 0 + ∑l =1 β l xil into θ 0 + ∑Jj=1 θ Tj zij , where zij 2 Rpj are corre-
lated features such as the dummay variables for levels of a qualitative variable, and
∑Jj=1 pj = p, and change the penalty of lasso to
J
λ ∑ θj 2
,
j =1
Nonconvex Penalties
where wj = 1/jβ̃ j jν , ν > 0, and β̃ j is a pilot estimate such as the LSE if p < n and
the univariate LSE if p n.
- It can be seen as an approximation to the `q penalties with q = 1 ν, but the
objective function is convex in β given β̃ .
Ping Yu (HKU) Model Selection and Regularization 85 / 98
Appendix A: Extensions of the Lasso
Continued
SCAD (smoothly clipped absolute deviation): replace λ jβ j of lasso by Jλ ,a (β ) for
some a 2 (a = 3.7 from a Bayesian argument), where
Z jβ j
(aλ s )+
Jλ ,a (β ) = λ I (s λ)+ I (s > λ ) ds
0 (a 1) λ
8
>
< λ jβ j , if 0 jβ j λ ,
λ2
= 1
a 1 2 + aλ jβ j 12 β 2 , if λ < jβ j aλ ,
>
: if jβ j > aλ .
1
2 (a + 1) λ 2 ,
- The second term in square-braces reduces the amount of shrinkage in the lasso
for larger values of β , with ultimately no shrinkage as jβ j ! ∞; this is similar to the
`q penalty with 0 < q < 1 [figure here].
Continued
MC+ (minimax concave): replace λ jβ j of lasso by
Z jβ j Z jβ j
x (λ γ x )+
Pλ ,γ (β ) = λ 1 dx = dx
0 λγ + 0 γ
(
β2
λ jβ j 2γ ,
if 0 jβ j γλ ,
=
1 2
, if jβ j > γλ ,
2 γλ
which translates the linear part of Jλ ,a (β ) to the origin, where γ 1 controls con-
cavity.
Nonnegative Garrote:
!2
n p
min ∑ yi β0 ∑ cj β j xij
β i =1 j =1
s.t. cj 0, j = 1, , p,
p
kck1 = ∑ cj s.
j =1
p
- The prediction is ∑j =1 cj β̂ j xij in both Garrote and NN-Garrote.
κj βj = ωj β j
with ω j being the sample standard deviation of the jth covariate if the original co-
variate is not normalized to have length 1.
Bayesian Analysis
RR
suppose ε N 0, σ 2 , and
p (β ) = g (β 0 ) ∏j =1 g β j ∝ ∏j =1 g β j ,
p p
∏i =1 φ ∏j =1 g
n p
p (β jX, y)R ∝ yi jβ 0 + xTi β , σ 2 βj
( )
1 n 2 λ p 2
∝ exp ∑ yi
2σ 2 i =1
β0 xTi β ∑β ,
2σ 2 j =1 j
and
R
arg max p (β jX, y)R = β̂ λ .
β
Lasso
If ( !)
λ 2σ 2
g βj = exp βj / ,
4σ 2 λ
2σ 2
i.e., the double-exponential (or Laplace) distribution with scale parameter λ
[see
Figure 6.11], then
∏i =1 φ ∏j =1 g
n p
p (β jX, y)L ∝ yi jβ 0 + xTi β , σ 2 βj
( )
1 n λ p
2
∝ exp ∑ yi
2σ 2 i =1
β0 xTi β ∑ β
2σ 2 j =1 j
8 2
9
>
< ∑ni=1 yi β0 xTi β + λ ∑j =1 β j >
p
=
= exp ,
>
: 2σ 2 >
;
and
( )
n 2 p
∑ ∑
L L
arg max p (β jX, y) = arg min yi β0 xTi β +λ βj = β̂ λ .
β β i =1 j =1
The lasso prior is steeply peaked at zero, while the Gaussian is flatter and fatter at
zero. Hence, the lasso expects a priori that many of the coefficients are (exactly)
zero, while ridge assumes the coefficients are randomly distributed about zero.
R L
β̂ λ is actually also the posterior mean of p (β jX, y)R , while β̂ λ is not the posterior
mean of p (β jX, y)L ; actually the posterior mean of p (β jX, y)L is not sparse.
Gamma-Lasso (GL)
λj
π βj = exp λ j jβ j j ,
2
and the Laplace rate λ j is assigned a conjugate gamma hyperprior
rs s 1 rλj
Ga λ j js, r = λ e ,
Γ (s ) j
where the shape s and rate r imply mean s/r and variance s/r 2 .
n o n o
Since arg maxλ j π (β j )Ga(λ j js, r ) = s/(r + jβ j j), and log π (β j )Ga λ j js, r ∝
s log λ j + (r + jβ j j)λ j , the joint MAP estimation of β j and λ j is equivalent to add
h i
s log 1+s/rjβ j/r
+ s ∝ s log(1 + jβ j j/r ), which is the log penalty so is nonconvex in
j
β j , to log f (yjX, β ).
(s, r ) can be chosen by CV or AICc (especially when p is large, see Lecture 6 and
Flynn, Hurvich and Simonoff (2013)).
Ping Yu (HKU) Model Selection and Regularization 96 / 98
Appendix B: Bayesian Interpretation for Ridge Regression and the Lasso
For nonconvex penalties like SCAD, Zou and Li (2008) suggest iteratively weighted-
`1 regularization, where the coefficient-specific weights are based upon results from
previous iterations of `1 regularization.
Actually, even a single step of such weighted-`1 regularization is enough to get
solutions that are close to optimal, so long as the preestimates are good enough
starting points. The crux of success is finding starts that are good enough.
On the other hand, Efron et al. (2004) suggest the least-angle regression (LARS)
algorithm for the lasso which provides a regularization path, a sequence of models
under decreasing amounts of regularization.
Taddy (2017) combines these two ideas for nonconvex penalties and proposes the
path of one-step estimators (POSE), which takes advantage of an available source
of initial values in any path estimation algorithm – the solved values from the previ-
ous path iteration, at no more cost than single lasso path.
Continued