[go: up one dir, main page]

0% found this document useful (0 votes)
35 views13 pages

2 Scad

1. The document proposes penalized likelihood approaches for variable selection in high-dimensional statistical modeling. These methods select variables and estimate coefficients simultaneously, allowing for confidence intervals on estimates. 2. The penalty functions used are nonconcave, symmetric around zero, and have singularities at the origin to produce sparse solutions. A new algorithm is developed to optimize these penalized likelihood functions. 3. The rates of convergence of the proposed estimators are established. With proper choice of regularization parameters, the estimators perform as well as an "oracle procedure" in selecting the correct model. Simulation results show favorable performance compared to other variable selection techniques.

Uploaded by

wxl03527
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views13 pages

2 Scad

1. The document proposes penalized likelihood approaches for variable selection in high-dimensional statistical modeling. These methods select variables and estimate coefficients simultaneously, allowing for confidence intervals on estimates. 2. The penalty functions used are nonconcave, symmetric around zero, and have singularities at the origin to produce sparse solutions. A new algorithm is developed to optimize these penalized likelihood functions. 3. The rates of convergence of the proposed estimators are established. With proper choice of regularization parameters, the estimators perform as well as an "oracle procedure" in selecting the correct model. Simulation results show favorable performance compared to other variable selection techniques.

Uploaded by

wxl03527
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Variable Selection via Nonconcave Penalized

Likelihood and its Oracle Properties


Jianqing Fan and Runze Li

Variable selection is fundamenta l to high-dimensiona l statistical modeling, including nonparametri c regression. Many approaches in use
are stepwise selection procedures , which can be computationally expensive and ignore stochastic errors in the variable selection process.
In this article, penalized likelihood approache s are proposed to handle these kinds of problems. The proposed methods select variables
and estimate coefŽ cients simultaneously. Hence they enable us to construct conŽ dence intervals for estimated parameters. The proposed
approaches are distinguished from others in that the penalty functions are symmetric, nonconcav e on 401 ˆ5, and have singularities at the
origin to produce sparse solutions. Furthermore, the penalty functions should be bounded by a constant to reduce bias and satisfy certain
conditions to yield continuous solutions. A new algorithm is proposed for optimizing penalized likelihood functions. The proposed ideas
are widely applicable. They are readily applied to a variety of parametric models such as generalized linear models and robust regression
models. They can also be applied easily to nonparametri c modeling by using wavelets and splines. Rates of convergenc e of the proposed
penalized likelihood estimators are established. Furthermore, with proper choice of regularization parameters, we show that the proposed
estimators perform as well as the oracle procedure in variable selection; namely, they work as well as if the correct submodel were
known. Our simulation shows that the newly proposed methods compare favorably with other variable selection techniques. Furthermore,
the standard error formulas are tested to be accurate enough for practical applications.
KEY WORDS: Hard thresholding; LASSO; Nonnegative garrote; Penalized likelihood; Oracle estimator; SCAD; Soft thresholding.

1. INTRODUCTION metric, convex on 401 ˆ5 (rather than concave for the negative
quadratic penalty in the penalized likelihood situation), and
Variable selection is an important topic in linear regression
possess singularities at the origin. A few penalty functions are
analysis. In practice, a large number of predictors usually are
discussed. They allow statisticians to select a penalty function
introduced at the initial stage of modeling to attenuate possible
to enhance the predictive power of a model and engineers to
modeling biases. On the other hand, to enhance predictabil-
ity and to select signiŽ cant variables, statisticians usually use sharpen noisy images. Optimizing a penalized likelihood is
stepwise deletion and subset selection. Although they are challenging, because the target function is a high-dimensional
practically useful, these selection procedures ignore stochas- nonconcave function with singularities. A new and generic
tic errors inherited in the stages of variable selections. Hence, algorithm is proposed that yields a uniŽ ed variable selection
their theoretical properties are somewhat hard to understand. procedure. A standard error formula for estimated coefŽ cients
Furthermore, the best subset variable selection suffers from is obtained by using a sandwich formula. The formula is tested
several drawbacks, the most severe of which is its lack of accurately enough for practical purposes, even when the sam-
stability as analyzed, for instance, by Breiman (1996). In an ple size is very moderate. The proposed procedures are com-
attempt to automatically and simultaneously select variables, pared with various other variable selection approaches. The
we propose a uniŽ ed approach via penalized least squares, results indicate the favorable performance of the newly pro-
retaining good features of both subset selection and ridge posed procedures.
regression. The penalty functions have to be singular at the Unlike the traditional variable selection procedures, the
origin to produce sparse solutions (many estimated coefŽ cients sampling properties on the penalized likelihood can be estab-
are zero), to satisfy certain conditions to produce continuous lished. We demonstrate how the rates of convergence for the
models (for stability of model selection), and to be bounded penalized likelihood estimators depend on the regularization
by a constant to produce nearly unbiased estimates for large parameter. We further show that the penalized likelihood esti-
coefŽ cients. The bridge regression proposed in Frank and mators perform as well as the oracle procedure in terms of
Friedman (1993) and the least absolute shrinkage and selection selecting the correct model, when the regularization param-
operator (LASSO) proposed by Tibshirani (1996, 1997) are eter is appropriately chosen. In other words, when the true
members of the penalized least squares, although their asso- parameters have some zero components, they are estimated as
ciated Lq penalty functions do not satisfy all of the preceding 0 with probability tending to 1, and the nonzero components
three required properties. are estimated as well as when the correct submodel is known.
The penalized least squares idea can be extended naturally This improves the accuracy for estimating not only the null
to likelihood-based models in various statistical contexts. Our components, but also the nonnull components. In short, the
approaches are distinguished from traditional methods (usu- penalized likelihood estimators work as well as if the correct
ally quadratic penalty) in that the penalty functions are sym- submodel were known in advance. The signiŽ cance of this is
that the proposed procedures outperform the maximum likeli-
hood estimator and perform as well as we hope. This is very
Jianqing Fan is Professor of Statistics, Department of Statistics, Chinese analogous to the superefŽ ciency phenomenon in the Hodges
University of Hong Kong and Professor, Department of Statistics, Univer-
sity of California, Los Angeles, CA 90095. Runze Li is Assistant Professor, example (see Lehmann 1983, p. 405).
Department of Statistics, Pennsylvani a State University, University Park, PA
16802-2111 . Fan’s research was partially supported by National Science
Foundation (NSF) grants DMS-0196041 and DMS-9977096 and a grant from
University of California at Los Angeles. Li’s research was supported by a © 2001 American Statistical Association
NSF grant DMS-0102505. The authors thank the associate editor and the ref- Journal of the American Statistical Association
erees for constructive comments that substantially improved an earlier draft. December 2001, Vol. 96, No. 456, Theory and Methods
1348
Fan and Li: Nonconcave Penalized Likelihood 1349

The proposed penalized likelihood method can be applied Denote z D XT y and let yO D XXT y. A form of the penalized
readily to high-dimensional nonparametric modeling. After least squares is
approximating regression functions using splines or wavelets,
it remains very critical to select signiŽ cant variables (terms X
d
X
d
1
in the expansion) to efŽ ciently represent unknown functions. 2
˜y ƒ X‚˜2 C‹ pj 4—‚j —5 D 21 ˜y ƒ yO ˜2 C 12 4zj ƒ ‚j 52
jD1 jD1
In a series of work by Stone and his collaborators (see
Stone, Hansen, Kooperberg, and Truong 1997), the traditional X
d
C‹ pj 4—‚j —50 (2.2)
variable selection approaches were modiŽ ed to select useful
jD1
spline subbases. It remains very challenging to understand
the sampling properties of these data-driven variable selec- The penalty functions pj 4¢5 in (2.2) are not necessarily the
tion techniques. Penalized likelihood approaches, outlined in same for all j. For example, we may wish to keep impor-
Wahba (1990), and Green and Silverman (1994) and refer- tant predictors in a parametric model and hence not be will-
ences therein, are based on a quadratic penalty. They reduce ing to penalize their corresponding parameters. For simplicity
the variability of estimators via the ridge regression. In wavelet of presentation, we assume that the penalty functions for all
approximations, Donoho and Johnstone (1994a) selected sig- coefŽ cients are the same, denoted by p4—¢—5. Furthermore, we
niŽ cant subbases (terms in the wavelet expansion) via thresh- denote ‹p4—¢—5 by p‹ 4—¢—5, so p4— ¢ —5 may be allowed to depend
olding. Our penalized likelihood approach can be applied on ‹. Extensions to the case with different thresholding func-
directly to these problems (see Antoniadis and Fan, in press). tions do not involve any extra difŽ culties.
Because we select variables and estimate parameters simulta- The minimization problem of (2.2) is equivalent to mini-
neously, the sampling properties of such a data-driven variable mizing componentwise. This leads us to consider the penal-
ized least squares problem
selection method can be established.
In Section 2, we discuss the relationship between the penal- 1
4z ƒ ˆ52 C p‹ 4—ˆ—50 (2.3)
2
ized least squares and the subset selection when design matri-
ces are orthonormal. In Section 3, we extend the penalized By taking the hard thresholding penalty function [see
likelihood approach discussed in Section 2 to various para- Fig. 1(a)]
metric regression models, including traditional linear regres-
sion models, robust linear regression models, and generalized p‹ 4—ˆ—5 D ‹2 ƒ 4—ˆ— ƒ ‹52 I4—ˆ— < ‹51 (2.4)
linear models. The asymptotic properties of the penalized
we obtain the hard thresholding rule (see Antoniadis 1997 and
likelihood estimators are established in Section 3.2. Based
Fan 1997)
on local quadratic approximations, a uniŽ ed iterative algo-
rithm for Ž nding penalized likelihood estimators is proposed ˆO D zI4—z— > ‹53 (2.5)
in Section 3.3. The formulas for covariance matrices of the
estimated coefŽ cients are also derived in this section. Two see Figure 2(a). In other words, the solution to (2.2) is
data-driven methods for Ž nding unknown thresholding param- simply zj I4—zj — > ‹5, which coincides with the best subset
eters are discussed in Section 4. Numerical comparisons and selection, and stepwise deletion and addition for orthonor-
simulation studies also are given in this section. Some discus- mal designs. Note that the hard thresholding penalty func-
sion is given in Section 5. Technical proofs are relegated to tion is a smoother penalty function than the entropy penalty
the Appendix. p‹ 4—ˆ—5 D 4‹2 =25I 4—ˆ— 6D 05, which also results in (2.5). The
former facilitates computational expedience in other settings.
A good penalty function should result in an estimator with
2. PENALIZED LEAST SQUARES AND three properties.
VARIABLE SELECTION 1. Unbiasedness: The resulting estimator is nearly unbi-
Consider the linear regression model ased when the true unknown parameter is large to avoid
unnecessary modeling bias.
2. Sparsity: The resulting estimator is a thresholding rule,
y D X‚ C ˜1 (2.1) which automatically sets small estimated coefŽ cients to
zero to reduce model complexity.
where y is an n € 1 vector and X is an n € d matrix. As in 3. Continuity: The resulting estimator is continuous in data
the traditional linear regression model, we assume that yi ’s are z to avoid instability in model prediction.
conditionally independent given the design matrix. There are We now provide some insights on these requirements.
strong connections between the penalized least squares and The Ž rst order derivative of (2.3) with respect to ˆ is
the variable selection in the linear regression model. To gain sgn4ˆ58—ˆ— C p‹0 4—ˆ—59 ƒ z. It is easy to see that when p‹0 4—ˆ—5 D
more insights about various variable selection procedures, in 0 for large —ˆ—, the resulting estimator is z when —z— is
this section we assume that the columns of X in (2.1) are sufŽ ciently large. Thus, when the true parameter —ˆ— is
orthonormal. The least squares estimate is obtained via min- large, the observed value —z— is large with high probability.
imizing ˜y ƒ X‚˜2 , which is equivalent to ˜‚ O ƒ ‚˜2 , where Hence the penalized least squares simply is ˆO D z, which is
O
‚ D X y is the ordinary least squares estimate.
T
approximately unbiased. Thus, the condition that p0‹ 4—ˆ—5 D 0
1350 Journal of the American Statistical Association, December 2001

Figure 1. Three Penalty Functions p‹ ( ˆ) and Their Quadratic Approximations. The values of ‹ are the same as those in Figure 5(c).

for large —ˆ— is a sufŽ cient condition for unbiasedness for a which was proposed by Donoho and Johnstone (1994a).
large true parameter. It corresponds to an improper prior dis- LASSO, proposed by Tibshirani (1996, 1997), is the penalized
tribution in the Bayesian model selection setting. A sufŽ cient least squares estimate with the L1 penalty in the general least
condition for the resulting estimator to be a thresholding rule squares and likelihood settings. The Lq penalty p‹ 4—ˆ—5 D ‹—ˆ—q
is that the minimum of the function —ˆ— C p‹0 4—ˆ—5 is positive. leads to a bridge regression (Frank and Friedman 1993 and
Figure 3 provides further insights into this statement. When Fu 1998). The solution is continuous only when q ¶ 1. How-
—z— < minˆ6D0 8—ˆ— C p‹0 4—ˆ—59, the derivative of (2.3) is posi- ever, when q > 1, the minimum of —ˆ— C p‹0 4—ˆ—5 is zero and
tive for all positive ˆ’s (and is negative for all negative ˆ’s). hence it does not produce a sparse solution [see Fig. 4(a)].
Therefore, the penalized least squares estimator is 0 in this The only continuous solution with a thresholding rule in this
situation, namely Ô D 0 for —z— < minˆ6D0 8—ˆ— C p‹0 4—ˆ—59. When family is the L1 penalty, but this comes at the price of shifting
—z— > minˆ6D0 —ˆ— C p0‹ 4—ˆ—5, two crossings may exist as shown the resulting estimator by a constant ‹ [see Fig. 2(b)].
in Figure 1; the larger one is a penalized least squares esti-
mator. This implies that a sufŽ cient and necessary condition 2.1 Smoothly Clipped Absolute Deviation Penalty
for continuity is that the minimum of the function —ˆ— C p‹0 4—ˆ—5
is attained at 0. From the foregoing discussion, we conclude The Lq and the hard thresholding penalty functions do not
that a penalty function satisfying the conditions of sparsity simultaneously satisfy the mathematical conditions for unbi-
and continuity must be singular at the origin. asedness, sparsity, and continuity. The continuous differen-
It is well known that the L2 penalty p‹ 4—ˆ—5 D ‹—ˆ—2 results tiable penalty function deŽ ned by
in a ridge regression. The L1 penalty p‹ 4—ˆ—5 D ‹—ˆ— yields a
soft thresholding rule 4a‹ ƒ ˆ5C
p‹0 4ˆ5 D ‹ I4ˆ µ ‹5 C I4ˆ > ‹5
4a ƒ 15‹
ˆOj D sgn4zj 54—zj — ƒ ‹5C 1 (2.6) for some a > 2 and ˆ > 01 (2.7)

Figure 2. Plot of Thresholding Functions for (a) the Hard, (b) the Soft, and (c) the SCAD Thresholding Functions With ‹ D 2 and a D 3.7
for SCAD.
Fan and Li: Nonconcave Penalized Likelihood 1351

The thresholding rule in (2.8) involves two unknown param-


eters ‹ and a. In practice, we could search the best pair 4‹1 a5
over the two-dimensional grids using some criteria, such as
cross-validation and generalized cross-validation (Craven and
Wahba 1979). Such an implementation can be computation-
ally expensive. To implement tools in Bayesian risk analy-
sis, we assume that for given a and ‹, the prior distribution
for ˆ is a normal distribution with zero mean and variance
a‹. We compute the Bayes risk via numerical integration.
Figure 5(a) depicts the Bayes risk as a functionpof a under the
squared loss, for the universal thresholding ‹ D 2 log4d5 (see
Donoho and Johnstone, 1994a) with d D 201 401 60, and 100;
Figure 3. A Plot of ˆ + p‹0 ( ˆ) Against ˆ( ˆ > 0) .
and Figure 5(b) is for d D 512, 1024, 2048, and 4096. From
Figure 5, (a) and (b), it can be seen that the Bayesian risks
are not very sensitive to the values of a. It can be seen from
improves the properties of the L1 penalty and the hard
Figure 5(a) that the Bayes risks achieve their minimums at
thresholding penalty function given by (2.4) [see Fig. 1(c)
a 307 when the value of d is less than 100. This choice gives
and subsequent discussion]. We call this penalty function the
pretty good practical performance for various variable selec-
smoothly clipped absolute deviation (SCAD) penalty. It corre- tion problems. Indeed, based on the simulations in Section
sponds to a quadratic spline function with knots at ‹ and a‹. 4.3, the choice of a D 307 works similarly to that chosen by
This penalty function leaves large values of ˆ not excessively the generalized cross-validation (GCV) method.
penalized and makes the solution continuous. The resulting
solution is given by 2.2 Performance of Thresholding Rules
8 We now compare the performance of the four previously
<sgn4z54—z— ƒ ‹5C 1
> when —z— µ 2‹1
stated thresholding rules. Marron, Adak, Johnstone, Neumann,
O
ˆ D 84a ƒ 15z ƒ sgn4z5a‹9=4a ƒ 251 when 2‹ < —z— µ a‹1
> and Patil (1998) applied the tool of risk analysis to under-
:z1 when —z— > a‹ stand the small sample behavior of the hard and soft thresh-
(2.8) olding rules. The closed forms for the L2 risk functions
O ˆ5 D E4ˆO ƒ ˆ52 were derived under the Gaussian model
R4ˆ1
[see Fig. 2(c)]. This solution is owing to Fan (1997), who gave Z N 4ˆ1‘ 2 5 for hard and soft thresholding rules by Donoho
a brief discussion in the settings of wavelets. In this article, and Johnstone (1994b). The risk function of the SCAD thresh-
we use it to develop an effective variable selection procedure olding rule can be found in Li (2000). To gauge the perfor-
for a broad class of models, including linear regression models mance of the four thresholding rules, Figure 5(c) depicts their
and generalized linear models. For simplicity of presentation, L2 risk functions under the Gaussian model Z N 4ˆ1 15. To
we use the acronym SCAD for all procedures using the SCAD make the scale of the thresholding parameters roughly com-
penalty. The performance of SCAD is similar to that of Ž rm parable, we took ‹ D 2 for the hard thresholding rule and
shrinkage proposed by Gao and Bruce (1997) when design adjusted the values of ‹ for the other thresholding rules so that
matrices are orthonormal. their estimated values are the same when ˆ D 3. The SCAD

Figure 4. Plot of p0‹ ( ˆ) Functions Over ˆ > 0 (a) for Lq Penalties, (b) the Hard Thresholding Penalty, and (c) the SCAD Penalty. In (a), the heavy
line corresponds to L1 , the dash-dot line corresponds to L.5 , and the thin line corresponds to L2 penalties.
1352 Journal of the American Statistical Association, December 2001

Figure 5. Risk Functions of Proposed Procedures


p Under the Quadratic Loss. (a) Posterior risk functions of the SCAD under the prior ˆ
N( 0, a‹) using the universal thresholding ‹ D 2 log( d ) for four different values d: heavy line, d D 20; dashed line, d D 40; medium line, d D 60;
thin line, d D 100. (b) Risk functions similar to those for (a): heavy line, d D 572; dashed line, d D 1, 024; medium line, d D 2, 048; thin line,
d D 4, 096. (c) Risk functions of the four different thresholding rules. The heavy, dashed, and solid lines denote minimum SCAD, hard, and soft
thresholding rules, respectively.

performs favorably compared with the other two thresholding with respect to ‚. This results in a penalized robust estimator
rules. This also can be understood via their corresponding for ‚.
penalty functions plotted in Figure 1. It is clear that the SCAD For generalized linear models, statistical inferences are
retains the good mathematical properties of the other two based on underlying likelihood functions. The penalized max-
thresholding penalty functions. Hence, it is expected to per- imum likelihood estimator can be used to select signiŽ cant
form the best. For general ‘ 2 , the picture is the same, except variables. Assume that the data 84xi 1 Yi 59 are collected inde-
it is scaled vertically by ‘ 2 , and the ˆ axis should be replaced pendently. Conditioning on xi 1 Yi has a density fi 4g4xTi ‚51 yi 5,
by ˆ=‘ . where g is a known link function. Let `i D log fi denote the
conditional log-likelihood of Yi . A form of the penalized like-
3. VARIABLE SELECTION VIA PENALIZED lihood is
LIKELIHOOD Xn
Xd
`i g xTi ‚ 1 yi ƒ n p‹ 4—‚j —50
The methodology in the previous section can be applied iD1 jD1

directly to many other statistical contexts. In this section, we Maximizing the penalized likelihood function is equivalent to
consider linear regression models, robust linear models, and minimizing
likelihood-based generalized linear models. From now on, we
assume that the design matrix X D 4xij 5 is standardized so that X
n
X
d
ƒ `i g xiT ‚ 1 yi C n p‹ 4—‚j —5 (3.3)
each column has mean 0 and variance 1. iD1 jD1

3.1 Penalized Least Squares and Likelihood with respect to ‚. To obtain a penalized maximum likelihood
estimator of ‚, we minimize (3.3) with respect to ‚ for some
In the classical linear regression model, the least squares
thresholding parameter ‹.
estimate is obtained via minimizing the sum of squared resid-
ual errors. Therefore, (2.2) can be extended naturally to the 3.2 Sampling Properties and Oracle Properties
situation in which design matrices are not orthonormal. Simi-
lar to (2.2), a form of penalized least squares is In this section, we establish the asymptotic theory for our
nonconcave penalized likelihood estimator. Let
X
d
T
1
2
4y ƒ X‚5T 4y ƒ X‚5 C n p‹ 4—‚j —50 (3.1) ‚0 D 4‚10 1 : : : 1 ‚d0 5T D ‚T10 1 ‚T20 0
jD1
Without loss of generality, assume that ‚20 D 0. Let I4‚0 5 be
Minimizing (3.1) with respect to ‚ leads to a penalized least the Fisher information matrix and let I1 4‚10 1 05 be the Fisher
squares estimator of ‚. information knowing ‚20 D 0. We Ž rst show that there exists
It is well known that the least squares estimate is not robust. a penalized likelihood estimator that converges at the rate
We can consider the outlier-resistant loss functions such as
the L1 loss or, more generally, Huber’s – function (see Huber OP 4nƒ1=2 C an 51 (3.4)
1981). Therefore, instead of minimizing (3.1), we minimize
where an D max8p‹0 n 4—‚j0 —52 ‚j0 6D 09. This implies that for the
X
n
X
d hard thresholding and SCAD penalty functions, the penalized
–4—yi ƒ xi ‚—5 C n p‹ 4—‚j —5 (3.2) likelihood estimator is root-n consistent if ‹n ! 0. Further-
iD1 jD1 more, we demonstrate that such a root-n consistent estimator
Fan and Li: Nonconcave Penalized Likelihood 1353

must satisfy ‚O D 0 and ‚ O is asymptotic normal with covari- Theorem 2 (Oracle Property). Let V1 1 : : : 1 Vn be inde-
2 1
ƒ1 1=2
ance matrix I1 , if n ‹n ! ˆ. This implies that the penal- pendent and identically distributed, each with a density
ized likelihood estimator performs as well as if ‚20 D 0 were f 4V1 ‚5 satisfying conditions (A)–(C) in Appendix. Assume
known. In language similar to Donoho and Johnstone (1994a), that the penalty
p function p‹n 4—ˆ—5 satisŽ es condition (3.5). If
the resulting estimator performs as well as the oracle estima- ‹n ! 0 and n‹n ! ˆ as n ! ˆ, then with probability tend-
tor, which knows in advance that ‚20 D 0. ing to 1, the root-n consistent local maximizers ‚ O D ‚O 1 in
‚O 2
The preceding oracle performance is closely related to Theorem 1 must satisfy:
the superefŽ ciency phenomenon. Consider the simplest linear
regression model y D 1n Œ C ˜, where ˜ Nn 401 In 5. A super- (a) Sparsity: ‚O 2 D 0.
efŽ cient estimate for Œ is (b) Asymptotic normality:
( p
n4I1 4‚10 5 C è5 ‚ O ƒ‚
S
Y1 S— ¶ nƒ1=4 1
if —Y 1 10
„n D
S
cY 1 if —YS— < nƒ1=4 1 C 4I1 4‚10 5 C è5ƒ1 b ! N 01 I1 4‚10 5

owing to Hodges (see Lehmann 1983, p. 405). If we set c to in distribution, where I1 4‚10 5 D I1 4‚10 1 05, the Fisher informa-
0, then „n coincides with the hard thresholding estimator with tion knowing ‚2 D 0.
the thresholding parameter ‹n D nƒ1=4 . This estimator correctly As a consequence, the asymptotic covariance matrix of ‚ O is
1
estimates the parameter at point 0 without paying any price
1 ƒ1 ƒ1
for estimating the parameter elsewhere. I 4‚ 5 C è I1 4‚10 5 I1 4‚10 5 C è 1
We now state the result in a fairly general setting. To n 1 10
facilitate the presentation, we assume that the penalization is which approximately equals 41=n5I1ƒ1 4‚10 5 for the threshold-
applied to every component of ‚. However, there is no extra ing penalties discussed in Section 2 if ‹n tends to 0.
difŽ culty to extend it to the case where some components
Remark 1. For the hard and SCAD thresholding penalty
(e.g., variance in the linear models) are not penalized.
functions, if ‹n ! 01 an D 0. Hence, by Theorem 2, when
Set Vi D 4Xi 1 Yi 5, i D 11 : : : 1 n. Let L4‚5 be the log- p
n‹ n ! ˆ, their corresponding penalized likelihood esti-
likelihood function of the observations V1 1 : : : 1 Vn and
mators possess the oracle property and perform as well as
letP Q4‚5 be the penalized likelihood function L4‚5 ƒ
the maximum likelihood estimates for estimating ‚1 knowing
n djD1 p‹n 4—‚j —5. We state our theorems here, but their proofs
are relegated to the Appendix, where the conditions for the ‚2 D 0. However, for the L1 penalty, an D ‹n . Hence, the root-
theorems also can be found. n consistency requires that ‹n D OP 4nƒ1=2 5. On thepother hand,
the oracle property in Theorem 2 requires that n‹n ! ˆ.
Theorem 1. Let V1 1 : : : 1 Vn be independent and identi- These two conditions for LASSO cannot be satisŽ ed simul-
cally distributed, each with a density f 4V1 ‚5 (with respect taneously. Indeed, for the L1 penalty, we conjecture that the
to a measure Œ) that satisŽ es conditions (A)–(C) in the oracle property does not hold. However, for Lq penalty with
Appendix. If max8—p‹00n 4—‚j0 —5—2 ‚j0 6D 09 ! 0, then there exists q < 1, the oracle property continues to hold with suitable
a local maximizer ‚ O of Q4‚5 such that ˜‚ O ƒ‚ ˜ D OP 4nƒ1=2 C choice of ‹n .
0
an 5, where an is given by (3.4).
Now we brie y discuss the regularity conditions (A)–(C)
It is clear from Theorem 1 that by choosing a proper ‹n , for the generalized linear models (see McCullagh and Nelder
there exists a root-n consistent penalized likelihood estimator. 1989). With a canonical link, the condition distribution of Y
We now show that this estimator must possess the sparsity given X D x belongs to the canonical exponential family, that
property ‚ O D 0, which is stated as follows. is, with a density function
2

Lemma 1. Let V1 1 : : : 1 Vn be independent and identically yxT ‚ ƒ b4xT ‚5


distributed, each with a density f 4V1 ‚5 that satisŽ es condi- f 4y3 x1 ‚5 D c4y5 exp 0
a4”5
tions (A)–(C) in the Appendix. Assume that
Clearly, the regularity conditions (A) are satisŽ ed. The Fisher
lim inf lim inf p0‹n 4ˆ5=‹n > 00 (3.5) information matrix is
n!ˆ ˆ!0C
p
If ‹n ! 0 and n‹ n ! ˆ as n ! ˆ, then with probabil- I4‚5 D E b 00 4x T ‚5xx T =a4”50
ity tending to 1, for any given ‚1 satisfying ˜‚1 ƒ ‚10 ˜ D
OP 4nƒ1=2 5 and any constant C, Therefore, if E8b 00 4xT ‚5xxT 9 is Ž nite and positive deŽ nite,
then condition (B) holds. If for all ‚ in some neighborhood
‚1 ‚1 of ‚0 , —b 435 4xT ‚5— µ M0 4x5 for some function M0 4x5 satisfy-
Q D max Q 0
0 ˜‚2 ˜µCn ƒ1=2 ‚2 ing E‚0 8M0 4x5Xj Xk Xl 9 < ˆ for all j1 k1 l, then condition (C)
Denote holds. For general link functions, similar conditions need to
è D diag p00‹n 4—‚10 —51 : : : 1 p‹00n 4—‚s0 —5 guarantee conditions (B) and (C). The mathematical deriva-
tion of those conditions does not involve any extra difŽ culty
and except more tedious notation. Results in Theorems 1 and 2
b D p0‹n 4—‚10 —5sgn4‚10 51 : : : 1 p‹0 n 4—‚s0 —5sgn4‚s0 5 1
T also can be established for the penalized least squares (3.1)
and the penalized robust linear regression (3.2) under some
where s is the number of components of ‚10 . mild regularity conditions. See Li (2000) for details.
1354 Journal of the American Statistical Association, December 2001

3.3 A New Uni’ ed Algorithm where


Tibshirani (1996) proposed an algorithm for solving ¡`4‚0 5 ¡ 2 `4‚0 5
ï `4‚0 5 D 1 ï 2 `4‚0 5 D 1
constrained least squares problems of LASSO, whereas ¡‚ ¡‚ ¡‚T
Fu (1998) provided a “shooting algorithm” for LASSO.
è‹ 4‚0 5 D diag p‹0 4—‚10 —5=—‚10 —1 : : : 1 p‹0 4—‚d0 —5=—‚d0 — 0
See also LASSO2 submitted by Berwin Turlach at Statlib
(http:==lib.stat.cmu.edu=S=). In this section, we propose a new The quadratic minimization problem (3.8) yields the solution
uniŽ ed algorithm for the minimization problems (3.1), (3.2),
and (3.3) via local quadratic approximations. The Ž rst term in ‚1 D ‚0 ƒ ï 2 `4‚0 5 C nè‹ 4‚0 5
ƒ1
ï `4‚0 5 C nU‹ 4‚0 5 1
(3.1), (3.2), and (3.3) may be regarded as a loss function of (3.9)
‚. Denote it by `4‚5. Then the expressions (3.1), (3.2), and
(3.3) can be written in a uniŽ ed form as where U‹ 4‚0 5 D è‹ 4‚0 5‚0 . When the algorithm converges,
the estimator satisŽ es the condition
X
d
O 5
`4‚5 C n p‹ 4—‚j —50 (3.6) ¡`4‚ 0
C np ‹0 4—‚O j0 —5sgn4‚O j0 5 D 01
jD1 ¡‚j

The L1 , hard thresholding, and SCAD penalty functions are the penalized likelihood equation, for nonzero elements of
singular at the origin, and they do not have continuous second ‚O . SpeciŽ cally, for the penalized least squares problem (3.1),
0
order derivatives. However, they can be locally approximated the solution can be found by iteratively computing the ridge
by a quadratic function as follows. Suppose that we are given regression
an initial value ‚0 that is close to the minimizer of (3.6). If ƒ1
‚j0 is very close to 0, then set ‚O j D 0. Otherwise they can be ‚1 D XT X C nè‹ 4‚0 5 7XT y0
locally approximated by a quadratic function as
Similarly we obtain the solution for (3.2) by iterating
0
p‹ 4—‚j —5 D p0‹ 4—‚j —5sgn4‚j 5 p‹0 4—‚j0 —5=—‚j0 — ‚j 1 ‚1 D XT WX C 12 nè‹ 4‚0 5
ƒ1
XT Wy1

when ‚j 6D 0. In other words, where W D diag8–4—y1 ƒ x1T ‚0 —5=4y1 ƒ x1T ‚0 52 1 : : : 1 –4—yn ƒ


xnT ‚0 —5=4yn ƒ xnT ‚0 52 9.
As in the maximum likelihood estimation (MLE) setting,
p‹ 4—‚j —5 p‹ 4—‚j0 —5 C 12 p‹0 4—‚j0 —5=—‚j0 — 4‚2j ƒ ‚2j0 51 with the good initial value ‚0 , the one-step procedure can
for ‚j ‚j0 0 (3.7) be as efŽ cient as the fully iterative procedure, namely, the
penalized maximum likelihood estimator, when the Newton–
Raphson algorithm is used (see Bickel 1975). Now regarding
Figure 1 shows the L1 , hard thresholding, and SCAD penalty
‚4kƒ15 as a good initial value at the kth step, the next iteration
functions, and their approximations on the right-hand side of
also can be regarded as a one-step procedure and hence the
(3.7) at two different values of ‚j0 . A drawback of this approx- resulting estimator still can be as efŽ cient as the fully itera-
imation is that once a coefŽ cient is shrunken to zero, it will tive method (see Robinson 1988 for the theory on the differ-
stay at zero. However, this method signiŽ cantly reduces the ence between the MLE and the k-step estimators). Therefore,
computational burden. estimators obtained by the aforementioned algorithm with a
If `4‚5 is the L1 loss as in (3.2), then it does not have few iterations always can be regarded as a one-step estima-
continuous second order partial derivatives with respect to ‚. tor, which is as efŽ cient as the fully iterative method. In this
However, –4—y ƒ xT ‚—5 in (3.2) can be analogously approx- sense, we do not have to iterate the foregoing algorithm until
imated by 8–4y ƒ xT ‚0 5=4y ƒ x T ‚0 52 94y ƒ xT ‚52 , as long as convergence as long as the initial estimators are good enough.
the initial value ‚0 of ‚ is close to the minimizer. When some The estimators from the full models can be used as initial esti-
of the residuals —y ƒ x T ‚0 — are small, this approximation is not mators, as long as they are not overly parameterized.
very good. See Section 3.4 for some slight modiŽ cations of
this approximation. 3.4 Standard Error Formula
Now assume that the log-likelihood function is smooth with The standard errors for the estimated parameters can be
respect to ‚ so that its Ž rst two partial derivatives are contin- obtained directly because we are estimating parameters and
uous. Thus the Ž rst term in (3.6) can be locally approximated selecting variables at the same time. Following the conven-
by a quadratic function. Therefore, the minimization problem tional technique in the likelihood setting, the corresponding
(3.6) can be reduced to a quadratic minimization problem and sandwich formula can be used as an estimator for the covari-
the Newton–Raphson algorithm can be used. Indeed, (3.6) can ance of the estimates ‚O , the nonvanishing component of ‚.O
1
be locally approximated (except for a constant term) by That is,

ƒ1
`4‚0 5 C ï `4‚0 5T 4‚ ƒ ‚0 5 C 12 4‚ ƒ ‚0 5T ï 2 `4‚0 54‚ ƒ ‚0 5 O 5 D ï 2 `4‚
cov4‚
d O 5 C nè‹ 4‚
O 5 O 5
cov ï `4 ‚
d
1 1 1 1
ƒ1
C 12 n‚T è‹ 4‚0 5‚1 (3.8) € O 1 5 C nè‹ 4‚
ï 2 `4 ‚ O 15 0 (3.10)
Fan and Li: Nonconcave Penalized Likelihood 1355

Compare with Theorem 2(b). This formula is shown to have where the expectation is taken only with respect to the new
good accuracy for moderate sample sizes. observation 4x1 Y 5. The prediction error can be decomposed as
When the L1 loss is used in the robust regression, some
slight modiŽ cations are needed in the aforementioned algo- PE4Œ5
O D E Y ƒ E4Y —x5
2
C E E4Y —x5 ƒ Œ4x5
O 0
2

rithm and its corresponding sandwich formula. For –4x5 D —x—,


the diagonal elements of W are 8—ri —ƒ1 9 with ri D yi ƒ xiT ‚0 .
The Ž rst component is the inherent prediction error due to
Thus, for a given current value of ‚0 , when some of the resid-
the noise. The second component is due to lack of Ž t to an
uals 8ri 9 are close to 0, these points receive too much weight.
underlying model. This component is called model error and
Hence, we replace the weight by 4an C —ri —5ƒ1 . In our imple-
is denoted ME4Œ50
O The size of the model error re ects perfor-
mentations, we took an as the 2nƒ1=2 quantile of the absolute
mances of different model selection procedures. If Y D x T ‚ C
residuals 8—ri —1 i D 11 : : : 1 n9. Thus, the constant an changes O ƒ ‚5T E4xxT 54‚O ƒ ‚5.
˜, where E4˜—x5 D 0, then ME4Œ5O D 4‚
from iteration to iteration.

3.5 Testing Convergence of the Algorithm 4.2 Selection of Thresholding Parameters

We now demonstrate that our algorithm converges to the To implement the methods described in Sections 2 and 3,
right solution. To this end, we took a 100-dimensional vector we need to estimate the thresholding parameters ‹ and a
‚ consisting of 50 zeros and other nonzero elements gener- (for the SCAD). Denote by ˆ the tuning parameters to be
ated from N 401 52 5, and used a 100 € 100 orthonormal design estimated, that is, ˆ D 4‹1 a5 for the SCAD and ˆ D ‹ for
matrix X. We then generated a response vector y from the the other penalty functions. Here we discuss two methods of
linear model (2.1). We chose an orthonormal design matrix estimating ˆ: Ž vefold cross-validation and generalized cross-
for our testing case, because the penalized least squares has validation, as suggested by Breiman (1995), Tibshirani (1996),
a closed form mathematical solution so that we can compare and Fu (1998).
our output with the mathematical solution. Our experiment For completeness, we now describe the details of the
did show that the proposed algorithm converged to the right cross-validation and the generalized cross-validation proce-
solution. It took MATLAB 0.27, 0.39, and 0.16 s to converge dures. Here we discuss only these two procedures for lin-
for the penalized least squares with the SCAD, L1 , and hard ear regression models. Extensions to robust linear models and
thresholding penalties. The numbers of iterations were 30, 30, likelihood-based linear models do not involve extra difŽ cul-
and 5, respectively for the penalized least squares with the ties. The Ž vefold cross-validation procedure is as follows:
SCAD, L1 , and the hard thresholding penalty. In fact, after 10 Denote the full dataset by T , and denote cross-validation
iterations, the penalized least squares estimators are already training and test set by T ƒ T  and T  , respectively, for
45
very close to the true value.  D 11 : : : 1 50 For each ˆ and , we Ž nd the estimator ‚O 4ˆ5
of ‚ using the training set T ƒ T  . Form the cross-validation
4. NUMERICAL COMPARISONS criterion as
The purpose of this section is to compare the performance 5
of the proposed approaches with existing ones and to test the
X X 45
CV4ˆ5 D O 4ˆ5 2 0
yk ƒ xTk ‚
accuracy of the standard error formula. We also illustrate our D1 4yk 1 x k 52T 
penalized likelihood approaches by a real data example. In
all examples in this section, we computed the penalized like- We Ž nd a Ô that minimizes CV4ˆ5.
lihood estimate with the L1 penalty, referred as to LASSO, The second method is the generalized cross-validation. For
by our algorithm rather than those of Tibshirani (1996) and linear regression models, we update the solution by
Fu (1998).
ƒ1
4.1 Prediction and Model Error ‚1 4ˆ5 D XT X C nè‹ 4‚0 5 X T y0

The prediction error is deŽ ned as the average error in the Thus the Ž tted value yO of y is X8X T X C nè‹ 4‚0 59ƒ1 XT y, and
prediction of Y given x for future cases not used in the con-
struction of a prediction equation. There are two regression O O ƒ1
PX 8‚4ˆ59 D X XT X C nè‹ 4‚5 XT
situations, X random and X controlled. In the case that X is
random, both Y and x are randomly selected. In the controlled
situation, design matrices are selected by experimenters and can be regarded as a projection matrix. DeŽ ne the num-
only y is random. For ease of presentation, we consider only ber of effective parameters in the penalized least squares
O
Ž t as e4ˆ5 D tr6PX 8‚4ˆ5970 Therefore, the generalized cross-
the X-random case.
In X-random situations, the data 4xi 1 Yi 5 are assumed to be validation statistic is
a random sample from their parent distribution 4x1 Y 5. Then,
if Œ4x5
O is a prediction procedure constructed using the present 1 ˜y ƒ X‚4ˆ5˜2
GCV4ˆ5 D
data, the prediction error is deŽ ned as n 81 ƒ e4ˆ5=n92

PE4Œ5
O D E8Y ƒ Œ4x59
O 2
1 and Ô D arg minˆ 8GCV4ˆ59.
1356 Journal of the American Statistical Association, December 2001

4.3 Simulation Study Table 1. Simulation Results for the Linear Regression Model

In the following examples, we numerically compare the Avg. No. of 0 Coef’ cients
proposed variable selection methods with the ordinary least
squares, ridge regression, best subset selection, and nonneg- Method MRME (%) Correct Incorrect
ative garrote (see Breiman 1995). All simulations are con- n D 40, ‘ D 3
ducted using MATLAB codes. We directly used the constraint SCAD1 72090 4020 021
least squares module in MATLAB to Ž nd the nonnegative gar- SCAD2 69003 4031 027
LASSO 63019 3053 007
rote estimate. As recommended in Breiman (1995), a Ž vefold Hard 73082 4009 019
cross-validation was used to estimate the tuning parameter for Ridge 83028 0 0
the nonnegative garrote. For the other model selection pro- Best subset 68026 4050 035
Garrote 76090 2080 009
cedures, both Ž vefold cross-validation and generalized cross- Oracle 33031 5 0
validation were used to estimate thresholding parameters. n D 40, ‘ D 1
However, their performance was similar. Therefore, we present SCAD1 54081 4029 0
only the results based on the generalized cross-validation. SCAD2 47025 4034 0
LASSO 63019 3051 0
Example 4.1 (Linear Regression). In this example we Hard 69072 3093 0
Ridge 95021 0 0
simulated 100 datasets consisting of n observations from the Best subset 53060 4054 0
model Garrote 56055 3035 0
Y D xT ‚ C‘ ˜1 Oracle 33031 5 0
n D 60, ‘ D 1
where ‚ D 431 1051 01 01 21 01 01 05T , and the components of x SCAD1 47054 4037 0
and ˜ are standard normal. The correlation between xi and xj SCAD2 43079 4042 0
LASSO 65022 3056 0
is —iƒj— with  D 05. This is a model used in Tibshirani (1996). Hard 71011 4002 0
First, we chose n D 40 and ‘ D 3. Then we reduced ‘ to 1 and Ridge 97036 0 0
Ž nally increased the sample size to 60. The model error of the Best subset 46011 4073 0
Garrote 55090 3038 0
proposed procedures is compared to that of the least squares Oracle 29082 5 0
estimator. The median of relative model errors (MRME) over
100 simulated datasets are summarized in Table 1. The aver- NOTE: The value of a in SCAD 1 is obtained by generalized cross-validation, whereas the
value of a in SCAD 2 is 3.7.
age of 0 coefŽ cients is also reported in Table 1, in which the
column labeled “Correct” presents the average restricted only
to the true zero coefŽ cients, and the column labeled “ Incor-
rect” depicts the average of coefŽ cients erroneously set to 0.
SCAD is expected to be as good as that of the oracle esti-
From Table 1, it can be seen that when the noise level mator as the sample size n increases (see Tables 5 and 6 for
is high and the sample size is small, LASSO performs the more details).
best and it signiŽ cantly reduces both model error and model We now test the accuracy of our standard error formula
complexity, whereas ridge regression reduces only model (3.10). The median absolute deviation divided by 06745,
error. The other variable selection procedures also reduce
denoted by SD in Table 2, of 100 estimated coefŽ cients in the
model error and model complexity. However, when the noise
100 simulations can be regarded as the true standard error. The
level is reduced, the SCAD outperforms the LASSO and the
median of the 100 estimated SD’s, denoted by SDm , and the
other penalized least squares. Ridge regression performs very
poorly. The best subset selection method performs quite sim- median absolute deviation error of the 100 estimated standard
ilarly to the SCAD. The nonnegative garrote performs quite errors divided by 06745, denoted by SDmad , gauge the over-
well in various situations. Comparing the Ž rst two rows in all performance of the standard error formula (3.10). Table 2
Table 1, we can see that the choice of a D 307 is very reason- presents the results for nonzero coefŽ cients when the sample
able. Therefore, we used it for other examples in this article. size n D 60. The results for the other two cases with n D 40
Table 1 also depicts the performance of an oracle estimator. are similar. Table 2 suggests that the sandwich formula per-
From Table 1, it also can be seen that the performance of forms surprisingly well.

Table 2. Standard Deviations of Estimators for the Linear Regression Model ( n D 60)

‚O 1 ‚O 2 ‚O 5

Method SD SDm (SDmad ) SD SDm (SDmad ) SD SDm (SDmad )

SCAD1 0166 0161 (0021) 0170 0160 (0024) 0148 0145 (0022)
SCAD2 0161 0161 (0021) 0164 0161 (0024) 0151 0143 (0023)
LASSO 0164 0154 (0019) 0173 0150 (0022) 0153 0142 (0021)
Hard 0169 0161 (0022) 0174 0162 (0025) 0178 0148 (0021)
Best subset 0163 0155 (0020) 0152 0154 (0026) 0152 0139 (0020)
Oracle 0155 0154 (0020) 0147 0153 (0024) 0146 0137 (0019)
Fan and Li: Nonconcave Penalized Likelihood 1357

Table 3. Simulation Results for the Robust Linear Model Table 5. Simulation Results for the Logistic Regression

Avg. No. of 0 Coef’ cients Avg. No. of 0 Coef’ cients

Method MRME (%) Correct Incorrect Method MRME (%) Correct Incorrect

SCAD (a D 307) 35052 4071 0 SCAD (a D 307) 26048 4098 004


LASSO 52080 4029 0 LASSO 53014 3076 0
Hard 47022 4070 0 Hard 59006 4027 0
Best subset 41053 4085 018 Best subset 31063 4084 001
Oracle 23033 5 0 Oracle 25071 5 0

Example 4.2 (Robust Regression). In this example, we Example 4.4. In this example, we apply the proposed
simulated 100 datasets consisting of 60 observations from the penalized likelihood methodology to the burns data, collected
model by the General Hospital Burn Center at the University of
Y D x T ‚ C ˜1 Southern California. The dataset consists of 981 observations.
The binary response variable Y is 1 for those victims who sur-
where ‚ and x are the same as those in Example 1. The ˜ is vived their burns and 0 otherwise. Covariates X1 D age1 X2 D
drawn from the standard normal distribution with 10% outliers sex1 X3 D log4burn area C15, and binary variable X4 D oxygen
from the standard Cauchy distribution. The simulation results (0 normal, 1 abnormal) were considered. Quadratic terms of
are summarized in Table 3. From Table 3, it can be seen that X 1 and X3 , and all interaction terms were included. The inter-
the SCAD somewhat outperforms the other procedures. The cept term was added and the logistic regression model was
true and estimated standard deviations of estimators via sand- Ž tted. The best subset variable selection with the Akaike infor-
wich formula (3.7) are shown in Table 4, which indicates that mation criterion (A IC) and the Bayesian information criterion
the performance of the sandwich formula is very good. (B IC) was applied to this dataset. The unknown parameter ‹
was chosen by generalized cross-validation: it is .6932, .0015,
Example 4.3 (Logistic Regression). In this example, we and .8062 for the penalized likelihood estimates with the
simulated 100 datasets consisting of 200 observations from SCAD, L1 , and hard thresholding penalties, respectively. The
the model Y Bernoulli8p4xT ‚591 where p4u5 D exp4u5=41 C constant a in the SCAD was taken as 3.7. With the selected
exp4u55, and the Ž rst six components of x and ‚ are the same ‹, the penalized likelihood estimator was obtained at the 6th,
as those in Example 1. The last two components of x are 28th, and 5th step iterations for the penalized likelihood with
independently identically distributed as a Bernoulli distribu- the SCAD, L1 , and hard thresholding penalties, respectively.
tion with probability of success 05. All covariates are stan- We also computed 10-step estimators, which took us less than
dardized. Model errors are computed via 1000 Monte Carlo 50 s for each penalized likelihood estimator, and the differ-
simulations. The summary of simulation results is depicted in ences between the full iteration estimators and the 10-step
Tables 5 and 6. From Table 5, it can be seen that the perfor- estimators were less than 1%. The estimated coefŽ cients and
mance of the SCAD is much better than the other two penal- standard errors for the transformed data, based on the penal-
ized likelihood estimates. Results in Table 6 show that our ized likelihood estimators, are reported in Table 7.
standard error estimator works well. From Tables 5 and 6,
SCAD works as well as the oracle estimator in terms of the From Table 7, the best subset procedure via minimizing
MRME and the accuracies of estimated standard errors. the B IC scores chooses 5 out of 13 covariates, whereas the
SCAD chooses 4 covariates. The difference between them is
We remark that the estimated SDs for the L1 penalized like- that the best subset keeps X4 . Both SCAD and the best sub-
lihood estimator (LASSO) are consistently smaller than the set variable selection (B IC) do not include X12 and X32 in the
SCAD, but its overall MRME is larger than that of the SCAD. selected subset, but both LASSO and the best subset variable
This implies that the biases in the L1 penalized likelihood esti- selection (A IC) do. LASSO chooses the quadratic term of X1
mators are large. This remark applies to all of our examples. and X3 rather than their linear terms. It also selects an inter-
Indeed, in Table 7, all coefŽ cients were noticeably shrunken action term X2 X3 , which may not be statistically signiŽ cant.
by LASSO. LASSO shrinks noticeably large coefŽ cients. In this example,

Table 4. Standard Deviations of Estimators for the Robust Regression Model

‚O 1 ‚O 2 ‚O 5

Method SD SDm (SDmad ) SD SDm (SDmad ) SD SDm (SDmad )

SCAD 0167 0171 (0018) 0185 0176 (0022) 0165 0155 (0020)
LASSO 0158 0165 (0022) 0159 0167 (0020 0182 0154 (0019)
Hard 0179 0168 (0018) 0176 0176 (0025) 0157 0154 (0020)
Best subset 0198 0172 (0023) 0185 0175 (0024) 0199 0152 (0023)
Oracle 0163 0199 (0040) 0156 0202 (0043) 0166 0177 (0037)
1358 Journal of the American Statistical Association, December 2001

Table 6. Standard Deviations of Estimators for the Logistic Regression

‚O 1 ‚O 2 ‚O 5

Method SD SDm (SDmad ) SD SDm (SDmad ) SD SDm (SDmad )

SCAD (a D 307) 0571 0538 (0107) 0383 0372 (0061) 0432 0398 (0065)
LASSO 0310 0379 (0037) 0285 0284 (0019) 0244 0287 (0019)
Hard 0675 0561 (0126) 0428 0400 (0062) 0467 0421 (0079)
Best subset 0624 0547 (0121) 0398 0383 (0067) 0468 0412 (0077)
Oracle 0553 0538 (0103) 0374 0373 (0060) 0432 0398 (0064)

the penalized likelihood with the hard thresholding penalty excessive biases. The approach proposed here can be applied
retains too many predictors. Particularly, it selects variables to other statistical contexts without any extra difŽ culties.
X2 and X2 X3 .
APPENDIX: PROOFS
5. CONCLUSION Before we present the proofs of the theorems, we Ž rst state some
regularity conditions. Denote by ì the parameter space for ‚.
We proposed a variable selection method via penalized
likelihood approaches. A family of penalty functions was Regularity Conditions
introduced. Rates of convergence of the proposed penalized
(A) The observations Vi are independent and identically dis-
likelihood estimators were established. With proper choice of tributed with probability density f 4V1 ‚5 with respect to some mea-
regularization parameters, we have shown that the proposed sure Œ. f 4V1 ‚5 has a common support and the model is identiŽ able.
estimators perform as well as the oracle procedure for vari- Furthermore, the Ž rst and second logarithmic derivatives of f satis-
able selection. The methods were shown to be effective and fying the equations
the standard errors were estimated with good accuracy. A uni-
Ž ed algorithm was proposed for minimizing penalized likeli- ¡ log f 4V1 ‚5
E‚ D0 for j D 11 : : : 1 d
hood function, which is usually a sum of convex and concave ¡‚j
functions. Our algorithm is backed up by statistical theory and and
hence gives estimators with good statistical properties. Com-
pared with the best subset method, which is very time con- ¡ ¡
Ij k 4‚5 D E‚ log f 4V1 ‚5 log f 4V1 ‚5
suming, the newly proposed methods are much faster, more ¡‚j ¡‚k
effective, and have strong theoretical backup. They select vari- ¡2
ables simultaneously via optimizing a penalized likelihood, D E‚ ƒ log f 4V1 ‚5 0
¡‚j ¡‚k
and hence the standard errors of estimated parameters can
be estimated accurately. The LASSO proposed by Tibshirani (B) The Fisher information matrix
(1996) is a member of this penalized likelihood family with T
¡ ¡
L1 penalty. It has good performance when the noise to signal I4‚5 D E log f 4V1 ‚5 log f 4V1 ‚5
ratio is large, but the bias created by this approach is notice- ¡‚ ¡‚
ably large. See also the remarks in Example 4.3. The penal- is Ž nite and positive deŽ nite at ‚ D ‚0 .
ized likelihood with the SCAD penalty function gives the best (C) There exists an open subset — of ì that contains the true
performance in selecting signiŽ cant variables without creating parameter point ‚0 such that for almost all V the density f 4V1 ‚5

Table 7. Estimated Coef’ cients and Standard Errors for Example 4.4

Best Subset Best Subset


Method MLE (AIC) (BIC) SCAD LASSO Hard

Intercept 5051 (075) 4081 (045) 6012 (057) 6009 (029) 3070 (025) 5088 (041)
X1 ƒ8083 (2097) ƒ6049 (1075) ƒ12015 (1081) ƒ12024 (008) 0 (—) ƒ11032 (101)
X2 2030 (2000) 0 (—) 0 (—) 0 (—) 0 (—) 2021 (1041)
X3 ƒ2077 (3043) 0 (—) ƒ6093 (079) ƒ7000 (021) 0 (—) ƒ4023 (064)
X4 ƒ1074 (1041) 030 (011) ƒ029 (011) 0 (—) ƒ028 (009) ƒ1016 (1004)
X 12 ƒ075 (061) ƒ1004 (054) 0 (—) 0 (—) ƒ1071 (024) 0 (—)
X 32 ƒ2070 (2045) ƒ4055 (055) 0 (—) 0 (—) ƒ2067 (022) ƒ1092 (095)
X1X2 003 (034) 0 (—) 0 (—) 0 (—) 0 (—) 0 (—)
X1X3 7046 (2034) 5069 (1029) 9083 (1063) 9084 (014) 036 (022) 9006 (096)
X1X4 024 (032) 0 (—) 0 (—) 0 (—) 0 (—) 0 (—)
X2X3 ƒ2015 (1061) 0 (—) 0 (—) 0 (—) ƒ0010 (010) ƒ2013 (1027)
X2X4 ƒ012 (016) 0 (—) 0 (—) 0 (—) 0 (—) 0 (—)
X3X4 1023 (1021) 0 (—) 0 (—) 0 (—) 0 (—) 082 (1001)
Fan and Li: Nonconcave Penalized Likelihood 1359

admits all third derivatives 4¡f 4V1 ‚55=4¡‚j ¡‚k ¡‚l 5 for all ‚ 2 —. where ‚ü lies between ‚ and ‚0 . Note that by the standard argu-
Furthermore, there exist functions Mjkl such that ments,
¡L4‚0 5
nƒ1 D OP 4nƒ1=2 5
¡3 ¡‚j
log f 4V1 ‚5 µ Mj kl 4V5 for all ‚ 2 —1
¡‚j ¡‚k ¡‚l and
1 ¡ 2 L4‚0 5 ¡ 2 L4‚0 5
where mjkl D E‚0 6Mj kl 4V57 < ˆ for j1 k1 l0 DE C oP 4150
n ¡‚j ¡‚l ¡‚j ¡‚l
These regularity conditions guarantee asymptotic normality of the
ordinary maximum likelihood estimates. See, for example, Lehmann By the assumption that ‚ ƒ ‚0 D OP 4nƒ1=2 5, we have
(1983). ¡Q4‚5
D n‹n ƒ‹ƒ1 0
n p ‹n 4—‚j —5sgn4‚j 5 C OP 4n
ƒ1=2
=‹n 5 0
Proof of Theorem 1 ¡‚j

Let  n D nƒ1=2 C an . We want to show that for any given ˜ > 0, Whereas lim inf n!ˆ lim inf ˆ!0C ‹ƒ1 0
n p‹ n 4ˆ5 > 0 and n
ƒ1=2
=‹n ! 0,
there exists a large constant C such that the sign of the derivative is completely determined by that of ‚j .
Hence, (A.3) and (A.4) follow. This completes the proof.
P sup Q4‚0 C  n u5 < Q4‚0 5 ¶ 1 ƒ ˜0 (A.1)
˜u˜DC Proof of Theorem 2
This implies with probability at least 1 ƒ ˜ that there exists a local It follows by Lemma 1 that part (a) holds. Now we prove part (b).
maximum in the ball 8‚0 C  n u2 ˜u˜ µ C9. Hence, there exists a It can be shown easily that there exists a ‚ O 1 in Theorem 1 that is a
local maximizer such that ˜‚O ƒ ‚0 ˜ D OP 4 n 5. ‚1
root-n consistent local maximizer of Q8 0 9, which is regarded as a
Using p‹n 405 D 0, we have function of ‚1 , and that satisŽ es the likelihood equations

Dn 4u5 ² Q4‚0 C  n u5ƒ Q4‚0 5 ¡Q4‚5


D0 for j D 11 : : : 1 s0 (A.5)
Xs ¡‚j ‚O
‚D4 01 5
µ L4‚0 C n u5ƒ L4‚0 5ƒn p‹n 4—‚j0 C n uj —5ƒp‹n 4—‚j0 —5 1
jD1 O 1 is a consistent estimator,
Note that ‚
0
where s is the number of components of ‚10 . Let L 4‚0 5 be the gra-
¡L4‚5
dient vector of L. By the standard argument on the Taylor expansion ƒ np0‹n 4—‚O j —5sgn4‚O j 5
of the likelihood function, we have ¡‚j O

‚D 4 01 5

Dn 4u5 µ  n L0 4‚0 5T u ƒ 12 uT I 4‚0 5un 2n 81 C oP 4159 ¡L4‚0 5 X s


¡ 2 L4‚0 5
D C C oP 415 4‚O l ƒ ‚l0 5
¡‚j lD1
¡‚j ¡‚l
X
s
ƒ n n p0‹n 4—‚j0 —5sgn4‚j0 5uj
j D1
ƒ n p‹0 n 4—‚j0 —5sgn4‚j0 5 C 8p‹00n 4—‚j0 —5 C oP 41594‚O j ƒ ‚j0 5 0

C n 2n p‹00n 4—‚j0 —5u2j 81 C o4159 0 (A.2) It follows by Slutsky’s theorem and the central limit theorem that
p
Note that nƒ1=2 L0 4‚0 5 D OP 415. Thus, the Ž rst term on the right-hand O 1 ƒ ‚10 C 4I1 4‚10 5 C è5ƒ1 b ! N 801 I1 4‚10 59
n4I1 4‚10 5 C è5 ‚
side of (A.2) is on the order OP 4n1=2  n 5 D OP 4n 2n 5. By choosing
a sufŽ ciently large C, the second term dominates the Ž rst term uni- in distribution.
formly in ˜u˜ D C. Note that the third term in (A.2) is bounded by
p [Received March 2000. Revised December 2000.]
sn n an ˜u˜ C n 2n max —p‹00n 4—‚j0 —5—2 ‚j0 6D 0 ˜u˜2 0

This is also dominated by the second term of (A.2). Hence, by choos-


ing a sufŽ ciently large C, (A.1) holds. This completes the proof of REFERENCES
the theorem. Antoniadis, A. (1997), “Wavelets in Statistics: A Review” (with discussion),
Journal of the Italian Statistical Association, 6, 97–144.
Proof of Lemma 1 Antoniadis, A., and Fan, J. (2001), “Regularization of Wavelets Approxima-
tions,” Journal of the American Statistical Association. 96, 939–967.
It is sufŽ cient to show that with probability tending to 1 as n ! ˆ, Bickel, P. J. (1975), “One-Step Huber Estimates in Linear Models,” Journal
for any ‚1 satisfying ‚1 ƒ ‚10 D OP 4nƒ1=2 5 and for some small ˜n D of the American Statistical Association, 70, 428–433.
Cnƒ1=2 and j D s C 11 : : : 1 d, Breiman, L. (1995), “Better Subset Regression Using the Nonnegative Gar-
rote,” Technometrics, 37, 373–384.
¡Q4‚5 Breiman, L. (1996), “Heuristics of Instability and Stabilization in Model
<0 for 0 < ‚j < ˜n (A.3) Selection,” The Annals of Statistics, 24, 2350–2383.
¡‚j Craven, P., and Wahba, G. (1979), “Smoothing Noisy Data With Spline Func-
>0 for ƒ ˜n < ‚j < 00 (A.4) tions: Estimating the Correct Degree of Smoothing by the Method of Gen-
eralized Cross-Validation,” Numerische Mathematik, 31, 377–403.
To show (A.3), by Taylor’s expansion, we have Donoho, D. L., and Johnstone, I. M. (1994a), “ Ideal Spatial Adaptation by
Wavelet Shrinkage,” Biometrika, 81, 425–455.
¡Q4‚5 ¡L4‚5 (1994b), “Minimax Risk Over `p Balls for `q Error,” Probability
D ƒ np‹0 n 4—‚j —5sgn4‚j 5 Theory and Related Fields, 99, 277–303.
¡‚j ¡‚j Fan, J. (1997), “Comments on ‘Wavelets in Statistics: A Review’ by A. Anto-
niadis,” Journal of the Italian Statistical Association, 6, 131–138.
¡L4‚0 5 X d
¡ 2 L4‚0 5 Xd X d
¡ 3 L4‚ü 5 Frank, I. E., and Friedman, J. H. (1993), “A Statistical View of Some Chemo-
D C 4‚l ƒ ‚l0 5 C
¡‚j ¡‚j ¡‚l ¡‚j ¡‚l ‚k metrics Regression Tools,” Technometrics, 35, 109–148.
lD1 lD1 kD1
Fu, W. J. (1998), “Penalized Regression: The Bridge Versus the LASSO,”
€ 4‚l ƒ ‚l0 54‚k ƒ ‚k0 5 ƒ np0‹n 4—‚j —5sgn4‚j 51 Journal of Computational and Graphical Statistics, 7, 397–416.
1360 Journal of the American Statistical Association, December 2001

Gao, H. Y., and Bruce, A. G. (1997), “WaveShrink With Firm Shrinkage,” McCullagh, P., and Nelder, J. A. (1989), Generalized Linear Models. 2nd ed.,
Statistica Sinica, 7, 855–874. London: Chapman and Hall.
Green, P. J., and Silverman, B. W. (1994), Nonparametric Regression and Robinson, P. M. (1988), “The Stochastic Difference Between Econometrics
Generalized Linear Models: A Roughness Penalty Approach, London: and Statistics,” Econometrica, 56, 531–547.
Chapman and Hall. Stone, C. J., Hansen, M., Kooperberg , C., and Truong, Y. K. (1997), “Poly-
Huber, P. (1981), Robust Estimation, New York: Wiley. nomial Splines and Their Tensor Products in Extended Linear Modeling”
Lehmann, E. L. (1983), Theory of Point Estimation. PaciŽ c Grove, CA: (with discussion), The Annals of Statistics 25, 1371–1470.
Wadsworth and Brooks/Cole. Tibshirani, R. J. (1996), “Regression Shrinkage and Selection via
Li, R. (2000), “High-Dimensiona l Modeling via Nonconcave Penalized Like- the LASSO,” Journal of the Royal Statistical Society, Ser. B, 58,
lihood and Local Likelihood,” unpublishe d Ph.D. dissertation, University 267–288.
of North Carolina at Chapel Hill, Dept. of Statistics. Tibshirani, R. J. (1997), “The LASSO Method for Variable Selection in the
Marron, J. S., Adak, S., Johnstone, I. M., Neumann, M. H., and Patil, P. Cox Model,” Statistics in Medicine, 16, 385–395.
(1998), “Exact Risk Analysis of Wavelet Regression,” Journal Computa- Wahba, G. (1990), Spline Models for Observationa l Data, Philadelphia:
tional and Graphical Statistics, 7, 278–309. S IAM.

You might also like