Paper3 - Prediction Error Estimation
Paper3 - Prediction Error Estimation
Downloaded from https://academic.oup.com/bioinformatics/article-abstract/21/15/3301/195433 by Univ. of Michigan Law Library user on 25 April 2020
Annette M. Molinaro1,3,∗ , Richard Simon2 and Ruth M. Pfeiffer1
1 Biostatistics
Branch, Division of Cancer Epidemiology and Genetics and 2 Biometric Research Branch, Division of
Cancer Treatment and Diagnostics, NCI, NIH, Rockville, MD 20852 USA and 3 Department of Epidemiology and
Public Health, Yale University School of Medicine, New Haven, CT 06520, USA
Received on April 6, 2005; revised on April 28, 2005; accepted on May 12, 2005
Advance Access publication May 19, 2005
ABSTRACT Including too many noisy variables reduces accuracy of the predic-
Motivation: In genomic studies, thousands of features are collected tion and may lead to over-fitting of data, resulting in promising but
on relatively few samples. One of the goals of these studies is to build often non-reproducible results (Ransohoff, 2004).
classifiers to predict the outcome of future observations. There are Another difficulty is model selection with numerous classification
three inherent steps to this process: feature selection, model selection models available. An important step in reporting results is assessing
and prediction assessment. With a focus on prediction assessment, the chosen model’s error rate, or generalizability. In the absence of
we compare several methods for estimating the ‘true’ prediction error independent validation data, a common approach to estimating pre-
of a prediction model in the presence of feature selection. dictive accuracy is based on some form of resampling the original
Results: For small studies where features are selected from thou- data, e.g. cross-validation. These techniques divide the data into a
sands of candidates, the resubstitution and simple split-sample estim- learning set and a test set, and range in complexity from the popu-
ates are seriously biased. In these small samples, leave-one-out lar learning-test split to v-fold cross-validation, Monte-Carlo v-fold
cross-validation (LOOCV), 10-fold cross-validation (CV) and the .632+ cross-validation and bootstrap resampling. Few comparisons of
bootstrap have the smallest bias for diagonal discriminant analysis, standard resampling methods have been performed to date, and all of
nearest neighbor and classification trees. LOOCV and 10-fold CV have them exhibit limitations that make their conclusions inapplicable to
the smallest bias for linear discriminant analysis. Additionally, LOOCV, most genomic settings. Early comparisons of resampling techniques
5- and 10-fold CV, and the .632+ bootstrap have the lowest mean in the literature are focussed on model selection as opposed to predic-
square error. The .632+ bootstrap is quite biased in small sample tion error estimation (Breiman and Spector, 1992; Burman, 1989). In
sizes with strong signal-to-noise ratios. Differences in performance two recent assessments of resampling techniques for error estimation
among resampling methods are reduced as the number of specimens (Braga-Neto and Dougherty, 2004; Efron, 2004), feature selection
available increase. was not included as part of the resampling procedures, causing the
Contact: annette.molinaro@yale.edu conclusions to be inappropriate for the high-dimensional setting.
Supplementary Information: A complete compilation of results and We have performed an extensive comparison of resampling meth-
R code for simulations and analyses are available in Molinaro et al. ods to estimate prediction error using simulated (large signal-to-noise
(2005) (http://linus.nci.nih.gov/brb/TechReport.htm). ratio), microarray (intermediate signal to noise ratio) and proteomic
data (low signal-to-noise ratio), encompassing increasing sample
sizes with large numbers of features. The impact of feature selection
1 INTRODUCTION
on the performance of various cross-validation methods is high-
In genomic experiments one frequently encounters high dimensional lighted. The results elucidate the ‘best’ resampling techniques for
data and small sample sizes. Microarrays simultaneously monitor future research involving high dimensional data to avoid overly
expression levels for several thousands of genes. Proteomic profil- optimistic assessment of the performance of a model.
ing studies using SELDI-TOF (surface-enhanced laser desorption
and ionization time-of-flight) measure size and charge of proteins
and protein fragments by mass spectroscopy, and result in up to 2 METHODS
15 000 intensity levels at prespecified mass values for each spectrum.
In the prediction problem, one observes n independent and identically dis-
Sample sizes in such experiments are typically <100. tributed (i.i.d.) random variables O1 , . . . , On with unknown distribution P .
In many studies, observations are known to belong to predeter- Each observation in O consists of an outcome Y with range Y and an l-vector
mined classes and the task is to build predictors or classifiers for of measured covariates, or features, X with range X , such that Oi = (Xi , Yi ),
new observations whose class is unknown. Deciding which genes or i = 1, . . . , n. In microarray experiments X includes gene expression meas-
proteomic measurements to include in the prediction is called fea- urements, while in proteomic data, it includes the intensities at the mass
ture selection and is a crucial step in developing a class predictor. over charge (m/z) values. X may also contain covariates such as a patient’s
age and/or histopathologic measurements. The outcome Y may be a con-
tinuous measure such as months to disease or a categorical measure such as
∗ To whom correspondence should be addressed. disease status.
The goal in class prediction is to build a rule implementing the information validation set. We will focus solely on the partitioning of the data into learning
from X in order to predict Y . The intention is that by building this rule, based and test sets for the express purpose of estimating the generalization error.
on the observations O1 , . . . , On , a future unobserved outcome Y0 can be To enhance a general discussion of resampling methods, we define a binary
predicted based on its corresponding measured features X0 . If the outcome random n-vector, Sn ∈ {0, 1}n , which splits the observations into the desired
is continuous, then the rule, or predictor, ψ is defined as a mapping from the subsets (Molinaro et al., 2004). A realization of Sn = (Sn,1 , . . . , Sn,n ) pre-
feature space X onto the real line, i.e. ψ : X → R. Consequently, ŷ = ψ(x) scribes a particular split of the entire dataset of n observations into a learning
set, {i ∈ {1, . . . , n} : Sn,i = 0}, and a test set, {i ∈ {1, . . . , n} : Sn,i = 1}. Let
Downloaded from https://academic.oup.com/bioinformatics/article-abstract/21/15/3301/195433 by Univ. of Michigan Law Library user on 25 April 2020
denotes the predicted outcome based on the observed X. Such predictors can
be built via regression (linear and non-linear) or recursive binary partitioning p be the proportion of observations in the test set. The empirical distributions
0 1 , respectively.
such as classification and regression trees (CART) (Breiman et al., 1984). of the learning and test sets are denoted by Pn,S n
and Pn,S n
If the outcome Y is categorical it assumes one of K values. In this case, Importantly, Sn is independent of the empirical distribution of the complete
the rule ψ partitions the feature space X into K disjoint and exhaustive dataset of n observations Pn and the particular distribution of Sn defines the
groups Gk , where k = 1, . . . , K, such that ŷ = k if x ∈ Gk . Standard type of resampling method. Given Sn , the performance of any given estimator
statistical analyses include linear discriminant analysis (LDA) and diagonal ψ(·|Pn ) can be assessed via the resampling conditional risk estimate
discriminant classifiers (DDA), nearest neighbors (NN) and CART, as well
as aggregate classifiers. θ̂n(1−p) = ESn L(o, ψ(·|Pn,S 0
n
1
))dPn,S n
(o), (4)
Thorough discussions of the prediction problem and available algorithms
can be found in Breiman et al. (1984), McLachlan (1992), Ripley (1996) and where Sn refers to binary split vectors for the entire dataset of n observations
Hastie et al. (2003). and p = i Si,n /n is the proportion of n observations in the test set.
The rule ψ can be written as ψ(·|Pn ), where Pn denotes the empirical There are several considerations when selecting a resampling method.
distribution of O and reflects the dependence of the built rule on the observed The first is sample size n. For v-fold cross-validation and bootstrap, Dudoit
data. Loss functions may be employed to quantify the performance of a given and van der Laan (2003) (http://www.bepress.com/ucbbiostat/paper126) have
rule. A common loss function for a continuous outcome Y is the squared shown that as n → ∞ (and consequently np → ∞) asymptotic optimality
error loss, L(Y , ψ) = [Y − ψ(X)]2 . With a categorical outcome Y , a popular is achieved. However, no such results exist for finite samples. Other consid-
choice is the indicator loss function, L(Y , ψ) = I [Y = ψ(X)]. A loss erations are on the proportion p of the observations for the test set and the
function could also incorporate differential misclassification costs (Breiman number of times the estimate is calculated. We address these considerations
et al., 1984). in the following sections and refer the reader to more detailed discussions in
For either type of outcome, the expected loss, or risk, is defined as: McLachlan (1992) and Davison and Hinkley (1997).
θ̃ = R(ψ, P ) = EP [L(Y , ψ)] = L(y, ψ(x)) dP (x, y). (1) 2.1.1 Split sample This popular resampling method, also known as the
learning-test split or holdout method (McLachlan, 1992), entails a single par-
The rule in Equation (1) is constructed and evaluated upon the distribution tition of the data into a learning set and a test set based on a predetermined
P , as such, θ̃ is referred to as the asymptotic risk. However, in reality P p. For example, p = 1/3 allots two-thirds of the data to the learning set and
is unknown, thus, the rule based upon the observations O1 , . . . , On has an one-third to the test set. The distribution of Sn places mass 1/2 on two binary
expected loss, or conditional risk (also known as the generalization error), vectors which assign the n observations to the learning and test sets. The
defined as: advantage of this method is the ease of computation. Also, since the classifier
is developed only once, a completely specified algorithm for classifier devel-
θ̃n = R(ψ(·|Pn ), P ) = L(y, ψ(x|Pn )) dP (x, y). (2) opment need not be available; the development can be more informal and
subjective. There are two potential sources of bias inherent in this method:
There are two impetuses for evaluating the conditional risk: model selec-
bias introduced by each individual observation contributing only to the learn-
tion and performance assessment. In model selection, the goal is to find the
ing or test set; and, bias due to a small learning set, whereas both features and
one which minimizes the conditional risk over a collection of potential mod-
classifiers selected depend solely on the learning set. Because the learning
els. In performance assessment, the goal is to estimate the generalization
set is smaller than the full data set, the test set error for a model built on the
error for a given model, i.e. assess how well it predicts the outcome of an
training set will tend to over-estimate the unknown generalization error for a
observation not included in O.
model built on the full dataset.
In an ideal setting an independent dataset would be available for the pur-
poses of model selection and estimating the generalization error. Typically, 2.1.2 v-fold cross-validation This method randomly assigns the n
however, one must use the observed sample O for model building, selec- observations to one of v partitions such that the partitions are near-equal size.
tion and performance assessment. The simplest method for estimating the Subsequently, the learning set contains all but one of the partitions which is
conditional risk is with the resubstitution or apparent error: labeled the test set. The generalization error is assessed for each of the v test
sets and then averaged over v. In this method, the distribution of Sn puts mass
θ̂nRS = R(ψ(·|Pn ), Pn ) = L(y, ψ(x|Pn )) dPn (x, y). (3) 1/v on the v binary vectors, which assign each of the n observations to one
of the v partitions. The proportion p is approximately equal to 1/v. Both p
Here each of the n observation is used for constructing, selecting and,
and the number of averages can adversely or positively affect this estimate
subsequently, evaluating the prediction error of ψ. Consequently, the resub-
of error. For example, a larger v (e.g. v = 10) results in a smaller proportion
stitution risk estimate tends to underestimate the generalization error (Efron,
p in the test set; thus, a higher proportion in the learning set decreases the
1983; McLachlan, 1992). To alleviate this biased estimation, resampling
bias. In addition, the number of averages is equivalent to v and thus, may
methods such as cross-validation or bootstrapping can be employed. In the
additionally decrease the bias.
next section, we describe these techniques and their implications in the
framework of prediction error. 2.1.3 Leave-one-out cross-validation (LOOCV) This is the most
extreme case of v-fold cross-validation. In this method each observation is
2.1 Resampling methods individually assigned to the test set, i.e. v = n and p = 1/n (Lachenbruch
In the absence of a large, independent test set, there are numerous techniques and Mickey, 1968; Geisser, 1975; Stone, 1974, 1977). The distribution of
for assessing prediction error by implementing some form of partitioning or Sn places mass 1/n on the n binary vectors, which assign each of the n
resampling of the original observed data O. Each of these techniques involves observations to the learning and test sets. LOOCV and the corresponding
dividing the data into a learning set and a test set. For purposes of model p = 1/n represent the best example of a bias-variance trade-off. It tends
selection, the learning set may further be divided into a training set and a toward a small bias with elevated variance. In model selection, LOOCV has
3302
Prediction error estimation
performed poorly compared to v-fold cross-validation (Breiman and Spector, and CART), and data sets (simulated, microarray and proteomic; see
1992). Due to the computational burden, LOOCV has not been a favored Sections 3.1–3.3) are utilized. Prior to discussing results, the general
method for large samples, and its behavior in estimating generalization error strategy for estimating the risks is explained followed by the specifics
has not been thoroughly studied. of each dataset.
2.1.4 Monte Carlo cross-validation (MCCV) MCCV randomly Each dataset consists of N observations with N1 cases and N0 con-
splits the sample into a learning and test set numerous times (e.g. 20, 50 trols and l measured features. For r = 1, . . . , R repetitions, a random
Downloaded from https://academic.oup.com/bioinformatics/article-abstract/21/15/3301/195433 by Univ. of Michigan Law Library user on 25 April 2020
or 1000 iterations). For each split, np = n(1/v) of the observations are sample of size n stratified by case/control status is selected from N ,
labeled as the test set and n(1 − p) = n(1 − 1/v) as the learning set. For such that the number of cases in the subsample (n/2) equals the num-
example, in MCCV with v = 10 each of 50 iterations allot 10% of the data to ber of controls. The stratification allows for equal representation of
the test set and 90% to the learning set. The generalization error is assessed both, cases and controls, such that classification algorithms relying
for each of the 50 test sets and subsequently
averaged over the 50 iterations. on majority consensus are not biased toward either (Quackenbush,
The distribution of Sn puts mass 1/ np n on each of the binary vectors rep- 2004). This random sample, or subsample, plays two roles. First, it
resenting one split into a learning and test set. As the number of iterations serves as a sample from which the resampling conditional risk θ̂n(1−p)
increases the computational burden of MCCV is quite large. However, unless can be estimated. This is accomplished by splitting the subsample
the iterations of random splits approaches infinity, the chance that each obser- into a learning and test set corresponding to each of the resampling
vation is included in a learning set and a test set (over all iterations) is small, methods. For each r, an estimate of θ̂n(1−p) is obtained for each res-
introducing a similar bias to that of the split sample approach (i.e. when each ampling method with all four algorithms. In reality, the distribution
observation is either in the learning set or test set). P of the observed data O is unknown and thus, so is the ‘true’ con-
2.1.5 .632+ Bootstrap Several variations of the bootstrap have been ditional risk. In order to estimate θ̃n in Equation (2) we will use the
introduced to estimate the generalization error. The leave-one-out boot- complete observed data. As such, the subsample’s second role is to
strap (θ̂nBS ) is based on a random sample drawn with replacement from n serve as the learning set and the remaining N − n observations as
observations (Efron, 1983; Efron and Tibshirani, 1993). For each draw, the the test set for an approximation of the conditional risk θ̃n .
observations left out (≈.368n) serve as the test set. The learning set has Given the high-dimensional structure of each data set (i.e. large
≈.632n unique observations which leads to an overestimation of the predic- l), feature selection is an important task administered before running
tion error (i.e. a decrease in the learning set leads to an increase in the bias). any of the algorithms. Feature selection must occur based on the
To correct for this, two estimators have been suggested: the .632 bootstrap
learning set within each resampling, otherwise additional bias is
and the .632+ estimator. Both correct by adding the underestimated resubsti-
introduced (Simon et al., 2003). This correct approach to feature
tution error θ̂nRS , ωθ̂nBS + (1 − ω)θ̂nRS . For the .632 bootstrap the weight ω is
constant (ω = .632), whereas for the .632+ bootstrap ω is determined based selection within cross-validation has been referred to as honest or
on the ‘no-information error rate’ (Efron and Tibshirani, 1997). We focus on complete (Quackenbush, 2004). There are many methods available
the latter as it is the most used in the literature and the most robust across for feature selection; here t-tests are used. Initially components of
different algorithms (Efron and Tibshirani, 1997). X with the largest 10 absolute value t-test statistics are considered.
Subsequently, the largest 20 are discussed.
2.2 Algorithms All simulations and analyses were implemented in R (Ihaka and
Predictions of outcomes based on the observed X can employ parametric or Gentlemen, 1996).
non-parametric algorithms. If the outcome is continuous, predictors can be
built using regression models or recursive binary partitioning like CART. If 3.1 Simulated data
the outcome is categorical, algorithms which partition the feature space X The simulated datasets are generated as described in Bura and Pfeiffer
into disjoint and exhaustive groups are used. In this manuscript, we limit our
(2003). Each dataset contains N = 300 observations with 750 cov-
discussion to the classification of binary outcomes, i.e. Y = 0 or Y = 1, and
thus, evaluate methods for the estimation of prediction error in the context of
ariates, representing patients and genes, respectively. Half of the
the following classification algorithms. observations (i.e. 150) are labeled controls (Y = 0) and half cases
We calculate the LDA with the lda function in the MASS library of the (Y = 1). Of the 750 genes, 8 are associated with disease and the
statistical package R (Venables and Ripley, 1994; Ihaka and Gentlemen, others are non-predictive. The controls are simulated from a mul-
1996). We use the function dlda in the library supclust in R to imple- tivariate normal distribution with a mean of 0 and covariance matrix
ment DDA (Dettling and Maechler, 2005, http://lib.stat.cmu.edu/R/CRAN/ . The cases have 99% non-differentially expressed genes which
src/contrib/Descriptions/supclust.html). The library supclust also houses are generated from the same N (0, ) as the controls. The 1% of the
the function nnr for NN. CART classification is obtained using the library genes that are differentially expressed are generated from a mixture
and function rpart in R (Breiman et al., 1984; Therneau and Atkinson, of two multivariate normals with means µ1 and µ2 and covariance
1997).
structure . The mixing probability is 0.5. The covariance matrix
= (σij ) has a block structure with σij = 0.2 for |j − i| ≤ 5 and
3 ANALYSIS zero otherwise. Estimates of θ̂n(1−p) and θ̃n are based on learning
The goal of this analysis is to ascertain differences between res- samples of size 40, 80 and 120 and test sets of size 260, 220 and
ampling methods in the estimation of generalization error (presently, 180, respectively.
limited to the classification problem) in the presence of feature
selection. We evaluate the influence of sample size, parametric to 3.2 Lymphoma and lung datasets
non-parametric classification methods, and large feature spaces on The microarray datasets are both publicly available. The first focuses
each resampling method’s ability to estimate the resampling con- on diffuse large-B-cell lymphoma (Rosenwald et al., 2002). In this
ditional risk θ̂n(1−p) [Equation (4)] compared to that of the ‘true’ study there are 7399 genes on the microarray and 240 patients. For
conditional risk θ̃n [Equation (2)]. As such, a range of sample sizes the purposes of this analysis, the outcome-variable represents the
(n = 40, 80 and 120), classification algorithms (LDA, DDA, NN lymphoma subtype: activated B-cell for Y = 0 and germinal-center
3303
A.M.Molinaro et al.
B-cell for Y = 1. This is an example of a moderate signal-to-noise Table 1. Prediction error estimates
ratio dataset, as the subgroups do not separate perfectly based on the
microarray observations (Wright et al., 2003). Estimates of θ̂n(1−p)
Estimator p Algorithm Estimate SD Bias MSE
and θ̃n are based on learning samples of size 40, 80 and 120 and test
sets of size 200, 160 and 120, respectively. The second study uses
oligonucleotide microarrays to measure 12 601 transcript sequences θ̃n 0.87 LDA 0.078 0.093
Downloaded from https://academic.oup.com/bioinformatics/article-abstract/21/15/3301/195433 by Univ. of Michigan Law Library user on 25 April 2020
DDA 0.160 0.086
for 186 lung tumor samples (Bhattacharjee et al., 2001). For our
NN 0.042 0.084
analysis, the outcome represents the 139 adenocarcinomas as Y = 0
CART 0.121 0.133
and the remaining 47 tumors as Y = 1. v-fold CV 0.5 LDA 0.357 0.126 0.279 0.097
3.3 Proteomic ovarian dataset DDA 0.342 0.106 0.182 0.052
NN 0.277 0.135 0.235 0.077
The proteomic dataset consists of 164 SELDI-TOF measurements CART 0.430 0.121 0.309 0.134
from NCI/Mayo Clinic serum samples. These data are part of a 0.2 LDA 0.161 0.127 0.083 0.017
study designed to validate previously identified proteomic markers DDA 0.208 0.086 0.048 0.012
for ovarian cancer. The readings are from fraction 4, IMAC30 Pro- NN 0.108 0.102 0.066 0.011
teinChip arrays, read at high and low energy settings in a PCS4000 CART 0.284 0.117 0.163 0.055
ProteinChip Reader (Ciphergen Biosystems, Inc., Fremont, CA). 0.1 LDA 0.118 0.120 0.040 0.008
The spectra were externally calibrated for mass, internally normal- DDA 0.177 0.087 0.017 0.007
NN 0.078 0.102 0.036 0.005
ized for intensity using total ion current, and baseline subtracted.
CART 0.189 0.104 0.068 0.024
Peaks were manually selected and the intensity recorded.
LOOCV 0.025 LDA 0.092 0.115 0.014 0.008
Of the n = 164 observations, 45 are ovarian cancer cases and 119 DDA 0.164 0.096 0.004 0.007
controls. Estimates of θ̂n(1−p) and θ̃n are based on learning samples NN 0.058 0.103 0.016 0.005
of size 40 and 80 and test sets of size 144 and 104, respectively. CART 0.146 0.125 0.025 0.018
Given the nature of proteomic data as well as the naive algorithms Split 0.333 LDA 0.205 0.184 0.127 0.053
implemented, this will serve as a low signal-to-noise example. DDA 0.243 0.138 0.083 0.034
NN 0.145 0.169 0.103 0.044
3.4 Results CART 0.371 0.174 0.25 0.121
To compare the resampling methods in Section 2.1, conditional risk 0.5 LDA 0.348 0.185 0.270 0.113
estimates for each method are calculated and compared to each other DDA 0.344 0.139 0.184 0.062
and the truth (i.e. the conditional risk). This evaluation is based on NN 0.265 0.177 0.223 0.086
CART 0.438 0.155 0.317 0.147
the mean squared error (MSE) and bias, calculated as follows:
.632+ 50 ≈.368 LDA 0.274 0.084 0.196 0.047
1
R repetitions DDA 0.286 0.074 0.126 0.028
MSE = (θ̂n,r − θ̃n,r )2 NN 0.200 0.070 0.158 0.032
r CART 0.387 0.080 0.266 0.100
r=1
1
R
Bias = (θ̂n,r − θ̃n,r ), The estimate θ̂n (column 4) and SD (column 5) based on learning sample of size 40.
r The estimate θ̃n (rows 1–4) and SD based on the remaining 260 observations. Bias
r=1
(column 6) and MSE (column 7) reported for each resampling technique (column 1)
where θ̂n,r is the resampling conditional risk and θ̃n,r is the conditional and algorithm (column 3). The ten features with largest t-statistics used in algorithms.
risk for the r-th repetition. In all results the total number of repetitions Minimums in bold.
is set at 100, i.e. R = 100.
There were several attempts to examine the effect of varying p (Table 1). The largest MSE and bias occur with 2-fold CV and split
on those resampling methods which allow user-defined test set pro- sample with p = 1/2. For n = 80 and n = 120, the differences
portions (i.e. v-fold cross-validation, MCCV and split sample). For among these methods diminish. For n = 40 and n = 80, .632+ has
v-fold cross-validation, 2-, 5- and 10-fold were explored. In MCCV, the smallest SD, followed by 10-fold CV, LOOCV and 5-fold CV.
both p and the number of MCCV repetitions affect the estimation, The only exception is for LDA and NN at n = 80, when LOOCV and
thus, test set proportions of p = 0.5, p = 0.2, p = 0.1 as well as 10-fold CV have the smallest. At n = 120, the differences among
repetitions of 20, 50 and 1000 were run. In split-sample estimation these methods diminish.
test set proportions of both p = 1/3 and p = 1/2 were examined to Lymphoma and lung study results In the lymphoma study, for
assess the bias/variance trade-off. n = 40, 80 and 120, .632+, LOOCV, 5- and 10-fold CV have the
Due to space limitations, all results are discussed but only a limited smallest MSE and bias. The two split-samples and 2-fold CV have
number of tables can be displayed. The interested reader is referred to the largest MSE and bias. Similar to the simulation study, .632+
Molinaro et al. (2005) (http://linus.nci.nih.gov/brb/TechReport.htm) has the smallest SD across the algorithms and sample sizes, while
for a comprehensive compilation of results. The MCCV results are both split samples do by far the worst. Partial results are shown in
not included below, as the only noticeable improvement over v-fold Table 2. The results from the lung study are very similar and thor-
CV is a slight decrease in variance. Additionally, the advantage of oughly discussed in Molinaro et al. (2005). (http://linus.nci.nih.gov/
increasing the MCCV iterations from 20 to 50 to 1000 is minimal. brb/TechReport.htm).
Simulation study results For n = 40, LOOCV and 10-fold CV have Ovarian study results For n = 40 to n = 80, LOOCV and .632+
the smallest MSE and bias, followed by 5-fold CV and then .632+ have the smallest MSE, followed by 5- and 10-fold CV. As for bias,
3304
Prediction error estimation
Downloaded from https://academic.oup.com/bioinformatics/article-abstract/21/15/3301/195433 by Univ. of Michigan Law Library user on 25 April 2020
2-fold CV 0.085 0.038 0.01 0.043 0.002 0.004 0.031 0.0 0.003
5-fold CV 0.07 0.004 0.007 0.045 −0.008 0.005 0.032 −0.006 0.003
10-fold CV 0.063 −0.007 0.006 0.036 −0.009 0.003 0.031 −0.006 0.003
LOOCV 0.072 −0.019 0.008 0.04 −0.013 0.004 0.033 −0.004 0.003
Split 1/3 0.119 0.001 0.017 0.071 0.0 0.007 0.059 −0.004 0.005
Split 1/2 0.117 0.037 0.018 0.058 0.001 0.005 0.046 −0.001 0.004
.632+ 0.049 −0.006 0.004 0.025 −0.02 0.003 0.018 −0.015 0.002
Comparison of resampling method’s MSE, bias and SD. Results shown are for the DDA algorithm using the top 10 genes as ranked by t-tests.
Table 3. Ovarian study results Table 4. Prediction error estimates without feature selection
3305
A.M.Molinaro et al.
Downloaded from https://academic.oup.com/bioinformatics/article-abstract/21/15/3301/195433 by Univ. of Michigan Law Library user on 25 April 2020
LDA 0.331 0.075 0.252 0.072 0.242 0.101 0.164 0.035
DDA 0.337 0.075 0.177 0.044 0.270 0.072 0.110 0.022
n = 40 NN 0.259 0.072 0.217 0.055 0.167 0.083 0.125 0.022
CART 0.414 0.065 0.296 0.114 0.377 0.085 0.256 0.094
LDA 0.07 0.063 0.043 0.004 0.044 0.053 0.017 0.002
DDA 0.146 0.058 0.074 0.008 0.104 0.058 0.033 0.003
n = 80 NN 0.046 0.056 0.036 0.003 0.022 0.043 0.012 0.001
CART 0.098 0.047 0.057 0.006 0.062 0.039 0.020 0.002
LDA 0.032 0.033 0.011 0.001 0.026 0.026 0.005 0
DDA 0.088 0.045 0.036 0.002 0.068 0.043 0.016 0.001
n = 120 NN 0.016 0.030 0.007 0 0.012 0.023 0.003 0
CART 0.048 0.025 0.022 0.001 0.038 0.022 0.012 0.001
The leave-one-out bootstrap and 3-fold MCCV estimate, SD, bias, and MSE, over 3 samples sizes and 4 algorithms. Feature selection was used to select the top 10 ranked features
by t-tests.
Resampling with and without replacement To understand the (3) 10-fold CV prediction error estimates approximate those
ramification of resampling with replacement as it pertains to the boot- of LOOCV in almost all settings. For computationally bur-
strap estimates, we compared the leave-one-out bootstrap estimate densome analyses, 10-fold CV may be preferable to LOOCV.
(Section 2.1.5) to the 3-fold MCCV. The 3-fold MCCV randomly Additionally, in the simulated data, repeated resamplings (the
selects .666n for the learning set and .333n for the test set. This is average of 10 repeats) reduce the MSE, bias, and variance of
repeated numerous times and the estimates averaged. Therefore the 10-fold CV.
3-fold MCCV is equivalent to the leave-one-out bootstrap, except it (4) The .632+ prediction error estimate performs best with
employs resampling without replacement. Table 5 displays the sim- moderate to weak signal-to-noise ratios. Previous studies
ulation study results for the two estimates using 50 iterations for have found the bootstrap variants superior to LOOCV and
both. Interestingly, the bias and MSE for the leave-one-out boot- v-fold CV; however, these studies did not include feature
strap are roughly double that of 3-fold MCCV. The only two distinct selection. As seen in Table 1, honest resampling in small
differences between the two methods are the replicate copies in the samples with strong signal suggest that LOOCV and 10-fold
learning set, inherent in the bootstrap estimate, and the fact that on CV are in fact better than the .632+ bootstrap. This dis-
average .632n unique observations are in the learning sample for the crepancy fades when feature selection is discarded (Table 4)
leave-one-out bootstrap, whereas there are always .666n in the learn- and when the signal decreases, as seen in the lymphoma and
ing sample for the 3-fold MCCV. Both these factors may contribute ovarian datasets (Tables 2 and 3). Additional glimpses into the
to the increase in bias and MSE. bootstrap estimate (Table 5) indicate that the sampling with
replacement increases the MSE and bias substantially over
4 DISCUSSION 3-fold MCCV (i.e. resampling without replacement).
Estimation of prediction error when confronted with a multitude (5) MCCV does not decrease the MSE or bias enough to warrant
of covariates and small sample sizes is a relatively new problem. its use over v-fold CV.
Feature selection, sample size and signal-to-noise ratio are import- (6) As the sample size grows, the differences among the
ant influences on the relative performance of resampling methods. resampling methods decrease. Additionally, as the signal
We have evaluated resampling methods for use in high dimensional decreases from strong in the simulated data to rather weak
classification problems using a range of sample sizes, algorithms and in the ovarian data the discrepancies between the methods
signals. Some general conclusions may be summarized as follows: diminish.
(1) With small sample sizes, the split sample method and In future work we will compare the resampling methods for
2-fold CV perform very poorly. This poor performance is continuous outcomes and continue to explore the behavior of the
primarily due to a large positive bias resulting from the use of bootstrap estimates. Also, the effect of feature selection method may
a reduced training set size, which severely impairs its ability play an important role in prediction and needs further investigation.
to effectively select features and fit a model. The large bias
contributes to a large MSE.
ACKNOWLEDGEMENTS
(2) LOOCV generally performs very well with regard to MSE
A.M.M. was supported by the Cancer Prevention Fellowship
and bias. The only exception is when an unstable classi-
Program, DCP/NCI/NIH. The authors thank Mark J. van der Laan
fier (e.g. CART) is used in the presence of a weak signal.
for fruitful discussions.
In this setting, the larger MSE is attributed to LOOCV’s
increased variance. Conflict of Interest: none declared.
3306
Prediction error estimation
REFERENCES Hastie,T., Tibshirani,R. and Friedman,J. (2003) The Elements of Statistical Learning:
Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer, 1st
Bhattacharjee,A. et al. (2001) Classification of human lung carcinomas by mRNA
edn, 3rd print. Springer-Verlag, New York.
expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl Acad.
Ihaka,R. and Gentlemen,R. (1996) R: a language for data analysis and graphics.
Sci. USA, 98, 13790–13795.
J. Comput. Graph. Stat., 5, 299–314.
Braga-Neto,U.M. and Dougherty,E.R. (2004) Is cross-validation for small-sample
Lachenbruch,P.A. and Mickey,M.R. (1968) Estimation of error rates in discriminant
microarray classification? Bioinformatics, 20, 374–380.
analysis. Technometrics, 10, 1–11.
Downloaded from https://academic.oup.com/bioinformatics/article-abstract/21/15/3301/195433 by Univ. of Michigan Law Library user on 25 April 2020
Breiman,L. and Spector,P. (1992) Submodel selection and evaluation in regression. The
X-random case. Int. Stat. Rev., 60, 291–319. McLachlan,G.J. (1992) Discriminant Analysis and Statistical Pattern Recognition. John
Breiman,L., Friedman,J.H., Olshen,R.A. and Stone,C.J. (1984) Classification and Wiley & Sons, Inc, New York.
Regression Trees. Wadsworth and Brooks/Cole, Monterey, CA. Molinaro,A.M. et al. (2004) Tree-based multivariate regression and density estimation
Bura,E. and Pfeiffer,R.M. (2003) Graphical methods for class prediction using with right-censored data. J. Multivar. Anal., 90, 154–177.
dimension reduction techniques on DNA microarray data. Bioinformatics, 19, Molinaro,A.M., Simon,R. and Pfeiffer, R.M. (2005) Prediction error estimation: a com-
1252–1258. parison of resampling methods. Technical Report Number 30, Biometrics Research
Burman,P. (1989) A comparative study of ordinary cross-validation, v-fold Branch, Division of Cancer Treatment and Diagnosis, NCI.
cross-validation and the repeated learning-testing methods. Biometrika, 76, Quackenbush,J. (2004) Meeting the challenges of functional genomics: from the
503–514. laboratory to the clinic. Preclinica, 2, 313–316.
Davison,A.C. and Hinkley,D.V. (1997) Bootstrap Methods and Their Application. Ransohoff,D.F. (2004) Rules of evidence for cancer molecular marker discovery and
Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University validation. Nature Reviews/Cancer, 4, 309–313.
Press, Cambridge, UK. Ripley,B.D. (1996) Pattern Recognition and Neural Networks. Cambridge University
Dettling,M. and Maechler,M. (2005) Software R Contributed Package: Supclust: Press, New York.
Supervised Clustering of Genes. (http://cran.r-project.org) version 1.0-5. Rosenwald,A. et al. (2002) The use of molecular profiling to predict survival
Dudoit,S. and van der Laan,M.J. (2003) Asymptotics of cross-validated risk estim- after chemotherapy for diffuse large-B-cell lymphoma. New Engl. J. Med., 346,
ation in model selection and performance assessment. Technical Report 126, 1937–1946.
U.C. Berkeley Division of Biostatistics Working Paper Series. http://www.bepress. Simon,R. et al. (2003) Pitfalls in the use of DNA microarray data for diagnostic and
com/ucbbiostat/paper126. prognostic classification. J. Natl Cancer Inst., 95, 14–18.
Efron,B. (1983) Estimating the error rate of a prediction rule: improvement on cross- Stone,M. (1974) Cross-validatory choice and assessment of statistical predictions. J. R.
validation. J. Am. Stat. Assoc., 78, 316–331. Stat. Soc., Series B, 36, 111–147.
Efron,B. (2004) The estimation of prediction error: covariance penalties and cross- Stone,M. (1977) Asymptotics for and against cross-validation. Biometrika, 64, 29–35.
validation. J. Am. Stat. Assoc., 99, 619–642. Therneau,T. and Atkinson,E. (1997) An introduction to recursive partitioning using
Efron,B. and Tibshirani,R.J. (1993) An Introduction to the Bootstrap. the RPART routine. Technical Report 61, Section of Biostatistics, Mayo Clinic,
Monographs on Statistics and Applied Probability. Chapman and Hall, Rochester.
Vol. 57, NY. Venables,W.N. and Ripley,B.D. (1994) Modern Applied Statistics with S-PLUS.
Efron,B. and Tibshirani,R.J. (1997) Improvements on cross-validation: the .632+ Springer, New York.
bootstrap method. J. Am. Stat. Assoc., 92, 548–560. Wright,G. et al. (2003) A gene expression-based method to diagnose clinically dis-
Geisser,S. (1975) The predictive sample reuse method with applications. J. Am. Stat. tinct subgroups of diffuse large B cell lymphoma. Proc. Natl Acad. Sci. USA, 100,
Assoc., 70, 320–328. 9991–9996.
3307