AdaBoost Is Consistent
AdaBoost Is Consistent
AdaBoost is Consistent
Abstract
The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investi-
gated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal
consistency. We show that provided AdaBoost is stopped after n1−ε iterations—for sample size n
and ε ∈ (0, 1)—the sequence of risks of the classifiers it produces approaches the Bayes risk.
Keywords: boosting, adaboost, consistency
1. Introduction
Boosting algorithms are an important recent development in classification. These algorithms belong
to a group of voting methods (see, for example, Schapire, 1990; Freund, 1995; Freund and Schapire,
1996, 1997; Breiman, 1996, 1998), that produce a classifier as a linear combination of base or weak
classifiers. While empirical studies show that boosting is one of the best off the shelf classifica-
tion algorithms (see Breiman, 1998) theoretical results do not give a complete explanation of their
effectiveness.
The first formulations of boosting by Schapire (1990), Freund (1995), and Freund and Schapire
(1996, 1997) considered boosting as an iterative algorithm that is run for a fixed number of iterations
and at every iteration it chooses one of the base classifiers, assigns a weight to it and eventually
outputs the classifier that is the weighted majority vote of the chosen classifiers. Later Breiman
(1997, 1998, 2004) pointed out that boosting is a gradient descent type algorithm (see also Friedman
et al., 2000; Mason et al., 2000).
Experimental results by Drucker and Cortes (1996), Quinlan (1996), Breiman (1998), Bauer
and Kohavi (1999) and Dietterich (2000) showed that boosting is a very effective method, that often
leads to a low test error. It was also noted that boosting continues to decrease test error long after the
sample error becomes zero: though it keeps adding more weak classifiers to the linear combination
of classifiers, the generalization error, perhaps surprisingly, usually does not increase. However
some of the experiments suggested that there might be problems, since boosting performed worse
than bagging in the presence of noise (Dietterich, 2000), and boosting concentrated not only on
the “hard” areas, but also on outliers and noise (Bauer and Kohavi, 1999). And indeed, some more
experiments, for example by Friedman et al. (2000), Grove and Schuurmans (1998) and Mason et al.
(2000), see also Bickel et al. (2006), as well as some theoretical results (for example, Jiang, 2002)
showed that boosting, ran for an arbitrary large number of steps, overfits, though it takes a very long
time to do it.
Upper bounds on the risk of boosted classifiers were obtained, based on the fact that boosting
tends to maximize the margin of the training examples (Schapire et al., 1998; Koltchinskii and
Panchenko, 2002), but Breiman (1999) pointed out that margin-based bounds do not completely
explain the success of boosting methods. In particular, these results do not resolve the issue of
consistency: they do not explain under which conditions we may expect the risk to converge to the
Bayes risk. A recent work by Reyzin and Schapire (2006) shows that while maximization of the
margin is useful, it should not be done at the expense of the classifier complexity.
Breiman (2004) showed that under some assumptions on the underlying distribution “popula-
tion boosting” converges to the Bayes risk as the number of iterations goes to infinity. Since the
population version assumes infinite sample size, this does not imply a similar result for AdaBoost,
especially given results of Jiang (2002), that there are examples when AdaBoost has prediction error
asymptotically suboptimal at t = ∞ (t is the number of iterations).
Several authors have shown that modified versions of AdaBoost are consistent. These modifi-
cations include restricting the l1 -norm of the combined classifier (Mannor et al., 2003; Blanchard
et al., 2003; Lugosi and Vayatis, 2004; Zhang, 2004) , and restricting the step size of the algo-
rithm (Zhang and Yu, 2005). Jiang (2004) analyses the unmodified boosting algorithm and proves
a process consistency property, under certain assumptions. Process consistency means that there
exists a sequence (tn ) such that if AdaBoost with sample size n is stopped after tn iterations, its risk
approaches the Bayes risk. However Jiang also imposes strong conditions on the underlying distri-
bution: the distribution of X (the predictor) has to be absolutely continuous with respect to Lebesgue
measure and the function FB (X) = (1/2) ln(P(Y = 1|X)/P(Y = −1|X)) has to be continuous on X .
Also Jiang’s proof is not constructive and does not give any hint on when the algorithm should be
stopped. Bickel et al. (2006) prove a consistency result for AdaBoost, under the assumption that the
probability distribution is such that the steps taken by the algorithm are not too large. In this paper,
we study stopping rules that guarantee consistency. In particular, we are interested in AdaBoost, not
a modified version. Our main result (Corollary 9) demonstrates that a simple stopping rule suffices
for consistency: the number of iterations is a fixed function of the sample size. We assume only that
the class of base classifiers has finite VC-dimension, and that the span of this class is sufficiently
rich. Both assumptions are clearly necessary.
2. Notation
Here we describe the AdaBoost procedure formulated as a coordinate descent algorithm and intro-
duce definitions and notation. We consider a binary classification problem. We are given X , the
measurable (feature) space, and Y = {−1, 1}, the set of (binary) labels. We are given a sample
Sn = {(Xi ,Yi )}ni=1 of i.i.d. observations distributed as the random variable (X,Y ) ∼ P , where P is
an unknown distribution. Our goal is to construct a classifier g n : X → Y based on this sample. The
quality of the classifier gn is given by the misclassification probability
L(gn ) = P(gn (X) 6= Y |Sn ).
Of course we want this probability to be as small as possible and close to the Bayes risk
L? = inf L(g) = E(min{η(X), 1 − η(X)}),
g
2348
A DA B OOST IS C ONSISTENT
where the infimum is taken over all possible (measurable) classifiers and η(·) is a conditional prob-
ability
η(x) = P(Y = 1|X = x).
The infimum above is achieved by the Bayes classifier g? (x) = g(2η(x) − 1), where
1 , x > 0,
g(x) =
−1 , x ≤ 0.
1 n
∑ exp(−Yi f (Xi )).
n i=1
Many of our results are applicable to a broader family of such algorithms, where the function α 7→
exp(−α) is replaced by another function ϕ. Thus, for a function ϕ : R → R + , we define the empirical
ϕ-risk and the ϕ-risk,
1 n
Rϕ,n ( f ) = ∑ ϕ(Yi f (Xi ))
n i=1
and Rϕ ( f ) = Eϕ(Y f (X)).
Clearly, the function ϕ needs to be appropriate for classification, in the sense that a measurable f
that minimizes Rϕ ( f ) should have minimal risk. This is equivalent (see Bartlett et al., 2006) to ϕ
satisfying the following condition (‘classification calibration’). For all 0 ≤ η ≤ 1, η 6= 1/2,
The choice of γ < 1 in the above algorithm allows approximate minimization. Notice that the
original formulation of AdaBoost assumed exact minimization in (2), which corresponds to γ = 1.
2349
BARTLETT AND T RASKIN
k f k? = inf ∑ |αi |, f = ∑ αi hi , hi ∈ H .
πl ◦ F = f˜| f˜ = πl ( f ), f ∈ F .
g ◦ F = { f˜| f˜ = g( f ), f ∈ F }.
∂Q( f + λh)
Q0 ( f ; h) = .
∂λ λ=0
2350
A DA B OOST IS C ONSISTENT
The key difficulty with this approach is that the concentration inequalities behind the uniform
convergence results are valid only for a suitably small class of suitably bounded functions. However
boosting in general and AdaBoost in particular may produce functions that cannot be appropriately
bounded. To circumvent this difficulty, we rely on the observation that, for the purposes of clas-
sification, we can replace the function f returned by AdaBoost by any function f 0 that satisfies
sign( f 0 ) = sign( f ). Therefore we consider the clipped version π λ ◦ ft of the function returned by
AdaBoost after t iterations. This clipping ensures that the functions f t are suitably bounded. Fur-
thermore, the complexity of the clipped class (as measured by its pseudo-dimension—see Pollard,
1984) grows slowly with the stopping time t, so we can show that the ϕ-risk of a clipped function
is not much larger than its empirical ϕ-risk. Lemma 4 provides the necessary details. In order to
compare the empirical ϕ-risk of the clipped function to that of a suitable reference sequence f¯n , we
first use the fact that the empirical ϕ-risk of a clipped function πλ ◦ ft is not much larger than the
empirical ϕ-risk of ft .
The next step is to relate Rϕ,n ( ft ) to Rϕ,n ( f¯n ). The choice of a suitable sieve depends on what
can be shown about the progress of the algorithm. We consider an increasing sequence of l ? -balls,
and define f¯n as the (near) minimizer of the ϕ-risk in the appropriate l? -ball. Theorems 6 and 8
show that as the stopping time increases, the empirical ϕ-risk of the function returned by AdaBoost
is not much larger than that of f¯n . Finally Hoeffding’s inequality shows that the empirical ϕ-risks of
the reference functions f¯n are close to their ϕ-risks. Combining all the pieces, the ϕ-risk of π λ ◦ ft
approaches R?ϕ , provided the stopping time increases suitably slowly with the sample size. The
consistency of AdaBoost follows.
We now describe our assumptions. First, we shall impose the following condition.
For many classes H , the above condition is satisfied for all possible distributions P . Lugosi and
Vayatis (2004, Lemma 1) discuss sufficient conditions for Condition 1. As an example of such a
class, we can take the class of indicators of all rectangles or the class of indicators of half-spaces
defined by hyperplanes or the class of binary trees with the number of terminal nodes equal to
d + 1 (we consider trees with terminal nodes formed by successive univariate splits), where d is the
dimensionality of X (see Breiman, 2004).
The following set of conditions deals with uniform convergence and convergence of the boosting
algorithm. The main theorem (Theorem 1) shows that these, together with Condition 1, suffice for
consistency of the boosting procedure. Later in this section we show that the conditions are satisfied
by AdaBoost.
Condition 2 Let n be sample size. Let there exist non-negative sequences t n → ∞, ζn → ∞ and a
sequence { f¯n }∞
n=1 of reference functions such that
Rϕ ( f¯n ) → R? ,
n→∞
2351
BARTLETT AND T RASKIN
Let Condition 2 be satisfied. Then the boosting procedure stopped at step t n returns a sequence of
classifiers ftn almost surely satisfying L(g( ftn )) → L? as n → ∞.
Remark 2 Note that Condition (6) could be replaced by the mild condition that the function ϕ is
bounded below.
Proof For almost every outcome ω on the probability space (Ω, S , P) we can define sequences
ε1n (ω) → 0, ε2n (ω) → 0 and ε3n (ω) → 0, such that for almost all ω the following inequalities are true.
Rϕ (πζn ( ftn )) ≤ Rϕ,n (πζn ( ftn )) + ε1n (ω) by (3)
≤ Rϕ,n ( ftn ) + ε1n (ω) + ϕζn (7)
≤ Rϕ,n ( f¯n ) + ε1n (ω) + ϕζn + ε2n (ω) by (5)
≤ Rϕ ( f¯n ) + ε1n (ω) + ϕζn + ε2n (ω) + ε3n (ω) by (4). (8)
Inequality (7) follows from the convexity of ϕ(·) (see Lemma 14 in Appendix E). By (6) and choice
of the sequence { f¯n }∞
n=1 we have Rϕ ( f n ) → R and ϕζn → 0. And from (8) follows Rϕ (πζn ( ftn )) → R
¯ ? ?
a.s. Eventually we can use the result by Bartlett et al. (2006, Theorem 3) to conclude that
a.s.
L(g(πζn ( ftn )))→L? .
But for ζn > 0 we have g(πζn ( ftn )) = g( ftn ), therefore
a.s.
L(g( ftn ))→L? .
Hence, the boosting procedure is consistent if stopped after t n steps.
The almost sure formulation of Condition 2 does not provide explicit rates of convergence of
L(g( ftn )) to L? . However, a slightly stricter form of Condition 2, which allows these rates to be
calculated, is considered in Appendix A.
In the following sections, we show that Condition 2 can be satisfied for some choices of ϕ. We
shall treat parts (a)–(c) separately.
2352
A DA B OOST IS C ONSISTENT
The proof of consistency is based on the following result, which builds on the result by Koltchin-
skii and Panchenko (2002) and resembles a lemma due to Lugosi and Vayatis (2004, Lemma 2).
R1q
Then for V = dVC (H ), c = 24 0 ln 8e
ε2
dε and any n, ζ > 0 and t > 0,
r
(V + 1)(t + 1) log2 [2(t + 1)/ ln 2]
E sup |Rϕ ( f ) − Rϕ,n ( f )| ≤ cζLϕ,ζ .
f ∈πζ ◦F t n
Now, if we choose ζ and δ as functions of n, such that ∑∞ n=1 δ (n) < ∞ and right hand side of (9)
2
converges to 0 as n → ∞, we can appeal to Borel-Cantelli lemma and conclude, that for such choice
of ζn and δn Condition 2 (a) holds.
Lemma 4, unlike Lemma 2 of Lugosi and Vayatis (2004), allows us to choose the number of
steps t, which describes the complexity of the linear combination of base functions, and this is
essential for the proof of the consistency. It is easy to see that for AdaBoost (i.e., ϕ(x) = e −x ) we
can choose ζ = κ ln n and t = n1−ε with κ > 0, ε ∈ (0, 1) and 2κ − ε < 0.
2353
BARTLETT AND T RASKIN
To show that Condition 2(b) is satisfied for a variety of loss functions we use Hoeffding’s inequality.
Theorem 5 Define the variation of a function ϕ on the interval [−a, a] (for a > 0) as
f¯n }∞
If a sequence {p n=1 satisfies the condition f n (x) ∈ [−λn , λn ], ∀x ∈ X , where λn > 0 is chosen so
¯
that Vϕ,λn = o( n/ ln n), then
a.s.
max{0, Rϕ,n ( f¯n ) − Rϕ ( f¯n )} → 0. (10)
n→∞
Proof Since we restricted the range of f¯n to the interval [−λn , λn ], we have, almost surely, ϕ(Y f¯n (X)) ∈
[a, b], where b − a ≤ Vϕ,λn . Therefore Hoeffding’s inequality guarantees that for all ε n
P Rϕ,n ( f¯n ) − Rϕ ( f¯n ) ≥ εn ≤ exp −2nε2n /Vϕ,λ
2
= δn .
n
To prove the statement of the theorem we require εn = o(1) and ∑ δn < ∞. Then we appeal to the
Borel-Cantelli lemma to conclude that (10) holds. These restrictions are satisfied if
n
2
Vϕ,λ = o
n
ln n
and the statement of the theorem follows.
The choice of λn in the above theorem depends on the loss function ϕ. In the case of the
AdaBoost loss ϕ(x) = e−x we shall choose λn = κ ln n, where κ ∈ (0, 1/2). One way to guarantee
that the functions f¯n satisfy condition f¯n (x) ∈ [−λn , λn ], ∀x ∈ X , is to choose f¯n ∈ Fλn .
Theorem 6 Let the function Q( f ) be convex in f and twice differentiable in all directions h ∈ H .
Let Q? = limλ→∞ inf f ∈Fλ Q( f ). Assume that ∀c1 , c2 , such that Q? < c1 < c2 < ∞,
Also assume the following approximate minimization scheme for γ ∈ (0, 1]. Define f k+1 = fk +
αk+1 hk+1 such that
Q( fk+1 ) ≤ γ inf Q( fk + αh) + (1 − γ)Q( f k )
h∈H ,α∈R
and
Q( fk+1 ) = inf Q( fk + αhk+1 ).
α∈R
2354
A DA B OOST IS C ONSISTENT
Then for any reference function f¯ and the sequence of functions f m , produced by the boosting
algorithm, the following bound holds ∀m > 0 such that Q( f m ) > Q( f¯).
s
3 (Q( f ) − Q( f¯))2
2 − 21
8B 0 ` + c 3 m
Q( fm ) ≤ Q( f¯) + ln 0 2 , (11)
γ 2 β3 `0
Remark 7 Results in Zhang and Yu (2005, e.g., Lemma 4.1) provide similar bounds under either an
assumption of a bounded step size of the boosting algorithm or a positive lower bound on Q 00 ( f ; h)
for all f , h. Since we consider boosting algorithms with unrestricted step size, the only option would
be to assume a positive lower bound on the second derivative. While such an assumption is fine for
the quadratic loss ϕ(x) = x2 , second derivative R00n ( f ; h) of the empirical risk for the exponential
loss used by the AdaBoost algorithm can not be bounded from below by a positive constant in a
general case. Theorem 6 makes a mild assumption that second derivative is positive for all f such
that R( f ) > R? (Rn ( f ) > R?n ) .
It is easy to see, that the theorem above applies to the AdaBoost algorithm, since there we first
choose the direction (base classifier) hi and then we compute the step size αi as
1 1 − εi 1 R( fi ) − R0 ( fi ; hi )
αi = ln = ln .
2 εi 2 R( fi ) + R0 ( fi ; hi )
Now we only have to recall that this value of αi corresponds to exact minimization in the direction
hi .
From now on we are going to specialize to AdaBoost and use ϕ(x) = e −x . Hence we drop the
subscript ϕ in Rϕ,n and Rϕ and use Rn and R respectively.
Theorem 6 allows us to get an upper bound on the difference between the exp-risk of the function
output by AdaBoost and the exp-risk of the appropriate reference function. For brevity in the next
theorem we make an assumption R? > 0, though a similar result can be stated for R? = 0. For
completeness, the corresponding theorem is given in Appendix D.
Theorem 8 Assume R? > 0. Let tn be the number of steps we run AdaBoost. Let λn = κ ln n,
κ ∈ (0, 1/2). Let a > 1 be an arbitrary fixed number. Let { f¯n }∞ n=1 be a sequence of functions
such that f¯n ∈ Fλn . Then with probability at least 1 − δn , where δn = exp −2(R? )2 n1−2κ /a2 , the
following holds
√ −1/2
2(1 − R? (a − 1)/a) λ2n + 2tn (a/(a − 1) − R? )/R?
¯ 2
Rn ( f t n ) ≤ R n ( f n ) + ln .
γ a−1 ? 3/2
λ2n
a R
2355
BARTLETT AND T RASKIN
then all the conditions in Theorem 6 are satisfied as long as Rn ( f¯n ) > 0 (with Q( f ) replaced by
Rn ( f )) and in the Equation (11) we have B = Rn ( f0 ) = 1, β ≥ Rn ( f¯n ), f0 − f¯n ? ≤ λn . Since for t
such that Rn ( ft ) ≤ Rn ( f¯n ) the theorem is trivially true, we only have to notice that exp(Yi f¯n (Xi )) ∈
[0, nκ ], hence Hoeffding’s inequality guarantees that
!
1 n Yi f¯n (Xi ) R?
∑ Y f¯n (X)
≤ exp −2(R? )2 n1−2κ /a2 = δn ,
P e − Ee ≤−
n i=1 a
where we choose and fix the constant a > 1 arbitrarily and independently of n and the sequence
{ f¯n }∞
n=1 . Therefore with probability at least 1 − δn we bound empirical risk from below as Rn ( f n ) ≥
¯
R( fn ) − R /a ≥ R − R /a = R (a − 1)/a, since R( fn ) ≥ R . Therefore β ≥ R (a − 1)/a and the
¯ ? ? ? ? ¯ ? ?
result follows immediately from Equation (11) if we use the fact that R ? > 0.
R( f¯n ) ≤ inf R( f ) + εn ,
f ∈Fλn
for some εn → 0, then together with Condition 1 this will imply R( f¯n ) → R∗ as n → ∞ and Condi-
tion 2 (c) follows.
lim inf R( f ) = R?
λ→∞ f ∈Fλ
and tn = n1−ε for ε ∈ (0, 1). Then AdaBoost stopped at step tn returns a sequence of classifiers
almost surely satisfying L(g( ftn )) → L? .
Proof First assume L? > 0. For the exponential loss function this implies R? > 0. As was suggested
after the proof of Lemma 4 we may choose ζn = κ ln n for 2κ − ε < 0 (which also implies κ < 1/2)
to satisfy Condition 2 (a). Recall that discussion after the proof of the Theorem 8 suggests choice
of the sequence { f¯n }∞
n=1 of reference functions such that f n ∈ Fλn and
¯
R( f¯n ) ≤ inf R( f ) + εn
f ∈Fλn
for εn → 0 and λn = κ ln n with κ ∈ (0, 1/2) to ensure that Condition 2 (c) holds. Eventually, as
it follows from the discussion after the proof of the Theorem 5, choice of the sequence { f¯n }∞
n=1 to
satisfy Condition 2(c) also ensures that Condition 2(b) holds. Since function ϕ(x) = e −x is clearly
2356
A DA B OOST IS C ONSISTENT
classification calibrated and conditions of this Corollary assume Condition 1 then all the conditions
of Theorem 1 hold and consistency of the AdaBoost algorithm follows.
For L? = 0 the proof is similar, but we need to use Theorem 13 in Appendix D instead of Theo-
rem 8.
4. Discussion
We showed that AdaBoost is consistent if stopped sufficiently early, after t n iterations, for tn =
O(n1−ε ) with ε ∈ (0, 1). We do not know whether this number can be increased. Results by Jiang
(2002) imply that for some X and function class H the AdaBoost algorithm will achieve zero
training error after tn steps, where n2 /tn = o(1) (see also work by Mannor and Meir (2001, Lemma
1) for an example of X = Rd and H = {linear classifiers}, for which perfect separation on the
training sample is guaranteed after 8n2 ln n iterations), hence if run for that many iterations, the
AdaBoost algorithm does not produce a consistent classifier. We do not know what happens in
between O(n1−ε ) and O(n2 ln n). Lessening this gap is a subject of further research.
The AdaBoost algorithm, as well as other versions of the boosting procedure, replaces the 0 − 1
loss with a convex function ϕ to overcome algorithmic difficulties associated with the non-convex
optimization problem. In order to conclude that Rϕ ( fn ) → R?ϕ implies L(g( f n )) → L? we want ϕ
to be classification calibrated and this requirement cannot be relaxed, as shown by Bartlett et al.
(2006).
The statistical part of the analysis, summarized in Lemma 4 and Theorem 5, works for quite
an arbitrary loss function ϕ. The only restriction imposed by Lemma 4 is that ϕ must be Lipschitz
on any compact set. This requirement is an artifact of our proof and is caused by the use of the
“contraction principle”. It can be relaxed in some cases: Shen et al. (2003) use the classification
calibrated loss function
2 , x < 0,
ψ(x) = 1 − x , 0 ≤ x < 1,
0 , x ≥ 1,
lim inf R( f ) = R?
λ→∞ f ∈Fλ
and tn = n1−ε for ε ∈ (0, 1). Then boosting procedure stopped at step t n returns a sequence of
classifiers almost surely satisfying L(g( ftn )) → L? .
2357
BARTLETT AND T RASKIN
We cannot make analogous conclusion about other loss functions. For example for logit loss
ϕ(x) = ln(1 + e−x ), Lemma 4 and Theorem 5 work, since Lϕ,λ = 1 and Mϕ,λ = ln(1 + eλ ), hence
choosing tn , λn , ζn as for either the exponential or quadratic losses will work. The assumption of
the Theorem 6 also holds with R00ϕ,n ( f ; h) ≥ Rϕ,n ( f )/n, though the resulting inequality is trivial: the
factor 1/n in this bound precludes us from finding an analog of Theorem 8. A similar problem
arises in the case of the modified quadratic loss ϕ(x) = [max(1 − x, 0)] 2 , for which R00ϕ,n ( f ; h) ≥ 2/n.
Generally, any loss function with “really flat” regions may cause trouble. Another issue is the
very slow rate of convergence in Theorems 6 and 8. Hence further research intended either to
improve convergence rates or extend the applicability of these theorems to loss functions other than
exponential and quadratic is desirable.
Acknowledgments
This work was supported by the NSF under award DMS-0434383. The authors would like to thank
Peter Bickel for useful discussions, as well as Jean-Philippe Vert and two anonymous referees for
their comments and suggestions.
such that ∑∞ ¯ ∞
i=1 δi < ∞, j = 1, 2, 3, εn → 0, k = 1, 2, 3, a sequence { f n }n=1 of reference functions such
j k
that
Rϕ ( f¯n ) → R? ,
n→∞
2358
A DA B OOST IS C ONSISTENT
Theorem 11 Assume ϕ is classification calibrated and convex, and for ϕ λ = infx∈[−λ,λ] ϕ(x) without
loss of generality assume
lim ϕλ = inf ϕ(x) = 0. (15)
λ→∞ x∈(−∞,∞)
Let Condition 3 be satisfied. Then the boosting procedure stopped at step t n returns a sequence of
classifiers ftn almost surely satisfying L(g( ftn )) → L? as n → ∞.
Inequalities (16), (18) and (17) hold with probability at least 1 − δ 1n , 1 − δ2n and 1 − δ3n respectively.
We assumed in Condition 3 that Rϕ ( f¯n ) → R? and (15) implies that ϕζn → 0 by the choice of the
sequence ζn . Now we appeal to the Borel-Cantelli lemma and arrive at Rϕ (πζn ( ftn )) → R? a.s.
Eventually we can use Theorem 3 by Bartlett et al. (2006) to conclude that
a.s.
L(g(πζn ( ftn )))→L? .
We could prove Theorem 11 by using the Borel-Cantelli lemma and appealing to Theorem 1,
but the above proof allows the following corollary on the rate of convergence.
Corollary 12 Let the conditions of Theorem 11 be satisfied. Then there exists a non-decreasing
function ψ, such that ψ(0) = 0, and with probability at least 1 − δ 1n − δ2n − δ3n
L(g( ftn )) − L ≤ ψ
? −1
(εn + εn + εn + ϕζn ) + inf Rϕ − Rϕ
1 2 3 ?
, (19)
f ∈Fλn
1+θ 1−θ
ψ(θ) = φ(0) − inf φ(α) + φ(−α) : α ∈ R ,
2 2
2359
BARTLETT AND T RASKIN
The proof of Theorem 11 shows that for function f tn with probability at least 1 − δ1n − δ2n
Rϕ ( ftn ) − inf Rϕ ≤ ε1n + ε2n + ε3n + ϕζn .
f ∈Fλn
The second term under ψ−1 in (19) is an approximation error and, in a general case, it may
decrease arbitrarily slowly. However, if it is known that it decreases sufficiently fast, the first term
becomes an issue. For example Corollary 9, even if the1 approximation error decreases sufficiently
fast, will give a convergence rate of the order O (ln n)− 4 . This follows from Example 1 by Bartlett
√
et al. (2006), where it is shown that for AdaBoost (exponential loss function) ψ −1 (x) ≤ 2x, and the
fact that both ε1n and ε2n , as well as ϕζn , in Corollary 9 decrease at the rate O(n1−α ) (infact, α’smight
1
be different for all three of them), hence everything is dominated by ε 3n , which is O (ln n)− 2 .
R q
Then for V = dVC (H ), c = 24 01 ln 8e ε2
dε and any n, ζ > 0 and t > 0,
r
(V + 1)(t + 1) log2 [2(t + 1)/ ln 2]
E sup |Rϕ ( f ) − Rϕ,n ( f )| ≤ cζLϕ,ζ .
f ∈πζ ◦F t n
2360
A DA B OOST IS C ONSISTENT
where σi are i.i.d. with P(σi = 1) = P(σi = −1) = 1/2. Then we use the “contraction principle”
(see Ledoux and Talagrand, 1991, Theorem 4.12, pp. 112–113) with a function ψ(x) = (ϕ(x) −
ϕ(0))/Lϕ,ζ to get
1 n
E sup |Rϕ ( f ) − Rϕ,n ( f )| ≤ 4Lϕ,ζ E sup
f ∈πζ ◦F t
∑ −σiYi f (Xi )
f ∈πζ ◦F t n i=1
1 n
= 4Lϕ,ζ E sup ∑ σi f (Xi ) .
f ∈πζ ◦F t n i=1
Next we proceed and find the supremum. Notice, that functions in π ζ ◦ F t are bounded and clipped
to absolute value equal ζ, therefore we can rescale πζ ◦ F t by (2ζ)−1 and get
1 n 1 n
E sup ∑
f ∈πζ ◦F t n i=1
σi f (Xi ) = 2ζE sup ∑ σi f (Xi ) .
f ∈(2ζ)−1 ◦πζ ◦F t n i=1
Next, we use Dudley’s entropy integral (Dudley, 1999) to bound the right hand side above
Z ∞q
1 n 12
E sup ∑ σi f (Xi ) ≤ √ ln N (ε, (2ζ)−1 ◦ πζ ◦ F t , L2 (Pn ))dε.
f ∈(2ζ)−1 ◦πζ ◦F t n i=1 n 0
Since, for ε > 1, the covering number N is 1, the upper integration limit can be taken as 1, and we
can use Pollard’s bound (Pollard, 1990) for F ⊆ [0, 1]X ,
4e dP (F)
N (ε, F, L2 (P)) ≤ 2 2 ,
ε
R1q
where dP (F) is a pseudo-dimension, and obtain for c̃ = 12 0 ln 8e
ε2
dε,
s
1n dP ((2ζ)−1 ◦ πζ ◦ F t )
E sup ∑ σi f (Xi ) ≤ c̃ .
f ∈(2ζ)−1 ◦πζ ◦F t n i=1 n
Also notice that constant c̃ does not depend on F t or ζ. Next, since (2ζ)−1 ◦ πζ is non-decreasing,
we use the inequality dP ((2ζ)−1 ◦ πζ ◦ F t ) ≤ dP (F t ) (for example, Anthony and Bartlett, 1999,
Theorem 11.3) to obtain
dP (F t )
r
1 n
E sup ∑ σi f (Xi ) ≤ c .
f ∈(2ζ)−1 ◦πζ ◦F t n i=1 n
And then, since Lemma 3 gives an upper-bound on the pseudo-dimension of the class F t , we have
r
1 n (V + 1)(t + 1) log2 [2(t + 1)/ ln 2]
E sup ∑ σi f (Xi ) ≤ cζ
f ∈πζ ◦F t n i=1 n
,
2361
BARTLETT AND T RASKIN
with the constant c above being independent of H , t and ζ. To prove the second statement we use
McDiarmid’s bounded difference inequality (Devroye et al., 1996, Theorem 9.2, p. 136), since for
all i ∈ {1, . . . , n}
Mϕ,ζ
sup sup |Rϕ ( f ) − Rϕ,n ( f )| − sup |Rϕ ( f ) − R0ϕ,n ( f )| ≤ ,
(x j ,y j )nj=1 ,(xi0 ,y0i ) f ∈πζ ◦F t f ∈πζ ◦F t n
where R0ϕ,n ( f ) is obtained from Rϕ,n ( f ) by changing each pair (xi , yi ) to an independent pair (xi0 , y0i ).
This completes the proof of the lemma.
Then for any reference function f¯ and the sequence of functions f m , produced by the boosting
algorithm, the following bound holds ∀m > 0 such that Q( f m ) > Q( f¯).
s
3 (Q( f ) − Q( f¯))2
2 − 21
8B 0 ` + c 3 m
Q( fm ) ≤ Q( f¯) + ln 0 2 ,
γ 2 β3 `0
Proof The statement of the theorem is a version of a result implicit in the proof of Theorem 1
by Bickel et al. (2006). If for some m we have Q( f m ) ≤ Q( f¯), then the theorem is trivially true
for all m0 ≥ m. Therefore, we are going to consider only the case when Q( f m ) > Q( f¯). We shall
also assume Q( f m+1 ) ≥ Q( f¯) (the impact of this assumption will be discussed later). Define ε m =
Q( fm ) − Q( f¯). By convexity of Q(·),
|Q0 ( fm ; fm − f¯)| ≥ εm . (20)
Let fm − f¯ = ∑ α̃i h̃i , where α̃i and h̃i correspond to the best representation (with the l1 -norm of α̃
equal the l? -norm). Then from (20) and linearity of the derivative we have
εm ≤ ∑ α̃i Q0 ( fm ; h̃i ) ≤ sup |Q0 ( fm ; h)| ∑ |α̃i |,
h∈H
2362
A DA B OOST IS C ONSISTENT
therefore
εm εm
sup Q0 ( fm ; h) ≥ = . (21)
h∈H fm − f¯ ?
`m
Next,
1
Q( fm + αhm ) = Q( fm ) + αQ0 ( fm ; hm ) + α2 Q00 ( f˜m ; hm ),
2
where f˜m = fm + α̃m hm , for α̃m ∈ [0, αm ]. By assumption f˜m is on the path from f m to fm+1 , and
we have assumed exact minimization in the given direction, hence f m+1 is the lowest point in the
direction hm starting from f m , so we have the following bounds
1 |Q0 ( fm ; hm )|2
Q( fm+1 ) ≥ Q( fm ) + inf (αQ0 ( fm ; hm ) + α2 β) = Q( fm ) − . (22)
α∈R 2 2β
On the other hand,
Q( fm + αm hm ) ≤ γ Q( fm + αh) + (1 − γ)Q( f m )
inf
h∈H ,α∈R
1 2
≤ γ inf Q( fm ) + αQ ( fm ; h) + α B) + (1 − γ)Q( f m )
0
h∈H ,α∈R 2
suph∈H |Q0 ( fm ; h)|2
= Q( fm ) − γ . (23)
2B
Therefore, combining (22) and (23), we get
γβ
r
0 0
|Q ( fm ; hm )| ≥ sup |Q ( fm ; h)| . (24)
h∈H B
Another Taylor expansion, this time around f m+1 (and we again use the fact that f m+1 is the mini-
mum on the path from f m ), gives us
1
Q( fm ) = Q( fm+1 ) + α2m Q00 ( ˜f˜m ; hm ), (25)
2
√
where ˜f˜m is some (other) function on the path from f m to fm+1 . Therefore, if |αm | < γ|Q0 ( fm ; hm )|/B,
then
γ|Q0 ( fm ; hm )|2
Q( fm ) − Q( fm+1 ) < ,
2B
but by (23)
γ suph∈H |Q0 ( fm ; h)|2 γ|Q0 ( fm ; hm )|2
Q( fm ) − Q( fm+1 ) ≥ ≥ ,
2B 2B
therefore we conclude, by combining (24) and (21), that
√ 0
γ|Q ( fm ; hm )| γ β suph∈H |Q0 ( fm ; h)| γεm β
p p
|αm | ≥ ≥ ≥ . (26)
B B3/2 `m B3/2
2363
BARTLETT AND T RASKIN
Recall that
m−1
fm − f¯ ?
≤ fm−1 − f¯ ?
+ |αm−1 | ≤ f0 − f¯ ?
+ ∑ |αi |
i=0
!1/2
√ m−1
≤ f0 − f¯ ? + m ∑ α2i ,
i=0
therefore, combining with (27) and (26), since the sequence ε i is decreasing,
m
2
β
(Q( f0 ) − Q( f¯)) ≥ ∑ α2i
i=0
γ2 β m ε2
≥
B3 ∑ `2i
i=0 i
γ2 β 2 m 1
≥ ε ∑
B3 m i=0 √ 1/2 2
`0 + i ∑ j=0 α j
i−1 2
γ2 β 2 m 1
≥ ε ∑
B3 m i=0 √ 2(Q( f0 )−Q( f¯)) 1/2 2
`0 + i β
γ2 β 2 m 1
≥ 3
εm ∑ 2(Q( f0 )−Q( f¯))
.
2B 2
i=0 ` + i
0 β
Since
m Z m+1
1 dx 1 a + b(m + 1)
∑ a + bi ≥ 0
= ln
a + bx b a
,
i=0
then
2 2(Q( f0 )−Q( f¯))
2 γ 2 β2 `0 + β (m + 1)
(Q( f0 ) − Q( f¯)) ≥ 3 ε 2
ln .
β 4B (Q( f0 ) − Q( f¯)) m `20
Therefore
− 21
2(Q( f0 )−Q( f¯))
s
2
8B3 (Q( f0 ) − Q( f¯))2 `0 + β (m + 1)
εm ≤ ln . (28)
γ 2 β3 `20
The proof of the above inequality for index m works as long as Q( f m+1 ) ≥ Q( f¯). If f¯ is such that
Q( fm ) ≥ Q( f¯) for all m, then we do not need to do anything else. However, if there exists m 0 such
that Q( fm0 ) < Q( f¯) and Q( fm0 −1 ) ≥ Q( f¯), then the above proof is not valid for index m0 − 1. To
overcome this difficulty, we notice that Q( f m0 −1 ) is bounded from above by Q( f m0 −2 ), therefore
to get a bound that holds for all m (except for m = 0) we may use a bound for ε m−1 to bound
2364
A DA B OOST IS C ONSISTENT
Q( fm ) − Q( f¯) = εm : shift (decrease) the index m on the right hand side of (28) by one. This com-
pletes the proof of the theorem.
Rn ( ftn ) ≤ Rn ( f¯n )
s
16R3n ( f0 )|Rn ( f0 ) − Rn ( f¯n )|2 (ln n)3κ
+
Cγ2
−1/2
(κ ln ln n)2 + 4|Rn ( f0 ) − Rn ( f¯n )|(ln n)κtn /C
× ln .
(κ ln ln n)2
Proof For the exponential loss assumption R? = 0 is equivalent to L? = 0. It also implies that the
fastest decrease rate of the function τ : λ → inf f ∈Fλ R( f ) is O(e−λ ). To see this, assume that for
some λ there exists f ∈ Fλ such that L(g( f )) = 0 (i.e., we have achieved perfect classification).
Clearly, for any a > 0
a
R(a f ) = Ee−Ya f (X) = E e−Y f (X) ≥ (inf e−y f (x) )a .
x,y
Therefore, choose λn = κ ln ln n. Then inf f ∈Fλn R( f ) ≥ C(ln n)−κ , where C depends on H and P ,
but does not depend on n. On the other hand Hoeffding’s inequality for f¯n ∈ Fλn guarantees that
Choice of εn = n−ν for ν ∈ (0, 1/2) ensures that δn → 0. This allows to conclude that with proba-
bility at least 1 − δn empirical risk Rn ( f¯n ) can be lower bounded as
Rn ( f¯n ) ≥ R( f¯n ) − εn
2365
BARTLETT AND T RASKIN
C
β≥ .
2(ln n)κ
Since for f¯n such that Rn ( f¯n ) > Rn ( f0 ) theorem trivially holds, we only have to plug Rn ( f¯n ) = 0,
B = Rn ( f0 ) and β = C(ln n)κ /2 into (11) to get the statement of the theorem. Obviously, this bound
holds for R? > 0.
Appendix E.
Lemma 14 Let the function ϕ : R → R+ ∪ {0} be convex. Then for any λ > 0
Proof If x ∈ [−λ, λ] then the statement of the lemma is clearly true. Without loss of generality
assume x > λ; case x < −λ is similar. Then we have two possibilities.
2. ϕ(x) < ϕ(λ). Due to convexity, for any z < λ we have ϕ(z) > ϕ(λ), therefore
References
Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations. Cam-
bridge University Press, 1999.
Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Discussion of boosting papers. The
Annals of Statistics, 32(1):85–91, 2004.
Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classification, and risk
bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
Eric Bauer and Ron Kohavi. An empirical comparison of voting classification algorithms: Bagging,
boosting and variants. Machine Learning, 36:105–139, 1999.
Peter J. Bickel, Ya’acov Ritov, and Alon Zakai. Some theory for generalized boosting algorithms.
Journal of Machine Learning Research, 7:705–732, May 2006.
Gilles Blanchard, Gábor Lugosi, and Nicolas Vayatis. On the rate of convergence of regularized
boosting classifiers. Journal of Machine Learning Research, 4:861–894, 2003.
2366
A DA B OOST IS C ONSISTENT
Leo Breiman. Arcing the edge. Technical Report 486, Department of Statistics, University of
California, Berkeley, 1997.
Leo Breiman. Prediction games and arcing algorithms. Neural Computation, 11:1493–1517, 1999.
(Was Department of Statistics, U.C. Berkeley Technical Report 504, 1997).
Leo Breiman. Arcing classifiers (with discussion). The Annals of Statistics, 26(3):801–849, 1998.
(Was Department of Statistics, U.C. Berkeley Technical Report 460, 1996).
Leo Breiman. Population theory for predictor ensembles. The Annals of Statistics, 32(1):1–11,
2004. (See also Department of Statistics, U.C. Berkeley Technical Report 579, 2000).
Luc Devroye, László Györfi, and Gábor Lugosi. A Probabilistic Theory of Pattern Recognition.
Springer, New York, 1996.
Harris Drucker and Corinna Cortes. Boosting decision trees. In D.S. Touretzky, M.C. Mozer, and
M.E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 479–485.
M.I.T. Press, 1996.
Richard M. Dudley. Uniform Central Limit Theorems. Cambridge University Press, Cambridge,
MA, 1999.
Yoav Freund. Boosting a weak learning algorithm by majority. Information and Computation, 121:
256–285, 1995.
Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In 13th Interna-
tional Conference on Machine Learning, pages 148–156, San Francisco, 1996. Morgan Kaufman.
Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an
application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.
Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: A statistical
view of boosting. The Annals of Statistics, 28:337–407, 2000.
Adam J. Grove and Dale Schuurmans. Boosting in the limit: Maximizing the margin of learned
ensembles. In Proceedings of the Fifteenth National Conference on Artificial Intelligence, pages
692–699, Menlo Park, CA, 1998. AAAI Press.
Wenxin Jiang. On weak base hypotheses and their implications for boosting regression and classi-
fication. The Annals of Statistics, 30:51–73, 2002.
Wenxin Jiang. Process consistency for AdaBoost. The Annals of Statistics, 32(1):13–29, 2004.
Vladimir Koltchinskii and Dmitry Panchenko. Empirical margin distributions and bounding the
generalization error of combined classifiers. The Annals of Statistics, 30:1–50, 2002.
2367
BARTLETT AND T RASKIN
Michel Ledoux and Michel Talagrand. Probability in Banach Spaces. Springer-Verlag, New York,
1991.
Gábor Lugosi and Nicolas Vayatis. On the Bayes-risk consistency of regularized boosting methods.
The Annals of Statistics, 32(1):30–55, 2004.
Shie Mannor and Ron Meir. Weak learners and improved rates of convergence in boosting. In
Advances in Neural Information Processing Systems, 13, pages 280–286, 2001.
Shie Mannor, Ron Meir, and Tong Zhang. Greedy algorithms for classification – consistency, con-
vergence rates, and adaptivity. Journal of Machine Learning Research, 4:713–742, 2003.
Llew Mason, Jonathan Baxter, Peter L. Bartlett, and Marcus Frean. Boosting algorithms as gradient
descent. In S.A. Solla, T.K. Leen, and K.-R. Muller, editors, Advances in Neural Information
Processing Systems, 12, pages 512–518. MIT Press, 2000.
J. Ross Quinlan. Bagging, boosting, and C4.5. In 13 AAAI Conference on Artificial Intelligence,
pages 725–730, Menlo Park, CA, 1996. AAAI Press.
Lev Reyzin and Robert E. Schapire. How boosting the margin can also boost classifier com-
plexity. In ICML ’06: Proceedings of the 23rd international conference on Machine learn-
ing, pages 753–760, New York, NY, USA, 2006. ACM Press. ISBN 1-59593-383-2. doi:
http://doi.acm.org/10.1145/1143844.1143939.
Robert E. Schapire. The strength of weak learnability. Machine Learning, 5:197–227, 1990.
Robert E. Schapire, Yoav Freund, Peter L. Bartlett, and Wee Sun Lee. Boosting the margin: A
new explanation for the effectiveness of voting methods. The Annals of Statistics, 26:1651–1686,
1998.
Xiaotong Shen, George C. Tseng, Xuegong Zhang, and Wing H. Wong. On ψ-learning. Journal of
the American Statistical Association, 98(463):724–734, 2003.
Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk
minimization. The Annals of Statistics, 32(1):56–85, 2004.
Tong Zhang and Bin Yu. Boosting with early stopping: convergence and consistency. The Annals
of Statistics, 33:1538–1579, 2005.
2368