0% found this document useful (0 votes)

70 views22 pages

AdaBoost Is Consistent

The document analyzes the AdaBoost algorithm and investigates its risk and consistency properties. It shows that if AdaBoost is stopped after n1-ε iterations, where n is the sample size and ε is between 0 and 1, then the risk of the classifiers produced will approach the Bayes risk, demonstrating that AdaBoost is universally consistent.

Uploaded by

Robson Souza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views22 pages

AdaBoost Is Consistent

Uploaded by

Robson Souza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Journal of Machine Learning Research 8 (2007) 2347-2368 Submitted 12/06; Revised 7/07; Published 10/07

AdaBoost is Consistent

Peter L. Bartlett BARTLETT @ STAT. BERKELEY. EDU

Department of Statistics and Computer Science Division
University of California
Berkeley, CA 94720-3860, USA
Mikhail Traskin MTRASKIN @ STAT. BERKELEY. EDU
Department of Statistics
University of California
Berkeley, CA 94720-3860, USA

Editor: Yoav Freund

Abstract
The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investi-
gated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal
consistency. We show that provided AdaBoost is stopped after n1−ε iterations—for sample size n
and ε ∈ (0, 1)—the sequence of risks of the classifiers it produces approaches the Bayes risk.
Keywords: boosting, adaboost, consistency

1. Introduction
Boosting algorithms are an important recent development in classification. These algorithms belong
to a group of voting methods (see, for example, Schapire, 1990; Freund, 1995; Freund and Schapire,
1996, 1997; Breiman, 1996, 1998), that produce a classifier as a linear combination of base or weak
classifiers. While empirical studies show that boosting is one of the best off the shelf classifica-
tion algorithms (see Breiman, 1998) theoretical results do not give a complete explanation of their
effectiveness.
The first formulations of boosting by Schapire (1990), Freund (1995), and Freund and Schapire
(1996, 1997) considered boosting as an iterative algorithm that is run for a fixed number of iterations
and at every iteration it chooses one of the base classifiers, assigns a weight to it and eventually
outputs the classifier that is the weighted majority vote of the chosen classifiers. Later Breiman
(1997, 1998, 2004) pointed out that boosting is a gradient descent type algorithm (see also Friedman
et al., 2000; Mason et al., 2000).
Experimental results by Drucker and Cortes (1996), Quinlan (1996), Breiman (1998), Bauer
and Kohavi (1999) and Dietterich (2000) showed that boosting is a very effective method, that often
leads to a low test error. It was also noted that boosting continues to decrease test error long after the
sample error becomes zero: though it keeps adding more weak classifiers to the linear combination
of classifiers, the generalization error, perhaps surprisingly, usually does not increase. However
some of the experiments suggested that there might be problems, since boosting performed worse
than bagging in the presence of noise (Dietterich, 2000), and boosting concentrated not only on
the “hard” areas, but also on outliers and noise (Bauer and Kohavi, 1999). And indeed, some more
experiments, for example by Friedman et al. (2000), Grove and Schuurmans (1998) and Mason et al.

c 2007 Peter L. Bartlett and Mikhail Traskin.

BARTLETT AND T RASKIN

(2000), see also Bickel et al. (2006), as well as some theoretical results (for example, Jiang, 2002)
showed that boosting, ran for an arbitrary large number of steps, overfits, though it takes a very long
time to do it.
Upper bounds on the risk of boosted classifiers were obtained, based on the fact that boosting
tends to maximize the margin of the training examples (Schapire et al., 1998; Koltchinskii and
Panchenko, 2002), but Breiman (1999) pointed out that margin-based bounds do not completely
explain the success of boosting methods. In particular, these results do not resolve the issue of
consistency: they do not explain under which conditions we may expect the risk to converge to the
Bayes risk. A recent work by Reyzin and Schapire (2006) shows that while maximization of the
margin is useful, it should not be done at the expense of the classifier complexity.
Breiman (2004) showed that under some assumptions on the underlying distribution “popula-
tion boosting” converges to the Bayes risk as the number of iterations goes to infinity. Since the
population version assumes infinite sample size, this does not imply a similar result for AdaBoost,
especially given results of Jiang (2002), that there are examples when AdaBoost has prediction error
asymptotically suboptimal at t = ∞ (t is the number of iterations).
Several authors have shown that modified versions of AdaBoost are consistent. These modifi-
cations include restricting the l1 -norm of the combined classifier (Mannor et al., 2003; Blanchard
et al., 2003; Lugosi and Vayatis, 2004; Zhang, 2004) , and restricting the step size of the algo-
rithm (Zhang and Yu, 2005). Jiang (2004) analyses the unmodified boosting algorithm and proves
a process consistency property, under certain assumptions. Process consistency means that there
exists a sequence (tn ) such that if AdaBoost with sample size n is stopped after tn iterations, its risk
approaches the Bayes risk. However Jiang also imposes strong conditions on the underlying distri-
bution: the distribution of X (the predictor) has to be absolutely continuous with respect to Lebesgue
measure and the function FB (X) = (1/2) ln(P(Y = 1|X)/P(Y = −1|X)) has to be continuous on X .
Also Jiang’s proof is not constructive and does not give any hint on when the algorithm should be
stopped. Bickel et al. (2006) prove a consistency result for AdaBoost, under the assumption that the
probability distribution is such that the steps taken by the algorithm are not too large. In this paper,
we study stopping rules that guarantee consistency. In particular, we are interested in AdaBoost, not
a modified version. Our main result (Corollary 9) demonstrates that a simple stopping rule suffices
for consistency: the number of iterations is a fixed function of the sample size. We assume only that
the class of base classifiers has finite VC-dimension, and that the span of this class is sufficiently
rich. Both assumptions are clearly necessary.

2. Notation
Here we describe the AdaBoost procedure formulated as a coordinate descent algorithm and intro-
duce definitions and notation. We consider a binary classification problem. We are given X , the
measurable (feature) space, and Y = {−1, 1}, the set of (binary) labels. We are given a sample
Sn = {(Xi ,Yi )}ni=1 of i.i.d. observations distributed as the random variable (X,Y ) ∼ P , where P is
an unknown distribution. Our goal is to construct a classifier g n : X → Y based on this sample. The
quality of the classifier gn is given by the misclassification probability
L(gn ) = P(gn (X) 6= Y |Sn ).
Of course we want this probability to be as small as possible and close to the Bayes risk
L? = inf L(g) = E(min{η(X), 1 − η(X)}),
g

2348
A DA B OOST IS C ONSISTENT

where the infimum is taken over all possible (measurable) classifiers and η(·) is a conditional prob-
ability
η(x) = P(Y = 1|X = x).
The infimum above is achieved by the Bayes classifier g? (x) = g(2η(x) − 1), where

1 , x > 0,
g(x) =
−1 , x ≤ 0.

We are going to produce a classifier as a linear combination of base classifiers in H = {h|h : X →

Y }. We shall assume that class H has a finite VC (Vapnik-Chervonenkis) dimension dVC (H ) =
max |S| : S ⊆ X , H|S = 2|S| .
AdaBoost works to find a combination f that minimizes the convex criterion

1 n
∑ exp(−Yi f (Xi )).
n i=1

Many of our results are applicable to a broader family of such algorithms, where the function α 7→
exp(−α) is replaced by another function ϕ. Thus, for a function ϕ : R → R + , we define the empirical
ϕ-risk and the ϕ-risk,

1 n
Rϕ,n ( f ) = ∑ ϕ(Yi f (Xi ))
n i=1
and Rϕ ( f ) = Eϕ(Y f (X)).

Clearly, the function ϕ needs to be appropriate for classification, in the sense that a measurable f
that minimizes Rϕ ( f ) should have minimal risk. This is equivalent (see Bartlett et al., 2006) to ϕ
satisfying the following condition (‘classification calibration’). For all 0 ≤ η ≤ 1, η 6= 1/2,

inf{ηϕ(α) + (1 − η)ϕ(−α) : α(2η − 1) ≤ 0} > inf{ηϕ(α) + (1 − η)ϕ(−α) : α ∈ R}. (1)

We shall assume that ϕ satisfies (1).

Then the boosting procedure can be described as follows.

1. Set f0 ≡ 0. Choose number of iterations t.

2. For k = 1, . . . ,t, set

fk = fk−1 + αk−1 hk−1 ,
where the following holds for some fixed γ ∈ (0, 1] independent of k.

Rϕ,n ( fk ) ≤ γ inf Rϕ,n ( fk−1 + αh) + (1 − γ)Rϕ,n ( fk−1 ). (2)

h∈H ,α∈R

We call αi the step size of the algorithm at step i.

3. Output g ◦ ft as the final classifier.

The choice of γ < 1 in the above algorithm allows approximate minimization. Notice that the
original formulation of AdaBoost assumed exact minimization in (2), which corresponds to γ = 1.

2349
BARTLETT AND T RASKIN

We shall also use the convex hull of H scaled by λ ≥ 0,

( )
n n
Fλ = f f = ∑ λi hi , n ∈ N ∪ {0}, λi ≥ 0, ∑ λi = λ, hi ∈ H
i=1 i=1

as well as the set of k-combinations, k ∈ N, of functions in H

( )
k
Fk= f f = ∑ λi hi , λi ∈ R, hi ∈ H .
i=1

We also need to define the l? -norm: for any f ∈ F

k f k? = inf ∑ |αi |, f = ∑ αi hi , hi ∈ H .

Define the squashing function πl (·) to be


 l , x > l,
πl (x) = x , x ∈ [−l, l],
−l , x < −l.


Then the set of truncated functions is

πl ◦ F = f˜| f˜ = πl ( f ), f ∈ F .

The set of classifiers based on a class F is denoted by

g ◦ F = { f˜| f˜ = g( f ), f ∈ F }.

Define the derivative of an arbitrary function Q(·) in the direction of h as

∂Q( f + λh)
Q0 ( f ; h) = .
∂λ λ=0

The second derivative Q00 ( f ; h) is defined similarly.

3. Consistency of Boosting Procedure

In this section, we present the proof of the consistency of AdaBoost. We begin with an overview.
The usual approach to proving consistency involves a few key steps (see, for example, Bartlett
et al., 2004). The first is a comparison theorem, which shows that as the ϕ-risk R ϕ ( fn ) approaches R?ϕ
(the infimum over measurable functions of Rϕ ), L( fn ) approaches L? . The classification calibration
condition (1) suffices for this (Bartlett et al., 2006). The second step is to show that the class of
functions is suitably rich so that there is some sequence of elements f¯n for which limn→∞ Rϕ ( f¯n ) =
R?ϕ . The third step is to show that the ϕ-risk of the estimate f n approaches that of the reference
sequence f¯n . For instance, for a method of sieves that minimizes the empirical ϕ-risk over a suitable
set Fn (which increases with the sample size n), one could define the reference sequence f¯n as the
minimizer of the ϕ-risk in Fn . Then, provided that the sets Fn grow suitably slowly with n, the
maximal deviation over Fn between empirical ϕ-risk and ϕ-risk would converge to zero. Such a
uniform convergence result would imply that the sequence f n has ϕ-risk converging to R?ϕ .

2350
A DA B OOST IS C ONSISTENT

The key difficulty with this approach is that the concentration inequalities behind the uniform
convergence results are valid only for a suitably small class of suitably bounded functions. However
boosting in general and AdaBoost in particular may produce functions that cannot be appropriately
bounded. To circumvent this difficulty, we rely on the observation that, for the purposes of clas-
sification, we can replace the function f returned by AdaBoost by any function f 0 that satisfies
sign( f 0 ) = sign( f ). Therefore we consider the clipped version π λ ◦ ft of the function returned by
AdaBoost after t iterations. This clipping ensures that the functions f t are suitably bounded. Fur-
thermore, the complexity of the clipped class (as measured by its pseudo-dimension—see Pollard,
1984) grows slowly with the stopping time t, so we can show that the ϕ-risk of a clipped function
is not much larger than its empirical ϕ-risk. Lemma 4 provides the necessary details. In order to
compare the empirical ϕ-risk of the clipped function to that of a suitable reference sequence f¯n , we
first use the fact that the empirical ϕ-risk of a clipped function πλ ◦ ft is not much larger than the
empirical ϕ-risk of ft .
The next step is to relate Rϕ,n ( ft ) to Rϕ,n ( f¯n ). The choice of a suitable sieve depends on what
can be shown about the progress of the algorithm. We consider an increasing sequence of l ? -balls,
and define f¯n as the (near) minimizer of the ϕ-risk in the appropriate l? -ball. Theorems 6 and 8
show that as the stopping time increases, the empirical ϕ-risk of the function returned by AdaBoost
is not much larger than that of f¯n . Finally Hoeffding’s inequality shows that the empirical ϕ-risks of
the reference functions f¯n are close to their ϕ-risks. Combining all the pieces, the ϕ-risk of π λ ◦ ft
approaches R?ϕ , provided the stopping time increases suitably slowly with the sample size. The
consistency of AdaBoost follows.
We now describe our assumptions. First, we shall impose the following condition.

Condition 1 Denseness. Let the distribution P and class H be such that

lim inf Rϕ ( f ) = R?ϕ ,

λ→∞ f ∈Fλ

where R?ϕ = inf Rϕ ( f ) over all measurable functions.

For many classes H , the above condition is satisfied for all possible distributions P . Lugosi and
Vayatis (2004, Lemma 1) discuss sufficient conditions for Condition 1. As an example of such a
class, we can take the class of indicators of all rectangles or the class of indicators of half-spaces
defined by hyperplanes or the class of binary trees with the number of terminal nodes equal to
d + 1 (we consider trees with terminal nodes formed by successive univariate splits), where d is the
dimensionality of X (see Breiman, 2004).
The following set of conditions deals with uniform convergence and convergence of the boosting
algorithm. The main theorem (Theorem 1) shows that these, together with Condition 1, suffice for
consistency of the boosting procedure. Later in this section we show that the conditions are satisfied
by AdaBoost.

Condition 2 Let n be sample size. Let there exist non-negative sequences t n → ∞, ζn → ∞ and a
sequence { f¯n }∞
n=1 of reference functions such that

Rϕ ( f¯n ) → R? ,
n→∞

and suppose that the following conditions are satisfied.

2351
BARTLETT AND T RASKIN

a. Uniform convergence of tn -combinations.

a.s.
sup |Rϕ ( f ) − Rϕ,n ( f )| → 0. (3)
f ∈πζn ◦F tn n→∞

b. Convergence of empirical ϕ-risks for the sequence { f¯n }∞ n=1 .

a.s.
max 0, Rϕ,n ( f¯n ) − Rϕ ( f¯n ) → 0.

(4)
n→∞

c. Algorithmic convergence of tn -combinations.

a.s.
max 0, Rϕ,n ( ftn ) − Rϕ,n ( f¯n )

→ 0. (5)
n→∞

Now we state the main theorem.

Theorem 1 Assume ϕ is classification calibrated and convex. Assume, without loss of generality,
that for ϕλ = infx∈[−λ,λ] ϕ(x),
lim ϕλ = inf ϕ(x) = 0. (6)
λ→∞ x∈(−∞,∞)

Let Condition 2 be satisfied. Then the boosting procedure stopped at step t n returns a sequence of
classifiers ftn almost surely satisfying L(g( ftn )) → L? as n → ∞.
Remark 2 Note that Condition (6) could be replaced by the mild condition that the function ϕ is
bounded below.
Proof For almost every outcome ω on the probability space (Ω, S , P) we can define sequences
ε1n (ω) → 0, ε2n (ω) → 0 and ε3n (ω) → 0, such that for almost all ω the following inequalities are true.
Rϕ (πζn ( ftn )) ≤ Rϕ,n (πζn ( ftn )) + ε1n (ω) by (3)
≤ Rϕ,n ( ftn ) + ε1n (ω) + ϕζn (7)
≤ Rϕ,n ( f¯n ) + ε1n (ω) + ϕζn + ε2n (ω) by (5)
≤ Rϕ ( f¯n ) + ε1n (ω) + ϕζn + ε2n (ω) + ε3n (ω) by (4). (8)
Inequality (7) follows from the convexity of ϕ(·) (see Lemma 14 in Appendix E). By (6) and choice
of the sequence { f¯n }∞
n=1 we have Rϕ ( f n ) → R and ϕζn → 0. And from (8) follows Rϕ (πζn ( ftn )) → R
¯ ? ?

a.s. Eventually we can use the result by Bartlett et al. (2006, Theorem 3) to conclude that
a.s.
L(g(πζn ( ftn )))→L? .
But for ζn > 0 we have g(πζn ( ftn )) = g( ftn ), therefore
a.s.
L(g( ftn ))→L? .
Hence, the boosting procedure is consistent if stopped after t n steps.

The almost sure formulation of Condition 2 does not provide explicit rates of convergence of
L(g( ftn )) to L? . However, a slightly stricter form of Condition 2, which allows these rates to be
calculated, is considered in Appendix A.
In the following sections, we show that Condition 2 can be satisfied for some choices of ϕ. We
shall treat parts (a)–(c) separately.

2352
A DA B OOST IS C ONSISTENT

3.1 Uniform Convergence of tn -Combinations

Here we show that Condition 2 (a) is satisfied for a variety of functions ϕ, and in particular for
exponential loss used in AdaBoost. We begin with a simple lemma (see Freund and Schapire, 1997,
Theorem 8 or Anthony and Bartlett, 1999, Theorem 6.1):

Lemma 3 For any t ∈ N if dVC (H ) ≥ 2 the following holds:

dP (F t ) ≤ 2(t + 1)(dVC (H ) + 1) log2 [2(t + 1)/ ln 2],

where dP (F t ) is the pseudo-dimension of class F t .

The proof of consistency is based on the following result, which builds on the result by Koltchin-
skii and Panchenko (2002) and resembles a lemma due to Lugosi and Vayatis (2004, Lemma 2).

Lemma 4 For a continuous function ϕ define the Lipschitz constant

Lϕ,ζ = inf{L|L > 0, |ϕ(x) − ϕ(y)| ≤ L|x − y|, −ζ ≤ x, y ≤ ζ}

and maximum absolute value of ϕ(·) when argument is in [−ζ, ζ]

Mϕ,ζ = max |ϕ(x)|.

x∈[−ζ,ζ]

R1q
Then for V = dVC (H ), c = 24 0 ln 8e
ε2
dε and any n, ζ > 0 and t > 0,
r
(V + 1)(t + 1) log2 [2(t + 1)/ ln 2]
E sup |Rϕ ( f ) − Rϕ,n ( f )| ≤ cζLϕ,ζ .
f ∈πζ ◦F t n

Also, for any δ > 0, with probability at least 1 − δ,

r
(V + 1)(t + 1) log2 [2(t + 1)/ ln 2]
sup |Rϕ ( f ) − Rϕ,n ( f )| ≤ cζLϕ,ζ
f ∈πζ ◦F t n
r
ln(1/δ)
+ Mϕ,ζ . (9)
2n

Proof The proof is given in Appendix B.

Now, if we choose ζ and δ as functions of n, such that ∑∞ n=1 δ (n) < ∞ and right hand side of (9)
2

converges to 0 as n → ∞, we can appeal to Borel-Cantelli lemma and conclude, that for such choice
of ζn and δn Condition 2 (a) holds.
Lemma 4, unlike Lemma 2 of Lugosi and Vayatis (2004), allows us to choose the number of
steps t, which describes the complexity of the linear combination of base functions, and this is
essential for the proof of the consistency. It is easy to see that for AdaBoost (i.e., ϕ(x) = e −x ) we
can choose ζ = κ ln n and t = n1−ε with κ > 0, ε ∈ (0, 1) and 2κ − ε < 0.

2353
BARTLETT AND T RASKIN

3.2 Convergence of Empirical ϕ-Risks for the Sequence { f¯n }∞

n=1 .

To show that Condition 2(b) is satisfied for a variety of loss functions we use Hoeffding’s inequality.

Theorem 5 Define the variation of a function ϕ on the interval [−a, a] (for a > 0) as

Vϕ,a = sup ϕ(x) − inf ϕ(x).

x∈[−a,a] x∈[−a,a]

f¯n }∞
If a sequence {p n=1 satisfies the condition f n (x) ∈ [−λn , λn ], ∀x ∈ X , where λn > 0 is chosen so
¯
that Vϕ,λn = o( n/ ln n), then
a.s.
max{0, Rϕ,n ( f¯n ) − Rϕ ( f¯n )} → 0. (10)
n→∞

Proof Since we restricted the range of f¯n to the interval [−λn , λn ], we have, almost surely, ϕ(Y f¯n (X)) ∈
[a, b], where b − a ≤ Vϕ,λn . Therefore Hoeffding’s inequality guarantees that for all ε n

P Rϕ,n ( f¯n ) − Rϕ ( f¯n ) ≥ εn ≤ exp −2nε2n /Vϕ,λ
2
= δn .

n

To prove the statement of the theorem we require εn = o(1) and ∑ δn < ∞. Then we appeal to the
Borel-Cantelli lemma to conclude that (10) holds. These restrictions are satisfied if
n
2
Vϕ,λ = o
n
ln n
and the statement of the theorem follows.

The choice of λn in the above theorem depends on the loss function ϕ. In the case of the
AdaBoost loss ϕ(x) = e−x we shall choose λn = κ ln n, where κ ∈ (0, 1/2). One way to guarantee
that the functions f¯n satisfy condition f¯n (x) ∈ [−λn , λn ], ∀x ∈ X , is to choose f¯n ∈ Fλn .

3.3 Algorithmic Convergence of AdaBoost

So far we dealt with the statistical properties of the function we are minimizing; now we turn to
the algorithmic part. Here we show that Condition 2(c) is satisfied for the AdaBoost algorithm. We
need the following simple consequence of the proof of Bickel et al. (2006, Theorem 1).

Theorem 6 Let the function Q( f ) be convex in f and twice differentiable in all directions h ∈ H .
Let Q? = limλ→∞ inf f ∈Fλ Q( f ). Assume that ∀c1 , c2 , such that Q? < c1 < c2 < ∞,

0 < inf{Q00 ( f ; h) : c1 < Q( f ) < c2 , h ∈ H }

≤ sup{Q00 ( f ; h) : Q( f ) < c2 , h ∈ H } < ∞.

Also assume the following approximate minimization scheme for γ ∈ (0, 1]. Define f k+1 = fk +
αk+1 hk+1 such that
Q( fk+1 ) ≤ γ inf Q( fk + αh) + (1 − γ)Q( f k )
h∈H ,α∈R

and
Q( fk+1 ) = inf Q( fk + αhk+1 ).
α∈R

2354
A DA B OOST IS C ONSISTENT

Then for any reference function f¯ and the sequence of functions f m , produced by the boosting
algorithm, the following bound holds ∀m > 0 such that Q( f m ) > Q( f¯).
s
3 (Q( f ) − Q( f¯))2
2 − 21
8B 0 ` + c 3 m
Q( fm ) ≤ Q( f¯) + ln 0 2 , (11)
γ 2 β3 `0

where `k = f¯ − fk ? , c3 = 2(Q( f0 ) − Q( f¯))/β, β = inf{Q00 ( f ; h) : Q( f¯) < Q( f ) < Q( f 0 ), h ∈ H },

B = sup{Q00 ( f ; h) : Q( f ) < Q( f 0 ), h ∈ H }.
Proof The statement of the theorem is a version of a result implicit in the proof of (Bickel et al.,
2006, Theorem 1). The proof is given in Appendix C.

Remark 7 Results in Zhang and Yu (2005, e.g., Lemma 4.1) provide similar bounds under either an
assumption of a bounded step size of the boosting algorithm or a positive lower bound on Q 00 ( f ; h)
for all f , h. Since we consider boosting algorithms with unrestricted step size, the only option would
be to assume a positive lower bound on the second derivative. While such an assumption is fine for
the quadratic loss ϕ(x) = x2 , second derivative R00n ( f ; h) of the empirical risk for the exponential
loss used by the AdaBoost algorithm can not be bounded from below by a positive constant in a
general case. Theorem 6 makes a mild assumption that second derivative is positive for all f such
that R( f ) > R? (Rn ( f ) > R?n ) .
It is easy to see, that the theorem above applies to the AdaBoost algorithm, since there we first
choose the direction (base classifier) hi and then we compute the step size αi as
1 1 − εi 1 R( fi ) − R0 ( fi ; hi )
αi = ln = ln .
2 εi 2 R( fi ) + R0 ( fi ; hi )
Now we only have to recall that this value of αi corresponds to exact minimization in the direction
hi .
From now on we are going to specialize to AdaBoost and use ϕ(x) = e −x . Hence we drop the
subscript ϕ in Rϕ,n and Rϕ and use Rn and R respectively.
Theorem 6 allows us to get an upper bound on the difference between the exp-risk of the function
output by AdaBoost and the exp-risk of the appropriate reference function. For brevity in the next
theorem we make an assumption R? > 0, though a similar result can be stated for R? = 0. For
completeness, the corresponding theorem is given in Appendix D.
Theorem 8 Assume R? > 0. Let tn be the number of steps we run AdaBoost. Let λn = κ ln n,
κ ∈ (0, 1/2). Let a > 1 be an arbitrary fixed number. Let { f¯n }∞ n=1 be a sequence of functions
such that f¯n ∈ Fλn . Then with probability at least 1 − δn , where δn = exp −2(R? )2 n1−2κ /a2 , the

following holds
√ −1/2
2(1 − R? (a − 1)/a) λ2n + 2tn (a/(a − 1) − R? )/R?

¯ 2
Rn ( f t n ) ≤ R n ( f n ) + ln .
γ a−1 ? 3/2
λ2n
a R

Proof This theorem follows directly from Theorem 6. Because in AdaBoost

1 n 1 n
R00n ( f ; h) = ∑
n i=1
(−Yi h(Xi ))2 e−Yi f (Xi ) = ∑ e−Yi f (Xi ) = Rn ( f ),
n i=1

2355
BARTLETT AND T RASKIN

then all the conditions in Theorem 6 are satisfied as long as Rn ( f¯n ) > 0 (with Q( f ) replaced by
Rn ( f )) and in the Equation (11) we have B = Rn ( f0 ) = 1, β ≥ Rn ( f¯n ), f0 − f¯n ? ≤ λn . Since for t
such that Rn ( ft ) ≤ Rn ( f¯n ) the theorem is trivially true, we only have to notice that exp(Yi f¯n (Xi )) ∈
[0, nκ ], hence Hoeffding’s inequality guarantees that
!
1 n Yi f¯n (Xi ) R?
∑ Y f¯n (X)
≤ exp −2(R? )2 n1−2κ /a2 = δn ,

P e − Ee ≤−
n i=1 a

where we choose and fix the constant a > 1 arbitrarily and independently of n and the sequence
{ f¯n }∞
n=1 . Therefore with probability at least 1 − δn we bound empirical risk from below as Rn ( f n ) ≥
¯
R( fn ) − R /a ≥ R − R /a = R (a − 1)/a, since R( fn ) ≥ R . Therefore β ≥ R (a − 1)/a and the
¯ ? ? ? ? ¯ ? ?

result follows immediately from Equation (11) if we use the fact that R ? > 0.

It is easy to see that choice of λn ’s in the above theorem ensures that ∑∞

n=1 δn < ∞, therefore
Borel-Cantelli lemma guarantees that for tn → ∞ sufficiently fast (for example as O(nα ) for α ∈
(0, 1))
a.s.
max{0, Rn ( ftn ) − Rn ( f¯n )} → 0.
n→∞

If in addition to the conditions of Theorem 8 we shall require that

R( f¯n ) ≤ inf R( f ) + εn ,
f ∈Fλn

for some εn → 0, then together with Condition 1 this will imply R( f¯n ) → R∗ as n → ∞ and Condi-
tion 2 (c) follows.

3.4 Consistency of AdaBoost

Having all the ingredients at hand, consistency of AdaBoost is a simple corollary of Theorem 1.

Corollary 9 Assume V = dVC (H ) < ∞,

lim inf R( f ) = R?
λ→∞ f ∈Fλ

and tn = n1−ε for ε ∈ (0, 1). Then AdaBoost stopped at step tn returns a sequence of classifiers
almost surely satisfying L(g( ftn )) → L? .

Proof First assume L? > 0. For the exponential loss function this implies R? > 0. As was suggested
after the proof of Lemma 4 we may choose ζn = κ ln n for 2κ − ε < 0 (which also implies κ < 1/2)
to satisfy Condition 2 (a). Recall that discussion after the proof of the Theorem 8 suggests choice
of the sequence { f¯n }∞
n=1 of reference functions such that f n ∈ Fλn and
¯

R( f¯n ) ≤ inf R( f ) + εn
f ∈Fλn

for εn → 0 and λn = κ ln n with κ ∈ (0, 1/2) to ensure that Condition 2 (c) holds. Eventually, as
it follows from the discussion after the proof of the Theorem 5, choice of the sequence { f¯n }∞
n=1 to
satisfy Condition 2(c) also ensures that Condition 2(b) holds. Since function ϕ(x) = e −x is clearly

2356
A DA B OOST IS C ONSISTENT

classification calibrated and conditions of this Corollary assume Condition 1 then all the conditions
of Theorem 1 hold and consistency of the AdaBoost algorithm follows.
For L? = 0 the proof is similar, but we need to use Theorem 13 in Appendix D instead of Theo-
rem 8.

4. Discussion
We showed that AdaBoost is consistent if stopped sufficiently early, after t n iterations, for tn =
O(n1−ε ) with ε ∈ (0, 1). We do not know whether this number can be increased. Results by Jiang
(2002) imply that for some X and function class H the AdaBoost algorithm will achieve zero
training error after tn steps, where n2 /tn = o(1) (see also work by Mannor and Meir (2001, Lemma
1) for an example of X = Rd and H = {linear classifiers}, for which perfect separation on the
training sample is guaranteed after 8n2 ln n iterations), hence if run for that many iterations, the
AdaBoost algorithm does not produce a consistent classifier. We do not know what happens in
between O(n1−ε ) and O(n2 ln n). Lessening this gap is a subject of further research.
The AdaBoost algorithm, as well as other versions of the boosting procedure, replaces the 0 − 1
loss with a convex function ϕ to overcome algorithmic difficulties associated with the non-convex
optimization problem. In order to conclude that Rϕ ( fn ) → R?ϕ implies L(g( f n )) → L? we want ϕ
to be classification calibrated and this requirement cannot be relaxed, as shown by Bartlett et al.
(2006).
The statistical part of the analysis, summarized in Lemma 4 and Theorem 5, works for quite
an arbitrary loss function ϕ. The only restriction imposed by Lemma 4 is that ϕ must be Lipschitz
on any compact set. This requirement is an artifact of our proof and is caused by the use of the
“contraction principle”. It can be relaxed in some cases: Shen et al. (2003) use the classification
calibrated loss function 
 2 , x < 0,
ψ(x) = 1 − x , 0 ≤ x < 1,
0 , x ≥ 1,


which is non-Lipschitz on any interval [−λ, λ], λ > 0.

The algorithmic part, presented by Theorems 6 and 8, concentrated on the analysis of the expo-
nential (AdaBoost) loss ϕ(x) = e−x . This approach also works for the quadratic loss ϕ(x) = (1−x) 2 .
Theorem 6 assumes that the second derivative R00ϕ ( f ; h) is bounded from below by a positive con-
stant, possibly dependent on the value of Rϕ ( f ), as long as Rϕ ( f ) > R?ϕ . This condition is clearly
satisfied for ϕ(x) = (1 − x)2 : R00ϕ ( f ; h) ≡ 2 and we do not need an analog of Theorem 8; Theorem 6
suffices. Lemma 4 can be applied for the quadratic loss with L ϕ,λ = 2(1 + λ) and Mϕ,λ = (1 + λ)2 .
We may choose tn , λn , ζn the same as for the exponential loss or set λn = n1/4−ϑ1 , ϑ1 ∈ (0, 1/4),
ζn = nρ−ϑ2 , ϑ2 = (0, ρ), ρ = min(ε/2, 1/4) to get the following analog of Corollary 9.

Corollary 10 Assume ϕ(x) = (1 − x)2 . Assume V = dVC (H ) < ∞,

lim inf R( f ) = R?
λ→∞ f ∈Fλ

and tn = n1−ε for ε ∈ (0, 1). Then boosting procedure stopped at step t n returns a sequence of
classifiers almost surely satisfying L(g( ftn )) → L? .

2357
BARTLETT AND T RASKIN

We cannot make analogous conclusion about other loss functions. For example for logit loss
ϕ(x) = ln(1 + e−x ), Lemma 4 and Theorem 5 work, since Lϕ,λ = 1 and Mϕ,λ = ln(1 + eλ ), hence
choosing tn , λn , ζn as for either the exponential or quadratic losses will work. The assumption of
the Theorem 6 also holds with R00ϕ,n ( f ; h) ≥ Rϕ,n ( f )/n, though the resulting inequality is trivial: the
factor 1/n in this bound precludes us from finding an analog of Theorem 8. A similar problem
arises in the case of the modified quadratic loss ϕ(x) = [max(1 − x, 0)] 2 , for which R00ϕ,n ( f ; h) ≥ 2/n.
Generally, any loss function with “really flat” regions may cause trouble. Another issue is the
very slow rate of convergence in Theorems 6 and 8. Hence further research intended either to
improve convergence rates or extend the applicability of these theorems to loss functions other than
exponential and quadratic is desirable.

Acknowledgments

This work was supported by the NSF under award DMS-0434383. The authors would like to thank
Peter Bickel for useful discussions, as well as Jean-Philippe Vert and two anonymous referees for
their comments and suggestions.

Appendix A. Rate of Convergence of L(g( ftn )) to L?

Here we formulate Condition 2 in a stricter form and prove consistency along with a rate of conver-
gence of the boosting procedure to the Bayes risk.

Condition 3 Let n be sample size. Let there exist non-negative sequences t n → ∞, ζn → ∞, δn → 0

such that ∑∞ ¯ ∞
i=1 δi < ∞, j = 1, 2, 3, εn → 0, k = 1, 2, 3, a sequence { f n }n=1 of reference functions such
j k

that
Rϕ ( f¯n ) → R? ,
n→∞

and the following conditions hold.

a. Uniform convergence of tn -combinations.

!
P sup |Rϕ ( f ) − Rϕ,n ( f )| > ε1n < δ1n . (12)
f ∈πζn ◦F tn

b. Convergence of empirical ϕ-risks for the sequence { f¯n }∞

n=1 .

P Rϕ,n ( f¯n ) − Rϕ ( f¯n ) > ε2n < δ2n .

(13)

c. Algorithmic convergence of tn -combinations.

P Rϕ,n ( ftn ) − Rϕ,n ( f¯n ) > ε3n < δ3n .

(14)

Now we state the analog of Theorem 1.

2358
A DA B OOST IS C ONSISTENT

Theorem 11 Assume ϕ is classification calibrated and convex, and for ϕ λ = infx∈[−λ,λ] ϕ(x) without
loss of generality assume
lim ϕλ = inf ϕ(x) = 0. (15)
λ→∞ x∈(−∞,∞)

Let Condition 3 be satisfied. Then the boosting procedure stopped at step t n returns a sequence of
classifiers ftn almost surely satisfying L(g( ftn )) → L? as n → ∞.

Proof Consider the following sequence of inequalities.

Rϕ (πζn ( ftn )) ≤ Rϕ,n (πζn ( ftn )) + ε1n by (12) (16)

≤ Rϕ,n ( ftn ) + ε1n + ϕζn
≤ Rϕ,n ( f¯n ) + ε1n + ϕζn + ε3n by (14) (17)
≤ Rϕ ( f¯n ) + ε1n + ϕζn + ε3n + ε2n by (13). (18)

Inequalities (16), (18) and (17) hold with probability at least 1 − δ 1n , 1 − δ2n and 1 − δ3n respectively.
We assumed in Condition 3 that Rϕ ( f¯n ) → R? and (15) implies that ϕζn → 0 by the choice of the
sequence ζn . Now we appeal to the Borel-Cantelli lemma and arrive at Rϕ (πζn ( ftn )) → R? a.s.
Eventually we can use Theorem 3 by Bartlett et al. (2006) to conclude that
a.s.
L(g(πζn ( ftn )))→L? .

But for ζn > 0 we have g(πζn ( ftn )) = g( ftn ), therefore

a.s.
L(g( ftn ))→L? .

Hence the boosting procedure is consistent if stopped after t n steps.

We could prove Theorem 11 by using the Borel-Cantelli lemma and appealing to Theorem 1,
but the above proof allows the following corollary on the rate of convergence.

Corollary 12 Let the conditions of Theorem 11 be satisfied. Then there exists a non-decreasing
function ψ, such that ψ(0) = 0, and with probability at least 1 − δ 1n − δ2n − δ3n

L(g( ftn )) − L ≤ ψ
? −1
(εn + εn + εn + ϕζn ) + inf Rϕ − Rϕ
1 2 3 ?
, (19)
f ∈Fλn

where ψ−1 is the inverse of ψ.

Proof From Theorem 3 of Bartlett et al. (2006), if φ is convex we have that

1+θ 1−θ

ψ(θ) = φ(0) − inf φ(α) + φ(−α) : α ∈ R ,
2 2

and for any distribution and any measurable function f

L(g( f )) − L? ≤ ψ−1 Rϕ ( f ) − R?ϕ .

2359
BARTLETT AND T RASKIN

On the other hand,

Rϕ ( f ) − R?ϕ = Rϕ ( f ) − inf Rϕ + inf Rϕ − R?ϕ .
f ∈Fλn f ∈Fλn

The proof of Theorem 11 shows that for function f tn with probability at least 1 − δ1n − δ2n
Rϕ ( ftn ) − inf Rϕ ≤ ε1n + ε2n + ε3n + ϕζn .
f ∈Fλn

Putting all the components together we obtain (19).

The second term under ψ−1 in (19) is an approximation error and, in a general case, it may
decrease arbitrarily slowly. However, if it is known that it decreases sufficiently fast, the first term
becomes an issue. For example Corollary 9, even if the1 approximation error decreases sufficiently
fast, will give a convergence rate of the order O (ln n)− 4 . This follows from Example 1 by Bartlett
√
et al. (2006), where it is shown that for AdaBoost (exponential loss function) ψ −1 (x) ≤ 2x, and the
fact that both ε1n and ε2n , as well as ϕζn , in Corollary 9 decrease at the rate O(n1−α ) (infact, α’smight
1
be different for all three of them), hence everything is dominated by ε 3n , which is O (ln n)− 2 .

Appendix B. Proof of Lemma 4

For convenience, we state the lemma once again.
Lemma 4 For a continuous function ϕ define the Lipschitz constant
Lϕ,ζ = inf{L|L > 0, |ϕ(x) − ϕ(y)| ≤ L|x − y|, −ζ ≤ x, y ≤ ζ}
and maximum absolute value of ϕ(·) when argument is in [−ζ, ζ]
Mϕ,ζ = max |ϕ(x)|.
x∈[−ζ,ζ]

R q
Then for V = dVC (H ), c = 24 01 ln 8e ε2
dε and any n, ζ > 0 and t > 0,
r
(V + 1)(t + 1) log2 [2(t + 1)/ ln 2]
E sup |Rϕ ( f ) − Rϕ,n ( f )| ≤ cζLϕ,ζ .
f ∈πζ ◦F t n

Also, for any δ > 0, with probability at least 1 − δ,

r
(V + 1)(t + 1) log2 [2(t + 1)/ ln 2]
sup |Rϕ ( f ) − Rϕ,n ( f )| ≤ cζLϕ,ζ
f ∈πζ ◦F t n
r
ln(1/δ)
+ Mϕ,ζ .
2n
Proof The proof of this lemma is similar to the proof of Lugosi and Vayatis (2004, Lemma 2) in
that we begin with symmetrization followed by the application of the “contraction principle”. We
use symmetrization to get
1 n
E sup |Rϕ ( f ) − Rϕ,n ( f )| ≤ 2E sup
f ∈πζ ◦F t
∑ σi (ϕ(−Yi f (Xi )) − ϕ(0)) ,
f ∈πζ ◦F t n i=1

2360
A DA B OOST IS C ONSISTENT

where σi are i.i.d. with P(σi = 1) = P(σi = −1) = 1/2. Then we use the “contraction principle”
(see Ledoux and Talagrand, 1991, Theorem 4.12, pp. 112–113) with a function ψ(x) = (ϕ(x) −
ϕ(0))/Lϕ,ζ to get

1 n
E sup |Rϕ ( f ) − Rϕ,n ( f )| ≤ 4Lϕ,ζ E sup
f ∈πζ ◦F t
∑ −σiYi f (Xi )
f ∈πζ ◦F t n i=1

1 n
= 4Lϕ,ζ E sup ∑ σi f (Xi ) .
f ∈πζ ◦F t n i=1

Next we proceed and find the supremum. Notice, that functions in π ζ ◦ F t are bounded and clipped
to absolute value equal ζ, therefore we can rescale πζ ◦ F t by (2ζ)−1 and get

1 n 1 n
E sup ∑
f ∈πζ ◦F t n i=1
σi f (Xi ) = 2ζE sup ∑ σi f (Xi ) .
f ∈(2ζ)−1 ◦πζ ◦F t n i=1

Next, we use Dudley’s entropy integral (Dudley, 1999) to bound the right hand side above
Z ∞q
1 n 12
E sup ∑ σi f (Xi ) ≤ √ ln N (ε, (2ζ)−1 ◦ πζ ◦ F t , L2 (Pn ))dε.
f ∈(2ζ)−1 ◦πζ ◦F t n i=1 n 0

Since, for ε > 1, the covering number N is 1, the upper integration limit can be taken as 1, and we
can use Pollard’s bound (Pollard, 1990) for F ⊆ [0, 1]X ,

4e dP (F)

N (ε, F, L2 (P)) ≤ 2 2 ,
ε
R1q
where dP (F) is a pseudo-dimension, and obtain for c̃ = 12 0 ln 8e
ε2
dε,
s
1n dP ((2ζ)−1 ◦ πζ ◦ F t )
E sup ∑ σi f (Xi ) ≤ c̃ .
f ∈(2ζ)−1 ◦πζ ◦F t n i=1 n

Also notice that constant c̃ does not depend on F t or ζ. Next, since (2ζ)−1 ◦ πζ is non-decreasing,
we use the inequality dP ((2ζ)−1 ◦ πζ ◦ F t ) ≤ dP (F t ) (for example, Anthony and Bartlett, 1999,
Theorem 11.3) to obtain

dP (F t )
r
1 n
E sup ∑ σi f (Xi ) ≤ c .
f ∈(2ζ)−1 ◦πζ ◦F t n i=1 n

And then, since Lemma 3 gives an upper-bound on the pseudo-dimension of the class F t , we have
r
1 n (V + 1)(t + 1) log2 [2(t + 1)/ ln 2]
E sup ∑ σi f (Xi ) ≤ cζ
f ∈πζ ◦F t n i=1 n
,

2361
BARTLETT AND T RASKIN

with the constant c above being independent of H , t and ζ. To prove the second statement we use
McDiarmid’s bounded difference inequality (Devroye et al., 1996, Theorem 9.2, p. 136), since for
all i ∈ {1, . . . , n}

Mϕ,ζ
sup sup |Rϕ ( f ) − Rϕ,n ( f )| − sup |Rϕ ( f ) − R0ϕ,n ( f )| ≤ ,
(x j ,y j )nj=1 ,(xi0 ,y0i ) f ∈πζ ◦F t f ∈πζ ◦F t n

where R0ϕ,n ( f ) is obtained from Rϕ,n ( f ) by changing each pair (xi , yi ) to an independent pair (xi0 , y0i ).
This completes the proof of the lemma.

Appendix C. Proof of Theorem 6

For convenience, we state the theorem once again.
Theorem 6 Let the function Q( f ) be convex in f and twice differentiable in all directions h ∈ H .
Let Q? = limλ→∞ inf f ∈Fλ Q( f ). Assume that ∀c1 , c2 , such that Q? < c1 < c2 < ∞,
0 < inf{Q00 ( f ; h) : c1 < Q( f ) < c2 , h ∈ H }
≤ sup{Q00 ( f ; h) : Q( f ) < c2 , h ∈ H } < ∞.
Also assume the following approximate minimization scheme for γ ∈ (0, 1]. Define f k+1 = fk +
αk+1 hk+1 such that
Q( fk+1 ) ≤ γ inf Q( fk + αh) + (1 − γ)Q( f k )
h∈H ,α∈R
and
Q( fk+1 ) = inf Q( fk + αhk+1 ).
α∈R

where `k = f¯ − fk ? , c3 = 2(Q( f0 ) − Q( f¯))/β, β = inf{Q00 ( f ; h) : Q( f¯) < Q( f ) < Q( f 0 ), h ∈ H },

B = sup{Q00 ( f ; h) : Q( f ) < Q( f 0 ), h ∈ H }.

Proof The statement of the theorem is a version of a result implicit in the proof of Theorem 1
by Bickel et al. (2006). If for some m we have Q( f m ) ≤ Q( f¯), then the theorem is trivially true
for all m0 ≥ m. Therefore, we are going to consider only the case when Q( f m ) > Q( f¯). We shall
also assume Q( f m+1 ) ≥ Q( f¯) (the impact of this assumption will be discussed later). Define ε m =
Q( fm ) − Q( f¯). By convexity of Q(·),
|Q0 ( fm ; fm − f¯)| ≥ εm . (20)
Let fm − f¯ = ∑ α̃i h̃i , where α̃i and h̃i correspond to the best representation (with the l1 -norm of α̃
equal the l? -norm). Then from (20) and linearity of the derivative we have
εm ≤ ∑ α̃i Q0 ( fm ; h̃i ) ≤ sup |Q0 ( fm ; h)| ∑ |α̃i |,
h∈H

2362
A DA B OOST IS C ONSISTENT

therefore
εm εm
sup Q0 ( fm ; h) ≥ = . (21)
h∈H fm − f¯ ?
`m
Next,
1
Q( fm + αhm ) = Q( fm ) + αQ0 ( fm ; hm ) + α2 Q00 ( f˜m ; hm ),
2
where f˜m = fm + α̃m hm , for α̃m ∈ [0, αm ]. By assumption f˜m is on the path from f m to fm+1 , and
we have assumed exact minimization in the given direction, hence f m+1 is the lowest point in the
direction hm starting from f m , so we have the following bounds

Q( f¯) < Q( fm+1 ) ≤ Q( f˜m ) ≤ Q( fm ) ≤ Q( f0 ).

Then by the definition of β, which depends on Q( f¯), we have

1 |Q0 ( fm ; hm )|2
Q( fm+1 ) ≥ Q( fm ) + inf (αQ0 ( fm ; hm ) + α2 β) = Q( fm ) − . (22)
α∈R 2 2β
On the other hand,

Q( fm + αm hm ) ≤ γ Q( fm + αh) + (1 − γ)Q( f m )
inf
h∈H ,α∈R

1 2
≤ γ inf Q( fm ) + αQ ( fm ; h) + α B) + (1 − γ)Q( f m )
0
h∈H ,α∈R 2
suph∈H |Q0 ( fm ; h)|2
= Q( fm ) − γ . (23)
2B
Therefore, combining (22) and (23), we get

γβ
r
0 0
|Q ( fm ; hm )| ≥ sup |Q ( fm ; h)| . (24)
h∈H B

2363
BARTLETT AND T RASKIN

Using (25) we have

m m
2 2
∑ α2i ≤ β ∑ (Q( fi ) − Q( fi+1 )) ≤ β (Q( f0 ) − Q( f¯)). (27)
i=0 i=0

Recall that
m−1
fm − f¯ ?
≤ fm−1 − f¯ ?
+ |αm−1 | ≤ f0 − f¯ ?
+ ∑ |αi |
i=0
!1/2
√ m−1
≤ f0 − f¯ ? + m ∑ α2i ,
i=0

therefore, combining with (27) and (26), since the sequence ε i is decreasing,
m
2
β
(Q( f0 ) − Q( f¯)) ≥ ∑ α2i
i=0
γ2 β m ε2
≥
B3 ∑ `2i
i=0 i
γ2 β 2 m 1
≥ ε ∑
B3 m i=0 √ 1/2 2
`0 + i ∑ j=0 α j
i−1 2

γ2 β 2 m 1
≥ ε ∑
B3 m i=0 √ 2(Q( f0 )−Q( f¯)) 1/2 2

`0 + i β

γ2 β 2 m 1
≥ 3
εm ∑ 2(Q( f0 )−Q( f¯))
.
2B 2
i=0 ` + i
0 β

Since
m Z m+1
1 dx 1 a + b(m + 1)
∑ a + bi ≥ 0
= ln
a + bx b a
,
i=0

then
2 2(Q( f0 )−Q( f¯))
2 γ 2 β2 `0 + β (m + 1)
(Q( f0 ) − Q( f¯)) ≥ 3 ε 2
ln .
β 4B (Q( f0 ) − Q( f¯)) m `20
Therefore
− 21
2(Q( f0 )−Q( f¯))
s 
2
8B3 (Q( f0 ) − Q( f¯))2  `0 + β (m + 1)
εm ≤ ln  . (28)
γ 2 β3 `20

The proof of the above inequality for index m works as long as Q( f m+1 ) ≥ Q( f¯). If f¯ is such that
Q( fm ) ≥ Q( f¯) for all m, then we do not need to do anything else. However, if there exists m 0 such
that Q( fm0 ) < Q( f¯) and Q( fm0 −1 ) ≥ Q( f¯), then the above proof is not valid for index m0 − 1. To
overcome this difficulty, we notice that Q( f m0 −1 ) is bounded from above by Q( f m0 −2 ), therefore
to get a bound that holds for all m (except for m = 0) we may use a bound for ε m−1 to bound

2364
A DA B OOST IS C ONSISTENT

Q( fm ) − Q( f¯) = εm : shift (decrease) the index m on the right hand side of (28) by one. This com-
pletes the proof of the theorem.

Appendix D. Zero Bayes Risk

Here we consider a modification of Theorem 8. In this case our assumptions imply that R ? = 0,
and the proof presented above does not work. However for AdaBoost we can modify the proof
appropriately to show an adequate convergence rate.

Theorem 13 Assume R? = 0. Let tn be a number of steps we run AdaBoost. Let λn = κ ln ln n

for κ ∈ (0, 1/6). Let εn =n−ν , for ν ∈ (0, 1/2). Then with probability at least 1 − δn , where
δn = exp −2n1−2ν /(ln n)2κ , for some constant C that depends on H and P but does not depend
on n, for n such that
C 2
κ
> ν
(ln n) n
the following holds

Rn ( ftn ) ≤ Rn ( f¯n )
s
16R3n ( f0 )|Rn ( f0 ) − Rn ( f¯n )|2 (ln n)3κ
+
Cγ2
−1/2
(κ ln ln n)2 + 4|Rn ( f0 ) − Rn ( f¯n )|(ln n)κtn /C

× ln .
(κ ln ln n)2

Proof For the exponential loss assumption R? = 0 is equivalent to L? = 0. It also implies that the
fastest decrease rate of the function τ : λ → inf f ∈Fλ R( f ) is O(e−λ ). To see this, assume that for
some λ there exists f ∈ Fλ such that L(g( f )) = 0 (i.e., we have achieved perfect classification).
Clearly, for any a > 0
a
R(a f ) = Ee−Ya f (X) = E e−Y f (X) ≥ (inf e−y f (x) )a .
x,y

Therefore, choose λn = κ ln ln n. Then inf f ∈Fλn R( f ) ≥ C(ln n)−κ , where C depends on H and P ,
but does not depend on n. On the other hand Hoeffding’s inequality for f¯n ∈ Fλn guarantees that

P R( f¯n ) − Rn ( f¯n ) ≥ εn ≤ exp −2nε2n /(ln n)2κ = δn .

Choice of εn = n−ν for ν ∈ (0, 1/2) ensures that δn → 0. This allows to conclude that with proba-
bility at least 1 − δn empirical risk Rn ( f¯n ) can be lower bounded as

Rn ( f¯n ) ≥ R( f¯n ) − εn

and for n large enough for

C 2
κ
> ν
(ln n) n

2365
BARTLETT AND T RASKIN

to hold we get a lower bound on β in (11) of Theorem 6 as

C
β≥ .
2(ln n)κ

Since for f¯n such that Rn ( f¯n ) > Rn ( f0 ) theorem trivially holds, we only have to plug Rn ( f¯n ) = 0,
B = Rn ( f0 ) and β = C(ln n)κ /2 into (11) to get the statement of the theorem. Obviously, this bound
holds for R? > 0.

Appendix E.
Lemma 14 Let the function ϕ : R → R+ ∪ {0} be convex. Then for any λ > 0

ϕ(πλ (x)) ≤ ϕ(x) + inf ϕ(z). (29)

z∈[−λ,λ]

Proof If x ∈ [−λ, λ] then the statement of the lemma is clearly true. Without loss of generality
assume x > λ; case x < −λ is similar. Then we have two possibilities.

1. ϕ(x) ≥ ϕ(λ) = ϕ(πλ (x)) and (29) is obvious.

2. ϕ(x) < ϕ(λ). Due to convexity, for any z < λ we have ϕ(z) > ϕ(λ), therefore

ϕ(πλ (x)) = ϕ(λ) ≤ ϕ(λ) + ϕ(x) = inf ϕ(z) + ϕ(x).

z∈[−λ,λ]

The statement of the lemma is proven.

References
Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations. Cam-
bridge University Press, 1999.

Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Discussion of boosting papers. The
Annals of Statistics, 32(1):85–91, 2004.

Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classification, and risk
bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.

Eric Bauer and Ron Kohavi. An empirical comparison of voting classification algorithms: Bagging,
boosting and variants. Machine Learning, 36:105–139, 1999.

Peter J. Bickel, Ya’acov Ritov, and Alon Zakai. Some theory for generalized boosting algorithms.
Journal of Machine Learning Research, 7:705–732, May 2006.

Gilles Blanchard, Gábor Lugosi, and Nicolas Vayatis. On the rate of convergence of regularized
boosting classifiers. Journal of Machine Learning Research, 4:861–894, 2003.

2366
A DA B OOST IS C ONSISTENT

Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.

Leo Breiman. Arcing the edge. Technical Report 486, Department of Statistics, University of
California, Berkeley, 1997.

Leo Breiman. Prediction games and arcing algorithms. Neural Computation, 11:1493–1517, 1999.
(Was Department of Statistics, U.C. Berkeley Technical Report 504, 1997).

Leo Breiman. Arcing classifiers (with discussion). The Annals of Statistics, 26(3):801–849, 1998.
(Was Department of Statistics, U.C. Berkeley Technical Report 460, 1996).

Leo Breiman. Population theory for predictor ensembles. The Annals of Statistics, 32(1):1–11,
2004. (See also Department of Statistics, U.C. Berkeley Technical Report 579, 2000).

Luc Devroye, László Györfi, and Gábor Lugosi. A Probabilistic Theory of Pattern Recognition.
Springer, New York, 1996.

Thomas G. Dietterich. An experimental comparison of three methods for constructing ensembles of

decision trees: bagging, boosting, and randomization. Machine Learning, 40(2):139–158, 2000.

Harris Drucker and Corinna Cortes. Boosting decision trees. In D.S. Touretzky, M.C. Mozer, and
M.E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 479–485.
M.I.T. Press, 1996.

Richard M. Dudley. Uniform Central Limit Theorems. Cambridge University Press, Cambridge,
MA, 1999.

Yoav Freund. Boosting a weak learning algorithm by majority. Information and Computation, 121:
256–285, 1995.

Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In 13th Interna-
tional Conference on Machine Learning, pages 148–156, San Francisco, 1996. Morgan Kaufman.

Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an
application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.

Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: A statistical
view of boosting. The Annals of Statistics, 28:337–407, 2000.

Adam J. Grove and Dale Schuurmans. Boosting in the limit: Maximizing the margin of learned
ensembles. In Proceedings of the Fifteenth National Conference on Artificial Intelligence, pages
692–699, Menlo Park, CA, 1998. AAAI Press.

Wenxin Jiang. On weak base hypotheses and their implications for boosting regression and classi-
fication. The Annals of Statistics, 30:51–73, 2002.

Wenxin Jiang. Process consistency for AdaBoost. The Annals of Statistics, 32(1):13–29, 2004.

Vladimir Koltchinskii and Dmitry Panchenko. Empirical margin distributions and bounding the
generalization error of combined classifiers. The Annals of Statistics, 30:1–50, 2002.

2367
BARTLETT AND T RASKIN

Michel Ledoux and Michel Talagrand. Probability in Banach Spaces. Springer-Verlag, New York,
1991.

Gábor Lugosi and Nicolas Vayatis. On the Bayes-risk consistency of regularized boosting methods.
The Annals of Statistics, 32(1):30–55, 2004.

Shie Mannor and Ron Meir. Weak learners and improved rates of convergence in boosting. In
Advances in Neural Information Processing Systems, 13, pages 280–286, 2001.

Shie Mannor, Ron Meir, and Tong Zhang. Greedy algorithms for classification – consistency, con-
vergence rates, and adaptivity. Journal of Machine Learning Research, 4:713–742, 2003.

Llew Mason, Jonathan Baxter, Peter L. Bartlett, and Marcus Frean. Boosting algorithms as gradient
descent. In S.A. Solla, T.K. Leen, and K.-R. Muller, editors, Advances in Neural Information
Processing Systems, 12, pages 512–518. MIT Press, 2000.

David Pollard. Convergence of Stochastic Processes. Springer-Verlag, New York, 1984.

David Pollard. Empirical Processes: Theory and Applications. IMS, 1990.

J. Ross Quinlan. Bagging, boosting, and C4.5. In 13 AAAI Conference on Artificial Intelligence,
pages 725–730, Menlo Park, CA, 1996. AAAI Press.

Lev Reyzin and Robert E. Schapire. How boosting the margin can also boost classifier com-
plexity. In ICML ’06: Proceedings of the 23rd international conference on Machine learn-
ing, pages 753–760, New York, NY, USA, 2006. ACM Press. ISBN 1-59593-383-2. doi:
http://doi.acm.org/10.1145/1143844.1143939.

Robert E. Schapire. The strength of weak learnability. Machine Learning, 5:197–227, 1990.

Robert E. Schapire, Yoav Freund, Peter L. Bartlett, and Wee Sun Lee. Boosting the margin: A
new explanation for the effectiveness of voting methods. The Annals of Statistics, 26:1651–1686,
1998.

Xiaotong Shen, George C. Tseng, Xuegong Zhang, and Wing H. Wong. On ψ-learning. Journal of
the American Statistical Association, 98(463):724–734, 2003.

Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk
minimization. The Annals of Statistics, 32(1):56–85, 2004.

Tong Zhang and Bin Yu. Boosting with early stopping: convergence and consistency. The Annals
of Statistics, 33:1538–1579, 2005.

2368

Liang Sur 20
No ratings yet
Liang Sur 20
61 pages
Ada Boost
No ratings yet
Ada Boost
7 pages
Bagging and Boosting: 9.520 Class 10, 13 March 2006 Sasha Rakhlin
No ratings yet
Bagging and Boosting: 9.520 Class 10, 13 March 2006 Sasha Rakhlin
19 pages
Multi-class AdaBoost Explained
No ratings yet
Multi-class AdaBoost Explained
12 pages
Boosting and Applications Yuan
No ratings yet
Boosting and Applications Yuan
41 pages
Addaboost
No ratings yet
Addaboost
12 pages
Boosting With The L - Loss: Regression and Classification
No ratings yet
Boosting With The L - Loss: Regression and Classification
32 pages
A Short Introduction To Boosting
No ratings yet
A Short Introduction To Boosting
14 pages
Boosting Approach To Machine Learn
No ratings yet
Boosting Approach To Machine Learn
23 pages
Introduction To Boosting: Cynthia Rudin PACM, Princeton University
No ratings yet
Introduction To Boosting: Cynthia Rudin PACM, Princeton University
29 pages
Ada Boost
No ratings yet
Ada Boost
25 pages
DM (Boosting)
No ratings yet
DM (Boosting)
15 pages
A Brief Introduction To Adaboost: Hongbo Deng 6 Feb, 2007
No ratings yet
A Brief Introduction To Adaboost: Hongbo Deng 6 Feb, 2007
35 pages
Boosting Mit
No ratings yet
Boosting Mit
36 pages
AdaBoost: A Guide for Data Scientists
No ratings yet
AdaBoost: A Guide for Data Scientists
17 pages
AdaBoost M1
No ratings yet
AdaBoost M1
16 pages
Experiments With A New Boosting Algorithm: Yoav Freund Robert E. Schapire
No ratings yet
Experiments With A New Boosting Algorithm: Yoav Freund Robert E. Schapire
16 pages
Improved Boosting with Confidence
No ratings yet
Improved Boosting with Confidence
40 pages
Boosting Margin
No ratings yet
Boosting Margin
30 pages
Boosting Buehlmann
No ratings yet
Boosting Buehlmann
52 pages
Pradipta Kumar Pattanayak - Ada Boosting
No ratings yet
Pradipta Kumar Pattanayak - Ada Boosting
44 pages
Bagging - Boosting
No ratings yet
Bagging - Boosting
9 pages
An Introduction To Boosting and Leveraging: 1 A Brief History of Boosting
No ratings yet
An Introduction To Boosting and Leveraging: 1 A Brief History of Boosting
66 pages
A Short Introduction To Boosting
No ratings yet
A Short Introduction To Boosting
14 pages
AdaBoost Algorithm: Key Features & Benefits
No ratings yet
AdaBoost Algorithm: Key Features & Benefits
9 pages
A Short Introduction To Boosting
No ratings yet
A Short Introduction To Boosting
14 pages
Survey - Gradient Boosting Machine
No ratings yet
Survey - Gradient Boosting Machine
9 pages
Lecture Notes 7
No ratings yet
Lecture Notes 7
8 pages
Computational Data Analysis: Machine Learning
No ratings yet
Computational Data Analysis: Machine Learning
26 pages
Overview of Adaboost: Reconciling Its Views To Better Understand Its Dynamics
No ratings yet
Overview of Adaboost: Reconciling Its Views To Better Understand Its Dynamics
39 pages
1 Eric Boosting304FinalRpdf
No ratings yet
1 Eric Boosting304FinalRpdf
19 pages
Adaboost: Derek Hoiem March 31, 2004
No ratings yet
Adaboost: Derek Hoiem March 31, 2004
46 pages
Evolution of Boosting Algorithms
No ratings yet
Evolution of Boosting Algorithms
32 pages
Statistics Project
No ratings yet
Statistics Project
5 pages
AdaBoost Notes
No ratings yet
AdaBoost Notes
5 pages
Adaboost
No ratings yet
Adaboost
5 pages
Adaboost
No ratings yet
Adaboost
22 pages
07 Boosting Notes
No ratings yet
07 Boosting Notes
10 pages
Boosting
No ratings yet
Boosting
2 pages
Adaboost
No ratings yet
Adaboost
13 pages
LECTURE+NOTES Boosting
No ratings yet
LECTURE+NOTES Boosting
8 pages
Boosting Algorithms Explained
No ratings yet
Boosting Algorithms Explained
79 pages
Boosting
No ratings yet
Boosting
11 pages
Improving Classification With AdaBoost
No ratings yet
Improving Classification With AdaBoost
20 pages
کتاب هفتم بارگزاری شده
No ratings yet
کتاب هفتم بارگزاری شده
57 pages
FAQ - Boosting - Ensemble Techniques - Great Learning
No ratings yet
FAQ - Boosting - Ensemble Techniques - Great Learning
2 pages
CS229 Supplemental Lecture Notes: 1 Boosting
No ratings yet
CS229 Supplemental Lecture Notes: 1 Boosting
11 pages
Boosting With Structural Sparsity
No ratings yet
Boosting With Structural Sparsity
41 pages
Learn The Boosting - Method - Implementation - in - R
No ratings yet
Learn The Boosting - Method - Implementation - in - R
27 pages
Boosting Algorithms: Regularization, Prediction and Model Fitting
No ratings yet
Boosting Algorithms: Regularization, Prediction and Model Fitting
29 pages
AdaBoost New PDF
No ratings yet
AdaBoost New PDF
45 pages
Introduction To Machine Learning - Boosting
No ratings yet
Introduction To Machine Learning - Boosting
6 pages
Boosting Algorithms Explained
No ratings yet
Boosting Algorithms Explained
2 pages
Machine Learning Boosting Guide
No ratings yet
Machine Learning Boosting Guide
27 pages
Adaboost
No ratings yet
Adaboost
29 pages
Experimenting XGBoost Algorithmfor Predictionand Classificationof Different Datasets
No ratings yet
Experimenting XGBoost Algorithmfor Predictionand Classificationof Different Datasets
12 pages
Boosting and AdaBoost For Machine Learning
No ratings yet
Boosting and AdaBoost For Machine Learning
18 pages
Data Science Masters Program - Curriculum-Updated 2019
No ratings yet
Data Science Masters Program - Curriculum-Updated 2019
52 pages
Titanic Survival Analysis
No ratings yet
Titanic Survival Analysis
61 pages
Text Preprocessing
No ratings yet
Text Preprocessing
39 pages
Nazmi 2015
No ratings yet
Nazmi 2015
6 pages
Products Reviews and Sentimental Analysis System For Ecommerce Website
No ratings yet
Products Reviews and Sentimental Analysis System For Ecommerce Website
3 pages
CAF Jun24 Chp13 Statistical Description and Sampling Pranav Popat
No ratings yet
CAF Jun24 Chp13 Statistical Description and Sampling Pranav Popat
28 pages
Hyperspectral Signal Processing To Identify Land Cover Pattern PDF
No ratings yet
Hyperspectral Signal Processing To Identify Land Cover Pattern PDF
78 pages
Cse Projects 2025
No ratings yet
Cse Projects 2025
13 pages
5 MCQ Ann Ann Quiz Selected
No ratings yet
5 MCQ Ann Ann Quiz Selected
21 pages
Data Mining Concepts
No ratings yet
Data Mining Concepts
35 pages
Compilled Notes - Final
No ratings yet
Compilled Notes - Final
100 pages
Deforestation Monitoring With Sentinel-2
No ratings yet
Deforestation Monitoring With Sentinel-2
31 pages
Machines: Probabilistic Condition Monitoring of Azimuth Thrusters Based On Acceleration Measurements
No ratings yet
Machines: Probabilistic Condition Monitoring of Azimuth Thrusters Based On Acceleration Measurements
1 page
Lab1-Algorithms For Information Retrieval. Introduction
No ratings yet
Lab1-Algorithms For Information Retrieval. Introduction
13 pages
Detecting Fake News by RNN-based Gatekeeping Behavior Model On Social Networks
No ratings yet
Detecting Fake News by RNN-based Gatekeeping Behavior Model On Social Networks
13 pages
EEG Machine Learning With Higuchi's Fractal Dimension and Sample Entropy As Features For Successful Detection of Depression
No ratings yet
EEG Machine Learning With Higuchi's Fractal Dimension and Sample Entropy As Features For Successful Detection of Depression
34 pages
DATA MINING Notes
No ratings yet
DATA MINING Notes
37 pages
Weka Exercise 1
No ratings yet
Weka Exercise 1
7 pages
Livemove 2 User Manual
No ratings yet
Livemove 2 User Manual
113 pages
An Efficient Detection of Fake Currency KNN Method
No ratings yet
An Efficient Detection of Fake Currency KNN Method
7 pages
Alzheimer Disease Prediction
No ratings yet
Alzheimer Disease Prediction
25 pages
Machine Learning Essentials
No ratings yet
Machine Learning Essentials
74 pages
AIML - CAT-II - Question Bank
No ratings yet
AIML - CAT-II - Question Bank
2 pages
Introduction To Statistical Learning: With Applications in R
No ratings yet
Introduction To Statistical Learning: With Applications in R
13 pages
Third Review PPT
No ratings yet
Third Review PPT
19 pages
Aiml Assignment 2
No ratings yet
Aiml Assignment 2
2 pages
ESDL Lab Manual
No ratings yet
ESDL Lab Manual
7 pages
Use of Image Classification in PIM Systems
No ratings yet
Use of Image Classification in PIM Systems
68 pages
Lecture Notes For Chapter 7 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 7 Introduction To Data Mining, 2 Edition
108 pages
Learning Problems & Concept Learning
No ratings yet
Learning Problems & Concept Learning
41 pages