Foundations of Machine Learning Lecture 6
Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu
Boosting
Mehryar Mohri - Foundations of Machine Learning
page 2
Weak Learning
(Kearns and Valiant, 1994)
Denition: concept class C is weakly PAC-learnable if there exists a (weak) learning algorithm L and > 0 such that: for all c C, > 0, > 0, and all distributions D,
S D
Pr
R(hS )
for samples S of size m = poly(1/ ) for a xed
polynomial.
1 2
Mehryar Mohri - Foundations of Machine Learning
page 3
Boosting Ideas
Finding simple relatively accurate base classiers often not hard weak learner. Main ideas:
use weak learner to create a strong learner. how? Combine base classiers returned by weak
learner (ensemble method). But, how should the base classiers be combined?
Mehryar Mohri - Foundations of Machine Learning
page 4
AdaBoost - Preliminaries
Family of base classiers: H
{ 1, +1}X .
(Freund and Schapire, 1997)
Iterative algorithm (T rounds): at each round t [1, T ] maintains distribution Dt over training sample. t convex combination hypothesis ft = s=1 s hs , with s 0 and hs H . Weighted error of hypothesis h at round t :
m
Pr[h(xi ) = yi ] =
Dt i=1
Dt (i)1h(xi )=yi .
Mehryar Mohri - Foundations of Machine Learning
page 5
AdaBoost
H { 1, +1} .
X
(Freund and Schapire, 1997)
AdaBoost(S = ((x1 , y1 ), . . . , (xm , ym ))) 1 for i 1 to m do 1 2 D1 (i) m 3 for t 1 to T do 4 ht base classier in H with small error t = PrDt [ht (xi ) = yi ] 1 t 1 log 5 t 2 t 1 2 6 Zt 2[ t (1 normalization factor t )] 7 for i 1 to m do Dt (i) exp( t yi ht (xi )) 8 Dt+1 (i) Zt T 9 f t=1 t ht 10 return h = sgn(f )
Mehryar Mohri - Foundations of Machine Learning
page 6
Notes
Distributions Dt over training sample: originally uniform. at each round, the weight of a misclassied example is increased. e Q observation: Dt+1 (i) = m , since Z
yi ft (xi ) t s=1 s
Dt+1 (i) =
Dt (i)e
t y i h t (x i )
Zt
Dt
1 (i)e
1 y i ht
1 (x i )
t y i h t (x i )
Zt
1 Zt
1 e = m
yi
Pt
s=1
s h s (x i )
t s=1
Zs
Weight assigned to base classier ht : t directy depends on the accuracy of ht at round t .
Mehryar Mohri - Foundations of Machine Learning
page 7
Illustration
t = 1
Mehryar Mohri - Foundations of Machine Learning
t = 2
page
t = 3
...
...
Mehryar Mohri - Foundations of Machine Learning
page 9
+ 2
+ 3
Mehryar Mohri - Foundations of Machine Learning
page 10
Bound on Empirical Error
T
(Freund and Schapire, 1997)
Theorem: The empirical error of the classier output by AdaBoost veries:
R(h) exp 2
t=1
1 2
2 t
If further for all t [1, T ] ,
R(h) exp( 2
2
(1 2
T ).
t ) , then
does not need to be known in advance: adaptive boosting.
page 11
Mehryar Mohri - Foundations of Machine Learning
Proof: Since, as we saw, D
1 R(h) = m 1 m
m m
t+1 (i) = m
1 y i f (x i )
i=1 m T
1 m
exp( yi f (xi ))
i=1 T
yi ft (xi ) e Q m t s=1 Zs
m
i=1 t t=1
Zt DT +1 (i) =
t=1
Zt .
Now, since Z
Zt = =
i=1
is a normalization factor,
t y i h t (x i )
Dt (i)e
Dt (i)e
t )e t) 1
t
Dt (i)e
i:yi ht (xi ) 0
i:yi ht (xi )<0
= (1 = (1
+ te
t t
1
t
=2
t (1
page 12
t ).
Mehryar Mohri - Foundations of Machine Learning
Thus,
T
Zt =
t=1 t=1 T
2 exp
t=1
t (1
t)
=
t=1 2 t
1 2
2 t T
1 2
= exp
2
t=1
1 2
2 t
Notes: minimizer of since (1 )e =
t t
t
(1
te
t
t )e
+ te .
, at each round, AdaBoost assigns the same probability mass to correctly classied and misclassied instances. for base classiers x [ 1, +1] , t can be similarly chosen to minimize Zt .
page 13
Mehryar Mohri - Foundations of Machine Learning
AdaBoost = Coordinate Descent
Objective Function: convex and differentiable.
m m
F( ) =
i=1
y i f (x i )
=
i=1
x
yi
PT
t=1
t h t (x i )
0 1 loss
Mehryar Mohri - Foundations of Machine Learning
page 14
Direction: unit vector e with
t
et = argmin
t
dF (
t 1
+ et )
=0
s h s (x i )
d
yi Pt
1 s=1
dF (
Since F (
t 1 + et ) d
m t 1
+ et ) =
m
e
i=1
y i ht (x i )
t 1
=
=0 i=1 m
yi ht (xi ) exp
yi
s=1 t 1
s hs (xi )
=
i=1
yi ht (xi )Dt (i) m
s=1 t 1
Zs
t 1
[(1
t)
t]
m
s=1
Zs = [2
1] m
s=1
Zs .
Thus, direction corresponding to base classier with smallest error.
Mehryar Mohri - Foundations of Machine Learning
page 15
Step size: obtained via
dF (
t 1 + et ) =0 d m t 1
yi ht (xi ) exp
i=1 m
yi
s=1 t 1
s hs (xi )
y i h t (x i )
=0
yi ht (xi )Dt (i) m
i=1 m s=1
Zs e
y i h t (x i )
=0
yi ht (xi )Dt (i)e
i=1
y i h t (x i )
=0
[(1 =
t )e
te t t
]=0
1 1 log 2
Thus, step size matches base classier weight of AdaBoost.
Mehryar Mohri - Foundations of Machine Learning
page 16
Alternative Loss Functions
boosting loss x e x logistic loss
x log2 (1 + e
x
square loss
x (1
x)2 1x
hinge loss
x max(1
x, 0)
zero-one loss x 1x<0
Mehryar Mohri - Foundations of Machine Learning
page 17
Relationship with Logistic Regression
Objective function:
m
(Lebanon and Lafferty, 2001)
F( ) =
i=1
log(1 + e
2yi (xi )
).
Solve exactly the same optimization problem, except for a normalization constraint required for logistic regression:
min D(p, 1)
p
subject to: p x
0
x
p(x)
y
p(x, y )[
j (x, y )
E[
p b
j |x]]
=0
X,
y Y
p(x, y ) = 1
difference between the algorithms.
with D(p, q) =
x
p(x)
y
p(x, y ) log
p(x, y ) q (x, y )
p(x, y ) + q (x, y ) .
page 18
Mehryar Mohri - Foundations of Machine Learning
Standard Use in Practice
Base learners: decision trees, quite often just decision stumps (trees of depth one).
Boosting stumps: data in RN , e.g.,N = 2, (height(x), weight(x)) . associate a stump to each component. pre-sort each component: O(N m log m). at each round, nd best component and threshold. total complexity: O((m log m)N + mN T ). stumps not weak learners: think XOR example!
page 19
Mehryar Mohri - Foundations of Machine Learning
Overtting?
Assume that VCdim(H ) = d and for a xed T , dene
T
FT =
sgn
t=1
t ht
b :
t, b
R, h t
H .
FT can form a very rich family of classiers. It can
be shown (Freund and Schapire, 1997) that:
VCdim(FT )
2(d + 1)(T + 1) log2 ((T + 1)e).
This suggests that AdaBoost could overt for large values of T , and that is in fact observed in some cases, but in various others it is not!
Mehryar Mohri - Foundations of Machine Learning
page 20
Empirical Observations
cumulative distribution
Several empirical observations (not all): AdaBoost does not seem to overt, furthermore:
20 15
1.0
test error
error
10 5 0 10 100 1000
0.5
training error
# rounds
-1
-0.5
ma
Mehryar Mohri - Foundations of Machine Learning
Figure 2: Errortrees curves and the margin C4.5 decision (Schapire et al., 1998). distribution graph for the letter dataset as reported by Schapire et al. [69]. Left: the error curves (lower and upper curves, respectively) of the comb
page 21
L1 Margin Denitions
Denition: the margin of a point x with label y is (the algebraic distance of x = [h1 (x), . . . , hT (x)] to the hyperplane x = 0 ):
(x) = yf (x)
m t=T
=
t
T t=1
t ht (x) 1
=y
Denition: the margin of the classier for a sample S = (x1 , . . . , xm ) is the minimum margin of the points in that sample:
= min yi
i [1,m]
xi
1
Mehryar Mohri - Foundations of Machine Learning
page 22
Note: SVM margin: Boosting margin:
with H(x) =
h 1 (x ) h T (x )
w (xi ) = min yi . w 2 i [1,m] = min yi
i [1,m]
H(xi )
1
. . .
Distances:
distance to hyperplane w x + b = 0 :
|w x + b| , w p
1 1 with + = 1. p q
Mehryar Mohri - Foundations of Machine Learning
page 23
Rademacher Complexity of Convex Hulls
Theorem: Let H be a set of functions mapping from X to R . Let the convex hull of H be dened as
p p
conv(H ) = {
k hk : p
k=1
1, k
0,
k=1
1 , hk
H }.
Then, for any sample S , RS (conv(H )) = RS (H ). Proof:
1 RS (conv(H )) = E m 1 E = m 1 = E m
m p i
1
sup
hk H, 0, 1 i=1 p
k hk (xi )
k=1 m i hk (xi ) i=1
sup
sup
1
hk H 0,
k
1 k=1 m i hk (xi ) i=1
hk H k [1,p] m
sup max
Mehryar Mohri - Foundations of Machine Learning
1 E sup = m h H i=1
i h(xi )
= RS (H ).
page 24
Margin Bound - Ensemble Methods
(Koltchinskii and Panchenko, 2002)
Corollary: Let H be a set of real-valued functions. Fix > 0 . For any > 0 , with probability at least 1 , the following holds for all h conv(H ) :
R(h) R(h) R (h) + Rm H + R (h) + RS H + 3 2 2 log 1 2m log 2 . 2m
Proof: Direct consequence of margin bound of Lecture 4 and RS (conv(H )) = RS (H ) .
Mehryar Mohri - Foundations of Machine Learning
page 25
Margin Bound - Ensemble Methods
(Koltchinskii and Panchenko, 2002); see also (Schapire et al., 1998)
Corollary: Let H be a family of functions taking values in { 1, +1} with VC dimension d . Fix > 0 . For any > 0 , with probability at least 1 , the following holds for all h conv(H ) :
R(h) R (h) + 2 2d log em d + m log 1 . 2m
Proof: Follows directly previous corollary and VC dimension bound on Rademacher complexity (see lecture 3).
Mehryar Mohri - Foundations of Machine Learning
page 26
Notes
All of these bounds can be generalized to hold uniformly for all (0, 1), at the cost of an additional log log2 2 term and other minor constant factor m changes (Koltchinskii and Panchenko, 2002). For AdaBoost, the bound applies to the functions
x f (x)
1 T t=1 t ht (x) 1
conv(H ).
Note that T does not appear in the bound.
Mehryar Mohri - Foundations of Machine Learning
page 27
Margin Distribution
Theorem: For any > 0 , the following holds:
Pr yf (x)
1 T
2T
t=1
1 t
(1
1+ t)
Proof: Using the identity Dt+1 (i) =
1 m
m
1 y i f (x i )
i=1
1 m 1 = m =e
m i=1 m
exp( yi f (xi ) +
T
yi f (xi ) eQ m T t=1 Zt
e
i=1
1
Zt DT +1 (i)
t=1 T T t=1
page 28
T t=1
Zt = 2
1
t
t (1
t ).
Mehryar Mohri - Foundations of Machine Learning
Notes
If for all t [1, T ] , can be bounded by
Pr yf (x)
1
(1 2
(1
t ), then
the upper bound
T /2
2 )1
(1 + 2 )1+
For < , (1 2 )1 (1+2 )1+ < 1 and the bound decreases exponentially in T .
O(1/ m), For the bound to be convergent: O(1/ m) is roughly the condition on the thus edge value.
Mehryar Mohri - Foundations of Machine Learning
page 29
But, Does AdaBoost Maximize the Margin?
No: AdaBoost may converge to a margin that is signicantly below the maximum margin (Rudin et al., 2004) (e.g., 1/3 instead of 3/8)! Lower bound: AdaBoost can achieve asymptotically a margin that is at least max if data separable and 2 some conditions on the base learners (Rtsch and Warmuth, 2002). Several boosting-type margin-maximization algorithms: but, performance in practice not clear or not reported.
Mehryar Mohri - Foundations of Machine Learning
page 30
Outliers
AdaBoost assigns larger weights to harder examples.
Application: Detecting mislabeled examples. Dealing with noisy data: regularization based on the average weight assigned to a point (soft margin idea for boosting) (Meir and Rtsch, 2003).
Mehryar Mohri - Foundations of Machine Learning
page
31
AdaBoosts Weak Learning Condition
Denition: the edge of a base classier ht for a distribution D over the training sample is
1 (t) = 2
t
1 = 2
yi ht (xi )D(i).
i=1
Condition: there exists > 0 for any distribution D over the training sample and any base classier
(t) .
Mehryar Mohri - Foundations of Machine Learning
page 32
Zero-Sum Games
Denition: payoff matrix M = (Mij ) Rm n. m possible actions (pure strategy) for row player. n possible actions for column player. Mij payoff for row player ( = loss for column player) when row plays i , column plays j . Example:
rock rock 0 paper 1 scissors -1 paper scissors -1 1 0 -1 1 0
page 33
Mehryar Mohri - Foundations of Machine Learning
Mixed Strategies
(von Neumann, 1928)
Denition: player row selects a distribution p over the rows, player column a distribution q over columns. The expected payoff for row is
m i p j q n
E [Mij ] =
i=1 j =1
pi Mij qj = p Mq.
von Neumanns minimax theorem:
max min p Mq = min max p Mq.
equivalent form:
p j [1,n]
Mehryar Mohri - Foundations of Machine Learning
max min p Mej = min max ei Mq.
q i [1,m]
page 34
AdaBoost and Game Theory
Game: Player A: selects point xi , i [1, m]. Player B: selects base learner ht , t [1, T ]. Payoff matrix M { 1, +1}m T: Mit = yi ht (xi ). von Neumanns theorem: assume nite H .
m T
= min max
D h H i=1
D(i)yi h(xi ) = max min yi
i [1,m] t=1
t ht (xi ) 1
Mehryar Mohri - Foundations of Machine Learning
page 35
Consequences
Weak learning condition = non-zero margin. thus, possible to search for non-zero margin. AdaBoost = (suboptimal) search for corresponding . Weak learning = strong condition: the condition implies linear separability with margin 2 > 0 .
Mehryar Mohri - Foundations of Machine Learning
page 36
Linear Programming Problem
Maximizing the margin:
( xi ) = max min yi . || ||1 i [1,m] max subject to : yi ( xi ) 1 = 1.
This is equivalent to the following convex optimization LP problem:
Note that:
| x|
1
= x
, with H = {x :
x = 0}.
page 37
Mehryar Mohri - Foundations of Machine Learning
Maximum-margin solutions with different norms.
Norm || ||2 .
Norm || || .
AdaBoost gives an approximate solution for the
LP problem.
Mehryar Mohri - Foundations of Machine Learning page 38
Advantages of AdaBoost
Simple: straightforward implementation. Efcient: complexity O(mN T ) for stumps: when N and T are not too large, the algorithm is quite fast. Theoretical guarantees: but still many questions. AdaBoost not designed to maximize margin. regularized versions of AdaBoost.
Mehryar Mohri - Foundations of Machine Learning
page 39
Weaker Aspects
Parameters: need to determine T , the number of rounds of boosting: stopping criterion. need to determine base learners: risk of overtting or low margins. Noise: severely damages the accuracy of Adaboost(Dietterich, 2000). boosting algorithms based on convex potentials do not tolerate even low levels of random noise, even with L1 regularization or early stopping (Long and Servedio, 2010).
page 40
Mehryar Mohri - Foundations of Machine Learning
References
Thomas G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Machine Learning, 40(2): 139-158, 2000. Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119-139, 1997. G. Lebanon and J. Lafferty. Boosting and maximum likelihood for exponential models. In NIPS, pages 447454, 2001. Ron Meir and Gunnar Rtsch. An introduction to boosting and leveraging. In Advanced Lectures on Machine Learning (LNAI2600), 2003. J. von Neumann. Zur Theorie der Gesellschaftsspiele. Mathematische Annalen, 100:295-320, 1928. Cynthia Rudin, Ingrid Daubechies and Robert E. Schapire. The dynamics of AdaBoost: Cyclic behavior and convergence of margins. Journal of Machine Learning Research, 5: 1557-1595, 2004.
Mehryar Mohri - Foundations of Machine Learning
page 41
References
Rtsch, G., and Warmuth, M. K. (2002) Maximizing the Margin with Boosting, in Proceedings of the 15th Annual Conference on Computational Learning Theory (COLT 02), Sidney, Australia, pp. 334350, July 2002. Robert E. Schapire. The boosting approach to machine learning: An overview. In D. D. Denison, M. H. Hansen, C. Holmes, B. Mallick, B.Yu, editors, Nonlinear Estimation and Classication. Springer, 2003. Robert E. Schapire,Yoav Freund, Peter Bartlett and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5): 1651-1686, 1998.
Mehryar Mohri - Foundations of Machine Learning
page 42