Bagging
Forests
Boosting
Dropout
Lecture 10: Ensemble Learning
DD2421
Atsuto Maki
Autumn, 2020
Atsuto Maki Lecture 10: Ensemble Learning
Background: Ensemble Learning
We will describe and investigate algorithms to
train weak classifiers/regressors and how to combine them
to construct a classifier/regressor more powerful than any
of the individual ones.
They are called Ensemble learning, Commitee machine, etc.
Outline of the lecture
Wisdom of Crowds – why combine classifiers?
Bagging: static structure, parallel training
Forests: an extension of bagging
Boosting: static structure, serial training
(Example: face detection)
The Wisdom of Crowds
Crowd wiser than any individual
The collective knowledge of a diverse and independent
body of people typically exceeds the knowledge of any
single individual and can be harnessed by voting.
When ?
For which questions ?
See The Wisdom of Crowds by James Surowiecki published in
2004 to see this idea applied to business.
What makes a crowd wise?
Four elements required to form a wise crow (J. Surowiecki):
Diversity of opinion. People in crowd should have a
range of experiences, education and opinions.
Independence. Prediction by person in crowd is not
influenced by other people in the crowd.
Decentralization People have specializations and
local knowledge.
Aggregation. There is a mechanism for aggregating
all predictions into one single prediction.
What makes a crowd wise?
Four elements required to form a wise crow (J. Surowiecki):
Diversity of opinion. People in crowd should have a
range of experiences, education and opinions.
Independence. Prediction by person in crowd is not
influenced by other people in the crowd.
Decentralization People have specializations and
local knowledge.
Aggregation. There is a mechanism for aggregating
all predictions into one single prediction.
What makes a crowd wise?
Four elements required to form a wise crow (J. Surowiecki):
Diversity of opinion. People in crowd should have a
range of experiences, education and opinions.
Independence. Prediction by person in crowd is not
influenced by other people in the crowd.
Decentralization People have specializations and
local knowledge.
Aggregation. There is a mechanism for aggregating
all predictions into one single prediction.
What makes a crowd wise?
Four elements required to form a wise crow (J. Surowiecki):
Diversity of opinion. People in crowd should have a
range of experiences, education and opinions.
Independence. Prediction by person in crowd is not
influenced by other people in the crowd.
Decentralization People have specializations and
local knowledge.
Aggregation. There is a mechanism for aggregating
all predictions into one single prediction.
Combining classifiers
Will exploit Wisdom of crowd ideas for specific tasks by
combining classifier predictions and
aim to combine independent and diverse classifiers.
But will use labelled training data
to identify the expert classifiers in the pool;
to identify complementary classifiers;
to indicate how to best combine them.
Combining classifiers
Will exploit Wisdom of crowd ideas for specific tasks by
combining classifier predictions and
aim to combine independent and diverse classifiers.
But will use labelled training data
to identify the expert classifiers in the pool;
to identify complementary classifiers;
to indicate how to best combine them.
Ensemble method: Bagging
Bootstrap Aggregating
Use bootstrap replicates of training set by
sampling with replacement.
On each replicate learn one model – combined
altogether.
Binary classification example
True decision boundary Training data set Si
Estimate the true decision boundary with a decision tree
trained from some labeled training set Si .
High variance, Low bias classifiers
E.g. decision trees
High variance classifiers produce differing decision
boundaries which are highly dependent on the training
data.
Low bias classifiers produce decision boundaries which on
average are good approximations to the true decision
boundary.
Ensemble predictions using diverse high-variance, low-bias
classifiers reduce the variance of the ensemble classifier.
Estimated decision boundaries found using bootstrap
replicates:
S1 S2 S3 S4 S5
S6 S7 S8 S9 S10
Property of instability
Bagging - Bootstrap Aggregating
Input: Training data
S = {(x1 , y1 ), . . . , (xm , ym )}
of inputs xj ∈ Rd and their labels or real values
yj .
Iterate: for b = 1, . . . , B
1 Sample training examples, with replacement,
m0 times from S to create Sb (m0 ≤ m).
2 Use this bootstrap sample Sb to estimate the
regression or classification function fb .
Bagging - Bootstrap Aggregating
Input: Training data
S = {(x1 , y1 ), . . . , (xm , ym )}
of inputs xj ∈ Rd and their labels or real values
yj .
Iterate: for b = 1, . . . , B
1 Sample training examples, with replacement,
m0 times from S to create Sb (m0 ≤ m).
2 Use this bootstrap sample Sb to estimate the
regression or classification function fb .
Output: The bagging estimate for
Classification:
B
X
fbag (x) = argmax Ind (fb (x) = k)
1≤k≤K
b=1
Regression:
B
1 X
fbag (x) = fb (x)
B b=1
Note: Ind (x) = 1 if x = TRUE
otherwise, Ind (x) = 0
Bagging is a procedure to reduce the variance of our
classifier (when labelled training data is limited).
→ Powerful algorithm for controling overfitting.
Note: it only produces good results for high variance, low bias
classifiers.
Apply bagging to the original example
Ground S1 Classifier Bagged
truth from S1 classifier
B = 100.
Apply bagging to the original example
Ground S1 Classifier Bagged
truth from S1 classifier
B = 100.
If we bag a high bias, low variance classifier - oriented
horizontal and vertical lines - we don’t get any benefit.
A cartoon depiction of how bagging works
We train an 8 detector on the dataset, containing {6,8,9}.
Construct two different resampled dataset, each by
sampling with replacement.
A cartoon depiction of how bagging works
(cont.)
The first resampled dataset omits the 9 and repeats the 8.
→ The detector learns a loop on top of the digit for an 8.
The second resampled dataset omits the 6 and repeat the 9.
→ The detector learns a loop on the bottom for an 8.
Each rule is brittle.
If we average their output, then the detector is robust;
maximal confidence only when both loops of the 8 are
present.
See Chapter 7.11 of Deep Learning in www.deeplearningbook.org
(2016) by Ian Goodfellow.
Ensemble method: Forest
Decision/Random/Randomized Forest
Bagging + Random feature selection at each node
Decision Forests / Random Forests
Two kinds of randomnesses involved in:
- Sampling training data (the same as in Bagging)
- Feature selection at each node
Trees are less correlated, i.e. even higher variance
between weak learners.
A classifier suited to multi-class problem.
Decision Forests / Random Forests
Two kinds of randomnesses involved in:
- Sampling training data (the same as in Bagging)
- Feature selection at each node
Trees are less correlated, i.e. even higher variance
between weak learners.
A classifier suited to multi-class problem.
Decision Forests / Random Forests
Two kinds of randomnesses involved in:
- Sampling training data (the same as in Bagging)
- Feature selection at each node
Trees are less correlated, i.e. even higher variance
between weak learners.
A classifier suited to multi-class problem.
Ensemble method: Boosting
Started from a question:
Can a set of weak learners create a single strong
classifier where a weak learner performs only
slightly better than a chance? (Kearns, 1988)
Loop:
- Apply learner to weighted samples
- Increase weights of misclassified examples
Example: Ensemble Prediction
Voting of oriented hyper-planes can define convex regions.
Green region is the true boundary.
High-bias classifier Low-bias classifier
Low model complexity (small # of d.o.f.) =⇒ High-bias
High model complexity (large # of d.o.f.) =⇒ Low-bias
Example (cont.)
A diverse and complementary set of high-bias classifiers,
with performance better than chance, combined by voting
T
!
X
fV (x) = sign ht (x)
t=1
can produce a classifier with a low-bias.
ht ∈ H where H is a family of possible weak classifiers functions.
Ensemble Method: Boosting
Input: Training data S = {(x1 , y1 ), . . . , (xm , ym )} of inputs
xi and their labels yi ∈ {−1, 1} or real values.
H: a family of possible weak classifiers/regression functions.
Output: A strong classifier/regression function
T T
!
X X
fT (x) = sign αt ht (x) or fT (x) = αt ht (x)
t=1 t=1
weighted sum of weak classifiers
ht ∈ H t = 1, ..., T
αt : confidence/reliability
Ensemble Method: Boosting
Input: Training data S = {(x1 , y1 ), . . . , (xm , ym )} of inputs
xi and their labels yi ∈ {−1, 1} or real values.
H: a family of possible weak classifiers/regression functions.
Output: A strong classifier/regression function
T T
!
X X
fT (x) = sign αt ht (x) or fT (x) = αt ht (x)
t=1 t=1
weighted sum of weak classifiers
ht ∈ H t = 1, ..., T
αt : confidence/reliability
Ensemble Method: Boosting
Ensemble Method: Boosting
Core ideas (Just consider case of classification here.)
Performance of classifiers h1 , . . . , ht helps define ht+1 .
Remember: Each ht ∈ H
(t)
Maintain weight wi for each training example in S.
(t)
Large wi =⇒ xi has greater influence on choice of ht .
(t)
Iteration t: wi gets increased / decreased
if xi is wrongly / correctly classified by ht .
Binary classification example
True decision boundary Training data
H is the set of all possible oriented vertical and horizontal
lines.
Example
Round 1
Chosen weak classifier Re-weight training points Current strong classifier
(2)
1 = 0.19, α1 = 1.45 wi ’s f1 (x)
Example
Round 2
Chosen weak classifier Re-weight training points Current strong classifier
(3)
2 = 0.1512, α2 = 1.725 wi ’s f2 (x)
Example
Round 3
Chosen weak classifier Re-weight training points Current strong classifier
(4)
3 = 0.2324, α3 = 1.1946 wi ’s f3 (x)
Example
Round 4
Chosen weak classifier Re-weight training points Current strong classifier
(5)
4 = 0.2714, α4 = 0.9874 wi ’s f4 (x)
Example
Round 5
Chosen weak classifier Re-weight training points Current strong classifier
(6)
5 = 0.2616, α5 = 1.0375 wi ’s f5 (x)
Example
Round 6
Chosen weak classifier Re-weight training points Current strong classifier
(7)
6 = 0.2262, α6 = 1.2298 wi ’s f6 (x)
Example
Round 7
Chosen weak classifier Re-weight training points Current strong classifier
(8)
7 = 0.2680, α7 = 1.0049 wi ’s f7 (x)
Example
Round 8
Chosen weak classifier Re-weight training points Current strong classifier
(9)
8 = 0.3282, α8 = 0.7165 wi ’s f8 (x)
Example
Round 9
Chosen weak classifier Re-weight training points Current strong classifier
(10)
9 = 0.3048, α9 = 0.8246 wi ’s f9 (x)
Example
Round 10
Chosen weak classifier Re-weight training points Current strong classifier
(11)
10 = 0.2943, α10 = 0.8744 wi ’s f10 (x)
Example
Round 11
Chosen weak classifier Re-weight training points Current strong classifier
(12)
11 = 0.2876, α11 = 0.9071 wi ’s f11 (x)
Example
...........................
Example
Round 21
Chosen weak classifier Re-weight training points Current strong classifier
(22)
21 = 0.3491, α21 = 0.6232 wi ’s f21 (x)
Adaboost Algorithm (Freund & Schapire, 1997)
Given: Labeled training data
S = {(x1 , y1 ), . . . , (xm , ym )}
of inputs xj ∈ Rd and their labels
yj ∈ {−1, 1}.
A set/class H of T possible weak
classifiers.
(1)
Initialize: Introduce a weight, wj , for each training
sample.
(1) 1
Set wj = m
for each j.
Adaboost Algorithm (cont.)
Iterate: for t = 1, . . . , T
(t) (t)
1 Train weak classifier ht ∈ H using S and w1 , . . . , wm ;
select the one that minimizes the training error:
m
(t)
X
t = wj Ind (yj 6= ht (xj ))
j=1
(sum of the weights for misclassified samples)
2 Compute the reliability coefficient:
1 − t
αt = loge
t
t must be less than 0.5. Break out of loop if t ≈ .5
(t+1) (t)
3 Update weights by using: wj = wj exp(−αt yj ht (xj ))
Note the sign of yj ht (xj ).
4 Normalize the weights so that they sum to 1.
Adaboost Algorithm (cont.)
Iterate: for t = 1, . . . , T
(t) (t)
1 Train weak classifier ht ∈ H using S and w1 , . . . , wm ;
select the one that minimizes the training error:
m
(t)
X
t = wj Ind (yj 6= ht (xj ))
j=1
(sum of the weights for misclassified samples)
2 Compute the reliability coefficient:
1 − t
αt = loge
t
t must be less than 0.5. Break out of loop if t ≈ .5
(t+1) (t)
3 Update weights by using: wj = wj exp(−αt yj ht (xj ))
Note the sign of yj ht (xj ).
4 Normalize the weights so that they sum to 1.
Adaboost Algorithm (cont.)
Iterate: for t = 1, . . . , T
(t) (t)
1 Train weak classifier ht ∈ H using S and w1 , . . . , wm ;
select the one that minimizes the training error:
m
(t)
X
t = wj Ind (yj 6= ht (xj ))
j=1
(sum of the weights for misclassified samples)
2 Compute the reliability coefficient:
1 − t
αt = loge
t
t must be less than 0.5. Break out of loop if t ≈ .5
(t+1) (t)
3 Update weights by using: wj = wj exp(−αt yj ht (xj ))
Note the sign of yj ht (xj ).
4 Normalize the weights so that they sum to 1.
Dropout
A regularization method for Artificial Neural
Networks
(can be seen also as an ensemble method)
Dropout: “ultimate” ensemble learning in ANN
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov,
Dropout: A Simple Way to Prevent Neural Networks
from Overtting. Journal of Machine Learning Research 15:
1929-1958, 2014.
Summary
Summary: Ensemble Prediction
Can combine many weak classifiers/regressors into a
stronger classifier; voting, averaging, bagging
if weak classifiers/regressors are better than random.
if there is sufficient de-correlation (independence)
amongst the weak classifiers/regressors.
Summary: Ensemble Prediction & Learning
Can combine many (high-bias) weak classifiers/regressors
into a strong classifier; boosting
if weak classifiers/regressors are chosen and combined
using knowledge of how well they and others performed on
the task on training data.
The selection and combination encourages the weak
classifiers to be complementary, diverse and de-correlated.
Appendix
P. Viola, M. J. Jones, Robust real-time face detection.
International Journal of Computer Vision 57(2): 137-154, 2004.
Viola & Jones Face Detection
Most state-of-the-art face detection on mobile phones,
digital cameras etc. are/were based on this algorithm.
Example of a classifier constructed using the Boosting
algorithm.
Viola & Jones: Training data
Positive training examples:
Image patches corresponding to
faces - (xi , 1).
Negative training examples:
Random image patches from
images not containing faces -
(xj , −1).
Note: All patches are re-scaled to
have same size. ⇑
Positive training examples
Viola & Jones: Weak classifier
FACE or NON-FACE
Input: x Apply filter: f j (x) Output: h(x) = (f j (x) > θ)
Filters used compute differences between sums of pixels in adjacent
rectangles. (These can be computed very quickly using Integral
Images.)
Viola & Jones: Filters Considered
Huge “Library” of Filters
Huge library of possible Haar-like filters, f 1 , . . . , f n with
n ≈ 16, 000, 000.
Viola & Jones: AdaBoost training
Recap: define weak classifier as
(
1 if f jt (x) > θt
ht (x) =
−1 otherwise
Use AdaBoost to efficiently choose the best weak
classifiers and to combine them.
Remember: a weak classifier corresponds to a filter type and a
threshold.
Viola & Jones: AdaBoost training (cont.)
For t = 1, . . . , T
for each filter type j
1 Apply filter, f j , to each example.
2 Sort examples by their filter responses.
3 Select best threshold for this classifier: θtj .
4 Keep record of error of this classifier: tj .
Select the filter-threshold combination (weak classifier
j ∗ ) with minimum error. Then set jt = j ∗ , t = tj ∗
and θt = θtj ∗ .
Re-weight examples according to the AdaBoost
formualae.
Note: (There are many tricks to make this implementation more efficient.)
Viola & Jones: Sliding window
Remember: Better classification rates if use a classifier,
fT , with large T .
Given a new image, I, detect the faces in the image by:
for each plausible face size s
for each possible patch centre c
1 Extract sub-patch of size s at c from I.
2 Re-scale patch to size of training patches.
3 Apply detector to patch.
4 Keep record of s and c if the detector returns positive.
This is a lot of patches to be examined. If T is very large
processing an image will be very slow!
Viola & Jones: Sliding window
Remember: Better classification rates if use a classifier,
fT , with large T .
Given a new image, I, detect the faces in the image by:
for each plausible face size s
for each possible patch centre c
1 Extract sub-patch of size s at c from I.
2 Re-scale patch to size of training patches.
3 Apply detector to patch.
4 Keep record of s and c if the detector returns positive.
This is a lot of patches to be examined. If T is very large
processing an image will be very slow!
Viola & aJones:
• Given nestedCascade of classifiers
set of classifier
% False Pos
hypothesis classes 0
But: vs false negdetermined b
99
only a tiny proportion of the patches will be faces and
many of them will not look anything like a face.
% Detection
Exploit this fact: Introduce a cascade of increasingly
50
strong classifiers
• Computational Risk Minimization
T T T
IMAGE Classifier 2 Classifier 3
SUB-WINDOW
Classifier 1 FACE
F F F
NON-FACE NON-FACE NON-FACE
Viola and Jones, Robust object detection using a boosted cascade of simple features, CVPR 2001
Viola & aJones:
• Given nestedCascade of classifiers
set of classifier
% False Pos
hypothesis classes 0
But: vs false negdetermined b
99
only a tiny proportion of the patches will be faces and
many of them will not look anything like a face.
% Detection
Exploit this fact: Introduce a cascade of increasingly
50
strong classifiers
• Computational Risk Minimization
T T T
IMAGE Classifier 2 Classifier 3
SUB-WINDOW
Classifier 1 FACE
F F F
NON-FACE NON-FACE NON-FACE
Viola and Jones, Robust object detection using a boosted cascade of simple features, CVPR 2001
Viola & Jones: Cascade of classifiers
Cascaded Classifier
50% 20% 2%
IMAGE 5 Features 20 Features
SUB-WINDOW
1 Feature FACE
F F F
NON-FACE NON-FACE NON-FACE
A 1 feature classifier achieves 100% detection rate and
• A 1 feature
about 50% falseclassifier achieves 100% detection
positive rate.
and about 50% false positive
A 5 feature classifier achieves rate. rate and
100% detection
40% false positive rate (20% cumulative) - using data
• A
from5 previous
featurestage
classifier
. achieves 100% detection
and
A 20 40%
featurefalse positive
classifier achievesrate
100%(20% cumulative)
detection rate
– using data from previous stage.
with 10% false positive rate (2% cumulative).
Viola & Jones: Typical Results