[go: up one dir, main page]

0% found this document useful (0 votes)
77 views53 pages

Lecture 20: Bagging, Random Forests, Boosting: Reading: Chapter 8

Ensemble Methods are methods that combine together many model predictions. For example, in Bagging (short for bootstrap aggregation), parallel models are constructed on m = many bootstrapped samples (eg., 50), and then the predictions from the m models are averaged to obtain the prediction from the ensemble of models. In this tutorial we walk through basics of three Ensemble Methods: Bagging, Random Forests, and Boosting.

Uploaded by

isaias.prestes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views53 pages

Lecture 20: Bagging, Random Forests, Boosting: Reading: Chapter 8

Ensemble Methods are methods that combine together many model predictions. For example, in Bagging (short for bootstrap aggregation), parallel models are constructed on m = many bootstrapped samples (eg., 50), and then the predictions from the m models are averaged to obtain the prediction from the ensemble of models. In this tutorial we walk through basics of three Ensemble Methods: Bagging, Random Forests, and Boosting.

Uploaded by

isaias.prestes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Lecture 20: Bagging, Random Forests,

Boosting
Reading: Chapter 8

STATS 202: Data mining and analysis

November 13, 2017

1 / 17
Classification and Regression trees, in a nut shell

I Grow the tree by recursively splitting the samples in the leaf


Ri according to Xj > s, such that (Ri , Xj , s) maximize the
drop in RSS.

2 / 17
Classification and Regression trees, in a nut shell

I Grow the tree by recursively splitting the samples in the leaf


Ri according to Xj > s, such that (Ri , Xj , s) maximize the
drop in RSS.
→ Greedy algorithm.

2 / 17
Classification and Regression trees, in a nut shell

I Grow the tree by recursively splitting the samples in the leaf


Ri according to Xj > s, such that (Ri , Xj , s) maximize the
drop in RSS.
→ Greedy algorithm.
I Create a sequence of subtrees T0 , T1 , . . . , Tm using a pruning
algorithm.

2 / 17
Classification and Regression trees, in a nut shell

I Grow the tree by recursively splitting the samples in the leaf


Ri according to Xj > s, such that (Ri , Xj , s) maximize the
drop in RSS.
→ Greedy algorithm.
I Create a sequence of subtrees T0 , T1 , . . . , Tm using a pruning
algorithm.
I Select the best tree Ti (or the best α) by cross validation.

2 / 17
Classification and Regression trees, in a nut shell

I Grow the tree by recursively splitting the samples in the leaf


Ri according to Xj > s, such that (Ri , Xj , s) maximize the
drop in RSS.
→ Greedy algorithm.
I Create a sequence of subtrees T0 , T1 , . . . , Tm using a pruning
algorithm.
I Select the best tree Ti (or the best α) by cross validation.
→ Why might it be better to choose α instead of the tree Ti
by cross-validation?

2 / 17
Example. Heart dataset.

How do we deal with categorical predictors?

Thal:a
|

Ca < 0.5 Ca < 0.5

Slope < 1.5 Oldpeak < 1.1

MaxHR < 161.5 ChestPain:bc Age < 52 Thal:b RestECG < 1


ChestPain:a Yes
RestBP < 157 No Yes Yes
Yes No
Chol < 244 No Chol < 244 Sex < 0.5
MaxHR < 156 No Yes
MaxHR < 145.5 Yes
No
No No No No Yes
No Yes

3 / 17
Categorical predictors

I If there are only 2 categories, then the split is obvious. We


don’t have to choose the splitting point s, as for a numerical
variable.

4 / 17
Categorical predictors

I If there are only 2 categories, then the split is obvious. We


don’t have to choose the splitting point s, as for a numerical
variable.
I If there are more than 2 categories:
I Order the categories according to the average of the response:

ChestPain : a > ChestPain : c > ChestPain : b

I Treat as a numerical variable with this ordering, and choose a


splitting point s.

4 / 17
Categorical predictors

I If there are only 2 categories, then the split is obvious. We


don’t have to choose the splitting point s, as for a numerical
variable.
I If there are more than 2 categories:
I Order the categories according to the average of the response:

ChestPain : a > ChestPain : c > ChestPain : b

I Treat as a numerical variable with this ordering, and choose a


splitting point s.
I One can show that this is the optimal way of partitioning.

4 / 17
Missing data

I Suppose we can assign every sample to a leaf Ri despite the


missing data.

5 / 17
Missing data

I Suppose we can assign every sample to a leaf Ri despite the


missing data.
I When choosing a new split with variable Xj (growing the tree):

5 / 17
Missing data

I Suppose we can assign every sample to a leaf Ri despite the


missing data.
I When choosing a new split with variable Xj (growing the tree):
I Only consider the samples which have the variable Xj .

5 / 17
Missing data

I Suppose we can assign every sample to a leaf Ri despite the


missing data.
I When choosing a new split with variable Xj (growing the tree):
I Only consider the samples which have the variable Xj .
I In addition to choosing the best split, choose a second best
split using a different variable, and a third best, ...

5 / 17
Missing data

I Suppose we can assign every sample to a leaf Ri despite the


missing data.
I When choosing a new split with variable Xj (growing the tree):
I Only consider the samples which have the variable Xj .
I In addition to choosing the best split, choose a second best
split using a different variable, and a third best, ...
I To propagate a sample down the tree, if it is missing a variable
to make a decision, try the second best decision, or the third
best, ...

5 / 17
Bagging
I Bagging = Bootstrap Aggregating

6 / 17
Bagging
I Bagging = Bootstrap Aggregating
I In the Bootstrap, we replicate our dataset by sampling with
replacement:
I Original dataset: x = c(x1, x2, . . . , x100)
I Bootstrap samples:
boot1 = sample(x, 100, replace = True), ...,
bootB = sample(x, 100, replace = True).

6 / 17
Bagging
I Bagging = Bootstrap Aggregating
I In the Bootstrap, we replicate our dataset by sampling with
replacement:
I Original dataset: x = c(x1, x2, . . . , x100)
I Bootstrap samples:
boot1 = sample(x, 100, replace = True), ...,
bootB = sample(x, 100, replace = True).
I We used these samples to get the Standard Error of a
parameter estimate:
v
u
u 1 X B B
(b) 1 X (k) 2
SE(β̂1 ) ≈ t (β̂1 − β̂1 )
B−1 B
b=1 k=1

6 / 17
Bagging

I In Bagging we average the predictions of a model fit to many


Bootstrap samples.
Example. Bagging the Lasso

7 / 17
Bagging

I In Bagging we average the predictions of a model fit to many


Bootstrap samples.
Example. Bagging the Lasso
I Let ŷ L,b be the prediction of the Lasso applied to the bth
bootstrap sample.

7 / 17
Bagging

I In Bagging we average the predictions of a model fit to many


Bootstrap samples.
Example. Bagging the Lasso
I Let ŷ L,b be the prediction of the Lasso applied to the bth
bootstrap sample.
I Bagging prediction:
B
1 X L,b
ŷ boot = ŷ .
B
b=1

7 / 17
When does Bagging make sense?

When a regression method or a classifier has a tendency to overfit,


Bagging reduces the variance of the prediction.

8 / 17
When does Bagging make sense?

When a regression method or a classifier has a tendency to overfit,


Bagging reduces the variance of the prediction.

I When n is large, the empirical distribution is similar to the


true distribution of the samples.

8 / 17
When does Bagging make sense?

When a regression method or a classifier has a tendency to overfit,


Bagging reduces the variance of the prediction.

I When n is large, the empirical distribution is similar to the


true distribution of the samples.
I Bootstrap samples are like independent realizations of the
data.

8 / 17
When does Bagging make sense?

When a regression method or a classifier has a tendency to overfit,


Bagging reduces the variance of the prediction.

I When n is large, the empirical distribution is similar to the


true distribution of the samples.
I Bootstrap samples are like independent realizations of the
data.
I Bagging amounts to averaging the fits from many independent
datasets, which would reduce the variance by a factor 1/B.

8 / 17
Bagging decision trees
I Disadvantage: Every time we fit a decision tree to a
Bootstrap sample, we get a different tree T b .

9 / 17
Bagging decision trees
I Disadvantage: Every time we fit a decision tree to a
Bootstrap sample, we get a different tree T b .
→ Loss of interpretability

9 / 17
Bagging decision trees
I Disadvantage: Every time we fit a decision tree to a
Bootstrap sample, we get a different tree T b .
→ Loss of interpretability
I For each predictor, add up the total amount by which the RSS
(or Gini index) decreases every time we use the predictor in T b .

9 / 17
Bagging decision trees
I Disadvantage: Every time we fit a decision tree to a
Bootstrap sample, we get a different tree T b .
→ Loss of interpretability
I For each predictor, add up the total amount by which the RSS
(or Gini index) decreases every time we use the predictor in T b .
I Average this total over each Boostrap estimate T 1 , . . . , T B .

9 / 17
Bagging decision trees
I Disadvantage: Every time we fit a decision tree to a
Bootstrap sample, we get a different tree T b .
→ Loss of interpretability
I For each predictor, add up the total amount by which the RSS
(or Gini index) decreases every time we use the predictor in T b .
I Average this total over each Boostrap estimate T 1 , . . . , T B .
Fbs

RestECG

ExAng

Sex

Slope

Chol

Age

RestBP

MaxHR

Oldpeak

ChestPain

Ca

Thal

0 20 40 60 80 100
Variable Importance

9 / 17
Out-of-bag (OOB) error

I To estimate the test error of a bagging estimate, we could use


cross-validation.

10 / 17
Out-of-bag (OOB) error

I To estimate the test error of a bagging estimate, we could use


cross-validation.
I Each time we draw a Bootstrap sample, we only use 63% of
the observations.

10 / 17
Out-of-bag (OOB) error

I To estimate the test error of a bagging estimate, we could use


cross-validation.
I Each time we draw a Bootstrap sample, we only use 63% of
the observations.
I Idea: use the rest of the observations as a test set.

10 / 17
Out-of-bag (OOB) error

I To estimate the test error of a bagging estimate, we could use


cross-validation.
I Each time we draw a Bootstrap sample, we only use 63% of
the observations.
I Idea: use the rest of the observations as a test set.
I OOB error:
I For each sample xi , find the prediction ŷib for all bootstrap
samples b which do not contain xi . There should be around
0.37B of them. Average these predictions to obtain ŷioob .

10 / 17
Out-of-bag (OOB) error

I To estimate the test error of a bagging estimate, we could use


cross-validation.
I Each time we draw a Bootstrap sample, we only use 63% of
the observations.
I Idea: use the rest of the observations as a test set.
I OOB error:
I For each sample xi , find the prediction ŷib for all bootstrap
samples b which do not contain xi . There should be around
0.37B of them. Average these predictions to obtain ŷioob .
I Compute the error (yi − ŷioob )2 .

10 / 17
Out-of-bag (OOB) error

I To estimate the test error of a bagging estimate, we could use


cross-validation.
I Each time we draw a Bootstrap sample, we only use 63% of
the observations.
I Idea: use the rest of the observations as a test set.
I OOB error:
I For each sample xi , find the prediction ŷib for all bootstrap
samples b which do not contain xi . There should be around
0.37B of them. Average these predictions to obtain ŷioob .
I Compute the error (yi − ŷioob )2 .
I Average the errors over all observations i = 1, . . . , n.

10 / 17
Out-of-bag (OOB) error

0.30
0.25
Error

0.20
0.15

Test: Bagging
Test: RandomForest
OOB: Bagging
0.10

OOB: RandomForest

0 50 100 150 200 250 300

Number of Trees

The test error decreases as we increase B


(dashed line is the error for a plain decision tree).
11 / 17
Random Forests

Bagging has a problem:


→ The trees produced by different Bootstrap samples can be very
similar.

Random Forests:

12 / 17
Random Forests

Bagging has a problem:


→ The trees produced by different Bootstrap samples can be very
similar.

Random Forests:
I We fit a decision tree to different Bootstrap samples.

12 / 17
Random Forests

Bagging has a problem:


→ The trees produced by different Bootstrap samples can be very
similar.

Random Forests:
I We fit a decision tree to different Bootstrap samples.
I When growing the tree, we select a random sample of m < p
predictors to consider in each step.

12 / 17
Random Forests

Bagging has a problem:


→ The trees produced by different Bootstrap samples can be very
similar.

Random Forests:
I We fit a decision tree to different Bootstrap samples.
I When growing the tree, we select a random sample of m < p
predictors to consider in each step.
I This will lead to very different (or “uncorrelated”) trees from
each sample.

12 / 17
Random Forests

Bagging has a problem:


→ The trees produced by different Bootstrap samples can be very
similar.

Random Forests:
I We fit a decision tree to different Bootstrap samples.
I When growing the tree, we select a random sample of m < p
predictors to consider in each step.
I This will lead to very different (or “uncorrelated”) trees from
each sample.
I Finally, average the prediction of each tree.

12 / 17
Random Forests vs. Bagging

0.30
0.25
Error

0.20
0.15

Test: Bagging
Test: RandomForest
OOB: Bagging
0.10

OOB: RandomForest

0 50 100 150 200 250 300

Number of Trees

13 / 17
Random Forests, choosing m

m=p
m=p/2

0.5
Test Classification Error m= p

0.4
0.3
0.2

0 100 200 300 400 500

Number of Trees


The optimal m is usually around p,
but this can be used as a tuning parameter.

14 / 17
Boosting

15 / 17
Boosting
1. Set fˆ(x) = 0, and ri = yi for i = 1, . . . , n.

15 / 17
Boosting
1. Set fˆ(x) = 0, and ri = yi for i = 1, . . . , n.
2. For b = 1, . . . , B, iterate:

15 / 17
Boosting
1. Set fˆ(x) = 0, and ri = yi for i = 1, . . . , n.
2. For b = 1, . . . , B, iterate:
2.1 Fit a decision tree fˆb with d splits to the response r1 , . . . , rn .

15 / 17
Boosting
1. Set fˆ(x) = 0, and ri = yi for i = 1, . . . , n.
2. For b = 1, . . . , B, iterate:
2.1 Fit a decision tree fˆb with d splits to the response r1 , . . . , rn .
2.2 Update the prediction to:

fˆ(x) ← fˆ(x) + λfˆb (x).

15 / 17
Boosting
1. Set fˆ(x) = 0, and ri = yi for i = 1, . . . , n.
2. For b = 1, . . . , B, iterate:
2.1 Fit a decision tree fˆb with d splits to the response r1 , . . . , rn .
2.2 Update the prediction to:

fˆ(x) ← fˆ(x) + λfˆb (x).

2.3 Update the residuals,

ri ← ri − λfˆb (xi ).

15 / 17
Boosting
1. Set fˆ(x) = 0, and ri = yi for i = 1, . . . , n.
2. For b = 1, . . . , B, iterate:
2.1 Fit a decision tree fˆb with d splits to the response r1 , . . . , rn .
2.2 Update the prediction to:

fˆ(x) ← fˆ(x) + λfˆb (x).

2.3 Update the residuals,

ri ← ri − λfˆb (xi ).

3. Output the final model:


B
X
fˆ(x) = λfˆb (x).
b=1
15 / 17
Boosting, intuitively

Boosting learns slowly:

We first use the samples that are easiest to predict, then slowly
down weigh these cases, moving on to harder samples.

16 / 17
Boosting vs. random forests

0.25
Boosting: depth=1
Boosting: depth=2
0.20 RandomForest: m= p
Test Classification Error

0.15
0.10
0.05

0 1000 2000 3000 4000 5000

Number of Trees

The parameter λ = 0.01 in each case.


We can tune the model by CV using λ, d, B.
17 / 17

You might also like