AdaBoost
AdaBoost algorithm, short for Adaptive Boosting, is a Boosting technique used as an
Ensemble Method in Machine Learning.
It is called Adaptive Boosting as the weights are re-assigned to each instance, with
higher weights assigned to incorrectly classified instances.
Boosting is used to reduce bias as well as variance for supervised learning. It works
on the principle of learners growing sequentially. Except for the first, each
subsequent learner is grown from previously grown learners. In simple words, weak
learners are converted into strong ones. The AdaBoost algorithm works on the same
principle as boosting with a slight difference.
How Does AdaBoost Work?
It makes ‘n’ number of decision trees during the data training period. As the first
decision tree/model is made, the incorrectly classified record in the first model is
given priority. These records are sent as input for the second model. The process
goes on until we specify a number of base learners we want to create. Repetition of
records is allowed with all boosting techniques.
Especially the record which is incorrectly classified is used as input for the next
model. This process is repeated until the specified condition is met. There will be ‘n’
number of models made by taking the errors from the previous model. This is how
boosting works. The models 1,2, 3,…, N are individual models that can be known as
decision trees. All types of boosting models work on the same principle.
When the random forest is used, the algorithm makes an ‘n’ number of trees. It
makes proper trees that consist of a start node with several leaf nodes. Some trees
might be bigger than others, but there is no fixed depth in a random forest. With
AdaBoost, however, the algorithm only makes a node with two leaves, known as
Stump.
These stumps are weak learners and boosting techniques prefer this. The order of
stumps is very important in AdaBoost. The error of the first stump influences how
other stumps are made.
Work Flow
Step 1 :Creating the First Base Learner
1. Let’s assume there are only 5 records. All these records will be assigned a sample
weight.
2. The formula used for this is ‘W=1/N’ where N is the number of records.
3. The sample weight becomes 1/5 initially. Every record gets the same weight. In
this case, it’s 1/5.
4. Next is to create the same number of stumps as the number of features. If there
are 3 features, it will create 3 stumps.
5. From these stumps, it will create three decision trees. This process can be called
the stumps-base learner model. Out of these 3 models, the algorithm selects
only one.
6. Two properties are considered while selecting a base learner – Gini and Entropy.
We must calculate Gini or Entropy the same way it is calculated for decision
trees.
7. The stump with the least value will be the first base learner.
8. The number below the leaves represents the correctly and incorrectly classified
records. By using these records, the Gini or Entropy index is calculated. The
stump that has the least Entropy or Gini will be selected as the base learner.
9. Let’s assume that the entropy index is the least for stump 1. So, let’s take stump
1, i.e., feature 1 as our first base learner. Let’s assume feature (f1) has classified 2
records correctly and 1 incorrectly.
Step 2 – Calculating the Total Error (TE)
The total error is the sum of all the errors in the classified record for sample weights.
In our case, there is only 1 error, so Total Error (TE) = 1/5.
Step 3 – Calculating Performance of the Stump
1. Formula for calculating Performance of the Stump is: –
where, ln is natural log and TE is Total Error.
2. In our case, TE is 1/5. By substituting the value of total error in the above formula
and solving it, we get the value for the performance of the stump as 0.693.
3. Why is it necessary to calculate the TE and performance of a stump?
The answer is, we must update the sample weight before proceeding to the
next model or stage because if the same weight is applied, the output
received will be from the first model.
In boosting, only the wrong records/incorrectly classified records would get
more preference than the correctly classified records.
In AdaBoost, both records were allowed to pass and the wrong records are
repeated more than the correct ones.
We must increase the weight for the wrongly classified records and decrease
the weight for the correctly classified records.
In the next step, we will be updating the weights based on the performance
of the stump.
Step 4 – Updating Weights
1. For incorrectly classified records, the formula for updating weights is:
New Sample Weight = Sample Weight * e^(Performance)
In our case Sample weight = 1/5 so, 1/5 * e^ (0.693) = 0.399
2. For correctly classified records, we use the same formula with the performance
value being negative. This leads the weight for correctly classified records to be
reduced as compared to the incorrectly classified ones. The formula is:
New Sample Weight = Sample Weight * e^- (Performance)
Putting the values, 1/5 * e^-(0.693) = 0.100
3. The total sum of all the new sample weights should be 1. In this case, it is seen
that the total updated weight of all the records is not 1, it’s 0.799. To bring the
sum to 1, every updated weight must be divided by the total sum of updated
weight. For example, if our updated weight is 0.399 and we divide this by 0.799,
i.e. 0.399/0.799=0.50.
0.50 can be known as the normalized weight.
Normalized Weight = updated weight divided by the total sum of updated
weight.
Step 5 – Creating a New Dataset
4. Now, it’s time to create a new dataset from our previous one. In the new
dataset, the frequency of incorrectly classified records will be more than the
correct ones.
5. The new dataset has to be created using and considering the normalized weights.
It will probably select the wrong records for training purposes. That will be the
second decision tree/stump. To make a new dataset based on normalized
weight, the algorithm will divide it into buckets.
6. So, our first bucket is from 0 – 0.13, second will be from 0.13 –
0.63(0.13+0.50), third will be from 0.63 – 0.76(0.63+0.13), and so on.
7. After this the algorithm will run 5 iterations to select different records from the
older dataset.
8. Suppose in the 1st iteration, the algorithm will take a random value 0.46 to see
which bucket that value falls into and select that record in the new dataset.
9. It will again select a random value, see which bucket it is in and select that
record for the new dataset. The same process is repeated 5 times.
10. There is a high probability for wrong records to get selected several times. This
will form the new dataset.
For example if row 2 is incorrectly classified in previous one then it will get
selected multiple times.
Based on this new dataset, the algorithm will create a new decision tree/stump and it will
repeat the same process from step 1 till it sequentially passes through all stumps and finds
that there is less error as compared to normalized weight that we had in the initial stage.
How Does the Algorithm Decide Output for Test Data?
1. Suppose with the above dataset, the algorithm constructed 3 decision trees or
stumps.
2. The test dataset will pass through all the stumps which have been constructed by the
algorithm.
3. While passing through the 1st stump, the output it produces is 1. Passing through
the 2nd stump, the output generated once again is 1.
4. While passing through the 3rd stump it gives the output as 0. In the AdaBoost
algorithm too, the majority of votes take place between the stumps, in the same way
as in random trees. In this case, the final output will be 1. This is how the output with
test data is decided.
Important hyperparameters in AdaBoost
The following are the most important hyperparameters
both AdaBoostClassifier() and AdaBoostRegressor().
base_estimator: This is the base learner used in AdaBoost algorithms. The
default and most common learner is a decision tree stump (a decision tree with
max_depth=1) as we discussed earlier.
n_estimators: The maximum number of estimators (models) to train
sequentially. The default is 50. We’ll measure the effect of this hyperparameter
soon.
learning_rate: This determines the weight applied to each estimator in the
boosting process. The default is 1. Smaller values such as 0.05, 0.1 force the
algorithm to train slower but with high-performance scores. We’ll measure the
effect of this hyperparameter soon.
Advantages and disadvantages
1. Adaboost is less prone to overfitting as the input parameters are not jointly
optimized.
2. The accuracy of weak classifiers can be improved by using Adaboost. Nowadays,
Adaboost is being used to classify text and images rather than binary classification
problems.
Disadvantages
1. The main disadvantage of Adaboost is that it needs a quality dataset.
2. Noisy data and outliers have to be avoided before adopting an Adaboost algorithm.
3. AdaBoost uses a progressively learning boosting technique. Hence high-quality data
is needed in examples of AdaBoost vs Random Forest.
4. It is also very sensitive to outliers and noise in data requiring the elimination of these
factors before using the data. It is also much slower than the XGBoost algorithm.