ENSEMBLE METHODS
ENSEMBLE METHODS: INCREASING THE
CLASSIFICATION ACCURACY
Ensemble methods
An ensemble for classification is a composite model,
made up of a combination of classifiers.
Use a combination of models to increase accuracy
Popular ensemble methods
Bagging: averaging the prediction over a collection of
classifiers
Boosting: weighted vote with a collection of classifiers
Random Forests: collection of decision trees
2
ENSEMBLE METHODS
It combines a series of k learned models, M 1 , M 2 , …, M k , with the
aim of creating an improved model M*
A given data set, D, is used to create k training sets, D 1 , D 2 ,...,
D k , where D i (1 ≤ i ≤ k) is used to generate classifier M i .
Given a new data tuple to classify, the base classifiers each vote
by returning a class prediction .
The ensemble returns a class prediction based on the votes of
the base classifiers .
An ensemble tends to be more accurate than its base classifiers.
BAGGING: BOOSTRAP AGGREGATION
Analogy: Diagnosis based on multiple doctors’ majority vote
Training
Given a set D of d tuples, at each iteration i, a training set D i of d
tuples is sampled with replacement from D (i.e., bootstrap)
A classifier model M i is learned for each training set D i
Classification: classify an unknown sample X
Each classifier M i returns its class prediction
The bagged classifier M* counts the votes and assigns the class
with the most votes to X
Prediction: can be applied to the prediction of continuous values
by taking the average value of each prediction for a given test
tuple
Accuracy
Often significantly better than a single classifier derived from D
For noise data: not considerably worse, more robust
Proved improved accuracy in prediction
4
ALGORITHM - BAGGING
BOOSTING
Analogy: Consult several doctors, based on a combination
of weighted diagnoses—weight assigned based on the
previous diagnosis accuracy
How boosting works?
Weights are assigned to each training tuple
A series of k classifiers is iteratively learned
After a classifier M i is learned, the weights are updated to
allow the subsequent classifier, M i+1 , to pay more
attention to the training tuples that were misclassified by
Mi
The final M* combines the votes of each individual
classifier, where the weight of each classifier's vote is a
function of its accuracy
The basic idea is that when we build a classifier, we want it
to focus more on the misclassified tuples of the previous
round.
Some classifiers may be better at classifying some
“difficult” tuples than others.
In this way, we build a series of classifiers that
complement each other. 6
ADAPTIVE BOOSTING (AdaBoost)
Suppose, we are given D, a data set of d class-labeled tuples,
(X1 , y1),(X2, y2),..,(Xd, yd), where yi is the class label of tuple Xi.
Initially, AdaBoost assigns each training tuple an equal weight of
1/d.
Generating k classifiers for the ensemble requires k rounds
through the rest of the algorithm.
In round i, the tuples from D are sampled to form a training set,
Di , of size d.
Sampling with replacement is used—the same tuple may be
selected more than once.
A classifier model, Mi , is derived from the training tuples of Di
If a tuple was incorrectly classified, its weight is increased.
If a tuple was correctly classified, its weight is decreased.
These weights will be used to generate the training samples for
the classifier of the next round.
ADAPTIVE BOOSTING (AdaBoost)
A tuple’s weight reflects how dif ficult it is to classify — the higher
the weight, the more often it has been misclassified
These weights will be used to generate the training samples for
the classifier of the next round .
The basic idea is that when we build a classifier, we want it to
focus more on the misclassified tuples of the previous round .
Some classifiers may be better at classifying some “difficult”
tuples than others
To compute the error rate of model Mi , we sum the weights of
each of the tuples in Di that Mi misclassified.
where err(Xj)is the misclassification error of tuple Xj : If the tuple
was misclassified, then err( Xj) is 1; otherwise, it is 0
ADAPTIVE BOOSTING (AdaBoost)
“Once boosting is complete, how is the ensemble of classifiers
used to predict the class label of a tuple, X?”
Unlike bagging, where each classifier was assigned an equal
vote, boosting assigns a weight to each classifier’s vote, based
on how well the classifier performed
The lower a classifier’s error rate, the more accurate it is
and therefore, the higher its weight for voting should be
The weight of classifier Mi ’s vote is
For each class, c, we sum the weights of each classifier that
assigned class c to X
The class with the highest sum is the “winner” and is returned as
the class prediction for tuple X
ALGORITHM - AdaBoost
RANDOM FOREST (BREIMAN 2001)
Random Forest:
Each classifier in the ensemble is a decision tree classifier and is
generated using a random selection of attributes at each node to
determine the split
During classification, each tree votes and the most popular class is
returned
Two Methods to construct Random Forest:
Forest-RI (random input selection): Randomly select, at each node, F
attributes as candidates for the split at the node. The CART
methodology is used to grow the trees to maximum size
Forest-RC (random linear combinations): Creates new attributes (or
features) that are a linear combination of the existing attributes
(reduces the correlation between individual classifiers)
Comparable in accuracy to Adaboost, but more robust to errors and
outliers
Insensitive to the number of attributes selected for consideration at
each split, and faster than bagging or boosting
11
CLASSIFICATION OF CLASS-IMBALANCED
DATA SETS
Class-imbalance problem: Rare positive example but
numerous negative ones, e.g., medical diagnosis, fraud, oil -
spill, fault, etc.
Traditional methods assume a balanced distribution of
classes and equal error costs: not suitable for class -
imbalanced data
Typical methods for imbalance data in 2 -class classification:
Oversampling: re-sampling of data from positive class
Under-sampling: randomly eliminate tuples from negative
class
Threshold-moving: moves the decision threshold, t, so that
the rare class tuples are easier to classify, and hence, less
chance of costly false negative errors
Ensemble techniques: Ensemble multiple classifiers
introduced above
Still difficult for class imbalance problem on multiclass
tasks
12