[go: up one dir, main page]

0% found this document useful (0 votes)
17 views82 pages

8 Classification

Chapter 8 covers the fundamental concepts of classification in machine learning, including decision tree induction, Bayes classification methods, and techniques for improving classification accuracy. It discusses the differences between supervised and unsupervised learning, the classification process, and various algorithms used for model construction and evaluation. Additionally, it addresses challenges such as overfitting and scalability, along with methods like ensemble techniques and pruning to enhance model performance.

Uploaded by

22ituos110
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views82 pages

8 Classification

Chapter 8 covers the fundamental concepts of classification in machine learning, including decision tree induction, Bayes classification methods, and techniques for improving classification accuracy. It discusses the differences between supervised and unsupervised learning, the classification process, and various algorithms used for model construction and evaluation. Additionally, it addresses challenges such as overfitting and scalability, along with methods like ensemble techniques and pruning to enhance model performance.

Uploaded by

22ituos110
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

Chapter 8.

Classification: Basic Concepts

■ Classification: Basic Concepts


■ Decision Tree Induction
■ Bayes Classification Methods
■ Rule Extraction from a Decision Tree
■ Model Evaluation and Selection
■ Techniques to Improve Classification Accuracy: Ensemble Methods
■ Summary

1
Types of machine learning
■ Supervised Machine Learning
■ Unsupervised Machine Learning
■ Semi-Supervised Machine Learning
■ Reinforcement Learning

2
Supervised learning
■ The training data (observations,
measurements, etc.) are accompanied by
labels indicating the class of the
observations
■ New data is classified based on the
training set
• Regression
• Linear Regression
• Regression Trees
• Non-Linear Regression
• Bayesian Linear Regression
• Polynomial Regression
• Classification
• Random Forest
• Decision Trees
• Logistic Regression
• Support vector Machines

3
Unsupervised learning
■ The class labels of training data is unknown
■ Given a set of measurements, observations,
etc. with the aim of establishing the
existence of classes or clusters in the data
■ K-means clustering
■ KNN (k-nearest neighbors)
■ Hierarchal clustering
■ Anomaly detection
■ Neural Networks
■ Principle Component Analysis
■ Independent Component Analysis
■ Apriori algorithm
■ Singular value decomposition

4
Classification vs. Regression

■ Classification
■ predicts categorical class labels (discrete or nominal)

■ classifies data (constructs a model) based on the training set and the values (class
labels) in a classifying attribute and uses it in classifying new data
■ Regression (Numeric Prediction )
■ models continuous-valued functions, i.e., predicts unknown or missing values

■ Typical applications
■ Credit/loan approval:

■ Medical diagnosis: if a tumor is cancerous or not

■ Fraud detection: if a transaction is fraudulent or not

■ Web page categorization: which category it is

5
Classification—A Two-Step Process
1. Model Construction (Training)
2. Model usage (Testing)

■ Model construction:
■ Describing a set of predetermined classes
■ Each tuple/sample is assumed to belong to a predefined class, as determined by the class label
attribute
■ The set of tuples used for model construction is training set
■ The model is represented as classification rules, decision trees, or mathematical formulae

6
Classification—A Two-Step Process
1. Model Construction (Training)
2. Model usage (Testing)

■ Model usage:
■ For classifying future or unknown objects
■ Estimate accuracy of the model
■ The known label of test sample is compared with the classified result from the model

■ Accuracy rate is the percentage of test set samples that are correctly classified by the model

■ Test set is independent of training set (otherwise overfitting)

■ If the accuracy is acceptable, use the model to classify new data

7
Process (1): Model Construction

Classification
Algorithms
Training
Data

Classifier
(Model)

IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
8
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)

Tenured?

9
Chapter 8. Classification: Basic Concepts

■ Classification: Basic Concepts


■ Decision Tree Induction
■ Bayes Classification Methods
■ Rule Extraction from a Decision Tree
■ Model Evaluation and Selection
■ Techniques to Improve Classification Accuracy: Ensemble Methods
■ Summary

10
Generate decision tree
Input:
- D, set of training
tuples and their
associated class
labels
- attribute list
- Attribute
selection method

- Output:
- A decision tree.

11
Splitting criterion
■ Attribute selection method is used to determine the splitting criterion
■ Splitting criterion tells us Which attribute to test at node N by
determining the “best” way to separate or partition the tuples in D into
individual classes
■ Indicates
■ Splitting attribute

■ Split-point or splitting subset

■ The splitting criterion is determined so that, ideally, the resulting


partitions at each branch are as “pure” as possible

12
Split-point or splitting subset
■ The tuples in D are partitioned
accordingly.
■ Let A be the splitting
attribute. A has v distinct
values, {a1, a2, : : : , av}, based
on the training data.
■ There are three possible
scenarios.
■ A is discrete-valued
■ A is continuous value
■ A is discrete-valued and a binary
tree must be produced
13
Terminating conditions
1. All the tuples in partition D (represented at node N) belong to the same
class.
2. There are no remaining attributes on which the tuples may be further
partitioned
■ In this case, majority voting is employed

3. There are no tuples for a given branch


■ The computational complexity of the algorithm given training set D is O
(n*|D| * log (|D|))
■ n = number of attributes describing the tuples in D.

■ |D| = number of training tuples in D.

14
Attribute selection Measures
■ Provides a ranking for each attribute describing the given training tuples
■ The attribute having the best score for the measure is chosen as the
splitting attribute for the given tuples.
■ If the splitting attribute is continuous-valued or if we are restricted to
binary trees, then, respectively, either a split point or a splitting subset
must also be determined as part of the splitting criterion.
■ Popular attribute selection measures
■ Information gain

■ Gain ratio

■ Gini Index

15
Information Gain (ID3/C4.5)
■ Select the attribute with the highest information gain
■ Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by
|Ci, D|/|D|
■ Expected information (entropy) needed to classify a tuple in D:

■ Information needed (after using A to split D into v partitions) to classify D:

■ Information gained by branching on attribute A

16
17
18
19
Calculate Entropy for senior

20
Gain of Age Attribute

21
Calculation information Gain of Income

22
Calculate Entropy for ‘medium’

23
Calculate Entropy for ‘low’

24
Gain of income

25
After Splitting

26
Decision tree
■ The final decision tree returned by the algorithm is shown in Figure

27
Computing Information-Gain for Continuous-Valued
Attributes
■ Let attribute A be a continuous-valued attribute
■ Must determine the best split point for A
■ Sort the value A in increasing order
■ Typically, the midpoint between each pair of adjacent values is considered as a
possible split point
■ (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
■ The point with the minimum expected information requirement for A is selected
as the split-point for A
■ Split:
■ D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of tuples in
D satisfying A > split-point
28
29
Gain Ratio for Attribute Selection (C4.5)
■ Information gain measure is biased towards attributes with a large number of values
■ C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to
information gain)

■ Ex.

■ GainRatio(A) = Gain(A)/SplitInfo(A)
■ gain_ratio(income) = 0.029/1.557 = 0.019
■ The attribute with the maximum gain ratio is selected as the splitting attribute

30
Gini Index (CART, IBM IntelligentMiner)

■ If a data set D contains examples from n classes, gini index, gini(D) is defined as

where pj is the relative frequency of class j in D


■ If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is defined
as

■ Reduction in Impurity:

■ The attribute provides the smallest ginisplit(D) (or the largest reduction in impurity) is
chosen to split the node (need to enumerate all the possible splitting points for each
attribute)

31
32
Computation of Gini Index
■ Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”

■ Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in D2

Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the {low,medium} (and


{high}) since it has the lowest Gini index
■ All attributes are assumed continuous-valued
■ May need other tools, e.g., clustering, to get the possible split values
■ Can be modified for categorical attributes 33
Comparing Attribute Selection Measures

■ The three measures, in general, return good results but


■ Information gain:
■ biased towards multivalued attributes
■ Gain ratio:
■ tends to prefer unbalanced splits in which one partition is much smaller than
the others
■ Gini index:
■ biased to multivalued attributes
■ has difficulty when # of classes is large
■ tends to favor tests that result in equal-sized partitions and purity in both
partitions
34
Other Attribute Selection Measures
■ CHAID: a popular decision tree algorithm, measure based on χ2 test for independence
■ C-SEP: performs better than info. gain and gini index in certain cases
■ G-statistic: has a close approximation to χ2 distribution
■ MDL (Minimal Description Length) principle (i.e., the simplest solution is preferred):
■ The best tree as the one that requires the fewest # of bits to both (1) encode the tree, and (2)
encode the exceptions to the tree
■ Multivariate splits (partition based on multiple variable combinations)
■ CART: finds multivariate splits based on a linear comb. of attrs.
■ Which attribute selection measure is the best?
■ Most give good results, none is significantly superior than others

35
Overfitting and Tree Pruning
■ Overfitting: An induced tree may overfit the training data
■ Too many branches, some may reflect anomalies due to noise or outliers

■ Poor accuracy for unseen samples

■ High Training Accuracy but Low Testing Accuracy

■ Two approaches to avoid overfitting


■ Prepruning:

■ Postpruning:

36
Prepruning
■ Halt tree construction early
■ Upon halting, the node becomes a leaf
■ The leaf may hold the most frequent class among the subset tuples or the probability
distribution of those tuples
■ When constructing a tree, measures such as statistical significance, information gain,
Gini index, and so on, can be used to assess the goodness of a split.
■ If partitioning the tuples at a node would result in a split that falls below a
prespecified threshold, then further partitioning of the given subset is halted.
■ There are difficulties, however, in choosing an appropriate threshold.
■ High thresholds could result in oversimplified trees,

■ whereas low thresholds could result in very little simplification.

37
Postpruning
■ Remove branches from a
“fully grown” tree—get a
sequence of progressively
pruned trees
■ Use a set of data different
from the training data to
decide which is the “best
pruned tree”

38
Repetition and Replication problem in Decision Tree
Repetition Replication

Solutions:
• multivariate splits (splits based on a combination of attributes)
• rule-based classifier
39
Scalability problem in Classification for Large Databases
■ Why is decision tree induction popular?
■ relatively faster learning speed (than other classification methods)

■ convertible to simple and easy to understand classification rules

■ can use SQL queries for accessing databases

■ comparable classification accuracy with other methods

■ Scalability: Classifying data sets with millions of examples and hundreds of attributes
with reasonable speed
■ Training set of class-labeled tuples, does not fit in memory?

40
Scalable Decision Tree Induction Methods

■ SLIQ (EDBT’96 — Mehta et al.)


■ Builds an index for each attribute and only class list and the current attribute list

reside in memory
■ SPRINT (VLDB’96 — J. Shafer et al.)
■ Constructs an attribute list data structure

■ PUBLIC (VLDB’98 — Rastogi & Shim)


■ Integrates tree splitting and tree pruning: stop growing the tree earlier

■ RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)


■ Builds an AVC-list (attribute, value, class label)

■ BOAT (PODS’99 — Gehrke, Ganti, Ramakrishnan & Loh)


■ Uses bootstrapping to create several small samples

41
Scalability Framework for RainForest

■ Separates the scalability aspects from the criteria that determine the quality of
the tree
■ Builds an AVC-list: AVC (Attribute, Value, Class_label)
■ AVC-set (of an attribute X )
■ Projection of training dataset onto the attribute X and class label where
counts of individual class label are aggregated
■ AVC-group (of a node n )
■ Set of AVC-sets of all predictor attributes at the node n

42
Rainforest: Training Set and Its AVC Sets

Training Examples AVC-set on Age AVC-set on income


income Buy_Computer
Age Buy_Computer
yes no
yes no
Youth 2 3 high 2 2

Middle_age 4 0 medium 4 2
senior 3 2
low 3 1

AVC-set on
AVC-set on Student
credit_rating
student Buy_Computer Buy_Computer
Credit
yes no rating yes no

yes 6 1 fair 6 2

no 3 4 excellent 3 3

43
BOAT (Bootstrapped Optimistic Algorithm for Tree
Construction)
■ Use a statistical technique called bootstrapping to create several smaller
samples (subsets), each fits in memory
■ Each subset is used to create a tree, resulting in several trees
■ These trees are examined and used to construct a new tree T’
■ It turns out that T’ is very close to the tree that would be generated using the
whole data set together
■ Adv: requires only two scans of DB, an incremental alg.

44
Chapter 8. Classification: Basic Concepts

■ Classification: Basic Concepts


■ Decision Tree Induction
■ Bayes Classification Methods
■ Rule Extraction from a Decision Tree
■ Model Evaluation and Selection
■ Techniques to Improve Classification Accuracy: Ensemble Methods
■ Summary

45
Bayesian Classification: Why?
■ Supervised learning algorithm
■ A statistical classifier
■ Performs probabilistic prediction, i.e., predicts class membership
probabilities
■ The probability that a given tuple belongs to a particular class
■ Foundation: Based on Bayes’ Theorem.
■ Performance: A simple Bayesian classifier, naive Bayesian classifier, has
comparable performance with decision tree and selected neural
network classifiers

46
Why is it called Naïve Bayes?
■ Due to it assumption
■ The occurrence of a certain feature is independent of the occurrence of

other features.
■ Such as if the fruit is identified on the bases of color, shape, and taste,

then red, spherical, and sweet fruit is recognized as an apple.


■ Hence each feature individually contributes to identify that it is an

apple without depending on each other.

■ Also know as class conditional independence

47
Bayesian Classification
■ Problem statement:
■ Given features X ,X ,…,X
1 2 n
■ Predict a label Y

■ Applications:
■ Spam Classification
■ Given an email, predict whether it is spam or not
■ Medical Diagnosis
■ Given a list of symptoms, predict whether a patient has disease X or not
■ Weather
■ Based on temperature, humidity, etc… predict if it will rain tomorrow
■ Digit Recognition
Bayes’ Theorem: Basics
■ Probability of event A, given the event B is true
Likelihood prior
posterior probability
probability
Marginal probability / evidence

■ where A and B are events and P(B) ≠ 0

49
Naïve Bayes Classifier

■ Where, y is class variable and X is a dependent feature vector (of size n)


where:
■ For e.g. X= {age=youth, Income = medium, Student = yes, Credit_rating = Fair} and
Y=Buys_computer = yes
■ So P(y|X) here means, the probability of buying a computer given that the
the customer age is young, income is medium, credit rating is fair and he
is a student.
■ Assumption: independence among the features
■ if any two events A and B are independent, then, P(A,B) = P(A)P(B)
50
Classification Is to Derive the Maximum Posteriori

■ Now, as the denominator remains constant for a given input, we can


remove that term:

■ we find the probability of given set of inputs for all possible values of the
class variable y and pick up the output with maximum probability. This can
be expressed
conditional probability
class 51
Naïve Bayes Classifier: Training Dataset

Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’

Data to be classified:
X = (age=youth,
Income = medium,
Student = yes
Credit_rating = Fair)
Buys_computer=?

52
Naïve Bayes Classifier: An Example

■ Step 1:
■ P(Ci):
■ P(buys_computer = “yes”) = 9/14 = 0.643
■ P(buys_computer = “no”) = 5/14= 0.357

53
Naïve Bayes Classifier: An Example
■ Step 2:
■ Compute P(X|Ci) for each class
P (age = “youth” | buys_computer = “yes”)
= 2/9 = 0.222
P(age = “youth” | buys_computer = “no”) =
3/5 = 0.6
P(income = “medium” | buys_computer =
“yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer =
“no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes)
= 6/9 = 0.667
P(student = “yes” | buys_computer = “no”)
= 1/5 = 0.2
P(credit_rating = “fair” | buys_computer =
“yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer =
“no”) = 2/5 = 0.4
54
Naïve Bayes Classifier: An Example
■ Step 2: ■ Step 3
■ Compute P(X|Ci) for each class ■ X = (age=Youth , income = medium, student = yes,
P (age = “youth” | buys_computer = “yes”) credit_rating = fair)
= 2/9 = 0.222 P(X|Ci) :
P(age = “youth” | buys_computer = “no”) = P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 =
3/5 = 0.6 0.044
P(income = “medium” | buys_computer = P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
“yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = Step 4:
“no”) = 2/5 = 0.4 P(X|Ci)*P(Ci) :
P(student = “yes” | buys_computer = “yes) = P(X|buys_computer = “yes”) * P(buys_computer = “yes”) =
6/9 = 0.667 0.044*0.643 = 0.028
P(student = “yes” | buys_computer = “no”) = P(X|buys_computer = “no”) * P(buys_computer = “no”) =
1/5 = 0.2 0.019 * 0.357 =0.007
P(credit_rating = “fair” | buys_computer =
“yes”) = 6/9 = 0.667 Step 5:
P(credit_rating = “fair” | buys_computer = Therefore, X belongs to class (“buys_computer = yes”)
“no”) = 2/5 = 0.4
55
Naïve Bayes Classifier
■ If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian distribution
with a mean μ and standard deviation σ

and P(xk|Ci) is

56
Avoiding the Zero-Probability Problem
■ Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise,
the predicted prob. will be zero

■ Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and
income = high (10)
■ Use Laplacian correction (or Laplacian estimator)
■ Adding 1 to each case

Prob(income = low) = 1/1003


Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
■ The “corrected” prob. estimates are close to their “uncorrected” counterparts

57
Naïve Bayes Classifier: Comments
■ Advantages
■ Easy to implement

■ Good results obtained in most of the cases

■ Disadvantages
■ Assumption: class conditional independence, therefore loss of accuracy

■ Practically, dependencies exist among variables

■ E.g., hospitals: patients: Profile: age, family history, etc.

Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.


■ Dependencies among these cannot be modeled by Naïve Bayes Classifier

■ How to deal with these dependencies? Bayesian Belief Networks (Chapter 9)

58
Chapter 8. Classification: Basic Concepts

■ Classification: Basic Concepts


■ Decision Tree Induction
■ Bayes Classification Methods
■ Rule Extraction from a Decision Tree
■ Model Evaluation and Selection
■ Techniques to Improve Classification Accuracy: Ensemble Methods
■ Summary

59
Rule Extraction from a Decision Tree
age?
■ Rules are easier to understand than large trees
■ One rule is created for each path from the root to <=30
31..40 >40
a leaf
■ Each attribute-value pair along a path forms a student?
yes
credit rating?

conjunction: the leaf holds the class prediction


excellent fair
■ Rules are mutually exclusive and exhaustive no yes

no yes no yes

■ Example: Rule extraction from our buys_computer decision-tree


IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair THEN buys_computer = yes

60
Chapter 8. Classification: Basic Concepts

■ Classification: Basic Concepts


■ Decision Tree Induction
■ Bayes Classification Methods
■ Rule Extraction from a Decision Tree
■ Model Evaluation and Selection
■ Techniques to Improve Classification Accuracy: Ensemble Methods
■ Summary

61
Model Evaluation and Selection
■ Evaluation metrics: How can we measure accuracy? Other metrics to consider?
■ Use validation test set of class-labeled tuples instead of training set when assessing
accuracy
■ Methods for estimating a classifier’s accuracy:
■ Holdout method, random subsampling
■ Cross-validation
■ Bootstrap
■ Comparing classifiers:
■ Confidence intervals
■ Cost-benefit analysis and ROC Curves

62
Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:


Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

■ Given m classes, an entry, CMi,j in a confusion matrix indicates


# of tuples in class i that were labeled by the classifier as class j
■ May have extra rows/columns to provide totals
63
Confusion Matrix for multi class classification

64
Classifier Evaluation Metrics: Accuracy, Error Rate,
Sensitivity and Specificity
A\P C ¬C
■ Class Imbalance Problem:
C TP FN P
■ One class may be rare, e.g. fraud, or
¬C FP TN N
HIV-positive
P’ N’ All
■ Significant majority of the negative class

■ Classifier Accuracy, or recognition rate: and minority of the positive class


percentage of test set tuples that are ■ Sensitivity: True Positive recognition rate

correctly classified ■ Sensitivity = TP/P

Accuracy = (TP + TN)/All ■ Specificity: True Negative recognition rate

■ Error rate: 1 – accuracy, or ■ Specificity = TN/N

Error rate = (FP + FN)/All

65
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
■ Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive

■ Recall: completeness – what % of positive tuples did the


classifier label as positive?
■ Perfect score is 1.0
■ Inverse relationship between precision & recall

■ F measure (F1 or F-score): harmonic mean of precision


and recall,

■ Fß: weighted measure of precision


and recall
■ assigns ß times as much weight to

recall as to precision 66
Classifier Evaluation Metrics: Example

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)


cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)

■ Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

67
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
■ Holdout method
■ Given data is randomly partitioned into two independent sets

■ Training set (e.g., 2/3) for model construction

■ Test set (e.g., 1/3) for accuracy estimation

■ Random sampling: a variation of holdout

■ Repeat holdout k times, accuracy = avg. of the accuracies obtained

■ Cross-validation (k-fold, where k = 10 is most popular)


■ Randomly partition the data into k mutually exclusive subsets, each approximately
equal size
■ At i-th iteration, use D as test set and others as training set
i
■ Leave-one-out: k folds where k = # of tuples, for small sized data

■ *Stratified cross-validation*: folds are stratified so that class dist. in each fold is
approx. the same as that in the initial data

68
Evaluating Classifier Accuracy: Bootstrap
■ Bootstrap
■ Works well with small data sets
■ Samples the given training tuples uniformly with replacement
■ i.e., each time a tuple is selected, it is equally likely to be selected again and re-added to the
training set
■ Several bootstrap methods, and a common one is .632 boostrap
■ A data set with d tuples is sampled d times, with replacement, resulting in a training set of d
samples. The data tuples that did not make it into the training set end up forming the test set.
About 63.2% of the original data end up in the bootstrap, and the remaining 36.8% form the test
set (since (1 – 1/d)d ≈ e-1 = 0.368)
■ Repeat the sampling procedure k times, overall accuracy of the model:

69
Model Selection: ROC Curves
■ ROC (Receiver Operating Characteristics) curves:
for visual comparison of classification models
■ Originated from signal detection theory
■ Shows the trade-off between the true positive
rate and the false positive rate

■ Vertical axis
■ The area under the ROC curve is a measure of the represents the true
positive rate
accuracy of the model
■ Horizontal axis rep.
the false positive rate
■ The closer to the diagonal line (i.e., the closer the ■ A model with perfect
area is to 0.5), the less accurate is the model accuracy will have an
area of 1.0
70
Model Selection: ROC Curves

AUC = 1, the classifier can AUC=0.5, then the classifier is


correctly distinguish between all When 0.5<AUC<1, there is a not able to distinguish between
the Positive and the Negative high chance that the classifier Positive and Negative class
class points. will be able to distinguish the points. Meaning that the
positive class values from the classifier either predicts a
AUC had been 0, then the negative ones. random class or a constant
classifier would predict all class for all the data points
Negatives as Positives and all
Positives as Negatives.
71
Issues Affecting Model Selection
■ Accuracy
■ classifier accuracy: predicting class label
■ predictor accuracy: guessing value of predicted attributes
■ Speed
■ time to construct the model (training time)
■ time to use the model (classification/prediction time)
■ Robustness: handling noise and missing values
■ Scalability: efficiency in disk-resident databases
■ Interpretability
■ understanding and insight provided by the model
■ Other measures, e.g., goodness of rules, such as decision tree size or compactness of
classification rules
72
Chapter 8. Classification: Basic Concepts

■ Classification: Basic Concepts


■ Decision Tree Induction
■ Bayes Classification Methods
■ Rule-Based Classification
■ Rule Extraction from a Decision Tree
■ Techniques to Improve Classification Accuracy: Ensemble Methods
(Extra)
■ Summary
73
Ensemble Methods: Increasing the Accuracy

■ Ensemble methods
■ Use a combination of models to increase accuracy

■ Combine a series of k learned models, M , M , …, M , with the aim of creating an


1 2 k
improved model M*
■ Popular ensemble methods
■ Bagging: averaging the prediction over a collection of classifiers

■ Boosting: weighted vote with a collection of classifiers

■ Ensemble: combining a set of heterogeneous classifiers

74
Bagging: Boostrap Aggregation
■ Analogy: Diagnosis based on multiple doctors’ majority vote
■ Training
■ Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with
replacement from D (i.e., bootstrap)
■ A classifier model Mi is learned for each training set Di
■ Classification: classify an unknown sample X
■ Each classifier Mi returns its class prediction
■ The bagged classifier M* counts the votes and assigns the class with the most votes to X
■ Prediction: can be applied to the prediction of continuous values by taking the average value of each
prediction for a given test tuple
■ Accuracy
■ Often significantly better than a single classifier derived from D
■ For noise data: not considerably worse, more robust
■ Proved improved accuracy in prediction

75
Boosting
■ Analogy: Consult several doctors, based on a combination of weighted
diagnoses—weight assigned based on the previous diagnosis accuracy
■ How boosting works?
■ Weights are assigned to each training tuple
■ A series of k classifiers is iteratively learned
■ After a classifier Mi is learned, the weights are updated to allow the subsequent
classifier, Mi+1, to pay more attention to the training tuples that were
misclassified by Mi
■ The final M* combines the votes of each individual classifier, where the weight
of each classifier's vote is a function of its accuracy
■ Boosting algorithm can be extended for numeric prediction
■ Comparing with bagging: Boosting tends to have greater accuracy, but it also risks
overfitting the model to misclassified data

76
Adaboost (Freund and Schapire, 1997)
■ Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
■ Initially, all the weights of tuples are set the same (1/d)
■ Generate k classifiers in k rounds. At round i,
■ Tuples from D are sampled (with replacement) to form a training set Di of the same size
■ Each tuple’s chance of being selected is based on its weight
■ A classification model Mi is derived from Di
■ Its error rate is calculated using Di as a test set
■ If a tuple is misclassified, its weight is increased, o.w. it is decreased
■ Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error rate is the sum of the
weights of the misclassified tuples:

■ The weight of classifier Mi’s vote is

77
Random Forest (Breiman 2001)
■ Random Forest:
■ Each classifier in the ensemble is a decision tree classifier and is generated using a random selection
of attributes at each node to determine the split
■ During classification, each tree votes and the most popular class is returned
■ Two Methods to construct Random Forest:
■ Forest-RI (random input selection): Randomly select, at each node, F attributes as candidates for
the split at the node. The CART methodology is used to grow the trees to maximum size
■ Forest-RC (random linear combinations): Creates new attributes (or features) that are a linear
combination of the existing attributes (reduces the correlation between individual classifiers)
■ Comparable in accuracy to Adaboost, but more robust to errors and outliers
■ Insensitive to the number of attributes selected for consideration at each split, and faster than bagging
or boosting

78
Classification of Class-Imbalanced Data Sets

■ Class-imbalance problem: Rare positive example but numerous negative ones, e.g.,
medical diagnosis, fraud, oil-spill, fault, etc.
■ Traditional methods assume a balanced distribution of classes and equal error costs:
not suitable for class-imbalanced data
■ Typical methods for imbalance data in 2-class classification:
■ Oversampling: re-sampling of data from positive class

■ Under-sampling: randomly eliminate tuples from negative class

■ Threshold-moving: moves the decision threshold, t, so that the rare class tuples
are easier to classify, and hence, less chance of costly false negative errors
■ Ensemble techniques: Ensemble multiple classifiers introduced above

■ Still difficult for class imbalance problem on multiclass tasks

79
Chapter 8. Classification: Basic Concepts

■ Classification: Basic Concepts


■ Decision Tree Induction
■ Bayes Classification Methods
■ Rule-Based Classification
■ Rule Extraction from a Decision Tree
■ Techniques to Improve Classification Accuracy: Ensemble Methods
■ Summary

80
Summary (I)
■ Classification is a form of data analysis that extracts models describing important
data classes.
■ Effective and scalable methods have been developed for decision tree induction,
Naive Bayesian classification, rule-based classification, and many other classification
methods.
■ Evaluation metrics include: accuracy, sensitivity, specificity, precision, recall, F
measure, and Fß measure.
■ Stratified k-fold cross-validation is recommended for accuracy estimation. Bagging
and boosting can be used to increase overall accuracy by learning and combining a
series of individual models.

81
Summary (II)
■ Significance tests and ROC curves are useful for model selection.
■ There have been numerous comparisons of the different classification methods; the
matter remains a research topic
■ No single method has been found to be superior over all others for all data sets
■ Issues such as accuracy, training time, robustness, scalability, and interpretability
must be considered and can involve trade-offs, further complicating the quest for an
overall superior method

82

You might also like