8 Classification
8 Classification
1
Types of machine learning
■ Supervised Machine Learning
■ Unsupervised Machine Learning
■ Semi-Supervised Machine Learning
■ Reinforcement Learning
2
Supervised learning
■ The training data (observations,
measurements, etc.) are accompanied by
labels indicating the class of the
observations
■ New data is classified based on the
training set
• Regression
• Linear Regression
• Regression Trees
• Non-Linear Regression
• Bayesian Linear Regression
• Polynomial Regression
• Classification
• Random Forest
• Decision Trees
• Logistic Regression
• Support vector Machines
3
Unsupervised learning
■ The class labels of training data is unknown
■ Given a set of measurements, observations,
etc. with the aim of establishing the
existence of classes or clusters in the data
■ K-means clustering
■ KNN (k-nearest neighbors)
■ Hierarchal clustering
■ Anomaly detection
■ Neural Networks
■ Principle Component Analysis
■ Independent Component Analysis
■ Apriori algorithm
■ Singular value decomposition
4
Classification vs. Regression
■ Classification
■ predicts categorical class labels (discrete or nominal)
■ classifies data (constructs a model) based on the training set and the values (class
labels) in a classifying attribute and uses it in classifying new data
■ Regression (Numeric Prediction )
■ models continuous-valued functions, i.e., predicts unknown or missing values
■ Typical applications
■ Credit/loan approval:
5
Classification—A Two-Step Process
1. Model Construction (Training)
2. Model usage (Testing)
■ Model construction:
■ Describing a set of predetermined classes
■ Each tuple/sample is assumed to belong to a predefined class, as determined by the class label
attribute
■ The set of tuples used for model construction is training set
■ The model is represented as classification rules, decision trees, or mathematical formulae
6
Classification—A Two-Step Process
1. Model Construction (Training)
2. Model usage (Testing)
■ Model usage:
■ For classifying future or unknown objects
■ Estimate accuracy of the model
■ The known label of test sample is compared with the classified result from the model
■ Accuracy rate is the percentage of test set samples that are correctly classified by the model
7
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
8
Process (2): Using the Model in Prediction
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
Tenured?
9
Chapter 8. Classification: Basic Concepts
10
Generate decision tree
Input:
- D, set of training
tuples and their
associated class
labels
- attribute list
- Attribute
selection method
- Output:
- A decision tree.
11
Splitting criterion
■ Attribute selection method is used to determine the splitting criterion
■ Splitting criterion tells us Which attribute to test at node N by
determining the “best” way to separate or partition the tuples in D into
individual classes
■ Indicates
■ Splitting attribute
12
Split-point or splitting subset
■ The tuples in D are partitioned
accordingly.
■ Let A be the splitting
attribute. A has v distinct
values, {a1, a2, : : : , av}, based
on the training data.
■ There are three possible
scenarios.
■ A is discrete-valued
■ A is continuous value
■ A is discrete-valued and a binary
tree must be produced
13
Terminating conditions
1. All the tuples in partition D (represented at node N) belong to the same
class.
2. There are no remaining attributes on which the tuples may be further
partitioned
■ In this case, majority voting is employed
14
Attribute selection Measures
■ Provides a ranking for each attribute describing the given training tuples
■ The attribute having the best score for the measure is chosen as the
splitting attribute for the given tuples.
■ If the splitting attribute is continuous-valued or if we are restricted to
binary trees, then, respectively, either a split point or a splitting subset
must also be determined as part of the splitting criterion.
■ Popular attribute selection measures
■ Information gain
■ Gain ratio
■ Gini Index
15
Information Gain (ID3/C4.5)
■ Select the attribute with the highest information gain
■ Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by
|Ci, D|/|D|
■ Expected information (entropy) needed to classify a tuple in D:
16
17
18
19
Calculate Entropy for senior
20
Gain of Age Attribute
21
Calculation information Gain of Income
22
Calculate Entropy for ‘medium’
23
Calculate Entropy for ‘low’
24
Gain of income
25
After Splitting
26
Decision tree
■ The final decision tree returned by the algorithm is shown in Figure
27
Computing Information-Gain for Continuous-Valued
Attributes
■ Let attribute A be a continuous-valued attribute
■ Must determine the best split point for A
■ Sort the value A in increasing order
■ Typically, the midpoint between each pair of adjacent values is considered as a
possible split point
■ (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
■ The point with the minimum expected information requirement for A is selected
as the split-point for A
■ Split:
■ D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of tuples in
D satisfying A > split-point
28
29
Gain Ratio for Attribute Selection (C4.5)
■ Information gain measure is biased towards attributes with a large number of values
■ C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to
information gain)
■ Ex.
■ GainRatio(A) = Gain(A)/SplitInfo(A)
■ gain_ratio(income) = 0.029/1.557 = 0.019
■ The attribute with the maximum gain ratio is selected as the splitting attribute
30
Gini Index (CART, IBM IntelligentMiner)
■ If a data set D contains examples from n classes, gini index, gini(D) is defined as
■ Reduction in Impurity:
■ The attribute provides the smallest ginisplit(D) (or the largest reduction in impurity) is
chosen to split the node (need to enumerate all the possible splitting points for each
attribute)
31
32
Computation of Gini Index
■ Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
■ Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in D2
35
Overfitting and Tree Pruning
■ Overfitting: An induced tree may overfit the training data
■ Too many branches, some may reflect anomalies due to noise or outliers
■ Postpruning:
36
Prepruning
■ Halt tree construction early
■ Upon halting, the node becomes a leaf
■ The leaf may hold the most frequent class among the subset tuples or the probability
distribution of those tuples
■ When constructing a tree, measures such as statistical significance, information gain,
Gini index, and so on, can be used to assess the goodness of a split.
■ If partitioning the tuples at a node would result in a split that falls below a
prespecified threshold, then further partitioning of the given subset is halted.
■ There are difficulties, however, in choosing an appropriate threshold.
■ High thresholds could result in oversimplified trees,
37
Postpruning
■ Remove branches from a
“fully grown” tree—get a
sequence of progressively
pruned trees
■ Use a set of data different
from the training data to
decide which is the “best
pruned tree”
38
Repetition and Replication problem in Decision Tree
Repetition Replication
Solutions:
• multivariate splits (splits based on a combination of attributes)
• rule-based classifier
39
Scalability problem in Classification for Large Databases
■ Why is decision tree induction popular?
■ relatively faster learning speed (than other classification methods)
■ Scalability: Classifying data sets with millions of examples and hundreds of attributes
with reasonable speed
■ Training set of class-labeled tuples, does not fit in memory?
40
Scalable Decision Tree Induction Methods
reside in memory
■ SPRINT (VLDB’96 — J. Shafer et al.)
■ Constructs an attribute list data structure
41
Scalability Framework for RainForest
■ Separates the scalability aspects from the criteria that determine the quality of
the tree
■ Builds an AVC-list: AVC (Attribute, Value, Class_label)
■ AVC-set (of an attribute X )
■ Projection of training dataset onto the attribute X and class label where
counts of individual class label are aggregated
■ AVC-group (of a node n )
■ Set of AVC-sets of all predictor attributes at the node n
42
Rainforest: Training Set and Its AVC Sets
Middle_age 4 0 medium 4 2
senior 3 2
low 3 1
AVC-set on
AVC-set on Student
credit_rating
student Buy_Computer Buy_Computer
Credit
yes no rating yes no
yes 6 1 fair 6 2
no 3 4 excellent 3 3
43
BOAT (Bootstrapped Optimistic Algorithm for Tree
Construction)
■ Use a statistical technique called bootstrapping to create several smaller
samples (subsets), each fits in memory
■ Each subset is used to create a tree, resulting in several trees
■ These trees are examined and used to construct a new tree T’
■ It turns out that T’ is very close to the tree that would be generated using the
whole data set together
■ Adv: requires only two scans of DB, an incremental alg.
44
Chapter 8. Classification: Basic Concepts
45
Bayesian Classification: Why?
■ Supervised learning algorithm
■ A statistical classifier
■ Performs probabilistic prediction, i.e., predicts class membership
probabilities
■ The probability that a given tuple belongs to a particular class
■ Foundation: Based on Bayes’ Theorem.
■ Performance: A simple Bayesian classifier, naive Bayesian classifier, has
comparable performance with decision tree and selected neural
network classifiers
46
Why is it called Naïve Bayes?
■ Due to it assumption
■ The occurrence of a certain feature is independent of the occurrence of
other features.
■ Such as if the fruit is identified on the bases of color, shape, and taste,
47
Bayesian Classification
■ Problem statement:
■ Given features X ,X ,…,X
1 2 n
■ Predict a label Y
■ Applications:
■ Spam Classification
■ Given an email, predict whether it is spam or not
■ Medical Diagnosis
■ Given a list of symptoms, predict whether a patient has disease X or not
■ Weather
■ Based on temperature, humidity, etc… predict if it will rain tomorrow
■ Digit Recognition
Bayes’ Theorem: Basics
■ Probability of event A, given the event B is true
Likelihood prior
posterior probability
probability
Marginal probability / evidence
49
Naïve Bayes Classifier
■ we find the probability of given set of inputs for all possible values of the
class variable y and pick up the output with maximum probability. This can
be expressed
conditional probability
class 51
Naïve Bayes Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data to be classified:
X = (age=youth,
Income = medium,
Student = yes
Credit_rating = Fair)
Buys_computer=?
52
Naïve Bayes Classifier: An Example
■ Step 1:
■ P(Ci):
■ P(buys_computer = “yes”) = 9/14 = 0.643
■ P(buys_computer = “no”) = 5/14= 0.357
53
Naïve Bayes Classifier: An Example
■ Step 2:
■ Compute P(X|Ci) for each class
P (age = “youth” | buys_computer = “yes”)
= 2/9 = 0.222
P(age = “youth” | buys_computer = “no”) =
3/5 = 0.6
P(income = “medium” | buys_computer =
“yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer =
“no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes)
= 6/9 = 0.667
P(student = “yes” | buys_computer = “no”)
= 1/5 = 0.2
P(credit_rating = “fair” | buys_computer =
“yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer =
“no”) = 2/5 = 0.4
54
Naïve Bayes Classifier: An Example
■ Step 2: ■ Step 3
■ Compute P(X|Ci) for each class ■ X = (age=Youth , income = medium, student = yes,
P (age = “youth” | buys_computer = “yes”) credit_rating = fair)
= 2/9 = 0.222 P(X|Ci) :
P(age = “youth” | buys_computer = “no”) = P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 =
3/5 = 0.6 0.044
P(income = “medium” | buys_computer = P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
“yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = Step 4:
“no”) = 2/5 = 0.4 P(X|Ci)*P(Ci) :
P(student = “yes” | buys_computer = “yes) = P(X|buys_computer = “yes”) * P(buys_computer = “yes”) =
6/9 = 0.667 0.044*0.643 = 0.028
P(student = “yes” | buys_computer = “no”) = P(X|buys_computer = “no”) * P(buys_computer = “no”) =
1/5 = 0.2 0.019 * 0.357 =0.007
P(credit_rating = “fair” | buys_computer =
“yes”) = 6/9 = 0.667 Step 5:
P(credit_rating = “fair” | buys_computer = Therefore, X belongs to class (“buys_computer = yes”)
“no”) = 2/5 = 0.4
55
Naïve Bayes Classifier
■ If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian distribution
with a mean μ and standard deviation σ
and P(xk|Ci) is
56
Avoiding the Zero-Probability Problem
■ Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise,
the predicted prob. will be zero
■ Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and
income = high (10)
■ Use Laplacian correction (or Laplacian estimator)
■ Adding 1 to each case
57
Naïve Bayes Classifier: Comments
■ Advantages
■ Easy to implement
■ Disadvantages
■ Assumption: class conditional independence, therefore loss of accuracy
58
Chapter 8. Classification: Basic Concepts
59
Rule Extraction from a Decision Tree
age?
■ Rules are easier to understand than large trees
■ One rule is created for each path from the root to <=30
31..40 >40
a leaf
■ Each attribute-value pair along a path forms a student?
yes
credit rating?
no yes no yes
60
Chapter 8. Classification: Basic Concepts
61
Model Evaluation and Selection
■ Evaluation metrics: How can we measure accuracy? Other metrics to consider?
■ Use validation test set of class-labeled tuples instead of training set when assessing
accuracy
■ Methods for estimating a classifier’s accuracy:
■ Holdout method, random subsampling
■ Cross-validation
■ Bootstrap
■ Comparing classifiers:
■ Confidence intervals
■ Cost-benefit analysis and ROC Curves
62
Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
64
Classifier Evaluation Metrics: Accuracy, Error Rate,
Sensitivity and Specificity
A\P C ¬C
■ Class Imbalance Problem:
C TP FN P
■ One class may be rare, e.g. fraud, or
¬C FP TN N
HIV-positive
P’ N’ All
■ Significant majority of the negative class
65
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
■ Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive
recall as to precision 66
Classifier Evaluation Metrics: Example
67
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
■ Holdout method
■ Given data is randomly partitioned into two independent sets
■ *Stratified cross-validation*: folds are stratified so that class dist. in each fold is
approx. the same as that in the initial data
68
Evaluating Classifier Accuracy: Bootstrap
■ Bootstrap
■ Works well with small data sets
■ Samples the given training tuples uniformly with replacement
■ i.e., each time a tuple is selected, it is equally likely to be selected again and re-added to the
training set
■ Several bootstrap methods, and a common one is .632 boostrap
■ A data set with d tuples is sampled d times, with replacement, resulting in a training set of d
samples. The data tuples that did not make it into the training set end up forming the test set.
About 63.2% of the original data end up in the bootstrap, and the remaining 36.8% form the test
set (since (1 – 1/d)d ≈ e-1 = 0.368)
■ Repeat the sampling procedure k times, overall accuracy of the model:
69
Model Selection: ROC Curves
■ ROC (Receiver Operating Characteristics) curves:
for visual comparison of classification models
■ Originated from signal detection theory
■ Shows the trade-off between the true positive
rate and the false positive rate
■ Vertical axis
■ The area under the ROC curve is a measure of the represents the true
positive rate
accuracy of the model
■ Horizontal axis rep.
the false positive rate
■ The closer to the diagonal line (i.e., the closer the ■ A model with perfect
area is to 0.5), the less accurate is the model accuracy will have an
area of 1.0
70
Model Selection: ROC Curves
■ Ensemble methods
■ Use a combination of models to increase accuracy
74
Bagging: Boostrap Aggregation
■ Analogy: Diagnosis based on multiple doctors’ majority vote
■ Training
■ Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with
replacement from D (i.e., bootstrap)
■ A classifier model Mi is learned for each training set Di
■ Classification: classify an unknown sample X
■ Each classifier Mi returns its class prediction
■ The bagged classifier M* counts the votes and assigns the class with the most votes to X
■ Prediction: can be applied to the prediction of continuous values by taking the average value of each
prediction for a given test tuple
■ Accuracy
■ Often significantly better than a single classifier derived from D
■ For noise data: not considerably worse, more robust
■ Proved improved accuracy in prediction
75
Boosting
■ Analogy: Consult several doctors, based on a combination of weighted
diagnoses—weight assigned based on the previous diagnosis accuracy
■ How boosting works?
■ Weights are assigned to each training tuple
■ A series of k classifiers is iteratively learned
■ After a classifier Mi is learned, the weights are updated to allow the subsequent
classifier, Mi+1, to pay more attention to the training tuples that were
misclassified by Mi
■ The final M* combines the votes of each individual classifier, where the weight
of each classifier's vote is a function of its accuracy
■ Boosting algorithm can be extended for numeric prediction
■ Comparing with bagging: Boosting tends to have greater accuracy, but it also risks
overfitting the model to misclassified data
76
Adaboost (Freund and Schapire, 1997)
■ Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
■ Initially, all the weights of tuples are set the same (1/d)
■ Generate k classifiers in k rounds. At round i,
■ Tuples from D are sampled (with replacement) to form a training set Di of the same size
■ Each tuple’s chance of being selected is based on its weight
■ A classification model Mi is derived from Di
■ Its error rate is calculated using Di as a test set
■ If a tuple is misclassified, its weight is increased, o.w. it is decreased
■ Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error rate is the sum of the
weights of the misclassified tuples:
77
Random Forest (Breiman 2001)
■ Random Forest:
■ Each classifier in the ensemble is a decision tree classifier and is generated using a random selection
of attributes at each node to determine the split
■ During classification, each tree votes and the most popular class is returned
■ Two Methods to construct Random Forest:
■ Forest-RI (random input selection): Randomly select, at each node, F attributes as candidates for
the split at the node. The CART methodology is used to grow the trees to maximum size
■ Forest-RC (random linear combinations): Creates new attributes (or features) that are a linear
combination of the existing attributes (reduces the correlation between individual classifiers)
■ Comparable in accuracy to Adaboost, but more robust to errors and outliers
■ Insensitive to the number of attributes selected for consideration at each split, and faster than bagging
or boosting
78
Classification of Class-Imbalanced Data Sets
■ Class-imbalance problem: Rare positive example but numerous negative ones, e.g.,
medical diagnosis, fraud, oil-spill, fault, etc.
■ Traditional methods assume a balanced distribution of classes and equal error costs:
not suitable for class-imbalanced data
■ Typical methods for imbalance data in 2-class classification:
■ Oversampling: re-sampling of data from positive class
■ Threshold-moving: moves the decision threshold, t, so that the rare class tuples
are easier to classify, and hence, less chance of costly false negative errors
■ Ensemble techniques: Ensemble multiple classifiers introduced above
79
Chapter 8. Classification: Basic Concepts
80
Summary (I)
■ Classification is a form of data analysis that extracts models describing important
data classes.
■ Effective and scalable methods have been developed for decision tree induction,
Naive Bayesian classification, rule-based classification, and many other classification
methods.
■ Evaluation metrics include: accuracy, sensitivity, specificity, precision, recall, F
measure, and Fß measure.
■ Stratified k-fold cross-validation is recommended for accuracy estimation. Bagging
and boosting can be used to increase overall accuracy by learning and combining a
series of individual models.
81
Summary (II)
■ Significance tests and ROC curves are useful for model selection.
■ There have been numerous comparisons of the different classification methods; the
matter remains a research topic
■ No single method has been found to be superior over all others for all data sets
■ Issues such as accuracy, training time, robustness, scalability, and interpretability
must be considered and can involve trade-offs, further complicating the quest for an
overall superior method
82