[go: up one dir, main page]

0% found this document useful (0 votes)
17 views30 pages

Data Mining Concepts and Techniques

Chapter 6 discusses classification and prediction in data mining, defining classification as a process that predicts categorical class labels and prediction as modeling continuous-valued functions. It covers various classification methods such as decision trees, support vector machines, and Bayesian classification, along with issues related to data preparation and evaluation of classification methods. The chapter also emphasizes the importance of model construction, accuracy measures, and techniques like information gain and Gini index for attribute selection.

Uploaded by

Vit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views30 pages

Data Mining Concepts and Techniques

Chapter 6 discusses classification and prediction in data mining, defining classification as a process that predicts categorical class labels and prediction as modeling continuous-valued functions. It covers various classification methods such as decision trees, support vector machines, and Bayesian classification, along with issues related to data preparation and evaluation of classification methods. The chapter also emphasizes the importance of model construction, accuracy measures, and techniques like information gain and Gini index for attribute selection.

Uploaded by

Vit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Chapter 6.

Classification and Prediction


Data Mining:
Concepts and Techniques  What is classification? What is
prediction?
 Support Vector Machines (SVM)

 Associative classification
 Issues regarding classification  Lazy learners (or learning from
— Chapter 6 — and prediction your neighbors)
 Classification by decision tree  Other classification methods
induction
Jiawei Han  Prediction
Department of Computer Science  Bayesian classification  Accuracy and error measures
University of Illinois at Urbana-Champaign  Rule-based classification  Ensemble methods
www.cs.uiuc.edu/~hanj  Classification by back  Model selection
©2006 Jiawei Han and Micheline Kamber, All rights reserved propagation
 Summary
95年10月17日星期二 Data Mining: Concepts and Techniques 1 95年10月17日星期二 Data Mining: Concepts and Techniques 2

Classification vs. Prediction Classification—A Two-Step Process

 Classification  Model construction: describing a set of predetermined classes


 Each tuple/sample is assumed to belong to a predefined class,
 predicts categorical class labels (discrete or nominal)
as determined by the class label attribute
 classifies data (constructs a model) based on the  The set of tuples used for model construction is training set

training set and the values (class labels) in a  The model is represented as classification rules, decision trees,

classifying attribute and uses it in classifying new data or mathematical formulae


 Prediction  Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
 models continuous-valued functions, i.e., predicts
 The known label of test sample is compared with the
unknown or missing values classified result from the model
 Typical applications  Accuracy rate is the percentage of test set samples that are

 Credit approval correctly classified by the model


 Test set is independent of training set, otherwise over-fitting
 Target marketing
will occur
 Medical diagnosis  If the accuracy is acceptable, use the model to classify data

 Fraud detection tuples whose class labels are not known


95年10月17日星期二 Data Mining: Concepts and Techniques 3 95年10月17日星期二 Data Mining: Concepts and Techniques 4
Process (1): Model Construction Process (2): Using the Model in Prediction

Classification
Algorithms
Training Classifier
Data
Testing
Data Unseen Data
NAME RANK YEARS TENURED Classifier
M ike Assistant Prof 3 no (Model)
(Jeff, Professor, 4)
M ary Assistant Prof 7 yes
Bill Professor 2 yes NAME RANK YEARS TENURED
Jim Associate Prof 7 yes Tom Assistant Prof 2 no Tenured?
IF rank = ‘professor’ Merlisa Associate Prof 7 no
Dave Assistant Prof 6 no
OR years > 6 George Professor 5 yes
Anne Associate Prof 3 no
THEN tenured = ‘yes’ Joseph Assistant Prof 7 yes
95年10月17日星期二 Data Mining: Concepts and Techniques 5 95年10月17日星期二 Data Mining: Concepts and Techniques 6

Supervised vs. Unsupervised Learning Chapter 6. Classification and Prediction

 What is classification? What is  Support Vector Machines (SVM)


 Supervised learning (classification)
prediction?  Associative classification
 Supervision: The training data (observations,
 Issues regarding classification Lazy learners (or learning from
measurements, etc.) are accompanied by labels 

and prediction your neighbors)


indicating the class of the observations
 Classification by decision tree Other classification methods
 New data is classified based on the training set 

induction Prediction
 Unsupervised learning (clustering) 

 Bayesian classification Accuracy and error measures


 The class labels of training data is unknown 

 Rule-based classification Ensemble methods


 Given a set of measurements, observations, etc. with 

the aim of establishing the existence of classes or  Classification by back  Model selection
clusters in the data propagation
 Summary
95年10月17日星期二 Data Mining: Concepts and Techniques 7 95年10月17日星期二 Data Mining: Concepts and Techniques 8
Issues: Data Preparation Issues: Evaluating Classification Methods

 Accuracy
 Data cleaning  classifier accuracy: predicting class label

 Preprocess data in order to reduce noise and handle  predictor accuracy: guessing value of predicted

missing values attributes


 Speed
 Relevance analysis (feature selection)
 time to construct the model (training time)
 Remove the irrelevant or redundant attributes  time to use the model (classification/prediction time)

 Data transformation  Robustness: handling noise and missing values


 Generalize and/or normalize data  Scalability: efficiency in disk-resident databases
 Interpretability
 understanding and insight provided by the model

 Other measures, e.g., goodness of rules, such as decision


tree size or compactness of classification rules
95年10月17日星期二 Data Mining: Concepts and Techniques 9 95年10月17日星期二 Data Mining: Concepts and Techniques 10

Chapter 6. Classification and Prediction Decision Tree Induction: Training Dataset

 What is classification? What is  Support Vector Machines (SVM) age income student credit_rating buys_computer
<=30 high no fair no
prediction?  Associative classification
This <=30 high no excellent no
 Issues regarding classification Lazy learners (or learning from 31…40 high no fair yes

follows an >40 medium no fair yes
and prediction your neighbors) example >40 low yes fair yes
 Classification by decision tree  Other classification methods of >40 low yes excellent no
31…40 low yes excellent yes
induction  Prediction Quinlan’s <=30 medium no fair no
 Bayesian classification  Accuracy and error measures
ID3 <=30 low yes fair yes
(Playing >40 medium yes fair yes
 Rule-based classification Ensemble methods <=30 medium yes excellent yes

Tennis) 31…40 medium no excellent yes
 Classification by back Model selection
 31…40 high yes fair yes
propagation >40 medium no excellent no
 Summary
95年10月17日星期二 Data Mining: Concepts and Techniques 11 95年10月17日星期二 Data Mining: Concepts and Techniques 12
Output: A Decision Tree for “buys_computer” Algorithm for Decision Tree Induction
 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-conquer
age?
manner
 At start, all the training examples are at the root

<=30 overcast
31..40 >40  Attributes are categorical (if continuous-valued, they are

discretized in advance)
 Examples are partitioned recursively based on selected attributes
student? yes credit rating?
 Test attributes are selected on the basis of a heuristic or

statistical measure (e.g., information gain)


no yes excellent fair
 Conditions for stopping partitioning
 All samples for a given node belong to the same class
no yes no yes
 There are no remaining attributes for further partitioning –

majority voting is employed for classifying the leaf


 There are no samples left

95年10月17日星期二 Data Mining: Concepts and Techniques 13 95年10月17日星期二 Data Mining: Concepts and Techniques 14

Attribute Selection Measure: Attribute Selection: Information Gain


Information Gain (ID3/C4.5)
g Class P: buys_computer = “yes” 5 4
Info age ( D ) = I ( 2,3) + I ( 4,0 )
 Select the attribute with the highest information gain g Class N: buys_computer = “no” 14 14
 Let pi be the probability that an arbitrary tuple in D Info ( D ) = I (9 ,5 ) = −
9 9 5 5
log 2 ( ) − log 2 ( ) = 0 .940 +
5
I (3, 2 ) = 0 .694
14 14 14 14
belongs to class Ci, estimated by |Ci, D|/|D| 14
Expected information (entropy) needed to classify a tuple age pi ni I(p i, ni) 5
 I ( 2,3) means “age <=30” has 5
in D: m <=30 2 3 0.971 14
out of 14 samples, with 2 yes’es
Info ( D ) = − ∑ pi log 2 ( pi ) 31…40 4 0 0
i =1 and 3 no’s. Hence
>40 3 2 0.971
 Information needed (after using A to split D into v age income student credit_rating buys_computer Gain ( age ) = Info ( D ) − Info age ( D ) = 0.246
v |D | <=30 high no fair no
partitions) to classify D:
InfoA (D) = ∑ × I (Dj )
j <=30 high no excellent no
31…40 high no fair yes Similarly,
j =1 | D |
>40 medium no fair yes
>40 low yes fair yes

 Information gained by branching on attribute A


>40 low
31…40 low
yes
yes
excellent
excellent
no
yes
Gain(income) = 0.029
Gain( student ) = 0.151
<=30 medium no fair no

Gain(A)= Info(D)− InfoA(D)


<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium
31…40 medium
yes
no
excellent
excellent
yes
yes
Gain(credit _ rating ) = 0.048
31…40 high yes fair yes
95年10月17日星期二 Data Mining: Concepts and Techniques 15 >4095年10月17日星期二
medium no excellent Data Mining:
no Concepts and Techniques 16
Computing Information-Gain for
Gain Ratio for Attribute Selection (C4.5)
Continuous-Value Attributes
 Let attribute A be a continuous-valued attribute  Information gain measure is biased towards attributes
 Must determine the best split point for A with a large number of values
 C4.5 (a successor of ID3) uses gain ratio to overcome the
 Sort the value A in increasing order
problem (normalization to information gain)
 Typically, the midpoint between each pair of adjacent v | Dj | | Dj |
values is considered as a possible split point SplitInfo A ( D) = −∑ × log 2 ( )
j =1 |D| |D|
 (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
 GainRatio(A) = Gain(A)/SplitInfo(A)
 The point with the minimum expected information 4 4 6 6 4 4
 Ex. SplitInfo A ( D ) = − × log 2 ( ) − × log 2 ( ) − × log 2 ( ) = 0 .926
requirement for A is selected as the split-point for A 14 14 14 14 14 14
 gain_ratio(income) = 0.029/0.926 = 0.031
 Split:
 The attribute with the maximum gain ratio is selected as
 D1 is the set of tuples in D satisfying A ≤ split-point, and
the splitting attribute
D2 is the set of tuples in D satisfying A > split-point
95年10月17日星期二 Data Mining: Concepts and Techniques 17 95年10月17日星期二 Data Mining: Concepts and Techniques 18

Gini index (CART, IBM IntelligentMiner) Gini index (CART, IBM IntelligentMiner)

 If a data set D contains examples from n classes, gini index, gini(D) is  Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
defined as n 9 5
2 2

gini ( D ) = 1 − ∑ p 2j gini( D) = 1 −   −   = 0.459


 14   14 
j =1
 Suppose the attribute income partitions D into 10 in D1: {low, medium}
where pj is the relative frequency of class j in D  10  4
and 4 in D2 giniincome∈{low,medium} ( D) =  Gini( D1 ) +  Gini( D1 )
 If a data set D is split on A into two subsets D1 and D2, the gini index  14   14 
gini(D) is defined as
| D1 | |D |
gini A ( D ) = gini ( D 1) + 2 gini ( D 2 )
|D | |D |
 Reduction in Impurity: but gini{medium,high} is 0.30 and thus the best since it is the lowest
∆gini( A) = gini(D) − giniA (D)  All attributes are assumed continuous-valued
 May need other tools, e.g., clustering, to get the possible split values
 The attribute provides the smallest ginisplit(D) (or the largest reduction
in impurity) is chosen to split the node (need to enumerate all the  Can be modified for categorical attributes
possible splitting points for each attribute)
95年10月17日星期二 Data Mining: Concepts and Techniques 19 95年10月17日星期二 Data Mining: Concepts and Techniques 20
Comparing Attribute Selection Measures Other Attribute Selection Measures

 The three measures, in general, return good results but  CHAID: a popular decision tree algorithm, measure based on χ2 test
for independence
 Information gain:
 C-SEP: performs better than info. gain and gini index in certain cases
 biased towards multivalued attributes
 G-statistics: has a close approximation to χ2 distribution
 Gain ratio:
 MDL (Minimal Description Length) principle (i.e., the simplest solution
 tends to prefer unbalanced splits in which one is preferred):
partition is much smaller than the others  The best tree as the one that requires the fewest # of bits to both
 Gini index: (1) encode the tree, and (2) encode the exceptions to the tree
 biased to multivalued attributes  Multivariate splits (partition based on multiple variable combinations)
 CART: finds multivariate splits based on a linear comb. of attrs.
 Which attribute selection measure is the best?
 Most give good results, none is significantly superior than others

95年10月17日星期二 Data Mining: Concepts and Techniques 21 95年10月17日星期二 Data Mining: Concepts and Techniques 22

Overfitting and Tree Pruning Enhancements to Basic Decision Tree Induction

 Overfitting: An induced tree may overfit the training data  Allow for continuous-valued attributes
 Too many branches, some may reflect anomalies due to noise or  Dynamically define new discrete-valued attributes that
outliers partition the continuous attribute value into a discrete
 Poor accuracy for unseen samples set of intervals
 Two approaches to avoid overfitting  Handle missing attribute values
 Prepruning: Halt tree construction early—do not split a node if this  Assign the most common value of the attribute
would result in the goodness measure falling below a threshold
 Assign probability to each of the possible values
 Difficult to choose an appropriate threshold
 Attribute construction
 Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees  Create new attributes based on existing ones that are
 Use a set of data different from the training data to decide
sparsely represented
which is the “best pruned tree”  This reduces fragmentation, repetition, and replication
95年10月17日星期二 Data Mining: Concepts and Techniques 23 95年10月17日星期二 Data Mining: Concepts and Techniques 24
Classification in Large Databases Scalable Decision Tree Induction Methods

 Classification—a classical problem extensively studied by  SLIQ (EDBT’96 — Mehta et al.)


statisticians and machine learning researchers  Builds an index for each attribute and only class list and

 Scalability: Classifying data sets with millions of examples the current attribute list reside in memory
and hundreds of attributes with reasonable speed  SPRINT (VLDB’96 — J. Shafer et al.)
 Constructs an attribute list data structure
 Why decision tree induction in data mining?
 relatively faster learning speed (than other classification
 PUBLIC (VLDB’98 — Rastogi & Shim)
methods)  Integrates tree splitting and tree pruning: stop growing

 convertible to simple and easy to understand


the tree earlier
classification rules  RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
 can use SQL queries for accessing databases  Builds an AVC-list (attribute, value, class label)

 comparable classification accuracy with other methods  BOAT (PODS’99 — Gehrke, Ganti, Ramakrishnan & Loh)
 Uses bootstrapping to create several small samples

95年10月17日星期二 Data Mining: Concepts and Techniques 25 95年10月17日星期二 Data Mining: Concepts and Techniques 26

Chapter 6. Classification and Prediction Bayesian Classification: Why?

 What is classification? What is  Support Vector Machines (SVM)  A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
prediction? Associative classification

 Foundation: Based on Bayes’ Theorem.
 Issues regarding classification  Lazy learners (or learning from  Performance: A simple Bayesian classifier, naïve Bayesian
and prediction your neighbors)
classifier, has comparable performance with decision tree
and selected neural network classifiers
Classification by decision tree Other classification methods
   Incremental: Each training example can incrementally
induction  Prediction increase/decrease the probability that a hypothesis is
correct — prior knowledge can be combined with observed
 Bayesian classification  Accuracy and error measures data
 Rule-based classification  Ensemble methods  Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard
 Classification by back Model selection
 of optimal decision making against which other methods
propagation
 Summary
can be measured
95年10月17日星期二 Data Mining: Concepts and Techniques 27 95年10月17日星期二 Data Mining: Concepts and Techniques 28
Bayesian Theorem: Basics Bayesian Theorem

 Let X be a data sample (“evidence”): class label is unknown  Given training data X, posteriori probability of a
 Let H be a hypothesis that X belongs to class C hypothesis H, P(H|X), follows the Bayes theorem
Classification is to determine P(H|X), the probability that
P (H | X ) = P (X | H )P (H )


the hypothesis holds given the observed data sample X P (X )


 P(H) (prior probability), the initial probability
 Informally, this can be written as
 E.g., X will buy computer, regardless of age, income, …
posteriori = likelihood x prior/evidence
 P(X): probability that sample data is observed
 Predicts X belongs to C2 iff the probability P(Ci|X) is the
 P(X|H) (posteriori probability), the probability of observing
highest among all the P(Ck|X) for all the k classes
the sample X, given that the hypothesis holds
 E.g., Given that X will buy computer, the prob. that X is  Practical difficulty: require initial knowledge of many
31..40, medium income probabilities, significant computational cost
95年10月17日星期二 Data Mining: Concepts and Techniques 29 95年10月17日星期二 Data Mining: Concepts and Techniques 30

Towards Naïve Bayesian Classifier Derivation of Naïve Bayes Classifier


 Let D be a training set of tuples and their associated class  A simplified assumption: attributes are conditionally
labels, and each tuple is represented by an n-D attribute independent (i.e., no dependence relation between
attributes): n
vector X = (x1, x2, …, xn) P( X | C i) = ∏ P ( x | C i ) = P( x | C i ) × P( x | C i) × ... × P( x | C i)
k 1 2 n
k =1
 Suppose there are m classes C1, C2, …, Cm.
 This greatly reduces the computation cost: Only counts
 Classification is to derive the maximum posteriori, i.e., the the class distribution
maximal P(Ci|X)
 If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having
 This can be derived from Bayes’ theorem value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
P(X | C )P(C )
P(C | X) = i i  If Ak is continous-valued, P(xk|Ci) is usually computed
i P(X) based on Gaussian distribution with a mean µ and
 Since P(X) is constant for all classes, only standard deviation σ −
( x−µ ) 2
1
P(C | X) = P(X | C )P(C ) g ( x, µ ,σ ) = 2σ 2
e
i i i 2π σ
needs to be maximized and P(xk|Ci) is
P(X| Ci) = g(xk , µCi ,σCi )
95年10月17日星期二 Data Mining: Concepts and Techniques 31 95年10月17日星期二 Data Mining: Concepts and Techniques 32
Naïve Bayesian Classifier: Training Dataset Naïve Bayesian Classifier: An Example
age income studentcredit_rating
buys_computer
 P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
<=30 high no fair no P(buys_computer = “no”) = 5/14= 0.357
<=30 high no excellent no
Class: 31…40 high no fair yes  Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
C1:buys_computer = ‘yes’ >40 medium no fair yes P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
C2:buys_computer = ‘no’ >40 low yes fair yes P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
>40 low yes excellent no P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
Data sample P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
31…40 low yes excellent yes
X = (age <=30, P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
Income = medium,
<=30 medium no fair no P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

Student = yes <=30 low yes fair yes


 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
Credit_rating = Fair) >40 medium yes fair yes
<=30 medium yes excellent yes P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
31…40 medium no excellent yes P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
31…40 high yes fair yes P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007

>40 medium no excellent no Therefore, X belongs to class (“buys_computer = yes”)


95年10月17日星期二 Data Mining: Concepts and Techniques 33 95年10月17日星期二 Data Mining: Concepts and Techniques 34

Avoiding the 0-Probability Problem Naïve Bayesian Classifier: Comments


 Naïve Bayesian prediction requires each conditional prob. be non-zero.  Advantages
Otherwise, the predicted prob. will be zero  Easy to implement
n
 Good results obtained in most of the cases
P ( X | C i) = ∏ P ( x k | C i)
k =1  Disadvantages
 Ex. Suppose a dataset with 1000 tuples, income=low (0), income=  Assumption: class conditional independence, therefore

medium (990), and income = high (10), loss of accuracy


 Use Laplacian correction (or Laplacian estimator)  Practically, dependencies exist among variables
 Adding 1 to each case
 E.g., hospitals: patients: Profile: age, family history, etc.
Prob(income = low) = 1/1003 Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
Prob(income = medium) = 991/1003  Dependencies among these cannot be modeled by Naïve

Prob(income = high) = 11/1003 Bayesian Classifier


 The “corrected” prob. estimates are close to their “uncorrected”
 How to deal with these dependencies?
counterparts
 Bayesian Belief Networks
95年10月17日星期二 Data Mining: Concepts and Techniques 35 95年10月17日星期二 Data Mining: Concepts and Techniques 36
Bayesian Belief Networks Bayesian Belief Network: An Example

Family The conditional probability table


 Bayesian belief network allows a subset of the variables History
Smoker
(CPT) for variable LungCancer:
conditionally independent (FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

 A graphical model of causal relationships LC 0.8 0.5 0.7 0.1


 Represents dependency among the variables LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9
 Gives a specification of joint probability distribution
CPT shows the conditional probability for
 Nodes: random variables each possible combination of its parents
 Links: dependency
X Y  X and Y are the parents of Z, and Y is PositiveXRay Dyspnea Derivation of the probability of a
particular combination of values of X,
the parent of P from CPT:
Z  No dependency between Z and P Bayesian Belief Networks n
P ( x 1 ,..., x n ) = ∏ P ( x i | Parents ( Y i ))
P i =1
 Has no loops or cycles
95年10月17日星期二 Data Mining: Concepts and Techniques 37 95年10月17日星期二 Data Mining: Concepts and Techniques 38

Training Bayesian Networks Chapter 6. Classification and Prediction

 Several scenarios:  What is classification? What is  Support Vector Machines (SVM)


 Given both the network structure and all variables
prediction?  Associative classification
observable: learn only the CPTs
 Issues regarding classification Lazy learners (or learning from
 Network structure known, some hidden variables:


gradient descent (greedy hill-climbing) method, and prediction your neighbors)


analogous to neural network learning  Classification by decision tree  Other classification methods
 Network structure unknown, all variables observable: induction  Prediction
search through the model space to reconstruct
Bayesian classification
network topology 
 Accuracy and error measures
 Unknown structure, all hidden variables: No good  Rule-based classification  Ensemble methods
algorithms known for this purpose  Classification by back  Model selection
 Ref. D. Heckerman: Bayesian networks for data mining propagation
 Summary
95年10月17日星期二 Data Mining: Concepts and Techniques 39 95年10月17日星期二 Data Mining: Concepts and Techniques 40
Using IF-THEN Rules for Classification Rule Extraction from a Decision Tree
age?

<=30 31..40
 Represent the knowledge in the form of IF-THEN rules >40
 Rules are easier to understand than large trees student? credit rating?
R: IF age = youth AND student = yes THEN buys_computer = yes yes
 One rule is created for each path from the root
 Rule antecedent/precondition vs. rule consequent no yes excellent fair
to a leaf
 Assessment of a rule: coverage and accuracy no yes no yes

 ncovers = # of tuples covered by R  Each attribute-value pair along a path forms a


 ncorrect = # of tuples correctly classified by R conjunction: the leaf holds the class prediction
coverage(R) = ncovers /|D| /* D: training data set */  Rules are mutually exclusive and exhaustive
accuracy(R) = ncorrect / ncovers
 Example: Rule extraction from our buys_computer decision-tree
 If more than one rule is triggered, need conflict resolution
IF age = young AND student = no THEN buys_computer = no
 Size ordering: assign the highest priority to the triggering rules that has
the “toughest” requirement (i.e., with the most attribute test) IF age = young AND student = yes THEN buys_computer = yes
 Class-based ordering: decreasing order of prevalence or misclassification IF age = mid-age THEN buys_computer = yes
cost per class IF age = old AND credit_rating = excellent THEN buys_computer = yes
 Rule-based ordering (decision list): rules are organized into one long IF age = young AND credit_rating = fair THEN buys_computer = no
priority list, according to some measure of rule quality or by experts
95年10月17日星期二 Data Mining: Concepts and Techniques 41 95年10月17日星期二 Data Mining: Concepts and Techniques 42

Rule Extraction from the Training Data How to Learn-One-Rule?


 Star with the most general rule possible: condition = empty
 Sequential covering algorithm: Extracts rules directly from training data
 Adding new attributes by adopting a greedy depth-first strategy
 Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
 Picks the one that most improves the rule quality
 Rules are learned sequentially, each for a given class Ci will cover many
tuples of Ci but none (or few) of the tuples of other classes  Rule-Quality measures: consider both coverage and accuracy

 Steps:  Foil-gain (in FOIL & RIPPER): assesses info_gain by extending


condition pos' pos
 Rules are learned one at a time FOIL_ Gain = pos'×(log2 − log2 )
pos'+neg' pos + neg
 Each time a rule is learned, the tuples covered by the rules are
It favors rules that have high accuracy and cover many positive tuples
removed
 Rule pruning based on an independent set of test tuples
 The process repeats on the remaining tuples unless termination
pos − neg
condition, e.g., when no more training examples or when the quality FOIL _ Prune( R) =
pos + neg
of a rule returned is below a user-specified threshold
Pos/neg are # of positive/negative tuples covered by R.
 Comp. w. decision-tree induction: learning a set of rules simultaneously
If FOIL_Prune is higher for the pruned version of R, prune R
95年10月17日星期二 Data Mining: Concepts and Techniques 43 95年10月17日星期二 Data Mining: Concepts and Techniques 44
Chapter 6. Classification and Prediction Classification: A Mathematical Mapping

 What is classification? What is  Support Vector Machines (SVM)  Classification:


prediction?  Associative classification  predicts categorical class labels

 Issues regarding classification  Lazy learners (or learning from  E.g., Personal homepage classification
and prediction your neighbors)  xi = (x1, x2, x3, …), yi = +1 or –1
Classification by decision tree Other classification methods  x1 : # of a word “homepage”
 

induction Prediction

 x2 : # of a word “welcome”
 Bayesian classification Accuracy and error measures

 Mathematically
Rule-based classification
 x ∈ X = ℜ , y ∈ Y = {+1, –1}

 Ensemble methods n
 Classification by back  Model selection  We want a function f: X  Y
propagation
 Summary
95年10月17日星期二 Data Mining: Concepts and Techniques 45 95年10月17日星期二 Data Mining: Concepts and Techniques 46

Linear Classification Discriminative Classifiers

 Binary Classification  Advantages


problem  prediction accuracy is generally high

 The data above the red  As compared to Bayesian methods – in general


line belongs to class ‘x’  robust, works when training examples contain errors
x The data below red line
x   fast evaluation of the learned target function
x x x belongs to class ‘o’  Bayesian networks are normally slow
x  Examples: SVM, Criticism
x x o 
x Perceptron, Probabilistic
o  long training time

x o o Classifiers
oo o  difficult to understand the learned function (weights)

o o  Bayesian networks can be used easily for pattern discovery


o o o o  not easy to incorporate domain knowledge
 Easy in the form of priors on the data or distributions

95年10月17日星期二 Data Mining: Concepts and Techniques 47 95年10月17日星期二 Data Mining: Concepts and Techniques 48
Perceptron & Winnow Classification by Backpropagation
• Vector: x, w
x2  Backpropagation: A neural network learning algorithm
• Scalar: x, y, w
Input: {(x1, y1), …}  Started by psychologists and neurobiologists to develop
Output: classification function f(x) and test computational analogues of neurons
f(xi) > 0 for yi = +1  A neural network: A set of connected input/output units
f(xi) < 0 for yi = -1
where each connection has a weight associated with it
f(x) => wx + b = 0
 During the learning phase, the network learns by
or w1x1+w2x2+b = 0
adjusting the weights so as to be able to predict the
• Perceptron: update W
additively correct class label of the input tuples
• Winnow: update W  Also referred to as connectionist learning due to the
multiplicatively
x1 connections between units

95年10月17日星期二 Data Mining: Concepts and Techniques 49 95年10月17日星期二 Data Mining: Concepts and Techniques 50

Neural Network as a Classifier A Neuron (= a perceptron)


Weakness
- µk


 Long training time


 Require a number of parameters typically best determined x0 w0
empirically, e.g., the network topology or ``structure."
x1 w1
 Poor interpretability: Difficult to interpret the symbolic meaning
behind the learned weights and of ``hidden units" in the network
∑ f
output y
 Strength xn wn
 High tolerance to noisy data
For Example
 Ability to classify untrained patterns n

 Well-suited for continuous-valued inputs and outputs


Input weight weighted Activation y = sign( ∑ wi xi + µ k )
 Successful on a wide array of real-world data vector x vector w sum function i =0

 Algorithms are inherently parallel


 The n-dimensional input vector x is mapped into variable y by
 Techniques have recently been developed for the extraction of means of the scalar product and a nonlinear function mapping
rules from trained neural networks
95年10月17日星期二 Data Mining: Concepts and Techniques 51 95年10月17日星期二 Data Mining: Concepts and Techniques 52
A Multi-Layer Feed-Forward Neural Network How A Multi-Layer Neural Network Works?

 The inputs to the network correspond to the attributes measured


Output vector
for each training tuple
Err j = O j (1 − O j )∑ Errk w jk  Inputs are fed simultaneously into the units making up the input
Output layer k layer
θ j = θ j + (l) Err j  They are then weighted and fed simultaneously to a hidden layer
wij = wij + (l ) Err j Oi  The number of hidden layers is arbitrary, although usually only one

Hidden layer Err j = O j (1 − O j )(T j − O j )  The weighted outputs of the last hidden layer are input to units
making up the output layer, which emits the network's prediction
wij 1
Oj = −I j
 The network is feed-forward in that none of the weights cycles
1+ e back to an input unit or to an output unit of a previous layer
Input layer
I j = ∑ wij Oi + θ j  From a statistical point of view, networks perform nonlinear
i
regression: Given enough hidden units and enough training
Input vector: X samples, they can closely approximate any function
95年10月17日星期二 Data Mining: Concepts and Techniques 53 95年10月17日星期二 Data Mining: Concepts and Techniques 54

Defining a Network Topology Backpropagation


 First decide the network topology: # of units in the  Iteratively process a set of training tuples & compare the network's
prediction with the actual known target value
input layer, # of hidden layers (if > 1), # of units in each
hidden layer, and # of units in the output layer  For each training tuple, the weights are modified to minimize the
mean squared error between the network's prediction and the
 Normalizing the input values for each attribute measured in actual target value
the training tuples to [0.0—1.0]
 Modifications are made in the “backwards” direction: from the output
 One input unit per domain value, each initialized to 0 layer, through each hidden layer down to the first hidden layer, hence
 Output, if for classification and more than two classes, “backpropagation”
one output unit per class is used  Steps
 Initialize weights (to small random #s) and biases in the network
 Once a network has been trained and its accuracy is
 Propagate the inputs forward (by applying activation function)
unacceptable, repeat the training process with a different
 Backpropagate the error (by updating weights and biases)
network topology or a different set of initial weights  Terminating condition (when error is very small, etc.)

95年10月17日星期二 Data Mining: Concepts and Techniques 55 95年10月17日星期二 Data Mining: Concepts and Techniques 56
Backpropagation and Interpretability Chapter 6. Classification and Prediction
 Efficiency of backpropagation: Each epoch (one interation through the
 What is classification? What is  Support Vector Machines (SVM)
training set) takes O(|D| * w), with |D| tuples and w weights, but # of
epochs can be exponential to n, the number of inputs, in the worst prediction?  Associative classification
case
 Issues regarding classification  Lazy learners (or learning from
 Rule extraction from networks: network pruning
and prediction your neighbors)
 Simplify the network structure by removing weighted links that
have the least effect on the trained network  Classification by decision tree  Other classification methods
 Then perform link, unit, or activation value clustering induction  Prediction
 The set of input and activation values are studied to derive rules
 Bayesian classification Accuracy and error measures
describing the relationship between the input and hidden unit 

layers  Rule-based classification  Ensemble methods


 Sensitivity analysis: assess the impact that a given input variable has
 Classification by back Model selection
on a network output. The knowledge gained from this analysis can be 

represented in rules propagation


 Summary
95年10月17日星期二 Data Mining: Concepts and Techniques 57 95年10月17日星期二 Data Mining: Concepts and Techniques 58

SVM—Support Vector Machines SVM—History and Applications


 A new classification method for both linear and nonlinear  Vapnik and colleagues (1992)—groundwork from Vapnik
data & Chervonenkis’ statistical learning theory in 1960s
 It uses a nonlinear mapping to transform the original
 Features: training can be slow but accuracy is high owing
training data into a higher dimension
to their ability to model complex nonlinear decision
 With the new dimension, it searches for the linear optimal
separating hyperplane (i.e., “decision boundary”) boundaries (margin maximization)
 With an appropriate nonlinear mapping to a sufficiently  Used both for classification and prediction
high dimension, data from two classes can always be  Applications:
separated by a hyperplane
 handwritten digit recognition, object recognition,
 SVM finds this hyperplane using support vectors
speaker identification, benchmarking time-series
(“essential” training tuples) and margins (defined by the
support vectors) prediction tests
95年10月17日星期二 Data Mining: Concepts and Techniques 59 95年10月17日星期二 Data Mining: Concepts and Techniques 60
SVM—General Philosophy SVM—Margins and Support Vectors

Small Margin Large Margin


Support Vectors

95年10月17日星期二 Data Mining: Concepts and Techniques 61 95年10月17日星期二 Data Mining: Concepts and Techniques 62

SVM—When Data Is Linearly Separable SVM—Linearly Separable


 A separating hyperplane can be written as
W●X+b=0
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
 For 2-D it can be written as
w0 + w1 x1 + w2 x2 = 0
m
 The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi  Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
There are infinite lines (hyperplanes) separating the two classes but we want to
sides defining the margin) are support vectors
find the best one (the one that minimizes classification error on unseen data)  This becomes a constrained (convex) quadratic optimization
SVM searches for the hyperplane with the largest margin, i.e., maximum problem: Quadratic objective function and linear constraints 
marginal hyperplane (MMH) Quadratic Programming (QP)  Lagrangian multipliers

95年10月17日星期二 Data Mining: Concepts and Techniques 63 95年10月17日星期二 Data Mining: Concepts and Techniques 64
A2

Why Is SVM Effective on High Dimensional Data? SVM—Linearly Inseparable


 The complexity of trained classifier is characterized by the # of A1
 Transform the original input data into a higher
support vectors rather than the dimensionality of the data
dimensional space
 The support vectors are the essential or critical training examples —
they lie closest to the decision boundary (MMH)
 If all other training examples are removed and the training is repeated,
the same separating hyperplane would be found
 The number of support vectors found can be used to compute an
(upper) bound on the expected error rate of the SVM classifier, which
is independent of the data dimensionality
 Thus, an SVM with a small number of support vectors can have good
generalization, even when the dimensionality of the data is high  Search for a linear separating hyperplane in the new space

95年10月17日星期二 Data Mining: Concepts and Techniques 65 95年10月17日星期二 Data Mining: Concepts and Techniques 66

SVM—Kernel functions Scaling SVM by Hierarchical Micro-Clustering

 Instead of computing the dot product on the transformed data tuples,  SVM is not scalable to the number of data objects in terms of training
it is mathematically equivalent to instead applying a kernel function
time and memory usage
K(Xi, Xj) to the original data, i.e., K(Xi, Xj) = Φ(Xi) Φ(Xj)
 Typical Kernel Functions  “Classifying Large Datasets Using SVMs with Hierarchical Clusters
Problem” by Hwanjo Yu, Jiong Yang, Jiawei Han, KDD’03
 CB-SVM (Clustering-Based SVM)
 Given limited amount of system resources (e.g., memory),
maximize the SVM performance in terms of accuracy and the
training speed
 Use micro-clustering to effectively reduce the number of points to
 SVM can also be used for classifying multiple (> 2) classes and for be considered
regression analysis (with additional user parameters)  At deriving support vectors, de-cluster micro-clusters near
“candidate vector” to ensure high classification accuracy

95年10月17日星期二 Data Mining: Concepts and Techniques 67 95年10月17日星期二 Data Mining: Concepts and Techniques 68
CB-SVM: Clustering-Based SVM CF-Tree: Hierarchical Micro-cluster

 Training data sets may not even fit in memory


 Read the data set once (minimizing disk access)
 Construct a statistical summary of the data (i.e., hierarchical
clusters) given a limited amount of memory
 The statistical summary maximizes the benefit of learning SVM
 The summary plays a role in indexing SVMs
 Essence of Micro-clustering (Hierarchical indexing structure)
 Use micro-cluster hierarchical indexing structure
 provide finer samples closer to the boundary and coarser
samples farther from the boundary
 Selective de-clustering to ensure high accuracy
95年10月17日星期二 Data Mining: Concepts and Techniques 69 95年10月17日星期二 Data Mining: Concepts and Techniques 70

CB-SVM Algorithm: Outline Selective Declustering

 Construct two CF-trees from positive and negative data  CF tree is a suitable base structure for selective declustering
sets independently  De-cluster only the cluster Ei such that
 Need one scan of the data set  Di – Ri < Ds, where Di is the distance from the boundary to
the center point of Ei and Ri is the radius of Ei
 Train an SVM from the centroids of the root entries
 Decluster only the cluster whose subclusters have
 De-cluster the entries near the boundary into the next possibilities to be the support cluster of the boundary
level
 “Support cluster”: The cluster whose centroid is a
 The children entries de-clustered from the parent support vector
entries are accumulated into the training set with the
non-declustered parent entries
 Train an SVM again from the centroids of the entries in
the training set
 Repeat until nothing is accumulated
95年10月17日星期二 Data Mining: Concepts and Techniques 71 95年10月17日星期二 Data Mining: Concepts and Techniques 72
Experiment on Synthetic Dataset Experiment on a Large Data Set

95年10月17日星期二 Data Mining: Concepts and Techniques 73 95年10月17日星期二 Data Mining: Concepts and Techniques 74

SVM vs. Neural Network SVM Related Links

 SVM  Neural Network  SVM Website

 Relatively new concept  Relatively old  http://www.kernel-machines.org/


 Nondeterministic
 Deterministic algorithm Representative implementations
algorithm 

 Nice Generalization  Generalizes well but  LIBSVM: an efficient implementation of SVM, multi-class
properties doesn’t have strong
classifications, nu-SVM, one-class SVM, including also various
 Hard to learn – learned mathematical foundation
 Can easily be learned in interfaces with java, python, etc.
in batch mode using
incremental fashion
quadratic programming  SVM-light: simpler but performance is not better than LIBSVM,
 To learn complex
techniques support only binary classification and only C language
functions—use multilayer
 Using kernels can learn perceptron (not that  SVM-torch: another recent implementation also written in C.
very complex functions trivial)
95年10月17日星期二 Data Mining: Concepts and Techniques 75 95年10月17日星期二 Data Mining: Concepts and Techniques 76
SVM—Introduction Literature Chapter 6. Classification and Prediction

 “Statistical Learning Theory” by Vapnik: extremely hard to understand,  What is classification? What is  Support Vector Machines (SVM)
containing many errors too.
prediction?  Associative classification
 C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern
 Issues regarding classification Lazy learners (or learning from
Recognition. Knowledge Discovery and Data Mining, 2(2), 1998. 

and prediction your neighbors)


 Better than the Vapnik’s book, but still written too hard for
introduction, and the examples are so not-intuitive  Classification by decision tree  Other classification methods
 The book “An Introduction to Support Vector Machines” by N. induction  Prediction
Cristianini and J. Shawe-Taylor  Bayesian classification  Accuracy and error measures
 Also written hard for introduction, but the explanation about the
 Rule-based classification  Ensemble methods
mercer’s theorem is better than above literatures
 Classification by back Model selection
 The neural network book by Haykins 

propagation
 Contains one nice chapter of SVM introduction  Summary
95年10月17日星期二 Data Mining: Concepts and Techniques 77 95年10月17日星期二 Data Mining: Concepts and Techniques 78

Associative Classification Typical Associative Classification Methods

 Associative classification  CBA (Classification By Association: Liu, Hsu & Ma, KDD’98)
 Association rules are generated and analyzed for use in classification  Mine association possible rules in the form of
 Search for strong associations between frequent patterns  Cond-set (a set of attribute-value pairs)  class label
(conjunctions of attribute-value pairs) and class labels  Build classifier: Organize rules according to decreasing precedence
based on confidence and then support
 Classification: Based on evaluating a set of rules in the form of
 CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01)
P1 ^ p2 … ^ pl  “Aclass = C” (conf, sup)  Classification: Statistical analysis on multiple rules
 Why effective?  CPAR (Classification based on Predictive Association Rules: Yin & Han, SDM’03)
 It explores highly confident associations among multiple attributes  Generation of predictive rules (FOIL-like analysis)
and may overcome some constraints introduced by decision-tree  High efficiency, accuracy similar to CMAR
induction, which considers only one attribute at a time  RCBT (Mining top-k covering rule groups for gene expression data, Cong et al. SIGMOD’05)
 In many studies, associative classification has been found to be more  Explore high-dimensional classification, using top-k rule groups
accurate than some traditional classification methods, such as C4.5  Achieve high classification accuracy and high run-time efficiency
95年10月17日星期二 Data Mining: Concepts and Techniques 79 95年10月17日星期二 Data Mining: Concepts and Techniques 80
Associative Classification May Achieve High
A Closer Look at CMAR
Accuracy and Efficiency (Cong et al. SIGMOD05)
 CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01)
 Efficiency: Uses an enhanced FP-tree that maintains the distribution of
class labels among tuples satisfying each frequent itemset
 Rule pruning whenever a rule is inserted into the tree
 Given two rules, R1 and R2, if the antecedent of R1 is more general

than that of R2 and conf(R1) ≥ conf(R2), then R2 is pruned


 Prunes rules for which the rule antecedent and class are not

positively correlated, based on a χ2 test of statistical significance


 Classification based on generated/pruned rules
 If only one rule satisfies tuple X, assign the class label of the rule

 If a rule set S satisfies X, CMAR

 divides S into groups according to class labels

2
 uses a weighted χ measure to find the strongest group of rules,

based on the statistical correlation of rules within a group


 assigns X the class label of the strongest group
95年10月17日星期二 Data Mining: Concepts and Techniques 81 95年10月17日星期二 Data Mining: Concepts and Techniques 82

Chapter 6. Classification and Prediction Lazy vs. Eager Learning

 What is classification? What is  Support Vector Machines (SVM)  Lazy vs. eager learning
 Lazy learning (e.g., instance-based learning): Simply
prediction? Associative classification

stores training data (or only minor processing) and
 Issues regarding classification  Lazy learners (or learning from waits until it is given a test tuple
and prediction  Eager learning (the above discussed methods): Given a
your neighbors)
set of training set, constructs a classification model
 Classification by decision tree  Other classification methods before receiving new (e.g., test) data to classify
induction  Prediction  Lazy: less time in training but more time in predicting
 Bayesian classification  Accuracy
 Accuracy and error measures
 Lazy method effectively uses a richer hypothesis space
 Rule-based classification  Ensemble methods since it uses many local linear functions to form its
 Classification by back implicit global approximation to the target function
 Model selection
 Eager: must commit to a single hypothesis that covers
propagation
 Summary the entire instance space
95年10月17日星期二 Data Mining: Concepts and Techniques 83 95年10月17日星期二 Data Mining: Concepts and Techniques 84
Lazy Learner: Instance-Based Methods The k-Nearest Neighbor Algorithm
 Instance-based learning:  All instances correspond to points in the n-D space
 Store training examples and delay the processing  The nearest neighbor are defined in terms of
(“lazy evaluation”) until a new instance must be Euclidean distance, dist(X1, X2)
classified  Target function could be discrete- or real- valued
 Typical approaches  For discrete-valued, k-NN returns the most common
 k-nearest neighbor approach value among the k training examples nearest to xq
 Instances represented as points in a Euclidean
 Vonoroi diagram: the decision surface induced by 1-
space. NN for a typical set of training examples
 Locally weighted regression

 Constructs local approximation

 Case-based reasoning
_
_
_ _
.
+
 Uses symbolic representations and knowledge-
_ .
+
xq + . . .
based inference
95年10月17日星期二 Data Mining: Concepts and Techniques 85 95年10月17日星期二
_
+ .
Data Mining: Concepts and Techniques 86

Discussion on the k-NN Algorithm Case-Based Reasoning (CBR)


 CBR: Uses a database of problem solutions to solve new problems
 k-NN for real-valued prediction for a given unknown tuple
 Store symbolic description (tuples or cases)—not points in a Euclidean
 Returns the mean values of the k nearest neighbors space
 Distance-weighted nearest neighbor algorithm  Applications: Customer-service (product-related diagnosis), legal ruling
 Weight the contribution of each of the k neighbors  Methodology

according to their distance to the query xq w≡  Instances represented by rich symbolic descriptions (e.g., function
1
graphs)
Give greater weight to closer neighbors d (xq, x )2
 i  Search for similar cases, multiple retrieved cases may be combined
 Robust to noisy data by averaging k-nearest neighbors  Tight coupling between case retrieval, knowledge-based reasoning,
 Curse of dimensionality: distance between neighbors could and problem solving
be dominated by irrelevant attributes  Challenges
 Find a good similarity metric
 To overcome it, axes stretch or elimination of the least
 Indexing based on syntactic similarity measure, and when failure,
relevant attributes
backtracking, and adapting to additional cases
95年10月17日星期二 Data Mining: Concepts and Techniques 87 95年10月17日星期二 Data Mining: Concepts and Techniques 88
Chapter 6. Classification and Prediction Genetic Algorithms (GA)

 What is classification? What is  Support Vector Machines (SVM)  Genetic Algorithm: based on an analogy to biological evolution
prediction?  An initial population is created consisting of randomly generated rules
 Associative classification
 Each rule is represented by a string of bits
 Issues regarding classification  Lazy learners (or learning from  E.g., if A1 and ¬A2 then C2 can be encoded as 100
and prediction your neighbors)  If an attribute has k > 2 values, k bits can be used
 Classification by decision tree  Other classification methods  Based on the notion of survival of the fittest, a new population is
induction formed to consist of the fittest rules and their offsprings
 Prediction
 The fitness of a rule is represented by its classification accuracy on a
 Bayesian classification  Accuracy and error measures set of training examples
 Rule-based classification  Offsprings are generated by crossover and mutation
 Ensemble methods
 The process continues until a population P evolves when each rule in P
 Classification by back Model selection
 satisfies a prespecified threshold
propagation Slow but easily parallelizable
 Summary 

95年10月17日星期二 Data Mining: Concepts and Techniques 89 95年10月17日星期二 Data Mining: Concepts and Techniques 90

Fuzzy Set
Rough Set Approach
Approaches
 Rough sets are used to approximately or “roughly” define
equivalent classes
 A rough set for a given class C is approximated by two sets: a lower  Fuzzy logic uses truth values between 0.0 and 1.0 to
approximation (certain to be in C) and an upper approximation represent the degree of membership (such as using
(cannot be described as not belonging to C)
fuzzy membership graph)
 Attribute values are converted to fuzzy values
 Finding the minimal subsets (reducts) of attributes for feature
 e.g., income is mapped into the discrete categories
reduction is NP-hard but a discernibility matrix (which stores the
{low, medium, high} with fuzzy values calculated
differences between attribute values for each pair of data tuples) is
used to reduce the computation intensity  For a given new sample, more than one fuzzy value may
apply
 Each applicable rule contributes a vote for membership
in the categories
 Typically, the truth values for each predicted category
are summed, and these sums are combined
95年10月17日星期二 Data Mining: Concepts and Techniques 91 95年10月17日星期二 Data Mining: Concepts and Techniques 92
Chapter 6. Classification and Prediction What Is Prediction?

 What is classification? What is  Support Vector Machines (SVM)  (Numerical) prediction is similar to classification
 construct a model
prediction? Associative classification  use model to predict continuous or ordered value for a given input


 Issues regarding classification  Lazy learners (or learning from  Prediction is different from classification
 Classification refers to predict categorical class label
and prediction your neighbors)
 Prediction models continuous-valued functions
 Classification by decision tree  Other classification methods  Major method for prediction: regression
induction  Prediction  model the relationship between one or more independent or

predictor variables and a dependent or response variable


 Bayesian classification Accuracy and error measures
  Regression analysis
 Rule-based classification  Ensemble methods  Linear and multiple regression

 Non-linear regression
 Classification by back  Model selection  Other regression methods: generalized linear model, Poisson
propagation regression, log-linear models, regression trees
 Summary
95年10月17日星期二 Data Mining: Concepts and Techniques 93 95年10月17日星期二 Data Mining: Concepts and Techniques 94

Linear Regression Nonlinear Regression


 Linear regression: involves a response variable y and a single  Some nonlinear models can be modeled by a polynomial
predictor variable x function
y = w0 + w1 x  A polynomial regression model can be transformed into
where w0 (y-intercept) and w1 (slope) are regression coefficients linear regression model. For example,
 Method of least squares: estimates the best-fitting straight line y = w0 + w1 x + w2 x2 + w3 x3
| D|

∑ (x − x )( y i − y ) convertible to linear with new variables: x2 = x2, x3= x3


w = w = y −w x
i
i =1

1 |D|
0 1 y = w0 + w1 x + w2 x2 + w3 x3
∑ ( xi − x ) 2
i =1
 Other functions, such as power function, can also be
Multiple linear regression: involves more than one predictor variable

transformed to linear model
 Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|)
 Some models are intractable nonlinear (e.g., sum of
Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2

exponential terms)
 Solvable by extension of least square method or using SAS, S-Plus
 possible to obtain least square estimates through
 Many nonlinear functions can be transformed into the above
extensive calculation on more complex formulae
95年10月17日星期二 Data Mining: Concepts and Techniques 95 95年10月17日星期二 Data Mining: Concepts and Techniques 96
Other Regression-Based Models Regression Trees and Model Trees
 Generalized linear model:  Regression tree: proposed in CART system (Breiman et al. 1984)
 Foundation on which linear regression can be applied to modeling  CART: Classification And Regression Trees
categorical response variables
 Each leaf stores a continuous-valued prediction
 Variance of y is a function of the mean value of y, not a constant
 Logistic regression: models the prob. of some event occurring as a  It is the average value of the predicted attribute for the training
linear function of a set of predictor variables tuples that reach the leaf
 Poisson regression: models the data that exhibit a Poisson  Model tree: proposed by Quinlan (1992)
distribution
 Each leaf holds a regression model—a multivariate linear equation
 Log-linear models: (for categorical data)
for the predicted attribute
 Approximate discrete multidimensional prob. distributions
 A more general case than regression tree
 Also useful for data compression and smoothing
 Regression trees and model trees  Regression and model trees tend to be more accurate than linear

 Trees to predict continuous values rather than class labels regression when the data are not represented well by a simple linear
model
95年10月17日星期二 Data Mining: Concepts and Techniques 97 95年10月17日星期二 Data Mining: Concepts and Techniques 98

Predictive Modeling in Multidimensional Databases Prediction: Numerical Data


 Predictive modeling: Predict data values or construct
generalized linear models based on the database data
 One can only predict value ranges or category distributions
 Method outline:
 Minimal generalization
 Attribute relevance analysis
 Generalized linear model construction
 Prediction
 Determine the major factors which influence the prediction
 Data relevance analysis: uncertainty measurement,

entropy analysis, expert judgement, etc.


 Multi-level prediction: drill-down and roll-up analysis
95年10月17日星期二 Data Mining: Concepts and Techniques 99 95年10月17日星期二 Data Mining: Concepts and Techniques 100
Prediction: Categorical Data Chapter 6. Classification and Prediction

 What is classification? What is  Support Vector Machines (SVM)


prediction?  Associative classification
 Issues regarding classification  Lazy learners (or learning from
and prediction your neighbors)
 Classification by decision tree  Other classification methods
induction  Prediction
 Bayesian classification  Accuracy and error measures
 Rule-based classification  Ensemble methods
 Classification by back  Model selection
propagation
 Summary
95年10月17日星期二 Data Mining: Concepts and Techniques 101 95年10月17日星期二 Data Mining: Concepts and Techniques 102

C1 C2
C1 True positive False negative
Classifier Accuracy Measures Predictor Error Measures
C2 False positive True negative

classes buy_computer = yes buy_computer = no total recognition(%)


 Measure predictor accuracy: measure how far off the predicted value is
buy_computer = yes 6954 46 7000 99.34
from the actual known value
buy_computer = no 412 2588 3000 86.27
 Loss function: measures the error betw. yi and the predicted value yi’
total 7366 2634 10000 95.52
 Absolute error: | yi – yi’|
 Accuracy of a classifier M, acc(M): percentage of test set tuples that are
correctly classified by the model M  Squared error: (yi – yi’)2
 Error rate (misclassification rate) of M = 1 – acc(M)  Test error (generalization error):
d
the average loss over the test
d
set
 Given m classes, CMi,j, an entry in a confusion matrix, indicates #  Mean absolute error: ∑| y
i =1
i − yi ' | Mean squared error: ∑(y
i =1
i − yi ' ) 2
of tuples in class i that are labeled by the classifier as class j d d d
 Alternative accuracy measures (e.g., for cancer diagnosis)  Relative absolute error: ∑ | y
d

i − y i '|
Relative squared error:
∑ ( y − y ')
i =1
i i
2

i =1
sensitivity = t-pos/pos /* true positive recognition rate */ d d
∑|y i − y |
∑ ( y − y) i
2

specificity = t-neg/neg /* true negative recognition rate */ i =1


i =1
The mean squared-error exaggerates the presence of outliers
precision = t-pos/(t-pos + f-pos)
Popularly use (square) root mean-square error, similarly, root relative
accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos + neg)
squared error
 This model can also be used for cost-benefit analysis
95年10月17日星期二 Data Mining: Concepts and Techniques 103 95年10月17日星期二 Data Mining: Concepts and Techniques 104
Evaluating the Accuracy of a Classifier Evaluating the Accuracy of a Classifier
or Predictor (I) or Predictor (II)
 Holdout method  Bootstrap
 Given data is randomly partitioned into two independent sets  Works well with small data sets
 Training set (e.g., 2/3) for model construction
 Samples the given training tuples uniformly with replacement
 Test set (e.g., 1/3) for accuracy estimation
 i.e., each time a tuple is selected, it is equally likely to be
 Random sampling: a variation of holdout
selected again and re-added to the training set
 Repeat holdout k times, accuracy = avg. of the accuracies
 Several boostrap methods, and a common one is .632 boostrap
obtained
 Suppose we are given a data set of d tuples. The data set is sampled d
 Cross-validation (k-fold, where k = 10 is most popular) times, with replacement, resulting in a training set of d samples. The data
 Randomly partition the data into k mutually exclusive subsets, tuples that did not make it into the training set end up forming the test set.
each approximately equal size About 63.2% of the original data will end up in the bootstrap, and the
 At i-th iteration, use Di as test set and others as training set remaining 36.8% will form the test set (since (1 – 1/d)d ≈ e-1 = 0.368)
 Leave-one-out: k folds where k = # of tuples, for small sized data  Repeat the sampling procedue k times, overall accuracy of the
 Stratified cross-validation: folds are stratified so that class dist. in model: k
acc( M ) = ∑ (0.632 × acc( M i ) test _ set +0.368 × acc( M i ) train _ set )
each fold is approx. the same as that in the initial data i =1

95年10月17日星期二 Data Mining: Concepts and Techniques 105 95年10月17日星期二 Data Mining: Concepts and Techniques 106

Chapter 6. Classification and Prediction Ensemble Methods: Increasing the Accuracy

 What is classification? What is  Support Vector Machines (SVM)


prediction?  Associative classification
 Issues regarding classification  Lazy learners (or learning from  Ensemble methods
and prediction your neighbors)  Use a combination of models to increase accuracy

 Classification by decision tree  Other classification methods  Combine a series of k learned models, M1, M2, …, Mk,

induction with the aim of creating an improved model M*


 Prediction
 Popular ensemble methods
 Bayesian classification Accuracy and error measures
 Bagging: averaging the prediction over a collection of


 Rule-based classification  Ensemble methods classifiers


 Classification by back  Boosting: weighted vote with a collection of classifiers
 Model selection
propagation  Ensemble: combining a set of heterogeneous classifiers
 Summary
95年10月17日星期二 Data Mining: Concepts and Techniques 107 95年10月17日星期二 Data Mining: Concepts and Techniques 108
Bagging: Boostrap Aggregation Boosting
 Analogy: Diagnosis based on multiple doctors’ majority vote  Analogy: Consult several doctors, based on a combination of weighted
 Training diagnoses—weight assigned based on the previous diagnosis accuracy
 Given a set D of d tuples, at each iteration i, a training set Di of d  How boosting works?
tuples is sampled with replacement from D (i.e., boostrap)  Weights are assigned to each training tuple
 A classifier model Mi is learned for each training set Di
 A series of k classifiers is iteratively learned
 Classification: classify an unknown sample X
 After a classifier Mi is learned, the weights are updated to allow the
 Each classifier Mi returns its class prediction
subsequent classifier, Mi+1, to pay more attention to the training
 The bagged classifier M* counts the votes and assigns the class
tuples that were misclassified by Mi
with the most votes to X
 The final M* combines the votes of each individual classifier, where
 Prediction: can be applied to the prediction of continuous values by
the weight of each classifier's vote is a function of its accuracy
taking the average value of each prediction for a given test tuple
 Accuracy  The boosting algorithm can be extended for the prediction of
continuous values
 Often significant better than a single classifier derived from D

 For noise data: not considerably worse, more robust


 Comparing with bagging: boosting tends to achieve greater accuracy,
but it also risks overfitting the model to misclassified data
 Proved improved accuracy in prediction
95年10月17日星期二 Data Mining: Concepts and Techniques 109 95年10月17日星期二 Data Mining: Concepts and Techniques 110

Adaboost (Freund and Schapire, 1997) Chapter 6. Classification and Prediction


 Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
 What is classification? What is  Support Vector Machines (SVM)
 Initially, all the weights of tuples are set the same (1/d)
 Generate k classifiers in k rounds. At round i, prediction?  Associative classification
 Tuples from D are sampled (with replacement) to form a Issues regarding classification
training set Di of the same size
  Lazy learners (or learning from
 Each tuple’s chance of being selected is based on its weight and prediction your neighbors)
 A classification model Mi is derived from Di
 Classification by decision tree  Other classification methods
 Its error rate is calculated using Di as a test set
 If a tuple is misclssified, its weight is increased, o.w. it is induction  Prediction
decreased
 Bayesian classification Accuracy and error measures
 Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier 

Mi error rate is the sum of the weights of the misclassified tuples:  Rule-based classification
d  Ensemble methods
error ( M i ) = ∑ w j × err ( X j )
j  Classification by back Model selection
1 − error ( M i ) 
 The weight of classifier Mi’s vote is log propagation
error ( M i )  Summary
95年10月17日星期二 Data Mining: Concepts and Techniques 111 95年10月17日星期二 Data Mining: Concepts and Techniques 112
Chapter 6. Classification and Prediction Summary (I)
 Classification and prediction are two forms of data analysis that can
 What is classification? What is  Support Vector Machines (SVM)
be used to extract models describing important data classes or to
prediction?  Associative classification predict future data trends.
 Issues regarding classification  Lazy learners (or learning from  Effective and scalable methods have been developed for decision
and prediction your neighbors) trees induction, Naive Bayesian classification, Bayesian belief network,
rule-based classifier, Backpropagation, Support Vector Machine (SVM),
 Classification by decision tree Other classification methods

associative classification, nearest neighbor classifiers, and case-based
induction  Prediction reasoning, and other classification methods such as genetic
 Bayesian classification algorithms, rough set and fuzzy set approaches.
 Accuracy and error measures
 Linear, nonlinear, and generalized linear models of regression can be
 Rule-based classification Ensemble methods
 used for prediction. Many nonlinear problems can be converted to
 Classification by back  Model selection linear problems by performing transformations on the predictor
propagation variables. Regression trees and model trees are also used for
 Summary prediction.
95年10月17日星期二 Data Mining: Concepts and Techniques 113 95年10月17日星期二 Data Mining: Concepts and Techniques 114

Summary (II) References (1)


C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future
 Stratified k-fold cross-validation is a recommended method for 

Generation Computer Systems, 13, 1997.


accuracy estimation. Bagging and boosting can be used to increase  C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University Press,
overall accuracy by learning and combining a series of individual 1995.
 L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
models.
Wadsworth International Group, 1984.
 Significance tests and ROC curves are useful for model selection  C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition.
Data Mining and Knowledge Discovery, 2(2): 121-168, 1998.
 There have been numerous comparisons of the different classification
 P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned
and prediction methods, and the matter remains a research topic data for scaling machine learning. KDD'95.
 W. Cohen. Fast effective rule induction. ICML'95.
 No single method has been found to be superior over all others for all
 G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule groups for
data sets gene expression data. SIGMOD'05.
 Issues such as accuracy, training time, robustness, interpretability, and  A. J. Dobson. An Introduction to Generalized Linear Models. Chapman and Hall,
1990.
scalability must be considered and can involve trade-offs, further  G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and
complicating the quest for an overall superior method differences. KDD'99.

95年10月17日星期二 Data Mining: Concepts and Techniques 115 95年10月17日星期二 Data Mining: Concepts and Techniques 116
References (2) References (3)
 R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed. John Wiley and  T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy,
Sons, 2001
complexity, and training time of thirty-three old and new classification
 U. M. Fayyad. Branching on attribute values in decision tree generation. AAAI’94. algorithms. Machine Learning, 2000.
 Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line
 J. Magidson. The Chaid approach to segmentation modeling: Chi-squared
learning and an application to boosting. J. Computer and System Sciences, 1997.
automatic interaction detection. In R. P. Bagozzi, editor, Advanced Methods of
 J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision
Marketing Research, Blackwell Business, 1994.
tree construction of large datasets. VLDB’98.
 M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data
 J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision Tree
Construction. SIGMOD'99. mining. EDBT'96.

 T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data  T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
Mining, Inference, and Prediction. Springer-Verlag, 2001.  S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-
 D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The Disciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998
combination of knowledge and statistical data. Machine Learning, 1995.  J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
 M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han. Generalization and decision  J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. ECML’93.
tree induction: Efficient classification in data mining. RIDE'97.
 J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
 B. Liu, W. Hsu, and Y. Ma. Integrating Classification and Association Rule. KDD'98.
 J. R. Quinlan. Bagging, boosting, and c4.5. AAAI'96.
 W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on
Multiple Class-Association Rules, ICDM'01.
95年10月17日星期二 Data Mining: Concepts and Techniques 117 95年10月17日星期二 Data Mining: Concepts and Techniques 118

References (4)
 R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building
and pruning. VLDB’98.
 J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for
data mining. VLDB’96.
 J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan Kaufmann,
1990.
 P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley,
2005.
 S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification
and Prediction Methods from Statistics, Neural Nets, Machine Learning, and
Expert Systems. Morgan Kaufman, 1991.
 S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997.
 I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and
Techniques, 2ed. Morgan Kaufmann, 2005.
 X. Yin and J. Han. CPAR: Classification based on predictive association rules.
SDM'03
 H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with
hierarchical clusters. KDD'03.
95年10月17日星期二 Data Mining: Concepts and Techniques 119

You might also like