Data Mining Concepts and Techniques
Data Mining Concepts and Techniques
Associative classification
Issues regarding classification Lazy learners (or learning from
— Chapter 6 — and prediction your neighbors)
Classification by decision tree Other classification methods
induction
Jiawei Han Prediction
Department of Computer Science Bayesian classification Accuracy and error measures
University of Illinois at Urbana-Champaign Rule-based classification Ensemble methods
www.cs.uiuc.edu/~hanj Classification by back Model selection
©2006 Jiawei Han and Micheline Kamber, All rights reserved propagation
Summary
95年10月17日星期二 Data Mining: Concepts and Techniques 1 95年10月17日星期二 Data Mining: Concepts and Techniques 2
training set and the values (class labels) in a The model is represented as classification rules, decision trees,
Classification
Algorithms
Training Classifier
Data
Testing
Data Unseen Data
NAME RANK YEARS TENURED Classifier
M ike Assistant Prof 3 no (Model)
(Jeff, Professor, 4)
M ary Assistant Prof 7 yes
Bill Professor 2 yes NAME RANK YEARS TENURED
Jim Associate Prof 7 yes Tom Assistant Prof 2 no Tenured?
IF rank = ‘professor’ Merlisa Associate Prof 7 no
Dave Assistant Prof 6 no
OR years > 6 George Professor 5 yes
Anne Associate Prof 3 no
THEN tenured = ‘yes’ Joseph Assistant Prof 7 yes
95年10月17日星期二 Data Mining: Concepts and Techniques 5 95年10月17日星期二 Data Mining: Concepts and Techniques 6
induction Prediction
Unsupervised learning (clustering)
the aim of establishing the existence of classes or Classification by back Model selection
clusters in the data propagation
Summary
95年10月17日星期二 Data Mining: Concepts and Techniques 7 95年10月17日星期二 Data Mining: Concepts and Techniques 8
Issues: Data Preparation Issues: Evaluating Classification Methods
Accuracy
Data cleaning classifier accuracy: predicting class label
Preprocess data in order to reduce noise and handle predictor accuracy: guessing value of predicted
What is classification? What is Support Vector Machines (SVM) age income student credit_rating buys_computer
<=30 high no fair no
prediction? Associative classification
This <=30 high no excellent no
Issues regarding classification Lazy learners (or learning from 31…40 high no fair yes
follows an >40 medium no fair yes
and prediction your neighbors) example >40 low yes fair yes
Classification by decision tree Other classification methods of >40 low yes excellent no
31…40 low yes excellent yes
induction Prediction Quinlan’s <=30 medium no fair no
Bayesian classification Accuracy and error measures
ID3 <=30 low yes fair yes
(Playing >40 medium yes fair yes
Rule-based classification Ensemble methods <=30 medium yes excellent yes
Tennis) 31…40 medium no excellent yes
Classification by back Model selection
31…40 high yes fair yes
propagation >40 medium no excellent no
Summary
95年10月17日星期二 Data Mining: Concepts and Techniques 11 95年10月17日星期二 Data Mining: Concepts and Techniques 12
Output: A Decision Tree for “buys_computer” Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-conquer
age?
manner
At start, all the training examples are at the root
<=30 overcast
31..40 >40 Attributes are categorical (if continuous-valued, they are
discretized in advance)
Examples are partitioned recursively based on selected attributes
student? yes credit rating?
Test attributes are selected on the basis of a heuristic or
95年10月17日星期二 Data Mining: Concepts and Techniques 13 95年10月17日星期二 Data Mining: Concepts and Techniques 14
Gini index (CART, IBM IntelligentMiner) Gini index (CART, IBM IntelligentMiner)
If a data set D contains examples from n classes, gini index, gini(D) is Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
defined as n 9 5
2 2
The three measures, in general, return good results but CHAID: a popular decision tree algorithm, measure based on χ2 test
for independence
Information gain:
C-SEP: performs better than info. gain and gini index in certain cases
biased towards multivalued attributes
G-statistics: has a close approximation to χ2 distribution
Gain ratio:
MDL (Minimal Description Length) principle (i.e., the simplest solution
tends to prefer unbalanced splits in which one is preferred):
partition is much smaller than the others The best tree as the one that requires the fewest # of bits to both
Gini index: (1) encode the tree, and (2) encode the exceptions to the tree
biased to multivalued attributes Multivariate splits (partition based on multiple variable combinations)
CART: finds multivariate splits based on a linear comb. of attrs.
Which attribute selection measure is the best?
Most give good results, none is significantly superior than others
95年10月17日星期二 Data Mining: Concepts and Techniques 21 95年10月17日星期二 Data Mining: Concepts and Techniques 22
Overfitting: An induced tree may overfit the training data Allow for continuous-valued attributes
Too many branches, some may reflect anomalies due to noise or Dynamically define new discrete-valued attributes that
outliers partition the continuous attribute value into a discrete
Poor accuracy for unseen samples set of intervals
Two approaches to avoid overfitting Handle missing attribute values
Prepruning: Halt tree construction early—do not split a node if this Assign the most common value of the attribute
would result in the goodness measure falling below a threshold
Assign probability to each of the possible values
Difficult to choose an appropriate threshold
Attribute construction
Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees Create new attributes based on existing ones that are
Use a set of data different from the training data to decide
sparsely represented
which is the “best pruned tree” This reduces fragmentation, repetition, and replication
95年10月17日星期二 Data Mining: Concepts and Techniques 23 95年10月17日星期二 Data Mining: Concepts and Techniques 24
Classification in Large Databases Scalable Decision Tree Induction Methods
Scalability: Classifying data sets with millions of examples the current attribute list reside in memory
and hundreds of attributes with reasonable speed SPRINT (VLDB’96 — J. Shafer et al.)
Constructs an attribute list data structure
Why decision tree induction in data mining?
relatively faster learning speed (than other classification
PUBLIC (VLDB’98 — Rastogi & Shim)
methods) Integrates tree splitting and tree pruning: stop growing
comparable classification accuracy with other methods BOAT (PODS’99 — Gehrke, Ganti, Ramakrishnan & Loh)
Uses bootstrapping to create several small samples
95年10月17日星期二 Data Mining: Concepts and Techniques 25 95年10月17日星期二 Data Mining: Concepts and Techniques 26
What is classification? What is Support Vector Machines (SVM) A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
prediction? Associative classification
Foundation: Based on Bayes’ Theorem.
Issues regarding classification Lazy learners (or learning from Performance: A simple Bayesian classifier, naïve Bayesian
and prediction your neighbors)
classifier, has comparable performance with decision tree
and selected neural network classifiers
Classification by decision tree Other classification methods
Incremental: Each training example can incrementally
induction Prediction increase/decrease the probability that a hypothesis is
correct — prior knowledge can be combined with observed
Bayesian classification Accuracy and error measures data
Rule-based classification Ensemble methods Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard
Classification by back Model selection
of optimal decision making against which other methods
propagation
Summary
can be measured
95年10月17日星期二 Data Mining: Concepts and Techniques 27 95年10月17日星期二 Data Mining: Concepts and Techniques 28
Bayesian Theorem: Basics Bayesian Theorem
Let X be a data sample (“evidence”): class label is unknown Given training data X, posteriori probability of a
Let H be a hypothesis that X belongs to class C hypothesis H, P(H|X), follows the Bayes theorem
Classification is to determine P(H|X), the probability that
P (H | X ) = P (X | H )P (H )
<=30 31..40
Represent the knowledge in the form of IF-THEN rules >40
Rules are easier to understand than large trees student? credit rating?
R: IF age = youth AND student = yes THEN buys_computer = yes yes
One rule is created for each path from the root
Rule antecedent/precondition vs. rule consequent no yes excellent fair
to a leaf
Assessment of a rule: coverage and accuracy no yes no yes
Issues regarding classification Lazy learners (or learning from E.g., Personal homepage classification
and prediction your neighbors) xi = (x1, x2, x3, …), yi = +1 or –1
Classification by decision tree Other classification methods x1 : # of a word “homepage”
induction Prediction
x2 : # of a word “welcome”
Bayesian classification Accuracy and error measures
Mathematically
Rule-based classification
x ∈ X = ℜ , y ∈ Y = {+1, –1}
Ensemble methods n
Classification by back Model selection We want a function f: X Y
propagation
Summary
95年10月17日星期二 Data Mining: Concepts and Techniques 45 95年10月17日星期二 Data Mining: Concepts and Techniques 46
x o o Classifiers
oo o difficult to understand the learned function (weights)
95年10月17日星期二 Data Mining: Concepts and Techniques 47 95年10月17日星期二 Data Mining: Concepts and Techniques 48
Perceptron & Winnow Classification by Backpropagation
• Vector: x, w
x2 Backpropagation: A neural network learning algorithm
• Scalar: x, y, w
Input: {(x1, y1), …} Started by psychologists and neurobiologists to develop
Output: classification function f(x) and test computational analogues of neurons
f(xi) > 0 for yi = +1 A neural network: A set of connected input/output units
f(xi) < 0 for yi = -1
where each connection has a weight associated with it
f(x) => wx + b = 0
During the learning phase, the network learns by
or w1x1+w2x2+b = 0
adjusting the weights so as to be able to predict the
• Perceptron: update W
additively correct class label of the input tuples
• Winnow: update W Also referred to as connectionist learning due to the
multiplicatively
x1 connections between units
95年10月17日星期二 Data Mining: Concepts and Techniques 49 95年10月17日星期二 Data Mining: Concepts and Techniques 50
Hidden layer Err j = O j (1 − O j )(T j − O j ) The weighted outputs of the last hidden layer are input to units
making up the output layer, which emits the network's prediction
wij 1
Oj = −I j
The network is feed-forward in that none of the weights cycles
1+ e back to an input unit or to an output unit of a previous layer
Input layer
I j = ∑ wij Oi + θ j From a statistical point of view, networks perform nonlinear
i
regression: Given enough hidden units and enough training
Input vector: X samples, they can closely approximate any function
95年10月17日星期二 Data Mining: Concepts and Techniques 53 95年10月17日星期二 Data Mining: Concepts and Techniques 54
95年10月17日星期二 Data Mining: Concepts and Techniques 55 95年10月17日星期二 Data Mining: Concepts and Techniques 56
Backpropagation and Interpretability Chapter 6. Classification and Prediction
Efficiency of backpropagation: Each epoch (one interation through the
What is classification? What is Support Vector Machines (SVM)
training set) takes O(|D| * w), with |D| tuples and w weights, but # of
epochs can be exponential to n, the number of inputs, in the worst prediction? Associative classification
case
Issues regarding classification Lazy learners (or learning from
Rule extraction from networks: network pruning
and prediction your neighbors)
Simplify the network structure by removing weighted links that
have the least effect on the trained network Classification by decision tree Other classification methods
Then perform link, unit, or activation value clustering induction Prediction
The set of input and activation values are studied to derive rules
Bayesian classification Accuracy and error measures
describing the relationship between the input and hidden unit
95年10月17日星期二 Data Mining: Concepts and Techniques 61 95年10月17日星期二 Data Mining: Concepts and Techniques 62
95年10月17日星期二 Data Mining: Concepts and Techniques 63 95年10月17日星期二 Data Mining: Concepts and Techniques 64
A2
95年10月17日星期二 Data Mining: Concepts and Techniques 65 95年10月17日星期二 Data Mining: Concepts and Techniques 66
Instead of computing the dot product on the transformed data tuples, SVM is not scalable to the number of data objects in terms of training
it is mathematically equivalent to instead applying a kernel function
time and memory usage
K(Xi, Xj) to the original data, i.e., K(Xi, Xj) = Φ(Xi) Φ(Xj)
Typical Kernel Functions “Classifying Large Datasets Using SVMs with Hierarchical Clusters
Problem” by Hwanjo Yu, Jiong Yang, Jiawei Han, KDD’03
CB-SVM (Clustering-Based SVM)
Given limited amount of system resources (e.g., memory),
maximize the SVM performance in terms of accuracy and the
training speed
Use micro-clustering to effectively reduce the number of points to
SVM can also be used for classifying multiple (> 2) classes and for be considered
regression analysis (with additional user parameters) At deriving support vectors, de-cluster micro-clusters near
“candidate vector” to ensure high classification accuracy
95年10月17日星期二 Data Mining: Concepts and Techniques 67 95年10月17日星期二 Data Mining: Concepts and Techniques 68
CB-SVM: Clustering-Based SVM CF-Tree: Hierarchical Micro-cluster
Construct two CF-trees from positive and negative data CF tree is a suitable base structure for selective declustering
sets independently De-cluster only the cluster Ei such that
Need one scan of the data set Di – Ri < Ds, where Di is the distance from the boundary to
the center point of Ei and Ri is the radius of Ei
Train an SVM from the centroids of the root entries
Decluster only the cluster whose subclusters have
De-cluster the entries near the boundary into the next possibilities to be the support cluster of the boundary
level
“Support cluster”: The cluster whose centroid is a
The children entries de-clustered from the parent support vector
entries are accumulated into the training set with the
non-declustered parent entries
Train an SVM again from the centroids of the entries in
the training set
Repeat until nothing is accumulated
95年10月17日星期二 Data Mining: Concepts and Techniques 71 95年10月17日星期二 Data Mining: Concepts and Techniques 72
Experiment on Synthetic Dataset Experiment on a Large Data Set
95年10月17日星期二 Data Mining: Concepts and Techniques 73 95年10月17日星期二 Data Mining: Concepts and Techniques 74
Nice Generalization Generalizes well but LIBSVM: an efficient implementation of SVM, multi-class
properties doesn’t have strong
classifications, nu-SVM, one-class SVM, including also various
Hard to learn – learned mathematical foundation
Can easily be learned in interfaces with java, python, etc.
in batch mode using
incremental fashion
quadratic programming SVM-light: simpler but performance is not better than LIBSVM,
To learn complex
techniques support only binary classification and only C language
functions—use multilayer
Using kernels can learn perceptron (not that SVM-torch: another recent implementation also written in C.
very complex functions trivial)
95年10月17日星期二 Data Mining: Concepts and Techniques 75 95年10月17日星期二 Data Mining: Concepts and Techniques 76
SVM—Introduction Literature Chapter 6. Classification and Prediction
“Statistical Learning Theory” by Vapnik: extremely hard to understand, What is classification? What is Support Vector Machines (SVM)
containing many errors too.
prediction? Associative classification
C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern
Issues regarding classification Lazy learners (or learning from
Recognition. Knowledge Discovery and Data Mining, 2(2), 1998.
propagation
Contains one nice chapter of SVM introduction Summary
95年10月17日星期二 Data Mining: Concepts and Techniques 77 95年10月17日星期二 Data Mining: Concepts and Techniques 78
Associative classification CBA (Classification By Association: Liu, Hsu & Ma, KDD’98)
Association rules are generated and analyzed for use in classification Mine association possible rules in the form of
Search for strong associations between frequent patterns Cond-set (a set of attribute-value pairs) class label
(conjunctions of attribute-value pairs) and class labels Build classifier: Organize rules according to decreasing precedence
based on confidence and then support
Classification: Based on evaluating a set of rules in the form of
CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01)
P1 ^ p2 … ^ pl “Aclass = C” (conf, sup) Classification: Statistical analysis on multiple rules
Why effective? CPAR (Classification based on Predictive Association Rules: Yin & Han, SDM’03)
It explores highly confident associations among multiple attributes Generation of predictive rules (FOIL-like analysis)
and may overcome some constraints introduced by decision-tree High efficiency, accuracy similar to CMAR
induction, which considers only one attribute at a time RCBT (Mining top-k covering rule groups for gene expression data, Cong et al. SIGMOD’05)
In many studies, associative classification has been found to be more Explore high-dimensional classification, using top-k rule groups
accurate than some traditional classification methods, such as C4.5 Achieve high classification accuracy and high run-time efficiency
95年10月17日星期二 Data Mining: Concepts and Techniques 79 95年10月17日星期二 Data Mining: Concepts and Techniques 80
Associative Classification May Achieve High
A Closer Look at CMAR
Accuracy and Efficiency (Cong et al. SIGMOD05)
CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01)
Efficiency: Uses an enhanced FP-tree that maintains the distribution of
class labels among tuples satisfying each frequent itemset
Rule pruning whenever a rule is inserted into the tree
Given two rules, R1 and R2, if the antecedent of R1 is more general
2
uses a weighted χ measure to find the strongest group of rules,
What is classification? What is Support Vector Machines (SVM) Lazy vs. eager learning
Lazy learning (e.g., instance-based learning): Simply
prediction? Associative classification
stores training data (or only minor processing) and
Issues regarding classification Lazy learners (or learning from waits until it is given a test tuple
and prediction Eager learning (the above discussed methods): Given a
your neighbors)
set of training set, constructs a classification model
Classification by decision tree Other classification methods before receiving new (e.g., test) data to classify
induction Prediction Lazy: less time in training but more time in predicting
Bayesian classification Accuracy
Accuracy and error measures
Lazy method effectively uses a richer hypothesis space
Rule-based classification Ensemble methods since it uses many local linear functions to form its
Classification by back implicit global approximation to the target function
Model selection
Eager: must commit to a single hypothesis that covers
propagation
Summary the entire instance space
95年10月17日星期二 Data Mining: Concepts and Techniques 83 95年10月17日星期二 Data Mining: Concepts and Techniques 84
Lazy Learner: Instance-Based Methods The k-Nearest Neighbor Algorithm
Instance-based learning: All instances correspond to points in the n-D space
Store training examples and delay the processing The nearest neighbor are defined in terms of
(“lazy evaluation”) until a new instance must be Euclidean distance, dist(X1, X2)
classified Target function could be discrete- or real- valued
Typical approaches For discrete-valued, k-NN returns the most common
k-nearest neighbor approach value among the k training examples nearest to xq
Instances represented as points in a Euclidean
Vonoroi diagram: the decision surface induced by 1-
space. NN for a typical set of training examples
Locally weighted regression
Case-based reasoning
_
_
_ _
.
+
Uses symbolic representations and knowledge-
_ .
+
xq + . . .
based inference
95年10月17日星期二 Data Mining: Concepts and Techniques 85 95年10月17日星期二
_
+ .
Data Mining: Concepts and Techniques 86
according to their distance to the query xq w≡ Instances represented by rich symbolic descriptions (e.g., function
1
graphs)
Give greater weight to closer neighbors d (xq, x )2
i Search for similar cases, multiple retrieved cases may be combined
Robust to noisy data by averaging k-nearest neighbors Tight coupling between case retrieval, knowledge-based reasoning,
Curse of dimensionality: distance between neighbors could and problem solving
be dominated by irrelevant attributes Challenges
Find a good similarity metric
To overcome it, axes stretch or elimination of the least
Indexing based on syntactic similarity measure, and when failure,
relevant attributes
backtracking, and adapting to additional cases
95年10月17日星期二 Data Mining: Concepts and Techniques 87 95年10月17日星期二 Data Mining: Concepts and Techniques 88
Chapter 6. Classification and Prediction Genetic Algorithms (GA)
What is classification? What is Support Vector Machines (SVM) Genetic Algorithm: based on an analogy to biological evolution
prediction? An initial population is created consisting of randomly generated rules
Associative classification
Each rule is represented by a string of bits
Issues regarding classification Lazy learners (or learning from E.g., if A1 and ¬A2 then C2 can be encoded as 100
and prediction your neighbors) If an attribute has k > 2 values, k bits can be used
Classification by decision tree Other classification methods Based on the notion of survival of the fittest, a new population is
induction formed to consist of the fittest rules and their offsprings
Prediction
The fitness of a rule is represented by its classification accuracy on a
Bayesian classification Accuracy and error measures set of training examples
Rule-based classification Offsprings are generated by crossover and mutation
Ensemble methods
The process continues until a population P evolves when each rule in P
Classification by back Model selection
satisfies a prespecified threshold
propagation Slow but easily parallelizable
Summary
95年10月17日星期二 Data Mining: Concepts and Techniques 89 95年10月17日星期二 Data Mining: Concepts and Techniques 90
Fuzzy Set
Rough Set Approach
Approaches
Rough sets are used to approximately or “roughly” define
equivalent classes
A rough set for a given class C is approximated by two sets: a lower Fuzzy logic uses truth values between 0.0 and 1.0 to
approximation (certain to be in C) and an upper approximation represent the degree of membership (such as using
(cannot be described as not belonging to C)
fuzzy membership graph)
Attribute values are converted to fuzzy values
Finding the minimal subsets (reducts) of attributes for feature
e.g., income is mapped into the discrete categories
reduction is NP-hard but a discernibility matrix (which stores the
{low, medium, high} with fuzzy values calculated
differences between attribute values for each pair of data tuples) is
used to reduce the computation intensity For a given new sample, more than one fuzzy value may
apply
Each applicable rule contributes a vote for membership
in the categories
Typically, the truth values for each predicted category
are summed, and these sums are combined
95年10月17日星期二 Data Mining: Concepts and Techniques 91 95年10月17日星期二 Data Mining: Concepts and Techniques 92
Chapter 6. Classification and Prediction What Is Prediction?
What is classification? What is Support Vector Machines (SVM) (Numerical) prediction is similar to classification
construct a model
prediction? Associative classification use model to predict continuous or ordered value for a given input
Issues regarding classification Lazy learners (or learning from Prediction is different from classification
Classification refers to predict categorical class label
and prediction your neighbors)
Prediction models continuous-valued functions
Classification by decision tree Other classification methods Major method for prediction: regression
induction Prediction model the relationship between one or more independent or
Non-linear regression
Classification by back Model selection Other regression methods: generalized linear model, Poisson
propagation regression, log-linear models, regression trees
Summary
95年10月17日星期二 Data Mining: Concepts and Techniques 93 95年10月17日星期二 Data Mining: Concepts and Techniques 94
1 |D|
0 1 y = w0 + w1 x + w2 x2 + w3 x3
∑ ( xi − x ) 2
i =1
Other functions, such as power function, can also be
Multiple linear regression: involves more than one predictor variable
transformed to linear model
Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|)
Some models are intractable nonlinear (e.g., sum of
Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2
exponential terms)
Solvable by extension of least square method or using SAS, S-Plus
possible to obtain least square estimates through
Many nonlinear functions can be transformed into the above
extensive calculation on more complex formulae
95年10月17日星期二 Data Mining: Concepts and Techniques 95 95年10月17日星期二 Data Mining: Concepts and Techniques 96
Other Regression-Based Models Regression Trees and Model Trees
Generalized linear model: Regression tree: proposed in CART system (Breiman et al. 1984)
Foundation on which linear regression can be applied to modeling CART: Classification And Regression Trees
categorical response variables
Each leaf stores a continuous-valued prediction
Variance of y is a function of the mean value of y, not a constant
Logistic regression: models the prob. of some event occurring as a It is the average value of the predicted attribute for the training
linear function of a set of predictor variables tuples that reach the leaf
Poisson regression: models the data that exhibit a Poisson Model tree: proposed by Quinlan (1992)
distribution
Each leaf holds a regression model—a multivariate linear equation
Log-linear models: (for categorical data)
for the predicted attribute
Approximate discrete multidimensional prob. distributions
A more general case than regression tree
Also useful for data compression and smoothing
Regression trees and model trees Regression and model trees tend to be more accurate than linear
Trees to predict continuous values rather than class labels regression when the data are not represented well by a simple linear
model
95年10月17日星期二 Data Mining: Concepts and Techniques 97 95年10月17日星期二 Data Mining: Concepts and Techniques 98
C1 C2
C1 True positive False negative
Classifier Accuracy Measures Predictor Error Measures
C2 False positive True negative
i − y i '|
Relative squared error:
∑ ( y − y ')
i =1
i i
2
i =1
sensitivity = t-pos/pos /* true positive recognition rate */ d d
∑|y i − y |
∑ ( y − y) i
2
95年10月17日星期二 Data Mining: Concepts and Techniques 105 95年10月17日星期二 Data Mining: Concepts and Techniques 106
Classification by decision tree Other classification methods Combine a series of k learned models, M1, M2, …, Mk,
Mi error rate is the sum of the weights of the misclassified tuples: Rule-based classification
d Ensemble methods
error ( M i ) = ∑ w j × err ( X j )
j Classification by back Model selection
1 − error ( M i )
The weight of classifier Mi’s vote is log propagation
error ( M i ) Summary
95年10月17日星期二 Data Mining: Concepts and Techniques 111 95年10月17日星期二 Data Mining: Concepts and Techniques 112
Chapter 6. Classification and Prediction Summary (I)
Classification and prediction are two forms of data analysis that can
What is classification? What is Support Vector Machines (SVM)
be used to extract models describing important data classes or to
prediction? Associative classification predict future data trends.
Issues regarding classification Lazy learners (or learning from Effective and scalable methods have been developed for decision
and prediction your neighbors) trees induction, Naive Bayesian classification, Bayesian belief network,
rule-based classifier, Backpropagation, Support Vector Machine (SVM),
Classification by decision tree Other classification methods
associative classification, nearest neighbor classifiers, and case-based
induction Prediction reasoning, and other classification methods such as genetic
Bayesian classification algorithms, rough set and fuzzy set approaches.
Accuracy and error measures
Linear, nonlinear, and generalized linear models of regression can be
Rule-based classification Ensemble methods
used for prediction. Many nonlinear problems can be converted to
Classification by back Model selection linear problems by performing transformations on the predictor
propagation variables. Regression trees and model trees are also used for
Summary prediction.
95年10月17日星期二 Data Mining: Concepts and Techniques 113 95年10月17日星期二 Data Mining: Concepts and Techniques 114
95年10月17日星期二 Data Mining: Concepts and Techniques 115 95年10月17日星期二 Data Mining: Concepts and Techniques 116
References (2) References (3)
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed. John Wiley and T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy,
Sons, 2001
complexity, and training time of thirty-three old and new classification
U. M. Fayyad. Branching on attribute values in decision tree generation. AAAI’94. algorithms. Machine Learning, 2000.
Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line
J. Magidson. The Chaid approach to segmentation modeling: Chi-squared
learning and an application to boosting. J. Computer and System Sciences, 1997.
automatic interaction detection. In R. P. Bagozzi, editor, Advanced Methods of
J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision
Marketing Research, Blackwell Business, 1994.
tree construction of large datasets. VLDB’98.
M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data
J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision Tree
Construction. SIGMOD'99. mining. EDBT'96.
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
Mining, Inference, and Prediction. Springer-Verlag, 2001. S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-
D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The Disciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998
combination of knowledge and statistical data. Machine Learning, 1995. J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han. Generalization and decision J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. ECML’93.
tree induction: Efficient classification in data mining. RIDE'97.
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
B. Liu, W. Hsu, and Y. Ma. Integrating Classification and Association Rule. KDD'98.
J. R. Quinlan. Bagging, boosting, and c4.5. AAAI'96.
W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on
Multiple Class-Association Rules, ICDM'01.
95年10月17日星期二 Data Mining: Concepts and Techniques 117 95年10月17日星期二 Data Mining: Concepts and Techniques 118
References (4)
R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building
and pruning. VLDB’98.
J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for
data mining. VLDB’96.
J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan Kaufmann,
1990.
P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley,
2005.
S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification
and Prediction Methods from Statistics, Neural Nets, Machine Learning, and
Expert Systems. Morgan Kaufman, 1991.
S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997.
I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and
Techniques, 2ed. Morgan Kaufmann, 2005.
X. Yin and J. Han. CPAR: Classification based on predictive association rules.
SDM'03
H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with
hierarchical clusters. KDD'03.
95年10月17日星期二 Data Mining: Concepts and Techniques 119