Slide TIF311 DM 10 11
Slide TIF311 DM 10 11
Slide TIF311 DM 10 11
CLUSTERING AND
CLASSIFICATION
Overview
• Definition of Clustering
• Existing clustering methods
• Clustering examples
• Classification
• Classification examples
• Conclusion
Definition
• Clustering can be considered the most important
unsupervised learning technique; so, as every other
problem of this kind, it deals with finding a structure in a
collection of unlabeled data.
• Simplifications
• Pattern detection
• Useful in data concept construction
• Unsupervised learning process
Where to use clustering?
• Data mining
• Information retrieval
• text mining
• Web analysis
• marketing
• medical diagnostic
Which method should I use?
• Distance-based
• Hierarchical
• Partitioning
• Probabilistic
Measuring Similarity
• Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, which is typically metric:
d(i, j)
• There is a separate “quality” function that measures the
“goodness” of a cluster.
• The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
and ratio variables.
• Weights should be associated with different variables
based on applications and data semantics.
• It is hard to define “similar enough” or “good enough”
– the answer is typically highly subjective.
• In this case we easily identify the 4 clusters into which the data can be
divided; the similarity criterion is distance: two or more objects belong to
the same cluster if they are “close” according to a given distance. This is
called distance-based clustering.
Hierarchical clustering
Agglomerative (bottom up) Divisive (top down)
• If 80<=x<90 then x A
grade =B.
<80 >=80
• If 70<=x<80 then x B
grade =C.
• If 60<=x<70 then <70 >=70
grade =D. x C
• If x<50 then grade =F. <50 >=60
F D
Classification Techniques
• Approach:
1. Create specific model by evaluating
training data (or using domain
experts’ knowledge).
2. Apply model developed to new data.
• Classes must be predefined
• Most common techniques use DTs,
NNs, or are based on distances or
statistical methods.
Defining Classes
Distance Based
Partitioning Based
Classification Using Regression
• Division: Use regression function to
divide area into regions.
• Prediction: Use regression function to
predict a class membership function.
Input includes desired class.
Height Example Data
Name Gender Height Output1 Output2
Kristina F 1.6m Short Medium
Jim M 2m Tall Medium
Maggie F 1.9m Medium Tall
Martha F 1.88m Medium Tall
Stephanie F 1.7m Short Medium
Bob M 1.85m Medium Medium
Kathy F 1.6m Short Medium
Dave M 1.7m Short Medium
Worth M 2.2m Tall Tall
Steven M 2.1m Tall Tall
Debbie F 1.8m Medium Medium
Todd M 1.95m Medium Medium
Kim F 1.9m Medium Tall
Amy F 1.8m Medium Medium
Wynette F 1.75m Medium Medium
Division
Prediction
Classification Using Distance
• Place items in class to which they are
“closest”.
• Must determine distance between an item
and a class.
• Classes represented by
– Centroid: Central value.
– Medoid: Representative point.
– Individual points
• Algorithm: KNN
K Nearest Neighbor
(KNN):
• Training set includes classes.
• Examine K items near item to be
classified.
• New item placed in class with the most
number of close items.
• O(q) for each tuple to be classified.
(Here q is the size of the training set.)
KNN
KNN Algorithm
Classification Using Decision
Trees
• Partitioning based: Divide search
space into rectangular regions.
• Tuple placed into class based on the
region within which it falls.
• DT approaches differ in how the tree is
built: DT Induction
• Internal nodes associated with attribute
and arcs with values for that attribute.
• Algorithms: ID3, C4.5, CART
Decision Tree
Given:
– D = {t1, …, tn} where ti=<ti1, …, tih>
– Database schema contains {A1, A2, …, Ah}
– Classes C={C1, …., Cm}
Decision or Classification Tree is a tree
associated with D such that
– Each internal node is labeled with attribute, Ai
– Each arc is labeled with predicate which can be
applied to attribute at parent
– Each leaf node is labeled with a class, Cj
DT Induction
Comparing DTs
Balanced
Deep
ID3
• Creates tree using information theory concepts and
tries to reduce expected number of comparison..
• ID3 chooses split attribute with the highest
information gain using entropy as base for
calculation.
Conclusion
• very useful in data mining
• applicable for both text and graphical
based data
• Help simplify data complexity
• classification
• detect hidden pattern in data
Reference
• Dr. M.H. Dunham -
http://engr.smu.edu/~mhd/dmbook/part2.ppt.
• Dr. Lee, Sin-Min – San Jose State University
• Mu-Yu Lu, SJSU
• Database System Concepts, Silberschatz, Korth, Sudarshan