Chapter4 Clustering
Chapter4 Clustering
Clustering
Course Outcomes
ITC602.4: Implement the appropriate data mining methods like
classification, clustering or Frequent Pattern mining on large data sets.
Chapter 4. Clustering
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods - K-Means, K-Mediods
5. Hierarchical Methods - Agglomerative, Divisive, BIRCH
6. Density-Based Methods - DBSCAN
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10.Constraint-Based Clustering
11.Outlier Analysis
12.Summary Data Mining: Concepts and
04/14/21 Techniques 2
What is Cluster Analysis?
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Finding similarities according to the characteristics
found and grouping similar data objects into
clusters
Unsupervised learning: no predefined classes
Typical applications
As a stand-alone tool to get insight into data
distribution
04/14/21As a preprocessing step for other algorithms 3
Clustering: Rich Applications and
Multidisciplinary Efforts
Pattern Recognition
Spatial Data Analysis
Create thematic maps in GIS by clustering
feature spaces
Detect spatial clusters or for other spatial mining
tasks
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar
Data Mining: Concepts and
04/14/21 access patterns Techniques 4
Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
Land use: Identification of areas of similar land use in an
earth observation database
Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
City-planning: Identifying groups of houses according to their
house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
Data Mining: Concepts and
04/14/21 Techniques 5
Quality: What Is Good Clustering?
[ ]
Data matrix
(two modes)
x 11 .. . x 1f . .. x 1p
. .. .. . .. . . .. . ..
x i1 .. . x if . .. x ip
. .. .. . .. . . .. . ..
x n1 .. . x nf . .. x np
[ ]
Dissimilarity matrix
0
(one mode)
d( 2,1) 0
d( 3,1) d( 3,2) 0
: : :
d ( n,1) d( n,2) ... ... 0
Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types
Standardize data
Calculate1 the mean absolute deviation:
s f = ( | x1 f −mf | + | x 2 f −m f | + . ..+ | x nf −m f | )
n
1
mf = ( x 1 f +x 2f + . . . +x nf ) .
n
where
Calculate the standardized
x if −m f measurement (z-
z if =
sf
score)
r if
f
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
f is interval-based: use the normalized distance
f is ordinal or ratio-scaled r −1
z if = if
M f −1
compute ranks rif and
04/14/21
and treat z as interval-scaled 21
Vector Objects
Vector objects: keywords in documents, gene
features in micro-arrays, etc.
Broad applications: information retrieval, biologic
taxonomy, etc.
Cosine measure
√
the cluster to its centroid
Σ Ni=1 ( t ip −c m ) 2
Rm =
N
Diameter: square root of average mean squared distance
between all pairs of points in the cluster
√
Σ i=N 1 Σ Ni=1 ( t ip −t iq ) 2
Dm =
N ( N −1)
Example
5
3 10
10
9
9
2 8
8
7
7
1 6
6
5
5
0 4
4
2 3
Assign
4 5 6 7 8 Update
9
3 10
3
2 each the 2
1
objects cluster 1
0
0
0 1 2 3 4 5 6 7 8 9 10 to means 0 1 2 3 4 5 6 7 8 9 10
most
similar reassign reassign
center
K=2
Arbitrarily choose
K object as initial
5 6 7 8 9 10
cluster
5 center
6 7 8 9 10 Update
the
cluster
means
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Total Cost = 20
9 9
10
8
8 8
7
6
Arbitrary Assign
5
choose k each
object as remaini
7 7
4
3
initial ng
6
2
medoids
6 object
to
1
0
0 1 2 3 4 5 6 7 8 9 10
nearest
medoid Randomly select a
5 5
K=2 s nonmedoid
4 4 Total Cost = 26 object,Oramdom
10 10
3 Do loop 3
9
8
Compute
9
8
Swapping 7 total cost 7
Until no O and
2
6
of 6
Oramdom
2
change
5 5
4
swapping 4
If quality is
3 3
1 1 2 2
improved. 1 1
0 0
0 0 Data
0 1 2
Mining:
3 4 5
Concepts
6 7 8 9
and
10 0 1 2 3 4 5 6 7 8 9 10
1 04/14/21 2 3 4 1 5 2 Techniques
6 3 7 4 8 5 9 6 10 7 38
PAM (Partitioning Around Medoids)
(1987)
PAM (Kaufman and Rousseeuw, 1987), built in Splus
Use real object to represent the cluster
Select k representative objects arbitrarily
For each pair of non-selected object h and selected
object i, calculate the total swapping cost TCih
For each pair of i and h,
If TCih < 0, i is replaced by h
Then assign each non-selected object to the
most similar representative object
repeat steps 2-3 until there is no change
Data Mining: Concepts and
04/14/21 Techniques 39
PAM Clustering: Total swapping cost
10
TCih=jCjih
9
8
t j
t
7 j
h
10
i h
6 i
9
4
Cjih = d(j, h) - d(j, i) Cjih = 0
7
h
2
j
i
5
i h j
1
t 4
t
0
1 2 3 4 5 6 7 8 9
3
4 5 6 7 8 9 10
Go on in a non-descending fashion
9 9
8 8
6 6
5 5
4 4
3 3
2
Data Mining: Concepts and
2
04/14/21 Techniques 46
Dendrogram: Shows How the Clusters are Merged
8
7
7
6
6
5
5
4
4
3
3
√
the cluster to its centroid
Σ Ni=1 ( t ip −c m ) 2
Rm =
N
Diameter: square root of average mean squared distance
between all pairs of points in the cluster
√
Σ i=N 1 Σ Ni=1 ( t ip −t iq ) 2
Dm =
N ( N −1)
LS: Ni=1=Xi 9
SS: Ni=1=Xi2 8
CF = (5, (16,30),(54,190))
7
6
(3,4)
(2,6)
5
(4,5)
(4,7)
4
(3,8)
3
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5
p belongs to NEps(q) p MinPts = 5
q
core point condition: Eps = 1 cm
Border
Eps = 1cm
Core MinPts = 5
(SIGMOD’99)
Produces a special order of the database wrt its
Reachability Distance o
p2 o
Max (core-distance (o), d (o, p))
MinPts = 5
r(p1, o) = 2.8cm. r(p2,o) =Data
4cm Mining: Concepts and
04/14/21 Techniques = 3 cm 67
Chapter 4. Clustering
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods - K-Means, K-Mediods
5. Hierarchical Methods - Agglomerative, Divisive, BIRCH
6. Density-Based Methods - DBSCAN
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10.Constraint-Based Clustering
11.Outlier Analysis
12.Summary Data Mining: Concepts and
04/14/21 Techniques 68
Outlier Analysis
What are outliers?
The set of objects are considerably dissimilar
defined
Categorization of approaches for outlier detection:
data distribution
Drawbacks
Nested-loop algorithm
Cell-based algorithm
Data Mining: Concepts and
04/14/21 Techniques 72
Density-Based Local
Outlier Detection
Distance-based outlier
detection is based on global
distance distribution
It encounters difficulties to
cannot identify o2 as an
outlier
Need the concept of local
Data Mining: Concepts and
outlier
04/14/21 Techniques 73
Outlier Discovery: Deviation-Based
Approach