Lecture 8-9
Basic techniques for cluster
detection
1
Overview
Lecture 8
• The problem of cluster detection
• Measuring proximity between data objects
• The K-means cluster detection method
Lecture 9
• The agglomeration cluster detection method
• Performance issues of the basic methods
• Cluster evaluation and interpretation
• Summary
2
Lecture 8
3
Clustering
• Clustering techniques can be categorized as either hierarchical or
nonhierarchical
Clustering
Hierarchical Nonhierarchical
Agglomerative Divisive K-means SOM
4
Clustering
• Divisive hierarchical clustering starts from the whole data set in one
cluster, and then breaks this up in each time smaller clusters until
one observation per cluster remains.
• Agglomerative clustering starts from all observations in one cluster
and continuing to merge the ones that are most similar until all
observations make up one big cluster.
5
Recall: Data Mining: Problems & Patterns
• Cluster detection
– Measure similarity among data objects and group them into
clusters accordingly
Clustering
Method
Input data points Cluster Memberships
of Data Points
6
Problem of Cluster Detection
• What is cluster detection?
• Clustering is a process of discovering clusters
• Cluster: a group of objects known as members
• The centre of a cluster is known as the centroid
• Members of a cluster are similar to each other (low
intra-cluster)
• Members of different clusters are different (high inter-
cluster)
: centroids
7
Problem of Cluster Detection
• Basic elements of a
clustering solution ?
?
• A sensible measure for
similarity, e.g. Euclidean
• A goodness-of-fit function ?
for evaluating the quality of
resulting clusters, e.g. SSE
• An effective and efficient
Internal Inter-cluster
clustering algorithm, e.g. K- variation distance
means
Good or Bad?
8
Problem of Cluster Detection
• Requirements for clustering solutions
• Scalability – speed, size
• Able to deal with different data types of attributes
• Able to discover clusters of arbitrary shapes
• Minimal requirements for domain knowledge to
determine input parameters
• Able to deal with noise and outliers
• Insensitive to order of input data records
• Able to deal with high dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability
9
Measures of Proximity
• Basics
• Proximity between two data objects is represented by
either similarity or dissimilarity
• Similarity: a numeric measure of the degree of alikeness,
dissimilarity: numeric measure of the degree of difference
between two objects
• Similarity measure and dissimilarity measure are often
convertible; normally dissimilarity is preferred
• Measure of dissimilarity:
• Measuring the difference between values of the corresponding
attributes
• Combining the measures of the differences into one value
10
Measures of Proximity
• A function D with a measurement of distance d(x, y)
is called a metric if it satisfies:
• d(x, y) 0 and d(x, x) = 0, for all data objects x and y
• d(x, y) = d(y, x), for all data objects x and y
• d(x, y) d(x, z) + d(z, y), for all data objects x, y and z
• Difference of values for a single attribute is directly related
to the domain type of the attribute.
• Proximity function with some of the properties can still be
useful.
• Some measure is better than no measure at all.
11
Measures of Proximity
• Value differences for different attribute types
• nominal values
• If two names are the same, the difference is 0; otherwise the maximum
e.g. diff(“John”, “John”) = 0, diff(“John”, “Mary”) =
• Same for difference between binary values
e.g. diff(Yes, No) =
• ordinal values
• Different degree of proximity can be compared
e.g. diff(A, B) < diff(A, D).
• Converting ordinal values to consecutive integers
e.g. A: 5, B: 4, C: 3, D: 2, E:1. A – B 1 and A – D 3
• interval and ratio attributes
• Measured as the absolute difference between 2 numerical values
• Unknown value
• Assume the unknown value to be maximally different from any other
value
diff(NULL, v) = |v|, diff(NULL, NULL) =
12
Measures of Proximity
• Ratio of mismatched features for nominal attributes
Given two data objects i and j of p nominal attributes. Let m
represent the number of attributes where the values of the two
objects match, the RMF
𝑝−𝑚
𝑅𝑀𝐹(𝑖, 𝑗) = [0,1]
𝑝
Body Weight Body Height Blood Pressure Blood Sugar Habit Class
e.g. heavy short high 3 smoker P 6−4 1
heavy short high 1 nonsmoker P d (row1, row2) = =
normal tall normal 3 nonsmoker N 6 3
heavy tall normal 2 smoker N
low medium normal 2 nonsmoker N
low tall normal 1 nonsmoker P
normal medium high 3 smoker P 6 −1 5
low short high 2 smoker P d (row1, row3) = =
heavy tall high 2 nonsmoker P 6 6
low medium normal 3 smoker P
heavy medium normal 3 nonsmoker N
13
Measures of Proximity
Distance functions for interval/ratio attributes
• Minkowski function
d (i, j) = q (| x − x | + | x − x | +...+ | x − x | )
q q q
i1 j1 i2 j2 ip jp
xi1 is the ith feature value of object i, p is the number of features.
Special cases:
➢ Manhattan distance (q = 1) d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j2 ip jp
➢ Euclidean distance (q = 2) d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j2 ip jp
➢ Supremum/Chebyshev (q = ) 𝑑 𝑖, 𝑗 = max 𝑥𝑖𝑡 − 𝑥𝑗𝑡
𝑡
for t = 1…p
14
Measures of Proximity
Distance functions for interval/ratio attributes
customerID No of Trans Revenue Tenure(Months)
101 30 1000 20
102 40 400 30 Manhattan
103 35 300 30
104 20 1000 35 Tenure Euclidean
105 50 500 1 40
Chebyshev
106 80 100 10
107 10 1000 2 30
d1 (cust101, cust102) =| 30 − 40 | + | 1000 − 400 | + | 20 − 30 |= 620 20
10
d 2 (cust101, cust102) = (30 − 40) + (1000 − 400) + (20 − 30) 600.16
2 2 2
200 10 20 30 40 50 No. of Trans
dmax (cust101, cust102) =| 1000 − 400 |= 600 400
600
800
1000
Revenue
15
Measures of Proximity
Binary attributes
• Given two data objects i and j of p binary attributes,
• f00 : the number of attributes where i is 0 and j is 0
• f01 : the number of attributes where i is 0 and j is 1
• f10 : the number of attributes where i is 1 and j is 0
• f11 : the number of attributes where i is 1 and j is 1
• Simple mismatch coefficient (SMC) for symmetric values:
f 01 + f10
SMC (i, j ) =
f 00 + f 01 + f10 + f11
• Jaccard coefficient is defined for asymmetric values:
f 01 + f10
JC (i, j ) =
f 01 + f10 + f11
16
Measures of Proximity
• Binary attributes (example)
DocumentID Query Database Programming Interface User Usability Network Web GUI HTML
d1 1 1 0 0 0 0 0 0 0 0
d2 0 1 1 0 0 0 0 0 0 0
d3 0 1 0 0 0 0 0 0 0 0
f 01 + f10 1+1 2 f 01 + f10 1+1 2
SMC(d1, d 2) = = = JC(d1, d 2) = = =
f 00 + f 01 + f10 + f11 7 + 1 + 1 + 1 10 f 01 + f10 + f11 1 + 1 + 1 3
SMC not that different; JC very different: two-word (out of 3) difference
f 01 + f10 1 1 f 01 + f10 1 1
SMC(d1, d 3) = = = JC(d1, d 3) = = =
f 00 + f 01 + f10 + f11 8 + 1 + 1 10 f 01 + f10 + f11 1 + 1 2
SMC very similar; JC still quite different: one word (out of 2) difference
17
Measures of Proximity
• Cosine similarity function
• Treating two data objects as vectors
• Similarity is measured as the angle between the two vectors
• Similarity is 1 when = 0, and 0 when = 90
• Similarity function:
i• j n
n
cos(i, j ) =
i j || i ||=
2 j
i• j = ik
|| i || || j || k =1
k k
k =1
i
18
Measures of Proximity
• Cosine similarity function (example)
Given two data objects:
x = (3, 2, 0, 5), and y = (1, 0, 0, 0)
Since,
x • y = 3*1 + 2*0 + 0*0 + 5*0 = 3
||x|| = sqrt(32 + 22 + 02 + 52) 6.16
||y|| = sqrt(12 + 02 + 02 + 02) = 1
Then, the similarity between x and y: cos(x, y) = 3/(6.16 * 1) = 0.49
The dissimilarity between x and y: 1 – cos(x,y) = 0.51
19
Measures of Proximity
• Combining heterogeneous attributes
• Based on the principle of ratio of mismatched features
• For the kth attribute, compute the dissimilarity dk in [0,1]
• Set the indicator variable k as follows:
• k = 0, if the kth attribute is an asymmetric binary attribute and both
objects have value 0 for the attribute
• k = 1, otherwise
• Compute the overall distance between i and j as:
k =1
k dk
d (i, j ) = n
k =1
k
20
Measures of Proximity
• Combining heterogeneous attributes (example)
SubjectID Body Height Body Blood Habit Blood
(cm) Weight (KG) Pressure Sugar
S1 125 61 High Smoker High
S2 178 90 Normal Nonsmoker Low
• D(S1, S2) = (1 x 0.8833)+(1x0.4603)+(1x1)+(1x1)+(1x1)/(1+1+1+1+1)
= 4.3437/5
= 0.8687
Read the chapter for more explanation on the data and calculation.
21
K-means, a Basic Clustering Method
• Outline of main steps
• Define the number of clusters (k)
• Choose k data objects randomly to serve as the initial centroids for
the k clusters
• Assign each data object to the cluster represented by its nearest
centroid
• Find a new centroid for each cluster by calculating the mean vector
of its members
• Undo the memberships of all data objects. Go back to Step 3 and
repeat the process until cluster membership no longer changes or a
maximum number of iterations is reached.
22
K-means, a Basic Clustering Method
• Illustration of the method:
23
K-means, a Basic Clustering Method
• Strengths & weaknesses
• Strengths
• Simple and easy to implement
• Quite efficient
• Weaknesses
• Need to specify the value of k, but we may not know what the
value should be beforehand
• Sensitive to the choice of initial k centroids: the result can be
non-deterministic
• Sensitive to noise
• Applicable only when mean is meaningful to the given data set
24
K-means, a Basic Clustering Method
• Overcoming the weaknesses:
• Using cluster quality to determine the value of k
• Improving how the initial k centroids are chosen
• Running the clustering a number of times and select the result
with highest quality
• Finding centres that are farther apart
• Dealing with noise
• Removing outliers before clustering
25
K-means, a Basic Clustering Method
• Value of k and cluster quality
Screen plot
Cluster errors (e.g. SSE)
Number of
clusters
26
Lecture 9
27
Hierarchical Clustering
Create a hierarchy of clusters for cluster analysis.
• Two different approaches
• Divisive approach
• Develops the hierarchy from
the root to the leaves by
splitting bigger clusters into
smaller ones (similar to
bisecting K-means)
• Agglomerative approach
• Develops the hierarchy from
the leaf nodes to the root by
merging smaller clusters into
bigger ones.
28
Hierarchical Clustering
• Agglomerative Approach
• Take all n data objects as individual clusters and build a n x n
dissimilarity matrix. The matrix stores the distance between any pair
of data objects.
• While the number of clusters > 1 do:
• Find a pair of data objects/clusters with the minimum distance
• Merge the two data objects/clusters into a bigger cluster
• Replace the entries in the matrix for the original clusters or
objects by the cluster tag of the newly formed cluster
• Re-calculate relevant distances and update the matrix
29
Hierarchical Clustering
• Produces a set of nested clusters organized as a hierarchical tree
• Can be visualized as a dendrogram
• A tree like diagram that records the sequences of merges or splits
6 5
0.2 4
3 4
2
0.15 5
2
0.1
1
0.05 3 1
0
1 3 2 5 4 6
30
Strengths of Hierarchical Clustering
• Do not have to assume any particular number of clusters
• Any desired number of clusters can be obtained by ‘cutting’ the
dendogram at the proper level
• They may correspond to meaningful taxonomies
• Example in biological sciences (e.g., animal kingdom, phylogeny
reconstruction, …)
31
The Agglomeration Method
• Illustration of the method
32
The Agglomeration Method
• Agglomeration approaches for
merging clusters
• Single link: the distance
between two closest points
• Complete link: the distance
between two farthest points
• Group average: the average of
all pair-wise distances
• Centroids: the distance between
the centroids
33
Single Link Agglomeration Method
• Similarity of two clusters is based on the two most similar (closest)
points in the different clusters
• Determined by one pair of points, i.e., by one link in the proximity graph.
34
Single Link Agglomeration Method (cont)
The distance of other objects to s2/s3 is computed
based on the minimum distance value.
35
The Agglomeration Method
• Strengths and weaknesses
• Strengths
• Deterministic results
• Multiple possible versions of clustering
• No need to specify the value of a k beforehand
• Can create clusters of arbitrary shapes (single-link)
• Weaknesses
• Does not scale up for large data sets
• Cannot undo membership like the K-means
36
Cluster Evaluation
• Cluster quality
• Principle:
• low-level of members variation within a cluster (intra-cluster)
• High-level of dissimilarity between clusters (inter-cluster)
• The measures
• Cohesion: sum of squared errors (SSE), within-cluster variation (wc),
sum of within-cluster variation (WC)
• Separation: sum of squared distances between cluster centroids
(BC)
• Combining the cohesion and separation, the ratio BC/WC is a good
indicator of overall quality (Q)
37
Cohesion
• Within cluster variation: Average squared distance between members in a
cluster and its centroid
Cluster Ck: cluster k
SubjectID Body Height Body Weight
Tag rk: centroid of Ck
s1 125 61 2
s2 178 90 1
s3
s4
178
180
92
83
1
1 𝑆𝑆𝐸(𝐶𝑘 ) = 𝑑(𝑥, 𝑟𝑘 )2 Sum of Squared Error)
s5 167 85 1 𝑥∈𝐶𝑘
s6 170 89 1 1
s7 173 98 1 𝑤𝑐(𝐶𝑘 ) = SSE(𝐶𝑘 )
s8 135 40 2
𝐶𝑘
s9 120 35 2 K
s10
s11
145
125
70
50
2
2 WC = wc (Ck ) (total within cluster variation)
k =1
Centroid cluster 1 (174.33, 89.5) Centroid cluster 2 (130, 51.2)
C1 is a better quality cluster than C2. WC = (45.806 + 247.760) = 293.57 38
Separation Measure clusters
• If the cluster centroids are very dissimilar, their quality is higher.
• This can be measured by using the distance between the clusters.
BC = d (r j , rk ) 2
1 j k K
rk, rj: centroid of Ck Cj, for cluster k and
j, respectively,
Cluster
d = Euclidean distance
SubjectID Body Height Body Weight
Tag
s1 125 61 2
s2 178 90 1
s3 178 92 1
s4 180 83 1
s5 167 85 1
s6 170 89 1
s7 173 98 1
s8 135 40 2
s9 120 35 2
s10 145 70 2
s11 125 50 2
Centroid cluster 1 (174.33, 89.5) Centroid cluster 2 (130, 51.2)
39
Overall cluster quality
• Combining the cohesion and separation, the ratio BC/WC is a good
indicator of overall quality (Q)
𝐵𝐶
Q=
𝑊𝐶
The larger the value of this ratio, the better the cluster quality.
A value close to 1 indicate poor quality clusters: the distance between
clusters is close to the distance within clusters.
Cluster
SubjectID Body Height Body Weight
Tag
s1 125 61 2
s2 178 90 1
s3 178 92 1
s4 180 83 1
s5 167 85 1
s6 170 89 1
s7 173 98 1
s8 135 40 2
s9 120 35 2
s10 145 70 2
s11 125 50 2 40
Cluster Evaluation & Interpretation
• Using cluster quality for clustering
• With K-means:
• Add an outer loop for different values of K (from low to high)
• At an iteration, conduct K-means clustering using the current K
• Measure the overall cluster quality and decide whether the
resulting cluster quality acceptable
• If not, increase the value of K by 1 and repeat the process
• With agglomeration:
• Traverse the hierarchy level by level from the root
• At a level, evaluate the overall quality of clusters
• If the quality is acceptable, take the clusters at the level as the final
result. If not, move to the next level and repeat the process.
41
Summary
• A clustering solution must provide a sensible proximity function, effective
algorithm and a cluster evaluation function
• Proximity is normally measured by a distance function that combines
measures of value differences upon attributes
• The K-Means method continues to refine prototype partitions until
membership changes no longer occur
• The agglomeration method constructs all possible groupings of individual
data objects into a hierarchy of clusters
• Good clustering results mean high similarity among members of a cluster
and low similarity between members of different clusters
42
References
• Read Chapter 4 of Data Mining Techniques and Applications
Useful further references
• Tan, P-N., Steinbach, M. and Kumar, V. (2006), Introduction to Data Mining,
Addison-Wesley, Chapters 2 (section 2.4) and 8.
43