[go: up one dir, main page]

0% found this document useful (0 votes)
30 views43 pages

Lecture 8-9 - Clustering

The document provides an overview of basic cluster detection techniques covered in Lectures 8 and 9. Lecture 8 discusses measuring proximity between data objects and the K-means cluster detection method. Lecture 9 covers agglomerative clustering, performance issues of basic methods, cluster evaluation, and interpretation. The document also defines clusters, discusses requirements for clustering solutions, and describes various measures of proximity that can be used to determine similarity between data objects.

Uploaded by

johndeuterok
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views43 pages

Lecture 8-9 - Clustering

The document provides an overview of basic cluster detection techniques covered in Lectures 8 and 9. Lecture 8 discusses measuring proximity between data objects and the K-means cluster detection method. Lecture 9 covers agglomerative clustering, performance issues of basic methods, cluster evaluation, and interpretation. The document also defines clusters, discusses requirements for clustering solutions, and describes various measures of proximity that can be used to determine similarity between data objects.

Uploaded by

johndeuterok
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Lecture 8-9

Basic techniques for cluster


detection

1
Overview

Lecture 8
• The problem of cluster detection
• Measuring proximity between data objects
• The K-means cluster detection method

Lecture 9
• The agglomeration cluster detection method
• Performance issues of the basic methods
• Cluster evaluation and interpretation
• Summary

2
Lecture 8

3
Clustering

• Clustering techniques can be categorized as either hierarchical or


nonhierarchical

Clustering

Hierarchical Nonhierarchical

Agglomerative Divisive K-means SOM

4
Clustering

• Divisive hierarchical clustering starts from the whole data set in one
cluster, and then breaks this up in each time smaller clusters until
one observation per cluster remains.
• Agglomerative clustering starts from all observations in one cluster
and continuing to merge the ones that are most similar until all
observations make up one big cluster.

5
Recall: Data Mining: Problems & Patterns
• Cluster detection
– Measure similarity among data objects and group them into
clusters accordingly

Clustering
Method

Input data points Cluster Memberships


of Data Points

6
Problem of Cluster Detection
• What is cluster detection?
• Clustering is a process of discovering clusters
• Cluster: a group of objects known as members
• The centre of a cluster is known as the centroid
• Members of a cluster are similar to each other (low
intra-cluster)
• Members of different clusters are different (high inter-
cluster)

: centroids

7
Problem of Cluster Detection

• Basic elements of a
clustering solution ?
?
• A sensible measure for
similarity, e.g. Euclidean
• A goodness-of-fit function ?
for evaluating the quality of
resulting clusters, e.g. SSE
• An effective and efficient
Internal Inter-cluster
clustering algorithm, e.g. K- variation distance
means

Good or Bad?

8
Problem of Cluster Detection
• Requirements for clustering solutions
• Scalability – speed, size
• Able to deal with different data types of attributes
• Able to discover clusters of arbitrary shapes
• Minimal requirements for domain knowledge to
determine input parameters
• Able to deal with noise and outliers
• Insensitive to order of input data records
• Able to deal with high dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability

9
Measures of Proximity
• Basics
• Proximity between two data objects is represented by
either similarity or dissimilarity
• Similarity: a numeric measure of the degree of alikeness,
dissimilarity: numeric measure of the degree of difference
between two objects
• Similarity measure and dissimilarity measure are often
convertible; normally dissimilarity is preferred
• Measure of dissimilarity:
• Measuring the difference between values of the corresponding
attributes
• Combining the measures of the differences into one value

10
Measures of Proximity
• A function D with a measurement of distance d(x, y)
is called a metric if it satisfies:
• d(x, y)  0 and d(x, x) = 0, for all data objects x and y
• d(x, y) = d(y, x), for all data objects x and y
• d(x, y)  d(x, z) + d(z, y), for all data objects x, y and z
• Difference of values for a single attribute is directly related
to the domain type of the attribute.
• Proximity function with some of the properties can still be
useful.
• Some measure is better than no measure at all.

11
Measures of Proximity
• Value differences for different attribute types
• nominal values
• If two names are the same, the difference is 0; otherwise the maximum
e.g. diff(“John”, “John”) = 0, diff(“John”, “Mary”) = 
• Same for difference between binary values
e.g. diff(Yes, No) = 
• ordinal values
• Different degree of proximity can be compared
e.g. diff(A, B) < diff(A, D).
• Converting ordinal values to consecutive integers
e.g. A: 5, B: 4, C: 3, D: 2, E:1. A – B  1 and A – D  3
• interval and ratio attributes
• Measured as the absolute difference between 2 numerical values
• Unknown value
• Assume the unknown value to be maximally different from any other
value
diff(NULL, v) = |v|, diff(NULL, NULL) = 

12
Measures of Proximity
• Ratio of mismatched features for nominal attributes
Given two data objects i and j of p nominal attributes. Let m
represent the number of attributes where the values of the two
objects match, the RMF

𝑝−𝑚
𝑅𝑀𝐹(𝑖, 𝑗) = [0,1]
𝑝

Body Weight Body Height Blood Pressure Blood Sugar Habit Class
e.g. heavy short high 3 smoker P 6−4 1
heavy short high 1 nonsmoker P d (row1, row2) = =
normal tall normal 3 nonsmoker N 6 3
heavy tall normal 2 smoker N
low medium normal 2 nonsmoker N
low tall normal 1 nonsmoker P
normal medium high 3 smoker P 6 −1 5
low short high 2 smoker P d (row1, row3) = =
heavy tall high 2 nonsmoker P 6 6
low medium normal 3 smoker P
heavy medium normal 3 nonsmoker N

13
Measures of Proximity
Distance functions for interval/ratio attributes

• Minkowski function
d (i, j) = q (| x − x | + | x − x | +...+ | x − x | )
q q q
i1 j1 i2 j2 ip jp
xi1 is the ith feature value of object i, p is the number of features.
Special cases:
➢ Manhattan distance (q = 1) d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j2 ip jp
➢ Euclidean distance (q = 2) d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j2 ip jp

➢ Supremum/Chebyshev (q = ) 𝑑 𝑖, 𝑗 = max 𝑥𝑖𝑡 − 𝑥𝑗𝑡


𝑡
for t = 1…p

14
Measures of Proximity
Distance functions for interval/ratio attributes

customerID No of Trans Revenue Tenure(Months)


101 30 1000 20
102 40 400 30 Manhattan
103 35 300 30
104 20 1000 35 Tenure Euclidean
105 50 500 1 40
Chebyshev
106 80 100 10
107 10 1000 2 30

d1 (cust101, cust102) =| 30 − 40 | + | 1000 − 400 | + | 20 − 30 |= 620 20

10
d 2 (cust101, cust102) = (30 − 40) + (1000 − 400) + (20 − 30)  600.16
2 2 2

200 10 20 30 40 50 No. of Trans


dmax (cust101, cust102) =| 1000 − 400 |= 600 400
600
800
1000

Revenue

15
Measures of Proximity
Binary attributes
• Given two data objects i and j of p binary attributes,
• f00 : the number of attributes where i is 0 and j is 0
• f01 : the number of attributes where i is 0 and j is 1
• f10 : the number of attributes where i is 1 and j is 0
• f11 : the number of attributes where i is 1 and j is 1

• Simple mismatch coefficient (SMC) for symmetric values:


f 01 + f10
SMC (i, j ) =
f 00 + f 01 + f10 + f11

• Jaccard coefficient is defined for asymmetric values:


f 01 + f10
JC (i, j ) =
f 01 + f10 + f11

16
Measures of Proximity
• Binary attributes (example)

DocumentID Query Database Programming Interface User Usability Network Web GUI HTML
d1 1 1 0 0 0 0 0 0 0 0
d2 0 1 1 0 0 0 0 0 0 0
d3 0 1 0 0 0 0 0 0 0 0

f 01 + f10 1+1 2 f 01 + f10 1+1 2


SMC(d1, d 2) = = = JC(d1, d 2) = = =
f 00 + f 01 + f10 + f11 7 + 1 + 1 + 1 10 f 01 + f10 + f11 1 + 1 + 1 3
SMC  not that different; JC  very different: two-word (out of 3) difference

f 01 + f10 1 1 f 01 + f10 1 1
SMC(d1, d 3) = = = JC(d1, d 3) = = =
f 00 + f 01 + f10 + f11 8 + 1 + 1 10 f 01 + f10 + f11 1 + 1 2
SMC  very similar; JC  still quite different: one word (out of 2) difference

17
Measures of Proximity
• Cosine similarity function
• Treating two data objects as vectors
• Similarity is measured as the angle  between the two vectors
• Similarity is 1 when  = 0, and 0 when  = 90
• Similarity function:

i• j n


n
cos(i, j ) =
i j || i ||=
2 j
i• j = ik 
|| i ||  || j || k =1
k k
k =1
i

18
Measures of Proximity
• Cosine similarity function (example)

Given two data objects:


x = (3, 2, 0, 5), and y = (1, 0, 0, 0)
Since,
x • y = 3*1 + 2*0 + 0*0 + 5*0 = 3
||x|| = sqrt(32 + 22 + 02 + 52)  6.16
||y|| = sqrt(12 + 02 + 02 + 02) = 1
Then, the similarity between x and y: cos(x, y) = 3/(6.16 * 1) = 0.49
The dissimilarity between x and y: 1 – cos(x,y) = 0.51

19
Measures of Proximity
• Combining heterogeneous attributes
• Based on the principle of ratio of mismatched features
• For the kth attribute, compute the dissimilarity dk in [0,1]
• Set the indicator variable k as follows:
• k = 0, if the kth attribute is an asymmetric binary attribute and both
objects have value 0 for the attribute
• k = 1, otherwise
• Compute the overall distance between i and j as:


k =1
k  dk
d (i, j ) = n


k =1
k

20
Measures of Proximity

• Combining heterogeneous attributes (example)

SubjectID Body Height Body Blood Habit Blood


(cm) Weight (KG) Pressure Sugar
S1 125 61 High Smoker High
S2 178 90 Normal Nonsmoker Low

• D(S1, S2) = (1 x 0.8833)+(1x0.4603)+(1x1)+(1x1)+(1x1)/(1+1+1+1+1)


= 4.3437/5
= 0.8687

Read the chapter for more explanation on the data and calculation.
21
K-means, a Basic Clustering Method

• Outline of main steps


• Define the number of clusters (k)
• Choose k data objects randomly to serve as the initial centroids for
the k clusters
• Assign each data object to the cluster represented by its nearest
centroid
• Find a new centroid for each cluster by calculating the mean vector
of its members
• Undo the memberships of all data objects. Go back to Step 3 and
repeat the process until cluster membership no longer changes or a
maximum number of iterations is reached.

22
K-means, a Basic Clustering Method
• Illustration of the method:

23
K-means, a Basic Clustering Method

• Strengths & weaknesses


• Strengths
• Simple and easy to implement
• Quite efficient
• Weaknesses
• Need to specify the value of k, but we may not know what the
value should be beforehand
• Sensitive to the choice of initial k centroids: the result can be
non-deterministic
• Sensitive to noise
• Applicable only when mean is meaningful to the given data set

24
K-means, a Basic Clustering Method

• Overcoming the weaknesses:


• Using cluster quality to determine the value of k
• Improving how the initial k centroids are chosen
• Running the clustering a number of times and select the result
with highest quality
• Finding centres that are farther apart
• Dealing with noise
• Removing outliers before clustering

25
K-means, a Basic Clustering Method
• Value of k and cluster quality

Screen plot
Cluster errors (e.g. SSE)

Number of
clusters

26
Lecture 9

27
Hierarchical Clustering
Create a hierarchy of clusters for cluster analysis.

• Two different approaches


• Divisive approach
• Develops the hierarchy from
the root to the leaves by
splitting bigger clusters into
smaller ones (similar to
bisecting K-means)
• Agglomerative approach
• Develops the hierarchy from
the leaf nodes to the root by
merging smaller clusters into
bigger ones.

28
Hierarchical Clustering

• Agglomerative Approach
• Take all n data objects as individual clusters and build a n x n
dissimilarity matrix. The matrix stores the distance between any pair
of data objects.
• While the number of clusters > 1 do:
• Find a pair of data objects/clusters with the minimum distance
• Merge the two data objects/clusters into a bigger cluster
• Replace the entries in the matrix for the original clusters or
objects by the cluster tag of the newly formed cluster
• Re-calculate relevant distances and update the matrix

29
Hierarchical Clustering

• Produces a set of nested clusters organized as a hierarchical tree


• Can be visualized as a dendrogram
• A tree like diagram that records the sequences of merges or splits

6 5

0.2 4
3 4
2
0.15 5
2
0.1

1
0.05 3 1

0
1 3 2 5 4 6

30
Strengths of Hierarchical Clustering

• Do not have to assume any particular number of clusters


• Any desired number of clusters can be obtained by ‘cutting’ the
dendogram at the proper level
• They may correspond to meaningful taxonomies
• Example in biological sciences (e.g., animal kingdom, phylogeny
reconstruction, …)

31
The Agglomeration Method
• Illustration of the method

32
The Agglomeration Method

• Agglomeration approaches for


merging clusters
• Single link: the distance
between two closest points
• Complete link: the distance
between two farthest points
• Group average: the average of
all pair-wise distances
• Centroids: the distance between
the centroids

33
Single Link Agglomeration Method

• Similarity of two clusters is based on the two most similar (closest)


points in the different clusters
• Determined by one pair of points, i.e., by one link in the proximity graph.

34
Single Link Agglomeration Method (cont)

The distance of other objects to s2/s3 is computed


based on the minimum distance value.

35
The Agglomeration Method

• Strengths and weaknesses


• Strengths
• Deterministic results
• Multiple possible versions of clustering
• No need to specify the value of a k beforehand
• Can create clusters of arbitrary shapes (single-link)
• Weaknesses
• Does not scale up for large data sets
• Cannot undo membership like the K-means

36
Cluster Evaluation
• Cluster quality
• Principle:
• low-level of members variation within a cluster (intra-cluster)
• High-level of dissimilarity between clusters (inter-cluster)
• The measures
• Cohesion: sum of squared errors (SSE), within-cluster variation (wc),
sum of within-cluster variation (WC)
• Separation: sum of squared distances between cluster centroids
(BC)
• Combining the cohesion and separation, the ratio BC/WC is a good
indicator of overall quality (Q)

37
Cohesion
• Within cluster variation: Average squared distance between members in a
cluster and its centroid
Cluster Ck: cluster k
SubjectID Body Height Body Weight
Tag rk: centroid of Ck
s1 125 61 2
s2 178 90 1
s3
s4
178
180
92
83
1
1 𝑆𝑆𝐸(𝐶𝑘 ) = ෍ 𝑑(𝑥, 𝑟𝑘 )2 Sum of Squared Error)
s5 167 85 1 𝑥∈𝐶𝑘
s6 170 89 1 1
s7 173 98 1 𝑤𝑐(𝐶𝑘 ) = SSE(𝐶𝑘 )
s8 135 40 2
𝐶𝑘
s9 120 35 2 K
s10
s11
145
125
70
50
2
2 WC =  wc (Ck ) (total within cluster variation)
k =1
Centroid cluster 1 (174.33, 89.5) Centroid cluster 2 (130, 51.2)

 C1 is a better quality cluster than C2. WC = (45.806 + 247.760) = 293.57 38


Separation Measure clusters

• If the cluster centroids are very dissimilar, their quality is higher.


• This can be measured by using the distance between the clusters.

BC =  d (r j , rk ) 2
1 j  k  K
rk, rj: centroid of Ck Cj, for cluster k and
j, respectively,
Cluster
d = Euclidean distance
SubjectID Body Height Body Weight
Tag
s1 125 61 2
s2 178 90 1
s3 178 92 1
s4 180 83 1
s5 167 85 1
s6 170 89 1
s7 173 98 1
s8 135 40 2
s9 120 35 2
s10 145 70 2
s11 125 50 2

Centroid cluster 1 (174.33, 89.5) Centroid cluster 2 (130, 51.2)


39
Overall cluster quality

• Combining the cohesion and separation, the ratio BC/WC is a good


indicator of overall quality (Q)

𝐵𝐶
Q=
𝑊𝐶
The larger the value of this ratio, the better the cluster quality.
A value close to 1 indicate poor quality clusters: the distance between
clusters is close to the distance within clusters.
Cluster
SubjectID Body Height Body Weight
Tag
s1 125 61 2
s2 178 90 1
s3 178 92 1
s4 180 83 1
s5 167 85 1
s6 170 89 1
s7 173 98 1
s8 135 40 2
s9 120 35 2
s10 145 70 2
s11 125 50 2 40
Cluster Evaluation & Interpretation

• Using cluster quality for clustering


• With K-means:
• Add an outer loop for different values of K (from low to high)
• At an iteration, conduct K-means clustering using the current K
• Measure the overall cluster quality and decide whether the
resulting cluster quality acceptable
• If not, increase the value of K by 1 and repeat the process
• With agglomeration:
• Traverse the hierarchy level by level from the root
• At a level, evaluate the overall quality of clusters
• If the quality is acceptable, take the clusters at the level as the final
result. If not, move to the next level and repeat the process.

41
Summary

• A clustering solution must provide a sensible proximity function, effective


algorithm and a cluster evaluation function
• Proximity is normally measured by a distance function that combines
measures of value differences upon attributes
• The K-Means method continues to refine prototype partitions until
membership changes no longer occur
• The agglomeration method constructs all possible groupings of individual
data objects into a hierarchy of clusters
• Good clustering results mean high similarity among members of a cluster
and low similarity between members of different clusters

42
References

• Read Chapter 4 of Data Mining Techniques and Applications

Useful further references


• Tan, P-N., Steinbach, M. and Kumar, V. (2006), Introduction to Data Mining,
Addison-Wesley, Chapters 2 (section 2.4) and 8.

43

You might also like