CLUSTERING
Bor-shen Lin
bslin@cs.ntust.edu.tw
http:www.cs.ntust.edu.tw/~bslin
CLUSTER ANALYSIS
A form of unsupervised learning
Automatic classification of samples into a number
of groups using a measure of association
Data in each group are similar
Input
A set of samples: X
A measure of similarity: s X = { xi }
L
Output Cluster
s
A number of groups Analysis
A partition L = { G1, G2, …, GN }
G1∪G2 ∪… ∪GN = X and Gi∩Gj = f
EXAMPLE
xx x x xx x x xx x x
x x x x
x x x x x
x x x
x x x
K=3 K=4
CLUSTERING FOR ANYTHING
Clustering for customers
Clustering for documents
Clustering for words
Clustering for images
Clustering for melodies
Clustering for objects
Clustering for design patterns
Clustering for styles
CLUSTERING OF DESIGN PATTERNS
BASIC CONCEPT OF CLUSTER ANALYSIS
Sample
A point in a multi-dimensional space
Objective of Cluster Analysis
Construct decision boundaries (classification surfaces) based
on unlabeled training data set
(Unsupervised learning)
Appropriate for exploration of interrelationships among
samples to make a preliminary assessment of the sample
structure
DIFFICULTIES OF CLUSTERING
Shape and size of data
Number of clusters
Determined according to desired resolution
There might be different results for the same data
(different # of clusters)
Human can visualize the data with dimension at
most 3. Visualization of high dimensional data
need to rely on clustering algorithms
FEATURES
Quantitative features
Continuous values: R, R+
Discrete values: { 0, 1 }, Z
Interval values: { x ≦ 20, 20 < x ≦ 40, x > 40 }
Qualitative features
Nominal or unordered: color is “blue” or “red”
Ordinal: military rank: “general” and “colonel”
SIMILARITY/DISTANCE MEASURE
Similarity
Symmetric: s(x, x’) = s(x’, x) for all x, x’ in X
Normalized: 0 ≦ s(x, x’) ≦ 1
Distance (dissimilarity) measure
d(x, x’) ≧ 0 (d(x, x) = 0)
d(x, x’) = d(x’, x)
Called metric distance measure if triangular
inequality holds:
d(x, x’’) ≦ d(x, x’) + d(x’, x’’)
METRICS FOR CONTINUOUS VALUES
Euclidean distance
d2(xi, xj) = [Sk(xik – xjk)2]1/2
Block distance (L1 metric)
d1(xi, xj) = Sk|xik – xjk|
Minkowski metric
dp(xi, xj) = [Sk|xik – xjk|p]1/p
Cosine similarity
scos(xi, xj) = [Sk(xik · xjk)] / [Skxik2· Skxjk2]1/2
scos(xi, xj) = 1 if xi = c · xj, c > 0
METRICS FOR DISCRETE VALUES
xi , xj
(1, 1) : a, (1, 0) : b
(0, 1): c, (0, 0): d
Simple Matching Coefficient (SMC)
Ssmc(xi, xj) = (a + d) / (a + b + c +d)
Jaccard Coefficient
Sjc(xi, xj) = a / (a + b + c)
Rao’s Coefficient
Src(xi, xj) = a / (a + b + c +d)
EXAMPLE
x1 = (0,0,1,1,0,1,0,1)
x2 = (0,1,1,0,0,1,0,0)
→ a = 2 for (1, 1)
b = 2 for (1, 0)
c = 1 for (0, 1)
d = 3 for (0, 0)
→ Ssmc(x1, x2) = 5 / 8
Sjc(x1, x2) = 2 / 5
Src(x1, x2) = 2 / 8
ADVANCED DISTANCE MEASURES
Consider the effect of the context (relative distance)
Mutual neighbor distance (MND)
MND(xi, xj) = NN(xi, xj) + NN(xj, xi)
NN(xi, xj): the neighbor number of xi with respect to xj
xi is the n-th closest point for xj
C C
B B
A D A
E
NN(A, B) = 1, NN(B, A) = 1 NN(A, B) = 1, NN(B, A) = 4
NN(B, C) = 1, NN(C, B) = 2 NN(B, C) = 1, NN(C, B) = 2
MND(A,B) = 2, MND(B,C) = 3 MND(A,B) = 5, MND(B,C)=3
Positions of A & B are the same, but with different distances!
CLUSTERING ALGORITHMS
Hierarchical Clustering Algorithm
Producing the hierarchy (dendrogram)
Partitional Clustering algorithm
Producing the partition of the data
HIERARCHICAL CLUSTERING ALGORITHMS
Divisible algorithm
A Top-down process
Example: CART for clustering
Divided by discrete properties
Starts from the entire set of samples
Divides it into a partition of subsets
Agglomerative algorithm (aggregation)
A bottom-up process
Rain drops (small → big)
Regards each object as a cluster initially
The clusters are merged into larger clusters
A dendrogram is constructed
DISTANCE BETWEEN CLUSTERS
Dmin(Ci, Cj) = min |xi – xj|, for xi in Ci, xj in Cj
Dmean(Ci, Cj) = |mi – mj|, for centroids mi and mj
Davg(Ci, Cj) = 1/(ninj) SiSj|xi – xj|, for xi in Ci, xj in
Cj
Dmax(Ci, Cj) = max |xi – xj|, for xi in Ci, xj in Cj
Ci
+ Dmin
+ +
Cj
x +
+
+ + x
Dmax +
+
AGGLOMERATIVE CLUSTERING
1. Place each sample in its own cluster. Construct a
list of inter-cluster distances for all pairs of
samples, and sort this list in ascending order.
2. Step through the sorted list of distances, forming
for each distinct threshold value dk a graph of the
samples where pairs of samples closer than dk are
connected into a new cluster by a graph edge. If
all the samples are members of a connected graph,
stop. Otherwise, repeat this step.
3. The output of the algorithm is a nested hierarchy
of graphs, which can be cut at the desired
dissimilarity level forming a partition (clusters)
identified by simple connected components in the
corresponding subgraph.
EXAMPLE
Samples
x1=(0,2), x2=(0,0), x3=(1.5,0), x4=(5,0), x5=(5,2)
d(x1, x2) = 2, d(x1, x3) = 2.5, d(x1, x4) = 5.39,
d(x1, x5) = 5, d(x2, x3) = 1.5, d(x2, x4) = 5,
d(x2, x5) = 5.29, d(x3, x4) = 3.5,
d(x3, x5) = 4.03, d(x4, x5) = 2
1.5 2.0 3.5
x1 3
x2 2 x x
x3 1
x4 0 x x x
x5 0 1 2 3 4 5
Dendrogram (using Dmin)
MERGING OF CLUSTERS
●
𝑋4
𝑋
● 2
𝑋5
●
●
𝑋3
𝑋1
●
𝑋2 𝑋3 𝑋1 𝑋4 𝑋5
CHAMELEON ALGORITHM
Basic idea: Improve the clustering quality by using a more elaborate
criterion when merging two clusters
Ci and Cj will be merged if the interconnectivity and closeness of
the merged cluster is very similar to that of Ci and Cj before
merging
Min-cut on a graph (of cluster Ci)
Partitioning a graph into two parts of close, equal size such that
the total weight of the edges being cut is minimized.
Total weight of the edges: stand for total interconnectivity (c.f.
inner links in political parties)
min-cut: minimize loss of connectivity for the graph after being cut
Interconnectivity I(Ci): total weight of edges being cut (total loss)
Closeness C(Ci): average weight of edges being cut (average loss)
CHAMELEON ALGORITHM
1. Divide all data into highly dense sub-clusters.
Construct initial graph G=(V,E) by KNN with
weighted edge e(vi,vj): closeness between two
samples
Recursively partition G into many small,
unconnected subgraphs by doing min-cut
Stopped when certain criteria are satisfied
2. Merge sub-clusters into larger clusters.
For Ci and Cj, compute RI(Ci, Cj) and RC(Ci, Cj)
Relative interconnectivity
RI(Ci, Cj) ≡ I(Ci∪Cj) / [0.5·(I(Ci)+I(Cj))]
Relative closeness
RC(Ci, Cj) ≡ C(Ci∪Cj) / [0.5·(C(Ci)+C(Cj))]
s(Ci, Cj) ≡ RC(Ci, Cj)·RI(Ci, Cj)a, 0 ≦ a ≦ 1
Minimizing the decrease after being merged
PARTITIONAL CLUSTERING
Why?
Construction of dendrogram is computationally complex
for large data set
Global criterion
Euclidean square-error measure E2 for k-th cluster
E2 = Skdk, dk = Si(xik - Mk)2 , mk = (1/nk)Sixik
Euclidean distance cannot be used if the dimensions have
different physical meaning → using stochastic model
Local criterion
Minimal mutual neighbor distance (MND)
Utilizing the local structure or context in the data
K-MEANS CLUSTERING ALGORITHM
A partitional clustering algorithm employing a
square-error criterion
Simplest & most commonly used
Easy to implement
Time and space complexity is relatively small
Gives surprisingly good result if the clusters are
compact, hyperspherical in shape, and well
separated in the feature space
Work well for data that are far apart
Equivalent to Kohonen net in ANN
Similar methods (e.g. C-means, K-mediods)
REDISTRIBUTION OF SAMPLES
x x
x
x x x
x 𝑪2
▲ 𝑪3 x
x ▲
x x
x x
x
𝑪1 x x
▲ ●
x 𝑿
x
x
x
K-MEANS ALGORITHM
1. Select an initial partition with k clusters
containing randomly chosen samples, and
compute the centroids of the clusters.
2. Generate a new partition by (re-distributing)
assigning each sample to its closest cluster center
k* = argmink d(x, mk) for every x
3. Compute new centroids of the clusters
mk = (1/n)Sxi for cluster k
4. Repeat steps 2 and 3 until an optimum value of
the criterion function is found.
CHARACTERISTICS OF K-MEANS
Time complexity: O(nkl)
n: number of samples
k: number of clusters
l: number of iterations
Space complexity: O(k+n)
Order-independent
It generates the same partition of the data at the end of
the partitioning process irrespective of the order in
which the samples are presented to the algorithm
ORDER-DEPENDENT
INCREMENTAL CLUSTERING
1. Initially, there is no cluster.
2. For every point, compute the distance between the
point and any existing cluster.
3. If the minimum distance does exist, raise a new
cluster with the point and go to step 2. Otherwise, go
to step 3.
4. If the minimum distance is smaller than a threshold,
add that point to the corresponding cluster.
Otherwise raise a new cluster with the point.
5. Go to step 2.
Order dependent
The order that data are presented influences the result
MAJOR PROBLEMS OF K-MEANS
Sensitive to the selection of the initial partition
May converge to a local minimum
Lack of available guidelines for
Choosing the initial partition
Adjusting the number of clusters
Selecting the stopping criterion
Very sensitive to noise and outlier data points
Every point has the same contribution to the centroid
K-mediods method: remove the outliers when
computing the centroids
→ less sensitive to noise and outliers
Based on global distance
Might not work well for the data with locally
connected patterns
K-MEANS FOR CATEGORICAL DATA
Centroids for clusters can NOT be computed
d(x, mk) cannot be calculated without mk,
but d(xi, xj) or s(xi, xj) can still be found
Use K-nearest neighbor (KNN) for reclustering data
x can be clustered according to the clusters of its KNNs
through majority voting
Do not compute the distance between x and centroid
My choice → according to the choices of K-best friends
Use the distribution of cluster for reclustering data
P(x|Ck) could be trained for each cluster
Example: customer x = [gender=male, age=20,
occupation=engineer, income=900k, …]
REDISTRIBUTION OF CATEGORICAL DATA (KNN)
x1 = (A,B,A,B,C,B)
x2 = (A,A,A,B,A,B)
x3 = (B,B,A,B,A,B)
x4 = (B,C,A,B,B,A)
x5 = (B,A,B,A,C,A)
x6 = (A,C,B,A,B,B)
C1 = { x1, x2, x3 }, C2 = { x4, x5, x6 }
For y = { A, C, A, B, C, A }
ssmc(y, x1)=0.66, ssmc(y, x2)=0.50, ssmc(y, x3)=0.33
ssmc (y, x4)=0.66, ssmc(y, x5)=0.33, ssmc(y, x6)=0.33
KNN for K=3 are x1, x2, x4 → voting C1 for y
K-MEANS VS. GMM TRAINING
K-Means
Distance: Euclidean distance (for homogenous data)
Objective function: square error
Clusters have mutually exclusive data
Each point has equal contribution (weight) to mean
1
𝒎𝑘 = σ 𝒙
𝑛𝑘 𝒙𝑖 ∈𝐶𝑘 𝑖
GMM
Similarity: Gaussian probability (distance is
normalized by convariance (𝒙 − 𝝁𝑘 )𝑡 Σ −1 (𝒙 − 𝝁𝑘 ))
Objective function: expectation of log probabilities
All data points contribute to each cluster
σ𝑖 𝒍𝑖 (𝑘)𝒙𝑖 σ𝑖 𝒍𝑖 𝑘 (𝒙𝑖 −𝝁𝑘 )(𝒙𝑖 −𝝁𝑘 )𝑡
𝝁𝑘 = σ𝑖,𝑘 𝒍𝑖 (𝑘)
, 𝚺𝑘 = σ𝑖,𝑘 𝒍𝑖 (𝑘)
K-MEANS FOR VECTOR QUANTIZATION
Select a set of prototype vectors to represent a
large vector space
Vectors in vector space are encoded into a
discrete symbols
Codebook: a set of codewords (vectors)
Example: data compression
EVALUATION FOR CLUSTER ANALYSIS
Assessment of the data domain
estimating the data in advance(how many clusters,
distance between clusters, …)
Cluster validity (verification on the results)
External assessment of validity
Compare the discovered structure to an a priori structure
Internal examination of validity
Determine if the discovered structure is intrinsically
appropriate or meaningful
Relative test
Adjust the algorithms and parameters to produce
different clusters. Compare the results to judge which is
more applicable
SUMMARY
There is NO clustering technique that is universally
applicable in uncovering the variety of structures present in
multidimensional data sets.
User’s understanding of the problem and the corresponding
data types will be the best criteria to select the appropriate
method.
The data should be subjected to tests for clustering tendency
before applying a clustering algorithm, followed by a
validation of the clusters generated by the algorithm
There is no best clustering algorithm. Try several ones.
Clustering analysis might be used as processing techniques
in intermediate level.
e.g. word class n-gram (sharing probabilities)
REFERENCE
Data Mining: Concepts, Models, Methods and
Algorithms
Mehmed Kantardzic, Wiley Inter-science