[go: up one dir, main page]

0% found this document useful (0 votes)
24 views35 pages

Clustering

Clustering is an unsupervised learning technique that classifies samples into groups based on similarity measures. It involves various algorithms such as hierarchical and partitional clustering, with K-means being one of the most commonly used methods. Challenges in clustering include determining the number of clusters, handling high-dimensional data, and sensitivity to noise and outliers.

Uploaded by

江冠緯
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views35 pages

Clustering

Clustering is an unsupervised learning technique that classifies samples into groups based on similarity measures. It involves various algorithms such as hierarchical and partitional clustering, with K-means being one of the most commonly used methods. Challenges in clustering include determining the number of clusters, handling high-dimensional data, and sensitivity to noise and outliers.

Uploaded by

江冠緯
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

CLUSTERING

Bor-shen Lin
bslin@cs.ntust.edu.tw
http:www.cs.ntust.edu.tw/~bslin
CLUSTER ANALYSIS
 A form of unsupervised learning
 Automatic classification of samples into a number
of groups using a measure of association
 Data in each group are similar
 Input
 A set of samples: X
 A measure of similarity: s X = { xi }
L
 Output Cluster
s
 A number of groups Analysis
 A partition L = { G1, G2, …, GN }
G1∪G2 ∪… ∪GN = X and Gi∩Gj = f
EXAMPLE

xx x x xx x x xx x x

x x x x
x x x x x
x x x
x x x

K=3 K=4
CLUSTERING FOR ANYTHING
 Clustering for customers
 Clustering for documents

 Clustering for words

 Clustering for images

 Clustering for melodies

 Clustering for objects

 Clustering for design patterns

 Clustering for styles


CLUSTERING OF DESIGN PATTERNS
BASIC CONCEPT OF CLUSTER ANALYSIS
 Sample
 A point in a multi-dimensional space
 Objective of Cluster Analysis
 Construct decision boundaries (classification surfaces) based
on unlabeled training data set
(Unsupervised learning)
 Appropriate for exploration of interrelationships among
samples to make a preliminary assessment of the sample
structure
DIFFICULTIES OF CLUSTERING
 Shape and size of data
 Number of clusters
 Determined according to desired resolution
 There might be different results for the same data
(different # of clusters)
 Human can visualize the data with dimension at
most 3. Visualization of high dimensional data
need to rely on clustering algorithms
FEATURES
 Quantitative features
 Continuous values: R, R+
 Discrete values: { 0, 1 }, Z
 Interval values: { x ≦ 20, 20 < x ≦ 40, x > 40 }

 Qualitative features
 Nominal or unordered: color is “blue” or “red”
 Ordinal: military rank: “general” and “colonel”
SIMILARITY/DISTANCE MEASURE
 Similarity
 Symmetric: s(x, x’) = s(x’, x) for all x, x’ in X
 Normalized: 0 ≦ s(x, x’) ≦ 1

 Distance (dissimilarity) measure


 d(x, x’) ≧ 0 (d(x, x) = 0)
 d(x, x’) = d(x’, x)
 Called metric distance measure if triangular
inequality holds:
d(x, x’’) ≦ d(x, x’) + d(x’, x’’)
METRICS FOR CONTINUOUS VALUES
 Euclidean distance
 d2(xi, xj) = [Sk(xik – xjk)2]1/2
 Block distance (L1 metric)
 d1(xi, xj) = Sk|xik – xjk|
 Minkowski metric
 dp(xi, xj) = [Sk|xik – xjk|p]1/p
 Cosine similarity
 scos(xi, xj) = [Sk(xik · xjk)] / [Skxik2· Skxjk2]1/2
 scos(xi, xj) = 1 if xi = c · xj, c > 0
METRICS FOR DISCRETE VALUES
 xi , xj
(1, 1) : a, (1, 0) : b
(0, 1): c, (0, 0): d
 Simple Matching Coefficient (SMC)
 Ssmc(xi, xj) = (a + d) / (a + b + c +d)
 Jaccard Coefficient
 Sjc(xi, xj) = a / (a + b + c)
 Rao’s Coefficient
 Src(xi, xj) = a / (a + b + c +d)
EXAMPLE
 x1 = (0,0,1,1,0,1,0,1)
x2 = (0,1,1,0,0,1,0,0)
→ a = 2 for (1, 1)
b = 2 for (1, 0)
c = 1 for (0, 1)
d = 3 for (0, 0)
→ Ssmc(x1, x2) = 5 / 8
Sjc(x1, x2) = 2 / 5
Src(x1, x2) = 2 / 8
ADVANCED DISTANCE MEASURES
 Consider the effect of the context (relative distance)
 Mutual neighbor distance (MND)
MND(xi, xj) = NN(xi, xj) + NN(xj, xi)
NN(xi, xj): the neighbor number of xi with respect to xj
xi is the n-th closest point for xj

C C
B B
A D A
E

NN(A, B) = 1, NN(B, A) = 1 NN(A, B) = 1, NN(B, A) = 4


NN(B, C) = 1, NN(C, B) = 2 NN(B, C) = 1, NN(C, B) = 2
MND(A,B) = 2, MND(B,C) = 3 MND(A,B) = 5, MND(B,C)=3
Positions of A & B are the same, but with different distances!
CLUSTERING ALGORITHMS
 Hierarchical Clustering Algorithm
 Producing the hierarchy (dendrogram)
 Partitional Clustering algorithm
 Producing the partition of the data
HIERARCHICAL CLUSTERING ALGORITHMS
 Divisible algorithm
 A Top-down process
 Example: CART for clustering
 Divided by discrete properties
 Starts from the entire set of samples
 Divides it into a partition of subsets

 Agglomerative algorithm (aggregation)


 A bottom-up process
 Rain drops (small → big)
 Regards each object as a cluster initially
 The clusters are merged into larger clusters
 A dendrogram is constructed
DISTANCE BETWEEN CLUSTERS

 Dmin(Ci, Cj) = min |xi – xj|, for xi in Ci, xj in Cj


 Dmean(Ci, Cj) = |mi – mj|, for centroids mi and mj

 Davg(Ci, Cj) = 1/(ninj) SiSj|xi – xj|, for xi in Ci, xj in


Cj
 Dmax(Ci, Cj) = max |xi – xj|, for xi in Ci, xj in Cj

Ci
+ Dmin
+ +
Cj
x +
+
+ + x
Dmax +
+
AGGLOMERATIVE CLUSTERING
1. Place each sample in its own cluster. Construct a
list of inter-cluster distances for all pairs of
samples, and sort this list in ascending order.
2. Step through the sorted list of distances, forming
for each distinct threshold value dk a graph of the
samples where pairs of samples closer than dk are
connected into a new cluster by a graph edge. If
all the samples are members of a connected graph,
stop. Otherwise, repeat this step.
3. The output of the algorithm is a nested hierarchy
of graphs, which can be cut at the desired
dissimilarity level forming a partition (clusters)
identified by simple connected components in the
corresponding subgraph.
EXAMPLE
 Samples
 x1=(0,2), x2=(0,0), x3=(1.5,0), x4=(5,0), x5=(5,2)
 d(x1, x2) = 2, d(x1, x3) = 2.5, d(x1, x4) = 5.39,
d(x1, x5) = 5, d(x2, x3) = 1.5, d(x2, x4) = 5,
d(x2, x5) = 5.29, d(x3, x4) = 3.5,
d(x3, x5) = 4.03, d(x4, x5) = 2
1.5 2.0 3.5
x1 3
x2 2 x x
x3 1
x4 0 x x x
x5 0 1 2 3 4 5
Dendrogram (using Dmin)
MERGING OF CLUSTERS


𝑋4

𝑋
● 2
𝑋5


𝑋3
𝑋1

𝑋2 𝑋3 𝑋1 𝑋4 𝑋5
CHAMELEON ALGORITHM
 Basic idea: Improve the clustering quality by using a more elaborate
criterion when merging two clusters
 Ci and Cj will be merged if the interconnectivity and closeness of
the merged cluster is very similar to that of Ci and Cj before
merging
 Min-cut on a graph (of cluster Ci)
 Partitioning a graph into two parts of close, equal size such that
the total weight of the edges being cut is minimized.
 Total weight of the edges: stand for total interconnectivity (c.f.
inner links in political parties)
 min-cut: minimize loss of connectivity for the graph after being cut
 Interconnectivity I(Ci): total weight of edges being cut (total loss)
 Closeness C(Ci): average weight of edges being cut (average loss)
CHAMELEON ALGORITHM
1. Divide all data into highly dense sub-clusters.
 Construct initial graph G=(V,E) by KNN with
weighted edge e(vi,vj): closeness between two
samples
 Recursively partition G into many small,
unconnected subgraphs by doing min-cut
 Stopped when certain criteria are satisfied
2. Merge sub-clusters into larger clusters.
 For Ci and Cj, compute RI(Ci, Cj) and RC(Ci, Cj)
 Relative interconnectivity
RI(Ci, Cj) ≡ I(Ci∪Cj) / [0.5·(I(Ci)+I(Cj))]
 Relative closeness
RC(Ci, Cj) ≡ C(Ci∪Cj) / [0.5·(C(Ci)+C(Cj))]
 s(Ci, Cj) ≡ RC(Ci, Cj)·RI(Ci, Cj)a, 0 ≦ a ≦ 1
Minimizing the decrease after being merged
PARTITIONAL CLUSTERING
 Why?
 Construction of dendrogram is computationally complex
for large data set
 Global criterion
 Euclidean square-error measure E2 for k-th cluster
E2 = Skdk, dk = Si(xik - Mk)2 , mk = (1/nk)Sixik
 Euclidean distance cannot be used if the dimensions have
different physical meaning → using stochastic model

 Local criterion
 Minimal mutual neighbor distance (MND)
 Utilizing the local structure or context in the data
K-MEANS CLUSTERING ALGORITHM

 A partitional clustering algorithm employing a


square-error criterion
 Simplest & most commonly used
 Easy to implement
 Time and space complexity is relatively small
 Gives surprisingly good result if the clusters are
compact, hyperspherical in shape, and well
separated in the feature space
 Work well for data that are far apart
 Equivalent to Kohonen net in ANN

 Similar methods (e.g. C-means, K-mediods)


REDISTRIBUTION OF SAMPLES

x x
x
x x x
x 𝑪2
▲ 𝑪3 x
x ▲
x x
x x
x
𝑪1 x x
▲ ●
x 𝑿
x
x
x
K-MEANS ALGORITHM
1. Select an initial partition with k clusters
containing randomly chosen samples, and
compute the centroids of the clusters.
2. Generate a new partition by (re-distributing)
assigning each sample to its closest cluster center
k* = argmink d(x, mk) for every x
3. Compute new centroids of the clusters
mk = (1/n)Sxi for cluster k
4. Repeat steps 2 and 3 until an optimum value of
the criterion function is found.
CHARACTERISTICS OF K-MEANS
 Time complexity: O(nkl)
n: number of samples
k: number of clusters
l: number of iterations
 Space complexity: O(k+n)

 Order-independent
 It generates the same partition of the data at the end of
the partitioning process irrespective of the order in
which the samples are presented to the algorithm
ORDER-DEPENDENT
INCREMENTAL CLUSTERING

1. Initially, there is no cluster.


2. For every point, compute the distance between the
point and any existing cluster.
3. If the minimum distance does exist, raise a new
cluster with the point and go to step 2. Otherwise, go
to step 3.
4. If the minimum distance is smaller than a threshold,
add that point to the corresponding cluster.
Otherwise raise a new cluster with the point.
5. Go to step 2.

 Order dependent
 The order that data are presented influences the result
MAJOR PROBLEMS OF K-MEANS
 Sensitive to the selection of the initial partition
 May converge to a local minimum
 Lack of available guidelines for
 Choosing the initial partition
 Adjusting the number of clusters
 Selecting the stopping criterion
 Very sensitive to noise and outlier data points
 Every point has the same contribution to the centroid
 K-mediods method: remove the outliers when
computing the centroids
→ less sensitive to noise and outliers
 Based on global distance
 Might not work well for the data with locally
connected patterns
K-MEANS FOR CATEGORICAL DATA

 Centroids for clusters can NOT be computed


 d(x, mk) cannot be calculated without mk,
but d(xi, xj) or s(xi, xj) can still be found
 Use K-nearest neighbor (KNN) for reclustering data
 x can be clustered according to the clusters of its KNNs
through majority voting
 Do not compute the distance between x and centroid
 My choice → according to the choices of K-best friends

 Use the distribution of cluster for reclustering data


 P(x|Ck) could be trained for each cluster
 Example: customer x = [gender=male, age=20,
occupation=engineer, income=900k, …]
REDISTRIBUTION OF CATEGORICAL DATA (KNN)
 x1 = (A,B,A,B,C,B)
x2 = (A,A,A,B,A,B)
x3 = (B,B,A,B,A,B)
x4 = (B,C,A,B,B,A)
x5 = (B,A,B,A,C,A)
x6 = (A,C,B,A,B,B)
 C1 = { x1, x2, x3 }, C2 = { x4, x5, x6 }

 For y = { A, C, A, B, C, A }

ssmc(y, x1)=0.66, ssmc(y, x2)=0.50, ssmc(y, x3)=0.33


ssmc (y, x4)=0.66, ssmc(y, x5)=0.33, ssmc(y, x6)=0.33
 KNN for K=3 are x1, x2, x4 → voting C1 for y
K-MEANS VS. GMM TRAINING
 K-Means
 Distance: Euclidean distance (for homogenous data)
 Objective function: square error
 Clusters have mutually exclusive data
 Each point has equal contribution (weight) to mean
1
𝒎𝑘 = σ 𝒙
𝑛𝑘 𝒙𝑖 ∈𝐶𝑘 𝑖

 GMM
 Similarity: Gaussian probability (distance is
normalized by convariance (𝒙 − 𝝁𝑘 )𝑡 Σ −1 (𝒙 − 𝝁𝑘 ))
 Objective function: expectation of log probabilities
 All data points contribute to each cluster
σ𝑖 𝒍𝑖 (𝑘)𝒙𝑖 σ𝑖 𝒍𝑖 𝑘 (𝒙𝑖 −𝝁𝑘 )(𝒙𝑖 −𝝁𝑘 )𝑡
𝝁𝑘 = σ𝑖,𝑘 𝒍𝑖 (𝑘)
, 𝚺𝑘 = σ𝑖,𝑘 𝒍𝑖 (𝑘)
K-MEANS FOR VECTOR QUANTIZATION
 Select a set of prototype vectors to represent a
large vector space
 Vectors in vector space are encoded into a
discrete symbols
 Codebook: a set of codewords (vectors)

 Example: data compression


EVALUATION FOR CLUSTER ANALYSIS
 Assessment of the data domain
 estimating the data in advance(how many clusters,
distance between clusters, …)
 Cluster validity (verification on the results)
 External assessment of validity
 Compare the discovered structure to an a priori structure
 Internal examination of validity
 Determine if the discovered structure is intrinsically
appropriate or meaningful
 Relative test
 Adjust the algorithms and parameters to produce
different clusters. Compare the results to judge which is
more applicable
SUMMARY
 There is NO clustering technique that is universally
applicable in uncovering the variety of structures present in
multidimensional data sets.
 User’s understanding of the problem and the corresponding
data types will be the best criteria to select the appropriate
method.
 The data should be subjected to tests for clustering tendency
before applying a clustering algorithm, followed by a
validation of the clusters generated by the algorithm
 There is no best clustering algorithm. Try several ones.
 Clustering analysis might be used as processing techniques
in intermediate level.
 e.g. word class n-gram (sharing probabilities)
REFERENCE
 Data Mining: Concepts, Models, Methods and
Algorithms
Mehmed Kantardzic, Wiley Inter-science

You might also like