0% found this document useful (0 votes)

24 views35 pages

Clustering

Clustering is an unsupervised learning technique that classifies samples into groups based on similarity measures. It involves various algorithms such as hierarchical and partitional clustering, with K-means being one of the most commonly used methods. Challenges in clustering include determining the number of clusters, handling high-dimensional data, and sensitivity to noise and outliers.

Uploaded by

江冠緯

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views35 pages

Clustering

Uploaded by

江冠緯

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

CLUSTERING

Bor-shen Lin
bslin@cs.ntust.edu.tw
http:www.cs.ntust.edu.tw/~bslin
CLUSTER ANALYSIS
 A form of unsupervised learning
 Automatic classification of samples into a number
of groups using a measure of association
 Data in each group are similar
 Input
 A set of samples: X
 A measure of similarity: s X = { xi }
L
 Output Cluster
s
 A number of groups Analysis
 A partition L = { G1, G2, …, GN }
G1∪G2 ∪… ∪GN = X and Gi∩Gj = f
EXAMPLE

xx x x xx x x xx x x

x x x x
x x x x x
x x x
x x x

K=3 K=4
CLUSTERING FOR ANYTHING
 Clustering for customers
 Clustering for documents

 Clustering for words

 Clustering for images

 Clustering for melodies

 Clustering for objects

 Clustering for design patterns

 Clustering for styles

CLUSTERING OF DESIGN PATTERNS
BASIC CONCEPT OF CLUSTER ANALYSIS
 Sample
 A point in a multi-dimensional space
 Objective of Cluster Analysis
 Construct decision boundaries (classification surfaces) based
on unlabeled training data set
(Unsupervised learning)
 Appropriate for exploration of interrelationships among
samples to make a preliminary assessment of the sample
structure
DIFFICULTIES OF CLUSTERING
 Shape and size of data
 Number of clusters
 Determined according to desired resolution
 There might be different results for the same data
(different # of clusters)
 Human can visualize the data with dimension at
most 3. Visualization of high dimensional data
need to rely on clustering algorithms
FEATURES
 Quantitative features
 Continuous values: R, R+
 Discrete values: { 0, 1 }, Z
 Interval values: { x ≦ 20, 20 < x ≦ 40, x > 40 }

 Qualitative features
 Nominal or unordered: color is “blue” or “red”
 Ordinal: military rank: “general” and “colonel”
SIMILARITY/DISTANCE MEASURE
 Similarity
 Symmetric: s(x, x’) = s(x’, x) for all x, x’ in X
 Normalized: 0 ≦ s(x, x’) ≦ 1

 Distance (dissimilarity) measure

 d(x, x’) ≧ 0 (d(x, x) = 0)
 d(x, x’) = d(x’, x)
 Called metric distance measure if triangular
inequality holds:
d(x, x’’) ≦ d(x, x’) + d(x’, x’’)
METRICS FOR CONTINUOUS VALUES
 Euclidean distance
 d2(xi, xj) = [Sk(xik – xjk)2]1/2
 Block distance (L1 metric)
 d1(xi, xj) = Sk|xik – xjk|
 Minkowski metric
 dp(xi, xj) = [Sk|xik – xjk|p]1/p
 Cosine similarity
 scos(xi, xj) = [Sk(xik · xjk)] / [Skxik2· Skxjk2]1/2
 scos(xi, xj) = 1 if xi = c · xj, c > 0
METRICS FOR DISCRETE VALUES
 xi , xj
(1, 1) : a, (1, 0) : b
(0, 1): c, (0, 0): d
 Simple Matching Coefficient (SMC)
 Ssmc(xi, xj) = (a + d) / (a + b + c +d)
 Jaccard Coefficient
 Sjc(xi, xj) = a / (a + b + c)
 Rao’s Coefficient
 Src(xi, xj) = a / (a + b + c +d)
EXAMPLE
 x1 = (0,0,1,1,0,1,0,1)
x2 = (0,1,1,0,0,1,0,0)
→ a = 2 for (1, 1)
b = 2 for (1, 0)
c = 1 for (0, 1)
d = 3 for (0, 0)
→ Ssmc(x1, x2) = 5 / 8
Sjc(x1, x2) = 2 / 5
Src(x1, x2) = 2 / 8
ADVANCED DISTANCE MEASURES
 Consider the effect of the context (relative distance)
 Mutual neighbor distance (MND)
MND(xi, xj) = NN(xi, xj) + NN(xj, xi)
NN(xi, xj): the neighbor number of xi with respect to xj
xi is the n-th closest point for xj

C C
B B
A D A
E

NN(A, B) = 1, NN(B, A) = 1 NN(A, B) = 1, NN(B, A) = 4

NN(B, C) = 1, NN(C, B) = 2 NN(B, C) = 1, NN(C, B) = 2
MND(A,B) = 2, MND(B,C) = 3 MND(A,B) = 5, MND(B,C)=3
Positions of A & B are the same, but with different distances!
CLUSTERING ALGORITHMS
 Hierarchical Clustering Algorithm
 Producing the hierarchy (dendrogram)
 Partitional Clustering algorithm
 Producing the partition of the data
HIERARCHICAL CLUSTERING ALGORITHMS
 Divisible algorithm
 A Top-down process
 Example: CART for clustering
 Divided by discrete properties
 Starts from the entire set of samples
 Divides it into a partition of subsets

 Agglomerative algorithm (aggregation)

 A bottom-up process
 Rain drops (small → big)
 Regards each object as a cluster initially
 The clusters are merged into larger clusters
 A dendrogram is constructed
DISTANCE BETWEEN CLUSTERS

 Dmin(Ci, Cj) = min |xi – xj|, for xi in Ci, xj in Cj

 Dmean(Ci, Cj) = |mi – mj|, for centroids mi and mj

 Davg(Ci, Cj) = 1/(ninj) SiSj|xi – xj|, for xi in Ci, xj in

Cj
 Dmax(Ci, Cj) = max |xi – xj|, for xi in Ci, xj in Cj

Ci
+ Dmin
+ +
Cj
x +
+
+ + x
Dmax +
+
AGGLOMERATIVE CLUSTERING
1. Place each sample in its own cluster. Construct a
list of inter-cluster distances for all pairs of
samples, and sort this list in ascending order.
2. Step through the sorted list of distances, forming
for each distinct threshold value dk a graph of the
samples where pairs of samples closer than dk are
connected into a new cluster by a graph edge. If
all the samples are members of a connected graph,
stop. Otherwise, repeat this step.
3. The output of the algorithm is a nested hierarchy
of graphs, which can be cut at the desired
dissimilarity level forming a partition (clusters)
identified by simple connected components in the
corresponding subgraph.
EXAMPLE
 Samples
 x1=(0,2), x2=(0,0), x3=(1.5,0), x4=(5,0), x5=(5,2)
 d(x1, x2) = 2, d(x1, x3) = 2.5, d(x1, x4) = 5.39,
d(x1, x5) = 5, d(x2, x3) = 1.5, d(x2, x4) = 5,
d(x2, x5) = 5.29, d(x3, x4) = 3.5,
d(x3, x5) = 4.03, d(x4, x5) = 2
1.5 2.0 3.5
x1 3
x2 2 x x
x3 1
x4 0 x x x
x5 0 1 2 3 4 5
Dendrogram (using Dmin)
MERGING OF CLUSTERS

●
𝑋4

𝑋
● 2
𝑋5
●
●
𝑋3
𝑋1
●

𝑋2 𝑋3 𝑋1 𝑋4 𝑋5
CHAMELEON ALGORITHM
 Basic idea: Improve the clustering quality by using a more elaborate
criterion when merging two clusters
 Ci and Cj will be merged if the interconnectivity and closeness of
the merged cluster is very similar to that of Ci and Cj before
merging
 Min-cut on a graph (of cluster Ci)
 Partitioning a graph into two parts of close, equal size such that
the total weight of the edges being cut is minimized.
 Total weight of the edges: stand for total interconnectivity (c.f.
inner links in political parties)
 min-cut: minimize loss of connectivity for the graph after being cut
 Interconnectivity I(Ci): total weight of edges being cut (total loss)
 Closeness C(Ci): average weight of edges being cut (average loss)
CHAMELEON ALGORITHM
1. Divide all data into highly dense sub-clusters.
 Construct initial graph G=(V,E) by KNN with
weighted edge e(vi,vj): closeness between two
samples
 Recursively partition G into many small,
unconnected subgraphs by doing min-cut
 Stopped when certain criteria are satisfied
2. Merge sub-clusters into larger clusters.
 For Ci and Cj, compute RI(Ci, Cj) and RC(Ci, Cj)
 Relative interconnectivity
RI(Ci, Cj) ≡ I(Ci∪Cj) / [0.5·(I(Ci)+I(Cj))]
 Relative closeness
RC(Ci, Cj) ≡ C(Ci∪Cj) / [0.5·(C(Ci)+C(Cj))]
 s(Ci, Cj) ≡ RC(Ci, Cj)·RI(Ci, Cj)a, 0 ≦ a ≦ 1
Minimizing the decrease after being merged
PARTITIONAL CLUSTERING
 Why?
 Construction of dendrogram is computationally complex
for large data set
 Global criterion
 Euclidean square-error measure E2 for k-th cluster
E2 = Skdk, dk = Si(xik - Mk)2 , mk = (1/nk)Sixik
 Euclidean distance cannot be used if the dimensions have
different physical meaning → using stochastic model

 Local criterion
 Minimal mutual neighbor distance (MND)
 Utilizing the local structure or context in the data
K-MEANS CLUSTERING ALGORITHM

 A partitional clustering algorithm employing a

square-error criterion
 Simplest & most commonly used
 Easy to implement
 Time and space complexity is relatively small
 Gives surprisingly good result if the clusters are
compact, hyperspherical in shape, and well
separated in the feature space
 Work well for data that are far apart
 Equivalent to Kohonen net in ANN

 Similar methods (e.g. C-means, K-mediods)

REDISTRIBUTION OF SAMPLES

x x
x
x x x
x 𝑪2
▲ 𝑪3 x
x ▲
x x
x x
x
𝑪1 x x
▲ ●
x 𝑿
x
x
x
K-MEANS ALGORITHM
1. Select an initial partition with k clusters
containing randomly chosen samples, and
compute the centroids of the clusters.
2. Generate a new partition by (re-distributing)
assigning each sample to its closest cluster center
k* = argmink d(x, mk) for every x
3. Compute new centroids of the clusters
mk = (1/n)Sxi for cluster k
4. Repeat steps 2 and 3 until an optimum value of
the criterion function is found.
CHARACTERISTICS OF K-MEANS
 Time complexity: O(nkl)
n: number of samples
k: number of clusters
l: number of iterations
 Space complexity: O(k+n)

 Order-independent
 It generates the same partition of the data at the end of
the partitioning process irrespective of the order in
which the samples are presented to the algorithm
ORDER-DEPENDENT
INCREMENTAL CLUSTERING

1. Initially, there is no cluster.

2. For every point, compute the distance between the
point and any existing cluster.
3. If the minimum distance does exist, raise a new
cluster with the point and go to step 2. Otherwise, go
to step 3.
4. If the minimum distance is smaller than a threshold,
add that point to the corresponding cluster.
Otherwise raise a new cluster with the point.
5. Go to step 2.

 Order dependent
 The order that data are presented influences the result
MAJOR PROBLEMS OF K-MEANS
 Sensitive to the selection of the initial partition
 May converge to a local minimum
 Lack of available guidelines for
 Choosing the initial partition
 Adjusting the number of clusters
 Selecting the stopping criterion
 Very sensitive to noise and outlier data points
 Every point has the same contribution to the centroid
 K-mediods method: remove the outliers when
computing the centroids
→ less sensitive to noise and outliers
 Based on global distance
 Might not work well for the data with locally
connected patterns
K-MEANS FOR CATEGORICAL DATA

 Centroids for clusters can NOT be computed

 d(x, mk) cannot be calculated without mk,
but d(xi, xj) or s(xi, xj) can still be found
 Use K-nearest neighbor (KNN) for reclustering data
 x can be clustered according to the clusters of its KNNs
through majority voting
 Do not compute the distance between x and centroid
 My choice → according to the choices of K-best friends

 Use the distribution of cluster for reclustering data

 P(x|Ck) could be trained for each cluster
 Example: customer x = [gender=male, age=20,
occupation=engineer, income=900k, …]
REDISTRIBUTION OF CATEGORICAL DATA (KNN)
 x1 = (A,B,A,B,C,B)
x2 = (A,A,A,B,A,B)
x3 = (B,B,A,B,A,B)
x4 = (B,C,A,B,B,A)
x5 = (B,A,B,A,C,A)
x6 = (A,C,B,A,B,B)
 C1 = { x1, x2, x3 }, C2 = { x4, x5, x6 }

 For y = { A, C, A, B, C, A }

ssmc(y, x1)=0.66, ssmc(y, x2)=0.50, ssmc(y, x3)=0.33

ssmc (y, x4)=0.66, ssmc(y, x5)=0.33, ssmc(y, x6)=0.33
 KNN for K=3 are x1, x2, x4 → voting C1 for y
K-MEANS VS. GMM TRAINING
 K-Means
 Distance: Euclidean distance (for homogenous data)
 Objective function: square error
 Clusters have mutually exclusive data
 Each point has equal contribution (weight) to mean
1
𝒎𝑘 = σ 𝒙
𝑛𝑘 𝒙𝑖 ∈𝐶𝑘 𝑖

 GMM
 Similarity: Gaussian probability (distance is
normalized by convariance (𝒙 − 𝝁𝑘 )𝑡 Σ −1 (𝒙 − 𝝁𝑘 ))
 Objective function: expectation of log probabilities
 All data points contribute to each cluster
σ𝑖 𝒍𝑖 (𝑘)𝒙𝑖 σ𝑖 𝒍𝑖 𝑘 (𝒙𝑖 −𝝁𝑘 )(𝒙𝑖 −𝝁𝑘 )𝑡
𝝁𝑘 = σ𝑖,𝑘 𝒍𝑖 (𝑘)
, 𝚺𝑘 = σ𝑖,𝑘 𝒍𝑖 (𝑘)
K-MEANS FOR VECTOR QUANTIZATION
 Select a set of prototype vectors to represent a
large vector space
 Vectors in vector space are encoded into a
discrete symbols
 Codebook: a set of codewords (vectors)

 Example: data compression

EVALUATION FOR CLUSTER ANALYSIS
 Assessment of the data domain
 estimating the data in advance(how many clusters,
distance between clusters, …)
 Cluster validity (verification on the results)
 External assessment of validity
 Compare the discovered structure to an a priori structure
 Internal examination of validity
 Determine if the discovered structure is intrinsically
appropriate or meaningful
 Relative test
 Adjust the algorithms and parameters to produce
different clusters. Compare the results to judge which is
more applicable
SUMMARY
 There is NO clustering technique that is universally
applicable in uncovering the variety of structures present in
multidimensional data sets.
 User’s understanding of the problem and the corresponding
data types will be the best criteria to select the appropriate
method.
 The data should be subjected to tests for clustering tendency
before applying a clustering algorithm, followed by a
validation of the clusters generated by the algorithm
 There is no best clustering algorithm. Try several ones.
 Clustering analysis might be used as processing techniques
in intermediate level.
 e.g. word class n-gram (sharing probabilities)
REFERENCE
 Data Mining: Concepts, Models, Methods and
Algorithms
Mehmed Kantardzic, Wiley Inter-science

ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Cluster
100% (1)
Cluster
72 pages
ML - 8
No ratings yet
ML - 8
70 pages
Clustering
No ratings yet
Clustering
75 pages
Lec09 Clustering
No ratings yet
Lec09 Clustering
27 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
Data Analytics for B.Tech Students
No ratings yet
Data Analytics for B.Tech Students
98 pages
Session 34 - 35clustering
No ratings yet
Session 34 - 35clustering
50 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Data Mining-Unit 3-Part1
No ratings yet
Data Mining-Unit 3-Part1
41 pages
کتاب چهارم بارگزاری شده
No ratings yet
کتاب چهارم بارگزاری شده
63 pages
Cluster Analysis Techniques Guide
No ratings yet
Cluster Analysis Techniques Guide
37 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
Clustering
No ratings yet
Clustering
28 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Week 3 Clustering
No ratings yet
Week 3 Clustering
36 pages
Unsupervised Algorithms Unit3
No ratings yet
Unsupervised Algorithms Unit3
53 pages
Clustering
No ratings yet
Clustering
75 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
Clustering
No ratings yet
Clustering
55 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
Chapter 6
No ratings yet
Chapter 6
62 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
40 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
51 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
Clustering
No ratings yet
Clustering
80 pages
Intro to Cluster Analysis
No ratings yet
Intro to Cluster Analysis
90 pages
Unit 5
No ratings yet
Unit 5
63 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
32 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
Clustering
No ratings yet
Clustering
84 pages
ML 07 Clustering
No ratings yet
ML 07 Clustering
56 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Clustering
No ratings yet
Clustering
125 pages
MachineLearning Unit IV
No ratings yet
MachineLearning Unit IV
51 pages
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
61 pages
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
No ratings yet
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
41 pages
Clustering
No ratings yet
Clustering
65 pages
Clustering Part-1
No ratings yet
Clustering Part-1
48 pages
3 UnSupervised Learning
No ratings yet
3 UnSupervised Learning
53 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
M5
No ratings yet
M5
40 pages
Unsupervised Learning Modi
No ratings yet
Unsupervised Learning Modi
16 pages
Unsupervised Learning for Students
No ratings yet
Unsupervised Learning for Students
59 pages
Clustering
No ratings yet
Clustering
80 pages
(3rd Year) Pattern REcognition Lecture 4
No ratings yet
(3rd Year) Pattern REcognition Lecture 4
48 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
Intro to Clustering Methods
No ratings yet
Intro to Clustering Methods
39 pages
ML Imp Ques 2
No ratings yet
ML Imp Ques 2
37 pages
Graph Partitioning & Clustering Techniques
No ratings yet
Graph Partitioning & Clustering Techniques
14 pages
Clustering
No ratings yet
Clustering
18 pages
Introduction 16 25
No ratings yet
Introduction 16 25
10 pages
Regression
No ratings yet
Regression
23 pages
Local Search
No ratings yet
Local Search
48 pages
Bayesian Network
No ratings yet
Bayesian Network
30 pages
Characters and Fonts
No ratings yet
Characters and Fonts
4 pages
PG Diploma in Vlsi Soc Design and Verification Course Structure
No ratings yet
PG Diploma in Vlsi Soc Design and Verification Course Structure
16 pages
Instructions of SH 043 Interface Screen汇能达CEM9000SH 043接口屏使用说明书 20180515
No ratings yet
Instructions of SH 043 Interface Screen汇能达CEM9000SH 043接口屏使用说明书 20180515
16 pages
MI Al Ayah Mranggen - Contact Me
No ratings yet
MI Al Ayah Mranggen - Contact Me
19 pages
Sound Engineering for Aspiring Artists
No ratings yet
Sound Engineering for Aspiring Artists
46 pages
Statement 01650089764 20250903
No ratings yet
Statement 01650089764 20250903
79 pages
Experiment 3: AIM Apparatus Required
No ratings yet
Experiment 3: AIM Apparatus Required
3 pages
Xcode Cheat Sheet: Search Navigation Editing
No ratings yet
Xcode Cheat Sheet: Search Navigation Editing
2 pages
Database Management System: Lec - 1: Basic Database Concepts
No ratings yet
Database Management System: Lec - 1: Basic Database Concepts
39 pages
Channel Audit by Prabhjeet - in
No ratings yet
Channel Audit by Prabhjeet - in
1 page
MarooOn Catalog (ENG) 190604
No ratings yet
MarooOn Catalog (ENG) 190604
45 pages
Cisco HD Broadcast Digital Set-Top Model 4398DVB
No ratings yet
Cisco HD Broadcast Digital Set-Top Model 4398DVB
2 pages
TECHNOLOGY
No ratings yet
TECHNOLOGY
27 pages
Wireless Application Protocol: Mobile Computing EID 362 Unit - V
No ratings yet
Wireless Application Protocol: Mobile Computing EID 362 Unit - V
21 pages
Learn MERN Stack: Online Course & Career Guide
100% (1)
Learn MERN Stack: Online Course & Career Guide
22 pages
Facebookmatrix
No ratings yet
Facebookmatrix
1 page
GSM SMS Automation for Agriculture
No ratings yet
GSM SMS Automation for Agriculture
14 pages
CSE4004: Digital Forensics Assignment 4: Boot Sector Sequence
No ratings yet
CSE4004: Digital Forensics Assignment 4: Boot Sector Sequence
12 pages
Advanced Java Programming
No ratings yet
Advanced Java Programming
129 pages
Digital Forensics
No ratings yet
Digital Forensics
11 pages
VLSI LAB Record-Format
No ratings yet
VLSI LAB Record-Format
5 pages
WT-lab-Manual MASTER MANUAL
No ratings yet
WT-lab-Manual MASTER MANUAL
62 pages
The Great Locomotive Chase - PDF
No ratings yet
The Great Locomotive Chase - PDF
88 pages
Ensf 480 Practice Midterm Exercises
No ratings yet
Ensf 480 Practice Midterm Exercises
10 pages
Jasper Reports
No ratings yet
Jasper Reports
29 pages
Enabling Secure Software-Defined Networking For Wireless and Cellular Networks Through Virtualization
No ratings yet
Enabling Secure Software-Defined Networking For Wireless and Cellular Networks Through Virtualization
7 pages
DX Diag
No ratings yet
DX Diag
31 pages
Install Operating System in Accordance With Established Installation Procedures and To Comply With End-User Requirements
No ratings yet
Install Operating System in Accordance With Established Installation Procedures and To Comply With End-User Requirements
13 pages
1st Four Chapters Short Questions
No ratings yet
1st Four Chapters Short Questions
2 pages
Beatroute - Distributor Api: Developer'S Guide
No ratings yet
Beatroute - Distributor Api: Developer'S Guide
18 pages
Kaye 2000 Procedure Manual User Manual
No ratings yet
Kaye 2000 Procedure Manual User Manual
23 pages