Clustering New
Clustering New
Clustering New
Slides are compiled from various online sources with a great acknowledgement to all those who made the content freely available on the web.
Clustering Techniques 1
Outline
• What is clustering
• Challenges and applications
• Different types of clustering techniques
Clustering Techniques 2
Clustering
• Cluster: a collection of data objects
• Similar to one another within the same cluster
• Dissimilar to the objects in other clusters
• Clustering is a technique for finding groups of
objects such that the objects in a group will be
similar (or related) to one another and different
from (or unrelated to) the objects in other
groups
• Clustering is often called an unsupervised
learning task as no class values denoting an a
priori grouping of the data instances are given,
which is the case in supervised learning.
Clustering Techniques 3
An illustration
• The data set has three natural groups of data points, i.e., 3 natural
clusters.
Clustering Techniques 4
Notion of a Cluster can be Ambiguous
Clustering Techniques 5
What Is Good Clustering?
• A good clustering method will produce high quality clusters with
• high intra-class similarity
• low inter-class similarity
• The quality of a clustering result depends on both the similarity measure used by
the method and its implementation.
Clustering Techniques 6
Applications
• Business Intelligence
• Organize the customers into groups and develop a business strategies for enhanced
customer relationship management.
• Image pattern recognition
• Cluster hand-written digits.
• Image compression: represent the image with limited number of colors.
• Web search
• Clustering can be used to organize the search results into groups and present the
results in a concise and easily accessible way.
• Biology
• Determining the group structure in, e.g., a population of cells or microarray gene
expression data.
• Security
• To identify fraudulent behavior.
• Many other
Clustering Techniques 7
Requirements of Clustering
• Scalability
• Ability to deal with different types of attributes
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to determine input parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• High dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability
Clustering Techniques 9
Major Clustering Approaches
• Partitioning algorithms: Construct various partitions and then evaluate them by some criterion.
• Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some
criterion.
• Grid-based: These algorithms partition the data space into a finite number of cells to form a grid structure
and then form clusters from the cells in the grid structure. Clusters correspond to regions that are more
dense in data points than their surroundings
Clustering Techniques 10
Partitioning Clustering Approach
• Partitioning algorithms construct partition of a database of N objects into
a set of K clusters.
• The partitioning clustering algorithm usually adopts the Iterative
Optimization paradigm.
• It starts with an initial partition and uses an iterative control strategy.
• It tries swapping data points to see if such a swapping improves the quality of
clustering.
• When swapping does not yield any improvements in clustering, it finds a locally
optimal partitioning
• In principle, optimal partition achieved via minimizing the sum of
squared distance to its “representative object” in each cluster
E = K
k =1 xCk d ( x, mk )
2
Clustering Techniques 11
K-means algorithm
• Given the cluster number K, the K-means algorithm is carried out in
three steps after initialization:
• Initialization: set seed points (randomly)
1. Assign each object to the cluster of the nearest seed point measured with a specific
distance metric
2. Compute new seed points as the centroids of the clusters of the current partition (the
centroid is the centre, i.e., mean point, of the cluster)
3. Go back to Step 1), stop when no more new assignment (i.e., membership in each
cluster no longer changes)
Clustering Techniques 12
K-means - Example
• Problem:
• Suppose we have 4 types of medicines and each has two attributes (pH and
weight index). Our goal is to group these objects into K=2 group of medicine.
D
Medicine Weight pH-
Index
C
A 1 1
B 2 1
C 4 3 A B
D 5 4
Clustering Techniques 13
K-means - Example
• Step 1: Use initial seed points for partitioning
c1 = A , c 2 = B
d( D , c1 ) = ( 5 − 1)2 + ( 4 − 1)2 = 5
d( D , c2 ) = ( 5 − 2)2 + ( 4 − 1)2 = 4.24
Clustering Techniques 14
K-means - Example
• Step 2: Compute new centroids of the current partition
Knowing the members of each
cluster, now we compute the new
centroid of each group based on
these new memberships.
c1 = (1, 1)
2 + 4 + 5 1+ 3+ 4
c2 = ,
3 3
11 8
=( , )
3 3
Clustering Techniques 15
K-means - Example
• Step 2: Renew membership based on new centroids
Clustering Techniques 16
K-means - Example
• Step 3: Repeat the first two steps until its convergence
Knowing the members of each
cluster, now we compute the new
centroid of each group based on
these new memberships.
1+ 2 1+1 1
c1 = , = (1 , 1)
2 2 2
4+5 3+4 1 1
c2 = , = (4 , 3 )
2 2 2 2
Clustering Techniques 17
K-means - Example
• Step 3: Repeat the first two steps until its convergence
Clustering Techniques 18
Strengths of k-means
• Strengths:
• Simple: easy to understand and to implement
• Efficient: Time complexity: O(tkn), where n is the number of data points, k is
the number of clusters, and t is the number of iterations.
Clustering Techniques 19
Weaknesses of k-means
• The algorithm is only applicable if the mean is defined.
• The user needs to specify k.
• The algorithm is sensitive to outliers
• Outliers are data points that are very far away from other data points.
• Outliers could be errors in the data recording or some special data points with
very different values.
Clustering Techniques 20
Weaknesses of k-means: Problems with outliers
Clustering Techniques 21
Weaknesses of k-means: To deal with outliers
• One method is to remove some data points in the clustering process that are
much farther away from the centroids than other data points.
• To be safe, we may want to monitor these possible outliers over a few iterations and
then decide to remove them.
• Another method is to perform random sampling. Since in sampling we only
choose a small subset of the data points, the chance of selecting an outlier is
very small.
• Assign the rest of the data points to the clusters by distance or similarity comparison,
or classification
Clustering Techniques 22
Weaknesses of k-means
• The algorithm is sensitive to initial seeds.
Clustering Techniques 23
Weaknesses of k-means
• If we use different seeds: good results
There are some
methods to help
choose good seeds
Clustering Techniques 24
Weaknesses of k-means
• The k-means algorithm is not suitable for discovering clusters that are
not hyper-ellipsoids (or hyper-spheres).
Clustering Techniques 25
Hierarchical Clustering
• Hierarchical Clustering Approach
• A typical clustering analysis approach via partitioning data set
sequentially
• Construct nested partitions layer by layer via grouping objects into a
tree of clusters (without the need to know the number of clusters in
advance)
• Use (generalised) distance matrix as clustering criteria
• Agglomerative vs Divisive
• Two sequential clustering strategies for constructing a tree of clusters
• Agglomerative: a bottom-up strategy
• Initially each data object is in its own (atomic) cluster
• Then merge these atomic clusters into larger and larger clusters
• Divisive: a top-down strategy
• Initially all objects are in one single cluster
• Then the cluster is subdivided into smaller and smaller clusters
Hierarchical Clustering
• Agglomerative approach
Initialization:
Each object is a cluster
Iteration:
a
ab Merge two clusters which are
b abcde most similar to each other;
c Until all objects are merged
cde into a single cluster
d
de
e
27
Hierarchical Clustering
• Divisive Approaches
Initialization:
All objects stay in one cluster
Iteration:
a Select a cluster and split it into
ab
b abcde
two sub clusters
Until each leaf cluster contains
c
cde only one object
d
de
e
28
Dendrogram
• A binary tree that shows how clusters are merged/split hierarchically
• Each node in the tree is a cluster; each leaf node is a singleton cluster
29
Dendrogram
• A clustering of the data objects is obtained by cutting the dendrogram
at the desired level, then each connected component forms a cluster
30
Dendrogram
• A clustering of the data objects is obtained by cutting the dendrogram
at the desired level, then each connected component forms a cluster
31
How to Merge Clusters?
• How to measure the distance between clusters?
Single-link
Complete-link
Distance?
Average-link
Centroid distance
33
How to Define Inter-Cluster Distance
34
How to Define Inter-Cluster Distance
Single-link
d min (Ci , C j ) = avg d ( p, q)
Complete-link pCi , qC j
Average-link The distance between two clusters is
Centroid distance represented by the average distance of all pairs
of data objects belonging to different clusters.
35
How to Define Inter-Cluster Distance
mi,mj are the
means of Ci, Cj,
36
Cluster Distance Measures
Example: Given a data set of five objects characterized by a single continuous feature, assume that there are
two clusters: C1: {a, b} and C2: {c, d, e}.
a b c d e
1 2 4 5 6
1. Calculate the distance matrix. 2. Calculate three cluster distances between C1 and C2.
a b c d e Single link
dist(C1 , C 2 ) = min{ d(a, c), d(a, d), d(a, e), d(b, c), d(b, d), d(b, e)}
a 0 1 3 4 5 = min{3, 4, 5, 2, 3, 4} = 2
b 1 0 2 3 4 Complete link
dist(C1 , C 2 ) = max{ d(a, c), d(a, d), d(a, e), d(b, c), d(b, d), d(b, e)}
c 3 2 0 1 2
= max{3, 4, 5, 2, 3, 4} = 5
d 4 3 1 0 1
d(a, c) + d(a, d) + d(a, e) + d(b, c) + d(b, d) + d(b, e)
Average dist(C1 , C 2 ) =
e 5 4 2 1 0 6
3 + 4 + 5 + 2 + 3 + 4 21
= = = 3.5
6 6
Agglomerative Algorithm
• The Agglomerative algorithm is carried out in three steps:
1) Convert all object features into
a distance matrix
2) Set each object as a cluster
(thus if we have N objects, we
will have N clusters at the
beginning)
3) Repeat until number of cluster
is one (or known # of clusters)
▪ Merge two closest clusters
▪ Update “distance matrix”
38
Example
• Problem: clustering analysis with agglomerative algorithm
data matrix
Euclidean distance
distance matrix
Example
• Merge two closest clusters (iteration 1)
Example
• Update distance matrix (iteration 1)
Example
• Merge two closest clusters (iteration 2)
Example
• Update distance matrix (iteration 2)
Example
• Merge two closest clusters/update distance matrix (iteration 3)
Example
• Merge two closest clusters/update distance matrix (iteration 4)
Example
• Final result (meeting termination condition)
Example
• Dendrogram tree representation
1. In the beginning we have 6
clusters: A, B, C, D, E and F
6 2. We merge clusters D and F into
cluster (D, F) at distance 0.50
3. We merge cluster A and cluster B
into (A, B) at distance 0.71
lifetime
object
Which Distance Measure is Better?
• Each method has both advantages and disadvantages; application-
dependent, single-link and complete-link are the most common
methods
• Single-link
• Can find irregular-shaped clusters
• Sensitive to outliers, suffers the so-called chaining effects
• In order to merge two groups, only need one pair of points to be close, irrespective of all
others. Therefore clusters can be too spread out, and not compact enough
• Complete-link, Average-link, and Centroid distance
• Robust to outliers
• Tend to break large clusters
• Prefer spherical clusters
Density-Based Clustering Methods
• Clustering based on local connectivity
and density functions
• Basic idea
• Clusters are dense regions in the data
space, separated by regions of lower
object density
• A cluster is defined as a maximal set of
density connected points
• Each cluster has a considerable higher
density of points than outside of the
cluster
• Major features:
• Discover clusters of arbitrary shape
• Handle noise
• One scan
Clustering Techniques 50
Density Based Spatial Clustering of Application of Noise (DBSCAN)
• Two global parameters:
• Eps: Maximum radius of the neighbourhood
• MinPts: Minimum number of points in an
Eps-neighbourhood of that point
• Density = number of points within a
specified radius r (Eps)
• A point is a core point if it has more than
a specified number of points (MinPts)
within Eps
• These are points that are at the interior of a
cluster
• A border point has fewer than MinPts
within Eps, but is in the neighborhood of
a core point
• A noise point is any point that is not a
core point or a border point.
Clustering Techniques 51
Density-reachability
• An object q is directly density-reachable from
object p, if p is a core object and q is in p’s Eps-
neighborhood
Clustering Techniques 52
Density-reachability
• Density-Reachable (directly and indirectly):
• A point p is directly density-reachable from p2
• p2 is directly density-reachable from p1
• p1 is directly density-reachable from q
• p ← p2 ← p1 ←q form a chain
Clustering Techniques 53
Density-Connectivity
• A pair of points p and q are density-
connected, If they are commonly density-
reachable from a point o
• Density-connectivity is symmetric
Clustering Techniques 54
DBSCAN
• Cluster : Given a data set D, parameter Eps and MinPts, a cluster C is a
subset of D satisfying the following conditions
• For all points 𝑝, 𝑞 ∈ 𝑪, if 𝑝 ∈ 𝑪 and q is density-reachable from p with respect
to Eps and MinPts, then 𝒒 ∈ 𝑪 (Maximality)
• For all points 𝑝, 𝑞 ∈ 𝑪, p is density connected to q with respect to Eps and
Minpts ( Connectivity)
• Note: cluster contains core points as well as border points
Clustering Techniques 55
DBSCAN-Algorithm
Clustering Techniques 57
DBSCAN-Example
• Parameter
• = 2 cm
• MinPts = 3
for each o D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
Clustering Techniques 58
DBSCAN-Example
• Parameter
• = 2 cm
• MinPts = 3
for each o D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
Clustering Techniques 59
DBSCAN-Example
• Parameter
• = 2 cm
• MinPts = 3
for each o D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
Clustering Techniques 60
MinPts = 5
P1
P
C1 C1
C1 P
C1
C1
C1 C1
Clustering Techniques 62
When DBSCAN Does NOT Work Well
Original Points
Clustering Techniques 63
DBSCAN: Sensitive to Parameters
Clustering Techniques 64
Density Based Clustering: Discussion
• Advantages
• Clusters can have arbitrary shape and size
• Number of clusters is determined automatically
• Can separate clusters from surrounding noise
• Disadvantages
• Input parameters may be difficult to determine
• In some situations very sensitive to input parameter setting
Clustering Techniques 65
Happy Learning !!
Thank you !!!
Clustering Techniques 67