Clustering New

Clustering Techniques
Dr. Venkateswara Rao

NIT Warangal
Slides are compiled from various online sources with a great acknowledgement to all those who made the content freely available on the web.
Clustering Techniques 1
Outline
• What is clustering
• Challenges and applications
• Different types of clustering techniques
Clustering
• Cluster: a collection of data objects
• Similar to one another within the same cluster
• Dissimilar to the objects in other clusters
• Clustering is a technique for finding groups of
objects such that the objects in a group will be
similar (or related) to one another and different
from (or unrelated to) the objects in other
groups
• Clustering is often called an unsupervised
learning task as no class values denoting an a
priori grouping of the data instances are given,
which is the case in supervised learning.
An illustration
• The data set has three natural groups of data points, i.e., 3 natural
clusters.
Notion of a Cluster can be Ambiguous
How many clusters? Six Clusters
Two Clusters Four Clusters
What Is Good Clustering?
• A good clustering method will produce high quality clusters with
• high intra-class similarity
• low inter-class similarity
• The quality of a clustering result depends on both the similarity measure used by
the method and its implementation.
Applications
• Business Intelligence
• Organize the customers into groups and develop a business strategies for enhanced
customer relationship management.
• Image pattern recognition
• Cluster hand-written digits.
• Image compression: represent the image with limited number of colors.
• Web search
• Clustering can be used to organize the search results into groups and present the
results in a concise and easily accessible way.
• Biology
• Determining the group structure in, e.g., a population of cells or microarray gene
expression data.
• Security
• To identify fraudulent behavior.
• Many other
Requirements of Clustering
• Scalability
• Ability to deal with different types of attributes
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to determine input parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• High dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability
Major Clustering Approaches
• Partitioning algorithms: Construct various partitions and then evaluate them by some criterion.
• Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some
criterion.
• Density-based: A cluster is defined as a maximal set of density connected points.
• Grid-based: These algorithms partition the data space into a finite number of cells to form a grid structure
and then form clusters from the cells in the grid structure. Clusters correspond to regions that are more
dense in data points than their surroundings
Partitioning Clustering Approach
• Partitioning algorithms construct partition of a database of N objects into
a set of K clusters.
• The partitioning clustering algorithm usually adopts the Iterative
Optimization paradigm.
• It starts with an initial partition and uses an iterative control strategy.
• It tries swapping data points to see if such a swapping improves the quality of
clustering.
• When swapping does not yield any improvements in clustering, it finds a locally
optimal partitioning
• In principle, optimal partition achieved via minimizing the sum of
squared distance to its “representative object” in each cluster
E = K
k =1  xCk d ( x, mk )
2
e.g., Euclidean distance d 2 ( x, mk ) =  (xn − m kn ) 2

n =1
K-means algorithm
• Given the cluster number K, the K-means algorithm is carried out in
three steps after initialization:
• Initialization: set seed points (randomly)
1. Assign each object to the cluster of the nearest seed point measured with a specific
distance metric
2. Compute new seed points as the centroids of the clusters of the current partition (the
centroid is the centre, i.e., mean point, of the cluster)
3. Go back to Step 1), stop when no more new assignment (i.e., membership in each
cluster no longer changes)
K-means - Example
• Problem:
• Suppose we have 4 types of medicines and each has two attributes (pH and
weight index). Our goal is to group these objects into K=2 group of medicine.
D
Medicine Weight pH-
Index
C
A 1 1
B 2 1
C 4 3 A B
D 5 4
K-means - Example
• Step 1: Use initial seed points for partitioning
c1 = A , c 2 = B
d( D , c1 ) = ( 5 − 1)2 + ( 4 − 1)2 = 5
d( D , c2 ) = ( 5 − 2)2 + ( 4 − 1)2 = 4.24
Assign each object to the cluster

with the nearest seed point
K-means - Example
• Step 2: Compute new centroids of the current partition
Knowing the members of each
cluster, now we compute the new
centroid of each group based on
these new memberships.
c1 = (1, 1)
 2 + 4 + 5 1+ 3+ 4
c2 =  , 
 3 3 
11 8
=( , )
3 3
K-means - Example
• Step 2: Renew membership based on new centroids
Compute the distance of all

objects to the new centroids
Assign the membership to objects
K-means - Example
• Step 3: Repeat the first two steps until its convergence
Knowing the members of each
cluster, now we compute the new
centroid of each group based on
these new memberships.
 1+ 2 1+1 1
c1 =  ,  = (1 , 1)
 2 2  2
4+5 3+4 1 1
c2 =  ,  = (4 , 3 )
 2 2  2 2
K-means - Example
• Step 3: Repeat the first two steps until its convergence
Compute the distance of all objects

to the new centroids
Stop due to no new assignment

Membership in each cluster no
longer change
Strengths of k-means
• Strengths:
• Simple: easy to understand and to implement
• Efficient: Time complexity: O(tkn), where n is the number of data points, k is
the number of clusters, and t is the number of iterations.
Weaknesses of k-means
• The algorithm is only applicable if the mean is defined.
• The user needs to specify k.
• The algorithm is sensitive to outliers
• Outliers are data points that are very far away from other data points.
• Outliers could be errors in the data recording or some special data points with
very different values.
Weaknesses of k-means: Problems with outliers
Weaknesses of k-means: To deal with outliers
• One method is to remove some data points in the clustering process that are
much farther away from the centroids than other data points.
• To be safe, we may want to monitor these possible outliers over a few iterations and
then decide to remove them.
• Another method is to perform random sampling. Since in sampling we only
choose a small subset of the data points, the chance of selecting an outlier is
very small.
• Assign the rest of the data points to the clusters by distance or similarity comparison,
or classification
• The algorithm is sensitive to initial seeds.
• If we use different seeds: good results
There are some
methods to help
choose good seeds
• The k-means algorithm is not suitable for discovering clusters that are
not hyper-ellipsoids (or hyper-spheres).
Hierarchical Clustering
• Hierarchical Clustering Approach
• A typical clustering analysis approach via partitioning data set
sequentially
• Construct nested partitions layer by layer via grouping objects into a
tree of clusters (without the need to know the number of clusters in
advance)
• Use (generalised) distance matrix as clustering criteria
• Agglomerative vs Divisive
• Two sequential clustering strategies for constructing a tree of clusters
• Agglomerative: a bottom-up strategy
• Initially each data object is in its own (atomic) cluster
• Then merge these atomic clusters into larger and larger clusters
• Divisive: a top-down strategy
• Initially all objects are in one single cluster
• Then the cluster is subdivided into smaller and smaller clusters
• Agglomerative approach
Initialization:
Each object is a cluster
Iteration:
a
ab Merge two clusters which are
b abcde most similar to each other;
c Until all objects are merged
cde into a single cluster
d
de
e
Step 0 Step 1 Step 2 Step 3 Step 4 bottom-up
27
• Divisive Approaches
Initialization:
All objects stay in one cluster
Iteration:
a Select a cluster and split it into
ab
b abcde
two sub clusters
Until each leaf cluster contains
c
cde only one object
d
de
e
Step 4 Step 3 Step 2 Step 1 Step 0 Top-down
28
Dendrogram
• A binary tree that shows how clusters are merged/split hierarchically
• Each node in the tree is a cluster; each leaf node is a singleton cluster
29
Dendrogram
• A clustering of the data objects is obtained by cutting the dendrogram
at the desired level, then each connected component forms a cluster
30
Dendrogram
• A clustering of the data objects is obtained by cutting the dendrogram
at the desired level, then each connected component forms a cluster
31
How to Merge Clusters?
• How to measure the distance between clusters?
Single-link
Complete-link
Distance?
Average-link
Centroid distance
Hint: Distance between clusters is usually defined

on the basis of distance between objects.
How to Define Inter-Cluster Distance
Single-link d min (Ci , C j ) = min d ( p, q )

pCi , qC j
Complete-link
The distance between two clusters is
Average-link represented by the distance of the closest pair
Centroid distance of data objects belonging to different clusters.
33
Single-link d min (Ci , C j ) = max d ( p, q)

pCi , qC j
Complete-link
The distance between two clusters is
Average-link represented by the distance of the farthest pair
Centroid distance of data objects belonging to different clusters.
34
Single-link
d min (Ci , C j ) = avg d ( p, q)
Complete-link pCi , qC j
Average-link The distance between two clusters is
Centroid distance represented by the average distance of all pairs
of data objects belonging to different clusters.
35
 
mi,mj are the
means of Ci, Cj,
Single-link d mean (Ci , C j ) = d (mi , m j )

Complete-link
Average-link The distance between two clusters is
represented by the distance between the
Centroid distance means of the cluters.
36
Cluster Distance Measures
Example: Given a data set of five objects characterized by a single continuous feature, assume that there are
two clusters: C1: {a, b} and C2: {c, d, e}.
a b c d e
1 2 4 5 6
1. Calculate the distance matrix. 2. Calculate three cluster distances between C1 and C2.
a b c d e Single link
dist(C1 , C 2 ) = min{ d(a, c), d(a, d), d(a, e), d(b, c), d(b, d), d(b, e)}
a 0 1 3 4 5 = min{3, 4, 5, 2, 3, 4} = 2
b 1 0 2 3 4 Complete link
dist(C1 , C 2 ) = max{ d(a, c), d(a, d), d(a, e), d(b, c), d(b, d), d(b, e)}
c 3 2 0 1 2
= max{3, 4, 5, 2, 3, 4} = 5
d 4 3 1 0 1
d(a, c) + d(a, d) + d(a, e) + d(b, c) + d(b, d) + d(b, e)
Average dist(C1 , C 2 ) =
e 5 4 2 1 0 6
3 + 4 + 5 + 2 + 3 + 4 21
= = = 3.5
6 6
Agglomerative Algorithm
• The Agglomerative algorithm is carried out in three steps:
1) Convert all object features into
a distance matrix
2) Set each object as a cluster
(thus if we have N objects, we
will have N clusters at the
beginning)
3) Repeat until number of cluster
is one (or known # of clusters)
▪ Merge two closest clusters
▪ Update “distance matrix”
38
Example
• Problem: clustering analysis with agglomerative algorithm
data matrix
Euclidean distance
distance matrix
Example
• Merge two closest clusters (iteration 1)
Example
• Update distance matrix (iteration 1)
Example
• Merge two closest clusters (iteration 2)
Example
• Update distance matrix (iteration 2)
Example
• Merge two closest clusters/update distance matrix (iteration 3)
Example
• Merge two closest clusters/update distance matrix (iteration 4)
Example
• Final result (meeting termination condition)
Example
• Dendrogram tree representation
1. In the beginning we have 6
clusters: A, B, C, D, E and F
6 2. We merge clusters D and F into
cluster (D, F) at distance 0.50
3. We merge cluster A and cluster B
into (A, B) at distance 0.71
lifetime
4. We merge clusters E and (D, F)

5 into ((D, F), E) at distance 1.00
5. We merge clusters ((D, F), E) and C
4 into (((D, F), E), C) at distance 1.41
3 6. We merge clusters (((D, F), E), C)
2 and (A, B) into ((((D, F), E), C), (A, B))
at distance 2.50
7. The last cluster contain all the objects,
thus conclude the computation
object
Example
• Dendrogram tree representation
• For a dendrogram tree, its horizontal axis
indexes all objects in a given data set, while
6
its vertical axis expresses the lifetime of all
possible cluster formation.
lifetime
5 • The lifetime of a cluster (individual cluster)

in the dendrogram is defined as a distance
4
interval from the moment that the cluster is
3
2 created to the moment that it disappears by
merging with other clusters.
object
Which Distance Measure is Better?
• Each method has both advantages and disadvantages; application-
dependent, single-link and complete-link are the most common
methods
• Single-link
• Can find irregular-shaped clusters
• Sensitive to outliers, suffers the so-called chaining effects
• In order to merge two groups, only need one pair of points to be close, irrespective of all
others. Therefore clusters can be too spread out, and not compact enough
• Complete-link, Average-link, and Centroid distance
• Robust to outliers
• Tend to break large clusters
• Prefer spherical clusters
Density-Based Clustering Methods
• Clustering based on local connectivity
and density functions
• Basic idea
• Clusters are dense regions in the data
space, separated by regions of lower
object density
• A cluster is defined as a maximal set of
density connected points
• Each cluster has a considerable higher
density of points than outside of the
cluster
• Major features:
• Discover clusters of arbitrary shape
• Handle noise
• One scan
Density Based Spatial Clustering of Application of Noise (DBSCAN)
• Two global parameters:
• Eps: Maximum radius of the neighbourhood
• MinPts: Minimum number of points in an
Eps-neighbourhood of that point
• Density = number of points within a
specified radius r (Eps)
• A point is a core point if it has more than
a specified number of points (MinPts)
within Eps
• These are points that are at the interior of a
cluster
• A border point has fewer than MinPts
within Eps, but is in the neighborhood of
a core point
• A noise point is any point that is not a
core point or a border point.
Density-reachability
• An object q is directly density-reachable from
object p, if p is a core object and q is in p’s Eps-
neighborhood
• q is directly density-reachable from p

• p is not directly density-reachable from q
• Density-reachability is asymmetric
Density-reachability
• Density-Reachable (directly and indirectly):
• A point p is directly density-reachable from p2
• p2 is directly density-reachable from p1
• p1 is directly density-reachable from q
• p ← p2 ← p1 ←q form a chain
• p is (indirectly) density-reachable from q

• q is not density-reachable from p
Density-Connectivity
• A pair of points p and q are density-
connected, If they are commonly density-
reachable from a point o
• Density-connectivity is symmetric
DBSCAN
• Cluster : Given a data set D, parameter Eps and MinPts, a cluster C is a
subset of D satisfying the following conditions
• For all points 𝑝, 𝑞 ∈ 𝑪, if 𝑝 ∈ 𝑪 and q is density-reachable from p with respect
to Eps and MinPts, then 𝒒 ∈ 𝑪 (Maximality)
• For all points 𝑝, 𝑞 ∈ 𝑪, p is density connected to q with respect to Eps and
Minpts ( Connectivity)
• Note: cluster contains core points as well as border points
DBSCAN-Algorithm
Input: The data set D

Parameter: , MinPts
For each object p in D
if p is a core object and not processed then
C = retrieve all objects density-reachable from p
mark all objects in C as processed
report C as a cluster
else mark p as outlier
end if
End For
DBSCAN-Example
• Parameter
•  = 2 cm
• MinPts = 3
for each o  D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
DBSCAN-Example
• Parameter
•  = 2 cm
• MinPts = 3
for each o  D do
else
assign o to NOISE
DBSCAN-Example
• Parameter
•  = 2 cm
• MinPts = 3
for each o  D do
else
assign o to NOISE
MinPts = 5

P1

 P
C1 C1
C1 P
1. Check the - 1. Check the unprocessed

neighborhood of p; objects in C
2. If p has less than MinPts 2. If no core object, return C
neighbors then mark p
3. Otherwise, randomly pick
as outlier and continue
up one core object p1,
with the next object
mark p1 as processed,
3. Otherwise mark p as and put all unprocessed
processed and put all neighbors of p1 in cluster
the neighbors in cluster C
C


C1
C1

C1 C1
When DBSCAN Does NOT Work Well
Original Points
• Cannot handle Varying densities

• sensitive to parameters
DBSCAN: Sensitive to Parameters
Density Based Clustering: Discussion
• Advantages
• Clusters can have arbitrary shape and size
• Number of clusters is determined automatically
• Can separate clusters from surrounding noise
• Disadvantages
• Input parameters may be difficult to determine
• In some situations very sensitive to input parameter setting
Happy Learning !!
Thank you !!!

Clustering New

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Clustering New

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clustering New

Uploaded by

Copyright:

Available Formats

Clustering Techniques

Dr. Venkateswara Rao

How many clusters? Six Clusters

Two Clusters Four Clusters

• Density-based: A cluster is defined as a maximal set of density connected points.

e.g., Euclidean distance d 2 ( x, mk ) =  (xn − m kn ) 2

Assign each object to the cluster

Compute the distance of all

Assign the membership to objects

Compute the distance of all objects

Stop due to no new assignment

Step 0 Step 1 Step 2 Step 3 Step 4 bottom-up

Step 4 Step 3 Step 2 Step 1 Step 0 Top-down

Hint: Distance between clusters is usually defined

Single-link d min (Ci , C j ) = min d ( p, q )

Single-link d min (Ci , C j ) = max d ( p, q)

Single-link d mean (Ci , C j ) = d (mi , m j )

4. We merge clusters E and (D, F)

5 • The lifetime of a cluster (individual cluster)

• q is directly density-reachable from p

• p is (indirectly) density-reachable from q

Input: The data set D

1. Check the - 1. Check the unprocessed

• Cannot handle Varying densities

You might also like