[go: up one dir, main page]

0% found this document useful (0 votes)
12 views40 pages

Clustering Part2

Uploaded by

ankityadav10291
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views40 pages

Clustering Part2

Uploaded by

ankityadav10291
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Balanced Iterative Reducing and Clustering Using Hierarchies

(BIRCH)
• Agglomerative Clustering designed for clustering a large amount of numerical
data
• It introduces two concepts :
• Clustering feature
• Clustering feature tree (CF tree)
• These structures help the clustering method achieve good speed and scalability
in large databases.

• What Birch algorithm tries to solve?


• Most of the existing algorithms DO NOT consider the case that datasets can be
too large to fit in main memory
• They DO NOT concentrate on minimizing the number of scans of the dataset
• I/O costs are very high
• The complexity of BIRCH is O(n) where n is the number of objects to be
clustered.
Clustering Techniques 1
BIRCH: Key Components
• Clustering Feature (CF)
• Summary of the statistics for a given cluster: the 0-th, 1st and 2nd moments
of the cluster from the statistical point of view
• A CF entry has sufficient information to calculate the centroid, radius,
diameter and many other distance measures
• Additively theorem allows us to merge sub-clusters incrementally
• CF-Tree
• height-balanced tree
• two parameters:
• number of entries in each node
• The diameter of all entries in a leaf node
• Leaf nodes are connected via prev and next pointers

Clustering Techniques 2
Clustering Feature

Clustering Techniques 3
Distance Measures

Clustering Techniques 4
Clustering Feature
• Suppose cluster C1 has CF1=(N1, LS1
,SS1), cluster C2 has CF2 =(N2,LS2,SS2)
• If we merge C1 with C2, the CF for the
merged cluster C is

• Why CF?
• Summarized info for single cluster
CF3=CF1+CF2= 〈3+3, (9+35, 10+36), (29+417 , 38+440)〉
• Summarized info for two clusters = 〈6, (44,46), (446 ,478)〉
• Additive theorem

Clustering Techniques 5
CF-Tree
• A CF tree is a height-balanced tree with two parameters:
branching factor B and threshold T.
• Branching factor
• B : Each non-leaf node contains at most B entries of the form [CFi,
childi] , where I = 1, 2, …, B, and childi is a pointer to its i-th child
node. CFi Is the CF of the sub-cluster represented by the childi.
• L : A leaf node contains at most L entries, each of the form [CF i],
where I = 1, 2, …, L. In addition, each leaf node has two pointers,
“prev” and “next” which are used to chain all leaf node together
• A leaf node also represent a cluster made up of all the sub-clusters
represented by its entries.
• Threshold T
• The diameter (alternatively, the radius) of all entries in a leaf node is
at most T
• Leaf nodes are connected via prev and next pointers

Clustering Techniques 6
Example of CF Tree
Root
B=6 CF1 CF2 CF3 CF6

L=5 child1 child2 child3 child6

Non-leaf node
CF9 CF10 CF11 CF13

child1 child2 child3 child5

Leaf node Leaf node


prev CF90 CF91 CF94 next prev CF95 CF96 CF98 next

7
CF Tree Insertion

Clustering Techniques 8
Example
0.5
0.25
0
0.65
1
1.4
1.1 Root-Node 0
CF1: n=1, Ls=0.5, SS = 0.25
• T = 0.15
•L = 2
Leaf1:R=0
•B = 2

Clustering Techniques 9
Example
0.5
0.25
0
0.65
1
1.4
1.1 Root-Node 0 Root-Node 0
CF1: n=1, Ls=0.5, SS = 0.25 CF1: n=2, Ls=0.75, SS = 0.313
• T = 0.15
•L = 2
Leaf1:R=0 Leaf1:R=0.126
•B = 2

Clustering Techniques 10
Example

Root-Node 0
0.5 CF1: n=2, Ls=0.75, SS = 0.313 CF2: n=1, Ls=0.0, SS = 0
0.25
0
0.65 • T = 0.15
1 •L = 2 Leaf1:R=0.126 Leaf2:R=0
1.4
1.1
•B = 2

Clustering Techniques 11
Example

• T = 0.15 0.5
0.25
Example •L = 2 0
•B = 2 0.65
Root-Node 0 1
CF12: n=3, Ls=0.75, SS = 0.313 CF3: n=1, Ls=0.65, SS = 0.423 1.4
1.1

Node 1 Node 2
CF1: n=2, Ls=0.75, SS = 0.313 CF2: n=1, Ls=0.0, SS = 0 CF3: n=1, Ls=0.65, SS = 0.423

Leaf1:R=0.126 Leaf2:R=0. Leaf3:R=0.

Clustering Techniques 13
Example

Clustering Techniques 14
• T = 0.15 0.5
0.25
Example •L = 2 0
•B = 2 0.65
Root-Node 0 1
CF12: n=3, Ls=0.75, SS = 0.313 CF34: n=2, Ls=1.65, SS = 1.423 1.4
1.1

Node 1 Node 2
CF1: n=2, Ls=0.75, CF2: n=1, Ls=0.0, CF3: n=1, Ls=0.65, CF4: n=1, Ls=1, SS
SS = 0.313 SS = 0 SS = 0.423 =1

Leaf1:R=0.126 Leaf2:R=0. Leaf3:R=0. Leaf4:R=0.

Clustering Techniques 15 15
Example

Clustering Techniques 16
• T = 0.15 0.5
0.25
Example •L = 2
0
•B = 2 0.65
Root-Node 0
1
CF12: n=3, Ls=0.75, SS = 0.313 CF345: n=3, Ls=3.05, SS = 3.393
1.4
1.1
Node 1 Node 2
CF1: n=2, Ls=0.75, CF2: n=1, Ls=0.0, CF34: n=2, Ls=1.65, CF5: n=1, Ls=1.4,
SS = 0.313 SS = 0 SS = 1.423 SS = 1.96

Leaf1:R=0.126 Leaf2:R=0. Node 2.1 Node 2.2


CF3: n=1, Ls=0.65, CF4: n=1, Ls=1, CF5: n=1, Ls=1.4,
SS = 0.423 SS = 1 SS = 1.96

Leaf3:R=0. Leaf4:R=0. Leaf5:R=0.

Do height balancing and continue

17 17
Clustering Techniques
CLUSTERING THE SUB-CLUSTERS

Clustering Techniques 18
Clustering Using Representative (CURE)
• Drawbacks of Traditional Clustering Algorithms
• Centroid-based approach (using dmean) considers
only one point as representative of a cluster - the
cluster centroid.
• All-points approach (based on dmin) makes the
clustering algorithm extremely sensitive to outliers.
• Both of them can’t work well for non-spherical or
arbitrary shaped clusters.

Clustering Techniques 19
CURE: Approach
• CURE is positioned between centroid based and all point extremes.
• A constant number of well scattered points is used to capture the
shape and extend of a cluster.
• The points are shrunk towards the centroid of the cluster by a factor
α.
• These well scattered and shrunk points are used as representative of
the cluster.

Clustering Techniques 20
CURE: Approach
• Scattered points approach alleviates
shortcomings of centroid based and all point
based methods.
• Since multiple representatives are used, the
splitting of large clusters is avoided.
• Multiple representatives allow for discovery of non
spherical clusters.
• The shrinking phase will affect outliers more than
other points since their distance from the centroid
will be decreased more than that of regular points.

Clustering Techniques 21
CURE: Approach
• Initially since all points are in separate clusters, each cluster is defined by the
point in the cluster.
• Clusters are merged until they contain at least c points.
• The first scattered point in a cluster in one which is farthest away from the
clusters centroid.
• Other scattered points are chosen so that their distance from previously chosen
scattered points in maximal.
• When c well scattered points are calculated they are shrunk by some factor α (r
= p + α*(mean-p)).
• After clusters have c representatives the distance between two clusters is the
distance between two of the closest representatives of each cluster
• Every time two clusters are merged their representatives are re-calculated.

Clustering Techniques 22
Density-Based Clustering Methods
• Clustering based on local
connectivity and density functions
• Basic idea
• Clusters are dense regions in the data
space, separated by regions of lower
object density
• A cluster is defined as a maximal set of
density connected points
• Each cluster has a considerable
higher density of points than outside
of the cluster
• Major features:
• Discover clusters of arbitrary shape
• One scan

Clustering Techniques 23
Density Based Spatial Clustering of Application of Noise
(DBSCAN)
• Two global parameters:
• Eps: Maximum radius of the neighbourhood
• MinPts: Minimum number of points in an
Eps-neighbourhood of that point
• Density = number of points within a
specified radius r (Eps)
• A point is a core point if it has more than
a specified number of points (MinPts)
within Eps
• These are points that are at the interior of a
cluster
• A border point has fewer than MinPts
within Eps, but is in the neighborhood of
a core point
• A noise point is any point that is not a
core point or a border point.

Clustering Techniques 24
Density-reachability
• An object q is directly density-reachable from
object p, if p is a core object and q is in p’s
Eps-neighborhood

• q is directly density-reachable from p


• p is not directly density-reachable from q
• Density-reachability is asymmetric

Clustering Techniques 25
Density-reachability

• p is (indirectly) density-reachable from q


• q is not density-reachable from p

Clustering Techniques 26
Density-Connectivity
• A pair of points p and q are
density-connected, If they are commonly
density-reachable from a point o
• Density-connectivity is symmetric

Clustering Techniques 27
DBSCAN

Clustering Techniques 28
DBSCAN-Algorithm
Input: The data set D
Parameter: ε, MinPts
For each object p in D
if p is a core object and not processed then
C = retrieve all objects density-reachable from p
mark all objects in C as processed
report C as a cluster
else mark p as outlier
end if
End For

Clustering Techniques 29
DBSCAN-Example
• Parameter
• ε = 2 cm
• MinPts = 3

for each o ∈ D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE

Clustering Techniques 30
DBSCAN-Example
• Parameter
• ε = 2 cm
• MinPts = 3

for each o ∈ D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE

Clustering Techniques 31
DBSCAN-Example
• Parameter
• ε = 2 cm
• MinPts = 3

for each o ∈ D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE

Clustering Techniques 32
MinPts = 5

ε
P1
ε
ε P
C1 C1
C1 P

1. Check the 1. Check the unprocessed


ε-neighborhood of p; objects in C
2. If p has less than MinPts 2. If no core object, return C
neighbors then mark p 3. Otherwise, randomly pick
as outlier and continue
up one core object p1,
with the next object
mark p1 as processed,
3. Otherwise mark p as and put all unprocessed
processed and put all neighbors of p1 in cluster
the neighbors in cluster C
C

Clustering Techniques 33
ε

ε
C1
C1

ε
C1 C1

Clustering Techniques 34
When DBSCAN Does NOT Work Well

Original Points

• Cannot handle Varying densities


• sensitive to parameters

(MinPts=4, Eps=9.75)
Clustering Techniques 35
DBSCAN: Sensitive to Parameters

Clustering Techniques 36
Density Based Clustering: Discussion
• Advantages
• Clusters can have arbitrary shape and size
• Number of clusters is determined automatically
• Can separate clusters from surrounding noise
• Disadvantages
• Input parameters may be difficult to determine
• In some situations very sensitive to input parameter setting

Clustering Techniques 37
Validation Measures for Clustering
• External
• use external information not present in the data

• Internal
• measures evaluate the goodness of a clustering structure without respect to
external information
Validation Measures for Clustering
Validation Measures for Clustering
• External
• Dice Coefficient
• Similarity
• Accuracy

Clustering Techniques 40

You might also like