Balanced Iterative Reducing and Clustering Using Hierarchies
(BIRCH)
• Agglomerative Clustering designed for clustering a large amount of numerical
data
• It introduces two concepts :
• Clustering feature
• Clustering feature tree (CF tree)
• These structures help the clustering method achieve good speed and scalability
in large databases.
• What Birch algorithm tries to solve?
• Most of the existing algorithms DO NOT consider the case that datasets can be
too large to fit in main memory
• They DO NOT concentrate on minimizing the number of scans of the dataset
• I/O costs are very high
• The complexity of BIRCH is O(n) where n is the number of objects to be
clustered.
Clustering Techniques 1
BIRCH: Key Components
• Clustering Feature (CF)
• Summary of the statistics for a given cluster: the 0-th, 1st and 2nd moments
of the cluster from the statistical point of view
• A CF entry has sufficient information to calculate the centroid, radius,
diameter and many other distance measures
• Additively theorem allows us to merge sub-clusters incrementally
• CF-Tree
• height-balanced tree
• two parameters:
• number of entries in each node
• The diameter of all entries in a leaf node
• Leaf nodes are connected via prev and next pointers
Clustering Techniques 2
Clustering Feature
•
Clustering Techniques 3
Distance Measures
•
Clustering Techniques 4
Clustering Feature
• Suppose cluster C1 has CF1=(N1, LS1
,SS1), cluster C2 has CF2 =(N2,LS2,SS2)
• If we merge C1 with C2, the CF for the
merged cluster C is
• Why CF?
• Summarized info for single cluster
CF3=CF1+CF2= 〈3+3, (9+35, 10+36), (29+417 , 38+440)〉
• Summarized info for two clusters = 〈6, (44,46), (446 ,478)〉
• Additive theorem
Clustering Techniques 5
CF-Tree
• A CF tree is a height-balanced tree with two parameters:
branching factor B and threshold T.
• Branching factor
• B : Each non-leaf node contains at most B entries of the form [CFi,
childi] , where I = 1, 2, …, B, and childi is a pointer to its i-th child
node. CFi Is the CF of the sub-cluster represented by the childi.
• L : A leaf node contains at most L entries, each of the form [CF i],
where I = 1, 2, …, L. In addition, each leaf node has two pointers,
“prev” and “next” which are used to chain all leaf node together
• A leaf node also represent a cluster made up of all the sub-clusters
represented by its entries.
• Threshold T
• The diameter (alternatively, the radius) of all entries in a leaf node is
at most T
• Leaf nodes are connected via prev and next pointers
Clustering Techniques 6
Example of CF Tree
Root
B=6 CF1 CF2 CF3 CF6
L=5 child1 child2 child3 child6
Non-leaf node
CF9 CF10 CF11 CF13
child1 child2 child3 child5
Leaf node Leaf node
prev CF90 CF91 CF94 next prev CF95 CF96 CF98 next
7
CF Tree Insertion
•
Clustering Techniques 8
Example
0.5
0.25
0
0.65
1
1.4
1.1 Root-Node 0
CF1: n=1, Ls=0.5, SS = 0.25
• T = 0.15
•L = 2
Leaf1:R=0
•B = 2
Clustering Techniques 9
Example
0.5
0.25
0
0.65
1
1.4
1.1 Root-Node 0 Root-Node 0
CF1: n=1, Ls=0.5, SS = 0.25 CF1: n=2, Ls=0.75, SS = 0.313
• T = 0.15
•L = 2
Leaf1:R=0 Leaf1:R=0.126
•B = 2
Clustering Techniques 10
Example
Root-Node 0
0.5 CF1: n=2, Ls=0.75, SS = 0.313 CF2: n=1, Ls=0.0, SS = 0
0.25
0
0.65 • T = 0.15
1 •L = 2 Leaf1:R=0.126 Leaf2:R=0
1.4
1.1
•B = 2
Clustering Techniques 11
Example
•
• T = 0.15 0.5
0.25
Example •L = 2 0
•B = 2 0.65
Root-Node 0 1
CF12: n=3, Ls=0.75, SS = 0.313 CF3: n=1, Ls=0.65, SS = 0.423 1.4
1.1
Node 1 Node 2
CF1: n=2, Ls=0.75, SS = 0.313 CF2: n=1, Ls=0.0, SS = 0 CF3: n=1, Ls=0.65, SS = 0.423
Leaf1:R=0.126 Leaf2:R=0. Leaf3:R=0.
Clustering Techniques 13
Example
•
Clustering Techniques 14
• T = 0.15 0.5
0.25
Example •L = 2 0
•B = 2 0.65
Root-Node 0 1
CF12: n=3, Ls=0.75, SS = 0.313 CF34: n=2, Ls=1.65, SS = 1.423 1.4
1.1
Node 1 Node 2
CF1: n=2, Ls=0.75, CF2: n=1, Ls=0.0, CF3: n=1, Ls=0.65, CF4: n=1, Ls=1, SS
SS = 0.313 SS = 0 SS = 0.423 =1
Leaf1:R=0.126 Leaf2:R=0. Leaf3:R=0. Leaf4:R=0.
Clustering Techniques 15 15
Example
•
Clustering Techniques 16
• T = 0.15 0.5
0.25
Example •L = 2
0
•B = 2 0.65
Root-Node 0
1
CF12: n=3, Ls=0.75, SS = 0.313 CF345: n=3, Ls=3.05, SS = 3.393
1.4
1.1
Node 1 Node 2
CF1: n=2, Ls=0.75, CF2: n=1, Ls=0.0, CF34: n=2, Ls=1.65, CF5: n=1, Ls=1.4,
SS = 0.313 SS = 0 SS = 1.423 SS = 1.96
Leaf1:R=0.126 Leaf2:R=0. Node 2.1 Node 2.2
CF3: n=1, Ls=0.65, CF4: n=1, Ls=1, CF5: n=1, Ls=1.4,
SS = 0.423 SS = 1 SS = 1.96
Leaf3:R=0. Leaf4:R=0. Leaf5:R=0.
Do height balancing and continue
17 17
Clustering Techniques
CLUSTERING THE SUB-CLUSTERS
•
Clustering Techniques 18
Clustering Using Representative (CURE)
• Drawbacks of Traditional Clustering Algorithms
• Centroid-based approach (using dmean) considers
only one point as representative of a cluster - the
cluster centroid.
• All-points approach (based on dmin) makes the
clustering algorithm extremely sensitive to outliers.
• Both of them can’t work well for non-spherical or
arbitrary shaped clusters.
Clustering Techniques 19
CURE: Approach
• CURE is positioned between centroid based and all point extremes.
• A constant number of well scattered points is used to capture the
shape and extend of a cluster.
• The points are shrunk towards the centroid of the cluster by a factor
α.
• These well scattered and shrunk points are used as representative of
the cluster.
Clustering Techniques 20
CURE: Approach
• Scattered points approach alleviates
shortcomings of centroid based and all point
based methods.
• Since multiple representatives are used, the
splitting of large clusters is avoided.
• Multiple representatives allow for discovery of non
spherical clusters.
• The shrinking phase will affect outliers more than
other points since their distance from the centroid
will be decreased more than that of regular points.
Clustering Techniques 21
CURE: Approach
• Initially since all points are in separate clusters, each cluster is defined by the
point in the cluster.
• Clusters are merged until they contain at least c points.
• The first scattered point in a cluster in one which is farthest away from the
clusters centroid.
• Other scattered points are chosen so that their distance from previously chosen
scattered points in maximal.
• When c well scattered points are calculated they are shrunk by some factor α (r
= p + α*(mean-p)).
• After clusters have c representatives the distance between two clusters is the
distance between two of the closest representatives of each cluster
• Every time two clusters are merged their representatives are re-calculated.
Clustering Techniques 22
Density-Based Clustering Methods
• Clustering based on local
connectivity and density functions
• Basic idea
• Clusters are dense regions in the data
space, separated by regions of lower
object density
• A cluster is defined as a maximal set of
density connected points
• Each cluster has a considerable
higher density of points than outside
of the cluster
• Major features:
• Discover clusters of arbitrary shape
• One scan
Clustering Techniques 23
Density Based Spatial Clustering of Application of Noise
(DBSCAN)
• Two global parameters:
• Eps: Maximum radius of the neighbourhood
• MinPts: Minimum number of points in an
Eps-neighbourhood of that point
• Density = number of points within a
specified radius r (Eps)
• A point is a core point if it has more than
a specified number of points (MinPts)
within Eps
• These are points that are at the interior of a
cluster
• A border point has fewer than MinPts
within Eps, but is in the neighborhood of
a core point
• A noise point is any point that is not a
core point or a border point.
Clustering Techniques 24
Density-reachability
• An object q is directly density-reachable from
object p, if p is a core object and q is in p’s
Eps-neighborhood
• q is directly density-reachable from p
• p is not directly density-reachable from q
• Density-reachability is asymmetric
Clustering Techniques 25
Density-reachability
•
• p is (indirectly) density-reachable from q
• q is not density-reachable from p
Clustering Techniques 26
Density-Connectivity
• A pair of points p and q are
density-connected, If they are commonly
density-reachable from a point o
• Density-connectivity is symmetric
Clustering Techniques 27
DBSCAN
•
Clustering Techniques 28
DBSCAN-Algorithm
Input: The data set D
Parameter: ε, MinPts
For each object p in D
if p is a core object and not processed then
C = retrieve all objects density-reachable from p
mark all objects in C as processed
report C as a cluster
else mark p as outlier
end if
End For
Clustering Techniques 29
DBSCAN-Example
• Parameter
• ε = 2 cm
• MinPts = 3
for each o ∈ D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
Clustering Techniques 30
DBSCAN-Example
• Parameter
• ε = 2 cm
• MinPts = 3
for each o ∈ D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
Clustering Techniques 31
DBSCAN-Example
• Parameter
• ε = 2 cm
• MinPts = 3
for each o ∈ D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
Clustering Techniques 32
MinPts = 5
ε
P1
ε
ε P
C1 C1
C1 P
1. Check the 1. Check the unprocessed
ε-neighborhood of p; objects in C
2. If p has less than MinPts 2. If no core object, return C
neighbors then mark p 3. Otherwise, randomly pick
as outlier and continue
up one core object p1,
with the next object
mark p1 as processed,
3. Otherwise mark p as and put all unprocessed
processed and put all neighbors of p1 in cluster
the neighbors in cluster C
C
Clustering Techniques 33
ε
ε
C1
C1
ε
C1 C1
Clustering Techniques 34
When DBSCAN Does NOT Work Well
Original Points
• Cannot handle Varying densities
• sensitive to parameters
(MinPts=4, Eps=9.75)
Clustering Techniques 35
DBSCAN: Sensitive to Parameters
Clustering Techniques 36
Density Based Clustering: Discussion
• Advantages
• Clusters can have arbitrary shape and size
• Number of clusters is determined automatically
• Can separate clusters from surrounding noise
• Disadvantages
• Input parameters may be difficult to determine
• In some situations very sensitive to input parameter setting
Clustering Techniques 37
Validation Measures for Clustering
• External
• use external information not present in the data
• Internal
• measures evaluate the goodness of a clustering structure without respect to
external information
Validation Measures for Clustering
Validation Measures for Clustering
• External
• Dice Coefficient
• Similarity
• Accuracy
Clustering Techniques 40