0% found this document useful (0 votes)

12 views40 pages

Clustering Part2

Uploaded by

ankityadav10291

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views40 pages

Clustering Part2

Uploaded by

ankityadav10291

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Balanced Iterative Reducing and Clustering Using Hierarchies

(BIRCH)
• Agglomerative Clustering designed for clustering a large amount of numerical
data
• It introduces two concepts :
• Clustering feature
• Clustering feature tree (CF tree)
• These structures help the clustering method achieve good speed and scalability
in large databases.

• What Birch algorithm tries to solve?

• Most of the existing algorithms DO NOT consider the case that datasets can be
too large to fit in main memory
• They DO NOT concentrate on minimizing the number of scans of the dataset
• I/O costs are very high
• The complexity of BIRCH is O(n) where n is the number of objects to be
clustered.
Clustering Techniques 1
BIRCH: Key Components
• Clustering Feature (CF)
• Summary of the statistics for a given cluster: the 0-th, 1st and 2nd moments
of the cluster from the statistical point of view
• A CF entry has sufficient information to calculate the centroid, radius,
diameter and many other distance measures
• Additively theorem allows us to merge sub-clusters incrementally
• CF-Tree
• height-balanced tree
• two parameters:
• number of entries in each node
• The diameter of all entries in a leaf node
• Leaf nodes are connected via prev and next pointers

Clustering Techniques 2
Clustering Feature
•

Clustering Techniques 3
Distance Measures
•

Clustering Techniques 4
Clustering Feature
• Suppose cluster C1 has CF1=(N1, LS1
,SS1), cluster C2 has CF2 =(N2,LS2,SS2)
• If we merge C1 with C2, the CF for the
merged cluster C is

• Why CF?
• Summarized info for single cluster
CF3=CF1+CF2= 〈3+3, (9+35, 10+36), (29+417 , 38+440)〉
• Summarized info for two clusters = 〈6, (44,46), (446 ,478)〉
• Additive theorem

Clustering Techniques 5
CF-Tree
• A CF tree is a height-balanced tree with two parameters:
branching factor B and threshold T.
• Branching factor
• B : Each non-leaf node contains at most B entries of the form [CFi,
childi] , where I = 1, 2, …, B, and childi is a pointer to its i-th child
node. CFi Is the CF of the sub-cluster represented by the childi.
• L : A leaf node contains at most L entries, each of the form [CF i],
where I = 1, 2, …, L. In addition, each leaf node has two pointers,
“prev” and “next” which are used to chain all leaf node together
• A leaf node also represent a cluster made up of all the sub-clusters
represented by its entries.
• Threshold T
• The diameter (alternatively, the radius) of all entries in a leaf node is
at most T
• Leaf nodes are connected via prev and next pointers

Clustering Techniques 6
Example of CF Tree
Root
B=6 CF1 CF2 CF3 CF6

L=5 child1 child2 child3 child6

Non-leaf node
CF9 CF10 CF11 CF13

child1 child2 child3 child5

Leaf node Leaf node

prev CF90 CF91 CF94 next prev CF95 CF96 CF98 next

7
CF Tree Insertion
•

Clustering Techniques 8
Example
0.5
0.25
0
0.65
1
1.4
1.1 Root-Node 0
CF1: n=1, Ls=0.5, SS = 0.25
• T = 0.15
•L = 2
Leaf1:R=0
•B = 2

Clustering Techniques 9
Example
0.5
0.25
0
0.65
1
1.4
1.1 Root-Node 0 Root-Node 0
CF1: n=1, Ls=0.5, SS = 0.25 CF1: n=2, Ls=0.75, SS = 0.313
• T = 0.15
•L = 2
Leaf1:R=0 Leaf1:R=0.126
•B = 2

Clustering Techniques 10
Example

Root-Node 0
0.5 CF1: n=2, Ls=0.75, SS = 0.313 CF2: n=1, Ls=0.0, SS = 0
0.25
0
0.65 • T = 0.15
1 •L = 2 Leaf1:R=0.126 Leaf2:R=0
1.4
1.1
•B = 2

Clustering Techniques 11
Example
•
• T = 0.15 0.5
0.25
Example •L = 2 0
•B = 2 0.65
Root-Node 0 1
CF12: n=3, Ls=0.75, SS = 0.313 CF3: n=1, Ls=0.65, SS = 0.423 1.4
1.1

Node 1 Node 2
CF1: n=2, Ls=0.75, SS = 0.313 CF2: n=1, Ls=0.0, SS = 0 CF3: n=1, Ls=0.65, SS = 0.423

Leaf1:R=0.126 Leaf2:R=0. Leaf3:R=0.

Clustering Techniques 13
Example
•

Clustering Techniques 14
• T = 0.15 0.5
0.25
Example •L = 2 0
•B = 2 0.65
Root-Node 0 1
CF12: n=3, Ls=0.75, SS = 0.313 CF34: n=2, Ls=1.65, SS = 1.423 1.4
1.1

Node 1 Node 2
CF1: n=2, Ls=0.75, CF2: n=1, Ls=0.0, CF3: n=1, Ls=0.65, CF4: n=1, Ls=1, SS
SS = 0.313 SS = 0 SS = 0.423 =1

Leaf1:R=0.126 Leaf2:R=0. Leaf3:R=0. Leaf4:R=0.

Clustering Techniques 15 15
Example
•

Clustering Techniques 16
• T = 0.15 0.5
0.25
Example •L = 2
0
•B = 2 0.65
Root-Node 0
1
CF12: n=3, Ls=0.75, SS = 0.313 CF345: n=3, Ls=3.05, SS = 3.393
1.4
1.1
Node 1 Node 2
CF1: n=2, Ls=0.75, CF2: n=1, Ls=0.0, CF34: n=2, Ls=1.65, CF5: n=1, Ls=1.4,
SS = 0.313 SS = 0 SS = 1.423 SS = 1.96

Leaf1:R=0.126 Leaf2:R=0. Node 2.1 Node 2.2

CF3: n=1, Ls=0.65, CF4: n=1, Ls=1, CF5: n=1, Ls=1.4,
SS = 0.423 SS = 1 SS = 1.96

Leaf3:R=0. Leaf4:R=0. Leaf5:R=0.

Do height balancing and continue

17 17
Clustering Techniques
CLUSTERING THE SUB-CLUSTERS
•

Clustering Techniques 18
Clustering Using Representative (CURE)
• Drawbacks of Traditional Clustering Algorithms
• Centroid-based approach (using dmean) considers
only one point as representative of a cluster - the
cluster centroid.
• All-points approach (based on dmin) makes the
clustering algorithm extremely sensitive to outliers.
• Both of them can’t work well for non-spherical or
arbitrary shaped clusters.

Clustering Techniques 19
CURE: Approach
• CURE is positioned between centroid based and all point extremes.
• A constant number of well scattered points is used to capture the
shape and extend of a cluster.
• The points are shrunk towards the centroid of the cluster by a factor
α.
• These well scattered and shrunk points are used as representative of
the cluster.

Clustering Techniques 20
CURE: Approach
• Scattered points approach alleviates
shortcomings of centroid based and all point
based methods.
• Since multiple representatives are used, the
splitting of large clusters is avoided.
• Multiple representatives allow for discovery of non
spherical clusters.
• The shrinking phase will affect outliers more than
other points since their distance from the centroid
will be decreased more than that of regular points.

Clustering Techniques 21
CURE: Approach
• Initially since all points are in separate clusters, each cluster is defined by the
point in the cluster.
• Clusters are merged until they contain at least c points.
• The first scattered point in a cluster in one which is farthest away from the
clusters centroid.
• Other scattered points are chosen so that their distance from previously chosen
scattered points in maximal.
• When c well scattered points are calculated they are shrunk by some factor α (r
= p + α*(mean-p)).
• After clusters have c representatives the distance between two clusters is the
distance between two of the closest representatives of each cluster
• Every time two clusters are merged their representatives are re-calculated.

Clustering Techniques 22
Density-Based Clustering Methods
• Clustering based on local
connectivity and density functions
• Basic idea
• Clusters are dense regions in the data
space, separated by regions of lower
object density
• A cluster is defined as a maximal set of
density connected points
• Each cluster has a considerable
higher density of points than outside
of the cluster
• Major features:
• Discover clusters of arbitrary shape
• One scan

Clustering Techniques 23
Density Based Spatial Clustering of Application of Noise
(DBSCAN)
• Two global parameters:
• Eps: Maximum radius of the neighbourhood
• MinPts: Minimum number of points in an
Eps-neighbourhood of that point
• Density = number of points within a
specified radius r (Eps)
• A point is a core point if it has more than
a specified number of points (MinPts)
within Eps
• These are points that are at the interior of a
cluster
• A border point has fewer than MinPts
within Eps, but is in the neighborhood of
a core point
• A noise point is any point that is not a
core point or a border point.

Clustering Techniques 24
Density-reachability
• An object q is directly density-reachable from
object p, if p is a core object and q is in p’s
Eps-neighborhood

• q is directly density-reachable from p

• p is not directly density-reachable from q
• Density-reachability is asymmetric

Clustering Techniques 25
Density-reachability
•

• p is (indirectly) density-reachable from q

• q is not density-reachable from p

Clustering Techniques 26
Density-Connectivity
• A pair of points p and q are
density-connected, If they are commonly
density-reachable from a point o
• Density-connectivity is symmetric

Clustering Techniques 27
DBSCAN
•

Clustering Techniques 28
DBSCAN-Algorithm
Input: The data set D
Parameter: ε, MinPts
For each object p in D
if p is a core object and not processed then
C = retrieve all objects density-reachable from p
mark all objects in C as processed
report C as a cluster
else mark p as outlier
end if
End For

Clustering Techniques 29
DBSCAN-Example
• Parameter
• ε = 2 cm
• MinPts = 3

for each o ∈ D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE

Clustering Techniques 30
DBSCAN-Example
• Parameter
• ε = 2 cm
• MinPts = 3

for each o ∈ D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE

Clustering Techniques 31
DBSCAN-Example
• Parameter
• ε = 2 cm
• MinPts = 3

for each o ∈ D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE

Clustering Techniques 32
MinPts = 5

ε
P1
ε
ε P
C1 C1
C1 P

1. Check the 1. Check the unprocessed

ε-neighborhood of p; objects in C
2. If p has less than MinPts 2. If no core object, return C
neighbors then mark p 3. Otherwise, randomly pick
as outlier and continue
up one core object p1,
with the next object
mark p1 as processed,
3. Otherwise mark p as and put all unprocessed
processed and put all neighbors of p1 in cluster
the neighbors in cluster C
C

Clustering Techniques 33
ε

ε
C1
C1

ε
C1 C1

Clustering Techniques 34
When DBSCAN Does NOT Work Well

Original Points

• Cannot handle Varying densities

• sensitive to parameters

(MinPts=4, Eps=9.75)
Clustering Techniques 35
DBSCAN: Sensitive to Parameters

Clustering Techniques 36
Density Based Clustering: Discussion
• Advantages
• Clusters can have arbitrary shape and size
• Number of clusters is determined automatically
• Can separate clusters from surrounding noise
• Disadvantages
• Input parameters may be difficult to determine
• In some situations very sensitive to input parameter setting

Clustering Techniques 37
Validation Measures for Clustering
• External
• use external information not present in the data

• Internal
• measures evaluate the goodness of a clustering structure without respect to
external information
Validation Measures for Clustering
Validation Measures for Clustering
• External
• Dice Coefficient
• Similarity
• Accuracy

Clustering Techniques 40

ML - 8
No ratings yet
ML - 8
70 pages
Clustering
No ratings yet
Clustering
45 pages
Cluster Analysis
No ratings yet
Cluster Analysis
22 pages
HTCB Unit 5
No ratings yet
HTCB Unit 5
3 pages
Data Mining Clustering Techniques
No ratings yet
Data Mining Clustering Techniques
43 pages
M6
No ratings yet
M6
23 pages
Birch
No ratings yet
Birch
6 pages
Unit 5
No ratings yet
Unit 5
10 pages
DM Clustering UNIT4
No ratings yet
DM Clustering UNIT4
36 pages
By Lior Rokach and Oded Maimon: Clustering Methods
No ratings yet
By Lior Rokach and Oded Maimon: Clustering Methods
5 pages
Module 3 Clustering
No ratings yet
Module 3 Clustering
57 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
30 pages
Data Mining: Hierarchical Clustering, DBSCAN The EM Algorithm
No ratings yet
Data Mining: Hierarchical Clustering, DBSCAN The EM Algorithm
63 pages
Lecture 18
No ratings yet
Lecture 18
27 pages
Presentation On Clustering Algorithms
No ratings yet
Presentation On Clustering Algorithms
43 pages
Chatgpt Unit - 4
No ratings yet
Chatgpt Unit - 4
4 pages
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
No ratings yet
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
22 pages
Clustering
No ratings yet
Clustering
7 pages
Open Lecture 13 - DBSCAN PDF
No ratings yet
Open Lecture 13 - DBSCAN PDF
33 pages
Clustering Hierarchical PDF
No ratings yet
Clustering Hierarchical PDF
31 pages
Introduction To Cluster Analysis.
No ratings yet
Introduction To Cluster Analysis.
53 pages
Module 5
No ratings yet
Module 5
91 pages
Custer Analysis: Prepared by Navin Ninama
No ratings yet
Custer Analysis: Prepared by Navin Ninama
20 pages
4.5-Cluster Analysis
No ratings yet
4.5-Cluster Analysis
17 pages
Partition
No ratings yet
Partition
52 pages
Clustering Algorithms Overview
No ratings yet
Clustering Algorithms Overview
6 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
24 pages
Clustering Algorithms Overview
No ratings yet
Clustering Algorithms Overview
4 pages
DS143 Group 13 Presentation-1
No ratings yet
DS143 Group 13 Presentation-1
27 pages
Data Mining Unit-Iv
No ratings yet
Data Mining Unit-Iv
34 pages
Ambo University: Inistitute of Technology
No ratings yet
Ambo University: Inistitute of Technology
15 pages
SSRN Id3768295
No ratings yet
SSRN Id3768295
7 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
40 pages
Clustering & Association Mining Basics
No ratings yet
Clustering & Association Mining Basics
50 pages
DBSCAN
No ratings yet
DBSCAN
30 pages
Hierarchical ClusteringAlgorithm
No ratings yet
Hierarchical ClusteringAlgorithm
32 pages
Clustering Basics
No ratings yet
Clustering Basics
39 pages
Clustering
No ratings yet
Clustering
8 pages
Clustering
No ratings yet
Clustering
12 pages
Clustering
No ratings yet
Clustering
53 pages
Module 5
No ratings yet
Module 5
43 pages
03 Clustering
No ratings yet
03 Clustering
63 pages
M5
No ratings yet
M5
40 pages
Chapter 2 (19-06-2019 v2)
No ratings yet
Chapter 2 (19-06-2019 v2)
10 pages
ML Unit 4
No ratings yet
ML Unit 4
15 pages
Clustering
No ratings yet
Clustering
11 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
Dbscan: Presented By: Garrett Poppe
No ratings yet
Dbscan: Presented By: Garrett Poppe
22 pages
Lecture 6
No ratings yet
Lecture 6
55 pages
Unit 4 Cluster Analysis 3
No ratings yet
Unit 4 Cluster Analysis 3
20 pages
4.6 Dbscan
No ratings yet
4.6 Dbscan
27 pages
Cluster Analysis - Approach 1
No ratings yet
Cluster Analysis - Approach 1
28 pages
Clustering
No ratings yet
Clustering
65 pages
Chapter 6
No ratings yet
Chapter 6
62 pages
Cluster
100% (1)
Cluster
72 pages
Week 10
No ratings yet
Week 10
84 pages
Ktustudents - In: 1. Hierarchical Methods
No ratings yet
Ktustudents - In: 1. Hierarchical Methods
21 pages
Ankit Yadav Resume
No ratings yet
Ankit Yadav Resume
1 page
Admit Card
No ratings yet
Admit Card
3 pages
Andc-Opac - Kohacloud.in Kohapay1 Response - PHP
No ratings yet
Andc-Opac - Kohacloud.in Kohapay1 Response - PHP
1 page
YT Slide Machines ICSE One Shot Class 10
No ratings yet
YT Slide Machines ICSE One Shot Class 10
56 pages
CNF Sem Vi
No ratings yet
CNF Sem Vi
3 pages
Understanding Force and Motion
No ratings yet
Understanding Force and Motion
42 pages
Final Project Report - Kelompok 4
No ratings yet
Final Project Report - Kelompok 4
6 pages
Neuro-Fuzzy Hybrid System Presentation
No ratings yet
Neuro-Fuzzy Hybrid System Presentation
14 pages
PR - Exam - WS21 - 22 Sol
No ratings yet
PR - Exam - WS21 - 22 Sol
19 pages
Lösungen Zu Den Exercises AI Python
No ratings yet
Lösungen Zu Den Exercises AI Python
26 pages
Neural Network Learning Methods
No ratings yet
Neural Network Learning Methods
25 pages
Report Diabetics
No ratings yet
Report Diabetics
8 pages
Artificial Intelligence MCQs
No ratings yet
Artificial Intelligence MCQs
9 pages
AIDL Mids
No ratings yet
AIDL Mids
5 pages
Stock Prediction FINAL YEAR PROJECT
No ratings yet
Stock Prediction FINAL YEAR PROJECT
23 pages
Speech Recognition With Llms Adapted To Disordered Speech Using Reinforcement Learning
No ratings yet
Speech Recognition With Llms Adapted To Disordered Speech Using Reinforcement Learning
5 pages
Intelligent Supply Chain Optimization Through IoT Analytics and Predictive AI A Comprehensive Analysis of US Market Implementation
No ratings yet
Intelligent Supply Chain Optimization Through IoT Analytics and Predictive AI A Comprehensive Analysis of US Market Implementation
22 pages
Pham 2023 IOP Conf. Ser. Earth Environ. Sci. 1278 012004
No ratings yet
Pham 2023 IOP Conf. Ser. Earth Environ. Sci. 1278 012004
8 pages
Bemnet Meresa
No ratings yet
Bemnet Meresa
120 pages
Eti MLC 20220426 Icc2022
No ratings yet
Eti MLC 20220426 Icc2022
22 pages
A Systematic Review of AI Literacy Conceptualization Con - 2024 - Computers and
No ratings yet
A Systematic Review of AI Literacy Conceptualization Con - 2024 - Computers and
20 pages
Progress Diary
No ratings yet
Progress Diary
7 pages
AI Heart Attack Predictor App
No ratings yet
AI Heart Attack Predictor App
37 pages
Intracranial EEG A Guide For Cognitive Neuroscientists ISBN 3031209095, 9783031209093 Verified Download
No ratings yet
Intracranial EEG A Guide For Cognitive Neuroscientists ISBN 3031209095, 9783031209093 Verified Download
17 pages
Apollon: AML Defense for IDS
No ratings yet
Apollon: AML Defense for IDS
73 pages
Deep Learning for Engineers
No ratings yet
Deep Learning for Engineers
141 pages
Engineering Technology CCMAS 2023 FINAL 244 263
No ratings yet
Engineering Technology CCMAS 2023 FINAL 244 263
20 pages
Predictive Analytics for Marketers
No ratings yet
Predictive Analytics for Marketers
66 pages
DATA SCIENCE Internship
100% (1)
DATA SCIENCE Internship
16 pages
DL Unit - III Notes1
No ratings yet
DL Unit - III Notes1
14 pages
Evolution of Generative AI 1721160426
No ratings yet
Evolution of Generative AI 1721160426
10 pages
Exam Guidence
No ratings yet
Exam Guidence
1 page
10 Books
No ratings yet
10 Books
19 pages
Resume Rohnit
No ratings yet
Resume Rohnit
1 page
SAP Analystic Cloud
0% (1)
SAP Analystic Cloud
6 pages
IK Gujral Punjab Technical University Jalandhar, Kapurthala
No ratings yet
IK Gujral Punjab Technical University Jalandhar, Kapurthala
74 pages

Clustering Part2

Uploaded by

Clustering Part2

Uploaded by

Balanced Iterative Reducing and Clustering Using Hierarchies

• What Birch algorithm tries to solve?

L=5 child1 child2 child3 child6

child1 child2 child3 child5

Leaf node Leaf node

Leaf1:R=0.126 Leaf2:R=0. Leaf3:R=0.

Leaf1:R=0.126 Leaf2:R=0. Leaf3:R=0. Leaf4:R=0.

Leaf1:R=0.126 Leaf2:R=0. Node 2.1 Node 2.2

Leaf3:R=0. Leaf4:R=0. Leaf5:R=0.

Do height balancing and continue

• q is directly density-reachable from p

• p is (indirectly) density-reachable from q

1. Check the 1. Check the unprocessed

• Cannot handle Varying densities

You might also like