[go: up one dir, main page]

0% found this document useful (0 votes)
20 views22 pages

Heirarchical clustering

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views22 pages

Heirarchical clustering

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Hierarchical clustering

By Meenakshi Iyer (202418031)


Hierarchical clustering builds a nested hierarchy of clusters
represented by a dendrogram, which visually shows how data
points are grouped at each step.

But, What’s a Dendogram?

The dendrogram is a tree-like


structure that is mainly used to
store each step as a memory that
the HC algorithm performs. In the
dendrogram plot, the Y-axis shows
the Euclidean distances between
the data points, and the x-axis
shows all the data points of the
given dataset.

Ref 1
Hierarchical Agglomerative Clustering (HAC)
● Approach: Known as a bottom-up method.
● Structure: Provides a hierarchical structure (dendrogram) that’s more informative than flat
clustering.
● No Pre-Specified Clusters: No need to define the number of clusters in advance.
● Process: Starts with each data point as its own cluster and successively merges pairs of clusters
until a single cluster is formed.

1 2 3 4

Compute distance matrix Merge Update distance matrix Repeat 1, 2, & 3

Compute the distance Merge the two closest Update the distance matrix Repeat the steps until you
matrix for each pair of data with regard to the new cluster get a single cluster
clusters
points using a distance
metric.

Ref 2
Hierarchical Divisive Clustering
● Approach: Known as a top-down method.
● Structure: Builds a hierarchical structure (dendrogram) that shows nested clusters.
● No Pre-Specified Clusters: No need to specify the number of clusters in advance.
● Process: Starts with all data in one cluster and recursively splits into smaller clusters until reaching
individual data points or desired structure.

1 2 3 4

Build Hierarchical
All Data in One Cluster Split the Cluster Repeat 1 & 2 Structure

Begin with a single cluster Choose a criterion to split Continue splitting each Record each split to create a
containing all data points. clusters (e.g., distance or sub-cluster until you reach the tree-like hierarchy (similar to a
variance) and divide the desired number of clusters or dendrogram).
cluster into two sub-clusters. a stopping condition.

Ref 3
Maths
Linkage is the criterion used to decide the distance between clusters when merging or splitting
them. Linkage methods define how the distance between two clusters is calculated, which
directly affects the formation of the hierarchical structure.

Types of linkages that are typically used are


1. Single Linkage
2. Complete Linkage
3. Average Linkage
4. Centroid Linkage
5. Ward's Method

Affects Cluster Shape: Different linkages create different shapes (e.g., single linkage = elongated,
complete linkage = compact).
Sensitivity to Outliers: Some methods (like single linkage) are more sensitive to outliers.
Computational Efficiency: Some linkages (e.g., complete linkage) are more computationally expensive.
Consistency: The choice of linkage impacts the consistency and structure of clustering results.
Adaptability: Certain methods (like average linkage) are versatile and handle various cluster shapes.

Ref 1,5 and 6


Single Linkage (Minimum Distance) : It is the Shortest Distance between the
closest points of the clusters. Consider the below image

D(A,B) =

● Here, D(A,B) is the distance between clusters.


● d(a,b) represents the distance between points a and b from clusters.
Complete Linkage (Maximum Distance) : It is the farthest distance between the two points of two
different clusters. It is one of the popular linkage methods as it forms tighter clusters than
single-linkage.

D(A,B) =

● Measures the distance between the farthest points in clusters.


Average Linkage (Mean Distance) : It is the linkage method in which the distance between each pair of
datasets is added up and then divided by the total number of datasets to calculate the average distance
between two clusters. It is also one of the most popular linkage methods.

D(A,B) =

● Here, ∣A∣ and ∣B∣ are the sizes of clusters A and B, respectively.
● Calculates the average of all pairwise distances between points in clusters A and B.
Advantages

1. No Predefined Clusters: Does not require specifying the number of clusters, making it more flexible and
accurate.
2. Easy to Interpret: Produces a tree-like structure (dendrogram) that is visually intuitive and ideal for data
analysis.
3. Flexible and Versatile: Can handle different types of data, similarity functions, and distance measures; suitable
for both supervised and unsupervised tasks.
4. Scalable: Suitable for large datasets, though computationally expensive with many points.
5. Multiple Levels of Output: Offers detailed cluster insights at different hierarchical levels.

Disadvantages

1. Sensitive to Outliers: Outliers can distort cluster formation, leading to incorrect results.
2. Computationally Expensive: Time complexity increases with the number of data points, making it less efficient
for large datasets.
3. Interpretation Challenges: Dendrograms can be complex and hard to interpret, especially for large datasets.
4. Lack of Optimality: Does not guarantee the best clustering solution; dependent on the researcher’s judgment.
5. Requires Manual Cluster Selection: The number of clusters must be manually chosen, which can be
time-consuming and error-prone.

Ref 7 - 8
Hierarchical clustering K-Means Clustering

● Can stop at any desired cluster count


● Requires pre-set number of clusters K
(flexible)
● Best for spherical clusters.
● Works with clusters of various shapes, but
● Uses mean or median for cluster
generally less effective than k-means for
centers.
distinctly spherical clusters.
● Less computationally intensive; suitable
● Forms a tree-like hierarchy (dendrogram).
for large datasets.
● Computationally expensive (needs n×n
● Can vary due to random initialization.
times n×n distance matrix), so it doesn’t
scale well for large datasets.
● Results are reproducible
BIRCH
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is an unsupervised algorithm
for hierarchical clustering on large datasets. It incrementally clusters multi-dimensional data, optimizing
clustering quality within memory and time constraints. BIRCH can accelerate k-means and Gaussian
mixture modeling, often requiring only a single database scan.

BIRCH was specifically designed to solve these problems:

● Scalability: Handles large datasets efficiently.


● Memory Efficiency: Avoids storing all data points, using a CF tree for compact representation.
● Robustness to Noise: Reduces the impact of noise and outliers on clustering.
● Handling of Complex Cluster Shapes: While it works best for spherical clusters, it can be
combined with other algorithms to refine clusters of various shapes.
● High-Dimensional Data: Reduces computational overhead in high-dimensional spaces.
● Hierarchical Structure: Builds clusters incrementally, offering a hierarchical view of the data.

Ref 9
BIRCH attempts to minimize the memory requirements of large datasets by summarizing the information
contained in dense regions or clusters as Clustering Feature (CF) entries.

The Clustering Feature (CF) entry of the cluster is defined as a triple

CF = (N, LS, SS)

N = number of data points in the cluster


LS = linear sum of the N data points. LS=∑ Xi (i = 1 to N)
SS = squared sum of the N data points. SS = ∑ (Xi)^2

CF Tree

A CF tree is a height-balanced, compact representation of data, where each leaf node entry represents a
subcluster rather than individual data points. Each node has a branching factor (max number of entries) and a
threshold (max subcluster size per leaf node), which impacts tree size. Higher thresholds create smaller trees.

Ref 10 -11
consider a set of 1-D objects {22,9,12,15,18,27,11,36,10,3,14,32}. Let the Branching factor B=2
and the Diameter threshold T=5
BIRCH Parameters
1. threshold: Max data points in a subcluster at the leaf node.
2. branching_factor: Max subclusters per internal node.
3. n_clusters: Final number of clusters after clustering; if None, returns intermediate clusters.

1 2 3 4

Load the memory by Condense the desirable Global clustering Clustering Refining
building the CF Tree range by building a
smaller CF tree

Incrementally constructs a CF Refines the CF tree by removing Applies a clustering algorithm Reassigns data points to their
tree that compactly summarizes outliers and merging nearby (e.g., hierarchical clustering) on closest cluster centroids,
data into subclusters using subclusters to create a smaller, CF entries in leaf nodes, either by improving accuracy and
thresholds for maximum size. more efficient tree. specifying cluster count or a optionally discarding outliers.
diameter threshold.
Advantages
● Scalability: Efficiently handles large datasets with memory-saving, hierarchical structures.
● Speed: Fast traversal and clustering, ideal for real-time applications.
● Automatic Cluster Count: Determines clusters by adaptively splitting based on a set
threshold.

Disadvantages
● Parameter Sensitivity: Requires careful tuning of parameters like branching factor and
threshold.
● Cluster Shape Limitation: Works best with spherical clusters, struggling with complex shapes.

Conclusion
BIRCH is a powerful clustering algorithm for large-scale data, providing efficient and fast clustering.
While highly effective, it requires parameter tuning and may be less suitable for non-spherical
clusters, making it valuable for exploratory data analysis and data mining.
Points to Ponder
1) How would the ability to not pre-define the number of clusters affect the flexibility of your
clustering analysis?
2) Given that hierarchical clustering can be computationally expensive with large datasets, what
strategies would you use to handle scalability issues?
3) How could the multiple levels of output from hierarchical clustering (dendrogram) help in
analyzing and fine-tuning your clusters?
4) Considering the complexity of dendrograms, what approaches could you use to make hierarchical
clustering results more interpretable to a broader audience?
5) What methods or tools could help automate or guide the selection of the appropriate number of
clusters in hierarchical clustering?
6) How do you think BIRCH’s efficiency in handling large datasets compares to traditional clustering
algorithms like K-means in terms of memory usage and scalability?
7) Considering BIRCH assumes spherical clusters, how might you adapt or combine BIRCH with
other algorithms to handle clusters with more complex shapes, like elongated or non-convex
clusters?
8) What are some possible limitations or pitfalls of relying on BIRCH’s automatic cluster count
determination? Are there situations where you would prefer to manually specify the number of
clusters?
References
1. https://www.javatpoint.com/hierarchical-clustering-in-machine-learning
2. https://www.datacamp.com/tutorial/introduction-hierarchical-clustering-python
3. https://www.geeksforgeeks.org/hierarchical-clustering
4. https://www.geeksforgeeks.org/difference-between-k-means-and-hierarchical-clustering
5. https://www.saigeetha.in/post/hierarchical-clustering-types-of-linkages
6. https://en.wikipedia.org/wiki/Hierarchical_clustering
7. https://towardsdatascience.com/understanding-the-concept-of-hierarchical-clustering-technique-
c6e8243758ec
8. https://codinginfinite.com/hierarchical-clustering-applications-advantages-and-disadvantages
9. https://en.wikipedia.org/wiki/BIRCH
10. https://www.youtube.com/watch?v=3tW2R_0ukK8
11. https://medium.com/@vipulddalal/birch-algorithm-with-working-example-a7b8fe047bd4
12. Mohammed_j_zaki_wagner_meira_jr_data_mining_and_analysis
Thank you

You might also like