Heirarchical clustering
Heirarchical clustering
Ref 1
Hierarchical Agglomerative Clustering (HAC)
● Approach: Known as a bottom-up method.
● Structure: Provides a hierarchical structure (dendrogram) that’s more informative than flat
clustering.
● No Pre-Specified Clusters: No need to define the number of clusters in advance.
● Process: Starts with each data point as its own cluster and successively merges pairs of clusters
until a single cluster is formed.
1 2 3 4
Compute the distance Merge the two closest Update the distance matrix Repeat the steps until you
matrix for each pair of data with regard to the new cluster get a single cluster
clusters
points using a distance
metric.
Ref 2
Hierarchical Divisive Clustering
● Approach: Known as a top-down method.
● Structure: Builds a hierarchical structure (dendrogram) that shows nested clusters.
● No Pre-Specified Clusters: No need to specify the number of clusters in advance.
● Process: Starts with all data in one cluster and recursively splits into smaller clusters until reaching
individual data points or desired structure.
1 2 3 4
Build Hierarchical
All Data in One Cluster Split the Cluster Repeat 1 & 2 Structure
Begin with a single cluster Choose a criterion to split Continue splitting each Record each split to create a
containing all data points. clusters (e.g., distance or sub-cluster until you reach the tree-like hierarchy (similar to a
variance) and divide the desired number of clusters or dendrogram).
cluster into two sub-clusters. a stopping condition.
Ref 3
Maths
Linkage is the criterion used to decide the distance between clusters when merging or splitting
them. Linkage methods define how the distance between two clusters is calculated, which
directly affects the formation of the hierarchical structure.
Affects Cluster Shape: Different linkages create different shapes (e.g., single linkage = elongated,
complete linkage = compact).
Sensitivity to Outliers: Some methods (like single linkage) are more sensitive to outliers.
Computational Efficiency: Some linkages (e.g., complete linkage) are more computationally expensive.
Consistency: The choice of linkage impacts the consistency and structure of clustering results.
Adaptability: Certain methods (like average linkage) are versatile and handle various cluster shapes.
D(A,B) =
D(A,B) =
D(A,B) =
● Here, ∣A∣ and ∣B∣ are the sizes of clusters A and B, respectively.
● Calculates the average of all pairwise distances between points in clusters A and B.
Advantages
1. No Predefined Clusters: Does not require specifying the number of clusters, making it more flexible and
accurate.
2. Easy to Interpret: Produces a tree-like structure (dendrogram) that is visually intuitive and ideal for data
analysis.
3. Flexible and Versatile: Can handle different types of data, similarity functions, and distance measures; suitable
for both supervised and unsupervised tasks.
4. Scalable: Suitable for large datasets, though computationally expensive with many points.
5. Multiple Levels of Output: Offers detailed cluster insights at different hierarchical levels.
Disadvantages
1. Sensitive to Outliers: Outliers can distort cluster formation, leading to incorrect results.
2. Computationally Expensive: Time complexity increases with the number of data points, making it less efficient
for large datasets.
3. Interpretation Challenges: Dendrograms can be complex and hard to interpret, especially for large datasets.
4. Lack of Optimality: Does not guarantee the best clustering solution; dependent on the researcher’s judgment.
5. Requires Manual Cluster Selection: The number of clusters must be manually chosen, which can be
time-consuming and error-prone.
Ref 7 - 8
Hierarchical clustering K-Means Clustering
Ref 9
BIRCH attempts to minimize the memory requirements of large datasets by summarizing the information
contained in dense regions or clusters as Clustering Feature (CF) entries.
CF Tree
A CF tree is a height-balanced, compact representation of data, where each leaf node entry represents a
subcluster rather than individual data points. Each node has a branching factor (max number of entries) and a
threshold (max subcluster size per leaf node), which impacts tree size. Higher thresholds create smaller trees.
Ref 10 -11
consider a set of 1-D objects {22,9,12,15,18,27,11,36,10,3,14,32}. Let the Branching factor B=2
and the Diameter threshold T=5
BIRCH Parameters
1. threshold: Max data points in a subcluster at the leaf node.
2. branching_factor: Max subclusters per internal node.
3. n_clusters: Final number of clusters after clustering; if None, returns intermediate clusters.
1 2 3 4
Load the memory by Condense the desirable Global clustering Clustering Refining
building the CF Tree range by building a
smaller CF tree
Incrementally constructs a CF Refines the CF tree by removing Applies a clustering algorithm Reassigns data points to their
tree that compactly summarizes outliers and merging nearby (e.g., hierarchical clustering) on closest cluster centroids,
data into subclusters using subclusters to create a smaller, CF entries in leaf nodes, either by improving accuracy and
thresholds for maximum size. more efficient tree. specifying cluster count or a optionally discarding outliers.
diameter threshold.
Advantages
● Scalability: Efficiently handles large datasets with memory-saving, hierarchical structures.
● Speed: Fast traversal and clustering, ideal for real-time applications.
● Automatic Cluster Count: Determines clusters by adaptively splitting based on a set
threshold.
Disadvantages
● Parameter Sensitivity: Requires careful tuning of parameters like branching factor and
threshold.
● Cluster Shape Limitation: Works best with spherical clusters, struggling with complex shapes.
Conclusion
BIRCH is a powerful clustering algorithm for large-scale data, providing efficient and fast clustering.
While highly effective, it requires parameter tuning and may be less suitable for non-spherical
clusters, making it valuable for exploratory data analysis and data mining.
Points to Ponder
1) How would the ability to not pre-define the number of clusters affect the flexibility of your
clustering analysis?
2) Given that hierarchical clustering can be computationally expensive with large datasets, what
strategies would you use to handle scalability issues?
3) How could the multiple levels of output from hierarchical clustering (dendrogram) help in
analyzing and fine-tuning your clusters?
4) Considering the complexity of dendrograms, what approaches could you use to make hierarchical
clustering results more interpretable to a broader audience?
5) What methods or tools could help automate or guide the selection of the appropriate number of
clusters in hierarchical clustering?
6) How do you think BIRCH’s efficiency in handling large datasets compares to traditional clustering
algorithms like K-means in terms of memory usage and scalability?
7) Considering BIRCH assumes spherical clusters, how might you adapt or combine BIRCH with
other algorithms to handle clusters with more complex shapes, like elongated or non-convex
clusters?
8) What are some possible limitations or pitfalls of relying on BIRCH’s automatic cluster count
determination? Are there situations where you would prefer to manually specify the number of
clusters?
References
1. https://www.javatpoint.com/hierarchical-clustering-in-machine-learning
2. https://www.datacamp.com/tutorial/introduction-hierarchical-clustering-python
3. https://www.geeksforgeeks.org/hierarchical-clustering
4. https://www.geeksforgeeks.org/difference-between-k-means-and-hierarchical-clustering
5. https://www.saigeetha.in/post/hierarchical-clustering-types-of-linkages
6. https://en.wikipedia.org/wiki/Hierarchical_clustering
7. https://towardsdatascience.com/understanding-the-concept-of-hierarchical-clustering-technique-
c6e8243758ec
8. https://codinginfinite.com/hierarchical-clustering-applications-advantages-and-disadvantages
9. https://en.wikipedia.org/wiki/BIRCH
10. https://www.youtube.com/watch?v=3tW2R_0ukK8
11. https://medium.com/@vipulddalal/birch-algorithm-with-working-example-a7b8fe047bd4
12. Mohammed_j_zaki_wagner_meira_jr_data_mining_and_analysis
Thank you