Hierarchial Clustering
Hierarchial Clustering
Hierarchical Clustering
Dr. R. Srinivasan
Assistant Professor
Department of Computing Technologies
School of Computing
SRM Institute of Science and Technology
Hierarchical Clustering
1) Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group the
unlabeled datasets into a cluster also known as hierarchical cluster analysis or HCA
2) In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-shaped structure is
known as the dendrogram
3) Sometimes the results of K-means clustering, and hierarchical clustering may look similar, but they both
differ depending on how they work. As there is no requirement to predetermine the number of clusters as we
did in the K-Means algorithm
1) Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with taking all data
points as single clusters and merging them until one cluster is left
2) Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down approach
Why hierarchical clustering?
As we already have other clustering algorithms such as K-Means Clustering, then why we need hierarchical
clustering?
1) We have seen in the K-means clustering that there are some challenges with this algorithm, which are a
predetermined number of clusters, and it always tries to create the clusters of the same size
2) To solve these two challenges, we can opt for the hierarchical clustering algorithm because, in this algorithm,
we don't need to have knowledge about the predefined number of clusters
Agglomerative Hierarchical clustering
The agglomerative hierarchical clustering algorithm is a popular example of HCA. To group the datasets into
clusters, it follows the bottom-up approach.
It means, this algorithm considers each dataset as a single cluster at the beginning, and then start combining the
closest pair of clusters together.
It does this until all the clusters are merged into a single cluster that contains all the datasets
This hierarchy of clusters is represented in the form of the dendrogram.
Steps:
Step-1: Create each data point as a single cluster. Let's say there are N data points, so the number of clusters will
also be N.
Agglomerative Hierarchical clustering
Step-2: Take two closest data points or clusters and merge them to form one cluster. So, there will now be N-1
clusters
Step-3: Again, take the two closest clusters and merge them together to form one cluster. There will be N-2
clusters.
Step 3
Step 2
Agglomerative Hierarchical clustering
Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters. Consider the below
images:
Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram to divide the clusters as
per the problem.
Proximity Between Clusters
The closest distance between the two clusters is crucial for the hierarchical clustering. There are various ways to
calculate the distance between two clusters, and these ways decide the rule for clustering. These measures are
called Linkage methods.
Single Linkage: It is the Shortest Distance between the closest points of the clusters.
Complete Linkage: It is the farthest distance between the two points of two different clusters. It is one of the
popular linkage methods as it forms tighter clusters than single-linkage.
Average Linkage: It is the linkage method in which the distance between each pair of datasets is added up and
then divided by the total number of datasets to calculate the average distance between two clusters. It is also one
of the most popular linkage methods.
Centroid Linkage: It is the linkage method in which the distance between the centroid of the clusters is
calculated.
Ward’s Method: It uses squared error to compute the similarity of the two clusters for merging.
Measure for the distance between two clusters
Complete Centroid
Single
Woking of Dendrogram in Hierarchical clustering
The dendrogram is a tree-like structure that is mainly used to store each step as a memory that the HC
algorithm performs. In the dendrogram plot, the Y-axis shows the Euclidean distances between the data
points, and the x-axis shows all the data points of the given dataset.
Woking of Dendrogram in Hierarchical clustering
In the above diagram, the left part is showing how clusters are created in agglomerative clustering, and the
right part is showing the corresponding dendrogram.
As we have discussed above, firstly, the datapoints P2 and P3 combine and form a cluster, correspondingly a
dendrogram is created, which connects P2 and P3 with a rectangular shape. The height is decided according
to the Euclidean distance between the data points.
In the next step, P5 and P6 form a cluster, and the corresponding dendrogram is created. It is higher than of
previous, as the Euclidean distance between P5 and P6 is a little bit greater than the P2 and P3.
Again, two new dendrograms are created that combine P1, P2, and P3 in one dendrogram, and P4, P5, and
P6, in another dendrogram.
At last, the final dendrogram is created that combines all the data points together.
Agglomerative Hierarchical clustering
Advantages
1) No apriori information about the number of clusters required.
2) Easy to implement and gives best result in some cases.
Disadvantages
1) Algorithm can never undo what was done previously.
2) Time complexity of at least O(n2 log n) is required, where ‘n’ is the number of data points.
3) Based on the type of distance matrix chosen for merging different algorithms can suffer with one or more
of the following:
i) Sensitivity to noise and outliers
ii) Breaking large clusters
iii) Difficulty handling different sized clusters and convex shapes
4) No objective function is directly minimized
5) Sometimes it is difficult to identify the correct number of clusters by the dendogram.
Divisive Clustering
Divisive clustering works just the opposite of agglomerative clustering. It starts by considering all the data
points into a big single cluster and later on splitting them into smaller heterogeneous clusters continuously
until all data points are in their own cluster.
Thus, they are good at identifying large clusters. It follows a top-down approach and is more efficient than
agglomerative clustering. But, due to its complexity in implementation, it doesn’t have any predefined
implementation in any of the major machine learning frameworks.