Agglomerative Hierarchical Clustering
Agglomerative Hierarchical Clustering
Agglomerative Hierarchical Clustering
Master 1
I n g é n i e r i e des Systèmes Informatiques I n t e l l i g e n t s
2019/2020
Hierarchical Cluster Analysis
• Begin with N clusters (one object by cluster) • Begin with 1 cluster (all objects together)
Repeat Repeat
Merge the most similar objects Split the most dissimilar objects
Until all objects are in the same cluster Until all objects are in their own cluster
Repeat
Merge the two closest clusters and update the distance matrix
Until only a single cluster remains
The proximity between object can be measured as distance matrix using one of these distance metrics
Euclidean In cartesian coordinates, if x1 = (a1, b1) and x2 = (a2,b2) are two points
distance in Euclidean n-space, then the distance (d) from x1 to x2, or from x2 to x1 is given by
Hamming Is a metric for comparing two binary data strings of equal length. Hamming
distance distance is the number of bit positions in which the two bits are different.
Levenshtein Is a string metric for measuring the difference between two sequences.
distance The Levenshtein distance between two strings, a and b (of length |a| and |b| respectively),
is given by
Gower Is computed as the average of partial dissimilarities across individuals. Each partial
distance
(mixed data) dissimilarity (and thus Gower distance) ranges in [0 1].
Manhattan The distance between two points measured along axes at right angles. In a plane with
distance
x1 = (a1, b1) and x2 = (a2,b2)
• we merge in each step the two clusters whose two closest members have the smallest distance (the
two clusters with the smallest minimum pairwise distance).
• we merge the two clusters with the smallest maximum pairwise distance
Average linkage
• compromise between the sensitivity of complete-link method to outliers and the tendency of single-link
clustering
The distance between the centers CC1 and CC2 of two clusters C1 and C2 respectively
Step 01
Looking for the pair of samples that are the most similar (closest in the sense of havinh the lowest dissimilarity)
• B and F having the lost dissimilarity or distance, d=2.0
• These two samples are then joined in the same goup = at a level of 2.0 in the first step of the dendrogram.
Step 02
Repeating the step 01 by calculating the distance between and the other samples using :
Step 03
=
04/18/2020 Master 1 ISII : Data Mining 10
Agglomerative Hierarchical Clustering (AHC)
Step 05
Step 06
In the next step the lowest dissimilarity is 2.5, for A and E which
are joined in a group G2 at a level 2.5 in dendrogram
In the third step the lowest dissimilarity is 3.33, for C and G which
are joined in group G3 at a level 3.33 in dendrogram.
Dendrogram
the last step (step 6) in this example of Agglomerative Hierarchical Clustering method that consists of
joining objects G4 and G5 in G6 at a level of 8.0 to obtain the final result of the cluster analysis
We try to cut the dendrogram to the level where it creates the best distribution of cloud points in very
distinct classes.
We cut the dendrogram where the grouped classes were the most distant
o The pairs of samples (B,F), (A,E) and (C,G) are fairly close
o The sample D is a singleton, it separates itself entirely from all the others
o Clusters (B,F), (A,E,C,G) and (D) are below level 5.0. In each of them, the samples have more than
50% similarity
o Clusters (A,E) and (C,G) are merged at level of 4.25, while the next cluster (B,F) is joined with (D)
at level of 7.77 which represents a large gap. This is thus a very convenient level to cut the tree
C1 C2 C3
o After cutting dendrogram at level of 5.0, we will end up with three clusters which can be seen as three
categories of a categorical variable
A B C D E F G
2 1 2 3 2 1 2
o Flexible : It does not even require a distance - any measure can be used. We use for :
Categorial data : Jaccard
Strings : Levenshtein distance
Mixed type data : Gower distance
o A dendrogram can also be very useful in understanding our dataset.
Weakness
o High in time complexity
o Poor solutions may be found without it being obvious that they are poor
o Arbitrary decisions in the choice of distance metric and the linkage criteria.
o Most hierarchical clustering techniques does not work with values are missing in the data
o with certain types of data, it is difficult to determine how to compute a distance matrix