Agglomerative Hierarchical Clustering

Hierarchical Cluster Analysis
Agglomerative Hierarchical Clustering

Dr. Chaouki BOUFENAR
Master 1
I n g é n i e r i e des Systèmes Informatiques I n t e l l i g e n t s
2019/2020
Hierarchical Cluster Analysis
Agglomerative Hierarchical Clustering Vs Divisive Hierarchical Clustering
• Bottom up approach • Top down approach
• Begin with N clusters (one object by cluster) • Begin with 1 cluster (all objects together)
Repeat Repeat
Merge the most similar objects Split the most dissimilar objects
Until all objects are in the same cluster Until all objects are in their own cluster
04/18/2020 Master 1 ISII : Data Mining 2

Agglomerative Hierarchical Clustering (AHC)
Algorithm Agglomerative Hierarchical Clustering

Compute the distance matrix
Let each data point be a cluster
Repeat
Merge the two closest clusters and update the distance matrix
Until only a single cluster remains

Distance Matrix (Dissimilarities)
The proximity between object can be measured as distance matrix using one of these distance metrics
Euclidean In cartesian coordinates, if x1 = (a1, b1) and x2 = (a2,b2) are two points
distance in Euclidean n-space, then the distance (d) from x1 to x2, or from x2 to x1 is given by
Hamming Is a metric for comparing two binary data strings of equal length. Hamming
distance distance is the number of bit positions in which the two bits are different.
Levenshtein Is a string metric for measuring the difference between two sequences.
distance The Levenshtein distance between two strings, a and b (of length |a| and |b| respectively),
is given by

Distance Matrix (Dissimilarities)
Gower Is computed as the average of partial dissimilarities across individuals. Each partial
distance
(mixed data) dissimilarity (and thus Gower distance) ranges in [0 1].
Manhattan The distance between two points measured along axes at right angles. In a plane with
distance
x1 = (a1, b1) and x2 = (a2,b2)

Inter-clusters Distance (Dissimilarity)
Distance between two objects belonging to two different clusters.
Single Complete Average Centroid Average Centroid

linkage linkage linkage linkage linkage
 Simple or single linkage (Minimum)
• distance between closest elements in clusters
• we merge in each step the two clusters whose two closest members have the smallest distance (the
two clusters with the smallest minimum pairwise distance).

 Complete linkage (Maximum)
• distance between farthest elements in clusters
• we merge the two clusters with the smallest maximum pairwise distance
 Average linkage
• average of all pairwise distances
• compromise between the sensitivity of complete-link method to outliers and the tendency of single-link
clustering

 Centroid linkage
The distance between the centers CC1 and CC2 of two clusters C1 and C2 respectively
 Average Centroid linkage

The average centroid linkage distance is the distance between the center of a cluster and all the objects
belonging to a different cluster

 Example
• proximity, dissimilarity or distances matrix
 Step 01
Looking for the pair of samples that are the most similar (closest in the sense of havinh the lowest dissimilarity)
• B and F having the lost dissimilarity or distance, d=2.0
• These two samples are then joined in the same goup = at a level of 2.0 in the first step of the dendrogram.
 Step 02
Repeating the step 01 by calculating the distance between and the other samples using :

Simple Linkage Complete Linkage Average Linkage
 Step 03
=
 Step 05

 Step 04

 Step 05
 Step 06

 Dendrogram
 Dendrogram or tree diagram is a graphical representation of a matrix of distances
 It makes interpretation of the structure of data much easier

 The horizontal axis represents samples and the vertical axis represents dissimilarities
 In the clustering of n objects or samples, there are n–1 nodes

 Dendrogram
 Dendrogram or tree diagram is a graphical representation of a matrix of distances

 It makes interpretation of the structure of data much easier
How to build a dendrogram ?

We are explain on an example using a Simple linkage method
 At the first step of the hierarchical clustering process, the samples

B and F are merged in group G1 at a level of 2.0 which represents
their similarity value.
 In the next step the lowest dissimilarity is 2.5, for A and E which
are joined in a group G2 at a level 2.5 in dendrogram

 Dendrogram
 In the third step the lowest dissimilarity is 3.33, for C and G which
are joined in group G3 at a level 3.33 in dendrogram.
 at step 4, objects G2 and G3 that represent the lowest dissimilarity of

3.75 are merged in G4 at a level 3.75 in dendrogram.
 the penultimate (step 5) of the process consists of joining objects G1

and G4 which have the closest distance in the proximity matrix. This
two group form G5 group at a level of 5.0 in dendrogram.

 Dendrogram
 the last step (step 6) in this example of Agglomerative Hierarchical Clustering method that consists of
joining objects G4 and G5 in G6 at a level of 8.0 to obtain the final result of the cluster analysis

 Cutting the Dendrogram
 We try to cut the dendrogram to the level where it creates the best distribution of cloud points in very
distinct classes.
 We cut the dendrogram where the grouped classes were the most distant
In our example with Complete linkage, we can see that :
o The pairs of samples (B,F), (A,E) and (C,G) are fairly close
o The sample D is a singleton, it separates itself entirely from all the others
o Clusters (B,F), (A,E,C,G) and (D) are below level 5.0. In each of them, the samples have more than
50% similarity
o Clusters (A,E) and (C,G) are merged at level of 4.25, while the next cluster (B,F) is joined with (D)
at level of 7.77 which represents a large gap. This is thus a very convenient level to cut the tree

 Cutting the Dendrogram
C1 C2 C3
o After cutting dendrogram at level of 5.0, we will end up with three clusters which can be seen as three
categories of a categorical variable
A B C D E F G
2 1 2 3 2 1 2

 Strengths
o Easy to understand and easy to do
o Flexible : It does not even require a distance - any measure can be used. We use for :
Categorial data : Jaccard
Strings : Levenshtein distance
Mixed type data : Gower distance
o A dendrogram can also be very useful in understanding our dataset.
 Weakness
o High in time complexity
o Poor solutions may be found without it being obvious that they are poor
o Arbitrary decisions in the choice of distance metric and the linkage criteria.
o Most hierarchical clustering techniques does not work with values are missing in the data
o with certain types of data, it is difficult to determine how to compute a distance matrix

Agglomerative Hierarchical Clustering

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Agglomerative Hierarchical Clustering

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Agglomerative Hierarchical Clustering

Uploaded by

Copyright:

Available Formats

Hierarchical Cluster Analysis

Agglomerative Hierarchical Clustering

Agglomerative Hierarchical Clustering Vs Divisive Hierarchical Clustering

• Bottom up approach • Top down approach

04/18/2020 Master 1 ISII : Data Mining 2

Algorithm Agglomerative Hierarchical Clustering

04/18/2020 Master 1 ISII : Data Mining 3

04/18/2020 Master 1 ISII : Data Mining 4

04/18/2020 Master 1 ISII : Data Mining 5

Distance between two objects belonging to two different clusters.

Single Complete Average Centroid Average Centroid

 Simple or single linkage (Minimum)

• distance between closest elements in clusters

04/18/2020 Master 1 ISII : Data Mining 6

• distance between farthest elements in clusters

• average of all pairwise distances

04/18/2020 Master 1 ISII : Data Mining 7

 Average Centroid linkage

04/18/2020 Master 1 ISII : Data Mining 8

04/18/2020 Master 1 ISII : Data Mining 9

Simple Linkage Complete Linkage Average Linkage

Simple Linkage Complete Linkage Average Linkage

04/18/2020 Master 1 ISII : Data Mining 11

Simple Linkage Complete Linkage Average Linkage

04/18/2020 Master 1 ISII : Data Mining 12

Simple Linkage Complete Linkage Average Linkage

Simple Linkage Complete Linkage Average Linkage

04/18/2020 Master 1 ISII : Data Mining 13

 Dendrogram or tree diagram is a graphical representation of a matrix of distances

 It makes interpretation of the structure of data much easier

 In the clustering of n objects or samples, there are n–1 nodes

04/18/2020 Master 1 ISII : Data Mining 14

 Dendrogram or tree diagram is a graphical representation of a matrix of distances

How to build a dendrogram ?

 At the first step of the hierarchical clustering process, the samples

04/18/2020 Master 1 ISII : Data Mining 15

 at step 4, objects G2 and G3 that represent the lowest dissimilarity of

 the penultimate (step 5) of the process consists of joining objects G1

04/18/2020 Master 1 ISII : Data Mining 16

04/18/2020 Master 1 ISII : Data Mining 17

 Cutting the Dendrogram

In our example with Complete linkage, we can see that :

04/18/2020 Master 1 ISII : Data Mining 18

 Cutting the Dendrogram

04/18/2020 Master 1 ISII : Data Mining 19

04/18/2020 Master 1 ISII : Data Mining 20

You might also like