[go: up one dir, main page]

0% found this document useful (0 votes)
110 views21 pages

Agglomerative Hierarchical Clustering

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 21

Hierarchical Cluster Analysis

Agglomerative Hierarchical Clustering


Dr. Chaouki BOUFENAR

Master 1
I n g é n i e r i e des Systèmes Informatiques I n t e l l i g e n t s
2019/2020
Hierarchical Cluster Analysis

Agglomerative Hierarchical Clustering Vs Divisive Hierarchical Clustering

• Bottom up approach • Top down approach

• Begin with N clusters (one object by cluster) • Begin with 1 cluster (all objects together)

Repeat Repeat

Merge the most similar objects Split the most dissimilar objects

Until all objects are in the same cluster Until all objects are in their own cluster

04/18/2020 Master 1 ISII : Data Mining 2


Agglomerative Hierarchical Clustering (AHC)

Algorithm Agglomerative Hierarchical Clustering


Compute the distance matrix
Let each data point be a cluster

Repeat
Merge the two closest clusters and update the distance matrix
Until only a single cluster remains

04/18/2020 Master 1 ISII : Data Mining 3


Agglomerative Hierarchical Clustering (AHC)
Distance Matrix (Dissimilarities)

The proximity between object can be measured as distance matrix using one of these distance metrics

Euclidean In cartesian coordinates, if x1 = (a1, b1) and x2 = (a2,b2) are two points
distance in Euclidean n-space, then the distance (d) from x1 to x2, or from x2 to x1 is given by

Hamming Is a metric for comparing two binary data strings of equal length. Hamming
distance distance is the number of bit positions in which the two bits are different.

Levenshtein Is a string metric for measuring the difference between two sequences.
distance The Levenshtein distance between two strings, a and b (of length |a| and |b| respectively),
is given by

04/18/2020 Master 1 ISII : Data Mining 4


Agglomerative Hierarchical Clustering (AHC)
Distance Matrix (Dissimilarities)

Gower Is computed as the average of partial dissimilarities across individuals. Each partial
distance
(mixed data) dissimilarity (and thus Gower distance) ranges in [0 1].

Manhattan The distance between two points measured along axes at right angles. In a plane with
distance
x1 = (a1, b1) and x2 = (a2,b2)

04/18/2020 Master 1 ISII : Data Mining 5


Agglomerative Hierarchical Clustering (AHC)
Inter-clusters Distance (Dissimilarity)

Distance between two objects belonging to two different clusters.

Single Complete Average Centroid Average Centroid


linkage linkage linkage linkage linkage

 Simple or single linkage (Minimum)

• distance between closest elements in clusters

• we merge in each step the two clusters whose two closest members have the smallest distance (the
two clusters with the smallest minimum pairwise distance).

04/18/2020 Master 1 ISII : Data Mining 6


Agglomerative Hierarchical Clustering (AHC)
 Complete linkage (Maximum)

• distance between farthest elements in clusters

• we merge the two clusters with the smallest maximum pairwise distance

 Average linkage

• average of all pairwise distances

• compromise between the sensitivity of complete-link method to outliers and the tendency of single-link
clustering

04/18/2020 Master 1 ISII : Data Mining 7


Agglomerative Hierarchical Clustering (AHC)
 Centroid linkage

The distance between the centers CC1 and CC2 of two clusters C1 and C2 respectively

 Average Centroid linkage


The average centroid linkage distance is the distance between the center of a cluster and all the objects
belonging to a different cluster

04/18/2020 Master 1 ISII : Data Mining 8


Agglomerative Hierarchical Clustering (AHC)
 Example
• proximity, dissimilarity or distances matrix

 Step 01
Looking for the pair of samples that are the most similar (closest in the sense of havinh the lowest dissimilarity)
• B and F having the lost dissimilarity or distance, d=2.0
• These two samples are then joined in the same goup = at a level of 2.0 in the first step of the dendrogram.

 Step 02
Repeating the step 01 by calculating the distance between and the other samples using :

04/18/2020 Master 1 ISII : Data Mining 9


Agglomerative Hierarchical Clustering (AHC)
Simple Linkage Complete Linkage Average Linkage

 Step 03

Simple Linkage Complete Linkage Average Linkage

=
04/18/2020 Master 1 ISII : Data Mining 10
Agglomerative Hierarchical Clustering (AHC)
 Step 05

Simple Linkage Complete Linkage Average Linkage

04/18/2020 Master 1 ISII : Data Mining 11


Agglomerative Hierarchical Clustering (AHC)
 Step 04

Simple Linkage Complete Linkage Average Linkage

04/18/2020 Master 1 ISII : Data Mining 12


Agglomerative Hierarchical Clustering (AHC)
 Step 05

Simple Linkage Complete Linkage Average Linkage

 Step 06

Simple Linkage Complete Linkage Average Linkage

04/18/2020 Master 1 ISII : Data Mining 13


Agglomerative Hierarchical Clustering (AHC)
 Dendrogram

 Dendrogram or tree diagram is a graphical representation of a matrix of distances

 It makes interpretation of the structure of data much easier


 The horizontal axis represents samples and the vertical axis represents dissimilarities

 In the clustering of n objects or samples, there are n–1 nodes

04/18/2020 Master 1 ISII : Data Mining 14


Agglomerative Hierarchical Clustering (AHC)
 Dendrogram

 Dendrogram or tree diagram is a graphical representation of a matrix of distances


 It makes interpretation of the structure of data much easier

How to build a dendrogram ?


We are explain on an example using a Simple linkage method

 At the first step of the hierarchical clustering process, the samples


B and F are merged in group G1 at a level of 2.0 which represents
their similarity value.

 In the next step the lowest dissimilarity is 2.5, for A and E which
are joined in a group G2 at a level 2.5 in dendrogram

04/18/2020 Master 1 ISII : Data Mining 15


Agglomerative Hierarchical Clustering (AHC)
 Dendrogram

 In the third step the lowest dissimilarity is 3.33, for C and G which
are joined in group G3 at a level 3.33 in dendrogram.

 at step 4, objects G2 and G3 that represent the lowest dissimilarity of


3.75 are merged in G4 at a level 3.75 in dendrogram.

 the penultimate (step 5) of the process consists of joining objects G1


and G4 which have the closest distance in the proximity matrix. This
two group form G5 group at a level of 5.0 in dendrogram.

04/18/2020 Master 1 ISII : Data Mining 16


Agglomerative Hierarchical Clustering (AHC)

 Dendrogram

 the last step (step 6) in this example of Agglomerative Hierarchical Clustering method that consists of
joining objects G4 and G5 in G6 at a level of 8.0 to obtain the final result of the cluster analysis

04/18/2020 Master 1 ISII : Data Mining 17


Agglomerative Hierarchical Clustering (AHC)

 Cutting the Dendrogram

 We try to cut the dendrogram to the level where it creates the best distribution of cloud points in very
distinct classes.

 We cut the dendrogram where the grouped classes were the most distant

In our example with Complete linkage, we can see that :

o The pairs of samples (B,F), (A,E) and (C,G) are fairly close

o The sample D is a singleton, it separates itself entirely from all the others

o Clusters (B,F), (A,E,C,G) and (D) are below level 5.0. In each of them, the samples have more than
50% similarity

o Clusters (A,E) and (C,G) are merged at level of 4.25, while the next cluster (B,F) is joined with (D)
at level of 7.77 which represents a large gap. This is thus a very convenient level to cut the tree

04/18/2020 Master 1 ISII : Data Mining 18


Agglomerative Hierarchical Clustering (AHC)

 Cutting the Dendrogram

C1 C2 C3
o After cutting dendrogram at level of 5.0, we will end up with three clusters which can be seen as three
categories of a categorical variable
A B C D E F G
2 1 2 3 2 1 2

04/18/2020 Master 1 ISII : Data Mining 19


Agglomerative Hierarchical Clustering (AHC)
 Strengths
o Easy to understand and easy to do

o Flexible : It does not even require a distance - any measure can be used. We use for :
Categorial data : Jaccard
Strings : Levenshtein distance
Mixed type data : Gower distance
o A dendrogram can also be very useful in understanding our dataset.

 Weakness
o High in time complexity
o Poor solutions may be found without it being obvious that they are poor
o Arbitrary decisions in the choice of distance metric and the linkage criteria.
o Most hierarchical clustering techniques does not work with values are missing in the data

o with certain types of data, it is difficult to determine how to compute a distance matrix

04/18/2020 Master 1 ISII : Data Mining 20

You might also like