1731009606_Clustering_(Class_38-39)
1731009606_Clustering_(Class_38-39)
1731009606_Clustering_(Class_38-39)
Unsupervised Learning
Unsupervised learning
Clustering is the
partitioning of a data set
into subsets (clusters), so
that the data in each
subset (ideally) share
some common trait - often
according to some defined
distance measure.
Training set:
Different Types of clustering:
1. Hierarchical algorithms: these find successive clusters using
previously established clusters.
1. Agglomerative ("bottom-up"): Agglomerative algorithms begin with
each element as a separate cluster and merge them into successively
larger clusters.
2. Divisive ("top-down"): Divisive algorithms begin with the whole set
and proceed to divide it into successively smaller clusters.
2. Partitional clustering: Partitional algorithms determine all clusters at
once. They include:
• K-means
• Fuzzy c-means clustering
• QT clustering algorithm
Common Distance measures:
• Distance measure will determine how the similarity of two
elements is calculated and it will influence the shape of the
clusters.
1. The Euclidean distance (also called 2-norm distance) is
given by:
Cluster Centroids
K-means algorithm assumption
Input:
- (number of clusters)
- Training set
(drop convention)
K-means algorithm
}
K-means optimization objective
= index of cluster (1,2,…, ) to which example is currently
assigned
= cluster centroid ( )
= cluster centroid of cluster to which example has been
assigned
Optimization objective:
K-means algorithm
Repeat {
for = 1 to
:= index (from 1 to ) of cluster centroid
closest to
for = 1 to
:= average (mean) of points assigned to cluster
}
K-Mean Clustering algorithm
Random initialization
Should have
Cost function
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Weight
Weight
Height Height
A Simple example showing the implementation of
k-means algorithm
(using K=2)
Step 1:
Initialization: Randomly we choose following two centroids
(k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and m2=(5.0,7.0).
Step 2:
• Thus, we obtain two clusters
containing:
{1,2,3} and {4,5,6,7}.
• Their new centroids are:
Step 3:
• Now using these centroids we
compute the Euclidean
distance of each object, as
shown in table.
Decompose data objects into a several levels of nested partitioning (tree of clusters),
called a dendrogram.
A clustering of the data objects is obtained by cutting the dendrogram at the desired
level, then each connected component forms a cluster.
“Calculation of the Similarity Between Two
Clusters?”
Calculating the similarity between two clusters is important to
merge or divide the clusters.
• MIN
• MAX
• Group Average
• Distance Between Centroids
• Ward’s Method
“Calculation of the Similarity Between Two Clusters?”
• MIN
Sim(C1,C2) = Min Sim(Pi, Pj) such that Pi ∈ C1 & Pj ∈ C2.
Also known as single-linkage algorithm. The similarity of two clusters C1
and C2 is equal to the minimum of the similarity between points Pi and Pj
such that Pi belongs to C1 and Pj belongs to C2.
It can separate non-elliptical shapes as long as the gap between the two
clusters is not small but cannot separate clusters properly if there is noise
between clusters.
“Calculation of the Similarity Between Two Clusters?”
• MAX
Sim(C1,C2) = Max Sim(Pi, Pj) such that Pi ∈ C1 & Pj ∈ C2
Pick the two farthest points such that one point lies in cluster one and the other
point lies in cluster 2