[go: up one dir, main page]

0% found this document useful (0 votes)
5 views45 pages

1731009606_Clustering_(Class_38-39)

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 45

Clustering

Unsupervised Learning
Unsupervised learning
Clustering is the
partitioning of a data set
into subsets (clusters), so
that the data in each
subset (ideally) share
some common trait - often
according to some defined
distance measure.

Training set:
Different Types of clustering:
1. Hierarchical algorithms: these find successive clusters using
previously established clusters.
1. Agglomerative ("bottom-up"): Agglomerative algorithms begin with
each element as a separate cluster and merge them into successively
larger clusters.
2. Divisive ("top-down"): Divisive algorithms begin with the whole set
and proceed to divide it into successively smaller clusters.
2. Partitional clustering: Partitional algorithms determine all clusters at
once. They include:
• K-means
• Fuzzy c-means clustering
• QT clustering algorithm
Common Distance measures:
• Distance measure will determine how the similarity of two
elements is calculated and it will influence the shape of the
clusters.
1. The Euclidean distance (also called 2-norm distance) is
given by:

2. The Manhattan distance (also called taxicab norm or 1-norm)


is given by:
3.The maximum norm is given by:

4. The Mahalanobis distance corrects data for different scales and


correlations in the variables.
5. Inner product space: The angle between two vectors can be used
as a distance measure when clustering high dimensional data.
6. Hamming distance (sometimes edit distance) measures the
minimum number of substitutions required to change one
member into another.
Cluster Centroids

Cluster Centroids
K-means algorithm assumption

Input:
- (number of clusters)
- Training set

(drop convention)
K-means algorithm

Randomly initialize cluster centroids


Repeat {
for = 1 to
:= index (from 1 to ) of cluster centroid
closest to
for = 1 to
:= average (mean) of points assigned to cluster

}
K-means optimization objective
= index of cluster (1,2,…, ) to which example is currently
assigned
= cluster centroid ( )
= cluster centroid of cluster to which example has been
assigned
Optimization objective:
K-means algorithm

Randomly initialize cluster centroids

Repeat {
for = 1 to
:= index (from 1 to ) of cluster centroid
closest to
for = 1 to
:= average (mean) of points assigned to cluster
}
K-Mean Clustering algorithm
Random initialization
Should have

Randomly pick training


examples.

Set equal to these


examples.
Local optima

Depending on the initialization of


cluster centroids K-means can produce
different results- it is known as
random initialization trap
Random initialization
For i = 1 to 100 {

Randomly initialize K-means.


Run K-means. Get .
Compute cost function (distortion)

Pick clustering that gave lowest cost


What is the right value of K?
Choosing the value of K
Elbow method:
Cost function

Cost function
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

(no. of clusters) (no. of clusters)


Choosing the value of K
Sometimes, you’re running K-means to get clusters to use for some
later/downstream purpose. Evaluate K-means based on a metric for how
well it performs for that later purpose.

E.g. T-shirt sizing T-shirt sizing

Weight
Weight

Height Height
A Simple example showing the implementation of
k-means algorithm
(using K=2)
Step 1:
Initialization: Randomly we choose following two centroids
(k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and m2=(5.0,7.0).
Step 2:
• Thus, we obtain two clusters
containing:
{1,2,3} and {4,5,6,7}.
• Their new centroids are:
Step 3:
• Now using these centroids we
compute the Euclidean
distance of each object, as
shown in table.

• Therefore, the new clusters


are:
{1,2} and {3,4,5,6,7}

• Next centroids are:


m1=(1.25,1.5) and m2 =
(3.9,5.1)
• Step 4 :
The clusters obtained are:
{1,2} and {3,4,5,6,7}

• Therefore, there is no change


in the cluster.
• Thus, the algorithm comes to
a halt here and final result
consist of 2 clusters {1,2} and
{3,4,5,6,7}.
PLOT
Weaknesses of K-Mean Clustering
1. When the numbers of data are not so many, initial grouping
will determine the cluster significantly.
2. The number of cluster, K, must be determined before hand. Its
disadvantage is that it does not yield the same result with each
run, since the resulting clusters depend on the initial random
assignments.
3. We never know the real cluster, using the same data, because if
it is inputted in a different order it may produce different
cluster if the number of data is few.
4. It is sensitive to initial condition. Different initial condition
may produce different result of cluster. The algorithm may be
trapped in the local optimum.
Applications of K-Mean Clustering
• It is relatively efficient and fast. It computes result at O(tkn),
where n is number of objects or points, k is number of
clusters and t is number of iterations.
• k-means clustering can be applied to machine learning or
data mining
• Used on acoustic data in speech understanding to convert
waveforms into one of k categories (known as Vector
Quantization or Image Segmentation).
• Also used for choosing color palettes on old fashioned
graphical display devices and Image Quantization.
CONCLUSION
• K-means algorithm is useful for undirected knowledge discovery and
is relatively simple. K-means has found wide spread usage in lot of
fields, ranging from unsupervised learning of neural network, Pattern
recognitions, Classification analysis, Artificial intelligence, image
processing, machine vision, and many others.
Hierarchical Clustering
• Use distance matrix as clustering criteria. This method does not require
the number of clusters k as an input, but needs a termination condition Lets
say we have five data points {a, b, c, d, e}.
agglomerative
Step 0 Step 1 Step 2 Step 3 Step 4
(AGNES)

a • It begin with each element as a separate cluster


ab and at each iteration, the similar clusters merge
b abcde
with each other until one cluster or K clusters are
formed.
c
cde • Consider all the data points as a single cluster and
in each iteration, separate the data points from the
d
de cluster which are not similar. Each data point
which is separated is considered as an individual
e cluster. In the end, we’ll be left with n clusters.

Step 4 Step 3 Step 2 Step 1 Step 0 divisive


(DIANA)
Dendrogram: Shows How the Clusters are Merged

Decompose data objects into a several levels of nested partitioning (tree of clusters),
called a dendrogram.

A clustering of the data objects is obtained by cutting the dendrogram at the desired
level, then each connected component forms a cluster.
“Calculation of the Similarity Between Two
Clusters?”
Calculating the similarity between two clusters is important to
merge or divide the clusters.

• MIN
• MAX
• Group Average
• Distance Between Centroids
• Ward’s Method
“Calculation of the Similarity Between Two Clusters?”
• MIN
Sim(C1,C2) = Min Sim(Pi, Pj) such that Pi ∈ C1 & Pj ∈ C2.
Also known as single-linkage algorithm. The similarity of two clusters C1
and C2 is equal to the minimum of the similarity between points Pi and Pj
such that Pi belongs to C1 and Pj belongs to C2.

It can separate non-elliptical shapes as long as the gap between the two
clusters is not small but cannot separate clusters properly if there is noise
between clusters.
“Calculation of the Similarity Between Two Clusters?”
• MAX
Sim(C1,C2) = Max Sim(Pi, Pj) such that Pi ∈ C1 & Pj ∈ C2
Pick the two farthest points such that one point lies in cluster one and the other
point lies in cluster 2

Complete Linkage; it does well in separating clusters if there is noise between


clusters; but it is biased towards globular clusters, and tends to break large
clusters.
“Calculation of the Similarity Between Two Clusters?”
• Group Average
sim(C1,C2) = ∑ sim(Pi, Pj)/|C1|*|C2|; where, Pi ∈ C1 & Pj ∈ C2
Take all the pairs of points and compute their similarities and calculate
the average of the similarities.

It does well in separating clusters if there is noise between clusters; but


is biased towards globular clusters.
“Calculation of the Similarity Between Two Clusters?”
• Distance between centroids
Compute the centroids of two clusters C1 & C2 and take the similarity
between the two centroids as the similarity between two clusters. This is
a less popular technique in the real world.
“Calculation of the Similarity Between Two Clusters?”
• Ward’s Method

sim(C1,C2) = ∑ (dist(Pi, Pj))²/|C1|*|C2|

calculate the similarity between two clusters in same way as computed


by Group Average except it calculates the sum of the square of the
distances Pi and PJ.

It also does well in separating clusters if there is noise between clusters.


But it is also biased towards globular clusters.
Space and Time Complexity of Hierarchical
clustering Technique
• The space required is very high to store the similarity matrix in the
RAM for large number of data points in case of Hierarchical clustering
Technique.

Space complexity = O(m²) where m is the number of data points.

• To perform “m” iterations and in each iteration, we need to update the


similarity matrix and restore the matrix, the time complexity is also
very high.

Time complexity = O(m³) where m is the number of data points.


Limitations of Hierarchical clustering Technique

 There is no mathematical objective for Hierarchical


clustering.

 All the approaches to calculate the similarity between


clusters has its own disadvantages.

 High space and time complexity for Hierarchical clustering.


Hence this clustering algorithm cannot be used when we have
huge data.

You might also like