Clustering
(Unsupervised Learning)
Partitioned Clustering
• The data set is divided into a pre-defined number of non-overlapping subsets or clusters
• Each data point belongs to exactly one cluster
• One of the most common partitioned clustering algorithms is K-means
1.Initialization:
• Choose the number of clusters, K, and randomly initialize K cluster centroids. These centroids
can be randomly selected data points from the dataset.
2.Assign Data Points to Nearest Cluster:
• For each data point, calculate the distance to each cluster centroid.
• Assign each data point to the cluster whose centroid is the closest (usually based on Euclidean
distance).
3.Update Cluster Centroids:
• Recalculate the centroids of the clusters based on the mean of all data points assigned to each
cluster.
4.Repeat:
• Repeat steps 2 and 3 until convergence, i.e., until the cluster assignments no longer change or a
specified number of iterations is reached.
5.Convergence Criteria:
• Convergence can be determined by checking whether the cluster centroids are no longer
changing significantly between iterations or whether the assignments of data points to clusters
remain unchanged.
6.Final Output:
• Once the algorithm converges, the final output is a set of K clusters, each represented by its
centroid, and the data points assigned to each cluster.
Partitioned Clustering (K-Means
Clustering)
Partitional clustering assigns a set of data points into k-clusters by using iterative
processes. In these processes, n data are classified into k-clusters.
Solve this ……
Elbow
method
The elbow method is a graphical method for finding the optimal K value in a k-means
clustering algorithm.
It works by finding WCSS (Within-Cluster Sum of Square) i.e. the sum of the square
distance between points in a cluster and the cluster centroid.
Step 2: For each k calculate the within-cluster sum of squares (WCSS)
Hierarchical Clustering
Hierarchical clustering is another unsupervised machine learning algorithm, which is used to
group the unlabeled datasets into a cluster and also known as hierarchical cluster analysis or
HCA.
In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-
shaped structure is known as the dendrogram.
Sometimes the results of K-means clustering and hierarchical clustering may look similar, but
they both differ depending on how they work. As there is no requirement to predetermine
the number of clusters as we did in the K-Means algorithm.
The hierarchical clustering technique has two approaches:
1.Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts
with taking all data points as single clusters and merging them until one cluster is left.
2.Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-
down approach.
Agglomerative Hierarchical clustering
There are various ways to calculate the distance between two clusters, and these ways decide the
rule for clustering. These measures are called Linkage methods.
1. Single Linkage
2. Complete Linkage
3. Average Linkage
4. Centroid Linkage
Single
Linkage
Problem: Assume that the database D is given by the table below. Follow single link technique
to find clusters in D. Use Euclidean distance measure.
Step 1. Plot the objects in n-dimensional space (where n is the number of attributes). In
our case we have 2 attributes – x and y, so we plot the objects p1, p2, … p6 in 2-
dimensional space:
Step 2. Calculate the distance from each object (point) to all other points, using Euclidean
distance measure, and place the numbers in a distance matrix.
Step 3 Identify the two clusters with the shortest distance in the matrix, and merge them together.
Re-compute the distance matrix, as those two clusters are now in a single cluster, (no longer exist
by themselves).
By looking at the distance matrix above, we see that p3 and p6 have the smallest distance
from all - 0.11 So, we merge those two in a single cluster, and re-compute the distance matrix.
dist( (p3, p6), p1 ) = MIN ( dist(p3, p1) ,
dist(p6, p1) )
= MIN ( 0.22 , 0.23 )
//from original matrix
= 0.22
Step 4 Repeat Step 3 until all clusters are merged
Since, we have merged (p2, p5) together in a cluster, we now have one entry for (p2, p5) in
the table, and no longer have p2 or p5 separately. Therefore, we need to re-compute the distance
from all other points / clusters to our new cluster - (p2, p5). The distance between (p3, p6) and
(p2, p5) would be calculated as follows:
dist( (p3, p6), (p2, p5) ) = MIN ( dist(p3, p2) ,
dist(p6, p2), dist(p3, p5), dist(p6, p5) ) = MIN (
0.15 , 0.25, 0.28, 0.39 ) //from original matrix
= 0.15
Since we have more clusters to merge, we continue to repeat Step 3.
Since we have more clusters to merge, we continue to repeat Step
3.
Since we have more clusters to merge, we continue to repeat Step 3.
Stopping condition
The user would like the data partitioned into several clusters for unsupervised
learning purposes.
The algorithm has to stop clustering at some point – either the user will specify the
number of clusters he/she would like to have, or the algorithm has to make a decision
on its own.
Density based clustering
In density-based clustering (unsupervised non-parametric algorithm): given a set of points in
some space, it groups together points that are closely packed together (points with many
nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest
neighbors are too far away).
DBSCAN is the density
based clustering
algorithm
Calculate the distance that is less than epsilon which is 1.9 in this case
If the number of
points in the set is less
than 4 then it is noise
If the number of
points in the set is
greater than or equal
to 4 then it is Core
If the point is a part of
the core then it is the
Border and the one
which is not a border
will remain noise or
the outlier
The core is the centroid
and along with the
member points they
form the group