[go: up one dir, main page]

0% found this document useful (0 votes)
26 views11 pages

Clustering

Clustering is a process of grouping similar data points into clusters based on their characteristics, which is useful in various applications such as customer segmentation and image processing. Different clustering techniques include partitioning methods like K-means, hierarchical clustering, and density-based methods like DBSCAN, each with its advantages and specific use cases. Metrics such as Silhouette Score, Davies-Bouldin Index, and Calinski-Harabasz Index are used to evaluate the effectiveness of clustering algorithms.

Uploaded by

vedanta.rcr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views11 pages

Clustering

Clustering is a process of grouping similar data points into clusters based on their characteristics, which is useful in various applications such as customer segmentation and image processing. Different clustering techniques include partitioning methods like K-means, hierarchical clustering, and density-based methods like DBSCAN, each with its advantages and specific use cases. Metrics such as Silhouette Score, Davies-Bouldin Index, and Calinski-Harabasz Index are used to evaluate the effectiveness of clustering algorithms.

Uploaded by

vedanta.rcr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

CLUSTERING

Clustering:
The process of making a group of abstract objects into classes of similar
objects is known as clustering.
Points to Remember:
One group is treated as a cluster of data objects
 In the process of cluster analysis, the first step is to partition the set of
data into groups with the help of data similarity, and then groups are
assigned to their respective labels.
 The biggest advantage of clustering over-classification is it can adapt to
the changes made and helps single out useful features that differentiate
different groups.

Cluster Analysis
Cluster analysis is also known as clustering, which groups similar data
points forming clusters. The goal is to ensure that data points within a
cluster are more similar to each other than to those in other clusters. For
example, in e-commerce retailers use clustering to group customers based
on their purchasing habits. If one group frequently buys fitness gear while
another prefers electronics. This helps companies to give personalized
recommendations and improve customer experience.

It is useful for:
 Scalability: It can efficiently handle large volumes of data.
 High Dimensionality: Can handle high-dimensional data.
1. Adaptability to Different Data Types: It can work with numerical
data like age, salary and categorical data like gender, occupation.
2. Handling Noisy and Missing Data: Usually, datasets contain missing
values or inconsistencies and clustering can manage them easily.
3. Interpretability: Output of clustering is easy to understand and apply
in real-world scenarios.

Types of Clustering Techniques


Clustering can be broadly classified into several methods. The choice of
method depends on the type of data and the problem you're solving.

Dr.Vidya Pawar, Vedanta Degree College, RaichurPage 1


CLUSTERING

Partitioning Methods
4. Partitioning Methods divide the data into k groups (clusters) where each
data point belongs to only one group. These methods are used when you
already know how many clusters you want to create. A common example
is K-means clustering.
1. In K-means the algorithm assigns each data point to the nearest center
and then updates the center based on the average of all points in that
group. This process repeats until the centres stop changing. It is used in
real-life applications like streaming platforms like Spotify to group users
based on their listening habits.

This clustering method classifies the information into multiple groups based
on the characteristics and similarity of the data. Its the data analysts to
specify the number of clusters that has to be generated for the clustering
methods.

In the partitioning method when database(D) that contains multiple(N)


objects then the partitioning method constructs user-specified(K) partitions
of the data in which each partition represents a cluster and a particular
region. There are many algorithms that come under partitioning method
some of the popular ones are K-Mean, PAM(K-Medoids), CLARA algorithm
(Clustering Large Applications) etc. In this article, we will be seeing the
working of K Mean algorithm in detail.

K-Mean (A centroid based Technique): The K means algorithm takes


the input parameter K from the user and partitions the dataset containing N
objects into K clusters so that resulting similarity among the data objects
inside the group (intracluster) is high but the similarity of data objects with
the data objects from outside the cluster is low (intercluster). The similarity
of the cluster is determined with respect to the mean value of the cluster. It
is a type of square error algorithm. At the start randomly k objects from the
dataset are chosen in which each of the objects represents a cluster
mean(centre). For the rest of the data objects, they are assigned to the
nearest cluster based on their distance from the cluster mean. The new
mean of each of the cluster is then calculated with the added data objects.

Algorithm:
K mean:
Input:
K: The number of clusters in which the dataset has to be divided

Dr.Vidya Pawar, Vedanta Degree College, RaichurPage 2


CLUSTERING

D: A dataset containing N number of objects

Output:
A dataset of K clusters
Method:
2. Randomly assign K objects from the dataset(D) as cluster centres(C)
 (Re) Assign each object to which object is most similar based upon mean
values.
 Update Cluster means, i.e., Recalculate the mean of each cluster with
the updated values.
 Repeat Step 2 until no change occurs.

Figure - K-mean Clustering

For example refer running notes

Hierarchical clustering
A Hierarchical clustering method works via grouping data into a tree of
clusters. Hierarchical clustering begins by treating every data point as a
separate cluster. Then, it repeatedly executes the subsequent steps:
 Identify the 2 clusters which can be closest together, and
 Merge the 2 maximum comparable clusters. We need to continue these
steps until all the clusters are merged together.

In Hierarchical Clustering, the aim is to produce a hierarchical series of


nested clusters. A diagram called Dendrogram (A Dendrogram is a tree-
like diagram that statistics the sequences of merges or splits) graphically
represents this hierarchy and is an inverted tree that describes the order in

Dr.Vidya Pawar, Vedanta Degree College, RaichurPage 3


CLUSTERING

which factors are merged (bottom-up view) or clusters are broken up (top-
down view).

Let's say we have six data points A, B, C, D, E, and F.

 Step-1: Consider each alphabet as a single cluster and calculate the


distance of one cluster from all the other clusters.
 Step-2: In the second step comparable clusters are merged together to
form a single cluster. Let's say cluster (B) and cluster (C) are very similar
to each other therefore we merge them in the second step similarly to
cluster (D) and (E) and at last, we get the clusters [(A), (BC), (DE), (F)]
 Step-3: We recalculate the proximity according to the algorithm and
merge the two nearest clusters([(DE), (F)]) together to form new clusters
as [(A), (BC), (DEF)]
 Step-4: Repeating the same process; The clusters DEF and BC are
comparable and merged together to form a new cluster. We’re now left
with clusters [(A), (BCDEF)].
 Step-5: At last, the two remaining clusters are merged together to form
a single cluster [(ABCDEF)].

Divisive Hierarchical clustering


We can say that Divisive Hierarchical clustering is precisely the opposite of
Agglomerative Hierarchical clustering. In Divisive Hierarchical clustering, we
take into account all of the data points as a single cluster and in every
iteration, we separate the data points from the clusters which aren't
comparable. In the end, we are left with N clusters.

Dr.Vidya Pawar, Vedanta Degree College, RaichurPage 4


CLUSTERING

Divisive Hierarchical clustering

DBSCAN is a density-based clustering algorithm that groups data


points that are closely packed together and marks outliers as
noise based on their density in the feature space. It identifies clusters as
dense regions in the data space separated by areas of lower density. Unlike
K-Means or hierarchical clustering which assumes clusters are compact
and spherical, DBSCAN perform well in handling real-world data
irregularities such as:
1. Arbitrary-Shaped Clusters: Clusters can take any shape not just
circular or convex.
2. Noise and Outliers: It effectively identifies and handles noise points
without assigning them to any cluster.

DBSCAN Clustering in ML | Density based clustering


The figure above shows a data set with clustering algorithms: K-Means and
Hierarchical handling compact, spherical clusters with varying noise
tolerance while DBSCAN manages arbitrary-shaped clusters and noise
handling.
Key Parameters in DBSCAN

Dr.Vidya Pawar, Vedanta Degree College, RaichurPage 5


CLUSTERING

1. eps: This defines the radius of the neighborhood around a data point. If
the distance between two points is less than or equal to eps they are
considered neighbors. A common method to determine eps is by analyzing
the k-distance graph. Choosing the right eps is important:
3. If eps is too small most points will be classified as noise.
4. If eps is too large clusters may merge and the algorithm may fail to
distinguish between them.
2. MinPts: This is the minimum number of points required within
the eps radius to form a dense region. A general rule of thumb is to set
MinPts >= D+1 where D is the number of dimensions in the dataset.
For most cases a minimum value of MinPts = 3 is recommended.

Flowchart:
Figure - K-mean Clustering
Example: Suppose we want to group the visitors to a website using just
their age as follows:
16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66
Initial Cluster:
K=2
Centroid(C1) = 16 [16]
Centroid(C2) = 22 [22]
Note: These two points are chosen randomly from the dataset.

Dr.Vidya Pawar, Vedanta Degree College, RaichurPage 6


CLUSTERING

Iteration-1:
C1 = 16.33 [16, 16, 17]
C2 = 37.25 [20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-2:
C1 = 19.55 [16, 16, 17, 20, 20, 21, 21, 22, 23]
C2 = 46.90 [29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-3:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-4:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
No change Between Iteration 3 and 4, so we stop. Therefore we get the
clusters (16-29) and (36-66) as 2 clusters we get using K Mean Algorithm.

Grid-based clustering methods are a category of clustering algorithms that


quantize the data space into a finite number of cells or grids, and then
perform clustering operations on the grid structures. The key idea is to
reduce the vast data objects to a much smaller number of grid cells, which
can make the clustering process more efficient, especially for large datasets.
Grid-based clustering offers several advantages:

 Efficiency: By reducing the problem space to a finite number of cells,


grid-based methods can significantly speed up the clustering process,
especially for large datasets.

 Scalability: These methods can handle high-dimensional data more


effectively by working with grids in subspaces.

 Flexibility: Grid-based methods can easily adapt to different data


distributions and can be combined with other clustering techniques

Dr.Vidya Pawar, Vedanta Degree College, RaichurPage 7


CLUSTERING

Process:
 Grid Creation: The data space is partitioned into a grid structure.
 Cell Assignment: Each data point is assigned to the corresponding grid
cell.
 Clustering Operations: Clustering algorithms are applied on the grid cells,
rather than individual data points.
 Cluster Refinement (Optional): In some methods, clusters can be refined
further by examining the relationships between cells.
Advantages:
 Fast Processing:
Grid-based clustering often has a faster processing time, especially for
large datasets, as the number of calculations is reduced.
 Scalability:
These methods can be more scalable for large datasets, as the clustering
operations are performed on the grid cells rather than individual data
points.
 Computational Complexity:
Grid-based methods can have lower computational complexity, making
them suitable for analyzing large and complex datasets.
Examples:
 STING (Statistical Information Grid):
A grid-based clustering algorithm that uses statistical information within the
grid cells to identify clusters.
 CLIQUE (CLustering In QUEst):
Another grid-based clustering algorithm that identifies clusters by
examining the density of grid cells.
 Wave Cluster:

Dr.Vidya Pawar, Vedanta Degree College, RaichurPage 8


CLUSTERING

A grid-based clustering algorithm that uses a wavelet transform to analyze


the data.

Applications:
Grid-based clustering is used in various applications, including:
 Spatial Data Mining: Analyzing spatial datasets, such as location data or
geographic information systems.
 Web Data Mining: Analyzing web usage patterns and identifying user
clusters.
 Image Processing: Segmenting images and identifying regions of interest.
 Data Mining: General clustering tasks, particularly for large and complex
datasets.

Clustering is a technique in Machine Learning that is used to group


similar data points. While the algorithm performs its job, helping uncover
the patterns and structures in the data, it is important to judge how well it
functions. Several metrics have been designed to evaluate the performance
of these clustering algorithms.
Clustering is an unsupervised machine-learning approach that is used
to group comparable data points based on specific traits or attributes.
Clustering algorithms do not require labelled data, which makes them ideal
for finding patterns in large datasets. It is a widely used technique in
applications like customer segmentation, image recognition, anomaly
detection, etc.
There are multiple clustering algorithms, and each has its way of grouping
data points. Clustering metrics are used to evaluate all these algorithms.
Let us take a look at some of the most commonly used clustering metrics:
1. Silhouette Score
The Silhouette Score is a way to measure how good the clusters are in a
dataset. It helps us understand how well the data points have been
grouped. The score ranges from -1 to 1.
 A score close to 1 means a point fits really well in its group (cluster) and
is far from other groups.
 A score close to 0 means the point is on the border between two
clusters.
 A score close to -1 means the point might be in the wrong cluster.
Silhouette Score (S) for a data point i is calculated as:

Dr.Vidya Pawar, Vedanta Degree College, RaichurPage 9


CLUSTERING

S(i)=b(i)−a(i)max(a(i),b(i))S(i)=max(a(i),b(i))b(i)−a(i)
where,
 a(i) is the average distance from i to other data points in the same
cluster.
 b(i) is the smallest average distance from i to data points in a different
cluster.
2. Davies-Bouldin Index
The Davies-Bouldin Index (DBI) helps us measure how good the
clustering is in a dataset. It looks at how tight each cluster is
(compactness), and how far apart the clusters are (separation).
 Lower DBI = better, clearer clusters
 Higher DBI = messy, overlapping clusters
A lower score is better, because it means:
 Points in the same cluster are close to each other.
 Different clusters are far apart from one another.
Davies-Bouldin Index (DB) is calculated as:
DB=1k∑i=1kmax⁡j≠i(Rii+RjjRij)DB=k1∑i=1kmaxj=i(RijRii+Rjj)
where,
 k is the total number of clusters.
 RiiRii is the compactness of cluster i.
 RjjRjj is the compactness of cluster j.
 RijRij is the dissimilarity (distance) between cluster i and cluster j.

3. Calinski-Harabasz Index (Variance Ratio Criterion)


The Calinski-Harabasz Index measures how good the clusters are in a
dataset.
It looks at:
 How close the points are inside each cluster?
 How far apart the clusters are?
A higher score is better, as it means the clusters are tight and well-
separated. It helps determine the ideal number of clusters.
Calinski-Harabasz Index (CH) is calculated as:
CH=BW×N−KK−1CH=WB×K−1N−K
where,
 B is the sum of squares between clusters.
 W is the sum of squares within clusters.
 N is the total number of data points.
 K is the number of clusters.

Adjusted Rand Index (ARI)

Dr.Vidya Pawar, Vedanta Degree College, RaichurPage 10


CLUSTERING

The Adjusted Rand Index (ARI) helps us measure how accurate a


clustering result is by comparing it to the true labels (ground truth).
It checks how well the pairs of points are grouped:
 Are the same pairs together in both the real and predicted clusters?
 Are different pairs also kept apart correctly?
The score ranges from -1 to 1:
 1 means perfect match - the clustering is exactly right.
 0 means random guess - no better than chance.
 Below 0 means worse than random - very poor clustering.
Adjusted Rand Index (ARI) is calculated as:
ARI=(RI−ExpectedRI)(max(RI)−ExpectedRI)ARI=(max(RI)−ExpectedRI)
(RI−ExpectedRI)
where,
 RI is the Rand Index.
 ExpectedRIExpectedRI is the expected value of the Rand Index.

5. Mutual Information (MI)


Mutual Information measures how much two variables are related or
connected. In clustering, it compares how much the true cluster labels
match with the predicted labels. It shows how much knowing about one
variable helps us predict the other. The more agreement there is, the
higher the score.
 Higher values mean better agreement between the clusters.
Zero means no agreement at all.

MI(y,z)=∑i∑jp(yi,zj)⋅log⁡(p(yi,zj)p(yi)⋅p(zj))MI(y,z)=∑i∑jp(yi,zj)⋅log(p(yi)⋅p(zj
MI between true labels Y and predicted labels Z is calculated as:

)p(yi,zj))
where,
yiyi is a true label.
zizi is a predicted label.
p(yi,zi)p(yi,zi) is the joint probability of yiyi and zjzj.
p(yi)p(yi) and p(zi)p(zi) are the marginal probabilities.
These clustering metrics help in evaluating the quality and performance of
clustering algorithms, allowing for informed decisions when selecting the
most suitable clustering solution for a given dataset.

**************************************************

Dr.Vidya Pawar, Vedanta Degree College, RaichurPage 11

You might also like