A Comprehensive Guide to Unsupervised Learning and
Clustering Algorithms
September 14, 2025
Contents
1 Basic Fundamental Questions of Unsupervised Learning 3
2 Difference Between Classification and Clustering 3
3 Four Types of Distance Measures 3
3.1 Euclidean Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2 Manhattan Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3 Cosine Similarity/Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.4 Jaccard Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4 Comparisons of Four Clustering Algorithms 4
4.1 K-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.2 DBSCAN (Density-Based Spatial Clustering of Applications with Noise) . . . . . . . . . . 4
4.3 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.4 Gaussian Mixture Model (GMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
5 Pros and Cons of the Four Algorithms 5
5.1 K-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
5.2 DBSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
5.3 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.4 Gaussian Mixture Model (GMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6 Which Algorithm is Suitable for Which Situation? 6
6.1 When to Use K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.2 When to Use DBSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6.3 When to Use Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6.4 When to Use Gaussian Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
7 Challenges in Unsupervised Learning 7
8 Gaussian Mixture Models and the EM Algorithm 8
8.1 Gaussian Mixture Models (GMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
8.2 Expectation-Maximization (EM) Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 8
9 Pros and Cons of K-means Algorithm 8
9.1 Pros of K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
9.2 Cons of K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
10 Steps in K-means Algorithm 9
1
11 The Elbow Method for Optimal Value of k in K-means 9
11.1 Concept of the Elbow Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
11.2 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
11.3 Steps to Implement the Elbow Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
11.4 Interpretation of the Elbow Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
11.5 Advantages of the Elbow Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
11.6 Limitations of the Elbow Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
12 Maximum Likelihood Calculation 12
2
1 Basic Fundamental Questions of Unsupervised Learning
Unsupervised learning is a type of machine learning where the model learns patterns from unlabeled
data. Unlike supervised learning, there are no predefined output labels to guide the learning process.
The fundamental questions in unsupervised learning include:
1. What structures exist in the data? Unsupervised learning aims to discover inherent patterns,
groupings, or relationships within the data without prior knowledge of what these might be.
2. How can we represent the data efficiently? Unsupervised learning seeks to find compact
representations of the data, such as through dimensionality reduction or feature extraction.
3. What are the underlying factors that generate the data? By identifying latent variables
or hidden structures, unsupervised learning attempts to understand the data-generating process.
4. How can we evaluate the quality of unsupervised learning? Without ground truth labels,
evaluating unsupervised learning algorithms becomes challenging and requires alternative metrics.
The primary goal of unsupervised learning is to explore the data and find meaningful patterns or
structures that can provide insights into the underlying phenomena.
2 Difference Between Classification and Clustering
Classification and clustering are both techniques for grouping data, but they differ fundamentally in
their approach and application:
Aspect Classification Clustering
Learning Type Supervised Learning Unsupervised Learning
Data Requirement Labeled data (input-output pairs) Unlabeled data (only inputs)
Goal Predict labels for new data points Discover natural groupings in data
Output Predefined categories Discovered clusters
Evaluation Accuracy, precision, recall, F1-score Internal metrics (e.g., silhouette score)
Training Process Learn from labeled examples Find patterns without guidance
Example Spam detection, image recognition Customer segmentation, anomaly detection
Table 1: Comparison between Classification and Clustering
In classification, the algorithm learns from a training set where each data point is associated with a
known label. The goal is to build a model that can accurately predict the label of new, unseen data points.
In contrast, clustering works with unlabeled data and aims to group similar data points together based
on their features or characteristics, without any prior knowledge of what these groups might represent.
3 Four Types of Distance Measures
Distance measures are fundamental to clustering algorithms as they quantify the similarity or dissimi-
larity between data points. Here are four common distance measures used in clustering:
3.1 Euclidean Distance
Euclidean distance is the most common distance measure, representing the straight-line distance between
two points in Euclidean space. For two points x = (x1 , x2 , ..., xn ) and y = (y1 , y2 , ..., yn ), the Euclidean
distance is defined as: v
u n
uX
d(x, y) = t (xi − yi )2 (1)
i=1
It is suitable for continuous variables and works well when clusters are spherical or globular.
3
3.2 Manhattan Distance
Manhattan distance, also known as city block distance or L1 distance, is the sum of absolute differences
between coordinates:
Xn
d(x, y) = |xi − yi | (2)
i=1
It is less sensitive to outliers than Euclidean distance and is useful when dealing with high-dimensional
data or when the movement is restricted to grid-like paths.
3.3 Cosine Similarity/Distance
Cosine similarity measures the cosine of the angle between two vectors, focusing on orientation rather
than magnitude: Pn
xi yi
similarity(x, y) = pPn i=1 2
pPn
2
(3)
i=1 xi i=1 yi
Cosine distance is then defined as:
d(x, y) = 1 − similarity(x, y) (4)
It is particularly useful for text analysis and high-dimensional sparse data where the magnitude of vectors
may not be as important as their direction.
3.4 Jaccard Distance
Jaccard distance measures dissimilarity between sample sets. For two sets A and B, the Jaccard similarity
is:
|A ∩ B|
J(A, B) = (5)
|A ∪ B|
The Jaccard distance is then:
dJ (A, B) = 1 − J(A, B) (6)
It is commonly used for binary data, such as when comparing sets of items or presence/absence features.
4 Comparisons of Four Clustering Algorithms
4.1 K-means Clustering
K-means is a centroid-based algorithm that partitions data into K clusters, where each data point belongs
to the cluster with the nearest mean (centroid). It aims to minimize the within-cluster sum of squares.
4.2 DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
DBSCAN is a density-based algorithm that groups together points that are closely packed, marking as
outliers points that lie alone in low-density regions. It doesn’t require specifying the number of clusters
in advance.
4.3 Hierarchical Clustering
Hierarchical clustering builds a hierarchy of clusters either through a bottom-up approach (agglomerative)
or a top-down approach (divisive). It creates a tree-like structure called a dendrogram that shows the
arrangement of the clusters.
4.4 Gaussian Mixture Model (GMM)
GMM is a probabilistic model that assumes all data points are generated from a mixture of several
Gaussian distributions with unknown parameters. It provides soft clustering, where each data point has
a probability of belonging to each cluster.
4
Aspect K-means DBSCAN Hierarchical GMM
Cluster Shape Spherical Arbitrary Arbitrary Elliptical
Number of Clusters Predefined (K) Automatic Determined by cutoff Predefined (K
Outlier Handling Poor Excellent Moderate Good
Cluster Assignment Hard Hard Hard Soft (probabi
Scalability High Moderate Low (O(n²) or worse) Moderate
Parameter Sensitivity High (K, initialization) High (eps, min samples) Moderate (linkage criteria) Moderate (K,
Interpretability High Moderate High (dendrogram) Moderate
Table 2: Comparison of Four Clustering Algorithms
5 Pros and Cons of the Four Algorithms
5.1 K-means Clustering
Pros:
• Simple to understand and implement
• Computationally efficient with time complexity O(tkn), where t is iterations, k is clusters, and n
is data points
• Scales well to large datasets
• Guarantees convergence (though to a local optimum)
• Easy to interpret results
Cons:
• Requires specifying the number of clusters (K) in advance
• Sensitive to initialization and may converge to local optima
• Assumes spherical clusters and struggles with irregular shapes
• Sensitive to outliers and noise
• May perform poorly with clusters of different sizes and densities
5.2 DBSCAN
Pros:
• Does not require specifying the number of clusters in advance
• Can find arbitrarily shaped clusters
• Robust to outliers and noise (identifies them as separate)
• Only two parameters to tune (eps and min samples)
• Works well with spatial data
Cons:
• Sensitive to parameter settings (eps and min samples)
• Struggles with clusters of varying densities
• Not deterministic; border points may be assigned to different clusters on different runs
• Computationally expensive for large datasets (though optimizations exist)
• May have difficulty with high-dimensional data (curse of dimensionality)
5
5.3 Hierarchical Clustering
Pros:
• Does not require specifying the number of clusters in advance
• Provides a dendrogram for visualization of cluster relationships
• Captures nested cluster structures
• Can use various distance metrics and linkage criteria
• Intuitive and easy to understand
Cons:
• Computationally expensive (O(n²) or worse time complexity)
• Sensitive to noise and outliers
• Once a decision is made to combine or split clusters, it cannot be undone
• May not scale well to large datasets
• Different linkage criteria can produce very different results
5.4 Gaussian Mixture Model (GMM)
Pros:
• Provides soft clustering (probabilistic cluster assignments)
• Can model elliptical clusters (more flexible than K-means)
• Provides a probabilistic framework for cluster membership
• Can be used for density estimation
• Works well with overlapping clusters
Cons:
• Requires specifying the number of components (clusters) in advance
• Computationally more intensive than K-means
• May converge to local optima
• Assumes data is generated from a mixture of Gaussian distributions
• Sensitive to initialization
• Can be difficult to interpret with many dimensions
6 Which Algorithm is Suitable for Which Situation?
The choice of clustering algorithm depends on the characteristics of the data and the specific requirements
of the application:
6.1 When to Use K-means
• When you have a rough idea of the number of clusters
• When clusters are expected to be spherical and similar in size
• When working with large datasets where computational efficiency is important
• When you need a simple, easy-to-interpret solution
• When the data is not too high-dimensional
6
6.2 When to Use DBSCAN
• When clusters have arbitrary, non-spherical shapes
• When the dataset contains noise or outliers that need to be identified
• When you don’t know the number of clusters in advance
• When working with spatial or geographic data
• When clusters have varying densities (though this can be challenging)
6.3 When to Use Hierarchical Clustering
• When you want to understand the hierarchical relationship between clusters
• When the number of clusters is unknown and you want to explore different levels of granularity
• When working with smaller datasets where computational cost is not a major concern
• When you need a dendrogram for visualization or presentation purposes
• When the underlying data has a natural hierarchical structure
6.4 When to Use Gaussian Mixture Model
• Data is generated from a mixture of a fixed number of Gaussian (normal) distributions.
• When you need probabilistic cluster assignments (soft clustering)
• When clusters are expected to have elliptical shapes
• When clusters may overlap
• When you want to model the underlying probability distribution of the data
• When you need density estimation in addition to clustering
7 Challenges in Unsupervised Learning
Unsupervised learning faces several unique challenges that distinguish it from supervised learning ap-
proaches:
Noisy Data Outliers and noise can distort patterns and reduce the effectiveness of algorithms.
Assumption Dependence Algorithms often rely on assumptions (e.g., cluster shapes), which may
not match the actual data structure.
Overfitting Risk Overfitting can occur when models capture noise instead of meaningful patterns
in the data.
Limited Guidance The absence of labels restricts the ability to guide the algorithm toward specific
outcomes.
Cluster Interpretability Results, such as clusters, may lack clear meaning or alignment with
real-world categories. Sensitivity to Parameters Many algorithms require careful tuning of hyperpa-
rameters, such as the number of clusters in k-means.
Lack of Ground Truth Unsupervised learning lacks labeled data, making it difficult to evaluate
the accuracy of results.
7
8 Gaussian Mixture Models and the EM Algorithm
8.1 Gaussian Mixture Models (GMM)
A Gaussian Mixture Model is a probabilistic model that assumes all the data points are generated from
a mixture of a finite number of Gaussian distributions with unknown parameters. GMMs can be used
for clustering, density estimation, and anomaly detection.
The probability density function of a GMM is given by:
K
X
p(x) = πk N (x|µk , Σk ) (7)
k=1
where:
• K is the number of Gaussian components (clusters)
PK
• πk is the mixing coefficient for component k, with k=1 πk = 1 and πk ≥ 0
• N (x|µk , Σk ) is the Gaussian distribution with mean µk and covariance matrix Σk
8.2 Expectation-Maximization (EM) Algorithm
The EM algorithm is an iterative method for finding maximum likelihood estimates of parameters in
statistical models with unobserved latent variables. For GMMs, the EM algorithm alternates between
two steps:
Expectation Step (E-step): Calculate the expected value of the latent variables (cluster assign-
ments) given the current parameter estimates:
πk N (xn |µk , Σk )
γ(znk ) = PK (8)
j=1 πj N (xn |µj , Σj )
where γ(znk ) is the probability that data point xn belongs to cluster k.
Maximization Step (M-step): Update the parameters to maximize the expected log-likelihood:
N
1 X
πknew = γ(znk ) (9)
N n=1
PN
n=1 γ(znk )xn
µnew
k = P N
(10)
n=1 γ(znk )
PN
γ(znk )(xn − µnew )(xn − µnew )T
Σnew
k = n=1 PN
k k
(11)
n=1 γ(znk )
The E-step and M-step are repeated until convergence, when the change in the log-likelihood falls
below a threshold or a maximum number of iterations is reached.
9 Pros and Cons of K-means Algorithm
9.1 Pros of K-means
• Simplicity: K-means is conceptually straightforward and easy to implement, making it accessible
to practitioners with varying levels of expertise.
• Efficiency: With a time complexity of O(tkn), where t is the number of iterations, k is the number
of clusters, and n is the number of data points, K-means scales well to large datasets.
• Guaranteed Convergence: The algorithm is guaranteed to converge to a local optimum, though
not necessarily the global optimum.
• Adaptability: K-means can be easily adapted to different domains and can be combined with
other techniques for improved performance.
• Interpretability: The results of K-means are easy to understand and visualize, especially in
low-dimensional spaces.
8
9.2 Cons of K-means
• Sensitivity to Initialization: The final clustering result can vary significantly based on the
initial selection of centroids, potentially leading to suboptimal solutions.
• Requirement of K: The algorithm requires specifying the number of clusters in advance, which
is often unknown in real-world applications.
• Assumption of Spherical Clusters: K-means assumes clusters are spherical and equally sized,
making it unsuitable for datasets with clusters of different shapes, sizes, or densities.
• Sensitivity to Outliers: Outliers can significantly affect the position of centroids, leading to
poor clustering results.
• Local Optima: The algorithm may converge to a local optimum rather than the global optimum,
especially with poor initialization.
10 Steps in K-means Algorithm
The K-means algorithm follows an iterative process to partition data into K clusters. The algorithm
consists of the following steps:
1. Initialization: We begin by randomly selecting k cluster centroids. These centroids can be cho-
sen either randomly from the data points or initialized using more sophisticated methods like
k-means++ to improve convergence.
2. Assignment Step: Each data point is assigned to the nearest centroid, forming clusters. The
”nearest” centroid is determined using a distance metric, typically Euclidean distance. Mathemat-
ically, each data point xi is assigned to cluster Cj where:
j = arg min ∥xi − µj ∥2 (12)
j
and µj represents the centroid of cluster j.
3. Update Step: After the assignment, we recalculate the centroid of each cluster by averaging the
points within it. The new centroid µj for cluster j is computed as:
1 X
µj = xi (13)
|Cj |
xi ∈Cj
where |Cj | is the number of data points in cluster j.
4. Repeat: This process repeats until the centroids no longer change significantly or the maximum
number of iterations is reached. The algorithm converges when the assignments no longer change
between iterations, indicating that the clusters have stabilized.
11 The Elbow Method for Optimal Value of k in K-means
The Elbow Method is a popular heuristic used to determine the optimal number of clusters (k) in
K-means clustering. It helps in finding the value of k that provides the best balance between model
complexity and performance.
11.1 Concept of the Elbow Method
The Elbow Method runs K-means clustering on the dataset for a range of k values (typically from 1
to 10) and calculates the Within-Cluster Sum of Squares (WCSS) for each k. WCSS measures the
sum of squared distances between each point and its centroid within a cluster. As k increases, WCSS
naturally decreases because clusters become smaller and more compact. However, after a certain point,
the improvement in WCSS becomes marginal. The ”elbow” point in the WCSS vs. k plot represents the
optimal k where adding another cluster doesn’t significantly improve the model.
9
11.2 Mathematical Formulation
The WCSS (also known as inertia) is calculated as:
k X
X
WCSS = ||p − µi ||2 (14)
i=1 p∈Ci
where:
• k is the number of clusters
• Ci is the i-th cluster
• p is a data point in cluster Ci
• µi is the centroid of cluster Ci
11.3 Steps to Implement the Elbow Method
1. Run K-means for multiple k values: Execute the K-means algorithm for different values of k
(e.g., k = 1, 2, 3, ..., 10).
2. Calculate WCSS for each k: For each k value, compute the WCSS value after the algorithm
converges.
3. Plot the results: Create a line plot with k values on the x-axis and corresponding WCSS values
on the y-axis.
4. Identify the elbow point: Look for the point in the plot where the rate of decrease in WCSS
sharply changes, forming an ”elbow” shape. This point represents the optimal k value.
11.4 Interpretation of the Elbow Plot
In the WCSS vs. k plot:
• The initial steep decline indicates significant improvement in clustering quality as k increases.
• The elbow point (where the slope changes dramatically) represents the optimal k value.
• After the elbow point, the curve flattens, indicating that additional clusters provide diminishing
returns.
Figure 1: Elbow method
10
11.5 Advantages of the Elbow Method
• Simplicity: Easy to understand and implement.
• Visual interpretation: Provides an intuitive graphical representation.
• Computational efficiency: Relatively fast compared to more complex validation methods.
11.6 Limitations of the Elbow Method
• Subjectivity: The elbow point is sometimes not clearly identifiable, especially with complex
datasets.
• Not always definitive: Some datasets may show multiple elbows or no clear elbow at all.
• Assumes spherical clusters: Works best when clusters are compact and spherical, which aligns
with K-means assumptions.
• Sensitive to initialization: Results may vary based on the initial centroid positions in K-means.
The Elbow Method remains one of the most widely used techniques for determining the optimal
number of clusters in K-means clustering due to its simplicity and intuitive nature, despite its limitations.
11
12 Maximum Likelihood Calculation
Figure 2: Maximum Likelihood calculation 1
12
Figure 3: Maximum Likelihood calculation 2
13
Figure 4: Maximum Likelihood calculation 3
14