Unit 4
Unit 4
Unit-4
Cluster Analysis:
Cluster analysis, or clustering, is a type of unsupervised machine learning technique used to group a
set of objects in such a way that objects in the same group (or cluster) are more similar to each other
than to those in other groups. It's commonly used in data mining and exploratory data analysis to
uncover patterns or structure in data.
Key Concepts in Cluster Analysis:
Clusters: Groups of data points that are similar to each other based on certain features.
Centroid: The center of a cluster, typically represented as the mean of all points within the cluster.
Distance Metric: A measure used to quantify the similarity or dissimilarity between data points.
Common metrics include Euclidean distance, Manhattan distance, and cosine similarity.
Common Clustering Algorithms:
K-Means Clustering:
Hierarchical Clustering
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
Mean Shift Clustering
Gaussian Mixture Models(GMM)
Choosing the Right Algorithm:
• K-Means: Best for well-separated, spherical clusters. Requires specifying the number of clusters.
• Hierarchical Clustering: Useful for detailed clustering hierarchy and when the number of clusters is
unknown.
• DBSCAN: Ideal for clusters of arbitrary shape and dealing with noise, but requires careful parameter
tuning.
• Mean Shift: Good for finding clusters of arbitrary shape without requiring the number of clusters.
• GMM: Suitable for data that follows a Gaussian distribution or when you need probabilistic
clustering.
Example Workflow:
Data Preparation:
o Clean and preprocess the data (e.g., normalization, handling missing values).
Choose Algorithm:
o Select an appropriate clustering algorithm based on data characteristics and goals.
Apply Algorithm:
o Implement the chosen clustering algorithm on the dataset.
Evaluate Results:
o Assess the quality of the clusters using metrics like silhouette score or visual inspection.
Interpret and Use Clusters:
o Analyse the clusters to derive insights and apply them to business problems or research
questions.
Cluster analysis is a powerful tool for discovering hidden patterns and relationships in data, making
it invaluable in many fields such as marketing, biology, and image processing.
Hierarchical Clustering:
Hierarchical clustering is a type of clustering algorithm that builds a hierarchy of clusters either by
progressively merging smaller clusters (agglomerative) or by recursively splitting a larger cluster
(divisive). The result of hierarchical clustering is typically presented as a dendrogram—a tree-like
Steps:
• Consider each alphabet as a single cluster and calculate the distance of one cluster from all the other
clusters.
• In the second step, comparable clusters are merged together to form a single cluster. Let’s say
cluster (B) and cluster (C) are very similar to each other therefore we merge them in the second step
similarly to cluster (D) and (E) and at last, we get the clusters [(A), (BC), (DE), (F)]
• We recalculate the proximity according to the algorithm and merge the two nearest clusters([(DE),
(F)]) together to form new clusters as [(A), (BC), (DEF)]
• Repeating the same process; The clusters DEF and BC are comparable and merged together to form
a new cluster. We’re now left with clusters [(A), (BCDEF)].
• At last, the two remaining clusters are merged together to form a single cluster [(ABCDEF)].
Steps:
Initialization:
o Start with each data point as its own cluster.
Calculate Distances:
o Compute the distance between each pair of clusters using the chosen distance metric and
linkage criterion.
Merge Clusters:
o Identify the pair of clusters with the smallest distance (or largest similarity) and merge them
into a single cluster.
While merging two clusters we check the distance between two every pair of clusters and merge the
pair with the least distance/most similarity. But the question is how is that distance determined.
There are different ways of defining Inter Cluster distance/similarity. Some of them are:
1. Min Distance: Find the minimum distance between any two points of the cluster.
Steps:
Initialization:
o Start with all data points in a single cluster.
Split Clusters:
o Identify the cluster that should be split based on the largest dissimilarity or another criterion.
Recalculate Distances:
o Compute distances between the newly created clusters and the remaining clusters.
Repeat:
o Continue splitting clusters until the desired number of clusters is achieved or a stopping
criterion is met.
Example Workflow:
Start with One Cluster: Begin with all data points in one cluster.
Split the Cluster: Identify and split the cluster with the highest dissimilarity.
Recalculate Distances: Update the distances between the newly formed clusters.
Iterate: Continue splitting until the desired number of clusters is achieved.
Distance Metric: A method for measuring the dissimilarity between data points or clusters. Common
metrics include Euclidean distance, Manhattan distance, and others.
Linkage Criteria: The method used to determine the distance between clusters. Common linkage
criteria include:
o Single Linkage: The distance between the closest points in the two clusters (also known as
minimum linkage).
L(R,S)=min(D(I ,j )), I ϵ R, jϵ S
o Complete Linkage: The distance between the farthest points in the two clusters (also known
as maximum linkage).
L(R,S)=max(D(I ,j )), I ϵ R, jϵ S
o Average Linkage: The average distance between all pairs of points in the two clusters.
L(R,S)=1/nR×nS∑( i=1 to nR) ∑( j=1 to nS) D(I ,j),I ∈R , j∈S
where,
• nR : Number of data-points in R
• nS : Number of data-points in S
○ Ward’s Linkage: Minimizes the variance within each cluster by merging clusters that result in
the smallest increase in total within-cluster variance.
Advantages of Hierarchical Clustering:
• No Need to Specify Number of Clusters: Unlike methods like K-Means, hierarchical clustering
does not require specifying the number of clusters in advance.
• Dendrogram Visualization: Provides a clear visual representation of the clustering process and
hierarchy.
• Flexibility: Can handle various types of data and distance metrics.
Disadvantages of Hierarchical Clustering:
• Computational Complexity: Can be computationally intensive, especially for large datasets.
• Scalability: May not scale well with very large datasets due to the quadratic complexity of
distance computations.
• Sensitivity to Noise: Can be sensitive to noise and outliers in the data.
Applications of Hierarchical Clustering:
• Gene Expression Analysis: Grouping genes with similar expression patterns.
• Customer Segmentation: Identifying customer groups with similar purchasing behaviors.
• Document Clustering: Organizing documents into hierarchical categories based on content
similarity.
• Image Analysis: Grouping similar images or features for image recognition tasks.
Hierarchical clustering is a powerful tool for exploratory data analysis and pattern discovery, offering
• Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will
compute it by applying some mathematics that we have studied to calculate the distance
between two points. So, we will draw a median between both the centroids. Consider the
below image:
From the above image, it is clear that points left side of the line is near to the K1 or blue centroid,
and points to the right of the line are close to the yellow centroid. Let's color them as blue and
yellow for clear visualization.
• As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:
From the above image, we can see, one yellow point is on the left side of the line, and two blue
points are right to the line. So, these three points will be assigned to new centroids.
• As we got the new centroids so again will draw the median line and reassign the data points.
So, the image will be:
As our model is ready, so we can now remove the assumed centroids, and the two final clusters will
be as shown in the below image:
The performance of the K-means clustering algorithm depends upon highly efficient clusters that it
forms. But choosing the optimal number of clusters is a big task. There are some different ways to
find the optimal number of clusters, but here we are discussing the most appropriate method to find
the number of clusters or value of K. The method is given below:
Elbow Method:
The Elbow method is one of the most popular ways to find the optimal number of clusters. This
method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares, which
defines the total variations within a cluster. The formula to calculate the value of WCSS (for 3
clusters) is given below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2
In the above formula of WCSS,
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point and
its centroid within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method such as
Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
• It executes the K-means clustering on a given dataset for different K values (ranges from 1-10).
• For each value of K, calculates the WCSS value.
• Plots a curve between calculated WCSS values and the number of clusters K.
• The sharp point of bend or a point of the plot looks like an arm, then that point is considered
as the best value of K.
Characteristics:
• Objective Function: Minimizes the within-cluster sum of squares (variance), defined as:
J=∑i=1 to k ∑x ∈Ci ∥x− μi∥^2
AR, MA, ARMA, and ARIMA models are used to forecast the observation at (t+1) based on the
historical data of previous time spots recorded for the same observation. However, it is necessary to
make sure that the time series is stationary over the historical data of observation overtime period.
If the time series is not stationary then we could apply the differencing factor on the records and see
Auto Correlation function takes into consideration of all the past observations irrespective of its
effect on the future or present time period. It calculates the correlation between the t and (t-k) time
period. It includes all the lags or intervals between t and (t-k) time periods. Correlation is always
The PACF determines the partial correlation between time period t and t-k. It doesn’t take into
consideration all the time lags between t and t-k. For e.g. let's assume that today's stock price may
be dependent on 3 days prior stock price but it might not take into consideration yesterday's stock
price closure. Hence we consider only the time lags having a direct impact on future time period by
neglecting the insignificant time lags in between the two-time slots t and t-k.
How to differentiate when to use ACF and PACF?
Let's take an example of sweets sale and income generated in a village over a year. Under the
assumption that every 2 months there is a festival in the village, we take out the historical data of
sweets sale and income generated for 12 months. If we plot the time as month then we can observe
that when it comes to calculating the sweets sale we are interested in only alternate months as the
sale of sweets increases every two months. But if we are to consider the income generated next
month then we have to take into consideration all the 12 months of last year.
So in the above situation, we will use ACF to find out the income generated in the future but we will
be using PACF to find out the sweets sold in the next month.
AR (Auto-Regressive) Model
of previous time spots is decided by the coefficient factor at that particular period of time. The price
of a share of any particular company X may depend on all the previous share prices in the time
series. This kind of model calculates the regression of past time series and calculates the present or
Consider an example of a milk distribution company that produces milk every month in the country.
We want to calculate the amount of milk to be produced current month considering the milk
generated in the last year. We begin by calculating the PACF values of all the 12 lags with respect to
the current month. If the value of the PACF of any particular month is more than a significant value
For e.g in the above figure the values 1,2, 3 up to 12 displays the direct effect(PACF) of the milk
production in the current month w.r.t the given the lag t. If we consider two significant values above
t-k. These unexpected impacts are known as Errors or Residuals. The impact of previous time spots is
decided by the coefficient factor α at that particular period of time. The price of a share of any
particular company X may depend on some company merger that happened overnight or maybe the
company resulted in shutdown due to bankruptcy. This kind of model calculates the residuals or
errors of past time series and calculates the present or future values in the series in know as Moving
Consider an example of Cake distribution during my birthday. Let's assume that your mom asks you
to bring pastries to the party. Every year you miss judging the no of invites to the party and end
upbringing more or less no of cakes as per requirement. The difference in the actual and expected
results in the error. So you want to avoid the error for this year hence we apply the moving average
model on the time series and calculate the no of pastries needed this year based on past collective
errors. Next, calculate the ACF values of all the lags in the time series. If the value of the ACF of any
particular month is more than a significant value only those values will be considered for the model
analysis.
For e.g in the above figure the values 1,2, 3 up to 12 displays the total error(ACF) of count in pastries
current month w.r.t the given the lag t by considering all the in-between lags between time t and
current month. If we consider two significant values above the threshold then the model will be
termed as MA(2).
This is a model that is combined from the AR and MA models. In this model, the impact of previous
lags along with the residuals is considered for forecasting the future values of the time series. Here β
represents the coefficients of the AR model and α represents the coefficients of the MA model.
Yt = β₁* yₜ-₁ + α₁* Ɛₜ-₁ + β₂* yₜ-₂ + α₂ * Ɛₜ-₂ + β₃ * yₜ-₃ + α₃ * Ɛₜ-₃ +………… + βₖ * yₜ-ₖ + αₖ * Ɛₜ-ₖ
Consider the above graphs where the MA and AR values are plotted with their respective significant
values. Let's assume that we consider only 1 significant value from the AR model and likewise 1
significant value from the MA model. So the ARMA model will be obtained from the combined
Stationary Time Series. In order to achieve the same, we apply the differencing or Integrated
method where we subtract the t-1 value from t values of time series. After applying the first
differencing if we are still unable to get the Stationary time series then we again apply the second-
order differencing.
The ARIMA model is quite similar to the ARMA model other than the fact that it includes one more
factor known as Integrated( I ) i.e. differencing which stands for I in the ARIMA model. So in short
ARIMA model is a combination of a number of differences already applied on the model in order to
make it stationary, the number of previous lags along with residuals errors in order to forecast
future values.
Consider the above graphs where the MA and AR values are plotted with their respective significant
values. Let's assume that we consider only 1 significant value from the AR model and likewise 1
significant value from the MA model. Also, the graph was initially non-stationary and we had to
perform differencing operation once in order to convert into a stationary set. Hence the ARIMA
model which will be obtained from the combined values of the other two models along with the
All these models give us an insight or at least close enough prediction about any particular time
series. Also, it depends on the users that which model perfectly suffices their needs. If the chances
of error rate are less in any one model compared to other models then it's preferred that we choose
Applications
• Economics: Forecasting GDP, inflation rates.
• Finance: Predicting stock prices, interest rates.
• Engineering: Analyzing system behaviors and control systems.
Assumptions
• The time series is stationary, meaning its statistical properties do not change over time.
• The residuals (errors) are normally distributed and uncorrelated.
2. ARCH Model (Autoregressive Conditional Heteroskedasticity)
The ARCH model, introduced by Robert Engle in 1982, is designed to model time series data where
the variance of the errors is not constant but changes over time.
ARCH Model Structure
The ARCH model is specified as:
ϵt=σt zt
σt^2=α0+α1 ϵ(t−1)^2+α2 ϵ(t−2)^ 2+⋯+αq ϵ(t−q)^2
where:
• ϵt is the error term at time t.
• Σt^2 is the conditional variance of ϵt given past information.
• zt is a white noise error term with zero mean and unit variance.
• α0,α1,…,αq are parameters.
Applications
• Finance: Modeling and forecasting volatility of financial markets, such as stock prices and
exchange rates.
• Econometrics: Analyzing economic time series where volatility changes over time.
Assumptions
• The error terms are conditionally heteroskedastic, meaning the variance changes over time
but is dependent on past errors.
• The model assumes that the conditional variance depends on past squared errors.
3. GARCH Model (Generalized Autoregressive Conditional Heteroskedasticity)
The GARCH model, introduced by Tim Bollerslev in 1986, extends the ARCH model to include past
forecast variances as well as past squared errors. This makes it more flexible and better suited for
capturing the persistence of volatility.
GARCH Model Structure
The GARCH(p, q) model is specified as:
ϵt=σtzt
σt^2=α0+α1ϵ(t−1)^2+α2ϵ(t−2)^2+⋯+αqϵ(t−q)^2+β1σ(t−1)^2+β2σ(t−2)^2+⋯+βpσ(t−p)^
2where: