[go: up one dir, main page]

0% found this document useful (0 votes)
14 views25 pages

Unit 4

Cluster analysis is an unsupervised machine learning technique used to group similar objects, utilizing various algorithms like K-Means, Hierarchical, and DBSCAN. Key concepts include clusters, centroids, and distance metrics, with workflows involving data preparation, algorithm selection, and evaluation. Hierarchical clustering can be agglomerative or divisive, offering flexibility in visualization and analysis, but may be computationally intensive and sensitive to noise.

Uploaded by

gtejashwini55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views25 pages

Unit 4

Cluster analysis is an unsupervised machine learning technique used to group similar objects, utilizing various algorithms like K-Means, Hierarchical, and DBSCAN. Key concepts include clusters, centroids, and distance metrics, with workflows involving data preparation, algorithm selection, and evaluation. Hierarchical clustering can be agglomerative or divisive, offering flexibility in visualization and analysis, but may be computationally intensive and sensitive to noise.

Uploaded by

gtejashwini55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Untitled

Sunday, August 11, 2024 2:58 PM

Unit-4
Cluster Analysis:
Cluster analysis, or clustering, is a type of unsupervised machine learning technique used to group a
set of objects in such a way that objects in the same group (or cluster) are more similar to each other
than to those in other groups. It's commonly used in data mining and exploratory data analysis to
uncover patterns or structure in data.
Key Concepts in Cluster Analysis:
Clusters: Groups of data points that are similar to each other based on certain features.
Centroid: The center of a cluster, typically represented as the mean of all points within the cluster.
Distance Metric: A measure used to quantify the similarity or dissimilarity between data points.
Common metrics include Euclidean distance, Manhattan distance, and cosine similarity.
Common Clustering Algorithms:
K-Means Clustering:
Hierarchical Clustering
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
Mean Shift Clustering
Gaussian Mixture Models(GMM)
Choosing the Right Algorithm:
• K-Means: Best for well-separated, spherical clusters. Requires specifying the number of clusters.
• Hierarchical Clustering: Useful for detailed clustering hierarchy and when the number of clusters is
unknown.
• DBSCAN: Ideal for clusters of arbitrary shape and dealing with noise, but requires careful parameter
tuning.
• Mean Shift: Good for finding clusters of arbitrary shape without requiring the number of clusters.
• GMM: Suitable for data that follows a Gaussian distribution or when you need probabilistic
clustering.
Example Workflow:
Data Preparation:
o Clean and preprocess the data (e.g., normalization, handling missing values).
Choose Algorithm:
o Select an appropriate clustering algorithm based on data characteristics and goals.
Apply Algorithm:
o Implement the chosen clustering algorithm on the dataset.
Evaluate Results:
o Assess the quality of the clusters using metrics like silhouette score or visual inspection.
Interpret and Use Clusters:
o Analyse the clusters to derive insights and apply them to business problems or research
questions.

Cluster analysis is a powerful tool for discovering hidden patterns and relationships in data, making
it invaluable in many fields such as marketing, biology, and image processing.

Hierarchical Clustering:
Hierarchical clustering is a type of clustering algorithm that builds a hierarchy of clusters either by
progressively merging smaller clusters (agglomerative) or by recursively splitting a larger cluster
(divisive). The result of hierarchical clustering is typically presented as a dendrogram—a tree-like

AML Notes UNIT-4 Page 1


(divisive). The result of hierarchical clustering is typically presented as a dendrogram—a tree-like
diagram that illustrates the arrangement and similarity of clusters.
Key Concepts in Hierarchical Clustering:
Dendrogram: A tree diagram that shows the arrangement of clusters formed at various levels of
similarity or distance. The height of the branches represents the dissimilarity between clusters.
Agglomerative Hierarchical Clustering: A bottom-up approach where each data point starts as its
own cluster, and pairs of clusters are merged as one moves up the hierarchy.
It is also known as the bottom-up approach or hierarchical agglomerative clustering (HAC). A
structure that is more informative than the unstructured set of clusters returned by flat clustering.
This clustering algorithm does not require us to prespecify the number of clusters. Bottom-up
algorithms treat each data as a singleton cluster at the outset and then successively agglomerate
pairs of clusters until all clusters have been merged into a single cluster that contains all data.

Steps:
• Consider each alphabet as a single cluster and calculate the distance of one cluster from all the other
clusters.
• In the second step, comparable clusters are merged together to form a single cluster. Let’s say
cluster (B) and cluster (C) are very similar to each other therefore we merge them in the second step
similarly to cluster (D) and (E) and at last, we get the clusters [(A), (BC), (DE), (F)]
• We recalculate the proximity according to the algorithm and merge the two nearest clusters([(DE),
(F)]) together to form new clusters as [(A), (BC), (DEF)]
• Repeating the same process; The clusters DEF and BC are comparable and merged together to form
a new cluster. We’re now left with clusters [(A), (BCDEF)].
• At last, the two remaining clusters are merged together to form a single cluster [(ABCDEF)].
Steps:
Initialization:
o Start with each data point as its own cluster.
Calculate Distances:
o Compute the distance between each pair of clusters using the chosen distance metric and
linkage criterion.
Merge Clusters:
o Identify the pair of clusters with the smallest distance (or largest similarity) and merge them
into a single cluster.

AML Notes UNIT-4 Page 2


into a single cluster.
Update Distances:
o Recalculate distances between the new cluster and the remaining clusters.
Repeat:
o Continue merging clusters until all data points are in a single cluster or until the desired
number of clusters is achieved.
Construct Dendrogram:
o Plot the dendrogram to visualize the hierarchical structure of clusters.
Example Workflow:
Compute Pairwise Distances: Calculate the initial distances between all individual data points.
Merge Closest Clusters: Merge the closest clusters and update the distance matrix.
Iterate: Repeat the merging process until all points are in one cluster.
Cut Dendrogram: Determine the number of clusters by cutting the dendrogram at the desired level.

Divisive Hierarchical Clustering:


A top-down approach where all data points start in a single cluster, and the cluster is recursively split
into smaller clusters.
It is also known as a top-down approach. This algorithm also does not require to prespecify the
number of clusters. Top-down clustering requires a method for splitting a cluster that contains the
whole data and proceeds by splitting clusters recursively until individual data have been split into
singleton clusters.

Computing Distance Matrix:

While merging two clusters we check the distance between two every pair of clusters and merge the
pair with the least distance/most similarity. But the question is how is that distance determined.
There are different ways of defining Inter Cluster distance/similarity. Some of them are:
1. Min Distance: Find the minimum distance between any two points of the cluster.

AML Notes UNIT-4 Page 3


1. Min Distance: Find the minimum distance between any two points of the cluster.
2. Max Distance: Find the maximum distance between any two points of the cluster.
3. Group Average: Find the average distance between every two points of the clusters.
4. Ward’s Method: The similarity of two clusters is based on the increase in squared error when two
clusters are merged.
For example, if we group a given data using different methods, we may get different results:

Steps:
Initialization:
o Start with all data points in a single cluster.
Split Clusters:
o Identify the cluster that should be split based on the largest dissimilarity or another criterion.
Recalculate Distances:
o Compute distances between the newly created clusters and the remaining clusters.
Repeat:
o Continue splitting clusters until the desired number of clusters is achieved or a stopping
criterion is met.
Example Workflow:
Start with One Cluster: Begin with all data points in one cluster.
Split the Cluster: Identify and split the cluster with the highest dissimilarity.
Recalculate Distances: Update the distances between the newly formed clusters.
Iterate: Continue splitting until the desired number of clusters is achieved.

Hierarchical Agglomerative vs Divisive Clustering

• Divisive clustering is more complex as compared to agglomerative clustering, as in the case of


divisive clustering we need a flat clustering method as “subroutine” to split each cluster until we
have each data having its own singleton cluster.
• Divisive clustering is more efficient if we do not generate a complete hierarchy all the way down to
individual data leaves. The time complexity of a naive agglomerative clustering is O(n3) because we

AML Notes UNIT-4 Page 4


individual data leaves. The time complexity of a naive agglomerative clustering is O(n3) because we
exhaustively scan the N x N matrix dist_mat for the lowest distance in each of N-1 iterations. Using
priority queue data structure we can reduce this complexity to O(n2logn). By using some more
optimizations it can be brought down to O(n2). Whereas for divisive clustering given a fixed number
of top levels, using an efficient flat algorithm like K-Means, divisive algorithms are linear in the
number of patterns and clusters.
• A divisive algorithm is also more accurate. Agglomerative clustering makes decisions by considering
the local patterns or neighbor points without initially taking into account the global distribution of
data. These early decisions cannot be undone. whereas divisive clustering takes into consideration
the global distribution of data when making top-level partitioning decisions.

Distance Metric: A method for measuring the dissimilarity between data points or clusters. Common
metrics include Euclidean distance, Manhattan distance, and others.
Linkage Criteria: The method used to determine the distance between clusters. Common linkage
criteria include:
o Single Linkage: The distance between the closest points in the two clusters (also known as
minimum linkage).
L(R,S)=min(D(I ,j )), I ϵ R, jϵ S
o Complete Linkage: The distance between the farthest points in the two clusters (also known
as maximum linkage).
L(R,S)=max(D(I ,j )), I ϵ R, jϵ S
o Average Linkage: The average distance between all pairs of points in the two clusters.
L(R,S)=1/nR×nS∑( i=1 to nR) ∑( j=1 to nS) D(I ,j),I ∈R , j∈S
where,
• nR : Number of data-points in R
• nS : Number of data-points in S
○ Ward’s Linkage: Minimizes the variance within each cluster by merging clusters that result in
the smallest increase in total within-cluster variance.
Advantages of Hierarchical Clustering:
• No Need to Specify Number of Clusters: Unlike methods like K-Means, hierarchical clustering
does not require specifying the number of clusters in advance.
• Dendrogram Visualization: Provides a clear visual representation of the clustering process and
hierarchy.
• Flexibility: Can handle various types of data and distance metrics.
Disadvantages of Hierarchical Clustering:
• Computational Complexity: Can be computationally intensive, especially for large datasets.
• Scalability: May not scale well with very large datasets due to the quadratic complexity of
distance computations.
• Sensitivity to Noise: Can be sensitive to noise and outliers in the data.
Applications of Hierarchical Clustering:
• Gene Expression Analysis: Grouping genes with similar expression patterns.
• Customer Segmentation: Identifying customer groups with similar purchasing behaviors.
• Document Clustering: Organizing documents into hierarchical categories based on content
similarity.
• Image Analysis: Grouping similar images or features for image recognition tasks.
Hierarchical clustering is a powerful tool for exploratory data analysis and pattern discovery, offering

AML Notes UNIT-4 Page 5


Hierarchical clustering is a powerful tool for exploratory data analysis and pattern discovery, offering
flexibility in understanding and visualizing the structure within datasets.

Single Linkage Method:


The Single Linkage Method is a technique used in hierarchical clustering to build a hierarchical tree,
also known as a dendrogram, which represents nested clusters of data points. This method is also
known as nearest neighbor clustering.
Key Concepts:
Hierarchical Clustering:
o Hierarchical clustering is a method of cluster analysis that seeks to build a hierarchy of
clusters. It can be divided into two main types:
▪ Agglomerative (Bottom-Up): Starts with each data point as its own cluster and merges
clusters iteratively.
▪ Divisive (Top-Down): Starts with all data points in a single cluster and splits clusters
iteratively.
Single Linkage Method:
The Single Linkage Method is a specific approach within agglomerative hierarchical clustering.

AML Notes UNIT-4 Page 6


o The Single Linkage Method is a specific approach within agglomerative hierarchical clustering.
It merges clusters based on the minimum distance between them.
o In this method, the distance between two clusters is defined as the shortest distance between
any single pair of points, where one point is in the first cluster and the other point is in the
second cluster.
Steps in the Single Linkage Method:
Initialization:
o Start with each data point as its own individual cluster.
Calculate Pairwise Distances:
o Compute the distance between each pair of clusters. For two clusters Ci and Cj , the distance
d(Ci, Cj) is given by: d(Ci, Cj )=min {d(x, y)∣ x∈ Ci, y∈ Cj} where d(x, y) is the distance between
points x and y.
Merge Clusters:
o Find the pair of clusters with the smallest distance and merge them into a single cluster.
Update Distances:
o Update the distances between the newly formed cluster and the remaining clusters.
Repeat:
o Repeat steps 2-4 until all data points are merged into a single cluster or until the desired
number of clusters is obtained.
Construct Dendrogram:
o A dendrogram (a tree diagram) is constructed to visualize the clustering process, showing how
clusters are merged at each step.
Example:
Consider a set of data points in a 2D space. Here’s a simplified illustration of how the Single Linkage
Method works:
Start with individual clusters:
o Each data point starts as its own cluster.
Compute pairwise distances:
o Calculate the distance between all pairs of points and identify the smallest distance.
Merge clusters:
o Merge the pair of clusters with the smallest distance. For example, if the closest pair of points
are in clusters C1 and C2, merge C1 and C2.
Update distances:
o Recalculate the distance between the newly formed cluster and all other clusters using the
minimum distance criterion.
Repeat until complete:
o Continue merging clusters based on the smallest distance and update the dendrogram
accordingly.
Advantages:
• Simplicity: The method is straightforward and easy to understand.

AML Notes UNIT-4 Page 7


• Flexibility: Can work well with different types of distance metrics.
Limitations:
• Chaining Effect: Single linkage clustering can lead to the "chaining effect," where clusters may
be elongated and may form a chain-like structure rather than compact clusters.
• Sensitivity to Noise: Sensitive to noise and outliers, which can affect the clustering results.
Comparison with Other Methods:
• Complete Linkage Method: Unlike single linkage, which uses the minimum distance, complete
linkage uses the maximum distance between any two points in the clusters to determine the
distance between clusters.
• Average Linkage Method: Uses the average distance between all pairs of points in the
clusters.
Applications:
• Data Analysis: Useful for exploratory data analysis to understand the structure and
relationships within the data.
• Bioinformatics: Often used for clustering genes or proteins based on expression data.
The Single Linkage Method is a fundamental technique in hierarchical clustering that is particularly
useful for understanding data structure and forming clusters based on the closest pairwise distances
between data points.

AML Notes UNIT-4 Page 8


K-Means and KNN Clustering Algorithms:
K-Means and K-Nearest Neighbors (KNN) are commonly used algorithms in machine learning, but
they serve different purposes and are applied in different contexts. Here’s a breakdown of each
algorithm:
K-Means Clustering:
K-Means is a clustering algorithm used to partition a dataset into k distinct, non-overlapping groups
or clusters. The objective is to minimize the variance within each cluster.
K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science. In this topic, we will learn what is K-means clustering
algorithm, how the algorithm works, along with the Python implementation of k-means clustering.
What is K-Means Algorithm?
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled
dataset into different clusters. Here K defines the number of pre-defined clusters that need to be
created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters,

AML Notes UNIT-4 Page 9


created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters,
and so on.
"It is an iterative algorithm that divides the unlabeled dataset into k different clusters in
such a way that each dataset belongs only one group that has similar properties."
It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main
aim of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k should be
predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:


• Determines the best value for K center points or centroids by an iterative process.
• Assigns each data point to its closest k-center. Those data points which are near to the particular k-
center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?


The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of
each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
Let's understand the above steps by considering the visual plots:
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given
below:

AML Notes UNIT-4 Page 10


• Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different
clusters. It means here we will try to group these datasets into two different clusters.
• We need to choose some random k points or centroid to form the cluster. These points can be
either the points from the dataset or any other point. So, here we are selecting the below two
points as k points, which are not the part of our dataset. Consider the below image:

• Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will
compute it by applying some mathematics that we have studied to calculate the distance
between two points. So, we will draw a median between both the centroids. Consider the
below image:

AML Notes UNIT-4 Page 11


below image:

From the above image, it is clear that points left side of the line is near to the K1 or blue centroid,
and points to the right of the line are close to the yellow centroid. Let's color them as blue and
yellow for clear visualization.

• As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:

AML Notes UNIT-4 Page 12


• Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same
process of finding a median line. The median will be like below image:

From the above image, we can see, one yellow point is on the left side of the line, and two blue
points are right to the line. So, these three points will be assigned to new centroids.

AML Notes UNIT-4 Page 13


As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or
K-points.
• We will repeat the process by finding the center of gravity of centroids, so the new centroids
will be as shown in the below image:

• As we got the new centroids so again will draw the median line and reassign the data points.
So, the image will be:

AML Notes UNIT-4 Page 14


• We can see in the above image; there are no dissimilar data points on either side of the line,
which means our model is formed. Consider the below image:

As our model is ready, so we can now remove the assumed centroids, and the two final clusters will
be as shown in the below image:

AML Notes UNIT-4 Page 15


How to choose the value of "K number of clusters" in K-means Clustering?

The performance of the K-means clustering algorithm depends upon highly efficient clusters that it
forms. But choosing the optimal number of clusters is a big task. There are some different ways to
find the optimal number of clusters, but here we are discussing the most appropriate method to find
the number of clusters or value of K. The method is given below:
Elbow Method:
The Elbow method is one of the most popular ways to find the optimal number of clusters. This
method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares, which
defines the total variations within a cluster. The formula to calculate the value of WCSS (for 3
clusters) is given below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2
In the above formula of WCSS,
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point and
its centroid within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method such as
Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
• It executes the K-means clustering on a given dataset for different K values (ranges from 1-10).
• For each value of K, calculates the WCSS value.
• Plots a curve between calculated WCSS values and the number of clusters K.
• The sharp point of bend or a point of the plot looks like an arm, then that point is considered
as the best value of K.

The steps to be followed for the implementation are given below:


• Data Pre-processing
• Finding the optimal number of clusters using the elbow method
• Training the K-means algorithm on the training dataset
• Visualizing the clusters

Characteristics:
• Objective Function: Minimizes the within-cluster sum of squares (variance), defined as:
J=∑i=1 to k ∑x ∈Ci ∥x− μi∥^2

AML Notes UNIT-4 Page 16


J=∑i=1 to k ∑x ∈Ci ∥x− μi∥^2
where μ i is the centroid of cluster Ci, and x is a data point in cluster Ci.
• Distance Metric: Typically uses Euclidean distance to measure similarity between data points
and centroids.
Pros:
• Simple and Fast: Efficient for large datasets with a fixed number of clusters.
• Scalability: Works well with a large number of data points and features.
Cons:
• Requires k to be Specified: You need to know the number of clusters in advance, which can be
challenging.
• Sensitivity to Initialization: Can converge to local minima depending on the initial centroids.
• Assumes Spherical Clusters: Performs best when clusters are spherical and of similar size.
K-Nearest Neighbors (KNN):
K-Nearest Neighbors (KNN) is a classification and regression algorithm used to predict the class or
value of a new data point based on the classes or values of its nearest neighbors.
How KNN Works:
Choose k:
o Select the number k of nearest neighbors to consider.
Distance Calculation:
o Compute the distance between the new data point and all existing data points using a distance
metric (e.g., Euclidean distance).
Identify Neighbors:
o Find the k nearest data points to the new data point.
Classification or Regression:
o Classification: Assign the most common class label among the k nearest neighbors.
o Regression: Predict the value as the average (or weighted average) of the values of the k
nearest neighbors.
Characteristics:
• Instance-Based Learning: KNN does not explicitly learn a model but makes predictions based
on the training data during the prediction phase.
• Distance Metric: Commonly uses Euclidean distance for continuous variables, but other
metrics (e.g., Manhattan, Minkowski) can also be used.
Pros:
• Simple to Implement: Easy to understand and implement.
• No Training Phase: Instantaneously adapts to new data without retraining.
• Versatile: Can be used for both classification and regression.
Cons:
• Computationally Expensive: Requires distance computations for each query point, which can
be slow for large datasets.

AML Notes UNIT-4 Page 17


• Sensitive to Noise: Performance can be affected by noisy or irrelevant features.
• Scalability Issues: Becomes inefficient with a very large number of data points or high-
dimensional data.
Summary of Differences:
• Purpose:
o K-Means: Clustering algorithm for grouping similar data points into clusters.
o KNN: Classification or regression algorithm that makes predictions based on the
nearest neighbors.
• Output:
o K-Means: Produces k clusters with centroids.
o KNN: Provides a class label or value for new data points based on the nearest
neighbors.
• Training Phase:
o K-Means: Involves an iterative process to update centroids and assign data points
to clusters.
o KNN: No explicit training phase; predictions are made directly from the training
data.
• Use Cases:
o K-Means: Used for market segmentation, image compression, and anomaly
detection.
o KNN: Used for handwriting recognition, recommendation systems, and predictive
modeling.
Both algorithms are foundational in machine learning and data analysis, each suitable for different
types of tasks and data scenarios.

Time Series Forecasting : ARMA Model ,ARCH & GARCH Model:


Time series forecasting involves predicting future values based on previously observed values in a
time series. Several statistical models are used for time series analysis, including ARMA, ARCH, and
GARCH models. Here’s a detailed overview of each:
1. ARMA Model (AutoRegressive Moving Average)
The ARMA model is a classical time series model that combines two components:
• AutoRegressive (AR): Captures the relationship between an observation and a number of
lagged observations (previous values).
• Moving Average (MA): Models the relationship between an observation and a number of
lagged forecast errors.

AR, MA, ARMA, and ARIMA models are used to forecast the observation at (t+1) based on the

historical data of previous time spots recorded for the same observation. However, it is necessary to

make sure that the time series is stationary over the historical data of observation overtime period.

If the time series is not stationary then we could apply the differencing factor on the records and see

if the graph of the time series is a stationary overtime period.

AML Notes UNIT-4 Page 18


if the graph of the time series is a stationary overtime period.
ACF (Auto Correlation Function)

Auto Correlation function takes into consideration of all the past observations irrespective of its

effect on the future or present time period. It calculates the correlation between the t and (t-k) time

period. It includes all the lags or intervals between t and (t-k) time periods. Correlation is always

calculated using the Pearson Correlation formula.


PACF(Partial Correlation Function)

The PACF determines the partial correlation between time period t and t-k. It doesn’t take into

consideration all the time lags between t and t-k. For e.g. let's assume that today's stock price may

be dependent on 3 days prior stock price but it might not take into consideration yesterday's stock

price closure. Hence we consider only the time lags having a direct impact on future time period by

neglecting the insignificant time lags in between the two-time slots t and t-k.
How to differentiate when to use ACF and PACF?

Let's take an example of sweets sale and income generated in a village over a year. Under the

assumption that every 2 months there is a festival in the village, we take out the historical data of

sweets sale and income generated for 12 months. If we plot the time as month then we can observe

that when it comes to calculating the sweets sale we are interested in only alternate months as the

sale of sweets increases every two months. But if we are to consider the income generated next

month then we have to take into consideration all the 12 months of last year.

So in the above situation, we will use ACF to find out the income generated in the future but we will

be using PACF to find out the sweets sold in the next month.
AR (Auto-Regressive) Model

AML Notes UNIT-4 Page 19


The time period at t is impacted by the observation at various slots t-1, t-2, t-3, ….., t-k. The impact

of previous time spots is decided by the coefficient factor at that particular period of time. The price

of a share of any particular company X may depend on all the previous share prices in the time

series. This kind of model calculates the regression of past time series and calculates the present or

future values in the series in know as Auto Regression (AR) model.

Yt = β₁* y-₁ + β₂* yₜ-₂ + β₃ * yₜ-₃ + ………… + βₖ * yₜ-ₖ

Consider an example of a milk distribution company that produces milk every month in the country.

We want to calculate the amount of milk to be produced current month considering the milk

generated in the last year. We begin by calculating the PACF values of all the 12 lags with respect to

the current month. If the value of the PACF of any particular month is more than a significant value

only those values will be considered for the model analysis.

For e.g in the above figure the values 1,2, 3 up to 12 displays the direct effect(PACF) of the milk

production in the current month w.r.t the given the lag t. If we consider two significant values above

the threshold then the model will be termed as AR(2).


MA (Moving Average) Model

AML Notes UNIT-4 Page 20


The time period at t is impacted by the unexpected external factors at various slots t-1, t-2, t-3, …..,

t-k. These unexpected impacts are known as Errors or Residuals. The impact of previous time spots is

decided by the coefficient factor α at that particular period of time. The price of a share of any

particular company X may depend on some company merger that happened overnight or maybe the

company resulted in shutdown due to bankruptcy. This kind of model calculates the residuals or

errors of past time series and calculates the present or future values in the series in know as Moving

Average (MA) model.

Yt = α₁* Ɛₜ-₁ + α₂ * Ɛₜ-₂ + α₃ * Ɛₜ-₃ + ………… + αₖ * Ɛₜ-ₖ

Consider an example of Cake distribution during my birthday. Let's assume that your mom asks you

to bring pastries to the party. Every year you miss judging the no of invites to the party and end

upbringing more or less no of cakes as per requirement. The difference in the actual and expected

results in the error. So you want to avoid the error for this year hence we apply the moving average

model on the time series and calculate the no of pastries needed this year based on past collective

errors. Next, calculate the ACF values of all the lags in the time series. If the value of the ACF of any

particular month is more than a significant value only those values will be considered for the model

analysis.

For e.g in the above figure the values 1,2, 3 up to 12 displays the total error(ACF) of count in pastries

current month w.r.t the given the lag t by considering all the in-between lags between time t and

current month. If we consider two significant values above the threshold then the model will be

termed as MA(2).

AML Notes UNIT-4 Page 21


termed as MA(2).
ARMA (Auto Regressive Moving Average) Model

This is a model that is combined from the AR and MA models. In this model, the impact of previous

lags along with the residuals is considered for forecasting the future values of the time series. Here β

represents the coefficients of the AR model and α represents the coefficients of the MA model.

Yt = β₁* yₜ-₁ + α₁* Ɛₜ-₁ + β₂* yₜ-₂ + α₂ * Ɛₜ-₂ + β₃ * yₜ-₃ + α₃ * Ɛₜ-₃ +………… + βₖ * yₜ-ₖ + αₖ * Ɛₜ-ₖ

Consider the above graphs where the MA and AR values are plotted with their respective significant

values. Let's assume that we consider only 1 significant value from the AR model and likewise 1

significant value from the MA model. So the ARMA model will be obtained from the combined

values of the other two models will be of the order of ARMA(1,1).


ARIMA (Auto-Regressive Integrated Moving Average) Model

AML Notes UNIT-4 Page 22


We know that in order to apply the various models we must in the beginning convert the series into

Stationary Time Series. In order to achieve the same, we apply the differencing or Integrated

method where we subtract the t-1 value from t values of time series. After applying the first

differencing if we are still unable to get the Stationary time series then we again apply the second-

order differencing.

The ARIMA model is quite similar to the ARMA model other than the fact that it includes one more

factor known as Integrated( I ) i.e. differencing which stands for I in the ARIMA model. So in short

ARIMA model is a combination of a number of differences already applied on the model in order to

make it stationary, the number of previous lags along with residuals errors in order to forecast

future values.

Consider the above graphs where the MA and AR values are plotted with their respective significant

values. Let's assume that we consider only 1 significant value from the AR model and likewise 1

significant value from the MA model. Also, the graph was initially non-stationary and we had to

perform differencing operation once in order to convert into a stationary set. Hence the ARIMA

model which will be obtained from the combined values of the other two models along with the

Integral operator can be displayed as ARIMA(1,1,1).


Conclusion :

All these models give us an insight or at least close enough prediction about any particular time

series. Also, it depends on the users that which model perfectly suffices their needs. If the chances

of error rate are less in any one model compared to other models then it's preferred that we choose

AML Notes UNIT-4 Page 23


of error rate are less in any one model compared to other models then it's preferred that we choose

the one which gives us the closest estimation.

Applications
• Economics: Forecasting GDP, inflation rates.
• Finance: Predicting stock prices, interest rates.
• Engineering: Analyzing system behaviors and control systems.
Assumptions
• The time series is stationary, meaning its statistical properties do not change over time.
• The residuals (errors) are normally distributed and uncorrelated.
2. ARCH Model (Autoregressive Conditional Heteroskedasticity)
The ARCH model, introduced by Robert Engle in 1982, is designed to model time series data where
the variance of the errors is not constant but changes over time.
ARCH Model Structure
The ARCH model is specified as:
ϵt=σt zt
σt^2=α0+α1 ϵ(t−1)^2+α2 ϵ(t−2)^ 2+⋯+αq ϵ(t−q)^2
where:
• ϵt is the error term at time t.
• Σt^2 is the conditional variance of ϵt given past information.
• zt is a white noise error term with zero mean and unit variance.
• α0,α1,…,αq are parameters.
Applications
• Finance: Modeling and forecasting volatility of financial markets, such as stock prices and
exchange rates.
• Econometrics: Analyzing economic time series where volatility changes over time.
Assumptions
• The error terms are conditionally heteroskedastic, meaning the variance changes over time
but is dependent on past errors.
• The model assumes that the conditional variance depends on past squared errors.
3. GARCH Model (Generalized Autoregressive Conditional Heteroskedasticity)
The GARCH model, introduced by Tim Bollerslev in 1986, extends the ARCH model to include past
forecast variances as well as past squared errors. This makes it more flexible and better suited for
capturing the persistence of volatility.
GARCH Model Structure
The GARCH(p, q) model is specified as:
ϵt=σtzt
σt^2=α0+α1ϵ(t−1)^2+α2ϵ(t−2)^2+⋯+αqϵ(t−q)^2+β1σ(t−1)^2+β2σ(t−2)^2+⋯+βpσ(t−p)^
2where:

AML Notes UNIT-4 Page 24


2where:
• ϵt is the error term.
• σt^2 is the conditional variance.
• zt is a white noise error term.
• α0,α1,…,αq are the parameters for past squared errors.
• β1,β2,…,βp are the parameters for past variances.
Applications
• Finance: Forecasting and modeling volatility in stock returns, risk management, and
derivatives pricing.
• Economics: Analyzing and predicting economic indicators with variable volatility.
Assumptions
• The conditional variance depends on past squared errors and past variances.
• The model assumes that the volatility clustering is persistent, meaning periods of high
volatility are followed by more high volatility.
Summary
• ARMA: Useful for stationary time series where both the mean and variance are constant over
time.
• ARCH: Focuses on modeling changing volatility in a time series with conditional
heteroskedasticity.
• GARCH: An extension of ARCH that includes past variances in modeling, suitable for capturing
persistent volatility clustering.
These models are fundamental tools in time series analysis and forecasting, each addressing
different aspects of time-dependent data.

AML Notes UNIT-4 Page 25

You might also like