[go: up one dir, main page]

0% found this document useful (0 votes)
16 views132 pages

Unit 5

AIML part 5

Uploaded by

manishukale472
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views132 pages

Unit 5

AIML part 5

Uploaded by

manishukale472
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 132

MIT School of Computing

Department of Computer Science & Engineering

Third Year Engineering

21BTCS003-Artificial Intelligence and Machine Learning

Class - T.Y. (SEM-II)

AY 2024-2025
MIT School of Computing
Department of Computer Science & Engineering

Syllabus
UNIT – V Unsupervised Learning and Ensemble Learning
09 Hours
Overview of Unsupervised learning; Importance; Applications; Supervised Vs
Unsupervised Learning; Disadvantages; Types of unsupervised learning: Clustering &
Association; Clustering: Hierarchical clustering, K-means clustering and KNN clustering;
Association: Apriori Algorithm and FP Growth Algorithm;

Introduction to Ensemble Learning; Ensemble Learning Techniques: Bagging, Boosting


and Stacking;

Dimensionality Reduction: Principal Component Analysis (PCA).

2
Artificial Intelligence and Machine Learning
MIT School of Computing
Department of Computer Science & Engineering

Text Books:
1. “Artificial Intelligence: A Modern Approach” by Stuart Russell and Peter Norvig
2. “Machine Learning: A Probabilistic Perspective” by Kevin P. Murphy

Reference Books:
1. “Pattern Recognition and Machine Learning” by Christopher Bishop
2. “Deep Learning” by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
3. “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow” by Aurelien Geron
MOOC/ Video Lectures available at:
1. http://videolectures.net/andrew_ng/
2. https://nptel.ac.in/courses/106105077

Artificial Intelligence and Machine Learning 3


MIT School of Computing
Department of Computer Science & Engineering

What is Machine Learning


• Machine learning is a subfield of artificial intelligence (AI) that focuses
on developing algorithms and models that enable computers to learn
patterns, make predictions, and improve their performance over time
without explicit programming.
• It involves the utilization of statistical techniques and computational
power to enable systems to learn from data and make informed decisions.
• Machine Learning is the process of creating models which can perform a
certain task without the need for a human explicitly programming it to do
something.

Artificial Intelligence and Machine Learning 4


MIT School of Computing
Department of Computer Science & Engineering

Courtesy: https://blog.autify.com/en/machine-learning-in-software-testing/

Artificial Intelligence and Machine Learning 5


MIT School of Computing
Department of Computer Science & Engineering

Let's have a
look, what we
have done so
far and what
we will do
further Courtesy: https://www.favouriteblog.com/15-algorithms-machine-learning-engineers/

Artificial Intelligence and Machine Learning 6


MIT School of Computing
Department of Computer Science & Engineering

Courtesy:intellectualpoint.com/getting-started/how-to-launch-a-career-in-machine-lear
ning/

Artificial Intelligence and Machine Learning 7


MIT School of Computing
Department of Computer Science & Engineering

WHAT IS UNSUPERVISED LEARNING?


•Unsupervised learning is a category of machine learning where the algorithm is trained on data without
explicit guidance in the form of labeled outcomes.
•Unlike supervised learning, where the model is provided with labeled examples to learn from,
unsupervised learning involves the algorithm discovering patterns, relationships, or structures within the
data without predefined target labels.

Type of Machine Learning And What is Unsupervised Learning?


(technical-beast.blogspot.com)

Artificial Intelligence and Machine Learning 8


MIT School of Computing
Department of Computer Science & Engineering

Supervised vs.
Unsupervised
Learning

Artificial Intelligence and Machine Learning 9


MIT School of Computing
Department of Computer Science & Engineering

WHY IS IT IMPORTANT?
• Pattern Discovery:
• Unsupervised learning allows algorithms to identify hidden patterns, structures, and relationships within data without explicit
guidance from labeled examples. This is essential for understanding the underlying structure of complex datasets.
• Data Exploration and Preprocessing:
• Before engaging in supervised learning tasks, unsupervised learning can be used for exploratory data analysis. Techniques like
clustering and dimensionality reduction assist in understanding the data's inherent characteristics and reducing its complexity.
• Anomaly Detection:
• Unsupervised learning is effective in detecting anomalies or outliers in datasets. By learning the normal patterns within the
data, algorithms can identify instances that deviate from the norm, which is crucial in fields like fraud detection, cybersecurity,
and quality control.
• Clustering and Segmentation:
• Clustering techniques, such as K-Means, enable the grouping of similar data points into clusters. This is valuable for
segmentation in various domains, including customer segmentation in marketing, grouping genes in bioinformatics, or
categorizing news articles.
• Feature Learning:
• Unsupervised learning helps in automatically learning relevant features from raw data. Techniques like autoencoders and
restricted Boltzmann machines can discover informative representations of data, facilitating improved performance in
downstream tasks.
Artificial Intelligence and Machine Learning 10
MIT School of Computing
Department of Computer Science & Engineering

WHY IS IT IMPORTANT?
• Reducing Dimensionality:
• Unsupervised learning methods, like Principal Component Analysis (PCA) or t-distributed Stochastic Neighbor Embedding
(t-SNE), aid in reducing the dimensionality of data. This is beneficial for visualization, computational efficiency, and mitigating
the curse of dimensionality.
• Generative Modeling:
• Unsupervised learning is used in generative modeling, where algorithms learn the underlying distribution of the data.
Generative models, such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), can then generate
new, realistic data samples.
• Self-Supervised Learning:
• Unsupervised learning is a foundation for self-supervised learning, where models are trained using the data itself to create
supervisory signals. This approach is increasingly popular in natural language processing and computer vision, leading to
improved performance on downstream tasks.
• Data Imputation and Denoising:
• In scenarios with missing or noisy data, unsupervised learning techniques can be employed for data imputation or denoising,
helping to recover missing or corrupted information.
• Advancing Artificial Intelligence:
• Unsupervised learning is fundamental to the broader field of artificial intelligence. It serves as a cornerstone for developing
more sophisticated algorithms and models that can adapt, generalize, and discover meaningful patterns in diverse datasets.
Artificial Intelligence and Machine Learning 11
MIT School of Computing
Department of Computer Science & Engineering

Key Characteristics:
1. No Labeled Output:
○ Unsupervised learning algorithms work with datasets where the desired output (labels or target values) is not
provided during training.
2. Exploratory Nature:
○ The primary goal is exploration and uncovering hidden patterns or structures within the data.
3. Clustering and Association:
○ Common tasks include clustering similar data points together and discovering associations or relationships among
variables.
4. Dimensionality Reduction:
○ Techniques like principal component analysis (PCA) are often used to reduce the dimensionality of the data,
focusing on the most informative features.
5. Anomaly Detection:
○ Unsupervised learning is employed in anomaly detection, where the algorithm identifies data instances that deviate
from the norm.
6. No Explicit Feedback:
○ Without explicit feedback, the algorithm must infer the underlying structure of the data based on inherent patterns.
Artificial Intelligence and Machine Learning 12
MIT School of Computing
Department of Computer Science & Engineering

Applications
1. Customer Segmentation:
○ Clustering customers based on purchasing behavior.
2. Recommendation Systems:
○ Identifying patterns in user preferences for personalized recommendations.
3. Anomaly Detection:
○ Detecting unusual behavior or outliers in network traffic, fraud detection, etc.
4. Image and Speech Recognition:
○ Learning representations of data without labeled examples.
5. Natural Language Processing (NLP):
○ Topic modeling and text clustering.

Artificial Intelligence and Machine Learning 13


MIT School of Computing
Department of Computer Science & Engineering

Challenges
1. Evaluation Metrics:
○ Assessing the performance of unsupervised learning models can be challenging due to the
absence of explicit labels.
2. Interpretability:
○ Understanding and interpreting the discovered patterns may be less straightforward compared to
supervised learning.
3. Data Preprocessing:
○ Handling missing data and outliers becomes critical in unsupervised learning.

Artificial Intelligence and Machine Learning 14


MIT School of Computing
Department of Computer Science & Engineering

Unsupervised Learning Algorithms


1. Clustering Algorithms:
○ k-Means Clustering
○ Hierarchical Clustering
○ DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
2. Dimensionality Reduction Techniques:
○ Principal Component Analysis (PCA)
○ t-Distributed Stochastic Neighbor Embedding (t-SNE)
3. Association Rule Mining:
○ Apriori Algorithm
○ FP-growth Algorithm
4. Neural Network-based Unsupervised Learning: Autoencoders and Self-Organizing Maps (SOMs)
5. Anomaly Detection Algorithms: One-Class SVM (Support Vector Machine) and Isolation Forest
6. Generative Models:
○ Variational Autoencoders (VAEs)
○ Generative Adversarial Networks (GANs)

Artificial Intelligence and Machine Learning 15


MIT School of Computing
Department of Computer Science & Engineering

Unsupervised Learning Algorithms


Cluster analysis finds the commonalities between the data objects and categorizes them as per the
presence and absence of those commonalities.
Association: An association rule is an unsupervised learning method which is used for finding the
relationships between variables in the large database.

Artificial Intelligence and Machine Learning 16


MIT School of Computing
Department of Computer Science & Engineering
Clustering Algorithms
Clustering is the type of Unsupervised Learning where you find patterns in the
data that you are working on. It may be the shape, size, colour etc. which can be
used to group data items or create clusters.

https://technical-beast.blogspot.com/2020/05/Unsupervised-Machine
Artificial Intelligence and Machine Learning -learning.html 17
MIT School of Computing
Department of Computer Science & Engineering

k-means Clustering
• k-Means is a popular unsupervised machine learning algorithm used for
clustering, a process of grouping similar data points together.
• The algorithm aims to partition data into k clusters, where each cluster
represents a group of data points that are more similar to each other than to
those in other clusters.

https://www.lancaster.ac.uk/stor-i-
student-sites/harini-jayaraman/k-m
eans-clustering/

Artificial Intelligence and Machine Learning 18


MIT School of Computing
Department of Computer Science & Engineering

k-means Clustering
Steps in K-Means:

1. Initialization:
○ Randomly select k initial cluster centroids. These centroids act as the starting points for the
clusters.
2. Assignment:
○ Assign each data point to the cluster whose centroid is closest in terms of Euclidean distance.
This step creates initial clusters.
3. Update Centroids:
○ Recalculate the centroids of the clusters based on the mean of the data points assigned to each
cluster.
4. Repeat:
○ Repeat the assignment and centroid update steps iteratively until convergence. Convergence
occurs when the assignment of data points to clusters no longer changes significantly.

Artificial Intelligence and Machine Learning 19


MIT School of Computing
Department of Computer Science & Engineering

k-means Clustering
Example Application: Customer Segmentation

Objective: Imagine you have customer data with features like age and annual income. You want to group customers
into distinct segments based on these features for targeted marketing.

Steps:

1. Initialization:
○ Randomly choose k initial centroids, where k is the number of desired clusters.
2. Assignment:
○ Assign each customer to the cluster whose centroid is closest based on features like age and income.
3. Update Centroids:
○ Recalculate the centroids for each cluster by taking the mean of the features of the data points in that cluster.
4. Repeat:
○ Iteratively repeat the assignment and centroid update steps until the clusters stabilize.

Results: After convergence, the algorithm has grouped customers into k clusters, each representing a segment of
customers with similar age and income characteristics.
Artificial Intelligence and Machine Learning 20
MIT School of Computing
Department of Computer Science & Engineering

k-means Clustering
Key Considerations:

1. Choosing k:
○ The number of clusters (k) needs to be specified beforehand. Techniques like the elbow method
or silhouette analysis can help determine an optimal value for k.
2. Sensitive to Initializations:
○ K-Means can be sensitive to the initial placement of centroids. Multiple runs with different
initializations may be performed to improve results.
3. Euclidean Distance:
○ The algorithm relies on Euclidean distance, making it suitable for numeric data but less effective
for categorical data or data with irregular shapes.
4. Scalability:
○ K-Means may struggle with large datasets or high-dimensional data due to its computational
complexity.

Artificial Intelligence and Machine Learning 21


MIT School of Computing
Department of Computer Science & Engineering

k-means Clustering
Key Points in K-Means Clustering:
• Objective:
○ K-Means clustering aims to partition a dataset into k clusters, where each cluster represents a group of data
points that are more similar to each other than to those in other clusters.
• Unsupervised Learning:
○ K-Means is an unsupervised learning algorithm, meaning it doesn't require labeled target information during
training.
• Initialization:
○ The algorithm starts by randomly selecting k initial cluster centroids. The choice of initial centroids can
impact the final result.
• Assignment Step:
○ In the assignment step, each data point is assigned to the cluster whose centroid is closest. The most common
distance metric used is Euclidean distance.
• Centroid Update Step:
○ After assigning all data points, the algorithm updates the centroids of each cluster. The new centroids are
calculated as the mean of the data points in each cluster.
• Iteration:
○ The assignment and centroid update steps are repeated iteratively until convergence. Convergence occurs
when the assignment of data points to clusters no longer changes significantly.

Artificial Intelligence and Machine Learning 22


MIT School of Computing
Department of Computer Science & Engineering
k-means Clustering
Key Points in K-Means Clustering:
• Convergence Criteria:
○ Common convergence criteria include a maximum number of iterations or a small change in the centroids
between consecutive iterations.
• Number of Clusters (k):
○ The number of clusters (k) needs to be specified before running the algorithm. Techniques like the elbow
method or silhouette analysis can help determine an optimal value for k.
• Sensitivity to Initializations:
○ K-Means can be sensitive to the initial placement of centroids. Running the algorithm multiple times with
different initializations and selecting the best result is a common practice.
• Within-Cluster Variance:
○ The objective of K-Means is to minimize the within-cluster variance, which is the sum of squared distances
between each data point and the centroid of its assigned cluster.
• Euclidean Distance:
○ K-Means relies on Euclidean distance as the default distance metric. It measures the straight-line distance
between two points in a multidimensional space.
• Assumes Spherical Clusters:
○ K-Means assumes that clusters are spherical and equally sized, which may not be suitable for all types of
data.
• Scalability:
○ The algorithm can struggle with large datasets or high-dimensional data due to its computational complexity.
Artificial Intelligence and Machine Learning 23
MIT School of Computing
Department of Computer Science & Engineering

k-means Clustering
Applications of K-Means:

● Customer segmentation in marketing.


● Image compression in computer vision.
● Document categorization in natural language processing.
● Anomaly detection in cybersecurity.

Artificial Intelligence and Machine Learning 24


MIT School of Computing
Department of Computer Science & Engineering

k-means Clustering
Advantages of K-Means:

● Easy to implement: K-means clustering is an iterable algorithm and a relatively simple algorithm. In
fact, we can also perform k-means clustering manually as we did in the numerical example.
● Scalability: We can use k-means clustering for even 10 records or even 10 million records in a
dataset. It will give us results in both cases.
● Convergence: The k-means clustering algorithm is guaranteed to give us results. It guarantees
convergence. Thus, we will get the result of the execution of the algorithm for sure.
● Generalization: K-means clustering doesn’t apply to a specific problem. From numerical data to text
documents, you can use the k-means clustering algorithm on any dataset to perform clustering. It can
also be applied to datasets of different sizes having entirely different distributions in the dataset.
Hence, this algorithm is completely generalized.
● Choice of centroids: You can warm-start the choice of centroids in an easy manner. Hence, the
algorithm allows you to choose and assign centroids that fit well with the dataset.

Artificial Intelligence and Machine Learning 25


MIT School of Computing
Department of Computer Science & Engineering

k-means Clustering
Limitations:

● Deciding the number of clusters: In k-means clustering, you need to decide the number of clusters
by using the elbow method.
● Choice of initial centroids: The number of iterations in the clustering process completely depends on
the choice of centroids. Hence, you need to properly choose the centroids in the initial step for
maximizing the efficiency of the algorithm.
● Effect of outliers: In the execution of the k-means clustering algorithm, we use all the points in a
cluster to determine the centroids for the next iteration. If there are outliers in the dataset, they highly
affect the position of the centroids. Due to this, the clustering becomes inaccurate. To avoid this, you
can try to identify outliers and remove them in the data cleaning process.
● Curse of Dimensionality: If the number of dimensions in the dataset increases, the distance of the
data points from a given point starts converging to a specific value. Due to this, k-means clustering
that calculates the clusters based on the distance between the points becomes inefficient. To
overcome this problem, you can use advanced clustering algorithms like spectral clustering.
Alternatively, you can also try to reduce the dimensionality of the dataset while data preprocessing.

Artificial Intelligence and Machine Learning 26


MIT School of Computing
Department of Computer Science & Engineering

k-means Clustering Example


You are given 15 points in the Cartesian coordinate system as follows. Given the information that we need to make 3
clusters. It means we are given K=3. Solve this numerical on k-means clustering.
1. First, we will randomly choose 3 centroids from the
given data. Let us consider A2 (2,6), A7 (5,10), and
A15 (6,11) as the centroids of the initial clusters.
Hence, we will consider that
● Centroid 1=(2,6) is associated with cluster 1.
● Centroid 2=(5,10) is associated with cluster 2.
● Centroid 3=(6,11) is associated with cluster 3.

Artificial Intelligence and Machine Learning 27


MIT School of Computing
Department of Computer Science & Engineering

k-means Clustering Example


2. Now we will find the euclidean distance between each point and the
centroids. Based on the minimum distance of each point from the
centroids, we will assign the points to a cluster.
At this point, we have completed the first iteration of the k-means
clustering algorithm and assigned each point into a cluster.
In the above table, you can observe that the point that is closest to the
centroid of a given cluster is assigned to the cluster.
3. Now, we will calculate the new centroid for each cluster.
● In cluster 1, we have 6 points i.e. A2 (2,6), A5 (6,4), A6 (1,2),
A10 (7,5), A12 (4,6), A14 (3,8). To calculate the new centroid for
cluster 1, we will find the mean of the x and y coordinates of each
point in the cluster. Hence, the new centroid for cluster 1 is
(3.833, 5.167).
● In cluster 2, we have 5 points i.e. A1 (2,10), A4 (6,9), A7 (5,10) ,
A8 (4,9), and A13 (3,10). Hence, the new centroid for cluster 2 is
(4, 9.6)
● In cluster 3, we have 4 points i.e. A3 (11,11), A9 (10,12), A11
(9,11), and A15 (6,11). Hence, the new centroid for cluster 3 is
Results from 1st iteration of K means clustering (9, 11.25). 28
Artificial Intelligence and Machine Learning
MIT School of Computing
Department of Computer Science & Engineering

k-means Clustering Example


3. Now that we have calculated new centroids for each cluster, we will
calculate the distance of each data point from the new centroids. Then, we
will assign the points to clusters based on their distance from the centroids.
The results for this process have been given in the following table.
Now, we have completed the second iteration of the k-means clustering
algorithm and assigned each point into an updated cluster. In the table, you
can observe that the point closest to the new centroid of a given cluster is
assigned to the cluster.
4. Now, we will calculate the new centroid for each cluster for the third
iteration.
● In cluster 1, we have 5 points i.e. A2 (2,6), A5 (6,4), A6 (1,2), A10
(7,5), and A12 (4,6). To calculate the new centroid for cluster 1, we
will find the mean of the x and y coordinates of each point in the
cluster. Hence, the new centroid for cluster 1 is (4, 4.6).
● In cluster 2, we have 7 points i.e. A1 (2,10), A4 (6,9), A7 (5,10) , A8
(4,9), A13 (3,10), A14 (3,8), and A15 (6,11). Hence, the new centroid
for cluster 2 is (4.143, 9.571)
● In cluster 3, we have 3 points i.e. A3 (11,11), A9 (10,12), and A11
Results from 2nd iteration of K means clustering 29
(9,11). Hence, the new centroid
Artificial Intelligence and Machine Learning for cluster 3 is (10, 11.333).
MIT School of Computing
Department of Computer Science & Engineering

k-means Clustering Example


4. Now, we will calculate the distance of each data point from the new centroids. Then, we
will assign the points to clusters based on their distance from the centroids. The results for
this process have been given in the following table.
Now, we have completed the third iteration of the k-means clustering algorithm and
assigned each point into an updated cluster.
In the table, you can observe that the point that is closest to the new centroid of a given
cluster is assigned to the cluster.
5. Now, we will calculate the new centroid for each cluster for the third iteration.
● In cluster 1, we have 5 points i.e. A2 (2,6), A5 (6,4), A6 (1,2), A10 (7,5), and A12
(4,6). To calculate the new centroid for cluster 1, we will find the mean of the x and y
coordinates of each point in the cluster. Hence, the new centroid for cluster 1 is (4,
4.6).
● In cluster 2, we have 7 points i.e. A1 (2,10), A4 (6,9), A7 (5,10) , A8 (4,9), A13
(3,10), A14 (3,8), and A15 (6,11). Hence, the new centroid for cluster 2 is (4.143,
9.571)
● In cluster 3, we have 3 points i.e. A3 (11,11), A9 (10,12), and A11 (9,11). Hence, the
new centroid for cluster 3 is (10, 11.333).
Here, you can observe that no point has changed its cluster compared to the previous
iteration. Due to this, the centroid also remains constant. Therefore, we will say that the
clusters have been stabilized. Hence, the clusters obtained after the third iteration are the
Results from 3rd iteration of K means clustering final clusters made from the given dataset. If we plot the clusters on a graph, 30 the graph
Artificiallooks
Intelligence and Machine
like as follows. Learning
MIT School of Computing
Department of Computer Science & Engineering

k-means Clustering Example


If we plot the clusters on a graph, the graph looks like as follows.
In the below plot, points in the clusters have been plotted using red, blue, and black markers. The centroids of the clusters
have been marked using green circles.

Plot for K-Means Clustering


31
Artificial Intelligence and Machine Learning
MIT School of Computing
Department of Computer Science & Engineering

k-means Clustering Example


Apply K(=2)-Means algorithm over the data (185, 72), (170, 56), (168, 60), (179,68), (182,72), (188,77) up to two
iterations and show the clusters. Initially choose first two objects as initial centroids..

Given, number of clusters to be created (K) = 2 say c1 and


c2,number of iterations = 2 and The given data points can be
represented in tabular form as:

Instances X Y
1 185 72
2 170 56
3 168 60
4 179 68
5 182 72
6 188 77

Artificial Intelligence and Machine Learning 32


MIT School of Computing
Department of Computer Science & Engineering

k-means Clustering Example


Consider first two objects as initial centroids:

Centroid for first cluster c1 = (185, 72)

Centroid for second cluster c2 = (170, 56)

Iteration 1: Now calculating similarity by using Euclidean distance measure as: Euclidian Distance

Artificial Intelligence and Machine Learning 33


MIT School of Computing
Department of Computer Science & Engineering

k-means Clustering Example

Artificial Intelligence and Machine Learning 34


MIT School of Computing
Department of Computer Science & Engineering

k-means Clustering Example


Representing above information in tabular form: The resulting cluster after first iteration is:

Distance of each data points from cluster centroids


Instance X Y Distance Distance
s (C1) (C2)

1 185 72 C1

2 170 56 C2

3 168 60 C2

4 179 68 C1

5 182 72 C1

6 188 77 C1

Artificial Intelligence and Machine Learning 35


MIT School of Computing
Department of Computer Science & Engineering

k-means Clustering Example

Iteration 2: Now calculating centroid for each cluster:

Now, again calculating similarity:

Artificial Intelligence and Machine Learning 36


MIT School of Computing
Department of Computer Science & Engineering

k-means Clustering Example

Artificial Intelligence and Machine Learning 37


MIT School of Computing
Department of Computer Science & Engineering

k-means Clustering Example

Artificial Intelligence and Machine Learning 38


MIT School of Computing
Department of Computer Science & Engineering

k-means Clustering Example


Representing above information in tabular form: The resulting cluster after Second iteration is:

Distance of each data points from cluster centroids


Instanc X Y Distance Distanc
es (C1) e(C2)

1 185 72 1.5207 21.2603 C1

2 170 56 21.1261 2.2361 C2

3 168 60 19.7563 2.2361 C2

4 179 68 6.1897 14.1421 C1

5 182 72 1.5207 19.105 C1

6 188 77 6.5431 26.8701 C1

Artificial Intelligence and Machine Learning 39


MIT School of Computing
Department of Computer Science & Engineering

Hierarchical Clustering
• Hierarchical Clustering is an unsupervised machine learning algorithm that organizes data points into
a hierarchical tree-like structure based on their similarities. Unlike K-Means, which partitions data
into a fixed number of clusters, hierarchical clustering creates a tree of clusters, also known as a
dendrogram.
• The key difference between k means and hierarchical clustering is that k means divides the data into
a predetermined number of clusters (k) whereas, hierarchical clustering forms a hierarchy of clusters
without requiring the number of clusters beforehand.
• Heirarchical clustering technique can be of two types:
• Agglomerative: Starts with individual data points as separate clusters and iteratively merges the
most similar clusters until only one cluster remains.
• Divisive: Starts with all data points in one cluster and recursively splits it into smaller clusters
based on dissimilarities.

Artificial Intelligence and Machine Learning 40


MIT School of Computing
Department of Computer Science & Engineering

Hierarchical Clustering
Key Concepts:
• Similarity or Dissimilarity Measures: Hierarchical clustering requires a metric to quantify the
similarity or dissimilarity between data points. Common measures include Euclidean distance,
Manhattan distance, or correlation coefficients.
• Dendrogram: A dendrogram is a tree-like diagram that visually represents the merging
(agglomerative) or splitting (divisive) of clusters. It provides insight into the hierarchical relationships
between data points.
• Linkage Methods: Linkage methods determine how the similarity or dissimilarity between clusters
is computed. Common linkage methods include:
• Single Linkage: The minimum of all pairwise distance between elements in each pair of clusters is used to
measure the distance between two clusters.
• Complete Linkage: The maximum of all pairwise distance between elements in each pair of clusters is used
to measure the distance between two clusters.
• Average Linkage: The average of all pairwise distances between elements in each pair of clusters is used to
measure the distance between two clusters.
• Centroid linkage: Before merging, the distance between the two clusters’ centroids are considered.
• Ward's Method: It uses squared error to compute the similarity of the two clusters for merging.
Artificial Intelligence and Machine Learning 41
MIT School of Computing
Department of Computer Science & Engineering

Dendrogram
A dendrogram is a tree-like diagram used in hierarchical clustering to represent the relationships between data points
or clusters. It visually illustrates how data points or clusters are merged or split throughout the clustering process. Here
are the key components and features of a dendrogram:

Components of a Dendrogram:

1. Vertical Lines (Branches):


○ Represent individual data points or clusters at the bottom of the dendrogram.
2. Horizontal Lines (Fusion/Split):
○ Connect vertical lines to show the merging (agglomerative) or splitting (divisive) of clusters.
3. Height or Distance:
○ The vertical distance between the points where horizontal lines connect represents the dissimilarity or
distance at which clusters were merged or split.
4. Leaves:
○ The terminal points of the dendrogram, corresponding to individual data points or clusters.
5. Root:
○ The topmost point of the dendrogram where all data points or clusters are joined into a single cluster.
Artificial Intelligence and Machine Learning 42
MIT School of Computing
Department of Computer Science & Engineering

Dendrogram

Artificial Intelligence and Machine Learning 43


MIT School of Computing
Department of Computer Science & Engineering

Hierarchical Clustering
Steps in Agglomerative Hierarchical Clustering:
1. Compute the proximity matrix using a distance metric, such
as Euclidean distance, to measure how close each pair of
data points are.
2. Use a linkage function, such as single, complete, or average
linkage, to group the closest data points or clusters into a
larger cluster, based on the distance matrix from the previous
step.
3. Repeat step 2 until all data points are merged into a single
cluster, or until a stopping criterion is met, such as a desired
number of clusters or a threshold distance.
The result of agglomerative clustering is a tree-like structure
called a dendrogram, which shows the hierarchical relationship
between the clusters. You can cut the dendrogram at different
levels to obtain different partitions of the data.

Artificial Intelligence and Machine Learning 44


MIT School of Computing
Department of Computer Science & Engineering

PROXIMITY MATRIX AND LINKAGE


The proximity matrix is a matrix consisting of the distance between each pair of data points. The distance is computed by a
distance function. Euclidean distance is one of the most commonly used distance functions.

The below proximity matrix consists of n points named x, and the d(xi,xj) represents the distance between the points.

In order to group the data points in a cluster, a linkage function is used where the values in the proximity matrix are taken and the
data points are grouped based on similarity. The newly formed clusters are linked to each other until they form a single cluster
containing all the data points.

Artificial Intelligence and Machine Learning 45


MIT School of Computing
Department of Computer Science & Engineering

Artificial Intelligence and Machine Learning 46


MIT School of Computing
Department of Computer Science & Engineering

● Step-1: Consider each alphabet as a single cluster and calculate the distance of one cluster from all the
other clusters.
● Step-2: In the second step comparable clusters are merged together to form a single cluster. Let’s say
cluster (B) and cluster (C) are very similar to each other therefore we merge them in the second step
similarly to cluster (D) and (E) and at last, we get the clusters [(A), (BC), (DE), (F)]
● Step-3: We recalculate the proximity according to the algorithm and merge the two nearest
clusters([(DE), (F)]) together to form new clusters as [(A), (BC), (DEF)]
● Step-4: Repeating the same process; The clusters DEF and BC are comparable and merged together to
form a new cluster. We’re now left with clusters [(A), (BCDEF)].
● Step-5: At last, the two remaining clusters are merged together to form a single cluster [(ABCDEF)].

Artificial Intelligence and Machine Learning 47


MIT School of Computing
Department of Computer Science & Engineering

Divisive Hierarchical clustering


Divisive Hierarchical clustering is precisely the opposite of Agglomerative Hierarchical
clustering.

In Divisive Hierarchical clustering, we take into account all of the data points as a single cluster
and in every iteration, we separate the data points from the clusters which aren’t comparable. In
the end, we are left with N clusters.

Artificial Intelligence and Machine Learning 48


MIT School of Computing
Department of Computer Science & Engineering

Artificial Intelligence and Machine Learning 49


MIT School of Computing
Department of Computer Science & Engineering

Numerical Example Agglomerative clustering


Given the following points A (1,
1), B(2, 3), C(3, 5), D(4,5), E(6,6),
and F(7,5), Perform agglomerative
clustering and show each step.
To perform clustering, we will
first create a distance matrix
consisting of the distance between
each point in the dataset (using
euclidean distance). The distance
matrix looks as follows.

Artificial Intelligence and Machine Learning 50


MIT School of Computing
Department of Computer Science & Engineering

Numerical Example Agglomerative clustering

After calculating the distance matrix, we will start clustering the data points using the single
linkage approach.
Step 1: First, we will consider each data point as a single cluster. After this, we will start
combining the clusters.
Step 2: To combine the individual clusters, we can consider the following points.
1. Point A is closest to point B.
2. Point B is at a similar distance to points A and C.
3. Point C is closest to point D.
4. Point D is closest to point C.
5. Point E is closest to F.
6. Point F is closest to E.
From the above points, we can combine (C, D) and (E, F) as clusters. We will name them CD and
EF respectively. As we have ambiguity for points A and B, let us not combine them and treat them
as individual clusters.
Artificial Intelligence and Machine Learning 51
MIT School of Computing
Department of Computer Science & Engineering

Numerical Example Agglomerative clustering

Step 3: Now, we will calculate the


minimum distance between clusters A,
B, CD, and EF. You can observe that
1. Cluster A is closest to B.
2. Cluster B is closest to A as well
as CD.
3. Cluster EF is closest to CD.
Using the above information let us
combine B and CD and name the
cluster BCD.
Step 4: After this, The cluster BCD is
at the same distance from A and EF.
Hence, we will first merge BCD and
EF to form BCDEF.
Step 5: Finally, we will merge A and
BCDEF to form the cluster ABCDEF.
Using the above steps, we will get the
following dendrogram.
Artificial Intelligence and Machine Learning 52
MIT School of Computing
Department of Computer Science & Engineering

Numerical Example Agglomerative clustering

In the above example, if we combine (A,


B), (C, D), and (E, F) together in step 2
above, we will get clusters AB, CD, and
EF.
Now, The minimum distance of CD is the
same from AB and EF. Let us merge AB
and CD to form ABCD first. Next, we
will combine ABCD and EF to obtain the
cluster ABCDEF.
As a result, we will get the following
dendrogram.

Artificial Intelligence and Machine Learning 53


MIT School of Computing
Department of Computer Science & Engineering

Numerical Example Agglomerative clustering

Instead of combining AB and CD in the


previous example, let us first combine CD
and EF to form CDEF. Then, we can
combine AB with CDEF to form the cluster
ABCDEF. As a result, we will get the
following dendrogram.

In the above example, there are many


alternative paths at different steps. Due to
this, we have obtained different
dendrograms. Here, we have only used
single linkage to calculate the clusters.
Instead of the single linkage approach, you
can also use complete linkage and average
linkage approaches that might give you
different clusters for the same dataset as
well.
Artificial Intelligence and Machine Learning 54
MIT School of Computing
Department of Computer Science & Engineering

Why Does Agglomerative Clustering Give Different Clusters and Dendrograms?

Agglomerative clustering gives different clusters for two reasons.


● First, it is a greedy algorithm. Therefore, it looks for the closest clusters to combine them. Now,
if cluster C is at the same distance with different clusters A, B, D, and E, the algorithm can
merge C into any of these clusters and form one of the clusters AC, BC, CD, and CE. Therefore,
the availability of alternative paths while merging the clusters gives us different dendrograms
and different clusters as well.
● The algorithm can select any metric among single linkage, complete linkage, and average
linkage as the distance metric to calculate the distance between the clusters. If we take different
distance metrics in different runs of the algorithm, we will get different dendrograms and
different clusters as well.

Artificial Intelligence and Machine Learning 55


MIT School of Computing
Department of Computer Science & Engineering

Uses of Hierarchical Clustering


Hierarchical clustering can be used for several applications, ranging from customer segmentation to object
recognition.
MARKET SEGMENTATION
Companies can better understand their markets by identifying target groups based on certain traits like
demographics, personal interests or behaviors. Organizing consumers according to these characteristics enables
organizations to see what consumers care about and whose backgrounds or interests may align with their products
and services. Businesses can then tailor products, marketing ads and other materials according to the preferences
of target audiences.
GEO-SPATIAL ANALYSIS
Besides grouping individual consumers or customers based on specific traits, hierarchical clustering can also
group individuals based on their geographic location. Organizations can then view where their customer bases are
located, predict product demand in certain areas and adjust their marketing and business strategies accordingly.
IMAGE SEGMENTATION
When dealing with images, hierarchical clustering can distinguish between separate visual elements. For example,
the technique can discern between different facial features, aiding facial recognition technology. It can also
discriminate between cars and other objects like buildings and animals, powering image recognition technology.
ANOMALY DETECTION
Hierarchical clustering is also effective at detecting anomalies. By clustering data points into groups, hierarchical
clustering can isolate outliers that don’t belong to any cluster. Researchers can apply this method to root out errors
at different stages of the data collection process, preventing anomalies from impacting the accuracy of data sets.
Artificial Intelligence and Machine Learning 56
MIT School of Computing
Department of Computer Science & Engineering

KNN Algorithm
Key Concept:
• K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
• K-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available
categories.
• K-NN algorithm stores all the available data and classifies a new data point based
on the similarity. This means when new data appears then it can be easily classified
into a well suite category by using K- NN algorithm.
• K-NN algorithm can be used for Regression as well as for Classification but mostly
it is used for the Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
• It is also called a lazy learner algorithm because it does not learn from the training
set immediately instead it stores the dataset and at the time of classification, it 57
Artificial Intelligence and Machine Learning
performs an action on the dataset.
MIT School of Computing
Department of Computer Science & Engineering

KNN Algorithm
Key Concept:
• KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new data.
• Example: Suppose, we have an image of a creature that looks similar to cat and
dog, but we want to know either it is a cat or dog. So for this identification, we can
use the KNN algorithm, as it works on a similarity measure. Our KNN model will
find the similar features of the new data set to the cats and dogs images and based
on the most similar features it will put it in either cat or dog category.

Artificial Intelligence and Machine Learning 58


MIT School of Computing
Department of Computer Science & Engineering

Why do we need a K-NN Algorithm?


Suppose there are two categories, i.e., Category A and Category B, and we have a
new data point x1, so this data point will lie in which of these categories. To solve
this type of problem, we need a K-NN algorithm. With the help of K-NN, we can
easily identify the category or class of a particular dataset. Consider the below
diagram:

Artificial Intelligence and Machine Learning 59


MIT School of Computing
Department of Computer Science & Engineering

KNN Algorithm
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each
category.
Step-5: Assign the new data points to that category for which the number of
the neighbor is maximum.
Step-6: Our model is ready.

Artificial Intelligence and Machine Learning 60


MIT School of Computing
Department of Computer Science & Engineering

KNN Algorithm Example


Suppose we have a new data point and we need to put it in the required
category. Consider the below image:

Artificial Intelligence and Machine Learning 61


MIT School of Computing
Department of Computer Science & Engineering

KNN Algorithm Example


Firstly, we will choose the number of neighbors, so we will choose the k=5.
Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the
distance between two points, which we have already studied in geometry. It can be calculated as:

Artificial Intelligence and Machine Learning 62


MIT School of Computing
Department of Computer Science & Engineering

KNN Algorithm Example


By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category B. Consider the below image:

Artificial Intelligence and Machine Learning 63


MIT School of Computing
Department of Computer Science & Engineering

How to select the value of K in the K-NN Algorithm?


Below are some points to remember while selecting the value of K in the K-NN
algorithm:

• There is no particular way to determine the best value for "K", so we need to try some values to
find the best out of them. The most preferred value for K is 5.
• A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the
model.
• Large values for K are good, but it may find some difficulties.

Artificial Intelligence and Machine Learning 64


MIT School of Computing
Department of Computer Science & Engineering

Advantages of KNN Algorithm:


• It is simple to implement.
• It is robust to the noisy training data
• It can be more effective if the training data is
large.

Disadvantages of KNN Algorithm:


• Always needs to determine the value of K which
may be complex some time.
• The computation cost is high because of
calculating the distance between the data points for
all the training samples.

Artificial Intelligence and Machine Learning 65


MIT School of Computing
Department of Computer Science & Engineering

Scenario:
• We want to classify a data point using KNN. Assume we already have the following dataset, where
each row represents a data point with two features (x1,x2) and its label.
Data Point X1 X2 lable

A 1 1 Red

B 2 1 Red

C 4 3 Blue

D 5 4 Blue

• We want to classify a new data point PPP with coordinates (x1=3,x2=2using k=3.
Step 1: Calculate the distances
• We use the Euclidean distance formula:

Artificial Intelligence and Machine Learning 66


MIT School of Computing
Department of Computer Science & Engineering

Compute distances between P(3,2)P(3, 2)P(3,2) and all other points:


1. Distance to A (1, 1):

Artificial Intelligence and Machine Learning 67


MIT School of Computing
Department of Computer Science & Engineering

Step 2: Sort the distances


• Sort the points by their distances to P(3,2)P(3, 2)P(3,2):

Data Point Distance Label

B 1.41 Red

C 1.41 Blue

A 2.24 Red

D 2.83 Blue

Step 3: Select the k=3 nearest neighbors


The 3 nearest neighbors are:
• B (Red)
• C (Blue)
• A (Red)
Artificial Intelligence and Machine Learning 68
MIT School of Computing
Department of Computer Science & Engineering

Step 4: Majority voting


Count the labels of the 3 nearest neighbors:
• Red: 2
• Blue: 1
The majority label is Red.

Final Classification:
The new data point P(3,2) is classified as Red.

Artificial Intelligence and Machine Learning 69


MIT School of Computing
Department of Computer Science & Engineering

Association
● Association rule learning is a type of
unsupervised learning technique that checks
for the dependency of one data item on
another data item and maps accordingly so
that it can be more profitable.
● It tries to find some interesting relations or
associations among the variables of dataset. It
is based on different rules to discover the
interesting relations between variables in the
database.
● For example, if a customer buys bread, he
most likely can also buy butter, eggs, or milk,
so these products are stored within a shelf or
mostly nearby. Consider the diagram:

Artificial Intelligence and Machine Learning 70


MIT School of Computing
Department of Computer Science & Engineering

How does Association Rule Learning work?


Association rule learning works on the concept of If and Else Statement

An association rule is an implication of the form

X→Y,
where X and Y are itemsets. The rule signifies that if X occurs in a transaction, then Y is likely to occur as well.

Here the If element is called antecedent, and then statement is called as Consequent. These types of relationships where we can find out some association
or relation between two items is known as single cardinality. It is all about creating rules, and if the number of items increases, then cardinality also
increases accordingly. So, to measure the associations between thousands of data items, there are several metrics.
These metrics are given below:

○ Support
○ Confidence
○ Lift

Artificial Intelligence and Machine Learning 71


MIT School of Computing
Department of Computer Science & Engineering

Support

Support is the frequency of A or how frequently an item appears in the dataset. It is defined as the fraction of the transaction T that contains the itemset X. If
there are X datasets, then for transactions T, it can be written as:

Confidence

Confidence indicates how often the rule has been found to be true. Or how often the items X and Y occur together in the dataset when the occurrence of X is
already given. It is the ratio of the transaction that contains X and Y to the number of records that contain X.

Lift

It is the strength of any rule, which can be defined as below formula:

It is the ratio of the observed support measure and expected support if X and Y are independent of each other. It has three possible values:

○ If Lift= 1: The probability of occurrence of antecedent and consequent is independent of each other.
○ Lift>1: It determines the degree to which the two itemsets are dependent to each other.
○ Lift<1: It tells us that one item is a substitute for other items, which means one item has a negative effect on another.

Artificial Intelligence and Machine Learning 72


MIT School of Computing
Department of Computer Science & Engineering

Example:
Consider a retail dataset where each transaction contains a list of items purchased by a customer. The goal is to identify association rules to
understand purchasing patterns. Here's an example of an association rule:

● Rule: {Diapers}→ {Baby Food}


● Interpretation: If a customer buys diapers, they are likely to buy baby food as well.
● Support: Percentage of transactions containing both diapers and baby food.
● Confidence: Probability of buying baby food given that diapers are purchased.

Algorithms:
Popular algorithms used for association rule learning include:

● Apriori Algorithm: Generates frequent itemsets and association rules based on support and confidence thresholds.
● FP-Growth (Frequent Pattern Growth): Constructs a compact data structure (FP-tree) to efficiently mine frequent itemsets.

Applications:
● Market Basket Analysis: Identifying purchasing patterns in retail datasets.
● Recommender Systems: Generating personalized recommendations based on user-item interactions.
● Healthcare: Discovering associations between medical conditions or treatments.

Artificial Intelligence and Machine Learning 73


MIT School of Computing
Department of Computer Science & Engineering

Apriori Algorithm in Machine Learning


The Apriori algorithm uses frequent itemsets to generate association rules, and it is designed to work on the databases that contain
transactions. With the help of these association rule, it determines how strongly or how weakly two objects are connected. This algorithm
uses a breadth-first search and Hash Tree to calculate the itemset associations efficiently. It is the iterative process for finding the frequent
itemsets from the large dataset.

What is Frequent Itemset?

Frequent itemsets are those items whose support is greater than the threshold value or user-specified minimum support. It means if A & B are
the frequent itemsets together, then individually A and B should also be the frequent itemset.

Suppose there are the two transactions: A= {1,2,3,4,5}, and B= {2,3,7}, in these two transactions, 2 and 3 are the frequent itemsets.

Steps for Apriori Algorithm

Below are the steps for the apriori algorithm:

Step-1: Determine the support of itemsets in the transactional database, and select the minimum support and confidence.

Step-2: Take all supports in the transaction with higher support value than the minimum or selected support value.

Step-3: Find all the rules of these subsets that have higher confidence value than the threshold or minimum confidence.
Artificial Intelligence and Machine Learning 74
Step-4: Sort the rules as the decreasing order of lift.
MIT School of Computing
Department of Computer Science & Engineering

Apriori Algorithm Working

We will understand the apriori algorithm using an example and mathematical calculation:

Example: Suppose we have the following dataset that has various transactions, and from this dataset, we need to find the frequent
itemsets and generate the association rules using the Apriori algorithm:

Artificial Intelligence and Machine Learning 75


MIT School of Computing
Department of Computer Science & Engineering

Apriori Algorithm Working

Solution:

Step-1: Calculating C1 and L1: C1

● In the first step, we will create a table that contains


support count (The frequency of each itemset
individually in the dataset) of each itemset in the given
dataset. This table is called the Candidate set or C1.
● Now, we will take out all the itemsets that have the L1
greater support count than the Minimum Support (2). It
will give us the table for the frequent itemset L1.
● Since all the itemsets have greater or equal support
count than the minimum support, except the E, so E
itemset will be removed.

Artificial Intelligence and Machine Learning 76


MIT School of Computing
Department of Computer Science & Engineering

Step-2: Candidate Generation C2, and L2:

• In this step, we will generate C2 with the help of L1. In


C2, we will create the pair of the itemsets of L1 in the
C2
form of subsets.
• After creating the subsets, we will again find the
support count from the main transaction table of
datasets, i.e., how many times these pairs have occurred
together in the given dataset. So, we will get the below
table for C2. L2
• Again, we need to compare the C2 Support count with
the minimum support count, and after comparing, the
itemset with less support count will be eliminated from
the table C2. It will give us the below table for L2

Artificial Intelligence and Machine Learning 77


MIT School of Computing
Department of Computer Science & Engineering

Step-3: Candidate generation C3, and L3:

• For C3, we will repeat the same two processes, but now
we will form the C3 table with subsets of three itemsets
C3
together, and will calculate the support count from the
dataset. It will give the below table:
• Now we will create the L3 table. As we can see from
the above C3 table, there is only one combination of
itemset that has support count equal to the minimum
support count.
• So, the L3 will have only one combination, i.e., {A, B,
C}.

Artificial Intelligence and Machine Learning 78


MIT School of Computing
Department of Computer Science & Engineering

Step-4: Finding the association rules for the subsets:

To generate the association rules, first, we will create a new table with the possible rules from the occurred combination {A, B,C}. For all the rules, we
will calculate the Confidence using formula sup( A^B)/A. After calculating the confidence value for all rules, we will exclude the rules that have less
confidence than the minimum threshold (50%).

Consider the below table:

As the given threshold or minimum confidence is 50%, so the first three rules A ^B → C, B^C → A, and A^C → B can be
considered as the strong association rules for the given problem.

Artificial Intelligence and Machine Learning 79


MIT School of Computing
Department of Computer Science & Engineering

Disadvantages of Apriori Algorithm


The two primary drawbacks of the Apriori Algorithm are:
1. At each step, candidate sets have to be built.
2. To build the candidate sets, the algorithm has to repeatedly scan the database.

Artificial Intelligence and Machine Learning 80


MIT School of Computing
Department of Computer Science & Engineering

What is FP Growth Algorithm?

● The FP-Growth Algorithm is an alternative way to find frequent item sets without using candidate generations, thus improving performance. For
so much, it uses a divide-and-conquer strategy. The core of this method is the usage of a special data structure named frequent-pattern tree
(FP-tree), which retains the item set association information.

This algorithm works as follows:

a. First, it compresses the input database creating an FP-tree instance to represent frequent items.
b. After this first step, it divides the compressed database into a set of conditional databases, each associated with one frequent pattern.
c. Finally, each such database is mined separately.
● Using this strategy, the FP-Growth reduces the search costs by recursively looking for short patterns and then concatenating them into the long
frequent patterns.
● In large databases, holding the FP tree in the main memory is impossible. A strategy to cope with this problem is to partition the database into a set
of smaller databases (called projected databases) and then construct an FP-tree from each of these smaller databases.

Artificial Intelligence and Machine Learning 81


MIT School of Computing
Department of Computer Science & Engineering

FP-Tree (frequent-pattern tree)

• A compact data structure that stores quantitative information about frequent patterns in a database.
• Each transaction is read and then mapped onto a path in the FP-tree. This is done until all transactions have been read.
• Different transactions with common subsets allow the tree to remain compact because their paths overlap.
• A frequent Pattern Tree is made with the initial item sets of the database.
• The purpose of the FP tree is to mine the most frequent pattern.
• Each node of the FP tree represents an item of the item set.
• The root node represents null, while the lower nodes represent the item sets.
• The associations of the nodes with the lower nodes, that is, the item sets with the other item sets, are maintained while forming
the tree.
• Han defines the FP-tree as the tree structure given below:

Artificial Intelligence and Machine Learning 82


MIT School of Computing
Department of Computer Science & Engineering

FP-Tree (frequent-pattern tree)

Han defines the FP-tree as the tree structure given:

• One root is labelled as "null" with a set of item-prefix subtrees as children and a frequent-item-header table.
• Each node in the item-prefix subtree consists of three fields:
• Item-name: registers which item is represented by the node;
• Count: the number of transactions represented by the portion of the path reaching the node;
• Node-link: links to the next node in the FP-tree carrying the same item name or null if there is none.
• Each entry in the frequent-item-header table consists of two fields:
• Item-name: as the same to the node;
• Head of node-link: a pointer to the first node in the FP-tree carrying the item name.
• Additionally, the frequent-item-header table can have the count support for an item.

Artificial Intelligence and Machine Learning 83


MIT School of Computing
Department of Computer Science & Engineering

The below diagram is an example of a best-case scenario that The worst-case scenario occurs when every transaction has a
occurs when all transactions have the same itemset; the size of the unique item set. So the space needed to store the tree is greater
FP-tree will be only a single branch of nodes. than the space used to store the original data set because the
FP-tree requires additional space to store pointers between
nodes and the counters for each item. The diagram below
shows how a worst-case scenario FP-tree might appear. It can
be seen that the tree's complexity grows with each
transaction's uniqueness.

Artificial Intelligence and Machine Learning 84


MIT School of Computing
Department of Computer Science & Engineering
Algorithm 1: FP-tree construction

Input: A transaction database DB and a minimum support threshold?

Output: FP-tree, the frequent-pattern tree of DB.

Method: The FP-tree is constructed as follows.

1. The first step is to scan the database to find the occurrences of the itemsets in the database. This step is the same as the first step of Apriori. The
count of 1-itemsets in the database is called support count or frequency of 1-itemset.
2. The second step is to construct the FP tree. For this, create the root of the tree. The root is represented by null.
3. The next step is to scan the database again and examine the transactions. Examine the first transaction and find out the itemset in it. The itemset
with the max count is taken at the top, and then the next itemset with the lower count. It means that the branch of the tree is constructed with
transaction itemsets in descending order of count.
4. The next transaction in the database is examined. The itemsets are ordered in descending order of count. If any itemset of this transaction is already
present in another branch, then this transaction branch would share a common prefix to the root.
This means that the common itemset is linked to the new node of another itemset in this transaction.
5. Also, the count of the itemset is incremented as it occurs in the transactions. The common node and new node count are increased by 1 as they are
created and linked according to transactions.
6. The next step is to mine the created FP Tree. For this, the lowest node is examined first, along with the links of the lowest nodes. The lowest node
represents the frequency pattern length 1. From this, traverse the path in the FP Tree. This path or paths is called a conditional pattern base.
A conditional pattern base is a sub-database consisting of prefix paths in the FP tree occurring with the lowest node (suffix).
7. Construct a Conditional FP Tree, formed by a count of itemsets in the path. The itemsets meeting the threshold support are considered in the
Conditional FP Tree.

8. Frequent Patterns are generated from the Conditional FP Tree.

Using this algorithm, the FP-tree is constructed in two database scans. The first scan collects and sorts the set of frequent items, and the second constructs
the FP-Tree. Artificial Intelligence and Machine Learning 85
MIT School of Computing
Department of Computer Science & Engineering

FP Growth Example The given data in the table below is a hypothetical dataset of transactions with each letter representing an item. Minimum support = 3.

Step-2: Frequent Pattern set is built which will contain all the elements whose
frequency is greater than or equal to the minimum support. These elements are
stored in descending order of their respective frequencies. After insertion of the
relevant items, the set L looks like this:-

Step-3: Now, for each transaction, the respective Ordered-Item set is built. It is
done by iterating the Frequent Pattern set and checking if the current item is
contained in the transaction in question. If the current item is contained, the item is
Step-1: The inserted in the Ordered-Item set for the current transaction. The following table is
frequency of each built for all the transactions:
individual item is
computed:-

Artificial Intelligence and Machine Learning 86


MIT School of Computing
Department of Computer Science & Engineering

Step-4: Now, all the Ordered-Item sets are inserted into a Trie Data b) Inserting the set {K, E, O, Y}:
Structure.
Till the insertion of the elements K and E, simply the support count is
a) Inserting the set {K, E, M, O, Y}:
increased by 1. On inserting O we can see that there is no direct link
Here, all the items are simply linked one after the other in the order of between E and O, therefore a new node for the item O is initialized with
the support count as 1 and item E is linked to this new node. On inserting
occurrence in the set and initialize the support count for each item as 1.
Y, we first initialize a new node for the item Y with support count as 1
and link the new node of O with the new node of Y.

Artificial Intelligence and Machine Learning 87


MIT School of Computing
Department of Computer Science & Engineering

c) Inserting the set {K, E, M}: d) Inserting the set {K, M, Y}:

Here simply the support count of each element is increased by 1.


Similar to step b), first the support count of K is increased, then new nodes
for M and Y are initialized and linked accordingly.

Artificial Intelligence and Machine Learning 88


MIT School of Computing
Department of Computer Science & Engineering

e) Inserting the set {K, E, O}: Step-5: Now, for each item, the Conditional Pattern Base is computed
which is path labels of all the paths which lead to any node of the given item
in the frequent-pattern tree. Note that the items in the below table are
Here simply the support counts of the respective elements are increased. Note that the arranged in the ascending order of their frequencies.
support count of the new node of item O is increased.

Artificial Intelligence and Machine Learning 89


MIT School of Computing
Department of Computer Science & Engineering

Step-6: Now for each item, the Conditional Frequent Pattern Tree is built. It is done Step-7: From the Conditional Frequent Pattern tree, the Frequent Pattern
by taking the set of elements that is common in all the paths in the Conditional Pattern rules are generated by pairing the items of the Conditional Frequent Pattern
Base of that item and calculating its support count by summing the support counts of all Tree set to the corresponding to the item as given in the below table.
the paths in the Conditional Pattern Base.

Step-8: For each row, two types of association rules can be inferred for example for the first row which contains the element, the rules K -> Y and Y -> K can
be inferred. To determine the valid rule, the confidence of both the rules is calculated and the one with confidence greater than or equal to the minimum
confidence value is retained.

Artificial Intelligence and Machine Learning 90


MIT School of Computing
Department of Computer Science & Engineering

Dimensionality Reduction
Dimensionality reduction is a technique used in machine learning to reduce the number of input variables or dimensions
in a dataset while retaining its most relevant information.

• It is often used to simplify complex data and improve the accuracy and efficiency of models. Some common
methods of dimensionality reduction include Principal Component Analysis (PCA), Linear Discriminant Analysis
(LDA), and t-SNE (t-Distributed Stochastic Neighbor Embedding).
• These techniques work by identifying patterns and relationships within the data and projecting them onto a
lower-dimensional space.

Importance of Dimensionality Reduction

• It reduces the time and storage space required.


• It helps remove multi-collinearity which improves the interpretation of the parameters of the machine learning
model.
• It becomes easier to visualize the data when reduced to very low dimensions such as 2D or 3D.
• It avoids the curse of dimensionality.
• It leads to better human interpretations and less computational cost with simplification of models.
• Enhanced model performance and reduced computational cost

Artificial Intelligence and Machine Learning 91


MIT School of Computing
Department of Computer Science & Engineering

Dimensionality Reduction
Key Concepts and Techniques:

1. Curse of Dimensionality:
○ High-dimensional datasets often suffer from the curse of dimensionality, leading to increased
computational complexity and potential overfitting.
2. Motivation:
○ Dimensionality reduction is motivated by the desire to overcome challenges associated with
high-dimensional data, such as computational inefficiency, increased storage requirements, and
the potential for noise or irrelevant features.
3. Two Approaches:
○ Feature Selection: Select a subset of the original features.
○ Feature Extraction: Create new features that are combinations of the original features.

Artificial Intelligence and Machine Learning 92


MIT School of Computing
Department of Computer Science & Engineering

Dimensionality Reduction
Benefits of Dimensionality Reduction:
1. Improved Computational Efficiency:
○ Reduced dimensionality leads to faster training and inference times for machine
learning models.
2. Enhanced Visualization:
○ Lower-dimensional representations are easier to visualize, aiding in data exploration
and interpretation.
3. Noise Reduction:
○ Eliminating irrelevant or noisy features can improve model generalization and
performance.
4. Memory Savings:
○ Storing and processing lower-dimensional data requires less memory and
computational resources.
5. Improved Model Performance:
○ Dimensionality reduction can lead to better model performance, particularly when
dealing with overfitting in high-dimensional spaces.

Artificial Intelligence and Machine Learning 93


MIT School of Computing
Department of Computer Science & Engineering

Dimensionality Reduction
Challenges:
1. Information Loss:
○ Dimensionality reduction may result in some loss of information, and the challenge is to
minimize this loss.
2. Selection of Techniques:
○ Choosing the most appropriate technique depends on the nature of the data and the specific goals
of the analysis.
3. Interpretability:
○ Reduced-dimensional representations may be more challenging to interpret, especially in
non-linear techniques.

Artificial Intelligence and Machine Learning 94


MIT School of Computing
Department of Computer Science & Engineering

Principal Component Analysis (PCA)


Principal Component Analysis (PCA) is a linear dimensionality reduction technique widely used in data analysis and machine learning. It aims to
transform high-dimensional data into a lower-dimensional space while preserving as much of the original variability as possible.

Key Concepts:

1. Variance and Covariance:


○ PCA focuses on capturing the variance in the data. It achieves this by identifying the principal components, which are the directions in
the data space along which the data varies the most.
○ Principal components are orthogonal to each other, meaning they are uncorrelated.
2. Principal Components:
○ The first principal component (PC1) explains the maximum variance in the data. Subsequent components (PC2, PC3, etc.) capture the
remaining variance in decreasing order.
○ Each principal component is a linear combination of the original features.
3. Eigenvalues and Eigenvectors:
○ PCA involves computing the eigenvalues and eigenvectors of the covariance matrix of the original data.
○ Eigenvalues represent the amount of variance captured by each principal component.
○ Eigenvectors represent the direction of maximum variance, corresponding to the principal components.
4. Projection:
○ The original data is projected onto the principal components, creating a new set of uncorrelated features that retain most of the
variability in the data.
○ The dimensionality is reduced by retaining only the top-k principal components.
5. Explained Variance:
○ The ratio of the sum of the eigenvalues for the first k principal components to the total sum of eigenvalues represents the proportion of
variance retained. 95
Artificial Intelligence and Machine Learning
MIT School of Computing
Department of Computer Science & Engineering

Principal Component Analysis (PCA)


Steps in PCA:
1. Standardization:
○ Standardize the features to have zero mean and unit variance. This is important to ensure that
features with larger scales do not dominate the PCA process.
2. Covariance Matrix:
○ Calculate the covariance matrix of the standardized data.
3. Eigenvalue Decomposition:
○ Compute the eigenvalues and eigenvectors of the covariance matrix.
4. Selection of Principal Components:
○ Sort the eigenvalues in decreasing order and choose the top-k eigenvectors corresponding to the
k largest eigenvalues to form the principal components.
5. Projection:
○ Project the original data onto the selected principal components to obtain the lower-dimensional
representation.

Artificial Intelligence and Machine Learning 96


MIT School of Computing
Department of Computer Science & Engineering

Applications of PCA:
1. Dimensionality Reduction:
○ Reducing the number of features in high-dimensional datasets.
2. Data Visualization:
○ Visualizing data in lower-dimensional space for exploration and
interpretation.
3. Noise Reduction:
○ Identifying and retaining the most relevant features while reducing the
impact of noise.
4. Feature Engineering:
○ Creating new features that capture the most important patterns in the
data.
5. Compression:
○ Compressing data while preserving its essential characteristics.
Artificial Intelligence and Machine Learning 97
MIT School of Computing
Department of Computer Science & Engineering

Challenges and Considerations in PCA:


1. Linearity:
○ PCA assumes linear relationships between features. Non-linear
relationships may not be accurately captured.
2. Interpretability:
○ Interpreting the principal components may be challenging, especially
when they are linear combinations of many original features.
3. Standardization:
○ PCA is sensitive to the scale of the features, so standardization is
essential.

Artificial Intelligence and Machine Learning 98


EXAMPLES- DIMENSIONALITY REDUCTION USING PCA
Suppose you have a dataset containing
information about customers of a retail store.
The dataset has multiple features such as age,
| Customer ID | Age | Income ($) | Spending Score |
income, spending score, etc. You want to
|-----------------|---------|---------------|--------------------|
reduce the dimensionality of this dataset |1 | 21 | 15000 | 75 |
using Principal Component Analysis (PCA). |2 | 30 | 30000 | 50 |
|3 | 35 | 40000 | 60 |
Given the following dataset:
|4 | 40 | 60000 | 40 |
|5 | 50 | 80000 | 90 |

Using PCA, reduce the


dimensionality of this dataset to
two principal components.
EXAMPLES- DIMENSIONALITY REDUCTION USING PCA
Step 1: Standardize the Data

● Standardize the dataset so that each feature has a mean of 0 and a standard deviation of 1. This step is important because PCA is
sensitive to the scale of the features.

Step 2: Calculate the Covariance Matrix

● Calculate the covariance matrix of the standardized dataset. The covariance matrix represents the relationships between the different
features.

Step 3: Calculate the Eigenvectors and Eigenvalues

● Compute the eigenvectors and eigenvalues of the covariance matrix. Eigenvectors represent the directions of the new feature space,
while eigenvalues represent the magnitude of the variance in those directions.

Step 4: Select Principal Components

● Sort the eigenvalues in descending order and choose the top k eigenvectors corresponding to the k largest eigenvalues. These
eigenvectors will form the new basis for the reduced feature space.

Step 5: Project Data onto Principal Components

● Project the standardized dataset onto the selected principal components to obtain the reduced-dimensional dataset.
EXAMPLES- DIMENSIONALITY REDUCTION USING PCA
Let's assume that after performing PCA, we
find the following eigenvalues and
Eigenvalues:
eigenvectors:
[0.8, 0.5, 0.3, 0.1]

Eigenvectors:
[[0.5, -0.4, 0.7, 0.2],
[-0.2, 0.8, 0.1, -0.5],
[-0.6, -0.3, 0.4, 0.6],
[0.5, 0.2, 0.5, -0.6]]

We select the first two


eigenvectors and project the
standardized dataset onto these
components.
EXAMPLES- DIMENSIONALITY REDUCTION USING
PCA
The reduced-dimensional dataset would be:
| Customer ID | PC1 | PC2 |
|-----------------|-------|-------|
|1 | 1.2 | 0.3 |
|2 | -0.5 | -0.9|
|3 | -0.3 | 0.1 |
|4 | -0.9 | 0.6 |
|5 | 0.5 | 0.8 |

This reduced-dimensional dataset contains the principal components PC1 and PC2, which capture
most of the variance in the original dataset. The reduced dataset can be used for further analysis or
modeling while retaining the essential information from the original dataset.
MIT School of Computing
Department of Computer Science & Engineering

Ensemble Learning

Ensemble Learning
• Introduction to Ensemble Learning
• Bagging and Boosting; Voting Classifier

Artificial Intelligence and Machine Learning 103


MIT School of Computing
Department of Computer Science & Engineering

Introduction to Ensemble Learning


● Ensemble Learning is a machine learning paradigm
where multiple models are combined to make predictions
or decisions.
● The idea is to leverage the strengths of diverse models
and overcome individual weaknesses, resulting in a more
accurate and robust overall prediction.
● Individual Models: These are the base models used in
the ensemble, each trained on a subset of the data or with
different parameters.
● Ensemble Model: The predictions from individual
models are combined or aggregated to produce a final
prediction that often outperforms any single model.

Courtesy: Bisunzu Mining company (SMB Bisunzu) _ société minière de Bisunzu


(smb-sarl.com)

104
Artificial Intelligence and Machine Learning
MIT School of Computing
Department of Computer Science & Engineering

Importance of combining multiple models


● Diversity: Ensemble methods benefit from using diverse models that have different
strengths and weaknesses, capturing a broader range of patterns in the data.
● Reduction of Overfitting: Combining multiple models helps mitigate overfitting, as errors
or biases in individual models are likely to cancel out when aggregated.
● Increased Stability: Ensemble methods provide more stable predictions, as they are less
sensitive to variations in the training data or specific model choices.
● Improved Generalization: By combining different algorithms or variations of the same
algorithm, ensemble learning can improve generalization performance on unseen data.
● Increased accuracy and robustness
● Overcoming bias and variance

105
Artificial Intelligence and Machine Learning
MIT School of Computing
Department of Computer Science & Engineering

Increased accuracy and robustness


● Consistent Performance: Ensemble models often achieve higher accuracy compared to
individual models, ensuring more consistent and reliable predictions.
● Robustness to Noise: Ensembles can filter out noise and outliers present in the data,
leading to more robust predictions in real-world scenarios.
● Better Handling of Complex Relationships: In cases where the relationship between
features and the target variable is complex, ensembles excel by capturing diverse aspects of
the underlying patterns.

Example: Random Forest, a popular ensemble method, combines multiple decision trees to

achieve higher accuracy and robustness in classification and regression tasks.

Artificial Intelligence and Machine Learning 106


MIT School of Computing
Department of Computer Science & Engineering

Overcoming Bias and variance


Bias Reduction:
● Diverse Models: Ensemble learning involves training multiple models, each with its own strengths,
weaknesses, and biases. By combining models that capture different aspects of the data, ensemble
methods reduce bias.
● Voting or Averaging: Ensemble methods often use techniques like voting or averaging to combine
predictions from individual models. This helps in achieving a more balanced and unbiased prediction
by considering the collective wisdom of diverse models.
Variance Reduction:
● Aggregation of Predictions: Ensemble learning aggregates predictions from multiple models,
leading to a reduction in the variance of the overall prediction. This is particularly beneficial when
individual models may overfit to specific patterns in the data.
● Error Compensation: Models in an ensemble may make errors on different subsets of the data. By
combining their predictions, ensemble methods compensate for individual model weaknesses,
resulting in a more stable and robust prediction.

Artificial Intelligence and Machine Learning 107


MIT School of Computing
Department of Computer Science & Engineering

Ensemble Learning Techniques


Ensemble learning can be performed in two ways:
• Bagging (Bootstrap Aggregation)
• Random Forest
• Boosting
• ADABOOST
• GRADIENT BOOSTING
• XgBoost
• Stacking

Artificial Intelligence and Machine Learning 108


MIT School of Computing
Department of Computer Science & Engineering

Bagging (Bootstrap Aggregating)


Definition

● An ensemble machine learning technique


designed to improve the stability and accuracy of
predictive models, especially those that are prone
to overfitting.
● It involves training multiple instances of a base
learning algorithm on different subsets of the
training data, and then combining their
predictions to make a final, more robust
prediction.

Purpose

● Bagging helps reduce overfitting and variance,


Courtesy: bagging machine learning explained - Vickey Lay
making the ensemble model more reliable and (leomundoblog.blogspot.com)

effective.
Artificial Intelligence and Machine Learning 109
MIT School of Computing
Department of Computer Science & Engineering

How it works: Training multiple models on different subsets of the dataset


● Training Data: The original dataset is used for training the ensemble models.
● Bootstrap Sampling: Random subsets of the training data are created through bootstrap sampling,
allowing some instances to be repeated while others may be omitted.
● Base Models (1, 2, ..., N): Multiple base models (e.g., decision trees) are trained independently on
each bootstrap sample.
● Aggregation: The predictions from individual models are combined through methods like voting (for
classification) or averaging (for regression).
● Final Prediction: The aggregated predictions result in a final prediction that is often more robust and
accurate than predictions from individual models.

Example algorithms: Random Forest

● Popular algorithms like Random Forest use bagging to create a robust and accurate ensemble. It
utilizes an ensemble of decision trees trained on bootstrapped samples to achieve high predictive
accuracy and robustness.

Artificial Intelligence and Machine Learning 110


MIT School of Computing
Department of Computer Science & Engineering

Advantages of Bagging:
● Variance Reduction: Bagging helps reduce the variance of the model by averaging or voting over
multiple independently trained models, thus making the final model more robust.
● Overfitting Mitigation: Since base models are trained on different subsets of the data, the ensemble is
less likely to overfit the training data compared to individual models.
● Improved Generalization: Bagging often leads to better generalization performance on unseen data,
enhancing the overall predictive power of the ensemble.

Artificial Intelligence and Machine Learning 111


MIT School of Computing
Department of Computer Science & Engineering

Boosting
● "Boosting" refers to an ensemble machine learning technique that combines
the predictions of multiple weak models to create a strong model.
● The primary goal of boosting is to improve the accuracy and generalization
of the model.

Example algorithms: AdaBoost, Gradient Boosting (GBM), XGBoost, and


LightGBM, each with its own variations and improvements. These algorithms
are widely used in various domains, such as finance, healthcare, and computer
vision.

Artificial Intelligence and Machine Learning 112


MIT School of Computing
Department of Computer Science & Engineering

How it works:
● Combining weak learners into a strong
learner
● In boosting, a series of weak learners
(models that perform slightly better
than random chance) are trained
sequentially.
● Each weak learner focuses on the
mistakes made by its predecessors,
with more emphasis on the
misclassified data points.
● The final model is a weighted sum of Courtesy: https://towardsdatascience.com/boosting-algorithms-explained-d38f56ef3f30

the individual weak learners, where


the weights are assigned based on the
performance of each learner.

Artificial Intelligence and Machine Learning 113


MIT School of Computing
Department of Computer Science & Engineering

How it works:
Step 1: The base algorithm reads
the data and assigns equal weight to
each sample observation.

Step 2: False predictions made by


the base learner are identified. In the
next iteration, these false predictions
are assigned to the next base learner
with a higher weightage on these
incorrect predictions.
Courtesy: https://www.edureka.co/blog/boosting-machine-learning/

Step 3: Repeat step 2 until the


algorithm can correctly classify the
output.
Artificial Intelligence and Machine Learning 114
MIT School of Computing
Department of Computer Science & Engineering

WHY IS BOOSTING USED?


● Let’s suppose that on given a data set of images containing images of cats and
dogs, you were asked to build a model that can classify these images into two
separate classes.

Artificial Intelligence and Machine Learning 115


MIT School of Computing
Department of Computer Science & Engineering

WHY IS BOOSTING USED?


▪All these rules help us identify whether
an image is a Dog or a cat, however, if we
were to classify an image based on an
individual (single) rule, the prediction
would be flawed. Each of these rules,
individually, are called weak learners
because these rules are not strong enough
to classify an image as a cat or dog.

▪Therefore, to make sure that our


prediction is more accurate, we can
combine the prediction from each of these
weak learners by using the majority rule
or weighted average. This makes a strong
learner model.

Artificial Intelligence and Machine Learning 116


MIT School of Computing
Department of Computer Science & Engineering

TYPES OF BOOSTING
There are three main ways through which boosting can be carried out:

▪Adaptive Boosting or AdaBoost


▪Gradient Boosting
▪XGBoost

Artificial Intelligence and Machine Learning 117


MIT School of Computing
Department of Computer Science & Engineering

ADAPTIVE BOOSTING

Artificial Intelligence and Machine Learning 118


MIT School of Computing
Department of Computer Science & Engineering

One of the most popular boosting algorithms is AdaBoost (Adaptive Boosting). Here's a simplified
overview of how AdaBoost works:
1. Initialization: Assign equal weights to all training samples.
2. Build Weak Model: Train a weak learner (e.g., a decision tree) on the data, giving more weight
to misclassified samples.
3. Compute Error: Calculate the error of the weak learner, considering the weighted samples.
4. Compute Weight: Assign a weight to the weak learner based on its performance (lower error
results in a higher weight).
5. Update Weights: Adjust the weights of the training samples, giving higher weight to
misclassified samples.
6. Repeat: Repeat steps 2-5 for a predetermined number of iterations or until a certain level of
accuracy is reached.
7. Final Model: Combine the weak learners into a strong model by assigning weights to their
predictions.
Boosting algorithms are effective in improving predictive performance and handling complex
relationships within the data. However, they can be sensitive to noisy data and outliers.

Artificial Intelligence and Machine Learning 119


MIT School of Computing
Department of Computer Science & Engineering

GRADIENT BOOSTING
In Gradient Boosting, base learners are generated sequentially in such a way that the
present base learner is always more effective than the previous one.

Optimize the loss function of the previous learner


▪Loss function that needs to be ameliorated.
▪Weak learner for computing predictions and forming strong learners.
▪An Additive Model that will regularize the loss function.

Artificial Intelligence and Machine Learning 120


MIT School of Computing
Department of Computer Science & Engineering

XGBOOST
XGBoost is an advanced version of Gradient
boosting method, it literally means eXtreme
Gradient Boosting. The main aim of this algorithm
is to increase the speed and efficiency of
computation.
The main features provided by XGBoost are:
▪Parallelly creates decision trees.
▪Implementing distributed computing methods for
evaluating large and complex models.
▪Using Out-of-Core Computing to analyze huge
datasets.
▪Implementing cache optimization to make the best
use of resources.

Artificial Intelligence and Machine Learning 121


MIT School of Computing
Department of Computer Science & Engineering

STACKING
Stacking, short for "stacked generalization," is an ensemble learning technique that involves training multiple
models to combine their predictions. Instead of relying on a single model, stacking leverages the strengths of
diverse models to achieve improved predictive performance. The process typically involves the following steps:

1. Base Models:
○ Train multiple base models using the same dataset but with different algorithms or configurations.
2. Meta-Model (or Blender):
○ Introduce a meta-model, often referred to as a blender or meta-learner, which takes the predictions from
the base models as input features.
3. Training the Meta-Model:
○ Train the meta-model using the predictions from the base models as input features and the actual target
values. This meta-model learns to combine the predictions of the base models in an optimal way.
4. Predictions:
○ When making predictions on new data, the base models generate individual predictions, which are then
fed into the trained meta-model to produce the final prediction.

Artificial Intelligence and Machine Learning 122


MIT School of Computing
Department of Computer Science & Engineering

STACKING
Advantages of Stacking:

● Improved Predictive Performance: Stacking often leads to more accurate predictions compared to
individual base models.
● Model Diversity: The use of diverse base models contributes to the overall robustness and effectiveness of
the ensemble.
● Adaptability: Stacking can accommodate a variety of machine learning algorithms as base models, allowing
for flexibility in model selection.

Considerations:

● Complexity: Stacking introduces an additional layer of complexity, and careful tuning is necessary to
prevent overfitting.
● Data Size: Stacking may not provide significant benefits with small datasets, and the increased
computational cost might outweigh the advantages.

Artificial Intelligence and Machine Learning 123


MIT School of Computing
Department of Computer Science & Engineering

STACKING
Implementation:

● Common machine learning libraries, such as scikit-learn in Python, offer tools and functions to implement
stacking. These libraries provide easy-to-use interfaces for creating and training ensemble models.

Example:

Consider a stacking scenario in a classification problem where base models include a decision tree, a support
vector machine, and a neural network. The meta-model, often a logistic regression or another simple model,
learns to combine the predictions of these base models to achieve a more accurate and robust final prediction.

Artificial Intelligence and Machine Learning 124


MIT School of Computing
Department of Computer Science & Engineering

BOOSTING VS BAGGING
Boosting or Sequential ensemble
Here, the weak learners are sequentially
produced during the training phase. The
performance of the model is improved by
assigning a higher weightage to the
previous, incorrectly classified samples. An
example of boosting is the AdaBoost
algorithm.
Bagging or Parallel ensemble
Here, the weak learners are produced
parallelly during the training phase. The
performance of the model can be increased
by parallelly training a number of weak
learners on bootstrapped data sets. An Courtesy: https://towardsdatascience.com/ensemble-learning-bagging-boosting-3098079e5422

example of bagging is the Random Forest


algorithm
Artificial Intelligence and Machine Learning 125
MIT School of Computing
Department of Computer Science & Engineering

Courtesy: https://www.analyticsvidhya.com/blog/2023/01/ensemble-learning-methods-bagging-boosting-and-stacking/

Artificial Intelligence and Machine Learning 126


MIT School of Computing
Department of Computer Science & Engineering

Voting Classifier
• A Voting Classifier is an ensemble learning method that combines the predictions of multiple base
models to make a final prediction.
• Aggregating predictions of classification models to create a better classifier by predicting the class
with the most votes is known as a voting classifier.
• This approach leverages the wisdom of diverse models to achieve better overall performance than
individual models.

Courtesy: Ensemble Methods in Machine Learning: Examples -


Analytics Yogi (vitalflux.com)
Artificial Intelligence and Machine Learning 127
MIT School of Computing
Department of Computer Science & Engineering

Voting Classifier
There are two main types of voting in a classifier: hard voting and soft voting.

1. Hard Voting: Here, each base model in the ensemble independently predicts the class, and the final prediction is
determined by a majority vote.

● Implementation:
○ Each base model "votes" for a class.
○ The class with the majority of votes is chosen as the final prediction.
● Use Case:
○ Appropriate when individual models are diverse and may excel in different aspects.

2. Soft Voting: Here, each base model assigns a probability to each class, and the final prediction is based on the
weighted average of these probabilities.

● Implementation: Explore more at: Hard voting and Soft Voting -


emanlee - 博客园 (cnblogs.com)
○ Weighted average of predicted probabilities for each class.
○ The class with the highest average probability is selected as the final prediction.
● Use Case:
○ Effective when models provide probability estimates, allowing for a more nuanced decision.
Artificial Intelligence and Machine Learning 128
MIT School of Computing
Department of Computer Science & Engineering

Advantages of Voting Classifier:


1. Improved Accuracy:
○ By combining predictions from multiple models, the ensemble often outperforms individual
models, leading to higher accuracy.
2. Robustness:
○ The ensemble is more robust to outliers or errors in individual models, as these tend to get
mitigated by the collective decision-making process.
3. Diversity:
○ Utilizes diverse models, each with its strengths and weaknesses, leading to a more
comprehensive understanding of the data.
4. Versatility:
○ Can be applied to a wide range of machine learning tasks, including classification and
regression.

Artificial Intelligence and Machine Learning 129


MIT School of Computing
Department of Computer Science & Engineering

Common Algorithms Used in Voting Classifiers:


1. Decision Trees:
○ Simple decision trees can serve as base models, contributing to the diversity of the ensemble.
2. Support Vector Machines (SVM):
○ SVMs with different kernels or parameters can be combined to form a robust ensemble.
3. Logistic Regression:
○ Logistic regression models can be included in the ensemble for their simplicity and
interpretability.
4. K-Nearest Neighbors (KNN):
○ KNN models with different distances or neighbors can contribute to the diversity of the
ensemble.

Artificial Intelligence and Machine Learning 130


MIT School of Computing
Department of Computer Science & Engineering

Voting Classifier implementation in Scikit


from sklearn.ensemble import VotingClassifier

from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LogisticRegression

from sklearn.svm import SVC

from sklearn.tree import DecisionTreeClassifier

from sklearn.datasets import load_iris

# Load a dataset (e.g., Iris dataset)

iris = load_iris()

# Create base models

model1 = LogisticRegression(max_iter=1000)

model2 = SVC(probability=True)

model3 = DecisionTreeClassifier()

# Create a voting classifier

voting_clf = VotingClassifier(estimators=[('lr', model1), ('svm', model2), ('dt', model3)], voting='hard')

# Evaluate the ensemble Courtesy:ML | Voting Classifier using Sklearn - GeeksforGeeks

cross_val_score(voting_clf, iris.data, iris.target, cv=5) Explore more at: Voting Classifier in Machine Learning | Aman
Kharwal (thecleverprogrammer.com)

Artificial Intelligence and Machine Learning 131


MIT School of Computing
Department of Computer Science & Engineering

132
Computer Organization & Architecture by COA Team

You might also like