Unit 5
Unit 5
AY 2024-2025
MIT School of Computing
Department of Computer Science & Engineering
Syllabus
UNIT – V Unsupervised Learning and Ensemble Learning
09 Hours
Overview of Unsupervised learning; Importance; Applications; Supervised Vs
Unsupervised Learning; Disadvantages; Types of unsupervised learning: Clustering &
Association; Clustering: Hierarchical clustering, K-means clustering and KNN clustering;
Association: Apriori Algorithm and FP Growth Algorithm;
2
Artificial Intelligence and Machine Learning
MIT School of Computing
Department of Computer Science & Engineering
Text Books:
1. “Artificial Intelligence: A Modern Approach” by Stuart Russell and Peter Norvig
2. “Machine Learning: A Probabilistic Perspective” by Kevin P. Murphy
Reference Books:
1. “Pattern Recognition and Machine Learning” by Christopher Bishop
2. “Deep Learning” by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
3. “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow” by Aurelien Geron
MOOC/ Video Lectures available at:
1. http://videolectures.net/andrew_ng/
2. https://nptel.ac.in/courses/106105077
Courtesy: https://blog.autify.com/en/machine-learning-in-software-testing/
Let's have a
look, what we
have done so
far and what
we will do
further Courtesy: https://www.favouriteblog.com/15-algorithms-machine-learning-engineers/
Courtesy:intellectualpoint.com/getting-started/how-to-launch-a-career-in-machine-lear
ning/
Supervised vs.
Unsupervised
Learning
WHY IS IT IMPORTANT?
• Pattern Discovery:
• Unsupervised learning allows algorithms to identify hidden patterns, structures, and relationships within data without explicit
guidance from labeled examples. This is essential for understanding the underlying structure of complex datasets.
• Data Exploration and Preprocessing:
• Before engaging in supervised learning tasks, unsupervised learning can be used for exploratory data analysis. Techniques like
clustering and dimensionality reduction assist in understanding the data's inherent characteristics and reducing its complexity.
• Anomaly Detection:
• Unsupervised learning is effective in detecting anomalies or outliers in datasets. By learning the normal patterns within the
data, algorithms can identify instances that deviate from the norm, which is crucial in fields like fraud detection, cybersecurity,
and quality control.
• Clustering and Segmentation:
• Clustering techniques, such as K-Means, enable the grouping of similar data points into clusters. This is valuable for
segmentation in various domains, including customer segmentation in marketing, grouping genes in bioinformatics, or
categorizing news articles.
• Feature Learning:
• Unsupervised learning helps in automatically learning relevant features from raw data. Techniques like autoencoders and
restricted Boltzmann machines can discover informative representations of data, facilitating improved performance in
downstream tasks.
Artificial Intelligence and Machine Learning 10
MIT School of Computing
Department of Computer Science & Engineering
WHY IS IT IMPORTANT?
• Reducing Dimensionality:
• Unsupervised learning methods, like Principal Component Analysis (PCA) or t-distributed Stochastic Neighbor Embedding
(t-SNE), aid in reducing the dimensionality of data. This is beneficial for visualization, computational efficiency, and mitigating
the curse of dimensionality.
• Generative Modeling:
• Unsupervised learning is used in generative modeling, where algorithms learn the underlying distribution of the data.
Generative models, such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), can then generate
new, realistic data samples.
• Self-Supervised Learning:
• Unsupervised learning is a foundation for self-supervised learning, where models are trained using the data itself to create
supervisory signals. This approach is increasingly popular in natural language processing and computer vision, leading to
improved performance on downstream tasks.
• Data Imputation and Denoising:
• In scenarios with missing or noisy data, unsupervised learning techniques can be employed for data imputation or denoising,
helping to recover missing or corrupted information.
• Advancing Artificial Intelligence:
• Unsupervised learning is fundamental to the broader field of artificial intelligence. It serves as a cornerstone for developing
more sophisticated algorithms and models that can adapt, generalize, and discover meaningful patterns in diverse datasets.
Artificial Intelligence and Machine Learning 11
MIT School of Computing
Department of Computer Science & Engineering
Key Characteristics:
1. No Labeled Output:
○ Unsupervised learning algorithms work with datasets where the desired output (labels or target values) is not
provided during training.
2. Exploratory Nature:
○ The primary goal is exploration and uncovering hidden patterns or structures within the data.
3. Clustering and Association:
○ Common tasks include clustering similar data points together and discovering associations or relationships among
variables.
4. Dimensionality Reduction:
○ Techniques like principal component analysis (PCA) are often used to reduce the dimensionality of the data,
focusing on the most informative features.
5. Anomaly Detection:
○ Unsupervised learning is employed in anomaly detection, where the algorithm identifies data instances that deviate
from the norm.
6. No Explicit Feedback:
○ Without explicit feedback, the algorithm must infer the underlying structure of the data based on inherent patterns.
Artificial Intelligence and Machine Learning 12
MIT School of Computing
Department of Computer Science & Engineering
Applications
1. Customer Segmentation:
○ Clustering customers based on purchasing behavior.
2. Recommendation Systems:
○ Identifying patterns in user preferences for personalized recommendations.
3. Anomaly Detection:
○ Detecting unusual behavior or outliers in network traffic, fraud detection, etc.
4. Image and Speech Recognition:
○ Learning representations of data without labeled examples.
5. Natural Language Processing (NLP):
○ Topic modeling and text clustering.
Challenges
1. Evaluation Metrics:
○ Assessing the performance of unsupervised learning models can be challenging due to the
absence of explicit labels.
2. Interpretability:
○ Understanding and interpreting the discovered patterns may be less straightforward compared to
supervised learning.
3. Data Preprocessing:
○ Handling missing data and outliers becomes critical in unsupervised learning.
https://technical-beast.blogspot.com/2020/05/Unsupervised-Machine
Artificial Intelligence and Machine Learning -learning.html 17
MIT School of Computing
Department of Computer Science & Engineering
k-means Clustering
• k-Means is a popular unsupervised machine learning algorithm used for
clustering, a process of grouping similar data points together.
• The algorithm aims to partition data into k clusters, where each cluster
represents a group of data points that are more similar to each other than to
those in other clusters.
https://www.lancaster.ac.uk/stor-i-
student-sites/harini-jayaraman/k-m
eans-clustering/
k-means Clustering
Steps in K-Means:
1. Initialization:
○ Randomly select k initial cluster centroids. These centroids act as the starting points for the
clusters.
2. Assignment:
○ Assign each data point to the cluster whose centroid is closest in terms of Euclidean distance.
This step creates initial clusters.
3. Update Centroids:
○ Recalculate the centroids of the clusters based on the mean of the data points assigned to each
cluster.
4. Repeat:
○ Repeat the assignment and centroid update steps iteratively until convergence. Convergence
occurs when the assignment of data points to clusters no longer changes significantly.
k-means Clustering
Example Application: Customer Segmentation
Objective: Imagine you have customer data with features like age and annual income. You want to group customers
into distinct segments based on these features for targeted marketing.
Steps:
1. Initialization:
○ Randomly choose k initial centroids, where k is the number of desired clusters.
2. Assignment:
○ Assign each customer to the cluster whose centroid is closest based on features like age and income.
3. Update Centroids:
○ Recalculate the centroids for each cluster by taking the mean of the features of the data points in that cluster.
4. Repeat:
○ Iteratively repeat the assignment and centroid update steps until the clusters stabilize.
Results: After convergence, the algorithm has grouped customers into k clusters, each representing a segment of
customers with similar age and income characteristics.
Artificial Intelligence and Machine Learning 20
MIT School of Computing
Department of Computer Science & Engineering
k-means Clustering
Key Considerations:
1. Choosing k:
○ The number of clusters (k) needs to be specified beforehand. Techniques like the elbow method
or silhouette analysis can help determine an optimal value for k.
2. Sensitive to Initializations:
○ K-Means can be sensitive to the initial placement of centroids. Multiple runs with different
initializations may be performed to improve results.
3. Euclidean Distance:
○ The algorithm relies on Euclidean distance, making it suitable for numeric data but less effective
for categorical data or data with irregular shapes.
4. Scalability:
○ K-Means may struggle with large datasets or high-dimensional data due to its computational
complexity.
k-means Clustering
Key Points in K-Means Clustering:
• Objective:
○ K-Means clustering aims to partition a dataset into k clusters, where each cluster represents a group of data
points that are more similar to each other than to those in other clusters.
• Unsupervised Learning:
○ K-Means is an unsupervised learning algorithm, meaning it doesn't require labeled target information during
training.
• Initialization:
○ The algorithm starts by randomly selecting k initial cluster centroids. The choice of initial centroids can
impact the final result.
• Assignment Step:
○ In the assignment step, each data point is assigned to the cluster whose centroid is closest. The most common
distance metric used is Euclidean distance.
• Centroid Update Step:
○ After assigning all data points, the algorithm updates the centroids of each cluster. The new centroids are
calculated as the mean of the data points in each cluster.
• Iteration:
○ The assignment and centroid update steps are repeated iteratively until convergence. Convergence occurs
when the assignment of data points to clusters no longer changes significantly.
k-means Clustering
Applications of K-Means:
k-means Clustering
Advantages of K-Means:
● Easy to implement: K-means clustering is an iterable algorithm and a relatively simple algorithm. In
fact, we can also perform k-means clustering manually as we did in the numerical example.
● Scalability: We can use k-means clustering for even 10 records or even 10 million records in a
dataset. It will give us results in both cases.
● Convergence: The k-means clustering algorithm is guaranteed to give us results. It guarantees
convergence. Thus, we will get the result of the execution of the algorithm for sure.
● Generalization: K-means clustering doesn’t apply to a specific problem. From numerical data to text
documents, you can use the k-means clustering algorithm on any dataset to perform clustering. It can
also be applied to datasets of different sizes having entirely different distributions in the dataset.
Hence, this algorithm is completely generalized.
● Choice of centroids: You can warm-start the choice of centroids in an easy manner. Hence, the
algorithm allows you to choose and assign centroids that fit well with the dataset.
k-means Clustering
Limitations:
● Deciding the number of clusters: In k-means clustering, you need to decide the number of clusters
by using the elbow method.
● Choice of initial centroids: The number of iterations in the clustering process completely depends on
the choice of centroids. Hence, you need to properly choose the centroids in the initial step for
maximizing the efficiency of the algorithm.
● Effect of outliers: In the execution of the k-means clustering algorithm, we use all the points in a
cluster to determine the centroids for the next iteration. If there are outliers in the dataset, they highly
affect the position of the centroids. Due to this, the clustering becomes inaccurate. To avoid this, you
can try to identify outliers and remove them in the data cleaning process.
● Curse of Dimensionality: If the number of dimensions in the dataset increases, the distance of the
data points from a given point starts converging to a specific value. Due to this, k-means clustering
that calculates the clusters based on the distance between the points becomes inefficient. To
overcome this problem, you can use advanced clustering algorithms like spectral clustering.
Alternatively, you can also try to reduce the dimensionality of the dataset while data preprocessing.
Instances X Y
1 185 72
2 170 56
3 168 60
4 179 68
5 182 72
6 188 77
Iteration 1: Now calculating similarity by using Euclidean distance measure as: Euclidian Distance
1 185 72 C1
2 170 56 C2
3 168 60 C2
4 179 68 C1
5 182 72 C1
6 188 77 C1
Hierarchical Clustering
• Hierarchical Clustering is an unsupervised machine learning algorithm that organizes data points into
a hierarchical tree-like structure based on their similarities. Unlike K-Means, which partitions data
into a fixed number of clusters, hierarchical clustering creates a tree of clusters, also known as a
dendrogram.
• The key difference between k means and hierarchical clustering is that k means divides the data into
a predetermined number of clusters (k) whereas, hierarchical clustering forms a hierarchy of clusters
without requiring the number of clusters beforehand.
• Heirarchical clustering technique can be of two types:
• Agglomerative: Starts with individual data points as separate clusters and iteratively merges the
most similar clusters until only one cluster remains.
• Divisive: Starts with all data points in one cluster and recursively splits it into smaller clusters
based on dissimilarities.
Hierarchical Clustering
Key Concepts:
• Similarity or Dissimilarity Measures: Hierarchical clustering requires a metric to quantify the
similarity or dissimilarity between data points. Common measures include Euclidean distance,
Manhattan distance, or correlation coefficients.
• Dendrogram: A dendrogram is a tree-like diagram that visually represents the merging
(agglomerative) or splitting (divisive) of clusters. It provides insight into the hierarchical relationships
between data points.
• Linkage Methods: Linkage methods determine how the similarity or dissimilarity between clusters
is computed. Common linkage methods include:
• Single Linkage: The minimum of all pairwise distance between elements in each pair of clusters is used to
measure the distance between two clusters.
• Complete Linkage: The maximum of all pairwise distance between elements in each pair of clusters is used
to measure the distance between two clusters.
• Average Linkage: The average of all pairwise distances between elements in each pair of clusters is used to
measure the distance between two clusters.
• Centroid linkage: Before merging, the distance between the two clusters’ centroids are considered.
• Ward's Method: It uses squared error to compute the similarity of the two clusters for merging.
Artificial Intelligence and Machine Learning 41
MIT School of Computing
Department of Computer Science & Engineering
Dendrogram
A dendrogram is a tree-like diagram used in hierarchical clustering to represent the relationships between data points
or clusters. It visually illustrates how data points or clusters are merged or split throughout the clustering process. Here
are the key components and features of a dendrogram:
Components of a Dendrogram:
Dendrogram
Hierarchical Clustering
Steps in Agglomerative Hierarchical Clustering:
1. Compute the proximity matrix using a distance metric, such
as Euclidean distance, to measure how close each pair of
data points are.
2. Use a linkage function, such as single, complete, or average
linkage, to group the closest data points or clusters into a
larger cluster, based on the distance matrix from the previous
step.
3. Repeat step 2 until all data points are merged into a single
cluster, or until a stopping criterion is met, such as a desired
number of clusters or a threshold distance.
The result of agglomerative clustering is a tree-like structure
called a dendrogram, which shows the hierarchical relationship
between the clusters. You can cut the dendrogram at different
levels to obtain different partitions of the data.
The below proximity matrix consists of n points named x, and the d(xi,xj) represents the distance between the points.
In order to group the data points in a cluster, a linkage function is used where the values in the proximity matrix are taken and the
data points are grouped based on similarity. The newly formed clusters are linked to each other until they form a single cluster
containing all the data points.
● Step-1: Consider each alphabet as a single cluster and calculate the distance of one cluster from all the
other clusters.
● Step-2: In the second step comparable clusters are merged together to form a single cluster. Let’s say
cluster (B) and cluster (C) are very similar to each other therefore we merge them in the second step
similarly to cluster (D) and (E) and at last, we get the clusters [(A), (BC), (DE), (F)]
● Step-3: We recalculate the proximity according to the algorithm and merge the two nearest
clusters([(DE), (F)]) together to form new clusters as [(A), (BC), (DEF)]
● Step-4: Repeating the same process; The clusters DEF and BC are comparable and merged together to
form a new cluster. We’re now left with clusters [(A), (BCDEF)].
● Step-5: At last, the two remaining clusters are merged together to form a single cluster [(ABCDEF)].
In Divisive Hierarchical clustering, we take into account all of the data points as a single cluster
and in every iteration, we separate the data points from the clusters which aren’t comparable. In
the end, we are left with N clusters.
After calculating the distance matrix, we will start clustering the data points using the single
linkage approach.
Step 1: First, we will consider each data point as a single cluster. After this, we will start
combining the clusters.
Step 2: To combine the individual clusters, we can consider the following points.
1. Point A is closest to point B.
2. Point B is at a similar distance to points A and C.
3. Point C is closest to point D.
4. Point D is closest to point C.
5. Point E is closest to F.
6. Point F is closest to E.
From the above points, we can combine (C, D) and (E, F) as clusters. We will name them CD and
EF respectively. As we have ambiguity for points A and B, let us not combine them and treat them
as individual clusters.
Artificial Intelligence and Machine Learning 51
MIT School of Computing
Department of Computer Science & Engineering
KNN Algorithm
Key Concept:
• K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
• K-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available
categories.
• K-NN algorithm stores all the available data and classifies a new data point based
on the similarity. This means when new data appears then it can be easily classified
into a well suite category by using K- NN algorithm.
• K-NN algorithm can be used for Regression as well as for Classification but mostly
it is used for the Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
• It is also called a lazy learner algorithm because it does not learn from the training
set immediately instead it stores the dataset and at the time of classification, it 57
Artificial Intelligence and Machine Learning
performs an action on the dataset.
MIT School of Computing
Department of Computer Science & Engineering
KNN Algorithm
Key Concept:
• KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new data.
• Example: Suppose, we have an image of a creature that looks similar to cat and
dog, but we want to know either it is a cat or dog. So for this identification, we can
use the KNN algorithm, as it works on a similarity measure. Our KNN model will
find the similar features of the new data set to the cats and dogs images and based
on the most similar features it will put it in either cat or dog category.
KNN Algorithm
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each
category.
Step-5: Assign the new data points to that category for which the number of
the neighbor is maximum.
Step-6: Our model is ready.
• There is no particular way to determine the best value for "K", so we need to try some values to
find the best out of them. The most preferred value for K is 5.
• A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the
model.
• Large values for K are good, but it may find some difficulties.
Scenario:
• We want to classify a data point using KNN. Assume we already have the following dataset, where
each row represents a data point with two features (x1,x2) and its label.
Data Point X1 X2 lable
A 1 1 Red
B 2 1 Red
C 4 3 Blue
D 5 4 Blue
• We want to classify a new data point PPP with coordinates (x1=3,x2=2using k=3.
Step 1: Calculate the distances
• We use the Euclidean distance formula:
B 1.41 Red
C 1.41 Blue
A 2.24 Red
D 2.83 Blue
Final Classification:
The new data point P(3,2) is classified as Red.
Association
● Association rule learning is a type of
unsupervised learning technique that checks
for the dependency of one data item on
another data item and maps accordingly so
that it can be more profitable.
● It tries to find some interesting relations or
associations among the variables of dataset. It
is based on different rules to discover the
interesting relations between variables in the
database.
● For example, if a customer buys bread, he
most likely can also buy butter, eggs, or milk,
so these products are stored within a shelf or
mostly nearby. Consider the diagram:
X→Y,
where X and Y are itemsets. The rule signifies that if X occurs in a transaction, then Y is likely to occur as well.
Here the If element is called antecedent, and then statement is called as Consequent. These types of relationships where we can find out some association
or relation between two items is known as single cardinality. It is all about creating rules, and if the number of items increases, then cardinality also
increases accordingly. So, to measure the associations between thousands of data items, there are several metrics.
These metrics are given below:
○ Support
○ Confidence
○ Lift
Support
Support is the frequency of A or how frequently an item appears in the dataset. It is defined as the fraction of the transaction T that contains the itemset X. If
there are X datasets, then for transactions T, it can be written as:
Confidence
Confidence indicates how often the rule has been found to be true. Or how often the items X and Y occur together in the dataset when the occurrence of X is
already given. It is the ratio of the transaction that contains X and Y to the number of records that contain X.
Lift
It is the ratio of the observed support measure and expected support if X and Y are independent of each other. It has three possible values:
○ If Lift= 1: The probability of occurrence of antecedent and consequent is independent of each other.
○ Lift>1: It determines the degree to which the two itemsets are dependent to each other.
○ Lift<1: It tells us that one item is a substitute for other items, which means one item has a negative effect on another.
Example:
Consider a retail dataset where each transaction contains a list of items purchased by a customer. The goal is to identify association rules to
understand purchasing patterns. Here's an example of an association rule:
Algorithms:
Popular algorithms used for association rule learning include:
● Apriori Algorithm: Generates frequent itemsets and association rules based on support and confidence thresholds.
● FP-Growth (Frequent Pattern Growth): Constructs a compact data structure (FP-tree) to efficiently mine frequent itemsets.
Applications:
● Market Basket Analysis: Identifying purchasing patterns in retail datasets.
● Recommender Systems: Generating personalized recommendations based on user-item interactions.
● Healthcare: Discovering associations between medical conditions or treatments.
Frequent itemsets are those items whose support is greater than the threshold value or user-specified minimum support. It means if A & B are
the frequent itemsets together, then individually A and B should also be the frequent itemset.
Suppose there are the two transactions: A= {1,2,3,4,5}, and B= {2,3,7}, in these two transactions, 2 and 3 are the frequent itemsets.
Step-1: Determine the support of itemsets in the transactional database, and select the minimum support and confidence.
Step-2: Take all supports in the transaction with higher support value than the minimum or selected support value.
Step-3: Find all the rules of these subsets that have higher confidence value than the threshold or minimum confidence.
Artificial Intelligence and Machine Learning 74
Step-4: Sort the rules as the decreasing order of lift.
MIT School of Computing
Department of Computer Science & Engineering
We will understand the apriori algorithm using an example and mathematical calculation:
Example: Suppose we have the following dataset that has various transactions, and from this dataset, we need to find the frequent
itemsets and generate the association rules using the Apriori algorithm:
Solution:
• For C3, we will repeat the same two processes, but now
we will form the C3 table with subsets of three itemsets
C3
together, and will calculate the support count from the
dataset. It will give the below table:
• Now we will create the L3 table. As we can see from
the above C3 table, there is only one combination of
itemset that has support count equal to the minimum
support count.
• So, the L3 will have only one combination, i.e., {A, B,
C}.
To generate the association rules, first, we will create a new table with the possible rules from the occurred combination {A, B,C}. For all the rules, we
will calculate the Confidence using formula sup( A^B)/A. After calculating the confidence value for all rules, we will exclude the rules that have less
confidence than the minimum threshold (50%).
As the given threshold or minimum confidence is 50%, so the first three rules A ^B → C, B^C → A, and A^C → B can be
considered as the strong association rules for the given problem.
● The FP-Growth Algorithm is an alternative way to find frequent item sets without using candidate generations, thus improving performance. For
so much, it uses a divide-and-conquer strategy. The core of this method is the usage of a special data structure named frequent-pattern tree
(FP-tree), which retains the item set association information.
a. First, it compresses the input database creating an FP-tree instance to represent frequent items.
b. After this first step, it divides the compressed database into a set of conditional databases, each associated with one frequent pattern.
c. Finally, each such database is mined separately.
● Using this strategy, the FP-Growth reduces the search costs by recursively looking for short patterns and then concatenating them into the long
frequent patterns.
● In large databases, holding the FP tree in the main memory is impossible. A strategy to cope with this problem is to partition the database into a set
of smaller databases (called projected databases) and then construct an FP-tree from each of these smaller databases.
• A compact data structure that stores quantitative information about frequent patterns in a database.
• Each transaction is read and then mapped onto a path in the FP-tree. This is done until all transactions have been read.
• Different transactions with common subsets allow the tree to remain compact because their paths overlap.
• A frequent Pattern Tree is made with the initial item sets of the database.
• The purpose of the FP tree is to mine the most frequent pattern.
• Each node of the FP tree represents an item of the item set.
• The root node represents null, while the lower nodes represent the item sets.
• The associations of the nodes with the lower nodes, that is, the item sets with the other item sets, are maintained while forming
the tree.
• Han defines the FP-tree as the tree structure given below:
• One root is labelled as "null" with a set of item-prefix subtrees as children and a frequent-item-header table.
• Each node in the item-prefix subtree consists of three fields:
• Item-name: registers which item is represented by the node;
• Count: the number of transactions represented by the portion of the path reaching the node;
• Node-link: links to the next node in the FP-tree carrying the same item name or null if there is none.
• Each entry in the frequent-item-header table consists of two fields:
• Item-name: as the same to the node;
• Head of node-link: a pointer to the first node in the FP-tree carrying the item name.
• Additionally, the frequent-item-header table can have the count support for an item.
The below diagram is an example of a best-case scenario that The worst-case scenario occurs when every transaction has a
occurs when all transactions have the same itemset; the size of the unique item set. So the space needed to store the tree is greater
FP-tree will be only a single branch of nodes. than the space used to store the original data set because the
FP-tree requires additional space to store pointers between
nodes and the counters for each item. The diagram below
shows how a worst-case scenario FP-tree might appear. It can
be seen that the tree's complexity grows with each
transaction's uniqueness.
1. The first step is to scan the database to find the occurrences of the itemsets in the database. This step is the same as the first step of Apriori. The
count of 1-itemsets in the database is called support count or frequency of 1-itemset.
2. The second step is to construct the FP tree. For this, create the root of the tree. The root is represented by null.
3. The next step is to scan the database again and examine the transactions. Examine the first transaction and find out the itemset in it. The itemset
with the max count is taken at the top, and then the next itemset with the lower count. It means that the branch of the tree is constructed with
transaction itemsets in descending order of count.
4. The next transaction in the database is examined. The itemsets are ordered in descending order of count. If any itemset of this transaction is already
present in another branch, then this transaction branch would share a common prefix to the root.
This means that the common itemset is linked to the new node of another itemset in this transaction.
5. Also, the count of the itemset is incremented as it occurs in the transactions. The common node and new node count are increased by 1 as they are
created and linked according to transactions.
6. The next step is to mine the created FP Tree. For this, the lowest node is examined first, along with the links of the lowest nodes. The lowest node
represents the frequency pattern length 1. From this, traverse the path in the FP Tree. This path or paths is called a conditional pattern base.
A conditional pattern base is a sub-database consisting of prefix paths in the FP tree occurring with the lowest node (suffix).
7. Construct a Conditional FP Tree, formed by a count of itemsets in the path. The itemsets meeting the threshold support are considered in the
Conditional FP Tree.
Using this algorithm, the FP-tree is constructed in two database scans. The first scan collects and sorts the set of frequent items, and the second constructs
the FP-Tree. Artificial Intelligence and Machine Learning 85
MIT School of Computing
Department of Computer Science & Engineering
FP Growth Example The given data in the table below is a hypothetical dataset of transactions with each letter representing an item. Minimum support = 3.
Step-2: Frequent Pattern set is built which will contain all the elements whose
frequency is greater than or equal to the minimum support. These elements are
stored in descending order of their respective frequencies. After insertion of the
relevant items, the set L looks like this:-
Step-3: Now, for each transaction, the respective Ordered-Item set is built. It is
done by iterating the Frequent Pattern set and checking if the current item is
contained in the transaction in question. If the current item is contained, the item is
Step-1: The inserted in the Ordered-Item set for the current transaction. The following table is
frequency of each built for all the transactions:
individual item is
computed:-
Step-4: Now, all the Ordered-Item sets are inserted into a Trie Data b) Inserting the set {K, E, O, Y}:
Structure.
Till the insertion of the elements K and E, simply the support count is
a) Inserting the set {K, E, M, O, Y}:
increased by 1. On inserting O we can see that there is no direct link
Here, all the items are simply linked one after the other in the order of between E and O, therefore a new node for the item O is initialized with
the support count as 1 and item E is linked to this new node. On inserting
occurrence in the set and initialize the support count for each item as 1.
Y, we first initialize a new node for the item Y with support count as 1
and link the new node of O with the new node of Y.
c) Inserting the set {K, E, M}: d) Inserting the set {K, M, Y}:
e) Inserting the set {K, E, O}: Step-5: Now, for each item, the Conditional Pattern Base is computed
which is path labels of all the paths which lead to any node of the given item
in the frequent-pattern tree. Note that the items in the below table are
Here simply the support counts of the respective elements are increased. Note that the arranged in the ascending order of their frequencies.
support count of the new node of item O is increased.
Step-6: Now for each item, the Conditional Frequent Pattern Tree is built. It is done Step-7: From the Conditional Frequent Pattern tree, the Frequent Pattern
by taking the set of elements that is common in all the paths in the Conditional Pattern rules are generated by pairing the items of the Conditional Frequent Pattern
Base of that item and calculating its support count by summing the support counts of all Tree set to the corresponding to the item as given in the below table.
the paths in the Conditional Pattern Base.
Step-8: For each row, two types of association rules can be inferred for example for the first row which contains the element, the rules K -> Y and Y -> K can
be inferred. To determine the valid rule, the confidence of both the rules is calculated and the one with confidence greater than or equal to the minimum
confidence value is retained.
Dimensionality Reduction
Dimensionality reduction is a technique used in machine learning to reduce the number of input variables or dimensions
in a dataset while retaining its most relevant information.
• It is often used to simplify complex data and improve the accuracy and efficiency of models. Some common
methods of dimensionality reduction include Principal Component Analysis (PCA), Linear Discriminant Analysis
(LDA), and t-SNE (t-Distributed Stochastic Neighbor Embedding).
• These techniques work by identifying patterns and relationships within the data and projecting them onto a
lower-dimensional space.
Dimensionality Reduction
Key Concepts and Techniques:
1. Curse of Dimensionality:
○ High-dimensional datasets often suffer from the curse of dimensionality, leading to increased
computational complexity and potential overfitting.
2. Motivation:
○ Dimensionality reduction is motivated by the desire to overcome challenges associated with
high-dimensional data, such as computational inefficiency, increased storage requirements, and
the potential for noise or irrelevant features.
3. Two Approaches:
○ Feature Selection: Select a subset of the original features.
○ Feature Extraction: Create new features that are combinations of the original features.
Dimensionality Reduction
Benefits of Dimensionality Reduction:
1. Improved Computational Efficiency:
○ Reduced dimensionality leads to faster training and inference times for machine
learning models.
2. Enhanced Visualization:
○ Lower-dimensional representations are easier to visualize, aiding in data exploration
and interpretation.
3. Noise Reduction:
○ Eliminating irrelevant or noisy features can improve model generalization and
performance.
4. Memory Savings:
○ Storing and processing lower-dimensional data requires less memory and
computational resources.
5. Improved Model Performance:
○ Dimensionality reduction can lead to better model performance, particularly when
dealing with overfitting in high-dimensional spaces.
Dimensionality Reduction
Challenges:
1. Information Loss:
○ Dimensionality reduction may result in some loss of information, and the challenge is to
minimize this loss.
2. Selection of Techniques:
○ Choosing the most appropriate technique depends on the nature of the data and the specific goals
of the analysis.
3. Interpretability:
○ Reduced-dimensional representations may be more challenging to interpret, especially in
non-linear techniques.
Key Concepts:
Applications of PCA:
1. Dimensionality Reduction:
○ Reducing the number of features in high-dimensional datasets.
2. Data Visualization:
○ Visualizing data in lower-dimensional space for exploration and
interpretation.
3. Noise Reduction:
○ Identifying and retaining the most relevant features while reducing the
impact of noise.
4. Feature Engineering:
○ Creating new features that capture the most important patterns in the
data.
5. Compression:
○ Compressing data while preserving its essential characteristics.
Artificial Intelligence and Machine Learning 97
MIT School of Computing
Department of Computer Science & Engineering
● Standardize the dataset so that each feature has a mean of 0 and a standard deviation of 1. This step is important because PCA is
sensitive to the scale of the features.
● Calculate the covariance matrix of the standardized dataset. The covariance matrix represents the relationships between the different
features.
● Compute the eigenvectors and eigenvalues of the covariance matrix. Eigenvectors represent the directions of the new feature space,
while eigenvalues represent the magnitude of the variance in those directions.
● Sort the eigenvalues in descending order and choose the top k eigenvectors corresponding to the k largest eigenvalues. These
eigenvectors will form the new basis for the reduced feature space.
● Project the standardized dataset onto the selected principal components to obtain the reduced-dimensional dataset.
EXAMPLES- DIMENSIONALITY REDUCTION USING PCA
Let's assume that after performing PCA, we
find the following eigenvalues and
Eigenvalues:
eigenvectors:
[0.8, 0.5, 0.3, 0.1]
Eigenvectors:
[[0.5, -0.4, 0.7, 0.2],
[-0.2, 0.8, 0.1, -0.5],
[-0.6, -0.3, 0.4, 0.6],
[0.5, 0.2, 0.5, -0.6]]
This reduced-dimensional dataset contains the principal components PC1 and PC2, which capture
most of the variance in the original dataset. The reduced dataset can be used for further analysis or
modeling while retaining the essential information from the original dataset.
MIT School of Computing
Department of Computer Science & Engineering
Ensemble Learning
Ensemble Learning
• Introduction to Ensemble Learning
• Bagging and Boosting; Voting Classifier
104
Artificial Intelligence and Machine Learning
MIT School of Computing
Department of Computer Science & Engineering
105
Artificial Intelligence and Machine Learning
MIT School of Computing
Department of Computer Science & Engineering
Example: Random Forest, a popular ensemble method, combines multiple decision trees to
Purpose
effective.
Artificial Intelligence and Machine Learning 109
MIT School of Computing
Department of Computer Science & Engineering
● Popular algorithms like Random Forest use bagging to create a robust and accurate ensemble. It
utilizes an ensemble of decision trees trained on bootstrapped samples to achieve high predictive
accuracy and robustness.
Advantages of Bagging:
● Variance Reduction: Bagging helps reduce the variance of the model by averaging or voting over
multiple independently trained models, thus making the final model more robust.
● Overfitting Mitigation: Since base models are trained on different subsets of the data, the ensemble is
less likely to overfit the training data compared to individual models.
● Improved Generalization: Bagging often leads to better generalization performance on unseen data,
enhancing the overall predictive power of the ensemble.
Boosting
● "Boosting" refers to an ensemble machine learning technique that combines
the predictions of multiple weak models to create a strong model.
● The primary goal of boosting is to improve the accuracy and generalization
of the model.
How it works:
● Combining weak learners into a strong
learner
● In boosting, a series of weak learners
(models that perform slightly better
than random chance) are trained
sequentially.
● Each weak learner focuses on the
mistakes made by its predecessors,
with more emphasis on the
misclassified data points.
● The final model is a weighted sum of Courtesy: https://towardsdatascience.com/boosting-algorithms-explained-d38f56ef3f30
How it works:
Step 1: The base algorithm reads
the data and assigns equal weight to
each sample observation.
TYPES OF BOOSTING
There are three main ways through which boosting can be carried out:
ADAPTIVE BOOSTING
One of the most popular boosting algorithms is AdaBoost (Adaptive Boosting). Here's a simplified
overview of how AdaBoost works:
1. Initialization: Assign equal weights to all training samples.
2. Build Weak Model: Train a weak learner (e.g., a decision tree) on the data, giving more weight
to misclassified samples.
3. Compute Error: Calculate the error of the weak learner, considering the weighted samples.
4. Compute Weight: Assign a weight to the weak learner based on its performance (lower error
results in a higher weight).
5. Update Weights: Adjust the weights of the training samples, giving higher weight to
misclassified samples.
6. Repeat: Repeat steps 2-5 for a predetermined number of iterations or until a certain level of
accuracy is reached.
7. Final Model: Combine the weak learners into a strong model by assigning weights to their
predictions.
Boosting algorithms are effective in improving predictive performance and handling complex
relationships within the data. However, they can be sensitive to noisy data and outliers.
GRADIENT BOOSTING
In Gradient Boosting, base learners are generated sequentially in such a way that the
present base learner is always more effective than the previous one.
XGBOOST
XGBoost is an advanced version of Gradient
boosting method, it literally means eXtreme
Gradient Boosting. The main aim of this algorithm
is to increase the speed and efficiency of
computation.
The main features provided by XGBoost are:
▪Parallelly creates decision trees.
▪Implementing distributed computing methods for
evaluating large and complex models.
▪Using Out-of-Core Computing to analyze huge
datasets.
▪Implementing cache optimization to make the best
use of resources.
STACKING
Stacking, short for "stacked generalization," is an ensemble learning technique that involves training multiple
models to combine their predictions. Instead of relying on a single model, stacking leverages the strengths of
diverse models to achieve improved predictive performance. The process typically involves the following steps:
1. Base Models:
○ Train multiple base models using the same dataset but with different algorithms or configurations.
2. Meta-Model (or Blender):
○ Introduce a meta-model, often referred to as a blender or meta-learner, which takes the predictions from
the base models as input features.
3. Training the Meta-Model:
○ Train the meta-model using the predictions from the base models as input features and the actual target
values. This meta-model learns to combine the predictions of the base models in an optimal way.
4. Predictions:
○ When making predictions on new data, the base models generate individual predictions, which are then
fed into the trained meta-model to produce the final prediction.
STACKING
Advantages of Stacking:
● Improved Predictive Performance: Stacking often leads to more accurate predictions compared to
individual base models.
● Model Diversity: The use of diverse base models contributes to the overall robustness and effectiveness of
the ensemble.
● Adaptability: Stacking can accommodate a variety of machine learning algorithms as base models, allowing
for flexibility in model selection.
Considerations:
● Complexity: Stacking introduces an additional layer of complexity, and careful tuning is necessary to
prevent overfitting.
● Data Size: Stacking may not provide significant benefits with small datasets, and the increased
computational cost might outweigh the advantages.
STACKING
Implementation:
● Common machine learning libraries, such as scikit-learn in Python, offer tools and functions to implement
stacking. These libraries provide easy-to-use interfaces for creating and training ensemble models.
Example:
Consider a stacking scenario in a classification problem where base models include a decision tree, a support
vector machine, and a neural network. The meta-model, often a logistic regression or another simple model,
learns to combine the predictions of these base models to achieve a more accurate and robust final prediction.
BOOSTING VS BAGGING
Boosting or Sequential ensemble
Here, the weak learners are sequentially
produced during the training phase. The
performance of the model is improved by
assigning a higher weightage to the
previous, incorrectly classified samples. An
example of boosting is the AdaBoost
algorithm.
Bagging or Parallel ensemble
Here, the weak learners are produced
parallelly during the training phase. The
performance of the model can be increased
by parallelly training a number of weak
learners on bootstrapped data sets. An Courtesy: https://towardsdatascience.com/ensemble-learning-bagging-boosting-3098079e5422
Courtesy: https://www.analyticsvidhya.com/blog/2023/01/ensemble-learning-methods-bagging-boosting-and-stacking/
Voting Classifier
• A Voting Classifier is an ensemble learning method that combines the predictions of multiple base
models to make a final prediction.
• Aggregating predictions of classification models to create a better classifier by predicting the class
with the most votes is known as a voting classifier.
• This approach leverages the wisdom of diverse models to achieve better overall performance than
individual models.
Voting Classifier
There are two main types of voting in a classifier: hard voting and soft voting.
1. Hard Voting: Here, each base model in the ensemble independently predicts the class, and the final prediction is
determined by a majority vote.
● Implementation:
○ Each base model "votes" for a class.
○ The class with the majority of votes is chosen as the final prediction.
● Use Case:
○ Appropriate when individual models are diverse and may excel in different aspects.
2. Soft Voting: Here, each base model assigns a probability to each class, and the final prediction is based on the
weighted average of these probabilities.
iris = load_iris()
model1 = LogisticRegression(max_iter=1000)
model2 = SVC(probability=True)
model3 = DecisionTreeClassifier()
cross_val_score(voting_clf, iris.data, iris.target, cv=5) Explore more at: Voting Classifier in Machine Learning | Aman
Kharwal (thecleverprogrammer.com)
132
Computer Organization & Architecture by COA Team