ML Tech Neo Study
ML Tech Neo Study
MODULE 1
CHAPTER
Introduction to Machine
Learning
Syllabus
Machine Learning, Types of Machine Learning, Issues in Machine Learning, Applicationof Machine Learning, Steps in
developing a Machine Learning Application.
Overfitting,Underfitting,Bias-Variance trade-off.
Training Error, Generalization error,
Machine Leaning...
'"'"""** "****°*'*'"°****''***'*°'*'*"*"***********'*'*°******'*****'***°''*****"***°"*****'°''*'°*****'******°***'*'' 1-2
1.1
UQ. What is Machine learning? How it is different from data mining ?
1-2
MUComp.)- May 17,5Marks,May 19,5MarksS). ***************************'"*****"*'**'°"*'°*****'"*****"**°'°'
UQ. Define Machine learning and explain with example importance of Machine Learning.
1-2
C(MU(Comp.)-Dec.19,5 Marks).
Key Terminology.. ..................... . ** nnnn************ ********.****************************
*******.1-3
1.2
UQ. What are the key tasks of Machine Learning? (MUComp.)-May 16,5 Marks). 3
*. 1-55
1.3 Types of Machine Leaning... ********************* *****************************************"***************" ********
1-9
1.7 Applications of Machine Learning..
UQ. Write short note on: Machine learning applications.(MU(Comp.)-May 16, May 17,10 Marks)...
... 1-10
Training error and generalization error . ****************************************************************************'********************
1.8
1-11
1.9 Underfitting,overfitting,bias and variance trade off .. a****a*****ao************a******a*************************'*******"************************
1-13
ChapterEnds.
Machine Leaning(MU-Sem.7-Comp.) (Introductionto MachineLearning)..Pageno. (1-21
1. MACHINE LEARNING
-- ---- -- ---.
What is Machine leaming? How it is different from data mining ?MUComp.)- May 17, 5 Marks, May 19, 5 Markes
Q Define Machine leaning and explain with example importance of Machine Learningl(MU(Comp.)-Dec. 19, 5 Marks)
A machine that is
intellectuallycapable as much as humans, have always attracted writers and early computer scientist
who were excited about
artificial intelligenceand machine learning.
The first machine
learning system was developed in the 1950s. In 1952, Samuel has developed a progranm to
checkers. The
program was able to observe positions at game and learn the model that
play
player. gives
better moves for machine
In 1957, Frank Rosenblatt
designed the Perceptron, which is a simple classifier but when it is combined in
numbers, in a network, it became large
a
powerful tool.
Minsky in 1960, came up with limitation of perceptron.He showed that the X-OR problem could not be representedby
perceptron and such inseparabledata distribution cannot be handled and
research dormant until 1980s.
went to following this Minsky's work neural network
Machine learning became very famous in 1990s,due to the introduction of statistics.
combination lead to probabilistic Computer science and statistics
approaches in Artificial
intelligence. This area is further shifted to data
driven
techniques. As Huge amount of data is available, scientists started to
and learn from data. design intelligentsystems that are able to analyze
Machine learning is a
category of Artificial
Intelligence. In machine
themselves,explicit programmingis not learning computers has the ability to learn
required.
Machine focuses on the study and
data.
developmentof algorithms that can learn from data and also make
predictions on
Machine learning is defined by Tom Mitchell
as "A
program learns from
T and performancemeasure 'P', if its performanceon tasks in asexperience with respect to some class of
tasks 'E'
E' represents the "T measured by 'P" improves with E'" Here
past experienceddata and "T" represents the tasks such as
P', we mnight want to increase prediction,classification,etc. Example of
accuracy in prediction.
Machine learning mainly focuses on the
design and
developmentof computer programs that can teach themselves Data
to grow and
change when exposed to new data.
Computer Program
Using machine learning we can collect information from a Output
dataset by asking the
computer to make some sense from
data. Machine learning is
turning data into information. Fig. 1.1.1: Machine Learning
The Fig. 1.1.1 is the schematic
representation of the ML system. ML system takes the training data and
knowledge as the input. Background knowledge and data helps the Learner background
particulartask or problem. Performance correspondingto the solution program to provide a solution for a
can be also measured. ML
mainly two components,Learner and a Reasoner. Learner use system comprises of
the training data and
model and this can be used
by reasoner to provide the solution for a task.
backgroundknowledge to build the
Machine learning can be
applied to many applicationssuch as politics to
many problems. Any application which needs to extract geosciences.It is tool that can be applied to
a
some information from
data, can benefit from machine data and also takes some action on
learning methods.
Some of the
applications are spam filtering in email, face recognition,
and
handwritingdigit recognition. product recommendationsfrom Amazon.com
1.2KEY TERMINOLOGY
UQ What are the key tasks of Machine Learning? (MU(Comp.) - May 16, 5 Marks)
Expert System
Expert system is a system which is developed using some training set, testing set, and knowledge representation,
features, algorithm and classification terminology.
i) Training Set: A training set comprises of training examples which will be used to train machine learning
algorithms.
(i) Testing Set: To test machine learning algorithms what's usually done is to have a training set of data and a
separate dataset, called a test set.
(ii) Knowledge Representation : Knowledge representation may be stored in the form of a set of rules. It may be an
example from the training set or a probability distribution.
(iv) Features : Important properties or attributes.
(v) Classification: We classify the data based on features.
Process: Suppose we want to use a machine learning algorithm for classification. The next step is to train the
algorithm, or allows it to learn. To train the algorithm we give as a input a quality data called as training set.
Each training example has some features and one target variable. The target variable is what we will be trying to predict
with our machine learning algorithms. In a training dataset the
target variable is known. The machine learns by finding
some relationship between the target variable and the features. In the classification tasks the target variables are known
as classes. I is assumed that there will be a limited number of classes.
The class or target variable that the training example belongs to is then compared to the predicted value, and we can get
a idea about the accuracy of the algorithm.
Example: First we will see some terminologies that are frequently used in machine learning methods. Let's take an
example that we want to design a classification system that will classify the instances in to either Acceptable or
Unacceptable. This kind of system is a fascinating topic often related with machine learning called expert systems.
(New Syllabus w.e.f academic year 22-23) (M7-79) Tech-Neo Publications..ASACHIN SHAH Venture
no. (1-4)
(Introduction to MachineLearning)..Page
selected are Buying_Price
Machine Learning (MU-Sem.7-Comp.) or the
attributes
1.2.1. The features
of the various cars are stored in Table a record comprises of features.
Four features to Table 1.2.1 represents
Maintenance_Price. Lug_Boot and Safety. Examples belong values. The first two
reatures represent
and takes limited disjoint
In Table 1.2.1 all the features are categorical in nature
Third feature
shows the luggage
medium and low.
price of a car such as high, measures or not, which
the buying price and
maintenancee
whether the car has safety
big. Fourth feature represents
Capacity of a car as small, medium or
Classification is one of the important task learning. In this application Maintenance_Price, Lug_Boot and
machine
car's Buying_Price,
have all information about learning
group of other cars. Suppose we or Unacceptable. Many
machine
target variable.
be
In classification task the target variable takes a discrete value, and in the task of regression its value could
continuous.
In a training dataset we have the value of target variable. The relationship that exists between the features and the target
variable is used by machine for learning. The target variable is the evaluation of the car. Classes are the target variables
in the classification task. In classification systems it is assumed that classes are to be of limited number.
Attributes or features are the individual values that, when combined with other features, make up a training example.
This is usually columns in a training or test set.
A training dataset and a testing dataset, is used to test machine learning algorithms. First the training dataset is given as
input to the program. Program uses this data to learn. Next, the test set is given to the program. The program decides
which instance of test data belongs to which class. The predicted output is compared with the actual output of the
program, and we can get a idea about the accuracy of the algorithm. There are best ways to use all the information in
the training dataset and test dataset.
Assume in car evaluation classification system, we have tested the program and it meets the desired level of accuracy.
Knowledge representation is used to check what the machine has learned. There are many ways in which knowledge
can be represented.
We can use set of rules or a probabilitydistribution to represent the knowledge.
Many algorithms represent the knowledge which is more interpretable to humans than others. In some situations we
may not want to build an expert system but we are interested only in the knowledge representation that's acquired from
training a machine learning algorithm.
Table 1.2.1: Car evaluation classification based on four features
Features Target
Variables
Fig. 1.2.1: Features and target variable identified
1. Supervised Learning
Semi-supervised learning
is a combination of supervised and unsupervised learning. In this there is some amount of labeled training data
and also you have large amount of unlabeled data and you try to come up with some learning algorithm that
convert even when training data is not labeled.
In classification task, the aim is to predict class of an instance of data. Another method in machine learning is
regression.
Regression is the prediction of a numeric value. Regression's example is to draw best fit line which passes
through some data points in order to generalize the data points.
Classification and regression are examples of supervised learning. These types of problems are called as
supervised because we are asking the algorithm what to predict.
The exact opposite of supervised is a task called as unsupervised learning. In unsupervised learning, target value
or label is not given for the data. A problem in which similar items are grouped together is called as clustering. In
unsupervised learning, we may also want to find statistical values that describe the data. This is called as density
estimation. Another task of unsupervised learning may be reducing the huge amount of data from many attributes
to a small number so that we can properly visualize it in two or three dimensions.
Table 1.3.2:Unsupervised
learningtasks
K-Means Expectation Maximization
DBSCAN | Parzen Window
(New Syllabus w.e.f academic year 22-23) (M7-79) Tech-Neo Publications...ASACHIN SHAH Venture
Machine Learning (MU-Sem.7-Comp.) (Introductionto Machine Learning)..Page no. (1-7)
2. How much training data is sufficient? What should be the general amount of data that can be found to relate the
confidence in learned hypotheses to the amount training experience and the character of the learner's hypothesis space?
3. Prior knowledge held by the learner is used at which time and manner to guide the process of generalizing from
examples? If we have approximately correct knowledge, will it helpful even when it is only approximately correct?
4 What is the best strategy for choosing a useful next training experience, and how does the choice of this strategy after
the complexityof the learming problem?
5. To reduce the task of learning to one or more function approximation problems, what will be the best approach? What
specific functions should the system attempt to learn? Can this process itself be automated?
6. To improve the knowledge epresentation and to learn the target function, how can the learner automatically alter its
representation?
UQ Explainthe steps requiredfor selectingthe right machínelearning algorithm. (MU(Comp.) - May 16, 8 Marks)
With all the different algorithms available in machine
learning, how can you select which one to use ? First you need to
focus on your goal. What are you
trying get
to out of this? What data do you have or can you collect ? Secondly you have to
consider the data.
1. Goal: If you are trying to predict or forecast a target value, then you need to look into supervisedlearning. Otherwise,
you have to use unsupervised learning.
(a) If you have chosen supervisedlearning,then next you need to focus what's your
on
target value?
If target value is diserete (e.g. Yes/ No, 1 /2/3, A/B/C), then use Classification.
If targetvalue is continuous i.e. Number
of values
(e.g. 0- 100,-99 to 99), then use Regression.
(b) If you have chosen
unsupervisedlearning,then next you need to focus on what is your aim?
If you want to fit your data into some discrete
groups, then use Clustering
If you want to find numerical estimate of
how strong the fit into each group, then use
algorithm density estimation
2. Data: Are the features continuous or nominal ? Are
there missing values in features?
missing values? Are there outliers in the data? To narrow the algorithm selection If yes, what is a reason for
data can help you. process, all of these features of your
You could collect the samples from a website and extracting data.
From RSS feed or an API
From device to collect wind speed measurement
Publiclyavailabledata.
(New Syllabusw.e.f academic year 22-23) (M7-79) Tech-Neo Publications...ASACHIN SHAH Venture
Machine Leaning(MU-Sem.7-Comp.) (Introductionto Machine Learning)...Pageno. (1-8)
2. Preparation of the Input data
Once you have the input data, you need to check whether it's in a useable format or not.
Some algorithm can accept target variables and features as string; some need them to be integers.
Some algorithm accepts features in a special format.
7. Use It
In this step a real program is developed to do some task, and once again it is checked if all the previous steps worked as
you expected. You might encounter some new data and have to revisit step 1-5.
Training Phase
Label Machine
learning
Feature UT gorithm
extractor
Input Features
Testing Phase
Classifer Label
model
Feature
extactor Un-
Input FeatureS
(New Syllabus w.ef academic year 22-23) (M7-79) e Tech-Neo Publications...ASACHIN SHAH Venture
Machine Learning(MU-Sem.7-Comp.) (Introductionto MachineLearning)..Pageno.(1-9)
1.7 APPLICATIONSOF MACHINE LEARNING
UQ Write short note on: Machine learningapplications. (MU(Comp.)-May 16,May 17,10 Marks)
1. Learning Associatlons
A supermarket chain-one an example of retail application of machine learning is basket analysis, which is finding
associations between products bought by customers:
people who buy P typically also buy Q and if there is a customer who buys Q and does not buy P, he or she is a
potential P customer. Once we identify such customers, we can target them for cross-selling.
n inding an associationrule, we are interestedin learning a conditionalprobabilityof the form P (QP) whereQ
s the product we would like to condition on P, which are the product/ products which we know that customer has
already purchased.
P (Milk/Bread) =0.7
It implies that 70% of customers who buy bread also buy milk
2 Classification
A credit is an amount of
money loaned by a financial institution. It is important for the bank to be able to predict
in advance the risk associated with a loan. Which is the
whole amount back?
probabilitythat the customer will default and not pay the
(New Syllabus w.e.f academic year 22-23) (M7-79) Tech-NeoPublications.ASACHIN SHAH Venture
Machine Learning(MU-Sem.7-Comp.) (Introductionto MachineLearning)...Pageno. (1-10)
4. Unsupervised Learning
One of the important unsupervised learning problem is clustering. In clustering dataset is partitioned in to
meaningfül sub classes known as clusters. For example, suppose you want to decorate your home using given
items.
NOW you will classify them using unsupervised learning (no prior knowledge) and this classification can be on the
basis of color of items, shape of items, material used for items, type of items or whatever way you would like.
5. Reinforcement Learning
There are some of the applications where output of system is a sequence of actions. In such applications the
seqvence of correct actions instead of single action is important in order to reach goal. An action is said to be good
if it is part of good policy. Machine learming program generates a policy by learning previous good action
sequences. Such methods are called reinforcement methods
A good example of reinforcement learning is chess playing. In artificial intelligence and machine learning, one of
the most important research area is game playing. Games can be are easily described but at the same time, they are
quite difficult to play well. Let's take a example of chess that has limited number of rules, but the game is very
difficult because for each state there can be large number of possible moves.
Another application of reinforcement learning is robot navigation. The robot can move in all possible directions at
any point of time. The algorithm should reach goal state from an initial state by learning the correct sequence of
actions after conducting number of trial runs.
When the system has unreliable and partial sensory information, it makes reinforcement learning complex. Let's
take an example of robot with incomplete camera information. Here robot does not know its exact location.
TRAININGERRORAND GENERALIZATIONERRORR
Training Error
In machine learning, training a predictive model means finding a function which maps a set of values x to a value y. If
we apply the model to the data it was trained on, we are calculating the training error. If we calculate the error on data
which was unknown in the training phase, we are calculating the test error. Training error is calculated as follows:
En Jerror,
In the above
K). Y) P (Y. X)dx
equation error is calculated over all
possible values of X and Y. error (f X).¥;) is used to representthat
these two values are same or not and if not then these values differes
by how much. P(X, Y) represents how often we expect
to see such X and Y.
Let's see an example. Consider a college student trying to prepare for his final exam. A diligent student will strive to
practice well and test his abilities using exams from previous years. Nonetheless, doing well on past exams is no
guarantee that he will excel when it matters. For instance, the student might try to prepare by rote learning the answers
to the exam questions. This requires the student to memorize many things. She might even remember the answers for
past exams perfectly. Another student might prepare by trying to understand the reasons for giving certain answers. In
most cases, the latter student will do much better.
Let's see one more example, consider the problem of trying to classify the outcomes of coin tosses (class 0: heads, class
:tails) based on some contextual features that might be available. Suppose that the coin is fair. No matter what
algorithm we come up with, the generalization error will always be 1/2. However, for most algorithms, we should
epect our trainingemor to be considerablylower, depending on the luck of the draw, even if we did not have any
features! Consider the dataset {0, 1, 1, 1,0, 1}. Our feature-less
algorithmwould have to fall back on always predicting
the majority class, which appears from our limited sample to be 1. In this case, the model that always predicts class 1
will incur an error of 1/3, considerably better than our generalization error. As we increase the amount of data, the
probability that the fraction of heads will deviate significantly from 1/2 diminishes, and our training error would come
to match the generalization error.
When we train our models, we attempt to search for a function that fits the
training data as well as possible. If the
function is so flexible that it can catch on to
spurious patterns just as easily as to true associations, then it might
perform too well without producing a model that generalizes well to unseen data. This is precisely what we want to
avoid or at least control. Many of the techniques in deep learning are heuristics and tricks aimed at guarding against
overfitting.
When we have simple models and abundant data, we expect the generalizationerror to resemble the
When we work withmore complex models and fewer
training error.
examples, we expect the training error to go down but the
generalizationgap to grow.
Training error
Underfittingzone Overfittingzone
Generalizationerror
Generalization gap
Optimal capacity
Capacity
Fig. 1.8.1 : Training Error and GeneralizationError
1.9 UNDERFITTING,OVERFITTING,BIAS AND VARIANCETRADE OFF
w..ww.ASA
Let us consider that we are designing machine
a
learning model. A model is said to be a good machine learning model
if it generalizesany new input data from the problem domain in a proper way. This helps us to make predictionsin the
future data, that data model has never seen.
Now, suppose we want to check how well our machine learning model learns and generalizes to the new data. For that
we have overfittingand underfitting,which are majorly
responsiblefor the poor performancesof the machine learning
algorithms.
(New Syllabus we.f academic year 22-23) (M7-79) Tech-NeoPublications..ASACHIN SHAH Venture
Machine Leaning(MU-Sem.7-Comp) (Introductionto Machine Learning)...Pageno. (1-12)
Before diving further let's understands two impotant terms:
Blas-Assumptions made by a model to make a function easier to learn. (The algorithms error rate on the training set is
algorithms bias.)Variance - If you train your data on training data and obtain a very low error, upon changing the data
and then training the same previous model you experience high eror, this is variance. (How much worse the algorithm
does on the test set than the training set is known as the algorithmsvariance.)
Underfitting
A
Statistical
model
trend of the data.
or a machine learning algorithm is said to have underfittingwhen it cannot capture the underlying
nderfitting destroys the acuracy of our machine learning model. Its occurrencesimply means that our model or the
algorithmdoes not fit the data well enough.
t
a
sualy happens when have less data to build accurate model and also when
we an try to build linear model with
we a
non-lineardata In such the rules of the machine learning model too easy and flexible to be applied such
cases
minimal data and therefore the model will
are on
Overfitting
A statistical model is said to be overfitted,when we train it with a lot
of data. When a model gets trained with so much
of data, it starts learning from the noise and inaccurate data entries in our data set. Then the
model does not categorize
the data correctly, because of too many details and noise.
The causes of overfitting are the
non-parametricand non-linear methods because these types of machine learning
algorithms have more freedom in building the model based on the dataset and therefore they can really build unrealistic
models.
A solution to avoid overfitting is using a linear algorithm if we have linear data or using the parameters like the
maximal depth if we are using decision trees.
/oOo
O O
O
+Poto X
X+ot
Ox o
x
Xx
XX XXX
Xx X
Under-fitting Appropirate-fitting Over-fitting
(too simple to
explain the variance) (forcefitting-too
good to be true)
Underfitting Overfitting
Zone zone
Generalization
- ------
error
--
Bias Variance
ChapterEnds..
MODULE 3
CHAPTER
Ensemble Learning
3
Syllabus
UnderstandingEnsembles, K-fold cross validation, Boosting, Stumping, XGBoost
Bagging, Subagging, Random Forest, Comparisonwith Boosting, Different combine classifiers.
ways to
3.5.2 Table of
GQ.
Contents.. *********"'*****"
***** -7
Write a short
3.6
note on: Bagging and Boosting. (10 Marks). ***. -7
Random Forest.. **. 3-7
*********************°***
3.6.1 Random Forest. *°*°'***** 3-9
3.6.2 Random Forest Algorithm. 3-9
3.6.3
Bagging.. ************. 3-9
3.6.4 Row Sampling. 3-10
3.6.5
Important Features of Random
3.6.6 Forest..
Difference Between Decision Tree ****
.... 3-10
*****..
3-14
(Ensemble Learning)....Pageno.
(3-2)
Machine Leaning (MU-Sem.7-Comp.)
Remarks
O
an ouput
different models to derive
An ensemble method is a technique that uses multiple independent similar or
CROSS-VALIDATIONIN MACHINELEARNING
Cross-validation is a technique for validating the model efficiency by training it on the subset of input data and testing
independent dataset.
cannot
In machine learning, we need to test the stability of the model. It means based only on the training dataset, we
3.2.2 K-FoldCross-Validaclon
In each set (fold) training and the test would be performed precisely once during this entire process. It helps us to avoid
overfitting. When a model is trained using all of the data in a single short and give the best performance accuracy.
This K-fold cross-validation helps us to build the model which is a generalized one.
o data.
the challengeof the volume of the assessment.
and hyper parameter
will support buildingmodel and iit
and train data and which is called K and should be
Here test set
assigned as a parameter
The model is validated mutiple times based on the value
an INTEGER
i 1, 2,.. K.
The dataset X is divided randomly into K equal-sizedparts, X,
=
We come across two problems with this. First, to keep the training set large, we allow validation sets that anre small,
Second, the training sets overlap considerablynamely, any two training sets share (K -2) parts.
K is typically 10 or 30. As K increases, the percentage of training instances increases and we get more robust
10
SCufledataset
E E5-*%-x
Split datasetinto Splittraining Use (K-1
trainingandtest datasetinto fold for
K-folds training
Fig. 3.2.1
During each run, one fold is considered for
below pictorial representation testing and the
gives an idea of the flow of rest will be for
the fold-defined training and moving on with iteraions the
size
(New Syllabus w.e.f academicyear 22-23)
(M7-79)
Tech-NeoPublications.ASAC
ACHIN SHAH Venture
Machine Leaming(MU-Sem.7-Comp.) Ensemble Learning)....Pageno. (3-4
Test
Test
Tralning Training
Training
Training 3 Test
Training Training 4Test
5Training Test
Fig. 3,2.2
Here, each data-points is used, once in the hold-out set and K - 1 in Training. So during the full iteration at least once,
one fold will be used for testing and the rest for training.In the above set, 5-Testing20 Training
In each iteration, we will get an accuracy score and have to sum them and find the mean.
Here we can understand how the data is spread in a way of consistency and will make a conclusion whether to go for
the production with this model (or) NOT.
All Data
wwww
Training data
Test data
Training folds
Test fold
Split 1 E
Split 2 E 10
Split 3
i= 1
E XX XX
Split 3 En6
Final evaluationTest deta Accuracy
XX XX
Fig. 3.2.3
(i) The optimisedvalue for the K is 10 and used with the data of good size, (i.e. commonly used).
threshold. However, rarely, multiple thresholds may be chosen and the stump therefore contains three or move leaves.
Decision stumps are often used as components (called "weak learners" or "base learner") in machine learning
ensemble techniques such as bagging and boosting.
a3.3.2 Remarks
(1) Meaning of stump in decision tree
A decision stump is a decision tree, which uses only a single attribute for splitting.
For discrete attributes, this means that the tree consists only of a single interior node (i.e., the root has only leaves
as successor nodes).
If the attribute is numerical, the tree may be more complex.
AdaBoost was the first really successful boosting algorithm developed for the purpose of binary classification.
AdaBoost is short for Adaptive Boosting and is a very popular boosting technique that combines multiple "weak
classifiers" into a single "Strong Classifiers".
AdaBoost is the best starting point for understanding boosting algorithms. It is adaptive in the sense that subsequent
classifiers built are tweaked in favour of those instances misclassified by previous classifiers.
It is sensitive to noisy data and outliers.
AdaBoost uses multiple iterations to generate a single composite strong learmer. It creates a strong learner by iteratively
adding weak learmers. During each phase of training, a new weak learner is added to the ensemble, and a weighting
vector is adjusted to focus on examples that were misclassified in previous rounds. The result is a classifier that has
higher accuracy than the weak learner classifiers.
5 XG-BOOST
XG-Boost stands for Extreme Gradient Boosting. It is written in C++ which optimises the training for Gradient
boosting.
To understand the XG-boost, we first need to understand the trees especially the decision tree
This process is repeated on each derived subset in a recursive manner called recursive partitioning. The recursive
ive is
completedwhen the subset at a node all has the same value of the target variable, or when splitting no larger addsval
to the predictions.
unified way.
(4) System features
cluster of machines.
Civ) Distributed computingfor training very large models using a
The most time-consumingpart oftree leaning is to get the data into sorted
(v) Column block for parallellearning:
order. In order to reduce the cost of sorting, the data is stored in the column blocks in sorted order in compressed
format.
(5) XG-Boost features
term helps to smooth the final learnt weights to avoid overfitting. The
) Regularised learning: Regularisation
select a model employingsimple and predictivefunctions.
regularisedobjectivewill tend to
model cannotbe optimisedusing traditional optimizationmethods in
(i) Gradient tree boosting: The tree ensemble
trained in additive
Euclidean space. Instead, the model is
an manner.
- -
Parallel
Sequence
Fig. 3.6.2
Remark
(New Syllabus we.f academic year 22-23) (M7-79) Tech-NeoPublications..ASACHIN SHAH Venture
Machine Loaming(MU-Sem.7-Comp.)
Ensemble Leaming).Pageno. (3-111
Each decision tree will generate an output.
Step 3:
for classification and regression respectively.
Step 4: Final output is based on majority voting or averaging
Actualy, Random forest is a collection of decision trees, still there are differences in their behaviour.
sample is not used to train the data instead it is used to evaluate the performance.
These samples are called as out of bag samples.
Advantages
.Itcan be used in classification and regression problems.
2. If solves the problem of overfiting as output is based on majority voting or averaging.
3 Even if data contains missing values or null values, it performs well.
4. Each decision tree created is independent of the other. It means that is shows the property of parallelisation.
5. It is highly stable as the average answers given by a large number of trees are taken.
6 If maintains diversity as all the attributes are not considered while making each decision tree even through it is not true
in all cases.
7. It is immune to the curse of dimensionality. Since each tree does not consider all the attributes, feature-space is reduced.
8. We need not segregate data to train and test, as there will always be 30% of the data which is not seen by the decision
tree which is made out of bootstrap.
Disadvantages
1. Random forest is highly complex when compared to decision trees where decisions can be made by following the path
of the tree.
Whenever it has to make a prediction, each decision tree has to generate output for the given input data.
ERemark
1. Random forest can handle binary, continuous and categorical data.
2. It is not a boosting technique. In boosting, one is learning from other, which in turn
boosts the learning.
3 The trees in random forest run parallel. Boosting is an approach to increases the complexity of
models that suffer from high bias, that is, models that
underfit the training data.
There is no interaction between these trees while Boosting reduces error mainly by reducing bias, and also
building the trees. to some extent variance, by aggregating the output from
many models.
5. Random forest and bagging are "bagging" algorithms XG boost always gives more importance to functional
that aim to reduce the complexity of models that space when reducing the cost of a model.
overfit the training data.
7. Random forest tries to give more preferencesto hyper They also tend to be harder to tune than random forest.
parameters to optimise the model.
6 Classes of classifiers
There are three classes of classifiers
Descriptive classiffer : Descriptive classifier are used to describe shape, size, texture or a pattern.
(i) Instrument classifier: The handshapes of instrument classifiers describe how an object is handled.
i) Element classifiers: It takes a set of different learners and combines them using new learning techniques.
Chapter Ends..
MODULE 5
CHAPTER
Learning with Clustering
5
Sylilabus
Introductionto clustering with overview of distance metrics and major clustering approaches.
Maximization
Model based Clustering: Expectation
Graph Based Clustering: Clustering with minimal spanning tree,
Algorithm,Density Based Clustering:DBSCAN.
5.1.2 Projecting the Data onto a Lower Dimensional Space.. .********** 5-3
*********************************************
5.1.3 K-means and SpectralClustering. 4
5.2 Evaluation methods based on Ground Truth Homogeneity.. *************°****************************** ***** ***********.
*.5-4
5.6.2
************s*
5-8
Examples on HierarchicalClustering..
**************** ***********
5-9
UEx. 5.6.3(MU -May 16. 10Marks)..
*a******************************
UEx. 5.6.4(MU -May 17, 10Marks. 5-14
5.7 UnsupervisedLearning:AssociationRules. 15
************e***************°°
5.7.1 Introductionto
****
16
AssociationRules..°*°****°°.**°°°°**°*******°**********
5.7.2 Apriori Algorithm e****** **********
5-16
e******** *****
************************************.
5.7.3 Performance Measures. **********s.
*************o***************************
51
********.
achine Leaming(MU-Sem.7-Comp.)
5.7.4 Issues in Machine Leaming. (Leamingwith Clustering)..Pageno.(-2)
whet are the issues in Machine leaming ? ************* *** ***0**
*******e*****e*****.****s**«a** ..5-19
9
UO. Explainthe steps required for selectingthe *****************************s** *
rlght machine leaming algorithm.
(MU-May16,8Marks)
************ ****eeeendseecetoeu*see**eeo*****o********************* 5-19
57.6 Steps in Developinga Machine Leaming
Application.. 5-20
**********************sssesseses*sseeseeesoossoesossne*ssseseense*******
UO. Explain the steps of developingMachine
Leaming applications.(MU -May 19.10 >****e*ssoeres************* 5-20
5.7.7 Applications of Machine Leaming ... Marks)..
"*******eee********ne**e* .s ** **ec***********ee************** . 5-21
Write short note on : Machine learning
applications.(MU -May 16. May 17. 10 Marks. 5-21
Graph Based Clustering..
eon*****eessoneo*******
5.8
5.8.1 Graph ClusteringAlgorithm..
*****e********************e**s************esnesosese*sssesee********************
***°********e***********o0o**e***oes*e*************eo**ee*o*****aseeseee
5-22
5.10.2 EM Algorithm ..
****************************************************so*ossoserses*s*er*os.*********ssvoossseoe*ssoroesse********************
5-28
********************************************************ss********
..5-29
5.11 ExpectationMaximisation Algorithm..
5.11.1 Latent VariableModel. ****************** **e****************e*************************e******************************************eso********
5-29
5-29
o****
5.11.4 Multivariate Gaussian Model
**************************o***********************************************
ssnee**e****.******
*********************e**** 5-29
5.11.5 Gaussian MixtureModels.
********ssssessssssssssssnrsa .as ****************s***************************************************************s********es**sess* 5-30
5.12 Density Based Clustering.. .eeaseeeeeees*eneeusee********o******n*************"*****°*°*******************************************s****0ssos**ee4s0ee 5-30
5.12.1 Iregularitiesin the Data.. *ee******e************ees****e*******e*s****e****e*eoe** 5-31
5.12.2 The basic Principle of DBSCAN Clustering
..
enoeo**o************e*****e*******a***********°*********************e*************eseo« ...5-31
5.12.3 DBSCAN and K-means.. ***********************u
.5-31
e***eee***.**************************oos**********************
***°°*°*******
******
Chapter Ends ..
(Loarningwith Clustering)..Page no.
(5-3)
Mactine Loaning(MU-Sem.7 Comp.) () Fully-connected Graph
TO CLUSTERING with an undirected edge
connected
point is
INTRODUCTION
5.1 Here, ench
distance between the two points to
weighted by the
Spectral Clustering every other point.
which performs Gaussian similarity metric is used, to
is an nlgorithm Here, we use
Spectral clustering algorithms in
than many
traditional clustering calculate the distance.
hetter
many caNe
Lower
as graph node and thus 5.1.2 Projecting the Data onto a
Itreats each data point
problem into a graph Dimenslonal Space
transform the clustering
partitioning problem. members of the same cluster
of three There is a possibility that
Implementation of spectral clustering consists in the given dimensional space.
may be far away
fundamental steps. that those
Thus the dimensional space
is reduced so
assur the data to follow some property. This makes 2. Completeness:A perfectlycompleteclustering is one
is technique to answer a more-generic class of where all data-points belonging to the same class are
this
Fig. 5.2.1
5.1.3 K-means and Spectral Clustering
Data points as nodes of a Trivial completeness
1. Spectral clustering :
when all the data points are clustered into
clusters found by partitioning It is the case
connected graph and are
into one cluster.
this graph. It is based on its spectral decomposition minimumn
It is the extreme case when homogeneity is
subgraphs.
and completeness is maximum.
K-means clustering: It divides the objects into
2.
K-clusters such that some metric relative to the
is perfectly following
In the above diagram, the clustering VB B)hc
cluster the data points are
homogeneous since in each Bh +C
complete because
of the same class label but it is
not
be adjusted in such way that either
a
notall data points of the same class label belong to the The factor ß can
the homogeneity is favoured or completeness of the
same class label.
algorithm is favoured.
evaluation metric is :
Red Red Class 1
The advantage of this
Green class 2 (i) It is independent of the number of class labels.
Green
Blue Blue class 3 (i) The number of clusters.
(ii) The size of the data.
Fig. 5.2.4
(iv) The clusteringalgorithmand
In the above diagram, the clustering is perfectly
(v) Is a reliable metric.
is because all data points of the same class
compiete. It
label belong to the same cluster. Remarks
5.2.3
But it is not homogeneous because the 1" cluster
contains data points of many class labels. (i) Homogeneity in clustering
Let us assume that there are N data samples, C A clustering result satisfies homogeneity if all of its
clusters and ak number of clusters contain only data points which are members of
differentclass labels, K a single class.
data-points belonging to the class C and cluster K.
Then the homogeneity h is given by the following This metric is independent of the absolute values of the
labels a permutation of the class or cluster label
h =1 CK)
HC) values won't change the score value.
Estimated number of clusters 3
Where
N C
Blue
HCK) - ack
log ack
K=1 C =1
Green
LC=1
Fig. 5.2.5
ack 2 ack
(ii) Homogeneity score
and H(C) =-
C 1
The score is useful to check whether the clustering
The completeness C is given by the formula algorithm meets an important requirement, i.e. a cluster
should contain only samples belonging to a single
C =1(KC) class.
H(K)
It is defined as
C
d as the number () a
agreements between X and Y and
c + and
to the same subset (true positive)
of disagreements between X and Y.
the number of pairs correctly
labelled as belonging
(i) b is
to differentsubsets.
SACHIN SHAH Venture
a Tech-Neo Publications..A
academic year 22-23) (M7-79)
Vew syllabus w.e.f
Machine Learning (MU-Sem.7-Comp.)
(Learningwith Clustering)...Pageno.(5-7
ADJUSTED RAND INDEX
a (a-1) etc
5.4 Note:2-Ca
index is the corrected for chance
The adjusted Rand Remarks
version of the Rand index.
1. Good adjusted Rand score (ARI)
This correction decides the base line by using the
The adjusted Rand index (ARI) should be as follows:
expected similarity of all pair-wise comparisons
between clusterings specified by a random model. (i) ARI20.90, excellent recovery
The Rand index assumes the value between 0 and + 1, (ii) 0.80 S ARI <0.90; good recovery
the adjusted Rand index may take negative values if (iii) 0.65 S ARI <0.80, moderate recovery
the index is less than the expected index.
ARI <0.65:Poorrecovery
5.4.1 The ConcingencyTable 2. Negative ARI
Consider a set of n elements and two clustering Negative ARI says that the agreement is less than what
(grouping or partitions) of these elements, namely, is expected from a random result.
X = {X, X2. X} and This implies that the results are 'orthogonal' or
Y = {Y1. Y2, .. Ys} complimentary' to some extent.
The overlap between X and Y can be mentioned in a M 5.5 CONCEPTS OF WEAK AND EAGER
contingency table {nji}, where each entry nj denotes
LEARNER
the number of objects in common between X; and Y
n XnY The distinction between easy learners and lazy learners
We prepare a table
is based on when the algorithmchooses from thedata
A lazy leaner delays choosing from the data, until it is
Lazy learner
X i) It just stores dataset without learning from it.
Sums bb2 b (ii) Start classifying data when it receives test data.
ii) It takes less time for learning and more time ror
5.4.2 Definition classifyingdata.
The originaladjustedRand Index is given by (ARI) Eager Learner
(1) When it receives data.set, it starts classitying
(learning).
i) It does not wait for test-data to learn.
i.j 11) It takes long time learning and less time classifyg
ARI
data.
1Li Examples
In supervised learning:
Lazy
Where nip a, bj are values from the
contingencytable K-NearestNeighbour,Case-brandReasoning
Eager and
lazy learning in machine learning Lazy decision tree constructs a customised decision
tree for each test, which consists of only a single path from
leaning methods construct general, explicit
Eager
E the root to a leaf node.
description of
of the target function. It is based on
provided training cxamples. Lazy algorlthm
learning methods simply store the data and A lazy learning algorithm is a learning algorithm that
Lazy
eeneralize the data only when an explicit request is can be applied by a lazy learning system (to solve a lazy
e
made. learning task).
construct a classification model based
Eager leamers Lazy learner context
on the given training
data, even before receiving data
It lazily postpones any work until receiving testingg
for classification.
It considers a single hypothesis that records and only performs the work, which is necessary to
covers the entire instance space.
predict its Target values.
In machine learning, lazy learning
is a learning method
in which generalisation of the training data is delayed
until a query is made to the system.
5.6 UNSUPERVISEDLEARNING
HIERARCHICALCLUSTERING
Naive Bayes is an eager learner
Naive Bayes classification algorithm is one of eager 5.6.1 Hierarchical Clustering
learning algorithmsthat promotes the class conditional
Agglomerative HierarchicalClustering
independence.
In agglomerative clustering initially each data point is
And it predictsthe class label in the fastest manner.
considered as a single cluster. In the next step, pairs of
5.5.1 Example of Lazy Learner clusters are merged or agglomerated.
Lazy learningrefersto any machine learningprocess in This step is repeated until all clusters have been
which majority of computation defers to consultation
merged in to a single cluster. At the end a single cluster
time. remains that contains all the data points.
Thetypicalexamples are:
Hierarchical clustering algorithms works in top-down
G) Instance based learning and manner or bottom-up manner. Hierarchical clustering
i) Lazy Bayesian Rules. is known as Hierarchical agglomerative clustering.
In agglomerative, clustering is represented as a
Lazy learning stands in contrast to eager learning in
which the majority of computation occurs at training dendogram as in Fig. 5.6.1 where each merge is
time. represented by a horizontal line.
Instance is classified.
Tech-Neo Publications..ASACHIN SHAH Venture
W
Sylabus w.e.f academic year 22-23) (M7-79)
Machine Learning(MU-Sem.7-Comp.) (LeamingwithClustering)..Pageno. (5-9)
A clustering of the data objects is obtained by cutting In complete linkage, the distance between two clusters
the dendogram at the desired level, then each is considered to be equal to greatest distance from any
connected forms a cluster. member of one cluster to any member of other cluster.
The basic steps of Agglomerative hierarchical D, s) = Max (d (i, j), object i cluster r and object j
clustering are as follows clusters
1 Compute the proximity matrix (distance matrix)
In average linkage, we consider the distance between
2 Assume each data point as a cluster. any two clusters A and B is taken to be equal to
3 Repeat average of all distances between pairs of object i in A
4. Merge the two nearest clusters. and j in B.ie., mean distance between elements of each
5. Update the proximity matrix other.
6. Until only a single cluster remains D,5) = Mean {d (i. j), object i- cluster r and objectj
No of D =
(0.4, 0.53) to D2 (0.22, 0.38) is,
Yes Yes End
=
cluster=1
No
D
0.4-0.22)+(0.53 -0.38°
2
=0.24
The distance of data
Merge 2 closest clusters point
D =
(04, 0.53) to Dg =
(0.35, 0.32)is,
Update distancematrix D =
V0.4 0.35) + (0.53 -0.32) 0.22 =
cluster s
D
= o4-0.26)+(0.53 0.19) =0.37
(New Syllabus w.e.f academic year 22-23)
(M7-79)
Tech-NeoPublications..ASACHIN
SHAH Venture
chineLeaming(MU-Sem.7-Comp.)
The
distance of data point
0.14 is
(Leaningwith Clustering)...Page no.(5-10)
D, =(0.4,0.53) to Ds = (0.08, 0.41) is, smallest. D, and D have
we
combine this two in one smallest distance. So,
matrix. cluster and recalculate distance
n V04-0.08)+(0.53 0.41)=0.34
Distance ((D3, D,).
distance of data point (D2, D;)) =
min (distance (Da, D2»»
The distance (D6. Da), distance
D= (0.4,0.53) to Dg= (0.45, 0.30) is, (D3, Ds), distance (D, D))
min (0.15, 0.25, 0.28,
0.29) 0.15 =
D 0.22 0.15 0
D4 0.37 0.20
DA 0.37 0.20 0.15 0 0.15 0
D (D, D) D3, D) D
Ds 0.34 0.14 0.28 0.29 0.15 is smallest. (D2, Ds) and
(D3, Dg) as well as D4
and (D3, D) have smallest distance. We can
D 0.23|0.25 | 0.11 0.22 0.39 0 pick either one.
D D D3 D Ds D6 Distance matrix
0.11 is smallest. D3 and Dg have smallest distance. So, D 0
we combine this two in one cluster and recalculate distance (D2, Ds, D3, D) 0.22
matrix. D4 0.37 0.15
Distance ((Da, D). Di) = min (distance (D3, D).
D (D, Ds, D3, Dg) D4
distance (Dg, D,)) = min (0.22, 0.23) = 0.22 0.15 is smallest. (D2, D5, D3, Dg) and D have smallest
Distance ((D3. Dg), D) = min (distance (D3. D). distance. So, we combine this two in one cluster and
recalculate distance matrix.
distance(D6 D)) = min (0.15, 0.25) = 0.15
Distance ((D, D), D,) = min (distance (Dg, Da). Distance matrix
distance (Dg. DA) = min (0.15, 0.22) = 0.15 D 0
Distance (Da. Dg), D,) = min (distance(D. Ds). (D2, Ds, Dg. Dg.D 0.22 0
distance (D6. D5) = min (0.28, 0.39) = 0.28
D D2, Dg. Dg. Dg. D)
Similarly we will calculate all distances. Now a single cluster remains (D2, Ds, Dg, Dg, Dg, D)
Distance matrix Next, we represent the final dendogram for single
linkage as,
D 0 Root One node
D2 0.24 0
0
(D3.D)0.22 | 0.15
0.15
D4 0.37 0.20
0.29 0
Ds 0.34 0.14 0.28
D3 D6 D4 D5 D4
D1 Leaft Individual
clusters
D D2 (D3, D D4 Ds
D 0 (D, D) 0.34 0
distance (D6, D,)) max (0.22, 0.23) 0.23 Next, we represent the final dendogram for
complete
=
=
Distance matrix
D1
D 0.24 o
(D3,D0.23 0.25 D3 D6 D4 D2 D5 D1
(D3,D0.23 0.39 D D2 D3 D Ds D
0.11 is smallest.
D 0.37 .29 0.22 0 Dg and Dg have smallest distance. So,
we combine this two in one cluster and
D (D2, Ds) (Dg, D) D distance matrix. recalculate
0.22 is smallest. Here
(D3, Dg) and Distance (D3, D), D,)= 1/2 (distance(D3, D)
D4 have smallest
distance. So, we combine these two in +distance D6. D,)) 1/2
one cluster and (0.22 +0.23) =0.23
=
0
D
D2 0.24 0
matrix.
matrix.
0 min (6, 3) = 3
(P2. P))
=
D distance
distances.
0.24 will calculate all
(D2, Ds) Similarly, we
0
(D2, D5, D) P. P) P P4 Ps
0
(D. Dg, D) 0.26 distanc-
have smallest
(D3, Dg,D4 smallest. (P, P)
and P3
3 is
(D2. Ds, Di two in one
cluster and
recalculE
22-23) (M7-79)
(New syllabusw.e.f academic year
Machine Learning(MU-Sem.7-Comp.) (Learningwith Clustering)..Pageno. (5-13)
Distance (P,. P2. Pa). Pa) min (distance(P. 2 is smallest. P and P2 have smallest distance. So, we
=
Pa).
combine this two in one cluster and recalculate distance
distance (P2. Pa). distance (P3. Pa)) = min (9, 7) = 7
matrix.
Similarly. we will calculate all distances. Distance (P,. P2), P3) =max (distance(P1. P3),
distance (P2. Pa)) = max (6, 3) = 6
Distance matrix
Similarly, we will calculate all distances.
(P.P2. P3
Distance matrix
Pa 7
(P.P 0
P5 4 P3
P.Pz. Pa) Pa P's
4 is smallest. Pa and Ps have smallest distance.
PPa 10 7
Distance matrix Ps
P.P2. P) 0
(P1.P2)P3 Pa Ps
4 is smallest. P and Ps have smallest distance. So, we
(P. Ps) 5 0 combine this two in one cluster and recalculate distance
matrix.
(P1.P2. P3) Pa. Ps)
Now a single cluster remains (P1. P2, P3, P4. Ps)
Distance matrix
Next, we represent the final dendogram for single (P,P) 0
linkage as,
P3
Pa.P)10 7
P1.P) Ps Pa. Ps)
6 is smallest. (P1, P2) and P3 have smallest distance.
So, we combine this two in one cluster and recalculate
distance matrix.
P1 P2 P3 P4 P5
P 10
8||o
P1 P2 Pa Pa Ps
P1 P2 P3 P4 P5
(New Syllabus w.e.f academic year 22-23) (M7-79) Tech-Neo Publications...ASACHIN SHAH Venture
achine Learning (MU--Sem.7-Comp.)
Pa1097 0
Ps985|4
P1 P2
o
Ps Pa Ps
and P1 P2 P3
2 is smallest. P1 P2 have smallest distance.
So, we
P4 P5
combine this two in one cluster and recalculate distance UEx.5.6.3 MU May 16,10Marks
matrix. Apply Agglomerative clustering algorithm on given data
and draw
Distance ((P1. P2). Pa) = 1/2 (distance(P1. P3), dendrogram. Show three clusters with its allocated
distance (P2. Pa)) = 1/2 (6, 3) = 4.5 points. Use singlelink method.
Similarly, we will calculate all distances. A B C D E F
Distance matrix A
VT 20
P.P) 0 B 0
VT8
P3 4.5 0
C VTO
P4 9.5
7 D V17
Ps 8.5 s|4o
P.P)Ps P Ps
E 0
V13
4 is smallest. P4 and Ps have smallest distance. So, we
20 ViB
F
V
combine this two in one cluster and recalculate distance Soln.:
matrix.
Distance matrix
Distance matrix
A
(P.P)
B 1.414 0
P3 4.5
C | 3.162 | 2.828
(P4.Ps) 9
6
P.P) P (P4. Ps) D 4.123 1 2.236
. 1S smallest. (P1, P) and P3 have smallest distance.
E 2.236 1 2.236 2 0
0, we combine this two in one cluster and recalculate
distance matrix.
Distance matrix
F 4.4724.242 2
33.6 o
A
A B C DE F
P.Pa. P 0 1 is smallest. B, D and B, E have smallest distance. We
(Pa, Ps) 8 can select anyone. So, we combine B, D in one cluster
and recalculate distance matrix using single linkage.
(P1. P2. P3) P4. Ps)
PA 3 5.656
4 P4,P5,P6 5.201 2.236
Ps4 4
P1,P2 P3
P63 3.5 P4,P5,P6
(New Syllabus w.e.f academic
year 22-23) (M7-79)
Tech-NeoPublications..ASACHIN SHAH
Venture
ne
em.7-Comp.)
Learning(MU-Sem (Leaningwith Clustering)...Pageno. (5-16)
Mach
2.236 is allest. P4, P5, P6 and P3 are combined
Distance matrix
tOgether.
P1,P2 0
P1.P2
P3 5.302
P3,P4.P5,P6 |5.656
PIP2 P3,P4,PS,P6 P4,P5,P6 3.66 1.817
Next we will ombine all clusters in single cluster.
a
P1,P2 P3 P4,P5,P6
Now we wil solve using average 1linkage 1.817 is smallest. P4,P5,P6 and P3 are combined
Distance matrix together.
PI P1,P2
P2 0.707 P3,P4,P5,P6 4.07 0
P3 5.656 4.949
PI,P2 P3,P4,P5,P6
P4 3.605 2.915 2.236 0
Next we will combine all clusters in a single cluster.
PS 4.242 3.535 141410
P6 5.201 2.5 1.802 0.5 1.118 0
5.7 UNSUPERVISEDLEARNING
P1 P2 P3 P4 P5 P6 ASSOCIATION RULES www.oa
We
0.5 is smallest. P4 and P6 have smallest distance.
can select anyone .So, we combine this in one cluster 5.7.1 Introduction to Association Rules
and recalculate distance matrix using complete linkage.
Associations are any things that are generally occurred
Distance matrix together. Let us understand this with the help of ann
interested in finding
example, as a shop owner we are
P 0 out the products that are generally sell together.
of an association
the construction
0.707 is smallest. P1 and P2 have smallest distance. So, This might suggest
recalculate rule if candle then cake however, this is predictable
we combine this two in one cluster and
-
candle.
P1,P2 0 proportion of the people buying cake also buy a
are applied to find the frequent item sets from the huge T7 1,3
amount of database. T8 1,2,3,5
In market basket analysis mostly Apriori algorithm is
T9 1,2,3
used. The algorithm is used to find the products that
can be bought together. Apriori algorithm can also be Soln.
used in the medical discipline to find reactions of drugs
Step 1: CalculatingS1 and I1
on the patients.
Initially we will generate a table that contains support
What is Frequent Item set ?
count of each item set in the given dataset. This table is
ftem sets having support more than the set minimum called the Candidate set or S1.
support or threshold value are called as Frequent item
sets. Let us see an example, if there are P and Q
Ttemset Support count
frequent item sets, then individually P andQ should 6
also be the frequent item set.
7
Let us assume there are the two transactions:
5
P (3,4,6,7,8), and Q = {4,6,9), in these two
4
transactions, 4 and 6 are the frequent item sets.
Apriori AlgorithmWorking L5
Next, we will consider all the item sets that are having
Apriori algoithm works according to the following steps,
more support count than the Minimum Support i.e. 2.
1 Find the support of item sets in the transactional
This will give us the table for the frequent item set I1.
database. Choose the minimum support and confidence.
Here all the item sets
2 From the transaction consider all supports with more are having more or same support
support value than the minimum or chosen support count than the minimum support, except the E, so E
value. item set will be removed.
3. Identify all the rules of these subsets which are having
more confidence value than the threshold or minimum Itemset Support count
confidence. 1 6
4. Sort the rules as the decreasing order of lift.
2 7
Example: Let us assume that we are having following
dataset that contains various transactions. We want to 3 5
identify the frequent item sets and create the 2
association rules with Apriori algorithm. In these
example minimum support 2 and minimum confidence
Step 2: Calculating $2 and 12
50% is given.
Next, we will create $2 using I1. In s2, we will
TransactionID Item sets
generate the pair of the item sets of Il in the form of
Ti 1,2 subsets.
T2 2,4 Once we create the subsets, we will again find the
T3 2,3 support count from the main transaction table. Here we
will check how many times these pairs have occurred
T4 1,2,4
together in the given dataset. So, we will get the
T5 1,3 followingtablefor $2:
T6 2,3
(New Syllabus w.e.f academic year 22-23) (M7-79) Tech-Neo Publications..ASACHIN SHAH Venture
Mach
ine Leaming (MU-Sem.7-Comp.) (Learningwith Clustering)...Pageno. (5-18)
After computing the confidence value for all rules, we
Itemset Supportcount will exclude the rules that have less confidence than
(1.2) 4
the minimum threshold(50%).
of three 1^ 2 3, 2 3 1, and
now we will form the $3 table with subsets sothe first three rules the strong association
item sets together, and will calculate the support count 1 3 2 can be consideredas
table: rules for the given problem.
from the dataset. It will give the following
approximately correct?
Support is defined the number of occurrences of an
as
item in the dataset. It is also defined as the fraction of a useful next
What is the best strategy for choosing
the transaction T that contains the itemset I. If there are does the choice of this
training experience, and how
I datasets, then for transactions T, it can be written as: after the complexity of the learningproblem?
strategy
Suppon () =Freq (T)/T 5. To reduce the task of learning to one or more function
problems, what will the best
be
2. Confidence approximation
specific functions should the system
Confidence represents how likely the rule has been
approach? What
attempt to learn? Can this process itself be automated?
found to be true. We can also say that how many times
To improve the knowledge representation and to learm
the items P and Q appears together in the dataset when
the target function, how can the learner automatically
the occurrence of P is already given. It is the ratio of
the transaction that contains P and Q to the number of alter its representation?
5.7.4 Issues in Machine Learning (a) If you have chosen supervised learning, then next
--- ---- -- - you need to focus on what's your target value?
UQ. What are the issues in Machine learning ?
If target value is discrete (e.g. Yes/ No, 1 /2/3,
(MU May 15,5 Marks) A/B/C), then use Classification.
I. Which algorithm we have to select to learn general If target value is continuous i.e. Number of
target functions from specific training dataset? What values (e.g. 0 - 100, 99 o 99), then use
should be the settings for particular algorithms, so as to Regression.
converge to the desired function, given sufficient
(b) If you have chosen unsupervised learning. then
training data? Which algorithms perform best for which
next you need to focus on what is your aim?
type of problems and representations?
How much training data is sufficient? What should be If you want to fit your data into some discrete
2.
the general amount of data that can be found to relate groups, then use Clustering
(New Syllabus w.e.f academic year 22-23) (M7-79) Tech-Neo PublicationsASACHIN SHAH Venture
Machine Leaming (MU-Sem.7-Comp.) (Learningwith Clustering)...Pageno. (5-20)
If you want to find numertcal estimate of how
4. The importance of this step is that it makes you
strong the fit into cach grouP, then use density understand that you don't have any garbage value
estimation algorithm coming in.
Data: Are the features continuous or nominal ? Are
2. 5. Train the aigorithm
there missing values in features ? If yes, what is a
reason for missing values? Are there outliers in the Good clean data from the first two steps is given as
input to the algorithm. The algorithm extracts
data? To narrow the algorithm selection process, all of
these features of your data can help you. information or knowledge. This knowledge is mostly
stored in a format that is readily useable by machine
Table 5.7.1 : Selection of Algorithm
for next 2 steps.
New Syllabus w.e.f academic year 22-23) (M7-79) Tech-NeoPublications.ASACHIN SHAH Venture
Machine Learning(MU-Sem.7-Comp.) (Learningwith Clustering)...Pageno.(5-21)
a 5.7.7 Applications of Machine Learning Savings
Low-Risk
----
In finding an association nule, we are interested in Let's take the inputs as the area of the flat, location
learming a conditional probability of the form P (QP) and purchase year and other information that affects
where Q is the product we would like to condition on the rate of flat.
P, which are the product / products which we know The output is the price of the flat. The applications
that customer has already purchased. where output is numeric are regression problemns.
P (Milk/Bread) =0.7 Let X represents flat features and Y is the price of flat.
It implies that 70% of customers who buy bread also We can collect training data by surveying past
buy milk purchased transactions and the Machine Learning
algorithm fits a function to this data to learn Y as a
(2) Classification function of X for the suitable values of W and Wo.
A credit is an amount of money loaned by a financial Y = wx + Wo
institution. wrmunrryownpuunuuonnonye
(New Syllabus w.e.f academic year 22-23) (M7-79) Tech-Neo Publications..ASACHIN SHAH Venture
Machine Leaming(MU-Sem.7-Comp.)
Now you will
(Leamingwith Clustering)...Pageno.(5-22)
classify them using unsupervised Let's take an example of robot with incomplete camera
learning (no prior knowledge) and this classification information. Here robot does not know its exact
can be on the basis of color of
items, shape of items, location.
material used for items, type of items or whatever way
you would like.
M5.8 GRAPH BASED CLUSTERING
(5) Reinforcement Learning
Graph clustering is an important subject, and deals
There are of the applications where
some
output of with clusteringwith graphs
system is a
sequence of actions. In such applications
the sequence of correct actions instead of The data of a clustering problem can be represented as
single action a graph where each element to be clustered is
is importantin order to reach
goal. represented as a node and the distance between two
An action is said to be
good if it is part of good policy. elements is modeled by a certain weight on the edge
Machine learning program generates a
policy by linking the nodes.
learning previous good action sequences. Such
methods are called reinforcement methods. Thus in graph clustering, elements within a cluster are
connected to each other but have no connection to
A good example of reinforcement learning is chess
elements outside that cluster.
playing. In artificial intelligence and machine learning,
of the important research 5.8.1
one most area is game Graph ClusteringAlgorithm
playing The HCS
Games can be are easily described but at the same (Highly Connected Subgraphs) clustering
time, they are quite difficult to play well. Let's take a algorithm (also known as the HCS algorithm, and other
names such as highly connected clusters
example of chess that has limited number of rules, but lcomponents
Kenels) is an algorithm based on graph connectivity
the game is very difficult because for each state there
for cluster analysis.
can be large number of possible moves.
It works by representing the similarity data in a
Another application of reinforcement learning is robot
similarity graph, and then finding all the highly
navigation. The robot can move in all possible
connected subgraphs.
directions at any point of time.
It does not make any prior assumptions on the number
The algorithm should reach goal state from an initial
of clusters.
state by learning the correct sequence of actions after
conducting number of trial runs. The HCS algorithm gives a clustering solution, which
is inherently meaningful in the application domain.
When the system has unrcliable and partial sensory
information, it makes reinforcement learning complex.
Graph
partitioning9
Each connected
component is a
cluster
Fig. 5.8.1
(New Syllabusw.e.f academic year 22-23) (M7-79) Tech-Neo Publications..ASACHIN SHAH Venture
Machine Leaming(MU-Sem.7-Comp.) (LearningwithClustering)...Pageno.(5-23)
(2) Clustering as graph partitioning (5) Objective function for partitioning
Two things are required : Our aim is not to only to have 'minimise the graph', but
(i) An objective function to detemine what would be also look for "balanced'" clusters.
the best way to 'cut' the edges of a graph.
Ratio Cut Cut ( V, Cut (V.V2
(V1.V)=IV, IV2
Normalised Cut (V1. V,) = V2, Cut(V, V,)
ie V,
Where d W
VI and V2 are the set of nodes in partitions 1 and 2;
Fig. 5.8.2 IVlis the number of nodes in partitionV
(i) An algorithm to find the optimal partition (optimal
according to objective function).
(3) Objective function for partitioning
V2
Suppose we want to partition the set of vertices V into
two sets: Vj and V2. One possible objective function is to
minimise graph cut:
Fig. 5.8.5
Cut (V1 W W is weight of the (a) Example
iEV1 ,je Va
edge between nodes i and j o 0
0.1 0./
Vs02 VapO2dv
0. 0.2
/0.3 0.3 o.1
v,.2dv Vap- v.O
0.3 01
vO03
0Ove v Cut 0.1 Cut 0.2
Cut 0.2 Cut 0.4 Fig. 5.8.6 Fig.5.8.7
Fig. 5.8.3 Ratio
cut="+=0.12 Ratiocut=+=0.13
(4) Objective function for partitioning limitation of
minimising graph cut Normalised cut Normalised cut
1.07 922-053
.1 b) Example : If graph is unweighted(or has
Va0.2V to same edge
weight).
o
a
Cut 0.1 Va1
Fig. 5.8.4
N
The optimal solution might be to split up a single node
from the rest of the graph! Fig. 5.8.9
Fig. 5.8.8
Not a desirable solution,
(New Syllabus w.e.f academic year 22-23) (M7-79) Tech-NeoPublications..ASACHIN SHAH Venture
Machine Learning(MU-Sem.7-Comp.)
Cut 1 (LeaningwithClustering)..Pageno.(5-24)
Cut 2 (7) Spectral clustering
Ratio cut=+z=12 Ratio cut +=0.67 Spectral propertiesof a graph:
) We find eigenvaluesand eigenvectorsof the adjacency
Normalised cut Normalised cut matrix. They can be used to representa graph.
(ii) There exists a relationship between spectral properties
-02 Method
of a graph and the graph partitioningproblem.
Preliminaries
o 0 0 1 io o o
W= L110O
o o oo 1 1
Two block
diagonal matrices
0 0 0l1 0 0
00 o1oiJ
Two clusters
Fig. 5.8.10
8) Graph Laplacian matrix
Coo 1o o o
0 0 1 j0 0 0
Laplacian,L=D-W
TO0 o o
0
o
0
Laplacian also 0 -11 0
has a block 0 0 0
structure L
0 0 0!-1 1 01
o0 oL01i J
Fig. 5.8.11
-0.4 algorithms.
Here we discuss MST-based clustering algorithm
0.6
through the 'cluster centre initialisation algorithm',
-0.8L
0.8 0.6 -0.4 0.2 0 0.2 0.4 0.6 0.8 called as CCIMST.
Spectralclustering CCiMST is based on geodesic distance and dual
0.8 densities of the points.
0.6 We shall demonstrate that the inconsistent edge is
located on the shortest path between the cluster centres,
0.4
so that we can find the inconsistent edge with the
0.2 length of the edges as well as densities of their
endpoints on the shortest path.
The
=
(V, E), where Er {e1, e2. e-1)
density p; of point x; is defined as e e E(G).
=
Now, there is one and only one path between each pair
Pexp-DGa] of vertices in T.
2
Where, D (zi, xi) is the
Euclidean distance between
5.9.2 Definitlon: GeodesicDistance
points x and x, and o is variance.
Suppose P ={P1. P2,.Pele V is the
(1) Input:Dataset two vertices x, and
path between
x in T, where edge (P. Pk =1)e E',
X {x;, Xj; ¬ R,i= 1,2, .. n}, the
=
groups of cluster centers are achieved. another is based on the average pairwise distance
(ii)
between points in the different clusters.
5.9.3 Inconsistent Edge
inter cluster
Based the above idea, we use the
on
For the dataset with K categories, the MST-based distance based on the distance between
the points at the
clustering method partitions the MST into K subtrees intersection of two different clusters.
K
Ti by removing(K-1)inconsistentedges. We mention the detailed algorithm:
Cij
Dx2 The expectation maximisation algorithm performs
maximum likelihood estimation in the presence of
e+
is the Euclidean distance between latent riables.
where D (X, x)
and and e; and e; are the local and global density First it estimates the values of the latent variables, then
points ; x;
of points x; and x; respectively. it optimises the model.
For the (K - 1) paths, we find and remnove the edge Then it repeats the two steps till convergence takes
(New Syllabusw.e.f
academic year 22-23) (M7-79) Tech-Neo Publications.ASACHIN SHAH Venture
Machine Leaming(MU-Sem.7-Comp.)
5.10.1 Use of EM (Leaming with Clustering)....Pageno. (5-28)
We again mention that the essence of
) The Expectation Maximisation (EM) algorithm can Expectation-
Maximisationalgorithm is to use the available data of
find maximum-likelibood estimates for model the dataset to estimate the
parameterswhen the data is incomplete. missing data and then we
use that data to
() It is an iterative way update values of the parameter.
the
likelihoodfunction.
to
approximate the maximum Flow-chart for EM algorlthm
(ii) This can be used to fill the Start
missing data in a sample. Initial values
(iv) It can be used on the basis of
clusters. unsupervisedlearning of
Expectation E-Step
(v)It can be used for the purpose
of estimating the
parametersof Hidden Markov Model (HMM). Maximization M-Step
(vi) It can be use to find the values of latent variables
No Yes
5.10.2 EM Algorithm Convergence Stop
GQ. Write short note on: EM algorithm. (5 Marks) Fig.5.10.2
In statistics, an
expectation-maximisation (EM) 5.10.3 Advantagesof EM Algorithm
algorithm is an iterative method.
() It is guaranteed that likelihood will increase with each
It finds maximum a
posterior (MAP) estimates of iteration.
parameters in statistical models, where model depends (i) The E-step and M-step are quite easy to implement for
on unobserved latent variables.
many problems.
The EM iteration alternates between: (ii) Solutions to the M-steps exist in the closed form.
i) II performs an expectation (E) step : that creates a Disadvantages of EM algorithm
function for the ) It has slow convergence
expectationof the log-likelihood
and it is evaluated using the current estimate for i) It makes convergence to the local optima only.
the parameters, and (ii) It requires both the probabilities, forward and backward
(note that, numerical optimisation requires only
(i) A maximisation step (M) : it computes parameters forward probability).
maximising the expected log-likelihoodfund on
a 5.10.4 Applicationsof EM-Algorithm
E-step.
The latent variable model has several real-life
These parameter-estimates are used to determine the
applicationsin Machine-learning:
distribution of the latent variables in the next E-step. () Use to calculate the Gaussian density of a function.
The above two steps are repeated till convergence takes i) It helps to fill in the missing data during a sample.
place. (ii) In domains like Natural Language Processing (NLP).
M-Step mputer Vision etc. It finds plenty of applications.
Update/typothesis (iv) It is also helpful in image reconstruction in the field of
Medicine, and structuralEngineering.
E-Step (v) Used to estimate the prameters of mixed models like
Update variables gaussian Mixture Models.
Fig. 5.10.1
(New Syllabusw.ef academic year 22-23) (M7-79) Tech-Neo Publications..ASACHIN SHAH Venture
Machine Leaming(MU-Sem.7-Comp.) (Learning with Clustering)..Pageno. (5-29)
M5.11 EXPECTATIONMAXIMISATION
ALGORITHM 1.0
SPurpose of algorithm
In of the
many problem statements of machine
leaming, we obtain many relevant features to build the --50
required model. But only a few of these features are X
observable. (The non-observable features are also
called Fig. 5.11.1
as latent features).
As we do not have the values for The higher the valve of o (standard deviation) more
non-observable
(latent) variables, the Expectation maximization would be the spread along X-axis.
algorithm tries to use the existing data to determine the In 1-D space, the probability density function of
optimum values for these variables and then one can Gaussian distribution is given by,
find model parameters 2
g%10) P81(xl41.o,)+(1-P)82lk2
We find (pl44.G ,) through '
Fig. 5.12.2
M5.12 DENSITY BASEDCLUSTERING
For non-convex clusters and outliers / noises, K-means
It is a clustering method and is Density-basedspatial algorithm has difficulties for identifying these clusters with
clustering of applications with noise (DBSCAN), arbitrary shapes.
clustering method. DBSCAN algorithmrequires two parameters:
Clusters are dense regions in the data space. They are L. eps
separated by regions of the lower density of points.
Around a data-point, it contains the neighbourhood. It
The DBSCAN algorithm is based on this notion of means If the distance between two points is lower or
"clusters" and "noise. equal to 'eps' then they are considered as neighbours
If the eps value is very smal, then large part of the
The basic concept is that for each point of a cluster, the data will be considered as outliers.
neighbourhoodof a given radius has to contain at least
If it is chosen very large, then clusters will merge and
a minimum number of points. majority of the data-points will be in the same clusters.
Hence the eps-valueis based the k-distance
on
graph.
2. Min Dts
(New Syllabus w.e.f academic year 22-23) (M7-79) LATech-NeoPublications...ASACHIN SHAH Venture
Machine Leaming (MU-Sem.7-Comp.) (Learningwith Clustering)....Pageno. (5-31)
i) Border point: A point which has fewer than MinDts We iterate through the remaining unvisited points in
within eps but it is in the the dataset.
neighbourhood of a corec
point. Those points that do not belong to any cluster are
(ii) Noise or outlier: A point which is not a core point or noise.
border point.
Min pts= 4 a5.12.2 The basic Principle of DBSCAN
Bordor point
eps 1 unit Clustering
The main principle of DBSCAN is to find the
neighbourhoods of data points, exceeding certain
density threshold.
Core eps
Chapter Ends..
MODULE 6
CHAPTER
6 DimensionalityReduction
syliabus
DimensionalityReduction Techniques,
Decomposition. Principal Component Analysis, Linear
Discriminant Analysis, ng lued
6.1 DimensionalityReductionTechniques..
UQ. Why DimensionalityReduction is ******e************************es*******eeceeneee*****.
**************************** 6-2
Importantstep in Machine Learning?
MUComp)-May 19, Dec.19,5Marks).
very
6.1.1 Dimension Reduction *******a********************ennesseessesec****eence************************************6-2
Techniquesin
ML
6.1.1(A) Feature Selection.. ********************************esososs**ss******es*o************************************. 6-3
6.1.1(B) Feature Extraction. **************e*****.e***************an*******snaues****eeeneeeess**aa********.***************************.6-3
6.2 Principle ComponentAnalysis... ***aa****o******e*****.°°****e******************* **ssenssse ****s**************s************** ..6-4
*******esns************e******************** annesenea esss*****eossesen*****n*******sse****************************
UQ. Explain in detail Principal Component 6-4
Analysisfor Dimension Reduction.
UQ.
(MUComp)-May
Use
15,May 16,10Marks)..
Principal Componentanalysis (PCA) to arrive at the 6-4
transformedmatrix for the given data.
A- 3 1 05
MUComp)-May 17, 10 Marks. ******** **********sss**************a*******
Ua. Describe the steps to reduce s*a****.*astnossssssse***************oresenee******.4
dimensionalityusing Principal ComponentAnalysis method by clearly
mathematicalformulas used.
(MU(Comp)-May 19, Dec.19, 5 Marks).******************snsssesestating O
UQ. Write Short note on ISA and compare it with PCA.
6.3
(MU(Comp) -May 19, 5Marks). .6-5
Linear discriminant
6.4
analysis... 7
SingleValueD Decomposition. 6-88
?
UQ Why DimensionalityReduction is very Importantstep in Machine Learning
(MU(Comp) -May 19, Dec.19, 5 Marks)
arrangement of
information having tremendous
Dimension Reduction alludes to the way toward changing over an
regressionproblem.
follows. It indicates 2 dimensions P, and P2, which are given us
We should take a gander at the picture demonstrated as
somehow managed to utilize
a chance to state estimations of a few objects in cm (P) and inches (P). Presently,if we
of noise in the system,
Doth the measurements in machine learning,they will pass on comparabledata and present a lot
from 2D
present here is changed
over
dimension. Dimension of information
so it is better to simply utilizing one
(from P and P) to ID (Q), to make the information moderately less demanding to clarify.
www. www.
www.
P2
(inches)
P1(cm)
Q1
It decreases the time which is required for doing same calculations. If the dimensions are less then processing will
be less, added advantage of having less dimensions is permission to use calculations unfit for countless.
It handles with multi-collinearity that is used to enhance the execution of the model. It evacuates excess
highlights. For example: it makes no sense to put away an incentive in two distinct units (meters and inches).
Lessening the dimensions of information to 3D or 2D may enable us to plot and visualize exactly. You would then
be able to watch designs all the more unmistakably.Beneath we can see that, howa 3D information is changed
over into 2D. It has distinguished the 20D plane at that point spoke to the focuses on these two new axis z, and z
0
20
20
0
-20 5
40 2
60 10
15 10-5 15
10 15-20
Z2 20 T
15
-10
-15
-20
80-60 -40 -20 o 20 40 60 80
Reduction
Fig. 6.1.2: Example of Dimension
enhance the execution of models.
and because of this we can
It is helpful in noise evacuation additionally
for a large number of features.
o The classifier's performanceusuallywill degrade
* **********Y******;***********
wwwe*w* :
***
weenwwwwww.www.wwwwibowewrimmdrwww
Number of variables
a6.1.1(A) FeatureSelection
maximizes the
Given a set offeaturesF= {X1.X,}. The feature selection problem is to find a subsetFSF that
maximize some scoring function.
learner's ability to classifythe patterns. FinallyF should
Feature selection
LJ L
(New Syllabuswe.f academic year 22-23) (M7-79) Tech-NeoPublications..ASACHIN SHAH Venture
(DimensionalityReduction)..Page no. (6-4)
Machine Leaming(MU-Sem.7-Comp.)
Feature Selection Steps
Feature selection is an optimizationproblem.
Step 1: Search the space of possible feature subsets.
Step 2: Pick the subset that is optimal or near-optimalwith respect to some objective function.
Subset selection
Y2
Feature extraction
A projection matrix w is computed from N-dimensional to M-dimensional vectors to achieve low error.
Z w'x. Principlecomponentanalysis and Independentcomponentanalysis are the feature extraction methods.
2 PRINCIPLE COMPONENTANALYSIS
A430
.
(MUComp) -May 17, 10 Marks)
-------
(New Syllabus w.e.f academic year 22-23) (M7-79) Tech-NeoPublications...ASACHIN SHAH Venture
Machine Leaming(MU-Sem.7-Comp.)
(DimensionalityReduction)...Pageno. (6
UO Describe the steps to reduce
dimensionality using Principal Component Analysis method
mathematicalformulas used. by cdearty Staung
UQ Write Short note on ISA and compare it with PCA.
(MU(Comp)-May19,Dec.19,5Marks)
(MUComp) -May 19,5 Marks)
.Dimensional reduction is regularly the best beginning
stage when dealing with high dimensionalinformation.
It is utilized for an assortment of reasons, from
perception to denoising, and in a wide range of uses, from signal
processing to bioinformatics.
A standout amongst the most
broadly utilized dimensionalreductiontools is PrincipalComponentAnalysis(PCA
PCA verifiably accept that the dataset under thought is typically dispersed, furthermore, chooses the subspace which
expands the anticipateddifference.
Wecosidera entereddata set, and developthe sample covariancematrix, at that point q-dimensionalPCA is identical
to anticipatingonto the
q-dimensionalsubspace spread over by the q eigenvectorsof S with biggest eigenvalues.
In this system, varñables are changed into another arrangement of variables, which are straight blend of unique
variables. These new arrangement of variables are known as principle components. They are calculated so that st
principlecomponent S represents a large portion of the conceivablevariety of unique informationafter which each
succeedingcomponent has the most noteworthyconceivablevariance.
The second principal component should be symmetrical to the primary principal component. As it were, it does its Dest
to catch the difference in the information that isn't caught by the primary principal component. For two-dimensional
dataset, there can be just two principal components. The following is a depiction of the information and its first and
second principal component. You can see that second principal component is symmetrical to first principal component.
"
****" 1principalcomponent
2 principal component
www.wwwwww.ww.twww.igwww.gewwiromrns
****
****
****
Fig. 6.2.1
The principal components are sensitive to the size of estimation, now to setle this issue we ought to dependably
institutionalize factors before applying PCA. Applying PCA to your infomational collection loses its importance. In
the event that interpretability of the outcomes is critical for your investigation, PCA isn't the corect system
for your task.
X 1.81
Ym 1.91
Now we will find the covariance matrix, C
=|C Cyw
-1
=0.6165
CY Cyx-2X-X) (Y-Y,)
N-1
= 0.61544
C Y
=0.7165
N-1
C 0.6165 0.61544
Lo.61544 0.7165
Now to find the eigen values following equation is used
IC-Al = 0
I 0.6165 0.61544 o
Lo.61544 07165J-
By solving the above determinant we will get quadratic equation and solving that equation we will get two eigen values
of Aas =0.0489 and =1.284
Now we will find the eigen vectors corresponding to eigen values
First we will find the first eigen vector corresponding to first eigen value, A=0.0489
CxV ,xV
.6165 0.61544
LO.61544 0.7165J Lv, =0.0489
=0.0489
From the above equation we will get two equations as
0.6165 V +0.61544 V12 = 0.0489 V
0.61544 V+0.7165 V12 = 0.0489 V
To find the eigen vector we can take either Equation (1) or (2), for both the equations answer will be the same, let's
take the first equation
= 0.6165 Vi +0.61544VI = 0.0489 V
0.5676 Vi1 +0.61544 V2
VI 0.61544V12
VI 1.0842V12
Now we will assume Vi2= 1 then V11=- 1.0842
V =0842
As we are assuming the value of Vi2 we have to normalized V, as follow
VIN VI.0842+TL.0842
J
-0.735
VIN 0677J
(New Syllabus w.e.f academic year 22-23) (M7-79) Tech-Neo Publications..ASACHIN SHAH Venture
(Dimensionality Reduction)...Page
no. (6-7)
Machine Leaming(MU-Sem.7-Comp.)
value, 1.284
Similarlywe will find the second eigen vectorcorrespondingto second eigen
CxV xV2
o.6165 0.615441 1.284
Lo.61544 0.7165J LV12
V2
to normalized V2 as follows
As we are assumingthe value of V22 we have
V2N 0.922
V0.922+1L
0.677
V2N Lo.735 in
to the eigen vector corresponding to
maximum Eigen value,
Now we have to find the principlecomponent, it is equal
is Eigen vector VzN
this is maximum, hence principlecomponent
o.6771
Principlecomponent Lo.735
Compute the d-dimensional mean vectors for the different clawsen from the datwet.
2. Compute the scatter matrices (in-between-classand within-clamNACatter matrix).
3. Compute the eigenvectos(e, )and correspondingolgen values (a,
4. Sort the eigenvectors by decreasing cigen values and choONe k elgenvectors
g 4) forthewcatermatrlce.
with the largest elgen values to form a
dxk dimensional matrix W
(whereevery eolumn representsan elgenvector).
Use this d x k eigenvector matrix to transform the samples onto the new subspuce. This can be summarized by the
matrix multiplication: Y = Xx W (where X is an x d-dimensional matrix representing the n samples, and y are the
transformedn x k-dimensionalsamples in the new subspace).
Ex.6.3.1 Compute the Linear Discriminant projection for the following two-dimensional dataset.
X =
(x,, x3)= (4,1).(2,4).(2,3),(3,6),(4,4)
X = (%, x) =1(9,10)(6,8).9,5).(8,7).(10,8))
Soln.
Step 1: Compute mean for class 1 and class 2
M 3 3.601 M, = [8.40 7.60
Step 2: Compute within class scatter matrix
SW = S, +S2
S, is the covariance matrix of class 1 and S, is the covariance matrix for class 2
S, 2X- M,)(X-M,)
S, Cx- M) (X-M,)
0.80 0.407 1.84-0.04
S -0.40 2.60J S, 0.04 2.64
2.64-0.44
SW
-0.44 5.28
Step 3 Compute between class scatter matrix
SB =(M -M)(M,-M
29.16 21.601
SB
L 21.60 16.00.
Step 4: Find best LDA projection vector
w*=SW (M, -M,) =[-0.91 - 0.39]
A = USv
(New Syllabus w.e.f academicyear 22-23) (M7-79) Tech-NeoPublications..ASACHIN SHAH Venture
Machine Leaming(MU-Sem.7-Comp.)
Here, A represents m xn matnx. U (DimensionalityReduction)...Pageno.(6
representsm x n
orthogonal matrix. S is a n xn diagonal matnx
orthogonal matrix. and v **
B.641:FindSVDfor A=
soln.
First we will alculateAA =
Now we will calculateeigen vectors V, and V,
using the method that we have seen in PCA
V, =
Ly
AV, 22
AV =
U, = =
SVD is written as,
A lah
ChapterEnds