[go: up one dir, main page]

0% found this document useful (0 votes)
2K views146 pages

ML Tech Neo Study

1. Machine learning is a category of artificial intelligence where computers are able to learn from data without being explicitly programmed. It focuses on developing algorithms that can learn from and make predictions on data. 2. An early example of machine learning was a program developed in the 1950s to play checkers that was able to learn strategies that gave it better moves from observing game positions. 3. Machine learning is important because it allows machines to be designed that can analyze huge amounts of available data and learn from it to make intelligent predictions or classifications.

Uploaded by

Sankalp Rane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views146 pages

ML Tech Neo Study

1. Machine learning is a category of artificial intelligence where computers are able to learn from data without being explicitly programmed. It focuses on developing algorithms that can learn from and make predictions on data. 2. An early example of machine learning was a program developed in the 1950s to play checkers that was able to learn strategies that gave it better moves from observing game positions. 3. Machine learning is important because it allows machines to be designed that can analyze huge amounts of available data and learn from it to make intelligent predictions or classifications.

Uploaded by

Sankalp Rane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 146

Mohd Laraib

MODULE 1
CHAPTER
Introduction to Machine
Learning

Syllabus
Machine Learning, Types of Machine Learning, Issues in Machine Learning, Applicationof Machine Learning, Steps in
developing a Machine Learning Application.
Overfitting,Underfitting,Bias-Variance trade-off.
Training Error, Generalization error,

Machine Leaning...
'"'"""** "****°*'*'"°****''***'*°'*'*"*"***********'*'*°******'*****'***°''*****"***°"*****'°''*'°*****'******°***'*'' 1-2
1.1
UQ. What is Machine learning? How it is different from data mining ?
1-2
MUComp.)- May 17,5Marks,May 19,5MarksS). ***************************'"*****"*'**'°"*'°*****'"*****"**°'°'

UQ. Define Machine learning and explain with example importance of Machine Learning.
1-2
C(MU(Comp.)-Dec.19,5 Marks).
Key Terminology.. ..................... . ** nnnn************ ********.****************************
*******.1-3

1.2
UQ. What are the key tasks of Machine Learning? (MUComp.)-May 16,5 Marks). 3

*. 1-55
1.3 Types of Machine Leaning... ********************* *****************************************"***************" ********

UQ. Explain how supervised learning is different from unsupervised learning.


MUComp.)- May 17, 5 Marks).****'* ************************************************************* *****.******** 1-5

1.4 Issues in Machine Learning.


UQ. What are the issues in Machine learning? (MU(Comp.)-May 15,5 Marks. **************************************** ***.. 1-6

How to choose the right algorithm 1-7


1.5
Ua. Explain the steps required for selecting the right machine learning algorithm.
**.. 1-7
(MUComp.)-
May 16, 8 Marks).
Steps in developing a machine learning application.. *""**""*"""******'"*************************************''*** ******ssssssos**** 7
1.6
UQ. Explain the steps of developingMachine Learningapplications. (MUComp)-May 19,10 Marks).. ************** 1-7

1-9
1.7 Applications of Machine Learning..
UQ. Write short note on: Machine learning applications.(MU(Comp.)-May 16, May 17,10 Marks)...
... 1-10
Training error and generalization error . ****************************************************************************'********************

1.8
1-11
1.9 Underfitting,overfitting,bias and variance trade off .. a****a*****ao************a******a*************************'*******"************************

1-13
ChapterEnds.
Machine Leaning(MU-Sem.7-Comp.) (Introductionto MachineLearning)..Pageno. (1-21
1. MACHINE LEARNING
-- ---- -- ---.
What is Machine leaming? How it is different from data mining ?MUComp.)- May 17, 5 Marks, May 19, 5 Markes
Q Define Machine leaning and explain with example importance of Machine Learningl(MU(Comp.)-Dec. 19, 5 Marks)
A machine that is
intellectuallycapable as much as humans, have always attracted writers and early computer scientist
who were excited about
artificial intelligenceand machine learning.
The first machine
learning system was developed in the 1950s. In 1952, Samuel has developed a progranm to
checkers. The
program was able to observe positions at game and learn the model that
play
player. gives
better moves for machine
In 1957, Frank Rosenblatt
designed the Perceptron, which is a simple classifier but when it is combined in
numbers, in a network, it became large
a
powerful tool.
Minsky in 1960, came up with limitation of perceptron.He showed that the X-OR problem could not be representedby
perceptron and such inseparabledata distribution cannot be handled and
research dormant until 1980s.
went to following this Minsky's work neural network
Machine learning became very famous in 1990s,due to the introduction of statistics.
combination lead to probabilistic Computer science and statistics
approaches in Artificial
intelligence. This area is further shifted to data
driven
techniques. As Huge amount of data is available, scientists started to
and learn from data. design intelligentsystems that are able to analyze
Machine learning is a
category of Artificial
Intelligence. In machine
themselves,explicit programmingis not learning computers has the ability to learn
required.
Machine focuses on the study and
data.
developmentof algorithms that can learn from data and also make
predictions on
Machine learning is defined by Tom Mitchell
as "A
program learns from
T and performancemeasure 'P', if its performanceon tasks in asexperience with respect to some class of
tasks 'E'

E' represents the "T measured by 'P" improves with E'" Here
past experienceddata and "T" represents the tasks such as
P', we mnight want to increase prediction,classification,etc. Example of
accuracy in prediction.
Machine learning mainly focuses on the
design and
developmentof computer programs that can teach themselves Data
to grow and
change when exposed to new data.
Computer Program
Using machine learning we can collect information from a Output
dataset by asking the
computer to make some sense from
data. Machine learning is
turning data into information. Fig. 1.1.1: Machine Learning
The Fig. 1.1.1 is the schematic
representation of the ML system. ML system takes the training data and
knowledge as the input. Background knowledge and data helps the Learner background
particulartask or problem. Performance correspondingto the solution program to provide a solution for a
can be also measured. ML
mainly two components,Learner and a Reasoner. Learner use system comprises of
the training data and
model and this can be used
by reasoner to provide the solution for a task.
backgroundknowledge to build the
Machine learning can be
applied to many applicationssuch as politics to
many problems. Any application which needs to extract geosciences.It is tool that can be applied to
a
some information from
data, can benefit from machine data and also takes some action on
learning methods.
Some of the
applications are spam filtering in email, face recognition,
and
handwritingdigit recognition. product recommendationsfrom Amazon.com

(New Syllabus w.e.f


academic year 22-23)
(M7-79)
Tech-NeoPublications..ASACHIN SHAH Venture
Machine Learning (MU-Sem.7-0Comp.) (Introductionto MachineLearning)..Page no. (1-3)
Inoccurences
detectingspam email, if you check for the of single word it will not be very helpful. But checking the
occurrence
of certain words used together and combined this with the length of the email and other parameters, you
could get a much clearer idea of whether the email is spam or not.
Machine learning is used by most of the companies to
increase productivity, forecast weather, to improve
business decisions, detect disease and do many more Data Problem/Task
things.
Machine learning uses statistics. There are many
problems where the solution is not deterministic. Models
There are certain problems for which we don't have
that much information and also don't have that much Learner Reasoner
computing power to properly model the problem. For
these problems we need statistics, example of such
type of problem is prediction of motivation and
Solution
behavior of humans. The behavior and motivation of Background Knowledge
humans is a problem that is currently very difficult to
model. Fig. 1.1.2: Schematicdiagram of Machine Learning
Machine learning = Take data + understand it + process it +extract value from it+visualize it + communicate it

1.2KEY TERMINOLOGY
UQ What are the key tasks of Machine Learning? (MU(Comp.) - May 16, 5 Marks)

Expert System
Expert system is a system which is developed using some training set, testing set, and knowledge representation,
features, algorithm and classification terminology.
i) Training Set: A training set comprises of training examples which will be used to train machine learning
algorithms.
(i) Testing Set: To test machine learning algorithms what's usually done is to have a training set of data and a
separate dataset, called a test set.
(ii) Knowledge Representation : Knowledge representation may be stored in the form of a set of rules. It may be an
example from the training set or a probability distribution.
(iv) Features : Important properties or attributes.
(v) Classification: We classify the data based on features.
Process: Suppose we want to use a machine learning algorithm for classification. The next step is to train the
algorithm, or allows it to learn. To train the algorithm we give as a input a quality data called as training set.
Each training example has some features and one target variable. The target variable is what we will be trying to predict
with our machine learning algorithms. In a training dataset the
target variable is known. The machine learns by finding
some relationship between the target variable and the features. In the classification tasks the target variables are known
as classes. I is assumed that there will be a limited number of classes.
The class or target variable that the training example belongs to is then compared to the predicted value, and we can get
a idea about the accuracy of the algorithm.
Example: First we will see some terminologies that are frequently used in machine learning methods. Let's take an
example that we want to design a classification system that will classify the instances in to either Acceptable or
Unacceptable. This kind of system is a fascinating topic often related with machine learning called expert systems.

(New Syllabus w.e.f academic year 22-23) (M7-79) Tech-Neo Publications..ASACHIN SHAH Venture
no. (1-4)
(Introduction to MachineLearning)..Page
selected are Buying_Price
Machine Learning (MU-Sem.7-Comp.) or the
attributes
1.2.1. The features
of the various cars are stored in Table a record comprises of features.
Four features to Table 1.2.1 represents
Maintenance_Price. Lug_Boot and Safety. Examples belong values. The first two
reatures represent
and takes limited disjoint
In Table 1.2.1 all the features are categorical in nature
Third feature
shows the luggage
medium and low.
price of a car such as high, measures or not, which
the buying price and
maintenancee
whether the car has safety
big. Fourth feature represents
Capacity of a car as small, medium or

takes the value as low. medium or high. the car out of


we want to evaluate a

Classification is one of the important task learning. In this application Maintenance_Price, Lug_Boot and
machine
car's Buying_Price,
have all information about learning
group of other cars. Suppose we or Unacceptable. Many
machine

is used to evaluate a given car as Acceptable


Safety. Classification method
variable in this example is the
classification. The target or the response
algorithms are there that can be used for
evaluation of a car.
task in the classification is
for classification. The main
to use
Suppose we have selected a machine learning algorithm
to train the algorithm which is
algorithm, allow it to learn. We give the experienced data as the input
o
train the or

called as training data.


record has four featuress
contains 14 training records in Table 1.2.1. Suppose each training
Let's assume training dataset to predict the
the response variable shown in Fig. 1.2.1.The machine learning algorithm is used
and one target or as

target variable.
be
In classification task the target variable takes a discrete value, and in the task of regression its value could
continuous.
In a training dataset we have the value of target variable. The relationship that exists between the features and the target
variable is used by machine for learning. The target variable is the evaluation of the car. Classes are the target variables
in the classification task. In classification systems it is assumed that classes are to be of limited number.
Attributes or features are the individual values that, when combined with other features, make up a training example.
This is usually columns in a training or test set.
A training dataset and a testing dataset, is used to test machine learning algorithms. First the training dataset is given as
input to the program. Program uses this data to learn. Next, the test set is given to the program. The program decides
which instance of test data belongs to which class. The predicted output is compared with the actual output of the
program, and we can get a idea about the accuracy of the algorithm. There are best ways to use all the information in
the training dataset and test dataset.
Assume in car evaluation classification system, we have tested the program and it meets the desired level of accuracy.
Knowledge representation is used to check what the machine has learned. There are many ways in which knowledge
can be represented.
We can use set of rules or a probabilitydistribution to represent the knowledge.
Many algorithms represent the knowledge which is more interpretable to humans than others. In some situations we
may not want to build an expert system but we are interested only in the knowledge representation that's acquired from
training a machine learning algorithm.
Table 1.2.1: Car evaluation classification based on four features

Buying Price MaintenancePrice Lug Boot Safety Evaluation?


High High Small
High Unacceptable
High High Small Low
Unacceptable
Medium High Small High
Low Acceptable
Medium Small High Acceptable
Low Low
Low
Big High Acceptable
Low
Big Low
Unacceptable
(New Syllabus w.e.f academic year 22-23)
(M7-79)
Tech-NeoPublications...ASACHIN SHAH Venture
Machine Learning(MU-Sem.7-Comp.) (Introductionto Machine Leaning)...Pageno. (1-5)
Buying Price Maintenance Price Lug Boot Safety Evaluation2
Medium Low Low Acceptable
Big
High Medium Small High Unacceptable
High Low Big High Acceptable
Low Acceptable
Medium Big High
High Medium Big Low Acceptable
Medium Medium Small Low Acceptable
Medium High Big High Acceptable
Low Medium Small Low Unacceptable

Buying Price MaintenancePrice Lug_Boot Safety Evaluation


Low Low Big High Acceptable

Features Target
Variables
Fig. 1.2.1: Features and target variable identified

1.3 TYPES OF MACHINE LEARNING

UQ Explain how supervisedlearningisdifferentfrom unsupervisedlearning (MU(Comp.) - May 17, 5 Marks)


*-*

Some of the main types of machine learning are

1. Supervised Learning

In this type of learning we use data which is


comprises of input and corresponding output. For
every instance of data we can have input X' and
corresponding output "Y'. Fromn this ML system
New Input
will build model so that given an observation
X, for new observation X' it will try to find
Input-1 Output-1 Learning
out what is coresponding 'Y". Model
Input-2 Output-2 Algorithm
Insupervised learning training data is labelled Input-n Output-n
with the correct answers, e.g. "spam" or "ham."
Two most importanttypes of
supervisedlearning Output Y
are classification (where the outputs are discrete
labels, as in spam
filtering) and regression
(where the outputs are real-valued).
Fig. 1.3.1:SupervisedLearning
2. Unsupervised learning

In unsupervised learning you are only given


input 'X*, there is no label to the data and given input-1 Learning
the data or different data points, you may want to Input-2 Algorithm
form clusters or want to find some pattern. Input-n
Two important unsupervised learning tasks are
dimension reduction and clustering.

Fig. 1.3.2: UnsupervisedLearning

(New Syllabusw.e.f academic year 22-23) (M7-79) ATech-NeoPublications...ASACHIN SHAH Venture


Machine Learning(MU-Sem.7-Comp.) (Introductionto Machine Learning)...Pageno.(1-6)
3. Reinforcement learning
Action a
In reinforcement learning you have an agent who
is acting in an environment and you want to
findout what action the agent must take based on
the reward or penalty that the agent gets it.
State S S1
In this an agent (e.g.. a robot or controller) seeks
Agent Environment|
to leam the optimal actions to take based the
outcomes of past actions.
Reward
Fig. 1.3.3: Reinforcement Learning

Semi-supervised learning
is a combination of supervised and unsupervised learning. In this there is some amount of labeled training data
and also you have large amount of unlabeled data and you try to come up with some learning algorithm that
convert even when training data is not labeled.
In classification task, the aim is to predict class of an instance of data. Another method in machine learning is
regression.
Regression is the prediction of a numeric value. Regression's example is to draw best fit line which passes
through some data points in order to generalize the data points.
Classification and regression are examples of supervised learning. These types of problems are called as
supervised because we are asking the algorithm what to predict.
The exact opposite of supervised is a task called as unsupervised learning. In unsupervised learning, target value
or label is not given for the data. A problem in which similar items are grouped together is called as clustering. In
unsupervised learning, we may also want to find statistical values that describe the data. This is called as density
estimation. Another task of unsupervised learning may be reducing the huge amount of data from many attributes
to a small number so that we can properly visualize it in two or three dimensions.

Table 1.3.1: Supervised learning tasks

K-Nearest Neighbours Linear

Naive Bayes Locally weighted linear


Support Vector Machines Ridge
Decision Trees Lasso

Table 1.3.2:Unsupervised
learningtasks
K-Means Expectation Maximization
DBSCAN | Parzen Window

1.4 1sSuESIN MACHINE LEARNING

UQ. hat are the issues in Machine learning?


---------*
(MU(Comp.)-
-
May 15,5 Marks)
----
1. Which algorithm we have to select to learn general target functions from specific training dataset? What should be the
settings for particular algorithms, so as to converge to the desired function, given sufficient training data? Which
algorithms perform best for which type of problems and representations?

(New Syllabus w.e.f academic year 22-23) (M7-79) Tech-Neo Publications...ASACHIN SHAH Venture
Machine Learning (MU-Sem.7-Comp.) (Introductionto Machine Learning)..Page no. (1-7)
2. How much training data is sufficient? What should be the general amount of data that can be found to relate the
confidence in learned hypotheses to the amount training experience and the character of the learner's hypothesis space?
3. Prior knowledge held by the learner is used at which time and manner to guide the process of generalizing from
examples? If we have approximately correct knowledge, will it helpful even when it is only approximately correct?
4 What is the best strategy for choosing a useful next training experience, and how does the choice of this strategy after
the complexityof the learming problem?
5. To reduce the task of learning to one or more function approximation problems, what will be the best approach? What
specific functions should the system attempt to learn? Can this process itself be automated?
6. To improve the knowledge epresentation and to learn the target function, how can the learner automatically alter its
representation?

1.5 HOW TO CHOOSE THE RIGHT ALGORITHM?

UQ Explainthe steps requiredfor selectingthe right machínelearning algorithm. (MU(Comp.) - May 16, 8 Marks)
With all the different algorithms available in machine
learning, how can you select which one to use ? First you need to
focus on your goal. What are you
trying get
to out of this? What data do you have or can you collect ? Secondly you have to
consider the data.
1. Goal: If you are trying to predict or forecast a target value, then you need to look into supervisedlearning. Otherwise,
you have to use unsupervised learning.
(a) If you have chosen supervisedlearning,then next you need to focus what's your
on
target value?
If target value is diserete (e.g. Yes/ No, 1 /2/3, A/B/C), then use Classification.
If targetvalue is continuous i.e. Number
of values
(e.g. 0- 100,-99 to 99), then use Regression.
(b) If you have chosen
unsupervisedlearning,then next you need to focus on what is your aim?
If you want to fit your data into some discrete
groups, then use Clustering
If you want to find numerical estimate of
how strong the fit into each group, then use
algorithm density estimation
2. Data: Are the features continuous or nominal ? Are
there missing values in features?
missing values? Are there outliers in the data? To narrow the algorithm selection If yes, what is a reason for
data can help you. process, all of these features of your

Table 1.5.1: Selection of


Algorithm
Supervised Learning UnsupervisedLearning
Discrete Classification Clustering
Continuous Regression Density Estimation
1.6 STEPSIN DEVELOPINGA MACHINELEARNING APPLICATION
UQ Explainthe steps of developingMachine Learningappications
(MU(Comp.)-May 19, 10 Marks)
1. Collection of Data

You could collect the samples from a website and extracting data.
From RSS feed or an API
From device to collect wind speed measurement

Publiclyavailabledata.
(New Syllabusw.e.f academic year 22-23) (M7-79) Tech-Neo Publications...ASACHIN SHAH Venture
Machine Leaning(MU-Sem.7-Comp.) (Introductionto Machine Learning)...Pageno. (1-8)
2. Preparation of the Input data
Once you have the input data, you need to check whether it's in a useable format or not.
Some algorithm can accept target variables and features as string; some need them to be integers.
Some algorithm accepts features in a special format.

3. Analyse the Input data


Looking at the data you have passed in a text editor to check collection and preparation of input data steps are
properly working and you don't have a bunch of empty values.
You can also check at the data to find out if you can see any patterns or if there is anything obvious, such as a few
data points greatly differ from remaining set of the data.
Plotting data in 1, 2 or 3 dimensions can also help.
Distil multiple dimensions down to 2/3 so that you can visualize the data.
4. The importance of this step is that it makes you understand that you don't have any garbage value coming in.

5. Train the algorithm


Good clean data from the first two steps is given as input to the algorithm. The algorithm extracts information or
knowledge. This knowledge is mostly stored in a format that is readily useable by machine for next 2 steps.
In case of unsupervised learning, training step is not there because target value is not present. Complete data is
used in the next step.

6. Test the algorithm


In this step the information learned in the previous step is used. When you are checking an algorithm, you will test
it to find out whether it works properly or not. In supervised case, you have some known values that can be used
to evaluate the algorithm.
In case of unsupervised, you may have to use some other matrices to evaluate the success. In either case, if you
are not satisfied, you can again go back to step 4, change some things and test again.
Mostly problem occurs in collection or preparation of data and you will have to go back to step 1.

7. Use It
In this step a real program is developed to do some task, and once again it is checked if all the previous steps worked as
you expected. You might encounter some new data and have to revisit step 1-5.

Training Phase

Label Machine
learning
Feature UT gorithm
extractor
Input Features

Testing Phase

Classifer Label
model
Feature
extactor Un-
Input FeatureS

Fig. 1.6.1: Typical example of Machine Learning Application

(New Syllabus w.ef academic year 22-23) (M7-79) e Tech-Neo Publications...ASACHIN SHAH Venture
Machine Learning(MU-Sem.7-Comp.) (Introductionto MachineLearning)..Pageno.(1-9)
1.7 APPLICATIONSOF MACHINE LEARNING

UQ Write short note on: Machine learningapplications. (MU(Comp.)-May 16,May 17,10 Marks)
1. Learning Associatlons
A supermarket chain-one an example of retail application of machine learning is basket analysis, which is finding
associations between products bought by customers:
people who buy P typically also buy Q and if there is a customer who buys Q and does not buy P, he or she is a
potential P customer. Once we identify such customers, we can target them for cross-selling.
n inding an associationrule, we are interestedin learning a conditionalprobabilityof the form P (QP) whereQ
s the product we would like to condition on P, which are the product/ products which we know that customer has
already purchased.
P (Milk/Bread) =0.7
It implies that 70% of customers who buy bread also buy milk
2 Classification
A credit is an amount of
money loaned by a financial institution. It is important for the bank to be able to predict
in advance the risk associated with a loan. Which is the
whole amount back?
probabilitythat the customer will default and not pay the

In credit scoring, the bank calculates the risk


given the
amount of credit and the informationabout the customer.
(Income, savings, collaterals, profession, age, past
financial history). The aim is to infer a
general rule from ***g******
this data, coding the association between a customer's
Savings Low-Risk
attributes and his risk.
Machine Leaming system fits a model to the
past data to
HHAA
be able to calculate the risk for a new
decides to accept or refuse it
applicationand then
accordingly. 2
OoTAA
Ifincome> Q, and savings >Q
Then low risk ELSE high - risk
Other classification examples are
Optical character Income
recognition, face recognition, medical diagnosis, speech
recognitionand biometric.
Fig. 1.7.1: Classificationfor credit scoring
3. Regression

Suppose we want to design a system that can predict the


price of a flat. Let's take the inputs as the area of the
flat, location and purchase year and other information ****************i******.

that affects the rate of flat. The


output is the price of the
flat. The applications where output is numeric are Y W*X* Wo
regressionproblems.
Let X represents flat features and Y is the price of flat. Y: price
*****
of flat
We can collect training data by surveying past
purchased transactions and the Machine Learning
algorithm fits a function to this data to learn Y as a
*************:**** X: area of flat
function of X for the suitable values of W and W
Y W*x + Wo Fig. 1.7.2: Regression for prediction of price of flat

(New Syllabus w.e.f academic year 22-23) (M7-79) Tech-NeoPublications.ASACHIN SHAH Venture
Machine Learning(MU-Sem.7-Comp.) (Introductionto MachineLearning)...Pageno. (1-10)
4. Unsupervised Learning
One of the important unsupervised learning problem is clustering. In clustering dataset is partitioned in to
meaningfül sub classes known as clusters. For example, suppose you want to decorate your home using given

items.
NOW you will classify them using unsupervised learning (no prior knowledge) and this classification can be on the
basis of color of items, shape of items, material used for items, type of items or whatever way you would like.

5. Reinforcement Learning

There are some of the applications where output of system is a sequence of actions. In such applications the
seqvence of correct actions instead of single action is important in order to reach goal. An action is said to be good
if it is part of good policy. Machine learming program generates a policy by learning previous good action
sequences. Such methods are called reinforcement methods
A good example of reinforcement learning is chess playing. In artificial intelligence and machine learning, one of
the most important research area is game playing. Games can be are easily described but at the same time, they are
quite difficult to play well. Let's take a example of chess that has limited number of rules, but the game is very
difficult because for each state there can be large number of possible moves.
Another application of reinforcement learning is robot navigation. The robot can move in all possible directions at
any point of time. The algorithm should reach goal state from an initial state by learning the correct sequence of
actions after conducting number of trial runs.
When the system has unreliable and partial sensory information, it makes reinforcement learning complex. Let's
take an example of robot with incomplete camera information. Here robot does not know its exact location.

TRAININGERRORAND GENERALIZATIONERRORR
Training Error
In machine learning, training a predictive model means finding a function which maps a set of values x to a value y. If
we apply the model to the data it was trained on, we are calculating the training error. If we calculate the error on data
which was unknown in the training phase, we are calculating the test error. Training error is calculated as follows:

Erain error (p (X), Y)


In the
represents
above equation epresents the number of training examples. f X,) represents the predicted value and Y
n
the true or actual values, error (G, X). Y) is used to represent that these two values are same or not and if not
then these values differs by how much.
Generalization Error
For supervised learning applications in machine learning and statistical learning theory, generalization error is a
measure of how accurately an algorithm is able to predict outcome values for previously unseen data. Generalization error is
calculated as follows:

En Jerror,
In the above
K). Y) P (Y. X)dx
equation error is calculated over all
possible values of X and Y. error (f X).¥;) is used to representthat
these two values are same or not and if not then these values differes
by how much. P(X, Y) represents how often we expect
to see such X and Y.

Training Error versus Generalization Error


The training error
is the error of our model as calculated on
the training dataset, while
expectationof our model's error were we need to generalizationerror is the
apply it to an infinite stream of additional data examples drawn from
the same underlyingdata distributionas our
originalsample.
(New Syllabus w.e.f academic year 22-23) (M7-79)
LTech-Neo Publications..ASACHIN SHAH Venture
Machine Leaming(MU-Sem.7-Comp.) (Introductionto MachineLeaning)..Page no.(1-11)
Problematically, we can never calculate the generalization error exactly. That is because the stream of infinite data is an
imaginary object. In practice, we must estimate the generalization error by applying our model to an independent test
set constituted of a random selection of data examples that were withheld from our training set.

Let's see an example. Consider a college student trying to prepare for his final exam. A diligent student will strive to
practice well and test his abilities using exams from previous years. Nonetheless, doing well on past exams is no
guarantee that he will excel when it matters. For instance, the student might try to prepare by rote learning the answers
to the exam questions. This requires the student to memorize many things. She might even remember the answers for
past exams perfectly. Another student might prepare by trying to understand the reasons for giving certain answers. In
most cases, the latter student will do much better.

Let's see one more example, consider the problem of trying to classify the outcomes of coin tosses (class 0: heads, class
:tails) based on some contextual features that might be available. Suppose that the coin is fair. No matter what
algorithm we come up with, the generalization error will always be 1/2. However, for most algorithms, we should
epect our trainingemor to be considerablylower, depending on the luck of the draw, even if we did not have any
features! Consider the dataset {0, 1, 1, 1,0, 1}. Our feature-less
algorithmwould have to fall back on always predicting
the majority class, which appears from our limited sample to be 1. In this case, the model that always predicts class 1
will incur an error of 1/3, considerably better than our generalization error. As we increase the amount of data, the
probability that the fraction of heads will deviate significantly from 1/2 diminishes, and our training error would come
to match the generalization error.

When we train our models, we attempt to search for a function that fits the
training data as well as possible. If the
function is so flexible that it can catch on to
spurious patterns just as easily as to true associations, then it might
perform too well without producing a model that generalizes well to unseen data. This is precisely what we want to
avoid or at least control. Many of the techniques in deep learning are heuristics and tricks aimed at guarding against
overfitting.
When we have simple models and abundant data, we expect the generalizationerror to resemble the
When we work withmore complex models and fewer
training error.
examples, we expect the training error to go down but the
generalizationgap to grow.

Training error
Underfittingzone Overfittingzone
Generalizationerror

Generalization gap
Optimal capacity
Capacity
Fig. 1.8.1 : Training Error and GeneralizationError
1.9 UNDERFITTING,OVERFITTING,BIAS AND VARIANCETRADE OFF
w..ww.ASA
Let us consider that we are designing machine
a
learning model. A model is said to be a good machine learning model
if it generalizesany new input data from the problem domain in a proper way. This helps us to make predictionsin the
future data, that data model has never seen.
Now, suppose we want to check how well our machine learning model learns and generalizes to the new data. For that
we have overfittingand underfitting,which are majorly
responsiblefor the poor performancesof the machine learning
algorithms.
(New Syllabus we.f academic year 22-23) (M7-79) Tech-NeoPublications..ASACHIN SHAH Venture
Machine Leaning(MU-Sem.7-Comp) (Introductionto Machine Learning)...Pageno. (1-12)
Before diving further let's understands two impotant terms:
Blas-Assumptions made by a model to make a function easier to learn. (The algorithms error rate on the training set is
algorithms bias.)Variance - If you train your data on training data and obtain a very low error, upon changing the data
and then training the same previous model you experience high eror, this is variance. (How much worse the algorithm
does on the test set than the training set is known as the algorithmsvariance.)

Underfitting
A
Statistical
model
trend of the data.
or a machine learning algorithm is said to have underfittingwhen it cannot capture the underlying

nderfitting destroys the acuracy of our machine learning model. Its occurrencesimply means that our model or the
algorithmdoes not fit the data well enough.
t
a
sualy happens when have less data to build accurate model and also when
we an try to build linear model with
we a

non-lineardata In such the rules of the machine learning model too easy and flexible to be applied such
cases
minimal data and therefore the model will
are on

probably make a lot of wrong predictions.


Underfittingcan be avoided by using more data and also reducing the features by feature selection.
In a nutshell, Underfitting -

High bias and low variance


Techniquesto reduce underfitting:
1. Increase model complexity
2. Increase number of features, performingfeature
engineering
3. Remove noise from the data.
4. Increase the number of epochs or increase the duration of training to get better results.

Overfitting
A statistical model is said to be overfitted,when we train it with a lot
of data. When a model gets trained with so much
of data, it starts learning from the noise and inaccurate data entries in our data set. Then the
model does not categorize
the data correctly, because of too many details and noise.
The causes of overfitting are the
non-parametricand non-linear methods because these types of machine learning
algorithms have more freedom in building the model based on the dataset and therefore they can really build unrealistic
models.
A solution to avoid overfitting is using a linear algorithm if we have linear data or using the parameters like the
maximal depth if we are using decision trees.

In a nutshell, Overfitting High varianceand low bias


Techniques to reduce overfitting:
1. Increase training data.
2 Reduce model complexity.
3. Early stopping during the training phase (have an eye over the loss over the training period as loss
increase stop training).
soon as begins to
4 Ridge Regularizationand Lasso Regularization
5. Use dropout for neural networks to tackle overfitting.
Ideally, the case when the model makes the predictionswith 0 error, is said to have a
is achievable at good fit
spot between overfitting and underfitting. In order to understand it we
a
on the data. This situation
will have to look at the
performanceof our model with the passage of time, while it is learning from
With the passage of time, our model will
training dataset.
keep on learning and thus the error for the model on the training and testing
(New Syllabus w.e.f academic year 22-23) (M7-79)
E Tech-Neo Publications..ASACHIN SHAH Venture
Machine Leaming (MU-Sem.7-Comp.) (Introductionto MachineLearning)....Pageno. (1-13)
data will
keep on decreasing. If it will learn for too long, the model will become more prone to overfittingdue to the
presence of noise and less useful details. Hence the
We will
performanceof our model will decrease. In order to get a good fit,
stop at a point just before where the error starts
increasing.At this point the model is said to have good skills on
training datasets as well as our unseen testing
dataset.
Bias-variance trade-off
So what is the right measure? Depending on the model at hand, a performance that lies between overfitting and
underfitting is more desirable. This trade-off is the most integral aspect of Machine Learning model training. As wve
discussed, Machine Learning models fulfil
their purpose when they generalize well. Generalizationis bound
undesirable outcomes- high bias and by the two
high variance. Detecting whether the model suffers from either one is the sole
responsibilityof the model developer.

/oOo
O O
O
+Poto X
X+ot
Ox o
x
Xx
XX XXX
Xx X
Under-fitting Appropirate-fitting Over-fitting
(too simple to
explain the variance) (forcefitting-too
good to be true)

Fig. 1.9.1 Underfittingand Overfitting


Bias-variance trade-off
So what is the right measure?
Depending on the model at hand, performance that lies between overfitting and
a
underfitting is more desirable. This trade-off is the most integral
aspect of Machine Learning model training. As we
discussed, Machine Learning models fulfil their purpose when
they generalize well. Generalizationis bound by the two
undesirable outcomes- high bias and high variance.
responsibilityof the model developer.
Detecting whether the model suffers from either one is the sole

Underfitting Overfitting
Zone zone

Generalization
- ------
error
--

Bias Variance

Optimal capacity Capacity


Fig. 1.9.2: Bias variance Tradeoff as a function of model capacity

ChapterEnds..
MODULE 3
CHAPTER

Ensemble Learning
3
Syllabus
UnderstandingEnsembles, K-fold cross validation, Boosting, Stumping, XGBoost
Bagging, Subagging, Random Forest, Comparisonwith Boosting, Different combine classifiers.
ways to

3.1 Ensemble Learning... °***


*********°******
** *

GQ. Explain the concept of Ensemble learning. (5 .3-2


3.2 Cross-validationin Marks). .3-2
machine leaming.
3.2.1
.
Methods used for
3.2.2
Cross-Validation.
K-Fold Cross-Validation.
***°****** .3-2
** 3-2
3.2.3 Life Cycle of K-fold Cross-Validation. .3-2
3.2.4 Thumb Rules Associated with .3-3
K-Fold.. *******

3.2.5 Some Remarks.. 3-4


****°****.

3.3 Stumping ensembles .. .3-5


****.****°.*************
3.3.1 For Continuous Features. 3-5
3. Remarks.. ********.
***** 3-6
3.4 Boosting in machine learning.. *e***** .. 3-6
3.5 XG-Boost... ...
***** .. 3-6
3.5.1 Decision Tree..
*******°*********°°*******°. J-6

3.5.2 Table of
GQ.
Contents.. *********"'*****"
***** -7

Write a short
3.6
note on: Bagging and Boosting. (10 Marks). ***. -7
Random Forest.. **. 3-7
*********************°***
3.6.1 Random Forest. *°*°'***** 3-9
3.6.2 Random Forest Algorithm. 3-9

3.6.3
Bagging.. ************. 3-9
3.6.4 Row Sampling. 3-10

3.6.5
Important Features of Random
3.6.6 Forest..
Difference Between Decision Tree ****
.... 3-10

and Random Forest 3-11


3.6.7
Advantagesand Disadvantagesof Random .. ..3-11
3.7
Comparison Between Boosting Random Forest Algorithm.
3.8 DifferentVWays To Combine ForestAnd Boosting... 3-12
Classifiers.. 3-12
Chapter Ends.. e******s
3-13

*****..
3-14
(Ensemble Learning)....Pageno.
(3-2)
Machine Leaning (MU-Sem.7-Comp.)

3.1 ENSEMBLE LEARNING t


(5 MarkS)
GQ. Explain the concept of Ensemblelearning.
or more models.
An ensemble is a machine learning model that combines the predictionsfrom two
The models that contribute to the ensemble are called as ensemble members.
They may be of the same type or of different types.
They may or may not be trained on the same training data.

Remarks
O
an ouput
different models to derive
An ensemble method is a technique that uses multiple independent similar or

make some predictions.

For example, a random forest is an ensemble of multiple decision trees.


tree. It trains eacn
2. Ensemble Learning: This uses a single machine learning algorithm. It is an unpruned decision
model on a different sample of the same training dataset.
The predictions made by the ensemble members are combined using simple statistics, such as voting or averaging

CROSS-VALIDATIONIN MACHINELEARNING
Cross-validation is a technique for validating the model efficiency by training it on the subset of input data and testing

technique how statistical model generalises to an


on previously unseen subset of the input data. It is a a

independent dataset.
cannot
In machine learning, we need to test the stability of the model. It means based only on the training dataset, we

fit our model on the training dataset.


For this purpose, we test our model on the sample which is not part of the training dataset. And then we deploy the
model on that sample.
This complete process comes under cross-validation.

The basic steps of cross-validation are


) Reserve a subset of the dataset as a validationset
Gi) Provide the trainingto the model using the trainingdataset.
ii) Evaluate model performanceusing the validation se.
If the model performs well with the validation set, perform the further step, else check for the issues.

3.2.1 Methods used for Cross-Validation


The common methods used for cross-validation are:

(1) Validationset approach (2) Leave-P-out cross-validation


(3) Leave one out cross- validation (4) K-fold cross-validation

(5) StratifiedK-fold cross-validation


Among these, K-fold cross-validationis easy to understand,and the output is less biased than other methods

3.2.2 K-FoldCross-Validaclon
In each set (fold) training and the test would be performed precisely once during this entire process. It helps us to avoid
overfitting. When a model is trained using all of the data in a single short and give the best performance accuracy.
This K-fold cross-validation helps us to build the model which is a generalized one.

(New Syllabuswe.f academicyear 22-23) (M7-79) Tech-NeoPublications..ASACHIN SHAH Venture


(EnsembleLearnit Page no. 3-3)
Machine Learning(MU-Sem.7-Comp.) training, testing
and validation
have to split the data set into three sets, with
achieve this K-fold cross validation, we

o data.
the challengeof the volume of the assessment.
and hyper parameter
will support buildingmodel and iit
and train data and which is called K and should be
Here test set
assigned as a parameter
The model is validated mutiple times based on the value

an INTEGER
i 1, 2,.. K.
The dataset X is divided randomly into K equal-sizedparts, X,
=

the remaining (K-1) parts to for


form
validation set and combine
To generate cach pair, we keepone of the K parts out as
the training set.
K pairs
Doing this K times, each time leaving out another one of the K parts out, we get
V, X. T,=X,UX,U...UXg

We come across two problems with this. First, to keep the training set large, we allow validation sets that anre small,

Second, the training sets overlap considerablynamely, any two training sets share (K -2) parts.
K is typically 10 or 30. As K increases, the percentage of training instances increases and we get more robust

estimators, but the validation set becomes smaller.


Also, there is the costof training the classifierK times, which increases as Kis increased.
As N increases,K can be smaller, if N is smal, K should be large to allow large large enough training sets. (N is the
number of instances).
One extreme case of K-fold cross-validation is leave-one-out where given a dataset of N instances, only one instance is
left out as the validation set (instance) and training uses (N- 1) instances.
We then get N separate parts by leaving out a diferent instance at each iteration. This is typically used in applications
such as medical diagnosis.

a 3.2.3 Life Cycle of K-fold Cross-Validation


Let us have a generalisedK-value. If K 5, it means,
Train and Test.
=
we are
splitting the given dataset into 5 folds and running the

10
SCufledataset
E E5-*%-x
Split datasetinto Splittraining Use (K-1
trainingandtest datasetinto fold for
K-folds training

Always leave Take care of Find the


fold for test alltransformation accuracy on
inthe fold each fold

Fig. 3.2.1
During each run, one fold is considered for
below pictorial representation testing and the
gives an idea of the flow of rest will be for
the fold-defined training and moving on with iteraions the
size
(New Syllabus w.e.f academicyear 22-23)
(M7-79)
Tech-NeoPublications.ASAC
ACHIN SHAH Venture
Machine Leaming(MU-Sem.7-Comp.) Ensemble Learning)....Pageno. (3-4
Test
Test
Tralning Training
Training
Training 3 Test
Training Training 4Test
5Training Test

Fig. 3,2.2
Here, each data-points is used, once in the hold-out set and K - 1 in Training. So during the full iteration at least once,
one fold will be used for testing and the rest for training.In the above set, 5-Testing20 Training
In each iteration, we will get an accuracy score and have to sum them and find the mean.
Here we can understand how the data is spread in a way of consistency and will make a conclusion whether to go for
the production with this model (or) NOT.

All Data

wwww
Training data
Test data
Training folds
Test fold

Split 1 E
Split 2 E 10
Split 3

i= 1
E XX XX

Split 3 En6
Final evaluationTest deta Accuracy
XX XX

Compare the accuracy

Fig. 3.2.3

3.2.4 Thumb Rules Associated with K-Fold

We discuss a few thumb-rules while dealing with K-fold.


K should be always 2 2 and equal to number of records.
)
If2, then just 2 iterations
IfK= No of records in the dataset, then 1 for testing and n-for training.

(i) The optimisedvalue for the K is 10 and used with the data of good size, (i.e. commonly used).

Tech-Neo Publications...ASACHIN SHAH Venture


(New Syllabusw.e.f academic year 22-23) (M7-79)
Machine Learning(MU-Sem.7-Comp.) (Ensemble Learning)....Page no.
(111) 1f the K-value is
large, then this will lead to less variance across the training set and limit the model
(3-5
currency,dif
across the iterations.
ference
(iv) The number of folds is generally inversely proportional to the size of the data set, which means, if the
small, the number of folds can increase. dataset size
dataset size is
too
(V)
Lager values of K
eventually increase the running time of the cross-validation process.
3.2.5 Some Remarks
(1) Short note on K-cross Validation
CrosN-validation resamplingprocedure used to evaluate machine learning models on
is a
a limited data sample.
The procedure has a single parameter called K that refers to the number of groups that a given data sample is to be snli
into. Hence. the procedure is often called K-fold cross-validation.

(2) Purpose of K-fold Cross-Validation


K-fold cross-validation is one method that
attempts to maximise the use of the available data for
testing a model. It is particularly useful for assessing model training and then
performance, as it provides a range of accuracy scores acrose
(somewhat) different data sets. ross

(3) Folds in K-fold Cross-Validation


The key configurationparameter for K-fold cross-validation is K that defines the number of folds in
which to split a
given dataset.
Common values are K = 3, K= 5 and K 10, and the
=
most popular value used in applied machine learning to evaluate
models is K = 10.

(4) Difference between K-fold and cross-validation


Cross-validation score is a function which evaluates a data and returns the score. On the other hand, K-fold is
which lets to split data to K-folds.
a class,

(5) Does K-fold cross-validation prevent over fitting ?


K-fold cross-validation won't reduce over fitting on its own, but using it will generally gives a better insight on the
model, which can help to avoid or reduce over fitting.

3.3 STuMPING ENSEMBLES

A decision stumpis machine learning model consisting of a one-


level decision tree. That is, it is a decision tree with one internal node
(the Petal width 1.75
root) which is immediatelyconnected to the terminal nodes (its levels).
A decision stump makes a
prediction based on the value of just a No Yes
single input feature. Sometimes they are also called as 1-rules.
[An example of a decision stump that discriminates between two of Iris Versicolor Iris Verginica
three classes of iris flower data set iris versicolor and
iris verginica. **** wwwn

The petal width is in centimetres. This


particular stump achieves 94%
accuracy on the iris dataset for these two classes.]
Fig. 3.3.1
Depending on the type of the input features, several variations are possible. For normal features, one may build a
tump
which contains a leaf for each
possible feature value, or a stump with the two leaves, one some
chosen category, and the other leaf to all the other of corresponas o which
categories.
For binary features these two schemes are
identical. A missing value may be treated as a yet another category

(New Syllabus w.e.f academic year 22-23) (M7-79)


Tech-Neo Publications...ASACHIN SHAH Ve
nture
Machine Leaming(MU-Sem.7-Comp.) (EnsembleLearning)...Pageno.(3-6)
3.3.1 For Continuous Features
Usually, some threshold feature value is selected, and the stump contains two leaves-for values below and above the

threshold. However, rarely, multiple thresholds may be chosen and the stump therefore contains three or move leaves.
Decision stumps are often used as components (called "weak learners" or "base learner") in machine learning
ensemble techniques such as bagging and boosting.

a3.3.2 Remarks
(1) Meaning of stump in decision tree

A decision stump is a decision tree, which uses only a single attribute for splitting.
For discrete attributes, this means that the tree consists only of a single interior node (i.e., the root has only leaves
as successor nodes).
If the attribute is numerical, the tree may be more complex.

(2) Are decision stumps linear ?


A decision stump is not a linear model. The decision boundary can be a line, even if the model is not linear.

BOOSTINGIN MACHINE LEARNING


Boosting is an ensemble modelling technique that attempts to build a strong classifier from the number of weak
classifiers. It is done by building a model by using a weak models in series.
Firstly, a model is built from the training data. Then the second model is built which tries to corect the errors present in
the first model.
This procedure is continued and models are added until either the complete training data set is predicted correctly or the
maximum number of models are added.

AdaBoost was the first really successful boosting algorithm developed for the purpose of binary classification.
AdaBoost is short for Adaptive Boosting and is a very popular boosting technique that combines multiple "weak
classifiers" into a single "Strong Classifiers".
AdaBoost is the best starting point for understanding boosting algorithms. It is adaptive in the sense that subsequent
classifiers built are tweaked in favour of those instances misclassified by previous classifiers.
It is sensitive to noisy data and outliers.

AdaBoost uses multiple iterations to generate a single composite strong learmer. It creates a strong learner by iteratively
adding weak learmers. During each phase of training, a new weak learner is added to the ensemble, and a weighting
vector is adjusted to focus on examples that were misclassified in previous rounds. The result is a classifier that has
higher accuracy than the weak learner classifiers.

5 XG-BOOST

XG-Boost stands for Extreme Gradient Boosting. It is written in C++ which optimises the training for Gradient
boosting.
To understand the XG-boost, we first need to understand the trees especially the decision tree

(New Syllabuswe.f academic year 22-23) (M7-79) Tech-NeoPublications..ASACHIN SHAH Venture


Machine Leaming(MU-Sem.7-Comp.) (EnsembleLearning)....Pageno. (3-71
3.5.1 Decislon Tree
A decision tree is a flowchart-like tree structure, where each internal node denotes a test on an attribute, cach brar
represents an outcome of the test, and each leaf node (terminal node) holds a class label. ranch
A tree can be " learned" by splitting the source set into subsets based on an attribute value test.

This process is repeated on each derived subset in a recursive manner called recursive partitioning. The recursive
ive is
completedwhen the subset at a node all has the same value of the target variable, or when splitting no larger addsval
to the predictions.

3.5.2 Table of Contents


We mention below the contents of XG-Boost.
(1) The power of XG-Boost
(2) Why Ensemble Learning?
) Bagging ii) Boosting
3) Splitting Algorithms
(4) System Features.
(5) Unique features of XG-Boost,and Goals of XG-Boost.
(1) The power of XG-Boost
The beauty of this
powerful algorithm lies in its scalability, which drives fast learning through parallel and
distributed
computingand offers efficient memory usage.
XG-Boost is the most useful, straight forward and robust solution.
(2) Why EnsembleLearning?

GQ Write a shortnote on: Baggingand Boosting.


(10Marks)
XG-Boost is an ensemble learning method. It may not be sufficient to
--1

learning model. rely upon the results of just one machine


Ensemble learning offers
systematicsolution to combine the
a

single model which gives the aggregatedoutput from severalpredictive


is a
models.
power of multiple learners. The resultant
The models that form the
ensemble, could be either from the same
algorithms.Bagging and Boosting are two widely used ensemble learning algorithm or different learning
learners. First we discuss
) Bagging Bagging
While decision trees are one of the most
easily interpretablemodels, they exhibit
Considera single training dataset that highly variable behaviour.
tree in order toobtain two models.
we
randomly split into two parts. Now, we use each
to part train a decision
When we fit both these models,
they would yield different results.
high variance due to this behaviour. Decision trees said
are to be associated w1u
Bagging or boosting aggregationhelps to reduce
these learners for the variance in
to
training. The final predictionis the any learner. Data
sampled with replacementis 1
(i) Boosting averaged output from all the learners.

In boosting,the trees are built


tree. Each tree learns sequentiallysuch that each subsequenttree
from its aims to reduce the errors s
predecessorsand updates the residual errors. of the prevD

(New Syllabus w.e.f academic


year 22-23) (M7-79)
Tech-Neo Publications..A
SACHIN SHAH Ventue
Machine Leaming(MU-Sem.7-Comp.) (EnsembleLearming)..Pageno. (3-8
Hence, the tree that grows next in the sequence wil learn from an updated version of the residuals.
The base leamers in boosting are weak lcarners in which the bias is high, and the predictive power is Just et

than random guessing.


Each of these weak learnerscontributessome vital informationfor prediction.It enables the boostingtecna
produce a strong learmerby effectively combining these weak learners.
The final strong learmerbrings down both the bias and the variance.

(3) Splitting algorithms


the
gives all
) Exact greedy algorithm: The main problem in tree learning is to find the best split. This algorithm ror
all the possible splits
possible splits on all the features. It is computationally demanding to enumerate
continuous features.
splitting points
greedy algorithmis very powerful since it gives possible into memory.
all
) Approximate algorithm: The exact
does not fit entirely
greedily. But it is not possible to efficiently carry it out when the data
of feature distribution. The
Approximate algorithm proposes candidate splitting points according percentiles
to
statistics
candidate points, aggregates the
algorithm then maps the continuous features into buckets split by these
and finds the best solution among proposals based on the aggregated statistics.
is to propose candidate split points.
(ii) Weighted quantile sketch: One importantstep in the approximatealgorithm data.
XG-Boost has a distributed weightedquantile sketch algorithmto effectivelyhandle weighted
common for the input X to
be sparse.
(iv) Sparsity-aware split finding: In many real-world problems, it quite
is
There are multiple possible causes for sparsity:
(a) Presence of missing values in the data

b) Frequentzero entries in the statistics.


(c) Artifactsof features engineeringsuch as one-hotencoding
It is importantto make the algorithmaware of the sparsitypattern in the data. XG boost handles all sparsity patterns in a

unified way.
(4) System features

) Collectingstatistics for each column can be parallelisedgiving us a parallel algorithmfor split-finding.


Cache-aware XG boost has been designed to make optimal use of hardware.This is done by allocating
(i) access:

internal buffers in each thread, where the gradientstatistics can be stored.


(ii) Blocks for out of core computation for very large datasets that don't fit into memory.

cluster of machines.
Civ) Distributed computingfor training very large models using a
The most time-consumingpart oftree leaning is to get the data into sorted
(v) Column block for parallellearning:
order. In order to reduce the cost of sorting, the data is stored in the column blocks in sorted order in compressed

format.
(5) XG-Boost features
term helps to smooth the final learnt weights to avoid overfitting. The
) Regularised learning: Regularisation
select a model employingsimple and predictivefunctions.
regularisedobjectivewill tend to
model cannotbe optimisedusing traditional optimizationmethods in
(i) Gradient tree boosting: The tree ensemble
trained in additive
Euclidean space. Instead, the model is
an manner.

(i) Shrinkage and column subsampling:


o Besides the regularised objective, two additional techniques are used to further prevent overfitting.

scales newly added weights by a factor n after each step of tree


o The first technique is shrinkage. Shrinkage
boosting
(New Syllabus wef academic year 22-23) (M7-79)
Tech-Neo Publications..ASACHIN SHAH Venture
Machine Leaming(MU.-Sem.7-Comnp.) (EnsembleLearning).Pageno.(3-9)
o Similar to a learning rate in stochastic optimisation, shrinkage reduces the influence of each tree and
and leaves
leaves space
for future trees to improve the model
o The second technique is the column (feature) subsampling. This technique is used in random forest. Column
sampling prevents over-fiting even more so than the traditional row sub-sampling. The usage of column
sub
samples also speeds up computations of the parallel algorithm. sub
(6) Goas of XG Boost
)Execwtion speed : XG boost was almost always faster than the other benchmarked. Implementationfrom pvthon
really factor when compared to the other algorithms. is
(i) Model performance: XG boost dominates structured tabular datasets classification
or on and regresein
ssion
predictivemodelling problems.
(7) Conclusion
o XG boost is a fasteralgorithmwhen compared to other algorithmsbecause of its parallel and distributed
computina
o XG Boost is developed with both deep considerations in terms of systems optimisationand principles in
learning. machine
o The goal of this library is to push the extreme of the
and accurate library.
computationlimits of machines to provide a scalable, portable

M3.6 RANDOM FOREST


a J.6.1 Random Forest
Dataset
Random forest is supervisedmachine learning algorithm that
a
is
used widely in classificationand Regressionproblems.
It builds decision trees on different samples and takes their
vote for classificationand average in case majority Decision tree-1 Decision tree-1
of regression. Decision tree-1
a3.6.2 Random Forest Result-1
Algorithm Result-2 Result-N
Random forest
Algorithm can handle the data
continuous variables in case of
set
containing Majority voting/Averaging
variables in case of classification. regression and categorical
It performs better results for classification FinalFinal result
problems.
Working of Random
ForestAlgorithm
Ensemble Technique Fig. 3.6.1
Ensemble means
combining multiple models. Here a
model. collection of models make
Ensemble uses two
predictions rather than an individua
types of methods
1.
Bagging:It createsa
based on
different training subset from sample training data with
majority voting. For example,
2. Random forest. replacement. The final output
Boosting:It creates sequential models
learners into strong such that the final model has the
learners. For highest accuracy. Then it combines weak
example, ADA BOOST, XG
BOOST
(New Syllabus w.ef
academic year 22-23)
(M7-79)
a Tech-Neo
Publications..ASACHIN SHAH
SHAH Venture
Ve
Machine Leaming(MU-Sem.7-Comp.)
Bagging (EnsembleLeaming)....Pageno.(3-10)
BoosingJ

- -
Parallel
Sequence
Fig. 3.6.2
Remark

Random forest works on the Bagging principle.


3.6.3 Bagging
Bagging is also known as Bootstrap Aggregation. It is an ensemble technique used by random forest.

Bagging chooses a random sample from the data set.


Hence each model is generatedfrom the samples (Bootstrapsamples) providedby the originaldata with replacement.It
is known as Row sampling.
a 3.6.4 Row Sampling
This step of row sampling with replacement is called Original data
bootstrap.
Each model is trained independently and it
generates
results. The final output is based on majority voting
|00O| o
OO000
OOBootstrapping
5ootstrapping
after combining the results of all models.
The output that is based on majority voting is known as
aggregation. Classifier Classifier Classifier Aggregation
We observe that the bootstrap sample is taken from the
actual data (Bootstrap sample 01, Bootstrap sample 02 Ensemble classifier Bagging
and Bootstrapsample 03) with a replacement. Fig. 3.6.3
Clearly each sample will not contain unique data. Actual
data
Now, the model, Model 01, Model 02, Model 03
obtained from this bootstrap sample are trained
Bootstrapsample Bootstrapsample Bootstrapsample
01
independently. 02

Each model generatesresults. And based on majority


voting final output is obtained MModel01 Model 02 Model 03
Steps involved in random forest algorithm
Step 1: In Random forest n number of random records
are taken from the data set havingk number of records.
Step 2: Individual decision trees are constructed for
each sample. Fig. 3.6.4

(New Syllabus we.f academic year 22-23) (M7-79) Tech-NeoPublications..ASACHIN SHAH Venture
Machine Loaming(MU-Sem.7-Comp.)
Ensemble Leaming).Pageno. (3-111
Each decision tree will generate an output.
Step 3:
for classification and regression respectively.
Step 4: Final output is based on majority voting or averaging

a3.6.5 Important Feacures of Random Forest


are considered while making an
individual tree, each tree is differen.
1. Diversity: Not all attributes /Variables/ features rent.
2. Immune to the curse of dimensionality:
Since each troe does not consider all the features, the feature-space is reduced.
3.
Parallelisation : Bach tree is created independentlyout of different data and attributes. This implies that we can mal
e
full use of CPU to build random forests.
Train-Test split : In a random forest we don't have to segregate the data for training and testing. t is because there
will always be 30% of the data which is not seen by the decision-tree.
5. Stability : Stability arises because the result is based on majority voting/ averaging.

a J.66 Difference Becween Decision Tree and Random Forest

Actualy, Random forest is a collection of decision trees, still there are differences in their behaviour.

Sr. DecisionTrees Randomn Forest


No.

1. Decision normally suffer from the


trees
problem of | Random forests are created from subsets of data and
overfitting. Since it is allowed to grow without any the final output is based on average or majority
control. ranking and hence the problem of overfitting is taken
care of.
A single decision tree is faster in computation. It is comparatively slower.
3. When a dataset with features is taken as
input by a | Random forest randomly selects observations,
buildsa
decision tree, it will formulate some set of rules to do
| decision tree and the average result is taken. It
does
prediction. not use any set of formulas.
We conclude that if the trees diverse and
trees.
are
acceptable then random forests are much more successful than decision
Important Hyper parameters
Hyper parametersare used in random forests to
) Increase the performancepower and
(i) Predictive power of models or
i) To make the model faster.

() Following hyper parametersincrease


the predictive
1. n-estimators:Number
power
of trees the algorithmbuilds before
2. Max-features:Maximum number averagingthe predictions.
of features-randomforest split a node.
3.
Mini-sample-leaf:
It determinesthe minimum number
leaves to split on of
(II) Following hyper internal node.
parametersincreasesthe speed
1. n-jobs: It tells the
engine how many processors it is allowed to
if the value is 1, there is
-

no limit. use. If the value is


1, it can use
only one proCcssor but
(New Syllabus w.ef academic
year 22-23)
(M7-79)
Tech-Neo Publications.ASA SACHIN SHAH enture
Machine Leaming(MU-Sem.7-Comp.) (EnsembleLeaming)...Pageno.(3-12)
a
Random-state: It controls the randomness of the sample. The model will always produce the results it it nas
. same
definite value of random state and if it has been given the same hyper parameters and the same training data
OOB-score: OOB implies out of the bag. It is a random forest cross-validation method. In this method, one-third of tne

sample is not used to train the data instead it is used to evaluate the performance.
These samples are called as out of bag samples.

3.6.7 Advantages and Disadvantages of Random Forest Algorichm

Advantages
.Itcan be used in classification and regression problems.
2. If solves the problem of overfiting as output is based on majority voting or averaging.
3 Even if data contains missing values or null values, it performs well.
4. Each decision tree created is independent of the other. It means that is shows the property of parallelisation.
5. It is highly stable as the average answers given by a large number of trees are taken.
6 If maintains diversity as all the attributes are not considered while making each decision tree even through it is not true

in all cases.

7. It is immune to the curse of dimensionality. Since each tree does not consider all the attributes, feature-space is reduced.
8. We need not segregate data to train and test, as there will always be 30% of the data which is not seen by the decision
tree which is made out of bootstrap.

Disadvantages
1. Random forest is highly complex when compared to decision trees where decisions can be made by following the path
of the tree.

2. Training time is more compared to other models due to its complexity.

Whenever it has to make a prediction, each decision tree has to generate output for the given input data.

ERemark
1. Random forest can handle binary, continuous and categorical data.

2. Random forest can handle missing values.


3. Random forest is a fast, simple, flexible and robust model with some limitations.

3.7 COMPARISONBETWEENBOOSTINGRANDOM FOREST AND BOOSTING


Sr.
No. RandomForest Boosting
1. Random forest is a bagging technique The decision trees built
are
additively,i.e. each decision
tree is built one after another

2. It is not a boosting technique. In boosting, one is learning from other, which in turn
boosts the learning.

3 The trees in random forest run parallel. Boosting is an approach to increases the complexity of
models that suffer from high bias, that is, models that
underfit the training data.

(New Syllabusw.e.f academic year 22-23) (M7-79) Tech-NeoPublications..ASACHIN SHAH Venture


Machine Leaming(MU-Sem.7-Comp.) (EnsembleLeaming)....Pageno. (3-13)

Random Forest Boosting

There is no interaction between these trees while Boosting reduces error mainly by reducing bias, and also
building the trees. to some extent variance, by aggregating the output from
many models.

5. Random forest and bagging are "bagging" algorithms XG boost always gives more importance to functional
that aim to reduce the complexity of models that space when reducing the cost of a model.
overfit the training data.

6. Random forest fully grown decision


uses trees (low | Gradient boosting may not be good choice
a it there is lot
bias, high variance). It tackles the error reduction in the of noise, it result in
as can overfitting.
opposite way : by reducing variance.

7. Random forest tries to give more preferencesto hyper They also tend to be harder to tune than random forest.
parameters to optimise the model.

8. Random forest may not be give better


performance Gradient boosting trees can be more accurate than
compared to gradient boosting. random forests.
9 The basic idea in
randomforest is although a single| We train gradient boosting trees to correct each other's
tree may be inaccurate, the collective decisions of a errors, they can capturecomplex patterns in the data.
bunch of trees are likely to be
right most of the time.
10. Forests are more robust and more accurate than a If the data is noisy, the boosted trees may overfit and
single tree. But, they are harder interpret since each
to start modelling the noise.
classification decision or
regression output is not one
but multiple decision paths.

3.8 DIFFERENTWAYS TO COMBINECLASSIFIERS


(1) The simplest way of combining classifier output is to allow each
classifier to make its our
the plurality
predictionas the "final" output. The simple voting scheme predictionand then choose
does not always easy to implementand easy to understand,butt
produce the best possible results.
(2) To form different ways to combine
classifier: they can be divided into two
Ensemble methods: Refers to sets of big groups.
systems that combine to create a new
technique.Bagging and boosting are the most extended system using the same learming
ones.
Hybrid methods: Take a set
of differentlearners and combine them
Model ensemblingis also used using new learning techniques.
for
combining differentclassifiers.

(New Syllabus w.e.f academic


year 22-23) (M7-79)
eTech-NeoPublications..ASACHIN
SHAH Verntu
Machine ming(MU-Sem.7-Comp mp.) (Ensemble Leaming)..Page no.(3-14)
(3)
ambining classif+ers better than selecting the best one
Com
We
of this
ntion two extensions method
() using an extended set of
eta-level features and (1) the other using multiresponse model trees
leam
to at the meta-level. The latter extension performs better than
existing acking approaches and better than selecting the best
stack
classifier by cross validation.

Ensemble classifiersalways betterthansingle classifier:There isa -


nossibility that an ensemble model performs better than an individual
model. If we build many of those and the individualclassifieris weak,
then the overall performance should be better than an individual
model.
Level 2
Level 1
To use stacking classifier: A simple way to achieve this is to split
trainingset in half, as shown below Fig. 3.8.1

6 Classes of classifiers
There are three classes of classifiers
Descriptive classiffer : Descriptive classifier are used to describe shape, size, texture or a pattern.
(i) Instrument classifier: The handshapes of instrument classifiers describe how an object is handled.
i) Element classifiers: It takes a set of different learners and combines them using new learning techniques.

Chapter Ends..
MODULE 5

CHAPTER
Learning with Clustering
5
Sylilabus
Introductionto clustering with overview of distance metrics and major clustering approaches.
Maximization
Model based Clustering: Expectation
Graph Based Clustering: Clustering with minimal spanning tree,
Algorithm,Density Based Clustering:DBSCAN.

5.1 Introduction to Clustering. *******.*********.********** ***a*******. ************************************ **************** .5-3

5.1.1 Buildingthe SimilarityGraph. .*********nsesornss*********************************************************************


..5-3

5.1.2 Projecting the Data onto a Lower Dimensional Space.. .********** 5-3
*********************************************
5.1.3 K-means and SpectralClustering. 4
5.2 Evaluation methods based on Ground Truth Homogeneity.. *************°****************************** ***** ***********.
*.5-4

5.2.1 Trivial Homogeneity.


**************""******""****""******************************°**************************************** *********** 5-4
5.2.2 Remark ..
***********************************************************.********************s
**** . ********4 5-4
5.2.3 Remarks.. *********************************arrerene
******************* ************************eeeere* ********* 5-5
5.3 Adjusted Rand Index.***********ee**********"********************"************************************************************** ********b-6

5.4 Adjusted RandIndex. *****************a****a***°°*.


*******************************. .5-7
*****

5.4.1 The ContingencyTable.. *** *********************************** **********e*e* ********


.5-7
5.4.2 Definition.
***************************
****e*e********************************************************** 5-7
5.5 Concepts of Weak and Eager Learner..
*****a****a******************************** ******0..5-7
5.5.1 Example of Lazy Leamer.. a*aa*oe*******a*********************as*a**e*******************.
5-8
5.6 UnsupervisedLearning: HierarchicalClustering. **°°°************* ************
5-8
5.6.1 Hierarchical Clustering..
**********o4

5.6.2
************s*
5-8
Examples on HierarchicalClustering..
**************** ***********
5-9
UEx. 5.6.3(MU -May 16. 10Marks)..
*a******************************
UEx. 5.6.4(MU -May 17, 10Marks. 5-14
5.7 UnsupervisedLearning:AssociationRules. 15
************e***************°°
5.7.1 Introductionto
****

16
AssociationRules..°*°****°°.**°°°°**°*******°**********
5.7.2 Apriori Algorithm e****** **********

5-16
e******** *****
************************************.
5.7.3 Performance Measures. **********s.

*************o***************************
51
********.
achine Leaming(MU-Sem.7-Comp.)
5.7.4 Issues in Machine Leaming. (Leamingwith Clustering)..Pageno.(-2)
whet are the issues in Machine leaming ? ************* *** ***0**
*******e*****e*****.****s**«a** ..5-19

(MU May 15,5 Marks.. 5-19


75 How to choose the rnght algorithm ? *****************s*sss
..
KS). osssens***
mnnnsnnssssoennsss*****s*seneesse**********
***********ssseesr

9
UO. Explainthe steps required for selectingthe *****************************s** *
rlght machine leaming algorithm.
(MU-May16,8Marks)
************ ****eeeendseecetoeu*see**eeo*****o********************* 5-19
57.6 Steps in Developinga Machine Leaming
Application.. 5-20
**********************sssesseses*sseeseeesoossoesossne*ssseseense*******
UO. Explain the steps of developingMachine
Leaming applications.(MU -May 19.10 >****e*ssoeres************* 5-20
5.7.7 Applications of Machine Leaming ... Marks)..
"*******eee********ne**e* .s ** **ec***********ee************** . 5-21
Write short note on : Machine learning
applications.(MU -May 16. May 17. 10 Marks. 5-21
Graph Based Clustering..
eon*****eessoneo*******

5.8
5.8.1 Graph ClusteringAlgorithm..
*****e********************e**s************esnesosese*sssesee********************
***°********e***********o0o**e***oes*e*************eo**ee*o*****aseeseee
5-22

***************e***************.********so*********o**************************** s*nonanee*********** -22


Method of Graph-based
5.8.2 Clustering.. *******************e*******oeo*oa*eee****esoo****ese****ee***** 5-22

5.9 Clusteringwith Minimal Spanning Tree.


e******a******o******e******e*a*****************es********** 525

5.9.1 The Cluster Center-initialisationMethod. 5-25


**************e*******a******.a********esenee**seseoseeo*******e**************e*******
*
5.9.2 Definition:Geodesic Distance. 5-26
************************************s*******esoeaseree*******e********.******n**o********
******************
5.9.3 Inconsistent Edge..
********aneasans******************sassssasssssss ssasse* . 5-27
5.9.4 Intemal Clustering ValidationIndex..************e*e***************o*s*****e****evseeee**s*** eceeee*******ess***********5-27
5.10 Model Based Clustering... ************************************sssosessesnssoenese
5-27
nosessenesossseessee**ss*coeessesossostporesee******
5.10.1 Use of EM .. een***°*********ao***e*******°°e*******************************° *************************e*******e**************************
5-28

5.10.2 EM Algorithm ..
****************************************************so*ossoserses*s*er*os.*********ssvoossseoe*ssoroesse********************
5-28

Ga. Writeshort note on: EM algorithm. (5 Marks). *5-28

5.10.3 Advantages of EM Algorithm. ***eam*****************se******o***************************e*************°*****°°*************


5-28

5.10.4 Applicationsof EM-Algorithm. *..*.**sosesnoses***********sosseueene******************************************************


.5-28

********************************************************ss********
..5-29
5.11 ExpectationMaximisation Algorithm..
5.11.1 Latent VariableModel. ****************** **e****************e*************************e******************************************eso********
5-29

5.11.2 Use of Latent Variable Model .. *******************************************e*****************e********eso*s**** ***********°************** 5-29


5-29
5.11.3 Use-Case of EM Algorithm ..
o****o***********es************e******e****************************°************************************s****.

5-29
o****
5.11.4 Multivariate Gaussian Model
**************************o***********************************************

ssnee**e****.******

*********************e**** 5-29
5.11.5 Gaussian MixtureModels.
********ssssessssssssssssnrsa .as ****************s***************************************************************s********es**sess* 5-30
5.12 Density Based Clustering.. .eeaseeeeeees*eneeusee********o******n*************"*****°*°*******************************************s****0ssos**ee4s0ee 5-30
5.12.1 Iregularitiesin the Data.. *ee******e************ees****e*******e*s****e****e*eoe** 5-31
5.12.2 The basic Principle of DBSCAN Clustering
..

enoeo**o************e*****e*******a***********°*********************e*************eseo« ...5-31
5.12.3 DBSCAN and K-means.. ***********************u
.5-31
e***eee***.**************************oos**********************
***°°*°*******
******

Chapter Ends ..
(Loarningwith Clustering)..Page no.
(5-3)
Mactine Loaning(MU-Sem.7 Comp.) () Fully-connected Graph
TO CLUSTERING with an undirected edge
connected
point is
INTRODUCTION
5.1 Here, ench
distance between the two points to
weighted by the
Spectral Clustering every other point.
which performs Gaussian similarity metric is used, to
is an nlgorithm Here, we use
Spectral clustering algorithms in
than many
traditional clustering calculate the distance.
hetter
many caNe
Lower
as graph node and thus 5.1.2 Projecting the Data onto a
Itreats each data point
problem into a graph Dimenslonal Space
transform the clustering
partitioning problem. members of the same cluster
of three There is a possibility that
Implementation of spectral clustering consists in the given dimensional space.
may be far away
fundamental steps. that those
Thus the dimensional space
is reduced so

reduced dimensional space and


5.1.1 Building the Similarity Graph points are closer in the
can be clustered together by a traditional clustering
in the form of an
This step builds the similarity graph
A.
algorithm.
adjacency matrix which is represented by
matrix A can be constructed in the It is carried out by Graph Laplacian Matrix.
The adjacency
following manner. To compute it, we define the degree of a node.
matrix A can be constructed in the The degree of i node is given by.
The adjacency
following manner : n

(0) Epsilon-neighbourhood Graph d (G.j)eE


First a parameter epsilon is fixed. Then every point is Note that Wi is the edge between the nodes i and j as
connected to all the points which lie in its defined in the adjacency matrix.
epsilon-radius. The degree matrix is defined as
It all the distances between any two points are similar
D= .i=j
in scale then the distance between two points are not Dij o, izj
stored (i.e. weights of the edges)
Then the Graph Laplacian matrix is defined as
It is because they do not provide any additional
L = D-A
information. In this case, the graph drawn or built is an
This matrix is then normalized.
undirected and weighted graph.
To reduce the dimensions, the eigenvalues and the
(I) K-Nearest Nelghbours
respective eigenvectors are calculated.
Again, we fix the parameter k. Then for two vertices u If the number of clusters is k then the first eigenvalues
and v, an edge is directed form u to v only if v is and their eigenvectors are taken and stacked into a
among the k-nearest neighbours of u. We follow the matrix such that the eigenvectorsare the columns.
followingapproaches: 3. Clustering the Data
(a) Direct an edge from u to v and from v to u if either
v is
among the k-nearest neighbours of u ORu is The process involves
clustering the reduced data by
among the k-nearest neighbours of v. using K-means clusteringtechnique.
(b) An edge is to be directed from u to v and from v First, each node is assigned a row of the normalizedo
to u if v is among the k-nearest the Graph-Laplacian-Matrix.
neighbours of u
AND u is among the k-nearest
neighbours of u Then this data is clustered
traditiona
AND u is among the k-nearest using any
neighboursof v.
technique.
(New Syllabus w.ef academic year 22-23) (M7-79)
LTech-NeoPublications.ASACHIN SHAH Ventur
chine Leaming U-Sem.7-Comp.) (Learningwith Clustering)..Pageno.(5-
Properties
Homogeneitydescribes the closeness of the clustering
ssumption-Less : This clustering technique does not algorithm to this perfection.
Assu

assur the data to follow some property. This makes 2. Completeness:A perfectlycompleteclustering is one
is technique to answer a more-generic class of where all data-points belonging to the same class are
this

clustering problems. clustered into the same cluster.


and speed: This algorithm Completenessdescribes the closeness of the clustering
Fase ofimplementation algorithm to this perfection.
sis easier to implement
than other clusteringalgorithms
and is also very fast as it mainly consists of
mathematical computations. 5.2.1 Trivial Homogeneity
to
Not-scalable : This is time-consuming for dense case when the number of clusters is equal
It is the
it involves the building of matrices the number of data points and each point is in exactly
datasets, because
and computation
of eigenvalues and eigenvectors. one cluster.
while
Working of spectral clustering It takes place when homogeneity is highest
completeness is minimum.
clustering, the data points are treated as
. In spectral
is treated graph
nodes of a graph. Thus, clustering
as a
red
partitioning problem.
green
The nodes are mapped to a low dimensional space
then Black all
that can be easily segregated to form clusters.

Fig. 5.2.1
5.1.3 K-means and Spectral Clustering
Data points as nodes of a Trivial completeness
1. Spectral clustering :
when all the data points are clustered into
clusters found by partitioning It is the case
connected graph and are
into one cluster.
this graph. It is based on its spectral decomposition minimumn
It is the extreme case when homogeneity is
subgraphs.
and completeness is maximum.
K-means clustering: It divides the objects into
2.
K-clusters such that some metric relative to the

centroids of the clusters is minimised.

Advantage of spectral clustering


cluster "points" which
One advantage is its ability to
for this a "similarity",
necessarily vectors and to use
are not

which is less-restrictive than a distance.


Fig. 5.2.2

W5.2 EVALUATION METHODS BASED ON is


GROUND TRUTH HOMOGENEITY We assume that each data point in the above diagram
for Trivial Homogeneity and
of the different class label
is Trivial Completeness.
clustering technique
One of the disadvantages of any
its performance. a 5.2.2 Remark
it it is difficult to evaluate
that
V-measure was
metric of from completeness.
To tackle this problem, the The term homogeneous is different
basic concept is of the respective
developed. he In homogeneity the
V-measure requires whether in each cluster, each
he calculation of the cluster which we check
label.
calculation of two terms data is of the same class
point
clustering basic concept is of the respective
In completeness, the
1. homogeneous
omogeneity: perfectly A to the whether data points of cach
one where each cluster has
data-points belonging class label which
we check
cluster.
class label is in the
same
same class label.

Tech-Neo Publications.A SACHIN SHAH Venture

academic year 22-23)


(M7-79)
New Syllabusw.e.f
(LeamingwithClustering)...Page no.
(5-5)
Machine Leaming (MU-Sem.7-Comp.)
K ack
Class 1 and H(K) =- 2 CJlog C
Class 2 K=1

Thus the weighted V-measure Vp is given by the


Fig. 5.2.3

is perfectly following
In the above diagram, the clustering VB B)hc
cluster the data points are
homogeneous since in each Bh +C
complete because
of the same class label but it is
not
be adjusted in such way that either
a
notall data points of the same class label belong to the The factor ß can
the homogeneity is favoured or completeness of the
same class label.
algorithm is favoured.
evaluation metric is :
Red Red Class 1
The advantage of this
Green class 2 (i) It is independent of the number of class labels.

Green
Blue Blue class 3 (i) The number of clusters.
(ii) The size of the data.
Fig. 5.2.4
(iv) The clusteringalgorithmand
In the above diagram, the clustering is perfectly
(v) Is a reliable metric.
is because all data points of the same class
compiete. It
label belong to the same cluster. Remarks
5.2.3
But it is not homogeneous because the 1" cluster
contains data points of many class labels. (i) Homogeneity in clustering

Let us assume that there are N data samples, C A clustering result satisfies homogeneity if all of its
clusters and ak number of clusters contain only data points which are members of
differentclass labels, K a single class.
data-points belonging to the class C and cluster K.
Then the homogeneity h is given by the following This metric is independent of the absolute values of the
labels a permutation of the class or cluster label
h =1 CK)
HC) values won't change the score value.
Estimated number of clusters 3
Where
N C
Blue
HCK) - ack
log ack
K=1 C =1
Green
LC=1

Fig. 5.2.5
ack 2 ack
(ii) Homogeneity score
and H(C) =-
C 1
The score is useful to check whether the clustering
The completeness C is given by the formula algorithm meets an important requirement, i.e. a cluster
should contain only samples belonging to a single
C =1(KC) class.
H(K)
It is defined as
C

Where H(K,C) =- 2 log ack


C=1 K=1
h = 1- H(YueYpred)
HYtrue)
It is bounded between 0 and 1, with low values
indicatinga low
homogeneity.
(New Syllabus w.e.f academicyear 22-23) (M7-79)
Tech-Neo Publications...ASACHIN SHAH Venture
Machir Leaning (MU-Sem.7-Comp.)
(LeaningwithClustering)..Pageno. (5-6)
ADJUSTED RAND INDEX (i) Note that the denominator is the total number of pairs,
S.3 hence the Rand index represents the frequeney of
Rand is measure of the
The Rend ndex or measure a occurrence of agreements over the total pairs.
hetween two data clusterings.
similarity (i) Or, the probability that X and Y will agree on a
The Rand Index is used in statistics and in particularin randomly chosen pair.
data clustering.
Note that, nC - 0-0
form of the Rand
index is adjusted for the chance
It is called as Adjusted Rand (ii) Also rand index can be seen as a measure of the
grouping of elements.
Index. percentage of correct decisions made by the algorithm.
It can also be computed as,
From a mathematical standpoint, Rand index is related
to the accuracy. But it is also applicable when class
RI =
TP +TN
used. TP +FP +FN+TN
labels are not
Yellow Where
Black
- green Black TP is the number of true positives,
TN is the number of true negatives,
FP is the number of false positives,
Green
Yellow And FN is the number of false negatives.
Fig.5.3.1
Properties
dataset with the K
Example clustering for a means
(i) The rand index has a value between 0 and 1 with 0
(left) and Mean shift (right) algorithm. indicates that the two data clusterings do not agree on

Definition:Rand Indexx any pair of pointsand1


Consider a set of n elements. (ii) 1 indicating that the data clustering are exactly the
S =
la, a2, , a) and two partitions of S to same.
terms as follows:
compare
We define a, b, c, d in mathematical
X =
{X]. X2, ..
X,} a partitionofS into r subsets (i) a IS*I, where
and S = {(o o)l oi, oj, e X, o, 0e YAj
subsets.
{Y1. Y2... Ys} apartition of S
Y =
into s
(ii) b = IS*I where
With this we define
that in the
) a, the number of pairs of elements in S are
E Ye Y1
same subset in X and in the same subset in Y.
(ii) C = IS*I where
in S that are in
() b, the number of pairs of elements
subsets in Y.
different subsets in X and in different
(iv) d ISl where
in S that are in the
=

(n) c, the number of pairs of elements


same subset in X and in different
subsets in Y.
S [(oo)le K1.0E X20,jE Yal
in S that are in For some
(v) d, the number of pairs of elements
diterent subsets in X and in the
same subset in Y. 1si,jsn, i #j, 1 sk, k, k2 Sr
The Rand index R is k #k
R a+b
a+b+c+dnC2
a+b 1 SA,21,2$ S,A1 *2
Remark
as the number of labelled belonging
Clearly, (a + b) be considered can
is the number of pairs corectly
as

d as the number () a
agreements between X and Y and
c + and
to the same subset (true positive)
of disagreements between X and Y.
the number of pairs correctly
labelled as belonging
(i) b is
to differentsubsets.
SACHIN SHAH Venture
a Tech-Neo Publications..A
academic year 22-23) (M7-79)
Vew syllabus w.e.f
Machine Learning (MU-Sem.7-Comp.)
(Learningwith Clustering)...Pageno.(5-7
ADJUSTED RAND INDEX
a (a-1) etc
5.4 Note:2-Ca
index is the corrected for chance
The adjusted Rand Remarks
version of the Rand index.
1. Good adjusted Rand score (ARI)
This correction decides the base line by using the
The adjusted Rand index (ARI) should be as follows:
expected similarity of all pair-wise comparisons
between clusterings specified by a random model. (i) ARI20.90, excellent recovery
The Rand index assumes the value between 0 and + 1, (ii) 0.80 S ARI <0.90; good recovery
the adjusted Rand index may take negative values if (iii) 0.65 S ARI <0.80, moderate recovery
the index is less than the expected index.
ARI <0.65:Poorrecovery
5.4.1 The ConcingencyTable 2. Negative ARI

Consider a set of n elements and two clustering Negative ARI says that the agreement is less than what
(grouping or partitions) of these elements, namely, is expected from a random result.
X = {X, X2. X} and This implies that the results are 'orthogonal' or
Y = {Y1. Y2, .. Ys} complimentary' to some extent.

The overlap between X and Y can be mentioned in a M 5.5 CONCEPTS OF WEAK AND EAGER
contingency table {nji}, where each entry nj denotes
LEARNER
the number of objects in common between X; and Y
n XnY The distinction between easy learners and lazy learners

We prepare a table
is based on when the algorithmchooses from thedata
A lazy leaner delays choosing from the data, until it is

X Y Y Ys Sums asked to make a prediction, while an eager leamer


takes away from the data during training and uses this
X1 D11 12 is | a1 choice to make predictions.
It does not directly compare queries with instances in
X2 n21 22 2s a2 the dataset.

Lazy learner
X i) It just stores dataset without learning from it.

Sums bb2 b (ii) Start classifying data when it receives test data.
ii) It takes less time for learning and more time ror
5.4.2 Definition classifyingdata.
The originaladjustedRand Index is given by (ARI) Eager Learner
(1) When it receives data.set, it starts classitying
(learning).
i) It does not wait for test-data to learn.
i.j 11) It takes long time learning and less time classifyg
ARI
data.

1Li Examples
In supervised learning:

Lazy
Where nip a, bj are values from the
contingencytable K-NearestNeighbour,Case-brandReasoning

(New Syllabus w.e.f academic year 22-23) (M7-79)


Tech-NeoPublications...A
SACHIN SHAH
enture
Machine Leaming
(MU-Sem.7-Com (Learningwith Clustering)...Page no.(5-8)
instance defined in
The nearest neighbours of an are
Eager

terms of Euclidean distance.


Decision Tree, Naive Bayes, Artificial Neural
Lazy decislon tree
Networks.

Eager and
lazy learning in machine learning Lazy decision tree constructs a customised decision
tree for each test, which consists of only a single path from
leaning methods construct general, explicit
Eager
E the root to a leaf node.
description of
of the target function. It is based on
provided training cxamples. Lazy algorlthm
learning methods simply store the data and A lazy learning algorithm is a learning algorithm that
Lazy
eeneralize the data only when an explicit request is can be applied by a lazy learning system (to solve a lazy
e
made. learning task).
construct a classification model based
Eager leamers Lazy learner context
on the given training
data, even before receiving data
It lazily postpones any work until receiving testingg
for classification.
It considers a single hypothesis that records and only performs the work, which is necessary to
covers the entire instance space.
predict its Target values.
In machine learning, lazy learning
is a learning method
in which generalisation of the training data is delayed
until a query is made to the system.
5.6 UNSUPERVISEDLEARNING
HIERARCHICALCLUSTERING
Naive Bayes is an eager learner
Naive Bayes classification algorithm is one of eager 5.6.1 Hierarchical Clustering
learning algorithmsthat promotes the class conditional
Agglomerative HierarchicalClustering
independence.
In agglomerative clustering initially each data point is
And it predictsthe class label in the fastest manner.
considered as a single cluster. In the next step, pairs of
5.5.1 Example of Lazy Learner clusters are merged or agglomerated.

Lazy learningrefersto any machine learningprocess in This step is repeated until all clusters have been
which majority of computation defers to consultation
merged in to a single cluster. At the end a single cluster
time. remains that contains all the data points.
Thetypicalexamples are:
Hierarchical clustering algorithms works in top-down
G) Instance based learning and manner or bottom-up manner. Hierarchical clustering
i) Lazy Bayesian Rules. is known as Hierarchical agglomerative clustering.
In agglomerative, clustering is represented as a
Lazy learning stands in contrast to eager learning in
which the majority of computation occurs at training dendogram as in Fig. 5.6.1 where each merge is
time. represented by a horizontal line.

K-NN is learner. It does not learn a


a lazy Root: One node
discriminative function from the training data. Instead
it memorises the training dataset.
There is no training time in K-NN. The prediction step
in K-NN is
expensive.
When we supply training data to this algorithm, the
algorithmdoes not train itself at all.
-Leaf. Individual
E
Instance-based learning A B C
clusters

Stance based methods are referred to as lazy learning


Fig. 5.6.1: Dendogram
nethods because they delay processing until a new

Instance is classified.
Tech-Neo Publications..ASACHIN SHAH Venture
W
Sylabus w.e.f academic year 22-23) (M7-79)
Machine Learning(MU-Sem.7-Comp.) (LeamingwithClustering)..Pageno. (5-9)
A clustering of the data objects is obtained by cutting In complete linkage, the distance between two clusters
the dendogram at the desired level, then each is considered to be equal to greatest distance from any
connected forms a cluster. member of one cluster to any member of other cluster.
The basic steps of Agglomerative hierarchical D, s) = Max (d (i, j), object i cluster r and object j
clustering are as follows clusters
1 Compute the proximity matrix (distance matrix)
In average linkage, we consider the distance between
2 Assume each data point as a cluster. any two clusters A and B is taken to be equal to
3 Repeat average of all distances between pairs of object i in A
4. Merge the two nearest clusters. and j in B.ie., mean distance between elements of each
5. Update the proximity matrix other.

6. Until only a single cluster remains D,5) = Mean {d (i. j), object i- cluster r and objectj

In Agglomerative hierarchical clustering proximity clusters


matrix is symmetric i.e., the number on lower half will
be same as the numbers on top half. 5.6.2 Examples on Hierarchical Clustering
Different approaches to defining the distance between
Ex. 5.6.1 : The table shows the six data points. Use all link
clusters distinguish the different algorithm's i.e.,
methods to find clusters. Use Euclidian distance measure.
Single linkage, Complete linkage and Average linkage
clusters.
In single linkage, the distance between two clusters is D 0.4 0.53
considered to be equal to shortest distance from
any D2 0.22 0.38
member of one cluster to any member of other cluster.
D3 0.35 0.32
Stat D4 0.26 0.19
Object and their measured Ds 0.08 0.41
features
D 0.45 0.30

Compute distancematrix Soln.:


First we will solve using single linkage
Set object as cluster The distance of data
point

No of D =
(0.4, 0.53) to D2 (0.22, 0.38) is,
Yes Yes End
=

cluster=1
No
D
0.4-0.22)+(0.53 -0.38°
2
=0.24
The distance of data
Merge 2 closest clusters point
D =
(04, 0.53) to Dg =
(0.35, 0.32)is,
Update distancematrix D =
V0.4 0.35) + (0.53 -0.32) 0.22 =

Fig. 5.6.2 The distance of data


point
Dr, s) =
Min {d (i,j), object i cluster r and
object j D =
(0.4, 0.53) to D4 (0.26,0.19)is,
=

cluster s
D
= o4-0.26)+(0.53 0.19) =0.37
(New Syllabus w.e.f academic year 22-23)
(M7-79)
Tech-NeoPublications..ASACHIN
SHAH Venture
chineLeaming(MU-Sem.7-Comp.)
The
distance of data point
0.14 is
(Leaningwith Clustering)...Page no.(5-10)
D, =(0.4,0.53) to Ds = (0.08, 0.41) is, smallest. D, and D have
we
combine this two in one smallest distance. So,
matrix. cluster and recalculate distance
n V04-0.08)+(0.53 0.41)=0.34
Distance ((D3, D,).
distance of data point (D2, D;)) =
min (distance (Da, D2»»
The distance (D6. Da), distance
D= (0.4,0.53) to Dg= (0.45, 0.30) is, (D3, Ds), distance (D, D))
min (0.15, 0.25, 0.28,
0.29) 0.15 =

p V04-0.45) +(0.53-0.30 =0.23 Similarly,we will calculate all distances.


Similarly we will calculate all distances, Distance matrix
Distance matrix D 0
D (D,D)0.24
D2 0.24 0 (D3.D 0.22 0.15 0

D 0.22 0.15 0
D4 0.37 0.20
DA 0.37 0.20 0.15 0 0.15 0
D (D, D) D3, D) D
Ds 0.34 0.14 0.28 0.29 0.15 is smallest. (D2, Ds) and
(D3, Dg) as well as D4
and (D3, D) have smallest distance. We can
D 0.23|0.25 | 0.11 0.22 0.39 0 pick either one.
D D D3 D Ds D6 Distance matrix
0.11 is smallest. D3 and Dg have smallest distance. So, D 0
we combine this two in one cluster and recalculate distance (D2, Ds, D3, D) 0.22
matrix. D4 0.37 0.15
Distance ((Da, D). Di) = min (distance (D3, D).
D (D, Ds, D3, Dg) D4
distance (Dg, D,)) = min (0.22, 0.23) = 0.22 0.15 is smallest. (D2, D5, D3, Dg) and D have smallest
Distance ((D3. Dg), D) = min (distance (D3. D). distance. So, we combine this two in one cluster and
recalculate distance matrix.
distance(D6 D)) = min (0.15, 0.25) = 0.15
Distance ((D, D), D,) = min (distance (Dg, Da). Distance matrix
distance (Dg. DA) = min (0.15, 0.22) = 0.15 D 0

Distance (Da. Dg), D,) = min (distance(D. Ds). (D2, Ds, Dg. Dg.D 0.22 0
distance (D6. D5) = min (0.28, 0.39) = 0.28
D D2, Dg. Dg. Dg. D)
Similarly we will calculate all distances. Now a single cluster remains (D2, Ds, Dg, Dg, Dg, D)
Distance matrix Next, we represent the final dendogram for single
linkage as,
D 0 Root One node

D2 0.24 0
0
(D3.D)0.22 | 0.15
0.15
D4 0.37 0.20
0.29 0
Ds 0.34 0.14 0.28
D3 D6 D4 D5 D4
D1 Leaft Individual
clusters
D D2 (D3, D D4 Ds

Tech-Neo Publications.A SACHIN SHAH Venture


Wew Syllabus w.ef academic year 22-23) (M7-79)
Machine Leaming(MU-Sem.7-Comp.) (Learningwith Clustering)...Page no. (5-11)
Now will solve using complete linkage
we
Distance matrix
Distance matrix D 0

D 0 (D, D) 0.34 0

D 0.24 0 (D,. Dg, D) 0.37 0.39


Da 0.22 0.15 0
D (D3, Dg. D) (D3, Dg, D
D0.37 0.20 0.150 0.34 is smallest. (D2, D5) and D have smallest distance
D 0.34 0.14 0.28 0.29 so, we combine these two in one cluster and recalculate
0 distance matrix.
D0.23 0.25 0.11 0.22 0.39 0
Distance matrix
D D D3 D4 Ds D (D2, Ds, Di
0.11 is smallest. D3 and D have smallest distance. So, (D.Dg, D)L 0.39
we combine this two in one cluster and recalculate distance
matrix. D2, Ds. Di D3, D6. D)
Distance Now single cluster remains (D2, Ds. D, D3, Dg. DA)
a
((D3, D6), D,) max (distance(D3, D),
=

distance (D6, D,)) max (0.22, 0.23) 0.23 Next, we represent the final dendogram for
complete
=
=

Similarly, we will calculate all distances. linkage as,

Distance matrix
D1
D 0.24 o
(D3,D0.23 0.25 D3 D6 D4 D2 D5 D1

D 0.37 0.20 0.22 Now will solve


D4 we
using average linkage
0 0 Distance matrix
Ds 0.34 0.14 0.39 0.29 0
D D D3, D) D D5 D
0.14 is smallest. D2 and D2 0.24 0
Ds
have smallest distance. So,
we combine this two in one cluster and recalculate distance
matrix. D3 0.22 0.15 0
Distance matrix Da0.37 0.20 0.15 0
D5
Ds 0.340.14 0.280.29
D o 0
Ds
D6 0.23 0.25 0.11 0.22 0.39
(D2 Ds) 0.34 0 0

(D3,D0.23 0.39 D D2 D3 D Ds D
0.11 is smallest.
D 0.37 .29 0.22 0 Dg and Dg have smallest distance. So,
we combine this two in one cluster and
D (D2, Ds) (Dg, D) D distance matrix. recalculate
0.22 is smallest. Here
(D3, Dg) and Distance (D3, D), D,)= 1/2 (distance(D3, D)
D4 have smallest
distance. So, we combine these two in +distance D6. D,)) 1/2
one cluster and (0.22 +0.23) =0.23
=

recalculate distance matrix.


Similarly,we will calculateall distances.
(New Syllabus w.e.f academic year 22-23) (M7-79)
LETech-NeoPublications..ASACHIN SHAH
Venture
chineLearning(MU-Sem.7-Comp.) (Learning with Clustering)...Pageno.(5-12)
Distance matrix

0
D
D2 0.24 0

D3. D)0.23 0.2


D3 D6 D4 D2 D5 D1
0.37 0.20 0.19
Da
D and
0.34 0.29
Ex. 5.6.2: Apply single linkage, complete linkage
D
D 0.34 0.14 distance matrix and draw
average linkage on the following
D D (D3, D) D4 Ds dendogram.

0.14 is smallest. D>


and D_ have smallest distance. So, Soln.
cluster and recalculate
combine this two in First we will solve using single linkage
one
we
distance matrix.
Distance matrix
Distance matrix
P
D 0
P221
0
(D,D5) 0.29
Pa 6 3
0.27 0
D.D) 0.22 P10 9 7 0
0.15
D4 0.37 0.22
Ps95 4
D1 (D2. Ds) (D3, D) D4
P1 Pi P3 Pa Ps
have smallest distance. So, we
smallest distance. So, we
(D3, D) and D4 2 is smallest. P and P2 have
recalculate distance
cluster and distanc
combine this two in cluster and recalculate
one
combine this two in one

matrix.
matrix.

Distance matrix min (distance (P, Pa),


(P1. P2). Pz)
=
Distance

0 min (6, 3) = 3
(P2. P))
=

D distance
distances.
0.24 will calculate all
(D2, Ds) Similarly, we

0.27 0.26 Distance matrix


(D. D)
(D2, Ds) D3. Dg. Da
D smallest
P. Pa)
have
0.24 is smallest. (D, Ds) and Di 0
we combine
this two in one cluster and
P
distance. So,
recalculate distance matrix.
P4
Distance matrix Ps 8
4

0
(D2, D5, D) P. P) P P4 Ps
0
(D. Dg, D) 0.26 distanc-
have smallest
(D3, Dg,D4 smallest. (P, P)
and P3
3 is
(D2. Ds, Di two in one
cluster and
recalculE

remains (D2, D, Di, D3, D6, D4) So, we


combine this
NOw a single cluster for average distance matrix.
dendogram
VEXI, we represent the final

linkage as, SACHIN SHAH Vente


Publications...A
Tech-Neo

22-23) (M7-79)
(New syllabusw.e.f academic year
Machine Learning(MU-Sem.7-Comp.) (Learningwith Clustering)..Pageno. (5-13)
Distance (P,. P2. Pa). Pa) min (distance(P. 2 is smallest. P and P2 have smallest distance. So, we
=
Pa).
combine this two in one cluster and recalculate distance
distance (P2. Pa). distance (P3. Pa)) = min (9, 7) = 7
matrix.
Similarly. we will calculate all distances. Distance (P,. P2), P3) =max (distance(P1. P3),
distance (P2. Pa)) = max (6, 3) = 6
Distance matrix
Similarly, we will calculate all distances.
(P.P2. P3
Distance matrix
Pa 7
(P.P 0
P5 4 P3
P.Pz. Pa) Pa P's
4 is smallest. Pa and Ps have smallest distance.
PPa 10 7

Distance matrix Ps

P.P2. P) 0
(P1.P2)P3 Pa Ps
4 is smallest. P and Ps have smallest distance. So, we
(P. Ps) 5 0 combine this two in one cluster and recalculate distance
matrix.
(P1.P2. P3) Pa. Ps)
Now a single cluster remains (P1. P2, P3, P4. Ps)
Distance matrix
Next, we represent the final dendogram for single (P,P) 0
linkage as,
P3
Pa.P)10 7
P1.P) Ps Pa. Ps)
6 is smallest. (P1, P2) and P3 have smallest distance.
So, we combine this two in one cluster and recalculate
distance matrix.
P1 P2 P3 P4 P5

Now we will solve using complete linkage


Distance matrix
P.P2.P)
Distance matrix
(P4,Ps) 10
P1
(P. P2. Pa) (P4. Ps)
P 2 Now a single cluster remains (P. P2. P3, P4, Ps)
Next, we represent the final dendogram for complete
P3 3 linkage as,

P 10
8||o
P1 P2 Pa Pa Ps
P1 P2 P3 P4 P5

(New Syllabus w.e.f academic year 22-23) (M7-79) Tech-Neo Publications...ASACHIN SHAH Venture
achine Learning (MU--Sem.7-Comp.)

we will solve using average linkage (Leamingwith Clustering)....Pageno. (5-14


Now Now a
single cluster remains (P1,. P2, P3, P4. Ps)
Distance matrix
Next, we
represent the final dendogram for average
P linkage as,
P22
Pa 6 3 0

Pa1097 0
Ps985|4
P1 P2
o
Ps Pa Ps
and P1 P2 P3
2 is smallest. P1 P2 have smallest distance.
So, we
P4 P5
combine this two in one cluster and recalculate distance UEx.5.6.3 MU May 16,10Marks
matrix. Apply Agglomerative clustering algorithm on given data
and draw
Distance ((P1. P2). Pa) = 1/2 (distance(P1. P3), dendrogram. Show three clusters with its allocated
distance (P2. Pa)) = 1/2 (6, 3) = 4.5 points. Use singlelink method.
Similarly, we will calculate all distances. A B C D E F
Distance matrix A
VT 20
P.P) 0 B 0
VT8
P3 4.5 0
C VTO
P4 9.5
7 D V17
Ps 8.5 s|4o
P.P)Ps P Ps
E 0
V13
4 is smallest. P4 and Ps have smallest distance. So, we
20 ViB
F
V
combine this two in one cluster and recalculate distance Soln.:
matrix.
Distance matrix
Distance matrix
A
(P.P)
B 1.414 0
P3 4.5
C | 3.162 | 2.828
(P4.Ps) 9
6
P.P) P (P4. Ps) D 4.123 1 2.236
. 1S smallest. (P1, P) and P3 have smallest distance.
E 2.236 1 2.236 2 0
0, we combine this two in one cluster and recalculate
distance matrix.
Distance matrix
F 4.4724.242 2
33.6 o
A
A B C DE F
P.Pa. P 0 1 is smallest. B, D and B, E have smallest distance. We
(Pa, Ps) 8 can select anyone. So, we combine B, D in one cluster
and recalculate distance matrix using single linkage.
(P1. P2. P3) P4. Ps)

(Newyllabus w.e.f LTech-Neo Publications..ASACHIN SHAH Venture


academicyear 22-23) (M7-79)
Machine Learning (MU-Sem.7-Comp.)
(Learningwith Clustering)....Pageno. (5-15)
Distance matrix Soln.
First we will solve using complete linkage
A 0
Distance matrix
B,D 1.414 0
pl
C 3.162 2.236| 0
P2 0.7007
E 2.26 2.236 0
P3 5.656 949 0
F 4.472 3 2 3.6 0 P4 3.605 2.915 2.236 o
A B.D C E F
P5 4.242 3.535 1.414 1
I is smallest. B, D and E have smallest distance. So, we
combine this two in one cluster and recalculate distance P65.201 2.5 1.802 0.5 1.118 0
matrix. P5 P6
P1 P2 P3 P4
Distance matrix 0.5 is smallest. P4 and P6 have smallest distance. We
A 0 can select anyone. So, we combine this in one cluster
and recalculate distance matrix using complete linkage.
B.D.E 1.414
Distance matrix
C 3.1621o
P1
F 4.472
3 2 o P2 0.707 0
A B.D.E C F
P3 5.656 4.949 0
1 is smallest. B, D, E and C combined
are
together.
P4,P6 5.201 2.5 1.802 0
Distance matrix
A
P5 4.242 3.535 1414 1.118
0
P1 P2 P3 P4,P6 P5
B,D,E,C 1.414 0
0.707 is smallest. P1 and P2 have smallest
F 4.472 o we combine this two in one
distance. So,
cluster and recalculate
A B,D.EC F distance matrix.
In the questions three clusters are asked with their Distance matrix
allocated points. Three clusters are A, (B, D, E, C)
and F. P1,P2 0
P3 5.656
UEx. 5.6.4 MU-May 17,10Marks P4,P6 5.201 2.236
For the given set of
pointsidentifyclustersusingcomplete P5
link and average
link usingAgglomerativeclustering. 4.242|1414 1.118 0
P1,P2 P3 P4,P6 P5
A B
1.118 is smallest. P4, P6 and
P 1 1 P5 are combinedtogether.
Pa1.5 1.5 Distance matrix
P1,P2
|Ps 5
P3
0

PA 3 5.656
4 P4,P5,P6 5.201 2.236
Ps4 4
P1,P2 P3
P63 3.5 P4,P5,P6
(New Syllabus w.e.f academic
year 22-23) (M7-79)
Tech-NeoPublications..ASACHIN SHAH
Venture
ne
em.7-Comp.)
Learning(MU-Sem (Leaningwith Clustering)...Pageno. (5-16)
Mach
2.236 is allest. P4, P5, P6 and P3 are combined
Distance matrix
tOgether.

P1,P2 0
P1.P2
P3 5.302
P3,P4.P5,P6 |5.656
PIP2 P3,P4,PS,P6 P4,P5,P6 3.66 1.817
Next we will ombine all clusters in single cluster.
a
P1,P2 P3 P4,P5,P6
Now we wil solve using average 1linkage 1.817 is smallest. P4,P5,P6 and P3 are combined
Distance matrix together.

PI P1,P2
P2 0.707 P3,P4,P5,P6 4.07 0
P3 5.656 4.949
PI,P2 P3,P4,P5,P6
P4 3.605 2.915 2.236 0
Next we will combine all clusters in a single cluster.
PS 4.242 3.535 141410
P6 5.201 2.5 1.802 0.5 1.118 0
5.7 UNSUPERVISEDLEARNING
P1 P2 P3 P4 P5 P6 ASSOCIATION RULES www.oa
We
0.5 is smallest. P4 and P6 have smallest distance.
can select anyone .So, we combine this in one cluster 5.7.1 Introduction to Association Rules
and recalculate distance matrix using complete linkage.
Associations are any things that are generally occurred
Distance matrix together. Let us understand this with the help of ann
interested in finding
example, as a shop owner we are
P 0 out the products that are generally sell together.

P2 0.707 0 We will simplify this further, if there is a customer

likely that he will


who purchases milk then it is more
P3 5.656 4.949 0 that the
also purchase bread. From this we can say
0 customers who buy milk tend to also buy bread.
P4,P6 4.403 2.707 2.019
most clients buy cake.
will This
In a bakery shop
P5 |4.242 3.535| 1.414 1.059
means that there will be many frequent item sets

P2 P3 P4,P6 P5 such {candle, cake}.


P1 involving cake, as

of an association
the construction
0.707 is smallest. P1 and P2 have smallest distance. So, This might suggest
recalculate rule if candle then cake however, this is predictable
we combine this two in one cluster and
-

item set (and


distance matrix. given that {cake} is already a frequent
clearly at least as frequent as {candle,
cake }).
Distance matrix Of more interest would be the
converse rule if cake
considerable
then candle which expresses that
a

candle.
P1,P2 0 proportion of the people buying cake also buy a

P3 5.302 5.7.2 Apriori Algorithm

.019 used to create association rules


P4,P6 3.55 Frequent item sets are
created to work on the
1.059 0 in Apriori algorithm. It is
P5 3.888 1.414 Association rules
transactions.
databases that contain
P5 the two items are strongly
P1, P2 P3 P4, P6 are used to identify whether
or not.
combined together. connected to each other
1S smallest. P4, P6 and P5 are

SACHIN SHAH Venture


Tech-Neo Publications...A
(New Syllabus w.e.f academicyear 22-23) (M7-79)
Machine Leaming (MU-Sem.7-Comp.) (Loamingwith Clustering)....Pageno. (5-17)
Breadth fist search and hash tree methods are used to
find out the item sets in an optimum manner. Tterations
TransactionID Item sets

are applied to find the frequent item sets from the huge T7 1,3
amount of database. T8 1,2,3,5
In market basket analysis mostly Apriori algorithm is
T9 1,2,3
used. The algorithm is used to find the products that
can be bought together. Apriori algorithm can also be Soln.
used in the medical discipline to find reactions of drugs
Step 1: CalculatingS1 and I1
on the patients.
Initially we will generate a table that contains support
What is Frequent Item set ?
count of each item set in the given dataset. This table is
ftem sets having support more than the set minimum called the Candidate set or S1.
support or threshold value are called as Frequent item
sets. Let us see an example, if there are P and Q
Ttemset Support count
frequent item sets, then individually P andQ should 6
also be the frequent item set.
7
Let us assume there are the two transactions:
5
P (3,4,6,7,8), and Q = {4,6,9), in these two
4
transactions, 4 and 6 are the frequent item sets.

Apriori AlgorithmWorking L5
Next, we will consider all the item sets that are having
Apriori algoithm works according to the following steps,
more support count than the Minimum Support i.e. 2.
1 Find the support of item sets in the transactional
This will give us the table for the frequent item set I1.
database. Choose the minimum support and confidence.
Here all the item sets
2 From the transaction consider all supports with more are having more or same support
support value than the minimum or chosen support count than the minimum support, except the E, so E
value. item set will be removed.
3. Identify all the rules of these subsets which are having
more confidence value than the threshold or minimum Itemset Support count
confidence. 1 6
4. Sort the rules as the decreasing order of lift.
2 7
Example: Let us assume that we are having following
dataset that contains various transactions. We want to 3 5
identify the frequent item sets and create the 2
association rules with Apriori algorithm. In these
example minimum support 2 and minimum confidence
Step 2: Calculating $2 and 12
50% is given.
Next, we will create $2 using I1. In s2, we will
TransactionID Item sets
generate the pair of the item sets of Il in the form of
Ti 1,2 subsets.
T2 2,4 Once we create the subsets, we will again find the
T3 2,3 support count from the main transaction table. Here we
will check how many times these pairs have occurred
T4 1,2,4
together in the given dataset. So, we will get the
T5 1,3 followingtablefor $2:
T6 2,3

(New Syllabus w.e.f academic year 22-23) (M7-79) Tech-Neo Publications..ASACHIN SHAH Venture
Mach
ine Leaming (MU-Sem.7-Comp.) (Learningwith Clustering)...Pageno. (5-18)
After computing the confidence value for all rules, we
Itemset Supportcount will exclude the rules that have less confidence than
(1.2) 4
the minimum threshold(50%).

(1.3 4 Consider the below table

1.41 Rules Support Confidence


2,3 123 2 Supl(1 2) 3 ]/sup(1 ^2)
2.4 2 2/4 0.5 = 50%

13.5) 0 231 2 Sup{(23) A1 }/sup(2 ^3)


S2 Support count with thee = 2/4 0.5 = 50%
Now we will compare
after doing this, the item
minimum support count, and 2 Supi(1 3) *2}/sup(1 ^3)
count will be removed from the
132
set with less support
table for 12 2/4 0.5 = 50%
table S2. It will give us the following
Itemset Supportcount 31 2 2 Supt(3 1 2)}/sup(3)
4 = 2/5 0.4 =40%
1.2
1.3 123 2 Supi(12 ^3))/sup(1)
=2/6 0.33 = 33.33%
2,3
(2.4) 2 2 213 2 Supl(22 n3)|/sup(2)
2/7 0.28 28%
Step 3:Calculation of $3,and
I3:
but minimum confidence is 50%,
For $3, we will follow the same two processes, As the given threshold or

of three 1^ 2 3, 2 3 1, and
now we will form the $3 table with subsets sothe first three rules the strong association
item sets together, and will calculate the support count 1 3 2 can be consideredas
table: rules for the given problem.
from the dataset. It will give the following

Itemset Support count 5.7.3 PerformanceMeasures


Association rule mining is a kind of unsupervised
2
1.2,3 learning which checks the dependency of one data item
1
(2.3,4 on other data item.
to each other
1,3,4} Based on dependency items are mapped
so that it can be more profitable.
1,24) table we This method identifies
some interesting relations or
13 table. From S3 .
Next we will generate the variables of dataset.
combination of itemset associations that exist between the
can see that there is only
one
be
mining Apriori algorithm
can
same as that of minimum For Association rule
that has support count
the 13 will have only one
used.
Support count. So,
of If and
combination,i.e., {1,2,3). Association rule mining works on the concept
rules for the if X then Y. In this the If part
the association Else Statement, such as
Step 4: Identifying
subsets: and then part is known as
a new
is known as antecedent,
rules,, we will create
To create the association above Consequent.
rules from the find out
with possible
the in which we can
Such types of relationships
able
relation among two items
is called
combination {1,2,3). some
association or
formula sup if
Confidence using In this rules are created. Here
We will compute the as single cardinality.
(1 2 1for all the rules.
SACHIN SHAH Venture
Tech-Neo Publications..A
(M7-79)
Wew Syllabusw.e.f academic year 22-23)
(Learningwith Clustering)..Page no. (5-19)
Machine Learming(MU-Sem.7-Comp.) in learned hypotheses to the amount
then cardinality also the confidence
the number of items are more

training experience and the character of the learner's


increases accordingly.
between more
hypothesis space?
Due to this to the associations
measure
the learner is used at which
numbers of data items, several metrics are used. These 3. Prior knowledge held by
time and manner to guide the process of generalizing
metrics are explained bellow,
from examples? If we have approximately correct
1 Support when it is only
knowledge, will it helpful
even

approximately correct?
Support is defined the number of occurrences of an
as
item in the dataset. It is also defined as the fraction of a useful next
What is the best strategy for choosing
the transaction T that contains the itemset I. If there are does the choice of this
training experience, and how
I datasets, then for transactions T, it can be written as: after the complexity of the learningproblem?
strategy
Suppon () =Freq (T)/T 5. To reduce the task of learning to one or more function
problems, what will the best
be
2. Confidence approximation
specific functions should the system
Confidence represents how likely the rule has been
approach? What
attempt to learn? Can this process itself be automated?
found to be true. We can also say that how many times
To improve the knowledge representation and to learm
the items P and Q appears together in the dataset when
the target function, how can the learner automatically
the occurrence of P is already given. It is the ratio of
the transaction that contains P and Q to the number of alter its representation?

records that contain P How to choose the right algorichm ?


5.7.5
Confidence = Freq (P. Q)/ Freq (P)

3. Lift uQ Explain the steps required for selecting the right


machine learning algorithm.
Lift is defined as the strength of any rule.
Lift= Support (P. Q)/ (Support (P) * Support (Q) (IMU-May 16,8 Marks
It is the ratio of the observed support measure and With all the different algorithms available in machine
expected support if P and Q are independent of each learning, how can you select which one to use ? First you
other. It has three possible values: need to focus on your goal.
Lift= 1: It defines the probability of appearance of What are you trying to get out of this? What data do
antecedent and consequent independent to each other. have to consider
you have or can you collect? Secondly you
Lifd1: It defines the degree by which the two item sets the data.
are dependent on each other. 1. Goal If you are trying to predict or forecast a target
Lift1: It describes if there are any alternative for other value, then you need to look into supervised learning.
items. Otherwise, you have to use unsupervised leaning.

5.7.4 Issues in Machine Learning (a) If you have chosen supervised learning, then next
--- ---- -- - you need to focus on what's your target value?
UQ. What are the issues in Machine learning ?
If target value is discrete (e.g. Yes/ No, 1 /2/3,
(MU May 15,5 Marks) A/B/C), then use Classification.
I. Which algorithm we have to select to learn general If target value is continuous i.e. Number of
target functions from specific training dataset? What values (e.g. 0 - 100, 99 o 99), then use
should be the settings for particular algorithms, so as to Regression.
converge to the desired function, given sufficient
(b) If you have chosen unsupervised learning. then
training data? Which algorithms perform best for which
next you need to focus on what is your aim?
type of problems and representations?
How much training data is sufficient? What should be If you want to fit your data into some discrete
2.
the general amount of data that can be found to relate groups, then use Clustering

(New Syllabus w.e.f academic year 22-23) (M7-79) Tech-Neo PublicationsASACHIN SHAH Venture
Machine Leaming (MU-Sem.7-Comp.) (Learningwith Clustering)...Pageno. (5-20)
If you want to find numertcal estimate of how
4. The importance of this step is that it makes you
strong the fit into cach grouP, then use density understand that you don't have any garbage value
estimation algorithm coming in.
Data: Are the features continuous or nominal ? Are
2. 5. Train the aigorithm
there missing values in features ? If yes, what is a
reason for missing values? Are there outliers in the Good clean data from the first two steps is given as
input to the algorithm. The algorithm extracts
data? To narrow the algorithm selection process, all of
these features of your data can help you. information or knowledge. This knowledge is mostly
stored in a format that is readily useable by machine
Table 5.7.1 : Selection of Algorithm
for next 2 steps.

Supervised In case of unsupervised learning, training step is not


Unsupervised
Learning there because target value is not present. Complete data
Learning is used in the next step.
Classification
Discrete
Clustering
Test the algorithm
Continuous Regression Density Estimation
In this step the information learned in the previous step
5.7.6 Steps in Developing a Machine is used. When you are checking an algorithm, you will
test it to find out whether it works properly or not. In
Learning Application
Supervised case, you have some known values that can
be used to evaluate the
UQ. Explain the steps of
developing Machine Learning algorithm.
In case of unsupervised, you may have to use some
applications. (MUMay 19, 10 Marks) other matrices to evaluate the success. In either case, if
1. Collection of Data you are not satisfied, you can again go back to step 4,
change some things and test again.
You could collect the samples from website
a and
extracting data. Mostly problem occurs in collection or preparation of
data and you will have to go back to step 1.
From RSS feed or an API
7. Use it
From device to collect wind speed measurement
In this step real program is
Publicly available data. a
developed to do some
task, and once again it is checked if all the
2. Preparation of the input data previous
steps worked as you expected. You might encounter
Once you have the input data, you need to check some new data and have to revisit
step 1-5.
whether it's in a useable format or not.
Some algorithm can accept target variables and
features as string; some need them to be
integers. Some Training Phase|
algorithm accepts features in a special format.
3. Label Machine
Analyse the input data
learning
Looking at the data you have passed in a text editor to Feature algorithm
Input extractor
check collection and preparation of input data steps are Features
properly working and you don't have a bunch of empty
values.
Testing Phase
You can also check at the data to find out if you can
See any patterns or if there is anything obvious, such as xtacbr mmodel Labe
Feature
a few data points greatly differ from remaining set of
Input Features
the data. Plotting data in 1, 2 or 3 dimensions can also
help.
Distil pmultiple dimensions down to 2/3 so that you Fig. 5.7.1: Typical exampleof Machine
can visualize the data. LearningApplication

New Syllabus w.e.f academic year 22-23) (M7-79) Tech-NeoPublications.ASACHIN SHAH Venture
Machine Learning(MU-Sem.7-Comp.) (Learningwith Clustering)...Pageno.(5-21)
a 5.7.7 Applications of Machine Learning Savings
Low-Risk
----

UQ Writeshortnoteon: Machine leaming applications.


AAT
(MU-May 16,May 17,10Marks) High-Risk
oAA
(1) Learning Associations
A supermarket chain-one an example of retail
application of machine learning is basket analysis, Income
which is finding associations between products bought
by customers Fig. 5.7.2: Classificationfor credit scoring
If people who buy P typically also buy Q and if there is
(3) Regression
a customer who buys Q and does not buy P, he or she
is a potential P customer. Once we identify such Suppose we want to design a system that can predict

customers, we can target them for cross-selling. the price of a flat.

In finding an association nule, we are interested in Let's take the inputs as the area of the flat, location
learming a conditional probability of the form P (QP) and purchase year and other information that affects
where Q is the product we would like to condition on the rate of flat.
P, which are the product / products which we know The output is the price of the flat. The applications
that customer has already purchased. where output is numeric are regression problemns.
P (Milk/Bread) =0.7 Let X represents flat features and Y is the price of flat.
It implies that 70% of customers who buy bread also We can collect training data by surveying past
buy milk purchased transactions and the Machine Learning
algorithm fits a function to this data to learn Y as a
(2) Classification function of X for the suitable values of W and Wo.
A credit is an amount of money loaned by a financial Y = wx + Wo
institution. wrmunrryownpuunuuonnonye

It is important for the bank to be able to predict in


advance the risk associated with a loan. Which is the
probability that the customer will default and not pay
the whole amount back?
In credit scoring, the bank calculates the risk given the
Y wx+Wo
amount of credit and the information about the
customer. (Income, savings, collaterals, profession, Y: price
age, past financial history). The aim is to infer a of flat
general rule from this data, coding the association
eenedgger**
between a customer's attributes and his risk.
X: area offlat
Machine Learning system fits a model to the past data
to be able to calculate the risk for a new application
and then decides to accept or refuse it accordingly. Fig. 5.7.3: Regression for prediction of price of flat

Ifincome>Q1 and savings > Q2


(4) Unsupervised Learning
Then low - risk ELES high - risk
One of the important unsupervised learning problem is
Other classification examples are Optical character clustering. In clustering dataset is partitioned in to
recognition, face recognition, medical diagnosis, meaningful sub classes known as clusters. For
speech recognition and biometric. example, suppose you want to decorate your homne
using given items.

(New Syllabus w.e.f academic year 22-23) (M7-79) Tech-Neo Publications..ASACHIN SHAH Venture
Machine Leaming(MU-Sem.7-Comp.)
Now you will
(Leamingwith Clustering)...Pageno.(5-22)
classify them using unsupervised Let's take an example of robot with incomplete camera
learning (no prior knowledge) and this classification information. Here robot does not know its exact
can be on the basis of color of
items, shape of items, location.
material used for items, type of items or whatever way
you would like.
M5.8 GRAPH BASED CLUSTERING
(5) Reinforcement Learning
Graph clustering is an important subject, and deals
There are of the applications where
some
output of with clusteringwith graphs
system is a
sequence of actions. In such applications
the sequence of correct actions instead of The data of a clustering problem can be represented as
single action a graph where each element to be clustered is
is importantin order to reach
goal. represented as a node and the distance between two
An action is said to be
good if it is part of good policy. elements is modeled by a certain weight on the edge
Machine learning program generates a
policy by linking the nodes.
learning previous good action sequences. Such
methods are called reinforcement methods. Thus in graph clustering, elements within a cluster are
connected to each other but have no connection to
A good example of reinforcement learning is chess
elements outside that cluster.
playing. In artificial intelligence and machine learning,
of the important research 5.8.1
one most area is game Graph ClusteringAlgorithm
playing The HCS
Games can be are easily described but at the same (Highly Connected Subgraphs) clustering
time, they are quite difficult to play well. Let's take a algorithm (also known as the HCS algorithm, and other
names such as highly connected clusters
example of chess that has limited number of rules, but lcomponents
Kenels) is an algorithm based on graph connectivity
the game is very difficult because for each state there
for cluster analysis.
can be large number of possible moves.
It works by representing the similarity data in a
Another application of reinforcement learning is robot
similarity graph, and then finding all the highly
navigation. The robot can move in all possible
connected subgraphs.
directions at any point of time.
It does not make any prior assumptions on the number
The algorithm should reach goal state from an initial
of clusters.
state by learning the correct sequence of actions after
conducting number of trial runs. The HCS algorithm gives a clustering solution, which
is inherently meaningful in the application domain.
When the system has unrcliable and partial sensory
information, it makes reinforcement learning complex.

5.8.2 Method of Graph-based Clustering

(1) Transform the data into a graph representation:


) Vertices are the data points to be clustered
(i) Edges are weighted and based on similarity between data points.

Graph
partitioning9

Each connected
component is a
cluster

Fig. 5.8.1

(New Syllabusw.e.f academic year 22-23) (M7-79) Tech-Neo Publications..ASACHIN SHAH Venture
Machine Leaming(MU-Sem.7-Comp.) (LearningwithClustering)...Pageno.(5-23)
(2) Clustering as graph partitioning (5) Objective function for partitioning
Two things are required : Our aim is not to only to have 'minimise the graph', but
(i) An objective function to detemine what would be also look for "balanced'" clusters.
the best way to 'cut' the edges of a graph.
Ratio Cut Cut ( V, Cut (V.V2
(V1.V)=IV, IV2
Normalised Cut (V1. V,) = V2, Cut(V, V,)
ie V,
Where d W
VI and V2 are the set of nodes in partitions 1 and 2;
Fig. 5.8.2 IVlis the number of nodes in partitionV
(i) An algorithm to find the optimal partition (optimal
according to objective function).
(3) Objective function for partitioning
V2
Suppose we want to partition the set of vertices V into
two sets: Vj and V2. One possible objective function is to
minimise graph cut:
Fig. 5.8.5
Cut (V1 W W is weight of the (a) Example
iEV1 ,je Va
edge between nodes i and j o 0
0.1 0./
Vs02 VapO2dv
0. 0.2
/0.3 0.3 o.1
v,.2dv Vap- v.O
0.3 01
vO03
0Ove v Cut 0.1 Cut 0.2
Cut 0.2 Cut 0.4 Fig. 5.8.6 Fig.5.8.7
Fig. 5.8.3 Ratio
cut="+=0.12 Ratiocut=+=0.13
(4) Objective function for partitioning limitation of
minimising graph cut Normalised cut Normalised cut

1.07 922-053
.1 b) Example : If graph is unweighted(or has
Va0.2V to same edge
weight).
o

a
Cut 0.1 Va1
Fig. 5.8.4
N
The optimal solution might be to split up a single node
from the rest of the graph! Fig. 5.8.9
Fig. 5.8.8
Not a desirable solution,

(New Syllabus w.e.f academic year 22-23) (M7-79) Tech-NeoPublications..ASACHIN SHAH Venture
Machine Learning(MU-Sem.7-Comp.)
Cut 1 (LeaningwithClustering)..Pageno.(5-24)
Cut 2 (7) Spectral clustering
Ratio cut=+z=12 Ratio cut +=0.67 Spectral propertiesof a graph:
) We find eigenvaluesand eigenvectorsof the adjacency
Normalised cut Normalised cut matrix. They can be used to representa graph.
(ii) There exists a relationship between spectral properties
-02 Method
of a graph and the graph partitioningproblem.

(6) Algorithm for graph partitioning


i) Start with a similarity/adjacency matrix, W, of a graph :
Method of minimisingthe objectivefunction (i) Definea diagonal matrix D
n
i) We can use a heuristic (greedy) approach to doo
this. 2 Waxif i=j
ii) An elegant way to optimise the function is by Dy-k
using ideas from spectral graph theory. This leads otherwise
If W is a binary 0 - 1 matrix, then D; represents the
to a class of algorithms known as spectral
degree of node i.
clustering

Preliminaries

o 0 0 1 io o o

W= L110O
o o oo 1 1
Two block
diagonal matrices

0 0 0l1 0 0
00 o1oiJ
Two clusters

Fig. 5.8.10
8) Graph Laplacian matrix

Coo 1o o o

0 0 1 j0 0 0

w=LL100 o o Two block


matrices
o o oo 1 1
o 0 01 0 0
oo oo 0_i

Laplacian,L=D-W
TO0 o o

0
o

0
Laplacian also 0 -11 0
has a block 0 0 0
structure L

0 0 0!-1 1 01

o0 oL01i J
Fig. 5.8.11

Tech-Neo Publications..A SACHIN SHAH Venture


(New Syllabusw.ef academicyear 22-23) (M7-79)
(5-25)
Machine Learning (MU-Sem.7-COMP)
(Learningwith Clustering)...Pageno.
Remark
(9) Properties of graph Laplacian and
Spectral properties
of a graph (i.e. eigen-values
() L= (D-W) is a symmetric matrix
information about clustering structure.
i) Lis a positive semi-definite matrix eigenvectors) contain

It means all eigenvalues are non-negative i.e. 2 0. 5.9 CLUSTERINGWITH MINIMAL


SPANNING TREE
(10) Spectralclustering8
Consider a data-set with N data points:
minimum spanning tree based
) Construct an N x N similarity matrix W. There are two Euclidean

i) Compute the Laplacian matrix, L D-W


N x N = clusteringalgorithms:
(1) ak-constrained (2) an unconstrained algorithm
(ii) Computethe K "Smallest"eigenvectorsof L:
(a) Each eigenvector Vi is an Nx1 Column vector. (i) K-constrained clustering algorithm producesa
K-partition of a set of points for any given k.
b) Createa matrix V containing eigen vectors V1, V2 tree of a
The algorithm constructs a minimum spanning
Vas column (one may exclude the first representative points and
removes edges that
set of
eigenvector) The process is repeated
satisfy a predefined criterion.
(iv) Cluster the rows of V using K-means or other until K clusters are produced.
clustering algorithms into K-clusters.
Our unconstrained clustering algorithm partitions a
Example point set into a group of clusters by maximally
in
K-Means reducing the overall standard deviation of the edges
0.8 the Euclidean minimum spanning tree. It is constructed
0.6 from a set, without prescribingthe number
given point
of clusters.
0.4 The minimal spanning tree (MST-) based clustering
0.2 method can identify clusters of arbitrary shape by
removing inconsistent edges.
The definition of the inconsistent edges is a major issue
-0.2 .*
that has to be cleared in all MST - based clustering

-0.4 algorithms.
Here we discuss MST-based clustering algorithm
0.6
through the 'cluster centre initialisation algorithm',
-0.8L
0.8 0.6 -0.4 0.2 0 0.2 0.4 0.6 0.8 called as CCIMST.
Spectralclustering CCiMST is based on geodesic distance and dual
0.8 densities of the points.
0.6 We shall demonstrate that the inconsistent edge is
located on the shortest path between the cluster centres,
0.4
so that we can find the inconsistent edge with the
0.2 length of the edges as well as densities of their
endpoints on the shortest path.

-0.2 5.9.1 The Cluster Center-Initialisation


-0.4 Method

-0.6 Among the cluster center initialisation methods,


-0.8 density peaks clustering(DPC) has been widely used.
-0.8-0.6 -0.4 0.2 0 0.2 0.4 0.6 0.8
Cluster center-initialisationmethod is based on DPC;
Fig. 5.8.12 so we briefly describe DPC.

(SPPU-New Syllabus w.e.facademic year 21-22)(P6-58) Tech-NeoPublications...ASACHIN SHAH Venture


Machine Leaming(MU-Sem.7-Comp.)
In DPC, we assume that the
cluster centers are (Learningwith Clustering)...Pageno. (5-26)
characterisedby a higher density than Dataset X is
their represented by undirected completed
relatively large distance from neighbours
an
and by a graph G = (V, E), where
higher densities. points with
DPC uses two quantities: V {V1.V2V,),IEI=2
() the density pj of point x, and Each data point x; in data set X
Vi in correspondsto a verteex
(i) its distance , from graph G is represented by Xi. Let T (V, ET) denote
points of higher densities. the MST of G
=

The
=
(V, E), where Er {e1, e2. e-1)
density p; of point x; is defined as e e E(G).
=

Now, there is one and only one path between each pair
Pexp-DGa] of vertices in T.
2
Where, D (zi, xi) is the
Euclidean distance between
5.9.2 Definitlon: GeodesicDistance
points x and x, and o is variance.
Suppose P ={P1. P2,.Pele V is the
(1) Input:Dataset two vertices x, and
path between
x in T, where edge (P. Pk =1)e E',
X {x;, Xj; ¬ R,i= 1,2, .. n}, the
=

total numberof 1SKSI-1


points n, a predefinedparameter8. The geodesic distance between two vertices x; and x; is
(2) Output: The value of o. defined as
I-1
(3) Begin
(4) Calculate and sort the pairwise distance between points K=1
ascending order, thatis, d.dz, d Where D Px. Px+ 1) is the Euclideandistance between
The distance between the
two points x; and x
point xzj and the other points The Euclidean distance
with higher densities,denoted by 8,, is defined as between pairwise points is
replaced by geodesic distance, and this leads to the result
that the distance between
min (d ( pairwise points in the same cluster
x), ife<ej become smaller, while the distance between
pairwise points
i:PiP from the different cluster is
2) larger.
(d (x, Xi)), After the geodesic distance is
max
otherwise defined, the density pi of
point x is defined as:
When the density p; and distance o, for each point have
been calculated,the decision graph is further generated. Phexp 20
The points with relatively high and
p, o, are considered
as cluster centers. In
addition, the distance o, between the
the other points with higher densities is
points x; and
The Euclidean distance between defined as,
pairwise points is the
distance measure in DPC method.
min
This distance measure is suitable for the data sets with D x), if pi<Pj
convex shape, but it is not suitable for the data sets with non j:P<P
convex shape. To solve this
problem, we adopt a new max (D (, x)), otherwise
distance metric: known as geodesic distance.
geodesic distance between two points is the shortest Since we are
adapting the selected cluster centres to
distancebetween two points) data-sets with arbitrary shape, we introduce the concept of
Let X be a data-set with K clusters andn data points, global density and local density.
that is,
X =
{xjlx^e R",i=1,2,.n}
(New Syllabus w.e.f academic Tech-NeoPublications.ASACHIN SHAH Venture
year 22-23) (M7-79)
(Learning with Clustering)..Page no. (5-27)
MachineLeaming (MU-Sem.7-Comp.)
cluster separation and intra cluster
The variance o is regarded as scale factor. If the value In general, the inter
are used as the internal validation
of o is smal, the corresponding density Pj is taken as local compactness
measures.
density around the point xi.
calculation of intercluster separation can be
The
In contrast the larger the value of o, the corresponding
categorized intotwo classes
density p is regarded as global density around the point X
The points with relatively higher Pi and are (i) to take the distance of a single pair of points as the
considered as cluster centers, and hence corresponding two intercluster separation, and

groups of cluster centers are achieved. another is based on the average pairwise distance
(ii)
between points in the different clusters.
5.9.3 Inconsistent Edge
inter cluster
Based the above idea, we use the
on
For the dataset with K categories, the MST-based distance based on the distance between
the points at the
clustering method partitions the MST into K subtrees intersection of two different clusters.
K
Ti by removing(K-1)inconsistentedges. We mention the detailed algorithm:

We note that the inconsistent edge between two (1) Input:Dataset


vertices must be in the path connecting two cluster centers X={x;} xe R',i=1,2, n}
of the different clusters to which the two vertices belong. K.
The total number of points n, the number of clusters
There are paths among K cluster centers. We The clustering result
have to find (K - 1) paths from them. The method of
of
{ClusterCluster2,
..,ClusterkK}
selecting (K -

1) paths is the geodesic distances


pairs of cluster centres in T; and (K - 1) edges in (K-1) iniconsistentedges
distance
T correspond to the paths in T. Einc e, e2, , eK -1}, the geodesic
..

Dg i, X) between points x; and x.


After determining (K - 1) paths, then we have to find
(K- 1) inconsistent edges on each of the (K- 1) paths. (2) Output: Interclustersep.
Generally, the inconsistentedge has two features (3) Begin.
i) Its length is longer, (i) The densities of the two end
points are smaller. M 5.10 MODEL BASED CLUSTERING
Based on this fact, we define a new parameter for the
edge e connecting x; and x; in the path. Expectation maximisation clustering

Cij
Dx2 The expectation maximisation algorithm performs
maximum likelihood estimation in the presence of
e+
is the Euclidean distance between latent riables.
where D (X, x)
and and e; and e; are the local and global density First it estimates the values of the latent variables, then
points ; x;
of points x; and x; respectively. it optimises the model.

For the (K - 1) paths, we find and remnove the edge Then it repeats the two steps till convergence takes

with the largest value of eij and we obtain K


,
clusters. place.
Expectation maximisation (EM) clustering is an
2 5.9.4 Internal Clustering Validation Index
unsupervised clustering algorithm. It also extends to NLP
We have obtained two groups of cluster centers under applications, like Hidden Markov Models and medical
obtain two
local density and global density. And we imaging. For this it uses Latent Dirichlet Allocation.
of clustering results. We can find internal
groups
validation measures to obtain the optimal result from
the two clustering results.

(New Syllabusw.e.f
academic year 22-23) (M7-79) Tech-Neo Publications.ASACHIN SHAH Venture
Machine Leaming(MU-Sem.7-Comp.)
5.10.1 Use of EM (Leaming with Clustering)....Pageno. (5-28)
We again mention that the essence of
) The Expectation Maximisation (EM) algorithm can Expectation-
Maximisationalgorithm is to use the available data of
find maximum-likelibood estimates for model the dataset to estimate the
parameterswhen the data is incomplete. missing data and then we
use that data to
() It is an iterative way update values of the parameter.
the

likelihoodfunction.
to
approximate the maximum Flow-chart for EM algorlthm
(ii) This can be used to fill the Start
missing data in a sample. Initial values
(iv) It can be used on the basis of
clusters. unsupervisedlearning of
Expectation E-Step
(v)It can be used for the purpose
of estimating the
parametersof Hidden Markov Model (HMM). Maximization M-Step
(vi) It can be use to find the values of latent variables
No Yes
5.10.2 EM Algorithm Convergence Stop
GQ. Write short note on: EM algorithm. (5 Marks) Fig.5.10.2
In statistics, an
expectation-maximisation (EM) 5.10.3 Advantagesof EM Algorithm
algorithm is an iterative method.
() It is guaranteed that likelihood will increase with each
It finds maximum a
posterior (MAP) estimates of iteration.
parameters in statistical models, where model depends (i) The E-step and M-step are quite easy to implement for
on unobserved latent variables.
many problems.
The EM iteration alternates between: (ii) Solutions to the M-steps exist in the closed form.
i) II performs an expectation (E) step : that creates a Disadvantages of EM algorithm
function for the ) It has slow convergence
expectationof the log-likelihood
and it is evaluated using the current estimate for i) It makes convergence to the local optima only.

the parameters, and (ii) It requires both the probabilities, forward and backward
(note that, numerical optimisation requires only
(i) A maximisation step (M) : it computes parameters forward probability).
maximising the expected log-likelihoodfund on
a 5.10.4 Applicationsof EM-Algorithm
E-step.
The latent variable model has several real-life
These parameter-estimates are used to determine the
applicationsin Machine-learning:
distribution of the latent variables in the next E-step. () Use to calculate the Gaussian density of a function.
The above two steps are repeated till convergence takes i) It helps to fill in the missing data during a sample.
place. (ii) In domains like Natural Language Processing (NLP).
M-Step mputer Vision etc. It finds plenty of applications.
Update/typothesis (iv) It is also helpful in image reconstruction in the field of
Medicine, and structuralEngineering.
E-Step (v) Used to estimate the prameters of mixed models like
Update variables gaussian Mixture Models.

Fig. 5.10.1

(New Syllabusw.ef academic year 22-23) (M7-79) Tech-Neo Publications..ASACHIN SHAH Venture
Machine Leaming(MU-Sem.7-Comp.) (Learning with Clustering)..Pageno. (5-29)
M5.11 EXPECTATIONMAXIMISATION
ALGORITHM 1.0

SPurpose of algorithm
In of the
many problem statements of machine
leaming, we obtain many relevant features to build the --50
required model. But only a few of these features are X
observable. (The non-observable features are also
called Fig. 5.11.1
as latent features).
As we do not have the values for The higher the valve of o (standard deviation) more
non-observable
(latent) variables, the Expectation maximization would be the spread along X-axis.
algorithm tries to use the existing data to determine the In 1-D space, the probability density function of
optimum values for these variables and then one can Gaussian distribution is given by,
find model parameters 2

5.11.1 Latent Variable Model


2)-
A latent variable model consists of observable where is mean ando is variance of the distribution.
variables along with unobservable variables.
This is true only in 1-D only.
Observed variables in the dataset can be measured
In the case of two variables, we will have a 3D-bell
whereas unobserved (latent / hidden) variables are
curve instead of a 2-D bell shaped curve.
inferred from the observed variables.
The probability density function for 3 D-bell curve is
5.11.2 Use of Latent Variable Model given by,
II can be used to find the local maximum likelihood txl.2) xp -w2 a-»
(MLE) parameters for latent variables in a statistical or
mathematical model. where x is input vector, is the 2-D mean vector and

i) It is used to predict these missing values in the dataset 2is 2x2covariancematrix.


if we can predict the general form of
probability This can be generalized for d-dimension.
distribution associated with these latent variables.
(ii) The concept behind this algorithm is to use the
5.11.4 Multivariate Gaussian Model
observable samples of latent variables to predict the For the multivariate Gaussian model, we have x and
values of unobservable samples.
as vector of length d, and 2 would be a d x d
This process is repeated till the convergence of the covariancematrix.
value occurs.
Hence, for a dataset
having 'd' features, we would
5.11.3 Use-Case of EM Algorlthm have a mixture of k Gaussian distributions
(where k
represents the number of clusters), each having a
Normal distribution about which we are already certain mean vector and variance matrix.
familiar is also called Gaussian distribution. For finding mean and variance for each Gaussian, we
This distribution is used in the field of machine use a
technique called Expectation Maximization
learning and statistics. (EM).
We give below with a few Gaussian distribution with
differentvalues of o. 5.11.5 GaussianMbxture Models
The main assumption of these mixture
models is that
there are a certain number of Gaussian
distributions,
and each of these distributions a cluster.
represent
(New Syllabus w.ef academic year 22-23) (M7-79) eTech-NeoPublications..ASACHIN SHAH Venture
MachineLeaming(MU-Sem.7-Comp.)
For example, the Gaussian mixture
(Learningwith Clustering)....Pageno. (5-30)
model of a
2 Gaussian distributions 5.12.1 IrregulariclesIn the Data
2 2
N(41.) and N (42, o) Data may contain irregularities like

Here have to estimate total


Clusters can be of arbitrary shape as shown in the
we
of 5 parameters:
a
2 figure.
(p. H1,o P20, ) where p is the probabilitythat 2. Data may contain noise.
the data comes from the first Gaussian distributionand
(1 p) is that it comes from second Gaussian
distribution.
Then the probability density function of the mixture
model is given by :
2 2

g%10) P81(xl41.o,)+(1-P)82lk2
We find (pl44.G ,) through '

EM iteration And that gives the best fit. database 3

Fig. 5.12.2
M5.12 DENSITY BASEDCLUSTERING
For non-convex clusters and outliers / noises, K-means
It is a clustering method and is Density-basedspatial algorithm has difficulties for identifying these clusters with
clustering of applications with noise (DBSCAN), arbitrary shapes.
clustering method. DBSCAN algorithmrequires two parameters:
Clusters are dense regions in the data space. They are L. eps
separated by regions of the lower density of points.
Around a data-point, it contains the neighbourhood. It
The DBSCAN algorithm is based on this notion of means If the distance between two points is lower or
"clusters" and "noise. equal to 'eps' then they are considered as neighbours
If the eps value is very smal, then large part of the
The basic concept is that for each point of a cluster, the data will be considered as outliers.
neighbourhoodof a given radius has to contain at least
If it is chosen very large, then clusters will merge and
a minimum number of points. majority of the data-points will be in the same clusters.
Hence the eps-valueis based the k-distance
on
graph.
2. Min Dts

It is the minimum number of data


points (neighbors)
within eps radius.
database 1
database 2 If the dataset is large, the larger value of MinDts must
be chosen.
Fig. 5.12.1
In general, the minimum MinDts can be derived from
Why DBSCAN? the number of dimensions D in the dataset.
Partitioning methods like K-means and hierarchical Thus, MinDts 2 D+1
clustering work for finding spherical-shadedclusters or The minimum value of MinDts should be at
convex clusters. least 3.
In this algorithm, we come across 3
In other words, they are suitable only for compact and types of Data
points:
well-separated clusters.
i) Core point: A point is a core
point if it has more than
Also, they are severely affected by the presence of
MinDts points within eps.
noise and outliers in the data.

(New Syllabus w.e.f academic year 22-23) (M7-79) LATech-NeoPublications...ASACHIN SHAH Venture
Machine Leaming (MU-Sem.7-Comp.) (Learningwith Clustering)....Pageno. (5-31)
i) Border point: A point which has fewer than MinDts We iterate through the remaining unvisited points in
within eps but it is in the the dataset.
neighbourhood of a corec
point. Those points that do not belong to any cluster are
(ii) Noise or outlier: A point which is not a core point or noise.
border point.
Min pts= 4 a5.12.2 The basic Principle of DBSCAN
Bordor point
eps 1 unit Clustering
The main principle of DBSCAN is to find the
neighbourhoods of data points, exceeding certain
density threshold.
Core eps

The density thereshold is defined by two parameters


point Noise
1. The radius of neighbourhood (eps) and
Fig. 5.12.33
We carry out DBSCAN
2 The minimum number of neighbours / data points
algorithm in the
following (MinDts) within the radius of the neighbourhood
steps
1. First find all the neighbour points in the eps and 5.12.3 DBSCAN and K-means
identify the core points or visited with more than
DBSCAN clustering cannot efficiently handle high
MinDts neighbours.
dimensional datasets.
2. For each core point if it is not
already assigned to K-means clustering does not work well with outliers
a cluster, create a new cluster. and noisy datasets.
3. Find recursively all its density connected points DBSCAN clustering efficiently handles outliers and
and assign them to the same cluster as the core noisy dataset.
point.
Advantages and disadvantages of DBSCAN
Point a and b are said to be density connected if there
1. DBSCAN does not
exist a point C which has a sufficient number of require a priori specification of
points number of clusters.
in its neighbors and both the points a and b are within
the eps distance. 2. It is able to identify noise data while clustering.
3. DBSCAN algorithm can find
This is a chaining process. Thus if b is neighbor of
C arbitrarily shaped and
Cis neighborof d, d is neighbor arbitrarily size clusters.
of e,
which in tum is
neighbor of a then it implies that b is neighbor of a.

Chapter Ends..
MODULE 6
CHAPTER
6 DimensionalityReduction

syliabus

DimensionalityReduction Techniques,
Decomposition. Principal Component Analysis, Linear
Discriminant Analysis, ng lued

6.1 DimensionalityReductionTechniques..
UQ. Why DimensionalityReduction is ******e************************es*******eeceeneee*****.
**************************** 6-2
Importantstep in Machine Learning?
MUComp)-May 19, Dec.19,5Marks).
very
6.1.1 Dimension Reduction *******a********************ennesseessesec****eence************************************6-2
Techniquesin
ML
6.1.1(A) Feature Selection.. ********************************esososs**ss******es*o************************************. 6-3
6.1.1(B) Feature Extraction. **************e*****.e***************an*******snaues****eeeneeeess**aa********.***************************.6-3
6.2 Principle ComponentAnalysis... ***aa****o******e*****.°°****e******************* **ssenssse ****s**************s************** ..6-4
*******esns************e******************** annesenea esss*****eossesen*****n*******sse****************************
UQ. Explain in detail Principal Component 6-4
Analysisfor Dimension Reduction.
UQ.
(MUComp)-May
Use
15,May 16,10Marks)..
Principal Componentanalysis (PCA) to arrive at the 6-4
transformedmatrix for the given data.
A- 3 1 05
MUComp)-May 17, 10 Marks. ******** **********sss**************a*******
Ua. Describe the steps to reduce s*a****.*astnossssssse***************oresenee******.4
dimensionalityusing Principal ComponentAnalysis method by clearly
mathematicalformulas used.
(MU(Comp)-May 19, Dec.19, 5 Marks).******************snsssesestating O
UQ. Write Short note on ISA and compare it with PCA.
6.3
(MU(Comp) -May 19, 5Marks). .6-5
Linear discriminant
6.4
analysis... 7
SingleValueD Decomposition. 6-88

UQ. Find the singularvalue decompositionof A=

(MUComp)-Dec.19,10 Marks). e**ee*ee *°eeees******at se**aoaaohea**ooaoeeossoee*e***o*ae*eaeoeseaee****eese eseeuee


.. 6-8
Chapter Ends. ********e*s******a*a******"s**sns**sassssnsensssm*assn**a*o*************************eoroerseesssssansosensse 6-9
(DimensionalityReduction)..Page no. (6-2)
Machine Leaming(MU-Sem.7-Comp.)
6.1 DIMENSIONALITY REDUCTION TECHNIQUES

?
UQ Why DimensionalityReduction is very Importantstep in Machine Learning
(MU(Comp) -May 19, Dec.19, 5 Marks)

arrangement of
information having tremendous
Dimension Reduction alludes to the way toward changing over an

comparable data briefly. These


guaranteeing that it passes
on
dimensions into informmation with lesser dimensions
for classification
acquire better properties
or
methods are normally utilized while tackling machine learning issues to

regressionproblem.
follows. It indicates 2 dimensions P, and P2, which are given us
We should take a gander at the picture demonstrated as
somehow managed to utilize
a chance to state estimations of a few objects in cm (P) and inches (P). Presently,if we
of noise in the system,
Doth the measurements in machine learning,they will pass on comparabledata and present a lot
from 2D
present here is changed
over
dimension. Dimension of information
so it is better to simply utilizing one
(from P and P) to ID (Q), to make the information moderately less demanding to clarify.
www. www.

www.

P2
(inches)

P1(cm)

Q1

Fig. 6.1.1 Exampleof DimensionReduction


There are number of methods using which, we can get k dimensions by reducing n dimensions (k < n) of informational
index. These k dimensions can be specifically distinguished (sifted) or can be a blend of dimensions (weighted
midpoints of dimensions) or new dimension(s) that speak to existing numerous dimensions well. A standout amongst
the most well-known use of this procedure is Image preparing
Now we will see the significanceof applying Dimension Reduction method
o It assists in information packing and diminishing the storage room required

It decreases the time which is required for doing same calculations. If the dimensions are less then processing will
be less, added advantage of having less dimensions is permission to use calculations unfit for countless.
It handles with multi-collinearity that is used to enhance the execution of the model. It evacuates excess
highlights. For example: it makes no sense to put away an incentive in two distinct units (meters and inches).
Lessening the dimensions of information to 3D or 2D may enable us to plot and visualize exactly. You would then
be able to watch designs all the more unmistakably.Beneath we can see that, howa 3D information is changed
over into 2D. It has distinguished the 20D plane at that point spoke to the focuses on these two new axis z, and z

(New Syllabuswef academic year 22-23) (M7-79) Tech-NeoPublications...ASACHIN SHAH Venture


Machine Learning(MU-Sem.7-Comp.) (DimensionalityReduction)....Pageno. (6-3)

0
20
20
0
-20 5

40 2
60 10
15 10-5 15
10 15-20

Z2 20 T

15

-10
-15
-20
80-60 -40 -20 o 20 40 60 80

Reduction
Fig. 6.1.2: Example of Dimension
enhance the execution of models.
and because of this we can
It is helpful in noise evacuation additionally
for a large number of features.
o The classifier's performanceusuallywill degrade
* **********Y******;***********

wwwe*w* :

***

***** **** *********

weenwwwwww.www.wwwwibowewrimmdrwww
Number of variables

Fig. 6.1.3 :Classifier performance and amount of data

6.1.1 Dimension Reduction Techniques in ML

a6.1.1(A) FeatureSelection
maximizes the
Given a set offeaturesF= {X1.X,}. The feature selection problem is to find a subsetFSF that
maximize some scoring function.
learner's ability to classifythe patterns. FinallyF should

Feature selection

LJ L
(New Syllabuswe.f academic year 22-23) (M7-79) Tech-NeoPublications..ASACHIN SHAH Venture
(DimensionalityReduction)..Page no. (6-4)
Machine Leaming(MU-Sem.7-Comp.)
Feature Selection Steps
Feature selection is an optimizationproblem.
Step 1: Search the space of possible feature subsets.

Step 2: Pick the subset that is optimal or near-optimalwith respect to some objective function.
Subset selection

There are d initial features and 2" possible subsets


Criteria to decide which subset is the best:
Cassifier based on these m features has the lowest probability of error of all such classifiers

It is not possible to go over all 2" possibilities, so we need some heuristics


Here we select uncorrelated features
Forward search
o Start from empty set of features
O Try each of remaining features
Estimate classification/regression error for adding specific feature
Select feature that gives maximum improvement in validation error
Stop when no significant improvement
Backward search
Start with original set of size d
Drop features with smallest impact on error.

6.1.1(B) Feature Extraction


Suppose a set of features F = {X,, *******.., Xy} is given. The Feature Extraction task is to map F to some feature set
F" that will maximize the learner's ability to classifypatterns.

Y2
Feature extraction

A projection matrix w is computed from N-dimensional to M-dimensional vectors to achieve low error.
Z w'x. Principlecomponentanalysis and Independentcomponentanalysis are the feature extraction methods.

2 PRINCIPLE COMPONENTANALYSIS

UQ Explain in detail PrincipalComponentAnalysisfor Dimension Reduction. (MU(Comp) -


May 15, May 16, 10 Marks)
UQ. Use PrincipalComponentanalysis (PCA) to arrive at the transformedmatrixfor the given data.

A430
.
(MUComp) -May 17, 10 Marks)
-------

(New Syllabus w.e.f academic year 22-23) (M7-79) Tech-NeoPublications...ASACHIN SHAH Venture
Machine Leaming(MU-Sem.7-Comp.)
(DimensionalityReduction)...Pageno. (6
UO Describe the steps to reduce
dimensionality using Principal Component Analysis method
mathematicalformulas used. by cdearty Staung
UQ Write Short note on ISA and compare it with PCA.
(MU(Comp)-May19,Dec.19,5Marks)
(MUComp) -May 19,5 Marks)
.Dimensional reduction is regularly the best beginning
stage when dealing with high dimensionalinformation.
It is utilized for an assortment of reasons, from
perception to denoising, and in a wide range of uses, from signal
processing to bioinformatics.
A standout amongst the most
broadly utilized dimensionalreductiontools is PrincipalComponentAnalysis(PCA
PCA verifiably accept that the dataset under thought is typically dispersed, furthermore, chooses the subspace which
expands the anticipateddifference.
Wecosidera entereddata set, and developthe sample covariancematrix, at that point q-dimensionalPCA is identical
to anticipatingonto the
q-dimensionalsubspace spread over by the q eigenvectorsof S with biggest eigenvalues.
In this system, varñables are changed into another arrangement of variables, which are straight blend of unique
variables. These new arrangement of variables are known as principle components. They are calculated so that st
principlecomponent S represents a large portion of the conceivablevariety of unique informationafter which each
succeedingcomponent has the most noteworthyconceivablevariance.
The second principal component should be symmetrical to the primary principal component. As it were, it does its Dest
to catch the difference in the information that isn't caught by the primary principal component. For two-dimensional
dataset, there can be just two principal components. The following is a depiction of the information and its first and
second principal component. You can see that second principal component is symmetrical to first principal component.
"

****" 1principalcomponent

2 principal component
www.wwwwww.ww.twww.igwww.gewwiromrns

****

****

******* ************ww *********wvoww.ven

****

Fig. 6.2.1

The principal components are sensitive to the size of estimation, now to setle this issue we ought to dependably
institutionalize factors before applying PCA. Applying PCA to your infomational collection loses its importance. In
the event that interpretability of the outcomes is critical for your investigation, PCA isn't the corect system
for your task.

Ex.6.2.1: Apply PCA onthe followingdataand find the principlecomponent

x25 0522 19 3.1 23 2115 1


Y24 07 29 223 27 1.6 1.1 16 09
Soln.
First we will find the mean values

N= number of data points = 10

X 1.81

(New Syllabus we.f academicyear 22-23) (M7-79) Tech-NeoPublications..ASACHIN SHAH Venture


Machine Learning (MU-Sem.7-Comp.) (DimensionalityReduction)..Pageno. (6-6)

Ym 1.91
Now we will find the covariance matrix, C
=|C Cyw

-1
=0.6165

CY Cyx-2X-X) (Y-Y,)
N-1
= 0.61544

C Y
=0.7165
N-1

C 0.6165 0.61544
Lo.61544 0.7165
Now to find the eigen values following equation is used
IC-Al = 0

I 0.6165 0.61544 o
Lo.61544 07165J-
By solving the above determinant we will get quadratic equation and solving that equation we will get two eigen values
of Aas =0.0489 and =1.284
Now we will find the eigen vectors corresponding to eigen values
First we will find the first eigen vector corresponding to first eigen value, A=0.0489

CxV ,xV
.6165 0.61544
LO.61544 0.7165J Lv, =0.0489
=0.0489
From the above equation we will get two equations as
0.6165 V +0.61544 V12 = 0.0489 V
0.61544 V+0.7165 V12 = 0.0489 V

To find the eigen vector we can take either Equation (1) or (2), for both the equations answer will be the same, let's
take the first equation
= 0.6165 Vi +0.61544VI = 0.0489 V
0.5676 Vi1 +0.61544 V2
VI 0.61544V12
VI 1.0842V12
Now we will assume Vi2= 1 then V11=- 1.0842

V =0842
As we are assuming the value of Vi2 we have to normalized V, as follow

VIN VI.0842+TL.0842
J
-0.735
VIN 0677J
(New Syllabus w.e.f academic year 22-23) (M7-79) Tech-Neo Publications..ASACHIN SHAH Venture
(Dimensionality Reduction)...Page
no. (6-7)
Machine Leaming(MU-Sem.7-Comp.)
value, 1.284
Similarlywe will find the second eigen vectorcorrespondingto second eigen
CxV xV2
o.6165 0.615441 1.284
Lo.61544 0.7165J LV12

From the above equationwe will get two equationsas


0.6165V 0.61544V2 = 0.0489 V
0.61544 V1+0.7165V22 0.0489V2
the equations answer will
be the same,
let s
To find the eigen vector we can take either Equation (1) or (2), for both
take the first equation
= 0.6165 V +0.61544Vn
1.284 V21-0.6675Va =-0.61544V22
V21 0.922Va
will assume V22=1 then V2 = 0.922
Now we

V2
to normalized V2 as follows
As we are assumingthe value of V22 we have

V2N 0.922
V0.922+1L

0.677
V2N Lo.735 in
to the eigen vector corresponding to
maximum Eigen value,
Now we have to find the principlecomponent, it is equal
is Eigen vector VzN
this is maximum, hence principlecomponent
o.6771
Principlecomponent Lo.735

W 6.3 LINEAR DISCRIMINANT ANALYSIS


the pre-
is most commonly used as dimensionality reduction technique in
Linear Discriminant Analysis (LDA) a dataset onto a
processing step for pattern-classification and learning applications.The goal is to project
machine
of dimensionality") and also
lower-dimensional space with good class-separability in order avoid overfitting("curse
reduce computational costs.
Taxonomic
Ronald A. Fisher formulated the Linear Discriminant
in 1936 (The Use of Multiple Measurements in
for a 2-class
classifier. The originalLinear discriminant was described
Problems), and it also has some practicaluses as Discriminant
"multi-class Linear Discriminant Analysis" or "Multiple
problem, and it was then later generalizedas of classification)
C. R. Rao in 1948 (The utilization of multiple
measurements in problems biological
Analysis"by
The general LDA approach is very similar to a PrincipalComponentAnalysis but in addition to finding the component
the
axes that maximize the variance of our data (PCA), we are additionally interested in the axes that maximize
separationbetween multipleclasses (LDA).
smaller
LDA is to projecta feature space (a dataset n-dimensional samples)onto a
So, in a nutshell,often the goaB of an
1) while maintaining the class-discriminatory
information. In general, dimensionality
subspacek (where k n
-

costs for a given classification task, but it can also be helpful to


reduction does not reducing computational
only help
avoid overfitting by minimizing the error in parameter estimation ("curse of dimensionality").

(New Syllabus w.e.f academicyear 22-23) (M7-79)


Tech-Neo Publications.ASACHIN SHAH Venture
Machine Loarning(MU.-Sem.7-Comp.) (DimenslonlllyReduotion)...Pagono. (6-8)
Summarlxingthe LDA approach In 5 stepps
w ue the s general nteps for performinga lInear dincrimlnant analynin; we will explore them in more detail
in the following sections :

Compute the d-dimensional mean vectors for the different clawsen from the datwet.
2. Compute the scatter matrices (in-between-classand within-clamNACatter matrix).
3. Compute the eigenvectos(e, )and correspondingolgen values (a,
4. Sort the eigenvectors by decreasing cigen values and choONe k elgenvectors
g 4) forthewcatermatrlce.
with the largest elgen values to form a
dxk dimensional matrix W
(whereevery eolumn representsan elgenvector).
Use this d x k eigenvector matrix to transform the samples onto the new subspuce. This can be summarized by the
matrix multiplication: Y = Xx W (where X is an x d-dimensional matrix representing the n samples, and y are the
transformedn x k-dimensionalsamples in the new subspace).
Ex.6.3.1 Compute the Linear Discriminant projection for the following two-dimensional dataset.
X =
(x,, x3)= (4,1).(2,4).(2,3),(3,6),(4,4)
X = (%, x) =1(9,10)(6,8).9,5).(8,7).(10,8))

Soln.
Step 1: Compute mean for class 1 and class 2
M 3 3.601 M, = [8.40 7.60
Step 2: Compute within class scatter matrix
SW = S, +S2
S, is the covariance matrix of class 1 and S, is the covariance matrix for class 2

S, 2X- M,)(X-M,)
S, Cx- M) (X-M,)
0.80 0.407 1.84-0.04
S -0.40 2.60J S, 0.04 2.64
2.64-0.44
SW
-0.44 5.28
Step 3 Compute between class scatter matrix
SB =(M -M)(M,-M
29.16 21.601
SB
L 21.60 16.00.
Step 4: Find best LDA projection vector
w*=SW (M, -M,) =[-0.91 - 0.39]

6.4 SINGLE VALUED DEcOMPOSITION

UQ Find thesingularvalue decompositionof A =|1 1

(MU(Comp) - Dec. 19, 10 Marks)


------ ----
In singular value decomposition method a matrix is decomposed into three other matrices:

A = USv
(New Syllabus w.e.f academicyear 22-23) (M7-79) Tech-NeoPublications..ASACHIN SHAH Venture
Machine Leaming(MU-Sem.7-Comp.)
Here, A represents m xn matnx. U (DimensionalityReduction)...Pageno.(6
representsm x n
orthogonal matrix. S is a n xn diagonal matnx
orthogonal matrix. and v **

Matrix U has the left singular vectors as


columns; S is a diagonal matrix which contains
right singular vectors as rows. In singular value singular values; andY has
Here the covariance matrix is diagonal. decompositionoriginal data present in a coordinatesystem nded.
is

To cakculate singular value decompositionwe


need to find the eigen
qolumns of V consists ot values and eigenvectorsof AA' and A A: I1
f eigen values from AA'eigenvectors
of A' The.

columns ofU consistsof the eigenvectorsof AA. The square roo


or A'A
representsthe singular values in S.
The singular values arc arranged in descendingorder and
stored as the diagonal entries of the S matrix. IDc gu lar
walues are always real numbers. If the matrix A is a
real matrix, then U and V are also real.

B.641:FindSVDfor A=
soln.
First we will alculateAA =
Now we will calculateeigen vectors V, and V,
using the method that we have seen in PCA

V, =

Ly

Next we will calculate AV, and AV2

AV, 22

Next we will calculate U, and U2

AV =
U, = =
SVD is written as,

A lah

ChapterEnds

You might also like