Combined ML
Combined ML
Combined ML
NEERAJ.GUPTA@GLA.AC.IN 2
LEARNING
Human learn our surrounding through 5 senses
eye,
ear,
nose,
tongue
and skin.
NEERAJ.GUPTA@GLA.AC.IN 3
LEARNING
1. Rote Learning (memorization): Memorizing things without
knowing the concept or logic behind them.
2. Passive Learning (instructions): Learning from a
teacher/expert.
3. Analogy (experience): Learning new things from our past
experience.
4. Inductive Learning (experience): On the basis of past
experience, formulating a generalized concept.
5. Deductive Learning: Deriving new facts from past facts.
NEERAJ.GUPTA@GLA.AC.IN 4
CONCEPT LEARNING
Inducing general functions from specific training examples is a
main issue of machine learning.
NEERAJ.GUPTA@GLA.AC.IN 6
CONCEPT LEARNING
A Formal Definition for Concept Learning:
Inferring a Boolean-valued function from training examples of
its input and output.
• An example for concept-learning is the learning of bird-concept
from the given examples of birds (positive examples) and non-
birds (negative examples).
• We are trying to learn the definition of a concept from given
examples.
SPORT EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 7
SPORT EXAMPLE
Concept to be learned:
Days in which Aldo can enjoy water sport
Attributes:
Sky: Sunny, Cloudy, Rainy Wind: Strong, Weak
AirTemp: Warm, Cold Water: Warm, Cool
Humidity: Normal, High Forecast: Same, Change
Instances in the training set:
(out of the 96 possible):
NEERAJ.GUPTA@GLA.AC.IN 8
HYPOTHESES REPRESENTATION
h is a set of constraints on attributes:
a specific value: e.g. Water = Warm
any value allowed: e.g. Water = ?
no value allowed: e.g. Water = Ø
Example hypothesis:
Sky AirTemp Humidity Wind Water Forecast
Sunny, ?, ?, Strong, ?, Same
Corresponding to boolean function:
Sunny(Sky) ∧ Strong(Wind) ∧ Same(Forecast)
H, hypotheses space, all possible h
NEERAJ.GUPTA@GLA.AC.IN 9
HYPOTHESIS SATISFACTION
An instance x satisfies an hypothesis h iff all the constraints expressed by h are
satisfied by the attribute values in x.
Example 1:
x1: Sunny, Warm, Normal, Strong, Warm, Same
h1: Sunny, ?, ?, Strong, ?, Same Satisfies?
Yes
Example 2:
x2: Sunny, Warm, Normal, Strong, Warm, Same
h2: Sunny, ?, ?, Ø, ?, Same Satisfies?
No
NEERAJ.GUPTA@GLA.AC.IN 10
FORMAL TASK DESCRIPTION
Given:
X all possible days, as described by the attributes
A set of hypothesis H, a conjunction of constraints on the attributes,
representing a function h: X {0, 1}
[h(x) = 1 if x satisfies h; h(x) = 0 if x does not satisfy h]
A target concept: c: X {0, 1} where
c(x) = 1 iff EnjoySport = Yes;
c(x) = 0 iff EnjoySport = No;
A training set of possible instances D: {x, c(x)}
Goal: find a hypothesis h in H such that
h(x) = c(x) for all x in D
Hopefully h will be able to predict outside D…
NEERAJ.GUPTA@GLA.AC.IN 11
THE INDUCTIVE LEARNING ASSUMPTION
We can at best guarantee that the output hypothesis fits the target
concept over the training data
Assumption: an hypothesis that approximates well the training data
will also approximate the target function over unobserved
examples
i.e. given a significant training set, the output hypothesis is able to
make predictions
NEERAJ.GUPTA@GLA.AC.IN 12
CONCEPT LEARNING AS SEARCH
Concept learning is a task of searching an hypotheses space
The representation chosen for hypotheses determines the search
space
In the example we have:
3 x 25 = 96 possible instances (6 attributes)
1 + 4 x 35= 973 semantically distinct hypothesis
considering that all the hypothesis with some are semantically
equivalent, i.e. inconsistent
Structuring the search space may help in searching more efficiently
NEERAJ.GUPTA@GLA.AC.IN 13
GENERAL TO SPECIFIC ORDERING
Consider: h1 = Sunny, ?, ?, Strong, ?, ?
h2 = Sunny, ?, ?, ?, ?, ?
Any instance classified positive by h1 will also be classified
positive by h2
h2 is more general than h1
Definition: hj g hk iff (x X ) [(hk(x) = 1) (hj (x)= 1)]
g more general or equal; >g strictly more general
Most general hypothesis: ?, ?, ?, ?, ?, ?
Most specific hypothesis: Ø, Ø, Ø, Ø, Ø, Ø NEERAJ.GUPTA@GLA.AC.IN 14
GENERAL TO SPECIFIC ORDERING: INDUCED STRUCTURE
NEERAJ.GUPTA@GLA.AC.IN 15
FIND-S: FINDING THE MOST SPECIFIC HYPOTHESIS
1.Initialize h to the most specific hypothesis in H
2.For each positive training instance:
for each attribute constraint a in h:
If the constraint a is satisfied by x then do nothing
else replace a in h by the next more general
constraint satified by x (move towards a more
general hp)
3.Output hypothesis h
NEERAJ.GUPTA@GLA.AC.IN 16
FIND-S IN ACTION
NEERAJ.GUPTA@GLA.AC.IN 17
EXAMPLE
To illustrate this algorithm, assume the learner is given the sequence of training
examples from the EnjoySport task
NEERAJ.GUPTA@GLA.AC.IN 18
PROPERTIES OF FIND-S
Find-S is guaranteed to output the most specific hypothesis within H that is
consistent with the positive training examples
The final hypothesis will also be consistent with the negative examples
Problems:
There can be more than one “most specific hypotheses”
We cannot say if the learner converged to the correct target
Why choose the most specific?
If the training examples are inconsistent, the algorithm can be mislead: no tolerance to rumor.
Negative example are not considered
NEERAJ.GUPTA@GLA.AC.IN 19
QUESTION
Consider the following data set having the data about which particular seeds are
poisonous.
NEERAJ.GUPTA@GLA.AC.IN 20
EXAMPLE
First we consider the hypothesis to be more specific hypothesis.
Hence, our hypothesis would be :
h = {ϕ, ϕ, ϕ, ϕ}
Instance 1 :
The data in example 1 is { GREEN, HARD, NO, WRINKLED }. We see that our initial hypothesis is
more specific and we have to generalize it for this example. Hence, the hypothesis becomes :
NEERAJ.GUPTA@GLA.AC.IN 21
EXAMPLE
Instance 3 :
Here we see that this example has a negative outcome. Hence we neglect this example and our
hypothesis remains the same.
NEERAJ.GUPTA@GLA.AC.IN 23
THANKS
NEERAJ.GUPTA@GLA.AC.IN 24
MACHINE LEARNING (ML-5)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
NEERAJ.GUPTA@GLA.AC.IN 1
AGENDA
Linear regression with one variable
NEERAJ.GUPTA@GLA.AC.IN 2
QUIZ
X Y
2 4
3 9
5 25
9 81
7 49
11 121
10.5 WHAT?
NEERAJ.GUPTA@GLA.AC.IN 3
QUIZ
X Y
2 4
3 9
5 25
9 81
7 49
11 121
10.5 110.25
NEERAJ.GUPTA@GLA.AC.IN 4
HOW DO YOU FIND THAT?
You find the relation between X and Y
NEERAJ.GUPTA@GLA.AC.IN 5
HOW DO YOU FIND THAT?
You find the relation between X and Y
Y = X.X =X^2
Y=f(X)
NEERAJ.GUPTA@GLA.AC.IN 6
HOW DO YOU FIND THAT?
You find the relation between X and Y
Y = X.X =X^2
Y=f(X)
Which one is dependent variable ?
NEERAJ.GUPTA@GLA.AC.IN 7
HOW DO YOU FIND THAT?
You find the relation between X and Y
Y = X.X =X^2
Y=f(X)
Which one is dependent variable ? ANSWER =
Y
NEERAJ.GUPTA@GLA.AC.IN 8
HOW DO YOU FIND THAT?
You find the relation between X and Y
Y = X.X =X^2
Y=f(X)
Which one is dependent variable ? ANSWER = Y
So What is X?
NEERAJ.GUPTA@GLA.AC.IN 9
HOW DO YOU FIND THAT?
You find the relation between X and Y
Y = X.X =X^2
Y=f(X)
Which one is dependent variable ? ANSWER = Y
SO What is X? X= Independent variable
NEERAJ.GUPTA@GLA.AC.IN 10
QUIZ
X Y
2 3
3 5
5 9
9 10
7 6.5
11 11.8
10.5 WHAT?
NEERAJ.GUPTA@GLA.AC.IN 11
QUIZ
X Y
2 3
3 5
5 9
9 10
7 6.5
11 11.8
10.5 Is it difficult to find out
the relation ?
NEERAJ.GUPTA@GLA.AC.IN 12
GRAPH IS SOLUTION ?
Y
14
12
10
0
0 2 4 6 8 10 12
NEERAJ.GUPTA@GLA.AC.IN 13
Y
14
12
10
0
0 2 4 6 8 10 12
NEERAJ.GUPTA@GLA.AC.IN 14
APPROXIMATION
Y
14
11.8
12
10
10
9
8
6.5
Y
6
5
4
3
0
0 2 4 6 8 10 12
X
NEERAJ.GUPTA@GLA.AC.IN 15
FIND THE EQUATION OF LINE ?
Two Points are given (3, 5) and (9,10)
Find equation of line ?
Y= m.X + c
NEERAJ.GUPTA@GLA.AC.IN 16
FIND THE EQUATION OF LINE ?
Two Points are given (3, 5) and (9,10)
Find equation of line ?
Y= m.X + c
Y= 0.83X+2.5
NEERAJ.GUPTA@GLA.AC.IN 17
DEFINITION
Finding the relation between dependent variable and
Independent variable is called Linear Regression.
OR
Finding the best fit line between dependent variable and
Independent variable is called Linear Regression.
NEERAJ.GUPTA@GLA.AC.IN 18
DEFINITION
Finding the relation between dependent variable and Independent variable is called
Linear Regression.
NEERAJ.GUPTA@GLA.AC.IN 19
DEFINITION
Finding the relation between dependent variable and Independent variable is called
Linear Regression.
Now X=10.5 , X=12 , 13, 2.5
Y= what?
What are you doing here?
(USES OF LINEAR REGRESSION)
FORECASTING
PREDICTION
NEERAJ.GUPTA@GLA.AC.IN 20
THE ERROR (RESIDUALS)
ERROR =GIVEN DATA(ACTUAL DATA) – PREDICTED DATA
e=Y(actual) –Y(predicted)
NEERAJ.GUPTA@GLA.AC.IN 21
THE ERROR (RESIDUALS)
ERROR =GIVEN DATA(ACTUAL DATA) – PREDICTED DATA
e=Y(actual) –Y(predicted)
“Question”
“How can we find the best fit line?”
NEERAJ.GUPTA@GLA.AC.IN 22
THE ERROR (RESIDUALS)
ERROR =GIVEN DATA(ACTUAL DATA) – PREDICTED DATA
e=Y(actual) –Y(predicted)
“Question”
“How can we find the best fit line?”
“Answer”
If Y(actual) =Y(predicted) or e=0
Or
Minimise the error
NEERAJ.GUPTA@GLA.AC.IN 23
HOW TO FIND BEST FIT LINE
Derivation of linear regression equations : (FOR SINGLE VARIABLE)
given a set of n points ( , ) on a scatterplot,
find the best-fit line, =a+ b
such that the sum of squared errors in Y, ∑( − ) is minimized.
NEERAJ.GUPTA@GLA.AC.IN 24
Linear regression with one variable
MODEL REPRESENTATION *
SRC : * Andrew NG
NEERAJ.GUPTA@GLA.AC.IN 25
500
Housing Prices
400
(Portland, OR)
300
Price 200
(in 1000s 100
of dollars)
0
0 500 1000 1500 2000 2500 3000
Size (feet2)
Supervised Learning Regression Problem
Given the “right answer” for Predict real-valued output
each example in the data.
Training set of Size in feet2 (x) Price ($) in 1000's (y)
housing prices 2104 460
(Portland, OR) 1416 232
1534 315
852 178
… …
Notation:
m = Number of training examples
x’s = “input” variable / features
y’s = “output” variable / “target” variable
Training Set How do we represent h ?
Learning Algorithm
Size of h Estimated
house price
SRC : * Andrew NG
NEERAJ.GUPTA@GLA.AC.IN 29
Size in feet2 (x) Price ($) in 1000's (y)
Training Set
2104 460
1416 232
1534 315
852 178
… …
Hypothesis:
‘s: Parameters
How to choose ‘s ?
NEERAJ.GUPTA@GLA.AC.IN 30
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
NEERAJ.GUPTA@GLA.AC.IN 31
y
NEERAJ.GUPTA@GLA.AC.IN 32
Linear regression with one variable
COST FUNCTION
INTUITION I *
SRC : * Andrew NG
NEERAJ.GUPTA@GLA.AC.IN 33
Simplified
Hypothesis:
Parameters:
Cost Function:
Goal:
NEERAJ.GUPTA@GLA.AC.IN 34
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
NEERAJ.GUPTA@GLA.AC.IN 35
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
NEERAJ.GUPTA@GLA.AC.IN 36
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
NEERAJ.GUPTA@GLA.AC.IN 37
Linear regression with one variable
COST FUNCTION
INTUITION II *
SRC : * Andrew NG
NEERAJ.GUPTA@GLA.AC.IN 38
Hypothesis:
Parameters:
Cost Function:
Goal:
NEERAJ.GUPTA@GLA.AC.IN 39
(for fixed , this is a function of x) (function of the parameters )
500
400
100
0
0 500 1000 1500 2000 2500 3000
Size in feet2 (x)
NEERAJ.GUPTA@GLA.AC.IN 40
NEERAJ.GUPTA@GLA.AC.IN 41
(for fixed , this is a function of x) (function of the parameters )
NEERAJ.GUPTA@GLA.AC.IN 42
(for fixed , this is a function of x) (function of the parameters )
NEERAJ.GUPTA@GLA.AC.IN 43
(for fixed , this is a function of x) (function of the parameters )
NEERAJ.GUPTA@GLA.AC.IN 44
(for fixed , this is a function of x) (function of the parameters )
NEERAJ.GUPTA@GLA.AC.IN 45
Linear regression with one variable
GRADIENT DESCENT *
SRC : * Andrew NG
NEERAJ.GUPTA@GLA.AC.IN 46
Have some function
Want
Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum
NEERAJ.GUPTA@GLA.AC.IN 47
J()
NEERAJ.GUPTA@GLA.AC.IN 48
J()
NEERAJ.GUPTA@GLA.AC.IN 49
Gradient descent algorithm
NEERAJ.GUPTA@GLA.AC.IN 50
Linear regression with one variable
GRADIENT DESCENT
INTUITION*
SRC : * Andrew NG
NEERAJ.GUPTA@GLA.AC.IN 51
Gradient descent algorithm
NEERAJ.GUPTA@GLA.AC.IN 52
NEERAJ.GUPTA@GLA.AC.IN 53
If α is too small, gradient descent
can be slow.
NEERAJ.GUPTA@GLA.AC.IN 54
Linear regression with one variable
GRADIENT DESCENT FOR
LINEAR REGRESSION*
SRC : * Andrew NG
NEERAJ.GUPTA@GLA.AC.IN 55
Gradient descent algorithm Linear Regression Model
NEERAJ.GUPTA@GLA.AC.IN 56
NEERAJ.GUPTA@GLA.AC.IN 57
Gradient descent algorithm
update
and
simultaneously
NEERAJ.GUPTA@GLA.AC.IN 58
J()
NEERAJ.GUPTA@GLA.AC.IN 59
J()
NEERAJ.GUPTA@GLA.AC.IN 60
NEERAJ.GUPTA@GLA.AC.IN 61
(for fixed , this is a function of x) (function of the parameters )
NEERAJ.GUPTA@GLA.AC.IN 62
(for fixed , this is a function of x) (function of the parameters )
NEERAJ.GUPTA@GLA.AC.IN 63
(for fixed , this is a function of x) (function of the parameters )
NEERAJ.GUPTA@GLA.AC.IN 64
(for fixed , this is a function of x) (function of the parameters )
NEERAJ.GUPTA@GLA.AC.IN 65
(for fixed , this is a function of x) (function of the parameters )
NEERAJ.GUPTA@GLA.AC.IN 66
(for fixed , this is a function of x) (function of the parameters )
NEERAJ.GUPTA@GLA.AC.IN 67
(for fixed , this is a function of x) (function of the parameters )
NEERAJ.GUPTA@GLA.AC.IN 68
(for fixed , this is a function of x) (function of the parameters )
NEERAJ.GUPTA@GLA.AC.IN 69
(for fixed , this is a function of x) (function of the parameters )
NEERAJ.GUPTA@GLA.AC.IN 70
“Batch” Gradient Descent
NEERAJ.GUPTA@GLA.AC.IN 71
THANKS
NEERAJ.GUPTA@GLA.AC.IN 72
MACHINE LEARNING (ML-6)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
NEERAJ.GUPTA@GLA.AC.IN 1
AGENDA
Linear regression with multiple variable
Gradient descent for multiple variables
Gradient descent in practice I: Feature Scaling
SRC : * Andrew NG
NEERAJ.GUPTA@GLA.AC.IN 2
Multiple features (variables).
2104 460
1416 232
1534 315
852 178
… …
NEERAJ.GUPTA@GLA.AC.IN 3
Multiple features (variables).
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… … … … …
NEERAJ.GUPTA@GLA.AC.IN 4
Multiple features (variables).
Size (feet2) Number of Number of Price ($1000)
bedrooms floors Age of home (years)
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… … … … …
Notation:
= number of features
= input (features) of training example.
= value of feature in training example.
NEERAJ.GUPTA@GLA.AC.IN 5
Hypothesis:
Previously:
NEERAJ.GUPTA@GLA.AC.IN 6
For convenience of notation, define .
SRC : * Andrew NG
NEERAJ.GUPTA@GLA.AC.IN 8
Hypothesis:
Parameters:
Cost function:
Gradient descent:
Repeat
(simultaneously update )
NEERAJ.GUPTA@GLA.AC.IN 10
Linear Regression with multiple variables
SRC : * Andrew NG
NEERAJ.GUPTA@GLA.AC.IN 11
Feature Scaling
Idea: Make sure features are on a similar scale.
E.g. = size (0-2000 feet2) size (feet2)
= number of bedrooms (1-5)
number of bedrooms
NEERAJ.GUPTA@GLA.AC.IN 12
Feature Scaling
Get every feature into approximately a
range.
NEERAJ.GUPTA@GLA.AC.IN 13
Mean normalization
Replace with to make features have approximately zero mean
(Do not apply to ).
E.g.
NEERAJ.GUPTA@GLA.AC.IN 14
Linear Regression with multiple variables
SRC : * Andrew NG
NEERAJ.GUPTA@GLA.AC.IN 15
Gradient descent
NEERAJ.GUPTA@GLA.AC.IN 16
Making sure gradient descent is working correctly.
Example automatic
convergence test:
Declare convergence if
decreases by less than
in one iteration.
0 100 200 300 400
No. of iterations
NEERAJ.GUPTA@GLA.AC.IN 17
Making sure gradient descent is working correctly.
Gradient descent not working.
Use smaller .
No. of iterations
To choose , try
NEERAJ.GUPTA@GLA.AC.IN 19
Linear Regression with multiple variables
NEERAJ.GUPTA@GLA.AC.IN 20
Housing prices prediction
NEERAJ.GUPTA@GLA.AC.IN 21
Polynomial regression
Price
(y)
Size (x)
NEERAJ.GUPTA@GLA.AC.IN 22
Choice of features
Price
(y)
Size (x)
NEERAJ.GUPTA@GLA.AC.IN 23
Linear Regression with multiple variables
NORMAL EQUATION
NEERAJ.GUPTA@GLA.AC.IN 24
Gradient Descent
NEERAJ.GUPTA@GLA.AC.IN 25
Intuition: If 1D
(for every )
1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178
NEERAJ.GUPTA@GLA.AC.IN 27
Examples:
Size (feet2) Number of Number of Price ($1000)
bedrooms floors Age of home (years)
1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178
1 3000 4 1 38 540
NEERAJ.GUPTA@GLA.AC.IN 28
NEERAJ.GUPTA@GLA.AC.IN 29
THANKS
NEERAJ.GUPTA@GLA.AC.IN 30
MACHINE LEARNING (ML-7)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
Logistic regression (Classification)
SRC : * Andrew NG
NEERAJ.GUPTA@GLA.AC.IN 2
Classification
Malignant ?
(No) 0
Tumor Size Tumor Size
Logistic Regression:
Logistic Regression
HYPOTHESIS
REPRESENTATION
Logistic Regression Model
Want
0.5
Sigmoid function 0
Logistic function
Interpretation of Hypothesis Output
= estimated probability that y = 1 on input x
Example: If
z
Suppose predict “ “ if
predict “ “ if
Decision Boundary
x2
3
2
1 2 3 x1
Predict “ “ if
Non-linear decision boundaries
x2
-1 1 x1
-1
Predict “ “ if
x2
x1
Logistic Regression
COST FUNCTION
Training set:
m examples
“non-convex” “convex”
Logistic regression cost function
If y = 1
0 1
Logistic regression cost function
If y = 0
0 1
Logistic Regression
SIMPLIFIED COST FUNCTION AND
GRADIENT DESCENT
Logistic regression cost function
Logistic regression cost function
To fit parameters :
Want :
Repeat
Want :
Repeat
x2 x2
x1 x1
x2
One-vs-all (one-vs-rest):
x1
x2 x2
x1 x1
x2
Class 1:
Class 2:
Class 3:
x1
One-vs-all
Train a logistic regression classifier for each
class to predict the probability that .
NEERAJ.GUPTA@GLA.AC.IN 28
MACHINE LEARNING (ML-8)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
Machine Learning: Training, Testing, Evaluation
SRC : * Andrew NG
NEERAJ.GUPTA@GLA.AC.IN 2
EVALUATING THE HYPOTHESIS
Fail to generalize to new examples not in
training set.
NEERAJ.GUPTA@GLA.AC.IN 3
EVALUATING THE HYPOTHESIS
NEERAJ.GUPTA@GLA.AC.IN 4
TRAINING/TESTING PROCEDURE FOR LINEAR
REGRESSION
Learn parameter from training data (minimizing training error ( ))
NEERAJ.GUPTA@GLA.AC.IN 5
TRAINING/TESTING PROCEDURE FOR LOGISTIC
REGRESSION
Learn parameter from training data (minimizing training error ( ))
Compute test set error:
NEERAJ.GUPTA@GLA.AC.IN 6
PYTHON IMPLEMENTATION
NEERAJ.GUPTA@GLA.AC.IN 7
PYTHON IMPLEMENTATION
NEERAJ.GUPTA@GLA.AC.IN 8
PYTHON IMPLEMENTATION
NEERAJ.GUPTA@GLA.AC.IN 9
EVALUATION METRICS*
Training objective (cost function) is only a proxy for real world objectives.
Metrics are useful and important for evaluation.
Metrics help capture a business goal into a quantitative target.
Helps organize ML team effort towards that target. Generally in the form of
improving that metric on the dev set.
Useful for lower level tasks and debugging (e.g. diagnosing bias vs variance).
NEERAJ.GUPTA@GLA.AC.IN 11
BINARY CLASSIFICATION
x is input
y is binary output (0/1)
Model is ŷ= h(x)
Two types of models
Models that output a categorical class directly (K-nearest neighbor, Decision
tree)
Models that output a real valued score (SVM, Logistic Regression)
Score could be margin (SVM), probability
Need to pick a threshold
We focus on this type (the other type can be interpreted as an instance)
NEERAJ.GUPTA@GLA.AC.IN 12
SCORE BASED MODELS
NEERAJ.GUPTA@GLA.AC.IN 13
THRESHOLD -> CLASSIFIER -> POINT METRICS
NEERAJ.GUPTA@GLA.AC.IN 14
POINT METRICS: CONFUSION MATRIX
NEERAJ.GUPTA@GLA.AC.IN 15
POINT METRICS: TRUE POSITIVES
NEERAJ.GUPTA@GLA.AC.IN 16
POINT METRICS: TRUE NEGATIVES
NEERAJ.GUPTA@GLA.AC.IN 17
POINT METRICS: FALSE POSITIVES
NEERAJ.GUPTA@GLA.AC.IN 18
POINT METRICS: FALSE NEGATIVES
NEERAJ.GUPTA@GLA.AC.IN 19
FP AND FN ALSO CALLED TYPE-1 AND TYPE-2
ERRORS
NEERAJ.GUPTA@GLA.AC.IN 20
POINT METRICS: ACCURACY
NEERAJ.GUPTA@GLA.AC.IN 21
POINT METRICS: PRECISION
NEERAJ.GUPTA@GLA.AC.IN 22
POINT METRICS: POSITIVE RECALL (SENSITIVITY)
NEERAJ.GUPTA@GLA.AC.IN 23
POINT METRICS: NEGATIVE RECALL (SPECIFICITY)
NEERAJ.GUPTA@GLA.AC.IN 24
POINT METRICS: F1-SCORE
NEERAJ.GUPTA@GLA.AC.IN 25
POINT METRICS: CHANGING THRESHOLD
NEERAJ.GUPTA@GLA.AC.IN 26
POINT METRICS: CHANGING THRESHOLD
NEERAJ.GUPTA@GLA.AC.IN 27
SUMMARY METRICS: PRC (RECALL VS. PRECISION)
NEERAJ.GUPTA@GLA.AC.IN 28
ROC (RECEIVER OPERATING CHARACTERISTICS)
•ROC curve is a performance measurement for classification problem at various
thresholds settings.
•ROC is a probability curve and AUC represents degree or measure of separability.
•It tells how much model is capable of distinguishing between classes.
•Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s.
•By analogy, Higher the AUC, better the model is at distinguishing between patients
with disease and no disease.
NEERAJ.GUPTA@GLA.AC.IN 29
ROC CURVE
•The ROC curve is plotted with TPR against the FPR where TPR is on y-axis and FPR is
on the x-axis.
NEERAJ.GUPTA@GLA.AC.IN 30
DEFINING TERMS USED IN AUC AND ROC CURVE
TPR (True Positive Rate) / Recall /Sensitivity
NEERAJ.GUPTA@GLA.AC.IN 31
HOW TO SPECULATE THE PERFORMANCE OF THE
MODEL?
•An excellent model has AUC near to the 1 which means it has good measure of
separability.
•A poor model has AUC near to the 0 which means it has worst measure of
separability. In fact it means it is reciprocating the result. It is predicting 0s as 1s and
1s as 0s.
•AUC is 0.5, it means model has no class separation capacity whatsoever.
NEERAJ.GUPTA@GLA.AC.IN 32
INTERPRETATION OF ROC CURVE
As we know, ROC is a curve of probability. So lets plot the distributions of those
probabilities:
Red distribution curve is of the positive class (patients with disease) and green
distribution curve is of negative class(patients with no disease)
NEERAJ.GUPTA@GLA.AC.IN 33
INTERPRETATION OF ROC CURVE
NEERAJ.GUPTA@GLA.AC.IN 34
INTERPRETATION OF ROC CURVE
NEERAJ.GUPTA@GLA.AC.IN 35
INTERPRETATION OF ROC CURVE
NEERAJ.GUPTA@GLA.AC.IN 36
COMPARING ROC CURVES
NEERAJ.GUPTA@GLA.AC.IN 37
RELATION BETWEEN SENSITIVITY, SPECIFICITY,
FPR AND THRESHOLD
•Sensitivity and Specificity are inversely proportional to each other. So when we
increase Sensitivity, Specificity decreases and vice versa.
•When we decrease the threshold, we get more positive values thus it increases the
sensitivity and decreasing the specificity.
•Similarly, when we increase the threshold, we get more negative values thus we get
higher specificity and lower sensitivity.
NEERAJ.GUPTA@GLA.AC.IN 38
EXAMPLE: ROC
NEERAJ.GUPTA@GLA.AC.IN 39
THANKS
NEERAJ.GUPTA@GLA.AC.IN 40
MACHINE LEARNING (ML-9)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
NEERAJ.GUPTA@GLA.AC.IN 1
AGENDA
Bias Vs Variance
NEERAJ.GUPTA@GLA.AC.IN 2
NEERAJ.GUPTA@GLA.AC.IN 3
NEERAJ.GUPTA@GLA.AC.IN 4
NEERAJ.GUPTA@GLA.AC.IN 5
NEERAJ.GUPTA@GLA.AC.IN 6
NEERAJ.GUPTA@GLA.AC.IN 7
NEERAJ.GUPTA@GLA.AC.IN 8
NEERAJ.GUPTA@GLA.AC.IN 9
NEERAJ.GUPTA@GLA.AC.IN 10
NEERAJ.GUPTA@GLA.AC.IN 11
NEERAJ.GUPTA@GLA.AC.IN 12
NEERAJ.GUPTA@GLA.AC.IN 13
NEERAJ.GUPTA@GLA.AC.IN 14
NEERAJ.GUPTA@GLA.AC.IN 15
NEERAJ.GUPTA@GLA.AC.IN 16
NEERAJ.GUPTA@GLA.AC.IN 17
NEERAJ.GUPTA@GLA.AC.IN 18
NEERAJ.GUPTA@GLA.AC.IN 19
NEERAJ.GUPTA@GLA.AC.IN 20
NEERAJ.GUPTA@GLA.AC.IN 21
NEERAJ.GUPTA@GLA.AC.IN 22
NEERAJ.GUPTA@GLA.AC.IN 23
NEERAJ.GUPTA@GLA.AC.IN 24
NEERAJ.GUPTA@GLA.AC.IN 25
NEERAJ.GUPTA@GLA.AC.IN 26
NEERAJ.GUPTA@GLA.AC.IN 27
NEERAJ.GUPTA@GLA.AC.IN 28
NEERAJ.GUPTA@GLA.AC.IN 29
NEERAJ.GUPTA@GLA.AC.IN 30
NEERAJ.GUPTA@GLA.AC.IN 31
NEERAJ.GUPTA@GLA.AC.IN 32
NEERAJ.GUPTA@GLA.AC.IN 33
NEERAJ.GUPTA@GLA.AC.IN 34
NEERAJ.GUPTA@GLA.AC.IN 35
NEERAJ.GUPTA@GLA.AC.IN 36
NEERAJ.GUPTA@GLA.AC.IN 37
NEERAJ.GUPTA@GLA.AC.IN 38
NEERAJ.GUPTA@GLA.AC.IN 39
NEERAJ.GUPTA@GLA.AC.IN 40
NEERAJ.GUPTA@GLA.AC.IN 41
NEERAJ.GUPTA@GLA.AC.IN 42
NEERAJ.GUPTA@GLA.AC.IN 43
NEERAJ.GUPTA@GLA.AC.IN 44
NEERAJ.GUPTA@GLA.AC.IN 45
NEERAJ.GUPTA@GLA.AC.IN 46
NEERAJ.GUPTA@GLA.AC.IN 47
TRADE-OFF (BIAS VS VARIANCE)
NEERAJ.GUPTA@GLA.AC.IN 48
THANKS
NEERAJ.GUPTA@GLA.AC.IN 49
MACHINE LEARNING (ML-10)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
K Nearest Neighbor
NEERAJ.GUPTA@GLA.AC.IN 2
K-NEAREST-NEIGHBORS ALGORITHM
K nearest neighbors (KNN) is a simple algorithm that stores all
available cases and classifies new cases based on a similarity
measure (distance function)
If K=1, then the case is simply assigned to the class of its nearest
neighbor
DISTANCE FUNCTION MEASUREMENTS
HAMMING DISTANCE
For category variables, Hamming distance can be used.
K-NEAREST-NEIGHBORS
WHAT IS THE MOST POSSIBLE LABEL FOR C?
c
WHAT IS THE MOST POSSIBLE LABEL FOR C?
Solution: Looking for the nearest K neighbors of c.
Take the majority label as c’s label
Let’s suppose k = 3:
WHAT IS THE MOST POSSIBLE LABEL FOR C?
c
WHAT IS THE MOST POSSIBLE LABEL FOR C?
The 3 nearest points to c are: a, a and o.
Therefore, the most possible label for c is a.
PSEUDO CODE OF KNN
1. Load the data
2. Initialise the value of k
3. For getting the predicted class, iterate from 1 to total number of
training data points
1. Calculate the distance between test data and each row of training data. Here we will
use Euclidean distance as our distance metric since it’s the most popular method.
2. Sort the calculated distances in ascending order based on distance values
3. Get top k rows from the sorted array
4. Get the most frequent class of these rows
5. Return the predicted class
NEERAJ.GUPTA@GLA.AC.IN 12
REMARKS
CHOOSING THE MOST SUITABLE K
NORMALIZATION
NORMALIZATION
NORMALIZATION
NORMALIZATION
K-NEAREST NEIGHBOR CLASSIFICATION (KNN)
Unlike all the previous learning methods, kNN does not build model
from the training data.
To classify a test instance d, define k-neighborhood P as k nearest
neighbors of d
Count number n of training instances in P that belong to class cj
Estimate Pr(cj|d) as n/k
No training is needed. Classification time is linear in training set size for
each test case.
19
DISCUSSIONS
kNN can deal with complex and arbitrary decision boundaries.
Despite its simplicity, researchers have shown that the classification accuracy of kNN
can be quite strong and in many cases as accurate as those elaborated methods.
kNN is slow at the classification time
kNN does not produce an understandable model
20
EXERCISE
NEERAJ.GUPTA@GLA.AC.IN 21
EXERCISE
NEERAJ.GUPTA@GLA.AC.IN 22
EXERCISE
Suppose, you have given the following data where x and y are the 2 input variables
and Class is the dependent variable.
Q1. Suppose, you want to predict the class of new data point x=1 and y=1 using
eucludian distance in 3-NN. In which class this data point belong to?
NEERAJ.GUPTA@GLA.AC.IN 23
EXERCISE
Q2. In the previous question, you are now want use 7-NN instead of 3-KNN which of
the following x=1 and y=1 will belong to?
Q2. In the previous question, you are now want use 5-NN instead of 3-KNN which of
the following x=1 and y=1 will belong to?
NEERAJ.GUPTA@GLA.AC.IN 24
THANKS
NEERAJ.GUPTA@GLA.AC.IN 25
MACHINE LEARNING (ML-11)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
oNaïve Bayes Classifier
NEERAJ.GUPTA@GLA.AC.IN 2
WHAT IS NAIVE BAYES ALGORITHM?
• It is a classification technique based on Bayes’ Theorem with an assumption of
independence among predictors.
• A Naive Bayes classifier assumes that the presence of a particular feature in a class
is unrelated to the presence of any other feature.
• Naive Bayes model is easy to build and particularly useful for very large data sets.
• Along with simplicity, Naive Bayes is known to outperform even highly sophisticated
classification methods.
NEERAJ.GUPTA@GLA.AC.IN 3
PREREQUISITES FOR BAYES’ THEOREM
What is an Experiment?
“An experiment is a planned operation carried out under controlled conditions.”
Tossing a coin, rolling a die, and drawing a card out of a well-shuffled pack of cards are all
examples of experiments.
NEERAJ.GUPTA@GLA.AC.IN 4
SAMPLE SPACE
The result of an experiment is called an outcome. The set of all possible outcomes of
an event is called the sample space.
For example, if our experiment is throwing dice and recording its outcome, the sample
space will be:
S1 = {1, 2, 3, 4, 5, 6}
What will be the sample when we’re tossing a coin?
S2 = {H, T}
NEERAJ.GUPTA@GLA.AC.IN 5
EVENT
An event is a set of outcomes (i.e. a subset of the sample space) of an experiment.
Let’s get back to the experiment of rolling a dice and define events E and F as:
E = An even number is obtained = {2, 4, 6}
F = A number greater than 3 is obtained = {4, 5, 6}
The probability of these events:
NEERAJ.GUPTA@GLA.AC.IN 7
RANDOM VARIABLE
Let’s take a simple example (refer to the above image as we go along). Define a
random variable X on the sample space of the experiment of tossing a coin. It takes a
value +1 if “Heads” is obtained and -1 if “Tails” is obtained. Then, X takes on values
+1 and -1 with equal probability of 1/2.
Consider that Y is the observed temperature (in Celsius) of a given place on a given
day. So, we can say that Y is a continuous random variable defined on the same space,
S = [0, 100] (Celsius Scale is defined from zero degree Celsius to 100 degrees
Celsius).
NEERAJ.GUPTA@GLA.AC.IN 8
EXHAUSTIVE EVENTS
A set of events is said to be exhaustive if at least one of the events must occur at
any time. Thus, two events A and B are said to be exhaustive if A ∪ B = S, the
sample space.
For example, let’s say that A is the event that a card drawn out of a pack is red and
B is the event that the card drawn is black. Here, A and B are exhaustive because
the sample space S = {red, black}. Pretty straightforward stuff, right?
NEERAJ.GUPTA@GLA.AC.IN 9
INDEPENDENT EVENTS
If the occurrence of one event does not have any effect on the occurrence of
another, then the two events are said to be independent. Mathematically, two events
A and B are said to be independent if:
NEERAJ.GUPTA@GLA.AC.IN 10
CONDITIONAL PROBABILITY
Consider that we’re drawing a card from a given deck.
What is the probability that it is a black card?
That’s easy – 1/2, right?
However, what if we know it was a black card – then what would be the probability
that it was a king?
This is where the concept of conditional probability comes into play.
Conditional probability is defined as the probability of an event A, given that
another event B has already occurred (i.e. A conditional B). This is represented by
P(A|B) and we can define it as:
P(A|B) = P(A ∩ B) / P(B)
NEERAJ.GUPTA@GLA.AC.IN 11
CONDITIONAL PROBABILITY
Let event A represent picking a king, and event B, picking a black card. Then, we find
P(A|B) using the above formula:
NEERAJ.GUPTA@GLA.AC.IN 12
WHAT IS BAYES’ THEOREM?
NEERAJ.GUPTA@GLA.AC.IN 13
WHAT IS BAYES’ THEOREM?
“Have you ever seen the popular TV show ‘Sherlock’
(or any crime thriller show)? Think about it – our
beliefs about the culprit change throughout the
episode. We process new evidence and refine our
hypothesis at each step.
NEERAJ.GUPTA@GLA.AC.IN 16
AN ILLUSTRATION OF BAYES’ THEOREM
Let’s solve a problem using Bayes’ Theorem. This will help you understand and
visualize where you can apply it.
There are 3 boxes labeled A, B, and C:
Box A contains 2 red and 3 black balls
Box B contains 3 red and 1 black ball
And box C contains 1 red ball and 4 black balls
The three boxes are identical and have an equal probability of getting picked.
Consider that a red ball is chosen. Then what is the probability that this red ball was
picked out of box A?
NEERAJ.GUPTA@GLA.AC.IN 17
CONTD…
We have prior probabilities P(A) = P(B) = P (C) = 1 / 3, since all boxes have equal
probability of getting picked.
NEERAJ.GUPTA@GLA.AC.IN 19
NAIVE BAYES’ CLASSIFIERS
Naive Bayes’ Classifiers are a set of probabilistic classifiers based on
the Bayes’ Theorem. The underlying assumption of these classifiers is
that all the features used for classification are independent of each
other.
That’s where the name ‘naive’ comes in since it is rare that we obtain a
set of totally independent features.
The way these classifiers work is exactly how we solved in the
illustration, just with a lot more features assumed to be independent of
each other.
NEERAJ.GUPTA@GLA.AC.IN 20
NAIVE BAYES’ CLASSIFIERS
Here, we need to find the probability P(Y|X) where X is an n-dimensional random
variable whose component random variables X_1, X_2, …., X_n are independent of
each other:
NEERAJ.GUPTA@GLA.AC.IN 22
CONTD…
NEERAJ.GUPTA@GLA.AC.IN 23
CONTD…
NEERAJ.GUPTA@GLA.AC.IN 24
ADVANTAGES & DISADVANTAGES
Advantages of Naïve Bayes Classifier:
•Naïve Bayes is one of the fast and easy ML algorithms to predict a class of
datasets.
•It can be used for Binary as well as Multi-class Classifications.
•It performs well in Multi-class predictions as compared to the other Algorithms.
•It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
•Naive Bayes assumes that all features are independent or unrelated, so it cannot
learn the relationship between features.
NEERAJ.GUPTA@GLA.AC.IN 25
APPLICATIONS OF NAÏVE BAYES CLASSIFIER:
It is used for Credit Scoring.
It is used in medical data classification.
It can be used in real-time predictions because Naïve Bayes Classifier is an
eager learner.
It is used in Text classification such as Spam filtering and Sentiment analysis.
NEERAJ.GUPTA@GLA.AC.IN 26
TYPES OF NAÏVE BAYES MODEL
There are three types of Naive Bayes Model, which are given below:
Gaussian: The Gaussian model assumes that features follow a normal distribution. This means if
predictors take continuous values instead of discrete, then the model assumes that these values
are sampled from the Gaussian distribution.
NEERAJ.GUPTA@GLA.AC.IN 27
TYPES OF NAÏVE BAYES MODEL
Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular
document belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
NEERAJ.GUPTA@GLA.AC.IN 28
TYPES OF NAÏVE BAYES MODEL
Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor
variables are the independent Booleans variables. Such as if a particular word is present or
not in a document. This model is also famous for document classification tasks.
BernoulliNB implements the naive Bayes training and classification algorithms for data that is
distributed according to multivariate Bernoulli distributions; i.e., there may be multiple features
but each one is assumed to be a binary-valued (Bernoulli, boolean) variable.
NEERAJ.GUPTA@GLA.AC.IN 29
THANKS
NEERAJ.GUPTA@GLA.AC.IN 30
MACHINE LEARNING (ML-12)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
oDecision Tree
NEERAJ.GUPTA@GLA.AC.IN 2
DECISION TREE
NEERAJ.GUPTA@GLA.AC.IN 3
DECISION TREE
NEERAJ.GUPTA@GLA.AC.IN 4
ILLUSTRATING CLASSIFICATION TASK
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
10
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
NEERAJ.GUPTA@GLA.AC.IN 5
EXAMPLES OF CLASSIFICATION TASK
Predicting tumor cells as benign or malignant
NEERAJ.GUPTA@GLA.AC.IN 6
EXAMPLE OF A DECISION TREE
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
NEERAJ.GUPTA@GLA.AC.IN 8
DECISION TREE CLASSIFICATION TASK
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
10
NEERAJ.GUPTA@GLA.AC.IN 9
APPLY MODEL TO TEST DATA
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
NEERAJ.GUPTA@GLA.AC.IN 10
APPLY MODEL TO TEST DATA
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
NEERAJ.GUPTA@GLA.AC.IN 11
APPLY MODEL TO TEST DATA Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
NEERAJ.GUPTA@GLA.AC.IN 12
APPLY MODEL TO TEST DATA Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
NEERAJ.GUPTA@GLA.AC.IN 13
APPLY MODEL TO TEST DATA Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
NEERAJ.GUPTA@GLA.AC.IN 14
APPLY MODEL TO TEST DATA
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
NEERAJ.GUPTA@GLA.AC.IN 15
DECISION TREE CLASSIFICATION TASK
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
10
NEERAJ.GUPTA@GLA.AC.IN 16
DECISION TREE INDUCTION
Many Algorithms:
Hunt’s Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ, SPRINT
NEERAJ.GUPTA@GLA.AC.IN 17
TREE INDUCTION
Greedy strategy.
Split the records based on an attribute test that optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
NEERAJ.GUPTA@GLA.AC.IN 18
TREE INDUCTION
Greedy strategy.
Split the records based on an attribute test that optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
NEERAJ.GUPTA@GLA.AC.IN 19
HOW TO SPECIFY TEST CONDITION?
Depends on attribute types
Nominal
Ordinal
Continuous
NEERAJ.GUPTA@GLA.AC.IN 20
SPLITTING BASED ON NOMINAL ATTRIBUTES
Multi-way split: Use as many partitions as distinct values.
CarType
Family Luxury
Sports
CarType OR CarType
{Sports, {Family,
Luxury} {Family} Luxury} {Sports}
NEERAJ.GUPTA@GLA.AC.IN 21
SPLITTING BASED ON ORDINAL ATTRIBUTES
Multi-way split: Use as many partitions as distinct values.
Size
Small Large
Medium
Size Size
{Small,
{Large}
OR {Medium,
{Small}
Medium} Large}
Size
{Small,
Large} {Medium}
NEERAJ.GUPTA@GLA.AC.IN 22
SPLITTING BASED ON CONTINUOUS ATTRIBUTES
NEERAJ.GUPTA@GLA.AC.IN 23
SPLITTING BASED ON CONTINUOUS ATTRIBUTES
NEERAJ.GUPTA@GLA.AC.IN 24
TREE INDUCTION
Greedy strategy.
Split the records based on an attribute test that optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
NEERAJ.GUPTA@GLA.AC.IN 25
HOW TO DETERMINE THE BEST SPLIT
Before Splitting: 10 records of class 0,
10 records of class 1
NEERAJ.GUPTA@GLA.AC.IN 26
HOW TO DETERMINE THE BEST SPLIT
Greedy approach:
Nodes with homogeneous class distribution are preferred
Need a measure of node impurity:
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
NEERAJ.GUPTA@GLA.AC.IN 27
MEASURES OF NODE IMPURITY
Entropy
Gini Index
Misclassification error
NEERAJ.GUPTA@GLA.AC.IN 28
ENTROPY
NEERAJ.GUPTA@GLA.AC.IN 29
ENTROPY
https://en.wikipedia.org/wiki/Entropy_(information_theory)
NEERAJ.GUPTA@GLA.AC.IN 30
ENTROPY
NEERAJ.GUPTA@GLA.AC.IN 31
ENTROPY
yes
no
NEERAJ.GUPTA@GLA.AC.IN 32
HOW TO FIND THE BEST SPLIT
Before Splitting: C0 N00 M0
C1 N01
A? B?
Yes No Yes No
M1 M2 M3 M4
M12 M34
Gain = M0 – M12 vs M0 – M34 NEERAJ.GUPTA@GLA.AC.IN 33
MEASURE OF IMPURITY: GINI
Gini Index for a given node t :
GINI (t ) 1 [ p ( j | t )]2
j
Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least
interesting information
Minimum (0.0) when all records belong to one class, implying most interesting information
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
NEERAJ.GUPTA@GLA.AC.IN 34
EXAMPLES FOR COMPUTING GINI
GINI (t ) 1 [ p ( j | t )]2
j
NEERAJ.GUPTA@GLA.AC.IN 35
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 36
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 37
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 38
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 39
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 40
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 41
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 42
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 43
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 44
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 45
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 46
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 47
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 48
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 49
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 50
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 51
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 52
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 53
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 54
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 55
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 56
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 57
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 58
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 59
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 60
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 61
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 62
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 63
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 64
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 65
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 66
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 67
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 68
THANKS
NEERAJ.GUPTA@GLA.AC.IN 69
MACHINE LEARNING (ML-13)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
oData preprocessing
NEERAJ.GUPTA@GLA.AC.IN 2
What is Data?
a collection of number assigned as value to
quantitative variable and/ or characters
assigned as value to qualitative variables, or
collection of records and their attributes
Ordinal
Used to rank order cases
Example: ranking (eg. movie on scale of 1-10), height (tall, medium, short), grades
Interval
Example: Calendar dates, longitude, latitude
Ratio
Same as interval variable but they have a “true zero”
Example: time, length, population, age
4
Properties of Attribute values
The type of an attributes depends on which of the following properties it
possess:
Distinctness: = ≠
Order: < >
Addition: + -
Multiplication: * /
Nominal: Distinctness
Ordinal: Distinctness, Order
5
Discrete and Continuous Attributes
Discrete Attribute
Has only a finite or countable infinite set of values
Examples: zip codes, counts, or the set of words in a collection of
documents
Often represented as integer variables.
Continuous Attribute
Has real numbers as attribute values
Examples: temperature, height, or weight.
Practically, real values can only be measured and represented using a finite number of digits.
6
Type of data sets
Record Data
Data Matrix
Transaction data
Graph Data
World wide web
Molecular structure
Ordered
Spatial data
Temporal data
Sequential data
7
Record Data
Data that consists of a collection of records, each of which consists of fixed set
of attributes
8
Data Matrix
If data objects have the same fixed set of numeric attributes, then the data
objects can be thought of as points in a multidimensional space, where each
dimension represents a distinct attribute
Such data set can be represented by an m by n matrix, where there are m rows,
one for each object, and n columns, one for each attribute
9
Data Matrix Example for Documents
Each document becomes a `term' vector,
each term is a component (attribute) of the vector,
the value of each component is the number of times the corresponding term occurs in the
document.
10
Transaction Data
A typical type of record data, then
Each record (transaction) involves a set of items
11
Graph data
Example: Facebook graph and HTML links
12
Ordered data
Genetic sequence data
13
Data Quality
What kind of data quality problems?
How can we detect the problem with the data?
Missing values
Noise and outliers
Duplicate data
14
Data Quality: Missing Values
Reasons for missing values
Information is not collected
(e.g., people decline to give their age and weight)
Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
15
Data Quality: Noise
Noise refers to modification of original values
Examples: distortion of a person’s voice when talking on
16
Data Quality: Outliers
Outliers are data objects with characteristics that are considerably different
than most of the other data objects in the data set
17
Data Quality: Duplicate Data
Data set may include data objects that are duplicates, or almost duplicates of one
another
Major issue when merging data from heterogenous sources
Examples:
Same person with multiple email addresses
Data cleaning
Process of dealing with duplicate data issues
18
Data Preprocessing
Imputation
Outlier management
Feature selection
19
Imputation (filling in) of missing data
Imputation is performed using a number of different algorithms, which can be
subdivided into single and multiple imputation methods.
Multiple-imputation methods
several likelihood- ordered choices for imputing the missing value are computed and one
“best” value is selected.
20
Imputation Contd…
Single imputation
Mean imputation
Hot deck imputation
Multiple imputation
21
Single imputation Contd…
Mean imputation
Mean imputation, also called unconditional mean
imputation, is a widely used imputation method
Mean imputation assumes that the mean of a
variable is the best estimate for any
case that has missing information on this variable
For continuous variable, each missing value is
imputed with the mean of known values for the
same variable
For categorical variable, the missing values of
are the mode of the observed values of same
variable
22
Advantages
fast,
simple,
Limitations
underestimation of the population variance
thus a small standard error
23
Single imputation Contd…
24
Advantages
preserves the population distribution
it is better than mean imputation
Limitations
distort correlations and covariances
25
Other type of Single imputation
Regression imputation
Cold-deck imputation
Sequential imputation
26
Multiple imputation
The idea of Multiple Imputation is to replace each missing value with multiple
acceptable values that represent a distribution of possibilities.
This results in a number of complete datasets (usually 3-10):
27
Outlier management
Outlier: A data object that deviates significantly from the normal objects as if
it were generated by a different mechanism
Ex.: Unusual credit card purchase
Outliers are different from the noise data
Noise is random error or variance in a measured variable
28
Types of Outliers
Three kinds:
Global,
Contextual
Collective
29
Types of Outliers Contd…
Contextual outlier (or conditional outlier)
Object is O if it deviates significantly based on a selected context
c
o
Ex. 48 C in Mathura: outlier? (depending on summer or winter?)
Attributes of data objects should be divided into two groups to detect O
c
Contextual attributes: defines the context, e.g., time & location
30
Types of Outliers Contd…
Collective Outliers
A subset of data objects collectively deviate significantly from the
whole data set, even if the individual data objects may not be
outliers
Applications: E.g., intrusion detection:
When a number of computers keep sending denial-of-service
packages to each other
Detection of collective outliers
Consider not only behavior of individual objects, but also that
of groups of objects
Need to have the background knowledge on the relationship
Unsupervised, and
Semi-supervised methods
proximity-based, and
clustering-based methods
32
Statistical Methods
Statistical methods (also known as model -based methods) assume that the normal data follow
some statistical model
The data not following the model are outliers.
Methods are divided into two categories: parametric vs. non-parametric
Parametric method
Assumes that the normal data is generated by a parametric distribution with parameter θ
The probability density function of the parametric distribution f(x, θ) gives the probability
Non-parametric method
Not assume an a-priori statistical model and determine the model from the input data
Not completely parameter free but consider the number and nature of the parameters are
Often assume that data are generated from a normal distribution, learn the parameters from the
input data, and identify the points with low probability as outliers
Ex: Avg. temp.: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4}
Use the maximum likelihood method to estimate μ and σ
So, 24 is an outlier 34
Statistical Methods – Box Plot
Values less than Q1-1.5*IQR and greater than Q3+1.5*IQR are outliers
Consider the following dataset:
10.2, 14.1, 14.4. 14.4, 14.4, 14.5, 14.5, 14.6, 14.7, 14.7, 14.7, 14.9,
15.1, 15.9, 16.4
Here,
Q2(median) = 14.6
Q1 = 14.4
Q3 = 14.9
IQR = Q3 – Q1 = 14.9 - 14.4 = 0.5
Outliers will be any points:
below Q1 – 1.5×IQR = 14.4 – 0.75 = 13.65 or
above Q3 + 1.5×IQR = 14.9 + 0.75 = 15.65
So, the outliers are at 10.2, 15.9, and 16.4. 35
Non-Parametric Methods: Detection Using Histogram
36
Proximity-Based Methods
An object is an outlier if the nearest neighbors of the object are far away, i.e., the
proximity of the object significantly deviates from the proximity of most of the
other objects in the same data set
Two types of proximity-based outlier detection methods
Distance-based outlier detection: An object o is an outlier if its neighborhood does not have
enough other points
Density-based outlier detection: An object o is an outlier if its density is relatively much
lower than that of its neighbors
37
Clustering-Based Outlier Detection
An object is an outlier if
it does not belong to any cluster,
One-Hot Encoding
39
Label Encoding
Label encoding assigns each unique value to a different integer.
This approach assumes an ordering of the categories:
Eg: "Never" (0) < "Rarely" (1) < "Most days" (2) < "Every day" (3)
This assumption makes sense in this example, because there is an indisputable ranking to the
categories. Not all categorical variables have a clear ordering in the values, but we refer to
those that do as ordinal variables.
For tree-based models (like decision trees and random forests), you can expect label
41
One-Hot Encoding Contd…
In contrast to label encoding, one-hot encoding does not assume an ordering of
the categories.
This approach to work particularly well if there is no clear ordering in the
categorical data (e.g., "Red" is neither more nor less than "Yellow").
We refer to categorical variables without an intrinsic ranking as nominal
variables.
One-hot encoding generally does not perform well if the categorical variable
takes on a large number of values (i.e., you generally won't use it for variables
taking more than 15 different values).
42
Feature selection
Features contain information about the target
Naïve view:
More feature
=>More information
=>More discrimination power
In practice:
Many reasons why this is not the case !
43
Feature selection Contd…
Curse of dimensionality
44
Feature selection Contd…
Data may contain many irrelevant and redundant variables (features) and often
comparably few (limited) training examples
Feature selection
A procedure in machine learning to find a subset of features that produces
‘better’ model for given dataset
Avoid overfitting and achieve better generalization ability
Reduce the storage requirement and training time
Interpretability
45
Feature Selection
Given a set of N features, the goal of feature selection is to select a subset of K
features (K << N) in order to minimize the classification error.
46
Feature Selection vs Feature Extraction
Feature Selection
New features represent a subset of the original features.
When classifying novel patterns, only a small number of features need to be computed (i.e.,
faster classification).
Feature Extraction
Projection to M<N dimension
New features are combinations (linear for PCA/LDA) of the original features (difficult to
interpret).
When classifying novel patterns, all features need to be computed.
47
Feature Selection: Main Steps
Feature selection is an optimization
problem.
48
Feature Selection: Main Steps (cont’d)
Search methods
Exhaustive
Heuristic
Randomized
Evaluation methods
Filter (Unsupervised)
Look at input only
Select the subset that has the most information
Wrapper (Supervised)
Train using selected subset
Estimate error on validation dataset
49
Search Methods
Assuming n features, an exhaustive search would require examining possible
subsets of size d.
In practice, heuristics are used to speed-up search but they cannot guarantee
optimality.
50
Evaluation Methods
Filter
Evaluation is independent of the classification
algorithm.
“goodness”
The objective function is based on the
information).
51
Evaluation Methods (cont’d)
Wrapper
Evaluation uses criteria related to
the classification algorithm.
“goodness”
The objective function is based
52
Filter vs Wrapper Methods (cont’d)
Filter Methods
Advantages
Much faster than wrapper methods since the objective function has lower computational
requirements.
The optimum feature set might work well with various classifiers as it is not tied to a
specific classifier.
Disadvantages
Achieve lower recognition accuracy compared to wrapper methods.
Have a tendency to select more features compared to wrapper methods.
53
Filter vs Wrapper Methods
Wrapper Methods
Advantages
Achieve higher recognition accuracy compared to filter methods since they use the classifier
itself in choosing the optimum set of features.
Disadvantages
Much slower compared to filter methods since the classifier must be trained and tested for
each candidate feature subset.
The optimum feature subset might not work well for other classifiers.
54
Forward selection
(heuristic search)
Estimate the classification/ regression error for adding each specific feature
55
Backward selection (BS)
(heuristic search)
56
THANKS
NEERAJ.GUPTA@GLA.AC.IN 57
MACHINE LEARNING (ML-14)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
oMachine Learning
oTypes of Machine Learning
oUnsupervised Learning : Clustering
oUnsupervised Example
oApplications of Clustering
oK-Means Algorithm
oK-Means Examples
NEERAJ.GUPTA@GLA.AC.IN 2
WHEN DO WE USE MACHINE LEARNING?
ML is used when:
• Human expertise does not exist (navigating on Mars)
NEERAJ.GUPTA@GLA.AC.IN 7
SUPERVISED VS UNSUPERVISED LEARNING
Supervised learning: discover patterns in the data that relate data attributes
with a target (class) attribute.
Patterns are then utilized to predict the values of the target attribute in future data instances.
NEERAJ.GUPTA@GLA.AC.IN 8
UNSUPERVISED EXAMPLE
A bank wants to give credit card offers to its customers. Currently, they look at the details of
each customer and based on this information, decide which offer should be given to which
customer.
NEERAJ.GUPTA@GLA.AC.IN 10
APPLICATIONS OF CLUSTERING IN REAL-WORLD
Image Segmentation
Recommendation Engines
NEERAJ.GUPTA@GLA.AC.IN 12
K-MEANS CLUSTERING
NEERAJ.GUPTA@GLA.AC.IN 13
K-MEANS CLUSTERING
NEERAJ.GUPTA@GLA.AC.IN 14
K-MEANS CLUSTERING
NEERAJ.GUPTA@GLA.AC.IN 15
K-MEANS CLUSTERING
NEERAJ.GUPTA@GLA.AC.IN 16
K-MEANS CLUSTERING
NEERAJ.GUPTA@GLA.AC.IN 17
K-MEANS CLUSTERING
NEERAJ.GUPTA@GLA.AC.IN 18
K-MEANS ALGORITHM
NEERAJ.GUPTA@GLA.AC.IN 19
K-MEANS ALGORITHM
NEERAJ.GUPTA@GLA.AC.IN 20
EXAMPLE
Divide the given sample data in two (2) clusters using K-Means algorithm using
Euclidean Distance.
Sno. Height(H) Weight
(W)
1 185 72
2 170 56
3 168 60
4 179 68
5 182 72
6 188 77
7 180 71
8 180 70
9 183 84
10 180 88
11 180 67
12 177 76 NEERAJ.GUPTA@GLA.AC.IN 21
K-MEANS FOR NON-SEPARATED CLUSTERS
NEERAJ.GUPTA@GLA.AC.IN 22
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 23
SOLUTION
NEERAJ.GUPTA@GLA.AC.IN 24
SOLUTION
NEERAJ.GUPTA@GLA.AC.IN 25
SOLUTION
NEERAJ.GUPTA@GLA.AC.IN 26
SOLUTION
NEERAJ.GUPTA@GLA.AC.IN 27
SOLUTION
NEERAJ.GUPTA@GLA.AC.IN 28
SOLUTION
NEERAJ.GUPTA@GLA.AC.IN 29
SOLUTION
NEERAJ.GUPTA@GLA.AC.IN 30
SOLUTION
NEERAJ.GUPTA@GLA.AC.IN 31
SOLUTION
NEERAJ.GUPTA@GLA.AC.IN 32
SOLUTION
NEERAJ.GUPTA@GLA.AC.IN 33
VISUALIZATION OF SOLUTION
NEERAJ.GUPTA@GLA.AC.IN 34
VISUALIZATION OF SOLUTION
NEERAJ.GUPTA@GLA.AC.IN 35
HOW TO CHOOSE K: ELBOW METHOD
It is based on the sum of squared distance (SSE) between data points and their
assigned clusters’ centroids.
NEERAJ.GUPTA@GLA.AC.IN 36
DRAWBACKS
K-means algorithm is good in capturing structure of the data if clusters have
a spherical-like shape.
It always try to construct a nice spherical shape around the centroid.
The clusters which have a complicated geometric shapes, k-means does a
poor job in clustering the data.
NEERAJ.GUPTA@GLA.AC.IN 37
CLUSTERING QUALITY
Ideal clustering is characterized by minimal intra cluster distance and maximal
inter cluster distance.
There are majorly two types of measures to assess the clustering performance.
(i) Extrinsic Measures which require ground truth labels.
Rand index
(ii) Intrinsic Measures that does not require ground truth labels.
Silhouette Coefficient
NEERAJ.GUPTA@GLA.AC.IN 38
BASIC CLUSTERING METHODS
1. Partitioning methods
k-means
2. Hierarchical methods
Agglomerative (bottom-up) or divisive (top-down)
3. Density-based methods
4. Grid-based methods
NEERAJ.GUPTA@GLA.AC.IN 39
OVERVIEW OF CLUSTERING METHODS
NEERAJ.GUPTA@GLA.AC.IN 40
HIERARCHICAL METHODS
A hierarchical clustering method works by grouping data objects into a hierarchy or
“tree” of clusters.
Representing data objects in the form of a hierarchy is useful for data summarization
and visualization.
Agglomerative methods start with individual objects as clusters, which are iteratively
merged to form larger clusters.
Divisive methods initially let all the given objects form one cluster, which they
iteratively split into smaller clusters.
Hierarchical clustering methods can encounter difficulties regarding the selection of
merge or split points.
NEERAJ.GUPTA@GLA.AC.IN 41
AGGLOMERATIVE VERSUS DIVISIVE HIERARCHICAL
CLUSTERING
A hierarchical clustering method can be either agglomerative or divisive, depending
on whether the hierarchical decomposition is formed in a bottom-up (merging) or top
down (splitting) fashion.
NEERAJ.GUPTA@GLA.AC.IN 43
AGGLOMERATIVE HIERARCHICAL CLUSTERING
We assign each point to an individual cluster in this technique.
Suppose there are 4 data points. We will assign each of these points to a cluster and
hence will have 4 clusters in the beginning:
NEERAJ.GUPTA@GLA.AC.IN 44
AGGLOMERATIVE HIERARCHICAL CLUSTERING
Then, at each iteration, we merge the closest pair of clusters and repeat this step until only a single cluster
is left:
NEERAJ.GUPTA@GLA.AC.IN 45
DIVISIVE HIERARCHICAL CLUSTERING
Divisive hierarchical clustering works in the opposite way.
Instead of starting with n clusters (in case of n observations), we start with a single
cluster and assign all the points to that cluster.
So, it doesn’t matter if we have 10 or 1000 data points. All these points will belong
to the same cluster at the beginning:
NEERAJ.GUPTA@GLA.AC.IN 46
DIVISIVE HIERARCHICAL CLUSTERING
Now, at each iteration, we split the farthest point in the cluster and repeat this
process until each cluster only contains a single point:
We are splitting (or dividing) the clusters at each step, hence the name divisive
hierarchical clustering.
NEERAJ.GUPTA@GLA.AC.IN 47
STEPS TO PERFORM HIERARCHICAL CLUSTERING
We merge the most similar points or clusters in hierarchical clustering – we know this. Now the
question is –
?
Distance-based algorithm
NEERAJ.GUPTA@GLA.AC.IN 48
STEPS TO PERFORM HIERARCHICAL CLUSTERING
In hierarchical clustering, we have a concept called a proximity matrix. This stores
the distances between each point.
Suppose a teacher wants to divide her students into different groups. She has the
marks scored by each student in an assignment and based on these marks, she wants
to segment them into groups. There’s no fixed target here as to how many groups to
have. Since the teacher does not know what type of students should be assigned to
which group, it cannot be solved as a supervised learning problem. So, we will try to
apply hierarchical clustering here and segment the students into different groups.
Let’s take a sample of 5 students:
NEERAJ.GUPTA@GLA.AC.IN 49
CREATING A PROXIMITY MATRIX
Let’s make the 5 x 5 proximity matrix for our example:
NEERAJ.GUPTA@GLA.AC.IN 50
STEPS TO PERFORM HIERARCHICAL CLUSTERING
Step 1: First, we assign all the points to an individual cluster
Step 2: Next, we will look at the smallest distance in the proximity matrix and merge
the points with the smallest distance. We then update the proximity matrix:
NEERAJ.GUPTA@GLA.AC.IN 51
STEPS TO PERFORM HIERARCHICAL CLUSTERING
Here, the smallest distance is 3 and hence we will merge point 1 and 2.
NEERAJ.GUPTA@GLA.AC.IN 52
STEPS TO PERFORM HIERARCHICAL CLUSTERING
Let’s look at the updated clusters and accordingly update the proximity matrix:
NEERAJ.GUPTA@GLA.AC.IN 53
STEPS TO PERFORM HIERARCHICAL CLUSTERING
Now, we will again calculate the proximity matrix for these clusters:
NEERAJ.GUPTA@GLA.AC.IN 54
STEPS TO PERFORM HIERARCHICAL CLUSTERING
Step 3: We will repeat step 2 until only a single cluster is left.
NEERAJ.GUPTA@GLA.AC.IN 56
HOW SHOULD WE CHOOSE THE NUMBER OF
CLUSTERS IN HIERARCHICAL CLUSTERING?
Whenever two clusters are merged, we will join them in this dendrogram and the
height of the join will be the distance between these points.
NEERAJ.GUPTA@GLA.AC.IN 57
HOW SHOULD WE CHOOSE THE NUMBER OF
CLUSTERS IN HIERARCHICAL CLUSTERING?
NEERAJ.GUPTA@GLA.AC.IN 58
HOW SHOULD WE CHOOSE THE NUMBER OF
CLUSTERS IN HIERARCHICAL CLUSTERING?
More the distance of the vertical lines in the dendrogram, more the distance
between those clusters. Now, we can set a threshold distance and draw a horizontal
line (Generally, we try to set the threshold in such a way that it cuts the tallest vertical
line)
NEERAJ.GUPTA@GLA.AC.IN 59
THANKS
NEERAJ.GUPTA@GLA.AC.IN 60
MACHINE LEARNING (ML-15)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
oSupport Vector Machine
oTypes of SVM
oHyperplane
oSupport Vectors
oLinear SVM Mathematically
oExamples
oPros and Cons
NEERAJ.GUPTA@GLA.AC.IN 2
INTRODUCTION
•Support Vector Machine abbreviated as SVM
NEERAJ.GUPTA@GLA.AC.IN 3
INTRODUCTION
•SVM algorithm can be used for Face detection, image
classification, text categorization, etc.
NEERAJ.GUPTA@GLA.AC.IN 4
WHAT IS SUPPORT VECTOR MACHINE?
•A support vector machine is a machine learning model that is able to
generalize between two different classes if the set of labelled data is
provided in the training set to the algorithm.
NEERAJ.GUPTA@GLA.AC.IN 5
TYPES OF SVM
1. Linear SVM:
•Linear SVM is used for linearly separable data.
•It means if a dataset can be classified into two classes by using a single
straight line, then such data is termed as linearly separable data.
•The classifier is used called as Linear SVM classifier.
NEERAJ.GUPTA@GLA.AC.IN 6
TYPES OF SVM
2. Non-linear SVM:
•Non-Linear SVM is used for non-linearly separated data.
•It means if a dataset cannot be classified by using a straight line, then such
data is termed as non-linear data.
•The classifier used is called as Non-linear SVM classifier.
NEERAJ.GUPTA@GLA.AC.IN 7
SUPPORT VECTOR MACHINE
NEERAJ.GUPTA@GLA.AC.IN 8
SUPPORT VECTOR MACHINE
NEERAJ.GUPTA@GLA.AC.IN 9
SUPPORT VECTOR MACHINE
•To separate the two classes of data points, there are many
possible hyperplanes that could be chosen.
•The objective is to find a plane that has the maximum margin, i.e
the maximum distance between data points of both classes.
NEERAJ.GUPTA@GLA.AC.IN 11
HYPERPLANES
NEERAJ.GUPTA@GLA.AC.IN 12
HYPERPLANES
•Identify the right hyper-plane (Scenario-1):
NEERAJ.GUPTA@GLA.AC.IN 13
HYPERPLANES
Identify the right hyper-plane (Scenario-2):
NEERAJ.GUPTA@GLA.AC.IN 14
HYPERPLANES
Identify the right hyper-plane (Scenario-3):
NEERAJ.GUPTA@GLA.AC.IN 15
HYPERPLANES
Can we classify two classes (Scenario-4)?
NEERAJ.GUPTA@GLA.AC.IN 16
HYPERPLANES
Find the hyper-plane to segregate to classes (Scenario-5):
NEERAJ.GUPTA@GLA.AC.IN 17
HYPERPLANES
Find the hyper-plane to segregate to classes (Scenario-5):In the scenario below, we
can’t have linear hyper-plane between the two classes, so how does SVM classify
these two classes?
NEERAJ.GUPTA@GLA.AC.IN 18
HYPERPLANES
The SVM algorithm has a technique called the kernel trick. The SVM kernel is
a function that takes low dimensional input space and transforms it to a higher
dimensional space i.e. it converts not separable problem to separable
problem.
NEERAJ.GUPTA@GLA.AC.IN 19
SUPPORT VECTORS
•Support vectors are data points that are closer to the hyperplane and influence
the position and orientation of the hyperplane.
•Deleting the support vectors will change the position of the hyperplane.
NEERAJ.GUPTA@GLA.AC.IN 20
SUPPORT VECTORS
NEERAJ.GUPTA@GLA.AC.IN 21
Sec. 15.1
22
Sec. 15.1
GEOMETRIC MARGIN
wT x b
Distance from example to the separator is ry
w
Examples closest to the hyperplane are support vectors.
Margin ρ of the separator is the width of separation between support vectors of
classes.
x ρ
r
x′
w
23
Sec. 15.1
wTxi + b ≥ 1 if yi = 1
wTxi + b ≤ −1 if yi = −1
24
Sec. 15.1
ρ wTxa + b = 1
Hyperplane
wT x + b = 0 wTxb + b = -1
This implies:
wT x + b = 0
wT(xa–xb) = 2
ρ = ||xa–xb||2 = 2/||w||2
25
Sec. 15.1
26
SVM EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 27
SVM EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 28
SVM EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 29
SVM EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 30
SVM EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 31
SVM EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 32
SVM EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 33
EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 34
SVM EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 35
SVM EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 36
SVM EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 37
SVM EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 38
PROS AND CONS ASSOCIATED WITH SVM
Pros:
•It works really well with a clear margin of separation
•It is effective in high dimensional spaces.
•It is effective in cases where the number of dimensions is greater than the
number of samples.
•It uses a subset of training points in the decision function (called support
vectors), so it is also memory efficient.
NEERAJ.GUPTA@GLA.AC.IN 39
PROS AND CONS ASSOCIATED WITH SVM
Cons:
•It doesn’t perform well when we have large data set because the required training
time is higher.
•It doesn’t perform very well, when the data set has more noise i.e. target classes are
overlapping.
NEERAJ.GUPTA@GLA.AC.IN 40
THANKS
NEERAJ.GUPTA@GLA.AC.IN 41
MACHINE LEARNING (ML-15)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
oPrincipal Component Ananlysis (PCA)
NEERAJ.GUPTA@GLA.AC.IN 2
MOTIVATION
Clustering
One way to summarize a complex real-valued data point with a
single categorical variable
Dimensionality reduction
Another way to simplify complex high-dimensional data
Summarize data with a lower dimensional real valued vector
MOTIVATION
Clustering
One way to summarize a complex real-valued data point with a single
categorical variable
Dimensionality reduction
Another way to simplify complex high-dimensional data
Summarize data with a lower dimentional real valued vector
• Given data points in d dimensions
• Convert them to data points in r<d dimensions
• With minimal loss of information
NEED FOR PCA
High dimension data is extremely complex to process due to inconsistencies in the feature which increase
the computation time.
NEERAJ.GUPTA@GLA.AC.IN 5
NEED FOR PCA
NEERAJ.GUPTA@GLA.AC.IN 6
Data Compression
(cm)
Andrew Ng
Data Compression
Reduce data from
2D to 1D
(inches)
(cm)
Andrew Ng
Data Compression
Reduce data from 3D to 2D
Andrew Ng
Principal Component Analysis (PCA) problem formulation
Andrew Ng
STEP BY STEP PCA
NEERAJ.GUPTA@GLA.AC.IN 11
NEERAJ.GUPTA@GLA.AC.IN 12
COVARIANCE
Variance and Covariance:
Measure of the “spread” of a set of points around their center of mass(mean)
Variance:
Measure of the deviation from the mean for points in one dimension
Covariance:
Measure of how much each of the dimensions vary from the mean with respect to
each other
Example
EIGENVECTOR AND EIGENVALUE
Ax - λx = 0
Ax = λx
(A – λI)x = 0
NEERAJ.GUPTA@GLA.AC.IN 25
PRACTICE PROBLEMS BASED ON PCA
Step-01:
Get data.
The given feature vectors are-
x1 = (2, 1)
x2 = (3, 5)
x3 = (4, 3)
x4 = (5, 6)
x5 = (6, 7)
x6 = (7, 8)
NEERAJ.GUPTA@GLA.AC.IN 26
PRACTICE PROBLEMS BASED ON PCA
Step-02:
Calculate the mean vector (µ).
NEERAJ.GUPTA@GLA.AC.IN 27
PRACTICE PROBLEMS BASED ON PCA
Step-03:
Subtract mean vector (µ) from the given feature vectors.
x1 – µ = (2 – 4.5, 1 – 5) = (-2.5, -4)
x2 – µ = (3 – 4.5, 5 – 5) = (-1.5, 0)
x3 – µ = (4 – 4.5, 3 – 5) = (-0.5, -2)
x4 – µ = (5 – 4.5, 6 – 5) = (0.5, 1)
x5 – µ = (6 – 4.5, 7 – 5) = (1.5, 2)
x6 – µ = (7 – 4.5, 8 – 5) = (2.5, 3)
Feature vectors (xi) after subtracting mean vector (µ) are-
NEERAJ.GUPTA@GLA.AC.IN 28
PRACTICE PROBLEMS BASED ON PCA
Step-04:
Calculate the covariance matrix.
Covariance matrix is given by-
NEERAJ.GUPTA@GLA.AC.IN 29
PRACTICE PROBLEMS BASED ON PCA
NEERAJ.GUPTA@GLA.AC.IN 30
PRACTICE PROBLEMS BASED ON PCA
Now,
Covariance matrix
= (m1 + m2 + m3 + m4 + m5 + m6) / 6
NEERAJ.GUPTA@GLA.AC.IN 31
PRACTICE PROBLEMS BASED ON PCA
Step-05:
Calculate the eigen values and eigen vectors of the covariance matrix.
λ is an eigen value for a matrix M if it is a solution of the characteristic equation |M – λI|
= 0.
So, we have-
NEERAJ.GUPTA@GLA.AC.IN 32
PRACTICE PROBLEMS BASED ON PCA
From here,
(2.92 – λ)(5.67 – λ) – (3.67 x 3.67) = 0
16.56 – 2.92λ – 5.67λ + λ2 – 13.47 = 0
λ2 – 8.59λ + 3.09 = 0
NEERAJ.GUPTA@GLA.AC.IN 34
PRACTICE PROBLEMS BASED ON PCA
Solving these, we get-
2.92X1 + 3.67X2 = 8.22X1
3.67X1 + 5.67X2 = 8.22X2
On simplification, we get-
5.3X1 = 3.67X2 ………(1)
3.67X1 = 2.55X2 ………(2)
From (1) and (2), X1 = 0.69X2
From (2), the eigen vector is-
NEERAJ.GUPTA@GLA.AC.IN 35
PRACTICE PROBLEMS BASED ON PCA
Thus, principal component for the given data set is-
NEERAJ.GUPTA@GLA.AC.IN 36
PRACTICE PROBLEMS BASED ON PCA
Lastly, we project the data points onto the new subspace as-
NEERAJ.GUPTA@GLA.AC.IN 37
PRACTICE PROBLEMS BASED ON PCA
Use PCA Algorithm to transform the pattern (2, 1) onto the eigen vector in the
previous question.
The given feature vector is (2, 1).
NEERAJ.GUPTA@GLA.AC.IN 38
PRACTICE PROBLEMS BASED ON PCA
The feature vector gets transformed to
= Transpose of Eigen vector x (Feature Vector – Mean Vector)
NEERAJ.GUPTA@GLA.AC.IN 39
SIFT feature visualization
• The top three principal components of SIFT descriptors from a set of images are computed
• Map these principal components to the principal components of the RGB space
• pixels with similar colors share similar structures
Application: Image compression
ORIGINAL IMAGE
2 2 2 2
4 4 4 4
6 6 6 6
8 8 8 8
10 10 10 10
12 12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
2 2 2 2
4 4 4 4
6 6 6 6
8 8 8 8
10 10 10 10
12 12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
2 2 2 2
4 4 4 4
6 6 6 6
8 8 8 8
10 10 10 10
12 12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
PCA COMPRESSION: 144D ) 6D
6 MOST IMPORTANT EIGENVECTORS
2 2 2
4 4 4
6 6 6
8 8 8
10 10 10
12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
2 2 2
4 4 4
6 6 6
8 8 8
10 10 10
12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
PCA COMPRESSION: 144D 3D
3 most important eigenvectors
2 2
4 4
6 6
8 8
10 10
12 12
2 4 6 8 10 12 2 4 6 8 10 12
10
12
2 4 6 8 10 12
PCA COMPRESSION: 144D 1D
60 most important eigenvectors
http://en.wikipedia.org/wiki/Discrete_cosine_transform
DIMENSIONALITY REDUCTION
PCA (Principal Component Analysis):
Find projection that maximize the variance
Multidimensional Scaling:
Find projection that best preserves inter-point distances
…
…
THANKS
NEERAJ.GUPTA@GLA.AC.IN 57
MACHINE LEARNING (ML-17)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
oEnsemble Methods
NEERAJ.GUPTA@GLA.AC.IN 2
INTRODUCTION
Ensemble methods
Use a combination of models to increase accuracy
Combine a series of k learned models, M1, M2, …, Mk, with the aim of creating
an improved model M*
INTRODUCTION
Two most popular ensemble methods are bagging and boosting.
Bagging: Training a bunch of individual models in a parallel way. Each model
is trained by a random subset of the data.
averaging the prediction over a collection of classifiers
1 error ( M i )
log
The weight of classifier Mi’s vote (αM) is error ( M i )
REFERENCES
Han, Jiawei, Jian Pei, and Micheline Kamber. Data mining: concepts and
techniques. Elsevier, 2011.
NEERAJ.GUPTA@GLA.AC.IN 14
MACHINE LEARNING (ML-18)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
oArtificial Neural Network
History of Artificial Neural Networks
What is an Artificial Neural Networks?
How it works?
Learning
NEERAJ.GUPTA@GLA.AC.IN 2
HISTORY OF THE ARTIFICIAL NEURAL NETWORKS
History of the ANNs stems from the 1940s, the decade of the first electronic
computer.
However, the first important step took place in 1957 when Rosenblatt introduced the
first concrete neural model, the perceptron. Rosenblatt also took part in constructing
the first successful neurocomputer, the Mark I Perceptron. After this, the
development of ANNs has proceeded as described in Figure.
NEERAJ.GUPTA@GLA.AC.IN 3
HISTORY OF THE ARTIFICIAL NEURAL NETWORKS
Rosenblatt's original perceptron model contained only one layer. From this, a multi-
layered model was derived in 1960. At first, the use of the multi-layer perceptron
(MLP) was complicated by the lack of a appropriate learning algorithm.
In 1974, Werbos came to introduce a so-called backpropagation algorithm for the
three-layered perceptron network.
NEERAJ.GUPTA@GLA.AC.IN 4
HISTORY OF THE ARTIFICIAL NEURAL NETWORKS
In 1986, The application area of the MLP networks remained rather limited until the
breakthrough when a general back propagation algorithm for a multi-layered
perceptron was introduced by Rummelhart and Mclelland.
In 1982, Hopfield brought out his idea of a neural network. Unlike the neurons in
MLP, the Hopfield network consists of only one layer whose neurons are fully
connected with each other.
NEERAJ.GUPTA@GLA.AC.IN 5
HISTORY OF THE ARTIFICIAL NEURAL NETWORKS
Since then, new versions of the Hopfield network have been developed. The
Boltzmann machine has been influenced by both the Hopfield network and
the MLP.
NEERAJ.GUPTA@GLA.AC.IN 6
HISTORY OF THE ARTIFICIAL NEURAL NETWORKS
In 1988, Radial Basis Function (RBF) networks were first introduced by
Broomhead & Lowe. Although the basic idea of RBF was developed 30 years
ago under the name method of potential function, the work by Broomhead &
Lowe opened a new frontier in the neural network community.
NEERAJ.GUPTA@GLA.AC.IN 7
HISTORY OF THE ARTIFICIAL NEURAL NETWORKS
In 1982, A totally unique kind of network model is the Self-Organizing Map
(SOM) introduced by Kohonen. SOM is a certain kind of topological map
which organizes itself based on the input patterns that it is trained with. The
SOM originated from the LVQ (Learning Vector Quantization) network the
underlying idea of which was also Kohonen's in 1972.
NEERAJ.GUPTA@GLA.AC.IN 8
HISTORY OF THE ARTIFICIAL NEURAL NETWORKS
NEERAJ.GUPTA@GLA.AC.IN 9
ARTIFICIAL NEURAL NETWORK
A set of major aspects of a parallel distributed model
include:
a set of processing units (cells).
a state of activation for every unit, which equivalent to the output of the
unit.
connections between the units. Generally each connection is defined by a
weight.
a propagation rule, which determines the effective input of a unit from its
external inputs.
NEERAJ.GUPTA@GLA.AC.IN 10
ARTIFICIAL NEURAL NETWORK
an activation function, which determines the new level of
activation based on the effective input and the current activation.
an external input for each unit.
a method for information gathering (the learning rule).
an environment within which the system must operate, providing
input signals and _ if necessary _ error signals.
NEERAJ.GUPTA@GLA.AC.IN 11
COMPUTERS VS. NEURAL NETWORKS
“Standard” Computers Neural Networks
NEERAJ.GUPTA@GLA.AC.IN 12
WHY ARTIFICIAL NEURAL NETWORKS?
There are two basic reasons why we are interested in building artificial neural
networks (ANNs):
• Technical viewpoint: Some problems such as character recognition or the
prediction of future states of a system require massively parallel and adaptive
processing.
• Biological viewpoint: ANNs can be used to replicate and simulate
components of the human (or animal) brain, thereby giving us insight into
natural information processing.
NEERAJ.GUPTA@GLA.AC.IN 13
ARTIFICIAL NEURAL NETWORK
•The “building blocks” of neural networks are
the neurons.
• In technical systems, we also refer to them as units or nodes.
NEERAJ.GUPTA@GLA.AC.IN 15
HOW DO ANNS WORK?
An artificial neural network (ANN) is either a hardware
implementation or a computer program which strives to simulate
the information processing capabilities of its biological exemplar.
NEERAJ.GUPTA@GLA.AC.IN 16
HOW DO OUR BRAINS WORK?
The Brain is A massively parallel information processing
system.
Our brains are a huge network of processing elements. A
typical brain contains a network of 10 billion neurons.
NEERAJ.GUPTA@GLA.AC.IN 17
HOW DO OUR BRAINS WORK?
A processing element
Dendrites: Input
Cell body: Processor
Synaptic: Link
NEERAJ.GUPTA@GLA.AC.IN 18
Axon: Output
How do our brains work?
A processing element
NEERAJ.GUPTA@GLA.AC.IN 19
How do our brains work?
A processing element
NEERAJ.GUPTA@GLA.AC.IN 20
How do our brains work?
A processing element
The axon endings almost touch the dendrites or cell body of the
next neuron.
NEERAJ.GUPTA@GLA.AC.IN 22
How do our brains work?
A processing element
NEERAJ.GUPTA@GLA.AC.IN 23
How do our brains work?
A processing element
Neurotransmitters are chemicals which are released from the first neuron
and which bind to the
Second.
NEERAJ.GUPTA@GLA.AC.IN 24
How do our brains work?
A processing element
NEERAJ.GUPTA@GLA.AC.IN 27
How do ANNs work?
............
Input xm x2 x1
Processing ∑
∑= X1+X2 + ….+Xm =y
Output y
NEERAJ.GUPTA@GLA.AC.IN 28
How do ANNs work?
Not all inputs are equal
............
xm x2 x1
Input
wm .....
weights w2 w1
Output y
NEERAJ.GUPTA@GLA.AC.IN 29
How do ANNs work?
The signal is not passed down to the
next neuron verbatim
............
xm x2 x1
Input
wm ..... w2
weights w1
Processing ∑
Transfer Function
f(vk)
(Activation Function)
Output y
NEERAJ.GUPTA@GLA.AC.IN 30
The output is a function of the input, that is
affected by the weights, and the transfer
functions
NEERAJ.GUPTA@GLA.AC.IN 31
Activation Functions:
NEERAJ.GUPTA@GLA.AC.IN 32
Forward propagation: Vectorized implementation
NEERAJ.GUPTA@GLA.AC.IN 33
Neural Network learning its own features
NEERAJ.GUPTA@GLA.AC.IN 34
Other Network Architecture
NEERAJ.GUPTA@GLA.AC.IN 35
Simple example: AND
NEERAJ.GUPTA@GLA.AC.IN 36
Simple example: OR
NEERAJ.GUPTA@GLA.AC.IN 37
Artificial Neural Networks
An ANN can:
1. compute any computable function, by the
appropriate selection of the network
topology and weights values.
NEERAJ.GUPTA@GLA.AC.IN 38
Learning by trial‐and‐error
Continuous process of:
Trial:
Processing an input to produce an output (In terms of ANN: Compute
the output function of a given input)
Evaluate:
Evaluating this output by comparing the actual output with
the expected output.
Adjust:
Adjust the weights.
NEERAJ.GUPTA@GLA.AC.IN 39
How it works?
Set initial values of the weights randomly.
Input: truth table of the XOR
Do
Read input (e.g. 0, and 0)
Compute an output (e.g. 0.60543)
Compare it to the expected output. (Diff= 0.60543)
Modify the weights accordingly.
Loop until a condition is met
Condition: certain number of iterations
Condition: error threshold
NEERAJ.GUPTA@GLA.AC.IN 40
Design Issues
Initial weights (small random values ∈[‐1,1])
Transfer function (How the inputs and the weights are
combined to produce output?)
Error estimation
Weights adjusting
Number of neurons
Data representation
Size of training set
NEERAJ.GUPTA@GLA.AC.IN 41
Transfer Functions
Linear: The output is proportional to the total weighted
input.
NEERAJ.GUPTA@GLA.AC.IN 42
Error Estimation
The root mean square error (RMSE) is a frequently-used
measure of the differences between values predicted by a
model or an estimator and the values actually observed from
the thing being modeled or estimated
NEERAJ.GUPTA@GLA.AC.IN 43
Weights Adjusting
After each iteration, weights should be adjusted to
minimize the error.
– All possible weights
– Back propagation
NEERAJ.GUPTA@GLA.AC.IN 44
Back Propagation
Back-propagation is an example of supervised learning is used at
each layer to minimize the error between the layer’s response and
the actual data
The error at each hidden layer is an average of the evaluated error
Hidden layer networks are trained this way
NEERAJ.GUPTA@GLA.AC.IN 45
ANN EXAMPLE
NEERAJ.GUPTA@GLA.AC.IN 46
BACKPROPAGATION STEP BY STEP
NEERAJ.GUPTA@GLA.AC.IN 47
BACKPROPAGATION STEP BY STEP
• Neural network training is about finding weights that minimize prediction error. We
usually start our training with a set of randomly generated weights.
• Then, backpropagation is used to update the weights in an attempt to correctly map
arbitrary inputs to outputs.
NEERAJ.GUPTA@GLA.AC.IN 48
BACKPROPAGATION STEP BY STEP
initial weights will be as following: w1 = 0.11, w2 = 0.21, w3 = 0.12, w4 = 0.08, w5 = 0.14 and w6 = 0.15
NEERAJ.GUPTA@GLA.AC.IN 49
BACKPROPAGATION STEP BY STEP
Dataset
Our dataset has one sample with two inputs and one output.
NEERAJ.GUPTA@GLA.AC.IN 50
BACKPROPAGATION STEP BY STEP
Forward Pass
We will use given weights and inputs to predict the output. Inputs are multiplied by weights;
the results are then passed forward to next layer.
NEERAJ.GUPTA@GLA.AC.IN 51
BACKPROPAGATION STEP BY STEP
Calculating Error
Now, it’s time to find out how our network performed by calculating the difference between the
actual output and predicted one. It’s clear that our network output, or prediction, is not even
close to actual output. We can calculate the difference or the error as following.
NEERAJ.GUPTA@GLA.AC.IN 52
BACKPROPAGATION STEP BY STEP
Reducing Error
• The main goal of the training is to reduce the error or the difference
between prediction and actual output. Since actual output is constant,
“not changing”, the only way to reduce the error is to
change prediction value.
• how to change prediction value?
• It calculates the gradient of the error function with respect to the neural
network’s weights. The calculation proceeds backwards through the
network.
NEERAJ.GUPTA@GLA.AC.IN 57
BACKPROPAGATION STEP BY STEP
So to update w6 we can apply the following formula
Similarly, we can derive the update formula for w5 and any other weights existing between the output and
the hidden layer.
NEERAJ.GUPTA@GLA.AC.IN 58
BACKPROPAGATION STEP BY STEP
However, when moving backward to update w1, w2, w3 and w4 existing between input and hidden
layer, the partial derivative for the error function with respect to w1, for example, will be as following.
NEERAJ.GUPTA@GLA.AC.IN 59
BACKPROPAGATION STEP BY STEP
We can find the update formula for the remaining weights w2, w3 and w4 in the same way.
In summary, the update formulas for all weights will be as following:
NEERAJ.GUPTA@GLA.AC.IN 60
BACKPROPAGATION STEP BY STEP
We can rewrite the update formulas in matrices as following
NEERAJ.GUPTA@GLA.AC.IN 61
BACKPROPAGATION STEP BY STEP
Backward Pass
Using derived formulas we can find the new weights.
Learning rate: is a hyperparameter which means that we need to manually guess its value.
NEERAJ.GUPTA@GLA.AC.IN 62
BACKPROPAGATION STEP BY STEP
Now, using the new weights we will repeat the forward passed
NEERAJ.GUPTA@GLA.AC.IN 63
BACKPROPAGATION STEP BY STEP
https://hmkcode.com/netflow/
NEERAJ.GUPTA@GLA.AC.IN 64
Applications Areas
Function approximation
including time series prediction and modeling.
Classification
including patterns and sequences recognition, novelty
detection and sequential decision making.
(radar systems, face identification, handwritten text recognition)
Data processing
including filtering, clustering blinds source separation and
compression.
(data mining, e-mail Spam filtering)
NEERAJ.GUPTA@GLA.AC.IN 65
Advantages / Disadvantages
Advantages
Adapt to unknown situations
Powerful, it can model complex functions.
Ease of use, learns by example, and very little user
domain‐specific expertise needed
Disadvantages
Forgets
Not exact
Large complexity of the network structure
NEERAJ.GUPTA@GLA.AC.IN 66
Conclusion
Artificial Neural Networks are an imitation of the biological
neural networks, but much simpler ones.
The computing would have a lot to gain from neural networks.
Their ability to learn by example makes them very flexible and
powerful furthermore there is need to device an algorithm in
order to perform a specific task.
NEERAJ.GUPTA@GLA.AC.IN 67
Conclusion
Neural networks also contributes to area of research such a
neurology and psychology. They are regularly used to model
parts of living organizations and to investigate the internal
mechanisms of the brain.
Many factors affect the performance of ANNs, such as the
transfer functions, size of training sample, network topology,
weights adjusting algorithm, …
NEERAJ.GUPTA@GLA.AC.IN 68
Q. How does each neuron work in ANNS?
What is back propagation?
A neuron: receives input from many other neurons;
changes its internal state (activation) based on the
current input;
sends one output signal to many other neurons, possibly
including its input neurons (ANN is recurrent network).
NEERAJ.GUPTA@GLA.AC.IN 69
THANKS
NEERAJ.GUPTA@GLA.AC.IN 70
MACHINE LEARNING (ML-19)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
oArtificial Neural Network
Gradient Descent
Stochastic Gradient Descent
Gradient Descent Vs Stochastic Gradient Descent
NEERAJ.GUPTA@GLA.AC.IN 2
OPTIMIZATION
Optimization is always the ultimate goal whether you are dealing with a real life
problem or building a software product.
Optimization basically means getting the optimal output for your problem.
NEERAJ.GUPTA@GLA.AC.IN 5
GRADIENT DESCENT ALGORITHM AND ITS VARIANTS
Gradient Descent is an optimization algorithm used for minimizing the cost
function in various machine learning algorithms. It is basically used for
updating the parameters of the learning model.
NEERAJ.GUPTA@GLA.AC.IN 7
BATCH GRADIENT DESCENT
NEERAJ.GUPTA@GLA.AC.IN 8
STOCHASTIC GRADIENT DESCENT
Stochastic Gradient Descent:
• This is a type of gradient descent which processes 1 training example per
iteration.
• Hence, the parameters are being updated even after one iteration in which
only a single example has been processed.
• Hence this is quite faster than batch gradient descent.
• But again, when the number of training examples is large, even then it
processes only one example which can be additional overhead for the
system as the number of iterations will be quite large.
NEERAJ.GUPTA@GLA.AC.IN 9
STOCHASTIC GRADIENT DESCENT
NEERAJ.GUPTA@GLA.AC.IN 10
MINI BATCH GRADIENT DESCENT:
Mini Batch gradient descent:
• This is a type of gradient descent which works faster than both batch
gradient descent and stochastic gradient descent.
• Here b examples where b<m are processed per iteration.
• So even if the number of training examples is large, it is processed in
batches of b training examples in one go.
• Thus, it works for larger training examples and that too with lesser number
of iterations.
NEERAJ.GUPTA@GLA.AC.IN 11
MINI BATCH GRADIENT DESCENT:
NEERAJ.GUPTA@GLA.AC.IN 12
THANKS
NEERAJ.GUPTA@GLA.AC.IN 13
MACHINE LEARNING (ML-20)
AN INTRODUCTION TO : DEEP LEARNING
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
oAn introduction to: Deep Learning
aka or related to
Deep Neural Networks
Deep Structural Learning
Deep Belief Networks
etc,
NEERAJ.GUPTA@GLA.AC.IN 2
Machine Learning Basics
Machine learning is a field of computer science that gives computers the ability to
learn without being explicitly programmed
Machine Learning
Labeled Data algorithm
Training
Prediction
class A
class A
Anomaly Detection
Sequence labeling
http://mbjoseph.github.io/2013/11/27/measure.html
…
ML vs. Deep Learning
Most machine learning methods work well because of human-designed representations
and input features
ML becomes just optimizing weights to best make a final prediction
What is Deep Learning (DL) ?
A machine learning subfield of learning representations of data. Exceptional effective
at learning patterns.
Deep learning algorithms attempt to learn (multiple levels of) representation by using a
hierarchy of multiple layers
If you provide the system tons of information, it begins to understand it and respond in
useful ways.
https://www.xenonstack.com/blog/static/public/uploads/media/machine-learning-vs-deep-learning.png
Why is DL useful?
o Manually designed features are often over-specified, incomplete and take a long
time to design and validate
o Learned Features are easy to adapt, fast to learn
o Deep learning provides a very flexible, (almost?) universal, learnable framework for
representing world, visual and linguistic information.
o Can learn both unsupervised and supervised
o Effective end-to-end joint system learning
o Utilize large amounts of training data
= ( + )
= ( + )
Activation functions
How do we train?
Demo
Training
Sample labeled Forward it Back-
data through the Update the
propagate the
network, get network weights
(batch) errors
predictions
learning rate
( ; , )
Activation functions
Non-linearities needed to learn complex (non-linear) representations of data, otherwise
the NN would be just a linear function
http://cs231n.github.io/assets/nn1/layer_sizes.jpeg
http://adilmoujahid.com/images/activation.png
- Sigmoid neurons saturate and kill gradients, thus NN will barely learn
• when the neuron’s activation are 0 or 1 (saturate)
• gradient at these regions almost zero
• almost no signal will flow to its weights
• if initial weights are too large then most neurons would saturate
Activation: Tanh
Takes a real-valued number and
“squashes” it into range between -1 and
1.
→ −1,1
http://adilmoujahid.com/images/activation.png
http://adilmoujahid.com/images/activation.png
http://wiki.bethanycrane.com/overfitting-of-data
https://www.neuraldesigner.com/images/learning/selection_error.svg
Regularization
Dropout
• Randomly drop units (along with their connections)
during training
• Each unit retained with fixed probability p,
independent of other units
• Hyper-parameter p to be chosen (tuned)
Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural
networks from overfitting." Journal of machine learning research (2014)
L2 = weight decay
• Regularization term that penalizes big weights, added to the
objective = +
• Weight decay value determines how dominant regularization is during
gradient computation
• Big weight decay coefficient big penalty for big weights
Early-stopping
• Use validation error to decide when to stop training
• Stop when monitored quantity has not improved after n subsequent epochs
• n is called patience
Loss functions and output
Classification Regression
f(x)=x
Convolutional
Input matrix 3x3 filter
http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution
Convolutional Neural Networks (CNNs)
Main CNN idea for text:
Compute vectors for n-grams and group them afterwards
max pool
2x2 filters
and stride 2
https://shafeentejani.github.io/assets/images/pooling.gif
CNN for text classification
Severyn, Aliaksei, and Alessandro Moschitti. "UNITN: Training Deep Convolutional Neural Network for Twitter Sentiment
Classification." SemEval@ NAACL-HLT. 2015.
CNN with multiple filters
https://pbs.twimg.com/media/C2j-8j5UsAACgEK.jpg
( ) ( )
ℎ = ( ℎ + )
( ) ( )
ℎ = ( ℎ + )
= ℎ ;ℎ
NEERAJ.GUPTA@GLA.AC.IN 26