[go: up one dir, main page]

0% found this document useful (0 votes)
8 views35 pages

Lecture 3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views35 pages

Lecture 3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Machine Learning: Parametric Models

Dr. Kritesh K. Gupta


Assistant Professor

Amrita School of Artificial Intelligence,


Amrita Vishwa Vidyapeetham, Coimbatore, India
02nd July 2025
Correlation of the Parameters with Response

Y Y Y

x x x

• Cov(X1,X2) = 0 • Cov(X1,X2) > 0 • Cov(X1,X2) < 0


• X1 and X2 are not correlated • X1 and X2 are positively • X1 and X2 are negatively
correlated correlated

2
Regression Model: Linear Regression

3
Simple Linear Regression
X (Weight) Y (Height)
74 170
New weight Model Predicted height

80 180
Best fit
75 175.5
height line
- -
- - yi

 Error  min 
Errori  yi  yi yi

weight
4
Mathematical understanding
y    0  1.x 0  0
height

y   predicted height
 dy  0  0
x  weight 1  slope  
height
 dx 

weight

weight 5
How best fit line is formed?
Cost function- height

1 n
J (  0,  1)  [  ( yi  yi ) 2 ] 
n i 1
Objective  {to get a best fit line}

Error 
weight

Optimal selection of 0 and 1


6
Lets apply this y
x y yx
1 1 -
2 2 - 3

3 3 -
2
y   0  1.x

0  0 1

𝛽 =0 𝛽 = 0.5 𝛽 =1
x= 1 y  0  0 1 y   0  0.5  1 y   0  1 1 1 2 3 x
0  0.5 1
x= 2 y  0  0  2 y  0  0.5  2 y  0  1 2
0 1 2
x= 3 y  0  0  3 y   0  0.5  3 y   0  1 3
0  1.5 3
7
What’s happening to the cost function? 5
4

J(β1) 3
n
1
J (  0,  1)  [  ( yi  yi ) 2 ]  2
n i 1
1

0 0.5 1 1.5 2
β1
𝛽 =0 𝛽 = 0.5 𝛽 =1

yx J(β1) yx J(β1) yx J(β1)


x= 1 0 0.5 J ( 1 ) 1
1 1 J ( 1 )
J ( 1 )   (1  0) 2  (2  0)2  (3  0) 2    (1  0.5)2  (2  1) 2  (3  1.5) 2 
x= 2 0 3 1 3 2 1
14   (1  1) 2  (2  2) 2  (3  3) 2 
  4.66 1 3
3   0.25  1  2.25  1.166 0
x= 3 0 1.5 3 3
8
Convergence algorithm (optimize the change of βj )
gradient dJ (  j ) Repeat until convergence
d j
{
Initial weight (βj)
J (  j )
 j :  j  ( )
(  j )
}
learning rate (α)
when slope  ve :
(  j ) new   j  ( ve)
(  j ) new  (  j )old

when slope  ve :
(  j )new   j  (ve)
(  j )new  (  j )old

9
Multiple Linear Regression
y   0  1.x

y    0  1.x1   2 .x2
0  Intercept / Bias
1 ,  2  Slope / Weights
What is a generic equation for multiple regression?
10
https://medium.com/analytics-vidhya/implementing-gradient-descent-for-multi-linear-regression-from-scratch-3e31c114ae12
Y

Ridge Regression: L2 Regularization

11
Ridge Regression: (For overcoming Overfitting)
1 n 2
J (  0,  1)    ( yi  yi )   0(in case of SLR )
Training
Y ˆ
Test
yˆi   0   1 x  n i 1 

Overfitting

1 n  n
J (  0,  1)    ( yi  yˆi ) 2     ( slope)2
 n i 1  i 1

L2 Regularization
X

High Training Accuracy: Low Bias


Low Test Accuracy : High Variance
12
λ vs. slope

   slope 
13
Lasso Regression: (For Feature Selection)
L1 Regularization
1 n  n
J (  0,  1)    ( yi  yˆi ) 2     slope
 n i 1  i 1

14
ElasticNet Regression
1 n  n n
J (  0,  1)    ( yi  yˆi ) 2   1  slope  2  ( slope) 2
 n i 1  i 1 i 1
For feature selection Penalty for overfitting

15
Classification Model: Logistic Regression

16
Can Linear Regression Solve Classification Problem?
X Y: O/P
Y
Study hours (Pass: 1, Fail: 0)
1
2 0
3 0 0.5
4 0
5
Present
1 0
6 position 1
7 1

X
1 2 3 4 5 6 7 8 9 10 11 12

 y  1, if y  0.5

 y  0, if y  0.5
17
Can Linear Regression Solve Classification Problem?
X Y: O/P
Y
Study hours (Pass: 1, Fail: 0)
1
2 0
3 0 0.5
4 0
5
Present
1 0
6 position 1
7 1
12 1
X
Challenges: 1 2 3 4 5 6 7 8 9 10 11 12
1. Not robust in handling outliers
2. Responses > 1 and < 0

Hence, we need a mechanism that can restrict the predicted responses between 0 to 1.
18
What can solve these challenges?
1
Y y  0  1 x  ( y)  y
1 e
1
0.5
0 Present
position

 ( y)
X
1 2 3 4 5 6 7 8 9 10 11 12

y 19
Sigmoid Function Nonlinearity
When z is large and positive When z is large and negative
when z  5 when z  5

Present
position

20
Cost function
Linear Regression Logistic Regression

1 m
 1 m
J (  0 , 1 )   ( yi  yi )
J (  0 , 1 )   ( yi  yi )
2
2 ˆ
m i 1 m i 1
  1
yi  0  1 x yˆi   ( yi )  
 yi
1 e

21
Why gradient descent doesn’t work in Logistic
Regression?

Present
position

22
Why Mean Squared Error cannot be used as cost
function for Logistic Regression?
1
If the true label y is 1 and the predicted probability z
 0.2
1 e

Squaring the difference :


Present
position

The squaring operation intensifies the impact of misclassifications, especially when


predicted class is close to 0 or 1.

23
Log Loss or Cross Entropy Function
The log of corrected probabilities, in logistic regression, is obtained by taking the
natural log (base e) of the predicted probabilities.
m
J   yi log( yˆi )  (1  yi ) log(1  yˆi )
i 1

Present  log(1  yˆ ) if y  0
position loss  
 log( yˆ ) if y  1

24
Log Loss and Model Performance
m
J   yi log( yˆi )  (1  yi ) log(1  yˆi )
i 1

sample True Predicted Log loss


1 10
Label (yi) probability (ŷi)
Total Log Loss   Log Lossi  0.353
1 1 0.8 -log(0.8) =0.223 10 i 1
2 0 Present 0.3 -log(0.7) = 0.357
• Gradient of Log Loss:
3 1 position 0.7 -log(0.7)= 0.357
4 0 0.2 -log(0.8) 0.223 Log Lossi
5 1 0.9 -log(0.9) = 0.105

6 0 0.1 -log(0.9) = 0.105
• Parameter update:
7 1 0.4 -log(0.4) = 0.916
Log Lossi
8 0 0.6 -log(0.4) = 0.916  
  
9 1 0.85 -log(0.85) = 0.162 
10 0 0.15 -log(0.85) = 0.162

25
Multi-class classification (One vs. Rest [OVR])
X Y O1 O2 O3
- - 1 0 0
- - 0 1 0
Y M3
- - 0 0 1
- - 1 0 0
M2 M1

Model M1 I/P (X,Y), O/P: {O1}


Model M2 I/P (X,Y), O/P: {O2}
Present
Model M3 I/P (X,Y), O/P: {O3}
position

Unknown features (X,Y)


X 0.25 0.20 0.55
0.75 0.15 0.10

26
Performance Measures

27
Performance measure for Regression: R-squared

Y
yi  yˆi yi
SS residual yi  y
R  1
2

SStotal
y
 1
 ( y  yˆ )
i i
2
yˆ i

(y  y)
i
2

28
Performance measure for Regression: Adjusted R-squared
Number of independent features (P) R-squared
1. Size of the house 0.75 (1  R 2
)( N  1)
1. Size of the house 0.80 Adjusted R 2  1 
2. Number of rooms N  P 1
1. Size of the house 0.85 N:No.of data points
2. Number of rooms
3. Location P: No. of Indpendent features
1. Size of the house 0.87
2. Number of rooms
3. Location
4. Gender

29
Performance measure for Regression: Mean Squared Error (MSE)
n 2
1
Advantages Disadvantages

n i 1
( y  yˆ )

Differentiable

It has one local or Its sensitivity to


global minima outliers

Fast convergence
to minima

30
Performance measure for Regression: Mean Absolute Error (MAE)
1 n
Advantages Disadvantages

n i 1
( y  yˆ )

Convergence
Robust to takes time
outliers (sub-
gradient)

31
Performance metrics for Classification
Actual
• Confusion Matrix 1 0
X1 X2 Y Yhat
1

Predicted
- - 0 1
- - 1 1
- - 0 0 0
- Present- 1 1
- position- 1 1
- - 0 1
- - 1 0
Actual
1 0
• Accuracy
1
TP  TN
Predicted
TP FP
Accuracy 
TP  FP  TN  FN
0 FN TN
32
Performance metrics for Classification
Actual
• Precision 1 0

1 TP FP

Predicted
TP
Precison 
TP  FP
0 FN TN
Present
position
• Recall

Actual
TP 1 0
Recall 
TP  FN 1

Predicted
TP FP

0 FN TN
33
Use Cases Actual
1 0
• Example 1: spam [1] vs. not spam [0]
1 TP FP

Predicted
A mail is not a spam, but model predicted spam
False Positives Cannot be afforded. FP 
A mail is a spam, but model predicted not spam 0 FN TN
False Negatives Does not cause much damage.
Present
position
• Example 2: disease prediction
A patient has diabetes [1], but model predicted not diabetes[0]
Actual
False Negatives Cannot be afforded. FN  1 0
A patient don’t have diabetes, but model predicted diabetes
1

Predicted
False positves TP FP
Does not cause much damage.

0 FN TN

34
Performance metrics for Classification
• F-Score Precision×Recall
(1   )
2

Precision+Recall
In the cases where reducing FP and FN both are necessary, we use = 1
Precision×Recall
Present F1 score  2 
position Precision+Recall
In the cases where reducing FP is more important than FN, we use = 0.5
Precision×Recall
F 0.5 score  (1  0.25) 
Precision+Recall
In the cases where reducing FN is more important than FP, we use = 2
Precision×Recall
F 2 score  5 
Precision+Recall 35

You might also like