Machine Learning: Parametric Models
Dr. Kritesh K. Gupta
Assistant Professor
Amrita School of Artificial Intelligence,
Amrita Vishwa Vidyapeetham, Coimbatore, India
02nd July 2025
Correlation of the Parameters with Response
Y Y Y
x x x
• Cov(X1,X2) = 0 • Cov(X1,X2) > 0 • Cov(X1,X2) < 0
• X1 and X2 are not correlated • X1 and X2 are positively • X1 and X2 are negatively
correlated correlated
2
Regression Model: Linear Regression
3
Simple Linear Regression
X (Weight) Y (Height)
74 170
New weight Model Predicted height
80 180
Best fit
75 175.5
height line
- -
- - yi
Error min
Errori yi yi yi
weight
4
Mathematical understanding
y 0 1.x 0 0
height
y predicted height
dy 0 0
x weight 1 slope
height
dx
weight
weight 5
How best fit line is formed?
Cost function- height
1 n
J ( 0, 1) [ ( yi yi ) 2 ]
n i 1
Objective {to get a best fit line}
Error
weight
Optimal selection of 0 and 1
6
Lets apply this y
x y yx
1 1 -
2 2 - 3
3 3 -
2
y 0 1.x
0 0 1
𝛽 =0 𝛽 = 0.5 𝛽 =1
x= 1 y 0 0 1 y 0 0.5 1 y 0 1 1 1 2 3 x
0 0.5 1
x= 2 y 0 0 2 y 0 0.5 2 y 0 1 2
0 1 2
x= 3 y 0 0 3 y 0 0.5 3 y 0 1 3
0 1.5 3
7
What’s happening to the cost function? 5
4
J(β1) 3
n
1
J ( 0, 1) [ ( yi yi ) 2 ] 2
n i 1
1
0 0.5 1 1.5 2
β1
𝛽 =0 𝛽 = 0.5 𝛽 =1
yx J(β1) yx J(β1) yx J(β1)
x= 1 0 0.5 J ( 1 ) 1
1 1 J ( 1 )
J ( 1 ) (1 0) 2 (2 0)2 (3 0) 2 (1 0.5)2 (2 1) 2 (3 1.5) 2
x= 2 0 3 1 3 2 1
14 (1 1) 2 (2 2) 2 (3 3) 2
4.66 1 3
3 0.25 1 2.25 1.166 0
x= 3 0 1.5 3 3
8
Convergence algorithm (optimize the change of βj )
gradient dJ ( j ) Repeat until convergence
d j
{
Initial weight (βj)
J ( j )
j : j ( )
( j )
}
learning rate (α)
when slope ve :
( j ) new j ( ve)
( j ) new ( j )old
when slope ve :
( j )new j (ve)
( j )new ( j )old
9
Multiple Linear Regression
y 0 1.x
y 0 1.x1 2 .x2
0 Intercept / Bias
1 , 2 Slope / Weights
What is a generic equation for multiple regression?
10
https://medium.com/analytics-vidhya/implementing-gradient-descent-for-multi-linear-regression-from-scratch-3e31c114ae12
Y
Ridge Regression: L2 Regularization
11
Ridge Regression: (For overcoming Overfitting)
1 n 2
J ( 0, 1) ( yi yi ) 0(in case of SLR )
Training
Y ˆ
Test
yˆi 0 1 x n i 1
Overfitting
1 n n
J ( 0, 1) ( yi yˆi ) 2 ( slope)2
n i 1 i 1
L2 Regularization
X
High Training Accuracy: Low Bias
Low Test Accuracy : High Variance
12
λ vs. slope
slope
13
Lasso Regression: (For Feature Selection)
L1 Regularization
1 n n
J ( 0, 1) ( yi yˆi ) 2 slope
n i 1 i 1
14
ElasticNet Regression
1 n n n
J ( 0, 1) ( yi yˆi ) 2 1 slope 2 ( slope) 2
n i 1 i 1 i 1
For feature selection Penalty for overfitting
15
Classification Model: Logistic Regression
16
Can Linear Regression Solve Classification Problem?
X Y: O/P
Y
Study hours (Pass: 1, Fail: 0)
1
2 0
3 0 0.5
4 0
5
Present
1 0
6 position 1
7 1
X
1 2 3 4 5 6 7 8 9 10 11 12
y 1, if y 0.5
y 0, if y 0.5
17
Can Linear Regression Solve Classification Problem?
X Y: O/P
Y
Study hours (Pass: 1, Fail: 0)
1
2 0
3 0 0.5
4 0
5
Present
1 0
6 position 1
7 1
12 1
X
Challenges: 1 2 3 4 5 6 7 8 9 10 11 12
1. Not robust in handling outliers
2. Responses > 1 and < 0
Hence, we need a mechanism that can restrict the predicted responses between 0 to 1.
18
What can solve these challenges?
1
Y y 0 1 x ( y) y
1 e
1
0.5
0 Present
position
( y)
X
1 2 3 4 5 6 7 8 9 10 11 12
y 19
Sigmoid Function Nonlinearity
When z is large and positive When z is large and negative
when z 5 when z 5
Present
position
20
Cost function
Linear Regression Logistic Regression
1 m
1 m
J ( 0 , 1 ) ( yi yi )
J ( 0 , 1 ) ( yi yi )
2
2 ˆ
m i 1 m i 1
1
yi 0 1 x yˆi ( yi )
yi
1 e
21
Why gradient descent doesn’t work in Logistic
Regression?
Present
position
22
Why Mean Squared Error cannot be used as cost
function for Logistic Regression?
1
If the true label y is 1 and the predicted probability z
0.2
1 e
Squaring the difference :
Present
position
The squaring operation intensifies the impact of misclassifications, especially when
predicted class is close to 0 or 1.
23
Log Loss or Cross Entropy Function
The log of corrected probabilities, in logistic regression, is obtained by taking the
natural log (base e) of the predicted probabilities.
m
J yi log( yˆi ) (1 yi ) log(1 yˆi )
i 1
Present log(1 yˆ ) if y 0
position loss
log( yˆ ) if y 1
24
Log Loss and Model Performance
m
J yi log( yˆi ) (1 yi ) log(1 yˆi )
i 1
sample True Predicted Log loss
1 10
Label (yi) probability (ŷi)
Total Log Loss Log Lossi 0.353
1 1 0.8 -log(0.8) =0.223 10 i 1
2 0 Present 0.3 -log(0.7) = 0.357
• Gradient of Log Loss:
3 1 position 0.7 -log(0.7)= 0.357
4 0 0.2 -log(0.8) 0.223 Log Lossi
5 1 0.9 -log(0.9) = 0.105
6 0 0.1 -log(0.9) = 0.105
• Parameter update:
7 1 0.4 -log(0.4) = 0.916
Log Lossi
8 0 0.6 -log(0.4) = 0.916
9 1 0.85 -log(0.85) = 0.162
10 0 0.15 -log(0.85) = 0.162
25
Multi-class classification (One vs. Rest [OVR])
X Y O1 O2 O3
- - 1 0 0
- - 0 1 0
Y M3
- - 0 0 1
- - 1 0 0
M2 M1
Model M1 I/P (X,Y), O/P: {O1}
Model M2 I/P (X,Y), O/P: {O2}
Present
Model M3 I/P (X,Y), O/P: {O3}
position
Unknown features (X,Y)
X 0.25 0.20 0.55
0.75 0.15 0.10
26
Performance Measures
27
Performance measure for Regression: R-squared
Y
yi yˆi yi
SS residual yi y
R 1
2
SStotal
y
1
( y yˆ )
i i
2
yˆ i
(y y)
i
2
28
Performance measure for Regression: Adjusted R-squared
Number of independent features (P) R-squared
1. Size of the house 0.75 (1 R 2
)( N 1)
1. Size of the house 0.80 Adjusted R 2 1
2. Number of rooms N P 1
1. Size of the house 0.85 N:No.of data points
2. Number of rooms
3. Location P: No. of Indpendent features
1. Size of the house 0.87
2. Number of rooms
3. Location
4. Gender
29
Performance measure for Regression: Mean Squared Error (MSE)
n 2
1
Advantages Disadvantages
n i 1
( y yˆ )
Differentiable
It has one local or Its sensitivity to
global minima outliers
Fast convergence
to minima
30
Performance measure for Regression: Mean Absolute Error (MAE)
1 n
Advantages Disadvantages
n i 1
( y yˆ )
Convergence
Robust to takes time
outliers (sub-
gradient)
31
Performance metrics for Classification
Actual
• Confusion Matrix 1 0
X1 X2 Y Yhat
1
Predicted
- - 0 1
- - 1 1
- - 0 0 0
- Present- 1 1
- position- 1 1
- - 0 1
- - 1 0
Actual
1 0
• Accuracy
1
TP TN
Predicted
TP FP
Accuracy
TP FP TN FN
0 FN TN
32
Performance metrics for Classification
Actual
• Precision 1 0
1 TP FP
Predicted
TP
Precison
TP FP
0 FN TN
Present
position
• Recall
Actual
TP 1 0
Recall
TP FN 1
Predicted
TP FP
0 FN TN
33
Use Cases Actual
1 0
• Example 1: spam [1] vs. not spam [0]
1 TP FP
Predicted
A mail is not a spam, but model predicted spam
False Positives Cannot be afforded. FP
A mail is a spam, but model predicted not spam 0 FN TN
False Negatives Does not cause much damage.
Present
position
• Example 2: disease prediction
A patient has diabetes [1], but model predicted not diabetes[0]
Actual
False Negatives Cannot be afforded. FN 1 0
A patient don’t have diabetes, but model predicted diabetes
1
Predicted
False positves TP FP
Does not cause much damage.
0 FN TN
34
Performance metrics for Classification
• F-Score Precision×Recall
(1 )
2
Precision+Recall
In the cases where reducing FP and FN both are necessary, we use = 1
Precision×Recall
Present F1 score 2
position Precision+Recall
In the cases where reducing FP is more important than FN, we use = 0.5
Precision×Recall
F 0.5 score (1 0.25)
Precision+Recall
In the cases where reducing FN is more important than FP, we use = 2
Precision×Recall
F 2 score 5
Precision+Recall 35