CSM 6405: Symbolic ML II
Lecture 2: Linear and Multivariate Regression
Acknowledgement: Andrew Ng (Stanford University), Coursera
Pro f. Dr. M d . R a k i b Ha s s an
De pt . o f Co m p u te r S c i en ce a n d M at h e m at ics ,
Ba n gl adesh Agr i cul tural Un i ve rsi ty.
E m a i l: ra k i b@ bau .edu.bd
Linear Regression
with 1 Variable
PROF. DR. MD. RAKIB HASSAN 2
Gradient Descent
❖ Have some function 𝐽 𝜃0 , 𝜃1
❖ Find min 𝐽 𝜃0 , 𝜃1
𝜃0 ,𝜃1
❖ Outline:
❑ Start with some 𝜃0 , 𝜃1
❑ Keep changing 𝜃0 , 𝜃1 to reduce 𝐽 𝜃0 , 𝜃1 until we hopefully
end up at a minimum
PROF. DR. MD. RAKIB HASSAN 3
Gradient Descent
PROF. DR. MD. RAKIB HASSAN 4
Gradient Descent
PROF. DR. MD. RAKIB HASSAN 5
Gradient Descent Algorithm
❖ Repeat until convergence {
𝛿
𝜃𝑗 = 𝜃𝑗 − 𝛼 𝐽(𝜃0 , 𝜃1 ) (for j=0 and j=1)
𝛿𝜃𝑗
}
where, 𝛼 = learning rate
❖ Simultaneous update
𝛿
❑ temp0 = 𝜃0 − 𝛼 𝐽(𝜃0 , 𝜃1 )
𝛿𝜃 𝑗
𝛿
❑ temp1 = 𝜃1 − 𝛼 𝐽(𝜃0 , 𝜃1 )
𝛿𝜃𝑗
❑ 𝜃0 = temp0
❑ 𝜃1 = temp1
PROF. DR. MD. RAKIB HASSAN 6
Gradient or Slope
𝐶ℎ𝑎𝑛𝑔𝑒 𝑖𝑛 𝑌
❖ 𝐺𝑟𝑎𝑑𝑖𝑒𝑛𝑡 =
𝐶ℎ𝑎𝑛𝑔𝑒 𝑖𝑛 𝑋
❖ Starting from the left and going across to the right is
positive (but going across to the left is negative).
❖ Up is positive, and down is negative
PROF. DR. MD. RAKIB HASSAN 7
Examples
PROF. DR. MD. RAKIB HASSAN 8
Examples
PROF. DR. MD. RAKIB HASSAN 9
Gradient Descent
𝐽(𝜃1 )
𝛿
𝜃1 = 𝜃1 − 𝛼 𝐽(𝜃1 )
𝛿𝜃1
= 𝜃1 − 𝛼(𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑙𝑜𝑝𝑒)
𝜃1
PROF. DR. MD. RAKIB HASSAN 10
Gradient Descent
𝐽(𝜃1 )
𝛿
𝜃1 = 𝜃1 − 𝛼 𝐽(𝜃1 )
𝛿𝜃1
= 𝜃1 − 𝛼(𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑠𝑙𝑜𝑝𝑒)
𝜃1
PROF. DR. MD. RAKIB HASSAN 11
Fixed Learning Rate
❖ Gradient descent can converge to a local minimum,
even with the learning rate 𝛼 fixed.
𝐽(𝜃1 )
𝛿
𝜃1 = 𝜃1 − 𝛼 𝐽(𝜃1 )
𝛿𝜃1
= 𝜃1 − 𝛼(𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑙𝑜𝑝𝑒)
𝜃1
❖ As we approach a local minimum, gradient descent
will automatically take smaller steps. So, no need to
decrease 𝛼 over time.
PROF. DR. MD. RAKIB HASSAN 12
Gradient Descent Algorithm
❖ Repeat until convergence {
𝛿
𝜃𝑗 = 𝜃𝑗 − 𝛼 𝐽(𝜃0 , 𝜃1 )
𝛿𝜃𝑗
}
where, 𝛼 = learning rate
❖ Linear regression model
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥
𝑚
1 2
𝐽 𝜃0 , 𝜃1 = ℎ𝜃 𝑥 (𝑖) − 𝑦 (𝑖)
2𝑚
𝑖=1
PROF. DR. MD. RAKIB HASSAN 13
Gradient Descent Algorithm
𝛿
❖ 𝜃𝑗 = 𝜃𝑗 − 𝛼 𝐽(𝜃0 , 𝜃1 )
𝛿𝜃𝑗
1 2
❖ 𝐽 𝜃0 , 𝜃1 = σ𝑚 ℎ𝜃 𝑥 (𝑖) −𝑦 (𝑖)
2𝑚 𝑖=1
𝛿 1
❖ 𝑗= 0: 𝐽 𝜃0 , 𝜃1 = σ𝑚 ℎ
𝑖=1 𝜃 𝑥 (𝑖) − 𝑦 (𝑖)
𝛿𝜃0 𝑚
𝛿 1
❖ 𝑗= 1: 𝐽 𝜃0 , 𝜃1 = σ𝑚
𝑖=1 ℎ𝜃 𝑥
𝑖 −𝑦 𝑖 . 𝑥 (𝑖)
𝛿𝜃1 𝑚
PROF. DR. MD. RAKIB HASSAN 14
Gradient Descent Algorithm
❖ Repeat until convergence {
1 𝑚
𝜃0 = 𝜃0 − 𝛼 σ𝑖=1 ℎ𝜃 𝑥 (𝑖) − 𝑦 (𝑖)
𝑚
1 𝑚 𝑖 𝑖
𝜃1 = 𝜃1 − 𝛼 σ ℎ𝜃 𝑥 −𝑦 . 𝑥 (𝑖)
𝑚 𝑖=1
}
❖ Update 𝜃0 and 𝜃1 simultaneously
PROF. DR. MD. RAKIB HASSAN 15
Convergence
PROF. DR. MD. RAKIB HASSAN 16
Convex Function
PROF. DR. MD. RAKIB HASSAN 17
Batch Gradient Descent
❖ “Batch”: Each step of gradient descent uses all the
training examples.
PROF. DR. MD. RAKIB HASSAN 18
Linear Regression
with Multiple
Variables
MULTIPLE FEATURES
PROF. DR. MD. RAKIB HASSAN 19
Multiple Features (Variables)
𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒙𝟒 𝒚
𝒎 = 𝟒𝟕
1416
❖ Notation: 2 3
𝑥 =
❑ 𝑛 = number of features 2
❑ 𝑥 𝑖 = input (features) of 𝑖 𝑡ℎ training example 40
𝑖
❑ 𝑥𝑗 = value of feature 𝑗 in 𝑖 𝑡ℎ training example 𝟐
𝒙𝟑 = 𝟐
PROF. DR. MD. RAKIB HASSAN 20
Hypothesis
❖ ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 +𝜃3 𝑥3 + ⋯ + 𝜃𝑛 𝑥𝑛
❖ ℎ𝜃 𝑥 = 𝜃0 𝑥0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 +𝜃3 𝑥3 + ⋯ + 𝜃𝑛 𝑥𝑛
❖ For convenience of notation, define 𝑥0 = 1.
❖ ℎ𝜃 𝑥 = 𝜃0 𝑥0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 +𝜃3 𝑥3 + ⋯ + 𝜃𝑛 𝑥𝑛
❖ = 𝜃𝑇𝑋
𝜃𝑇 = 𝜃0 𝜃1 …
𝑥0
𝑋 = 𝑥1
⋮
PROF. DR. MD. RAKIB HASSAN 21
Gradient Descent
❖ Repeat until convergence {
1 𝑚 𝑖
𝜃𝑗 = 𝜃𝑗 − 𝛼 σ ℎ𝜃 𝑥 (𝑖) − 𝑦 (𝑖) 𝑥𝑗
𝑚 𝑖=1
(simultaneously update 𝜃𝑗 for 𝑗 = 0,1, … , 𝑛)
}
1 𝑚 𝑖
𝜃0 = 𝜃0 − 𝛼 σ𝑖=1 ℎ𝜃 𝑥 (𝑖) − 𝑦 (𝑖) 𝑥0
𝑚
1 (𝑖)
𝜃1 = 𝜃1 − 𝛼 σ𝑚 ℎ𝜃 𝑥 𝑖 −𝑦 𝑖 . 𝑥1
𝑚 𝑖=1
1 𝑚 𝑖 𝑖 (𝑖)
𝜃2 = 𝜃2 − 𝛼 σ𝑖=1 ℎ𝜃 𝑥 −𝑦 . 𝑥2
𝑚
PROF. DR. MD. RAKIB HASSAN 22
Feature Scaling
❖ Make sure features are on a similar scale
❖ Get every feature into approximately a −1 ≤ 𝑥𝑖 ≤ 1
range.
❖ Examples:
❑ 𝑥1 = size (0-2000 feet2)
❑ 𝑥2 = number of bedrooms (1-5)
❖ Method 1:
size feet2
❑ 𝑥1 =
2000
number of bedrooms
❑ 𝑥2 =
5
PROF. DR. MD. RAKIB HASSAN 23
Method 2 – Mean Normalization
𝑥𝑖 −𝜇
❖ 𝑥𝑖 =
𝜎
❑ 𝜇 = mean
❑ 𝜎 = standard deviation
❖ Do not apply to 𝑥0 = 1
PROF. DR. MD. RAKIB HASSAN 24
How to Choose Learning Rate?
❖ Make sure gradient descent is working correctly.
❖ Example automatic convergence test:
min 𝐽(𝜃)
𝜃
No. of iterations
❖ Declare convergence if 𝐽 𝜃 decreases by less than
10−3 in one iteration.
PROF. DR. MD. RAKIB HASSAN 25
Learning Rate
If 𝛼 is too large, gradient
If 𝛼 is too small, gradient
descent can overshoot the
descent can be slow
minimum. It may fail to
converge, or even diverge.
❖ Try:
❑ …, 0.001, 0.003, …, 0.01, 0.03, … , 0.1, 0.3, …
PROF. DR. MD. RAKIB HASSAN 26
Polynomial Regression
❖ y = − 0.08x2 + 1.4x − 0.1
PROF. DR. MD. RAKIB HASSAN 27
Analytical Solution of Linear Regression
❖ Examples: m = 4
PROF. DR. MD. RAKIB HASSAN 28
Normal Equation
❖ 𝑚 examples 𝑥 1 , 𝑦 1 , … , (𝑥 𝑚 , 𝑦 𝑚 ) 𝑥0
𝑖
𝑖
❖ 𝑛 featues 𝑖
𝑥1
𝑥 = 𝑥 𝑖 ∈ ℝ𝑛+1
2
⋮
𝑖
𝑥𝑛
1
1 𝑥1 𝑦 1
𝑖
1 2 2
Example: If 𝑥 = 𝑖 , 𝑋= 1 𝑥1 𝑦= 𝑦
𝑥1 ⋮ ⋮ ⋮
𝑚 𝑦 𝑚
1 𝑥1
PROF. DR. MD. RAKIB HASSAN 29
Plots of Sample Data
Using
• Gradient descent
• Normal equation
PROF. DR. MD. RAKIB HASSAN 30
Gradient Descent vs Normal Eq.
❖ Gradient Descent 𝑚 = training examples
❑ Need to choose 𝛼 𝑛 = features
❑ Needs many iterations
❑ Works well even when 𝑛 is large
❖ Normal equation
❑ No need to choose 𝛼
❑ No need to iterate
❑ Need to compute 𝑋 𝑇 𝑋 −1
o Complexity: 𝑂(𝑛3 )
❑ Slow if 𝑛 is very large
PROF. DR. MD. RAKIB HASSAN 31
Non-Invertible/ Singular/ Degenerate
❖ Matrices that cannot be inverted, are called Non-
invertible/ Singular/ Degenerate matrices
❑ Normal equation
o 𝜃 = 𝑋 𝑇 𝑋 −1 𝑋 𝑇 𝑦
❖ Causes
o Redundant features (Linearly dependent)
➢ Example: 𝑥1 = size in feet2, 𝑥2 = size in m2
o Too many features (e.g., 𝑚 ≤ 𝑛): more features than training examples
➢ Delete some features, or use regularization.
❖ Solution:
❑ Use pinv(X) in Matlab
o Moore-Penrose pseudo inverse
PROF. DR. MD. RAKIB HASSAN 32
Assignment
❖ Find the values of 𝜃 to best fit the sample data using
❑ Gradient descent:
o Size -> price
o Size, beds -> price
❑ Normal equation
o Size -> price
o Size, beds -> price
❖ Plots
❑ Best fit lines
❑ Cost functions (2D and 3D)
❖ Do’s and don’ts
❑ Do not use any library
❑ Only use gradient descent algorithm and normal equation given in the
slides
PROF. DR. MD. RAKIB HASSAN 33
Assignment (Cont.)
❖ Any Tool
❑ Java, Python, Matlab, Octave, etc.
❖ Submit
❑ What files:
o Source codes with proper documentation
❑ To: rakib@bau.edu.bd
❑ Deadline: 12-Nov-2020
❖ Marks:
❑ 20
❖ Marks deduction:
❑ 10 marks deduction per day for submitting after deadline
❑ X% deduction for X% similarity.
❑ 0 for > 50% similarity.
PROF. DR. MD. RAKIB HASSAN 34
PROF. DR. MD. RAKIB HASSAN 35