UCK358E – INTR.
TO ARTIFICIAL INTELLIGENCE
SPRING ‘24
LECTURE 3
LINEAR AND POLYNOMIAL REGRESSION
Instructor: Barış Başpınar
Model Representation
500
Housing Prices
400
(Portland, OR)
300
Price 200
(in 1000s 100
of dollars) 0
0 500 1000 1500 2000 2500 3000
Size (feet2)
Supervised Learning Regression Problem
Given the “right answer” for Predict real-valued output
each example in the data.
Model Representation
Training set of Size in feet2 (x) Price ($) in 1000's (y)
housing prices 2104 460
(Portland, OR) 1416 232
1534 315 m = 50
852 178
… …
Notation:
m = Number of training examples
x’s = “input” variable / features 𝑥 (1) , 𝑦 (1) = (2104, 460)
y’s = “output” variable / “target” variable 𝑥 (2) , 𝑦 (2) = (1416, 232)
𝑥 (𝑖) , 𝑦 (𝑖) → 𝑖 𝑡ℎ training example
Model Representation: linear regression
Training Set How do we represent h ?
500
Learning Algorithm 400
300
200
100
Size of h Estimated 0
house price 0 1000 2000 3000
hypothesis
Linear regression with one variable.
Univariate linear regression.
Cost Function
Size in feet2 (x) Price ($) in 1000's (y)
Training Set
2104 460
1416 232
1534 315 m = 50
852 178
… …
Hypothesis:
‘s: Parameters
How to choose ‘s ?
Cost Function
3 3 3
ℎ 𝑥 = 1 + 0.5𝑥
2 ℎ 𝑥 = 1.5 + 0 𝑥 2 2
ℎ 𝑥 = 0.5𝑥
1 1 1
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
Cost Function
Number of training examples Squared error function
y
𝜃1 , 𝜃2
Cost function
Idea: Choose so that
is close to for our
training examples
Cost Function Intuition
Simplified
Hypothesis:
𝜃0 = 0
3
Parameters: 2
1
0
Cost Function: 0 1 2 3
Goal:
Cost Function Intuition
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
𝑚
1 2
𝐽 𝜃1 = ℎ𝜃 𝑥 (𝑖) − 𝑦 (𝑖)
2𝑚
𝑖=1
𝑚
1 2 1 2
= 𝜃1 𝑥 (𝑖) − 𝑦 (𝑖) 𝐽 1 = 0+0+0 =0
2𝑚 2𝑚
𝑖=1
Cost Function Intuition
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
1 2 2 2
3.5
𝐽 0.5 = 0.5 − 1 + 1−2 + 1.5 − 3 = = 0.58
2×3 6
Cost Function Intuition
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
min 𝐽 𝜃1
𝜃1
1 2 2 2
14
𝐽 0 = 0−1 + 0−2 + 0−3 = = 2.33
2×3 6
Cost Function Intuition
Hypothesis:
Parameters:
Cost Function:
Goal:
Cost Function Intuition
(for fixed , this is a function of x) (function of the parameters )
500
400
Price ($) 300
in 1000’s
200
100
0
0 1000 2000 3000
Size in feet2 (x)
𝜃0 , 𝜃1
Cost Function Intuition
Contour plots
𝐽 𝜃0 , 𝜃1
Cost Function Intuition
(for fixed , this is a function of x) (function of the parameters )
Cost Function Intuition
(for fixed , this is a function of x) (function of the parameters )
Gradient Descent
Have some function
Want
Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum
Gradient Descent
J(0,1)
1
0
Gradient Descent
J(0,1)
1
0
Gradient Descent Algorithm
learning rate
Correct: Simultaneous update Incorrect:
Gradient Descent Intuition
positive negative
slope slope
𝜕 𝜕
𝜃1 ≔ 𝜃1 − 𝛼 𝐽 𝜃1 𝜃1 ≔ 𝜃1 − 𝛼 𝐽 𝜃1
𝜕𝜃1 𝜕𝜃1
≥0 ≤0
Gradient Descent Intuition
If α is too small, gradient descent
can be slow.
If α is too large, gradient descent
can overshoot the minimum. It may
fail to converge, or even diverge.
Gradient Descent Intuition
Gradient descent can converge to a local minimum, even with the learning rate α fixed
at local optima
Current value of
As we approach a local minimum, gradient descent will automatically take smaller steps
Gradient Descent for Linear Regression
Gradient descent algorithm Linear Regression Model
Gradient Descent for Linear Regression
In order to implement this algorithm, we need to calculate the partial derivatives:
𝑚
𝜕 𝜕 1 2
𝐽 𝜃0 , 𝜃1 = ℎ𝜃 𝑥 (𝑖) − 𝑦 (𝑖)
𝜕𝜃𝑗 𝜕𝜃𝑗 2𝑚
𝑖=1
𝑚
𝜕 𝜕 1 2
𝐽 𝜃0 , 𝜃1 = 𝜃0 + 𝜃1 𝑥 (𝑖) − 𝑦 (𝑖)
𝜕𝜃𝑗 𝜕𝜃𝑗 2𝑚
𝑖=1
𝑚 𝑚
𝜕 1 (𝑖) (𝑖)
1
𝐽 𝜃0 , 𝜃1 = 𝜃0 + 𝜃1 𝑥 − 𝑦 = ℎ𝜃 𝑥 (𝑖) − 𝑦 (𝑖)
𝜕𝜃0 𝑚 𝑚
𝑖=1 𝑖=1
𝑚 𝑚
𝜕 1 1
𝐽 𝜃0 , 𝜃1 = 𝜃0 + 𝜃1 𝑥 (𝑖) − 𝑦 (𝑖) 𝑥 (𝑖) = ℎ𝜃 𝑥 (𝑖) − 𝑦 (𝑖) 𝑥 (𝑖)
𝜕𝜃1 𝑚 𝑚
𝑖=1 𝑖=1
Gradient Descent for Linear Regression
𝜕
𝐽 𝜃0 , 𝜃1
𝜕𝜃0
update
and
simultaneously
𝜕
𝐽 𝜃0 , 𝜃1
𝜕𝜃1
Gradient Descent for Linear Regression
(for fixed , this is a function of x) (function of the parameters )
Gradient Descent for Linear Regression
(for fixed , this is a function of x) (function of the parameters )
Gradient Descent for Linear Regression
(for fixed , this is a function of x) (function of the parameters )
Gradient Descent for Linear Regression
(for fixed , this is a function of x) (function of the parameters )
Gradient Descent for Linear Regression
(for fixed , this is a function of x) (function of the parameters )
Gradient Descent for Linear Regression
(for fixed , this is a function of x) (function of the parameters )
Gradient Descent for Linear Regression
(for fixed , this is a function of x) (function of the parameters )
Gradient Descent for Linear Regression
(for fixed , this is a function of x) (function of the parameters )
Gradient Descent for Linear Regression
(for fixed , this is a function of x) (function of the parameters )
Multiple Features (Variables)
Size (feet2) Number of Number of Age of home Price ($1000)
bedrooms floors (years)
𝑥1 𝑥2 𝑥3 𝑥4 𝑦
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… … … … …
Notation: 1416
𝑥 (2) = 3
= number of features 2
= input (features) of training example. 40
= value of feature in training example. (2)
𝑥3 = 2
Multiple Features (Variables)
Hypothesis:
Previously:
Now:
For convenience of notation, define .
Multivariate linear regression
Gradient Descent with Multiple Variables
New algorithm :
Gradient Descent
Repeat
Previously (n=1):
Repeat
(simultaneously update for
)
(𝑖)
𝑥0 = 1
(simultaneously update )
Gradient Descent: Feature Scaling
Idea: Make sure features are on a similar scale
E.g. = size (0-2000 feet2) size (feet2)
= number of bedrooms (1-5)
number of bedrooms
Gradient Descent: Feature Scaling – Mean Normalization
Replace with to make features have approximately zero mean
(Do not apply to ).
E.g.
𝑥1 − 𝜇1 𝑥2 − 𝜇2
𝑥1 ← 𝑥2 ←
𝑠1 𝑠2
𝜇𝑖 : average value of 𝑥𝑖 in training set
𝑠𝑖 : range (max-min) or standard deviation
Gradient Descent: Learning Rate
Making sure gradient descent is working correctly.
Gradient descent not working.
Use smaller .
No. of iterations
No. of iterations No. of iterations
- For sufficiently small , should decrease on every iteration.
- But if is too small, gradient descent can be slow to converge.
Gradient Descent: Learning Rate
- If is too small: slow convergence.
- If is too large: may not decrease on
every iteration; may not converge.
To choose , try
… , 0.001, 0.003, 0.01, 0.03, 0.1, …
3x
Features and Polynomial Regression
Housing prices prediction
Area:
𝑥 = frontage * depth
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥
Features and Polynomial Regression
Choice of features
Price
(y)
Size (x)
Normal Equation
Normal equation: Method to solve for analytically.
Intuition: If 1D
𝜕
𝐽 𝜃 =0
𝜕𝜃
Solve for 𝜃
(for every )
Solve for
Normal Equation
Examples:
Size (feet2) Number of Number of Age of home Price ($1000)
bedrooms floors (years)
1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178
1
Gradient Descent vs Normal Equation
training examples, features.
Gradient Descent Normal Equation
• Need to choose . • No need to choose .
• Needs many iterations. • Don’t need to iterate.
• Works well even • Need to compute
when is large.
• Slow if is very large.
References
A. Ng. Machine Learning, Lecture Notes.
I. Goodfellow, Y. Bengio and A. Courville, “Deep Learning”, 2016.