Linear
Regression
Linear Regression
• Linear regression, or ordinary least squares (OLS), is the simplest and most classic linear method
for regression.
• Linear regression finds the parameters w and b that minimize the mean squared error between
predictions and the true regression targets, y, on the training set.
• The mean squared error is the sum of the squared differences between the predictions and the
true values.
Linear regression
with one variable
Model
representation
Machine Learning
500
Housing Prices
400
(Portland, OR)
300
Price 220200
(in 1000s 100
of dollars) 0
0 500 1000
1250 1500 2000 2500 3000
Size (feet2)
Supervised Learning Regression Problem
Given the “right answer” for Predict real-valued output
each example in the data. (Classification: discrete output values)
Training set of Size in feet2 (x) Price ($) in 1000's (y)
housing prices 2104 460
(Portland, OR) 1416 232
1534 315
852 178
… …
Notation:
m = Number of training examples
x’s = “input” variable / features
y’s = “output” variable / “target” variable
(x, y) – one training example
(𝑥 𝑖 , 𝑦 𝑖 ) 𝑖𝑡ℎ 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑒𝑥𝑎𝑚𝑝𝑙𝑒
Training Set How do we represent h ?
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥
Learning Algorithm
x
x x x
Size of h Estimated y x x
house price
x hypothesis Estimated y x
Linear regression with one variable.
h maps from x’s to y’s Univariate linear regression.
Linear regression
with one variable
Cost function
Machine Learning
Size in feet2 (x) Price ($) in 1000's (y)
Training Set
2104 460
1416 232
1534 315 m = 47
852 178
… …
Hypothesis:
‘s: Parameters
How to choose ‘s ?
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
y
Idea: Choose so that
is close to for our
training examples
Linear regression
with one variable
Cost function
intuition I
Machine Learning
Simplified
Hypothesis:
Parameters:
Cost Function:
Goal:
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
Linear regression
with one variable
Cost function
intuition II
Machine Learning
Hypothesis:
Parameters:
Cost Function:
Goal:
(for fixed , this is a function of x) (function of the parameters )
500
400
Price ($) 300
in 1000’s
200
100
0
0 500 1000 1500 2000 2500 3000
Size in feet2 (x)
(for fixed , this is a function of x) (function of the parameters )
Each contour line shows the set of values of parameter that share the same value of J
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
Linear regression
with one variable
Gradient
descent
Machine Learning
Have some function
Want
Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum
J()
J()
Gradient descent algorithm
Gradient descent algorithm
Correct: Simultaneous update Incorrect:
Linear regression
with one variable
Gradient descent
intuition
Machine Learning
Gradient descent algorithm
If α is too small, gradient descent
can be slow.
If α is too large, gradient descent
can overshoot the minimum. It may
fail to converge, or even diverge.
at local optima
Current value of
Gradient descent can converge to a local
minimum, even with the learning rate α fixed.
As we approach a local
minimum, gradient
descent will automatically
take smaller steps. So, no
need to decrease α over
time.
Linear regression
with one variable
Gradient descent for
linear regression
Machine Learning
Gradient descent algorithm Linear Regression Model
Gradient descent algorithm
update
and
simultaneously
J()
CONVEX QUADRATIC FUNCTION
The cost function for linear regression is always going to be a bowl-shaped function,
called a convex quadratic function that gives the global optima.
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
“Batch” Gradient Descent
“Batch”: Each step of gradient descent
uses all the training examples.
Other Options of GD
• Stochastic Gradient Descent
• we repeatedly run through the
training set
• Each time we encounter a training
exp, we update the parameters
according to the gradient of the
error with respect to that single
training example only.
• Mini-Batch Gradient Descent
• Take a subset of the entire
dataset.
Linear Regression with
multiple variables
Multiple features
Machine Learning
Multiple features (variables).
Size (feet2) Price ($1000)
2104 460
1416 232
1534 315
852 178
… …
Multiple features (variables).
Size (feet2) Number of Number of Age of home Price ($1000)
bedrooms floors (years)
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… … … … …
Notation:
= number of features
= input (features) of training example.
= value of feature in training example.
For convenience of notation, define .
1 * (n+1)
Multivariate linear regression.
Linear Regression with
multiple variables
Gradient descent for
multiple variables
Machine Learning
Hypothesis:
Parameters:
Cost function:
Gradient descent:
Repeat
(simultaneously update for every )
New algorithm :
Gradient Descent
Repeat
Previously (n=1):
Repeat
(simultaneously update for
)
(simultaneously update )
Linear Regression with
multiple variables
Gradient descent in
practice I: Feature Scaling
Machine Learning
Feature Scaling
Idea: Make sure features are on a similar scale.
E.g. = size (0-2000 feet2)
= number of bedrooms (1-5)
Feature Scaling
Get every feature into approximately a range.
Mean normalization
Replace with to make features have approximately zero mean
(Do not apply to ).
E.g.
Linear Regression with
multiple variables
Gradient descent in
practice II: Learning rate
Machine Learning
Gradient descent
- “Debugging”: How to make sure gradient
descent is working correctly.
- How to choose learning rate .
Making sure gradient descent is working correctly.
Example automatic
convergence test:
Declare convergence if
decreases by less than
in one iteration.
0 100 200 300 400
No. of iterations
Making sure gradient descent is working correctly.
Gradient descent not working.
Use smaller .
No. of iterations
No. of iterations No. of iterations
- For sufficiently small , should decrease on every iteration.
- But if is too small, gradient descent can be slow to converge.
Summary:
- If is too small: slow convergence.
- If is too large: may not decrease on
every iteration; may not converge.
To choose , try
Linear Regression with
multiple variables
Features and
polynomial regression
Machine Learning
Housing prices prediction
Polynomial regression
Price
(y)
Size (x)
Choice of features
Price
(y)
Size (x)
Linear Regression with
multiple variables
Normal equation
Machine Learning
Gradient Descent
Normal equation: Method to solve for
analytically.
Intuition: If 1D
(for every )
Solve for
Examples:
Size (feet2) Number of Number of Age of home Price ($1000)
bedrooms floors (years)
1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178
examples ; features.
E.g. If
is inverse of matrix .
Octave: pinv(X’*X)*X’*y
training examples, features.
Gradient Descent Normal Equation
• Need to choose . • No need to choose .
• Needs many iterations. • Don’t need to iterate.
• Works well even • Need to compute
when is large.
• Slow if is very large.
Ridge Regression (L2 Regularization)
•
•
•
•
Lasso (Least Absolute Shrinkage and Selection
Operator) (L1 Regularization)
•
•
•
•
Logistic
Regression
Logistic Regression*
ℎ𝜃 𝑥 = 𝜃 𝑇 𝒙
Malignancy
0.5
Tumor Size Tumor Size
*Name is given due to historical reasons
Logistic Regression
• In case of a binary classification problem where 𝑦 𝜖{0,1}
• Threshold of classifier is at ℎ𝜃 (𝑥) = 0.5
• If ℎ𝜃 (𝑥) > 0.5, y = 1
• If ℎ𝜃 (𝑥) < 0.5, y = 0
• Logistic regression: 0 ≤ ℎ𝜃 (𝑥) ≤ 1
Logistic Regression
• In linear regression: ℎ𝜃 𝑥 = 𝜃 𝑇 𝒙
• In logistic regression: ℎ𝜃 𝑥 = 𝑠𝑖𝑔𝑚(𝜃 𝑇 𝒙)
𝟏
• Where 𝑠𝑖𝑔𝑚 𝑧 = (sigmoid function or the logistic function)
𝟏+𝒆−𝒛
• Hence the name logistic regression
• But it is a classifier that is extended from linear regression
𝟏
• Finally ℎ𝜃 𝑥 = 𝑇
𝟏+𝒆−𝜃 𝒙
Logistic Regression
𝟏
• ℎ𝜃 𝑥 = 𝑇 1
𝟏+𝒆−𝜃 𝒙
• Task is to select parameters 𝜃 to fit the date
0.5
0
Logistic Regression
• ℎ𝜃 𝑥 = estimated probability that y = 1 on input x
• For example, if
𝑥0 1
𝑥= 𝑥 =
1 𝑡𝑢𝑚𝑜𝑟 𝑆𝑖𝑧𝑒
And ℎ𝜃 𝑥 = 0.7
70% chance of tumor being malignant.
• ℎ𝜃 𝑥 = 𝑝(𝑦 = 1|𝑥; 𝜃), prob that y=1 given x, parameterized by 𝜃.
• 𝑝 𝑦 = 0 𝑥; 𝜃 + 𝑝(𝑦 = 1|𝑥; 𝜃)=1
• 𝑝(𝑦 = 0|𝑥; 𝜃)=1-𝑝(𝑦 = 1|𝑥; 𝜃)
Logistic Regression
Decision Boundary
ℎ𝜃 𝑥 = 𝑔 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2
Linear Case −3
𝜃= 1
x2
1
Predict y = 1 if −3 + 𝑥1 + 𝑥2 ≥ 0
𝑥1 + 𝑥2 ≥ 3
A linear decision boundary obtained from
linear regression model.
x1
Logistic Regression
Decision Boundary
Nonlinear Case ℎ𝜃 𝑥 = 𝑔 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + 𝜃3 𝑥32 + 𝜃4 𝑥42
−1
0
𝜃= 0
1
1
Predict y = 1 if −1 + 𝑥32 + 𝑥42 ≥ 0
Logistic Regression
Decision Boundary
Nonlinear Case ℎ𝜃 𝑥 = 𝑔 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + 𝜃3 𝑥32 + 𝜃4 𝑥42
−1
0
𝜃= 0
1
1
Predict y = 1 if −1 + 𝑥32 + 𝑥42 ≥ 0
x2
1 y=1
y=0
-1 x1
Logistic Regression
Cost Function
Cost that a learning algo (hypothesis) has to pay if its prediction is h(x) when the actual label is y
𝑚
1 1 2
Linear Regression Model: 𝐽 𝜃 = ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 (MSE: Mean Squared Error)
𝑚 2
𝑖=1
1 2
Cost (ℎ𝜃 (x),y) = ℎ𝜃 𝑥 − 𝑦
2
If we use the same cost function for logistic regression whose hypothesis is a nonlinear
𝜃
function, it will result in a nonconvex J( ).
Gradient Descent
𝜃 𝑛𝑒𝑤 = 𝜃 𝑜𝑙𝑑 + ∆𝜃
𝜕𝐽 𝜃
∆𝜃 = −𝛼 |𝜃= 𝜃(old)
𝜕𝜃
Parameter 𝛼 > 0.𝛼 plays an important role in the
convergence of the algorithm.
If 𝛼 is too small, the correction is small and
Geometric interpretation of parameter optimization.
convergence to the optimum point is slow.
If 𝛼 is too large, the algorithm may oscillate around In the gradient descent scheme, the
the optimum value and convergence is not possible. correction of the parameters takes place
in the direction that decreases the value
of the cost function.
Logistic Regression
Cost Function
• If𝛼 is properly chosen, the algorithm converges
to a stationary point of J(𝜃) which can be
• A local minimum 𝜃10
• A global minimum 𝜃 0
• Or a saddle point 𝜃20
• To which of the stationary points the algorithm
will converge depends on the position of the
initial point, relative to the stationary points.
• Furthermore, the convergence speed depends on
the form of the cost J(𝜃).
Logistic Regression
Cost Function for Logistic Regression
Logistic Regression − log ℎ𝜃 (x) if y=1
Cost (ℎ𝜃 (x),y) = ቊ
Model: − log(1 − ℎ𝜃 (x)) if y=0
Y=1
Cost = 0 if y = 1, h = 1
But as h → 0
Cost → infinity
That is if h = 0
P(y=1|x;w) = 0, but y = 1
We’ll penalize the learning algorithm by a large cost.
ℎ𝜃 (x) 1
Logistic Regression
Cost Function for Logistic Regression
Logistic Regression − log ℎ𝜃 (x) if y=1
Cost (ℎ𝜃 (x),y) = ቊ
Model: − log(1 − ℎ𝜃 (x)) if y=0
Y=0
ℎ𝜃 (x) 1
Logistic Regression
Cost Function for Logistic Regression
𝑚
1
𝐽 𝜃 = 𝐶𝑜𝑠𝑡(ℎ𝜃 (𝑥 𝑖 ),𝑦 𝑖 )
𝑚
𝑖=1
− log ℎ𝜃 (x) if y=1
Cost (ℎ𝜃 (x),y) = ቊ
− log(1 − ℎ𝜃 (x)) if y=0
Cost (ℎ𝜃 (x),y) = −𝑦 log ℎ𝜃 (x) − (1 − 𝑦) log(1 − ℎ𝜃 (x))
𝑚
1
𝐽 𝜃 = (−𝑦 𝑖 log ℎ𝜃 (𝑥 𝑖 ) − (1 − 𝑦 𝑖 ) log(1 − ℎ𝜃 (𝑥 𝑖 ))
𝑚
𝑖=1
• This cost function is derived in Statistics from the idea of maximum likelihood estimation
which helps to efficiently find parameters for different models
Logistic Regression
Cost Function for Logistic Regression
𝑚
1
𝐽 𝜃 = 𝐶𝑜𝑠𝑡(ℎ𝜃 (𝑥 𝑖 ),𝑦 𝑖 )
𝑚
𝑖=1
• To fit parameter 𝜃
min 𝐽 𝜃 𝑡𝑜 𝑔𝑒𝑡𝜃 𝒃𝒖𝒕 𝒉𝒐𝒘? ?
𝜃
• To make a prediction given new data (x)
𝟏
ℎ𝜃 𝑥 = 𝑇𝒙
𝟏 + 𝒆−𝜃
Logistic Regression
Cost Function for Logistic Regression
min 𝐽 𝜃 𝑖𝑠 𝑎𝑐ℎ𝑖𝑒𝑣𝑒𝑑 𝑡ℎ𝑟𝑜𝑢𝑔ℎ 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡 𝑑𝑒𝑠𝑐𝑒𝑛𝑡
𝜃
𝑚
1
𝐽 𝜃 = −𝑦 𝑖 log ℎ𝜃 (𝑥 𝑖 ) + (1 − 𝑦 𝑖 ) log(1 − ℎ𝜃 (𝑥 𝑖 ))
𝑚
𝑖=1
𝜕𝐽 𝜃
𝜃𝑗 = 𝜃𝑗 −𝛼
𝜕𝜃𝑗
1 𝑚
𝜃𝑗 = 𝜃𝑗 − 𝛼 σ𝑖=1(ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 )𝑥𝑗𝑖
𝑚