Regression
• Linear regression
• Logistic regression
500
Housing Prices
400
(Portland, OR)
300
Price 200
(in 1000s 100
of dollars)
0
0 500 1000 1500 2000 2500 3000
Size (feet2)
Supervised Learning Regression Problem
Given the “right answer” for Predict real-valued output
each example in the data. Classification: discrete-valued
output
Training set of Size in feet2 (x) Price ($) in 1000's (y)
housing prices 2104 460
(Portland, OR) 1416 232
1534 315
852 178
… …
Notation:
m = Number of training examples ( )
= 2104
x’s = “input” variable / features ( )
= 1416
y’s = “output” variable / “target” variable ( )
= 460
(x,y) – one training example
( training example
Size in feet2 (x) Price ($) in 1000's (y)
Training Set
2104 460
1416 232
1534 315
852 178
… …
Hypothesis:
‘s: Parameters
How to choose ‘s ?
Idea: Choose so that
is close to for our training
examples
Simplified
Hypothesis:
Parameters:
Cost Function:
Goal:
Square error function
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
1 1 (1)= 0
ℎ 𝑥 =𝑦
0 0 x
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
( ()
() ( + + )=0
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
1 1
x
0 0 x
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
]=
.
x
(for fixed , this is a function of x) x
(function of the parameter ) x
x
3 3 x
x
2
x x
2 x
x x
1 1 x
x x
x x
0 0 x
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
]=
.
]=
.
Hypothesis:
Parameters:
Cost Function:
Goal:
Have some function
Want ,…
Outline:
• Start with some (say )
• Keep changing to reduce
until we hopefully end up at a minimum
Gradient descent algorithm
Correct: Simultaneous update Incorrect:
Gradient descent algorithm
)
If α is too small, gradient descent
can be slow.
If α is too large, gradient descent
can overshoot the minimum. It may
fail to converge, or even diverge.
at local optima
Current value of
Gradient descent can converge to a local
minimum, even with the learning rate α fixed.
As we approach a local
minimum, gradient
descent will automatically
take smaller steps. So, no
need to decrease α over
time.
Gradient descent algorithm Linear Regression Model
()
()
() ()
Gradient descent algorithm
update
and
simultaneously
J()
J()
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
“Batch” Gradient Descent
“Batch”: Each step of gradient descent uses all the
training examples.
Stochastic Gradient Descent (SGD)
It updates the parameters for each training example,
one by one.
Stochastic Gradient Descent (SGD)
Advantages: - faster than Batch GD in some problem.
- the frequent updates allow us to have a pretty detailed
rate of improvement.
Disadvantages: - the frequent updates are more computationally expensive
- The frequency of those updates can also result in noisy
gradients
Mini Batch Gradient Descent
A combination of the concepts of SGD and Batch Gradient Descent.
Mini Batch Gradient Descent
- Splits the training dataset into small batches
- Performs an update for each of these
batches.
Creates a balance between the
robustness of stochastic gradient descent and
the efficiency of batch gradient descent.
Mini-batch sizes range between 50 and 256.
Multiple features (variables).
Size (feet2) Price ($1000)
2104 460
1416 232
1534 315
852 178
… …
Multiple features (variables).
Size (feet2) Number of Number of Age of home Price ($1000)
bedrooms floors (years)
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… … … … …
Multiple features (variables).
Size (feet2) Number of Number of Age of home Price ($1000)
bedrooms floors (years)
2104 5 1 45 460
1416 3 2 40 232 (
1534 3 2 30 315
852 2 1 36 178
… … … … …
Notation:
= number of features (
= input (features) of training example.
= value of feature in training example. ( )
Hypothesis:
Previously: (one variable)
Multiple variables:
+ +
+ -
For convenience of notation, define (
+
1x 𝑛 + 1
matrix
Multivariate linear regression.
Hypothesis:
Parameters:
Cost function:
Gradient descent:
Repeat
(simultaneously update for every )
New algorithm :
Gradient Descent
Repeat
Previously (n=1):
Repeat
(simultaneously update for
)
()
=1
(simultaneously update )
Feature Scaling
Idea: Make sure features are on a similar scale.
Get every feature into approximately a range.
Too big
Too small
Mean normalization
Replace with to make features have approximately zero mean
(Do not apply to ).
E.g. Avg. size 1000
1-5 bedrooms
Gradient descent
- “Debugging”: How to make sure gradient
descent is working correctly.
- How to choose learning rate .
Making sure gradient descent is working correctly.
Example automatic
convergence test:
Declare convergence if
decreases by less than
in one iteration.
0 100 200 300 400
No. of iterations
Making sure gradient descent is working correctly.
Gradient descent not working.
Use smaller .
)
No. of iterations
No. of iterations No. of iterations
- For sufficiently small , should decrease on every iteration.
- But if is too small, gradient descent can be slow to converge.
Summary:
- If is too small: slow convergence.
- If is too large: may not decrease on
every iteration; may not converge.
To choose , try
0.003 0.03 0.3
Polynomial regression
Price
(y)
Size (x)
Choice of features
Price
(y)
Size (x)
Normal equation: Method to solve for analytically.
Intuition: If 1D
Solve for
(for every )
Solve for
Examples:
Size (feet2) Number of Number of Age of home Price ($1000)
bedrooms floors (years)
1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178
m – dimensional vector
Examples:
Size (feet2) Number of Number of Age of home Price ($1000)
bedrooms floors (years)
1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178
1 3000 4 1 38 540
examples ; features.
E.g. If
is inverse of matrix .
Set
Matlab pinv(X’*X)*X’*y
000
training examples, features.
Gradient Descent Normal Equation
• Need to choose . • No need to choose .
• Needs many iterations. • Don’t need to iterate.
• Works well even • Need to compute
when is large.
• Slow if is very large.
Normal equation
- What if is non-invertible? (singular/
degenerate)
- Matlab: pinv(X’*X)*X’*y
inv(X’*X)*X’*y
What if is non-invertible?
• Redundant features (linearly dependent).
E.g. size in feet2
size in m2
• Too many features (e.g. ).
- Delete some features, or use regularization.
Logistic regression (classification)
• Classification
• Hypothesis representation
• Decision boundary
• Cost function
• Multi-class classification: One-vs-all
Classification
Email: Spam / Not Spam?
Online Transactions: Fraudulent (Yes / No)?
Tumor: Malignant / Benign ?
0: “Negative Class” (e.g., benign tumor)
1: “Positive Class” (e.g., malignant tumor)
𝒉𝜽 𝒙 = 𝜽𝑻 𝒙
(Yes) 1
0.5
Malignant ?
(No) 0
Tumor Size
Predict “ ” Predict “ ”
Threshold classifier output at 0.5:
If , predict “ ”
If , predict “ ”
Classification: or
can be > 1 or < 0
Logistic Regression:
Logistic Regression Model
Want
𝑔(𝑧)
1
0.5
Sigmoid function 0
Logistic function
Interpretation of Hypothesis Output
= estimated probability that on input x
Example: If
Tell patient that 70% chance of tumor being malignant
“probability that y = 1, given x,
parameterized by ”
Logistic regression 1
𝑔(𝑧)
0.5
0
Suppose predict “ “ if
predict “ “ if
Decision Boundary
x2
3
2
01 2 3 x1
Decision boundary
Predict “ “ if
Decision boundary
Non-linear decision boundaries
x2
0
-1 1 x1
𝑥 +𝑥 =1
Predict “ “ if
-1
x2
𝑥 +𝑥 ≥1
x1
Training set:
m examples
How to choose parameters ?
Cost function
Linear regression:
Logistic regression: () () () ()
“non-convex” “convex”
• Logarithm function
• Natural logarithm with base e
Logistic regression cost function
If 1
0 1
Logistic regression cost function
If = 0 if
But as
=1- =0
0 1
Logistic regression cost function
If 1:
If :
Logistic regression cost function
To fit parameters :
To make a prediction given new :
Output
Gradient Descent
Want :
Repeat
(simultaneously update all )
() () ()
)= ) −𝐽 𝜃
()
=
()
=
Gradient Descent
Want :
Repeat
(simultaneously update all )
Algorithm looks identical to linear regression!
Optimization algorithm
Cost function . Want .
Given , we have code that can compute
-
- (for )
Gradient descent:
Repeat
Optimization algorithm
Given , we have code that can compute
-
- (for )
Optimization algorithms: Advantages:
- Gradient descent - No need to manually pick
- Conjugate gradient - Often faster than gradient
- BFGS descent.
- L-BFGS Disadvantages:
- More complex
Multiclass classification
Email foldering/tagging: Work, Friends, Family, Hobby
Medical diagrams: Not ill, Cold, Flu
Weather: Sunny, Cloudy, Rain, Snow
Binary classification: Multi-class classification:
x2 x2
x1 x1
x2
One-vs-all (one-vs-rest):
( )
x1
x2 x2
( )
x1 x1
x2
Class 1:
Class 2: ( )
Class 3:
x1
One-vs-all
Train a logistic regression classifier for each
class to predict the probability that .
On a new input , to make a prediction, pick the
class that maximizes
Regularization
• The problem of overfitting
• Cost function
• Regularized logistic regression
Example: Linear regression (housing prices)
Price
Price
Price
Size Size Size
“Underfit” or “high bias” “just right” “overfit” or “high variance”
Overfitting: If we have too many features, the learned hypothesis
may fit the training set very well ( ), but fail
to generalize to new examples (predict prices on new examples).
Example: Logistic regression
x2 x2 x2 +
x1 x1 x1
( = sigmoid function)
Underfit
overfit
Addressing overfitting:
size of house
no. of bedrooms
Price
no. of floors
age of house
average income in neighborhood Size
kitchen size
Addressing overfitting:
Options:
1. Reduce number of features.
― Manually select which features to keep.
― Model selection algorithm.
2. Regularization.
― Keep all the features, but reduce magnitude/values of
parameters .
― Works well when we have a lot of features, each of
which contributes a bit to predicting .
Intuition
Price
Price
Size of house Size of house
X X
Suppose we penalize and make , really small.
Regularization.
Small values for parameters
― “Simpler” hypothesis
― Less prone to overfitting
Housing:
― Features:
― Parameters:
, , …,
Not
Regularization.
Price
Size of house
Regularized linear regression
Gradient descent
Repeat
Regularized logistic regression.
x2 +
x1
Cost function:
Gradient descent
Repeat
Advanced optimization
function [jVal, gradient] = costFunction(theta)
jVal = [ code to compute ];
gradient(1) = [code to compute ];
gradient(2) = [code to compute ];
gradient(3) = [code to compute ];
gradient(n+1) = [code to compute ];
Reference:
- Machine Learning, Andrew Ng, coursera.org