0% found this document useful (0 votes)

6 views96 pages

Lec2 Regression

The document discusses regression techniques, including linear and logistic regression, focusing on their applications in predicting housing prices and classification tasks. It explains concepts such as cost functions, gradient descent algorithms, and the importance of feature scaling and regularization to avoid overfitting. Additionally, it covers multi-class classification using the one-vs-all approach and the use of optimization algorithms for parameter fitting.

Uploaded by

Hoàng Bảo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views96 pages

Lec2 Regression

Uploaded by

Hoàng Bảo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 96

Regression

• Linear regression
• Logistic regression
500
Housing Prices
400
(Portland, OR)
300

Price 200

(in 1000s 100

of dollars)
0
0 500 1000 1500 2000 2500 3000

Size (feet2)
Supervised Learning Regression Problem
Given the “right answer” for Predict real-valued output
each example in the data. Classification: discrete-valued
output
Training set of Size in feet2 (x) Price ($) in 1000's (y)
housing prices 2104 460
(Portland, OR) 1416 232
1534 315
852 178
… …
Notation:
m = Number of training examples ( )
= 2104
x’s = “input” variable / features ( )
= 1416
y’s = “output” variable / “target” variable ( )
= 460
(x,y) – one training example
( training example
Size in feet2 (x) Price ($) in 1000's (y)
Training Set
2104 460
1416 232
1534 315
852 178
… …

Hypothesis:
‘s: Parameters
How to choose ‘s ?
Idea: Choose so that
is close to for our training
examples
Simplified
Hypothesis:

Parameters:

Cost Function:

Goal:

Square error function

(for fixed , this is a function of x) (function of the parameter )

3 3

2 2

1 1 (1)= 0
ℎ 𝑥 =𝑦

0 0 x
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5

( ()

() ( + + )=0
(for fixed , this is a function of x) (function of the parameter )

3 3

2 2

1 1
x
0 0 x
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5

]=
.
x
(for fixed , this is a function of x) x
(function of the parameter ) x
x
3 3 x
x

2
x x
2 x
x x
1 1 x
x x
x x
0 0 x
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5

]=
.
]=
.
Hypothesis:

Parameters:

Cost Function:

Goal:
Have some function
Want ,…

Outline:
• Start with some (say )
• Keep changing to reduce
until we hopefully end up at a minimum
Gradient descent algorithm

Correct: Simultaneous update Incorrect:

Gradient descent algorithm
)

If α is too small, gradient descent

can be slow.

If α is too large, gradient descent

can overshoot the minimum. It may
fail to converge, or even diverge.
at local optima

Current value of
Gradient descent can converge to a local
minimum, even with the learning rate α fixed.

As we approach a local
minimum, gradient
descent will automatically
take smaller steps. So, no
need to decrease α over
time.
Gradient descent algorithm Linear Regression Model
()

()

() ()
Gradient descent algorithm

update
and
simultaneously
J()



J()



(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
“Batch” Gradient Descent
“Batch”: Each step of gradient descent uses all the
training examples.
Stochastic Gradient Descent (SGD)
It updates the parameters for each training example,
one by one.
Stochastic Gradient Descent (SGD)
Advantages: - faster than Batch GD in some problem.
- the frequent updates allow us to have a pretty detailed
rate of improvement.
Disadvantages: - the frequent updates are more computationally expensive
- The frequency of those updates can also result in noisy
gradients

Mini Batch Gradient Descent

A combination of the concepts of SGD and Batch Gradient Descent.
Mini Batch Gradient Descent

- Splits the training dataset into small batches

- Performs an update for each of these
batches.
Creates a balance between the
robustness of stochastic gradient descent and
the efficiency of batch gradient descent.

Mini-batch sizes range between 50 and 256.

Multiple features (variables).

Size (feet2) Price ($1000)

2104 460
1416 232
1534 315
852 178
… …
Multiple features (variables).

Size (feet2) Number of Number of Age of home Price ($1000)

bedrooms floors (years)

2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… … … … …
Multiple features (variables).
Size (feet2) Number of Number of Age of home Price ($1000)
bedrooms floors (years)

2104 5 1 45 460
1416 3 2 40 232 (
1534 3 2 30 315
852 2 1 36 178
… … … … …
Notation:
= number of features (
= input (features) of training example.
= value of feature in training example. ( )
Hypothesis:
Previously: (one variable)

Multiple variables:
+ +

+ -
For convenience of notation, define (

+
1x 𝑛 + 1
matrix

Multivariate linear regression.

Hypothesis:
Parameters:
Cost function:

Gradient descent:
Repeat

(simultaneously update for every )

New algorithm :
Gradient Descent
Repeat
Previously (n=1):
Repeat
(simultaneously update for
)
()
=1

(simultaneously update )
Feature Scaling
Idea: Make sure features are on a similar scale.
Get every feature into approximately a range.

Too big
Too small
Mean normalization
Replace with to make features have approximately zero mean
(Do not apply to ).
E.g. Avg. size 1000
1-5 bedrooms
Gradient descent

- “Debugging”: How to make sure gradient

descent is working correctly.
- How to choose learning rate .
Making sure gradient descent is working correctly.

Example automatic
convergence test:

Declare convergence if
decreases by less than
in one iteration.
0 100 200 300 400

No. of iterations
Making sure gradient descent is working correctly.
Gradient descent not working.
Use smaller .
)

No. of iterations

No. of iterations No. of iterations

- For sufficiently small , should decrease on every iteration.

- But if is too small, gradient descent can be slow to converge.
Summary:
- If is too small: slow convergence.
- If is too large: may not decrease on
every iteration; may not converge.

To choose , try
0.003 0.03 0.3
Polynomial regression

Price
(y)

Size (x)
Choice of features

Price
(y)

Size (x)
Normal equation: Method to solve for analytically.
Intuition: If 1D

Solve for

(for every )

Solve for
Examples:
Size (feet2) Number of Number of Age of home Price ($1000)
bedrooms floors (years)

1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178

m – dimensional vector
Examples:
Size (feet2) Number of Number of Age of home Price ($1000)
bedrooms floors (years)

1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178
1 3000 4 1 38 540
examples ; features.

E.g. If
is inverse of matrix .
Set

Matlab pinv(X’*X)*X’*y

000
training examples, features.
Gradient Descent Normal Equation
• Need to choose . • No need to choose .
• Needs many iterations. • Don’t need to iterate.
• Works well even • Need to compute
when is large.
• Slow if is very large.
Normal equation

- What if is non-invertible? (singular/

degenerate)
- Matlab: pinv(X’*X)*X’*y
inv(X’*X)*X’*y
What if is non-invertible?
• Redundant features (linearly dependent).
E.g. size in feet2
size in m2

• Too many features (e.g. ).

- Delete some features, or use regularization.
Logistic regression (classification)

• Classification
• Hypothesis representation
• Decision boundary
• Cost function
• Multi-class classification: One-vs-all
Classification

Email: Spam / Not Spam?

Online Transactions: Fraudulent (Yes / No)?
Tumor: Malignant / Benign ?

0: “Negative Class” (e.g., benign tumor)

1: “Positive Class” (e.g., malignant tumor)
𝒉𝜽 𝒙 = 𝜽𝑻 𝒙
(Yes) 1
0.5
Malignant ?

(No) 0
Tumor Size

Predict “ ” Predict “ ”

Threshold classifier output at 0.5:

If , predict “ ”
If , predict “ ”
Classification: or

can be > 1 or < 0

Logistic Regression:
Logistic Regression Model
Want

𝑔(𝑧)
1

0.5

Sigmoid function 0

Logistic function
Interpretation of Hypothesis Output
= estimated probability that on input x

Example: If

Tell patient that 70% chance of tumor being malignant

“probability that y = 1, given x,

parameterized by ”
Logistic regression 1
𝑔(𝑧)

0.5

0
Suppose predict “ “ if

predict “ “ if
Decision Boundary
x2
3
2

01 2 3 x1
Decision boundary

Predict “ “ if
Decision boundary
Non-linear decision boundaries
x2

0
-1 1 x1
𝑥 +𝑥 =1
Predict “ “ if
-1

x2
𝑥 +𝑥 ≥1

x1
Training set:

m examples

How to choose parameters ?

Cost function
Linear regression:

Logistic regression: () () () ()

“non-convex” “convex”
• Logarithm function
• Natural logarithm with base e
Logistic regression cost function

If 1

0 1
Logistic regression cost function

If = 0 if
But as

=1- =0

0 1
Logistic regression cost function

If 1:

If :
Logistic regression cost function

To fit parameters :

To make a prediction given new :

Output
Gradient Descent

Want :
Repeat

(simultaneously update all )

() () ()
)= ) −𝐽 𝜃

()
=
()
=
Gradient Descent

Want :
Repeat

(simultaneously update all )

Algorithm looks identical to linear regression!

Optimization algorithm
Cost function . Want .
Given , we have code that can compute
-
- (for )

Gradient descent:
Repeat
Optimization algorithm
Given , we have code that can compute
-
- (for )

Optimization algorithms: Advantages:

- Gradient descent - No need to manually pick
- Conjugate gradient - Often faster than gradient
- BFGS descent.
- L-BFGS Disadvantages:
- More complex
Multiclass classification
Email foldering/tagging: Work, Friends, Family, Hobby

Medical diagrams: Not ill, Cold, Flu

Weather: Sunny, Cloudy, Rain, Snow

Binary classification: Multi-class classification:

x2 x2

x1 x1
x2
One-vs-all (one-vs-rest):
( )

x1
x2 x2

( )

x1 x1
x2
Class 1:
Class 2: ( )
Class 3:
x1
One-vs-all

Train a logistic regression classifier for each

class to predict the probability that .

On a new input , to make a prediction, pick the

class that maximizes
Regularization

• The problem of overfitting

• Cost function
• Regularized logistic regression
Example: Linear regression (housing prices)
Price

Price

Price
Size Size Size

“Underfit” or “high bias” “just right” “overfit” or “high variance”

Overfitting: If we have too many features, the learned hypothesis

may fit the training set very well ( ), but fail
to generalize to new examples (predict prices on new examples).
Example: Logistic regression

x2 x2 x2 +

x1 x1 x1

( = sigmoid function)
Underfit
overfit
Addressing overfitting:
size of house
no. of bedrooms

Price
no. of floors
age of house
average income in neighborhood Size

kitchen size
Addressing overfitting:

Options:
1. Reduce number of features.
― Manually select which features to keep.
― Model selection algorithm.
2. Regularization.
― Keep all the features, but reduce magnitude/values of
parameters .
― Works well when we have a lot of features, each of
which contributes a bit to predicting .
Intuition

Price
Price

Size of house Size of house

X X
Suppose we penalize and make , really small.
Regularization.

Small values for parameters

― “Simpler” hypothesis
― Less prone to overfitting
Housing:
― Features:
― Parameters:

, , …,
Not
Regularization.
Price

Size of house
Regularized linear regression
Gradient descent
Repeat
Regularized logistic regression.

x2 +

x1
Cost function:
Gradient descent
Repeat
Advanced optimization
function [jVal, gradient] = costFunction(theta)
jVal = [ code to compute ];

gradient(1) = [code to compute ];

gradient(2) = [code to compute ];

gradient(3) = [code to compute ];

gradient(n+1) = [code to compute ];

Reference:
- Machine Learning, Andrew Ng, coursera.org

ML Regression & Gradient Descent
No ratings yet
ML Regression & Gradient Descent
37 pages
Linear Regression
100% (1)
Linear Regression
51 pages
Lecture 4 - More On Linear Regression and Polynomial Regression
No ratings yet
Lecture 4 - More On Linear Regression and Polynomial Regression
26 pages
ML 02 Linear Regression
No ratings yet
ML 02 Linear Regression
51 pages
Linear Regression
No ratings yet
Linear Regression
95 pages
Intro to Linear Regression
100% (1)
Intro to Linear Regression
47 pages
Week 4
No ratings yet
Week 4
101 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
Linear Regression
No ratings yet
Linear Regression
75 pages
Lecture 3 Ai
No ratings yet
Lecture 3 Ai
48 pages
Machine Learning Guide 2017
No ratings yet
Machine Learning Guide 2017
15 pages
Week 04
No ratings yet
Week 04
101 pages
Machine Learning Shortnote
No ratings yet
Machine Learning Shortnote
14 pages
Optimization23 22
No ratings yet
Optimization23 22
32 pages
Linear Regression
No ratings yet
Linear Regression
63 pages
Linear Regression For Machine Learning Course
No ratings yet
Linear Regression For Machine Learning Course
41 pages
Mlfa Autumn 22 Lec 04
No ratings yet
Mlfa Autumn 22 Lec 04
24 pages
Machine Learning Basics Guide
No ratings yet
Machine Learning Basics Guide
56 pages
LinearRegression Annotated
No ratings yet
LinearRegression Annotated
116 pages
Linear Regression and Gradient Descent
No ratings yet
Linear Regression and Gradient Descent
30 pages
Notes 1
No ratings yet
Notes 1
30 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
CS229
No ratings yet
CS229
69 pages
cs229 2
No ratings yet
cs229 2
275 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Regression
No ratings yet
Regression
30 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
15 pages
Supervised Learning Part2
No ratings yet
Supervised Learning Part2
27 pages
ML03
No ratings yet
ML03
14 pages
Computing For Data Sciences: Introduction To Regression Analysis
No ratings yet
Computing For Data Sciences: Introduction To Regression Analysis
9 pages
Machine Learning Notes by Standard Andrew NG
No ratings yet
Machine Learning Notes by Standard Andrew NG
142 pages
Machine Learning Notes AndrewNg
No ratings yet
Machine Learning Notes AndrewNg
141 pages
Machine Learning Basics for Students
No ratings yet
Machine Learning Basics for Students
7 pages
Stanford ML CS229-Merged Notes
No ratings yet
Stanford ML CS229-Merged Notes
126 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
293 pages
AIMLB PGP 2025 Session 5
No ratings yet
AIMLB PGP 2025 Session 5
67 pages
CHC 351 Module 5
No ratings yet
CHC 351 Module 5
65 pages
ML Notes
No ratings yet
ML Notes
14 pages
CS 229: Supervised Learning Basics
100% (1)
CS 229: Supervised Learning Basics
48 pages
A Layman's Guide To The Project
No ratings yet
A Layman's Guide To The Project
34 pages
ML02
No ratings yet
ML02
25 pages
cs229 Notes1 PDF
No ratings yet
cs229 Notes1 PDF
28 pages
Week 2
No ratings yet
Week 2
5 pages
UNIT3
No ratings yet
UNIT3
37 pages
MLA TAB Lecture3
No ratings yet
MLA TAB Lecture3
70 pages
Week 1 Lecture Notes
No ratings yet
Week 1 Lecture Notes
7 pages
3 TrainingNetwork
No ratings yet
3 TrainingNetwork
65 pages
Gansp Awareness Quiz PDF
No ratings yet
Gansp Awareness Quiz PDF
13 pages
Machine Learning Basics Explained
No ratings yet
Machine Learning Basics Explained
10 pages
Multivariate Linear Regression Guide
No ratings yet
Multivariate Linear Regression Guide
24 pages
(Machine Learning Coursera) Lecture Note Week 1
No ratings yet
(Machine Learning Coursera) Lecture Note Week 1
8 pages
Neural Network Training: Optimization
No ratings yet
Neural Network Training: Optimization
62 pages
Op Tim Ization
No ratings yet
Op Tim Ization
9 pages
CM20315 06 Fitting
No ratings yet
CM20315 06 Fitting
67 pages
Module3 Ch1
No ratings yet
Module3 Ch1
83 pages
01 - Mnist - Ipynb (4) - JupyterLab
No ratings yet
01 - Mnist - Ipynb (4) - JupyterLab
23 pages
Cmos Fab
No ratings yet
Cmos Fab
15 pages
Lecture7C Classification
No ratings yet
Lecture7C Classification
34 pages
Lecture7B Classification
No ratings yet
Lecture7B Classification
78 pages
Catalog (Meiden Vacuum Capacitors)
No ratings yet
Catalog (Meiden Vacuum Capacitors)
16 pages
The Philippine Stock Exchange, Inc.: What Is PSE? History
No ratings yet
The Philippine Stock Exchange, Inc.: What Is PSE? History
7 pages
Print 6.written Test
No ratings yet
Print 6.written Test
3 pages
Crosstalent LearningDevelopment Module Documentation 64 End (1) .FR - en
No ratings yet
Crosstalent LearningDevelopment Module Documentation 64 End (1) .FR - en
61 pages
Dibdin. An Introduction To The Knowledge of Rare and Valuable Editions of The Greek and Latin Classics. 1808. Volume 2.
100% (2)
Dibdin. An Introduction To The Knowledge of Rare and Valuable Editions of The Greek and Latin Classics. 1808. Volume 2.
480 pages
Industrial Valve Solutions Guide
No ratings yet
Industrial Valve Solutions Guide
10 pages
SHIMANO 105 SPD-SL Carbon Pedals
No ratings yet
SHIMANO 105 SPD-SL Carbon Pedals
1 page
Siemens: RELAY 7PA30 2-1AA00
No ratings yet
Siemens: RELAY 7PA30 2-1AA00
7 pages
2021 Waec Timetable
No ratings yet
2021 Waec Timetable
11 pages
Matlab Code For Sorted Real Schur Forms
100% (1)
Matlab Code For Sorted Real Schur Forms
20 pages
Education in Colonial Rule
No ratings yet
Education in Colonial Rule
2 pages
Construction Products Catalog
No ratings yet
Construction Products Catalog
46 pages
YEA-SIA-IEC-4 - Hardware Configuration - 2013-03-26 PDF
No ratings yet
YEA-SIA-IEC-4 - Hardware Configuration - 2013-03-26 PDF
165 pages
GFM Divine Healing Booklet
No ratings yet
GFM Divine Healing Booklet
56 pages
Dynamic Analysis of High Speed Railway Track: PHD Research Proposal Topic
No ratings yet
Dynamic Analysis of High Speed Railway Track: PHD Research Proposal Topic
14 pages
Bandung's Smart Mobility Readiness
No ratings yet
Bandung's Smart Mobility Readiness
7 pages
Group Storyboard-Edid6508
No ratings yet
Group Storyboard-Edid6508
12 pages
Rev2 Cement Mill Liners 2024 10
No ratings yet
Rev2 Cement Mill Liners 2024 10
4 pages
WFJ-80 Kratom Grinder Quotation
No ratings yet
WFJ-80 Kratom Grinder Quotation
15 pages
Perter Grimes
No ratings yet
Perter Grimes
2 pages
Fitting An Origin-Displaced Logarithmic Spiral To Empirical Data by Differential Evolution Method of Global Optimization
No ratings yet
Fitting An Origin-Displaced Logarithmic Spiral To Empirical Data by Differential Evolution Method of Global Optimization
7 pages
Chapter 3: Metamorphic Rocks
No ratings yet
Chapter 3: Metamorphic Rocks
2 pages
Blooms Taxonomy Action Verbs-1
No ratings yet
Blooms Taxonomy Action Verbs-1
1 page
Growth Performance of Broilers Fed A Quality Protein Maize
No ratings yet
Growth Performance of Broilers Fed A Quality Protein Maize
13 pages
Kono Nikutarashii Megami Manga Download
No ratings yet
Kono Nikutarashii Megami Manga Download
28 pages
Santals' Quest for Identity
No ratings yet
Santals' Quest for Identity
4 pages
Product Information Human PSA-Total ELISA Kot
No ratings yet
Product Information Human PSA-Total ELISA Kot
5 pages
Construction of Stormwater Canal Bridge and Outlet Methodology
No ratings yet
Construction of Stormwater Canal Bridge and Outlet Methodology
2 pages
Praveen Kumar's CV: Store & Purchase Expert
No ratings yet
Praveen Kumar's CV: Store & Purchase Expert
3 pages
Teacher Assessment Strategies
No ratings yet
Teacher Assessment Strategies
3 pages

Lec2 Regression

Uploaded by

Lec2 Regression

Uploaded by

Regression

(in 1000s 100

Square error function

Correct: Simultaneous update Incorrect:

If α is too small, gradient descent

If α is too large, gradient descent

Mini Batch Gradient Descent

- Splits the training dataset into small batches

Mini-batch sizes range between 50 and 256.

Size (feet2) Price ($1000)

Size (feet2) Number of Number of Age of home Price ($1000)

Multivariate linear regression.

(simultaneously update for every )

- “Debugging”: How to make sure gradient

No. of iterations No. of iterations

- For sufficiently small , should decrease on every iteration.

- What if is non-invertible? (singular/

• Too many features (e.g. ).

Email: Spam / Not Spam?

0: “Negative Class” (e.g., benign tumor)

Threshold classifier output at 0.5:

can be > 1 or < 0

Tell patient that 70% chance of tumor being malignant

“probability that y = 1, given x,

How to choose parameters ?

To make a prediction given new :

(simultaneously update all )

(simultaneously update all )

Algorithm looks identical to linear regression!

Optimization algorithms: Advantages:

Medical diagrams: Not ill, Cold, Flu

Weather: Sunny, Cloudy, Rain, Snow

Train a logistic regression classifier for each

On a new input , to make a prediction, pick the

• The problem of overfitting

“Underfit” or “high bias” “just right” “overfit” or “high variance”

Overfitting: If we have too many features, the learned hypothesis

Size of house Size of house

Small values for parameters

gradient(1) = [code to compute ];

gradient(2) = [code to compute ];

gradient(3) = [code to compute ];

gradient(n+1) = [code to compute ];

You might also like