0% found this document useful (0 votes)

21 views45 pages

ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)

This document summarizes key concepts about optimization and gradient descent algorithms for machine learning models. It discusses: 1) The goal of optimization is to find the minimizer of an objective function. Gradient descent is commonly used by taking steps proportional to the negative gradient to minimize loss functions. 2) Stochastic gradient descent approximates the full gradient using a small batch of samples to speed up computation for large datasets, updating the parameters after each batch. 3) Logistic regression can be optimized with stochastic gradient descent by sampling mini-batches to estimate the gradient of the log-likelihood objective and taking a step in the opposite direction.

Uploaded by

svwnerlgwr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views45 pages

ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)

Uploaded by

svwnerlgwr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

ECS171: Machine Learning

Lecture 4: Optimization (LFD 3.3, SGD)

Cho-Jui Hsieh
UC Davis

Jan 22, 2018

Gradient descent
Optimization

Goal: find the minimizer of a function

min f (w )
w

For now we assume f is twice differentiable

Machine learning algorithm: find the hypothesis that minimizes Ein
Convex vs Nonconvex

Convex function:
∇f (w ∗ ) = 0 ⇔ w ∗ is global minimum
A function is convex if ∇2 f (w ) is positive definite
Example: linear regression, logistic regression, · · ·
Non-convex function:
∇f (x) = 0 ⇔ Global min, local min, or saddle point
most algorithms only converge to gradient= 0
Example: neural network, · · ·
Convex vs Nonconvex

Convex function:
∇f (w ∗ ) = 0 ⇔ w ∗ is global minimum
A function is convex if ∇2 f (w ) is positive definite
Example: linear regression, logistic regression, · · ·
Non-convex function:
∇f (w ∗ ) = 0 ⇔ w ∗ is Global min, local min, or saddle point
most algorithms only converge to gradient= 0
Example: neural network, · · ·
Gradient Descent

Gradient descent: repeatedly do

w t+1 ← w t − α∇f (w t )
α > 0 is the step size
Gradient Descent

Gradient descent: repeatedly do

w t+1 ← w t − α∇f (w t )
α > 0 is the step size
Generate the sequence w 1 , w 2 , · · ·
converge to minimum solution (∇f (w ) = 0)
Gradient Descent

Gradient descent: repeatedly do

w t+1 ← w t − α∇f (w t )
α > 0 is the step size
Generate the sequence w 1 , w 2 , · · ·
converge to minimum solution (∇f (w ) = 0)
Step size too large ⇒ diverge; too small ⇒ slow convergence
Why gradient descent?

Reason I: Gradient is the “steepest direction” to decrease the objective

function locally
Why gradient descent?

Reason I: Gradient is the “steepest direction” to decrease the objective

function locally
Reason II: successive approximation view
At each iteration, form an approximation function of f (·):

1
f (w t + d ) ≈ g (d ) := f (w t ) + ∇f (w t )T d + kd k2
2α
Update solution by w t+1 ← w t + d ∗
d ∗ = arg mind g (d )
∇g (d ∗ ) = 0 ⇒ ∇f (w t ) + α1 d ∗ = 0 ⇒ d ∗ = −α∇f (w t )
Why gradient descent?

Reason I: Gradient is the “steepest direction” to decrease the objective

function locally
Reason II: successive approximation view
At each iteration, form an approximation function of f (·):

1
f (w t + d ) ≈ g (d ) := f (w t ) + ∇f (w t )T d + kd k2
2α
Update solution by w t+1 ← w t + d ∗
d ∗ = arg mind g (d )
∇g (d ∗ ) = 0 ⇒ ∇f (w t ) + α1 d ∗ = 0 ⇒ d ∗ = −α∇f (w t )

d ∗ will decrease f (·) if α (step size) is sufficiently small

Illustration of gradient descent
Illustration of gradient descent

Form a quadratic approximation

1
f (w t + d ) ≈ g (d ) = f (w t ) + ∇f (w t )T d + kd k2
2α
Illustration of gradient descent

Minimize g (d ):
1 ∗
∇g (d ∗ ) = 0 ⇒ ∇f (w t ) + d = 0 ⇒ d ∗ = −α∇f (w t )
α
Illustration of gradient descent

Update
w t+1 = w t + d ∗ = w t −α∇f (w t )
Illustration of gradient descent

Form another quadratic approximation

1
f (w t+1 + d ) ≈ g (d ) = f (w t+1 ) + ∇f (w t+1 )T d + kd k2
2α
d ∗ = −α∇f (w t+1 )
Illustration of gradient descent

Update
w t+2 = w t+1 + d ∗ = w t+1 −α∇f (w t+1 )
When will it diverge?

Can diverge (f (w t )<f (w t+1 )) if g is not an upperbound of f

When will it converge?

Always converge (f (w t )>f (w t+1 )) when g is an upperbound of f

Convergence

Let L be the Lipchitz constant

(∇2 f (x) LI for all x)
1
Theorem: gradient descent converges if α < L
Convergence

Let L be the Lipchitz constant

(∇2 f (x) LI for all x)
1
Theorem: gradient descent converges if α < L
In practice, we do not know L · · ·
need to tune step size when running gradient descent
Applying to Logistic regression

gradient descent for logistic regression

Initialize the weights w0
For t = 1, 2, · · ·
Compute the gradient
N
1 X yn xn
∇Ein = −
N n=1 1 + e yn w T xn

Update the weights: w ← w − η∇Ein

Return the final weights w
Applying to Logistic regression

gradient descent for logistic regression

Initialize the weights w0
For t = 1, 2, · · ·
Compute the gradient
N
1 X yn xn
∇Ein = −
N n=1 1 + e yn w T xn

Update the weights: w ← w − η∇Ein

Return the final weights w
When to stop?
Fixed number of iterations, or
Stop when k∇Ein k <
Stochastic Gradient descent
Large-scale Problems

Machine learning: usually minimizing the in-sample loss (training loss)

N
1 X
min{ `(w T xn , yn )} := Ein (w ) (linear model)
w N
n=1
N
1 X
min{ `(hw (xn ), yn )} := Ein (w ) (general hypothesis)
w N
n=1

`: loss function (e.g., `(a, b) = (a − b)2 )

Gradient descent:
w ←w −η ∇Ein (w )
| {z }
Main computation
Large-scale Problems

Machine learning: usually minimizing the in-sample loss (training loss)

N
1 X
min{ `(w T xn , yn )} := Ein (w ) (linear model)
w N
n=1
N
1 X
min{ `(hw (xn ), yn )} := Ein (w ) (general hypothesis)
w N
n=1

`: loss function (e.g., `(a, b) = (a − b)2 )

Gradient descent:
w ←w −η ∇Ein (w )
| {z }
Main computation

1 PN
In general, Ein (w ) = N n=1 fn (w ),
each fn (w ) only depends on (xn , yn )
Stochastic gradient

Gradient:
N
1X
∇Ein (w ) = ∇fn (w )
N
n=1

Each gradient computation needs to go through all training samples

slow when millions of samples
Faster way to compute “approximate gradient”?
Stochastic gradient

Gradient:
N
1X
∇Ein (w ) = ∇fn (w )
N
n=1

Each gradient computation needs to go through all training samples

slow when millions of samples
Faster way to compute “approximate gradient”?
Use stochastic sampling:
Sample a small subset B ⊆ {1, · · · , N}
Estimate gradient
1 X
∇Ein (w ) ≈ ∇fn (w )
|B|
n∈B

|B|: batch size

Stochastic gradient descent

Stochastic Gradient Descent (SGD)

Input: training data {xn , yn }N
n=1
Initialize w (zero or random)
For t = 1, 2, · · ·
Sample a small batch B ⊆ {1, · · · , N}
Update parameter
1 X
w ← w − ηt ∇fn (w )
|B|
n∈B
Stochastic gradient descent

Stochastic Gradient Descent (SGD)

Input: training data {xn , yn }N
n=1
Initialize w (zero or random)
For t = 1, 2, · · ·
Sample a small batch B ⊆ {1, · · · , N}
Update parameter
1 X
w ← w − ηt ∇fn (w )
|B|
n∈B

Extreme case: |B| = 1 ⇒ Sample one training data at a time

Logistic Regression by SGD

Logistic regression:
N
1 X T
min log(1 + e −yn w xn )
w N | {z }
n=1
fn (w )

SGD for Logistic Regression

Input: training data {xn , yn }N
n=1
Initialize w (zero or random)
For t = 1, 2, · · ·
Sample a batch B ⊆ {1, · · · , N}
Update parameter
1 X −yn xn
w ← w − ηt
|B| 1 + e yn w T xn
i∈B | {z }
∇fn (w )
Why SGD works?

Stochastic gradient is an unbiased estimator of full gradient:

N
1 X 1 X
E[ ∇fn (w )] = ∇fn (w )
|B| N
n∈B n=1
= ∇Ein (w )
Why SGD works?

Stochastic gradient is an unbiased estimator of full gradient:

N
1 X 1 X
E[ ∇fn (w )] = ∇fn (w )
|B| N
n∈B n=1
= ∇Ein (w )

Each iteration updated by

gradient + zero-mean noise

Stochastic gradient descent

In gradient descent, η (step size) is a fixed constant

Can we use fixed step size for SGD?
Stochastic gradient descent

In gradient descent, η (step size) is a fixed constant

Can we use fixed step size for SGD?
SGD with fixed step size cannot converge to global/local minimizers
Stochastic gradient descent

In gradient descent, η (step size) is a fixed constant

Can we use fixed step size for SGD?
SGD with fixed step size cannot converge to global/local minimizers
If w ∗ is the minimizer, ∇f (w ∗ ) = N1 N ∗
P
n=1 ∇fn (w )=0,
Stochastic gradient descent

In gradient descent, η (step size) is a fixed constant

Can we use fixed step size for SGD?
SGD with fixed step size cannot converge to global/local minimizers
If w ∗ is the minimizer, ∇f (w ∗ ) = N1 N ∗
P
n=1 ∇fn (w )=0,

1 X
but ∇fn (w ∗ )6=0 if B is a subset
|B|
n∈B
Stochastic gradient descent

In gradient descent, η (step size) is a fixed constant

Can we use fixed step size for SGD?
SGD with fixed step size cannot converge to global/local minimizers
If w ∗ is the minimizer, ∇f (w ∗ ) = N1 N ∗
P
n=1 ∇fn (w )=0,

1 X
but ∇fn (w ∗ )6=0 if B is a subset
|B|
n∈B

(Even if we got minimizer, SGD will move away from it)

Stochastic gradient descent, step size

To make SGD converge:

Step size should decrease to 0
ηt → 0
Usually with polynomial rate: η t ≈ t −a with constant a
Stochastic gradient descent vs Gradient descent

Stochastic gradient descent:

pros:
cheaper computation per iteration
faster convergence in the beginning
cons:
less stable, slower final convergence
hard to tune step size

(Figure from https://medium.com/@ImadPhd/

gradient-descent-algorithm-and-its-variants-10f652806a3)
Revisit perceptron Learning Algorithm

Given a classification data {xn , yn }N

n=1
Learning a linear model:
N
1 X
min `(w T xn , yn )
w N
n=1
Consider the loss:
`(w T xn , yn ) = max(0, −yn w T xn )

What’s the gradient?

Revisit perceptron Learning Algorithm

`(w T xn , yn ) = max(0, −yn w T xn )

Consider two cases:
Case I: yn w T xn > 0 (prediction correct)
`(w T xn , yn ) = 0
∂ T
∂w `(w xn , yn ) = 0
Revisit perceptron Learning Algorithm

`(w T xn , yn ) = max(0, −yn w T xn )

Consider two cases:
Case I: yn w T xn > 0 (prediction correct)
`(w T xn , yn ) = 0
∂ T
∂w `(w xn , yn ) = 0
Case II: yn w T xn < 0 (prediction wrong)
`(w T xn , yn ) = −yn w T xn
∂ T
∂w `(w xn , yn ) = −yn xn
Revisit perceptron Learning Algorithm

`(w T xn , yn ) = max(0, −yn w T xn )

Consider two cases:
Case I: yn w T xn > 0 (prediction correct)
`(w T xn , yn ) = 0
∂ T
∂w `(w xn , yn ) = 0
Case II: yn w T xn < 0 (prediction wrong)
`(w T xn , yn ) = −yn w T xn
∂ T
∂w `(w xn , yn ) = −yn xn
SGD update rule: Sample an index n
(
t+1 wt if yn w T xn ≥0 (predict correct)
w ←
w t + η t yn xn if yn w T xn <0 (predict wrong)

Equivalent to Perceptron Learning Algorithm when η t = 1

Conclusions

Gradient descent
Stochastic gradient descent
Next class: LFD 2

Questions?

Stochastic Gradient Descent Basics
No ratings yet
Stochastic Gradient Descent Basics
22 pages
SGD 2
No ratings yet
SGD 2
18 pages
Lecture05 Descent
No ratings yet
Lecture05 Descent
31 pages
Lecture 5
No ratings yet
Lecture 5
4 pages
Assignment 4
No ratings yet
Assignment 4
8 pages
Optimization
No ratings yet
Optimization
6 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
DS303: Introduction To Machine Learning: Stochastic Gradient Descent
No ratings yet
DS303: Introduction To Machine Learning: Stochastic Gradient Descent
19 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
ML Lec 08 Gradient Descent
No ratings yet
ML Lec 08 Gradient Descent
37 pages
Linear Regression For Machine Learning Course
No ratings yet
Linear Regression For Machine Learning Course
41 pages
Mlfa Autumn 22 Lec 04
No ratings yet
Mlfa Autumn 22 Lec 04
24 pages
17 Convexoptim5
No ratings yet
17 Convexoptim5
63 pages
Gradient Descent
No ratings yet
Gradient Descent
17 pages
Gradient Descent New
No ratings yet
Gradient Descent New
42 pages
Assignment No 3
No ratings yet
Assignment No 3
7 pages
Linear Models & Optimization Techniques
No ratings yet
Linear Models & Optimization Techniques
24 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
CS221 - Artificial Intelligence - Machine Learning - 4 Stochastic Gradient Descent
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 4 Stochastic Gradient Descent
12 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Logistic Regression
No ratings yet
Logistic Regression
9 pages
1 One Dimension: Gradient Descent
No ratings yet
1 One Dimension: Gradient Descent
5 pages
Gradient Descent Algorithm Is A First
No ratings yet
Gradient Descent Algorithm Is A First
5 pages
Gradient Descent DS Rohit Sharma Fench Knjs
No ratings yet
Gradient Descent DS Rohit Sharma Fench Knjs
15 pages
Discussion 4 CS771
No ratings yet
Discussion 4 CS771
25 pages
UNIT3
No ratings yet
UNIT3
37 pages
Gradient Descent & Logistic Regression
No ratings yet
Gradient Descent & Logistic Regression
2 pages
Lecture 7 (With Notes)
No ratings yet
Lecture 7 (With Notes)
39 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Tut04 - One Algorithm To Optimize Them All
No ratings yet
Tut04 - One Algorithm To Optimize Them All
19 pages
5 Gradients
No ratings yet
5 Gradients
26 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
3 Recitation StochasticGradientDescent
No ratings yet
3 Recitation StochasticGradientDescent
10 pages
AIMLB PGP 2025 Session 5
No ratings yet
AIMLB PGP 2025 Session 5
67 pages
5.1loss Function, Optimization, GD
No ratings yet
5.1loss Function, Optimization, GD
39 pages
Week 10 Notes MLF
No ratings yet
Week 10 Notes MLF
20 pages
NLP for Machine Learning Enthusiasts
No ratings yet
NLP for Machine Learning Enthusiasts
53 pages
Optimization23 22
No ratings yet
Optimization23 22
32 pages
Gradient Descent
No ratings yet
Gradient Descent
4 pages
Convex Module B
No ratings yet
Convex Module B
29 pages
Unit 2-DLV
No ratings yet
Unit 2-DLV
84 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
Stochastic Gradient Descent Tuning
No ratings yet
Stochastic Gradient Descent Tuning
8 pages
Gradient Descent for ML Experts
No ratings yet
Gradient Descent for ML Experts
5 pages
Gradient Descent in Machine Learning
No ratings yet
Gradient Descent in Machine Learning
98 pages
Topic5 Stoch Grad D Oct202023
No ratings yet
Topic5 Stoch Grad D Oct202023
29 pages
Paper 2
No ratings yet
Paper 2
27 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Class06 SGD
No ratings yet
Class06 SGD
24 pages
Lecture 10
No ratings yet
Lecture 10
27 pages
Gradient Descent for ML Practitioners
No ratings yet
Gradient Descent for ML Practitioners
2 pages
UNIT2
No ratings yet
UNIT2
25 pages
2,5 Stochastic Gradient Descent
No ratings yet
2,5 Stochastic Gradient Descent
11 pages
Chapter Gradient Descent
No ratings yet
Chapter Gradient Descent
6 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
DL Unit - 2
No ratings yet
DL Unit - 2
20 pages
2223 1 Sehh2363
No ratings yet
2223 1 Sehh2363
6 pages
OLS Matrix Analysis for Statisticians
No ratings yet
OLS Matrix Analysis for Statisticians
14 pages
Tests of Skewness
No ratings yet
Tests of Skewness
9 pages
Biningdistributions Clemen&Winkler RA 99
No ratings yet
Biningdistributions Clemen&Winkler RA 99
17 pages
Econometrics Modulei
100% (1)
Econometrics Modulei
85 pages
Normal Approximation for Poisson
No ratings yet
Normal Approximation for Poisson
1 page
Safety Stock Calculation
No ratings yet
Safety Stock Calculation
3 pages
Tests For A Population Mean The Critical Region and The P-Value
No ratings yet
Tests For A Population Mean The Critical Region and The P-Value
8 pages
Probability & Statistics Guide
No ratings yet
Probability & Statistics Guide
17 pages
Caie A2 Maths 9709 Statistics 2 v1
No ratings yet
Caie A2 Maths 9709 Statistics 2 v1
10 pages
Lampiran Hasil Uji Univariat
No ratings yet
Lampiran Hasil Uji Univariat
3 pages
Chapter 5 Queueing Theory
No ratings yet
Chapter 5 Queueing Theory
32 pages
3sensitivity Lecture Note
No ratings yet
3sensitivity Lecture Note
17 pages
ALY6010 - Project 3 Document - Electronic Keno - v1 PDF
No ratings yet
ALY6010 - Project 3 Document - Electronic Keno - v1 PDF
6 pages
Descriptive Stats & Tests in SPSS
No ratings yet
Descriptive Stats & Tests in SPSS
53 pages
Intro to Bootstrap for Econometrics
No ratings yet
Intro to Bootstrap for Econometrics
29 pages
Forecast
No ratings yet
Forecast
82 pages
2023 Mathematical Methods Examination Paper
No ratings yet
2023 Mathematical Methods Examination Paper
33 pages
Applying Quantitative Bias Analysis To Epidemiologic Data 2nd Edition Free Ebook Download
100% (16)
Applying Quantitative Bias Analysis To Epidemiologic Data 2nd Edition Free Ebook Download
17 pages
W1 Questions - Week 1
No ratings yet
W1 Questions - Week 1
3 pages
M 8 1 Introduction To DOE
No ratings yet
M 8 1 Introduction To DOE
19 pages
Waqar Ansari's RISE QM Ch#10
No ratings yet
Waqar Ansari's RISE QM Ch#10
15 pages
Basic 2
No ratings yet
Basic 2
13 pages
Exam Financial Econometrics 1 - Normal Session 2023-2024
No ratings yet
Exam Financial Econometrics 1 - Normal Session 2023-2024
1 page
For Economic, Business & Social Studies
No ratings yet
For Economic, Business & Social Studies
33 pages
Iso TS 28037-2010
100% (1)
Iso TS 28037-2010
72 pages
Coleman Dcc08
No ratings yet
Coleman Dcc08
10 pages
MSF Full Chapters PDF
No ratings yet
MSF Full Chapters PDF
164 pages
Ife Example
No ratings yet
Ife Example
16 pages
Turmeric Polvoron Sensory Evaluation
No ratings yet
Turmeric Polvoron Sensory Evaluation
1 page