ECS171: Machine Learning
Lecture 4: Optimization (LFD 3.3, SGD)
Cho-Jui Hsieh
UC Davis
Jan 22, 2018
Gradient descent
Optimization
Goal: find the minimizer of a function
min f (w )
w
For now we assume f is twice differentiable
Machine learning algorithm: find the hypothesis that minimizes Ein
Convex vs Nonconvex
Convex function:
∇f (w ∗ ) = 0 ⇔ w ∗ is global minimum
A function is convex if ∇2 f (w ) is positive definite
Example: linear regression, logistic regression, · · ·
Non-convex function:
∇f (x) = 0 ⇔ Global min, local min, or saddle point
most algorithms only converge to gradient= 0
Example: neural network, · · ·
Convex vs Nonconvex
Convex function:
∇f (w ∗ ) = 0 ⇔ w ∗ is global minimum
A function is convex if ∇2 f (w ) is positive definite
Example: linear regression, logistic regression, · · ·
Non-convex function:
∇f (w ∗ ) = 0 ⇔ w ∗ is Global min, local min, or saddle point
most algorithms only converge to gradient= 0
Example: neural network, · · ·
Gradient Descent
Gradient descent: repeatedly do
w t+1 ← w t − α∇f (w t )
α > 0 is the step size
Gradient Descent
Gradient descent: repeatedly do
w t+1 ← w t − α∇f (w t )
α > 0 is the step size
Generate the sequence w 1 , w 2 , · · ·
converge to minimum solution (∇f (w ) = 0)
Gradient Descent
Gradient descent: repeatedly do
w t+1 ← w t − α∇f (w t )
α > 0 is the step size
Generate the sequence w 1 , w 2 , · · ·
converge to minimum solution (∇f (w ) = 0)
Step size too large ⇒ diverge; too small ⇒ slow convergence
Why gradient descent?
Reason I: Gradient is the “steepest direction” to decrease the objective
function locally
Why gradient descent?
Reason I: Gradient is the “steepest direction” to decrease the objective
function locally
Reason II: successive approximation view
At each iteration, form an approximation function of f (·):
1
f (w t + d ) ≈ g (d ) := f (w t ) + ∇f (w t )T d + kd k2
2α
Update solution by w t+1 ← w t + d ∗
d ∗ = arg mind g (d )
∇g (d ∗ ) = 0 ⇒ ∇f (w t ) + α1 d ∗ = 0 ⇒ d ∗ = −α∇f (w t )
Why gradient descent?
Reason I: Gradient is the “steepest direction” to decrease the objective
function locally
Reason II: successive approximation view
At each iteration, form an approximation function of f (·):
1
f (w t + d ) ≈ g (d ) := f (w t ) + ∇f (w t )T d + kd k2
2α
Update solution by w t+1 ← w t + d ∗
d ∗ = arg mind g (d )
∇g (d ∗ ) = 0 ⇒ ∇f (w t ) + α1 d ∗ = 0 ⇒ d ∗ = −α∇f (w t )
d ∗ will decrease f (·) if α (step size) is sufficiently small
Illustration of gradient descent
Illustration of gradient descent
Form a quadratic approximation
1
f (w t + d ) ≈ g (d ) = f (w t ) + ∇f (w t )T d + kd k2
2α
Illustration of gradient descent
Minimize g (d ):
1 ∗
∇g (d ∗ ) = 0 ⇒ ∇f (w t ) + d = 0 ⇒ d ∗ = −α∇f (w t )
α
Illustration of gradient descent
Update
w t+1 = w t + d ∗ = w t −α∇f (w t )
Illustration of gradient descent
Form another quadratic approximation
1
f (w t+1 + d ) ≈ g (d ) = f (w t+1 ) + ∇f (w t+1 )T d + kd k2
2α
d ∗ = −α∇f (w t+1 )
Illustration of gradient descent
Update
w t+2 = w t+1 + d ∗ = w t+1 −α∇f (w t+1 )
When will it diverge?
Can diverge (f (w t )<f (w t+1 )) if g is not an upperbound of f
When will it converge?
Always converge (f (w t )>f (w t+1 )) when g is an upperbound of f
Convergence
Let L be the Lipchitz constant
(∇2 f (x) LI for all x)
1
Theorem: gradient descent converges if α < L
Convergence
Let L be the Lipchitz constant
(∇2 f (x) LI for all x)
1
Theorem: gradient descent converges if α < L
In practice, we do not know L · · ·
need to tune step size when running gradient descent
Applying to Logistic regression
gradient descent for logistic regression
Initialize the weights w0
For t = 1, 2, · · ·
Compute the gradient
N
1 X yn xn
∇Ein = −
N n=1 1 + e yn w T xn
Update the weights: w ← w − η∇Ein
Return the final weights w
Applying to Logistic regression
gradient descent for logistic regression
Initialize the weights w0
For t = 1, 2, · · ·
Compute the gradient
N
1 X yn xn
∇Ein = −
N n=1 1 + e yn w T xn
Update the weights: w ← w − η∇Ein
Return the final weights w
When to stop?
Fixed number of iterations, or
Stop when k∇Ein k <
Stochastic Gradient descent
Large-scale Problems
Machine learning: usually minimizing the in-sample loss (training loss)
N
1 X
min{ `(w T xn , yn )} := Ein (w ) (linear model)
w N
n=1
N
1 X
min{ `(hw (xn ), yn )} := Ein (w ) (general hypothesis)
w N
n=1
`: loss function (e.g., `(a, b) = (a − b)2 )
Gradient descent:
w ←w −η ∇Ein (w )
| {z }
Main computation
Large-scale Problems
Machine learning: usually minimizing the in-sample loss (training loss)
N
1 X
min{ `(w T xn , yn )} := Ein (w ) (linear model)
w N
n=1
N
1 X
min{ `(hw (xn ), yn )} := Ein (w ) (general hypothesis)
w N
n=1
`: loss function (e.g., `(a, b) = (a − b)2 )
Gradient descent:
w ←w −η ∇Ein (w )
| {z }
Main computation
1 PN
In general, Ein (w ) = N n=1 fn (w ),
each fn (w ) only depends on (xn , yn )
Stochastic gradient
Gradient:
N
1X
∇Ein (w ) = ∇fn (w )
N
n=1
Each gradient computation needs to go through all training samples
slow when millions of samples
Faster way to compute “approximate gradient”?
Stochastic gradient
Gradient:
N
1X
∇Ein (w ) = ∇fn (w )
N
n=1
Each gradient computation needs to go through all training samples
slow when millions of samples
Faster way to compute “approximate gradient”?
Use stochastic sampling:
Sample a small subset B ⊆ {1, · · · , N}
Estimate gradient
1 X
∇Ein (w ) ≈ ∇fn (w )
|B|
n∈B
|B|: batch size
Stochastic gradient descent
Stochastic Gradient Descent (SGD)
Input: training data {xn , yn }N
n=1
Initialize w (zero or random)
For t = 1, 2, · · ·
Sample a small batch B ⊆ {1, · · · , N}
Update parameter
1 X
w ← w − ηt ∇fn (w )
|B|
n∈B
Stochastic gradient descent
Stochastic Gradient Descent (SGD)
Input: training data {xn , yn }N
n=1
Initialize w (zero or random)
For t = 1, 2, · · ·
Sample a small batch B ⊆ {1, · · · , N}
Update parameter
1 X
w ← w − ηt ∇fn (w )
|B|
n∈B
Extreme case: |B| = 1 ⇒ Sample one training data at a time
Logistic Regression by SGD
Logistic regression:
N
1 X T
min log(1 + e −yn w xn )
w N | {z }
n=1
fn (w )
SGD for Logistic Regression
Input: training data {xn , yn }N
n=1
Initialize w (zero or random)
For t = 1, 2, · · ·
Sample a batch B ⊆ {1, · · · , N}
Update parameter
1 X −yn xn
w ← w − ηt
|B| 1 + e yn w T xn
i∈B | {z }
∇fn (w )
Why SGD works?
Stochastic gradient is an unbiased estimator of full gradient:
N
1 X 1 X
E[ ∇fn (w )] = ∇fn (w )
|B| N
n∈B n=1
= ∇Ein (w )
Why SGD works?
Stochastic gradient is an unbiased estimator of full gradient:
N
1 X 1 X
E[ ∇fn (w )] = ∇fn (w )
|B| N
n∈B n=1
= ∇Ein (w )
Each iteration updated by
gradient + zero-mean noise
Stochastic gradient descent
In gradient descent, η (step size) is a fixed constant
Can we use fixed step size for SGD?
Stochastic gradient descent
In gradient descent, η (step size) is a fixed constant
Can we use fixed step size for SGD?
SGD with fixed step size cannot converge to global/local minimizers
Stochastic gradient descent
In gradient descent, η (step size) is a fixed constant
Can we use fixed step size for SGD?
SGD with fixed step size cannot converge to global/local minimizers
If w ∗ is the minimizer, ∇f (w ∗ ) = N1 N ∗
P
n=1 ∇fn (w )=0,
Stochastic gradient descent
In gradient descent, η (step size) is a fixed constant
Can we use fixed step size for SGD?
SGD with fixed step size cannot converge to global/local minimizers
If w ∗ is the minimizer, ∇f (w ∗ ) = N1 N ∗
P
n=1 ∇fn (w )=0,
1 X
but ∇fn (w ∗ )6=0 if B is a subset
|B|
n∈B
Stochastic gradient descent
In gradient descent, η (step size) is a fixed constant
Can we use fixed step size for SGD?
SGD with fixed step size cannot converge to global/local minimizers
If w ∗ is the minimizer, ∇f (w ∗ ) = N1 N ∗
P
n=1 ∇fn (w )=0,
1 X
but ∇fn (w ∗ )6=0 if B is a subset
|B|
n∈B
(Even if we got minimizer, SGD will move away from it)
Stochastic gradient descent, step size
To make SGD converge:
Step size should decrease to 0
ηt → 0
Usually with polynomial rate: η t ≈ t −a with constant a
Stochastic gradient descent vs Gradient descent
Stochastic gradient descent:
pros:
cheaper computation per iteration
faster convergence in the beginning
cons:
less stable, slower final convergence
hard to tune step size
(Figure from https://medium.com/@ImadPhd/
gradient-descent-algorithm-and-its-variants-10f652806a3)
Revisit perceptron Learning Algorithm
Given a classification data {xn , yn }N
n=1
Learning a linear model:
N
1 X
min `(w T xn , yn )
w N
n=1
Consider the loss:
`(w T xn , yn ) = max(0, −yn w T xn )
What’s the gradient?
Revisit perceptron Learning Algorithm
`(w T xn , yn ) = max(0, −yn w T xn )
Consider two cases:
Case I: yn w T xn > 0 (prediction correct)
`(w T xn , yn ) = 0
∂ T
∂w `(w xn , yn ) = 0
Revisit perceptron Learning Algorithm
`(w T xn , yn ) = max(0, −yn w T xn )
Consider two cases:
Case I: yn w T xn > 0 (prediction correct)
`(w T xn , yn ) = 0
∂ T
∂w `(w xn , yn ) = 0
Case II: yn w T xn < 0 (prediction wrong)
`(w T xn , yn ) = −yn w T xn
∂ T
∂w `(w xn , yn ) = −yn xn
Revisit perceptron Learning Algorithm
`(w T xn , yn ) = max(0, −yn w T xn )
Consider two cases:
Case I: yn w T xn > 0 (prediction correct)
`(w T xn , yn ) = 0
∂ T
∂w `(w xn , yn ) = 0
Case II: yn w T xn < 0 (prediction wrong)
`(w T xn , yn ) = −yn w T xn
∂ T
∂w `(w xn , yn ) = −yn xn
SGD update rule: Sample an index n
(
t+1 wt if yn w T xn ≥0 (predict correct)
w ←
w t + η t yn xn if yn w T xn <0 (predict wrong)
Equivalent to Perceptron Learning Algorithm when η t = 1
Conclusions
Gradient descent
Stochastic gradient descent
Next class: LFD 2
Questions?