CSE 151B/251B
Deep Learning
Rose Yu
ANNOUNCEMENT
Kaggle Group Signup
• Sign up as a group before April 18th
• Use Piazza search for Teammates function
• Sign up on Google sheet first
• After that, register the info on Canvas to receive AWS
O P T I M I Z AT I O N
Learning as Optimization
Loss
learning as optimization
Weight
Parameter
to learn the weights, we need the derivative of the loss w.r.t. the weight
i.e. “how should the weight be updated to decrease the loss?”
@L
w=w ↵
@w
with multiple weights, we need the gradient of the loss w.r.t. the weights
w=w ↵rw L
learning rate
Gradient Descent
Convexity
Traditional loss functions often assume convexity
L(x2) − L(x1) ≥ ▿ L(x1)⊤(x2 − x1)
Easy to find
global optima!
Strict convex if
diff always >0
Convexity
• All local optima are global optima:
• Strictly convex: unique global optimum:
• Stochastic gradient descent will find the global optimum
with good learning rate
Poll
Optimizing Deep Neural Nets
VGG-56
convex non-convex
Li, Hao, et al. "Visualizing the loss landscape of neural nets."
Advances in Neural Information Processing Systems. 2018.
saddle point
stochastic gradient descent
large scale (N) datasets leads to memory bottleneck
stochastic gradient descent (SGD): w = w ↵r̃w L
use stochastic gradient estimate to descend the surface of the loss function
batch gradient stochastic gradient
n
▿˜ w L = ▿w L( f(xi; w), yi)
∑
▿L = ▿w L( f(xi; w), yi)
i=1
mini-batch size
11
Batch and Minibatch
n 1
mini-batch size
• Accurate gradient estimate (nonlinear) • Approximate estimate
• Fast training • Slow training
• Memory bottleneck • Memory efficient
• Hard to parallelize • Easy to parallelize
• Worse generalization • Better generalization
Stochastic gradient needs to be an unbiased estimator
▿w L(w) = [▿˜ wL(w, ξ)]
𝔼
Mini-Batch
• mini-batch implementation:https://d2l.ai/chapter_optimization/
minibatch-sgd.html
• Mini-batch SGD is faster than GD and SGD
• Trade-off between convergence speed and computational efficiency
SGD is Not Enough
global minima
Increased frequency and severity of
bad local minima
pathological curvature, like
the type seen in the well-
known Rosenbrock function
An ill-Conditioned Example
increase learning rate: 0.4
• Consider optimizing f(x) = 0.1x12 + 2x22
• The function is very flat along the
direction of x1
• The gradient is much larger and changes
rapidly along the direction of x2
increase learning rate: 0.6
• Increase learning rate improves the
convergence along the x1 direction, but
the overall solution is much worse
• Key idea: can we average the past
gradients?
Momentum
• SGD oscillates a lot and converges slowly in the
direction with a flat landscape
• Use ``leaky gradient’: use a fraction of the past
gradients updates
vt = βvt−1 − gt,t−1 gradients wt+1 = wt + vt
past gradients
t−1
2 τ
∑
vt = β vt−2 − βgt−1,t−2 − gt,t−1 = β gt−τ,t−τ−1
τ=0
Example
increase learning rate: 0.6 • momentum = 0 recovers the
momentum: 0.25
original SGD equation
• momentum improves learning
even with the same learning
rate
increase learning rate: 0.6
momentum: 0.5 • increasing the momentum can
lead to oscillating trajectories
• but still much better than
diverged solutions
Momentum
• Momentum
physical interpretation: update velocity
wt+1 = wt − α ▿˜ w L(wt)
vt = βvt−1 − α ▿˜ w L(wt) SGD without momentum SGD with momentum
wt+1 = wt + vt
• Nesterov Momentum
stronger theoretical guarantee
vt = βvt−1 − α ▿˜ w L(wt + αvt−1)
wt+1 = wt + vt
learning rate
w=w ↵r̃w L α is the learning rate
Learning rate is essential to convergence speed and accuracy
L(w) L(w) L(w)
w w w
too low too high just right
Simple Strategy
• Divide Loss Function by Number of Examples:
α
wt+1 = wt − ▿˜ w L(wt)
n
• Start with large step size
1
• If loss plateaus, divide step size by 2: αt = α0
2t
• (Can also use advanced optimization methods)
• (Step size must decrease over time to guarantee
convergence to global optimum)
1
•
Scale the learning rate by iterations αt = α0
t+c
Adaptive Learning rate
• Potential Issues of fixed learning rate
• parameters for common features converge rather quickly to their optimal values
• infrequent features we are still short of observing them sufficiently frequently
before their optimal values can be determined
• Remedy: let the learning rate scales according to the features:
1
αt = α0 where s(i, t) counts the number of nonzeros
s(i, t) + c
for parameter dimension i that we have observed up to time t
• Only for the input features, but not for the gradients
Adaptive Learning Rate
• AdaGrad: adaptive scales the learning for each
parameter dimension.
ϵ
•
wt+1 = wt − ⊙g gradients
δ+r adaptive learning rate
• rt = rt−1 + g ⊙ g sum of gradient squares
∂2 L ∂L∂L ∂L∂L
∂w1∂w2
⋯ ∂w ∂w
∂2w1 1 n
∂L∂L
Hessian matrix H = ⋮ ⋱ ∂w2∂wn
• ∂L∂L ∂2 L
∂wn∂w1
⋯
∂2wn
• Diag(H) = g ⊙ g approximates the Hessian
Adaptive Learning Rate
Cost is sensitive to learning rate only in some directions in the parameter space
maintain“memory” of previous gradients and scale gradients per parameter
• AdaGrad
gradient g = ▿˜ w L(wt)
approximate Hessian rt = rt−1 + g ⊙ g
ϵ
update wt+1 = wt − ⊙g
δ+r
• RMSProp
gradient g = ▿˜ w L(wt)
approximate Hessian rt = ρrt−1 + (1 − ρ)g ⊙ g
ϵ
update wt+1 = wt − ⊙g
δ+r
Comparison
local minima and saddle points are largely not an issue
in many dimensions, can move in exponentially more directions
24 http://sebastianruder.com/optimizing-gradient-descent/index.html