[go: up one dir, main page]

0% found this document useful (0 votes)
3 views63 pages

17 Convexoptim5

The document discusses stochastic gradient descent (SGD) as a method for minimizing an average of functions, highlighting its efficiency compared to traditional gradient descent. It outlines the process of SGD, including the choice of indices for updates and the benefits of using diminishing step sizes. Additionally, it touches on mini-batch SGD as a variation that reduces variance while maintaining efficiency.

Uploaded by

Sahil Dutta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views63 pages

17 Convexoptim5

The document discusses stochastic gradient descent (SGD) as a method for minimizing an average of functions, highlighting its efficiency compared to traditional gradient descent. It outlines the process of SGD, including the choice of indices for updates and the benefits of using diminishing step sizes. Additionally, it touches on mini-batch SGD as a variation that reduces variance while maintaining efficiency.

Uploaded by

Sahil Dutta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Convex Optimization V

Presidency University

May, 2025
Stochastic gradient descent
I Consider minimizing an average of functions
m
1 X
min fi (x)
x m
i=1
Stochastic gradient descent
I Consider minimizing an average of functions
m
1 X
min fi (x)
x m
i=1

m
P m
P
I Since ∇ fi (x) = ∇fi (x), gradient descent would repeat:
i=1 i=1

m
(k) (k−1) 1 X
x =x − tk . ∇fi (x (k−1) ), k = 1, 2, 3, . . .
m
i=1
Stochastic gradient descent
I Consider minimizing an average of functions
m
1 X
min fi (x)
x m
i=1

m
P m
P
I Since ∇ fi (x) = ∇fi (x), gradient descent would repeat:
i=1 i=1

m
(k) (k−1) 1 X
x =x − tk . ∇fi (x (k−1) ), k = 1, 2, 3, . . .
m
i=1

I In comparison, stochastic gradient descent or SGD (or


incremental gradient descent) repeats:

x (k) = x (k−1) − tk .∇fik (x (k−1) ), k = 1, 2, 3, . . .

where ik ∈ {1, 2, . . . , m} is some chosen index at iteration k.


I Two rules for choosing index ik at iteration k:
I Randomized rule: choose ik ∈ {1, 2, . . . , m}uniformly at
random
I Cyclic rule: choose ik = 1, 2, . . . , m, 1, 2, . . . , m, . . .
I Two rules for choosing index ik at iteration k:
I Randomized rule: choose ik ∈ {1, 2, . . . , m}uniformly at
random
I Cyclic rule: choose ik = 1, 2, . . . , m, 1, 2, . . . , m, . . .
I Randomized rule is more common in practice. For randomized
rule, we note that

E [∇fik (x)] = ∇f (x),

so we can view SGD as using an unbiased estimate of the


gradient at each step.
I Two rules for choosing index ik at iteration k:
I Randomized rule: choose ik ∈ {1, 2, . . . , m}uniformly at
random
I Cyclic rule: choose ik = 1, 2, . . . , m, 1, 2, . . . , m, . . .
I Randomized rule is more common in practice. For randomized
rule, we note that

E [∇fik (x)] = ∇f (x),

so we can view SGD as using an unbiased estimate of the


gradient at each step.

I Main appeal of SGD:


I Iteration cost is independent of m (number of functions)
I Can also be a big savings in terms of memory usage
Example: stochastic logistic regression

I Given (xi , yi ) ∈ Rp × {0, 1}, i = 1, . . . , n, recall logistic


regression:
n
1 X
minf (β) = (−yi xiT β + log(1 + exp(xiT β))).
β n | {z }
i=1
fi (β)
Example: stochastic logistic regression

I Given (xi , yi ) ∈ Rp × {0, 1}, i = 1, . . . , n, recall logistic


regression:
n
1 X
minf (β) = (−yi xiT β + log(1 + exp(xiT β))).
β n | {z }
i=1
fi (β)

n
1 P
I Gradient computation ∇f (β) = n (yi − pi (β))xi is doable
i=1
when n is moderate, but not when n is large.
Example: stochastic logistic regression

I Given (xi , yi ) ∈ Rp × {0, 1}, i = 1, . . . , n, recall logistic


regression:
n
1 X
minf (β) = (−yi xiT β + log(1 + exp(xiT β))).
β n | {z }
i=1
fi (β)

n
1 P
I Gradient computation ∇f (β) = n (yi − pi (β))xi is doable
i=1
when n is moderate, but not when n is large.

I If we compare full gradient (also called batch) against


stochastic gradient, we get:
I One batch update costs O(np)
I One stochastic update costs O(p)
I The following picture shows the behavior of full (batch)
gradient descent compared to stochastic gradient descent for
an example with n = 10, p = 2 (although this is not the
setting that SGD is usually used in).

I We note that far from the optimum, SGD moves faster but
when close to the optimum, gradient descent converges
quickly while SGD struggles.
Step sizes
I Generally, in SGD we use diminishing step sizes,
e.g.,tk = k1 , fork = 1, 2, 3, . . .
Step sizes
I Generally, in SGD we use diminishing step sizes,
e.g.,tk = k1 , fork = 1, 2, 3, . . .
I Why not fixed step sizes? We can give an intuitive explanation.
Step sizes
I Generally, in SGD we use diminishing step sizes,
e.g.,tk = k1 , fork = 1, 2, 3, . . .
I Why not fixed step sizes? We can give an intuitive explanation.
I Suppose we take cyclic rule for simplicity. If we set tk = t for
m updates in a row, we get:
m
X
(k+m) (k)
x =x −t ∇fi (x (k+i−1) ).
i=1
Step sizes
I Generally, in SGD we use diminishing step sizes,
e.g.,tk = k1 , fork = 1, 2, 3, . . .
I Why not fixed step sizes? We can give an intuitive explanation.
I Suppose we take cyclic rule for simplicity. If we set tk = t for
m updates in a row, we get:
m
X
(k+m) (k)
x =x −t ∇fi (x (k+i−1) ).
i=1

I Meanwhile, a full gradient with step size t would give:


m
X
(k+1) (k)
x =x −t ∇fi (x (k) ).
i=1
Step sizes
I Generally, in SGD we use diminishing step sizes,
e.g.,tk = k1 , fork = 1, 2, 3, . . .
I Why not fixed step sizes? We can give an intuitive explanation.
I Suppose we take cyclic rule for simplicity. If we set tk = t for
m updates in a row, we get:
m
X
(k+m) (k)
x =x −t ∇fi (x (k+i−1) ).
i=1

I Meanwhile, a full gradient with step size t would give:


m
X
(k+1) (k)
x =x −t ∇fi (x (k) ).
i=1
m
[∇fi (x (k+i−1) ) − ∇fi (x (k) )], and if
P
I The difference here is t
i=1
we hold t constant, this difference will not generally be going
to zero.
Convergence rates
I Recall that for convex f , gradient descent with diminishing
step sizes satisfies
1
f (x (k) ) − f ∗ = O( √ ).
k
Convergence rates
I Recall that for convex f , gradient descent with diminishing
step sizes satisfies
1
f (x (k) ) − f ∗ = O( √ ).
k
I When f is differentiable with Lipschitz gradient, we get for
suitable fixed step sizes
1
f (x (k) ) − f ∗ = O( ).
k
Convergence rates
I Recall that for convex f , gradient descent with diminishing
step sizes satisfies
1
f (x (k) ) − f ∗ = O( √ ).
k
I When f is differentiable with Lipschitz gradient, we get for
suitable fixed step sizes
1
f (x (k) ) − f ∗ = O( ).
k
I What about SGD?
Convergence rates
I Recall that for convex f , gradient descent with diminishing
step sizes satisfies
1
f (x (k) ) − f ∗ = O( √ ).
k
I When f is differentiable with Lipschitz gradient, we get for
suitable fixed step sizes
1
f (x (k) ) − f ∗ = O( ).
k
I What about SGD?
I For convex f , SGD with diminishing step sizes satisfies
1
E [f (x (k) )] − f ∗ = O( √ ).
k
Convergence rates
I Recall that for convex f , gradient descent with diminishing
step sizes satisfies
1
f (x (k) ) − f ∗ = O( √ ).
k
I When f is differentiable with Lipschitz gradient, we get for
suitable fixed step sizes
1
f (x (k) ) − f ∗ = O( ).
k
I What about SGD?
I For convex f , SGD with diminishing step sizes satisfies
1
E [f (x (k) )] − f ∗ = O( √ ).
k
I Unfortunately this does not improve when we further assume f
has Lipschitz gradient.
I Even worse is the following discrepancy!
I Even worse is the following discrepancy!

I When f is strongly convex and has a Lipschitz gradient,


gradient descent satisfies

f (x (k) ) − f ∗ = O(c k )

where c < 1.
I Even worse is the following discrepancy!

I When f is strongly convex and has a Lipschitz gradient,


gradient descent satisfies

f (x (k) ) − f ∗ = O(c k )

where c < 1.

I But under same conditions, SGD gives us


1
E [f (x (k) )] − f ∗ = O( ).
k
I Even worse is the following discrepancy!

I When f is strongly convex and has a Lipschitz gradient,


gradient descent satisfies

f (x (k) ) − f ∗ = O(c k )

where c < 1.

I But under same conditions, SGD gives us


1
E [f (x (k) )] − f ∗ = O( ).
k

I So stochastic methods do not enjoy the linear convergence


rate of gradient descent under strong convexity.
I What can we do to improve SGD?
I What can we do to improve SGD?

I One answer to this, especially from the machine learning


community, is that these shortcomings of SGD are not a very
critical point, because for better out-of-sample performance in
our estimators, we do not necessarily want to fully optimize
the objective.
I What can we do to improve SGD?

I One answer to this, especially from the machine learning


community, is that these shortcomings of SGD are not a very
critical point, because for better out-of-sample performance in
our estimators, we do not necessarily want to fully optimize
the objective.

I Another answer is using mini-batch SGD, which we shall now


discuss.
Mini-batches

I Also common is mini-batch stochastic gradient descent, where


we choose a random subset Ik ⊆ {1, 2, . . . , m}, of size
|Ik | = b << m, and repeat:

1X
x (k) = x (k−1) − tk . ∇fi (x (k−1) ), k = 1, 2, 3, . . .
b
i∈Ik
Mini-batches

I Also common is mini-batch stochastic gradient descent, where


we choose a random subset Ik ⊆ {1, 2, . . . , m}, of size
|Ik | = b << m, and repeat:

1X
x (k) = x (k−1) − tk . ∇fi (x (k−1) ), k = 1, 2, 3, . . .
b
i∈Ik

I Again, we are approximating full gradient by an unbiased


estimate:
1X
E[ ∇fi (x)] = ∇f (x).
b
i∈Ik
Mini-batches

I Also common is mini-batch stochastic gradient descent, where


we choose a random subset Ik ⊆ {1, 2, . . . , m}, of size
|Ik | = b << m, and repeat:

1X
x (k) = x (k−1) − tk . ∇fi (x (k−1) ), k = 1, 2, 3, . . .
b
i∈Ik

I Again, we are approximating full gradient by an unbiased


estimate:
1X
E[ ∇fi (x)] = ∇f (x).
b
i∈Ik

I Using mini-batches reduces the variance of our gradient


estimate by a factor b1 , but is also b times more expensive.
Example (Contd.)
I Back to logistic regression, let’s now consider a regularized
version:
n
1 X λ
minp (−yi xiT β + log(1 + exp(xiT β))) + ||β||22 .
β∈R n 2
i=1
Example (Contd.)
I Back to logistic regression, let’s now consider a regularized
version:
n
1 X λ
minp (−yi xiT β + log(1 + exp(xiT β))) + ||β||22 .
β∈R n 2
i=1

I Let us write the criterion as


n
1 X
f (β) = fi (β),
n
i=1

where
λ
fi (β) = −yi xiT β + log(1 + exp(xiT β)) + ||β||22 .
2
Example (Contd.)
I Back to logistic regression, let’s now consider a regularized
version:
n
1 X λ
minp (−yi xiT β + log(1 + exp(xiT β))) + ||β||22 .
β∈R n 2
i=1

I Let us write the criterion as


n
1 X
f (β) = fi (β),
n
i=1

where
λ
fi (β) = −yi xiT β + log(1 + exp(xiT β)) + ||β||22 .
2
I The full gradient computation is
n
∇f (β) = n1
P
(yi − pi (β))xi + λβ.
i=1
I Comparison between methods:
I One batch update costs O(np)
I One mini-batch update costs O(bp)
I One stochastic update costs O(p)
Example: n = 10, 000, p = 20,all methods using fixed steps

I We get that in terms of per iteration performance, using


mini-batches does help.
If we parametrize by flops
Suboptimality gap

I This demonstrates that mini-batch SGD does not really help in


terms of optimality.
End of the story?
I Summarizing we get:
I SGD can be super effective in terms of iteration cost, memory
End of the story?
I Summarizing we get:
I SGD can be super effective in terms of iteration cost, memory
I But SGD is slow to converge, can’t adapt to strong convexity
End of the story?
I Summarizing we get:
I SGD can be super effective in terms of iteration cost, memory
I But SGD is slow to converge, can’t adapt to strong convexity
I And mini-batches seem to be a wash in terms of flops (though
they can still be useful in practice)
End of the story?
I Summarizing we get:
I SGD can be super effective in terms of iteration cost, memory
I But SGD is slow to converge, can’t adapt to strong convexity
I And mini-batches seem to be a wash in terms of flops (though
they can still be useful in practice)
I Is this the end of the story for SGD?
End of the story?
I Summarizing we get:
I SGD can be super effective in terms of iteration cost, memory
I But SGD is slow to converge, can’t adapt to strong convexity
I And mini-batches seem to be a wash in terms of flops (though
they can still be useful in practice)
I Is this the end of the story for SGD?
I For a while, the answer was believed to be yes. Slow
convergence for strongly convex functions was believed
inevitable, as Nemirovski and others established matching
lower bounds ... but this was for a more general stochastic
problem, where
Z
f (x) = F (x, ξ)dP(ξ).
End of the story?
I Summarizing we get:
I SGD can be super effective in terms of iteration cost, memory
I But SGD is slow to converge, can’t adapt to strong convexity
I And mini-batches seem to be a wash in terms of flops (though
they can still be useful in practice)
I Is this the end of the story for SGD?
I For a while, the answer was believed to be yes. Slow
convergence for strongly convex functions was believed
inevitable, as Nemirovski and others established matching
lower bounds ... but this was for a more general stochastic
problem, where
Z
f (x) = F (x, ξ)dP(ξ).

I New wave of “variance reduction” work shows we can modify


SGD to converge much faster for finite sums.
SGD in large-scale ML
I SGD has really taken off in large-scale machine learning
SGD in large-scale ML
I SGD has really taken off in large-scale machine learning

I In many ML problems we don’t care about optimizing to high


accuracy, it doesn’t pay off in terms of statistical performance
SGD in large-scale ML
I SGD has really taken off in large-scale machine learning

I In many ML problems we don’t care about optimizing to high


accuracy, it doesn’t pay off in terms of statistical performance

I Thus (in contrast to what classic theory says) fixed step sizes
are commonly used in ML applications
SGD in large-scale ML
I SGD has really taken off in large-scale machine learning

I In many ML problems we don’t care about optimizing to high


accuracy, it doesn’t pay off in terms of statistical performance

I Thus (in contrast to what classic theory says) fixed step sizes
are commonly used in ML applications

I One trick is to experiment with step sizes using small fraction


of training before running SGD on full data set ... many other
heuristics are common (E.g., Bottou (2012), “Stochastic
gradient descent tricks”)
SGD in large-scale ML
I SGD has really taken off in large-scale machine learning

I In many ML problems we don’t care about optimizing to high


accuracy, it doesn’t pay off in terms of statistical performance

I Thus (in contrast to what classic theory says) fixed step sizes
are commonly used in ML applications

I One trick is to experiment with step sizes using small fraction


of training before running SGD on full data set ... many other
heuristics are common (E.g., Bottou (2012), “Stochastic
gradient descent tricks”)

I Many variants provide better practical stability, convergence:


momentum, acceleration, averaging, coordinate-adapted step
sizes, variance reduction: AdaGrad, Adam, AdaMax, SVRG,
SAG, SAGA ...
Early stopping
I Suppose p is large and we wanted to fit (say) a logistic
regression model to data (xi , yi ) ∈ Rp × {0, 1}, i = 1, . . . , n.
Early stopping
I Suppose p is large and we wanted to fit (say) a logistic
regression model to data (xi , yi ) ∈ Rp × {0, 1}, i = 1, . . . , n.

I We could solve (say) `2 regularized logistic regression:


n
1 X
minp (−yi xiT β + log(1 + exp(xiT β)))
β∈R n
i=1

subject to ||β||2 ≤ t.
Early stopping
I Suppose p is large and we wanted to fit (say) a logistic
regression model to data (xi , yi ) ∈ Rp × {0, 1}, i = 1, . . . , n.

I We could solve (say) `2 regularized logistic regression:


n
1 X
minp (−yi xiT β + log(1 + exp(xiT β)))
β∈R n
i=1

subject to ||β||2 ≤ t.

I We could also run gradient descent on the un-regularized


problem:
n
1 X
minp (−yi xiT β + log(1 + exp(xiT β)))
β∈R n
i=1

and stop early, i.e., terminate gradient descent well-short of


the global minimum.
I Consider the following, for a very small constant step size :
I Start at β (0) = 0, solution to regularized problem at t = 0
I Consider the following, for a very small constant step size :
I Start at β (0) = 0, solution to regularized problem at t = 0
I Perform gradient descent on unregularized criterion
n
1 X
β (k) = β (k−1) − . (yi − pi (β (k−1) ))xi , k = 1, 2, 3, . . .
n
i=1

(we could equally well consider SGD).


I Treat β (k) as an approximate solution to regularized problem
with t = ||β (k) ||2
I Consider the following, for a very small constant step size :
I Start at β (0) = 0, solution to regularized problem at t = 0
I Perform gradient descent on unregularized criterion
n
1 X
β (k) = β (k−1) − . (yi − pi (β (k−1) ))xi , k = 1, 2, 3, . . .
n
i=1

(we could equally well consider SGD).


I Treat β (k) as an approximate solution to regularized problem
with t = ||β (k) ||2
I This is called early stopping for gradient descent.
I Consider the following, for a very small constant step size :
I Start at β (0) = 0, solution to regularized problem at t = 0
I Perform gradient descent on unregularized criterion
n
1 X
β (k) = β (k−1) − . (yi − pi (β (k−1) ))xi , k = 1, 2, 3, . . .
n
i=1

(we could equally well consider SGD).


I Treat β (k) as an approximate solution to regularized problem
with t = ||β (k) ||2
I This is called early stopping for gradient descent.

I Why would we ever do this?


I Consider the following, for a very small constant step size :
I Start at β (0) = 0, solution to regularized problem at t = 0
I Perform gradient descent on unregularized criterion
n
1 X
β (k) = β (k−1) − . (yi − pi (β (k−1) ))xi , k = 1, 2, 3, . . .
n
i=1

(we could equally well consider SGD).


I Treat β (k) as an approximate solution to regularized problem
with t = ||β (k) ||2
I This is called early stopping for gradient descent.

I Why would we ever do this?

I It’s both more convenient and potentially much more efficient


than using explicit regularization.
An intriguing connection

I When we solve the `2 regularized logistic problem for varying


t, solution path looks quite similar to gradient descent path!
I The following figure shows with p = 8, solution and grad
descent paths side by side:
Lots left to explore

I Connection holds beyond logistic regression, for arbitrary loss


Lots left to explore

I Connection holds beyond logistic regression, for arbitrary loss

I In general, the grad descent path will not coincide with the l2
regularized path (as  → 0). Though in practice, it seems to
give competitive statistical performance
Lots left to explore

I Connection holds beyond logistic regression, for arbitrary loss

I In general, the grad descent path will not coincide with the l2
regularized path (as  → 0). Though in practice, it seems to
give competitive statistical performance

I Can extend early stopping idea to mimic a generic regularizer


(beyond l2 ).
Lots left to explore

I Connection holds beyond logistic regression, for arbitrary loss

I In general, the grad descent path will not coincide with the l2
regularized path (as  → 0). Though in practice, it seems to
give competitive statistical performance

I Can extend early stopping idea to mimic a generic regularizer


(beyond l2 ).

I There is a lot of literature on early stopping, but it’s still not


as well-understood as it should be.
Lots left to explore

I Connection holds beyond logistic regression, for arbitrary loss

I In general, the grad descent path will not coincide with the l2
regularized path (as  → 0). Though in practice, it seems to
give competitive statistical performance

I Can extend early stopping idea to mimic a generic regularizer


(beyond l2 ).

I There is a lot of literature on early stopping, but it’s still not


as well-understood as it should be.

I Early stopping is just one instance of implicit or algorithmic


regularization ... many others are effective in large-scale ML,
they all should be better understood.

You might also like