0% found this document useful (0 votes)

3 views63 pages

17 Convexoptim5

The document discusses stochastic gradient descent (SGD) as a method for minimizing an average of functions, highlighting its efficiency compared to traditional gradient descent. It outlines the process of SGD, including the choice of indices for updates and the benefits of using diminishing step sizes. Additionally, it touches on mini-batch SGD as a variation that reduces variance while maintaining efficiency.

Uploaded by

Sahil Dutta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views63 pages

17 Convexoptim5

Uploaded by

Sahil Dutta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 63

Convex Optimization V

Presidency University

May, 2025
Stochastic gradient descent
I Consider minimizing an average of functions
m
1 X
min fi (x)
x m
i=1
Stochastic gradient descent
I Consider minimizing an average of functions
m
1 X
min fi (x)
x m
i=1

m
P m
P
I Since ∇ fi (x) = ∇fi (x), gradient descent would repeat:
i=1 i=1

m
(k) (k−1) 1 X
x =x − tk . ∇fi (x (k−1) ), k = 1, 2, 3, . . .
m
i=1
Stochastic gradient descent
I Consider minimizing an average of functions
m
1 X
min fi (x)
x m
i=1

m
P m
P
I Since ∇ fi (x) = ∇fi (x), gradient descent would repeat:
i=1 i=1

m
(k) (k−1) 1 X
x =x − tk . ∇fi (x (k−1) ), k = 1, 2, 3, . . .
m
i=1

I In comparison, stochastic gradient descent or SGD (or

incremental gradient descent) repeats:

x (k) = x (k−1) − tk .∇fik (x (k−1) ), k = 1, 2, 3, . . .

where ik ∈ {1, 2, . . . , m} is some chosen index at iteration k.

I Two rules for choosing index ik at iteration k:
I Randomized rule: choose ik ∈ {1, 2, . . . , m}uniformly at
random
I Cyclic rule: choose ik = 1, 2, . . . , m, 1, 2, . . . , m, . . .
I Two rules for choosing index ik at iteration k:
I Randomized rule: choose ik ∈ {1, 2, . . . , m}uniformly at
random
I Cyclic rule: choose ik = 1, 2, . . . , m, 1, 2, . . . , m, . . .
I Randomized rule is more common in practice. For randomized
rule, we note that

E [∇fik (x)] = ∇f (x),

so we can view SGD as using an unbiased estimate of the

gradient at each step.
I Two rules for choosing index ik at iteration k:
I Randomized rule: choose ik ∈ {1, 2, . . . , m}uniformly at
random
I Cyclic rule: choose ik = 1, 2, . . . , m, 1, 2, . . . , m, . . .
I Randomized rule is more common in practice. For randomized
rule, we note that

E [∇fik (x)] = ∇f (x),

so we can view SGD as using an unbiased estimate of the

gradient at each step.

I Main appeal of SGD:

I Iteration cost is independent of m (number of functions)
I Can also be a big savings in terms of memory usage
Example: stochastic logistic regression

I Given (xi , yi ) ∈ Rp × {0, 1}, i = 1, . . . , n, recall logistic

regression:
n
1 X
minf (β) = (−yi xiT β + log(1 + exp(xiT β))).
β n | {z }
i=1
fi (β)
Example: stochastic logistic regression

I Given (xi , yi ) ∈ Rp × {0, 1}, i = 1, . . . , n, recall logistic

regression:
n
1 X
minf (β) = (−yi xiT β + log(1 + exp(xiT β))).
β n | {z }
i=1
fi (β)

n
1 P
I Gradient computation ∇f (β) = n (yi − pi (β))xi is doable
i=1
when n is moderate, but not when n is large.
Example: stochastic logistic regression

I Given (xi , yi ) ∈ Rp × {0, 1}, i = 1, . . . , n, recall logistic

regression:
n
1 X
minf (β) = (−yi xiT β + log(1 + exp(xiT β))).
β n | {z }
i=1
fi (β)

n
1 P
I Gradient computation ∇f (β) = n (yi − pi (β))xi is doable
i=1
when n is moderate, but not when n is large.

I If we compare full gradient (also called batch) against

stochastic gradient, we get:
I One batch update costs O(np)
I One stochastic update costs O(p)
I The following picture shows the behavior of full (batch)
gradient descent compared to stochastic gradient descent for
an example with n = 10, p = 2 (although this is not the
setting that SGD is usually used in).

I We note that far from the optimum, SGD moves faster but
when close to the optimum, gradient descent converges
quickly while SGD struggles.
Step sizes
I Generally, in SGD we use diminishing step sizes,
e.g.,tk = k1 , fork = 1, 2, 3, . . .
Step sizes
I Generally, in SGD we use diminishing step sizes,
e.g.,tk = k1 , fork = 1, 2, 3, . . .
I Why not fixed step sizes? We can give an intuitive explanation.
Step sizes
I Generally, in SGD we use diminishing step sizes,
e.g.,tk = k1 , fork = 1, 2, 3, . . .
I Why not fixed step sizes? We can give an intuitive explanation.
I Suppose we take cyclic rule for simplicity. If we set tk = t for
m updates in a row, we get:
m
X
(k+m) (k)
x =x −t ∇fi (x (k+i−1) ).
i=1
Step sizes
I Generally, in SGD we use diminishing step sizes,
e.g.,tk = k1 , fork = 1, 2, 3, . . .
I Why not fixed step sizes? We can give an intuitive explanation.
I Suppose we take cyclic rule for simplicity. If we set tk = t for
m updates in a row, we get:
m
X
(k+m) (k)
x =x −t ∇fi (x (k+i−1) ).
i=1

I Meanwhile, a full gradient with step size t would give:

m
X
(k+1) (k)
x =x −t ∇fi (x (k) ).
i=1
Step sizes
I Generally, in SGD we use diminishing step sizes,
e.g.,tk = k1 , fork = 1, 2, 3, . . .
I Why not fixed step sizes? We can give an intuitive explanation.
I Suppose we take cyclic rule for simplicity. If we set tk = t for
m updates in a row, we get:
m
X
(k+m) (k)
x =x −t ∇fi (x (k+i−1) ).
i=1

I Meanwhile, a full gradient with step size t would give:

m
X
(k+1) (k)
x =x −t ∇fi (x (k) ).
i=1
m
[∇fi (x (k+i−1) ) − ∇fi (x (k) )], and if
P
I The difference here is t
i=1
we hold t constant, this difference will not generally be going
to zero.
Convergence rates
I Recall that for convex f , gradient descent with diminishing
step sizes satisfies
1
f (x (k) ) − f ∗ = O( √ ).
k
Convergence rates
I Recall that for convex f , gradient descent with diminishing
step sizes satisfies
1
f (x (k) ) − f ∗ = O( √ ).
k
I When f is differentiable with Lipschitz gradient, we get for
suitable fixed step sizes
1
f (x (k) ) − f ∗ = O( ).
k
Convergence rates
I Recall that for convex f , gradient descent with diminishing
step sizes satisfies
1
f (x (k) ) − f ∗ = O( √ ).
k
I When f is differentiable with Lipschitz gradient, we get for
suitable fixed step sizes
1
f (x (k) ) − f ∗ = O( ).
k
I What about SGD?
Convergence rates
I Recall that for convex f , gradient descent with diminishing
step sizes satisfies
1
f (x (k) ) − f ∗ = O( √ ).
k
I When f is differentiable with Lipschitz gradient, we get for
suitable fixed step sizes
1
f (x (k) ) − f ∗ = O( ).
k
I What about SGD?
I For convex f , SGD with diminishing step sizes satisfies
1
E [f (x (k) )] − f ∗ = O( √ ).
k
Convergence rates
I Recall that for convex f , gradient descent with diminishing
step sizes satisfies
1
f (x (k) ) − f ∗ = O( √ ).
k
I When f is differentiable with Lipschitz gradient, we get for
suitable fixed step sizes
1
f (x (k) ) − f ∗ = O( ).
k
I What about SGD?
I For convex f , SGD with diminishing step sizes satisfies
1
E [f (x (k) )] − f ∗ = O( √ ).
k
I Unfortunately this does not improve when we further assume f
has Lipschitz gradient.
I Even worse is the following discrepancy!
I Even worse is the following discrepancy!

I When f is strongly convex and has a Lipschitz gradient,

gradient descent satisfies

f (x (k) ) − f ∗ = O(c k )

where c < 1.
I Even worse is the following discrepancy!

I When f is strongly convex and has a Lipschitz gradient,

gradient descent satisfies

f (x (k) ) − f ∗ = O(c k )

where c < 1.

I But under same conditions, SGD gives us

1
E [f (x (k) )] − f ∗ = O( ).
k
I Even worse is the following discrepancy!

I When f is strongly convex and has a Lipschitz gradient,

gradient descent satisfies

f (x (k) ) − f ∗ = O(c k )

where c < 1.

I But under same conditions, SGD gives us

1
E [f (x (k) )] − f ∗ = O( ).
k

I So stochastic methods do not enjoy the linear convergence

rate of gradient descent under strong convexity.
I What can we do to improve SGD?
I What can we do to improve SGD?

I One answer to this, especially from the machine learning

community, is that these shortcomings of SGD are not a very
critical point, because for better out-of-sample performance in
our estimators, we do not necessarily want to fully optimize
the objective.
I What can we do to improve SGD?

I One answer to this, especially from the machine learning

community, is that these shortcomings of SGD are not a very
critical point, because for better out-of-sample performance in
our estimators, we do not necessarily want to fully optimize
the objective.

I Another answer is using mini-batch SGD, which we shall now

discuss.
Mini-batches

I Also common is mini-batch stochastic gradient descent, where

we choose a random subset Ik ⊆ {1, 2, . . . , m}, of size
|Ik | = b << m, and repeat:

1X
x (k) = x (k−1) − tk . ∇fi (x (k−1) ), k = 1, 2, 3, . . .
b
i∈Ik
Mini-batches

I Also common is mini-batch stochastic gradient descent, where

we choose a random subset Ik ⊆ {1, 2, . . . , m}, of size
|Ik | = b << m, and repeat:

1X
x (k) = x (k−1) − tk . ∇fi (x (k−1) ), k = 1, 2, 3, . . .
b
i∈Ik

I Again, we are approximating full gradient by an unbiased

estimate:
1X
E[ ∇fi (x)] = ∇f (x).
b
i∈Ik
Mini-batches

I Also common is mini-batch stochastic gradient descent, where

we choose a random subset Ik ⊆ {1, 2, . . . , m}, of size
|Ik | = b << m, and repeat:

1X
x (k) = x (k−1) − tk . ∇fi (x (k−1) ), k = 1, 2, 3, . . .
b
i∈Ik

I Again, we are approximating full gradient by an unbiased

estimate:
1X
E[ ∇fi (x)] = ∇f (x).
b
i∈Ik

I Using mini-batches reduces the variance of our gradient

estimate by a factor b1 , but is also b times more expensive.
Example (Contd.)
I Back to logistic regression, let’s now consider a regularized
version:
n
1 X λ
minp (−yi xiT β + log(1 + exp(xiT β))) + ||β||22 .
β∈R n 2
i=1
Example (Contd.)
I Back to logistic regression, let’s now consider a regularized
version:
n
1 X λ
minp (−yi xiT β + log(1 + exp(xiT β))) + ||β||22 .
β∈R n 2
i=1

I Let us write the criterion as

n
1 X
f (β) = fi (β),
n
i=1

where
λ
fi (β) = −yi xiT β + log(1 + exp(xiT β)) + ||β||22 .
2
Example (Contd.)
I Back to logistic regression, let’s now consider a regularized
version:
n
1 X λ
minp (−yi xiT β + log(1 + exp(xiT β))) + ||β||22 .
β∈R n 2
i=1

I Let us write the criterion as

n
1 X
f (β) = fi (β),
n
i=1

where
λ
fi (β) = −yi xiT β + log(1 + exp(xiT β)) + ||β||22 .
2
I The full gradient computation is
n
∇f (β) = n1
P
(yi − pi (β))xi + λβ.
i=1
I Comparison between methods:
I One batch update costs O(np)
I One mini-batch update costs O(bp)
I One stochastic update costs O(p)
Example: n = 10, 000, p = 20,all methods using fixed steps

I We get that in terms of per iteration performance, using

mini-batches does help.
If we parametrize by flops
Suboptimality gap

I This demonstrates that mini-batch SGD does not really help in

terms of optimality.
End of the story?
I Summarizing we get:
I SGD can be super effective in terms of iteration cost, memory
End of the story?
I Summarizing we get:
I SGD can be super effective in terms of iteration cost, memory
I But SGD is slow to converge, can’t adapt to strong convexity
End of the story?
I Summarizing we get:
I SGD can be super effective in terms of iteration cost, memory
I But SGD is slow to converge, can’t adapt to strong convexity
I And mini-batches seem to be a wash in terms of flops (though
they can still be useful in practice)
End of the story?
I Summarizing we get:
I SGD can be super effective in terms of iteration cost, memory
I But SGD is slow to converge, can’t adapt to strong convexity
I And mini-batches seem to be a wash in terms of flops (though
they can still be useful in practice)
I Is this the end of the story for SGD?
End of the story?
I Summarizing we get:
I SGD can be super effective in terms of iteration cost, memory
I But SGD is slow to converge, can’t adapt to strong convexity
I And mini-batches seem to be a wash in terms of flops (though
they can still be useful in practice)
I Is this the end of the story for SGD?
I For a while, the answer was believed to be yes. Slow
convergence for strongly convex functions was believed
inevitable, as Nemirovski and others established matching
lower bounds ... but this was for a more general stochastic
problem, where
Z
f (x) = F (x, ξ)dP(ξ).
End of the story?
I Summarizing we get:
I SGD can be super effective in terms of iteration cost, memory
I But SGD is slow to converge, can’t adapt to strong convexity
I And mini-batches seem to be a wash in terms of flops (though
they can still be useful in practice)
I Is this the end of the story for SGD?
I For a while, the answer was believed to be yes. Slow
convergence for strongly convex functions was believed
inevitable, as Nemirovski and others established matching
lower bounds ... but this was for a more general stochastic
problem, where
Z
f (x) = F (x, ξ)dP(ξ).

I New wave of “variance reduction” work shows we can modify

SGD to converge much faster for finite sums.
SGD in large-scale ML
I SGD has really taken off in large-scale machine learning
SGD in large-scale ML
I SGD has really taken off in large-scale machine learning

I In many ML problems we don’t care about optimizing to high

accuracy, it doesn’t pay off in terms of statistical performance
SGD in large-scale ML
I SGD has really taken off in large-scale machine learning

I In many ML problems we don’t care about optimizing to high

accuracy, it doesn’t pay off in terms of statistical performance

I Thus (in contrast to what classic theory says) fixed step sizes
are commonly used in ML applications
SGD in large-scale ML
I SGD has really taken off in large-scale machine learning

I In many ML problems we don’t care about optimizing to high

accuracy, it doesn’t pay off in terms of statistical performance

I Thus (in contrast to what classic theory says) fixed step sizes
are commonly used in ML applications

I One trick is to experiment with step sizes using small fraction

of training before running SGD on full data set ... many other
heuristics are common (E.g., Bottou (2012), “Stochastic
gradient descent tricks”)
SGD in large-scale ML
I SGD has really taken off in large-scale machine learning

I In many ML problems we don’t care about optimizing to high

accuracy, it doesn’t pay off in terms of statistical performance

I Thus (in contrast to what classic theory says) fixed step sizes
are commonly used in ML applications

I One trick is to experiment with step sizes using small fraction

of training before running SGD on full data set ... many other
heuristics are common (E.g., Bottou (2012), “Stochastic
gradient descent tricks”)

I Many variants provide better practical stability, convergence:

momentum, acceleration, averaging, coordinate-adapted step
sizes, variance reduction: AdaGrad, Adam, AdaMax, SVRG,
SAG, SAGA ...
Early stopping
I Suppose p is large and we wanted to fit (say) a logistic
regression model to data (xi , yi ) ∈ Rp × {0, 1}, i = 1, . . . , n.
Early stopping
I Suppose p is large and we wanted to fit (say) a logistic
regression model to data (xi , yi ) ∈ Rp × {0, 1}, i = 1, . . . , n.

I We could solve (say) `2 regularized logistic regression:

n
1 X
minp (−yi xiT β + log(1 + exp(xiT β)))
β∈R n
i=1

subject to ||β||2 ≤ t.
Early stopping
I Suppose p is large and we wanted to fit (say) a logistic
regression model to data (xi , yi ) ∈ Rp × {0, 1}, i = 1, . . . , n.

I We could solve (say) `2 regularized logistic regression:

n
1 X
minp (−yi xiT β + log(1 + exp(xiT β)))
β∈R n
i=1

subject to ||β||2 ≤ t.

I We could also run gradient descent on the un-regularized

problem:
n
1 X
minp (−yi xiT β + log(1 + exp(xiT β)))
β∈R n
i=1

and stop early, i.e., terminate gradient descent well-short of

the global minimum.
I Consider the following, for a very small constant step size :
I Start at β (0) = 0, solution to regularized problem at t = 0
I Consider the following, for a very small constant step size :
I Start at β (0) = 0, solution to regularized problem at t = 0
I Perform gradient descent on unregularized criterion
n
1 X
β (k) = β (k−1) − . (yi − pi (β (k−1) ))xi , k = 1, 2, 3, . . .
n
i=1

(we could equally well consider SGD).

I Treat β (k) as an approximate solution to regularized problem
with t = ||β (k) ||2
I Consider the following, for a very small constant step size :
I Start at β (0) = 0, solution to regularized problem at t = 0
I Perform gradient descent on unregularized criterion
n
1 X
β (k) = β (k−1) − . (yi − pi (β (k−1) ))xi , k = 1, 2, 3, . . .
n
i=1

(we could equally well consider SGD).

I Treat β (k) as an approximate solution to regularized problem
with t = ||β (k) ||2
I This is called early stopping for gradient descent.
I Consider the following, for a very small constant step size :
I Start at β (0) = 0, solution to regularized problem at t = 0
I Perform gradient descent on unregularized criterion
n
1 X
β (k) = β (k−1) − . (yi − pi (β (k−1) ))xi , k = 1, 2, 3, . . .
n
i=1

(we could equally well consider SGD).

I Treat β (k) as an approximate solution to regularized problem
with t = ||β (k) ||2
I This is called early stopping for gradient descent.

I Why would we ever do this?

I Consider the following, for a very small constant step size :
I Start at β (0) = 0, solution to regularized problem at t = 0
I Perform gradient descent on unregularized criterion
n
1 X
β (k) = β (k−1) − . (yi − pi (β (k−1) ))xi , k = 1, 2, 3, . . .
n
i=1

(we could equally well consider SGD).

I Treat β (k) as an approximate solution to regularized problem
with t = ||β (k) ||2
I This is called early stopping for gradient descent.

I Why would we ever do this?

I It’s both more convenient and potentially much more efficient

than using explicit regularization.
An intriguing connection

I When we solve the `2 regularized logistic problem for varying

t, solution path looks quite similar to gradient descent path!
I The following figure shows with p = 8, solution and grad
descent paths side by side:
Lots left to explore

I Connection holds beyond logistic regression, for arbitrary loss

Lots left to explore

I Connection holds beyond logistic regression, for arbitrary loss

I In general, the grad descent path will not coincide with the l2
regularized path (as → 0). Though in practice, it seems to
give competitive statistical performance
Lots left to explore

I Connection holds beyond logistic regression, for arbitrary loss

I In general, the grad descent path will not coincide with the l2
regularized path (as → 0). Though in practice, it seems to
give competitive statistical performance

I Can extend early stopping idea to mimic a generic regularizer

(beyond l2 ).
Lots left to explore

I Connection holds beyond logistic regression, for arbitrary loss

I In general, the grad descent path will not coincide with the l2
regularized path (as → 0). Though in practice, it seems to
give competitive statistical performance

I Can extend early stopping idea to mimic a generic regularizer

(beyond l2 ).

I There is a lot of literature on early stopping, but it’s still not

as well-understood as it should be.
Lots left to explore

I Connection holds beyond logistic regression, for arbitrary loss

I In general, the grad descent path will not coincide with the l2
regularized path (as → 0). Though in practice, it seems to
give competitive statistical performance

I Can extend early stopping idea to mimic a generic regularizer

(beyond l2 ).

I There is a lot of literature on early stopping, but it’s still not

as well-understood as it should be.

I Early stopping is just one instance of implicit or algorithmic

regularization ... many others are effective in large-scale ML,
they all should be better understood.

Stochastic Gradient Descent Basics
No ratings yet
Stochastic Gradient Descent Basics
22 pages
2.stochastic Gradient Descent (SGD)
No ratings yet
2.stochastic Gradient Descent (SGD)
11 pages
ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)
No ratings yet
ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)
45 pages
Lecture 5
No ratings yet
Lecture 5
4 pages
L23 Stochastic Gradient and Mini Batch
No ratings yet
L23 Stochastic Gradient and Mini Batch
9 pages
Topic5 Stoch Grad D Oct202023
No ratings yet
Topic5 Stoch Grad D Oct202023
29 pages
Stochastic Gradient Descent Tuning
No ratings yet
Stochastic Gradient Descent Tuning
8 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
DS303: Introduction To Machine Learning: Stochastic Gradient Descent
No ratings yet
DS303: Introduction To Machine Learning: Stochastic Gradient Descent
19 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
Gradient Descent in Machine Learning
No ratings yet
Gradient Descent in Machine Learning
98 pages
Lecture05 Descent
No ratings yet
Lecture05 Descent
31 pages
2,5 Stochastic Gradient Descent
No ratings yet
2,5 Stochastic Gradient Descent
11 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
Tut04 - One Algorithm To Optimize Them All
No ratings yet
Tut04 - One Algorithm To Optimize Them All
19 pages
Discussion 4 CS771
No ratings yet
Discussion 4 CS771
25 pages
Machine Learning: Gradient Descent Methods
No ratings yet
Machine Learning: Gradient Descent Methods
11 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Why Stochastic Gradient Descent Works
No ratings yet
Why Stochastic Gradient Descent Works
6 pages
Advanced Stochastic Gradient Descent
No ratings yet
Advanced Stochastic Gradient Descent
23 pages
Gradient Descent Types Explained
No ratings yet
Gradient Descent Types Explained
5 pages
ML - Stochastic Gradient Descent (SGD) - GeeksforGeeks
No ratings yet
ML - Stochastic Gradient Descent (SGD) - GeeksforGeeks
9 pages
Mlfa Autumn 22 Lec 04
No ratings yet
Mlfa Autumn 22 Lec 04
24 pages
12-Mini-Batch Gradient Descent - Exponential Weighted Averages-07-08-2024
No ratings yet
12-Mini-Batch Gradient Descent - Exponential Weighted Averages-07-08-2024
2 pages
Convex Module B
No ratings yet
Convex Module B
29 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
04 Batch SGD Mini Batch Gradient Descent Algorithms
No ratings yet
04 Batch SGD Mini Batch Gradient Descent Algorithms
3 pages
Assignment 4
No ratings yet
Assignment 4
8 pages
Different Types of Gradient Descent
No ratings yet
Different Types of Gradient Descent
4 pages
Stochastic Gradient Descent Guide
No ratings yet
Stochastic Gradient Descent Guide
12 pages
Week 10 Notes MLF
No ratings yet
Week 10 Notes MLF
20 pages
The Error-Feedback Framework: Better Rates For SGD With Delayed Gradients and Compressed Updates
No ratings yet
The Error-Feedback Framework: Better Rates For SGD With Delayed Gradients and Compressed Updates
36 pages
ConvexSpring25 Week9
No ratings yet
ConvexSpring25 Week9
26 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
Assignment No 3
No ratings yet
Assignment No 3
7 pages
Lecture 7 (With Notes)
No ratings yet
Lecture 7 (With Notes)
39 pages
Gradient Descent Algorithm Is A First
No ratings yet
Gradient Descent Algorithm Is A First
5 pages
Handbook of Convergence Theorems
No ratings yet
Handbook of Convergence Theorems
70 pages
Gradient Descent
No ratings yet
Gradient Descent
52 pages
Op Tim Ization
No ratings yet
Op Tim Ization
9 pages
Gradient Descent for ML Practitioners
No ratings yet
Gradient Descent for ML Practitioners
2 pages
Is Stochastic Gradient Descent Effective? A PDE Perspective On Machine Learning Processes
No ratings yet
Is Stochastic Gradient Descent Effective? A PDE Perspective On Machine Learning Processes
50 pages
Stochastic Search Methods
No ratings yet
Stochastic Search Methods
2 pages
Stochastic Gradient Descent Guide
No ratings yet
Stochastic Gradient Descent Guide
4 pages
ML Lec-6
No ratings yet
ML Lec-6
16 pages
5 Gradients
No ratings yet
5 Gradients
26 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
SGD 2
No ratings yet
SGD 2
18 pages
Neural Network Training: Optimization
No ratings yet
Neural Network Training: Optimization
62 pages
Gradient Descent & Stochastic Optimization
No ratings yet
Gradient Descent & Stochastic Optimization
4 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
SGD Explained for Data Scientists
No ratings yet
SGD Explained for Data Scientists
23 pages
Linear Regression For Machine Learning Course
No ratings yet
Linear Regression For Machine Learning Course
41 pages
Bridging The Gap Between Constant Step Size Stochastic Gradient Descent and Markov Chains
No ratings yet
Bridging The Gap Between Constant Step Size Stochastic Gradient Descent and Markov Chains
30 pages
Logistic
No ratings yet
Logistic
14 pages
Optimization Gradient Descent
No ratings yet
Optimization Gradient Descent
13 pages
17 Large Scale Machine Learning PDF
No ratings yet
17 Large Scale Machine Learning PDF
10 pages
Contoh CV Pelaut
67% (3)
Contoh CV Pelaut
1 page
Standard Proposal
No ratings yet
Standard Proposal
23 pages
Computational Methods For Geophysics References
No ratings yet
Computational Methods For Geophysics References
2 pages
DBMT Notes Online
No ratings yet
DBMT Notes Online
16 pages
Fundamentals of Programming in SAS A Case Studies Approach PDF Download
100% (1)
Fundamentals of Programming in SAS A Case Studies Approach PDF Download
29 pages
Wa0006
No ratings yet
Wa0006
13 pages
2.1quality Management Planning & Objectives - DGFF
No ratings yet
2.1quality Management Planning & Objectives - DGFF
2 pages
Biofuel Pitch Final
No ratings yet
Biofuel Pitch Final
2 pages
Decision,+ERC+Case+No +2014-020+MC
No ratings yet
Decision,+ERC+Case+No +2014-020+MC
15 pages
Definition of Co-Operation: Objectives
No ratings yet
Definition of Co-Operation: Objectives
12 pages
Missile Autopilot Nonlinear Control
No ratings yet
Missile Autopilot Nonlinear Control
6 pages
CV Muhammad Tahir
No ratings yet
CV Muhammad Tahir
5 pages
Circuit Diagrams 00MY RHD
No ratings yet
Circuit Diagrams 00MY RHD
127 pages
01 - Pneumatic Vs Hydraulic
100% (1)
01 - Pneumatic Vs Hydraulic
17 pages
Hazard Analysis Offshore Platforms Offshore Pipelines
100% (3)
Hazard Analysis Offshore Platforms Offshore Pipelines
178 pages
Citect to ClearSCADA Migration Plan
No ratings yet
Citect to ClearSCADA Migration Plan
2 pages
June 7, 2024
No ratings yet
June 7, 2024
2 pages
Shanthi Pavan - CT Audio Mash DSM - JSSC 2020
No ratings yet
Shanthi Pavan - CT Audio Mash DSM - JSSC 2020
11 pages
Standardization of Growbag Media With Nutriseed 3
No ratings yet
Standardization of Growbag Media With Nutriseed 3
1 page
Social Security Dispute Ruling
No ratings yet
Social Security Dispute Ruling
11 pages
24S1 SS ZG653 M1 CS02A ArchitecturalContext
No ratings yet
24S1 SS ZG653 M1 CS02A ArchitecturalContext
23 pages
Inbound 6799273439868753358
No ratings yet
Inbound 6799273439868753358
2 pages
Entrepreneurial Project Scouting
No ratings yet
Entrepreneurial Project Scouting
7 pages
Nayan Material
No ratings yet
Nayan Material
52 pages
Brin Eee
No ratings yet
Brin Eee
15 pages
Slide 4 Advance Capital Budgeting
No ratings yet
Slide 4 Advance Capital Budgeting
89 pages
Lehe2027-00 G20CM34
No ratings yet
Lehe2027-00 G20CM34
4 pages
Goldman Sachs Coding Test Round Top 20
No ratings yet
Goldman Sachs Coding Test Round Top 20
6 pages
Your Guide To Installing and Getting Started With Winsteam
No ratings yet
Your Guide To Installing and Getting Started With Winsteam
4 pages
Social Welfare and Work Basics 2024
No ratings yet
Social Welfare and Work Basics 2024
10 pages

17 Convexoptim5

Uploaded by

17 Convexoptim5

Uploaded by

Convex Optimization V

I In comparison, stochastic gradient descent or SGD (or

x (k) = x (k−1) − tk .∇fik (x (k−1) ), k = 1, 2, 3, . . .

where ik ∈ {1, 2, . . . , m} is some chosen index at iteration k.

E [∇fik (x)] = ∇f (x),

so we can view SGD as using an unbiased estimate of the

E [∇fik (x)] = ∇f (x),

so we can view SGD as using an unbiased estimate of the

I Main appeal of SGD:

I Given (xi , yi ) ∈ Rp × {0, 1}, i = 1, . . . , n, recall logistic

I Given (xi , yi ) ∈ Rp × {0, 1}, i = 1, . . . , n, recall logistic

I Given (xi , yi ) ∈ Rp × {0, 1}, i = 1, . . . , n, recall logistic

I If we compare full gradient (also called batch) against

I Meanwhile, a full gradient with step size t would give:

I Meanwhile, a full gradient with step size t would give:

I When f is strongly convex and has a Lipschitz gradient,

I When f is strongly convex and has a Lipschitz gradient,

I But under same conditions, SGD gives us

I When f is strongly convex and has a Lipschitz gradient,

I But under same conditions, SGD gives us

I So stochastic methods do not enjoy the linear convergence

I One answer to this, especially from the machine learning

I One answer to this, especially from the machine learning

I Another answer is using mini-batch SGD, which we shall now

I Also common is mini-batch stochastic gradient descent, where

I Also common is mini-batch stochastic gradient descent, where

I Again, we are approximating full gradient by an unbiased

I Also common is mini-batch stochastic gradient descent, where

I Again, we are approximating full gradient by an unbiased

I Using mini-batches reduces the variance of our gradient

I Let us write the criterion as

I Let us write the criterion as

I We get that in terms of per iteration performance, using

I This demonstrates that mini-batch SGD does not really help in

I New wave of “variance reduction” work shows we can modify

I In many ML problems we don’t care about optimizing to high

I In many ML problems we don’t care about optimizing to high

I In many ML problems we don’t care about optimizing to high

I One trick is to experiment with step sizes using small fraction

I In many ML problems we don’t care about optimizing to high

I One trick is to experiment with step sizes using small fraction

I Many variants provide better practical stability, convergence:

I We could solve (say) `2 regularized logistic regression:

I We could solve (say) `2 regularized logistic regression:

I We could also run gradient descent on the un-regularized

and stop early, i.e., terminate gradient descent well-short of

(we could equally well consider SGD).

(we could equally well consider SGD).

(we could equally well consider SGD).

I Why would we ever do this?

(we could equally well consider SGD).

I Why would we ever do this?

I It’s both more convenient and potentially much more efficient

I When we solve the `2 regularized logistic problem for varying

I Connection holds beyond logistic regression, for arbitrary loss

I Connection holds beyond logistic regression, for arbitrary loss

I Connection holds beyond logistic regression, for arbitrary loss

I Can extend early stopping idea to mimic a generic regularizer

I Connection holds beyond logistic regression, for arbitrary loss

I Can extend early stopping idea to mimic a generic regularizer

I There is a lot of literature on early stopping, but it’s still not

I Connection holds beyond logistic regression, for arbitrary loss

I Can extend early stopping idea to mimic a generic regularizer

I There is a lot of literature on early stopping, but it’s still not

I Early stopping is just one instance of implicit or algorithmic

You might also like