Adaptive Learning Rate for SGD [enhancement] #1261

briancheung · 2012-10-21T22:06:53Z

I'm considering implementing a new adaptive learning rate algorithm proposed here:

Anyone have any feedback about where this should go? I was thinking it would be added somewhere around https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/sgd_fast.pyx#L383

Since SGD is also applicable to non-linear models (e.g. neural network models), are there any plans to move that outside of linear_model?

GaelVaroquaux · 2012-10-22T07:04:17Z

Hey Brian,

This paper is probably a great paper (I haven't read it), but it's also a
very recent paper (the final version is not even out yet). We usualy have
a policy of waiting a year or two after a publication, to make sure that
the various aspects of the publication are understood.

Now, if it requires only a small set of modifications to our exising
codebase and makes our existing examples run much better, we can make an
exception of course :). That said, I think that there are still a lot of
much easier tricks to pull to improve our learning rate (the SGD guys
would know better then me here).

amueller · 2012-10-22T10:51:36Z

@briancheung if you want to have adaptive learning rates, I recommend adagrad, which is not as new and works quite well. I think adding this to sklearn would be awesome. That is not totally trivial, though :-/
We don't really want to make the SGD code any slower.

briancheung · 2012-10-23T02:45:44Z

That makes sense as this is a very new article. But the idea is fairly straightforward and might make the SGD based methods more automated and possibly suitable for non-stationary data. But I agree that this method might require some time to develop.

Here's a general summary of how it works. They approximate the variance of the probability distribution of optimum for each sample (via moving window of samples). When the variance is high, this indicates disagreement about where the gradient direction is among samples. They penalize the second order step size accordingly.

@amueller This looks like something related, I'll have to take a look though it looks a bit more technical.

iskandr · 2014-12-09T16:46:18Z

Now that some time has passed, has any consensus emerged as to whether this is a good learning rate scheme?

Also, Tom Schaul seems to have an implementation here: https://github.com/schaul/py-optim/blob/master/PyOptim/algorithms/vsgd.py

agramfort · 2014-12-09T21:20:48Z

thanks Alex for pointing us to this.

Do you know what is the license of this code?

hammer · 2014-12-09T21:30:47Z

@agramfort looks like BSD 3-clause: https://github.com/schaul/py-optim/blob/master/PyOptim/LICENSE.txt

agramfort · 2014-12-09T21:37:30Z

cool thanks !

ogrisel · 2014-12-10T10:04:22Z

AdaGrad is the most well knowned scheme. Refinements as AdaDelta and RMSProp seems to be gaining a lot of traction among deep learning practitioners but this is still evolving.

AdaDelta in particular seems interesting as its hyper parameters can seemingly be left to their default values and still get good convergence for a wide range of problems which is particularly user friendly.

Since SGD is also applicable to non-linear models (e.g. neural network models), are there any plans to move that outside of linear_model?

sklearn.linear_model.SGDClassifier will stay a linear model as it's heavily optimized for the linear case (I agree that SGDClassifier is a poor name though, SGDLinearClassifier would have been better).

For training non linear models with SGD, have a look at: #3204

briancheung · 2014-12-10T17:10:21Z

AdaDelta is fairly straightforward to implement and doesn't force eventual convergence like AdaGrad. I've found it requires the least amount of tuning over things like SGD, Momentum, Nesterov Momentum, etc.

ogrisel · 2014-12-11T08:21:13Z

I've found it requires the least amount of tuning over things like SGD, Momentum, Nesterov Momentum, etc.

This is consistent with occasional feedback I got from several practitioners. Could you please explain what you mean by " force eventual convergence like AdaGrad"?

mblondel · 2014-12-11T11:32:03Z

AdaDelta is fairly straightforward to implement and doesn't force eventual convergence like AdaGrad.

This sentence doesn't make any sense to me in a convex optimization setting, in which you can reach the global optimum. In the non-convex optimization setting, I guess what you mean is that Adadelta can avoid bad local minima?

I am personally rather -1 on adaptive learning rates which are mostly used in the deep learning community (Adadelta etc): the SGD classes in scikit-learn are for convex optimization. I am +1 for AdaGrad which works well and was originally designed for convex optimization (and online learning). Now if there's strong empirical evidence that Adadelta is better than AdaGrad even on convex optimization problems, I could change my mind :)

BTW, my project lightning has a fast implementation of AdaGrad with elastic-net penalty
https://github.com/mblondel/lightning
https://github.com/mblondel/lightning/blob/master/lightning/impl/adagrad_fast.pyx

It works really well and is indeed not very sensitive to the initial learning rate.

briancheung · 2014-12-11T16:52:04Z

This is consistent with occasional feedback I got from several practitioners. Could you please explain what you mean by " force eventual convergence like AdaGrad"?

The denominator in AdaGrad accumulates the RMS from all previous gradients, so the learning rate will eventually go to zero. I could see this being useful in convex problems where the learning rate has to reach 0 at some point but I don't know if AdaGrad offers any guarantees that the learning rate will be 0 only at the global minimum for convex problems. AdaDelta takes a window (exponentially decaying) of the RMS.

I personally don't work with convex problems ;), so I can't directly comment about AdaGrad vs AdaDelta. But I did come across some visualizations comparing these methods on a few convex problems.

ogrisel · 2014-12-12T13:02:33Z

@briancheung I tweeted those viz and people want to know the original author and the source code. Do you know where they come from?

briancheung · 2014-12-12T15:58:44Z

I found the link from this post by a user named alecradford from Reddit's /r/MachineLearning.

ogrisel · 2014-12-12T16:10:03Z

Based on this discussion, my feeling is that AdaGrad can be considered to pass the proof of time in terms of number of citations and replicated usefulness for penalized linear models and therefore we could consider a PR to add a support for AdaGrad as a new learning rate scheme to the SGDClassifier class.

"No more pesky learning rates", AdaDelta, RMSprop and the like seem to be most used for non-linear models, hence not suitable for inclusion in the sklearn.linear_model.SGDClassifier class itself.

Once the MLP PR has been successfully tested with SGD and momentum (which is the historical baseline we have to include in our implementation) and merged to master, we might want to discuss some adaptive learning rates such as AdaDelta and RMSProp. However I am still not sure which of AdaDelta and RMSprop is going to emerge as the go-to optimizer for architectures with multiple non-linearities so it might be still too early for a PR. Out of repo implementations are surely encouraged though.

andrwc · 2015-04-16T02:05:00Z

Bit of an old thread... If no one else is currently working on AdaGrad learning rates for the linear_model.SGD* classes I'm happy to do it.

amueller · 2015-04-29T23:03:13Z

I thought there was a PR somewhere, but maybe not?

andrwc · 2015-04-30T01:49:03Z

Not that I could find. There was this:

#3729

which appears to be hanging, where the adagrad/adadelta learning rate schemes were evaluated at some length. That PR was looking at what appeared to be a pretty significant refactor of the existing SGD code which stalled.

Mentioned within that was: http://www.mblondel.org/lightning/ -- which has the implementation and a few others and follows the sklearn api conventions. I've not looked at the code but if that's not a candidate for being merged, I would propose to simply add AdaGrad learning rates using the existing sklearn code in sgd_fast etc... the api already exposes most of what's needed and it'd give some useful functionality.

agramfort · 2015-04-30T08:57:50Z

feel free to take over or maybe @TomDLT will into it at some point.

amueller · 2015-04-30T14:12:57Z

Maybe #3729 was what I meant. Talk to @mblondel about lightning ;)

adrinjalali · 2024-04-17T13:28:31Z

I think we're not going to be working on this. Closing, but correct me if I'm wrong @lorentzenchr @ogrisel

cmarmo added the module:linear_model label Dec 3, 2021

adrinjalali closed this as not planned Won't fix, can't repro, duplicate, stale Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Adaptive Learning Rate for SGD [enhancement] #1261

Adaptive Learning Rate for SGD [enhancement] #1261

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Adaptive Learning Rate for SGD [enhancement] #1261

Adaptive Learning Rate for SGD [enhancement] #1261

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!