8000 Adaptive Learning Rate for SGD [enhancement] · Issue #1261 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Adaptive Learning Rate for SGD [enhancement] #1261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
briancheung opened this issue Oct 21, 2012 · 21 comments
Closed

Adaptive Learning Rate for SGD [enhancement] #1261

briancheung opened this issue Oct 21, 2012 · 21 comments

Comments

@briancheung
Copy link
Contributor

I'm considering implementing a new adaptive learning rate algorithm proposed here:

http://arxiv.org/abs/1206.1106

Anyone have any feedback about where this should go? I was thinking it would be added somewhere around https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/sgd_fast.pyx#L383

Since SGD is also applicable to non-linear models (e.g. neural network models), are there any plans to move that outside of linear_model?

@GaelVaroquaux
Copy link
Member

Hey Brian,

This paper is probably a great paper (I haven't read it), but it's also a
very recent paper (the final version is not even out yet). We usualy have
a policy of waiting a year or two after a publication, to make sure that
the various aspects of the publication are understood.

Now, if it requires only a small set of modifications to our exising
codebase and makes our existing examples run much better, we can make an
exception of course :). That said, I think that there are still a lot of
much easier tricks to pull to improve our learning rate (the SGD guys
would know better then me here).

@amueller
Copy link
Member

@briancheung if you want to have adaptive learning rates, I recommend adagrad, which is not as new and works quite well. I think adding this to sklearn would be awesome. That is not totally trivial, though :-/
We don't really want to make the SGD code any slower.

@briancheung
Copy link
Contributor Author

That makes sense as this is a very new article. But the idea is fairly straightforward and might make the SGD based methods more automated and possibly suitable for non-stationary data. But I agree that this method might require some time to develop.

Here's a general summary of how it works. They approximate the variance of the probability distribution of optimum for each sample (via moving window of samples). When the variance is high, this indicates disagreement about where the gradient direction is among samples. They penalize the second order step size accordingly.

@amueller This looks like something related, I'll have to take a look though it looks a bit more technical.

@iskandr
Copy link
iskandr commented Dec 9, 2014

Now that some time has passed, has any consensus emerged as to whether this is a good learning rate scheme?

Also, Tom Schaul seems to have an implementation here: https://github.com/schaul/py-optim/blob/master/PyOptim/algorithms/vsgd.py

@agramfort
Copy link
Member

thanks Alex for pointing us to this.

Do you know what is the license of this code?

@hammer
Copy link
Contributor
hammer commented Dec 9, 2014

@agramfort
Copy link
Member

cool thanks !

@ogrisel
Copy link
Member
ogrisel commented Dec 10, 2014

AdaGrad is the most well knowned scheme. Refinements as AdaDelta and RMSProp seems to be gaining a lot of traction among deep learning practitioners but this is still evolving.

AdaDelta in particular seems interesting as its hyper parameters can seemingly be left to their default values and still get good convergence for a wide range of problems which is particularly user friendly.

Since SGD is also applicable to non-linear models (e.g. neural network models), are there any plans to move that outside of linear_model?

sklearn.linear_model.SGDClassifier will stay a linear model as it's heavily optimized for the linear case (I agree that SGDClassifier is a poor name though, SGDLinearClassifier would have been better).

For training non linear models with SGD, have a look at: #3204

@briancheung
Copy link
Contributor Author

AdaDelta is fairly straightforward to implement and doesn't force eventual convergence like AdaGrad. I've found it requires the least amount of tuning over things like SGD, Momentum, Nesterov Momentum, etc.

@ogrisel
Copy link
Member
ogrisel commented Dec 11, 2014

I've found it requires the least amount of tuning over things like SGD, Momentum, Nesterov Momentum, etc.

This is consistent with occasional feedback I got from several practitioners. Could you please explain what you mean by " force eventual convergence like AdaGrad"?

@mblondel
Copy link
Member

AdaDelta is fairly straightforward to implement and doesn't force eventual convergence like AdaGrad.

This sentence doesn't make any sense to me in a convex optimization setting, in which you can reach the global optimum. In the non-convex optimization setting, I guess what you mean is that Adadelta can avoid bad local minima?

I am personally rather -1 on adaptive learning rates which are mostly used in the deep learning community (Adadelta etc): the SGD classes in scikit-learn are for convex optimization. I am +1 for AdaGrad which works well and was originally designed for convex optimization (and online learning). Now if there's strong empirical evidence that Adadelta is better than AdaGrad even on convex optimization problems, I could change my mind :)

BTW, my project lightning has a fast implementation of AdaGrad with elastic-net penalty
https://github.com/mblondel/lightning
https://github.com/mblondel/lightning/blob/master/lightning/impl/adagrad_fast.pyx

It works really well and is indeed not very sensitive to the initial learning rate.

@briancheung
Copy link
Contributor Author

This is consistent with occasional feedback I got from several practitioners. Could you please explain what you mean by " force eventual convergence like AdaGrad"?

The denominator in AdaGrad accumulates the RMS from all previous gradients, so the learning rate will eventually go to zero. I could see this being useful in convex problems where the learning rate has to reach 0 at some point but I don't know if AdaGrad offers any guarantees that the learning rate will be 0 only at the global minimum for convex problems. AdaDelta takes a window (exponentially decaying) of the RMS.

I personally don't work with convex problems ;), so I can't directly comment about AdaGrad vs AdaDelta. But I did come across some visualizations comparing these methods on a few convex problems.

@ogrisel
Copy link
Member
ogrisel commented Dec 12, 2014

@briancheung I tweeted those viz and people want to know the original author and the source code. Do you know where they come from?

@briancheung
Copy link
Contributor Author

I found the link from this post by a user named alecradford from Reddit's /r/MachineLearning.

@ogrisel
Copy link
Member
ogrisel commented Dec 12, 2014

Based on this discussion, my feeling is that AdaGrad can be considered to pass the proof of time in terms of number of citations and replicated usefulness for penalized linear models and therefore we could consider a PR to add a support for AdaGrad as a new learning rate scheme to the SGDClassifier class.

"No more pesky learning rates", AdaDelta, RMSprop and the like seem to be most used for non-linear models, hence not suitable for inclusion in the sklearn.linear_model.SGDClassifier class itself.

Once the MLP PR has been successfully tested with SGD and momentum (which is the historical baseline we have to include in our implementation) and merged to master, we might want to discuss some adaptive learning rates such as AdaDelta and RMSProp. However I am still not sure which of AdaDelta and RMSprop is going to emerge as the go-to optimizer for architectures with multiple non-linearities so it might be still too early for a PR. Out of repo implementations are surely encouraged though.

@andrwc
Copy link
andrwc commented Apr 16, 2015

Bit of an old thread... If no one else is currently working on AdaGrad learning rates for the linear_model.SGD* classes I'm happy to do it.

@amueller
Copy link
Member

I thought there was a PR somewhere, but maybe not?

@andrwc
Copy link
andrwc commented Apr 30, 2015

Not that I could find. There was this:

#3729

which appears to be hanging, where the adagrad/adadelta learning rate schemes were evaluated at some length. That PR was looking at what appeared to be a pretty significant refactor of the existing SGD code which stalled.

Mentioned within that was: http://www.mblondel.org/lightning/ -- which has the implementation and a few others and follows the sklearn api conventions. I've not looked at the code but if that's not a candidate for being merged, I would propose to simply add AdaGrad learning rates using the existing sklearn code in sgd_fast etc... the api already exposes most of what's needed and it'd give some useful functionality.

@agramfort
Copy link
Member
agramfort commented Apr 30, 2015 via email

@amueller
Copy link
Member

Maybe #3729 was what I meant. Talk to @mblondel about lightning ;)

@adrinjalali
Copy link
Member

I think we're not going to be working on this. Closing, but correct me if I'm wrong @lorentzenchr @ogrisel

@adrinjalali adrinjalali closed this as not planned Won't fix, can't repro, duplicate, stale Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

0