-
-
Notifications
You must be signed in to change notification settings - Fork 26k
Adaptive Learning Rate for SGD [enhancement] #1261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hey Brian, This paper is probably a great paper (I haven't read it), but it's also a Now, if it requires only a small set of modifications to our exising |
@briancheung if you want to have adaptive learning rates, I recommend adagrad, which is not as new and works quite well. I think adding this to sklearn would be awesome. That is not totally trivial, though :-/ |
That makes sense as this is a very new article. But the idea is fairly straightforward and might make the SGD based methods more automated and possibly suitable for non-stationary data. But I agree that this method might require some time to develop. Here's a general summary of how it works. They approximate the variance of the probability distribution of optimum for each sample (via moving window of samples). When the variance is high, this indicates disagreement about where the gradient direction is among samples. They penalize the second order step size accordingly. @amueller This looks like something related, I'll have to take a look though it looks a bit more technical. |
Now that some time has passed, has any consensus emerged as to whether this is a good learning rate scheme? Also, Tom Schaul seems to have an implementation here: https://github.com/schaul/py-optim/blob/master/PyOptim/algorithms/vsgd.py |
thanks Alex for pointing us to this. Do you know what is the license of this code? |
cool thanks ! |
AdaGrad is the most well knowned scheme. Refinements as AdaDelta and RMSProp seems to be gaining a lot of traction among deep learning practitioners but this is still evolving. AdaDelta in particular seems interesting as its hyper parameters can seemingly be left to their default values and still get good convergence for a wide range of problems which is particularly user friendly.
sklearn.linear_model.SGDClassifier will stay a linear model as it's heavily optimized for the linear case (I agree that SGDClassifier is a poor name though, SGDLinearClassifier would have been better). For training non linear models with SGD, have a look at: #3204 |
AdaDelta is fairly straightforward to implement and doesn't force eventual convergence like AdaGrad. I've found it requires the least amount of tuning over things like SGD, Momentum, Nesterov Momentum, etc. |
This is consistent with occasional feedback I got from several practitioners. Could you please explain what you mean by " force eventual convergence like AdaGrad"? |
This sentence doesn't make any sense to me in a convex optimization setting, in which you can reach the global optimum. In the non-convex optimization setting, I guess what you mean is that Adadelta can avoid bad local minima? I am personally rather -1 on adaptive learning rates which are mostly used in the deep learning community (Adadelta etc): the SGD classes in scikit-learn are for convex optimization. I am +1 for AdaGrad which works well and was originally designed for convex optimization (and online learning). Now if there's strong empirical evidence that Adadelta is better than AdaGrad even on convex optimization problems, I could change my mind :) BTW, my project lightning has a fast implementation of AdaGrad with elastic-net penalty It works really well and is indeed not very sensitive to the initial learning rate. |
The denominator in AdaGrad accumulates the RMS from all previous gradients, so the learning rate will eventually go to zero. I could see this being useful in convex problems where the learning rate has to reach 0 at some point but I don't know if AdaGrad offers any guarantees that the learning rate will be 0 only at the global minimum for convex problems. AdaDelta takes a window (exponentially decaying) of the RMS. I personally don't work with convex problems ;), so I can't directly comment about AdaGrad vs AdaDelta. But I did come across some visualizations comparing these methods on a few convex problems. |
@briancheung I tweeted those viz and people want to know the original author and the source code. Do you know where they come from? |
I found the link from this post by a user named alecradford from Reddit's /r/MachineLearning. |
Based on this discussion, my feeling is that AdaGrad can be considered to pass the proof of time in terms of number of citations and replicated usefulness for penalized linear models and therefore we could consider a PR to add a support for AdaGrad as a new learning rate scheme to the "No more pesky learning rates", AdaDelta, RMSprop and the like seem to be most used for non-linear models, hence not suitable for inclusion in the Once the MLP PR has been successfully tested with SGD and momentum (which is the historical baseline we have to include in our implementation) and merged to master, we might want to discuss some adaptive learning rates such as AdaDelta and RMSProp. However I am still not sure which of AdaDelta and RMSprop is going to emerge as the go-to optimizer for architectures with multiple non-linearities so it might be still too early for a PR. Out of repo implementations are surely encouraged though. |
Bit of an old thread... If no one else is currently working on AdaGrad learning rates for the linear_model.SGD* classes I'm happy to do it. |
I thought there was a PR somewhere, but maybe not? |
Not that I could find. There was this: which appears to be hanging, where the adagrad/adadelta learning rate schemes were evaluated at some length. That PR was looking at what appeared to be a pretty significant refactor of the existing SGD code which stalled. Mentioned within that was: http://www.mblondel.org/lightning/ -- which has the implementation and a few others and follows the sklearn api conventions. I've not looked at the code but if that's not a candidate for being merged, I would propose to simply add AdaGrad learning rates using the existing sklearn code in sgd_fast etc... the api already exposes most of what's needed and it'd give some useful functionality. |
feel free to take over or maybe @TomDLT will into it at some point.
|
I think we're not going to be working on this. Closing, but correct me if I'm wrong @lorentzenchr @ogrisel |
I'm considering implementing a new adaptive learning rate algorithm proposed here:
http://arxiv.org/abs/1206.1106
Anyone have any feedback about where this should go? I was thinking it would be added somewhere around https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/sgd_fast.pyx#L383
Since SGD is also applicable to non-linear models (e.g. neural network models), are there any plans to move that outside of linear_model?
The text was updated successfully, but these errors were encountered: