-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG + 2] Mlp with adam, nesterov's momentum, early stopping #5214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Great, thanks :) |
Can you please also check the failing tests? |
@ogrisel and @kastnerkyle might be interested, too ;) |
I think the reason for failing is because I changed default values. Now that the default algorithm is 'adam' with a lower initial learning rate, it does not converge as fast as 'l-bfgs' on small datasets. I can fix tests in |
Did you check out generalization error also? I have found Adam is generally There is a paper about this: Also some other work by Choromanska et. al. On Fri, Sep 4, 2015 at 3:16 PM, Andreas Mueller notifications@github.com
|
The error/accuracy shown in benchmark results are for test set. I think I agree that well-tuned SGD + nesterov momentum can out-performce adam, but in my experience (though not too much), the learning rate schedule needs to be carefully adjusted by monitoring history of training/validation losses. I think a reasonable approach is allowing users to pass a function for learning rate scheduling. |
================================= | ||
|
||
Performs binary classification using multi-layer perceptron | ||
with l-bfgs algorithm. The prediction target is an XOR of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this docstring is out-dated.
I'm not sure this example is very helpful, given that we have non-linear datasets in the "classifier comparison".
Looks good so far :) I'm checking the adam formulas right now. |
Thanks for the comments @amueller , I'll work on these. And I think I also found the reason for the larger than expected fluctuation in training loss during training. |
yes that should also be fixed. |
self.loss_)) | ||
|
||
# adjusting learning rates | ||
if self.learning_rate == 'invscaling': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this make sense with adam? Probably not, right?
From the experiments, I am not sure if it ever makes sense actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean adjusting learning rates? Probably not. I think I intended to ignore self.learning_rate
when algorithm == 'adam'
. I think I just forgot to do this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or probably I can just set self.learning_rate
to 'constant'
if algorithm == 'adam'
in _initialize
, and throw a warning if it was set to a different value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't change self.learning_rate
. parameters to init might not be changed. you can add an attribute _learning_rate
that is set dependent on algorithm
but that seems ugly. Maybe also defer this learning rate schedule to the optimizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's probably a good idea as we are already considering having another function in optimizer that is called on epoch end
One thing I'm a bit concerned about is that we now have many parameter that only impact SGD, which is now not even the default. Adam doesn't make use of the "momentum", "nesterovs_momentum". It does currently take "invscaling" into account, but that is probably not a good idea. I think we should try to decouple the SGD heuristics from the rest of the parameters and code a bit more, but I'm not sure how to do this best. |
In the benchmark you posted in the top, all the SGD algorithms are slower by a factor of two with early stopping. Why is that? |
I think the main reason is without early stopping, the stop criterion relies on training loss, which as I mentioned, was only calculated with the last minibatch. This makes training loss fluctuates a lot, and in turn makes it easier to stop early according to our stopping criterion. This should not be so obvious a problem after I fix this. I'm going to re-run the benchmarks |
That makes sense. |
So do we want to have a running average of the loss? Or only adjust the learning rate / stop on epoch level? |
Currently we are not having any running average of loss, it's all based on current epoch and history of past few epochs. I didn't change the logic for stopping criterion. |
I meant running average over the batches. You wanted to use the loss on the whole training set, not just the last batch, right? How do you compute that? You don't want to go over the whole training set again just to compute the stopping criterion. |
Oh yes, that's right. I'm going to keep sort of a running average of loss over batches, weighted on size of the batch slice, as it can be different for the last batch |
ok, cool. |
I refactored early stopping and learning rate update a bit. It should be clearer now. And I re-ran the benchmarks after fixing Here are some plots. '_val' indicates validation score, which are only calculated when early_stopping is on. |
The |
Adam seems to work pretty well, and with the early stopping seems to stop pretty quickly. So that seems reasonable to go with. |
By the way, here what it looks like when I previously let the model decide when to stop training: |
I made some change to documentation and fixed some mistakes. It should look much better now. I'll check again today to see if I missed anything. For structure of the code, I guess what needs more discussion is whether it is the model or optimizer who has the right to decide when to stop training. The idea is that the model keeps track of There can be several ways of achieving this, of course, but ideally we want a way to perfectly define the border of behaviors for both the model and the optimizer. First approachI started with something as in here: https://github.com/glennq/scikit-learn/blob/5dd3443911e781006f92871ef9483fad92aa1497/sklearn/neural_network/multilayer_perceptron.py#L558 This gives all rights to the model. It is also allowed to check Second approachAnd then I came up with what it is now. Whenever I think the idea is that it's really the optimizer who decides, but only when the model asks. The optimizer is sort of like a boss approving requests of the model... I believe it's better than the first approach in that either party keeps track of its own state and only answer queries received. Third approachAnother approach I have in mind is that we let the optimizer decide. This would make the optimizer class more listener-like though I personally think my current approach is fine, but I agree that it is not obvious which side decides when to stop. |
I much prefer the first approach actually but it is mostly a matter of preference. The third approach is kind of the inverse of the first approach, except the optimizer (which is mostly simple / mathematical) is controlling the model (which is complex, since it usually handles or is involved in minibatches and the "training loop"). So I am a fan of the first approach when I do it on my own since the optimizer doesn't really need to know about epochs - just costs and gradients (and sometimes second order info). The second approach (what is already implemented) seems just fine to me. I will review again. |
On gradient check - what we have now is fine IMO. Renaming seems useful, since our implementation will likely serve as an educational tool. We need to know whether |
+1 from me besides that naming - @arjoly and I agree that We can handle |
Just rename the derivatives. I'm going to compare |
@glennq I was just about to do the rename. Thanks. I'm going to merge this. It would be great if you can do the timing comparison, but I think it's a minor detail. |
(I'll merge if travis is green) |
can you please squash your commits though? |
no problem |
[MRG + 2] Mlp with adam, nesterov's momentum, early stopping
🍻 |
Thanks! Wohoooo! 🍻 After only like... 3 years? Thanks a log @glennq and @eickenberg and @kastnerkyle for reviews :) |
great job @IssamLaradji @glennq @amueller @ogrisel and anyone else I am missing |
Great! Thank you @amueller @kastnerkyle @eickenberg @ogrisel for reviews! |
Great job everyone!! :) :) |
Awesome !!! 🍻 |
Great job to everyone involved! (We should really try to learn from this, and make sure we avoid mega huge PRs in the future, as much as possible) |
great job everybody who ever worked on this!!! :) On Friday, October 23, 2015, Gilles Louppe notifications@github.com wrote:
|
F** yeah! I am really happy. This was excellent team work. Thanks you, to everybody who reviewed, and to @IssamLaradji and @glennq! |
🍻 !
|
Based on #3204 and #3939.
Code and results for timing and performance benchmarks on MNIST, 20 New Groups and RCV1 can be found at https://gist.github.com/glennq/44ca8b66770430ee10f9
Basically my observation is that adam performs consistently well. Early stopping cannot help to improve performance for adam, but can reduce training time significantly. In other updating schemes, adaptive learning rate (divide by 5 if not improving) with nesterov's momentum also lines just after adam, followed by constant learning rate with nesterov's momentum. The problem with adaptive learning rate though, is that training time is significantly longer.
Also, there are cases when early stopping can increase training time, which I think is due to the fact that the heuristic we use (stop if loss not improving over best loss by tol for more than two consecutive iterations) can be affected if training loss has large variation but validation score is more smoothly improving.
This affects update schemes using momentum, but it seems to be not a problem for adam.
The following is a visualized classifier comparison modified from http://scikit-learn.org/dev/auto_examples/classification/plot_classifier_comparison.html
I'll still work on benchmarking small datasets such as digits and iris, and probably plot training/validation loss for each iterations for more insights.