8000 [MRG + 2] Mlp with adam, nesterov's momentum, early stopping by glennq · Pull Request #5214 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG + 2] Mlp with adam, nesterov's momentum, early stopping #5214

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Oct 23, 2015
Merged

[MRG + 2] Mlp with adam, nesterov's momentum, early stopping #5214

merged 8 commits into from
Oct 23, 2015

Conversation

glennq
Copy link
Contributor
@glennq glennq commented Sep 4, 2015

Based on #3204 and #3939.

Code and results for timing and performance benchmarks on MNIST, 20 New Groups and RCV1 can be found at https://gist.github.com/glennq/44ca8b66770430ee10f9

Basically my observation is that adam performs consistently well. Early stopping cannot help to improve performance for adam, but can reduce training time significantly. In other updating schemes, adaptive learning rate (divide by 5 if not improving) with nesterov's momentum also lines just after adam, followed by constant learning rate with nesterov's momentum. The problem with adaptive learning rate though, is that training time is significantly longer.

Also, there are cases when early stopping can increase training time, which I think is due to the fact that the heuristic we use (stop if loss not improving over best loss by tol for more than two consecutive iterations) can be affected if training loss has large variation but validation score is more smoothly improving.
This affects update schemes using momentum, but it seems to be not a problem for adam.

The following is a visualized classifier comparison modified from http://scikit-learn.org/dev/auto_examples/classification/plot_classifier_comparison.html

figure_1

I'll still work on benchmarking small datasets such as digits and iris, and probably plot training/validation loss for each iterations for more insights.

@amueller
Copy link
Member
amueller commented Sep 4, 2015

Great, thanks :)

@amueller
Copy link
Member
amueller commented Sep 4, 2015

Can you please also check the failing tests?

@amueller
Copy link
Member
amueller commented Sep 4, 2015

@ogrisel and @kastnerkyle might be interested, too ;)

@glennq
Copy link
Contributor Author
glennq commented Sep 4, 2015

I think the reason for failing is because I changed default values. Now that the default algorithm is 'adam' with a lower initial learning rate, it does not converge as fast as 'l-bfgs' on small datasets.

I can fix tests in neural_network but I'm not sure whether I need to change set_fast_parameters in estimator_checks.py or simply change default algorithm back to 'l-bfgs'

8000

@kastnerkyle
Copy link
Member

Did you check out generalization error also? I have found Adam is generally
good, but has a worse validation (usually) when compared to a well tuned
SGD + nesterov momentum.

There is a paper about this:
Train faster, generalize better: Stability of stochastic gradient descent
Moritz Hardt, Benjamin Recht, Yoram Singer
http://arxiv.org/abs/1509.01240

Also some other work by Choromanska et. al.
The Loss Surfaces of Multilayer Networks
http://arxiv.org/pdf/1412.0233.pdf

On Fri, Sep 4, 2015 at 3:16 PM, Andreas Mueller notifications@github.com
wrote:

@ogrisel https://github.com/ogrisel and @kastnerkyle
https://github.com/kastnerkyle might be interested, too ;)


Reply to this email directly or view it on GitHub
#5214 (comment)
.

@glennq
Copy link
Contributor Author
glennq commented Sep 4, 2015

The error/accuracy shown in benchmark results are for test set.

I think I agree that well-tuned SGD + nesterov momentum can out-performce adam, but in my experience (though not too much), the learning rate schedule needs to be carefully adjusted by monitoring history of training/validation losses.
The 'halving if not improving' scheme works well in practice, but similar to 'adaptive' learning rate in the benchmarks, this can make training take much longer. This is not a problem for models that needs days to train as long as it gives state-of-the-art performance, but I'm not sure if this is what we want in scikit-learn.

I think a reasonable approach is allowing users to pass a function for learning rate scheduling.

=================================

Performs binary classification using multi-layer perceptron
with l-bfgs algorithm. The prediction target is an XOR of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this docstring is out-dated.
I'm not sure this example is very helpful, given that we have non-linear datasets in the "classifier comparison".

@amueller
Copy link
Member
amueller commented Sep 8, 2015

Looks good so far :) I'm checking the adam formulas right now.
It would be great to have an example that shows how early stopping is useful.

@glennq
Copy link
Contributor Author
glennq commented Sep 8, 2015

Thanks for the comments @amueller , I'll work on these. And I think I also found the reason for the larger than expected fluctuation in training loss during training. self.loss_ is calculated based on only the latest mini batch instead of the entire training set. I'm going to fix this as well.

@amueller
Copy link
Member
amueller commented Sep 8, 2015

yes that should also be fixed.

self.loss_))

# adjusting learning rates
if self.learning_rate == 'invscaling':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this make sense with adam? Probably not, right?
From the experiments, I am not sure if it ever makes sense actually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean adjusting learning rates? Probably not. I think I intended to ignore self.learning_rate when algorithm == 'adam'. I think I just forgot to do this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or probably I can just set self.learning_rate to 'constant' if algorithm == 'adam' in _initialize, and throw a warning if it was set to a different value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't change self.learning_rate. parameters to init might not be changed. you can add an attribute _learning_rate that is set dependent on algorithm but that seems ugly. Maybe also defer this learning rate schedule to the optimizer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably a good idea as we are already considering having another function in optimizer that is called on epoch end

@amueller
Copy link
Member
amueller commented Sep 8, 2015

One thing I'm a bit concerned about is that we now have many parameter that only impact SGD, which is now not even the default. Adam doesn't make use of the "momentum", "nesterovs_momentum". It does currently take "invscaling" into account, but that is probably not a good idea.

I think we should try to decouple the SGD heuristics from the rest of the parameters and code a bit more, but I'm not sure how to do this best.

@amueller
Copy link
Member
amueller commented Sep 8, 2015

In the benchmark you posted in the top, all the SGD algorithms are slower by a factor of two with early stopping. Why is that?

@glennq
Copy link
Contributor Author
glennq commented Sep 8, 2015

I think the main reason is without early stopping, the stop criterion relies on training loss, which as I mentioned, was only calculated with the last minibatch. This makes training loss fluctuates a lot, and in turn makes it easier to stop early according to our stopping criterion.

This should not be so obvious a problem after I fix this. I'm going to re-run the benchmarks

@amueller
Copy link
Member
amueller commented Sep 8, 2015

That makes sense.

@amueller
Copy link
Member
amueller commented Sep 8, 2015

So do we want to have a running average of the loss? Or only adjust the learning rate / stop on epoch level?

@glennq
Copy link
Contributor Author
glennq commented Sep 8, 2015

Currently we are not having any running average of loss, it's all based on current epoch and history of past few epochs. I didn't change the logic for stopping criterion.

@amueller
Copy link
Member
amueller commented Sep 8, 2015

I meant running average over the batches. You wanted to use the loss on the whole training set, not just the last batch, right? How do you compute that? You don't want to go over the whole training set again just to compute the stopping criterion.

@glennq
Copy link
Contributor Author
glennq commented Sep 8, 2015

Oh yes, that's right. I'm going to keep sort of a running average of loss over batches, weighted on size of the batch slice, as it can be different for the last batch

@amueller
Copy link
Member
amueller commented Sep 8, 2015

ok, cool.

@glennq
Copy link
Contributor Author
glennq commented Sep 10, 2015

I refactored early stopping and learning rate update a bit. It should be clearer now.

And I re-ran the benchmarks after fixing self.loss_ with plots for the training process, the updated code and results can be found in the gist: https://gist.github.com/glennq/44ca8b66770430ee10f9

Here are some plots. '_val' indicates validation score, which are only calculated when early_stopping is on.

MNIST:
mnist

20 new groups:
20_news_1

RCV1:
rcv1_2

@amueller
Copy link
Member

The _val values are accuracy and therefore go up, right?

@amueller
Copy link
Member

Adam seems to work pretty well, and with the early stopping seems to stop pretty quickly. So that seems reasonable to go with.

@glennq
Copy link
Contributor Author
glennq commented Oct 21, 2015

By the way, here what it looks like when I previously let the model decide when to stop training:
https://github.com/glennq/scikit-learn/blob/5dd3443911e781006f92871ef9483fad92aa1497/sklearn/neural_network/multilayer_perceptron.py#L558

@glennq
Copy link
Contributor Author
glennq commented Oct 22, 2015

I made some change to documentation and fixed some mistakes. It should look much better now. I'll check again today to see if I missed anything.

For structure of the code, I guess what needs more discussion is whether it is the model or optimizer who has the right to decide when to stop training.

The idea is that the model keeps track of _no_improvement_count, which can trigger stopping when larger than 2, but if lr_schedule == 'adaptive' and learning_rate > 1e-6 in sgd optimizer, it will only decrease learning_rate by a factor of 5 and reset _no_improvement_count to zero. Training continues until learning_rate <= 1e-6.

There can be several ways of achieving this, of course, but ideally we want a way to perfectly define the border of behaviors for both the model and the optimizer.

First approach

I started with something as in here: https://github.com/glennq/scikit-learn/blob/5dd3443911e781006f92871ef9483fad92aa1497/sklearn/neural_network/multilayer_perceptron.py#L558

This gives all rights to the model. It is also allowed to check learning_rate in optimizer and change it directly, without the optimizer knowing that. I believe this is not good design.

Second approach

And then I came up with what it is now. Whenever _no_improvement_count > 2, the model knows it's probably time to trigger stopping, and then it checks with the optimizer to see if it is really OK to stop. The optimizer checks its learning_rate and lr_schedule, and gives signal based on its own state. If True, the model stops training, if False, training continues and _no_improvement_count is reset to 0.

I think the idea is that it's really the optimizer who decides, but only when the model asks. The optimizer is sort of like a boss approving requests of the model...

I believe it's better than the first approach in that either party keeps track of its own state and only answer queries received.

Third approach

Another approach I have in mind is that we let the optimizer decide.
We can either pass a reference of the model to the optimizer so it can reset _no_improvement_count directly, or just pass in _no_improvement_count each time, and the model knows it should reset _no_improvement_count when its value is larger than 2 but the optimizer decides not to stop.

This would make the optimizer class more listener-like though


I personally think my current approach is fine, but I agree that it is not obvious which side decides when to stop.
If I need to change this, I'd personally prefer the third approach, but any kind of suggestion or opinion is welcomed.

@kastnerkyle
Copy link
Member

I much prefer the first approach actually but it is mostly a matter of preference. The third approach is kind of the inverse of the first approach, except the optimizer (which is mostly simple / mathematical) is controlling the model (which is complex, since it usually handles or is involved in minibatches and the "training loop"). So I am a fan of the first approach when I do it on my own since the optimizer doesn't really need to know about epochs - just costs and gradients (and sometimes second order info).

The second approach (what is already implemented) seems just fine to me. I will review again.

@kastnerkyle
Copy link
Member

On gradient check - what we have now is fine IMO.

Renaming seems useful, since our implementation will likely serve as an educational tool.

We need to know whether safe_sparse_dot pays a computational penalty - since we are using it quite a bit (both fprop and bprop for n_layers) this could potentially be an issue. If we could benchmark some random floating point dot products calling np.dot directly vs. safe_sparse_dot (as a separate script is fine - or even a commandline call) I would like that. I am 90% sure the overhead is minimal compared to other things but 100% would be better.

@kastnerkyle
Copy link
Member

+1 from me besides that naming - @arjoly and I agree that inplace_logistic_derivative is a good name, and similar for the other derivatives

We can handle safe_sparse_dot later if it is an issue.

@glennq
Copy link
Contributor Author
glennq commented Oct 23, 2015

Just rename the derivatives. I'm going to compare safe_sparse_dot with np.dot

@amueller
Copy link
Member

@glennq I was just about to do the rename. Thanks. I'm going to merge this. It would be great if you can do the timing comparison, but I think it's a minor detail.

@amueller
Copy link
Member

(I'll merge if travis is green)

@amueller
Copy link
Member

can you please squash your commits though?

@glennq
Copy link
Contributor Author
glennq commented Oct 23, 2015

no problem

@kastnerkyle kastnerkyle changed the title [MRG + 1] Mlp with adam, nesterov's momentum, early stopping [MRG + 2] Mlp with adam, nesterov's momentum, early stopping Oct 23, 2015
amueller added a commit that referenced this pull request Oct 23, 2015
[MRG + 2] Mlp with adam, nesterov's momentum, early stopping
@amueller amueller merged commit 965a715 into scikit-learn:master Oct 23, 2015
@kastnerkyle
Copy link
Member

🍻

@amueller
Copy link
Member

Thanks! Wohoooo! 🍻 After only like... 3 years? Thanks a log @glennq and @eickenberg and @kastnerkyle for reviews :)

@kastnerkyle
Copy link
Member

great job @IssamLaradji @glennq @amueller @ogrisel and anyone else I am missing

@glennq
Copy link
Contributor Author
glennq commented Oct 23, 2015

Great! Thank you @amueller @kastnerkyle @eickenberg @ogrisel for reviews!

@IssamLaradji
Copy link
Contributor

Great job everyone!! :) :)

@arjoly
Copy link
Member
arjoly commented Oct 23, 2015

Awesome !!! 🍻

@glouppe
Copy link
Contributor
glouppe commented Oct 23, 2015

Great job to everyone involved!

(We should really try to learn from this, and make sure we avoid mega huge PRs in the future, as much as possible)

@eickenberg
Copy link
Contributor

great job everybody who ever worked on this!!! :)

On Friday, October 23, 2015, Gilles Louppe notifications@github.com wrote:

Great job to everyone involved!

(We should really try to learn from this, and make sure we avoid mega huge
PRs in the future, as much as possible)


Reply to this email directly or view it on GitHub
#5214 (comment)
.

@GaelVaroquaux
Copy link
Member

F** yeah! I am really happy. This was excellent team work.

Thanks you, to everybody who reviewed, and to @IssamLaradji and @glennq!

@agramfort
Copy link
Member
agramfort commented Oct 24, 2015 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants
0