[MRG + 2] Mlp with adam, nesterov's momentum, early stopping #5214

glennq · 2015-09-04T18:04:07Z

Based on #3204 and #3939.

Code and results for timing and performance benchmarks on MNIST, 20 New Groups and RCV1 can be found at https://gist.github.com/glennq/44ca8b66770430ee10f9

Basically my observation is that adam performs consistently well. Early stopping cannot help to improve performance for adam, but can reduce training time significantly. In other updating schemes, adaptive learning rate (divide by 5 if not improving) with nesterov's momentum also lines just after adam, followed by constant learning rate with nesterov's momentum. The problem with adaptive learning rate though, is that training time is significantly longer.

Also, there are cases when early stopping can increase training time, which I think is due to the fact that the heuristic we use (stop if loss not improving over best loss by tol for more than two consecutive iterations) can be affected if training loss has large variation but validation score is more smoothly improving.
This affects update schemes using momentum, but it seems to be not a problem for adam.

The following is a visualized classifier comparison modified from http://scikit-learn.org/dev/auto_examples/classification/plot_classifier_comparison.html

I'll still work on benchmarking small datasets such as digits and iris, and probably plot training/validation loss for each iterations for more insights.

amueller · 2015-09-04T19:15:23Z

Great, thanks :)

amueller · 2015-09-04T19:16:29Z

Can you please also check the failing tests?

amueller · 2015-09-04T19:16:46Z

@ogrisel and @kastnerkyle might be interested, too ;)

glennq · 2015-09-04T19:38:16Z

I think the reason for failing is because I changed default values. Now that the default algorithm is 'adam' with a lower initial learning rate, it does not converge as fast as 'l-bfgs' on small datasets.

I can fix tests in neural_network but I'm not sure whether I need to change set_fast_parameters in estimator_checks.py or simply change default algorithm back to 'l-bfgs'

kastnerkyle · 2015-09-04T19:49:13Z

Did you check out generalization error also? I have found Adam is generally
good, but has a worse validation (usually) when compared to a well tuned
SGD + nesterov momentum.

There is a paper about this:
Train faster, generalize better: Stability of stochastic gradient descent
Moritz Hardt, Benjamin Recht, Yoram Singer
http://arxiv.org/abs/1509.01240

Also some other work by Choromanska et. al.
The Loss Surfaces of Multilayer Networks
http://arxiv.org/pdf/1412.0233.pdf

On Fri, Sep 4, 2015 at 3:16 PM, Andreas Mueller notifications@github.com
wrote:

@ogrisel https://github.com/ogrisel and @kastnerkyle
https://github.com/kastnerkyle might be interested, too ;)

—
Reply to this email directly or view it on GitHub
#5214 (comment)
.

glennq · 2015-09-04T20:52:55Z

The error/accuracy shown in benchmark results are for test set.

I think I agree that well-tuned SGD + nesterov momentum can out-performce adam, but in my experience (though not too much), the learning rate schedule needs to be carefully adjusted by monitoring history of training/validation losses.
The 'halving if not improving' scheme works well in practice, but similar to 'adaptive' learning rate in the benchmarks, this can make training take much longer. This is not a problem for models that needs days to train as long as it gives state-of-the-art performance, but I'm not sure if this is what we want in scikit-learn.

I think a reasonable approach is allowing users to pass a function for learning rate scheduling.

amueller · 2015-09-08T18:50:28Z

examples/neural_networks/plot_mlp_nonlinear.py

+=================================
+
+Performs binary classification using multi-layer perceptron
+with l-bfgs algorithm. The prediction target is an XOR of the


this docstring is out-dated.
I'm not sure this example is very helpful, given that we have non-linear datasets in the "classifier comparison".

amueller · 2015-09-08T19:17:00Z

Looks good so far :) I'm checking the adam formulas right now.
It would be great to have an example that shows how early stopping is useful.

glennq · 2015-09-08T19:24:13Z

Thanks for the comments @amueller , I'll work on these. And I think I also found the reason for the larger than expected fluctuation in training loss during training. self.loss_ is calculated based on only the latest mini batch instead of the entire training set. I'm going to fix this as well.

amueller · 2015-09-08T19:26:15Z

yes that should also be fixed.

amueller · 2015-09-08T19:32:31Z

sklearn/neural_network/multilayer_perceptron.py

+                                                         self.loss_))
+
+                # adjusting learning rates
+                if self.learning_rate == 'invscaling':


Does this make sense with adam? Probably not, right?
From the experiments, I am not sure if it ever makes sense actually.

You mean adjusting learning rates? Probably not. I think I intended to ignore self.learning_rate when algorithm == 'adam'. I think I just forgot to do this

Or probably I can just set self.learning_rate to 'constant' if algorithm == 'adam' in _initialize, and throw a warning if it was set to a different value.

You can't change self.learning_rate. parameters to init might not be changed. you can add an attribute _learning_rate that is set dependent on algorithm but that seems ugly. Maybe also defer this learning rate schedule to the optimizer.

That's probably a good idea as we are already considering having another function in optimizer that is called on epoch end

amueller · 2015-09-08T19:36:13Z

One thing I'm a bit concerned about is that we now have many parameter that only impact SGD, which is now not even the default. Adam doesn't make use of the "momentum", "nesterovs_momentum". It does currently take "invscaling" into account, but that is probably not a good idea.

I think we should try to decouple the SGD heuristics from the rest of the parameters and code a bit more, but I'm not sure how to do this best.

amueller · 2015-09-08T19:44:20Z

In the benchmark you posted in the top, all the SGD algorithms are slower by a factor of two with early stopping. Why is that?

glennq · 2015-09-08T19:48:47Z

I think the main reason is without early stopping, the stop criterion relies on training loss, which as I mentioned, was only calculated with the last minibatch. This makes training loss fluctuates a lot, and in turn makes it easier to stop early according to our stopping criterion.

This should not be so obvious a problem after I fix this. I'm going to re-run the benchmarks

amueller · 2015-09-08T20:25:11Z

That makes sense.

amueller · 2015-09-08T20:27:17Z

So do we want to have a running average of the loss? Or only adjust the learning rate / stop on epoch level?

glennq · 2015-09-08T20:42:27Z

Currently we are not having any running average of loss, it's all based on current epoch and history of past few epochs. I didn't change the logic for stopping criterion.

amueller · 2015-09-08T20:45:38Z

I meant running average over the batches. You wanted to use the loss on the whole training set, not just the last batch, right? How do you compute that? You don't want to go over the whole training set again just to compute the stopping criterion.

glennq · 2015-09-08T20:48:36Z

Oh yes, that's right. I'm going to keep sort of a running average of loss over batches, weighted on size of the batch slice, as it can be different for the last batch

amueller · 2015-09-08T20:52:51Z

ok, cool.

glennq · 2015-09-10T22:24:48Z

I refactored early stopping and learning rate update a bit. It should be clearer now.

And I re-ran the benchmarks after fixing self.loss_ with plots for the training process, the updated code and results can be found in the gist: https://gist.github.com/glennq/44ca8b66770430ee10f9

Here are some plots. '_val' indicates validation score, which are only calculated when early_stopping is on.

MNIST:

20 new groups:

RCV1:

amueller · 2015-09-11T19:12:28Z

The _val values are accuracy and therefore go up, right?

amueller · 2015-09-11T19:13:59Z

Adam seems to work pretty well, and with the early stopping seems to stop pretty quickly. So that seems reasonable to go with.

…el case

glennq · 2015-10-21T23:45:53Z

By the way, here what it looks like when I previously let the model decide when to stop training:
https://github.com/glennq/scikit-learn/blob/5dd3443911e781006f92871ef9483fad92aa1497/sklearn/neural_network/multilayer_perceptron.py#L558

glennq · 2015-10-22T17:27:24Z

I made some change to documentation and fixed some mistakes. It should look much better now. I'll check again today to see if I missed anything.

For structure of the code, I guess what needs more discussion is whether it is the model or optimizer who has the right to decide when to stop training.

The idea is that the model keeps track of _no_improvement_count, which can trigger stopping when larger than 2, but if lr_schedule == 'adaptive' and learning_rate > 1e-6 in sgd optimizer, it will only decrease learning_rate by a factor of 5 and reset _no_improvement_count to zero. Training continues until learning_rate <= 1e-6.

There can be several ways of achieving this, of course, but ideally we want a way to perfectly define the border of behaviors for both the model and the optimizer.

First approach

I started with something as in here: https://github.com/glennq/scikit-learn/blob/5dd3443911e781006f92871ef9483fad92aa1497/sklearn/neural_network/multilayer_perceptron.py#L558

This gives all rights to the model. It is also allowed to check learning_rate in optimizer and change it directly, without the optimizer knowing that. I believe this is not good design.

Second approach

And then I came up with what it is now. Whenever _no_improvement_count > 2, the model knows it's probably time to trigger stopping, and then it checks with the optimizer to see if it is really OK to stop. The optimizer checks its learning_rate and lr_schedule, and gives signal based on its own state. If True, the model stops training, if False, training continues and _no_improvement_count is reset to 0.

I think the idea is that it's really the optimizer who decides, but only when the model asks. The optimizer is sort of like a boss approving requests of the model...

I believe it's better than the first approach in that either party keeps track of its own state and only answer queries received.

Third approach

Another approach I have in mind is that we let the optimizer decide.
We can either pass a reference of the model to the optimizer so it can reset _no_improvement_count directly, or just pass in _no_improvement_count each time, and the model knows it should reset _no_improvement_count when its value is larger than 2 but the optimizer decides not to stop.

This would make the optimizer class more listener-like though

I personally think my current approach is fine, but I agree that it is not obvious which side decides when to stop.
If I need to change this, I'd personally prefer the third approach, but any kind of suggestion or opinion is welcomed.

kastnerkyle · 2015-10-23T09:12:52Z

kastnerkyle commented

Oct 23, 2015

I much prefer the first approach actually but it is mostly a matter of preference. The third approach is kind of the inverse of the first approach, except the optimizer (which is mostly simple / mathematical) is controlling the model (which is complex, since it usually handles or is involved in minibatches and the "training loop"). So I am a fan of the first approach when I do it on my own since the optimizer doesn't really need to know about epochs - just costs and gradients (and sometimes second order info).

The second approach (what is already implemented) seems just fine to me. I will review again.

kastnerkyle · 2015-10-23T09:16:38Z

On gradient check - what we have now is fine IMO.

Renaming seems useful, since our implementation will likely serve as an educational tool.

We need to know whether safe_sparse_dot pays a computational penalty - since we are using it quite a bit (both fprop and bprop for n_layers) this could potentially be an issue. If we could benchmark some random floating point dot products calling np.dot directly vs. safe_sparse_dot (as a separate script is fine - or even a commandline call) I would like that. I am 90% sure the overhead is minimal compared to other things but 100% would be better.

kastnerkyle · 2015-10-23T13:58:06Z

+1 from me besides that naming - @arjoly and I agree that inplace_logistic_derivative is a good name, and similar for the other derivatives

We can handle safe_sparse_dot later if it is an issue.

glennq · 2015-10-23T15:30:05Z

Just rename the derivatives. I'm going to compare safe_sparse_dot with np.dot

amueller · 2015-10-23T15:38:48Z

@glennq I was just about to do the rename. Thanks. I'm going to merge this. It would be great if you can do the timing comparison, but I think it's a minor detail.

amueller · 2015-10-23T15:39:10Z

(I'll merge if travis is green)

amueller · 2015-10-23T15:39:48Z

can you please squash your commits though?

glennq · 2015-10-23T15:43:30Z

no problem

[MRG + 2] Mlp with adam, nesterov's momentum, early stopping

kastnerkyle · 2015-10-23T17:22:30Z

🍻

amueller · 2015-10-23T17:22:47Z

Thanks! Wohoooo! 🍻 After only like... 3 years? Thanks a log @glennq and @eickenberg and @kastnerkyle for reviews :)

kastnerkyle · 2015-10-23T17:22:53Z

great job @IssamLaradji @glennq @amueller @ogrisel and anyone else I am missing

glennq · 2015-10-23T17:29:49Z

Great! Thank you @amueller @kastnerkyle @eickenberg @ogrisel for reviews!

IssamLaradji · 2015-10-23T17:31:54Z

Great job everyone!! :) :)

arjoly · 2015-10-23T17:55:43Z

Awesome !!! 🍻

glouppe · 2015-10-23T17:59:42Z

Great job to everyone involved!

(We should really try to learn from this, and make sure we avoid mega huge PRs in the future, as much as possible)

eickenberg · 2015-10-23T18:21:01Z

great job everybody who ever worked on this!!! :)

On Friday, October 23, 2015, Gilles Louppe notifications@github.com wrote:

Great job to everyone involved!

(We should really try to learn from this, and make sure we avoid mega huge
PRs in the future, as much as possible)

—
Reply to this email directly or view it on GitHub
#5214 (comment)
.

GaelVaroquaux · 2015-10-24T07:18:01Z

F** yeah! I am really happy. This was excellent team work.

Thanks you, to everybody who reviewed, and to @IssamLaradji and @glennq!

agramfort · 2015-10-24T07:29:53Z

🍻 !

amueller reviewed Sep 8, 2015
View reviewed changes

amueller and others added 4 commits October 21, 2015 14:41

FIX partial fit test for MLP

ee071e0

ENH better 'constant' learning rate schedule

3c8210f

iterate, improve. Nesterov's momentum.

3891af8

fix log_loss for multiclass case and add binary_log_loss for multilab…

bde2270

…el case

Finish up mlp_refactoring and squash previous commits

917bacb

kastnerkyle changed the title ~~[MRG + 1] Mlp with adam, nesterov's momentum, early stopping~~ [MRG + 2] Mlp with adam, nesterov's momentum, early stopping Oct 23, 2015

amueller added a commit that referenced this pull request Oct 23, 2015

Merge pull request #5214 from glennq/mlp_refactoring

965a715

[MRG + 2] Mlp with adam, nesterov's momentum, early stopping

amueller merged commit 965a715 into scikit-learn:master Oct 23, 2015

amueller mentioned this pull request Oct 23, 2015

[MRG] Generic multi layer perceptron #3204

Closed

4 tasks

pkch mentioned this pull request Dec 27, 2015

MLPClassifier: using warm_start breaks adam algorithm? #6091

Open

Uh oh!

[MRG + 2] Mlp with adam, nesterov's momentum, early stopping #5214

[MRG + 2] Mlp with adam, nesterov's momentum, early stopping #5214

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

First approach

Second approach

Third approach

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!