8000 ENH better 'constant' learning rate schedule by ogrisel · Pull Request #22 · amueller/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

ENH better 'constant' learning rate schedule #22

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 18, 2014

Conversation

ogrisel
Copy link
@ogrisel ogrisel commented Dec 18, 2014

I switched the mnist to be deeper and narrower using SGD + momentum with a better "constant" learning rate that makes it possible to be aggressive without diverging:

...
Iteration 49, cost = 0.00434729

Classification performance:
===========================

Classifier               train-time   test-time   error-rate
------------------------------------------------------------
MultilayerPerceptron        157.24s       0.08s       0.0199

@amueller
Copy link
Owner

Nice :) I am just experimenting with that lol.

this one gives 1.5% but takes 1000s :-/

mlp = MultilayerPerceptronClassifier(hidden_layer_sizes=(800, 800), algorithm="sgd", random_state=42, verbose=10, max_iter=30,
                                     alpha=0, momentum=.9, learning_rate_init=1)

@amueller
Copy link
Owner

Does yours actually stop because of convergence? 157s seems very short for 400 iterations. I guess you tweaked the tol?

@ogrisel
Copy link
Author
ogrisel commented Dec 18, 2014

Does yours actually stop because of convergence? 157s seems very short for 400 iterations. I guess you tweaked the tol?

yes, tol is set to 1e-4 to keep the benchmark fast enough.

Actually I am thinking that this convergence check could better be done on a validation set instead of the training set. That would get us early stopping for free which is what people are actually interested in practice. WDYT?

@ogrisel
Copy link
Author
ogrisel commented Dec 18, 2014

this one gives 1.5% but takes 1000s :-/

I am sure we could get better results with (2000, 2000) and dropout. Adding dropout is not easy with the current code base though. That might be a code smell.

@IssamLaradji
Copy link

Nice changes.

I am thinking that this convergence check could better be done on a validation set instead of the training set

Are you suggesting we divide (X,y) into a training and a validation set when running.fit(X,y)? isn't that risky in that the algorithm will train on less training samples ?

@ogrisel
Copy link
Author
ogrisel commented Dec 18, 2014

isn't that risky in that the algorithm will train on less training samples ?

Yes, but overfitting by training too much (especially without dropout) is an even bigger risk. I would reserve 10% of the (X, y) data as validation set by default, and make it possible for the user to pass a custom validation set:

mlp.fit(X, y, X_validation=X_validation, y_validation=y_validation)

Also if verbose >= 2 I would also compute and report the .score() value on the training and validation set in addition to the cost.

@amueller
Copy link
Owner

I think we shouldn't do early stopping in the current PR. We should really rather merge this really fast and then iterate on it.

@ogrisel
Copy link
Author
ogrisel commented Dec 18, 2014

We should really rather merge this really fast and then iterate on it.

+1 for merging my PR into your PR into @IssamLaradji's PR at least so that we can all iterate and experiment with the same code base.

I would really like to have a way to monitor the progress of train score + validation score vs epochs before we merge in master. Both for LBGS and SGD. I don't think you can use and tune the hyperparameter of a MLP in practice without plotting those curves to get some intuitions on what is going wrong. Doing brute force randomized parameter search is too expensive.

Maybe the validation API is not good. Maybe we should introduce a monitor callback instead.

@amueller
Copy link
Owner

You can monitor the verbose output, but I agree that is suboptimal.
A callback actually seems like a good idea. Do we want to do it the same way that GradientBoosting does it? I would rather not have a fit parameter.

@ogrisel
Copy link
Author
ogrisel commented Dec 18, 2014

You can monitor the verbose output, but I agree that is suboptimal.

This is what I do but you have no way to detect how much the network is overfitting and how that evolves with the number of epochs.

Yes we should be consistent with GradientBoosting that makes it possible to address our use case.

@amueller
Copy link
Owner

For monitoring a validation set I used partial_fit ;)

amueller added a commit that referenced this pull request Dec 18, 2014
ENH better 'constant' learning rate schedule
@amueller amueller merged commit a096d4a into amueller:mlp_refactoring Dec 18, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0