-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Used stratified splits for early stopping in GBDT and MLP #13164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
jnothman
merged 10 commits into
scikit-learn:master
from
NicolasHug:stratify_early_stopping
Mar 26, 2019
Merged
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
36eb7b9
Used stratified splits for early stopping in GBDT and MLP
NicolasHug 966c6bc
Added PR number in whatsnew
NicolasHug 6caae0f
fix ident issue
NicolasHug 8037436
Don't stratify multilabel MLP
NicolasHug a5bade0
updated whatsnew
NicolasHug 5a083f4
Merge branch 'master' into stratify_early_stopping
NicolasHug 61d0255
should fix test
NicolasHug 5a97311
Updated whatnew according to comments
NicolasHug a698ade
Update early_stopping (or n_iter_no_change) doc to mention stratifica…
NicolasHug 9e7304d
removed double import
NicolasHug File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: This test is rather weak, since it relies on the fact that this error message is only raised by
StratifiedShuffleSplit
.I suggest:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this, this test still has a (very low) probability to not fail just by chance even if the split isn't stratified.
Also it wouldn't work for MLP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I agree mine isn't great either... no strong opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 / 3**100 = 1e-48, I think the probability is low enough.
Why wouldn't it work for MLP ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because MLP wouldn't raise an error whether splits are stratified or not. The error is only raised by GBDTs because they use an
init
estimator (and that's only once #12983 is merged)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not just about raising an error when training on only one class.
It's about training the GBDT on C classes, and training the init estimator on C - X classes, where X is the number of classes that aren't present after the (non-stratified) split
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm wait I thought
train_test_split
would raise an error if it coudn't populate y_train with at least one sample belonging to each class but that's not the caseSo this PR doesn't fix my original issue...
(I still think splits should be stratified though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you think about a test that makes sure the GBDT and MLP can predict both classes on a very imbalances dataset?
I'd design the test to fail on master but pass on this branch, so that'd be some kind of non-regression test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree splits should be stratified, both for binary and multiclass problems, to preserve imbalanced class distributions. On the extreme, it could even lead to training only on a subset of classes, which can be problematic.
My suggested test is indeed a non-regression test, checking that both estimators use stratified splits on a over-simplistic 2-class toy problem. It relies on the fact that using a (non-stratified) split which would happen to take only 1 class in the training set would lead to an error. The error is already raised in
GradientBoostingClassifier
but not inMLPClassifier
, which I consider as a bug.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well until this "bug" is fixed this test is not a non-regression test.
Also, even with such a small probability, it still feels weird to me and I'm not sure that's good practice. But I'm happy to hear what other have to say