-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Used stratified splits for early stopping in GBDT and MLP #13164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Used stratified splits for early stopping in GBDT and MLP #13164
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
|
||
def test_early_stopping_stratified(): | ||
# Make sure data splitting for early stopping is stratified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: This test is rather weak, since it relies on the fact that this error message is only raised by StratifiedShuffleSplit
.
I suggest:
X = [[1, 2], [2, 3], [3, 4], [4, 5]]
y = [0, 0, 1, 1]
gbc = GradientBoostingClassifier(validation_fraction=0.5, n_iter_no_change=5)
# some non-stratified random splits would select only one class in y,
# leading to an error
for _ in range(100):
gbc.fit(X, y)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this, this test still has a (very low) probability to not fail just by chance even if the split isn't stratified.
Also it wouldn't work for MLP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I agree mine isn't great either... no strong opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 / 3**100 = 1e-48, I think the probability is low enough.
Why wouldn't it work for MLP ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because MLP wouldn't raise an error whether splits are stratified or not. The error is only raised by GBDTs because they use an init
estimator (and that's only once #12983 is merged)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not just about raising an error when training on only one class.
It's about training the GBDT on C classes, and training the init estimator on C - X classes, where X is the number of classes that aren't present after the (non-stratified) split
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm wait I thought train_test_split
would raise an error if it coudn't populate y_train with at least one sample belonging to each class but that's not the case
So this PR doesn't fix my original issue...
(I still think splits should be stratified though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you think about a test that makes sure the GBDT and MLP can predict both classes on a very imbalances dataset?
I'd design the test to fail on master but pass on this branch, so that'd be some kind of non-regression test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree splits should be stratified, both for binary and multiclass problems, to preserve imbalanced class distributions. On the extreme, it could even lead to training only on a subset of classes, which can be problematic.
My suggested test is indeed a non-regression test, checking that both estimators use stratified splits on a over-simplistic 2-class toy problem. It relies on the fact that using a (non-stratified) split which would happen to take only 1 class in the training set would lead to an error. The error is already raised in GradientBoostingClassifier
but not in MLPClassifier
, which I consider as a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well until this "bug" is fixed this test is not a non-regression test.
Also, even with such a small probability, it still feels weird to me and I'm not sure that's good practice. But I'm happy to hear what other have to say
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure we have tests for multilabel with early stopping ... Multilabel might break when you try to stratify
Thanks @jnothman , I disabled stratification in MLP in the multilabel case. Is that the correct strategy? (GBDTs don't support multilabel classification) |
I disabled stratification in MLP in the multilabel case. Is that the
correct strategy?
I think it's the best we can do, at least for now.
|
That merge failed. Please fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should also update the text for validation_fraction in all classifiers. Otherwise LGTM
doc/whats_new/v0.21.rst
Outdated
@@ -183,6 +183,10 @@ Support for Python 3.4 and below has been officially dropped. | |||
the gradients would be incorrectly computed in multiclass classification | |||
problems. :issue:`12715` by :user:`Nicolas Hug<NicolasHug>`. | |||
|
|||
- |Fix| Early stopping is now checked on a stratified split for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unclear. Perhaps "Fixed a bug where validation sets for early stopping in XxClassifier were not sampled with stratification." ??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @NicolasHug
Reference Issues/PRs
What does this implement/fix? Explain your changes.
As discussed here #12983 (comment) most of the estimators (perceptron, passiveagressive, SGDClassifier) use stratified splits for early stopping checking.
As far as I can tell only MLP and GBDTs do not. This PR makes them use stratified split.
This is essential in particular for GBDTs which use an
init
estimator trained after a call totrain_test_split
(for early stopping): if the split is not stratified, thisinit
estimator might be trained on a subset of the data with missing classes, causing a bug.Any other comments?