[MRG] Used stratified splits for early stopping in GBDT and MLP #13164

NicolasHug · 2019-02-14T14:53:08Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

As discussed here #12983 (comment) most of the estimators (perceptron, passiveagressive, SGDClassifier) use stratified splits for early stopping checking.

As far as I can tell only MLP and GBDTs do not. This PR makes them use stratified split.

This is essential in particular for GBDTs which use an init estimator trained after a call to train_test_split (for early stopping): if the split is not stratified, this init estimator might be trained on a subset of the data with missing classes, causing a bug.

Any other comments?

git pus

TomDLT

LGTM

TomDLT · 2019-02-14T20:13:36Z

sklearn/ensemble/tests/test_gradient_boosting.py

+
+
+def test_early_stopping_stratified():
+    # Make sure data splitting for early stopping is stratified


Nitpick: This test is rather weak, since it relies on the fact that this error message is only raised by StratifiedShuffleSplit.

I suggest:

X = [[1, 2], [2, 3], [3, 4], [4, 5]] y = [0, 0, 1, 1] gbc = GradientBoostingClassifier(validation_fraction=0.5, n_iter_no_change=5) # some non-stratified random splits would select only one class in y, # leading to an error for _ in range(100): gbc.fit(X, y)

I'm not sure about this, this test still has a (very low) probability to not fail just by chance even if the split isn't stratified.

Also it wouldn't work for MLP

But I agree mine isn't great either... no strong opinion

1 / 3**100 = 1e-48, I think the probability is low enough.
Why wouldn't it work for MLP ?

Because MLP wouldn't raise an error whether splits are stratified or not. The error is only raised by GBDTs because they use an init estimator (and that's only once #12983 is merged)

It's not just about raising an error when training on only one class.

It's about training the GBDT on C classes, and training the init estimator on C - X classes, where X is the number of classes that aren't present after the (non-stratified) split

Hmm wait I thought train_test_split would raise an error if it coudn't populate y_train with at least one sample belonging to each class but that's not the case

So this PR doesn't fix my original issue...

(I still think splits should be stratified though)

What would you think about a test that makes sure the GBDT and MLP can predict both classes on a very imbalances dataset?

I'd design the test to fail on master but pass on this branch, so that'd be some kind of non-regression test.

I agree splits should be stratified, both for binary and multiclass problems, to preserve imbalanced class distributions. On the extreme, it could even lead to training only on a subset of classes, which can be problematic.

My suggested test is indeed a non-regression test, checking that both estimators use stratified splits on a over-simplistic 2-class toy problem. It relies on the fact that using a (non-stratified) split which would happen to take only 1 class in the training set would lead to an error. The error is already raised in GradientBoostingClassifier but not in MLPClassifier, which I consider as a bug.

Well until this "bug" is fixed this test is not a non-regression test.

Also, even with such a small probability, it still feels weird to me and I'm not sure that's good practice. But I'm happy to hear what other have to say

jnothman

Make sure we have tests for multilabel with early stopping ... Multilabel might break when you try to stratify

NicolasHug · 2019-02-15T12:23:41Z

Thanks @jnothman , I disabled stratification in MLP in the multilabel case. Is that the correct strategy?

(GBDTs don't support multilabel classification)

jnothman · 2019-02-17T12:05:59Z

I disabled stratification in MLP in the multilabel case. Is that the

correct strategy? I think it's the best we can do, at least for now.

jnothman · 2019-03-25T07:11:33Z

That merge failed. Please fix.

jnothman

I think you should also update the text for validation_fraction in all classifiers. Otherwise LGTM

jnothman · 2019-03-25T07:13:22Z

doc/whats_new/v0.21.rst

@@ -183,6 +183,10 @@ Support for Python 3.4 and below has been officially dropped.
  the gradients would be incorrectly computed in multiclass classification
  problems. :issue:`12715` by :user:`Nicolas Hug<NicolasHug>`.

+- |Fix| Early stopping is now checked on a stratified split for


This is unclear. Perhaps "Fixed a bug where validation sets for early stopping in XxClassifier were not sampled with stratification." ??

…tion

sklearn/linear_model/passive_aggressive.py

sklearn/linear_model/stochastic_gradient.py

jnothman

Thanks @NicolasHug

…-learn#13164)

…scikit-learn#13164)" This reverts commit 8b47feb.

…-learn#13164)

NicolasHug added 3 commits February 14, 2019 09:47

Used stratified splits for early stopping in GBDT and MLP

36eb7b9

Added PR number in whatsnew

966c6bc

fix ident issue

6caae0f

git pus

TomDLT reviewed Feb 14, 2019

View reviewed changes

TomDLT approved these changes Feb 14, 2019

View reviewed changes

jnothman reviewed Feb 15, 2019

View reviewed changes

NicolasHug added 2 commits February 15, 2019 07:19

Don't stratify multilabel MLP

8037436

updated whatsnew

a5bade0

Merge branch 'master' into stratify_early_stopping

5a083f4

jnothman reviewed Mar 25, 2019

View reviewed changes

NicolasHug added 4 commits March 25, 2019 08:44

should fix test

61d0255

Updated whatnew according to comments

5a97311

Update early_stopping (or n_iter_no_change) doc to mention stratifica…

a698ade

…tion

removed double import

9e7304d

TomDLT reviewed Mar 25, 2019

View reviewed changes

sklearn/linear_model/passive_aggressive.py Show resolved Hide resolved

sklearn/linear_model/stochastic_gradient.py Show resolved Hide resolved

jnothman approved these changes Mar 26, 2019

View reviewed changes

jnothman merged commit 4eb85b4 into scikit-learn:master Mar 26, 2019

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

FIX Used stratified splits for early stopping in GBDT and MLP (scikit…

8b47feb

…-learn#13164)

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "FIX Used stratified splits for early stopping in GBDT and MLP (…

0098aeb

…scikit-learn#13164)" This reverts commit 8b47feb.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "FIX Used stratified splits for early stopping in GBDT and MLP (…

b2b3237

…scikit-learn#13164)" This reverts commit 8b47feb.

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

FIX Used stratified splits for early stopping in GBDT and MLP (scikit…

4f6609c

…-learn#13164)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG] Used stratified splits for early stopping in GBDT and MLP #13164

[MRG] Used stratified splits for early stopping in GBDT and MLP #13164

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!



		def test_early_stopping_stratified():
		# Make sure data splitting for early stopping is stratified

Uh oh!

[MRG] Used stratified splits for early stopping in GBDT and MLP #13164

[MRG] Used stratified splits for early stopping in GBDT and MLP #13164

Uh oh!

Conversation

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!