8000 [MRG] Used stratified splits for early stopping in GBDT and MLP by NicolasHug · Pull Request #13164 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] Used stratified splits for early stopping in GBDT and MLP #13164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Mar 26, 2019
13 changes: 11 additions & 2 deletions doc/whats_new/v0.21.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,8 @@ random sampling procedures.
`max_leaf_nodes` are set. |Fix|
- :class:`linear_model.LogisticRegression` and
:class:`linear_model.LogisticRegressionCV` with 'saga' solver. |Fix|
- :class:`ensemble.GradientBoostingClassifier` for multiclass
classification. |Fix|
- :class:`ensemble.GradientBoostingClassifier` |Fix|
- :class:`neural_network.MLPClassifier` |Fix|
- :func:`svm.SVC.decision_function` and
:func:`multiclass.OneVsOneClassifier.decision_function`. |Fix|

Expand Down Expand Up @@ -183,6 +183,10 @@ Support for Python 3.4 and below has been officially dropped.
the gradients would be incorrectly computed in multiclass classification
problems. :issue:`12715` by :user:`Nicolas Hug<NicolasHug>`.

- |Fix| Fixed a bug in :class:`ensemble.GradientBoostingClassifier` where
validation sets for early stopping were not sampled with stratification.
:issue:`13164` by :user:`Nicolas Hug<NicolasHug>`.

- |Fix| Fixed a bug in :class:`ensemble.GradientBoostingClassifier` where
the default initial prediction of a multiclass classifier would predict the
classes priors instead of the log of the priors. :issue:`12983` by
Expand Down Expand Up @@ -422,6 +426,11 @@ Support for Python 3.4 and below has been officially dropped.
:class:`neural_network.MLPRegressor` where the option :code:`shuffle=False`
was being ignored. :issue:`12582` by :user:`Sam Waterbury <samwaterbury>`.

- |Fix| Fixed a bug in :class:`neural_network.MLPClassifier` where
validation sets for early stopping were not sampled with stratification. In
multilabel case however, splits are still not stratified.
:issue:`13164` by :user:`Nicolas Hug<NicolasHug>`.

:mod:`sklearn.pipeline`
.......................

Expand Down
6 changes: 4 additions & 2 deletions sklearn/ensemble/gradient_boosting.py
Original file line number Diff line number Diff line change
Expand Up @@ -1447,10 +1447,12 @@ def fit(self, X, y, sample_weight=None, monitor=None):
y = self._validate_y(y, sample_weight)

if self.n_iter_no_change is not None:
stratify = y if is_classifier(self) else None
X, X_val, y, y_val, sample_weight, sample_weight_val = (
train_test_split(X, y, sample_weight,
random_state=self.random_state,
test_size=self.validation_fraction))
test_size=self.validation_fraction,
stratify=stratify))
if is_classifier(self):
if self.n_classes_ != np.unique(y).shape[0]:
# We choose to error here. The problem is that the init
Expand Down Expand Up @@ -1933,7 +1935,7 @@ class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin):
number, it will set aside ``validation_fraction`` size of the training
data as validation and terminate training when validation score is not
improving in all of the previous ``n_iter_no_change`` numbers of
iterations.
iterations. The split is stratified.

.. versionadded:: 0.20

Expand Down
31 changes: 22 additions & 9 deletions sklearn/ensemble/tests/test_gradient_boosting.py
Original file line number Diff line number Diff line change
Expand Up @@ -1265,8 +1265,8 @@ def test_gradient_boosting_early_stopping():
X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state=42)
# Check if early_stopping works as expected
for est, tol, early_stop_n_estimators in ((gbc, 1e-1, 24), (gbr, 1e-1, 13),
(gbc, 1e-3, 36),
for est, tol, early_stop_n_estimators in ((gbc, 1e-1, 28), (gbr, 1e-1, 13),
(gbc, 1e-3, 70),
(gbr, 1e-3, 28)):
est.set_params(tol=tol)
est.fit(X_train, y_train)
Expand Down Expand Up @@ -1321,6 +1321,18 @@ def test_gradient_boosting_validation_fraction():
assert gbc.n_estimators_ < gbc3.n_estimators_


def test_early_stopping_stratified():
# Make sure data splitting for early stopping is stratified
Copy link
Member
@TomDLT TomDLT Feb 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: This test is rather weak, since it relies on the fact that this error message is only raised by StratifiedShuffleSplit.

I suggest:

X = [[1, 2], [2, 3], [3, 4], [4, 5]]
y = [0, 0, 1, 1]
gbc = GradientBoostingClassifier(validation_fraction=0.5, n_iter_no_change=5)
# some non-stratified random splits would select only one class in y,
# leading to an error
for _ in range(100):
    gbc.fit(X, y)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this, this test still has a (very low) probability to not fail just by chance even if the split isn't stratified.

Also it wouldn't work for MLP

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I agree mine isn't great either... no strong opinion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 / 3**100 = 1e-48, I think the probability is low enough.
Why wouldn't it work for MLP ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because MLP wouldn't raise an error whether splits are stratified or not. The error is only raised by GBDTs because they use an init estimator (and that's only once #12983 is merged)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not just about raising an error when training on only one class.

It's about training the GBDT on C classes, and training the init estimator on C - X classes, where X is the number of classes that aren't present after the (non-stratified) split

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm wait I thought train_test_split would raise an error if it coudn't populate y_train with at least one sample belonging to each class but that's not the case

So this PR doesn't fix my original issue...

(I still think splits should be stratified though)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think about a test that makes sure the GBDT and MLP can predict both classes on a very imbalances dataset?

I'd design the test to fail on master but pass on this branch, so that'd be some kind of non-regression test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree splits should be stratified, both for binary and multiclass problems, to preserve imbalanced class distributions. On the extreme, it could even lead to training only on a subset of classes, which can be problematic.

My suggested test is indeed a non-regression test, checking that both estimators use stratified splits on a over-simplistic 2-class toy problem. It relies on the fact that using a (non-stratified) split which would happen to take only 1 class in the training set would lead to an error. The error is already raised in GradientBoostingClassifier but not in MLPClassifier, which I consider as a bug.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well until this "bug" is fixed this test is not a non-regression test.

Also, even with such a small probability, it still feels weird to me and I'm not sure that's good practice. But I'm happy to hear what other have to say

X = [[1, 2], [2, 3], [3, 4], [4, 5]]
y = [0, 0, 0, 1]

gbc = GradientBoostingClassifier(n_iter_no_change=5)
with pytest.raises(
ValueError,
match='The least populated class in y has only 1 member'):
gbc.fit(X, y)


class _NoSampleWeightWrapper(BaseEstimator):
def __init__(self, est):
self.est = est
Expand Down Expand Up @@ -1381,19 +1393,20 @@ def test_gradient_boosting_init_wrong_methods(estimator, missing_method):


def test_early_stopping_n_classes():
# when doing early stopping (_, y_train, _, _ = train_test_split(X, y))
# when doing early stopping (_, , y_train, _ = train_test_split(X, y))
# there might be classes in y that are missing in y_train. As the init
# estimator will be trained on y_train, we need to raise an error if this
# happens.

X = [[1, 2], [2, 3], [3, 4], [4, 5]]
y = [0, 1, 1, 1]
gb = GradientBoostingClassifier(n_iter_no_change=5, random_state=4)
X = [[1]] * 10
y = [0, 0] + [1] * 8 # only 2 negative class over 10 samples
gb = GradientBoostingClassifier(n_iter_no_change=5, random_state=0,
validation_fraction=8)
with pytest.raises(
ValueError,
match='The training data after the early stopping split'):
gb.fit(X, y)

# No error with another random seed
gb = GradientBoostingClassifier(n_iter_no_change=5, random_state=0)
gb.fit(X, y)
# No error if we let training data be big enough
gb = GradientBoostingClassifier(n_iter_no_change=5, random_state=0,
validation_fraction=4)
8 changes: 4 additions & 4 deletions sklearn/linear_model/passive_aggressive.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,8 @@ class PassiveAggressiveClassifier(BaseSGDClassifier):
early_stopping : bool, default=False
Whether to use early stopping to terminate training when validation.
score is not improving. If set to True, it will automatically set aside
a fraction of training data as validation and terminate training when
validation score is not improving by at least tol for
a stratified fraction of training data as validation and terminate
training when validation score is not improving by at least tol for
n_iter_no_change consecutive epochs.

.. versionadded:: 0.20
Expand Down Expand Up @@ -282,8 +282,8 @@ class PassiveAggressiveRegressor(BaseSGDRegressor):
early_stopping : bool, default=False
Whether to use early stopping to terminate training when validation.
score is not improving. If set to True, it will automatically set aside
a fraction of training data as validation and terminate training when
validation score is not improving by at least tol for
a fraction of training data as validation and terminate
training when validation score is not improving by at least tol for
n_iter_no_change consecutive epochs.

.. versionadded:: 0.20
Expand Down
4 changes: 2 additions & 2 deletions sklearn/linear_model/perceptron.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,8 +62,8 @@ class Perceptron(BaseSGDClassifier):
early_stopping : bool, default=False
Whether to use early stopping to terminate training when validation.
score is not improving. If set to True, it will automatically set aside
a fraction of training data as validation and terminate training when
validation score is not improving by at least tol for
a stratified fraction of training data as validation and terminate
training when validation score is not improving by at least tol for
n_iter_no_change consecutive epochs.

.. versionadded:: 0.20
Expand Down
8 changes: 4 additions & 4 deletions sklearn/linear_model/stochastic_gradient.py
Original file line number Diff line number Diff line change
Expand Up @@ -828,8 +828,8 @@ class SGDClassifier(BaseSGDClassifier):
early_stopping : bool, default=False
Whether to use early stopping to terminate training when validation
score is not improving. If set to True, it will automatically set aside
a fraction of training data as validation and terminate training when
validation score is not improving by at least tol for
a stratified fraction of training data as validation and terminate
training when validation score is not improving by at least tol for
n_iter_no_change consecutive epochs.

.. versionadded:: 0.20
Expand Down Expand Up @@ -1433,8 +1433,8 @@ class SGDRegressor(BaseSGDRegressor):
early_stopping : bool, default=False
Whether to use early stopping to terminate training when validation
score is not improving. If set to True, it will automatically set aside
a fraction of training data as validation and terminate training when
validation score is not improving by at least tol for
a fraction of training data as validation and terminate
training when validation score is not improving by at least tol for
n_iter_no_change consecutive epochs.

.. versionadded:: 0.20
Expand Down
9 changes: 7 additions & 2 deletions sklearn/neural_network/multilayer_perceptron.py
Original file line number Diff line number Diff line change
Expand Up @@ -484,9 +484,13 @@ def _fit_stochastic(self, X, y, activations, deltas, coef_grads,
# early_stopping in partial_fit doesn't make sense
early_stopping = self.early_stopping and not incremental
if early_stopping:
# don't stratify in multilabel classification
should_stratify = is_classifier(self) and self.n_outputs_ == 1
stratify = y if should_stratify else None
X, X_val, y, y_val = train_test_split(
X, y, random_state=self._random_state,
test_size=self.validation_fraction)
test_size=self.validation_fraction,
stratify=stratify)
if is_classifier(self):
y_val = self._label_binarizer.inverse_transform(y_val)
else:
Expand Down Expand Up @@ -803,7 +807,8 @@ class MLPClassifier(BaseMultilayerPerceptron, ClassifierMixin):
score is not improving. If set to true, it will automatically set
aside 10% of training data as validation and terminate training when
validation score is not improving by at least tol for
``n_iter_no_change`` consecutive epochs.
``n_iter_no_change`` consecutive epochs. The split is stratified,
except in a multilabel setting.
Only effective when solver='sgd' or 'adam'

validation_fraction : float, optional, default 0.1
Expand Down
17 changes: 17 additions & 0 deletions sklearn/neural_network/tests/test_mlp.py
Original file line number Diff line number Diff line change
Expand Up @@ -308,6 +308,11 @@ def test_multilabel_classification():
mlp.partial_fit(X, y, classes=[0, 1, 2, 3, 4])
assert_greater(mlp.score(X, y), 0.9)

# Make sure early stopping still work now that spliting is stratified by
# default (it is disabled for multilabel classification)
mlp = MLPClassifier(early_stopping=True)
mlp.fit(X, y).predict(X)


@pytest.mark.filterwarnings('ignore: The default value of multioutput') # 0.23
def test_multioutput_regression():
Expand Down Expand Up @@ -663,3 +668,15 @@ def test_n_iter_no_change_inf():

# validate _update_no_improvement_count() was always triggered
assert_equal(clf._no_improvement_count, clf.n_iter_ - 1)


def test_early_stopping_stratified():
# Make sure data splitting for early stopping is stratified
X = [[1, 2], [2, 3], [3, 4], [4, 5]]
y = [0, 0, 0, 1]

mlp = MLPClassifier(early_stopping=True)
with pytest.raises(
ValueError,
match='The least populated class in y has only 1 member'):
mlp.fit(X, y)
0