ENH allow to pass splitter for early stopping validation in HGBT #28440

lorentzenchr · 2024-02-16T17:20:31Z

Reference Issues/PRs

Partially solves #18748. Alternative to #27124.

What does this implement/fix? Explain your changes.

This PR allows to pass splitters to parameter validation_fraction of HistGradientBoostingClassifier and HistGradientBoostingRegressor.

Any other comments?

github-actions · 2024-02-16T17:21:59Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 930e4ab. Link to the linter CI: here}

adrinjalali · 2024-02-17T09:28:08Z

The issue I have with passing a splitter, is that there is no way to avoid leakage if there's any preprocessing involved, which in the case of existing outliers, is a real issue.

lorentzenchr · 2024-02-17T11:42:49Z

@adrinjalali see #18748 (comment). We were ok with some possibility of data leakage. It's a tradeoff: either X_val, y_val are constructed with the same preprocessing as X_train, y_train (often desirable) or less data leakage (it then entirely depends on how a user creates X_val, y_val).

The issue I have with passing a splitter, is that there is no way to avoid leakage if there's any preprocessing involved, which in the case of existing outliers, is a real issue.

Could you give an example?

adrinjalali · 2024-02-19T08:24:21Z

I don't think it's okay for us to be "fine" with "a bit of data leakage" at all.

Here was an example I wrote to show the effect on linear models: #26359 (comment)

We should be fixing Pipeline instead of accepting a somewhat data-leaky solution.

Also, with the previous PR, there would have been a way for people to do the right thing w/o data leakage, maybe by implementing their own Pipeline which would be easy, but with this, we're somewhat removing that possibility.

lorentzenchr · 2024-02-19T09:44:58Z

TLDR: I don't think information leakage is a real problem here.

IIUC, you prefer the GridSearchCV version in #26359 (comment). For the estimator considered here that would mean:

from sklearn.datasets import make_regression
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import GridSearchCV, ShuffleSplit
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler


X, y = make_regression(n_samples=200, n_features=500, n_informative=5, random_state=0)
X[:2, ] = X[:2, ] + 20

# Linear transormations like StandardScaler of features have no effect on tree based models,
# but for the sake of an example:
est_gs = GridSearchCV(
    make_pipeline(
        StandardScaler(),
        HistGradientBoostingRegressor(
            early_stopping=True,
            validation_fraction=ShuffleSplit(test_size=0.3, random_state=123),
        ),
    ),
    param_grid={"histgradientboostingregressor__max_depth": list(range(5))},
    cv=5,
).fit(X, y)

I don't see any leakage here. The advantage here is that inside HGBT, the ShuffleSplit for early stopping validation gets the X from the pipeline, the pipeline gets it from GridSearchCV. So all pieces only see and process what they should.

Once metadata-routing is activated for HGBT, we then could also pass a "splitting feature" like customer ID to the splitter.

adrinjalali · 2024-02-19T12:50:49Z

Linear transormations like StandardScaler of features have no effect on tree
based models, but for the sake of an example:

Not all transformations are going to be linear though.

In that example, the data used for validation is transformed with the statistics calculated from the validation set itself, therefore there is data leakage. The tree model might not care about that this particular transformation, but early stopping is not just for a tree based model. This is an approximation of what I think we can do:

from sklearn.datasets import make_regression
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import GridSearchCV, ShuffleSplit
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

X, y = make_regression(n_samples=200, n_features=500, n_informative=5, random_state=0)
X[:2,] = X[:2,] + 20

# Validation set chosen before looking at the data.
X_val, y_val = X[:50,], y[:50,]
X, y = X[50:,], y[50:,]

est_gs = GridSearchCV(
    Pipeline(
        (
            StandardScaler(),
            HistGradientBoostingRegressor(
                early_stopping=True,
                validation_fraction=ShuffleSplit(test_size=0.3, random_state=123),
            ).set_fit_request(X_val=True, y_val=True),
        ),
        # telling pipeline to transform these inputs up to the step which is
        # requesting them.
        transform_input=["X_val", "y_val"],
    ),
    param_grid={"histgradientboostingregressor__max_depth": list(range(5))},
    cv=5,
).fit(X, y, X_val, y_val)
# this passes X_val, y_val to Pipeline, and Pipeline knows how to deal with
# them.

lorentzenchr · 2024-02-19T14:02:01Z

# telling pipeline to transform these inputs up to the step which is
# requesting them.
transform_input=["X_val", "y_val"],

@adrinjalali This is better proposed in a new issue and mentioned in #18748. (It's a missing piece of our pipeline indeed.)

betatim · 2024-02-19T15:40:44Z

I think a relevant question here is: what is the result of the data leakage?

Off the top of my head I don't know what the effect of data leakage in the dataset used for early stopping is. Typically data leakage leads to a performance estimate that is too optimistic (for example if you re-use some of your training data to estimate your performance (for unseen data)). So I guess data leakage here will lead to the early stop happening too early, maybe?? It isn't clear to me that data leakage changes the shape of the validation curve, I have no trouble believing that it shifts it up/down, but I don't know if the shift should change as a function of the iteration. A constant shift wouldn't change the shape.

If the shape isn't changed then the early stopping will still happen at the correct iteration, assuming the stopping is based on some kind of "we have seen no improvement for a few iterations" decision. If the shape is changed then we will stop at possibly the "wrong" iteration, though I wonder how big this effect is compared to the uncertainty from using a finite sized validation set.

The important thing is that people should not use the value of the metric from the validation set as a way to estimate the performance on unseen data. I think that is anyway something you shouldn't do?

Does someone know what data leakage does to the shape of the curve of "early stopping metric vs iteration"?

amueller · 2024-03-08T21:11:45Z

Typically data leakage leads to a performance estimate that is too optimistic (for example if you re-use some of your training data to estimate your performance (for unseen data)). So I guess data leakage here will lead to the early stop happening too early, maybe?? It isn't clear to me that data leakage changes the shape of the validation curve, I have no trouble believing that it shifts it up/down, but I don't know if the shift should change as a function of the iteration. A constant shift wouldn't change the shape.

Imagine doing feature selection in a pipeline and then doing early stopping. If your feature selection sees the validation set, it will make the problem much easier, so a much simpler model suffices, which would be underfit for the actual setting.

The feature selection case is kind of a worse-case scenario but in that case the learning problem can be arbitrarily different from the original learning problem.

amueller · 2024-03-25T15:05:14Z

@adrinjalali your solution actually overcomes some of the issues that @thomasjpfan and I had seen in doing this. @thomasjpfan what do you think?

amueller · 2024-03-25T15:51:24Z

@adrinjalali you probably want to remove validation_fraction from the suggested solution, I think.

ogrisel · 2024-03-25T17:07:55Z

@adrinjalali we discussed this issue at today's dev meeting. Would you be interested in pushing through with a draft PR of the approach you suggested in #28440 (comment)?

adrinjalali · 2024-04-09T16:10:18Z

@ogrisel will do!

adrinjalali · 2024-11-15T09:17:08Z

With #28901 merged, we can move forward with validation set(s).

lorentzenchr · 2024-12-12T22:13:26Z

Now, we favor #27124 again and we can close this PR, right?

ENH allow to pass splitter for early stopping validation in HGBT

930e4ab

github-actions bot added the module:ensemble label Feb 16, 2024

jeremiedbb mentioned this pull request Mar 8, 2024

Support early_stopping with custom validation_set #18748

Open

adrinjalali mentioned this pull request Apr 26, 2024

FEAT allow metadata to be transformed in a Pipeline #28901

Merged

ogrisel mentioned this pull request Aug 1, 2024

BaggingRegressor with **fit_params with CatBoostRegressor fit(..., eval_set= ()) #29591

Closed

lorentzenchr mentioned this pull request Dec 12, 2024

ENH add X_val and y_val to HGBT.fit #27124

Merged

lorentzenchr closed this Mar 9, 2025

lorentzenchr deleted the hgbt_validation_splitter branch March 9, 2025 11:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH allow to pass splitter for early stopping validation in HGBT #28440

ENH allow to pass splitter for early stopping validation in HGBT #28440

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ENH allow to pass splitter for early stopping validation in HGBT #28440

ENH allow to pass splitter for early stopping validation in HGBT #28440

Uh oh!

Conversation

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

✔️ Linting Passed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!