8000 ENH allow to pass splitter for early stopping validation in HGBT by lorentzenchr · Pull Request #28440 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

ENH allow to pass splitter for early stopping validation in HGBT #28440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

lorentzenchr
Copy link
Member

Reference Issues/PRs

Partially solves #18748. Alternative to #27124.

What does this implement/fix? Explain your changes.

This PR allows to pass splitters to parameter validation_fraction of HistGradientBoostingClassifier and HistGradientBoostingRegressor.

Any other comments?

Copy link

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 930e4ab. Link to the linter CI: here

@adrinjalali
Copy link
Member

The issue I have with passing a splitter, is that there is no way to avoid leakage if there's any preprocessing involved, which in the case of existing outliers, is a real issue.

@lorentzenchr
Copy link
Member Author

@adrinjalali see #18748 (comment). We were ok with some possibility of data leakage. It's a tradeoff: either X_val, y_val are constructed with the same preprocessing as X_train, y_train (often desirable) or less data leakage (it then entirely depends on how a user creates X_val, y_val).

The issue I have with passing a splitter, is that there is no way to avoid leakage if there's any preprocessing involved, which in the case of existing outliers, is a real issue.

Could you give an example?

@adrinjalali
Copy link
Member

I don't think it's okay for us to be "fine" with "a bit of data leakage" at all.

Here was an example I wrote to show the effect on linear models: #26359 (comment)

We should be fixing Pipeline instead of accepting a somewhat data-leaky solution.

Also, with the previous PR, there would have been a way for people to do the right thing w/o data leakage, maybe by implementing their own Pipeline which would be easy, but with this, we're somewhat removing that possibility.

@lorentzenchr
Copy link
Member Author

TLDR: I don't think information leakage is a real problem here.

IIUC, you prefer the GridSearchCV version in #26359 (comment). For the estimator considered here that would mean:

from sklearn.datasets import make_regression
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import GridSearchCV, ShuffleSplit
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler


X, y = make_regression(n_samples=200, n_features=500, n_informative=5, random_state=0)
X[:2, ] = X[:2, ] + 20

# Linear transormations like StandardScaler of features have no effect on tree based models,
# but for the sake of an example:
est_gs = GridSearchCV(
    make_pipeline(
        StandardScaler(),
        HistGradientBoostingRegressor(
            early_stopping=True,
            validation_fraction=ShuffleSplit(test_size=0.3, random_state=123),
        ),
    ),
    param_grid={"histgradientboostingregressor__max_depth": list(range(5))},
    cv=5,
).fit(X, y)

I don't see any leakage here. The advantage here is that inside HGBT, the ShuffleSplit for early stopping validation gets the X from the pipeline, the pipeline gets it from GridSearchCV. So all pieces only see and process what they should.

Once metadata-routing is activated for HGBT, we then could also pass a "splitting feature" like customer ID to the splitter.

@adrinjalali
Copy link
Member

Linear transormations like StandardScaler of features have no effect on tree
based models, but for the sake of an example:

Not all transformations are going to be linear though.

In that example, the data used for validation is transformed with the statistics calculated from the validation set itself, therefore there is data leakage. The tree model might not care about that this particular transformation, but early stopping is not just for a tree based model. This is an approximation of what I think we can do:

from sklearn.datasets import make_regression
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import GridSearchCV, ShuffleSplit
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

X, y = make_regression(n_samples=200, n_features=500, n_informative=5, random_state=0)
X[:2,] = X[:2,] + 20

# Validation set chosen before looking at the data.
X_val, y_val = X[:50,], y[:50,]
X, y = X[50:,], y[50:,]

est_gs = GridSearchCV(
    Pipeline(
        (
            StandardScaler(),
            HistGradientBoostingRegressor(
                early_stopping=True,
                validation_fraction=ShuffleSplit(test_size=0.3, random_state=123),
            ).set_fit_request(X_val=True, y_val=True),
        ),
        # telling pipeline to transform these inputs up to the step which is
        # requesting them.
        transform_input=["X_val", "y_val"],
    ),
    param_grid={"histgradientboostingregressor__max_depth": list(range(5))},
    cv=5,
).fit(X, y, X_val, y_val)
# this passes X_val, y_val to Pipeline, and Pipeline knows how to deal with
# them.

@lorentzenchr
Copy link
Member Author
# telling pipeline to transform these inputs up to the step which is
# requesting them.
transform_input=["X_val", "y_val"],

@adrinjalali This is better proposed in a new issue and mentioned in #18748. (It's a missing piece of our pipeline indeed.)

@betatim
Copy link
Member
betatim commented Feb 19, 2024

I think a relevant question here is: what is the result of the data leakage?

Off the top of my head I don't know what the effect of data leakage in the dataset used for early stopping is. Typically data leakage leads to a performance estimate that is too optimistic (for example if you re-use some of your training data to estimate your performance (for unseen data)). So I guess data leakage here will lead to the early stop happening too early, maybe?? It isn't clear to me that data leakage changes the shape of the validation curve, I have no trouble believing that it shifts it up/down, but I don't know if the shift should change as a function of the iteration. A constant shift wouldn't change the shape.

If the shape isn't changed then the early stopping will still happen at the correct iteration, assuming the stopping is based on some kind of "we have seen no improvement for a few iterations" decision. If the shape is changed then we will stop at possibly the "wrong" iteration, though I wonder how big this effect is compared to the uncertainty from using a finite sized validation set.

The important thing is that people should not use the value of the metric from the validation set as a way to estimate the performance on unseen data. I think that is anyway something you shouldn't do?

Does someone know what data leakage does to the shape of the curve of "early stopping metric vs iteration"?

@amueller
Copy link
Member
amueller commented Mar 8, 2024

Typically data leakage leads to a performance estimate that is too optimistic (for example if you re-use some of your training data to estimate your performance (for unseen data)). So I guess data leakage here will lead to the early stop happening too early, maybe?? It isn't clear to me that data leakage changes the shape of the validation curve, I have no trouble believing that it shifts it up/down, but I don't know if the shift should change as a function of the iteration. A constant shift wouldn't change the shape.

Imagine doing feature selection in a pipeline and then doing early stopping. If your feature selection sees the validation set, it will make the problem much easier, so a much simpler model suffices, which would be underfit for the actual setting.

The feature selection case is kind of a worse-case scenario but in that case the learning problem can be arbitrarily different from the original learning problem.

@amueller
Copy link
Member

@adrinjalali your solution actually overcomes some of the issues that @thomasjpfan and I had seen in doing this. @thomasjpfan what do you think?

@amueller
Copy link
Member

@adrinjalali you probably want to remove validation_fraction from the suggested solution, I think.

@ogrisel
Copy link
Member
ogrisel commented Mar 25, 2024

@adrinjalali we discussed this issue at today's dev meeting. Would you be interested in pushing through with a draft PR of the approach you suggested in #28440 (comment)?

@adrinjalali
Copy link
Member

@ogrisel will do!

@adrinjalali
Copy link
Member

With #28901 merged, we can move forward with validation set(s).

@lorentzenchr
Copy link
Member Author

Now, we favor #27124 again and we can close this PR, right?

@lorentzenchr lorentzenchr deleted the hgbt_validation_splitter branch March 9, 2025 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0