-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH allow to pass splitter for early stopping validation in HGBT #28440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH allow to pass splitter for early stopping validation in HGBT #28440
Conversation
The issue I have with passing a splitter, is that there is no way to avoid leakage if there's any preprocessing involved, which in the case of existing outliers, is a real issue. |
@adrinjalali see #18748 (comment). We were ok with some possibility of data leakage. It's a tradeoff: either
Could you give an example? |
I don't think it's okay for us to be "fine" with "a bit of data leakage" at all. Here was an example I wrote to show the effect on linear models: #26359 (comment) We should be fixing Also, with the previous PR, there would have been a way for people to do the right thing w/o data leakage, maybe by implementing their own |
TLDR: I don't think information leakage is a real problem here. IIUC, you prefer the GridSearchCV version in #26359 (comment). For the estimator considered here that would mean: from sklearn.datasets import make_regression
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import GridSearchCV, ShuffleSplit
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
X, y = make_regression(n_samples=200, n_features=500, n_informative=5, random_state=0)
X[:2, ] = X[:2, ] + 20
# Linear transormations like StandardScaler of features have no effect on tree based models,
# but for the sake of an example:
est_gs = GridSearchCV(
make_pipeline(
StandardScaler(),
HistGradientBoostingRegressor(
early_stopping=True,
validation_fraction=ShuffleSplit(test_size=0.3, random_state=123),
),
),
param_grid={"histgradientboostingregressor__max_depth": list(range(5))},
cv=5,
).fit(X, y) I don't see any leakage here. The advantage here is that inside HGBT, the ShuffleSplit for early stopping validation gets the X from the pipeline, the pipeline gets it from GridSearchCV. So all pieces only see and process what they should. Once metadata-routing is activated for HGBT, we then could also pass a "splitting feature" like customer ID to the splitter. |
Not all transformations are going to be linear though. In that example, the data used for validation is transformed with the statistics calculated from the validation set itself, therefore there is data leakage. The tree model might not care about that this particular transformation, but early stopping is not just for a tree based model. This is an approximation of what I think we can do: from sklearn.datasets import make_regression
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import GridSearchCV, ShuffleSplit
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
X, y = make_regression(n_samples=200, n_features=500, n_informative=5, random_state=0)
X[:2,] = X[:2,] + 20
# Validation set chosen before looking at the data.
X_val, y_val = X[:50,], y[:50,]
X, y = X[50:,], y[50:,]
est_gs = GridSearchCV(
Pipeline(
(
StandardScaler(),
HistGradientBoostingRegressor(
early_stopping=True,
validation_fraction=ShuffleSplit(test_size=0.3, random_state=123),
).set_fit_request(X_val=True, y_val=True),
),
# telling pipeline to transform these inputs up to the step which is
# requesting them.
transform_input=["X_val", "y_val"],
),
param_grid={"histgradientboostingregressor__max_depth": list(range(5))},
cv=5,
).fit(X, y, X_val, y_val)
# this passes X_val, y_val to Pipeline, and Pipeline knows how to deal with
# them. |
@adrinjalali This is better proposed in a new issue and mentioned in #18748. (It's a missing piece of our pipeline indeed.) |
I think a relevant question here is: what is the result of the data leakage? Off the top of my head I don't know what the effect of data leakage in the dataset used for early stopping is. Typically data leakage leads to a performance estimate that is too optimistic (for example if you re-use some of your training data to estimate your performance (for unseen data)). So I guess data leakage here will lead to the early stop happening too early, maybe?? It isn't clear to me that data leakage changes the shape of the validation curve, I have no trouble believing that it shifts it up/down, but I don't know if the shift should change as a function of the iteration. A constant shift wouldn't change the shape. If the shape isn't changed then the early stopping will still happen at the correct iteration, assuming the stopping is based on some kind of "we have seen no improvement for a few iterations" decision. If the shape is changed then we will stop at possibly the "wrong" iteration, though I wonder how big this effect is compared to the uncertainty from using a finite sized validation set. The important thing is that people should not use the value of the metric from the validation set as a way to estimate the performance on unseen data. I think that is anyway something you shouldn't do? Does someone know what data leakage does to the shape of the curve of "early stopping metric vs iteration"? |
Imagine doing feature selection in a pipeline and then doing early stopping. If your feature selection sees the validation set, it will make the problem much easier, so a much simpler model suffices, which would be underfit for the actual setting. The feature selection case is kind of a worse-case scenario but in that case the learning problem can be arbitrarily different from the original learning problem. |
@adrinjalali your solution actually overcomes some of the issues that @thomasjpfan and I had seen in doing this. @thomasjpfan what do you think? |
@adrinjalali you probably want to remove |
@adrinjalali we discussed this issue at today's dev meeting. Would you be interested in pushing through with a draft PR of the approach you suggested in #28440 (comment)? |
@ogrisel will do! |
With #28901 merged, we can move forward with validation set(s). |
Now, we favor #27124 again and we can close this PR, right? |
Reference Issues/PRs
Partially solves #18748. Alternative to #27124.
What does this implement/fix? Explain your changes.
This PR allows to pass splitters to parameter
validation_fraction
ofHistGradientBoostingClassifier
andHistGradientBoostingRegressor
.Any other comments?