-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Support early_stopping with custom validation_set #18748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Currently, the validation split is implemented as a mask array We would need to:
Do you want to give it a try ? |
Makes sense, forgot about this consideration. |
The issue I have with passing In many of the discussions I see here and elsewhere, people seem to not have that requirement? |
Scikit-learn's API is centered around easy cross-validation, but I guess a lot of people use their own cross-validation tools, for better or worse. |
+1, would find this very useful across the different models that support early stopping validation. TomDLT's idea of using indices seems like it could address adrinjalali's concern of preprocessing since it would still be split off from the training data once it gets to the model. That, or providing group labels and supporting I have data where points from each group are highly correlated. I'm using group splitters to validate outside of training, but since the training data is randomly split for validation (in my case by Of course, understanding that this has a lot of consequences outside of just these models, since sklearn has a consistent API for |
+1, would find this feature very useful. In my work we have hand crafted train and validation set, and ideally we want to stop training once performance on validation set becomes worse for some number of iters. |
8000
I guess we could think of an API where pipeline's |
Could we start a PR and work on this collectively? I still need to understand how sklearn API works in general but would love to learn more and contribute! |
Given the complexity of the issue, I would start with proposing an API here first after learning about scikit-learn's API. Resolving this issue involves adding new API to estimators that has early stopping and requires thinking about how meta-estimators like pipeline interacts with this new API. Side Note: skorch.net.NeuralNet, has a |
Would we allow the same validation for transformers as well? Imagine an encoder at the first step of the pipeline, which would have its own early stopping criteria. Is the validation set used for that step the same as the validation set used for the last step of the pipeline? |
If we want to be simple, then yes. The transformer would get the non-transformed version to validation and the final step would get the transformed version. If we try to place the validation set into fit |
Note that auto-splitting from the training set inside the final classifier/regressor is problematic when this estimator is wrapped in a rebalancing meta-estimator to tackle target imbalance problems: rebalancing should happen only on training data while early stopping, model selection and evaluation should only use metrics computed using originally balanced data. I am not sure an auto-magical API would work for this. Making it possible to pass a manually prepared validation might be the sanest way to deal with this situation. |
I made a similar point in #15127 (comment). |
Does this issue benefit from SLEP006 metadata routing? If yes, maybe an example code would be enough? |
I think if we keep the validation set fixed, then yes. |
That‘ll work. Is it then a parameter to LighGBM has |
So with metadata routing this would work: X, y = load_iris()
X, y, X_eval, y_eval = train_test_split(X, y)
preprocessing = Pipeline(...) # preprocessing steps
X = preprocessing.fit_transform(X, y)
X_eval = preprocessing.transform(X_eval)
gs = GridSearchCV(
HistGradientBoostingClsasifier().set_fit_request(X_eval=True, y_eval=True),
...
)
gs.fit(X, y, X_evel=X_eval, y_eval=y_eval) But you can't put any preprocessing in a pipeline and pass that to the grid search, since in grid search the preprocessing steps are re-fit with new parameters if we're tuning on them, and then I think that's already a good step, but we'd need to also modify pipeline to handle things which are supposed to be transformed before feeding to next steps. Does that make sense? |
@adrinjalali Thanks. Yes, that makes sense. If nobody else is working on it or intents to do so, I'll soon open a PR for HGBT. |
The consensus of a longer discussion at the drafting meeting 19.01.2024 was to go with passing splitter objects (option 2) as parameters to the estimator (e.g. HistGradientBoosting*).
|
To be more precise regarding the above chart, callbacks are not an alternative to provide the validation set. In order to have interesting callbacks (early stopping, monitoring, ...), the validation set must but accessible when the callback is called. However the solution we chose for a unified API to provide a validation is kind of orthogonal to callbacks API. Then, once this question is solved, an early stopping callback will be imo a good solution to have a consistent API for early stopping across all estimators, instead of each estimator implementing its own version of early stopping. |
I disagree with any of the "works" above. If you have a pipeline that consists of feature selection and an estimator, the feature selection will include the validation data, leading to potentially very misleading results. |
This is proposed here #28440 (comment)
Yet it's how it's been done in scikit-learn until now (HGBT, SGD, ...). But I agree that it does not prevent us from finding a better solution :)
Let's discuss it in the next one on march 22nd. I'll send a mail on the mailing list to make others aware. |
@amueller Are we talking about the same problem? In my understanding, the issue here is early stopping in the final estimator of a pipeline, not ES in a preprocessing step before that, nor an unbiased estimate of the out-of-sample performance à la cross validation. Sure, the preprocessing has an effect on the final estimator, but ES should just avoid overfitting or spare resources of the final estimator. As @betatim stated, #28440 (comment), as long as the validation curve is only y-shifted, everything is fine. Also sure, a mechanism to tell a pipeline to pass validation data through it is a missing piece in our API (even after metadata routing) and would solve this issue in the methodological soundest way (or not sound at all, we just put the burden on the user). It was proposed by @adrinjalali, #28440 (comment), and, IMHO, should be proposed in an own issue or even in a SLEP. |
The issue around feature selection in a pipeline is described here: #28440 (comment). Concretely, something like this: pipe = make_pipeline(
SequentialFeatureSelector(...),
HistGradientBoostingClassifier(early_stopping=True),
) |
@lorentzenchr Sorry I was busy and missed the March 22nd draft meeting (which is also somewhat inconvenient for my 7am but I would have made it if I'd known this was the issue discussed).
It's been done that way in single estimators, and was impossible to do in pipelines. I.e. the issue we're talking about was absent because doing the wrong thing was impossible. Doing the right thing was also impossible, though. |
Cross-linking to a PR that stemmed from the discussion in #28440: |
Describe the workflow you want to enable
Today in SGDClassifier, the parameter early_stopping uses a fraction of the data randomly, it would be useful to support a custom validation set chosen by the user.
Describe your proposed solution
for example:
EDIT
Broader Scope
Same applies to
GradientBoosting*
andHistGradientBoosting*
The text was updated successfully, but these errors were encountered: