8000 GBDT support custom validation set · Issue #15127 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

GBDT support custom validation set #15127

New issue 8000

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
qinhanmin2014 opened this issue Oct 4, 2019 · 8 comments
Closed

GBDT support custom validation set #15127

qinhanmin2014 opened this issue Oct 4, 2019 · 8 comments

Comments

@qinhanmin2014
Copy link
Member

Is it reasonable to support custom validation set in GBDT? Currently we do train_test_split internally but I think sometimes users want to do train_test_split themselves (e.g., sometimes we want to use first 80% of the dataset as training set and the rest of the dataset as validation set.).
xgboost, lightgbm and catboost all support custom validation set.

@vachanda
Copy link
Contributor

Hi @qinhanmin2014, just to clarify you are talking about the BaseHistGradientBoosting class where the validation split is taking place?

P.S. I want to work on this and just needed the clarification.

@qinhanmin2014
Copy link
Member Author

Hi @qinhanmin2014, just to clarify you are talking about the BaseHistGradientBoosting class where the validation split is taking place?

Yes, something like xgb/lgb/ctb. Be careful, there's only +1 from me (we need +2 before making the decision)

@NicolasHug
Copy link
Member

This could be used for any estimator that implements early stopping, not just the GBDTs.

What would the API look like? New arguments to fit? How does that work out in a pipeline? Or in general for the meta estimators?

Clearly that's not a trivial decision / design ;)

@candalfigomoro
Copy link

@NicolasHug @TomDLT
The current validation set approach of scikit-learn tempts people to produce bad models. Suppose you perform a minority class oversampling or data augmentation or some other preprocessing step. Ideally, the validation set should reflect the distribution of the test set and live data. If you split the validation set from the training set, you could inherit a biased distribution from the training set. Splitting a validation set from the training set is often a bad practice.

@ogrisel
Copy link
Member
ogrisel commented Oct 29, 2021

That's a very good point. Assuming the features are preprocessed by a pipeline, that would make the API complex quite complex to get right. For instance, assume that scikit-learn had a rebalancing meta-estimator (such as BalancedBagging* in imbalanced-learn) to wrap the GBDT model, it means that validation set need to be computed using the upstream steps of the pipeline, before the meta-estimator that wraps the GBDT model.

I don't think it's possible or desirable to have an API to handle this methodological point automagically. But we should definitely make it possible to pass this kind a pre-computed validation set and have an example to document this kind of pipelines.

@fPkX6F1nGTX
Copy link

How about joining forces with https://github.com/keras-team/keras and/or folding into tensorflow?

@vedranf
Copy link
vedranf commented Mar 19, 2023

I tried to setup a grid search with GBDT from scikit-learn and XGBoost (using the custom validation set) in order to compare them. However, passing eval set in fit:

grid.fit(X, y, groups=groups, clf__eval_set=[...])

results in somewhat expected:

TypeError: fit() got an unexpected keyword argument 'eval_set'

so I could only use either GBDT or XGBoost in grid search, but not both estimators in the same run. Perhaps as part of this issue, first step would be to allow the "eval_set" in kwargs, even if it won't (yet) be used internally for early stopping so that one can at least directly compare XGBoost models with ones in scikit-learn.

@lorentzenchr
Copy link
Member

Let‘s centralize the discussion and join #18748.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants
0