GBDT support custom validation set #15127

qinhanmin2014 · 2019-10-04T08:01:26Z

Is it reasonable to support custom validation set in GBDT? Currently we do train_test_split internally but I think sometimes users want to do train_test_split themselves (e.g., sometimes we want to use first 80% of the dataset as training set and the rest of the dataset as validation set.).
xgboost, lightgbm and catboost all support custom validation set.

vachanda · 2019-10-31T03:31:14Z

Hi @qinhanmin2014, just to clarify you are talking about the BaseHistGradientBoosting class where the validation split is taking place?

P.S. I want to work on this and just needed the clarification.

qinhanmin2014 · 2019-10-31T03:43:25Z

Hi @qinhanmin2014, just to clarify you are talking about the BaseHistGradientBoosting class where the validation split is taking place?

Yes, something like xgb/lgb/ctb. Be careful, there's only +1 from me (we need +2 before making the decision)

NicolasHug · 2019-11-17T13:44:13Z

This could be used for any estimator that implements early stopping, not just the GBDTs.

What would the API look like? New arguments to fit? How does that work out in a pipeline? Or in general for the meta estimators?

Clearly that's not a trivial decision / design ;)

candalfigomoro · 2021-08-17T14:32:55Z

@NicolasHug @TomDLT
The current validation set approach of scikit-learn tempts people to produce bad models. Suppose you perform a minority class oversampling or data augmentation or some other preprocessing step. Ideally, the validation set should reflect the distribution of the test set and live data. If you split the validation set from the training set, you could inherit a biased distribution from the training set. Splitting a validation set from the training set is often a bad practice.

ogrisel · 2021-10-29T09:46:32Z

That's a very good point. Assuming the features are preprocessed by a pipeline, that would make the API complex quite complex to get right. For instance, assume that scikit-learn had a rebalancing meta-estimator (such as BalancedBagging* in imbalanced-learn) to wrap the GBDT model, it means that validation set need to be computed using the upstream steps of the pipeline, before the meta-estimator that wraps the GBDT model.

I don't think it's possible or desirable to have an API to handle this methodological point automagically. But we should definitely make it possible to pass this kind a pre-computed validation set and have an example to document this kind of pipelines.

fPkX6F1nGTX · 2021-12-20T19:17:30Z

How about joining forces with https://github.com/keras-team/keras and/or folding into tensorflow?

vedranf · 2023-03-19T19:39:55Z

I tried to setup a grid search with GBDT from scikit-learn and XGBoost (using the custom validation set) in order to compare them. However, passing eval set in fit:

grid.fit(X, y, groups=groups, clf__eval_set=[...])

results in somewhat expected:

TypeError: fit() got an unexpected keyword argument 'eval_set'

so I could only use either GBDT or XGBoost in grid search, but not both estimators in the same run. Perhaps as part of this issue, first step would be to allow the "eval_set" in kwargs, even if it won't (yet) be used internally for early stopping so that one can at least directly compare XGBoost models with ones in scikit-learn.

lorentzenchr · 2023-08-03T05:09:53Z

Let‘s centralize the discussion and join #18748.

qinhanmin2014 mentioned this issue Oct 10, 2019

feature_importances_ should be a method in the ideal design #9606

Open

thomasjpfan added the Enhancement label Oct 26, 2019

vachanda mentioned this issue Oct 31, 2019

Enhancement in scikit-learn library. cu-hpsc/hpsc-community#5

Open

NicolasHug mentioned this issue Nov 17, 2019

Document the advantage of our new GBDT #15392

Closed

StrikerRUS mentioned this issue Dec 11, 2019

[RFC] compatibility with scikit-learn microsoft/LightGBM#2628

Closed

NicolasHug mentioned this issue Nov 19, 2020

Support early_stopping with custom validation_set #18748

Open

cmarmo added the module:ensemble label Mar 28, 2022

lorentzenchr closed this as completed Aug 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GBDT support custom validation set #15127

GBDT support custom validation set #15127

GBDT support custom validation set #15127

GBDT support custom validation set #15127

Comments