-
-
Notifications
You must be signed in to change notification settings - Fork 26.6k
Description
Current situation
We have the unfortunate situation to have 2 different versions of gradient boosting, the old estimators (GradientBoostingClassifier and GradientBoostingRegressor) as well as the new ones using binning and histogram strategies similar to LightGBM (HistGradientBoostingClassifier and HistGradientBoostingRegressor).
This makes advertising the new ones harder, e.g. #26826, and also result in a larger feature gap between those two.
Based on discussions in #27139 and during a monthly meeting (maybe not documented), I'd like to call for comments on the following:
Proposition
Unify both types of gradient boosting in a single class, i.e. the old names GradientBoostingClassifier and make them switch the underlying estimator class based on a parameter value, e.g. max_bins (None->old classes, integer->new classes).
Note that binning and histograms are not the only difference.
Comparison
Algorithm
The old GBT uses Friedman gradient boosting with a line search step. (The lines search sometimes, e.g. for log loss, uses a 2. order approximation and is therefore, sometimes, called "hybrid gradient-Newton boosting"). The trees are learned on the gradients. A tree searches for the best split among all (veeeery many) split candidates for all features. After a single tree is fit, the terminal node values are re-computed which corresponds to a line search step.
The new HGBT uses a 2. order approximation of the loss, i.e. gradients and hessians (XGBoost paper, therefore sometimes called Newton boosting). In addition, it bins/discretizes the features X and uses a histogram of gradients/hessians/counts per feature. A tree then searches for the best split candidate, but there are only n_features * n_bins candidates (muuuuch less than in GBT).
| estimator | trees train on | node values (consequence of tree train) | features X |
|---|---|---|---|
| GBT | gradients | recomputed in lines search | use as is |
| HGBT | gradients/hessians | sum(gradient)/sum(hessian) |
bin/discretize |
In fact, one could use 2. order loss (gradients and hessians) without binning X, and vice-versa, use binning with fitting trees on gradients (without hessians).
Parameters
HistGradientBoostingRegressor |
GradientBoostingRegressor |
Same | Comment |
|---|---|---|---|
| loss | loss | ✅ | |
| quantile | alpha | ❌ | |
| learning_rate | learning_rate | ✅ | |
| max_iter | n_estimators | ❌ | #12807 (comment) |
| max_leaf_nodes | max_leaf_nodes | ✅ | |
| max_depth | max_depth | ✅ | |
| min_samples_leaf | min_samples_leaf | ✅ | |
| l2_regularization | learning_rate | ✅ | |
| max_features | max_features | ✅ | |
| max_bins | ⛔ (nonsense) | ❌ | |
| categorical_features | ⛔ | ❌ | |
| monotonic_cst | ⛔ | ❌ | #27305 |
| interaction_cst | ⛔ | ❌ | |
| warm_start | warm_start | ✅ | |
| early_stopping | ⛔ | ❌ | |
| scoring | ⛔ | ❌ | |
| validation_fraction | validation_fraction | ✅ | |
| n_iter_no_change | n_iter_no_change | ✅ | |
| tol | tol | ✅ | |
| verbose | verbose | ✅ | |
| random_state | random_state | ✅ | |
| class_weight | ⛔ | ❌ | |
| ⛔ | subsample | ❌ | #16062 |
| ⛔ (nonsense) | criterion | ❌ | |
| ⛔ | min_samples_split | ❌ | |
| ⛔ | min_weight_fraction_leaf | ❌ | |
| ⛔ | min_impurity_decrease | ❌ | |
| ⛔ | init | ❌ | #27109 |
| ⛔ | ccp_alpha | ❌ |
In fact, only the quantile/alpha and max_iter/n_estimator parameters are conflicting.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status