-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] implement least absolute deviation loss in GBDTs #13896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] implement least absolute deviation loss in GBDTs #13896
Conversation
I think it's ready, ping @ogrisel @glemaitre @adrinjalali since you guys are the most familiar with the code ;) |
# loss | ||
# If the hessians are constant, we consider they are equal to 1. | ||
# - This is correct for the half LS loss | ||
# - For LAD loss, hessians are actually 0, but they are always |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if that's the case, doesn't it make sense to actually set them to 0
? I'm kinda worried about maintainability of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or even have a parameter in the loss class which is the function which gives you the constant hessians, i.e. for LAD it'd be np.zeros
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it's more maintainable to have the convention that constant hessians are always 1 rather than have custom constant hessians for each loss. Especially when these hessians are never used, like here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, that's something that's always implicit and not explained in the papers, but hessians need to be treated as 1 (even though they're 0): in order for the leaf value computation to be an average instead of a sum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good otherwise. You will need an entry in what's new as well.
I just went to check our docs, and noticed/remembered we still don't have good examples/user guide for these classes. But other than that, it seems the loss function names are not consistent with the other ensemble methods, for instance the GBR calls this LAD, I think. |
('binary_crossentropy', 2, 1), | ||
('categorical_crossentropy', 3, 3), | ||
]) | ||
@pytest.mark.skipif(Y_DTYPE != np.float64, | ||
reason='Need 64 bits float precision for numerical checks') | ||
def test_numerical_gradients(loss, n_classes, prediction_dim): | ||
@pytest.mark.parametrize('seed', range(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
O_o
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^^
I can remove it, it's just convenient when testing locally
don't worry that's still on my stack, I'm just waiting for a few more features
right, we broke name consistency with the other implementation already ('ls' vs 'least-squares', 'deviance' vs 'binary-crossentropy', etc.) |
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
…solute_loss_hist_gbdt
…t-learn into absolute_loss_hist_gbdt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM :)
…solute_loss_hist_gbdt
…solute_loss_hist_gbdt
@glemaitre you were OK with this I think, do you wanna have another look so we can merge? Thanks! |
…solute_loss_hist_gbdt
Ping @glemaitre ;) Maybe @ogrisel too? |
@@ -668,7 +672,8 @@ class HistGradientBoostingRegressor(BaseHistGradientBoosting, RegressorMixin): | |||
|
|||
Parameters | |||
---------- | |||
loss : {'least_squares'}, optional (default='least_squares') | |||
loss : {'least_squares', 'least_absolute_deviation'}, \ | |||
optional (default='least_squares') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we remove right away the optional
and the parenthesis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather keep the current docstring consistent with the other entries but I'm +1 in updating all of them in another PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK fine with me
I am still fine with the PR. But before to merge, would it make sense to add a test in |
…solute_loss_hist_gbdt
and regarding the test, WDYT? |
I agree. It currently fails, I'm investigating why. This has to do with the initial predictions being different from lightgbm and sklearn. |
OK I sorted it out. Added a comment: # - We don't check the least_absolute_deviation loss here. This is because
# LightGBM's computation of the median (used for the initial value of
# raw_prediction) is a bit off (they'll e.g. return midpoints when there
# is no need to.). Since these tests only run 1 iteration, the
# discrepancy between the initial values leads to biggish differences in
# the predictions. These differences are much smaller with more
# iterations. |
OK make sense. Merging then. Thanks @NicolasHug |
This PR implements the least absolute deviation (or MAE) loss for GBDTs.
It's not as trivial as it seems, since the leaves values have to be updated after the tree is trained. Indeed for MAE, we train the trees to fit the gradients (or more accuratly a Newton–Raphson step), but we need them to predict the median of the residuals. See the original paper for more details.
See also the current implementation of this for the old version of GBDTs, in particular
_update_terminal_region()
insklearn/ensemble/gb_losses.py
Will ping when this is ready.
This is a bit slower than LightGBM, since the leaves values update is not done in parallel here.
python benchmarks/bench_hist_gradient_boosting.py --lightgbm --problem regression --n-samples-max 5000000 --n-trees 50 --loss least_absolute_deviation