-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Add sample_weight support to binning in HGBT #27117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
More of a "for my education" question but also maybe useful to think about: is there a way to determine a "correct" or "incorrect" set of bin edges for a dataset? I think there is no way to define what a "correct" set of bin edges is. There are smarter and less smart ways to choose the edges, and some sets of edges will lead to improved performance than others. But I find it hard to come up with a way to define that a set of edges is "incorrect". When you compute how many samples are between two edges you do need to take into account the weights, at least I think it is "incorrect" not to. The point being: maybe it is simpler to ignore the weights when determining the edges but take them into account when calculating how many samples are in each bin. |
I don't follow. The general point is more that if you interpret weights as frequency weights (which is allowed) then you might very well want to take them into account for calculating quantiles as bin thresholds such that about equal sum of weights ends up in each bin. |
I agree that it is smart to do this. What I was wondering is if it is too hard/it would be easier to ignore the weights when computing the edges and only taking them into account when computing the contents of each bin. Hence the question about whether it is incorrect/wrong to ignore the weights when computing the edges or (potentially) just leads to lower performance? For example using equally spaced bins between feature minimum and feature maximum isn't wrong, but likely to lead to lower performance than the quantiles based binning. |
The I think, we also should expose the |
IIRC we use the |
We have our own (quite simple) weighted percentile function in utils. |
Note that we the training set is large, we subsample it and find the bin thresholds only on those on the subsample. In that case we should probably use the weights when subsampling only as proposed in #14696 (comment). However, subsampling does not happen if the number of data points is smaller than |
Use
sample_weight
in the binning ofHistGradientBoostingClassifier
andHistGradientBoostingRegressor
, or allow it via an option.Currently, sample weights are ignored in the
_BinMapper
.Some more context and history summarized by @NicolasHug here:
The text was updated successfully, but these errors were encountered: