8000 Add sample_weight support to binning in HGBT · Issue #27117 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Add sample_weight support to binning in HGBT #27117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lorentzenchr opened this issue Aug 20, 2023 · 7 comments
Open

Add sample_weight support to binning in HGBT #27117

lorentzenchr opened this issue Aug 20, 2023 · 7 comments

Comments

@lorentzenchr
Copy link
Member

Use sample_weight in the binning of HistGradientBoostingClassifier and HistGradientBoostingRegressor, or allow it via an option.

Currently, sample weights are ignored in the _BinMapper.

Some more context and history summarized by @NicolasHug here:

I agree that it would make sense to support SW in the binner (although this is nuanced, see note below). Reading back the original PR, this was discussed extensively:

The estimators were still experimental at the time so we had more flexibility. Now that they're stable, bringing SW support in the Binner might require BC mitigations.

Also as a side note, #15657 (comment) may be relevant here: it's not super clear to me how we should handle SW in an estimator that performs some sort of subsambling during the training process (as is the case here during Binning).

@betatim
Copy link
Member
betatim commented Aug 21, 2023

More of a "for my education" question but also maybe useful to think about: is there a way to determine a "correct" or "incorrect" set of bin edges for a dataset?

I think there is no way to define what a "correct" set of bin edges is. There are smarter and less smart ways to choose the edges, and some sets of edges will lead to improved performance than others. But I find it hard to come up with a way to define that a set of edges is "incorrect".

When you compute how many samples are between two edges you do need to take into account the weights, at least I think it is "incorrect" not to.

The point being: maybe it is simpler to ignore the weights when determining the edges but take them into account when calculating how many samples are in each bin.

@lorentzenchr
Copy link
Member Author

The point being: maybe it is simpler to ignore the weights when determining the edges but take them into account when calculating how many samples are in each bin.

I don't follow.

The general point is more that if you interpret weights as frequency weights (which is allowed) then you might very well want to take them into account for calculating quantiles as bin thresholds such that about equal sum of weights ends up in each bin.

@betatim
Copy link
Member
betatim commented Aug 22, 2023

The general point is more that if you interpret weights as frequency weights (which is allowed) then you might very well want to take them into account for calculating quantiles as bin thresholds such that about equal sum of weights ends up in each bin.

I agree that it is smart to do this. What I was wondering is if it is too hard/it would be easier to ignore the weights when computing the edges and only taking them into account when computing the contents of each bin. Hence the question about whether it is incorrect/wrong to ignore the weights when computing the edges or (potentially) just leads to lower performance?

For example using equally spaced bins between feature minimum and feature maximum isn't wrong, but likely to lead to lower performance than the quantiles based binning.

@lorentzenchr
Copy link
Member Author

The _BinMapper already uses quantiles. So it would simply be a matter of using weighted quantiles instead.

I think, we also should expose the subsample parameter of the bin mapper to allow users more control.

@betatim
Copy link
Member
betatim commented Aug 28, 2023

The _BinMapper already uses quantiles. So it would simply be a matter of using weighted quantiles instead.

IIRC we use the percentile function from numpy which doesn't support weights. I assume the reason they don't support it is because it is difficult to implement an efficient algorithm, but that is speculation.

@lorentzenchr
Copy link
Member Author

We have our own (quite simple) weighted percentile function in utils.

@ogrisel
Copy link
Member
ogrisel commented Jul 5, 2024

Note that we the training set is large, we subsample it and find the bin thresholds only on those on the subsample. In that case we should probably use the weights when subsampling only as proposed in #14696 (comment).

However, subsampling does not happen if the number of data points is smaller than int(2e5). In that case we could make the _find_binning_thresholds respect sample_weight by calling our own _weighted_percentile instead of np.percentile internally. The case where len(distinct_values) <= max_bins might be tricky to get right though. Maybe we should no longer special-case it and instead call np.unique on the thresholds and expect that it can be lower than 256.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants
0