-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
MRG+2: Refactor - Farm out class_weight calcs to .utils #4190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
if classes_missing: | ||
# Make missing classes' weight zero | ||
weight_k[np.in1d(y_full, list(classes_missing))] = 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use in1d from fixes for old numpy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, was not aware of the fix, will update.
besides looks great. I like all these red lines. Is there any other classifier that could benefit from this? |
:-)
Asides from those proposed in #4114, perhaps This function is written mostly to help in multi-output and bootstrap settings where there's more moving parts, so the code reduction in linear models and/or svm would be minimal, if it could be used at all. Most linear model and svm classes seem to do different things with an intermediate |
I would say yes. The benefit is also that this function gets more |
# modify the sample weights with the corresponding class weight | ||
sample_weight *= cw[np.searchsorted(self.classes_, y)] | ||
sample_weight *= compute_sample_weight(self.class_weight, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not totally familiar with RidgeClassifierCV
, but note here ^ that I am now computing the implied cw
variable over y
rather than Y
, as it is/was done in RidgeClassifier
. I did some tests and it appears that compute_class_weight
does not mind if it gets a one-hot-encoded version of y, or the original long version. Thus, I do not think that this change has any effect, but just putting it out there.
@agramfort , I guess there is also some benefit to having more explicit unit testing on what the expanded class_weight |
LGTM +1 for merge |
Thanks @agramfort . I just noticed that there was an inplace multiplication on |
Anyone else have time to review? |
|
||
|
||
def test_compute_sample_weight(): | ||
"""Test (and demo) compute_sample_weight.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test when class_weight=None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Done & thx for the review @glouppe !
Besides my small comment, +1 for merge. Thanks for the refactoring! |
Given this function, it becomes quite easy to add class weights to estimators that already support sample weights. This includes Bagging or GradientBoosting for example. (This can be done in the future, in a later PR, assuming you want to tackle this) |
Yep, I will refactor #4114 on top of this function which should make the review of those estimators easier too. Thanks for the +1 @glouppe |
Can this be merged @agramfort @glouppe ? |
Merging, thanks for your work :) |
MRG+2: Refactor - Farm out class_weight calcs to .utils
Thanks! |
With #4114 bringing a few more ensembles onboard the
class_weight
bandwagon, a fair bit of duplicated code is being proposed. I know that @amueller was originally concerned about code duplication in the original RF/Tree PR and I think that this function may alleviate some of that.This PR farms out the calculations and some of the error checks for the
expanded_class_weight
variable to a newsklearn.utils
function:compute_sample_weight(class_weight, y, indices=None)
and refactors the code from #3961 to utilise it. It also adds a bit more rigor to the input checks and tests.Benefits:
class_weight='subsample'
option is doing what we think it is doingIf merged, I shall also make appropriate mods to the code in #4114 to also take advantage of this helper function.