8000 MRG+2: Refactor - Farm out class_weight calcs to .utils by trevorstephens · Pull Request #4190 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

MRG+2: Refactor - Farm out class_weight calcs to .utils #4190

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 6, 2015

Conversation

trevorstephens
Copy link
Contributor

With #4114 bringing a few more ensembles onboard the class_weight bandwagon, a fair bit of duplicated code is being proposed. I know that @amueller was originally concerned about code duplication in the original RF/Tree PR and I think that this function may alleviate some of that.

This PR farms out the calculations and some of the error checks for the expanded_class_weight variable to a new sklearn.utils function: compute_sample_weight(class_weight, y, indices=None) and refactors the code from #3961 to utilise it. It also adds a bit more rigor to the input checks and tests.

Benefits:

  • Better unit testing to ensure the class_weight='subsample' option is doing what we think it is doing
  • Will make transitioning classifiers that don't support multi-output, such as the meta-estimators, somewhat easier in some distant future
  • Removes duplicated code (the main point really)

If merged, I shall also make appropriate mods to the code in #4114 to also take advantage of this helper function.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 94.8% when pulling f75c98b on trevorstephens:refactor_cw into 94157fa on scikit-learn:master.


if classes_missing:
# Make missing classes' weight zero
weight_k[np.in1d(y_full, list(classes_missing))] = 0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8000

maybe use in1d from fixes for old numpy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, was not aware of the fix, will update.

@agramfort
Copy link
Member

besides looks great. I like all these red lines.

Is there any other classifier that could benefit from this?

@trevorstephens
Copy link
Contributor Author

I like all these red lines.

:-)

Is there any other classifier that could benefit from this?

Asides from those proposed in #4114, perhaps RidgeClassifier: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/ridge.py#L600 could use it. Do you think the small amount of code deletion is worthwhile here @agramfort ?

This function is written mostly to help in multi-output and bootstrap settings where there's more moving parts, so the code reduction in linear models and/or svm would be minimal, if it could be used at all. Most linear model and svm classes seem to do different things with an intermediate class_weight than how it is handled in the trees and forests, such as storing as an attribute or sending on to optimized C or cython code, so the translation is difficult there.

@agramfort
Copy link
Member

Asides from those proposed in #4114, perhaps RidgeClassifier: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/ridge.py#L600 could use it. Do you think the small amount of code deletion is worthwhile here @agramfort ?

I would say yes. The benefit is also that this function gets more
visibility in the code base.

# modify the sample weights with the corresponding class weight
sample_weight *= cw[np.searchsorted(self.classes_, y)]
sample_weight *= compute_sample_weight(self.class_weight, y)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not totally familiar with RidgeClassifierCV, but note here ^ that I am now computing the implied cw variable over y rather than Y, as it is/was done in RidgeClassifier. I did some tests and it appears that compute_class_weight does not mind if it gets a one-hot-encoded version of y, or the original long version. Thus, I do not think that this change has any effect, but just putting it out there.

@trevorstephens
Copy link
Contributor Author

I would say yes. The benefit is also that this function gets more
visibility in the code base.

@agramfort , I guess there is also some benefit to having more explicit unit testing on what the expanded class_weight cw variable should look like. Implemented in latest commit. Let me know if I have your +1 now, but please note my comment on the latest diff.

@agramfort
Copy link
Member

LGTM +1 for merge

@agramfort agramfort changed the title Refactor - Farm out class_weight calcs to .utils MRG+1: Refactor - Farm out class_weight calcs to .utils Feb 2, 2015
@trevorstephens
Copy link
Contributor Author

Thanks @agramfort . I just noticed that there was an inplace multiplication on sample_weight in the original implementation of RidgeClassifierCV. Will fix this tonight. In the meantime, anyone able to be a second reviewer?

@trevorstephens
Copy link
Contributor Author

Anyone else have time to review?



def test_compute_sample_weight():
"""Test (and demo) compute_sample_weight."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test when class_weight=None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Done & thx for the review @glouppe !

@glouppe
Copy link
Contributor
glouppe commented Feb 5, 2015

Besides my small comment, +1 for merge. Thanks for the refactoring!

@glouppe
Copy link
Contributor
glouppe commented Feb 5, 2015

Is there any other classifier that could benefit from this?

Given this function, it becomes quite easy to add class weights to estimators that already support sample weights. This includes Bagging or GradientBoosting for example. (This can be done in the future, in a later PR, assuming you want to tackle this)

@trevorstephens
Copy link
Contributor Author

Given this function, it becomes quite easy to add class weights to estimators that already support sample weights. This includes Bagging or GradientBoosting for example. (This can be done in the future, in a later PR, assuming you want to tackle this)

Yep, I will refactor #4114 on top of this function which should make the review of those estimators easier too. Thanks for the +1 @glouppe

@trevorstephens trevorstephens changed the title MRG+1: Refactor - Farm out class_weight calcs to .utils MRG+2: Refactor - Farm out class_weight calcs to .utils Feb 5, 2015
@trevorstephens
Copy link
Contributor Author

Can this be merged @agramfort @glouppe ?

@glouppe
Copy link
Contributor
glouppe commented Feb 6, 2015

Merging, thanks for your work :)

glouppe added a commit that referenced this pull request Feb 6, 2015
MRG+2: Refactor - Farm out class_weight calcs to .utils
@glouppe glouppe merged commit 420daac into scikit-learn:master Feb 6, 2015
@trevorstephens trevorstephens deleted the refactor_cw branch February 6, 2015 12:58
@trevorstephens
Copy link
Contributor Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0