MRG+2: Refactor - Farm out class_weight calcs to .utils #4190

trevorstephens · 2015-02-01T02:54:44Z

With #4114 bringing a few more ensembles onboard the class_weight bandwagon, a fair bit of duplicated code is being proposed. I know that @amueller was originally concerned about code duplication in the original RF/Tree PR and I think that this function may alleviate some of that.

This PR farms out the calculations and some of the error checks for the expanded_class_weight variable to a new sklearn.utils function: compute_sample_weight(class_weight, y, indices=None) and refactors the code from #3961 to utilise it. It also adds a bit more rigor to the input checks and tests.

Benefits:

Better unit testing to ensure the class_weight='subsample' option is doing what we think it is doing
Will make transitioning classifiers that don't support multi-output, such as the meta-estimators, somewhat easier in some distant future
Removes duplicated code (the main point really)

If merged, I shall also make appropriate mods to the code in #4114 to also take advantage of this helper function.

coveralls · 2015-02-01T03:05:35Z

Coverage increased (+0.01%) to 94.8% when pulling f75c98b on trevorstephens:refactor_cw into 94157fa on scikit-learn:master.

agramfort · 2015-02-01T20:21:37Z

sklearn/utils/class_weight.py

+
+        if classes_missing:
+            # Make missing classes' weight zero
+            weight_k[np.in1d(y_full, list(classes_missing))] = 0.


maybe use in1d from fixes for old numpy.

Thanks, was not aware of the fix, will update.

agramfort · 2015-02-01T20:22:42Z

besides looks great. I like all these red lines.

Is there any other classifier that could benefit from this?

trevorstephens · 2015-02-01T21:16:00Z

I like all these red lines.

:-)

Is there any other classifier that could benefit from this?

Asides from those proposed in #4114, perhaps RidgeClassifier: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/ridge.py#L600 could use it. Do you think the small amount of code deletion is worthwhile here @agramfort ?

This function is written mostly to help in multi-output and bootstrap settings where there's more moving parts, so the code reduction in linear models and/or svm would be minimal, if it could be used at all. Most linear model and svm classes seem to do different things with an intermediate class_weight than how it is handled in the trees and forests, such as storing as an attribute or sending on to optimized C or cython code, so the translation is difficult there.

agramfort · 2015-02-01T21:29:25Z

Asides from those proposed in #4114, perhaps RidgeClassifier: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/ridge.py#L600 could use it. Do you think the small amount of code deletion is worthwhile here @agramfort ?

I would say yes. The benefit is also that this function gets more
visibility in the code base.

trevorstephens · 2015-02-01T22:38:19Z

sklearn/linear_model/ridge.py

        # modify the sample weights with the corresponding class weight
-        sample_weight *= cw[np.searchsorted(self.classes_, y)]
+        sample_weight *= compute_sample_weight(self.class_weight, y)


I am not totally familiar with RidgeClassifierCV, but note here ^ that I am now computing the implied cw variable over y rather than Y, as it is/was done in RidgeClassifier. I did some tests and it appears that compute_class_weight does not mind if it gets a one-hot-encoded version of y, or the original long version. Thus, I do not think that this change has any effect, but just putting it out there.

trevorstephens · 2015-02-01T22:42:31Z

I would say yes. The benefit is also that this function gets more
visibility in the code base.

@agramfort , I guess there is also some benefit to having more explicit unit testing on what the expanded class_weight cw variable should look like. Implemented in latest commit. Let me know if I have your +1 now, but please note my comment on the latest diff.

agramfort · 2015-02-02T21:27:59Z

LGTM +1 for merge

trevorstephens · 2015-02-02T22:57:20Z

Thanks @agramfort . I just noticed that there was an inplace multiplication on sample_weight in the original implementation of RidgeClassifierCV. Will fix this tonight. In the meantime, anyone able to be a second reviewer?

trevorstephens · 2015-02-05T02:56:01Z

Anyone else have time to review?

glouppe · 2015-02-05T07:09:51Z

sklearn/utils/tests/test_class_weight.py

+
+
+def test_compute_sample_weight():
+    """Test (and demo) compute_sample_weight."""


Can you add a test when class_weight=None?

Good point. Done & thx for the review @glouppe !

glouppe · 2015-02-05T07:14:27Z

Besides my small comment, +1 for merge. Thanks for the refactoring!

glouppe · 2015-02-05T07:16:36Z

Is there any other classifier that could benefit from this?

Given this function, it becomes quite easy to add class weights to estimators that already support sample weights. This includes Bagging or GradientBoosting for example. (This can be done in the future, in a later PR, assuming you want to tackle this)

trevorstephens · 2015-02-05T15:47:15Z

Given this function, it becomes quite easy to add class weights to estimators that already support sample weights. This includes Bagging or GradientBoosting for example. (This can be done in the future, in a later PR, assuming you want to tackle this)

Yep, I will refactor #4114 on top of this function which should make the review of those estimators easier too. Thanks for the +1 @glouppe

trevorstephens · 2015-02-06T06:28:29Z

trevorstephens commented

Feb 6, 2015

Can this be merged @agramfort @glouppe ?

glouppe · 2015-02-06T07:24:22Z

Merging, thanks for your work :)

MRG+2: Refactor - Farm out class_weight calcs to .utils

trevorstephens · 2015-02-06T12:58:17Z

Thanks!

trevorstephens added 2 commits January 31, 2015 18:26

add compute_sample_weight util

c5fadff

refactor forests & trees class_weight calc

f75c98b

agramfort reviewed Feb 1, 2015
View reviewed changes

use safe in1d, use compute_sample_weight in ridge

965bf51

trevorstephens reviewed Feb 1, 2015
View reviewed changes

agramfort changed the title ~~Refactor - Farm out class_weight calcs to .utils~~ MRG+1: Refactor - Farm out class_weight calcs to .utils Feb 2, 2015

remove inplace multiplication in ridge

54d68d2

glouppe reviewed Feb 5, 2015
View reviewed changes

add test for None class_weights

f2431db

trevorstephens changed the title ~~MRG+1: Refactor - Farm out class_weight calcs to .utils~~ MRG+2: Refactor - Farm out class_weight calcs to .utils Feb 5, 2015

glouppe added a commit that referenced this pull request Feb 6, 2015

Merge pull request #4190 from trevorstephens/refactor_cw

420daac

MRG+2: Refactor - Farm out class_weight calcs to .utils

glouppe merged commit 420daac into scikit-learn:master Feb 6, 2015

trevorstephens deleted the refactor_cw branch February 6, 2015 12:58

This was referenced Feb 7, 2015

[MRG] GBM & meta-ensembles - support for class_weight #4215

Closed

[MRG] class_weight for Bagging, AdaBoost & GradientBoosting classifiers #4114

Closed

trevorstephens mentioned this pull request Mar 3, 2015

Possible bug in compute_class_weight() #4324

Closed

trevorstephens mentioned this pull request Mar 3, 2015

compute_class_weight() class param behaviour #4327

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

MRG+2: Refactor - Farm out class_weight calcs to .utils #4190

MRG+2: Refactor - Farm out class_weight calcs to .utils #4190

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!



		def test_compute_sample_weight():
		"""Test (and demo) compute_sample_weight."""

Uh oh!

MRG+2: Refactor - Farm out class_weight calcs to .utils #4190

MRG+2: Refactor - Farm out class_weight calcs to .utils #4190

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!