8000 [MRG] class_weight for Bagging, AdaBoost & GradientBoosting classifiers by trevorstephens · Pull Request #4114 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] class_weight for Bagging, AdaBoost & GradientBoosting classifiers #4114

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

trevorstephens
Copy link
Contributor

Initial early crack at the rest of the ensembles carrying on from #3961

My biggest question is how (or if) to handle class_weight='subsample' in GradientBoostingClassifier. In RandomForestClassifier, I weighted based on the bootstrap sample as I have done in BaggingClassifier in this PR. But the applicability to GBM is not so clear to me. Right now I simply use the 'auto' weighting as a initialization of the sample weights, as the fit sample_weight param does right now, and as is implemented in AdaBoostClassifier in this PR. My initial thought for 'subsample' was to fit the residuals to the re-weighted boosting sample_weight while not actually altering it and letting the iterations do their things, but I feel that we may be venturing far away from the literature there...

Any thoughts?

Additionally, I'm certain Travis will be upset with me on the infamous check_class_weight_classifiers in estimator_checks. GBM may get by from the hack here: #3961 (diff) but the two meta estimators do not call min_weight_fraction_leaf directly and are sure to fail. As one solution, I could add another bandaid to check if the estimator has base_estimator and set a decision tree with a similar bias against lightly weighted leaf nodes.

Will add individual estimator checks on top of this soon, of course.

Calling the previous reviewers for input: @glouppe @amueller @GaelVaroquaux , plus any of you other fine people who are willing to comment :-)

@glouppe
Copy link
Contributor
glouppe commented Jan 19, 2015

CC: @pprett as well

@pprett
Copy link
Member
pprett commented Jan 19, 2015

My initial thought for 'subsample' was to fit the residuals to the re-weighted boosting sample_weight while not actually altering it and letting the iterations do their things, but I feel that we may be venturing far away from the literature there...

@trevorstephens I'm not sure if I completely understand what you are planning to do...
As far as I understood, 'subsample' mode will set the class weights by applying weight balancing in the (bootstrap) sample. The only difference to RF is that (stochastic) GBRT does sampling w/o replacement.

@trevorstephens
Copy link
Contributor Author

@pprett yes, that is how it is working... Let me try to be more coherent with my question :)

In RandomForestClassifier we have the original class labels that we are classifying directly in each estimator. In GradientBoostingClassifier we are regressing on the residuals. So, my question, does the idea of weighting by the subsample's classes (that the individual estimators do not directly see) apply equally for GBMs?

@trevorstephens
Copy link
Contributor Author

Added a (possibly questionable) hack to pass check_class_weight_classifiers in estimator_checks as well as individual estimator tests.

@@ -758,6 +760,9 @@ def check_class_weight_classifiers(name, Classifier):
classifier.set_params(n_iter=100)
if hasattr(classifier, "min_weight_fraction_leaf"):
classifier.set_params(min_weight_fraction_leaf=0.01)
if isinstance(classifier, (BaggingClassifier, AdaBoostClassifier)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fix is good. If you have a better test let me know ;) Maybe add a comment saying "the dataset is noisy and these classifiers are not regularized" or something like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall do. Also got rid of the doubling down on class_weights, that just seemed cheeky to me :) pared back the boosting rate for Ada instead to get a passing grade.

@trevorstephens
Copy link
Contributor Author

I've been thinking about the 'subsample' option and feel that even though it is less direct, the same idea holds: that you want to emphasize the minority class if you chose this option.

Also, there's two ways to go in GradientBoostingClassifer': either weight based on the overall subsample distribution, or the by-class distribution once it's broken down into the individual by-class regression tree fits. I have implemented by the "overall" version in the current commit. This could be moved down a level so that the weighting is determined at the [1,0] level if multi-class. I suspect this will have little difference in actual performance, but just putting it out there.

I have also moved around some pep8 things in GBM in prior commits and moved the __init__ stuff around in this one to match the style of other sklearn classes.

Flipping to [MRG] but interested to hear what the tree people think.

@trevorstephens trevorstephens changed the title [WIP] class_weight for Bagging, AdaBoost & GradientBoosting classifiers [MRG] class_weight for Bagging, AdaBoost & GradientBoosting classifiers Jan 21, 2015
@trevorstephens
Copy link
Contributor Author

I was thinking of implementing a new utils function that would do a lot of the duplicated work and error checks that are proposed present in all of the ensemble (and d-tree) classifiers.

Something like utils.compute_sample_weight(class_weight, y, indices=None)

It could check the dimension of y and return the class-balanced sample weights, what I have been calling expanded_class_weight, and automatically apply the bootstrap/subsample re-balancing if indices were not None too.

This would make transitioning any of the meta estimators to multi-output down the road a little easier too if that's on the map. Anyone have any strong opinions on this?

@trevorstephens
Copy link
Contributor Author

I'm closing this PR and will open a two new ones, #4215 , which refactors this code on top of #4190 as well as another (later) one that proposes the GBM style mods that I implemented here. Seems that these are two items that should be addressed separately and would probably muddy the waters for potential code reviewers, who are busy enough as it is :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0