[MRG] class_weight for Bagging, AdaBoost & GradientBoosting classifiers#4114
[MRG] class_weight for Bagging, AdaBoost & GradientBoosting classifiers#4114trevorstephens wants to merge 3 commits intoscikit-learn:masterfrom
Conversation
|
CC: @pprett as well |
@trevorstephens I'm not sure if I completely understand what you are planning to do... |
|
@pprett yes, that is how it is working... Let me try to be more coherent with my question :) In |
|
Added a (possibly questionable) hack to pass |
sklearn/utils/estimator_checks.py
Outdated
There was a problem hiding this comment.
I think the fix is good. If you have a better test let me know ;) Maybe add a comment saying "the dataset is noisy and these classifiers are not regularized" or something like that?
There was a problem hiding this comment.
Shall do. Also got rid of the doubling down on class_weights, that just seemed cheeky to me :) pared back the boosting rate for Ada instead to get a passing grade.
|
I've been thinking about the Also, there's two ways to go in I have also moved around some pep8 things in GBM in prior commits and moved the Flipping to [MRG] but interested to hear what the tree people think. |
|
I was thinking of implementing a new Something like It could check the dimension of This would make transitioning any of the meta estimators to multi-output down the road a little easier too if that's on the map. Anyone have any strong opinions on this? |
|
I'm closing this PR and will open a two new ones, #4215 , which refactors this code on top of #4190 as well as another (later) one that proposes the GBM style mods that I implemented here. Seems that these are two items that should be addressed separately and would probably muddy the waters for potential code reviewers, who are busy enough as it is :-) |
Initial early crack at the rest of the ensembles carrying on from #3961
My biggest question is how (or if) to handle
class_weight='subsample'inGradientBoostingClassifier. InRandomForestClassifier, I weighted based on the bootstrap sample as I have done inBaggingClassifierin this PR. But the applicability to GBM is not so clear to me. Right now I simply use the 'auto' weighting as a initialization of the sample weights, as the fitsample_weightparam does right now, and as is implemented inAdaBoostClassifierin this PR. My initial thought for 'subsample' was to fit the residuals to the re-weighted boostingsample_weightwhile not actually altering it and letting the iterations do their things, but I feel that we may be venturing far away from the literature there...Any thoughts?
Additionally, I'm certain Travis will be upset with me on the infamous
check_class_weight_classifiersinestimator_checks. GBM may get by from the hack here: #3961 (diff) but the two meta estimators do not callmin_weight_fraction_leafdirectly and are sure to fail. As one solution, I could add another bandaid to check if the estimator hasbase_estimatorand set a decision tree with a similar bias against lightly weighted leaf nodes.Will add individual estimator checks on top of this soon, of course.
Calling the previous reviewers for input: @glouppe @amueller @GaelVaroquaux , plus any of you other fine people who are willing to comment :-)