-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MAINT Split Splitter
into a BaseSplitter
and a Splitter
subclass to allow easier inheritance
#24990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thank you for inspecting this and writing a comprehensive issue, @adam2392! |
I am not convinced. To me, it makes total sense: a splitter is composed of a criterion and it is in charge of dispatching the What is the concrete friction here? |
My main motivation is making the tree class more easily subclassable for other tree models. For example, when we discussed in the monthly dev meeting, there are quantile trees, honest trees, survival trees, oblique trees, unsupervised trees and more. The main reason that they are not included is the maintenance burden due to the tree code not being easy to work with. For this specific setting, the friction arises if one tries to leverage splitter and criterion code for an unsupervised tree model. In #24577 I advocate for making a base class for criterion to make code more modular and abstract. In this, I want to extend that notion for Splitter. Right now the Splitter claims to be abstract, yet in the Another concrete example of how the
It seems unnecessary and doesn't match what the docstrings of the |
I see the argument though to keep
|
Since the trees are, it quite makes sense. If we want to extend to unsupervised, I would only be passing |
I would not expect that we do a copy there but instead point to the original array which is equivalent to passing the pointer. Since I never know if Cython will use the Python or C assignment there, I would need to check the generated C code to affirm what I say. |
This is something that we actually need to do if we want to make it easier to inherit. So I am fine with this part. |
Yes agreed, but then for something like However, as you pointed out, we still want to maintain the chain This change would then allow us to do the separation of splitter into BaseSplitter and Splitter classes cleanly (#24990 (comment)). |
In summary if you and @jjerphan okay with it, then @jshinm and I will work on the following:
Lmk if this sounds correct @glemaitre. I've updated the Issue description to reflect this. I don't expect any performance regressions similar to how in Criterion PR (#24678) there were none. However, if there are any differences, most likely it would be speed improvements if any if Criterion is actually making an additional copy of |
I think it makes sense. I don't see any use case where we actually want to change |
In an unsupervised setting,
The rest is just extra imo. Thus the goal is to separate the notion of initializing the data vs initializing the sample-mask/pointers. Sure someone could define it as |
BaseSplitter
and a Splitter
subclass to allow easier inheritance
BaseSplitter
and a Splitter
subclass to allow easier inheritanceSplitter
into a BaseSplitter
and a Splitter
subclass to allow easier inheritance
Ah I see, I think I misunderstood what you meant here @glemaitre. The purpose of this change I propose is not to say we want to change The reason the change would work is that Before moving forward, I want to confirm again that the 3 proposed changes are valid? |
In theory, I would say yes. I think the best is to implement the changes you propose in dedicated and atomic PR so that we can see if there is any problem with each part of this refactoring. Does that work for you, @adam2392? Do you need any help? |
Hi @jjerphan thanks for the confirmation! No I think overall this change is pretty easy and simple. Would it be helpful for me to make one bigger PR to demonstrate what the overall change would look like and then do smaller sequential PRs for the actual merging? |
Yes, that's a good way to proceed. 👍 |
Summary
With #24678, we make it easier for the
Criterion
class to be inherited. However, the Splitter class can also leverage this improvement. We should separate the currentSplitter
class into an abstract base class for "any type of splitting"BaseSplitter
and an abstract supervisededly splitter class that requiresy
labelsSplitter
. By keeping the names the same, this change is mostly backwards-compatible.Moreover, we should refactor what
criterion.init
does into its two intended functionalities: i) setting data and ii) moving pointers around and updating criterion statistics.Proposed improvement
Based on discussion below, we want to preserve the
y
parameter passing chain ofTree -> Splitter -> Criterion
. With the exception ofsplitter.init
, all other functions can and should be part of the base class without any notion of whether or not the splitter is supervised, or unsupervised. In order to achieve this, we need to separate wherey
is passed to the criterion (currently it is done withinnode_reset
.criterion.init
into two functions for initializing the data and resetting the pointers (scikit-learn/sklearn/tree/_criterion.pyx
Lines 44 to 69 in b728b2e
i)
criterion.init(y, sample_weight, weighted_n_samples, samples)
, which initializes the data for criterion.ii)
criterion.set_pointer(start, end)
, which sets the pointers to the start/end of the samples we considersplitter.init
to pass the value ofy
to the criterion via the newcriterion.init
function (scikit-learn/sklearn/tree/_splitter.pyx
Line 96 in b728b2e
splitter.node_reset
to callcriterion.set_pointer
instead ofcriterion.init
. (scikit-learn/sklearn/tree/_splitter.pyx
Line 180 in 9c9c858
Splitter
intoBaseSplitter
andSplitter
class. TheBaseSplitter
will contain all the methods except forinit
.Splitter
will subclassBaseSplitter
and require theinit
function.This makes Splitter easier to subclass and also removes some of the interdependency of parameters between Tree, Splitter and Criterion.
Once the changes are made, one should verify:
tree
submodule's Cython code still builds (i.e.make clean
and thenpip install --verbose --no-build-isolation --editable .
should not error out)sklearn/tree
all passasv continuous --verbose --split --bench RandomForest upstream/main <new_branch_name>
and then for side-by-side comparisonasv compare main <new_branch_name>
Reference
As discussed in #24577 , I wrote up a doc on proposed improvements to the tree submodule that would:
cc: @jjerphan
@jshinm I think this is a good next issue to tackle because it is more involved, but does not require making any backward incompatible changes to the sklearn tree code.
The text was updated successfully, but these errors were encountered: