10000 MAINT Split `Splitter` into a `BaseSplitter` and a `Splitter` subclass to allow easier inheritance · Issue #24990 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

MAINT Split Splitter into a BaseSplitter and a Splitter subclass to allow easier inheritance #24990

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
adam2392 opened this issue Nov 20, 2022 · 15 comments · May be fixed by #25101
Open

MAINT Split Splitter into a BaseSplitter and a Splitter subclass to allow easier inheritance #24990

adam2392 opened this issue Nov 20, 2022 · 15 comments · May be fixed by #25101

Comments

@adam2392
Copy link
Member
adam2392 commented Nov 20, 2022

Summary

With #24678, we make it easier for the Criterion class to be inherited. However, the Splitter class can also leverage this improvement. We should separate the current Splitter class into an abstract base class for "any type of splitting" BaseSplitter and an abstract supervisededly splitter class that requires y labels Splitter. By keeping the names the same, this change is mostly backwards-compatible.

Moreover, we should refactor what criterion.init does into its two intended functionalities: i) setting data and ii) moving pointers around and updating criterion statistics.

Proposed improvement

Based on discussion below, we want to preserve the y parameter passing chain of Tree -> Splitter -> Criterion. With the exception of splitter.init, all other functions can and should be part of the base class without any notion of whether or not the splitter is supervised, or unsupervised. In order to achieve this, we need to separate where y is passed to the criterion (currently it is done within node_reset.

  1. Refactor criterion.init into two functions for initializing the data and resetting the pointers (
    cdef int init(self, const DOUBLE_t[:, ::1] y, DOUBLE_t* sample_weight,
    double weighted_n_samples, SIZE_t* samples, SIZE_t start,
    SIZE_t end) nogil except -1:
    """Placeholder for a method which will initialize the criterion.
    Returns -1 in case of failure to allocate memory (and raise MemoryError)
    or 0 otherwise.
    Parameters
    ----------
    y : array-like, dtype=DOUBLE_t
    y is a buffer that can store values for n_outputs target variables
    sample_weight : array-like, dtype=DOUBLE_t
    The weight of each sample
    weighted_n_samples : double
    The total weight of the samples being considered
    samples : array-like, dtype=SIZE_t
    Indices of the samples in X and y, where samples[start:end]
    correspond to the samples in this node
    start : SIZE_t
    The first sample to be used on this node
    end : SIZE_t
    The last sample used on this node
    """
    pass
    ):
    i) criterion.init(y, sample_weight, weighted_n_samples, samples), which initializes the data for criterion.
    ii) criterion.set_pointer(start, end), which sets the pointers to the start/end of the samples we consider
  2. Refactor splitter.init to pass the value of y to the criterion via the new criterion.init function (
    cdef int init(self,
    )
  3. Refactor splitter.node_reset to call criterion.set_pointer instead of criterion.init. (
    self.criterion.init(self.y,
    )
  4. Refactor Splitter into BaseSplitter and Splitter class. The BaseSplitter will contain all the methods except for init. Splitter will subclass BaseSplitter and require the init function.

This makes Splitter easier to subclass and also removes some of the interdependency of parameters between Tree, Splitter and Criterion.

Once the changes are made, one should verify:

  1. If tree submodule's Cython code still builds (i.e. make clean and then pip install --verbose --no-build-isolation --editable . should not error out)
  2. verify unit tests inside sklearn/tree all pass
  3. verify that the asv benchmarks do not show a performance regression.

asv continuous --verbose --split --bench RandomForest upstream/main <new_branch_name> and then for side-by-side comparison asv compare main <new_branch_name>

Reference

As discussed in #24577 , I wrote up a doc on proposed improvements to the tree submodule that would:

  1. make it easier for 3rd party packages to subclass existing sklearn tree code and
  2. make it easier for sklearn itself to make improvements to the tree code with many of the modern improvements to trees

cc: @jjerphan

@jshinm I think this is a good next issue to tackle because it is more involved, but does not require making any backward incompatible changes to the sklearn tree code.

@github-actions github-actions bot added the Needs Triage Issue requires triage label Nov 20, 2022
@adam2392 adam2392 changed the title [MAINT] Improve the handling of 'y' parameter in tree and splitter [MAINT] Improve the handling of 'y' parameter in splitter and criterion to make code more modular Nov 20, 2022
@jjerphan
Copy link
Member

Thank you for inspecting this and writing a comprehensive issue, @adam2392!

@glemaitre
Copy link
Member

Currently, the y parameter is passed around from tree -> splitter -> criterion, but this is not necessary and introduces unnecessary friction in subclassing the existing tree code.

I am not convinced. To me, it makes total sense: a splitter is composed of a criterion and it is in charge of dispatching the X and y. Initializing with an external y at the beginning is quite surprising if I am looking at the implementation because looking at the Splitter and Criterion code, I have no clue where it comes from.

What is the concrete friction here?

@adam2392
Copy link
Member Author

What is the concrete friction here?

My main motivation is making the tree class more easily subclassable for other tree models. For example, when we discussed in the monthly dev meeting, there are quantile trees, honest trees, survival trees, oblique trees, unsupervised trees and more. The main reason that they are not included is the maintenance burden due to the tree code not being easy to work with.

For this specific setting, the friction arises if one tries to leverage splitter and criterion code for an unsupervised tree model. In #24577 I advocate for making a base class for criterion to make code more modular and abstract. In this, I want to extend that notion for Splitter. Right now the Splitter claims to be abstract, yet in the init code by passing in y, it explicitly assumes it is supervised. However, a tree splitter's job is simply splitting "X". Criterion measures the values of those splits via some function of X, y, or combination of the two (in sklearn currently, it is done only as functions of y).

Another concrete example of how the Splitter class is not truly "abstract" modular base class is in splitter.node_reset, we might expect the node to reset the pointers and the values within criterion. However, it re-initializes y every time as well. This implicitly like the Criterion, assumes it is supervised.

cdef int node_reset(self, SIZE_t start, SIZE_t end,
                        double* weighted_n_node_samples) nogil except -1:
        """Reset splitter on node samples[start:end].

        Returns -1 in case of failure to allocate memory (and raise MemoryError)
        or 0 otherwise.

        Parameters
        ----------
        start : SIZE_t
            The index of the first sample to consider
        end : SIZE_t
            The index of the last sample to consider
        weighted_n_node_samples : ndarray, dtype=double pointer
            The total weight of those samples
        """

        self.start = start
        self.end = end

        self.criterion.init(self.y,
                            self.sample_weight,
                            self.weighted_n_samples,
                            &self.samples[0],
                            start,
                            end)

        weighted_n_node_samples[0] = self.criterion.weighted_n_node_samples
        return 0

It seems unnecessary and doesn't match what the docstrings of the criterion.in 8000 it and splitter.node_reset say. Moreover, it does not allow one to leverage criterion.node_reset code.

@adam2392
Copy link
Member Author

I am not convinced. To me, it makes total sense: a splitter is composed of a criterion and it is in charge of dispatching the X and y. Initializing with an external y at the beginning is quite surprising if I am looking at the implementation because looking at the Splitter and Criterion code, I have no clue where it comes from.

I see the argument though to keep y within splitter to pass to criterion. However, I wonder if you're okay then with these two suggestions:

  1. to make the code more readable
  • refactor criterion init into an initialization portion for y, sample_weight and such and a reset_pointers portion to reset start, pos, end pointers
  • pass y to Criterion when calling splitter.init to initialize the data
  • refactor node_reset inside splitter to reset pointers instead of "re-initializing the entire array"
  1. separate splitter into a BaseSplitter and Splitter abstract class?

@glemaitre
Copy link
Member

Right now the Splitter claims to be abstract, yet in the init code by passing in y, it explicitly assumes it is supervised.

Since the trees are, it quite makes sense. If we want to extend to unsupervised, I would only be passing y=None by default.

@glemaitre
Copy link
Member

refactor node_reset inside splitter to reset pointers instead of "re-initializing the entire array"

I would not expect that we do a copy there but instead point to the original array which is equivalent to passing the pointer. Since I never know if Cython will use the Python or C assignment there, I would need to check the generated C code to affirm what I say.

@glemaitre
Copy link
Member

separate splitter into a BaseSplitter and Splitter abstract class?

This is something that we actually need to do if we want to make it easier to inherit. So I am fine with this part.

@adam2392
Copy link
Member Author

Since the trees are, it quite makes sense. If we want to extend to unsupervised, I would only be passing y=None by default.

Yes agreed, but then for something like node_reset it seems like it would be a function within the abstract base class, BaseSplitter that does not need to know about what y is. node_reset is just resetting a pointer.

However, as you pointed out, we still want to maintain the chain y -> Tree -> Splitter -> Criterion. Then within the supervised subclass Splitter, we would need to move the passing of y from node_reset to the init function.

This change would then allow us to do the separation of splitter into BaseSplitter and Splitter classes cleanly (#24990 (comment)).

@adam2392
Copy link
Member Author
adam2392 commented Nov 20, 2022

In summary if you and @jjerphan okay with it, then @jshinm and I will work on the following:

  1. separate Splitter into a BaseSplitter and Splitter class
  2. refactor splitter.node_reset to only reset the pointers necessary and move the passing of the y array into the splitter.init function; this still maintains the y -> Tree -> Splitter -> Criterion chain.
  3. refactor criterion.init into two functions: i) initializing the data arrays (y, sample_weight, etc.) and ii) initializing the pointers.

Lmk if this sounds correct @glemaitre. I've updated the Issue description to reflect this.

I don't expect any performance regressions similar to how in Criterion PR (#24678) there were none. However, if there are any differences, most likely it would be speed improvements if any if Criterion is actually making an additional copy of y when calling criterion.init (#24990 (comment)).

@adam2392 adam2392 changed the title [MAINT] Improve the handling of 'y' parameter in splitter and criterion to make code more modular [MAINT, Cython, Tree] Separate the Splitter class into a "BaseSplitter" and a "Splitter" class to allow easier inheritance Nov 20, 2022
@glemaitre
Copy link
Member

refactor splitter.node_reset to only reset the pointers necessary and move the passing of the y array into the splitter.init function; this still maintains the y -> Tree -> Splitter -> Criterion chain.

I think it makes sense. I don't see any use case where we actually want to change y. node_reset could be an abstract method of BaseSplitter and defined as a reset of the pointer in Splitter for our supervised splitter. Someone could just define it as return None if there is no need in some unsupervised cases.

@adam2392
Copy link
Member Author

node_reset could be an abstract method of BaseSplitter and defined as a reset of the pointer in Splitter for our supervised splitter. Someone could just define it as return None if there is no need in some unsupervised cases.

In an unsupervised setting, node_reset can be the exact same as supervised setting. That is node_reset is just about resetting the "samples" i.e. rows (can be rows of y, X or both presumably). That really just involves the following parameters:

samples : array-like, dtype=SIZE_t
            A mask on the samples, showing which ones we want to use
        start : SIZE_t
            The first sample to use in the mask
        end : SIZE_t
            The last sample to use in the mask

The rest is just extra imo. Thus the goal is to separate the notion of initializing the data vs initializing the sample-mask/pointers. Sure someone could define it as return None, but at least for unsupervised, I noticed that splitter.node_reset is still useful, so why not keep it as part of the abstract BaseSplitter? Lmk if this is along the lines of what you're thinking.

@jjerphan jjerphan changed the title [MAINT, Cython, Tree] Separate the Splitter class into a "BaseSplitter" and a "Splitter" class to allow easier inheritance MAINT Split the Splitter class into a BaseSplitter and a Splitter subclass to allow easier inheritance Nov 21, 2022
@jjerphan jjerphan changed the title MAINT Split the Splitter class into a BaseSplitter and a Splitter subclass to allow easier inheritance MAINT Split Splitter into a BaseSplitter and a Splitter subclass to allow easier inheritance Nov 21, 2022
@adam2392
Copy link
Member Author

refactor splitter.node_reset to only reset the pointers necessary and move the passing of the y array into the splitter.init function; this still maintains the y -> Tree -> Splitter -> Criterion chain.
I think it makes sense. I don't see any use case where we actually want to change y. node_reset could be an abstract method of BaseSplitter and defined as a reset of the pointer in Splitter for our supervised splitter. Someone could just define it as return None if there is no need in some unsupervised cases.

Ah I see, I think I misunderstood what you meant here @glemaitre. The purpose of this change I propose is not to say we want to change y, but to separate the notion of "initialization" and "resetting" inside Criterion and the splitter.node_reset function. This would allow any subclass of Criterion to be supervised, or unsupervised. Moreover, any subclass of Splitter can also be supervised, or unsupervised.

The reason the change would work is that y is not being used when we call splitter.node_reset, except the very first time.

Before moving forward, I want to confirm again that the 3 proposed changes are valid?

@jjerphan
Copy link
Member

Before moving forward, I want to confirm again that the 3 proposed changes are valid?

In theory, I would say yes.

I think the best is to implement the changes you propose in dedicated and atomic PR so that we can see if there is any problem with each part of this refactoring.

Does that work for you, @adam2392? Do you need any help?

@adam2392
Copy link
Member Author

Hi @jjerphan thanks for the confirmation!

No I think overall this change is pretty easy and simple.

Would it be helpful for me to make one bigger PR to demonstrate what the overall change would look like and then do smaller sequential PRs for the actual merging?

@jjerphan
Copy link
Member

Yes, that's a good way to proceed. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
0