8000 [RFC] Modularize 'Criterion' class inside tree submodule to allow 3rd party extensions for data-based criterion · Issue #24577 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[RFC] Modularize 'Criterion' class inside tree submodule to allow 3rd party extensions for data-based criterion #24577

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
adam2392 opened this issue Oct 4, 2022 · 7 comments

Comments

@adam2392
Copy link
Member
adam2392 commented Oct 4, 2022

Summary

There exists other criterions, where additional arguments would need to be passed in. The current Criterion class is not actually a generic Criterion class since it explicitly assumes that the Criterion is supervised-learning based and does NOT require the data X itself. The init(...) function explicitly passes in y.

I am trying to implement and subclass Criterion to implement Unsupervised Random Forests (https://arxiv.org/abs/1907.02844), which use the data within the Criterion. I would like to subclass the Criterion class for my own UnsupervisedCriterion. However, the init function assumes access to a y label, but everything else about it is fine and correct.

Instead I would like to define my own init function, but keep the rest of the Criterion API. This way, I can define an unsupervised criterion with the init function:

    cdef int init(self, const DTYPE_T[:, ::1] X, DOUBLE_t* sample_weight,
                  double weighted_n_samples, SIZE_t* samples, SIZE_t start,
                  SIZE_t end) nogil except -1

and I could then define the rest of the functions, while making sure I am in strict compliance with sklearn's Tree submodule.

Proposed Solution

Add a class called BaseCriterion and migrate all functions/properties of the current Criterion class, EXCEPT for init function to BaseCriterion.

This is completely backwards-compatable and ensures 3rd party API can leverage the existing API contract for Criterions using the BaseCriterion class.

Alternatives

Alternatively, I can do what scikit-survival does and pass in this additional data to __cinit__, but this is weird because you basically need to pass in data to a Criterion class before it's used with Tree/Splitter. Rn init is called within the Tree/Splitter, which ensures the y that Tree/Splitter have access to is the same as Criterion. If you are forced to pass things through __cinit__ it is quite possible to introduce runtime bugs because your data changes after you have already instantiated a criterion class.

Alternatives2

Another possibility that I'm not 100% sure on is that Criterion.init(...) is passed y and just resets it to the passed in y data. However, it never modifies y. Instead, it should only init the 'y' variable once.... Perhaps something like Criterion.init_data(y), which separates initialization of the data compared to initialization of different variables.

@github-actions github-actions bot added the Needs Triage Issue requires triage label Oct 4, 2022
@jjerphan
Copy link
Member
jjerphan commented Oct 6, 2022

cc @Vincent-Maladiere and @ogrisel who potentially might be interested in this regarding extension Criterion in 3rd party scikit-learn-like packages.

#24000 might also be relevant but has a broader scope.

@jjerphan jjerphan changed the title [REFACTOR] Modularize 'Criterion' class inside tree submodule to allow 3rd party extensions for data-based criterion [RFC] Modularize 'Criterion' class inside tree submodule to allow 3rd party extensions for data-based criterion Oct 6, 2022
@jjerphan jjerphan added the Refactor Code refactor label Oct 6, 2022
@adam2392
Copy link
Member Author
adam2392 commented Oct 6, 2022

FYI, I have also written up a more lengthy proposal on improving tree submodule. See #24000 (comment)

@jjerphan
Copy link
Member

Thanks for proposing a solutions with alternatives. The proposed solution makes sense to me.

Even if it modifies the current Cython API used by third party libraries, I think performing this change will help them in the long-term.

@lorentzenchr
Copy link
Member

@scikit-learn/core-devs This needs a decision.

@adam2392 What a about passing a fake y and just never use it?

In total, I‘m a bit concerned about building an unofficial API for a 3rd party library (where are the unsupervised trees supposed to live?) which thereby becomes official making development in scikit-learn harder.

@adam2392
Copy link
Member Author
adam2392 commented Jan 29, 2024

What a about passing a fake y and just never use it?

I think passing a fake y could work. I have not thought enough about it.

In total, I‘m a bit concerned about building an unofficial API for a 3rd party library (where are the unsupervised trees supposed to live?) which thereby becomes official making development in scikit-learn harder.

Totally understand and agree that most likely unsupervised trees, or other variations will not make it into scikit-learn in general.

The core of the issue was moreso on what are the relevant hard assumptions built into the Criterion object. Not every feature of statistical ML goes into scikit-learn, but many people extend the sklearn design for other packages (e.g. survival trees, EconML, scikit-image, etc.). If there is a way to make the design make more sense for a dev and also provide more flexibility to 3rd parties, it would be nice. Rn for example, I thought it didn't make sense that a Criterion must have knowledge of y.

This issue was meant to discuss that. By abstracting a "Base Criterion" object out, it defines the API/design for which a tree should compute metrics for performing a split.

For reference, here is where I have implemented a version of unsupervised trees: https://github.com/neurodata/scikit-tree/blob/main/sktree/tree/ 8000 unsupervised/_unsup_criterion.pyx.

@adrinjalali
Copy link
Member

Can't find it now, but a couple of years ago I was working on criterion, and finishing that would mean changing the API on this base criterion class.

I'm happy to have a much better developer API here, but we need to be very clear about the fact that it won't follow the usual stability / deprecation cycle as other parts of the library.

@adam2392
Copy link
Member Author
adam2392 commented Feb 1, 2024

Also, I am happy to contribute to the maintenance as well.

In general to summarize the points in this thread and others, private extension API is fine if:

  1. documentation clearly states no stability
  2. benchmarks against main demonstrate the trees do not lose performance in terms of runtime (accuracy is a given)
  3. the introduced abstractions are not difficult to maintain and are not too complex

I think the last point is where I see usually there is opposition, but from my experience, having a well-defined abstract API is a good thing.

I'm happy to have a much better developer API here, but we need to be very clear about the fact that it won't follow the usual stability / deprecation cycle as other parts of the library.

Awesome that there is positive sentiment in that direction. What do other ppl think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
0