8000 [RFC] Modularize 'Criterion' class inside tree submodule to allow 3rd party extensions for data-based criterion · Issue #24577 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
[RFC] Modularize 'Criterion' class inside tree submodule to allow 3rd party extensions for data-based criterion #24577
Open
@adam2392

Description

@adam2392

Summary

There exists other criterions, where additional arguments would need to be passed in. The current Criterion class is not actually a generic Criterion class since it explicitly assumes that the Criterion is supervised-learning based and does NOT require the data X itself. The init(...) function explicitly passes in y.

I am trying to implement and subclass Criterion to implement Unsupervised Random Forests (https://arxiv.org/abs/1907.02844), which use the data within the Criterion. I would like to subclass the Criterion class for my own UnsupervisedCriterion. However, the init function assumes access to a y label, but everything else about it is fine and correct.

Instead I would like to define my own init function, but keep the rest of the Criterion API. This way, I can define an unsupervised criterion with the init function:

    cdef int init(self, const DTYPE_T[:, ::1] X, DOUBLE_t* sample_weight,
                  double weighted_n_samples, SIZE_t* samples, SIZE_t start,
                  SIZE_t end) nogil except -1

and I could then define the rest of the functions, while making sure I am in strict compliance with sklearn's Tree submodule.

Proposed Solution

Add a class called BaseCriterion and migrate all functions/properties of the current Criterion class, EXCEPT for init function to BaseCriterion.

This is completely backwards-compatable and ensures 3rd party API can leverage the existing API contract for Criterions using the BaseCriterion class.

Alternatives

Alternatively, I can do what scikit-survival does and pass in this additional data to __cinit__, but this is weird because you basically need to pass in data to a Criterion class before it's used with Tree/Splitter. Rn init is called within the Tree/Splitter, which ensures the y that Tree/Splitter have access to is the same as Criterion. If you are forced to pass things through __cinit__ it is quite possible to introduce runtime bugs because your data changes after you have already instantiated a criterion class.

Alternatives2

Another possibility that I'm not 100% sure on is that Criterion.init(...) is passed y and just resets it to the passed in y data. However, it never modifies y. Instead, it should only init the 'y' variable once.... Perhaps something like Criterion.init_data(y), which separates initialization of the data compared to initialization of different variables.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0