-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[RFC] Modularize 'Criterion' class inside tree submodule to allow 3rd party extensions for data-based criterion #24577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @Vincent-Maladiere and @ogrisel who potentially might be interested in this regarding extension #24000 might also be relevant but has a broader scope. |
FYI, I have also written up a more lengthy proposal on improving tree submodule. See #24000 (comment) |
Thanks for proposing a solutions with alternatives. The proposed solution makes sense to me. Even if it modifies the current Cython API used by third party libraries, I think performing this change will help them in the long-term. |
@scikit-learn/core-devs This needs a decision. @adam2392 What a about passing a fake y and just never use it? In total, I‘m a bit concerned about building an unofficial API for a 3rd party library (where are the unsupervised trees supposed to live?) which thereby becomes official making development in scikit-learn harder. |
I think passing a fake y could work. I have not thought enough about it.
Totally understand and agree that most likely unsupervised trees, or other variations will not make it into scikit-learn in general. The core of the issue was moreso on what are the relevant hard assumptions built into the Criterion object. Not every feature of statistical ML goes into scikit-learn, but many people extend the sklearn design for other packages (e.g. survival trees, EconML, scikit-image, etc.). If there is a way to make the design make more sense for a dev and also provide more flexibility to 3rd parties, it would be nice. Rn for example, I thought it didn't make sense that a Criterion must have knowledge of y. This issue was meant to discuss that. By abstracting a "Base Criterion" object out, it defines the API/design for which a tree should compute metrics for performing a split. For reference, here is where I have implemented a version of unsupervised trees: https://github.com/neurodata/scikit-tree/blob/main/sktree/tree/ 8000 unsupervised/_unsup_criterion.pyx. |
Can't find it now, but a couple of years ago I was working on criterion, and finishing that would mean changing the API on this base criterion class. I'm happy to have a much better developer API here, but we need to be very clear about the fact that it won't follow the usual stability / deprecation cycle as other parts of the library. |
Also, I am happy to contribute to the maintenance as well. In general to summarize the points in this thread and others, private extension API is fine if:
I think the last point is where I see usually there is opposition, but from my experience, having a well-defined abstract API is a good thing.
Awesome that there is positive sentiment in that direction. What do other ppl think? |
Summary
There exists other criterions, where additional arguments would need to be passed in. The current
Criterion
class is not actually a generic Criterion class since it explicitly assumes that the Criterion is supervised-learning based and does NOT require the dataX
itself. Theinit(...)
function explicitly passes iny
.I am trying to implement and subclass
Criterion
to implement Unsupervised Random Forests (https://arxiv.org/abs/1907.02844), which use the data within the Criterion. I would like to subclass theCriterion
class for my ownUnsupervisedCriterion
. However, theinit
function assumes access to ay
label, but everything else about it is fine and correct.Instead I would like to define my own
init
function, but keep the rest of the Criterion API. This way, I can define an unsupervised criterion with theinit
function:and I could then define the rest of the functions, while making sure I am in strict compliance with sklearn's Tree submodule.
Proposed Solution
Add a class called
BaseCriterion
and migrate all functions/properties of the currentCriterion
class, EXCEPT forinit
function toBaseCriterion
.This is completely backwards-compatable and ensures 3rd party API can leverage the existing API contract for Criterions using the
BaseCriterion
class.Alternatives
Alternatively, I can do what
scikit-survival
does and pass in this additional data to__cinit__
, but this is weird because you basically need to pass in data to a Criterion class before it's used with Tree/Splitter. Rninit
is called within the Tree/Splitter, which ensures they
thatTree/Splitter
have access to is the same asCriterion
. If you are forced to pass things through__cinit__
it is quite possible to introduce runtime bugs because your data changes after you have already instantiated a criterion class.Alternatives2
Another possibility that I'm not 100% sure on is that
Criterion.init(...)
is passedy
and just resets it to the passed iny
data. However, it never modifiesy
. Instead, it should only init the 'y' variable once.... Perhaps something likeCriterion.init_data(y)
, which separates initialization of the data compared to initialization of different variables.The text was updated successfully, but these errors were encountered: