Description
Summary
There exists other criterions, where additional arguments would need to be passed in. The current Criterion
class is not actually a generic Criterion class since it explicitly assumes that the Criterion is supervised-learning based and does NOT require the data X
itself. The init(...)
function explicitly passes in y
.
I am trying to implement and subclass Criterion
to implement Unsupervised Random Forests (https://arxiv.org/abs/1907.02844), which use the data within the Criterion. I would like to subclass the Criterion
class for my own UnsupervisedCriterion
. However, the init
function assumes access to a y
label, but everything else about it is fine and correct.
Instead I would like to define my own init
function, but keep the rest of the Criterion API. This way, I can define an unsupervised criterion with the init
function:
cdef int init(self, const DTYPE_T[:, ::1] X, DOUBLE_t* sample_weight,
double weighted_n_samples, SIZE_t* samples, SIZE_t start,
SIZE_t end) nogil except -1
and I could then define the rest of the functions, while making sure I am in strict compliance with sklearn's Tree submodule.
Proposed Solution
Add a class called BaseCriterion
and migrate all functions/properties of the current Criterion
class, EXCEPT for init
function to BaseCriterion
.
This is completely backwards-compatable and ensures 3rd party API can leverage the existing API contract for Criterions using the BaseCriterion
class.
Alternatives
Alternatively, I can do what scikit-survival
does and pass in this additional data to __cinit__
, but this is weird because you basically need to pass in data to a Criterion class before it's used with Tree/Splitter. Rn init
is called within the Tree/Splitter, which ensures the y
that Tree/Splitter
have access to is the same as Criterion
. If you are forced to pass things through __cinit__
it is quite possible to introduce runtime bugs because your data changes after you have already instantiated a criterion class.
Alternatives2
Another possibility that I'm not 100% sure on is that Criterion.init(...)
is passed y
and just resets it to the passed in y
data. However, it never modifies y
. Instead, it should only init the 'y' variable once.... Perhaps something like Criterion.init_data(y)
, which separates initialization of the data compared to initialization of different variables.