sklearn/tree/
by splitting the "Splitter" and "Partitioner" code #29459Description
Summary of the problem
There are quite a number of GH issues with the label tree
(https://github.com/scikit-learn/scikit-learn/issues?page=2&q=is%3Aopen+is%3Aissue+label%3Amodule%3Atree).
However, the code is a bit hard to approach as there's so many moving parts. For example, see: #18448 (comment), where users are intimidated by the Cython codebase. A large part of the complexity comes in the splitter, which contains the most algorithmic logic. I think with the work done by @thomasjpfan recently, we were able to pull apart the idea of a "partitioner" and "splitter" in the trees.
This problem also arises because https://github.com/scikit-learn/scikit-learn/pull/29437/files#diff-e2cca285e1e883ab1d427120dfa974c1ba83eb6e2f5d5f416bbd99717ca5f5fc makes the code-diff very large inside _splitter.pyx
file making it also harder to review and understand what changes affect the "splitter" and what changes affect the "partitioner".
Proposed change
I further propose to split these into separate files, so it is easier to maintain, read and determine what is affecting what. In addition, we can decrease the code complexity by creating an abstract base class for the DensePartitioner
and SparsePartitioner
.
See the following PR for an example of what would change. The _splitter.pyx
file would decrease by over 800 LOC, while it gets moved to _partitioner.pyx
.