MAINT, RFC Simplify the Cython code in sklearn/tree/ by splitting the "Splitter" and "Partitioner" code

@thomasjpfan

Summary of the problem

There are quite a number of GH issues with the label tree (https://github.com/scikit-learn/scikit-learn/issues?page=2&q=is%3Aopen+is%3Aissue+label%3Amodule%3Atree).

However, the code is a bit hard to approach as there's so many moving parts. For example, see: #18448 (comment), where users are intimidated by the Cython codebase. A large part of the complexity comes in the splitter, which contains the most algorithmic logic. I think with the work done by @thomasjpfan recently, we were able to pull apart the idea of a "partitioner" and "splitter" in the trees.

This problem also arises because https://github.com/scikit-learn/scikit-learn/pull/29437/files#diff-e2cca285e1e883ab1d427120dfa974c1ba83eb6e2f5d5f416bbd99717ca5f5fc makes the code-diff very large inside _splitter.pyx file making it also harder to review and understand what changes affect the "splitter" and what changes affect the "partitioner".

Proposed change

I further propose to split these into separate files, so it is easier to maintain, read and determine what is affecting what. In addition, we can decrease the code complexity by creating an abstract base class for the DensePartitioner and SparsePartitioner.

See the following PR for an example of what would change. The _splitter.pyx file would decrease by over 800 LOC, while it gets moved to _partitioner.pyx.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Summary of the problem

Proposed change

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Description

Summary of the problem

Proposed change

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions