8000 MAINT, RFC Simplify the Cython code in `sklearn/tree/` by splitting the "Splitter" and "Partitioner" code · Issue #29459 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
MAINT, RFC Simplify the Cython code in sklearn/tree/ by splitting the "Splitter" and "Partitioner" code #29459
Closed
@adam2392

Description

@adam2392

Summary of the problem

There are quite a number of GH issues with the label tree (https://github.com/scikit-learn/scikit-learn/issues?page=2&q=is%3Aopen+is%3Aissue+label%3Amodule%3Atree).

However, the code is a bit hard to approach as there's so many moving parts. For example, see: #18448 (comment), where users are intimidated by the Cython codebase. A large part of the complexity comes in the splitter, which contains the most algorithmic logic. I think with the work done by @thomasjpfan recently, we were able to pull apart the idea of a "partitioner" and "splitter" in the trees.

This problem also arises because https://github.com/scikit-learn/scikit-learn/pull/29437/files#diff-e2cca285e1e883ab1d427120dfa974c1ba83eb6e2f5d5f416bbd99717ca5f5fc makes the code-diff very large inside _splitter.pyx file making it also harder to review and understand what changes affect the "splitter" and what changes affect the "partitioner".

Proposed change

I further propose to split these into separate files, so it is easier to maintain, read and determine what is affecting what. In addition, we can decrease the code complexity by creating an abstract base class for the DensePartitioner and SparsePartitioner.

See the following PR for an example of what would change. The _splitter.pyx file would decrease by over 800 LOC, while it gets moved to _partitioner.pyx.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0