Creating a Custom Splitter for Decision Trees with Scikit-learn #26031
-
I am working on designing a custom splitter for decision trees, which is similar to the BestSplitter (splitter = "best") provided by the Scikit-learn library. However, my use case requires a bit more competition across multiple candidate features, and the original implementation doesn't suit our needs. Moreover, we have a unique type of flattened data that we want to deal with by building this slight variation, which was previously implemented in Java using the WEKA decision tree. However, our goal is to create a Scikit-like custom estimator in Python to integrate with our full Python system, as it would simplify the overall process and improve maintainability. I've spent hours searching for examples on Github and Stack Overflow but couldn't find any detailed example solutions. I successfully built the Scikit-learn from source and played around with the _splitter.pyx file, but any changes I made on the Cython side seemed unrecognised (e.g., basic printf() actions weren't outputting to stdout, with and without the GIL). While I did make sure that the Python API of Scikit-learn printed out information before calling the Cython code (_classed.py) which at first sight made me smile to be on the right track but then Cython came around.. Is it either practically impossible to achieve something in a relative short amount of time (not months), or would some guidance be helpful, any thoughts, past experience tickets, Pas Github fork where they'd have done something? Alternatively, if this task proves too cumbersome, can anyone recommend another ``worth/reliable'' open-source Python decision tree implementation that we can modify or extend to include our custom splitter procedure at each node? Sadly we wanted to stick with Scikit but if this is not possible we'll have to find some ways around. Our desired implementation is actually relatively simple, but the additional abstraction layer introduced by Cython makes it challenging. I'm willing to tackle this challenge, but without examples or guidance, I'm concerned it might not be a productive path. Happy to hear any of your guidance and experiments on the matter 👌 PS: I could share a nebulous code snippet of how our variant would be ressembling in Python if this is helpful in any way, however regardless of Scikit, basically in a a high-level overview Thank you so much in advance! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 18 replies
-
Migrating discussion here from #10251 (comment) The actual ability to extend trees, you'll have to look through the source code. As the sklearn devs mentioned, the Cython API is not stable, and thus can quickly change. Therefore, we wanted to keep inline w/ upstream as much as possible to reduce scope. The source code tho shows how to extend the Python API and the Cython private API. |
Beta Was this translation helpful? Give feedback.
Hi there,
I finally was able to added a new node split function to the Cython's side of a decision tree's implementation from the fork's branch as you suggested. Added my pertinent new hyperparameters a bit everywhere, and have now implemented my strategy. It functions flawlessly. By the end of the year, I hope to have time to consider publishing a new article on Medium that explains how I modified a node split procedure of a decision tree using scikit tree.
In the interim, I want to wish you a fantastic week, thank you for your wonderful support, and express my deep appreciation for your patience. I will also share the medium article throughout this discussion when it's done as it is not…