Description
Summary
While missing-value support for decision trees have been added recently, they only work when encoded in a dense array. Since RandomForest*
and ExtraTrees*
both support sparse X
, if a user encodes np.nan
inside sparse X
, it should still work.
Solution
Add missing-value logic in SparsePartitioner
in _parititoner.pyx
, BestSparseSplitter
and RandomSparseSplitter
in _splitter.pyx
.
The logic is the same as in the dense case, but just has to handle the fact that X
is now sparse CSC array format.
Misc.
FYI https://github.com/scikit-learn/scikit-learn/pull/27966 will introduce native support for missing values in the `ExtraTree*` models (i.e. random splitter).
One thing I noticed though as I went through the PR is that the current codebase still does not support missing values in the sparse splitter. I think this might be pretty easy to add, but should we re-open this issue technically?
Xref: #5870 (comment)
Originally posted by @adam2392 in #5870 (comment)