Closed
Description
Currently the tree fitting procedure tries all possible splits between unique values of each feature in find_best_split
and _smallest_sample_larger_than
in _tree.pyx
. This can be very expensive for large datasets.
TMVA [1] implements both this same procedure as well as a mode that histograms each feature with a fixed number of bins [2]:
The cut values are optimised by scanning over the variable range with a granularity that is set
via the option nCuts. The default value of nCuts=20 proved to be a good compromise between
computing time and step size. Finer stepping values did not increase noticeably the performance
of the BDTs. However, a truly optimal cut, given the training sample, is determined by setting
nCuts=-1. This invokes an algorithm that tests all possible cuts on the training sample and finds
the best one.
A similar feature in scikit-learn would be nice and should offer a speedup on large datasets with many features. Hopefully this issue can be converted to a PR soon (preferably after #1488 is merged). Just noting it here for future reference and to incite discussion.
[1] http://tmva.sourceforge.net/
[2] http://tmva.sourceforge.net/docu/TMVAUsersGuide.pdf (sections 8.12.2 and 8.12.3)
Metadata
Metadata
Assignees
Labels
No labels