speedup for decision trees with large datasets

Currently the tree fitting procedure tries all possible splits between unique values of each feature in find_best_split and _smallest_sample_larger_than in _tree.pyx. This can be very expensive for large datasets.

TMVA [1] implements both this same procedure as well as a mode that histograms each feature with a fixed number of bins [2]:

The cut values are optimised by scanning over the variable range with a granularity that is set
via the option nCuts. The default value of nCuts=20 proved to be a good compromise between
computing time and step size. Finer stepping values did not increase noticeably the performance
of the BDTs. However, a truly optimal cut, given the training sample, is determined by setting
nCuts=-1. This invokes an algorithm that tests all possible cuts on the training sample and finds
the best one.

A similar feature in scikit-learn would be nice and should offer a speedup on large datasets with many features. Hopefully this issue can be converted to a PR soon (preferably after #1488 is merged). Just noting it here for future reference and to incite discussion.

[1] http://tmva.sourceforge.net/
[2] http://tmva.sourceforge.net/docu/TMVAUsersGuide.pdf (sections 8.12.2 and 8.12.3)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions