8000 speedup for decision trees with large datasets · Issue #1532 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
speedup for decision trees with large datasets #1532
Closed
@ndawe

Description

@ndawe

Currently the tree fitting procedure tries all possible splits between unique values of each feature in find_best_split and _smallest_sample_larger_than in _tree.pyx. This can be very expensive for large datasets.

TMVA [1] implements both this same procedure as well as a mode that histograms each feature with a fixed number of bins [2]:

The cut values are optimised by scanning over the variable range with a granularity that is set
via the option nCuts. The default value of nCuts=20 proved to be a good compromise between
computing time and step size. Finer stepping values did not increase noticeably the performance
of the BDTs. However, a truly optimal cut, given the training sample, is determined by setting
nCuts=-1. This invokes an algorithm that tests all possible cuts on the training sample and finds
the best one.

A similar feature in scikit-learn would be nice and should offer a speedup on large datasets with many features. Hopefully this issue can be converted to a PR soon (preferably after #1488 is merged). Just noting it here for future reference and to incite discussion.

[1] http://tmva.sourceforge.net/
[2] http://tmva.sourceforge.net/docu/TMVAUsersGuide.pdf (sections 8.12.2 and 8.12.3)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0