8000 Adding Oblique Trees (Forest-RC) to the Cythonized Tree Module · Issue #20819 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
Adding Oblique Trees (Forest-RC) to the Cythonized Tree Module #20819
Open
@adam2392

Description

@adam2392

Describe the workflow you want to enable

I am part of the @neurodata team. We are proposing to add a cythonized module that allows for building oblique trees.

Oblique trees, otherwise known as Forest-RC, originally proposed by, Breiman 2001 [1], would enable a more flexible class of decision trees. Breiman called a version of these Forest-RC, as compared to what most people call random forest, which he called Forest-RI. He noted:

"Forest-RC does exceptionally well on the synthetic data sets. Overall, it compares more favorably to Adaboost than Forest-RI."

In other words, in Breiman's paper introduction random forest over 20 years ago, with over 75,000 citations, he presented 2 algorithms, and argued that the one which does sparse linear combinations of features worked better than the naive one which most people have used since. More recently, several authors have corroborated these original results [2-6]. Oblique trees in general open up a wide class of flexibility for trees to take advantage of structure in data.

[1]: Breiman, L. Random Forests. Machine Learning 45, 5–32 (2001). https://doi.org/10.1023/A:1010933404324
[2]: T. M. Tomita, J. Browne, C. Shen, J. Chung, J. L. Patsolic, B. Falk, J. Yim, C. E. Priebe, R. Burns, M. Maggioni, and J. T. Vogelstein. Sparse Projection Oblique Randomer Forests. Journal of Machine Learning Research, 2020.
[3]: Tyler M Tomita, Mauro Maggioni, and Joshua T Vogelstein. Roflmao: Robust oblique forests with linear matrix operations. In Proceedings of the 2017 SIAM International Conference on Data Mining, pages 498–506. Society for Industrial and Applied Mathematics, 2017.
[4]: Rainforth, Tom & Wood, Frank. (2015). Canonical Correlation Forests.
[5]: Perry, R., Li, A., Huynh, C., Tomita, T.M., Mehta, R., Arroyo, J., Patsolic, J., Falk, B., & Vogelstein, J. (2019). Manifold Oblique Random Forests: Towards Closing the Gap on Convolutional Deep Networks.
[6]: Menze, Bjoern & Kelm, Bernd & Splitthoff, Daniel & Köthe, Ullrich & Hamprecht, Fred. (2011). On Oblique Random Forests. 453-469. 10.1007/978-3-642-23783-6_29.

Describe your proposed solution

The implementation of Forest-RC (oblique forests) was ported into a scikit-learn fork (neurodata#11). The Forest-RC Cython implementation currently builds cleanly on top of the existing cython code for the regular Decision Trees and thus does not affect the runtime, or performance on Random Forests at all. Moreover, it improves upon RF in certain cc_18 benchmarks (see figure below).

The .pxd files can also be included in each scikit-learn release, so that other sklearn-compatible forest modules can build upon our implementation of the Oblique tree.

Relevant Benchmarks

To ease scikit-learn developer concerns on incorporating such a large piece of code change, we can provide the following benchmarks and tests on Cythonized Forest-RC vs Cythonized Forest-RI. These benchmarks were ran on the OpenML CC-18 dataset. Max features were set to n_features, so runtime was very high for both RF and oblique forests in e.g. Fashion-MNIST.

  1. benchmark tests on cc_18 vs Random Forests (see two figures below) Forest-RC is significantly better then RF on this benchmark suite. A paired one-sided Wilcoxon signed test was ran. Forest-RC was significantly better then RF with a PValue ~ 1e-20.

image
image

  1. overall runtime (training times on cc_18) vs Random Forests

image

  1. Runtime benchmarking dependency on number of features and number of samples: In addition, we performed a benchmarking the runtime performance of oblique trees/forests using the code provided in benchmarks/bench_tree.py. The notebook is shown in this gist: https://gist.github.com/adam2392/a652aa1c88ba94f0d6aab96ccd3ade24

  2. visualizing the intuitive differences between Oblique and Axis-aligned trees: Per a discussion with @thomasjpfan , he wanted to see how we can educate an incoming user when/how oblique trees are different. We can extend the existing example in sklearn. The following notebook shows the decision surface of an oblique tree vs axis-aligned tree: https://gist.github.com/adam2392/78519091104cecfeb8ff796eac7e8115

Additional context

cc: @jovo

Some action items are:

  1. Unit testing (I will probably need some direction as to where tests can be added as say parametrization, and what sort of tests the maintainers would like to see)
  2. Make the official PR so developers/maintainers can track progress

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0