8000 [MRG + 3] Store values for all nodes instead of just leaves for decision trees by andosa · Pull Request #4655 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG + 3] Store values for all nodes instead of just leaves for decision trees #4655

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

andosa
Copy link
@andosa andosa commented Apr 30, 2015

Store values for all nodes instead of just leaves, as discussed in #4644 and required in #2937.

Not surprisingly, the impact on performance is minimal.
Benchmark:

from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier

X, y = make_classification(n_samples=50000, n_features=50, n_informative=5,
                       n_redundant=2, n_repeated=0, n_classes=8,
                       n_clusters_per_class=1, random_state=0)

def test_dt(X, y):
    dt = DecisionTreeClassifier()
    dt.fit(X,y).predict(X)

Master:

In [4]: %timeit -r5 test_dt(X,y)
1 loops, best of 5: 5.96 s per loop

PR:

In [6]: %timeit -r5 test_dt(X,y)
1 loops, best of 5: 5.96 s per loop

@amueller
Copy link
Member

Can you try on, say MNIST, covertype and / or newsgroups? There should be scripts in the benchmarks folder.

@andosa
Copy link
Author
andosa commented May 1, 2015

MNIST data (best for each).

Master

>python bench_mnist.py --classifiers CART

Classifier               train-time   test-time   error-rate
------------------------------------------------------------
CART                         81.48s       0.10s       0.1214

PR

Classifier               train-time   test-time   error-rate
------------------------------------------------------------
CART                         81.70s       0.09s       0.1214

Both actually vary +-1 sec on my machine

@glouppe
Copy link
Contributor
glouppe commented May 1, 2015

That is nice :) +1

cc @arjoly

@amueller
Copy link
Member
amueller commented May 1, 2015

LGTM (but having @arjoly) sign off would obv be nice ;)

@amueller amueller changed the title Store values for all nodes instead of just leaves for decision trees [MRG + 2] Store values for all nodes instead of just leaves for decision trees May 1, 2015
@arjoly
Copy link
Member
arjoly commented May 4, 2015

Here some results with ensemble. This looks good. +1

master (mnist)
Classifier               train-time   test-time   error-rate
------------------------------------------------------------
ExtraTrees                   48.47s       0.53s       0.0288
RandomForest                 46.93s       0.46s       0.0304
CART                         23.58s       0.02s       0.1214


This pr (mnist)

Classifier               train-time   test-time   error-rate
------------------------------------------------------------
ExtraTrees                   49.21s       0.53s       0.0288
RandomForest                 46.86s       0.46s       0.0304
CART                         23.57s       0.18s       0.1214


Master (covertype)

Classifier   train-time test-time error-rate
--------------------------------------------
RandomForest  69.7185s   0.3960s     0.0334  
ExtraTrees    46.1684s   0.7399s     0.0366  
CART          16.2451s   0.0689s     0.0425  

This pr (covertype)

Classifier   train-time test-time error-rate
--------------------------------------------
RandomForest  67.9607s   0.3838s     0.0334  
ExtraTrees    44.3711s   0.5905s     0.0366  
CART          15.9335s   0.0721s     0.0425

@arjoly arjoly changed the title [MRG + 2] Store values for all nodes instead of just leaves for decision trees [MRG + 3] Store values for all nodes instead of just leaves for decision trees May 4, 2015
node_id * tree.value_stride)
# Store value for all nodes, to facilitate tree/model inspection and interpretation
splitter.node_value(tree.value +
node_id * tree.value_stride)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use a single line and still fit in less than 80 columns now that the ident level was reduced by 4 spaces:

                splitter.node_value(tree.value + node_id * tree.value_stride)

@ogrisel
Copy link
Member
ogrisel commented May 4, 2015

Let me do the cosmetic fix and merge this.

@ogrisel
Copy link
Member
ogrisel commented May 4, 2015

Merged as 3594925

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0