8000 MRG: Add sample weighting to decision trees by glouppe · Pull Request #1488 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
8000

MRG: Add sample weighting to decision trees #1488

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 38 commits into from

Conversation

glouppe
Copy link
Contributor
@glouppe glouppe commented Dec 23, 2012

This PR is the first part of #522. It adds sample weighting to decision trees and forests.

tasks and `max_features=n_features` on regression problems. If "sqrt",
then `max_features=sqrt(n_features)`. If "log2", then
`max_features=log2(n_features)`. If None, then
`max_features=n_features`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why None means max_features=n_features instead of 'max' for instance? And default not equal to 'auto'?
Sorry for nitpicking, but it has puzzled me the first time I used tree estimators.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, a decision tree should not be random. Hence max_features=None.

Note however that max_features="auto" by default in forests.

@arjoly
Copy link
Member
arjoly commented Dec 23, 2012

sklearn/ensemble/_gradient_boosting.c is modified, but there isn't sklearn/ensemble/_gradient_boosting.pyx.

Thanks @glouppe and @ndawe for the pr.

@glouppe
Copy link
Contributor Author
glouppe commented Dec 23, 2012

@arjoly _gradient_boosting.c needs to be regenerated because of changes in the cython _tree module it imports.

@glouppe
Copy link
Contributor Author
glouppe commented Dec 23, 2012

Done! I also added the modifications done by @pprett to rewrite bootstrapping using sample weights. This should result in a noticeable speedup when using random forests with default parameters :)

@pprett
Copy link
Member
pprett commented Dec 23, 2012

I did some benchmarks to compare this PR against R's randomForest package (via rpy2).

You can see the results below: I used 100 trees, max_features=3, and min_samples_leaf=1 except for MNIST where I used 10 trees. All results are averaged over 3 runs.

I put emphasize on some "interesting" differences.

Most notable: the difference for the two synthetic datasets hastie_10_2 and random_gaussians; both have relatively few features. MNIST might be due to the overhead by the rpy2 wrapper (the array is quite large...)


madelon

r py master
score 0.4178 (0.01) 0.4117 (0.01) 0.4156 (0.02)
train 10.6725 (0.12) 11.1650 (0.13) 22.2776 (0.05)
test 0.0895 (0.00) 0.0165 (0.00) 0.0158 (0.00)

hastie_10_2

r py master
score 0.1383 (0.01) 0.1389 (0.00) 0.1393 (0.00)
train 0.3696 (0.01) 1.3119 (0.02) 1.7072 (0.00)
test 0.1651 (0.01) 0.1369 (0.00) 0.1284 (0.00)

spam

r py master
score 0.0543 (0.00) 0.0438 (0.00) 0.0451 (0.00)
train 1.7327 (0.01) 2.2989 (0.01) 3.7778 (0.03)
test 0.0358 (0.00) 0.0253 (0.00) 0.0248 (0.00)

landsat

r py master
score 0.0000 (0.00) 0.0000 (0.00) 0.0000 (0.00)
train 2.5476 (0.03) 4.3610 (0.02) 7.3381 (0.01)
test 0.1137 (0.00) 0.1164 (0.00) 0.1100 (0.00)

arcene

r py master
score 0.2667 (0.01) 0.2467 (0.02) 0.2000 (0.02)
train 4.2622 (0.02) 4.9789 (0.07) 11.0852 (0.07)
test 0.1840 (0.01) 0.0061 (0.00) 0.0060 (0.00)

random_gaussian

r py master
score 0.1422 (0.01) 0.1383 (0.01) 0.1400 (0.01)
train 0.3666 (0.00) 1.3010 (0.02) 1.7090 (0.03)
test 0.1538 (0.00) 0.1373 (0.00) 0.1292 (0.00)

mnist

r py master
score 0.1176 (0.01) 0.0694 (0.00) 0.0699 (0.00)
train 128.7950 (2.39) 72.2359 (0.07) 147.0456 (3.91)
test 1.1124 (0.08) 0.0451 (0.00) 0.0467 (0.00)

@glouppe
Copy link
Contributor Author
glouppe commented Dec 23, 2012

If you have time, would mind running those on master to see the benefits from switching to sample weighting? I am curious :)

@pprett
Copy link
Member
pprett commented Dec 23, 2012

@glouppe I've added the master timings to the tables; a 2-fold improvement on all reasonably sized datasets. random_gaussian is quite interesting... R is way faster than our implementation... I've to investigate.

btw: you can find the benchmark script here https://gist.github.com/4089926 - it requires my landsat branch [1]

[1] https://github.com/pprett/scikit-learn/tree/landsat-dataset

@amueller
Copy link
Member

Any reason master is more accurate on arcene? That seems a bit weird, right?

@amueller
Copy link
Member

Btw, pretty cool speedup overall :) Try to review this branch soon, but probably after christmas (maybe @bdholt1 has a more qualified opinion?)

sample_mask=sample_mask, X_argsorted=X_argsorted,
sample_counts = np.bincount(indices)

if len(sample_counts) < n_samples:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a backport of the new bincount in utils. it provides the min_length keyword with does exactly this ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hint! I changed that. Too bad all these functions are buried (and hidden) into our codebase :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is actually a standard numpy function, just not in 1.3.0. I added the backport recently.

If you have any ideas how to make it more discoverable, that would be great :)

@pprett
Copy link
Member
pprett commented Dec 24, 2012

@glouppe arcene is a quite noisy dataset - the score standard deviation is also quite large (0.02) - so IHMO the difference should be ok.

the two small datasets (hastie and random gaussians) indicate that there is some considerable (constant) overhead somewhere in the code that is only noticeable when the nr of features is quite low (both have only 10 features)...

@amueller
Copy link
Member

You are right, I think I mistook the standard deviation by an order of magnitude. The difference is ok.
I guess the constant overhead is for another PR ;)

@amueller
Copy link
Member

I think it would be great if we had a more explicit test, maybe on some synthetic data.
For example having a XOR with a bit of label noise and when you do the sample weights the one way you get a horizontal decision boundary and when you do it the other way you get a vertical decision boundary.
Not sure if this is a great test, maybe something with just 4 hand-picked samples on a line would be better, I don't know.

I just had the impression that there is no test for a fine-grained changing of the weights, as is needed for the boosting, for example.

@pprett
Copy link
Member
pprett commented Dec 24, 2012

I profiled on of the small datasets: ~90 is spent in Tree.build thus extension code. I ran yep to profile the extension; most of the time is spent in recursive_partition. 50% is spent in find_best_split the other ~40% are spent on malloc, dealloc, and (what seems like) fancy indexing.

Interesting: the time spent in ClassificationCriterion.update is roughly the time spent in RandomState.shuffle / RandomState.permutation .

As far as I can see there is no single statement / function to blame for the performance issue.

You can find the pprof call graph here: https://dl.dropbox.com/u/402667/pprof_forest_hastie.ps

@glouppe
Copy link
Contributor Author
glouppe commented Dec 24, 2012

@pprett Thanks for the profiling! I think there is still places where things can be optimized. The one that I have in mind is getting rid of permutation and directly shuffle self.features in-place. This would prevent useless memory allocations/deallocations due to permutation.

@amueller Yep, you are right. I'll add some more fine-grained tests later today.

@arjoly
Copy link
Member
arjoly commented Dec 24, 2012

@glouppe Try to use np.random.shuffle with random_state_shuffle = random_state.shuffle. It should decrease a bit the python overhead.

Not far from 100% coverage. Nice.

Name                                             Stmts   Miss  Cover   Missing
------------------------------------------------------------------------------
sklearn.tree.tree                                  204      8    96%   109, 127, 129, 133, 272, 275, 279, 306

6D40
# Methods
cdef void resize(self, int capacity=*)

cpdef build(self, np.ndarray X, np.ndarray y,
np.ndarray sample_mask=*, np.ndarray X_argsorted=*)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the * syntax?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is for default values, in pxd files. Don't ask why it is like that though ;)

http://docs.cython.org/src/reference/language_basics.html#optional-arguments

DOC: document negative weight treatment in the case of classification
max_depth : integer or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until
all leaves are pure or until all leaves contain less than
min_samples_split samples.
Note: this parameter is tree-specific.

min_samples_split : integer, optional (default=1)
min_samples_split : integer, optional (default=2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 2? Something to do with sample weights?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you split a node with only one sample in it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must have misunderstood, iirc this value was the minimum number of samples in a leaf node but is now the number of samples required to create a split node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bdholt1 Nothing to do with sample weighting. Sorry for this confusion. This was brought up in #1518 and I fixed it here to ease things for me.

@ndawe
Copy link
Member
ndawe commented Jan 7, 2013

Repeating @glouppe: I think this PR is ready to be merged. Then we can begin working on AdaBoost.

@amueller
Copy link
Member
amueller commented Jan 7, 2013

Set the milestone to 0.13. I think we can get this :)

@glouppe
Copy link
Contributor Author
glouppe commented Jan 8, 2013

/ping @pprett @bdholt1?

It would be nice to have merged for the release :)

@pprett
Copy link
Member
pprett commented Jan 8, 2013

Just finished the review... looks very good to me! I re-ran all my benchmarks to check for performance regressions - couldn't find any.

Great work @glouppe and @ndawe - thanks a lot!

I'm +1000 for merge - our tree hugger will like this one!

X = iris.data
y = iris.target

dupplicates = rng.randint(0, X.shape[0], 1000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicates

@bdholt1
Copy link
Member
bdholt1 commented Jan 8, 2013

I've just finished my review - didn't have time to build the code but I went through it reasonably thoroughly and it looks good to me so I'm happy to merge this in. Will be a great contribution to the tree module!

@bdholt1
Copy link
Member
bdholt1 commented Jan 8, 2013

+1 to merge, and congrats to @ndawe and @glouppe for this work.

@glouppe
Copy link
Contributor Author
glouppe commented Jan 8, 2013

Thanks guys for the reviews! and thank you @ndawe for your work :)

@amueller When you come by, once Travis is done, feel free to click the green button or to do your rebase-fu.

@glouppe
Copy link
Contributor Author
glouppe commented Jan 8, 2013

@amueller Travis is happy.

@ndawe
Copy link
Member
ndawe commented Jan 8, 2013

Thanks for the reviews! Time to begin work on the rest of #522.

@arjoly
Copy link
Member
arjoly commented Jan 8, 2013

Congrats to @glouppe and @ndawe !!!

@glouppe
Copy link
Contributor Author
glouppe commented Jan 8, 2013

@ndawe I'll checkout AdaBoost from #522 into a new branch as soon as tomorrow morning. Is that OK for you or do you want to take the lead on that one?

@ndawe
Copy link
Member
ndawe commented Jan 8, 2013

@glouppe either way is fine for me. I suppose we need to checkout:

sklearn/ensemble/weight_boosting.py
sklearn/ensemble/init.py
examples/ensemble/plot_adaboost_*

from my treeweights branch. I could actually handle this right now since I would like to begin using this PR asap in my research (using weights + AdaBoost).

@amueller
Copy link
Member
amueller commented Jan 8, 2013

Congrats everyone. Will now begin looking at the tree and considering rebase-view.
Thanks @glouppe and @ndawe for all the work!

@glouppe
Copy link
Contributor Author
glouppe commented Jan 8, 2013

@ndawe Okay then, wait for @amueller rebase and checkout those files.

(Remember however that @pprett found quite a severe bug in the convergence of the implementation. Hence I am not sure it is safe to use for research purposes right now. At least not until we fix that.)

@glouppe
Copy link
Contributor Author
glouppe commented Jan 8, 2013

@amueller The last merge in master is conflicting with the what's new file. Can you fix that or should I?

@amueller
Copy link
Member
amueller commented Jan 8, 2013

don't worry, will fix.

@ndawe
Copy link
Member
ndawe commented Jan 8, 2013

@glouppe good point. I am anxious to see where the error is in my AdaBoost.

@amueller
Copy link
Member
amueller commented Jan 8, 2013

Merged by rebase. Thanks again guys!

@amueller amueller closed this Jan 8, 2013
@glouppe
Copy link
Contributor Author
glouppe commented Jan 8, 2013

🍺

@ndawe
Copy link
Member
ndawe commented Jan 8, 2013

@glouppe, I owe you 🍻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0