MRG: Add sample weighting to decision trees #1488

glouppe · 2012-12-23T12:43:03Z

This PR is the first part of #522. It adds sample weighting to decision trees and forests.

arjoly · 2012-12-23T13:15:28Z

sklearn/tree/tree.py

+        tasks and `max_features=n_features` on regression problems. If "sqrt",
+        then `max_features=sqrt(n_features)`. If "log2", then
+        `max_features=log2(n_features)`. If None, then
+        `max_features=n_features`.


Why None means max_features=n_features instead of 'max' for instance? And default not equal to 'auto'?
Sorry for nitpicking, but it has puzzled me the first time I used tree estimators.

By default, a decision tree should not be random. Hence max_features=None.

Note however that max_features="auto" by default in forests.

arjoly · 2012-12-23T13:40:07Z

sklearn/ensemble/_gradient_boosting.c is modified, but there isn't sklearn/ensemble/_gradient_boosting.pyx.

Thanks @glouppe and @ndawe for the pr.

glouppe · 2012-12-23T13:45:09Z

@arjoly _gradient_boosting.c needs to be regenerated because of changes in the cython _tree module it imports.

glouppe · 2012-12-23T14:36:43Z

Done! I also added the modifications done by @pprett to rewrite bootstrapping using sample weights. This should result in a noticeable speedup when using random forests with default parameters :)

pprett · 2012-12-23T17:07:04Z

I did some benchmarks to compare this PR against R's randomForest package (via rpy2).

You can see the results below: I used 100 trees, max_features=3, and min_samples_leaf=1 except for MNIST where I used 10 trees. All results are averaged over 3 runs.

I put emphasize on some "interesting" differences.

Most notable: the difference for the two synthetic datasets hastie_10_2 and random_gaussians; both have relatively few features. MNIST might be due to the overhead by the rpy2 wrapper (the array is quite large...)

madelon

	r	py	master
score	0.4178 (0.01)	0.4117 (0.01)	0.4156 (0.02)
train	10.6725 (0.12)	11.1650 (0.13)	22.2776 (0.05)
test	0.0895 (0.00)	0.0165 (0.00)	0.0158 (0.00)

hastie_10_2

	r	py	master
score	0.1383 (0.01)	0.1389 (0.00)	0.1393 (0.00)
train	0.3696 (0.01)	1.3119 (0.02)	1.7072 (0.00)
test	0.1651 (0.01)	0.1369 (0.00)	0.1284 (0.00)

spam

	r	py	master
score	0.0543 (0.00)	0.0438 (0.00)	0.0451 (0.00)
train	1.7327 (0.01)	2.2989 (0.01)	3.7778 (0.03)
test	0.0358 (0.00)	0.0253 (0.00)	0.0248 (0.00)

landsat

	r	py	master
score	0.0000 (0.00)	0.0000 (0.00)	0.0000 (0.00)
train	2.5476 (0.03)	4.3610 (0.02)	7.3381 (0.01)
test	0.1137 (0.00)	0.1164 (0.00)	0.1100 (0.00)

arcene

	r	py	master
score	0.2667 (0.01)	0.2467 (0.02)	0.2000 (0.02)
train	4.2622 (0.02)	4.9789 (0.07)	11.0852 (0.07)
test	0.1840 (0.01)	0.0061 (0.00)	0.0060 (0.00)

random_gaussian

	r	py	master
score	0.1422 (0.01)	0.1383 (0.01)	0.1400 (0.01)
train	0.3666 (0.00)	1.3010 (0.02)	1.7090 (0.03)
test	0.1538 (0.00)	0.1373 (0.00)	0.1292 (0.00)

mnist

	r	py	master
score	0.1176 (0.01)	0.0694 (0.00)	0.0699 (0.00)
train	128.7950 (2.39)	72.2359 (0.07)	147.0456 (3.91)
test	1.1124 (0.08)	0.0451 (0.00)	0.0467 (0.00)

glouppe · 2012-12-23T17:11:55Z

If you have time, would mind running those on master to see the benefits from switching to sample weighting? I am curious :)

pprett · 2012-12-23T22:59:52Z

@glouppe I've added the master timings to the tables; a 2-fold improvement on all reasonably sized datasets. random_gaussian is quite interesting... R is way faster than our implementation... I've to investigate.

btw: you can find the benchmark script here https://gist.github.com/4089926 - it requires my landsat branch [1]

[1] https://github.com/pprett/scikit-learn/tree/landsat-dataset

amueller · 2012-12-24T08:33:29Z

Any reason master is more accurate on arcene? That seems a bit weird, right?

amueller · 2012-12-24T08:36:13Z

Btw, pretty cool speedup overall :) Try to review this branch soon, but probably after christmas (maybe @bdholt1 has a more qualified opinion?)

amueller · 2012-12-24T09:15:45Z

sklearn/ensemble/forest.py

-                     sample_mask=sample_mask, X_argsorted=X_argsorted,
+            sample_counts = np.bincount(indices)
+
+            if len(sample_counts) < n_samples:


there is a backport of the new bincount in utils. it provides the min_length keyword with does exactly this ;)

Thanks for the hint! I changed that. Too bad all these functions are buried (and hidden) into our codebase :(

It is actually a standard numpy function, just not in 1.3.0. I added the backport recently.

If you have any ideas how to make it more discoverable, that would be great :)

pprett · 2012-12-24T09:31:35Z

@glouppe arcene is a quite noisy dataset - the score standard deviation is also quite large (0.02) - so IHMO the difference should be ok.

the two small datasets (hastie and random gaussians) indicate that there is some considerable (constant) overhead somewhere in the code that is only noticeable when the nr of features is quite low (both have only 10 features)...

amueller · 2012-12-24T09:45:51Z

You are right, I think I mistook the standard deviation by an order of magnitude. The difference is ok.
I guess the constant overhead is for another PR ;)

amueller · 2012-12-24T09:51:49Z

I think it would be great if we had a more explicit test, maybe on some synthetic data.
For example having a XOR with a bit of label noise and when you do the sample weights the one way you get a horizontal decision boundary and when you do it the other way you get a vertical decision boundary.
Not sure if this is a great test, maybe something with just 4 hand-picked samples on a line would be better, I don't know.

I just had the impression that there is no test for a fine-grained changing of the weights, as is needed for the boosting, for example.

pprett · 2012-12-24T10:07:31Z

I profiled on of the small datasets: ~90 is spent in Tree.build thus extension code. I ran yep to profile the extension; most of the time is spent in recursive_partition. 50% is spent in find_best_split the other ~40% are spent on malloc, dealloc, and (what seems like) fancy indexing.

Interesting: the time spent in ClassificationCriterion.update is roughly the time spent in RandomState.shuffle / RandomState.permutation .

As far as I can see there is no single statement / function to blame for the performance issue.

You can find the pprof call graph here: https://dl.dropbox.com/u/402667/pprof_forest_hastie.ps

glouppe · 2012-12-24T10:48:53Z

@pprett Thanks for the profiling! I think there is still places where things can be optimized. The one that I have in mind is getting rid of permutation and directly shuffle self.features in-place. This would prevent useless memory allocations/deallocations due to permutation.

@amueller Yep, you are right. I'll add some more fine-grained tests later today.

arjoly · 2012-12-24T11:32:30Z

@glouppe Try to use np.random.shuffle with random_state_shuffle = random_state.shuffle. It should decrease a bit the python overhead.

Not far from 100% coverage. Nice.

Name                                             Stmts   Miss  Cover   Missing
------------------------------------------------------------------------------
sklearn.tree.tree                                  204      8    96%   109, 127, 129, 133, 272, 275, 279, 306

amueller · 2012-12-24T11:34:24Z

sklearn/tree/_tree.pxd

    # Methods
    cdef void resize(self, int capacity=*)

    cpdef build(self, np.ndarray X, np.ndarray y,
-                np.ndarray sample_mask=*, np.ndarray X_argsorted=*)


What is the * syntax?

It is for default values, in pxd files. Don't ask why it is like that though ;)

http://docs.cython.org/src/reference/language_basics.html#optional-arguments

DOC: document negative weight treatment in the case of classification

bdholt1 · 2013-01-06T22:10:40Z

sklearn/ensemble/forest.py

    max_depth : integer or None, optional (default=None)
        The maximum depth of the tree. If None, then nodes are expanded until
        all leaves are pure or until all leaves contain less than
        min_samples_split samples.
        Note: this parameter is tree-specific.

-    min_samples_split : integer, optional (default=1)
+    min_samples_split : integer, optional (default=2)


Why 2? Something to do with sample weights?

How do you split a node with only one sample in it?

I must have misunderstood, iirc this value was the minimum number of samples in a leaf node but is now the number of samples required to create a split node.

@bdholt1 Nothing to do with sample weighting. Sorry for this confusion. This was brought up in #1518 and I fixed it here to ease things for me.

ndawe · 2013-01-07T00:19:12Z

Repeating @glouppe: I think this PR is ready to be merged. Then we can begin working on AdaBoost.

amueller · 2013-01-07T09:08:40Z

Set the milestone to 0.13. I think we can get this :)

glouppe · 2013-01-08T20:59:29Z

/ping @pprett @bdholt1?

It would be nice to have merged for the release :)

pprett · 2013-01-08T21:11:23Z

Just finished the review... looks very good to me! I re-ran all my benchmarks to check for performance regressions - couldn't find any.

Great work @glouppe and @ndawe - thanks a lot!

I'm +1000 for merge - our tree hugger will like this one!

bdholt1 · 2013-01-08T21:40:15Z

sklearn/tree/tests/test_tree.py

+    X = iris.data
+    y = iris.target
+
+    dupplicates = rng.randint(0, X.shape[0], 1000)


bdholt1 · 2013-01-08T21:44:31Z

I've just finished my review - didn't have time to build the code but I went through it reasonably thoroughly and it looks good to me so I'm happy to merge this in. Will be a great contribution to the tree module!

bdholt1 · 2013-01-08T21:46:49Z

+1 to merge, and congrats to @ndawe and @glouppe for this work.

glouppe · 2013-01-08T21:48:24Z

Thanks guys for the reviews! and thank you @ndawe for your work :)

@amueller When you come by, once Travis is done, feel free to click the green button or to do your rebase-fu.

glouppe · 2013-01-08T21:56:28Z

@amueller Travis is happy.

ndawe · 2013-01-08T21:59:07Z

Thanks for the reviews! Time to begin work on the rest of #522.

arjoly · 2013-01-08T22:00:37Z

Congrats to @glouppe and @ndawe !!!

glouppe · 2013-01-08T22:03:05Z

@ndawe I'll checkout AdaBoost from #522 into a new branch as soon as tomorrow morning. Is that OK for you or do you want to take the lead on that one?

ndawe · 2013-01-08T22:12:35Z

@glouppe either way is fine for me. I suppose we need to checkout:

sklearn/ensemble/weight_boosting.py
sklearn/ensemble/init.py
examples/ensemble/plot_adaboost_*

from my treeweights branch. I could actually handle this right now since I would like to begin using this PR asap in my research (using weights + AdaBoost).

amueller · 2013-01-08T22:12:40Z

Congrats everyone. Will now begin looking at the tree and considering rebase-view.
Thanks @glouppe and @ndawe for all the work!

glouppe · 2013-01-08T22:17:47Z

@ndawe Okay then, wait for @amueller rebase and checkout those files.

(Remember however that @pprett found quite a severe bug in the convergence of the implementation. Hence I am not sure it is safe to use for research purposes right now. At least not until we fix that.)

glouppe · 2013-01-08T22:20:37Z

@amueller The last merge in master is conflicting with the what's new file. Can you fix that or should I?

amueller · 2013-01-08T22:23:56Z

don't worry, will fix.

ndawe · 2013-01-08T22:24:47Z

1241

@glouppe good point. I am anxious to see where the error is in my AdaBoost.

amueller · 2013-01-08T22:32:03Z

Merged by rebase. Thanks again guys!

glouppe · 2013-01-08T22:34:07Z

🍺

ndawe · 2013-01-08T22:35:01Z

@glouppe, I owe you 🍻

glouppe and others added 5 commits December 23, 2012 13:13

Checkout files from ndawe:treeweights

86f8311

FIX: roll back some changes

9f9520b

FIX: what's new

2d38319

flake8

d5be882

sampling w/ replacement via sample_weights

198393a

arjoly reviewed Dec 23, 2012
View reviewed changes

glouppe added 4 commits December 23, 2012 14:54

ENH: early binding + allocate features at tree creation (by @pprett)

6162da3

FIX: oob test

8f31a62

DOC: sample_weight=None

75130a5

DOC: what's new

edb0401

DOC: typo

1c34457

DOC: cosmit

ab1f444

amueller reviewed Dec 24, 2012
View reviewed changes

FIX: use sklearn.utils.fixes.bincount

a9df4f8

amueller reviewed Dec 24, 2012
View reviewed changes

Merge pull request #8 from ndawe/tree-sample_weight

d53e749

DOC: document negative weight treatment in the case of classification

bdholt1 reviewed Jan 6, 2013
View reviewed changes

DOC: updated docstring

ddd7b65

ndawe mentioned this pull request Jan 7, 2013

speedup for decision trees with large datasets #1532

Closed

bdholt1 reviewed Jan 8, 2013
View reviewed changes

sklearn/tree/tests/test_tree.py

X = iris.data

y = iris.target

dupplicates = rng.randint(0, X.shape[0], 1000)

Copy link

Member

bdholt1 Jan 8, 2013

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicates

Typo

9388a37

amueller closed this Jan 8, 2013

glouppe deleted the tree-sample_weight branch January 8, 2013 22:36

ndawe mentioned this pull request Oct 18, 2014

Ensemble models (and maybe others?) don't check for negative sample_weight #3774

Open

Uh oh!

MRG: Add sample weighting to decision trees #1488

MRG: Add sample weighting to decision trees #1488

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!