-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
MRG: Add sample weighting to decision trees #1488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
tasks and `max_features=n_features` on regression problems. If "sqrt", | ||
then `max_features=sqrt(n_features)`. If "log2", then | ||
`max_features=log2(n_features)`. If None, then | ||
`max_features=n_features`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why None
means max_features=n_features
instead of 'max'
for instance? And default
not equal to 'auto'
?
Sorry for nitpicking, but it has puzzled me the first time I used tree estimators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, a decision tree should not be random. Hence max_features=None
.
Note however that max_features="auto"
by default in forests.
@arjoly |
Done! I also added the modifications done by @pprett to rewrite bootstrapping using sample weights. This should result in a noticeable speedup when using random forests with default parameters :) |
I did some benchmarks to compare this PR against R's randomForest package (via rpy2). You can see the results below: I used 100 trees, max_features=3, and min_samples_leaf=1 except for MNIST where I used 10 trees. All results are averaged over 3 runs. I put emphasize on some "interesting" differences. Most notable: the difference for the two synthetic datasets hastie_10_2 and random_gaussians; both have relatively few features. MNIST might be due to the overhead by the rpy2 wrapper (the array is quite large...) madelon
hastie_10_2
spam
landsat
arcene
random_gaussian
mnist
|
If you have time, would mind running those on master to see the benefits from switching to sample weighting? I am curious :) |
@glouppe I've added the master timings to the tables; a 2-fold improvement on all reasonably sized datasets. btw: you can find the benchmark script here https://gist.github.com/4089926 - it requires my landsat branch [1] [1] https://github.com/pprett/scikit-learn/tree/landsat-dataset |
Any reason master is more accurate on arcene? That seems a bit weird, right? |
Btw, pretty cool speedup overall :) Try to review this branch soon, but probably after christmas (maybe @bdholt1 has a more qualified opinion?) |
sample_mask=sample_mask, X_argsorted=X_argsorted, | ||
sample_counts = np.bincount(indices) | ||
|
||
if len(sample_counts) < n_samples: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a backport of the new bincount in utils. it provides the min_length
keyword with does exactly this ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the hint! I changed that. Too bad all these functions are buried (and hidden) into our codebase :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is actually a standard numpy function, just not in 1.3.0. I added the backport recently.
If you have any ideas how to make it more discoverable, that would be great :)
@glouppe arcene is a quite noisy dataset - the score standard deviation is also quite large (0.02) - so IHMO the difference should be ok. the two small datasets (hastie and random gaussians) indicate that there is some considerable (constant) overhead somewhere in the code that is only noticeable when the nr of features is quite low (both have only 10 features)... |
You are right, I think I mistook the standard deviation by an order of magnitude. The difference is ok. |
I think it would be great if we had a more explicit test, maybe on some synthetic data. I just had the impression that there is no test for a fine-grained changing of the weights, as is needed for the boosting, for example. |
I profiled on of the small datasets: ~90 is spent in Interesting: the time spent in ClassificationCriterion.update is roughly the time spent in RandomState.shuffle / RandomState.permutation . As far as I can see there is no single statement / function to blame for the performance issue. You can find the pprof call graph here: https://dl.dropbox.com/u/402667/pprof_forest_hastie.ps |
@pprett Thanks for the profiling! I think there is still places where things can be optimized. The one that I have in mind is getting rid of @amueller Yep, you are right. I'll add some more fine-grained tests later today. |
@glouppe Try to use np.random.shuffle with Not far from 100% coverage. Nice.
|
# Methods | ||
cdef void resize(self, int capacity=*) | ||
|
||
cpdef build(self, np.ndarray X, np.ndarray y, | ||
np.ndarray sample_mask=*, np.ndarray X_argsorted=*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the * syntax?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is for default values, in pxd
files. Don't ask why it is like that though ;)
http://docs.cython.org/src/reference/language_basics.html#optional-arguments
DOC: document negative weight treatment in the case of classification
max_depth : integer or None, optional (default=None) | ||
The maximum depth of the tree. If None, then nodes are expanded until | ||
all leaves are pure or until all leaves contain less than | ||
min_samples_split samples. | ||
Note: this parameter is tree-specific. | ||
|
||
min_samples_split : integer, optional (default=1) | ||
min_samples_split : integer, optional (default=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 2? Something to do with sample weights?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you split a node with only one sample in it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must have misunderstood, iirc this value was the minimum number of samples in a leaf node but is now the number of samples required to create a split node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repeating @glouppe: I think this PR is ready to be merged. Then we can begin working on AdaBoost. |
Set the milestone to 0.13. I think we can get this :) |
X = iris.data | ||
y = iris.target | ||
|
||
dupplicates = rng.randint(0, X.shape[0], 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicates
I've just finished my review - didn't have time to build the code but I went through it reasonably thoroughly and it looks good to me so I'm happy to merge this in. Will be a great contribution to the tree module! |
@amueller Travis is happy. |
Thanks for the reviews! Time to begin work on the rest of #522. |
@glouppe either way is fine for me. I suppose we need to checkout: sklearn/ensemble/weight_boosting.py from my treeweights branch. I could actually handle this right now since I would like to begin using this PR asap in my research (using weights + AdaBoost). |
@amueller The last merge in master is conflicting with the what's new file. Can you fix that or should I? |
don't worry, will fix. |
@glouppe good point. I am anxious to see where the error is in my AdaBoost. |
Merged by rebase. Thanks again guys! |
🍺 |
@glouppe, I owe you 🍻 |
This PR is the first part of #522. It adds sample weighting to decision trees and forests.