-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Tree speedup #946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Tree speedup #946
Conversation
Gilles, you can find the benchmark here: https://gist.github.com/3090011 Personally, I think this benchmark is too simple, we should rather use multiple datasets with different characteristics (num_features vs. num_samples, feature types, regression vs. classification). Maybe we can re-use something from the GBRT benchmark code: https://github.com/pprett/scikit-learn/blob/gradient_boosting-benchmarks/benchmarks/bench_gbrt.py |
…ee-speedup Conflicts: sklearn/tree/tree.py
All tests in test_tree.py but graphviz and pickle now pass. I keep it here for today. The next major issue is to implement |
Should I run this through the profiler if I find time or are you still making major changes? |
I am still making major changes. I let you know as soon as you can start review the code in more depth. Thanks :) |
…ee-speedup Conflicts: sklearn/tree/tree.py
All tests in @pprett Gradient Boosting still need to be fixed though :) I was thinking maybe we could do that together since this PR involves many changes to the Tree API. In particular I think that in |
great - unfortunately, I'm a little busy at the moment - I don't think I can make it in the coming days... |
Okay then, don't worry. I will have a look at it myself first. |
What should be the expected behavior? Put NaNs if proba[k] is 0? |
I suppose the RuntimeWarning is as good as it gets. Perhaps if we are expecting this warning then we can catch it so it doesn't show up on the nosetests? |
I turned off the divide-by-0 warnings in the test suite. Regarding the other issue, it is the same as before. I don't know what's best... @pprett ? |
hmm... personally, warning and -inf is fine to me - I'm not a friend of exceptions and -inf is better than NaN IMHO |
+1 for having a -info. Ok for catching the warning in the tests as we expect it to happen in this case. |
Actually np.log(0) already returns -inf and outputs the warning. So problem solved. I was rather discussing about the test fail of |
@glouppe this is again a float32 vs. float64 issue which caused two samples to flip rank - please change the ground truth to reflect the new ranking:
|
@bdholt1 Can you check whether my last commit solves the issue? |
You couldn't make this up if you tried!
|
Could it just be my build that's unstable? Or is it GBRT feature importances algorithm that's unstable? |
ok - looks like its pretty unstable... 1 sec 2012/7/17 Brian Holt
Peter Prettenhofer |
Would it be possible to change the dataset used in the tests so that feature importances are strictly monotonic (e.g. no two feature have close importances)? Alternatively could we implement a deterministic tie breaking scheme (if those are exact ties and that the current scheme is not deterministic because of the use of dictionary or similar non deterministic datastructure somewhere)? |
@glouppe can you add this to the
I think there is an issue with |
This caused the ranking to change again. I added the lines anyway and changed the ranking again in my last commit. |
Any clue @pprett ? I tried various things but none seem to yield a stable ranking... |
Gilles, are you using a 32bit arch? 2012/7/18 Gilles Louppe
Peter Prettenhofer |
Yes |
I noticed reproducability issues on 32 bit arch - don't know whether 2012/7/18 Gilles Louppe
Peter Prettenhofer |
My last commit disables the test. Feel free to merge :) |
+1 |
+1 for merge too - can't wait to bench gbm on the new master! |
Okay then, I click the green button! Thanks all of you for the reviews :) I'll open a new issue regarding the |
Concerns issue #933
Hi guys!
This is a very very early pull request to improve the tree module in terms of training time. This is so early that please, don't look at the code for now, it still needs A LOT of work.
The main contribution is basically to rewrite the Tree class from tree.py into a Cython class in _tree.pyx. In its current state, this already achieves a ~2x speedup w.r.t. master. Not bad, but I expect better in the changes to come.
@pprett Could you point me to the benchmark you used in #923? Thanks :)
Update 16/07: All tests now pass. This is ready for merge.