-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
MAINT Handle Criterion.samples
using a memoryview
#25005
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
samples
inside Criterion
classes from pointer to memoryviewCriterion.samples
using a memoryview
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @adam2392.
Here are a few comments.
Hmm running the asv benchmark, one gets a regression in terms of the time to fit dense data... Granted... I am using my laptop, which could be running stuff in the background... But also would there be a reason this regression is actually true? cc: @jjerphan
The memoryview should be as efficient as its pointer counterpart in all the ways that we use it. |
I would recommend running asv benchmark on a (dedicated) machine without any other work load (or with CPU affinity and isolation set). I think variations aren't due to this PR changes which should be harmless. |
@jshinm can you help run asv benchmarks here. |
@adam2392 it seems like as @jjerphan mentioned there's a small harmless variation between tests which in my case (sklearn-main) jshinm@jshinm-OMEN-by-HP-Laptop-16-b0xxx:~/Desktop/workstation/sklearn-jms/asv_benchmarks$ asv compare 167b2980 9c9c8582
All benchmarks:
before after ratio
[167b2980] [9c9c8582]
<main> <samples~20>
0.7230709112942986 0.7230709112942986 1.00 ensemble.HistGradientBoostingClassifierBenchmark.track_test_score
0.9812160155622751 0.9812160155622751 1.00 ensemble.HistGradientBoostingClassifierBenchmark.track_train_score
191M 186M 0.98 ensemble.RandomForestClassifierBenchmark.peakmem_fit('dense', 1)
427M 422M 0.99 ensemble.RandomForestClassifierBenchmark.peakmem_fit('sparse', 1)
195M 190M 0.98 ensemble.RandomForestClassifierBenchmark.peakmem_predict('dense', 1)
411M 406M 0.99 ensemble.RandomForestClassifierBenchmark.peakmem_predict('sparse', 1)
- 4.47±0.01s 4.03±0.01s 0.90 ensemble.RandomForestClassifierBenchmark.time_fit('dense', 1)
6.33±0.01s 6.38±0.02s 1.01 ensemble.RandomForestClassifierBenchmark.time_fit('sparse', 1)
132±1ms 130±0.7ms 0.99 ensemble.RandomForestClassifierBenchmark.time_predict('dense', 1)
847±1ms 871±2ms 1.03 ensemble.RandomForestClassifierBenchmark.time_predict('sparse', 1)
0.7464271763500541 0.7464271763500541 1.00 ensemble.RandomForestClassifierBenchmark.track_test_score('dense', 1)
0.8656423941766682 0.8656423941766682 1.00 ensemble.RandomForestClassifierBenchmark.track_test_score('sparse', 1)
0.9968171694224932 0.9968171694224932 1.00 ensemble.RandomForestClassifierBenchmark.track_train_score('dense', 1)
0.9996123288718864 0.9996123288718864 1.00 ensemble.RandomForestClassifierBenchmark.track_train_score('sparse', 1) Test machine info os [Linux 5.15.0-53-generic]:
arch [x86_64]:
cpu [11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz]:
num_cpu [16]:
ram [65483276]: |
In that case, this is good to go on my end @jjerphan |
Unless you would like me to rename every |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reporting the benchmarks' results, @jshinm.
This PR LGTM modulo the last point @adam2392 mentionned:
Unless you would like me to rename every i, p, k index pointers where sample_indices are used?
In fact, this is probably better to revert this change to use i
, p and
k` because those are used in other places in this file and performing this renaming would change the scope of this PR.
What do you think, @adam2392 ? Can you revert it back?
Sorry for having changed my mind.
|
||
cdef DOUBLE_t y_mean = 0. | ||
cdef DOUBLE_t poisson_loss = 0. | ||
cdef DOUBLE_t w = 1.0 | ||
cdef SIZE_t i, k, p |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @jshinm you forgot to define this in your PR to address Julien's comments. In Cython, all variables must be typed, meaning the type is defined apriori to usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cython will always surprise me. I am not really sure how did it infer previously and create those variables for us.
Done.
Now that it has been brought up, I think that would be a nice parallel PR tho to improve the readability of the Cython code. I get extreme cognitive load when reading code with letters as variables :p. Happy to do another PR, or encourage @jshinm to start a quick one there :). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thank you, @adam2392 and @jshinm!
I agree regarding cognitive load. I guess those are standard mathematical notation in algorithms that are handy when one is used to them, but they might be changed. Note that we somewhat also want to minimizes cosmetic changes because those come with other costs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Co-authored-by: Jong Shin <jshin.m@gmail.com>
Co-authored-by: Jong Shin <jshin.m@gmail.com>
Reference Issues/PRs
Fixes: #25004
Putting up a PR to document what the end result of #24678, #24987 and this PR would look like.
What does this implement/fix? Explain your changes.
Converts
SIZE_t* samples
toconst SIZE_t[:] samples
.Any other comments?
This is a downstream PR to #24678 and #24987 and should be reviewed/merged AFTER those are reviewed/merged.
I will rebase everything in sequence.
Cross-referencing: #17299 , #24875