MAINT Handle `Criterion.samples` using a memoryview #25005

adam2392 · 2022-11-22T17:33:49Z

Reference Issues/PRs

Fixes: #25004

Putting up a PR to document what the end result of #24678, #24987 and this PR would look like.

What does this implement/fix? Explain your changes.

Converts SIZE_t* samples to const SIZE_t[:] samples.

Any other comments?

This is a downstream PR to #24678 and #24987 and should be reviewed/merged AFTER those are reviewed/merged.

I will rebase everything in sequence.

Cross-referencing: #17299 , #24875

jjerphan

Thanks, @adam2392.

Here are a few comments.

sklearn/tree/_criterion.pyx

sklearn/tree/_splitter.pyx

adam2392 · 2022-11-23T20:38:06Z

Hmm running the asv benchmark, one gets a regression in terms of the time to fit dense data... Granted... I am using my laptop, which could be running stuff in the background... But also would there be a reason this regression is actually true?

cc: @jjerphan

> asv continuous --verbose --split --bench RandomForest upstream/main samples
...

> asv compare main samples
       before           after         ratio
     [3eb00d83]       [d059a306]
     <main>           <samples> 
             228M             244M     1.07  ensemble.RandomForestClassifierBenchmark.peakmem_fit('dense', 1)
             576M             559M     0.97  ensemble.RandomForestClassifierBenchmark.peakmem_fit('sparse', 1)
             216M             216M     1.00  ensemble.RandomForestClassifierBenchmark.peakmem_predict('dense', 1)
             442M             442M     1.00  ensemble.RandomForestClassifierBenchmark.peakmem_predict('sparse', 1)
+         4.88±0s          5.37±0s     1.10  ensemble.RandomForestClassifierBenchmark.time_fit('dense', 1)
       7.16±0.01s          7.16±0s     1.00  ensemble.RandomForestClassifierBenchmark.time_fit('sparse', 1)
          133±5ms          138±6ms     1.03  ensemble.RandomForestClassifierBenchmark.time_predict('dense', 1)
         962±30ms         938±20ms     0.98  ensemble.RandomForestClassifierBenchmark.time_predict('sparse', 1)

The memoryview should be as efficient as its pointer counterpart in all the ways that we use it.

jjerphan · 2022-11-23T20:48:26Z

I would recommend running asv benchmark on a (dedicated) machine without any other work load (or with CPU affinity and isolation set).

I think variations aren't due to this PR changes which should be harmless.

adam2392 · 2022-11-29T13:42:05Z

@jshinm can you help run asv benchmarks here.

jshinm · 2022-11-30T19:37:35Z

@adam2392 it seems like as @jjerphan mentioned there's a small harmless variation between tests which in my case sample branch was slightly faster. Nothing was run in the background as @adam2392 suggested to me

(sklearn-main) jshinm@jshinm-OMEN-by-HP-Laptop-16-b0xxx:~/Desktop/workstation/sklearn-jms/asv_benchmarks$ asv compare 167b2980 9c9c8582 

All benchmarks:

       before           after         ratio
     [167b2980]       [9c9c8582]
     <main>           <samples~20>
0.7230709112942986  0.7230709112942986     1.00  ensemble.HistGradientBoostingClassifierBenchmark.track_test_score
  0.9812160155622751  0.9812160155622751     1.00  ensemble.HistGradientBoostingClassifierBenchmark.track_train_score
             191M             186M     0.98  ensemble.RandomForestClassifierBenchmark.peakmem_fit('dense', 1)
             427M             422M     0.99  ensemble.RandomForestClassifierBenchmark.peakmem_fit('sparse', 1)
             195M             190M     0.98  ensemble.RandomForestClassifierBenchmark.peakmem_predict('dense', 1)
             411M             406M     0.99  ensemble.RandomForestClassifierBenchmark.peakmem_predict('sparse', 1)
-      4.47±0.01s       4.03±0.01s     0.90  ensemble.RandomForestClassifierBenchmark.time_fit('dense', 1)
       6.33±0.01s       6.38±0.02s     1.01  ensemble.RandomForestClassifierBenchmark.time_fit('sparse', 1)
          132±1ms        130±0.7ms     0.99  ensemble.RandomForestClassifierBenchmark.time_predict('dense', 1)
          847±1ms          871±2ms     1.03  ensemble.RandomForestClassifierBenchmark.time_predict('sparse', 1)
  0.7464271763500541  0.7464271763500541     1.00  ensemble.RandomForestClassifierBenchmark.track_test_score('dense', 1)
  0.8656423941766682  0.8656423941766682     1.00  ensemble.RandomForestClassifierBenchmark.track_test_score('sparse', 1)
  0.9968171694224932  0.9968171694224932     1.00  ensemble.RandomForestClassifierBenchmark.track_train_score('dense', 1)
  0.9996123288718864  0.9996123288718864     1.00  ensemble.RandomForestClassifierBenchmark.track_train_score('sparse', 1)

Test machine info

os [Linux 5.15.0-53-generic]: 
arch [x86_64]: 
cpu [11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz]:
num_cpu [16]:
ram [65483276]:

adam2392 · 2022-11-30T19:44:14Z

In that case, this is good to go on my end @jjerphan

adam2392 · 2022-11-30T19:45:14Z

Unless you would like me to rename every i, p, k index pointers where sample_indices are used?

jjerphan

Thanks for reporting the benchmarks' results, @jshinm.

This PR LGTM modulo the last point @adam2392 mentionned:

Unless you would like me to rename every i, p, k index pointers where sample_indices are used?

In fact, this is probably better to revert this change to use i, p and k` because those are used in other places in this file and performing this renaming would change the scope of this PR.

What do you think, @adam2392 ? Can you revert it back?
Sorry for having changed my mind.

sklearn/tree/_criterion.pyx

adam2392 · 2022-12-01T05:14:49Z

sklearn/tree/_criterion.pyx


        cdef DOUBLE_t y_mean = 0.
        cdef DOUBLE_t poisson_loss = 0.
        cdef DOUBLE_t w = 1.0
+        cdef SIZE_t i, k, p


FYI @jshinm you forgot to define this in your PR to address Julien's comments. In Cython, all variables must be typed, meaning the type is defined apriori to usage.

Cython will always surprise me. I am not really sure how did it infer previously and create those variables for us.

adam2392 · 2022-12-01T05:19:48Z

What do you think, @adam2392 ? Can you revert it back? Sorry for having changed my mind.

Done.

In fact, this is probably better to revert this change to use i, p and k` because those are used in other places in this file and performing this renaming would change the scope of this PR.

Now that it has been brought up, I think that would be a nice parallel PR tho to improve the readability of the Cython code. I get extreme cognitive load when reading code with letters as variables :p. Happy to do another PR, or encourage @jshinm to start a quick one there :).

jjerphan

LGTM.

Thank you, @adam2392 and @jshinm!

I agree regarding cognitive load. I guess those are standard mathematical notation in algorithms that are handy when one is used to them, but they might be changed. Note that we somewhat also want to minimizes cosmetic changes because those come with other costs.

glemaitre

LGTM

glemaitre · 2022-12-01T13:46:52Z

Thanks @jshinm @adam2392

Co-authored-by: Jong Shin <jshin.m@gmail.com>

8000
github-actions bot added module:tree cython labels Nov 22, 2022

adam2392 marked this pull request as draft November 22, 2022 17:34

Initial commit

03e1d77

adam2392 force-pushed the samples branch from 9cc76c7 to 03e1d77 Compare November 23, 2022 02:28

adam2392 marked this pull request as ready for review November 23, 2022 02:29

Fix pip install

4a2979b

jjerphan changed the title ~~[MAINT] Refactor samples inside Criterion classes from pointer to memoryview~~ MAINT Handle Criterion.samples using a memoryview Nov 23, 2022

jjerphan reviewed Nov 23, 2022

View reviewed changes

adam2392 added 4 commits November 23, 2022 15:14

Address comments about naming

1b4bd2e

Fix indentation

7722824

All fixed

b47f79a

Merge branch 'main' into samples

d059a30

adam2392 added 2 commits November 28, 2022 10:52

Merge branch 'main' into samples

7eab854

Merge branch 'main' into samples

30a3f43

Merge branch 'main' into samples

a9e7186

jjerphan reviewed Nov 30, 2022

View reviewed changes

jshinm and others added 2 commits November 30, 2022 22:29

revert variable naming as suggested by @jjerphan

d223ff7

Merge pull request #28 from jshinm/jms-update-samples

b29e109

adam2392 commented Dec 1, 2022

View reviewed changes

sklearn/tree/_criterion.pyx Outdated Show resolved Hide resolved

adam2392 added 2 commits November 30, 2022 22:56

Update sklearn/tree/_criterion.pyx

5a13731

Add definitions

02d56c2

adam2392 commented Dec 1, 2022

View reviewed changes

jjerphan approved these changes Dec 1, 2022 6D40

View reviewed changes

glemaitre self-requested a review December 1, 2022 13:27

glemaitre approved these changes Dec 1, 2022

View reviewed changes

glemaitre merged commit 599e03c into scikit-learn:main Dec 1, 2022

jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023

MAINT Handle Criterion.samples using a memoryview (scikit-learn#25005)

548db87

Co-authored-by: Jong Shin <jshin.m@gmail.com>

jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023

MAINT Handle Criterion.samples using a memoryview (scikit-learn#25005)

556a159

Co-authored-by: Jong Shin <jshin.m@gmail.com>

adam2392 deleted the samples branch March 2, 2023 18:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT Handle `Criterion.samples` using a memoryview #25005

MAINT Handle `Criterion.samples` using a memoryview #25005

MAINT Handle Criterion.samples using a memoryview #25005

MAINT Handle Criterion.samples using a memoryview #25005

Conversation

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MAINT Handle `Criterion.samples` using a memoryview #25005

MAINT Handle `Criterion.samples` using a memoryview #25005