MNT Uses memoryviews in tree criterion #22921

thomasjpfan · 2022-03-22T22:01:12Z

Continues #22868

This PR refactors sklearn/tree/_criterion.pxd to use memoryviews. This PR simplifies the code because strides are handled by the memoryview and Python can handle the memory. I ran this benchmark with every criterion and different n_samples and noticed no change in performance. Here are the plots of the results and here is the raw results on main and raw results for this PR.

Note that memcpy and memset are still used because they benchmark better when compared to their memoryview counter parts (mv[:] = 0.0 or mv[:] = other_mv).

sklearn/tree/_criterion.pyx

jjerphan

Thank you, @thomasjpfan.

Relying on numpy's allocator is indeed more appropriate.

sklearn/tree/_criterion.pyx

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

jeremiedbb

Nice simplification.

This is the only place where the strides are still needed. When I benchmark by using the memorview directly, I get a runtime regression.

Was it significant ? do you still have the results for this ?

jeremiedbb · 2022-03-26T01:27:01Z

Note that memcpy and memset are still used because they benchmark better when compared to their memoryview counter parts (mv[:] = 0.0 or mv[:] = other_mv).

That's expected. memset natively works on blocks that have the size of the registers. The loop can't compete with that, even more when not all optimisation flags are enabled.

thomasjpfan · 2022-03-26T19:47:53Z

Was it significant ? do you still have the results for this ?

I reran the benchmarks for entropy again with memoryviews (repeating it 30 times with different random seeds), and I got similiar results between main, using memoryviews (pr_mv) and using pointers (pr_pointer):

From memory, my original benchmarks showed a ~1% runtime regression compared to main. Maybe something was running on my system during my original benchmarks.

If you are interested in running the benchmark:

python benchmark.py pr_results.json --config entropy

will store the results in pr_results.json and output the mean/std for the 30 runs.

jeremiedbb · 2022-03-29T15:33:18Z

@jjerphan you might want to take another look at it ?

jjerphan

LGTM. Thank you, @thomasjpfan.

jjerphan · 2022-03-25T12:55:01Z

sklearn/tree/_criterion.pyx

@@ -319,8 +292,7 @@ cdef class ClassificationCriterion(Criterion):
        cdef SIZE_t offset = 0


This can be removed.

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

MNT Uses memoryviews in tree criterion

1eec472

github-actions bot added module:tree cython labels Mar 22, 2022

thomasjpfan commented Mar 22, 2022

View reviewed changes

sklearn/tree/_criterion.pyx Outdated Show resolved Hide resolved

jjerphan reviewed Mar 25, 2022

View reviewed changes

thomasjpfan and others added 3 commits March 25, 2022 10:51

Update sklearn/tree/_criterion.pyx

9a1119c

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

Apply suggestions from code review

c6127b8

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

CLN Remove unneeded imports

49800e6

jeremiedbb approved these changes Mar 26, 2022

View reviewed changes

ENH Use memoryviews

111efb7

jjerphan approved these changes Mar 29, 2022

View reviewed changes

jjerphan merged commit e113897 into scikit-learn:main Mar 29, 2022

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Apr 6, 2022

MNT Uses memoryviews in tree criterion (scikit-learn#22921)

b779407

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

jmloyola mentioned this pull request Apr 9, 2022

ENH Hellinger distance split criterion for classification trees #16478

Closed

5 tasks

thomasjpfan mentioned this pull request Apr 28, 2022

MNT Refactor splitter flow by removing indentation #23237

Merged

pedroilidio mentioned this pull request May 21, 2022

Cythonization error with pip install pedroilidio/bipartite_learn#3

Closed

c-bata mentioned this pull request Sep 10, 2022

Use pyproject.toml for setuptools and make Cython required. optuna/optuna-fast-fanova#7

Merged

harish1996 mentioned this pull request Apr 10, 2023

Update hellinger_distance_criterion.pyx EvgeniDubov/hellinger-distance-criterion#13

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MNT Uses memoryviews in tree criterion #22921

MNT Uses memoryviews in tree criterion #22921

		@@ -319,8 +292,7 @@ cdef class ClassificationCriterion(Criterion):
		cdef SIZE_t offset = 0

MNT Uses memoryviews in tree criterion #22921

MNT Uses memoryviews in tree criterion #22921

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment