ENH Introduce dtype preservation semantics in `DistanceMetric` objects. #27006

Micky774 · 2023-08-03T18:24:38Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Preserves dtype when computing distances, under the assumption that the precision of the input data is an implication of preferred precision of output data. Note that accumulation still largely occurs using float64_t with some exceptions.

Any other comments?

Current benchmarks (generated here) suggest that there is no regression in the dense case (dist), and a 10-25% speedup in the sparse case (dist_csr).

Benchmark Plots

Memory profiling indicates a reduction of memory usage in this script from 763MiB to 382MiB.

cc: @jjerphan @OmarManzoor @thomasjpfan

github-actions · 2023-08-03T18:26:17Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: f8b6027. Link to the linter CI: here}

jjerphan

Thank you, @Micky774.

We indeed need to add dtype preservation to Cython implementations. Yet, I do not think we need to change the return type of methods to INPUT_DTYPE_t. One of my comments gives some details.

sklearn/metrics/_dist_metrics.pyx.tp

doc/whats_new/v1.4.rst

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

…into dm_float32

jjerphan

I was oncerned with the loss of precision for PairwiseDistancesReductions but since the test suite passes I think we can accept or one the other: proceed as you prefer, @Micky774.

This reverts commit ee4faf4.

Micky774 · 2023-08-09T17:54:53Z

I was oncerned with the loss of precision for PairwiseDistancesReductions but since the test suite passes I think we can accept or one the other: proceed as you prefer, @Micky774.

I've reintroduced the dtype-dependent signatures. Do you think this warrants a separate changelog entry considering DistanceMetric objects now fully preserve dtype?

jjerphan · 2023-08-09T19:35:02Z

I don't think this is necessary since DistanceMetric32 aren't public.

OmarManzoor

LGTM. Thanks @Micky774

jjerphan · 2023-08-10T19:17:04Z

I merged main again to resolve the unrelated linter failure on the CI.

…s. (scikit-learn#27006) Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

Micky774 added 2 commits August 3, 2023 09:05

Initial dtype preservation work

799e0bd

Updated accumulator to be INPUT_DTYPE_t and reverted weights to f64

44455b0

github-actions bot added cython module:metrics labels Aug 3, 2023

Updated changelog

5d486ac

jjerphan reviewed Aug 4, 2023

View reviewed changes

sklearn/metrics/_dist_metrics.pyx.tp Show resolved Hide resolved

doc/whats_new/v1.4.rst Outdated Show resolved Hide resolved

Micky774 and others added 5 commits August 7, 2023 11:52

Update doc/whats_new/v1.4.rst

1054140

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

Merge branch 'main' into dm_float32

fd1c6c4

Merge branch 'dm_float32' of https://github.com/Micky774/scikit-learn …

14e9729

…into dm_float32

Revert changes to {r}dist return type

ee4faf4

Merge branch 'main' into dm_float32

1144ada

jjerphan approved these changes Aug 9, 2023

View reviewed changes

Revert "Revert changes to {r}dist return type"

b79ebb0

This reverts commit ee4faf4.

OmarManzoor approved these changes Aug 10, 2023

View reviewed changes

jjerphan enabled auto-merge (squash) August 10, 2023 19:09

Merge branch 'main' into dm_float32

f8b6027

jjerphan merged commit acf60de into scikit-learn:main Aug 10, 2023

Micky774 deleted the dm_float32 branch August 10, 2023 22:07

TamaraAtanasoska pushed a commit to TamaraAtanasoska/scikit-learn that referenced this pull request Aug 21, 2023

ENH Introduce dtype preservation semantics in DistanceMetric object…

3436948

…s. (scikit-learn#27006) Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023

ENH Introduce dtype preservation semantics in DistanceMetric object…

02f218a

…s. (scikit-learn#27006) Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Introduce dtype preservation semantics in `DistanceMetric` objects. #27006

ENH Introduce dtype preservation semantics in `DistanceMetric` objects. #27006

ENH Introduce dtype preservation semantics in DistanceMetric objects. #27006

ENH Introduce dtype preservation semantics in DistanceMetric objects. #27006

Conversation

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

✔️ Linting Passed

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ENH Introduce dtype preservation semantics in `DistanceMetric` objects. #27006

ENH Introduce dtype preservation semantics in `DistanceMetric` objects. #27006