-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Introduce dtype preservation semantics in DistanceMetric
objects.
#27006
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @Micky774.
We indeed need to add dtype preservation to Cython implementations. Yet, I do not think we need to change the return type of methods to INPUT_DTYPE_t
. One of my comments gives some details.
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
…into dm_float32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was oncerned with the loss of precision for PairwiseDistancesReductions
but since the test suite passes I think we can accept or one the other: proceed as you prefer, @Micky774.
This reverts commit ee4faf4.
I've reintroduced the dtype-dependent signatures. Do you think this warrants a separate changelog entry considering |
I don't think this is necessary since |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @Micky774
I merged |
…s. (scikit-learn#27006) Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
…s. (scikit-learn#27006) Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Reference Issues/PRs
What does this implement/fix? Explain your changes.
Preserves dtype when computing distances, under the assumption that the precision of the input data is an implication of preferred precision of output data. Note that accumulation still largely occurs using
float64_t
with some exceptions.Any other comments?
Current benchmarks (generated here) suggest that there is no regression in the dense case (
dist
), and a 10-25% speedup in the sparse case (dist_csr
).Benchmark Plots
Memory profiling indicates a reduction of memory usage in this script from
763MiB
to382MiB
.cc: @jjerphan @OmarManzoor @thomasjpfan