-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
POC 32bit datasets support for PairwiseDistancesReduction
#22590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
6d9f8a9
to
2b68863
Compare
2b68863
to
8ae1cb6
Compare
Also populate the .gitignore with new files
This allows keeping the same interface in Python namely: - PairwiseDistancesReduction.is_usable_for - PairwiseDistancesReduction.valid_metrics - PairwiseDistancesArgKmin.compute while being able to route to the 32bit and 64bit implementations defined via Tempita. The design pattern used here on PairwiseDistancesReduction and PairwiseDistancesArgKmin is the Facade design pattern. See: https://refactoring.guru/design-patterns/facade
DistanceMetric
DistanceMetric
and PairwiseDistancesReduction
sklearn/metrics/_dist_metrics.pyx.tp
Outdated
|
||
cdef inline np.ndarray _buffer_to_ndarray(const DTYPE_t* x, np.npy_intp n): | ||
# Wrap a memory buffer with an ndarray. Warning: this is not robust. | ||
# In particular, if x is deallocated before the returned array goes | ||
# out of scope, this could cause memory errors. Since there is not | ||
# a possibility of this for our use-case, this should be safe. | ||
|
||
# Note: this Segfaults unless np.import_array() is called above | ||
return PyArray_SimpleNewFromData(1, &n, DTYPECODE, <void*>x) | ||
|
||
|
||
from libc.math cimport fabs, sqrt, exp, pow, cos, sin, asin | ||
cdef DTYPE_t INF = np.inf | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewers: this has been moved under the Tempita loop to be templated.
sklearn/metrics/_dist_metrics.pyx.tp
Outdated
###################################################################### | ||
# metric mappings | ||
# These map from metric id strings to class names | ||
METRIC_MAPPING = {'euclidean': EuclideanDistance, | ||
'l2': EuclideanDistance, | ||
'minkowski': MinkowskiDistance, | ||
'p': MinkowskiDistance, | ||
'manhattan': ManhattanDistance, | ||
'cityblock': ManhattanDistance, | ||
'l1': ManhattanDistance, | ||
'chebyshev': ChebyshevDistance, | ||
'infinity': ChebyshevDistance, | ||
'seuclidean': SEuclideanDistance, | ||
'mahalanobis': MahalanobisDistance, | ||
'wminkowski': WMinkowskiDistance, | ||
'hamming': HammingDistance, | ||
'canberra': CanberraDistance, | ||
'braycurtis': BrayCurtisDistance, | ||
'matching': MatchingDistance, | ||
'jaccard': JaccardDistance, | ||
'dice': DiceDistance, | ||
'kulsinski': KulsinskiDistance, | ||
'rogerstanimoto': RogersTanimotoDistance, | ||
'russellrao': RussellRaoDistance, | ||
'sokalmichener': SokalMichenerDistance, | ||
'sokalsneath': SokalSneathDistance, | ||
'haversine': HaversineDistance, | ||
'pyfunc': PyFuncDistance} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewers: this has been moved under the Tempita loop to be templated.
sklearn/metrics/_dist_metrics.pyx.tp
Outdated
.. math:: | ||
D(x, y) = \frac{N_{TF} + N_{FT}}{N_{TT} + N_{TF} + N_{FT}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed such doc because this was making Tempita injection fail.
This can be reintroduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it would be good to reintroduce it with a notation that does not use backslashes (assuming they are the ones causing the problem).
Previously, upcast was done in the critical region. This causes an unneeded upcast for one of the buffers. This only upcasts buffers when necessary and where necessary without duplication contrarily to previously. Two methods are introduced to perform this upcast for each strategy. Yet, this adds some complexity to the templating.
DistanceMetric
and PairwiseDistancesReduction
PairwiseDistancesReduction
cpdef DTYPE_t[::1] _sqeuclidean_row_norms( | ||
const DTYPE_t[:, ::1] X, | ||
ITYPE_t num_threads, | ||
): | ||
"""Compute the squared euclidean norm of the rows of X in parallel. | ||
|
||
This is faster than using np.einsum("ij, ij->i") even when using a single thread. | ||
""" | ||
cdef: | ||
# Casting for X to remove the const qualifier is needed because APIs | ||
# exposed via scipy.linalg.cython_blas aren't reflecting the arguments' | ||
# const qualifier. | ||
# See: https://github.com/scipy/scipy/issues/14262 | ||
DTYPE_t * X_ptr = <DTYPE_t *> &X[0, 0] | ||
ITYPE_t idx = 0 | ||
ITYPE_t n = X.shape[0] | ||
ITYPE_t d = X.shape[1] | ||
DTYPE_t[::1] squared_row_norms = np.empty(n, dtype=DTYPE) | ||
|
||
for idx in prange(n, schedule='static', nogil=True, num_threads=num_threads): | ||
squared_row_norms[idx] = _dot(d, X_ptr + idx * d, 1, X_ptr + idx * d, 1) | ||
|
||
return squared_row_norms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewer: this has been moved bellow, closer to 32bit and 64bit definitions.
from cython.parallel cimport parallel, prange | ||
|
||
from ._dist_metrics cimport DatasetsPair, DenseDenseDatasetsPair |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewer: this has been moved bellow, closer to 32bit and 64bit definitions.
cdef: | ||
readonly DatasetsPair datasets_pair | ||
|
||
# The number of threads that can be used is stored in effective_n_threads. | ||
# | ||
# The number of threads to use in the parallelisation strategy | ||
# (i.e. parallel_on_X or parallel_on_Y) can be smaller than effective_n_threads: | ||
# for small datasets, less threads might be needed to loop over pair of chunks. | ||
# | ||
# Hence the number of threads that _will_ be used for looping over chunks | ||
# is stored in chunks_n_threads, allowing solely using what we need. | ||
# | ||
# Thus, an invariant is: | ||
# | ||
# chunks_n_threads <= effective_n_threads | ||
# | ||
ITYPE_t effective_n_threads | ||
ITYPE_t chunks_n_threads | ||
|
||
ITYPE_t n_samples_chunk, chunk_size | ||
|
||
ITYPE_t n_samples_X, X_n_samples_chunk, X_n_chunks, X_n_samples_last_chunk | ||
ITYPE_t n_samples_Y, Y_n_samples_chunk, Y_n_chunks, Y_n_samples_last_chunk | ||
|
||
bint execute_in_parallel_on_Y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewer: this has been moved bellow, within 32bit and 64bit definitions (done via Tempita).
Hence, PairwiseDistancesReduction
now really just is an interface.
Conflicts: .gitignore sklearn/metrics/_dist_metrics.pxd.tp sklearn/metrics/_dist_metrics.pyx.tp sklearn/metrics/_pairwise_distances_reduction.pyx.tp sklearn/metrics/setup.py sklearn/metrics/tests/test_pairwise_distances_reduction.py
df17323
to
3f6f2c6
Compare
I can reproduce the crash locally in sklearn/metrics/tests/test_pairwise_distances_reduction.py::test_pairwise_distances_argkmin[42-parallel_on_X-float32-euclidean-0-50] Fatal Python error: Aborted
Current thread 0x00007fbcc71ee740 (most recent call first):
File "/home/ogrisel/code/scikit-learn/sklearn/metrics/tests/test_pairwise_distances_reduction.py", line 576 in test_pairwise_distances_argkmin
File "/home/ogrisel/mambaforge/envs/dev/lib/python3.10/site-packages/_pytest/python.py", line 192 in pytest_pyfunc_call
File "/home/ogrisel/mambaforge/envs/dev/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall GDB hints at a double-free problem in the
|
Trying to investigate with valgrind configured to use the Python suppressions:
Here is the output. I am not yet sure what to make of this.
|
I rebuilt scikit-learn with debug symbols (
Which is when staring the OpenMP region in this Cython function: cdef void compute_exact_distances(self) nogil:
cdef:
ITYPE_t i, j
ITYPE_t[:, ::1] Y_indices = self.argkmin_indices
DTYPE_t[:, ::1] distances = self.argkmin_distances
for i in prange(self.n_samples_X, schedule='static', nogil=True,
num_threads=self.effective_n_threads):
for j in range(self.k):
distances[i, j] = self.datasets_pair.distance_metric._rdist_to_dist(
# Guard against eventual -0., causing nan production.
max(distances[i, j], 0.)
) |
I re-ran the same test and this time I got:
This line is static void __pyx_tp_dealloc_7sklearn_7metrics_29_pairwise_distances_reduction_GEMMTermComputer32(PyObject *o) {
struct __pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_GEMMTermComputer32 *p = (struct __pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_GEMMTermComputer32 *)o;
#if CYTHON_USE_TP_FINALIZE
if (unlikely(PyType_HasFeature(Py_TYPE(o), Py_TPFLAGS_HAVE_FINALIZE) && Py_TYPE(o)->tp_finalize) && (!PyType_IS_GC(Py_TYPE(o)) || !_PyGC_FINALIZED(o))) {
if (PyObject_CallFinalizerFromDealloc(o)) return;
}
#endif
__Pyx_call_destructor(p->dist_middle_terms_chunks);
__Pyx_call_destructor(p->X_c_upcast);
__Pyx_call_destructor(p->Y_c_upcast);
__PYX_XDEC_MEMVIEW(&p->X, 1);
__PYX_XDEC_MEMVIEW(&p->Y, 1);
(*Py_TYPE(o)->tp_free)(o);
} So maybe the C++ code generated by Cython for the stack allocated nested vectors is invalid. Maybe we could try to dynamically allocate those with the |
The valgrind output seems to be pointing to problems with those same datastructure but in the constructor:
That is the line in the loop that does the upcasting:
# Upcasting X_c=X[X_start:X_end, :] from float32 to float64
for i in range(n_chunk_samples):
for j in range(self.n_features):
self.X_c_upcast[thread_num][i * self.n_features + j] = <DTYPE_t> self.X[X_start + i, j] |
I tried to use a diff --git a/sklearn/metrics/_pairwise_distances_reduction.pyx.tp b/sklearn/metrics/_pairwise_distances_reduction.pyx.tp
index 90ea78c11..ec68614b6 100644
--- a/sklearn/metrics/_pairwise_distances_reduction.pyx.tp
+++ b/sklearn/metrics/_pairwise_distances_reduction.pyx.tp
@@ -48,7 +48,7 @@ from .. import get_config
from libc.stdlib cimport free, malloc
from libc.stdio cimport printf
from libc.float cimport DBL_MAX
-from libcpp.memory cimport shared_ptr, make_shared
+from libcpp.memory cimport shared_ptr, make_shared, unique_ptr, make_unique
from libcpp.vector cimport vector
from cython cimport final
from cython.operator cimport dereference as deref
@@ -627,8 +627,8 @@ cdef class GEMMTermComputer{{bitness}}:
vector[vector[DTYPE_t]] dist_middle_terms_chunks
{{if need_upcast}}
- vector[vector[DTYPE_t]] X_c_upcast
- vector[vector[DTYPE_t]] Y_c_upcast
+ unique_ptr[vector[vector[DTYPE_t]]] X_c_upcast
+ unique_ptr[vector[vector[DTYPE_t]]] Y_c_upcast
{{endif}}
def __init__(self,
@@ -649,13 +649,13 @@ cdef class GEMMTermComputer{{bitness}}:
self.dist_middle_terms_chunks = vector[vector[DTYPE_t]](self.effective_n_threads)
{{if need_upcast}}
return
{{endif}}
@@ -743,7 +743,7 @@ cdef class GEMMTermComputer{{bitness}}:
# Upcasting Y_c=Y[Y_start:Y_end, :] from float32 to float64
return
{{endif}}
@@ -743,7 +743,7 @@ cdef class GEMMTermComputer{{bitness}}:
# Upcasting Y_c=Y[Y_start:Y_end, :] from float32 to float64
for i in range(n_chunk_samples):
for j in range(self.n_features):
- self.Y_c_upcast[thread_num][i * self.n_features + j] = <DTYPE_t> self.Y[Y_start + i, j]
+ deref(self.Y_c_upcast)[thread_num][i * self.n_features + j] = <DTYPE_t> self.Y[Y_start + i, j]
{{else}}
return
{{endif}}
@@ -776,8 +776,8 @@ cdef class GEMMTermComputer{{bitness}}:
ITYPE_t K = X_c.shape[1]
DTYPE_t alpha = - 2.
{{if need_upcast}}
- DTYPE_t * A = self.X_c_upcast[thread_num].data()
- DTYPE_t * B = self.Y_c_upcast[thread_num].data()
+ DTYPE_t * A = deref(self.X_c_upcast)[thread_num].data()
+ DTYPE_t * B = deref(self.Y_c_upcast)[thread_num].data()
{{else}}
# Casting for A and B to remove the const is needed because APIs exposed via
# scipy.linalg.cython_blas aren't reflecting the arguments' const qualifier. but I still get the same failure. Same with a |
My bad, the error was due to an improper logic. f0fc839 fixes it. Yet, there are tests that aren't passing, likely due to numerical reasons. |
Ok, that's great to know that their no problem with stack-allocated nested vectors datastructures as I mistakenly suspected. |
The |
I think that it would be easier to debug and to review to (in this order):
What do you think? |
metric in cls.valid_metrics()) | ||
|
||
|
||
cdef class PairwiseDistancesArgKmin(PairwiseDistancesReduction): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewers: similarly PairwiseDistancesArgKmin
is just an interface now which dispatch to the correct dtype-specific implementation at runtime.
f"Currently: X.dtype={X.dtype} and Y.dtype={Y.dtype}." | ||
) | ||
|
||
cdef class PairwiseDistancesRadiusNeighborhood(PairwiseDistancesReduction): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewers: similarly PairwiseDistancesRadiusNeighborhood
is just an interface now which dispatch to the correct dtype-specific implementation at runtime.
cpdef DTYPE_t[::1] _sqeuclidean_row_norms64( | ||
const DTYPE_t[:, ::1] X, | ||
ITYPE_t num_threads, | ||
): | ||
"""Compute the squared euclidean norm of the rows of X in parallel. | ||
|
||
This is faster than using np.einsum("ij, ij->i") even when using a single thread. | ||
""" | ||
cdef: | ||
# Casting for X to remove the const qualifier is needed because APIs | ||
# exposed via scipy.linalg.cython_blas aren't reflecting the arguments' | ||
# const qualifier. | ||
# See: https://github.com/scipy/scipy/issues/14262 | ||
DTYPE_t * X_ptr = <DTYPE_t *> &X[0, 0] | ||
ITYPE_t i = 0 | ||
ITYPE_t n = X.shape[0] | ||
ITYPE_t d = X.shape[1] | ||
DTYPE_t[::1] squared_row_norms = np.empty(n, dtype=DTYPE) | ||
|
||
for i in prange(n, schedule='static', nogil=True, num_threads=num_threads): | ||
squared_row_norms[i] = _dot(d, X_ptr + i * d, 1, X_ptr + i * d, 1) | ||
|
||
return squared_row_norms | ||
|
||
|
||
cpdef DTYPE_t[::1] _sqeuclidean_row_norms32( | ||
const cnp.float32_t[:, ::1] X, | ||
ITYPE_t num_threads, | ||
): | ||
"""Compute the squared euclidean norm of the rows of X in parallel. | ||
|
||
This is faster than using np.einsum("ij, ij->i") even when using a single thread. | ||
""" | ||
cdef: | ||
# Casting for X to remove the const qualifier is needed because APIs | ||
# exposed via scipy.linalg.cython_blas aren't reflecting the arguments' | ||
# const qualifier. | ||
# See: https://github.com/scipy/scipy/issues/14262 | ||
cnp.float32_t * X_ptr = <cnp.float32_t *> &X[0, 0] | ||
ITYPE_t i = 0, j = 0 | ||
ITYPE_t n = X.shape[0] | ||
ITYPE_t d = X.shape[1] | ||
DTYPE_t[::1] squared_row_norms = np.empty(n, dtype=DTYPE) | ||
|
||
# To upcast the i-th row of X from 32bit to 64bit | ||
DTYPE_t * X_idx_upcast_ptr | ||
|
||
with nogil, parallel(num_threads=num_threads): | ||
# Thread-local buffer allocation | ||
X_i_upcast_ptr = <DTYPE_t* > malloc(sizeof(DTYPE_t) * d) | ||
for i in prange(n, schedule='static'): | ||
|
||
# Upcasting the i-th row of X from 32bit to 64bit | ||
for j in range(d): | ||
X_i_upcast_ptr[j] = <DTYPE_t> deref(X_ptr + i * d + j) | ||
|
||
squared_row_norms[i] = _dot(d, X_i_upcast_ptr, 1, X_i_upcast_ptr, 1) | ||
|
||
free(X_i_upcast_ptr) | ||
|
||
return squared_row_norms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewers: those are specialisation of _sqeuclidean_row_norms
for each dtype.
PairwiseDistancesReduction
PairwiseDistancesReduction
Due to the refactoring, I'm closing this PR in favor of #23865. |
Reference Issues/PRs
Follows #22134. Experimental POC to assess if Tempita is sufficient.
What does this implement/fix? Explain your changes.
Full design proposal
Context
PairwiseDistancesReduction
needs to support float32 and float64DatasetPairs
.To do so, DatasetPairs needs to be adapted for float32 (X, Y) and concrete
PairwiseDistancesReduction
s needs to do maintain the routing to those.The current Cython extension types (i.e cdef class) hierarchy currently support 64 bits implementation. It simply breaks down as follows:
Where
FastEuclideanPairwiseDistancesArgKmin
is called in most cases.Problem
We need some flexibility to be able to support 32bit datasets while not duplicating the implementations. In this regard, templating (i.e. to have classes be dtype-defined) and type covariance (i.e. if
A
extendsB
thanClass<A>
extendsClass<B>
) would have come in handy to extent the current hierarchy for 64bit to support 32bit.Yet, Cython does not support templating in its language constructions nor does it support type covariance.
Also, Cython offers support for fused types however they can't be used on Cython extension types' attributes, making using this useful feature impossible to use in our context without some hacks.
Proposed solution
Still, we can use Tempita to come up with a working solution preserving performance at the cost of maintenance.
To perform this:
DistanceMetric
sDistanceMetric
s are still exposed via the current public API but the 32bit version must remain private.PairwiseDistancesReductions
has been changed using à la facade design pattern. so as to keep the same Python interfaces (namelyPairwiseDistancesReduction.is_usable_for
,PairwiseDistancesReduction.valid_metrics
,PairwiseDistancesArgKmin.compute
) but have concrete 32bit and 64bit implementation be defined via Tempita as follows:Future extension solution
In the future, we could just use the same pattern. For instance we could have:
TODO:
Hardware scalability
Adapting this script to use float32 datasets, we access that this implementation scales linearly, similarly to its 64bit counterpart:
Raw results
Raw results
Speed-ups between 1.0 (e7fb5b8) and this PR @ 65ebc92 (via ca9197a502bf1289db722a6261ff5fe7edf8e981)
Up to ×50 speed-ups in normal configurations.
Some regression when using small datasets and a high number of threads.
1 core
2 cores
4 cores
8 cores
16 cores
32 cores
64 cores
128 cores
Any other comments?
Is this proposal too complicated?