POC 32bit datasets support for `PairwiseDistancesReduction` #22590

jjerphan · 2022-02-23T13:51:05Z

Reference Issues/PRs

Follows #22134. Experimental POC to assess if Tempita is sufficient.

What does this implement/fix? Explain your changes.

Full design proposal

Context

PairwiseDistancesReduction needs to support float32 and float64 DatasetPairs.

To do so, DatasetPairs needs to be adapted for float32 (X, Y) and concrete PairwiseDistancesReductions needs to do maintain the routing to those.

The current Cython extension types (i.e cdef class) hierarchy currently support 64 bits implementation. It simply breaks down as follows:

                        (abstract)
                 PairwiseDistancesReduction
                            ^
                            |
                            |
                  (concrete 64bit implem.)
                       (Python API)
                  PairwiseDistancesArgKmin
                            ^
                            |
                            |
            (specialized concrete 64bit implem.)
            FastEuclideanPairwiseDistancesArgKmin

Where FastEuclideanPairwiseDistancesArgKmin is called in most cases.

Problem

We need some flexibility to be able to support 32bit datasets while not duplicating the implementations. In this regard, templating (i.e. to have classes be dtype-defined) and type covariance (i.e. if A extends B than Class<A> extends Class<B>) would have come in handy to extent the current hierarchy for 64bit to support 32bit.

Yet, Cython does not support templating in its language constructions nor does it support type covariance.

Also, Cython offers support for fused types however they can't be used on Cython extension types' attributes, making using this useful feature impossible to use in our context without some hacks.

Proposed solution

Still, we can use Tempita to come up with a working solution preserving performance at the cost of maintenance.

To perform this:

32bit is now supported for DistanceMetrics
the 64bit implementation of DistanceMetrics are still exposed via the current public API but the 32bit version must remain private.
the layout of classes for PairwiseDistancesReductions has been changed using à la facade design pattern. so as to keep the same Python interfaces (namely PairwiseDistancesReduction.is_usable_for, PairwiseDistancesReduction.valid_metrics, PairwiseDistancesArgKmin.compute) but have concrete 32bit and 64bit implementation be defined via Tempita as follows:

                       (abstract)
                PairwiseDistancesReduction
                            ^
                            |
                            +------------------------------------------+--------------------------------------------------+
                            |                                          |                                                  |
                            |                                      (abstract)                                         (abstract)
                            |                             PairwiseDistancesReduction32                       PairwiseDistancesReduction64
                            |                                          ^                                                  ^
                            |                                          |                                                  |
                            |                                          |                                                  |
                            |                                          |                                                  |
                       (Python API)                         (concrete 32bit implem.)                           (concrete 64bit implem.)
                  PairwiseDistancesArgKmin                 PairwiseDistancesArgKmin32                          PairwiseDistancesArgKmin64
                                                                       |                                                  |
                                                                       |                                                  |
                                                                       |                                                  |
                                                                       |                                                  |
                                                       (specialized concrete 32bit implem.)               (specialized concrete 64bit implem.)
                                                     FastEuclideanPairwiseDistancesArgKmin32            FastEuclideanPairwiseDistancesArgKmin64

Future extension solution

In the future, we could just use the same pattern. For instance we could have:


                           ...                                        ...                                                ...   
                            |                                          |                                                  |
                            |                                          |                                                  |
                            |                                          |                                                  |
                       (Python API)                         (concrete 32bit implem.)                           (concrete 64bit implem.)
          PairwiseDistancesRadiusNeighborhood          PairwiseDistancesRadiusNeighborhood32             PairwiseDistancesRadiusNeighborhood64
                                                                       |                                                  |
                                                                       |                                                  |
                                                                       |                                                  |
                                                                       |                                                  |
                                                       (specialized concrete 32bit implem.)               (specialized concrete 64bit implem.)
                                                 FastEuclideanPairwiseDistancesRadiusNeighborhood32  FastEuclideanPairwiseDistancesRadiusNeighborhood64

TODO:

fix the failing test
add more tests for 32bit datasets on user-facing interfaces
split this PR in smaller ones to ease reviews

Hardware scalability

Adapting this script to use float32 datasets, we access that this implementation scales linearly, similarly to its 64bit counterpart:

Raw results

    n_threads  n_train  n_test  n_features  mean_runtime  stderr_runtime
0           1   100000  100000          50     57.981657               0
1           2   100000  100000          50     29.401138               0
2           4   100000  100000          50     14.627211               0
3           8   100000  100000          50      7.748570               0
4          16   100000  100000          50      4.204991               0
5          32   100000  100000          50      2.385364               0
6          64   100000  100000          50      1.576305               0
7         128   100000  100000          50      2.115476               0
8           1   100000  100000         100     83.216700               0
9           2   100000  100000         100     42.717769               0
10          4   100000  100000         100     21.534403               0
11          8   100000  100000         100     10.926104               0
12         16   100000  100000         100      5.956875               0
13         32   100000  100000         100      3.348170               0
14         64   100000  100000         100      2.083073               0
15        128   100000  100000         100      3.822223               0
16          1   100000  100000         500    290.757614               0
17          2   100000  100000         500    142.708740               0
18          4   100000  100000         500     72.544749               0
19          8   100000  100000         500     35.726813               0
20         16   100000  100000         500     19.464046               0
21         32   100000  100000         500     10.771516               0
22         64   100000  100000         500      7.123072               0
23        128   100000  100000         500     11.439384               0

Raw results

    n_threads  n_train  n_test  n_features  mean_runtime  stderr_runtime
0           1  1000000   10000          50     57.369851               0
1           2  1000000   10000          50     29.368813               0
2           4  1000000   10000          50     14.890100               0
3           8  1000000   10000          50      7.564469               0
4          16  1000000   10000          50      3.912440               0
5          32  1000000   10000          50      2.094077               0
6          64  1000000   10000          50      1.356988               0
7         128  1000000   10000          50      1.528763               0
8           1  1000000   10000         100     81.371726               0
9           2  1000000   10000         100     42.803727               0
10          4  1000000   10000         100     21.626557               0
11          8  1000000   10000         100     11.082455               0
12         16  1000000   10000         100      5.795145               0
13         32  1000000   10000         100      3.061136               0
14         64  1000000   10000         100      2.006234               0
15        128  1000000   10000         100      2.012048               0
16          1  1000000   10000         500    286.566753               0
17          2  1000000   10000         500    149.337710               0
18          4  1000000   10000         500     75.545747               0
19          8  1000000   10000         500     38.256877               0
20         16  1000000   10000         500     19.557651               0
21         32  1000000   10000         500     11.193385               0
22         64  1000000   10000         500      9.533238               0
23        128  1000000   10000         500      8.433263               0

Speed-ups between 1.0 (`e7fb5b8`) and this PR @ `65ebc92` (via ca9197a502bf1289db722a6261ff5fe7edf8e981)

Up to ×50 speed-ups in normal configurations.
Some regression when using small datasets and a high number of threads.

1 core

       before           after         ratio
     [998e8f20]       [65ebc927]
     <main>           <distance-metrics-32bit>
+      1.07±0.01m          1.18±0m     1.10  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_mean_shift(10000, 100000, 100)
-         993±1ms          889±1ms     0.90  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 10000, 100)
-        93.2±1ms       82.9±0.5ms     0.89  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 10000, 100)
-         1.97±0m          1.75±0m     0.89  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(10000, 100000, 100)
-        93.2±1ms       82.3±0.2ms     0.88  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 10000, 100)
-      93.1±0.4ms       81.4±0.2ms     0.87  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 1000, 100)
-      93.3±0.6ms       81.6±0.4ms     0.87  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 1000, 100)
-         1.01±0s          831±2ms     0.82  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 100000, 100)
-         1.01±0s          827±3ms     0.82  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 100000, 100)
-      5.97±0.01s       4.88±0.01s     0.82  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 100000, 100)
-      10.3±0.02s          8.06±0s     0.78  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 100000, 100)
-         1.04±0s          806±2ms     0.78  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 10000, 100)
-      10.3±0.03s          8.00±0s     0.77  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 100000, 100)
-         1.05±0s          806±3ms     0.77  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 10000, 100)
-      11.6±0.3ms       8.63±0.1ms     0.74  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 1000, 100)
-      11.7±0.3ms      8.65±0.04ms     0.74  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 1000, 100)
-       193±0.6ms       99.4±0.6ms     0.52  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
-      20.7±0.3ms      10.4±0.08ms     0.50  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
-         2.02±0s          998±2ms     0.49  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)
-         202±1ms       84.7±0.4ms     0.42  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-         21.0±0s          8.28±0s     0.39  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 100000, 100)
-         2.11±0s          828±3ms     0.39  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

2 cores

       before           after         ratio
     [998e8f20]       [65ebc927]
     <main>           <distance-metrics-32bit>
-         970±2ms         857±50ms     0.88  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 10000, 100)
-         1.94±0m          1.66±0m     0.86  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(10000, 100000, 100)
-      5.74±0.01s       4.43±0.01s     0.77  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 100000, 100)
-      72.4±0.7ms       42.6±0.2ms     0.59  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 1000, 100)
-      72.4±0.9ms       42.5±0.2ms     0.59  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 1000, 100)
-        73.3±2ms       42.9±0.1ms     0.59  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 10000, 100)
-        73.7±2ms       43.1±0.1ms     0.58  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 10000, 100)
-         783±1ms        418±0.7ms     0.53  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 100000, 100)
-         782±2ms          416±1ms     0.53  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 100000, 100)
-         801±1ms          411±1ms     0.51  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 10000, 100)
-         804±1ms          411±1ms     0.51  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 10000, 100)
-      7.93±0.04s          4.04±0s     0.51  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 100000, 100)
-      7.93±0.03s          4.04±0s     0.51  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 100000, 100)
-      9.65±0.2ms      4.71±0.03ms     0.49  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 1000, 100)
-      9.76±0.2ms      4.68±0.03ms     0.48  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 1000, 100)
-      19.1±0.2ms      6.37±0.07ms     0.33  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
-         175±1ms       51.7±0.3ms     0.30  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
-         1.80±0s          503±1ms     0.28  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)
-         182±1ms       44.8±0.1ms     0.25  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-         1.87±0s          423±1ms     0.23  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)
-      18.5±0.01s          4.15±0s     0.22  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 100000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

4 cores

       before           after         ratio
     [998e8f20]       [65ebc927]
     <main>           <distance-metrics-32bit>
-         1.91±0m          1.61±0m     0.84  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(10000, 100000, 100)
-      61.2±0.8ms       23.7±0.2ms     0.39  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 10000, 100)
-      61.3±0.6ms       23.7±0.3ms     0.39  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 10000, 100)
-      63.2±0.6ms       23.9±0.2ms     0.38  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 1000, 100)
-      63.0±0.6ms       23.8±0.2ms     0.38  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 1000, 100)
-      9.09±0.2ms      2.92±0.05ms     0.32  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 1000, 100)
-         679±1ms          218±1ms     0.32  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 100000, 100)
-         675±2ms          216±1ms     0.32  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 100000, 100)
-      9.44±0.2ms      2.95±0.06ms     0.31  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 1000, 100)
-         700±2ms          212±1ms     0.30  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 10000, 100)
-         698±1ms          211±1ms     0.30  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 10000, 100)
-      6.89±0.02s          2.06±0s     0.30  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 100000, 100)
-      6.88±0.03s          2.05±0s     0.30  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 100000, 100)
-      18.3±0.1ms      4.37±0.04ms     0.24  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
-       163±0.9ms       27.9±0.1ms     0.17  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
-         1.69±0s          262±1ms     0.15  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)
-       171±0.9ms       26.3±0.2ms     0.15  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-         1.77±0s          217±1ms     0.12  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

8 cores

       before           after         ratio
     [998e8f20]       [65ebc927]
     <main>           <distance-metrics-32bit>
+         499±1ms          730±8ms     1.46  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 1000, 100)
-        111±10ms         94.3±7ms     0.85  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_birch(1000, 1000, 100)
-         1.91±0m          1.60±0m     0.84  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(10000, 100000, 100)
-      10.7±0.4ms      3.55±0.06ms     0.33  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 1000, 100)
-      10.7±0.4ms      3.40±0.03ms     0.32  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 1000, 100)
-      20.2±0.4ms      4.84±0.04ms     0.24  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
-      68.0±0.6ms       14.4±0.3ms     0.21  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 1000, 100)
-      68.6±0.6ms       14.3±0.3ms     0.21  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 1000, 100)
-      67.9±0.9ms       13.6±0.2ms     0.20  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 10000, 100)
-      67.4±0.7ms       13.5±0.2ms     0.20  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 10000, 100)
-         722±1ms        117±0.8ms     0.16  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 100000, 100)
-         721±1ms        116±0.8ms     0.16  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 100000, 100)
-         729±3ms        111±0.8ms     0.15  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 10000, 100)
-         727±2ms        111±0.8ms     0.15  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 10000, 100)
-      7.06±0.02s          1.07±0s     0.15  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 100000, 100)
-      7.06±0.03s          1.06±0s     0.15  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 100000, 100)
-       170±0.9ms       15.8±0.1ms     0.09  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
-       179±0.7ms       16.2±0.2ms     0.09  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-         1.73±0s          141±1ms     0.08  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)
-         1.79±0s        114±0.7ms     0.06  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

16 cores

       before           after         ratio
     [998e8f20]       [65ebc927]
     <main>           <distance-metrics-32bit>
+        13.1±1ms        28.0±10ms     2.13  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 1000, 100)
+         495±1ms         747±10ms     1.51  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 1000, 100)
+        22.5±1ms        32.3±10ms     1.43  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
+         1.67±0s        2.00±0.1s     1.20  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_mean_shift(1000, 100000, 100)
+         1.64±0s       1.94±0.03s     1.19  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_mean_shift(1000, 1000, 100)
+         1.64±0s        1.91±0.1s     1.16  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_mean_shift(1000, 10000, 100)
+         954±1ms       1.09±0.01s     1.15  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 10000, 100)
-       1.69±0.1s       1.53±0.02s     0.90  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_tsne(1000, 1000, 100)
-        67.7±2ms        58.3±20ms     0.86  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 10000, 100)
-         1.89±0m          1.58±0m     0.83  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(10000, 100000, 100)
-        67.1±2ms         44.0±1ms     0.66  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 10000, 100)
-        13.1±1ms      5.26±0.07ms     0.40  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 1000, 100)
-         171±3ms         56.0±6ms     0.33  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
-        69.2±2ms       9.91±0.1ms     0.14  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 1000, 100)
-        69.4±2ms       9.60±0.1ms     0.14  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 1000, 100)
-         769±2ms       80.7±0.8ms     0.10  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 10000, 100)
-         767±3ms       80.0±0.7ms     0.10  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 10000, 100)
-         690±3ms       67.9±0.6ms     0.10  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 100000, 100)
-         687±3ms       67.4±0.6ms     0.10  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 100000, 100)
-      7.55±0.03s          580±2ms     0.08  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 100000, 100)
-      7.58±0.02s          581±2ms     0.08  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 100000, 100)
-         179±2ms       12.5±0.2ms     0.07  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-         1.83±0s       98.6±0.9ms     0.05  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)
-         1.69±0s       79.7±0.5ms     0.05  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

32 cores

       before           after         ratio
     [998e8f20]       [65ebc927]
     <main>           <distance-metrics-32bit>
+         499±2ms          765±9ms     1.53  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 1000, 100)
+      1.77±0.01s        2.13±0.1s     1.20  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_mean_shift(1000, 10000, 100)
+      1.78±0.01s       2.09±0.06s     1.18  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_mean_shift(1000, 1000, 100)
+         968±2ms       1.14±0.02s     1.18  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 10000, 100)
+         1.79±0s       2.08±0.04s     1.16  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_mean_shift(1000, 100000, 100)
-       1.69±0.1s       1.42±0.01s     0.84  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_tsne(1000, 1000, 100)
-         1.89±0m          1.57±0m     0.83  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(10000, 100000, 100)
-        16.5±2ms       9.70±0.1ms     0.59  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 1000, 100)
-        16.4±2ms      8.91±0.09ms     0.54  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 1000, 100)
-         176±5ms       84.9±0.8ms     0.48  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
-        25.5±2ms       10.9±0.2ms     0.43  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
-        74.0±3ms       10.6±0.1ms     0.14  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 1000, 100)
-        74.5±3ms       10.4±0.1ms     0.14  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 1000, 100)
-         775±2ms       62.6±0.2ms     0.08  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 10000, 100)
-         775±2ms       62.4±0.2ms     0.08  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 10000, 100)
-         185±4ms       12.4±0.1ms     0.07  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-         670±3ms       44.0±0.3ms     0.07  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 100000, 100)
-         669±3ms       43.9±0.3ms     0.07  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 100000, 100)
-      7.61±0.03s          334±3ms     0.04  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 100000, 100)
-      7.64±0.02s         334±20ms     0.04  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 100000, 100)
-         1.85±0s       80.3±0.2ms     0.04  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)
-         1.68±0s       51.1±0.3ms     0.03  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

64 cores

     [998e8f20]       [65ebc927]
     <main>           <distance-metrics-32bit>
+        90.5±3ms          216±8ms     2.38  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 10000, 100)
+        90.5±4ms         184±20ms     2.03  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 10000, 100)
+         513±2ms         808±10ms     1.58  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 1000, 100)
+      1.01±0.01s       1.25±0.03s     1.24  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 10000, 100)
+      1.94±0.01s       2.38±0.08s     1.22  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_mean_shift(1000, 10000, 100)
+      1.96±0.01s       2.31±0.07s     1.18  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_mean_shift(1000, 1000, 100)
+      1.99±0.01s       2.28±0.07s     1.14  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_mean_shift(1000, 100000, 100)
-         689±1ms          621±4ms     0.90  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_affinity_propagation(1000, 100000, 100)
-         205±3ms          176±4ms     0.86  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
-        27.1±5ms         22.5±9ms     0.83  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 1000, 100)
-         1.89±0m       1.56±0.01m     0.82  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(10000, 100000, 100)
-       1.89±0.1s       1.53±0.04s     0.81  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_tsne(1000, 1000, 100)
-        62.8±9ms         50.7±2ms     0.81  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_label_spreading(1000, 1000, 100)
-        60.4±9ms         48.1±2ms     0.80  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_label_propagation(1000, 1000, 100)
-        27.1±5ms       19.7±0.4ms     0.73  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 1000, 100)
-        37.4±5ms       22.7±0.2ms     0.61  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
-        89.4±5ms        19.0±20ms     0.21  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 1000, 100)
-        89.2±5ms       14.4±0.2ms     0.16  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 1000, 100)
-         921±7ms          145±2ms     0.16  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 10000, 100)
-         921±8ms         95.0±2ms     0.10  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 10000, 100)
-         212±3ms       16.6±0.1ms     0.08  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-         2.00±0s          110±2ms     0.05  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)
-        733±10ms       32.8±0.2ms     0.04  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 100000, 100)
-         728±9ms       31.6±0.3ms     0.04  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 100000, 100)
-      1.73±0.01s       36.3±0.2ms     0.02  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

128 cores

       before           after         ratio
     [998e8f20]       [65ebc927]
     <main>           <distance-metrics-32bit>
+         121±3ms        1.50±0.1s    12.40  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 10000, 100)
+         127±7ms       1.55±0.06s    12.25  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 10000, 100)
+        34.8±2ms         258±20ms     7.40  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 1000, 100)
+        32.5±2ms         211±10ms     6.52  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 1000, 100)
+         235±3ms       1.46±0.03s     6.25  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
+        44.9±2ms         257±10ms     5.73  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
+      5.78±0.02s       17.8±0.04s     3.08  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 100000, 100)
+      1.10±0.05s       2.66±0.03s     2.41  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 10000, 100)
+      2.37±0.02s          5.48±1s     2.31  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_mean_shift(1000, 100000, 100)
+        589±50ms       1.14±0.05s     1.94  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 1000, 100)
+         113±2ms          219±8ms     1.94  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 1000, 100)
+         113±2ms          191±7ms     1.69  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 1000, 100)
+      2.33±0.06s        3.55±0.3s     1.52  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_mean_shift(1000, 10000, 100)
+      2.36±0.03s       3.11±0.04s     1.32  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_mean_shift(1000, 1000, 100)
+      46.4±0.07s       1.01±0.01m     1.31  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_affinity_propagation(10000, 100000, 100)
+      1.07±0.01s       1.36±0.02s     1.26  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 10000, 100)
+      1.10±0.03s       1.39±0.04s     1.26  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 10000, 100)
+      10.4±0.08s       13.0±0.03s     1.24  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 100000, 100)
+      1.64±0.03s       1.93±0.05s     1.18  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_birch(1000, 10000, 100)
+        446±20ms         492±40ms     1.10  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_birch(1000, 1000, 100)
-      2.20±0.02s       1.39±0.04s     0.63  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)
-       21.3±0.2s        13.0±0.2s     0.61  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 100000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

Any other comments?

Is this proposal too complicated?

Also populate the .gitignore with new files

This allows keeping the same interface in Python namely: - PairwiseDistancesReduction.is_usable_for - PairwiseDistancesReduction.valid_metrics - PairwiseDistancesArgKmin.compute while being able to route to the 32bit and 64bit implementations defined via Tempita. The design pattern used here on PairwiseDistancesReduction and PairwiseDistancesArgKmin is the Facade design pattern. See: https://refactoring.guru/design-patterns/facade

Tempita is a funny preprocessor. 🤙

jjerphan · 2022-02-25T17:24:18Z

sklearn/metrics/_dist_metrics.pyx.tp

-
-cdef inline np.ndarray _buffer_to_ndarray(const DTYPE_t* x, np.npy_intp n):
-    # Wrap a memory buffer with an ndarray. Warning: this is not robust.
-    # In particular, if x is deallocated before the returned array goes
-    # out of scope, this could cause memory errors.  Since there is not
-    # a possibility of this for our use-case, this should be safe.
-
-    # Note: this Segfaults unless np.import_array() is called above
-    return PyArray_SimpleNewFromData(1, &n, DTYPECODE, <void*>x)
-
-
-from libc.math cimport fabs, sqrt, exp, pow, cos, sin, asin
-cdef DTYPE_t INF = np.inf
-


Note to reviewers: this has been moved under the Tempita loop to be templated.

jjerphan · 2022-02-25T17:24:32Z

sklearn/metrics/_dist_metrics.pyx.tp

-######################################################################
-# metric mappings
-#  These map from metric id strings to class names
-METRIC_MAPPING = {'euclidean': EuclideanDistance,
-                  'l2': EuclideanDistance,
-                  'minkowski': MinkowskiDistance,
-                  'p': MinkowskiDistance,
-                  'manhattan': ManhattanDistance,
-                  'cityblock': ManhattanDistance,
-                  'l1': ManhattanDistance,
-                  'chebyshev': ChebyshevDistance,
-                  'infinity': ChebyshevDistance,
-                  'seuclidean': SEuclideanDistance,
-                  'mahalanobis': MahalanobisDistance,
-                  'wminkowski': WMinkowskiDistance,
-                  'hamming': HammingDistance,
-                  'canberra': CanberraDistance,
-                  'braycurtis': BrayCurtisDistance,
-                  'matching': MatchingDistance,
-                  'jaccard': JaccardDistance,
-                  'dice': DiceDistance,
-                  'kulsinski': KulsinskiDistance,
-                  'rogerstanimoto': RogersTanimotoDistance,
-                  'russellrao': RussellRaoDistance,
-                  'sokalmichener': SokalMichenerDistance,
-                  'sokalsneath': SokalSneathDistance,
-                  'haversine': HaversineDistance,
-                  'pyfunc': PyFuncDistance}
-


Note to reviewers: this has been moved under the Tempita loop to be templated.

jjerphan · 2022-02-25T17:25:38Z

sklearn/metrics/_dist_metrics.pyx.tp

-    .. math::
-       D(x, y) = \frac{N_{TF} + N_{FT}}{N_{TT} + N_{TF} + N_{FT}}


I removed such doc because this was making Tempita injection fail.

This can be reintroduced.

Indeed it would be good to reintroduce it with a notation that does not use backslashes (assuming they are the ones causing the problem).

Previously, upcast was done in the critical region. This causes an unneeded upcast for one of the buffers. This only upcasts buffers when necessary and where necessary without duplication contrarily to previously. Two methods are introduced to perform this upcast for each strategy. Yet, this adds some complexity to the templating.

jjerphan · 2022-02-28T09:44:01Z

sklearn/metrics/_pairwise_distances_reduction.pyx.tp

-cpdef DTYPE_t[::1] _sqeuclidean_row_norms(
-    const DTYPE_t[:, ::1] X,
-    ITYPE_t num_threads,
-):
-    """Compute the squared euclidean norm of the rows of X in parallel.
-
-    This is faster than using np.einsum("ij, ij->i") even when using a single thread.
-    """
-    cdef:
-        # Casting for X to remove the const qualifier is needed because APIs
-        # exposed via scipy.linalg.cython_blas aren't reflecting the arguments'
-        # const qualifier.
-        # See: https://github.com/scipy/scipy/issues/14262
-        DTYPE_t * X_ptr = <DTYPE_t *> &X[0, 0]
-       
9E81
 ITYPE_t idx = 0
-        ITYPE_t n = X.shape[0]
-        ITYPE_t d = X.shape[1]
-        DTYPE_t[::1] squared_row_norms = np.empty(n, dtype=DTYPE)
-
-    for idx in prange(n, schedule='static', nogil=True, num_threads=num_threads):
-        squared_row_norms[idx] = _dot(d, X_ptr + idx * d, 1, X_ptr + idx * d, 1)
-
-    return squared_row_norms


Note to reviewer: this has been moved bellow, closer to 32bit and 64bit definitions.

jjerphan · 2022-02-28T09:44:18Z

sklearn/metrics/_pairwise_distances_reduction.pyx.tp

 from cython.parallel cimport parallel, prange

-from ._dist_metrics cimport DatasetsPair, DenseDenseDatasetsPair


Note to reviewer: this has been moved bellow, closer to 32bit and 64bit definitions.

jjerphan · 2022-02-28T09:46:09Z

sklearn/metrics/_pairwise_distances_reduction.pyx.tp

-    cdef:
-        readonly DatasetsPair datasets_pair
-
-        # The number of threads that can be used is stored in effective_n_threads.
-        #
-        # The number of threads to use in the parallelisation strategy
-        # (i.e. parallel_on_X or parallel_on_Y) can be smaller than effective_n_threads:
-        # for small datasets, less threads might be needed to loop over pair of chunks.
-        #
-        # Hence the number of threads that _will_ be used for looping over chunks
-        # is stored in chunks_n_threads, allowing solely using what w
F438
e need.
-        #
-        # Thus, an invariant is:
-        #
-        #                 chunks_n_threads <= effective_n_threads
-        #
-        ITYPE_t effective_n_threads
-        ITYPE_t chunks_n_threads
-
-        ITYPE_t n_samples_chunk, chunk_size
-
-        ITYPE_t n_samples_X, X_n_samples_chunk, X_n_chunks, X_n_samples_last_chunk
-        ITYPE_t n_samples_Y, Y_n_samples_chunk, Y_n_chunks, Y_n_samples_last_chunk
-
-        bint execute_in_parallel_on_Y


Note to reviewer: this has been moved bellow, within 32bit and 64bit definitions (done via Tempita).

Hence, PairwiseDistancesReduction now really just is an interface.

Conflicts: .gitignore sklearn/metrics/_dist_metrics.pxd.tp sklearn/metrics/_dist_metrics.pyx.tp sklearn/metrics/_pairwise_distances_reduction.pyx.tp sklearn/metrics/setup.py sklearn/metrics/tests/test_pairwise_distances_reduction.py

ogrisel · 2022-05-25T11:53:36Z

I can reproduce the crash locally in test_pairwise_distances_argkmin[42-parallel_on_X-float32-euclidean-0-50]:

sklearn/metrics/tests/test_pairwise_distances_reduction.py::test_pairwise_distances_argkmin[42-parallel_on_X-float32-euclidean-0-50] Fatal Python error: Aborted

Current thread 0x00007fbcc71ee740 (most recent call first):
  File "/home/ogrisel/code/scikit-learn/sklearn/metrics/tests/test_pairwise_distances_reduction.py", line 576 in test_pairwise_distances_argkmin
  File "/home/ogrisel/mambaforge/envs/dev/lib/python3.10/site-packages/_pytest/python.py", line 192 in pytest_pyfunc_call
  File "/home/ogrisel/mambaforge/envs/dev/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall

GDB hints at a double-free problem in the __dealloc__ of GEMMTermComputer32:

gdb --ex r --args python -m pytest sklearn/metrics/tests/test_pairwise_distances_reduction.py -k "test_pairwise_distances_argkmin[42-parallel_on_X-float32-euclidean-0-50]" -v

#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737350469440) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140737350469440) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140737350469440, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff7cc5476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff7cab7f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffff7d0c6f6 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff7e5eb8c "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#6  0x00007ffff7d23d7c in malloc_printerr (str=str@entry=0x7ffff7e617d0 "double free or corruption (!prev)") at ./malloc/malloc.c:5664
#7  0x00007ffff7d25efc in _int_free (av=0x7ffff7e9cc80 <main_arena>, p=0x555556e64ae0, have_lock=<optimized out>) at ./malloc/malloc.c:4591
#8  0x00007ffff7d284d3 in __GI___libc_free (mem=<optimized out>) at ./malloc/malloc.c:3391
#9  0x00007fffb31409fe in __pyx_tp_dealloc_7sklearn_7metrics_29_pairwise_distances_reduction_GEMMTermComputer32(_object*) ()
   from /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so
#10 0x00007fffb314059d in __pyx_tp_dealloc_7sklearn_7metrics_29_pairwise_distances_reduction_FastEuclideanPairwiseDistancesArgKmin32(_object*) ()
   from /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so
#11 0x00007fffb3179b2b in __pyx_pw_7sklearn_7metrics_29_pairwise_distances_reduction_26PairwiseDistancesArgKmin32_1compute(_object*, _object*, _object*) ()
   from /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so
#12 0x0000555555697f4c in cfunction_call (func=0x7fffb2ed23e0, args=<optimized out>, kwargs=<optimized out>) at /usr/local/src/conda/python-3.10.4/Objects/methodobject.c:543
#13 0x00007fffb3136e0e in __Pyx_PyObject_Call(_object*, _object*, _object*) () from /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so
#14 0x00007fffb316f7ba in __pyx_pw_7sklearn_7metrics_29_pairwise_distances_reduction_24PairwiseDistancesArgKmin_1compute(_object*, _object*, _object*) ()
[...]

ogrisel · 2022-05-25T12:24:35Z

Trying to investigate with valgrind configured to use the Python suppressions:

valgrind --tool=memcheck --suppressions=$HOME/code/cpython/Misc/valgrind-python.supp python -m pytest sklearn/metrics/tests/test_pairwise_distances_reduction.py -k "test_pairwise_distances_argkmin[42-parallel_on_X-float32-euclidean-0-50]"

Here is the output. I am not yet sure what to make of this.

=============================================================================================== test session starts ================================================================================================
platform linux -- Python 3.10.4, pytest-7.1.1, pluggy-1.0.0
rootdir: /home/ogrisel/code/scikit-learn, configfile: setup.cfg
plugins: json-report-1.5.0, anyio-3.5.0, xdist-2.5.0, forked-1.4.0, metadata-2.0.1, timeout-2.1.0, json-0.4.0
collected 403 items / 402 deselected / 1 selected                                                                                                                                                                  

sklearn/metrics/tests/test_pairwise_distances_reduction.py ==39808== Warning: invalid file descriptor 1048564 in syscall close()
==39808== Warning: invalid file descriptor 1048565 in syscall close()
==39808== Warning: invalid file descriptor 1048566 in syscall close()
==39808== Warning: invalid file descriptor 1048567 in syscall close()
==39808==    Use --log-fd=<number> to select an alternative log fd.
==39808== Warning: invalid file descriptor 1048568 in syscall close()
==39808== Warning: invalid file descriptor 1048569 in syscall close()
==39664== Thread 11:
==39664== Invalid write of size 8
==39664==    at 0x4C4166A6: __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_18GEMMTermComputer32__parallel_on_X_init_chunk(__pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_GEMMTermComputer32*, long, long, long) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4C4128E8: __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_28PairwiseDistancesReduction32__parallel_on_X(__pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_PairwiseDistancesReduction32*) [clone ._omp_fn.0] (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4A83D207: gomp_thread_start (team.c:125)
==39664==    by 0x49FFB42: start_thread (pthread_create.c:442)
==39664==    by 0x4A90BB3: clone (clone.S:100)
==39664==  Address 0x5826ff8 is 0 bytes after a block of size 5,000 alloc'd
==39664==    at 0x4849013: operator new(unsigned long) (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==39664==    by 0x4C4662EE: std::vector<double, std::allocator<double> >::_M_default_append(unsigned long) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4C45FAB6: __pyx_pw_7sklearn_7metrics_29_pairwise_distances_reduction_18GEMMTermComputer32_1__init__(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x245634: type_call (typeobject.c:1133)
==39664==    by 0x4C41193D: __Pyx_PyObject_Call(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4C43CF10: __pyx_pw_7sklearn_7metrics_29_pairwise_distances_reduction_39FastEuclideanPairwiseDistancesArgKmin32_3__init__(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x245634: type_call (typeobject.c:1133)
==39664==    by 0x4C41193D: __Pyx_PyObject_Call(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4C448E8E: __pyx_pw_7sklearn_7metrics_29_pairwise_distances_reduction_26PairwiseDistancesArgKmin32_1compute(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x24BF4B: cfunction_call (methodobject.c:543)
==39664==    by 0x4C41193D: __Pyx_PyObject_Call(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4C450DD9: __pyx_pw_7sklearn_7metrics_29_pairwise_distances_reduction_24PairwiseDistancesArgKmin_1compute(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664== 
==39664== Invalid write of size 8
==39664==    at 0x4C41657C: __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_18GEMMTermComputer32__parallel_on_X_pre_compute_and_reduce_distances_on_chunks(__pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_GEMMTermComputer32*, long, long, long, long, long) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4C412926: __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_28PairwiseDistancesReduction32__parallel_on_X(__pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_PairwiseDistancesReduction32*) [clone ._omp_fn.0] (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4A83D207: gomp_thread_start (team.c:125)
==39664==    by 0x49FFB42: start_thread (pthread_create.c:442)
==39664==    by 0x4A90BB3: clone (clone.S:100)
==39664==  Address 0x615bb78 is 0 bytes after a block of size 5,000 alloc'd
==39664==    at 0x4849013: operator new(unsigned long) (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==39664==    by 0x4C4662EE: std::vector<double, std::allocator<double> >::_M_default_append(unsigned long) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4C45FACD: __pyx_pw_7sklearn_7metrics_29_pairwise_distances_reduction_18GEMMTermComputer32_1__init__(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x245634: type_call (typeobject.c:1133)
==39664==    by 0x4C41193D: __Pyx_PyObject_Call(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4C43CF10: __pyx_pw_7sklearn_7metrics_29_pairwise_distances_reduction_39FastEuclideanPairwiseDistancesArgKmin32_3__init__(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x245634: type_call (typeobject.c:1133)
==39664==    by 0x4C41193D: __Pyx_PyObject_Call(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4C448E8E: __pyx_pw_7sklearn_7metrics_29_pairwise_distances_reduction_26PairwiseDistancesArgKmin32_1compute(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x24BF4B: cfunction_call (methodobject.c:543)
==39664==    by 0x4C41193D: __Pyx_PyObject_Call(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4C450DD9: __pyx_pw_7sklearn_7metrics_29_pairwise_distances_reduction_24PairwiseDistancesArgKmin_1compute(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664== 

valgrind: m_mallocfree.c:303 (get_bszB_as_is): Assertion 'bszB_lo == bszB_hi' failed.
valgrind: Heap block lo/hi size mismatch: lo = 5072, hi = 4636128830317133824.
This is probably caused by your program erroneously writing past the
end of a heap block and corrupting heap metadata.  If you fix any
invalid writes reported by Memcheck, this assertion failure will
probably go away.  Please try that before reporting this as a bug.


host stacktrace:
==39664==    at 0x5804284A: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==39664==    by 0x58042977: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==39664==    by 0x58042B1B: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==39664==    by 0x5804C8CF: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==39664==    by 0x5803AE9A: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==39664==    by 0x580395B7: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==39664==    by 0x5803DF3D: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==39664==    by 0x58038868: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==39664==    by 0x100CC61D6F: ???

sched status:
  running_tid=11

Thread 1: status = VgTs_Yielding (lwpid 39664)
==39664==    at 0x4A83F133: do_spin (wait.h:56)
==39664==    by 0x4A83F133: do_wait (wait.h:66)
==39664==    by 0x4A83F2BB: gomp_team_barrier_wait_end (bar.c:112)
==39664==    by 0x4C412854: __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_28PairwiseDistancesReduction32__parallel_on_X(__pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_PairwiseDistancesReduction32*) [clone ._omp_fn.0] (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4A8384A7: GOMP_parallel (parallel.c:178)
==39664==    by 0x4C411F42: __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_28PairwiseDistancesReduction32__parallel_on_X(__pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_PairwiseDistancesReduction32*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4C44904A: __pyx_pw_7sklearn_7metrics_29_pairwise_distances_reduction_26PairwiseDistancesArgKmin32_1compute(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x24BF4B: cfunction_call (methodobject.c:543)
==39664==    by 0x4C41193D: __Pyx_PyObject_Call(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4C450DD9: __pyx_pw_7sklearn_7metrics_29_pairwise_distances_reduction_24PairwiseDistancesArgKmin_1compute(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x24BF4B: cfunction_call (methodobject.c:543)
==39664==    by 0x245297: _PyObject_MakeTpCall (call.c:215)
==39664==    by 0x2412D4: UnknownInlinedFun (abstract.h:112)
==39664==    by 0x2412D4: UnknownInlinedFun (abstract.h:99)
==39664==    by 0x2412D4: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x2412D4: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x2412D4: _PyEval_EvalFrameDefault (ceval.c:4231)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x25A347: UnknownInlinedFun (call.c:267)
==39664==    by 0x25A347: UnknownInlinedFun (call.c:290)
==39664==    by 0x25A347: PyObject_Call (call.c:317)
==39664==    by 0x23E3C4: UnknownInlinedFun (ceval.c:5919)
==39664==    by 0x23E3C4: _PyEval_EvalFrameDefault (ceval.c:4277)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x23E3C4: UnknownInlinedFun (ceval.c:5919)
==39664==    by 0x23E3C4: _PyEval_EvalFrameDefault (ceval.c:4277)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x24029A: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x24029A: _PyEval_EvalFrameDefault (ceval.c:4181)
==39664==    by 0x2596AB: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x2596AB: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x2596AB: UnknownInlinedFun (call.c:342)
==39664==    by 0x2596AB: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x2596AB: method_vectorcall (classobject.c:53)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x24029A: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x24029A: _PyEval_EvalFrameDefault (ceval.c:4181)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x24459C: _PyObject_FastCallDictTstate (call.c:153)
==39664==    by 0x257458: _PyObject_Call_Prepend (call.c:431)
==39664==    by 0x32EFB4: slot_tp_call (typeobject.c:7489)
==39664==    by 0x245297: _PyObject_MakeTpCall (call.c:215)
==39664==    by 0x2412D4: UnknownInlinedFun (abstract.h:112)
==39664==    by 0x2412D4: UnknownInlinedFun (abstract.h:99)
==39664==    by 0x2412D4: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x2412D4: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x2412D4: _PyEval_EvalFrameDefault (ceval.c:4231)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x23BF2E: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x23BF2E: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x23BF2E: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x23BF2E: _PyEval_EvalFrameDefault (ceval.c:4198)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x23E3C4: UnknownInlinedFun (ceval.c:5919)
==39664==    by 0x23E3C4: _PyEval_EvalFrameDefault (ceval.c:4277)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x24029A: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x24029A: _PyEval_EvalFrameDefault (ceval.c:4181)
==39664==    by 0x2596AB: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x2596AB: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x2596AB: UnknownInlinedFun (call.c:342)
==39664==    by 0x2596AB: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x2596AB: method_vectorcall (classobject.c:53)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x24029A: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x24029A: _PyEval_EvalFrameDefault (ceval.c:4181)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x24459C: _PyObject_FastCallDictTstate (call.c:153)
==39664==    by 0x257458: _PyObject_Call_Prepend (call.c:431)
==39664==    by 0x32EFB4: slot_tp_call (typeobject.c:7489)
==39664==    by 0x25A498: UnknownInlinedFun (call.c:305)
==39664==    by 0x25A498: PyObject_Call (call.c:317)
==39664==    by 0x23E3C4: UnknownInlinedFun (ceval.c:5919)
==39664==    by 0x23E3C4: _PyEval_EvalFrameDefault (ceval.c:4277)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x23BBCB: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x23BBCB: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x23BBCB: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x23BBCB: _PyEval_EvalFrameDefault (ceval.c:4213)
==39664==    by 0x2596AB: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x2596AB: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x2596AB: UnknownInlinedFun (call.c:342)
==39664==    by 0x2596AB: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x2596AB: method_vectorcall (classobject.c:53)
==39664==    by 0x23CBD0: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x23CBD0: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x23CBD0: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x23CBD0: _PyEval_EvalFrameDefault (ceval.c:4231)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x23E3C4: UnknownInlinedFun (ceval.c:5919)
==39664==    by 0x23E3C4: _PyEval_EvalFrameDefault (ceval.c:4277)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x23BBCB: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x23BBCB: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x23BBCB: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x23BBCB: _PyEval_EvalFrameDefault (ceval.c:4213)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x23CBD0: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x23CBD0: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x23CBD0: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x23CBD0: _PyEval_EvalFrameDefault (ceval.c:4231)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x23E3C4: UnknownInlinedFun (ceval.c:5919)
==39664==    by 0x23E3C4: _PyEval_EvalFrameDefault (ceval.c:4277)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x24029A: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x24029A: _PyEval_EvalFrameDefault (ceval.c:4181)
==39664==    by 0x2596AB: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x2596AB: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x2596AB: UnknownInlinedFun (call.c:342)
==39664==    by 0x2596AB: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x2596AB: method_vectorcall (classobject.c:53)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x24029A: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x24029A: _PyEval_EvalFrameDefault (ceval.c:4181)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x24459C: _PyObject_FastCallDictTstate (call.c:153)
==39664==    by 0x257458: _PyObject_Call_Prepend (call.c:431)
==39664==    by 0x32EFB4: slot_tp_call (typeobject.c:7489)
==39664==    by 0x245297: _PyObject_MakeTpCall (call.c:215)
==39664==    by 0x2412D4: UnknownInlinedFun (abstract.h:112)
==39664==    by 0x2412D4: UnknownInlinedFun (abstract.h:99)
==39664==    by 0x2412D4: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x2412D4: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x2412D4: _PyEval_EvalFrameDefault (ceval.c:4231)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x23E3C4: UnknownInlinedFun (ceval.c:5919)
==39664==    by 0x23E3C4: _PyEval_EvalFrameDefault (ceval.c:4277)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x24029A: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x24029A: _PyEval_EvalFrameDefault (ceval.c:4181)
==39664==    by 0x2596AB: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x2596AB: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x2596AB: UnknownInlinedFun (call.c:342)
==39664==    by 0x2596AB: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x2596AB: method_vectorcall (classobject.c:53)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x24029A: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x24029A: _PyEval_EvalFrameDefault (ceval.c:4181)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x24459C: _PyObject_FastCallDictTstate (call.c:153)
==39664==    by 0x257458: _PyObject_Call_Prepend (call.c:431)
==39664==    by 0x32EFB4: slot_tp_call (typeobject.c:7489)
==39664==    by 0x245297: _PyObject_MakeTpCall (call.c:215)
==39664==    by 0x2412D4: UnknownInlinedFun (abstract.h:112)
==39664==    by 0x2412D4: UnknownInlinedFun (abstract.h:99)
==39664==    by 0x2412D4: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x2412D4: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x2412D4: _PyEval_EvalFrameDefault (ceval.c:4231)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x23BBCB: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x23BBCB: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x23BBCB: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x23BBCB: _PyEval_EvalFrameDefault (ceval.c:4213)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x23BBCB: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x23BBCB: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x23BBCB: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x23BBCB: _PyEval_EvalFrameDefault (ceval.c:4213)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x23E3C4: UnknownInlinedFun (ceval.c:5919)
==39664==    by 0x23E3C4: _PyEval_EvalFrameDefault (ceval.c:4277)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x24029A: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x24029A: _PyEval_EvalFrameDefault (ceval.c:4181)
==39664==    by 0x2596AB: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x2596AB: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x2596AB: UnknownInlinedFun (call.c:342)
==39664==    by 0x2596AB: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x2596AB: method_vectorcall (classobject.c:53)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x24029A: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x24029A: _PyEval_EvalFrameDefault (ceval.c:4181)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x24459C: _PyObject_FastCallDictTstate (call.c:153)
==39664==    by 0x257458: _PyObject_Call_Prepend (call.c:431)
==39664==    by 0x32EFB4: slot_tp_call (typeobject.c:7489)
==39664==    by 0x245297: _PyObject_MakeTpCall (call.c:215)
==39664==    by 0x2412D4: UnknownInlinedFun (abstract.h:112)
==39664==    by 0x2412D4: UnknownInlinedFun (abstract.h:99)
==39664==    by 0x2412D4: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x2412D4: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x2412D4: _PyEval_EvalFrameDefault (ceval.c:4231)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x23BBCB: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x23BBCB: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x23BBCB: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x23BBCB: _PyEval_EvalFrameDefault (ceval.c:4213)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x24029A: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x24029A: _PyEval_EvalFrameDefault (ceval.c:4181)
==39664==    by 0x2F2B91: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x2F2B91: _PyEval_Vector (ceval.c:5065)
==39664==    by 0x2F2AD6: PyEval_EvalCode (ceval.c:1134)
==39664==    by 0x2F9C1E: UnknownInlinedFun (bltinmodule.c:1056)
==39664==    by 0x2F9C1E: builtin_exec (bltinmodule.c.h:371)
==39664==    by 0x24C5BE: cfunction_vectorcall_FASTCALL (methodobject.c:430)
==39664==    by 0x23BBCB: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x23BBCB: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x23BBCB: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x23BBCB: _PyEval_EvalFrameDefault (ceval.c:4213)
client stack range: [0x1FFEFE7000 0x1FFF000FFF] client SP: 0x1FFEFFABC8
valgrind stack range: [0x1002CBE000 0x1002DBDFFF] top usage: 18984 of 1048576

Thread 2: status = VgTs_WaitSys syscall 202 (lwpid 39760)
==39664==    at 0x49FC197: __futex_abstimed_wait_common64 (futex-internal.c:57)
==39664==    by 0x49FC197: __futex_abstimed_wait_common (futex-internal.c:87)
==39664==    by 0x49FC197: __futex_abstimed_wait_cancelable64 (futex-internal.c:139)
==39664==    by 0x49FEAC0: __pthread_cond_wait_common (pthread_cond_wait.c:503)
==39664==    by 0x49FEAC0: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.c:627)
==39664==    by 0x7ED875B: blas_thread_server (in /home/ogrisel/mambaforge/envs/dev/lib/libopenblasp-r0.3.20.so)
==39664==    by 0x49FFB42: start_thread (pthread_create.c:442)
==39664==    by
F987
 0x4A90BB3: clone (clone.S:100)
client stack range: [0x9FF0000 0xA7EEFFF] client SP: 0xA7EECF0
valgrind stack range: [0x100993E000 0x1009A3DFFF] top usage: 5472 of 1048576

Thread 3: status = VgTs_WaitSys syscall 202 (lwpid 39761)
==39664==    at 0x49FC197: __futex_abstimed_wait_common64 (futex-internal.c:57)
==39664==    by 0x49FC197: __futex_abstimed_wait_common (futex-internal.c:87)
==39664==    by 0x49FC197: __futex_abstimed_wait_cancelable64 (futex-internal.c:139)
==39664==    by 0x49FEAC0: __pthread_cond_wait_common (pthread_cond_wait.c:503)
==39664==    by 0x49FEAC0: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.c:627)
==39664==    by 0x7ED875B: blas_thread_server (in /home/ogrisel/mambaforge/envs/dev/lib/libopenblasp-r0.3.20.so)
==39664==    by 0x49FFB42: start_thread (pthread_create.c:442)
==39664==    by 0x4A90BB3: clone (clone.S:100)
client stack range: [0xA7F1000 0xAFEFFFF] client SP: 0xAFEFCF0
valgrind stack range: [0x1009C56000 0x1009D55FFF] top usage: 5472 of 1048576

Thread 4: status = VgTs_WaitSys syscall 202 (lwpid 39762)
==39664==    at 0x49FC197: __futex_abstimed_wait_common64 (futex-internal.c:57)
==39664==    by 0x49FC197: __futex_abstimed_wait_common (futex-internal.c:87)
==39664==    by 0x49FC197: __futex_abstimed_wait_cancelable64 (futex-internal.c:139)
==39664==    by 0x49FEAC0: __pthread_cond_wait_common (pthread_cond_wait.c:503)
==39664==    by 0x49FEAC0: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.c:627)
==39664==    by 0x7ED875B: blas_thread_server (in /home/ogrisel/mambaforge/envs/dev/lib/libopenblasp-r0.3.20.so)
==39664==    by 0x49FFB42: start_thread (pthread_create.c:442)
==39664==    by 0x4A90BB3: clone (clone.S:100)
client stack range: [0xAFF2000 0xB7F0FFF] client SP: 0xB7F0CF0
valgrind stack range: [0x1009F5A000 0x100A059FFF] top usage: 5472 of 1048576

Thread 5: status = VgTs_WaitSys syscall 202 (lwpid 39763)
==39664==    at 0x49FC197: __futex_abstimed_wait_common64 (futex-internal.c:57)
==39664==    by 0x49FC197: __futex_abstimed_wait_common (futex-internal.c:87)
==39664==    by 0x49FC197: __futex_abstimed_wait_cancelable64 (futex-internal.c:139)
==39664==    by 0x49FEAC0: __pthread_cond_wait_common (pthread_cond_wait.c:503)
==39664==    by 0x49FEAC0: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.c:627)
==39664==    by 0x7ED875B: blas_thread_server (in /home/ogrisel/mambaforge/envs/dev/lib/libopenblasp-r0.3.20.so)
==39664==    by 0x49FFB42: start_thread (pthread_create.c:442)
==39664==    by 0x4A90BB3: clone (clone.S:100)
client stack range: [0xB7F3000 0xBFF1FFF] client SP: 0xBFF1CF0
valgrind stack range: [0x100A25E000 0x100A35DFFF] top usage: 5472 of 1048576

Thread 6: status = VgTs_WaitSys syscall 202 (lwpid 39764)
==39664==    at 0x49FC197: __futex_abstimed_wait_common64 (futex-internal.c:57)
==39664==    by 0x49FC197: __futex_abstimed_wait_common (futex-internal.c:87)
==39664==    by 0x49FC197: __futex_abstimed_wait_cancelable64 (futex-internal.c:139)
==39664==    by 0x49FEAC0: __pthread_cond_wait_common (pthread_cond_wait.c:503)
==39664==    by 0x49FEAC0: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.c:627)
==39664==    by 0x7ED875B: blas_thread_server (in /home/ogrisel/mambaforge/envs/dev/lib/libopenblasp-r0.3.20.so)
==39664==    by 0x49FFB42: start_thread (pthread_create.c:442)
==39664==    by 0x4A90BB3: clone (clone.S:100)
client stack range: [0xBFF4000 0xC7F2FFF] client SP: 0xC7F2CF0
valgrind stack range: [0x100A562000 0x100A661FFF] top usage: 5472 of 1048576

Thread 7: status = VgTs_WaitSys syscall 202 (lwpid 39765)
==39664==    at 0x49FC197: __futex_abstimed_wait_common64 (futex-internal.c:57)
==39664==    by 0x49FC197: __futex_abstimed_wait_common (futex-internal.c:87)
==39664==    by 0x49FC197: __futex_abstimed_wait_cancelable64 (futex-internal.c:139)
==39664==    by 0x49FEAC0: __pthread_cond_wait_common (pthread_cond_wait.c:503)
==39664==    by 0x49FEAC0: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.c:627)
==39664==    by 0x7ED875B: blas_thread_server (in /home/ogrisel/mambaforge/envs/dev/lib/libopenblasp-r0.3.20.so)
==39664==    by 0x49FFB42: start_thread (pthread_create.c:442)
==39664==    by 0x4A90BB3: clone (clone.S:100)
client stack range: [0xC7F5000 0xCFF3FFF] client SP: 0xCFF3CF0
valgrind stack range: [0x100A866000 0x100A965FFF] top usage: 5472 of 1048576

Thread 8: status = VgTs_WaitSys syscall 202 (lwpid 39766)
==39664==    at 0x49FC197: __futex_abstimed_wait_common64 (futex-internal.c:57)
==39664==    by 0x49FC197: __futex_abstimed_wait_common (futex-internal.c:87)
==39664==    by 0x49FC197: __futex_abstimed_wait_cancelable64 (futex-internal.c:139)
==39664==    by 0x49FEAC0: __pthread_cond_wait_common (pthread_cond_wait.c:503)
==39664==    by 0x49FEAC0: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.c:627)
==39664==    by 0x7ED875B: blas_thread_server (in /home/ogrisel/mambaforge/envs/dev/lib/libopenblasp-r0.3.20.so)
==39664==    by 0x49FFB42: start_thread (pthread_create.c:442)
==39664==    by 0x4A90BB3: clone (clone.S:100)
client stack range: [0xCFF6000 0xD7F4FFF] client SP: 0xD7F4CF0
valgrind stack range: [0x100AB6A000 0x100AC69FFF] top usage: 5472 of 1048576

Thread 9: status = VgTs_Yielding (lwpid 39801)
==39664==    at 0x4A83F133: do_spin (wait.h:56)
==39664==    by 0x4A83F133: do_wait (wait.h:66)
==39664==    by 0x4A83F2BB: gomp_team_barrier_wait_end (bar.c:112)
==39664==    by 0x4C412854: __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_28PairwiseDistancesReduction32__parallel_on_X(__pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_PairwiseDistancesReduction32*) [clone ._omp_fn.0] (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4A83D207: gomp_thread_start (team.c:125)
==39664==    by 0x49FFB42: start_thread (pthread_create.c:442)
==39664==    by 0x4A90BB3: clone (clone.S:100)
client stack range: [0x54729000 0x54F27FFF] client SP: 0x54F27D68
valgrind stack range: [0x100E9C4000 0x100EAC3FFF] top usage: 3224 of 1048576

Thread 10: status = VgTs_WaitSys syscall 202 (lwpid 39802)
==39664==    at 0x4A83F1A7: futex_wake (futex.h:111)
==39664==    by 0x4C412854: __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_28PairwiseDistancesReduction32__parallel_on_X(__pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_PairwiseDistancesReduction32*) [clone ._omp_fn.0] (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4A83D207: gomp_thread_start (team.c:125)
==39664==    by 0x49FFB42: start_thread (pthread_create.c:442)
==39664==    by 0x4A90BB3: clone (clone.S:100)
client stack range: [0x54F2A000 0x55728FFF] client SP: 0x55728D98
valgrind stack range: [0x100ECC8000 0x100EDC7FFF] top usage: 3224 of 1048576

Thread 11: status = VgTs_Runnable (lwpid 39803)
==39664==    at 0x8CE4051: dgemm_incopy_HASWELL (in /home/ogrisel/mambaforge/envs/dev/lib/libopenblasp-r0.3.20.so)
==39664==    by 0x7D850A1: dgemm_tn (in /home/ogrisel/mambaforge/envs/dev/lib/libopenblasp-r0.3.20.so)
==39664==    by 0x7CA96B8: dgemm_ (in /home/ogrisel/mambaforge/envs/dev/lib/libopenblasp-r0.3.20.so)
==39664==    by 0x4C498F75: __pyx_fuse_1__pyx_f_7sklearn_5utils_12_cython_blas__gemm (in /home/ogrisel/code/scikit-learn/sklearn/utils/_cython_blas.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4C423F99: __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_18GEMMTermComputer32__compute_distances_on_chunks(__pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_GEMMTermComputer32*, long, long, long, long, long) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4C41819C: __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_39FastEuclideanPairwiseDistancesArgKmin32__compute_and_reduce_distances_on_chunks(__pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_FastEuclideanPairwiseDistancesArgKmin32*, long, long, long, long, long) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4C412941: __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_28PairwiseDistancesReduction32__parallel_on_X(__pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_PairwiseDistancesReduction32*) [clone ._omp_fn.0] (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4A83D207: gomp_thread_start (team.c:125)
==39664==    by 0x49FFB42: start_thread (pthread_create.c:442)
==39664==    by 0x4A90BB3: clone (clone.S:100)
client stack range: [0x5572B000 0x55F29FFF] client SP: 0x55F29698
valgrind stack range: [0x100EFCC000 0x100F0CBFFF] top usage: 6920 of 1048576

Thread 13: status = VgTs_WaitSys syscall 202 (lwpid 39805)
==39664==    at 0x4A83F14C: futex_wait (futex.h:97)
==39664==    by 0x4A83F14C: do_wait (wait.h:67)
==39664==    by 0x4A83F202: gomp_barrier_wait_end (bar.c:48)
==39664==    by 0x4A83D229: gomp_simple_barrier_wait (simple-bar.h:60)
==39664==    by 0x4A83D229: gomp_thread_start (team.c:129)
==39664==    by 0x49FFB42: start_thread (pthread_create.c:442)
==39664==    by 0x4A90BB3: clone (clone.S:100)
client stack range: [0x5672D000 0x56F2BFFF] client SP: 0x56F2BE08
valgrind stack range: [0x100F5D4000 0x100F6D3FFF] top usage: 3736 of 1048576


Note: see also the FAQ in the source distribution.
It contains workarounds to several common problems.
In particular, if Valgrind aborted or crashed after
identifying problems in your program, there's a good chance
that fixing those problems will prevent Valgrind aborting or
crashing, especially if it happened in m_mallocfree.c.

If that doesn't help, please report this bug to: www.valgrind.org

In the bug report, send all the above text, the valgrind
version, and what OS and version you are using.  Thanks.

ogrisel · 2022-05-25T12:59:02Z

I rebuilt scikit-learn with debug symbols (make clean && python setup.py build_ext --inplace --debug) and now I get the crash at a slightly different location. Here is the post mortem backtrace from gdb:

#0  0x00007ffff7d2847e in __GI___libc_free (mem=0x407e123de0000000) at ./malloc/malloc.c:3368
#1  0x00007ffff7d17ffd in get_cached_stack (memp=<synthetic pointer>, sizep=<synthetic pointer>) at ./nptl/allocatestack.c:137
#2  allocate_stack (stacksize=<synthetic pointer>, stack=<synthetic pointer>, pdp=<synthetic pointer>, attr=0x7fffb45a8760 <gomp_thread_attr>) at ./nptl/allocatestack.c:364
#3  __pthread_create_2_1 (newthread=newthread@entry=0x7fffffff8760, attr=attr@entry=0x7fffb45a8760 <gomp_thread_attr>, start_routine=start_routine@entry=0x7fffb458913b <gomp_thread_start>, 
    arg=arg@entry=0x7fffffff86f0) at ./nptl/pthread_create.c:647
#4  0x00007fffb458a0d1 in gomp_team_start (
    fn=fn@entry=0x7fffb3029b30 <_ZL110__pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_26PairwiseDistancesArgKmin32_compute_exact_distancesP86__pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_PairwiseDistancesArgKmin32._omp_fn.0(void)>, data=data@entry=0x7fffffff8a60, nthreads=nthreads@entry=8, flags=flags@entry=0, team=0x555556efcd10, taskgroup=taskgroup@entry=0x0) at ../../../libgomp/team.c:845
#5  0x00007fffb45844a3 in GOMP_parallel (
    fn=0x7fffb3029b30 <_ZL110__pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_26PairwiseDistancesArgKmin32_compute_exact_distancesP86__pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_PairwiseDistancesArgKmin32._omp_fn.0(void)>, data=0x7fffffff8a60, num_threads=8, flags=0) at ../../../libgomp/parallel.c:176
#6  0x00007fffb302b01d in __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_26PairwiseDistancesArgKmin32_compute_exact_distances (__pyx_v_self=0x555556db7560)
    at sklearn/metrics/_pairwise_distances_reduction.cpp:26318
#7  0x00007fffb303934c in __pyx_pf_7sklearn_7metrics_29_pairwise_distances_reduction_26PairwiseDistancesArgKmin32_6_finalize_results (__pyx_v_return_distance=<optimized out>, __pyx_v_self=0x555556db7560)
    at sklearn/metrics/_pairwise_distances_reduction.cpp:26637
#8  __pyx_pw_7sklearn_7metrics_29_pairwise_distances_reduction_26PairwiseDistancesArgKmin32_7_finalize_results (__pyx_v_self=0x555556db7560, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>)
    at sklearn/metrics/_pairwise_distances_reduction.cpp:26599
#9  0x0000555555697f4c in cfunction_call (func=0x7fffb2e73ba0, args=<optimized out>, kwargs=<optimized out>) at /usr/local/src/conda/python-3.10.4/Objects/methodobject.c:543
#10 0x00007fffb3029a07 in __Pyx_PyObject_Call (kw=0x0, arg=0x7fffb32f6da0, func=0x7fffb2e73ba0) at sklearn/metrics/_pairwise_distances_reduction.cpp:55384
#11 __Pyx__PyObject_CallOneArg (arg=<optimized out>, func=0x7fffb2e73ba0) at sklearn/metrics/_pairwise_distances_reduction.cpp:56570
#12 __Pyx_PyObject_CallOneArg (func=func@entry=0x7fffb2e73ba0, arg=<optimized out>, arg@entry=0x5555558f9660 <_Py_TrueStruct>) at sklearn/metrics/_pairwise_distances_reduction.cpp:56589
#13 0x00007fffb3056d73 in __pyx_pf_7sklearn_7metrics_29_pairwise_distances_reduction_26PairwiseDistancesArgKmin32_compute (__pyx_v_cls=<optimized out>, __pyx_v_return_distance=1, 
    __pyx_v_strategy=<optimized out>, __pyx_v_metric_kwargs=<optimized out>, __pyx_v_chunk_size=0x7ffff760c3f0, __pyx_v_metric=<optimized out>, __pyx_v_k=<optimized out>, __pyx_v_Y=<optimized out>,

Which is when staring the OpenMP region in this Cython function:

    cdef void compute_exact_distances(self) nogil:
        cdef:
            ITYPE_t i, j
            ITYPE_t[:, ::1] Y_indices = self.argkmin_indices
            DTYPE_t[:, ::1] distances = self.argkmin_distances
        for i in prange(self.n_samples_X, schedule='static', nogil=True,
                        num_threads=self.effective_n_threads):
            for j in range(self.k):
                distances[i, j] = self.datasets_pair.distance_metric._rdist_to_dist(
                    # Guard against eventual -0., causing nan production.
                    max(distances[i, j], 0.)
                )

ogrisel · 2022-05-25T13:10:01Z

I re-ran the same test and this time I got:

#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737350469440) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140737350469440) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140737350469440, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff7cc5476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff7cab7f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffff7d0c6f6 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff7e5eb8c "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#6  0x00007ffff7d23d7c in malloc_printerr (str=str@entry=0x7ffff7e617d0 "double free or corruption (!prev)") at ./malloc/malloc.c:5664
#7  0x00007ffff7d25efc in _int_free (av=0x7ffff7e9cc80 <main_arena>, p=0x555556e67cf0, have_lock=<optimized out>) at ./malloc/malloc.c:4591
#8  0x00007ffff7d284d3 in __GI___libc_free (mem=<optimized out>) at ./malloc/malloc.c:3391
#9  0x00007fffb31262ce in __gnu_cxx::new_allocator<double>::deallocate (__t=<optimized out>, __p=<optimized out>, this=0x555555e88470) at /usr/include/c++/11/ext/new_allocator.h:132
#10 std::allocator_traits<std::allocator<double> >::deallocate (__n=<optimized out>, __p=<optimized out>, __a=...) at /usr/include/c++/11/bits/alloc_traits.h:496
#11 std::_Vector_base<double, std::allocator<double> >::_M_deallocate (__n=<optimized out>, __p=<optimized out>, this=0x555555e88470) at /usr/include/c++/11/bits/stl_vector.h:354
#12 std::_Vector_base<double, std::allocator<double> >::~_Vector_base (this=0x555555e88470, __in_chrg=<optimized out>) at /usr/include/c++/11/bits/stl_vector.h:335
#13 std::vector<double, std::allocator<double> >::~vector (this=0x555555e88470, __in_chrg=<optimized out>) at /usr/include/c++/11/bits/stl_vector.h:683
#14 std::_Destroy<std::vector<double, std::allocator<double> > > (__pointer=0x555555e88470) at /usr/include/c++/11/bits/stl_construct.h:151
#15 std::_Destroy_aux<false>::__destroy<std::vector<double, std::allocator<double> >*> (__last=0x555555e88530, __first=0x555555e88470) at /usr/include/c++/11/bits/stl_construct.h:163
#16 std::_Destroy<std::vector<double, std::allocator<double> >*> (__last=0x555555e88530, __first=<optimized out>) at /usr/include/c++/11/bits/stl_construct.h:196
#17 std::_Destroy<std::vector<double, std::allocator<double> >*, std::vector<double, std::allocator<double> > > (__last=0x555555e88530, __first=<optimized out>) at /usr/include/c++/11/bits/alloc_traits.h:854
#18 std::vector<std::vector<double, std::allocator<double> >, std::allocator<std::vector<double, std::allocator<double> > > >::~vector (this=0x555556d7af90, __in_chrg=<optimized out>)
    at /usr/include/c++/11/bits/stl_vector.h:680
#19 __Pyx_call_destructor<std::vector<std::vector<double, std::allocator<double> >, std::allocator<std::vector<double, std::allocator<double> > > > > (x=...)
    at sklearn/metrics/_pairwise_distances_reduction.cpp:339
#20 __pyx_tp_dealloc_7sklearn_7metrics_29_pairwise_distances_reduction_GEMMTermComputer32 (o=0x555556d7ada0) at sklearn/metrics/_pairwise_distances_reduction.cpp:51450

This line is __Pyx_call_destructor(p->X_c_upcast); in:

static void __pyx_tp_dealloc_7sklearn_7metrics_29_pairwise_distances_reduction_GEMMTermComputer32(PyObject *o) {
  struct __pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_GEMMTermComputer32 *p = (struct __pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_GEMMTermComputer32 *)o;
  #if CYTHON_USE_TP_FINALIZE
  if (unlikely(PyType_HasFeature(Py_TYPE(o), Py_TPFLAGS_HAVE_FINALIZE) && Py_TYPE(o)->tp_finalize) && (!PyType_IS_GC(Py_TYPE(o)) || !_PyGC_FINALIZED(o))) {
    if (PyObject_CallFinalizerFromDealloc(o)) return;
  }
  #endif
  __Pyx_call_destructor(p->dist_middle_terms_chunks);
  __Pyx_call_destructor(p->X_c_upcast);
  __Pyx_call_destructor(p->Y_c_upcast);
  __PYX_XDEC_MEMVIEW(&p->X, 1);
  __PYX_XDEC_MEMVIEW(&p->Y, 1);
  (*Py_TYPE(o)->tp_free)(o);
}

So maybe the C++ code generated by Cython for the stack allocated nested vectors is invalid. Maybe we could try to dynamically allocate those with the new keyword and then del them in an explicit __dealloc__ method for GEMMTermComputer32.

ogrisel · 2022-05-25T13:14:50Z

The valgrind output seems to be pointing to problems with those same datastructure but in the constructor:

==48327== Invalid write of size 8
==48327==    at 0x4C4166A6: __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_18GEMMTermComputer32__parallel_on_X_init_chunk(__pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_GEMMTermComputer32*, long, long, long) (_pairwise_distances_reduction.cpp:20832)
==48327==    by 0x4C4128E8: __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_28PairwiseDistancesReduction32__parallel_on_X(__pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_PairwiseDistancesReduction32*) [clone ._omp_fn.0] (_pairwise_distances_reduction.cpp:22628)
==48327==    by 0x4A83D207: gomp_thread_start (team.c:125)
==48327==    by 0x49FFB42: start_thread (pthread_create.c:442)
==48327==    by 0x4A90BB3: clone (clone.S:100)
==48327==  Address 0x47222748 is 0 bytes after a block of size 5,000 alloc'd
==48327==    at 0x4849013: operator new(unsigned long) (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==48327==    by 0x4C4662EE: allocate (new_allocator.h:127)
==48327==    by 0x4C4662EE: allocate (alloc_traits.h:464)
==48327==    by 0x4C4662EE: _M_allocate (stl_vector.h:346)
==48327==    by 0x4C4662EE: std::vector<double, std::allocator<double> >::_M_default_append(unsigned long) (vector.tcc:635)
==48327==    by 0x4C45FAB6: resize (stl_vector.h:940)
==48327==    by 0x4C45FAB6: __pyx_pf_7sklearn_7metrics_29_pairwise_distances_reduction_18GEMMTermComputer32___init__ (_pairwise_distances_reduction.cpp:20567)
==48327==    by 0x4C45FAB6: __pyx_pw_7sklearn_7metrics_29_pairwise_distances_reduction_18GEMMTermComputer32_1__init__(_object*, _object*, _object*) (_pairwise_distances_reduction.cpp:20424)

That is the line in the loop that does the upcasting:

sklearn/metrics/_pairwise_distances_reduction.pyx:2321

        # Upcasting X_c=X[X_start:X_end, :] from float32 to float64
        for i in range(n_chunk_samples):
            for j in range(self.n_features):
                self.X_c_upcast[thread_num][i * self.n_features + j] = <DTYPE_t> self.X[X_start + i, j]

ogrisel · 2022-05-25T14:46:46Z

I tried to use a unique_ptr as:

diff --git a/sklearn/metrics/_pairwise_distances_reduction.pyx.tp b/sklearn/metrics/_pairwise_distances_reduction.pyx.tp
index 90ea78c11..ec68614b6 100644
--- a/sklearn/metrics/_pairwise_distances_reduction.pyx.tp
+++ b/sklearn/metrics/_pairwise_distances_reduction.pyx.tp
@@ -48,7 +48,7 @@ from .. import get_config
 from libc.stdlib cimport free, malloc
 from libc.stdio cimport printf
 from libc.float cimport DBL_MAX
-from libcpp.memory cimport shared_ptr, make_shared
+from libcpp.memory cimport shared_ptr, make_shared, unique_ptr, make_unique
 from libcpp.vector cimport vector
 from cython cimport final
 from cython.operator cimport dereference as deref
@@ -627,8 +627,8 @@ cdef class GEMMTermComputer{{bitness}}:
         vector[vector[DTYPE_t]] dist_middle_terms_chunks
 
 {{if need_upcast}}
-        vector[vector[DTYPE_t]] X_c_upcast
-        vector[vector[DTYPE_t]] Y_c_upcast
+        unique_ptr[vector[vector[DTYPE_t]]] X_c_upcast
+        unique_ptr[vector[vector[DTYPE_t]]] Y_c_upcast
 {{endif}}
 
     def __init__(self,
@@ -649,13 +649,13 @@ cdef class GEMMTermComputer{{bitness}}:
         self.dist_middle_terms_chunks = vector[vector[DTYPE_t]](self.effective_n_threads)
 
 {{if need_upcast}}
         return
 {{endif}}
@@ -743,7 +743,7 @@ cdef class GEMMTermComputer{{bitness}}:
         # Upcasting Y_c=Y[Y_start:Y_end, :] from float32 to float64
         return
 {{endif}}
@@ -743,7 +743,7 @@ cdef class GEMMTermComputer{{bitness}}:
         # Upcasting Y_c=Y[Y_start:Y_end, :] from float32 to float64
         for i in range(n_chunk_samples):
             for j in range(self.n_features):
-                self.Y_c_upcast[thread_num][i * self.n_features + j] = <DTYPE_t> self.Y[Y_start + i, j]
+                deref(self.Y_c_upcast)[thread_num][i * self.n_features + j] = <DTYPE_t> self.Y[Y_start + i, j]
 {{else}}
         return
 {{endif}}
@@ -776,8 +776,8 @@ cdef class GEMMTermComputer{{bitness}}:
             ITYPE_t K = X_c.shape[1]
             DTYPE_t alpha = - 2.
 {{if need_upcast}}
-            DTYPE_t * A = self.X_c_upcast[thread_num].data()
-            DTYPE_t * B = self.Y_c_upcast[thread_num].data()
+            DTYPE_t * A = deref(self.X_c_upcast)[thread_num].data()
+            DTYPE_t * B = deref(self.Y_c_upcast)[thread_num].data()
 {{else}}
             # Casting for A and B to remove the const is needed because APIs exposed via
             # scipy.linalg.cython_blas aren't reflecting the arguments' const qualifier.

but I still get the same failure. Same with a shared_ptr. If I use a new operator then the problem goes away but we have to do manual memory management with a __dealloc__.

jjerphan · 2022-05-30T08:07:14Z

My bad, the error was due to an improper logic. f0fc839 fixes it. Yet, there are tests that aren't passing, likely due to numerical reasons.

ogrisel · 2022-05-30T08:22:30Z

My bad, the error was due to an improper logic. f0fc839 fixes it. Yet, there are tests that aren't passing, likely due to numerical reasons.

Ok, that's great to know that their no problem with stack-allocated nested vectors datastructures as I mistakenly suspected.

ogrisel · 2022-05-30T09:03:27Z

The E ValueError: Buffer dtype mismatch, expected 'const DTYPE_t' but got 'float' in https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=42567&view=logs&j=0a287ed6-22f4-5cb4-88b1-d5fcdc4d8b7e&t=98f1f182-d951-51b4-1bc1-ca049099b19d&l=8438 seems to be caused by the lack of upcasting of the precomputed row_norms in sklearn/cluster/_birch.py at line 758.

jjerphan · 2022-05-30T09:11:29Z

I think that it would be easier to debug and to review to (in this order):

have MAINT float 32bit support for DistanceMetric #22764 be merged
have TST Add test for quasi equality for PairwiseDistancesReduction results #23490 be merged
create another PR to port PairwiseDistancesArgKmin to support 32bit data (extracting code from this PR)
create another PR to port PairwiseDistancesReduction to support 32bit data (extracting code from this PR)

What do you think?

jjerphan · 2022-06-01T13:51:52Z

sklearn/metrics/_pairwise_distances_reduction.pyx.tp

                metric in cls.valid_metrics())

+
+cdef class PairwiseDistancesArgKmin(PairwiseDistancesReduction):


Note to reviewers: similarly PairwiseDistancesArgKmin is just an interface now which dispatch to the correct dtype-specific implementation at runtime.

jjerphan · 2022-06-01T13:52:39Z

sklearn/metrics/_pairwise_distances_reduction.pyx.tp

+            f"Currently: X.dtype={X.dtype} and Y.dtype={Y.dtype}."
+        )
+
+cdef class PairwiseDistancesRadiusNeighborhood(PairwiseDistancesReduction):


Note to reviewers: similarly PairwiseDistancesRadiusNeighborhood is just an interface now which dispatch to the correct dtype-specific implementation at runtime.

jjerphan · 2022-06-01T13:55:06Z

sklearn/metrics/_pairwise_distances_reduction.pyx.tp

+cpdef DTYPE_t[::1] _sqeuclidean_row_norms64(
+    const DTYPE_t[:, ::1] X,
+    ITYPE_t num_threads,
+):
+    """Compute the squared euclidean norm of the rows of X in parallel.
+
+    This is faster than using np.einsum("ij, ij->i") even when using a single thread.
+    """
+    cdef:
+        # Casting for X to remove the const qualifier is needed because APIs
+        # exposed via scipy.linalg.cython_blas aren't reflecting the arguments'
+        # const qualifier.
+        # See: https://github.com/scipy/scipy/issues/14262
+        DTYPE_t * X_ptr = <DTYPE_t *> &X[0, 0]
+        ITYPE_t i = 0
+        ITYPE_t n = X.shape[0]
+        ITYPE_t d = X.shape[1]
+        DTYPE_t[::1] squared_row_norms = np.empty(n, dtype=DTYPE)
+
+    for i in prange(n, schedule='static', nogil=True, num_threads=num_threads):
+        squared_row_norms[i] = _dot(d, X_ptr + i * d, 1, X_ptr + i * d, 1)
+
+    return squared_row_norms
+
+
+cpdef DTYPE_t[::1] _sqeuclidean_row_norms32(
+    const cnp.float32_t[:, ::1] X,
+    ITYPE_t num_threads,
+):
+    """Compute the squared euclidean norm of the rows of X in parallel.
+
+    This is faster than using np.einsum("ij, ij->i") even when using a single thread.
+    """
+    cdef:
+        # Casting for X to remove the const qualifier is needed because APIs
+        # exposed via scipy.linalg.cython_blas aren't reflecting the arguments'
+        # const qualifier.
+        # See: https://github.com/scipy/scipy/issues/14262
+        cnp.float32_t * X_ptr = <cnp.float32_t 
10000
*> &X[0, 0]
+        ITYPE_t i = 0, j = 0
+        ITYPE_t n = X.shape[0]
+        ITYPE_t d = X.shape[1]
+        DTYPE_t[::1] squared_row_norms = np.empty(n, dtype=DTYPE)
+
+        # To upcast the i-th row of X from 32bit to 64bit
+        DTYPE_t * X_idx_upcast_ptr
+
+    with nogil, parallel(num_threads=num_threads):
+        # Thread-local buffer allocation
+        X_i_upcast_ptr = <DTYPE_t* > malloc(sizeof(DTYPE_t) * d)
+        for i in prange(n, schedule='static'):
+
+            # Upcasting the i-th row of X from 32bit to 64bit
+            for j in range(d):
+                X_i_upcast_ptr[j] = <DTYPE_t> deref(X_ptr + i * d + j)
+
+            squared_row_norms[i] = _dot(d, X_i_upcast_ptr, 1, X_i_upcast_ptr, 1)
+
+        free(X_i_upcast_ptr)
+
+    return squared_row_norms


Note to reviewers: those are specialisation of _sqeuclidean_row_norms for each dtype.

jjerphan · 2022-07-08T13:55:03Z

Due to the refactoring, I'm closing this PR in favor of #23865.

github-actions bot added module:metrics cython labels Feb 23, 2022

jjerphan mentioned this pull request Feb 23, 2022

PERF PairwiseDistancesReductions initial work #22587

Closed

jjerphan force-pushed the distance-metrics-32bit branch 2 times, most recently from 6d9f8a9 to 2b68863 Compare February 23, 2022 14:58

MAINT Generate DistanceMetrics for 32bit vectors

8ae1cb6

jjerphan force-pushed the distance-metrics-32bit branch from 2b68863 to 8ae1cb6 Compare February 23, 2022 15:01

jjerphan added 2 commits February 23, 2022 16:35

MAINT Generate PairwiseDistancesReduction for 32bit and 64bit

487c0f1

Also populate the .gitignore with new files

jjerphan changed the title ~~MAINT 32bit support for DistanceMetric~~ MAINT 32bit support for DistanceMetric and PairwiseDistancesReduction Feb 24, 2022

jjerphan added 4 commits February 25, 2022 11:02

Merge branch 'main' into distance-metrics-32bit

2c78948

TST Fix test_sqeuclidean_row_norms

327d249

TST Add fixture to test quasi-equality for 32bit

baf2fc6

MAINT Do not route 32bit specialize implementation yet

3c80b40

jjerphan added the No Changelog Needed label Feb 25, 2022

jjerphan added 3 commits February 25, 2022 15:41

TST Adapt DistanceMetrics tests for 32bit

3d9d565

fixup! TST Adapt DistanceMetrics tests for 32bit

eb0c65e

MAINT Upcast buffers to 64bit when and where needed

42eba72

Tempita is a funny preprocessor. 🤙

jjerphan commented Feb 25, 2022

View reviewed changes

jjerphan marked this pull request as ready for review February 25, 2022 17:28

jjerphan added 3 commits February 26, 2022 12:47

CLN Improve imports and fix duplicated ignored files

65ebc92

MAINT Do not warn if Y_norm_squared is passed via metric_kwargs

ffb08ac

jjerphan changed the title ~~MAINT 32bit support for DistanceMetric and PairwiseDistancesReduction~~ MAINT 32bit support for PairwiseDistancesReduction Feb 28, 2022

DOC Update whats_new entry

059128f

jjerphan commented Feb 28, 2022

View reviewed changes

jjerphan added 4 commits April 4, 2022 14:34

MAINT Reorganise upcast w.r.t GEMMTermComputer introduction

5d261b1

MAINT Correctly allocate buffer for upcasting

26b3839

Merge branch 'main' into distance-metrics-32bit

3f6f2c6

jjerphan force-pushed the distance-metrics-32bit branch from df17323 to 3f6f2c6 Compare May 24, 2022 08:08

jjerphan added 3 commits May 29, 2022 10:17

TST Update tests

7b0bcd3

MAINT Correctly resize buffers for upcasting

f0fc839

Merge branch 'main' into distance-metrics-32bit

12771ed

jjerphan mentioned this pull request May 30, 2022

TST Add test for quasi equality for PairwiseDistancesReduction results #23490

Merged

Merge branch 'main' into distance-metrics-32bit

5fc225e

jjerphan commented Jun 1, 2022

View reviewed changes

MAINT Document and reduce diff but not the logic

cef57b1

jjerphan changed the title ~~MAINT 32bit datasets support for PairwiseDistancesReduction~~ POC 32bit datasets support for PairwiseDistancesReduction Jun 1, 2022

jjerphan mentioned this pull request Jun 1, 2022

MAINT Introduce dispatchers for PairwiseDistancesReductions #23515

Merged

DEBUG Propagate sort_results

cbef7f1

jjerphan closed this Jul 8, 2022

jjerphan deleted the distance-metrics-32bit branch October 21, 2022 14:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

POC 32bit datasets support for `PairwiseDistancesReduction` #22590

POC 32bit datasets support for `PairwiseDistancesReduction` #22590

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		.. math::
		D(x, y) = \frac{N_{TF} + N_{FT}}{N_{TT} + N_{TF} + N_{FT}}

		from cython.parallel cimport parallel, prange

		from ._dist_metrics cimport DatasetsPair, DenseDenseDatasetsPair

		metric in cls.valid_metrics())


		cdef class PairwiseDistancesArgKmin(PairwiseDistancesReduction):

Uh oh!

POC 32bit datasets support for PairwiseDistancesReduction #22590

POC 32bit datasets support for PairwiseDistancesReduction #22590

Uh oh!

Conversation

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Hardware scalability

Speed-ups between 1.0 (e7fb5b8) and this PR @ 65ebc92 (via ca9197a502bf1289db722a6261ff5fe7edf8e981)

Any other comments?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

POC 32bit datasets support for `PairwiseDistancesReduction` #22590

POC 32bit datasets support for `PairwiseDistancesReduction` #22590

Speed-ups between 1.0 (`e7fb5b8`) and this PR @ `65ebc92` (via ca9197a502bf1289db722a6261ff5fe7edf8e981)