8000 POC 32bit datasets support for `PairwiseDistancesReduction` by jjerphan · Pull Request #22590 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

POC 32bit datasets support for PairwiseDistancesReduction #22590

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 32 commits into from

Conversation

jjerphan
Copy link
Member
@jjerphan jjerphan commented Feb 23, 2022

Reference Issues/PRs

Follows #22134. Experimental POC to assess if Tempita is sufficient.

What does this implement/fix? Explain your changes.

Full design proposal

Context

PairwiseDistancesReduction needs to support float32 and float64 DatasetPairs.

To do so, DatasetPairs needs to be adapted for float32 (X, Y) and concrete PairwiseDistancesReductions needs to do maintain the routing to those.

The current Cython extension types (i.e cdef class) hierarchy currently support 64 bits implementation. It simply breaks down as follows:

                        (abstract)
                 PairwiseDistancesReduction
                            ^
                            |
                            |
                  (concrete 64bit implem.)
                       (Python API)
                  PairwiseDistancesArgKmin
                            ^
                            |
                            |
            (specialized concrete 64bit implem.)
            FastEuclideanPairwiseDistancesArgKmin

Where FastEuclideanPairwiseDistancesArgKmin is called in most cases.

Problem

We need some flexibility to be able to support 32bit datasets while not duplicating the implementations. In this regard, templating (i.e. to have classes be dtype-defined) and type covariance (i.e. if A extends B than Class<A> extends Class<B>) would have come in handy to extent the current hierarchy for 64bit to support 32bit.

Yet, Cython does not support templating in its language constructions nor does it support type covariance.

Also, Cython offers support for fused types however they can't be used on Cython extension types' attributes, making using this useful feature impossible to use in our context without some hacks.

Proposed solution

Still, we can use Tempita to come up with a working solution preserving performance at the cost of maintenance.

To perform this:

  • 32bit is now supported for DistanceMetrics
  • the 64bit implementation of DistanceMetrics are still exposed via the current public API but the 32bit version must remain private.
  • the layout of classes for PairwiseDistancesReductions has been changed using à la facade design pattern. so as to keep the same Python interfaces (namely PairwiseDistancesReduction.is_usable_for, PairwiseDistancesReduction.valid_metrics, PairwiseDistancesArgKmin.compute) but have concrete 32bit and 64bit implementation be defined via Tempita as follows:
                       (abstract)
                PairwiseDistancesReduction
                            ^
                            |
                            +------------------------------------------+--------------------------------------------------+
                            |                                          |                                                  |
                            |                                      (abstract)                                         (abstract)
                            |                             PairwiseDistancesReduction32                       PairwiseDistancesReduction64
                            |                                          ^                                                  ^
                            |                                          |                                                  |
                            |                                          |               
8000
                                   |
                            |                                          |                                                  |
                       (Python API)                         (concrete 32bit implem.)                           (concrete 64bit implem.)
                  PairwiseDistancesArgKmin                 PairwiseDistancesArgKmin32                          PairwiseDistancesArgKmin64
                                                                       |                                                  |
                                                                       |                                                  |
                                                                       |                                                  |
                                                                       |                                                  |
                                                       (specialized concrete 32bit implem.)               (specialized concrete 64bit implem.)
                                                     FastEuclideanPairwiseDistancesArgKmin32            FastEuclideanPairwiseDistancesArgKmin64

Future extension solution

In the future, we could just use the same pattern. For instance we could have:


                           ...                                        ...                                                ...   
                            |                                          |                                                  |
                            |                                          |                                                  |
                            |                                          |                                                  |
                       (Python API)                         (concrete 32bit implem.)                           (concrete 64bit implem.)
          PairwiseDistancesRadiusNeighborhood          PairwiseDistancesRadiusNeighborhood32             PairwiseDistancesRadiusNeighborhood64
                                                                       |                                                  |
                                                                       |                                                  |
                                                                       |                                                  |
                                                                       |                                                  |
                                                       (specialized concrete 32bit implem.)               (specialized concrete 64bit implem.)
                                                 FastEuclideanPairwiseDistancesRadiusNeighborhood32  FastEuclideanPairwiseDistancesRadiusNeighborhood64

TODO:

  • fix the failing test
  • add more tests for 32bit datasets on user-facing interfaces
  • split this PR in smaller ones to ease reviews

Hardware scalability

Adapting this script to use float32 datasets, we access that this implementation scales linearly, similarly to its 64bit counterpart:

speed_up_100000_100000_log

Raw results
    n_threads  n_train  n_test  n_features  mean_runtime  stderr_runtime
0           1   100000  100000          50     57.981657               0
1           2   100000  100000          50     29.401138               0
2           4   100000  100000          50     14.627211               0
3           8   100000  100000          50      7.748570               0
4          16   100000  100000          50      4.204991               0
5          32   100000  100000          50      2.385364               0
6          64   100000  100000          50      1.576305               0
7         128   100000  100000          50      2.115476               0
8           1   100000  100000         100     83.216700               0
9           2   100000  100000         100     42.717769               0
10          4   100000  100000         100     21.534403               0
11          8   100000  100000         100     10.926104               0
12         16   100000  100000         100      5.956875               0
13         32   100000  100000         100      3.348170               0
14         64   100000  100000         100      2.083073               0
15        128   100000  100000         100      3.822223               0
16          1   100000  100000         500    290.757614               0
17          2   100000  100000         500    142.708740               0
18          4   100000  100000         500     72.544749               0
19          8   100000  100000         500     35.726813               0
20         16   100000  100000         500     19.464046               0
21         32   100000  100000         500     10.771516               0
22         64   100000  100000         500      7.123072               0
23        128   100000  100000         500     11.439384               0

speed_up_1000000_10000_log

Raw results
    n_threads  n_train  n_test  n_features  mean_runtime  stderr_runtime
0           1  1000000   10000          50     57.369851               0
1           2  1000000   10000          50     29.368813               0
2           4  1000000   10000          50     14.890100               0
3           8  1000000   10000          50      7.564469               0
4          16  1000000   10000          50      3.912440               0
5          32  1000000   10000          50      2.094077               0
6          64  1000000   10000          50      1.356988               0
7         128  1000000   10000          50      1.528763               0
8           1  1000000   10000         100     81.371726               0
9           2  1000000   10000         100     42.803727               0
10          4  1000000   10000         100     21.626557               0
11          8  1000000   10000         100     11.082455               0
12         16  1000000   10000         100      5.795145               0
13         32  1000000   10000         100      3.061136               0
14         64  1000000   10000         100      2.006234               0
15        128  1000000   10000         100      2.012048               0
16          1  1000000   10000         500    286.566753               0
17          2  1000000   10000         500    149.337710               0
18          4  1000000   10000         500     75.545747               0
19          8  1000000   10000         500     38.256877               0
20         16  1000000   10000         500     19.557651               0
21         32  1000000   10000         500     11.193385               0
22         64  1000000   10000         500      9.533238               0
23        128  1000000   10000         500      8.433263               0

Speed-ups between 1.0 (e7fb5b8) and this PR @ 65ebc92 (via ca9197a502bf1289db722a6261ff5fe7edf8e981)

Up to ×50 speed-ups in normal configurations.
Some regression when using small datasets and a high number of threads.

1 core
       before           after         ratio
     [998e8f20]       [65ebc927]
     <main>           <distance-metrics-32bit>
+      1.07±0.01m          1.18±0m     1.10  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_mean_shift(10000, 100000, 100)
-         993±1ms          889±1ms     0.90  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 10000, 100)
-        93.2±1ms       82.9±0.5ms     0.89  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 10000, 100)
-         1.97±0m          1.75±0m     0.89  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(10000, 100000, 100)
-        93.2±1ms       82.3±0.2ms     0.88  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 10000, 100)
-      93.1±0.4ms       81.4±0.2ms     0.87  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 1000, 100)
-      93.3±0.6ms       81.6±0.4ms     0.87  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 1000, 100)
-         1.01±0s          831±2ms     0.82  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 100000, 100)
-         1.01±0s          827±3ms     0.82  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 100000, 100)
-      5.97±0.01s       4.88±0.01s     0.82  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 100000, 100)
-      10.3±0.02s          8.06±0s     0.78  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 100000, 100)
-         1.04±0s          806±2ms     0.78  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 10000, 100)
-      10.3±0.03s          8.00±0s     0.77  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 100000, 100)
-         1.05±0s          806±3ms     0.77  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 10000, 100)
-      11.6±0.3ms       8.63±0.1ms     0.74  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 1000, 100)
-      11.7±0.3ms      8.65±0.04ms     0.74  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 1000, 100)
-       193±0.6ms       99.4±0.6ms     0.52  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
-      20.7±0.3ms      10.4±0.08ms     0.50  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
-         2.02±0s          998±2ms     0.49  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)
-         202±1ms       84.7±0.4ms     0.42  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-         21.0±0s          8.28±0s     0.39  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 100000, 100)
-         2.11±0s          828±3ms     0.39  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.
2 cores
       before           after         ratio
     [998e8f20]       [65ebc927]
     <main>           <distance-metrics-32bit>
-         970±2ms         857±50ms     0.88  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 10000, 100)
-         1.94±0m          1.66±0m     0.86  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(10000, 100000, 100)
-      5.74±0.01s       4.43±0.01s     0.77  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 100000, 100)
-      72.4±0.7ms       42.6±0.2ms     0.59  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 1000, 100)
-      72.4±0.9ms       42.5±0.2ms     0.59  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 1000, 100)
-        73.3±2ms       42.9±0.1ms     0.59  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 10000, 100)
-        73.7±2ms       43.1±0.1ms     0.58  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 10000, 100)
-         783±1ms        418±0.7ms     0.53  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 100000, 100)
-         782±2ms          416±1ms     0.53  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 100000, 100)
-         801±1ms          411±1ms     0.51  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 10000, 100)
-         804±1ms          411±1ms     0.51  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 10000, 100)
-      7.93±0.04s          4.04±0s     0.51  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 100000, 100)
-      7.93±0.03s          4.04±0s     0.51  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 100000, 100)
-      9.65±0.2ms      4.71±0.03ms     0.49  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 1000, 100)
-      9.76±0.2ms      4.68±0.03ms     0.48  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 1000, 100)
-      19.1±0.2ms      6.37±0.07ms     0.33  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
-         175±1ms       51.7±0.3ms     0.30  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
-         1.80±0s          503±1ms     0.28  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)
-         182±1ms       44.8±0.1ms     0.25  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-         1.87±0s          423±1ms     0.23  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)
-      18.5±0.01s          4.15±0s     0.22  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 100000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
4 cores
       before           after         ratio
     [998e8f20]       [65ebc927]
     <main>
8000
;           <distance-metrics-32bit>
-         1.91±0m          1.61±0m     0.84  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(10000, 100000, 100)
-      61.2±0.8ms       23.7±0.2ms     0.39  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 10000, 100)
-      61.3±0.6ms       23.7±0.3ms     0.39  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 10000, 100)
-      63.2±0.6ms       23.9±0.2ms     0.38  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 1000, 100)
-      63.0±0.6ms       23.8±0.2ms     0.38  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 1000, 100)
-      9.09±0.2ms      2.92±0.05ms     0.32  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 1000, 100)
-         679±1ms          218±1ms     0.32  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 100000, 100)
-         675±2ms          216±1ms     0.32  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 100000, 100)
-      9.44±0.2ms      2.95±0.06ms     0.31  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 1000, 100)
-         700±2ms          212±1ms     0.30  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 10000, 100)
-         698±1ms          211±1ms     0.30  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 10000, 100)
-      6.89±0.02s          2.06±0s     0.30  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 100000, 100)
-      6.88±0.03s          2.05±0s     0.30  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 100000, 100)
-      18.3±0.1ms      4.37±0.04ms     0.24  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
-       163±0.9ms       27.9±0.1ms     0.17  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
-         1.69±0s          262±1ms     0.15  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)
-       171±0.9ms       26.3±0.2ms     0.15  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-         1.77±0s          217±1ms     0.12  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
8 cores
       before           after         ratio
     [998e8f20]       [65ebc927]
     <main>           <distance-metrics-32bit>
+         499±1ms          730±8ms     1.46  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 1000, 100)
-        111±10ms         94.3±7ms     0.85  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_birch(1000, 1000, 100)
-         1.91±0m          1.60±0m     0.84  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(10000, 100000, 100)
-      10.7±0.4ms      3.55±0.06ms     0.33  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 1000, 100)
-      10.7±0.4ms      3.40±0.03ms     0.32  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 1000, 100)
-      20.2±0.4ms      4.84±0.04ms     0.24  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
-      68.0±0.6ms       14.4±0.3ms     0.21  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 1000, 100)
-      68.6±0.6ms       14.3±0.3ms     0.21  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 1000, 100)
-      67.9±0.9ms       13.6±0.2ms     0.20  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 10000, 100)
-      67.4±0.7ms       13.5±0.2ms     0.20  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 10000, 100)
-         722±1ms        117±0.8ms     0.16  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 100000, 100)
-         721±1ms        116±0.8ms     0.16  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 100000, 100)
-         729±3ms        111±0.8ms     0.15  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 10000, 100)
-         727±2ms        111±0.8ms     0.15  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 10000, 100)
-      7.06±0.02s          1.07±0s     0.15  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 100000, 100)
-      7.06±0.03s          1.06±0s     0.15  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 100000, 100)
-       170±0.9ms       15.8±0.1ms     0.09  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
-       179±0.7ms       16.2±0.2ms     0.09  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-         1.73±0s          141±1ms     0.08  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)
-         1.79±0s        114±0.7ms     0.06  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.
16 cores
       before           after         ratio
     [998e8f20]       [65ebc927]
     <main>           <distance-metrics-32bit>
+        13.1±1ms        28.0±10ms     2.13  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 1000, 100)
+         495±1ms         747±10ms     1.51  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 1000, 100)
+        22.5±1ms        32.3±10ms     1.43  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
+         1.67±0s        2.00±0.1s     1.20  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_mean_shift(1000, 100000, 100)
+         1.64±0s       1.94±0.03s     1.19  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_mean_shift(1000, 1000, 100)
+         1.64±0s        1.91±0.1s     1.16  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_mean_shift(1000, 10000, 100)
+         954±1ms       1.09±0.01s     1.15  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 10000, 100)
-       1.69±0.1s       1.53±0.02s     0.90  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_tsne(1000, 1000, 100)
-        67.7±2ms        58.3±20ms     0.86  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 10000, 100)
-         1.89±0m          1.58±0m     0.83  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(10000, 100000, 100)
-        67.1±2ms         44.0±1ms     0.66  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 10000, 100)
-        13.1±1ms      5.26±0.07ms     0.40  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 1000, 100)
-         171±3ms         56.0±6ms     0.33  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
-        69.2±2ms       9.91±0.1ms     0.14  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 1000, 100)
-        69.4±2ms       9.60±0.1ms     0.14  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 1000, 100)
-         769±2ms       80.7±0.8ms     0.10  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 10000, 100)
-         767±3ms       80.0±0.7ms     0.10  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 10000, 100)
-         690±3ms       67.9±0.6ms     0.10  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 100000, 100)
-         687±3ms       67.4±0.6ms     0.10  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 100000, 100)
-      7.55±0.03s          580±2ms     0.08  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 100000, 100)
-      7.58±0.02s          581±2ms     0.08  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 100000, 100)
-         179±2ms       12.5±0.2ms     0.07  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-         1.83±0s       98.6±0.9ms     0.05  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)
-         1.69±0s       79.7±0.5ms     0.05  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.
32 cores
       before           after         ratio
     [998e8f20]       [65ebc927]
     <main>           <distance-metrics-32bit>
+         499±2ms          765±9ms     1.53  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 1000, 100)
+      1.77±0.01s        2.13±0.1s     1.20  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_mean_shift(1000, 10000, 100)
+      1.78±0.01s       2.09±0.06s     1.18  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_mean_shift(1000, 1000, 100)
+         968±2ms       1.14±0.02s     1.18  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 10000, 100)
+         1.79±0s       2.08±0.04s     1.16  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_mean_shift(1000, 100000, 100)
-       1.69±0.1s       1.42±0.01s     0.84  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_tsne(1000, 1000, 100)
-         1.89±0m          1.57±0m     0.83  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(10000, 100000, 100)
-        16.5±2ms       9.70±0.1ms     0.59  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 1000, 100)
-        16.4±2ms      8.91±0.09ms     0.54  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 1000, 100)
-         176±5ms       84.9±0.8ms     0.48  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
-        25.5±2ms       10.9±0.2ms     0.43  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
-        74.0±3ms       10.6±0.1ms     0.14  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 1000, 100)
-        74.5±3ms       10.4±0.1ms     0.14  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 1000, 100)
-         775±2ms       62.6±0.2ms     0.08  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 10000, 100)
-         775±2ms       62.4±0.2ms     0.08  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 10000, 100)
-         185±4ms       12.4±0.1ms     0.07  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-         670±3ms       44.0±0.3ms     0.07  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 100000, 100)
-         669±3ms       43.9±0.3ms     0.07  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 100000, 100)
-      7.61±0.03s          334±3ms     0.04  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 100000, 100)
-      7.64±0.02s         334±20ms     0.04  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 100000, 100)
-         1.85±0s       80.3±0.2ms     0.04  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)
-         1.68±0s       51.1±0.3ms     0.03  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.
64 cores
     [998e8f20]       [65ebc927]
     <main>           <distance-metrics-32bit>
+        90.5±3ms          216±8ms     2.38  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 10000, 100)
+        90.5±4ms         184±20ms     2.03  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 10000, 100)
+         513±2ms         808±10ms     1.58  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 1000, 100)
+      1.01±0.01s       1.25±0.03s     1.24  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 10000, 100)
+      1.94±0.01s       2.38±0.08s     1.22  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_mean_shift(1000, 10000, 100)
+      1.96±0.01s       2.31±0.07s     1.18  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_mean_shift(1000, 1000, 100)
+      1.99±0.01s       2.28±0.07s     1.14  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_mean_shift(1000, 100000, 100)
-         689±1ms          621±4ms     0.90  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_affinity_propagation(1000, 100000, 100)
-         205±3ms          176±4ms     0.86  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
-        27.1±5ms         22.5±9ms     0.83  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 1000, 100)
-         1.89±0m       1.56±0.01m     0.82  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(10000, 100000, 100)
-       1.89±0.1s       1.53±0.04s     0.81  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_tsne(1000, 1000, 100)
-        62.8±9ms         50.7±2ms     0.81  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_label_spreading(1000, 1000, 100)
-        60.4±9ms         48.1±2ms     0.80  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_label_propagation(1000, 1000, 100)
-        27.1±5ms       19.7±0.4ms     0.73  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 1000, 100)
-        37.4±5ms       22.7±0.2ms     0.61  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
-        89.4±5ms        19.0±20ms     0.21  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 1000, 100)
-        89.2±5ms       14.4±0.2ms     0.16  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 1000, 100)
-         921±7ms          145±2ms     0.16  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 10000, 100)
-         921±8ms         95.0±2ms     0.10  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 10000, 100)
-         212±3ms       16.6±0.1ms     0.08  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-         2.00±0s          110±2ms     0.05  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)
-        733±10ms       32.8±0.2ms     0.04  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 100000, 100)
-         728±9ms       31.6±0.3ms     0.04  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 10000
8000
0, 100)
-      1.73±0.01s       36.3±0.2ms     0.02  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.
128 cores
       before           after         ratio
     [998e8f20]       [65ebc927]
     <main>           <distance-metrics-32bit>
+         121±3ms        1.50±0.1s    12.40  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 10000, 100)
+         127±7ms       1.55±0.06s    12.25  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 10000, 100)
+        34.8±2ms         258±20ms     7.40  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 1000, 100)
+        32.5±2ms         211±10ms     6.52  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 1000, 100)
+         235±3ms       1.46±0.03s     6.25  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
+        44.9±2ms         257±10ms     5.73  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
+      5.78±0.02s       17.8±0.04s     3.08  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 100000, 100)
+      1.10±0.05s       2.66±0.03s     2.41  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 10000, 100)
+      2.37±0.02s          5.48±1s     2.31  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_mean_shift(1000, 100000, 100)
+        589±50ms       1.14±0.05s     1.94  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 1000, 100)
+         113±2ms          219±8ms     1.94  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 1000, 100)
+         113±2ms          191±7ms     1.69  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 1000, 100)
+      2.33±0.06s        3.55±0.3s     1.52  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_mean_shift(1000, 10000, 100)
+      2.36±0.03s       3.11±0.04s     1.32  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_mean_shift(1000, 1000, 100)
+      46.4±0.07s       1.01±0.01m     1.31  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_affinity_propagation(10000, 100000, 100)
+      1.07±0.01s       1.36±0.02s     1.26  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 10000, 100)
+      1.10±0.03s       1.39±0.04s     1.26  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 10000, 100)
+      10.4±0.08s       13.0±0.03s     1.24  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 100000, 100)
+      1.64±0.03s       1.93±0.05s     1.18  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_birch(1000, 10000, 100)
+        446±20ms         492±40ms     1.10  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_birch(1000, 1000, 100)
-      2.20±0.02s       1.39±0.04s     0.63  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)
-       21.3±0.2s        13.0±0.2s     0.61  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 100000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

Any other comments?

Is this proposal too complicated?

@jjerphan jjerphan force-pushed the distance-metrics-32bit branch from 2b68863 to 8ae1cb6 Compare February 23, 2022 15:01
Also populate the .gitignore with new files
This allows keeping the same interface in Python namely:

 - PairwiseDistancesReduction.is_usable_for
 - PairwiseDistancesReduction.valid_metrics
 - PairwiseDistancesArgKmin.compute

while being able to route to the 32bit and 64bit implementations
defined via Tempita.

The design pattern used here on PairwiseDistancesReduction
and PairwiseDistancesArgKmin is the Facade design pattern.

See: https://refactoring.guru/design-patterns/facade
@jjerphan jjerphan changed the title MAINT 32bit support for DistanceMetric MAINT 32bit support for DistanceMetric and PairwiseDistancesReduction Feb 24, 2022
Comment on lines 17 to 47

cdef inline np.ndarray _buffer_to_ndarray(const DTYPE_t* x, np.npy_intp n):
# Wrap a memory buffer with an ndarray. Warning: this is not robust.
# In particular, if x is deallocated before the returned array goes
# out of scope, this could cause memory errors. Since there is not
# a possibility of this for our use-case, this should be safe.

# Note: this Segfaults unless np.import_array() is called above
return PyArray_SimpleNewFromData(1, &n, DTYPECODE, <void*>x)


from libc.math cimport fabs, sqrt, exp, pow, cos, sin, asin
cdef DTYPE_t INF = np.inf

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: this has been moved under the Tempita loop to be templated.

Comment on lines 44 to 72
######################################################################
# metric mappings
# These map from metric id strings to class names
METRIC_MAPPING = {'euclidean': EuclideanDistance,
'l2': EuclideanDistance,
'minkowski': MinkowskiDistance,
'p': MinkowskiDistance,
'manhattan': ManhattanDistance,
'cityblock': ManhattanDistance,
'l1': ManhattanDistance,
'chebyshev': ChebyshevDistance,
'infinity': ChebyshevDistance,
'seuclidean': SEuclideanDistance,
'mahalanobis': MahalanobisDistance,
'wminkowski': WMinkowskiDistance,
'hamming': HammingDistance,
'canberra': CanberraDistance,
'braycurtis': BrayCurtisDistance,
'matching': MatchingDistance,
'jaccard': JaccardDistance,
'dice': DiceDistance,
'kulsinski': KulsinskiDistance,
'rogerstanimoto': RogersTanimotoDistance,
'russellrao': RussellRaoDistance,
'sokalmichener': SokalMichenerDistance,
'sokalsneath': SokalSneathDistance,
'haversine': HaversineDistance,
'pyfunc': PyFuncDistance}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: this has been moved under the Tempita loop to be templated.

Comment on lines 863 to 864
.. math::
D(x, y) = \frac{N_{TF} + N_{FT}}{N_{TT} + N_{TF} + N_{FT}}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed such doc because this was making Tempita injection fail.

This can be reintroduced.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it would be good to reintroduce it with a notation that does not use backslashes (assuming they are the ones causing the problem).

@jjerphan jjerphan marked this pull request as ready for review February 25, 2022 17:28
Previously, upcast was done in the critical region.
This causes an unneeded upcast for one of the buffers.

This only upcasts buffers when necessary and where
necessary without duplication contrarily to previously.

Two methods are introduced to perform this upcast
for each strategy.

Yet, this adds some complexity to the templating.
@jjerphan jjerphan changed the title MAINT 32bit support for DistanceMetric and PairwiseDistancesReduction MAINT 32bit support for PairwiseDistancesReduction Feb 28, 2022
Comment on lines 53 to 117
cpdef DTYPE_t[::1] _sqeuclidean_row_norms(
const DTYPE_t[:, ::1] X,
ITYPE_t num_threads,
):
"""Compute the squared euclidean norm of the rows of X in parallel.

This is faster than using np.einsum("ij, ij->i") even when using a single thread.
"""
cdef:
# Casting for X to remove the const qualifier is needed because APIs
# exposed via scipy.linalg.cython_blas aren't reflecting the arguments'
# const qualifier.
# See: https://github.com/scipy/scipy/issues/14262
DTYPE_t * X_ptr = <DTYPE_t *> &X[0, 0]
ITYPE_t idx = 0
ITYPE_t n = X.shape[0]
ITYPE_t d = X.shape[1]
DTYPE_t[::1] squared_row_norms = np.empty(n, dtype=DTYPE)

for idx in prange(n, schedule='static', nogil=True, num_threads=num_threads):
squared_row_norms[idx] = _dot(d, X_ptr + idx * d, 1, X_ptr + idx * d, 1)

return squared_row_norms
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewer: this has been moved bellow, closer to 32bit and 64bit definitions.

from cython.parallel cimport parallel, prange

from ._dist_metrics cimport DatasetsPair, DenseDenseDatasetsPair
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewer: this has been moved bellow, closer to 32bit and 64bit definitions.

Comment on lines -144 to -210
cdef:
readonly DatasetsPair datasets_pair

# The number of threads that can be used is stored in effective_n_threads.
#
# The number of threads to use in the parallelisation strategy
# (i.e. parallel_on_X or parallel_on_Y) can be smaller than effective_n_threads:
# for small datasets, less threads might be needed to loop over pair of chunks.
#
# Hence the number of threads that _will_ be used for looping over chunks
# is stored in chunks_n_threads, allowing solely using what we need.
#
# Thus, an invariant is:
#
# chunks_n_threads <= effective_n_threads
#
ITYPE_t effective_n_threads
ITYPE_t chunks_n_threads

ITYPE_t n_samples_chunk, chunk_size

ITYPE_t n_samples_X, X_n_samples_chunk, X_n_chunks, X_n_samples_last_chunk
ITYPE_t n_samples_Y, Y_n_samples_chunk, Y_n_chunks, Y_n_samples_last_chunk

bint execute_in_parallel_on_Y
Copy link
Member Author
@jjerphan jjerphan Feb 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewer: this has been moved bellow, within 32bit and 64bit definitions (done via Tempita).

Hence, PairwiseDistancesReduction now really just is an interface.

jjerphan added 4 commits April 4, 2022 14:34
Conflicts:
    .gitignore
    sklearn/metrics/_dist_metrics.pxd.tp
    sklearn/metrics/_dist_metrics.pyx.tp
    sklearn/metrics/_pairwise_distances_reduction.pyx.tp
    sklearn/metrics/setup.py
    sklearn/metrics/tests/test_pairwise_distances_reduction.py
@jjerphan jjerphan force-pushed the distance-metrics-32bit branch from df17323 to 3f6f2c6 Compare May 24, 2022 08:08
@ogrisel
Copy link
Member
ogrisel commented May 25, 2022

I can reproduce the crash locally in test_pairwise_distances_argkmin[42-parallel_on_X-float32-euclidean-0-50]:

sklearn/metrics/tests/test_pairwise_distances_reduction.py::test_pairwise_distances_argkmin[42-parallel_on_X-float32-euclidean-0-50] Fatal Python error: Aborted

Current thread 0x00007fbcc71ee740 (most recent call first):
  File "/home/ogrisel/code/scikit-learn/sklearn/metrics/tests/test_pairwise_distances_reduction.py", line 576 in test_pairwise_distances_argkmin
  File "/home/ogrisel/mambaforge/envs/dev/lib/python3.10/site-packages/_pytest/python.py", line 192 in pytest_pyfunc_call
  File "/home/ogrisel/mambaforge/envs/dev/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall

GDB hints at a double-free problem in the __dealloc__ of GEMMTermComputer32:

gdb --ex r --args python -m pytest sklearn/metrics/tests/test_pairwise_distances_reduction.py -k "test_pairwise_distances_argkmin[42-parallel_on_X-float32-euclidean-0-50]" -v
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737350469440) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140737350469440) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140737350469440, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff7cc5476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff7cab7f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffff7d0c6f6 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff7e5eb8c "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#6  0x00007ffff7d23d7c in malloc_printerr (str=str@entry=0x7ffff7e617d0 "double free or corruption (!prev)") at ./malloc/malloc.c:5664
#7  0x00007ffff7d25efc in _int_free (av=0x7ffff7e9cc80 <main_arena>, p=0x555556e64ae0, have_lock=<optimized out>) at ./malloc/malloc.c:4591
#8  0x00007ffff7d284d3 in __GI___libc_free (mem=<optimized out>) at ./malloc/malloc.c:3391
#9  0x00007fffb31409fe in __pyx_tp_dealloc_7sklearn_7metrics_29_pairwise_distances_reduction_GEMMTermComputer32(_object*) ()
   from /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so
#10 0x00007fffb314059d in __pyx_tp_dealloc_7sklearn_7metrics_29_pairwise_distances_reduction_FastEuclideanPairwiseDistancesArgKmin32(_object*) ()
   from /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so
#11 0x00007fffb3179b2b in __pyx_pw_7sklearn_7metrics_29_pairwise_distances_reduction_26PairwiseDistancesArgKmin32_1compute(_object*, _object*, _object*) ()
   from /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so
#12 0x0000555555697f4c in cfunction_call (func=0x7fffb2ed23e0, args=<optimized out>, kwargs=<optimized out>) at /usr/local/src/conda/python-3.10.4/Objects/methodobject.c:543
#13 0x00007fffb3136e0e in __Pyx_PyObject_Call(_object*, _object*, _object*) () from /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so
#14 0x00007fffb316f7ba in __pyx_pw_7sklearn_7metrics_29_pairwise_distances_reduction_24PairwiseDistancesArgKmin_1compute(_object*, _object*, _object*) ()
[...]

@ogrisel
Copy link
Member
ogrisel commented May 25, 2022

Trying to investigate with valgrind configured to use the Python suppressions:

valgrind --tool=memcheck --suppressions=$HOME/code/cpython/Misc/valgrind-python.supp python -m pytest sklearn/metrics/tests/test_pairwise_distances_reduction.py -k "test_pairwise_distances_argkmin[42-parallel_on_X-float32-euclidean-0-50]"

Here is the output. I am not yet sure what to make of this.

=============================================================================================== test session starts ================================================================================================
platform linux -- Python 3.10.4, pytest-7.1.1, pluggy-1.0.0
rootdir: /home/ogrisel/code/scikit-learn, configfile: setup.cfg
plugins: json-report-1.5.0, anyio-3.5.0, xdist-2.5.0, forked-1.4.0, metadata-2.0.1, timeout-2.1.0, json-0.4.0
collected 403 items / 402 deselected / 1 selected                                                                                                                                                                  

sklearn/metrics/tests/test_pairwise_distances_reduction.py ==39808== Warning: invalid file descriptor 1048564 in syscall close()
==39808== Warning: invalid file descriptor 1048565 in syscall close()
==39808== Warning: invalid file descriptor 1048566 in syscall close()
==39808== Warning: invalid file descriptor 1048567 in syscall close()
==39808==    Use --log-fd=<number> to select an alternative log fd.
==39808== Warning: invalid file descriptor 1048568 in syscall close()
==39808== Warning: invalid file descriptor 1048569 in syscall close()
==39664== Thread 11:
==39664== Invalid write of size 8
==39664==    at 0x4C4166A6: __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_18GEMMTermComputer32__parallel_on_X_init_chunk(__pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_GEMMTermComputer32*, long, long, long) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4C4128E8: __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_28PairwiseDistancesReduction32__parallel_on_X(__pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_PairwiseDistancesReduction32*) [clone ._omp_fn.0] (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4A83D207: gomp_thread_start (team.c:125)
==39664==    by 0x49FFB42: start_thread (pthread_create.c:442)
==39664==    by 0x4A90BB3: clone (clone.S:100)
==39664==  Address 0x5826ff8 is 0 bytes after a block of size 5,000 alloc'd
==39664==    at 0x4849013: operator new(unsigned long) (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==39664==    by 0x4C4662EE: std::vector<double, std::allocator<double> >::_M_default_append(unsigned long) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4C45FAB6: __pyx_pw_7sklearn_7metrics_29_pairwise_distances_reduction_18GEMMTermComputer32_1__init__(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x245634: type_call (typeobject.c:1133)
==39664==    by 0x4C41193D: __Pyx_PyObject_Call(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4C43CF10: __pyx_pw_7sklearn_7metrics_29_pairwise_distances_reduction_39FastEuclideanPairwiseDistancesArgKmin32_3__init__(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x245634: type_call (typeobject.c:1133)
==39664==    by 0x4C41193D: __Pyx_PyObject_Call(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4C448E8E: __pyx_pw_7sklearn_7metrics_29_pairwise_distances_reduction_26PairwiseDistancesArgKmin32_1compute(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x24BF4B: cfunction_call (methodobject.c:543)
==39664==    by 0x4C41193D: __Pyx_PyObject_Call(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4C450DD9: __pyx_pw_7sklearn_7metrics_29_pairwise_distances_reduction_24PairwiseDistancesArgKmin_1compute(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664== 
==39664== Invalid write of size 8
==39664==    at 0x4C41657C: __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_18GEMMTermComputer32__parallel_on_X_pre_compute_and_reduce_distances_on_chunks(__pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_GEMMTermComputer32*, long, long, long, long, long) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4C412926: __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_28PairwiseDistancesReduction32__parallel_on_X(__pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_PairwiseDistancesReduction32*) [clone ._omp_fn.0] (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4A83D207: gomp_thread_start (team.c:125)
==39664==    by 0x49FFB42: start_thread (pthread_create.c:442)
==39664==    by 0x4A90BB3: clone (clone.S:100)
==39664==  Address 0x615bb78 is 0 bytes after a block of size 5,000 alloc'd
==39664==    at 0x4849013: operator new(unsigned long) (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==39664==    by 0x4C4662EE: std::vector<double, std::allocator<double> >::_M_default_append(unsigned long) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4C45FACD: __pyx_pw_7sklearn_7metrics_29_pairwise_distances_reduction_18GEMMTermComputer32_1__init__(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x245634: type_call (typeobject.c:1133)
==39664==    by 0x4C41193D: __Pyx_PyObject_Call(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4C43CF10: __pyx_pw_7sklearn_7metrics_29_pairwise_distances_reduction_39FastEuclideanPairwiseDistancesArgKmin32_3__init__(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x245634: type_call (typeobject.c:1133)
==39664==    by 0x4C41193D: __Pyx_PyObject_Call(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4C448E8E: __pyx_pw_7sklearn_7metrics_29_pairwise_distances_reduction_26PairwiseDistancesArgKmin32_1compute(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x24BF4B: cfunction_call (methodobject.c:543)
==39664==    by 0x4C41193D: __Pyx_PyObject_Call(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4C450DD9: __pyx_pw_7sklearn_7metrics_29_pairwise_distances_reduction_24PairwiseDistancesArgKmin_1compute(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664== 

valgrind: m_mallocfree.c:303 (get_bszB_as_is): Assertion 'bszB_lo == bszB_hi' failed.
valgrind: Heap block lo/hi size mismatch: lo = 5072, hi = 4636128830317133824.
This is probably caused by your program erroneously writing past the
end of a heap block and corrupting heap metadata.  If you fix any
invalid writes reported by Memcheck, this assertion failure will
probably go away.  Please try that before reporting this as a bug.


host stacktrace:
==39664==    at 0x5804284A: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==39664==    by 0x58042977: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==39664==    by 0x58042B1B: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==39664==    by 0x5804C8CF: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==39664==    by 0x5803AE9A: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==39664==    by 0x580395B7: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==39664==    by 0x5803DF3D: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==39664==    by 0x58038868: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==39664==    by 0x100CC61D6F: ???

sched status:
  running_tid=11

Thread 1: status = VgTs_Yielding (lwpid 39664)
==39664==    at 0x4A83F133: do_spin (wait.h:56)
==39664==    by 0x4A83F133: do_wait (wait.h:66)
==39664==    by 0x4A83F2BB: gomp_team_barrier_wait_end (bar.c:112)
==39664==    by 0x4C412854: __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_28PairwiseDistancesReduction32__parallel_on_X(__pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_PairwiseDistancesReduction32*) [clone ._omp_fn.0] (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4A8384A7: GOMP_parallel (parallel.c:178)
==39664==    by 0x4C411F42: __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_28PairwiseDistancesReduction32__parallel_on_X(__pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_PairwiseDistancesReduction32*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4C44904A: __pyx_pw_7sklearn_7metrics_29_pairwise_distances_reduction_26PairwiseDistancesArgKmin32_1compute(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x24BF4B: cfunction_call (methodobject.c:543)
==39664==    by 0x4C41193D: __Pyx_PyObject_Call(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4C450DD9: __pyx_pw_7sklearn_7metrics_29_pairwise_distances_reduction_24PairwiseDistancesArgKmin_1compute(_object*, _object*, _object*) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x24BF4B: cfunction_call (methodobject.c:543)
==39664==    by 0x245297: _PyObject_MakeTpCall (call.c:215)
==39664==    by 0x2412D4: UnknownInlinedFun (abstract.h:112)
==39664==    by 0x2412D4: UnknownInlinedFun (abstract.h:99)
==39664==    by 0x2412D4: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x2412D4: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x2412D4: _PyEval_EvalFrameDefault (ceval.c:4231)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x25A347: UnknownInlinedFun (call.c:267)
==39664==    by 0x25A347: UnknownInlinedFun (call.c:290)
==39664==    by 0x25A347: PyObject_Call (call.c:317)
==39664==    by 0x23E3C4: UnknownInlinedFun (ceval.c:5919)
==39664==    by 0x23E3C4: _PyEval_EvalFrameDefault (ceval.c:4277)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x23E3C4: UnknownInlinedFun (ceval.c:5919)
==39664==    by 0x23E3C4: _PyEval_EvalFrameDefault (ceval.c:4277)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x24029A: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x24029A: _PyEval_EvalFrameDefault (ceval.c:4181)
==39664==    by 0x2596AB: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x2596AB: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x2596AB: UnknownInlinedFun (call.c:342)
==39664==    by 0x2596AB: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x2596AB: method_vectorcall (classobject.c:53)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x24029A: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x24029A: _PyEval_EvalFrameDefault (ceval.c:4181)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x24459C: _PyObject_FastCallDictTstate (call.c:153)
==39664==    by 0x257458: _PyObject_Call_Prepend (call.c:431)
==39664==    by 0x32EFB4: slot_tp_call (typeobject.c:7489)
==39664==    by 0x245297: _PyObject_MakeTpCall (call.c:215)
==39664==    by 0x2412D4: UnknownInlinedFun (abstract.h:112)
==39664==    by 0x2412D4: UnknownInlinedFun (abstract.h:99)
==39664==    by 0x2412D4: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x2412D4: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x2412D4: _PyEval_EvalFrameDefault (ceval.c:4231)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x23BF2E: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x23BF2E: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x23BF2E: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x23BF2E: _PyEval_EvalFrameDefault (ceval.c:4198)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x23E3C4: UnknownInlinedFun (ceval.c:5919)
==39664==    by 0x23E3C4: _PyEval_EvalFrameDefault (ceval.c:4277)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x24029A: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x24029A: _PyEval_EvalFrameDefault (ceval.c:4181)
==39664==    by 0x2596AB: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x2596AB: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x2596AB: UnknownInlinedFun (call.c:342)
==39664==    by 0x2596AB: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x2596AB: method_vectorcall (classobject.c:53)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x24029A: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x24029A: _PyEval_EvalFrameDefault (ceval.c:4181)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x24459C: _PyObject_FastCallDictTstate (call.c:153)
==39664==    by 0x257458: _PyObject_Call_Prepend (call.c:431)
==39664==    by 0x32EFB4: slot_tp_call (typeobject.c:7489)
==39664==    by 0x25A498: UnknownInlinedFun (call.c:305)
==39664==    by 0x25A498: PyObject_Call (call.c:317)
==39664==    by 0x23E3C4: UnknownInlinedFun (ceval.c:5919)
==39664==    by 0x23E3C4: _PyEval_EvalFrameDefault (ceval.c:4277)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x23BBCB: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x23BBCB: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x23BBCB: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x23BBCB: _PyEval_EvalFrameDefault (ceval.c:4213)
==39664==    by 0x2596AB: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x2596AB: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x2596AB: UnknownInlinedFun (call.c:342)
==39664==    by 0x2596AB: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x2596AB: method_vectorcall (classobject.c:53)
==39664==    by 0x23CBD0: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x23CBD0: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x23CBD0: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x23CBD0: _PyEval_EvalFrameDefault (ceval.c:4231)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x23E3C4: UnknownInlinedFun (ceval.c:5919)
==39664==    by 0x23E3C4: _PyEval_EvalFrameDefault (ceval.c:4277)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x23BBCB: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x23BBCB: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x23BBCB: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x23BBCB: _PyEval_EvalFrameDefault (ceval.c:4213)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x23CBD0: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x23CBD0: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x23CBD0: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x23CBD0: _PyEval_EvalFrameDefault (ceval.c:4231)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x23E3C4: UnknownInlinedFun (ceval.c:5919)
==39664==    by 0x23E3C4: _PyEval_EvalFrameDefault (ceval.c:4277)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x24029A: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x24029A: _PyEval_EvalFrameDefault (ceval.c:4181)
==39664==    by 0x2596AB: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x2596AB: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x2596AB: UnknownInlinedFun (call.c:342)
==39664==    by 0x2596AB: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x2596AB: method_vectorcall (classobject.c:53)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x24029A: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x24029A: _PyEval_EvalFrameDefault (ceval.c:4181)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x24459C: _PyObject_FastCallDictTstate (call.c:153)
==39664==    by 0x257458: _PyObject_Call_Prepend (call.c:431)
==39664==    by 0x32EFB4: slot_tp_call (typeobject.c:7489)
==39664==    by 0x245297: _PyObject_MakeTpCall (call.c:215)
==39664==    by 0x2412D4: UnknownInlinedFun (abstract.h:112)
==39664==    by 0x2412D4: UnknownInlinedFun (abstract.h:99)
==39664==    by 0x2412D4: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x2412D4: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x2412D4: _PyEval_EvalFrameDefault (ceval.c:4231)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x23E3C4: UnknownInlinedFun (ceval.c:5919)
==39664==    by 0x23E3C4: _PyEval_EvalFrameDefault (ceval.c:4277)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x24029A: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x24029A: _PyEval_EvalFrameDefault (ceval.c:4181)
==39664==    by 0x2596AB: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x2596AB: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x2596AB: UnknownInlinedFun (call.c:342)
==39664==    by 0x2596AB: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x2596AB: method_vectorcall (classobject.c:53)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x24029A: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x24029A: _PyEval_EvalFrameDefault (ceval.c:4181)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x24459C: _PyObject_FastCallDictTstate (call.c:153)
==39664==    by 0x257458: _PyObject_Call_Prepend (call.c:431)
==39664==    by 0x32EFB4: slot_tp_call (typeobject.c:7489)
==39664==    by 0x245297: _PyObject_MakeTpCall (call.c:215)
==39664==    by 0x2412D4: UnknownInlinedFun (abstract.h:112)
==39664==    by 0x2412D4: UnknownInlinedFun (abstract.h:99)
==39664==    by 0x2412D4: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x2412D4: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x2412D4: _PyEval_EvalFrameDefault (ceval.c:4231)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x23BBCB: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x23BBCB: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x23BBCB: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x23BBCB: _PyEval_EvalFrameDefault (ceval.c:4213)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x23BBCB: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x23BBCB: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x23BBCB: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x23BBCB: _PyEval_EvalFrameDefault (ceval.c:4213)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x23E3C4: UnknownInlinedFun (ceval.c:5919)
==39664==    by 0x23E3C4: _PyEval_EvalFrameDefault (ceval.c:4277)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x24029A: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x24029A: _PyEval_EvalFrameDefault (ceval.c:4181)
==39664==    by 0x2596AB: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x2596AB: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x2596AB: UnknownInlinedFun (call.c:342)
==39664==    by 0x2596AB: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x2596AB: method_vectorcall (classobject.c:53)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x24029A: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x24029A: _PyEval_EvalFrameDefault (ceval.c:4181)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x24459C: _PyObject_FastCallDictTstate (call.c:153)
==39664==    by 0x257458: _PyObject_Call_Prepend (call.c:431)
==39664==    by 0x32EFB4: slot_tp_call (typeobject.c:7489)
==39664==    by 0x245297: _PyObject_MakeTpCall (call.c:215)
==39664==    by 0x2412D4: UnknownInlinedFun (abstract.h:112)
==39664==    by 0x2412D4: UnknownInlinedFun (abstract.h:99)
==39664==    by 0x2412D4: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x2412D4: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x2412D4: _PyEval_EvalFrameDefault (ceval.c:4231)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x23BBCB: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x23BBCB: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x23BBCB: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x23BBCB: _PyEval_EvalFrameDefault (ceval.c:4213)
==39664==    by 0x24C3CE: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x24C3CE: UnknownInlinedFun (ceval.c:5065)
==39664==    by 0x24C3CE: _PyFunction_Vectorcall (call.c:342)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x24029A: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x24029A: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x24029A: _PyEval_EvalFrameDefault (ceval.c:4181)
==39664==    by 0x2F2B91: UnknownInlinedFun (pycore_ceval.h:46)
==39664==    by 0x2F2B91: _PyEval_Vector (ceval.c:5065)
==39664==    by 0x2F2AD6: PyEval_EvalCode (ceval.c:1134)
==39664==    by 0x2F9C1E: UnknownInlinedFun (bltinmodule.c:1056)
==39664==    by 0x2F9C1E: builtin_exec (bltinmodule.c.h:371)
==39664==    by 0x24C5BE: cfunction_vectorcall_FASTCALL (methodobject.c:430)
==39664==    by 0x23BBCB: UnknownInlinedFun (abstract.h:114)
==39664==    by 0x23BBCB: UnknownInlinedFun (abstract.h:123)
==39664==    by 0x23BBCB: UnknownInlinedFun (ceval.c:5867)
==39664==    by 0x23BBCB: _PyEval_EvalFrameDefault (ceval.c:4213)
client stack range: [0x1FFEFE7000 0x1FFF000FFF] client SP: 0x1FFEFFABC8
valgrind stack range: [0x1002CBE000 0x1002DBDFFF] top usage: 18984 of 1048576

Thread 2: status = VgTs_WaitSys syscall 202 (lwpid 39760)
==39664==    at 0x49FC197: __futex_abstimed_wait_common64 (futex-internal.c:57)
==39664==    by 0x49FC197: __futex_abstimed_wait_common (futex-internal.c:87)
==39664==    by 0x49FC197: __futex_abstimed_wait_cancelable64 (futex-internal.c:139)
==39664==    by 0x49FEAC0: __pthread_cond_wait_common (pthread_cond_wait.c:503)
==39664==    by 0x49FEAC0: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.c:627)
==39664==    by 0x7ED875B: blas_thread_server (in /home/ogrisel/mambaforge/envs/dev/lib/libopenblasp-r0.3.20.so)
==39664==    by 0x49FFB42: start_thread (pthread_create.c:442)
==39664==    by 0x4A90BB3: clone (clone.S:100)
client stack range: [0x9FF0000 0xA7EEFFF] client SP: 0xA7EECF0
valgrind stack range: [0x100993E000 0x1009A3DFFF] top usage: 5472 of 1048576

Thread 3: status = VgTs_WaitSys syscall 202 (lwpid 39761)
==39664==    at 0x49FC197: __futex_abstimed_wait_common64 (futex-internal.c:57)
==39664==    by 0x49FC197: __futex_abstimed_wait_common (futex-internal.c:87)
==39664==    by 0x49FC197: __futex_abstimed_wait_cancelable64 (futex-internal.c:139)
==39664==    by 0x49FEAC0: __pthread_cond_wait_common (pthread_cond_wait.c:503)
==39664==    by 0x49FEAC0: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.c:627)
==39664==    by 0x7ED875B: blas_thread_server (in /home/ogrisel/mambaforge/envs/dev/lib/libopenblasp-r0.3.20.so)
==39664==    by 0x49FFB42: start_thread (pthread_create.c:442)
==39664==    by 0x4A90BB3: clone (clone.S:100)
client stack range: [0xA7F1000 0xAFEFFFF] client SP: 0xAFEFCF0
valgrind stack range: [0x1009C56000 0x1009D55FFF] top usage: 5472 of 1048576

Thread 4: status = VgTs_WaitSys syscall 202 (lwpid 39762)
==39664==    at 0x49FC197: __futex_abstimed_wait_common64 (futex-internal.c:57)
==39664==    by 0x49FC197: __futex_abstimed_wait_common (futex-internal.c:87)
==39664==    by 0x49FC197: __futex_abstimed_wait_cancelable64 (futex-internal.c:139)
==39664==    by 0x49FEAC0: __pthread_cond_wait_common (pthread_cond_wait.c:503)
==39664==    by 0x49FEAC0: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.c:627)
==39664==    by 0x7ED875B: blas_thread_server (in /home/ogrisel/mambaforge/envs/dev/lib/libopenblasp-r0.3.20.so)
==39664==    by 0x49FFB42: start_thread (pthread_create.c:442)
==39664==    by 0x4A90BB3: clone (clone.S:100)
client stack range: [0xAFF2000 0xB7F0FFF] client SP: 0xB7F0CF0
valgrind stack range: [0x1009F5A000 0x100A059FFF] top usage: 5472 of 1048576

Thread 5: status = VgTs_WaitSys syscall 202 (lwpid 39763)
==39664==    at 0x49FC197: __futex_abstimed_wait_common64 (futex-internal.c:57)
==39664==    by 0x49FC197: __futex_abstimed_wait_common (futex-internal.c:87)
==39664==    by 0x49FC197: __futex_abstimed_wait_cancelable64 (futex-internal.c:139)
==39664==    by 0x49FEAC0: __pthread_cond_wait_common (pthread_cond_wait.c:503)
==39664==    by 0x49FEAC0: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.c:627)
==39664==    by 0x7ED875B: blas_thread_server (in /home/ogrisel/mambaforge/envs/dev/lib/libopenblasp-r0.3.20.so)
==39664==    by 0x49FFB42: start_thread (pthread_create.c:442)
==39664==    by 0x4A90BB3: clone (clone.S:100)
client stack range: [0xB7F3000 0xBFF1FFF] client SP: 0xBFF1CF0
valgrind stack range: [0x100A25E000 0x100A35DFFF] top usage: 5472 of 1048576

Thread 6: status = VgTs_WaitSys syscall 202 (lwpid 39764)
==39664==    at 0x49FC197: __futex_abstimed_wait_common64 (futex-internal.c:57)
==39664==    by 0x49FC197: __futex_abstimed_wait_common (futex-internal.c:87)
==39664==    by 0x49FC197: __futex_abstimed_wait_cancelable64 (futex-internal.c:139)
==39664==    by 0x49FEAC0: __pthread_cond_wait_common (pthread_cond_wait.c:503)
==39664==    by 0x49FEAC0: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.c:627)
==39664==    by 0x7ED875B: blas_thread_server (in /home/ogrisel/mambaforge/envs/dev/lib/libopenblasp-r0.3.20.so)
==39664==    by 0x49FFB42: start_thread (pthread_create.c:442)
==39664==    by 0x4A90BB3: clone (clone.S:100)
client stack range: [0xBFF4000 0xC7F2FFF] client SP: 0xC7F2CF0
valgrind stack range: [0x100A562000 0x100A661FFF] top usage: 5472 of 1048576

Thread 7: status = VgTs_WaitSys syscall 202 (lwpid 39765)
==39664==    at 0x49FC197: __futex_abstimed_wait_common64 (futex-internal.c:57)
==39664==    by 0x49FC197: __futex_abstimed_wait_common (futex-internal.c:87)
==39664==    by 0x49FC197: __futex_abstimed_wait_cancelable64 (futex-internal.c:139)
==39664==    by 0x49FEAC0: __pthread_cond_wait_common (pthread_cond_wait.c:503)
==39664==    by 0x49FEAC0: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.c:627)
==39664==    by 0x7ED875B: blas_thread_server (in /home/ogrisel/mambaforge/envs/dev/lib/libopenblasp-r0.3.20.so)
==39664==    by 0x49FFB42: start_thread (pthread_create.c:442)
==39664==    by 0x4A90BB3: clone (clone.S:100)
client stack range: [0xC7F5000 0xCFF3FFF] client SP: 0xCFF3CF0
valgrind stack range: [0x100A866000 0x100A965FFF] top usage: 5472 of 1048576

Thread 8: status = VgTs_WaitSys syscall 202 (lwpid 39766)
==39664==    at 0x49FC197: __futex_abstimed_wait_common64 (futex-internal.c:57)
==39664==    by 0x49FC197: __futex_abstimed_wait_common (futex-internal.c:87)
==39664==    by 0x49FC197: __futex_abstimed_wait_cancelable64 (futex-internal.c:139)
==39664==    by 0x49FEAC0: __pthread_cond_wait_common (pthread_cond_wait.c:503)
==39664==    by 0x49FEAC0: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.c:627)
==39664==    by 0x7ED875B: blas_thread_server (in /home/ogrisel/mambaforge/envs/dev/lib/libopenblasp-r0.3.20.so)
==39664==    by 0x49FFB42: start_thread (pthread_create.c:442)
==39664==    by 0x4A90BB3: clone (clone.S:100)
client stack range: [0xCFF6000 0xD7F4FFF] client SP: 0xD7F4CF0
valgrind stack range: [0x100AB6A000 0x100AC69FFF] top usage: 5472 of 1048576

Thread 9: status = VgTs_Yielding (lwpid 39801)
==39664==    at 0x4A83F133: do_spin (wait.h:56)
==39664==    by 0x4A83F133: do_wait (wait.h:66)
==39664==    by 0x4A83F2BB: gomp_team_barrier_wait_end (bar.c:112)
==39664==    by 0x4C412854: __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_28PairwiseDistancesReduction32__parallel_on_X(__pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_PairwiseDistancesReduction32*) [clone ._omp_fn.0] (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4A83D207: gomp_thread_start (team.c:125)
==39664==    by 0x49FFB42: start_thread (pthread_create.c:442)
==39664==    by 0x4A90BB3: clone (clone.S:100)
client stack range: [0x54729000 0x54F27FFF] client SP: 0x54F27D68
valgrind stack range: [0x100E9C4000 0x100EAC3FFF] top usage: 3224 of 1048576

Thread 10: status = VgTs_WaitSys syscall 202 (lwpid 39802)
==39664==    at 0x4A83F1A7: futex_wake (futex.h:111)
==39664==    by 0x4C412854: __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_28PairwiseDistancesReduction32__parallel_on_X(__pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_PairwiseDistancesReduction32*) [clone ._omp_fn.0] (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4A83D207: gomp_thread_start (team.c:125)
==39664==    by 0x49FFB42: start_thread (pthread_create.c:442)
==39664==    by 0x4A90BB3: clone (clone.S:100)
client stack range: [0x54F2A000 0x55728FFF] client SP: 0x55728D98
valgrind stack range: [0x100ECC8000 0x100EDC7FFF] top usage: 3224 of 1048576

Thread 11: status = VgTs_Runnable (lwpid 39803)
==39664==    at 0x8CE4051: dgemm_incopy_HASWELL (in /home/ogrisel/mambaforge/envs/dev/lib/libopenblasp-r0.3.20.so)
==39664==    by 0x7D850A1: dgemm_tn (in /home/ogrisel/mambaforge/envs/dev/lib/libopenblasp-r0.3.20.so)
==39664==    by 0x7CA96B8: dgemm_ (in /home/ogrisel/mambaforge/envs/dev/lib/libopenblasp-r0.3.20.so)
==39664==    by 0x4C498F75: __pyx_fuse_1__pyx_f_7sklearn_5utils_12_cython_blas__gemm (in /home/ogrisel/code/scikit-learn/sklearn/utils/_cython_blas.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4C423F99: __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_18GEMMTermComputer32__compute_distances_on_chunks(__pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_GEMMTermComputer32*, long, long, long, long, long) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4C41819C: __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_39FastEuclideanPairwiseDistancesArgKmin32__compute_and_reduce_distances_on_chunks(__pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_FastEuclideanPairwiseDistancesArgKmin32*, long, long, long, long, long) (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4C412941: __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_28PairwiseDistancesReduction32__parallel_on_X(__pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_PairwiseDistancesReduction32*) [clone ._omp_fn.0] (in /home/ogrisel/code/scikit-learn/sklearn/metrics/_pairwise_distances_reduction.cpython-310-x86_64-linux-gnu.so)
==39664==    by 0x4A83D207: gomp_thread_start (team.c:125)
==39664==    by 0x49FFB42: start_thread (pthread_create.c:442)
==39664==    by 0x4A90BB3: clone (clone.S:100)
client stack range: [0x5572B000 0x55F29FFF] client SP: 0x55F29698
valgrind stack range: [0x100EFCC000 0x100F0CBFFF] top usage: 6920 of 1048576

Thread 13: status = VgTs_WaitSys syscall 202 (lwpid 39805)
==39664==    at 0x4A83F14C: futex_wait (futex.h:97)
==39664==    by 0x4A83F14C: do_wait (wait.h:67)
==39664==    by 0x4A83F202: gomp_barrier_wait_end (bar.c:48)
==39664==    by 0x4A83D229: gomp_simple_barrier_wait (simple-bar.h:60)
==39664==    by 0x4A83D229: gomp_thread_start (team.c:129)
==39664==    by 0x49FFB42: start_thread (pthread_create.c:442)
==39664==    by 0x4A90BB3: clone (clone.S:100)
client stack range: [0x5672D000 0x56F2BFFF] client SP: 0x56F2BE08
valgrind stack range: [0x100F5D4000 0x100F6D3FFF] top usage: 3736 of 1048576


Note: see also the FAQ in the source distribution.
It contains workarounds to several common problems.
In particular, if Valgrind aborted or crashed after
identifying problems in your program, there's a good chance
that fixing those problems will prevent Valgrind aborting or
crashing, especially if it happened in m_mallocfree.c.

If that doesn't help, please report this bug to: www.valgrind.org

In the bug report, send all the above text, the valgrind
version, and what OS and version you are using.  Thanks.

@ogrisel
Copy link
Member
ogrisel commented May 25, 2022

I rebuilt scikit-learn with debug symbols (make clean && python setup.py build_ext --inplace --debug) and now I get the crash at a slightly different location. Here is the post mortem backtrace from gdb:

#0  0x00007ffff7d2847e in __GI___libc_free (mem=0x407e123de0000000) at ./malloc/malloc.c:3368
#1  0x00007ffff7d17ffd in get_cached_stack (memp=<synthetic pointer>, sizep=<synthetic pointer>) at ./nptl/allocatestack.c:137
#2  allocate_stack (stacksize=<synthetic pointer>, stack=<synthetic pointer>, pdp=<synthetic pointer>, attr=0x7fffb45a8760 <gomp_thread_attr>) at ./nptl/allocatestack.c:364
#3  __pthread_create_2_1 (newthread=newthread@entry=0x7fffffff8760, attr=attr@entry=0x7fffb45a8760 <gomp_thread_attr>, start_routine=start_routine@entry=0x7fffb458913b <gomp_thread_start>, 
    arg=arg@entry=0x7fffffff86f0) at ./nptl/pthread_create.c:647
#4  0x00007fffb458a0d1 in gomp_team_start (
    fn=fn@entry=0x7fffb3029b30 <_ZL110__pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_26PairwiseDistancesArgKmin32_compute_exact_distancesP86__pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_PairwiseDistancesArgKmin32._omp_fn.0(void)>, data=data@entry=0x7fffffff8a60, nthreads=nthreads@entry=8, flags=flags@entry=0, team=0x555556efcd10, taskgroup=taskgroup@entry=0x0) at ../../../libgomp/team.c:845
#5  0x00007fffb45844a3 in GOMP_parallel (
    fn=0x7fffb3029b30 <_ZL110__pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_26PairwiseDistancesArgKmin32_compute_exact_distancesP86__pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_PairwiseDistancesArgKmin32._omp_fn.0(void)>, data=0x7fffffff8a60, num_threads=8, flags=0) at ../../../libgomp/parallel.c:176
#6  0x00007fffb302b01d in __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_26PairwiseDistancesArgKmin32_compute_exact_distances (__pyx_v_self=0x555556db7560)
    at sklearn/metrics/_pairwise_distances_reduction.cpp:26318
#7  0x00007fffb303934c in __pyx_pf_7sklearn_7metrics_29_pairwise_distances_reduction_26PairwiseDistancesArgKmin32_6_finalize_results (__pyx_v_return_distance=<optimized out>, __pyx_v_self=0x555556db7560)
    at sklearn/metrics/_pairwise_distances_reduction.cpp:26637
#8  __pyx_pw_7sklearn_7metrics_29_pairwise_distances_reduction_26PairwiseDistancesArgKmin32_7_finalize_results (__pyx_v_self=0x555556db7560, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>)
    at sklearn/metrics/_pairwise_distances_reduction.cpp:26599
#9  0x0000555555697f4c in cfunction_call (func=0x7fffb2e73ba0, args=<optimized out>, kwargs=<optimized out>) at /usr/local/src/conda/python-3.10.4/Objects/methodobject.c:543
#10 0x00007fffb3029a07 in __Pyx_PyObject_Call (kw=0x0, arg=0x7fffb32f6da0, func=0x7fffb2e73ba0) at sklearn/metrics/_pairwise_distances_reduction.cpp:55384
#11 __Pyx__PyObject_CallOneArg (arg=<optimized out>, func=0x7fffb2e73ba0) at sklearn/metrics/_pairwise_distances_reduction.cpp:56570
#12 __Pyx_PyObject_CallOneArg (func=func@entry=0x7fffb2e73ba0, arg=<optimized out>, arg@entry=0x5555558f9660 <_Py_TrueStruct>) at sklearn/metrics/_pairwise_distances_reduction.cpp:56589
#13 0x00007fffb3056d73 in __pyx_pf_7sklearn_7metrics_29_pairwise_distances_reduction_26PairwiseDistancesArgKmin32_compute (__pyx_v_cls=<optimized out>, __pyx_v_return_distance=1, 
    __pyx_v_strategy=<optimized out>, __pyx_v_metric_kwargs=<optimized out>, __pyx_v_chunk_size=0x7ffff760c3f0, __pyx_v_metric=<optimized out>, __pyx_v_k=<optimized out>, __pyx_v_Y=<optimized out>,

Which is when staring the OpenMP region in this Cython function:

    cdef void compute_exact_distances(self) nogil:
        cdef:
            ITYPE_t i, j
            ITYPE_t[:, ::1] Y_indices = self.argkmin_indices
            DTYPE_t[:, ::1] distances = self.argkmin_distances
        for i in prange(self.n_samples_X, schedule='static', nogil=True,
                        num_threads=self.effective_n_threads):
            for j in range(self.k):
                distances[i, j] = self.datasets_pair.distance_metric._rdist_to_dist(
                    # Guard against eventual -0., causing nan production.
                    max(distances[i, j], 0.)
                )

@ogrisel
Copy link
Member
ogrisel commented May 25, 2022

I re-ran the same test and this time I got:

#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737350469440) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140737350469440) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140737350469440, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff7cc5476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff7cab7f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffff7d0c6f6 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff7e5eb8c "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#6  0x00007ffff7d23d7c in malloc_printerr (str=str@entry=0x7ffff7e617d0 "double free or corruption (!prev)") at ./malloc/malloc.c:5664
#7  0x00007ffff7d25efc in _int_free (av=0x7ffff7e9cc80 <main_arena>, p=0x555556e67cf0, have_lock=<optimized out>) at ./malloc/malloc.c:4591
#8  0x00007ffff7d284d3 in __GI___libc_free (mem=<optimized out>) at ./malloc/malloc.c:3391
#9  0x00007fffb31262ce in __gnu_cxx::new_allocator<double>::deallocate (__t=<optimized out>, __p=<optimized out>, this=0x555555e88470) at /usr/include/c++/11/ext/new_allocator.h:132
#10 std::allocator_traits<std::allocator<double> >::deallocate (__n=<optimized out>, __p=<optimized out>, __a=...) at /usr/include/c++/11/bits/alloc_traits.h:496
#11 std::_Vector_base<double, std::allocator<double> >::_M_deallocate (__n=<optimized out>, __p=<optimized out>, this=0x555555e88470) at /usr/include/c++/11/bits/stl_vector.h:354
#12 std::_Vector_base<double, std::allocator<double> >::~_Vector_base (this=0x555555e88470, __in_chrg=<optimized out>) at /usr/include/c++/11/bits/stl_vector.h:335
#13 std::vector<double, std::allocator<double> >::~vector (this=0x555555e88470, __in_chrg=<optimized out>) at /usr/include/c++/11/bits/stl_vector.h:683
#14 std::_Destroy<std::vector<double, std::allocator<double> > > (__pointer=0x555555e88470) at /usr/include/c++/11/bits/stl_construct.h:151
#15 std::_Destroy_aux<false>::__destroy<std::vector<double, std::allocator<double> >*> (__last=0x555555e88530, __first=0x555555e88470) at /usr/include/c++/11/bits/stl_construct.h:163
#16 std::_Destroy<std::vector<double, std::allocator<double> >*> (__last=0x555555e88530, __first=<optimized out>) at /usr/include/c++/11/bits/stl_construct.h:196
#17 std::_Destroy<std::vector<double, std::allocator<double> >*, std::vector<double, std::allocator<double> > > (__last=0x555555e88530, __first=<optimized out>) at /usr/include/c++/11/bits/alloc_traits.h:854
#18 std::vector<std::vector<double, std::allocator<double> >, std::allocator<std::vector<double, std::allocator<double> > > >::~vector (this=0x555556d7af90, __in_chrg=<optimized out>)
    at /usr/include/c++/11/bits/stl_vector.h:680
#19 __Pyx_call_destructor<std::vector<std::vector<double, std::allocator<double> >, std::allocator<std::vector<double, std::allocator<double> > > > > (x=...)
    at sklearn/metrics/_pairwise_distances_reduction.cpp:339
#20 __pyx_tp_dealloc_7sklearn_7metrics_29_pairwise_distances_reduction_GEMMTermComputer32 (o=0x555556d7ada0) at sklearn/metrics/_pairwise_distances_reduction.cpp:51450

This line is __Pyx_call_destructor(p->X_c_upcast); in:

static void __pyx_tp_dealloc_7sklearn_7metrics_29_pairwise_distances_reduction_GEMMTermComputer32(PyObject *o) {
  struct __pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_GEMMTermComputer32 *p = (struct __pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_GEMMTermComputer32 *)o;
  #if CYTHON_USE_TP_FINALIZE
  if (unlikely(PyType_HasFeature(Py_TYPE(o), Py_TPFLAGS_HAVE_FINALIZE) && Py_TYPE(o)->tp_finalize) && (!PyType_IS_GC(Py_TYPE(o)) || !_PyGC_FINALIZED(o))) {
    if (PyObject_CallFinalizerFromDealloc(o)) return;
  }
  #endif
  __Pyx_call_destructor(p->dist_middle_terms_chunks);
  __Pyx_call_destructor(p->X_c_upcast);
  __Pyx_call_destructor(p->Y_c_upcast);
  __PYX_XDEC_MEMVIEW(&p->X, 1);
  __PYX_XDEC_MEMVIEW(&p->Y, 1);
  (*Py_TYPE(o)->tp_free)(o);
}

So maybe the C++ code generated by Cython for the stack allocated nested vectors is invalid. Maybe we could try to dynamically allocate those with the new keyword and then del them in an explicit __dealloc__ method for GEMMTermComputer32.

@ogrisel
Copy link
Member
ogrisel commented May 25, 2022

The valgrind output seems to be pointing to problems with those same datastructure but in the constructor:

==48327== Invalid write of size 8
==48327==    at 0x4C4166A6: __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_18GEMMTermComputer32__parallel_on_X_init_chunk(__pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_GEMMTermComputer32*, long, long, long) (_pairwise_distances_reduction.cpp:20832)
==48327==    by 0x4C4128E8: __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_28PairwiseDistancesReduction32__parallel_on_X(__pyx_obj_7sklearn_7metrics_29_pairwise_distances_reduction_PairwiseDistancesReduction32*) [clone ._omp_fn.0] (_pairwise_distances_reduction.cpp:22628)
==48327==    by 0x4A83D207: gomp_thread_start (team.c:125)
==48327==    by 0x49FFB42: start_thread (pthread_create.c:442)
==48327==    by 0x4A90BB3: clone (clone.S:100)
==48327==  Address 0x47222748 is 0 bytes after a block of size 5,000 alloc'd
==48327==    at 0x4849013: operator new(unsigned long) (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==48327==    by 0x4C4662EE: allocate (new_allocator.h:127)
==48327==    by 0x4C4662EE: allocate (alloc_traits.h:464)
==48327==    by 0x4C4662EE: _M_allocate (stl_vector.h:346)
==48327==    by 0x4C4662EE: std::vector<double, std::allocator<double> >::_M_default_append(unsigned long) (vector.tcc:635)
==48327==    by 0x4C45FAB6: resize (stl_vector.h:940)
==48327==    by 0x4C45FAB6: __pyx_pf_7sklearn_7metrics_29_pairwise_distances_reduction_18GEMMTermComputer32___init__ (_pairwise_distances_reduction.cpp:20567)
==48327==    by 0x4C45FAB6: __pyx_pw_7sklearn_7metrics_29_pairwise_distances_reduction_18GEMMTermComputer32_1__init__(_object*, _object*, _object*) (_pairwise_distances_reduction.cpp:20424)

That is the line in the loop that does the upcasting:

sklearn/metrics/_pairwise_distances_reduction.pyx:2321
        # Upcasting X_c=X[X_start:X_end, :] from float32 to float64
        for i in range(n_chunk_samples):
            for j in range(self.n_features):
                self.X_c_upcast[thread_num][i * self.n_features + j] = <DTYPE_t> self.X[X_start + i, j]

@ogrisel
Copy link
Member
ogrisel commented May 25, 2022

I tried to use a unique_ptr as:

diff --git a/sklearn/metrics/_pairwise_distances_reduction.pyx.tp b/sklearn/metrics/_pairwise_distances_reduction.pyx.tp
index 90ea78c11..ec68614b6 100644
--- a/sklearn/metrics/_pairwise_distances_reduction.pyx.tp
+++ b/sklearn/metrics/_pairwise_distances_reduction.pyx.tp
@@ -48,7 +48,7 @@ from .. import get_config
 from libc.stdlib cimport free, malloc
 from libc.stdio cimport printf
 from libc.float cimport DBL_MAX
-from libcpp.memory cimport shared_ptr, make_shared
+from libcpp.memory cimport shared_ptr, make_shared, unique_ptr, make_unique
 from libcpp.vector cimport vector
 from cython cimport final
 from cython.operator cimport dereference as deref
@@ -627,8 +627,8 @@ cdef class GEMMTermComputer{{bitness}}:
         vector[vector[DTYPE_t]] dist_middle_terms_chunks
 
 {{if need_upcast}}
-        vector[vector[DTYPE_t]] X_c_upcast
-        vector[vector[DTYPE_t]] Y_c_upcast
+        unique_ptr[vector[vector[DTYPE_t]]] X_c_upcast
+        unique_ptr[vector[vector[DTYPE_t]]] Y_c_upcast
 {{endif}}
 
     def __init__(self,
@@ -649,13 +649,13 @@ cdef class GEMMTermComputer{{bitness}}:
         self.dist_middle_terms_chunks = vector[vector[DTYPE_t]](self.effective_n_threads)
 
 {{if need_upcast}}
         return
 {{endif}}
@@ -743,7 +743,7 @@ cdef class GEMMTermComputer{{bitness}}:
         # Upcasting Y_c=Y[Y_start:Y_end, :] from float32 to float64
         return
 {{endif}}
@@ -743,7 +743,7 @@ cdef class GEMMTermComputer{{bitness}}:
         # Upcasting Y_c=Y[Y_start:Y_end, :] from float32 to float64
         for i in range(n_chunk_samples):
             for j in range(self.n_features):
-                self.Y_c_upcast[thread_num][i * self.n_features + j] = <DTYPE_t> self.Y[Y_start + i, j]
+                deref(self.Y_c_upcast)[thread_num][i * self.n_features + j] = <DTYPE_t> self.Y[Y_start + i, j]
 {{else}}
         return
 {{endif}}
@@ -776,8 +776,8 @@ cdef class GEMMTermComputer{{bitness}}:
             ITYPE_t K = X_c.shape[1]
             DTYPE_t alpha = - 2.
 {{if need_upcast}}
-            DTYPE_t * A = self.X_c_upcast[thread_num].data()
-            DTYPE_t * B = self.Y_c_upcast[thread_num].data()
+            DTYPE_t * A = deref(self.X_c_upcast)[thread_num].data()
+            DTYPE_t * B = deref(self.Y_c_upcast)[thread_num].data()
 {{else}}
             # Casting for A and B to remove the const is needed because APIs exposed via
             # scipy.linalg.cython_blas aren't reflecting the arguments' const qualifier.

but I still get the same failure. Same with a shared_ptr. If I use a new operator then the problem goes away but we have to do manual memory management with a __dealloc__.

@jjerphan
Copy link
Member Author

My bad, the error was due to an improper logic. f0fc839 fixes it. Yet, there are tests that aren't passing, likely due to numerical reasons.

@ogrisel
Copy link
Member
ogrisel commented May 30, 2022

My bad, the error was due to an improper logic. f0fc839 fixes it. Yet, there are tests that aren't passing, likely due to numerical reasons.

Ok, that's great to know that their no problem with stack-allocated nested vectors datastructures as I mistakenly suspected.

@ogrisel
Copy link
Member
ogrisel commented May 30, 2022

The E ValueError: Buffer dtype mismatch, expected 'const DTYPE_t' but got 'float' in https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=42567&view=logs&j=0a287ed6-22f4-5cb4-88b1-d5fcdc4d8b7e&t=98f1f182-d951-51b4-1bc1-ca049099b19d&l=8438 seems to be caused by the lack of upcasting of the precomputed row_norms in sklearn/cluster/_birch.py at line 758.

@jjerphan
Copy link
Member Author

I think that it would be easier to debug and to review to (in this order):

What do you think?

metric in cls.valid_metrics())


cdef class PairwiseDistancesArgKmin(PairwiseDistancesReduction):
Copy link
Member Author
@jjerphan jjerphan Jun 1, 2022 10000

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: similarly PairwiseDistancesArgKmin is just an interface now which dispatch to the correct dtype-specific implementation at runtime.

f"Currently: X.dtype={X.dtype} and Y.dtype={Y.dtype}."
)

cdef class PairwiseDistancesRadiusNeighborhood(PairwiseDistancesReduction):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: similarly PairwiseDistancesRadiusNeighborhood is just an interface now which dispatch to the correct dtype-specific implementation at runtime.

Comment on lines +534 to +594
cpdef DTYPE_t[::1] _sqeuclidean_row_norms64(
const DTYPE_t[:, ::1] X,
ITYPE_t num_threads,
):
"""Compute the squared euclidean norm of the rows of X in parallel.

This is faster than using np.einsum("ij, ij->i") even when using a single thread.
"""
cdef:
# Casting for X to remove the const qualifier is needed because APIs
# exposed via scipy.linalg.cython_blas aren't reflecting the arguments'
# const qualifier.
# See: https://github.com/scipy/scipy/issues/14262
DTYPE_t * X_ptr = <DTYPE_t *> &X[0, 0]
ITYPE_t i = 0
ITYPE_t n = X.shape[0]
ITYPE_t d = X.shape[1]
DTYPE_t[::1] squared_row_norms = np.empty(n, dtype=DTYPE)

for i in prange(n, schedule='static', nogil=True, num_threads=num_threads):
squared_row_norms[i] = _dot(d, X_ptr + i * d, 1, X_ptr + i * d, 1)

return squared_row_norms


cpdef DTYPE_t[::1] _sqeuclidean_row_norms32(
const cnp.float32_t[:, ::1] X,
ITYPE_t num_threads,
):
"""Compute the squared euclidean norm of the rows of X in parallel.

This is faster than using np.einsum("ij, ij->i") even when using a single thread.
"""
cdef:
# Casting for X to remove the const qualifier is needed because APIs
# exposed via scipy.linalg.cython_blas aren't reflecting the arguments'
# const qualifier.
# See: https://github.com/scipy/scipy/issues/14262
cnp.float32_t * X_ptr = <cnp.float32_t *> &X[0, 0]
ITYPE_t i = 0, j = 0
ITYPE_t n = X.shape[0]
ITYPE_t d = X.shape[1]
DTYPE_t[::1] squared_row_norms = np.empty(n, dtype=DTYPE)

# To upcast the i-th row of X from 32bit to 64bit
DTYPE_t * X_idx_upcast_ptr

with nogil, parallel(num_threads=num_threads):
# Thread-local buffer allocation
X_i_upcast_ptr = <DTYPE_t* > malloc(sizeof(DTYPE_t) * d)
for i in prange(n, schedule='static'):

# Upcasting the i-th row of X from 32bit to 64bit
for j in range(d):
X_i_upcast_ptr[j] = <DTYPE_t> deref(X_ptr + i * d + j)

squared_row_norms[i] = _dot(d, X_i_upcast_ptr, 1, X_i_upcast_ptr, 1)

free(X_i_upcast_ptr)

return squared_row_norms
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: those are specialisation of _sqeuclidean_row_norms for each dtype.

@jjerphan jjerphan changed the title MAINT 32bit datasets support for PairwiseDistancesReduction POC 32bit datasets support for PairwiseDistancesReduction Jun 1, 2022
@jjerphan
Copy link
Member Author
jjerphan commented Jul 8, 2022

Due to the refactoring, I'm closing this PR in favor of #23865.

@jjerphan jjerphan closed this Jul 8, 2022
@jjerphan jjerphan deleted the distance-metrics-32bit branch October 21, 2022 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cython float32 Issues related to support for 32bit data module:metrics No Changelog Needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0