Implement SIMD optimization for `ManhattanDistance.dist` #11

Micky774 · 2023-03-28T21:46:58Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Provides a proof-of-concept SIMD implementation for ManhattanDistance.dist

Any other comments?

jeremiedbb · 2023-03-29T10:19:57Z

Is the performance gain really because we're currently not using simd ?
I thought SSE2 instructions were quite standard and that most compilers are able to use auto-vectorization for this instruction set. (unlike for avx2 and even less for avx512)

I wonder if the issue doesn't come from instruction level parallelism. If I replace the current implementation by a manual unrolling

        cdef intp_t n = size // 4
        cdef intp_t r = size % 4

        for j in range(n):
            d += (fabs(x1[0] - x2[0]) +
                  fabs(x1[1] - x2[1]) +
                  fabs(x1[2] - x2[2]) +
                  fabs(x1[3] - x2[3])
                 )
            x1 += 4; x2 += 4

        for j in range(r):
            d += fabs(x1[j] - x2[j])

        return d

I get improvements as well. Looking at the assembly, I'm not able to tell if it better leverages vectorization or instruction level parallelism.

~~1 other issue might be the use of fabs for float32. It's for double precision. We might want to fabsf instead, because currently float32 is slower than float64 on my machine.~~ Edited: it's on puporse, we return a double, I read too quickly. I don't understand why float32 is slower than float64 though.

betatim · 2023-03-29T12:41:02Z

sklearn/metrics/_dist_metrics.pyx.tp

-            d += fabs(x1[j] - x2[j])
+        cdef cnp.intp_t j, simd_width = 16 / sizeof({{INPUT_DTYPE_t}})
+        cdef cnp.intp_t remainder = size % simd_width
+        if HAS_SIMD:


Naive question: is there a benefit to making this a "compile time if statement"? And if yes, is that even possible/advisable?

There probably is a benefit, though I have no idea how significant.

One option would be to port the existing code for whichever function we try to accelerate to C as well and handle it all there via compile-time macros. Of course that makes things significantly more complicated, but it's the ~~simplest~~ most straightforward option I can think of to accomodate it.

After reading (and thinking) a bit more, in particular the Numpy setup, I think you have to have a runtime check. One that is even more "runtime" than this one which relies on a compile time constant. (Maybe except for SSE SSE2 as this seems to be the "min" setting for numpy?)

jeremiedbb · 2023-03-29T15:27:19Z

I ran a quick benchmark on my laptop:

import numpy as np
from sklearn.metrics._pairwise_distances_reduction import ArgKmin
from threadpoolctl import threadpool_limits

threadpool_limits(2)  # or 1
X = np.random.uniform(size=(100000, 100)).astype(np.float32)  # or float64
Y = np.random.uniform(size=(10000, 100)).astype(np.float32)  # or float64

results = ArgKmin.compute(
    X=X,
    Y=Y,
    k=5,
    metric="manhattan",
    strategy="auto",
    return_distance=False,
)

		2 threads	1 thread
simd	float32	54s	101s
	float64	36s	65s
no simd	float32	60s	115s
	float64	55s	105s
unroll	float32	37s	66s
	float64	26s	48s

the current implementation in main is no simd. This PR is simd and unroll is the snippet I wrote in my previous comment.

Although the simd version improves over main, it seems that a simple manual unrolling can achieve at least the same kind of performance gain (we'd need more benchmarks ofc but this already gives a hint). Maybe the simd version should also do some manual loop unrolling ?

GaelVaroquaux · 2023-03-29T16:23:12Z

Maybe we should implement the suggestion by @jeremiedbb before the SIMD version. It seems that it will be much more straightforward and will lead to user benefits.

ogrisel · 2023-03-29T16:35:22Z

I tried @jeremiedbb's manual loop unrolling on my Apple M1 machine and it does not change the timings vs main. And more threads is always better (there is no hyperthreading on this platform).

ogrisel · 2023-03-29T16:54:56Z

Maybe we should implement the suggestion by @jeremiedbb before the SIMD version. It seems that it will be much more straightforward and will lead to user benefits.

+1

SSE3 only works on x86 CPUs, we would need more code to support other architectures such as ARM64.

Furthermore, we could get better performance by using AVX2 or AVX512 intrinsics (wider vectors) but then we would need a runtime dispatch mechanism for those SIMD optimized functions because we cannot assume that our x86 users will have a recent enough CPU with AVX512 instructions so we would have to fallback to AVX2 and then to SSE3 (or even SSE2) if the widest vector instruction sets are not list in the CPU flags.

I believe numpy has implemented such a dispatch mechanism but I haven't looked at it yet.

sklearn/metrics/_dist_optim.c

Micky774 · 2023-03-29T19:20:41Z

Maybe we should implement the suggestion by @jeremiedbb before the SIMD version. It seems that it will be much more straightforward and will lead to user benefits.

I think this is definitely a good idea. It provides immediate verifiable benefits at no real cost.

I updated and reran the benchmarks on my machine with @jeremiedbb's unrolling, as well as the SIMD unrolling I added in 3faa339.

Unrolling + SIMD offers 2x performance for float32 and 1.36x performance for float64 compared to simple unrolling on main

Note that the relatively milder results for float64 are expected with 128bit SSE instructions,

@jeremiedbb re-running your benchmark snippet on your unrolled loops against the SIMD unrolled loops, I get times of 15.8 and 7.7 seconds respectively

lorentzenchr · 2023-03-29T20:36:22Z

I tried @jeremiedbb's manual loop unrolling on my Apple M1 machine and it does not change the timings vs main.

I observed the same behaviour for the loop unrolling in HGBT on my Mac (Intel CPU, with clang I think): the manual unrolling has no effect.

jeremiedbb · 2023-03-29T20:39:41Z

I tried @jeremiedbb's manual loop unrolling on my Apple M1 machine and it does not change the timings vs main.

I observed the same behaviour for the loop unrolling in HGBT on my Mac (Intel CPU, with clang I think): the manual unrolling has no effect.

This is kind of expected. Clang is more clever than GCC when it comes to instruction level parallelism. In such a loop, GCC is not able to tell (at least with default flags) that 2 steps of the loop are independant, so we have to show him the way.

jeremiedbb · 2023-03-29T20:40:52Z

I updated and reran the benchmarks on my machine

Nice, we can clearly see the 2 improvements here: one from SIMD and one from ILP 👍

ogrisel · 2023-03-30T12:05:01Z

We might want to explore the idea to use clang to build the linux binaries of scikit-learn. It's possible that we would get various speed-ups for free without having to do manual loop unrolling in our code.

The only problem would to evaluate the impact of switching to llvm-openmp (used by default by clang when compiling/linking with -fopenmp) vs libgomp (a.k.a. GNU OpenMP, used by gcc with compiling/linking with -fopenmp). Notably we discovered the following bad interaction between llvm-openmp and intel-opemp only on Linux in the past:

https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md#incompatibility-between-intel-openmp-and-llvm-openmp-under-linux

ogrisel · 2023-03-30T12:11:33Z

Coming back to the use of intrinsics, another alternative would be to attempt to structure our code (and maybe use restrict annotations to resolve pointer aliasing issues, not sure if Cython would allow that) to make it possible for compiler to automatically vectorize it.

But still we would need dynamic dispatch based on CPU architecture flags. But maybe modern compilers can also generate code to do that automatically. I haven't investigated myself yet.

jeremiedbb · 2023-03-30T12:13:17Z

and maybe use restrict

unfortunately cython does not expose it.

ogrisel · 2023-03-30T12:17:16Z

BTW, thanks for the benchmark reports @Micky774, this is very interesting.

A generic comment though: when presenting such duration benchmarks in perf related PRs, I think it's important to ylim at 0, otherwise it's hard to gauge the relative improvements as the scale varies from one plot to the next (even in relative terms) and makes it hard to decode in which cases the improvements are important or if they are comparable accros plots.

Alternatively, you could plot the speed-up factor vs main in which case the scale of the plots will naturally always be the same scale (assuming main is always the slowest or close to the slowest when we work on a performance optimization PRs).

jjerphan · 2023-03-31T10:19:18Z

Thank you for exploring this, @Micky774.

As @jeremiedbb indicated in scikit-learn#26010 (comment), I also think scikit-learn is not the right place to have this level of granularity. I also think that the added code and compilation complexity (C source, includes in Cython, branching in the Cython implementation and extra compilation flags) has a significant cost especially if there are only a few distance metrics which can be optimized using SIMD: do you know if there are potential other candidates?

SciPy has a nice framework of efficient implementation of distance metrics for the dense case already. Still those implementations are not exposed by Cython interfaces.

Do people think we could contribute and sync with the maintainers of SciPy upstream so that there only are one set of efficient distance metrics implementations in the ecosystem?

For now, I am +1 regarding using unrolling for ManhattanDistance. 👍

Micky774 · 2023-04-18T16:40:27Z

Closing in favor of working on a separate library to provide DistanceMetric implementations. The quick-adoption route would be to allow metric arguments to accept DistanceMetric objects directly, and let users of the third-party-library pass them directly as arguments. The long-term solution would be to develop a plugin system for DistanceMetric. These are not mutually exclusive, either.

Micky774 added 6 commits March 28, 2023 11:03

Initial setup

7151beb

Updated SIMD implementation

41c2c13

Removed debug guard

9fb8e98

Merge branch 'main' into optim

9b27236

Separated implementation into c file

8a71684

Whitespace adjustment

11f7695

Micky774 mentioned this pull request Mar 28, 2023

Introduce SIMD intrinsics for _dist_metrics.pyx scikit-learn/scikit-learn#26010

Open

betatim reviewed Mar 29, 2023

View reviewed changes

Merge branch 'main' into optim

4bf6ebc

jeremiedbb reviewed Mar 29, 2023

View reviewed changes

sklearn/metrics/_dist_optim.c Outdated Show resolved Hide resolved

Implemented unrolling and improved loop structure

3faa339

Simplified expression

5291f7a

Micky774 closed this Apr 18, 2023

Micky774 deleted the optim branch May 4, 2023 19:21

Micky774 mentioned this pull request May 5, 2023

API Allow users to pass instances of DistanceMetric directly to metric keyword arguments scikit-learn/scikit-learn#26329

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement SIMD optimization for `ManhattanDistance.dist` #11

Implement SIMD optimization for `ManhattanDistance.dist` #11

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Implement SIMD optimization for ManhattanDistance.dist #11

Implement SIMD optimization for ManhattanDistance.dist #11

Uh oh!

Conversation

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Implement SIMD optimization for `ManhattanDistance.dist` #11

Implement SIMD optimization for `ManhattanDistance.dist` #11