8000 Implement SIMD optimization for `ManhattanDistance.dist` by Micky774 · Pull Request #11 · Micky774/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Implement SIMD optimization for ManhattanDistance.dist #11

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from
Closed

Conversation

Micky774
Copy link
Owner

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Provides a proof-of-concept SIMD implementation for ManhattanDistance.dist

Any other comments?

@jeremiedbb
Copy link
jeremiedbb commented Mar 29, 2023

Is the performance gain really because we're currently not using simd ?
I thought SSE2 instructions were quite standard and that most compilers are able to use auto-vectorization for this instruction set. (unlike for avx2 and even less for avx512)

I wonder if the issue doesn't come from instruction level parallelism. If I replace the current implementation by a manual unrolling

        cdef intp_t n = size // 4
        cdef intp_t r = size % 4

        for j in range(n):
            d += (fabs(x1[0] - x2[0]) +
                  fabs(x1[1] - x2[1]) +
                  fabs(x1[2] - x2[2]) +
                  fabs(x1[3] - x2[3])
                 )
            x1 += 4; x2 += 4

        for j in range(r):
            d += fabs(x1[j] - x2[j])

        return d

I get improvements as well. Looking at the assembly, I'm not able to tell if it better leverages vectorization or instruction level parallelism.

1 other issue might be the use of fabs for float32. It's for double precision. We might want to fabsf instead, because currently float32 is slower than float64 on my machine. Edited: it's on puporse, we return a double, I read too quickly. I don't understand why float32 is slower than float64 though.

d += fabs(x1[j] - x2[j])
cdef cnp.intp_t j, simd_width = 16 / sizeof({{INPUT_DTYPE_t}})
cdef cnp.intp_t remainder = size % simd_width
if HAS_SIMD:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naive question: is there a benefit to making this a "compile time if statement"? And if yes, is that even possible/advisable?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There probably is a benefit, though I have no idea how significant.

One option would be to port the existing code for whichever function we try to accelerate to C as well and handle it all there via compile-time macros. Of course that makes things significantly more complicated, but it's the simplest most straightforward option I can think of to accomodate it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading (and thinking) a bit more, in particular the Numpy setup, I think you have to have a runtime check. One that is even more "runtime" than this one which relies on a compile time constant. (Maybe except for SSE SSE2 as this seems to be the "min" setting for numpy?)

@jeremiedbb
Copy link

I ran a quick benchmark on my laptop:

import numpy as np
from sklearn.metrics._pairwise_distances_reduction import ArgKmin
from threadpoolctl import threadpool_limits

threadpool_limits(2)  # or 1
X = np.random.uniform(size=(100000, 100)).astype(np.float32)  # or float64
Y = np.random.uniform(size=(10000, 100)).astype(np.float32)  # or float64

results = ArgKmin.compute(
    X=X,
    Y=Y,
    k=5,
    metric="manhattan",
    strategy="auto",
    return_distance=False,
)
2 threads 1 thread
simd float32 54s 101s
float64 36s 65s
no simd float32 60s 115s
float64 55s 105s
unroll float32 37s 66s
float64 26s 48s

the current implementation in main is no simd. This PR is simd and unroll is the snippet I wrote in my previous comment.

Although the simd version improves over main, it seems that a simple manual unrolling can achieve at least the same kind of performance gain (we'd need more benchmarks ofc but this already gives a hint). Maybe the simd version should also do some manual loop unrolling ?

@GaelVaroquaux
Copy link

Maybe we should implement the suggestion by @jeremiedbb before the SIMD version. It seems that it will be much more straightforward and will lead to user benefits.

@ogrisel
Copy link
ogrisel commented Mar 29, 2023

I tried @jeremiedbb's manual loop unrolling on my Apple M1 machine and it does not change the timings vs main. And more threads is always better (there is no hyperthreading on this platform).

8000

@ogrisel
Copy link
ogrisel commented Mar 29, 2023

Maybe we should implement the suggestion by @jeremiedbb before the SIMD version. It seems that it will be much more straightforward and will lead to user benefits.

+1

SSE3 only works on x86 CPUs, we would need more code to support other architectures such as ARM64.

Furthermore, we could get better performance by using AVX2 or AVX512 intrinsics (wider vectors) but then we would need a runtime dispatch mechanism for those SIMD optimized functions because we cannot assume that our x86 users will have a recent enough CPU with AVX512 instructions so we would have to fallback to AVX2 and then to SSE3 (or even SSE2) if the widest vector instruction sets are not list in the CPU flags.

I believe numpy has implemented such a dispatch mechanism but I haven't looked at it yet.

@Micky774
Copy link
Owner Author

Maybe we should implement the suggestion by @jeremiedbb before the SIMD version. It seems that it will be much more straightforward and will lead to user benefits.

I think this is definitely a good idea. It provides immediate verifiable benefits at no real cost.

I updated and reran the benchmarks on my machine with @jeremiedbb's unrolling, as well as the SIMD unrolling I added in 3faa339.

e1c4cd82-938b-402c-8fdd-a7a200a95fde

Unrolling + SIMD offers 2x performance for float32 and 1.36x performance for float64 compared to simple unrolling on main

Note that the relatively milder results for float64 are expected with 128bit SSE instructions,

@jeremiedbb re-running your benchmark snippet on your unrolled loops against the SIMD unrolled loops, I get times of 15.8 and 7.7 seconds respectively

@lorentzenchr
Copy link

I tried @jeremiedbb's manual loop unrolling on my Apple M1 machine and it does not change the timings vs main.

I observed the same behaviour for the loop unrolling in HGBT on my Mac (Intel CPU, with clang I think): the manual unrolling has no effect.

@jeremiedbb
Copy link

I tried @jeremiedbb's manual loop unrolling on my Apple M1 machine and it does not change the timings vs main.

I observed the same behaviour for the loop unrolling in HGBT on my Mac (Intel CPU, with clang I think): the manual unrolling has no effect.

This is kind of expected. Clang is more clever than GCC when it comes to instruction level parallelism. In such a loop, GCC is not able to tell (at least with default flags) that 2 steps of the loop are independant, so we have to show him the way.

@jeremiedbb
Copy link

I updated and reran the benchmarks on my machine

Nice, we can clearly see the 2 improvements here: one from SIMD and one from ILP 👍

@ogrisel
Copy link
ogrisel commented Mar 30, 2023

We might want to explore the idea to use clang to build the linux binaries of scikit-learn. It's possible that we would get various speed-ups for free without having to do manual loop unrolling in our code.

The only problem would to evaluate the impact of switching to llvm-openmp (used by default by clang when compiling/linking with -fopenmp) vs libgomp (a.k.a. GNU OpenMP, used by gcc with compiling/linking with -fopenmp). Notably we discovered the following bad interaction between llvm-openmp and intel-opemp only on Linux in the past:

@ogrisel
Copy link
ogrisel commented Mar 30, 2023

Coming back to the use of intrinsics, another alternative would be to attempt to structure our code (and maybe use restrict annotations to resolve pointer aliasing issues, not sure if Cython would allow that) to make it possible for compiler to automatically vectorize it.

But still we would need dynamic dispatch based on CPU architecture flags. But maybe modern compilers can also generate code to do that automatically. I haven't investigated myself yet.

@jeremiedbb
Copy link

and maybe use restrict

unfortunately cython does not expose it.

@ogrisel
Copy link
ogrisel commented Mar 30, 2023

BTW, thanks for the benchmark reports @Micky774, this is very interesting.

A generic comment though: when presenting such duration benchmarks in perf related PRs, I think it's important to ylim at 0, otherwise it's hard to gauge the relative improvements as the scale varies from one plot to the next (even in relative terms) and makes it hard to decode in which cases the improvements are important or if they are comparable accros plots.

Alternatively, you could plot the speed-up factor vs main in which case the scale of the plots will naturally always be the same scale (assuming main is always the slowest or close to the slowest when we work on a performance optimization PRs).

@jjerphan
Copy link

Thank you for exploring this, @Micky774.

As @jeremiedbb indicated in scikit-learn#26010 (comment), I also think scikit-learn is not the right place to have this level of granularity. I also think that the added code and compilation complexity (C source, includes in Cython, branching in the Cython implementation and extra compilation flags) has a significant cost especially if there are only a few distance metrics which can be optimized using SIMD: do you know if there are potential other candidates?

SciPy has a nice framework of efficient implementation of distance metrics for the dense case already. Still those implementations are not exposed by Cython interfaces.

Do people think we could contribute and sync with the maintainers of SciPy upstream so that there only are one set of efficient distance metrics implementations in the ecosystem?

For now, I am +1 regarding using unrolling for ManhattanDistance. 👍

@Micky774
Copy link
Owner Author

Closing in favor of working on a separate library to provide DistanceMetric implementations. The quick-adoption route would be to allow metric arguments to accept DistanceMetric objects directly, and let users of the third-party-library pass them directly as arguments. The long-term solution would be to develop a plugin system for DistanceMetric. These are not mutually exclusive, either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0