-
Notifications
You must be signed in to change notification settings - Fork 0
Implement SIMD optimization for ManhattanDistance.dist
#11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Is the performance gain really because we're currently not using simd ? I wonder if the issue doesn't come from instruction level parallelism. If I replace the current implementation by a manual unrolling cdef intp_t n = size // 4
cdef intp_t r = size % 4
for j in range(n):
d += (fabs(x1[0] - x2[0]) +
fabs(x1[1] - x2[1]) +
fabs(x1[2] - x2[2]) +
fabs(x1[3] - x2[3])
)
x1 += 4; x2 += 4
for j in range(r):
d += fabs(x1[j] - x2[j])
return d I get improvements as well. Looking at the assembly, I'm not able to tell if it better leverages vectorization or instruction level parallelism.
|
d += fabs(x1[j] - x2[j]) | ||
cdef cnp.intp_t j, simd_width = 16 / sizeof({{INPUT_DTYPE_t}}) | ||
cdef cnp.intp_t remainder = size % simd_width | ||
if HAS_SIMD: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naive question: is there a benefit to making this a "compile time if statement"? And if yes, is that even possible/advisable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There probably is a benefit, though I have no idea how significant.
One option would be to port the existing code for whichever function we try to accelerate to C as well and handle it all there via compile-time macros. Of course that makes things significantly more complicated, but it's the simplest most straightforward option I can think of to accomodate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading (and thinking) a bit more, in particular the Numpy setup, I think you have to have a runtime check. One that is even more "runtime" than this one which relies on a compile time constant. (Maybe except for SSE SSE2 as this seems to be the "min" setting for numpy?)
I ran a quick benchmark on my laptop: import numpy as np
from sklearn.metrics._pairwise_distances_reduction import ArgKmin
from threadpoolctl import threadpool_limits
threadpool_limits(2) # or 1
X = np.random.uniform(size=(100000, 100)).astype(np.float32) # or float64
Y = np.random.uniform(size=(10000, 100)).astype(np.float32) # or float64
results = ArgKmin.compute(
X=X,
Y=Y,
k=5,
metric="manhattan",
strategy="auto",
return_distance=False,
)
the current implementation in main is Although the simd version improves over main, it seems that a simple manual unrolling can achieve at least the same kind of performance gain (we'd need more benchmarks ofc but this already gives a hint). Maybe the simd version should also do some manual loop unrolling ? |
Maybe we should implement the suggestion by @jeremiedbb before the SIMD version. It seems that it will be much more straightforward and will lead to user benefits. |
I tried @jeremiedbb's manual loop unrolling on my Apple M1 machine and it does not change the timings vs main. And more threads is always better (there is no hyperthreading on this platform). |
+1 SSE3 only works on x86 CPUs, we would need more code to support other architectures such as ARM64. Furthermore, we could get better performance by using AVX2 or AVX512 intrinsics (wider vectors) but then we would need a runtime dispatch mechanism for those SIMD optimized functions because we cannot assume that our x86 users will have a recent enough CPU with AVX512 instructions so we would have to fallback to AVX2 and then to SSE3 (or even SSE2) if the widest vector instruction sets are not list in the CPU flags. I believe numpy has implemented such a dispatch mechanism but I haven't looked at it yet. |
I think this is definitely a good idea. It provides immediate verifiable benefits at no real cost. I updated and reran the benchmarks on my machine with @jeremiedbb's unrolling, as well as the SIMD unrolling I added in 3faa339. Unrolling + SIMD offers 2x performance for Note that the relatively milder results for @jeremiedbb re-running your benchmark snippet on your unrolled loops against the SIMD unrolled loops, I get times of |
I observed the same behaviour for the loop unrolling in HGBT on my Mac (Intel CPU, with clang I think): the manual unrolling has no effect. |
This is kind of expected. Clang is more clever than GCC when it comes to instruction level parallelism. In such a loop, GCC is not able to tell (at least with default flags) that 2 steps of the loop are independant, so we have to show him the way. |
Nice, we can clearly see the 2 improvements here: one from SIMD and one from ILP 👍 |
We might want to explore the idea to use clang to build the linux binaries of scikit-learn. It's possible that we would get various speed-ups for free without having to do manual loop unrolling in our code. The only problem would to evaluate the impact of switching to llvm-openmp (used by default by clang when compiling/linking with |
Coming back to the use of intrinsics, another alternative would be to attempt to structure our code (and maybe use But still we would need dynamic dispatch based on CPU architecture flags. But maybe modern compilers can also generate code to do that automatically. I haven't investigated myself yet. |
unfortunately cython does not expose it. |
BTW, thanks for the benchmark reports @Micky774, this is very interesting. A generic comment though: when presenting such duration benchmarks in perf related PRs, I think it's important to ylim at 0, otherwise it's hard to gauge the relative improvements as the scale varies from one plot to the next (even in relative terms) and makes it hard to decode in which cases the improvements are important or if they are comparable accros plots. Alternatively, you could plot the speed-up factor vs |
Thank you for exploring this, @Micky774. As @jeremiedbb indicated in scikit-learn#26010 (comment), I also think scikit-learn is not the right place to have this level of granularity. I also think that the added code and compilation complexity (C source, includes in Cython, branching in the Cython implementation and extra compilation flags) has a significant cost especially if there are only a few distance metrics which can be optimized using SIMD: do you know if there are potential other candidates? SciPy has a nice framework of efficient implementation of distance metrics for the dense case already. Still those implementations are not exposed by Cython interfaces. Do people think we could contribute and sync with the maintainers of SciPy upstream so that there only are one set of efficient distance metrics implementations in the ecosystem? For now, I am +1 regarding using unrolling for |
Closing in favor of working on a separate library to provide |
Reference Issues/PRs
What does this implement/fix? Explain your changes.
Provides a proof-of-concept SIMD implementation for
ManhattanDistance.dist
Any other comments?