-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DRAFT ENH Introduce AVX
SIMD instructions for ManhattanDistances.dist
#27145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thinks that came to my mind seeing this:
Otherwise I think this is pretty cool. |
That's something I'm hoping we can discuss a bit more. I'm unfamiliar with Rust, and although I know it is picking up popularity pretty quickly, afaik it does make it harder to work on relatively complex data structures (the antagonistic example is doubly-linked-lists) but I don't honestly know to what degree. My (comparative) familiarity with C++biases me. I'd love for someone more familiar with Rust to educate me in this regard :) I do think that there is potentially a lot of room to improve code readability by leveraging C++ templating and binding/importing via Cython. This is actually a line-item for discussion in the new native performance discussion group (see notes) although has not been broadly discussed yet. IMO it is easier to slowly-but-safely transition from pure Cython/C to some C++ in key areas than to introduce Rust, but I've not looked at any example implementations so I can still be swayed. I do also believe that the ability to include a relatively simple static dispatch to a widely supported SIMD target, propagating performance gains across a potentially wide swathe of the library is itself a strong merit for C++ inclusion (even if not widespread adoption).
Right, that seems to be a common opinion. I have it added as a submodule right now for (my) convenience but it is simple to vendor it as a header-only library (they're under BSD-3 so licensing is no issue either). I don't have a strong preference one way or the other so I think we can navigate around that. Edit: It's also not necessary at runtime at all, so we could try to specify it as a build-time requirement for users if they want SIMD acceleration and setup CI solutions for wheel building. I'm not sure what the best way to do that is, since (as usual) windows makes things tricky |
I would be very skeptical of that. C++ has become a monster. I would agree with you maybe 20 years ago though. But I've done enough C++ that I really wouldn't want us to maintain C++ code for multiple platforms.
I don't mind vendoring some C++ library at all, as long as we don't have to maintain it, and it doesn't give us much headache to support multiple platforms. And as long as the only thing we have to do is to add a very light binding layer which could also be auto-generated, then we should be fine. |
As discussed on the last monthly meeting (though it is not reported in its notes), the introduction of C++ code (be it done via this PR or another one) necessitates considering several technical aspects (packaging consideration, linkage and build system, depending on third party library or vendoring it, etc.) and several social aspects (when do we want to use C++ over existing and used solutions like Cython, who gets to maintain C++ implementations for those part of the project to have an healthy bus-factor, what are the conventions around C++, etc.) of scikit-learn. We decided that such choices needs to be discussed and voted in favor or in disfavor via a SLEP (I invoke @lorentzenchr for confirmation). IIRC, @Micky774 (let me know if I am wrong) proposed to author one; I also propose helping with writing it. |
Yes, let’s discuss in a slep which has as its goal the usage of simd instructions. Rust, for instance, would then be one alternative. Let me add here that we hit the boundaries of Cython several times, e.g. usage of tempita templates. |
Reference Issues/PRs
A follow-up to #26010
What does this implement/fix? Explain your changes.
This PR demonstrates the minimal infrastructure required for proper building and runtime dispatch of
AVX
SIMD instructions. This means that scikit-learn can optionally be built withAVX
instructions. If, at runtime,AVX
support is found on the runtime machine and scikit-learn was built withAVX
support, then the optimizedAVX
instructions are leveraged when computing distances. Any other case results in the usual scalar loops currently found onmain
.Concrete changes
_simd/
directory housing new filessimd.hpp
which provides a definition and extern declarations of templated Manhattan distance functions (xsimd_manhattan_dist
)simd.cpp
which provides externed declarations ofxsimd_manhattan_dist
. This file serves as a translation unit to target with-mavx
feature flags.avx_dist_metrics
library which is linked against to isolate the rest of_dist_metrics
from requiring additional feature flags (preventing auto-vectorization of unintended code, e.g. backup scalar loops)WITH_SIMD
macro to correctly compile SIMD optimized code, or default to a trivial implementation to satisfy compilationDIST_METRICS
macro to correctly allow inclusion of_dist_optim.cpp
in_dist_metrics.pyx
but not in other Cython files (necessary since anycimport
of_dist_metrics.pyx
forces the inclusion of_dist_optim.cpp
)Adds build and runtime dependency onUsesarchspec
to provide out-of-the-box supportxsimd
directly for runtime microarchitecture feature detection (to determine whether building withAVX
is possible, and whether running withAVX
is possible) via_runtime_check.pyx
and by compilingruntime_check_program
insetup.py
for a quick build-time check.xsimd
as a submodule (~2MB
but could be reduced by half if we only include theinclude
directory)self.{r}dist(...)
calls to<Class>.rdist_csr(self, ...)
to work around Cython's idiosyncrasies when compiling to C++TODO:
Resolve whether to add a dependency onUsesarchspec
or vendor itxsimd
directlyxsimd
as a submodule, vendor an instance, or only include as build-time dependency for wheels and conda-forgeAny other comments?
AVX
instructions were deployed in late 2011 and 2012 and thus are likely found in a significant portion of our users' machines. Honestly, we may want to consider going toAVX2
directly since it was released only a couple years later...I don't know of any comprehensive surveys in the ML field which indicate the presence of these features on work machines and laptops.
Initial benchmarks generated with this script (exported from a Jupyter notebook).
Note
PR_noruntime
is the case where the library is built withAVX
support, but the runtime check fails to detectAVX
capabilities.PR_nobuild
is built withoutAVX
capabilities.PR
is built withAVX
capabilities, and uses them at runtime.There are no slowdowns compared to main, and when built with
AVX
support, and it is detected at runtime, we see minor speedups even in cases of smalln_features
, to drastic speedups (4-7x
) whenn_features
is large.Plots