-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Introduce SIMD intrinsics for _dist_metrics.pyx
#26010
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @Micky774, Thank you for exploring this! I think those really are encouraging results. 💯 If we aim for performance, I think using intrinsics is relevant. Yet in the past, the use of lower-level languages for scikit-learn has been discussed and I do not think maintainers agree on whether we should use them. If I remember correctly, for some maintainers using such languages goes against approcheability of code for readers and goes against some elements of the culture of the project — personally, those are arguments which I do not understand but I would be happy to continue discussions. Alternatively, we could imagine relying on SciPy's implementations of distance metrics (which are performant), but this might also create supplementary maintenance work in their codebase as well. I am looking forward to other maintainers' opinion. In the meantime, I will review your PR. |
But these are compile-time checks right? So if the binaries are compiled on a machine that supports SSE3, the binary will try to execute the intrinsics instructions, even if the end-users' machine doesn't support it? Perhaps this is less of an issue for SSE3 as @Micky774 mentioned they should be widely available. But FWIW SIMD support and packaging is a hairy issue and that might be a challenge for scikit-learn. Here's what PIL-SIMD has to say about why it's not upstreamed in PIL directly:
In PyTorch, they work around this problem by compiling the optimized code with multiple instruction sets, and then leverage the pytorch dispatcher at run time to figure out which code-path to use depending on the users machine support. But this is probably way too complex for scikit-learn. EDIT: as far as I can tell the pytorch setup is very similar to that of |
What I am wondering is if these 2x and 1.5x are also obtained at the level of the estimator. My feeling here is that we will introduce a layer of complexity that we probably want to extend to a lot of subroutines. Is the complexity worth the 1.5x to 2x speed-up? |
hi everyone, just wading in since i find this super interesting.
i think the main resistance would be to adding new c/c++ code to the
codebase due to the maintenance burden on new contributors. (not sure how
our stance on that changed since ive last been active though!)
In PyTorch
<https://github.com/pytorch/pytorch/tree/master/aten/src/ATen/native/cpu#readme>,
they work around this problem by compiling the optimized code with multiple
instruction sets, and then leverage the pytorch dispatcher *at run time*
to figure out which code-path to use depending on the users machine
support. But this is probably way too complex for scikit-learn.
fwiw i had to build a custom extension for pytorch that also involved a
bundled compiled library (as numpy does). It was a bit weird and flimsy and
writing it was almost like a new skillset compared to the usual python
build system, and missing symbol errors are quite tricky to debug due to
all the runtime magic bunch. Not sure if related but pytorch wheels are
also around 200MB even without cuda.
—
… Reply to this email directly, view it on GitHub
<#26010 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAB3AUIXE724ERRGISIXICDW6P5OFANCNFSM6AAAAAAWLFQAOU>
.
You are receiving this because you are on a team that was mentioned.Message
ID: ***@***.***>
|
While thinking about this I came across some links that are interesting to read: |
Yes. Regarding this, we could either follow NumPy's strategy of build the superset and dispatch at runtime, or build minimally and specify which features we enable to try to maximize portability. The PIL-SIMD discussion is based on enabling With that being said, NumPy and PyTorch have already built out viable runtime dispatch systems for SIMD intrinsics. While we can't quite apply their solutions one-to-one (and even if we could it would still be a lot of effort), we have a lot we could learn from them if we did decide to pursue that solution. As an aside, I'm personally more familiar with NumPy's system. Just to minimize scope right now, I'd rather stay within
Almost surely the gains would not be that dramatic for all estimators (or perhaps even most estimators) however this would help those where distance computations are a potent bottleneck. I imagine such estimators would be mostly those which now use the
Thanks for offering this point. It is one of my greatest apprehensions towards this as well. I feel like only recently with the immense work of several maintainers has the momentum for Cython contributions been increasing (shoutout to @thomasjpfan and @jjerphan for helping me learn Cython) and yet it still remains a rare target for non-maintainer contributions. The introduced SIMD intrinscs and surrounding C code would be even harder to foster support for. With that being said, I think thorough education -- both in documentation and reviews -- can help mitigate that. I think if we keep the use in a very limited scope and focus on clarity, then we can even develop those skills in our community. I've done some work on the NumPy side of things, and I do adore that about their community: a lot of folks are eager to learn new skills when fumbling around the codebase and seeing syntax they had never seen before :) |
Re Cython vs C++: FWIW, I personally don't think that Cython (as we use it in scikit-learn) is a lower barrier to entry than C / C++. It might have been the case a long time ago, but today we make use of some very advanced Cython features and hit a bunch of edge cases, all of which require strong expertise. I remember spending days of trial and error trying to figure out how to do something in Cython that would have taken me 10 minutes in C, simply because there are much less resources on how to use Cython properly compared to C (cython/cython#2813, #17299). What makes C / C++ hard for most Python programmers isn't the syntax, it's the [lack of] memory management and all the memory pointer handling. This isn't something that Cython makes easier - not in the scikit-learn code base at least. I would suspect that those who are competent enough to write Cython code for scikit-learn would be just as competent at writing C/C++ code. (This is not a dig at Cython, I love Cython, and I know how important it is to the ecosystem). |
i think the main resistance would be to adding new c/c++ code to the codebase due to the maintenance burden on new contributors. (not sure how our stance on that changed since ive last been active though!)
We're working on a plugin API for these usecases, but it's still early on in the development.
|
Could we have a benchmark on something like After a quick scan of the Numpy resources linked above I think the infrastructure to do all the work besides the code that actually uses the intrinsics is a significant amount of work. I like the comment from Gael. One of the ideas (the idea?) behind plugins for computational backends is that anyone can write a plugin that does crazy things (as considered from the viewpoint of core scikit-learn) and try out other ideas to achieve performance improvements. Most of the time people think of GPUs, but why not have a |
I think Cython is still a good compromise and still is the best short- and medium-term solution for scikit-learn IMO. The more we advance the more we are facing limitations with Cython as a language. Beyond problems solved with workarounds based on undocumented features or aspect of Cython, examples of limitation the team in this regard faced are:
After having spent several thousand of hours reading, writing Cython code, and reading Cython's source code trying to contribute to the project, I can say that:
Still, if we were to reconsider Cython over alternatives, I think we might need to perform before hand qualitative analysis, e.g. For an alternative X:
We also might want perform an analysis backed on evidence-backed (e.g. ideally, can we reach optimal and safe execution on a prototype first without changing scikit-learn's UX and packaging?). This is not negligible work. I think the plugin API might make experimenting easier in this regards. 👍 |
I'm not very enthusiastic about such low-level optimizations, not only because of the added complexity and maintenance burden, but mainly because I think scikit-learn is not the appropriate place for that (most of the time). Another thing to consider imo is the balance between the potential gain, the added complexity and the user interest. The more time we dedicate to performance improvement the less time we have to improve the usability of the library for instance. Since there are ongoing discussions regarding the future of the project I think it's something to keep in mind. Finally, regarding this poc specifically, before diving into simd, there might be simple tricks that we should try before. I think in general we should try simpler solutions first and increase complexity when they fail. Here for instance, a manual unrolling might give similar results for a lot less added complexity (see this quick benchmark Micky774#11 (comment)) |
Just wanted to update the conversation on the issue to match the developments in the PR conversation: it seems the manual loop unrolling mainly benefits GCC compiled code (or at least, the benefits are not observed when compiled with Clang) and is independant of SIMD optimization and can be combined (cf. Micky774#11 (comment), Micky774#11 (comment)) |
I wanted to mention you make wonderful points and I feel like this is a conversation that will soon need to be had (even if just to affirm our use of Cython). Personally, I've been growing fond of the idea of writing in C/C++ directly and using Cython as a quick and easy way of binding to Python functions (e.g. my prototype SIMD implementation). I think Scipy does something similar -- write in C++ directly and wrap/bind to bring into Python -- we could more closely look at their solutions. Note, however, that I do not have much experience with this strategy and can't properly advocate for it (yet). |
I think C/C++/SyCL implementations (cc @fcharras, who is working on GPU implementations for scikit-learn) with Python bindings (done with Cython, PyBind11 or nanobind) is adapted. Yet, I think we need to consider current performance bottlenecks, complexity which is currently handled by Cython first (e.g. how to compile and link against BLAS?), the number of people interested by such development, and ongoing work on performance (algorithm redesign, Array API support, Cython implementations, GPU implementations, etc.). Currently, I do not know what people think about exploring such alternatives for scikit-learn. |
I will wave the plugin flag once more :D I think plugins are an excellent way to get an answer to all of the questions raised above. This is because anyone can write a plugin and use what ever technology they want to do it. Then we have some concrete code that can either be abandoned (we learnt something that we didn't realise before implementation), maintained as a separate thing (maybe you want to iterate much faster than scikit-learn releases, serve a small niche set of users, exotic optimisation tricks, etc) or integrated into the core of scikit-learn (it proves itself as beneficial and not too burdensome to maintain). (Maybe the plugin stuff we are working on is not yet there in this fantastic land, but that is where we are trying to go. Or at least trying to reach this promised land is what gets me excited about plugins.) |
I think at this point I agree that there are better ways to handle this than natively supporting/requiring
|
2. Provide separate SIMD-accelerated implementations of different distance metrics in perhaps a scikit-learn-contrib package
I would prefer exploring the notion of plugin, that we have already started exploring elsewhere to add optimized code.
While plugins are a slightly longer-term option, I believe that they can provide the user much more value, as they can open the door to speeding up larger codebases, made of chains of operations, without modifying the user-level code.
|
I also think developing plugins to experiment with keeping the UX of scikit-learn unchanged is the best approach. I also think working on a plug-in for SIMD-accelerated implementations of distance metrics also might help make some progress for the plugin discussions (see #22438). I started playing a bit around with How does that sound to people? 🙂 Ping to @betatim, and @Micky774 who showed interest in previous comments. 🏓 |
AFAIC, apache arrow C++ uses xsimd. More generally, can we ask some libraries for their experience with xsimd? A second point is that we then would officially allow more C++. I personally would be fine with this. C++ templates, for instance, seem ways cleaner and more future proof than Cython tempita templating. The point is that this should be a deliberate decision. |
Yes, the developers of Firefox also recently have chosen to vendor xsimd and to use it for their webaudio engine. |
What do you think of first starting to develop a plugin to isolate C++ implementations from scikit-learn's code-base, coming up with a comparison of execution time and maintainability, and deciding on a solution? |
Just for info: numpy went with google‘s Highway in numpy/numpy#24018. |
IIRC, the (implicit?) consensus on the mailing list was to first assess the value of efficient non-euclidean distance metrics computations for end-users. |
Context
Pairwise distance computation is an essential part of many estimators in scikit-learn, and can take up a significant portion of run time in certain workflows. I believe that we may achieve significant performance gains in several (perhaps most) distance metric implementations by leveraging SIMD intrinsics.
Proof of Concept
I built a quick proof of concept just to see what kinds of performance gains we could observe with a potentially-naive implementation of SIMD intrinsics. I chose to optimize the
ManhattanDistance.dist
function. This implementation uses intrinsics found inSSE{1,2,3}
. To ensure that the instructions are supported, it checks for the presence of theSSE3
instruction set (SSE3
impliesSSE{1,2}
) and provides the optimized implementation if so. Otherwise it provides a dummy implementation just to appease Cython, and the main function falls back to the current implementation onmain
. Note that on most modern hardware, support forSSE3
is a reasonable expectation (indeed numpy assumes it is always present when optimization is enabled). For the specific implementation referred to here, please take a look at this PR: Micky774#11Note that the full benefit of the intrinsics are gained when compiling with
-march="native"
, however the benefit is still significant when compiling with-march="nocona"
, as is often default (e.g when following the scikit-learn development instructions on linux).Benchmarks
The following benchmarks were produced by this gist: https://gist.github.com/Micky774/567a5fa199c05d90c4c08625b077840e
Summary: The SIMD implementations are ~2x faster than the current implementation for
float32
and 1.5x faster forfloat64
.Plots
Discussion
I haven't looked too deeply into this yet, as first I wanted to see whether there was interest in the venture. I would love to hear what the other maintainers' thoughts are regarding exploring this route in a bit more detail. Obviously SIMD implementations will bring with them added complexity, but the performance gains are pretty compelling. In my opinion, the tradeoff is worth it.
CC: @scikit-learn/core-devs
The text was updated successfully, but these errors were encountered: