8000 Introduce SIMD intrinsics for `_dist_metrics.pyx` · Issue #26010 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Introduce SIMD intrinsics for _dist_metrics.pyx #26010

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Micky774 opened this issue Mar 28, 2023 · 23 comments
Open

Introduce SIMD intrinsics for _dist_metrics.pyx #26010

Micky774 opened this issue Mar 28, 2023 · 23 comments

Comments

@Micky774
Copy link
Contributor
Micky774 commented Mar 28, 2023

Context

Pairwise distance computation is an essential part of many estimators in scikit-learn, and can take up a significant portion of run time in certain workflows. I believe that we may achieve significant performance gains in several (perhaps most) distance metric implementations by leveraging SIMD intrinsics.

Proof of Concept

I built a quick proof of concept just to see what kinds of performance gains we could observe with a potentially-naive implementation of SIMD intrinsics. I chose to optimize the ManhattanDistance.dist function. This implementation uses intrinsics found in SSE{1,2,3}. To ensure that the instructions are supported, it checks for the presence of the SSE3 instruction set (SSE3 implies SSE{1,2}) and provides the optimized implementation if so. Otherwise it provides a dummy implementation just to appease Cython, and the main function falls back to the current implementation on main. Note that on most modern hardware, support for SSE3 is a reasonable expectation (indeed numpy assumes it is always present when optimization is enabled). For the specific implementation referred to here, please take a look at this PR: Micky774#11

Note that the full benefit of the intrinsics are gained when compiling with -march="native", however the benefit is still significant when compiling with -march="nocona", as is often default (e.g when following the scikit-learn development instructions on linux).

Benchmarks

The following benchmarks were produced by this gist: https://gist.github.com/Micky774/567a5fa199c05d90c4c08625b077840e

Summary: The SIMD implementations are ~2x faster than the current implementation for float32 and 1.5x faster for float64.

Plots

f2b1f1e8-59b0-4ec5-b91c-fe1d19abd9ec

Discussion

I haven't looked too deeply into this yet, as first I wanted to see whether there was interest in the venture. I would love to hear what the other maintainers' thoughts are regarding exploring this route in a bit more detail. Obviously SIMD implementations will bring with them added complexity, but the performance gains are pretty compelling. In my opinion, the tradeoff is worth it.

CC: @scikit-learn/core-devs

@jjerphan
Copy link
Member

Hi @Micky774,

Thank you for exploring this! I think those really are encouraging results. 💯

If we aim for performance, I think using intrinsics is relevant.

Yet in the past, the use of lower-level languages for scikit-learn has been discussed and I do not think maintainers agree on whether we should use them.

If I remember correctly, for some maintainers using such languages goes against approcheability of code for readers and goes against some elements of the culture of the project — personally, those are arguments which I do not understand but I would be happy to continue discussions.

Alternatively, we could imagine relying on SciPy's implementations of distance metrics (which are performant), but this might also create supplementary maintenance work in their codebase as well.

I am looking forward to other maintainers' opinion. In the meantime, I will review your PR.

@NicolasHug
Copy link
Member
NicolasHug commented Mar 29, 2023

To ensure that the instructions are supported, it checks for the presence of the SSE3 instruction set (SSE3 implies SSE{1,2}) and provides the optimized implementation if so

But these are compile-time checks right? So if the binaries are compiled on a machine that supports SSE3, the binary will try to execute the intrinsics instructions, even if the end-users' machine doesn't support it?

Perhaps this is less of an issue for SSE3 as @Micky774 mentioned they should be widely available. But FWIW SIMD support and packaging is a hairy issue and that might be a challenge for scikit-learn. Here's what PIL-SIMD has to say about why it's not upstreamed in PIL directly:

Why do not contribute SIMD to the original Pillow
Well, it's not that simple. First of all, the original Pillow supports a large number of architectures, not just x86. But even for x86 platforms, Pillow is often distributed via precompiled binaries. In order for us to integrate SIMD into the precompiled binaries we'd need to execute runtime CPU capabilities checks. To compile the code this way we need to pass the -mavx2 option to the compiler. But with the option included, a compiler will inject AVX instructions even for SSE functions (i.e. interchange them) since every SSE instruction has its AVX equivalent. So there is no easy way to compile such library, especially with setuptools.

In PyTorch, they work around this problem by compiling the optimized code with multiple instruction sets, and then leverage the pytorch dispatcher at run time to figure out which code-path to use depending on the users machine support. But this is probably way too complex for scikit-learn.

EDIT: as far as I can tell the pytorch setup is very similar to that of numpy that Tim linked to below #26010 (comment)

@glemaitre
Copy link
Member

Summary: The SIMD implementations are ~2x faster than the current implementation for float32 and 1.5x faster for float64.

What I am wondering is if these 2x and 1.5x are also obtained at the level of the estimator. My feeling here is that we will introduce a layer of complexity that we probably want to extend to a lot of subroutines. Is the complexity worth the 1.5x to 2x speed-up?

@vene
Copy link
Member
vene commented Mar 29, 2023 via email

@betatim
Copy link
Member
betatim commented Mar 29, 2023

While thinking about this I came across some links that are interesting to read:

@Micky774
Copy link
Contributor Author

But these are compile-time checks right? So if the binaries are compiled on a machine that supports SSE3, the binary will try to execute the intrinsics instructions, even if the end-users' machine doesn't support it?

Yes. Regarding this, we could either follow NumPy's strategy of build the superset and dispatch at runtime, or build minimally and specify which features we enable to try to maximize portability. The PIL-SIMD discussion is based on enabling AVX2 however I don't think we would need to if we wanted just a minimal optimization.

With that being said, NumPy and PyTorch have already built out viable runtime dispatch systems for SIMD intrinsics. While we can't quite apply their solutions one-to-one (and even if we could it would still be a lot of effort), we have a lot we could learn from them if we did decide to pursue that solution. As an aside, I'm personally more familiar with NumPy's system.

Just to minimize scope right now, I'd rather stay within SSE3 instructions since those are still incredibly portable afaik.

What I am wondering is if these 2x and 1.5x are also obtained at the level of the estimator. My feeling here is that we will introduce a layer of complexity that we probably want to extend to a lot of subroutines. Is the complexity worth the 1.5x to 2x speed-up?

Almost surely the gains would not be that dramatic for all estimators (or perhaps even most estimators) however this would help those where distance computations are a potent bottleneck. I imagine such estimators would be mostly those which now use the PairwiseDistanceReduction backend. Consequently, I think distance computation will steadily become more present as bottlenecks over time.

I think the main resistance would be to adding new c/c++ code to the
codebase due to the maintenance burden on new contributors. (not sure how
our stance on that changed since ive last been active though!)

Thanks for offering this point. It is one of my greatest apprehensions towards this as well. I feel like only recently with the immense work of several maintainers has the momentum for Cython contributions been increasing (shoutout to @thomasjpfan and @jjerphan for helping me learn Cython) and yet it still remains a rare target for non-maintainer contributions. The introduced SIMD intrinscs and surrounding C code would be even harder to foster support for.

With that being said, I think thorough education -- both in documentation and reviews -- can help mitigate that. I think if we keep the use in a very limited scope and focus on clarity, then we can even develop those skills in our community. I've done some work on the NumPy side of things, and I do adore that about their community: a lot of folks are eager to learn new skills when fumbling around the codebase and seeing syntax they had never seen before :)

@NicolasHug
Copy link
Member

Re Cython vs C++:

FWIW, I personally don't think that Cython (as we use it in scikit-learn) is a lower barrier to entry than C / C++. It might have been the case a long time ago, but today we make use of some very advanced Cython features and hit a bunch of edge cases, all of which require strong expertise. I remember spending days of trial and error trying to figure out how to do something in Cython that would have taken me 10 minutes in C, simply because there are much less resources on how to use Cython properly compared to C (cython/cython#2813, #17299).

What makes C / C++ hard for most Python programmers isn't the syntax, it's the [lack of] memory management and all the memory pointer handling. This isn't something that Cython makes easier - not in the scikit-learn code base at least.

I would suspect that those who are competent enough to write Cython code for scikit-learn would be just as competent at writing C/C++ code.

(This is not a dig at Cython, I love Cython, and I know how important it is to the ecosystem).

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Mar 29, 2023 via email

@betatim
Copy link
Member
betatim commented Mar 29, 2023

Could we have a benchmark on something like KMeans as an example of how the pure algorithm improvements translate to "estimator improvements"?

After a quick scan of the Numpy resources linked above I think the infrastructure to do all the work besides the code that actually uses the intrinsics is a significant amount of work.

I like the comment from Gael. One of the ideas (the idea?) behind plugins for computational backends is that anyone can write a plugin that does crazy things (as considered from the viewpoint of core scikit-learn) and try out other ideas to achieve performance improvements. Most of the time people think of GPUs, but why not have a scikit-learn-with-all-the-AVXs plugin? You would probably need less infrastructure code because you can say "Just don't install this if you are on a RPi!" as well.

@jjerphan
Copy link
Member
jjerphan commented Mar 29, 2023

I think Cython is still a good compromise and still is the best short- and medium-term solution for scikit-learn IMO. The more we advance the more we are facing limitations with Cython as a language. Beyond problems solved with workarounds based on undocumented features or aspect of Cython, examples of limitation the team in this regard faced are:

After having spent several thousand of hours reading, writing Cython code, and reading Cython's source code trying to contribute to the project, I can say that:

  • I like Cython for:

    • its concision
    • its learning curve thanks to interactive mode in Jupyter (I am probably biased for having programmed in C or C++ before)
    • the presence of interfaces to BLAS in SciPy which removes some burden from scikit-learn maintainers' life ❤️
  • I do not like Cython for:

    • the lack of idioms for coming up flexible design with reliable and safe execution
    • the lack of a clear and unified documentation
    • paths and experience for contributing to Cython (PyRex predates PEP-8, a lot of complexity, unfortunate small bus-factor, etc.)
    • the small of developers' (and thus reviewers' pool) which might make authors and reviewers wait and perform context switches over months, making motivation fade quickly

Still, if we were to reconsider Cython over alternatives, I think we might need to perform before hand qualitative analysis, e.g. For an alternative X:

  • Can we enforce conventions?
  • Do people like developing using X over Cython?
  • Is it compelling and convenient for people to learn and to express their ideas with X over Cython?
  • Is X properly documented and tooled?
  • Does X have other limitations?

We also might want perform an analysis backed on evidence-backed (e.g. ideally, can we reach optimal and safe execution on a prototype first without changing scikit-learn's UX and packaging?). This is not negligible work.

I think the plugin API might make experimenting easier in this regards. 👍

@jeremiedbb
Copy link
Member

I'm not very enthusiastic about such low-level optimizations, not only because of the added complexity and maintenance burden, but mainly because I think scikit-learn is not the appropriate place for that (most of the time).
What I mean is that to me pairwise distances (for instance since the poc is about that) is a generic brick of scientific computing. I don't think it's the role of scikit-learn to implement state of the art pairwise distances. Instead I think this role belongs to libraries that expose common tools for scientific computing like scipy, or even to dedicated libraries like keops.
While I definitely see the value of optimizing such algorithms, I think it would benefit to a lot more people if we were able to contribute that upstream rather than keeping optimizing it on our own.

Another thing to consider imo is the balance between the potential gain, the added complexity and the user interest. The more time we dedicate to performance improvement the less time we have to improve the usability of the library for instance. Since there are ongoing discussions regarding the future of the project I think it's something to keep in mind.

Finally, regarding this poc specifically, before diving into simd, there might be simple tricks that we should try before. I think in general we should try simpler solutions first and increase complexity when they fail. Here for instance, a manual unrolling might give similar results for a lot less added complexity (see this quick benchmark Micky774#11 (comment))

@Micky774
Copy link
Contributor Author

Finally, regarding this poc specifically, before diving into simd, there might be simple tricks that we should try before. I think in general we should try simpler solutions first and increase complexity when they fail. Here for instance, a manual unrolling might give similar results for a lot less added complexity (see this quick benchmark Micky774#11 (comment))

Just wanted to update the conversation on the issue to match the developments in the PR conversation: it seems the manual loop unrolling mainly benefits GCC compiled code (or at least, the benefits are not observed when compiled with Clang) and is independant of SIMD optimization and can be combined (cf. Micky774#11 (comment), Micky774#11 (comment))

@Micky774
Copy link
Contributor Author
Micky774 commented Mar 29, 2023

I think Cython is still a good compromise and still is the best short- and medium-term solution for scikit-learn IMO. The more we advance the more we are facing limitations with Cython as a language.

I wanted to mention you make wonderful points and I feel like this is a conversation that will soon need to be had (even if just to affirm our use of Cython). Personally, I've been growing fond of the idea of writing in C/C++ directly and using Cython as a quick and easy way of binding to Python functions (e.g. my prototype SIMD implementation). I think Scipy does something similar -- write in C++ directly and wrap/bind to bring into Python -- we could more closely look at their solutions. Note, however, that I do not have much experience with this strategy and can't properly advocate for it (yet).

@jjerphan
Copy link
Member

I think C/C++/SyCL implementations (cc @fcharras, who is working on GPU implementations for scikit-learn) with Python bindings (done with Cython, PyBind11 or nanobind) is adapted.

Yet, I think we need to consider current performance bottlenecks, complexity which is currently handled by Cython first (e.g. how to compile and link against BLAS?), the number of people interested by such development, and ongoing work on performance (algorithm redesign, Array API support, Cython implementations, GPU implementations, etc.).

Currently, I do not know what people think about exploring such alternatives for scikit-learn.

@betatim
Copy link
Member
betatim commented Mar 30, 2023

I will wave the plugin flag once more :D I think plugins are an excellent way to get an answer to all of the questions raised above. This is because anyone can write a plugin and use what ever technology they want to do it. Then we have some concrete code that can either be abandoned (we learnt something that we didn't realise before implementation), maintained as a separate thing (maybe you want to iterate much faster than scikit-learn releases, serve a small niche set of users, exotic optimisation tricks, etc) or integrated into the core of scikit-learn (it proves itself as beneficial and not too burdensome to maintain).

(Maybe the plugin stuff we are working on is not yet there in this fantastic land, but that is where we are trying to go. Or at least trying to reach this promised land is what gets me excited about plugins.)

@Micky774
Copy link
Contributor Author
Micky774 commented Apr 7, 2023

I think at this point I agree that there are better ways to handle this than natively supporting/requiring SSE3. In particular, what do people think about the following strategy:

  1. Update some metric arguments to accept DistanceMetric objects directly
  2. Provide separate SIMD-accelerated implementations of different distance metrics in perhaps a scikit-learn-contrib package
  3. Offer them as an alternative option in the documentation for those users looking to squeeze out more performance

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Apr 10, 2023 via email

@jjerphan
Copy link
Member
jjerphan commented Apr 10, 2023

I also think developing plugins to experiment with keeping the UX of scikit-learn unchanged is the best approach.

I also think working on a plug-in for SIMD-accelerated implementations of distance metrics also might help make some progress for the plugin discussions (see #22438).

I started playing a bit around with xsimd, and it looks like a good candidate for such implementations. xsimd supports runtime dispatch to the best architecture and runtime dispatch to the best implementations based on memory (non)-alignment. Having a few C++ functions using SIMD (similar to the ones of examples) backing Cython interfaces à la DistanceMetric should work and should be sufficient.

How does that sound to people? 🙂

Ping to @betatim, and @Micky774 who showed interest in previous comments. 🏓

@lorentzenchr
Copy link
Member

AFAIC, apache arrow C++ uses xsimd. More generally, can we ask some libraries for their experience with xsimd? A second point is that we then would officially allow more C++. I personally would be fine with this. C++ templates, for instance, seem ways cleaner and more future proof than Cython tempita templating. The point is that this should be a deliberate decision.

@jjerphan
Copy link
Member
jjerphan commented Apr 14, 2023

AFAIC, apache arrow C++ uses xsimd

Yes, the developers of Firefox also recently have chosen to vendor xsimd and to use it for their webaudio engine.

@jjerphan
Copy link
Member

I personally would be fine with this. C++ templates, for instance, seem ways cleaner and more future proof than Cython tempita templating. The point is that this should be a deliberate decision.

What do you think of first starting to develop a plugin to isolate C++ implementations from scikit-learn's code-base, coming up with a comparison of execution time and maintainability, and deciding on a solution?

@lorentzenchr
Copy link
Member

Just for info: numpy went with google‘s Highway in numpy/numpy#24018.

@jjerphan
Copy link
Member
jjerphan commented Dec 3, 2023

IIRC, the (implicit?) consensus on the mailing list was to first assess the value of efficient non-euclidean distance metrics computations for end-users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants
0