8000 DRAFT ENH Introduce `AVX` SIMD instructions for `ManhattanDistances.dist` by Micky774 · Pull Request #27145 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

DRAFT ENH Introduce AVX SIMD instructions for ManhattanDistances.dist #27145

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

Micky774
Copy link
Contributor
@Micky774 Micky774 commented Aug 23, 2023

Reference Issues/PRs

A follow-up to #26010

What does this implement/fix? Explain your changes.

This PR demonstrates the minimal infrastructure required for proper building and runtime dispatch of AVX SIMD instructions. This means that scikit-learn can optionally be built with AVX instructions. If, at runtime, AVX support is found on the runtime machine and scikit-learn was built with AVX support, then the optimized AVX instructions are leveraged when computing distances. Any other case results in the usual scalar loops currently found on main.

Concrete changes

  • Introduction of _simd/ directory housing new files
  • Introduction of simd.hpp which provides a definition and extern declarations of templated Manhattan distance functions (xsimd_manhattan_dist)
  • Introduction of simd.cpp which provides externed declarations of xsimd_manhattan_dist. This file serves as a translation unit to target with -mavx feature flags.
  • Compiles the above into a composite avx_dist_metrics library which is linked against to isolate the rest of _dist_metrics from requiring additional feature flags (preventing auto-vectorization of unintended code, e.g. backup scalar loops)
  • Provides templated scalar loops in C++ which match our current Cython implementaiton
  • Adds WITH_SIMD macro to correctly compile SIMD optimized code, or default to a trivial implementation to satisfy compilation
  • Adds DIST_METRICS macro to correctly allow inclusion of _dist_optim.cpp in _dist_metrics.pyx but not in other Cython files (necessary since any cimport of _dist_metrics.pyx forces the inclusion of _dist_optim.cpp)
  • Adds build and runtime dependency on archspec to provide out-of-the-box support Uses xsimd directly for runtime microarchitecture feature detection (to determine whether building with AVX is possible, and whether running with AVX is possible) via _runtime_check.pyx and by compiling runtime_check_program in setup.py for a quick build-time check.
  • Includes xsimd as a submodule (~2MB but could be reduced by half if we only include the include directory)
  • Changes self.{r}dist(...) calls to <Class>.rdist_csr(self, ...) to work around Cython's idiosyncrasies when compiling to C++

TODO:

  • Resolve whether to add a dependency on archspec or vendor it Uses xsimd directly
  • Resolve whether to add xsimd as a submodule, vendor an instance, or only include as build-time dependency for wheels and conda-forge

Any other comments?

AVX instructions were deployed in late 2011 and 2012 and thus are likely found in a significant portion of our users' machines. Honestly, we may want to consider going to AVX2 directly since it was released only a couple years later...

I don't know of any comprehensive surveys in the ML field which indicate the presence of these features on work machines and laptops.

Initial benchmarks generated with this script (exported from a Jupyter notebook).

Note

PR_noruntime is the case where the library is built with AVX support, but the runtime check fails to detect AVX capabilities. PR_nobuild is built without AVX capabilities.
PR is built with AVX capabilities, and uses them at runtime.

There are no slowdowns compared to main, and when built with AVX support, and it is detected at runtime, we see minor speedups even in cases of small n_features, to drastic speedups (4-7x) when n_features is large.

Plots

e089e9cc-a6c4-4b62-a6ca-0d94e2babcaa

ad00c343-0c0f-4438-8ea5-b5a98d1660ac

@github-actions
Copy link
github-actions bot commented Aug 23, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: fd481b3. Link to the linter CI: here

@adrinjalali
Copy link
8000
Member

Thinks that came to my mind seeing this:

  • are we introducing C++? We don't really have any, except the ones we've vendored, so this has no precedence and I would rather intorduce rust than C++ in the project as a new-ish language.
  • definitely don't want to introduce a submodule, they're hard to work with, and confuse [almost] everyone. It would make it significantly harder to contribute to the project compared to what we have now.

Otherwise I think this is pretty cool.

@Micky774
Copy link
Contributor Author
Micky774 commented Aug 24, 2023
  • are we introducing C++? We don't really have any, except the ones we've vendored, so this has no precedence and I would rather intorduce rust than C++ in the project as a new-ish language.

That's something I'm hoping we can discuss a bit more. I'm unfamiliar with Rust, and although I know it is picking up popularity pretty quickly, afaik it does make it harder to work on relatively complex data structures (the antagonistic example is doubly-linked-lists) but I don't honestly know to what degree. My (comparative) familiarity with C++biases me. I'd love for someone more familiar with Rust to educate me in this regard :)

I do think that there is potentially a lot of room to improve code readability by leveraging C++ templating and binding/importing via Cython. This is actually a line-item for discussion in the new native performance discussion group (see notes) although has not been broadly discussed yet. IMO it is easier to slowly-but-safely transition from pure Cython/C to some C++ in key areas than to introduce Rust, but I've not looked at any example implementations so I can still be swayed.

I do also believe that the ability to include a relatively simple static dispatch to a widely supported SIMD target, propagating performance gains across a potentially wide swathe of the library is itself a strong merit for C++ inclusion (even if not widespread adoption).

  • definitely don't want to introduce a submodule, they're hard to work with, and confuse [almost] everyone. It would make it significantly harder to contribute to the project compared to what we have now.

Right, that seems to be a common opinion. I have it added as a submodule right now for (my) convenience but it is simple to vendor it as a header-only library (they're under BSD-3 so licensing is no issue either). I don't have a strong preference one way or the other so I think we can navigate around that.

Edit: It's also not necessary at runtime at all, so we could try to specify it as a build-time requirement for users if they want SIMD acceleration and setup CI solutions for wheel building. I'm not sure what the best way to do that is, since (as usual) windows makes things tricky

@adrinjalali
Copy link
Member

IMO it is easier to slowly-but-safely transition from pure Cython/C to some C++ in key areas than to introduce Rust, but I've not looked at any example implementations so I can still be swayed.

I would be very skeptical of that. C++ has become a monster. I would agree with you maybe 20 years ago though. But I've done enough C++ that I really wouldn't want us to maintain C++ code for multiple platforms.

I do also believe that the ability to include a relatively simple static dispatch to a widely supported SIMD target, propagating performance gains across a potentially wide swathe of the library is itself a strong merit for C++ inclusion (even if not widespread adoption).

I don't mind vendoring some C++ library at all, as long as we don't have to maintain it, and it doesn't give us much headache to support multiple platforms. And as long as the only thing we have to do is to add a very light binding layer which could also be auto-generated, then we should be fine.

@jjerphan
Copy link
Member

As discussed on the last monthly meeting (though it is not reported in its notes), the introduction of C++ code (be it done via this PR or another one) necessitates considering several technical aspects (packaging consideration, linkage and build system, depending on third party library or vendoring it, etc.) and several social aspects (when do we want to use C++ over existing and used solutions like Cython, who gets to maintain C++ implementations for those part of the project to have an healthy bus-factor, what are the conventions around C++, etc.) of scikit-learn.

We decided that such choices needs to be discussed and voted in favor or in disfavor via a SLEP (I invoke @lorentzenchr for confirmation).

IIRC, @Micky774 (let me know if I am wrong) proposed to author one; I also propose helping with writing it.

@lorentzenchr
Copy link
Member

Yes, let’s discuss in a slep which has as its goal the usage of simd instructions. Rust, for instance, would then be one alternative.
@Micky774 It does not need to be perfect to start a discussion.

Let me add here that we hit the boundaries of Cython several times, e.g. usage of tempita templates.

@jjerphan
Copy link
Member
jjerphan commented Sep 8, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0