-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
[$] Optimize NumPy SIMD algorithms for Power VSX #13393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The main problem here is lack of expertise and hardware among the developers. Do you know anyone who could help? |
This already has been discussed with Ralf Gommers and Julian Taylor, who suggested available developers. Of course, anyone is welcome to work on the issue and claim the financial bounty. |
Can I work on it? |
@seiko2plus Are you familiar with Power VSX and have the appropriate hardware to work on? This is one of those situations where we need an expert, but if you feel up to it, go ahead. |
@charris, Yes, I'm familiar with Power/VSX, I have access to Power8&9 hardware and also I have a wide experience for other SIMD extensions too. |
@seiko2plus Great! Go for it. |
@charris , Thanks! |
A bit of context for this:
I would recommend that if @seiko2plus or someone else tries to tackle this right now, they post a summary here of what they plan to do, or send an early WIP PR. It would be unfortunate if someone spends a lot of time and then in review we say that the changes should have been done differently. Second recommendation: please give @juliantaylor a bit of time (a couple of days) to respond before starting work @seiko2plus. EDIT: given that this is the first bounty for NumPy, I also posted a message to the numpy-discussion mailing list about this, https://mail.python.org/pipermail/numpy-discussion/2019-April/079371.html |
@rgommers, Thank you for the clarification and sure I'm going to follow your recommendations. |
I currently cannot do it, please consider other people. |
A solution to this should provide proper build changes to support the various architectures, right now it is focused on SSE/AVX only. I believe if it's is going to move forward it will require more than just cloning simd.inc and duplicating the code there for another architecture. Theoretically the vector built-in's for the compiler should have a default implementation, and specialization for x86*/ARM*/PPC64[LE] specific simd each a separate implementation that is built according to availability (decided by compiler/build ? not all builds are native) and/or at runtime via features testing? I started looking into this when I got an email about it 2 days ago, however now I see someone is working on it already. |
@dmiller423 I don't think anybody is working on this yet, in fact, I think the scope/management has yet to be determined. So stay around. |
yes, that's why I'm working on different solution aims to friendly support multiple SIMD extensions [sse, avx2, avx512, vsx, neon] and provide better build options that support flexible control between cpu[baseline, dispatch] also support multiple compilers [intel, gcc, clang, msvc]. |
@seiko2plus Note that NumPy has its own templating machinery. You can see examples in the |
@seiko2plus so you're planning on making OpenCV a dependency ? |
@charris, thanks for pointing this out, sure I will dive into NumPy's infrastructure but my main focus for now is implement new API.
@dmiller423, definitely not, just inspired by OpenCV's HAL. |
@dmiller423 I read the comment as "create a macro processing framework like OpenCV", not "depend on OpenCV". We also create C code from python for instance, |
Exactly. That kind of approach is not maintainable long-term.
multiple compilers supported with the same code sounds good. this is also something @oleksandr-pavlyk mentioned; the Intel compiler now doesn't work with the SIMD code in this repo, so Intel is making some changes to numpy when compiling with I'll also summarize some comments made by @juliantaylor on this recently: Runtime detected compiler generated code for Adding support for the boolean operators should be relatively simple too. But as compilers cannot vectorize most of that for us as well it might mean more code (or more templating if the intrinsics interfaces match the SSE ones). Okay, now to who does this:
|
For the most part i've only looked into how the pieces fit together and some initial changes that would be required to modify the build system and templates. I stopped once I saw you wanted to wait for reviewers and a full game plan, as this is the best course. Duplicating work or racing to see who can whip up a quick implementation is a recipe for disaster long term. I'm open to implementing it, there is still some question as to the actual details of that implementation. Some thoughts: getting a nice implementation written for all compilers is going to be tough using instrinsics... It will require ifdef guards to check for compilers etc. I generally tend to use assembly if such a thing is a requirement, it allows sticking to a single code base per architecture. Also, a default vectorized C implementation using extensions would work on both gcc and clang (possibly with some details changed for clang on ppc64le especially), i'm not sure if/how compilers such as icc or realview implement these as I have not used them recently. It would require considerable R&D to properly decide details. |
Not applying for the bounty, but I'm currently working on an FFT library that may become useful for numpy at some point (see the recent comments of #11885). If someone can point me to documentation which macros I need to check on PowerPC to test for VSX availability (i.e. the equivalents to |
My thoughts on the matter, while I have a strong interest in the matter please note I cannot promise to actually be able to review the changes in a timely manner if at all. The first thing that needs to be figured out is how to compile different functions with different compiler flags within our distutils based build infrastructure. The way to do it has influence on how the code layout could be structured. The current code uses gcc target and optimization attributes to compile specific functions for different targets and these functions are only called on appropriate hardware. If it is possible by just compiling single files with different flags and just have a bunch of For the actual implementations there are three categories:
Some general notes: Do not mess with aligning memory unless it is necessary or it has reproducible performance benefits. Also keep in mind we have a templating function which should cover a fair share of cases see |
Since einsum has come up, lets bring @dgasmith in. He maintains an optimized einsum branch with different backends and is probably the current expert on that module. |
@charris Thanks for pinging me on this. It seems that most topics are well covered and I don't see too much to add here. As a note something that may be simpler is to autogenerate the python loops and have numba/numexpr compile it on the fly. The overall structure of At the same time lots of folks seem to use |
That's not feasible, we can't accept either of those as a dependency for NumPy. Perhaps in a couple of years for Numba .... |
Thanks for the thoughts though @dgasmith! |
@rgommers, sorry for the late reply,
Here's my current road map
And since I'm going to support other CPU architectures and also because IBM lately was very generous with me so I'm not going to claim for this bounty. |
Thanks @seiko2plus. okay, it seems you're picking this up:)
That sounds like a good idea.
If this is for building wheels, then yes that sounds fine. Note that we would prefer never to have to specify this for regular builds that are meant for use on the same system. That's not necessary today, so with improved runtime detection of CPU features it should definitely not be necessary. Also keep in mind this comment by @juliantaylor above: The first thing that needs to be figured out is how to compile different functions with different compiler flags within our distutils based build infrastructure. The way to do it has influence on how the code layout could be structured.
That's up to you of course. We wouldn't complain if you donated it to the project instead:) Or to a charity of your choice. If IBM gets what it wants, I'm sure they're be happy to pay .....
this does sound like a logical strategy |
Does anyone work on it? Can I pick it up? |
Hi @barkovv, thanks for your interest. Actually, @seiko2plus is working on this and has two open PRs:
Work has slowed down, but I think is reasonably far along. So I'd rather let @seiko2plus comment here, I suggest not to start over on this. |
Thanks @seiko2plus. Hope you're doing well. Yes, we'll give you some time, please update us if the end of this month is also not feasible. |
@rgommers @seiko2plus |
PPC64LE Linux has a set of SSE/SSSE3 emulation intrinsics headers. An IBM team built NumPy with the headers and saw a 15% performance improvement on benchmarks that stressed the SSE intrinsics code path, which matched the improvement for x86. This functional hack confirms the benefit of SIMD for Power. x86 derives more benefit from AVX intrinsics, for which emulation has not been written for Power. The SIMD infrastructure should address that. |
Thanks @edelsohn, makes sense that there's a benefit but still good to see it confirmed.
I think it does already, assuming you mean "AVX equivalent instructions as universal intrinsics". |
By SIMD infrastructure, I mean the universal intrinsics. IBM observed that more of NumPy SIMD seems to utilize AVX/AVX2, for which there currently is no Power emulation. When NumPy is converted to universal intrinsics and the universal intrinsics are implemented for Power, then the AVX-equivalent benefit in NumPy will be exposed for Power as well. |
Did anyone already try to use generic C/C++ code and optimize it using OpenMP? In tests with Tesseract such code was comparable to hand optimized code for AVX2, and the same source code worked for ARM NEON and PowerPC Altivec, too. |
@stweil can you explain a bit more what you mean? SIMD instructions give faster single-threaded code, and when you say OpenMP I'm thinking about parallelism. |
OpenMP gained support for vectorization in addition to parallelization some time ago. See for example |
Yes, OpenMP has different compiler pragmas for parallelism (not indended here) and for SIMD (that's what is desired here). C Code which uses these pragmas and which is compiled with the right compiler and compiler flags (optimization, cpu) automatically uses the SIMD machine instructions. |
However I think (though I'm not sure) that OpenMP SIMD will only cover a small subset of what universal intrinsics can do. For example I don't see how things like shuffling, masked operations, horizontal addition etc. could be achieved with OpenMP pargmas. |
@stweil, numpy/numpy/core/src/umath/fast_loop_macros.h Lines 106 to 135 in de3fcf1
|
I think we can close this issue and declare the bounty complete. The original bounty says:
We have implemented the infrastructure needed to enable the porting of loops via Universal SIMD, and even ported some of the loops, so the enablement is complete. |
I completely agree. Sayed and the NumPy core developers have done an amazing job! Do you want me to close the issue? I have been deferring to the NumPy developers. |
Thanks @edelsohn. Agreed that the solution exceeded expectations. I will close it then. |
Awesome, thanks everyone for the excellent work! |
Thank you for everything <3 |
Uh oh!
There was an error while loading. Please reload this page.
NumPy contains SIMD vectorized code for x86 SSE and AVX. This issue is a feature request to implement native, equivalent enablement for Power VSX, achieving equivalent speedup appropriate for the SIMD vector width of VSX (128 bits).
EDIT (by @rgommers): link to bounty: https://www.bountysource.com/issues/73221262-optimize-numpy-simd-algorithms-for-power-vsx
The focus is PPC64LE Linux. If the optimization can be portable to AIX (big endian) that's great, but not a strict requirement. In other words, if AIX continues to use the scalar code for now, that's okay.
The text was updated successfully, but these errors were encountered: