8000 [$] Optimize NumPy SIMD algorithms for Power VSX · Issue #13393 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

[$] Optimize NumPy SIMD algorithms for Power VSX #13393

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
edelsohn opened this issue Apr 23, 2019 · 46 comments
Closed

[$] Optimize NumPy SIMD algorithms for Power VSX #13393

edelsohn opened this issue Apr 23, 2019 · 46 comments

Comments

@edelsohn
Copy link
edelsohn commented Apr 23, 2019

NumPy contains SIMD vectorized code for x86 SSE and AVX. This issue is a feature request to implement native, equivalent enablement for Power VSX, achieving equivalent speedup appropriate for the SIMD vector width of VSX (128 bits).

EDIT (by @rgommers): link to bounty: https://www.bountysource.com/issues/73221262-optimize-numpy-simd-algorithms-for-power-vsx

The focus is PPC64LE Linux. If the optimization can be portable to AIX (big endian) that's great, but not a strict requirement. In other words, if AIX continues to use the scalar code for now, that's okay.

@charris
Copy link
Member
charris commented Apr 23, 2019

The main problem here is lack of expertise and hardware among the developers. Do you know anyone who could help?

@edelsohn
Copy link
Author

This already has been discussed with Ralf Gommers and Julian Taylor, who suggested available developers. Of course, anyone is welcome to work on the issue and claim the financial bounty.

@seiko2plus
Copy link
Member

Can I work on it?

@charris
Copy link
Member
charris commented Apr 23, 2019

@seiko2plus Are you familiar with Power VSX and have the appropriate hardware to work on? This is one of those situations where we need an expert, but if you feel up to it, go ahead.

@seiko2plus
Copy link
Member

@charris, Yes, I'm familiar with Power/VSX, I have access to Power8&9 hardware and also I have a wide experience for other SIMD extensions too.

@seiko2plus
Copy link
Member

@charris , Thanks!

@rgommers
Copy link
Member
rgommers commented Apr 23, 2019

A bit of context for this:

  • This is the first time (as far as I know) that anyone has offered a bounty for implementing a feature in NumPy. It's a bit of an experiment for us as a project.
  • We have pre-discussed this on the steering committee mailing list. @juliantaylor has done a lot of work on this before, so if he wants to tackle this that would be good.
  • Keep in mind that it's not just about implementing - the feature has to be merged as well. For one, there's reviewer effort (usually in short supply; if part of the bounty would go to the reviewer or to the project that would probably help ....). Also, a requirement for merging should be that VSX support does not increase the maintainance burden substantially (this is possible).
  • Keep in mind also that there's open PRs on AVX support. Please check those, and build on them or discuss in case of possible conflicts. E.g. AVX for floats? #11113 is relevant, and ENH: Use AVX for float32 implementation of np.sin & np.cos #13368 is active right now.
  • We have powerpc CI on TravisCI, which is good. PR should ensure that VSX is exercised there. Also with GCC on AIX systems should work. EDIT: clarified by @edelsohn in the description that AIX isn't a hard requirement.

I would recommend that if @seiko2plus or someone else tries to tackle this right now, they post a summary here of what they plan to do, or send an early WIP PR. It would be unfortunate if someone spends a lot of time and then in review we say that the changes should have been done differently.

Second recommendation: please give @juliantaylor a bit of time (a couple of days) to respond before starting work @seiko2plus.

EDIT: given that this is the first bounty for NumPy, I also posted a message to the numpy-discussion mailing list about this, https://mail.python.org/pipermail/numpy-discussion/2019-April/079371.html

@seiko2plus
Copy link
Member

@rgommers, Thank you for the clarification and sure I'm going to follow your recommendations.

@juliantaylor
Copy link
Contributor

I currently cannot do it, please consider other people.

@dmiller423
Copy link

A solution to this should provide proper build changes to support the various architectures, right now it is focused on SSE/AVX only. I believe if it's is going to move forward it will require more than just cloning simd.inc and duplicating the code there for another architecture. Theoretically the vector built-in's for the compiler should have a default implementation, and specialization for x86*/ARM*/PPC64[LE] specific simd each a separate implementation that is built according to availability (decided by compiler/build ? not all builds are native) and/or at runtime via features testing? I started looking into this when I got an email about it 2 days ago, however now I see someone is working on it already.

@charris
Copy link
Member
charris commented Apr 25, 2019

@dmiller423 I don't think anybody is working on this yet, in fact, I think the scope/management has yet to be determined. So stay around.

@seiko2plus
Copy link
Member
seiko2plus commented Apr 25, 2019

@dmiller423, @charris,

it will require more than just cloning simd.inc and duplicating the code there for another architecture

yes, that's why I'm working on different solution aims to friendly support multiple SIMD extensions [sse, avx2, avx512, vsx, neon] and provide better build options that support flexible control between cpu[baseline, dispatch] also support multiple compilers [intel, gcc, clang, msvc].
the solution is based on OpenCV wide universal intrinsics but instead of using c++ templates, going to use python to generate C code during build time. I will release a WIP pull request after I get done from the core.

@charris
Copy link
Member
charris commented Apr 25, 2019

@seiko2plus Note that NumPy has its own templating machinery. You can see examples in the *.src.c files and the script for processing them is numpy/distutils/conv_template.py.

@dmiller423
Copy link

@seiko2plus so you're planning on making OpenCV a dependency ?

@seiko2plus
Copy link
Member

@charris, thanks for pointing this out, sure I will dive into NumPy's infrastructure but my main focus for now is implement new API.

so you're planning on making OpenCV a dependency ?

@dmiller423, definitely not, just inspired by OpenCV's HAL.

@mattip
Copy link
Member
mattip commented Apr 25, 2019

@dmiller423 I read the comment as "create a macro processing framework like OpenCV", not "depend on OpenCV".

We also create C code from python for instance, __umath_generated.c is generated by make_code in numpy/core/code_generators/generate_umath.py. The use case here seems to be more appropriate if creating more complicated code than the templating system cleanly supports or if templating is not appropriate.

@rgommers
Copy link
Member

A solution to this should provide proper build changes to support the various architectures, right now it is focused on SSE/AVX only. I believe if it's is going to move forward it will require more than just cloning simd.inc and duplicating the code there for another architecture.

Exactly. That kind of approach is not maintainable long-term.

yes, that's why I'm working on different solution aims to friendly support multiple SIMD extensions [sse, avx2, avx512, vsx, neon] and provide better build options that support flexible control between cpu[baseline, dispatch] also support multiple compilers [intel, gcc, clang, msvc].

multiple compilers supported with the same code sounds good. this is also something @oleksandr-pavlyk mentioned; the Intel compiler now doesn't work with the SIMD code in this repo, so Intel is making some changes to numpy when compiling with icc. (@oleksandr-pavlyk there's probably more to that story than my simplistic one-sentence summary, feel free to weigh in with ideas/wishes).

I'll also summarize some comments made by @juliantaylor on this recently:
"""
gh-11113 is switching our ufunc float SSE code to generated runtime detected SSE/AVX, adding PPC should just be adding another flag to our templating, the rest is done by GCC. Our integer code is already completely compiler generated. For compiler agnostic code it could be more difficult to avoid almost duplicated code.

Runtime detected compiler generated code for einsum is a little bit harder due to lack of existing infrastructure but also likely feasible (also adding AVX support at the same time there might actually be worthwhile).

Adding support for the boolean operators should be relatively simple too. But as compilers cannot vectorize most of that for us as well it might mean more code (or more templating if the intrinsics interfaces match the SSE ones).
"""

Okay, now to who does this:

  • @juliantaylor isn't available, and the rest of the core devs have not spoken up about wanting to do this. So it's up for grabs.
  • We have two interested people here that both seem to understand pretty well what is needed: @seiko2plus and @dmiller423. @seiko2plus showed up here slightly earlier, however @dmiller423 seems to have started already as well.
  • Unfortunately bounties don't really come with a "social model". I would like to avoid that we get two people doing lots of good work in parallel and then in the end we have to choose which PR to prefer.
  • Ideally I'd like to see @seiko2plus and @dmiller423 talk to each other, and decide who is going to do this between themselves or even work together and split the bounty.
  • If that doesn't work out and you both want to do it, I propose that you both put up a "work plan" with an outline of how you want to approach changes to numpy. At that point the numpy core team can make an informed choice before anyone has done all the work.

@dmiller423
Copy link
dmiller423 commented Apr 26, 2019

For the most part i've only looked into how the pieces fit together and some initial changes that would be required to modify the build system and templates. I stopped once I saw you wanted to wait for reviewers and a full game plan, as this is the best course. Duplicating work or racing to see who can whip up a quick implementation is a recipe for disaster long term.

I'm open to implementing it, there is still some question as to the actual details of that implementation.
I also do not mind working with someone and splitting a bounty, I will let you (maintainers) decide the best course of action.

Some thoughts: getting a nice implementation written for all compilers is going to be tough using instrinsics... It will require ifdef guards to check for compilers etc. I generally tend to use assembly if such a thing is a requirement, it allows sticking to a single code base per architecture. Also, a default vectorized C implementation using extensions would work on both gcc and clang (possibly with some details changed for clang on ppc64le especially), i'm not sure if/how compilers such as icc or realview implement these as I have not used them recently. It would require considerable R&D to properly decide details.

@mreineck
Copy link
Contributor
mreineck commented Apr 26, 2019

Not applying for the bounty, but I'm currently working on an FFT library that may become useful for numpy at some point (see the recent comments of #11885).

If someone can point me to documentation which macros I need to check on PowerPC to test for VSX availability (i.e. the equivalents to __AVX__, __SSE2__ etc. on x86_64), it should be trivial to add support for the instructions.

@juliantaylor
Copy link
Contributor
juliantaylor commented Apr 26, 2019

My thoughts on the matter, while I have a strong interest in the matter please note I cannot promise to actually be able to review the changes in a timely manner if at all.

The first thing that needs to be figured out is how to compile different functions with different compiler flags within our distutils based build infrastructure. The way to do it has influence on how the code layout could be structured.

The current code uses gcc target and optimization attributes to compile specific functions for different targets and these functions are only called on appropriate hardware.
This approach was chosen because it was quick and easy but at least at the time gcc was the free only compiler capable of doing that, so for example clang never got autovectorized code. (There were plans to add that feature to clang too but I have not been keeping up to date on its progress)
To remove this limitation the typical approach is to put your hardware specific code into their own files and compile these with different optimization flags to achieve the same effect on a object level instead of function level. This is not trivially possible with pythons distutils, or at least at the time I was not able to get it to work in a reasonable manner.
If it is still not easy the best approach would be to do something similar to our npymath internal static library. Move the simd code into its own static library with where one can more easily configure different compiler flags, e.g. one library per architecture which is statically linked into the umath library and used by the already existing ufunc dispatching.
With this approach it needs to be addressed what exactly is in this library. Most of our existing vectorized code is very simple and follows the same pattern, a loop structure and a loop body, differing only in vector size and unrolling. To have the architecture specific parts minimal the library should only contain the bodies and not the other overhead but at least with the library approach this puts the additional requirement of link time optimization in the compiler (gcc can do it just fine but it may increase the version requirement to get good performance).

If it is possible by just compiling single files with different flags and just have a bunch of <type>_<isa>.c type files his problem would disappear so it is definitely worth it looking into mangling distutils again.

For the actual implementations there are three categories:

  • the simple stuff, basic float and integer math shape maintaining operations (e.g. array + array)
    We already have loop macros that allow compilers like gcc to autovectorize these very well (e.g. BINARY_LOOP_FAST, as an example see the integer basic math functions which all use this approach and have AVX2 runtime dispatched variants.
    For powerpc most likely the compiler will be equally powerful and it is just a matter of having an appropriate powerpc ISA loop as we do for X86 where we for integers have generic and avx2 and adding an ISA identifier to numpy/core/code_generators/generate_umath.py (currently you just dump it there and have preprocessor macros decide if it actually used)

  • functions where handwritten intrinsics outperform the GCC compiler, these include some boolean loops, floating point comparisons (I think I opened a bug report for that but so far I know it is not solved yet) and maybe also some reductions (e.g. max/min).
    For these functions it needs to be determined if the compiler is good enough (performance loss compared to handwritten code is fine for new powerpc code as long as it faster than the scalar cide but in that case the x86 handwritten code should stay) and if not one needs to determine a way to avoid duplicating it.
    In many cases this might be possible with templating the intrinsic in similar ways like we do now which was intended to simplify extending it SSE to AVX. Though in retrospect much of the templating there turned out to be a bad idea and harmed readability. Mostly because either AVX did not outperform SSE at all as we are mostly memory bandwidth bound in numpy and because the templating was not sufficient to actually be able to use it with AVX.
    Also have in mind that for the boolean code you might run into endian differences between powerpc64le and the big endian variants, so far I know the bounty is only powerpc64le but it shouldn't break on the big endian variants.
    So in summary it needs a case by case look what is the best approach and as most code is very basic we I think we also can accept some duplication.

  • einsum
    The einsum code also has sse2 code and is even older than the vectorized ufunc code. It has no support for AVX and thus also no runtime dispatching infrastructure.
    It would be nice to also have this problem corrected while adding powerpc vector support but this might be quite a bit of work as it is a lot of highly optimized code with lots of templating and specializations for different loop sizes. In particular performance should not be lost for summations involving small loops where function call overhead is significant.
    It is difficult for me to estimate how much work it is, but as most loops are just simple dot products one can maybe replace a lot with compiler generation. So a feasible approach could be to replace all the dot products, pull out the whole lot of computation functions into a template adding a copy for each ISA.
    Then add runtime dispatching into the function itself (as compared to at program start for current ufunc dispatching). As the entrypoint has a gil you can avoid determining runtime isa multiple times.
    As it is not part of the bounty AVX support does not need to be added here, but we would require the ability to do so ourselves without major changes for updates here to be acceptable.

Some general notes:
Please do not write assembly code when instrinsics work, they are mostly more readable, easier to understand for people unfamiliar with assembler but familiar with C and they allow the compiler to optimize your code to fit the hardware as it has full freedom on parameters like register allocations.

Do not mess with aligning memory unless it is necessary or it has reproducible performance benefits.
E.g. current x86 type hardware performs the same for (vector sized) aligned and unaligned data, at least in numpys context.
Much of the sse2 code does deal with alignment, but that was written for much older hardware which is not that relevant anymore. For the new ppc code we gladly accept small performance costs on old hardware (say < 4 years) if it makes the code simpler.

Also keep in mind we have a templating function which should cover a fair share of cases see numpy/distutils/conv_template.py. If not extending it should be the prefered option.

@charris
Copy link
Member
charris commented Apr 26, 2019

Since einsum has come up, lets bring @dgasmith in. He maintains an optimized einsum branch with different backends and is probably the current expert on that module.

@dgasmith
Copy link
Contributor

@charris Thanks for pinging me on this. It seems that most topics are well covered and I don't see too much to add here. As a note something that may be simpler is to autogenerate the python loops and have numba/numexpr compile it on the fly. The overall structure of einsum really lends itself to JIT techniques. The strategy with optimized einsum is to use a diverse set of backends to meet requirements and building out a small JIT backend based off other libraries has come up several times.

At the same time lots of folks seem to use einsum for low latency operations as you hit pure C functionality after a few very short python steps leading to tiny operation being much faster than canonical python-based array manipulation (example: a = np.arange(5); np.sum(a * a)). Any JIT technique is natively going to come a bit of overhead and depends on your operation. You may be able to pull some ideas from TCL as well.

@rgommers
Copy link
Member

As a note something that may be simpler is to autogenerate the python loops and have numba/numexpr compile it on the fly.

That's not feasible, we can't accept either of those as a dependency for NumPy. Perhaps in a couple of years for Numba ....

@rgommers
Copy link
Member

Thanks for the thoughts though @dgasmith!

@seiko2plus
Copy link
Member

@rgommers, sorry for the late reply,

If that doesn't work out and you both want to do it, I propose that you both put up a "work plan" with an outline of how you want to approach changes to numpy. At that point the numpy core team can make an informed choice before anyone has done all the work.

Here's my current road map

  • improve runtime detection of CPU features(ENH: improve runtime detection of CPU features #13421)
  • improve build options (Up next)
    implement new two option
    --cpu-baseline (Specify a list of enabled baseline CPU optimizations)
    --cpu-dispatch=(Specify a list of dispatched CPU optimizations)
    remove the use of target attributes
    and replace it with an alternative solution
    that compatible with most of compilers/platforms
  • implement own version of NumPy universal intrinsics
    that mapping to (SSE[1 to 4.2], AVX, AVX2, AVX512[MIN[F,CD]], VSX[LE AND BE][2.06, 2.07, 3.0], NEON)
  • remove current SIMD raw intrinsics and replace it with universal intrinsics without losing performance

Unfortunately bounties don't really come with a "social model". I would like to avoid that we get two people doing lots of good work in parallel and then in the end we have to choose which PR to prefer.

And since I'm going to support other CPU architectures and also because IBM lately was very generous with me so I'm not going to claim for this bounty.

@rgommers
Copy link
Member
rgommers commented May 7, 2019

Thanks @seiko2plus. okay, it seems you're picking this up:)

improve runtime detection of CPU features

That sounds like a good idea.

improve build options, implement new two options, --cpu-baseline, ...

If this is for building wheels, then yes that sounds fine. Note that we would prefer never to have to specify this for regular builds that are meant for use on the same system. That's not necessary today, so with improved runtime detection of CPU features it should definitely not be necessary.

Also keep in mind this comment by @juliantaylor above: The first thing that needs to be figured out is how to compile different functions with different compiler flags within our distutils based build infrastructure. The way to do it has influence on how the code layout could be structured.

And since I'm going to support other CPU architectures and also because IBM lately was very generous with me so I'm not going to claim for this bounty.

That's up to you of course. We wouldn't complain if you donated it to the project instead:) Or to a charity of your choice. If IBM gets what it wants, I'm sure they're be happy to pay .....

remove current SIMD raw intrinsics and replace it with universal intrinsics

this does sound like a logical strategy

@barkovv
Copy link
barkovv commented Sep 1, 2019

Does anyone work on it? Can I pick it up?

@rgommers
Copy link
Member
rgommers commented Sep 2, 2019

Hi @barkovv, thanks for your interest. Actually, @seiko2plus is working on this and has two open PRs:

Work has slowed down, but I think is reasonably far along. So I'd rather let @seiko2plus comment here, I suggest not to start over on this.

@seiko2plus
Copy link
Member

Does anyone work on it? Can I pick it up?

@barkovv, Yes, I work on it but just like @rgommers mentioned "the work has slowed down" but due to reasons beyond my control, I truly apologize for that.
Please give me until the end of this month to finish this pull-request #13516 at least.

@rgommers
Copy link
Member
rgommers commented Sep 3, 2019

Thanks @seiko2plus. Hope you're doing well. Yes, we'll give you some time, please update us if the end of this month is also not feasible.

@barkovv
Copy link
barkovv commented Sep 7, 2019

@rgommers @seiko2plus
Okay, good luck!

@edelsohn
Copy link
Author

PPC64LE Linux has a set of SSE/SSSE3 emulation intrinsics headers. An IBM team built NumPy with the headers and saw a 15% performance improvement on benchmarks that stressed the SSE intrinsics code path, which matched the improvement for x86. This functional hack confirms the benefit of SIMD for Power.

x86 derives more benefit from AVX intrinsics, for which emulation has not been written for Power. The SIMD infrastructure should address that.

@rgommers
Copy link
Member

Thanks @edelsohn, makes sense that there's a benefit but still good to see it confirmed.

x86 derives more benefit from AVX intrinsics, for which emulation has not been written for Power. The SIMD infrastructure should address that.

I think it does already, assuming you mean "AVX equivalent instructions as universal intrinsics".

@edelsohn
Copy link
Author

By SIMD infrastructure, I mean the universal intrinsics. IBM observed that more of NumPy SIMD seems to utilize AVX/AVX2, for which there currently is no Power emulation. When NumPy is converted to universal intrinsics and the universal intrinsics are implemented for Power, then the AVX-equivalent benefit in NumPy will be exposed for Power as well.

@stweil
Copy link
stweil commented Dec 14, 2020

Did anyone already try to use generic C/C++ code and optimize it using OpenMP? In tests with Tesseract such code was comparable to hand optimized code for AVX2, and the same source code worked for ARM NEON and PowerPC Altivec, too.

@rgommers
Copy link
Member
rgommers commented Dec 14, 2020

@stweil can you explain a bit more what you mean? SIMD instructions give faster single-threaded code, and when you say OpenMP I'm thinking about parallelism.

@mreineck
Copy link
Contributor

OpenMP gained support for vectorization in addition to parallelization some time ago. See for example
https://www.openmp.org/spec-html/5.0/openmpsu42.html

@stweil
Copy link
stweil commented Dec 14, 2020

Yes, OpenMP has different compiler pragmas for parallelism (not indended here) and for SIMD (that's what is desired here). C Code which uses these pragmas and which is compiled with the right compiler and compiler flags (optimization, cpu) automatically uses the SIMD machine instructions.

@mreineck
Copy link
Contributor

However I think (though I'm not sure) that OpenMP SIMD will only cover a small subset of what universal intrinsics can do. For example I don't see how things like shuffling, masked operations, horizontal addition etc. could be achieved with OpenMP pargmas.

@seiko2plus
Copy link
Member

@stweil, OpenMP is counting on the compiler level of auto-vectorization, and NumPy already provides similar functionality.

/*
* loop with contiguous specialization
* op should be the code working on `tin in` and
* storing the result in `tout *out`
* combine with NPY_GCC_OPT_3 to allow autovectorization
* should only be used where its worthwhile to avoid code bloat
*/
#define BASE_UNARY_LOOP(tin, tout, op) \
UNARY_LOOP { \
const tin in = *(tin *)ip1; \
tout *out = (tout *)op1; \
op; \
}
#define UNARY_LOOP_FAST(tin, tout, op) \
do { \
/* condition allows compiler to optimize the generic macro */ \
if (IS_UNARY_CONT(tin, tout)) { \
if (args[0] == args[1]) { \
BASE_UNARY_LOOP(tin, tout, op) \
} \
else { \
BASE_UNARY_LOOP(tin, tout, op) \
} \
} \
else { \
BASE_UNARY_LOOP(tin, tout, op) \
} \
} \
while (0)

@mattip
Copy link
Member
mattip commented Jan 7, 2021

I think we can close this issue and declare the bounty complete. The original bounty says:

NumPy contains SIMD vectorized code for x86 SSE and AVX. This issue is a feature request to implement native, equivalent enablement for Power VSX, achieving equivalent speedup appropriate for the SIMD vector width of VSX (128 bits).

We have implemented the infrastructure needed to enable the porting of loops via Universal SIMD, and even ported some of the loops, so the enablement is complete.

@edelsohn
Copy link
Author
edelsohn commented Jan 8, 2021

I completely agree. Sayed and the NumPy core developers have done an amazing job!

Do you want me to close the issue? I have been deferring to the NumPy developers.

@mattip
Copy link
Member
mattip commented Jan 8, 2021

Thanks @edelsohn. Agreed that the solution exceeded expectations. I will close it then.

@mattip mattip closed this as completed Jan 8, 2021
@rgommers
Copy link
Member
558 rgommers commented Jan 8, 2021

Awesome, thanks everyone for the excellent work!

@seiko2plus
Copy link
Member

Thank you for everything <3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

0