-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
USIMD: Optimize the performace of np.einsum for all platforms #16641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Merge branch 'master' of https://github.com/numpy/numpy into einsum-neon
These never worked to the best of my knowledge, so disable it explicitly to hopefully simplify cleanup in other parts.
We discussed this briefly in the sprint. Correct me if I am wrong: this is a reimplementation of the SSE2 intrinsics in NEON. I wonder if we could already use the universal intrinsics to make this generic? |
Floats besides float64 were being coerced to integers and complex step sizes for the index trick classes would fail for complex64 input. Fixes numpy#16466
I'd be happy to change it to universal intrinsics,Is there any guidance document or example that I should read first? @mattip @seiko2plus |
not yet, but sure I can guide you |
Merge branch 'master' of https://github.com/numpy/numpy into einsum-neon
/**begin repeat2 | ||
* #i = 0, 2, 4, 6# | ||
*/ | ||
a = vmulq_f64(vld1q_f64(data0+@i@), vld1q_f64(data1+@i@)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace these intrinsics with:
vld1q_f64
-> npyv_load_f64
vmulq_f64
-> npyv_mul_f64
vaddq_f64
-> npyv_add_f64
vst1q_f64
-> npyv_store_f64
for more intrinsics, you have to search or dive in the NPYV
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW why you didn't use FMA3 here, well currently there's no mul-add intrinsic in NPYV but I guess It will not be that hard to impmlent it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will make another PR for mul-add intrinsic in NPYV.
… of github.com:numpy/numpy into einsum-neon
Co-authored-by: Ross Barnowski <rossbar@berkeley.edu>
@seiko2plus Thanks for your advice, Reconstruct I just make PS: USIMD is really powerful, saved so much coding time. 😃 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice to see use of the universal intrinsics.
.gitignore
Outdated
numpy/core/src/umath/_umath_tests.dispatch.h | ||
numpy/core/src/umath/_umath_tests.dispatch.sse41.c | ||
numpy/core/src/*/*.dispatch.*.c | ||
numpy/core/src/*/*.dispatch.h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seiko2plus shouldn't we be generating files in build/* rather than in the source directories, or is this similar to cython-generated files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, should be generated into build/src.*, but maybe @Qiyu8 use --inplace
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I used -i
option during build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you used inplace option -i
because you received error like that
: fatal error: einsum.dispatch.h: No such file or directory
well it's my bad, CcompilerOpt
should include build/src.*/numpy/core/src/multiarray
to the search dir, even if it wasn't added by 'core/setup.py'.
This issue should be fixed by #16858.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#16858 is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you @mattip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep this ignore part, I build the source with -i
option, and the *.dispatch.*.c
will generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default people should not be using -i
since it messes up the source directory, as you discovered. Why don't you want to use the build directory? Is there another problem that using -i
fixes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
build directory is good , no problem if I don't use -i
option. will remove from ignore list.
self.c = np.arange(24000, dtype=dtype).reshape(20, 30, 40) | ||
self.c1 = np.arange(1200, dtype=dtype).reshape(30, 40) | ||
self.c2 = np.arange(480000, dtype=dtype) | ||
self.c3 = np.arange(600, dtype=dtype) | ||
F438 self.d = np.arange(10000, dtype=dtype).reshape(10,100,10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good at some point to rename these with meaningful names. Is there a reason to have the different size variants? Perhaps a very small variant would be good as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattip I found that meaningless naming is pretty common in every bench
module, should we open a separate rename PR? I want to focus on the universal intrinsics transform process in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. What about adding a very small variant as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt that small variant is not enough to evaluate the degree of optimization, I will test with small size and provide a report anyway.
@Qiyu8, you should dispatch the tabs that hold the functions pointers within |
Co-authored-by: Sayed Adel <seiko@imavr.com>
- paves the way for integer optimization - fix memory overflow/bus errors - improve the unrolling - activate the unrolling when NPYV isn't available - add support for fma3 - robust/cleanup - other minor fixes
- paves the way for integer optimization - fix memory overflow/bus errors - improve the unrolling - activate the unrolling when NPYV isn't available - add support for fma3 - robust/cleanup - other minor fixes
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
seven sub module of np.einsum is optimized using neon intrinsics. The optimization resulted in a performance improvement of 10~42%.
System info:
The optimization effect: