-
Notifications
You must be signed in to change notification settings - Fork 24.7k
[cpu][vec] support reduce ops for add and max #144065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144065
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 211ea77 with merge base 0431d47 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally OK, but it's better to have avx2 added as well. otherwise avx2 is going to be slow, which might not be aware by the caller.
Thanks! The avx2-related ops have been added. |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Description
During the support of INT8 SDPA pytorch/ao#1372, we find that
at::vec::vec_reduce_all<int32_t>
would go into slow scalar path when doing sum and max. So here, we support the two reduce-related opsreduce_add
andreduce_max
forvec512
andvec256
, using the Sequence instructions.Details
reduce_add
andreduce_max
for dtypesint32
andfloat32
, using the Sequence instructions;reduce
in vec base, in order to simplify the codes.cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10