-
Notifications
You must be signed in to change notification settings - Fork 24.7k
Optimize shard_dim_alltoall to use alltoall_single #148868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
as titled, previously the shard_dim_alltoall uses `all_to_all`, which essentially could incur lots of copies if the tensor become non-contiguous during splits, and alltoall itself also incur copies This PR uses alltoall_single instead, so that we could minimize tensor copies. tested on all the shard dim change tests and it works properly
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148868
Note: Links to docs will display an error until the docs builds have been completed. ❌ 4 New Failures, 1 Pending, 5 Unrelated FailuresAs of commit c63b4a2 with merge base d789c22 ( NEW FAILURES - The following jobs have failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
CI failures are not related to the PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to me
std::vector<int64_t> out_split_sizes; | ||
std::vector<int64_t> in_split_sizes; | ||
c10d::AllToAllOptions opts; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are optional, so don't have to be fed in alltoall_base
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are required by the alltoall_base
APIs :(
@pytorchbot merge -i "failures are not related to the PR" |
❌ 🤖 pytorchbot command failed:
Try |
@pytorchbot merge -i |
as titled, previously the shard_dim_alltoall uses
all_to_all
, which essentially could incur lots of copies if the tensor become non-contiguous during splits, and alltoall itself also incur copiesThis PR uses alltoall_single instead, so that we could minimize tensor copies.
tested on all the shard dim change tests and it works properly:
Fixes #ISSUE_NUMBER
cc @H-Huang @awgu @kwen2501 @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o