Optimize shard_dim_alltoall to use alltoall_single #148868

wanchaol · 2025-03-10T05:45:35Z

as titled, previously the shard_dim_alltoall uses all_to_all, which essentially could incur lots of copies if the tensor become non-contiguous during splits, and alltoall itself also incur copies

This PR uses alltoall_single instead, so that we could minimize tensor copies.

tested on all the shard dim change tests and it works properly:

pytest test/distributed/tensor/test_redistribute.py -s -k shard_dim_alltoall

Fixes #ISSUE_NUMBER

cc @H-Huang @awgu @kwen2501 @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

as titled, previously the shard_dim_alltoall uses `all_to_all`, which essentially could incur lots of copies if the tensor become non-contiguous during splits, and alltoall itself also incur copies This PR uses alltoall_single instead, so that we could minimize tensor copies. tested on all the shard dim change tests and it works properly

pytorch-bot · 2025-03-10T05:45:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148868

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures, 1 Pending, 5 Unrelated Failures

As of commit c63b4a2 with merge base d789c22 ():

NEW FAILURES - The following jobs have failed:

periodic / linux-focal-cuda12.4-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, lf.linux.4xlarge.nvidia.gpu) (gh)
/var/lib/jenkins/workspace/BUILD.bazel:434:11: Compiling aten/src/ATen/native/cuda/RowwiseScaledMM.cu failed: (Exit 1): nvcc failed: error executing command (from target //:aten_cuda) external/local_cuda/cuda/bin/nvcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer '-std=c++17' -MD -MF ... (remaining 320 arguments skipped)
pull / linux-focal-cpu-py3.10-gcc11-bazel-test / build-and-test (default, 1, 1, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-cuda12.6-py3.10-gcc11 / build (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-cuda12.6-py3.10-gcc11-sm89 / build (gh)
Final attempt failed. Child_process exited with error code 1

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

periodic / linux-focal-cuda11.8-py3.10-gcc9-debug / test (default, 4, 7, lf.linux.4xlarge.nvidia.gpu, oncall:debug-build) (gh) (similar failure)
'test/profiler/test_profiler.py::TestProfiler::test_profile_all_threads'
periodic / linux-focal-cuda12.4-py3-gcc9-slow-gradcheck / test (default, 2, 8, linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck) (gh) (similar failure)
inductor/test_cutlass_backend.py::TestCutlassBackend::test_max_autotune_cutlass_backend_sparse_semi_structured_mm_dynamic_False
periodic / linux-focal-cuda12.4-py3.10-gcc9 / test (nogpu_NO_AVX2, 2, 2, lf.linux.4xlarge) (gh) (similar failure)
inductor/test_cpu_repro.py::CPUReproTests::test_group_norm_large_input

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

pull / unstable-linux-focal-cuda12.6-py3.10-gcc11-sm89-xfail / build (gh)
Final attempt failed. Child_process exited with error code 1
trunk / libtorch-linux-focal-cuda12.4-py3.10-gcc9-debug / build (gh) (#148495)
undefined reference to std::__throw_bad_array_new_length()'`

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wanchaol · 2025-03-10T16:39:04Z

CI failures are not related to the PR

tianyu-l

sounds good to me

tianyu-l · 2025-03-10T18:29:33Z

torch/csrc/distributed/c10d/Functional.cpp

+  std::vector<int64_t> out_split_sizes;
+  std::vector<int64_t> in_split_sizes;
+  c10d::AllToAllOptions opts;


these are optional, so don't have to be fed in alltoall_base?

They are required by the alltoall_base APIs :(

wanchaol · 2025-03-10T18:34:49Z

@pytorchbot merge -i "failures are not related to the PR"

pytorch-bot · 2025-03-10T18:34:58Z

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: failures are not related to the PR

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try @pytorchbot --help for more info.

wanchaol · 2025-03-10T18:35:15Z

@pytorchbot merge -i

pytorchmergebot · 2025-03-10T18:37:13Z

Merge started

Your change will be merged while ignoring the following 9 checks: pull / linux-focal-cuda12.6-py3.10-gcc11 / build, pull / unstable-linux-focal-cuda12.6-py3.10-gcc11-sm89-xfail / build, pull / linux-focal-cuda12.6-py3.10-gcc11-sm89 / build, pull / linux-focal-cpu-py3.10-gcc11-bazel-test / build-and-test (default, 1, 1, linux.4xlarge), periodic / linux-focal-cuda12.4-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, lf.linux.4xlarge.nvidia.gpu), periodic / linux-focal-cuda11.8-py3.10-gcc9-debug / test (default, 4, 7, lf.linux.4xlarge.nvidia.gpu, oncall:debug-build), periodic / linux-focal-cuda12.4-py3.10-gcc9 / test (nogpu_NO_AVX2, 2, 2, lf.linux.4xlarge), periodic / linux-focal-cuda12.4-py3-gcc9-slow-gradcheck / test (default, 2, 8, linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck), trunk / libtorch-linux-focal-cuda12.4-py3.10-gcc9-debug / build

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Mar 10, 2025

wanchaol added release notes: distributed (dtensor) release notes category and removed release notes: distributed (c10d) release notes category labels Mar 10, 2025

pytorchbot added the open source label Mar 10, 2025

wanchaol added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Mar 10, 2025

lint

c63b4a2

wanchaol added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 10, 2025

wanchaol requested a review from tianyu-l March 10, 2025 16:38

tianyu-l approved these changes Mar 10, 2025

View reviewed changes

tianyu-l reviewed Mar 10, 2025

View reviewed changes

pytorchmergebot added the merging label Mar 10, 2025

pytorchmergebot added the Merged label Mar 10, 2025

pytorchmergebot closed this in 3129faf Mar 10, 2025

pytorchmergebot removed the merging label Mar 10, 2025

github-actions bot deleted the dt_alltoall branch April 12, 2025 02:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize shard_dim_alltoall to use alltoall_single #148868

Optimize shard_dim_alltoall to use alltoall_single #148868

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Optimize shard_dim_alltoall to use alltoall_single #148868

Optimize shard_dim_alltoall to use alltoall_single #148868

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148868

❌ 4 New Failures, 1 Pending, 5 Unrelated Failures

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!