[a2av] 2D all-to-all-vdev #155058

kwen2501 · 2025-06-03T18:55:46Z

Stack from ghstack (oldest at bottom):

A 2D AllToAllv shuffle is illustrated below:
(world_size = 2, ne = 2, where ne is number of experts per rank)

        Source: |       Rank 0      |       Rank 1      |
                | c0 | c1 | c2 | c3 | d0 | d1 | d2 | d3 |

        Dest  : |       Rank 0      |       Rank 1      |
                | c0 | d0 | c1 | d1 | c2 | d2 | c3 | d3 |

where each c_i / d_i are slices of the input tensor, targeting expert i, with length indicated by input splits (in in_out_splits[0]).

That is, the 2D AllToAllv shuffle achieves a transpose from rank-major order at input to expert-major order at output.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

[ghstack-poisoned]

pytorch-bot · 2025-06-03T18:55:50Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/155058

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit d849727 with merge base fa85434 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / linux-focal-py3.13-clang10 / test (default, 1, 5, linux.4xlarge) (gh) (detected as infra flaky with no log or failing log classifier)
trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 1, linux.rocm.gpu.4) (gh) (disabled by #139011 but the issue was closed recently and a rebase is needed to make it pass)
distributed/test_c10d_nccl.py::ProcessGroupNCCLGroupTest::test_extra_cuda_context

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: d0c5cfa Pull-Request-resolved: #155058

[ghstack-poisoned]

ngimel · 2025-06-03T21:35:21Z

torch/csrc/distributed/c10d/nvshmem_extension.cu

+  int64_t* splits_ptr = (int64_t*)(splits_hdl->get_buffer_ptrs()[rank]);
+
+  // Number of experts per rank
+  int ne = in_out_splits.stride(0) / world_size;


you need to check that in_out_splits is contiguous, and check its sizes, not strides

in_out_splits is expected of shape [3, world_size * ne].
I will add a check. Thanks!

Right, but if tensor is not contiguous stride may satisfy this check, but the tensor size won't.

ngimel · 2025-06-03T21:37:45Z

torch/csrc/distributed/c10d/nvshmem_extension.cu

+  int ne = in_out_splits.stride(0) / world_size;
+  TORCH_CHECK(ne * world_size == in_out_splits.stride(0), "Each row of in_out_splits must be a multiple of world_size")
+
+  auto stream = at::cuda::getCurrentCUDAStream(input.device().index());


you need to check that current device matches input's device (unless device guard code is generated automatically, I don't know how you function is registered), and then just use getCurrentCUDAStream() here because you'd get stream for current device

I'd prefer if the API do not require user setting a device guard upfront.
So, I will do something like the following here:

at::cuda::OptionalCUDAGuard gpuGuard(input.device()); auto stream = at::cuda::getCurrentCUDAStream(); nvshmemx_collective_launch(..., stream);

[ghstack-poisoned]

ghstack-source-id: 5d2a22a Pull-Request-resolved: #155058 Add device guard

kwen2501 · 2025-06-03T23:26:07Z

@ngimel Updated the PR with the changes mentioned above :)

ngimel · 2025-06-05T21:01:25Z

torch/csrc/distributed/c10d/nvshmem_extension.cu

+  void* output_ptr = out_hdl->get_buffer_ptrs()[rank];
+  int64_t* splits_ptr = (int64_t*)(splits_hdl->get_buffer_ptrs()[rank]);
+
+  auto split_shape = in_out_splits.sizes();


I still don't see a check for in_out_splits.is_contiguous()

Sorry added.

ngimel · 2025-06-05T21:03:07Z

torch/csrc/distributed/c10d/nvshmem_extension.cu

+        transpose from rank-major order at input to expert-major order at
+        output.
+  */
+  auto input_hdl = c10d::symmetric_memory::rendezvous(input, group_name);


you also need to check input/output dimensionality and contiguity

Added a contiguity check. I can't think of a dimensionality requirement here (can be 1D, 2D or n-D).

also you need to check dtypes - integer for splits, same dtype for input output

ngimel · 2025-06-05T21:04:32Z

torch/csrc/distributed/c10d/nvshmem_extension.cu

+    int e = tid % ne;
+    // This does a transpose from rank-major order to expert-major order
+    int dst_offset = e * npes + mype;
+    nvshmem_int64_p(source_offsets + dst_offset, peer_offsets[tid], peer);


here would also make sense to check that there are no negative numbers in splits, and that sum of splits is less than input size?

Added a negative check.

[ghstack-poisoned]

ngimel · 2025-06-05T21:41:55Z

torch/csrc/distributed/c10d/nvshmem_extension.cu

+  int num_blocks = std::min(world_size * ne, world_size > 8 ? 8 : 64);
+
+  // Stride at dim 0 (assuming input is contiguous, TODO)
+  size_t stride_bytes = input.stride(0) * input.element_size();


here you are assuming that input.stride(0) == output.stride(0), you should check it

ngimel · 2025-06-05T21:46:02Z

torch/csrc/distributed/c10d/nvshmem_extension.cu

+        transpose from rank-major order at input to expert-major order at
+        output.
+  */
+  auto input_hdl = c10d::symmetric_memory::rendezvous(input, group_name);


also you need to check dtypes - integer for splits, same dtype for input output

ngimel · 2025-06-05T21:47:12Z

torch/csrc/distributed/c10d/nvshmem_extension.cu

+    int peer = eid % npes;
+    // Amount from `peer` for `e`
+    auto peer_size = output_splits[eid] * stride;
+    auto source_offset = source_offsets[eid] * stride;


you need to check that these offsets are within tensor, so there are no OOB reads

Added block_aggregate and check against input size 0.

[ghstack-poisoned]

ngimel · 2025-06-05T23:59:58Z

torch/csrc/distributed/c10d/nvshmem_extension.cu


 // This is an exclusive prefix sum function that calculates read (or write) offsets for each peer.
-__device__ void prefixSum(int64_t *odata, int64_t *idata, int n) {
+__device__ int64_t prefixSum(int64_t *odata, int64_t *idata, int n) {


btw previously I've seen int64 scan slow down kernel big time, you might want to check the performance

kwen2501 · 2025-06-06T00:39:34Z

@pytorchbot merge

pytorchmergebot · 2025-06-06T00:41:26Z

Merge failed

Reason: Not merging any PRs at the moment because there is a merge blocking https://github.com/pytorch/pytorch/labels/ci:%20sev issue open at:
#155265

Details for Dev Infra team

Raised by workflow job

kwen2501 · 2025-06-06T17:27:53Z

@pytorchbot merge

pytorchmergebot · 2025-06-06T17:29:52Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-06-06T17:30:44Z

Starting merge as part of PR stack under #155172

Downstream consumer of the 2D all-to-all-v is often a group GEMM. Today the GEMM often have an alignment requirement on the chunk sizes within grouped sequence, where each chunk carries the tokens headed for an expert. For example, `torch._group_mm` requires an alignment of 8. This PR adds that alignment capability, when user passes in a `major_align` argument, so that no extra padding step is needed. The key in supporting that is making the output offsets aligned to such value. (Output offsets are returned to the users in the 3rd row of `in_out_splits`, on device. The 2nd row, output splits, are unaffected by this alignment value -- i.e. reflecting true number of tokens for an expert.) The algorithm is as follows. ![502413288_678786854922438_530852083153996358_n](https://github.com/user-attachments/assets/557624a3-150e-4ab6-ba8b-1dbaa5ac01ac) In detailed implementation, we use warp scan to calculate prefix sum on the "block" illustrated above. As a result, the "block" size, i.e. `npes` is currently limited to warp size 32. Pull Request resolved: #155172 Approved by: https://github.com/ngimel ghstack dependencies: #153653, #153677, #155058

Update

373a2f9

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Jun 3, 2025

[a2av] 2D all-to-all-vdev

90f12da

ghstack-source-id: d0c5cfa Pull-Request-resolved: #155058

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jun 3, 2025

kwen2501 requested review from fduwjj, fegin and ngimel June 3, 2025 19:01

Update

dc50137

[ghstack-poisoned]

ngimel reviewed Jun 3, 2025

View reviewed changes

Update

baa5600

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Jun 3, 2025

[a2av] 2D all-to-all-vdev

1f99ef1

ghstack-source-id: 5d2a22a Pull-Request-resolved: #155058 Add device guard

kwen2501 mentioned this pull request Jun 4, 2025

[a2av] Align length of major dimension in output of 2D a2av #155172

Closed

kwen2501 added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 5, 2025

ngimel reviewed Jun 5, 2025

View reviewed changes

Update

cc1537f

[ghstack-poisoned]

ngimel reviewed Jun 5, 2025

View reviewed changes

Update

d849727

[ghstack-poisoned]

ngimel approved these changes Jun 6, 2025

View reviewed changes

pytorchmergebot added the merging label Jun 6, 2025

pytorchmergebot removed the merging label Jun 6, 2025

pytorchmergebot added the merging label Jun 6, 2025

pytorchmergebot closed this in 453bc9f Jun 6, 2025

pytorchmergebot added Merged and removed merging labels Jun 6, 2025

github-actions bot deleted the gh/kwen2501/164/head branch July 9, 2025 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[a2av] 2D all-to-all-vdev #155058

[a2av] 2D all-to-all-vdev #155058

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[a2av] 2D all-to-all-vdev #155058

[a2av] 2D all-to-all-vdev #155058

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/155058

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Merge failed

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants