[SymmMem] Speed up tests #153677

kwen2501 · 2025-05-16T01:45:13Z

Stack from ghstack (oldest at bottom):

Use MultiProcContinousTest to avoid re-create ProcessGroup in each test instance.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

[ghstack-poisoned]

pytorch-bot · 2025-05-16T01:45:16Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153677

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c5d9c3e with merge base fa85434 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 9b6f66e Pull-Request-resolved: #153677

fegin

LGTM, the failing tests looks like from the previous PR.

ngimel

Very cool! How long to symm mem tests run now?

ngimel · 2025-05-16T16:15:52Z

test/distributed/test_symmetric_memory.py

            [4, 8192, 8196],
-            [4, 8, 16],
+            [
+                8


interetsint, do you know why memory usage changed? All these tests use very little memory

It it not a problem of the alignment, but the number of tests we run continuously. It seems we either failed to release tensors or there is some flaw in the allocation logic (e.g. allocated more than needed).

[ghstack-poisoned]

ghstack-source-id: 56d9fdf Pull-Request-resolved: #153677

[ghstack-poisoned]

ghstack-source-id: 8c73378 Pull-Request-resolved: #153677

kwen2501 · 2025-05-19T06:33:15Z

@pytorchbot merge

pytorchmergebot · 2025-05-19T06:35:15Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-05-19T07:12:26Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 3, 3, linux.g4dn.12xlarge.nvidia.gpu)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

[ghstack-poisoned]

ghstack-source-id: fce262d Pull-Request-resolved: #153677

[ghstack-poisoned]

ghstack-source-id: 4bca618 Pull-Request-resolved: #153677

[ghstack-poisoned]

ghstack-source-id: f0a32f6 Pull-Request-resolved: #153677

kwen2501 · 2025-05-26T03:31:36Z

@pytorchbot merge

pytorchmergebot · 2025-05-26T03:33:30Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

malfet · 2025-05-26T14:06:24Z

@pytorchbot revert -m "I don't know how, but you PRs keep escaping TD and breaking trunk oops I wrong" -c nosignal

malfet · 2025-05-26T14:08:24Z

Sorry, looks like infra is just unhappy

pytorchmergebot · 2025-05-26T14:08:24Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2025-05-26T14:08:25Z

Don't want to revert based on edited command

A 2D AllToAllv shuffle is illustrated below: (`world_size` = 2, `ne` = 2, where `ne` is number of experts per rank) ``` Source: | Rank 0 | Rank 1 | | c0 | c1 | c2 | c3 | d0 | d1 | d2 | d3 | Dest : | Rank 0 | Rank 1 | | c0 | d0 | c1 | d1 | c2 | d2 | c3 | d3 | ``` where each `c_i` / `d_i` are slices of the `input` tensor, targeting expert `i`, with length indicated by input splits (in `in_out_splits[0]`). That is, the 2D AllToAllv shuffle achieves a transpose from rank-major order at input to expert-major order at output. Pull Request resolved: #155058 Approved by: https://github.com/ngimel ghstack dependencies: #153653, #153677

Downstream consumer of the 2D all-to-all-v is often a group GEMM. Today the GEMM often have an alignment requirement on the chunk sizes within grouped sequence, where each chunk carries the tokens headed for an expert. For example, `torch._group_mm` requires an alignment of 8. This PR adds that alignment capability, when user passes in a `major_align` argument, so that no extra padding step is needed. The key in supporting that is making the output offsets aligned to such value. (Output offsets are returned to the users in the 3rd row of `in_out_splits`, on device. The 2nd row, output splits, are unaffected by this alignment value -- i.e. reflecting true number of tokens for an expert.) The algorithm is as follows. ![502413288_678786854922438_530852083153996358_n](https://github.com/user-attachments/assets/557624a3-150e-4ab6-ba8b-1dbaa5ac01ac) In detailed implementation, we use warp scan to calculate prefix sum on the "block" illustrated above. As a result, the "block" size, i.e. `npes` is currently limited to warp size 32. Pull Request resolved: #155172 Approved by: https://github.com/ngimel ghstack dependencies: #153653, #153677, #155058

A 2D AllToAllv shuffle is illustrated below: (`world_size` = 2, `ne` = 2, where `ne` is number of experts per rank) ``` Source: | Rank 0 | Rank 1 | | c0 | c1 | c2 | c3 | d0 | d1 | d2 | d3 | Dest : | Rank 0 | Rank 1 | | c0 | d0 | c1 | d1 | c2 | d2 | c3 | d3 | ``` where each `c_i` / `d_i` are slices of the `input` tensor, targeting expert `i`, with length indicated by input splits (in `in_out_splits[0]`). That is, the 2D AllToAllv shuffle achieves a transpose from rank-major order at input to expert-major order at output. Pull Request resolved: pytorch#155058 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#153653, pytorch#153677

…155172) Downstream consumer of the 2D all-to-all-v is often a group GEMM. Today the GEMM often have an alignment requirement on the chunk sizes within grouped sequence, where each chunk carries the tokens headed for an expert. For example, `torch._group_mm` requires an alignment of 8. This PR adds that alignment capability, when user passes in a `major_align` argument, so that no extra padding step is needed. The key in supporting that is making the output offsets aligned to such value. (Output offsets are returned to the users in the 3rd row of `in_out_splits`, on device. The 2nd row, output splits, are unaffected by this alignment value -- i.e. reflecting true number of tokens for an expert.) The algorithm is as follows. ![502413288_678786854922438_530852083153996358_n](https://github.com/user-attachments/assets/557624a3-150e-4ab6-ba8b-1dbaa5ac01ac) In detailed implementation, we use warp scan to calculate prefix sum on the "block" illustrated above. As a result, the "block" size, i.e. `npes` is currently limited to warp size 32. Pull Request resolved: pytorch#155172 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#153653, pytorch#153677, pytorch#155058

A 2D AllToAllv shuffle is illustrated below: (`world_size` = 2, `ne` = 2, where `ne` is number of experts per rank) ``` Source: | Rank 0 | Rank 1 | | c0 | c1 | c2 | c3 | d0 | d1 | d2 | d3 | Dest : | Rank 0 | Rank 1 | | c0 | d0 | c1 | d1 | c2 | d2 | c3 | d3 | ``` where each `c_i` / `d_i` are slices of the `input` tensor, targeting expert `i`, with length indicated by input splits (in `in_out_splits[0]`). That is, the 2D AllToAllv shuffle achieves a transpose from rank-major order at input to expert-major order at output. Pull Request resolved: pytorch#155058 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#153653, pytorch#153677

…155172) Downstream consumer of the 2D all-to-all-v is often a group GEMM. Today the GEMM often have an alignment requirement on the chunk sizes within grouped sequence, where each chunk carries the tokens headed for an expert. For example, `torch._group_mm` requires an alignment of 8. This PR adds that alignment capability, when user passes in a `major_align` argument, so that no extra padding step is needed. The key in supporting that is making the output offsets aligned to such value. (Output offsets are returned to the users in the 3rd row of `in_out_splits`, on device. The 2nd row, output splits, are unaffected by this alignment value -- i.e. reflecting true number of tokens for an expert.) The algorithm is as follows. ![502413288_678786854922438_530852083153996358_n](https://github.com/user-attachments/assets/557624a3-150e-4ab6-ba8b-1dbaa5ac01ac) In detailed implementation, we use warp scan to calculate prefix sum on the "block" illustrated above. As a result, the "block" size, i.e. `npes` is currently limited to warp size 32. Pull Request resolved: pytorch#155172 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#153653, pytorch#153677, pytorch#155058

Update

e05e0b3

[ghstack-poisoned]

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category labels May 16, 2025

kwen2501 added a commit that referenced this pull request May 16, 2025

[SymmMem] Speed up tests

ce69368

ghstack-source-id: 9b6f66e Pull-Request-resolved: #153677

kwen2501 mentioned this pull request May 16, 2025

[Distributed][CI] Rework continuous TestCase #153653

Closed

kwen2501 requested review from fegin and ngimel May 16, 2025 01:49

fegin approved these changes May 16, 2025

View reviewed changes

Skylion007 approved these changes May 16, 2025

View reviewed changes

ngimel approved these changes May 16, 2025

View reviewed changes

Update

bac63a4

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request May 17, 2025

[SymmMem] Speed up tests

6252506

ghstack-source-id: 56d9fdf Pull-Request-resolved: #153677

Update

41316af

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request May 19, 2025

[SymmMem] Speed up tests

f6dd20d

ghstack-source-id: 8c73378 Pull-Request-resolved: #153677

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 19, 2025

pytorchmergebot added the merging label May 19, 2025

pytorchmergebot removed the merging label May 19, 2025

Update

feb7c4a

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request May 23, 2025

[SymmMem] Speed up tests

6c48ad5

ghstack-source-id: fce262d Pull-Request-resolved: #153677

Update

f2384a7

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request May 24, 2025

[SymmMem] Speed up tests

7538040

ghstack-source-id: 4bca618 Pull-Request-resolved: #153677

Update

c5d9c3e

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request May 24, 2025

[SymmMem] Speed up tests

b146ebd

ghstack-source-id: f0a32f6 Pull-Request-resolved: #153677

pytorchmergebot added the merging label May 26, 2025

pytorchmergebot closed this in 062387f May 26, 2025

pytorchmergebot added Merged and removed merging labels May 26, 2025

github-actions bot deleted the gh/kwen2501/154/head branch June 27, 2025 02:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SymmMem] Speed up tests #153677

[SymmMem] Speed up tests #153677

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

[SymmMem] Speed up tests #153677

[SymmMem] Speed up tests #153677

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153677

✅ No Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Merge started

Uh oh!

Merge failed

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants