[Async TP] Fuse matmul-reduce-scatters when reduce scatters have multiple users, and save fused node for backward instead of reduce_scatter node #149875

danielvegamyhre · 2025-03-24T19:57:19Z

Stack

[previous PR in stack] [Async TP] More robust support for rowwise scales when fusing matmul reduce-scatter #149247

TL;DR

This PR implements support in async TP for saving the reduce-scatter result for backward, which previously would break the torchtitan AC policies: no AC, per op SAC, and per layer SAC.

Context

In torchtitan's LLama3 per op SAC policy, we want to save the output of reduce_scatter ops for backward, which is useful for TP. The reduce_scatter op is also saved for No AC (since all activations are saved) and per layer SAC (since we save the activations for N full layers, which do contain reduce-scatters for TP.

However, doing this causes incompatibility with Async TP for the AC policies above, for 2 reasons:

The graph pattern matching specifically only matches on reduce scatter nodes with 1 user, but reduce_scatter nodes saved for backwards will have 2 users (the 2nd one being the return/output node, which saves it for backward).
The subgraph replacement logic which replaces the users of the wait_tensor after the reduce-scatter with the new fused node has no mechanism to save the fused_node for backward instead of the reduce-scatter node. This means we cannot directly replace the subgraph, since we can't delete nodes which still have users (in this case, the output node is still using the reduce-scatter node).

To fix this, we do 2 things:

Add additional pattern matching logic to also match reduce-scatter nodes with 2 users, so we also perform fusion when reduce-scatter is saved for backward.
When replacing the subgraph with the fused node, detect if the reduce-scatter was saved for backward, and if so, save the result of the fused node for backward instead. This enables us to properly erase the subgraph and prevent the memory leak which occurred in [Async TP] Activations not cleared after backward when reduce_scatter_tensor saved for backward by per op SAC #149876

Other changes

Continue to throw an error if we don't find any candidate all-gathers or reduce-scatters for fusion (since TP should have both) but DON'T throw an error if we don't fuse any matmul-reduce-scatters. This is because I've found there are actually valid graphs where we do fuse reduce scatters in the forward graph but not the backward graph (in the backward pass there are reduce-scatters but the producer op is an "add" not a mm/scaled_mm).

Test plan

All unit tests are passing
Visualized the graphs and verified the fusion is occurring properly.
Verified via manual torchtitan runs there is no memory leak / OOM occurring anymore.

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

…matmul reduce scatter

pytorch-bot · 2025-03-24T19:57:22Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149875

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

danielvegamyhre added 17 commits March 20, 2025 05:43

update scatter dim with reshape

7e90d5a

more robust dim update

e877a0c

refactor symmetric memory

c725545

fix incorrect movedims that were affecting numerics

f8894bc

comments

b2d8e97

new checks and unit tests to catch silent failures

82115f5

comments, logging

e07a65f

better comments

2c6040c

clean up

fcbc112

handle pre mm reshape and post mm reshape separately

55d44a2

lint, comments

4454af3

lint

3c772b1

lint

cb2fe4a

test lint

001c6ef

separate code paths for fused matmul reduce scatter and fused scaled …

76b74ff

…matmul reduce scatter

lint

984fd3d

cleanup

cf7fe84

pytorch-bot bot added ciflow/inductor module: inductor labels Mar 24, 2025

danielvegamyhre marked this pull request as draft March 24, 2025 19:57

danielvegamyhre mentioned this pull request Mar 24, 2025

[Async TP] Activations not cleared after backward when reduce_scatter_tensor saved for backward by per op SAC #149876

Closed

danielvegamyhre changed the title ~~[WIP] [Async TP] Fuse matmul-reduce-scatters when reduce scatters have multiple users~~ [Async TP] Fuse matmul-reduce-scatters when reduce scatters have multiple users, and save fused node for backward instead of reduce_scatter node Mar 25, 2025

danielvegamyhre marked this pull request as ready for review March 25, 2025 01:55

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (pipeline) release notes category labels Mar 25, 2025

danielvegamyhre force-pushed the scatter-dim branch from 5509e6f to 8c1a679 Compare March 25, 2025 15:01

cast to node to appease linter

2f3ced5

danielvegamyhre force-pushed the users branch from 16d3b5f to 2f3ced5 Compare March 25, 2025 15:02

danielvegamyhre closed this Mar 25, 2025

github-actions bot deleted the users branch April 27, 2025 02:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Async TP] Fuse matmul-reduce-scatters when reduce scatters have multiple users, and save fused node for backward instead of reduce_scatter node #149875

[Async TP] Fuse matmul-reduce-scatters when reduce scatters have multiple users, and save fused node for backward instead of reduce_scatter node #149875

[Async TP] Fuse matmul-reduce-scatters when reduce scatters have multiple users, and save fused node for backward instead of reduce_scatter node #149875

[Async TP] Fuse matmul-reduce-scatters when reduce scatters have multiple users, and save fused node for backward instead of reduce_scatter node #149875

Conversation

Stack

TL;DR

Context

Other changes

Test plan

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149875