[FSDP] Enable async collectives in FSDP with MPI backend for compute/comm and comm/comm overlap #153215

nariaki3551 · 2025-05-08T23:23:24Z

In FSDP1, all collectives are currently invoked with async_op=False. Additionally, when using the MPI backend, CUDA stream-based scheduling is not supported, causing collectives to block both computation and other collectives. As a result, there is no overlap between communication and computation.

This PR enables async_op=True for all_gather and reduce_scatter, and explicitly delays the corresponding .wait() calls. This allows FSDP with MPI-backend to benefit from overlapping comp/comm and comm/comm.

This change only affects MPI-backend and does not change behavior in other backends.

Changes

Use async_op=True in:
- all_gather during FlatParamHandle.unshard
- reduce_scatter for gradient reduction
Add _all_gather_work and _reduce_scatter_work to FlatParamHandle to store in-flight work objects
Track in-flight collectives in _FSDPState to ensure:
- At most one all_gather and one reduce_scatter is active
- A new collective is not issued until the previous one has completed via .wait()

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

…cution (#1)

pytorch-bot · 2025-05-08T23:23:28Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153215

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 85aee04 with merge base 5683965 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

awgu · 2025-05-09T13:50:19Z

It is not obvious to me that this change is correct. FSDP1 is written such that waiting on the all-gather (or reduce-scatter respectively) is done by having the current/default/compute stream wait for the separate stream from which the all-gather is issued. I am not sure that waiting on an all-gather at the point of the next all-gather is doing the same thing as that -- in fact, I would be somewhat surprised if this is the same synchronization behavior.

… proper ordering

nariaki3551 · 2025-05-13T07:24:15Z

@awgu

As you pointed out, the PR did not correctly enforce the ordering between communication and computation. This is because the MPI backend does not support CUDA streams, making wait_stream() ineffective for scheduling execution.

To address this, I have added to simple rule in this commit to clarify and enforce synchronization:

If stream_A.wait_stream(_unshard_stream) is called, we wait on the issued allgather.
If stream_A.wait_stream(_post_backward_stream) is called, we wait on the issued reduce_scatter.
Otherwise, we additionally call stream_B.synchronize() when stream_A.wait_stream(stream_B) is used.

This rule enables comm and comp to overlap under the MPI backend, while still preserving the execution order previously enforced via CUDA streams.

unshard (allgather) is issued only after pre_unshard is complete.
reduce_grad (reduce_scatter) is issued only after backward computation is complete.
Forward/backward computation begins only after unshard (allgather) has completed.
FSDP1 waits for the last reduce_scatter to complete in _post_callback_final_callback function

nariaki3551 · 2025-05-21T03:18:19Z

@awgu Could you please take a look at this PR? Thanks a lot.

nariaki3551 added 4 commits May 2, 2025 16:08

feat: Delay wait for all_gather in unshard to enable asynchronous exe…

b14aafa

…cution (#1)

rename _unwaited_unshard_work to _all_gather_work

37982d8

lazy wait reducescatter

2ef6168

Merge branch 'pytorch:main' into dev/lazy_wait_collective

1e7c9fc

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels May 8, 2025

nariaki3551 changed the title ~~Enable async collectives in FSDP with MPI backend for compute/comm and comm/comm overlap~~ [FSDP] Enable async collectives in FSDP with MPI backend for compute/comm and comm/comm overlap May 8, 2025

nariaki3551 marked this pull request as ready for review May 8, 2025 23:25

pytorchbot added the open source label May 8, 2025

Skylion007 requested a review from awgu May 9, 2025 13:06

colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 13, 2025

Add explicit CUDA stream syncs for MPI backend to ensure compute/comm…

85aee04

… proper ordering

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FSDP] Enable async collectives in FSDP with MPI backend for compute/comm and comm/comm overlap #153215

[FSDP] Enable async collectives in FSDP with MPI backend for compute/comm and comm/comm overlap #153215

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[FSDP] Enable async collectives in FSDP with MPI backend for compute/comm and comm/comm overlap #153215

Are you sure you want to change the base?

[FSDP] Enable async collectives in FSDP with MPI backend for compute/comm and comm/comm overlap #153215

Uh oh!

Conversation

Uh oh!

Changes

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153215

✅ No Failures

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!