remove TORCH_NCCL_AVOID_RECORD_STREAMS,use stashed_for_allocator_safety_ to save the input ref #148553

taozhiwei · 2025-03-05T13:26:22Z

Thoroughly solve the following problems: https://discuss.pytorch.org/t/cuda-allocation-lifetime-for-inputs-to-distributed-all-reduce/191573
recordStream can cause additional performance loss and result in memory not being released in a timely manner, save input tensors ref to stashed_for_allocator_safety_ can make sure input tensors are not freed before their usages on ncclStreams finish.

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

pytorch-bot · 2025-03-05T13:26:25Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148553

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit cb8d870 with merge base e0ea593 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…ty_ to save the input ref

ngimel · 2025-03-06T02:16:13Z

Likely a duplicate of #148467

kwen2501 · 2025-03-06T17:52:59Z

Thanks! Glad to see the community is interested in it!
In terms of details, the change needed is more than just removing the env var and set the default to true. In particular:
when async_op=True, the PG could be holding the tensor forever if the user did not call work.wait().
This is the one reason we did not turn on avoid-record-stream by default for async mode.
#148590 is trying to solve that now.

taozhiwei · 2025-03-07T10:05:09Z

Thanks! Glad to see the community is interested in it! In terms of details, the change needed is more than just removing the env var and set the default to true. In particular: when async_op=True, the PG could be holding the tensor forever if the user did not call work.wait(). This is the one reason we did not turn on avoid-record-stream by default for async mode. #148590 is trying to solve that now.

OK，I will close this PR

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Mar 5, 2025

pytorchbot added the open source label Mar 5, 2025

taozhiwei force-pushed the myfeature branch 2 times, most recently from 3dabe06 to 45ff1f8 Compare March 5, 2025 13:36

remove TORCH_NCCL_AVOID_RECORD_STREAMS,use stashed_for_allocator_safe…

cb8d870

…ty_ to save the input ref

taozhiwei force-pushed the myfeature branch from 45ff1f8 to cb8d870 Compare March 5, 2025 13:42

bdhirsh requested review from kwen2501 and ngimel March 6, 2025 01:54

bdhirsh added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 6, 2025

taozhiwei mentioned this pull request Mar 6, 2025

[PGNCCL] Launch kernel on current stream & remove record_stream entirely #148590

Closed

taozhiwei closed this Mar 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

remove TORCH_NCCL_AVOID_RECORD_STREAMS,use stashed_for_allocator_safety_ to save the input ref #148553

remove TORCH_NCCL_AVOID_RECORD_STREAMS,use stashed_for_allocator_safety_ to save the input ref #148553

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

remove TORCH_NCCL_AVOID_RECORD_STREAMS,use stashed_for_allocator_safety_ to save the input ref #148553

remove TORCH_NCCL_AVOID_RECORD_STREAMS,use stashed_for_allocator_safety_ to save the input ref #148553

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148553

✅ No Failures

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!