8000 remove TORCH_NCCL_AVOID_RECORD_STREAMS,use stashed_for_allocator_safety_ to save the input ref by taozhiwei · Pull Request #148553 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

remove TORCH_NCCL_AVOID_RECORD_STREAMS,use stashed_for_allocator_safety_ to save the input ref #148553

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

taozhiwei
Copy link
Contributor
@taozhiwei taozhiwei commented Mar 5, 2025

Thoroughly solve the following problems: https://discuss.pytorch.org/t/cuda-allocation-lifetime-for-inputs-to-distributed-all-reduce/191573
recordStream can cause additional performance loss and result in memory not being released in a timely manner, save input tensors ref to stashed_for_allocator_safety_ can make sure input tensors are not freed before their usages on ncclStreams finish.

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

Copy link
pytorch-bot bot commented Mar 5, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148553

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit cb8d870 with merge base e0ea593 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Mar 5, 2025
@taozhiwei taozhiwei force-pushed the myfeature branch 2 times, most recently from 3dabe06 to 45ff1f8 Compare March 5, 2025 13:36
@bdhirsh bdhirsh requested review from kwen2501 and ngimel March 6, 2025 01:54
@bdhirsh bdhirsh added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 6, 2025
@ngimel
Copy link
Collaborator
ngimel commented Mar 6, 2025

Likely a duplicate of #148467

@kwen2501
Copy link
Contributor
kwen2501 commented Mar 6, 2025

Thanks! Glad to see the community is interested in it!
In terms of details, the change needed is more than just removing the env var and set the default to true. In particular:
when async_op=True, the PG could be holding the tensor forever if the user did not call work.wait().
This is the one reason we did not turn on avoid-record-stream by default for async mode.
#148590 is trying to solve that now.

@taozhiwei
Copy link
Contributor Author

Thanks! Glad to see the community is interested in it! In terms of details, the change needed is more than just removing the env var and set the default to true. In particular: when async_op=True, the PG could be holding the tensor forever if the user did not call work.wait(). This is the one reason we did not turn on avoid-record-stream by default for async mode. #148590 is trying to solve that now.

OK,I will close this PR

@taozhiwei taozhiwei closed this Mar 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (c10d) release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0