-
Notifications
You must be signed in to change notification settings - Fork 24.7k
[c10d][NCCL] Refactor coalesced storage #122651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122651
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (3 Unrelated Failures)As of commit 3df4123 with merge base 5891c5b ( FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general.
A couple general comments:
- The vector forms of
coalescedDevice_
andcoalescedComm_
come from the time when multi-device-per-process was still supported. We did not refactor these two variables to single-device form (for no specific reason). - While the code change looks good, it seems to have a trade-off:
Instead of having a single check site inendCoalescing
, like:
assert coalescedComms_.size() == 1
we now need to do
TORCH_CHECK(coalescedComm_ == ncclComm, MULTI_DEVICE_ERROR_MSG);
at every place it gets changed.
if (coalescing_state_ & CoalActive) { | ||
coalescing_state_ |= CoalColl; | ||
if (coalescedDevice_.index() < 0) { | ||
coalescedDevice_ = device; | ||
} else { | ||
TORCH_CHECK( | ||
coalescedDevice_.index() == device.index(), MULTI_DEVICE_ERROR_MSG); | ||
} | ||
if (coalescedComm_ == nullptr) { | ||
coalescedComm_ = ncclComm; | ||
} else { | ||
TORCH_CHECK(coalescedComm_ == ncclComm, MULTI_DEVICE_ERROR_MSG); | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, why would we need to add this code in collectiveCoalesced
?
Coalescing in this function is done via torch::cuda::nccl::AutoNcclGroup nccl_group_guard
, rather than coalescing_state_
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be okay to investigate this addition in a second issue or PR if really needed? In that way we maintain this PR as a 1:1 refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I will make it 1:1 refactor. The reason why we need this in collectiveCoalesced
is that it fixes the wait behavior. Without it the cm.works
is just an empty list
with dist._coalescing_manager(group=pg_nccl, device=device, async_ops=True) as cm:
dist.all_reduce(a)
dist.all_reduce(b)
print(len(cm.works)) # prints 0
cm.wait()
So the cm.wait()
is no-op. Which is clearly what was not intended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed an issue #122807. Will create a PR after this one is merged.
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 mandatory check(s) failed. The first few are:
Dig deeper by viewing the failures on hud |
The CI failures are unrelated, @pytorchbot merge -i |
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 3 checks: pull / linux-jammy-py3.8-gcc11 / test (default, 2, 3, linux.2xlarge), pull / linux-jammy-py3.8-gcc11 / test (default, 3, 3, linux.2xlarge), pull / linux-focal-py3_8-clang9-xla / test (xla, 1, 1, linux.12xlarge, unstable) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
The `coalescedDevice_` are `coalescedComms_` used inefficiently and in case of consequent coalescing comms can cause to read-before-write condition. Pull Request resolved: pytorch#122651 Approved by: https://github.com/kwen2501, https://github.com/eqy
The
coalescedDevice_
arecoalescedComms_
used inefficiently and in case of consequent coalescing comms can cause to read-before-write condition.cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang