8000 [c10d][NCCL] Refactor coalesced storage by Aidyn-A · Pull Request #122651 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[c10d][NCCL] Refactor coalesced storage #122651

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

Aidyn-A
Copy link
Collaborator
@Aidyn-A Aidyn-A commented Mar 25, 2024

The coalescedDevice_ are coalescedComms_ used inefficiently and in case of consequent coalescing comms can cause to read-before-write condition.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang

@Aidyn-A Aidyn-A added oncall: distributed Add this issue/PR to distributed oncall triage queue module: nccl Problems related to nccl support release notes: distributed (c10d) release notes category labels Mar 25, 2024
@Aidyn-A Aidyn-A self-assigned this Mar 25, 2024
Copy link
pytorch-bot bot commented Mar 25, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122651

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit 3df4123 with merge base 5891c5b (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@Aidyn-A Aidyn-A changed the title [WIP][c10d][NCCL] Refactor coalesced storage [c10d][NCCL] Refactor coalesced storage Mar 26, 2024
@Aidyn-A Aidyn-A marked this pull request as ready for review March 26, 2024 16:28
@Aidyn-A Aidyn-A marked this pull request as draft March 26, 2024 21:55
@Aidyn-A Aidyn-A marked this pull request as ready for review March 27, 2024 00:22
@Aidyn-A Aidyn-A requested review from kwen2501, eqy and wconstab March 27, 2024 00:22
Copy link
Contributor
@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general.

A couple general comments:

  1. The vector forms of coalescedDevice_ and coalescedComm_ come from the time when multi-device-per-process was still supported. We did not refactor these two variables to single-device form (for no specific reason).
  2. While the code change looks good, it seems to have a trade-off:
    Instead of having a single check site in endCoalescing, like:
    assert coalescedComms_.size() == 1
    we now need to do
    TORCH_CHECK(coalescedComm_ == ncclComm, MULTI_DEVICE_ERROR_MSG);
    at every place it gets changed.

Comment on lines 2529 to 2543
if (coalescing_state_ & CoalActive) {
coalescing_state_ |= CoalColl;
if (coalescedDevice_.index() < 0) {
coalescedDevice_ = device;
} else {
TORCH_CHECK(
coalescedDevice_.index() == device.index(), MULTI_DEVICE_ERROR_MSG);
}
if (coalescedComm_ == nullptr) {
coalescedComm_ = ncclComm;
} else {
TORCH_CHECK(coalescedComm_ == ncclComm, MULTI_DEVICE_ERROR_MSG);
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, why would we need to add this code in collectiveCoalesced?

Coalescing in this function is done via torch::cuda::nccl::AutoNcclGroup nccl_group_guard, rather than coalescing_state_.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be okay to investigate this addition in a second issue or PR if really needed? In that way we maintain this PR as a 1:1 refactor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I will make it 1:1 refactor. The reason why we need this in collectiveCoalesced is that it fixes the wait behavior. Without it the cm.works is just an empty list

    with dist._coalescing_manager(group=pg_nccl, device=device, async_ops=True) as cm:
        dist.all_reduce(a)
        dist.all_reduce(b)

    print(len(cm.works)) # prints 0
    cm.wait()

So the cm.wait() is no-op. Which is clearly what was not intended.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed an issue #122807. Will create a PR after this one is merged.

@Aidyn-A
Copy link
Collaborator Author
Aidyn-A commented Mar 27, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 27, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 2 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@Aidyn-A
Copy link
Collaborator Author
Aidyn-A commented Mar 27, 2024

The CI failures are unrelated, @pytorchbot merge -i

@Aidyn-A
Copy link
Collaborator Author
Aidyn-A commented Mar 27, 2024

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 3 checks: pull / linux-jammy-py3.8-gcc11 / test (default, 2, 3, linux.2xlarge), pull / linux-jammy-py3.8-gcc11 / test (default, 3, 3, linux.2xlarge), pull / linux-focal-py3_8-clang9-xla / test (xla, 1, 1, linux.12xlarge, unstable)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
The `coalescedDevice_` are `coalescedComms_` used inefficiently and in case of consequent coalescing comms can cause to read-before-write condition.

Pull Request resolved: pytorch#122651
Approved by: https://github.com/kwen2501, https://github.com/eqy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged module: nccl Problems related to nccl support oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (c10d) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0