8000 Correctly set `CUDAGuard` in nccl collectives by oraluben · Pull Request #130921 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Correctly set CUDAGuard in nccl collectives #130921

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

oraluben
Copy link
Contributor
@oraluben oraluben commented Jul 17, 2024

Fixes #130414

#119421 removes some CUDAGuard.set_index calls, leave some gpuGuard initialized but not updated.
That adds implicit call to cudaSetDevice(0) on other ranks, and therefore lead to #130414.
This PR add them back.

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

Copy link
pytorch-bot bot commented Jul 17, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130921

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit e7cc8cf with merge base 2300bb2 (image):

NEW FAILURE - The following job has failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@oraluben oraluben changed the title Correctly set CUDAGuard in PGNCCL Correctly set CUDAGuard in nccl collectives Jul 17, 2024
@fduwjj
Copy link
Contributor
fduwjj commented Jul 17, 2024

I guess the set_cuda_device is already the recommended way to solve the issue. We might also need to update the warning messages to reflect this as well.

@soulitzer soulitzer requested review from kwen2501 and wconstab July 18, 2024 00:51
@soulitzer soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 18, 2024
@oraluben
Copy link
Contributor Author
oraluben commented Jul 18, 2024

I guess the set_cuda_device is already the recommended way to solve the issue.

That's true. But we used to have a (currently broken) heuristic that tries to do that when the code do not set device explicitly, I'd prefer to fix the heuristic rather than leave it not working and say "this is not the recommend way".

Also the warning is not printed with the default config. I add a commit to print it without extra env vars.

@@ -3912,13 +3910,13 @@ c10::intrusive_ptr<Work> ProcessGroupNCCL::barrier(const BarrierOptions& opts) {
// ensure that each process is on a different GPU
auto numGPUs = at::cuda::getNumGPUs();
int16_t deviceIdx = static_cast<int16_t>(rank_ % numGPUs);
LOG(INFO)
LOG(WARNING)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change makes sense to me.

@fduwjj
Copy link
Contributor
fduwjj commented Jul 18, 2024

@oraluben I am not sure about the change to bring that heuristic back is what we plan to do. cc: @H-Huang

@fduwjj fduwjj requested review from < 8000 a data-hovercard-type="user" data-hovercard-url="/users/H-Huang/hovercard" data-octo-click="hovercard-link-click" data-octo-dimensions="link_type:self" href="/H-Huang">H-Huang and shuqiangzhang July 18, 2024 15:55
@oraluben
Copy link
Contributor Author

Any update on this?

@@ -2752,8 +2752,6 @@ c10::intrusive_ptr<Work> ProcessGroupNCCL::collectiveCoalesced(
std::make_shared<std::vector<at::Tensor>>(inputs);
}

at::cuda::OptionalCUDAGuard gpuGuard;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this guard being removed instead of updated to device?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AutoNcclGroup below does that

@@ -2591,7 +2591,7 @@ c10::intrusive_ptr<Work> ProcessGroupNCCL::collective(
work->stashed_for_allocator_safety_->push_back(input);
}

at::cuda::OptionalCUDAGuard gpuGuard;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the case that gpuGuard without index is not effective? Or does it default to device 0?

Copy link
Contributor
@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reporting the error and the contribution.
It seems the changes proposed are in main now. (Fixed in a recent PR #134357)
Still, appreciate your debugging!

@oraluben
Copy link
Contributor Author

@kwen2501 thanks for your review anyway. I'd appreciate it if you could review this earlier, and your time could also be saved.

@oraluben oraluben closed this Sep 23, 2024
@kwen2501
Copy link
Contributor

Sorry I was on leave.

@oraluben oraluben deleted the guard-collective branch October 23, 2024 09:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (c10d) release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[c10d] collective failure with single-device style (#119421) and nccl <= 2.12
6 participants
0