8000 [3/N] Set correct device to CUDA guards by kwen2501 · Pull Request #134357 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[3/N] Set correct device to CUDA guards #134357

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

kwen2501
Copy link
Contributor
@kwen2501 kwen2501 commented Aug 23, 2024

Stack from ghstack (oldest at bottom):

In collective(), pointToPoint() and collectiveCoalesced(), CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue #134062.

With this fix, torch.cuda.set_device(device) is not needed to work around the IMA.

Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later.

cc @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

Copy link
pytorch-bot bot commented Aug 23, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134357

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit 2ba49ab with merge base 0dbc728 (image):

NEW FAILURE - The following job has failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Aug 23, 2024
kwen2501 added a commit that referenced this pull request Aug 23, 2024
ghstack-source-id: 94c9997
Pull Request resolved: #134357
@kwen2501 kwen2501 requested a review from shuqiangzhang August 23, 2024 22:33
@@ -2132,7 +2129,9 @@ std::shared_ptr<NCCLComm> ProcessGroupNCCL::getNCCLComm(
<< timerDeltaMs << " ms";
}

at::cuda::OptionalCUDAGuard gpuGuard;
// Get the device index
auto deviceIndex = device.index();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: do we need the deviceIndex var? Looks like we just pass device to guard directly

Ok probably we are using it somewhere, you moved the decl from below to above.

Copy link
Contributor
@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I forgot to demand tests before I stamp. Please add tests :)

@@ -1103,11 +1103,8 @@ void ProcessGroupNCCL::abortCommsFromMap(
for (auto& it : ncclCommsMap) {
auto& devName = it.first;
auto& ncclComm = it.second;
at::cuda::OptionalCUDAGuard gpuGuard;
at::DeviceIndex deviceIndex = getIndexFromDeviceKey(devName);
Copy link
Contributor
@shuqiangzhang shuqiangzhang Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For P2P comms, the deviceIndex could be -1 (invalid), as the keys in the map could be non deviceIndex, but rank to rank numbers. So we indeed need to check if deviceIndex >= 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the information. Let me revert this change here and add the above comments into the code.

In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue #134062.

With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA.

Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later.

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Aug 26, 2024
ghstack-source-id: 962b1ae
Pull Request resolved: #134357
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue #134062.

With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA.

Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later.

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Aug 26, 2024
ghstack-source-id: 97a5d15
Pull Request resolved: #134357
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue #134062.

With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA.

Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later.

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Aug 26, 2024
ghstack-source-id: e87bab4
Pull Request resolved: #134357
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue #134062.

With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA.

Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later.

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Aug 26, 2024
ghstack-source-id: f1a5b94
Pull Request resolved: #134357
@kwen2501
Copy link
Contributor Author

Oh I forgot to demand tests before I stamp. Please add tests :)

Testing with multiple colls now.

In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue #134062.

With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA.

Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later.

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Aug 26, 2024
ghstack-source-id: f31d327
Pull Request resolved: #134357
Copy link
Contributor
@shuqiangzhang shuqiangzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, thanks for the fix.

@skip_if_lt_x_gpu(2)
def test_nan_check(self):
# Not expecting an error, NaN check should not make legit code fail
os.environ["TORCH_NCCL_NAN_CHECK"] = "1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also set CUDA_LAUNCH_BLOCKING=1 in this test? as this test failed before only it is enabled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've seen test failing too when CUDA_LAUNCH_BLOCKING=0 (when the test has two all_reduce's).

@kwen2501 kwen2501 added the topic: bug fixes topic category label Aug 26, 2024
@kwen2501
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 26, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x 79ad95b954943b0af3d1416a0b500ebb83724b9a returned non-zero exit code 1

Auto-merging torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
CONFLICT (content): Merge conflict in torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
error: could not apply 79ad95b954... [2/N] Add flag to control which rank should perform NaN check (#134345)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Details for Dev Infra team Raised by workflow job

In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue #134062.

With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA.

Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later.

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
@kwen2501
Copy link
Contributor Author

@pytorchbot merge -f "previously landed, reverted base PR, re-open, CI all green, rebase, land again"

pytorch-bot[bot] reacted with thumbs up emoji

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Aug 29, 2024
pytorchmergebot pushed a commit that referenced this pull request Aug 29, 2024
By using a zeros() tensor instead of empty() tensor.

Pull Request resolved: #134707
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
ghstack dependencies: #134345, #134357, #134701
pytorch-bot bot pushed a commit that referenced this pull request Sep 13, 2024
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue #134062.

With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA.

Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later.

Pull Request resolved: #134357
Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang
ghstack dependencies: #134300, #134345
tolleybot pushed a commit to tolleybot/pytorch that referenced this pull request Sep 14, 2024
tolleybot pushed a commit to tolleybot/pytorch that referenced this pull request Sep 14, 2024
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue pytorch#134062.

With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA.

Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later.

Pull Request resolved: pytorch#134357
Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang
ghstack dependencies: pytorch#134300, pytorch#134345
tolleybot pushed a commit to tolleybot/pytorch that referenced this pull request Sep 14, 2024
This reverts commit afc76c6.

Reverted pytorch#134357 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](pytorch#134300 (comment)))
tolleybot pushed a commit to tolleybot/pytorch that referenced this pull request Sep 14, 2024
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue pytorch#134062.

With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA.

Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later.

Pull Request resolved: pytorch#134357
Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang
ghstack dependencies: pytorch#134345
tolleybot pushed a commit to tolleybot/pytorch that referenced this pull request Sep 14, 2024
tolleybot pushed a commit to tolleybot/pytorch that referenced this pull request Sep 14, 2024
By using a zeros() tensor instead of empty() tensor.

Pull Request resolved: pytorch#134707
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
ghstack dependencies: pytorch#134345, pytorch#134357, pytorch#134701
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue pytorch#134062.

With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA.

Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later.

Pull Request resolved: pytorch#134357
Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang
ghstack dependencies: pytorch#134300, pytorch#134345
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue pytorch#134062.

With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA.

Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later.

Pull Request resolved: pytorch#134357
Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang
ghstack dependencies: pytorch#134300, pytorch#134345
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
This reverts commit afc76c6.

Reverted pytorch#134357 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](pytorch#134300 (comment)))
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue pytorch#134062.

With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA.

Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later.

Pull Request resolved: pytorch#134357
Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang
ghstack dependencies: pytorch#134345
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
By using a zeros() tensor instead of empty() tensor.

Pull Request resolved: pytorch#134707
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
ghstack dependencies: pytorch#134345, pytorch#134357, pytorch#134701
@github-actions github-actions bot deleted the gh/kwen2501/52/head branch October 3, 2024 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category Reverted topic: bug fixes topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0