[3/N] Set correct device to CUDA guards #134357

kwen2501 · 2024-08-23T19:04:00Z

Stack from ghstack (oldest at bottom):

In collective(), pointToPoint() and collectiveCoalesced(), CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue #134062.

With this fix, torch.cuda.set_device(device) is not needed to work around the IMA.

Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later.

cc @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-08-23T19:04:03Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134357

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit 2ba49ab with merge base 0dbc728 ():

NEW FAILURE - The following job has failed:

trunk / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 3, 5, linux.g5.4xlarge.nvidia.gpu) (gh)
'test/profiler/test_cpp_thread.py::CppThreadTest::test_with_enable_profiler_in_child_thread'

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 4, 5, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
'test/profiler/test_cpp_thread.py::CppThreadTest::test_profile_memory'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 94c9997 Pull Request resolved: #134357

wconstab · 2024-08-24T03:43:39Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

@@ -2132,7 +2129,9 @@ std::shared_ptr<NCCLComm> ProcessGroupNCCL::getNCCLComm(
              << timerDeltaMs << " ms";
  }

-  at::cuda::OptionalCUDAGuard gpuGuard;
+  // Get the device index
+  auto deviceIndex = device.index();


Nit: do we need the deviceIndex var? Looks like we just pass device to guard directly

Ok probably we are using it somewhere, you moved the decl from below to above.

wconstab

Oh I forgot to demand tests before I stamp. Please add tests :)

shuqiangzhang · 2024-08-26T16:30:47Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

@@ -1103,11 +1103,8 @@ void ProcessGroupNCCL::abortCommsFromMap(
  for (auto& it : ncclCommsMap) {
    auto& devName = it.first;
    auto& ncclComm = it.second;
-    at::cuda::OptionalCUDAGuard gpuGuard;
    at::DeviceIndex deviceIndex = getIndexFromDeviceKey(devName);


For P2P comms, the deviceIndex could be -1 (invalid), as the keys in the map could be non deviceIndex, but rank to rank numbers. So we indeed need to check if deviceIndex >= 0

Thanks for the information. Let me revert this change here and add the above comments into the code.

In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue #134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: 962b1ae Pull Request resolved: #134357

In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue #134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: 97a5d15 Pull Request resolved: #134357

In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue #134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: e87bab4 Pull Request resolved: #134357

In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue #134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: f1a5b94 Pull Request resolved: #134357

kwen2501 · 2024-08-26T21:31:29Z

Oh I forgot to demand tests before I stamp. Please add tests :)

Testing with multiple colls now.

In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue #134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: f31d327 Pull Request resolved: #134357

shuqiangzhang

LGTM now, thanks for the fix.

shuqiangzhang · 2024-08-26T22:24:57Z

test/distributed/test_c10d_nccl.py

+    @skip_if_lt_x_gpu(2)
+    def test_nan_check(self):
+        # Not expecting an error, NaN check should not make legit code fail
+        os.environ["TORCH_NCCL_NAN_CHECK"] = "1"


Should we also set CUDA_LAUNCH_BLOCKING=1 in this test? as this test failed before only it is enabled

I think I've seen test failing too when CUDA_LAUNCH_BLOCKING=0 (when the test has two all_reduce's).

kwen2501 · 2024-08-26T23:48:23Z

@pytorchbot merge

pytorchmergebot · 2024-08-26T23:50:26Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-08-29T06:35:45Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-08-29T09:06:01Z

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x 79ad95b954943b0af3d1416a0b500ebb83724b9a returned non-zero exit code 1

Auto-merging torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
CONFLICT (content): Merge conflict in torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
error: could not apply 79ad95b954... [2/N] Add flag to control which rank should perform NaN check (#134345)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config advice.mergeConflict false"

Details for Dev Infra team

Raised by workflow job

In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue #134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

kwen2501 · 2024-08-29T16:23:40Z

@pytorchbot merge -f "previously landed, reverted base PR, re-open, CI all green, rebase, land again"

pytorchmergebot · 2024-08-29T16:25:21Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Pull Request resolved: #134701 Approved by: https://github.com/wconstab ghstack dependencies: #134345, #134357

By using a zeros() tensor instead of empty() tensor. Pull Request resolved: #134707 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab ghstack dependencies: #134345, #134357, #134701

In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue #134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. Pull Request resolved: #134357 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: #134300, #134345

This reverts commit 13114da. Reverted pytorch#134357 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#134357 (comment)))

In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue pytorch#134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. Pull Request resolved: pytorch#134357 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: pytorch#134300, pytorch#134345

This reverts commit afc76c6. Reverted pytorch#134357 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](pytorch#134300 (comment)))

In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue pytorch#134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. Pull Request resolved: pytorch#134357 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: pytorch#134345

Pull Request resolved: pytorch#134701 Approved by: https://github.com/wconstab ghstack dependencies: pytorch#134345, pytorch#134357

By using a zeros() tensor instead of empty() tensor. Pull Request resolved: pytorch#134707 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab ghstack dependencies: pytorch#134345, pytorch#134357, pytorch#134701

In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue pytorch#134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. Pull Request resolved: pytorch#134357 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: pytorch#134300, pytorch#134345

This reverts commit 13114da. Reverted pytorch#134357 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#134357 (comment)))

In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue pytorch#134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. Pull Request resolved: pytorch#134357 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: pytorch#134300, pytorch#134345

This reverts commit afc76c6. Reverted pytorch#134357 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](pytorch#134300 (comment)))

In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue pytorch#134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. Pull Request resolved: pytorch#134357 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: pytorch#134345

Pull Request resolved: pytorch#134701 Approved by: https://github.com/wconstab ghstack dependencies: pytorch#134345, pytorch#134357

By using a zeros() tensor instead of empty() tensor. Pull Request resolved: pytorch#134707 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab ghstack dependencies: pytorch#134345, pytorch#134357, pytorch#134701

[3/N] Set correct device to CUDA guards

deafc2a

[ghstack-poisoned]

This was referenced Aug 23, 2024

[1/N] Move NaN check onto NCCL stream #134300

Closed

[2/N] Add flag to control which rank should perform NaN check #134345

Closed

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Aug 23, 2024

kwen2501 added a 8000 commit that referenced this pull request Aug 23, 2024

[3/N] Set correct device to CUDA guards

2496c6e

ghstack-source-id: 94c9997 Pull Request resolved: #134357

kwen2501 requested a review from shuqiangzhang August 23, 2024 22:33

wconstab approved these changes Aug 24, 2024

View reviewed changes

wconstab requested changes Aug 24, 2024

View reviewed changes

shuqiangzhang reviewed Aug 26, 2024

View reviewed changes

kwen2501 added a commit that referenced this pull request Aug 26, 2024

[3/N] Set correct device to CUDA guards

1a1db10

ghstack-source-id: 962b1ae Pull Request resolved: #134357

kwen2501 added a commit that referenced this pull request Aug 26, 2024

[3/N] Set correct device to CUDA guards

ea273c1

ghstack-source-id: 97a5d15 Pull Request resolved: #134357

kwen2501 added a commit that referenced this pull request Aug 26, 2024

[3/N] Set correct device to CUDA guards

8df8659

ghstack-source-id: e87bab4 Pull Request resolved: #134357

kwen2501 added a commit that referenced this pull request Aug 26, 2024

[3/N] Set correct device to CUDA guards

bd9e7b0

ghstack-source-id: f1a5b94 Pull Request resolved: #134357

kwen2501 added a commit that referenced this pull request Aug 26, 2024

[3/N] Set correct device to CUDA guards

8278833

ghstack-source-id: f31d327 Pull Request resolved: #134357

shuqiangzhang approved these changes Aug 26, 2024

View reviewed changes

shuqiangzhang reviewed Aug 26, 2024

View reviewed changes

wconstab approved these changes Aug 26, 2024

View reviewed changes

kwen2501 added the topic: bug fixes topic category label Aug 26, 2024

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 26, 2024

pytorchmergebot added the merging label Aug 26, 2024

pytorchmergebot added the merging label Aug 29, 2024

pytorchmergebot removed the merging label Aug 29, 2024

pytorchmergebot added the merging label Aug 29, 2024

pytorchmergebot closed this in 26aea27 Aug 29, 2024

pytorchmergebot removed the merging label Aug 29, 2024

pytorchmergebot pushed a commit that referenced this pull request Aug 29, 2024

[4/N] Test NaN checker against broadcast (#134701)

d9d95dc

Pull Request resolved: #134701 Approved by: https://github.com/wconstab ghstack dependencies: #134345, #134357

tolleybot pushed a commit to tolleybot/pytorch that referenced this pull request Sep 14, 2024

[4/N] Test NaN checker against broadcast (pytorch#134701)

7df0c1c

Pull Request resolved: pytorch#134701 Approved by: https://github.com/wconstab ghstack dependencies: pytorch#134345, pytorch#134357

Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024

[4/N] Test NaN checker against broadcast (pytorch#134701)

f220d45

Pull Request resolved: pytorch#134701 Approved by: https://github.com/wconstab ghstack dependencies: pytorch#134345, pytorch#134357

kwen2501 mentioned this pull request Sep 20, 2024

Correctly set CUDAGuard in nccl collectives #130921

Closed

eqy mentioned this pull request Sep 21, 2024

[c10d][nccl][cuda] Regression (unspecific cuda launch error) with test_c10d_nncl #136390

Open

github-actions bot deleted the gh/kwen2501/52/head branch October 3, 2024 02:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[3/N] Set correct device to CUDA guards #134357

[3/N] Set correct device to CUDA guards #134357

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[3/N] Set correct device to CUDA guards #134357

[3/N] Set correct device to CUDA guards #134357

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134357

❌ 1 New Failure, 1 Unrelated Failure

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Merge started

Uh oh!

Merge started

Uh oh!

Merge failed

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!