[1/N] Move NaN check onto NCCL stream #134300

kwen2501 · 2024-08-23T01:10:21Z

Stack from ghstack (oldest at bottom):

So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels.
Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels).

The check is thus moved after the point where we depend NCCL stream from the last compute kernel.

Also moved declaration of checkForNan from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu.

cc @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

Differential Revision: D61957573

[ghstack-poisoned]

pytorch-bot · 2024-08-23T01:10:25Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134300

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 7bf9615 with merge base 4655eb3 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 4, 5, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
'test/profiler/test_cpp_thread.py::CppThreadTest::test_profile_memory'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 477a8b0 Pull Request resolved: #134300

wconstab · 2024-08-23T17:03:51Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

@@ -2667,6 +2664,10 @@ c10::intrusive_ptr<Work> ProcessGroupNCCL::collective(

  at::cuda::OptionalCUDAGuard gpuGuard;


did you determine what the 'gpuGuard' does and why it matters (or is it not important)

(I know CudaGuard(device_id) can be used to set the device. What does it mean if you use OptionalCUDAGuard and do not set the device?)

Likely just creating a CUDA context if there isn't one already.
If no device id is given, default device would be used.
See:
https://pytorch.org/cppdocs/api/structc10_1_1cuda_1_1_optional_c_u_d_a_guard.html
https://pytorch.org/cppdocs/api/structc10_1_1cuda_1_1_c_u_d_a_guard.html#structc10_1_1cuda_1_1_c_u_d_a_guard

Yes, we might need to set the device. Similar 'GPU0' bugs related to NCCL abort (#127363), we just set the context at the PTD level as a mitigation

wconstab · 2024-08-23T17:07:27Z

I want the PR desc updated to clarify

moving to the nccl stream is a preference we think is better not a requirement
ensuring the nan check is run on the proper gpu (via deviceguard) solves the bug

(im not sure if what i said in 1, 2 are even true but i want your PR desc to educate me about those details)

wconstab · 2024-08-23T17:15:17Z

can you also add tests? i think you can put a couple tests on each PR that cover the changes that PR fixed.

kwen2501 · 2024-08-23T17:17:29Z

moving to the nccl stream is a preference we think is better not a requirement

Agree

ensuring the nan check is run on the proper gpu (via deviceguard) solves the bug

Indeed, this PR didn't fix anything related to the device to run the NaN kernel.
Maybe there will be a third PR.

Also, I wonder if

at::cuda::OptionalCUDAGuard gpuGuard;

is the cause of additional context on GPU 0 by all processes when one type nvidia-smi.

So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels. Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels). The check is thus moved after the point where we depend NCCL stream from the last compute kernel. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels. Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels). The check is thus moved after the point where we depend NCCL stream from the last compute kernel. Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

kwen2501 · 2024-08-26T21:30:40Z

can you also add tests? i think you can put a couple tests on each PR that cover the changes that PR fixed.

Relying on original test for this PR as this PR is a "cosmetic" change -- the pass requirement is that original test continue to work.

shuqiangzhang

seeing the test passing in the 3rd PR

kwen2501 · 2024-08-26T23:47:54Z

@pytorchbot merge

pytorchmergebot · 2024-08-26T23:50:08Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-08-27T05:48:50Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

kwen2501 · 2024-08-27T16:00:29Z

@pytorchbot merge

pytorchmergebot · 2024-08-27T16:02:16Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Fixes #134062. For example, in case of broadcast / scatter, only the root rank should perform the NaN check. Pull Request resolved: #134345 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab ghstack dependencies: #134300

kwen2501 · 2024-08-29T05:46:03Z

@kwen2501 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

6D40

kwen2501 · 2024-08-29T06:26:57Z

@pytorchbot merge

pytorchmergebot · 2024-08-29T06:28:35Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Fixes #134062. For example, in case of broadcast / scatter, only the root rank should perform the NaN check. Pull Request resolved: #134345 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab ghstack dependencies: #134300 ghstack-source-id: a8f91cd

So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels. Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels). The check is thus moved after the point where we depend NCCL stream from the last compute kernel. Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu. Pull Request resolved: #134300 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab

…h#134345) Fixes pytorch#134062. For example, in case of broadcast / scatter, only the root rank should perform the NaN check. Pull Request resolved: pytorch#134345 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab ghstack dependencies: pytorch#134300

In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue #134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. Pull Request resolved: #134357 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: #134300, #134345

…h#134345) Fixes pytorch#134062. For example, in case of broadcast / scatter, only the root rank should perform the NaN check. Pull Request resolved: pytorch#134345 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab ghstack dependencies: pytorch#134300

In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue pytorch#134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. Pull Request resolved: pytorch#134357 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: pytorch#134300, pytorch#134345

This reverts commit afc76c6. Reverted pytorch#134357 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](pytorch#134300 (comment)))

…pytorch#134345)" This reverts commit 2fe7e33. Reverted pytorch#134345 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](pytorch#134300 (comment)))

This reverts commit 94caba4. Reverted pytorch#134300 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](pytorch#134300 (comment)))

So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels. Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels). The check is thus moved after the point where we depend NCCL stream from the last compute kernel. Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu. Differential Revision: [D61957573](https://our.internmc.facebook.com/intern/diff/D61957573) Pull Request resolved: pytorch#134300 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab

So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels. Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels). The check is thus moved after the point where we depend NCCL stream from the last compute kernel. Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu. Pull Request resolved: pytorch#134300 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab

…h#134345) Fixes pytorch#134062. For example, in case of broadcast / scatter, only the root rank should perform the NaN check. Pull Request resolved: pytorch#134345 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab ghstack dependencies: pytorch#134300

In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue pytorch#134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. Pull Request resolved: pytorch#134357 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: pytorch#1 10000 34300, pytorch#134345

…h#134345) Fixes pytorch#134062. For example, in case of broadcast / scatter, only the root rank should perform the NaN check. Pull Request resolved: pytorch#134345 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab ghstack dependencies: pytorch#134300

In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue pytorch#134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. Pull Request resolved: pytorch#134357 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: pytorch#134300, pytorch#134345

This reverts commit afc76c6. Reverted pytorch#134357 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](pytorch#134300 (comment)))

…pytorch#134345)" This reverts commit 2fe7e33. Reverted pytorch#134345 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](pytorch#134300 (comment)))

This reverts commit 94caba4. Reverted pytorch#134300 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](pytorch#134300 (comment)))

So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels. Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels). The check is thus moved after the point where we depend NCCL stream from the last compute kernel. Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu. Differential Revision: [D61957573](https://our.internmc.facebook.com/intern/diff/D61957573) Pull Request resolved: pytorch#134300 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab

[1/N] Move NaN check onto NCCL stream

3f25234

[ghstack-poisoned]

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Aug 23, 2024

kwen2501 added a commit that referenced this pull request Aug 23, 2024

[1/N] Move NaN check onto NCCL stream

e317ae3

ghstack-source-id: 477a8b0 Pull Request resolved: #134300

kwen2501 mentioned this pull request Aug 23, 2024

[2/N] Add flag to control which rank should perform NaN check #134345

Closed

wconstab reviewed Aug 23, 2024

View reviewed changes

kwen2501 mentioned this pull request Aug 23, 2024

[3/N] Set correct device to CUDA guards #134357

Closed

kwen2501 added 2 commits August 26, 2024 13:10

shuqiangzhang approved these changes Aug 26, 2024

View reviewed changes

wconstab approved these changes Aug 26, 2024

View reviewed changes

kwen2501 added topic: not user facing topic category and removed release notes: distributed (c10d) release notes category labels Aug 26, 2024

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 26, 2024

pytorchmergebot added the merging label Aug 26, 2024

pytorchmergebot closed this in 94caba4 Aug 27, 2024

pytorchmergebot added Merged and removed merging labels Aug 27, 2024

pytorchmergebot added the merging label Aug 29, 2024

This was referenced Aug 29, 2024

[5/N] Reconcile barrier and NaN checker #134707

Closed

[6/N] Add USE_C10D_NCCL #134741

Closed

pytorchmergebot closed this in 3645634 Aug 29, 2024

pytorchmergebot removed the merging label Aug 29, 2024

nWEIdia mentioned this pull request Sep 20, 2024

[c10d][nccl][cuda] Regression (unspecific cuda launch error) with test_c10d_nncl #136390

Open

github-actions bot deleted the gh/kwen2501/50/head branch October 3, 2024 02:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[1/N] Move NaN check onto NCCL stream #134300

[1/N] Move NaN check onto NCCL stream #134300

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		@@ -2667,6 +2664,10 @@ c10::intrusive_ptr<Work> ProcessGroupNCCL::collective(

		at::cuda::OptionalCUDAGuard gpuGuard;

[1/N] Move NaN check onto NCCL stream #134300

[1/N] Move NaN check onto NCCL stream #134300

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134300

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!