8000 [1/N] Move NaN check onto NCCL stream by kwen2501 · Pull Request #134300 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[1/N] Move NaN check onto NCCL stream #134300

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

kwen2501
Copy link
Contributor
@kwen2501 kwen2501 commented Aug 23, 2024

Stack from ghstack (oldest at bottom):

So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels.
Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels).

The check is thus moved after the point where we depend NCCL stream from the last compute kernel.

Also moved declaration of checkForNan from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu.

cc @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

Differential Revision: D61957573

Copy link
pytorch-bot bot commented Aug 23, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134300

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 7bf9615 with merge base 4655eb3 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Aug 23, 2024
kwen2501 added a commit that referenced this pull request Aug 23, 2024
ghstack-source-id: 477a8b0
Pull Request resolved: #134300
@@ -2667,6 +2664,10 @@ c10::intrusive_ptr<Work> ProcessGroupNCCL::collective(

at::cuda::OptionalCUDAGuard gpuGuard;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you determine what the 'gpuGuard' does and why it matters (or is it not important)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I know CudaGuard(device_id) can be used to set the device. What does it mean if you use OptionalCUDAGuard and do not set the device?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely just creating a CUDA context if there isn't one already.
If no device id is given, default device would be used.
See:
https://pytorch.org/cppdocs/api/structc10_1_1cuda_1_1_optional_c_u_d_a_guard.html
https://pytorch.org/cppdocs/api/structc10_1_1cuda_1_1_c_u_d_a_guard.html#structc10_1_1cuda_1_1_c_u_d_a_guard

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we might need to set the device. Similar 'GPU0' bugs related to NCCL abort (#127363), we just set the context at the PTD level as a mitigation

@wconstab
Copy link
Contributor

I want the PR desc updated to clarify

  1. moving to the nccl stream is a preference we think is better not a requirement
  2. ensuring the nan check is run on the proper gpu (via deviceguard) solves the bug

(im not sure if what i said in 1, 2 are even true but i want your PR desc to educate me about those details)

@wconstab
Copy link
Contributor

can you also add tests? i think you can put a couple tests on each PR that cover the changes that PR fixed.

@kwen2501
Copy link
Contributor Author
  1. moving to the nccl stream is a preference we think is better not a requirement

Agree

  1. ensuring the nan check is run on the proper gpu (via deviceguard) solves the bug

Indeed, this PR didn't fix anything related to the device to run the NaN kernel.
Maybe there will be a third PR.

Also, I wonder if

at::cuda::OptionalCUDAGuard gpuGuard;

is the cause of additional context on GPU 0 by all processes when one type nvidia-smi.

So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels.
Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels).

The check is thus moved after the point where we depend NCCL stream from the last compute kernel.

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels.
Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels).

The check is thus moved after the point where we depend NCCL stream from the last compute kernel.

Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu.

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
@kwen2501
Copy link
Contributor Author
kwen2501 commented Aug 26, 2024

can you also add tests? i think you can put a couple tests on each PR that cover the changes that PR fixed.

Relying on original test for this PR as this PR is a "cosmetic" change -- the pass requirement is that original test continue to work.

Copy link
Contributor
@shuqiangzhang shuqiangzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seeing the test passing in the 3rd PR

@kwen2501 kwen2501 added topic: not user facing topic category and removed release notes: distributed (c10d) release notes category labels Aug 26, 2024
@kwen2501
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 26, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@kwen2501
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Aug 27, 2024
Fixes #134062.
For example, in case of broadcast / scatter, only the root rank should perform the NaN check.

Pull Request resolved: #134345
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
ghstack dependencies: #134300
@kwen2501
Copy link
Contributor Author

@kwen2501 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

6D40

@kwen2501
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

kwen2501 added a commit that referenced this pull request Aug 29, 2024
Fixes #134062.
For example, in case of broadcast / scatter, only the root rank should perform the NaN check.

Pull Request resolved: #134345
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
ghstack dependencies: #134300
ghstack-source-id: a8f91cd
pytorch-bot bot pushed a commit that referenced this pull request Sep 13, 2024
So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels.
Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels).

The check is thus moved after the point where we depend NCCL stream from the last compute kernel.

Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu.

Pull Request resolved: #134300
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
malfet pushed a commit to aditew01/pytorch that referenced this pull request Sep 13, 2024
…h#134345)

Fixes pytorch#134062.
For example, in case of broadcast / scatter, only the root rank should perform the NaN check.

Pull Request resolved: pytorch#134345
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
ghstack dependencies: pytorch#134300
pytorch-bot bot pushed a commit that referenced this pull request Sep 13, 2024
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue #134062.

With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA.

Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later.

Pull Request resolved: #134357
Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang
ghstack dependencies: #134300, #134345
tolleybot pushed a commit to tolleybot/pytorch that referenced this pull request Sep 14, 2024
…h#134345)

Fixes pytorch#134062.
For example, in case of broadcast / scatter, only the root rank should perform the NaN check.

Pull Request resolved: pytorch#134345
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
ghstack dependencies: pytorch#134300
tolleybot pushed a commit to tolleybot/pytorch that referenced this pull request Sep 14, 2024
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue pytorch#134062.

With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA.

Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later.

Pull Request resolved: pytorch#134357
Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang
ghstack dependencies: pytorch#134300, pytorch#134345
tolleybot pushed a commit to tolleybot/pytorch that referenced this pull request Sep 14, 2024
This reverts commit afc76c6.

Reverted pytorch#134357 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](pytorch#134300 (comment)))
tolleybot pushed a commit to tolleybot/pytorch that referenced this pull request Sep 14, 2024
tolleybot pushed a commit to tolleybot/pytorch that referenced this pull request Sep 14, 2024
This reverts commit 94caba4.

Reverted pytorch#134300 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](pytorch#134300 (comment)))
tolleybot pushed a commit to tolleybot/pytorch that referenced this pull request Sep 14, 2024
So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels.
Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels).

The check is thus moved after the point where we depend NCCL stream from the last compute kernel.

Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu.

Differential Revision: [D61957573](https://our.internmc.facebook.com/intern/diff/D61957573)
Pull Request resolved: pytorch#134300
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels.
Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels).

The check is thus moved after the point where we depend NCCL stream from the last compute kernel.

Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu.

Pull Request resolved: pytorch#134300
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
…h#134345)

Fixes pytorch#134062.
For example, in case of broadcast / scatter, only the root rank should perform the NaN check.

Pull Request resolved: pytorch#134345
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
ghstack dependencies: pytorch#134300
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue pytorch#134062.

With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA.

Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later.

Pull Request resolved: pytorch#134357
Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang
ghstack dependencies: pytorch#1
10000
34300, pytorch#134345
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
…h#134345)

Fixes pytorch#134062.
For example, in case of broadcast / scatter, only the root rank should perform the NaN check.

Pull Request resolved: pytorch#134345
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
ghstack dependencies: pytorch#134300
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue pytorch#134062.

With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA.

Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later.

Pull Request resolved: pytorch#134357
Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang
ghstack dependencies: pytorch#134300, pytorch#134345
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
This reverts commit afc76c6.

Reverted pytorch#134357 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](pytorch#134300 (comment)))
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
This reverts commit 94caba4.

Reverted pytorch#134300 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](pytorch#134300 (comment)))
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels.
Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels).

The check is thus moved after the point where we depend NCCL stream from the last compute kernel.

Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu.

Differential Revision: [D61957573](https://our.internmc.facebook.com/intern/diff/D61957573)
Pull Request resolved: pytorch#134300
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
@github-actions github-actions bot deleted the gh/kwen2501/50/head branch October 3, 2024 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue Reverted topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0