-
Notifications
You must be signed in to change notification settings - Fork 24.7k
[1/N] Move NaN check onto NCCL stream #134300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134300
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 7bf9615 with merge base 4655eb3 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@@ -2667,6 +2664,10 @@ c10::intrusive_ptr<Work> ProcessGroupNCCL::collective( | |||
|
|||
at::cuda::OptionalCUDAGuard gpuGuard; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you determine what the 'gpuGuard' does and why it matters (or is it not important)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I know CudaGuard(device_id) can be used to set the device. What does it mean if you use OptionalCUDAGuard and do not set the device?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likely just creating a CUDA context if there isn't one already.
If no device id is given, default device would be used.
See:
https://pytorch.org/cppdocs/api/structc10_1_1cuda_1_1_optional_c_u_d_a_guard.html
https://pytorch.org/cppdocs/api/structc10_1_1cuda_1_1_c_u_d_a_guard.html#structc10_1_1cuda_1_1_c_u_d_a_guard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we might need to set the device. Similar 'GPU0' bugs related to NCCL abort (#127363), we just set the context at the PTD level as a mitigation
I want the PR desc updated to clarify
(im not sure if what i said in 1, 2 are even true but i want your PR desc to educate me about those details) |
can you also add tests? i think you can put a couple tests on each PR that cover the changes that PR fixed. |
Agree
Indeed, this PR didn't fix anything related to the device to run the NaN kernel. Also, I wonder if
is the cause of additional context on GPU 0 by all processes when one type nvidia-smi. |
So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels. Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels). The check is thus moved after the point where we depend NCCL stream from the last compute kernel. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels. Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels). The check is thus moved after the point where we depend NCCL stream from the last compute kernel. Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
Relying on original test for this PR as this PR is a "cosmetic" change -- the pass requirement is that original test continue to work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seeing the test passing in the 3rd PR
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Fixes #134062. For example, in case of broadcast / scatter, only the root rank should perform the NaN check. Pull Request resolved: #134345 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab ghstack dependencies: #134300
@kwen2501 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Fixes #134062. For example, in case of broadcast / scatter, only the root rank should perform the NaN check. Pull Request resolved: #134345 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab ghstack dependencies: #134300 ghstack-source-id: a8f91cd
So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels. Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels). The check is thus moved after the point where we depend NCCL stream from the last compute kernel. Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu. Pull Request resolved: #134300 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
…h#134345) Fixes pytorch#134062. For example, in case of broadcast / scatter, only the root rank should perform the NaN check. Pull Request resolved: pytorch#134345 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab ghstack dependencies: pytorch#134300
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue #134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. Pull Request resolved: #134357 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: #134300, #134345
…h#134345) Fixes pytorch#134062. For example, in case of broadcast / scatter, only the root rank should perform the NaN check. Pull Request resolved: pytorch#134345 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab ghstack dependencies: pytorch#134300
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue pytorch#134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. Pull Request resolved: pytorch#134357 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: pytorch#134300, pytorch#134345
This reverts commit afc76c6. Reverted pytorch#134357 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](pytorch#134300 (comment)))
…pytorch#134345)" This reverts commit 2fe7e33. Reverted pytorch#134345 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](pytorch#134300 (comment)))
This reverts commit 94caba4. Reverted pytorch#134300 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](pytorch#134300 (comment)))
So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels. Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels). The check is thus moved after the point where we depend NCCL stream from the last compute kernel. Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu. Differential Revision: [D61957573](https://our.internmc.facebook.com/intern/diff/D61957573) Pull Request resolved: pytorch#134300 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels. Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels). The check is thus moved after the point where we depend NCCL stream from the last compute kernel. Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu. Pull Request resolved: pytorch#134300 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
…h#134345) Fixes pytorch#134062. For example, in case of broadcast / scatter, only the root rank should perform the NaN check. Pull Request resolved: pytorch#134345 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab ghstack dependencies: pytorch#134300
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue pytorch#134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. Pull Request resolved: pytorch#134357 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: pytorch#1 10000 34300, pytorch#134345
…h#134345) Fixes pytorch#134062. For example, in case of broadcast / scatter, only the root rank should perform the NaN check. Pull Request resolved: pytorch#134345 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab ghstack dependencies: pytorch#134300
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue pytorch#134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. Pull Request resolved: pytorch#134357 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: pytorch#134300, pytorch#134345
This reverts commit afc76c6. Reverted pytorch#134357 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](pytorch#134300 (comment)))
…pytorch#134345)" This reverts commit 2fe7e33. Reverted pytorch#134345 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](pytorch#134300 (comment)))
This reverts commit 94caba4. Reverted pytorch#134300 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](pytorch#134300 (comment)))
So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels. Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels). The check is thus moved after the point where we depend NCCL stream from the last compute kernel. Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu. Differential Revision: [D61957573](https://our.internmc.facebook.com/intern/diff/D61957573) Pull Request resolved: pytorch#134300 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
Stack from ghstack (oldest at bottom):
So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels.
Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels).
The check is thus moved after the point where we depend NCCL stream from the last compute kernel.
Also moved declaration of
checkForNan
from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu.cc @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o
Differential Revision: D61957573