-
Notifications
You must be signed in to change notification settings - Fork 24.2k
[3/N] Set correct device to CUDA guards #134357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134357
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 1 Unrelated FailureAs of commit 2ba49ab with merge base 0dbc728 ( NEW FAILURE - The following job has failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@@ -2132,7 +2129,9 @@ std::shared_ptr<NCCLComm> ProcessGroupNCCL::getNCCLComm( | |||
<< timerDeltaMs << " ms"; | |||
} | |||
|
|||
at::cuda::OptionalCUDAGuard gpuGuard; | |||
// Get the device index | |||
auto deviceIndex = device.index(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: do we need the deviceIndex var? Looks like we just pass device to guard directly
Ok probably we are using it somewhere, you moved the decl from below to above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I forgot to demand tests before I stamp. Please add tests :)
@@ -1103,11 +1103,8 @@ void ProcessGroupNCCL::abortCommsFromMap( | |||
for (auto& it : ncclCommsMap) { | |||
auto& devName = it.first; | |||
auto& ncclComm = it.second; | |||
at::cuda::OptionalCUDAGuard gpuGuard; | |||
at::DeviceIndex deviceIndex = getIndexFromDeviceKey(devName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For P2P comms, the deviceIndex could be -1 (invalid), as the keys in the map could be non deviceIndex, but rank to rank numbers. So we indeed need to check if deviceIndex >= 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the information. Let me revert this change here and add the above comments into the code.
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue #134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue #134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue #134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue #134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
Testing with multiple colls now. |
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue #134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now, thanks for the fix.
@skip_if_lt_x_gpu(2) | ||
def test_nan_check(self): | ||
# Not expecting an error, NaN check should not make legit code fail | ||
os.environ["TORCH_NCCL_NAN_CHECK"] = "1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also set CUDA_LAUNCH_BLOCKING=1 in this test? as this test failed before only it is enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've seen test failing too when CUDA_LAUNCH_BLOCKING=0 (when the test has two all_reduce's).
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: Command
Details for Dev Infra teamRaised by workflow job |
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue #134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
@pytorchbot merge -f "previously landed, reverted base PR, re-open, CI all green, rebase, land again" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Pull Request resolved: #134701 Approved by: https://github.com/wconstab ghstack dependencies: #134345, #134357
By using a zeros() tensor instead of empty() tensor. Pull Request resolved: #134707 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab ghstack dependencies: #134345, #134357, #134701
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue #134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. Pull Request resolved: #134357 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: #134300, #134345
This reverts commit 13114da. Reverted pytorch#134357 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#134357 (comment)))
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue pytorch#134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. Pull Request resolved: pytorch#134357 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: pytorch#134300, pytorch#134345
This reverts commit afc76c6. Reverted pytorch#134357 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](pytorch#134300 (comment)))
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue pytorch#134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. Pull Request resolved: pytorch#134357 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: pytorch#134345
Pull Request resolved: pytorch#134701 Approved by: https://github.com/wconstab ghstack dependencies: pytorch#134345, pytorch#134357
By using a zeros() tensor instead of empty() tensor. Pull Request resolved: pytorch#134707 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab ghstack dependencies: pytorch#134345, pytorch#134357, pytorch#134701
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue pytorch#134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. Pull Request resolved: pytorch#134357 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: pytorch#134300, pytorch#134345
This reverts commit 13114da. Reverted pytorch#134357 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#134357 (comment)))
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue pytorch#134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. Pull Request resolved: pytorch#134357 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: pytorch#134300, pytorch#134345
This reverts commit afc76c6. Reverted pytorch#134357 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](pytorch#134300 (comment)))
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue pytorch#134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. Pull Request resolved: pytorch#134357 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: pytorch#134345
Pull Request resolved: pytorch#134701 Approved by: https://github.com/wconstab ghstack dependencies: pytorch#134345, pytorch#134357
By using a zeros() tensor instead of empty() tensor. Pull Request resolved: pytorch#134707 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab ghstack dependencies: pytorch#134345, pytorch#134357, pytorch#134701
Stack from ghstack (oldest at bottom):
In
collective()
,pointToPoint()
andcollectiveCoalesced()
, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue #134062.With this fix,
torch.cuda.set_device(device)
is not needed to work around the IMA.Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later.
cc @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o