You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue #134062.
With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA.
Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later.
Pull Request resolved: #134357
Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang
ghstack dependencies: #134345
0 commit comments