8000 [c10d] Turn off default non-blocking API mode to work around hang in … · pytorch/pytorch@87fc5af · GitHub
[go: up one dir, main page]

Skip to content

Commit 87fc5af

Browse files
kwen2501pytorchmergebot
authored andcommitted
[c10d] Turn off default non-blocking API mode to work around hang in NCCL 2.26 (#154055)
Work around issues like #153960, #152623 NCCL 2.26 seems to introduce random hang in non-blocking API mode. This PR opts out of non-blocking mode to work around it. Previously torch turned it on by default in eager init (i.e. `device_id` passed) to avoid init overhead. Pull Request resolved: #154055 Approved by: https://github.com/atalman
1 parent fae6f6c commit 87fc5af

File tree

1 file changed

+6
-3
lines changed

1 file changed

+6
-3
lines changed

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1102,9 +1102,12 @@ bool ProcessGroupNCCL::useNonblocking() {
11021102
useNonblocking_ = nbEnv;
11031103
}
11041104
// 3rd priority: automatically use nonblocking if we are in eager init mode
1105-
else if (getBoundDeviceId()) {
1106-
useNonblocking_ = true;
1107-
}
1105+
// Note: this automatic selection is disabled in torch 2.7.1 to work around a
1106+
// hang in NCCL 2.26 in non-blocking mode. We can revisit if NCCL fixes the
1107+
// bug. See https://github.com/pytorch/pytorch/issues/153960
1108+
// else if (getBoundDeviceId()) {
1109+
// useNonblocking_ = true;
1110+
// }
11081111
// 4th priority: otherwise, nonblocking = false to preserve old behavior
11091112
else {
11101113
useNonblocking_ = false;

0 commit comments

Comments
 (0)
0