10000 [5/N] Reconcile barrier and NaN checker (#134707) · pytorch/pytorch@5470fcd · GitHub
[go: up one dir, main page]

Skip to content

Commit 5470fcd

Browse files
kwen2501pytorchmergebot
authored andcommitted
[5/N] Reconcile barrier and NaN checker (#134707)
By using a zeros() tensor instead of empty() tensor. Pull Request resolved: #134707 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab ghstack dependencies: #134345, #134357, #134701
1 parent d91b49d commit 5470fcd

File tree

2 files changed

+4
-1
lines changed

2 files changed

+4
-1
lines changed

test/distributed/test_c10d_nccl.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -407,6 +407,7 @@ def test_nan_check(self):
407407
t = torch.ones(3, 4, dtype=torch.bfloat16, device=device)
408408
c10d.broadcast(x, src=0)
409409
c10d.all_reduce(t)
410+
c10d.barrier()
410411
c10d.destroy_process_group()
411412
# reset env
412413
os.environ["TORCH_NCCL_NAN_CHECK"] = "0"

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4064,8 +4064,10 @@ c10::intrusive_ptr<Work> ProcessGroupNCCL::barrier(const BarrierOptions& opts) {
40644064
auto barDevice = at::Device(at::DeviceType::CUDA, barDevIdx);
40654065

40664066
// Create a dummy tensor on the device
4067+
// Note: we use zeros() instead of empty() to prevent barrier from triggering
4068+
// alarm when NaN checker is enabled.
40674069
at::Tensor barrierTensor =
4068-
at::empty({1}, at::TensorOptions().device(barDevice).dtype(at::kFloat));
4070+
at::zeros({1}, at::TensorOptions().device(barDevice).dtype(at::kFloat));
40694071

40704072
// All reduce to achieve the barrier
40714073
auto work = allreduce_impl(barrierTensor);

0 commit comments

Comments
 (0)
0