Description
🐛 Describe the bug
When running
python test/distributed/test_c10d_nccl.py -k test_nan_assert_float16 on a H100x2 platform,
the current nightly (and likely v2.5.0 RC) is producing the following cuda error:
It did not check return code, because:
Tested with ghcr.io/pytorch/pytorch-nightly:2.5.0.dev20240818-cuda12.4-cudnn9-devel , the test did not generate errors other than failing the assertion check.
Bisected to #134300 (cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @atalman @malfet )
i.e. this commit 3645634
To reproduce on a 2xGPU platform:
docker pull ghcr.io/pytorch/pytorch-nightly:2.6.0.dev20240918-cuda12.4-cudnn9-devel
clone pytorch and checkout to the above commit (364563)
run:
python test/distributed/test_c10d_nccl.py -k test_nan_assert_float16
Versions
Bisected to #134300 (cc @kwen2501 @atalman @malfet )
i.e. this commit 3645634