[c10d][nccl][cuda] Regression (unspecific cuda launch error) with test_c10d_nncl

@XilunWu

🐛 Describe the bug

When running
python test/distributed/test_c10d_nccl.py -k test_nan_assert_float16 on a H100x2 platform,

the current nightly (and likely v2.5.0 RC) is producing the following cuda error:

It did not check return code, because:

Tested with ghcr.io/pytorch/pytorch-nightly:2.5.0.dev20240818-cuda12.4-cudnn9-devel , the test did not generate errors other than failing the assertion check.

Bisected to #134300 (cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @atalman @malfet )
i.e. this commit 3645634

To reproduce on a 2xGPU platform:

docker pull ghcr.io/pytorch/pytorch-nightly:2.6.0.dev20240918-cuda12.4-cudnn9-devel
clone pytorch and checkout to the above commit (364563)
run:
python test/distributed/test_c10d_nccl.py -k test_nan_assert_float16

Versions

Bisected to #134300 (cc @kwen2501 @atalman @malfet )
i.e. this commit 3645634

cc @eqy @Aidyn-A @ptrblck

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions