8000 [c10d][nccl][cuda] Regression (unspecific cuda launch error) with test_c10d_nncl · Issue #136390 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content
[c10d][nccl][cuda] Regression (unspecific cuda launch error) with test_c10d_nncl #136390
Open
@nWEIdia

Description

@nWEIdia

🐛 Describe the bug

When running
python test/distributed/test_c10d_nccl.py -k test_nan_assert_float16 on a H100x2 platform,

the current nightly (and likely v2.5.0 RC) is producing the following cuda error:

image

It did not check return code, because:
image

Tested with ghcr.io/pytorch/pytorch-nightly:2.5.0.dev20240818-cuda12.4-cudnn9-devel , the test did not generate errors other than failing the assertion check.

Bisected to #134300 (cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @atalman @malfet )
i.e. this commit 3645634

To reproduce on a 2xGPU platform:

docker pull ghcr.io/pytorch/pytorch-nightly:2.6.0.dev20240918-cuda12.4-cudnn9-devel
clone pytorch and checkout to the above commit (364563)
run:
python test/distributed/test_c10d_nccl.py -k test_nan_assert_float16

Versions

Bisected to #134300 (cc @kwen2501 @atalman @malfet )
i.e. this commit 3645634

cc @eqy @Aidyn-A @ptrblck

Metadata

Metadata

Assignees

Labels

module: ncclProblems related to nccl supportoncall: distributedAdd this issue/PR to distributed oncall triage queue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0