[CUDA] test_c10d_nccl test_extra_cuda_context failure due to _helper_test_extra_cuda_context_by_memory #153122

nWEIdia · 2025-05-08T01:15:45Z

While trying to replace cuda11.8 distributed jobs by cuda 12.6 (PR), test_extra_cuda_context failed and I had to increase the 1.5x heuristic to 1.7 to temporarily workaround the failure.

When this is finally fixed, I would roll back the 1.7 to 1.5.

Previously failed job:
https://github.com/pytorch/pytorch/actions/runs/14656019861/job/41132964287

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @ptrblck @msaroufim @eqy @jerryzh168 @mruberry @ZainRizvi @tinglvv @atalman @malfet

eqy · 2025-05-08T02:18:41Z

adding a note for myself that we might want to take a closer look at this one since CUDA 11 -> CUDA 12 did have a lot of issues around extra CUDA context creation

… to cu126-sm75 (#151594) This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6. In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues. #153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 skip - what is Sandcastle OS?) #153122 CUDA context related #153517 NCCL regression, future NCCL may fix it See: #147383 Pull Request resolved: #151594 Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/cyyever

…12.6 (#151594) This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6. In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues. #153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 skip - what is Sandcastle OS?) #153122 CUDA context related #153517 NCCL regression, future NCCL may fix it #154073 skip test_symmetric_memory for cuda 12.6 before it is fixed See: #147383 Pull Request resolved: #151594 Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/cyyever, https://github.com/huydhn, https://github.com/kwen2501

nWEIdia · 2025-05-23T17:10:06Z

63B4

From the discussion here: #154174 (comment)
Looks like we are going to remove this _by_memory test and prefer nvml based function.

I would soon close this issue without further investigation.

cc @ngimel @kwen2501 @atalman

jbschlosser added oncall: distributed Add this issue/PR to distributed oncall triage queue module: cuda Related to torch.cuda, and CUDA support in general module: tests Issues related to tests (not the torch.testing module) labels May 8, 2025

nWEIdia mentioned this issue May 16, 2025

[CI][CUDA][Distributed] Move cuda 11.8 distributed pull jobs to cuda 12.6 #151594

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CUDA] test_c10d_nccl test_extra_cuda_context failure due to _helper_test_extra_cuda_context_by_memory #153122

[CUDA] test_c10d_nccl test_extra_cuda_context failure due to _helper_test_extra_cuda_context_by_memory #153122

Uh oh!

Uh oh!

[CUDA] test_c10d_nccl test_extra_cuda_context failure due to _helper_test_extra_cuda_context_by_memory #153122

[CUDA] test_c10d_nccl test_extra_cuda_context failure due to _helper_test_extra_cuda_context_by_memory #153122

Comments

Uh oh!

Uh oh!

Uh oh!