[CUDA] test_c10d_nccl test_extra_cuda_context failure due to _helper_test_extra_cuda_context_by_memory #153122
Labels
module: cuda
Related to torch.cuda, and CUDA support in general
module: tests
Issues related to tests (not the torch.testing module)
oncall: distributed
Add this issue/PR to distributed oncall triage queue
Uh oh!
There was an error while loading. Please reload this page.
While trying to replace cuda11.8 distributed jobs by cuda 12.6 (PR), test_extra_cuda_context failed and I had to increase the 1.5x heuristic to 1.7 to temporarily workaround the failure.
When this is finally fixed, I would roll back the 1.7 to 1.5.
Previously failed job:
https://github.com/pytorch/pytorch/actions/runs/14656019861/job/41132964287
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @ptrblck @msaroufim @eqy @jerryzh168 @mruberry @ZainRizvi @tinglvv @atalman @malfet
The text was updated successfully, but these errors were encountered: