[CI][CUDA][Distributed] test_non_blocking_with_eager_init timeout

### 🐛 Describe the bug

NCCL 2.26.x may have had a regression where this test_non_blocking_with_eager_init unit test would timeout. 

This was exposed in https://hud.pytorch.org/pr/151594 and https://ossci-raw-job-status.s3.amazonaws.com/log/42176270701 to be specific: 
`_________ ProcessGroupNCCLGroupTest.test_non_blocking_with_eager_init __________
2025-05-13T23:33:57.4331770Z Traceback (most recent call last):
2025-05-13T23:33:57.4332326Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 614, in wrapper
2025-05-13T23:33:57.4332459Z     self._join_processes(fn)
2025-05-13T23:33:57.4333104Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 869, in _join_processes
2025-05-13T23:33:57.4333241Z     self._check_return_codes(elapsed_time)
2025-05-13T23:33:57.4333858Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925, in _check_return_codes
2025-05-13T23:33:57.4333972Z     raise RuntimeError(
2025-05-13T23:33:57.4334258Z RuntimeError: Process 0 terminated or timed out after 300.10095739364624 seconds
2025-05-13T23:33:57.4334494Z ----------------------------- Captured stdout call -----------------------------
2025-05-13T23:33:57.4334676Z Timing out after 300 seconds and killing subprocesses.
2025-05-13T23:33:57.4334980Z _________ ProcessGroupNCCLGroupTest.test_non_blocking_with_eager_init __________
2025-05-13T23:33:57.4335115Z Traceback (most recent call last):
2025-05-13T23:33:57.4335658Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 614, in wrapper
2025-05-13T23:33:57.4335773Z     self._join_processes(fn)
2025-05-13T23:33:57.4336358Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 869, in _join_processes
2025-05-13T23:33:57.4336557Z     self._check_return_codes(elapsed_time)
2025-05-13T23:33:57.4337160Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925, in _check_return_codes
2025-05-13T23:33:57.4337269Z     raise RuntimeError(
2025-05-13T23:33:57.4337573Z RuntimeError: Process 0 terminated or timed out after 304.54658126831055 seconds
2025-05-13T23:33:57.4337792Z ----------------------------- Captured stdout call -----------------------------
2025-05-13T23:33:57.4337973Z Timing out after 300 seconds and killing subprocesses.`

### Versions

nightly docker image, e.g. ghcr.io/pytorch/pytorch-nightly:2.8.0.dev20250511-cuda12.6-cudnn9-devel 


cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @seemethere @malfet @pytorch/pytorch-dev-infra @ptrblck @eqy @tinglvv @atalman 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions