-
Notifications
You must be signed in to change notification settings - Fork 24.8k
Description
🐛 Describe the bug
NCCL 2.26.x may have had a regression where this test_non_blocking_with_eager_init unit test would timeout.
This was exposed in https://hud.pytorch.org/pr/151594 and https://ossci-raw-job-status.s3.amazonaws.com/log/42176270701 to be specific:
_________ ProcessGroupNCCLGroupTest.test_non_blocking_with_eager_init __________ 2025-05-13T23:33:57.4331770Z Traceback (most recent call last): 2025-05-13T23:33:57.4332326Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 614, in wrapper 2025-05-13T23:33:57.4332459Z self._join_processes(fn) 2025-05-13T23:33:57.4333104Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 869, in _join_processes 2025-05-13T23:33:57.4333241Z self._check_return_codes(elapsed_time) 2025-05-13T23:33:57.4333858Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925, in _check_return_codes 2025-05-13T23:33:57.4333972Z raise RuntimeError( 2025-05-13T23:33:57.4334258Z RuntimeError: Process 0 terminated or timed out after 300.10095739364624 seconds 2025-05-13T23:33:57.4334494Z ----------------------------- Captured stdout call ----------------------------- 2025-05-13T23:33:57.4334676Z Timing out after 300 seconds and killing subprocesses. 2025-05-13T23:33:57.4334980Z _________ ProcessGroupNCCLGroupTest.test_non_blocking_with_eager_init __________ 2025-05-13T23:33:57.4335115Z Traceback (most recent call last): 2025-05-13T23:33:57.4335658Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 614, in wrapper 2025-05-13T23:33:57.4335773Z self._join_processes(fn) 2025-05-13T23:33:57.4336358Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 869, in _join_processes 2025-05-13T23:33:57.4336557Z self._check_return_codes(elapsed_time) 2025-05-13T23:33:57.4337160Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925, in _check_return_codes 2025-05-13T23:33:57.4337269Z raise RuntimeError( 2025-05-13T23:33:57.4337573Z RuntimeError: Process 0 terminated or timed out after 304.54658126831055 seconds 2025-05-13T23:33:57.4337792Z ----------------------------- Captured stdout call ----------------------------- 2025-05-13T23:33:57.4337973Z Timing out after 300 seconds and killing subprocesses.
Versions
nightly docker image, e.g. ghcr.io/pytorch/pytorch-nightly:2.8.0.dev20250511-cuda12.6-cudnn9-devel
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @seemethere @malfet @pytorch/pytorch-dev-infra @ptrblck @eqy @tinglvv @atalman