8000 [CI][CUDA][Distributed] test_non_blocking_with_eager_init timeout · Issue #153517 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content
[CI][CUDA][Distributed] test_non_blocking_with_eager_init timeout #153517
@nWEIdia

Description

@nWEIdia

🐛 Describe the bug

NCCL 2.26.x may have had a regression where this test_non_blocking_with_eager_init unit test would timeout.

This was exposed in https://hud.pytorch.org/pr/151594 and https://ossci-raw-job-status.s3.amazonaws.com/log/42176270701 to be specific:
_________ ProcessGroupNCCLGroupTest.test_non_blocking_with_eager_init __________ 2025-05-13T23:33:57.4331770Z Traceback (most recent call last): 2025-05-13T23:33:57.4332326Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 614, in wrapper 2025-05-13T23:33:57.4332459Z self._join_processes(fn) 2025-05-13T23:33:57.4333104Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 869, in _join_processes 2025-05-13T23:33:57.4333241Z self._check_return_codes(elapsed_time) 2025-05-13T23:33:57.4333858Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925, in _check_return_codes 2025-05-13T23:33:57.4333972Z raise RuntimeError( 2025-05-13T23:33:57.4334258Z RuntimeError: Process 0 terminated or timed out after 300.10095739364624 seconds 2025-05-13T23:33:57.4334494Z ----------------------------- Captured stdout call ----------------------------- 2025-05-13T23:33:57.4334676Z Timing out after 300 seconds and killing subprocesses. 2025-05-13T23:33:57.4334980Z _________ ProcessGroupNCCLGroupTest.test_non_blocking_with_eager_init __________ 2025-05-13T23:33:57.4335115Z Traceback (most recent call last): 2025-05-13T23:33:57.4335658Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 614, in wrapper 2025-05-13T23:33:57.4335773Z self._join_processes(fn) 2025-05-13T23:33:57.4336358Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 869, in _join_processes 2025-05-13T23:33:57.4336557Z self._check_return_codes(elapsed_time) 2025-05-13T23:33:57.4337160Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925, in _check_return_codes 2025-05-13T23:33:57.4337269Z raise RuntimeError( 2025-05-13T23:33:57.4337573Z RuntimeError: Process 0 terminated or timed out after 304.54658126831055 seconds 2025-05-13T23:33:57.4337792Z ----------------------------- Captured stdout call ----------------------------- 2025-05-13T23:33:57.4337973Z Timing out after 300 seconds and killing subprocesses.

Versions

nightly docker image, e.g. ghcr.io/pytorch/pytorch-nightly:2.8.0.dev20250511-cuda12.6-cudnn9-devel

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @seemethere @malfet @pytorch/pytorch-dev-infra @ptrblck @eqy @tinglvv @atalman

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: ciRelated to continuous integrationmodule: ncclProblems related to nccl supportmodule: regressionIt used to work, and now it doesn'toncall: distributedAdd this issue/PR to distributed oncall triage queue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0