10000 ProcessGroupGlooTest.test_scatter_stress_cuda is flaky · Issue #15963 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content
ProcessGroupGlooTest.test_scatter_stress_cuda is flaky #15963
@ssnl

Description

@ssnl

Example errors:

  1. build: https://circleci.com/gh/pytorch/pytorch/550843?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link
======================================================================
Jan 11 15:06:07 FAIL: test_scatter_stress_cuda (__main__.ProcessGroupGlooTest)
Jan 11 15:06:07 ----------------------------------------------------------------------
Jan 11 15:06:07 Traceback (most recent call last):
Jan 11 15:06:07   File "test_c10d.py", line 451, in wrapper
Jan 11 15:06:07     self._join_processes(fn)
Jan 11 15:06:07   File "test_c10d.py", line 496, in _join_processes
Jan 11 15:06:07     self._check_return_codes(elapsed_time)
Jan 11 15:06:07   File "test_c10d.py", line 507, in _check_return_codes
Jan 11 15:06:07     self.assertEqual(p.exitcode, first_process.exitcode)
Jan 11 15:06:07   File "/var/lib/jenkins/workspace/test/common_utils.py", line 443, in assertEqual
Jan 11 15:06:07     super(TestCase, self).assertLessEqual(abs(x - y), prec, message)
Jan 11 15:06:07 AssertionError: 12 not less than or equal to 1e-05 : 
  1. build: https://circleci.com/gh/pytorch/pytorch/550696?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link
======================================================================
Jan 11 13:48:43 ERROR: test_scatter_stress_cuda (__main__.ProcessGroupGlooTest)
Jan 11 13:48:43 ----------------------------------------------------------------------
Jan 11 13:48:43 Traceback (most recent call last):
Jan 11 13:48:43   File "test_c10d.py", line 451, in wrapper
Jan 11 13:48:43     self._join_processes(fn)
Jan 11 13:48:43   File "test_c10d.py", line 496, in _join_processes
Jan 11 13:48:43     self._check_return_codes(elapsed_time)
Jan 11 13:48:43   File "test_c10d.py", line 506, in _check_return_codes
Jan 11 13:48:43     raise RuntimeError('Process {} terminated or timed out after {} seconds'.format(i, elapsed_time))
Jan 11 13:48:43 RuntimeError: Process 0 terminated or timed out after 30.030478715896606 seconds

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @teng-li

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: c10dIssues/PRs related to collective communications and process groupsmodule: flaky-testsProblem is a flaky test in CIoncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0