-
Notifications
You must be signed in to change notification settings - Fork 24.7k
Open
Labels
module: c10dIssues/PRs related to collective communications and process groupsIssues/PRs related to collective communications and process groupsmodule: flaky-testsProblem is a flaky test in CIProblem is a flaky test in CIoncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
Example errors:
======================================================================
Jan 11 15:06:07 FAIL: test_scatter_stress_cuda (__main__.ProcessGroupGlooTest)
Jan 11 15:06:07 ----------------------------------------------------------------------
Jan 11 15:06:07 Traceback (most recent call last):
Jan 11 15:06:07 File "test_c10d.py", line 451, in wrapper
Jan 11 15:06:07 self._join_processes(fn)
Jan 11 15:06:07 File "test_c10d.py", line 496, in _join_processes
Jan 11 15:06:07 self._check_return_codes(elapsed_time)
Jan 11 15:06:07 File "test_c10d.py", line 507, in _check_return_codes
Jan 11 15:06:07 self.assertEqual(p.exitcode, first_process.exitcode)
Jan 11 15:06:07 File "/var/lib/jenkins/workspace/test/common_utils.py", line 443, in assertEqual
Jan 11 15:06:07 super(TestCase, self).assertLessEqual(abs(x - y), prec, message)
Jan 11 15:06:07 AssertionError: 12 not less than or equal to 1e-05 :
======================================================================
Jan 11 13:48:43 ERROR: test_scatter_stress_cuda (__main__.ProcessGroupGlooTest)
Jan 11 13:48:43 ----------------------------------------------------------------------
Jan 11 13:48:43 Traceback (most recent call last):
Jan 11 13:48:43 File "test_c10d.py", line 451, in wrapper
Jan 11 13:48:43 self._join_processes(fn)
Jan 11 13:48:43 File "test_c10d.py", line 496, in _join_processes
Jan 11 13:48:43 self._check_return_codes(elapsed_time)
Jan 11 13:48:43 File "test_c10d.py", line 506, in _check_return_codes
Jan 11 13:48:43 raise RuntimeError('Process {} terminated or timed out after {} seconds'.format(i, elapsed_time))
Jan 11 13:48:43 RuntimeError: Process 0 terminated or timed out after 30.030478715896606 seconds
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @teng-li
Metadata
Metadata
Assignees
Labels
module: c10dIssues/PRs related to collective communications and process groupsIssues/PRs related to collective communications and process groupsmodule: flaky-testsProblem is a flaky test in CIProblem is a flaky test in CIoncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module