[PP] Fix disabled flaky tests (pytorch#154856)

H-Huang · iupaikov-amd · commit d71a41b7845f · 2025-06-04T11:08:46.000Z
Fix pytorch#154373, pytorch#154391, pytorch#154408, pytorch#154443, pytorch#154481 Because MultiProcContinousTest [now executes the tests with 8 GPUs instead of 2](pytorch#153653), our PP tests comparing gradients have become flakier due to the longer pipeline. The gradients are still close but we need to relax the tolerance. Pull Request resolved: pytorch#154856 Approved by: https://github.com/Skylion007
diff --git a/test/distributed/pipelining/test_schedule_multiproc.py b/test/distributed/pipelining/test_schedule_multiproc.py
@@ -513,7 +513,7 @@ def test_grad_with_manual_interleaved(self, ScheduleClass, use_new_runtime):
             for name, p in stage_module.named_parameters():
                 ref_p = ref_submod.get_parameter(name)
                 try:
-                    torch.testing.assert_close(p.grad, ref_p.grad, rtol=1e-5, atol=4e-5)
+                    torch.testing.assert_close(p.grad, ref_p.grad, rtol=1e-5, atol=1e-3)
                 except AssertionError:
                     print(f"Gradient test failed for {name}: {p.grad} vs {ref_p.grad}")
                     raise