-
Notifications
You must be signed in to change notification settings - Fork 24.2k
Prevent rendezvous shutdown on worker restarts #124819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124819
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (4 Unrelated Failures)As of commit b109a19 with merge base 5007312 ( FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
c3b0882
to
d74e90d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to add a test for this? e.g., in test/distributed/elastic/rendezvous
?
CCing @kurman |
@pytorchmergebot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Successfully rebased |
fc24e09
to
57bb2f5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, cc: @wconstab
@@ -480,7 +480,7 @@ def _start_workers(self, worker_group: WorkerGroup) -> Dict[int, Any]: | |||
raise NotImplementedError | |||
|
|||
@abc.abstractmethod | |||
def _stop_workers(self, worker_group: WorkerGroup) -> None: | |||
def _stop_workers(self, worker_group: WorkerGroup, isRestart: bool = False) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rename to is_restart
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
57bb2f5
to
b109a19
Compare
@pytorchmergebot merge |
Merge failedReason: This PR needs a If not, please add the To add a label, you can comment to pytorchbot, for example For more information, see Details for Dev Infra teamRaised by workflow job |
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
#126002) Failure Info: ```bash (pt) betterman@bjys1009:/projs/framework/betterman/code/pytorch_new/test/distributed/elastic/agent/server/test$ pytest api_test.py -k test_restart_workers =============================================================================================================================================== test session starts ================================================================================================================================================ platform linux -- Python 3.10.8, pytest-8.1.1, pluggy-1.4.0 rootdir: /projs/framework/betterman/code/pytorch_new configfile: pytest.ini plugins: hypothesis-6.15.0, rerunfailures-14.0, flakefinder-1.1.0, xdist-3.3.1 collecting 1 item / projs/framework/betterman/code/pytorch_new/test/distributed/elastic/agent/server/test/api_test.py:123: PytestCollectionWarning: cannot collect test class 'TestAgent' because it has a __init__ constructor (from: test/distributed/elastic/agent/server/test/api_test.py) class TestAgent(SimpleElasticAgent): collected 29 items / 28 deselected / 1 selected Running 1 items in this shard api_test.py F [100%] ===================================================================================================================================================== FAILURES ===================================================================================================================================================== ___________________________________________________________________________________________________________________________________ SimpleElasticAgentTest.test_restart_workers ____________________________________________________________________________________________________________________________________ Traceback (most recent call last): File "/usr/local/python3.10/lib/python3.10/unittest/case.py", line 59, in testPartExecutor yield File "/usr/local/python3.10/lib/python3.10/unittest/case.py", line 591, in run self._callTestMethod(testMethod) File "/usr/local/python3.10/lib/python3.10/unittest/case.py", line 549, in _callTestMethod method() File "/projs/framework/betterman/code/pytorch_new/test/distributed/elastic/agent/server/test/api_test.py", line 368, in test_restart_workers agent._restart_workers(worker_group) File "/projs/framework/betterman/code/pytorch_new/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/projs/framework/betterman/code/pytorch_new/torch/distributed/elastic/agent/server/api.py", line 728, in _restart_workers self._stop_workers(worker_group, is_restart=True) TypeError: TestAgent._stop_workers() got an unexpected keyword argument 'is_restart' ============================================================================================================================================= short test summary info ============================================================================================================================================== FAILED [0.0054s] api_test.py::SimpleElasticAgentTest::test_restart_workers - TypeError: TestAgent._stop_workers() got an unexpected keyword argument 'is_restart' ========================================================================================================================================= 1 failed, 28 deselected in 7.37s ========================================================================================================================================= ``` Caused by #124819 . Pull Request resolved: #126002 Approved by: https://github.com/ezyang
In pytorch#117066, shutdown of the rendezvous was added if a worker shuts down. This is incorrect, because the rendezvous is actually shutdown in [this file](https://github.com/pytorch/pytorch/blob/fa6f9eb2be07f6289d2ab4e781077f7fc75dbe55/torch/distributed/launcher/api.py#L290) but should not be shutdown if a signal is received. See also [this pull request](pytorch#67749). pytorch#124819 then tried to remediate the situation by fixing the faulty shutdown for the restart case. But this is only triggered if the agent restarts the training, but not if the shutdown of the rendezvous happened before. Removing both these changes restores the original behavior. The rendezvous should only be shutdown if a run completes or fails, not for a single worker leaving.
In pytorch#117066, shutdown of the rendezvous was added if a worker shuts down. This is incorrect, because the rendezvous is actually shutdown in [this file](https://github.com/pytorch/pytorch/blob/fa6f9eb2be07f6289d2ab4e781077f7fc75dbe55/torch/distributed/launcher/api.py#L290) but should not be shutdown if a signal is received. See also [this pull request](pytorch#67749). pytorch#124819 then tried to remediate the situation by fixing the faulty shutdown for the restart case. But this is only triggered if the agent restarts the training, but not if the shutdown of the rendezvous happened before. Removing both these changes restores the original behavior. The rendezvous should only be shutdown if a run completes or fails, not for a single worker leaving.
In #117066, shutdown of the rendezvous was added if a worker shuts down. This is incorrect, because the rendezvous is actually shutdown in [this file](https://github.com/pytorch/pytorch/blob/fa6f9eb2be07f6289d2ab4e781077f7fc75dbe55/torch/distributed/launcher/api.py#L290) but should not be shutdown if a signal is received. See also [this pull request](#67749). #124819 then tried to remediate the situation by fixing the faulty shutdown for the restart case. But this is only triggered if the agent restarts the training, but not if the shutdown of the rendezvous happened before. Removing both these changes restores the original behavior. The rendezvous should only be shutdown if a run completes or fails, not for a single worker leaving. Fixes #150916 Fixes #147064 Pull Request resolved: #152525 Approved by: https://github.com/kiukchung
Fixes #123678
Summary
When the rank leaves and joins back, the workers are restarted and while restarting the rendezvous is shut down. This change prevents rendezvous shutdown during worker restarts.
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @dzhulgakov