10000 Prevent rendezvous shutdown on worker restarts by sathyanarays · Pull Request #124819 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Prevent rendezvous shutdown on worker restarts #124819

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

sathyanarays
Copy link
Contributor
@sathyanarays sathyanarays commented Apr 24, 2024

Fixes #123678

Summary

When the rank leaves and joins back, the workers are restarted and while restarting the rendezvous is shut down. This change prevents rendezvous shutdown during worker restarts.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @dzhulgakov

Copy link
pytorch-bot bot commented Apr 24, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124819

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (4 Unrelated Failures)

As of commit b109a19 with merge base 5007312 (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
linux-foundation-easycla bot commented Apr 24, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: sathyanarays / name: Sathyanarayanan Saravanamuthu (b109a19)

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Apr 24, 2024
@soulitzer soulitzer requested a review from kurman April 24, 2024 16:30
@soulitzer soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 24, 2024
Copy link
Collaborator
@eqy eqy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to add a test for this? e.g., in test/distributed/elastic/rendezvous?

@eqy eqy added the module: elastic Related to torch.distributed.elastic label Apr 30, 2024
@eqy
Copy link
Collaborator
eqy commented Apr 30, 2024

CCing @kurman

@sathyanarays
Copy link
Contributor Author

@eqy @kurman Added a test case. The newly added test case fails in the main branch and passes with this fix. Please review the changes once.

@sathyanarays sathyanarays requested a review from eqy May 2, 2024 09:33
@eqy
Copy link
Collaborator
eqy commented May 6, 2024

@pytorchmergebot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased rendezvous_fix onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout rendezvous_fix && git pull --rebase)

Copy link
Contributor
@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, cc: @wconstab

@@ -480,7 +480,7 @@ def _start_workers(self, worker_group: WorkerGroup) -> Dict[int, Any]:
raise NotImplementedError

@abc.abstractmethod
def _stop_workers(self, worker_group: WorkerGroup) -> None:
def _stop_workers(self, worker_group: WorkerGroup, isRestart: bool = False) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename to is_restart

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@eqy
Copy link
Collaborator
eqy commented May 8, 2024

@pytorchmergebot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 8, 2024
@pytorchmergebot
Copy link
Collaborator
pytorchmergebot commented May 8, 2024 8000

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team Raised by workflow job

@eqy eqy added the topic: not user facing topic category label May 8, 2024
@eqy
Copy link
Collaborator
eqy commented May 8, 2024

@pytorchmergebot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request May 14, 2024
#126002)

Failure Info:
```bash
(pt) betterman@bjys1009:/projs/framework/betterman/code/pytorch_new/test/distributed/elastic/agent/server/test$ pytest api_test.py -k test_restart_workers
=============================================================================================================================================== test session starts ================================================================================================================================================
platform linux -- Python 3.10.8, pytest-8.1.1, pluggy-1.4.0
rootdir: /projs/framework/betterman/code/pytorch_new
configfile: pytest.ini
plugins: hypothesis-6.15.0, rerunfailures-14.0, flakefinder-1.1.0, xdist-3.3.1
collecting 1 item                                                                                                                                                                                                                                                                                                  /
projs/framework/betterman/code/pytorch_new/test/distributed/elastic/agent/server/test/api_test.py:123: PytestCollectionWarning: cannot collect test class 'TestAgent' because it has a __init__ constructor (from: test/distributed/elastic/agent/server/test/api_test.py)
  class TestAgent(SimpleElasticAgent):
collected 29 items / 28 deselected / 1 selected
Running 1 items in this shard

api_test.py F                                                                                                                                                                                                                                                                                                [100%]

===================================================================================================================================================== FAILURES =====================================================================================================================================================
___________________________________________________________________________________________________________________________________ SimpleElasticAgentTest.test_restart_workers ____________________________________________________________________________________________________________________________________
Traceback (most recent call last):
  File "/usr/local/python3.10/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
    yield
  File "/usr/local/python3.10/lib/python3.10/unittest/case.py", line 591, in run
    self._callTestMethod(testMethod)
  File "/usr/local/python3.10/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
    method()
  File "/projs/framework/betterman/code/pytorch_new/test/distributed/elastic/agent/server/test/api_test.py", line 368, in test_restart_workers
    agent._restart_workers(worker_group)
  File "/projs/framework/betterman/code/pytorch_new/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
    result = f(*args, **kwargs)
  File "/projs/framework/betterman/code/pytorch_new/torch/distributed/elastic/agent/server/api.py", line 728, in _restart_workers
    self._stop_workers(worker_group, is_restart=True)
TypeError: TestAgent._stop_workers() got an unexpected keyword argument 'is_restart'
============================================================================================================================================= short test summary info ==============================================================================================================================================
FAILED [0.0054s] api_test.py::SimpleElasticAgentTest::test_restart_workers - TypeError: TestAgent._stop_workers() got an unexpected keyword argument 'is_restart'
========================================================================================================================================= 1 failed, 28 deselected in 7.37s =========================================================================================================================================
```
Caused by #124819 .

Pull Request resolved: #126002
Approved by: https://github.com/ezyang
georgkaleido added a commit to georgkaleido/pytorch that referenced this pull request Apr 30, 2025
In pytorch#117066, shutdown of the rendezvous was added if a worker shuts down. This is incorrect, because the rendezvous is actually shutdown in [this file](https://github.com/pytorch/pytorch/blob/fa6f9eb2be07f6289d2ab4e781077f7fc75dbe55/torch/distributed/launcher/api.py#L290) but should not be shutdown if a signal is received. See also [this pull request](pytorch#67749).

pytorch#124819 then tried to remediate the situation by fixing the faulty shutdown for the restart case. But this is only triggered if the agent restarts the training, but not if the shutdown of the rendezvous happened before.

Removing both these changes restores the original behavior.
The rendezvous should only be shutdown if a run completes or fails, not for a single worker leaving.
pytorchmergebot pushed a commit to georgkaleido/pytorch that referenced this pull request May 13, 2025
In pytorch#117066, shutdown of the rendezvous was added if a worker shuts down. This is incorrect, because the rendezvous is actually shutdown in [this file](https://github.com/pytorch/pytorch/blob/fa6f9eb2be07f6289d2ab4e781077f7fc75dbe55/torch/distributed/launcher/api.py#L290) but should not be shutdown if a signal is received. See also [this pull request](pytorch#67749).

pytorch#124819 then tried to remediate the situation by fixing the faulty shutdown for the restart case. But this is only triggered if the agent restarts the training, but not if the shutdown of the rendezvous happened before.

Removing both these changes restores the original behavior.
The rendezvous should only be shutdown if a run completes or fails, not for a single worker leaving.
pytorchmergebot pushed a commit that referenced this pull request May 14, 2025
In #117066, shutdown of the rendezvous was added if a worker shuts down. This is incorrect, because the rendezvous is actually shutdown in [this file](https://github.com/pytorch/pytorch/blob/fa6f9eb2be07f6289d2ab4e781077f7fc75dbe55/torch/distributed/launcher/api.py#L290) but should not be shutdown if a signal is received. See also [this pull request](#67749).

#124819 then tried to remediate the situation by fixing the faulty shutdown for the restart case. But this is only triggered if the agent restarts the training, but not if the shutdown of the rendezvous happened before.

Removing both these changes restores the original behavior. The rendezvous should only be shutdown if a run completes or fails, not for a single worker leaving.

Fixes #150916
Fixes #147064

Pull Request resolved: #152525
Approved by: https://github.com/kiukchung
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged module: elastic Related to torch.distributed.elastic oncall: distributed Add this issue/PR to distributed oncall triage queue open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[TorchElastic] Shutdown behavior appears incorrect and breaks rendezvous
7 participants
0