8000 elastic: do not shutdown rendezvous on leaving workers by georgkaleido · Pull Request #152525 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

elastic: do not shutdown rendezvous on leaving workers #152525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

georgkaleido
Copy link
Contributor
@georgkaleido georgkaleido commented Apr 30, 2025

In #117066, shutdown of the rendezvous was added if a worker shuts down. This is incorrect, because the rendezvous is actually shutdown in this file but should not be shutdown if a signal is received. See also this pull request.

#124819 then tried to remediate the situation by fixing the faulty shutdown for the restart case. But this is only triggered if the agent restarts the training, but not if the shutdown of the rendezvous happened before.

Removing both these changes restores the original behavior. The rendezvous should only be shutdown if a run completes or fails, not for a single worker leaving.

Fixes #150916
Fixes #147064

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

Copy link
pytorch-bot bot commented Apr 30, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152525

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 93d9ede with merge base 7243c69 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (torchelastic) labels Apr 30, 2025
@HDCharles HDCharles requested review from XuehaiPan and zdevito May 2, 2025 04:00
@mikaylagawarecki mikaylagawarecki requested a review from wconstab May 2, 2025 15:45
@mikaylagawarecki mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 2, 2025
@mikaylagawarecki mikaylagawarecki requested a review from d4l3k May 2, 2025 15:45
@d4l3k d4l3k requested a review from kiukchung May 12, 2025 22:43
@georgkaleido
Copy link
Contributor Author

@pytorchbot merge

Copy link
pytorch-bot bot commented May 13, 2025

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

@d4l3k
Copy link
Member
d4l3k commented May 13, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 13, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sorry, something went wrong.

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@d4l3k
Copy link
Member
d4l3k commented May 13, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

In pytorch#117066, shutdown of the rendezvous was added if a worker shuts down. This is incorrect, because the rendezvous is actually shutdown in [this file](https://github.com/pytorch/pytorch/blob/fa6f9eb2be07f6289d2ab4e781077f7fc75dbe55/torch/distributed/launcher/api.py#L290) but should not be shutdown if a signal is received. See also [this pull request](pytorch#67749).

pytorch#124819 then tried to remediate the situation by fixing the faulty shutdown for the restart case. But this is only triggered if the agent restarts the training, but not if the shutdown of the rendezvous happened before.

Removing both these changes restores the original behavior.
The rendezvous should only be shutdown if a run completes or fails, not for a single worker leaving.
@pytorchmergebot
Copy link
Collaborator

Successfully rebased elastic-close onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout elastic-close && git pull --rebase)

@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label May 13, 2025
@d4l3k
Copy link
Member
d4l3k commented May 13, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 13, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@d4l3k
Copy link
Member
d4l3k commented May 14, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (torchelastic) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Elastic training crashes on killed agent [torch/elastic] unexpected behavior of torch elastic
6 participants
0