elastic: do not shutdown rendezvous on leaving workers #152525

georgkaleido · 2025-04-30T08:39:30Z

In #117066, shutdown of the rendezvous was added if a worker shuts down. This is incorrect, because the rendezvous is actually shutdown in this file but should not be shutdown if a signal is received. See also this pull request.

#124819 then tried to remediate the situation by fixing the faulty shutdown for the restart case. But this is only triggered if the agent restarts the training, but not if the shutdown of the rendezvous happened before.

Removing both these changes restores the original behavior. The rendezvous should only be shutdown if a run completes or fails, not for a single worker leaving.

Fixes #150916
Fixes #147064

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

pytorch-bot · 2025-04-30T08:39:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152525

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[PREEMPTIVE] Removal of ephemeral variants on scale-config.yml

✅ No Failures

As of commit 93d9ede with merge base 7243c69 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

georgkaleido · 2025-05-13T16:44:45Z

@pytorchbot merge

pytorch-bot · 2025-05-13T16:44:50Z

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

d4l3k · 2025-05-13T16:55:06Z

@pytorchbot merge

pytorchmergebot · 2025-05-13T16:57:35Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-05-13T17:08:20Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-focal-py3_9-clang9-xla / build

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

d4l3k · 2025-05-13T17:31:48Z

@pytorchbot rebase

pytorchmergebot · 2025-05-13T17:33:24Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

In pytorch#117066, shutdown of the rendezvous was added if a worker shuts down. This is incorrect, because the rendezvous is actually shutdown in [this file](https://github.com/pytorch/pytorch/blob/fa6f9eb2be07f6289d2ab4e781077f7fc75dbe55/torch/distributed/launcher/api.py#L290) but should not be shutdown if a signal is received. See also [this pull request](pytorch#67749). pytorch#124819 then tried to remediate the situation by fixing the faulty shutdown for the restart case. But this is only triggered if the agent restarts the training, but not if the shutdown of the rendezvous happened before. Removing both these changes restores the original behavior. The rendezvous should only be shutdown if a run completes or fails, not for a single worker leaving.

pytorchmergebot · 2025-05-13T17:33:26Z

Successfully rebased elastic-close onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout elastic-close && git pull --rebase)

d4l3k · 2025-05-13T17:51:17Z

@pytorchbot merge

pytorchmergebot · 2025-05-13T17:53:10Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-05-13T23:51:43Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

d4l3k · 2025-05-14T00:41:59Z

@pytorchbot merge

pytorchmergebot · 2025-05-14T00:43:48Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (torchelastic) labels Apr 30, 2025

pytorchbot added the open source label Apr 30, 2025

georgkaleido mentioned this pull request Apr 30, 2025

[torch/elastic] unexpected behavior of torch elastic #147064

Closed

HDCharles requested review from XuehaiPan and zdevito May 2, 2025 04:00

mikaylagawarecki requested a review from wconstab May 2, 2025 15:45

mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 2, 2025

mikaylagawarecki requested a review from d4l3k May 2, 2025 15:45

d4l3k requested a review from kiukchung May 12, 2025 22:43

kiukchung approved these changes May 13, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 13, 2025

pytorchmergebot added the merging label May 13, 2025

pytorchmergebot removed the merging label May 13, 2025

georgkaleido added 2 commits May 13, 2025 17:33

fmt

93d9ede

pytorchmergebot force-pushed the elastic-close branch from 2168032 to 93d9ede Compare May 13, 2025 17:33

pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label May 13, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 13, 2025

pytorchmergebot added the merging label May 13, 2025

pytorchmergebot added the Merged label May 14, 2025

pytorchmergebot closed this in 8739a8c May 14, 2025

pytorchmergebot removed the merging label May 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

elastic: do not shutdown rendezvous on leaving workers #152525

elastic: do not shutdown rendezvous on leaving workers #152525

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elastic: do not shutdown rendezvous on leaving workers #152525

elastic: do not shutdown rendezvous on leaving workers #152525

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152525

❗ 1 Active SEVs

✅ No Failures

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Merge failed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!