8000 (torch/elastic) fix scale down bug caused by calling rdzv_handler.shutdown() on premature agent failures by kiukchung · Pull Request #67749 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

(torch/elastic) fix scale down bug caused by calling rdzv_handler.shutdown() on premature agent failures #67749

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

kiukchung
Copy link
Collaborator
@kiukchung kiukchung commented Nov 3, 2021

Summary: Fixes: #67742

Test Plan: deferring the testing to whoever is actually going to land this code - this diff is only for demonstrative purposes

Differential Revision: D32129005

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang

@facebook-github-bot
Copy link
Contributor
facebook-github-bot commented Nov 3, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit e3dc0d9 (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 1/2 non-scanned failure(s)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See GitHub Actions build linux-vulkan-bionic-py3.6-clang9 / build (1/1)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

2021-11-05T03:57:03.3721330Z �[36;1m echo "ERR...t available for the merge-base of your branch"�[0m
2021-11-05T03:57:03.3714684Z �[36;1mfi�[0m
2021-11-05T03:57:03.3715218Z �[36;1m# Covers the case where a previous tag doesn't exist for the tree�[0m
2021-11-05T03:57:03.3716072Z �[36;1m# this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly�[0m
2021-11-05T03:57:03.3716750Z �[36;1mif ! git rev-parse "$MERGE_BASE:.circleci/docker"; then�[0m
2021-11-05T03:57:03.3717456Z �[36;1m  echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"�[0m
2021-11-05T03:57:03.3718040Z �[36;1m  exit 1�[0m
2021-11-05T03:57:03.3718312Z �[36;1mfi�[0m
2021-11-05T03:57:03.3718871Z �[36;1mPREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")�[0m
2021-11-05T03:57:03.3719747Z �[36;1m# If no image exists but the hash is the same as the previous hash then we should error out here�[0m
2021-11-05T03:57:03.3720523Z �[36;1mif [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then�[0m
2021-11-05T03:57:03.3721330Z �[36;1m  echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"�[0m
2021-11-05T03:57:03.3722106Z �[36;1m  echo "       contact the PyTorch team to restore the original images"�[0m
2021-11-05T03:57:03.3722566Z �[36;1m  exit 1�[0m
2021-11-05T03:57:03.3722839Z �[36;1mfi�[0m
2021-11-05T03:57:03.3723202Z �[36;1mecho ::set-output name=rebuild::yes�[0m
2021-11-05T03:57:03.3734110Z shell: /usr/bin/bash -e {0}
2021-11-05T03:57:03.3734407Z env:
2021-11-05T03:57:03.3734946Z   BUILD_ENVIRONMENT: linux-vulkan-bionic-py3.6-clang9
2021-11-05T03:57:03.3735978Z   DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-py3.6-clang9
2021-11-05T03:57:03.3737073Z   SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
2021-11-05T03:57:03.3737978Z   XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Nov 3, 2021
@pytorch-probot
Copy link
pytorch-probot bot commented Nov 3, 2021
CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/kiukchung/pytorch/blob/e3dc0d99fec3a0d2f21fbc35aa1eee3178cbecdb/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-bionic-py3.6-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/xla ✅ triggered
linux-vulkan-bionic-py3.6-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-dynamic ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile ✅ triggered
linux-xenial-py3.6-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers ✅ triggered
linux-xenial-py3.6-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx ✅ triggered
linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux 8000 ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/win ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux 🚫 skipped
docker-builds ciflow/all 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-arm64-full-jit ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-x86-64-full-jit ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow 🚫 skipped
linux-xenial-py3-clang5-mobile-code-analysis ciflow/all, ciflow/linux, ciflow/mobile 🚫 skipped
parallelnative-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:
# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D32129005

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D32129005

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D32129005

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D32129005

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D32129005

Copy link
Contributor
@cbalioglu cbalioglu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

…tdown() on premature agent failures (pytorch#67749)

Summary:
Pull Request resolved: pytorch#67749

Fixes: pytorch#67742

Test Plan:
Added unittests.

Validated manually:

```
# start agent 0
$ torchrun --rdzv_backend c10d --rdzv_id 123 --rdzv_endpoint localhost:29500 --nnodes 1:2 --nproc_per_node 1 --monitor_interval 1 test.py

# start agent 1
torchrun --rdzv_backend c10d --rdzv_id 123 --rdzv_endpoint localhost:29500 --nnodes 1:2 --nproc_per_node 1 --monitor_interval 1 test.py

# kill agent 0
CTRL+C (SIGINT) or kill -15 (SIGTERM)

# restart it
torchrun --rdzv_backend c10d --rdzv_id 123 --rdzv_endpoint localhost:29500 --nnodes 1:2 --nproc_per_node 1 --monitor_interval 1 test.py
```

Reviewed By: cbalioglu

Differential Revision: D32129005

fbshipit-source-id: 4e695d0b3397951d375ecee321add5faf0cfa3ea
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D32129005

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in f6402c4.

georgkaleido added a commit to georgkaleido/pytorch that referenced this pull request Apr 30, 2025
In pytorch#117066, shutdown of the rendezvous was added if a worker shuts down. This is incorrect, because the rendezvous is actually shutdown in [this file](https://github.com/pytorch/pytorch/blob/fa6f9eb2be07f6289d2ab4e781077f7fc75dbe55/torch/distributed/launcher/api.py#L290) but should not be shutdown if a signal is received. See also [this pull request](pytorch#67749).

pytorch#124819 then tried to remediate the situation by fixing the faulty shutdown for the restart case. But this is only triggered if the agent restarts the training, but not if the shutdown of the rendezvous happened before.

Removing both these changes restores the original behavior.
The rendezvous should only be shutdown if a run completes or fails, not for a single worker leaving.
pytorchmergebot pushed a commit to georgkaleido/pytorch that referenced this pull request May 13, 2025
In pytorch#117066, shutdown of the rendezvous was added if a worker shuts down. This is incorrect, because the rendezvous is actually shutdown in [this file](https://github.com/pytorch/pytorch/blob/fa6f9eb2be07f6289d2ab4e781077f7fc75dbe55/torch/distributed/launcher/api.py#L290) but should not be shutdown if a signal is received. See also [this pull request](pytorch#67749).

pytorch#124819 then tried to remediate the situation by fixing the faulty shutdown for the restart case. But this is only triggered if the agent restarts the training, but not if the shutdown of the rendezvous happened before.

Removing both these chang
8000
es restores the original behavior.
The rendezvous should only be shutdown if a run completes or fails, not for a single worker leaving.
pytorchmergebot pushed a commit that referenced this pull request May 14, 2025
In #117066, shutdown of the rendezvous was added if a worker shuts down. This is incorrect, because the rendezvous is actually shutdown in [this file](https://github.com/pytorch/pytorch/blob/fa6f9eb2be07f6289d2ab4e781077f7fc75dbe55/torch/distributed/launcher/api.py#L290) but should not be shutdown if a signal is received. See also [this pull request](#67749).

#124819 then tried to remediate the situation by fixing the faulty shutdown for the restart case. But this is only triggered if the agent restarts the training, but not if the shutdown of the rendezvous happened before.

Removing both these changes restores the original behavior. The rendezvous should only be shutdown if a run completes or fails, not for a single worker leaving.

Fixes #150916
Fixes #147064

Pull Request resolved: #152525
Approved by: https://github.com/kiukchung
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed fb-exported Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[torch/elastic] Scale down does not work correctly when agent is killed with SIGINT, SIGTERM
3 participants
0