[torch/elastic] unexpected behavior of torch elastic

@H-Huang

🐛 Describe the bug

Hi all, I conducted some simple tests using torch elastic to understand its behavior under node failures, and I encountered several unexpected outcomes against the official doc.

Fault Tolerance & Elasticity test

Master node A command:

$ torchrun  --nnodes=1:2 --nproc-per-node=1 --rdzv-id=0 --rdzv-backend=c10d --rdzv-endpoint=MASTER_ADDR:MASTER_PORT --max-restarts=10 elastic-demo.py

Worker node B command:

$ torchrun  --nnodes=1:2 --nproc-per-node=1 --rdzv-id=0 --rdzv-backend=c10d --rdzv-endpoint=MASTER_ADDR:MASTER_PORT --max-restarts=10 elastic-demo.py

Case 1

Both nodes start the task simultaneously, and the training begins normally.
After terminating the worker node B task (using ctrl+c or kill -15), master node A hangs and the training still stalls.
Restarting the worker node B task sometimes results in an error (torch.distributed.elastic.rendezvous.api.RendezvousClosedError), but it occasionally restarts successfully. This behavior is irregular and the --max-restarts parameter does not seem to take effect; it occurs regardless of increasing or decreasing its value and appears to depend on the timing of the rejoining(not sure about that).

Case 2

Both nodes start the task simultaneously, and the training begins normally.
After terminating the worker node B task (using kill -9), master node A hangs and the training stalls.
Restarting the worker node B task allows the training to restart, but the --max-restarts parameter does not seem to take effect too.

Case 3

Both nodes start the task simultaneously, and the training begins normally.
After terminating master node A’s task (using ctrl+c, kill -15, or kill -9), the entire training crashes immediately.

The detailed error message:

Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
    result = agent.run()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 680, in run
    result = self._invoke_run(role)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 829, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 652, in _initialize_workers
    self._rendezvous(worker_group)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 489, in _rendezvous
    rdzv_info = spec.rdzv_handler.next_rendezvous()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1125, in next_rendezvous
    self._op_executor.run(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 667, in run
    raise RendezvousClosedError
torch.distributed.elastic.rendezvous.api.RendezvousClosedError

So my questions are:

Is the behavior of different signals (SIGINT, SIGTERM, SIGKILL) expected?
Why does the --max-restarts parameter not seem to affect the restart behavior? Is there something I'm missing in the configuration or use of this parameter?

Versions

torch version:

$ pip show torch
Name: torch
Version: 2.4.1
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /opt/conda/lib/python3.8/site-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: accelerate, bitsandbytes, deepspeed, flash_attn, flash_attn_1, peft, torchaudio, torchpippy, torchvision, transformer_engine, trl

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @dzhulgakov

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🐛 Describe the bug

Fault Tolerance & Elasticity test

Case 1

Case 2

Case 3

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

🐛 Describe the bug

Fault Tolerance & Elasticity test

Case 1

Case 2

Case 3

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions