Closed
Description
🐛 Describe the bug
Hi all, I conducted some simple tests using torch elastic to understand its behavior under node failures, and I encountered several unexpected outcomes against the official doc.
Fault Tolerance & Elasticity test
Master node A command:
$ torchrun --nnodes=1:2 --nproc-per-node=1 --rdzv-id=0 --rdzv-backend=c10d --rdzv-endpoint=MASTER_ADDR:MASTER_PORT --max-restarts=10 elastic-demo.py
Worker node B command:
$ torchrun --nnodes=1:2 --nproc-per-node=1 --rdzv-id=0 --rdzv-backend=c10d --rdzv-endpoint=MASTER_ADDR:MASTER_PORT --max-restarts=10 elastic-demo.py
Case 1
- Both nodes start the task simultaneously, and the training begins normally.
- After terminating the worker node B task (using ctrl+c or kill -15), master node A hangs and the training still stalls.
- Restarting the worker node B task sometimes results in an error (torch.distributed.elastic.rendezvous.api.RendezvousClosedError), but it occasionally restarts successfully. This behavior is irregular and the
--max-restarts
parameter does not seem to take effect; it occurs regardless of increasing or decreasing its value and appears to depend on the timing of the rejoining(not sure about that).
Case 2
- Both nodes start the task simultaneously, and the training begins normally.
- After terminating the worker node B task (using kill -9), master node A hangs and the training stalls.
- Restarting the worker node B task allows the training to restart, but the
--max-restarts
parameter does not seem to take effect too.
Case 3
- Both nodes start the task simultaneously, and the training begins normally.
- After terminating master node A’s task (using ctrl+c, kill -15, or kill -9), the entire training crashes immediately.
The detailed error message:
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
result = agent.run()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
result = f(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 680, in run
result = self._invoke_run(role)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 829, in _invoke_run
self._initialize_workers(self._worker_group)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
result = f(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 652, in _initialize_workers
self._rendezvous(worker_group)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
result = f(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 489, in _rendezvous
rdzv_info = spec.rdzv_handler.next_rendezvous()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1125, in next_rendezvous
self._op_executor.run(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 667, in run
raise RendezvousClosedError
torch.distributed.elastic.rendezvous.api.RendezvousClosedError
So my questions are:
- Is the behavior of different signals (SIGINT, SIGTERM, SIGKILL) expected?
- Why does the
--max-restarts
parameter not seem to affect the restart behavior? Is there something I'm missing in the configuration or use of this parameter?
Versions
torch version:
$ pip show torch
Name: torch
Version: 2.4.1
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /opt/conda/lib/python3.8/site-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: accelerate, bitsandbytes, deepspeed, flash_attn, flash_attn_1, peft, torchaudio, torchpippy, torchvision, transformer_engine, trl
cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @dzhulgakov