-
Notifications
You must be signed in to change notification settings - Fork 24.2k
Elastic training crashes on killed agent #150916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@d4l3k for elastic issue. "handle nodes joining or leaving during training" seem to be similar to torchft |
@andreacarrara-polimi calling pytorch/torch/distributed/elastic/agent/server/api.py Lines 718 to 720 in f76b7ef
@kiukchung is this intended behavior? Is there a better way to cleanly scale down a cluster? |
Seems related to: #147064 |
I wasn't aware of Instead, I care about the elasticity of the underlying infrastructure. I’ve implemented checkpointing manually, which is standard practice. This kind of elasticity is exactly what Elastic focuses on, as outlined on the first page of its documentation. |
I tested killing the agent on the other node using
In an additional test, I stopped the other machine entirely. This caused the rendezvous node to crash with the same error as the original issue. In further tests, even killing the worker alone gives inconsistent results. Sometimes, the agent restarts it successfully up to
Please try to reproduce this or share a working example if it behaves differently on your setup. |
I agree. That's why I mentioned it in my post. While related, my issue presents a different case of buggy behavior in Elastic. That said, have you made any progress on #147064? |
@kiukchung Are there any known cheap to implement temporary workarounds for this? I tried to wrap torchrun in a simple she'll script, that catches SIGTERM and then kill -9 workers and kill -9 agent (the idea was that workers won't call shutdown()), but it still results in rdvz closed error. |
🐛 Describe the bug
I'm trying to use Elastic to handle nodes joining or leaving during training. My setup runs two EC2 instances (Ubuntu 24.04, g4dn.xlarge, NVIDIA Tesla T4, driver 550, PyTorch in a venv). My script is minimal, reproducible and attached here. It's a simplified version of this example. Each node runs:
Each node launches one agent, which manages one worker. The node with
--node-rank=0
acts as the rendezvous server. If I kill its process, the training crashes as expected since it's a single point of failure. However, the problem is with the other node. Killing its worker results in correct behavior as the agent restarts it up to the value of--max-restarts
. But when I kill its agent, the training crashes instead of continuing with the rendezvous node only. The full traceback of the exception is included below.On the rendezvous node:
On the other node:
This happens regardless of whether I use
CTRL-C
orkill <PID>
. I’ve reviewed several related issues and discussions, like 67616, 67742, 147064 and this post. None of them address this scenario. Let me know if this behavior is expected or if I’m missing something. I’m happy to provide more details if needed.Versions
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k
The text was updated successfully, but these errors were encountered: