-
Notifications
You must be signed in to change notification settings - Fork 24.2k
[torch/elastic] unexpected behavior of torch elastic #147064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
btw, when l switch to torch==2.2.0, |
Thanks for the details. Could you clarify in your examples above:
The topology of the actual UNIX processes when invoking the |
I was able to repro Case 1 (RendezvousClosedError) by running two agents on a single node. What I ran (on my desktop):
Will take a closer look and report back here. |
To provide further context, I initialize the process group with the code |
|
Another discovery regarding the |
thanks for the note, will try a repro and cut a separate bug report for this once I confirm it. |
I think When you iirc if you're restarting the whole torchelastic agent/process you instead need to use the scheduler to manage max retries |
@shinytang6 Have you found any temporary workarounds? Like custom signal handling or probably mokeypathing torch? |
@NikitaShalagin No, l have to downgrade to pytorch 2.2.0 which torch elastic works. |
btw, is there any progress on this issue? cc @kiukchung |
@shinytang6 haven't you tried new torch 2.7 to fix this issue? |
@NikitaShalagin not yet,l only tried 2.2.0(work), 2.3.x & 2.4.x (not work) |
@shinytang6 by the way, I've tested on 2.2.0 and 2.2.2 |
@NikitaShalagin Good to know. I checked the version I tested before, and found that the version worked was torch 2.2.0 |
hi @shinytang6, haven't had a chance to dig into it. Will try looking into it this week unless @d4l3k has already looked into it. |
From my investigation, two things are at play here: Case 1&3: Case 2: |
🐛 Describe the bug
Hi all, I conducted some simple tests using torch elastic to understand its behavior under node failures, and I encountered several unexpected outcomes against the official doc.
Fault Tolerance & Elasticity test
Master node A command:
Worker node B command:
Case 1
--max-restarts
parameter does not seem to take effect; it occurs regardless of increasing or decreasing its value and appears to depend on the timing of the rejoining(not sure about that).Case 2
--max-restarts
parameter does not seem to take effect too.Case 3
The detailed error message:
So my questions are:
--max-restarts
parameter not seem to affect the restart behavior? Is there something I'm missing in the configuration or use of this parameter?Versions
torch version:
cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @dzhulgakov
The text was updated successfully, but these errors were encountered: