-
Notifications
You must be signed in to change notification settings - Fork 24.2k
[c10d] Consolidate monitoring thread in PGNCCL #153668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gh/fduwjj/138/base
Are you sure you want to change the base?
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153668
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit a7b7195 with merge base 7243c69 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This is the start of a series of efforts to consolidating auxiliary threads in PGNCCL, aka watchdog and heartbeat_monitoring threads. Right now we launch these two threads per PG instances, i.e., if users create hundred or thousand instances of PG or subPGs, we will end up with that twice many side threads which is not efficient. We have a RFC to consolidate them (#146956). Right now both threads are assigned with so many functionalities so it is hard to do the consolidations in one shot, we will try to split it into at least two steps (PRs) to make it easier to test and review. First of all, we start with the heartbeat monitoring thread which is relatively lightweight and conceptually easier to consolidate. What we did in this PR: 1. Make the heartbeat thread class (static) instead of launching it per PGNCCL instance. We make all the logic or variables used in this thread either global or static so that it does not call instance specific API. 2. Remove the dependency on PGStatus which is PG instance specific. (We need to do more around PG Status if we want to consolidate watchdog thread later but it is out of the scope of this PR.) 3. Move the error propagation check to watchdog thread which is more relevant. This is totally fine since we rolled out EventCache out fully so watchdog hang is rare now. Today there are two major functions inside heartbeat monitoring thread today: 1. Check the heartbeat of watchdog thread every 8 minutes, we make the heartbeat of watchdog global instead of instance specific. If there are watchdog hang, it should be global. (I am open to better solutions to this one). 2. We check TCPStore every 30 sec to see if any watchdog timeout happens on other ranks, if so we will initiate a dump signal on the current rank as well. Previously we only let the thread on the default PG instance to do #2, now with this consolidation, we do FR dump signal every 30 secs and heartbeat check every 8 mins at the same time. If we break the polling loop early we will wait until the full heartbeat timeout (8 mins) before killing the whole program (when we first built heartbeat thread we want to directly kill the program when the watchdog or the whole program hang at CudaEvent destory or NCCL Abort). cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh this is an interesting approach -- hadn't considered just making everything static but it's oddly elegant
How much work do you think it would be to pull the watchdog into a separate class? Would be nice to have some abstractions here since PGNCCL is a big class. Could also generalize the watchdog potentially for usage in other PGs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just some minor comments.
std::atomic<bool> ProcessGroupNCCL::terminateHeartbeatMonitorThread_{false}; | ||
c10::once_flag ProcessGroupNCCL::initFlag_; | ||
std::atomic_uint64_t ProcessGroupNCCL::heartbeat_; | ||
int ProcessGroupNCCL::heartbeatTimeoutInSec_; | ||
int ProcessGroupNCCL::waitTimeoutDumpInMilSec_; | ||
int ProcessGroupNCCL::coordCheckIntervalMilSec_; | ||
std::atomic<bool> ProcessGroupNCCL::watchdogHeartbeatMonitorEnabled_; | ||
std::mutex ProcessGroupNCCL::monitorMutex_; | ||
std::condition_variable ProcessGroupNCCL::monitorWakeUpCV_; | ||
bool ProcessGroupNCCL::dumpOnTimeoutOrEx_; | ||
std::string ProcessGroupNCCL::globalLogPrefix_; | ||
std::thread ProcessGroupNCCL::ncclHeartbeatMonitorThread_; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are these forward declaration for?
static c10::once_flag initFlag_; | ||
|
||
// Heartbeat of watchdog thread. | ||
std::atomic_uint64_t heartbeat_{}; | ||
static std::atomic_uint64_t heartbeat_; | ||
|
||
// The time interval used for deciding whether there is no watchdog heartbeat. | ||
int heartbeatTimeoutInSec_; | ||
static int heartbeatTimeoutInSec_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can pack all these variables and the std::thread ncclHeartbeatMonitorThread_
into a HeartbeatMonitor
class. Then create a static object. And init of all these values can be moved into constructor of the HeartbeatMonitor
class (instead of in constructor of ProcessGroupNCCL
)
This seems too long. Checking a counter across thread would not introduce much overhead. (and here the counter need not lock protection imo, thus even faster) |
Stack from ghstack (oldest at bottom):
This is the start of a series of efforts to consolidating auxiliary threads in PGNCCL, aka watchdog and heartbeat_monitoring threads. Right now we launch these two threads per PG instances, i.e., if users create hundred or thousand instances of PG or subPGs, we will end up with that twice many side threads which is not efficient. We have a RFC to consolidate them (#146956). Right now both threads are assigned with so many functionalities so it is hard to do the consolidations in one shot, we will try to split it into at least two steps (PRs) to make it easier to test and review.
First of all, we start with the heartbeat monitoring thread which is relatively lightweight and conceptually easier to consolidate. What we did in this PR:
Today there are two major functions inside heartbeat monitoring thread today:
Previously we only let the thread on the default PG instance to do #2, now with this consolidation, we do FR dump signal every 30 secs and heartbeat check every 8 mins at the same time. If we break the polling loop early we will wait until the full heartbeat timeout (8 mins) before killing the whole program (when we first built heartbeat thread we want to directly kill the program when the watchdog or the whole program hang at CudaEvent destory or NCCL Abort).
cc @H-Huang @awgu @wanchaol @fegin @wz337 @wconstab @d4l3k