Update on "[WIP] Consolidate watchdog and monitoring thread"

fduwjj · fduwjj · commit a7b7195f071e · 2025-05-15T15:36:47.000-07:00
This is the start of a series of efforts to consolidating auxiliary threads in PGNCCL, aka watchdog and heartbeat_monitoring threads. Right now we launch these two threads per PG instances, i.e., if users create hundred or thousand instances of PG or subPGs, we will end up with that twice many side threads which is not efficient. We have a RFC to consolidate them (#146956). Right now both threads are assigned with so many functionalities so it is hard to do the consolidations in one shot, we will try to split it into at least two steps (PRs) to make it easier to test and review. First of all, we start with the heartbeat monitoring thread which is relatively lightweight and conceptually easier to consolidate. What we did in this PR: 1. Make the heartbeat thread class (static) instead of launching it per PGNCCL instance. We make all the logic or variables used in this thread either global or static so that it does not call instance specific API. 2. Remove the dependency on PGStatus which is PG instance specific. (We need to do more around PG Status if we want to consolidate watchdog thread later but it is out of the scope of this PR.) 3. Move the error propagation check to watchdog thread which is more relevant. This is totally fine since we rolled out EventCache out fully so watchdog hang is rare now. Today there are two major functions inside heartbeat monitoring thread today: 1. Check the heartbeat of watchdog thread every 8 minutes, we make the heartbeat of watchdog global instead of instance specific. If there are watchdog hang, it should be global. (I am open to better solutions to this one). 2. We check TCPStore every 30 sec to see if any watchdog timeout happens on other ranks, if so we will initiate a dump signal on the current rank as well. Previously we only let the thread on the default PG instance to do #2, now with this consolidation, we do FR dump signal every 30 secs and heartbeat check every 8 mins at the same time. If we break the polling loop early we will wait until the full heartbeat timeout (8 mins) before killing the whole program (when we first built heartbeat thread we want to directly kill the program when the watchdog or the whole program hang at CudaEvent destory or NCCL Abort). cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k [ghstack-poisoned]
diff --git a/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp b/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
@@ -935,7 +935,6 @@ std::atomic<bool> ProcessGroupNCCL::watchdogHeartbeatMonitorEnabled_;
 std::mutex ProcessGroupNCCL::monitorMutex_;
 std::condition_variable ProcessGroupNCCL::monitorWakeUpCV_;
 bool ProcessGroupNCCL::dumpOnTimeoutOrEx_;
-bool ProcessGroupNCCL::propagatePgError_;
 std::string ProcessGroupNCCL::globalLogPrefix_;
 std::thread ProcessGroupNCCL::ncclHeartbeatMonitorThread_;
 
@@ -969,6 +968,7 @@ ProcessGroupNCCL::ProcessGroupNCCL(
   desyncDebug_ = getCvarBool(TORCH_NCCL_DESYNC_DEBUG, false) ||
       (dist_debug_level_ >= DebugLevel::Detail);
   rethrowCUDAErrors_ = getCvarBool(TORCH_NCCL_RETHROW_CUDA_ERRORS, true);
+  propagatePgError_ = getCvarBool(TORCH_NCCL_PROPAGATE_ERROR, false);
   // logging C++ stack isn't safe. Introduce a variable to control it.
   logCppStackOnUncleanShutdown_ =
       getCvarBool(TORCH_NCCL_LOG_CPP_STACK_ON_UNCLEAN_SHUTDOWN, true);
@@ -988,7 +988,6 @@ ProcessGroupNCCL::ProcessGroupNCCL(
     // both timeout and other errors.
     dumpOnTimeoutOrEx_ = getCvarBool(TORCH_NCCL_DUMP_ON_TIMEOUT, true) ||
         (dist_debug_level_ >= DebugLevel::Detail);
-    propagatePgError_ = getCvarBool(TORCH_NCCL_PROPAGATE_ERROR, false);
     watchdogHeartbeatMonitorEnabled_.store(
         getCvarBool(TORCH_NCCL_ENABLE_MONITORING, true));
     globalLogPrefix_ = c10::str("[Global Rank ", globalRank(), "] ");
diff --git a/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp b/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp
@@ -1314,7 +1314,7 @@ class TORCH_API ProcessGroupNCCL : public Backend {
 
   // Whether or not to propagate detected errors to all ranks in the same PG
   // through TCPStore.
-  static bool propagatePgError_;
+  bool propagatePgError_;
 
   // Whether or not to sleep after an exception is thrown in the watchdog.
   bool sleepAfterException_{};