8000 Update on "[WIP] Consolidate watchdog and monitoring thread" · pytorch/pytorch@a7b7195 · GitHub
[go: up one dir, main page]

Skip to content

Commit a7b7195

Browse files
committed
Update on "[WIP] Consolidate watchdog and monitoring thread"
This is the start of a series of efforts to consolidating auxiliary threads in PGNCCL, aka watchdog and heartbeat_monitoring threads. Right now we launch these two threads per PG instances, i.e., if users create hundred or thousand instances of PG or subPGs, we will end up with that twice many side threads which is not efficient. We have a RFC to consolidate them (#146956). Right now both threads are assigned with so many functionalities so it is hard to do the consolidations in one shot, we will try to split it into at least two steps (PRs) to make it easier to test and review. First of all, we start with the heartbeat monitoring thread which is relatively lightweight and conceptually easier to consolidate. What we did in this PR: 1. Make the heartbeat thread class (static) instead of launching it per PGNCCL instance. We make all the logic or variables used in this thread either global or static so that it does not call instance specific API. 2. Remove the dependency on PGStatus which is PG instance specific. (We need to do more around PG Status if we want to consolidate watchdog thread later but it is out of the scope of this PR.) 3. Move the error propagation check to watchdog thread which is more relevant. This is totally fine since we rolled out EventCache out fully so watchdog hang is rare now. Today there are two major functions inside heartbeat monitoring thread today: 1. Check the heartbeat of watchdog thread every 8 minutes, we make the heartbeat of watchdog global instead of instance specific. If there are watchdog hang, it should be global. (I am open to better solutions to this one). 2. We check TCPStore every 30 sec to see if any watchdog timeout happens on other ranks, if so we will initiate a dump signal on the current rank as well. Previously we only let the thread on the default PG instance to do #2, now with this consolidation, we do FR dump signal every 30 secs and heartbeat check every 8 mins at the same time. If we break the polling loop early we will wait until the full heartbeat timeout (8 mins) before killing the whole program (when we first built heartbeat thread we want to directly kill the program when the watchdog or the whole program hang at CudaEvent destory or NCCL Abort). cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k [ghstack-poisoned]
1 parent 769b736 commit a7b7195

File tree

2 files changed

+2
-3
lines changed

2 files changed

+2
-3
lines changed

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+1-2
Original file line numberDiff line numberDiff line change
@@ -935,7 +935,6 @@ std::atomic<bool> ProcessGroupNCCL::watchdogHeartbeatMonitorEnabled_;
935935
std::mutex ProcessGroupNCCL::monitorMutex_;
936936
std::condition_variable ProcessGroupNCCL::monitorWakeUpCV_;
937937
bool ProcessGroupNCCL::dumpOnTimeoutOrEx_;
938-
bool ProcessGroupNCCL::propagatePgError_;
939938
std::string ProcessGroupNCCL::globalLogPrefix_;
940939
std::thread ProcessGroupNCCL::ncclHeartbeatMonitorThread_;
941940

@@ -969,6 +968,7 @@ ProcessGroupNCCL::ProcessGroupNCCL(
969968
desyncDebug_ = getCvarBool(TORCH_NCCL_DESYNC_DEBUG, false) ||
970969
(dist_debug_level_ >= DebugLevel::Detail);
971970
rethrowCUDAErrors_ = getCvarBool(TORCH_NCCL_RETHROW_CUDA_ERRORS, true);
971+
propagatePgError_ = getCvarBool(TORCH_NCCL_PROPAGATE_ERROR, false);
972972
// logging C++ stack isn't safe. Introduce a variable to control it.
973973
logCppStackOnUncleanShutdown_ =
974974
getCvarBool(TORCH_NCCL_LOG_CPP_STACK_ON_UNCLEAN_SHUTDOWN, true);
@@ -988,7 +988,6 @@ ProcessGroupNCCL::ProcessGroupNCCL(
988988
// both timeout and other errors.
989989
dumpOnTimeoutOrEx_ = getCvarBool(TORCH_NCCL_DUMP_ON_TIMEOUT, true) ||
990990
(dist_debug_level_ >= DebugLevel::Detail);
991-
propagatePgError_ = getCvarBool(TORCH_NCCL_PROPAGATE_ERROR, false);
992991
watchdogHeartbeatMonitorEnabled_.store(
993992
getCvarBool(TORCH_NCCL_ENABLE_MONITORING, true));
994993
globalLogPrefix_ = c10::str("[Global Rank ", globalRank(), "] ");

torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp

+1-1
Original file line numberDiff line numberDiff line change
@@ -1314,7 +1314,7 @@ class TORCH_API ProcessGroupNCCL : public Backend {
13141314

13151315
// Whether or not to propagate detected errors to all ranks in the same PG
13161316
// through TCPStore.
1317-
static bool propagatePgError_;
1317+
bool propagatePgError_;
13181318

13191319
// Whether or not to sleep after an exception is thrown in the watchdog.
13201320
bool sleepAfterException_{};

0 commit comments

Comments
 (0)
0