[c10d] Consolidate monitoring thread in PGNCCL #153668

fduwjj · 2025-05-15T22:09:59Z

Stack from ghstack (oldest at bottom):

-> [c10d] Consolidate monitoring thread in PGNCCL #153668

This is the start of a series of efforts to consolidating auxiliary threads in PGNCCL, aka watchdog and heartbeat_monitoring threads. Right now we launch these two threads per PG instances, i.e., if users create hundred or thousand instances of PG or subPGs, we will end up with that twice many side threads which is not efficient. We have a RFC to consolidate them (#146956). Right now both threads are assigned with so many functionalities so it is hard to do the consolidations in one shot, we will try to split it into at least two steps (PRs) to make it easier to test and review.

First of all, we start with the heartbeat monitoring thread which is relatively lightweight and conceptually easier to consolidate. What we did in this PR:

Make the heartbeat thread class (static) instead of launching it per PGNCCL instance. We make all the logic or variables used in this thread either global or static so that it does not call instance specific API.
Remove the dependency on PGStatus which is PG instance specific. (We need to do more around PG Status if we want to consolidate watchdog thread later but it is out of the scope of this PR.)
Move the error propagation check to watchdog thread which is more relevant. This is totally fine since we rolled out EventCache out fully so watchdog hang is rare now.

Today there are two major functions inside heartbeat monitoring thread today:

Check the heartbeat of watchdog thread every 8 minutes, we make the heartbeat of watchdog global instead of instance specific. If there are watchdog hang, it should be global. (I am open to better solutions to this one).
We check TCPStore every 30 sec to see if any watchdog timeout happens on other ranks, if so we will initiate a dump signal on the current rank as well.

Previously we only let the thread on the default PG instance to do #2, now with this consolidation, we do FR dump signal every 30 secs and heartbeat check every 8 mins at the same time. If we break the polling loop early we will wait until the full heartbeat timeout (8 mins) before killing the whole program (when we first built heartbeat thread we want to directly kill the program when the watchdog or the whole program hang at CudaEvent destory or NCCL Abort).

cc @H-Huang @awgu @wanchaol @fegin @wz337 @wconstab @d4l3k

[ghstack-poisoned]

ghstack-source-id: 14b2ad7 Pull Request resolved: #153668

pytorch-bot · 2025-05-15T22:10:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153668

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit a7b7195 with merge base 7243c69 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, linux.2xlarge) (gh) (similar failure)
backends/xnnpack/test/ops/test_conv1d.py::TestConv1d::test_qs8_conv1d_batchnorm_seq

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This is the start of a series of efforts to consolidating auxiliary threads in PGNCCL, aka watchdog and heartbeat_monitoring threads. Right now we launch these two threads per PG instances, i.e., if users create hundred or thousand instances of PG or subPGs, we will end up with that twice many side threads which is not efficient. We have a RFC to consolidate them (#146956). Right now both threads are assigned with so many functionalities so it is hard to do the consolidations in one shot, we will try to split it into at least two steps (PRs) to make it easier to test and review. First of all, we start with the heartbeat monitoring thread which is relatively lightweight and conceptually easier to consolidate. What we did in this PR: 1. Make the heartbeat thread class (static) instead of launching it per PGNCCL instance. We make all the logic or variables used in this thread either global or static so that it does not call instance specific API. 2. Remove the dependency on PGStatus which is PG instance specific. (We need to do more around PG Status if we want to consolidate watchdog thread later but it is out of the scope of this PR.) 3. Move the error propagation check to watchdog thread which is more relevant. This is totally fine since we rolled out EventCache out fully so watchdog hang is rare now. Today there are two major functions inside heartbeat monitoring thread today: 1. Check the heartbeat of watchdog thread every 8 minutes, we make the heartbeat of watchdog global instead of instance specific. If there are watchdog hang, it should be global. (I am open to better solutions to this one). 2. We check TCPStore every 30 sec to see if any watchdog timeout happens on other ranks, if so we will initiate a dump signal on the current rank as well. Previously we only let the thread on the default PG instance to do #2, now with this consolidation, we do FR dump signal every 30 secs and heartbeat check every 8 mins at the same time. If we break the polling loop early we will wait until the full heartbeat timeout (8 mins) before killing the whole program (when we first built heartbeat thread we want to directly kill the program when the watchdog or the whole program hang at CudaEvent destory or NCCL Abort). cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k [ghstack-poisoned]

ghstack-source-id: 777a504 Pull Request resolved: #153668

d4l3k

Oh this is an interesting approach -- hadn't considered just making everything static but it's oddly elegant

How much work do you think it would be to pull the watchdog into a separate class? Would be nice to have some abstractions here since PGNCCL is a big class. Could also generalize the watchdog potentially for usage in other PGs

kwen2501

LGTM, just some minor comments.

kwen2501 · 2025-05-16T20:50:36Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+std::atomic<bool> ProcessGroupNCCL::terminateHeartbeatMonitorThread_{false};
+c10::once_flag ProcessGroupNCCL::initFlag_;
+std::atomic_uint64_t ProcessGroupNCCL::heartbeat_;
+int ProcessGroupNCCL::heartbeatTimeoutInSec_;
+int ProcessGroupNCCL::waitTimeoutDumpInMilSec_;
+int ProcessGroupNCCL::coordCheckIntervalMilSec_;
+std::atomic<bool> ProcessGroupNCCL::watchdogHeartbeatMonitorEnabled_;
+std::mutex ProcessGroupNCCL::monitorMutex_;
+std::condition_variable ProcessGroupNCCL::monitorWakeUpCV_;
+bool ProcessGroupNCCL::dumpOnTimeoutOrEx_;
+std::string ProcessGroupNCCL::globalLogPrefix_;
+std::thread ProcessGroupNCCL::ncclHeartbeatMonitorThread_;
+


what are these forward declaration for?

kwen2501 · 2025-05-16T20:52:40Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp

+  static c10::once_flag initFlag_;
+
  // Heartbeat of watchdog thread.
-  std::atomic_uint64_t heartbeat_{};
+  static std::atomic_uint64_t heartbeat_;

  // The time interval used for deciding whether there is no watchdog heartbeat.
-  int heartbeatTimeoutInSec_;
+  static int heartbeatTimeoutInSec_;


You can pack all these variables and the std::thread ncclHeartbeatMonitorThread_ into a HeartbeatMonitor class. Then create a static object. And init of all these values can be moved into constructor of the HeartbeatMonitor class (instead of in constructor of ProcessGroupNCCL)

kwen2501 · 2025-05-16T20:57:27Z

Check the heartbeat of watchdog thread every 8 minutes

This seems too long. Checking a counter across thread would not introduce much overhead. (and here the counter need not lock protection imo, thus even faster)

[WIP] Consolidate watchdog and monitoring thread

769b736

[ghstack-poisoned]

fduwjj added a commit that referenced this pull request May 15, 2025

[WIP] Consolidate watchdog and monitoring thread

5a3fefd

ghstack-source-id: 14b2ad7 Pull Request resolved: #153668

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels May 15, 2025

fduwjj added a commit that referenced this pull request May 15, 2025

[WIP] Consolidate watchdog and monitoring thread

05a260d

ghstack-source-id: 777a504 Pull Request resolved: #153668

fduwjj changed the title ~~[WIP] Consolidate watchdog and monitoring thread~~ [c10d] Consolidate monitoring thread in PGNCCL May 15, 2025

fduwjj requested review from kwen2501, d4l3k, eqy, wconstab and H-Huang May 15, 2025 22:38

d4l3k reviewed May 16, 2025

View reviewed changes

kwen2501 approved these changes May 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[c10d] Consolidate monitoring thread in PGNCCL #153668

[c10d] Consolidate monitoring thread in PGNCCL #153668

[c10d] Consolidate monitoring thread in PGNCCL #153668

Are you sure you want to change the base?

[c10d] Consolidate monitoring thread in PGNCCL #153668

Conversation

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153668

✅ You can merge normally! (1 Unrelated Failure)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment