8000 [c10d] Consolidate monitoring thread in PGNCCL by fduwjj · Pull Request #153668 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[c10d] Consolidate monitoring thread in PGNCCL #153668

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: gh/fduwjj/138/base
Choose a base branch
from

Conversation

fduwjj
Copy link
Contributor
@fduwjj fduwjj commented May 15, 2025

Stack from ghstack (oldest at bottom):

This is the start of a series of efforts to consolidating auxiliary threads in PGNCCL, aka watchdog and heartbeat_monitoring threads. Right now we launch these two threads per PG instances, i.e., if users create hundred or thousand instances of PG or subPGs, we will end up with that twice many side threads which is not efficient. We have a RFC to consolidate them (#146956). Right now both threads are assigned with so many functionalities so it is hard to do the consolidations in one shot, we will try to split it into at least two steps (PRs) to make it easier to test and review.

First of all, we start with the heartbeat monitoring thread which is relatively lightweight and conceptually easier to consolidate. What we did in this PR:

  1. Make the heartbeat thread class (static) instead of launching it per PGNCCL instance. We make all the logic or variables used in this thread either global or static so that it does not call instance specific API.
  2. Remove the dependency on PGStatus which is PG instance specific. (We need to do more around PG Status if we want to consolidate watchdog thread later but it is out of the scope of this PR.)
  3. Move the error propagation check to watchdog thread which is more relevant. This is totally fine since we rolled out EventCache out fully so watchdog hang is rare now.

Today there are two major functions inside heartbeat monitoring thread today:

  1. Check the heartbeat of watchdog thread every 8 minutes, we make the heartbeat of watchdog global instead of instance specific. If there are watchdog hang, it should be global. (I am open to better solutions to this one).
  2. We check TCPStore every 30 sec to see if any watchdog timeout happens on other ranks, if so we will initiate a dump signal on the current rank as well.

Previously we only let the thread on the default PG instance to do #2, now with this consolidation, we do FR dump signal every 30 secs and heartbeat check every 8 mins at the same time. If we break the polling loop early we will wait until the full heartbeat timeout (8 mins) before killing the whole program (when we first built heartbeat thread we want to directly kill the program when the watchdog or the whole program hang at CudaEvent destory or NCCL Abort).

cc @H-Huang @awgu @wanchaol @fegin @wz337 @wconstab @d4l3k

fduwjj added a commit that referenced this pull request May 15, 2025
ghstack-source-id: 14b2ad7
Pull Request resolved: #153668
Copy link
pytorch-bot bot commented May 15, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153668

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit a7b7195 with merge base 7243c69 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels May 15, 2025
8000
This is the start of a series of efforts to consolidating auxiliary threads in PGNCCL, aka watchdog and heartbeat_monitoring threads. Right now we launch these two threads per PG instances, i.e., if users create hundred or thousand instances of PG or subPGs, we will end up with that twice many side threads which is not efficient. We have a RFC to consolidate them (#146956). Right now both threads are assigned with so many functionalities so it is hard to do the consolidations in one shot, we will try to split it into at least two steps (PRs) to make it easier to test and review.

First of all, we start with the heartbeat monitoring thread which is relatively lightweight and conceptually easier to consolidate. What we did in this PR:
1. Make the heartbeat thread class (static) instead of launching it per PGNCCL instance. We make all the logic or variables used in this thread either global or static so that it does not call instance specific API.
2. Remove the dependency on PGStatus which is PG instance specific. (We need to do more around PG Status if we want to consolidate watchdog thread later but it is out of the scope of this PR.)
3. Move the error propagation check to watchdog thread which is more relevant. This is totally fine since we rolled out EventCache out fully so watchdog hang is rare now.

Today there are two major functions inside heartbeat monitoring thread today: 
1. Check the heartbeat of watchdog thread every 8 minutes, we make the heartbeat of watchdog global instead of instance specific. If there are watchdog hang, it should be global. (I am open to better solutions to this one).
2. We check TCPStore every 30 sec to see if any watchdog timeout happens on other ranks, if so we will initiate a dump signal on the current rank as well. 

Previously we only let the thread on the default PG instance to do #2, now with this consolidation, we do FR dump signal every 30 secs and heartbeat check every 8 mins at the same time. If we break the polling loop early we will wait until the full heartbeat timeout (8 mins) before killing the whole program (when we first built heartbeat thread we want to directly kill the program when the watchdog or the whole program hang at CudaEvent destory or NCCL Abort).

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k 

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request May 15, 2025
ghstack-source-id: 777a504
Pull Request resolved: #153668
@fduwjj fduwjj changed the title [WIP] Consolidate watchdog and monitoring thread [c10d] Consolidate monitoring thread in PGNCCL May 15, 2025
@fduwjj fduwjj requested review from kwen2501, d4l3k, eqy, wconstab and H-Huang May 15, 2025 22:38
Copy link
Member
@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is an interesting approach -- hadn't considered just making everything static but it's oddly elegant

How much work do you think it would be to pull the watchdog into a separate class? Would be nice to have some abstractions here since PGNCCL is a big class. Could also generalize the watchdog potentially for usage in other PGs

Copy link
Contributor
@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just some minor comments.

Comment on lines +928 to +940
std::atomic<bool> ProcessGroupNCCL::terminateHeartbeatMonitorThread_{false};
c10::once_flag ProcessGroupNCCL::initFlag_;
std::atomic_uint64_t ProcessGroupNCCL::heartbeat_;
int ProcessGroupNCCL::heartbeatTimeoutInSec_;
int ProcessGroupNCCL::waitTimeoutDumpInMilSec_;
int ProcessGroupNCCL::coordCheckIntervalMilSec_;
std::atomic<bool> ProcessGroupNCCL::watchdogHeartbeatMonitorEnabled_;
std::mutex ProcessGroupNCCL::monitorMutex_;
std::condition_variable ProcessGroupNCCL::monitorWakeUpCV_;
bool ProcessGroupNCCL::dumpOnTimeoutOrEx_;
std::string ProcessGroupNCCL::globalLogPrefix_;
std::thread ProcessGroupNCCL::ncclHeartbeatMonitorThread_;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are these forward declaration for?

Comment on lines +1175 to +1181
static c10::once_flag initFlag_;

// Heartbeat of watchdog thread.
std::atomic_uint64_t heartbeat_{};
static std::atomic_uint64_t heartbeat_;

// The time interval used for deciding whether there is no watchdog heartbeat.
int heartbeatTimeoutInSec_;
static int heartbeatTimeoutInSec_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can pack all these variables and the std::thread ncclHeartbeatMonitorThread_ into a HeartbeatMonitor class. Then create a static object. And init of all these values can be moved into constructor of the HeartbeatMonitor class (instead of in constructor of ProcessGroupNCCL)

@kwen2501
Copy link
Contributor
kwen2501 commented May 16, 2025

Check the heartbeat of watchdog thread every 8 minutes

This seems too long. Checking a counter across thread would not introduce much overhead. (and here the counter need not lock protection imo, thus even faster)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0