8000 [PyTorch][NCCL PG][Resubmit D67193887] Change getNCCLCommDumpMap to use new ncclCommDumpAll API by jiayulu · Pull Request #153636 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[PyTorch][NCCL PG][Resubmit D67193887] Change getNCCLCommDumpMap to use new ncclCommDumpAll API #153636

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
H 8000 ide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions torch/csrc/distributed/c10d/NCCLUtils.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -280,6 +280,13 @@ class NCCLComm {

ncclUniqueId getNcclId();
at::DeviceIndex getDeviceIndex();
#if defined(IS_NCCLX) && defined(NCCL_COMM_GET_UNIQUE_HASH)
uint64_t getNcclUniqueHash() {
uint64_t ncclUniqueHash = 0;
C10D_NCCL_CHECK(ncclCommGetUniqueHash(ncclComm_, &ncclUniqueHash), std::nullopt);
return ncclUniqueHash;
}
#endif

// Must not be copyable
NCCLComm(const NCCLComm&) = delete;
Expand Down
20 changes: 16 additions & 4 deletions torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
8000
Original file line number Diff line number Diff line change
Expand Up @@ -364,11 +364,20 @@ static void attachAllocatorHooks() {
static std::
unordered_map<std::string, std::unordered_map<std::string, std::string>>
getNCCLCommDumpMap() {
#if (defined(IS_NCCLX) || defined(USE_ROCM)) && defined(NCCL_COMM_DUMP)
#if (defined(IS_NCCLX) || defined(USE_ROCM)) && defined(NCCL_COMM_DUMP) && \
defined(NCCL_COMM_GET_UNIQUE_HASH)
std::unordered_map<
std::string /* ncclUniqueID */,
std::string /* CommHash */,
std::unordered_map<std::string, std::string> /* dump from this comm */>
ncclDumpMap;
#ifdef NCCL_COMM_DUMP_ALL
auto res = ncclCommDumpAll(ncclDumpMap);
if (res == ncclSuccess) {
return ncclDumpMap;
}
// Fall back to dump from each comm if ncclCommDumpAll failed
#endif // NCCL_COMM_DUMP_ALL

// dump_nccl_trace is only called from the default PG (local_id_=0), but we
// want to dump from all comms so we need to iterate over ncclCommMemPoolMap,
// which is static
Expand All @@ -382,8 +391,11 @@ static std::
}
}
for (auto& ncclComm : allNCCLComms) {
std::string ncclUniqueIDStr = buildNcclUniqueIdStr(ncclComm->getNcclId());
ncclDumpMap[ncclUniqueIDStr] = ncclComm->ncclCommDump();
std::stringstream ss;
ss << std::hex << ncclComm->getNcclUniqueHash();
std::string ncclUniqueHashStr = ss.str();
Comment on lines +394 to +396
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::stringstream ss;
ss << std::hex << ncclComm->getNcclUniqueHash();
std::string ncclUniqueHashStr = ss.str();
std::string ncclUniqueHashStr = fmt::format("{:x}", ncclComm->getNcclUniqueHash());

More efficient and readable to use libfmt here?


ncclDumpMap[ncclUniqueHashStr] = ncclComm->ncclCommDump();
Copy link
Collaborator
@Skylion007 Skylion007 May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ncclDumpMap[ncclUniqueHashStr] = ncclComm->ncclCommDump();
ncclDumpMap[std::move(ncclUniqueHashStr)] = ncclComm->ncclCommDump();

Or even remove the temporary variable entirely if you want.

}
return ncclDumpMap;
#else
Expand Down
Loading
0