-
Notifications
You must be signed in to change notification settings - Fork 24.2k
Fix support for nccl < 2.17 #145719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix support for nccl < 2.17 #145719
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145719
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit ebf3f48 with merge base 762724f ( UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot label "topic: not user facing" |
nit: "(since there's a static assert checking NCCL >= 2.7:" |
No it's already 2.7 now: pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp Lines 42 to 44 in f16ce3c
I also got a little bit confused when seeing 2.4 vs 2.7 in #141914, looks like 2.4 is the typo? Actually I didn't make it to find a nccl < 2.8 to validate if 2.7 really works, and same for 2.4. (2.8 tested) |
oops sorry about the confusion about 2.4 v/s 2.7. Also, when I tried to simplify the code in #141914 - I too ran into test timeouts. |
Do you plan to fix this? I can wait for your PR to be merged first, or I can also try to resolve it, the failure seems stable on CUDA 11.8. |
@c-p-i-o I've verified that the update should fix the failure. The cause is that torch/csrc/distributed/c10d/NCCLUtils.hpp and torch/csrc/distributed/c10d/quantization/quantization_gpu.cu don't include the header contains the checks before. |
Go ahead and land your PR! I'll abandon mine - no problem! |
The remaining test failure needs to be investigated:
https://github.com/pytorch/pytorch/actions/runs/13013038548/job/36309457605?pr=145719 |
bfp16 uses |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / linux-focal-rocm6.3-py3.10 / test (distributed, 1, 1, linux.rocm.gpu.4) Details for Dev Infra teamRaised by workflow job |
ping, we need another approve to run lint here :) |
@pytorchbot merge |
Merge startedYour change will be merge 6D40 d once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: Command
Details for Dev Infra teamRaised by workflow job |
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Rebase failed due to Command
Raised by https://github.com/pytorch/pytorch/actions/runs/13602354396 |
Hi, just wondering if we still have build issue for < 2.17? |
Didn't test on main but there looks like so. |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / cuda12.4-py3.10-gcc9-sm80 / build Details for Dev Infra teamRaised by workflow job |
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Fix build failure with older (< 2.17) NCCL.
Refactoring NCCL version related code:
torch/csrc/cuda/nccl.h
from various places and uniform some style (#if
to#ifdef
), which could improve maintainability of the NCCL part I hope.Resolves #141914
cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o