Fix support for nccl < 2.17 #145719

oraluben · 2025-01-27T06:09:47Z

Fix build failure with older (< 2.17) NCCL.

Refactoring NCCL version related code:

Fix failure against old NCCL versions since [PGNCCL] Use non-blocking mode by default in eager init #138527 cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o ;
remove unused checks caused by unsupported NCCL version (since there's a static assert checking NCCL >= 2.7: [rfc][be] static assert that nccl version is >= 2.7 #142023);
move NCCL macros to torch/csrc/cuda/nccl.h from various places and uniform some style (#if to #ifdef), which could improve maintainability of the NCCL part I hope.

Resolves #141914

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

pytorch-bot · 2025-01-27T06:09:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145719

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit ebf3f48 with merge base 762724f ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / unstable-linux-focal-cuda12.4-py3.10-gcc9-sm89-xfail / build (gh) (#147642)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

oraluben · 2025-01-27T06:26:29Z

@pytorchbot label "topic: not user facing"

wconstab · 2025-01-27T18:06:49Z

nit: "(since there's a static assert checking NCCL >= 2.7:"
I think you had a typo in your PR desc, it should be >= 2.4 right?

oraluben · 2025-01-28T00:52:17Z

it should be >= 2.4 right?

No it's already 2.7 now:

pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp

Lines 42 to 44 in f16ce3c

    
           static_assert( 
        
               (NCCL_MAJOR == 2 && NCCL_MINOR >= 7) || (NCCL_MAJOR > 2), 
        
               "NCCL version must be 2.7 or later");

I also got a little bit confused when seeing 2.4 vs 2.7 in #141914, looks like 2.4 is the typo?

Actually I didn't make it to find a nccl < 2.8 to validate if 2.7 really works, and same for 2.4. (2.8 tested)

c-p-i-o · 2025-01-28T01:54:54Z

oops sorry about the confusion about 2.4 v/s 2.7.

Also, when I tried to simplify the code in #141914 - I too ran into test timeouts.
So there's definitely something nefarious going on that needs to be looked at to get the failing tests to pass.

oraluben · 2025-01-28T04:49:22Z

Also, when I tried to simplify the code in #141914 - I too ran into test timeouts.
So there's definitely something nefarious going on that needs to be looked at to get the failing tests to pass.

Do you plan to fix this? I can wait for your PR to be merged first, or I can also try to resolve it, the failure seems stable on CUDA 11.8.

oraluben · 2025-01-28T14:50:51Z

@c-p-i-o I've verified that the update should fix the failure. The cause is that torch/csrc/distributed/c10d/NCCLUtils.hpp and torch/csrc/distributed/c10d/quantization/quantization_gpu.cu don't include the header contains the checks before.

torch/csrc/cuda/nccl.h

c-p-i-o · 2025-01-28T17:59:51Z

Do you plan to fix this? I can wait for your PR to be merged first, or I can also try to resolve it, the failure seems stable on CUDA 11.8.

Go ahead and land your PR! I'll abandon mine - no problem!

c-p-i-o · 2025-01-29T00:02:12Z

The remaining test failure needs to be investigated:

distributed/algorithms/quantization/test_quantization.py::DistQuantizationTests::test_all_to_all_bfp16
Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/test/distributed/algorithms/quantization/test_quantization.py", line 314, in <module>
    run_tests()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 1266, in run_tests
    assert len(failed_tests) == 0, "{} unit test(s) failed:\n\t{}".format(
AssertionError: 1 unit test(s) failed:
	distributed/algorithms/quantization/test_quantization.py::DistQuantizationTests::test_all_to_all_bfp1

https://github.com/pytorch/pytorch/actions/runs/13013038548/job/36309457605?pr=145719

oraluben · 2025-01-29T04:46:41Z

~~Looks like the test itself has bug. The macro was not working as expected before, so bf16 support of nccl was never enabled?~~

bfp16 uses _FloatToBfloat16Quantized, which uses fp16 on cpu and bf16 on gpu if supported. Previous the bf16 support of nccl was never enabled.

pytorchmergebot · 2025-02-06T08:08:08Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-02-06T08:51:00Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-focal-rocm6.3-py3.10 / test (distributed, 1, 1, linux.rocm.gpu.4)

Details for Dev Infra team

Raised by workflow job

oraluben · 2025-02-08T02:16:04Z

ping, we need another approve to run lint here :)

oraluben · 2025-02-16T08:53:27Z

@pytorchbot merge

pytorchmergebot · 2025-02-16T08:55:05Z

Merge started

Your change will be merge 6D40 d once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-02-16T09:00:56Z

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch merge --squash __pull-request-145719__init__ returned non-zero exit code 1

Auto-merging torch/csrc/distributed/c10d/NCCLUtils.cpp
Auto-merging torch/csrc/distributed/c10d/NCCLUtils.hpp
Auto-merging torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
CONFLICT (content): Merge conflict in torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
Squash commit -- not updating HEAD
Automatic merge failed; fix conflicts and then commit the result.

Details for Dev Infra team

Raised by workflow job

oraluben · 2025-03-01T07:11:08Z

@pytorchbot rebase

pytorchmergebot · 2025-03-01T07:12:36Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-03-01T07:12:38Z

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/145719/head returned non-zero exit code 1

Rebasing (1/15)
Auto-merging torch/csrc/distributed/c10d/NCCLUtils.cpp
Auto-merging torch/csrc/distributed/c10d/NCCLUtils.hpp
CONFLICT (content): Merge conflict in torch/csrc/distributed/c10d/NCCLUtils.hpp
Auto-merging torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
CONFLICT (content): Merge conflict in torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
error: could not apply 580a675a0b0... Support nccl >= 2.7
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Could not apply 580a675a0b0... Support nccl >= 2.7

Raised by https://github.com/pytorch/pytorch/actions/runs/13602354396

kwen2501 · 2025-03-10T20:45:04Z

Hi, just wondering if we still have build issue for < 2.17?

oraluben · 2025-03-10T23:50:45Z

Hi, just wondering if we still have build issue for < 2.17?

Didn't test on main but there looks like so.

oraluben · 2025-03-11T05:05:11Z

@pytorchbot merge

pytorchmergebot · 2025-03-11T05:06:53Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-03-11T09:11:41Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / cuda12.4-py3.10-gcc9-sm80 / build

Details for Dev Infra team

Raised by workflow job

github-actions · 2025-05-10T13:36:44Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jan 27, 2025

pytorchbot added the open source label Jan 27, 2025

pytorch-bot bot added the topic: not user facing topic category label Jan 27, 2025

oraluben marked this pull request as ready for review January 27, 2025 08:08

oraluben requested review from eqy and syed-ahmed as code owners January 27, 2025 08:08

oraluben mentioned this pull request Jan 27, 2025

[RFC][BE] assume error checking is on by default #141914

Closed

Support nccl >= 2.7

580a675

oraluben force-pushed the nccl-wraps branch from 2df7332 to 580a675 Compare January 27, 2025 08:56

wconstab requested review from kwen2501 and c-p-i-o January 27, 2025 18:05

try to fix failure

948f557

oraluben force-pushed the nccl-wraps branch from dd99ef2 to 948f557 Compare January 28, 2025 14:48

wconstab reviewed Jan 28, 2025

View reviewed changes

torch/csrc/cuda/nccl.h Show resolved Hide resolved

wconstab approved these changes Jan 28, 2025

View reviewed changes

c-p-i-o approved these changes Jan 28, 2025

View reviewed changes

oraluben added 2 commits January 29, 2025 11:48

fix wrong test

fc22bef

update comment and trivial updates

f4daeae

lint

0066215

pytorchmergebot removed the merging label Feb 6, 2025

oraluben added 5 commits February 6, 2025 17:59

fix rocm failure

49bf247

update

17b394e

update

b296bd0

Merge branch 'nccl-wraps' of github.com:oraluben/pytorch into nccl-wraps

501af79

update

bda4fdd

pytorchmergebot added the merging label Feb 16, 2025

pytorchmergebot removed the merging label Feb 16, 2025

Merge branch 'main' into nccl-wraps

08c38d3

Merge branch 'pytorch:main' into nccl-wraps

ebf3f48

pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Mar 1, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 11, 2025

pytorchmergebot added the merging label Mar 11, 2025

pytorchmergebot removed the merging label Mar 11, 2025

github-actions bot added the Stale label May 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix support for nccl < 2.17 #145719

Fix support for nccl < 2.17 #145719

Fix support for nccl < 2.17 #145719

Are you sure you want to change the base?

Fix support for nccl < 2.17 #145719

Conversation

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145719

✅ You can merge normally! (1 Unrelated Failure)

Merge started

Merge failed

Merge started

Merge failed

Merge started

Merge failed