[CI][CUDA][Distributed] Move cuda 11.8 distributed pull jobs to cuda 12.6 #151594

nWEIdia · 2025-04-17T18:13:52Z

This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6.
In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues.

#153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 skip - what is Sandcastle OS?)
#153122 CUDA context related
#153517 NCCL regression, future NCCL may fix it
#154073 skip test_symmetric_memory for cuda 12.6 before it is fixed

See: #147383

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @tinglvv @eqy @ptrblck @atalman @malfet

pytorch-bot · 2025-04-17T18:13:55Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/151594

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d307432 with merge base 2618977 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

malfet

You are migrating those jobs to pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9, but there are no such docker image build rule

.github/workflows/pull.yml

test/distributed/test_c10d_nccl.py

.github/workflows/pull.yml

nWEIdia · 2025-05-09T20:38:16Z

Need to rebase: but save the GHA link before rebasing: https://github.com/pytorch/pytorch/actions/runs/14935419951/job/41963368332

nWEIdia · 2025-05-13T22:18:13Z

Noticed another workaround for the test_nan_assert_float16 failure:
#151723 (comment)

Can we remove the pg.allreduce that was recently added in #151723 -- I am not certain this actually passed cuda118 distributed job check.

this actually causes test_nan_assert jobs to skip on platforms that are (not CUDA12_AND_ABOVE)

nWEIdia · 2025-05-14T14:54:46Z

For PR_time_benchmarks failure, there are "more like this"

Cross link: #149370
Alt URL: https://github.com/pytorch/pytorch/actions/runs/15013004860/job/42191479264

atalman

lgtm

huydhn · 2025-05-21T01:43:30Z

@pytorchbot revert -m 'Sorry for reverting your change but it seems to fail a distributed test in trunk' -c nosignal 8000

distributed/test_symmetric_memory.py::SymmMemSingleProcTest::test_stream_write_value32 GH job link HUD commit link

I'm not sure what happens here, may be target determination missed this test. It's easier to revert and see.

pytorchmergebot · 2025-05-21T01:45:12Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

…124-sm75 to cu126-sm75 (#151594)" This reverts commit 8cabd23. Reverted #151594 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail a distributed test in trunk ([comment](#151594 (comment)))

pytorchmergebot · 2025-05-21T01:45:23Z

@nWEIdia your PR has been successfully reverted.

nWEIdia · 2025-05-21T01:46:38Z

@pytorchbot rebase -b main

pytorchmergebot · 2025-05-21T01:48:19Z

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

pytorchmergebot · 2025-05-21T01:48:23Z

Successfully rebased main-run-distributed-on-cu126 onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout main-run-distributed-on-cu126 && git pull --rebase)

Use cu126+gcc11 docker image Change test_extra_cuda_context non pynvml path heuristics, Update message to show 70% Distributed job: Do not use "linux-focal-cuda12_6-py3_10-gcc11" build-environment as it is used by the "default" cuda 12.6 job Skip test_allgather_float8 due to SM<90 and Skip test_non_blocking_with_eager_init due to regressions with NCCL < 2.27 Skip test_non_blocking_with_eager_init, test_nan_assert Skip another test due to pytorch#154073

kwen2501

Unblocking. Left two comments.

kwen2501 · 2025-05-22T05:42:33Z

test/distributed/test_c10d_nccl.py

+        device = torch.device(f"cuda:{self.rank:d}")
+        if not sm_is_or_higher_than(device, 9, 0):
+            self.skipTest("FP8 reduction support begins with sm90 capable devices")


This is an all-gather test, so there is no need to skip for lack of reduction support I guess?

Agree this skip reason seems a bit inconsistent.

Using that mainly because this message is displayed when trying to execute this:

`torch.distributed.DistBackendError: NCCL error in: /my_workspace/gitlab-pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3551, invalid argument (run with NCCL_DEBUG=WARN for details), ...
ncclInvalidArgument: Invalid value for an argument.
Last error:
FP8 reduction support begins with sm90 capable devices.

To execute this test, run the following from the base repo dir:
python test/distributed/test_c10d_nccl.py NcclProcessGroupWithDispatchedCollectivesTests.test_allgather_float8_float8_e5m2`

Hmm, the error is thrown from NCCL -- could it be too restrictive?

It could be the case, cc @kiskra-nvidia
I could use standalone nccl tests to check allgather with FP8 on the machine.

Weird. Checking...

test/distributed/test_c10d_nccl.py

nWEIdia · 2025-05-22T06:25:53Z

@pytorchbot merge

pytorchmergebot · 2025-05-22T06:27:50Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

nWEIdia requested a review from a team as a code owner April 17, 2025 18:13

pytorch-bot bot added the topic: not user facing topic category label Apr 17, 2025

pytorchbot added the open source label Apr 17, 2025

nWEIdia added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 17, 2025

eqy approved these changes Apr 17, 2025

View reviewed changes

malfet approved these changes Apr 18, 2025

View reviewed changes

malfet previously requested changes Apr 18, 2025

View reviewed changes

.github/workflows/pull.yml Outdated Show resolved Hide resolved

nWEIdia removed the ciflow/trunk Trigger trunk jobs on your pull request label Apr 18, 2025

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 18, 2025

nWEIdia force-pushed the main-run-distributed-on-cu126 branch from 47ec03c to 0de6ee2 Compare May 2, 2025 17:45

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 2, 2025

nWEIdia requested review from malfet and yf225 May 2, 2025 17:54

atalman reviewed May 2, 2025

View reviewed changes

test/distributed/test_c10d_nccl.py Outdated Show resolved Hide resolved

nWEIdia force-pushed the main-run-distributed-on-cu126 branch from b519923 to e9c0220 Compare May 7, 2025 22:25

nWEIdia added the keep-going Don't stop on first failure, keep running tests until the end label May 7, 2025

nWEIdia force-pushed the main-run-distributed-on-cu126 branch from e9c0220 to 616f477 Compare May 8, 2025 21:57

nWEIdia commented May 9, 2025

View reviewed changes

.github/workflows/pull.yml Outdated Show resolved Hide resolved

nWEIdia force-pushed the main-run-distributed-on-cu126 branch from a97f921 to ef32efb Compare May 13, 2025 00:23

nWEIdia requested a review from jeffdaily as a code owner May 13, 2025 00:23

nWEIdia mentioned this pull request May 13, 2025

[CI][CUDA][Distributed] test_assert_nan_float16 unit test hangs with certain Host OS + CUDA KMD 570.133.07 #153479

Open

nWEIdia added a commit to nWEIdia/pytorch that referenced this pull request May 13, 2025

Remove recently added pg.allreduce call. See pytorch#151594 (comment)

512b52e

nWEIdia requested review from wconstab and kwen2501 May 13, 2025 22:24

nWEIdia added the ciflow/trunk Trigger trunk jobs on your pull request label May 14, 2025

atalman approved these changes May 14, 2025

View reviewed changes

pytorchmergebot closed this in 8cabd23 May 20, 2025

pytorchmergebot removed the merging label May 20, 2025

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels May 21, 2025

pytorchmergebot reopened this May 21, 2025

pytorchmergebot force-pushed the main-run-distributed-on-cu126 branch from a8fcd55 to 162305a Compare May 21, 2025 01:48

pytorch-bot bot pushed a commit that referenced this pull request May 21, 2025

Remove recently added pg.allreduce call. See #151594 (comment)

f35ae94

nWEIdia force-pushed the main-run-distributed-on-cu126 branch from 162305a to 25ec71d Compare May 21, 2025 04:18

nWEIdia mentioned this pull request May 21, 2025

RuntimeError: CUDA driver error: operation not supported with test_stream_write_value32 and cuStreamWriteValue32 #154073

Open

nWEIdia requested a review from huydhn May 21, 2025 22:22

huydhn approved these changes May 21, 2025

View reviewed changes

nWEIdia force-pushed the main-run-distributed-on-cu126 branch from 049be83 to d307432 Compare May 22, 2025 01:54

nWEIdia changed the title ~~[CI][CUDA] Move cu118 distributed pull jobs to cu126, move cu124-sm75 to cu126-sm75~~ [CI][CUDA] Move cuda 11.8 distributed pull jobs to cuda 12.6 May 22, 2025

nWEIdia changed the title ~~[CI][CUDA] Move cuda 11.8 distributed pull jobs to cuda 12.6~~ [CI][CUDA][Distributed] Move cuda 11.8 distributed pull jobs to cuda 12.6 May 22, 2025

kwen2501 approved these changes May 22, 2025

View reviewed changes

pytorchmergebot added the merging label May 22, 2025

pytorchmergebot closed this in 7128b50 May 22, 2025

pytorchmergebot removed the merging label May 22, 2025

atalman mentioned this pull request May 22, 2025

[Testing] multi BF0E gpu tests are still running against CUDA-11 #154119

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI][CUDA][Distributed] Move cuda 11.8 distributed pull jobs to cuda 12.6 #151594

[CI][CUDA][Distributed] Move cuda 11.8 distributed pull jobs to cuda 12.6 #151594

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[CI][CUDA][Distributed] Move cuda 11.8 distributed pull jobs to cuda 12.6 #151594

[CI][CUDA][Distributed] Move cuda 11.8 distributed pull jobs to cuda 12.6 #151594

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/151594

✅ No Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!