[CI][CUDA] Move cu118 distributed pull jobs to cu126, move cu124-sm75 to cu126-sm75 #151594

nWEIdia · 2025-04-17T18:13:52Z

This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6.
In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues.

#153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 works - what is Sandcastle OS?)
#153122 CUDA context related
#153517 NCCL regression, future NCCL may fix it

See: #147383

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @tinglvv @eqy @ptrblck @atalman @malfet

pytorch-bot · 2025-04-17T18:13:55Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/151594

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

CUDA not found in NVIDIA runners

✅ No Failures

As of commit a8fcd55 with merge base cfee904 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

malfet

You are migrating those jobs to pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9, but there are no such docker image build rule

.github/workflows/pull.yml

test/distributed/test_c10d_nccl.py

.github/workflows/pull.yml

nWEIdia · 2025-05-09T20:38:16Z

Need to rebase: but save the GHA link before rebasing: https://github.com/pytorch/pytorch/actions/runs/14935419951/job/41963368332

nWEIdia · 2025-05-13T22:18:13Z

Noticed another workaround for the test_nan_assert_float16 failure:
#151723 (comment)

Can we remove the pg.allreduce that was recently added in #151723 -- I am not certain this actually passed cuda118 distributed job check.

this actually causes test_nan_assert jobs to skip on platforms that are (not CUDA12_AND_ABOVE)

nWEIdia · 2025-05-14T14:54:46Z

For PR_time_benchmarks failure, there are "more like this"

Cross link: #149370
Alt URL: https://github.com/pytorch/pytorch/actions/runs/15013004860/job/42191479264

8000

atalman

lgtm

nWEIdia · 2025-05-15T16:40:07Z

the pr_benchmark_failure seems to be pointing to primarily the following "REGRESSION",

2025-05-14T08:11:32.9226783Z REGRESSION: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') failed, actual result 44313208084 is 3.15% higher than expected 42960000000 ±+2.50% if this is an expected regression, please update the expected results.
2025-05-14T08:11:32.9228064Z 
2025-05-14T08:11:32.9228366Z please update all results that changed significantly, and not only the failed ones

rather than missing benchmarks:
2025-05-14T08:11:32.9256052Z MISSING REGRESSION TEST: benchmark ('basic_InlineMod_eager', 'compile_time_instruction_count') does not have a regression test enabled for it. 2025-05-14T08:11:32.9256768Z 2025-05-14T08:11:32.9257315Z MISSING REGRESSION TEST: benchmark ('mm_loop_inductor_gpu', 'compile_time_instruction_count') does not have a regression test enabled for it. 2025-05-14T08:11:32.9257992Z 2025-05-14T08:11:32.9258589Z MISSING REGRESSION TEST: benchmark ('mm_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') does not have a regression test enabled for it. 2025-05-14T08:11:32.9259313Z 2025-05-14T08:11:32.9259893Z MISSING REGRESSION TEST: benchmark ('basic_NestedModule_eager', 'compile_time_instruction_count') does not have a regression test enabled for it.

Update: it no longer fails after rebasing https://github.com/pytorch/pytorch/actions/runs/15051405343/job/42310764668

nWEIdia · 2025-05-15T17:26:55Z

@pytorchbot rebase -b main

pytorchmergebot · 2025-05-15T17:28:33Z

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

cu126-sm75

no such manifest: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9:5e3a845da711d9e7a1b6d216d4b5da9b4062f0a8 as in https://ossci-raw-job-status.s3.amazonaws.com/log/40794351091

…-environment as it is used by the "default" cuda 12.6 job

ubuntu 22.04.

test_non_blocking_with_eager_init due to NCCL < 2.27

pytorchmergebot · 2025-05-15T17:28:36Z

Successfully rebased main-run-distributed-on-cu126 onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout main-run-distributed-on-cu126 && git pull --rebase)

nWEIdia · 2025-05-15T20:18:14Z

@pytorchbot merge

pytorch-bot · 2025-05-15T20:18:19Z

This PR has pending changes requested. Please address the comments and update the PR before merging.

test/distributed/test_c10d_nccl.py

nWEIdia requested a review from a team as a code owner April 17, 2025 18:13

pytorch-bot bot added the topic: not user facing topic category label Apr 17, 2025

pytorchbot added the open source label Apr 17, 2025

nWEIdia added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 17, 2025

eqy approved these changes Apr 17, 2025

View reviewed changes

malfet approved these changes Apr 18, 2025

View reviewed changes

malfet requested changes Apr 18, 2025

View reviewed changes

.github/workflows/pull.yml Outdated Show resolved Hide resolved

nWEIdia removed the ciflow/trunk Trigger trunk jobs on your pull request label Apr 18, 2025

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 18, 2025

nWEIdia force-pushed the main-run-distributed-on-cu126 branch from 47ec03c to 0de6ee2 Compare May 2, 2025 17:45

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 2, 2025

nWEIdia requested review from malfet and yf225 May 2, 2025 17:54

atalman reviewed May 2, 2025

View reviewed changes

test/distributed/test_c10d_nccl.py Outdated Show resolved Hide resolved

nWEIdia force-pushed the main-run-distributed-on-cu126 branch from b519923 to e9c0220 Compare May 7, 2025 22:25

nWEIdia added the keep-going Don't stop on first failure, keep running tests until the end label May 7, 2025

nWEIdia force-pushed the main-run-distributed-on-cu126 branch from e9c0220 to 616f477 Compare May 8, 2025 21:57

nWEIdia commented May 9, 2025

View reviewed changes

.github/workflows/pull.yml Outdated Show resolved Hide resolved

nWEIdia force-pushed the main-run-distributed-on-cu126 branch from a97f921 to ef32efb Compare May 13, 2025 00:23

nWEIdia requested a review from jeffdaily as a code owner May 13, 2025 00:23

nWEIdia mentioned this pull request May 13, 2025

[CI][CUDA][Distributed] test_assert_nan_float16 unit test hangs with certain Host OS + CUDA KMD 570.133.07 #153479

Open

nWEIdia added a commit to nWEIdia/pytorch that referenced this pull request May 13, 2025

Remove recently added pg.allreduce call. See pytorch#151594 (comment)

512b52e

nWEIdia requested review from wconstab and kwen2501 May 13, 2025 22:24

nWEIdia added the ciflow/trunk Trigger trunk jobs on your pull request label May 14, 2025

atalman approved these changes May 14, 2025

View reviewed changes

nWEIdia mentioned this pull request May 14, 2025

UNSTABLE pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks) #149370

Open

nWEIdia added 11 commits May 15, 2025 17:28

[CI][CUDA] Move cu118 distributed pull jobs to cu126, move cu124-sm75 to

24c4189

cu126-sm75

Use cu126+gcc11 docker image to hopefully avoid

a155125

no such manifest: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9:5e3a845da711d9e7a1b6d216d4b5da9b4062f0a8 as in https://ossci-raw-job-status.s3.amazonaws.com/log/40794351091

Change test_extra_cuda_context non pynvml path heuristics

398d9d0

Update message to show 70%

b6d9515

Distributed job: Do not use "linux-focal-cuda12_6-py3_10-gcc11" build…

1584d49

…-environment as it is used by the "default" cuda 12.6 job

Starting from CUDA 12.6 distributed jobs, switch from ubuntu 20.04 to

16c79eb

ubuntu 22.04.

Build jammy in GHA

378497b

Revert Jammy + CU126 changes. Need to explore runner's host OS choices.

1d48080

Remove recently added pg.allreduce call. See pytorch#151594 (comment)

bf7f849

Skip test_allgather_float8 due to SM<90 and Skip

612f8c1

test_non_blocking_with_eager_init due to NCCL < 2.27

Fix wrong skip.

0a08508

pytorchmergebot force-pushed the main-run-distributed-on-cu126 branch from 9d63cfd to 0a08508 Compare May 15, 2025 17:28

malfet reviewed May 16, 2025

View reviewed changes

test/distributed/test_c10d_nccl.py Outdated Show resolved Hide resolved

malfet reviewed May 16, 2025

View reviewed changes

test/distributed/test_c10d_nccl.py Show resolved Hide resolved

test/distributed/test_c10d_nccl.py Outdated Show resolved Hide resolved

Skip test_non_blocking_with_eager_init, test_nan_assert

a8fcd55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI][CUDA] Move cu118 distributed pull jobs to cu126, move cu124-sm75 to cu126-sm75 #151594

[CI][CUDA] Move cu118 distributed pull jobs to cu126, move cu124-sm75 to cu126-sm75 #151594

[CI][CUDA] Move cu118 distributed pull jobs to cu126, move cu124-sm75 to cu126-sm75 #151594

Are you sure you want to change the base?

[CI][CUDA] Move cu118 distributed pull jobs to cu126, move cu124-sm75 to cu126-sm75 #151594

Conversation

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/151594

❗ 1 Active SEVs

✅ No Failures

Choose a reason for hiding this comment

Choose a reason for hiding this comment