8000 [CI][CUDA] Move cu118 distributed pull jobs to cu126, move cu124-sm75 to cu126-sm75 by nWEIdia · Pull Request #151594 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[CI][CUDA] Move cu118 distributed pull jobs to cu126, move cu124-sm75 to cu126-sm75 #151594

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

nWEIdia
Copy link
Collaborator
@nWEIdia nWEIdia commented Apr 17, 2025

This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6.
In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues.

#153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 works - what is Sandcastle OS?)
#153122 CUDA context related
#153517 NCCL regression, future NCCL may fix it

See: #147383

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @tinglvv @eqy @ptrblck @atalman @malfet

@nWEIdia nWEIdia requested a review from a team as a code owner April 17, 2025 18:13
Copy link
pytorch-bot bot commented Apr 17, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/151594

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit a8fcd55 with merge base cfee904 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Apr 17, 2025
@nWEIdia nWEIdia added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 17, 2025
Copy link
Contributor
@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are migrating those jobs to pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9, but there are no such docker image build rule

@nWEIdia nWEIdia removed the ciflow/trunk Trigger trunk jobs on your pull request label Apr 18, 2025
@janeyx99 janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 18, 2025
@nWEIdia nWEIdia force-pushed the main-run-distributed-on-cu126 branch from 47ec03c to 0de6ee2 Compare May 2, 2025 17:45
@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 2, 2025
@nWEIdia nWEIdia requested review from malfet and yf225 May 2, 2025 17:54
@nWEIdia nWEIdia force-pushed the main-run-distributed-on-cu126 branch from b519923 to e9c0220 Compare May 7, 2025 22:25
@nWEIdia nWEIdia added the keep-going Don't stop on first failure, keep running tests until the end label May 7, 2025
@nWEIdia nWEIdia force-pushed the main-run-distributed-on-cu126 branch from e9c0220 to 616f477 Compare May 8, 2025 21:57
@nWEIdia
Copy link
Collaborator Author
nWEIdia commented May 9, 2025

Need to rebase: but save the GHA link before rebasing: https://github.com/pytorch/pytorch/actions/runs/14935419951/job/41963368332

@nWEIdia
Copy link
Collaborator Author
nWEIdia commented May 13, 2025

Noticed another workaround for the test_nan_assert_float16 failure:
#151723 (comment)

Can we remove the pg.allreduce that was recently added in #151723 -- I am not certain this actually passed cuda118 distributed job check.

this actually causes test_nan_assert jobs to skip on platforms that are (not CUDA12_AND_ABOVE)

nWEIdia added a commit to nWEIdia/pytorch that referenced this pull request May 13, 2025
@nWEIdia nWEIdia requested review from wconstab and kwen2501 May 13, 2025 22:24
@nWEIdia nWEIdia added the ciflow/trunk Trigger trunk jobs on your pull request label May 14, 2025
@nWEIdia
Copy link
Collaborator Author
nWEIdia commented May 14, 2025

Copy link
Contributor
@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@nWEIdia
Copy link
Collaborator Author
nWEIdia commented May 15, 2025

the pr_benchmark_failure seems to be pointing to primarily the following "REGRESSION",

2025-05-14T08:11:32.9226783Z REGRESSION: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') failed, actual result 44313208084 is 3.15% higher than expected 42960000000 ±+2.50% if this is an expected regression, please update the expected results.
2025-05-14T08:11:32.9228064Z 
2025-05-14T08:11:32.9228366Z please update all results that changed significantly, and not only the failed ones

rather than missing benchmarks:
2025-05-14T08:11:32.9256052Z MISSING REGRESSION TEST: benchmark ('basic_InlineMod_eager', 'compile_time_instruction_count') does not have a regression test enabled for it. 2025-05-14T08:11:32.9256768Z 2025-05-14T08:11:32.9257315Z MISSING REGRESSION TEST: benchmark ('mm_loop_inductor_gpu', 'compile_time_instruction_count') does not have a regression test enabled for it. 2025-05-14T08:11:32.9257992Z 2025-05-14T08:11:32.9258589Z MISSING REGRESSION TEST: benchmark ('mm_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') does not have a regression test enabled for it. 2025-05-14T08:11:32.9259313Z 2025-05-14T08:11:32.9259893Z MISSING REGRESSION TEST: benchmark ('basic_NestedModule_eager', 'compile_time_instruction_count') does not have a regression test enabled for it.

Update: it no longer fails after rebasing https://github.com/pytorch/pytorch/actions/runs/15051405343/job/42310764668

@nWEIdia
Copy link
Collaborator Author
nWEIdia commented May 15, 2025

@pytorchbot rebase -b main

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased main-run-distributed-on-cu126 onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout main-run-distributed-on-cu126 && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the main-run-distributed-on-cu126 branch from 9d63cfd to 0a08508 Compare May 15, 2025 17:28
@nWEIdia
Copy link
Collaborator Author
nWEIdia commented May 15, 2025

@pytorchbot merge

Copy link
pytorch-bot bot commented May 15, 2025

This PR has pending changes requested. Please address the comments and update the PR before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request keep-going Don't stop on first failure, keep running tests until the end oncall: distributed Add this issue/PR to distributed oncall triage queue open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0