-
Notifications
You must be signed in to change notification settings - Fork 24.2k
[CI][CUDA] Move cu118 distributed pull jobs to cu126, move cu124-sm75 to cu126-sm75 #151594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/151594
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit a8fcd55 with merge base cfee904 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are migrating those jobs to pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9
, but there are no such docker image build rule
47ec03c
to
0de6ee2
Compare
b519923
to
e9c0220
Compare
e9c0220
to
616f477
Compare
Need to rebase: but save the GHA link before rebasing: https://github.com/pytorch/pytorch/actions/runs/14935419951/job/41963368332 |
a97f921
to
ef32efb
Compare
Noticed another workaround for the test_nan_assert_float16 failure: Can we remove the pg.allreduce that was recently added in #151723 -- I am not certain this actually passed cuda118 distributed job check. this actually causes test_nan_assert jobs to skip on platforms that are (not CUDA12_AND_ABOVE) |
For PR_time_benchmarks failure, there are "more like this" Cross link: #149370 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
the pr_benchmark_failure seems to be pointing to primarily the following "REGRESSION",
rather than missing benchmarks: Update: it no longer fails after rebasing https://github.com/pytorch/pytorch/actions/runs/15051405343/job/42310764668 |
@pytorchbot rebase -b main |
@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here |
no such manifest: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9:5e3a845da711d9e7a1b6d216d4b5da9b4062f0a8 as in https://ossci-raw-job-status.s3.amazonaws.com/log/40794351091
…-environment as it is used by the "default" cuda 12.6 job
test_non_blocking_with_eager_init due to NCCL < 2.27
Successfully rebased |
9d63cfd
to
0a08508
Compare
@pytorchbot merge |
This PR has pending changes requested. Please address the comments and update the PR before merging. |
This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6.
In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues.
#153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 works - what is Sandcastle OS?)
#153122 CUDA context related
#153517 NCCL regression, future NCCL may fix it
See: #147383
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @tinglvv @eqy @ptrblck @atalman @malfet