-
Notifications
You must be signed in to change notification settings - Fork 24.3k
[CI][CUDA][Distributed] Move cuda 11.8 distributed pull jobs to cuda 12.6 #151594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/151594
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit d307432 with merge base 2618977 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are migrating those jobs to pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9
, but there are no such docker image build rule
47ec03c
to
0de6ee2
Compare
b519923
to
e9c0220
Compare
e9c0220
to
616f477
Compare
Need to rebase: but save the GHA link before rebasing: https://github.com/pytorch/pytorch/actions/runs/14935419951/job/41963368332 |
a97f921
to
ef32efb
Compare
Noticed another workaround for the test_nan_assert_float16 failure: Can we remove the pg.allreduce that was recently added in #151723 -- I am not certain this actually passed cuda118 distributed job check. this actually causes test_nan_assert jobs to skip on platforms that are (not CUDA12_AND_ABOVE) |
For PR_time_benchmarks failure, there are "more like this" Cross link: #149370 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@pytorchbot revert -m 'Sorry for reverting your change but it seems to fail a distributed test in trunk' -c nosignal 8000 distributed/test_symmetric_memory.py::SymmMemSingleProcTest::test_stream_write_value32 GH job link HUD commit link I'm not sure what happens here, may be target determination missed this test. It's easier to revert and see. |
@pytorchbot successfully started a revert job. Check the current status here. |
…124-sm75 to cu126-sm75 (#151594)" This reverts commit 8cabd23. Reverted #151594 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail a distributed test in trunk ([comment](#151594 (comment)))
@nWEIdia your PR has been successfully reverted. |
@pytorchbot rebase -b main |
@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here |
Successfully rebased |
a8fcd55
to
162305a
Compare
162305a
to
25ec71d
Compare
Use cu126+gcc11 docker image Change test_extra_cuda_context non pynvml path heuristics, Update message to show 70% Distributed job: Do not use "linux-focal-cuda12_6-py3_10-gcc11" build-environment as it is used by the "default" cuda 12.6 job Skip test_allgather_float8 due to SM<90 and Skip test_non_blocking_with_eager_init due to regressions with NCCL < 2.27 Skip test_non_blocking_with_eager_init, test_nan_assert Skip another test due to pytorch#154073
049be83
to
d307432
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unblocking. Left two comments.
device = torch.device(f"cuda:{self.rank:d}") | ||
if not sm_is_or_higher_than(device, 9, 0): | ||
self.skipTest("FP8 reduction support begins with sm90 capable devices") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an all-gather test, so there is no need to skip for lack of reduction support I guess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree this skip reason seems a bit inconsistent.
Using that mainly because this message is displayed when trying to execute this:
`torch.distributed.DistBackendError: NCCL error in: /my_workspace/gitlab-pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3551, invalid argument (run with NCCL_DEBUG=WARN for details), ...
ncclInvalidArgument: Invalid value for an argument.
Last error:
FP8 reduction support begins with sm90 capable devices.
To execute this test, run the following from the base repo dir:
python test/distributed/test_c10d_nccl.py NcclProcessGroupWithDispatchedCollectivesTests.test_allgather_float8_float8_e5m2`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the error is thrown from NCCL -- could it be too restrictive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be the case, cc @kiskra-nvidia
I could use standalone nccl tests to check allgather with FP8 on the machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird. Checking...
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6.
In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues.
#153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 skip - what is Sandcastle OS?)
#153122 CUDA context related
#153517 NCCL regression, future NCCL may fix it
#154073 skip test_symmetric_memory for cuda 12.6 before it is fixed
See: #147383
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @tinglvv @eqy @ptrblck @atalman @malfet