8000 [CI][CUDA][Distributed] Move cuda 11.8 distributed pull jobs to cuda 12.6 by nWEIdia · Pull Request #151594 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[CI][CUDA][Distributed] Move cuda 11.8 distributed pull jobs to cuda 12.6 #151594

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

nWEIdia
Copy link
Collaborator
@nWEIdia nWEIdia commented Apr 17, 2025

This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6.
In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues.

#153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 skip - what is Sandcastle OS?)
#153122 CUDA context related
#153517 NCCL regression, future NCCL may fix it
#154073 skip test_symmetric_memory for cuda 12.6 before it is fixed

See: #147383

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @tinglvv @eqy @ptrblck @atalman @malfet

@nWEIdia nWEIdia requested a review from a team as a code owner April 17, 2025 18:13
Copy link
pytorch-bot bot commented Apr 17, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/151594

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d307432 with merge base 2618977 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Apr 17, 2025
@nWEIdia nWEIdia added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 17, 2025
malfet
malfet previously requested changes Apr 18, 2025
Copy link
Contributor
@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are migrating those jobs to pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9, but there are no such docker image build rule

@nWEIdia nWEIdia removed the ciflow/trunk Trigger trunk jobs on your pull request label Apr 18, 2025
@janeyx99 janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 18, 2025
@nWEIdia nWEIdia force-pushed the main-run-distributed-on-cu126 branch from 47ec03c to 0de6ee2 Compare May 2, 2025 17:45
@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 2, 2025
@nWEIdia nWEIdia requested review from malfet and yf225 May 2, 2025 17:54
@nWEIdia nWEIdia force-pushed the main-run-distributed-on-cu126 branch from b519923 to e9c0220 Compare May 7, 2025 22:25
@nWEIdia nWEIdia added the keep-going Don't stop on first failure, keep running tests until the end label May 7, 2025
@nWEIdia nWEIdia force-pushed the main-run-distributed-on-cu126 branch from e9c0220 to 616f477 Compare May 8, 2025 21:57
@nWEIdia
Copy link
Collaborator Author
nWEIdia commented May 9, 2025

Need to rebase: but save the GHA link before rebasing: https://github.com/pytorch/pytorch/actions/runs/14935419951/job/41963368332

@nWEIdia
Copy link
Collaborator Author
nWEIdia commented May 13, 2025

Noticed another workaround for the test_nan_assert_float16 failure:
#151723 (comment)

Can we remove the pg.allreduce that was recently added in #151723 -- I am not certain this actually passed cuda118 distributed job check.

this actually causes test_nan_assert jobs to skip on platforms that are (not CUDA12_AND_ABOVE)

nWEIdia added a commit to nWEIdia/pytorch that referenced this pull request May 13, 2025
@nWEIdia nWEIdia requested review from wconstab and kwen2501 May 13, 2025 22:24
@nWEIdia nWEIdia added the ciflow/trunk Trigger trunk jobs on your pull request label May 14, 2025
@nWEIdia
Copy link
Collaborator Author
nWEIdia commented May 14, 2025

Copy link
Contributor
@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@huydhn
Copy link
Contributor
huydhn commented May 21, 2025

@pytorchbot revert -m 'Sorry for reverting your change but it seems to fail a distributed test in trunk' -c nosignal 8000

distributed/test_symmetric_memory.py::SymmMemSingleProcTest::test_stream_write_value32 GH job link HUD commit link

I'm not sure what happens here, may be target determination missed this test. It's easier to revert and see.

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request May 21, 2025
…124-sm75 to cu126-sm75 (#151594)"

This reverts commit 8cabd23.

Reverted #151594 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail a distributed test in trunk ([comment](#151594 (comment)))
@pytorchmergebot
Copy link
Collaborator

@nWEIdia your PR has been successfully reverted.

@pytorchmergebot pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels May 21, 2025
@nWEIdia
Copy link
Collaborator Author
nWEIdia commented May 21, 2025

@pytorchbot rebase -b main

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased main-run-distributed-on-cu126 onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout main-run-distributed-on-cu126 && git pull --rebase)

Use cu126+gcc11 docker image

Change test_extra_cuda_context non pynvml path heuristics, Update message to show 70%

Distributed job: Do not use "linux-focal-cuda12_6-py3_10-gcc11"  build-environment as it
is used by the "default" cuda 12.6 job

Skip test_allgather_float8 due to SM<90 and Skip
test_non_blocking_with_eager_init due to regressions with NCCL < 2.27

Skip test_non_blocking_with_eager_init, test_nan_assert

Skip another test due to pytorch#154073
@nWEIdia nWEIdia force-pushed the main-run-distributed-on-cu126 branch from 049be83 to d307432 Compare May 22, 2025 01:54
@nWEIdia nWEIdia changed the title [CI][CUDA] Move cu118 distributed pull jobs to cu126, move cu124-sm75 to cu126-sm75 [CI][CUDA] Move cuda 11.8 distributed pull jobs to cuda 12.6 May 22, 2025
@nWEIdia nWEIdia changed the title [CI][CUDA] Move cuda 11.8 distributed pull jobs to cuda 12.6 [CI][CUDA][Distributed] Move cuda 11.8 distributed pull jobs to cuda 12.6 May 22, 2025
Copy link
Contributor
@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unblocking. Left two comments.

Comment on lines +3683 to +3685
device = torch.device(f"cuda:{self.rank:d}")
if not sm_is_or_higher_than(device, 9, 0):
self.skipTest("FP8 reduction support begins with sm90 capable devices")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an all-gather test, so there is no need to skip for lack of reduction support I guess?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree this skip reason seems a bit inconsistent.

Using that mainly because this message is displayed when trying to execute this:

`torch.distributed.DistBackendError: NCCL error in: /my_workspace/gitlab-pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3551, invalid argument (run with NCCL_DEBUG=WARN for details), ...
ncclInvalidArgument: Invalid value for an argument.
Last error:
FP8 reduction support begins with sm90 capable devices.

To execute this test, run the following from the base repo dir:
python test/distributed/test_c10d_nccl.py NcclProcessGroupWithDispatchedCollectivesTests.test_allgather_float8_float8_e5m2`

F438 Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the error is thrown from NCCL -- could it be too restrictive?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be the case, cc @kiskra-nvidia
I could use standalone nccl tests to check allgather with FP8 on the machine.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird. Checking...

@nWEIdia
Copy link
Collaborator Author
nWEIdia commented May 22, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-no-td Do not run TD on this PR ciflow/trunk Trigger trunk jobs on your pull request keep-going Don't stop on first failure, keep running tests until the end Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source Reverted topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0