8000 [CI][CUDA] Move away from cuda12.4, Add cuda12.6 eager CI tests by tinglvv · Pull Request #148602 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[CI][CUDA] Move away from cuda12.4, Add cuda12.6 eager CI tests #148602

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8000
Closed
wants to merge 6 commits into from

Conversation

tinglvv
Copy link
Collaborator
@tinglvv tinglvv commented Mar 5, 2025

#145570

breaking #140793 into eager and inductor benchmarks to unblock

cc @atalman @malfet @nWEIdia @ptrblck

@tinglvv tinglvv requested review from a team and jeffdaily as code owners March 5, 2025 21:04
Copy link
pytorch-bot bot commented Mar 5, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148602

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 14 Pending

As of commit 69d5251 with merge base 98458e5 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Mar 5, 2025
@nWEIdia nWEIdia added the keep-going Don't stop on first failure, keep running tests until the end label Mar 5, 2025
Copy link
Collaborator
@nWEIdia nWEIdia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nWEIdia nWEIdia changed the title Move away from 12.4, Add 12.6 eager CI tests [CI][CUDA] Move away from cuda12.4, Add cuda12.6 eager CI tests Mar 5, 2025
@tinglvv
Copy link
Collaborator Author
tinglvv commented Mar 5, 2025

Failure #2 [internal] load metadata for docker.io/nvidia/cuda:12.4-devel-ubuntu20.04
#2 ERROR: docker.io/nvidia/cuda:12.4-devel-ubuntu20.04: not found, missed to replace this test. let me modify.

@atalman
Copy link
Contributor
atalman commented Mar 6, 2025

Failure #2 [internal] load metadata for docker.io/nvidia/cuda:12.4-devel-ubuntu20.04 #2 ERROR: docker.io/nvidia/cuda:12.4-devel-ubuntu20.04: not found, missed to replace this test. let me modify.

This looks like transient error with our infra with calculate-docker-image

@atalman atalman added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 6, 2025
@tinglvv
Copy link
Collaborator Author
tinglvv commented Mar 6, 2025

Thanks Andrey for skipping the test, waiting for the green signal from the test to merge.

@colesbury colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 6, 2025
@atalman atalman requested a review from malfet March 6, 2025 21:08
Copy link
Contributor
@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@malfet malfet added ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/slow labels Mar 7, 2025
@malfet
Copy link
Contributor
malfet commented Mar 7, 2025

If you are modifying slow and periodic workflows, you should tests those as well

ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=11
CONDA_CMAKE=yes
HALIDE=yes
TRITON=yes
;;
pytorch-linux-jammy-py3.12-triton-cpu)
CUDA_VERSION=12.4
CUDA_VERSION=12.6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like an oxymoron: why CPU builds needs CUDA (and what this config is for to begin with)

@@ -326,15 +326,15 @@ case "$image" in
EXECUTORCH=yes
;;
pytorch-linux-jammy-py3.12-halide)
CUDA_VERSION=12.4
CUDA_VERSION=12.6

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Halide is part of inductor, so not sure why you are modifying it here.

Copy link
Contributor
@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's test it in prod

@malfet
Copy link
Contributor
malfet commented Mar 7, 2025

@pytorchbot merge -f "Spartaa aa aa!"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/slow ciflow/trunk Trigger trunk jobs on your pull request keep-going Don't stop on first failure, 565E keep running tests until the end Merged open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants
0