Move ROCm MI300 jobs to unstable to make CI green #145790

ZainRizvi · 2025-01-27T23:00:40Z

This is a temporary change to reduce intermittent tests failures. Jobs can be moved back once those machines get better runner isolation.

This also sneaks in a small fix to all the rocm job's build step to be run on Linux Foundation runners (the get-label-type dependency). The inductor-rocm-mi300 workflow already had it, but it was missing in the rocm-mi300 workflow.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd @jjukukuknfdtjujbbghataylo

…s can get better runner isolation

pytorch-bot · 2025-01-27T23:00:44Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145790

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit cd3e7fc with merge base 639dd54 ():

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

unstable / linux-focal-rocm6.3-py3.10 / test (default, 1, 6, linux.rocm.gpu.mi300.2) (gh)
Process completed with exit code 1.
unstable / linux-focal-rocm6.3-py3.10 / test (default, 3, 6, linux.rocm.gpu.mi300.2) (gh)
Process completed with exit code 1.
unstable / linux-focal-rocm6.3-py3.10 / test (default, 6, 6, linux.rocm.gpu.mi300.2) (gh)
functorch/test_control_flow.py::TestControlFlow::test_cond_autograd_zeros_unused_branch_complex_compile_mode_compile

This comment was automatically generated by Dr. CI and updates every 15 minutes.

.github/workflows/rocm-mi300.yml

ZainRizvi · 2025-01-27T23:04:16Z

@pytorchbot merge

pytorchmergebot · 2025-01-27T23:05:52Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-01-27T23:32:40Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Lint / lintrunner-noclang / linux-job

Dig deeper by viewing the fai 8000 lures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

ZainRizvi · 2025-01-28T17:16:50Z

@pytorchbot merge

pytorchmergebot · 2025-01-28T17:19:28Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Fixes the reason behind moving the tests to unstable initially. (#145790) We ensure gpu isolation for each pod within kubernetes by propagating the drivers selected for the pod from the Kubernetes layer up to the docker run in pytorch here. Now we stick with the GPUs assigned to the pod in the first place and there is no overlap between the test runners. Pull Request resolved: #145829 Approved by: https://github.com/jeffdaily

malfet · 2025-01-29T18:02:32Z

.github/workflows/rocm-mi300.yml

    with:
-      build-environment: linux-focal-rocm6.3-py3.10


Just curious, what is the point of doing something like that for the workflow that could not be triggered by ciflow/XYZ labels? and as such not a merge blocking?

Fixes #145790 Needs #145504 to be merged first to resolve an artifact uploading issue with MI300 runners. This PR moves rocm unstable MI300 back to stable. The change to unstable was introduced through this [PR](#145790). This was because the MI300s were failing with a [docker daemon](https://github.com/pytorch/pytorch/actions/runs/13015957622/job/36306779536) issue which has been resolved. Pull Request resolved: #146675 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily

Fixes pytorch#145790 Needs pytorch#145504 to be merged first to resolve an artifact uploading issue with MI300 runners. This PR moves rocm unstable MI300 back to stable. The change to unstable was introduced through this [PR](pytorch#145790). This was because the MI300s were failing with a [docker daemon](https://github.com/pytorch/pytorch/actions/runs/13015957622/job/36306779536) issue which has been resolved. Pull Request resolved: pytorch#146675 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily

Move ROCm MI300 jobs to unstable to make CI green until those machine…

d430282

…s can get better runner isolation

ZainRizvi requested a review from a team as a code owner January 27, 2025 23:00

pytorch-bot bot added ciflow/rocm Trigger "default" config CI on ROCm module: rocm AMD GPU support for Pytorch topic: not user facing topic category labels Jan 27, 2025

ZainRizvi requested review from jeffdaily, seemethere and yangw-dev January 27, 2025 23:01

yangw-dev approved these changes Jan 27, 2025

View reviewed changes

.github/workflows/rocm-mi300.yml Outdated Show resolved Hide resolved

add newline

50ee67f

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 27, 2025

pytorchmergebot added the merging label Jan 27, 2025

pytorchmergebot removed the merging label Jan 27, 2025

pytorch-bot bot had a problem deploying to upload-benchmark-results January 27, 2025 23:34 Error

fix

cd3e7fc

pytorch-bot bot temporarily deployed to upload-benchmark-results January 28, 2025 00:09 Inactive

pytorch-bot bot temporarily deployed to upload-benchmark-results January 28, 2025 00:10 Inactive

saienduri mentioned this pull request Jan 28, 2025

Ensure GPU isolation for kubernetes pod MI300 runners. #145829

Closed

jataylo added the ciflow/unstable Run all experimental or flaky jobs on PyTorch unstable workflow label Jan 28, 2025

pytorch-bot bot temporarily deployed to upload-benchmark-results January 28, 2025 10:41 Inactive

pytorch-bot bot had a problem deploying to upload-benchmark-results January 28, 2025 10:41 Failure

pytorch-bot bot temporarily deployed to upload-benchmark-results January 28, 2025 10:41 Inactive

pytorch-bot bot had a problem deploying to upload-benchmark-results January 28, 2025 10:41 Failure

pytorch-bot bot temporarily deployed to upload-benchmark-results January 28, 2025 10:41 Inactive

pytorch-bot bot had a problem deploying to upload-benchmark-results January 28, 2025 10:41 Failure

pytorchmergebot added the merging label Jan 28, 2025

pytorchmergebot added the Merged label Jan 28, 2025

pytorchmergebot closed this in 097ccd9 Jan 28, 2025

pytorchmergebot removed the merging label Jan 28, 2025

malfet reviewed Jan 29, 2025

View reviewed changes

amdfaa mentioned this pull request Feb 7, 2025

[ROCm] Move ROCm unstable MI300 jobs back to stable #146675

Closed

github-actions bot deleted the zainr/rocm-unstable branch March 1, 2025 02:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move ROCm MI300 jobs to unstable to make CI green #145790

Move ROCm MI300 jobs to unstable to make CI green #145790

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!

Uh oh!

Move ROCm MI300 jobs to unstable to make CI green #145790

Move ROCm MI300 jobs to unstable to make CI green #145790

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145790

✅ You can merge normally! (3 Unrelated Failures)

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Merge failed

Uh oh!

Uh oh!

Merge started

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!