8000 Move ROCm MI300 jobs to unstable to make CI green by ZainRizvi · Pull Request #145790 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Move ROCm MI300 jobs to unstable to make CI green #145790

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

ZainRizvi
Copy link
Contributor
@ZainRizvi ZainRizvi commented Jan 27, 2025

This is a temporary change to reduce intermittent tests failures. Jobs can be moved back once those machines get better runner isolation.

This also sneaks in a small fix to all the rocm job's build step to be run on Linux Foundation runners (the get-label-type dependency). The inductor-rocm-mi300 workflow already had it, but it was missing in the rocm-mi300 workflow.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd @jjukukuknfdtjujbbghataylo

@ZainRizvi ZainRizvi requested a review from a team as a code owner January 27, 2025 23:00
Copy link
pytorch-bot bot commented Jan 27, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145790

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit cd3e7fc with merge base 639dd54 (image):

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/rocm Trigger "default" config CI on ROCm module: rocm AMD GPU support for Pytorch topic: not user facing topic category labels Jan 27, 2025
@ZainRizvi
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 27, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the fai 8000 lures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@pytorch-bot pytorch-bot bot had a problem deploying to upload-benchmark-results January 27, 2025 23:34 Error
@pytorch-bot pytorch-bot bot had a problem deploying to upload-benchmark-results January 27, 2025 23:34 Error
@pytorch-bot pytorch-bot bot had a problem deploying to upload-benchmark-results January 27, 2025 23:34 Error
@pytorch-bot pytorch-bot bot had a problem deploying to upload-benchmark-results January 27, 2025 23:34 Error
@pytorch-bot pytorch-bot bot had a problem deploying to upload-benchmark-results January 27, 2025 23:34 Error
@pytorch-bot pytorch-bot bot had a problem deploying to upload-benchmark-results January 27, 2025 23:34 Error
@pytorch-bot pytorch-bot bot had a problem deploying to upload-benchmark-results January 27, 2025 23:34 Error
@pytorch-bot pytorch-bot bot had a problem deploying to upload-benchmark-results January 27, 2025 23:34 Error
@pytorch-bot pytorch-bot bot had a problem deploying to upload-benchmark-results January 27, 2025 23:34 Error
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 28, 2025 00:09 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 28, 2025 00:09 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 28, 2025 00:09 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 28, 2025 00:10 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 28, 2025 00:10 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 28, 2025 00:10 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 28, 2025 00:10 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 28, 2025 00:10 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 28, 2025 00:10 Inactive
@jataylo jataylo added the ciflow/unstable Run all experimental or flaky jobs on PyTorch unstable workflow label Jan 28, 2025
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 28, 2025 10:41 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 28, 2025 10:41 Inactive
@pytorch-bot pytorch-bot bot had a problem deploying to upload-benchmark-results January 28, 2025 10:41 Failure
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 28, 2025 10:41 Inactive
@pytorch-bot pytorch-bot bot had a problem deploying to upload-benchmark-results January 28, 2025 10:41 Failure
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 28, 2025 10:41 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 28, 2025 10:41 Inactive
@pytorch-bot pytorch-bot bot had a problem deploying to upload-benchmark-results January 28, 2025 10:41 Failure
@ZainRizvi
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Jan 28, 2025
Fixes the reason behind moving the tests to unstable initially. (#145790)
We ensure gpu isolation for each pod within kubernetes by propagating the drivers selected for the pod from the Kubernetes layer up to the docker run in pytorch here.
Now we stick with the GPUs assigned to the pod in the first place and there is no overlap between the test runners.

Pull Request resolved: #145829
Approved by: https://github.com/jeffdaily
with:
build-environment: linux-focal-rocm6.3-py3.10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, what is the point of doing something like that for the workflow that could not be triggered by ciflow/XYZ labels? and as such not a merge blocking?

pytorchmergebot pushed a commit that referenced this pull request Feb 28, 2025
Fixes #145790
Needs #145504 to be merged first to resolve an artifact uploading issue with MI300 runners.

This PR moves rocm unstable MI300 back to stable. The change to unstable was introduced through this [PR](#145790). This was because the MI300s were failing with a [docker daemon](https://github.com/pytorch/pytorch/actions/runs/13015957622/job/36306779536) issue which has been resolved.

Pull Request resolved: #146675
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
@github-actions github-actions bot deleted the zainr/rocm-unstable branch March 1, 2025 02:11
majing921201 pushed a commit to majing921201/pytorch that referenced this pull request Mar 4, 2025
Fixes pytorch#145790
Needs pytorch#145504 to be merged first to resolve an artifact uploading issue with MI300 runners.

This PR moves rocm unstable MI300 back to stable. The change to unstable was introduced through this [PR](pytorch#145790). This was because the MI300s were failing with a [docker daemon](https://github.com/pytorch/pytorch/actions/runs/13015957622/job/36306779536) issue which has been resolved.

Pull Request resolved: pytorch#146675
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request ciflow/unstable Run all experimental or flaky jobs on PyTorch unstable workflow Merged module: rocm AMD GPU support for Pytorch topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0