Ensure GPU isolation for kubernetes pod MI300 runners. #145829

saienduri · 2025-01-28T09:15:18Z

Fixes the reason behind moving the tests to unstable initially. (#145790)
We ensure gpu isolation for each pod within kubernetes by propagating the drivers selected for the pod from the Kubernetes layer up to the docker run in pytorch here.
Now we stick with the GPUs assigned to the pod in the first place and there is no overlap between the test runners.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

pytorch-bot · 2025-01-28T09:15:22Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145829

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 6 Pending

As of commit 363aabf with merge base 5c5306e ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jataylo · 2025-01-28T12:59:20Z

Rerunning the failed lint job.

jeffdaily · 2025-01-28T17:19:14Z

@pytorchbot merge -f "workflow-only change. confirmed mi200 runners still work with this change, mi300 can only be tested post-merge"

pytorchmergebot · 2025-01-28T17:20:36Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

gpu isolation changes

363aabf
8000

saienduri requested a review from a team as a code owner January 28, 2025 09:15

pytorch-bot bot added the topic: not user facing topic category label Jan 28, 2025

facebook-github-bot added the module: rocm AMD GPU support for Pytorch label Jan 28, 2025

pytorchbot added the open source label Jan 28, 2025

jataylo added the ciflow/unstable Run all experimental or flaky jobs on PyTorch unstable workflow label Jan 28, 2025

jataylo mentioned this pull request Jan 28, 2025

[ROCM MI300 skips for flaky unit tests #145518

Closed

jeffdaily approved these changes Jan 28, 2025

View reviewed changes

jeffdaily added the ciflow/rocm Trigger "default" config CI on ROCm label Jan 28, 2025

pytorch-bot bot temporarily deployed to upload-benchmark-results January 28, 2025 17:11 Inactive

pytorchmergebot added merging Merged labels Jan 28, 2025

pytorchmergebot closed this in 7eb51e5 Jan 28, 2025

pytorchmergebot removed the merging label Jan 28, 2025

pruthvistony added the rocm This tag is for PRs from ROCm team label Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure GPU isolation for kubernetes pod MI300 runners. #145829

Ensure GPU isolation for kubernetes pod MI300 runners. #145829

Ensure GPU isolation for kubernetes pod MI300 runners. #145829

Ensure GPU isolation for kubernetes pod MI300 runners. #145829

Conversation

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145829

⏳ No Failures, 6 Pending

Merge started