-
Notifications
You must be signed in to change notification settings - Fork 24.2k
Ensure GPU isolation for kubernetes pod MI300 runners. #145829
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145829
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 6 PendingAs of commit 363aabf with merge base 5c5306e ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Rerunning the failed lint job. |
@pytorchbot merge -f "workflow-only change. confirmed mi200 runners still work with this change, mi300 can only be tested post-merge" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Fixes the reason behind moving the tests to unstable initially. (#145790)
We ensure gpu isolation for each pod within kubernetes by propagating the drivers selected for the pod from the Kubernetes layer up to the docker run in pytorch here.
Now we stick with the GPUs assigned to the pod in the first place and there is no overlap between the test runners.
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd