-
Notifications
You must be signed in to change notification settings - Fork 24.3k
Resolve zip file permission issue when uploading artifacts on ROCm MI300 CI runners #145504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145504
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (4 Unrelated Failures)As of commit 0caccf9 with merge base b63c601 ( FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Even after removing all diagnostic steps, MI200 and MI300 runs successfully uploaded artifacts: @jeffdaily I think this PR is ready to merge. Please review and approve. EDIT: @jeffdaily pointed out that the failing shards in the MI300 run did NOT run the permission change step and hence the zipping of artifacts failed: https://github.com/pytorch/pytorch/actions/runs/13535735120/job/37827995651#step:20:120. Need some more tweaking to ensure the step runs on failed jobs too. |
After fix in 0caccf9, we see the following, as desired:
@jeffdaily Please re-review. |
@pytorchbot merge -f "Unrelated CI failures; confirmed that artifact uploading is working for MI300 CI jobs" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Fixes #145790 Needs #145504 to be merged first to resolve an artifact uploading issue with MI300 runners. This PR moves rocm unstable MI300 back to stable. The change to unstable was introduced through this [PR](#145790). This was because the MI300s were failing with a [docker daemon](https://github.com/pytorch/pytorch/actions/runs/13015957622/job/36306779536) issue which has been resolved. Pull Request resolved: #146675 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
…300 CI runners (pytorch#145504) E.g.: https://github.com/pytorch/pytorch/actions/runs/13500418791/job/37719437613#step:19:120 ``` Beginning upload of artifact content to blob storage Error: An error has occurred while creating the zip file for upload Error: EACCES: permission denied, open '/home/runner/_work/pytorch/pytorch/test/test-reports/backends.xeon.test_launch_1.1_22ba1133f3fcd140_.log' /home/runner/_work/_actions/actions/upload-artifact/v4/dist/upload/index.js:3459 throw new Error('An error has occurred during zip creation for the artifact'); ^ ``` Pull Request resolved: pytorch#145504 Approved by: https://github.com/jeffdaily Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Fixes pytorch#145790 Needs pytorch#145504 to be merged first to resolve an artifact uploading issue with MI300 runners. This PR moves rocm unstable MI300 back to stable. The change to unstable was introduced through this [PR](pytorch#145790). This was because the MI300s were failing with a [docker daemon](https://github.com/pytorch/pytorch/actions/runs/13015957622/job/36306779536) issue which has been resolved. Pull Request resolved: pytorch#146675 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
E.g.: https://github.com/pytorch/pytorch/actions/runs/13500418791/job/37719437613#step:19:120
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd