8000 Resolve zip file permission issue when uploading artifacts on ROCm MI300 CI runners by amdfaa · Pull Request #145504 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Resolve zip file permission issue when uploading artifacts on ROCm MI300 CI runners #145504

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 28 commits into from

Conversation

amdfaa
Copy link
Contributor
@amdfaa amdfaa commented Jan 23, 2025

E.g.: https://github.com/pytorch/pytorch/actions/runs/13500418791/job/37719437613#step:19:120

Beginning upload of artifact content to blob storage
Error: An error has occurred while creating the zip file for upload
Error: EACCES: permission denied, open '/home/runner/_work/pytorch/pytorch/test/test-reports/backends.xeon.test_launch_1.1_22ba1133f3fcd140_.log'
/home/runner/_work/_actions/actions/upload-artifact/v4/dist/upload/index.js:3459
    throw new Error('An error has occurred during zip creation for the artifact');
    ^

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

@pytorch-bot pytorch-bot bot added module: rocm AMD GPU support for Pytorch topic: not user facing topic category labels Jan 23, 2025
Copy link
pytorch-bot bot commented Jan 23, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145504

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (4 Unrelated Failures)

As of commit 0caccf9 with merge base b63c601 (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@jataylo jataylo added test-config/distributed ciflow/rocm Trigger "default" config CI on ROCm ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ci-no-td Do not run TD on this PR and removed test-config/distributed labels Jan 23, 2025
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 23, 2025 18:33 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 23, 2025 18:33 Inactive
@pytorch-bot pytorch-bot bot had a problem deploying to upload-benchmark-results January 23, 2025 18:34 Error
@pytorch-bot pytorch-bot bot had a problem deploying to upload-benchmark-results January 23, 2025 18:34 Error
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 23, 2025 18:34 Inactive
@pytorch-bot pytorch-bot bot had a problem deploying to upload-benchmark-results January 23, 2025 18:34 Error
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 23, 2025 18:34 Inactive
@pytorch-bot pytorch-bot bot had a problem deploying to upload-benchmark-results January 23, 2025 18:34 Error
@pytorch-bot pytorch-bot bot had a problem deploying to upload-benchmark-results January 23, 2025 18:38 Error
@pytorch-bot pytorch-bot bot had a problem deploying to upload-benchmark-results January 23, 2025 18:38 Error
@pytorch-bot pytorch-bot bot had a problem deploying to upload-benchmark-results January 23, 2025 18:38 Error
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 23, 2025 20:38 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 23, 2025 20:38 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 23, 2025 20:38 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 23, 2025 20:38 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 23, 2025 20:38 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 23, 2025 20:38 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 23, 2025 20:38 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 23, 2025 20:38 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 23, 2025 20:39 Inactive
@jithunnair-amd jithunnair-amd added ciflow/unstable Run all experimental or flaky jobs on PyTorch unstable workflow ciflow/rocm Trigger "default" config CI on ROCm and removed ciflow/unstable Run all experimental or flaky jobs on PyTorch unstable workflow ciflow/rocm Trigger "default" config CI on ROCm labels Feb 26, 2025
@jithunnair-amd
Copy link
Collaborator
jithunnair-amd commented Feb 26, 2025

Even after removing all diagnostic steps, MI200 and MI300 runs successfully uploaded artifacts:
MI200: https://github.com/pytorch/pytorch/actions/runs/13535735012/job/37827986270
MI300: https://github.com/pytorch/pytorch/actions/runs/13535735120/job/37827995393

@jeffdaily I think this PR is ready to merge. Please review and approve.

EDIT: @jeffdaily pointed out that the failing shards in the MI300 run did NOT run the permission change step and hence the zipping of artifacts failed: https://github.com/pytorch/pytorch/actions/runs/13535735120/job/37827995651#step:20:120. Need some more tweaking to ensure the step runs on failed jobs too.

@jithunnair-amd jithunnair-amd changed the title [DO NOT MERGE] Resolve zip file permission issue when uploading artifacts on ROCm MI300 CI runners Resolve zip file permission issue when uploading artifacts on ROCm MI300 CI runners Feb 26, 2025
@jithunnair-amd jithunnair-amd marked this pull request as ready for review February 26, 2025 23:12
@jithunnair-amd jithunnair-amd requested a review from a team as a code owner February 26, 2025 23:12
@jithunnair-amd jithunnair-amd added ciflow/unstable Run all experimental or flaky jobs on PyTorch unstable workflow ciflow/rocm Trigger "default" config CI on ROCm and removed ciflow/unstable Run all experimental or flaky jobs on PyTorch unstable workflow ciflow/rocm Trigger "default" config CI on ROCm labels Feb 28, 2025
@jithunnair-amd
Copy link
Collaborator

After fix in 0caccf9, we see the following, as desired:

@jeffdaily Please re-review.

@jithunnair-amd
Copy link
Collaborator

@pytorchbot merge -f "Unrelated CI failures; confirmed that artifact uploading is working for MI300 CI jobs"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Feb 28, 2025
Fixes #145790
Needs #145504 to be merged first to resolve an artifact uploading issue with MI300 runners.

This PR moves rocm unstable MI300 back to stable. The change to unstable was introduced through this [PR](#145790). This was because the MI300s were failing with a [docker daemon](https://github.com/pytorch/pytorch/actions/runs/13015957622/job/36306779536) issue which has been resolved.

Pull Request resolved: #146675
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
majing921201 pushed a commit to majing921201/pytorch that referenced this pull request Mar 4, 2025
…300 CI runners (pytorch#145504)

E.g.: https://github.com/pytorch/pytorch/actions/runs/13500418791/job/37719437613#step:19:120
```
Beginning upload of artifact content to blob storage
Error: An error has occurred while creating the zip file for upload
Error: EACCES: permission denied, open '/home/runner/_work/pytorch/pytorch/test/test-reports/backends.xeon.test_launch_1.1_22ba1133f3fcd140_.log'
/home/runner/_work/_actions/actions/upload-artifact/v4/dist/upload/index.js:3459
    throw new Error('An error has occurred during zip creation for the artifact');
    ^
```

Pull Request resolved: pytorch#145504
Approved by: https://github.com/jeffdaily

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
majing921201 pushed a commit to majing921201/pytorch that referenced this pull request Mar 4, 2025
Fixes pytorch#145790
Needs pytorch#145504 to be merged first to resolve an artifact uploading issue with MI300 runners.

This PR moves rocm unstable MI300 back to stable. The change to unstable was introduced through this [PR](pytorch#145790). This was because the MI300s were failing with a [docker daemon](https://github.com/pytorch/pytorch/actions/runs/13015957622/job/36306779536) issue which has been resolved.

Pull Request resolved: pytorch#146675
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-no-td Do not run TD on this PR ciflow/rocm Trigger "default" config CI on ROCm ciflow/unstable Run all experimental or flaky jobs on PyTorch unstable workflow Merged module: rocm AMD GPU support for Pytorch open source topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0