8000 [CI] MacOS15-M2 runners are unstable · Issue #149999 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[CI] MacOS15-M2 runners are unstable #149999

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
malfet opened this issue Mar 8000 26, 2025 · 8 comments
Open

[CI] MacOS15-M2 runners are unstable #149999

malfet opened this issue Mar 26, 2025 · 8 comments
Assignees
Labels
module: ci Related to continuous integration module: flaky-tests Problem is a flaky test in CI triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module unstable

Comments

@malfet
Copy link
Contributor
malfet commented Mar 26, 2025

🐛 Describe the bug

Not sure if tests are exposing some sort of the problem, or we are doing something weird with infra, but
since #149900 was landed, about 5% of the runs finish prematurely with
runner lost communication with the server.
Examples on-trunk (HUD link):

Versions

CI

cc @seemethere @pytorch/pytorch-dev-infra @clee2000

@malfet malfet added module: ci Related to continuous integration module: flaky-tests Problem is a flaky test in CI triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module unstable labels Mar 26, 2025
@malfet
Copy link
Contributor Author
malfet commented Mar 26, 2025

Not sure if this is related, but at this point there are no m1-13 nor m2-15 runners available

Looks like all 18 m1-13 runners got their runner token invalidated
Fixed by using SSM to log into those and run

sudo su - ec2-user
tmux a
^Ctrl-C
tmux
rm -rf runner; mkdir runner; cd runner; curl -o actions-runner-osx-arm64-2.323.0.tar.gz -L https://github.com/actions/runner/releases/download/v2.323.0/actions-runner-osx-arm64-2.323.0.tar.gz; tar xzf ./actions-runner-osx-arm64-2.323.0.tar.gz; ./config.sh --url https://github.com/pytorch --token ADDYOURTOKENHERE --labels macos-m1-13 --unattended ; ./run.sh

8000
@jeanschmidt
Copy link
Contributor

I believe this might be unrelated. Yesterday all macos runners got disconnected (internal dashboard). Still investigating on why.

@jeanschmidt
Copy link
Contributor

@malfet did you try force-run the spa at that time? Did that work?

@malfet
Copy link
Contributor Author
malfet commented Mar 26, 2025

@malfet did you try force-run the spa at that time? Did that work?

@jeanschmidt no, I did not try that, as it was late and I needed to get runners back up quickly.

@jeanschmidt
Copy link
Contributor

It should be faster, and simpler, in most cases. It is all documented in our GHA infra runbook, exactly what to do when errors like this happens.

Still thanks for having a look.

Did you troubleshoot on why this happens?

@malfet
Copy link
Contributor Author
malfet commented Mar 26, 2025

Did you troubleshoot on why this happens?

Sorry, I might have conflated two issues here: what is happening to M1-13 (no idea, would appreciate your help) and M2-15 (probably OOM, as those machines have 32Gb of RAM, so some larger tests are running there)

@malfet
Copy link
Contributor Author
malfet commented Apr 1, 2025

Happened again, only one M2Pro runner remains and running spa fails (cc: @jeanschmidt ), so I'm not running following script (only difference from last time is that I assign tmux session a name gha_daemon

sudo su - ec2-user
tmux a
^Ctrl-C
tmux new -s gha_daemon 
rm -rf runner; mkdir runner; cd runner; curl -o actions-runner-osx-arm64-2.323.0.tar.gz -L https://github.com/actions/runner/releases/download/v2.323.0/actions-runner-osx-arm64-2.323.0.tar.gz; tar xzf ./actions-runner-osx-arm64-2.323.0.tar.gz; ./config.sh --url https://github.com/pytorch --token ADDYOURTOKENHERE --labels macos-m2-15 --unattended ; ./run.sh

@clee2000 clee2000 moved this to Cold Storage in PyTorch OSS Dev Infra Apr 1, 2025
@malfet
Copy link
Contributor Author
malfet commented May 13, 2025

Something started to happen again, yesterday evening we have only 6 runner, this morning we are down to 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: ci Related to continuous integration module: flaky-tests Problem is a flaky test in CI triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module unstable
Projects
Status: Cold Storage
Development

No branches or pull requests

2 participants
0