[CI] MacOS15-M2 runners are unstable #149999

malfet · 2025-03-26T01:07:57Z

🐛 Describe the bug

Not sure if tests are exposing some sort of the problem, or we are doing something weird with infra, but
since #149900 was landed, about 5% of the runs finish prematurely with
runner lost communication with the server.
Examples on-trunk (HUD link):

Versions

CI

cc @seemethere @pytorch/pytorch-dev-infra @clee2000

malfet · 2025-03-26T03:26:49Z

Not sure if this is related, but at this point there are no m1-13 nor m2-15 runners available

Looks like all 18 m1-13 runners got their runner token invalidated
Fixed by using SSM to log into those and run

sudo su - ec2-user
tmux a
^Ctrl-C
tmux
rm -rf runner; mkdir runner; cd runner; curl -o actions-runner-osx-arm64-2.323.0.tar.gz -L https://github.com/actions/runner/releases/download/v2.323.0/actions-runner-osx-arm64-2.323.0.tar.gz; tar xzf ./actions-runner-osx-arm64-2.323.0.tar.gz; ./config.sh --url https://github.com/pytorch --token ADDYOURTOKENHERE --labels macos-m1-13 --unattended ; ./run.sh

jeanschmidt · 2025-03-26T15:56:56Z

I believe this might be unrelated. Yesterday all macos runners got disconnected (internal dashboard). Still investigating on why.

jeanschmidt · 2025-03-26T15:58:47Z

@malfet did you try force-run the spa at that time? Did that work?

malfet · 2025-03-26T16:40:44Z

@malfet did you try force-run the spa at that time? Did that work?

@jeanschmidt no, I did not try that, as it was late and I needed to get runners back up quickly.

jeanschmidt · 2025-03-26T16:43:51Z

It should be faster, and simpler, in most cases. It is all documented in our GHA infra runbook, exactly what to do when errors like this happens.

Still thanks for having a look.

Did you troubleshoot on why this happens?

malfet · 2025-03-26T16:49:08Z

Did you troubleshoot on why this happens?

Sorry, I might have conflated two issues here: what is happening to M1-13 (no idea, would appreciate your help) and M2-15 (probably OOM, as those machines have 32Gb of RAM, so some larger tests are running there)

malfet · 2025-04-01T00:54:59Z

Happened again, only one M2Pro runner remains and running spa fails (cc: @jeanschmidt ), so I'm not running following script (only difference from last time is that I assign tmux session a name gha_daemon

sudo su - ec2-user
tmux a
^Ctrl-C
tmux new -s gha_daemon 
rm -rf runner; mkdir runner; cd runner; curl -o actions-runner-osx-arm64-2.323.0.tar.gz -L https://github.com/actions/runner/releases/download/v2.323.0/actions-runner-osx-arm64-2.323.0.tar.gz; tar xzf ./actions-runner-osx-arm64-2.323.0.tar.gz; ./config.sh --url https://github.com/pytorch --token ADDYOURTOKENHERE --labels macos-m2-15 --unattended ; ./run.sh

malfet · 2025-05-13T18:02:03Z

Something started to happen again, yesterday evening we have only 6 runner, this morning we are down to 2

malfet added module: ci Related to continuous integration module: flaky-tests Problem is a flaky test in CI triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module unstable labels Mar 26, 2025

github-project-automation bot added this to PyTorch OSS Dev Infra Mar 26, 2025

clee2000 assigned malfet Apr 1, 2025

clee2000 moved this to Cold Storage in PyTorch OSS Dev Infra Apr 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] MacOS15-M2 runners are unstable #149999

[CI] MacOS15-M2 runners are unstable #149999

[CI] MacOS15-M2 runners are unstable #149999

[CI] MacOS15-M2 runners are unstable #149999

Comments

🐛 Describe the bug

Versions