-
Notifications
You must be signed in to change notification settings - Fork 24.2k
[CI] MacOS15-M2 runners are unstable #149999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Not sure if this is related, but at this point there are no m1-13 nor m2-15 runners available Looks like all 18 m1-13 runners got their runner token invalidated
|
I believe this might be unrelated. Yesterday all macos runners got disconnected (internal dashboard). Still investigating on why. |
@malfet did you try force-run the spa at that time? Did that work? |
@jeanschmidt no, I did not try that, as it was late and I needed to get runners back up quickly. |
It should be faster, and simpler, in most cases. It is all documented in our GHA infra runbook, exactly what to do when errors like this happens. Still thanks for having a look. Did you troubleshoot on why this happens? |
Sorry, I might have conflated two issues here: what is happening to M1-13 (no idea, would appreciate your help) and M2-15 (probably OOM, as those machines have 32Gb of RAM, so some larger tests are running there) |
Happened again, only one M2Pro runner remains and running spa fails (cc: @jeanschmidt ), so I'm not running following script (only difference from last time is that I assign tmux session a name
|
Something started to happen again, yesterday evening we have only 6 runner, this morning we are down to 2 |
🐛 Describe the bug
Not sure if tests are exposing some sort of the problem, or we are doing something weird with infra, but
since #149900 was landed, about 5% of the runs finish prematurely with
runner lost communication with the server.
Examples on-trunk (HUD link):
Versions
CI
cc @seemethere @pytorch/pytorch-dev-infra @clee2000
The text was updated successfully, but these errors were encountered: