MacOS tests has not been running for few weeks #142206

malfet · 2024-12-06T02:32:08Z

#135386 rendered regular MacOS test shard useless

~~I.e. https://github.com/pytorch/pytorch/actions/runs/12191328925/job/34010247281?pr=141921 finishes in 18 sec for PR #141921 that could have some effect on Mac tests~~

Versions

CI

cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @seemethere @pytorch/pytorch-dev-infra

malfet · 2024-12-09T19:24:35Z

@clee2000 found the culprit: #135386 and has a fix #142270

malfet · 2024-12-09T22:31:48Z

Landed @clee2000's #142270 to enable testing
Reverts:

add torchrec collectives to enforce global ordering #141970 was causing almost all PT2 tests on mac fail with

AttributeError: '_OpNamespace' 'fsdp' object has no attribute 'all_gather_copy_in

[inductor][cpp] Add FlexAttention support for CPU inference #141453 causing flex attentions to fail with

Traceback (most recent call last):
  File "/Users/ec2-user/runner/_work/pytorch/pytorch/test/inductor/test_flex_attention.py", line 135, in <module>
    if torch.ops.mkldnn._is_mkldnn_bf16_supported()
  File "/Users/ec2-user/runner/_work/_temp/conda_environment_12245441371/lib/python3.9/site-packages/torch/_ops.py", line 1232, in __getattr__
    raise AttributeError(
AttributeError: '_OpNamespace' 'mkldnn' object has no attribute '_is_mkldnn_bf16_supported'

[RELAND] Add UTs for accelerator device-agnostic runtime APIs #133572 caused newly added tests fail with SIGSEGV, because mps_event was nil

Forward fixes:

[EZ] Skip test_zero_grid_with_backed_symbols on Mac #142436 as test relies on CUDA device being present
Fix test_indexing on MacOS #142440 int64_t can be either long or long long
[DataParallel] Skip for MPS device #142448 - it's nice to add MPS to list of accelerators, but torch.nn.dataparallel aren't compatible with it

malfet · 2024-12-09T22:49:40Z

~~AIs: - @kit1980 will have a look at shard 1 of MacOS and categorize the failures - @malfet to look at shard2 - @atalman to look at shard3~~

It doesn't really work this way, as closing a failure on one shard likely causes a rebalance

clee2000 · 2024-12-09T23:35:59Z

For the people looking through the tests, I merged #142421 to enable keep-going/continue on error on trunk for mac default tests.

Red signal will show up later, but you can see failing tests mid run on HUD by clicking the additional test failures button mid run. When the run is finished, you can also search for "consistently: " in the logs

This is a regression introduced by #141098 that went unnoticed due to #142206 Test plan: ``` python test_autograd.py -v -k test_dataparallel_saved_tensors_hooks ``` Before this change it failed with ``` ERROR: test_dataparallel_saved_tensors_hooks (__main__.TestMultithreadAutograd.test_dataparallel_saved_tensors_hooks) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/malfet/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3108, in wrapper method(*args, **kwargs) ~~~~~~^^^^^^^^^^^^^^^^^ File "/Users/malfet/git/pytorch/pytorch/test/test_autograd.py", line 13074, in test_dataparallel_saved_tensors_hooks model = torch.nn.DataParallel(Model()) File "/Users/malfet/git/pytorch/pytorch/torch/nn/parallel/data_parallel.py", line 153, in __init__ raise RuntimeError("no available devices were found") RuntimeError: no available devices were found ``` After it passes ```

As `torch._C._scatter` is only defined for CUDA/ROCm (and may be XPU?) This is a regression introduced by #141098 that went unnoticed due to #142206 Test plan: ``` python test_autograd.py -v -k test_dataparallel_saved_tensors_hooks ``` Before this change it failed with ``` ERROR: test_dataparallel_saved_tensors_hooks (__main__.TestMultithreadAutograd.test_dataparallel_saved_tensors_hooks) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/malfet/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3108, in wrapper method(*args, **kwargs) ~~~~~~^^^^^^^^^^^^^^^^^ File "/Users/malfet/git/pytorch/pytorch/test/test_autograd.py", line 13074, in test_dataparallel_saved_tensors_hooks model = torch.nn.DataParallel(Model()) File "/Users/malfet/git/pytorch/pytorch/torch/nn/parallel/data_parallel.py", line 153, in __init__ raise RuntimeError("no available devices were found") RuntimeError: no available devices were found ``` After this change it passes

Where int64_t is long long rather than long This fixes test regression introduced by #140597 that went undetected due to #142206 Pull Request resolved: #142440 Approved by: https://github.com/kit1980

As `torch._C._scatter` is only defined for CUDA/ROCm (and may be XPU?) This is a regression introduced by #141098 that went unnoticed due to #142206 Test plan: ``` python test_autograd.py -v -k test_dataparallel_saved_tensors_hooks ``` Before this change it failed with ``` ERROR: test_dataparallel_saved_tensors_hooks (__main__.TestMultithreadAutograd.test_dataparallel_saved_tensors_hooks) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/malfet/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3108, in wrapper method(*args, **kwargs) ~~~~~~^^^^^^^^^^^^^^^^^ File "/Users/malfet/git/pytorch/pytorch/test/test_autograd.py", line 13074, in test_dataparallel_saved_tensors_hooks model = torch.nn.DataParallel(Model()) File "/Users/malfet/git/pytorch/pytorch/torch/nn/parallel/data_parallel.py", line 153, in __init__ raise RuntimeError("no available devices were found") RuntimeError: no available devices were found ``` After this change it passes Pull Request resolved: #142448 Approved by: https://github.com/kit1980

malfet · 2024-12-10T06:44:54Z

Mitigated, see successful run here

malfet · 2024-12-10T06:52:24Z

What would be good to discuss in post mortem:

How it could have been detected earlier (@wdvr already has some dashboards)
Over-testing (considering that CI were essentially a no-op having to revert 3 and land 3 commits is not that bad)

Where int64_t is long long rather than long This fixes test regression introduced by pytorch#140597 that went undetected due to pytorch#142206 Pull Request resolved: pytorch#142440 Approved by: https://github.com/kit1980

As `torch._C._scatter` is only defined for CUDA/ROCm (and may be XPU?) This is a regression introduced by pytorch#141098 that went unnoticed due to pytorch#142206 Test plan: ``` python test_autograd.py -v -k test_dataparallel_saved_tensors_hooks ``` Before this change it failed with ``` ERROR: test_dataparallel_saved_tensors_hooks (__main__.TestMultithreadAutograd.test_dataparallel_saved_tensors_hooks) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/malfet/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3108, in wrapper method(*args, **kwargs) ~~~~~~^^^^^^^^^^^^^^^^^ File "/Users/malfet/git/pytorch/pytorch/test/test_autograd.py", line 13074, in test_dataparallel_saved_tensors_hooks model = torch.nn.DataParallel(Model()) File "/Users/malfet/git/pytorch/pytorch/torch/nn/parallel/data_parallel.py", line 153, in __init__ raise RuntimeError("no available devices were found") RuntimeError: no available devices were found ``` After this change it passes Pull Request resolved: pytorch#142448 Approved by: https://github.com/kit1980

Where int64_t is long long rather than long This fixes test regression introduced by pytorch#140597 that went undetected due to pytorch#142206 Pull Request resolved: pytorch#142440 Approved by: https://github.com/kit1980

As `torch._C._scatter` is only defined for CUDA/ROCm (and may be XPU?) This is a regression introduced by pytorch#141098 that went unnoticed due to pytorch#142206 Test plan: ``` python test_autograd.py -v -k test_dataparallel_saved_tensors_hooks ``` Before this change it failed with ``` ERROR: test_dataparallel_saved_tensors_hooks (__main__.TestMultithreadAutograd.test_dataparallel_saved_tensors_hooks) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/malfet/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3108, in wrapper method(*args, **kwargs) ~~~~~~^^^^^^^^^^^^^^^^^ File "/Users/malfet/git/pytorch/pytorch/test/test_autograd.py", line 13074, in test_dataparallel_saved_tensors_hooks model = torch.nn.DataParallel(Model()) File "/Users/malfet/git/pytorch/pytorch/torch/nn/parallel/data_parallel.py", line 153, in __init__ raise RuntimeError("no available devices were found") RuntimeError: no available devices were found ``` After this change it passes Pull Request resolved: pytorch#142448 Approved by: https://github.com/kit1980

Where int64_t is long long rather than long This fixes test regression introduced by pytorch/pytorch#140597 that went undetected due to pytorch/pytorch#142206 ghstack-source-id: e7e260c Pull Request resolved: pytorch/pytorch#142440

malfet · 2025-02-04T21:44:44Z

No post-mortem discussion ever happened, but tests are running now, so closing

malfet added module: ci Related to continuous integration triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Dec 6, 2024

github-project-automation bot added this to PyTorch OSS Dev Infra Dec 6, 2024

malfet added the ci: sev critical failure affecting PyTorch CI label Dec 9, 2024

malfet changed the title ~~TargetDeterminator skips all MacOS tests for PR that can affect MacOS~~ MacOS tests has not been running for few weeks Dec 9, 2024

malfet added module: regression It used to work, and now it doesn't high priority and removed triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Dec 9, 2024

pytorch-bot bot added the triage review label Dec 9, 2024

malfet mentioned this issue Dec 9, 2024

[AOTI XPU] Support AOT Inductor for Intel GPU. #140269

Closed

huydhn mentioned this issue Dec 9, 2024

UNSTABLE trunk / macos-py3-arm64 / test (default) #142434

Closed

malfet mentioned this issue Dec 10, 2024

Fix test_indexing on MacOS #142440

Closed

malfet mentioned this issue Dec 10, 2024

[Device] Add "mps" to torch._utils._get_device_attr #142447

Closed

malfet mentioned this issue Dec 10, 2024

[DataParallel] Skip for MPS device #142448

Closed

malfet added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed triage review labels Dec 10, 2024

malfet added triage review and removed ci: sev critical failure affecting PyTorch CI triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Dec 10, 2024

malfet moved this to Postmortem in PyTorch OSS Dev Infra Dec 11, 2024

soulitzer added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed triage review labels Dec 16, 2024

malfet closed this as completed Feb 4, 2025

github-project-automation bot moved this from Postmortem to Done in PyTorch OSS Dev Infra Feb 4, 2025

malfet mentioned this issue Apr 7, 2025

Decorator skipIfXpu disables tests when used on class #150779

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MacOS tests has not been running for few weeks #142206

MacOS tests has not been running for few weeks #142206

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MacOS tests has not been running for few weeks #142206

MacOS tests has not been running for few weeks #142206

Comments

Uh oh!

Versions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!