8000 MacOS tests has not been running for few weeks · Issue #142206 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

MacOS tests has not been running for few weeks #142206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
malfet opened this issue Dec 6, 2024 · 7 comments
Closed

MacOS tests has not been running for few weeks #142206

malfet opened this issue Dec 6, 2024 · 7 comments
Labels
high priority module: ci Related to continuous integration module: regression It used to work, and now it doesn't triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@malfet
Copy link
Contributor
malfet commented Dec 6, 2024

#135386 rendered regular MacOS test shard useless

I.e. https://github.com/pytorch/pytorch/actions/runs/12191328925/job/34010247281?pr=141921 finishes in 18 sec for PR #141921 that could have some effect on Mac tests

Versions

CI

cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @seemethere @pytorch/pytorch-dev-infra

@malfet malfet added module: ci Related to continuous integration triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Dec 6, 2024
@malfet
Copy link
Contributor Author
malfet commented Dec 9, 2024

@clee2000 found the culprit: #135386 and has a fix #142270

@malfet malfet added the ci: sev critical failure affecting PyTorch CI label Dec 9, 2024
@malfet malfet changed the title TargetDeterminator skips all MacOS tests for PR that can affect MacOS MacOS tests has not been running for few weeks Dec 9, 2024
@malfet malfet added module: regression It used to work, and now it doesn't high priority and removed triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Dec 9, 2024
@malfet
Copy link
Contributor Author
malfet commented Dec 9, 2024

Landed @clee2000's #142270 to enable testing
Reverts:

AttributeError: '_OpNamespace' 'fsdp' object has no attribute 'all_gather_copy_in
Traceback (most recent call last):
  File "/Users/ec2-user/runner/_work/pytorch/pytorch/test/inductor/test_flex_attention.py", line 135, in <module>
    if torch.ops.mkldnn._is_mkldnn_bf16_supported()
  File "/Users/ec2-user/runner/_work/_temp/conda_environment_12245441371/lib/python3.9/site-packages/torch/_ops.py", line 1232, in __getattr__
    raise AttributeError(
AttributeError: '_OpNamespace' 'mkldnn' object has no attribute '_is_mkldnn_bf16_supported'

Forward fixes:

@malfet
Copy link
Contributor Author
malfet commented Dec 9, 2024
AIs: - @kit1980 will have a look at shard 1 of MacOS and categorize the failures - @malfet to look at shard2 - @atalman to look at shard3

It doesn't really work this way, as closing a failure on one shard likely causes a rebalance

@clee2000
Copy link
Contributor
clee2000 commented Dec 9, 2024

For the people looking through the tests, I merged #142421 to enable keep-going/continue on error on trunk for mac default tests.

Red signal will show up later, but you can see failing tests mid run on HUD by clicking the additional test failures button mid run. When the run is finished, you can also search for "consistently: " in the logs

malfet added a commit that referenced this issue Dec 10, 2024
This is a regression introduced by #141098 that went unnoticed due to #142206

Test plan:
```
python test_autograd.py -v -k test_dataparallel_saved_tensors_hooks
```

Before this change it failed with
```
ERROR: test_dataparallel_saved_tensors_hooks (__main__.TestMultithreadAutograd.test_dataparallel_saved_tensors_hooks)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/malfet/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3108, in wrapper
    method(*args, **kwargs)
    ~~~~~~^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/test/test_autograd.py", line 13074, in test_dataparallel_saved_tensors_hooks
    model = torch.nn.DataParallel(Model())
  File "/Users/malfet/git/pytorch/pytorch/torch/nn/parallel/data_parallel.py", line 153, in __init__
    raise RuntimeError("no available devices were found")
RuntimeError: no available devices were found
```

After it passes
```
malfet added a commit that referenced this issue Dec 10, 2024
As `torch._C._scatter` is only defined for CUDA/ROCm (and may be XPU?)

This is a regression introduced by #141098 that went unnoticed due to #142206

Test plan:
```
python test_autograd.py -v -k test_dataparallel_saved_tensors_hooks
```

Before this change it failed with
```
ERROR: test_dataparallel_saved_tensors_hooks (__main__.TestMultithreadAutograd.test_dataparallel_saved_tensors_hooks)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/malfet/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3108, in wrapper
    method(*args, **kwargs)
    ~~~~~~^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/test/test_autograd.py", line 13074, in test_dataparallel_saved_tensors_hooks
    model = torch.nn.DataParallel(Model())
  File "/Users/malfet/git/pytorch/pytorch/torch/nn/parallel/data_parallel.py", line 153, in __init__
    raise RuntimeError("no available devices were found")
RuntimeError: no available devices were found
```

After this change it passes
pytorchmergebot pushed a commit that referenced this issue Dec 10, 2024
Where int64_t is long long rather than long

This fixes test regression introduced by #140597 that went undetected due to #142206

Pull Request resolved: #142440
Approved by: https://github.com/kit1980
pytorchmergebot pushed a commit that referenced this issue Dec 10, 2024
As `torch._C._scatter` is only defined for CUDA/ROCm (and may be XPU?)

This is a regression introduced by #141098 that went unnoticed due to #142206

Test plan:
```
python test_autograd.py -v -k test_dataparallel_saved_tensors_hooks
```

Before this change it failed with
```
ERROR: test_dataparallel_saved_tensors_hooks (__main__.TestMultithreadAutograd.test_dataparallel_saved_tensors_hooks)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/malfet/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3108, in wrapper
    method(*args, **kwargs)
    ~~~~~~^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/test/test_autograd.py", line 13074, in test_dataparallel_saved_tensors_hooks
    model = torch.nn.DataParallel(Model())
  File "/Users/malfet/git/pytorch/pytorch/torch/nn/parallel/data_parallel.py", line 153, in __init__
    raise RuntimeError("no available devices were found")
RuntimeError: no available devices were found
```

After this change it passes

Pull Request resolved: #142448
Approved by: https://github.com/kit1980
@malfet malfet added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed triage review labels Dec 10, 2024
@malfet
Copy link
Contributor Author
malfet commented Dec 10, 2024

Mitigated, see successful run here

@malfet malfet added triage review and removed ci: sev critical failure affecting PyTorch CI triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Dec 10, 2024
@malfet
Copy link
Contributor Author
malfet commented Dec 10, 2024

What would be good to discuss in post mortem:

  • How it could have been detected earlier (@wdvr already has some dashboards)
  • Over-testing (considering that CI were essentially a no-op having to revert 3 and land 3 commits is not that bad)

mori360 pushed a commit to mori360/pytorch that referenced this issue Dec 11, 2024
Where int64_t is long long rather than long

This fixes test regression introduced by pytorch#140597 that went undetected due to pytorch#142206

Pull Request resolved: pytorch#142440
Approved by: https://github.com/kit1980
mori360 pushed a commit to mori360/pytorch that referenced this issue Dec 11, 2024
As `torch._C._scatter` is only defined for CUDA/ROCm (and may be XPU?)

This is a regression introduced by pytorch#141098 that went unnoticed due to pytorch#142206

Test plan:
```
python test_autograd.py -v -k test_dataparallel_saved_tensors_hooks
```

Before this change it failed with
```
ERROR: test_dataparallel_saved_tensors_hooks (__main__.TestMultithreadAutograd.test_dataparallel_saved_tensors_hooks)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/malfet/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3108, in wrapper
    method(*args, **kwargs)
    ~~~~~~^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/test/test_autograd.py", line 13074, in test_dataparallel_saved_tensors_hooks
    model = torch.nn.DataParallel(Model())
  File "/Users/malfet/git/pytorch/pytorch/torch/nn/parallel/data_parallel.py", line 153, in __init__
    raise RuntimeError("no available devices were found")
RuntimeError: no available devices were found
```

After this change it passes

Pull Request resolved: pytorch#142448
Approved by: https://github.com/kit1980
@malfet malfet moved this to Postmortem in PyTorch OSS Dev Infra Dec 11, 2024
bluenote10 pushed a commit to bluenote10/pytorch that referenced this issue Dec 14, 2024
Where int64_t is long long rather than long

This fixes test regression introduced by pytorch#140597 that went undetected due to pytorch#142206

Pull Request resolved: pytorch#142440
Approved by: https://github.com/kit1980
bluenote10 pushed a commit to bluenote10/pytorch that referenced this issue Dec 14, 2024
As `torch._C._scatter` is only defined for CUDA/ROCm (and may be XPU?)

This is a regression introduced by pytorch#141098 that went unnoticed due to pytorch#142206

Test plan:
```
python test_autograd.py -v -k test_dataparallel_saved_tensors_hooks
```

Before this change it failed with
```
ERROR: test_dataparallel_saved_tensors_hooks (__main__.TestMultithreadAutograd.test_dataparallel_saved_tensors_hooks)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/malfet/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3108, in wrapper
    method(*args, **kwargs)
    ~~~~~~^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/test/test_autograd.py", line 13074, in test_dataparallel_saved_tensors_hooks
    model = torch.nn.DataParallel(Model())
  File "/Users/malfet/git/pytorch/pytorch/torch/nn/parallel/data_parallel.py", line 153, in __init__
    raise RuntimeError("no available devices were found")
RuntimeError: no available devices were found
```

After this change it passes

Pull Request resolved: pytorch#142448
Approved by: https://github.com/kit1980
Esquains pushed a commit to Esquains/study1 that referenced this issue Dec 15, 2024
Where int64_t is long long rather than long

This fixes test regression introduced by pytorch/pytorch#140597 that went undetected due to pytorch/pytorch#142206

ghstack-source-id: e7e260c
Pull Request resolved: pytorch/pytorch#142440
@soulitzer soulitzer added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed triage review labels Dec 16, 2024
@malfet
Copy link
Contributor Author
malfet commented Feb 4, 2025

No post-mortem discussion ever happened, but tests are running now, so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: ci Related to continuous integration module: regression It used to work, and now it doesn't triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Archived in project
Development

No branches or pull requests

3 participants
0