8000 Add device guard for xpu conv on multi device by guangyey · Pull Request #153067 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Add device guard for xpu conv on multi device #153067

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

guangyey
Copy link
Collaborator
@guangyey guangyey commented May 7, 2025

Stack from ghstack (oldest at bottom):

Motivation

fixes #153022
The root cause is that the XPU backend registers the convolution op using m.impl, which bypasses the device guard logic typically added by the code generation system. This can lead to unexpected behavior if the current device isn't explicitly set.

Additional Context

run the following script

import torch
import torchvision.models as models

torch.manual_seed(0)

model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
model.eval()
data = torch.rand(1, 3, 224, 224)

device = torch.device('xpu:1')  # 'xpu:0'
model = model.to(device=device, dtype=torch.float16)
data = data.to(device, dtype=torch.float16)


with torch.no_grad():
    ret = model(data)
    print(ret)

print("Execution finished")

The output is

         -9.2102e-02, -7.7588e-01, -1.4111e+00, -9.2383e-01,  6.4551e-01,
         -6.0730e-03, -7.8271e-01, -1.1904e+00, -4.1602e-01,  3.2715e-0
8000
2,
         -4.9854e-01, -6.3623e-01, -8.5107e-01, -6.8555e-01, -9.4434e-01,
         -8.8672e-01, -6.7969e-01, -6.9824e-01, -2.8882e-01,  2.0312e+00]],
       device='xpu:1', dtype=torch.float16)
Execution finished

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168

Copy link
pytorch-bot bot commented May 7, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153067

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 39068ab with merge base 9608e7f (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label May 7, 2025
guangyey added a commit that referenced this pull request May 7, 2025
ghstack-source-id: a76a64b
Pull Request resolved: #153067
guangyey added a commit that referenced this pull request May 7, 2025
ghstack-source-id: 3f3d1e9
Pull Request resolved: #153067
@guangyey guangyey added the ciflow/xpu Run XPU CI tasks label May 7, 2025
@guangyey guangyey added the release notes: xpu release notes category label May 7, 2025
@guangyey guangyey added this to the 2.7.1 milestone May 7, 2025
@guangyey guangyey moved this to Review Required in PyTorch Intel May 7, 2025
guangyey added a commit that referenced this pull request May 7, 2025
ghstack-source-id: e236a3e
Pull Request resolved: #153067
guangyey added a commit that referenced this pull request May 7, 2025
ghstack-source-id: a44e2f1
Pull Request resolved: #153067
guangyey added a commit that referenced 8000 this pull request May 7, 2025
ghstack-source-id: 09128ed
Pull Request resolved: #153067
guangyey added a commit that referenced this pull request May 7, 2025
ghstack-source-id: 1577980
Pull Request resolved: #153067
guangyey added a commit that referenced this pull request May 7, 2025
ghstack-source-id: e8ea35c
Pull Request resolved: #153067
@guangyey guangyey added the keep-going Don't stop on first failure, keep running tests until the end label May 7, 2025
@guangyey guangyey requested a review from albanD May 7, 2025 18:26
Copy link
Collaborator
@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@guangyey guangyey added the ciflow/trunk Trigger trunk jobs on your pull request label May 8, 2025
guangyey added 6 commits May 8, 2025 00:40
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 1, 4, linux.idc.xpu)

Details for Dev Infra team Raised by workflow job

@etaf
Copy link
Collaborator
etaf commented May 8, 2025

@EikanWang @chuanqi129 This timeout job occurred because the XPU CI splits the test tasks into four jobs, but the split is not well-balanced — different test cases take significantly different amounts of time to run. As a result, some jobs finish in 2 hours, while others take over 4 hours and still don't complete. Re-running the workflow doesn't help, since the job splitting is already fixed. A workaround is to rebase the branch, which retriggers the job splitting process and may result in a more balanced distribution by chance.

@guangyey
Copy link
Collaborator Author
guangyey commented May 9, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request May 9, 2025
ghstack-source-id: b80cb82
Pull Request resolved: #153067
@pytorchmergebot
Copy link
Collaborator

Successfully rebased gh/guangyey/146/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/153067)

guangyey added a commit that referenced this pull request May 9, 2025
ghstack-source-id: 9bbca67
Pull Request resolved: #153067
@guangyey
Copy link
Collaborator Author
guangyey commented May 9, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@guangyey
Copy link
Collaborator Author
guangyey commented May 9, 2025

Finally, the CI signal is green.
@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-project-automation github-project-automation bot moved this from Review Required to Done in PyTorch Intel May 9, 2025
[ghstack-poisoned]
@chuanqi129
Copy link
Collaborator

@guangyey please cherry-pick this PR to release/2.7 branch for v2.7.1 patch release also

@guangyey
Copy link
Collaborator Author

@pytorchbot cherry-pick --onto release/2.7 -c critical

pytorchbot pushed a commit that referenced this pull request May 11, 2025
# Motivation
fixes #153022
The root cause is that the XPU backend registers the convolution op using `m.impl`, which bypasses the device guard logic typically added by the code generation system. This can lead to unexpected behavior if the current device isn't explicitly set.

# Additional Context
run the following script
```python
import torch
import torchvision.models as models

torch.manual_seed(0)

model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
model.eval()
data = torch.rand(1, 3, 224, 224)

device = torch.device('xpu:1')  # 'xpu:0'
model = model.to(device=device, dtype=torch.float16)
data = data.to(device, dtype=torch.float16)

with torch.no_grad():
    ret = model(data)
    print(ret)

print("Execution finished")
```
The output is
```bash
         -9.2102e-02, -7.7588e-01, -1.4111e+00, -9.2383e-01,  6.4551e-01,
         -6.0730e-03, -7.8271e-01, -1.1904e+00, -4.1602e-01,  3.2715e-02,
         -4.9854e-01, -6.3623e-01, -8.5107e-01, -6.8555e-01, -9.4434e-01,
         -8.8672e-01, -6.7969e-01, -6.9824e-01, -2.8882e-01,  2.0312e+00]],
       device='xpu:1', dtype=torch.float16)
Execution finished

```

Pull Request resolved: #153067
Approved by: https://github.com/albanD, https://github.com/EikanWang

(cherry picked from commit e06a080)
@pytorchbot
Copy link
Collaborator

Cherry picking #153067

The cherry pick PR is at #153345 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated:

Details for Dev Infra team Raised by workflow job

atalman pushed a commit that referenced this pull request May 14, 2025
Add device guard for xpu conv on multi device (#153067)

# Motivation
fixes #153022
The root cause is that the XPU backend registers the convolution op using `m.impl`, which bypasses the device guard logic typically added by the code generation system. This can lead to unexpected behavior if the current device isn't explicitly set.

# Additional Context
run the following script
```python
import torch
import torchvision.models as models

torch.manual_seed(0)

model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
model.eval()
data = torch.rand(1, 3, 224, 224)

device = torch.device('xpu:1')  # 'xpu:0'
model = model.to(device=device, dtype=torch.float16)
data = data.to(device, dtype=torch.float16)

with torch.no_grad():
    ret = model(data)
    print(ret)

print("Execution finished")
```
The output is
```bash
         -9.2102e-02, -7.7588e-01, -1.4111e+00, -9.2383e-01,  6.4551e-01,
         -6.0730e-03, -7.8271e-01, -1.1904e+00, -4.1602e-01,  3.2715e-02,
         -4.9854e-01, -6.3623e-01, -8.5107e-01, -6.8555e-01, -9.4434e-01,
         -8.8672e-01, -6.7969e-01, -6.9824e-01, -2.8882e-01,  2.0312e+00]],
       device='xpu:1', dtype=torch.float16)
Execution finished

```

Pull Request resolved: #153067
Approved by: https://github.com/albanD, https://github.com/EikanWang

(cherry picked from commit e06a080)

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks keep-going Don't stop on first failure, keep running tests until the end Merged module: cpu CPU specific problem (e.g., perf, algorithm) open source release notes: xpu release notes category
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

7 participants
0