Add device guard for xpu conv on multi device #153067

guangyey · 2025-05-07T17:07:21Z

Stack from ghstack (oldest at bottom):

-> Add device guard for xpu conv on multi device #153067

Motivation

fixes #153022
The root cause is that the XPU backend registers the convolution op using m.impl, which bypasses the device guard logic typically added by the code generation system. This can lead to unexpected behavior if the current device isn't explicitly set.

Additional Context

run the following script

import torch
import torchvision.models as models

torch.manual_seed(0)

model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
model.eval()
data = torch.rand(1, 3, 224, 224)

device = torch.device('xpu:1')  # 'xpu:0'
model = model.to(device=device, dtype=torch.float16)
data = data.to(device, dtype=torch.float16)


with torch.no_grad():
    ret = model(data)
    print(ret)

print("Execution finished")

The output is

         -9.2102e-02, -7.7588e-01, -1.4111e+00, -9.2383e-01,  6.4551e-01,
         -6.0730e-03, -7.8271e-01, -1.1904e+00, -4.1602e-01,  3.2715e-0
8000
2,
         -4.9854e-01, -6.3623e-01, -8.5107e-01, -6.8555e-01, -9.4434e-01,
         -8.8672e-01, -6.7969e-01, -6.9824e-01, -2.8882e-01,  2.0312e+00]],
       device='xpu:1', dtype=torch.float16)
Execution finished

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168

pytorch-bot · 2025-05-07T17:07:25Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153067

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 39068ab with merge base 9608e7f ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-focal-py3_9-clang9-xla / build (gh) (trunk failure)
ninja: build stopped: subcommand failed

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: a76a64b Pull Request resolved: #153067

ghstack-source-id: 3f3d1e9 Pull Request resolved: #153067

ghstack-source-id: e236a3e Pull Request resolved: #153067

ghstack-source-id: a44e2f1 Pull Request resolved: #153067

ghstack-source-id: 09128ed Pull Request resolved: #153067

ghstack-source-id: 1577980 Pull Request resolved: #153067

ghstack-source-id: e8ea35c Pull Request resolved: #153067

albanD

Sure

[ghstack-poisoned]

pytorchmergebot · 2025-05-08T08:40:22Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-05-08T12:32:08Z

Merge failed

Reason: 1 jobs have failed, first few of them are: xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 1, 4, linux.idc.xpu)

Details for Dev Infra team

Raised by workflow job

etaf · 2025-05-08T12:54:02Z

@EikanWang @chuanqi129 This timeout job occurred because the XPU CI splits the test tasks into four jobs, but the split is not well-balanced — different test cases take significantly different amounts of time to run. As a result, some jobs finish in 2 hours, while others take over 4 hours and still don't complete. Re-running the workflow doesn't help, since the job splitting is already fixed. A workaround is to rebase the branch, which retriggers the job splitting process and may result in a more balanced distribution by chance.

guangyey · 2025-05-09T01:44:58Z

@pytorchbot rebase

pytorchmergebot · 2025-05-09T01:46:39Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]

ghstack-source-id: b80cb82 Pull Request resolved: #153067

pytorchmergebot · 2025-05-09T01:46:52Z

Successfully rebased gh/guangyey/146/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/153067)

ghstack-source-id: 9bbca67 Pull Request resolved: #153067

guangyey · 2025-05-09T02:40:45Z

@pytorchbot merge

pytorchmergebot · 2025-05-09T02:42:33Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-05-09T08:41:10Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

guangyey · 2025-05-09T09:39:42Z

Finally, the CI signal is green.
@pytorchbot merge

pytorchmergebot · 2025-05-09T09:41:30Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

[ghstack-poisoned]

chuanqi129 · 2025-05-09T11:19:34Z

@guangyey please cherry-pick this PR to release/2.7 branch for v2.7.1 patch release also

guangyey · 2025-05-11T15:18:33Z

@pytorchbot cherry-pick --onto release/2.7 -c critical

# Motivation fixes #153022 The root cause is that the XPU backend registers the convolution op using `m.impl`, which bypasses the device guard logic typically added by the code generation system. This can lead to unexpected behavior if the current device isn't explicitly set. # Additional Context run the following script ```python import torch import torchvision.models as models torch.manual_seed(0) model = models.resnet50(weights="ResNet50_Weights.DEFAULT") model.eval() data = torch.rand(1, 3, 224, 224) device = torch.device('xpu:1') # 'xpu:0' model = model.to(device=device, dtype=torch.float16) data = data.to(device, dtype=torch.float16) with torch.no_grad(): ret = model(data) print(ret) print("Execution finished") ``` The output is ```bash -9.2102e-02, -7.7588e-01, -1.4111e+00, -9.2383e-01, 6.4551e-01, -6.0730e-03, -7.8271e-01, -1.1904e+00, -4.1602e-01, 3.2715e-02, -4.9854e-01, -6.3623e-01, -8.5107e-01, -6.8555e-01, -9.4434e-01, -8.8672e-01, -6.7969e-01, -6.9824e-01, -2.8882e-01, 2.0312e+00]], device='xpu:1', dtype=torch.float16) Execution finished ``` Pull Request resolved: #153067 Approved by: https://github.com/albanD, https://github.com/EikanWang (cherry picked from commit e06a080)

pytorchbot · 2025-05-11T15:23:18Z

Cherry picking #153067

The cherry pick PR is at #153345 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated:

[v2.7.1] Release Tracker #152627 (comment)

Details for Dev Infra team

Raised by workflow job

Add device guard for xpu conv on multi device (#153067) # Motivation fixes #153022 The root cause is that the XPU backend registers the convolution op using `m.impl`, which bypasses the device guard logic typically added by the code generation system. This can lead to unexpected behavior if the current device isn't explicitly set. # Additional Context run the following script ```python import torch import torchvision.models as models torch.manual_seed(0) model = models.resnet50(weights="ResNet50_Weights.DEFAULT") model.eval() data = torch.rand(1, 3, 224, 224) device = torch.device('xpu:1') # 'xpu:0' model = model.to(device=device, dtype=torch.float16) data = data.to(device, dtype=torch.float16) with torch.no_grad(): ret = model(data) print(ret) print("Execution finished") ``` The output is ```bash -9.2102e-02, -7.7588e-01, -1.4111e+00, -9.2383e-01, 6.4551e-01, -6.0730e-03, -7.8271e-01, -1.1904e+00, -4.1602e-01, 3.2715e-02, -4.9854e-01, -6.3623e-01, -8.5107e-01, -6.8555e-01, -9.4434e-01, -8.8672e-01, -6.7969e-01, -6.9824e-01, -2.8882e-01, 2.0312e+00]], device='xpu:1', dtype=torch.float16) Execution finished ``` Pull Request resolved: #153067 Approved by: https://github.com/albanD, https://github.com/EikanWang (cherry picked from commit e06a080) Co-authored-by: Yu, Guangye <guangye.yu@intel.com>

guangyey requested review from EikanWang and gujinghui as code owners May 7, 2025 17:07

guangyey mentioned this pull request May 6, 2025

Add DeviceAllocator as the base device allocator #138222

Open

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label May 7, 2025

This was referenced May 7, 2025

[WIP] Add unified memory APIs for torch.accelerator #152932

Open

[WIP] Generalize device caching allocator #151298

Draft

guangyey added a commit that referenced this pull request May 7, 2025

8000

Add device guard for xpu conv on multi device

67dafe6

ghstack-source-id: a76a64b Pull Request resolved: #153067

guangyey added a commit that referenced this pull request May 7, 2025

Add device guard for xpu conv on multi device

728c246

ghstack-source-id: 3f3d1e9 Pull Request resolved: #153067

guangyey added the ciflow/xpu Run XPU CI tasks label May 7, 2025

guangyey added this to PyTorch Intel May 7, 2025

guangyey added the release notes: xpu release notes category label May 7, 2025

guangyey added this to the 2.7.1 milestone May 7, 2025

guangyey moved this to Review Required in PyTorch Intel May 7, 2025

guangyey added a commit that referenced this pull request May 7, 2025

Add device guard for xpu conv on multi device

5fc20f3

ghstack-source-id: e236a3e Pull Request resolved: #153067

pytorchbot added the open source label May 7, 2025

guangyey added a commit that referenced this pull request May 7, 2025

Add device guard for xpu conv on multi device

12649fd

ghstack-source-id: a44e2f1 Pull Request resolved: #153067

guangyey added a commit that referenced 8000 this pull request May 7, 2025

Add device guard for xpu conv on multi device

f696740

ghstack-source-id: 09128ed Pull Request resolved: #153067

guangyey added a commit that referenced this pull request May 7, 2025

Add device guard for xpu conv on multi device

b4310e2

ghstack-source-id: 1577980 Pull Request resolved: #153067

guangyey added a commit that referenced this pull request May 7, 2025

Add device guard for xpu conv on multi device

5b26740

ghstack-source-id: e8ea35c Pull Request resolved: #153067

guangyey added the keep-going Don't stop on first failure, keep running tests until the end label May 7, 2025

guangyey requested a review from albanD May 7, 2025 18:26

albanD approved these changes May 7, 2025

View reviewed changes

guangyey added the ciflow/trunk Trigger trunk jobs on your pull request label May 8, 2025

guangyey added 6 commits May 8, 2025 00:40

Update

ba23165

[ghstack-poisoned]

Update

c6728d4

[ghstack-poisoned]

Update

63b69d1

[ghstack-poisoned]

Update

e2edad4

[ghstack-poisoned]

Update

0d0cbe7

[ghstack-poisoned]

Update

4bae348

[ghstack-poisoned]

pytorchmergebot removed the merging label May 8, 2025

Update

b040713

[ghstack-poisoned]

pytorchmergebot pushed a commit that referenced this pull request May 9, 2025

Add device guard for xpu conv on multi device

8d87387

ghstack-source-id: b80cb82 Pull Request resolved: #153067

guangyey added a commit that referenced this pull request May 9, 2025

Add device guard for xpu conv on multi device

d45ce5e

ghstack-source-id: 9bbca67 Pull Request resolved: #153067

pytorchmergebot added the merging label May 9, 2025

pytorchmergebot added the Merged label May 9, 2025

pytorchmergebot closed this in e06a080 May 9, 2025

github-project-automation bot moved this from Review Required to Done in PyTorch Intel May 9, 2025

pytorchmergebot removed the merging label May 9, 2025

Update

39068ab

[ghstack-poisoned]

pytorchbot mentioned this pull request May 11, 2025

Add device guard for xpu conv on multi device #153345

Merged

pytorchbot mentioned this pull request May 11, 2025

[v2.7.1] Release Tracker #152627

Closed

atalman mentioned this pull request May 28, 2025

Release 2.7.1 validations checklist and cherry-picks #154512

Open

49 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add device guard for xpu conv on multi device #153067

Add device guard for xpu conv on multi device #153067

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add device guard for xpu conv on multi device #153067

Add device guard for xpu conv on multi device #153067

Uh oh!

Conversation

Uh oh!

Motivation

Additional Context

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153067

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Merge started

Uh oh!

Merge failed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!

Uh oh!

Cherry picking #153067

Uh oh!

Uh oh!