[Intel GPU] Avoid copy when the input of Matmul is broadcasted #143784

jianyizh · 2024-12-24T03:25:34Z

Avoid copy when the input of Matmul is 3D and broadcasted on batch dim. oneDNN support implicit broadcast semantics i.e., src can be broadcasted into weight if the corresponding dimension in src is 1 (and vice versa). On Max 1100, timm resmlp_12_224 amp_fp16 inference with bs=128 can improve from 42ms to 13.7 ms on torch.compile and 57.5ms to 32ms on eager mode.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

pytorch-bot · 2024-12-24T03:25:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143784

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 2 Unrelated Failures

As of commit b5def8a with merge base 30cbf13 ():

NEW FAILURES - The following jobs have failed:

xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 3, 4, linux.idc.xpu) (gh)
inductor/test_torchinductor_opinfo.py::TestInductorOpInfoXPU::test_comprehensive_fft_hfftn_xpu_float64
xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 4, 4, linux.idc.xpu) (gh)
higher_order_ops/test_invoke_quant.py::TestInvokeQuantInductor::test_prologue

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 1, 4, linux.idc.xpu) (gh) (similar failure)
inductor/test_max_autotune.py::TestPrologueFusion::test_pending_fusion_pro_and_epi
xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 2, 4, linux.idc.xpu) (gh) (similar failure)
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_assert_tensor_meta_xpu

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jianyizh · 2024-12-24T03:28:10Z

@pytorchbot label "topic: not user facing"

EikanWang · 2024-12-24T05:38:50Z

aten/src/ATen/native/mkldnn/xpu/detail/Utils.cpp

+void undo_broadcast_on_batch(at::Tensor& m1, at::Tensor& m2) {
+  // onednn support one of src and wei broadcasted on batch dim
+  auto tensor_dim = m1.dim();
+  TORCH_CHECK(


Broadcast for BS dime only serves performance. Pls. do NOT require the check is a must-to-have.

EikanWang · 2024-12-24T05:40:30Z

aten/src/ATen/native/mkldnn/xpu/detail/Utils.cpp

+  if (tensor_dim ==2)
+    return;
+  auto undo_broadcast = [](at::Tensor& tensor) {
+    if (tensor.stride(1) == 0 || tensor.stride(2) == 0) {


Why could a particular stride be zero when a tensor is not a scalar tensor?

I don't think we support broadcast on m, n, k dim, so the stride of last two dim cannot be zero. The tensor may not be a scalar

EikanWang · 2024-12-24T05:44:55Z

aten/src/ATen/native/mkldnn/xpu/detail/Utils.cpp

+  if (m1.stride(0) == 0 && m2.stride(0) == 0) {
+    // onednn does not support both src and wei broadcasted on batch dim. We copy the smaller one.
+    if (m1.size(1)<m2.size(2)) {
+      m1 = m1.contiguous();
+    }
+    else {
+      m2 = m2.contiguous();
+    }
+  }


Let's separate broadcast logic from the contiguous logic.

Meanwhile, pls. help refine is_onednn_mat_strides a little bit.

dnnl::memory::dims strides = get_onednn_strides(tensor); int64_t storage_size = 1; for (size_t dim = 0; dim < tensor_dim; ++dim) storage_size += (sizes[dim] - 1) * strides[dim]; if (storage_size < tensor.numel()) return false;

The above code snippet could be refined as follows.

if (at::has_internal_overlap(tensor) == at::MemOverlap::Yes) return false;

pytorch-bot · 2024-12-24T16:44:57Z

To add the ciflow label ciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-bot · 2024-12-24T16:45:03Z

To add the ciflow label ciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

jianyizh · 2024-12-26T05:49:11Z

UT is already covered here https://github.com/pytorch/pytorch/blob/main/test/xpu/test_gemm.py#L313

pytorch-bot · 2024-12-29T05:56:36Z

To add the ciflow label ciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

jianyizh · 2025-01-17T07:58:11Z

@pytorchbot rebase

pytorchmergebot · 2025-01-17T07:59:37Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-01-17T07:59:40Z

Successfully rebased jianyi/broadcast onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout jianyi/broadcast && git pull --rebase)

jianyizh · 2025-02-06T02:27:56Z

@pytorchbot rebase

pytorchmergebot · 2025-02-06T02:29:25Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-02-06T02:29:28Z

Successfully rebased jianyi/broadcast onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout jianyi/broadcast && git pull --rebase)

pytorchmergebot · 2025-02-11T02:57:44Z

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

pytorchmergebot · 2025-02-11T02:57:46Z

Successfully rebased jianyi/broadcast onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout jianyi/broadcast && git pull --rebase)

EikanWang · 2025-02-13T03:09:41Z

@pytorchbot merge

pytorchmergebot · 2025-02-13T03:11:27Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-02-13T04:20:40Z

Merge failed

Reason: 1 jobs have failed, first few of them are: xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 4, 4, linux.idc.xpu)

Details for Dev Infra team

Raised by workflow job

EikanWang · 2025-02-14T00:40:19Z

@pytorchbot merge -i

pytorchmergebot · 2025-02-14T00:42:00Z

Merge started

Your change will be merged while ignoring the following 4 checks: xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 1, 4, linux.idc.xpu), xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 2, 4, linux.idc.xpu), xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 3, 4, linux.idc.xpu), xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 4, 4, linux.idc.xpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…iguous (#144759) We should not always call contiguous on the dst of matmul. We have already removed copy of matmul input in #143784 I also fixed an accuracy issue by using onednn sum post op instead of binary add in the case of inplace to avoid UT failure. Pull Request resolved: #144759 Approved by: https://github.com/EikanWang

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Dec 24, 2024

pytorch-bot bot added the topic: not user facing topic category label Dec 24, 2024

pytorchbot added the open source label Dec 24, 2024

EikanWang reviewed Dec 24, 2024

View reviewed changes

EikanWang added the ciflow/xpu Run XPU CI tasks label Dec 24, 2024

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Dec 24, 2024

EikanWang added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module ciflow/xpu Run XPU CI tasks labels Dec 24, 2024

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Dec 24, 2024

jianyizh force-pushed the jianyi/broadcast branch from 5a3c51b to 366ed3b Compare December 26, 2024 05:45

jianyizh marked this pull request as ready for review December 26, 2024 05:49

EikanWang added ciflow/xpu Run XPU CI tasks and removed module: cpu CPU specific problem (e.g., perf, algorithm) labels Dec 29, 2024

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Dec 29, 2024

EikanWang added the ciflow/xpu Run XPU CI tasks label Dec 29, 2024

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Dec 29, 2024

jianyizh mentioned this pull request Jan 14, 2025

[Intel GPU] Avoid unnecessary copy when the dst of Matmul is non-contiguous #144759

Closed

pytorchmergebot force-pushed the jianyi/broadcast branch from fdff049 to 923f683 Compare January 17, 2025 07:59

jianyizh added 6 commits February 11, 2025 02:57

support broadcasted input for xpu matmul

4d4b287

update

c1f7bb5

update

718202a

refine

2809e16

fix ut

220dde2

style

f4910e2

pytorchmergebot force-pushed the jianyi/broadcast branch from a456c1b to f4910e2 Compare February 11, 2025 02:57

Apply suggestions from code review

b5def8a

EikanWang approved these changes Feb 13, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 13, 2025

pytorchmergebot added the merging label Feb 13, 2025

pytorchmergebot removed the merging label Feb 13, 2025

pytorchmergebot added the merging label Feb 14, 2025

pytorchmergebot added the Merged label Feb 14, 2025

pytorchmergebot closed this in 20a369a Feb 14, 2025

pytorchmergebot removed the merging label Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Intel GPU] Avoid copy when the input of Matmul is broadcasted #143784

[Intel GPU] Avoid copy when the input of Matmul is broadcasted #143784

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Intel GPU] Avoid copy when the input of Matmul is broadcasted #143784

[Intel GPU] Avoid copy when the input of Matmul is broadcasted #143784

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143784

❌ 2 New Failures, 2 Unrelated Failures

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Merge failed

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!