8000 [Intel GPU] Avoid copy when the input of Matmul is broadcasted by jianyizh · Pull Request #143784 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[Intel GPU] Avoid copy when the input of Matmul is broadcasted #143784

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

jianyizh
Copy link
Contributor
@jianyizh jianyizh commented Dec 24, 2024

Avoid copy when the input of Matmul is 3D and broadcasted on batch dim. oneDNN support implicit broadcast semantics i.e., src can be broadcasted into weight if the corresponding dimension in src is 1 (and vice versa). On Max 1100, timm resmlp_12_224 amp_fp16 inference with bs=128 can improve from 42ms to 13.7 ms on torch.compile and 57.5ms to 32ms on eager mode.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Dec 24, 2024
Copy link
pytorch-bot bot commented Dec 24, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143784

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 2 Unrelated Failures

As of commit b5def8a with merge base 30cbf13 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@jianyizh
Copy link
Contributor Author

@pytorchbot label "topic: not user facing"

void undo_broadcast_on_batch(at::Tensor& m1, at::Tensor& m2) {
// onednn support one of src and wei broadcasted on batch dim
auto tensor_dim = m1.dim();
TORCH_CHECK(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broadcast for BS dime only serves performance. Pls. do NOT require the check is a must-to-have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted

if (tensor_dim ==2)
return;
auto undo_broadcast = [](at::Tensor& tensor) {
if (tensor.stride(1) == 0 || tensor.stride(2) == 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why could a particular stride be zero when a tensor is not a scalar tensor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we support broadcast on m, n, k dim, so the stride of last two dim cannot be zero. The tensor may not be a scalar

Comment on lines 228 to 254
if (m1.stride(0) == 0 && m2.stride(0) == 0) {
// onednn does not support both src and wei broadcasted on batch dim. We copy the smaller one.
if (m1.size(1)<m2.size(2)) {
m1 = m1.contiguous();
}
else {
m2 = m2.contiguous();
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's separate broadcast logic from the contiguous logic.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meanwhile, pls. help refine is_onednn_mat_strides a little bit.

  dnnl::memory::dims strides = get_onednn_strides(tensor);
  int64_t storage_size = 1;
  for (size_t dim = 0; dim < tensor_dim; ++dim)
    storage_size += (sizes[dim] - 1) * strides[dim];
  if (storage_size < tensor.numel())
    return false;

The above code snippet could be refined as follows.

  if (at::has_internal_overlap(tensor) == at::MemOverlap::Yes) return false;

@EikanWang EikanWang added the ciflow/xpu Run XPU CI tasks label Dec 24, 2024
Copy link
pytorch-bot bot commented Dec 24, 2024

To add the ciflow label ciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Dec 24, 2024
@EikanWang EikanWang added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module ciflow/xpu Run XPU CI tasks labels Dec 24, 2024
Copy link
pytorch-bot bot commented Dec 24, 2024

To add the ciflow label ciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Dec 24, 2024
@jianyizh
Copy link
Contributor Author

@jianyizh jianyizh marked this pull request as ready for review December 26, 2024 05:49
@EikanWang EikanWang added ciflow/xpu Run XPU CI tasks and removed module: cpu CPU specific problem (e.g., perf, algorithm) labels Dec 29, 2024
Copy link
pytorch-bot bot commented Dec 29, 2024

To add the ciflow label ciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Dec 29, 2024
@EikanWang EikanWang added the ciflow/xpu Run XPU CI tasks label Dec 29, 2024
@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Dec 29, 2024
@jianyizh
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased jianyi/broadcast onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout jianyi/broadcast && git pull --rebase)

@jianyizh
Copy link
Contributor Author
jianyizh commented Feb 6, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased jianyi/broadcast onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout jianyi/broadcast && git pull --rebase)

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased jianyi/broadcast onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout jianyi/broadcast && git pull --rebase)

@EikanWang
Copy link
Collaborator

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 13, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 4, 4, linux.idc.xpu)

Details for Dev Infra team Raised by workflow job

@EikanWang
Copy link
Collaborator

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 4 checks: xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 1, 4, linux.idc.xpu), xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 2, 4, linux.idc.xpu), xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 3, 4, linux.idc.xpu), xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 4, 4, linux.idc.xpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Feb 27, 2025
…iguous (#144759)

We should not always call contiguous on the dst of matmul. We have already removed copy of matmul input in #143784

I also fixed an accuracy issue by using onednn sum post op instead of binary add in the case of inplace to avoid UT failure.

Pull Request resolved: #144759
Approved by: https://github.com/EikanWang
aditew01 pushed a commit that referenced this pull request Feb 28, 2025
…iguous (#144759)

We should not always call contiguous on the dst of matmul. We have already removed copy of matmul input in #143784

I also fixed an accuracy issue by using onednn sum post op instead of binary add in the case of inplace to avoid UT failure.

Pull Request resolved: #144759
Approved by: https://github.com/EikanWang
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks Merged module: cpu CPU specific problem (e.g., perf, algorithm) open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants
0