Upgrade submodule oneDNN to v3.7 #147498

yanbing-j · 2025-02-20T03:06:14Z

This PR is to upgrade submodule oneDNN to v3.7.

Improvements

Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support.
Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
Introduced initial optimizations for Intel GPUs based on Xe3 architecture.
Added bfloat16 support for SDPA, implemented fp16 and bf16 gemm kernel in SDPA.
Fixed f16 matmul accuracy, the issue of SDPA cannot dispatched to ukernel, bf16/fp16/fp32 conv performance, INT8 Kernel trigger page fault, deconvolution precision issue on complex128 and fp64 and gemm correctness issue in float16 issues.
Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL).
Improved bf16 to fp32 reorder performance.
Improved bf16 reorder performance.
Improved bf16 convolution with ACL.

Fixes #136348.

Validation results on CPU

NLP models accuracy/inference/training

Torchbench cpu userbenchmark inference & training

Inductor quantization

Dynamo benchmarks

Validation results on XPU

Accuracy is same as baseline. Performance is shown below.

Validation results on ARM

cc @gujinghui @PenghuiCheng @XiaobingSuper @jianyuh @jgong5 @mingfeima @sanchitintel @ashokei @jingxu10 @min-jean-cho @Guobing-Chen @Xia-Weiwen @snadampal @malfet @milpuz01 @aditew01 @nikhil-arm @fadara01 @voznesenskym @penguinwu @EikanWang @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @yf225

pytorch-bot · 2025-02-20T03:06:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147498

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Unrelated Failure

As of commit 1e329e1 with merge base 90e3a3d ():

NEW FAILURES - The following jobs have failed:

windows-binary-wheel / wheel-py3_11-xpu-test (gh)
Process completed with exit code 1.
xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 2, 4, linux.idc.xpu) (gh)
inductor/test_minifier.py::MinifierTests::test_after_aot_gpu_accuracy_error

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / win-vs2022-cpu-py3 / test (default, 2, 3, lf.windows.4xlarge.nonephemeral) (gh) (similar failure)
'test/test_mkldnn.py::TestMkldnnCPU::test_mul_cpu'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

fadara01

Looks good to me!

mingfeima

LGTM

facebook-github-bot · 2025-02-21T13:55:37Z

@atalman has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

atalman

lgtm

atalman · 2025-02-24T14:24:25Z

@pytorchmergebot merge

pytorchmergebot · 2025-02-24T14:27:00Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

xuhancn · 2025-02-25T14:29:15Z

@pytorchbot rebase -b main

pytorchmergebot · 2025-02-25T14:30:47Z

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

pytorchmergebot · 2025-02-25T14:30:49Z

Rebase failed due to

Aborting rebase because rebasing the branch resulted in the same sha as the target branch.
This usually happens because the PR has already been merged.  Please rebase locally and push.

Raised by https://github.com/pytorch/pytorch/actions/runs/13523596472

xuhancn · 2025-02-25T15:31:23Z

@pytorchbot rebase -b main

pytorchmergebot · 2025-02-25T15:32:51Z

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

pytorchmergebot · 2025-02-25T15:32:52Z

Rebase failed due to

Aborting rebase because rebasing the branch resulted in the same sha as the target branch.
This usually happens because the PR has already been merged.  Please rebase locally and push.

Raised by https://github.com/pytorch/pytorch/actions/runs/13524908106

yanbing-j · 2025-02-26T00:50:13Z

@xuhancn I have rebased to main manually.

This PR is to upgrade submodule oneDNN to v3.7. ## Improvements - Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids). - Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support. - Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids). - Introduced initial optimizations for Intel GPUs based on Xe3 architecture. - Added bfloat16 support for SDPA, implemented fp16 and bf16 gemm kernel in SDPA. - Fixed f16 matmul accuracy, the issue of SDPA cannot dispatched to ukernel, bf16/fp16/fp32 conv performance, INT8 Kernel trigger page fault, deconvolution precision issue on complex128 and fp64 and gemm correctness issue in float16 issues. - Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL). - Improved bf16 to fp32 reorder performance. - Improved bf16 reorder performance. - Improved bf16 convolution with ACL. Fixes pytorch#136348. ## Validation results on CPU 1. NLP models accuracy/inference/training ![image](https://github.com/user-attachments/assets/859279b8-1631-4268-b226-7de9ac5870d8) ![image](https://github.com/user-attachments/assets/30ec7151-41ca-482a-9d2d-0c4850e75bab) 2. Torchbench cpu userbenchmark inference & training ![image](https://github.com/user-attachments/assets/71c9807c-caf9-4385-9990-d2ab637031cd) 3. Inductor quantization ![image](https://github.com/user-attachments/assets/3d2a3bd3-82fa-4566-8050-7ea5d6b61675) 4. Dynamo benchmarks ![image](https://github.com/user-attachments/assets/554ecce3-c85c-4a0e-88f1-2e73983c5dcd) ![image](https://github.com/user-attachments/assets/148c88f8-4367-4428-bb54-ce8a4deefd1b) ![image](https://github.com/user-attachments/assets/f2e744f4-d710-4699-acf4-1f130ecfadf1) ![image](https://github.com/user-attachments/assets/97128b80-4d0e-495a-aeda-dde3e70c96fd) ![image](https://github.com/user-attachments/assets/a9afce37-684c-45c0-b938-6dd7e0383805) ![image](https://github.com/user-attachments/assets/b8714236-9681-4fbe-8d98-be93deedab88) ![image](https://github.com/user-attachments/assets/4423061f-d133-45ba-98bd-d2f739e50431) ![image](https://github.com/user-attachments/assets/7955da10-3d23-493e-99fa-658f7f40035b) ## Validation results on XPU Accuracy is same as baseline. Performance is shown below. ![image](https://github.com/user-attachments/assets/7645304d-5b1d-43f9-b840-9f846ed380a0) ## Validation results on ARM ![image](https://github.com/user-attachments/assets/080f7c02-0238-436f-ad20-5a9e3f6aafbb) ![image](https://github.com/user-attachments/assets/443742aa-ca61-41de-ae80-5d4c65cd0c87) Pull Request resolved: pytorch#147498 Approved by: https://github.com/fadara01, https://github.com/mingfeima, https://github.com/atalman

This reverts commit e72b4c6.

This PR is to upgrade submodule oneDNN to v3.7. ## Improvements - Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids). - Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support. - Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids). - Introduced initial optimizations for Intel GPUs based on Xe3 architecture. - Added bfloat16 support for SDPA, implemented fp16 and bf16 gemm kernel in SDPA. - Fixed f16 matmul accuracy, the issue of SDPA cannot dispatched to ukernel, bf16/fp16/fp32 conv performance, INT8 Kernel trigger page fault, deconvolution precision issue on complex128 and fp64 and gemm correctness issue in float16 issues. - Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL). - Improved bf16 to fp32 reorder performance. - Improved bf16 reorder performance. - Improved bf16 convolution with ACL. Fixes #136348. ## Validation results on CPU 1. NLP models accuracy/inference/training ![image](https://github.com/user-attachments/assets/859279b8-1631-4268-b226-7de9ac5870d8) ![image](https://github.com/user-attachments/assets/30ec7151-41ca-482a-9d2d-0c4850e75bab) 2. Torchbench cpu userbenchmark inference & training ![image](https://github.com/user-attachments/assets/71c9807c-caf9-4385-9990-d2ab637031cd) 3. Inductor quantization ![image](https://github.com/user-attachments/assets/3d2a3bd3-82fa-4566-8050-7ea5d6b61675) 4. Dynamo benchmarks ![image](https://github.com/user-attachments/assets/554ecce3-c85c-4a0e-88f1-2e73983c5dcd) ![image](https://github.com/user-attachments/assets/148c88f8-4367-4428-bb54-ce8a4deefd1b) ![image](https://github.com/user-attachments/assets/f2e744f4-d710-4699-acf4-1f130ecfadf1) ![image](https://github.com/user-attachments/assets/97128b80-4d0e-495a-aeda-dde3e70c96fd) ![image](https://github.com/user-attachments/assets/a9afce37-684c-45c0-b938-6dd7e0383805) ![image](https://github.com/user-attachments/assets/b8714236-9681-4fbe-8d98-be93deedab88) ![image](https://github.com/user-attachments/assets/4423061f-d133-45ba-98bd-d2f739e50431) ![image](https://github.com/user-attachments/assets/7955da10-3d23-493e-99fa-658f7f40035b) ## Validation results on XPU Accuracy is same as baseline. Performance is shown below. ![image](https://github.com/user-attachments/assets/7645304d-5b1d-43f9-b840-9f846ed380a0) ## Validation results on ARM ![image](https://github.com/user-attachments/assets/080f7c02-0238-436f-ad20-5a9e3f6aafbb) ![image](https://github.com/user-attachments/assets/443742aa-ca61-41de-ae80-5d4c65cd0c87) Pull Request resolved: #147498 Approved by: https://github.com/fadara01, https://github.com/mingfeima, https://github.com/atalman

This reverts commit 576ed1e. Reverted #147498 on behalf of https://github.com/wdvr due to failing some tests on trunk - see below ([comment](#147498 (comment)))

This PR is to upgrade submodule oneDNN to v3.7. ## Improvements - Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids). - Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support. - Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids). - Introduced initial optimizations for Intel GPUs based on Xe3 architecture. - Added bfloat16 support for SDPA, implemented fp16 and bf16 gemm kernel in SDPA. - Fixed f16 matmul accuracy, the issue of SDPA cannot dispatched to ukernel, bf16/fp16/fp32 conv performance, INT8 Kernel trigger page fault, deconvolution precision issue on complex128 and fp64 and gemm correctness issue in float16 issues. - Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL). - Improved bf16 to fp32 reorder performance. - Improved bf16 reorder performance. - Improved bf16 convolution with ACL. Fixes #136348. ## Validation results on CPU 1. NLP models accuracy/inference/training ![image](https://github.com/user-attachments/assets/859279b8-1631-4268-b226-7de9ac5870d8) ![image](https://github.com/user-attachments/assets/30ec7151-41ca-482a-9d2d-0c4850e75bab) 2. Torchbench cpu userbenchmark inference & training ![image](https://github.com/user-attachments/assets/71c9807c-caf9-4385-9990-d2ab637031cd) 3. Inductor quantization ![image](https://github.com/user-attachments/assets/3d2a3bd3-82fa-4566-8050-7ea5d6b61675) 4. Dynamo benchmarks ![image](https://github.com/user-attachments/assets/554ecce3-c85c-4a0e-88f1-2e73983c5dcd) ![image](https://github.com/user-attachments/assets/148c88f8-4367-4428-bb54-ce8a4deefd1b) ![image](https://github.com/user-attachments/assets/f2e744f4-d710-4699-acf4-1f130ecfadf1) ![image](https://github.com/user-attachments/assets/97128b80-4d0e-495a-aeda-dde3e70c96fd) ![image](https://github.com/user-attachments/assets/a9afce37-684c-45c0-b938-6dd7e0383805) ![image](https://github.com/user-attachments/assets/b8714236-9681-4fbe-8d98-be93deedab88) ![image](https://github.com/user-attachments/assets/4423061f-d133-45ba-98bd-d2f739e50431) ![image](https://github.com/user-attachments/assets/7955da10-3d23-493e-99fa-658f7f40035b) ## Validation results on XPU Accuracy is same as baseline. Performance is shown below. ![image](https://github.com/user-attachments/assets/7645304d-5b1d-43f9-b840-9f846ed380a0) ## Validation results on ARM ![image](https://github.com/user-attachments/assets/080f7c02-0238-436f-ad20-5a9e3f6aafbb) ![image](https://github.com/user-attachments/assets/443742aa-ca61-41de-ae80-5d4c65cd0c87) Pull Request resolved: #147498 Approved by: https://github.com/fadara01, https://github.com/mingfeima, https://github.com/atalman

vpirogov · 2025-02-28T23:14:49Z

Windows CI fail is root caused to MSVC 19.29.30158.0 bug that breaks binary primitive constructor in C++17 mode. PR uxlfoundation/oneDNN#2783 with a workaround is in validation. We'll have oneDNN v3.7.1 tagged as soon as it lands.

This PR is to upgrade submodule oneDNN to v3.7. ## Improvements - Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids). - Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support. - Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids). - Introduced initial optimizations for Intel GPUs based on Xe3 architecture. - Added bfloat16 support for SDPA, implemented fp16 and bf16 gemm kernel in SDPA. - Fixed f16 matmul accuracy, the issue of SDPA cannot dispatched to ukernel, bf16/fp16/fp32 conv performance, INT8 Kernel trigger page fault, deconvolution precision issue on complex128 and fp64 and gemm correctness issue in float16 issues. - Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL). - Improved bf16 to fp32 reorder performance. - Improved bf16 reorder performance. - Improved bf16 convolution with ACL. Fixes #136348. ## Validation results on CPU 1. NLP models accuracy/inference/training ![image](https://github.com/user-attachments/assets/859279b8-1631-4268-b226-7de9ac5870d8) ![image](https://github.com/user-attachments/assets/30ec7151-41ca-482a-9d2d-0c4850e75bab) 2. Torchbench cpu userbenchmark inference & training ![image](https://github.com/user-attachments/assets/71c9807c-caf9-4385-9990-d2ab637031cd) 3. Inductor quantization ![image](https://github.com/user-attachments/assets/3d2a3bd3-82fa-4566-8050-7ea5d6b61675) 4. Dynamo benchmarks ![image](https://github.com/user-attachments/assets/554ecce3-c85c-4a0e-88f1-2e73983c5dcd) ![image](https://github.com/user-attachments/assets/148c88f8-4367-4428-bb54-ce8a4deefd1b) ![image](https://github.com/user-attachments/assets/f2e744f4-d710-4699-acf4-1f130ecfadf1) ![image](https://github.com/user-attachments/assets/97128b80-4d0e-495a-aeda-dde3e70c96fd) ![image](https://github.com/user-attachments/assets/a9afce37-684c-45c0-b938-6dd7e0383805) ![image](https://github.com/user-attachments/assets/b8714236-9681-4fbe-8d98-be93deedab88) ![image](https://github.com/user-attachments/assets/4423061f-d133-45ba-98bd-d2f739e50431) ![image](https://github.com/user-attachments/assets/7955da10-3d23-493e-99fa-658f7f40035b) ## Validation results on XPU Accuracy is same as baseline. Performance is shown below. ![image](https://github.com/user-attachments/assets/7645304d-5b1d-43f9-b840-9f846ed380a0) ## Validation results on ARM ![image](https://github.com/user-attachments/assets/080f7c02-0238-436f-ad20-5a9e3f6aafbb) ![image](https://github.com/user-attachments/assets/443742aa-ca61-41de-ae80-5d4c65cd0c87) Pull Request resolved: #147498 Approved by: https://github.com/fadara01, https://github.com/mingfeima, https://github.com/atalman

Motivation === This PR is part of the plan of OneDNN Upstreaming, as #114848 [(comment)](#114848 (comment)) stated. The support of SDPA is via the overridable variance on XPU backend. Beside the added `Attention.cpp` file, `Graph.h` is added to hold utils for OneDNN graph including those for kernel/compile graph caching. In addition, a selection of testcases in `test/test_transformers.py` are copied into the new `test/xpu/test_transformers.py` and modified accordingly to provide additional tests beyond `./third_party/torch-xpu-ops/test/xpu/test_ops_xpu.py`. Depends on OneDNN version v3.7 upgrade in #147498 Depends on BUILD_GRAPH switch in #147608 Pull Request resolved: #147614 Approved by: https://github.com/jansel, https://github.com/EikanWang

Motivation === This PR is part of the plan of OneDNN Upstreaming, as pytorch#114848 [(comment)](pytorch#114848 (comment)) stated. The support of SDPA is via the overridable variance on XPU backend. Beside the added `Attention.cpp` file, `Graph.h` is added to hold utils for OneDNN graph including those for kernel/compile graph caching. In addition, a selection of testcases in `test/test_transformers.py` are copied into the new `test/xpu/test_transformers.py` and modified accordingly to provide additional tests beyond `./third_party/torch-xpu-ops/test/xpu/test_ops_xpu.py`. Depends on OneDNN version v3.7 upgrade in pytorch#147498 Depends on BUILD_GRAPH switch in pytorch#147608 Pull Request resolved: pytorch#147614 Approved by: https://github.com/jansel, https://github.com/EikanWang

github-actions · 2025-04-29T23:34:34Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

pytorch-bot bot added ciflow/inductor ciflow/linux-aarch64 linux aarch64 CI workflow module: inductor module: mkldnn Related to Intel IDEEP or oneDNN (a.k.a. mkldnn) integration topic: not user facing topic category labels Feb 20, 2025

yanbing-j requested review from atalman, mingfeima, Guobing-Chen and malfet February 20, 2025 03:06

yanbing-j added ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 intel This tag is for PR from Intel open source labels Feb 20, 2025

yanbing-j requested a review from fadara01 February 20, 2025 03:12

yanbing-j mentioned this pull request Feb 20, 2025

Upgrade submodule oneDNN to v3.7 #138889

Closed 8000

fadara01 approved these changes Feb 20, 2025

View reviewed changes

fadara01 mentioned this pull request Feb 20, 2025

Enable a fast path for (static) qlinear for AArch64 through ACL directly. #147337

Closed

DDEle mentioned this pull request Feb 21, 2025

[WIP] [Intel GPU] Add SDPA on XPU with OneDNN #140389

Closed

mingfeima approved these changes Feb 21, 2025

View reviewed changes

atalman approved these changes Feb 24, 2025

View reviewed changes

pytorchmergebot added the merging label Feb 24, 2025

pytorchmergebot added the Merged label Feb 24, 2025

pytorchmergebot closed this in 576ed1e Feb 24, 2025

pytorchmergebot removed the merging label Feb 24, 2025

pytorch deleted a comment from pytorchmergebot Feb 25, 2025

Upgrade oneDNN to v3.7

1e329e1

yanbing-j force-pushed the yanbing/upgrade_onednn_v3.7 branch from a024a93 to 1e329e1 Compare February 26, 2025 00:49

xuhancn mentioned this pull request Feb 26, 2025

[Don't merge]Upgrade submodule oneDNN to v3.7 (#147498)(Zi) #147917

Closed

DDEle added a commit to DDEle/pytorch that referenced this pull request Feb 26, 2025

Reapply "Upgrade submodule oneDNN to v3.7 (pytorch#147498)"

aa4bd84

This reverts commit e72b4c6.

yanbing-j mentioned this pull request Feb 26, 2025

[test][do not merge] test on 90e3a3d86d6139a7b00bdf56bdfe0f63ad18e980 #147964

Closed

DDEle added a commit to DDEle/pytorch that referenced this pull request Feb 27, 2025

Reapply "Upgrade submodule oneDNN to v3.7 (pytorch#147498)"

840a9e7

This reverts commit e72b4c6.

atalman mentioned this pull request Feb 27, 2025

td does not detect required test for mkl-dnn OneDNN update #148085

Closed

github-actions bot added the Stale label Apr 29, 2025

yanbing-j closed this Apr 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Upgrade submodule oneDNN to v3.7 #147498

Upgrade submodule oneDNN to v3.7 #147498

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Upgrade submodule oneDNN to v3.7 #147498

Upgrade submodule oneDNN to v3.7 #147498

Uh oh!

Conversation

Uh oh!

Improvements

Validation results on CPU

Validation results on XPU

Validation results on ARM

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147498

❌ 2 New Failures, 1 Unrelated Failure

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!