-
Notifications
You must be signed in to change notification settings - Fork 24.7k
Upgrade submodule oneDNN to v3.7 #147498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade submodule oneDNN to v3.7 #147498
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147498
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 1 Unrelated FailureAs of commit 1e329e1 with merge base 90e3a3d ( NEW FAILURES - The following jobs have failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@atalman has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@pytorchbot rebase -b main |
@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here |
Rebase failed due to
Raised by https://github.com/pytorch/pytorch/actions/runs/13523596472 |
@pytorchbot rebase -b main |
@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here |
Rebase failed due to
Raised by https://github.com/pytorch/pytorch/actions/runs/13524908106 |
a024a93
to
1e329e1
Compare
@xuhancn I have rebased to main manually. |
This PR is to upgrade submodule oneDNN to v3.7. ## Improvements - Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids). - Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support. - Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids). - Introduced initial optimizations for Intel GPUs based on Xe3 architecture. - Added bfloat16 support for SDPA, implemented fp16 and bf16 gemm kernel in SDPA. - Fixed f16 matmul accuracy, the issue of SDPA cannot dispatched to ukernel, bf16/fp16/fp32 conv performance, INT8 Kernel trigger page fault, deconvolution precision issue on complex128 and fp64 and gemm correctness issue in float16 issues. - Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL). - Improved bf16 to fp32 reorder performance. - Improved bf16 reorder performance. - Improved bf16 convolution with ACL. Fixes pytorch#136348. ## Validation results on CPU 1. NLP models accuracy/inference/training   2. Torchbench cpu userbenchmark inference & training  3. Inductor quantization  4. Dynamo benchmarks         ## Validation results on XPU Accuracy is same as baseline. Performance is shown below.  ## Validation results on ARM   Pull Request resolved: pytorch#147498 Approved by: https://github.com/fadara01, https://github.com/mingfeima, https://github.com/atalman
This reverts commit e72b4c6.
This reverts commit e72b4c6.
This PR is to upgrade submodule oneDNN to v3.7. ## Improvements - Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids). - Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support. - Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids). - Introduced initial optimizations for Intel GPUs based on Xe3 architecture. - Added bfloat16 support for SDPA, implemented fp16 and bf16 gemm kernel in SDPA. - Fixed f16 matmul accuracy, the issue of SDPA cannot dispatched to ukernel, bf16/fp16/fp32 conv performance, INT8 Kernel trigger page fault, deconvolution precision issue on complex128 and fp64 and gemm correctness issue in float16 issues. - Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL). - Improved bf16 to fp32 reorder performance. - Improved bf16 reorder performance. - Improved bf16 convolution with ACL. Fixes #136348. ## Validation results on CPU 1. NLP models accuracy/inference/training   2. Torchbench cpu userbenchmark inference & training  3. Inductor quantization  4. Dynamo benchmarks         ## Validation results on XPU Accuracy is same as baseline. Performance is shown below.  ## Validation results on ARM   Pull Request resolved: #147498 Approved by: https://github.com/fadara01, https://github.com/mingfeima, https://github.com/atalman
This reverts commit 576ed1e. Reverted #147498 on behalf of https://github.com/wdvr due to failing some tests on trunk - see below ([comment](#147498 (comment)))
This PR is to upgrade submodule oneDNN to v3.7. ## Improvements - Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids). - Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support. - Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids). - Introduced initial optimizations for Intel GPUs based on Xe3 architecture. - Added bfloat16 support for SDPA, implemented fp16 and bf16 gemm kernel in SDPA. - Fixed f16 matmul accuracy, the issue of SDPA cannot dispatched to ukernel, bf16/fp16/fp32 conv performance, INT8 Kernel trigger page fault, deconvolution precision issue on complex128 and fp64 and gemm correctness issue in float16 issues. - Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL). - Improved bf16 to fp32 reorder performance. - Improved bf16 reorder performance. - Improved bf16 convolution with ACL. Fixes #136348. ## Validation results on CPU 1. NLP models accuracy/inference/training   2. Torchbench cpu userbenchmark inference & training  3. Inductor quantization  4. Dynamo benchmarks         ## Validation results on XPU Accuracy is same as baseline. Performance is shown below.  ## Validation results on ARM   Pull Request resolved: #147498 Approved by: https://github.com/fadara01, https://github.com/mingfeima, https://github.com/atalman
Windows CI fail is root caused to MSVC 19.29.30158.0 bug that breaks binary primitive constructor in C++17 mode. PR uxlfoundation/oneDNN#2783 with a workaround is in validation. We'll have oneDNN v3.7.1 tagged as soon as it lands. |
This PR is to upgrade submodule oneDNN to v3.7. ## Improvements - Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids). - Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support. - Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids). - Introduced initial optimizations for Intel GPUs based on Xe3 architecture. - Added bfloat16 support for SDPA, implemented fp16 and bf16 gemm kernel in SDPA. - Fixed f16 matmul accuracy, the issue of SDPA cannot dispatched to ukernel, bf16/fp16/fp32 conv performance, INT8 Kernel trigger page fault, deconvolution precision issue on complex128 and fp64 and gemm correctness issue in float16 issues. - Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL). - Improved bf16 to fp32 reorder performance. - Improved bf16 reorder performance. - Improved bf16 convolution with ACL. Fixes #136348. ## Validation results on CPU 1. NLP models accuracy/inference/training   2. Torchbench cpu userbenchmark inference & training  3. Inductor quantization  4. Dynamo benchmarks         ## Validation results on XPU Accuracy is same as baseline. Performance is shown below.  ## Validation results on ARM   Pull Request resolved: #147498 Approved by: https://github.com/fadara01, https://github.com/mingfeima, https://github.com/atalman
Motivation === This PR is part of the plan of OneDNN Upstreaming, as #114848 [(comment)](#114848 (comment)) stated. The support of SDPA is via the overridable variance on XPU backend. Beside the added `Attention.cpp` file, `Graph.h` is added to hold utils for OneDNN graph including those for kernel/compile graph caching. In addition, a selection of testcases in `test/test_transformers.py` are copied into the new `test/xpu/test_transformers.py` and modified accordingly to provide additional tests beyond `./third_party/torch-xpu-ops/test/xpu/test_ops_xpu.py`. Depends on OneDNN version v3.7 upgrade in #147498 Depends on BUILD_GRAPH switch in #147608 Pull Request resolved: #147614 Approved by: https://github.com/jansel, https://github.com/EikanWang
Motivation === This PR is part of the plan of OneDNN Upstreaming, as pytorch#114848 [(comment)](pytorch#114848 (comment)) stated. The support of SDPA is via the overridable variance on XPU backend. Beside the added `Attention.cpp` file, `Graph.h` is added to hold utils for OneDNN graph including those for kernel/compile graph caching. In addition, a selection of testcases in `test/test_transformers.py` are copied into the new `test/xpu/test_transformers.py` and modified accordingly to provide additional tests beyond `./third_party/torch-xpu-ops/test/xpu/test_ops_xpu.py`. Depends on OneDNN version v3.7 upgrade in pytorch#147498 Depends on BUILD_GRAPH switch in pytorch#147608 Pull Request resolved: pytorch#147614 Approved by: https://github.com/jansel, https://github.com/EikanWang
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
This PR is to upgrade submodule oneDNN to v3.7.
Improvements
Fixes #136348.
Validation results on CPU
Validation results on XPU
Accuracy is same as baseline. Performance is shown below.

Validation results on ARM
cc @gujinghui @PenghuiCheng @XiaobingSuper @jianyuh @jgong5 @mingfeima @sanchitintel @ashokei @jingxu10 @min-jean-cho @Guobing-Chen @Xia-Weiwen @snadampal @malfet @milpuz01 @aditew01 @nikhil-arm @fadara01 @voznesenskym @penguinwu @EikanWang @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @yf225