[inductor] [cpp] improve cache blocking with CPU info #129348

chunyuan-w · 2024-06-24T03:59:17Z

Stack from ghstack (oldest at bottom):

Description

For single thread case, this PR improves the cache blocking in CPP GEMM template with the CPU info (the L1 and L2 cache size). Mc_blocks and Kc_blocks are calculated based on the below condition:
- size_of_B < L1
- size_of_A < 0.5 * L2

For multi-thread, we need to tune the task decomposition among threads together with cache blocking. We disabled the cache blocking change for now and will submit a follow-up PR for multi-thread optimizations.

Performance

No regressions. Models with > 3% performance speedup are listed below:

BF16 single thread (measured on CPU with AMX support)

static shape

Model Family	Model Name	Speedup
torchbench	detectron2_fasterrcnn_r_101_dc5	4%

dynamic shape

Model Family	Model Name	Speedup
torchbench	detectron2_fasterrcnn_r_101_dc5	4%

FP32 single thread (measured on Ice Lake)

static shape

Model Family	Model Name	Speedup
torchbench	basic_gnn_edgecnn	10%

dynamic shape

Model Family	Model Name	Speedup
torchbench	basic_gnn_edgecnn	10%

Next step

The E2E level improvement is limited due to the below reasons:

For several HF models, we can observe performance improvement at kernel level for the gemm template kernel but since the performance is either still worse than ATen kernel (thus won't be selected during autotune) or improved from worse than ATen to similar to ATen, so we don't see E2E level performance change.
There're models where the gemm template kernel could get > 10% performance improvement with this PR but since the kernel time is only about 3% of the E2E time, we don't observe significant E2E level improvement.

We will continue to find possible optimizations in the gemm template kernel in follow-up PRs.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-06-24T03:59:19Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129348

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit b09b641 with merge base 6c2c8ee ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Lint / lintrunner-noclang / linux-job (gh) (trunk failure)
>>> Lint for test/inductor/test_aot_inductor_package.py:
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 1, 5, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
inductor/test_flex_attention.py::TestFlexAttention::test_fw_bw_graph_correctness
trunk / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 1, 5, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
inductor/test_flex_attention.py::TestFlexAttention::test_fw_bw_graph_correctness

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 54c1796 Pull Request resolved: #129348

torch/_inductor/codegen/cpp_micro_gemm.py

This PR improves the cache blocking in CPP GEMM template with the CPU info (the L1 and L2 cache size): 1. Calculate the `Mc_blocks` and `Kc_blocks` based on the below condition: - size_of_B < L1 - size_of_A < 0.5 * L2 2. Use non-temporal tile load `_tile_stream_loadd` for A to keep B in L1. **TODOs:** - [ ] Collect benchmark data before and after this change cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

This PR improves the cache blocking in CPP GEMM template with the CPU info (the L1 and L2 cache size). `Mc_blocks` and `Kc_blocks` are calculated based on the below condition: - size_of_B < L1 - size_of_A < 0.5 * L2 **TODOs:** - [ ] Collect benchmark data before and after this change cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 040bbba Pull Request resolved: #129348

This PR improves the cache blocking in CPP GEMM template with the CPU info (the L1 and L2 cache size). `Mc_blocks` and `Kc_blocks` are calculated based on the below condition: - size_of_B < L1 - size_of_A < 0.5 * L2 **TODOs:** - [ ] Collect benchmark data before and after this change cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: f6eb034 Pull Request resolved: #129348

…d for A" Use non-temporal tile load `_tile_stream_loadd` for A to keep B in L1. Verified AMP static shapes and dynamic shapes on CPU with AMX support and no obvious performance boost (no regression either) at end-to-end level. We're expecting to get performance gain when adding #129348 (also in this ghstack) on top of this PR. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

Use non-temporal tile load `_tile_stream_loadd` for A to keep B in L1. Verified AMP static shapes and dynamic shapes on CPU with AMX support and no obvious performance boost (no regression either) at end-to-end level. We're expecting to get performance gain when adding #129348 (also in this ghstack) on top of this PR. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

Use non-temporal tile load `_tile_stream_loadd` for A to keep B in L1. Verified AMP static shapes and dynamic shapes on CPU with AMX support and no obvious performance boost (no regression either) at end-to-end level. We're expecting to get performance gain when adding #129348 (also in this ghstack) on top of this PR. Pull Request resolved: #129455 Approved by: https://github.com/jgong5

chunyuan-w · 2024-07-17T08:25:51Z

@pytorchbot rebase

pytorchmergebot · 2024-07-17T08:27:19Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

## Description For single thread case, this PR improves the cache blocking in CPP GEMM template with the CPU info (the L1 and L2 cache size). `Mc_blocks` and `Kc_blocks` are calculated based on the below condition: - size_of_B < L1 - size_of_A < 0.5 * L2 For multi-thread, we need to tune the task decomposition among threads together with cache blocking. We disabled the cache blocking change for now and will submit a follow-up PR for multi-thread optimizations. ## Performance No regressions. Models with > 3% performance speedup are listed below: ### BF16 single thread (measured on CPU with AMX support) - static shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | detectron2_fasterrcnn_r_101_dc5| 4% - dynamic shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | detectron2_fasterrcnn_r_101_dc5| 4% ### FP32 single thread (measured on Ice Lake) - static shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | basic_gnn_edgecnn| 10% - dynamic shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | basic_gnn_edgecnn| 10% ### Next step The E2E level improvement is limited due to the below reasons: - For several HF models, we can observe performance improvement at kernel level for the gemm template kernel but since the performance is either still worse than ATen kernel (thus won't be selected during autotune) or improved from worse than ATen to similar to ATen, so we don't see E2E level performance change. - There're models where the gemm template kernel could get > 10% performance improvement with this PR but since the kernel time is only about 3% of the E2E time, we don't observe significant E2E level improvement. We will continue to find possible optimizations in the gemm template kernel in follow-up PRs. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: c11aef6 Pull Request resolved: #129348

[ghstack-poisoned]

jgong5 · 2024-07-20T06:31:04Z

@pytorchbot merge

pytorchmergebot · 2024-07-20T06:33:22Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

jgong5 · 2024-07-20T06:45:14Z

@pytorchbot merge

pytorchmergebot · 2024-07-20T06:48:05Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

## Description For single thread case, this PR improves the cache blocking in CPP GEMM template with the CPU info (the L1 and L2 cache size). `Mc_blocks` and `Kc_blocks` are calculated based on the below condition: - size_of_B < L1 - size_of_A < 0.5 * L2 For multi-thread, we need to tune the task decomposition among threads together with cache blocking. We disabled the cache blocking change for now and will submit a follow-up PR for multi-thread optimizations. ## Performance No regressions. Models with > 3% performance speedup are listed below: ### BF16 single thread (measured on CPU with AMX support) - static shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | detectron2_fasterrcnn_r_101_dc5| 4% - dynamic shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | detectron2_fasterrcnn_r_101_dc5| 4% ### FP32 single thread (measured on Ice Lake) - static shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | basic_gnn_edgecnn| 10% - dynamic shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | basic_gnn_edgecnn| 10% ### Next step The E2E level improvement is limited due to the below reasons: - For several HF models, we can observe performance improvement at kernel level for the gemm template kernel but since the performance is either still worse than ATen kernel (thus won't be selected during autotune) or improved from worse than ATen to similar to ATen, so we don't see E2E level performance change. - There're models where the gemm template kernel could get > 10% performance improvement with this PR but since the kernel time is only about 3% of the E2E time, we don't observe significant E2E level improvement. We will continue to find possible optimizations in the gemm template kernel in follow-up PRs. Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: pytorch#129348 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: pytorch#130675, pytorch#130690

ghstack-source-id: 53015fa Pull Request resolved: pytorch/pytorch#129348

Use non-temporal tile load `_tile_stream_loadd` for A to keep B in L1. Verified AMP static shapes and dynamic shapes on CPU with AMX support and no obvious performance boost (no regression either) at end-to-end level. We're expecting to get performance gain when adding pytorch#129348 (also in this ghstack) on top of this PR. Pull Request resolved: pytorch#129455 Approved by: https://github.com/jgong5

## Description For single thread case, this PR improves the cache blocking in CPP GEMM template with the CPU info (the L1 and L2 cache size). `Mc_blocks` and `Kc_blocks` are calculated based on the below condition: - size_of_B < L1 - size_of_A < 0.5 * L2 For multi-thread, we need to tune the task decomposition among threads together with cache blocking. We disabled the cache blocking change for now and will submit a follow-up PR for multi-thread optimizations. ## Performance No regressions. Models with > 3% performance speedup are listed below: ### BF16 single thread (measured on CPU with AMX support) - static shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | detectron2_fasterrcnn_r_101_dc5| 4% - dynamic shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | detectron2_fasterrcnn_r_101_dc5| 4% ### FP32 single thread (measured on Ice Lake) - static shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | basic_gnn_edgecnn| 10% - dynamic shape | Model Family | Model Name | Speedup | |--------------|------------|---------| torchbench | basic_gnn_edgecnn| 10% ### Next step The E2E level improvement is limited due to the below reasons: - For several HF models, we can observe performance improvement at kernel level for the gemm template kernel but since the performance is either still worse than ATen kernel (thus won't be selected during autotune) or improved from worse than ATen to similar to ATen, so we don't see E2E level performance change. - There're models where the gemm template kernel could get > 10% performance improvement with this PR but since the kernel time is only about 3% of the E2E time, we don't observe significant E2E level improvement. We will continue to find possible optimizations in the gemm template kernel in follow-up PRs. Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: pytorch#129348 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: pytorch#130675, pytorch#130690

[inductor] [cpp] improve cache blocking with CPU info

b934f8c

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor labels Jun 24, 2024

chunyuan-w marked this pull request as draft June 24, 2024 03:59

pytorchbot added the open source label Jun 24, 2024

chunyuan-w added a commit that referenced this pull request Jun 24, 2024

[inductor] [cpp] improve cache blocking with CPU info

f64de59

ghstack-source-id: 54c1796 Pull Request resolved: #129348

jgong5 reviewed Jun 25, 2024

View reviewed changes

torch/_inductor/codegen/cpp_micro_gemm.py Outdated Show resolved Hide resolved

chunyuan-w mentioned this pull request Jun 25, 2024

[inductor] [cpp] use non-temporal tile load for A #129455

Closed

chunyuan-w added 3 commits June 25, 2024 02:25

chunyuan-w added a commit that referenced this pull request Jul 2, 2024

[inductor] [cpp] improve cache blocking with CPU info

009386d

ghstack-source-id: 040bbba Pull Request resolved: #129348

chunyuan-w added a commit that referenced this pull request Jul 10, 2024

[inductor] [cpp] improve cache blocking with CPU info

1a1d7a5

ghstack-source-id: f6eb034 Pull Request resolved: #129348

chunyuan-w marked this pull request as ready for review July 17, 2024 05:21

chunyuan-w requested a review from jgong5 July 17, 2024 05:21

chunyuan-w added a commit that referenced this pull request Jul 18, 2024

[inductor] [cpp] improve cache blocking with CPU info

e7d2bf5

ghstack-source-id: c11aef6 Pull Request resolved: #129348

jgong5 approved these changes Jul 18, 2024

View reviewed changes

jgong5 added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 18, 2024

jgong5 requested a review from jansel July 18, 2024 06:56

Update

b09b641

[ghstack-poisoned]

jansel approved these changes Jul 19, 2024

View reviewed changes

pytorchmergebot added the merging label Jul 20, 2024

pytorchmergebot removed the merging label Jul 20, 2024

jgong5 added the topic: not user facing topic category label Jul 20, 2024

pytorchmergebot added the merging label Jul 20, 2024

pytorchmergebot closed this in a831969 Jul 20, 2024

pytorchmergebot added Merged and removed merging labels Jul 20, 2024

francograndegmailcom pushed a commit to francograndegmailcom/pytorch-pytorch that referenced this pull request Jul 23, 2024

[inductor] [cpp] improve cache blocking with CPU info

bdc6af6

ghstack-source-id: 53015fa Pull Request resolved: pytorch/pytorch#129348

henrylhtsang mentioned this pull request Jul 31, 2024

[BE][typing] fix types in common pruning #132309

Closed

github-actions bot deleted the gh/chunyuan-w/18/head branch August 20, 2024 01:58

jgong5 mentioned this pull request Aug 24, 2024

[RFC] Add Cpp Template for GEMM related ops via max-autotune for Inductor CPU #125683

Open

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[inductor] [cpp] improve cache blocking with CPU info #129348

[inductor] [cpp] improve cache blocking with CPU info #129348

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[inductor] [cpp] improve cache blocking with CPU info #129348

[inductor] [cpp] improve cache blocking with CPU info #129348

Uh oh!

Conversation

Uh oh!

Description

Performance

BF16 single thread (measured on CPU with AMX support)

FP32 single thread (measured on Ice Lake)

Next step

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129348

✅ You can merge normally! (3 Unrelated Failures)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Merge failed

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!