[inductor][cpp][gemm] improve large bs perf with better cache blocking #132729

jgong5 · 2024-08-06T03:48:23Z

Stack from ghstack (oldest at bottom):

Improve the cache blocking by reducing Mc_blocks to make A reside in L2 and reused by B as much as possible. This improves large bs perf for both scenarios: 1) N is large and K is of medium sizes; 2) K is large. Different strategies are used to handle these scenarios. Check the notes in get_cache_blocking in the changes.

Measured with 56-core Intel (R) Xeon (R) CPU Max 9480, jemalloc 5.1 and intel omp, bf16. Run with code cache of B matrix (weights).

Model Shapes	Before Optimization	After Optimization	Speedup	onednn linear	Speedup over onednn
M=1024, N=12288, K=4096 (Llama2-8b)	5.69 ms	3.71 ms	1.53	4.53 ms	1.22
M=1024, N=4096, K=4096 (Llama2-8b)	1.69 ms	1.63 ms	1.04	2.05 ms	1.26
M=1024, N=22016, K=4096 (Llama2-8b)	10.32 ms	6.57 ms	1.57	8.46 ms	1.29
M=1024, N=4096, K=11008 (Llama2-8b)	5.21 ms	3.26 ms	1.60	4.65 ms	1.43
M=1024, N=5120, K=4096 (Llama3-8b)	1.99 ms	1.78 ms	1.12	2.31 ms	1.30
M=1024, N=28672, K=4096 (Llama3-8b)	13.41 ms	8.56 ms	1.57	10.96 ms	1.28
M=1024, N=4096, K=14336 (Llama3-8b)	6.93 ms	4.31 ms	1.61	6.24 ms	1.45

cc @voznesenskym @penguinwu @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-08-06T03:48:26Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/132729

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 37b7b4a with merge base f951fcd ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Lint / lintrunner-noclang / linux-job (gh) (trunk failure)
>>> Lint for test/export/test_serialize.py:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

ghstack-source-id: 2de2436 Pull Request resolved: #132729

jansel · 2024-08-13T22:36:06Z

torch/_inductor/config.py

+    # MxNxK dims respectively. The blockings are separated by comma and the unit is
+    # the number of register blocks.
+    # For example, "4,1,10" means 4 register blocks on M, 1 on N and 10 on K respectively.
+    gemm_cache_blocking = os.environ.get("TORCHINDUCTOR_CPP_GEMM_CACHE_BLOCKING", None)


Please add a test that enables this config via @config.patch(gemm_cache_blocking="...")

Thanks. Added.

[ghstack-poisoned]

leslie-fang-intel · 2024-08-14T07:56:36Z

torch/_inductor/codegen/cpp_gemm_template.py

                const int64_t n_size = n_end - n_start;
+                const int64_t nc_block_end = std::min(nc + Nc_blocks, n_block_end); // FIXME: maybe exceeding N?


Since we always do padded_n, I guess it will not exceed N. Maybe we can remove this FIXME.

Good point. Updated the comment.

leslie-fang-intel · 2024-08-14T08:22:49Z

torch/_inductor/codegen/cpp_gemm_template.py

+                    # The following is the solution for 4*Mc*Nc + Mc*Kc_bytes = L2,
+                    # assuming Mc == Nc for good data reuse.
+                    M_max = (math.sqrt(Kc_bytes * Kc_bytes + 16 * L2) - Kc_bytes) / 8
+                    if M_max < Mc_blocks * M0:


Should it be an assert theoretically? But considering we use some approximated calculation. I guess it should be fine.

It is not an assertion. When it is false, we use the default one which is Mt_blocks.

ghstack-source-id: c108154 Pull Request resolved: pytorch#132729

[ghstack-poisoned]

jgong5 · 2024-08-16T00:50:50Z

@pytorchbot merge

pytorchmergebot · 2024-08-16T00:52:30Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

#132730) Pull Request resolved: #132730 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel ghstack dependencies: #132729

…ing etc. (#133312) Indent the template instructions separately from the generated code, for readability. Also, renaming M0,N0,K0 to Mr,Nr,Kr ("r" meaning "register") to consistent naming. Pull Request resolved: #133312 Approved by: https://github.com/Skylion007, https://github.com/leslie-fang-intel ghstack dependencies: #132729, #132730

…) (#135438) Fix #134686. PR #132729 makes GEMM template faster for one of the GEMMs in xcit_large_24_p8_224: SingleProcess AUTOTUNE benchmarking takes 1.7088 seconds and 1.9207 seconds precompiling AUTOTUNE linear_unary(12544x3072, 768x3072, 768) cpp_packed_gemm_2 2.9371 ms 100.0% _linear_pointwise 3.1584 ms 93.0% But it is slower than Aten in the e2e run due to different cache behavior. The access to the input data (12544x3072) is LLC latency bound and bottlenecks seen due to the memory synchronization (data transfers and coherence updates across processors). This PR tries to mitigate the problem by cooperatively loading different chunks of input data from different processors that share the input data. Pull Request resolved: #135438 Approved by: https://github.com/leslie-fang-intel

pytorch#132729) Improve the cache blocking by reducing Mc_blocks to make A reside in L2 and reused by B as much as possible. This improves large bs perf for both scenarios: 1) N is large and K is of medium sizes; 2) K is large. Different strategies are used to handle these scenarios. Check the notes in `get_cache_blocking` in the changes. Measured with 56-core Intel (R) Xeon (R) CPU Max 9480, jemalloc 5.1 and intel omp, bf16. Run with code cache of B matrix (weights). Model Shapes | Before Optimization | After Optimization | Speedup | onednn linear | Speedup over onednn -- | -- | -- | -- | -- | -- M=1024, N=12288, K=4096 (Llama2-8b) | 5.69 ms | 3.71 ms | 1.53 | 4.53 ms | 1.22 M=1024, N=4096, K=4096 (Llama2-8b) | 1.69 ms | 1.63 ms | 1.04 | 2.05 ms | 1.26 M=1024, N=22016, K=4096 (Llama2-8b) | 10.32 ms | 6.57 ms | 1.57 | 8.46 ms | 1.29 M=1024, N=4096, K=11008 (Llama2-8b) | 5.21 ms | 3.26 ms | 1.60 | 4.65 ms | 1.43 M=1024, N=5120, K=4096 (Llama3-8b) | 1.99 ms | 1.78 ms | 1.12 | 2.31 ms | 1.30 M=1024, N=28672, K=4096 (Llama3-8b) | 13.41 ms | 8.56 ms | 1.57 | 10.96 ms | 1.28 M=1024, N=4096, K=14336 (Llama3-8b) | 6.93 ms | 4.31 ms | 1.61 | 6.24 ms | 1.45 Pull Request resolved: pytorch#132729 Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/jansel

pytorch#132730) Pull Request resolved: pytorch#132730 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel ghstack dependencies: pytorch#132729

…ing etc. (pytorch#133312) Indent the template instructions separately from the generated code, for readability. Also, renaming M0,N0,K0 to Mr,Nr,Kr ("r" meaning "register") to consistent naming. Pull Request resolved: pytorch#133312 Approved by: https://github.com/Skylion007, https://github.com/leslie-fang-intel ghstack dependencies: pytorch#132729, pytorch#132730

…ch#134686) (pytorch#135438) Fix pytorch#134686. PR pytorch#132729 makes GEMM template faster for one of the GEMMs in xcit_large_24_p8_224: SingleProcess AUTOTUNE benchmarking takes 1.7088 seconds and 1.9207 seconds precompiling AUTOTUNE linear_unary(12544x3072, 768x3072, 768) cpp_packed_gemm_2 2.9371 ms 100.0% _linear_pointwise 3.1584 ms 93.0% But it is slower than Aten in the e2e run due to different cache behavior. The access to the input data (12544x3072) is LLC latency bound and bottlenecks seen due to the memory synchronization (data transfers and coherence updates across processors). This PR tries to mitigate the problem by cooperatively loading different chunks of input data from different processors that share the input data. Pull Request resolved: pytorch#135438 Approved by: https://github.com/leslie-fang-intel

pytorch#132730) Pull Request resolved: pytorch#132730 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel ghstack dependencies: pytorch#132729

…ing etc. (pytorch#133312) Indent the template instructions separately from the generated code, for readability. Also, renaming M0,N0,K0 to Mr,Nr,Kr ("r" meaning "register") to consistent naming. Pull Request resolved: pytorch#133312 Approved by: https://github.com/Skylion007, https://github.com/leslie-fang-intel ghstack dependencies: pytorch#132729, pytorch#132730

…ch#134686) (pytorch#135438) Fix pytorch#134686. PR pytorch#132729 makes GEMM template faster for one of the GEMMs in xcit_large_24_p8_224: SingleProcess AUTOTUNE benchmarking takes 1.7088 seconds and 1.9207 seconds precompiling AUTOTUNE linear_unary(12544x3072, 768x3072, 768) cpp_packed_gemm_2 2.9371 ms 100.0% _linear_pointwise 3.1584 ms 93.0% But it is slower than Aten in the e2e run due to different cache behavior. The access to the input data (12544x3072) is LLC latency bound and bottlenecks seen due to the memory synchronization (data transfers and coherence updates across processors). This PR tries to mitigate the problem by cooperatively loading different chunks of input data from different processors that share the input data. Pull Request resolved: pytorch#135438 Approved by: https://github.com/leslie-fang-intel

Update

4095c83

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Aug 6, 2024

jgong5 mentioned this pull request Aug 6, 2024

[inductor][cpp][gemm] fix k-slicing bug and add thread blocking config #132730

Closed

pytorchbot added the open source label Aug 6, 2024

Update

a9342b6

[ghstack-poisoned]

jgong5 requested review from chunyuan-w and leslie-fang-intel August 6, 2024 06:23

jgong5 added the topic: not user facing topic category label Aug 6, 2024

leslie-fang-intel approved these changes Aug 7, 2024

View reviewed changes

chunyuan-w approved these changes Aug 7, 2024

View reviewed changes

Update

7f23aa7

[ghstack-poisoned]

jgong5 pushed a commit that referenced this pull request Aug 13, 2024

[inductor][cpp][gemm] improve large bs perf with better cache blocking

90c3566

ghstack-source-id: 2de2436 Pull Request resolved: #132729

jgong5 requested review from leslie-fang-intel, chunyuan-w and jansel August 13, 2024 05:36

jgong5 added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 13, 2024

jgong5 pushed a commit that referenced this pull request Aug 13, 2024

[inductor][cpp][gemm] improve large bs perf with better cache blocking

41b1d75

ghstack-source-id: 2de2436 Pull Request resolved: #132729

jansel requested changes Aug 13, 2024

View reviewed changes

Update

f0067c4

[ghstack-poisoned]

jgong5 mentioned this pull request Aug 13, 2024

[inductor][cpp][gemm] easy: adjust indentation of template, var renaming etc. #133312

Closed

jgong5 requested a review from jansel August 14, 2024 05:36

jgong5 mentioned this pull request Aug 14, 2024

[inductor] [cpp] improve cache blocking for is_dynamic_M #131306

Closed

leslie-fang-intel approved these changes Aug 14, 2024

View reviewed changes

chunyuan-w pushed a commit to chunyuan-w/pytorch that referenced this pull request Aug 14, 2024

[inductor][cpp][gemm] improve large bs perf with better cache blocking

6d7f86a

ghstack-source-id: c108154 Pull Request resolved: pytorch#132729

This was referenced Aug 14, 2024

[inductor][cpp][gemm] enable dynamic M for k-slicing #133447

Closed

[inductor][cpp][gemm] cache blocking config for dynamic shapes #133538

Closed

Update

37b7b4a

[ghstack-poisoned]

jansel approved these changes Aug 15, 2024

View reviewed changes

pytorchmergebot added the merging label Aug 16, 2024

pytorchmergebot added the Merged label Aug 16, 2024

pytorchmergebot closed this in c22f51c Aug 16, 2024

pytorchmergebot removed the merging label Aug 16, 2024

jgong5 mentioned this pull request Aug 24, 2024

[RFC] Add Cpp Template for GEMM related ops via max-autotune for Inductor CPU #125683

Open

18 tasks

chunyuan-w mentioned this pull request Aug 29, 2024

[inductor][cpu]inductor_max_autotune xcit_large_24_p8_224 multiple thread AMP static shape default wrapper performance regression #134686

Closed

jgong5 mentioned this pull request Sep 8, 2024

[inductor][cpp][gemm] fix perf regression xcit_large_24_p8_224 (#134686) #135438

Closed

github-actions bot deleted the gh/jgong5/64/head branch September 17, 2024 01:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[inductor][cpp][gemm] improve large bs perf with better cache blocking #132729

[inductor][cpp][gemm] improve large bs perf with better cache blocking #132729

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		const int64_t n_size = n_end - n_start;
		const int64_t nc_block_end = std::min(nc + Nc_blocks, n_block_end); // FIXME: maybe exceeding N?

[inductor][cpp][gemm] improve large bs perf with better cache blocking #132729

[inductor][cpp][gemm] improve large bs perf with better cache blocking #132729

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/132729

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!