[RFC] Add Cpp Template for GEMM related ops via max-autotune for Inductor CPU #125683

jgong5 · 2024-05-07T14:39:18Z

🚀 The feature, motivation and pitch

Motivation

torch.compile provides the "max-autotune" mode. For CUDA, the inductor backend leverages online benchmark results to select the best-performing kernels from various options, including ATen kernels and template-based kernels implemented with Triton and CUTLASS. These kernels are primarily designed to accelerate GEMM-related operations. However, for CPU, this "max-autotune" mechanism is not yet supported, and only ATen kernels are currently utilized.

This RFC proposes the introduction of similar template-based code generation support for GEMM-related operations on CPUs, implemented with C++ and activated through the "max-autotune" mode of torch.compile. By utilizing the autotuning mechanism of Inductor, users are expected to achieve enhanced performance for GEMM-related operations beyond the capabilities of ATen-based implementations.

Approaches

At a high level, the autotuning and template infrastructure from CUDA is mature enough to be adapted for CPU usage. We plan to extend the existing autotuning code to support CPU and develop the C++ template abstraction by referencing the CUTLASS template counterpart. Additionally, CPU-specific challenges such as thread decomposition, data layout arrangement (e.g., weight prepacking), and data blocking at various memory hierarchy levels for optimal performance need to be addressed. Based on our previous experiences, we employ a two-level abstraction to implement GEMMs: an outer loop that manages thread decomposition and cache blocking, and an inner micro-kernel that handles register blocking and various CPU architecture-specific optimizations. This approach allows for flexible performance tuning at multiple levels and direct utilization of low-level CPU hardware acceleration.

Key Components

Autotune Infrastructure for CPU: Generalizing and extending BenchmarkRequest with CPU support and Cpp module loader.
Cpp Template Infrastructure: Involving similar template abstractions as the CUTLASS template, such as CppTemplate, CppTemplateKernel, CppTemplateBuffer. The MicroGemm micro-kernel abstraction can be used by Cpp GEMM templates.
Micro Kernel Templates: Responsible for register blocking, instruction selection, and other CPU architecture-specific optimizations.
Cpp Templates: Including various GEMM-related Cpp templates (single GEMM, weight-only quantized GEMM, attention, MLP, etc.) that are responsible for thread decomposition, cache blocking, and outer-loop scheduling calling into micro-kernels. Packed GEMM support included.
Epilogue Fusion: This would involve support from Cpp templates, micro-kernel templates, and Cpp kernels.

Task Breakdowns

1. Autotune Infrastructure for CPU ([inductor] autotune benchmark support for cpu #125159)
2. Cpp Template Infrastructure ([inductor][cpp] GEMM template (infra and fp32) #124021)

Micro Kernel Templates

3.1 General FP32/BF16/FP16 MicroGemm based on ATen VEC ([inductor][cpp] GEMM template (infra and fp32) #124021 etc.)
3.2 BF16 AMX MicroGemm for x86 ([inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion #126068 etc.)
3.3 FP16 AMX MicroGemm for x86
3.4 INT8 AMX MicroGemm for x86 ([Inductor][CPP] Enable Quantized Linear with AMX MicroGEMM #129220)
3.5 INT8 Weight-quantized MicroGemm for x86 (Inductor-CPU WoQ int8 GEMM micro-kernel with scale epilogue #131887)
3.6 INT4 Weight-quantized MicroGemm for x86
3.7 MicroGemms for ARM

Cpp Template

Performance Tuning (ongoing work)

6.1 Thread blocking optimization ([inductor][cpp][gemm] improve thread blocking heuristics #131024, [inductor][cpp][gemm] support k slicing for static shapes #130821 etc.)
6.2 Cache blocking optimization ([inductor] [cpp] improve cache blocking with CPU info #129348, [inductor][cpp][gemm] improve large bs perf with better cache blocking #132729, [inductor] [cpp] use non-temporal tile load for A #129455 etc.)

Alternatives

No response

Additional context

No response

cc @ezyang @msaroufim @bdhirsh @anijain2305 @chauhang @voznesenskym @penguinwu @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire

The text was updated successfully, but these errors were encountered:

…p32)" This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC #125683 for more background info. 1. Cpp template infrastructure Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates. 2. Initial FP32 gemm template This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction. 3. Correctness and performance The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details. Static shapes | Benchmark | torchbench | huggingface | timm_models | |------------|-------------|--------------|--------------| | Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x | | Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x | | Single-threaded (baseline) | 1.56x | 1.19x | 1.51x | | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x | Key models being sped up: drq: 1.14x soft_act: 1.12 cait_m36_384: 1.18x Dynamic shapes | Benchmark | torchbench | huggingface | timm_models | | --- | --- | --- | --- | | Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x | | Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x | | Single-threaded (baseline) | 1.55x | 1.20x | 1.51x | | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x | Key models being sped up: BERT_pytorch: 1.22x pyhpc_turbulent: 1.13x soft_actor_critic: 1.77x BlenderbotForCausalLM: 1.09x cait_m36_384: 1.17x cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC #125683 for more background info. 1. Cpp template infrastructure Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates. 2. Initial FP32 gemm template This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction. 3. Correctness and performance The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details. Static shapes | Benchmark | torchbench | huggingface | timm_models | |------------|-------------|--------------|--------------| | Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x | | Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x | | Single-threaded (baseline) | 1.56x | 1.19x | 1.51x | | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x | Key models being sped up: drq: 1.14x soft_act: 1.12 cait_m36_384: 1.18x Dynamic shapes | Benchmark | torchbench | huggingface | timm_models | | --- | --- | --- | --- | | Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x | | Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x | | Single-threaded (baseline) | 1.55x | 1.20x | 1.51x | | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x | Key models being sped up: BERT_pytorch: 1.22x pyhpc_turbulent: 1.13x soft_actor_critic: 1.77x BlenderbotForCausalLM: 1.09x cait_m36_384: 1.17x cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]