-
Notifications
You must be signed in to change notification settings - Fork 24.8k
Description
🚀 The feature, motivation and pitch
Motivation
The Attention has been the critical performance bottleneck in the current LLM models, and FlexAttention is a good choice to cover the broad variants in the transformers series models. With FlexAttention, it is easy for us to enable the paged attention and fused SDPA in the transformers repo on XPU device. Besides, it also provide a candidate to process attention in LLM ecosystem libraries ., e.g., vLLM, SGLang on XPU device.
FlexAttention is also a good start point to push the intel triton based GEMM kernel to be matured. FlexAttention provide both flexattention kernel and flexdecoding kernel to cover both compute bound and memory bound GEMM computation, and different shapes should also been supported to serve LLM inference., e.g. head_dim=64, 96, 128, 256.
Our Plan
As you know, FlexAttention is flexible enough to cover all kinds of variant of attention which also means that the dependent software stack need to be strong enough to cooperate with triton template kernel. So, it is still a stretch goal to push the XPU+FlexAttention to be landed in the torch-2.8.
PR List
The FlexAttention is still in active development and the API is not stable yet.
Alternatives
No response
Additional context
No response
cc @chauhang @penguinwu @zou3519 @ydwu4 @bdhirsh @gujinghui @EikanWang @fengyuan14 @guangyey @Chillee @drisspg @yanboliang @BoyuanFeng