[RFC] Enable XPU+FlexAttention on Intel GPU

@chauhang

🚀 The feature, motivation and pitch

Motivation

The Attention has been the critical performance bottleneck in the current LLM models, and FlexAttention is a good choice to cover the broad variants in the transformers series models. With FlexAttention, it is easy for us to enable the paged attention and fused SDPA in the transformers repo on XPU device. Besides, it also provide a candidate to process attention in LLM ecosystem libraries ., e.g., vLLM, SGLang on XPU device.

FlexAttention is also a good start point to push the intel triton based GEMM kernel to be matured. FlexAttention provide both flexattention kernel and flexdecoding kernel to cover both compute bound and memory bound GEMM computation, and different shapes should also been supported to serve LLM inference., e.g. head_dim=64, 96, 128, 256.

Our Plan

As you know, FlexAttention is flexible enough to cover all kinds of variant of attention which also means that the dependent software stack need to be strong enough to cooperate with triton template kernel. So, it is still a stretch goal to push the XPU+FlexAttention to be landed in the torch-2.8.

PR List

The FlexAttention is still in active development and the API is not stable yet.

[WIP]Enable XPU path for FlexAttention

Alternatives

No response

Additional context

No response

cc @chauhang @penguinwu @zou3519 @ydwu4 @bdhirsh @gujinghui @EikanWang @fengyuan14 @guangyey @Chillee @drisspg @yanboliang @BoyuanFeng

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🚀 The feature, motivation and pitch

Motivation

Our Plan

PR List

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

🚀 The feature, motivation and pitch

Motivation

Our Plan

PR List

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions