8000 Improve performance on RoPE (and code around it). · Issue #1597 · NVIDIA/Fuser · GitHub
[go: up one dir, main page]

Skip to content
Improve performance on RoPE (and code around it).  #1597
@wujingyue

Description

@wujingyue

I'm separating this from #1502. While we can get rid of cat in some cases, improving nvFuser's codegen for slice and cat will still benefit the RoPE module and the QKV split around it.

https://github.com/NVIDIA/Fuser/blob/bug1597/qkv_split_rope.py has the fusion definition for the forward pass.

Below is a whiteboard illustration for convenience:

image

Legend:

  • S = slice
  • C = concat
  • R = reshape/view
  • T = permute

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0