Making Mamba first-class citizen in PyTorch #120189

yanboliang · 2024-02-19T06:03:59Z

🚀 The feature, motivation and pitch

Mamba is a new SSM (State Space Model) which is developed to address Transformers’ computational inefficiency on long sequences. It has attracted more attention recently due to faster inference and linear scaling in sequence length. We are exploring how to support Mamba as first-class citizens in PyTorch.

To better understand the gaps and coordinate these ongoing effects, we created the following doc to trace the requested features and issues. Feel free to comment if you have any feedback!
https://docs.google.com/document/d/1rNNByFrOjOQOBM6ZZqnOqc-LGMRmdnQifKfbY_KalnM/edit?usp=sharing

cc @ezyang @msaroufim @bdhirsh @anijain2305 @zou3519 @Chillee @ydwu4 @peterbell10 @lezcano @aakhundov @chauhang

lezcano · 2024-02-19T10:25:14Z

The document, as written, is a bit too optimistic.

The current tl.associative_scan (and as such @peterbell10's implementation in #119430) just supports pointwise accumulation functions. As such, we will just be able to implement SSMs for diagonal matrices, where matrix multiplication turns into pointwise multiplication.

We should be able to do this once:

[HOP][inductor] Add higher order associative scan operator #119430 lands
We extend the current support for multiple inputs and outputs in our scan operation (this shouldn't be too difficult).

yanboliang · 2024-02-19T19:42:50Z

@lezcano Yes, we only discussed SSMs for diagonal matrices in the doc, but we should be able to extend to more general SSMs after this.

@Jokeren

This PR implements a `reverse` argument for associative scan similar to the jax implementation. While this can be implemented using the tl.flip command, @Jokeren advised me that this would be very inefficient and that this should be done in the associative scan itself. The implementation can be summarized as `flip(scan(flip(x)))`. However the flip needs to happen along three axes: warp, lanes, chunks. To flip the chunks, I simply reverse the vector of values. To flip the lanes, I use a butterfly shuffle to efficiently reverse the lanes. To flip the warp (needed for the slow case) I flip the indexing of the warps themselves. I additionally modified the scan tests to include the new reverse implementation. ## Why is this needed? This was needed originally for the implementation of the Mamba model (https://srush.github.io/annotated-mamba/hard.html) to compute the backward pass of the models. I thought pretty hard about whether this could be done by any kind of recomputation, but it seems pretty necessary to be able to do a reverse accumulation in order to take a dot product in the kernel. (perhaps also relevant to pytorch/pytorch#120189 )

@Jokeren

…3177) This PR implements a `reverse` argument for associative scan similar to the jax implementation. While this can be implemented using the tl.flip command, @Jokeren advised me that this would be very inefficient and that this should be done in the associative scan itself. The implementation can be summarized as `flip(scan(flip(x)))`. However the flip needs to happen along three axes: warp, lanes, chunks. To flip the chunks, I simply reverse the vector of values. To flip the lanes, I use a butterfly shuffle to efficiently reverse the lanes. To flip the warp (needed for the slow case) I flip the indexing of the warps themselves. I additionally modified the scan tests to include the new reverse implementation. ## Why is this needed? This was needed originally for the implementation of the Mamba model (https://srush.github.io/annotated-mamba/hard.html) to compute the backward pass of the models. I thought pretty hard about whether this could be done by any kind of recomputation, but it seems pretty necessary to be able to do a reverse accumulation in order to take a dot product in the kernel. (perhaps also relevant to pytorch/pytorch#120189 )

bhack · 2024-06-11T13:58:50Z

I hope we could include also Mamba2 coverage https://arxiv.org/abs/2405.21060

state-spaces/mamba#355

bhack · 2024-07-29T16:15:59Z

Associative scan was merged in Triton on April
triton-lang/triton#3177
A minor bug is WIP at:
triton-lang/triton#4362

What is the status on pytorch?

bhack · 2024-12-14T22:31:20Z

/cc @bohnstingl @ydwu4

#95408 (comment)

ezyang added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module needs research We need to decide whether or not this merits inclusion, based on research world oncall: pt2 labels Feb 20, 2024

zou3519 added module: higher order operators torch.cond and similar module: pt2-dispatcher PT2 dispatcher-related issues (e.g., aotdispatch, functionalization, faketensor, custom-op, labels Feb 22, 2024

srush mentioned this issue Feb 23, 2024

[BACKEND] Implement associative scan with reverse arg triton-lang/triton#3177

Merged

acxz mentioned this issue Aug 25, 2024

Add other (sequence) regression tasks acxz/pl-utils#6

Open

bhack mentioned this issue Dec 15, 2024

Generating onnx file for the inference of Mamba? state-spaces/mamba#200

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Making Mamba first-class citizen in PyTorch #120189

Making Mamba first-class citizen in PyTorch #120189

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Making Mamba first-class citizen in PyTorch #120189

Making Mamba first-class citizen in PyTorch #120189

Comments

Uh oh!

🚀 The feature, motivation and pitch

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!