8000 TinyChat support for GQA and memory efficient loading by kentang-mit · Pull Request #90 · mit-han-lab/llm-awq · GitHub
[go: up one dir, main page]

Skip to content

TinyChat support for GQA and memory efficient loading #90

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 2, 2023

Conversation

kentang-mit
Copy link
Contributor

Add implementation from @ys-2020 that allows deployment of LLaMA-2-70B models on Orin-64G. Also add a faster context stage implementation compared with the release last week.

@casper-hansen
Copy link
Contributor

Is it correctly understood that AWQ now solely uses the modified GEMM kernel for processing context? Also, is the modified GEMM kernel faster than the previous one?

@kentang-mit
Copy link
Contributor Author

Hi Casper,

AWQ solely uses the GEMM kernel for context processing. I think its speed is on-par with the previous version, but it has been adapted to work with the latest weight packing scheme.

Best,
Haotian

@casper-hansen
Copy link
Contributor
casper-hansen commented Sep 21, 2023

Hi Casper,

AWQ solely uses the GEMM kernel for context processing. I think its speed is on-par with the previous version, but it has been adapted to work with the latest weight packing scheme.

Best, Haotian

Hi Haotian, thank you for answering! I have tested the new GEMM kernel. It is a good improvement over GEMV. It is 2x faster than GEMV but 5-6x slower than the original GEMM kernel.

Speed of context:

  • GEMM (original): 2400 tokens/s on 7B
  • GEMM (new): 440 tokens/s on 7B
  • GEMV: 234 tokens/s on 7B

@kentang-mit
Copy link
Contributor Author

Interesting. Is that just for the context stage? We haven't done any formal benchmark yet, but we will definitely work on further improve the speed of this kernel.

@casper-hansen
Copy link
Contributor

Interesting. Is that just for the context stage? We haven't done any formal benchmark yet, but we will definitely work on further improve the speed of this kernel.

Yes, this is only for the context stage.

Let me say this. In general, the new GEMV kernel is 20% faster at token generation than the original GEMM kernel. The only problem is that the context stage can be slow if you want to use 4-8-16k contexts.

Copy link
Collaborator
@Sakits Sakits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@tonylins tonylins merged commit be78265 into main Oct 2, 2023
@ys-2020 ys-2020 deleted the dev/tinychat_update_0918 branch March 30, 2024 01:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0