TinyChat support for GQA and memory efficient loading #90

kentang-mit · 2023-09-18T21:13:08Z

Add implementation from @ys-2020 that allows deployment of LLaMA-2-70B models on Orin-64G. Also add a faster context stage implementation compared with the release last week.

… efficient loading.

casper-hansen · 2023-09-18T22:35:38Z

Is it correctly understood that AWQ now solely uses the modified GEMM kernel for processing context? Also, is the modified GEMM kernel faster than the previous one?

kentang-mit · 2023-09-21T17:49:46Z

Hi Casper,

AWQ solely uses the GEMM kernel for context processing. I think its speed is on-par with the previous version, but it has been adapted to work with the latest weight packing scheme.

Best,
Haotian

casper-hansen · 2023-09-21T17:53:33Z

Hi Casper,

AWQ solely uses the GEMM kernel for context processing. I think its speed is on-par with the previous version, but it has been adapted to work with the latest weight packing scheme.

Best, Haotian

Hi Haotian, thank you for answering! I have tested the new GEMM kernel. It is a good improvement over GEMV. It is 2x faster than GEMV but 5-6x slower than the original GEMM kernel.

Speed of context:

GEMM (original): 2400 tokens/s on 7B
GEMM (new): 440 tokens/s on 7B
GEMV: 234 tokens/s on 7B

kentang-mit · 2023-09-21T18:08:26Z

Interesting. Is that just for the context stage? We haven't done any formal benchmark yet, but we will definitely work on further improve the speed of this kernel.

casper-hansen · 2023-09-21T18:12:45Z

Interesting. Is that just for the context stage? We haven't done any formal benchmark yet, but we will definitely work on further improve the speed of this kernel.

Yes, this is only for the context stage.

Let me say this. In general, the new GEMV kernel is 20% faster at token generation than the original GEMM kernel. The only problem is that the context stage can be slow if you want to use 4-8-16k contexts.

Sakits

Looks good to me.

[Major] Add TinyChat support for GQA, faster context stage and memory…

ca11f3e

… efficient loading.

kentang-mit requested review from tonylins and Sakits September 18, 2023 21:13

casper-hansen mentioned this pull request Sep 25, 2023

Add AWQ quantization inference support (#1019) huggingface/text-generation-inference#1054

Merged

5 tasks

[Major] shortened context for edge device

5efd875

Sakits approved these changes Oct 2, 2023

View reviewed changes

tonylins merged commit be78265 into main Oct 2, 2023

ys-2020 deleted the dev/tinychat_update_0918 branch March 30, 2024 01:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TinyChat support for GQA and memory efficient loading #90

TinyChat support for GQA and memory efficient loading #90

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TinyChat support for GQA and memory efficient loading #90

TinyChat support for GQA and memory efficient loading #90

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!