-
Notifications
You must be signed in to change notification settings - Fork 11.9k
Introduce New Lookup-Table(LUT)-Based Matrix Multiplication Method (TMAC) #13206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
[WIP] Fit llama.cpp build visibility. Runtime error-free. Wrong outputs. [WIP] Remove some deprecated codes. [Fix] ggml_tmac_transform_tensor should use *data as the original data. And gather code logics in ggml_tmac_can_mul_mat. Change tuning profile time back to 5000ms. Hard code bits/groupsize/sym. GPTQ Llama correct. Unify quantization_config loading.
today I suddenly found this PR(because I left Github on 07/18/2024 and back to Github on 01/29/2025 and missed a lot of wonderful/standout PRs) and found that impressive paper accordingly and I'm reading the paper again and again. I'm working on implementation of int8-based mulmat on Qualcomm Hexagon NPU currently. might-be this standout approach from MSRA can be used in ggml-hexagon(a specified llama.cpp backend for Qualcomm Hexagon NPU). |
after reading your team's outstanding paper again, I tried to dig into source code in this PR, but I can't build the source code:
|
|
it works in my Linux machine. thanks so much! |
Hi @slaren , could you take a quick look at this new pull request and see if it's basically all right? |
ggml/src/kompute
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason that the kompute Module is added here again and the kompute module not moved out of ggml-kompute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@inforithmics Nope, it's added back by mistake. Have removed it.
Unexpectedly SLOW performance on Apple M4 MAX for Llama-3-8b-EfficientQAT-w2g128-GPTQ compared to AGX Orin. I use following command to run your code on AGX and M4MAX
The performance on the Apple M4 MAX is considerably slower than on the NVIDIA AGX Orin. On M4 MAX
On AGX Orin 64 GB:
This is contrary to my expectation, as the M4 MAX's CPU is generally considered to be 3-5 times faster than the AGX Orin's CPU. |
Hi @QingtaoLi1, sorry for the delay. Before going into more depth, can you explain briefly what was the motivation of implementing this as an extra buffer type in the CPU backend? The reason I ask is because we would need the T-MAC quantization types to be integrated into ggml seamlessly along with the current quantization types. If the way quantization types are currently represented in ggml is not compatible with T-MAC, we would need to investigate in which ways it could be adapted to support it. The core functions such as |
@Zijie-Tian It may be some bug. We will test on Apple devices later. |
@slaren Okay!
For the newly-added data types, we need to re-order the weight layout for efficient LUT computation, as well as fitting the float-type of scales. In the previous PR, you mentioned #10446 (amx) as an example to pack the weights. So I imitate amx's implementation to add the new buffer type.
We have studied the existing ggml types. Our conclusion is that i-quant cannot be supported by current T-MAC method because i-quant is a vector-quant method, while T-MAC is for scalar-quant.
Here we have two main differences.
In our first implementation, this change was indeed put in ggml functions like |
@Zijie-Tian I've tested the w2g128 model on M2 Ultra, and seems it works well now. |
I have been thinking about this, and I think it would be ok to add new tensor types that do not conform exactly to the ggml structure of fixed size blocks organized sequentially in the tensor data stream. To do this however, we would need to add some safety checks to ensure that these types are not used in a unsupported way, for example, by forbidding creating views of tensors of these types. Functions like About the use of extra buffer types, this is intended for cases where the standard layout of a type can be reorganized to perform better on the current hardware. If these types only have one layout, there is no need to use an extra buffer type, and the code should be integrated into the CPU backend normally. On the last point, I do not think that we would want to add types that cannot be created using the tools in ggml and llama.cpp. I would expect the quantization code to be integrated into ggml as well, and the types supported by tools such as |
This is a re-submitted PR of #10181. We re-factor all the codes to meet the requirements and follow the constructive discussions with @slaren in the previous one.
Different from #10181, we integrate all the LUT codes under ggml/, thus no longer need any 3rdparty dependencies, and the CMakeLists.txt changes are minor.
Instead of a single new data type
INT_N
, this time we introduce a series of TMAC_* data types to avoid external meta information loading. New data types include one for Bitnet-like models (1 tensor with 1 scale value), and several for GPTQ models (group quantized, e.g. w2g64). We have listed some common GPTQ dtypes, and it's easy to extend to more bits and group sizes. Following existing data types, _0 means no zero points, and _1 means having zero points.How to Use It
Since there is no 3rdparty dependencies, the build/run pipeline is quite similar to the existing one.
For other devices, e.g. Apple, the script is similar or the same.
Speed
TBD
Model size
TBD
Perplexity
TBD
Note
There is an option
desc_act
in GPTQ model config: True means the weight columns are re-ordered when quantization, while False means the weights are in the original order. We only supportdesc_act=False
now. The other one is likely to be supported but needs quite a few engineer efforts. We think it better to open another PR after this one if it's required.TODO list
[x] Adapt and test Bitnet.
[x] Adapt and test Q4_0 and TQ types.
[x] Support BF16 model conversion.
[x] Support F16 scales and zero points.
[ ] Fix T-MAC gguf quantization of embed/output_weights.
[ ] Support kernel tuning with threadpool. This will probably break the current build encapsulation of targets llama/ggml/ggml-cpu/ggml-base.