-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Feature request
Compared to GPTQ, AWQ is more accurate and has much better inference performance.
Motivation
I will run benchmarks using version 1.0.3 of TGI to compare the performance of GPTQ (without act order) ,GPTQ (act order)and AWQ models.
TheBloke/Llama-2-7b-Chat-GPTQ::gptq-4bit-128g-actorder_True
abhinavkulkarni/meta-llama-Llama-2-7b-chat-hf-w4-g128-awq
Your contribution
After the benchmarks are completed, I will publish a new branch of code regarding AWQ support.
The AWQ kernel code is compiled from this huggingface#1019
The awq code has been submitted to the awq branch and compiled on the A10 machine, and can be used for inference on the A100 machine using the registry.cn-hangzhou.aliyuncs.com/zt_gcr/hf-infer:awq
image.
Environment
torch version: 2.0.1+cu117
GPU: Nvidia A100 40G
Nvidia Driver Version: 530.30.02
CUDA Version: 12.1
Running parameters
default
benchmark parameters
default