Add support for AWQ quantized models

Feature request

Compared to GPTQ, AWQ is more accurate and has much better inference performance.

Motivation

I will run benchmarks using version 1.0.3 of TGI to compare the performance of GPTQ (without act order) ,GPTQ (act order)and AWQ models.

TheBloke/Llama-2-7b-Chat-GPTQ::gptq-4bit-128g-actorder_True
abhinavkulkarni/meta-llama-Llama-2-7b-chat-hf-w4-g128-awq

Your contribution

After the benchmarks are completed, I will publish a new branch of code regarding AWQ support.
The AWQ kernel code is compiled from this huggingface#1019

The awq code has been submitted to the awq branch and compiled on the A10 machine, and can be used for inference on the A100 machine using the registry.cn-hangzhou.aliyuncs.com/zt_gcr/hf-infer:awq image.

Environment

torch version: 2.0.1+cu117
GPU: Nvidia A100 40G
Nvidia Driver Version: 530.30.02
CUDA Version: 12.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature request

Motivation

Your contribution

Environment

Running parameters

benchmark parameters

Result Statistics Bar

For a more details, please scroll down..

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Feature request

Motivation

Your contribution

Environment

Running parameters

benchmark parameters

Result Statistics Bar

For a more details, please scroll down..

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions