8000 Add support for AWQ quantized models · Issue #1 · ZJUICI/text-generation-inference · GitHub
[go: up one dir, main page]

Skip to content
Add support for AWQ quantized models #1
@zTaoplus

Description

@zTaoplus

Feature request

Compared to GPTQ, AWQ is more accurate and has much better inference performance.

Motivation

upstream issues

I will run benchmarks using version 1.0.3 of TGI to compare the performance of GPTQ (without act order) ,GPTQ (act order)and AWQ models.

TheBloke/Llama-2-7b-Chat-GPTQ::gptq-4bit-128g-actorder_True
abhinavkulkarni/meta-llama-Llama-2-7b-chat-hf-w4-g128-awq

Your contribution

After the benchmarks are completed, I will publish a new branch of code regarding AWQ support.
The AWQ kernel code is compiled from this huggingface#1019

The awq code has been submitted to the awq branch and compiled on the A10 machine, and can be used for inference on the A100 machine using the registry.cn-hangzhou.aliyuncs.com/zt_gcr/hf-infer:awq image.

Environment

torch version: 2.0.1+cu117
GPU: Nvidia A100 40G
Nvidia Driver Version: 530.30.02
CUDA Version: 12.1

Running parameters

default

benchmark parameters

default

Result Statistics Bar

image

For a more details, please scroll down..

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0