GPTQ Quantization (3-bit and 4-bit)

4-bit quantization tends to come at a cost of output quality losses. GPTQ quantization is a state of the art quantization method which results in negligible output performance loss when compared with the prior state of the art in 4-bit (and 3-bit/2-bit) quantization methods and even when compared with uncompressed fp16 inference.

It would be good to see benchmarks on the existing implementation. It's possible there is substantial quality loss from the 4-bit quantization. It's also possible that it isn't very substantial. We'd have to see benchmarks to know.

The related project GPTQ-for-LLaMA has some benchmarks available for their implementation.

Refernces:
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
The case for 4-bit precision: k-bit Inference Scaling Laws

Related work:
https://github.com/qwopqwop200/GPTQ-for-LLaMA/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions