Linear Quantization for Neural Networks

The motivation behind this project is to reduce model size and memory usage for inference. Storing large models can be burdensome, especially when dealing with edge devices. A model with a smaller footprint often simplifies and streamlines the inference process.

It allows developers to quantize any Hugging Face model from the HF Hub without losing much in terms of performance and accuracy. The underlying method is a simple linear quantization technique (both asymmetric and symmetric), enabling quick and ready-to-go quantization on the fly.

Linear Quantization

Symmetric Quantization: The range of quantized values is symmetric around zero. This is simpler but might be less accurate if the data is not centered around zero.
Asymmetric Quantization: The range of quantized values is not necessarily symmetric around zero and uses a zero-point offset. This can provide better accuracy for data that is not centered around zero.

The W8A16LL layer indicates a quantization scheme where weights are quantized to 8 bits (W8) and activations remain in 16-bit floating-point (A16) during the forward pass within the quantized linear layer.

Getting Started

Usage

Download and Load a Pre-trained Model:

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

model_id = "facebook/opt-350m"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Get the original model size   
print(f"Original footprint of the model: {model.get_memory_footprint()/1e+6} MB")

# The Original footprint of the model: 662.392832 MB

Run and Test the Original Model:

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("What are we having for dinner?"))

# [{'generated_text': "What are we having for dinner?\nI'm having a steak and a salad.\nI'm"}]

Load and Quantize Linear Layers:

from qllm import LinearQuantizer
from qllm.layers import W8A16LL

LinearQuantizer.replace_and_quantize_modules(model, W8A16LL, ['lm_head'])

# Get the quantized model size
print(f"Model footprint after quantization: {model.get_memory_footprint()/1e+6} MB")

# Model footprint after quantization: 359.799808 MB

Test the Quantized Model:

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("What are we having for dinner?"))

# [{'generated_text': "What are we having for dinner?\nI'm having a steak dinner.\nI'm having a"}]

Observation

There is around a 45.68% reduction in model footprint. This may not be the same for every model, but a reduction of anywhere around 30% is significantly helpful from the inference point of view.

Contribution

We warmly welcome contributions to the qllm project! Whether you're interested in discussing quantization techniques or talk about neural nets in general, you can start a discussion.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
examples		examples
qllm		qllm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Linear Quantization for Neural Networks