Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark

Code for the paper "Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark. Our codebase is built upon AutoGPTQ.

Authors (* Equal Contribution): Pingzhi Li*, Xiaolong Jin*, Yu Cheng, and Tianlong Chen.

Overview

Large Language Models (LLMs) have become foundational in the realm of natural language processing, demonstrating performance improvements as model sizes increase. The Mixture-of-Experts (MoE) approach offers a promising way to scale LLMs more efficiently by using fewer computational FLOPs through sparse activation. However, it suffers from significant memory overheads, necessitating model compression techniques. Post-training quantization, a popular method for model compression, proves less effective when directly applied to MoE models due to MoE's overlooked inherent sparsity. This paper explores several MoE structure-aware quantization heuristics, ranging from coarse to fine granularity, from MoE block to individual linear weight. Our investigations reveal critical principles: different MoE structures (i.e., blocks, experts, linear layers) require varying numbers of weight bits for effective and efficient quantization. Conclusions are supported by extensive benchmarking across two representative MoE models and six tasks. We further introduce novel enhancements to more accurately identify the most critical weights in MoE quantization that necessitate higher bit allocations, including the linear weight outlier scorer and MoE block scorer. Additionally, subsequent experiments validate our findings in the context of both weight and activation quantization.

Getting Started

conda create -n qmoe python=3.10
conda activate qmoe
pip install -r requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
auto_gptq		auto_gptq
docs		docs
examples		examples
lm_eval		lm_eval
moduleformer		moduleformer
optimum		optimum
scripts		scripts
tests		tests
LICENSE		LICENSE
README.md		README.md
collect-mixtral-predictor-data.py		collect-mixtral-predictor-data.py
deepseek_model.py		deepseek_model.py
dump-mixtral-block-output.py		dump-mixtral-block-output.py
dump-mixtral-outlier-metric.py		dump-mixtral-outlier-metric.py
dump_mixtral_routing_distribution.py		dump_mixtral_routing_distribution.py
eval.py		eval.py
eval.sh		eval.sh
gptq_deepseek.sh		gptq_deepseek.sh
gptq_deepseek_eval.sh		gptq_deepseek_eval.sh
gptq_inference.py		gptq_inference.py
lm_eval.sh		lm_eval.sh
lm_eval_gptq.py		lm_eval_gptq.py
lm_eval_gptq.sh		lm_eval_gptq.sh
lm_eval_seq.sh		lm_eval_seq.sh
mkdocs.yml		mkdocs.yml
probe_mixtral_bits.py		probe_mixtral_bits.py
probe_mixtral_bits.sh		probe_mixtral_bits.sh
quantize_gptq_deepseek.py		quantize_gptq_deepseek.py
quantize_gptq_deepseek_eval.py		quantize_gptq_deepseek_eval.py
quantize_gptq_mixtral.py		quantize_gptq_mixtral.py
quantize_gptq_mixtral_ffn-4-bits.sh		quantize_gptq_mixtral_ffn-4-bits.sh
requirements.txt		requirements.txt
run-mixtral-predictor.py		run-mixtral-predictor.py
run-predictor.sh		run-predictor.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark

Overview

Getting Started

About

Languages

License

UNITES-Lab/moe-quantization

Folders and files

Latest commit

History

Repository files navigation

Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark

Overview

Getting Started

About

Resources

License

Stars

Watchers

Forks

Languages