Computer Science > Machine Learning

arXiv:2208.09225 (cs)

[Submitted on 19 Aug 2022 (v1), last revised 23 Feb 2024 (this version, v2)]

Title:FP8 Quantization: The Power of the Exponent

Authors:Andrey Kuzmin, Mart Van Baalen, Yuwei Ren, Markus Nagel, Jorn Peters, Tijmen Blankevoort

Abstract:When quantizing neural networks for efficient inference, low-bit integers are the go-to format for efficiency. However, low-bit floating point numbers have an extra degree of freedom, assigning some bits to work on an exponential scale instead. This paper in-depth investigates this benefit of the floating point format for neural network inference. We detail the choices that can be made for the FP8 format, including the important choice of the number of bits for the mantissa and exponent, and show analytically in which settings these choices give better performance. Then we show how these findings translate to real networks, provide an efficient implementation for FP8 simulation, and a new algorithm that enables the learning of both the scale parameters and the number of exponent bits in the FP8 format. Our chief conclusion is that when doing post-training quantization for a wide range of networks, the FP8 format is better than INT8 in terms of accuracy, and the choice of the number of exponent bits is driven by the severity of outliers in the network. We also conduct experiments with quantization-aware training where the difference in formats disappears as the network is trained to reduce the effect of outliers.

Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2208.09225 [cs.LG]
	(or arXiv:2208.09225v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2208.09225

Submission history

From: Andrey Kuzmin [view email]
[v1] Fri, 19 Aug 2022 09:03:00 UTC (6,708 KB)
[v2] Fri, 23 Feb 2024 13:49:45 UTC (6,790 KB)

Computer Science > Machine Learning

Title:FP8 Quantization: The Power of the Exponent

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:FP8 Quantization: The Power of the Exponent

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators