Post-training quantization is a model-size reducing technique useful for deploying model on the web and in storage-limited environments such as mobile devices. TensorFlow.js's converter module supports reducing the numeric precision of weights to 16-bit and 8-bit integers after the completion of the model training, which leads to approximately 50% and 75% reduction in model size, respectively.
The following figure provides an intuitive understanding of the degree to which weight values are discretized under the 16- and 8-bit quantization regimes. The figure is based on a zoomed-in view of a sinusoidal wave.
This example focuses on how such quantization of weights affect the model's predicton accuracy.
This demo on quantization consists of four examples:
- housing: this demo evaluates the effect of quantization on the accuracy of a multi-layer perceptron regression model.
- mnist: this demo evaluates the effect of quantization on the accuracy of a relatively small deep convnet trained on the MNIST handwritten digits dataset. Without quantization, the convnet can achieve close-to-perfect (i.e., ~99.5%) test accuracy.
- fashion-mnist: this demo evaluates the effect of quantization on the accuracy of another small deep convnet traind on a problem slightly harder than MNIST. In particular, it is based on the Fashion MNIST dataset. The original, non-quantized model has an accuracy of 92%-93%.
- MobileNetV2: this demo evaluates quantized and non-quantizd versions of MobeilNetV2 (width = 1.0) on a sample of 1000 images from the ImageNet dataset. This subset is based on the sampling done by https://github.com/ajschumacher/imagen.
In the first three demos, quantizing the weights to 16 or 8 bits does not have any significant effect on the accuracy. In the MobileNetV2 demo, however, quantizing the weights to 8 bits leads to a significant deterioration in accuracy, as measured by the top-1 and top-5 accuracies. See example results in the table below:
Dataset and Model | Original (no-quantization) | 16-bit quantization | 8-bit quantization |
---|---|---|---|
housing: multi-layer regressor | MAE=0.311984 | MAE=0.311983 | MAE=0.312780 |
MNIST: convnet | accuracy=0.9952 | accuracy=0.9952 | accuracy=0.9952 |
Fashion MNIST: convnet | accuracy=0.922 | accuracy=0.922 | accuracy=0.9211 |
MobileNetV2 | top-1 accuracy=0.618; top-5 accuracy=0.788 | top-1 accuracy=0.624; top-5 accuracy=0.789 | top-1 accuracy=0.280; top-5 accuracy=0.490 |
MAE Stands for mean absolute error (lower is better).
They demonstrate different effects of the same quantization technique on different problems.
An additional factor affecting the over-the-wire size of models under quantization is the gzip ratio. This factor should be taken into account because gzip is widely used to transmit large files over the web.
Most non-quantized models (i.e.,
models with 32-bit float weights) are not very compressible, due to
the noise-like variation in their weight parameters, which contain
few repeating patterns. The same is true for models with weights
quantized at the 16-bit precision. However, when models are quantized
at the 8-bit precision, there is usually a significant increase in the
gzip compression ratio. The yarn quantize-and-evalute*
commands in
this example (see sections below) not only evaluates accuracy, but also
calculates the gzip compression ratio of model files under different
levels of quantization. The table below summarizes the compression ratios
from the four models covered by this example (higher is better):
gzip compression ratio:
(total size of the model.json and weight files) / (size of gzipped tar ball)
Model | Original (no-quantization) | 16-bit quantization | 8-bit quantization |
---|---|---|---|
housing: multi-layer regressor | 1.121 | 1.161 | 1.388 |
MNIST: convnet | 1.082 | 1.037 | 1.184 |
Fashion MNIST: convnet | 1.078 | 1.048 | 1.229 |
MobileNetV2 | 1.085 | 1.063 | 1.271 |
In preparation, do:
yarn
To run the train and save the model from scratch, do:
yarn train-housing
If you are running on a Linux system that is CUDA compatible, try installing the GPU:
yarn train-housing --gpu
To perform quantization on the model saved in the yarn train
step
and evaluate the effects on the model's test accuracy, do:
yarn quantize-and-evaluate-housing
In preparation, do:
yarn
To run the train and save the model from scratch, do:
yarn train-mnist
or with CUDA acceleration:
yarn train-mnist --gpu
To perform quantization on the model saved in the yarn train
step
and evaluate the effects on the model's test accuracy, do:
yarn quantize-and-evaluate-mnist
The command also calculates the ratio of gzip compression for the model's saved artifacts under the three different levels of quantization (no-quantization, 16-bit, and 8-bit).
In preparation, do:
yarn
To run the train and save the model from scratch, do:
yarn train-fashion-mnist
or with CUDA acceleration:
yarn train-fashion-mnist --gpu
To perform quantization on the model saved in the yarn train
step
and evaluate the effects on the model's test accuracy, do:
yarn quantize-and-evaluate-fashion-mnist
Unlike the previous three demos, the MobileNetV2 demo doesn't involve a model training step. Instead, the model is loaded as a Keras application and converted to the TensorFlow.js format for quantization and evaluation.
The non-quantized and quantized versions of MobileNetV2 are evaluated on a sample of 1000 images from the ImageNet dataset. The image files are downloaded from the hosted location on the web. This subset is based on the sampling done by https://github.com/ajschumacher/imagen.
All these steps can be performed with a single command:
yarn quantize-and-evaluate-MobileNetV2