Abstract
AI inference accelerators have drawn extensive attention. But none of the previous work performs a holistic and systematic benchmarking on AI inference accelerators. First, an end-to-end AI inference pipeline consists of six stages on both host and accelerators. However, previous work mainly evaluates hardware execution performance, which is only one stage on accelerators. Second, there is a lack of a systematic evaluation of different optimizations on AI inference accelerators. Along with six representative AI workloads and a typical AI inference accelerator–Diannao based on Cambricon ISA, we implement five frequently-used AI inference optimizations as user-configurable hyper-parameters. We explore the optimization space by sweeping the hyper-parameters and quantifying each optimization’s effect on the chosen metrics. We also provide cross-platform comparisons between Diannao and traditional platforms (Intel CPUs and Nvidia GPUs). Our evaluation provides several new observations and insights, which sheds light on the comprehensive understanding of AI inference accelerators’ performance and instructs the co-design of the upper-level optimizations and underlying hardware architecture.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The common pre-processing includes image decoding, image resizing, image padding, image cropping, channel arrangement, and normalization, etc. Different DNN workloads adopt different pre-processing techniques according to their requirements.
References
AnandTech: https://www.anandtech.com/show/12815/cambricon-makers-of-huaweis-kirin-npu-ip-build-a-big-ai-chip-and-pcie-card, (2018)
Cambricon: Cambricon cnrt. http://www.cambricon.com/index.php?m=content&c=index&a=lists&catid=71
Cambricon MLU100, http://www.cambricon.com/index.php?c=page&id=20
Chen, W. et al.: Compressing neural networks with the hashing trick. In: Proceedings of the International Conference on Machine Learning, pp. 2285–2294 (2015)
Chen, T., et al.: Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM ASPLOS 49(4), 269–284 (2014)
Courbariaux, et al. (2015) Binaryconnect: Training deep neural networks with binary weights during propagations. In: NeurIPS, pp. 3123–3131
DeepBench, https://github.com/baidu-research/DeepBench
Denil, M., et al.: Predicting parameters in deep learning. Adv Neural Inform Process Syst 26, 2148–2156 (2013)
Dean J et al. (2012) Large scale distributed deep networks. In: Advances in Neural Information Processing Systems 25. Curran Associates, Inc., pp. 1223–1231
Deng J, et al. (2009) Imagenet: A large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition, pp 248–255
Everingham, M. et al. (2012) The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results
Google: Edge-tpu. https://cloud.google.com/edge-tpu
Google: What Makes TPU Fine Tuned to Deep Learning. https://cloud.google.com/blog/products/ai-machine-learning/what-makes-tpus-fine-tuned-for-deep-learning
Gray J (1993) Database and transaction processing performance handbook
Hao T et al. (2018) Edge AIBench: towards comprehensive end-to-end edge computing benchmarking. International Symposium on Benchmarking, Measuring and Optimization, Springer, Cham, pp. 23-30
Han S, Mao H, Dally WJ (2016) Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. ICLR
Hennessy, J.L., Patterson, D.A.: A new golden age for computer architecture. Commun ACM 62(2), 48–60 (2019)
He K et al. (2015) Deep residual learning for image recognition. CoRR, vol. abs/1512.03385
Huawei: Huawei Ascend 310 Accelerator. http://ascend.huawei.com (2020)
Huang G et al. (2016) Densely connected convolutional networks. CoRR, vol. abs/1608.06993
Howard AG et al. (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, vol. abs/1704.04861
Iandola FN et al. (2016) Squeezenet: Alexnet-level accuracy with 50x fewer parameters and \(<\)1mb model size. CoRR, vol. abs/1602.07360
Jain, S., et al.: Trained quantization thresholds for accurate and efficient fixed-point inference of deep neural networks. Proc Mach Learn Syst 2, 112–128 (2020)
Jiang Z et al. (2021) Hpc ai500 v2. 0: The methodology, tools, and metrics for benchmarking hpc ai systems. IEEE CLUSTER
Jouppi, N.P. et al.: In-datacenter performance analysis of a tensor processing unit. In: ACM/IEEE ISCA. IEEE, pp. 1–12 (2017)
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105
Lee D, Kim B (2018) Retraining-based iterative weight quantization for deep neural networks. CoRR, vol. abs/1805.11233
Li J. et al.: Characterizing the i/o pipeline in the deployment of cnns on commercial accelerators. IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking. IEEE, pp. 137-144 (2020)
Liu, S., et al.: Cambricon: an instruction set architecture for neural networks. ACM/IEEE ISCA 44(3), 393–405 (2016)
Liu W et al.: Ssd: single shot multibox detector. (2016), to appear. [Online]. http://arxiv.org/abs/1512.02325
Luo C et al. (2018) AIoT bench: towards comprehensive benchmarking mobile and embedded device intelligence. International Symposium on Benchmarking, pp. 31–35. Springer, Cham, Measuring and Optimization
Ma X et al. (2019) PCONV: the missing but desirable sparsity in DNN weight pruning for real-time execution on mobile devices. CoRR, vol. abs/1909.05073
Mishra R et al. (2020) A Survey on Deep Neural Network Compression: Challenges, Overview, and Solutions. CoRR, vol. abs/2010.03954
Mittal D et al. (2018)ecovering from random pruning: On the plasticity of deep convolutional neural networks. CoRR, vol. abs/1801.10447
Niu W et al. (2020) Patdnn: Achieving real-time dnn execution on mobile devices with pattern-based weight pruning. In: ACM ASPLOS, pp. 907–922
Reddi VJ et al. (2020) Mlperf inference benchmark. In: ACM/IEEE ISCA, pp. 446–459
Sze, V., et al.: How to evaluate deep neural network processors: Tops/w (alone) considered harmful. IEEE Solid-State Circ Mag 12(3), 28–41 (2020)
Tang F et al. (2021) AIBench Training: Balanced Industry-Standard AI Training Benchmarking. IEEE Computer Society, In IEEE ISPASS
Tao, J.-H., et al.: Bench ip: Bencharking intelligence processors. J Comput Sci Technol 33(1), 1–23 (2018)
Turner J et al. (2018) Characterising across-stack optimisations for deep convolutional neural networks. In: IISWC, pp 101–110
Wang Y et al. A systematic methodology for analysis of deep learning hardware and software platforms. In: Proceedings of Machine Learning and Systems
Williams, S., et al.: Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4), 65–76 (2009)
Zhao, R., et al.: Improving neural network quantization without retraining using outlier channel splitting. Ser Proc Mach Learn Res 97, 7543–7552 (2019). (PMLR)
Zhou A, Yao A, Guo Y, Xu L, Chen Y Incremental network quantization: Towards lossless cnns with low-precision weights. CoRR, vol. abs/1702.03044, (2017). [Online]. http://arxiv.org/abs/1702.03044
Author information
Authors and Affiliations
Corresponding author
Appendix A
Appendix A
1.1 A.1 Implementation Details on Diannao
Considering the diversity of the network architecture, there is no-one-size-fits-all algorithm for the quantization and pruning. Research Jain (2020) shows some networks need to tailor the dedicated algorithm or retrain-based training method to make a compensate for the drop in model quality. Studying more general pruning and quantization algorithms Mishra et al. (2020) is still an open problem and beyond the scope of this paper. Here we briefly introduce our implementation of pruning and quantization.
1.1.1 A.1.1 Quantization
Diannao is equipped with large numbers of INT8-based ALUs. We implement INT8 quantization, which means that parameters of the model are stored using 8-bit fix-point integers instead of original floating-point numbers (Diannao use FP16 as its floating-point numbers). These model parameters are usually composed of three parts: weights, activations and bias. Considering that the proportion of bias in the overall parameters is small, we only quantify weights and activations. The computation process of quantified parameters can be summarized by the following formula:
where \(real\_number\) refers to the parameters before quantization and \(stored\_integers\) refers to the parameters after quantization. And \(scaling\_factor\) aims to prevent over or underflows when computing the lower precision results.
1.1.2 A.1.2 Weight Pruning
Inside Diannao, there are also large numbers of sparse computing units. We only prune the weight of convolutional and fully-connected layers, because the weights of these two types of layers occupy most of the parameters of the entire model. Sparsity is a decimal between 0 and 1, referring to the percentage of zero-valued weights in the model. We use sparsity to reflect the effects of weight pruning optimization. Motivated by Deep Compression Han et al. (2016), in each convolutional and fully-connected layer, we sort the weights and then zero out the weights that with the lowest magnitude based on the sparsity. To show the effect of weight punning on model quality and inference throughput, we gradually increase the sparsity from 0.01 to 0.9 with the increment step of 0.01 while keeping other optimizations fixed.
1.2 A.2 An Example of Efficient Network Deployment
Table 6 presents the best configuration candidates in terms of the end-to-end throughput. We get these configurations by looking up the database (discussed in Sect. 6.2). To illustrate the trade-off process between the throughput and model quality, we present 4 configurations for each workload. The pre-defined target quality as the minimum requirement for model quality.
For DenseNet121, the target quality is 0.73, it achieves the highest end-to-end throughput at the configuration (sparse, INT8, 1, 4, 1, 8). However, the accuracy of the model does not meet the requirements, so this configuration will still be discarded. Then the configuration (Dense, FP16, 1, 4, 1, 8) that reaches the second highest end-to-end throughput is chosen since the accuracy requirement is satisfied. We followed the same method to select the best configuration for the remaining workload.
Rights and permissions
About this article
Cite this article
Jiang, Z., Li, J., Liu, F. et al. A systematic study on benchmarking AI inference accelerators. CCF Trans. HPC 4, 87–103 (2022). https://doi.org/10.1007/s42514-022-00105-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42514-022-00105-z