A systematic study on benchmarking AI inference accelerators

Zihan Jiang ORCID: orcid.org/0000-0003-0632-7402^1,2^na1,
Jiansong Li^1,2^na1,
Fangxin Liu³,
Wanling Gao¹,
Lei Wang¹,
Chuanxin Lan¹,
Fei Tang^1,2,
Lei Liu¹ &
…
Tao Li³

393 Accesses
Explore all metrics

Abstract

AI inference accelerators have drawn extensive attention. But none of the previous work performs a holistic and systematic benchmarking on AI inference accelerators. First, an end-to-end AI inference pipeline consists of six stages on both host and accelerators. However, previous work mainly evaluates hardware execution performance, which is only one stage on accelerators. Second, there is a lack of a systematic evaluation of different optimizations on AI inference accelerators. Along with six representative AI workloads and a typical AI inference accelerator–Diannao based on Cambricon ISA, we implement five frequently-used AI inference optimizations as user-configurable hyper-parameters. We explore the optimization space by sweeping the hyper-parameters and quantifying each optimization’s effect on the chosen metrics. We also provide cross-platform comparisons between Diannao and traditional platforms (Intel CPUs and Nvidia GPUs). Our evaluation provides several new observations and insights, which sheds light on the comprehensive understanding of AI inference accelerators’ performance and instructs the co-design of the upper-level optimizations and underlying hardware architecture.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Benchmarking AI Inference: Where we are in 2020

Evaluating performance of AI operators using roofline model

Article 20 September 2021

More the Merrier: Comparative Evaluation of TPCx-AI and MLPerf Benchmarks for AI

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

The common pre-processing includes image decoding, image resizing, image padding, image cropping, channel arrangement, and normalization, etc. Different DNN workloads adopt different pre-processing techniques according to their requirements.

References

AnandTech: https://www.anandtech.com/show/12815/cambricon-makers-of-huaweis-kirin-npu-ip-build-a-big-ai-chip-and-pcie-card, (2018)
Cambricon: Cambricon cnrt. http://www.cambricon.com/index.php?m=content&c=index&a=lists&catid=71
Cambricon MLU100, http://www.cambricon.com/index.php?c=page&id=20
Chen, W. et al.: Compressing neural networks with the hashing trick. In: Proceedings of the International Conference on Machine Learning, pp. 2285–2294 (2015)
Chen, T., et al.: Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM ASPLOS 49(4), 269–284 (2014)
Google Scholar
Courbariaux, et al. (2015) Binaryconnect: Training deep neural networks with binary weights during propagations. In: NeurIPS, pp. 3123–3131
DeepBench, https://github.com/baidu-research/DeepBench
Denil, M., et al.: Predicting parameters in deep learning. Adv Neural Inform Process Syst 26, 2148–2156 (2013)
Google Scholar
Dean J et al. (2012) Large scale distributed deep networks. In: Advances in Neural Information Processing Systems 25. Curran Associates, Inc., pp. 1223–1231
Deng J, et al. (2009) Imagenet: A large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition, pp 248–255
Everingham, M. et al. (2012) The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results
Google: Edge-tpu. https://cloud.google.com/edge-tpu
Google: What Makes TPU Fine Tuned to Deep Learning. https://cloud.google.com/blog/products/ai-machine-learning/what-makes-tpus-fine-tuned-for-deep-learning
Gray J (1993) Database and transaction processing performance handbook
Hao T et al. (2018) Edge AIBench: towards comprehensive end-to-end edge computing benchmarking. International Symposium on Benchmarking, Measuring and Optimization, Springer, Cham, pp. 23-30
Han S, Mao H, Dally WJ (2016) Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. ICLR
Hennessy, J.L., Patterson, D.A.: A new golden age for computer architecture. Commun ACM 62(2), 48–60 (2019)
Article Google Scholar
He K et al. (2015) Deep residual learning for image recognition. CoRR, vol. abs/1512.03385
Huawei: Huawei Ascend 310 Accelerator. http://ascend.huawei.com (2020)
Huang G et al. (2016) Densely connected convolutional networks. CoRR, vol. abs/1608.06993
Howard AG et al. (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, vol. abs/1704.04861
Iandola FN et al. (2016) Squeezenet: Alexnet-level accuracy with 50x fewer parameters and $<$1mb model size. CoRR, vol. abs/1602.07360
Jain, S., et al.: Trained quantization thresholds for accurate and efficient fixed-point inference of deep neural networks. Proc Mach Learn Syst 2, 112–128 (2020)
Google Scholar
Jiang Z et al. (2021) Hpc ai500 v2. 0: The methodology, tools, and metrics for benchmarking hpc ai systems. IEEE CLUSTER
Jouppi, N.P. et al.: In-datacenter performance analysis of a tensor processing unit. In: ACM/IEEE ISCA. IEEE, pp. 1–12 (2017)
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105
Lee D, Kim B (2018) Retraining-based iterative weight quantization for deep neural networks. CoRR, vol. abs/1805.11233
Li J. et al.: Characterizing the i/o pipeline in the deployment of cnns on commercial accelerators. IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking. IEEE, pp. 137-144 (2020)
Liu, S., et al.: Cambricon: an instruction set architecture for neural networks. ACM/IEEE ISCA 44(3), 393–405 (2016)
Google Scholar
Liu W et al.: Ssd: single shot multibox detector. (2016), to appear. [Online]. http://arxiv.org/abs/1512.02325
Luo C et al. (2018) AIoT bench: towards comprehensive benchmarking mobile and embedded device intelligence. International Symposium on Benchmarking, pp. 31–35. Springer, Cham, Measuring and Optimization
Ma X et al. (2019) PCONV: the missing but desirable sparsity in DNN weight pruning for real-time execution on mobile devices. CoRR, vol. abs/1909.05073
Mishra R et al. (2020) A Survey on Deep Neural Network Compression: Challenges, Overview, and Solutions. CoRR, vol. abs/2010.03954
Mittal D et al. (2018)ecovering from random pruning: On the plasticity of deep convolutional neural networks. CoRR, vol. abs/1801.10447
Niu W et al. (2020) Patdnn: Achieving real-time dnn execution on mobile devices with pattern-based weight pruning. In: ACM ASPLOS, pp. 907–922
Reddi VJ et al. (2020) Mlperf inference benchmark. In: ACM/IEEE ISCA, pp. 446–459
Sze, V., et al.: How to evaluate deep neural network processors: Tops/w (alone) considered harmful. IEEE Solid-State Circ Mag 12(3), 28–41 (2020)
Article Google Scholar
Tang F et al. (2021) AIBench Training: Balanced Industry-Standard AI Training Benchmarking. IEEE Computer Society, In IEEE ISPASS
Tao, J.-H., et al.: Bench ip: Bencharking intelligence processors. J Comput Sci Technol 33(1), 1–23 (2018)
Article Google Scholar
Turner J et al. (2018) Characterising across-stack optimisations for deep convolutional neural networks. In: IISWC, pp 101–110
Wang Y et al. A systematic methodology for analysis of deep learning hardware and software platforms. In: Proceedings of Machine Learning and Systems
Williams, S., et al.: Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4), 65–76 (2009)
Article Google Scholar
Zhao, R., et al.: Improving neural network quantization without retraining using outlier channel splitting. Ser Proc Mach Learn Res 97, 7543–7552 (2019). (PMLR)
Google Scholar
Zhou A, Yao A, Guo Y, Xu L, Chen Y Incremental network quantization: Towards lossless cnns with low-precision weights. CoRR, vol. abs/1702.03044, (2017). [Online]. http://arxiv.org/abs/1702.03044

Download references

Author information

Z. Jiang and J. Li: These authors contributed equally to this work.

Authors and Affiliations

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Zihan Jiang, Jiansong Li, Wanling Gao, Lei Wang, Chuanxin Lan, Fei Tang & Lei Liu
University of Chinese Academy of Science, Beijing, China
Zihan Jiang, Jiansong Li & Fei Tang
College of Computer Science, Nankai University, Tianjin, China
Fangxin Liu & Tao Li

Authors

Zihan Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Jiansong Li
View author publications
You can also search for this author in PubMed Google Scholar
Fangxin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Wanling Gao
View author publications
You can also search for this author in PubMed Google Scholar
Lei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chuanxin Lan
View author publications
You can also search for this author in PubMed Google Scholar
Fei Tang
View author publications
You can also search for this author in PubMed Google Scholar
Lei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Tao Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zihan Jiang.

Appendix A

1.1 A.1 Implementation Details on Diannao

Considering the diversity of the network architecture, there is no-one-size-fits-all algorithm for the quantization and pruning. Research Jain (2020) shows some networks need to tailor the dedicated algorithm or retrain-based training method to make a compensate for the drop in model quality. Studying more general pruning and quantization algorithms Mishra et al. (2020) is still an open problem and beyond the scope of this paper. Here we briefly introduce our implementation of pruning and quantization.

1.1.1 A.1.1 Quantization

Diannao is equipped with large numbers of INT8-based ALUs. We implement INT8 quantization, which means that parameters of the model are stored using 8-bit fix-point integers instead of original floating-point numbers (Diannao use FP16 as its floating-point numbers). These model parameters are usually composed of three parts: weights, activations and bias. Considering that the proportion of bias in the overall parameters is small, we only quantify weights and activations. The computation process of quantified parameters can be summarized by the following formula:

$$\begin{aligned} real\_number = stored\_integers * scaling\_factor \end{aligned}$$

(A1)

where $real\_number$ refers to the parameters before quantization and $stored\_integers$ refers to the parameters after quantization. And $scaling\_factor$ aims to prevent over or underflows when computing the lower precision results.

1.1.2 A.1.2 Weight Pruning

Inside Diannao, there are also large numbers of sparse computing units. We only prune the weight of convolutional and fully-connected layers, because the weights of these two types of layers occupy most of the parameters of the entire model. Sparsity is a decimal between 0 and 1, referring to the percentage of zero-valued weights in the model. We use sparsity to reflect the effects of weight pruning optimization. Motivated by Deep Compression Han et al. (2016), in each convolutional and fully-connected layer, we sort the weights and then zero out the weights that with the lowest magnitude based on the sparsity. To show the effect of weight punning on model quality and inference throughput, we gradually increase the sparsity from 0.01 to 0.9 with the increment step of 0.01 while keeping other optimizations fixed.

1.2 A.2 An Example of Efficient Network Deployment

Table 6 presents the best configuration candidates in terms of the end-to-end throughput. We get these configurations by looking up the database (discussed in Sect. 6.2). To illustrate the trade-off process between the throughput and model quality, we present 4 configurations for each workload. The pre-defined target quality as the minimum requirement for model quality.

For DenseNet121, the target quality is 0.73, it achieves the highest end-to-end throughput at the configuration (sparse, INT8, 1, 4, 1, 8). However, the accuracy of the model does not meet the requirements, so this configuration will still be discarded. Then the configuration (Dense, FP16, 1, 4, 1, 8) that reaches the second highest end-to-end throughput is chosen since the accuracy requirement is satisfied. We followed the same method to select the best configuration for the remaining workload.

Table 6 Best optimization configurations in terms of end-to-end throughput for each DNNs on Diannao

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jiang, Z., Li, J., Liu, F. et al. A systematic study on benchmarking AI inference accelerators. CCF Trans. HPC 4, 87–103 (2022). https://doi.org/10.1007/s42514-022-00105-z

Download citation

Received: 23 November 2021
Accepted: 16 April 2022
Published: 07 June 2022
Issue Date: June 2022
DOI: https://doi.org/10.1007/s42514-022-00105-z

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Benchmarking AI Inference: Where we are in 2020

Evaluating performance of AI operators using roofline model

More the Merrier: Comparative Evaluation of TPCx-AI and MLPerf Benchmarks for AI

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix A

1.1 A.1 Implementation Details on Diannao

1.1.1 A.1.1 Quantization

1.1.2 A.1.2 Weight Pruning

1.2 A.2 An Example of Efficient Network Deployment

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A systematic study on benchmarking AI inference accelerators

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Benchmarking AI Inference: Where we are in 2020

Evaluating performance of AI operators using roofline model

More the Merrier: Comparative Evaluation of TPCx-AI and MLPerf Benchmarks for AI

Explore related subjects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix A

Appendix A

1.1 A.1 Implementation Details on Diannao

1.1.1 A.1.1 Quantization

1.1.2 A.1.2 Weight Pruning

1.2 A.2 An Example of Efficient Network Deployment

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation