research-article

Open access

Performance Optimization of the HPCG Benchmark on the Sunway TaihuLight Supercomputer

Authors:

Qiao SunAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 15, Issue 1

Article No.: 11, Pages 1 - 20

https://doi.org/10.1145/3182177

Published: 22 March 2018 Publication History

Abstract

In this article, we present some key techniques for optimizing HPCG on Sunway TaihuLight and demonstrate how to achieve high performance in memory-bound applications by exploiting specific characteristics of the hardware architecture. In particular, we utilize a block multicoloring approach for parallelization and propose methods such as requirement-based data mapping and customized gather collective to enhance the effective memory bandwidth. Experiments indicate that the optimized HPCG code can sustain 77% of the theoretical memory bandwidth and scale to the full system of more than 10 million cores, with an aggregated performance of 480.8 Tflop/s and a weak scaling efficiency of 87.3%.

References

[1]

Mark Adams. 2014. HPGMG 1.0: A Benchmark for Ranking High Performance Computing Systems. Technical Report LBNL-6630E. eScholarship.

[2]

Alexander Monakov, Anton Lokhmotov, and Arutyun Avetisyan. 2010. Automatically tuning sparse matrix-vector multiplication for GPU architectures. In Proceedings of the 5th International Conference on High Performance Embedded Architectures and Compilers (HiPEAC’10). 111--125.

Digital Library

[3]

Hesham Ali, Yong Shi, Deepak Khazanchi, Michael Lees, G. Dick van Albada, Jack Dongarra, Peter M. A. Sloot, et al. 2012. Block-asynchronous multigrid smoothers for GPU-accelerated systems. Procedia Computer Science 9, 7--16.

[4]

Edward Anderson and Youcef Saad. 1989. Solving sparse triangular linear systems on parallel computers. International Journal of High Speed Computing 01, 73--95.

Digital Library

[5]

Protonu Basu, Anand Venkat, Mary Hall, Samuel Williams, Brian Van Straalen, and Leonid Oliker. 2013. Compiler generation and autotuning of communication-avoiding operators for geometric multigrid. In Proceedings of the 2013 20th International Conference on High Performance Computing (HiPC’13). IEEE, Los Alamitos, CA.

[6]

Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage, and Analysis (SC’09). ACM, New York, NY, 18:1--18:11.

Digital Library

[7]

Edmond Chow, Robert D. Falgout, Jonathan J. Hu, Raymond S. Tuminaro, and Ulrike Meier Yang. 2006. A survey of parallelization techniques for multigrid solvers. In Parallel Processing for Scientific Computing. Society for Industrial and Applied Mathematics, Philadelphia, PA, 179--201.

[8]

Jack Dongarra and Michael Heroux. 2013. Toward a New Metric for Ranking High Performance Computing Systems. Technical Report. Sandia.

[9]

Jack Dongarra, Michael Heroux, and Luszczek Piotr. 2017. HPCG Results: ISC’17. Available at http://www.hpcg-benchmark.org/custom/index.html?lid=155&slid===291.

[10]

Jack Dongarra, Michael A. Heroux, and Piotr Luszczek. 2015. High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems. International Journal of High Performance Computing Applications 30, 1, 3--10.

Digital Library

[11]

Jack J. Dongarra, Piotr Luszczek, and Antoine Petitet. 2003. The LINPACK benchmark: Past, present and future. Concurrency and Computation: Practice and Experience 15, 803--820.

[12]

Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, Zhenya Song, Xiaomeng Huang, Chao Yang, et al. 2016. The Sunway TaihuLight supercomputer: System and applications. Science China Information Sciences 59, 1--16.

[13]

Frank Hülsemann, Markus Kowarschik, Marcus Mohr, and Ulrich Rüde. 2006. Parallel geometric multigrid. In Numerical Solution of Partial Differential Equations on Parallel Computers. Springer, Berlin, Germany, 165--208.

[14]

Takeshi Iwashita, Yuuichi Nakanishi, and Masaaki Shimasaki. 2005. Comparison criteria for parallel orderings in ILU preconditioning. SIAM Journal on Scientific Computing 26, 1234--1260.

Digital Library

[15]

Takeshi Iwashita, Hiroshi Nakashima, and Yasuhito Takahashi. 2012. Algebraic block multi-color ordering method for parallel multi-threaded sparse triangular solver in ICCG method. In Proceedings of the 2012 IEEE 26th International Parallel Distributed Processing Symposium (IPDPS’12). 474--483.

Digital Library

[16]

Vasileios Karakasis, Theodoros Gkountouvas, Kornilios Kourtis, Georgios Goumas, and Nectarios Koziris. 2013. An extended compression format for the optimization of sparse matrix-vector multiplication. IEEE Transactions on Parallel and Distributed Systems 24, 1930--1940.

Digital Library

[17]

George Karypis and Vipin Kumar. 1998. Multilevelk-way partitioning scheme for irregular graphs. Journal of Parallel and Distributed Computing 48, 1, 96--129.

Digital Library

[18]

David R. Kincaid, John R. Respess, David M. Young, and Rober R. Grimes. 1982. ITPACK 2C: A FORTRAN package for solving large sparse linear systems by adaptive accelerated iterative methods. ACM Transactions on Mathematical Software 8, 302--322.

Digital Library

[19]

Kiyoshi Kumahata, Kazuo Minami, Akira Hosoi, and Ikuo Miyoshi. 2016. HPCG Performance Improvement on the K computer. Retrieved February 14, 2018, from http://www.hpcg-benchmark.org/downloads/sc16/HPCG_on_the_K_Computer.pdf.

[20]

Kiyoshi Kumahata, Kazuo Minami, and Naoya Maruyama. 2016. High-performance conjugate gradient performance improvement on the K computer. International Journal of High Performance Computing Applications 30, 55--70.

Digital Library

[21]

Weifeng Liu and Brian Vinter. 2015. CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proceedings of the 29th ACM International Conference on Supercomputing (ICS’15). ACM, New York, NY, 339--350.

Digital Library

[22]

Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. 2013. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proceedings of the 27th International ACM Conference on Supercomputing (ICS’13). ACM, New York, NY, 273--282.

Digital Library

[23]

Yiqun Liu, Chao Yang, Fangfang Liu, Xianyi Zhang, Yutong Lu, Yunfei Du, Canqun Yang, Min Xie, and Xiangke Liao. 2015. 623 Tflop/s HPCG run on Tianhe-2: Leveraging millions of hybrid cores. International Journal of High Performance Computing Applications 30, 1, 39--54.

Digital Library

[24]

Yiqun Liu, Xianyi Zhang, Chao Yang, Fangfang Liu, and Yutong Lu. 2014. Accelerating HPCG on Tianhe-2: A hybrid CPU-MIC algorithm. In Proceedings of the 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS’14). 542--551.

[25]

Jan Mayer. 2009. Parallel algorithms for solving linear systems with sparse triangular matrices. Computing 86, 291--312.

Digital Library

[26]

Hans Meuer, Erich Strohmaier, Jack Dongarra, Horst Simon, and Meuer Martin. 2017. Top 500 Supercomputer Lists. Retrieved February 14, 2018, from http://www.top500.org

[27]

Kengo Nakajima. 2014. Optimization of serial and parallel communications for parallel geometric multigrid method. In Proceedings of the 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS’14). 25--32.

[28]

Maxim Naumov. 2011. Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU. Retrieved February 14, 2018, from http://research.nvidia.com/sites/default/files/publications/nvr-2011-001.pdf.

[29]

Jongsoo Park, Mikhail Smelyanskiy, Narayanan Sundaram, and Pradeep Dubey. 2014. Sparsifying synchronization for high-performance shared-memory sparse triangular solver. In Supercomputing. Springer International, 124--140.

Digital Library

[30]

Jongsoo Park, Mikhail Smelyanskiy, Karthikeyan Vaidyanathan, Alexander Heinecke, Dhiraj D. Kalamkar, Xing Liu, Md. Mosotofa Ali Patwary, Yutong Lu, and Pradeep Dubey. 2014. Efficient shared-memory implementation of high-performance conjugate gradient benchmark and its application to unstructured matrices. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’14). IEEE, Los Alamitos, CA, 945--955.

Digital Library

[31]

Everett Phillips and Massimiliano Fatica. 2014. A CUDA implementation of the high performance conjugate gradient benchmark. In High Performance Computing Systems: Performance Modeling, Benchmarking, and Simulation. Springer International, 68--84.

[32]

Eugene L. Poole and James M. Ortega. 1987. Multicolor ICCG methods for vector computers. SIAM Journal on Numerical Analysis 24, 25.

Digital Library

[33]

Yousef Saad. 2003. Iterative Methods for Sparse Linear Systems (2nd ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA.

Digital Library

[34]

Richard Vuduc, James W. Demmel, and Katherine A. Yelick. 2005. OSKI: A library of automatically tuned sparse matrix kernels. Journal of Physics: Conference Series 16, 521.

[35]

Samuel Williams, Dhiraj D. Kalamkar, Amik Singh, Anand M. Deshpande, Brian Van Straalen, Mikhail Smelyanskiy, Ann Almgren, Pradeep Dubey, John Shalf, and Leonid Oliker. 2012. Optimization of geometric multigrid for emerging multi- and manycore processors. In Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis (SC’12). IEEE, Los Alamitos, CA, 96:1--96:11.

Digital Library

[36]

Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. 2007. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC’07). ACM, New York, NY, 38:1--38:12.

Digital Library

[37]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM 52, 65--76.

Digital Library

[38]

Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014. yaSpMV: Yet another SpMV framework on GPUs. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’14). ACM, New York, NY, 107--118.

Digital Library

Cited By

Tan YNishimura MAbdelhamid RGuo BGao QYamaguchi Y(2024)Systolic Array-Based Many-Core Processor with Simultaneous Dual-Instruction IssuanceProceedings of the 14th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies10.1145/3665283.3665298(119-125)Online publication date: 19-Jun-2024
https://dl.acm.org/doi/10.1145/3665283.3665298
Yuan FYang XLi SDong DHuang CWang Z(2024)Optimizing Multi-Grid Preconditioned Conjugate Gradient Method on Multi-CoresIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.337247335:5(768-779)Online publication date: May-2024
https://doi.org/10.1109/TPDS.2024.3372473
Wang YLiu JZhu XZhang QLi SWang Q(2023)Improving Structured Grid-Based Sparse Matrix-Vector Multiplication and Gauss–Seidel Iteration on GPDSPApplied Sciences10.3390/app1315895213:15(8952)Online publication date: 3-Aug-2023
https://doi.org/10.3390/app13158952
Show More Cited By

Index Terms

Performance Optimization of the HPCG Benchmark on the Sunway TaihuLight Supercomputer

Recommendations

Enabling and scaling the HPCG benchmark on the newest generation Sunway supercomputer with 42 million heterogeneous cores
SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

We study and evaluate performance optimization techniques for the HPCG benchmark on the newest generation Sunway supercomputer. Specifically, a two-level blocking scheme is proposed to expose adequate parallelism in the symmetric Gauss-Seidel kernel ...
Optimizing Convolutional Neural Networks on the Sunway TaihuLight Supercomputer

The Sunway TaihuLight supercomputer is powered by SW26010, a new 260-core processor designed with on-chip fusion of heterogeneous cores. In this article, we present our work on optimizing the training process of convolutional neural networks (CNNs) on ...
A hierarchical grid algorithm for accelerating high-performance conjugate gradient benchmark on sunway many-core processor
ICCIP '17: Proceedings of the 3rd International Conference on Communication and Information Processing

This paper presents analysis and optimizations for High Performance Conjugate Gradient benchmark (HPCG) on the Sunway many-core processor. For modern multi-core and many-core processors, HPCG always presents a poor performance and under-utilizes ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 15, Issue 1

March 2018

401 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3199680

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 March 2018

Accepted: 01 December 2017

Revised: 01 November 2017

Received: 01 September 2017

Published in TACO Volume 15, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

CAS Key Research Program of Frontier Sciences
Natural Science Foundation of China
National Key R8D Plan of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
2,075
Total Downloads

Downloads (Last 12 months)412
Downloads (Last 6 weeks)31

Reflects downloads up to 03 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Tan YNishimura MAbdelhamid RGuo BGao QYamaguchi Y(2024)Systolic Array-Based Many-Core Processor with Simultaneous Dual-Instruction IssuanceProceedings of the 14th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies10.1145/3665283.3665298(119-125)Online publication date: 19-Jun-2024
https://dl.acm.org/doi/10.1145/3665283.3665298
Yuan FYang XLi SDong DHuang CWang Z(2024)Optimizing Multi-Grid Preconditioned Conjugate Gradient Method on Multi-CoresIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.337247335:5(768-779)Online publication date: May-2024
https://doi.org/10.1109/TPDS.2024.3372473
Wang YLiu JZhu XZhang QLi SWang Q(2023)Improving Structured Grid-Based Sparse Matrix-Vector Multiplication and Gauss–Seidel Iteration on GPDSPApplied Sciences10.3390/app1315895213:15(8952)Online publication date: 3-Aug-2023
https://doi.org/10.3390/app13158952
Ye YGuo HWang BWang PChen DLi F(2023)Coupled Incomplete Cholesky and Jacobi Preconditioned Conjugate Gradient on the New Generation of Sunway Many-Core ArchitectureIEEE Transactions on Computers10.1109/TC.2023.329688472:11(3326-3339)Online publication date: 1-Nov-2023
https://doi.org/10.1109/TC.2023.3296884
Zhang ZLiu JLi SYuan FZhu XWang QZhang J(2023)Optimizing 3D Mantle Convection Simulations on Multi-cores2023 IEEE International Conference on High Performance Computing & Communications, Data Science & Systems, Smart City & Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys60770.2023.00030(154-161)Online publication date: 17-Dec-2023
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys60770.2023.00030
Pan JXiao LTian MLiu TWang Y(2023)swParaFEM: a highly efficient parallel finite element solver on Sunway many-core architectureThe Journal of Supercomputing10.1007/s11227-023-05114-579:10(11427-11451)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1007/s11227-023-05114-5
TAN JPANG JLIU C(2022)What Factors Affect the Performance of Software after Migration: A Case Study on Sunway TaihuLight SupercomputerIEICE Transactions on Information and Systems10.1587/transinf.2021MPL0003E105.D:1(26-30)Online publication date: 1-Jan-2022
https://doi.org/10.1587/transinf.2021MPL0003
Agrawal TMalakar P(2022)IPMPI: Improved MPI Communication Logger2022 IEEE/ACM International Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI56604.2022.00009(31-40)Online publication date: Nov-2022
https://doi.org/10.1109/ExaMPI56604.2022.00009
Zhang MYang HWu SSun S(2022)Parallel multilevel domain decomposition preconditioners for monolithic solution of non-isothermal flow in reservoir simulationComputers & Fluids10.1016/j.compfluid.2021.105183232(105183)Online publication date: Jan-2022
https://doi.org/10.1016/j.compfluid.2021.105183
Chen HWang ZXiao XLi JDong XZhang X(2022)SunwayURANS: 3D full-annulus URANS simulations of transonic axial compressors on Sunway TaihuLightThe Journal of Supercomputing10.1007/s11227-022-04628-878:17(19167-19187)Online publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1007/s11227-022-04628-8
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents