[go: up one dir, main page]

skip to main content
research-article
Open access

Performance Optimization of the HPCG Benchmark on the Sunway TaihuLight Supercomputer

Published: 22 March 2018 Publication History

Abstract

In this article, we present some key techniques for optimizing HPCG on Sunway TaihuLight and demonstrate how to achieve high performance in memory-bound applications by exploiting specific characteristics of the hardware architecture. In particular, we utilize a block multicoloring approach for parallelization and propose methods such as requirement-based data mapping and customized gather collective to enhance the effective memory bandwidth. Experiments indicate that the optimized HPCG code can sustain 77% of the theoretical memory bandwidth and scale to the full system of more than 10 million cores, with an aggregated performance of 480.8 Tflop/s and a weak scaling efficiency of 87.3%.

References

[1]
Mark Adams. 2014. HPGMG 1.0: A Benchmark for Ranking High Performance Computing Systems. Technical Report LBNL-6630E. eScholarship.
[2]
Alexander Monakov, Anton Lokhmotov, and Arutyun Avetisyan. 2010. Automatically tuning sparse matrix-vector multiplication for GPU architectures. In Proceedings of the 5th International Conference on High Performance Embedded Architectures and Compilers (HiPEAC’10). 111--125.
[3]
Hesham Ali, Yong Shi, Deepak Khazanchi, Michael Lees, G. Dick van Albada, Jack Dongarra, Peter M. A. Sloot, et al. 2012. Block-asynchronous multigrid smoothers for GPU-accelerated systems. Procedia Computer Science 9, 7--16.
[4]
Edward Anderson and Youcef Saad. 1989. Solving sparse triangular linear systems on parallel computers. International Journal of High Speed Computing 01, 73--95.
[5]
Protonu Basu, Anand Venkat, Mary Hall, Samuel Williams, Brian Van Straalen, and Leonid Oliker. 2013. Compiler generation and autotuning of communication-avoiding operators for geometric multigrid. In Proceedings of the 2013 20th International Conference on High Performance Computing (HiPC’13). IEEE, Los Alamitos, CA.
[6]
Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage, and Analysis (SC’09). ACM, New York, NY, 18:1--18:11.
[7]
Edmond Chow, Robert D. Falgout, Jonathan J. Hu, Raymond S. Tuminaro, and Ulrike Meier Yang. 2006. A survey of parallelization techniques for multigrid solvers. In Parallel Processing for Scientific Computing. Society for Industrial and Applied Mathematics, Philadelphia, PA, 179--201.
[8]
Jack Dongarra and Michael Heroux. 2013. Toward a New Metric for Ranking High Performance Computing Systems. Technical Report. Sandia.
[9]
Jack Dongarra, Michael Heroux, and Luszczek Piotr. 2017. HPCG Results: ISC’17. Available at http://www.hpcg-benchmark.org/custom/index.html?lid=155&slid===291.
[10]
Jack Dongarra, Michael A. Heroux, and Piotr Luszczek. 2015. High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems. International Journal of High Performance Computing Applications 30, 1, 3--10.
[11]
Jack J. Dongarra, Piotr Luszczek, and Antoine Petitet. 2003. The LINPACK benchmark: Past, present and future. Concurrency and Computation: Practice and Experience 15, 803--820.
[12]
Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, Zhenya Song, Xiaomeng Huang, Chao Yang, et al. 2016. The Sunway TaihuLight supercomputer: System and applications. Science China Information Sciences 59, 1--16.
[13]
Frank Hülsemann, Markus Kowarschik, Marcus Mohr, and Ulrich Rüde. 2006. Parallel geometric multigrid. In Numerical Solution of Partial Differential Equations on Parallel Computers. Springer, Berlin, Germany, 165--208.
[14]
Takeshi Iwashita, Yuuichi Nakanishi, and Masaaki Shimasaki. 2005. Comparison criteria for parallel orderings in ILU preconditioning. SIAM Journal on Scientific Computing 26, 1234--1260.
[15]
Takeshi Iwashita, Hiroshi Nakashima, and Yasuhito Takahashi. 2012. Algebraic block multi-color ordering method for parallel multi-threaded sparse triangular solver in ICCG method. In Proceedings of the 2012 IEEE 26th International Parallel Distributed Processing Symposium (IPDPS’12). 474--483.
[16]
Vasileios Karakasis, Theodoros Gkountouvas, Kornilios Kourtis, Georgios Goumas, and Nectarios Koziris. 2013. An extended compression format for the optimization of sparse matrix-vector multiplication. IEEE Transactions on Parallel and Distributed Systems 24, 1930--1940.
[17]
George Karypis and Vipin Kumar. 1998. Multilevelk-way partitioning scheme for irregular graphs. Journal of Parallel and Distributed Computing 48, 1, 96--129.
[18]
David R. Kincaid, John R. Respess, David M. Young, and Rober R. Grimes. 1982. ITPACK 2C: A FORTRAN package for solving large sparse linear systems by adaptive accelerated iterative methods. ACM Transactions on Mathematical Software 8, 302--322.
[19]
Kiyoshi Kumahata, Kazuo Minami, Akira Hosoi, and Ikuo Miyoshi. 2016. HPCG Performance Improvement on the K computer. Retrieved February 14, 2018, from http://www.hpcg-benchmark.org/downloads/sc16/HPCG_on_the_K_Computer.pdf.
[20]
Kiyoshi Kumahata, Kazuo Minami, and Naoya Maruyama. 2016. High-performance conjugate gradient performance improvement on the K computer. International Journal of High Performance Computing Applications 30, 55--70.
[21]
Weifeng Liu and Brian Vinter. 2015. CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proceedings of the 29th ACM International Conference on Supercomputing (ICS’15). ACM, New York, NY, 339--350.
[22]
Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. 2013. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proceedings of the 27th International ACM Conference on Supercomputing (ICS’13). ACM, New York, NY, 273--282.
[23]
Yiqun Liu, Chao Yang, Fangfang Liu, Xianyi Zhang, Yutong Lu, Yunfei Du, Canqun Yang, Min Xie, and Xiangke Liao. 2015. 623 Tflop/s HPCG run on Tianhe-2: Leveraging millions of hybrid cores. International Journal of High Performance Computing Applications 30, 1, 39--54.
[24]
Yiqun Liu, Xianyi Zhang, Chao Yang, Fangfang Liu, and Yutong Lu. 2014. Accelerating HPCG on Tianhe-2: A hybrid CPU-MIC algorithm. In Proceedings of the 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS’14). 542--551.
[25]
Jan Mayer. 2009. Parallel algorithms for solving linear systems with sparse triangular matrices. Computing 86, 291--312.
[26]
Hans Meuer, Erich Strohmaier, Jack Dongarra, Horst Simon, and Meuer Martin. 2017. Top 500 Supercomputer Lists. Retrieved February 14, 2018, from http://www.top500.org
[27]
Kengo Nakajima. 2014. Optimization of serial and parallel communications for parallel geometric multigrid method. In Proceedings of the 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS’14). 25--32.
[28]
Maxim Naumov. 2011. Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU. Retrieved February 14, 2018, from http://research.nvidia.com/sites/default/files/publications/nvr-2011-001.pdf.
[29]
Jongsoo Park, Mikhail Smelyanskiy, Narayanan Sundaram, and Pradeep Dubey. 2014. Sparsifying synchronization for high-performance shared-memory sparse triangular solver. In Supercomputing. Springer International, 124--140.
[30]
Jongsoo Park, Mikhail Smelyanskiy, Karthikeyan Vaidyanathan, Alexander Heinecke, Dhiraj D. Kalamkar, Xing Liu, Md. Mosotofa Ali Patwary, Yutong Lu, and Pradeep Dubey. 2014. Efficient shared-memory implementation of high-performance conjugate gradient benchmark and its application to unstructured matrices. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’14). IEEE, Los Alamitos, CA, 945--955.
[31]
Everett Phillips and Massimiliano Fatica. 2014. A CUDA implementation of the high performance conjugate gradient benchmark. In High Performance Computing Systems: Performance Modeling, Benchmarking, and Simulation. Springer International, 68--84.
[32]
Eugene L. Poole and James M. Ortega. 1987. Multicolor ICCG methods for vector computers. SIAM Journal on Numerical Analysis 24, 25.
[33]
Yousef Saad. 2003. Iterative Methods for Sparse Linear Systems (2nd ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA.
[34]
Richard Vuduc, James W. Demmel, and Katherine A. Yelick. 2005. OSKI: A library of automatically tuned sparse matrix kernels. Journal of Physics: Conference Series 16, 521.
[35]
Samuel Williams, Dhiraj D. Kalamkar, Amik Singh, Anand M. Deshpande, Brian Van Straalen, Mikhail Smelyanskiy, Ann Almgren, Pradeep Dubey, John Shalf, and Leonid Oliker. 2012. Optimization of geometric multigrid for emerging multi- and manycore processors. In Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis (SC’12). IEEE, Los Alamitos, CA, 96:1--96:11.
[36]
Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. 2007. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC’07). ACM, New York, NY, 38:1--38:12.
[37]
Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM 52, 65--76.
[38]
Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014. yaSpMV: Yet another SpMV framework on GPUs. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’14). ACM, New York, NY, 107--118.

Cited By

View all
  • (2024)Systolic Array-Based Many-Core Processor with Simultaneous Dual-Instruction IssuanceProceedings of the 14th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies10.1145/3665283.3665298(119-125)Online publication date: 19-Jun-2024
  • (2024)Optimizing Multi-Grid Preconditioned Conjugate Gradient Method on Multi-CoresIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.337247335:5(768-779)Online publication date: May-2024
  • (2023)Improving Structured Grid-Based Sparse Matrix-Vector Multiplication and Gauss–Seidel Iteration on GPDSPApplied Sciences10.3390/app1315895213:15(8952)Online publication date: 3-Aug-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 15, Issue 1
March 2018
401 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3199680
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 March 2018
Accepted: 01 December 2017
Revised: 01 November 2017
Received: 01 September 2017
Published in TACO Volume 15, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. HPCG
  2. Sunway TaihuLight
  3. heterogeneous many-core processor
  4. performance optimization

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • CAS Key Research Program of Frontier Sciences
  • Natural Science Foundation of China
  • National Key R8D Plan of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)412
  • Downloads (Last 6 weeks)31
Reflects downloads up to 03 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Systolic Array-Based Many-Core Processor with Simultaneous Dual-Instruction IssuanceProceedings of the 14th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies10.1145/3665283.3665298(119-125)Online publication date: 19-Jun-2024
  • (2024)Optimizing Multi-Grid Preconditioned Conjugate Gradient Method on Multi-CoresIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.337247335:5(768-779)Online publication date: May-2024
  • (2023)Improving Structured Grid-Based Sparse Matrix-Vector Multiplication and Gauss–Seidel Iteration on GPDSPApplied Sciences10.3390/app1315895213:15(8952)Online publication date: 3-Aug-2023
  • (2023)Coupled Incomplete Cholesky and Jacobi Preconditioned Conjugate Gradient on the New Generation of Sunway Many-Core ArchitectureIEEE Transactions on Computers10.1109/TC.2023.329688472:11(3326-3339)Online publication date: 1-Nov-2023
  • (2023)Optimizing 3D Mantle Convection Simulations on Multi-cores2023 IEEE International Conference on High Performance Computing & Communications, Data Science & Systems, Smart City & Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys60770.2023.00030(154-161)Online publication date: 17-Dec-2023
  • (2023)swParaFEM: a highly efficient parallel finite element solver on Sunway many-core architectureThe Journal of Supercomputing10.1007/s11227-023-05114-579:10(11427-11451)Online publication date: 1-Jul-2023
  • (2022)What Factors Affect the Performance of Software after Migration: A Case Study on Sunway TaihuLight SupercomputerIEICE Transactions on Information and Systems10.1587/transinf.2021MPL0003E105.D:1(26-30)Online publication date: 1-Jan-2022
  • (2022)IPMPI: Improved MPI Communication Logger2022 IEEE/ACM International Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI56604.2022.00009(31-40)Online publication date: Nov-2022
  • (2022)Parallel multilevel domain decomposition preconditioners for monolithic solution of non-isothermal flow in reservoir simulationComputers & Fluids10.1016/j.compfluid.2021.105183232(105183)Online publication date: Jan-2022
  • (2022)SunwayURANS: 3D full-annulus URANS simulations of transonic axial compressors on Sunway TaihuLightThe Journal of Supercomputing10.1007/s11227-022-04628-878:17(19167-19187)Online publication date: 1-Nov-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media