Implementation and Evaluation of NAS Parallel CG Benchmark on GPU Cluster with Proprietary Interconnect TCA

Kazuya Matsumoto¹⁷,
Norihisa Fujita¹⁸,
Toshihiro Hanawa¹⁹ &
…
Taisuke Boku^17,18

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10150))

Included in the following conference series:

International Conference on Vector and Parallel Processing

448 Accesses
1 Citations

Abstract

We have been developing a proprietary interconnect technology called Tightly Coupled Accelerators (TCA) architecture to improve communication latency and bandwidth between accelerators (GPUs) over different nodes. This paper presents a Conjugate Gradient (CG) benchmark implementation using the TCA and results of performance evaluation on the HA-PACS/TCA system, which is a proof-of-concept GPU cluster based on the TCA concept. The implementation is based on the CG benchmark in NAS Parallel Benchmarks, and its parallelization is achieved by a two-dimensional decomposition of matrix data. The TCA utilization improves the communication performance compared with the implementation with MPI/InfiniBand utilization for small size benchmark classes. This study also shows that the CG implementation with the two-dimensional decomposition is more suitable for the TCA utilization than a CG implementation with a one-dimensional decomposition to make use of the interconnect.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Tuning of a Matrix-Matrix Multiplication Algorithm for Several GPUs Connected by Fast Communication Links

Scalability Pipelined Algorithm of the Conjugate Gradient Method on Heterogeneous Platforms

Strategies for Efficient Execution of Pipelined Conjugate Gradient Method on GPU Systems

Notes

1.
The CG benchmark program is originally written in Fortran.
2.
This is because using two or more sub-clusters entails a hybrid utilization of the TCA/PEACH2 and MPI/IB, and because two or more GPUs usage requires additional considerations to use the TCA/PEACH2 effectively. Both of the hybrid utilization and the multi GPU usage are our future work.
3.
The theoretical peak bandwidth of the dual-rail InfiniBand QDR is 8 GB/s, which is equivalent to that of PCIe Gen3 x8.
4.
A similar performance results were reported for the NPB-LU benchmark by Pennycook et al. [10]).

References

Bailey, D.H., Schreiber, R.S., Simon, H.D., Venkatakrishnan, V., Weeratunga, S.K., Barszcz, E., Barton, J.T., Browning, D.S., Carter, R.L., Dagum, L., Fatoohi, R.A., Frederickson, P.O., Lasinski, T.A.: The NAS parallel benchmarks - summary and preliminary results. In: Proceedings of SC 1991, pp. 158–165 (1991)
Google Scholar
Grewe, D., Wang, Z., O’Boyle, M.F.P.: Portable mapping of data parallel programs to OpenCL for heterogeneous systems. In: Proceedings of CGO 2013, pp. 1–10. IEEE (2013)
Google Scholar
Hanawa, T., Fujii, H., Fujita, N., Odajima, T., Matsumoto, K., Boku, T.: Evaluation of FFT for GPU cluster using tightly coupled accelerators architecture. In: Proceedings of Cluster 2015, pp. 635–641. IEEE (2015)
Google Scholar
Hanawa, T., Kodama, Y., Boku, T., Sato, M.: Tightly coupled accelerators architecture for minimizing communication latency among accelerators. In: Proceedings of IPDPSW 2013, pp. 1030–1039. IEEE (2013)
Google Scholar
Kodama, Y., Hanawa, T., Boku, T., Sato, M.: PEACH2: an FPGA-based PCIe network device for tightly coupled accelerators. ACM SIGARCH Comput. Architect. News 42(4), 3–8 (2014)
Article Google Scholar
Lee, S., Vetter, J.S.: Early evaluation of directive-based GPU programming models for productive exascale computing. In: Proceedings of SC 2012 (2012)
Google Scholar
Matsumoto, K., Hanawa, T., Kodama, Y., Fujii, H., Boku, T.: Implementation of CG method on GPU cluster with proprietary interconnect TCA for GPU direct communication. In: Proceedings of IPDPSW 2015, pp. 647–655. IEEE (2015)
Google Scholar
NVIDIA: NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect. Accessed 25 Aug 2016
Panda, D.K.: MVAPICH2-GDR (MVAPICH2 with GPUDirect RDMA). http://mvapich.cse.ohio-state.edu/overview/. Accessed 25 Aug 2016
Pennycook, S.J., Hammond, S.D., Jarvis, S.A., Mudalige, G.R.: Performance analysis of a hybrid MPI/CUDA implementation of the NAS-LU benchmark. SIGMETRICS Perform. Eval. Rev. 38(4), 23–29 (2011)
Article Google Scholar
Xu, R., Tian, X., Chandrasekaran, S., Yan, Y., Chapman, B.: NAS parallel benchmarks for GPGPUs using a directive-based programming model. In: Brodman, J., Tu, P. (eds.) LCPC 2014. LNCS, vol. 8967, pp. 67–81. Springer, Cham (2015). doi:10.1007/978-3-319-17473-0_5
Google Scholar

Download references

Acknowledgements

The present study was supported by the Japan Science and Technology Agency’s CREST program entitled “Research and Development of Unified Environment on Accelerated Computing and Interconnection for Post-Petascale Era.” The authors would like to thank the Center for Computational Sciences, University of Tsukuba for allowing us to use the HA-PACS/TCA system as part of the interdisciplinary Collaborative Research Program.

Author information

Authors and Affiliations

Center for Computational Sciences, University of Tsukuba, Tsukuba, Japan
Kazuya Matsumoto & Taisuke Boku
Graduate School of Systems and Information Engineering, University of Tsukuba, Tsukuba, Japan
Norihisa Fujita & Taisuke Boku
Information Technology Center, The University of Tokyo, Tokyo, Japan
Toshihiro Hanawa

Authors

Kazuya Matsumoto
View author publications
You can also search for this author in PubMed Google Scholar
Norihisa Fujita
View author publications
You can also search for this author in PubMed Google Scholar
Toshihiro Hanawa
View author publications
You can also search for this author in PubMed Google Scholar
Taisuke Boku
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kazuya Matsumoto .

Editor information

Editors and Affiliations

University of Porto, Porto, Portugal
Inês Dutra
University of Porto, Porto, Portugal
Rui Camacho
University of Porto, Porto, Portugal
Jorge Barbosa
Lawrence Berkeley National Laboratory, Berkeley, California, USA
Osni Marques

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Matsumoto, K., Fujita, N., Hanawa, T., Boku, T. (2017). Implementation and Evaluation of NAS Parallel CG Benchmark on GPU Cluster with Proprietary Interconnect TCA. In: Dutra, I., Camacho, R., Barbosa, J., Marques, O. (eds) High Performance Computing for Computational Science – VECPAR 2016. VECPAR 2016. Lecture Notes in Computer Science(), vol 10150. Springer, Cham. https://doi.org/10.1007/978-3-319-61982-8_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-61982-8_14
Published: 14 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-61981-1
Online ISBN: 978-3-319-61982-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Implementation and Evaluation of NAS Parallel CG Benchmark on GPU Cluster with Proprietary Interconnect TCA

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Tuning of a Matrix-Matrix Multiplication Algorithm for Several GPUs Connected by Fast Communication Links

Scalability Pipelined Algorithm of the Conjugate Gradient Method on Heterogeneous Platforms

Strategies for Efficient Execution of Pipelined Conjugate Gradient Method on GPU Systems

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Implementation and Evaluation of NAS Parallel CG Benchmark on GPU Cluster with Proprietary Interconnect TCA

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Tuning of a Matrix-Matrix Multiplication Algorithm for Several GPUs Connected by Fast Communication Links

Scalability Pipelined Algorithm of the Conjugate Gradient Method on Heterogeneous Platforms

Strategies for Efficient Execution of Pipelined Conjugate Gradient Method on GPU Systems

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation