Abstract
We have been developing a proprietary interconnect technology called Tightly Coupled Accelerators (TCA) architecture to improve communication latency and bandwidth between accelerators (GPUs) over different nodes. This paper presents a Conjugate Gradient (CG) benchmark implementation using the TCA and results of performance evaluation on the HA-PACS/TCA system, which is a proof-of-concept GPU cluster based on the TCA concept. The implementation is based on the CG benchmark in NAS Parallel Benchmarks, and its parallelization is achieved by a two-dimensional decomposition of matrix data. The TCA utilization improves the communication performance compared with the implementation with MPI/InfiniBand utilization for small size benchmark classes. This study also shows that the CG implementation with the two-dimensional decomposition is more suitable for the TCA utilization than a CG implementation with a one-dimensional decomposition to make use of the interconnect.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The CG benchmark program is originally written in Fortran.
- 2.
This is because using two or more sub-clusters entails a hybrid utilization of the TCA/PEACH2 and MPI/IB, and because two or more GPUs usage requires additional considerations to use the TCA/PEACH2 effectively. Both of the hybrid utilization and the multi GPU usage are our future work.
- 3.
The theoretical peak bandwidth of the dual-rail InfiniBand QDR is 8 GB/s, which is equivalent to that of PCIe Gen3 x8.
- 4.
A similar performance results were reported for the NPB-LU benchmark by Pennycook et al. [10]).
References
Bailey, D.H., Schreiber, R.S., Simon, H.D., Venkatakrishnan, V., Weeratunga, S.K., Barszcz, E., Barton, J.T., Browning, D.S., Carter, R.L., Dagum, L., Fatoohi, R.A., Frederickson, P.O., Lasinski, T.A.: The NAS parallel benchmarks - summary and preliminary results. In: Proceedings of SC 1991, pp. 158–165 (1991)
Grewe, D., Wang, Z., O’Boyle, M.F.P.: Portable mapping of data parallel programs to OpenCL for heterogeneous systems. In: Proceedings of CGO 2013, pp. 1–10. IEEE (2013)
Hanawa, T., Fujii, H., Fujita, N., Odajima, T., Matsumoto, K., Boku, T.: Evaluation of FFT for GPU cluster using tightly coupled accelerators architecture. In: Proceedings of Cluster 2015, pp. 635–641. IEEE (2015)
Hanawa, T., Kodama, Y., Boku, T., Sato, M.: Tightly coupled accelerators architecture for minimizing communication latency among accelerators. In: Proceedings of IPDPSW 2013, pp. 1030–1039. IEEE (2013)
Kodama, Y., Hanawa, T., Boku, T., Sato, M.: PEACH2: an FPGA-based PCIe network device for tightly coupled accelerators. ACM SIGARCH Comput. Architect. News 42(4), 3–8 (2014)
Lee, S., Vetter, J.S.: Early evaluation of directive-based GPU programming models for productive exascale computing. In: Proceedings of SC 2012 (2012)
Matsumoto, K., Hanawa, T., Kodama, Y., Fujii, H., Boku, T.: Implementation of CG method on GPU cluster with proprietary interconnect TCA for GPU direct communication. In: Proceedings of IPDPSW 2015, pp. 647–655. IEEE (2015)
NVIDIA: NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect. Accessed 25 Aug 2016
Panda, D.K.: MVAPICH2-GDR (MVAPICH2 with GPUDirect RDMA). http://mvapich.cse.ohio-state.edu/overview/. Accessed 25 Aug 2016
Pennycook, S.J., Hammond, S.D., Jarvis, S.A., Mudalige, G.R.: Performance analysis of a hybrid MPI/CUDA implementation of the NAS-LU benchmark. SIGMETRICS Perform. Eval. Rev. 38(4), 23–29 (2011)
Xu, R., Tian, X., Chandrasekaran, S., Yan, Y., Chapman, B.: NAS parallel benchmarks for GPGPUs using a directive-based programming model. In: Brodman, J., Tu, P. (eds.) LCPC 2014. LNCS, vol. 8967, pp. 67–81. Springer, Cham (2015). doi:10.1007/978-3-319-17473-0_5
Acknowledgements
The present study was supported by the Japan Science and Technology Agency’s CREST program entitled “Research and Development of Unified Environment on Accelerated Computing and Interconnection for Post-Petascale Era.” The authors would like to thank the Center for Computational Sciences, University of Tsukuba for allowing us to use the HA-PACS/TCA system as part of the interdisciplinary Collaborative Research Program.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Matsumoto, K., Fujita, N., Hanawa, T., Boku, T. (2017). Implementation and Evaluation of NAS Parallel CG Benchmark on GPU Cluster with Proprietary Interconnect TCA. In: Dutra, I., Camacho, R., Barbosa, J., Marques, O. (eds) High Performance Computing for Computational Science – VECPAR 2016. VECPAR 2016. Lecture Notes in Computer Science(), vol 10150. Springer, Cham. https://doi.org/10.1007/978-3-319-61982-8_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-61982-8_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-61981-1
Online ISBN: 978-3-319-61982-8
eBook Packages: Computer ScienceComputer Science (R0)