[go: up one dir, main page]

Skip to main content

Implementation and Evaluation of NAS Parallel CG Benchmark on GPU Cluster with Proprietary Interconnect TCA

  • Conference paper
  • First Online:
High Performance Computing for Computational Science – VECPAR 2016 (VECPAR 2016)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10150))

Included in the following conference series:

Abstract

We have been developing a proprietary interconnect technology called Tightly Coupled Accelerators (TCA) architecture to improve communication latency and bandwidth between accelerators (GPUs) over different nodes. This paper presents a Conjugate Gradient (CG) benchmark implementation using the TCA and results of performance evaluation on the HA-PACS/TCA system, which is a proof-of-concept GPU cluster based on the TCA concept. The implementation is based on the CG benchmark in NAS Parallel Benchmarks, and its parallelization is achieved by a two-dimensional decomposition of matrix data. The TCA utilization improves the communication performance compared with the implementation with MPI/InfiniBand utilization for small size benchmark classes. This study also shows that the CG implementation with the two-dimensional decomposition is more suitable for the TCA utilization than a CG implementation with a one-dimensional decomposition to make use of the interconnect.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The CG benchmark program is originally written in Fortran.

  2. 2.

    This is because using two or more sub-clusters entails a hybrid utilization of the TCA/PEACH2 and MPI/IB, and because two or more GPUs usage requires additional considerations to use the TCA/PEACH2 effectively. Both of the hybrid utilization and the multi GPU usage are our future work.

  3. 3.

    The theoretical peak bandwidth of the dual-rail InfiniBand QDR is 8 GB/s, which is equivalent to that of PCIe Gen3 x8.

  4. 4.

    A similar performance results were reported for the NPB-LU benchmark by Pennycook et al. [10]).

References

  1. Bailey, D.H., Schreiber, R.S., Simon, H.D., Venkatakrishnan, V., Weeratunga, S.K., Barszcz, E., Barton, J.T., Browning, D.S., Carter, R.L., Dagum, L., Fatoohi, R.A., Frederickson, P.O., Lasinski, T.A.: The NAS parallel benchmarks - summary and preliminary results. In: Proceedings of SC 1991, pp. 158–165 (1991)

    Google Scholar 

  2. Grewe, D., Wang, Z., O’Boyle, M.F.P.: Portable mapping of data parallel programs to OpenCL for heterogeneous systems. In: Proceedings of CGO 2013, pp. 1–10. IEEE (2013)

    Google Scholar 

  3. Hanawa, T., Fujii, H., Fujita, N., Odajima, T., Matsumoto, K., Boku, T.: Evaluation of FFT for GPU cluster using tightly coupled accelerators architecture. In: Proceedings of Cluster 2015, pp. 635–641. IEEE (2015)

    Google Scholar 

  4. Hanawa, T., Kodama, Y., Boku, T., Sato, M.: Tightly coupled accelerators architecture for minimizing communication latency among accelerators. In: Proceedings of IPDPSW 2013, pp. 1030–1039. IEEE (2013)

    Google Scholar 

  5. Kodama, Y., Hanawa, T., Boku, T., Sato, M.: PEACH2: an FPGA-based PCIe network device for tightly coupled accelerators. ACM SIGARCH Comput. Architect. News 42(4), 3–8 (2014)

    Article  Google Scholar 

  6. Lee, S., Vetter, J.S.: Early evaluation of directive-based GPU programming models for productive exascale computing. In: Proceedings of SC 2012 (2012)

    Google Scholar 

  7. Matsumoto, K., Hanawa, T., Kodama, Y., Fujii, H., Boku, T.: Implementation of CG method on GPU cluster with proprietary interconnect TCA for GPU direct communication. In: Proceedings of IPDPSW 2015, pp. 647–655. IEEE (2015)

    Google Scholar 

  8. NVIDIA: NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect. Accessed 25 Aug 2016

  9. Panda, D.K.: MVAPICH2-GDR (MVAPICH2 with GPUDirect RDMA). http://mvapich.cse.ohio-state.edu/overview/. Accessed 25 Aug 2016

  10. Pennycook, S.J., Hammond, S.D., Jarvis, S.A., Mudalige, G.R.: Performance analysis of a hybrid MPI/CUDA implementation of the NAS-LU benchmark. SIGMETRICS Perform. Eval. Rev. 38(4), 23–29 (2011)

    Article  Google Scholar 

  11. Xu, R., Tian, X., Chandrasekaran, S., Yan, Y., Chapman, B.: NAS parallel benchmarks for GPGPUs using a directive-based programming model. In: Brodman, J., Tu, P. (eds.) LCPC 2014. LNCS, vol. 8967, pp. 67–81. Springer, Cham (2015). doi:10.1007/978-3-319-17473-0_5

    Google Scholar 

Download references

Acknowledgements

The present study was supported by the Japan Science and Technology Agency’s CREST program entitled “Research and Development of Unified Environment on Accelerated Computing and Interconnection for Post-Petascale Era.” The authors would like to thank the Center for Computational Sciences, University of Tsukuba for allowing us to use the HA-PACS/TCA system as part of the interdisciplinary Collaborative Research Program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kazuya Matsumoto .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Matsumoto, K., Fujita, N., Hanawa, T., Boku, T. (2017). Implementation and Evaluation of NAS Parallel CG Benchmark on GPU Cluster with Proprietary Interconnect TCA. In: Dutra, I., Camacho, R., Barbosa, J., Marques, O. (eds) High Performance Computing for Computational Science – VECPAR 2016. VECPAR 2016. Lecture Notes in Computer Science(), vol 10150. Springer, Cham. https://doi.org/10.1007/978-3-319-61982-8_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-61982-8_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-61981-1

  • Online ISBN: 978-3-319-61982-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics