[go: up one dir, main page]

skip to main content
10.1145/3145617.3145619acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Beyond 16GB: Out-of-Core Stencil Computations

Published: 12 November 2017 Publication History

Abstract

Stencil computations are a key class of applications, widely used in the scientific computing community, and a class that has particularly benefited from performance improvements on architectures with high memory bandwidth. Unfortunately, such architectures come with a limited amount of fast memory, which is limiting the size of the problems that can be efficiently solved. In this paper, we address this challenge by applying the well-known cache-blocking tiling technique to large scale stencil codes implemented using the OPS domain specific language, such as CloverLeaf 2D, CloverLeaf 3D, and OpenSBLI. We introduce a number of techniques and optimisations to help manage data resident in fast memory, and minimise data movement. Evaluating our work on Intel's Knights Landing Platform as well as NVIDIA P100 GPUs, we demonstrate that it is possible to solve 3 times larger problems than the on-chip memory size with at most 15% loss in efficiency.

References

[1]
Victor Allombert, David Michea, Fabrice Dupros, Christian Bellier, Bernard Bourgine, Hideo Aochi, and Sylvain Jubertie. 2014. An Out-of-core GPU Approach for Accelerating Geostatistical Interpolation. Procedia Computer Science 29 (2014), 888--896. 2014 International Conference on Computational Science.
[2]
Richard F. Barrett, Michael A. Heroux, Paul T. Lin, Courtenay T. Vaughan, and Alan B. Williams. 2011. Poster: Mini-applications: Vehicles for Co-design. In Proceedings of the 2011 Companion on High Performance Computing Networking, Storage and Analysis Companion (SC '11 Companion). ACM, New York, NY, USA, 1--2.
[3]
Uday Bondhugula, Muthu Baskaran, Sriram Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. 2008. Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model. In International Conference on Compiler Construction (ETAPS CC).
[4]
Daniele Buono, Fausto Artico, Fabio Checconi, Jee W. Choi, Xinyu Que, and Lars Schneidenbach. 2017. Data Analytics with NVLink: An SpMV Case Study. In Proceedings of the Computing Frontiers Conference (CF'17). ACM, New York, NY, USA, 89--96.
[5]
CloverLeaf Ref 2013. CloverLeaf GitHub. (2013). github.com/UK-MAC/CloverLeaf_ref.
[6]
Jack Deslippe, Felipe H. da Jornada, Derek Vigil-Fowler, Taylor Barnes, Nathan Wichmann, Karthik Raman, Ruchira Sasanka, and Steven G. Louie. 2016. Optimizing Excited-State Electronic-Structure Codes for Intel Knights Landing: A Case Study on the Berkeley GW Software. Springer International Publishing, Cham, 402--414.
[7]
Toshio Endo. 2016. Realizing Out-of-Core Stencil Computations using Multi-Tier Memory Hierarchy on GPGPU Clusters. In Cluster Computing (CLUSTER), 2016 IEEE International Conference on. IEEE, 21--29.
[8]
A. Haidar, S. Tomov, K. Arturov, M. Guney, S. Story, and J. Dongarra. 2016. LU, QR, and Cholesky factorizations: Programming model, performance analysis and optimization techniques for the Intel Knights Landing Xeon Phi. In 2016 IEEE High Performance Extreme Computing Conference (HPEC). 1--7.
[9]
Alexander Heinecke, Alexander Breuer, Michael Bader, and Pradeep Dubey. 2016. High Order Seismic Simulations on the Intel Xeon Phi Processor (Knights Landing). Springer International Publishing, Cham, 343--362.
[10]
Christian T. Jacobs, Satya P. Jammy, and Neil D. Sandham. 2017. OpenSBLI: A framework for the automated derivation and parallel execution of finite difference solvers on a range of computer architectures. Journal of Computational Science 18 (2017), 12--23.
[11]
Zhongchao Lin, Yan Chen, Yu Zhang, Xunwang Zhao, and Huanhuan Zhang. 2017. An Efficient GPU-Based Out-of-Core LU Solver of Parallel Higher-Order Method of Moments for Solving Airborne Array Problems. International Journal of Antennas and Propagation 2017 (2017), 10.
[12]
AC Mallinson, David A Beckingsale, WP Gaudin, JA Herdman, JM Levesque, and Stephen A Jarvis. 2013. CloverLeaf: Preparing hydrodynamics codes for Exascale. In The Cray User Group 2013.
[13]
Nobuhiro Miki, Fumihiko Ino, and Kenichi Hagihara. 2016. An extension of OpenACC directives for out-of-core stencil computation with temporal blocking. In Accelerator Programming Using Directives (WACCPD), 2016 Third Workshop on. IEEE, 36--45.
[14]
G. R. Mudalige, I. Z. Reguly, M. B. Giles, A. C. Mallinson, W. P. Gaudin, and J. A. Herdman. 2015. Performance Analysis of a High-Level Abstractions-Based Hydrocode on Future Computing Systems. Springer International Publishing, Cham, 85--104.
[15]
OPS-GIT 2013. OPS GitHub Repository. (2013). https://github.com/gihanmudalige/OPS.
[16]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '13). ACM, New York, NY, USA, 519--530.
[17]
I. Z Reguly, G. R Mudalige, and M. B Giles. 2017. Loop Tiling in Large-Scale Stencil Codes at Run-time with OPS. ArXiv e-prints (April 2017). arXiv:cs.PF/1704.00693
[18]
István Z. Reguly, Gihan R. Mudalige, Michael B. Giles, Dan Cur-ran, and Simon McIntosh-Smith. 2014. The OPS Domain Specific Abstraction for Multi-block Structured Grid Computations. In Proceedings of the Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC '14). IEEE Press, Piscataway, NJ, USA, 58--67.
[19]
Ciolfax Research. {n. d.}. Clustering Modes in Knights Landing Processors. ({n. d.}). https://colfaxresearch.com/knl-numa/
[20]
Rico Richter, Jan Eric Kyprianidis, and Jrgen Dllner. 2013. Out-of-Core GPU-based Change Detection in Massive 3D Point Clouds. Transactions in GIS 17, 5 (2013), 724--741.
[21]
Carlos Rosales, John Cazes, Kent Milfeld, Antonio Gómez-Iglesias, Lars Koesterke, Lei Huang, and Jerome Vienne. 2016. A Comparative Study of Application Performance and Scalability on the Intel Knights Landing Processor. Springer International Publishing, Cham, 307--318.
[22]
N. Sakharnykh. {n. d.}. Beyond GPU Memory Limits with Unified Memory on Pascal. ({n. d.}). https://devblogs.nvidia.com/parallelforall/beyond-gpu-memory-limits-unified-memory-pascal/
[23]
Eric Schweitz, Richard Lethin, Allen Leung, and Benoit Meister. 2006. R-stream: A parametric high level compiler. Proceedings of HPEC (2006).
[24]
Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The Pochoir Stencil Compiler. In Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '11). ACM, New York, NY, USA, 117--128.
[25]
Josh Tobin, Alexander Breuer, Alexander Heinecke, Charles Yount, and Yifeng Cui. 2017. Accelerating Seismic Simulations Using the Intel Xeon Phi Knights Landing Processor. Springer International Publishing, Cham, 139--157.
[26]
Ruonan Wang and Christopher Harris. 2013. Scaling radio astronomy signal correlation on heterogeneous supercomputers using variousdata distribution methodologies. Experimental Astronomy 36, 3 (01 Dec 2013), 433--449.
[27]
Rui Wang, Yuchi Huo, Yazhen Yuan, Kun Zhou, Wei Hua, and Hujun Bao. 2013. GPU-based Out-of-core Many-lights Rendering. ACM Trans. Graph. 32, 6, Article 210 (Nov. 2013), 10 pages.
[28]
Charles Yount, Josh Tobin, Alexander Breuer, and Alejandro Duran. 2016. YASK-yet Another Stencil Kernel: A Framework for HPC Stencil Code-generation and Tuning. In Proceedings of the Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for HPC (WOLFHPC '16). IEEE Press, Piscataway, NJ, USA, 30--39.

Cited By

View all
  • (2021)Accelerating parallel CFD codes on modern vector processors using blockettesProceedings of the Platform for Advanced Scientific Computing Conference10.1145/3468267.3470615(1-9)Online publication date: 5-Jul-2021
  • (2021)Automatic Parallelisation of Sturctured Mesh Computations with SYCL2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00083(821-822)Online publication date: Sep-2021
  • (2020)A Data-Centric Directive-Based Framework to Accelerate Out-of-Core Stencil Computation on a GPUIEICE Transactions on Information and Systems10.1587/transinf.2020PAP0014E103.D:12(2421-2434)Online publication date: 1-Dec-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MCHPC'17: Proceedings of the Workshop on Memory Centric Programming for HPC
November 2017
43 pages
ISBN:9781450351317
DOI:10.1145/3145617
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU
  2. KNL
  3. Performance
  4. Stacked Memory
  5. Tiling
  6. Unified Memory

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • UK Engineering and Physical Sciences Research Council
  • Hungarian Academy of Sciences

Conference

SC '17
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)2
Reflects downloads up to 04 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Accelerating parallel CFD codes on modern vector processors using blockettesProceedings of the Platform for Advanced Scientific Computing Conference10.1145/3468267.3470615(1-9)Online publication date: 5-Jul-2021
  • (2021)Automatic Parallelisation of Sturctured Mesh Computations with SYCL2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00083(821-822)Online publication date: Sep-2021
  • (2020)A Data-Centric Directive-Based Framework to Accelerate Out-of-Core Stencil Computation on a GPUIEICE Transactions on Information and Systems10.1587/transinf.2020PAP0014E103.D:12(2421-2434)Online publication date: 1-Dec-2020
  • (2020)Preliminary Experience with OpenMP Memory Management ImplementationOpenMP: Portable Multi-Level Parallelism on Modern Systems10.1007/978-3-030-58144-2_20(313-327)Online publication date: 1-Sep-2020
  • (2019)Explicit Data Layout Management for Autotuning Exploration on Complex Memory Topologies2019 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC)10.1109/MCHPC49590.2019.00015(58-63)Online publication date: Nov-2019
  • (2018)Optimizing inter-nest data locality in imperfect stencils based on loop blockingThe Journal of Supercomputing10.5555/3288339.328836574:10(5432-5460)Online publication date: 1-Oct-2018
  • (2018)Optimizing inter-nest data locality in imperfect stencils based on loop blockingThe Journal of Supercomputing10.1007/s11227-018-2443-174:10(5432-5460)Online publication date: 30-May-2018

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media