research-article

Beyond 16GB: Out-of-Core Stencil Computations

Authors:

Istán Z. Reguly,

Gihan R. Mudalige,

Michael B. GilesAuthors Info & Claims

MCHPC'17: Proceedings of the Workshop on Memory Centric Programming for HPC

Pages 20 - 29

https://doi.org/10.1145/3145617.3145619

Published: 12 November 2017 Publication History

Abstract

Stencil computations are a key class of applications, widely used in the scientific computing community, and a class that has particularly benefited from performance improvements on architectures with high memory bandwidth. Unfortunately, such architectures come with a limited amount of fast memory, which is limiting the size of the problems that can be efficiently solved. In this paper, we address this challenge by applying the well-known cache-blocking tiling technique to large scale stencil codes implemented using the OPS domain specific language, such as CloverLeaf 2D, CloverLeaf 3D, and OpenSBLI. We introduce a number of techniques and optimisations to help manage data resident in fast memory, and minimise data movement. Evaluating our work on Intel's Knights Landing Platform as well as NVIDIA P100 GPUs, we demonstrate that it is possible to solve 3 times larger problems than the on-chip memory size with at most 15% loss in efficiency.

References

[1]

Victor Allombert, David Michea, Fabrice Dupros, Christian Bellier, Bernard Bourgine, Hideo Aochi, and Sylvain Jubertie. 2014. An Out-of-core GPU Approach for Accelerating Geostatistical Interpolation. Procedia Computer Science 29 (2014), 888--896. 2014 International Conference on Computational Science.

[2]

Richard F. Barrett, Michael A. Heroux, Paul T. Lin, Courtenay T. Vaughan, and Alan B. Williams. 2011. Poster: Mini-applications: Vehicles for Co-design. In Proceedings of the 2011 Companion on High Performance Computing Networking, Storage and Analysis Companion (SC '11 Companion). ACM, New York, NY, USA, 1--2.

Digital Library

[3]

Uday Bondhugula, Muthu Baskaran, Sriram Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. 2008. Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model. In International Conference on Compiler Construction (ETAPS CC).

Digital Library

[4]

Daniele Buono, Fausto Artico, Fabio Checconi, Jee W. Choi, Xinyu Que, and Lars Schneidenbach. 2017. Data Analytics with NVLink: An SpMV Case Study. In Proceedings of the Computing Frontiers Conference (CF'17). ACM, New York, NY, USA, 89--96.

Digital Library

[5]

CloverLeaf Ref 2013. CloverLeaf GitHub. (2013). github.com/UK-MAC/CloverLeaf_ref.

[6]

Jack Deslippe, Felipe H. da Jornada, Derek Vigil-Fowler, Taylor Barnes, Nathan Wichmann, Karthik Raman, Ruchira Sasanka, and Steven G. Louie. 2016. Optimizing Excited-State Electronic-Structure Codes for Intel Knights Landing: A Case Study on the Berkeley GW Software. Springer International Publishing, Cham, 402--414.

[7]

Toshio Endo. 2016. Realizing Out-of-Core Stencil Computations using Multi-Tier Memory Hierarchy on GPGPU Clusters. In Cluster Computing (CLUSTER), 2016 IEEE International Conference on. IEEE, 21--29.

[8]

A. Haidar, S. Tomov, K. Arturov, M. Guney, S. Story, and J. Dongarra. 2016. LU, QR, and Cholesky factorizations: Programming model, performance analysis and optimization techniques for the Intel Knights Landing Xeon Phi. In 2016 IEEE High Performance Extreme Computing Conference (HPEC). 1--7.

[9]

Alexander Heinecke, Alexander Breuer, Michael Bader, and Pradeep Dubey. 2016. High Order Seismic Simulations on the Intel Xeon Phi Processor (Knights Landing). Springer International Publishing, Cham, 343--362.

[10]

Christian T. Jacobs, Satya P. Jammy, and Neil D. Sandham. 2017. OpenSBLI: A framework for the automated derivation and parallel execution of finite difference solvers on a range of computer architectures. Journal of Computational Science 18 (2017), 12--23.

[11]

Zhongchao Lin, Yan Chen, Yu Zhang, Xunwang Zhao, and Huanhuan Zhang. 2017. An Efficient GPU-Based Out-of-Core LU Solver of Parallel Higher-Order Method of Moments for Solving Airborne Array Problems. International Journal of Antennas and Propagation 2017 (2017), 10.

[12]

AC Mallinson, David A Beckingsale, WP Gaudin, JA Herdman, JM Levesque, and Stephen A Jarvis. 2013. CloverLeaf: Preparing hydrodynamics codes for Exascale. In The Cray User Group 2013.

[13]

Nobuhiro Miki, Fumihiko Ino, and Kenichi Hagihara. 2016. An extension of OpenACC directives for out-of-core stencil computation with temporal blocking. In Accelerator Programming Using Directives (WACCPD), 2016 Third Workshop on. IEEE, 36--45.

Digital Library

[14]

G. R. Mudalige, I. Z. Reguly, M. B. Giles, A. C. Mallinson, W. P. Gaudin, and J. A. Herdman. 2015. Performance Analysis of a High-Level Abstractions-Based Hydrocode on Future Computing Systems. Springer International Publishing, Cham, 85--104.

[15]

OPS-GIT 2013. OPS GitHub Repository. (2013). https://github.com/gihanmudalige/OPS.

[16]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '13). ACM, New York, NY, USA, 519--530.

Digital Library

[17]

I. Z Reguly, G. R Mudalige, and M. B Giles. 2017. Loop Tiling in Large-Scale Stencil Codes at Run-time with OPS. ArXiv e-prints (April 2017). arXiv:cs.PF/1704.00693

[18]

István Z. Reguly, Gihan R. Mudalige, Michael B. Giles, Dan Cur-ran, and Simon McIntosh-Smith. 2014. The OPS Domain Specific Abstraction for Multi-block Structured Grid Computations. In Proceedings of the Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC '14). IEEE Press, Piscataway, NJ, USA, 58--67.

Digital Library

[19]

Ciolfax Research. {n. d.}. Clustering Modes in Knights Landing Processors. ({n. d.}). https://colfaxresearch.com/knl-numa/

[20]

Rico Richter, Jan Eric Kyprianidis, and Jrgen Dllner. 2013. Out-of-Core GPU-based Change Detection in Massive 3D Point Clouds. Transactions in GIS 17, 5 (2013), 724--741.

[21]

Carlos Rosales, John Cazes, Kent Milfeld, Antonio Gómez-Iglesias, Lars Koesterke, Lei Huang, and Jerome Vienne. 2016. A Comparative Study of Application Performance and Scalability on the Intel Knights Landing Processor. Springer International Publishing, Cham, 307--318.

[22]

N. Sakharnykh. {n. d.}. Beyond GPU Memory Limits with Unified Memory on Pascal. ({n. d.}). https://devblogs.nvidia.com/parallelforall/beyond-gpu-memory-limits-unified-memory-pascal/

[23]

Eric Schweitz, Richard Lethin, Allen Leung, and Benoit Meister. 2006. R-stream: A parametric high level compiler. Proceedings of HPEC (2006).

[24]

Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The Pochoir Stencil Compiler. In Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '11). ACM, New York, NY, USA, 117--128.

Digital Library

[25]

Josh Tobin, Alexander Breuer, Alexander Heinecke, Charles Yount, and Yifeng Cui. 2017. Accelerating Seismic Simulations Using the Intel Xeon Phi Knights Landing Processor. Springer International Publishing, Cham, 139--157.

[26]

Ruonan Wang and Christopher Harris. 2013. Scaling radio astronomy signal correlation on heterogeneous supercomputers using variousdata distribution methodologies. Experimental Astronomy 36, 3 (01 Dec 2013), 433--449.

[27]

Rui Wang, Yuchi Huo, Yazhen Yuan, Kun Zhou, Wei Hua, and Hujun Bao. 2013. GPU-based Out-of-core Many-lights Rendering. ACM Trans. Graph. 32, 6, Article 210 (Nov. 2013), 10 pages.

Digital Library

[28]

Charles Yount, Josh Tobin, Alexander Breuer, and Alejandro Duran. 2016. YASK-yet Another Stencil Kernel: A Framework for HPC Stencil Code-generation and Tuning. In Proceedings of the Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for HPC (WOLFHPC '16). IEEE Press, Piscataway, NJ, USA, 30--39.

Digital Library

Cited By

Yildirim AMader CMartins JRobinson T(2021)Accelerating parallel CFD codes on modern vector processors using blockettesProceedings of the Platform for Advanced Scientific Computing Conference10.1145/3468267.3470615(1-9)Online publication date: 5-Jul-2021
https://dl.acm.org/doi/10.1145/3468267.3470615
Balogh GReguly I(2021)Automatic Parallelisation of Sturctured Mesh Computations with SYCL2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00083(821-822)Online publication date: Sep-2021
https://doi.org/10.1109/Cluster48925.2021.00083
SHEN JINO FFARRÉS AHANZICH M(2020)A Data-Centric Directive-Based Framework to Accelerate Out-of-Core Stencil Computation on a GPUIEICE Transactions on Information and Systems10.1587/transinf.2020PAP0014E103.D:12(2421-2434)Online publication date: 1-Dec-2020
https://doi.org/10.1587/transinf.2020PAP0014
Show More Cited By

Index Terms

Beyond 16GB: Out-of-Core Stencil Computations

Recommendations

Implementing Genetic Algorithm Accelerated By Intel Xeon Phi
SoICT '17: Proceedings of the 8th International Symposium on Information and Communication Technology

In this paper, genetic algorithm (GA) accelerated by Intel Xeon Phi coprocessor based on Intel Many Integrated Chip (MIC) Architecture is proposed and called GAPhi framework. The GAPhi framework solves the power-aware task scheduling (PATS) problems in ...
Benchmarking and Evaluating Unified Memory for OpenMP GPU Offloading
LLVM-HPC'17: Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC

The latest OpenMP standard offers automatic device offloading capabilities which facilitate GPU programming. Despite this, there remain many challenges. One of these is the unified memory feature introduced in recent GPUs. GPUs in current and future HPC ...
Porting the COSMO Weather Model to Manycore CPUs
PASC '19: Proceedings of the Platform for Advanced Scientific Computing Conference

Weather and climate simulations are a major application driver in high-performance computing (HPC). With the end of Dennard scaling and Moore's law, the HPC industry increasingly employs specialized computation accelerators to increase computational ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MCHPC'17: Proceedings of the Workshop on Memory Centric Programming for HPC

November 2017

43 pages

ISBN:9781450351317

DOI:10.1145/3145617

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

UK Engineering and Physical Sciences Research Council
Hungarian Academy of Sciences

Conference

SC '17

Sponsor:

SIGHPC

SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 17, 2017

CO, Denver, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
82
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)2

Reflects downloads up to 04 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yildirim AMader CMartins JRobinson T(2021)Accelerating parallel CFD codes on modern vector processors using blockettesProceedings of the Platform for Advanced Scientific Computing Conference10.1145/3468267.3470615(1-9)Online publication date: 5-Jul-2021
https://dl.acm.org/doi/10.1145/3468267.3470615
Balogh GReguly I(2021)Automatic Parallelisation of Sturctured Mesh Computations with SYCL2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00083(821-822)Online publication date: Sep-2021
https://doi.org/10.1109/Cluster48925.2021.00083
SHEN JINO FFARRÉS AHANZICH M(2020)A Data-Centric Directive-Based Framework to Accelerate Out-of-Core Stencil Computation on a GPUIEICE Transactions on Information and Systems10.1587/transinf.2020PAP0014E103.D:12(2421-2434)Online publication date: 1-Dec-2020
https://doi.org/10.1587/transinf.2020PAP0014
Roussel ACarribault PJaeger J(2020)Preliminary Experience with OpenMP Memory Management ImplementationOpenMP: Portable Multi-Level Parallelism on Modern Systems10.1007/978-3-030-58144-2_20(313-327)Online publication date: 1-Sep-2020
https://doi.org/10.1007/978-3-030-58144-2_20
Perarnau SVideau BDenoyelle NMonna FIskra KBeckman P(2019)Explicit Data Layout Management for Autotuning Exploration on Complex Memory Topologies2019 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC)10.1109/MCHPC49590.2019.00015(58-63)Online publication date: Nov-2019
https://doi.org/10.1109/MCHPC49590.2019.00015
Seyfari YLotfi SKarimpour J(2018)Optimizing inter-nest data locality in imperfect stencils based on loop blockingThe Journal of Supercomputing10.5555/3288339.328836574:10(5432-5460)Online publication date: 1-Oct-2018
https://dl.acm.org/doi/10.5555/3288339.3288365
Seyfari YLotfi SKarimpour J(2018)Optimizing inter-nest data locality in imperfect stencils based on loop blockingThe Journal of Supercomputing10.1007/s11227-018-2443-174:10(5432-5460)Online publication date: 30-May-2018
https://doi.org/10.1007/s11227-018-2443-1

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents