research-article

Open access

FPDetect: Efficient Reasoning About Stencil Programs Using Selective Direct Evaluation

Authors:

Sriram Krishnamoorthy,

Ganesh Gopalakrishnan,

Ramakrishna TipireddyAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 17, Issue 3

Article No.: 19, Pages 1 - 27

https://doi.org/10.1145/3402451

Published: 17 August 2020 Publication History

All formats PDF

Abstract

We present FPDetect, a low-overhead approach for detecting logical errors and soft errors affecting stencil computations without generating false positives. We develop an offline analysis that tightly estimates the number of floating-point bits preserved across stencil applications. This estimate rigorously bounds the values expected in the data space of the computation. Violations of this bound can be attributed with certainty to errors. FPDetect helps synthesize error detectors customized for user-specified levels of accuracy and coverage. FPDetect also enables overhead reduction techniques based on deploying these detectors coarsely in space and time. Experimental evaluations demonstrate the practicality of our approach.

Supplementary Material

a19-das-suppl.pdf (das.zip)

Supplemental movie, appendix, image and software files for, FPDetect: Efficient Reasoning About Stencil Programs Using Selective Direct Evaluation

Download
1.17 MB

References

[1]

IEEE. 2008. IEEE standard for floating-point arithmetic. IEEE Std 754-2008 (Aug. 2008), 1--70.

[2]

George A. Articolo. 2009. Partial Differential Equations 8 Boundary Value Problems with Maple, Second Edition (2nd ed.). Academic Press, Orlando, FL.

[3]

Wenlei Bao, Sriram Krishnamoorthy, Louis-Noël Pouchet, Fabrice Rastello, and P. Sadayappan. 2016. PolyCheck: Dynamic verification of iteration space transformations on affine programs. In Proceedings of the POPL. 539--554.

[4]

R. Baumann. 2005. Soft errors in advanced computer systems. IEEE Design Test Comput. 22, 3 (May 2005), 258--266.

Digital Library

[5]

Sylvie Boldo and Jean-Christophe Filliâtre. 2007. Formal verification of floating-point programs. In Proceedings of the ARITH. 187--194.

Digital Library

[6]

Sylvie Boldo and Thi Minh Nguyen. 2011. Proofs of numerical programs when the compiler optimizes. Innov. Syst. Softw. Eng. 7, 2 (June 2011), 151--160.

Digital Library

[7]

Uday Bondhugula, Muthu Baskaran, Sriram Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. 2008. Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In Proceedings of the ETAPS CC.

[8]

Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the PLDI. ACM, New York, NY, 101--113.

[9]

Michael Carbin, Sasa Misailovic, and Martin C. Rinard. 2013. Verifying quantitative reliability for programs that execute on unreliable hardware. In Proceedings of the OOPSLA. 33--52.

[10]

Eric Cheng, Shahrzad Mirkhani, Lukasz G. Szafaryn, Chen-Yong Cher, Hyungmin Cho, Kevin Skadron, Mircea R. Stan, Klas Lilja, Jacob A. Abraham, Pradip Bose, and Subhasish Mitra. 2018. Tolerating soft errors in processor cores using CLEAR. IEEE Trans. CAD Integr. Circ. Syst. 37, 9 (2018), 1839--1852.

Digital Library

[11]

Wei-Fan Chiang, Mark Baranowski, Ian Briggs, Alexey Solovyev, Ganesh Gopalakrishnan, and Zvonimir Rakamaric. 2017. Rigorous floating-point mixed-precision tuning. In Proceedings of the POPL. 300--315.

Digital Library

[12]

Eva Darulova and Viktor Kuncak. 2014. Sound compilation of reals. In Proceedings of the POPL. 235--248.

Digital Library

[13]

Eva Darulova and Viktor Kuncak. 2017. Towards a compiler for reals. ACM Trans. Program. Lang. Syst. 39, 2 (Mar. 2017).

Digital Library

[14]

Arnab Das, Sriram Krishnamoorthy, Ian Briggs, Ganesh Gopalakrishnan, and Ramakrishna Tipireddy. 2020. FPDetect: Efficient Reasoning About Stencil Programs Using Selective Direct Evaluation. arxiv:cs.DC/2004.04359.

[15]

Marc Daumas, Guillaume Melquiond, and César A. Muñoz. 2005. Guaranteed proofs using interval arithmetic. In Proceedings of the ARITH. 188--195.

[16]

Luiz Henrique de Figueiredo and Jorge Stolfi. 2004. Affine arithmetic: Concepts and applications. Numer. Algor. 37, 1 (Dec. 2004), 147--158.

[17]

Sheng Di and Franck Cappello. 2016. Adaptive impact-driven detection of silent data corruption for HPC applications. Trans. Parallel Distrib. Syst. 27, 10 (2016), 2809--2823.

Digital Library

[18]

James Elliott, Mark Hoemmen, and Frank Mueller. 2014. Evaluating the impact of SDC on the GMRES iterative solver. In Proceedings of the IPDPS. 1193--1202.

Digital Library

[19]

James Elliott, Mark Hoemmen, and Frank Mueller. 2014. Resilience in numerical methods: A position on fault models and methodologies. CoRR abs/1401.3013 (2014).

[20]

James Elliott, Mark Hoemmen, and Frank Mueller. 2015. A numerical soft fault model for iterative linear solvers. In Proceedings of the HPDC. 271--274.

Digital Library

[21]

James Elliott, Mark Hoemmen, and Frank Mueller. 2016. Exploiting data representation for fault tolerance. J. Comput. Sci. 14 (2016), 51--60.

[22]

Aiman Fang, Aurélien Cavelan, Yves Robert, and Andrew A. Chien. 2017. Resilience for stencil computations with latent errors. In Proceedings of the ICPP. 581--590.

[23]

Marc Gamell, Keita Teranishi, Michael A. Heroux, Jackson Mayo, Hemanth Kolla, Jacqueline Chen, and Manish Parashar. 2015. Local recovery and failure masking for stencil-based applications at extreme scales. In Proceedings of the SC. 70:1--70:12.

[24]

David Goldberg. 1991. What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv. 23, 1 (Mar. 1991), 5--48.

Digital Library

[25]

L. A. B. Gomez and F. Cappello. 2015. Detecting and correcting data corruption in stencil applications through multivariate interpolation. In Proceedings of the CLUSTER. 595--602.

[26]

John L. Hennessy and David A. Patterson. 2019. A new golden age for computer architecture. Commun. ACM 62, 2 (Jan. 2019), 48--60.

Digital Library

[27]

Nicholas J. Higham. 2002. Accuracy and Stability of Numerical Algorithms (2nd ed.). Society for Industrial and Applied Mathematics. Retrieved from https://epubs.siam.org/doi/pdf/10.1137/1.9780898718027.

[28]

Kuang-Hua Huang and Jacob A. Abraham. 1984. Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 33, 6 (1984), 518--528.

Digital Library

[29]

Padma Jayaraman and Ranjani Parthasarathi. 2017. A survey on post-silicon functional validation for multicore architectures. ACM Comput. Surv. 50, 4 (Aug. 2017).

Digital Library

[30]

William Kahan. 1996. IEEE standard 754 for binary floating-point arithmetic. Lecture Notes Status IEEE 754, 94720-1776 (1996), 11.

[31]

Gokcen Kestor, Burcu Ozcelik Mutlu, Joseph Manzano, Omer Subasi, Osman Unsal, and Sriram Krishnamoorthy. 2018. Comparative analysis of soft-error detection strategies: A case study with iterative methods. In Proceedings of the CF. 173--182.

Digital Library

[32]

Walter Krämer. 1997. A priori worst-case error bounds for floating-point computations. In Proceedings of the ARITH. 64.

[33]

Qingrui Liu, Changhee Jung, Dongyoon Lee, and Devesh Tiwari. 2015. Clover: Compiler directed lightweight soft error resilience. SIGPLAN Not. 50, 5 (June 2015).

Digital Library

[34]

Victor Magron, George Constantinides, and Alastair Donaldson. 2017. Certified roundoff error bounds using semidefinite programming. ACM Trans. Math. Softw. 43, 4 (Jan. 2017).

Digital Library

[35]

Sasa Misailovic, Michael Carbin, Sara Achour, Zichao Qi, and Martin C. Rinard. 2014. Chisel: Reliability- and accuracy-aware optimization of approximate computational kernels. In Proceedings of the OOPSLA. 309--328.

[36]

Jean-Michel Muller, Nicolas Brisebarre, Florent de Dinechin, Claude-Pierre Jeannerod, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, Damien Stehlé, and Serge Torres. 2009. Handbook of Floating-Point Arithmetic. Birkhauser.

[37]

H. Quinn and P. Graham. 2005. Terrestrial-based radiation upsets: A cautionary tale. In Proceedings of the FCCM. 193--202.

[38]

Jude A. Rivers, Meeta S. Gupta, Jeonghee Shin, Prabhakar N. Kudva, and Pradip Bose. 2011. Error tolerance in server class processors. IEEE Trans. CAD Integr. Circ. Syst. 30, 7 (2011), 945--959.

Digital Library

[39]

B. Sangchoolie, K. Pattabiraman, and J. Karlsson. 2017. One bit is (not) enough: An empirical study of the impact of single and multiple bit-flip errors. In Proceedings of the DSN. 97--108.

[40]

Markus Schordan, Pei-Hung Lin, Daniel J. Quinlan, and Louis-Noël Pouchet. 2014. Verification of polyhedral optimizations with constant loop bounds in finite state space computations. In Proceedings of the ISoLA. 493--508.

[41]

N. Seifert. 2010. Radiation-induced Soft Error: A Chip-level Modeling. Delft, The Netherlands.

[42]

Vishal Sharma, G. Gopalkrishnan, and Greg Bronevetsky. 2015. Detecting soft errors in stencil based computations. In the 11th IEEE Workshop on Silicon Errors in Logic -- System Effects (SELSE'15).

[43]

Marc Snir, Robert W. Wisniewski, Jacob A. Abraham, Sarita V. Adve, Saurabh Bagchi, et al. 2014. Addressing failures in exascale computing. Proceedings of the IJHPCA 28, 2 (2014), 129--173.

[44]

Alexey Solovyev, Marek S. Baranowski, Ian Briggs, Charles Jacobsen, Zvonimir Rakamaric, and Ganesh Gopalakrishnan. 2019. Rigorous estimation of floating-point round-off errors with symbolic Taylor expansions. ACM Trans. Program. Lang. Syst. 41, 1 (2019), 2:1--2:39.

Digital Library

[45]

Omer Subasi, Sheng Di, Prasanna Balaprakash, Osman S. Unsal, Jesús Labarta, Adrián Cristal, Sriram Krishnamoorthy, and Franck Cappello. 2017. MACORD: Online adaptive machine learning framework for silent error detection. In Proceedings of the CLUSTER. 717--724.

[46]

Omer Subasi, Sheng Di, Leonardo Bautista-Gomez, Prasanna Balaprakash, Osman S. Ünsal, Jesús Labarta, Adrián Cristal, and Franck Cappello. 2016. Spatial support vector regression to detect silent errors in the exascale era. In Proceedings of the CCGrid. 413--424.

Digital Library

[47]

Omer Subasi and Sriram Krishnamoorthy. 2017. A gaussian process approach for effective soft error detection. In Proceedings of the CLUSTER. 608--612.

[48]

Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The pochoir stencil compiler. In Proceedings of the SPAA. 117--128.

[49]

Dingwen Tao, Shuaiwen Leon Song, Sriram Krishnamoorthy, Panruo Wu, Xin Liang, Eddy Z. Zhang, Darren J. Kerbyson, and Zizhong Chen. 2016. New-Sum: A novel online ABFT scheme for general iterative methods. In Proceedings of the HPDC. 43--55.

Digital Library

[50]

Devesh Tiwari, Saurabh Gupta, George Gallarno, Jim Rogers, and Don Maxwell. 2015. Reliability lessons learned from GPU experience with the titan supercomputer at oak ridge leadership computing facility. In Proceedings of the SC. ACM, New York, NY.

Digital Library

[51]

Ohio State University. 2012. the PolyOpt Polyhedral Compiler. Retrieved from http://hpcrl.cse.ohio-state.edu/wiki/index.php/Polyhedral_Compilation.

[52]

Sven Verdoolaege, Gerda Janssens, and Maurice Bruynooghe. 2012. Equivalence checking of static affine programs using widening to handle recurrences. ACM Trans. Program. Lang. Syst. 34, 3 (2012), 11:1--11:35.

Digital Library

[53]

Panruo Wu and Zizhong Chen. 2014. FT-ScaLAPACK: Correcting soft errors on-line for ScaLAPACK cholesky, QR, and LU factorization routines. In Proceedings of the HPDC. 49--60.

Digital Library

[54]

Panruo Wu, Nathan DeBardeleben, Qiang Guan, Sean Blanchard, Jieyang Chen, Dingwen Tao, Xin Liang, Kaiming Ouyang, and Zizhong Chen. 2017. Silent data corruption resilient two-sided matrix factorizations. In Proceedings of the PPoPP. 415--427.

Digital Library

[55]

Ren Xiaoguang, Xu Xinhai, Wang Qian, Chen Juan, Wang Miao, and Yang Xuejun. 2015. GS-DMR: Low-overhead soft error detection scheme for stencil-based computation. Parallel Comput. 41 (2015), 50--65.

Digital Library

[56]

Yaqi Zhang, Ralph Nathan, and Daniel J. Sorin. 2015. Reduced Precision Checking to Detect Errors in Floating Point Arithmetic. arxiv:cs.NA/1510.01145.

Cited By

Khalifa DMartel M(2022)Constrained Precision Tuning2022 8th International Conference on Control, Decision and Information Technologies (CoDIT)10.1109/CoDIT55151.2022.9804011(230-236)Online publication date: 17-May-2022
https://doi.org/10.1109/CoDIT55151.2022.9804011
Das ABriggs IGopalakrishnan GKrishnamoorthy SPanchekha PCuicchi CQualters IKramer W(2020)Scalable yet rigorous floating-point error analysisProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433768(1-14)Online publication date: 9-Nov-2020
https://dl.acm.org/doi/10.5555/3433701.3433768
Das ABriggs IGopalakrishnan GKrishnamoorthy SPanchekha P(2020)Scalable yet Rigorous Floating-Point Error AnalysisSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00055(1-14)Online publication date: Nov-2020
https://doi.org/10.1109/SC41405.2020.00055

Index Terms

FPDetect: Efficient Reasoning About Stencil Programs Using Selective Direct Evaluation

Recommendations

Control Focused Soft Error Detection for Embedded Applications

Advances in integrated circuits present several key challenges in system reliability as soft errors are expected to increase with successive technology generations. Computing systems must be able to continue functioning in spite of these soft errors, ...
Accelerating Single Iteration Performance of CUDA-Based 3D Reaction---Diffusion Simulations

The most commonly used approach for solving reaction---diffusion systems relies upon stencil computations. Although stencil computations feature low compute intensity, they place high demands on memory bandwidth. Fortunately, GPU computing allows for ...
Automatic code generation and tuning for stencil kernels on modern shared memory architectures

In this paper, we present Patus, a code generation and auto-tuning framework for stencil computations targeted at multi- and manycore processors, such as multicore CPUs and graphics processing units. Patus, which stands for " P arallel A uto tu ned S ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 17, Issue 3

September 2020

200 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3415154

Editor:
David Kaeli
Northeastern University, USA

Issue’s Table of Contents

Copyright © 2020 ACM.

© 2020 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 August 2020

Accepted: 01 May 2020

Revised: 01 April 2020

Received: 01 November 2019

Published in TACO Volume 17, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

NSF CCF
U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
638
Total Downloads

Downloads (Last 12 months)118
Downloads (Last 6 weeks)12

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Khalifa DMartel M(2022)Constrained Precision Tuning2022 8th International Conference on Control, Decision and Information Technologies (CoDIT)10.1109/CoDIT55151.2022.9804011(230-236)Online publication date: 17-May-2022
https://doi.org/10.1109/CoDIT55151.2022.9804011
Das ABriggs IGopalakrishnan GKrishnamoorthy SPanchekha PCuicchi CQualters IKramer W(2020)Scalable yet rigorous floating-point error analysisProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433768(1-14)Online publication date: 9-Nov-2020
https://dl.acm.org/doi/10.5555/3433701.3433768
Das ABriggs IGopalakrishnan GKrishnamoorthy SPanchekha P(2020)Scalable yet Rigorous Floating-Point Error AnalysisSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00055(1-14)Online publication date: Nov-2020
https://doi.org/10.1109/SC41405.2020.00055

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents