[go: up one dir, main page]

skip to main content
research-article
Open access

FPDetect: Efficient Reasoning About Stencil Programs Using Selective Direct Evaluation

Published: 17 August 2020 Publication History

Abstract

We present FPDetect, a low-overhead approach for detecting logical errors and soft errors affecting stencil computations without generating false positives. We develop an offline analysis that tightly estimates the number of floating-point bits preserved across stencil applications. This estimate rigorously bounds the values expected in the data space of the computation. Violations of this bound can be attributed with certainty to errors. FPDetect helps synthesize error detectors customized for user-specified levels of accuracy and coverage. FPDetect also enables overhead reduction techniques based on deploying these detectors coarsely in space and time. Experimental evaluations demonstrate the practicality of our approach.

Supplementary Material

a19-das-suppl.pdf (das.zip)
Supplemental movie, appendix, image and software files for, FPDetect: Efficient Reasoning About Stencil Programs Using Selective Direct Evaluation

References

[1]
IEEE. 2008. IEEE standard for floating-point arithmetic. IEEE Std 754-2008 (Aug. 2008), 1--70.
[2]
George A. Articolo. 2009. Partial Differential Equations 8 Boundary Value Problems with Maple, Second Edition (2nd ed.). Academic Press, Orlando, FL.
[3]
Wenlei Bao, Sriram Krishnamoorthy, Louis-Noël Pouchet, Fabrice Rastello, and P. Sadayappan. 2016. PolyCheck: Dynamic verification of iteration space transformations on affine programs. In Proceedings of the POPL. 539--554.
[4]
R. Baumann. 2005. Soft errors in advanced computer systems. IEEE Design Test Comput. 22, 3 (May 2005), 258--266.
[5]
Sylvie Boldo and Jean-Christophe Filliâtre. 2007. Formal verification of floating-point programs. In Proceedings of the ARITH. 187--194.
[6]
Sylvie Boldo and Thi Minh Nguyen. 2011. Proofs of numerical programs when the compiler optimizes. Innov. Syst. Softw. Eng. 7, 2 (June 2011), 151--160.
[7]
Uday Bondhugula, Muthu Baskaran, Sriram Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. 2008. Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In Proceedings of the ETAPS CC.
[8]
Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the PLDI. ACM, New York, NY, 101--113.
[9]
Michael Carbin, Sasa Misailovic, and Martin C. Rinard. 2013. Verifying quantitative reliability for programs that execute on unreliable hardware. In Proceedings of the OOPSLA. 33--52.
[10]
Eric Cheng, Shahrzad Mirkhani, Lukasz G. Szafaryn, Chen-Yong Cher, Hyungmin Cho, Kevin Skadron, Mircea R. Stan, Klas Lilja, Jacob A. Abraham, Pradip Bose, and Subhasish Mitra. 2018. Tolerating soft errors in processor cores using CLEAR. IEEE Trans. CAD Integr. Circ. Syst. 37, 9 (2018), 1839--1852.
[11]
Wei-Fan Chiang, Mark Baranowski, Ian Briggs, Alexey Solovyev, Ganesh Gopalakrishnan, and Zvonimir Rakamaric. 2017. Rigorous floating-point mixed-precision tuning. In Proceedings of the POPL. 300--315.
[12]
Eva Darulova and Viktor Kuncak. 2014. Sound compilation of reals. In Proceedings of the POPL. 235--248.
[13]
Eva Darulova and Viktor Kuncak. 2017. Towards a compiler for reals. ACM Trans. Program. Lang. Syst. 39, 2 (Mar. 2017).
[14]
Arnab Das, Sriram Krishnamoorthy, Ian Briggs, Ganesh Gopalakrishnan, and Ramakrishna Tipireddy. 2020. FPDetect: Efficient Reasoning About Stencil Programs Using Selective Direct Evaluation. arxiv:cs.DC/2004.04359.
[15]
Marc Daumas, Guillaume Melquiond, and César A. Muñoz. 2005. Guaranteed proofs using interval arithmetic. In Proceedings of the ARITH. 188--195.
[16]
Luiz Henrique de Figueiredo and Jorge Stolfi. 2004. Affine arithmetic: Concepts and applications. Numer. Algor. 37, 1 (Dec. 2004), 147--158.
[17]
Sheng Di and Franck Cappello. 2016. Adaptive impact-driven detection of silent data corruption for HPC applications. Trans. Parallel Distrib. Syst. 27, 10 (2016), 2809--2823.
[18]
James Elliott, Mark Hoemmen, and Frank Mueller. 2014. Evaluating the impact of SDC on the GMRES iterative solver. In Proceedings of the IPDPS. 1193--1202.
[19]
James Elliott, Mark Hoemmen, and Frank Mueller. 2014. Resilience in numerical methods: A position on fault models and methodologies. CoRR abs/1401.3013 (2014).
[20]
James Elliott, Mark Hoemmen, and Frank Mueller. 2015. A numerical soft fault model for iterative linear solvers. In Proceedings of the HPDC. 271--274.
[21]
James Elliott, Mark Hoemmen, and Frank Mueller. 2016. Exploiting data representation for fault tolerance. J. Comput. Sci. 14 (2016), 51--60.
[22]
Aiman Fang, Aurélien Cavelan, Yves Robert, and Andrew A. Chien. 2017. Resilience for stencil computations with latent errors. In Proceedings of the ICPP. 581--590.
[23]
Marc Gamell, Keita Teranishi, Michael A. Heroux, Jackson Mayo, Hemanth Kolla, Jacqueline Chen, and Manish Parashar. 2015. Local recovery and failure masking for stencil-based applications at extreme scales. In Proceedings of the SC. 70:1--70:12.
[24]
David Goldberg. 1991. What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv. 23, 1 (Mar. 1991), 5--48.
[25]
L. A. B. Gomez and F. Cappello. 2015. Detecting and correcting data corruption in stencil applications through multivariate interpolation. In Proceedings of the CLUSTER. 595--602.
[26]
John L. Hennessy and David A. Patterson. 2019. A new golden age for computer architecture. Commun. ACM 62, 2 (Jan. 2019), 48--60.
[27]
Nicholas J. Higham. 2002. Accuracy and Stability of Numerical Algorithms (2nd ed.). Society for Industrial and Applied Mathematics. Retrieved from https://epubs.siam.org/doi/pdf/10.1137/1.9780898718027.
[28]
Kuang-Hua Huang and Jacob A. Abraham. 1984. Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 33, 6 (1984), 518--528.
[29]
Padma Jayaraman and Ranjani Parthasarathi. 2017. A survey on post-silicon functional validation for multicore architectures. ACM Comput. Surv. 50, 4 (Aug. 2017).
[30]
William Kahan. 1996. IEEE standard 754 for binary floating-point arithmetic. Lecture Notes Status IEEE 754, 94720-1776 (1996), 11.
[31]
Gokcen Kestor, Burcu Ozcelik Mutlu, Joseph Manzano, Omer Subasi, Osman Unsal, and Sriram Krishnamoorthy. 2018. Comparative analysis of soft-error detection strategies: A case study with iterative methods. In Proceedings of the CF. 173--182.
[32]
Walter Krämer. 1997. A priori worst-case error bounds for floating-point computations. In Proceedings of the ARITH. 64.
[33]
Qingrui Liu, Changhee Jung, Dongyoon Lee, and Devesh Tiwari. 2015. Clover: Compiler directed lightweight soft error resilience. SIGPLAN Not. 50, 5 (June 2015).
[34]
Victor Magron, George Constantinides, and Alastair Donaldson. 2017. Certified roundoff error bounds using semidefinite programming. ACM Trans. Math. Softw. 43, 4 (Jan. 2017).
[35]
Sasa Misailovic, Michael Carbin, Sara Achour, Zichao Qi, and Martin C. Rinard. 2014. Chisel: Reliability- and accuracy-aware optimization of approximate computational kernels. In Proceedings of the OOPSLA. 309--328.
[36]
Jean-Michel Muller, Nicolas Brisebarre, Florent de Dinechin, Claude-Pierre Jeannerod, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, Damien Stehlé, and Serge Torres. 2009. Handbook of Floating-Point Arithmetic. Birkhauser.
[37]
H. Quinn and P. Graham. 2005. Terrestrial-based radiation upsets: A cautionary tale. In Proceedings of the FCCM. 193--202.
[38]
Jude A. Rivers, Meeta S. Gupta, Jeonghee Shin, Prabhakar N. Kudva, and Pradip Bose. 2011. Error tolerance in server class processors. IEEE Trans. CAD Integr. Circ. Syst. 30, 7 (2011), 945--959.
[39]
B. Sangchoolie, K. Pattabiraman, and J. Karlsson. 2017. One bit is (not) enough: An empirical study of the impact of single and multiple bit-flip errors. In Proceedings of the DSN. 97--108.
[40]
Markus Schordan, Pei-Hung Lin, Daniel J. Quinlan, and Louis-Noël Pouchet. 2014. Verification of polyhedral optimizations with constant loop bounds in finite state space computations. In Proceedings of the ISoLA. 493--508.
[41]
N. Seifert. 2010. Radiation-induced Soft Error: A Chip-level Modeling. Delft, The Netherlands.
[42]
Vishal Sharma, G. Gopalkrishnan, and Greg Bronevetsky. 2015. Detecting soft errors in stencil based computations. In the 11th IEEE Workshop on Silicon Errors in Logic -- System Effects (SELSE'15).
[43]
Marc Snir, Robert W. Wisniewski, Jacob A. Abraham, Sarita V. Adve, Saurabh Bagchi, et al. 2014. Addressing failures in exascale computing. Proceedings of the IJHPCA 28, 2 (2014), 129--173.
[44]
Alexey Solovyev, Marek S. Baranowski, Ian Briggs, Charles Jacobsen, Zvonimir Rakamaric, and Ganesh Gopalakrishnan. 2019. Rigorous estimation of floating-point round-off errors with symbolic Taylor expansions. ACM Trans. Program. Lang. Syst. 41, 1 (2019), 2:1--2:39.
[45]
Omer Subasi, Sheng Di, Prasanna Balaprakash, Osman S. Unsal, Jesús Labarta, Adrián Cristal, Sriram Krishnamoorthy, and Franck Cappello. 2017. MACORD: Online adaptive machine learning framework for silent error detection. In Proceedings of the CLUSTER. 717--724.
[46]
Omer Subasi, Sheng Di, Leonardo Bautista-Gomez, Prasanna Balaprakash, Osman S. Ünsal, Jesús Labarta, Adrián Cristal, and Franck Cappello. 2016. Spatial support vector regression to detect silent errors in the exascale era. In Proceedings of the CCGrid. 413--424.
[47]
Omer Subasi and Sriram Krishnamoorthy. 2017. A gaussian process approach for effective soft error detection. In Proceedings of the CLUSTER. 608--612.
[48]
Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The pochoir stencil compiler. In Proceedings of the SPAA. 117--128.
[49]
Dingwen Tao, Shuaiwen Leon Song, Sriram Krishnamoorthy, Panruo Wu, Xin Liang, Eddy Z. Zhang, Darren J. Kerbyson, and Zizhong Chen. 2016. New-Sum: A novel online ABFT scheme for general iterative methods. In Proceedings of the HPDC. 43--55.
[50]
Devesh Tiwari, Saurabh Gupta, George Gallarno, Jim Rogers, and Don Maxwell. 2015. Reliability lessons learned from GPU experience with the titan supercomputer at oak ridge leadership computing facility. In Proceedings of the SC. ACM, New York, NY.
[51]
Ohio State University. 2012. the PolyOpt Polyhedral Compiler. Retrieved from http://hpcrl.cse.ohio-state.edu/wiki/index.php/Polyhedral_Compilation.
[52]
Sven Verdoolaege, Gerda Janssens, and Maurice Bruynooghe. 2012. Equivalence checking of static affine programs using widening to handle recurrences. ACM Trans. Program. Lang. Syst. 34, 3 (2012), 11:1--11:35.
[53]
Panruo Wu and Zizhong Chen. 2014. FT-ScaLAPACK: Correcting soft errors on-line for ScaLAPACK cholesky, QR, and LU factorization routines. In Proceedings of the HPDC. 49--60.
[54]
Panruo Wu, Nathan DeBardeleben, Qiang Guan, Sean Blanchard, Jieyang Chen, Dingwen Tao, Xin Liang, Kaiming Ouyang, and Zizhong Chen. 2017. Silent data corruption resilient two-sided matrix factorizations. In Proceedings of the PPoPP. 415--427.
[55]
Ren Xiaoguang, Xu Xinhai, Wang Qian, Chen Juan, Wang Miao, and Yang Xuejun. 2015. GS-DMR: Low-overhead soft error detection scheme for stencil-based computation. Parallel Comput. 41 (2015), 50--65.
[56]
Yaqi Zhang, Ralph Nathan, and Daniel J. Sorin. 2015. Reduced Precision Checking to Detect Errors in Floating Point Arithmetic. arxiv:cs.NA/1510.01145.

Cited By

View all
  • (2022)Constrained Precision Tuning2022 8th International Conference on Control, Decision and Information Technologies (CoDIT)10.1109/CoDIT55151.2022.9804011(230-236)Online publication date: 17-May-2022
  • (2020)Scalable yet rigorous floating-point error analysisProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433768(1-14)Online publication date: 9-Nov-2020
  • (2020)Scalable yet Rigorous Floating-Point Error AnalysisSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00055(1-14)Online publication date: Nov-2020

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 17, Issue 3
September 2020
200 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3415154
Issue’s Table of Contents
© 2020 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 August 2020
Accepted: 01 May 2020
Revised: 01 April 2020
Received: 01 November 2019
Published in TACO Volume 17, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Soft error detection
  2. affine analysis
  3. floating point round-off error
  4. interval analysis
  5. silent data corruption
  6. software bug detection
  7. stencil computations

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • NSF CCF
  • U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)118
  • Downloads (Last 6 weeks)12
Reflects downloads up to 13 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Constrained Precision Tuning2022 8th International Conference on Control, Decision and Information Technologies (CoDIT)10.1109/CoDIT55151.2022.9804011(230-236)Online publication date: 17-May-2022
  • (2020)Scalable yet rigorous floating-point error analysisProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433768(1-14)Online publication date: 9-Nov-2020
  • (2020)Scalable yet Rigorous Floating-Point Error AnalysisSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00055(1-14)Online publication date: Nov-2020

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media