Abstract
Obtaining fault tolerant applications and systems is one of today’s most important topics of research. Fault tolerance is becoming more and more essential in shared memory parallel programs and in multi/many core architectures due to the decreasing size of transistors and growing number of failures. Very few research works and techniques for fault tolerant OpenMP programs were studied. These few works are based on checkpoint and recovery, and on static thread level redundancy techniques. However, these approaches may illustrate scalability issues when the number of cores increases or when an unbalanced workload exists. To overcome these issues, we present in this paper a dynamic task level redundancy technique for fault tolerant OpenMP applications. Our method is based on dynamically applying a Triple Modular Redundancy for OpenMP tasks through a dedicated runtime and on applying a majority voting to guarantee correct results. Our flexible fault tolerant OpenMP approach has been evaluated for performance and fault coverage and it showed small overhead with good error detection and recovery rate.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
ORACLE SUN, ”Tasks vs Nested Parallel Regions”, http://wikis.sun.com/display/openmp/Tasks+vs+Nested+Parallel+Regions
Ayguadé, E., Copty, N., Duran, A., Hoeflinger, J., Lin, Y., Massaioli, F., Teruel, X., Unnikrishnan, P., Zhang, G.: The design of openmp tasks. IEEE Trans. Parallel Distrib. Syst. 20, 404–418 (2009)
Balart, J., Duran, A., Gonzàlez, M., Martorell, X., Ayguadé, E., Labarta, J.: Nanos mercurium: a research compiler for openmp. In: European Workshop on OpenMP (EWOMP 2004), pp. 103–109 (2004)
Bronevetsky, G., Pingali, K., Stodghill, P.: Experimental evaluation of application-level checkpointing for openmp programs. In: Proceedings of the 20th Annual International Conference on Supercomputing, ICS 2006, pp. 2–13. ACM, New York (2006)
Cha, H., Rudnick, E.M., Choi, G.S., Patel, J.H., Iyer, R.K.: A fast and accurate gate-level transient fault simulation environment. In: Proceedings 23rd Symp. on Fault-Tolerant Computing Systems (FTCS-23), pp. 310–319 (1993)
Chan, C.Y., Bu, F., Shladover, S.: Experimental vehicle platform for pedestrian detection. California PATH research report. California PATH Program, Institute of Transportation Studies, University of California at Berkeley (2006)
Duran, A., Teruel, X., Ferrer, R., Martorell, X., Ayguade, E.: Barcelona openmp tasks suite: A set of benchmarks targeting the exploitation of task parallelism in openmp. In: Proceedings of the 2009 International Conference on Parallel Processing, ICPP 2009, pp. 124–131. IEEE Computer Society, Washington, DC (2009)
Gizopoulos, D., Psarakis, M., Adve, S.V., Ramachandran, P., Hari, S.K.S., Sorin, D., Meixner, A., Biswas, A., Vera, X.: Architectures for online error detection and recovery in multicore processors. In: Design, Automation & Test in Europe, DATE 2011 (2011)
Hongyi, F., Yan, D.: Using redundant threads for fault tolerance of openmp programs. In: Proceedings of the 2010 International Conference on Information Science and Applications, ICISA 2010 (2010)
Prvulovic, M., Zhang, Z., Torrellas, J.: Revive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In: Proceedings of the 29th Annual International Symposium on Computer architecture, ISCA 2002, pp. 111–122. IEEE Computer Society, Washington, DC (2002)
Saha, G.K.: Software based fault tolerance: a survey. Ubiquity 1, 1:1 (2006)
Sorin, D.J., Martin, M.M.K., Hill, M.D., Wood, D.A.: Safetynet: improving the availability of shared memory multiprocessors with global checkpoint/recovery. In: Proceedings of the 29th Annual International Symposium on Computer Architecture, ISCA 2002, pp. 123–134. IEEE Computer Society, Washington, DC (2002)
Teruel, X., Martorell, X., Duran, A., Ferrer, R., Ayguadé, E.: Support for openmp tasks in nanos v4. In: Proceedings of the 2007 Conference of the Center for Advanced Studies on Collaborative Research, CASCON 2007, pp. 256–259. ACM, New York (2007)
Wang, N.J., Patel, S.J.: Restore: Symptom-based soft error detection in microprocessors. IEEE Trans. Dependable Secur. Comput. 3 (2006)
Weaver, C., Emer, J., Mukherjee, S.S., Reinhardt, S.K.: Techniques to reduce the soft error rate of a high-performance microprocessor. In: Proceedings of the 31st Annual International Symposium on Computer Architecture, ISCA 2004, pp. 264–275. IEEE Computer Society, Washington, DC (2004)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Tahan, O., Shawky, M. (2012). Using Dynamic Task Level Redundancy for OpenMP Fault Tolerance. In: Herkersdorf, A., Römer, K., Brinkschulte, U. (eds) Architecture of Computing Systems – ARCS 2012. ARCS 2012. Lecture Notes in Computer Science, vol 7179. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28293-5_3
Download citation
DOI: https://doi.org/10.1007/978-3-642-28293-5_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28292-8
Online ISBN: 978-3-642-28293-5
eBook Packages: Computer ScienceComputer Science (R0)