Abstract
Dynamic Parallel Schedules (DPS) is a flow graph based framework for developing parallel applications on clusters of workstations. The DPS flow graph execution model enables automatic pipelined parallel execution of applications. DPS supports graceful degradation of parallel applications in case of node failures. The fault-tolerance mechanism relies on a set of backup threads stored in the volatile storage of alternate nodes that are kept up to date by both duplicating transmitted data objects and performing periodical checkpointing. The current state of a failed node can be reconstructed on its backup threads by re-executing the application since the last checkpoint. A valid execution order is automatically deduced from the flow graph. The addition of fault-tolerance to a DPS application requires only minor changes to the application’s source code. The present contribution focuses on the development of fault-tolerant parallel applications with DPS from a programmer’s perspective.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Agbaria, A., Friedman, R.: Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations. In: 8th International Symposium on High Performance Distributed Computing (HPDC-8 1999). IEEE CS Press, Los Alamitos (1999)
Baratloo, A., Dasgupta, P., Kedem, Z.M.: Calypso: A Novel Software System for Fault-Tolerant Parallel Procssing on Distributed Platforms. In: Proc. International Symposium on High-Performance Distributed Computing, pp. 122–129 (1995)
Batchu, R., Neelamegam, J., Cui, Z., Beddhua, M., Skjel-lum, A., Dandass, Y., Apte, M.: MPI/FT: Architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing. In: 1st IEEE International Symposium of Cluster Computing and the Grid, Melbourne, Australia (2001)
Bhargava, B., Lian, S.R.: Independent Checkpointing and Concurrent Rollback for Recovery - an Optimistic Approach. In: Proc. IEEE Symposium on Reliable Distributed Systems, pp. 3–12 (1988)
Chakravorty, S., Kale, L.V.: A fault tolerant protocol for massively parallel systems. In: 18th International Parallel and Distributed Processing Symposium (IPDPS 2004), pp. 212–219 (April 2004)
Das, D., Dasgupta, P., Das, P.P.: A New Method for Transparent Fault Tolerance of Distributed Programs on a Network of Workstations Using Alternative Schedules. In: Proc. Conf. on Algorithms and Architectures for Parallel Processing (ICAPP 1997), pp. 479–486 (1997)
Dongarra, J., Otto, S., Snir, M., Walker, D.: A message passing standard for MPP and Workstations. Communications of the ACM 39(7), 84–90 (1996)
Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM Computing Surveys 34(3), 375–408 (2002)
Elnozahy, E.N., Zwaenepoel, W.: Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback and Fast Output Commit. IEEE Transactions on Computers 41(5), 526–531 (1992)
Gerlach, S., Hersch, R.D.: DPS - Dynamic Parallel Schedules. In: International Parallel and Distributed Processing Symposium (IPDPS 2003), pp. 15–24 (April 2003)
Gerlach, S., Hersch, R.D.: Fault-tolerant Parallel Applications with Dynamic Parallel Schedules. In: International Parallel and Distributed Processing Symposium (IPDPS 2005), p. 278b (April 2005)
Gerlach, S.: DPS online documentation, http://dps.epfl.ch
Johnson, D.B., Zwaenepoel, W.: Sender based message logging, Digest of Papers, FTCS-17. In: Proc. 17th Annual International Symposium on Fault-Tolerant Computing, pp. 14–19 (1987)
Plank, J.S., Kim, Y., Dongarra, J.J.: Algorithm-Based Diskless Checkpointing for Fault Tolerant Matrix Operations, FTCS-25. In: Proc. 25th Annual International Symposium on Fault-Tolerant Computing, pp. 351–360 (1995)
Strom, R., Yemini, S.: Optimistic recovery in distributed systems. ACM Transactions on Computer Systems 3(3), 204–226 (1985)
Tamir, Y., Sequin, C.H.: Error recovery in multicomputers using global checkpoints. In: Proceedings of the International Conference on Parallel Processing, pp. 32–41 (1984)
Wang, Y.M., Fuchs, W.K.: Lazy Checkpoint Coordination for Bounding Rollback Propagation. In: Proc. 12th Symposium on Reliable Distributed Systems, October 1993, pp. 78–85 (1993)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Gerlach, S., Schaeli, B., Hersch, R.D. (2006). Fault-Tolerant Parallel Applications with Dynamic Parallel Schedules: A Programmer’s Perspective. In: Kohlas, J., Meyer, B., Schiper, A. (eds) Dependable Systems: Software, Computing, Networks. Lecture Notes in Computer Science, vol 4028. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11808107_9
Download citation
DOI: https://doi.org/10.1007/11808107_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-36821-2
Online ISBN: 978-3-540-36823-6
eBook Packages: Computer ScienceComputer Science (R0)