Abstract
The economy of scale offers cloud computing virtually unlimited cost effective processing potentials. Theoretically, prices under fair market conditions should reflect the most reasonable costs of computations. The fairness is ensured by the mutual agreements between the sellers and the buyers. Resource use efficiency is automatically optimized in the process. While there is no lack of incentives for the cloud provider to offer auction-based computing platform, using these volatile platform for practical computing is a challenge for existing programming paradigms. This paper reports a methodology and a toolkit designed to tame the challenges for MPI applications.
Unlike existing MPI fault tolerance tools, we emphasize on dynamically adjusted optimal checkpoint-restart (CPR) intervals. We introduce a formal model, then a HPC application toolkit, named SpotMPI, to facilitate the practical execution of real MPI applications on volatile auction-based cloud platforms. Our models capture the intrinsic dependencies between critical time consuming elements by leveraging instrumented performance parameters and publicly available resource bidding histories. We study algorithms with different computing v.s. communication complexities. Our results show non-trivial insights into the optimal bidding and application scaling strategies.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Starcluster (2010), http://web.mit.edu/stardev/cluster/
Amazon hpc cluster instances (2011), http://aws.amazon.com/ec2/hpc-applications/
Agbaria, A.M., Friedman, R.: Starfish: fault-tolerant dynamic mpi programs on clusters of workstations. In: Proceedings of the Eighth International Symposium on High Performance Distributed Computing, 1999, pp. 167–176 (1999)
Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing capabilities. In: Proceedings of the Spring Joint Computer Conference, April 18-20, pp. 483–485. ACM, New York (1967)
Andrzejak, A., Kondo, D., Yi, S.: Decision model for cloud computing under sla constraints. In: Proc. IEEE Int Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS) Symp., pp. 257–266 (2010)
Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer, R., Pratt, I., Warfield, A.: Xen and the art of virtualization. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, pp. 164–177. ACM, New York (2003)
Blathras, K., Szyld, D.B., Shi, Y.: Timing models and local stopping criteria for asynchronous iterative algorithms. Journal of Parallel and Distributed Computing 58(3), 446–465 (1999)
Borthakur, D.: The hadoop distributed file system: Architecture and design (2007), http://developer.yahoo.com/hadoop/tutorial/
Chohan, N., Castillo, C., Spreitzer, M., Steinder, M., Tantawi, A., Krintz, C.: See spot run: using spot instances for mapreduce workflows. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, p. 7. USENIX Association (2010)
Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems 22(3), 303–312 (2006)
Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)
Fagg, G., Dongarra, J.: Ft-mpi: Fault tolerant mpi, supporting dynamic applications in a dynamic world. In: Dongarra, J., Kacsuk, P., Podhorszki, N. (eds.) PVM/MPI 2000. LNCS, vol. 1908, pp. 346–353. Springer, Heidelberg (2000)
Graham, R.L., Choi, S.E., Daniel, D.J., Desai, N.N., Minnich, R.G., Rasmussen, C.E., Risinger, L.D., Sukalski, M.W.: A network-failure-tolerant message-passing system for terascale clusters. International Journal of Parallel Programming 31(4), 285–303 (2003)
Hargrove, P.H., Duell, J.C.: Berkeley lab checkpoint/restart (blcr) for linux clusters. In: Journal of Physics: Conference Series, vol. 46, p. 494. IOP Publishing (2006)
Hursey, J.: Coordinated Checkpoint/Restart Process Fault Tolerance for MPI Applications on HPC Systems. PhD thesis, Indiana University, Bloomington, IN, USA (July 2010)
Hursey, J., Squyres, J.M., Mattox, T.I., Lumsdaine, A.: The design and implementation of checkpoint/restart process fault tolerance for open mpi. In: Proc. IEEE Int. Parallel and Distributed Processing Symp. IPDPS 2007, pp. 1–8 (2007)
Iosup, A., Ostermann, S., Yigitbasi, N., Prodan, R., Fahringer, T., Epema, D.: Performance analysis of cloud computing services for many-tasks scientific computing. IEEE Transactions on Parallel and Distributed Systems 22(6), 931–945 (2011)
Litzkow, M., Tannenbaum, T., Basney, J., Livny, M.: Checkpoint and migration of unix processes in the condor distributed processing system. Technical report, Technical Report (1997)
Lusk, E.: Fault tolerance in mpi programs. Special issue of the Journal High Performance Computing Applications, IJHPCA (2002)
Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proc. Int. High Performance Computing, Networking, Storage and Analysis (SC) Conf. for, pp. 1–11 (2010)
Rao, S., Alvisi, L., Vin, H.M.: Egida: An extensible toolkit for low-overhead fault-tolerance. In: Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, 1999. Digest of Papers, pp. 48–55. IEEE, Los Alamitos (1999)
Shi, J.Y.: Program scalability analysis. In: International Conference on Distributed and Parallel Processing. Geogetown University, Washington D.C (1997)
Shi, J.Y., Taifi, M., Khreishah, A., Wu, J.: Sustainable gpu computing at scale. In: 14th IEEE International Conference in Computational Science and Engneering 2011 (2011)
Stellner, G.: Cocheck: Checkpointing and process migration for mpi. In: Proceedings of the 10th International Parallel Processing Symposium, IPPS 1996, pp. 526–531. IEEE Computer Society, Washington, DC, USA (1996)
Vecchiola, C., Pandey, S., Buyya, R.: High-performance cloud computing: A view of scientific applications. In: Proc. 10th Int. Pervasive Systems, Algorithms, and Networks (ISPAN) Symp., pp. 4–16 (2009)
Yi, S., Kondo, D., Andrzejak, A.: Reducing costs of spot instances via checkpointing in the amazon elastic compute cloud. In: 2010 IEEE 3rd International Conference on Cloud Computing, pp. 236–243. IEEE, Los Alamitos (2010)
Young, J.W.: A first order approximation to the optimum checkpoint interval. Communications of the ACM 17(9), 530–531 (1974)
Youseff, L., Wolski, R., Gorda, B., Krintz, C.: Evaluating the performance impact of xen on mpi and process execution for hpc systems. In: Proceedings of the 2nd International Workshop on Virtualization Technology in Distributed computing, p. 1. IEEE Computer Society, Los Alamitos (2006)
Zhang, Q., Grses, E., Boutaba, R., Xiao, J.: Dynamic resource allocation for spot markets in clouds. In: Proceedings of the 11th USENIX Conference on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services (2011)
Zheng, G., Shi, L., Kalé, L.V.: Ftc-charm++: An in-memory checkpoint-based fault tolerant runtime for charm++ and mpi. In: 2004 IEEE International Conference on Cluster Computing, pp. 93–103. IEEE, Los Alamitos (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Taifi, M., Shi, J.Y., Khreishah, A. (2011). SpotMPI: A Framework for Auction-Based HPC Computing Using Amazon Spot Instances. In: Xiang, Y., Cuzzocrea, A., Hobbs, M., Zhou, W. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2011. Lecture Notes in Computer Science, vol 7017. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24669-2_11
Download citation
DOI: https://doi.org/10.1007/978-3-642-24669-2_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24668-5
Online ISBN: 978-3-642-24669-2
eBook Packages: Computer ScienceComputer Science (R0)