Abstract
Over the past several years, a heterogeneous computing (HC) system has become more competitive as a commercial computing platform than a homogeneous system. With the growing scale of HC systems, network failures become inevitable. To achieve high performance, communication reliability should be considered while designing reliability-aware task scheduling algorithms. In this paper, we propose a new algorithm called RMSR (Replication-based scheduling for Maximizing System Reliability), which incorporates task communication into system reliability. To maximize communication reliability, an improved algorithm which searches all optimal reliability communication paths for current tasks is proposed. During the task replication phase, the task reliability threshold is determined by users and each task has dynamic replicas. Our comparative studies for both randomly generated graphs and application graphs of real-world problems show that our RMSR algorithm outperforms existing scheduling algorithms in terms of system reliability. For randomly generated graphs, several factors affecting the performance are analyzed in the paper. For an application graph of a real-world problem with a fixed DAG, the system reliability of the RMSR algorithm is at most influenced by one factor.
Similar content being viewed by others
References
Arabnejad, H., Barbosa, J.G.: List scheduling algorithm for heterogeneous systems by an optimistic cost table. IEEE Trans. Parallel Distrib. Syst. 25(3), 682–694 (2014)
Benoit, A., Hakem, M., Robert, Y.: Fault tolerant scheduling of precedence task graphs on heterogeneous platforms. In: IEEE International Symposium on Parallel and Distributed Processing, 2008. IPDPS 2008, pp. 1–8 (2008)
Bittencourt, L.F., Sakellariou, R., Madeira, E.R.: Dag scheduling using a lookahead variant of the heterogeneous earliest finish time algorithm. In: 2010 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 27–34. IEEE (2010)
Chiu, C.C., Yeh, Y.S., Chou, J.S.: A fast algorithm for reliability-oriented task assignment in a distributed system. Comput. Commun. 25(17), 1622–1630 (2002)
Chung, Y.C., Ranka, S.: Applications and performance analysis of a compile-time optimization approach for list scheduling algorithms on distributed memory multiprocessors. In: Supercomputing’92., Proceedings, pp. 512–521. IEEE (1992)
Daoud, M.I., Kharma, N.: A high performance algorithm for static task scheduling in heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 68(4), 399–409 (2008)
Dogan, A., Ozguner, F.: Optimal and suboptimal reliable scheduling of precedence-constrained tasks in heterogeneous distributed computing. In: Proceedings of the 2000 International Workshops on Parallel Processing, 2000, pp. 429–436. IEEE (2000)
Dogan, A., Ozguner, F.: Matching and scheduling algorithms for minimizing execution time and failure probability of applications in heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 308–323 (2002)
Dongarra, J.J., Jeannot, E., Saule, E., Shi, Z.: Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems. In: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pp. 280–288. ACM (2007)
El-Rewini, H., Lewis, T.G.: Scheduling parallel program tasks onto arbitrary target machines. J. Parallel Distrib. Comput. 9(2), 138–153 (1990)
Elegbede, A.C., Chu, C., Adjallah, K.H., Yalaoui, F.: Reliability allocation through cost minimization. IEEE Trans. Reliab. 52(1), 106–111 (2003)
Hsieh, C.C.: Optimal task allocation and hardware redundancy policies in distributed computing systems. Eur. J. Oper. Res. 147(2), 430–447 (2003)
Iverson, M.A., Özgüner, F., Follen, G.J.: Parallelizing existing applications in a distributed heterogeneous environment. In: 4TH Heterogeneous Computing Workshop (HCW’95. Citeseer (1995)
Kartik, S., Murthy, C.S.R.: Task allocation algorithms for maximizing reliability of distributed computing systems. IEEE Trans. Comput. 46(6), 719–724 (1997)
Kwok, Y.K., Ahmad, I.: Benchmarking and comparison of the task graph scheduling algorithms. J. Parallel Distrib. Comput. 59(3), 381–422 (1999)
Liu, G., Poh, K.L., Xie, M.: Iterative list scheduling for heterogeneous computing. J. Parallel Distrib. Comput. 65(5), 654–665 (2005)
Michael, R.G., Johnson, D.S.: Computers and intractability: A guide to the theory of np-completeness. WH Freeman &Co., San Francisco (1979)
Plank, J.S., Elwasif, W.R.: Experimental assessment of workstation failures and their impact on checkpointing systems. In: Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing, 1998. Digest of Papers. pp. 48–57. IEEE (1998)
Qin, X., Jiang, H.: A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters. J. Parallel Distrib. Comput. 65(8), 885–900 (2005)
Raj, J.S., Rachel, I.S.: A survey on reliability scheduling on grid computing. In: 2013 7th International Conference on Intelligent Systems and Control (ISCO), pp. 331–334. IEEE (2013)
Ramírez-Alcaraz, J.M., Tchernykh, A., Yahyapour, R., Schwiegelshohn, U., Quezada-Pina, A., González-García, J.L., Hirales-Carbajal, A.: Job allocation strategies with user run time estimates for online scheduling in hierarchical grids. Journal of Grid Computing 9(1), 95–116 (2011)
Sakellariou, R., Zhao, H.: A hybrid heuristic for dag scheduling on heterogeneous systems. In: Proceedings. 18th International Parallel and Distributed Processing Symposium, 2004. p. 111. IEEE (2004)
Shatz, S.M., Wang, J.P., Goto, M.: Task allocation for maximizing reliability of distributed computer systems. IEEE Trans. Comput. 41(9), 1156–1168 (1992)
Sih, G.C., Lee, E.A.: A compile-time scheduling heuristic for interconnection-constrained heterogeneous processor architectures. IEEE Trans. Parallel Distrib. Syst. 4(2), 175–187 (1993)
Srinivasan, S., Jha, N.K.: Safety and reliability driven task allocation in distributed systems. IEEE Trans. Parallel Distrib. Syst. 10(3), 238–251 (1999)
Tang, X., Li, K., Li, R., Veeravalli, B.: Reliability-aware scheduling strategy for heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 70(9), 941–952 (2010)
Tang, X., Li, K., Liao, G.: An effective reliability-driven technique of allocating tasks on heterogeneous cluster systems. Clust. Comput., 1–13 (2014)
Tang, X., Li, K., Qiu, M., Sha, E.H.M.: A hierarchical reliability-driven scheduling algorithm in grid systems. J. Parallel Distrib. Comput. 72(4), 525–535 (2012)
Tao, Y., Jin, H., Shi, X.: Grid workflow scheduling based on reliability cost. In: Proceedings of the 2nd international conference on Scalable information systems, p. 12. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering) (2007)
Tom, P.A., Murthy, C.: Algorithms for reliability-oriented module allocation in distributed computing systems. J. Syst. Softw. 40(2), 125–138 (1998)
Topcuoglu, H., Hariri, S., Wu, M.Y.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 260–274 (2002)
Tosun, S.: Energy-and reliability-aware task scheduling onto heterogeneous mpsoc architectures. J. Supercomput. 62(1), 265–289 (2012)
Ullman, J.D.: NP-complete scheduling problems. J. Comput. Syst. Sci. 10(3), 384–393 (1975)
Wang, X., Yeo, C.S., Buyya, R., Su, J.: Optimizing the makespan and reliability for workflow applications with reputation and a look-ahead genetic algorithm. Futur. Gener. Comput. Syst. 27(8), 1124–1134 (2011)
Wu, M.Y., Gajski, D.D.: Hypertool: A programming aid for message-passing systems. IEEE Trans. Parallel Distrib. Syst. 1(3), 330–343 (1990)
Zhang, J., Gu, J.: An approach to analyze grid service reliability subject to failures. In: 4th International Conference on Computer Science & Education, 2009. ICCSE’09. pp. 343–347. IEEE (2009)
Zhang, X., Zagorodnov, D., Hiltunen, M., Marzullo, K., Schlichting, R.D.: Fault-tolerant grid services using primary-backup: feasibility and performance. In: 2004 IEEE International Conference on Cluster Computing, pp. 105–114. IEEE (2004)
Zhao, B., Aydin, H., Zhu, D.: On maximizing reliability of real-time embedded applications under hard energy constraint. IEEE Trans. Ind. Inf. 6(3), 316–328 (2010)
Zhao, L., Ren, Y., Sakurai, K.: A resource minimizing scheduling algorithm with ensuring the deadline and reliability in heterogeneous systems. In: 2011 IEEE International Conference on Advanced Information Networking and Applications (AINA), pp. 275–282. IEEE (2011)
Zhao, L., Ren, Y., Xiang, Y., Sakurai, K.: Fault-tolerant scheduling with dynamic number of replicas in heterogeneous systems. In: 2010 12th IEEE International Conference on High Performance Computing and Communications (HPCC), pp. 434–441. IEEE (2010)
Zheng, Q., Veeravalli, B., Tham, C.K.: On the design of fault-tolerant scheduling strategies using primary-backup approach for computational grids with low replication costs. IEEE Trans. Comput. 58(3), 380–393 (2009)
Zhu, D., Melhem, R., Mossé, D.: The effects of energy management on reliability in real-time embedded systems. In: IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004. pp. 35–40. IEEE (2004)
Acknowledgments
A preliminary version of the paper will be presented in the Fourth Workshop on Parallel Computing and Optimization in conjunction with IPDPS 2014, Phoenix, Arizona, May 23, 2014. The authors would like to thank the three anonymous reviewers for their valuable and helpful comments on improving the manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
The research was partially funded by the Key Program of National Natural Science Foundation of China (Grant Nos. 61133005, 61432005), the National Outstanding Youth Science Program of National Natural Science Foundation of China (Grant No. 61625202), the International Science & Technology Cooperation Program of China (Grant No. 2015DFA11240), the National Key R&D Program of China (Grant No. 2016YFB0201402), the Hunan Provincial Innovation Foundation For Postgraduate (Grant No. CX2016B066), and the Outstanding Graduate Student Innovation Fund Program of Collaborative Innovation Center of High Performance Computing.
Rights and permissions
About this article
Cite this article
Wang, S., Li, K., Mei, J. et al. A Reliability-aware Task Scheduling Algorithm Based on Replication on Heterogeneous Computing Systems. J Grid Computing 15, 23–39 (2017). https://doi.org/10.1007/s10723-016-9386-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10723-016-9386-7