Abstract
Failure instances in distributed computing systems (DCSs) have exhibited temporal and spatial correlations, where a single failure instance can trigger a set of failure instances simultaneously or successively within a short time interval. In this work, we propose a correlated failure prediction approach (CFPA) to predict correlated failures of computing elements in DCSs. The approach models correlated-failure patterns using the concept of probabilistic shared risk groups and makes a prediction for correlated failures by exploiting an association rule mining approach in a parallel way. We conduct extensive experiments to evaluate the feasibility and effectiveness of CFPA using both failure traces from Los Alamos National Lab and simulated datasets. The experimental results show that the proposed approach outperforms other approaches in both the failure prediction performance and the execution time, and can potentially provide better prediction performance in a larger system.












Similar content being viewed by others
Notes
The operator “\(\backslash \)” denotes set minus as in “\( X \backslash Y\)”, which means ‘ Y is excluded from X’.
Each \(I_{k}\) corresponds to a certain CE in a DCS.
References
Liang, Y., Zhang, Y., Sivasubramaniam, A., Jette, M., Sahoo, R.: BlueGene/L failure analysis and prediction models. In: IEEE/IFIP Dependable Systems and Networks (DSN’ 06). (2006)
Asyabi, E., Azhdari, A., Dehsangi, M., Khan, M.G., Sharifi, M., Azhari, S.V.: Kani: a QoS-aware hypervisor-level scheduler for cloud computing environments. Clust. Comput. 19(2), 1–17 (2016)
Karim, R., Ding, C., Miri, A., Rahman, M.S.: Incorporating service and user information and latent features to predict QoS for selecting and recommending cloud service compositions. Clust. Comput. 19(2), 1–16 (2016)
Martini, B., Choo, K.K.R.: An integrated conceptual digital forensic framework for cloud computing. Digit. Investig. 9(9), 71–80 (2012)
Quick, D., Choo, K.K.R.: Dropbox analysis: data remnants on user machines. Digit. Investig. 10(1), 3–18 (2013)
Cahyani, N.D.W., Martini, B., Choo, K.R., Al-Azhar, A.M.N.: Forensic data acquisition from cloud-of-things devices: windows smartphones as a case study. Concurr. Comput. Pract. Exp. (2016)
Quick, D., Choo, K.K.R.: Google drive: forensic analysis of data remnants. J. Netw. Comput. Appl. 40(2), 179–193 (2014)
Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. In: IEEE/IFIP Dependable Systems and Networks (DSN’ 06). (2006)
Gallet, M., Yigitbasi, N., Javadi, B., Kondo, D., Iosup, A., Epema, D.: A model for space-correlated failures in large-scale distributed systems. In: Springer European Conference on Parallel Processing, pp. 88–100. (2010)
Pezoa, J.E., Hayat, M.M.: Reliability of heterogeneous distributed computing systems in the presence of correlated failures. IEEE Trans. Parallel Distrib. Syst. 25(4), 1034–1043 (2014)
Salfner, F., Schieschke, M., Malek, M.: Predicting failures of computer systems: a case study for a telecommunication system. In: IEEE Parallel and Distributed Processing Symposium (IPDPS’ 06). (2006)
Rahman, N.H.A., Glisson, W.B., Yang, Y., Choo, K.K.R.: Forensic-by-design framework for cyber-physical cloud systems. IEEE Cloud Comput. 3(1), 50–59 (2016)
Ab Rahman, N.H., Cahyani, N.D.W., Choo, K.R.: Cloud incident handling and forensic-by-design: cloud storage as a case study. Concurr. Comput. Pract. Exp. (2016)
Quick, D., Choo, K.K.R.: Digital droplets: microsoft skydrive forensic data remnants. Future Gener. Comput. Syst. 29(6), 1378–1394 (2013)
Tep, K.S., Martini, B., Hunt, R., Choo, K.K.R.: A Taxonomy of cloud attack consequences and mitigation strategies: the role of access control and privileged access management. In: IEEE Trustcom/BigDataSE/ISPA’ 15, pp. 1073–1080. (2015)
Baldoni, R., Montanari, L., Rizzuto, M.: On-line failure prediction in safety-critical systems. Future Gener. Comput. Syst. 45, 123–132 (2015)
Quick, D., Choo, K.K.R.: Big forensic data reduction: digital forensic images and electronic evidence. Clust. Comput. 19(2), 1–18 (2016)
Martini, B., Choo, K.K.R.: Cloud storage forensics: owncloud as a case study. Digit. Investig. 10(4), 287–299 (2013)
Quick, D., Martini, B., Choo, R.: Cloud Storage Forensics. Syngress Publishing, Boston (2013)
Martini, B., Choo, K.K.R.: Distributed filesystem forensics: xtreemfs as a case study. Digit. Investig. 11(4), 295–313 (2014)
Quick, D., Choo, K.K.R.: Forensic collection of cloud storage data: does the act of collection result in changes to the data or its metadata? Digit. Investig. 10(3), 266–277 (2013)
Fu, S., Xu, C.Z.: Exploring event correlation for failure prediction in coalitions of clusters. In: ACM/IEEE Supercomputing (SC’ 07). (2007)
Salfner, F., Malek, M.: Using hidden Semi-Markov models for effective online failure prediction. In: IEEE International Symposium on Reliable Distributed Systems (SRDS’ 07). (2007)
Murray, J.F., Hughes, G.F., Kreutz-Delgado, K.: Hard drive failure prediction using non-parametric statistical methods. In: ICANN/ICONIP’ 03. (2003)
Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Dependable Secur. Comput. 7(4), 337–350 (2010)
Vidyarthia, D.P., Tripathib, A.K.: Maximizing reliability of a distributed computing system with task allocation using simple genetic algorithm. J. Syst. Archit. 47(6), 549–554 (2001)
Palmer, J., Mitrani, I.: Empirical and analytical evaluation of systems with multiple unreliable servers. In: IEEE/IFIP Dependable Systems and Networks (DSN’ 06). (2006)
System availability, failure and usage data sets. Los Alamos National Laboratory (LANL). http://institutes.lanl.gov/data/fdata
Salfner, F., Lenk, M., Malek, M.: A survey of online failure prediction methods. J. ACM Comput. Surv. 42(10), 1–68 (2010)
The failure trace archive. http://fta.scem.uws.edu.au
Papadimitriou, D., Poppe, F., Jones, J., Venkatachalam, S., Dharanikota, S., Jain, R., Xue, Y.: Inference of shared risk link groups. IETF draft, OIF contribution, OIF. (2001)
Das, G., Papadimitriou, D., Tavernier, W., Colle, D., Dhaene, T., Pickavet, M., Demeester, P.: Link state protocol data mining for shared risk link group detection. In: IEEE Computer Communications and Networks (ICCCN’ 10), pp. 1–8. (2010)
Soysal, Ö.M.: Association rule mining with mostly associated sequential patterns. Exp. Syst. Appl. 42(5), 2582–2592 (2015)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Bai, C.G., Hu, Q.P., Xie, M., Ng, S.H.: Software failure prediction based on a Markov Bayesian network model. J. Syst. Softw. 74(3), 275–282 (2005)
Hughes, G.F., Murray, J.F., Kreutz-Delgado, K., Elkan, C.: Improved disk-drive failure warnings. IEEE Trans. Reliab. 51(3), 350–357 (2002)
Fu, S., Xu, C.Z.: Quantifying temporal and spatial correlation of failure events for proactive management. In: IEEE Reliable Distributed Systems (RNS’ 07). (2007)
Jhawar, R., Piuri, V.: Fault tolerance and resilience in cloud computing environments. Comput. Inf. Secur. Handb. (2013)
Yigitbasi, N., Gallet, M., Kondo, D., et al.: Analysis and modeling of time-correlated failures in large-scale distributed systems. In: IEEE/ACM Grid Computing (GRID’ 10), pp. 65–72. (2010)
Hoffmann, G., Malek, M.: Call availability prediction in a telecommunication system: a data driven empirical approach. In: IEEE SRDS’ 06, pp. 83–95. (2006)
Neumayer, S., Modiano, E.: Network reliability with geographically correlated failures. In: IEEE INFOCOM’ 10, pp. 1–9. (2010)
Kim, K., Venkatasubramanian, N.: Assessing the impact of geographically correlated failures on overlay-based data dissemination. In: IEEE GLOBECOM’ 10, pp. 1–5. (2010)
Fiondella, L., Rajasekaran, S., Gokhale, S.S.: Efficient software reliability analysis with correlated component failures. IEEE Trans. Reliab. 62(1), 244–255 (2013)
Acknowledgments
The author would like to thank the anonymous reviewers for their invaluable suggestions which have been incorporated to improve the quality of the paper. This work was supported by the National Natural Science Foundation of China (No.61372108).
Author information
Authors and Affiliations
Corresponding author
Appendix : Notations
Appendix : Notations
Table 6 summarizes the notations we used.
Rights and permissions
About this article
Cite this article
Zheng, W., Wang, Z., Huang, H. et al. SPSRG: a prediction approach for correlated failures in distributed computing systems. Cluster Comput 19, 1703–1721 (2016). https://doi.org/10.1007/s10586-016-0633-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-016-0633-2