[go: up one dir, main page]

Skip to main content

Proctor: A Semi-Supervised Performance Anomaly Diagnosis Framework for Production HPC Systems

  • Conference paper
  • First Online:
High Performance Computing (ISC High Performance 2021)

Abstract

Performance variation diagnosis in High-Performance Computing (HPC) systems is a challenging problem due to the size and complexity of the systems. Application performance variation leads to premature termination of jobs, decreased energy efficiency, or wasted computing resources. Manual root-cause analysis of performance variation based on system telemetry has become an increasingly time-intensive process as it relies on human experts and the size of telemetry data has grown. Recent methods use supervised machine learning models to automatically diagnose previously encountered performance anomalies in compute nodes. However, supervised machine learning models require large labeled data sets for training. This labeled data requirement is restrictive for many real-world application domains, including HPC systems, because collecting labeled data is challenging and time-consuming, especially considering anomalies that sparsely occur.

This paper proposes a novel semi-supervised framework that diagnoses previously encountered performance anomalies in HPC systems using a limited number of labeled data points, which is more suitable for production system deployment. Our framework first learns performance anomalies’ characteristics by using historical telemetry data in an unsupervised fashion. In the following process, we leverage supervised classifiers to identify anomaly types. While most semi-supervised approaches do not typically use anomalous samples, our framework takes advantage of a few labeled anomalous samples to classify anomaly types. We evaluate our framework on a production HPC system and on a testbed HPC cluster. We show that our proposed framework achieves 60% F1-score on average, outperforming state-of-the-art supervised methods by 11%, and maintains an average 0.06% anomaly miss rate.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Our implementation is available at: https://github.com/peaclab/Proctor.

References

  1. Agelastos, A., Allan, B., Brandt, J., et al.: The lightweight distributed metric service: a scalable infrastructure for continuous monitoring of large scale computing systems and applications. In: SC 2014: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 154–165 (2014)

    Google Scholar 

  2. Agelastos, A., Allan, B., Brandt, J., et al.: Toward rapid understanding of production HPC applications and systems. In: IEEE International Conference on Cluster Computing, pp. 464–473 (2015)

    Google Scholar 

  3. Agrawal, P., Girshick, R., Malik, J.: Analyzing the performance of multilayer neural networks for object recognition. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 329–344. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_22

    Chapter  Google Scholar 

  4. Ahmad, W.A., Bartolini, A., Beneventi, F., et al.: Design of an energy aware petaflops class high performance cluster based on power architecture. In: IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 964–973 (2017)

    Google Scholar 

  5. Alain, G., Bengio, Y.: What regularized auto-encoders learn from the data-generating distribution. J. Mach. Learn. Res. 15(1), 3563–3593 (2014)

    MathSciNet  MATH  Google Scholar 

  6. Ates, E., et al.: Taxonomist: application detection through rich monitoring data. In: Aldinucci, M., Padovani, L., Torquati, M. (eds.) Euro-Par 2018. LNCS, vol. 11014, pp. 92–105. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-96983-1_7

    Chapter  Google Scholar 

  7. Ates, E., Zhang, Y., Aksar, B., et al.: HPAS: an HPC performance anomaly suite for reproducing performance variations. In: ACM Proceedings of the 48th International Conference on Parallel Processing, pp. 1–10, August 2019

    Google Scholar 

  8. Bailey, D.H., Barszcz, E., Barton, J.T., et al.: The NAS parallel benchmarks summary and preliminary results. In: Supercomputing 1991: Proceedings of the 1991 ACM/IEEE Conference on Supercomputing, pp. 158–165 (1991)

    Google Scholar 

  9. Baseman, E., Blanchard, S., DeBardeleben, N., Bonnie, A., Morrow, A.: Interpretable anomaly detection for monitoring of high performance computing systems. In: Outlier Definition, Detection, and Description on Demand Workshop at ACM SIGKDD, San Francisco, August 2016 (2016)

    Google Scholar 

  10. Beneventi, F., Bartolini, A., Cavazzoni, C., Benini, L.: Continuous learning of HPC infrastructure models using big data analytics and in-memory processing tools. In: Design, Automation Test in Europe Conference Exhibition (DATE), pp. 1038–1043 (2017)

    Google Scholar 

  11. Bengio, Y.: Learning Deep Architectures for AI. Now Publishers Inc., New York (2009)

    Google Scholar 

  12. Bhatele, A., Mohror, K., Langer, S.H., Isaacs, K.E.: There goes the neighborhood: performance degradation due to nearby jobs. In: SC 2013: IEEE Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2013)

    Google Scholar 

  13. Bodik, P., Goldszmidt, M., Fox, A., Woodard, D.B., Andersen, H.: Fingerprinting the datacenter: automated classification of performance crises. In: Proceedings of the 5th European Conference on Computer Systems, pp. 111–124 (2010)

    Google Scholar 

  14. Borghesi, A., Bartolini, A., Lombardi, M., Milano, M., Benini, L.: A semisupervised autoencoder-based approach for anomaly detection in high performance computing systems. Eng. Appl. Artif. Intell. 85, 634–644 (2019)

    Article  Google Scholar 

  15. Borghesi, A., Bartolini, A., Lombardi, M., et al.: Anomaly detection using autoencoders in high performance computing systems. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9428–9433, July 2019. arXiv: 1811.05269

  16. Brandt, J., Chen, F., et al.: Quantifying effectiveness of failure prediction and response in HPC systems: methodology and example. In: IEEE International Conference on Dependable Systems and Networks Workshops (DSN-W), pp. 2–7 (2010)

    Google Scholar 

  17. Ciregan, D., Meier, U., Schmidhuber, J.: Multi-column deep neural networks for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3642–3649 (2012)

    Google Scholar 

  18. Dorier, M., Antoniu, G., Ross, R., et al.: CALCioM: mitigating I/O interference in HPC systems through cross-application coordination. In: IEEE 28th International Parallel and Distributed Processing Symposium, pp. 155–164 (2014)

    Google Scholar 

  19. Exascale proxy applications. https://proxyapps.exascaleproject.org/

  20. Ganglia monitoring system. http://ganglia.info/

  21. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)

    Google Scholar 

  22. Habib, S., Morozov, V., Frontiere, N., Finkel, H., Pope, A., Heitmann, K.: HACC: extreme scaling and performance across diverse architectures. In: SC 2013: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–10. IEEE (2013)

    Google Scholar 

  23. Heroux, M.A., et al.: Improving performance via mini-applications. Sandia National Laboratories, Technical report, SAND2009-5574 3 (2009)

    Google Scholar 

  24. Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  25. Hinton, G.E., Zemel, R.S.: Autoencoders, minimum description length and Helmholtz free energy. In: Proceedings of the 6th International Conference on Neural Information Processing Systems. NIPS 1993, pp. 3–10. Morgan Kaufmann Publishers Inc., San Francisco (1993)

    Google Scholar 

  26. Ibidunmoye, O., Hernández-Rodriguez, F., Elmroth, E.: Performance anomaly detection and bottleneck identification. ACM Comput. Surv. (CSUR) 48(1), 1–35 (2015)

    Article  Google Scholar 

  27. Klinkenberg, J., Terboven, C., Lankes, S., Müller, M.S.: Data mining-based analysis of HPC center operations. In: IEEE International Conference on Cluster Computing, pp. 766–773 (2017)

    Google Scholar 

  28. Kunang, Y.N., Nurmaini, S., Stiawan, D., Zarkasi, A., Jasmir, F.: Automatic features extraction using autoencoder in intrusion detection system. In: IEEE International Conference on Electrical Engineering and Computer Science (ICECOS), pp. 219–224 (2018)

    Google Scholar 

  29. Kunen, A.J., Bailey, T.S., Brown, P.N.: KRIPKE-a massively parallel transport mini-app. Technical report, Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States) (2015)

    Google Scholar 

  30. Leung, V.J., Bender, M.A., Bunde, D.P., Phillips, C.A.: Algorithmic support for commodity-based parallel computing systems. Technical report, Sandia National Laboratories (2003)

    Google Scholar 

  31. Liu, G., Bao, H., Han, B.: A stacked autoencoder-based deep neural network for achieving gearbox fault diagnosis. Math. Probl. Eng. (2018)

    Google Scholar 

  32. Luo, T., Nagarajan, S.G.: Distributed anomaly detection using autoencoder neural networks in WSN for IoT. In: IEEE International Conference on Communications (ICC), pp. 1–6 (2018)

    Google Scholar 

  33. Minhas, M.S., Zelek, J.: Semi-supervised anomaly detection using autoencoders. arXiv:2001.03674 [cs, eess, stat], January 2020. http://arxiv.org/abs/2001.03674

  34. Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: ICML (2010)

    Google Scholar 

  35. Petersson, N., Sjögreen, B.: Sw4 v1.1 [software] (2014). https://doi.org/http://doi.org/10.5281/zenodo.571844

  36. Plimpton, S.: Fast parallel algorithms for short-range molecular dynamics. J. Comput. Phys. 117(1), 1–19 (1995)

    Article  MATH  Google Scholar 

  37. Sato, D., Hanaoka, S., Nomura, Y., et al.: A primitive study on unsupervised anomaly detection with an autoencoder in emergency head CT volumes. In: Medical Imaging: Computer-Aided Diagnosis, vol. 10575, p. 105751P. International Society for Optics and Photonics (2018)

    Google Scholar 

  38. Schwaller, B., Tucker, N., Tucker, T., Allan, B., Brandt, J.: HPC system data pipeline to enable meaningful insights through analysis-driven visualizations. In: IEEE International Conference on Cluster Computing, pp. 433–441, September 2020

    Google Scholar 

  39. Snir, M., Carlson, B., et al.: Addressing failures in exascale computing. Int. J. High Perf. Comput. Appl. 28(2), 129–173 (2014)

    Article  Google Scholar 

  40. Song, H., Jiang, Z., et al.: A hybrid semi-supervised anomaly detection model for high-dimensional data. Comput. Intell. Neurosci. (2017)

    Google Scholar 

  41. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  42. Tuncer, O., et al.: Diagnosing performance variations in HPC applications using machine learning. In: Kunkel, J.M., Yokota, R., Balaji, P., Keyes, D. (eds.) ISC 2017. LNCS, vol. 10266, pp. 355–373. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58667-0_19

    Chapter  Google Scholar 

  43. Tuncer, O., Ates, E., Zhang, Y., et al.: Online diagnosis of performance variation in HPC systems using machine learning. IEEE Trans. Parallel Distrib. Syst. 30(4), 883–896 (2018)

    Article  Google Scholar 

  44. Wang, K., et al.: Research on healthy anomaly detection model based on deep learning from multiple time-series physiological signals. Sci. Program. (2016)

    Google Scholar 

  45. Yu, L., Lan, Z.: A scalable, non-parametric method for detecting performance anomaly in large scale computing. IEEE Trans. Parallel Distrib. Syst. 27(7), 1902–1914 (2015)

    Article  Google Scholar 

  46. Zhou, C., Paffenroth, R.C.: Anomaly detection with robust deep autoencoders. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 665–674 (2017)

    Google Scholar 

Download references

Acknowledgment

This work has been partially funded by Sandia National Laboratories. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under Contract DE-NA0003525. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Burak Aksar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Aksar, B. et al. (2021). Proctor: A Semi-Supervised Performance Anomaly Diagnosis Framework for Production HPC Systems. In: Chamberlain, B.L., Varbanescu, AL., Ltaief, H., Luszczek, P. (eds) High Performance Computing. ISC High Performance 2021. Lecture Notes in Computer Science(), vol 12728. Springer, Cham. https://doi.org/10.1007/978-3-030-78713-4_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-78713-4_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-78712-7

  • Online ISBN: 978-3-030-78713-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics