HybridTune: Spatio-Temporal Performance Data Correlation for Performance Diagnosis of Big Data Systems

Rui Ren^1,2,
Jiechao Cheng³,
Xi-Wen He¹,
Lei Wang¹,
Jian-Feng Zhan¹,
Wan-Ling Gao¹ &
…
Chun-Jie Luo^1,2

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

With tremendous growing interests in Big Data, the performance improvement of Big Data systems becomes more and more important. Among many steps, the first one is to analyze and diagnose performance bottlenecks of the Big Data systems. Currently, there are two major solutions. One is the pure data-driven diagnosis approach, which may be very time-consuming; the other is the rule-based analysis method, which usually requires prior knowledge. For Big Data applications like Spark workloads, we observe that the tasks in the same stages normally execute the same or similar codes on each data partition. On basis of the stage similarity and distributed characteristics of Big Data systems, we analyze the behaviors of the Big Data applications in terms of both system and micro-architectural metrics of each stage. Furthermore, for different performance problems, we propose a hybrid approach that combines prior rules and machine learning algorithms to detect performance anomalies, such as straggler tasks, task assignment imbalance, data skew, abnormal nodes and outlier metrics. Following this methodology, we design and implement a lightweight, extensible tool, named HybridTune, and measure the overhead and anomaly detection effectiveness of HybridTune using the BigDataBench benchmarks. Our experiments show that the overhead of HybridTune is only 5%, and the accuracy of outlier detection algorithm reaches up to 93%. Finally, we report several use cases diagnosing Spark and Hadoop workloads using BigDataBench, which demonstrates the potential use of HybridTune.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

E2EWatch: An End-to-End Anomaly Diagnosis Framework for Production HPC Systems

Proctor: A Semi-Supervised Performance Anomaly Diagnosis Framework for Production HPC Systems

SparkBench: a spark benchmarking suite characterizing large-scale in-memory data analytics

Article 31 January 2017

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Dai J, Huang J, Huang S, Huang B, Liu Y. HiTune: Dataflow-based performance analysis for big data cloud. In Proc. the 2011 USENIX Conference on USENIX Annual Technical Conference, June 2011, Article No. 27.
Guo Q, Li Y, Liu T, Wang K, Chen G, Bao X, Tang W. Correlation-based performance analysis for full-system MapReduce optimization. In Proc. the 2013 IEEE International Conference on Big Data, October 2013, pp.753-761.
Garduño E, Kavulya S P, Tan J, Gandhi R, Narasimhan P. Theia: Visual signatures for problem diagnosis in large Hadoop clusters. In Proc. the 26th Large Installation Sys- tem Administration Conference, December 2012, pp.33-42.
Tan J, Pan X, Kavulya S, Gandhi R, Narasimhan P. Mochi: Visual log-analysis based tools for debugging Hadoop. In Proc. USENIX Workshop on Hot Topics in Cloud Computing, June 2009, Article No. 1.
Cretu-Ciocarlie G, Budiu M, Goldszmidt M. Hunting for problems with Artemis. In Proc. the 1st USENIX Workshop on Analysis of System Logs, Dec. 2008, Article No. 2.
Herodotou H, Lim H, Luo G, Borisov N, Dong L, Cetin F, Babu S. Starfish: A self-tuning system for big data analytics. In Proc. the 5th Biennial Conference on Innovative Data Systems Research, January 2011, pp.261-272.
Wang L, Zhan J, Luo C, Zhu Y, Yang Q, He Y, Gao W, Jia Z, Shi Y, Zhang S, Zheng C, Lu G, Zhan K, Qiu B. Big-DataBench: A Big Data benchmark suite from internet services. In Proc. the 20th IEEE International Symposium on High Performance Computer Architecture, February 2014, pp.488-499.
Ananthanarayanan G, Kandula S, Greenberg A, Stoica I, Lu Y, Saha B, Harris E. Reining in the outliers in Map- Reduce clusters using Mantri. In Proc. the 9th USENIX Conference on Operating Systems Design and Implementation, October 2010, pp.265-278.
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin M, Shenker S, Stoica I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proc. the 9th USENIX Symposium on Networked Systems Design and Implementation, April 2012, pp.15-28.
Isard M, Budiu M, Yu Y, Birrell A, Fetterly D. Dryad: Distributed data-parallel programs from sequential building blocks. In Proc. the 2007 EuroSys Conference, March 2007, pp.59-72.
Article Google Scholar
Ren R, Jia Z, Wang L, Zhan J, Yi T. BDTUne: Hierarchical correlation-based performance analysis and rule-based diagnosis for big data systems. In Proc. the IEEE International Conference on Big Data, Dec. 2016, pp.555-562.
Cochran W, Cooley J, Favin D, Helms H, Kaenel R, Langa W, Maling G, Nelson D, Rader C, Welch P. What is the fast Fourier transform? IEEE Transactions on Audio and Electroacoustics, 1967, 55(10): 1664-1674.
Google Scholar
Knorr E M, Ng R T. Algorithms for mining distance- based outliers in large datasets. In Proc. the 24th International Conference on Very Large Data Bases, August 1998, pp.392-403.
Ming Z, Luo C, Gao W, Han R, Yang Q, Wang L, Zhan J. BDGS: A scalable Big Data generator suite in Big Data benchmarking. In Proc. the 2013 Workshop Series on Big Data Benchmarking, July 2014, pp.138-154.
Chapter Google Scholar
Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D B, Amde M, Owen S, Xin D, Xin R, Franklin M J, Zadeh R, Zaharia M, Talwalkar A. MLlib: Machine learning in Apache Spark. J. Mach. Learn. Res., 2016, 17: Article No. 34.
Wang C, Talwar V, Schwan K, Ranganathan P. Online de- tection of utility cloud anomalies using metric distributions. In Proc. the IEEE/IFIP Network Operations and Management Symposium, April 2010, pp.96-103.
Ousterhout K, Rasti R, Ratnasamy S, Shenker S, Chun B. Making sense of performance in data analytics frameworks. In Proc. the 12th USENIX Symposium on Networked Systems Design and Implementation, May 2015, pp.293-307.
Jayathilaka H, Krintz C, Wolski R. Detecting performance anomalies in cloud platform applications. IEEE Transactions on Cloud Computing. doi: https://doi.org/10.1109/TCC.2018.2808289.
Ramaswamy S, Rastogi R, Shim K. Efficient algorithms for mining outliers from large data sets. In Proc. the 2000 ACM SIGMOD International Conference on Management of Data, May 2000, pp.427-438.
Breunig M M, Kriegel H P, Ng R T, Sander J. LOF: Identifying density-based local outliers. In Proc. ACM SIGMOD International Conference on Management of Data, May 2000, pp.93-104.
Yu D, Sheikholeslami G, Zhang A. FindOut: Finding outliers in very large datasets. Knowledge and Information Systems, 2002, 4(4): 387-412.
Article Google Scholar
Yu L, Lan Z. A scalable, non-parametric method for detecting performance anomaly in large scale computing. IEEE Transactions on Parallel and Distributed Systems, 2016, 27(7): 1902-1914.
Article Google Scholar
Tan J, Pan X, Marinelli E, Kavulya S, Gandhi R, Narasimhan P. Kahuna: Problem diagnosis for MapReduce-based cloud computing environments. In Proc. the IEEE/IFIP Network Operations and Management Symposium, April 2010, pp.112-119.
Pan X, Tan J, Kavulya S, Gandhi R, Narasimhan P. Ganesha: BlackBox diagnosis of MapReduce systems. SIGMET-RICS Performance Evaluation Review, 2009, 37(3): 8-13.
Article Google Scholar
Gupta C, Sinha R, Zhang Y. Eagle: User profile-based anomaly detection for securing Hadoop clusters. In Proc. the 2015 IEEE International Conference on Big Data, October 2015, pp.1336-1343.
Kasick M P, Tan J, Gandhi R, Narasimhan P. Black-box problem diagnosis in parallel file systems. In Proc. the 8th USENIX Conference on File and Storage Technologies, February 2010, pp.43-56.
Fu X, Ren R, McKeez S A, Zhan J, Sun N. Digging deeper into cluster system logs for failure prediction and root cause diagnosis. In Proc. IEEE International Conference on Cluster Computing, September 2014, pp.103-112.
Khan L, Awad M, Thuraisingham B. A new intrusion detection system using support vector machines and hierarchical clustering. The VLDB Journal, 2007, 16(4): 507-521.
Article Google Scholar
Lee S, Shin K G. Probabilistic diagnosis of multiprocessor systems. ACM Computing Surveys, 1994, 26(1): 121-139.
Article Google Scholar
Das K, Schneider J. Detecting anomalous records in categorical datasets. In Proc. the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2007, pp.220-229.
Mi H, Wang H, Zhou Y, Lyu M R, Cai H. Toward fine-grained, unsupervised, scalable performance diagnosis for production cloud computing systems. IEEE Transactions on Parallel and Distributed Systems, 2013, 24(6): 1245-1255.
Article Google Scholar
Jia T, Chen P, Yang L, Li Y, Meng F, Xu J. An approach for anomaly diagnosis based on hybrid graph model with logs for distributed services. In Proc. the 2017 IEEE International Conference on Web Services, June 2017, pp.25-32.
Ren R, Tian S, Wang L. Online anomaly detection frame- work for Spark systems via stage-task behavior modeling. In Proc. the 15th ACM International Conference on Computing Frontiers, May 2018, pp.256-259.

Download references

Author information

Authors and Affiliations

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Rui Ren, Xi-Wen He, Lei Wang, Jian-Feng Zhan, Wan-Ling Gao & Chun-Jie Luo
University of Chinese Academy of Sciences, Beijing, 100049, China
Rui Ren & Chun-Jie Luo
School of Computing, National University of Singapore, Singapore, 117417, Singapore
Jiechao Cheng

Authors

Rui Ren
View author publications
You can also search for this author in PubMed Google Scholar
Jiechao Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Xi-Wen He
View author publications
You can also search for this author in PubMed Google Scholar
Lei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jian-Feng Zhan
View author publications
You can also search for this author in PubMed Google Scholar
Wan-Ling Gao
View author publications
You can also search for this author in PubMed Google Scholar
Chun-Jie Luo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jian-Feng Zhan.

Electronic supplementary material

ESM 1

(PDF 396 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ren, R., Cheng, J., He, XW. et al. HybridTune: Spatio-Temporal Performance Data Correlation for Performance Diagnosis of Big Data Systems. J. Comput. Sci. Technol. 34, 1167–1184 (2019). https://doi.org/10.1007/s11390-019-1968-y

Download citation

Received: 06 September 2018
Revised: 04 September 2019
Published: 22 November 2019
Issue Date: November 2019
DOI: https://doi.org/10.1007/s11390-019-1968-y

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

E2EWatch: An End-to-End Anomaly Diagnosis Framework for Production HPC Systems

Proctor: A Semi-Supervised Performance Anomaly Diagnosis Framework for Production HPC Systems

SparkBench: a spark benchmarking suite characterizing large-scale in-memory data analytics

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

HybridTune: Spatio-Temporal Performance Data Correlation for Performance Diagnosis of Big Data Systems

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

E2EWatch: An End-to-End Anomaly Diagnosis Framework for Production HPC Systems

Proctor: A Semi-Supervised Performance Anomaly Diagnosis Framework for Production HPC Systems

SparkBench: a spark benchmarking suite characterizing large-scale in-memory data analytics

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now