Investigating the performance of Hadoop and Spark platforms on machine learning algorithms

Ali Mostafaeipour¹,
Amir Jahangard Rafsanjani²,
Mohammad Ahmadi² &
…
Joshuva Arockia Dhanraj ORCID: orcid.org/0000-0001-5048-7775³

1688 Accesses
55 Citations
Explore all metrics

Abstract

One of the most challenging issues in the big data research area is the inability to process a large volume of information in a reasonable time. Hadoop and Spark are two frameworks for distributed data processing. Hadoop is a very popular and general platform for big data processing. Because of the in-memory programming model, Spark as an open-source framework is suitable for processing iterative algorithms. In this paper, Hadoop and Spark frameworks, the big data processing platforms, are evaluated and compared in terms of runtime, memory and network usage, and central processor efficiency. Hence, the K-nearest neighbor (KNN) algorithm is implemented on datasets with different sizes within both Hadoop and Spark frameworks. The results show that the runtime of the KNN algorithm implemented on Spark is 4 to 4.5 times faster than Hadoop. Evaluations show that Hadoop uses more sources, including central processor and network. It is concluded that the CPU in Spark is more effective than Hadoop. On the other hand, the memory usage in Hadoop is less than Spark.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Leveraging resource management for efficient performance of Apache Spark

Article Open access 23 August 2019

On Scalability of Distributed Machine Learning with Big Data on Apache Spark

Study of Big Data Analytics Tool: Apache Spark

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Chen M, Mao S, Liu Y (2014) Big data: a survey. Mob Netw Appl 19(2):171–209
Article Google Scholar
Wu C, Zapevalova E, Chen Y, Zeng D, Liu F (2018) Optimal model of continuous knowledge transfer in the big data environment. Computr Model Eng Sci 116(1):89–107
Google Scholar
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Tang Z, Jiang L, Yang L, Li K, Li K (2015) CRFs based parallel biomedical named entity recognition algorithm employing MapReduce framework. Clust Comput 18(2):493–505
Article Google Scholar
Tang Z, Liu K, Xiao J, Yang L, Xiao Z (2017) A parallel k-means clustering algorithm based on redundance elimination and extreme points optimization employing MapReduce. Concurr Comput Pract Exp 29(20):e4109
Article Google Scholar
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauly M, Michael J, Franklin SS, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Presented as Part of the 9th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 12), pp 15–28
Cobb AN, Benjamin AJ, Huang ES, Kuo PC (2018) Big data: more than big data sets. Surgery 164(4):640–642
Article Google Scholar
Qin SJ, Chiang LH (2019) Advances and opportunities in machine learning for process data analytics. Comput Chem Eng 126:465–473
Article Google Scholar
Jordan MI, Mitchell TM (2015) Machine learning: trends, perspectives, and prospects. Science 349(6245):255–260
Article MathSciNet Google Scholar
Wu C, Zapevalova E, Li F, Zeng D (2018) Knowledge structure and its impact on knowledge transfer in the big data environment. J Internet Technol 19(2):581–590
Google Scholar
Zhou L, Pan S, Wang J, Vasilakos AV (2017) Machine learning on big data: opportunities and challenges. Neurocomputing 237:350–361
Article Google Scholar
Russell SJ, Norvig P (2016) Artificial intelligence: a modern approach. Pearson Education Limited, Kuala Lumpur
MATH Google Scholar
Aziz K, Zaidouni D, Bellafkih M (2018) Real-time data analysis using Spark and Hadoop. In: 2018 4th International Conference on Optimization and Applications (ICOA). IEEE, pp 1–6
Hazarika AV, Ram GJSR, Jain E (2017) Performance comparison of Hadoop and spark engine. In: 2017 International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud)(I-SMAC). IEEE, pp 671–674
Gopalani S, Arora R (2015) Comparing apache spark and map reduce with performance analysis using k-means. Int J Comput Appl 113(1):8–11
Google Scholar
Wang H, Wu B, Yang S, Wang B, Liu Y (2014) Research of decision tree on yarn using mapreduce and Spark. In: Proceedings of the 2014 World Congress in Computer Science, Computer Engineering, and Applied Computing, pp 21–24
Liang F, Feng C, Lu X, Xu Z (2014) Performance benefits of DataMPI: a case study with BigDataBench. In: Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware. Springer, Cham, pp 111–123
Pirzadeh P (2015) On the performance evaluation of big data systems. Doctoral dissertation, UC Irvine
Mavridis I, Karatza H (2017) Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark. J Syst Softw 125:133–151
Article Google Scholar
Im S, Moseley B (2019) A conditional lower bound on graph connectivity in mapreduce. arXiv preprint arXiv:1904.08954
Kodali S, Dabbiru M, Rao BT, Patnaik UKC (2019) A k-NN-based approach using MapReduce for meta-path classification in heterogeneous information networks. In: Soft Computing in Data Analytics. Springer, Singapore, pp 277–284
Li Y, Eldawy A, Xue J, Knorozova N, Mokbel MF, Janardan R (2019) Scalable computational geometry in MapReduce. VLDB J 28(4):523–548
Article Google Scholar
Li F, Chen J, Wang Z (2019) Wireless MapReduce distributed computing. IEEE Trans Inf Theory 65(10):6101–6114
Article MathSciNet Google Scholar
Liu J, Wang P, Zhou J, Li K (2020) McTAR: a multi-trigger check pointing tactic for fast task recovery in MapReduce. IEEE Trans Serv Comput. https://doi.org/10.1109/TSC.2019.2904270
Article Google Scholar
Glushkova D, Jovanovic P, Abelló A (2019) Mapreduce performance model for Hadoop 2.x. Inf Syst 79:32–43
Article Google Scholar
Saxena A, Chaurasia A, Kaushik N, Kaushik N (2019) Handling big data using MapReduce over hybrid cloud. In: International Conference on Innovative Computing and Communications. Springer, Singapore, pp 135–144
Kuo A, Chrimes D, Qin P, Zamani H (2019) A Hadoop/MapReduce based platform for supporting health big data analytics. In: ITCH, pp 229–235
Kumar DK, Bhavanam D, Reddy L (2020) Usage of HIVE tool in Hadoop ECO system with loading data and user defined functions. Int J Psychosoc Rehabil 24(4):1058–1062
Google Scholar
Alnasir JJ, Shanahan HP (2020) The application of hadoop in structural bioinformatics. Brief Bioinform 21(1):96–105
Google Scholar
Park HM, Park N, Myaeng SH, Kang U (2020) PACC: large scale connected component computation on Hadoop and Spark. PLoS ONE 15(3):e0229936
Article Google Scholar
Xu Y, Wu S, Wang M, Zou Y (2020) Design and implementation of distributed RSA algorithm based on Hadoop. J Ambient Intell Humaniz Comput 11(3):1047–1053
Article Google Scholar
Wang J, Li X, Ruiz R, Yang J, Chu D (2020) Energy utilization task scheduling for MapReduce in heterogeneous clusters. IEEE Trans Serv Comput. https://doi.org/10.1109/TSC.2020.2966697
Article Google Scholar
Wei P, He F, Li L, Shang C, Li J (2020) Research on large data set clustering method based on MapReduce. Neural Comput Appl 32(1):93–99
Article Google Scholar
Souza A, Garcia I (2020) A preemptive fair scheduler policy for disco MapReduce framework. In: Anais do XV Workshop em Desempenho de Sistemas Computacionais e de Comunicação. SBC, pp 1–12
Jang S, Jang YE, Kim YJ, Yu H (2020) Input initialization for inversion of neural networks using k-nearest neighbor approach. Inf Sci 519:229–242
Article MathSciNet Google Scholar
Chen Y, Hu X, Fan W, Shen L, Zhang Z, Liu X et al (2020) Fast density peak clustering for large scale data based on kNN. Knowl-Based Syst 187:104824
Article Google Scholar
Janardhanan PS, Samuel P (2020) Optimum parallelism in Spark framework on Hadoop YARN for maximum cluster resource. In: First International Conference on Sustainable Technologies for Computational Intelligence: Proceedings of ICTSCI 2019, vol 1045. Springer Nature, p 351
Qin Y, Tang Y, Zhu X, Yan C, Wu C, Lin D (2020) Zone-based resource allocation strategy for heterogeneous spark clusters. In: Artificial Intelligence in China. Springer, Singapore, pp 113–121
Hussain DM, Surendran D (2020) The efficient fast-response content-based image retrieval using spark and MapReduce model framework. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-020-01775-9
Article Google Scholar
Nguyen MC, Won H, Son S, Gil MS, Moon YS (2019) Prefetching-based metadata management in advanced multitenant Hadoop. J Supercomput 75(2):533–553
Article Google Scholar
Javanmardi AK, Yaghoubyan SH, Bagherifard K et al (2020) A unit-based, cost-efficient scheduler for heterogeneous Hadoop systems. J Supercomput. https://doi.org/10.1007/s11227-020-03256-4
Article Google Scholar
Guo A, Jiang A, Lin J, Li X (2020) Data mining algorithms for bridge health monitoring: Kohonen clustering and LSTM prediction approaches. J Supercomput 76(2):932–947
Article Google Scholar
Cheng F, Yang Z (2019) FastMFDs: a fast, efficient algorithm for mining minimal functional dependencies from large-scale distributed data with Spark. J Supercomput 75(5):2497–2517
Article Google Scholar
Kang M, Lee J (2020) Effect of garbage collection in iterative algorithms on Spark: an experimental analysis. J Supercomput. https://doi.org/10.1007/s11227-020-03150-z
Article Google Scholar
Xiao W, Hu J (2020) SWEclat: a frequent itemset mining algorithm over streaming data using Spark Streaming. J Supercomput. https://doi.org/10.1007/s11227-020-03190-5
Article Google Scholar
Massie M, Li B, Nicholes B, Vuksan V, Alexander R, Buchbinder J, Costa F, Dean A, Josephsen D, Phaal P, Pocock D (2012) Monitoring with Ganglia: tracking dynamic host and application metrics at scale. O’Reilly Media Inc, Newton
Google Scholar
Whiteson D (2014) Higgs data set. https://archive.ics.uci.edu/ml/datasets/HIGGS. Accessed 2016
Harrington P (2012) Machine learning in action. Manning Publications Co, New York
Google Scholar
Masarat S, Sharifian S, Taheri H (2016) Modified parallel random forest for intrusion detection systems. J Supercomput 72(6):2235–2258
Article Google Scholar
Lai WK, Chen YU, Wu TY, Obaidat MS (2014) Towards a framework for large-scale multimedia data storage and processing on Hadoop platform. J Supercomput 68(1):488–507
Article Google Scholar
Won H, Nguyen MC, Gil MS, Moon YS, Whang KY (2017) Moving metadata from ad hoc files to database tables for robust, highly available, and scalable HDFS. J Supercomput 73(6):2657–2681
Article Google Scholar
Lee ZJ, Lee CY (2020) A parallel intelligent algorithm applied to predict students dropping out of university. J Supercomput 76(2):1049–1062
Article Google Scholar
Sandrini M, Xu B, Volochayev R, Awosika O, Wang WT, Butman JA, Cohen LG (2020) Transcranial direct current stimulation facilitates response inhibition through dynamic modulation of the fronto-basal ganglia network. Brain Stimul 13(1):96–104
Article Google Scholar
Jiang W, Fu J, Chen F, Zhan Q, Wang Y, Wei M, Xiao B (2020) Basal ganglia infarction after mild head trauma in pediatric patients with basal ganglia calcification. Clin Neurol Neurosurg 192:105706
Article Google Scholar
Kowalski CW, Lindberg JE, Fowler DK, Simasko SM, Peters JH (2020) Contributing mechanisms underlying desensitization of CCK-induced activation of primary nodose ganglia neurons. Am J Physiol Cell Physiol 318:C787–C796
Article Google Scholar

Download references

Author information

Authors and Affiliations

Industrial Engineering Department, Yazd University, Yazd, Iran
Ali Mostafaeipour
Computer Engineering Department, Yazd University, Yazd, Iran
Amir Jahangard Rafsanjani & Mohammad Ahmadi
Centre for Automation and Robotics (ANRO), Department of Mechanical Engineering, Hindustan Institute of Technology and Science, Chennai, 603103, India
Joshuva Arockia Dhanraj

Authors

Ali Mostafaeipour
View author publications
You can also search for this author in PubMed Google Scholar
Amir Jahangard Rafsanjani
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Ahmadi
View author publications
You can also search for this author in PubMed Google Scholar
Joshuva Arockia Dhanraj
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Joshuva Arockia Dhanraj.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mostafaeipour, A., Jahangard Rafsanjani, A., Ahmadi, M. et al. Investigating the performance of Hadoop and Spark platforms on machine learning algorithms. J Supercomput 77, 1273–1300 (2021). https://doi.org/10.1007/s11227-020-03328-5

Download citation

Published: 13 May 2020
Issue Date: February 2021
DOI: https://doi.org/10.1007/s11227-020-03328-5

Investigating the performance of Hadoop and Spark platforms on machine learning algorithms

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Leveraging resource management for efficient performance of Apache Spark

On Scalability of Distributed Machine Learning with Big Data on Apache Spark

Study of Big Data Analytics Tool: Apache Spark

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Investigating the performance of Hadoop and Spark platforms on machine learning algorithms

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Leveraging resource management for efficient performance of Apache Spark

On Scalability of Distributed Machine Learning with Big Data on Apache Spark

Study of Big Data Analytics Tool: Apache Spark

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation