Abstract
This paper presents CirroData, a high-performance SQL-on-Hadoop system designed for Big Data analytics workloads. As a home-grown enterprise-level online analytical processing (OLAP) system with more than seven-year research and development (R&D) experiences, we share our design details to the community about how to achieve high performance in CirroData. Multiple optimization techniques have been discussed in the paper. The effectiveness and the efficiency of all these techniques have been proved by our customers’ daily usage. Benchmark-level studies, as well as several real application case studies of CirroData, have been presented in this paper. Our evaluations show that CirroData can outperform various types of counterpart database systems in the community, such as “Spark+Hive”, “Spark+HBase”, Impala, DB-X/Y, Greenplum, HAWQ, and others. CirroData can achieve up to 4.99x speedup compared with Greenplum, HAWQ, and Spark in the standard TPC-H queries. Application-level evaluations demonstrate that CirroData outperforms “Spark+Hive” and “Spark+HBase” by up to 8.4x and 38.8x, respectively. In the meantime, CirroData achieves the performance speedups for some application workloads by up to 20x, 100x, 182.5x, 92.6x, and 55.5x as compared with Greenplum, DB-X, Impala, DB-Y, and HAWQ, respectively.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Fox G C, Qiu J, Kamburugamuve S, Jha S, Luckow A. HPC-ABDS high performance computing enhanced Apache big data stack. In Proc. the 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2015, pp.1057-1066.
Qiu J, Jha S, Luckow A, Fox G C. Towards HPCABDS: An initial high-performance big data stack. http://grids.ucs.indiana.edu/ptliupages/publications/nisthpc-abds.pdf, June 2019.
Shvachko K, Kuang H, Radia S, Chansler R. The Hadoop distributed file system. In Proc. the 26th IEEE Symposium on Mass Storage Systems and Technologies, May 2010, Article No. 9.
Zaharia M, Chowdhury M, Franklin M J, Shenker S, Stoica I. Spark: Cluster computing with working sets. In Proc. the 2nd USENIX Workshop on Hot Topics in Cloud Computing, June 2010, Article No. 5.
Thusoo A, Sarma J S, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R. Hive — A warehousing solution over a Map-Reduce framework. Proceedings of the VLDB Endowment, 2009, 2(2): 1626-1629.
Kornacker M, Behm A, Bittorf V et al. Impala: A modern, open-source SQL engine for Hadoop. In Proc. the 7th Biennial Conference on Innovative Data Systems Research, January 2015, Article No. 5.
Chang L, Wang ZW, Ma T et al. HAWQ: A massively parallel processing SQL engine in Hadoop. In Proc. the 2014 ACM SIGMOD International Conference on Management of Data, June 2014, pp.1223-1234.
Costea A, Ionescu A, Raducanu B et al. VectorH: Taking SQL-on-Hadoop to the next level. In Proc. the 2016 International Conference on Management of Data, June 2016, pp.1105-1117.
Hunt P, Konar M, Junqueira F P, Reed B. ZooKeeper: Wait-free coordination for Internet-scale systems. In Proc. the 2010 USENIX Annual Technical Conference, June 2010, Article No. 14.
Chris L, Adve V. LLVM: A compilation framework for lifelong program analysis & transformation. In Proc. the 2nd IEEE/ACM International Symposium on Code Generation and Optimization, March 2004, pp.75-88.
Neumann T. Efficiently compiling efficient query plans for modern hardware. Proceedings of the VLDB Endowment, 2011, 4(9): 539-550.
Neumann T, Leis V. Compiling database queries into machine code. IEEE Data Eng. Bull., 2014, 37(1): 3-11.
Shamgunov N. The MemSQL in-memory database system. In Proc. the 2nd International Workshop on in Memory Data Management and Analytics, September 2014, Article No. 1.
Tan C K. Vitesse DB: 100% PostgreSQL, 100X faster for analytics. The 2nd South Bay PostgreSQL Meetup, 2015. https://www.meetup.com/postgresql-1/events/221039792/, Nov. 2019.
Lu X, Liang F, Wang B, Zha L, Xu Z. DataMPI: Extending MPI to Hadoop-like big data computing. In Proc. the 28th International Parallel and Distributed Processing Symposium, May 2014, pp.829-838.
Liang F, Lu X. Accelerating iterative big data computing through MPI. Journal of Computer Science and Technology, Mar. 2015, 30(2): 283-294.
Gugnani S, Lu X, Qi H L, Zha L, Panda D K. Characterizing and accelerating indexing techniques on distributed ordered tables. In Proc. the 2017 IEEE International Conference on Big Data, December 2017, pp.173-182.
Kemper A, Neumann T. HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In Proc. the 27th International Conference on Data Engineering, April 2011, pp.195-206.
Pavlo A, Angulo G, Arulraj J et al. Self-driving database management systems. In Proc. the 8th Biennial Conference on Innovative Data Systems Research, January 2017, Article No. 14.
Thusoo A, Sarma J S, Jain N, Shao Z, Chakka P, Zhang N, Antony S, Liu H, Murthy R. Hive — A petabyte scale data warehouse using Hadoop. In Proc. the 26th IEEE International Conference on Data Engineering, March 2010, pp.996-1005.
Barber R, Garcia-Arellano C, Grosman R et al. Evolving databases for new-gen big data applications. In Proc. the 8th Biennial Conference on Innovative Data Systems Research, January 2017, Article No. 2.
Kallman R, Kimura H, Natkins J et al. H-Store: A high-performance, distributed main memory transaction processing system. Proceedings of the VLDB Endowment, 2008, 1(2): 1496-1499.
Pavlo A, Jones E P, Zdonik S. On predictive modeling for optimizing transaction execution in parallel OLTP systems. Proceedings of the VLDB Endowment, 2011, 5(2): 85-96.
Serafini M, Mansour E, Aboulnaga A, Salem K, Rafiq T, Minhas U F. Accordion: Elastic scalability for database systems supporting distributed transactions. Proceedings of the VLDB Endowment, 2014, 7(12): 1035-1046.
Taft R, Mansour E, Serafini M, Duggan J, Elmore A J, Aboulnaga A, Pavlo A, Stonebraker M. E-store: Fine-grained elastic partitioning for distributed transaction processing systems. Proceedings of the VLDB Endowment, 2014, 8(3): 245-256.
Mahajan K, Chowdhury M, Akella A, Chawla S. Dynamic query re-planning using QOOP. In Proc. the 13th USENIX Symposium on Operating Systems Design and Implementation, October 2018, pp.253-267.
Cowling J A, Liskov B. Granola: Low-overhead distributed transaction coordination. In Proc. the 2012 USENIX Annual Technical Conference, June 2012, pp.223-235.
Färber F, May N, Lehner W et al. The SAP HANA database — An architecture overview. IEEE Data Eng. Bull., 2012, 35(1): 28-33.
Lee J, Kwon Y S, Färber F et al. SAP HANA distributed inmemory database system: Transaction, session, and metadata management. In Proc. the 29th International Conference on Data Engineering, April 2013, pp.1165-1173.
Thomson A, Diamond T, Weng S C, Ren K, Shao P, Abadi D J. Calvin: Fast distributed transactions for partitioned database systems. In Proc. the 2012 ACM SIGMOD International Conference on Management of Data, May 2012, pp.1-12.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
ESM 1
(PDF 550 kb)
Rights and permissions
About this article
Cite this article
Jin, ZH., Shi, H., Hu, YX. et al. CirroData: Yet Another SQL-on-Hadoop Data Analytics Engine with High Performance. J. Comput. Sci. Technol. 35, 194–208 (2020). https://doi.org/10.1007/s11390-020-9536-z
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-020-9536-z