Tradeoffs between Parallel Database Systems, Hadoop, and HadoopDB as Platforms for Petabyte-Scale Analysis

Daniel J. Abadi¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6187))

Included in the following conference series:

International Conference on Scientific and Statistical Database Management

2100 Accesses
3 Citations

Abstract

As the market demand for analyzing data sets of increasing variety and scale continues to explode, the software options for performing this analysis are beginning to proliferate. No fewer than a dozen companies have launched in the past few years that sell parallel database products to meet this market demand. At the same time, MapReduce-based options, such as the open source Hadoop framework are becoming increasingly popular, and there have been a plethora of research publications in the past two years that demonstrate how MapReduce can be used to accelerate and scale various data analysis tasks.

Both parallel databases and MapReduce-based options have strengths and weaknesses that a practitioner must be aware of before selecting an analytical data management platform. In this talk, I describe some experiences in using these systems, and the advantages and disadvantages of the popular implementations of these systems. I then discuss a hybrid system that we are building at Yale University, called HadoopDB, that attempts to combine the advantages of both types of platforms. Finally, I discuss our experience in using HadoopDB for both traditional decision support workloads (i.e., TPC-H) and also scientific data management (analyzing the Uniprot protein sequence, function, and annotation data).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Upgrading a high performance computing environment for massive data processing

Article Open access 16 October 2019

Big Data Query Engines

ExaHDF5: Delivering Efficient Parallel I/O on Exascale Computing Systems

Article 17 January 2020

References

Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Silberschatz, A., Rasin, A.: HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. In: VLDB (2009)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: OSDI (2004)
Google Scholar
Pavlo, A., Rasin, A., Madden, S., Stonebraker, M., DeWitt, D., Paulson, E., Shrinivas, L., Abadi, D.J.: A Comparison of Approaches to Large Scale Data Analysis. In: Proc. of SIGMOD (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Yale University,
Daniel J. Abadi

Authors

Daniel J. Abadi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Science, University of Heidelberg, 69120, Heidelberg, Germany
Michael Gertz
Dept. of Computer Science and Genome Center, University of California, One Shields Avenue, 95616, Davis, CA, USA
Bertram Ludäscher

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Abadi, D.J. (2010). Tradeoffs between Parallel Database Systems, Hadoop, and HadoopDB as Platforms for Petabyte-Scale Analysis. In: Gertz, M., Ludäscher, B. (eds) Scientific and Statistical Database Management. SSDBM 2010. Lecture Notes in Computer Science, vol 6187. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13818-8_1

Download citation

DOI: https://doi.org/10.1007/978-3-642-13818-8_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13817-1
Online ISBN: 978-3-642-13818-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Tradeoffs between Parallel Database Systems, Hadoop, and HadoopDB as Platforms for Petabyte-Scale Analysis

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Upgrading a high performance computing environment for massive data processing

Big Data Query Engines

ExaHDF5: Delivering Efficient Parallel I/O on Exascale Computing Systems

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Tradeoffs between Parallel Database Systems, Hadoop, and HadoopDB as Platforms for Petabyte-Scale Analysis

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Upgrading a high performance computing environment for massive data processing

Big Data Query Engines

ExaHDF5: Delivering Efficient Parallel I/O on Exascale Computing Systems

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation