PLI $$^+$$ : efficient clustering of cloud databases | Distributed and Parallel Databases

Dai Hai Ton That ORCID: orcid.org/0000-0002-3935-2471¹,
James Wagner¹,
Alexander Rasin¹ &
…
Tanu Malik¹

329 Accesses
Explore all metrics

Abstract

Commercial cloud database services increase availability of data and provide reliable access to data. Routine database maintenance tasks such as clustering, however, increase the costs of hosting data on commercial cloud instances. Clustering causes an I/O burst; clustering in one-shot depletes I/O credit accumulated by an instance and increases the cost of hosting data. An unclustered database decreases query performance by scanning large amounts of data, gradually depleting I/O credits. In this paper, we introduce Physical Location Index Plus (${PLI}^{\small {{+}}}$), an indexing method for databases hosted on commercial cloud. ${PLI}^{\small {{+}}}$ relies on internal knowledge of data layout, building a physical location index, which maps a range of physical co-locations with a range of attribute values to create approximately sorted buckets. As new data is inserted, writes are partitioned in memory based on incoming data distribution. The data is written to physical locations on disk in block-based partitions to favor large granularity I/O. Incoming SQL queries on indexed attribute values are rewritten in terms of the physical location ranges. As a result, ${PLI}^{\small {{+}}}$ does not decrease query performance on an unclustered cloud database instance, DBAs may choose to cluster the instance when they have sufficiently large I/O credit available for clustering thus delaying the need for clustering. We evaluate query performance over ${PLI}^{\small {{+}}}$ by comparing it with clustered, unclustered (secondary) indexes, and log-structured merge trees on real datasets. Experiments show that ${PLI}^{\small {{+}}}$ significantly delays clustering, and yet does not degrade query performance—thus achieving higher level of sortedness than unclustered indexes and log-structured merge trees. We also evaluate the quality of clustering by introducing a measure of interval sortedness, and the size of index.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PatchIndex: exploiting approximate constraints in distributed databases

Article Open access 06 March 2021

A Universal Distributed Indexing Scheme for Data Centers with Tree-Like Topologies

Adaptive partitioning and indexing for in situ query processing

Article 15 November 2019

Notes

A DBMS deployed on a cloud platform.
We use the cost estimates from [4]; each GB at this time costs about $0.125.
This is a skeleton tree. At the initial stage, all tree nodes are initialized with meta data (e.g., intervals, pointers, etc.) but nodes initially do not contain any data.

References

Agrawal, S., Narasayya, V., Yang, B.: Integrating vertical and horizontal partitioning into automated physical database design. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, SIGMOD ’04, pp 359–370. ACM, New York (2004). https://doi.org/10.1145/1007568.1007609
Agrawal, S., Chaudhuri, S., Kollar, L., Marathe, A., Narasayya, V., Syamala, M.: Database tuning advisor for microsoft sql server 2005: demo. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, SIGMOD ’05, pp. 930–932. ACM, New York (2005). https://doi.org/10.1145/1066157.1066292
Amazon: Amazon EBS product details. https://aws.amazon.com/ebs/details/ (2017a)
Amazon: Amazon RDS for PostgreSQL pricing. https://aws.amazon.com/rds/postgresql/pricing/ (2017b)
Amazon: Storage for Amazon RDS. https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Storage.html (2017c)
Ang, C.H., Tan, K.P.: The interval B-tree. Inf. Process. Lett. 53(2), 85–89 (1995)
Article MathSciNet MATH Google Scholar
Catlett, C., Malik, T., Goldstein, B., Giuffrida, J., Shao, Y., Panella, A., Eder, D., Zanten, Ev, Mitchum, R., Thaler, S., Foster, I.T.: Plenario: an open data discovery and exploration platform for urban science. IEEE Data Eng. Bull. 37(4), 27–42 (2014)
Google Scholar
Consortium, G.P.: A global reference for human genetic variation. Nature 526(7571), 68 (2015)
Article Google Scholar
Curino, C., Jones, E., Zhang, Y., Madden, S.: Schism: a workload-driven approach to database replication and partitioning. Proc VLDB Endow. 3(1–2), 48–57 (2010). https://doi.org/10.14778/1920841.1920853
Article Google Scholar
Garfinkel, S.L.: Carving contiguous and fragmented files with fast object validation. Digit. Investig. 4, 2–12 (2007)
Article Google Scholar
Gray, R.M.: Entropy and Information Theory. Springer, New York (1990)
Book MATH Google Scholar
Jannen, W., Yuan, J., Zhan, Y., Akshintala, A., Esmet, J., Jiao, Y., Mittal, A., Pandey, P., Reddy, P., Walsh, L., Bender, M., Farach-Colton, M., Johnson, R., Kuszmaul, B.C., Porter, D.E.: BetrFS: a right-optimized write-optimized file system. In: USENIX Conference on File and Storage Technologies (FAST) (2015)
Jindal, A., Dittrich, J.: Relax and let the database do the partitioning online. In: Castellanos, M., Dayal, U., Lehner, W. (eds.) Enabling Real-Time Business Intelligence, pp. 65–80. Springer, Berlin (2012)
Chapter Google Scholar
Kersten, M.L., Manegold, S., et al.: Cracking the database store. In: CIDR, vol. 5, pp. 4–7 (2005)
Kimura, H., Huo, G., Rasin, A., Madden, S., Zdonik, S.B.: Correlation maps: a compressed access method for exploiting soft functional dependencies. Proc. VLDB Endow. 2(1), 1222–1233 (2009)
Article Google Scholar
Li, Y., He, B., Yang, R.J., Luo, Q., Yi, K.: Tree indexing on solid state drives. Proc. VLDB Endow. 3, 1–2 (2010)
Article Google Scholar
National Center for Biotechnology Information UNLoM: Genbank. https://www.ncbi.nlm.nih.gov/genbank/ (2017)
NASA: Nasa earth exchange. https://docs.opendata.aws/nasa-nex/readme.html (2018)
O’Neil, P., Cheng, E., Gawlick, D., O’Neil, E.: The log-structured merge-tree (LSM-tree). Acta Inf. 33(4), 351–385 (1996)
Article MATH Google Scholar
Oracle: Managing index-organized tables. https://docs.oracle.com/cd/B28359_01/server.111/b28310/tables012.htm#ADMIN01506 (2018)
Pivarski, J., Elmer, P., Bockelman, B., Zhang, Z.: Fast access to columnar, hierarchical data via code transformation. ArXiv e-prints. arXiv:1708.08319 (2017)
Ramakrishnan, R., Gehrke, J.: Database Management Systems, 3rd edn. McGraw-Hill Inc., New York (2003)
MATH Google Scholar
Richard G.G. III, Roussev, V.: Scalpel: a frugal. High performance file carver. In: Digital Forensic Research Workshop (DFRWS) (2005)
Schneider, T.: Unified New York City taxi and uber data. https://github.com/toddwschneider/nyc-taxi-data (2016a)
Schneider, T.W.: Unified New York city taxi and Uber data. https://github.com/toddwschneider/nyc-taxi-data (2016b)
Sears, R., Ramakrishnan, R.: bLSM: a general purpose log structured merge tree. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 217–228 (2012)
Seshadri, P., Swami, A.: Generalized partial indexes. In: Proceedings of the Eleventh International Conference on Data Engineering, pp. 420–427 (1995)
Szalay, A.S., Gray, J., Thakar, A.R., Kunszt, P.Z., Malik, T., Raddick, J., Stoughton, C., vandenBerg, J.: The SDSS Skyserver: public access to the Sloan digital sky server data. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 570–581 (2002)
Wagner, J., Rasin, A., Malik, T., Heart, K., Jehle, H., Grier, J.: Database forensic analysis with DBCarver. In: Conference of Innovative Database Research, pp. 84–90 (2017a)
Wagner, J., Rasin, A., That, D.H.T., Malik, T.: PLI: augmenting live databases with custom clustered indexes. In: Proceedings of the International Conference on Scientific and Statistical Database Management (2017b)
Walck, C.: Handbook on Statistical Distributions for Experimentalists. University of Stockholm, Stockholm (1996)
Google Scholar
Wang, X., Burns, R.C., Malik, T.: LifeRaft: data-driven, batch processing for the exploration of scientific databases. In: Proceedings Biennial Conference on Innovative Data Systems Research (CIDR) (2009)
Wang, X., Perlman, E., Burns, R., Malik, T., Budavári, T., Meneveau, C., Szalay, A.: Jaws: job-aware workload scheduling for the exploration of turbulence simulations. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–11. IEEE Computer Society, Washington DC (2010)

Download references

Author information

Authors and Affiliations

College of Computing and Digital Media, DePaul University, Chicago, IL, USA
Dai Hai Ton That, James Wagner, Alexander Rasin & Tanu Malik

Authors

Dai Hai Ton That
View author publications
You can also search for this author in PubMed Google Scholar
James Wagner
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Rasin
View author publications
You can also search for this author in PubMed Google Scholar
Tanu Malik
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dai Hai Ton That.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ton That, D.H., Wagner, J., Rasin, A. et al. PLI$^+$: efficient clustering of cloud databases. Distrib Parallel Databases 37, 177–208 (2019). https://doi.org/10.1007/s10619-018-7252-2

Download citation

Published: 06 October 2018
Issue Date: 15 March 2019
DOI: https://doi.org/10.1007/s10619-018-7252-2

PLI\(^+\): efficient clustering of cloud databases

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

PatchIndex: exploiting approximate constraints in distributed databases

A Universal Distributed Indexing Scheme for Data Centers with Tree-Like Topologies

Adaptive partitioning and indexing for in situ query processing

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

PLI\(^+\): efficient clustering of cloud databases

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

PatchIndex: exploiting approximate constraints in distributed databases

A Universal Distributed Indexing Scheme for Data Centers with Tree-Like Topologies

Adaptive partitioning and indexing for in situ query processing

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now