[go: up one dir, main page]

Skip to main content
Log in

PLI\(^+\): efficient clustering of cloud databases

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

Commercial cloud database services increase availability of data and provide reliable access to data. Routine database maintenance tasks such as clustering, however, increase the costs of hosting data on commercial cloud instances. Clustering causes an I/O burst; clustering in one-shot depletes I/O credit accumulated by an instance and increases the cost of hosting data. An unclustered database decreases query performance by scanning large amounts of data, gradually depleting I/O credits. In this paper, we introduce Physical Location Index Plus (\({PLI}^{\small {{+}}}\)), an indexing method for databases hosted on commercial cloud. \({PLI}^{\small {{+}}}\) relies on internal knowledge of data layout, building a physical location index, which maps a range of physical co-locations with a range of attribute values to create approximately sorted buckets. As new data is inserted, writes are partitioned in memory based on incoming data distribution. The data is written to physical locations on disk in block-based partitions to favor large granularity I/O. Incoming SQL queries on indexed attribute values are rewritten in terms of the physical location ranges. As a result, \({PLI}^{\small {{+}}}\) does not decrease query performance on an unclustered cloud database instance, DBAs may choose to cluster the instance when they have sufficiently large I/O credit available for clustering thus delaying the need for clustering. We evaluate query performance over \({PLI}^{\small {{+}}}\) by comparing it with clustered, unclustered (secondary) indexes, and log-structured merge trees on real datasets. Experiments show that \({PLI}^{\small {{+}}}\) significantly delays clustering, and yet does not degrade query performance—thus achieving higher level of sortedness than unclustered indexes and log-structured merge trees. We also evaluate the quality of clustering by introducing a measure of interval sortedness, and the size of index.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. A DBMS deployed on a cloud platform.

  2. We use the cost estimates from [4]; each GB at this time costs about $0.125.

  3. This is a skeleton tree. At the initial stage, all tree nodes are initialized with meta data (e.g., intervals, pointers, etc.) but nodes initially do not contain any data.

References

  1. Agrawal, S., Narasayya, V., Yang, B.: Integrating vertical and horizontal partitioning into automated physical database design. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, SIGMOD ’04, pp 359–370. ACM, New York (2004). https://doi.org/10.1145/1007568.1007609

  2. Agrawal, S., Chaudhuri, S., Kollar, L., Marathe, A., Narasayya, V., Syamala, M.: Database tuning advisor for microsoft sql server 2005: demo. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, SIGMOD ’05, pp. 930–932. ACM, New York (2005). https://doi.org/10.1145/1066157.1066292

  3. Amazon: Amazon EBS product details. https://aws.amazon.com/ebs/details/ (2017a)

  4. Amazon: Amazon RDS for PostgreSQL pricing. https://aws.amazon.com/rds/postgresql/pricing/ (2017b)

  5. Amazon: Storage for Amazon RDS. https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Storage.html (2017c)

  6. Ang, C.H., Tan, K.P.: The interval B-tree. Inf. Process. Lett. 53(2), 85–89 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  7. Catlett, C., Malik, T., Goldstein, B., Giuffrida, J., Shao, Y., Panella, A., Eder, D., Zanten, Ev, Mitchum, R., Thaler, S., Foster, I.T.: Plenario: an open data discovery and exploration platform for urban science. IEEE Data Eng. Bull. 37(4), 27–42 (2014)

    Google Scholar 

  8. Consortium, G.P.: A global reference for human genetic variation. Nature 526(7571), 68 (2015)

    Article  Google Scholar 

  9. Curino, C., Jones, E., Zhang, Y., Madden, S.: Schism: a workload-driven approach to database replication and partitioning. Proc VLDB Endow. 3(1–2), 48–57 (2010). https://doi.org/10.14778/1920841.1920853

    Article  Google Scholar 

  10. Garfinkel, S.L.: Carving contiguous and fragmented files with fast object validation. Digit. Investig. 4, 2–12 (2007)

    Article  Google Scholar 

  11. Gray, R.M.: Entropy and Information Theory. Springer, New York (1990)

    Book  MATH  Google Scholar 

  12. Jannen, W., Yuan, J., Zhan, Y., Akshintala, A., Esmet, J., Jiao, Y., Mittal, A., Pandey, P., Reddy, P., Walsh, L., Bender, M., Farach-Colton, M., Johnson, R., Kuszmaul, B.C., Porter, D.E.: BetrFS: a right-optimized write-optimized file system. In: USENIX Conference on File and Storage Technologies (FAST) (2015)

  13. Jindal, A., Dittrich, J.: Relax and let the database do the partitioning online. In: Castellanos, M., Dayal, U., Lehner, W. (eds.) Enabling Real-Time Business Intelligence, pp. 65–80. Springer, Berlin (2012)

    Chapter  Google Scholar 

  14. Kersten, M.L., Manegold, S., et al.: Cracking the database store. In: CIDR, vol. 5, pp. 4–7 (2005)

  15. Kimura, H., Huo, G., Rasin, A., Madden, S., Zdonik, S.B.: Correlation maps: a compressed access method for exploiting soft functional dependencies. Proc. VLDB Endow. 2(1), 1222–1233 (2009)

    Article  Google Scholar 

  16. Li, Y., He, B., Yang, R.J., Luo, Q., Yi, K.: Tree indexing on solid state drives. Proc. VLDB Endow. 3, 1–2 (2010)

    Article  Google Scholar 

  17. National Center for Biotechnology Information UNLoM: Genbank. https://www.ncbi.nlm.nih.gov/genbank/ (2017)

  18. NASA: Nasa earth exchange. https://docs.opendata.aws/nasa-nex/readme.html (2018)

  19. O’Neil, P., Cheng, E., Gawlick, D., O’Neil, E.: The log-structured merge-tree (LSM-tree). Acta Inf. 33(4), 351–385 (1996)

    Article  MATH  Google Scholar 

  20. Oracle: Managing index-organized tables. https://docs.oracle.com/cd/B28359_01/server.111/b28310/tables012.htm#ADMIN01506 (2018)

  21. Pivarski, J., Elmer, P., Bockelman, B., Zhang, Z.: Fast access to columnar, hierarchical data via code transformation. ArXiv e-prints. arXiv:1708.08319 (2017)

  22. Ramakrishnan, R., Gehrke, J.: Database Management Systems, 3rd edn. McGraw-Hill Inc., New York (2003)

    MATH  Google Scholar 

  23. Richard G.G. III, Roussev, V.: Scalpel: a frugal. High performance file carver. In: Digital Forensic Research Workshop (DFRWS) (2005)

  24. Schneider, T.: Unified New York City taxi and uber data. https://github.com/toddwschneider/nyc-taxi-data (2016a)

  25. Schneider, T.W.: Unified New York city taxi and Uber data. https://github.com/toddwschneider/nyc-taxi-data (2016b)

  26. Sears, R., Ramakrishnan, R.: bLSM: a general purpose log structured merge tree. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 217–228 (2012)

  27. Seshadri, P., Swami, A.: Generalized partial indexes. In: Proceedings of the Eleventh International Conference on Data Engineering, pp. 420–427 (1995)

  28. Szalay, A.S., Gray, J., Thakar, A.R., Kunszt, P.Z., Malik, T., Raddick, J., Stoughton, C., vandenBerg, J.: The SDSS Skyserver: public access to the Sloan digital sky server data. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 570–581 (2002)

  29. Wagner, J., Rasin, A., Malik, T., Heart, K., Jehle, H., Grier, J.: Database forensic analysis with DBCarver. In: Conference of Innovative Database Research, pp. 84–90 (2017a)

  30. Wagner, J., Rasin, A., That, D.H.T., Malik, T.: PLI: augmenting live databases with custom clustered indexes. In: Proceedings of the International Conference on Scientific and Statistical Database Management (2017b)

  31. Walck, C.: Handbook on Statistical Distributions for Experimentalists. University of Stockholm, Stockholm (1996)

    Google Scholar 

  32. Wang, X., Burns, R.C., Malik, T.: LifeRaft: data-driven, batch processing for the exploration of scientific databases. In: Proceedings Biennial Conference on Innovative Data Systems Research (CIDR) (2009)

  33. Wang, X., Perlman, E., Burns, R., Malik, T., Budavári, T., Meneveau, C., Szalay, A.: Jaws: job-aware workload scheduling for the exploration of turbulence simulations. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–11. IEEE Computer Society, Washington DC (2010)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dai Hai Ton That.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ton That, D.H., Wagner, J., Rasin, A. et al. PLI\(^+\): efficient clustering of cloud databases. Distrib Parallel Databases 37, 177–208 (2019). https://doi.org/10.1007/s10619-018-7252-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-018-7252-2

Keywords