[go: up one dir, main page]

skip to main content
10.1145/3323873.3325048acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
short-paper
Public Access

qwLSH: Cache-conscious Indexing for Processing Similarity Search Query Workloads in High-Dimensional Spaces

Published: 05 June 2019 Publication History

Abstract

Similarity search queries in high-dimensional spaces are an important type of queries in many domains such as image processing, machine learning, etc. %Since exact similarity search indexing techniques suffer from the well-knowncurse of dimensionality in high-dimensional spaces, approximate search techniques are often utilized instead. Locality Sensitive Hashing (LSH) has been shown to be an effective approximate search method for solving similarity search queries in high-dimensional spaces. Often, queries in real-world settings arrive as part of a query workload. LSH and its variants are particularly designed to solve single queries effectively. They suffer from one major drawback while executing query workloads: they do not take into consideration important data characteristics for effective cache utilization while designing the index structures. In this paper, we presentqwLSH, an index structure %for efficiently processing similarity search query workloads in high-dimensional spaces. We that intelligently divides a given cache during processing of a query workload by using novel cost models. Experimental results show that, given a query workload,qwLSH is able to perform faster than existing techniques due to its unique cost models and strategies to reduce cache misses. %We further present different caching strategies for efficiently processing similarity search query workloads. We evaluate our proposed unique design and cost models ofqwLSH on real datasets against state-of-the-art LSH-based techniques.

References

[1]
1998. Mnist. http://yann.lecun.com/exdb/mnist/
[2]
2004. Sift. http://corpus-texmex.irisa.fr/
[3]
2005. E2LSH. https://www.mit.edu/~andoni/LSH/
[4]
2006. Audio. http://www.cs.princeton.edu/cass/audio.tar.gz
[5]
2006. P53. https://archive.ics.uci.edu/ml/datasets/p53+Mutants
[6]
Daniar Achakeev, Bernhard Seeger, and PeterWidmayer. 2012. Sort-based Queryadaptive Loading of R-trees. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM '12). ACM, New York, NY, USA, 2080--2084.
[7]
Ahmed M. Aly, Hazem Elmeleegy, Yan Qi, and Walid Aref. 2016. Kangaroo: Workload-Aware Processing of Range Data and Range Queries in Hadoop. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining (WSDM '16). ACM, New York, NY, USA, 397--406.
[8]
Ahmed M. Aly, Ahmed R. Mahmood, Mohamed S. Hassan,Walid G. Aref, Mourad Ouzzani, Hazem Elmeleegy, and Thamir Qadah. 2015. AQWA: Adaptive Query Workload Aware Partitioning of Big Spatial Data. Proc. VLDB Endow. 8, 13 (Sept. 2015), 2062--2073.
[9]
Saikat Basu, Sangram Ganguly, Supratik Mukhopadhyay, Robert DiBiano, Manohar Karki, and Ramakrishna Nemani. 2015. DeepSat: A Learning Framework for Satellite Imagery. In Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL '15). ACM, New York, NY, USA, Article 37, 10 pages.
[10]
Mayank Bawa, Tyson Condie, and Prasanna Ganesan. 2005. LSH Forest: Selftuning Indexes for Similarity Search. In Proceedings of the 14th International Conference on World Wide Web (WWW '05). ACM, New York, NY, USA, 651--660.
[11]
Konstantin Berlin, Sergey Koren, Chen-Shan Chin, James P Drake, Jane M Landolin, and Adam M Phillippy. 2015. Assembling large genomes with singlemolecule sequencing and locality-sensitive hashing. Nature Biotechnology 33, 6 (jun 2015), 623--630.
[12]
Ruben Buaba, Mohamed Gebril, Abdollah Homaifar, Eric Kihn, and Mikhail Zhizhin. 2010. Locality Sensitive Hashing for satellite images using texture feature vectors. In 2010 IEEE Aerospace Conference. IEEE, 1--10.
[13]
Jeremy Buhler. 2001. Efficient large-scale sequence comparison by localitysensitive hashing. Bioinformatics 17, 5 (may 2001), 419--428.
[14]
Carlo Curino, Evan Jones, Yang Zhang, and Sam Madden. 2010. Schism: A Workload-driven Approach to Database Replication and Partitioning. Proc. VLDB Endow. 3, 1--2 (Sept. 2010), 48--57.
[15]
Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. 2004. Localitysensitive Hashing Scheme Based on P-stable Distributions. In Proceedings of the Twentieth Annual Symposium on Computational Geometry (SCG '04). ACM, New York, NY, USA, 253--262.
[16]
Junhao Gan, Jianlin Feng, Qiong Fang, and Wilfred Ng. 2012. Locality-sensitive Hashing Scheme Based on Dynamic Collision Counting. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD '12). ACM, New York, NY, USA, 541--552.
[17]
Jinyang Gao, H.V. Jagadish, Beng Chin Ooi, and Sheng Wang. 2015. Selective Hashing: Closing the Gap Between Radius Search and k-NN Search. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '15). ACM, New York, NY, USA, 349--358.
[18]
M. Gibas, G. Canahuate, and H. Ferhatosmanoglu. 2008. Online Index Recommendations for High-Dimensional Databases Using Query Workloads. IEEE Transactions on Knowledge and Data Engineering 20, 2 (Feb 2008), 246--260.
[19]
Richard A. Hankins and Jignesh M. Patel. 2003. Effect of Node Size on the Performance of Cache-conscious B+-trees. In Proceedings of the 2003 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS '03). ACM, New York, NY, USA, 283--294.
[20]
Qiang Huang, Jianlin Feng, Yikai Zhang, Qiong Fang, and Wilfred Ng. 2015. Query-aware Locality-sensitive Hashing for Approximate Nearest Neighbor Search. Proc. VLDB Endow. 9, 1 (Sept. 2015), 1--12.
[21]
Piotr Indyk and Rajeev Motwani. 1998. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing (STOC '98). ACM, New York, NY, USA, 604--613.
[22]
Yingfan Liu, Jiangtao Cui, Zi Huang, Hui Li, and Heng Tao Shen. 2014. SK-LSH: An Efficient Index Structure for Approximate Nearest Neighbor Search. Proc. VLDB Endow. 7, 9 (May 2014), 745--756.
[23]
Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. 2007. Multiprobe LSH: Efficient Indexing for High-dimensional Similarity Search. In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB '07). VLDB Endowment, 950--961. http://dl.acm.org/citation.cfm?id=1325851.1325958
[24]
Parth Nagarkar and K. Selçuk Candan. 2014. HCS: Hierarchical Cut Selection for Efficiently Processing Queries on Data Columns using Hierarchical Bitmap Indices. In Proceedings of the 17th International Conference on Extending Database Technology, EDBT 2014, Athens, Greece, March 24--28, 2014. 271--282.
[25]
Parth Nagarkar, K. Selçuk Candan, and Aneesha Bhat. 2015. Compressed Spatial Hierarchical Bitmap (cSHB) Indexes for Efficiently Processing Spatial Range Query Workloads. PVLDB 8, 12 (2015), 1382--1393. http://www.vldb.org/pvldb/ vol8/p1382-nagarkar.pdf
[26]
Andrew Pavlo, Carlo Curino, and Stanley Zdonik. 2012. Skew-aware Automatic Database Partitioning in Shared-nothing, Parallel OLTP Systems. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD '12). ACM, New York, NY, USA, 61--72.
[27]
Abdul Quamar, K. Ashwin Kumar, and Amol Deshpande. 2013. SWORD: Scalable Workload-aware Data Placement for Transactional Workloads. In Proceedings of the 16th International Conference on Extending Database Technology (EDBT '13). ACM, New York, NY, USA, 430--441.
[28]
Jun Rao and Kenneth A. Ross. 2000. Making B+- Trees Cache Conscious in Main Memory. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD '00). ACM, New York, NY, USA, 475--486.
[29]
Zeehasham Rasheed, Huzefa Rangwala, and Daniel Barbará. 2013. 16S rRNA metagenome clustering and diversity estimation using locality sensitive hashing. BMC Systems Biology 7, Suppl 4 (oct 2013), S11.
[30]
Bryan C. Russell, Antonio Torralba, Kevin P. Murphy, and William T. Freeman. 2008. LabelMe: A Database and Web-Based Tool for Image Annotation. International Journal of Computer Vision 77, 1 (01 May 2008), 157--173.
[31]
Stefan Sprenger, Steffen Zeuch, and Ulf Leser. 2017. Cache-Sensitive Skip List: Efficient Range Queries on Modern CPUs. In Data Management on New Hardware, Spyros Blanas, Rajesh Bordawekar, Tirthankar Lahiri, Justin Levandoski, and Andrew Pavlo (Eds.). Springer International Publishing, Cham, 1--17.
[32]
Yifang Sun, Wei Wang, Jianbin Qin, Ying Zhang, and Xuemin Lin. 2014. SRS: Solving C-approximate Nearest Neighbor Queries in High Dimensional Euclidean Space with a Tiny Index. Proc. VLDB Endow. 8, 1 (Sept. 2014), 1--12.
[33]
Narayanan Sundaram, Aizana Turmukhametova, Nadathur Satish, Todd Mostak, Piotr Indyk, Samuel Madden, and Pradeep Dubey. 2013. Streaming Similarity Search over One Billion Tweets Using Parallel Locality-sensitive Hashing. Proc. VLDB Endow. 6, 14 (Sept. 2013), 1930--1941.
[34]
Yufei Tao, Ke Yi, Cheng Sheng, and Panos Kalnis. 2010. Efficient and Accurate Nearest Neighbor and Closest Pair Search in High-dimensional Space. ACM Trans. Database Syst. 35, 3, Article 20 (July 2010), 46 pages.
[35]
Kostas Tzoumas, Man Lung Yiu, and Christian S. Jensen. 2009. Workload-aware Indexing of Continuously Moving Objects. Proc. VLDB Endow. 2, 1 (Aug. 2009), 1186--1197.
[36]
Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji. 2014. Hashing for Similarity Search: A Survey. CoRR abs/1408.2927 (2014). http://arxiv.org/abs/ 1408.2927
[37]
C. E. Yoon, O. OReilly, K. J. Bergen, and G. C. Beroza. 2015. Earthquake detection through computationally efficient similarity search. Science Advances 1, 11 (dec 2015), e1501057--e1501057.
[38]
Kamen Yotov, Tom Roeder, Keshav Pingali, John Gunnels, and Fred Gustavson. 2007. An Experimental Comparison of Cache-oblivious and Cache-conscious Programs. In Proceedings of the Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '07). ACM, New York, NY, USA, 93--104.
[39]
Yuxin Zheng, Qi Guo, Anthony K.H. Tung, and Sai Wu. 2016. LazyLSH: Approximate Nearest Neighbor Search for Multiple Distance Functions with a Single Index. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD '16). ACM, New York, NY, USA, 2023--2037.

Cited By

View all
  • (2020)Efficient Bitmap-based Indexing and Retrieval of Similarity Search Image Queries2020 IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI)10.1109/SSIAI49293.2020.9094616(58-61)Online publication date: Mar-2020
  • (2020)mmLSH: A Practical and Efficient Technique for Processing Approximate Nearest Neighbor Queries on Multimedia DataSimilarity Search and Applications10.1007/978-3-030-60936-8_4(47-61)Online publication date: 14-Oct-2020
  • (2020)Improving Locality Sensitive Hashing by Efficiently Finding Projected Nearest NeighborsSimilarity Search and Applications10.1007/978-3-030-60936-8_25(323-337)Online publication date: 14-Oct-2020

Index Terms

  1. qwLSH: Cache-conscious Indexing for Processing Similarity Search Query Workloads in High-Dimensional Spaces

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICMR '19: Proceedings of the 2019 on International Conference on Multimedia Retrieval
    June 2019
    427 pages
    ISBN:9781450367653
    DOI:10.1145/3323873
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 June 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. high-dimensional search
    2. locality sensitive hashing
    3. nearest-neighbor search

    Qualifiers

    • Short-paper

    Funding Sources

    Conference

    ICMR '19
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 254 of 830 submissions, 31%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)29
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 04 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)Efficient Bitmap-based Indexing and Retrieval of Similarity Search Image Queries2020 IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI)10.1109/SSIAI49293.2020.9094616(58-61)Online publication date: Mar-2020
    • (2020)mmLSH: A Practical and Efficient Technique for Processing Approximate Nearest Neighbor Queries on Multimedia DataSimilarity Search and Applications10.1007/978-3-030-60936-8_4(47-61)Online publication date: 14-Oct-2020
    • (2020)Improving Locality Sensitive Hashing by Efficiently Finding Projected Nearest NeighborsSimilarity Search and Applications10.1007/978-3-030-60936-8_25(323-337)Online publication date: 14-Oct-2020

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media