[go: up one dir, main page]

Skip to main content
Log in

Learning hash index based on a shallow autoencoder

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

The hash index plays an important role in improving query efficiency in databases. Because traditional hash algorithms cannot use the original data distribution, there is often a high collision rate in large-scale datasets. Additionally, some traditional hashing algorithms are data-dependent and cannot be accelerated in parallel. The learned index provides a new method for index design in data management systems. The key idea of the learned index is to consider the index as a model that can be learned. In this paper, we propose a learning hash algorithm based on a shallow autoencoder that can make full use of the original data characteristics and take advantage of the parallelism of matrix operations. Therefore, compared with traditional hash functions, the proposed method has a lower collision rate and higher efficiency. Finally, we verify the effectiveness of the proposed method through a series of experiments on synthetic datasets and real datasets. Experimental results show that the proposed hash algorithm has considerable advantages in reducing the collision rate and computing time while improving the retrieval efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Notes

  1. https://www.kaggle.com/therohk/examine-the-examiner

  2. https://nijianmo.github.io/amazon/index.html

References

  1. Dittrich J, Nix J, Schon̈ C (2021) The next 50 years in database indexing or: the case for automatically generated index structures. Proc VLDB Endow 15(3):527–540. https://doi.org/10.14778/3494124.3494136

    Article  Google Scholar 

  2. Hu D, Chen Z, Wu J, Sun J, Chen H (2021) Persistent memory hash indexes: an experimental evaluation. Proc VLDB Endow 14(5):785–798. https://doi.org/10.14778/3446095.3446101

    Article  Google Scholar 

  3. Pagh R, Rodler FF (2004) Cuckoo hashing. J Algorithms 51(2):122–144. https://doi.org/10.1016/j.jalgor.2003.12.002

    Article  MathSciNet  MATH  Google Scholar 

  4. Dietzfelbinger M, Karlin A, Mehlhorn K, Meyer auf der Heide F, Rohnert H, Tarjan RE (1994) Dynamic perfect hashing: upper and lower bounds. SIAM J Comput 23(4):738–761. https://doi.org/10.1137/S0097539791194094

    Article  MathSciNet  MATH  Google Scholar 

  5. Richter S, Alvarez V, Dittrich J (2015) A seven-dimensional analysis of hashing methods and its implications on query processing. PVLDB 9(3):96–107. https://doi.org/10.14778/2850583.2850585

    Article  Google Scholar 

  6. Rivest RL (1992) The MD5 message-digest algorithm. RFC 1321:1–21. https://doi.org/10.17487/RFC1321

    Article  Google Scholar 

  7. GelB P, Klus S, Schuster I, Schütte C (2021) Feature space approximation for kernel-based supervised learning. Knowl-Based Syst 221:106935. https://doi.org/10.1016/j.knosys.2021.106935

    Article  Google Scholar 

  8. Zheng Z, Zheng L, Yang Y (2018) A discriminatively learned cnn embedding for person reidentification. ACM Trans Multimedia Comput Commun Appl 14(1):13–11320. https://doi.org/10.1145/3159171

    Article  MathSciNet  Google Scholar 

  9. Zhao F, Huang Y, Wang L, Tan T (2015) Deep semantic ranking based hashing for multi-label image retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 1556–1564

  10. Cao Y, Qi H, Gui J, Li S, Li K (2019) General distributed hash learning on image descriptors for k-nearest neighbor search. IEEE Signal Process Lett 26(5):750–754. https://doi.org/10.1109/LSP.2019.2907777

    Article  Google Scholar 

  11. Zhang B, Qian J, Xie X, Xin Y, Dong Y (2021) Capsnet-based supervised hashing. Appl Intell 51(8):5912–5926. https://doi.org/10.1007/s10489-020-02180-7

    Article  Google Scholar 

  12. Xu X, Lai Z-H, Chen Yd (2020) Relaxed locality preserving supervised discrete hashing. IEEE Trans Big Data:1–1. https://doi.org/10.1109/TBDATA.2020.3027379

  13. Kraska T, Beutel A, Chi EH, Dean J, Polyzotis N (2018) The case for learned index structures. In: Proceedings of the 2018 international conference on management of data, pp 489–504. https://doi.org/10.1145/3183713.3196909

  14. Kipf A, Marcus R, van Renen A, Stoian M, Kemper A, Kraska T, Neumann T (2020) Radixspline: a single-pass learned index. In: Proceedings of the third international workshop on exploiting artificial intelligence techniques for data management, pp 1–5. https://doi.org/10.1145/3401071.3401659

  15. Ferragina P, Vinciguerra G (2020) The pgm-index: a fully-dynamic compressed learned index with provable worst-case bounds. Proc VLDB Endow 13(8):1162–1175. https://doi.org/10.14778/3389133.3389135

    Article  Google Scholar 

  16. Tang C, Wang Y, Dong Z, Hu G, Wang Z, Wang M, Chen H (2020) Xindex: a scalable learned index for multicore data storage. In: Proceedings of the 25th ACM SIGPLAN symposium on principles and practice of parallel programming, pp 308–320. https://doi.org/10.1145/3332466.3374547

  17. Galakatos A, Markovitch M, Binnig C, Fonseca R, Kraska T (2019) Fiting-tree: a data-aware index structure. In: Proceedings of the 2019 international conference on management of data, pp 1189–1206. https://doi.org/10.1145/3299869.3319860

  18. Ding J, Minhas UF, Yu J, Wang C, Do J, Li Y, Zhang H, Chandramouli B, Gehrke J, Kossmann D., Lomet D, Kraska T (2020) Alex: an updatable adaptive learned index. In: Proceedings of the 2020 ACM SIGMOD international conference on management of data, pp 969–984. https://doi.org/10.1145/3318464.3389711

  19. Li P, Hua Y, Zuo P, Jia J (2019) A scalable learned index scheme in storage systems. arXiv:1905.06256

  20. Li X, Li J, Wang X (2019) Aslm: adaptive single layer model for learned index. In: DASFAA 2019 international workshops: BDMS, BDQM, and GDMA, pp 80–95. https://doi.org/10.1007/978-3-030-18590-9_6

  21. Wu J, Zhang Y, Chen S, Wang J, Chen Y, Xing C (2021) Updatable learned index with precise positions. Proc VLDB Endow 14(8):1276–1288. https://doi.org/10.14778/3457390.3457393

    Article  Google Scholar 

  22. Lu B, Ding J, Lo E, Minhas UF, Wang T (2021) Apex: a high-performance learned index on persistent memory. Proc VLDB Endow 15(3):597–610. https://doi.org/10.14778/3494124.3494141

    Article  Google Scholar 

  23. Vitter JS, Chen W-C (1987) The Design and Analysis of Coalesced Hashing. Oxford University Press Inc., Oxford

    Google Scholar 

  24. Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual acm symposium on theory of computing, pp 604–613. https://doi.org/10.1145/276698.276876

  25. Smeulders AWM, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349–1380. https://doi.org/10.1109/34.895972

    Article  Google Scholar 

  26. Guo J, Zhang S, Li J (2016) Hash learning with convolutional neural networks for semantic based image retrieval. In: Advances in knowledge discovery and data mining, pp 227–238. https://doi.org/10.1007/978-3-319-31753-3_19

  27. Weiss Y, Torralba A, Fergus R (2008) Spectral hashing. In: Proceedings of the twenty-second annual conference on neural information processing systems (NIPS 2008), pp 1753–1760

  28. Song J, He T, Gao L, Xu X, Hanjalic A, Shen HT (2018) Binary generative adversarial networks for image retrieval. In: Proceedings of the AAAI conference on artificial intelligence, pp 394–401

  29. Chafik S, El Yacoubi MA, Daoudi I, El Ouardi H (2019) Unsupervised deep neuron-per-neuron hashing. Appl Intell 49(6):2218–2232. https://doi.org/10.1007/s10489-018-1353-5

    Article  Google Scholar 

  30. Wang J, Xu S, Zheng F, Lu K, Song J, Shao L (2021) Learning efficient hash codes for fast graph-based data similarity retrieval. IEEE Trans Image Process 30:6321–6334. https://doi.org/10.1109/TIP.2021.3093387

    Article  MathSciNet  Google Scholar 

  31. Vincent P, Larochelle H, Bengio Y, Manzagol P-A (2008) Extracting and composing robust features with denoising autoencoders. In: ICML ’08

  32. Reichenbach L (2018) Learned index structures: an evalution of their performance in key-value storage solutions. PhD thesis, Universität Hamburg

  33. Ni J, Li J, McAuley J (2019) Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp 188–197. https://doi.org/10.18653/v1/D19-1018

  34. Sedgewick R (2001) Part 3: searching algorithms. Algorithms in C, pp 231–234. Pearson Education, Boston

    Google Scholar 

  35. Aho AV, Lam MS, Sethi R, Ullman JD (2007) Compilers: principles, techniques, and tools, Pearson Education, India, Bengaluru

  36. Ritchie DM, Kernighan BW, Lesk ME (1988) The C programming language. Prentice Hall, Englewood Cliffs

    Google Scholar 

  37. Preiss BR (1999) Data structure and algorithms: with object-oriented design patterns in Java. Wiley, Hoboken

    Google Scholar 

  38. Fowler G, Noll LC, Vo K-P, Eastlake DE, Hansen T (2022) The FNV non-cryptographic hash algorithm. Internet-draft, Internet Engineering Task Force. Work in Progress. https://datatracker.ietf.org/doc/draft-eastlake-fnv/18/

Download references

Acknowledgements

We thank the editor and anonymous reviewers for their valuable comments and feedbacks. This work was supported by Guangxi Natural Science Foundations(No. 2018GXNSFDA281049), Science and Technology Major Project of Guangxi Province (No. AA19046004), National Natural Science Foundation of China (Nos. 62062027, U1811264, 62167002 and 61862013), Guangxi Key Laboratory of Trusted Software (No. kx202203) and the Innovation Project of GUET Graduate Education (No. 2022YCXS079).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to You Li.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lin, Y., Huang, Z. & Li, Y. Learning hash index based on a shallow autoencoder. Appl Intell 53, 14999–15010 (2023). https://doi.org/10.1007/s10489-022-04274-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-04274-w

Keywords

Navigation