Abstract
The hash index plays an important role in improving query efficiency in databases. Because traditional hash algorithms cannot use the original data distribution, there is often a high collision rate in large-scale datasets. Additionally, some traditional hashing algorithms are data-dependent and cannot be accelerated in parallel. The learned index provides a new method for index design in data management systems. The key idea of the learned index is to consider the index as a model that can be learned. In this paper, we propose a learning hash algorithm based on a shallow autoencoder that can make full use of the original data characteristics and take advantage of the parallelism of matrix operations. Therefore, compared with traditional hash functions, the proposed method has a lower collision rate and higher efficiency. Finally, we verify the effectiveness of the proposed method through a series of experiments on synthetic datasets and real datasets. Experimental results show that the proposed hash algorithm has considerable advantages in reducing the collision rate and computing time while improving the retrieval efficiency.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
References
Dittrich J, Nix J, Schon̈ C (2021) The next 50 years in database indexing or: the case for automatically generated index structures. Proc VLDB Endow 15(3):527–540. https://doi.org/10.14778/3494124.3494136
Hu D, Chen Z, Wu J, Sun J, Chen H (2021) Persistent memory hash indexes: an experimental evaluation. Proc VLDB Endow 14(5):785–798. https://doi.org/10.14778/3446095.3446101
Pagh R, Rodler FF (2004) Cuckoo hashing. J Algorithms 51(2):122–144. https://doi.org/10.1016/j.jalgor.2003.12.002
Dietzfelbinger M, Karlin A, Mehlhorn K, Meyer auf der Heide F, Rohnert H, Tarjan RE (1994) Dynamic perfect hashing: upper and lower bounds. SIAM J Comput 23(4):738–761. https://doi.org/10.1137/S0097539791194094
Richter S, Alvarez V, Dittrich J (2015) A seven-dimensional analysis of hashing methods and its implications on query processing. PVLDB 9(3):96–107. https://doi.org/10.14778/2850583.2850585
Rivest RL (1992) The MD5 message-digest algorithm. RFC 1321:1–21. https://doi.org/10.17487/RFC1321
GelB P, Klus S, Schuster I, Schütte C (2021) Feature space approximation for kernel-based supervised learning. Knowl-Based Syst 221:106935. https://doi.org/10.1016/j.knosys.2021.106935
Zheng Z, Zheng L, Yang Y (2018) A discriminatively learned cnn embedding for person reidentification. ACM Trans Multimedia Comput Commun Appl 14(1):13–11320. https://doi.org/10.1145/3159171
Zhao F, Huang Y, Wang L, Tan T (2015) Deep semantic ranking based hashing for multi-label image retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 1556–1564
Cao Y, Qi H, Gui J, Li S, Li K (2019) General distributed hash learning on image descriptors for k-nearest neighbor search. IEEE Signal Process Lett 26(5):750–754. https://doi.org/10.1109/LSP.2019.2907777
Zhang B, Qian J, Xie X, Xin Y, Dong Y (2021) Capsnet-based supervised hashing. Appl Intell 51(8):5912–5926. https://doi.org/10.1007/s10489-020-02180-7
Xu X, Lai Z-H, Chen Yd (2020) Relaxed locality preserving supervised discrete hashing. IEEE Trans Big Data:1–1. https://doi.org/10.1109/TBDATA.2020.3027379
Kraska T, Beutel A, Chi EH, Dean J, Polyzotis N (2018) The case for learned index structures. In: Proceedings of the 2018 international conference on management of data, pp 489–504. https://doi.org/10.1145/3183713.3196909
Kipf A, Marcus R, van Renen A, Stoian M, Kemper A, Kraska T, Neumann T (2020) Radixspline: a single-pass learned index. In: Proceedings of the third international workshop on exploiting artificial intelligence techniques for data management, pp 1–5. https://doi.org/10.1145/3401071.3401659
Ferragina P, Vinciguerra G (2020) The pgm-index: a fully-dynamic compressed learned index with provable worst-case bounds. Proc VLDB Endow 13(8):1162–1175. https://doi.org/10.14778/3389133.3389135
Tang C, Wang Y, Dong Z, Hu G, Wang Z, Wang M, Chen H (2020) Xindex: a scalable learned index for multicore data storage. In: Proceedings of the 25th ACM SIGPLAN symposium on principles and practice of parallel programming, pp 308–320. https://doi.org/10.1145/3332466.3374547
Galakatos A, Markovitch M, Binnig C, Fonseca R, Kraska T (2019) Fiting-tree: a data-aware index structure. In: Proceedings of the 2019 international conference on management of data, pp 1189–1206. https://doi.org/10.1145/3299869.3319860
Ding J, Minhas UF, Yu J, Wang C, Do J, Li Y, Zhang H, Chandramouli B, Gehrke J, Kossmann D., Lomet D, Kraska T (2020) Alex: an updatable adaptive learned index. In: Proceedings of the 2020 ACM SIGMOD international conference on management of data, pp 969–984. https://doi.org/10.1145/3318464.3389711
Li P, Hua Y, Zuo P, Jia J (2019) A scalable learned index scheme in storage systems. arXiv:1905.06256
Li X, Li J, Wang X (2019) Aslm: adaptive single layer model for learned index. In: DASFAA 2019 international workshops: BDMS, BDQM, and GDMA, pp 80–95. https://doi.org/10.1007/978-3-030-18590-9_6
Wu J, Zhang Y, Chen S, Wang J, Chen Y, Xing C (2021) Updatable learned index with precise positions. Proc VLDB Endow 14(8):1276–1288. https://doi.org/10.14778/3457390.3457393
Lu B, Ding J, Lo E, Minhas UF, Wang T (2021) Apex: a high-performance learned index on persistent memory. Proc VLDB Endow 15(3):597–610. https://doi.org/10.14778/3494124.3494141
Vitter JS, Chen W-C (1987) The Design and Analysis of Coalesced Hashing. Oxford University Press Inc., Oxford
Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual acm symposium on theory of computing, pp 604–613. https://doi.org/10.1145/276698.276876
Smeulders AWM, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349–1380. https://doi.org/10.1109/34.895972
Guo J, Zhang S, Li J (2016) Hash learning with convolutional neural networks for semantic based image retrieval. In: Advances in knowledge discovery and data mining, pp 227–238. https://doi.org/10.1007/978-3-319-31753-3_19
Weiss Y, Torralba A, Fergus R (2008) Spectral hashing. In: Proceedings of the twenty-second annual conference on neural information processing systems (NIPS 2008), pp 1753–1760
Song J, He T, Gao L, Xu X, Hanjalic A, Shen HT (2018) Binary generative adversarial networks for image retrieval. In: Proceedings of the AAAI conference on artificial intelligence, pp 394–401
Chafik S, El Yacoubi MA, Daoudi I, El Ouardi H (2019) Unsupervised deep neuron-per-neuron hashing. Appl Intell 49(6):2218–2232. https://doi.org/10.1007/s10489-018-1353-5
Wang J, Xu S, Zheng F, Lu K, Song J, Shao L (2021) Learning efficient hash codes for fast graph-based data similarity retrieval. IEEE Trans Image Process 30:6321–6334. https://doi.org/10.1109/TIP.2021.3093387
Vincent P, Larochelle H, Bengio Y, Manzagol P-A (2008) Extracting and composing robust features with denoising autoencoders. In: ICML ’08
Reichenbach L (2018) Learned index structures: an evalution of their performance in key-value storage solutions. PhD thesis, Universität Hamburg
Ni J, Li J, McAuley J (2019) Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp 188–197. https://doi.org/10.18653/v1/D19-1018
Sedgewick R (2001) Part 3: searching algorithms. Algorithms in C, pp 231–234. Pearson Education, Boston
Aho AV, Lam MS, Sethi R, Ullman JD (2007) Compilers: principles, techniques, and tools, Pearson Education, India, Bengaluru
Ritchie DM, Kernighan BW, Lesk ME (1988) The C programming language. Prentice Hall, Englewood Cliffs
Preiss BR (1999) Data structure and algorithms: with object-oriented design patterns in Java. Wiley, Hoboken
Fowler G, Noll LC, Vo K-P, Eastlake DE, Hansen T (2022) The FNV non-cryptographic hash algorithm. Internet-draft, Internet Engineering Task Force. Work in Progress. https://datatracker.ietf.org/doc/draft-eastlake-fnv/18/
Acknowledgements
We thank the editor and anonymous reviewers for their valuable comments and feedbacks. This work was supported by Guangxi Natural Science Foundations(No. 2018GXNSFDA281049), Science and Technology Major Project of Guangxi Province (No. AA19046004), National Natural Science Foundation of China (Nos. 62062027, U1811264, 62167002 and 61862013), Guangxi Key Laboratory of Trusted Software (No. kx202203) and the Innovation Project of GUET Graduate Education (No. 2022YCXS079).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lin, Y., Huang, Z. & Li, Y. Learning hash index based on a shallow autoencoder. Appl Intell 53, 14999–15010 (2023). https://doi.org/10.1007/s10489-022-04274-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-04274-w