[go: up one dir, main page]

skip to main content
research-article

Multiple Set Matching with Bloom Matrix and Bloom Vector

Published: 09 February 2020 Publication History

Abstract

Bloom Filter is a space-efficient probabilistic data structure for checking the membership of elements in a set. Given multiple sets, a standard Bloom Filter is not sufficient when looking for the items to which an element or a set of input elements belong. An example case is searching for documents with keywords in a large text corpus, which is essentially a multiple set matching problem where the input is single or multiple keywords, and the result is a set of possible candidate documents. This article solves the multiple set matching problem by proposing two efficient Bloom Multifilters called Bloom Matrix and Bloom Vector, which generalize the standard Bloom Filter. Both structures are space-efficient and answer queries with a set of identifiers for multiple set matching problems. The space efficiency can be optimized according to the distribution of labels among multiple sets: Uniform and Zipf. Bloom Vector efficiently exploits the Zipf distribution of data for further space reduction. Indeed, both structures are much more space-efficient compared with the state-of-the-art, Bloofi. The results also highlight that a Lookup operation on Bloom Matrix is significantly faster than on Bloom Vector and Bloofi.

References

[1]
Paulo Sérgio Almeida, Carlos Baquero, Nuno Preguiça, and David Hutchison. 2007. Scalable Bloom Filters. Inform. Process. Lett. 101, 6 (2007), 255--261.
[2]
Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (July 1970), 422--426.
[3]
Flavio Bonomi, Michael Mitzenmacher, Rina Panigrah, Sushil Singh, and George Varghese. 2006. Beyond Bloom Filters: From approximate membership checks to approximate state machines. SIGCOMM Comput. Commun. Rev. 36, 4 (Aug. 2006), 315--326.
[4]
Prosenjit Bose, Hua Guo, Evangelos Kranakis, Anil Maheshwari, Pat Morin, Jason Morrison, Michiel Smid, and Yihui Tang. 2008. On the false-positive rate of Bloom filters. Inform. Process. Lett. 108, 4 (2008), 210--213.
[5]
Sergey Brin and Lawrence Page. 2012. Reprint of: The anatomy of a large-scale hypertextual web search engine. Computer Networks 56, 18 (2012), 3825--3833.
[6]
Ana Cardoso-Cachopo. 2007. Improving Methods for Single-label Text Categorization. PhD Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa.
[7]
F. Chang, Wu chang Feng, and Kang Li. 2004. Approximate caches for packet classification. In Proceedings of the 23rd Annual Joint Conference of the IEEE Computer and Communications Societies, Vol. 4. 2196--2207.
[8]
Bernard Chazelle, Joe Kilian, Ronitt Rubinfeld, and Ayellet Tal. 2004. The Bloomier Filter: An efficient data structure for static support lookup tables. In Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’04). Society for Industrial and Applied Mathematics, Philadelphia, PA, 30--39.
[9]
Saar Cohen and Yossi Matias. 2003. Spectral Bloom Filters. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD’03). ACM, New York, NY, 241--252.
[10]
Adina Crainiceanu and Daniel Lemire. 2015. Bloofi: Multidimensional Bloom filters. Inf. Syst. 54 (2015), 311--324.
[11]
F. M. Cuenca-Acuna, C. Peery, R. P. Martin, and T. D. Nguyen. 2003. PlanetP: Using gossiping to build content addressable peer-to-peer information sharing communities. In Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing. 236--246.
[12]
Li Fan, Pei Cao, J. Almeida, and A. Z. Broder. 2000. Summary cache: A scalable wide-area web cache sharing protocol. IEEE/ACM Transactions on Networking 8, 3 (Jun 2000), 281--293.
[13]
Phillipa Gill, Martin Arlitt, Zongpeng Li, and Anirban Mahanti. 2007. Youtube traffic characterization: A view from the edge. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement (IMC’07). ACM, New York, NY, 15--28.
[14]
D. Guo, J. Wu, H. Chen, Y. Yuan, and X. Luo. 2010. The dynamic Bloom Filters. IEEE Trans. Knowl. Data Eng. 22, 1 (Jan 2010), 120--133.
[15]
Joseph M. Hellerstein, Jose M. Faleiro, Joseph Gonzalez, Johann Schleier-Smith, Vikram Sreekanti, Alexey Tumanov, and Chenggang Wu. 2019. Serverless computing: One step forward, two steps back. In Proceedings of the 9th Biennial Conference on Innovative Data Systems Research (CIDR’19). Retrieved from http://cidrdb.org/cidr2019/papers/p119-hellerstein-cidr19.pdf.
[16]
Konstantinos G. Kakoulis and Ioannis G. Tollis. 2006. Algorithms for the multiple label placement problem. Computational Geometry 35, 3 (2006), 143--161.
[17]
Konstantinos G. Kakoulis and Ioannis G. Tollis. 2013. Labeling algorithms. In Handbook of Graph Drawing and Visualization. CRC Press, 489--515. Retrieved from http://cs.brown.edu/∼rt/gdhandbook/chapters/labeling.pdf.
[18]
Dilip Kumar Krishnappa, Michael Zink, Carsten Griwodz, and Pål Halvorsen. 2015. Cache-centric video recommendation: An approach to improve the efficiency of YouTube caches. ACM Trans. Multimedia Comput. Commun. Appl. 11, 4, Article 48 (June 2015), 20.
[19]
A. Kumar, J. Xu, and J. Wang. 2006. Space-code Bloom Filter for efficient per-flow traffic measurement. IEEE J-SAC 24, 12 (Dec 2006), 2327--2339.
[20]
Lailong Luo, Deke Guo, Richard T. B. Ma, Ori Rottenstreich, and Xueshan Luo. 2018. Optimizing Bloom Filter: Challenges, solutions, and comparisons. IEEE Commun. Surv. Tutor. 21 (2018), 1912--1949.
[21]
Yong Woon Park, Kun Hyo Baek, and Ki Dong Chung. 2000. Reducing network traffic using two-layered cache servers for continuous media data on the Internet. In Proceedings of the 24th Annual International Computer Software and Applications Conference. 389--394.
[22]
Michael O. Rabin. 1989. Efficient dispersal of information for security, load balancing, and fault tolerance. J. ACM 36, 2 (April 1989), 335--348.
[23]
Patrick Reynolds and Amin Vahdat. 2003. Efficient peer-to-peer keyword searching. In Proceedings of the ACM/IFIP/USENIX 2003 International Conference on Middleware (Middleware’03). Springer-Verlag New York, Inc., New York, NY, 21--40.
[24]
S. C. Rhea and J. Kubiatowicz. 2002. Probabilistic location and routing. In Proceedings. of the 21st Annual Joint Conference of the IEEE Computer and Communications Societies, Vol. 3. 1248--1257.
[25]
Behrooz A. Shirazi, Krishna M. Kavi, and Ali R. Hurson (Eds.). 1995. Scheduling and Load Balancing in Parallel and Distributed Systems. IEEE Computer Society Press, Los Alamitos, CA.
[26]
János Tapolcai, József Bíró, Péter Babarczi, András Gulyás, Zalán Heszberger, and Dirk Trossen. 2015. Optimal false-positive-free Bloom Filter design for scalable multicast forwarding. IEEE/ACM Trans. Netw. 23, 6 (Dec. 2015), 1832--1845.
[27]
S. Tarkoma, C. E. Rothenberg, and E. Lagerspetz. 2012. Theory and practice of Bloom Filters for distributed systems. IEEE Communications Surveys Tutorials 14, 1 (First 2012), 131--155.
[28]
Chenyang Xu, Qin Liu, and Weixiong Rao. 2016. BMF: An indexing structure to support multi-element check. In Web-Age Information Management, Bin Cui, Nan Zhang, Jianliang Xu, Xiang Lian, and Dexi Liu (Eds.). Springer International Publishing, Cham, 441--453.
[29]
Tong Yang, Alex X. Liu, Muhammad Shahzad, Yuankun Zhong, Qiaobin Fu, Zi Li, Gaogang Xie, and Xiaoming Li. 2016. A shifting Bloom Filter framework for set queries. Proc. VLDB Endow. 9, 5 (Jan. 2016), 408--419.

Cited By

View all
  • (2023)A Framework for Scalable Object Storage and Retrieval Considering Privacy Concerns: A Case Study on the Signature Detection2023 9th International Conference on Web Research (ICWR)10.1109/ICWR57742.2023.10139243(237-241)Online publication date: 3-May-2023
  • (2021)A Stateful Bloom Filter for Per-Flow State MonitoringIEEE Transactions on Network Science and Engineering10.1109/TNSE.2021.30574598:2(1399-1413)Online publication date: 1-Apr-2021
  • (2021)A Bloom Filter Application for Processing Big Datasets through MapReduce Framework2021 International Conference on Information Technologies (InfoTech)10.1109/InfoTech52438.2021.9548638(1-5)Online publication date: 16-Sep-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 14, Issue 2
April 2020
322 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3382774
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 February 2020
Accepted: 01 November 2019
Revised: 01 October 2019
Received: 01 August 2019
Published in TKDD Volume 14, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Bloom filter
  2. Bloomier filter
  3. Zipf distribution
  4. big data
  5. multiple sets
  6. uniform distribution

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Business Finland 5G-FORCE research project
  • Academy of Finland

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)1
Reflects downloads up to 14 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)A Framework for Scalable Object Storage and Retrieval Considering Privacy Concerns: A Case Study on the Signature Detection2023 9th International Conference on Web Research (ICWR)10.1109/ICWR57742.2023.10139243(237-241)Online publication date: 3-May-2023
  • (2021)A Stateful Bloom Filter for Per-Flow State MonitoringIEEE Transactions on Network Science and Engineering10.1109/TNSE.2021.30574598:2(1399-1413)Online publication date: 1-Apr-2021
  • (2021)A Bloom Filter Application for Processing Big Datasets through MapReduce Framework2021 International Conference on Information Technologies (InfoTech)10.1109/InfoTech52438.2021.9548638(1-5)Online publication date: 16-Sep-2021
  • (2021)A Survey on Bloom Filter for Multiple SetsModeling, Simulation and Optimization10.1007/978-981-15-9829-6_60(775-789)Online publication date: 18-Mar-2021

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media