[go: up one dir, main page]

skip to main content
research-article

Dima: a distributed in-memory similarity-based query processing system

Published: 01 August 2017 Publication History

Abstract

Data analysts in industries spend more than 80% of time on data cleaning and integration in the whole process of data analytics due to data errors and inconsistencies. It calls for effective query processing techniques to tolerate the errors and inconsistencies. In this paper, we develop a distributed in-memory similarity-based query processing system called Dima. Dima supports two core similarity-based query operations, i.e., similarity search and similarity join. Dima extends the SQL programming interface for users to easily invoke these two operations in their data analysis jobs. To avoid expensive data transformation in a distributed environment, we design selectable signatures where two records approximately match if they share common signatures. More importantly, we can adaptively select the signatures to balance the workload. Dima builds signature-based global indexes and local indexes to support efficient similarity search and join. Since Spark is one of the widely adopted distributed in-memory computing systems, we have seamlessly integrated Dima into Spark and developed effective query optimization techniques in Spark. To the best of our knowledge, this is the first full-fledged distributed in-memory system that can support similarity-based query processing. We demonstrate our system in several scenarios, including entity matching, web table integration and query recommendation.

References

[1]
A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006.
[2]
R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007.
[3]
D. Deng, G. Li, S. Hao, J. Wang, and J. Feng. Massjoin: A mapreduce-based method for scalable string similarity joins. In ICDE, pages 340--351, 2014.
[4]
D. Deng, G. Li, H. Wen, and J. Feng. An efficient partition based method for exact set similarity joins. PVLDB, 9(4):360--371, 2015.
[5]
Y. Jiang, G. Li, J. Feng, and W. Li. String similarity joins: An experimental evaluation. PVLDB, 7(8):625--636, 2014.
[6]
G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for similarity joins. PVLDB, 5(3):253--264, 2011.
[7]
S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD Conference, pages 743--754, 2004.
[8]
R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In SIGMOD, pages 495--506, 2010.
[9]
J. Wang, G. Li, and J. Feng. Fast-join: An efficient method for fuzzy token matching based string similarity join. In ICDE, pages 458--469, 2011.
[10]
J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In SIGMOD Conference, pages 85--96, 2012.
[11]
C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW, pages 131--140, 2008.

Cited By

View all
  • (2024)Similarity Joins of Sparse FeaturesCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653370(80-92)Online publication date: 9-Jun-2024
  • (2021)Optimizing in-memory database engine for AI-powered on-line decision augmentation using persistent memoryProceedings of the VLDB Endowment10.14778/3446095.344610214:5(799-812)Online publication date: 1-Jan-2021
  • (2021)Learned Cardinality Estimation for Similarity QueriesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452790(1745-1757)Online publication date: 9-Jun-2021
  • Show More Cited By

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 10, Issue 12
August 2017
427 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2017
Published in PVLDB Volume 10, Issue 12

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)3
Reflects downloads up to 13 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Similarity Joins of Sparse FeaturesCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653370(80-92)Online publication date: 9-Jun-2024
  • (2021)Optimizing in-memory database engine for AI-powered on-line decision augmentation using persistent memoryProceedings of the VLDB Endowment10.14778/3446095.344610214:5(799-812)Online publication date: 1-Jan-2021
  • (2021)Learned Cardinality Estimation for Similarity QueriesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452790(1745-1757)Online publication date: 9-Jun-2021
  • (2021)Internal and external memory set containment joinThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-020-00644-330:3(447-470)Online publication date: 23-Feb-2021
  • (2020)Blocking and Filtering Techniques for Entity ResolutionACM Computing Surveys10.1145/337745553:2(1-42)Online publication date: 20-Mar-2020
  • (2019)Balance-aware distributed string similarity-based query processing systemProceedings of the VLDB Endowment10.14778/3329772.332977412:9(961-974)Online publication date: 1-May-2019
  • (2018)SmurfProceedings of the VLDB Endowment10.14778/3291264.329127212:3(278-291)Online publication date: 1-Nov-2018
  • (2018)Set similarity joins on mapreduceProceedings of the VLDB Endowment10.14778/3231751.323176011:10(1110-1122)Online publication date: 1-Jun-2018
  • (2018)Efficient set containment joinThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-018-0505-x27:4(471-495)Online publication date: 1-Aug-2018
  • (2017)Human-in-the-loop data integrationProceedings of the VLDB Endowment10.14778/3137765.313783310:12(2006-2017)Online publication date: 1-Aug-2017

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media