Efficient large-scale distance-based join queries in spatialhadoop

Francisco García-García¹,
Antonio Corral ORCID: orcid.org/0000-0002-0069-4642¹,
Luis Iribarne¹,
Michael Vassilakopoulos² &
…
Yannis Manolopoulos³

766 Accesses
5 Altmetric
Explore all metrics

Abstract

Efficient processing of Distance-Based Join Queries (DBJQs) in spatial databases is of paramount importance in many application domains. The most representative and known DBJQs are the K Closest Pairs Query (KCPQ) and the ε Distance Join Query (εDJQ). These types of join queries are characterized by a number of desired pairs (K) or a distance threshold (ε) between the components of the pairs in the final result, over two spatial datasets. Both are expensive operations, since two spatial datasets are combined with additional constraints. Given the increasing volume of spatial data originating from multiple sources and stored in distributed servers, it is not always efficient to perform DBJQs on a centralized server. For this reason, this paper addresses the problem of computing DBJQs on big spatial datasets in SpatialHadoop, an extension of Hadoop that supports efficient processing of spatial queries in a cloud-based setting. We propose novel algorithms, based on plane-sweep, to perform efficient parallel DBJQs on large-scale spatial datasets in SpatialHadoop. We evaluate the performance of the proposed algorithms in several situations with large real-world as well as synthetic datasets. The experiments demonstrate the efficiency and scalability of our proposed methodologies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

New plane-sweep algorithms for distance-based join queries in spatial databases

Article 27 February 2016

Voronoi-Diagram Based Partitioning for Distance Join Query Processing in SpatialHadoop

Enhancing SpatialHadoop with Closest Pair Queries

Notes

References

García-García F, Corral A, Iribarne L, Vassilakopoulos M, Manolopoulos Y (2016) Enhancing spatialhadoop with closest pair queries. In: ADBIS Conference, pp 212–225
Shekhar S, Chawla S (2003) Spatial databases - a tour. Prentice Hall, New Jersey
Google Scholar
Samet H (1990) Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS. Addison-Wesley, Boston
Google Scholar
Schiller JH, Voisard A (eds) (2004) Location-Based Services. Morgan Kaufmann, Burlington
Rigaux P, Scholl M, Voisard A (2002) Spatial databases - with applications to GIS. Elsevier, San Francisco
Google Scholar
Leong Hou U, Mamoulis N, Yiu ML (2008) Computation and monitoring of exclusive closest pairs. Trans Knowl Data Eng 20(12):1641–1654
Article Google Scholar
Ahmadi E, Nascimento MA (2016) K-closest pairs queries in road networks. In: MDM Conference, pp 232–241
Corral A, Manolopoulos Y, Theodoridis Y, Vassilakopoulos M (2004) Algorithms for processing k-closest-pair queries in spatial databases. Data Knowl Eng 49(1):67–104
Article Google Scholar
Roumelis G, Corral A, Vassilakopoulos M, Manolopoulos Y (2014) A new plane-sweep algorithm for the k-closest-pairs query. In: SOFSEM Conference, pp 478–490
Gao Y, Chen L, Li X, Yao B, Chen G (2015) Efficient k-closest pair queries in general metric spaces. VLDB J 24(3):415–439
Article Google Scholar
Roumelis G, Vassilakopoulos M, Corral A, Manolopoulos Y (2016) New plane-sweep algorithms for distance-based join queries in spatial databases. GeoInformatica 20(4):571–628
Article Google Scholar
Zhang C, Li F, Jestes J (2012) Efficient parallel kNN joins for large data in MapReduce. In: EDBT Conference, pp 38–49
Lu W, Shen Y, Chen S, Ooi BC (2012) Efficient processing of k nearest neighbor joins using MapReduce. PVLDB 5(10):1016–1027
Google Scholar
Wang K, Han J, Tu B, Dai J, Zhou W, Song X (2010) Accelerating spatial data processing with MapReduce. In: ICPADS Conference, pp 229–236
Nodarakis N, Pitoura E, Sioutas S, Tsakalidis AK, Tsoumakos D, Tzimas G (2016) kdann+: A rapid aknn classifier for big data. Trans Large-Scale Data-Knowl-Centered Syst 24:139–168
Google Scholar
Silva YN, Reed JM (2012) Exploiting mapreduce-based similarity joins. In: SIGMOD Conference, pp 693–696
Dean J, Ghemawat S (2004) Mapreduce: Simplified data processing on large clusters. In: 137–150
Li F, Ooi BC, Özsu MT, Wu S (2014) Distributed data management using mapreduce. ACM Comput Surv 46(3):31:1–31:42
Google Scholar
Chen CLP, Zhang C (2014) Data-intensive applications, challenges, techniques and technologies: A survey on big data. Inf Sci 275:314–347
Article Google Scholar
Giachetta R (2015) A framework for processing large scale geospatial and remote sensing data in mapreduce environment. Comput Graph 49:37–46
Article Google Scholar
Gani A, Siddiqa A, Shamshirband S, Hanum F (2016) A survey on indexing techniques for big data: taxonomy and performance evaluation. Knowl Inf Syst 46(2):241–284
Article Google Scholar
Doulkeridis C, Nørvåg K (2014) A survey of large-scale analytical query processing in mapreduce. VLDB J 23(3):355–380
Article Google Scholar
Eldawy A, Mokbel MF (2015) Spatialhadoop: A mapreduce framework for spatial data. In: ICDE Conference, pp 1352–1363
Shi J, Qiu Y, Minhas UF, Jiao L, Wang C, Reinwald B, Ȯzcan F (2015) Clash of the titans: Mapreduce vs. spark for large scale data analytics. PVLDB 8(13):2110–2121
Google Scholar
Lu J, Güting RH (2012) Parallel secondo: Boosting database engines with Hadoop. In: ICPADS Conference, pp 738–743
Aji A, Wang F, Vo H, Lee R, Liu Q, Zhang X, Saltz JH (2013) Hadoop-GIS: A high performance spatial data warehousing system over MapReduce. PVLDB 6(11):1009–1020
Google Scholar
Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive - A warehousing solution over a MapReduce framework. PVLDB 2(2):1626–1629
Google Scholar
You S, Zhang J, Gruenwald L (2015) Large-scale spatial join query processing in cloud. In: ICDE Workshops, pp 34–41
Yu J, Wu J, Sarwat M (2015) Geospark: a cluster computing framework for processing large-scale spatial data. In: SIGSPATIAL Conference, pp 70:1–70:4
Xie D, Li F, Yao B, Li G, Zhou L, Guo M (2016) Simba: Efficient in-memory spatial analytics. In: SIGMOD Conference, pp 1071–1085
Tang M, Yu Y, Malluhi QM, Ouzzani M, Aref WG (2016) Locationspark: A distributed in-memory data management system for big spatial data. PVLDB 9(13):1565–1568
Google Scholar
Li Z, Huang Q, Carbone GJ, Hu F (2017) A high performance query analytical framework for supporting data-intensive climate studies, Computers. Comput Environ Urban Syst 62:210–221
Article Google Scholar
Buck JB, Watkins N, LeFevre J, Ioannidou K, Maltzahn C, Polyzotis N, Brandt SA (2011) Scihadoop: array-based query processing in hadoop. In: SC Conference, pp 66:1–66:11
Eldawy A, Mokbel MF, Al-Harthi S, Alzaidy A, Tarek K, Ghani S (2015) SHAHED: A mapreduce-based system for querying and visualizing spatio-temporal satellite data. In: ICDE Conference, pp 1585–1596
Palamuttam R, Mogrovejo RM, Mattmann C, Wilson B, Whitehall K, Verma R, McGibbney LJ, Ramirez PM (2015) Scispark: Applying in-memory distributed computing to weather event detection and tracking. In: Conference on Big Data, pp 2020–2026
Zhang S, Han J, Liu Z, Wang K, Feng S (2009) Spatial queries evaluation with MapReduce. In: GCC Conference, pp 287–292
Ma Q, Yang B, Qian W, Zhou A (2009) Query processing of massive trajectory data based on MapReduce. In: CloudDb Conference, pp 9–16
Akdogan A, Demiryurek U, Demiryurek FB, Shahabi C (2010) Voronoi-based geospatial query processing with MapReduce. In: CloudCom Conference, pp 9–16
Maillo J, Triguero I, Herrera F (2015) A mapreduce-based k-nearest neighbor approach for big data classification. In: TrustCom/BigDataSE/ISPA Conference, pp 167–172
Park Y, Min J, Shim K (2013) Parallel computation of skyline and reverse skyline queries using mapreduce. PVLDB 6(14):2002–2013
Google Scholar
Zhang J, Jiang X, Ku W, Qin X (2016) Efficient parallel skyline evaluation using mapreduce. IEEE Trans Parallel Distrib Syst 27(7):1996–2009
Article Google Scholar
Ji C, Li Z, Qu W, Xu Y, Li Y (2014) Scalable nearest neighbor query processing based on inverted grid index. J Netw Comput Appl 44:172–182
Article Google Scholar
Zhang S, Han J, Liu Z, Wang K, Xu Z (2009) SJMR: parallelizing spatial join with MapReduce on clusters. In: CLUSTER Conference, pp 1–8
Patel JM, DeWitt DJ (1996) Partition based spatial-merge join. In: SIGMOD Conference, pp 259–270
Kim Y, Shim K (2012) Parallel top-k similarity join algorithms using MapReduce. In: ICDE Conference, pp 510–521
Jacox EH, Samet H (2008) Metric space similarity joins. ACM Trans Database Syst 33(2):1–38
Article Google Scholar
Gupta H, Chawda B, Negi S, Faruquie TA, Subramaniam LV, Mohania MK (2013) Processing multi-way spatial joins on map-reduce. In: EDBT Conference, pp 113–124
Wang H, Belhassena A (2017) Parallel trajectory search based on distributed index. Inf Sci 388-399:62–83
Article Google Scholar
Eldawy A, Li Y, Mokbel MF, Janardan R (2013) Cg_hadoop: computational geometry in mapreduce. In: SIGSPATIAL Conference, pp 284–293
Pertesis D, Doulkeridis C (2015) Efficient skyline query processing in spatialhadoop. Inf Syst 54:325–335
Article Google Scholar
Corral A, Manolopoulos Y, Theodoridis Y, Vassilakopoulos M (2000) Closest pair queries in spatial databases. In: SIGMOD Conference, pp 189–200
Hjaltason GR, Samet H (1998) Incremental distance join algorithms for spatial databases. In: SIGMOD Conference, pp 237–248
Shin H, Moon B, Lee S (2003) Adaptive and incremental processing for distance join queries. IEEE Trans Knowl Data Eng 15(6):1561–1578
Article Google Scholar
Yang C, Lin K (2002) An index structure for improving closest pairs and related join queries in spatial databases. In: IDEAS Conference, pp 140–149
Gutierrez G, Sȧez P (2013) The k closest pairs in spatial databases - when only one set is indexed. GeoInformatica 17(4):543–565
Article Google Scholar
Eldawy A, Alarabi L, Mokbel MF (2015) Spatial partitioning techniques in spatial hadoop. PVLDB 8(12):1602–1613
Google Scholar
Preparata FP, Shamos MI (1985) Computational Geometry - An Introduction. Springer, Berlin
Book Google Scholar
Corral A, Almendros-Jimėnez JM (2007) A performance comparison of distance-based query algorithms using r-trees in spatial databases. Inf Sci 177(11):2207–2237
Article Google Scholar
Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to Algorithms, 3rd edn. MIT Press, Cambridge
Google Scholar
Chaudhuri S, Motwani R, Narasayya VR (1999) On random sampling over joins. In: SIGMOD Conference, pp 263–274
Corral A, Vassilakopoulos M (2005) On approximate algorithms for distance-based queries using r-trees. Comput J 48(2):220–238
Article Google Scholar
Leutenegger ST, Edgington JM, Lopez MA (1997) Str: A simple and efficient algorithm for r-tree packing. In: ICDE Conference, pp 497–506
Papadopoulos AN, Nanopoulos A, Manolopoulos Y (2006) Processing distance join queries with constraints. Comput J 49(3):281–296
Article Google Scholar
Mamoulis N, Papadias D, Multiway spatial joins ACM (2001) Trans. Database Syst 26(4):424–475
Article Google Scholar
Corral A, Manolopoulos Y, Theodoridis Y, Vassilakopoulos M (2004) Multi-way distance join queries in spatial databases. GeoInformatica 8(4):373–402
Article Google Scholar
Vo H, Aji A, Wang F (2014) SATO: a spatial data partitioning framework for scalable query processing. In: SIGSPATIAL Conference, pp 545–548
Aji A, Vo H, Wang F Effective spatial data partitioning for scalable query processing. arXiv:1509.00910

Download references

Acknowledgements

Work of all authors funded by the MINECO research project [TIN2013-41576-R]. We would like to thank Prof. Goce Trajcevski for providing us interesting comments to enrich the article, and we would like also thank the anonymous reviewers for their constructive remarks.

Author information

Authors and Affiliations

Department on Informatics, University of Almeria, 04120, Almeria, Spain
Francisco García-García, Antonio Corral & Luis Iribarne
Department of Electrical and Computer Engineering, University of Thessaly, 38221, Volos, Greece
Michael Vassilakopoulos
Department of Informatics, Aristotle University, 54124, Thessaloniki, Greece
Yannis Manolopoulos

Authors

Francisco García-García
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Corral
View author publications
You can also search for this author in PubMed Google Scholar
Luis Iribarne
View author publications
You can also search for this author in PubMed Google Scholar
Michael Vassilakopoulos
View author publications
You can also search for this author in PubMed Google Scholar
Yannis Manolopoulos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Francisco García-García.

Additional information

A preliminary partial version of this work appeared in [1].

Rights and permissions

Reprints and permissions

About this article

Cite this article

García-García, F., Corral, A., Iribarne, L. et al. Efficient large-scale distance-based join queries in spatialhadoop. Geoinformatica 22, 171–209 (2018). https://doi.org/10.1007/s10707-017-0309-y

Download citation

Received: 03 March 2017
Revised: 24 July 2017
Accepted: 05 September 2017
Published: 20 September 2017
Issue Date: April 2018
DOI: https://doi.org/10.1007/s10707-017-0309-y

Efficient large-scale distance-based join queries in spatialhadoop

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

New plane-sweep algorithms for distance-based join queries in spatial databases

Voronoi-Diagram Based Partitioning for Distance Join Query Processing in SpatialHadoop

Enhancing SpatialHadoop with Closest Pair Queries

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Efficient large-scale distance-based join queries in spatialhadoop

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

New plane-sweep algorithms for distance-based join queries in spatial databases

Voronoi-Diagram Based Partitioning for Distance Join Query Processing in SpatialHadoop

Enhancing SpatialHadoop with Closest Pair Queries

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now