Abstract
Efficient processing of Distance-Based Join Queries (DBJQs) in spatial databases is of paramount importance in many application domains. The most representative and known DBJQs are the K Closest Pairs Query (KCPQ) and the ε Distance Join Query (εDJQ). These types of join queries are characterized by a number of desired pairs (K) or a distance threshold (ε) between the components of the pairs in the final result, over two spatial datasets. Both are expensive operations, since two spatial datasets are combined with additional constraints. Given the increasing volume of spatial data originating from multiple sources and stored in distributed servers, it is not always efficient to perform DBJQs on a centralized server. For this reason, this paper addresses the problem of computing DBJQs on big spatial datasets in SpatialHadoop, an extension of Hadoop that supports efficient processing of spatial queries in a cloud-based setting. We propose novel algorithms, based on plane-sweep, to perform efficient parallel DBJQs on large-scale spatial datasets in SpatialHadoop. We evaluate the performance of the proposed algorithms in several situations with large real-world as well as synthetic datasets. The experiments demonstrate the efficiency and scalability of our proposed methodologies.















Similar content being viewed by others
References
García-García F, Corral A, Iribarne L, Vassilakopoulos M, Manolopoulos Y (2016) Enhancing spatialhadoop with closest pair queries. In: ADBIS Conference, pp 212–225
Shekhar S, Chawla S (2003) Spatial databases - a tour. Prentice Hall, New Jersey
Samet H (1990) Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS. Addison-Wesley, Boston
Schiller JH, Voisard A (eds) (2004) Location-Based Services. Morgan Kaufmann, Burlington
Rigaux P, Scholl M, Voisard A (2002) Spatial databases - with applications to GIS. Elsevier, San Francisco
Leong Hou U, Mamoulis N, Yiu ML (2008) Computation and monitoring of exclusive closest pairs. Trans Knowl Data Eng 20(12):1641–1654
Ahmadi E, Nascimento MA (2016) K-closest pairs queries in road networks. In: MDM Conference, pp 232–241
Corral A, Manolopoulos Y, Theodoridis Y, Vassilakopoulos M (2004) Algorithms for processing k-closest-pair queries in spatial databases. Data Knowl Eng 49(1):67–104
Roumelis G, Corral A, Vassilakopoulos M, Manolopoulos Y (2014) A new plane-sweep algorithm for the k-closest-pairs query. In: SOFSEM Conference, pp 478–490
Gao Y, Chen L, Li X, Yao B, Chen G (2015) Efficient k-closest pair queries in general metric spaces. VLDB J 24(3):415–439
Roumelis G, Vassilakopoulos M, Corral A, Manolopoulos Y (2016) New plane-sweep algorithms for distance-based join queries in spatial databases. GeoInformatica 20(4):571–628
Zhang C, Li F, Jestes J (2012) Efficient parallel kNN joins for large data in MapReduce. In: EDBT Conference, pp 38–49
Lu W, Shen Y, Chen S, Ooi BC (2012) Efficient processing of k nearest neighbor joins using MapReduce. PVLDB 5(10):1016–1027
Wang K, Han J, Tu B, Dai J, Zhou W, Song X (2010) Accelerating spatial data processing with MapReduce. In: ICPADS Conference, pp 229–236
Nodarakis N, Pitoura E, Sioutas S, Tsakalidis AK, Tsoumakos D, Tzimas G (2016) kdann+: A rapid aknn classifier for big data. Trans Large-Scale Data-Knowl-Centered Syst 24:139–168
Silva YN, Reed JM (2012) Exploiting mapreduce-based similarity joins. In: SIGMOD Conference, pp 693–696
Dean J, Ghemawat S (2004) Mapreduce: Simplified data processing on large clusters. In: 137–150
Li F, Ooi BC, Özsu MT, Wu S (2014) Distributed data management using mapreduce. ACM Comput Surv 46(3):31:1–31:42
Chen CLP, Zhang C (2014) Data-intensive applications, challenges, techniques and technologies: A survey on big data. Inf Sci 275:314–347
Giachetta R (2015) A framework for processing large scale geospatial and remote sensing data in mapreduce environment. Comput Graph 49:37–46
Gani A, Siddiqa A, Shamshirband S, Hanum F (2016) A survey on indexing techniques for big data: taxonomy and performance evaluation. Knowl Inf Syst 46(2):241–284
Doulkeridis C, Nørvåg K (2014) A survey of large-scale analytical query processing in mapreduce. VLDB J 23(3):355–380
Eldawy A, Mokbel MF (2015) Spatialhadoop: A mapreduce framework for spatial data. In: ICDE Conference, pp 1352–1363
Shi J, Qiu Y, Minhas UF, Jiao L, Wang C, Reinwald B, Ȯzcan F (2015) Clash of the titans: Mapreduce vs. spark for large scale data analytics. PVLDB 8(13):2110–2121
Lu J, Güting RH (2012) Parallel secondo: Boosting database engines with Hadoop. In: ICPADS Conference, pp 738–743
Aji A, Wang F, Vo H, Lee R, Liu Q, Zhang X, Saltz JH (2013) Hadoop-GIS: A high performance spatial data warehousing system over MapReduce. PVLDB 6(11):1009–1020
Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive - A warehousing solution over a MapReduce framework. PVLDB 2(2):1626–1629
You S, Zhang J, Gruenwald L (2015) Large-scale spatial join query processing in cloud. In: ICDE Workshops, pp 34–41
Yu J, Wu J, Sarwat M (2015) Geospark: a cluster computing framework for processing large-scale spatial data. In: SIGSPATIAL Conference, pp 70:1–70:4
Xie D, Li F, Yao B, Li G, Zhou L, Guo M (2016) Simba: Efficient in-memory spatial analytics. In: SIGMOD Conference, pp 1071–1085
Tang M, Yu Y, Malluhi QM, Ouzzani M, Aref WG (2016) Locationspark: A distributed in-memory data management system for big spatial data. PVLDB 9(13):1565–1568
Li Z, Huang Q, Carbone GJ, Hu F (2017) A high performance query analytical framework for supporting data-intensive climate studies, Computers. Comput Environ Urban Syst 62:210–221
Buck JB, Watkins N, LeFevre J, Ioannidou K, Maltzahn C, Polyzotis N, Brandt SA (2011) Scihadoop: array-based query processing in hadoop. In: SC Conference, pp 66:1–66:11
Eldawy A, Mokbel MF, Al-Harthi S, Alzaidy A, Tarek K, Ghani S (2015) SHAHED: A mapreduce-based system for querying and visualizing spatio-temporal satellite data. In: ICDE Conference, pp 1585–1596
Palamuttam R, Mogrovejo RM, Mattmann C, Wilson B, Whitehall K, Verma R, McGibbney LJ, Ramirez PM (2015) Scispark: Applying in-memory distributed computing to weather event detection and tracking. In: Conference on Big Data, pp 2020–2026
Zhang S, Han J, Liu Z, Wang K, Feng S (2009) Spatial queries evaluation with MapReduce. In: GCC Conference, pp 287–292
Ma Q, Yang B, Qian W, Zhou A (2009) Query processing of massive trajectory data based on MapReduce. In: CloudDb Conference, pp 9–16
Akdogan A, Demiryurek U, Demiryurek FB, Shahabi C (2010) Voronoi-based geospatial query processing with MapReduce. In: CloudCom Conference, pp 9–16
Maillo J, Triguero I, Herrera F (2015) A mapreduce-based k-nearest neighbor approach for big data classification. In: TrustCom/BigDataSE/ISPA Conference, pp 167–172
Park Y, Min J, Shim K (2013) Parallel computation of skyline and reverse skyline queries using mapreduce. PVLDB 6(14):2002–2013
Zhang J, Jiang X, Ku W, Qin X (2016) Efficient parallel skyline evaluation using mapreduce. IEEE Trans Parallel Distrib Syst 27(7):1996–2009
Ji C, Li Z, Qu W, Xu Y, Li Y (2014) Scalable nearest neighbor query processing based on inverted grid index. J Netw Comput Appl 44:172–182
Zhang S, Han J, Liu Z, Wang K, Xu Z (2009) SJMR: parallelizing spatial join with MapReduce on clusters. In: CLUSTER Conference, pp 1–8
Patel JM, DeWitt DJ (1996) Partition based spatial-merge join. In: SIGMOD Conference, pp 259–270
Kim Y, Shim K (2012) Parallel top-k similarity join algorithms using MapReduce. In: ICDE Conference, pp 510–521
Jacox EH, Samet H (2008) Metric space similarity joins. ACM Trans Database Syst 33(2):1–38
Gupta H, Chawda B, Negi S, Faruquie TA, Subramaniam LV, Mohania MK (2013) Processing multi-way spatial joins on map-reduce. In: EDBT Conference, pp 113–124
Wang H, Belhassena A (2017) Parallel trajectory search based on distributed index. Inf Sci 388-399:62–83
Eldawy A, Li Y, Mokbel MF, Janardan R (2013) Cg_hadoop: computational geometry in mapreduce. In: SIGSPATIAL Conference, pp 284–293
Pertesis D, Doulkeridis C (2015) Efficient skyline query processing in spatialhadoop. Inf Syst 54:325–335
Corral A, Manolopoulos Y, Theodoridis Y, Vassilakopoulos M (2000) Closest pair queries in spatial databases. In: SIGMOD Conference, pp 189–200
Hjaltason GR, Samet H (1998) Incremental distance join algorithms for spatial databases. In: SIGMOD Conference, pp 237–248
Shin H, Moon B, Lee S (2003) Adaptive and incremental processing for distance join queries. IEEE Trans Knowl Data Eng 15(6):1561–1578
Yang C, Lin K (2002) An index structure for improving closest pairs and related join queries in spatial databases. In: IDEAS Conference, pp 140–149
Gutierrez G, Sȧez P (2013) The k closest pairs in spatial databases - when only one set is indexed. GeoInformatica 17(4):543–565
Eldawy A, Alarabi L, Mokbel MF (2015) Spatial partitioning techniques in spatial hadoop. PVLDB 8(12):1602–1613
Preparata FP, Shamos MI (1985) Computational Geometry - An Introduction. Springer, Berlin
Corral A, Almendros-Jimėnez JM (2007) A performance comparison of distance-based query algorithms using r-trees in spatial databases. Inf Sci 177(11):2207–2237
Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to Algorithms, 3rd edn. MIT Press, Cambridge
Chaudhuri S, Motwani R, Narasayya VR (1999) On random sampling over joins. In: SIGMOD Conference, pp 263–274
Corral A, Vassilakopoulos M (2005) On approximate algorithms for distance-based queries using r-trees. Comput J 48(2):220–238
Leutenegger ST, Edgington JM, Lopez MA (1997) Str: A simple and efficient algorithm for r-tree packing. In: ICDE Conference, pp 497–506
Papadopoulos AN, Nanopoulos A, Manolopoulos Y (2006) Processing distance join queries with constraints. Comput J 49(3):281–296
Mamoulis N, Papadias D, Multiway spatial joins ACM (2001) Trans. Database Syst 26(4):424–475
Corral A, Manolopoulos Y, Theodoridis Y, Vassilakopoulos M (2004) Multi-way distance join queries in spatial databases. GeoInformatica 8(4):373–402
Vo H, Aji A, Wang F (2014) SATO: a spatial data partitioning framework for scalable query processing. In: SIGSPATIAL Conference, pp 545–548
Aji A, Vo H, Wang F Effective spatial data partitioning for scalable query processing. arXiv:1509.00910
Acknowledgements
Work of all authors funded by the MINECO research project [TIN2013-41576-R]. We would like to thank Prof. Goce Trajcevski for providing us interesting comments to enrich the article, and we would like also thank the anonymous reviewers for their constructive remarks.
Author information
Authors and Affiliations
Corresponding author
Additional information
A preliminary partial version of this work appeared in [1].
Rights and permissions
About this article
Cite this article
García-García, F., Corral, A., Iribarne, L. et al. Efficient large-scale distance-based join queries in spatialhadoop. Geoinformatica 22, 171–209 (2018). https://doi.org/10.1007/s10707-017-0309-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10707-017-0309-y