[go: up one dir, main page]

Skip to main content

Advertisement

Log in

Efficient large-scale distance-based join queries in spatialhadoop

  • Published:
GeoInformatica Aims and scope Submit manuscript

Abstract

Efficient processing of Distance-Based Join Queries (DBJQs) in spatial databases is of paramount importance in many application domains. The most representative and known DBJQs are the K Closest Pairs Query (KCPQ) and the ε Distance Join Query (εDJQ). These types of join queries are characterized by a number of desired pairs (K) or a distance threshold (ε) between the components of the pairs in the final result, over two spatial datasets. Both are expensive operations, since two spatial datasets are combined with additional constraints. Given the increasing volume of spatial data originating from multiple sources and stored in distributed servers, it is not always efficient to perform DBJQs on a centralized server. For this reason, this paper addresses the problem of computing DBJQs on big spatial datasets in SpatialHadoop, an extension of Hadoop that supports efficient processing of spatial queries in a cloud-based setting. We propose novel algorithms, based on plane-sweep, to perform efficient parallel DBJQs on large-scale spatial datasets in SpatialHadoop. We evaluate the performance of the proposed algorithms in several situations with large real-world as well as synthetic datasets. The experiments demonstrate the efficiency and scalability of our proposed methodologies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. http://spark.apache.org/

  2. http://spatialhadoop.cs.umn.edu/datasets.html

References

  1. García-García F, Corral A, Iribarne L, Vassilakopoulos M, Manolopoulos Y (2016) Enhancing spatialhadoop with closest pair queries. In: ADBIS Conference, pp 212–225

  2. Shekhar S, Chawla S (2003) Spatial databases - a tour. Prentice Hall, New Jersey

    Google Scholar 

  3. Samet H (1990) Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS. Addison-Wesley, Boston

    Google Scholar 

  4. Schiller JH, Voisard A (eds) (2004) Location-Based Services. Morgan Kaufmann, Burlington

  5. Rigaux P, Scholl M, Voisard A (2002) Spatial databases - with applications to GIS. Elsevier, San Francisco

    Google Scholar 

  6. Leong Hou U, Mamoulis N, Yiu ML (2008) Computation and monitoring of exclusive closest pairs. Trans Knowl Data Eng 20(12):1641–1654

    Article  Google Scholar 

  7. Ahmadi E, Nascimento MA (2016) K-closest pairs queries in road networks. In: MDM Conference, pp 232–241

  8. Corral A, Manolopoulos Y, Theodoridis Y, Vassilakopoulos M (2004) Algorithms for processing k-closest-pair queries in spatial databases. Data Knowl Eng 49(1):67–104

    Article  Google Scholar 

  9. Roumelis G, Corral A, Vassilakopoulos M, Manolopoulos Y (2014) A new plane-sweep algorithm for the k-closest-pairs query. In: SOFSEM Conference, pp 478–490

  10. Gao Y, Chen L, Li X, Yao B, Chen G (2015) Efficient k-closest pair queries in general metric spaces. VLDB J 24(3):415–439

    Article  Google Scholar 

  11. Roumelis G, Vassilakopoulos M, Corral A, Manolopoulos Y (2016) New plane-sweep algorithms for distance-based join queries in spatial databases. GeoInformatica 20(4):571–628

    Article  Google Scholar 

  12. Zhang C, Li F, Jestes J (2012) Efficient parallel kNN joins for large data in MapReduce. In: EDBT Conference, pp 38–49

  13. Lu W, Shen Y, Chen S, Ooi BC (2012) Efficient processing of k nearest neighbor joins using MapReduce. PVLDB 5(10):1016–1027

    Google Scholar 

  14. Wang K, Han J, Tu B, Dai J, Zhou W, Song X (2010) Accelerating spatial data processing with MapReduce. In: ICPADS Conference, pp 229–236

  15. Nodarakis N, Pitoura E, Sioutas S, Tsakalidis AK, Tsoumakos D, Tzimas G (2016) kdann+: A rapid aknn classifier for big data. Trans Large-Scale Data-Knowl-Centered Syst 24:139–168

    Google Scholar 

  16. Silva YN, Reed JM (2012) Exploiting mapreduce-based similarity joins. In: SIGMOD Conference, pp 693–696

  17. Dean J, Ghemawat S (2004) Mapreduce: Simplified data processing on large clusters. In: 137–150

  18. Li F, Ooi BC, Özsu MT, Wu S (2014) Distributed data management using mapreduce. ACM Comput Surv 46(3):31:1–31:42

    Google Scholar 

  19. Chen CLP, Zhang C (2014) Data-intensive applications, challenges, techniques and technologies: A survey on big data. Inf Sci 275:314–347

    Article  Google Scholar 

  20. Giachetta R (2015) A framework for processing large scale geospatial and remote sensing data in mapreduce environment. Comput Graph 49:37–46

    Article  Google Scholar 

  21. Gani A, Siddiqa A, Shamshirband S, Hanum F (2016) A survey on indexing techniques for big data: taxonomy and performance evaluation. Knowl Inf Syst 46(2):241–284

    Article  Google Scholar 

  22. Doulkeridis C, Nørvåg K (2014) A survey of large-scale analytical query processing in mapreduce. VLDB J 23(3):355–380

    Article  Google Scholar 

  23. Eldawy A, Mokbel MF (2015) Spatialhadoop: A mapreduce framework for spatial data. In: ICDE Conference, pp 1352–1363

  24. Shi J, Qiu Y, Minhas UF, Jiao L, Wang C, Reinwald B, Ȯzcan F (2015) Clash of the titans: Mapreduce vs. spark for large scale data analytics. PVLDB 8(13):2110–2121

    Google Scholar 

  25. Lu J, Güting RH (2012) Parallel secondo: Boosting database engines with Hadoop. In: ICPADS Conference, pp 738–743

  26. Aji A, Wang F, Vo H, Lee R, Liu Q, Zhang X, Saltz JH (2013) Hadoop-GIS: A high performance spatial data warehousing system over MapReduce. PVLDB 6(11):1009–1020

    Google Scholar 

  27. Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive - A warehousing solution over a MapReduce framework. PVLDB 2(2):1626–1629

    Google Scholar 

  28. You S, Zhang J, Gruenwald L (2015) Large-scale spatial join query processing in cloud. In: ICDE Workshops, pp 34–41

  29. Yu J, Wu J, Sarwat M (2015) Geospark: a cluster computing framework for processing large-scale spatial data. In: SIGSPATIAL Conference, pp 70:1–70:4

  30. Xie D, Li F, Yao B, Li G, Zhou L, Guo M (2016) Simba: Efficient in-memory spatial analytics. In: SIGMOD Conference, pp 1071–1085

  31. Tang M, Yu Y, Malluhi QM, Ouzzani M, Aref WG (2016) Locationspark: A distributed in-memory data management system for big spatial data. PVLDB 9(13):1565–1568

    Google Scholar 

  32. Li Z, Huang Q, Carbone GJ, Hu F (2017) A high performance query analytical framework for supporting data-intensive climate studies, Computers. Comput Environ Urban Syst 62:210–221

    Article  Google Scholar 

  33. Buck JB, Watkins N, LeFevre J, Ioannidou K, Maltzahn C, Polyzotis N, Brandt SA (2011) Scihadoop: array-based query processing in hadoop. In: SC Conference, pp 66:1–66:11

  34. Eldawy A, Mokbel MF, Al-Harthi S, Alzaidy A, Tarek K, Ghani S (2015) SHAHED: A mapreduce-based system for querying and visualizing spatio-temporal satellite data. In: ICDE Conference, pp 1585–1596

  35. Palamuttam R, Mogrovejo RM, Mattmann C, Wilson B, Whitehall K, Verma R, McGibbney LJ, Ramirez PM (2015) Scispark: Applying in-memory distributed computing to weather event detection and tracking. In: Conference on Big Data, pp 2020–2026

  36. Zhang S, Han J, Liu Z, Wang K, Feng S (2009) Spatial queries evaluation with MapReduce. In: GCC Conference, pp 287–292

  37. Ma Q, Yang B, Qian W, Zhou A (2009) Query processing of massive trajectory data based on MapReduce. In: CloudDb Conference, pp 9–16

  38. Akdogan A, Demiryurek U, Demiryurek FB, Shahabi C (2010) Voronoi-based geospatial query processing with MapReduce. In: CloudCom Conference, pp 9–16

  39. Maillo J, Triguero I, Herrera F (2015) A mapreduce-based k-nearest neighbor approach for big data classification. In: TrustCom/BigDataSE/ISPA Conference, pp 167–172

  40. Park Y, Min J, Shim K (2013) Parallel computation of skyline and reverse skyline queries using mapreduce. PVLDB 6(14):2002–2013

    Google Scholar 

  41. Zhang J, Jiang X, Ku W, Qin X (2016) Efficient parallel skyline evaluation using mapreduce. IEEE Trans Parallel Distrib Syst 27(7):1996–2009

    Article  Google Scholar 

  42. Ji C, Li Z, Qu W, Xu Y, Li Y (2014) Scalable nearest neighbor query processing based on inverted grid index. J Netw Comput Appl 44:172–182

    Article  Google Scholar 

  43. Zhang S, Han J, Liu Z, Wang K, Xu Z (2009) SJMR: parallelizing spatial join with MapReduce on clusters. In: CLUSTER Conference, pp 1–8

  44. Patel JM, DeWitt DJ (1996) Partition based spatial-merge join. In: SIGMOD Conference, pp 259–270

  45. Kim Y, Shim K (2012) Parallel top-k similarity join algorithms using MapReduce. In: ICDE Conference, pp 510–521

  46. Jacox EH, Samet H (2008) Metric space similarity joins. ACM Trans Database Syst 33(2):1–38

    Article  Google Scholar 

  47. Gupta H, Chawda B, Negi S, Faruquie TA, Subramaniam LV, Mohania MK (2013) Processing multi-way spatial joins on map-reduce. In: EDBT Conference, pp 113–124

  48. Wang H, Belhassena A (2017) Parallel trajectory search based on distributed index. Inf Sci 388-399:62–83

    Article  Google Scholar 

  49. Eldawy A, Li Y, Mokbel MF, Janardan R (2013) Cg_hadoop: computational geometry in mapreduce. In: SIGSPATIAL Conference, pp 284–293

  50. Pertesis D, Doulkeridis C (2015) Efficient skyline query processing in spatialhadoop. Inf Syst 54:325–335

    Article  Google Scholar 

  51. Corral A, Manolopoulos Y, Theodoridis Y, Vassilakopoulos M (2000) Closest pair queries in spatial databases. In: SIGMOD Conference, pp 189–200

  52. Hjaltason GR, Samet H (1998) Incremental distance join algorithms for spatial databases. In: SIGMOD Conference, pp 237–248

  53. Shin H, Moon B, Lee S (2003) Adaptive and incremental processing for distance join queries. IEEE Trans Knowl Data Eng 15(6):1561–1578

    Article  Google Scholar 

  54. Yang C, Lin K (2002) An index structure for improving closest pairs and related join queries in spatial databases. In: IDEAS Conference, pp 140–149

  55. Gutierrez G, Sȧez P (2013) The k closest pairs in spatial databases - when only one set is indexed. GeoInformatica 17(4):543–565

    Article  Google Scholar 

  56. Eldawy A, Alarabi L, Mokbel MF (2015) Spatial partitioning techniques in spatial hadoop. PVLDB 8(12):1602–1613

    Google Scholar 

  57. Preparata FP, Shamos MI (1985) Computational Geometry - An Introduction. Springer, Berlin

    Book  Google Scholar 

  58. Corral A, Almendros-Jimėnez JM (2007) A performance comparison of distance-based query algorithms using r-trees in spatial databases. Inf Sci 177(11):2207–2237

    Article  Google Scholar 

  59. Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to Algorithms, 3rd edn. MIT Press, Cambridge

    Google Scholar 

  60. Chaudhuri S, Motwani R, Narasayya VR (1999) On random sampling over joins. In: SIGMOD Conference, pp 263–274

  61. Corral A, Vassilakopoulos M (2005) On approximate algorithms for distance-based queries using r-trees. Comput J 48(2):220–238

    Article  Google Scholar 

  62. Leutenegger ST, Edgington JM, Lopez MA (1997) Str: A simple and efficient algorithm for r-tree packing. In: ICDE Conference, pp 497–506

  63. Papadopoulos AN, Nanopoulos A, Manolopoulos Y (2006) Processing distance join queries with constraints. Comput J 49(3):281–296

    Article  Google Scholar 

  64. Mamoulis N, Papadias D, Multiway spatial joins ACM (2001) Trans. Database Syst 26(4):424–475

    Article  Google Scholar 

  65. Corral A, Manolopoulos Y, Theodoridis Y, Vassilakopoulos M (2004) Multi-way distance join queries in spatial databases. GeoInformatica 8(4):373–402

    Article  Google Scholar 

  66. Vo H, Aji A, Wang F (2014) SATO: a spatial data partitioning framework for scalable query processing. In: SIGSPATIAL Conference, pp 545–548

  67. Aji A, Vo H, Wang F Effective spatial data partitioning for scalable query processing. arXiv:1509.00910

Download references

Acknowledgements

Work of all authors funded by the MINECO research project [TIN2013-41576-R]. We would like to thank Prof. Goce Trajcevski for providing us interesting comments to enrich the article, and we would like also thank the anonymous reviewers for their constructive remarks.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Francisco García-García.

Additional information

A preliminary partial version of this work appeared in [1].

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

García-García, F., Corral, A., Iribarne, L. et al. Efficient large-scale distance-based join queries in spatialhadoop. Geoinformatica 22, 171–209 (2018). https://doi.org/10.1007/s10707-017-0309-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10707-017-0309-y

Keywords