High-Performance Overlay Analysis of Massive Geographic Polygons That Considers Shape Complexity in a Cloud Environment
:1. Introduction
2. Relevant Work
2.1. Shape Complexity
2.2. Overlay Analysis
3. Methodology
3.1. Basic Overlay Analysis Algorithm Running on Each Computing Node.
3.1.1. Hormann Algorithm and Improvement of Intersection Degeneration Problem.
- Calculating the intersections of the clipped and target polygons
- Judging the entry and exit of the intersection point by the vector line segment (judging the entry or exit point of the intersection point) and adding the entry point to the vertex sequence of the clipping result polygon
- Comparing the azimuth intervals of the degenerated vertices of the intersection points and adding the overlapping vertices of the azimuth intervals to the vertex sequence of the clipping result polygon
- Forming a new polygon (clipping result) in accordance with the sequence of vertices
3.1.2. Effect of Shape Complexity on Parallel Clipping Efficiency
3.2. Data Balancing and Partitioning Method that Considers Polygon Shape Complexity
3.2.1. Data Partitioning and Loading Strategy
- (1)
- Determine the order of the Hilbert curve, generate the Hilbert grid and the Hilbert curve, number the Hilbert curve sequentially, and obtain the Hilbert grid coding set,
- (2)
- Calculate the polygon MBR center point, find its corresponding mesh, and use the Hilbert coding of the mesh as the Hilbert coding of the polygon to obtain the Hilbert coding set of the polygon,
- (3)
- In accordance with the number of computing nodes M, divide the Hilbert coding set of the polygons into M partitions, and calculate the start–stop coding of the Hilbert coding of polygons in each partition.
- (4)
- Merge the grids of the Hilbert partitions to obtain partition polygons
3.2.2. R-tree Index Construction
3.3. Process Design of Distributed Parallel Overlay Analysis
3.4. Algorithmic Analysis
4. Experimental Study
4.1. Experimental Design
4.1.1. Computing Equipment
4.1.2. Experimental Data
4.1.3. Experimental Scene
- How much better will Spark parallel computing improve the performance of overlay analysis compared to desktop software?
- How much better is the performance of the parallel overlay analysis algorithm proposed in this paper compared with the direct use of the spark computing paradigm?
- How much influence does the complexity difference of a geographic polygon have on parallel overlay analysis?
4.2. Test Process and Results
4.2.1. Compare the Performance Differences of Four Modes: ArcMap, Spark_original, Spark_NoComlexity and Spark_improved
- (1)
- When the number of polygons is less than 10 million, the efficiency of Spark_original mode is even lower than that of ArcMap mode. When the number of polygons is more than 50,000, the time-consumption of the Spark_improved mode is less than that of the ArcMap mode. When the number of polygons exceeds 1 million, ArcMap mode consumes twice as much time as the Spark_improved mode. As the amount of data increases, the time-consumption of the ArcMap mode increases dramatically, and the time-consumption curve of Spark_improved mode is still relatively flat.
- (2)
- The efficiency of Spark_original mode is lower than that of Spark_improved mode, and the more polygons there are, the more obvious it is. This shows that the efficiency of overlay analysis using Spark directly is very low, and the algorithm optimization must be carried out according to the characteristics of spatial data and geographical calculation.
- (3)
- By comparing the time-consumption curves, Spark_improved takes almost half as much time as Spark_NoComlexity, which is better than I thought. I think it may be related to my experimental data: in Section 4.1.2, I have found that there are many polygons with high shape complexity in the experimental data. Maybe many big polygons are partitioned into the same computational partition, which leads to data skew.
4.2.2. Compare the Performance Differences of Four Modes: Spark_original, Spark_MBR, Spark_MBR_Hilbert and Spark_MBR_Hilbert_R-tree
- (1)
- After only adopting the MBR filtering strategy, the efficiency of overlay computation is increased by two to four times. Therefore, this strategy filters a large number of invalid overlay computations. Specific efficiency improvement is related to the size, shape, and spatial distribution of polygons in the target and clipped layers.
- (2)
- The Hilbert partitioning algorithm based on polygon graphic complexity is used to allocate the data of each computing node. When the amount of data reaches millions, the computing performance can be doubled. As the data amount increases, the computational performance advantage becomes more evident. The experimental data verify that the spatial aggregation characteristics of Hilbert partitioning that considers polygon complexity can considerably improve spatial analysis algorithms.
- (3)
- Index construction can generally improve the efficiency of data access, but index construction itself can result in a certain amount of computational overhead. After adding the R-tree index strategy based on the first two steps, the overlay calculation time of each order of magnitude increases slightly when the amount of data is less than 5 million. When the amount of data exceeds 5 million, the overlay calculation time decreases compared with the case without the R-tree index. Therefore, the data access time saved after the R-tree index is established offsets the time consumed by the index itself.
4.2.3. Cluster Acceleration Performance Testing of the Proposed Algorithm
4.3. Analysis of Experimental Results
5. Conclusions
Equipment | Num | Hardware Configuration | Operating System | Software | Remark |
portable computer | 1 | Thinkpad T470p, 8 vcore, 16 G RAM, SSD (Solid State Drive) | Windows 10 | ArcMap 10.4.1 | Single computer experiment for desktop overlay analysis. |
X86 Server | 6 | DELL R720, 24 core, 64 G RAM, HDD (Hard Disk Drive) | Centos7 | Hadoop 2.7, Spark 2.3.1 | Spark Computing Cluster |
Mode Abbreviation | Equipment | Data Storage Mode | Notes |
ArcMap | 1 portable computer with ArcMap | Local File System | Use the clip tool of Toolbox to perform overlay analysis on the portable computer |
Spark_original | Multiple X86 servers with Spark | HDFS | Directly partition the data randomly and do parallel overlay analysis without any improvement. |
Spark_improved | Multiple X86 servers with Spark | HDFS | Completely implement parallel overlay analysis according to the process of Section 3.3. Hilbert partitioning method considering graph complexity |
Spark_NoComlexity | Multiple X86 servers with Spark | HDFS | Except that the complexity of polygon graphics is not considered, all of them are the same as the Spark_improved mode. |
Spark_MBR | Multiple X86 servers with Spark | HDFS | Based on the Spark_original model, MBR filtering is performed first, and then parallel overlay analysis is performed. |
Spark_MBR_Hilbert | Multiple X86 servers with Spark | HDFS | Based on the Spark_original model, MBR filtering and a Hilbert partitioning operation are added. |
Spark_MBR_Hilbert_R-tree | Multiple X86 servers with Spark | HDFS | Based on the Spark_original model, MBR filtering, Hilbert partitioning and R-tree index creation operation are added. |
