CN103235974B

CN103235974B - A kind of method improving massive spatial data treatment effeciency

Info

Publication number: CN103235974B
Application number: CN201310148086.8A
Authority: CN
Inventors: 李连发; 王阳; 赵斯思; 王劲峰; 梁金能
Original assignee: Institute of Geographic Sciences and Natural Resources of CAS
Current assignee: Institute of Geographic Sciences and Natural Resources of CAS
Priority date: 2013-04-25
Filing date: 2013-04-25
Publication date: 2015-10-28
Anticipated expiration: 2033-04-25
Also published as: CN103235974A

Abstract

A method for improving the processing efficiency of massive spatial data, comprising the following steps: public operator extraction, public operator parallel strategy design, public operator parallel implementation, public operator call, public operator combination, and other steps. The present invention extracts the basic and common parts in spatial data processing as common operators, and parallelizes them based on MPI. In terms of data scale, it can process massive spatial data with millions of samples and hundreds of attributes. However, the existing spatial data The processing method is incomprehensible, but it can be processed efficiently and correctly. Task submission and parameter setting use web pages for interaction. All calculations are concentrated on the server side for efficient execution and completion. The client side has less pressure and is easy to operate. <pb pnum="1"/>

Description

A Method of Improving the Processing Efficiency of Massive Spatial Data

技术领域technical field

本发明涉及一种基于公共算子与高性能计算的提高海量空间数据处理效率的方法，通过提供一个可以稳定高效地运行在服务器上的空间数据处理并行计算框架，从而提高空间数据处理效率，同时研究如何将空间数据处理方法应用至多个领域。The invention relates to a method for improving the processing efficiency of massive spatial data based on public operators and high-performance computing. By providing a parallel computing framework for spatial data processing that can run stably and efficiently on a server, the processing efficiency of spatial data is improved, and at the same time Study how to apply spatial data processing methods to multiple fields.

背景技术Background technique

随着航天遥感技术、传感器技术和Internet的快速发展，获取时空数据的速度和规模的不断增长(从GB到PB)，同时数据在实例数量、属性数量和分类数量等方面都出现激增，高维大数据集随之出现。由于空间数据处理算法复杂度大、空间信息的复杂性，大数据集的空间数据处理会花费大量时间。同时在公共卫生与健康、灾害预警、人口空间化等众多领域，都有众多非专业人员需要用到空间数据处理中的复杂模型，现有模型基本都无法根据需求定制，从而导致无法广泛推广。With the rapid development of aerospace remote sensing technology, sensor technology and the Internet, the speed and scale of obtaining spatiotemporal data continue to grow (from GB to PB), and at the same time, the number of data instances, attributes, and classifications has surged, and high-dimensional Large data sets followed. Due to the complexity of spatial data processing algorithms and the complexity of spatial information, spatial data processing of large data sets will take a lot of time. At the same time, in many fields such as public health and health, disaster early warning, and population spatialization, many non-professionals need to use complex models in spatial data processing. Existing models basically cannot be customized according to needs, resulting in the inability to be widely promoted.

在空间数据处理操作方面，国外的Arc/Info、MGE等著名的GIS基础软件平台，都有着完善高效的工具库。但每个工具之间的组合与协同工作只能在较粗粒度上进行，不能实现工具内部细粒度算法级别的单独使用以及组合。为此可以将空间统计中基础部分以及共同部分提取出来，作为公共算子。通过使用公共算子可以解决空间数据处理代码重复编写、各个算法交互困难以及根据应用定制空间数据处理模型等难题。In terms of spatial data processing operations, well-known foreign GIS basic software platforms such as Arc/Info and MGE have complete and efficient tool libraries. However, the combination and collaborative work between each tool can only be carried out at a coarser level, and the individual use and combination of the fine-grained algorithm level within the tool cannot be realized. For this reason, the basic part and the common part in spatial statistics can be extracted as a common operator. The use of public operators can solve the problems of repeated writing of spatial data processing codes, difficulties in the interaction of various algorithms, and customization of spatial data processing models according to applications.

现代科学技术的进步极大的促进了计算科学的发展，新一代的计算机无论计算能力和计算速度都比早期的计算机优越许多。在实践中，由于受到物理元器件极限速度和技术水平的限制，单个的处理器远远不能满足现代许多领域中具有挑战性的大规模计算课题对计算资源的需求，所以除了增强处理器本身的计算能力外，并行处理是一种提高计算能力的有效手段。The advancement of modern science and technology has greatly promoted the development of computing science. The new generation of computers is much superior to the early computers in terms of computing power and computing speed. In practice, due to the limitation of the extreme speed of physical components and technical level, a single processor is far from meeting the demand for computing resources of challenging large-scale computing tasks in many modern fields, so in addition to enhancing the processor itself In addition to computing power, parallel processing is an effective means to improve computing power.

在多核心处理器如火如荼发展的时候，软件业却远远没有来得及为此作准备。如何在新兴的多核集群平台上方便快捷地开发各种空间数据处理应用，以提供更加高效的在线服务；另外更重要的是如何为上层程序开发人员隔离下层多核平台，使得开发出的空间数据处理并行算法能够很方便地被用户所使用，将成为一个严峻的挑战。While multi-core processors are in full swing, the software industry is far from ready for it. How to develop various spatial data processing applications conveniently and quickly on the emerging multi-core cluster platform to provide more efficient online services; more importantly, how to isolate the lower multi-core platform for upper-level program developers, so that the developed spatial data processing It will be a serious challenge for parallel algorithms to be easily used by users.

总体而言，现有的空间数据操作模型、算法及理论与方法研究不够深入，所取得的研究成果没有及时地实现以进行进一步的应用。高性能计算环境，尤其是多核计算环境等的迅速发展，为空间数据处理提供了更广阔的发展空间。因此，设计一个基于公共算子的空间数据处理并行计算框架，将是隔离下层高性能平台和上层应用开发，并将其应用到更为广泛行业的最佳解决方案。Generally speaking, the existing spatial data operation models, algorithms, and theories and methods are not deep enough, and the research results obtained have not been realized in time for further application. The rapid development of high-performance computing environment, especially multi-core computing environment, provides a broader development space for spatial data processing. Therefore, designing a parallel computing framework for spatial data processing based on common operators will be the best solution to isolate the lower-level high-performance platform and upper-level application development, and apply it to a wider range of industries.

发明内容Contents of the invention

本发明的技术解决问题：克服现有技术的不足，提出一种提高海量空间数据处理效率的方法，通过将空间数据处理中的基础与通用部分提取出来作为公共算子，基于MPI进行并行化，在数据规模上可以处理百万样本、上百个属性的海量空间数据，而现有空间数据处理软件是无法运算的，同时可以高效正确地进行处理，任务提交、参数设定采用网页进行交互，所有计算都集中在服务端高效执行完成，客户端压力小且操作简单。The technical solution of the present invention is to overcome the deficiencies of the prior art, and propose a method for improving the processing efficiency of massive spatial data, by extracting the basic and common parts in spatial data processing as common operators, and parallelizing them based on MPI, In terms of data scale, it can handle massive spatial data with millions of samples and hundreds of attributes, which cannot be calculated by existing spatial data processing software, and can be processed efficiently and correctly. Task submission and parameter setting are interactive through web pages. All calculations are concentrated on the server side for efficient execution, and the client has less pressure and is easy to operate.

本发明技术解决方案：一种提高海量空间数据处理效率的方法，包括以下步骤：The technical solution of the present invention: a method for improving the processing efficiency of massive spatial data, comprising the following steps:

(1)公共算子提取(1) Common operator extraction

(1.1)将空间数据处理方法按输入输出、实现思路、功能用途不同分为预处理、空间特征探索、空间信息计算和结果推断四个部分，每个部分包含多个空间处理模型，单个模型可以完成一个完整的空间数据处理功能，例如：分类、插值等；(1.1) The spatial data processing method is divided into four parts: preprocessing, spatial feature exploration, spatial information calculation, and result inference according to input and output, implementation ideas, and functional purposes. Each part contains multiple spatial processing models. A single model can Complete a complete spatial data processing function, such as: classification, interpolation, etc.;

(1.2)研究(1.1)中每个部分所包含的空间处理模型，依据功能完整性与不可分割性原则，将空间处理模型分拆为多个独立模块，每个模块都作为一个公共算子，其处理结果作为后续流程上其他公共算子的输入数据、输入条件，或直接为最终结果；(1.2) Study the space processing model contained in each part of (1.1), and split the space processing model into multiple independent modules according to the principle of functional integrity and indivisibility, and each module is used as a public operator. The processing results are used as the input data and input conditions of other public operators in the subsequent process, or directly as the final result;

(1.3)对提取到的公共算子集进行筛选，去除重复，得到需要进行并行化加速处理的公共算子集；(1.3) Screen the extracted public operator sets, remove duplication, and obtain the public operator sets that need to be parallelized and accelerated;

至此已将所有空间数据处理方法中的公共算子提取出来，继而需要对公共算子进行并行化处理实现加速。So far, the common operators in all spatial data processing methods have been extracted, and then it is necessary to parallelize the common operators to achieve acceleration.

(2)公共算子并行策略设计(2) Public operator parallel strategy design

(2.1)将步骤(1.3)中得到的每个公共算子划分为更细致的计算单元，单个计算单元只进行一次最简单的完整计算操作，求期望和对数；计算单元之间为顺序串行，内部实现为并行；(2.1) Divide each public operator obtained in step (1.3) into more detailed calculation units, and a single calculation unit only performs the simplest complete calculation operation once to find the expectation and logarithm; the sequence string between the calculation units Row, internally implemented as parallel;

(2.2)逐个判断计算单元的类型，制定数据分块分发策略,若计算单元全部为本地计算Local或邻域计算Focal，栅格数据按行进行分块，矢量数据需考虑空间拓扑关系，按照单一节点数据完整性的原则进行分块；如果包含全局计算Global，所有节点运算都需要数据，因此不进行分块，而将数据发送所有节点，发送采用广播策略。进行广播时的基本单元为进程，一个进程就是一个计算与通信单元，通常为CPU中的一个核心。每个进程得到数据之后加入广播者，向本节点的剩余进程与其它节点的所有进程发送；(2.2) Determine the type of computing unit one by one, and formulate a data block distribution strategy. If the computing units are all local computing Local or neighborhood computing Focal, the raster data is divided into rows, and the vector data needs to consider the spatial topological relationship. The principle of node data integrity is divided into blocks; if the global calculation Global is included, all node operations require data, so the data is not divided into blocks, but the data is sent to all nodes, and the transmission adopts a broadcast strategy. The basic unit for broadcasting is a process, and a process is a computing and communication unit, usually a core in a CPU. Each process joins the broadcaster after getting the data, and sends it to the remaining processes of this node and all processes of other nodes;

(2.3)数据分块策略设计完成后，需要进行计算单元的并行策略设计。计算单元分为全局参数计算以及单样本值循环计算。首先，进行全局参数计算的并行策略设计，并行策略有区域分解、功能分解。由于全局参数计算通常可表达为一个数学公式，可对该公式进行分解，将需要进行处理的空间数据分配给多个进程。(2.3) After the data block strategy design is completed, it is necessary to design the parallel strategy of the computing unit. The calculation unit is divided into global parameter calculation and single sample value cycle calculation. First, the parallel strategy design for global parameter calculation is carried out. The parallel strategy includes regional decomposition and functional decomposition. Since the global parameter calculation can usually be expressed as a mathematical formula, the formula can be decomposed and the spatial data to be processed can be allocated to multiple processes.

(2.4)然后，进行单样本值循环计算的并行策略设计，由于每一次的计算只依赖各样本值与全局参数，与其他样本计算无关，可采用数据并行策略，将样本平均分配至各个进程。(2.4) Then, the parallel strategy design of single-sample value cycle calculation is carried out. Since each calculation only depends on each sample value and global parameters, and has nothing to do with other sample calculations, a data parallel strategy can be used to evenly distribute samples to each process.

至此，所有公共算子的并行策略已经设计完成，进而可以依据制定的并行策略，采用特定编程语言以及并行接口实现公共算子。So far, the parallel strategies of all public operators have been designed, and then public operators can be implemented using specific programming languages and parallel interfaces according to the formulated parallel strategies.

(3)公共算子并行实现(3) Parallel implementation of common operators

(3.1)根据步骤(2)中提到的数据分块分发策略与计算单元的并行策略，基于MPI即Message Passing Interface，基于消息传递接口的并行库，设计四种并行原语，包括分发Map、规约Reduce、广播Broadcast、交叉运算Multiplex，从而实现对MPI函数库的扩展，提高公共算子在大数据条件下尤其是海量空间数据的传输效率；(3.1) According to the data block distribution strategy mentioned in step (2) and the parallel strategy of the computing unit, based on MPI (Message Passing Interface), based on the parallel library of the message passing interface, four parallel primitives are designed, including distribution Map, Reduce, broadcast, and multiplex interleaving operations are regulated, so as to realize the expansion of the MPI function library and improve the transmission efficiency of public operators under the condition of big data, especially massive spatial data;

(3.2)依据步骤(3.1)中的四种并行原语以及MPI函数，采用高级语言C++编写代码，将公共算子进行并行化处理，得到高效运行的并行公共算子集；(3.2) According to the four parallel primitives and the MPI function in the step (3.1), the high-level language C++ is used to write the code, and the common operators are processed in parallel to obtain a parallel common operator set that runs efficiently;

(3.3)将步骤(3.2)实现的公共算子在单节点与多节点集群上分别进行并行效率测试，统计IO、通信代价，不断改进，直至得到满足要求的可执行并行公共算子。(3.3) Perform parallel efficiency tests on the public operators implemented in step (3.2) on single-node and multi-node clusters, count IO and communication costs, and continuously improve until the executable parallel public operators that meet the requirements are obtained.

至此，所有的公共算子已经并行实现，每个公共算子都会编译成一个可以高效运行在高性能计算平台上的独立可执行文件。So far, all public operators have been implemented in parallel, and each public operator will be compiled into an independent executable file that can run efficiently on high-performance computing platforms.

(4)公共算子调用(4) Public operator call

(4.1)将步骤(3.3)中得到的公共算子的可执行文件部署到高性能集群上，并编写守护进程。集群上的守护进程是随系统启动并在后台运行的服务，用来进行参数解析、任务执行、结果反馈。(4.1) Deploy the executable file of the public operator obtained in step (3.3) to a high-performance cluster, and write a daemon process. The daemon process on the cluster is a service that starts with the system and runs in the background for parameter parsing, task execution, and result feedback.

(4.2)守护进程启动后，用户即可在客户端浏览器通过网页提交公共算子计算所需参数，由Web服务器将参数写入数据库中；(4.2) After the daemon process is started, the user can submit the parameters required for public operator calculation through the web page in the client browser, and the web server will write the parameters into the database;

(4.3)守护进程从数据库中读取公共算子计算参数并解译得到包含多个Key-Value键值对的哈希表，Key表示参数名称，Value表示参数值，将哈希表中的所有键值对拼接处理后得到需要进行空间数据处理任务的指令表达；(4.3) The daemon process reads the public operator calculation parameters from the database and interprets them to obtain a hash table containing multiple Key-Value pairs. Key represents the parameter name, and Value represents the parameter value. After the key-value pair splicing process, the instruction expression that needs to perform spatial data processing tasks is obtained;

(4.4)守护进程运行(4.3)中得到的任务指令，同时将运行输出信息与日志写入数据库中，运算所得结果写入磁盘；(4.4) The task instruction obtained in (4.3) of the daemon process operation, simultaneously writes the operation output information and the log into the database, and writes the result of the calculation into the disk;

(4.5)Web服务器从磁盘以及数据库提取输出信息与日志，组织后将运行输出、日志、计算结果构建为网页反馈给用户。用户获得运算结果以及输出信息后，整个公共算子调用过程也就结束；(4.5) The web server extracts the output information and logs from the disk and database, and organizes the output, logs, and calculation results into a web page for feedback to the user. After the user obtains the operation result and output information, the entire public operator calling process ends;

当进行简单空间数据处理时，即只进行单个公共算子的使用，整个流程至此已经结束，此时用户已可以通过网页将公共算子参数提交，并获得运算结果、输出信息以及日志。When performing simple spatial data processing, that is, only a single public operator is used, and the entire process has ended. At this point, users can submit public operator parameters through the web page, and obtain operation results, output information, and logs.

(5)公共算子组合(5) Combination of public operators

如果需要进行复杂空间数据处理或者完成特定领域的空间数据处理要求，则直接跳过步骤(4)，执行步骤(5)。If it is necessary to process complex spatial data or complete the spatial data processing requirements of a specific field, skip step (4) directly and perform step (5).

(5.1)将步骤(3)中得到的公共算子的可执行文件部署到高性能集群上，并编写守护进程。(5.1) Deploy the executable file of the public operator obtained in step (3) to a high-performance cluster, and write a daemon process.

(5.2)研究要进行的复杂空间数据处理或特定领域空间数据处理的逻辑结构，得到所需公共算子及各个公共算子之间的逻辑结构关系，包括公共算子执行先后关系、依赖关系以及公共算子输入输出之间的关系；(5.2) Study the logical structure of complex spatial data processing or spatial data processing in a specific field, and obtain the required public operators and the logical structural relationship between each public operator, including the execution sequence, dependency, and The relationship between the input and output of public operators;

(5.3)根据步骤(5.2)得到的逻辑结构关系，在可视化复杂模型编辑器中，将公共算子通过带方向的连接线组合，得到可视化模型；(5.3) According to the logical structure relationship obtained in step (5.2), in the visualized complex model editor, combine common operators through directional connection lines to obtain a visualized model;

(5.4)复杂模型编辑器将所得可视化模型转换为带有顺序的指令集合，同时将指令集合提交到数据库中；(5.4) The complex model editor converts the obtained visual model into a set of instructions with order, and submits the set of instructions to the database at the same time;

(5.5)守护进程从数据库读取指令集合进行解译，确定依赖关系后逐步运行，并将日志写入数据库；(5.5) The daemon process reads the command set from the database to interpret, determines the dependencies and then runs step by step, and writes the log to the database;

(5.6)等待步骤(5.5)中所有指令顺序依次运行完成后，守护进程将空间处理所得结果写入磁盘，由Web服务器反馈给用户；若运行失败，根据日志进行回滚，并将错误信息反馈给用户，至此提高海量空间数据处理的方法已经构建完成。(5.6) Wait for all the instructions in step (5.5) to run sequentially, and the daemon process will write the result of the space processing to the disk, and the web server will feed it back to the user; if the operation fails, roll back according to the log, and feed back the error information For users, the method to improve the processing of massive spatial data has been constructed so far.

所述步骤(2.3)中所述区域分解并行策略实施步骤为：将偏微分方程中非重叠的区域进行分解，由此将离散化后的方程化为一些独立的简单方程求解问题和一个与每个简单方程都关联的全局问题；所述功能分解并行策略实施步骤：用Newton迭代法求解线性方程组时，将解函数值与求导数值两个独立的过程可交由不同的计算机负责。The region decomposition parallel strategy implementation step in the step (2.3) is: decompose the non-overlapping regions in the partial differential equations, thus transforming the discretized equations into some independent simple equation solving problems and one and each A global problem related to two simple equations; the implementation steps of the parallel strategy of functional decomposition: when solving a system of linear equations with the Newton iterative method, the two independent processes of solving function values and derivation values can be assigned to different computers.

本发明与现有技术相比的优点在于：The advantage of the present invention compared with prior art is:

(1)通过使用权利要求并行策略设计与并行实现中提出的技术，本发明能够充分利用高性能集群计算优势，大大提高海量空间数据的处理效率，充分利用计算硬件带来的性能优势。本发明也解决了将空间数据引入高性能计算时存在的如何根据空间关系进行矢量数据分块、多节点进程间数据通信等问题。(1) By using the technology proposed in the parallel strategy design and parallel implementation of the claims, the present invention can make full use of the advantages of high-performance cluster computing, greatly improve the processing efficiency of massive spatial data, and make full use of the performance advantages brought by computing hardware. The invention also solves the problems of how to divide the vector data into blocks according to the spatial relationship and communicate data between multi-node processes when the spatial data is introduced into high-performance computing.

(2)使用公共算子构建空间数据处理并行计算框架，并行粒度小，可定制性强，使用复杂模型编辑器可以根据最终应用自由组合，从而将空间数据处理中的模型应用在多个行业领域。(2) Use public operators to build a parallel computing framework for spatial data processing. The parallel granularity is small and the customization is strong. Using the complex model editor can be freely combined according to the final application, so that the model in spatial data processing can be applied to multiple industries. .

(3)正如权利要求中步骤3公共算子调用所描述，本发明最终结果通过Web向用户免费提供服务，非专业用户不需要关心实现细节，只提供系统必要输入即可。服务端部署在高性能Linux集群上，将复杂的空间数据处理变得更加易用。同时也为专业用户提供了更多的参数控制，使得计算结果更加精确。(3) As described in the public operator call in step 3 of the claim, the final result of the present invention provides free services to users through the Web, and non-professional users do not need to care about the implementation details, but only provide the necessary system input. The server is deployed on a high-performance Linux cluster, making complex spatial data processing easier to use. At the same time, it also provides professional users with more parameter control, making the calculation results more accurate.

附图说明Description of drawings

图1为空间数据处理公共算子框架；Figure 1 is the common operator framework for spatial data processing;

图2为本发明方法实现流程图；Fig. 2 is the realization flow chart of the method of the present invention;

图3为空间变异函数拟合并行策略设计；Fig. 3 is the spatial variogram fitting parallel strategy design;

图4为高精度曲面建模并行策略设计；Figure 4 is the parallel strategy design for high-precision surface modeling;

图5为MSN并行策略设计；Fig. 5 is MSN parallel strategy design;

图6为多单元三明治抽样并行策略设计；Figure 6 shows the parallel strategy design for multi-unit sandwich sampling;

图7为四种并行原语；Figure 7 shows four parallel primitives;

图8为高性能贝叶斯分类器实现步骤及离散化流程；Fig. 8 is the implementation steps and discretization process of high-performance Bayesian classifier;

图9为高性能贝叶斯分类器结构学习与参数学习实现步骤；Fig. 9 is the implementation steps of high-performance Bayesian classifier structure learning and parameter learning;

图10为高性能贝叶斯分类器并行效率。Figure 10 shows the parallel efficiency of high-performance Bayesian classifiers.

具体实施方式Detailed ways

如图1、2所示，本发明主要包括以下几个步骤：As shown in Figures 1 and 2, the present invention mainly comprises the following steps:

公共算子提取public operator extraction

空间数据处理在资源组织上一般都比较单一，都是基于某一类数据的计算。而在算法上，基于某一类属性或图形的计算时，其计算相对简单；而在做统计推断计算时，复杂性会较高。总之，当所涉及的资源的来源越多时，空间数据处理就越复杂，其实现的过程就需要多方面的综合协同。Spatial data processing is generally relatively simple in terms of resource organization, and is based on the calculation of a certain type of data. In terms of algorithms, when calculating based on a certain type of attribute or graph, the calculation is relatively simple; while when doing statistical inference calculations, the complexity will be higher. In short, the more sources of resources involved, the more complex the processing of spatial data is, and the process of its realization requires the comprehensive coordination of many aspects.

本发明通过将空间数据处理方法按用途分为预处理、空间特征探索、空间信息计算、结果推断四个部分，每个部分包含多个空间处理模型；确定每个部分中包含模型的特征，提取其中基础与通用的部分作为公共算子，确保每个公共算子是一个独立模块，其处理结果可作为其他公共算子的输入数据、条件，或直接为最终结果；再对得到的公共算子集进行筛选，去除重复、不可并行的公共算子。In the present invention, the spatial data processing method is divided into four parts: preprocessing, spatial feature exploration, spatial information calculation, and result inference according to the purpose. Each part contains a plurality of spatial processing models; Among them, the basic and common parts are used as public operators to ensure that each public operator is an independent module, and its processing results can be used as input data and conditions of other public operators, or directly as the final result; and then the obtained public operators Sets are screened to remove duplicate and non-parallelizable common operators.

数据预处理是对现有空间数据进行粗加工，为后续模型提供其所需格式数据，提取处理的公共算子包括分布转换、正则化、离散化等。空间特征探索是对空间分布、相关性进行尝试性计算，得到空间数据整体与局部的聚类特征，可提取的公共算子包括Moran’I、GetisG、空间扫描统计等。空间信息计算目的是得到面域的总体特征参数以及通过空间插值将离散样本值变成连续面域，提取的公共算子包括半变异函数、分块矩阵转换、矩阵特征值求解等。结果推断是由已知样本信息训练模型后推断未知样本值，提取的公共算子包括极大似然估计、EM、朴素贝叶斯等。Data preprocessing is to roughly process the existing spatial data, and provide the required format data for the subsequent model. Common operators for extraction and processing include distribution conversion, regularization, and discretization, etc. Spatial feature exploration is a tentative calculation of spatial distribution and correlation to obtain the overall and local clustering features of spatial data. Common operators that can be extracted include Moran’I, GetisG, and spatial scanning statistics. The purpose of spatial information calculation is to obtain the overall characteristic parameters of the area and transform the discrete sample values into a continuous area through spatial interpolation. The extracted common operators include semivariogram, block matrix transformation, matrix eigenvalue calculation, etc. The result inference is to infer the unknown sample value after training the model with known sample information, and the extracted common operators include maximum likelihood estimation, EM, naive Bayes, etc.

公共算子并行策略设计Public Operator Parallel Strategy Design

将上一步中得到的公共算子划分为更细致的计算单元，每个计算单元只进行一次最简单的完整计算操作，求期望、对数等。计算单元之间为顺序串行，内部实现为并行。逐个判断计算单元的类型，制定数据分块分发策略。若计算单元全部为本地计算Local或邻域计算Focal，栅格数据按行进行分块，矢量数据需考虑空间拓扑关系，按照单一节点数据完整性的原则进行分块；如果包含全局计算Global，所有节点运算都需要数据，因此不进行分块，而将数据发送所所有节点，发送采用广播策略，每个进程得到数据之后加入广播者，向其它进程发送。Divide the public operators obtained in the previous step into more detailed calculation units, and each calculation unit only performs the simplest complete calculation operation once, such as finding expectation, logarithm, etc. Computing units are sequentially serialized, and internally implemented in parallel. Determine the type of computing unit one by one, and formulate a data block distribution strategy. If the calculation units are all local calculations or neighborhood calculations Focal, the raster data is divided into rows, and the vector data needs to consider the spatial topological relationship and be divided according to the principle of data integrity of a single node; if the global calculation Global is included, all All node operations require data, so the data is not divided into blocks, but the data is sent to all nodes, and the sending adopts a broadcast strategy. After each process gets the data, it joins the broadcaster and sends it to other processes.

数据分块策略设计完成后，需要进行计算单元的并行策略设计。计算单元分为全局参数计算以及单样本值循环计算。首先，进行全局参数计算的并行策略设计，一般是对计算公式本身进行分解，将计算内容分配给多个进程。并行策略有区域分解、功能分解。区域分解并行策略实施步骤：将偏微分方程中非重叠的区域进行分解，由此将离散化后的方程化为一些独立的小规模问题和一个与每个小问题都关联的全局问题。功能分解并行策略实施步骤：用Newton迭代法求解线性方程组时，将解函数值与求导数值两个独立的过程可交由不同的计算机负责。然后，进行单样本值循环计算的并行策略设计，由于每一次的计算只依赖各样本值与全局参数，与其他样本计算无关，可采用数据并行策略，将样本平均分配至各个进程。After the data block strategy design is completed, it is necessary to design the parallel strategy of the computing unit. The calculation unit is divided into global parameter calculation and single sample value cycle calculation. First of all, the parallel strategy design for global parameter calculation is generally to decompose the calculation formula itself and distribute the calculation content to multiple processes. Parallel strategies include regional decomposition and functional decomposition. Region decomposition parallel strategy implementation steps: decompose the non-overlapping regions in the partial differential equations, thereby transforming the discretized equations into some independent small-scale problems and a global problem associated with each small problem. Functional decomposition parallel strategy implementation steps: When using Newton iterative method to solve linear equations, the two independent processes of solving function value and derivation value can be assigned to different computers. Then, the parallel strategy design of single-sample value cycle calculation is carried out. Since each calculation only depends on each sample value and global parameters, and has nothing to do with other sample calculations, a data parallel strategy can be used to evenly distribute samples to each process.

每个公共算子的详细并行策略设计如下：The detailed parallel strategy design of each public operator is as follows:

(1)数据预处理(1) Data preprocessing

正则化及分布转换采用了数据分块与异步并行相结合的策略，可处理多变量数据，用户可进行多维度选择的正则化及函数转换；离散化的目标是将数值类型的数据采用优化的算法离散化，便于在分类学习中取得最优的效果。离散化并行策略主要采用同步并行算法以及流水线技术。在计算候选断点时，按照数据并行，各个进程分别计算断点，再汇总至跟进程。在筛选断点时，各个进程分别计算断点重要性，并得到每个进程重要性最大的点，进行reduce操作得到重要性最大的断点，根据一致性要求判断是否继续循环。Regularization and distribution conversion adopt the strategy of combining data block and asynchronous parallelism, which can handle multi-variable data, and users can perform regularization and function conversion of multi-dimensional selection; the goal of discretization is to use optimized The discretization of the algorithm is convenient for achieving the best results in classification learning. The discrete parallel strategy mainly adopts synchronous parallel algorithm and pipeline technology. When calculating candidate breakpoints, according to data parallelism, each process calculates breakpoints separately, and then aggregates them to the follower process. When screening breakpoints, each process calculates the importance of breakpoints separately, and obtains the most important point of each process, performs a reduce operation to obtain the most important breakpoint, and judges whether to continue the cycle according to the consistency requirement.

(2)空间相关性探索(2) Spatial correlation exploration

空间变异函数拟合，采用不同函数拟合，再采用R2选择最优的拟合参数。并行策略设计如图3所示，总体层面上采用异步并行，而下一级则可采用区域分解方法计算不同方向的变异性。将采用同步并行的设计策略，选择最优的模型，提高求解效率。由主进程根据n个子进程结果选择R2最大即精度最高的变异函数模型。Spatial variation function fitting, using different functions to fit, and then using R2 to select the best fitting parameters. The parallel strategy design is shown in Figure 3. Asynchronous parallelism is used on the overall level, and the variability in different directions can be calculated using the region decomposition method at the next level. A synchronous and parallel design strategy will be adopted to select the optimal model and improve the solution efficiency. The main process selects the variation function model with the largest R2 and the highest precision according to the results of n sub-processes.

(3)空间插值(3) Spatial interpolation

(一)克里格插值，由于插值的批量性，因此可以将每个点的插值运算分配给一个MPI进程，以获得非常好的并行性和性能增益。当计算资源远远大于需要插值点数量时，考虑对内含的The K-th nearest neighbor运算和求解线性方程组运算进行并行优化。插值的两个步骤：(a)找到距离目标点最近的N个点；(b)使用某种插值方法计算结果。对于Kriging插值，两个步骤演变为：The nearest K-th neighbor与求解线性方程组。(1) Kriging interpolation, due to the batch nature of interpolation, the interpolation operation of each point can be assigned to an MPI process to obtain very good parallelism and performance gain. When the computing resources are far greater than the number of interpolation points required, consider parallel optimization of the built-in The K-th nearest neighbor operation and the operation of solving linear equations. Two steps of interpolation: (a) find the N points closest to the target point; (b) calculate the result using some interpolation method. For Kriging interpolation, two steps evolve into: The nearest K-th neighbor and solving a system of linear equations.

公共算子1：最近邻算法，具体实现方法：(a)原始算法：brute-force(暴力法)；(b)串行的优化算法：基于空间划分及索引树的算法，典型算法如ANN；(c)并行算法：分治法(对每部分数据计算结果合并)。Public operator 1: nearest neighbor algorithm, specific implementation method: (a) original algorithm: brute-force (violent method); (b) serial optimization algorithm: algorithm based on space division and index tree, typical algorithm such as ANN; (c) Parallel algorithm: divide and conquer (merge calculation results for each part of data).

公共算子2：求解线形方程组使用成熟的并行线性代数库，如Linpack或Intel KernalMath Library；若求解的方程组阶数不高，采用区域分解、功能分解策略实现，将每个线性方程组放在单机上运行，以降低通信成本。区域分解并行策略实施步骤：将偏微分方程中非重叠的区域进行分解，由此将离散化后的方程化为一些独立的小规模问题和一个与每个小问题都关联的全局问题。功能分解并行策略实施步骤：用Newton迭代法求解线性方程组时，将解函数值与求导数值两个独立的过程可交由不同的计算机负责。Common operator 2: To solve linear equations, use a mature parallel linear algebra library, such as Linpack or Intel KernalMath Library; Run on a single machine to reduce communication costs. Region decomposition parallel strategy implementation steps: decompose the non-overlapping regions in the partial differential equations, thereby transforming the discretized equations into some independent small-scale problems and a global problem associated with each small problem. Functional decomposition parallel strategy implementation steps: When using Newton iterative method to solve linear equations, the two independent processes of solving function value and derivation value can be assigned to different computers.

(二)高精度曲面(2) High-precision curved surface

高精度曲面建模，并行策略设计如图4所示，根本问题是求解约束最小二乘的问题，通过迭代，获得数值曲面快速高效进行温度插值。并行策略总体采用对称模式，上层采用区域分解策略。通过将最终问题分解，转化为空间变异自动拟合、块三角变换方程求解、最小二乘法求解三个部分，三个公共算子内部的并行策略都采用区域分解。High-precision surface modeling and parallel strategy design are shown in Figure 4. The fundamental problem is to solve the problem of constrained least squares. Through iteration, the numerical surface is obtained for fast and efficient temperature interpolation. The overall parallel strategy adopts a symmetrical mode, and the upper layer adopts a regional decomposition strategy. By decomposing the final problem, it is transformed into three parts: automatic fitting of spatial variation, solution of block triangular transformation equation, and solution of least squares method. The parallel strategies within the three common operators all adopt region decomposition.

(4)面域总体参数估计(4) Area overall parameter estimation

(a)MSN即异质表面均值的无偏最优估计，可以提高研究区面域均值估计精度。(a) MSN is the unbiased optimal estimation of the heterogeneous surface mean, which can improve the estimation accuracy of the area mean in the study area.

并行策略设计如图5所示，总体上采用主从模式，最上层采用区域分解的并行化策略。主要转化为空间变异自动拟合、高斯方程无偏最优求解两个公共算子的并行，并行策略均为区域分解。The parallel strategy design is shown in Figure 5. The master-slave model is generally adopted, and the top layer adopts the parallelization strategy of region decomposition. It is mainly transformed into the parallelization of two common operators, automatic fitting of spatial variation and unbiased optimal solution of Gaussian equation, and the parallel strategy is region decomposition.

(b)多单元三明治抽样(b) Multi-unit sandwich sampling

并行策略设计如图6所示，总体上采用主从模式，将所求问题转化为空间变异自动拟合、相似性系数的计算、高斯方程无偏最优求解三个基本公共算子，公共算子内部并行策略均为区域分解。The parallel strategy design is shown in Figure 6. Generally, the master-slave mode is adopted to transform the required problem into three basic common operators: automatic fitting of spatial variation, calculation of similarity coefficient, and unbiased optimal solution of Gaussian equation. Sub-intra-parallel strategies are all region-decomposed.

Sandwich空间抽样模型，将由分层的相似性计算及无偏最优估计的高斯方程合成。The Sandwich space sampling model will be synthesized by a hierarchical similarity calculation and a Gaussian equation for unbiased optimal estimation.

(5)空间分类(5) Spatial classification

空间DAG分类推断，该模型基于贝叶斯网络，融入了空间因子。包括网络结构的学习、网络参数的学习、网络推理三个步骤，每个步骤采用不同的并行策略。Spatial DAG classification inference, the model is based on Bayesian network, incorporating spatial factors. It includes three steps: learning of network structure, learning of network parameters, and network reasoning, and each step adopts different parallel strategies.

网络结构的学习，一般包括两个方面模型选择与模型优化，模型选择决定了评判不同模型优劣的准则，如打分算法(包括K2、BIC、AIC等)，而模型优化是将最优模型寻找出来，如爬山算法。打分函数单次操作时间不长，但是需要反复调用，因此不需要对打分函数内部实现并行。爬山算法是一种元启发局部搜索算法，包括三种局部算子(增加边、删除边、反转边)。具体实现是在你给定的初始结构选择能改善贝叶斯网络的得分的算子，不断迭代。每次都需要寻找能够最大提高贝叶斯网络得分的局部算子，可以将每次运算分配至多个线程进行计算，再采用Reduce操作得到最优局部算子，执行完成局部算子之后，重新计算可执行的局部算子操作集合，重新分配任务计算，不断重复，直至网络得分不再提高。The learning of network structure generally includes two aspects: model selection and model optimization. Model selection determines the criteria for judging the pros and cons of different models, such as scoring algorithms (including K2, BIC, AIC, etc.), while model optimization is to find the optimal model Come out, such as hill climbing algorithm. The single operation time of the scoring function is not long, but it needs to be called repeatedly, so there is no need to implement parallelism inside the scoring function. Hill climbing algorithm is a meta-heuristic local search algorithm, including three local operators (add edge, delete edge, reverse edge). The specific implementation is to choose an operator that can improve the score of the Bayesian network in your given initial structure, and iterate continuously. Every time you need to find the local operator that can maximize the score of the Bayesian network, you can assign each operation to multiple threads for calculation, and then use the Reduce operation to obtain the optimal local operator. After the local operator is executed, recalculate A set of executable local operator operations, redistributing task calculations, and repeating until the network score no longer improves.

网络参数的学习是在给定拓扑结构的前提下，确定各节点的条件概率分布。本发明中现使用的参数学习方法是EM。具体实现分为E-M两步，E步主要采用数据并行分隔的策略，每次的计算都是只依赖单一样本，所有计算完成后进行通信得到最终结果即可。在M步利用BN的条件独立性和E步的期望充分统计因子,利用完整数据集下的似然函数可分解性,并行计算各个局部似然函数。The learning of network parameters is to determine the conditional probability distribution of each node under the premise of a given topology. The parameter learning method currently used in the present invention is EM. The specific implementation is divided into two steps, E-M. The E-step mainly adopts the strategy of data parallel separation. Each calculation only depends on a single sample. After all calculations are completed, communication is required to obtain the final result. In the M step, the conditional independence of BN and the expected sufficient statistical factor of the E step are used, and the decomposability of the likelihood function under the complete data set is used to calculate each local likelihood function in parallel.

网络的推理，采用数据分块并行策略。每次的推理操作都只需要当前样本影响因子数据与带条件概率的贝叶斯网络，便可对决策因子进行分类。因此，并行策略设计也仅仅针对数据，对数据进行分块即可。The reasoning of the network adopts the data block parallel strategy. Each reasoning operation only needs the current sample impact factor data and the Bayesian network with conditional probability to classify the decision factors. Therefore, the parallel strategy design is only for the data, and the data can be divided into blocks.

(6)时空模式识别(6) Spatio-temporal pattern recognition

时空扫描统计，热点/聚集区域探测并行化方案：Spatio-temporal scanning statistics, hotspot/gathering area detection parallelization scheme:

(一)选择候选聚集区：a.将全部格网点分为n个互不相交的子集，分配给n个并行进程；b.分别计算得到子集内没有重复的候选聚集区域集合；c.对于每两个子集合，进行重复候选聚集区域的剔除。(1) Selecting candidate aggregation areas: a. Divide all grid points into n disjoint subsets and assign them to n parallel processes; b. Calculate and obtain the set of candidate aggregation areas without repetition in the subsets; c. For every two sub-sets, repeated candidate aggregation regions are eliminated.

(二)基于真实观察数据，找到最大可能的聚集区域：a.将真实观察病例值分配到n个并行进程中；b.分别针对其包括的候选聚集区域，计算似然比值；c.找到使似然比值最大化的那个最大可能聚集区域。(2) Find the largest possible aggregation area based on the real observation data: a. Distribute the real observation case values into n parallel processes; b. Calculate the likelihood ratio for the candidate aggregation areas included in it; c. Find the The largest possible aggregation area where the likelihood ratio is maximized.

(三)Monte Carlo模拟，计算最大化似然比，并行地在若干个并行进程中独立运算，最后得到N个最大化的似然比值，分别存储在各个并行进程中。(4)计算最大可能聚集区域的统计显著性p值：a.在每一个并行进程中，分别进行最大化似然并值的排序；b.将两两进程的排序结果进行合并，直到最后合并为一个包含全部N个似然比值的排序后序列。(3) Monte Carlo simulation, calculate the maximum likelihood ratio, and independently calculate in several parallel processes in parallel, and finally get N maximum likelihood ratio values, which are stored in each parallel process respectively. (4) Calculate the statistical significance p-value of the largest possible aggregation area: a. In each parallel process, perform the sorting to maximize the likelihood value; b. Merge the sorting results of the two processes until the final merge is a sorted sequence containing all N likelihood ratios.

公共算子并行实现Parallel implementation of common operators

根据已经设计的并行策略，将每个公共算子进行并行。在分布式网络计算机系统中,采用消息传递方法实现进程间的通讯。当前流行的基于消息传递的并行编程环境是MPI(Message Passing Interface)和PVM(Parallel Virtual Machine)，其中消息传递接口MPI以其移植性好、功能强大、高效等优点而成为目前最重要的并行编程工具。Parallelize each public operator according to the designed parallel strategy. In the distributed network computer system, the communication between processes is realized by the method of message passing. The current popular parallel programming environments based on message passing are MPI (Message Passing Interface) and PVM (Parallel Virtual Machine). Among them, the message passing interface MPI has become the most important parallel programming environment due to its good portability, powerful functions, and high efficiency. tool.

根据步骤2中提到的三种运算，基于MPI设计四种并行原语(Map、Reduce、Broadcast、Multiplex)。如图7所示，具体是通过对MPI的函数进行扩展，提高其在进行大数据尤其是空间数据时的传输效率。与普通的Map-Reduce相比，并行粒度更细、多状态(MultipleStages)、且它们的通信机制不同。According to the three operations mentioned in step 2, four parallel primitives (Map, Reduce, Broadcast, Multiplex) are designed based on MPI. As shown in Figure 7, specifically, by extending the functions of MPI, the transmission efficiency of large data, especially spatial data, is improved. Compared with ordinary Map-Reduce, parallel granularity is finer, multiple states (MultipleStages), and their communication mechanisms are different.

(a)Map操作基于MPI_Scatter、MPI_Send、MPI_Recv实现，用于将原始数据、中间数据分发到当前通信域的所有进程。栅格数据对应整型或浮点型，属于MPI默认支持类型，直接发送；矢量数据需序列化为二进制串，再以Char类型发送。(a) The Map operation is implemented based on MPI_Scatter, MPI_Send, and MPI_Recv, and is used to distribute the original data and intermediate data to all processes in the current communication domain. Raster data corresponds to integer or floating-point type, which belongs to the default supported type of MPI, and is sent directly; vector data needs to be serialized into a binary string, and then sent as a Char type.

(b)Reduce操作基于MPI_Reduce实现，将各个进程计算的结果汇总至根进程。(b) The Reduce operation is implemented based on MPI_Reduce, and the results calculated by each process are aggregated to the root process.

(c)Broadcast操作基于MPI_Bcast实现，将单一进程的结构广播至所有进程。(c) The Broadcast operation is implemented based on MPI_Bcast, which broadcasts the structure of a single process to all processes.

(d)Multiplex操作基于MPI_Gatherall、MPI_Bcast、MPI_Send、MPI_Recv实现，将所有进程的读取的数据或计算结果广播给所有进程。(d) Multiplex operations are implemented based on MPI_Gatherall, MPI_Bcast, MPI_Send, and MPI_Recv, and broadcast the read data or calculation results of all processes to all processes.

下面以Moran’s I为例，说明公共算子并行实施的具体步骤。The following takes Moran's I as an example to illustrate the specific steps of parallel implementation of public operators.

在本文中需要对样本所有影响因子变量逐个计算局部Moran’s I值，每次计算分为三个阶段(求期望求求)，三阶段之间串行执行，三阶段内部并行执行。样本数据分块后，前两个阶段都只需块内运算后汇总至主进程，再广播到所有进程。最后一个阶段不仅包含块内运算，还需进行块间运算，将块内块间值求和。块间运算需要发送大量数据，采用进程间互相发送数据的方式，避免每个进程都过多等待。In this paper, it is necessary to calculate the local Moran's I value for all the influencing factor variables of the sample one by one, and each calculation is divided into three stages (seeking expectation beg beg ), execute serially between the three stages, and execute in parallel within the three stages. After the sample data is divided into blocks, the first two stages only need to perform intra-block calculations and aggregate them to the main process, and then broadcast to all processes. The last stage includes not only intra-block calculations, but also inter-block calculations, summing up the intra-block and inter-block values. Inter-block operations need to send a large amount of data, and use the method of sending data between processes to avoid excessive waiting for each process.

Moran’s I并行计算的具体步骤如下：The specific steps of Moran's I parallel computing are as follows:

(1)进程0分块读取栅格图像信息，并轮流发给其他进程；(1) Process 0 reads raster image information in blocks and sends them to other processes in turn;

(2)所有进程计算接收到的块，并计算块中观测值的总和；(2) All processes compute the received blocks and compute the sum of the observations in the blocks;

(3)通过Reduce计算所有观测值的总和并计算平均值，同时将平均值广播到所有进程；(3) Calculate the sum of all observed values and calculate the average value through Reduce, and broadcast the average value to all processes at the same time;

(4)所有进程计算本进程内数据块与 (4) All processes calculate data blocks in this process and

(5)每个进程都广播自己收到的块到其他进程。每个进程在收到块后，将其与自己在步骤1中接收到的块进行交叉运算(Multiplex)，分别计算出 (5) Each process broadcasts the blocks it receives to other processes. After each process receives the block, it performs an intersection operation (Multiplex) with the block it received in step 1 to calculate

(6)通过Reduce获得各个进程和与和，并得到最终的Moran’s I值；(6) Obtain each process through Reduce and with and, and get the final Moran's I value;

(7)将Moran’s I值写入样本属性中。(7) Write the Moran's I value into the sample attribute.

公共算子调用public operator call

每个公共算子在步骤3中实现之后都会编译为一个可单独执行的文件，可由守护进程(守护进程是一个在Linux系统下的服务，用以进行运算任务调度、执行，同时将计算结果与日志写入服务器数据库)进行不同参数的调用。具体操作步骤如下：After each public operator is implemented in step 3, it will be compiled into a single executable file, which can be used by the daemon process (a daemon process is a service under the Linux system to schedule and execute computing tasks, and at the same time compare the calculation results with The log is written to the server database) to call with different parameters. The specific operation steps are as follows:

(a)用户在客户端浏览器通过网页提交需要进行公共算子计算的参数，Web服务器将参数写入服务器数据库中；(a) The user submits the parameters that need to be calculated by the public operator through the web page in the client browser, and the Web server writes the parameters into the server database;

(b)守护进程从服务器数据库中读取参数，对各部分参数进行拼接处理得到需要执行的任务；(b) The daemon process reads the parameters from the server database, splicing each part of the parameters to obtain the tasks to be executed;

(c)守护进程将任务提交，运行指令，并通过管道技术截取程序运行输出信息与日志，将其写入服务器数据库中，运算所得结果写入服务器磁盘；(c) The daemon process submits tasks, executes instructions, and intercepts program operation output information and logs through pipeline technology, writes them into the server database, and writes the calculated results into the server disk;

(d)Web服务器从服务器磁盘以及数据库提取输出信息与日志，组织后将运行输出、日志、计算结果构建为网页反馈给用户；(d) The web server extracts output information and logs from the server disk and database, and organizes the operation output, logs, and calculation results into web pages to feed back to users;

专业用户也可在客户端安装ssh，然后以命令行的方式进行调用。命令行调用时可供选择的参数会更多一些，以下是数据预处理与空间DAG分类推断的调用命令即详细参数说明。Professional users can also install ssh on the client, and then call it through the command line. There are more parameters to choose from when the command line is invoked. The following is the invocation command for data preprocessing and spatial DAG classification inference, that is, the detailed parameter description.

(1)数据预处理(1) Data preprocessing

mpirun-np 2GeoPreprocessing-a 2-co 0-k 0-p gps_people_s.csv-c 0,1,2-ore_dis.csvmpirun-np 2GeoPreprocessing-a 2-co 0-k 0-p gps_people_s.csv -c 0,1,2-ore_dis.csv

参数解释：Parameter explanation:

-a 0表示log算法，1表示normal算法，2表示离散化算法-a 0 means log algorithm, 1 means normal algorithm, 2 means discretization algorithm

-p表示输入文件路径-p indicates the input file path

-o表示输入文件路径-o indicates the input file path

-c表示需要计算的列，从0开始；栅格数据则表示波段-c indicates the column to be calculated, starting from 0; raster data indicates the band

-co表示是否输出离散话完成后的分类信息，0表示不输出，1表示输出，默认为0-co indicates whether to output the classification information after the completion of discrete words, 0 means no output, 1 means output, the default is 0

-k表示是否使用k-mean聚类对断点进行筛选，0表示不使用，1表示使用，默认为0-k indicates whether to use k-mean clustering to filter breakpoints, 0 means not used, 1 means used, the default is 0

-C表示离散化算法中决策属性所在的列-C indicates the column where the decision attribute in the discretization algorithm is located

(2)空间DAG分类推断(2) Spatial DAG classification inference

mpirun-np 2 ParBayes–a StrLeaning_HC-p re_dis.csv-cn 2-dn 6-c 0,1,2,3,4mpirun-np 2 ParBayes–a StrLeaning_HC-p re_dis.csv-cn 2-dn 6-c 0,1,2,3,4

-a表示进行结构学习或者参数学习的算法-a indicates the algorithm for structure learning or parameter learning

-p表示输入文件路径-p indicates the input file path

-c表示需要计算的列，从0开始-c indicates the column to be calculated, starting from 0

-cn表示普通节点的节点大小，即取值类别，默认为2-cn indicates the node size of a common node, that is, the value category, and the default is 2

-dn表示决策节点的节点大小，默认为6-dn indicates the node size of the decision node, the default is 6

公共算子组合Combination of public operators

每个公共算子一般只负责特定功能与算法的实现，例如离散化公共算子、变异函数公共算子，要实现复杂的功能或者针对特定领域定制一些功能，必须将多个公共算子按一定结构组合为复杂模型。每个公共算子都有各自算法参数与接口，基于公共算子的空间数据处理并行框架通过提供一个可视化的复杂模型编辑器，来辅助用户将各个业务的逻辑结构通过可视化模型表达出来，进而由并行框架将其转化为在服务器可以执行的命令运行。并行框架调用公共算子时不仅内部可以并行，不存在依赖的公共算子之间也可以并行，进而提高整个复杂模型的运行效率。Each public operator is generally only responsible for the realization of specific functions and algorithms, such as discretization public operators and variation function public operators. To implement complex functions or customize some functions for specific fields, multiple public operators must be Structures are combined into complex models. Each public operator has its own algorithm parameters and interfaces. The parallel framework for spatial data processing based on public operators provides a visual complex model editor to assist users to express the logical structure of each business through a visual model, and then by The parallel framework turns this into a command run that can be executed on the server. When the parallel framework calls public operators, it can not only be parallelized internally, but also can be parallelized between public operators without dependencies, thereby improving the operating efficiency of the entire complex model.

具体实现步骤如下：The specific implementation steps are as follows:

(a)研究业务的逻辑结构，得到所需公共算子及各个公共算子之间的逻辑结构关系；(a) Study the logical structure of the business, and obtain the required public operators and the logical structural relationship between each public operator;

(b)根据5.1得到的逻辑结构，在可视化复杂模型编辑器中，将公共算子通过带方向的连接线组合，得到可视化模型；(b) According to the logical structure obtained in 5.1, in the visual complex model editor, combine common operators through directional connection lines to obtain a visual model;

(c)模型编辑完成后，由复杂模型编辑器将所得可视化模型转换为带有顺序的指令集合，同时将指令集合提交到服务器数据库中；(c) After the model editing is completed, the complex model editor converts the obtained visual model into a set of instructions with order, and submits the set of instructions to the server database at the same time;

(d)守护进程从服务器数据库读取指令集合进行解译，确定依赖关系后逐步运行，并将日志写入服务器数据库中；(d) The daemon process reads the command set from the server database to interpret, determines the dependencies and then runs step by step, and writes the log into the server database;

(e)等待5.4中所有指令顺序依次运行完成后，守护进程将空间处理所得结果写入服务器，由Web服务器反馈给用户；若运行失败，根据日志进行回滚，并将错误信息反馈给用户。(e) After waiting for all the instructions in 5.4 to run sequentially, the daemon process writes the space processing results to the server, and the web server feeds back to the user; if the operation fails, rollback is performed according to the log, and the error information is fed back to the user.

下面以高性能贝叶斯分类器为例说明本发明实施的具体步骤：The specific steps that the present invention implements are illustrated below with high-performance Bayesian classifier as an example:

(1)数据说明(示例数据可从http://159.226.110.219/网站下载)(1) Data description (sample data can be downloaded from http://159.226.110.219/ website)

测试案例：通过已知样本(轨迹数据)的信息，研究影响因子变量(如移动速度、人的活动度等)与类别变量(人的活动状态)之间的关系，进而对目标变量未知的样本进行分类推断。Test case: Through the information of known samples (trajectory data), study the relationship between influencing factor variables (such as moving speed, human activity, etc.) and category variables (person's activity status), and then target unknown samples Make classification inferences.

影响因子变量：移动速度、人的活动度、前后10分钟内的最大速度、10分钟中前后的距离差、GPS的测量参数等17个特征变量。Influencing factor variables: 17 characteristic variables including moving speed, human activity, maximum speed within 10 minutes before and after, distance difference before and after 10 minutes, and GPS measurement parameters.

类别变量：人的活动状态，包括屋里、屋外、屋外工作中或者是在车里。Categorical variable: person's activity status, including indoors, outdoors, working outside the house, or in the car.

数据量：已知类别变量的12万行矢量数据。采用90％的数据进行学习，并推断剩余10％样本的类别变量，与真实值进行比较得到推断准确率。Data volume: 120,000 rows of vector data with known categorical variables. Use 90% of the data for learning, and infer the categorical variables of the remaining 10% of the samples, and compare with the real value to get the inference accuracy.

(2)测试环境(2) Test environment

利用上述数据，在集群环境上对本文实现的并行贝叶斯分类器进行了测试。集群为四个节点，每个节点两颗Intel(R)Quad Core E5520 Xeon(R)CPU，共计八个核心，拥有16G独立内存、200G磁盘，节点之间连接带宽都为千兆以太网。Using the above data, the parallel Bayesian classifier implemented in this paper is tested on the cluster environment. The cluster consists of four nodes, each node has two Intel(R) Quad Core E5520 Xeon(R) CPUs, a total of eight cores, 16G independent memory, 200G disk, and the connection bandwidth between nodes is Gigabit Ethernet.

(3)实施步骤(3) Implementation steps

如图8所示，整个实施过程主要分为数据预处理(离散化)，贝叶斯网络结构学习，贝叶斯网络参数学习，贝叶斯网络分类推断。As shown in Figure 8, the entire implementation process is mainly divided into data preprocessing (discretization), Bayesian network structure learning, Bayesian network parameter learning, and Bayesian network classification inference.

离散化是将连续性变量变成离散的(分类)，离散化后的变量抗干扰性强。由于贝叶斯网络学习时需要离散化的数据，而通常获得的样本属性大多都是连续的(例如：人的活动度、GPS的各项参数等)，因此首先要对数据进行离散化。本发明所使用算法是基于断点重要性来计算。Discretization is to turn continuous variables into discrete (categorical), and the discretized variables have strong anti-interference. Since Bayesian network learning requires discretized data, and most of the sample attributes usually obtained are continuous (for example: human activity, various parameters of GPS, etc.), it is necessary to discretize the data first. The algorithm used in the present invention is calculated based on the breakpoint importance.

空间DAG分类推断，主要使用的是贝叶斯网络。如图9所示，网络结构的学习，采用爬山算法，得到在局部与总体都是最优的贝叶斯网络；网络参数的学习，采用EM(极大似然算法)得到带有先验概率的贝叶斯网络；网络的推理，采用Naive Bayes(朴素贝叶斯)。Spatial DAG classification inference mainly uses Bayesian network. As shown in Figure 9, the learning of the network structure adopts the hill-climbing algorithm to obtain a Bayesian network that is optimal locally and overall; the learning of the network parameters adopts EM (Maximum Likelihood Algorithm) to obtain The Bayesian network; network reasoning, using Naive Bayes (naive Bayesian).

具体调用方式包括以下三种方法：The specific calling methods include the following three methods:

(一)每个公共算子在模型编辑器中都作为一个对象，输入输出以及参数与对象之间的关系可以通过带有方向的连接线表达。(1) Each public operator is regarded as an object in the model editor, and the relationship between input and output, parameters and objects can be expressed through connecting lines with directions.

(二)登陆网站http://159.226.110.219/，注册用户后，在导航栏选择添加内容->数据预处理_离散化，输入计算所需参数(数据过大时可以通过ftp上传，然后输入文件处选择File attach即可)。同理根据图7中的步骤依次进行其他操作，得到最终结果。(2) Log in to the website http://159.226.110.219/, after registering as a user, select Add Content->Data Preprocessing_Discretization in the navigation bar, and enter the parameters required for calculation (if the data is too large, you can upload it through ftp, and then enter Select File attach at the file). Similarly, perform other operations sequentially according to the steps in FIG. 7 to obtain the final result.

(三)ssh中输入命令直接运行(3) Enter commands in ssh to run directly

mpirun-np 4 GeoPreprocessing-a 2-co 0-k 0-p gps_people_s.shp-c 0-17-ore_dis.csvmpirun-np 4 GeoPreprocessing-a 2-co 0-k 0-p gps_people_s.shp-c 0-17-ore_dis.csv

mpirun-np 4 ParBayes–a StrLeaning_HC-p re_dis.shp-c 0-17–o re.strmpirun-np 4 ParBayes–a StrLeaning_HC -p re_dis.shp -c 0-17–o re.str

mpirun-np 4 ParBayes–a ParLeaning_EM-p re_dis.shp–s re.str-c 0-17–ore.bysmpirun-np 4 ParBayes–a ParLeaning_EM -p re_dis.shp–s re.str -c 0-17–ore.bys

mpirun-np 4 ParBayes–a Inferring_Naive-p data_infer.shp–b re.bys-c 0-17mpirun-np 4 ParBayes–a Inferring_Naive-p data_infer.shp–b re.bys -c 0-17

(4)计算结果(4) Calculation results

并行效率是用来表示并行计算算法加速情况的指标，n核心并行效率计算公式为P_n＝(τ_n/τ₁)/n，其中，t_n表示n个核心执行算法所用时间，t₁表示单核心执行时间。Parallel efficiency is an index used to indicate the acceleration of parallel computing algorithms. The calculation formula for n-core parallel efficiency is P _n = (τ _n /τ ₁ )/n, where t _n represents the time spent by n cores to execute the algorithm, and t ₁ represents Single core execution time.

贝叶斯分类器平均并行效率为0.84，各步骤并行效率如图8所示，可以大大提高海量空间数据的分类推断效率。对最终分类结果进行十倍交叉验证时，推断准确率(pd值)为0.85。图10中ABCD四条曲线依次表示离散化、爬山算法、EM算法、朴素贝叶斯算法四个公共算子的并行效率。由于进程间通信以及算法存在不可并行化部分，并行效率随着进程数增加会下降。曲线斜率，即下降快慢则是由通信次数、通信量、算法本身可并行部分所占比例三个因素决定的。总体上看，并行贝叶斯分类器并行效率较高，在进行海量空间数据分类预测时对效率的提升非常明显，同时也证明了本发明对于提高海量空间数据处理效率是有效的。The average parallel efficiency of the Bayesian classifier is 0.84, and the parallel efficiency of each step is shown in Figure 8, which can greatly improve the efficiency of classification and inference of massive spatial data. When performing ten-fold cross-validation on the final classification results, the inference accuracy (pd value) was 0.85. The four curves ABCD in Figure 10 represent the parallel efficiency of the four common operators of discretization, hill-climbing algorithm, EM algorithm, and naive Bayesian algorithm in turn. Due to inter-process communication and non-parallelizable parts of the algorithm, the parallel efficiency will decrease as the number of processes increases. The slope of the curve, that is, the speed of decline, is determined by three factors: the number of communications, the amount of communications, and the proportion of the parallel part of the algorithm itself. Generally speaking, the parallel Bayesian classifier has high parallel efficiency, and the efficiency improvement is very obvious when performing massive spatial data classification and prediction, and it also proves that the present invention is effective for improving the processing efficiency of massive spatial data.

对于离散化的并行，32个核心时并行效率仍可达到0.8(图10中A曲线)，这主要是由于离散化时候选断点选取、断点重要性计算、断点筛选大部分运算都可在本地进行，通信开销非常小。For discretized parallelism, the parallel efficiency can still reach 0.8 when there are 32 cores (curve A in Figure 10). Doing it locally, the communication overhead is very small.

采用BIC打分算法与爬山优化算法的贝叶斯网络结构学习的平均并行效率为0.77(图10中B曲线)。随着迭代次数增加，并行效率同比会下降，原因在每次迭代都会随机生成DAG，此步骤无法进行并行，且每一次进行局部算子运算时都需要进行通信，获取得分最高的局部算子，并广播得分值、操作类型、参与节点。The average parallel efficiency of Bayesian network structure learning using BIC scoring algorithm and hill-climbing optimization algorithm is 0.77 (curve B in Figure 10). As the number of iterations increases, the parallel efficiency will decrease year-on-year. The reason is that DAG will be randomly generated in each iteration. This step cannot be parallelized, and communication is required every time a local operator operation is performed to obtain the local operator with the highest score. And broadcast the score value, operation type, and participating nodes.

贝叶斯网络参数学习采用EM算法，平均并行效率为0.80(图10中C曲线)。E步骤中每个样本的期望充分统计因子的计算彼此独立，具有很好的数据并行性。而M步骤也可通过分解似然函数，进行少量通信并行。但M步使用进程个数最多只能等于变量个数，在变量较少时M步的并行度不高。由于整个计算M步耗费时间很少，所以并行效率较。Bayesian network parameter learning adopts EM algorithm, and the average parallel efficiency is 0.80 (C curve in Fig. 10). The calculation of the expected sufficient statistical factors of each sample in the E step is independent of each other, which has good data parallelism. The M step can also be parallelized with a small amount of communication by decomposing the likelihood function. However, the number of processes used by the M step can only be equal to the number of variables at most, and the parallelism of the M step is not high when there are fewer variables. Since the entire calculation of M steps takes very little time, the efficiency of parallelism is relatively high.

推断分类采用朴素贝叶斯，平均并行效率为0.82(图10中D曲线)。每个样本的分类相互独立，且只依赖于已知的带条件概率的贝叶斯网络，进行数据分块并行后，完全不需要通信，各进程分别将运算结果写入磁盘或数据库，因此并行效率很好。Naive Bayesian is used for inference classification, with an average parallel efficiency of 0.82 (curve D in Figure 10). The classification of each sample is independent of each other and only depends on the known Bayesian network with conditional probability. After the data is divided into parallel blocks, no communication is required at all. Each process writes the calculation results to the disk or database respectively, so parallel Very efficient.

本发明未详细阐述部分属于本领域公知技术。Parts not described in detail in the present invention belong to the well-known technology in the art.

以上所述，仅为本发明部分具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本领域的人员在本发明揭露的技术范围内，可轻易想到的变化或替换，都应涵盖在本发明的保护范围之内。The above are only some specific implementations of the present invention, but the protection scope of the present invention is not limited thereto. Any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present invention should be covered within the protection scope of the present invention.

Claims

1. improve a method for massive spatial data treatment effeciency, it is characterized in that comprising the following steps:

(1) public operator extraction

(1.1) pre-service, space characteristics exploration, spatial information calculating and result is divided into infer four parts by input and output, realization approach, function and usage difference spatial data processing method, each part comprises multiple spatial manipulation model, and single model can complete a complete spatial data handling function;

(1.2) the spatial manipulation model that in (1.1), each part comprises is studied, according to functional completeness and inseparability principle, spatial manipulation model is divided into multiple standalone module, each module as a public operator, its result as other public operators in follow-up flow process input data, initial conditions or be directly net result;

(1.3) the public operator collection extracted is screened, remove and repeat, obtain needing to carry out the public operator collection that process is accelerated in parallelization;

So far by the public operator extraction in all spatial data processing methods out, then need to carry out parallelization process to public operator to realize accelerating;

(2) public operator paralleling tactic design

(2.1) each public operator obtained in step (1.3) is divided into finer computing unit, single computing unit only carries out once the operation of the simplest complete computation, namely asks expectation and logarithm; Be order serial operation between each computing unit, each computing unit inside is parallel work-flow;

(2.2) type of computing unit is judged one by one, formulate deblocking distribution policy, if computing unit is all local computing Local or neighborhood calculate Focal, raster data carries out piecemeal by row, vector data need consider spatial topotaxy, carries out piecemeal according to the principle of single node data integrity; If comprise global calculation Global, all node compute all need data, therefore piecemeal is not carried out, and data are sent all nodes, send and adopt broadcast strategy, elementary cell when broadcasting is process, a process is exactly one and calculates and communication unit, be generally a core in CPU, each process adds broadcaster after obtaining data, sends this data to the residue process of this node and all processes of other node;

(2.3), after deblocking Strategy Design completes, the paralleling tactic design carrying out computing unit is needed.Computing unit is divided into global parameter to calculate and single sample value cycle calculations.First, carry out the paralleling tactic design of global parameter calculating, paralleling tactic has Region Decomposition, Function Decomposition, because global parameter calculation expression is a mathematical formulae, decomposes this formula, distributes to multiple process by needing the spatial data carrying out processing;

(2.4) carry out the paralleling tactic design of single sample value cycle calculations, because calculating each time only relies on each sample value and global parameter, calculate irrelevant with other samples, data parallel strategy can be adopted, sample mean is dispensed to each process;

So far, the paralleling tactic of all public operators has designed, and according to the paralleling tactic formulated, adopts C++ programming language and parallel interface to realize public operator;

(3) public operator Parallel Implementation

(3.1) according to the paralleling tactic of the deblocking distribution policy mentioned in step (2) and computing unit, based on the parallel storehouse of MPI and Effect-based operation passing interface, design four kinds of parallel primitives, comprise distribution Map, stipulations Reduce, broadcast Broadcast, crossing operation Multiplex, thus realize expansion to MPI function library, improve the transfer efficiency of public operator especially massive spatial data under large data qualification;

(3.2) according to four kinds of parallel primitives in step (3.1) and MPI function, adopt higher level lanquage C++ to write code, public operator is carried out parallelization process, obtains the parallel public operator collection of Effec-tive Function;

(3.3) the public operator that step (3.2) realizes is carried out parallel efficiency test respectively in single node and multi-node cluster, statistics input and output IO, communication cost, update, until be met requirement perform parallel public operator;

So far, all public operator Parallel Implementation, each public operator can be compiled into the standalone executable file of an Effec-tive Function on high-performance calculation platform;

(4) public operator calls

(4.1) executable file of the public operator obtained in step (3.3) is deployed in High-Performance Computing Cluster, and write finger daemon, finger daemon on cluster starts with system and in the service of running background, is used for carrying out Parameter analysis of electrochemical, tasks carrying, result feedback;

(4.2), after finger daemon starts, namely user submits to public operator to calculate desired parameters, by Web server by parameter read-in database at client browser by webpage;

(4.3) finger daemon reads public operator calculating parameter and decipher obtains comprising the Hash table of multiple Key-Value key-value pair from database, Key represents parameter name, Value represents parameter value, the instruction obtaining needing to carry out spatial data handling task is expressed after all key-value pair splicings in Hash table;

(4.4) finger daemon runs the assignment instructions obtained in (4.3), will run in output information and daily record write into Databasce simultaneously, computing acquired results write disk;

(4.5) Web server extracts output information and daily record from disk and database, after tissue, operation output, daily record, result of calculation are configured to webpage and feed back to user, after user obtains operation result and output information, whole public operator invoked procedure also just terminates;

When carrying out simple spatial data handling, namely only carry out the use of single public operator, whole flow process so far terminates, and now public operator parameter can be submitted to by webpage by user, and obtains operation result, output information and daily record;

(5) public operator combination

If need the spatial data handling requirement carried out complex space data processing or complete specific area, then directly skip step (4), perform step (5);

(5.1) executable file of the public operator obtained in step (3) is deployed in High-Performance Computing Cluster, and writes finger daemon;

(5.2) logical organization of complex space data processing or the specific area spatial data handling that will carry out is studied, obtain the logical organization relation between required public operator and each public operator, comprise the relation between public operator execution precedence relationship, dependence and public operator input and output;

(5.3) according to the logical organization relation that step (5.2) obtains, in visual complex model editing machine, by the connecting line combination of public operator by band direction, Visualization Model is obtained;

(5.4) gained Visualization Model is converted to the sequential instruction set of band by complex model editing machine, is submitted in database by instruction set simultaneously;

(5.5) finger daemon carries out decipher from the set of database reading command, progressively runs after determining dependence, and by daily record write into Databasce;

(5.6), after in waiting step (5.5), all instruction sequences have been run successively, finger daemon, by spatial manipulation acquired results write disk, feeds back to user by Web server; If run unsuccessfully, carry out rollback, and error message is fed back to user according to daily record, the method so far improving massive spatial data process has built.

2. the method for raising massive spatial data treatment effeciency according to claim 1, it is characterized in that: described in described step (2.3), domain decomposition parallel policy implementation step is: decomposed in region non-overlapped in partial differential equation, thus the equation after discretize is turned to some independently simple equation solve with a global issue associated with each simple equation; Described Function Decomposition paralleling tactic implementation step: during with Newton solution by iterative method system of linear equations, by solution functional value and differentiate numerical value two independently process different computing machines can be transferred to be responsible for.