CN115952385B

CN115952385B - Parallel supernode sorting method and system for solving large-scale sparse equations

Info

Publication number: CN115952385B
Application number: CN202310224172.6A
Authority: CN
Inventors: 张赞军; 王英龙; 杨美红; 潘景山; 田敏; 刘弢; 杜伟
Original assignee: National Supercomputing Center in Jinan
Current assignee: National Supercomputing Center in Jinan
Priority date: 2023-03-10
Filing date: 2023-03-10
Publication date: 2023-05-05
Anticipated expiration: 2043-03-10
Also published as: CN115952385A

Abstract

The invention discloses a parallel supernode ordering method and a system for solving a large-scale sparse equation set, which relate to the technical field of high-performance computation, aim at a supernode blocky matrix generated in the LU decomposition process of a sparse matrix, circularly map matrix data according to rows and columns of the blocky matrix based on two-dimensional process grids, map upper triangle part data of the blocky matrix to a process for processing lower triangle part data through transposition, and simultaneously adopt a strategy of dynamically distributing resources, and distribute memory for processes in each process grid according to the number of row matrix blocks actually mapped to the processes, so that a large amount of memory space is saved, the memory expansibility is improved, the scale expansibility of the sparse matrix solution is improved, and the problem that the conventional ordering method cannot be suitable for solving the large-scale sparse linear equation set is solved.

Description

Parallel supernode sorting method and system for solving large-scale sparse equations

技术领域Technical Field

本发明涉及高性能计算技术领域，尤其涉及一种用于大规模稀疏方程组求解的并行超节点排序方法及系统。The present invention relates to the field of high performance computing technology, and in particular to a parallel supernode sorting method and system for solving large-scale sparse equation groups.

背景技术Background Art

随着超级计算机的快速发展，电磁、石油、海洋等科学行业的模拟计算得到巨大发展，通过高性能计算（HPC，High-Performance Computing）大规模稀疏线性方程组的求解在例如电磁模拟、油气勘探、地球气候模式模拟等广泛领域中越来越重要。通常，在模拟计算时，首先根据具体的物理模型生成边界偏微分方程组，然后通过区域划分的方法对求解空间进行离散化，生成多个子空间，每个子空间通过插值等方法，生成子空间的线性方程组

，

表示第i个子空间的系数矩阵，

、

表示第i个子空间的向量，其中

为所要求解的解，最后把所有子空间的方程组组装成一个全局待求解的线性方程组

。这一生成的线性方程组的系数矩阵

是稀疏的，也即该矩阵中的元素包含大量的值为0的元素，因此该系数矩阵

称为稀疏矩阵，该线性方程组通常统称为稀疏线性方程组。若稀疏矩阵的阶数超过百万、千万甚至上亿阶，则称该稀疏线性方程组为超大规模稀疏线性方程组，可简称为大规模稀疏方程组。With the rapid development of supercomputers, simulation computing in scientific industries such as electromagnetics, petroleum, and oceanography has made great progress. The solution of large-scale sparse linear equations through high-performance computing (HPC) is becoming increasingly important in a wide range of fields such as electromagnetic simulation, oil and gas exploration, and earth climate model simulation. Usually, during simulation calculations, a boundary partial differential equation group is first generated based on a specific physical model, and then the solution space is discretized by region partitioning to generate multiple subspaces. Each subspace generates a linear equation group in the subspace through interpolation and other methods.

,

represents the coefficient matrix of the i - th subspace,

,

represents the vector of the i- th subspace, where

Finally, all the equations of the subspaces are assembled into a global linear equation system to be solved.

The coefficient matrix of this generated linear system of equations is

is sparse, that is, the elements in the matrix contain a large number of elements with values of 0, so the coefficient matrix

It is called a sparse matrix, and the linear equation system is usually collectively referred to as a sparse linear equation system. If the order of the sparse matrix exceeds one million, ten million or even hundreds of millions, the sparse linear equation system is called a super-large-scale sparse linear equation system, which can be referred to as a large-scale sparse equation system.

目前，基于计算机及其程序求解稀疏线性方程组，一般采用直接法和迭代法。其中，利用直接法进行求解，以稀疏矩阵

为处理器运算过程的数据处理对象，基于高斯消元法，采用LU分解（矩阵因式分解方法的一种）的方法，在求解过程中生成超级节点（即超节点）块状矩阵，求得

，之后再把该式代入方程组的

，令

，求解三角方程组

，最后求解

，得到方程组的解

。At present, direct method and iterative method are generally used to solve sparse linear equations based on computers and their programs.

As the data processing object of the processor operation process, based on Gaussian elimination method, LU decomposition (a kind of matrix factorization method) is used to generate a super node (i.e. super node) block matrix in the solution process to obtain

, and then substitute this formula into the equation

,make

, solve the trigonometric system of equations

, and finally solve

, and we get the solution of the system of equations

.

现有的稀疏线性方程组的求解软件或程序有很多，但其均适合中小规模的方程组求解，当求解超大规模的稀疏线性方程组时，则会面临“内存墙”和“通信墙”问题，即：There are many existing software or programs for solving sparse linear equations, but they are all suitable for solving small and medium-sized equations. When solving ultra-large-scale sparse linear equations, they will face the "memory wall" and "communication wall" problems, namely:

（1）针对“内存墙”，由于方程组的规模过大，导致程序需要大量的内存来存储数据以用于计算，当内存需求超过计算环境的最大内存量时，就会导致程序崩溃而无法完成方程组的求解；(1) Regarding the "memory wall", due to the large size of the equation group, the program requires a large amount of memory to store data for calculation. When the memory requirement exceeds the maximum memory capacity of the computing environment, the program will crash and fail to solve the equation group.

（2）针对“通信墙”，目前的稀疏方程组的求解软件都是分布式求解程序，把求解任务划分给一定数量的、通过网络互联的、运行于不同计算机节点的子程序，通过子程序间的通信来协调求解方程组，而通信的内容既有控制信息也有矩阵数据，稀疏矩阵规模的增大也必然增加子程序间通信的压力，与“内存墙”类似，超过通信带宽的上限必然引起程序的崩溃，进而导致方程组求解的失败。(2) Regarding the "communication wall", the current software for solving sparse equations is a distributed solution program, which divides the solution task among a certain number of subroutines that are interconnected through a network and run on different computer nodes. The solution of the equation system is coordinated through communication between the subroutines. The content of the communication includes both control information and matrix data. The increase in the size of the sparse matrix will inevitably increase the pressure of communication between subroutines. Similar to the "memory wall", exceeding the upper limit of the communication bandwidth will inevitably cause the program to crash, which will lead to the failure of solving the equation system.

因此，在大规模稀疏线性方程组的求解过程中，在系统程序中应用传统的超节点排序方案进行程序内存分配，面临上述“内存墙”和“通信墙”问题，往往会导致在该环节出现运行错误，进而使得大规模稀疏线性方程组求解失败。Therefore, in the process of solving large-scale sparse linear equations, the traditional super-node sorting scheme is used in the system program to allocate program memory, which faces the above-mentioned "memory wall" and "communication wall" problems, often leading to operational errors in this link, and thus causing the failure of solving large-scale sparse linear equations.

发明内容Summary of the invention

为解决上述现有技术的不足，本发明提供了一种用于大规模稀疏方程组求解的并行超节点排序方法及系统，针对在稀疏矩阵LU分解过程中生成的超级节点块状矩阵，基于二维进程网格，按照块状矩阵的行和列循环映射矩阵数据，将该块状矩阵的上三角部分矩阵数据通过转置映射到处理下三角部分矩阵数据的进程中，同时采用动态分配资源的策略，根据实际映射到进程的行矩阵块的数量，为每个进程网格中的进程分配内存，以此节省大量的内存空间，提高内存扩展性，提高稀疏矩阵求解的规模扩展性。To address the deficiencies of the above-mentioned prior art, the present invention provides a parallel super-node sorting method and system for solving large-scale sparse equation systems. For the super-node block matrix generated in the sparse matrix LU decomposition process, based on the two-dimensional process grid, the matrix data is cyclically mapped according to the rows and columns of the block matrix, and the upper triangular matrix data of the block matrix is mapped to the process of processing the lower triangular matrix data by transposition. At the same time, a dynamic resource allocation strategy is adopted to allocate memory to the process in each process grid according to the number of row matrix blocks actually mapped to the process, thereby saving a large amount of memory space, improving memory scalability, and improving the scale scalability of sparse matrix solution.

第一方面，本公开提供了一种用于大规模稀疏方程组求解的并行超节点排序方法。In a first aspect, the present disclosure provides a parallel supernode sorting method for solving large-scale sparse equation systems.

一种用于大规模稀疏方程组求解的并行超节点排序方法，包括以下步骤：A parallel supernode sorting method for solving large-scale sparse equations comprises the following steps:

根据待求解的大规模稀疏线性方程组，获取稀疏矩阵，以稀疏矩阵为处理器运算过程的数据处理对象；According to the large-scale sparse linear equations to be solved, a sparse matrix is obtained, and the sparse matrix is used as a data processing object of the processor operation process;

在处理器运算过程中，构建二维进程网格，按块状矩阵行和列循环映射的方式，将稀疏矩阵的超级节点块状矩阵中的各个非零矩阵块映射到进程网格的对应进程中；During the processor operation, a two-dimensional process grid is constructed, and each non-zero matrix block in the super node block matrix of the sparse matrix is mapped to the corresponding process of the process grid in a circular mapping manner of the block matrix rows and columns;

针对各个进程，基于MPI集合通信，将每一进程所负责的U矩阵块的数据发送至负责该U矩阵块对应矩阵列的进程；For each process, based on MPI collective communication, the data of the U matrix block that each process is responsible for is sent to the process responsible for the corresponding matrix column of the U matrix block;

在各个进程获取相应数据后，根据所获取的数据与所负责的L矩阵块的数据进行匹配比较，确定最小匹配数据块号，统计小于最小匹配数据块号的数据块及其数量，根据数量确定依赖关系，基于依赖关系对数据块进行排序；After each process obtains the corresponding data, it matches and compares the obtained data with the data of the L matrix block it is responsible for, determines the minimum matching data block number, counts the data blocks smaller than the minimum matching data block number and their number, determines the dependency relationship based on the number, and sorts the data blocks based on the dependency relationship;

根据排序后的数据块为数据块所对应的进程分配内存。Allocate memory to the process corresponding to the data block according to the sorted data block.

第二方面，本公开提供了一种用于大规模稀疏方程组求解的并行超节点排序系统。In a second aspect, the present disclosure provides a parallel supernode sorting system for solving large-scale sparse equations.

一种用于大规模稀疏方程组求解的并行超节点排序系统，包括：A parallel supernode sorting system for solving large-scale sparse equations, comprising:

稀疏矩阵获取模块，用于根据待求解的大规模稀疏线性方程组，获取稀疏矩阵，以稀疏矩阵为处理器运算过程的数据处理对象；A sparse matrix acquisition module is used to acquire a sparse matrix according to a large-scale sparse linear equation group to be solved, and to use the sparse matrix as a data processing object in a processor operation process;

节点矩阵映射模块，用于在处理器运算过程中，构建二维进程网格，按块状矩阵行和列循环映射的方式，将稀疏矩阵的超级节点块状矩阵中的各个非零矩阵块映射到进程网格的对应进程中；The node matrix mapping module is used to construct a two-dimensional process grid during the processor operation process, and map each non-zero matrix block in the super node block matrix of the sparse matrix to the corresponding process of the process grid in a circular mapping manner of block matrix rows and columns;

进程数据获取模块，用于针对各个进程，基于MPI集合通信，将每一进程所负责的U矩阵块的数据发送至负责该U矩阵块对应矩阵列的进程；The process data acquisition module is used to send the data of the U matrix block responsible for each process to the process responsible for the corresponding matrix column of the U matrix block based on MPI collective communication for each process;

排序模块，用于在各个进程获取相应数据后，根据所获取的数据与所负责的L矩阵的数据进行匹配比较，确定最小匹配数据块号，统计小于最小匹配数据块号的数据块及其数量，根据数量确定依赖关系，基于依赖关系对数据块进行排序；The sorting module is used to match and compare the acquired data with the data of the L matrix for which it is responsible after each process obtains the corresponding data, determine the minimum matching data block number, count the data blocks smaller than the minimum matching data block number and their number, determine the dependency relationship according to the number, and sort the data blocks based on the dependency relationship;

内存分配模块，用于根据排序后的数据块为数据块所对应的进程分配内存。The memory allocation module is used to allocate memory to the process corresponding to the data block according to the sorted data block.

以上一个或多个技术方案存在以下有益效果：One or more of the above technical solutions have the following beneficial effects:

1、本发明提供了一种用于大规模稀疏方程组求解的并行超节点排序方法及系统，针对在稀疏矩阵LU分解过程中生成的超级节点块状矩阵，基于二维进程网格，按照块状矩阵的行和列循环映射矩阵数据，将该块状矩阵的上三角部分数据通过转置映射到处理下三角部分数据的进程中，同时采用动态分配资源的策略，根据实际映射到进程的行矩阵块的数量，为每个进程网格中的进程分配内存，以此节省大量的内存空间，提高内存扩展性，并提高稀疏矩阵求解的规模扩展性。1. The present invention provides a parallel super-node sorting method and system for solving large-scale sparse equation systems. For the super-node block matrix generated in the process of sparse matrix LU decomposition, based on the two-dimensional process grid, the matrix data is cyclically mapped according to the rows and columns of the block matrix, and the upper triangular part of the block matrix is mapped to the process of processing the lower triangular part of the data by transposition. At the same time, a dynamic resource allocation strategy is adopted to allocate memory to the process in each process grid according to the number of row matrix blocks actually mapped to the process, thereby saving a large amount of memory space, improving memory scalability, and improving the scale scalability of sparse matrix solution.

2、对于稀疏矩阵而言，由于一方面矩阵块被分摊到各个进程，各个进程不可能会处理所有U矩阵的行，另一方面由于矩阵是稀疏的，分摊到进程的各行的矩阵块的数量就更少，因此，本发明采用动态分配内存的方法，根据U矩阵数据的实际情况，若有数据块对应某个进程，则为该进程分配内存块，可以大大节省内存空间，再结合分布式MPI编程方法，提高求解性能，并提高计算的并发性和计算效率，减少时间浪费。2. For sparse matrices, on the one hand, matrix blocks are allocated to each process, and each process cannot process all rows of the U matrix. On the other hand, since the matrix is sparse, the number of matrix blocks allocated to each row of the process is even smaller. Therefore, the present invention adopts a method of dynamically allocating memory. According to the actual situation of the U matrix data, if a data block corresponds to a process, a memory block is allocated to the process, which can greatly save memory space. Combined with the distributed MPI programming method, the solution performance is improved, and the concurrency and efficiency of the calculation are improved, reducing time waste.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

构成本发明的一部分的说明书附图用来提供对本发明的进一步理解，本发明的示意性实施例及其说明用于解释本发明，并不构成对本发明的不当限定。The accompanying drawings in the specification, which constitute a part of the present invention, are used to provide a further understanding of the present invention. The exemplary embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute improper limitations on the present invention.

图1为本实施例一所述并行超节点排序方法的流程图；FIG1 is a flow chart of a parallel supernode sorting method according to the first embodiment of the present invention;

图2为Pr*Pc进程网格的结构示意图；FIG2 is a schematic diagram of the structure of the Pr*Pc process grid;

图3为稀疏矩阵的超级节点块状矩阵的结构示意图；FIG3 is a schematic diagram of the structure of a super node block matrix of a sparse matrix;

图4为5*5块状矩阵映射到2*3进程网格的示意图；FIG4 is a schematic diagram of mapping a 5*5 block matrix to a 2*3 process grid;

图5为本实施例一中发送给进程数据的示意图；FIG5 is a schematic diagram of data sent to a process in the first embodiment of the present invention;

图6为本实施例一中确定各个进程所需发送数据的流程示意图；FIG6 is a schematic diagram of a process for determining data to be sent by each process in the first embodiment;

图7为本实施例一中发送数据及排序的流程示意图。FIG. 7 is a schematic diagram of the process of sending data and sorting in the first embodiment.

具体实施方式DETAILED DESCRIPTION

应该指出，以下详细说明都是示例性的，旨在对本发明提供进一步的说明。除非另有指明，本文使用的所有技术和科学术语具有与本发明所属技术领域的普通技术人员通常理解的相同含义。It should be noted that the following detailed descriptions are exemplary and are intended to provide further explanation of the present invention. Unless otherwise specified, all technical and scientific terms used herein have the same meanings as those commonly understood by those skilled in the art to which the present invention belongs.

需要注意的是，这里所使用的术语仅是为了描述具体实施方式，而非意图限制根据本发明的示例性实施方式。如在这里所使用的，除非上下文另外明确指出，否则单数形式也意图包括复数形式，此外，还应当理解的是，当在本说明书中使用术语“包含”和/或“包括”时，其指明存在特征、步骤、操作、器件、组件和/或它们的组合。It should be noted that the terms used herein are only for describing specific embodiments and are not intended to limit exemplary embodiments according to the present invention. As used herein, unless the context clearly indicates otherwise, the singular form is also intended to include the plural form. In addition, it should be understood that when the terms "comprising" and/or "including" are used in this specification, it indicates the presence of features, steps, operations, devices, components and/or combinations thereof.

实施例一Embodiment 1

本实施例给出了一种用于大规模稀疏方程组求解的并行超节点排序方法，如图1所示，具体包括以下步骤：This embodiment provides a parallel supernode sorting method for solving large-scale sparse equations, as shown in FIG1 , which specifically includes the following steps:

步骤S1、根据待求解的大规模稀疏线性方程组，获取稀疏矩阵，以稀疏矩阵为处理器运算过程的数据处理对象；Step S1, obtaining a sparse matrix according to a large-scale sparse linear equation system to be solved, and taking the sparse matrix as a data processing object of a processor operation process;

步骤S2、在处理器运算过程中，构建二维进程网格，按块状矩阵行和列循环映射的方式，将稀疏矩阵的超级节点块状矩阵中的各个非零矩阵块映射到进程网格的对应进程中；Step S2: during the processor operation, construct a two-dimensional process grid, and map each non-zero matrix block in the super node block matrix of the sparse matrix to the corresponding process of the process grid in a manner of circular mapping of the block matrix rows and columns;

步骤S3、针对各个进程，基于MPI集合通信，将每一进程所负责的U矩阵块的数据发送至负责该U矩阵块对应矩阵列的进程；Step S3: for each process, based on MPI collective communication, the data of the U matrix block that each process is responsible for is sent to the process responsible for the corresponding matrix column of the U matrix block;

步骤S4、在各个进程获取相应数据后，根据所获取的数据与所负责的L矩阵的数据进行匹配比较，确定最小匹配数据块号，统计小于最小匹配数据块号的数据块及其数量，根据数量确定依赖关系，基于依赖关系对数据块进行排序；Step S4, after each process obtains corresponding data, it matches and compares the obtained data with the data of the L matrix it is responsible for, determines the minimum matching data block number, counts the data blocks smaller than the minimum matching data block number and their number, determines the dependency relationship according to the number, and sorts the data blocks based on the dependency relationship;

步骤S5、根据排序后的数据块为数据块所对应的进程分配内存，再进行后续的数据处理与运算。通过上述步骤，完成稀疏矩阵的超节点排序。Step S5: Allocate memory to the process corresponding to the data block according to the sorted data block, and then perform subsequent data processing and calculation. Through the above steps, the super node sorting of the sparse matrix is completed.

在本实施例中，首先，步骤S1中，针对待求解的大规模稀疏线性方程组，根据该稀疏线性方程组的系数，构成稀疏矩阵，以稀疏矩阵为处理器运算过程的数据处理对象。In this embodiment, first, in step S1, for a large-scale sparse linear equation group to be solved, a sparse matrix is constructed according to the coefficients of the sparse linear equation group, and the sparse matrix is used as a data processing object of a processor operation process.

步骤S2中，在处理器运算过程中，构建二维进程网格，按块状矩阵行和列循环映射的方式，将稀疏矩阵的超级节点块状矩阵中的各个非零矩阵块映射到进程网格的各个进程中。In step S2, during the processor operation, a two-dimensional process grid is constructed, and each non-zero matrix block in the super node block matrix of the sparse matrix is mapped to each process of the process grid in a circular mapping manner of block matrix rows and columns.

稀疏线性方程组并行求解方法基于进程网格进行分布式求解，进程网格如图2所示，当包含numProc个进程的作业启动以后，这numProc个进程分别被赋予一个唯一的ID号，ID号的范围为[0,numProc-1]。为了使方程组在求解过程中保持计算和通信的负载均衡，需要构建一个二维的Pr*Pc=numProc的进程网格，其中，Pr和Pc分别指进程网格的行数和列数，这numProc个进程ID号从0进程开始按从小到大顺序从左到右，第一行末端的进程号为Pc-1，下一个ID号为Pc的进程置于下一行的左端，后续的进程ID号再按从左到右的顺序排列，依次类推，构建一个Pr行、Pc列的进程网格。The parallel solution method for sparse linear equations is based on distributed solution based on process grid. The process grid is shown in Figure 2. When a job containing numProc processes is started, each of these numProc processes is assigned a unique ID number, and the ID number ranges from [0, numProc-1]. In order to keep the load balance of computing and communication in the solution process of the equation system, it is necessary to build a two-dimensional process grid of Pr*Pc=numProc, where Pr and Pc refer to the number of rows and columns of the process grid respectively. The numProc process ID numbers start from process 0 and are arranged from left to right in ascending order. The process number at the end of the first row is Pc-1, and the next process with ID number Pc is placed at the left end of the next row. The subsequent process ID numbers are arranged in order from left to right, and so on, to build a process grid with Pr rows and Pc columns.

稀疏矩阵的超级节点块状矩阵是一个N*N的块状矩阵，如图3所示，对角块是为方矩阵块，非对角块则有以下2种情况：第一、矩阵块的元素都是0，称其为0矩阵块；第二、矩阵块的某几行（对角块下面的矩阵块）或者某几列（对角块上方的矩阵块）的元素是非零的，其他行或者列的元素都是0元素，这样的矩阵块称之为非0矩阵块。The super node block matrix of the sparse matrix is an N*N block matrix, as shown in Figure 3. The diagonal blocks are square matrix blocks, and the non-diagonal blocks have the following two situations: first, the elements of the matrix block are all 0, which is called a 0 matrix block; second, the elements of some rows (matrix blocks below the diagonal blocks) or some columns (matrix blocks above the diagonal blocks) of the matrix block are non-zero, and the elements of other rows or columns are all 0 elements. Such matrix blocks are called non-zero matrix blocks.

超级节点块状矩阵需要把包含的各个非零矩阵块映射到进程网格的对应进程里，本实施例采用对进程网格按块状矩阵行和列循环映射的方式，将稀疏矩阵的超级节点块状矩阵中的各个非零矩阵块映射到进程网格的对应进程中。根据该映射方式，由以下几个公式确定从矩阵块到进程ID的映射，为：The super node block matrix needs to map each non-zero matrix block contained in it to the corresponding process of the process grid. This embodiment adopts a method of circularly mapping the process grid by block matrix rows and columns to map each non-zero matrix block in the super node block matrix of the sparse matrix to the corresponding process of the process grid. According to this mapping method, the mapping from the matrix block to the process ID is determined by the following formulas:

对于N*N的块状矩阵，已知位置为（I,J）的矩阵块

，0≤I≤N-1，0≤J≤N-1，且进程网格为Pr*Pc的网格，则：For an N*N block matrix, the matrix block at position (I, J) is known.

, 0≤I≤N-1, 0≤J≤N-1, and the process grid is a grid of Pr*Pc, then:

nrb=N/Pr，表示块状矩阵分配到各进程的行数；nrb=N/Pr, which indicates the number of rows of the block matrix allocated to each process;

ncb=N/Pc，表示块状矩阵分配到各进程的列数；ncb=N/Pc, which indicates the number of columns of the block matrix allocated to each process;

pr=I mod Pr，表示矩阵块对应的进程在进程网格的行；pr = I mod Pr, which indicates the row of the process grid corresponding to the matrix block;

pc=J mod Pc，表示矩阵块对应的进程在进程网格的列；pc = J mod Pc, which indicates the column of the process grid corresponding to the matrix block;

p=pr*Pc+pc，表示矩阵块对应的进程ID号。p=pr*Pc+pc, which indicates the process ID number corresponding to the matrix block.

以Pr*Pc进程网格是2*3进程网格为例，共包含numProc=6个进程，ID号分别为0、1、2、3、4和5，如图4所示，采用行和列循环映射的方式将包含25个矩阵块

（0≤I≤N-1，0≤J≤N-1，N=5）的5*5块状矩阵映射到2*3进程网格中。以矩阵块

为例，其中：Taking the Pr*Pc process grid as an example, which is a 2*3 process grid, it contains numProc=6 processes, with

ID numbers

0, 1, 2, 3, 4 and 5 respectively. As shown in Figure 4, the row and column circular mapping method will contain 25 matrix blocks.

The 5*5 block matrix (0≤I≤N-1, 0≤J≤N-1, N=5) is mapped to a 2*3 process grid.

For example, where:

pr表示矩阵块对应的进程在进程网格的行，pr=I mod Pr=3mod2=1，即表示矩阵块

对应的进程在进程网格的行，即第一行；pr represents the row of the process grid corresponding to the matrix block, pr=I mod Pr=3mod2=1, which means the matrix block

The corresponding process is in the row of the process grid, i.e. the first row;

pc表示矩阵块对应的进程在进程网格的列，pc=J mod Pc=4mod3=1，即表示矩阵块

对应的进程在进程网格的列，即第一列；pc represents the column of the process corresponding to the matrix block in the process grid, pc=J mod Pc=4mod3=1, which means the matrix block

The corresponding process is in the column of the process grid, i.e. the first column;

p表示矩阵块对应的进程ID号，p =pr*Pc+pc=1*3+1=4，即表示矩阵块

对应的进程ID号为4。p represents the process ID number corresponding to the matrix block, p =pr*Pc+pc=1*3+1=4, which means the matrix block

The corresponding process ID number is 4.

根据矩阵块的特点，以及上述矩阵块到进程网格的映射方法，一个进程在块状矩阵的某行被映射的数据块中可能会有0矩阵块，为了节省资源，不保留0矩阵块，只保留非0矩阵块。在系数矩阵求解中，将对角块上方的区域统称为U矩阵，下方的区域统称为L矩阵，这样矩阵由3个区域构成，分别是对角矩阵、U矩阵和L矩阵。According to the characteristics of the matrix block and the mapping method of the matrix block to the process grid, a process may have a 0 matrix block in the data block mapped to a row of the block matrix. In order to save resources, the 0 matrix block is not retained, and only the non-0 matrix block is retained. In the coefficient matrix solution, the area above the diagonal block is collectively referred to as the U matrix, and the area below is collectively referred to as the L matrix. In this way, the matrix consists of three areas, namely the diagonal matrix, the U matrix, and the L matrix.

步骤S3中，为了实现对稀疏矩阵的超级节点矩阵排序，各个进程需要获取其所负责的L矩阵的矩阵块所对应的U矩阵的矩阵块的信息，而这个信息可能由其他进程负责处理，因此需要通过MPI集合通信的方式获取这些信息。MPI（信息传递接口，message-passinginterface）是一个跨语言的通讯协议，MPI集合通信涉及到通讯器中所有进程的通信，集合通信通常包括通信、聚集和同步三种功能，MPI集合通信函数中常用的集合通信函数包括MPI_Reduce、MPI_Allreduce、MPI_Bcast、MPI_Scatter、MPI_Gather、MPI_Allgather、MPI_Scan、MPI_Reduce_Scatter，基于不同的MPI集合通信函数实现不同的通信模式。本实施例通过MPI集合通信获取相应信息，具体包括以下步骤：In step S3, in order to realize the super node matrix sorting of the sparse matrix, each process needs to obtain the information of the matrix block of the U matrix corresponding to the matrix block of the L matrix it is responsible for, and this information may be processed by other processes, so it is necessary to obtain this information through MPI collective communication. MPI (message-passing interface) is a cross-language communication protocol. MPI collective communication involves the communication of all processes in the communicator. Collective communication usually includes three functions: communication, aggregation and synchronization. The commonly used collective communication functions in the MPI collective communication function include MPI_Reduce, MPI_Allreduce, MPI_Bcast, MPI_Scatter, MPI_Gather, MPI_Allgather, MPI_Scan, MPI_Reduce_Scatter, and different communication modes are realized based on different MPI collective communication functions. This embodiment obtains corresponding information through MPI collective communication, and specifically includes the following steps:

针对各个进程，遍历每一进程所负责处理的U矩阵中每一行矩阵块，确定每一行矩阵块在全局矩阵中的全局行索引，根据全局行索引，确定与全局行索引数值相等的块状矩阵的列所对应的进程网格的列索引，同时统计每一进程所负责处理的U矩阵中的非零矩阵块的数量；For each process, traverse each row of matrix blocks in the U matrix that each process is responsible for processing, determine the global row index of each row of matrix blocks in the global matrix, and according to the global row index, determine the column index of the process grid corresponding to the column of the block matrix with the same value as the global row index, and at the same time count the number of non-zero matrix blocks in the U matrix that each process is responsible for processing;

获取每一进程所负责处理的U矩阵中每一行矩阵块在全局矩阵中的全局列索引，确定与全局列索引对应的L矩阵的全局行索引，即确定全局行索引和全局列索引相同的矩阵块对应的进程网格的行索引；Get the global column index of each row matrix block in the U matrix that each process is responsible for processing in the global matrix, and determine the global row index of the L matrix corresponding to the global column index, that is, determine the row index of the process grid corresponding to the matrix block with the same global row index and global column index;

基于所确定的进程网格的行索引和列索引，确定处理对应L矩阵的矩阵块的进程p，累加进程p所需接收的数据块的数量，保存与p进程相对应的矩阵行的数据块的数量，保存发送给p进程的各个矩阵块的全局块状矩阵的列索引号；Based on the determined row index and column index of the process grid, determine the process p that processes the matrix block corresponding to the L matrix, accumulate the number of data blocks that the process p needs to receive, save the number of data blocks of the matrix row corresponding to the p process, and save the column index number of the global block matrix of each matrix block sent to the p process;

在作业通信域，对获取及保存的数据运行MPI集合通信函数MPI_Alltoall，进行进程间的数据发送，各个进程获取从其他进程分别需要获取的矩阵块数量、从其他进程发送的矩阵块数量和从其他进程发送的各行矩阵块的数量。In the job communication domain, the MPI collective communication function MPI_Alltoall is run on the acquired and saved data to send data between processes. Each process obtains the number of matrix blocks it needs to obtain from other processes, the number of matrix blocks sent from other processes, and the number of matrix blocks of each row sent from other processes.

具体的，首先需要确定所需要发送的数据信息，如图6所示，通过以下步骤实现，包括：Specifically, firstly, it is necessary to determine the data information to be sent, as shown in FIG6 , which is achieved by the following steps, including:

步骤S3.1、针对各个进程，确定本进程（即某一当前进程）所负责处理的U矩阵中矩阵块的每一行的本地矩阵行索引ind_lr，ind_lr表示进程的本地矩阵行索引。Step S3.1: for each process, determine the local matrix row index ind_lr of each row of the matrix block in the U matrix that the process (i.e., a current process) is responsible for processing, where ind_lr represents the local matrix row index of the process.

步骤S3.2、通过公式ind_r=ind_lr*Pr+myrow，确定待处理的矩阵块的行全局行索引ind_r；其中，myrow表示本进程位于进程网格行数，mycol用于表示本进程位于进程网格的列数。Step S3.2, determine the global row index ind_r of the matrix block to be processed by the formula ind_r=ind_lr*Pr+myrow; where myrow represents the row number of the process grid where the current process is located, and mycol is used to represent the column number of the process grid where the current process is located.

步骤S3.3、根据全局行索引，通过公式pc=ind_r%Pc，确定与该行的全局行索引ind_r的数值相等的块状矩阵的列对应的进程网格的列索引pc。Step S3.3, according to the global row index, by using the formula pc=ind_r%Pc, determine the column index pc of the process grid corresponding to the column of the block matrix that is equal to the value of the global row index ind_r of the row.

步骤S3.4、构建该行数据块数组在本进程的指针。Step S3.4: construct a pointer to the row data block array in this process.

步骤S3.5、判断该指针是否为空，若为空，则认为本进程负责的全局行索引ind_r所对应行的矩阵块均为0矩阵块，此时转至步骤S2.1，处理本进程负责的矩阵的下一行，否则，则认为存在非0矩阵块，利用变量count来统计本进程所负责处理的U矩阵中的非零矩阵块的数量，即count+U矩阵中全局行索引ind_r所对应行的矩阵块的数量。Step S3.5, determine whether the pointer is empty. If it is empty, it is considered that the matrix blocks of the rows corresponding to the global row index ind_r that this process is responsible for are all 0 matrix blocks. At this time, go to step S2.1 to process the next row of the matrix that this process is responsible for. Otherwise, it is considered that there are non-zero matrix blocks. The variable count is used to count the number of non-zero matrix blocks in the U matrix that this process is responsible for processing, that is, count+the number of matrix blocks of the rows corresponding to the global row index ind_r in the U matrix.

步骤S3.6、通过上述步骤S2.1~步骤S2.5遍历全局行索引ind_r所对应行的所有矩阵块。Step S3.6, traverse all matrix blocks of the row corresponding to the global row index ind_r through the above steps S2.1 to S2.5.

步骤S3.7、获取全局行索引ind_r所对应行的各个矩阵块在全局矩阵中的列索引j。Step S3.7, obtain the column index j in the global matrix of each matrix block of the row corresponding to the global row index ind_r.

步骤S3.8、通过公式pr=j%Pr，确定该U矩阵的列索引块对应的L矩阵的行pr，即行索引与列索引相同的矩阵块对应的进程网格的行索引。Step S3.8, determine the row pr of the L matrix corresponding to the column index block of the U matrix through the formula pr=j%Pr, that is, the row index of the process grid corresponding to the matrix block with the same row index and column index.

步骤S3.9、通过行索引pr和列索引pc，基于公式p=pr*Pc+pc，获取处理对应L矩阵的矩阵块的进程p，此时，定义数组sendcnt[p]，如图5中（a）图所示，用于累加进程p需要接收的数据块的数量；定义指针LineSendCnt指向如图5中（b）图所示的数据结构，用于保存与p进程相对应的矩阵行的数据块的数量；最后，定义指针blocks_s，用于保存发送给p进程的各个矩阵块的全局块状矩阵的列索引号。Step S3.9, through the row index pr and the column index pc, based on the formula p=pr*Pc+pc, obtain the process p that processes the matrix block corresponding to the L matrix. At this time, define the array sendcnt[p], as shown in Figure 5 (a), which is used to accumulate the number of data blocks that process p needs to receive; define the pointer LineSendCnt to point to the data structure shown in Figure 5 (b), which is used to save the number of data blocks of the matrix row corresponding to process p; finally, define the pointer blocks_s, which is used to save the column index number of the global block matrix of each matrix block sent to process p.

遍历每一进程的U矩阵块，然后，将各个进程所确定的数据发送至相应的进程中，如图7所示，具体包括以下步骤：Traverse the U matrix block of each process, and then send the data determined by each process to the corresponding process, as shown in FIG7 , specifically including the following steps:

步骤S3.10、通过MPI集合通信函数MPI_Alltoall，在作业通信域里把各个进程需要从其他进程接收的Sendcnt值进行发送；通过MPI_Alltoallv通信函数，将blocks_s数据发送到各个进程，各进程将数据存放于blocks_r中，从而使各个进程获得其从其他进程得到的数据块的块索引号；通过通信函数MPI_Alltoall，将LineSendCnt数据发送到各个进程，各进程用lineRecvCnt保存数据，进而使得各进程知道它从其他进程分别获取的数据都对应哪一行。Step S3.10, through the MPI collective communication function MPI_Alltoall, send the Sendcnt value that each process needs to receive from other processes in the job communication domain; through the MPI_Alltoallv communication function, send the blocks_s data to each process, and each process stores the data in blocks_r, so that each process obtains the block index number of the data block it obtained from other processes; through the communication function MPI_Alltoall, send the LineSendCnt data to each process, and each process uses lineRecvCnt to save the data, so that each process knows which row the data it obtained from other processes corresponds to.

步骤S4中，在各个进程获取所需数据后，根据所获取的数据与所负责的L矩阵的数据进行匹配比较，确定最小匹配数据块号，统计小于最小匹配数据块号的数据块及其数量，根据该数量确定依赖关系，基于依赖关系对数据块进行排序，具体包括以下步骤：In step S4, after each process obtains the required data, it matches and compares the obtained data with the data of the L matrix it is responsible for, determines the minimum matching data block number, counts the data blocks smaller than the minimum matching data block number and their number, determines the dependency relationship based on the number, and sorts the data blocks based on the dependency relationship, specifically including the following steps:

步骤S4.1、各个进程获取以上数据后，用所获取的数据和本地的L矩阵数据进行匹配比较，对L矩阵的列号和U矩阵的行号相同的数据块，按从小到大的顺序进行匹配，若存在L矩阵的数据块的行索引和U矩阵的数据块的列索引相同，则匹配结束，并将该索引号保存于数组match[id]，因为矩阵为方阵，因此，id（即图7中的参数I）表示矩阵的行索引或列索引。Step S4.1, after each process obtains the above data, it matches and compares the obtained data with the local L matrix data, and matches the data blocks with the same column number of the L matrix and the same row number of the U matrix in ascending order. If the row index of the data block of the L matrix is the same as the column index of the data block of the U matrix, the matching ends and the index number is saved in the array match[id]. Because the matrix is a square matrix, id (i.e., parameter I in Figure 7) represents the row index or column index of the matrix.

步骤S4.2、调用MPI集合通信函数MPI_Allreduce，对match数组进行最小规约，从而各个进程可获得全局矩阵的L矩阵每列和对应的U矩阵每行中数据块的id号相同的最小匹配数据块号。Step S4.2, call the MPI collective communication function MPI_Allreduce to perform minimum reduction on the match array, so that each process can obtain the minimum matching data block number with the same id number of the data block in each column of the L matrix of the global matrix and each row of the corresponding U matrix.

步骤S4.3、各个进程分别遍历L矩阵的数据块、接收的blocks_r以及match数据，统计对应矩阵的每一列和相应的行中小于最小匹配数据块号的数据块的数量。在该步骤中，定义mindata[i]，用于保存满足该条件的数据块；定义minCount[i]，用于保存第i列数据满足该条件的数据块的总数。Step S4.3, each process traverses the data blocks of the L matrix, the received blocks_r and the match data, and counts the number of data blocks in each column and corresponding row of the corresponding matrix that are smaller than the minimum matching data block number. In this step, mindata[i] is defined to store the data blocks that meet the condition; minCount[i] is defined to store the total number of data blocks that meet the condition in the i-th column data.

步骤S4.4、通过MPI集合通信函数MPI_ALLgather和MPI_Allgatherv，收集所有进程各行小于最小匹配数据块号的数据，合并形成全局矩阵的小于最小匹配数据块号的数据块的集合。在该步骤中，mindata中保存了矩阵各列的所有矩阵块的块号，minCount中保存了各列的数据块的总数。Step S4.4, through the MPI collective communication functions MPI_ALLgather and MPI_Allgatherv, collect the data of each row of all processes that is smaller than the minimum matching data block number, and merge them to form a set of data blocks smaller than the minimum matching data block number of the global matrix. In this step, mindata stores the block numbers of all matrix blocks in each column of the matrix, and minCount stores the total number of data blocks in each column.

步骤S4.5、根据该数量确定依赖关系，基于依赖关系对mindata中的数据块进行排序，生成对矩阵列号的排序，存放于数组perm_c_supno中。某列数据块的总数越多，说明该列对角块与其他列数据块的依赖关系程度越深，依赖关系程度越深的列数据块其排序越靠后。Step S4.5, determine the dependency relationship according to the number, sort the data blocks in mindata based on the dependency relationship, generate the sorting of the matrix column numbers, and store them in the array perm_c_supno. The more the total number of data blocks in a column, the deeper the dependency between the diagonal block of the column and other column data blocks, and the column data blocks with deeper dependency will be sorted later.

由于矩阵的LU分解是先分解对角块，再更新与对角块同列和同行的L矩阵的行块和U矩阵的列块，最后更新这些行块和列块交叉位置的矩阵块，因此产生对角块对行块和列块的依赖关系，如果对角块所处的行和列没有行块和列块，那么它就能独立分解而不会影响其他对角块的分解，从而可以对其排序，把某些对角块和与之独立的对角块一起进行分解而不影响矩阵的LU分解结果。Since the LU decomposition of the matrix is to decompose the diagonal blocks first, then update the row blocks of the L matrix and the column blocks of the U matrix in the same column and row as the diagonal blocks, and finally update the matrix blocks at the intersection of these row blocks and column blocks, the diagonal blocks have a dependency on the row blocks and column blocks. If there are no row blocks and column blocks in the rows and columns where the diagonal blocks are located, then it can be decomposed independently without affecting the decomposition of other diagonal blocks, so that they can be sorted and some diagonal blocks can be decomposed together with independent diagonal blocks without affecting the LU decomposition result of the matrix.

最后，步骤S5中，根据排序后的数据块为数据块所对应的进程分配内存，完成稀疏矩阵的超节点排序。Finally, in step S5, memory is allocated to the process corresponding to the data block according to the sorted data block, thereby completing the supernode sorting of the sparse matrix.

现有方法中，为作业的每个进程统一分配和矩阵行数或者列数相同大小的数组，用来保存发给每个进程的各行的矩阵块的数量，由于计算机发展的水平有限、模型求解规模不是太大，这一方法还可使用，但随着科学技术的发展，越来越多需要求解大规模甚至超大规模模型的问题，常常会遇到内存不能满足程序运行需求的情况，导致程序运行失败。在实际研究工作中，比如在求解超大规模电磁问题的稀疏线性方程组时，经常会因为这一方法不合时宜而造成无法求解。In the existing method, an array of the same size as the number of rows or columns of the matrix is uniformly allocated to each process of the job to store the number of matrix blocks of each row sent to each process. Due to the limited level of computer development and the small scale of model solving, this method can still be used, but with the development of science and technology, more and more problems need to be solved for large-scale or even ultra-large-scale models, and it is often encountered that the memory cannot meet the program running requirements, resulting in program failure. In actual research work, for example, when solving sparse linear equations for ultra-large-scale electromagnetic problems, this method is often inappropriate and cannot be solved.

而实际上，对于稀疏矩阵而言，由于一方面矩阵块被分摊到各个进程，各个进程不可能会处理所有U矩阵的行，另一方面由于矩阵是稀疏的，分摊到进程的各行的矩阵块的数量就更少，因此，本实施例采用动态分配内存的方法，根据U矩阵数据的实际情况，若有数据块对应某个进程，则为该进程分配内存块，这样可以大大节省内存空间，再结合分布式MPI编程方法，可以确保方程组程序具有很好的内存扩展性，提高程序的求解性能，提高计算的并发性，提高计算效率，减少时间浪费。In fact, for sparse matrices, on the one hand, matrix blocks are allocated to various processes, and each process cannot process all rows of the U matrix. On the other hand, since the matrix is sparse, the number of matrix blocks allocated to each row of the process is even smaller. Therefore, this embodiment adopts a method of dynamically allocating memory. According to the actual situation of the U matrix data, if a data block corresponds to a process, a memory block is allocated to the process. This can greatly save memory space. Combined with the distributed MPI programming method, it can ensure that the equation system program has good memory scalability, improve the program's solving performance, improve the concurrency of calculations, improve calculation efficiency, and reduce time waste.

实施例二Embodiment 2

本实施例提出了一种用于大规模稀疏方程组求解的并行超节点排序系统，包括：This embodiment proposes a parallel supernode sorting system for solving large-scale sparse equations, including:

以上实施例二中涉及的各步骤与实施例一相对应，具体实施方式可参见实施例一的相关说明部分。本领域技术人员应该明白，上述本发明的各模块或各步骤可以用通用的计算机装置来实现，可选地，它们可以用计算装置可执行的程序代码来实现，从而，可以将它们存储在存储装置中由计算装置来执行，或者将它们分别制作成各个集成电路模块，或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。本发明不限制于任何特定的硬件和软件的结合。The steps involved in the above embodiment 2 correspond to those in embodiment 1, and the specific implementation methods can refer to the relevant description part of embodiment 1. Those skilled in the art should understand that the modules or steps of the present invention can be implemented by a general computer device, and optionally, they can be implemented by a program code executable by a computing device, so that they can be stored in a storage device and executed by the computing device, or they can be made into individual integrated circuit modules, or multiple modules or steps therein can be made into a single integrated circuit module for implementation. The present invention is not limited to any specific combination of hardware and software.

以上所述仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and variations. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the protection scope of the present invention.

上述虽然结合附图对本发明的具体实施方式进行了描述，但并非对本发明保护范围的限制，所属领域技术人员应该明白，在本发明的技术方案的基础上，本领域技术人员不需要付出创造性劳动即可做出的各种修改或变形仍在本发明的保护范围以内。Although the above describes the specific implementation mode of the present invention in conjunction with the accompanying drawings, it is not intended to limit the scope of protection of the present invention. Technical personnel in the relevant field should understand that various modifications or variations that can be made by technical personnel in the field without creative work on the basis of the technical solution of the present invention are still within the scope of protection of the present invention.

Claims

1. A parallel supernode ordering method for solving a large-scale sparse equation set is characterized by comprising the following steps:

acquiring a sparse matrix according to a large-scale sparse linear equation set to be solved, wherein the sparse matrix is used as a data processing object in the operation process of a processor;

in the operation process of a processor, a two-dimensional process grid is constructed, and each non-zero matrix block in a super node block matrix of a sparse matrix is mapped into a corresponding process of the process grid in a block matrix row and column cyclic mapping mode;

for each process, based on MPI set communication, sending the data of the U matrix block responsible for each process to the process responsible for the corresponding matrix array of the U matrix block;

after each process obtains corresponding data, carrying out matching comparison according to the obtained data and the data of the responsible L matrix blocks, determining the minimum matching data block number, counting the data blocks smaller than the minimum matching data block number and the number thereof, determining a dependency relationship according to the number, and sequencing the data blocks based on the dependency relationship;

and distributing the memory for the process corresponding to the data blocks according to the ordered data blocks.

2. The parallel supernode ordering method for large-scale sparse equation set solution according to claim 1, wherein each non-zero matrix block in the supernode blockmatrix of the sparse matrix is mapped circularly to the process grid according to blockmatrix rows and columns

Mapping into a corresponding process of Pr-Pc process grid;

the mapping mode is as follows:

pr=i mod Pr, representing the row of the process corresponding to the matrix block in the process grid;

pc=j mod Pc, representing the column of the process grid for the process corresponding to the matrix block;

p=pr×pc+pc, representing the process ID number corresponding to the matrix block.

3. The parallel supernode ordering method for large-scale sparse system of equations solution according to claim 1, wherein for each process, based on MPI set communication, sending the data of the U matrix block for which each process is responsible to the process responsible for the corresponding matrix array of the U matrix block includes:

traversing each row of matrix blocks in the U matrix which is responsible for processing by each process according to each process, determining a global row index of each row of matrix blocks in a global matrix, determining column indexes of process grids corresponding to columns of the block matrix with the same value as the global row index according to the global row index, and simultaneously counting the number of non-zero matrix blocks in the U matrix which is responsible for processing by each process;

acquiring a global column index of each row of matrix blocks in a global matrix in a U matrix which is responsible for processing by each process, and determining a global row index of an L matrix corresponding to the global column index, namely determining a row index of a process grid corresponding to a matrix block with the same global row index and global column index;

based on the determined row index and column index of the process grid, determining a process p for processing matrix blocks corresponding to the L matrix, accumulating the number of data blocks required to be received by the process p, storing the number of data blocks of matrix rows corresponding to the p process, and storing column index numbers of global block matrices of each matrix block sent to the p process.

4. The parallel supernode ordering method for large-scale sparse system of equations solution of claim 3, further comprising:

in the operation communication domain, running MPI set communication function MPI_Alltoall on the acquired and stored data, and transmitting inter-process data, wherein each process acquires the number of matrix blocks required to be acquired from other processes, the number of matrix blocks transmitted from other processes and the number of matrix blocks of each row transmitted from other processes.

5. The parallel supernode ordering method for solving large-scale sparse equation set according to claim 1, wherein after each process obtains corresponding data, performing matching comparison according to the obtained data and the data of the responsible L matrix, determining a minimum matching data block number, counting data blocks smaller than the minimum matching data block number and the number thereof, determining a dependency according to the number, and ordering the data blocks based on the dependency, comprising:

after each process obtains data, the obtained data and the local L matrix data are matched and compared, data blocks with the same column numbers of the L matrix and the same row numbers of the U matrix are matched in sequence from small to large, if the row indexes of the data blocks with the L matrix and the column indexes of the database with the U matrix are the same, the matching is finished, the same index number is stored in an array match [ id ], and the id is the row index or the column index of the matrix;

calling an MPI set communication function MPI_Allreduce, carrying out minimum reduction on the match [ id ] array, and obtaining the minimum matching data block number with the same id number of the data block in each column of the L matrix of the global matrix and each row of the corresponding U matrix by each process;

each process respectively counts the number of data blocks smaller than the minimum matching data block number in each column and the corresponding row;

collecting data of which each row is smaller than the minimum matching data block number of all processes through MPI set communication functions MPI_ALLgather and MPI_ALlgatherv, combining to form a set of data blocks of which the global matrix is smaller than the minimum matching data block number, and storing the block numbers of all matrix blocks of each column of the matrix through an array mindata;

and determining a dependency relationship according to the number of the data blocks, and sequencing the data blocks in the data group mindata based on the dependency relationship to generate sequencing of matrix column numbers.

6. The parallel supernode ordering method for large-scale sparse equation set solution according to claim 5, wherein the determining the dependency according to the number of data blocks, ordering the data blocks in the set mindata based on the dependency, generating an ordering of matrix column numbers, comprises:

and the array mindata stores the block numbers of all matrix blocks in each column of the matrix, counts the total number of matrix blocks in each matrix array, and the more the total number of column matrix blocks is, the deeper the dependency degree of the matrix array and other matrix blocks is, and the deeper the dependency degree of the column matrix blocks is, and the later the sequence is.

7. A parallel supernode ordering system for large-scale sparse system of equations solutions, comprising:

the sparse matrix acquisition module is used for acquiring a sparse matrix according to a large-scale sparse linear equation set to be solved, and the sparse matrix is used as a data processing object in the operation process of the processor;

the node matrix mapping module is used for constructing a two-dimensional process grid in the operation process of the processor, and mapping each non-zero matrix block in the super node block matrix of the sparse matrix into a corresponding process of the process grid in a block matrix row and column cyclic mapping mode;

the process data acquisition module is used for transmitting the data of the U matrix block responsible for each process to the process responsible for the corresponding matrix array of the U matrix block based on MPI set communication for each process;

the sequencing module is used for carrying out matching comparison on the acquired data and the data of the responsible L matrix after acquiring corresponding data in each process, determining the minimum matching data block number, counting the data blocks smaller than the minimum matching data block number and the quantity thereof, determining the dependency relationship according to the quantity, and sequencing the data blocks based on the dependency relationship;

and the memory allocation module is used for allocating memory to the process corresponding to the data block according to the ordered data block.

8. The parallel supernode ordering system for large-scale sparse system of equations according to claim 7, wherein each non-zero matrix block in the supernode blockmatrix of the sparse matrix is circularly mapped to the process grid by blockmatrix rows and columns

Mapping into a corresponding process of Pr-Pc process grid;

the mapping mode is as follows:

9. The parallel supernode ordering system for large-scale sparse system of equations solution of claim 7, wherein for each process, based on MPI set communication, sending data of a U matrix block for which each process is responsible to a process responsible for a corresponding matrix array of the U matrix block, comprises:

10. The parallel supernode ordering system for large-scale sparse system of equations solutions of claim 7, further comprising:

in the operation communication domain, an MPI set communication function MPI_Alltoall is operated on the acquired and stored data, and data transmission between processes is performed, wherein each process acquires the number of matrix blocks, the number of matrix block data and the number of matrix blocks of each row, which are required to be acquired from other processes respectively.