Background
In the field of computational fluid mechanics, a direct numerical simulation (Direct Numerical Simulation, DNS) can be performed by using a spectral element method to solve a Navier-Stokes equation which is not simplified by a turbulence model, so that fluid turbulence motion in a complex application scene can be captured, however, the computational complexity of the spectral element method is higher, the main hotspot function is a helmholtz operator Ax for constructing a speed and pressure equation in each solving time step, and the algorithm pseudo code is as follows:
The large number of small dense matrix multiplication operations generated in the helmholtz operator construction process occupies most of the time of program operation, and the calculation efficiency directly affects the calculation cost of DNS. The parallel optimization is performed on the size and the number characteristics of the matrix in the Helmholtz operator construction, and the parallel optimization is a key for improving the parallel efficiency of the program.
With the increasing demand of scientific computation, the traditional Helmholtz operator solving method based on a single CPU has difficulty in meeting the requirements of large-scale data processing and high-precision computation. In recent years, heterogeneous computing systems have become an effective way to solve the problem of large-scale computing due to their high parallel computing power and energy efficiency advantages.
DCU (Deep Computing Unit, depth calculator) is a coprocessor product of the maritime information GPGPU architecture suitable for computationally intensive task acceleration, equipped with a large number of computing units and high bandwidth memory. However, the application of existing helmholtz operator solutions in CPU-DCU heterogeneous computing environments still faces many challenges, such as:
(1) Load balancing between the CPU and DCU. Because the computing power of the CPU and the DCU are very different, tasks and data cannot be simply distributed evenly when divided, and reasonable distribution aiming at application characteristics is needed.
(2) Memory access overhead problems. The dense matrix multiplication is used as a memory intensive computing task, and how to fully utilize the cache, reduce the data transmission bottleneck, reduce the memory time overhead and improve the overall computing efficiency is a technical problem to be solved.
(3) The single small dense matrix has small calculation amount of multiplying tasks, and how to schedule the tasks on the DCU makes full use of the DCU calculation force is a key for improving the parallel efficiency.
Based on the above, the invention provides a method for generating inflow conditions for high-precision simulation of a reactor thermal fluid.
Disclosure of Invention
The invention aims to provide a heterogeneous parallel Helmholtz operator construction method for a DCU cluster to solve the problems of dense multiplication and memory, uneven task allocation, high memory expense and insufficient utilization of DCU computing power of a large number of dense small matrixes generated in the Helmholtz operator construction process in a DCU environment. The invention realizes the reasonable distribution of multi-stage prefetching and calculation tasks of dense small matrix multiplication, reduces access memory expenditure, fully utilizes the DCU calculation performance and can greatly improve the program running speed.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
the heterogeneous parallel Helmholtz operator construction method for the DCU cluster comprises the following steps of:
(1) On the basis of the multiplication of the block matrix, when the submatrices Ctile of different results are calculated, all the submatrices Atile and Btille are read into a shared memory space, so that the access times of a global memory are reduced, and the memory delay is reduced;
For multiplication of the submatrices Atile and Btile, a register is introduced to further improve the efficiency of data multiplexing, and one row and one column of the submatrices Atile and Btile are read each time the register is used and used for calculating a part of the submatrices Ctile so as to realize memory access optimization;
S2, obtaining hydrodynamic experimental data, obtaining a simulation calculation result by using the RANS, LES and DNS methods, and storing the obtained experimental data and the simulation result into a data source module;
(2) Task decomposition algorithm design:
For a matrix generated in a Helmholtz operator construction process in a DCU environment, the size of the scale of the matrix depends on the polynomial interpolation order of a spectral element method;
The shape of the matrix is divided into three cases of nxnx N, N 2 nxn and nxnxn 2, and specifically includes the following:
4) For the NxNxN case, a single matrix multiplication is regarded as a block to be distributed among thread groups;
5) Aiming at the N 2 XNXN condition, the scale of the result matrix C is N 2 XN, when N <8, the matrix is split into N XN N scale matrix multiplication, the task decomposition is carried out by adopting the method in the condition 1), when N epsilon [8,24], the number of lines of the matrix is at least 64, the size of the block is adjusted to 64 XN, and the load capacity of each thread is N elements;
6) For the case of nxnxn 2: adopting a similar blocking thought to an N 2 XN-scale matrix, splitting the matrix into N XN-scale matrix products when N <8, and adopting the method in the case 1) to perform task decomposition; when N E [8,24], the column number of the result matrix C is far greater than the line number, the size of the transition block is N x 64, the matrix is divided into a plurality of blocks, and the results in the blocks are distributed into threads according to the column, under the block shape, the data of each sub-matrix are distributed in the original matrix in no more continuous mode, 64 elements in the same line are in continuous memory addresses, and 64 threads in the thread bundle read 64 elements in continuous memory addresses each round.
Preferably, the size of the register in S1 is tile.
Preferably, for the case 1), N segments are generated for each grid in the helmholtz operator construction process, and since the dimension N of the matrix has a condition that the dimension of the thread group cannot be divided, in order to avoid the condition that load imbalance occurs in the same thread group, the segments are decomposed into four parts for calculation, and the decomposition flow is as follows:
① The number of tiles elementChunkNum and the total number of tiles totalChunkNum generated by a single grid are calculated:
elementChunkNum=N
totalChunkNum=Nel*elementChunkNum
② The global z-axis coordinate of the thread is the global number of the thread group to which the thread belongs, and the number of the responsible block is acquired according to the global number:
gridGroupId=blockDim.z*blockIdx.z+threadIdx.z
③ Dividing the block into four parts for calculation sequentially, wherein each part is processed by threads in a thread group, and each thread obtains the starting address of the calculated sub-block Ctile according to the x, y coordinates and the sub-block size for writing the calculation result.
Preferably, the case 2) specifically includes the following:
① Calculating to obtain the block number elementChunkNum and the total block number totalChunkNum generated by a single grid, and if the redundant line number exists so that N 2 cannot divide 64 completely, distributing the block number elementChunkNum and the total block number totalChunkNum into a thread group for calculation:
totalChunkNum=Nel*elementChunkNum
② The method comprises the steps of obtaining responsible blocks according to the global numbers of thread groups, calculating one row of elements in each thread in the thread groups, reading one element of an A matrix and one row of a B matrix each time to achieve data multiplexing in a block, and reading 64-X N-sized blocks for one thread bundle to access by using a block prefetching mode for the A matrix in the aspect of access memory merging optimization.
Compared with the prior art, the invention provides a heterogeneous parallel Helmholtz operator construction method for DCU clusters, which has the following beneficial effects:
(1) The invention solves the problem of memory access overhead of dense small matrix multiplication. The fine-grained access to the main memory is not performed any more, but the on-chip shared memory and registers on the DCU are fully utilized, the matrix is subjected to block parallel prefetching, and the access cost is obviously reduced.
(2) The invention ensures that tasks are reasonably distributed on the DCU, and fully utilizes the computing capacity of the DCU. The acceleration effect increases with increasing interpolation order. The acceleration ratio can reach 3.61 at the lowest and 30.04 at the highest.
(3) The invention solves the problem of load balancing between the program CPU and the DCU and between the DCUs, and improves the expandability of the program. The improvement of the speed-up ratio can be further obtained by using a plurality of DCUs, and the speed-up ratio can be further improved by 17.9% compared with a single DCU when the interpolation order reaches 24 orders.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments.
The invention provides a DCU cluster-oriented heterogeneous parallel Helmholtz operator construction method, which comprises the following steps:
(1) Block matrix multiplication memory optimization:
Matrix multiplication belongs to access memory intensive tasks, and the main direction of optimization is to reduce the access times of global memory. On the basis of a block matrix multiplication, the computation of the different result submatrices Ctile may require access to the same submatrix Atile or Btile, i.e., where there is data multiplexing among multiple threads. The submatrices Atile, btile are thus first all read into the shared memory space. When the threads in the single thread group access a continuous memory address, the DCU memory controller can obtain all data only by reading once, so that the memory access delay is greatly reduced.
For the multiplication of submatrices Atile and Btile, the introduction of registers further improves the efficiency of data multiplexing. The access optimization can be achieved by reading one row and one column of the submatrices Atile and Btile each time using registers for computing a portion of the submatrices Ctile. The register size used is tile.
(2) Task decomposition algorithm design:
The calculations generated during the construction of the helmholtz operator are dominated by a large number of small-scale matrices, the size of which depends on the polynomial interpolation order of the spectral element method. The order range used in the procedure is N.epsilon.4, 24, and on the resulting matrix shape, three cases are classified into NxNx N, N 2 xNxN and NxNxN 2. The invention respectively carries out task decomposition optimization aiming at three situations, and the specific description is as follows.
(1) For the N x N case, this form of matrix multiplication is more numerous than the other types, but the computation involved in a single matrix multiplication is small. It is therefore considered that a single matrix multiplication is allocated among thread groups as one partition. The threads in the block are simultaneously run by using 64 thread groups as a thread bundle, and the thread bundle is formed by a plurality of thread groups, so that the number of threads in the thread groups needs to be divided by 64, and the optional thread group sizes are 2 x 2,4 x 4 and 8 x 8. N blocks can be generated in each grid in the Helmholtz operator construction process, and the dimension N of the matrix can not be divided into the dimension of the thread group, so that the situation that load imbalance occurs in the same thread group is avoided, and the blocks are considered to be decomposed into four parts for calculation. The decomposition flow is as follows:
① The number of tiles elementChunkNum and the total number of tiles totalChunkNum generated by a single grid are calculated:
elementChunkNum=N
totalChunkNum=Nel*elementChunkNum
② The global z-axis coordinate of the thread, that is, the global number of the thread group to which the thread belongs, may obtain the number of the responsible block according to the global number:
gridGroupId=blockDim.z*blockIdx.z+threadIdx.z
③ The segmentation is disassembled into four parts for calculation sequentially. Each portion is processed by a thread within the thread group, each thread obtains a starting address of a calculated sub-block Ctile according to the x, y coordinates and sub-block size for writing the calculation result.
(2) For the case of N 2 N matrix multiplication in this form is computationally intensive, resulting in a matrix C of size N 2 N. When N <8, dividing the matrix into N multiplied N scale matrix, and performing task decomposition by adopting the method of the condition (1). When N epsilon [8,24], the minimum number of matrix lines can reach 64, which is enough to occupy one thread bundle for parallelization, so the size of the block is adjusted to 64 x N, and each thread load is N elements. The specific flow is as follows:
① The number of tiles elementChunkNum and the total number of tiles totalChunkNum generated by a single grid are calculated. If N 2 cannot divide 64, i.e., there is an extra number of lines, then the same is assigned to a thread group for calculation:
totalChunkNum=Nel*elementChunkNum
② And acquiring the responsible partition according to the global number of the thread group, wherein each thread in the thread group is responsible for calculating one row of elements. Reading one element of the a matrix and one row of the B matrix at a time enables multiplexing of data within the block. In terms of access merging optimization, a partition prefetch mode can be used for the matrix A, and a partition with the size of 64 x N is read for one thread bundle to access.
(3) For the case of NXNXN 2, the matrix multiplication of this shape is similar to the former, and therefore a block approach similar to that of an N 2 XN-scale matrix is employed. When N <8, dividing the matrix into N multiplied N scale matrix, and performing task decomposition by adopting the method of the condition (1). When N epsilon [8,24], the column number of the result matrix C is far greater than the line number, the size of the transition block is N multiplied by 64, the matrix is divided into a plurality of blocks, the results in the blocks are distributed into threads according to the column, and the memory optimization is realized by adopting the same thought as the previous matrix. However, the difference is that in the block shape, the data of each sub-matrix is not distributed continuously in the original matrix, but 64 elements in the same row are exactly located at continuous memory addresses, and 64 threads in the thread bundle exactly read 64 elements located at continuous memory addresses every round, so that optimization of access memory merging can be realized without sharing memory.
The method for constructing heterogeneous parallel Helmholtz operators for DCU clusters, which is provided by the invention, is described below with reference to the accompanying drawings and specific examples, and specifically comprises the following contents.
Example 1:
Referring to fig. 1, in the present invention, the DCU organizes threads in a block manner, where the number of a thread is a 3-dimensional vector, describing the position of the thread in the whole thread bundle, and may set the organization structure of the thread when executing a kernel function. The program may use blockdim.x, blockdim.y, blockdim.z to obtain the size information of the block. In the aspect of partitioning matrix multiplication memory optimization, a matrix multiplication instruction subjected to memory optimization can be divided into four steps:
(1) Data is loaded from global memory.
(2) And storing the data into the shared memory.
(3) Data is loaded from the shared memory into the registers.
(4) A computing operation is performed.
The algorithm flow of shared memory loading and register loading is as follows:
the present embodiment uses a parallel handling strategy. Because the space of the on-chip shared memory is larger, the a matrix blocks with the size of M tile x K and the B matrix blocks with the size of K x N tile can be simultaneously carried into the shared memory each time, wherein a shared is used for representing the a matrix part in the on-chip shared memory, each thread calculates blockThreadId of the thread to determine a carrying starting position, and carries one data of each interval blockSize. In this way, in each round of carrying, the data addresses of each thread carrying are continuous, and the specific algorithm flow is as follows:
Here a reg denotes the a matrix part stored in the register. The register capacity is smaller, and one row of the A matrix and one column of the B matrix in the shared memory are respectively carried into the register before each vector in the matrix multiplication.
In the aspect of N multiplied by N matrix task decomposition, each thread calculates the initial address of the responsible sub-block according to the x and y coordinates, and firstly calculates the memory address offset according to the dividing mode of four parts:
rowOffset=threadIdx.y*tileN*N
colOffset=threadIdx.x*tileN
resRosOffset=(blockDim.y*tileN+threadIdx.y)*N
resColOffset=blockDim.x*tileN+threadIdx.x
The starting memory addresses of the four parts can be calculated by the combination of the four offsets. Matrix partitioning is then circularly allocated among the thread groups until all computations are completed, with the number of partitions computed at the same time each time being the number of global thread groups, i.e. the global z-axis coordinate scale griddim.z. The specific algorithm flow is as follows:
The four matrix partitions generated by disassembly in this example are respectively:
(1) The upper left block is a square matrix with size tileN x blockdim.x, which can be uniformly distributed among threads in a thread group, and the size of the sub-block is tileN x tileN.
(2) The upper right block is a slender matrix with a size res, tileN, a subblock size is tileN, 1, and threads in a thread group res, block dim, x participate in the operation.
(3) The lower left block is a flat-width matrix with a size res tileN blocking dim.y, the size of the sub-block is 1 t ilen, and threads in the thread group have res blocking dim.y to participate in operation.
(4) The lower right block is a square matrix of size res, which contains a small amount of computation, so the computation of each element is allocated to one thread of the thread group for execution.
The algorithm flows of the N 2 XNXN type and the N XNXN 2 type are similar, and the block prefetch planes only differ when the block multiplication is calculated last. The task decomposition algorithm flow when N ε [8,24] is given here by way of example of N 2 N type:
The example adopts a 3D thread structure, different thread groups are distinguished by z coordinates, and blocks in a block are circularly allocated for calculation. Since the number of matrix rows in this case will be greater than 64, enough to make full use of one thread bundle, the matrix is no longer partitioned between thread groups. For each thread group, the calculation flow is as follows:
(1) The offset chunkOffset of the responsible block is obtained from the global number of the thread group.
chunkOffset=elementId*N3+chunkId*chunkSize
(2) And acquiring the responsible partition A chunk,Bchunk,Cchunk according to the offset.
(3) Each thread uses rowOffset offset to correspond its own responsible partition of a chunk,Bchunk with the corresponding shared memory address a tile,Ctile, a for prefetching and C for pre-storing results.
(4) The corresponding result C tile is computed in shared memory.
While the invention has been described with reference to preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.