[go: up one dir, main page]

CN119356862A - Heterogeneous parallel Helmholtz operator construction method for DCU clusters - Google Patents

Heterogeneous parallel Helmholtz operator construction method for DCU clusters Download PDF

Info

Publication number
CN119356862A
CN119356862A CN202411402685.2A CN202411402685A CN119356862A CN 119356862 A CN119356862 A CN 119356862A CN 202411402685 A CN202411402685 A CN 202411402685A CN 119356862 A CN119356862 A CN 119356862A
Authority
CN
China
Prior art keywords
matrix
thread
block
dcu
memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411402685.2A
Other languages
Chinese (zh)
Inventor
胡长军
武嘉琛
董玲玉
王�琦
陈婷婷
陈丹丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology Beijing USTB
Original Assignee
University of Science and Technology Beijing USTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology Beijing USTB filed Critical University of Science and Technology Beijing USTB
Priority to CN202411402685.2A priority Critical patent/CN119356862A/en
Publication of CN119356862A publication Critical patent/CN119356862A/en
Pending legal-status Critical Current

Links

Landscapes

  • Complex Calculations (AREA)

Abstract

本发明公开了面向DCU集群的异构并行亥姆霍兹算子构造方法,属于计算流体力学技术领域;本发明提出一种面向DCU集群的异构并行亥姆霍兹算子构造方法,主要包括分块矩阵乘访存优化和任务分解算法设计两部分内容,相较于现有设计,本发明不再对主存进行细粒度访问,而是充分利用DCU上的片上共享内存和寄存器,对矩阵进行分块并行预取,显著降低了访存开销;此外,本发明使任务在DCU上合理分配,充分利用DCU计算能力,加速效果随着插值阶数的增大加速比相应增大;最后,解决程序CPU与DCU间、多DCU间的负载均衡问题,提高了程序的可扩展性,使用多DCU可以进一步获得加速比的提升。

The invention discloses a heterogeneous parallel Helmholtz operator construction method for a DCU cluster, and belongs to the technical field of computational fluid dynamics. The invention proposes a heterogeneous parallel Helmholtz operator construction method for a DCU cluster, which mainly includes two parts: block matrix multiplication memory access optimization and task decomposition algorithm design. Compared with the existing design, the invention no longer performs fine-grained access to the main memory, but makes full use of the on-chip shared memory and registers on the DCU to perform block parallel pre-fetching of the matrix, which significantly reduces the memory access overhead. In addition, the invention enables tasks to be reasonably distributed on the DCU, fully utilizes the computing power of the DCU, and the acceleration effect increases with the increase of the interpolation order. Finally, the load balancing problem between the program CPU and the DCU and between multiple DCUs is solved, the scalability of the program is improved, and the use of multiple DCUs can further improve the acceleration ratio.

Description

Heterogeneous parallel Helmholtz operator construction method for DCU cluster
Technical Field
The invention relates to the technical field of computational fluid mechanics, in particular to a method for constructing heterogeneous parallel Helmholtz operators for DCU clusters.
Background
In the field of computational fluid mechanics, a direct numerical simulation (Direct Numerical Simulation, DNS) can be performed by using a spectral element method to solve a Navier-Stokes equation which is not simplified by a turbulence model, so that fluid turbulence motion in a complex application scene can be captured, however, the computational complexity of the spectral element method is higher, the main hotspot function is a helmholtz operator Ax for constructing a speed and pressure equation in each solving time step, and the algorithm pseudo code is as follows:
The large number of small dense matrix multiplication operations generated in the helmholtz operator construction process occupies most of the time of program operation, and the calculation efficiency directly affects the calculation cost of DNS. The parallel optimization is performed on the size and the number characteristics of the matrix in the Helmholtz operator construction, and the parallel optimization is a key for improving the parallel efficiency of the program.
With the increasing demand of scientific computation, the traditional Helmholtz operator solving method based on a single CPU has difficulty in meeting the requirements of large-scale data processing and high-precision computation. In recent years, heterogeneous computing systems have become an effective way to solve the problem of large-scale computing due to their high parallel computing power and energy efficiency advantages.
DCU (Deep Computing Unit, depth calculator) is a coprocessor product of the maritime information GPGPU architecture suitable for computationally intensive task acceleration, equipped with a large number of computing units and high bandwidth memory. However, the application of existing helmholtz operator solutions in CPU-DCU heterogeneous computing environments still faces many challenges, such as:
(1) Load balancing between the CPU and DCU. Because the computing power of the CPU and the DCU are very different, tasks and data cannot be simply distributed evenly when divided, and reasonable distribution aiming at application characteristics is needed.
(2) Memory access overhead problems. The dense matrix multiplication is used as a memory intensive computing task, and how to fully utilize the cache, reduce the data transmission bottleneck, reduce the memory time overhead and improve the overall computing efficiency is a technical problem to be solved.
(3) The single small dense matrix has small calculation amount of multiplying tasks, and how to schedule the tasks on the DCU makes full use of the DCU calculation force is a key for improving the parallel efficiency.
Based on the above, the invention provides a method for generating inflow conditions for high-precision simulation of a reactor thermal fluid.
Disclosure of Invention
The invention aims to provide a heterogeneous parallel Helmholtz operator construction method for a DCU cluster to solve the problems of dense multiplication and memory, uneven task allocation, high memory expense and insufficient utilization of DCU computing power of a large number of dense small matrixes generated in the Helmholtz operator construction process in a DCU environment. The invention realizes the reasonable distribution of multi-stage prefetching and calculation tasks of dense small matrix multiplication, reduces access memory expenditure, fully utilizes the DCU calculation performance and can greatly improve the program running speed.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
the heterogeneous parallel Helmholtz operator construction method for the DCU cluster comprises the following steps of:
(1) On the basis of the multiplication of the block matrix, when the submatrices Ctile of different results are calculated, all the submatrices Atile and Btille are read into a shared memory space, so that the access times of a global memory are reduced, and the memory delay is reduced;
For multiplication of the submatrices Atile and Btile, a register is introduced to further improve the efficiency of data multiplexing, and one row and one column of the submatrices Atile and Btile are read each time the register is used and used for calculating a part of the submatrices Ctile so as to realize memory access optimization;
S2, obtaining hydrodynamic experimental data, obtaining a simulation calculation result by using the RANS, LES and DNS methods, and storing the obtained experimental data and the simulation result into a data source module;
(2) Task decomposition algorithm design:
For a matrix generated in a Helmholtz operator construction process in a DCU environment, the size of the scale of the matrix depends on the polynomial interpolation order of a spectral element method;
The shape of the matrix is divided into three cases of nxnx N, N 2 nxn and nxnxn 2, and specifically includes the following:
4) For the NxNxN case, a single matrix multiplication is regarded as a block to be distributed among thread groups;
5) Aiming at the N 2 XNXN condition, the scale of the result matrix C is N 2 XN, when N <8, the matrix is split into N XN N scale matrix multiplication, the task decomposition is carried out by adopting the method in the condition 1), when N epsilon [8,24], the number of lines of the matrix is at least 64, the size of the block is adjusted to 64 XN, and the load capacity of each thread is N elements;
6) For the case of nxnxn 2: adopting a similar blocking thought to an N 2 XN-scale matrix, splitting the matrix into N XN-scale matrix products when N <8, and adopting the method in the case 1) to perform task decomposition; when N E [8,24], the column number of the result matrix C is far greater than the line number, the size of the transition block is N x 64, the matrix is divided into a plurality of blocks, and the results in the blocks are distributed into threads according to the column, under the block shape, the data of each sub-matrix are distributed in the original matrix in no more continuous mode, 64 elements in the same line are in continuous memory addresses, and 64 threads in the thread bundle read 64 elements in continuous memory addresses each round.
Preferably, the size of the register in S1 is tile.
Preferably, for the case 1), N segments are generated for each grid in the helmholtz operator construction process, and since the dimension N of the matrix has a condition that the dimension of the thread group cannot be divided, in order to avoid the condition that load imbalance occurs in the same thread group, the segments are decomposed into four parts for calculation, and the decomposition flow is as follows:
① The number of tiles elementChunkNum and the total number of tiles totalChunkNum generated by a single grid are calculated:
elementChunkNum=N
totalChunkNum=Nel*elementChunkNum
② The global z-axis coordinate of the thread is the global number of the thread group to which the thread belongs, and the number of the responsible block is acquired according to the global number:
gridGroupId=blockDim.z*blockIdx.z+threadIdx.z
③ Dividing the block into four parts for calculation sequentially, wherein each part is processed by threads in a thread group, and each thread obtains the starting address of the calculated sub-block Ctile according to the x, y coordinates and the sub-block size for writing the calculation result.
Preferably, the case 2) specifically includes the following:
① Calculating to obtain the block number elementChunkNum and the total block number totalChunkNum generated by a single grid, and if the redundant line number exists so that N 2 cannot divide 64 completely, distributing the block number elementChunkNum and the total block number totalChunkNum into a thread group for calculation:
totalChunkNum=Nel*elementChunkNum
② The method comprises the steps of obtaining responsible blocks according to the global numbers of thread groups, calculating one row of elements in each thread in the thread groups, reading one element of an A matrix and one row of a B matrix each time to achieve data multiplexing in a block, and reading 64-X N-sized blocks for one thread bundle to access by using a block prefetching mode for the A matrix in the aspect of access memory merging optimization.
Compared with the prior art, the invention provides a heterogeneous parallel Helmholtz operator construction method for DCU clusters, which has the following beneficial effects:
(1) The invention solves the problem of memory access overhead of dense small matrix multiplication. The fine-grained access to the main memory is not performed any more, but the on-chip shared memory and registers on the DCU are fully utilized, the matrix is subjected to block parallel prefetching, and the access cost is obviously reduced.
(2) The invention ensures that tasks are reasonably distributed on the DCU, and fully utilizes the computing capacity of the DCU. The acceleration effect increases with increasing interpolation order. The acceleration ratio can reach 3.61 at the lowest and 30.04 at the highest.
(3) The invention solves the problem of load balancing between the program CPU and the DCU and between the DCUs, and improves the expandability of the program. The improvement of the speed-up ratio can be further obtained by using a plurality of DCUs, and the speed-up ratio can be further improved by 17.9% compared with a single DCU when the interpolation order reaches 24 orders.
Drawings
FIG. 1 is a schematic diagram of a 3D thread structure according to embodiment 1 of the present invention;
FIG. 2 is a schematic diagram of the N matrix thread allocation strategy mentioned in embodiment 1 of the present invention;
FIG. 3 is a schematic diagram of an N 2 N matrix thread allocation policy set forth in embodiment 1 of the present invention;
Fig. 4 is a schematic diagram of an nxn 2 matrix thread allocation policy proposed in embodiment 1 of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments.
The invention provides a DCU cluster-oriented heterogeneous parallel Helmholtz operator construction method, which comprises the following steps:
(1) Block matrix multiplication memory optimization:
Matrix multiplication belongs to access memory intensive tasks, and the main direction of optimization is to reduce the access times of global memory. On the basis of a block matrix multiplication, the computation of the different result submatrices Ctile may require access to the same submatrix Atile or Btile, i.e., where there is data multiplexing among multiple threads. The submatrices Atile, btile are thus first all read into the shared memory space. When the threads in the single thread group access a continuous memory address, the DCU memory controller can obtain all data only by reading once, so that the memory access delay is greatly reduced.
For the multiplication of submatrices Atile and Btile, the introduction of registers further improves the efficiency of data multiplexing. The access optimization can be achieved by reading one row and one column of the submatrices Atile and Btile each time using registers for computing a portion of the submatrices Ctile. The register size used is tile.
(2) Task decomposition algorithm design:
The calculations generated during the construction of the helmholtz operator are dominated by a large number of small-scale matrices, the size of which depends on the polynomial interpolation order of the spectral element method. The order range used in the procedure is N.epsilon.4, 24, and on the resulting matrix shape, three cases are classified into NxNx N, N 2 xNxN and NxNxN 2. The invention respectively carries out task decomposition optimization aiming at three situations, and the specific description is as follows.
(1) For the N x N case, this form of matrix multiplication is more numerous than the other types, but the computation involved in a single matrix multiplication is small. It is therefore considered that a single matrix multiplication is allocated among thread groups as one partition. The threads in the block are simultaneously run by using 64 thread groups as a thread bundle, and the thread bundle is formed by a plurality of thread groups, so that the number of threads in the thread groups needs to be divided by 64, and the optional thread group sizes are 2 x 2,4 x 4 and 8 x 8. N blocks can be generated in each grid in the Helmholtz operator construction process, and the dimension N of the matrix can not be divided into the dimension of the thread group, so that the situation that load imbalance occurs in the same thread group is avoided, and the blocks are considered to be decomposed into four parts for calculation. The decomposition flow is as follows:
① The number of tiles elementChunkNum and the total number of tiles totalChunkNum generated by a single grid are calculated:
elementChunkNum=N
totalChunkNum=Nel*elementChunkNum
② The global z-axis coordinate of the thread, that is, the global number of the thread group to which the thread belongs, may obtain the number of the responsible block according to the global number:
gridGroupId=blockDim.z*blockIdx.z+threadIdx.z
③ The segmentation is disassembled into four parts for calculation sequentially. Each portion is processed by a thread within the thread group, each thread obtains a starting address of a calculated sub-block Ctile according to the x, y coordinates and sub-block size for writing the calculation result.
(2) For the case of N 2 N matrix multiplication in this form is computationally intensive, resulting in a matrix C of size N 2 N. When N <8, dividing the matrix into N multiplied N scale matrix, and performing task decomposition by adopting the method of the condition (1). When N epsilon [8,24], the minimum number of matrix lines can reach 64, which is enough to occupy one thread bundle for parallelization, so the size of the block is adjusted to 64 x N, and each thread load is N elements. The specific flow is as follows:
① The number of tiles elementChunkNum and the total number of tiles totalChunkNum generated by a single grid are calculated. If N 2 cannot divide 64, i.e., there is an extra number of lines, then the same is assigned to a thread group for calculation:
totalChunkNum=Nel*elementChunkNum
② And acquiring the responsible partition according to the global number of the thread group, wherein each thread in the thread group is responsible for calculating one row of elements. Reading one element of the a matrix and one row of the B matrix at a time enables multiplexing of data within the block. In terms of access merging optimization, a partition prefetch mode can be used for the matrix A, and a partition with the size of 64 x N is read for one thread bundle to access.
(3) For the case of NXNXN 2, the matrix multiplication of this shape is similar to the former, and therefore a block approach similar to that of an N 2 XN-scale matrix is employed. When N <8, dividing the matrix into N multiplied N scale matrix, and performing task decomposition by adopting the method of the condition (1). When N epsilon [8,24], the column number of the result matrix C is far greater than the line number, the size of the transition block is N multiplied by 64, the matrix is divided into a plurality of blocks, the results in the blocks are distributed into threads according to the column, and the memory optimization is realized by adopting the same thought as the previous matrix. However, the difference is that in the block shape, the data of each sub-matrix is not distributed continuously in the original matrix, but 64 elements in the same row are exactly located at continuous memory addresses, and 64 threads in the thread bundle exactly read 64 elements located at continuous memory addresses every round, so that optimization of access memory merging can be realized without sharing memory.
The method for constructing heterogeneous parallel Helmholtz operators for DCU clusters, which is provided by the invention, is described below with reference to the accompanying drawings and specific examples, and specifically comprises the following contents.
Example 1:
Referring to fig. 1, in the present invention, the DCU organizes threads in a block manner, where the number of a thread is a 3-dimensional vector, describing the position of the thread in the whole thread bundle, and may set the organization structure of the thread when executing a kernel function. The program may use blockdim.x, blockdim.y, blockdim.z to obtain the size information of the block. In the aspect of partitioning matrix multiplication memory optimization, a matrix multiplication instruction subjected to memory optimization can be divided into four steps:
(1) Data is loaded from global memory.
(2) And storing the data into the shared memory.
(3) Data is loaded from the shared memory into the registers.
(4) A computing operation is performed.
The algorithm flow of shared memory loading and register loading is as follows:
the present embodiment uses a parallel handling strategy. Because the space of the on-chip shared memory is larger, the a matrix blocks with the size of M tile x K and the B matrix blocks with the size of K x N tile can be simultaneously carried into the shared memory each time, wherein a shared is used for representing the a matrix part in the on-chip shared memory, each thread calculates blockThreadId of the thread to determine a carrying starting position, and carries one data of each interval blockSize. In this way, in each round of carrying, the data addresses of each thread carrying are continuous, and the specific algorithm flow is as follows:
Here a reg denotes the a matrix part stored in the register. The register capacity is smaller, and one row of the A matrix and one column of the B matrix in the shared memory are respectively carried into the register before each vector in the matrix multiplication.
In the aspect of N multiplied by N matrix task decomposition, each thread calculates the initial address of the responsible sub-block according to the x and y coordinates, and firstly calculates the memory address offset according to the dividing mode of four parts:
rowOffset=threadIdx.y*tileN*N
colOffset=threadIdx.x*tileN
resRosOffset=(blockDim.y*tileN+threadIdx.y)*N
resColOffset=blockDim.x*tileN+threadIdx.x
The starting memory addresses of the four parts can be calculated by the combination of the four offsets. Matrix partitioning is then circularly allocated among the thread groups until all computations are completed, with the number of partitions computed at the same time each time being the number of global thread groups, i.e. the global z-axis coordinate scale griddim.z. The specific algorithm flow is as follows:
The four matrix partitions generated by disassembly in this example are respectively:
(1) The upper left block is a square matrix with size tileN x blockdim.x, which can be uniformly distributed among threads in a thread group, and the size of the sub-block is tileN x tileN.
(2) The upper right block is a slender matrix with a size res, tileN, a subblock size is tileN, 1, and threads in a thread group res, block dim, x participate in the operation.
(3) The lower left block is a flat-width matrix with a size res tileN blocking dim.y, the size of the sub-block is 1 t ilen, and threads in the thread group have res blocking dim.y to participate in operation.
(4) The lower right block is a square matrix of size res, which contains a small amount of computation, so the computation of each element is allocated to one thread of the thread group for execution.
The algorithm flows of the N 2 XNXN type and the N XNXN 2 type are similar, and the block prefetch planes only differ when the block multiplication is calculated last. The task decomposition algorithm flow when N ε [8,24] is given here by way of example of N 2 N type:
The example adopts a 3D thread structure, different thread groups are distinguished by z coordinates, and blocks in a block are circularly allocated for calculation. Since the number of matrix rows in this case will be greater than 64, enough to make full use of one thread bundle, the matrix is no longer partitioned between thread groups. For each thread group, the calculation flow is as follows:
(1) The offset chunkOffset of the responsible block is obtained from the global number of the thread group.
chunkOffset=elementId*N3+chunkId*chunkSize
(2) And acquiring the responsible partition A chunk,Bchunk,Cchunk according to the offset.
(3) Each thread uses rowOffset offset to correspond its own responsible partition of a chunk,Bchunk with the corresponding shared memory address a tile,Ctile, a for prefetching and C for pre-storing results.
(4) The corresponding result C tile is computed in shared memory.
While the invention has been described with reference to preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (4)

1.面向DCU集群的异构并行亥姆霍兹算子构造方法,其特征在于,包括以下内容:1. A method for constructing a heterogeneous parallel Helmholtz operator for a DCU cluster, characterized by comprising the following contents: (1)分块矩阵乘访存优化:在分块矩阵乘法的基础上,对不同结果子矩阵Ctile计算时,将子矩阵Atile、Btile全部读入共享内存空间中,以此减少全局内存的访问次数,降低访存延迟;(1) Block matrix multiplication memory access optimization: Based on the block matrix multiplication, when calculating different result sub-matrices Ctile, all sub-matrices Atile and Btile are read into the shared memory space, thereby reducing the number of global memory accesses and reducing memory access latency; 对于子矩阵Atile和Btile的相乘,引入寄存器以进一步提升数据复用的效率;每次使用寄存器时读取子矩阵Atile和Btile的一行和一列,用于计算得到子矩阵Ctile的一部分,实现访存优化;For the multiplication of sub-matrices Atile and Btile, registers are introduced to further improve the efficiency of data reuse; each time the register is used, a row and a column of sub-matrices Atile and Btile are read to calculate a part of sub-matrix Ctile, thereby optimizing memory access; (2)任务分解算法设计:(2) Task decomposition algorithm design: 针对DCU环境下亥姆霍兹算子构造过程中产生的矩阵,所述矩阵的规模的尺寸取决于谱元法的多项式插值阶数;For the matrix generated during the construction of the Helmholtz operator in the DCU environment, the size of the matrix depends on the polynomial interpolation order of the spectral element method; 所述矩阵的形状分为N×N×N、N2×N×N和N×N×N2三种情形,具体包括内容如下:The shapes of the matrix are classified into three cases: N×N×N, N 2 ×N×N and N×N×N 2 , and the specific contents are as follows: 1)针对N×N×N情况:将单个矩阵乘视为一个分块在线程组间进行分配;1) For the N×N×N case: treat a single matrix multiplication as a block and distribute it among the thread groups; 2)针对N2×N×N情况:此情形下结果矩阵C的规模为N2×N;当N<8时,将矩阵拆分为N个N×N×N规模的矩阵乘,采用情形1)中的方法进行任务分解;当N∈[8,24]时,矩阵行数最小为64,将分块大小调整为64*N,每个线程负载量为N个元素;2) For the N 2 ×N×N case: In this case, the size of the result matrix C is N 2 ×N; when N < 8, the matrix is split into N N×N×N matrix multiplications, and the method in case 1) is used for task decomposition; when N∈[8,24], the minimum number of matrix rows is 64, the block size is adjusted to 64*N, and the load of each thread is N elements; 3)针对N×N×N2情况:采用与N2×N规模矩阵相似的分块思路,当N<8时,将矩阵拆分为N个N×N×N规模的矩阵乘,采用情形1)中的方法进行任务分解;当N∈[8,24]时,结果矩阵C的列数远大于行数,转变分块大小为N*64,将矩阵划分为多个分块,并将分块中的结果按列分配到线程中;在这种分块形状下,每个子矩阵的数据在原矩阵中的分布不再连续,同一行的64个元素处于连续的内存地址,线程束中的64个线程每一轮读取处于连续内存地址的64个元素。3) For the N×N×N 2 case: adopt a block idea similar to that of the N 2 ×N matrix. When N < 8, split the matrix into N N×N×N matrices, and use the method in case 1) to decompose the task. When N∈[8,24], the number of columns of the result matrix C is much larger than the number of rows. Change the block size to N*64, divide the matrix into multiple blocks, and distribute the results in the blocks to threads by column. Under this block shape, the data of each submatrix is no longer distributed continuously in the original matrix. The 64 elements in the same row are at continuous memory addresses. The 64 threads in the thread warp read 64 elements at continuous memory addresses in each round. 2.根据权利要求1所述的面向DCU集群的异构并行亥姆霍兹算子构造方法,其特征在于,S1中所述寄存器的规模为tile*tile。2. The method for constructing a heterogeneous parallel Helmholtz operator for a DCU cluster according to claim 1, wherein the scale of the register in S1 is tile*tile. 3.根据权利要求1所述的面向DCU集群的异构并行亥姆霍兹算子构造方法,其特征在于,针对情形1)下,亥姆霍兹算子构造过程中每个网格产生N个分块,由于矩阵的维度N存在无法整除线程组的维度的情况,为了避免同一线程组内出现负载不均衡的情况,将分块分解为四个部分进行计算,分解流程如下:3. The method for constructing a heterogeneous parallel Helmholtz operator for a DCU cluster according to claim 1 is characterized in that, for situation 1), each grid generates N blocks during the Helmholtz operator construction process. Since the dimension N of the matrix cannot divide the dimension of the thread group, in order to avoid load imbalance in the same thread group, the block is decomposed into four parts for calculation. The decomposition process is as follows: ①计算得到单个网格产生的分块数量elementChunkNum和总分块数量totalChunkNum:① Calculate the number of chunks elementChunkNum and the total number of chunks totalChunkNum generated by a single grid: elementChunkNum=NelementChunkNum=N totalChunkNum=Nel*elementChunkNumtotalChunkNum=Nel*elementChunkNum ②线程的全局z轴坐标即为线程所属线程组的全局编号,依此取得所负责分块的编号:②The global z-axis coordinate of the thread is the global number of the thread group to which the thread belongs, and the number of the block it is responsible for is obtained accordingly: gridGroupId=blockDim.z*blockIdx.z+threadIdx.zgridGroupId=blockDim.z*blockIdx.z+threadIdx.z ③将分块拆解为四个部分顺序进行计算,每个部分由线程组内的线程进行处理,各线程根据x、y坐标和子块尺寸获得计算的子块Ctile的起始地址,用于写入计算结果。③ The block is split into four parts for sequential calculation. Each part is processed by a thread in the thread group. Each thread obtains the starting address of the calculated sub-block Ctile according to the x, y coordinates and sub-block size to write the calculation results. 4.根据权利要求1所述的面向DCU集群的异构并行亥姆霍兹算子构造方法,其特征在于,所述情形2)具体包括如下内容:4. The method for constructing a heterogeneous parallel Helmholtz operator for a DCU cluster according to claim 1, wherein the scenario 2) specifically includes the following contents: ①计算得到单个网格产生的分块数量elementChunkNum和总分块数量totalChunkNum;若存在多余的行数使得N2无法整除64,则同样分配到一个线程组中进行计算:① Calculate the number of chunks elementChunkNum and the total number of chunks totalChunkNum generated by a single grid; if there are extra rows that make N 2 not divisible by 64, they are also assigned to a thread group for calculation: totalChunkNum=Nel*elementChunkNumtotalChunkNum=Nel*elementChunkNum ②根据线程组的全局编号取得所负责的分块,线程组内每个线程负责其中一行元素的计算;每次读取A矩阵的一个元素和B矩阵的一行实现block内的数据复用;访存合并优化方面,对A矩阵使用分块预取的方式,读取64*N大小的分块供一个线程束访问。②According to the global number of the thread group, the responsible block is obtained. Each thread in the thread group is responsible for the calculation of one row of elements. One element of the A matrix and one row of the B matrix are read each time to realize data reuse within the block. In terms of memory access merging optimization, the A matrix is pre-fetched in blocks, and blocks of 64*N size are read for access by one thread warp.
CN202411402685.2A 2024-10-09 2024-10-09 Heterogeneous parallel Helmholtz operator construction method for DCU clusters Pending CN119356862A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411402685.2A CN119356862A (en) 2024-10-09 2024-10-09 Heterogeneous parallel Helmholtz operator construction method for DCU clusters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411402685.2A CN119356862A (en) 2024-10-09 2024-10-09 Heterogeneous parallel Helmholtz operator construction method for DCU clusters

Publications (1)

Publication Number Publication Date
CN119356862A true CN119356862A (en) 2025-01-24

Family

ID=94299729

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411402685.2A Pending CN119356862A (en) 2024-10-09 2024-10-09 Heterogeneous parallel Helmholtz operator construction method for DCU clusters

Country Status (1)

Country Link
CN (1) CN119356862A (en)

Similar Documents

Publication Publication Date Title
Shen et al. Towards a uniform template-based architecture for accelerating 2D and 3D CNNs on FPGA
CN108241890B (en) Reconfigurable neural network acceleration method and architecture
US8400458B2 (en) Method and system for blocking data on a GPU
US6205533B1 (en) Mechanism for efficient data access and communication in parallel computations on an emulated spatial lattice
EP3757754B1 (en) Sorting for data-parallel computing devices
CN110516316B (en) A GPU-accelerated method for solving Euler equations by discontinuous Galerkin method
Yang et al. Molecular dynamics range-limited force evaluation optimized for FPGAs
US20230289398A1 (en) Efficient Matrix Multiply and Add with a Group of Warps
CN108446253A (en) The parallel calculating method that a kind of Sparse Matrix-Vector for martial prowess architectural framework multiplies
Wu et al. Optimizing dynamic programming on graphics processing units via adaptive thread-level parallelism
US6128639A (en) Array address and loop alignment calculations
CN114911619A (en) Batch parallel LU decomposition method of small and medium-sized dense matrix based on GPU for simulation system
Tandri et al. Automatic partitioning of data and computations on scalable shared memory multiprocessors
Wang et al. FP-AMR: A Reconfigurable Fabric Framework for Adaptive Mesh Refinement Applications
US5900023A (en) Method and apparatus for removing power-of-two restrictions on distributed addressing
CN111522776B (en) a computing architecture
CN119356862A (en) Heterogeneous parallel Helmholtz operator construction method for DCU clusters
Matsumoto et al. Multi-level optimization of matrix multiplication for GPU-equipped systems
US8539207B1 (en) Lattice-based computations on a parallel processor
Morari et al. Efficient sorting on the tilera manycore architecture
Hwang et al. Parallelizing molecular dynamics programs for distributed memory machines: An application of the chaos runtime support library
Lu et al. Optimizing GPU memory transactions for convolution operations
Wu et al. Optimizing dynamic programming on graphics processing units via data reuse and data prefetch with inter-block barrier synchronization
CN102955686B (en) A kind of N body problems short range effect calculates the optimization mapping method in GPU architecture
Bi et al. Efficiently running SpMV on multi-core DSPs for block sparse matrix

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination