CN106528054A

CN106528054A - GPU (Graphics Processing Unit) accelerated dense vector addition computing method

Info

Publication number: CN106528054A
Application number: CN201610955722.1A
Authority: CN
Inventors: 邹风华; 琚天鹏; 李鸣; 李一鸣; 郝韶航; 周赣
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2016-11-03
Filing date: 2016-11-03
Publication date: 2017-03-22

Abstract

The invention discloses a GPU-accelerated dense vector addition calculation method. The method is suitable for accelerating the addition operation of dense vectors: A+B=C, where A and B represent the added vectors, and C represents the result vector. The algorithm is implemented in detail The steps are: the CPU generates the data required to calculate the vector addition operation on the GPU; then the CPU transfers the data to the GPU; assigns the task of vector A+B to the GPU thread; then executes the vector addition operation in the GPU kernel function. In the present invention, the main task of the CPU is to generate and transmit data, and complete the scheduling of the main program. The addition operation of related vectors is completed by the GPU kernel function, and the high parallelism hardware characteristics of the GPU are used to greatly improve the speed of the dense vector addition operation.

Description

GPU Accelerated Dense Vector Addition Calculation Method

技术领域technical field

本发明属于电力系统高性能计算应用领域，尤其涉及一种GPU加速的稠密向量加法计算方法。The invention belongs to the application field of high-performance computing in power systems, and in particular relates to a GPU-accelerated dense vector addition computing method.

背景技术Background technique

GPU图形处理器(英语：GraphicsProcessingUnit，缩写：GPU)在处理单元的数量上要远远超过CPU，是一种众核并行处理器。传统上的GPU只负责图形渲染，而大部分的处理都交给了CPU。现在的GPU已经法阵为一种多核，多线程，具有强大计算能力和极高存储器带宽，可编程的处理器。在通用计算模型下，GPU作为CPU的协处理器工作，通过任务合理分配分解完成高性能计算。GPU graphics processing unit (English: Graphics Processing Unit, abbreviation: GPU) far exceeds the CPU in the number of processing units, and is a many-core parallel processor. Traditionally, the GPU is only responsible for graphics rendering, and most of the processing is handed over to the CPU. The current GPU has become a multi-core, multi-threaded, programmable processor with powerful computing capabilities and extremely high memory bandwidth. Under the general computing model, the GPU works as a coprocessor of the CPU, and completes high-performance computing through reasonable task allocation and decomposition.

稠密向量的计算具有并行性。因为向量中对应位置的数值计算相互独立，没有依赖关系，天然可以被并行的计算处理，适合GPU加速。The computation of dense vectors has parallelism. Because the numerical calculations of the corresponding positions in the vector are independent of each other and have no dependencies, they can naturally be processed in parallel and are suitable for GPU acceleration.

通过CPU和GPU之间合理的调度可以完成此类操作。Such operations can be accomplished through reasonable scheduling between the CPU and GPU.

发明内容Contents of the invention

发明目的：为了克服现有技术中存在的不足，本发明提供一种GPU加速的稠密向量相加计算方法，解决了稠密向量加法计算耗时的技术缺陷。Purpose of the invention: In order to overcome the deficiencies in the prior art, the present invention provides a GPU-accelerated dense vector addition calculation method, which solves the time-consuming technical defect of dense vector addition calculation.

技术方案：为实现上述目的，本发明的技术方案如下：Technical scheme: in order to achieve the above object, the technical scheme of the present invention is as follows:

GPU加速的稠密向量相加计算方法，该方法适用于加速稠密向量的加法操作：A+B＝C，其中A,B代表相加的向量，C代表结果向量，所述方法包括：A GPU-accelerated dense vector addition calculation method, which is suitable for accelerating the addition operation of dense vectors: A+B=C, where A and B represent the added vectors, and C represents the result vector. The method includes:

(1)CPU中分配GPU计算所需的数据存储空间，并将GPU计算所需的数据传输到GPU上；(1) Allocate the data storage space required for GPU calculation in the CPU, and transfer the data required for GPU calculation to the GPU;

(2)将向量A,B的各个元素相加的任务分配到GPU中的大量线程中执行；(2) Assign the task of adding each element of vector A and B to a large number of threads in the GPU for execution;

(3)在GPU中执行向量相加的内核函数Kernel_plus。(3) The kernel function Kernel_plus that performs vector addition in the GPU.

步骤(1)中，CPU中准备GPU内核函数所需的数据，具体如下：将向量A,B,C都存储为数组格式，其所述数据包括：向量元素个数为n，向量DataA、DataB，结果向量DataC。In step (1), prepare the data required by the GPU kernel function in the CPU, specifically as follows: vector A, B, and C are all stored as an array format, and the data includes: the number of vector elements is n, the vector DataA, DataB , the result vector DataC.

步骤(2)中，将向量A和B的n个元素相加任务分配到GPU中的大量线程中执行，即内核函数的一个线程负责向量A,B中的对应元素加法操作，如DataA[1]+DataB[1]＝DataC[1]。In step (2), the task of adding n elements of vectors A and B is assigned to a large number of threads in the GPU for execution, that is, one thread of the kernel function is responsible for the addition of the corresponding elements in vectors A and B, such as DataA[1 ]+DataB[1]=DataC[1].

步骤(3)中，完成向量加法操作的内核函数Kernel_plus定义为Kernel_plus<N_blocks，N_threads>，线程块大小固定为128，即：N_threads＝128，其线程块数量N_blocks为n/128，总线程数量为n；调用内核函数Kernel_plus<N_blocks，N_threads>来计算向量相加的操作；In step (3), the kernel function Kernel_plus of completing vector addition operation is defined as Kernel_plus<N _blocks , N _threads >, and the thread block size is fixed at 128, namely: N _threads =128, and its thread block quantity N _blocks is n/128, The total number of threads is n; call the kernel function Kernel_plus<N _blocks , N _threads > to calculate the vector addition operation;

所述内核函数Kernel_plus<N_blocks，N_threads>的计算流程为：The calculation process of the kernel function Kernel_plus<N _blocks , N _threads > is:

(4.1)CUDA自动为每个线程分配线程块索引blockID和线程块中的线程索引threadID；(4.1) CUDA automatically assigns thread block index blockID and thread index threadID in the thread block for each thread;

(4.2)将blockID赋值给变量tid，将threadID赋值给变量k，之后通过tid和t来索引tid号线程块中的k号线程；(4.2) Assign blockID to variable tid, assign threadID to variable k, and then index the k thread in the tid number thread block by tid and t;

(4.3)变量t＝tid*128+k；(4.3) variable t=tid*128+k;

(4.4)第tid号线程块中的k号线程负责向量A和向量B的第t个元素相加操作：C_t＝A_t+B_t，其中A_t，B_t和C_t分别是向量A，B和C中的第t个元素。(4.4) The k thread in the tid thread block is responsible for adding the t elements of vector A and vector B: C _t =A _t +B _t , where A _t , B _t and C _t are vector A respectively , the tth element in B and C.

有益效果：与现有技术比，CPU中生成在GPU上计算向量相加操作所需的数据；然后CPU将该数据传输到GPU上；将矩阵A₁+B₁＝C₁的加任务分配给GPU线程；接着在GPU中执行向量相加的内核函数。大幅降低了稠密向量加法的计算时间。Beneficial effect: Compared with the prior art, the CPU generates the data needed to calculate the vector addition operation on the GPU; then the CPU transmits the data to the GPU; the task of adding the matrix A ₁ +B ₁ =C ₁ is assigned to GPU thread; then execute the kernel function of vector addition in GPU. Significantly reduced computation time for dense vector addition.

附图说明Description of drawings

附图1为本发明的流程示意图。Accompanying drawing 1 is the schematic flow chart of the present invention.

具体实施方式detailed description

下面结合附图对本发明作更进一步的说明。The present invention will be further described below in conjunction with the accompanying drawings.

如附图1，如图1所示，本发明为一种GPU加速的稠密向量加法计算方法，该方法适用于加速稠密向量的加法操作：A+B＝C，其中A,B代表相加的向量，C代表结果向量，其特征在于所述方法包括：As shown in Figure 1, the present invention is a GPU-accelerated dense vector addition calculation method, which is suitable for accelerating the addition operation of dense vectors: A+B=C, where A and B represent the added Vector, C represents result vector, it is characterized in that described method comprises:

(1)CPU中分配GPU计算所需的数据存储空间，并将GPU计算所需的数据传输到GPU上；CPU中生成在GPU上计算向量相加操作所需的数据，并将该数据传输到GPU上；(1) The data storage space required for GPU calculation is allocated in the CPU, and the data required for GPU calculation is transmitted to the GPU; the data required for calculating the vector addition operation on the GPU is generated in the CPU, and the data is transmitted to on the GPU;

所述步骤(1)中，CPU中准备GPU内核函数所需的数据，具体如下：将向量A,B,C都存储为数组格式，其所述数据包括：向量元素个数为n，向量DataA、DataB，结果向量DataC。In described step (1), prepare the required data of GPU kernel function in CPU, specifically as follows: vector A, B, C are all stored as array format, and its described data comprises: vector element number is n, and vector DataA , DataB, the result vector DataC.

所述步骤(2)中，内核函数的一个线程负责向量A,B中的对应元素加法操作，如DataA[1]+DataB[1]＝DataC[1]。In the step (2), one thread of the kernel function is responsible for the addition operation of the corresponding elements in the vectors A and B, such as DataA[1]+DataB[1]=DataC[1].

所述步骤(3)中，完成向量加法操作的内核函数Kernel_plus定义为Kernel_plus<N_blocks，N_threads>，线程块大小固定为128，即：N_threads＝128，其线程块数量N_blocks为n/128，总线程数量为n；调用内核函数Kernel_plus<N_blocks，N_threads>来计算向量相加的操作。所述内核函数Kernel_plus<N_blocks，N_threads>的计算流程为：In described step (3), the kernel function Kernel_plus of finishing vector addition operation is defined as Kernel_plus<N _blocks , N _threads >, and thread block size is fixed as 128, namely: N _threads =128, and its thread block quantity N _blocks is n/ 128, the total number of threads is n; call the kernel function Kernel_plus<N _blocks , N _threads > to calculate the operation of vector addition. The calculation process of the kernel function Kernel_plus<N _blocks , N _threads > is:

(4.3)变量t＝tid*128+k；(4.3) variable t=tid*128+k;

(4.4)第tid号线程块中的k号线程负责向量A和向量B的第t个元素相加操作：C_t＝A_t+B_t，其中A_t，B_t和C_t分别是向量A，B和C中的第t个元素。(4.4) Thread k in the thread block tid is responsible for the addition operation of the tth element of vector A and vector B: C _t =A _t +B _t, where A _t , B _t and C _t are vector A respectively , the tth element in B and C.

以上所述仅是本发明的优选实施方式，应当指出：对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above is only a preferred embodiment of the present invention, it should be pointed out that for those of ordinary skill in the art, without departing from the principle of the present invention, some improvements and modifications can also be made. It should be regarded as the protection scope of the present invention.

Claims

1. The dense vector addition computing method of GPU acceleration, this method is applicable to the addition operation of accelerating dense vector: A+B=C, wherein A, B represents the vector of addition, and C represents the result vector, it is characterized in that, described Methods include:

(1) Allocate the data storage space required for GPU calculation in the CPU, and transfer the data required for GPU calculation to the GPU;

(2) Assign the task of adding each element of vector A and B to a large number of threads in the GPU for execution;

(3) The kernel function Kernel_plus that performs vector addition in the GPU.

2. the dense vector addition calculation method of GPU acceleration according to claim 1, is characterized in that: in step (1), prepare the required data of GPU kernel function in CPU, specifically as follows: vector A, B, C All are stored in an array format, and the data includes: the number of vector elements is n, the vector DataA, DataB, and the result vector DataC.

3. the dense vector addition calculation method of GPU acceleration according to claim 1, is characterized in that: in step (2), the task of adding n elements of vector A and B is assigned to a large number of threads in the GPU for execution , that is, one thread of the kernel function is responsible for the addition operation of the corresponding elements in the vectors A and B, such as DataA[1]+DataB[1]=DataC[1].

4. the dense vector addition calculation method of GPU acceleration according to claim 1, is characterized in that: step (3), the kernel function Kernel_plus of finishing vector addition operation is defined as Kernel_plus<N _blocks , N _threads >, thread block size It is fixed at 128, that is: N _threads =128, the number of thread blocks N _blocks is n/128, and the total number of threads is n; call the kernel function Kernel_plus<N _blocks , N _threads > to calculate the operation of vector addition;

The calculation process of the kernel function Kernel_plus<N _blocks , N _threads > is:

(4.1) CUDA automatically assigns thread block index blockID and thread index threadID in the thread block for each thread;

(4.2) Assign blockID to variable tid, assign threadID to variable k, and then index the k thread in the tid number thread block by tid and t;

(4.3) variable t=tid*128+k;

(4.4) The k thread in the tid thread block is responsible for adding the t elements of vector A and vector B: C _t =A _t +B _t , where A _t , B _t and C _t are vector A respectively , the tth element in B and C.