[go: up one dir, main page]

CN106528054A - GPU (Graphics Processing Unit) accelerated dense vector addition computing method - Google Patents

GPU (Graphics Processing Unit) accelerated dense vector addition computing method Download PDF

Info

Publication number
CN106528054A
CN106528054A CN201610955722.1A CN201610955722A CN106528054A CN 106528054 A CN106528054 A CN 106528054A CN 201610955722 A CN201610955722 A CN 201610955722A CN 106528054 A CN106528054 A CN 106528054A
Authority
CN
China
Prior art keywords
vector
gpu
thread
threads
kernel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610955722.1A
Other languages
Chinese (zh)
Inventor
邹风华
琚天鹏
李鸣
李一鸣
郝韶航
周赣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN201610955722.1A priority Critical patent/CN106528054A/en
Publication of CN106528054A publication Critical patent/CN106528054A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Complex Calculations (AREA)

Abstract

本发明公开了一种GPU加速的稠密向量加法计算方法,该方法适用于加速稠密向量的加法操作:A+B=C,其中A,B代表相加的向量,C代表结果向量,算法具体实施步骤有:CPU中生成在GPU上计算向量相加操作所需的数据;然后CPU将该数据传输到GPU上;将向量A+B的任务分配给GPU线程;接着在GPU中执行向量相加的内核函数。本发明中,CPU主要任务是生成和传输数据、完成主程序的调度,有关向量的加法操作均由GPU内核函数完成,利用GPU高并行度的硬件特性大大提升了稠密向量加法运算的速度。

The invention discloses a GPU-accelerated dense vector addition calculation method. The method is suitable for accelerating the addition operation of dense vectors: A+B=C, where A and B represent the added vectors, and C represents the result vector. The algorithm is implemented in detail The steps are: the CPU generates the data required to calculate the vector addition operation on the GPU; then the CPU transfers the data to the GPU; assigns the task of vector A+B to the GPU thread; then executes the vector addition operation in the GPU kernel function. In the present invention, the main task of the CPU is to generate and transmit data, and complete the scheduling of the main program. The addition operation of related vectors is completed by the GPU kernel function, and the high parallelism hardware characteristics of the GPU are used to greatly improve the speed of the dense vector addition operation.

Description

GPU加速的稠密向量加法计算方法GPU Accelerated Dense Vector Addition Calculation Method

技术领域technical field

本发明属于电力系统高性能计算应用领域,尤其涉及一种GPU加速的稠密向量加法计算方法。The invention belongs to the application field of high-performance computing in power systems, and in particular relates to a GPU-accelerated dense vector addition computing method.

背景技术Background technique

GPU图形处理器(英语:GraphicsProcessingUnit,缩写:GPU)在处理单元的数量上要远远超过CPU,是一种众核并行处理器。传统上的GPU只负责图形渲染,而大部分的处理都交给了CPU。现在的GPU已经法阵为一种多核,多线程,具有强大计算能力和极高存储器带宽,可编程的处理器。在通用计算模型下,GPU作为CPU的协处理器工作,通过任务合理分配分解完成高性能计算。GPU graphics processing unit (English: Graphics Processing Unit, abbreviation: GPU) far exceeds the CPU in the number of processing units, and is a many-core parallel processor. Traditionally, the GPU is only responsible for graphics rendering, and most of the processing is handed over to the CPU. The current GPU has become a multi-core, multi-threaded, programmable processor with powerful computing capabilities and extremely high memory bandwidth. Under the general computing model, the GPU works as a coprocessor of the CPU, and completes high-performance computing through reasonable task allocation and decomposition.

稠密向量的计算具有并行性。因为向量中对应位置的数值计算相互独立,没有依赖关系,天然可以被并行的计算处理,适合GPU加速。The computation of dense vectors has parallelism. Because the numerical calculations of the corresponding positions in the vector are independent of each other and have no dependencies, they can naturally be processed in parallel and are suitable for GPU acceleration.

通过CPU和GPU之间合理的调度可以完成此类操作。Such operations can be accomplished through reasonable scheduling between the CPU and GPU.

发明内容Contents of the invention

发明目的:为了克服现有技术中存在的不足,本发明提供一种GPU加速的稠密向量相加计算方法,解决了稠密向量加法计算耗时的技术缺陷。Purpose of the invention: In order to overcome the deficiencies in the prior art, the present invention provides a GPU-accelerated dense vector addition calculation method, which solves the time-consuming technical defect of dense vector addition calculation.

技术方案:为实现上述目的,本发明的技术方案如下:Technical scheme: in order to achieve the above object, the technical scheme of the present invention is as follows:

GPU加速的稠密向量相加计算方法,该方法适用于加速稠密向量的加法操作:A+B=C,其中A,B代表相加的向量,C代表结果向量,所述方法包括:A GPU-accelerated dense vector addition calculation method, which is suitable for accelerating the addition operation of dense vectors: A+B=C, where A and B represent the added vectors, and C represents the result vector. The method includes:

(1)CPU中分配GPU计算所需的数据存储空间,并将GPU计算所需的数据传输到GPU上;(1) Allocate the data storage space required for GPU calculation in the CPU, and transfer the data required for GPU calculation to the GPU;

(2)将向量A,B的各个元素相加的任务分配到GPU中的大量线程中执行;(2) Assign the task of adding each element of vector A and B to a large number of threads in the GPU for execution;

(3)在GPU中执行向量相加的内核函数Kernel_plus。(3) The kernel function Kernel_plus that performs vector addition in the GPU.

步骤(1)中,CPU中准备GPU内核函数所需的数据,具体如下:将向量A,B,C都存储为数组格式,其所述数据包括:向量元素个数为n,向量DataA、DataB,结果向量DataC。In step (1), prepare the data required by the GPU kernel function in the CPU, specifically as follows: vector A, B, and C are all stored as an array format, and the data includes: the number of vector elements is n, the vector DataA, DataB , the result vector DataC.

步骤(2)中,将向量A和B的n个元素相加任务分配到GPU中的大量线程中执行,即内核函数的一个线程负责向量A,B中的对应元素加法操作,如DataA[1]+DataB[1]=DataC[1]。In step (2), the task of adding n elements of vectors A and B is assigned to a large number of threads in the GPU for execution, that is, one thread of the kernel function is responsible for the addition of the corresponding elements in vectors A and B, such as DataA[1 ]+DataB[1]=DataC[1].

步骤(3)中,完成向量加法操作的内核函数Kernel_plus定义为Kernel_plus<Nblocks,Nthreads>,线程块大小固定为128,即:Nthreads=128,其线程块数量Nblocks为n/128,总线程数量为n;调用内核函数Kernel_plus<Nblocks,Nthreads>来计算向量相加的操作;In step (3), the kernel function Kernel_plus of completing vector addition operation is defined as Kernel_plus<N blocks , N threads >, and the thread block size is fixed at 128, namely: N threads =128, and its thread block quantity N blocks is n/128, The total number of threads is n; call the kernel function Kernel_plus<N blocks , N threads > to calculate the vector addition operation;

所述内核函数Kernel_plus<Nblocks,Nthreads>的计算流程为:The calculation process of the kernel function Kernel_plus<N blocks , N threads > is:

(4.1)CUDA自动为每个线程分配线程块索引blockID和线程块中的线程索引threadID;(4.1) CUDA automatically assigns thread block index blockID and thread index threadID in the thread block for each thread;

(4.2)将blockID赋值给变量tid,将threadID赋值给变量k,之后通过tid和t来索引tid号线程块中的k号线程;(4.2) Assign blockID to variable tid, assign threadID to variable k, and then index the k thread in the tid number thread block by tid and t;

(4.3)变量t=tid*128+k;(4.3) variable t=tid*128+k;

(4.4)第tid号线程块中的k号线程负责向量A和向量B的第t个元素相加操作:Ct=At+Bt,其中At,Bt和Ct分别是向量A,B和C中的第t个元素。(4.4) The k thread in the tid thread block is responsible for adding the t elements of vector A and vector B: C t =A t +B t , where A t , B t and C t are vector A respectively , the tth element in B and C.

有益效果:与现有技术比,CPU中生成在GPU上计算向量相加操作所需的数据;然后CPU将该数据传输到GPU上;将矩阵A1+B1=C1的加任务分配给GPU线程;接着在GPU中执行向量相加的内核函数。大幅降低了稠密向量加法的计算时间。Beneficial effect: Compared with the prior art, the CPU generates the data needed to calculate the vector addition operation on the GPU; then the CPU transmits the data to the GPU; the task of adding the matrix A 1 +B 1 =C 1 is assigned to GPU thread; then execute the kernel function of vector addition in GPU. Significantly reduced computation time for dense vector addition.

附图说明Description of drawings

附图1为本发明的流程示意图。Accompanying drawing 1 is the schematic flow chart of the present invention.

具体实施方式detailed description

下面结合附图对本发明作更进一步的说明。The present invention will be further described below in conjunction with the accompanying drawings.

如附图1,如图1所示,本发明为一种GPU加速的稠密向量加法计算方法,该方法适用于加速稠密向量的加法操作:A+B=C,其中A,B代表相加的向量,C代表结果向量,其特征在于所述方法包括:As shown in Figure 1, the present invention is a GPU-accelerated dense vector addition calculation method, which is suitable for accelerating the addition operation of dense vectors: A+B=C, where A and B represent the added Vector, C represents result vector, it is characterized in that described method comprises:

(1)CPU中分配GPU计算所需的数据存储空间,并将GPU计算所需的数据传输到GPU上;CPU中生成在GPU上计算向量相加操作所需的数据,并将该数据传输到GPU上;(1) The data storage space required for GPU calculation is allocated in the CPU, and the data required for GPU calculation is transmitted to the GPU; the data required for calculating the vector addition operation on the GPU is generated in the CPU, and the data is transmitted to on the GPU;

(2)将向量A,B的各个元素相加的任务分配到GPU中的大量线程中执行;(2) Assign the task of adding each element of vector A and B to a large number of threads in the GPU for execution;

(3)在GPU中执行向量相加的内核函数Kernel_plus。(3) The kernel function Kernel_plus that performs vector addition in the GPU.

所述步骤(1)中,CPU中准备GPU内核函数所需的数据,具体如下:将向量A,B,C都存储为数组格式,其所述数据包括:向量元素个数为n,向量DataA、DataB,结果向量DataC。In described step (1), prepare the required data of GPU kernel function in CPU, specifically as follows: vector A, B, C are all stored as array format, and its described data comprises: vector element number is n, and vector DataA , DataB, the result vector DataC.

所述步骤(2)中,内核函数的一个线程负责向量A,B中的对应元素加法操作,如DataA[1]+DataB[1]=DataC[1]。In the step (2), one thread of the kernel function is responsible for the addition operation of the corresponding elements in the vectors A and B, such as DataA[1]+DataB[1]=DataC[1].

所述步骤(3)中,完成向量加法操作的内核函数Kernel_plus定义为Kernel_plus<Nblocks,Nthreads>,线程块大小固定为128,即:Nthreads=128,其线程块数量Nblocks为n/128,总线程数量为n;调用内核函数Kernel_plus<Nblocks,Nthreads>来计算向量相加的操作。所述内核函数Kernel_plus<Nblocks,Nthreads>的计算流程为:In described step (3), the kernel function Kernel_plus of finishing vector addition operation is defined as Kernel_plus<N blocks , N threads >, and thread block size is fixed as 128, namely: N threads =128, and its thread block quantity N blocks is n/ 128, the total number of threads is n; call the kernel function Kernel_plus<N blocks , N threads > to calculate the operation of vector addition. The calculation process of the kernel function Kernel_plus<N blocks , N threads > is:

(4.1)CUDA自动为每个线程分配线程块索引blockID和线程块中的线程索引threadID;(4.1) CUDA automatically assigns thread block index blockID and thread index threadID in the thread block for each thread;

(4.2)将blockID赋值给变量tid,将threadID赋值给变量k,之后通过tid和t来索引tid号线程块中的k号线程;(4.2) Assign blockID to variable tid, assign threadID to variable k, and then index the k thread in the tid number thread block by tid and t;

(4.3)变量t=tid*128+k;(4.3) variable t=tid*128+k;

(4.4)第tid号线程块中的k号线程负责向量A和向量B的第t个元素相加操作:Ct=At+Bt,其中At,Bt和Ct分别是向量A,B和C中的第t个元素。(4.4) Thread k in the thread block tid is responsible for the addition operation of the tth element of vector A and vector B: C t =A t +B t, where A t , B t and C t are vector A respectively , the tth element in B and C.

以上所述仅是本发明的优选实施方式,应当指出:对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。The above is only a preferred embodiment of the present invention, it should be pointed out that for those of ordinary skill in the art, without departing from the principle of the present invention, some improvements and modifications can also be made. It should be regarded as the protection scope of the present invention.

Claims (4)

1.GPU加速的稠密向量相加计算方法,该方法适用于加速稠密向量的加法操作:A+B=C,其中A,B代表相加的向量,C代表结果向量,其特征在于,所述方法包括:1. The dense vector addition computing method of GPU acceleration, this method is applicable to the addition operation of accelerating dense vector: A+B=C, wherein A, B represents the vector of addition, and C represents the result vector, it is characterized in that, described Methods include: (1)CPU中分配GPU计算所需的数据存储空间,并将GPU计算所需的数据传输到GPU上;(1) Allocate the data storage space required for GPU calculation in the CPU, and transfer the data required for GPU calculation to the GPU; (2)将向量A,B的各个元素相加的任务分配到GPU中的大量线程中执行;(2) Assign the task of adding each element of vector A and B to a large number of threads in the GPU for execution; (3)在GPU中执行向量相加的内核函数Kernel_plus。(3) The kernel function Kernel_plus that performs vector addition in the GPU. 2.根据权利要求1所述的GPU加速的稠密向量相加计算方法,其特征在于:步骤(1)中,CPU中准备GPU内核函数所需的数据,具体如下:将向量A,B,C都存储为数组格式,其所述数据包括:向量元素个数为n,向量DataA、DataB,结果向量DataC。2. the dense vector addition calculation method of GPU acceleration according to claim 1, is characterized in that: in step (1), prepare the required data of GPU kernel function in CPU, specifically as follows: vector A, B, C All are stored in an array format, and the data includes: the number of vector elements is n, the vector DataA, DataB, and the result vector DataC. 3.根据权利要求1所述的GPU加速的稠密向量相加计算方法,其特征在于:步骤(2)中,将向量A和B的n个元素相加任务分配到GPU中的大量线程中执行,即内核函数的一个线程负责向量A,B中的对应元素加法操作,如DataA[1]+DataB[1]=DataC[1]。3. the dense vector addition calculation method of GPU acceleration according to claim 1, is characterized in that: in step (2), the task of adding n elements of vector A and B is assigned to a large number of threads in the GPU for execution , that is, one thread of the kernel function is responsible for the addition operation of the corresponding elements in the vectors A and B, such as DataA[1]+DataB[1]=DataC[1]. 4.根据权利要求1所述的GPU加速的稠密向量相加计算方法,其特征在于:步骤(3),完成向量加法操作的内核函数Kernel_plus定义为Kernel_plus<Nblocks,Nthreads>,线程块大小固定为128,即:Nthreads=128,其线程块数量Nblocks为n/128,总线程数量为n;调用内核函数Kernel_plus<Nblocks,Nthreads>来计算向量相加的操作;4. the dense vector addition calculation method of GPU acceleration according to claim 1, is characterized in that: step (3), the kernel function Kernel_plus of finishing vector addition operation is defined as Kernel_plus<N blocks , N threads >, thread block size It is fixed at 128, that is: N threads =128, the number of thread blocks N blocks is n/128, and the total number of threads is n; call the kernel function Kernel_plus<N blocks , N threads > to calculate the operation of vector addition; 所述内核函数Kernel_plus<Nblocks,Nthreads>的计算流程为:The calculation process of the kernel function Kernel_plus<N blocks , N threads > is: (4.1)CUDA自动为每个线程分配线程块索引blockID和线程块中的线程索引threadID;(4.1) CUDA automatically assigns thread block index blockID and thread index threadID in the thread block for each thread; (4.2)将blockID赋值给变量tid,将threadID赋值给变量k,之后通过tid和t来索引tid号线程块中的k号线程;(4.2) Assign blockID to variable tid, assign threadID to variable k, and then index the k thread in the tid number thread block by tid and t; (4.3)变量t=tid*128+k;(4.3) variable t=tid*128+k; (4.4)第tid号线程块中的k号线程负责向量A和向量B的第t个元素相加操作:Ct=At+Bt,其中At,Bt和Ct分别是向量A,B和C中的第t个元素。(4.4) The k thread in the tid thread block is responsible for adding the t elements of vector A and vector B: C t =A t +B t , where A t , B t and C t are vector A respectively , the tth element in B and C.
CN201610955722.1A 2016-11-03 2016-11-03 GPU (Graphics Processing Unit) accelerated dense vector addition computing method Pending CN106528054A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610955722.1A CN106528054A (en) 2016-11-03 2016-11-03 GPU (Graphics Processing Unit) accelerated dense vector addition computing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610955722.1A CN106528054A (en) 2016-11-03 2016-11-03 GPU (Graphics Processing Unit) accelerated dense vector addition computing method

Publications (1)

Publication Number Publication Date
CN106528054A true CN106528054A (en) 2017-03-22

Family

ID=58325470

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610955722.1A Pending CN106528054A (en) 2016-11-03 2016-11-03 GPU (Graphics Processing Unit) accelerated dense vector addition computing method

Country Status (1)

Country Link
CN (1) CN106528054A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080208942A1 (en) * 2007-02-23 2008-08-28 Nara Won Parallel Architecture for Matrix Transposition
CN101937425A (en) * 2009-07-02 2011-01-05 北京理工大学 Matrix Parallel Transpose Method Based on GPU Many-Core Platform
CN103543989A (en) * 2013-11-11 2014-01-29 镇江中安通信科技有限公司 Adaptive parallel processing method aiming at variable length characteristic extraction for big data
US8984043B2 (en) * 2009-12-23 2015-03-17 Intel Corporation Multiplying and adding matrices
CN105574809A (en) * 2015-12-16 2016-05-11 天津大学 Matrix exponent-based parallel calculation method for electromagnetic transient simulation graphic processor

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080208942A1 (en) * 2007-02-23 2008-08-28 Nara Won Parallel Architecture for Matrix Transposition
CN101937425A (en) * 2009-07-02 2011-01-05 北京理工大学 Matrix Parallel Transpose Method Based on GPU Many-Core Platform
US8984043B2 (en) * 2009-12-23 2015-03-17 Intel Corporation Multiplying and adding matrices
CN103543989A (en) * 2013-11-11 2014-01-29 镇江中安通信科技有限公司 Adaptive parallel processing method aiming at variable length characteristic extraction for big data
CN105574809A (en) * 2015-12-16 2016-05-11 天津大学 Matrix exponent-based parallel calculation method for electromagnetic transient simulation graphic processor

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
我是郭俊辰: "CUDA编程入门:向量加法和矩阵乘法", 《HTTPS://BLOG.CSDN.NET/U014030117/ARTICLE/DETAILS/45952971》 *

Similar Documents

Publication Publication Date Title
CN112330523B (en) Computational optimization for low-precision machine learning operations
TWI656481B (en) Method,computer-readable medium and system associated with merge-based parallelized consumption of sequences
US11106261B2 (en) Optimal operating point estimator for hardware operating under a shared power/thermal constraint
CN110032395B (en) Unified register file for improved resource utilization
US20200234129A1 (en) Techniques for removing masks from pruned neural networks
US11043028B2 (en) Reducing level of detail of a polygon mesh to decrease a complexity of rendered geometry within a scene
US12141689B2 (en) Data compression for a neural network
CN109993683A (en) Machine Learning Sparse Computation Mechanisms for Arbitrary Neural Networks, Arithmetic Computation Microarchitectures for Training Mechanisms, and Sparsity
EP3742350A1 (en) Parallelization strategies for training a neural network
CN105183562B (en) A method of rasterizing data are carried out based on CUDA technologies to take out rank
US9235512B2 (en) System, method, and computer program product for graphics processing unit (GPU) demand paging
US12189571B2 (en) Dual pipeline parallel systolic array
DE102022130536A1 (en) SELF-VOCING THREAD DISTRIBUTION POLICY
CN102591709A (en) Shapefile master-slave type parallel writing method based on OGR (open geospatial rule)
EP4109303A1 (en) Using sparsity metadata to reduce systolic array power consumption
US20220309017A1 (en) Multi-format graphics processing unit docking board
CN104615584B (en) The method for solving vectorization calculating towards GPDSP extensive triangular linear equation group
JP5238876B2 (en) Information processing apparatus and information processing method
WO2021147567A1 (en) Convolutional operation method and chip
DE102024102284A1 (en) SCALARIZING INSTRUCTIONS FOR SIMT ARCHITECTURES
US20240111532A1 (en) Lock-free unordered in-place compaction
CN103577160A (en) Characteristic extraction parallel-processing method for big data
CN106528054A (en) GPU (Graphics Processing Unit) accelerated dense vector addition computing method
CN113222099A (en) Convolution operation method and chip
Kasagi et al. An Efficient Technique for Large Mini-batch Challenge of DNNs Training on Large Scale Cluster

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170322