[go: up one dir, main page]

CN103294648A - Block matrix multiplication vectorization method supporting vector processor with multiple MAC (multiply accumulate) operational units - Google Patents

Block matrix multiplication vectorization method supporting vector processor with multiple MAC (multiply accumulate) operational units Download PDF

Info

Publication number
CN103294648A
CN103294648A CN2013101664113A CN201310166411A CN103294648A CN 103294648 A CN103294648 A CN 103294648A CN 2013101664113 A CN2013101664113 A CN 2013101664113A CN 201310166411 A CN201310166411 A CN 201310166411A CN 103294648 A CN103294648 A CN 103294648A
Authority
CN
China
Prior art keywords
matrix
submatrix
vector
sub
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013101664113A
Other languages
Chinese (zh)
Other versions
CN103294648B (en
Inventor
刘仲
陈书明
窦强
郭阳
刘衡竹
田希
龚国辉
陈海燕
彭元喜
万江华
刘胜
陈跃跃
扈啸
吴家铸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201310166411.3A priority Critical patent/CN103294648B/en
Publication of CN103294648A publication Critical patent/CN103294648A/en
Application granted granted Critical
Publication of CN103294648B publication Critical patent/CN103294648B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Complex Calculations (AREA)

Abstract

一种支持多MAC运算部件向量处理器的分块矩阵乘法向量化方法,流程为:(1)依据向量处理器的向量处理单元VPE的数量p、VPE中的MAC运算部件的数量m、向量存储器的容量s和矩阵元素的数据大小d,确定最优的子矩阵的块大小blocksize,确定乘数矩阵B的子矩阵的列数和行数以及确定被乘数矩阵A的子矩阵的行数与列数;(2)将向量存储器的容量s分为容量相等的两部分存储区域Buffer0和Buffer1,依次在Buffer0和Buffer1间以乒乓方式实现子矩阵的乘法,直到整个矩阵乘法计算完成。本发明具有实现简单、操作方便、可提高向量处理器并行性、能提高处理器运算效率等优点。

Figure 201310166411

A block matrix multiplication vectorization method for a vector processor supporting multiple MAC operation parts, the process is: (1) according to the number p of the vector processing unit VPE of the vector processor, the number m of MAC operation parts in the VPE, and the vector memory The capacity s and the data size d of matrix elements determine the block size blocksize of the optimal sub-matrix, determine the number of columns and rows of the sub-matrix of the multiplier matrix B, and determine the number of rows and the number of sub-matrix of the multiplicand matrix A Number of columns; (2) Divide the capacity s of the vector memory into two storage areas of equal capacity, Buffer0 and Buffer1, and perform sub-matrix multiplication between Buffer0 and Buffer1 in a ping-pong manner until the entire matrix multiplication calculation is completed. The invention has the advantages of simple implementation, convenient operation, improved parallelism of vector processors, and improved computing efficiency of processors.

Figure 201310166411

Description

支持多MAC运算部件向量处理器的分块矩阵乘法向量化方法Block Matrix Multiplication Vectorization Method for Vector Processors Supporting Multi-MAC Operation Units

技术领域 technical field

本发明主要涉及到数据处理技术领域,特指一种支持多MAC运算部件向量处理器的分块矩阵乘法向量化方法。 The invention mainly relates to the technical field of data processing, in particular to a block matrix multiplication vectorization method supporting a multi-MAC operation unit vector processor.

背景技术 Background technique

随着大型稠密线性方程组求解、4G无线通信、雷达信号处理、高清视频和数字图像处理等计算密集型应用的高性能计算需求日益增长,计算机体系结构出现显著变化,出现许多新型体系结构,如众核体系结构、异构多核体系结、流处理器体系结构和向量处理器体系结构等等,这些新的体系结构在单芯片上集成多个处理器核,每个核上包含丰富的运算部件,大幅度提高了芯片的计算性能;同时,还对软件开发提出了新的挑战。因为现有的大量程序和算法是基于单核处理器设计的,如何针对多核、多运算部件等体系结构特点,充分开发各个层次的并行性,高效地并行和向量化这些应用算法是当前面临的主要困难。 With the increasing demand for high-performance computing for computing-intensive applications such as solving large-scale dense linear equations, 4G wireless communications, radar signal processing, high-definition video and digital image processing, computer architectures have undergone significant changes, and many new architectures have emerged, such as Many-core architecture, heterogeneous multi-core architecture, stream processor architecture and vector processor architecture, etc. These new architectures integrate multiple processor cores on a single chip, and each core contains a wealth of computing components , greatly improving the computing performance of the chip; at the same time, it also poses new challenges to software development. Because a large number of existing programs and algorithms are designed based on single-core processors, how to fully develop parallelism at all levels and efficiently parallelize and vectorize these application algorithms is currently facing the architectural characteristics of multi-core and multi-computing components. main difficulty.

“矩阵乘法”是高性能计算(High Performance Computing,HPC)常用的核心模块之一,是典型的计算密集和访存密集型应用,对处理器的乘加(Multiply Accumulate,MAC)能力和访存带宽要求非常高,计算的时间复杂度很高,大约为O(N3),N为矩阵规模。传统的三重循环实现矩阵乘法的方法计算访存比较低,Cache的数据缺失、矩阵数据搬移开销占比大,导致处理器的运算效率较低。分块矩阵乘法方法将大矩阵的乘法分割为一系列子矩阵的乘法,通过合理的设置子矩阵的块大小,子矩阵的块大小blocksize通常满足blocksize<=sqrt(M/3),M为Cache的容量,使得子矩阵计算时的数据访问能够全部在Cache中命中,通过减少子矩阵的计算时间降低整个大矩阵乘法的计算时间,从而大幅度提高处理器的运算效率。 "Matrix multiplication" is one of the core modules commonly used in high-performance computing (High Performance Computing, HPC), and is a typical calculation-intensive and memory-intensive application. The bandwidth requirement is very high, and the calculation time complexity is very high, about O(N 3 ), where N is the size of the matrix. The traditional method of implementing matrix multiplication with triple loops has relatively low calculation and memory access, missing data in the Cache, and a large proportion of matrix data moving overhead, resulting in low computing efficiency of the processor. The block matrix multiplication method divides the multiplication of a large matrix into a series of sub-matrix multiplications. By setting the block size of the sub-matrix reasonably, the block size of the sub-matrix usually satisfies blocksize<=sqrt(M/3), and M is Cache capacity, so that the data access during sub-matrix calculation can all be hit in the Cache, and the calculation time of the entire large matrix multiplication is reduced by reducing the calculation time of the sub-matrix, thereby greatly improving the operation efficiency of the processor.

图1是多MAC运算部件向量处理器的一般结构示意图,它包括标量处理部件(Scalar Processing Unit,SPU)和向量处理部件(Vector Processing Unit,VPU),SPU负责标量任务计算和流控,VPU负责向量计算,包括若干向量处理单元(Vector Processing Element,VPE),每个VPE包含MAC0、MAC1等多个运算功能部件,以及ALU 、BP等其他功能部件。SPU和VPU提供数据通道传输和交换数据。向量数据访问单元支持向量数据的Load/Store,提供大容量的专用向量存储器,而不是单核处理器的Cache机制,现有的分块矩阵乘法方法不适合这类向量处理器。因此,亟需设计一种高效的支持多MAC运算部件向量处理器的分块矩阵乘法向量化的方法,以便最优的发挥向量处理器的运算效率。 Figure 1 is a schematic diagram of the general structure of a multi-MAC computing unit vector processor, which includes a scalar processing unit (Scalar Processing Unit, SPU) and a vector processing unit (Vector Processing Unit, VPU). The SPU is responsible for scalar task calculation and flow control, and the VPU is responsible for Vector computing, including several vector processing elements (Vector Processing Element, VPE), each VPE contains multiple computing functional components such as MAC0 and MAC1, as well as other functional components such as ALU and BP. SPU and VPU provide data channels to transmit and exchange data. The vector data access unit supports Load/Store of vector data and provides a large-capacity dedicated vector memory instead of the Cache mechanism of single-core processors. The existing block matrix multiplication method is not suitable for this type of vector processor. Therefore, there is an urgent need to design an efficient vectorization method for block matrix multiplication of vector processors that supports multi-MAC computing units, so as to optimize the computing efficiency of vector processors.

发明内容 Contents of the invention

本发明要解决的技术问题就在于:针对现有技术存在的技术问题,本发明提供一种实现简单、操作方便、可提高向量处理器并行性、能提高处理器运算效率的支持多MAC运算部件向量处理器的分块矩阵乘法向量化方法。 The technical problem to be solved by the present invention is: aiming at the technical problems existing in the prior art, the present invention provides a multi-MAC computing unit that is simple to implement, easy to operate, can improve the parallelism of vector processors, and can improve the computing efficiency of processors Blocked matrix multiplication vectorization method for vector processors.

为解决上述技术问题,本发明采用以下技术方案: In order to solve the problems of the technologies described above, the present invention adopts the following technical solutions:

一种支持多MAC运算部件向量处理器的分块矩阵乘法向量化方法,流程为: A block matrix multiplication vectorization method supporting a multi-MAC operation unit vector processor, the process is as follows:

(1)依据向量处理器的向量处理单元VPE的数量p、VPE中的MAC运算部件的数量m、向量存储器的容量s和矩阵元素的数据大小d,确定最优的子矩阵的块大小blocksize,确定乘数矩阵B的子矩阵的列数和行数以及确定被乘数矩阵A的子矩阵的行数与列数; (1) According to the number p of the vector processing unit VPE of the vector processor, the number m of MAC operation units in the VPE, the capacity s of the vector memory and the data size d of the matrix elements, determine the optimal sub-matrix block size blocksize, Determine the number of columns and the number of rows of the submatrix of the multiplier matrix B and determine the number of rows and the number of columns of the submatrix of the multiplicand matrix A;

(2)将向量存储器的容量s分为容量相等的两部分存储区域Buffer0和Buffer1,依次在Buffer0和Buffer1间以乒乓方式实现子矩阵的乘法,直到整个矩阵乘法计算完成。 (2) Divide the capacity s of the vector memory into two storage areas with equal capacity, Buffer0 and Buffer1, and realize the multiplication of the sub-matrix between Buffer0 and Buffer1 in a ping-pong manner until the entire matrix multiplication calculation is completed.

作为本发明的进一步改进: As a further improvement of the present invention:

所述步骤(1)中,乘数矩阵B的子矩阵的列数为p*m,行数为(s/2/2)/(p*m*d);确定乘数矩阵B的子矩阵块大小后,再确定被乘数矩阵A的子矩阵块大小;所述被乘数矩阵A的子矩阵的行数与列数都等于乘数矩阵B的子矩阵的行数,即(s/2/2)/(p*m*d)。 In the step (1), the number of columns of the sub-matrix of the multiplier matrix B is p*m, and the number of rows is (s/2/2)/(p*m*d); the sub-matrix of the multiplier matrix B is determined After the block size, determine the sub-matrix block size of the multiplicand matrix A; the number of rows and the number of columns of the sub-matrix of the multiplicand matrix A are all equal to the number of rows of the sub-matrix of the multiplier matrix B, i.e. (s/ 2/2)/(p*m*d).

所述向量处理器的标量处理部件SPU依次读取被乘数子矩阵A的每一行中的每一个元素,并扩展成一个向量数据;由向量处理部件VPU读取乘数子矩阵B的B0行数据与前述向量数据的每个元素分别进行乘累加;当遍历完被乘数子矩阵的A0行数据时,计算得到结果子矩阵C的C0行数据;当遍历完被乘数子矩阵A的所有行时,完成结果子矩阵C的计算。 The scalar processing unit SPU of the vector processor reads each element in each row of the multiplicand sub-matrix A in turn, and expands into a vector data; the B0 row of the multiplier sub-matrix B is read by the vector processing unit VPU The data and each element of the aforementioned vector data are multiplied and accumulated separately; when the A0 row data of the multiplicand sub-matrix is traversed, the C0 row data of the result sub-matrix C is calculated; when all the data of the multiplicand sub-matrix A are traversed When running, the calculation of the result sub-matrix C is completed.

所述步骤(2)中,存储区域Buffer0用于本次的子矩阵乘法中的乘数矩阵B和输出结果矩阵C的子矩阵存储,同时通过DMA控制器将下一次子矩阵乘法所需要的乘数矩阵B的子矩阵数据搬运到存储区域Buffer1,以及上一次的结果子矩阵数据搬移到外存储器。 In the step (2), the storage area Buffer0 is used for the sub-matrix storage of the multiplier matrix B and the output result matrix C in this sub-matrix multiplication, and at the same time, the multiplication required for the next sub-matrix multiplication is transferred by the DMA controller The sub-matrix data of the number matrix B is moved to the storage area Buffer1, and the last result sub-matrix data is moved to the external memory.

与现有技术相比,本发明的优点在于:本发明依据向量处理器的体系结构特点和矩阵元素的数据大小,确定最优的子矩阵的块大小blocksize,有效提高了处理器的计算访存比;采用双缓冲的乒乓方式实现子矩阵的乘法能够有效地将数据搬移时间与计算时间重叠,减少总的计算时间。由向量处理器的标量处理部件SPU读取被乘数子矩阵的行数据,并扩展成向量数据,与向量处理部件VPU按行读取乘数子矩阵的向量数据的每个元素分别进行乘累加,避免了列数据的访问以及向量数据的规约求和。这些优点使得本发明的方法实现简单,操作方便,能够充分挖掘向量处理器的指令、数据、任务等各个层次的并行性,将处理器的运算效率提高到90%以上,从而充分发挥多MAC运算部件向量处理器所具有的高性能计算能力的优点。 Compared with the prior art, the present invention has the advantages that: the present invention determines the optimal sub-matrix block size blocksize according to the architectural characteristics of the vector processor and the data size of the matrix elements, effectively improving the calculation and memory access of the processor. Compared with that, the multiplication of the sub-matrix can be realized by double-buffering ping-pong method, which can effectively overlap the data moving time with the computing time and reduce the total computing time. The scalar processing unit SPU of the vector processor reads the row data of the multiplicand sub-matrix, and expands it into vector data, and performs multiplication and accumulation for each element of the vector data of the multiplier sub-matrix read by the vector processing unit VPU by row. , avoiding access to column data and summation of vector data. These advantages make the method of the present invention simple to implement, easy to operate, can fully tap the parallelism of each level of instructions, data, tasks, etc. Advantages of the high-performance computing capabilities of component vector processors.

附图说明 Description of drawings

图1是多MAC运算部件向量处理器的一般结构示意图。 FIG. 1 is a schematic diagram of a general structure of a vector processor with multiple MAC operation units.

图2是本发明方法的执行流程示意图。 Fig. 2 is a schematic diagram of the execution flow of the method of the present invention.

图3是具体实施例中依据向量处理器的体系结构特点确定最优子矩阵的块大小的流程示意图。 Fig. 3 is a schematic flow diagram of determining the block size of the optimal sub-matrix according to the architectural characteristics of the vector processor in a specific embodiment.

图4是本发明中子矩阵乘法的具体实施过程的运算示意图。 Fig. 4 is an operational schematic diagram of the specific implementation process of sub-matrix multiplication in the present invention.

图5是在具体实施例中采用双缓冲的乒乓方式实现子矩阵乘法的流程示意图。 Fig. 5 is a schematic flow diagram of implementing sub-matrix multiplication in a double-buffered ping-pong manner in a specific embodiment.

具体实施方式 Detailed ways

以下将结合说明书附图和具体实施例对本发明做进一步详细说明。 The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

如图2所示,本发明支持多MAC运算部件向量处理器的分块矩阵乘法向量化方法,具体流程为: As shown in Figure 2, the present invention supports the block matrix multiplication vectorization method of the multi-MAC operation unit vector processor, and the specific process is:

(1)首先依据向量处理器的向量处理单元VPE的数量p、VPE中的MAC运算部件的数量m、向量存储器的容量s和矩阵元素的数据大小d,确定最优的子矩阵的块大小blocksize,确定乘数矩阵B的子矩阵的列数和行数以及确定被乘数矩阵A的子矩阵的行数与列数。 (1) First, according to the number p of the vector processing unit VPE of the vector processor, the number m of MAC operation units in the VPE, the capacity s of the vector memory, and the data size d of the matrix elements, determine the optimal sub-matrix block size blocksize , determine the number of columns and rows of the sub-matrix of the multiplier matrix B and determine the number of rows and columns of the sub-matrix of the multiplicand matrix A.

(2)将向量存储器的容量s分为容量相等的两部分存储区域Buffer0和Buffer1,依次在Buffer0和Buffer1间以乒乓方式实现子矩阵的乘法,直到整个矩阵乘法计算完成。 (2) Divide the capacity s of the vector memory into two storage areas with equal capacity, Buffer0 and Buffer1, and realize the multiplication of the sub-matrix between Buffer0 and Buffer1 in a ping-pong manner until the entire matrix multiplication calculation is completed.

如图3所示,具体应用时,根据向量处理器的向量处理单元VPE的数量p、每个VPE中的MAC运算部件的数量m、向量存储器的容量s和矩阵元素的数据大小d,确定最优的子矩阵的块大小blocksize。其中,乘数矩阵B的子矩阵的列数为p*m,行数为(s/2/2)/(p*m*d)。确定乘数矩阵B的子矩阵块大小后,再确定被乘数矩阵A的子矩阵块大小。被乘数矩阵A的子矩阵的行数与列数都等于乘数矩阵B的子矩阵的行数,即(s/2/2)/(p*m*d)。举一个例子来说,假定矩阵元素的数据为单精度浮点数据,数据大小为4B(字节),向量存储器的容量为1024KB,向量处理单元VPE的数量p=16,每个VPE中的MAC运算部件的数量m=2,则乘数矩阵B的子矩阵的列数为p*m=16*2=32列,行数为(1024*1024/2/2)/(16*2*4)=2048行。被乘数矩阵A的子矩阵的行数与列数都等于2048。 As shown in Figure 3, in specific applications, according to the number p of the vector processing unit VPE of the vector processor, the number m of the MAC operation parts in each VPE, the capacity s of the vector memory and the data size d of the matrix elements, determine the optimal Optimal sub-matrix block size blocksize. Wherein, the number of columns of the sub-matrix of the multiplier matrix B is p*m, and the number of rows is (s/2/2)/(p*m*d). After the sub-matrix block size of the multiplier matrix B is determined, the sub-matrix block size of the multiplicand matrix A is determined. Both the number of rows and the number of columns of the sub-matrix of the multiplier matrix A are equal to the number of rows of the sub-matrix of the multiplier matrix B, that is, (s/2/2)/(p*m*d). As an example, assume that the data of the matrix elements is single-precision floating-point data, the data size is 4B (bytes), the capacity of the vector memory is 1024KB, the number of vector processing units VPE is p=16, and the MAC in each VPE The number of operation components is m=2, then the number of columns of the sub-matrix of the multiplier matrix B is p*m=16*2=32 columns, and the number of rows is (1024*1024/2/2)/(16*2*4 )=2048 rows. The number of rows and the number of columns of the sub-matrix of the multiplicand matrix A are both equal to 2048.

如图4所示,本实施例中,乘数矩阵B的子矩阵的列数为4、行数为8;被乘数矩阵A的子矩阵的行数与列数都等于8。采用的方法是,由向量处理器的标量处理部件SPU依次读取被乘数子矩阵A的每一行中的每一个元素,并扩展成一个向量数据,如图4中的A0行的a00元素扩展成向量(a00, a00, a00, a00),a01元素扩展成向量(a01, a01, a01, a01)。由向量处理部件VPU读取乘数子矩阵B的B0行数据(b00, b01, b02, b03),与前述向量数据的每个元素分别进行乘累加。当遍历完被乘数子矩阵的A0行数据时,计算得到结果子矩阵C的C0行数据(c00, c01, c02, c03)。当遍历完被乘数子矩阵A的所有行时,完成结果子矩阵C的计算。 As shown in FIG. 4 , in this embodiment, the sub-matrix of the multiplier matrix B has 4 columns and 8 rows; the sub-matrix of the multiplicand matrix A has 8 rows and 8 columns. The method adopted is to sequentially read each element in each row of the multiplicand sub-matrix A by the scalar processing unit SPU of the vector processor, and expand it into a vector data, such as the a00 element expansion of the A0 row in Figure 4 into a vector (a00, a00, a00, a00), and the a01 element expands into a vector (a01, a01, a01, a01). The B0 row data (b00, b01, b02, b03) of the multiplier sub-matrix B is read by the vector processing unit VPU, and is multiplied and accumulated with each element of the aforementioned vector data. After traversing the A0 row data of the multiplicand sub-matrix, calculate the C0 row data (c00, c01, c02, c03) of the result sub-matrix C. When all the rows of the multiplicand sub-matrix A are traversed, the calculation of the result sub-matrix C is completed.

如图5所示,本实施例中的分块矩阵乘法采用双缓冲(Buffer)的乒乓方式实现子矩阵的乘法,将向量存储器的容量s分为容量相等的两部分存储区域Buffer0和Buffer1,其中Buffer0用于本次的子矩阵乘法中的乘数矩阵B和输出结果矩阵C的子矩阵存储,同时通过DMA控制器将下一次子矩阵乘法所需要的乘数矩阵B的子矩阵数据搬运到Buffer1,以及上一次的结果子矩阵数据搬移到外存储器。 As shown in Figure 5, the block matrix multiplication in this embodiment adopts the ping-pong method of double buffering (Buffer) to realize the multiplication of the sub-matrix, and the capacity s of the vector memory is divided into two storage areas Buffer0 and Buffer1 with equal capacity, where Buffer0 is used for the sub-matrix storage of the multiplier matrix B and the output result matrix C in this sub-matrix multiplication, and at the same time transfer the sub-matrix data of the multiplier matrix B required for the next sub-matrix multiplication to Buffer1 through the DMA controller , and the last result sub-matrix data is moved to the external memory.

综上所述,通过本发明所实现的支持多MAC运算部件向量处理器的分块矩阵乘法向量化方法,能够依据向量处理器的体系结构特点确定最优的子矩阵的块大小blocksize。采用双缓冲的乒乓方式实现子矩阵的乘法能够有效地将数据搬移时间与计算时间重叠,减少总的计算时间。由向量处理器的标量处理部件SPU读取被乘数子矩阵的行数据,并扩展成向量数据,与向量处理部件VPU按行读取乘数子矩阵的向量数据的每个元素分别进行乘累加,避免了列数据的访问以及向量数据的规约求和。这些优点使得本发明的方法实现简单,操作方便,能够充分挖掘向量处理器的指令、数据、任务等各个层次的并行性,将处理器的运算效率提高到90%以上,从而充分发挥多MAC运算部件向量处理器所具有的高性能计算能力的优点。 To sum up, through the block matrix multiplication vectorization method of the vector processor supporting multi-MAC operation components implemented by the present invention, the optimal sub-matrix block size blocksize can be determined according to the architecture characteristics of the vector processor. Using the double-buffered ping-pong method to realize the multiplication of the sub-matrix can effectively overlap the data moving time with the computing time and reduce the total computing time. The scalar processing unit SPU of the vector processor reads the row data of the multiplicand sub-matrix, and expands it into vector data, and performs multiplication and accumulation for each element of the vector data of the multiplier sub-matrix read by the vector processing unit VPU by row. , avoiding access to column data and summation of vector data. These advantages make the method of the present invention simple to implement, easy to operate, can fully tap the parallelism of each level of instructions, data, tasks, etc. Advantages of the high-performance computing capabilities of component vector processors.

以上仅是本发明的优选实施方式,本发明的保护范围并不仅局限于上述实施例,凡属于本发明思路下的技术方案均属于本发明的保护范围。应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理前提下的若干改进和润饰,应视为本发明的保护范围。 The above are only preferred implementations of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions under the idea of the present invention belong to the protection scope of the present invention. It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principle of the present invention should be regarded as the protection scope of the present invention.

Claims (4)

1. partitioned matrix multiplication vectorization method of supporting many MAC arithmetic unit vector processor is characterized in that flow process is:
(1) according to the quantity p of the vector processing unit VPE of vector processor, quantity m, the capacity s of vector memory of MAC arithmetic unit among the VPE and the size of data d of matrix element, determine the block size blocksize of optimum submatrix, determine line number and the columns of the submatrix of the columns of submatrix of multiplier matrix B and line number and definite multiplicand matrix A;
(2) the capacity s with vector memory is divided into two parts storage area Buffer0 and the Buffer1 that capacity equates, realizes the multiplication of submatrix successively between Buffer0 and Buffer1 with ping-pong, calculates up to whole matrix multiplication and finishes.
2. the partitioned matrix multiplication vectorization method of many MAC of support arithmetic unit vector processor according to claim 1 is characterized in that, in the described step (1), the columns of the submatrix of multiplier matrix B is p*m, and line number is (s/2/2)/(p*m*d); After determining the submatrix block size of multiplier matrix B, determine the submatrix block size of multiplicand matrix A again; The line number of the submatrix of described multiplicand matrix A and columns all equal the line number of the submatrix of multiplier matrix B, i.e. (s/2/2)/(p*m*d).
3. the partitioned matrix multiplication vectorization method of many MAC of support arithmetic unit vector processor according to claim 2, it is characterized in that, the scalar processor unit SPU of described vector processor reads each element in each row of multiplicand submatrix A successively, and is extended to a vector data; Read the B0 line data of multiplier submatrix B and each element of aforementioned vector data carries out multiply accumulating respectively by Vector Processing parts VPU; When having traveled through the A0 line data of multiplicand submatrix, calculate the C0 line data of the Matrix C that bears fruit; When having traveled through all row of multiplicand submatrix A, finish the calculating of the Matrix C that bears fruit.
4. according to the partitioned matrix multiplication vectorization method of claim 1 or 2 or 3 described many MAC of support arithmetic unit vector processors, it is characterized in that, in the described step (2), storage area Buffer0 is used for the multiplier matrix B of this submatrix multiplication and stores with the submatrix of output matrix of consequence C, simultaneously by dma controller will be next time the submatrix data of the needed multiplier matrix B of submatrix multiplication be transported to storage area Buffer1, and the last matrix data that bears fruit is moved external storage.
CN201310166411.3A 2013-05-08 2013-05-08 Support the partitioned matrix multiplication vectorization method of many MAC operation parts vector treatment device Active CN103294648B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310166411.3A CN103294648B (en) 2013-05-08 2013-05-08 Support the partitioned matrix multiplication vectorization method of many MAC operation parts vector treatment device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310166411.3A CN103294648B (en) 2013-05-08 2013-05-08 Support the partitioned matrix multiplication vectorization method of many MAC operation parts vector treatment device

Publications (2)

Publication Number Publication Date
CN103294648A true CN103294648A (en) 2013-09-11
CN103294648B CN103294648B (en) 2016-06-01

Family

ID=49095548

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310166411.3A Active CN103294648B (en) 2013-05-08 2013-05-08 Support the partitioned matrix multiplication vectorization method of many MAC operation parts vector treatment device

Country Status (1)

Country Link
CN (1) CN103294648B (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104461465A (en) * 2014-12-29 2015-03-25 南京大学 High-efficiency controller based on ping-pong operation and method thereof
CN104899182A (en) * 2015-06-09 2015-09-09 中国人民解放军国防科学技术大学 Matrix multiplication acceleration method for supporting variable blocks
CN105426344A (en) * 2015-11-09 2016-03-23 南京大学 Matrix calculation method of distributed large-scale matrix multiplication based on Spark
CN106445471A (en) * 2016-10-13 2017-02-22 北京百度网讯科技有限公司 Processor and method for executing matrix multiplication on processor
CN108509384A (en) * 2017-02-24 2018-09-07 富士通株式会社 Computational methods, information processing unit, calculation procedure and information processing system
CN108805273A (en) * 2018-05-20 2018-11-13 复旦大学 Door control unit accelerates the hardware circuit implementation of operation in a kind of LSTM
CN108985450A (en) * 2018-06-28 2018-12-11 中国人民解放军国防科技大学 Vector processor-oriented convolution neural network operation vectorization method
CN109086075A (en) * 2017-10-30 2018-12-25 上海寒武纪信息科技有限公司 Artificial intelligence process device and the method for executing Matrix Multiplication vector instruction using processor
US10338919B2 (en) 2017-05-08 2019-07-02 Nvidia Corporation Generalized acceleration of matrix multiply accumulate operations
CN110263296A (en) * 2019-05-18 2019-09-20 南京惟心光电系统有限公司 A kind of matrix-vector multiplier and its operation method based on photoelectricity computing array
US10454680B2 (en) 2016-11-01 2019-10-22 Beijing Baidu Netcom Science And Technology Co., Ltd. RSA decryption processor and method for controlling RSA decryption processor
CN110415157A (en) * 2018-04-26 2019-11-05 华为技术有限公司 A kind of calculation method and device of matrix multiplication
CN111045958A (en) * 2018-10-11 2020-04-21 展讯通信(上海)有限公司 Acceleration engine and processor
CN111737292A (en) * 2020-07-16 2020-10-02 腾讯科技(深圳)有限公司 Data retrieval method and related device
CN111902813A (en) * 2018-03-27 2020-11-06 Sk电信有限公司 Apparatus and method for convolution operation
CN112346852A (en) * 2019-08-06 2021-02-09 脸谱公司 Distributed physical processing of matrix summation operations
CN112446007A (en) * 2019-08-29 2021-03-05 上海华为技术有限公司 Matrix operation method, operation device and processor
CN112948758A (en) * 2021-02-24 2021-06-11 上海商汤智能科技有限公司 Data processing method and device and chip
CN114489496A (en) * 2022-01-14 2022-05-13 南京邮电大学 Data storage and transmission method based on FPGA artificial intelligence accelerator
US11556337B2 (en) 2021-04-12 2023-01-17 Analog Devices International Unlimited Company Parallel matrix multiplication technique optimized for memory fetches
US11990137B2 (en) 2018-09-13 2024-05-21 Shanghai Cambricon Information Technology Co., Ltd. Image retouching method and terminal device

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102018110607A1 (en) 2017-05-08 2018-11-08 Nvidia Corporation Generalized acceleration of matrix multiplication and accumulation operations

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7844630B2 (en) * 2007-09-01 2010-11-30 International Business Machines Corporation Method and structure for fast in-place transformation of standard full and packed matrix data formats
CN102214160A (en) * 2011-07-08 2011-10-12 中国科学技术大学 Single-accuracy matrix multiplication optimization method based on loongson chip 3A
CN102446160A (en) * 2011-09-06 2012-05-09 中国人民解放军国防科学技术大学 Matrix multiplication implementation method for double-precision SIMD (Single instruction multiple data) component

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7844630B2 (en) * 2007-09-01 2010-11-30 International Business Machines Corporation Method and structure for fast in-place transformation of standard full and packed matrix data formats
CN102214160A (en) * 2011-07-08 2011-10-12 中国科学技术大学 Single-accuracy matrix multiplication optimization method based on loongson chip 3A
CN102446160A (en) * 2011-09-06 2012-05-09 中国人民解放军国防科学技术大学 Matrix multiplication implementation method for double-precision SIMD (Single instruction multiple data) component

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
纪坤,等.: "矩阵三角分解分块算法的研究与实现", 《计算机应用与软件》 *
陈晶,等.: "分布式并行矩阵乘算法分析", 《测控技术》 *

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104461465A (en) * 2014-12-29 2015-03-25 南京大学 High-efficiency controller based on ping-pong operation and method thereof
CN104899182A (en) * 2015-06-09 2015-09-09 中国人民解放军国防科学技术大学 Matrix multiplication acceleration method for supporting variable blocks
CN104899182B (en) * 2015-06-09 2017-10-31 中国人民解放军国防科学技术大学 A kind of Matrix Multiplication accelerated method for supporting variable partitioned blocks
CN105426344A (en) * 2015-11-09 2016-03-23 南京大学 Matrix calculation method of distributed large-scale matrix multiplication based on Spark
CN106445471A (en) * 2016-10-13 2017-02-22 北京百度网讯科技有限公司 Processor and method for executing matrix multiplication on processor
US10140251B2 (en) 2016-10-13 2018-11-27 Beijing Baidu Netcom Science And Technology Co., Ltd. Processor and method for executing matrix multiplication operation on processor
US10454680B2 (en) 2016-11-01 2019-10-22 Beijing Baidu Netcom Science And Technology Co., Ltd. RSA decryption processor and method for controlling RSA decryption processor
CN108509384A (en) * 2017-02-24 2018-09-07 富士通株式会社 Computational methods, information processing unit, calculation procedure and information processing system
CN108509384B (en) * 2017-02-24 2022-04-12 富士通株式会社 Computing method, information processing device, computing program, and information processing system
US10884734B2 (en) 2017-05-08 2021-01-05 Nvidia Corporation Generalized acceleration of matrix multiply accumulate operations
US10338919B2 (en) 2017-05-08 2019-07-02 Nvidia Corporation Generalized acceleration of matrix multiply accumulate operations
CN109086075A (en) * 2017-10-30 2018-12-25 上海寒武纪信息科技有限公司 Artificial intelligence process device and the method for executing Matrix Multiplication vector instruction using processor
US12050887B2 (en) 2017-10-30 2024-07-30 Shanghai Cambricon Information Technology Co., Ltd. Information processing method and terminal device
US11922132B2 (en) 2017-10-30 2024-03-05 Shanghai Cambricon Information Technology Co., Ltd. Information processing method and terminal device
CN109086075B (en) * 2017-10-30 2021-06-08 上海寒武纪信息科技有限公司 Artificial intelligence processor and method for executing matrix multiplication vector instruction by using same
US11762631B2 (en) 2017-10-30 2023-09-19 Shanghai Cambricon Information Technology Co., Ltd. Information processing method and terminal device
CN111902813B (en) * 2018-03-27 2024-05-07 Sapeon韩国株式会社 Apparatus and method for convolution operation
CN111902813A (en) * 2018-03-27 2020-11-06 Sk电信有限公司 Apparatus and method for convolution operation
CN110415157A (en) * 2018-04-26 2019-11-05 华为技术有限公司 A kind of calculation method and device of matrix multiplication
CN110415157B (en) * 2018-04-26 2024-01-30 华为技术有限公司 Matrix multiplication calculation method and device
CN108805273A (en) * 2018-05-20 2018-11-13 复旦大学 Door control unit accelerates the hardware circuit implementation of operation in a kind of LSTM
CN108985450B (en) * 2018-06-28 2019-10-29 中国人民解放军国防科技大学 A Vectorization Method for Convolutional Neural Network Operations Oriented to Vector Processors
CN108985450A (en) * 2018-06-28 2018-12-11 中国人民解放军国防科技大学 Vector processor-oriented convolution neural network operation vectorization method
US11990137B2 (en) 2018-09-13 2024-05-21 Shanghai Cambricon Information Technology Co., Ltd. Image retouching method and terminal device
US11996105B2 (en) 2018-09-13 2024-05-28 Shanghai Cambricon Information Technology Co., Ltd. Information processing method and terminal device
US12094456B2 (en) 2018-09-13 2024-09-17 Shanghai Cambricon Information Technology Co., Ltd. Information processing method and system
US12057110B2 (en) 2018-09-13 2024-08-06 Shanghai Cambricon Information Technology Co., Ltd. Voice recognition based on neural networks
US12057109B2 (en) 2018-09-13 2024-08-06 Shanghai Cambricon Information Technology Co., Ltd. Information processing method and terminal device
CN111045958A (en) * 2018-10-11 2020-04-21 展讯通信(上海)有限公司 Acceleration engine and processor
CN110263296A (en) * 2019-05-18 2019-09-20 南京惟心光电系统有限公司 A kind of matrix-vector multiplier and its operation method based on photoelectricity computing array
CN112346852A (en) * 2019-08-06 2021-02-09 脸谱公司 Distributed physical processing of matrix summation operations
CN112446007A (en) * 2019-08-29 2021-03-05 上海华为技术有限公司 Matrix operation method, operation device and processor
CN111737292A (en) * 2020-07-16 2020-10-02 腾讯科技(深圳)有限公司 Data retrieval method and related device
CN112948758A (en) * 2021-02-24 2021-06-11 上海商汤智能科技有限公司 Data processing method and device and chip
US11556337B2 (en) 2021-04-12 2023-01-17 Analog Devices International Unlimited Company Parallel matrix multiplication technique optimized for memory fetches
CN114489496A (en) * 2022-01-14 2022-05-13 南京邮电大学 Data storage and transmission method based on FPGA artificial intelligence accelerator
CN114489496B (en) * 2022-01-14 2024-05-21 南京邮电大学 Data storage and transmission method based on FPGA artificial intelligent accelerator

Also Published As

Publication number Publication date
CN103294648B (en) 2016-06-01

Similar Documents

Publication Publication Date Title
CN103294648B (en) Support the partitioned matrix multiplication vectorization method of many MAC operation parts vector treatment device
JP6977239B2 (en) Matrix multiplier
CN104899182B (en) A kind of Matrix Multiplication accelerated method for supporting variable partitioned blocks
US9697176B2 (en) Efficient sparse matrix-vector multiplication on parallel processors
US11620818B2 (en) Spatially sparse neural network accelerator for multi-dimension visual analytics
US10140251B2 (en) Processor and method for executing matrix multiplication operation on processor
CN103336758B (en) The sparse matrix storage means of a kind of employing with the sparse row of compression of local information and the SpMV implementation method based on the method
KR101687081B1 (en) Processing method and apparatus for single-channel convolution layer, and processing method and apparatus for multi-channel convolution layer
US8984043B2 (en) Multiplying and adding matrices
WO2019205617A1 (en) Calculation method and apparatus for matrix multiplication
CN103440121A (en) Triangular matrix multiplication vectorization method of vector processor
WO2018107476A1 (en) Memory access device, computing device and device applied to convolutional neural network computation
US9164690B2 (en) System, method, and computer program product for copying data between memory locations
CN106846235B (en) A convolution optimization method and system accelerated by NVIDIA Kepler GPU assembly instructions
CN110175670B (en) A method and system for implementing YOLOv2 detection network based on FPGA
CN113743599B (en) Computing device and server of convolutional neural network
WO2022001550A1 (en) Address generation method, related device and storage medium
WO2024012180A1 (en) Matrix calculation method and device
WO2022068328A1 (en) Data migration method and apparatus, and processor and calculation device
CN104615584B (en) The method for solving vectorization calculating towards GPDSP extensive triangular linear equation group
CN116127261B (en) Matrix multiply-accumulate method and device in processor and electronic equipment
CN104636316B (en) The method calculated towards GPDSP extensive matrix multiplication
CN104317754B (en) The data transfer optimization method that strides towards heterogeneous computing system
CN110377874B (en) Convolution operation method and system
CN111859277A (en) A Vectorized Realization Method of Sparse Matrix-Vector Multiplication

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant