CN103294648A

CN103294648A - Block matrix multiplication vectorization method supporting vector processor with multiple MAC (multiply accumulate) operational units

Info

Publication number: CN103294648A
Application number: CN2013101664113A
Authority: CN
Inventors: 刘仲; 陈书明; 窦强; 郭阳; 刘衡竹; 田希; 龚国辉; 陈海燕; 彭元喜; 万江华; 刘胜; 陈跃跃; 扈啸; 吴家铸
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2013-05-08
Filing date: 2013-05-08
Publication date: 2013-09-11
Anticipated expiration: 2033-05-08
Also published as: CN103294648B

Abstract

A block matrix multiplication vectorization method for a vector processor supporting multiple MAC operation parts, the process is: (1) according to the number p of the vector processing unit VPE of the vector processor, the number m of MAC operation parts in the VPE, and the vector memory The capacity s and the data size d of matrix elements determine the block size blocksize of the optimal sub-matrix, determine the number of columns and rows of the sub-matrix of the multiplier matrix B, and determine the number of rows and the number of sub-matrix of the multiplicand matrix A Number of columns; (2) Divide the capacity s of the vector memory into two storage areas of equal capacity, Buffer0 and Buffer1, and perform sub-matrix multiplication between Buffer0 and Buffer1 in a ping-pong manner until the entire matrix multiplication calculation is completed. The invention has the advantages of simple implementation, convenient operation, improved parallelism of vector processors, and improved computing efficiency of processors.

Description

Block Matrix Multiplication Vectorization Method for Vector Processors Supporting Multi-MAC Operation Units

技术领域 technical field

本发明主要涉及到数据处理技术领域，特指一种支持多MAC运算部件向量处理器的分块矩阵乘法向量化方法。 The invention mainly relates to the technical field of data processing, in particular to a block matrix multiplication vectorization method supporting a multi-MAC operation unit vector processor.

背景技术 Background technique

随着大型稠密线性方程组求解、4G无线通信、雷达信号处理、高清视频和数字图像处理等计算密集型应用的高性能计算需求日益增长，计算机体系结构出现显著变化，出现许多新型体系结构，如众核体系结构、异构多核体系结、流处理器体系结构和向量处理器体系结构等等，这些新的体系结构在单芯片上集成多个处理器核，每个核上包含丰富的运算部件，大幅度提高了芯片的计算性能；同时，还对软件开发提出了新的挑战。因为现有的大量程序和算法是基于单核处理器设计的，如何针对多核、多运算部件等体系结构特点，充分开发各个层次的并行性，高效地并行和向量化这些应用算法是当前面临的主要困难。 With the increasing demand for high-performance computing for computing-intensive applications such as solving large-scale dense linear equations, 4G wireless communications, radar signal processing, high-definition video and digital image processing, computer architectures have undergone significant changes, and many new architectures have emerged, such as Many-core architecture, heterogeneous multi-core architecture, stream processor architecture and vector processor architecture, etc. These new architectures integrate multiple processor cores on a single chip, and each core contains a wealth of computing components , greatly improving the computing performance of the chip; at the same time, it also poses new challenges to software development. Because a large number of existing programs and algorithms are designed based on single-core processors, how to fully develop parallelism at all levels and efficiently parallelize and vectorize these application algorithms is currently facing the architectural characteristics of multi-core and multi-computing components. main difficulty.

“矩阵乘法”是高性能计算（High Performance Computing，HPC）常用的核心模块之一，是典型的计算密集和访存密集型应用，对处理器的乘加（Multiply Accumulate，MAC）能力和访存带宽要求非常高，计算的时间复杂度很高，大约为O（N³），N为矩阵规模。传统的三重循环实现矩阵乘法的方法计算访存比较低，Cache的数据缺失、矩阵数据搬移开销占比大，导致处理器的运算效率较低。分块矩阵乘法方法将大矩阵的乘法分割为一系列子矩阵的乘法，通过合理的设置子矩阵的块大小，子矩阵的块大小blocksize通常满足blocksize<=sqrt(M/3)，M为Cache的容量，使得子矩阵计算时的数据访问能够全部在Cache中命中，通过减少子矩阵的计算时间降低整个大矩阵乘法的计算时间，从而大幅度提高处理器的运算效率。 "Matrix multiplication" is one of the core modules commonly used in high-performance computing (High Performance Computing, HPC), and is a typical calculation-intensive and memory-intensive application. The bandwidth requirement is very high, and the calculation time complexity is very high, about O(N ³ ), where N is the size of the matrix. The traditional method of implementing matrix multiplication with triple loops has relatively low calculation and memory access, missing data in the Cache, and a large proportion of matrix data moving overhead, resulting in low computing efficiency of the processor. The block matrix multiplication method divides the multiplication of a large matrix into a series of sub-matrix multiplications. By setting the block size of the sub-matrix reasonably, the block size of the sub-matrix usually satisfies blocksize<=sqrt(M/3), and M is Cache capacity, so that the data access during sub-matrix calculation can all be hit in the Cache, and the calculation time of the entire large matrix multiplication is reduced by reducing the calculation time of the sub-matrix, thereby greatly improving the operation efficiency of the processor.

图1是多MAC运算部件向量处理器的一般结构示意图，它包括标量处理部件（Scalar Processing Unit，SPU）和向量处理部件（Vector Processing Unit，VPU），SPU负责标量任务计算和流控，VPU负责向量计算，包括若干向量处理单元（Vector Processing Element，VPE），每个VPE包含MAC0、MAC1等多个运算功能部件，以及ALU 、BP等其他功能部件。SPU和VPU提供数据通道传输和交换数据。向量数据访问单元支持向量数据的Load/Store，提供大容量的专用向量存储器，而不是单核处理器的Cache机制，现有的分块矩阵乘法方法不适合这类向量处理器。因此，亟需设计一种高效的支持多MAC运算部件向量处理器的分块矩阵乘法向量化的方法，以便最优的发挥向量处理器的运算效率。 Figure 1 is a schematic diagram of the general structure of a multi-MAC computing unit vector processor, which includes a scalar processing unit (Scalar Processing Unit, SPU) and a vector processing unit (Vector Processing Unit, VPU). The SPU is responsible for scalar task calculation and flow control, and the VPU is responsible for Vector computing, including several vector processing elements (Vector Processing Element, VPE), each VPE contains multiple computing functional components such as MAC0 and MAC1, as well as other functional components such as ALU and BP. SPU and VPU provide data channels to transmit and exchange data. The vector data access unit supports Load/Store of vector data and provides a large-capacity dedicated vector memory instead of the Cache mechanism of single-core processors. The existing block matrix multiplication method is not suitable for this type of vector processor. Therefore, there is an urgent need to design an efficient vectorization method for block matrix multiplication of vector processors that supports multi-MAC computing units, so as to optimize the computing efficiency of vector processors.

发明内容 Contents of the invention

本发明要解决的技术问题就在于：针对现有技术存在的技术问题，本发明提供一种实现简单、操作方便、可提高向量处理器并行性、能提高处理器运算效率的支持多MAC运算部件向量处理器的分块矩阵乘法向量化方法。 The technical problem to be solved by the present invention is: aiming at the technical problems existing in the prior art, the present invention provides a multi-MAC computing unit that is simple to implement, easy to operate, can improve the parallelism of vector processors, and can improve the computing efficiency of processors Blocked matrix multiplication vectorization method for vector processors.

为解决上述技术问题，本发明采用以下技术方案： In order to solve the problems of the technologies described above, the present invention adopts the following technical solutions:

一种支持多MAC运算部件向量处理器的分块矩阵乘法向量化方法，流程为： A block matrix multiplication vectorization method supporting a multi-MAC operation unit vector processor, the process is as follows:

（1）依据向量处理器的向量处理单元VPE的数量p、VPE中的MAC运算部件的数量m、向量存储器的容量s和矩阵元素的数据大小d，确定最优的子矩阵的块大小blocksize，确定乘数矩阵B的子矩阵的列数和行数以及确定被乘数矩阵A的子矩阵的行数与列数； (1) According to the number p of the vector processing unit VPE of the vector processor, the number m of MAC operation units in the VPE, the capacity s of the vector memory and the data size d of the matrix elements, determine the optimal sub-matrix block size blocksize, Determine the number of columns and the number of rows of the submatrix of the multiplier matrix B and determine the number of rows and the number of columns of the submatrix of the multiplicand matrix A;

（2）将向量存储器的容量s分为容量相等的两部分存储区域Buffer0和Buffer1，依次在Buffer0和Buffer1间以乒乓方式实现子矩阵的乘法，直到整个矩阵乘法计算完成。 (2) Divide the capacity s of the vector memory into two storage areas with equal capacity, Buffer0 and Buffer1, and realize the multiplication of the sub-matrix between Buffer0 and Buffer1 in a ping-pong manner until the entire matrix multiplication calculation is completed.

作为本发明的进一步改进： As a further improvement of the present invention:

所述步骤（1）中，乘数矩阵B的子矩阵的列数为p*m，行数为(s/2/2)/(p*m*d)；确定乘数矩阵B的子矩阵块大小后，再确定被乘数矩阵A的子矩阵块大小；所述被乘数矩阵A的子矩阵的行数与列数都等于乘数矩阵B的子矩阵的行数，即(s/2/2)/(p*m*d)。 In the step (1), the number of columns of the sub-matrix of the multiplier matrix B is p*m, and the number of rows is (s/2/2)/(p*m*d); the sub-matrix of the multiplier matrix B is determined After the block size, determine the sub-matrix block size of the multiplicand matrix A; the number of rows and the number of columns of the sub-matrix of the multiplicand matrix A are all equal to the number of rows of the sub-matrix of the multiplier matrix B, i.e. (s/ 2/2)/(p*m*d).

所述向量处理器的标量处理部件SPU依次读取被乘数子矩阵A的每一行中的每一个元素，并扩展成一个向量数据；由向量处理部件VPU读取乘数子矩阵B的B0行数据与前述向量数据的每个元素分别进行乘累加；当遍历完被乘数子矩阵的A0行数据时，计算得到结果子矩阵C的C0行数据；当遍历完被乘数子矩阵A的所有行时，完成结果子矩阵C的计算。 The scalar processing unit SPU of the vector processor reads each element in each row of the multiplicand sub-matrix A in turn, and expands into a vector data; the B0 row of the multiplier sub-matrix B is read by the vector processing unit VPU The data and each element of the aforementioned vector data are multiplied and accumulated separately; when the A0 row data of the multiplicand sub-matrix is traversed, the C0 row data of the result sub-matrix C is calculated; when all the data of the multiplicand sub-matrix A are traversed When running, the calculation of the result sub-matrix C is completed.

所述步骤（2）中，存储区域Buffer0用于本次的子矩阵乘法中的乘数矩阵B和输出结果矩阵C的子矩阵存储，同时通过DMA控制器将下一次子矩阵乘法所需要的乘数矩阵B的子矩阵数据搬运到存储区域Buffer1，以及上一次的结果子矩阵数据搬移到外存储器。 In the step (2), the storage area Buffer0 is used for the sub-matrix storage of the multiplier matrix B and the output result matrix C in this sub-matrix multiplication, and at the same time, the multiplication required for the next sub-matrix multiplication is transferred by the DMA controller The sub-matrix data of the number matrix B is moved to the storage area Buffer1, and the last result sub-matrix data is moved to the external memory.

与现有技术相比，本发明的优点在于：本发明依据向量处理器的体系结构特点和矩阵元素的数据大小，确定最优的子矩阵的块大小blocksize，有效提高了处理器的计算访存比；采用双缓冲的乒乓方式实现子矩阵的乘法能够有效地将数据搬移时间与计算时间重叠，减少总的计算时间。由向量处理器的标量处理部件SPU读取被乘数子矩阵的行数据，并扩展成向量数据，与向量处理部件VPU按行读取乘数子矩阵的向量数据的每个元素分别进行乘累加，避免了列数据的访问以及向量数据的规约求和。这些优点使得本发明的方法实现简单，操作方便，能够充分挖掘向量处理器的指令、数据、任务等各个层次的并行性，将处理器的运算效率提高到90%以上，从而充分发挥多MAC运算部件向量处理器所具有的高性能计算能力的优点。 Compared with the prior art, the present invention has the advantages that: the present invention determines the optimal sub-matrix block size blocksize according to the architectural characteristics of the vector processor and the data size of the matrix elements, effectively improving the calculation and memory access of the processor. Compared with that, the multiplication of the sub-matrix can be realized by double-buffering ping-pong method, which can effectively overlap the data moving time with the computing time and reduce the total computing time. The scalar processing unit SPU of the vector processor reads the row data of the multiplicand sub-matrix, and expands it into vector data, and performs multiplication and accumulation for each element of the vector data of the multiplier sub-matrix read by the vector processing unit VPU by row. , avoiding access to column data and summation of vector data. These advantages make the method of the present invention simple to implement, easy to operate, can fully tap the parallelism of each level of instructions, data, tasks, etc. Advantages of the high-performance computing capabilities of component vector processors.

附图说明 Description of drawings

图1是多MAC运算部件向量处理器的一般结构示意图。 FIG. 1 is a schematic diagram of a general structure of a vector processor with multiple MAC operation units.

图2是本发明方法的执行流程示意图。 Fig. 2 is a schematic diagram of the execution flow of the method of the present invention.

图3是具体实施例中依据向量处理器的体系结构特点确定最优子矩阵的块大小的流程示意图。 Fig. 3 is a schematic flow diagram of determining the block size of the optimal sub-matrix according to the architectural characteristics of the vector processor in a specific embodiment.

图4是本发明中子矩阵乘法的具体实施过程的运算示意图。 Fig. 4 is an operational schematic diagram of the specific implementation process of sub-matrix multiplication in the present invention.

图5是在具体实施例中采用双缓冲的乒乓方式实现子矩阵乘法的流程示意图。 Fig. 5 is a schematic flow diagram of implementing sub-matrix multiplication in a double-buffered ping-pong manner in a specific embodiment.

具体实施方式 Detailed ways

以下将结合说明书附图和具体实施例对本发明做进一步详细说明。 The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

如图2所示，本发明支持多MAC运算部件向量处理器的分块矩阵乘法向量化方法，具体流程为： As shown in Figure 2, the present invention supports the block matrix multiplication vectorization method of the multi-MAC operation unit vector processor, and the specific process is:

（1）首先依据向量处理器的向量处理单元VPE的数量p、VPE中的MAC运算部件的数量m、向量存储器的容量s和矩阵元素的数据大小d，确定最优的子矩阵的块大小blocksize，确定乘数矩阵B的子矩阵的列数和行数以及确定被乘数矩阵A的子矩阵的行数与列数。 (1) First, according to the number p of the vector processing unit VPE of the vector processor, the number m of MAC operation units in the VPE, the capacity s of the vector memory, and the data size d of the matrix elements, determine the optimal sub-matrix block size blocksize , determine the number of columns and rows of the sub-matrix of the multiplier matrix B and determine the number of rows and columns of the sub-matrix of the multiplicand matrix A.

如图3所示，具体应用时，根据向量处理器的向量处理单元VPE的数量p、每个VPE中的MAC运算部件的数量m、向量存储器的容量s和矩阵元素的数据大小d，确定最优的子矩阵的块大小blocksize。其中，乘数矩阵B的子矩阵的列数为p*m，行数为(s/2/2)/(p*m*d)。确定乘数矩阵B的子矩阵块大小后，再确定被乘数矩阵A的子矩阵块大小。被乘数矩阵A的子矩阵的行数与列数都等于乘数矩阵B的子矩阵的行数，即(s/2/2)/(p*m*d)。举一个例子来说，假定矩阵元素的数据为单精度浮点数据，数据大小为4B（字节），向量存储器的容量为1024KB，向量处理单元VPE的数量p=16，每个VPE中的MAC运算部件的数量m=2，则乘数矩阵B的子矩阵的列数为p*m=16*2=32列，行数为(1024*1024/2/2)/(16*2*4)=2048行。被乘数矩阵A的子矩阵的行数与列数都等于2048。 As shown in Figure 3, in specific applications, according to the number p of the vector processing unit VPE of the vector processor, the number m of the MAC operation parts in each VPE, the capacity s of the vector memory and the data size d of the matrix elements, determine the optimal Optimal sub-matrix block size blocksize. Wherein, the number of columns of the sub-matrix of the multiplier matrix B is p*m, and the number of rows is (s/2/2)/(p*m*d). After the sub-matrix block size of the multiplier matrix B is determined, the sub-matrix block size of the multiplicand matrix A is determined. Both the number of rows and the number of columns of the sub-matrix of the multiplier matrix A are equal to the number of rows of the sub-matrix of the multiplier matrix B, that is, (s/2/2)/(p*m*d). As an example, assume that the data of the matrix elements is single-precision floating-point data, the data size is 4B (bytes), the capacity of the vector memory is 1024KB, the number of vector processing units VPE is p=16, and the MAC in each VPE The number of operation components is m=2, then the number of columns of the sub-matrix of the multiplier matrix B is p*m=16*2=32 columns, and the number of rows is (1024*1024/2/2)/(16*2*4 )=2048 rows. The number of rows and the number of columns of the sub-matrix of the multiplicand matrix A are both equal to 2048.

如图4所示，本实施例中，乘数矩阵B的子矩阵的列数为4、行数为8；被乘数矩阵A的子矩阵的行数与列数都等于8。采用的方法是，由向量处理器的标量处理部件SPU依次读取被乘数子矩阵A的每一行中的每一个元素，并扩展成一个向量数据，如图4中的A0行的a00元素扩展成向量(a00, a00, a00, a00)，a01元素扩展成向量(a01, a01, a01, a01)。由向量处理部件VPU读取乘数子矩阵B的B0行数据(b00, b01, b02, b03)，与前述向量数据的每个元素分别进行乘累加。当遍历完被乘数子矩阵的A0行数据时，计算得到结果子矩阵C的C0行数据(c00, c01, c02, c03)。当遍历完被乘数子矩阵A的所有行时，完成结果子矩阵C的计算。 As shown in FIG. 4 , in this embodiment, the sub-matrix of the multiplier matrix B has 4 columns and 8 rows; the sub-matrix of the multiplicand matrix A has 8 rows and 8 columns. The method adopted is to sequentially read each element in each row of the multiplicand sub-matrix A by the scalar processing unit SPU of the vector processor, and expand it into a vector data, such as the a00 element expansion of the A0 row in Figure 4 into a vector (a00, a00, a00, a00), and the a01 element expands into a vector (a01, a01, a01, a01). The B0 row data (b00, b01, b02, b03) of the multiplier sub-matrix B is read by the vector processing unit VPU, and is multiplied and accumulated with each element of the aforementioned vector data. After traversing the A0 row data of the multiplicand sub-matrix, calculate the C0 row data (c00, c01, c02, c03) of the result sub-matrix C. When all the rows of the multiplicand sub-matrix A are traversed, the calculation of the result sub-matrix C is completed.

如图5所示，本实施例中的分块矩阵乘法采用双缓冲（Buffer）的乒乓方式实现子矩阵的乘法，将向量存储器的容量s分为容量相等的两部分存储区域Buffer0和Buffer1，其中Buffer0用于本次的子矩阵乘法中的乘数矩阵B和输出结果矩阵C的子矩阵存储，同时通过DMA控制器将下一次子矩阵乘法所需要的乘数矩阵B的子矩阵数据搬运到Buffer1，以及上一次的结果子矩阵数据搬移到外存储器。 As shown in Figure 5, the block matrix multiplication in this embodiment adopts the ping-pong method of double buffering (Buffer) to realize the multiplication of the sub-matrix, and the capacity s of the vector memory is divided into two storage areas Buffer0 and Buffer1 with equal capacity, where Buffer0 is used for the sub-matrix storage of the multiplier matrix B and the output result matrix C in this sub-matrix multiplication, and at the same time transfer the sub-matrix data of the multiplier matrix B required for the next sub-matrix multiplication to Buffer1 through the DMA controller , and the last result sub-matrix data is moved to the external memory.

综上所述，通过本发明所实现的支持多MAC运算部件向量处理器的分块矩阵乘法向量化方法，能够依据向量处理器的体系结构特点确定最优的子矩阵的块大小blocksize。采用双缓冲的乒乓方式实现子矩阵的乘法能够有效地将数据搬移时间与计算时间重叠，减少总的计算时间。由向量处理器的标量处理部件SPU读取被乘数子矩阵的行数据，并扩展成向量数据，与向量处理部件VPU按行读取乘数子矩阵的向量数据的每个元素分别进行乘累加，避免了列数据的访问以及向量数据的规约求和。这些优点使得本发明的方法实现简单，操作方便，能够充分挖掘向量处理器的指令、数据、任务等各个层次的并行性，将处理器的运算效率提高到90%以上，从而充分发挥多MAC运算部件向量处理器所具有的高性能计算能力的优点。 To sum up, through the block matrix multiplication vectorization method of the vector processor supporting multi-MAC operation components implemented by the present invention, the optimal sub-matrix block size blocksize can be determined according to the architecture characteristics of the vector processor. Using the double-buffered ping-pong method to realize the multiplication of the sub-matrix can effectively overlap the data moving time with the computing time and reduce the total computing time. The scalar processing unit SPU of the vector processor reads the row data of the multiplicand sub-matrix, and expands it into vector data, and performs multiplication and accumulation for each element of the vector data of the multiplier sub-matrix read by the vector processing unit VPU by row. , avoiding access to column data and summation of vector data. These advantages make the method of the present invention simple to implement, easy to operate, can fully tap the parallelism of each level of instructions, data, tasks, etc. Advantages of the high-performance computing capabilities of component vector processors.

以上仅是本发明的优选实施方式，本发明的保护范围并不仅局限于上述实施例，凡属于本发明思路下的技术方案均属于本发明的保护范围。应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理前提下的若干改进和润饰，应视为本发明的保护范围。 The above are only preferred implementations of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions under the idea of the present invention belong to the protection scope of the present invention. It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principle of the present invention should be regarded as the protection scope of the present invention.

Claims

1. partitioned matrix multiplication vectorization method of supporting many MAC arithmetic unit vector processor is characterized in that flow process is:

(1) according to the quantity p of the vector processing unit VPE of vector processor, quantity m, the capacity s of vector memory of MAC arithmetic unit among the VPE and the size of data d of matrix element, determine the block size blocksize of optimum submatrix, determine line number and the columns of the submatrix of the columns of submatrix of multiplier matrix B and line number and definite multiplicand matrix A;

(2) the capacity s with vector memory is divided into two parts storage area Buffer0 and the Buffer1 that capacity equates, realizes the multiplication of submatrix successively between Buffer0 and Buffer1 with ping-pong, calculates up to whole matrix multiplication and finishes.

2. the partitioned matrix multiplication vectorization method of many MAC of support arithmetic unit vector processor according to claim 1 is characterized in that, in the described step (1), the columns of the submatrix of multiplier matrix B is p*m, and line number is (s/2/2)/(p*m*d); After determining the submatrix block size of multiplier matrix B, determine the submatrix block size of multiplicand matrix A again; The line number of the submatrix of described multiplicand matrix A and columns all equal the line number of the submatrix of multiplier matrix B, i.e. (s/2/2)/(p*m*d).

3. the partitioned matrix multiplication vectorization method of many MAC of support arithmetic unit vector processor according to claim 2, it is characterized in that, the scalar processor unit SPU of described vector processor reads each element in each row of multiplicand submatrix A successively, and is extended to a vector data; Read the B0 line data of multiplier submatrix B and each element of aforementioned vector data carries out multiply accumulating respectively by Vector Processing parts VPU; When having traveled through the A0 line data of multiplicand submatrix, calculate the C0 line data of the Matrix C that bears fruit; When having traveled through all row of multiplicand submatrix A, finish the calculating of the Matrix C that bears fruit.

4. according to the partitioned matrix multiplication vectorization method of claim 1 or 2 or 3 described many MAC of support arithmetic unit vector processors, it is characterized in that, in the described step (2), storage area Buffer0 is used for the multiplier matrix B of this submatrix multiplication and stores with the submatrix of output matrix of consequence C, simultaneously by dma controller will be next time the submatrix data of the needed multiplier matrix B of submatrix multiplication be transported to storage area Buffer1, and the last matrix data that bears fruit is moved external storage.