CN102411558B

CN102411558B - Vector processor oriented large matrix multiplied vectorization realizing method

Info

Publication number: CN102411558B
Application number: CN201110338108.8A
Authority: CN
Inventors: 刘仲; 陈书明; 陈跃跃; 曾咏涛; 刘衡竹; 陈海燕; 龚国辉; 彭元喜; 陈胜刚
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2011-10-31
Filing date: 2011-10-31
Publication date: 2015-05-13
Anticipated expiration: 2031-10-31
Also published as: CN102411558A

Abstract

The invention discloses a vectorized realization method of large matrix multiplication oriented to a vector processor, comprising the following steps: (1) inputting a multiplicand matrix A and a multiplier matrix B; and the multiplier matrix B are transported to the vector storage unit respectively; when transporting, the 1st to nth rows in the multiplier matrix B are sequentially sorted into the 1st to nth columns; (2) the multiplier matrix A and the multiplier The elements in a column of the number matrix B are respectively loaded into K parallel processing units, and multiplied one by one; the result of multiplication is reduced and summed in a designated parallel processing unit; the summation result is used as a result matrix element Stored in the vector storage unit; (3) move forward to the next row of the multiplicand matrix A and the next column of the multiplier matrix B, repeat step (2) until the calculation of all data frames is completed, and the result matrix element is obtained. The resulting matrix C. The invention has simple principle and convenient operation, and can improve calculation efficiency.

Description

A Vectorization Implementation Method of Large Matrix Multiplication Oriented to Vector Processor

技术领域 technical field

本发明主要涉及向量处理器以及数据处理领域，尤其涉及一种大矩阵相乘的向量化实现方法。The invention mainly relates to the fields of vector processors and data processing, in particular to a vectorized realization method of multiplication of large matrices.

背景技术 Background technique

在许多科学计算任务和应用中都会涉及到矩阵乘法运算，如图像处理，通信系统中的信号编解码等，对于规模较大的矩阵相乘计算任务，由于涉及到大量的乘法和加法运算，需要占用大量的计算时间。如何在处理器上简单而高效的实现矩阵乘法运算一直是业界的研究热点。Matrix multiplication operations are involved in many scientific computing tasks and applications, such as image processing, signal encoding and decoding in communication systems, etc. For large-scale matrix multiplication calculation tasks, due to the large number of multiplication and addition operations involved, it is necessary to Takes a lot of computing time. How to implement matrix multiplication operations on processors simply and efficiently has always been a research hotspot in the industry.

在传统的标量处理器上，研究人员已经提出了多种有效的矩阵相乘实现方法，以减少数据在运算过程中的排序操作对完成整个矩阵相乘的运算的影响。但是，随着高清视频编解码、3G无线通信、雷达信号处理等高密集、实时运算应用的不断涌现，单芯片难以满足这类应用的高密度实时计算需求，向量处理器得到了广泛应用。如图1所示，为一个向量处理器的典型结构，其具有处理器和程序存储器和数据存储器(两者均可以为任意的可访问存储器，包括外部高速缓冲存储器、外部RAM等)。向量处理器的处理器分为标量处理部件和向量处理部件两个部分，通常向量处理部件内有K个并行处理单元(PE)，这些处理单元都有各自的运算部件和寄存器，处理单元间能通过规约指令进行的数据交互，如并行处理单元之间的数据相加、比较等。标量处理单元主要负责流控和逻辑判断指令的处理，而向量处理单元主要负责密集型的数据计算。向量处理单元运算所用的数据由向量数据存储单元提供。一般地，如图2所示，向量数据存储单元的BANK(存储体)的个数与向量处理单元的处理单元个数K是一致的。On traditional scalar processors, researchers have proposed a variety of effective matrix multiplication implementation methods to reduce the impact of data sorting operations on the completion of the entire matrix multiplication operation. However, with the continuous emergence of high-intensity and real-time computing applications such as high-definition video codecs, 3G wireless communications, and radar signal processing, it is difficult for a single chip to meet the high-density real-time computing needs of such applications, and vector processors have been widely used. As shown in Fig. 1, it is a typical structure of a vector processor, which has a processor, program memory and data memory (both of which can be any accessible memory, including external cache memory, external RAM, etc.). The processor of the vector processor is divided into two parts: the scalar processing part and the vector processing part. Usually, there are K parallel processing units (PE) in the vector processing part. These processing units have their own computing parts and registers. Data interaction through specification instructions, such as data addition and comparison between parallel processing units. The scalar processing unit is mainly responsible for the processing of flow control and logic judgment instructions, while the vector processing unit is mainly responsible for intensive data calculation. The data used for the operation of the vector processing unit is provided by the vector data storage unit. Generally, as shown in FIG. 2 , the number of BANKs (memory banks) of the vector data storage unit is consistent with the number K of processing units of the vector processing unit.

申请号为“200380107095.7”的专利文献，公开了英特尔公司提出的一个专利使用SIMD寄存器的小矩阵有效乘法，将被乘数矩阵A的对角线载入处理器的不同寄存器中，并将乘数矩阵B载入至少一个在纵向按序排列的寄存器中。通过移动一个元素，有选择地将寄存器中的乘数矩阵B的每列中的乘法和加法元素同已移动的一列中的上个元素一起移动至该列的前端。将被乘数矩阵A的对角线乘以乘数矩阵B的列，它们的结果被加到结果矩阵C的列的结果和上。该方法在矩阵规模较小的情况下是能获得比较好的效果，但是随着矩阵规模的逐渐增大，难以取得好的性能表现。因此，如何在向量处理器上实现高效的大矩阵乘法运算是当前面临的一个困难。The patent document with the application number "200380107095.7" discloses a patent proposed by Intel Corporation using SIMD registers for efficient multiplication of small matrices, loading the diagonals of the multiplicand matrix A into different registers of the processor, and loading the multipliers The matrix B is loaded into at least one register arranged vertically in sequence. By shifting one element, the multiplication and addition elements in each column of the multiplier matrix B in the register are selectively moved to the front of the column together with the previous element in the shifted column. The diagonals of the multiplicand matrix A are multiplied by the columns of the multiplier matrix B, and their results are added to the resulting sum of the columns of the result matrix C. This method can achieve better results when the size of the matrix is small, but it is difficult to achieve good performance as the size of the matrix gradually increases. Therefore, how to realize efficient large-matrix multiplication operation on vector processors is currently a difficult problem.

发明内容Contents of the invention

本发明所要解决的技术问题是：针对现有技术存在的问题，本发明提供一种原理简单、操作方便、能充分利用向量处理器的多级并行性特点且易于实现的面向向量处理器的大矩阵相乘的向量化实现方法。The technical problem to be solved by the present invention is: aiming at the problems existing in the prior art, the present invention provides a vector processor-oriented large-scale A vectorized implementation of matrix multiplication.

为解决上述技术问题，本发明采用以下技术方案：In order to solve the problems of the technologies described above, the present invention adopts the following technical solutions:

一种面向向量处理器的大矩阵相乘的向量化实现方法，包括以下步骤：A vectorized implementation method for multiplication of large matrices oriented to vector processors, comprising the following steps:

(1)输入被乘数矩阵A和乘数矩阵B；通过DMA控制器将被乘数矩阵A和乘数矩阵B分别搬运到向量存储单元中；在搬运过程中，将乘数矩阵B进行重排序，即将乘数矩阵B中的第1～n行依次排序为第1～n列；(1) Input the multiplicand matrix A and the multiplier matrix B; the multiplicand matrix A and the multiplier matrix B are transported to the vector storage unit respectively by the DMA controller; during the transport process, the multiplier matrix B is re- Sorting, that is, sorting the 1st to nth rows in the multiplier matrix B into the 1st to nth columns;

(2)将被乘数矩阵A一行中的元素和乘数矩阵B中一列中的元素分别加载到K个并行处理单元中，并一一对应相乘；将相乘的结果在一指定的并行处理单元中归约求和；将求和结果作为一个结果矩阵元素存储到向量存储单元中；(2) Load the elements in one row of the multiplicand matrix A and the elements in one column of the multiplier matrix B into K parallel processing units respectively, and multiply them one by one; Reduction and summation in the processing unit; storing the summation result as a result matrix element in the vector storage unit;

(3)顺移到被乘数矩阵A的下一行和乘数矩阵B的下一列，重复步骤(2)直至完成所有数据帧的计算，得到由结果矩阵元素组成的结果矩阵C。(3) Move forward to the next row of the multiplicand matrix A and the next column of the multiplier matrix B, repeat step (2) until the calculation of all data frames is completed, and a result matrix C composed of result matrix elements is obtained.

作为本发明的进一步改进：As a further improvement of the present invention:

所述搬运过程中，被乘数矩阵A的每一行组织成一个数据帧，乘数矩阵B的每一列组织成一个数据帧，当所述数据帧的元素个数不等于向量处理器中并行处理单元的个数K的倍数时，在数据帧尾部补0使得每个数据帧的元素个数等于并行处理单元的个数K的倍数。In the handling process, each row of the multiplier matrix A is organized into a data frame, and each column of the multiplier matrix B is organized into a data frame. When the number of elements of the data frame is not equal to the parallel processing in the vector processor When the number of units is a multiple of K, add 0 at the end of the data frame so that the number of elements in each data frame is equal to the multiple of the number K of parallel processing units.

与现有技术相比，本发明的优点在于：Compared with the prior art, the present invention has the advantages of:

本发明的面向向量处理器的大矩阵相乘的向量化实现方法，通过在DMA控制器搬运数据的过程中实现乘数矩阵B的数据重排序，同时还充分利用向量处理器中的向量部件多个并行处理单元能同时进行相同运算操作的特点来进行大量的同类型操作，从而大大的提高了计算矩阵乘法的效率，且步骤简单，易于实现。The vectorized implementation method of the vector processor-oriented large matrix multiplication of the present invention realizes the data reordering of the multiplier matrix B in the process of data transfer by the DMA controller, and also makes full use of the multiplicity of vector components in the vector processor. A large number of operations of the same type can be performed by a parallel processing unit that can perform the same operation at the same time, thereby greatly improving the efficiency of computing matrix multiplication, and the steps are simple and easy to implement.

附图说明 Description of drawings

图1是典型的向量处理器结构示意图。Figure 1 is a schematic diagram of a typical vector processor structure.

图2是图1的向量处理器中的向量数据存储单元的结构示意图。FIG. 2 is a schematic structural diagram of a vector data storage unit in the vector processor of FIG. 1 .

图3是本发明的总流程示意图。Fig. 3 is a schematic diagram of the overall flow of the present invention.

图4是本发明实施例1中用DMA控制器实现乘数矩阵B元素重排序示意图。FIG. 4 is a schematic diagram of implementing reordering of B elements of the multiplier matrix by using a DMA controller in Embodiment 1 of the present invention.

图5是本发明实施例2中的被乘数矩阵A和乘数矩阵B的元素在图2所示的向量数据存储单元中的存放形式示意图；图5(1)为本发明实施例2中的被乘数矩阵A的元素在图2所示的向量数据存储单元中的存放形式示意图；图5(2)为本发明实施例2中的乘数矩阵B的元素在图2所示的向量数据存储单元中的存放形式示意图。Fig. 5 is a schematic diagram of the storage form of elements of the multiplicand matrix A and the multiplier matrix B in the vector data storage unit shown in Fig. 2 in embodiment 2 of the present invention; Fig. 5 (1) is in embodiment 2 of the present invention The storage form schematic diagram of the element of multiplicand matrix A in the vector data storage unit shown in Fig. 2; Fig. 5 (2) is the vector shown in Fig. 2 for the element of multiplier matrix B in the embodiment of the present invention 2 Schematic diagram of the storage form in the data storage unit.

图6为本发明实施例2的被乘数矩阵A(16×16)和乘数矩阵B(16×16)加载到K个并行处理单元中的示意图。FIG. 6 is a schematic diagram of loading multiplicand matrix A (16×16) and multiplier matrix B (16×16) into K parallel processing units according to Embodiment 2 of the present invention.

图7是本发明实施例2的被乘数矩阵A(16×16)和乘数矩阵B(16×16)的矩阵乘法实现步骤示意图。Fig. 7 is a schematic diagram of implementation steps of matrix multiplication of multiplicand matrix A (16×16) and multiplier matrix B (16×16) in Embodiment 2 of the present invention.

图8为本发明实施例3中的被乘数矩阵A和乘数矩阵B的元素在图2所示的向量数据存储单元中的存放形式示意图；图8(1)为本发明实施例3中的被乘数矩阵A的元素在图2所示的向量数据存储单元中的存放形式示意图；图8(2)为本发明实施例3中的乘数矩阵B的元素在图2所示的向量数据存储单元中的存放形式示意图。Fig. 8 is a schematic diagram of the storage form of elements of the multiplicand matrix A and the multiplier matrix B in the vector data storage unit shown in Fig. 2 in Embodiment 3 of the present invention; Fig. 8 (1) is in Embodiment 3 of the present invention The storage form schematic diagram of the element of the multiplicand matrix A in the vector data storage unit shown in Fig. 2; Fig. 8 (2) is the vector shown in Fig. 2 for the element of the multiplier matrix B in the embodiment of the present invention 3 Schematic diagram of the storage form in the data storage unit.

图9是本发明实施例3的被乘数矩阵A(26×22)和乘数矩阵B(22×27)加载到K个并行处理单元中的示意图。FIG. 9 is a schematic diagram of loading multiplicand matrix A (26×22) and multiplier matrix B (22×27) into K parallel processing units according to Embodiment 3 of the present invention.

图10是本发明实施例3的被乘数矩阵A(26×22)和乘数矩阵B(22×27)的矩阵乘法实现步骤示意图。Fig. 10 is a schematic diagram of implementation steps of matrix multiplication of multiplicand matrix A (26×22) and multiplier matrix B (22×27) in Embodiment 3 of the present invention.

具体实施方式 Detailed ways

以下将结合说明书附图和具体实施例对本发明作进一步详细说明。The present invention will be described in further detail below in conjunction with the accompanying drawings and specific embodiments.

实施例1：Example 1:

如图3所示，本发明的面向向量处理器的大矩阵相乘的向量化实现方法，包括以下步骤：As shown in Fig. 3, the vectorized implementation method of multiplication of large matrices oriented to vector processors of the present invention comprises the following steps:

1、输入被乘数矩阵A和乘数矩阵B；通过DMA控制器将被乘数矩阵A和乘数矩阵B分别搬运到向量存储单元中，搬运过程中，如图4所示，将乘数矩阵B进行重排序，即将乘数矩阵B中的第1～n行依次排序为第1～n列。1. Input the multiplicand matrix A and the multiplier matrix B; the multiplicand matrix A and the multiplier matrix B are respectively transported to the vector storage unit through the DMA controller. During the transport process, as shown in Figure 4, the multiplier The matrix B is reordered, that is, the 1st to nth rows in the multiplier matrix B are sequentially sorted into the 1st to nth columns.

通过DMA控制器的配置，可以将被乘数矩阵A的每一行组织成一个数据帧，乘数矩阵B的每一列组织成一个数据帧，整个乘数矩阵B共可分成p个数据帧。当数据帧的元素个数不等于向量处理器中并行处理单元的个数K的倍数时，在数据帧尾部补0使得每个数据帧的元素个数等于并行处理单元的个数K的倍数。Through the configuration of the DMA controller, each row of the multiplicand matrix A can be organized into a data frame, each column of the multiplier matrix B can be organized into a data frame, and the entire multiplier matrix B can be divided into p data frames. When the number of elements in the data frame is not equal to the multiple of the number K of parallel processing units in the vector processor, add 0 at the end of the data frame so that the number of elements in each data frame is equal to the multiple of the number K of parallel processing units.

2、将被乘数矩阵A的一行数据帧和乘数矩阵B的一列数据帧中的元素分别加载到K个并行处理单元中，并一一对应相乘；相乘的结果在一指定的并行处理单元中归约求和；求和结果作为一个结果矩阵元素存储到向量存储单元中。2. Load the elements in a row of data frames of the multiplicand matrix A and a column of data frames of the multiplier matrix B into K parallel processing units, and multiply them one by one; the result of the multiplication is in a specified parallel processing unit Reduction and summation in the processing unit; the summation result is stored as a result matrix element in the vector storage unit.

3、顺移到被乘数矩阵A的下一行和乘数矩阵B的下一列，重复步骤2到3直至完成所有数据帧的计算，得到由结果矩阵元素组成的结果矩阵C。3. Move forward to the next row of the multiplicand matrix A and the next column of the multiplier matrix B, repeat steps 2 to 3 until the calculation of all data frames is completed, and a result matrix C composed of result matrix elements is obtained.

对于m*n的被乘数矩阵A乘以n*p的乘数矩阵B的运算，可得到m*p的矩阵C。其在数学公式上可表示为：(0≤i＜m，0≤j＜p)。结果矩阵C的元素C_ij是由被乘数矩阵A相应的行元素A_ik和乘数矩阵B相应的列元素B_kj进行点积操作计算求得。For the operation of multiplying the multiplicand matrix A of m*n by the multiplier matrix B of n*p, the matrix C of m*p can be obtained. It can be expressed in mathematical formula as: (0≤i<m, 0≤j<p). The element C _ij of the result matrix C is calculated by the dot product operation of the corresponding row element A _ik of the multiplicand matrix A and the corresponding column element B _kj of the multiplier matrix B.

实施例2：Example 2:

如图7所示，采用本发明的面向向量处理器的大矩阵相乘的向量化实现方法，计算规模为16×16的矩阵乘以规模为16×16的矩阵(向量处理单元个数K为8)，包括以下步骤：As shown in Fig. 7, adopt the vectorization realization method of the multiplication of large matrix facing vector processor of the present invention, the calculation scale is that the matrix of 16 * 16 is multiplied by the matrix of 16 * 16 in scale (the number of vector processing units K is 8), including the following steps:

1、如图6所示，输入被乘数矩阵A(16×16)和乘数矩阵B(16×16)；通过DMA搬运被乘数矩阵A和乘数矩阵B到向量存储单元，这个过程中实现乘数矩阵B的重排序(重排序方法与实施例1相同)，被乘数矩阵A和乘数矩阵B在向量单元的存放方式如图5(1)和图5(2)所示。1. As shown in Figure 6, input the multiplicand matrix A (16×16) and the multiplier matrix B (16×16); transfer the multiplicand matrix A and the multiplier matrix B to the vector storage unit through DMA, this process Realize the reordering of multiplier matrix B in (the reordering method is the same as embodiment 1), the storage mode of multiplicand matrix A and multiplier matrix B in vector unit is as shown in Figure 5 (1) and Figure 5 (2) .

2、将被乘数矩阵A的一行元素和乘数矩阵B的一列的元素加载到向量处理单元中，由于被乘数矩阵A和乘数矩阵B的规模都为16×16，所以要分两次将被乘数矩阵A和乘数矩阵B加载。加载入向量处理单元的对应的元素进行乘法操作，由于向量处理单元的个数是8，而被乘数元素和乘数元素的个数是分别16，这个步骤的乘法操作应进行两次。2. Load the elements of one row of the multiplicand matrix A and the elements of one column of the multiplier matrix B into the vector processing unit. Since the scales of the multiplicand matrix A and the multiplier matrix B are both 16×16, they need to be divided into two times will be loaded by multiplier matrix A and multiplier matrix B. The corresponding elements loaded into the vector processing unit are multiplied. Since the number of vector processing units is 8, and the numbers of multiplicand elements and multiplier elements are 16 respectively, the multiplication operation in this step should be performed twice.

3、用归约指令将步骤2中每个向量处理单元所计算出的结果进行相加操作，所得的结果归约到指定的处理单元X，同样的这个步骤的操作也应进行两次。3. Use the reduction instruction to add the results calculated by each vector processing unit in step 2, and reduce the result to the designated processing unit X. The same operation of this step should also be performed twice.

4、将上述在X单元所得的两个结果进行相加操作，得出一个结果矩阵元素并存入向量存储单元。4. Add the above two results obtained in the X unit to obtain a result matrix element and store it in the vector storage unit.

5、顺移到被乘数矩阵A的下一行和乘数矩阵B的下一列，重复上述过程中的步骤2到步骤4以计算得到整个结果矩阵C＝A×B。5. Move forward to the next row of the multiplicand matrix A and the next column of the multiplier matrix B, and repeat steps 2 to 4 in the above process to obtain the entire result matrix C=A×B.

实施例3：Example 3:

如图10所示，本发明的面向向量处理器的大矩阵相乘的向量化实现方法，计算规模为26×22的矩阵乘以规模为22×27的矩阵(向量处理单元个数K为8)，包括以下步骤：As shown in Fig. 10, the vectorized implementation method of multiplication of large matrices oriented to vector processors of the present invention, the calculation scale is 26 × 22 matrix multiplied by 22 × 27 matrices (the number of vector processing units K is 8 ), including the following steps:

1、如图9所示，通过DMA搬运被乘数矩阵A和乘数矩阵B到向量存储单元，这个过程中实现乘数矩阵B的重排序(重排序方法与实施例1相同)，而且还对被乘数矩阵A和乘数矩阵B进行了补0操作，被乘数矩阵A和乘数矩阵B在向量单元的存放方式如图8(1)和图8(2)。1. As shown in FIG. 9, the multiplicand matrix A and the multiplier matrix B are transferred to the vector storage unit by DMA, and the reordering of the multiplier matrix B is realized in this process (the reordering method is the same as in embodiment 1), and the The multiplicand matrix A and the multiplier matrix B are complemented with 0, and the storage methods of the multiplicand matrix A and the multiplier matrix B in the vector unit are shown in Figure 8(1) and Figure 8(2).

2、将被乘数矩阵A的一行元素和乘数矩阵B的一列的元素加载到向量处理单元中，这里被乘数矩阵A的行和乘数矩阵B的列是已经经过补0了，所以要分3次将被乘数矩阵A和乘数矩阵B加载。加载入向量处理单元的对应的元素进行乘法操作，由于向量处理单元的个数是8，而补0后被乘数元素和乘数元素的个数都为24，这个步骤的乘法操作应进行3次。2. Load the elements of one row of the multiplicand matrix A and the elements of one column of the multiplier matrix B into the vector processing unit. Here, the rows of the multiplier matrix A and the columns of the multiplier matrix B have been filled with 0, so To be divided into 3 times will be loaded by multiplier matrix A and multiplier matrix B. The corresponding elements loaded into the vector processing unit are multiplied. Since the number of vector processing units is 8, and the number of multiplicand elements and multiplier elements is 24 after complementing 0, the multiplication operation in this step should be 3 Second-rate.

3、用归约指令将步骤2中每个向量处理单元所计算出的结果进行相加操作，所得的结果归约到指定的处理单元X，同样的这个步骤的操作也应进行3次。3. Use the reduction instruction to add the results calculated by each vector processing unit in step 2, and reduce the result to the designated processing unit X. The same operation of this step should also be performed 3 times.

4、将上述在X单元所得的3个结果进行相加操作，得出一个结果矩阵元素并存入向量存储单元。4. Add the above three results obtained in the X unit to obtain a result matrix element and store it in the vector storage unit.

5、顺移到被乘数矩阵A的下一行和乘数矩阵B的下一列，重复上述过程中的步骤2到步骤4可以计算得到整个结果矩阵C＝A×B。5. Move forward to the next row of the multiplicand matrix A and the next column of the multiplier matrix B, and repeat steps 2 to 4 in the above process to obtain the entire result matrix C=A×B.

可以根据具体的向量处理器的结构和指令集按照以上的步骤编写出简单高效的代码实现大矩阵的相乘。本发明的方法对于程序员来说，是简单易懂的，有利于程序员编码实现。According to the specific structure and instruction set of the vector processor, a simple and efficient code can be written according to the above steps to realize the multiplication of large matrices. The method of the present invention is simple and easy to understand for programmers, and is beneficial for programmers to code and realize.

以上所述仅是本发明的优选实施方式，本发明的保护范围并不仅局限于上述实施例，凡属于本发明思路下的技术方案均属于本发明的保护范围。应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理前提下的若干改进和润饰，应视为本发明的保护范围。The above descriptions are only preferred implementations of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions under the idea of the present invention belong to the protection scope of the present invention. It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principle of the present invention should be regarded as the protection scope of the present invention.

Claims

1. A vectorized implementation method for multiplication of large matrices for vector processors, characterized in that it comprises the following steps:

(1) Input the multiplicand matrix A and the multiplier matrix B; transfer the multiplicand matrix A and the multiplier matrix B to the vector storage unit through the DMA controller; Sorting, that is, sorting the 1st to nth columns in the multiplier matrix B into the 1st to nth rows;

In the handling process, each row of the multiplier matrix A is organized into a data frame, and each column of the multiplier matrix B is organized into a data frame. When the number of elements of the data frame is not equal to the parallel processing in the vector processor When the number of units is a multiple of K, add 0 at the end of the data frame so that the number of elements in each data frame is equal to the multiple of the number K of parallel processing units;

(2) Load the elements in one row of the multiplicand matrix A and the elements in one row of the multiplier matrix B into K parallel processing units, and multiply them one by one; Reduction and summation in the processing unit; storing the summation result as a result matrix element in the vector storage unit;

(3) Move forward to the next row of the multiplier matrix B, repeat steps (2) and (3) until the calculation of one row of the multiplicand matrix A and all the rows of the multiplier matrix B is completed, and a row of elements of the result matrix C is calculated ;

(4) Move forward to the next row of the multiplicand matrix A, and repeat steps (2) (3) (4) until the calculation of all rows of the multiplicand matrix A is completed, and all row elements of the result matrix C are obtained.