CN117785114A

CN117785114A - A matrix multiplication execution method, device, electronic equipment and storage medium

Info

Publication number: CN117785114A
Application number: CN202311835934.2A
Authority: CN
Inventors: 李庚�; 刘建
Original assignee: Guangdong Greater Bay Area Institute of Integrated Circuit and System
Current assignee: Guangdong Greater Bay Area Institute of Integrated Circuit and System
Priority date: 2023-12-27
Filing date: 2023-12-27
Publication date: 2024-03-29

Abstract

The embodiment of the invention discloses a matrix multiplication operation execution method, a device, electronic equipment and a storage medium. The method comprises the following steps: responding to a matrix multiplication operation instruction, and acquiring a first matrix to be operated, a second matrix to be operated and an initial matrix to be updated; determining a block step of matrix block according to the buffer space capacity, and performing matrix block on the matrix to respectively obtain a first sub-matrix of a first matrix to be operated, a second sub-matrix of a second matrix to be operated and a third sub-matrix of an initial matrix to be updated; according to the block stride, a stored vector register and a pre-stored vector register can be selected from the vector registers; and sequentially acquiring corresponding submatrices from the stored vector register and the pre-stored vector register to perform matrix multiplication operation until the operation ending condition is met, so as to obtain a matrix multiplication operation result. According to the technical scheme provided by the embodiment of the invention, the operation efficiency of matrix multiplication is improved by optimizing the function performance.

Description

A matrix multiplication execution method, device, electronic equipment and storage medium

技术领域Technical field

本发明实施例涉及数值分析技术领域，尤其涉及一种矩阵乘法运算执行方法、装置、电子设备和存储介质。Embodiments of the present invention relate to the technical field of numerical analysis, and in particular to a matrix multiplication execution method, device, electronic device and storage medium.

背景技术Background technique

数值计算作为科学研究的重要工具，而以数值计算为基础的高性能计算目前已然成为了代表国家综合实力的重要科技要素之一，无论是在原子物理，计算化学，生物信息学，大气科学以及人工智能等领域都具有重要意义。而数值线性代数则是数值计算领域最主要的核心计算手段。它以矩阵为主要的数据表现形式，以矩阵运算作为问题求解的最终目的。Numerical computing is an important tool for scientific research, and high-performance computing based on numerical computing has now become one of the important scientific and technological elements representing the country's comprehensive strength, whether in atomic physics, computational chemistry, bioinformatics, atmospheric science and Fields such as artificial intelligence are of great significance. Numerical linear algebra is the most important core computing method in the field of numerical computing. It uses matrices as the main data representation form and matrix operations as the ultimate goal of problem solving.

由于数值线性代数在科学与工程计算中都具有广泛的应用，于是广泛的研究者不断开发出了性能强大的数值线性代数程序库，其中比较知名的有BLAS(Basic LinearAlgebra Subprograms，基础线性代数计算程序集)，BLAS中的线性代数操作主要分为三个方面，其中向量-向量称为Level-1级BLAS，向量-矩阵称为Level-2级BLAS，矩阵-矩阵称为Level-3级BLAS。Since numerical linear algebra has a wide range of applications in scientific and engineering computing, researchers have continuously developed powerful numerical linear algebra libraries, among which the more well-known is BLAS (Basic Linear Algebra Subprograms). The linear algebra operations in BLAS are mainly divided into three aspects, among which vector-vector is called Level-1 BLAS, vector-matrix is called Level-2 BLAS, and matrix-matrix is called Level-3 BLAS.

在现有技术中，对于矩阵乘法是通过采用A矩阵的一行点乘B矩阵的一列然后更新相应的C矩阵元素，但无论矩阵是采用行优先存储格式或者是列优先存储格式存储在内存当中时，当高速缓冲存储器(Cache)从主存中取A矩阵的一行或者是取B矩阵的一列时，由于地址不连续，有一方都会发生大量的高速缓冲存储器丢失(Cache Miss)以及翻译缓冲区(Translation Lookaside Buffer，TLB)丢失(TLB miss)，这将减小计算访存比，对计算速度造成很大的影响。另外，在现有技术中由于计算速度的降低，使得数值线性代数的计算效率并不够高，存在运行速度较慢的问题。In the existing technology, matrix multiplication is performed by dot-multiplying a column of matrix B by a row of matrix A and then updating the corresponding matrix element of C. However, no matter whether the matrix is stored in the memory in a row-first storage format or a column-first storage format, , when the cache (Cache) fetches a row of matrix A or a column of matrix B from main memory, a large number of cache misses (Cache Miss) and translation buffer (Cache Miss) will occur on either side due to discontinuous addresses. Translation Lookaside Buffer, TLB) is lost (TLB miss), which will reduce the calculation memory access ratio and have a great impact on the calculation speed. In addition, in the existing technology, due to the reduction in calculation speed, the calculation efficiency of numerical linear algebra is not high enough, and there is a problem of slow running speed.

发明内容Contents of the invention

本发明提供了一种矩阵乘法运算执行方法、装置、电子设备和存储介质，以提高矩阵乘法的运算效率。The invention provides a matrix multiplication operation execution method, device, electronic equipment and storage medium to improve the operation efficiency of matrix multiplication.

第一方面，本发明实施例提供了一种矩阵乘法运算执行方法，该方法包括：In a first aspect, embodiments of the present invention provide a matrix multiplication operation execution method, which method includes:

响应于矩阵乘法操作指令，获取第一待操作矩阵、第二待操作矩阵和初始待更新矩阵；In response to the matrix multiplication operation instruction, obtain the first matrix to be operated, the second matrix to be operated and the initial matrix to be updated;

根据缓存空间容量，确定用于进行矩阵分块的分块步幅；Determine the blocking stride used for matrix blocking according to the cache space capacity;

根据分块步幅，对第一待操作矩阵、第二待操作矩阵和初始待更新矩阵进行矩阵分块，分别得到第一待操作矩阵的第一子矩阵、第二待操作矩阵的第二子矩阵和初始待更新矩阵的第三子矩阵；According to the blocking step, the first matrix to be operated, the second matrix to be operated and the initial matrix to be updated are divided into matrix blocks to obtain the first sub-matrix of the first matrix to be operated and the second sub-matrix of the second matrix to be operated respectively. matrix and the third sub-matrix of the initial matrix to be updated;

根据分块步幅，从各向量寄存器中选取存储向量寄存器和预存向量寄存器；存储向量寄存器用于存储当前运算周期下的第一子矩阵、第二子矩阵和第三子矩阵；预存向量寄存器用于存储下一运算周期下的第一子矩阵和第二子矩阵；According to the block stride, a storage vector register and a pre-stored vector register are selected from each vector register; the storage vector register is used to store the first sub-matrix, the second sub-matrix and the third sub-matrix in the current operation cycle; the pre-stored vector register is used to store the first sub-matrix and the second sub-matrix in the next operation cycle;

依次从存储向量寄存器和预存向量寄存器中获取相应子矩阵进行矩阵乘法运算，直到满足运算结束条件，得到矩阵乘法运算结果。The corresponding sub-matrices are obtained from the storage vector register and the pre-stored vector register in turn and the matrix multiplication operation is performed until the operation end condition is met, and the matrix multiplication operation result is obtained.

第二方面，本发明实施例还提供了一种矩阵乘法运算执行装置，该装置包括：In a second aspect, embodiments of the present invention also provide a matrix multiplication execution device, which includes:

矩阵获取模块，用于响应于矩阵乘法操作指令，获取第一待操作矩阵、第二待操作矩阵和初始待更新矩阵；A matrix acquisition module, configured to acquire the first matrix to be operated, the second matrix to be operated and the initial matrix to be updated in response to the matrix multiplication operation instruction;

分块步幅确定模块，用于根据缓存空间容量，确定用于进行矩阵分块的分块步幅；The blocking stride determination module is used to determine the blocking stride for matrix partitioning based on the cache space capacity;

子矩阵得到模块，用于根据分块步幅，对第一待操作矩阵、第二待操作矩阵和初始待更新矩阵进行矩阵分块，分别得到第一待操作矩阵的第一子矩阵、第二待操作矩阵的第二子矩阵和初始待更新矩阵的第三子矩阵；The sub-matrix obtaining module is used to divide the first to-be-operated matrix, the second to-be-operated matrix and the initial to-be-updated matrix into matrix blocks according to the blocking step, and obtain the first sub-matrix and the second sub-matrix of the first to-be-operated matrix respectively. The second sub-matrix of the matrix to be operated on and the third sub-matrix of the initial matrix to be updated;

寄存器选取模块，用于根据分块步幅，从各向量寄存器中选取存储向量寄存器和预存向量寄存器；存储向量寄存器用于存储当前运算周期下的第一子矩阵、第二子矩阵和第三子矩阵；预存向量寄存器用于存储下一运算周期下的第一子矩阵和第二子矩阵；The register selection module is used to select the storage vector register and the pre-stored vector register from each vector register according to the block step; the storage vector register is used to store the first sub-matrix, the second sub-matrix and the third sub-matrix in the current operation cycle. Matrix; the pre-stored vector register is used to store the first sub-matrix and the second sub-matrix in the next operation cycle;

结果得到模块，用于依次从存储向量寄存器和预存向量寄存器中获取相应子矩阵进行矩阵乘法运算，直到满足运算结束条件，得到矩阵乘法运算结果。The result obtaining module is used to sequentially obtain corresponding sub-matrices from the storage vector register and the pre-stored vector register to perform matrix multiplication operation until the operation end condition is met to obtain the matrix multiplication operation result.

第三方面，本发明实施例还提供了一种电子设备，包括：In a third aspect, embodiments of the present invention also provide an electronic device, including:

至少一个处理器；以及at least one processor; and

与至少一个处理器通信连接的存储器；其中，a memory communicatively connected to at least one processor; wherein,

存储器存储有可被至少一个处理器执行的计算机程序，计算机程序被至少一个处理器执行，以使至少一个处理器能够执行本发明任一实施例的一种矩阵乘法运算执行方法。The memory stores a computer program that can be executed by at least one processor, and the computer program is executed by at least one processor, so that at least one processor can execute a matrix multiplication execution method according to any embodiment of the present invention.

第四方面，本发明实施例还提供了一种包含计算机可执行指令的存储介质，计算机可执行指令在由计算机处理器执行时，使得计算机处理器能够执行本发明实施例提供的任意一种矩阵乘法运算执行方法。In a fourth aspect, embodiments of the present invention also provide a storage medium containing computer-executable instructions. When executed by a computer processor, the computer-executable instructions enable the computer processor to execute any matrix provided by the embodiments of the present invention. Method for performing multiplication operations.

本发明实施例通过响应于矩阵乘法操作指令，获取第一待操作矩阵、第二待操作矩阵和初始待更新矩阵；根据缓存空间容量，确定用于进行矩阵分块的分块步幅；根据所述分块步幅，对所述第一待操作矩阵、所述第二待操作矩阵和所述初始待更新矩阵进行矩阵分块，分别得到所述第一待操作矩阵的第一子矩阵、所述第二待操作矩阵的第二子矩阵和所述初始待更新矩阵的第三子矩阵；根据所述分块步幅，从各向量寄存器中选取存储向量寄存器和预存向量寄存器；所述存储向量寄存器用于存储当前运算周期下的第一子矩阵、第二子矩阵和第三子矩阵；所述预存向量寄存器用于存储下一运算周期下的第一子矩阵和第二子矩阵；依次从存储向量寄存器和预存向量寄存器中获取相应子矩阵进行矩阵乘法运算，直到满足运算结束条件，得到矩阵乘法运算结果。本发明实施例所提供的技术方案，通过对函数性能进行优化，以提高矩阵乘法的运算效率。The embodiment of the present invention obtains the first to-be-operated matrix, the second to-be-operated matrix and the initial to-be-updated matrix by responding to the matrix multiplication operation instruction; determines the blocking step for matrix blocking according to the cache space capacity; according to the According to the blocking step, the first to-be-operated matrix, the second to-be-operated matrix and the initial to-be-updated matrix are divided into matrix blocks to obtain the first sub-matrix and the first sub-matrix of the first to-be-operated matrix respectively. The second sub-matrix of the second matrix to be operated and the third sub-matrix of the initial matrix to be updated; select a storage vector register and a pre-stored vector register from each vector register according to the block step; the storage vector The register is used to store the first sub-matrix, the second sub-matrix and the third sub-matrix in the current operation cycle; the pre-stored vector register is used to store the first sub-matrix and the second sub-matrix in the next operation cycle; from The corresponding sub-matrix is obtained from the storage vector register and the pre-stored vector register and the matrix multiplication operation is performed until the operation end condition is met, and the matrix multiplication operation result is obtained. The technical solution provided by the embodiment of the present invention improves the operational efficiency of matrix multiplication by optimizing function performance.

应当理解，本部分所描述的内容并非旨在标识本发明的实施例的关键或重要特征，也不用于限制本发明的范围。本发明的其它特征将通过以下的说明书而变得容易理解。It should be understood that what is described in this section is not intended to identify key or important features of the embodiments of the invention, nor is it intended to limit the scope of the invention. Other features of the present invention will become easily understood from the following description.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required for use in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without creative work.

图1是根据本发明实施例一提供的一种矩阵乘法运算执行方法的流程图；Figure 1 is a flow chart of a matrix multiplication execution method provided according to Embodiment 1 of the present invention;

图2是根据本发明实施例二提供的另一种矩阵乘法运算执行方法的流程图；Figure 2 is a flow chart of another matrix multiplication execution method provided according to Embodiment 2 of the present invention;

图3是根据本发明实施例三提供的一种矩阵乘法运算执行装置的结构示意图；FIG3 is a schematic diagram of the structure of a matrix multiplication operation execution device provided according to Embodiment 3 of the present invention;

图4是根据本发明实施例四提供的一种矩阵乘法运算执行方法的电子设备的结构示意图。FIG. 4 is a schematic structural diagram of an electronic device according to a matrix multiplication execution method provided in Embodiment 4 of the present invention.

具体实施方式Detailed ways

为了使本技术领域的人员更好地理解本发明方案，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分的实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都应当属于本发明保护的范围。In order to enable those skilled in the art to better understand the scheme of the present invention, the technical scheme in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work should fall within the scope of protection of the present invention.

需要说明的是，本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first", "second", etc. in the specification and claims of the present invention and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It should be understood that the data used in this way can be interchanged where appropriate, so that the embodiments of the present invention described herein can be implemented in an order other than those illustrated or described herein. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions, for example, a process, method, system, product or device that includes a series of steps or units is not necessarily limited to those steps or units clearly listed, but may include other steps or units that are not clearly listed or inherent to these processes, methods, products or devices.

此外，还需要说明的是，本发明的技术方案中，所涉及的相关数据等的收集、存储、使用、加工、传输、提供和公开等处理，均符合相关法律法规的规定，且不违背公序良俗。In addition, it should be noted that in the technical solution of the present invention, the collection, storage, use, processing, transmission, provision and disclosure of relevant data are in compliance with relevant laws and regulations and do not violate public order and good customs. .

实施例一Embodiment 1

图1是本发明实施例一提供的一种矩阵乘法运算执行方法的流程图，本实施例可适用于对函数性能进行优化，以提高矩阵乘法的运算效率的情况，该方法可以由一种矩阵乘法运算执行装置来执行，该一种矩阵乘法运算执行装置可以采用硬件和/或软件的形式实现，该一种矩阵乘法运算执行装置可配置于电子设备中，该电子设备可以是终端设备或服务器等，本发明实施例对此不进行限制。Figure 1 is a flow chart of a matrix multiplication operation execution method provided by Embodiment 1 of the present invention. This embodiment can be applied to the situation where function performance is optimized to improve the operational efficiency of matrix multiplication. The method can be executed by a matrix multiplication operation execution device. The matrix multiplication operation execution device can be implemented in the form of hardware and/or software. The matrix multiplication operation execution device can be configured in an electronic device, which can be a terminal device or a server, etc. The embodiment of the present invention does not limit this.

如图1所示，本发明实施例提供的一种矩阵乘法运算执行方法，具体包括如下步骤：As shown in Figure 1, a matrix multiplication execution method provided by an embodiment of the present invention specifically includes the following steps:

S110、响应于矩阵乘法操作指令，获取第一待操作矩阵、第二待操作矩阵和初始待更新矩阵。S110. In response to the matrix multiplication operation instruction, obtain the first matrix to be operated, the second matrix to be operated and the initial matrix to be updated.

具体的，响应于对矩阵乘法的数值计算需求，在响应到矩阵乘法操作指令后，可获取第一待操作矩阵、第二待操作矩阵以及初始待更新矩阵。特别的，本发明实施例对矩阵乘法操作指令的响应方式不进行限制。另外，本发明实施例对第一待操作矩阵、第二待操作矩阵以及初始待更新矩阵的获取方式不进行限制。Specifically, in response to the numerical calculation requirement for matrix multiplication, after responding to the matrix multiplication operation instruction, the first to-be-operated matrix, the second to-be-operated matrix, and the initial to-be-updated matrix can be obtained. In particular, the embodiment of the present invention does not limit the response mode of the matrix multiplication operation instruction. In addition, the embodiment of the present invention does not limit the acquisition method of the first to-be-operated matrix, the second to-be-operated matrix, and the initial to-be-updated matrix.

其中，第一待操作矩阵、第二待操作矩阵可以理解为，在响应于矩阵乘法操作指令后待操作的矩阵。示例性的，在进行矩阵乘法时，可以是通过Level-3级BLAS数值计算库中的单精度矩阵乘法(Single Precision General Matrix-matrix Multiplication，SGEMM)来实现，SGEMM是一个通用矩阵乘法，可通过修改函数参数以实现任意矩阵的相乘。在通过SGEMM执行矩阵乘法操作时，可依据公式如下所示：The first matrix to be operated and the second matrix to be operated can be understood as matrices to be operated upon in response to the matrix multiplication operation instruction. For example, when performing matrix multiplication, it can be implemented through Single Precision General Matrix-matrix Multiplication (SGEMM) in the Level-3 BLAS numerical calculation library. SGEMM is a general matrix multiplication that can be implemented through Modify function arguments to multiply arbitrary matrices. When performing matrix multiplication operations through SGEMM, the following formula can be used:

C＝alpha×op(A)×op(B)+beta×C；C = alpha × op(A) × op(B) + beta × C;

其中，alpha、beta均为标量因子，A、B和C均为操作矩阵，需要说明的是，A、B均为响应于矩阵乘法操作指令，所获取到的第一待操作矩阵、第二待操作矩阵。对于初始待更新矩阵的获取方式，可以是基于已获取到的A、B，并依据上述公式得到，此时计算得到的输出矩阵C可作为初始待更新矩阵。Among them, alpha and beta are scalar factors, A, B and C are operation matrices, and it should be noted that A and B are the first matrix to be operated and the second matrix to be operated obtained in response to the matrix multiplication operation instruction. The initial matrix to be updated can be obtained based on the obtained A and B and according to the above formula. At this time, the calculated output matrix C can be used as the initial matrix to be updated.

响应于矩阵乘法操作指令，可获取第一待操作矩阵、第二待操作矩阵和初始待更新矩阵，以便于根据所获取到的操作矩阵和初始待更新矩阵开展后续矩阵的数值计算工作。In response to the matrix multiplication operation instruction, the first to-be-operated matrix, the second to-be-operated matrix, and the initial to-be-updated matrix can be obtained, so as to carry out numerical calculations of subsequent matrices based on the obtained operation matrix and the initial to-be-updated matrix.

S120、根据缓存空间容量，确定用于进行矩阵分块的分块步幅。S120. Determine the blocking step for matrix blocking according to the cache space capacity.

具体的，根据计算机存储器中缓存空间容量的限制，可确定用于进行矩阵分块的分块步幅。需要说明的是，计算机存储层次结构自底向上包括主存、缓存、寄存器三个级别的存储结构，各个级别的存储空间大小以及访问时钟周期也是不同的。本发明实施例对根据缓存空间容量，进行矩阵分块的分块步幅的确定方式不进行限制。Specifically, according to the limitation of the cache space capacity in the computer memory, the blocking step for matrix blocking can be determined. It should be noted that the computer storage hierarchy includes three levels of storage structures: main memory, cache, and registers from bottom to top. The storage space size and access clock cycle of each level are also different. The embodiment of the present invention does not limit the method of determining the block step size for matrix block based on the cache space capacity.

通过缓存空间容量，确定对矩阵进行分块的分块步幅，以便于开展后续矩阵计算工作。The buffer space capacity is used to determine the blocking step of dividing the matrix into blocks to facilitate subsequent matrix calculations.

S130、根据分块步幅，对第一待操作矩阵、第二待操作矩阵和初始待更新矩阵进行矩阵分块，分别得到第一待操作矩阵的第一子矩阵、第二待操作矩阵的第二子矩阵和初始待更新矩阵的第三子矩阵。S130. According to the blocking step, divide the first matrix to be operated, the second matrix to be operated and the initial matrix to be updated into matrix blocks to obtain the first sub-matrix of the first matrix to be operated and the second sub-matrix of the second matrix to be operated respectively. The second sub-matrix and the third sub-matrix of the initial matrix to be updated.

具体的，根据已确定的分块步幅，可对第一待操作矩阵、第二待操作矩阵和初始待更新矩阵进行矩阵分块，进而可得到第一待操作矩阵的第一子矩阵、第二待操作矩阵的第二子矩阵和初始待更新矩阵的第三子矩阵。特别的，本发明实施例对一待操作矩阵、第二待操作矩阵和初始待更新矩阵进行矩阵分块时，所选取的矩阵分块方式不进行限制。Specifically, according to the determined block stride, the first matrix to be operated, the second matrix to be operated, and the initial matrix to be updated can be matrix-blocked, and then a first sub-matrix of the first matrix to be operated, a second sub-matrix of the second matrix to be operated, and a third sub-matrix of the initial matrix to be updated can be obtained. In particular, when the first matrix to be operated, the second matrix to be operated, and the initial matrix to be updated are matrix-blocked, the selected matrix-blocking method is not limited.

需要说明的是，在计算机存储层级结构中，通常来说，作为计算机系统的运算和控制核心的中央处理器(Central Processing Unit，CPU)对于寄存器、缓存、主存和硬盘的访问速度是不同的。一般的，寄存器可用于存储临时数据和指令，由于CPU可直接在寄存器中读写数据，故访问速度非常快，通常在纳秒级别，因此是CPU内部最快速的存储介质；缓存是位于CPU和主存之间的存储介质，可用于加速对主存的访问，可将缓存划分为L1、L2和L3等多级，其中，L1缓存最接近CPU，故速度很快，而L3缓存则远离CPU，因此速度较缓慢，通常来说，缓存的访问速度一般在几纳秒至十几纳秒之间；主存作为计算机系统中用于存储程序和数据的主要存储介质，通常是指随机存取存储器(Random Access Memory，RAM)，是与CPU直接进行数据交换的内部存储器，可随时读写，并且速度很快，通常可作为操作系统或其他正在运行中的程序的临时数据存储介质。在RAM工作时，可随时从任何一个指定的地址写入(即存入)或读出(即取出)信息，与只读存储器(Read-Only Memory，ROM)的区别在于数据的易失性，也就是一旦断电所存储的数据也将随之丢失，RAM在计算机和数字系统中可用来暂时存储程序、数据和中间结果。但与寄存器和缓存相比，主存的访问速度较慢，一般在数十纳秒至百纳秒之间；硬盘作为计算机系统中用于永久存储数据的设备，是最慢的存储介质，由于硬盘的访问速度较主存大幅降低，一般以毫秒级别甚至更长时间。It should be noted that in the computer storage hierarchy, generally speaking, the Central Processing Unit (CPU), which is the computing and control core of the computer system, has different access speeds to registers, cache, main memory and hard disk. . Generally, registers can be used to store temporary data and instructions. Since the CPU can directly read and write data in the registers, the access speed is very fast, usually at the nanosecond level, so it is the fastest storage medium inside the CPU; the cache is located between the CPU and The storage medium between the main memories can be used to speed up access to the main memory. The cache can be divided into multiple levels such as L1, L2 and L3. Among them, the L1 cache is closest to the CPU, so it is very fast, while the L3 cache is far away from the CPU. , so the speed is relatively slow. Generally speaking, the access speed of cache is generally between a few nanoseconds and more than ten nanoseconds; main memory is the main storage medium used to store programs and data in computer systems, and usually refers to random access. Random Access Memory (RAM) is an internal memory that directly exchanges data with the CPU. It can be read and written at any time and is very fast. It can usually be used as a temporary data storage medium for the operating system or other running programs. When RAM is working, information can be written (i.e., stored) or read (i.e., taken out) from any specified address at any time. The difference from read-only memory (Read-Only Memory, ROM) lies in the volatility of data. That is to say, the stored data will be lost once the power is cut off. RAM can be used to temporarily store programs, data and intermediate results in computers and digital systems. However, compared with registers and caches, the access speed of main memory is slower, generally between tens of nanoseconds and hundreds of nanoseconds. As a device used to permanently store data in computer systems, hard disks are the slowest storage media. The access speed of the hard disk is significantly lower than that of the main memory, generally on the millisecond level or even longer.

可选的，在根据分块步幅，对第一待操作矩阵、第二待操作矩阵和初始待更新矩阵进行矩阵分块，分别得到第一待操作矩阵的第一子矩阵、第二待操作矩阵的第二子矩阵和初始待更新矩阵的第三子矩阵之后，还包括：将第一待操作矩阵的第一子矩阵基于预设的行存储方式进行存储；以及，将第二待操作矩阵的第二子矩阵基于预设的列存储方式进行存储。Optionally, perform matrix partitioning on the first to-be-operated matrix, the second to-be-operated matrix and the initial to-be-updated matrix according to the blocking step, to obtain the first sub-matrix and the second to-be-operated matrix of the first to-be-operated matrix respectively. After the second sub-matrix of the matrix and the third sub-matrix of the initial matrix to be updated, it also includes: storing the first sub-matrix of the first matrix to be operated based on a preset row storage method; and, storing the second matrix to be operated The second sub-matrix is stored based on the preset column storage method.

具体的，根据分块步幅，在对第一待操作矩阵、第二待操作矩阵和初始待更新矩阵进行矩阵分块时，可分别得到第一待操作矩阵的第一子矩阵、第二待操作矩阵的第二子矩阵和初始待更新矩阵的第三子矩阵，在此之后，还可将第一待操作矩阵的第一子矩阵基于预设的行存储方式进行存储，将第二待操作矩阵的第二子矩阵基于预设的列存储方式进行存储。特别的，本发明实施例对预设的行存储方式，以及预设的列存储方式的具体存储方式不进行限制。Specifically, according to the blocking step, when the first matrix to be operated, the second matrix to be operated and the initial matrix to be updated are divided into matrix blocks, the first sub-matrix and the second sub-matrix of the first matrix to be operated can be obtained respectively. The second sub-matrix of the operation matrix and the third sub-matrix of the initial matrix to be updated. After that, the first sub-matrix of the first matrix to be operated can also be stored based on the preset row storage method, and the second sub-matrix to be operated can be stored based on the preset row storage method. The second submatrix of the matrix is stored based on the preset column storage method. In particular, the embodiment of the present invention does not limit the specific storage method of the preset row storage method and the preset column storage method.

根据分块步幅，可对第一待操作矩阵、第二待操作矩阵和初始待更新矩阵进行矩阵分块，分别得到第一待操作矩阵的第一子矩阵、第二待操作矩阵的第二子矩阵和初始待更新矩阵的第三子矩阵，在此之后，还可将第一待操作矩阵的第一子矩阵基于预设的行存储方式进行存储，将第二待操作矩阵的第二子矩阵基于预设的列存储方式进行存储。示例性的，综上所述，可通过矩阵分块的方法对位于内存中的矩阵进行划分，以便于将划分得到的更小的子矩阵存放进缓存中，使得CPU在进行计算时，可直接从缓存中取得元素进行计算，进而实现对矩阵中元素的快速访问。According to the blocking step, the first matrix to be operated, the second matrix to be operated and the initial matrix to be updated can be divided into matrix blocks to obtain the first sub-matrix of the first matrix to be operated and the second sub-matrix of the second matrix to be operated respectively. sub-matrix and the third sub-matrix of the initial matrix to be updated. After that, the first sub-matrix of the first matrix to be operated can also be stored based on the preset row storage method, and the second sub-matrix of the second matrix to be operated can be stored. Matrices are stored based on preset column storage methods. For example, to sum up, the matrix located in the memory can be divided by the matrix blocking method, so that the smaller sub-matrices obtained by the division can be stored in the cache, so that the CPU can directly Obtain elements from the cache for calculation, thereby achieving fast access to elements in the matrix.

S140、根据分块步幅，从各向量寄存器中选取存储向量寄存器和预存向量寄存器；存储向量寄存器用于存储当前运算周期下的第一子矩阵、第二子矩阵和第三子矩阵；预存向量寄存器用于存储下一运算周期下的第一子矩阵和第二子矩阵。S140. According to the block step, select a storage vector register and a pre-stored vector register from each vector register; the storage vector register is used to store the first sub-matrix, the second sub-matrix and the third sub-matrix in the current operation cycle; the pre-stored vector The register is used to store the first sub-matrix and the second sub-matrix in the next operation cycle.

具体的，根据分块步幅，可从向量寄存器中选取存储向量寄存器和预存向量寄存器，其中，存储向量寄存器可用于存储当前运算周期下的第一子矩阵、第二子矩阵和第三子矩阵；预存向量寄存器可用于存储下一运算周期下的第一子矩阵和第二子矩阵。Specifically, according to the block stride, a storage vector register and a pre-stored vector register can be selected from the vector registers, wherein the storage vector register can be used to store the first sub-matrix, the second sub-matrix and the third sub-matrix in the current operation cycle; the pre-stored vector register can be used to store the first sub-matrix and the second sub-matrix in the next operation cycle.

通过从各向量寄存器中选取存储向量寄存器和预存向量寄存器，其中，所选取的存储向量寄存器可用于存储当前运算周期下的第一子矩阵、第二子矩阵和第三子矩阵；所选取的预存向量寄存器可用于存储下一运算周期下的第一子矩阵和第二子矩阵，以便于根据存储向量寄存器和预存向量寄存器中的子矩阵开展后续工作，通过对子矩阵进行读取，进而提高对矩阵的计算效率。By selecting a storage vector register and a pre-stored vector register from each vector register, the selected storage vector register can be used to store the first sub-matrix, the second sub-matrix and the third sub-matrix in the current operation cycle; the selected pre-stored vector register The vector register can be used to store the first sub-matrix and the second sub-matrix in the next operation cycle, so that subsequent work can be carried out based on the sub-matrices in the stored vector register and pre-stored vector register. By reading the sub-matrices, the accuracy of the calculation can be improved. Computational efficiency of matrices.

S150、依次从存储向量寄存器和预存向量寄存器中获取相应子矩阵进行矩阵乘法运算，直到满足运算结束条件，得到矩阵乘法运算结果。S150. Obtain the corresponding sub-matrices from the storage vector register and the pre-stored vector register in sequence and perform matrix multiplication until the operation end condition is met, and the matrix multiplication result is obtained.

具体的，通过依次从存储向量寄存器和预存向量寄存器中获取相应的子矩阵，并根据已获取到的子矩阵进行矩阵乘法运算，在满足运算结束条件时，可得到矩阵乘法运算结果。特别的，本发明实施例对运算结束条件不进行限制。Specifically, by sequentially obtaining the corresponding sub-matrices from the storage vector register and the pre-stored vector register, and performing matrix multiplication operations based on the obtained sub-matrices, when the operation end conditions are met, the matrix multiplication operation results can be obtained. In particular, the embodiment of the present invention does not limit the operation end conditions.

根据存储向量寄存器和预存向量寄存器，并从中选取相应的子矩阵开展矩阵乘法运算，在满足运算结束条件时，可得到最终的矩阵乘法运算结果。通过对子矩阵开展矩阵乘法运算，并在满足运算结束条件时，得到矩阵乘法运算结果的方式，可有效提高对矩阵的计算效率。According to the storage vector register and the pre-stored vector register, the corresponding sub-matrix is selected to perform the matrix multiplication operation. When the operation end condition is met, the final matrix multiplication operation result can be obtained. By performing matrix multiplication operations on sub-matrices and obtaining the results of the matrix multiplication operations when the operation end conditions are met, the calculation efficiency of matrices can be effectively improved.

实施例二Embodiment 2

图2是本发明实施例二提供的另一种矩阵乘法运算执行方法的流程图，本发明实施例的技术方案在上述各可选技术方案的基础上进一步优化。Figure 2 is a flow chart of another matrix multiplication execution method provided in Embodiment 2 of the present invention. The technical solution of the embodiment of the present invention is further optimized based on the above optional technical solutions.

进一步的，将“根据缓存空间容量，确定用于进行矩阵分块的分块步幅”，进一步细化为“根据第一待操作矩阵、第二待操作矩阵和初始待更新矩阵的矩阵维度大小，确定矩阵分块次数；根据缓存空间容量，确定在各矩阵分块次数下第一待操作矩阵、第二待操作矩阵和初始待更新矩阵在不同维度下的分块步幅”，以提高矩阵乘法的运算效率。需要说明的是，在本实施例中未讲述部分，可参考其他实施例的相关表述，在此不再赘述。Further, "determine the blocking step for matrix blocking according to the cache space capacity" is further refined to "according to the matrix dimension size of the first matrix to be operated, the second matrix to be operated and the initial matrix to be updated. , determine the number of matrix blocks; according to the cache space capacity, determine the block steps of the first matrix to be operated, the second matrix to be operated and the initial matrix to be updated in different dimensions under each matrix block number, so as to improve the matrix The efficiency of multiplication operations. It should be noted that for parts that are not described in this embodiment, reference may be made to the relevant expressions in other embodiments, which will not be described again here.

如图2所示，本发明实施例提供的另一种矩阵乘法运算执行方法，具体包括如下步骤：As shown in Figure 2, another matrix multiplication execution method provided by an embodiment of the present invention specifically includes the following steps:

S210、响应于矩阵乘法操作指令，获取第一待操作矩阵、第二待操作矩阵和初始待更新矩阵。S210. In response to the matrix multiplication operation instruction, obtain the first matrix to be operated, the second matrix to be operated and the initial matrix to be updated.

S220、根据第一待操作矩阵、第二待操作矩阵和初始待更新矩阵的矩阵维度大小，确定矩阵分块次数。S220. Determine the number of matrix blocks according to the matrix dimension sizes of the first matrix to be operated, the second matrix to be operated and the initial matrix to be updated.

具体的，响应于矩阵乘法操作指令，可获取到第一待操作矩阵、第二待操作矩阵和初始待更新矩阵，进而根据所获取到的矩阵大小，可确定矩阵分块次数。特别的，本发明实施例对矩阵分块次数不进行限制。Specifically, in response to the matrix multiplication operation instruction, the first matrix to be operated, the second matrix to be operated and the initial matrix to be updated can be obtained, and then the number of matrix partitions can be determined according to the obtained matrix size. In particular, the embodiment of the present invention does not limit the number of matrix partitions.

在响应到矩阵乘法操作指令后，可根据所获取到的第一待操作矩阵、第二待操作矩阵和初始待更新矩阵的大小，确定矩阵分块次数。以便于根据各矩阵的矩阵分块次数获取到各矩阵的子矩阵，进而可根据获取到的各子矩阵开展后续工作。After responding to the matrix multiplication operation instruction, the number of matrix block divisions may be determined based on the obtained sizes of the first matrix to be operated, the second matrix to be operated on, and the initial matrix to be updated. In order to obtain the sub-matrices of each matrix according to the number of matrix blocks of each matrix, and then carry out subsequent work based on the obtained sub-matrices.

S230、根据缓存空间容量，确定在各矩阵分块次数下第一待操作矩阵、第二待操作矩阵和初始待更新矩阵在不同维度下的分块步幅。S230. According to the cache space capacity, determine the blocking steps of the first matrix to be operated, the second matrix to be operated and the initial matrix to be updated in different dimensions under each matrix blocking number.

S240、根据分块步幅，对第一待操作矩阵、第二待操作矩阵和初始待更新矩阵进行矩阵分块，分别得到第一待操作矩阵的第一子矩阵、第二待操作矩阵的第二子矩阵和初始待更新矩阵的第三子矩阵。S240, performing matrix blocking on the first matrix to be operated, the second matrix to be operated, and the initial matrix to be updated according to the blocking stride, to obtain a first sub-matrix of the first matrix to be operated, a second sub-matrix of the second matrix to be operated, and a third sub-matrix of the initial matrix to be updated, respectively.

具体的，根据缓存空间容量，可确定在各矩阵分块次数下第一待操作矩阵、第二待操作矩阵和初始待更新矩阵在不同维度下的分块步幅。根据已确定的分块步幅，可对第一待操作矩阵、第二待操作矩阵和初始待更新矩阵进行矩阵分块，继而可得到相对应的第一待操作矩阵的第一子矩阵、第二待操作矩阵的第二子矩阵和初始待更新矩阵的第三子矩阵。Specifically, according to the cache space capacity, the blocking steps of the first matrix to be operated, the second matrix to be operated and the initial matrix to be updated in different dimensions can be determined for each matrix blocking number. According to the determined blocking step, the first to-be-operated matrix, the second to-be-operated matrix, and the initial to-be-updated matrix can be divided into matrix blocks, and then the corresponding first sub-matrix and third sub-matrix of the first to-be-operated matrix can be obtained. The second sub-matrix of the matrix to be operated on and the third sub-matrix of the initial matrix to be updated.

需要说明的是，根据分块步幅，在对第一待操作矩阵、第二待操作矩阵和初始待更新矩阵进行矩阵分块时，可以是对上述各矩阵按行或按列进行分块，进而得到相应的子矩阵。本发明实施例对矩阵分块时，所选取的矩阵分块方式不进行限制。It should be noted that according to the blocking step, when the first to-be-operated matrix, the second to-be-operated matrix and the initial to-be-updated matrix are divided into matrix blocks, each of the above matrices can be divided into blocks by rows or columns. Then the corresponding sub-matrix is obtained. The embodiment of the present invention does not limit the matrix blocking method selected when the matrix is divided into blocks.

可选的，根据分块步幅，对第一待操作矩阵、第二待操作矩阵和初始待更新矩阵进行矩阵分块，分别得到第一待操作矩阵的第一子矩阵、第二待操作矩阵的第二子矩阵和初始待更新矩阵的第三子矩阵，包括：根据各矩阵分块次数下的第一待操作矩阵、第二待操作矩阵和初始待更新矩阵在不同维度下的分块步幅，对第一待操作矩阵、第二待操作矩阵和初始待更新矩阵进行矩阵分块，得到各矩阵分块次数下的第一待操作矩阵的第一分块矩阵、第二待操作矩阵的第二分块矩阵和初始待更新矩阵的第三分块矩阵；将最后一次矩阵分块次数下的第一待操作矩阵的第一分块矩阵确定为第一子矩阵，将第二待操作矩阵的第二分块矩阵确定为第二子矩阵，以及将初始待更新矩阵的第三分块矩阵确定为第三子矩阵。Optionally, perform matrix partitioning on the first to-be-operated matrix, the second to-be-operated matrix and the initial to-be-updated matrix according to the blocking step, to obtain the first sub-matrix and the second to-be-operated matrix of the first to-be-operated matrix respectively. The second sub-matrix of and the third sub-matrix of the initial matrix to be updated include: the first matrix to be operated, the second matrix to be operated and the initial matrix to be updated in different dimensions according to the number of blocks of each matrix. Amplitude, perform matrix partitioning on the first to-be-operated matrix, the second to-be-operated matrix and the initial to-be-updated matrix, and obtain the first block matrix of the first to-be-operated matrix and the second to-be-operated matrix under each matrix blocking number. The second block matrix and the third block matrix of the initial matrix to be updated; determine the first block matrix of the first matrix to be operated under the last matrix block number as the first sub-matrix, and the second matrix to be operated The second block matrix of is determined as the second sub-matrix, and the third block matrix of the initial matrix to be updated is determined as the third sub-matrix.

具体的，根据分块步幅，在对第一待操作矩阵、第二待操作矩阵和初始待更新矩阵进行矩阵分块时，可分别得到第一待操作矩阵的第一子矩阵、第二待操作矩阵的第二子矩阵和初始待更新矩阵的第三子矩阵，其具体的获取方式可以是，根据各矩阵分块次数下的第一待操作矩阵、第二待操作矩阵和初始待更新矩阵在不同维度下的分块步幅，对第一待操作矩阵、第二待操作矩阵和初始待更新矩阵在不同维度下的分块步幅，对第一待操作矩阵、第二待操作矩阵和初始待更新矩阵进行矩阵分块，继而可得到各矩阵分块次数下的第一待操作矩阵的第一分块矩阵、第二待操作矩阵的第二分块矩阵和初始待更新矩阵的第三分块矩阵。Specifically, according to the blocking step, when the first matrix to be operated, the second matrix to be operated and the initial matrix to be updated are divided into matrix blocks, the first sub-matrix and the second sub-matrix of the first matrix to be operated can be obtained respectively. The specific acquisition method of the second sub-matrix of the operation matrix and the third sub-matrix of the initial matrix to be updated can be based on the first matrix to be operated, the second matrix to be operated and the initial matrix to be updated according to the number of blocks of each matrix. The block strides in different dimensions, for the first to-be-operated matrix, the second to-be-operated matrix and the initial to-be-updated matrix in different dimensions, for the first to-be-operated matrix, the second to-be-operated matrix and The initial matrix to be updated is divided into matrix blocks, and then the first block matrix of the first matrix to be operated, the second block matrix of the second matrix to be operated and the third block matrix of the initial matrix to be updated can be obtained for each matrix block number. Partitioned Matrix.

根据各矩阵分块次数不同矩阵在不同维度下的分块步幅，可对各矩阵进行矩阵分块，继而得到各矩阵分块次数下各矩阵的分块矩阵，另外，可将最后一次矩阵分块次数下的第一待操作矩阵的第一分块矩阵确定为第一子矩阵，将第二待操作矩阵的第二分块矩阵确定为第二子矩阵，以及将初始待更新矩阵的第三分块矩阵确定为第三子矩阵，以便根据此时获取到的各子矩阵开展矩阵计算等后续工作。According to the blocking steps of different matrices in different dimensions according to the number of matrix blocking times, each matrix can be divided into matrix blocks, and then the blocking matrix of each matrix under each matrix blocking number can be obtained. In addition, the last matrix partitioning can be The first block matrix of the first matrix to be operated at the block number is determined as the first sub-matrix, the second block matrix of the second matrix to be operated is determined as the second sub-matrix, and the third block matrix of the initial matrix to be updated is determined The block matrix is determined as the third sub-matrix, so that subsequent work such as matrix calculation can be carried out based on each sub-matrix obtained at this time.

S250、根据分块步幅，从各向量寄存器中选取存储向量寄存器和预存向量寄存器；存储向量寄存器用于存储当前运算周期下的第一子矩阵、第二子矩阵和第三子矩阵；预存向量寄存器用于存储下一运算周期下的第一子矩阵和第二子矩阵。S250. According to the block step, select a storage vector register and a pre-stored vector register from each vector register; the storage vector register is used to store the first sub-matrix, the second sub-matrix and the third sub-matrix in the current operation cycle; the pre-stored vector The register is used to store the first sub-matrix and the second sub-matrix in the next operation cycle.

具体的，根据已获取到的分块步幅，可从各向量寄存器中选取存储向量寄存器和预存向量寄存器，其中，存储向量寄存器可用于存储当前运算周期下的第一子矩阵、第二子矩阵和第三子矩阵；预存向量寄存器可用于存储下一运算周期下的第一子矩阵和第二子矩阵。Specifically, according to the acquired block stride, a storage vector register and a pre-stored vector register can be selected from each vector register, wherein the storage vector register can be used to store the first sub-matrix, the second sub-matrix and the third sub-matrix in the current operation cycle; the pre-stored vector register can be used to store the first sub-matrix and the second sub-matrix in the next operation cycle.

可选的，根据分块步幅，从各向量寄存器中选取存储向量寄存器和预存向量寄存器，包括：根据最后一次矩阵分块次数下第一待操作矩阵、第二待操作矩阵和初始待更新矩阵的分块步幅，确定用于存储当前运算周期下的第一子矩阵、第二子矩阵和第三子矩阵的存储向量寄存器；根据存储向量寄存器的寄存器数量，从各向量寄存器中选取用于存储下一运算周期下的第一子矩阵和第二子矩阵的预存向量寄存器。Optionally, select a storage vector register and a pre-stored vector register from each vector register according to the blocking step, including: the first to-be-operated matrix, the second to-be-operated matrix and the initial to-be-updated matrix according to the last matrix blocking times block step, determine the storage vector registers used to store the first sub-matrix, second sub-matrix and third sub-matrix in the current operation cycle; according to the number of registers to store vector registers, select from each vector register for The pre-stored vector register stores the first sub-matrix and the second sub-matrix in the next operation cycle.

具体的，根据最后一次矩阵分块次数下第一待操作矩阵、第二待操作矩阵和初始待更新矩阵的分块步幅，确定用于存储当前运算周期下的第一子矩阵、第二子矩阵和第三子矩阵的存储向量寄存器；根据存储向量寄存器的寄存器数量，从各向量寄存器中选取用于存储下一运算周期下的第一子矩阵和第二子矩阵的预存向量寄存器。特别的，本发明实施例对根据分块步幅，从各向量寄存器中选取存储向量寄存器和预存向量寄存器的选取方式不进行限制。Specifically, according to the blocking steps of the first matrix to be operated, the second matrix to be operated and the initial matrix to be updated in the last matrix blocking times, it is determined to store the first sub-matrix and the second sub-matrix in the current operation cycle. Storage vector registers for the matrix and the third sub-matrix; according to the number of registers for storing vector registers, a pre-stored vector register for storing the first sub-matrix and the second sub-matrix in the next operation cycle is selected from each vector register. In particular, the embodiment of the present invention does not limit the selection method of selecting the storage vector register and the pre-stored vector register from each vector register according to the block step size.

可选的，根据最后一次矩阵分块次数下第一待操作矩阵、第二待操作矩阵和初始待更新矩阵的分块步幅，确定用于存储当前运算周期下的第一子矩阵、第二子矩阵和第三子矩阵的存储向量寄存器，包括：根据最后一次矩阵分块次数下的第一待操作矩阵的分块步幅，确定用于存储当前运算周期下的第一子矩阵的第一存储向量寄存器；根据最后一次矩阵分块次数下的第二待操作矩阵的分块步幅，确定用于存储当前运算周期下的第二子矩阵的第二存储向量寄存器；根据最后一次矩阵分块次数下的初始待更新矩阵的分块步幅，确定用于存储当前运算周期下的第三子矩阵的第三存储向量寄存器；将第一存储向量寄存器、第二存储向量寄存器和第三存储向量寄存器确定为存储寄存器。Optionally, according to the blocking steps of the first matrix to be operated, the second matrix to be operated and the initial matrix to be updated in the last matrix blocking times, determine the first sub-matrix and the second sub-matrix in the current operation cycle. The storage vector register of the sub-matrix and the third sub-matrix includes: determining the first sub-matrix used to store the first sub-matrix in the current operation cycle according to the block step of the first matrix to be operated on in the last matrix block count. Storage vector register; Determine the second storage vector register used to store the second sub-matrix in the current operation cycle based on the blocking stride of the second matrix to be operated on in the last matrix blocking count; Determine the second storage vector register according to the last matrix blocking The block stride of the initial matrix to be updated in times is determined to determine the third storage vector register used to store the third sub-matrix in the current operation cycle; combine the first storage vector register, the second storage vector register and the third storage vector The register is identified as a storage register.

具体的，在根据最后一次矩阵分块次数下第一待操作矩阵、第二待操作矩阵和初始待更新矩阵的分块步幅，确定用于存储当前运算周期下的第一子矩阵、第二子矩阵和第三子矩阵的存储向量寄存器时，可根据最后一次矩阵分块次数下的第一待操作矩阵的分块步幅，确定用于存储当前运算周期下的第一子矩阵的第一存储向量寄存器。根据最后一次矩阵分块次数下的第二待操作矩阵的分块步幅，可确定用于存储当前运算周期下的第二子矩阵的第二存储向量寄存器；根据最后一次矩阵分块次数下的初始待更新矩阵的分块步幅，可确定用于存储当前运算周期下的第三子矩阵的第三存储向量寄存器。另外，可将第一存储向量寄存器、第二存储向量寄存器和第三存储向量寄存器确定为存储寄存器。特别的，本发明实施例对当前运算周期下与各子矩阵相对应的存储向量寄存器的确定方式不进行限制。其中，所阐述的各子矩阵包括第一子矩阵、第二子矩阵和第三子矩阵，与各子矩阵相对应的各存储向量寄存器包括，第一存储向量寄存器、第二存储向量寄存器和第三存储向量寄存器。Specifically, based on the blocking steps of the first to-be-operated matrix, the second to-be-operated matrix and the initial to-be-updated matrix in the last matrix blocking times, it is determined to store the first sub-matrix and the second sub-matrix in the current operation cycle. When storing the vector registers of the sub-matrix and the third sub-matrix, the first step for storing the first sub-matrix in the current operation cycle can be determined based on the block step of the first matrix to be operated on in the last matrix block count. Store vector register. According to the blocking stride of the second matrix to be operated on for the last matrix blocking count, the second storage vector register used to store the second sub-matrix in the current operation cycle can be determined; according to the last matrix blocking count The block stride of the initial matrix to be updated can determine the third storage vector register used to store the third sub-matrix in the current operation cycle. In addition, the first storage vector register, the second storage vector register, and the third storage vector register may be determined as storage registers. In particular, the embodiment of the present invention does not limit the method of determining the storage vector register corresponding to each sub-matrix in the current operation cycle. Wherein, each sub-matrix described includes a first sub-matrix, a second sub-matrix and a third sub-matrix, and each storage vector register corresponding to each sub-matrix includes a first storage vector register, a second storage vector register and a third storage vector register. Three storage vector registers.

其中，根据存储向量寄存器的寄存器数量，从各向量寄存器中选取用于存储下一运算周期下的第一子矩阵和第二子矩阵的预存向量寄存器，包括：根据第一存储向量寄存器的第一寄存器数量，从各向量寄存器中选取用于存储下一运算周期下的第一子矩阵的第一预存向量寄存器；根据第二存储向量寄存器的第二寄存器数量，从各向量寄存器中选取用于存储下一运算周期下的第二子矩阵的第二预存向量寄存器。Among them, according to the register quantity of the storage vector register, the pre-stored vector register for storing the first sub-matrix and the second sub-matrix in the next operation cycle is selected from each vector register, including: according to the first register quantity of the first storage vector register, the first pre-stored vector register for storing the first sub-matrix in the next operation cycle is selected from each vector register; according to the second register quantity of the second storage vector register, the second pre-stored vector register for storing the second sub-matrix in the next operation cycle is selected from each vector register.

具体的，在根据存储向量寄存器的寄存器数量，从各向量寄存器中选取用于存储下一运算周期下的第一子矩阵和第二子矩阵的预存向量寄存器时，可根据第一存储向量寄存器的第一寄存器数量，从各向量寄存器中选取用于存储下一运算周期下的第一子矩阵的第一预存向量寄存器；根据第二存储向量寄存器的第二寄存器数量，可从各向量寄存器中选取用于存储下一运算周期下的第二子矩阵的第二预存向量寄存器。特别的，本发明实施例对根据各存储向量寄存器的寄存器数量，从各向量寄存器中选取用于存储下一运算周期下与各子矩阵相对应的预存向量寄存器的选取方式不进行限制。需要说明的是，此时所阐述的各存储向量寄存器包括第一存储向量寄存器和第二存储向量寄存器；预存向量寄存器包括第一预存向量寄存器和第二预存向量寄存器。Specifically, when selecting the pre-stored vector register for storing the first sub-matrix and the second sub-matrix in the next operation cycle from each vector register according to the number of registers of the storage vector register, the number of the first storage vector register can be The number of first registers, the first pre-stored vector register used to store the first sub-matrix in the next operation cycle is selected from each vector register; according to the number of second registers for storing the second vector register, it can be selected from each vector register A second pre-stored vector register used to store the second sub-matrix in the next operation cycle. In particular, the embodiment of the present invention does not limit the selection method of selecting from each vector register a pre-stored vector register corresponding to each sub-matrix in the next operation cycle based on the number of registers for each storage vector register. It should be noted that each storage vector register described at this time includes a first storage vector register and a second storage vector register; the pre-stored vector register includes a first pre-stored vector register and a second pre-stored vector register.

根据最后一次矩阵分块次数下的第一待操作矩阵的分块步幅，可确定用于存储当前运算周期下的第一子矩阵的第一存储向量寄存器；根据最后一次矩阵分块次数下的第二待操作矩阵的分块步幅，确定用于存储当前运算周期下的第二子矩阵的第二存储向量寄存器；根据最后一次矩阵分块次数下的初始待更新矩阵的分块步幅，确定用于存储当前运算周期下的第三子矩阵的第三存储向量寄存器；将第一存储向量寄存器、第二存储向量寄存器和第三存储向量寄存器确定为存储寄存器。其中，存储向量寄存器可用于存储当前运算周期下的第一子矩阵、第二子矩阵和第三子矩阵；根据第一存储向量寄存器的第一寄存器数量，从各向量寄存器中选取用于存储下一运算周期下的第一子矩阵的第一预存向量寄存器；根据第二存储向量寄存器的第二寄存器数量，从各向量寄存器中选取用于存储下一运算周期下的第二子矩阵的第二预存向量寄存器。以便于从各向量存储器中读取各子矩阵，进而开展各子矩阵的数值计算工作。According to the blocking step of the first matrix to be operated in the last matrix blocking count, the first storage vector register used to store the first sub-matrix in the current operation cycle can be determined; according to the last matrix blocking count The block stride of the second matrix to be operated determines the second storage vector register used to store the second sub-matrix in the current operation cycle; according to the block stride of the initial matrix to be updated in the last matrix block count, Determine a third storage vector register used to store the third sub-matrix in the current operation cycle; determine the first storage vector register, the second storage vector register and the third storage vector register as storage registers. Among them, the storage vector register can be used to store the first sub-matrix, the second sub-matrix and the third sub-matrix in the current operation cycle; according to the number of first registers in the first storage vector register, select from each vector register to store the next The first pre-stored vector register of the first sub-matrix in one operation cycle; according to the number of second registers of the second storage vector register, select the second vector register for storing the second sub-matrix in the next operation cycle from each vector register. Prestored vector registers. In order to facilitate the reading of each sub-matrix from each vector memory, and then carry out the numerical calculation of each sub-matrix.

S260、依次从存储向量寄存器和预存向量寄存器中获取相应子矩阵进行矩阵乘法运算，直到满足运算结束条件，得到矩阵乘法运算结果。S260: Obtain the corresponding sub-matrices from the storage vector register and the pre-stored vector register in sequence and perform matrix multiplication until the operation end condition is met, and the matrix multiplication result is obtained.

具体的，通过依次从存储向量寄存器和预存向量寄存器中获取相应的子矩阵进行矩阵乘法运算，在直至满足运算结束条件的情况下，可得到相对应的矩阵乘法运算结果。综上所陈述的技术方案，可获取到最终的矩阵乘法运算结果。Specifically, by sequentially obtaining corresponding sub-matrices from the storage vector register and the pre-stored vector register to perform matrix multiplication operations, the corresponding matrix multiplication operation results can be obtained when the operation end conditions are met. In summary, the technical solution described above can obtain the final matrix multiplication operation result.

在一个可选实施例的实现方式中，示例性的给出大小为m×k的矩阵A、大小为k×n的矩阵B和大小为m×n的矩阵C，其中，A、B矩阵均为输入矩阵，C为待更新的输出矩阵。In the implementation of an optional embodiment, a matrix A with a size of m×k, a matrix B with a size of k×n, and a matrix C with a size of m×n are given as examples, where the A and B matrices are both is the input matrix, and C is the output matrix to be updated.

基于已获取到的上述A、B和C三个矩阵，进行矩阵分块，其具体分块方式可以是，在n维方向上以nc为步长，对矩阵B按列进行分块，可得到大小为k×nc的子矩阵b；在n维方向以nc为步长，对矩阵C按列进行分块，得到大小为m×nc的子矩阵c；在k维方向以kc为步长，对矩阵A按列进行分块，得到大小为m×kc的子矩阵a；在k维方向以kc为步长，对各子矩阵b按行进行分块，得到大小为kc×nc的子矩阵bb；在m维方向以mc为步长，对子矩阵a按行进行分块，得到大小为mc×kc的子矩阵aa；在m维方向以mc为步长，对子矩阵c按行进行分块，得到大小为mc×nc的子矩阵cc；在nc维方向以nr为步长，对子矩阵bb按列进行分块，得到大小为kc×nr的子矩阵bbb；在nc维方向以nr为步长，对子矩阵cc按列进行分块，得到大小为mc×nr的子矩阵ccc；在mc维方向以mr为步长，对子矩阵aa按行进行分块，得到大小为mr×kc的子矩阵aaa；在mc维方向以mr为步长，对子矩阵ccc按行进行分块，得到大小为mr×nr的子矩阵cccc。Based on the above three matrices A, B and C that have been obtained, the matrix is divided into blocks. The specific blocking method can be, in the n-dimensional direction with nc as the step size, the matrix B is divided into blocks by columns, which can be obtained Submatrix b with a size of k×nc; with nc as the step size in the n-dimensional direction, divide the matrix C into columns by columns to obtain a submatrix c with a size of m×nc; with kc as the step size in the k-dimensional direction, Divide the matrix A into blocks by columns to obtain sub-matrix a with a size of m×kc; in the k-dimensional direction with kc as the step size, divide each sub-matrix b into blocks by rows to obtain a sub-matrix with a size of kc×nc. bb; in the m-dimensional direction, use mc as the step size, divide the sub-matrix a by rows, and obtain the sub-matrix aa of size mc×kc; in the m-dimensional direction, use mc as the step size, divide the sub-matrix c by rows Divide into blocks to obtain a submatrix cc with a size of mc×nc; in the nc-dimensional direction with nr as the step size, divide the sub-matrix bb into columns by columns to obtain a sub-matrix bbb with a size of kc×nr; in the nc-dimensional direction with nr is the step size, and the submatrix cc is divided into columns by columns to obtain the submatrix ccc with a size of mc×nr; in the mc dimension direction, mr is the step size, and the submatrix aa is divided into rows by rows to obtain a size of mr Submatrix aaa of

特别的，在完成上述矩阵分块操作后，此时得到的子矩阵A、子矩阵B和子矩阵C的大小分别是mr×kc，kc×nr和mr×nr。在此之后，可通过向量寄存器并结合Level-1(即L1)级缓存来计算子矩阵C。其中，综上所阐述的mc、kc、nc、mr和nr等分块参数均依赖于具体处理器的缓存空间大小进行合理划分，在本发明中，考虑到分块参数受限于Level-1级缓存的限制，例如L1的空间小等。在进行合理划分时，应满足如下公式(1)和公式(2)，两公式具体如下：In particular, after completing the above matrix partitioning operation, the sizes of sub-matrix A, sub-matrix B and sub-matrix C obtained at this time are mr×kc, kc×nr and mr×nr respectively. After this, the submatrix C can be calculated through the vector register combined with the Level-1 (ie L1) level cache. Among them, the blocking parameters such as mc, kc, nc, mr and nr described above all depend on the cache space size of the specific processor for reasonable division. In the present invention, considering that the blocking parameters are limited to Level-1 Level cache limitations, such as the small space of L1, etc. When making reasonable divisions, the following formulas (1) and (2) should be satisfied. The details of the two formulas are as follows:

其中，mc、kc、nc、mr和nr均不能超过L1大小，综上所述mc、kc、nc、mr和nr可分别取96kb、96kb、64kb、16kb和4kb。Among them, mc, kc, nc, mr and nr cannot exceed the L1 size. In summary, mc, kc, nc, mr and nr can be 96kb, 96kb, 64kb, 16kb and 4kb respectively.

其次，在对已获取到的子矩阵进行数据重排时，在分块后，由于A、B两矩阵中存储的浮点数在内存地址上是不连续的，这样会严重的影响缓存的命中率，导致Cache每次向主存取数据时取不到连续的数据，会产生大量Cache Miss，从而影响计算效率。为了避免上述情况的出现，可对已得到的大小为kc×nc的子矩阵B(即bb)和已得到的大小为mc×kc的子矩阵A(即aa)在kc方向上进行数据重排，以保证里面的数据都是地址连续的，以提高缓存命中率。首先，将kc×nc的子矩阵B按列切分成若干个kc×4的小矩阵，将这些小矩阵按行进行存储，然后将若干个小矩阵按顺序连接起来；其次，将mc×kc的子矩阵A按行切分成若干个16×kc的小矩阵，将这些小矩阵按列进行存储，然后将这若干个小矩阵按顺序连接起来。Secondly, when rearranging the data of the obtained sub-matrices, after the block is divided, since the floating point numbers stored in the two matrices A and B are discontinuous in the memory addresses, this will seriously affect the cache hit rate. , resulting in the Cache not being able to obtain consecutive data every time it accesses data from the main, which will generate a large number of Cache Misses, thus affecting computing efficiency. In order to avoid the above situation, the obtained sub-matrix B (ie bb) with size kc×nc and the obtained sub-matrix A (ie aa) with size mc×kc can be rearranged in the kc direction. , to ensure that the data inside have consecutive addresses to improve the cache hit rate. First, divide the kc×nc submatrix B into several kc×4 small matrices by columns, store these small matrices by rows, and then connect several small matrices in order; secondly, mc×kc Submatrix A is divided into several 16×kc small matrices by rows, these small matrices are stored in columns, and then these small matrices are connected in order.

然后，继而为上述已进行数据重排的子矩阵进行向量寄存器的分配时，向量寄存器主要是用来进行向量化加速计算的，一般来说，向量寄存器的个数使用的越多，对计算性能的提升也相应的增多，因此需要在有限的向量寄存器个数限制下，合理的分配向量寄存器的使用策略，使得能够最大化的利用它们。在本发明中由于使用的硬件向量寄存器位宽为128bits，能够存放4个单精度浮点数，故使用mr(在上述示例说明中即为16)/4个向量寄存器来存储A矩阵元素，nr(在上述示例说明中即为4)/4个向量寄存器来存储B矩阵元素，使用mr×nr/4个向量寄存器来存储C矩阵元素，另外再用4个向量寄存器来保存预取的16个A矩阵元素，用1个向量寄存器来保存预取的4个B矩阵元素，所以总共使用的向量寄存器个数为26个。Then, when allocating vector registers for the above-mentioned sub-matrix that has undergone data rearrangement, vector registers are mainly used for vectorized accelerated calculations. Generally speaking, the more vector registers are used, the lower the computing performance. The improvement has also increased accordingly, so it is necessary to rationally allocate the use strategy of vector registers under the limit of the limited number of vector registers so that they can be maximized. In the present invention, since the bit width of the hardware vector register used is 128 bits and can store 4 single-precision floating point numbers, mr (16 in the above example)/4 vector registers are used to store the A matrix elements, nr ( In the above example, 4)/4 vector registers are used to store the B matrix elements, mr×nr/4 vector registers are used to store the C matrix elements, and 4 vector registers are used to store the prefetched 16 A For matrix elements, 1 vector register is used to store the 4 prefetched B matrix elements, so the total number of vector registers used is 26.

继而，在对上述已获取到的数据进行重排时，可以预先将所需数据加载到缓存中，在计算时可直接从缓存中获取数据，减少等待数据从内存加载的时间，并且数据预取也可以Cache Miss，从而大幅提高计算效率。在编写汇编计算代码时，需要尽量避免因数据依赖等冲突造成的处理器流水线停顿，所以本发明结合硬件流水线特点，避免产生流水线冲突中的RAW型冲突而导致流水线停顿，对汇编指令进行了重排，从而高效的利用处理器的计算能力，将流水线拉满，提高GEMM的计算效率。Then, when rearranging the above obtained data, the required data can be loaded into the cache in advance, and the data can be obtained directly from the cache during calculation, reducing the time of waiting for the data to be loaded from the memory, and data prefetching You can also Cache Miss, thereby greatly improving computing efficiency. When writing assembly calculation code, it is necessary to try to avoid processor pipeline stalls caused by conflicts such as data dependencies. Therefore, the present invention combines the characteristics of hardware pipelines to avoid pipeline stalls caused by RAW-type conflicts in pipeline conflicts, and reorganizes the assembly instructions. row, thereby efficiently utilizing the computing power of the processor, filling the pipeline, and improving the computing efficiency of GEMM.

本发明实施例所采用的技术方案，通过对矩阵乘法进行向量化加速，从而大幅提高计算效率，在支持RISC-V矢量拓展指令集(0.7.1版本)的处理器上都能进行计算，不需要额外配置硬件。另外，本发明针对矩阵维度，对计算内核进行了重新设计，并使用手工汇编代码实现。通过结合硬件处理器特点，可计算并确定最优的矩阵分块策略、矩阵分块参数以及寄存器分配策略。因此，本发明所实现的优化方法，与最原始的三重循环矩阵乘计算方法相比，计算速度得到了大幅提升。The technical solution adopted in the embodiment of the present invention greatly improves the calculation efficiency by vectorizing the matrix multiplication, and can be calculated on a processor that supports the RISC-V vector extended instruction set (version 0.7.1). Additional hardware configuration is required. In addition, the present invention redesigns the calculation kernel for matrix dimensions and implements it using manual assembly code. By combining the characteristics of the hardware processor, the optimal matrix blocking strategy, matrix blocking parameters, and register allocation strategy can be calculated and determined. Therefore, compared with the most original triple circulant matrix multiplication calculation method, the optimization method implemented by the present invention has a greatly improved calculation speed.

本发明实施例通过响应于矩阵乘法操作指令，获取第一待操作矩阵、第二待操作矩阵和初始待更新矩阵，并根据上述矩阵的矩阵维度大小，确定矩阵分块次数；根据缓存空间容量，确定在各矩阵分块次数下上述各矩阵在不同维度下的分块步幅，并对第一待操作矩阵、第二待操作矩阵和初始待更新矩阵进行矩阵分块，得到各矩阵分块次数下的第一待操作矩阵的第一分块矩阵、第二待操作矩阵的第二分块矩阵和初始待更新矩阵的第三分块矩阵；将最后一次矩阵分块次数下的第一待操作矩阵的第一分块矩阵确定为第一子矩阵，将第二待操作矩阵的第二分块矩阵确定为第二子矩阵，以及将初始待更新矩阵的第三分块矩阵确定为第三子矩阵；根据最后一次矩阵分块次数下的第一待操作矩阵的分块步幅，确定用于存储当前运算周期下的第一子矩阵的第一存储向量寄存器；根据最后一次矩阵分块次数下的第二待操作矩阵的分块步幅，确定用于存储当前运算周期下的第二子矩阵的第二存储向量寄存器；根据最后一次矩阵分块次数下的初始待更新矩阵的分块步幅，确定用于存储当前运算周期下的第三子矩阵的第三存储向量寄存器；将第一存储向量寄存器、第二存储向量寄存器和第三存储向量寄存器确定为存储寄存器；根据第一存储向量寄存器的第一寄存器数量，从各向量寄存器中选取用于存储下一运算周期下的第一子矩阵的第一预存向量寄存器；根据第二存储向量寄存器的第二寄存器数量，从各向量寄存器中选取用于存储下一运算周期下的第二子矩阵的第二预存向量寄存器；存储向量寄存器用于存储当前运算周期下的第一子矩阵、第二子矩阵和第三子矩阵；预存向量寄存器用于存储下一运算周期下的第一子矩阵和第二子矩阵；依次从存储向量寄存器和预存向量寄存器中获取相应子矩阵进行矩阵乘法运算，直到满足运算结束条件，得到矩阵乘法运算结果。本发明实施例提供的技术方案，通过对函数性能进行优化，以提高矩阵乘法的运算效率。The embodiment of the present invention obtains the first matrix to be operated, the second matrix to be operated and the initial matrix to be updated by responding to the matrix multiplication operation instruction, and determines the number of matrix blocks according to the matrix dimension size of the above matrix; according to the cache space capacity, Determine the blocking steps of each of the above matrices in different dimensions under each matrix blocking number, and perform matrix blocking on the first to-be-operated matrix, the second to-be-operated matrix, and the initial to-be-updated matrix to obtain the number of matrix blocking times. The first block matrix of the first matrix to be operated on, the second block matrix of the second matrix to be operated on and the third block matrix of the initial matrix to be updated; divide the first block matrix of the last matrix to be operated on The first block matrix of the matrix is determined as the first sub-matrix, the second block matrix of the second matrix to be operated is determined as the second sub-matrix, and the third block matrix of the initial matrix to be updated is determined as the third sub-matrix. Matrix; determine the first storage vector register used to store the first sub-matrix in the current operation cycle based on the blocking step of the first matrix to be operated on the last matrix blocking count; determine the first storage vector register for storing the first sub-matrix in the current operation cycle; based on the last matrix blocking count The block stride of the second matrix to be operated is determined to determine the second storage vector register used to store the second sub-matrix in the current operation cycle; the block stride of the initial matrix to be updated is based on the last number of matrix blocks. , determine the third storage vector register used to store the third sub-matrix in the current operation cycle; determine the first storage vector register, the second storage vector register and the third storage vector register as storage registers; according to the first storage vector register According to the number of first registers, the first pre-stored vector register used to store the first sub-matrix in the next operation cycle is selected from each vector register; according to the number of second registers for storing the second vector register, selected from each vector register The second pre-stored vector register is used to store the second sub-matrix in the next operation cycle; the storage vector register is used to store the first sub-matrix, the second sub-matrix and the third sub-matrix in the current operation cycle; the pre-stored vector register is used To store the first sub-matrix and the second sub-matrix in the next operation cycle; sequentially obtain the corresponding sub-matrices from the storage vector register and the pre-stored vector register to perform matrix multiplication until the operation end condition is met, and the matrix multiplication operation result is obtained. The technical solution provided by the embodiment of the present invention improves the operational efficiency of matrix multiplication by optimizing function performance.

实施例三Embodiment 3

图3为本发明实施例三提供的一种矩阵乘法运算执行装置的结构示意图。如图3所示，该一种矩阵乘法运算执行装置包括：矩阵获取模块310、分块步幅确定模块320、子矩阵得到模块330、寄存器选取模块340和结果得到模块350。其中：FIG. 3 is a schematic structural diagram of a matrix multiplication execution device provided in Embodiment 3 of the present invention. As shown in Figure 3, this matrix multiplication operation execution device includes: a matrix acquisition module 310, a block step determination module 320, a sub-matrix acquisition module 330, a register selection module 340 and a result acquisition module 350. in:

矩阵获取模块310，用于响应于矩阵乘法操作指令，获取第一待操作矩阵、第二待操作矩阵和初始待更新矩阵；The matrix acquisition module 310 is configured to acquire the first matrix to be operated, the second matrix to be operated and the initial matrix to be updated in response to the matrix multiplication operation instruction;

分块步幅确定模块320，用于根据缓存空间容量，确定用于进行矩阵分块的分块步幅；The blocking stride determination module 320 is used to determine the blocking stride for matrix partitioning according to the cache space capacity;

子矩阵得到模块330，用于根据分块步幅，对第一待操作矩阵、第二待操作矩阵和初始待更新矩阵进行矩阵分块，分别得到第一待操作矩阵的第一子矩阵、第二待操作矩阵的第二子矩阵和初始待更新矩阵的第三子矩阵；The sub-matrix obtaining module 330 is used to divide the first matrix to be operated, the second matrix to be operated and the initial matrix to be updated into matrix blocks according to the blocking step, and obtain the first sub-matrix and the first sub-matrix of the first matrix to be operated respectively. The second sub-matrix of the matrix to be operated on and the third sub-matrix of the initial matrix to be updated;

寄存器选取模块340，用于根据分块步幅，从各向量寄存器中选取存储向量寄存器和预存向量寄存器；存储向量寄存器用于存储当前运算周期下的第一子矩阵、第二子矩阵和第三子矩阵；预存向量寄存器用于存储下一运算周期下的第一子矩阵和第二子矩阵；The register selection module 340 is used to select a storage vector register and a pre-stored vector register from each vector register according to the block step; the storage vector register is used to store the first sub-matrix, the second sub-matrix and the third sub-matrix in the current operation cycle. Sub-matrix; the pre-stored vector register is used to store the first sub-matrix and the second sub-matrix in the next operation cycle;

结果得到模块350，用于依次从存储向量寄存器和预存向量寄存器中获取相应子矩阵进行矩阵乘法运算，直到满足运算结束条件，得到矩阵乘法运算结果。The result obtaining module 350 is used to sequentially obtain the corresponding sub-matrices from the storage vector register and the pre-stored vector register and perform the matrix multiplication operation until the operation end condition is met to obtain the matrix multiplication operation result.

可选的，分块步幅确定模块320，包括：Optional, block stride determination module 320 includes:

分块次数确定单元，用于根据第一待操作矩阵、第二待操作矩阵和初始待更新矩阵的矩阵维度大小，确定矩阵分块次数；A unit for determining the number of matrix blocks according to the matrix dimensions of the first to-be-operated matrix, the second to-be-operated matrix and the initial to-be-updated matrix;

分块步幅确定单元，用于根据缓存空间容量，确定在各矩阵分块次数下第一待操作矩阵、第二待操作矩阵和初始待更新矩阵在不同维度下的分块步幅。The block stride determination unit is used to determine the block strides of the first to-be-operated matrix, the second to-be-operated matrix and the initial to-be-updated matrix in different dimensions according to the cache space capacity for each matrix block number.

可选的，分块步幅确定模块320，包括矩阵分块单元，具体用于：Optionally, the block stride determination module 320 includes a matrix block unit, specifically used for:

根据分块步幅，对第一待操作矩阵、第二待操作矩阵和初始待更新矩阵进行矩阵分块，分别得到第一待操作矩阵的第一子矩阵、第二待操作矩阵的第二子矩阵和初始待更新矩阵的第三子矩阵，包括：According to the block stride, the first matrix to be operated, the second matrix to be operated, and the initial matrix to be updated are matrix-blocked to obtain a first sub-matrix of the first matrix to be operated, a second sub-matrix of the second matrix to be operated, and a third sub-matrix of the initial matrix to be updated, respectively, including:

分块矩阵得到子单元，用于根据各矩阵分块次数下的第一待操作矩阵、第二待操作矩阵和初始待更新矩阵在不同维度下的分块步幅，对第一待操作矩阵、第二待操作矩阵和初始待更新矩阵进行矩阵分块，得到各矩阵分块次数下的第一待操作矩阵的第一分块矩阵、第二待操作矩阵的第二分块矩阵和初始待更新矩阵的第三分块矩阵；The sub-units obtained by dividing the matrix into blocks are used to calculate the first to-be-operated matrix, the second to-be-operated matrix and the initial to-be-operated matrix in different dimensions according to the block steps of the first to-be-operated matrix, the second to-be-operated matrix and the initial to-be-updated matrix under different matrix blocking times. The second matrix to be operated and the initial matrix to be updated are divided into matrix blocks to obtain the first block matrix of the first matrix to be operated, the second block matrix of the second matrix to be operated and the initial block matrix to be updated for each matrix block number. The third block matrix of the matrix;

子矩阵确定子单元，用于将最后一次矩阵分块次数下的第一待操作矩阵的第一分块矩阵确定为第一子矩阵，将第二待操作矩阵的第二分块矩阵确定为第二子矩阵，以及将初始待更新矩阵的第三分块矩阵确定为第三子矩阵。The sub-matrix determination subunit is used to determine the first block matrix of the first matrix to be operated on the last matrix block number as the first sub-matrix, and determine the second block matrix of the second matrix to be operated as the second block matrix. two sub-matrices, and determine the third block matrix of the initial matrix to be updated as the third sub-matrix.

可选的，分块步幅确定模块320，包括寄存器选择单元，具体用于：Optionally, the block step determination module 320 includes a register selection unit, specifically used for:

根据分块步幅，从各向量寄存器中选取存储向量寄存器和预存向量寄存器，包括：According to the block step, the storage vector register and the pre-stored vector register are selected from each vector register, including:

存储寄存器确定子单元，用于根据最后一次矩阵分块次数下第一待操作矩阵、第二待操作矩阵和初始待更新矩阵的分块步幅，确定用于存储当前运算周期下的第一子矩阵、第二子矩阵和第三子矩阵的存储向量寄存器；The storage register determines the subunit, which is used to determine the first subunit for storing the current operation cycle based on the block steps of the first to-be-operated matrix, the second to-be-operated matrix and the initial to-be-updated matrix in the last matrix block number. Storage vector registers for the matrix, second sub-matrix and third sub-matrix;

预存寄存器确定子单元，用于根据存储向量寄存器的寄存器数量，从各向量寄存器中选取用于存储下一运算周期下的第一子矩阵和第二子矩阵的预存向量寄存器。The pre-stored register determines the sub-unit, which is used to select the pre-stored vector register from each vector register for storing the first sub-matrix and the second sub-matrix in the next operation cycle according to the number of registers storing the vector register.

可选的，存储寄存器确定子单元，具体用于：Optional, the storage register determines the subunit, specifically used for:

根据最后一次矩阵分块次数下的第一待操作矩阵的分块步幅，确定用于存储当前运算周期下的第一子矩阵的第一存储向量寄存器；Determine the first storage vector register used to store the first sub-matrix in the current operation cycle according to the block stride of the first matrix to be operated on in the last matrix block count;

根据最后一次矩阵分块次数下的第二待操作矩阵的分块步幅，确定用于存储当前运算周期下的第二子矩阵的第二存储向量寄存器；Determine the second storage vector register used to store the second sub-matrix in the current operation cycle according to the block stride of the second matrix to be operated on at the last matrix block count;

根据最后一次矩阵分块次数下的初始待更新矩阵的分块步幅，确定用于存储当前运算周期下的第三子矩阵的第三存储向量寄存器；Determine the third storage vector register used to store the third sub-matrix in the current operation cycle according to the block stride of the initial matrix to be updated in the last matrix block count;

将第一存储向量寄存器、第二存储向量寄存器和第三存储向量寄存器确定为存储寄存器。The first storage vector register, the second storage vector register and the third storage vector register are determined as storage registers.

可选的，预存寄存器确定子单元，具体用于：Optional, the pre-stored register determines the subunit, specifically used for:

根据第一存储向量寄存器的第一寄存器数量，从各向量寄存器中选取用于存储下一运算周期下的第一子矩阵的第一预存向量寄存器；Select the first pre-stored vector register for storing the first sub-matrix in the next operation cycle from each vector register according to the number of first registers of the first storage vector register;

根据第二存储向量寄存器的第二寄存器数量，从各向量寄存器中选取用于存储下一运算周期下的第二子矩阵的第二预存向量寄存器。According to the second register quantity of the second storage vector register, a second pre-stored vector register for storing the second sub-matrix in the next operation cycle is selected from each vector register.

可选的，子矩阵得到模块330，还包括矩阵存储单元，包括：Optionally, the sub-matrix obtaining module 330 also includes a matrix storage unit, including:

行存储子单元，用于将第一待操作矩阵的第一子矩阵基于预设的行存储方式进行存储；以及，A row storage subunit, used to store the first sub-matrix of the first matrix to be operated based on a preset row storage method; and,

列存储子单元，用于将第二待操作矩阵的第二子矩阵基于预设的列存储方式进行存储。The column storage subunit is used to store the second sub-matrix of the second matrix to be operated based on a preset column storage method.

本发明实施例所提供的一种矩阵乘法运算执行装置可执行本发明任意实施例所提供的一种矩阵乘法运算执行方法，具备执行方法相应的功能模块和有益效果。A matrix multiplication execution device provided by an embodiment of the present invention can execute a matrix multiplication execution method provided by any embodiment of the present invention, and has functional modules and beneficial effects corresponding to the execution method.

实施例四Embodiment 4

图4示出了可以用来实施本发明的实施例的电子设备400的结构示意图。电子设备旨在表示各种形式的数字计算机，诸如，膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置，诸如，个人数字处理、蜂窝电话、智能电话、可穿戴设备(如头盔、眼镜、手表等)和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例，并且不意在限制本文中描述的和/或者要求的本发明的实现。FIG. 4 shows a schematic structural diagram of an electronic device 400 that can be used to implement embodiments of the present invention. Electronic devices are intended to refer to various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (eg, helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit the implementation of the invention described and/or claimed herein.

如图4所示，电子设备400包括至少一个处理器410，以及与至少一个处理器410通信连接的存储器，如只读存储器(ROM)420、随机访问存储器(RAM)430等，其中，存储器存储有可被至少一个处理器执行的计算机程序，处理器410可以根据存储在只读存储器(ROM)420中的计算机程序或者从存储单元480加载到随机访问存储器(RAM)430中的计算机程序，来执行各种适当的动作和处理。在(RAM)430中，还可存储电子设备400操作所需的各种程序和数据。处理器410、(RAM)420以及(RAM)430通过总线440彼此相连。输入/输出(I/O)接口450也连接至总线440。As shown in Figure 4, the electronic device 400 includes at least one processor 410, and a memory communicatively connected to the at least one processor 410, such as a read-only memory (ROM) 420, a random access memory (RAM) 430, etc., wherein the memory stores There is a computer program that can be executed by at least one processor. The processor 410 can perform the operation according to the computer program stored in the read-only memory (ROM) 420 or the computer program loaded from the storage unit 480 into the random access memory (RAM) 430. Perform various appropriate actions and processing. In the (RAM) 430, various programs and data required for the operation of the electronic device 400 may also be stored. The processor 410, (RAM) 420, and (RAM) 430 are connected to each other through a bus 440. An input/output (I/O) interface 450 is also connected to bus 440.

电子设备400中的多个部件连接至I/O接口450，包括：输入单元460，例如键盘、鼠标等；输出单元470，例如各种类型的显示器、扬声器等；存储单元480，例如磁盘、光盘等；以及通信单元490，例如网卡、调制解调器、无线通信收发机等。通信单元490允许电子设备400通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。A number of components in the electronic device 400 are connected to the I/O interface 450, including: an input unit 460, such as a keyboard, a mouse, etc.; an output unit 470, such as various types of displays, speakers, etc.; a storage unit 480, such as a disk, an optical disk, etc.; and a communication unit 490, such as a network card, a modem, a wireless communication transceiver, etc. The communication unit 490 allows the electronic device 400 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.

处理器410可以是各种具有处理和计算能力的通用和/或专用处理组件。处理器410的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的处理器、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。处理器410执行上文所描述的各个方法和处理，例如一种矩阵乘法运算执行方法。Processor 410 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 410 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various specialized artificial intelligence (AI) computing chips, various processors that run machine learning model algorithms, digital signal processing processor (DSP), and any appropriate processor, controller, microcontroller, etc. The processor 410 performs various methods and processes described above, such as a matrix multiplication operation execution method.

在一些实施例中，一种矩阵乘法运算执行方法可被实现为计算机程序，其被有形地包含于计算机可读存储介质，例如存储单元480。在一些实施例中，计算机程序的部分或者全部可以经由(RAM)420和/或通信单元490而被载入和/或安装到电子设备400上。当计算机程序加载到(RAM)430并由处理器410执行时，可以执行上文描述的一种矩阵乘法运算执行方法的一个或多个步骤。备选地，在其他实施例中，处理器410可以通过其他任何适当的方式(例如，借助于固件)而被配置为执行一种矩阵乘法运算执行方法。In some embodiments, a matrix multiplication operation execution method may be implemented as a computer program, which is tangibly contained in a computer-readable storage medium, such as a storage unit 480. In some embodiments, part or all of the computer program may be loaded and/or installed on the electronic device 400 via (RAM) 420 and/or the communication unit 490. When the computer program is loaded into (RAM) 430 and executed by the processor 410, one or more steps of the matrix multiplication operation execution method described above may be performed. Alternatively, in other embodiments, the processor 410 may be configured to execute a matrix multiplication operation execution method in any other appropriate manner (e.g., by means of firmware).

本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、负载可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括：实施在一个或者多个计算机程序中，该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释，该可编程处理器可以是专用或者通用可编程处理器，可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令，并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described above may be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on a chip implemented in a system (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or a combination thereof. These various embodiments may include implementation in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor The processor, which may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device. An output device.

用于实施本发明的方法的计算机程序可以采用一个或多个编程语言的任何组合来编写。这些计算机程序可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器，使得计算机程序当由处理器执行时使流程图和/或框图中所规定的功能/操作被实施。计算机程序可以完全在机器上执行、部分地在机器上执行，作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Computer programs for implementing the methods of the invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, such that the computer program, when executed by the processor, causes the functions/operations specified in the flowcharts and/or block diagrams to be implemented. A computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

在本发明的上下文中，计算机可读存储介质可以是有形的介质，其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的计算机程序。计算机可读存储介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备，或者上述内容的任何合适组合。备选地，计算机可读存储介质可以是机器可读信号介质。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of this invention, a computer-readable storage medium may be a tangible medium that may contain or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. Computer-readable storage media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing. Alternatively, the computer-readable storage medium may be a machine-readable signal medium. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, laptop disks, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.

为了提供与用户的交互，可以在电子设备上实施此处描述的系统和技术，该电子设备具有：用于向用户显示信息的显示装置(例如，CRT(阴极射线管)或者LCD(液晶显示器)监视器)；以及键盘和指向装置(例如，鼠标或者轨迹球)，用户可以通过该键盘和该指向装置来将输入提供给电子设备。其它种类的装置还可以用于提供与用户的交互；例如，提供给用户的反馈可以是任何形式的传感反馈(例如，视觉反馈、听觉反馈、或者触觉反馈)；并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide interaction with a user, the systems and techniques described herein may be implemented on an electronic device having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display)) for displaying information to the user monitor); and a keyboard and pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.

可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如，作为数据服务器)、或者包括中间件部件的计算系统(例如，应用服务器)、或者包括前端部件的计算系统(例如，具有图形用户界面或者网络浏览器的用户计算机，用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如，通信网络)来将系统的部件相互连接。通信网络的示例包括：局域网(LAN)、广域网(WAN)、区块链网络和互联网。The systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: local area network (LAN), wide area network (WAN), blockchain network, and the Internet.

计算系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器，又称为云计算服务器或云主机，是云计算服务体系中的一项主机产品，以解决了传统物理主机与VPS服务中，存在的管理难度大，业务扩展性弱的缺陷。Computing systems may include clients and servers. Clients and servers are generally remote from each other and typically interact over a communications network. The relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other. The server can be a cloud server, also known as cloud computing server or cloud host. It is a host product in the cloud computing service system to solve the problems of difficult management and weak business scalability in traditional physical hosts and VPS services. defect.

应该理解，可以使用上面所示的各种形式的流程，重新排序、增加或删除步骤。例如，本发明中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行，只要能够实现本发明的技术方案所期望的结果，本文在此不进行限制。It should be understood that various forms of the process shown above may be used, with steps reordered, added or deleted. For example, each step described in the present invention can be executed in parallel, sequentially, or in different orders. As long as the desired results of the technical solution of the present invention can be achieved, there is no limitation here.

上述具体实施方式，并不构成对本发明保护范围的限制。本领域技术人员应该明白的是，根据设计要求和其他因素，可以进行各种修改、组合、子组合和替代。任何在本发明的精神和原则之内所作的修改、等同替换和改进等，均应包含在本发明保护范围之内。The above-mentioned specific embodiments do not constitute a limitation on the scope of the present invention. It will be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions are possible depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention shall be included in the protection scope of the present invention.

Claims

1. A method of performing a matrix multiplication operation, comprising:

responding to a matrix multiplication operation instruction, and acquiring a first matrix to be operated, a second matrix to be operated and an initial matrix to be updated;

determining a block dividing step for dividing the matrix according to the capacity of the buffer memory space;

according to the block stride, performing matrix block on the first matrix to be operated, the second matrix to be operated and the initial matrix to be updated to respectively obtain a first sub-matrix of the first matrix to be operated, a second sub-matrix of the second matrix to be operated and a third sub-matrix of the initial matrix to be updated;

Selecting a stored vector register and a pre-stored vector register from the vector registers according to the block stride; the storage vector register is used for storing a first submatrix, a second submatrix and a third submatrix in the current operation period; the pre-stored vector register is used for storing a first submatrix and a second submatrix in the next operation period;

and sequentially acquiring corresponding submatrices from the stored vector register and the pre-stored vector register to perform matrix multiplication operation until the operation ending condition is met, so as to obtain a matrix multiplication operation result.

2. The method of claim 1, wherein determining a block stride for matrix block based on the buffer space capacity comprises:

determining matrix blocking times according to the matrix dimension sizes of the first matrix to be operated, the second matrix to be operated and the initial matrix to be updated;

and determining the block steps of the first matrix to be operated, the second matrix to be operated and the initial matrix to be updated under different dimensions according to the buffer space capacity.

3. The method according to claim 2, wherein performing matrix partitioning on the first to-be-operated matrix, the second to-be-operated matrix, and the initial to-be-updated matrix according to the partitioning step to obtain a first sub-matrix of the first to-be-operated matrix, a second sub-matrix of the second to-be-operated matrix, and a third sub-matrix of the initial to-be-updated matrix, respectively, includes:

According to the block steps of a first matrix to be operated, a second matrix to be operated and an initial matrix to be updated under different dimensions of each matrix block number, performing matrix block on the first matrix to be operated, the second matrix to be operated and the initial matrix to be updated to obtain a first block matrix of the first matrix to be operated, a second block matrix of the second matrix to be operated and a third block matrix of the initial matrix to be updated under each matrix block number;

determining a first blocking matrix of a first matrix to be operated under the last matrix blocking times as a first sub-matrix, determining a second blocking matrix of a second matrix to be operated as a second sub-matrix, and determining a third blocking matrix of an initial matrix to be updated as a third sub-matrix.

4. The method of claim 2, wherein selecting a stored vector register and a pre-stored vector register from among the vector registers according to the block stride comprises:

determining storage vector registers for storing a first submatrix, a second submatrix and a third submatrix in a current operation period according to the block steps of a first matrix to be operated, a second matrix to be operated and an initial matrix to be updated under the last matrix block times;

And selecting a pre-stored vector register for storing the first submatrix and the second submatrix in the next operation period from each vector register according to the register number of the stored vector registers.

5. The method of claim 4, wherein determining the storage vector registers for storing the first sub-matrix, the second sub-matrix, and the third sub-matrix for the current operation cycle according to the block stride of the first to-be-operated matrix, the second to-be-operated matrix, and the initial to-be-updated matrix for the last number of matrix blocks comprises:

determining a first storage vector register for storing a first submatrix under the current operation period according to the block stride of a first matrix to be operated under the last matrix block times;

determining a second storage vector register for storing a second submatrix under the current operation period according to the block step of the second matrix to be operated under the last matrix block times;

determining a third storage vector register for storing a third submatrix under the current operation period according to the block stride of the initial matrix to be updated under the last matrix block times;

the first, second, and third storage vector registers are determined to be storage registers.

6. The method of claim 5, wherein selecting a pre-stored vector register from each vector register for storing the first sub-matrix and the second sub-matrix for a next operation cycle according to the number of registers storing the vector register, comprises:

selecting a first pre-stored vector register for storing a first submatrix in a next operation period from each vector register according to the first register number of the first stored vector registers;

and selecting a second pre-stored vector register for storing a second submatrix in the next operation period from the vector registers according to the second register number of the second stored vector registers.

7. The method of claim 1, wherein after performing matrix partitioning on the first to-be-operated matrix, the second to-be-operated matrix, and the initial to-be-updated matrix according to the partitioning step, respectively obtaining a first sub-matrix of the first to-be-operated matrix, a second sub-matrix of the second to-be-operated matrix, and a third sub-matrix of the initial to-be-updated matrix, further comprising:

storing a first sub-matrix of the first matrix to be operated based on a preset row storage mode; the method comprises the steps of,

And storing a second submatrix of the second matrix to be operated based on a preset column storage mode.

8. A matrix multiplication execution apparatus, comprising:

the matrix acquisition module is used for responding to the matrix multiplication operation instruction to acquire a first matrix to be operated, a second matrix to be operated and an initial matrix to be updated;

the block stride determination module is used for determining a block stride for matrix block according to the buffer space capacity;

the submatrix obtaining module is used for performing matrix blocking on the first matrix to be operated, the second matrix to be operated and the initial matrix to be updated according to the blocking step, so as to respectively obtain a first submatrix of the first matrix to be operated, a second submatrix of the second matrix to be operated and a third submatrix of the initial matrix to be updated;

the register selecting module is used for selecting a stored vector register and a pre-stored vector register from the vector registers according to the block stride; the storage vector register is used for storing a first submatrix, a second submatrix and a third submatrix in the current operation period; the pre-stored vector register is used for storing a first submatrix and a second submatrix in the next operation period;

The result obtaining module is used for sequentially obtaining corresponding submatrices from the stored vector register and the pre-stored vector register to carry out matrix multiplication operation until the operation ending condition is met, and obtaining a matrix multiplication operation result.

9. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, causes the one or more processors to implement a matrix multiplication operation performing method as claimed in any one of claims 1 to 7.

10. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements a matrix multiplication operation performing method according to any of claims 1-7.