CN117454068B

CN117454068B - Method and computing device for implementing matrix multiplication operation

Info

Publication number: CN117454068B
Application number: CN202311502624.9A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Bi Ren Technology Co ltd
Current assignee: Shanghai Bi Ren Technology Co ltd
Priority date: 2023-11-10
Filing date: 2023-11-10
Publication date: 2025-01-24
Anticipated expiration: 2043-11-10
Also published as: CN117454068A

Abstract

The present disclosure provides a method and computing device for implementing matrix multiplication operations. The method includes converting a first matrix of the matrix multiplication operation into a first two-dimensional matrix and loading the first two-dimensional matrix into a first buffer for the first matrix in a tensor kernel, wherein the first matrix is a three-dimensional matrix, determining whether a second matrix of the matrix multiplication operation is a two-dimensional matrix or a three-dimensional matrix, loading the second matrix into the second buffer for the second matrix in the tensor kernel in response to the second matrix being a two-dimensional matrix, converting the second matrix into a second two-dimensional matrix in response to the second matrix being a three-dimensional matrix and loading the second matrix into the tensor kernel for the second buffer of the second matrix, and performing matrix multiplication on the first two-dimensional matrix and the second matrix or the second two-dimensional matrix in the tensor kernel.

Description

Method and computing device for implementing matrix multiplication operation

Technical Field

The present disclosure relates generally to the field of processors, and more particularly, to a method and computing device for implementing matrix multiplication operations.

Background

In modern scientific technologies such as weather forecast, oil exploration, nuclear physics, etc., a large number of computer-dependent analog calculations are performed, the core of which is matrix calculations representing state transitions. The fields of computer graphics processing, deep learning, etc. are also largely related to matrix multiplication operations.

In performing matrix multiplication operations for two matrices, subject to hardware requirements such as GEMM (GEneral Matrix to Matrix Multiplication, generic matrix multiplication) units or Tensor cores (Tensor cores), it is necessary to block the input matrix and perform matrix multiplication operations based on the block matrix.

However, hardware usage is inefficient when the remainder value of the corresponding dimension of the input matrix (e.g., batch dimension) relative to the hardware multiplication granularity modulo is small.

Disclosure of Invention

In view of the above, the present disclosure provides a method of performing matrix multiplication by performing dimensional folding for input matrices of different shapes.

According to one aspect of the present disclosure, a method of implementing convolution operations is provided. The method includes converting a first matrix of the matrix multiplication operation into a first two-dimensional matrix and loading the first two-dimensional matrix into a first buffer for the first matrix in a tensor kernel, wherein the first matrix is a three-dimensional matrix, determining whether a second matrix of the matrix multiplication operation is a two-dimensional matrix or a three-dimensional matrix, loading the second matrix into the second buffer for the second matrix in the tensor kernel in response to the second matrix being a two-dimensional matrix, converting the second matrix into a second two-dimensional matrix in response to the second matrix being a three-dimensional matrix and loading the second matrix into the tensor kernel for the second buffer of the second matrix, and performing matrix multiplication on the first two-dimensional matrix and the second matrix or the second two-dimensional matrix in the tensor kernel.

In some implementations, converting and loading a first matrix of the matrix multiplication operation into a first two-dimensional matrix into a first buffer for the first matrix includes loading the first matrix from a first memory outside the tensor core into a vector core through a data loading operation, determining first and second dimensions of the first two-dimensional matrix in the vector core based on second and third dimensions of the first matrix, storing the first two-dimensional matrix from the vector core into a second memory near the tensor core through a data storage operation, and loading the first two-dimensional matrix into the first buffer based on a size of the first buffer.

In some implementations, determining, in the vector kernel, first and second dimensions of the first two-dimensional matrix based on the second and third dimensions of the first matrix includes determining whether a dimension of the second and third dimensions of the first matrix that is to be combined with the first dimension of the first matrix satisfies a predetermined condition, responsive to determining that the dimension of the first matrix that is to be combined with the first dimension of the first matrix satisfies the predetermined condition, determining a remainder value of a hardware multiplication granularity modulo the tensor kernel for the dimension that is to be combined with the first dimension of the first matrix, determining whether the remainder value is greater than or equal to a predetermined threshold, wherein the predetermined threshold is associated with a hardware multiplication granularity of the tensor kernel, responsive to determining that the remainder value is greater than or equal to the predetermined threshold, blocking the first matrix in accordance with the first dimension of the first matrix and the dimension that is to be combined with the first dimension of the first matrix, and reading another dimension of the first matrix in accordance with the first dimension of the first matrix as the remainder value, and determining that the remainder value is less than the predetermined threshold along the first dimension of the first matrix.

In some implementations, converting the second matrix into a second two-dimensional matrix and loading into a second buffer for the second matrix within the tensor core includes loading the second matrix from a first memory outside the tensor core to a vector core through a multiple data loading operation, determining a first dimension and a second dimension of the second two-dimensional matrix in the vector core based on the first dimension and the second dimension of the first two-dimensional matrix, storing the second two-dimensional matrix from the vector core to a second memory near the tensor core through a multiple data storage operation, and loading the second two-dimensional matrix to the second buffer based on a size of the second buffer.

In some implementations, determining the first dimension and the second dimension of the second two-dimensional matrix in the vector kernel based on the first dimension and the second dimension of the first two-dimensional matrix includes determining whether the second dimension of the first matrix is merged with the first dimension of the first matrix based on the first dimension of the first two-dimensional matrix, if the second dimension of the first matrix is determined to be merged with the first dimension of the first matrix, blocking the first and third dimensions of the second matrix and reading the second matrix as the second two-dimensional matrix in accordance with the second dimension of the second matrix, and if the second dimension of the first matrix is determined not to be merged with the first dimension of the first matrix, blocking the second dimension and the second dimension of the second matrix and reading the second matrix as the second matrix in accordance with the third dimension of the second matrix.

According to another aspect of the present invention, a computing device is provided. The computing device includes a tensor kernel including a general matrix multiplication unit, a first buffer for a first matrix of a matrix multiplication operation, and a second buffer for a second matrix of the matrix multiplication operation, wherein the first matrix is a three-dimensional matrix, a first memory located outside the tensor kernel configured to store the first matrix and the second matrix before the matrix multiplication operation, a vector kernel configured to convert the first matrix and the second matrix into a first two-dimensional matrix and a second two-dimensional matrix, and a second memory located near the tensor kernel configured to store the first two-dimensional matrix and the second two-dimensional matrix, wherein the tensor kernel is configured to load the second matrix to the second buffer when the second matrix is a two-dimensional matrix, or to load the second two-dimensional matrix to the second buffer when the second matrix is a three-dimensional matrix, and to perform multiplication on the first two-dimensional matrix and the second two-dimensional matrix or the second two-dimensional matrix.

In some implementations, the vector core is configured to load the first matrix from the first memory through a data load operation, determine a first dimension and a second dimension of the first two-dimensional matrix based on a second dimension and a third dimension of the first matrix, and store the first two-dimensional matrix to the second memory through a data store operation.

In some implementations, the vector kernel is configured to determine whether a dimension of the second and third dimensions of the first matrix that is to be combined with a first dimension of the first matrix satisfies a predetermined condition, determine a remainder value of the hardware multiplication granularity of the tensor kernel modulo the dimension that is to be combined with the first dimension of the first matrix in response to determining that the dimension that is to be combined with the first dimension of the first matrix satisfies the predetermined condition, determine whether the remainder value is greater than or equal to a predetermined threshold, wherein the predetermined threshold is associated with the hardware multiplication granularity of the tensor kernel, block the first matrix according to the first dimension of the first matrix and the dimension that is to be combined with the first dimension of the first matrix in response to determining that the remainder value is greater than or equal to the predetermined threshold, and read the first matrix as the first two-dimensional matrix according to the other dimension of the second and third dimension of the first matrix, and splice the first two-dimensional matrix along the first matrix in response to determining that the remainder value is less than the predetermined threshold.

In some implementations, the vector core is configured to load the second matrix from the first memory through a data load operation, determine a first dimension and a second dimension of the second two-dimensional matrix based on the first dimension and the second dimension of the first two-dimensional matrix, and store the second two-dimensional matrix to the second memory through a data store operation.

In some implementations, the vector kernel is configured to determine whether a second dimension of the first matrix is merged with a first dimension of the first matrix based on a first dimension of the first two-dimensional matrix, if the second dimension of the first matrix is determined to be merged with the first dimension of the first matrix, block the first and third dimensions of the second matrix and read the second matrix as the second two-dimensional matrix according to the second dimension of the second matrix, and if the second dimension of the first matrix is determined not to be merged with the first dimension of the first matrix, block the first and second dimensions of the second matrix and read the second matrix as the second two-dimensional matrix according to the third dimension of the second matrix.

Drawings

The disclosure will be better understood and other objects, details, features and advantages of the disclosure will become more apparent by reference to the description of specific embodiments thereof given in the following drawings.

Fig. 1A and 1B show schematic diagrams of matrix multiplication operations for two different cases.

FIG. 2 shows a schematic diagram of a computing device for performing a matrix multiplication operation in accordance with an embodiment of the invention.

Fig. 3 shows a schematic flow chart of a method for implementing a matrix multiplication operation according to an embodiment of the invention.

Fig. 4 illustrates a further detailed flow chart of a process of converting and loading a first matrix according to some embodiments of the invention.

Fig. 5 illustrates a further detailed schematic diagram of a process of determining a first dimension and a second dimension of a first two-dimensional matrix according to some embodiments of the invention.

Fig. 6 illustrates a further detailed flow chart of a process of converting and loading a secondary matrix according to some embodiments of the invention.

FIG. 7 illustrates a further detailed schematic diagram of a process of determining a first dimension and a second dimension of a second two-dimensional matrix according to some embodiments of the invention.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are illustrated in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The term "comprising" and variations thereof as used herein means open ended, i.e., "including but not limited to. The term "or" means "and/or" unless specifically stated otherwise. The term "based on" means "based at least in part on". The terms "one embodiment" and "some embodiments" mean "at least one example embodiment. The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like, may refer to different or the same object.

In the example shown in fig. 1A, of two input matrices of the matrix multiplication operation, a first matrix a is a three-dimensional matrix (Batch, M, K), a second matrix B is a two-dimensional matrix (K, N), and an output matrix C after the matrix multiplication operation is performed is a three-dimensional matrix (Batch, M, N). Wherein the last dimension (third dimension K) of the first matrix a is the same as the first dimension K of the second matrix B (i.e. the last dimension of the former is the same as the first dimension of the latter) so that a matrix multiplication operation is feasible.

In the example shown in fig. 1B, of the two input matrices of the matrix multiplication operation, the first matrix a is a three-dimensional matrix (Batch, M, K), the second matrix B is a three-dimensional matrix (Batch, K, N), and the output matrix C after the matrix multiplication operation is performed is a two-dimensional matrix (M, N). Here, the last dimension (third dimension K) of the first matrix a is represented as the same as the second dimension K of the second matrix B for practical use, however, it will be understood by those skilled in the art that such representation is merely for convenience of description, and in fact, in an actual matrix multiplication operation, the second matrix B may be transposed such that a matrix multiplication operation between the first matrix a and the transposed second matrix B is possible, i.e., the last dimension of the former is the same as the first dimension of the latter, and description of such a transposition operation may be omitted herein.

Further, more generally, at least one dimension of two matrices performing a matrix multiplication operation may be the same, without requiring a specific order of that dimension. In this case, before performing the matrix multiplication operation, matrix transposition may be performed on at least one of the two original matrices such that the last dimension of the transposed first matrix a is the same as the first dimension of the second matrix B, which is not described herein.

There are two main methods for performing the matrix multiplication operation. One is to cycle the Batch dimension on the outer layer and the inner layer performs the matrix operation of MxNxK. For example, for the case of fig. 1A, the first matrix a may be sliced into Batch two-dimensional matrices (M, K) according to the Batch dimension, and then each two-dimensional matrix (M, K) may be subjected to a matrix multiplication operation with the second matrix B (K, N), thereby obtaining Batch two-dimensional matrices (M, N), which are stacked as an output three-dimensional matrix C (Batch, M, N). For the case of fig. 1B, the first matrix a and the second matrix B may be sliced into Batch two-dimensional matrices (M, K) and (K, N), respectively, according to the Batch dimension, and then each Batch two-dimensional matrix (M, K) and (K, N) is subjected to a matrix multiplication operation, thereby obtaining Batch two-dimensional matrices (M, N), which are superimposed to generate an output two-dimensional matrix (M, N). However, in this approach, the dimensions (M, K and N) of the two-dimensional matrix are unchanged, and hardware usage efficiency is still low when its remainder value relative to hardware multiplication granularity modulo is small.

Another method is to reshape (reshape) the input three-dimensional matrix into a two-dimensional matrix by software before the first matrix a and the second matrix B perform matrix multiplication operation, and then perform matrix multiplication operation on the converted two-dimensional matrix in hardware. For example, in the case of fig. 1A, the first matrix a may be reshaped into a two-dimensional matrix (Batch) and then the two-dimensional matrix (Batch) may be multiplied by the second matrix B (K, N) to obtain an output two-dimensional matrix (Batch) which may be reshaped into an output three-dimensional matrix C (Batch, M, N) according to need. In this way, the input matrix is converted from the three-dimensional matrix to the two-dimensional matrix through the reshaping operation, so that the remainder value of a certain dimension of the input matrix is greatly reduced (at most one) relative to the case of modular remainder value of the hardware multiplication granularity, and the hardware use efficiency is improved. However, in this method, the remodelling operation performed by the software brings more access operations, the total consumption time is increased, and the overall performance is deteriorated.

In the two matrix multiplication operations, if M, K is not aligned with the hardware multiplication granularity, the information of the Batch dimension cannot be fully utilized, especially when the remainder value obtained by modulo the hardware multiplication granularity by M or K is 2, 4, 8 and other smaller numbers, the utilization rate of the matrix multiplication hardware and the utilization rate of the buffer are low, so that the overall performance is poor, and if a remodelling process is inserted, a larger access memory is generated, so that the overall performance is poor.

Aiming at the problems, the invention provides a method for optimizing the matrix multiplication operation by carrying out dimensional folding on two input matrixes of the matrix multiplication operation, which can fully utilize the processing capacity of matrix multiplication hardware and improve the utilization rate of a buffer.

FIG. 2 shows a schematic diagram of a computing device 200 for performing a matrix multiplication operation in accordance with an embodiment of the invention. As shown in fig. 2, the computing device 200 may include a tensor core 210 for performing a matrix multiplication operation, the tensor core 210 having a given hardware multiplication granularity that limits the maximum dimension of the input matrix (i.e., the maximum of M or K) that the tensor core 210 can handle.

The computing device 200 may also include a first memory 220 outside of the tensor core 210. The first memory 220 is used to store the first matrix a and the second matrix B before the tensor core 210 performs the matrix multiplication operation, and to store the output matrix C after the matrix multiplication operation is finished. The first memory 220 may be, for example, a high-bandwidth memory (High Bandwidth Memory, HBM), which has the advantage of high throughput and high bandwidth, and may be widely used in fields with high requirements for access to large data volumes, such as Artificial Intelligence (AI) computation, etc.

Computing device 200 may also include vector core 230. The vector kernel 230 is used to convert a three-dimensional input matrix into a two-dimensional matrix.

The computing device 200 may also include a second memory 240 in the vicinity of the tensor core 210. The second memory 240 may be composed of a plurality of registers for storing a two-dimensional matrix generated by the vector core 230 converting the three-dimensional input matrix. The second memory 240 is typically directly connected to the tensor core 210 as the main memory of the tensor core 210 without further intermediate conversion, so that the data required for the tensor core 210 to calculate is stored as close as possible to the tensor core 210.

In some embodiments, the tensor core 210, the vector core 230, and the second memory 240 may be part of a separate stream processing unit 250. Although only one stream processing unit 250 is shown in fig. 2, those skilled in the art will appreciate that computing device 200 may contain a plurality of such stream processing units 250. In this case, the first memory 220 may serve multiple stream processing units 250 and may exchange data and instructions with each stream processing unit 250 of the computing device 200 through a high-speed bus or network-on-chip.

In some embodiments, the tensor core 210 may further include a generic matrix multiplication unit (GEMM) 212, a first buffer (a_buffer) 214 for the first matrix a, and a second buffer (b_buffer) 216 for the second matrix B. The first buffer 214 and the second buffer 216 may be memories within the tensor core 210 dedicated to the tensor core 210, wherein the first buffer 214 is for buffering one matrix at a time of the first matrix a at the hardware multiplication granularity of the GEMM 212 and the second buffer 216 is for buffering one vector of the second matrix B at a time at the hardware multiplication granularity of the GEMM 212. That is, the size of the first buffer 214 and the second buffer 216 is determined by the hardware multiplication granularity of the GEMM 212 or tensor core 210. For example, in the case where GEMM 212 is a two-dimensional accumulator array of nxn, first buffer 214 and second buffer 216 buffer one vector of length N x N at a time for first matrix a and second matrix B, respectively. The GEMM 212 is configured to perform a matrix multiplication operation on vectors input from the first buffer 214 and the second buffer 216 each time triggered by a matrix multiplication instruction received by the tensor core 210.

Furthermore, in some embodiments, in the tensor core 210, the two-dimensional matrix from the second memory 240 may also be transposed (not shown) before the first buffer 214 and the second buffer 216.

Fig. 3 shows a schematic flow chart of a method 300 for implementing a matrix multiplication operation according to an embodiment of the invention. The method 300 may be performed by the computing device 200 or in the computing device 200. Here, it is assumed that a first matrix a and a second matrix B of the matrix multiplication operation have been stored in the first memory 220 in advance, and the first matrix a is a three-dimensional matrix (Batch, M, K) as shown in fig. 1A and 1B, and the second matrix B may be a two-dimensional matrix (K, N) as shown in fig. 1A or a three-dimensional matrix (Batch, K, N) as shown in fig. 1B. Depending on whether the second matrix B is a two-dimensional matrix or a three-dimensional matrix, different processing is performed on the second matrix B in the method 300.

As shown in fig. 3, at block 310, a first matrix a of a matrix multiplication operation may be converted to a first two-dimensional matrix a' and loaded into a first buffer 214 for the first matrix a within the tensor core 210. For example, assuming that the first matrix a is a three-dimensional matrix (Batch, M, K), the converted first two-dimensional matrix a' is a two-dimensional matrix (batch×m, K) or (M, batch×k). Here, a more specific embodiment of converting and loading the first matrix a is described below with reference to fig. 4.

At block 320, it may be determined whether the second matrix B of the matrix multiplication operation is a two-dimensional matrix or a three-dimensional matrix, such as the matrix shown in fig. 1A or the matrix shown in fig. 1B.

If the second matrix B is a two-dimensional matrix (e.g., the two-dimensional matrix (K, N) shown in fig. 1A), then at block 330 the second matrix B may be loaded directly into the second buffer 216 for the second matrix B within the tensor core 210.

If the second matrix B is a three-dimensional matrix (e.g., the three-dimensional matrix (Batch, K, N) shown in FIG. 1B), then at block 340 the second matrix B may be converted to a second two-dimensional matrix B' and loaded into the second buffer 216 for the second matrix B within the tensor core 210. Here, a more specific embodiment of the conversion and loading of the second matrix B is described below with reference to fig. 6, which corresponds to the conversion and loading manner of the first matrix a.

Then, at block 350, matrix multiplication is performed on the first two-dimensional matrix a 'and the second matrix B (in the case of block 330) or the second two-dimensional matrix B' (in the case of block 340) in the tensor kernel 210.

Note that the conversion and loading of the first matrix a may be performed concurrently with the loading of the second matrix B (block 330) or the conversion and loading (block 340) or may be performed in sequential order, the specific order of execution being not limited herein.

Fig. 4 shows a further detailed flow chart of a process (block 310) of converting and loading a first matrix a according to some embodiments of the invention.

As shown in fig. 4, at block 312, a first matrix a may be loaded from a first memory 220 outside of the tensor core 210 to the vector core 230 through a multiple data Load (LDM) operation.

At block 314, in the vector kernel 230, a first dimension and a second dimension of the first two-dimensional matrix a' may be determined based on the second dimension M and the third dimension K of the first matrix a. Here, the conversion of the first matrix a by the vector core 230 is achieved by address conversion of each element in the first matrix a. Depending on the second dimension M and the third dimension K of the first matrix a, the transformed first two-dimensional matrix a' may be (Batch x M, K) or (M, batch x K), as described in detail below.

Fig. 5 illustrates a further detailed schematic diagram of a process of determining a first dimension and a second dimension of a first two-dimensional matrix a' (block 314), according to some embodiments of the invention.

As shown in fig. 5, at block 3141, it may be determined whether a dimension that needs to be combined with the first dimension Batch among the second dimension M and the third dimension K of the first matrix a satisfies a predetermined condition. Here, the predetermined condition may mean that the absolute value of the second dimension M or the third dimension K is smaller or smaller relative to the other.

For example, assume that the second dimension M of the first matrix a is specified as the dimension that needs to be merged with the first dimension Batch. In this case, if the second dimension M < < the third dimension K, or the absolute value of the second dimension M is small (for example, in the case where the hardware multiplication granularity of the tensor kernel is 64, the second dimension M is 2,4, or 8), the second dimension M of the first matrix a is considered to satisfy the predetermined condition.

For another example, assume that the third dimension K of the first matrix A is designated as the dimension that needs to be merged with the first dimension Batch. In this case, if the third dimension K < < the second dimension M, or the absolute value of the third dimension K is small (for example, in the case where the hardware multiplication granularity of the tensor kernel is 64, the third dimension K is 2, 4, or 8), the third dimension K of the first matrix a is considered to satisfy the predetermined condition.

It will be appreciated by those skilled in the art that the second dimension M and the third dimension K are merely exemplary and are not intended to be limiting in any way as to the order or size of the second dimension and the third dimension.

At block 3142, if it is determined that the dimension (the second dimension M or the third dimension K) that needs to be merged with the first dimension Batch of the first matrix a meets the predetermined condition, it may be determined that the dimension that needs to be merged modulo the hardware multiplication granularity Kd of the tensor core 210 has a remainder value K, i.e., k=m% Kd or k=k% Kd,% representing the remainder operation. Here, the hardware multiplication granularity Kd of the tensor kernel 210 is assumed to be 64, i.e., the tensor kernel 210 can process matrix multiplication of Kd size at a time, but it will be understood by those skilled in the art that the present invention is not limited thereto, and that the hardware multiplication granularity Kd can be other values, e.g., 8 x 8, 16 x 16, etc., depending on different hardware implementations of the computing device 200 or tensor kernel 210.

At block 3143, it may be determined whether the remainder value k is greater than or equal to a predetermined threshold, wherein the predetermined threshold is associated with a hardware multiplication granularity Kd of the tensor core 210. For example, the predetermined threshold may be half or less of the hardware multiplication granularity Kd of the tensor core 210.

If it is determined that the remainder value K is greater than or equal to the predetermined threshold, then at block 3144, the first matrix a may be partitioned according to a first dimension Batch of the first matrix a and the dimension M or K to be combined, and the first matrix a may be read as a first two-dimensional matrix a' according to the other one of the dimensions M or K of the first matrix a. That is, in this case, it is in fact that the first matrix a is directly regarded as a two-dimensional matrix of (Batch) or (M, batch) for data reading without splicing the first matrix a after the division, that is, without changing the storage manner of the matrix elements.

On the other hand, if it is determined that the remainder value k is less than the predetermined threshold, then at block 3145, the first matrix a may be stitched along a first dimension Batch of the first matrix a to form a first two-dimensional matrix a'. That is, in the case where k is small, it is necessary to splice along the first dimension Batch of the first matrix a, that is, change the storage manner of the matrix elements.

The method of blocking and stitching the first matrix a (Batch, M, K) in block 3144 may be described in chinese patent application 202210440010.1, for example.

In the manner described above, the first matrix a may be converted into a first two-dimensional matrix a' (Batch x M, K) or (M, batch x K) suitable for the hardware processing capability of the tensor core 210.

Continuing with FIG. 4, at block 316, a first two-dimensional matrix A' may be stored from vector core 230 to a second memory 240 in the vicinity of tensor core 210 via a multiple data Store (STM) operation.

Here, the LDM operation and the STM operation are single instruction read-write operations for a plurality of registers, wherein the LDM operation can sequentially load data of a section of address in an address space to the vector core 230 through a single LDM instruction, and the STM operation can sequentially write data of a plurality of registers to a plurality of registers of the second memory 240 through a single STM instruction.

At block 318, the first two-dimensional matrix a' may be loaded into the first buffer 214 based on the size of the first buffer 214.

In this way, a three-dimensional input matrix of the matrix multiplication operation can be converted into a suitable two-dimensional matrix, and the loading method of data to the corresponding buffer in the tensor core 210 is improved, so that the buffer space is fully utilized, the data loading times are reduced, and the calculation performance is improved.

Fig. 6 illustrates a further detailed flow chart of a process of converting and loading the secondary matrix B (block 340) according to some embodiments of the invention. In the embodiment of block 340, the second matrix B is a three-dimensional matrix and the conversion and loading process of the second matrix B is correspondingly performed depending on the conversion and loading process of the first matrix a described above in connection with fig. 4 and 5.

As shown in fig. 6, at block 342, a second matrix B may be loaded from the first memory 220 outside of the tensor core 210 to the vector core 230 through a multiple data Loading (LDM) operation.

At block 344, in the vector kernel 230, a first dimension (K or Batch x K) and a second dimension (Batch x N or N) of a second two-dimensional matrix B 'may be determined in the vector kernel 230 based on the first dimension (Batch x M or M) and the second dimension (K or Batch x K) of the first two-dimensional matrix a'.

Specifically, as described above, the first two-dimensional matrix a' may be determined as (Batch) M, K or (M, batch) K. In the case where the first two-dimensional matrix a' is (Batch) M, K, the first dimension Batch of the first matrix a is combined with the second dimension M, and therefore the second matrix B should combine the first dimension Batch with the third dimension N. On the other hand, in the case where the first two-dimensional matrix a ' is (M, batch) the first dimension Batch of the first matrix a is merged with the third dimension K, so the second matrix B should merge the first dimension Batch with the second dimension K to ensure that a matrix multiplication operation can be performed between the first two-dimensional matrix a ' and the second two-dimensional matrix B '.

Fig. 7 illustrates a further detailed schematic diagram of a process of determining a first dimension and a second dimension of a second two-dimensional matrix B' (block 344), according to some embodiments of the invention.

As shown in fig. 7, at block 3442, it may be determined whether the second dimension M of the first matrix a is merged with the first dimension Batch of the first matrix a based on the first dimension Batch of the first two-dimensional matrix a' M or M. That is, it is equivalent to determining whether the second dimension M or the third dimension K of the first matrix a is combined with the first dimension Batch.

If it is determined that the second dimension M of the first matrix a is combined with the first dimension Batch of the first matrix a (i.e., the first two-dimensional matrix a 'is (Batch x M, K)), then at block 3444, the second matrix B may be partitioned according to the first dimension Batch and the third dimension N of the second matrix B, and the second matrix B may be read as the second two-dimensional matrix B' in the order of the second dimension K of the second matrix B. That is, in this case, it is in fact the case that the second matrix B is directly regarded as a two-dimensional matrix of (K, batch) or (Batch x K, N) for data reading, without splicing the second matrix B after the blocking, that is, without changing the storage manner of the matrix elements.

On the other hand, if it is determined that the second dimension M of the first matrix a is not merged with the first dimension Batch of the first matrix a (i.e., the first dimension Batch of the first matrix a is merged with the third dimension K), then at block 3446, the second matrix B may be partitioned according to the first dimension Batch and the second dimension K of the second matrix B, and the second matrix B may be read as the second two-dimensional matrix B' according to the third dimension N of the second matrix B.

In the above manner, the second matrix B can be converted into a second two-dimensional matrix B' suitable for the hardware processing capability of the tensor core 210.

Continuing with FIG. 6, at block 346, a second two-dimensional matrix B' may be stored from the vector core 230 to the second memory 240 in the vicinity of the tensor core 210 via a multiple data Storage (STM) operation.

At block 348, a second two-dimensional matrix B' may be loaded into the second buffer 216 based on the size of the second buffer 216.

A computing device and method for implementing matrix multiplication operations in accordance with the present disclosure are described above with reference to the accompanying drawings. It will be appreciated by those skilled in the art that the execution of the methods described above is not limited to the order shown in the figures and described above, but may be performed in any other reasonable order. Furthermore, the computing device need not include all of the components shown in the figures, but may include only some or more of the components necessary to perform the functions described in this disclosure, and the manner of connection of such components is not limited to the form shown in the figures.

The present disclosure may be implemented as a method, computing device, system, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for performing aspects of the present disclosure. The computing device may include at least one processor and at least one memory coupled to the at least one processor, which may store instructions for execution by the at least one processor. The instructions, when executed by the at least one processor, may perform the method described above.

In one or more exemplary designs, the functions described in this disclosure may be implemented in hardware, software, firmware, or any combination thereof. For example, if implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The various units of the apparatus disclosed herein may be implemented using discrete hardware components or may be integrally implemented on one hardware component, such as a processor. For example, the various illustrative logical blocks, modules, and circuits described in connection with the disclosure may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein.

Those of ordinary skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both.

The previous description of the disclosure is provided to enable any person of ordinary skill in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for implementing a matrix multiplication operation, comprising:

Converting a first matrix of the matrix multiplication operation into a first two-dimensional matrix via a vector core of a computing device and loading the first two-dimensional matrix into a first buffer for the first matrix in the tensor core via a tensor core of the computing device, wherein the first matrix is a three-dimensional matrix;

Determining whether a second matrix of the matrix multiplication operation is a two-dimensional matrix or a three-dimensional matrix;

In response to the second matrix being a two-dimensional matrix, loading the second matrix into a second buffer for the second matrix within the tensor core via the tensor core;

In response to the second matrix being a three-dimensional matrix, converting the second matrix into a second two-dimensional matrix via the vector core and loading the second matrix into a second buffer for the second matrix in the tensor core via the tensor core; and

A matrix multiplication is performed on the first two-dimensional matrix and the second matrix or the second two-dimensional matrix in the tensor core.

2. The method of claim 1, wherein converting the first matrix of the matrix multiplication operation into a first two-dimensional matrix and loading the first matrix into a first buffer for the first matrix in the tensor core comprises:

Loading the first matrix from a first memory outside the tensor core to a vector core through a multiple data loading operation;

determining, in the vector kernel, a first dimension and a second dimension of the first two-dimensional matrix based on a second dimension and a third dimension of the first matrix;

Storing the first two-dimensional matrix from the vector core to a second memory near the tensor core through a multi-data storage operation; and

The first two-dimensional matrix is loaded into the first buffer based on a size of the first buffer.

3. The method of claim 2, wherein determining the first dimension and the second dimension of the first two-dimensional matrix based on the second dimension and the third dimension of the first matrix in the vector kernel comprises:

determining whether a dimension in the second dimension and the third dimension of the first matrix that needs to be merged with the first dimension of the first matrix meets a predetermined condition;

In response to determining that the dimension that needs to be merged with the first dimension of the first matrix satisfies the predetermined condition, determining a remainder value of the dimension that needs to be merged with the first dimension of the first matrix modulo a hardware multiplication granularity of the tensor core;

determining whether the remainder value is greater than or equal to a predetermined threshold, wherein the predetermined threshold is associated with a hardware multiplication granularity of the tensor core;

In response to determining that the remainder value is greater than or equal to the predetermined threshold, dividing the first matrix into blocks according to a first dimension of the first matrix and the dimension to be merged with the first dimension of the first matrix, and reading the first matrix as the first two-dimensional matrix according to the other of the second dimension and the third dimension of the first matrix; and

In response to determining that the remainder value is less than the predetermined threshold, the first matrix is concatenated along a first dimension of the first matrix to form the first two-dimensional matrix.

4. The method of claim 2, wherein converting the second matrix into a second two-dimensional matrix and loading the second matrix into a second buffer for the second matrix in the tensor core comprises:

Loading the second matrix from the first memory outside the tensor core to the vector core through a multiple data loading operation;

Based on the first dimension or the second dimension of the first two-dimensional matrix, determining the first dimension and the second dimension of the second two-dimensional matrix in the vector kernel;

storing the second two-dimensional matrix from the vector core to a second memory near the tensor core through a multi-data storage operation; and

The second two-dimensional matrix is loaded into the second buffer based on the size of the second buffer.

5. The method of claim 4, wherein determining the first dimension and the second dimension of the second two-dimensional matrix in the vector kernel based on the first dimension and the second dimension of the first two-dimensional matrix comprises:

determining, based on the first dimension of the first two-dimensional matrix, whether the second dimension of the first matrix is merged with the first dimension of the first matrix;

If it is determined that the second dimension of the first matrix is merged with the first dimension of the first matrix, dividing the first matrix into blocks according to the first dimension and the third dimension of the second matrix, and reading the second matrix as the second two-dimensional matrix according to the second dimension of the second matrix; and

If it is determined that the second dimension of the first matrix is not merged with the first dimension of the first matrix, the second matrix is blocked according to the first dimension and the second dimension of the second matrix, and the second matrix is read according to the third dimension of the second matrix as the second two-dimensional matrix.

6. A computing device comprising:

A tensor core comprising a general matrix multiplication unit, a first buffer for a first matrix for matrix multiplication operation, and a second buffer for a second matrix for matrix multiplication operation, wherein the first matrix is a three-dimensional matrix;

a first memory located outside the tensor core, configured to store the first matrix and the second matrix before the matrix multiplication operation;

a vector kernel configured to convert the first matrix and the second matrix into a first two-dimensional matrix and a second two-dimensional matrix; and

a second memory located near the tensor core, configured to store the first two-dimensional matrix and the second two-dimensional matrix,

Where the tensor core is configured as:

loading the first two-dimensional matrix into the first buffer,

When the second matrix is a two-dimensional matrix, the second matrix is loaded into the second buffer, or when the second matrix is a three-dimensional matrix, the second two-dimensional matrix is loaded into the second buffer, and

A matrix multiplication is performed on the first two-dimensional matrix and the second matrix or the second two-dimensional matrix.

7. The computing device of claim 6, wherein the vector core is configured to:

Loading the first matrix from the first memory through a multiple data loading operation;

determining a first dimension and a second dimension of the first two-dimensional matrix based on a second dimension and a third dimension of the first matrix; and

The first two-dimensional matrix is stored in the second memory through a multiple data storage operation.

8. The computing device of claim 7, wherein the vector core is configured to:

9. The computing device of claim 6, wherein the vector core is configured to:

Loading the second matrix from the first memory through a multiple data loading operation;

determining the first dimension and the second dimension of the second two-dimensional matrix based on the first dimension or the second dimension of the first two-dimensional matrix; and

The second two-dimensional matrix is stored in the second memory through a multiple data storage operation.

10. The computing device of claim 9, wherein the vector core is configured to:

If it is determined that the second dimension of the first matrix is not merged with the first dimension of the first matrix, the second matrix is divided into blocks according to the first dimension and the second dimension of the second matrix, and the second matrix is read according to the third dimension of the second matrix as the second two-dimensional matrix.