[go: up one dir, main page]

CN117454068B - Method and computing device for implementing matrix multiplication operation - Google Patents

Method and computing device for implementing matrix multiplication operation Download PDF

Info

Publication number
CN117454068B
CN117454068B CN202311502624.9A CN202311502624A CN117454068B CN 117454068 B CN117454068 B CN 117454068B CN 202311502624 A CN202311502624 A CN 202311502624A CN 117454068 B CN117454068 B CN 117454068B
Authority
CN
China
Prior art keywords
matrix
dimension
dimensional
dimensional matrix
core
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311502624.9A
Other languages
Chinese (zh)
Other versions
CN117454068A (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bi Ren Technology Co ltd
Original Assignee
Shanghai Bi Ren Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bi Ren Technology Co ltd filed Critical Shanghai Bi Ren Technology Co ltd
Priority to CN202311502624.9A priority Critical patent/CN117454068B/en
Publication of CN117454068A publication Critical patent/CN117454068A/en
Application granted granted Critical
Publication of CN117454068B publication Critical patent/CN117454068B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Complex Calculations (AREA)

Abstract

The present disclosure provides a method and computing device for implementing matrix multiplication operations. The method includes converting a first matrix of the matrix multiplication operation into a first two-dimensional matrix and loading the first two-dimensional matrix into a first buffer for the first matrix in a tensor kernel, wherein the first matrix is a three-dimensional matrix, determining whether a second matrix of the matrix multiplication operation is a two-dimensional matrix or a three-dimensional matrix, loading the second matrix into the second buffer for the second matrix in the tensor kernel in response to the second matrix being a two-dimensional matrix, converting the second matrix into a second two-dimensional matrix in response to the second matrix being a three-dimensional matrix and loading the second matrix into the tensor kernel for the second buffer of the second matrix, and performing matrix multiplication on the first two-dimensional matrix and the second matrix or the second two-dimensional matrix in the tensor kernel.

Description

Method and computing device for implementing matrix multiplication operation
Technical Field
The present disclosure relates generally to the field of processors, and more particularly, to a method and computing device for implementing matrix multiplication operations.
Background
In modern scientific technologies such as weather forecast, oil exploration, nuclear physics, etc., a large number of computer-dependent analog calculations are performed, the core of which is matrix calculations representing state transitions. The fields of computer graphics processing, deep learning, etc. are also largely related to matrix multiplication operations.
In performing matrix multiplication operations for two matrices, subject to hardware requirements such as GEMM (GEneral Matrix to Matrix Multiplication, generic matrix multiplication) units or Tensor cores (Tensor cores), it is necessary to block the input matrix and perform matrix multiplication operations based on the block matrix.
However, hardware usage is inefficient when the remainder value of the corresponding dimension of the input matrix (e.g., batch dimension) relative to the hardware multiplication granularity modulo is small.
Disclosure of Invention
In view of the above, the present disclosure provides a method of performing matrix multiplication by performing dimensional folding for input matrices of different shapes.
According to one aspect of the present disclosure, a method of implementing convolution operations is provided. The method includes converting a first matrix of the matrix multiplication operation into a first two-dimensional matrix and loading the first two-dimensional matrix into a first buffer for the first matrix in a tensor kernel, wherein the first matrix is a three-dimensional matrix, determining whether a second matrix of the matrix multiplication operation is a two-dimensional matrix or a three-dimensional matrix, loading the second matrix into the second buffer for the second matrix in the tensor kernel in response to the second matrix being a two-dimensional matrix, converting the second matrix into a second two-dimensional matrix in response to the second matrix being a three-dimensional matrix and loading the second matrix into the tensor kernel for the second buffer of the second matrix, and performing matrix multiplication on the first two-dimensional matrix and the second matrix or the second two-dimensional matrix in the tensor kernel.
In some implementations, converting and loading a first matrix of the matrix multiplication operation into a first two-dimensional matrix into a first buffer for the first matrix includes loading the first matrix from a first memory outside the tensor core into a vector core through a data loading operation, determining first and second dimensions of the first two-dimensional matrix in the vector core based on second and third dimensions of the first matrix, storing the first two-dimensional matrix from the vector core into a second memory near the tensor core through a data storage operation, and loading the first two-dimensional matrix into the first buffer based on a size of the first buffer.
In some implementations, determining, in the vector kernel, first and second dimensions of the first two-dimensional matrix based on the second and third dimensions of the first matrix includes determining whether a dimension of the second and third dimensions of the first matrix that is to be combined with the first dimension of the first matrix satisfies a predetermined condition, responsive to determining that the dimension of the first matrix that is to be combined with the first dimension of the first matrix satisfies the predetermined condition, determining a remainder value of a hardware multiplication granularity modulo the tensor kernel for the dimension that is to be combined with the first dimension of the first matrix, determining whether the remainder value is greater than or equal to a predetermined threshold, wherein the predetermined threshold is associated with a hardware multiplication granularity of the tensor kernel, responsive to determining that the remainder value is greater than or equal to the predetermined threshold, blocking the first matrix in accordance with the first dimension of the first matrix and the dimension that is to be combined with the first dimension of the first matrix, and reading another dimension of the first matrix in accordance with the first dimension of the first matrix as the remainder value, and determining that the remainder value is less than the predetermined threshold along the first dimension of the first matrix.
In some implementations, converting the second matrix into a second two-dimensional matrix and loading into a second buffer for the second matrix within the tensor core includes loading the second matrix from a first memory outside the tensor core to a vector core through a multiple data loading operation, determining a first dimension and a second dimension of the second two-dimensional matrix in the vector core based on the first dimension and the second dimension of the first two-dimensional matrix, storing the second two-dimensional matrix from the vector core to a second memory near the tensor core through a multiple data storage operation, and loading the second two-dimensional matrix to the second buffer based on a size of the second buffer.
In some implementations, determining the first dimension and the second dimension of the second two-dimensional matrix in the vector kernel based on the first dimension and the second dimension of the first two-dimensional matrix includes determining whether the second dimension of the first matrix is merged with the first dimension of the first matrix based on the first dimension of the first two-dimensional matrix, if the second dimension of the first matrix is determined to be merged with the first dimension of the first matrix, blocking the first and third dimensions of the second matrix and reading the second matrix as the second two-dimensional matrix in accordance with the second dimension of the second matrix, and if the second dimension of the first matrix is determined not to be merged with the first dimension of the first matrix, blocking the second dimension and the second dimension of the second matrix and reading the second matrix as the second matrix in accordance with the third dimension of the second matrix.
According to another aspect of the present invention, a computing device is provided. The computing device includes a tensor kernel including a general matrix multiplication unit, a first buffer for a first matrix of a matrix multiplication operation, and a second buffer for a second matrix of the matrix multiplication operation, wherein the first matrix is a three-dimensional matrix, a first memory located outside the tensor kernel configured to store the first matrix and the second matrix before the matrix multiplication operation, a vector kernel configured to convert the first matrix and the second matrix into a first two-dimensional matrix and a second two-dimensional matrix, and a second memory located near the tensor kernel configured to store the first two-dimensional matrix and the second two-dimensional matrix, wherein the tensor kernel is configured to load the second matrix to the second buffer when the second matrix is a two-dimensional matrix, or to load the second two-dimensional matrix to the second buffer when the second matrix is a three-dimensional matrix, and to perform multiplication on the first two-dimensional matrix and the second two-dimensional matrix or the second two-dimensional matrix.
In some implementations, the vector core is configured to load the first matrix from the first memory through a data load operation, determine a first dimension and a second dimension of the first two-dimensional matrix based on a second dimension and a third dimension of the first matrix, and store the first two-dimensional matrix to the second memory through a data store operation.
In some implementations, the vector kernel is configured to determine whether a dimension of the second and third dimensions of the first matrix that is to be combined with a first dimension of the first matrix satisfies a predetermined condition, determine a remainder value of the hardware multiplication granularity of the tensor kernel modulo the dimension that is to be combined with the first dimension of the first matrix in response to determining that the dimension that is to be combined with the first dimension of the first matrix satisfies the predetermined condition, determine whether the remainder value is greater than or equal to a predetermined threshold, wherein the predetermined threshold is associated with the hardware multiplication granularity of the tensor kernel, block the first matrix according to the first dimension of the first matrix and the dimension that is to be combined with the first dimension of the first matrix in response to determining that the remainder value is greater than or equal to the predetermined threshold, and read the first matrix as the first two-dimensional matrix according to the other dimension of the second and third dimension of the first matrix, and splice the first two-dimensional matrix along the first matrix in response to determining that the remainder value is less than the predetermined threshold.
In some implementations, the vector core is configured to load the second matrix from the first memory through a data load operation, determine a first dimension and a second dimension of the second two-dimensional matrix based on the first dimension and the second dimension of the first two-dimensional matrix, and store the second two-dimensional matrix to the second memory through a data store operation.
In some implementations, the vector kernel is configured to determine whether a second dimension of the first matrix is merged with a first dimension of the first matrix based on a first dimension of the first two-dimensional matrix, if the second dimension of the first matrix is determined to be merged with the first dimension of the first matrix, block the first and third dimensions of the second matrix and read the second matrix as the second two-dimensional matrix according to the second dimension of the second matrix, and if the second dimension of the first matrix is determined not to be merged with the first dimension of the first matrix, block the first and second dimensions of the second matrix and read the second matrix as the second two-dimensional matrix according to the third dimension of the second matrix.
Drawings
The disclosure will be better understood and other objects, details, features and advantages of the disclosure will become more apparent by reference to the description of specific embodiments thereof given in the following drawings.
Fig. 1A and 1B show schematic diagrams of matrix multiplication operations for two different cases.
FIG. 2 shows a schematic diagram of a computing device for performing a matrix multiplication operation in accordance with an embodiment of the invention.
Fig. 3 shows a schematic flow chart of a method for implementing a matrix multiplication operation according to an embodiment of the invention.
Fig. 4 illustrates a further detailed flow chart of a process of converting and loading a first matrix according to some embodiments of the invention.
Fig. 5 illustrates a further detailed schematic diagram of a process of determining a first dimension and a second dimension of a first two-dimensional matrix according to some embodiments of the invention.
Fig. 6 illustrates a further detailed flow chart of a process of converting and loading a secondary matrix according to some embodiments of the invention.
FIG. 7 illustrates a further detailed schematic diagram of a process of determining a first dimension and a second dimension of a second two-dimensional matrix according to some embodiments of the invention.
Detailed Description
Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are illustrated in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The term "comprising" and variations thereof as used herein means open ended, i.e., "including but not limited to. The term "or" means "and/or" unless specifically stated otherwise. The term "based on" means "based at least in part on". The terms "one embodiment" and "some embodiments" mean "at least one example embodiment. The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like, may refer to different or the same object.
Fig. 1A and 1B show schematic diagrams of matrix multiplication operations for two different cases.
In the example shown in fig. 1A, of two input matrices of the matrix multiplication operation, a first matrix a is a three-dimensional matrix (Batch, M, K), a second matrix B is a two-dimensional matrix (K, N), and an output matrix C after the matrix multiplication operation is performed is a three-dimensional matrix (Batch, M, N). Wherein the last dimension (third dimension K) of the first matrix a is the same as the first dimension K of the second matrix B (i.e. the last dimension of the former is the same as the first dimension of the latter) so that a matrix multiplication operation is feasible.
In the example shown in fig. 1B, of the two input matrices of the matrix multiplication operation, the first matrix a is a three-dimensional matrix (Batch, M, K), the second matrix B is a three-dimensional matrix (Batch, K, N), and the output matrix C after the matrix multiplication operation is performed is a two-dimensional matrix (M, N). Here, the last dimension (third dimension K) of the first matrix a is represented as the same as the second dimension K of the second matrix B for practical use, however, it will be understood by those skilled in the art that such representation is merely for convenience of description, and in fact, in an actual matrix multiplication operation, the second matrix B may be transposed such that a matrix multiplication operation between the first matrix a and the transposed second matrix B is possible, i.e., the last dimension of the former is the same as the first dimension of the latter, and description of such a transposition operation may be omitted herein.
Further, more generally, at least one dimension of two matrices performing a matrix multiplication operation may be the same, without requiring a specific order of that dimension. In this case, before performing the matrix multiplication operation, matrix transposition may be performed on at least one of the two original matrices such that the last dimension of the transposed first matrix a is the same as the first dimension of the second matrix B, which is not described herein.
There are two main methods for performing the matrix multiplication operation. One is to cycle the Batch dimension on the outer layer and the inner layer performs the matrix operation of MxNxK. For example, for the case of fig. 1A, the first matrix a may be sliced into Batch two-dimensional matrices (M, K) according to the Batch dimension, and then each two-dimensional matrix (M, K) may be subjected to a matrix multiplication operation with the second matrix B (K, N), thereby obtaining Batch two-dimensional matrices (M, N), which are stacked as an output three-dimensional matrix C (Batch, M, N). For the case of fig. 1B, the first matrix a and the second matrix B may be sliced into Batch two-dimensional matrices (M, K) and (K, N), respectively, according to the Batch dimension, and then each Batch two-dimensional matrix (M, K) and (K, N) is subjected to a matrix multiplication operation, thereby obtaining Batch two-dimensional matrices (M, N), which are superimposed to generate an output two-dimensional matrix (M, N). However, in this approach, the dimensions (M, K and N) of the two-dimensional matrix are unchanged, and hardware usage efficiency is still low when its remainder value relative to hardware multiplication granularity modulo is small.
Another method is to reshape (reshape) the input three-dimensional matrix into a two-dimensional matrix by software before the first matrix a and the second matrix B perform matrix multiplication operation, and then perform matrix multiplication operation on the converted two-dimensional matrix in hardware. For example, in the case of fig. 1A, the first matrix a may be reshaped into a two-dimensional matrix (Batch) and then the two-dimensional matrix (Batch) may be multiplied by the second matrix B (K, N) to obtain an output two-dimensional matrix (Batch) which may be reshaped into an output three-dimensional matrix C (Batch, M, N) according to need. In this way, the input matrix is converted from the three-dimensional matrix to the two-dimensional matrix through the reshaping operation, so that the remainder value of a certain dimension of the input matrix is greatly reduced (at most one) relative to the case of modular remainder value of the hardware multiplication granularity, and the hardware use efficiency is improved. However, in this method, the remodelling operation performed by the software brings more access operations, the total consumption time is increased, and the overall performance is deteriorated.
In the two matrix multiplication operations, if M, K is not aligned with the hardware multiplication granularity, the information of the Batch dimension cannot be fully utilized, especially when the remainder value obtained by modulo the hardware multiplication granularity by M or K is 2, 4, 8 and other smaller numbers, the utilization rate of the matrix multiplication hardware and the utilization rate of the buffer are low, so that the overall performance is poor, and if a remodelling process is inserted, a larger access memory is generated, so that the overall performance is poor.
Aiming at the problems, the invention provides a method for optimizing the matrix multiplication operation by carrying out dimensional folding on two input matrixes of the matrix multiplication operation, which can fully utilize the processing capacity of matrix multiplication hardware and improve the utilization rate of a buffer.
FIG. 2 shows a schematic diagram of a computing device 200 for performing a matrix multiplication operation in accordance with an embodiment of the invention. As shown in fig. 2, the computing device 200 may include a tensor core 210 for performing a matrix multiplication operation, the tensor core 210 having a given hardware multiplication granularity that limits the maximum dimension of the input matrix (i.e., the maximum of M or K) that the tensor core 210 can handle.
The computing device 200 may also include a first memory 220 outside of the tensor core 210. The first memory 220 is used to store the first matrix a and the second matrix B before the tensor core 210 performs the matrix multiplication operation, and to store the output matrix C after the matrix multiplication operation is finished. The first memory 220 may be, for example, a high-bandwidth memory (High Bandwidth Memory, HBM), which has the advantage of high throughput and high bandwidth, and may be widely used in fields with high requirements for access to large data volumes, such as Artificial Intelligence (AI) computation, etc.
Computing device 200 may also include vector core 230. The vector kernel 230 is used to convert a three-dimensional input matrix into a two-dimensional matrix.
The computing device 200 may also include a second memory 240 in the vicinity of the tensor core 210. The second memory 240 may be composed of a plurality of registers for storing a two-dimensional matrix generated by the vector core 230 converting the three-dimensional input matrix. The second memory 240 is typically directly connected to the tensor core 210 as the main memory of the tensor core 210 without further intermediate conversion, so that the data required for the tensor core 210 to calculate is stored as close as possible to the tensor core 210.
In some embodiments, the tensor core 210, the vector core 230, and the second memory 240 may be part of a separate stream processing unit 250. Although only one stream processing unit 250 is shown in fig. 2, those skilled in the art will appreciate that computing device 200 may contain a plurality of such stream processing units 250. In this case, the first memory 220 may serve multiple stream processing units 250 and may exchange data and instructions with each stream processing unit 250 of the computing device 200 through a high-speed bus or network-on-chip.
In some embodiments, the tensor core 210 may further include a generic matrix multiplication unit (GEMM) 212, a first buffer (a_buffer) 214 for the first matrix a, and a second buffer (b_buffer) 216 for the second matrix B. The first buffer 214 and the second buffer 216 may be memories within the tensor core 210 dedicated to the tensor core 210, wherein the first buffer 214 is for buffering one matrix at a time of the first matrix a at the hardware multiplication granularity of the GEMM 212 and the second buffer 216 is for buffering one vector of the second matrix B at a time at the hardware multiplication granularity of the GEMM 212. That is, the size of the first buffer 214 and the second buffer 216 is determined by the hardware multiplication granularity of the GEMM 212 or tensor core 210. For example, in the case where GEMM 212 is a two-dimensional accumulator array of nxn, first buffer 214 and second buffer 216 buffer one vector of length N x N at a time for first matrix a and second matrix B, respectively. The GEMM 212 is configured to perform a matrix multiplication operation on vectors input from the first buffer 214 and the second buffer 216 each time triggered by a matrix multiplication instruction received by the tensor core 210.
Furthermore, in some embodiments, in the tensor core 210, the two-dimensional matrix from the second memory 240 may also be transposed (not shown) before the first buffer 214 and the second buffer 216.
Fig. 3 shows a schematic flow chart of a method 300 for implementing a matrix multiplication operation according to an embodiment of the invention. The method 300 may be performed by the computing device 200 or in the computing device 200. Here, it is assumed that a first matrix a and a second matrix B of the matrix multiplication operation have been stored in the first memory 220 in advance, and the first matrix a is a three-dimensional matrix (Batch, M, K) as shown in fig. 1A and 1B, and the second matrix B may be a two-dimensional matrix (K, N) as shown in fig. 1A or a three-dimensional matrix (Batch, K, N) as shown in fig. 1B. Depending on whether the second matrix B is a two-dimensional matrix or a three-dimensional matrix, different processing is performed on the second matrix B in the method 300.
As shown in fig. 3, at block 310, a first matrix a of a matrix multiplication operation may be converted to a first two-dimensional matrix a' and loaded into a first buffer 214 for the first matrix a within the tensor core 210. For example, assuming that the first matrix a is a three-dimensional matrix (Batch, M, K), the converted first two-dimensional matrix a' is a two-dimensional matrix (batch×m, K) or (M, batch×k). Here, a more specific embodiment of converting and loading the first matrix a is described below with reference to fig. 4.
At block 320, it may be determined whether the second matrix B of the matrix multiplication operation is a two-dimensional matrix or a three-dimensional matrix, such as the matrix shown in fig. 1A or the matrix shown in fig. 1B.
If the second matrix B is a two-dimensional matrix (e.g., the two-dimensional matrix (K, N) shown in fig. 1A), then at block 330 the second matrix B may be loaded directly into the second buffer 216 for the second matrix B within the tensor core 210.
If the second matrix B is a three-dimensional matrix (e.g., the three-dimensional matrix (Batch, K, N) shown in FIG. 1B), then at block 340 the second matrix B may be converted to a second two-dimensional matrix B' and loaded into the second buffer 216 for the second matrix B within the tensor core 210. Here, a more specific embodiment of the conversion and loading of the second matrix B is described below with reference to fig. 6, which corresponds to the conversion and loading manner of the first matrix a.
Then, at block 350, matrix multiplication is performed on the first two-dimensional matrix a 'and the second matrix B (in the case of block 330) or the second two-dimensional matrix B' (in the case of block 340) in the tensor kernel 210.
Note that the conversion and loading of the first matrix a may be performed concurrently with the loading of the second matrix B (block 330) or the conversion and loading (block 340) or may be performed in sequential order, the specific order of execution being not limited herein.
Fig. 4 shows a further detailed flow chart of a process (block 310) of converting and loading a first matrix a according to some embodiments of the invention.
As shown in fig. 4, at block 312, a first matrix a may be loaded from a first memory 220 outside of the tensor core 210 to the vector core 230 through a multiple data Load (LDM) operation.
At block 314, in the vector kernel 230, a first dimension and a second dimension of the first two-dimensional matrix a' may be determined based on the second dimension M and the third dimension K of the first matrix a. Here, the conversion of the first matrix a by the vector core 230 is achieved by address conversion of each element in the first matrix a. Depending on the second dimension M and the third dimension K of the first matrix a, the transformed first two-dimensional matrix a' may be (Batch x M, K) or (M, batch x K), as described in detail below.
Fig. 5 illustrates a further detailed schematic diagram of a process of determining a first dimension and a second dimension of a first two-dimensional matrix a' (block 314), according to some embodiments of the invention.
As shown in fig. 5, at block 3141, it may be determined whether a dimension that needs to be combined with the first dimension Batch among the second dimension M and the third dimension K of the first matrix a satisfies a predetermined condition. Here, the predetermined condition may mean that the absolute value of the second dimension M or the third dimension K is smaller or smaller relative to the other.
For example, assume that the second dimension M of the first matrix a is specified as the dimension that needs to be merged with the first dimension Batch. In this case, if the second dimension M < < the third dimension K, or the absolute value of the second dimension M is small (for example, in the case where the hardware multiplication granularity of the tensor kernel is 64, the second dimension M is 2,4, or 8), the second dimension M of the first matrix a is considered to satisfy the predetermined condition.
For another example, assume that the third dimension K of the first matrix A is designated as the dimension that needs to be merged with the first dimension Batch. In this case, if the third dimension K < < the second dimension M, or the absolute value of the third dimension K is small (for example, in the case where the hardware multiplication granularity of the tensor kernel is 64, the third dimension K is 2, 4, or 8), the third dimension K of the first matrix a is considered to satisfy the predetermined condition.
It will be appreciated by those skilled in the art that the second dimension M and the third dimension K are merely exemplary and are not intended to be limiting in any way as to the order or size of the second dimension and the third dimension.
At block 3142, if it is determined that the dimension (the second dimension M or the third dimension K) that needs to be merged with the first dimension Batch of the first matrix a meets the predetermined condition, it may be determined that the dimension that needs to be merged modulo the hardware multiplication granularity Kd of the tensor core 210 has a remainder value K, i.e., k=m% Kd or k=k% Kd,% representing the remainder operation. Here, the hardware multiplication granularity Kd of the tensor kernel 210 is assumed to be 64, i.e., the tensor kernel 210 can process matrix multiplication of Kd size at a time, but it will be understood by those skilled in the art that the present invention is not limited thereto, and that the hardware multiplication granularity Kd can be other values, e.g., 8 x 8, 16 x 16, etc., depending on different hardware implementations of the computing device 200 or tensor kernel 210.
At block 3143, it may be determined whether the remainder value k is greater than or equal to a predetermined threshold, wherein the predetermined threshold is associated with a hardware multiplication granularity Kd of the tensor core 210. For example, the predetermined threshold may be half or less of the hardware multiplication granularity Kd of the tensor core 210.
If it is determined that the remainder value K is greater than or equal to the predetermined threshold, then at block 3144, the first matrix a may be partitioned according to a first dimension Batch of the first matrix a and the dimension M or K to be combined, and the first matrix a may be read as a first two-dimensional matrix a' according to the other one of the dimensions M or K of the first matrix a. That is, in this case, it is in fact that the first matrix a is directly regarded as a two-dimensional matrix of (Batch) or (M, batch) for data reading without splicing the first matrix a after the division, that is, without changing the storage manner of the matrix elements.
On the other hand, if it is determined that the remainder value k is less than the predetermined threshold, then at block 3145, the first matrix a may be stitched along a first dimension Batch of the first matrix a to form a first two-dimensional matrix a'. That is, in the case where k is small, it is necessary to splice along the first dimension Batch of the first matrix a, that is, change the storage manner of the matrix elements.
The method of blocking and stitching the first matrix a (Batch, M, K) in block 3144 may be described in chinese patent application 202210440010.1, for example.
In the manner described above, the first matrix a may be converted into a first two-dimensional matrix a' (Batch x M, K) or (M, batch x K) suitable for the hardware processing capability of the tensor core 210.
Continuing with FIG. 4, at block 316, a first two-dimensional matrix A' may be stored from vector core 230 to a second memory 240 in the vicinity of tensor core 210 via a multiple data Store (STM) operation.
Here, the LDM operation and the STM operation are single instruction read-write operations for a plurality of registers, wherein the LDM operation can sequentially load data of a section of address in an address space to the vector core 230 through a single LDM instruction, and the STM operation can sequentially write data of a plurality of registers to a plurality of registers of the second memory 240 through a single STM instruction.
At block 318, the first two-dimensional matrix a' may be loaded into the first buffer 214 based on the size of the first buffer 214.
In this way, a three-dimensional input matrix of the matrix multiplication operation can be converted into a suitable two-dimensional matrix, and the loading method of data to the corresponding buffer in the tensor core 210 is improved, so that the buffer space is fully utilized, the data loading times are reduced, and the calculation performance is improved.
Fig. 6 illustrates a further detailed flow chart of a process of converting and loading the secondary matrix B (block 340) according to some embodiments of the invention. In the embodiment of block 340, the second matrix B is a three-dimensional matrix and the conversion and loading process of the second matrix B is correspondingly performed depending on the conversion and loading process of the first matrix a described above in connection with fig. 4 and 5.
As shown in fig. 6, at block 342, a second matrix B may be loaded from the first memory 220 outside of the tensor core 210 to the vector core 230 through a multiple data Loading (LDM) operation.
At block 344, in the vector kernel 230, a first dimension (K or Batch x K) and a second dimension (Batch x N or N) of a second two-dimensional matrix B 'may be determined in the vector kernel 230 based on the first dimension (Batch x M or M) and the second dimension (K or Batch x K) of the first two-dimensional matrix a'.
Specifically, as described above, the first two-dimensional matrix a' may be determined as (Batch) M, K or (M, batch) K. In the case where the first two-dimensional matrix a' is (Batch) M, K, the first dimension Batch of the first matrix a is combined with the second dimension M, and therefore the second matrix B should combine the first dimension Batch with the third dimension N. On the other hand, in the case where the first two-dimensional matrix a ' is (M, batch) the first dimension Batch of the first matrix a is merged with the third dimension K, so the second matrix B should merge the first dimension Batch with the second dimension K to ensure that a matrix multiplication operation can be performed between the first two-dimensional matrix a ' and the second two-dimensional matrix B '.
Fig. 7 illustrates a further detailed schematic diagram of a process of determining a first dimension and a second dimension of a second two-dimensional matrix B' (block 344), according to some embodiments of the invention.
As shown in fig. 7, at block 3442, it may be determined whether the second dimension M of the first matrix a is merged with the first dimension Batch of the first matrix a based on the first dimension Batch of the first two-dimensional matrix a' M or M. That is, it is equivalent to determining whether the second dimension M or the third dimension K of the first matrix a is combined with the first dimension Batch.
If it is determined that the second dimension M of the first matrix a is combined with the first dimension Batch of the first matrix a (i.e., the first two-dimensional matrix a 'is (Batch x M, K)), then at block 3444, the second matrix B may be partitioned according to the first dimension Batch and the third dimension N of the second matrix B, and the second matrix B may be read as the second two-dimensional matrix B' in the order of the second dimension K of the second matrix B. That is, in this case, it is in fact the case that the second matrix B is directly regarded as a two-dimensional matrix of (K, batch) or (Batch x K, N) for data reading, without splicing the second matrix B after the blocking, that is, without changing the storage manner of the matrix elements.
On the other hand, if it is determined that the second dimension M of the first matrix a is not merged with the first dimension Batch of the first matrix a (i.e., the first dimension Batch of the first matrix a is merged with the third dimension K), then at block 3446, the second matrix B may be partitioned according to the first dimension Batch and the second dimension K of the second matrix B, and the second matrix B may be read as the second two-dimensional matrix B' according to the third dimension N of the second matrix B.
In the above manner, the second matrix B can be converted into a second two-dimensional matrix B' suitable for the hardware processing capability of the tensor core 210.
Continuing with FIG. 6, at block 346, a second two-dimensional matrix B' may be stored from the vector core 230 to the second memory 240 in the vicinity of the tensor core 210 via a multiple data Storage (STM) operation.
At block 348, a second two-dimensional matrix B' may be loaded into the second buffer 216 based on the size of the second buffer 216.
In this way, a three-dimensional input matrix of the matrix multiplication operation can be converted into a suitable two-dimensional matrix, and the loading method of data to the corresponding buffer in the tensor core 210 is improved, so that the buffer space is fully utilized, the data loading times are reduced, and the calculation performance is improved.
A computing device and method for implementing matrix multiplication operations in accordance with the present disclosure are described above with reference to the accompanying drawings. It will be appreciated by those skilled in the art that the execution of the methods described above is not limited to the order shown in the figures and described above, but may be performed in any other reasonable order. Furthermore, the computing device need not include all of the components shown in the figures, but may include only some or more of the components necessary to perform the functions described in this disclosure, and the manner of connection of such components is not limited to the form shown in the figures.
The present disclosure may be implemented as a method, computing device, system, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for performing aspects of the present disclosure. The computing device may include at least one processor and at least one memory coupled to the at least one processor, which may store instructions for execution by the at least one processor. The instructions, when executed by the at least one processor, may perform the method described above.
In one or more exemplary designs, the functions described in this disclosure may be implemented in hardware, software, firmware, or any combination thereof. For example, if implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The various units of the apparatus disclosed herein may be implemented using discrete hardware components or may be integrally implemented on one hardware component, such as a processor. For example, the various illustrative logical blocks, modules, and circuits described in connection with the disclosure may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein.
Those of ordinary skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both.
The previous description of the disclosure is provided to enable any person of ordinary skill in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1.一种实现矩阵乘运算的方法,包括:1. A method for implementing a matrix multiplication operation, comprising: 经由计算设备的向量核将所述矩阵乘运算的第一矩阵转换为第一二维矩阵并经由所述计算设备的张量核将所述第一二维矩阵加载至所述张量核内用于所述第一矩阵的第一缓存器,其中所述第一矩阵为三维矩阵;Converting a first matrix of the matrix multiplication operation into a first two-dimensional matrix via a vector core of a computing device and loading the first two-dimensional matrix into a first buffer for the first matrix in the tensor core via a tensor core of the computing device, wherein the first matrix is a three-dimensional matrix; 确定所述矩阵乘运算的第二矩阵是二维矩阵还是三维矩阵;Determining whether a second matrix of the matrix multiplication operation is a two-dimensional matrix or a three-dimensional matrix; 响应于所述第二矩阵是二维矩阵,经由所述张量核将所述第二矩阵加载至所述张量核内用于所述第二矩阵的第二缓存器;In response to the second matrix being a two-dimensional matrix, loading the second matrix into a second buffer for the second matrix within the tensor core via the tensor core; 响应于所述第二矩阵是三维矩阵,经由所述向量核将所述第二矩阵转换为第二二维矩阵并经由所述张量核加载至所述张量核内用于所述第二矩阵的第二缓存器;以及In response to the second matrix being a three-dimensional matrix, converting the second matrix into a second two-dimensional matrix via the vector core and loading the second matrix into a second buffer for the second matrix in the tensor core via the tensor core; and 在所述张量核中对所述第一二维矩阵和所述第二矩阵或所述第二二维矩阵执行矩阵乘。A matrix multiplication is performed on the first two-dimensional matrix and the second matrix or the second two-dimensional matrix in the tensor core. 2.如权利要求1所述的方法,其中将所述矩阵乘运算的第一矩阵转换为第一二维矩阵并加载至张量核内用于所述第一矩阵的第一缓存器包括:2. The method of claim 1, wherein converting the first matrix of the matrix multiplication operation into a first two-dimensional matrix and loading the first matrix into a first buffer for the first matrix in the tensor core comprises: 通过多数据加载操作将所述第一矩阵从所述张量核外的第一存储器加载至向量核;Loading the first matrix from a first memory outside the tensor core to a vector core through a multiple data loading operation; 在所述向量核中基于所述第一矩阵的第二维度和第三维度确定所述第一二维矩阵的第一维度和第二维度;determining, in the vector kernel, a first dimension and a second dimension of the first two-dimensional matrix based on a second dimension and a third dimension of the first matrix; 通过多数据存储操作将所述第一二维矩阵从所述向量核存储至所述张量核附近的第二存储器;以及Storing the first two-dimensional matrix from the vector core to a second memory near the tensor core through a multi-data storage operation; and 基于所述第一缓存器的大小将所述第一二维矩阵加载至所述第一缓存器。The first two-dimensional matrix is loaded into the first buffer based on a size of the first buffer. 3.如权利要求2所述的方法,其中在所述向量核中基于所述第一矩阵的第二维度和第三维度确定所述第一二维矩阵的第一维度和第二维度包括:3. The method of claim 2, wherein determining the first dimension and the second dimension of the first two-dimensional matrix based on the second dimension and the third dimension of the first matrix in the vector kernel comprises: 确定所述第一矩阵的第二维度和第三维度中需要与所述第一矩阵的第一维度进行合并的维度是否满足预定条件;determining whether a dimension in the second dimension and the third dimension of the first matrix that needs to be merged with the first dimension of the first matrix meets a predetermined condition; 响应于确定所述需要与所述第一矩阵的第一维度进行合并的维度满足所述预定条件,确定所述需要与所述第一矩阵的第一维度进行合并的维度对所述张量核的硬件乘法粒度取模的余数值;In response to determining that the dimension that needs to be merged with the first dimension of the first matrix satisfies the predetermined condition, determining a remainder value of the dimension that needs to be merged with the first dimension of the first matrix modulo a hardware multiplication granularity of the tensor core; 确定所述余数值是否大于或等于预定阈值,其中所述预定阈值与所述张量核的硬件乘法粒度相关联;determining whether the remainder value is greater than or equal to a predetermined threshold, wherein the predetermined threshold is associated with a hardware multiplication granularity of the tensor core; 响应于确定所述余数值大于或等于所述预定阈值,按照所述第一矩阵的第一维度和所述需要与所述第一矩阵的第一维度进行合并的维度对所述第一矩阵进行分块,并按照所述第一矩阵的第二维度和第三维度中的另一维度来读取所述第一矩阵作为所述第一二维矩阵;以及In response to determining that the remainder value is greater than or equal to the predetermined threshold, dividing the first matrix into blocks according to a first dimension of the first matrix and the dimension to be merged with the first dimension of the first matrix, and reading the first matrix as the first two-dimensional matrix according to the other of the second dimension and the third dimension of the first matrix; and 响应于确定所述余数值小于所述预定阈值,沿着所述第一矩阵的第一维度对所述第一矩阵进行拼接以形成所述第一二维矩阵。In response to determining that the remainder value is less than the predetermined threshold, the first matrix is concatenated along a first dimension of the first matrix to form the first two-dimensional matrix. 4.如权利要求2所述的方法,其中将所述第二矩阵转换为第二二维矩阵并加载至所述张量核内用于所述第二矩阵的第二缓存器包括:4. The method of claim 2, wherein converting the second matrix into a second two-dimensional matrix and loading the second matrix into a second buffer for the second matrix in the tensor core comprises: 通过多数据加载操作将所述第二矩阵从所述张量核外的第一存储器加载至向量核;Loading the second matrix from the first memory outside the tensor core to the vector core through a multiple data loading operation; 基于所述第一二维矩阵的第一维度或第二维度,在所述向量核中确定所述第二二维矩阵的第一维度和第二维度;Based on the first dimension or the second dimension of the first two-dimensional matrix, determining the first dimension and the second dimension of the second two-dimensional matrix in the vector kernel; 通过多数据存储操作将所述第二二维矩阵从所述向量核存储于所述张量核附近的第二存储器;以及storing the second two-dimensional matrix from the vector core to a second memory near the tensor core through a multi-data storage operation; and 基于所述第二缓存器的大小将所述第二二维矩阵加载至所述第二缓存器。The second two-dimensional matrix is loaded into the second buffer based on the size of the second buffer. 5.如权利要求4所述的方法,其中基于所述第一二维矩阵的第一维度和第二维度,在所述向量核中确定所述第二二维矩阵的第一维度和第二维度包括:5. The method of claim 4, wherein determining the first dimension and the second dimension of the second two-dimensional matrix in the vector kernel based on the first dimension and the second dimension of the first two-dimensional matrix comprises: 基于所述第一二维矩阵的第一维度确定所述第一矩阵的第二维度是否与所述第一矩阵的第一维度进行了合并;determining, based on the first dimension of the first two-dimensional matrix, whether the second dimension of the first matrix is merged with the first dimension of the first matrix; 如果确定第一矩阵的第二维度与所述第一矩阵的第一维度进行了合并,则将所述第二矩阵的第一维度和第三维度对所述第一矩阵进行分块,并按照所述第二矩阵的第二维度来读取所述第二矩阵作为所述第二二维矩阵;以及If it is determined that the second dimension of the first matrix is merged with the first dimension of the first matrix, dividing the first matrix into blocks according to the first dimension and the third dimension of the second matrix, and reading the second matrix as the second two-dimensional matrix according to the second dimension of the second matrix; and 如果确定第一矩阵的第二维度没有与所述第一矩阵的第一维度进行合并,则将所述第二矩阵的第一维度和第二维度对所述第二矩阵进行分块,并按照所述第二矩阵的第三维度来读取所述第二矩阵作为所述第二二维矩阵。If it is determined that the second dimension of the first matrix is not merged with the first dimension of the first matrix, the second matrix is blocked according to the first dimension and the second dimension of the second matrix, and the second matrix is read according to the third dimension of the second matrix as the second two-dimensional matrix. 6.一种计算设备,包括:6. A computing device comprising: 张量核,其包括通用矩阵乘单元、用于矩阵乘运算的第一矩阵的第一缓存器和用于矩阵乘运算的第二矩阵的第二缓存器,其中所述第一矩阵为三维矩阵;A tensor core comprising a general matrix multiplication unit, a first buffer for a first matrix for matrix multiplication operation, and a second buffer for a second matrix for matrix multiplication operation, wherein the first matrix is a three-dimensional matrix; 位于所述张量核之外的第一存储器,其被配置为在所述矩阵乘运算之前存储所述第一矩阵和所述第二矩阵;a first memory located outside the tensor core, configured to store the first matrix and the second matrix before the matrix multiplication operation; 向量核,其被配置为将所述第一矩阵和所述第二矩阵转换为第一二维矩阵和第二二维矩阵;以及a vector kernel configured to convert the first matrix and the second matrix into a first two-dimensional matrix and a second two-dimensional matrix; and 位于所述张量核附近的第二存储器,其被配置为存储所述第一二维矩阵和所述第二二维矩阵,a second memory located near the tensor core, configured to store the first two-dimensional matrix and the second two-dimensional matrix, 其中所述张量核被配置为:Where the tensor core is configured as: 将所述第一二维矩阵加载至所述第一缓存器,loading the first two-dimensional matrix into the first buffer, 在所述第二矩阵是二维矩阵时,将所述第二矩阵加载至所述第二缓存器,或者在所述第二矩阵是三维矩阵时,将所述第二二维矩阵加载至所述第二缓存器,以及When the second matrix is a two-dimensional matrix, the second matrix is loaded into the second buffer, or when the second matrix is a three-dimensional matrix, the second two-dimensional matrix is loaded into the second buffer, and 对所述第一二维矩阵和所述第二矩阵或所述第二二维矩阵执行矩阵乘。A matrix multiplication is performed on the first two-dimensional matrix and the second matrix or the second two-dimensional matrix. 7.如权利要求6所述的计算设备,其中所述向量核被配置为:7. The computing device of claim 6, wherein the vector core is configured to: 通过多数据加载操作从所述第一存储器加载所述第一矩阵;Loading the first matrix from the first memory through a multiple data loading operation; 基于所述第一矩阵的第二维度和第三维度确定所述第一二维矩阵的第一维度和第二维度;以及determining a first dimension and a second dimension of the first two-dimensional matrix based on a second dimension and a third dimension of the first matrix; and 通过多数据存储操作将所述第一二维矩阵存储至所述第二存储器。The first two-dimensional matrix is stored in the second memory through a multiple data storage operation. 8.如权利要求7所述的计算设备,其中所述向量核被配置为:8. The computing device of claim 7, wherein the vector core is configured to: 确定所述第一矩阵的第二维度和第三维度中需要与所述第一矩阵的第一维度进行合并的维度是否满足预定条件;determining whether a dimension in the second dimension and the third dimension of the first matrix that needs to be merged with the first dimension of the first matrix meets a predetermined condition; 响应于确定所述需要与所述第一矩阵的第一维度进行合并的维度满足所述预定条件,确定所述需要与所述第一矩阵的第一维度进行合并的维度对所述张量核的硬件乘法粒度取模的余数值;In response to determining that the dimension that needs to be merged with the first dimension of the first matrix satisfies the predetermined condition, determining a remainder value of the dimension that needs to be merged with the first dimension of the first matrix modulo a hardware multiplication granularity of the tensor core; 确定所述余数值是否大于或等于预定阈值,其中所述预定阈值与所述张量核的硬件乘法粒度相关联;determining whether the remainder value is greater than or equal to a predetermined threshold, wherein the predetermined threshold is associated with a hardware multiplication granularity of the tensor core; 响应于确定所述余数值大于或等于所述预定阈值,按照所述第一矩阵的第一维度和所述需要与所述第一矩阵的第一维度进行合并的维度对所述第一矩阵进行分块,并按照所述第一矩阵的第二维度和第三维度中的另一维度来读取所述第一矩阵作为所述第一二维矩阵;以及In response to determining that the remainder value is greater than or equal to the predetermined threshold, dividing the first matrix into blocks according to a first dimension of the first matrix and the dimension to be merged with the first dimension of the first matrix, and reading the first matrix as the first two-dimensional matrix according to the other of the second dimension and the third dimension of the first matrix; and 响应于确定所述余数值小于所述预定阈值,沿着所述第一矩阵的第一维度对所述第一矩阵进行拼接以形成所述第一二维矩阵。In response to determining that the remainder value is less than the predetermined threshold, the first matrix is concatenated along a first dimension of the first matrix to form the first two-dimensional matrix. 9.如权利要求6所述的计算设备,其中所述向量核被配置为:9. The computing device of claim 6, wherein the vector core is configured to: 通过多数据加载操作从所述第一存储器加载所述第二矩阵;Loading the second matrix from the first memory through a multiple data loading operation; 基于所述第一二维矩阵的第一维度或第二维度确定所述第二二维矩阵的第一维度和第二维度;以及determining the first dimension and the second dimension of the second two-dimensional matrix based on the first dimension or the second dimension of the first two-dimensional matrix; and 通过多数据存储操作将所述第二二维矩阵存储于所述第二存储器。The second two-dimensional matrix is stored in the second memory through a multiple data storage operation. 10.如权利要求9所述的计算设备,其中所述向量核被配置为:10. The computing device of claim 9, wherein the vector core is configured to: 基于所述第一二维矩阵的第一维度确定所述第一矩阵的第二维度是否与所述第一矩阵的第一维度进行了合并;determining, based on the first dimension of the first two-dimensional matrix, whether the second dimension of the first matrix is merged with the first dimension of the first matrix; 如果确定第一矩阵的第二维度与所述第一矩阵的第一维度进行了合并,则将所述第二矩阵的第一维度和第三维度对所述第一矩阵进行分块,并按照所述第二矩阵的第二维度来读取所述第二矩阵作为所述第二二维矩阵;以及If it is determined that the second dimension of the first matrix is merged with the first dimension of the first matrix, dividing the first matrix into blocks according to the first dimension and the third dimension of the second matrix, and reading the second matrix as the second two-dimensional matrix according to the second dimension of the second matrix; and 如果确定第一矩阵的第二维度没有与所述第一矩阵的第一维度进行合并,则将所述第二矩阵的第一维度和第二维度对所述第二矩阵进行分块,并按照所述第二矩阵的第三维度来读取所述第二矩阵作为所述第二二维矩阵。If it is determined that the second dimension of the first matrix is not merged with the first dimension of the first matrix, the second matrix is divided into blocks according to the first dimension and the second dimension of the second matrix, and the second matrix is read according to the third dimension of the second matrix as the second two-dimensional matrix.
CN202311502624.9A 2023-11-10 2023-11-10 Method and computing device for implementing matrix multiplication operation Active CN117454068B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311502624.9A CN117454068B (en) 2023-11-10 2023-11-10 Method and computing device for implementing matrix multiplication operation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311502624.9A CN117454068B (en) 2023-11-10 2023-11-10 Method and computing device for implementing matrix multiplication operation

Publications (2)

Publication Number Publication Date
CN117454068A CN117454068A (en) 2024-01-26
CN117454068B true CN117454068B (en) 2025-01-24

Family

ID=89594629

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311502624.9A Active CN117454068B (en) 2023-11-10 2023-11-10 Method and computing device for implementing matrix multiplication operation

Country Status (1)

Country Link
CN (1) CN117454068B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118193443A (en) * 2024-03-19 2024-06-14 上海壁仞科技股份有限公司 Data loading method for processor, computing device and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111191784A (en) * 2018-11-14 2020-05-22 辉达公司 Transposed sparse matrix multiplied by dense matrix for neural network training
CN111353126A (en) * 2018-12-20 2020-06-30 卡雷公司 Block matrix multiplication system

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021119907A1 (en) * 2019-12-16 2021-06-24 Intel Corporation Technology to mininimize negative impact of cache conflicts caused by incompatible leading dimensions in matrix multiplication and convolution kernels without dimension padding
CN111860838B (en) * 2020-07-24 2022-12-20 苏州浪潮智能科技有限公司 A fully connected layer computing method and device for a neural network
US20220075842A1 (en) * 2020-09-04 2022-03-10 Nvidia Corporation Processor and system for automatic fusion of matrix multiplication and reduction operations
US20220391320A1 (en) * 2021-05-24 2022-12-08 Industry-Academic Cooperation Foundation, Yonsei University Operation device of convolutional neural network, operation method of convolutional neural network and computer program stored in a recording medium to execute the method thereof
CN113468469A (en) * 2021-06-02 2021-10-01 北京迈格威科技有限公司 Convolution processing method and device of feature graph executed by computer and electronic equipment
GB2608791B (en) * 2021-06-29 2023-11-22 Imagination Tech Ltd Neural network comprising matrix multiplication
US20230118058A1 (en) * 2021-10-18 2023-04-20 Tencent America LLC Systems and methods for accelerating a neural network using a unified sparse tensor core

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111191784A (en) * 2018-11-14 2020-05-22 辉达公司 Transposed sparse matrix multiplied by dense matrix for neural network training
CN111353126A (en) * 2018-12-20 2020-06-30 卡雷公司 Block matrix multiplication system

Also Published As

Publication number Publication date
CN117454068A (en) 2024-01-26

Similar Documents

Publication Publication Date Title
US11907830B2 (en) Neural network architecture using control logic determining convolution operation sequence
US11449576B2 (en) Convolution operation processing method and related product
CN108304922B (en) Computing device and computing method for neural network computing
US10769749B2 (en) Processor, information processing apparatus, and operation method of processor
EP4510038A2 (en) Neural network comprising matrix multiplication
US20230021204A1 (en) Neural network comprising matrix multiplication
GB2560600A (en) Nueral Network Hardware
CN117454068B (en) Method and computing device for implementing matrix multiplication operation
CN110837483B (en) Tensor dimension transformation method and device
CN117435855B (en) Method for performing convolution operation, electronic device, and storage medium
CN118643005B (en) Computing device, data processing method for computing device, computer readable storage medium and computer program product
US20210082082A1 (en) Data processing method and processing circuit
GB2611521A (en) Neural network accelerator with a configurable pipeline
CN118708153A (en) A hardware accelerator and hardware acceleration method based on three-dimensional number theory transformation
CN119204129A (en) A DNN computing engine
CN119513470A (en) Memory device and operation method thereof
JPH0869407A (en) Processor with data memory

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant