CN119249053A

CN119249053A - Computation method, device, equipment and medium

Info

Publication number: CN119249053A
Application number: CN202411412083.5A
Authority: CN
Inventors: 施佳鑫; 谢夏婷; 张辉; 王京
Original assignee: Kunlun Core Beijing Technology Co ltd
Current assignee: Kunlun Core Beijing Technology Co ltd
Priority date: 2024-10-10
Filing date: 2024-10-10
Publication date: 2025-01-03

Abstract

The disclosure provides an operation method, an operation device, operation equipment and operation media, relates to the technical field of artificial intelligence, and particularly relates to the technical field of chips. The method includes performing, in response to a need to perform an inverse quantization operation on a first matrix based on a first quantized coefficient and a first offset and a matrix multiplication operation on the quantized first matrix and a second matrix smaller in data amount than the first matrix, a first operation of calculating a sum of a plurality of data elements of each row of the second matrix using an arithmetic logic unit, a second operation of determining a first intermediate result indicating a product of a matrix multiplication result of the first matrix and the second matrix and the first quantized coefficient using the matrix operation unit, performing a multiplication operation based on a sum of a plurality of rows of the first offset and the second matrix to obtain a second intermediate result, and performing an accumulation operation based on the first intermediate result and the second intermediate result to obtain a target operation result.

Description

Operation method, device, equipment and medium

Technical Field

The present disclosure relates to the field of artificial intelligence technology, and in particular, to the field of chip technology, and more particularly, to an operation method, an apparatus, an electronic device, a computer readable storage medium, and a computer program product.

Background

Artificial intelligence is the discipline of studying the process of making a computer mimic certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) of a person, both hardware-level and software-level techniques. The artificial intelligence hardware technology generally comprises technologies such as a sensor, a special artificial intelligence chip, cloud computing, distributed storage, big data processing and the like, and the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.

With the development of artificial intelligence technology, more and more applications achieve effects far exceeding those of traditional algorithms based on artificial intelligence technology. Deep learning is a data-intensive and computation-intensive algorithm, and is also a rapidly evolving iterative algorithm. In the deep learning algorithm, in order to improve the efficiency of the neural network model in processing complex tasks, the method needs to be applied to various operation types, for example, the data elements can be quantized based on inverse quantization operation.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, the problems mentioned in this section should not be considered as having been recognized in any prior art unless otherwise indicated.

Disclosure of Invention

The present disclosure provides an operational method, apparatus, electronic device, computer readable storage medium, and computer program product.

According to an aspect of the present disclosure, there is provided an operation method implemented by an operation apparatus, wherein the operation apparatus includes a matrix operation unit and an arithmetic logic unit, the method including, in response to receiving a target operation request to perform an inverse quantization operation on a first matrix based on a first quantization coefficient and a first offset and to perform a matrix multiplication operation on the first matrix and a second matrix after quantization, wherein a data amount of the first matrix is larger than a data amount of the second matrix, performing a first operation with the arithmetic logic unit, wherein the first operation includes, for each row of the second matrix, operating a sum of a plurality of data elements of the row, performing a second operation with the matrix operation unit, wherein the second operation includes determining a first intermediate result based on the first quantization coefficient, the first matrix and the second matrix, wherein the first intermediate result indicates a product of the matrix multiplication result of the first matrix and the second matrix and the first quantization coefficient, performing a second operation based on the first offset and the second matrix, and performing the intermediate result based on the first intermediate result and the second intermediate result.

According to an aspect of the present disclosure, there is provided an operation apparatus including a matrix operation unit, an arithmetic logic unit, a first operation unit configured to perform an inverse quantization operation on a first matrix and to perform a target operation request of a matrix multiplication operation on the quantized first matrix and second matrix in response to receiving a first quantization coefficient and a first offset, the data amount of the first matrix being larger than the data amount of the second matrix, the first operation being performed by the arithmetic logic unit, wherein the arithmetic logic unit is configured to operate a sum of a plurality of data elements of the row for each row of the second matrix, a second operation unit configured to perform a second operation by the matrix operation unit, wherein the matrix operation unit is configured to determine a first intermediate result based on the first quantization coefficient, the first matrix, and the second matrix, wherein the first intermediate result indicates a product of the multiplication result of the first matrix and the second matrix and the first quantization coefficient, a third operation unit configured to perform a sum of a plurality of data elements of the row for each row of the second matrix, and to obtain a fourth intermediate result based on the first offset and the second intermediate result.

According to an aspect of the present disclosure, there is provided a chip including the arithmetic device as described above.

According to an aspect of the present disclosure, there is provided an electronic device comprising at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the above-described method of operation.

According to an aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the above-described operation method.

According to an aspect of the present disclosure, there is provided a computer program product comprising a computer program, wherein the computer program is capable of implementing the above-described operation method when being executed by a processor.

According to one or more embodiments of the present disclosure, the operation efficiency of performing an inverse quantization operation including an offset and a matrix multiplication operation on a matrix may be improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The accompanying drawings illustrate exemplary embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. The illustrated embodiments are for exemplary purposes only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, according to an exemplary embodiment of the present disclosure;

FIG. 2 illustrates a flow chart of an operation method according to an exemplary embodiment of the present disclosure;

FIG. 3 shows a block diagram of a computing device according to an exemplary embodiment of the present disclosure;

Fig. 4 shows a block diagram of a computing device according to an exemplary embodiment of the present disclosure;

fig. 5 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, the use of the terms "first," "second," and the like to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of the elements, unless otherwise indicated, and such terms are merely used to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.

The terminology used in the description of the various illustrated examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. Furthermore, the term "and/or" as used in this disclosure encompasses any and all possible combinations of the listed items.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented, in accordance with an embodiment of the present disclosure. Referring to fig. 1, the system 100 includes one or more client devices 101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120. Client devices 101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In embodiments of the present disclosure, the server 120 may run one or more services or software applications that enable execution of the method of operation.

In some embodiments, server 120 may also provide other services or software applications, which may include non-virtual environments and virtual environments. In some embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of client devices 101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof that are executable by one or more processors. A user operating client devices 101, 102, 103, 104, 105, and/or 106 may in turn utilize one or more client applications to interact with server 120 to utilize the services provided by these components. It should be appreciated that a variety of different system configurations are possible, which may differ from system 100. Accordingly, FIG. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.

The user may use client devices 101, 102, 103, 104, 105, and/or 106 to send an operation request or data to be operated on. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that the present disclosure may support any number of client devices.

Client devices 101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptop computers), workstation computers, wearable devices, smart screen devices, self-service terminal devices, service robots, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and the like. These computer devices may run various classes and versions of software applications and operating systems, such as MICROSOFT Windows, apply iOS, UNIX-like operating systems, linux or Linux-like operating systems (e.g., GOOGLE Chrome OS), or include various mobile operating systems, such as MICROSOFT Windows Mobile OS, iOS, windows Phone, android. Portable handheld devices may include cellular telephones, smart phones, tablet computers, personal Digital Assistants (PDAs), and the like. Wearable devices may include head mounted displays (such as smart glasses) and other devices. The gaming system may include various handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any of a variety of networks known to those skilled in the art that may support data communications using any of a variety of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. For example only, the one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a blockchain network, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, WIFI), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture that involves virtualization (e.g., one or more flexible pools of logical storage devices that may be virtualized to maintain virtual storage devices of the server). In various embodiments, server 120 may run one or more services or software applications that provide the functionality described below.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above as well as any commercially available server operating systems. Server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, etc.

In some implementations, server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of client devices 101, 102, 103, 104, 105, and 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of client devices 101, 102, 103, 104, 105, and 106.

In some implementations, the server 120 may be a server of a distributed system or a server that incorporates a blockchain. The server 120 may also be a cloud server, or an intelligent cloud computing server or intelligent cloud host with artificial intelligence technology. The cloud server is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical host and Virtual special server (VPS PRIVATE SERVER) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of databases 130 may be used to store information such as audio files and video files. Database 130 may reside in various locations. For example, the database used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. Database 130 may be of different categories. In some embodiments, the database used by server 120 may be, for example, a relational database. One or more of these databases may store, update, and retrieve the databases and data from the databases in response to the commands.

In some embodiments, one or more of databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key value stores, object stores, or conventional stores supported by the file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.

In the matrix operation process, in order to reduce the requirements of corresponding hardware computing resources and storage resources of each data element in the matrix, the matrix can be quantized to reduce the data volume of matrix operation, for example, high-precision floating points can be quantized into integers, so that the operation difficulty is reduced, and the hardware resources are saved.

The data quantization process may be implemented based on linear transformation operations. For example, when symmetric quantization is performed for data to be quantized, quantization operations can be implemented based on the following formula:

Wherein x is an original data element, q is a quantized data element, scale is a quantization coefficient, round indicates a rounding operation rule, a data quantization result can be obtained by performing division operation based on x and scale and rounding operation, and a numerical range of the original data element is mapped to a quantized target numerical range with zero as a center. In this case, when it is necessary to perform an inverse quantization operation on q, q and scale can be restored to the original data element x.

In order to control the distribution of quantized data more accurately, an offset bias may be introduced on the basis of quantized coefficients, in which case the quantization operation may be implemented based on the following formula:

In this case, when it is necessary to perform an inverse quantization operation on q, that is, linear transformation based on the quantization coefficient scale and the offset bias as follows:

x’=q×scale+bias

in general, processing matrix data involves a large number of matrix multiplication operations, and when a matrix corresponding to a multiplication operation involves an asymmetric dequantization operation, an offset in the asymmetric dequantization operation causes a large amount of overhead, occupies a large amount of hardware resources, and affects operation efficiency.

Based on this, the present disclosure provides an operation method, when inverse quantization operation and matrix multiplication operation including an offset are required to be performed on a first matrix, on the basis of obtaining a matrix multiplication result quantized based on a quantization coefficient by using a matrix operation unit, splitting a calculated amount caused by the offset into an arithmetic logic unit, reducing the number of times of multiplication operation by summing row elements of a second matrix and then performing multiplication operation on the sum of rows, reducing the calculation load of a matrix multiplication operation array, and improving operation efficiency.

Fig. 2 shows a flowchart of an operation method 200 according to an exemplary embodiment of the present disclosure. In this embodiment, the method 200 is implemented by an arithmetic device comprising a matrix arithmetic unit and an arithmetic logic unit, as shown in fig. 2, the method 200 comprising:

Step S210, in response to receiving a target operation request for performing inverse quantization operation on a first matrix based on a first quantization coefficient and a first offset and performing matrix multiplication operation on the quantized first matrix and second matrix, wherein the data volume of the first matrix is larger than the data volume of the second matrix, performing a first operation by utilizing the arithmetic logic unit, wherein the first operation comprises, for each row of the second matrix, operating a sum of a plurality of data elements of the row;

step S220, performing a second operation by using the matrix operation unit, wherein the second operation comprises the steps of determining a first intermediate result based on the first quantization coefficient, the first matrix and the second matrix, wherein the first intermediate result indicates the product of the matrix multiplication results of the first matrix and the second matrix and the first quantization coefficient;

step S230, performing multiplication operation based on the sum of the first offset and the plurality of rows of the second matrix to obtain a second intermediate result, and

Step S240, performing an accumulation operation based on the first intermediate result and the second intermediate result to obtain a target operation result.

By applying the method 200, when inverse quantization operation and matrix multiplication operation including offset are required to be performed on the first matrix, on the basis of obtaining a matrix multiplication result quantized based on a quantization coefficient by using the matrix operation unit, the calculated amount caused by the offset is split into the arithmetic logic unit, and the number of times of multiplication operation is reduced by summing row elements of the second matrix and then performing multiplication operation on the sum of rows, so that the calculation load of the matrix multiplication operation array is reduced, and the operation efficiency is improved.

By way of example, the computational overhead that can be saved by applying the method 200 will be described below.

In one example, the first matrix has a size of axa and the second matrix has a size of bxa, where the above method 200 is not applied, i.e., a x a inverse quantization operations need to be performed on the first matrix, i.e., corresponding to a x a multiplication operations and a x a addition operations. After the quantized first matrix is obtained, a multiply-accumulate operation needs to be performed b×a×a times in the matrix multiplication operation. It will be appreciated that the data elements of the multiply-accumulate operation in this case are dequantized data elements.

By applying the above-described method 200, the data elements of the multiply-accumulate operation b×a×a times in step S220 are no longer the data elements after dequantization but the data elements before dequantization. It will be appreciated that the range of values or data bit width of the data elements before dequantisation is smaller than the data elements after dequantisation, so that the computational resources can be saved. For example, a floating point multiply-accumulate operation of bxa x a times may be changed to a multiply-accumulate operation of floating point number and fixed point number in this step. On the basis, step S210 involves 2×b×a times of addition operations, step S230 involves b×a times of multiplication operations, and since the data amount (a×a) of the first matrix is larger than the data amount (b×a) of the second matrix, the operation amount can be saved by applying the above method.

Meanwhile, by applying the above method 200, the computation overhead related to the quantization coefficient and the computation overhead related to the offset can be distributed to two hardware operation units, namely, a matrix operation unit and an arithmetic logic unit, which can execute operations in parallel, and the operation efficiency is improved by improving the operation parallelism.

With continued reference to the above example, by performing the addition operation in step S210 on the second matrix, the sum of a rows in the second matrix can be obtained, and thus a vector of scale a, that is, a matrix equivalent to b×1 can be obtained. On this basis, in step S230, the offset may be expressed as a matrix of 1×a, and then a second intermediate result of b×a is obtained by matrix multiplication, so that the superposition operation is conveniently performed with the first intermediate result of b×a in step S220.

It will be appreciated that the above-described operation method of performing matrix multiplication based on the b×1 matrix and the 1×a matrix is only for obtaining the second intermediate result of b×a more conveniently, and further, the operation method can perform the superposition operation with the first intermediate result. In some examples, in step S230, the second intermediate result of bxa may be obtained by other methods, for example, the multiplication may be performed based on the sum of the offset and a rows in the second matrix, and then the multiplication result may be arranged based on a preset rule to obtain the second intermediate result of bxa.

According to some embodiments, the method 200 further comprises performing a split operation on the first matrix and the second matrix to obtain a plurality of sub-matrix groups, each sub-matrix group comprising a first sub-matrix and a second sub-matrix, wherein performing the first operation with the arithmetic logic unit in step S210 comprises performing the first operation with the second sub-matrix group with the arithmetic logic unit while performing the second operation with the first sub-matrix group with the matrix operation unit in response to determining that the first operation with the first sub-matrix group is performed. By applying the method, the original matrix is split into a plurality of sub-matrix groups, so that the operation of the matrix operation unit and the operation of the arithmetic logic unit can be performed for the split sub-matrix groups in an interleaving manner, the utilization rate of the hardware unit is improved, and the operation efficiency is optimized by utilizing a pipeline.

In some examples, when the first matrix and the second matrix are split into a sub-matrix group a, a sub-matrix group B, a sub-matrix group C, in which case the operation corresponding to each sub-matrix group includes at least a step of performing a first operation on the second sub-matrix with the arithmetic logic unit and a step of performing a second operation on the first sub-matrix and the second sub-matrix with the matrix operation unit. In this case, when the first operation for the sub-matrix group a has been performed in the arithmetic logic unit, the first operation for the sub-matrix group B can be performed simultaneously with the execution of the second operation for the sub-matrix group a by the matrix operation unit. By arranging the operation flow of the sub-matrix group A, the sub-matrix group B and the sub-matrix group C by utilizing the matrix operation unit and the arithmetic logic unit, the operation time consumption can be saved, and the operation efficiency can be improved.

According to some embodiments, the target operation request is determined by performing an inverse quantization operation on a first matrix based on a first quantization coefficient and a first offset, performing an inverse quantization operation on a third matrix based on a second quantization coefficient and a second offset, and performing a first operation request for a matrix multiplication operation on the quantized first matrix and the quantized third matrix, and performing an inverse quantization operation on the third matrix based on the second quantization coefficient and the second offset to obtain the second matrix, in response to receiving a first operation request for a matrix multiplication operation on the quantized first matrix and the quantized third matrix, and performing an inverse quantization operation on the third matrix based on the second quantization coefficient and the second offset, in response to determining that the data amount of the first matrix is greater than the data amount of the third matrix, and determining a target operation request for a matrix multiplication operation on the quantized first matrix and the second matrix. Therefore, when inverse quantization operation is required to be performed on both matrices and matrix multiplication operation is further performed, inverse quantization can be performed on the matrix with smaller data size, so that the optimization operation mode described in the method 200 can be applied to the first matrix with larger data size in a targeted manner, and further operation efficiency is improved.

According to some embodiments, the first matrix is a weight matrix of a fully connected layer of a neural network model, and the second matrix is an input matrix of the fully connected layer. Therefore, the method can conduct targeted optimization in the matrix multiplication step related to the full connection layer of the neural network model calculation so as to improve the model calculation efficiency.

According to some embodiments, the first matrix is an attention weight matrix of an attention network and the second matrix is a value matrix of the attention network. Therefore, the method can conduct targeted optimization in a matrix multiplication step related to the attention network so as to improve the calculation efficiency of the model.

In some examples, the method 200 described above may be applied to the inference computation of a transducer model, which in this example includes a fully connected layer and an attention network, the inference computation process may specifically include the steps of:

first, the input sequence X needs to be linearly transformed into a query vector Q, a key vector K, and a value vector V:

Q=X·W_Q,K＝X·W_K,V＝X·W_V

Where W _Q、W_K and W _V are weight matrices of the full connection layer. On this basis, nonlinear activation can be performed based on the result of the linear transformation.

In some examples, a progressive quantization mode may be applied for the input sequence X and a channel-by-channel quantization mode may be applied for the weight matrix W to enhance quantization effects.

In this step, the inverse quantization operation of the input sequence X and the weight matrix W corresponds to the following operational formulas, respectively:

X=scale_X*qX+bias_X

W=scale_W*qW+bias_W

Where scale _X and scale _W are quantization coefficients, qX and qW are data elements before dequantization, bias _X and bias _W are data amounts. In this example, X is of size B×d_model, and W is of size d_model×d_model, where B is the size of the data batch and d_model is the hidden layer dimension of the model. In general, the value of B is much smaller than the value of d_model, in which case, the inverse quantization calculation can be performed for qX first, and then Q can be calculated for qW _Q and the inverse quantized X by applying the method 200 described above, and the calculation process can be implemented, for example, based on the following formula:

Q=X·(scale_WQ*qW_Q)+rowsum(X)·bias_WQ

In the formula, rowsum (X) corresponds to the operation of summing for each row in the second matrix (dequantized X) in step S221.

In some examples, the inference calculation process of the attention network may specifically include the steps of:

first, the dot product of the query vector Q and the key vector K needs to be calculated to obtain the similarity score between the two:

scores=Q·K^T

the attention weight S can be further calculated based on the similarity score:

By performing matrix multiplication based on the attention weight S and the value vector V, the output result O of the attention network can be obtained:

O=S·V

in this example, the query vector Q is 1×d in size, and the key vector K and the value vector V are l×d in size. In this case, the above method 200 may be applied to optimize the matrix multiplication operation involved in the similarity score and the output result O, and the operation may be implemented, for example, based on the following formula:

scores=Q·(scale_K*qK)^T+Q·bias_K ^T

O=S·(scale_V*qV)+rowsum(S)*bias_V

In this example, D is the dimension of the header in the attention mechanism, L is the sequence length, and typically, the value of D is much smaller than the value of L, in which case Q and S with smaller data size can be used as the second matrix in the method 200 to save computation.

According to some embodiments, the arithmetic logic unit comprises a plurality of adders and wherein performing the second operation with the arithmetic logic unit in step S202 comprises calculating a sum of a plurality of rows of the second matrix in parallel with the plurality of adders. Therefore, the arithmetic logic unit with certain operation parallelism can be utilized to realize high concurrency operation, and further the operation efficiency is improved.

In some examples, the matrix to be operated on may be floating point data, and the arithmetic logic unit may include n floating point adders, i.e. may implement the operation parallelism n, and on the basis of this, the arithmetic logic unit may further include a plurality of registers for temporarily storing intermediate results of the operation to support the arithmetic logic operation.

In this case, the summation operation of step S221 may be implemented based on the following formula:

Where A [ i ] [ j ] corresponds to the data element at the (i, j) position in the second matrix A. In this example, each floating point adder of the arithmetic logic unit together with one register completes a set of addition computations, specifically, each set of floating point adders and registers initializes v0=0. In this example, n sets of floating point adders and registers may read n data elements of A [ i ] [0] in the first operation cycle, thereby calculating v0=v0+data_i (data_i=A [ i ] [0 ]). Similarly, the second operation cycle can read n data elements of a [ i ] [1], and calculate v0=v0+data_i (data_i=)

Ai 1) and so on, the sum of the data elements of each row can be recorded by v 0.

According to an aspect of the present disclosure, there is also provided an arithmetic device. Fig. 3 shows a block diagram of a computing device 300 according to an exemplary embodiment of the present disclosure. As shown in fig. 3, the apparatus 300 includes:

a matrix operation unit 310;

an arithmetic logic unit 320;

a first operation unit 330 configured to perform a first operation with a matrix operation unit in response to receiving a target operation request to perform an inverse quantization operation on a first matrix based on a first quantization coefficient and a first offset and to perform a matrix multiplication operation on the quantized first matrix and second matrix, the first matrix having a data amount larger than a data amount of the second matrix, wherein the arithmetic logic unit 320 is configured to operate a sum of a plurality of data elements of each row of the second matrix for the row;

A second operation unit 340 configured to perform a second operation using the matrix operation unit, wherein the matrix operation unit 310 is configured to determine a first intermediate result based on the first quantization coefficient, the first matrix, and the second matrix, wherein the first intermediate result indicates a product of a matrix multiplication result of the first matrix and the second matrix and the first quantization coefficient;

A third operation unit 350 configured to perform multiplication operation based on the sum of the first offset and the plurality of rows of the second matrix to obtain a second intermediate result, and

A fourth operation unit 360 configured to perform an accumulation operation based on the first intermediate result and the second intermediate result to obtain a target operation result.

According to some embodiments, the apparatus 300 further comprises a splitting unit configured to perform a splitting operation on the first matrix and the second matrix to obtain a plurality of sub-matrix groups, each sub-matrix group comprising a first sub-matrix and a second sub-matrix, wherein the first operation unit 330 is configured to perform a first operation on the second sub-matrix group with the arithmetic logic unit while the matrix operation unit performs a second operation on the first sub-matrix group in response to determining that the first operation on the first sub-matrix group is performed.

According to some embodiments, the target operation request is determined by a determining unit including an inverse quantization subunit configured to perform an inverse quantization operation on a first matrix based on a first quantization coefficient and a first offset, perform an inverse quantization operation on a third matrix based on a second quantization coefficient and a second offset, and perform a first operation request for a matrix multiplication operation on the quantized first matrix and the quantized third matrix, and perform an inverse quantization operation on the third matrix based on the second quantization coefficient and the second offset to obtain the second matrix in response to a determination that the data amount of the first matrix is larger than the data amount of the third matrix, and a determining subunit configured to determine a target operation request for performing a matrix multiplication operation on the first matrix based on the first quantization coefficient and the first offset, and on the quantized first matrix and the second matrix.

According to some embodiments, the first matrix is a weight matrix of a fully connected layer of a neural network model, and the second matrix is an input matrix of the fully connected layer.

According to some embodiments, the first matrix is an attention weight matrix of an attention network and the second matrix is a value matrix of the attention network.

According to some embodiments, the arithmetic logic unit comprises a plurality of adders and wherein the first arithmetic unit 330 is configured to compute sums of a plurality of rows of the second matrix in parallel with the plurality of adders.

Fig. 4 shows a block diagram of a computing device 400 according to an exemplary embodiment of the present disclosure. As shown in fig. 4, the apparatus 400 includes a matrix operation unit 410, an arithmetic logic unit 420, a first operation unit 430, a second operation unit 440, and a third operation unit 450. In this example, the matrix operation unit 410 may be a calculation array composed of a plurality of arithmetic logic units 411, which may be used to implement two-dimensional matrix multiplication (i.e., a second operation). The arithmetic logic unit 420 may include a plurality of adders 421 to perform the first operation described above. The first, second and third arithmetic units 430, 440, 450 are used to implement the steps S210, S220, and S240 in the method 200 described above, and in this example, the operation of performing a multiplication operation based on the sum of the first offset and the plurality of rows of the second matrix to obtain the second intermediate result in step S230 may be implemented by using the matrix arithmetic unit 410, that is, performing a multiplication operation based on a row vector corresponding to the first offset and a column vector of the sum of the plurality of rows of the second matrix to obtain the second intermediate result.

In this example, when the method 200 is applied to a full-connection layer matrix multiplication operation (i.e., q=x· (scale _WQ*qW_Q)+rowsum(X)·bias_WQ) as described above), the hardware implementation process may include the following steps:

Step S1, tmp1=rowsum (X) calculation is completed by the adder 421 of the arithmetic logic unit 420. In this example, tmp1 may be a one-dimensional matrix made up of the sum of a plurality of rows of the X matrix.

Step S2, the calculation of tmp2=tmp1×bias _WQ is completed by the matrix operation unit 410. In this example, bias _WQ may be a one-dimensional matrix derived based on the values of offset bias.

In step S3, the matrix operation unit 410 completes the calculation of result _d＝scale_WQ*(X*qW_Q) +tmp2 to obtain the output result result_d.

As described above, when the data size is large, the original matrix may be split, and the steps S1 and S2-S3 are performed based on the split result pipeline, i.e. the arithmetic logic unit 420 and the matrix operation unit 410 are made to perform parallel operation, so as to improve the operation efficiency.

According to an aspect of the present disclosure, there is also provided a chip including the arithmetic device 300 as described above.

According to an aspect of the present disclosure, there is also provided an electronic device comprising at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the method of operation described above.

According to an aspect of the present disclosure, there is also provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the above-described operation method.

According to an aspect of the present disclosure, there is also provided a computer program product, including a computer program, wherein the computer program implements the above-mentioned operation method when being executed by a processor.

Referring to fig. 5, a block diagram of an electronic device 500 that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 5, the apparatus 500 includes a computing unit 501 that can perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The computing unit 501, ROM 502, and RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

The various components in the device 500 are connected to an I/O interface 505, including an input unit 506, an output unit 507, a storage unit 508, and a communication unit 509. The input unit 506 may be any type of device capable of inputting information to the device 500, the input unit 506 may receive input numeric or character information and generate key signal inputs related to user settings and/or function control of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a trackpad, a trackball, a joystick, a microphone, and/or a remote control. The output unit 507 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. Storage unit 508 may include, but is not limited to, magnetic disks, optical disks. The communication unit 509 allows the device 500 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, 802.11 devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.

The computing unit 501 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 501 performs the respective methods and processes described above, for example, an arithmetic method. For example, in some embodiments, the operational methods may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into RAM 503 and executed by computing unit 501, one or more steps of the above-described operation method may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the method of operation in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be a special or general purpose programmable processor, operable to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user, for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback), and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a Local Area Network (LAN), a Wide Area Network (WAN), the Internet, and a blockchain network.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

While embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the methods, systems, and apparatus described above are merely illustrative embodiments or examples and that the scope of the present disclosure is not limited by these embodiments or examples. Various elements of the embodiments or examples may be omitted or replaced with equivalent elements thereof. Furthermore, the steps may be performed in a different order than described in the present disclosure. Further, various elements of the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced by equivalent elements that appear after the disclosure.

Claims

1. A computing method implemented by a computing device, wherein the computing device comprises a matrix computing unit and an arithmetic logic unit, and the method comprises:

In response to receiving a target operation request to perform a dequantization operation on a first matrix based on a first quantization coefficient and a first offset and to perform a matrix multiplication operation on the quantized first matrix and a second matrix, wherein the data amount of the first matrix is greater than the data amount of the second matrix,

Performing a first operation using the arithmetic logic unit, wherein the first operation includes:

For each row of the second matrix, calculating the sum of multiple data elements of the row;

Performing a second operation using the matrix operation unit, wherein the second operation includes:

Determine a first intermediate result based on the first quantization coefficient, the first matrix, and the second matrix, wherein the first intermediate result indicates a product of a matrix multiplication result of the first matrix and the second matrix and the first quantization coefficient;

performing a multiplication operation based on the first offset and a sum of a plurality of rows of the second matrix to obtain a second intermediate result; and

An accumulation operation is performed based on the first intermediate result and the second intermediate result to obtain a target operation result.

2. The method of claim 1, further comprising:

performing a splitting operation on the first matrix and the second matrix to obtain a plurality of sub-matrix groups, each sub-matrix group including a first sub-matrix and a second sub-matrix,

The using the arithmetic logic unit to perform the first operation includes:

In response to determining that the first operation on the first sub-matrix group is completed, the arithmetic logic unit is used to perform the first operation on the second sub-matrix group while the matrix operation unit performs the second operation on the first sub-matrix group.

3. The method according to claim 1 or 2, wherein the target operation request is determined by:

In response to receiving a first operation request to perform a dequantization operation on a first matrix based on a first quantization coefficient and a first offset, to perform a dequantization operation on a third matrix based on a second quantization coefficient and a second offset, and to perform a matrix multiplication operation on the quantized first matrix and the quantized third matrix, and in response to determining that the data amount of the first matrix is greater than the data amount of the third matrix,

performing an inverse quantization operation on the third matrix based on the second quantization coefficient and the second offset to obtain the second matrix; and

A target operation request is determined to perform a dequantization operation on the first matrix based on the first quantization coefficient and the first offset and to perform a matrix multiplication operation on the quantized first matrix and the second matrix.

4. The method of claim 3, wherein the first matrix is a weight matrix of a fully connected layer of a neural network model, and the second matrix is an input matrix of the fully connected layer.

5. The method of claim 3, wherein the first matrix is an attention weight matrix of an attention network, and the second matrix is a value matrix of the attention network.

6. The method of any one of claims 1-5, wherein the arithmetic logic unit comprises a plurality of adders, and wherein performing the first operation using the arithmetic logic unit comprises:

The sums of the plurality of rows of the second matrix are calculated in parallel using the plurality of adders.

7. A computing device, comprising:

Matrix operation unit;

Arithmetic logic unit;

A first operation unit is configured to, in response to receiving a target operation request to perform a dequantization operation on a first matrix based on a first quantization coefficient and a first offset and to perform a matrix multiplication operation on the quantized first matrix and a second matrix, wherein the data amount of the first matrix is greater than the data amount of the second matrix, perform a first operation using the arithmetic logic unit, wherein the arithmetic logic unit is configured to:

A second operation unit is configured to perform a second operation using the matrix operation unit, wherein the matrix operation unit is configured to:

a third operation unit configured to perform a multiplication operation based on the first offset and a sum of a plurality of rows of the second matrix to obtain a second intermediate result; and

The fourth operation unit is configured to perform an accumulation operation based on the first intermediate result and the second intermediate result to obtain a target operation result.

8. The apparatus of claim 7, further comprising:

a splitting unit configured to perform a splitting operation on the first matrix and the second matrix to obtain a plurality of sub-matrix groups, each sub-matrix group including a first sub-matrix and a second sub-matrix,

Wherein, the first operation unit is configured as:

9. The apparatus according to claim 7 or 8, wherein the target operation request is determined by using the following determination unit, the determination unit comprising:

a dequantization subunit configured to, in response to receiving a first operation request to perform a dequantization operation on a first matrix based on a first quantization coefficient and a first offset, to perform a dequantization operation on a third matrix based on a second quantization coefficient and a second offset, and to perform a matrix multiplication operation on the quantized first matrix and the quantized third matrix, and in response to determining that the data amount of the first matrix is greater than the data amount of the third matrix, to perform a dequantization operation on the third matrix based on the second quantization coefficient and the second offset to obtain the second matrix; and

A determination subunit is configured to determine a target operation request for performing a dequantization operation on the first matrix based on the first quantization coefficient and the first offset and performing a matrix multiplication operation on the quantized first matrix and the second matrix.

10. The device of claim 9, wherein the first matrix is a weight matrix of a fully connected layer of a neural network model, and the second matrix is an input matrix of the fully connected layer.

11. The apparatus of claim 9, wherein the first matrix is an attention weight matrix of an attention network, and the second matrix is a value matrix of the attention network.

12. The apparatus of any one of claims 7 to 11, wherein the arithmetic logic unit comprises a plurality of adders, and wherein the first operation unit is configured to:

13. A chip comprising the computing device according to any one of claims 7 to 12.

14. An electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein

The memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform the method according to any one of claims 1 to 6.

15. A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause a computer to execute the method according to any one of claims 1 to 6.

16. A computer program product, comprising a computer program, wherein the computer program implements the method according to any one of claims 1 to 6 when executed by a processor.