[go: up one dir, main page]

CN119312839A - Low-complexity Transformer attention module prediction method and device - Google Patents

Low-complexity Transformer attention module prediction method and device Download PDF

Info

Publication number
CN119312839A
CN119312839A CN202411179921.9A CN202411179921A CN119312839A CN 119312839 A CN119312839 A CN 119312839A CN 202411179921 A CN202411179921 A CN 202411179921A CN 119312839 A CN119312839 A CN 119312839A
Authority
CN
China
Prior art keywords
matrix
shift
attention
result
multiplication
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411179921.9A
Other languages
Chinese (zh)
Inventor
胡杨
王辉征
方佳豪
韩慧明
尹首一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202411179921.9A priority Critical patent/CN119312839A/en
Publication of CN119312839A publication Critical patent/CN119312839A/en
Pending legal-status Critical Current

Links

Landscapes

  • Complex Calculations (AREA)

Abstract

The present application relates to a low complexity transducer attention module prediction method, apparatus, computer device, computer readable storage medium and computer program product. The method comprises the steps of obtaining a first leading zero counting result obtained by leading zero counting processing on a first matrix of a neural network model, wherein an attention weight matrix of the neural network model is the first matrix, a word element matrix input into the neural network model is the second matrix, or a word element matrix input into the neural network model is the first matrix, the attention weight matrix of the neural network model is the second matrix, multiplying approximate displacement processing is conducted on the second matrix based on the first leading zero counting result to obtain a query matrix and a key matrix, a prediction result of the attention matrix is determined based on the query matrix and the key matrix, and the prediction result of the attention matrix is used for representing the matching degree of context word elements. By adopting the method, the hardware resource cost can be reduced.

Description

Low-complexity transducer attention module prediction method and device
Technical Field
The present application relates to the field of artificial intelligence, and in particular, to a low complexity transducer attention module prediction method, apparatus, computer device, computer readable storage medium, and computer program product.
Background
Neural network models based on the Transformer architecture have achieved great success in natural language processing thanks to Attention (Attention) mechanisms that are able to capture contextual relationships. However, as the sequence of input models grows, the computational complexity and storage complexity of the attention mechanism also increases dramatically, as the computational and storage requirements of the attention module are proportional to the square of the sequence length. Therefore, how to effectively reduce the complexity of the attention module is one concern of current academic research.
Recent studies have found that sparsity between tokens (token) can be exploited to reduce the computational and storage complexity of the attention module, thanks to the redundancy naturally occurring in human language. Specifically, an estimate of the attention matrix may be obtained by some computation with relatively low complexity, the values of each row are sorted based on the estimated attention matrix, the maximum K values are selected, and the K maximum values are considered to have a larger influence on the attention, and in the subsequent computation, only the computation corresponding to the K values is needed. In this way, the calculation and storage of attention can be effectively reduced in the formal calculation. Therefore, a low complexity transducer attention module prediction method that enables attention matrix estimation is highly desirable.
The related low complexity transducer attention module prediction method is a mode of attention moment array prediction by using low bit quantization. Specifically, 4-bit low-precision quantization is performed on a word element matrix corresponding to a text input into a model, 4-bit quantization is performed on a query weight matrix W Q and a Key weight matrix W K, 4-bit multiplication is performed after quantization is completed, a corresponding query (Querry, Q) matrix and a Key (Key, K) matrix are obtained, and then, further 4-bit quantization is performed on the Q matrix and the K matrix, and the attention matrix is obtained after quantization is completed. However, the above method requires an integer multiplier for implementation of hardware, and the hardware overhead of the 4-bit integer multiplier is still huge. Therefore, the related low complexity transducer attention module prediction method has large hardware resource overhead.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a low complexity transform attention module prediction method, apparatus, computer device, computer readable storage medium, and computer program product that can reduce hardware resource overhead.
In a first aspect, the present application provides a low complexity fransformer attention module prediction method, comprising:
Acquiring a first leading zero count result obtained by leading zero count processing of a first matrix of a neural network model, wherein an attention weight matrix of the neural network model is the first matrix, a word element matrix input into the neural network model is a second matrix, or a word element matrix input into the neural network model is the first matrix, and an attention weight matrix of the neural network model is the second matrix;
based on the first leading zero counting result, performing multiplication approximate shift processing on the second matrix to obtain a query matrix and a key matrix;
And determining a prediction result of an attention matrix based on the query matrix and the key matrix, wherein the prediction result of the attention matrix is used for representing the matching degree of the context word elements.
In one embodiment, the determining the prediction result of the attention matrix based on the query matrix and the key matrix includes:
Performing leading zero counting processing on a third matrix of the neural network model to obtain a second leading zero counting result corresponding to the third matrix, wherein the query matrix is the third matrix, the transpose of the key matrix is a fourth matrix, or the transpose of the key matrix is the third matrix, and the query matrix is the fourth matrix;
and based on the second leading zero counting result, multiplying, approximating and shifting the fourth matrix to obtain a prediction result of the attention matrix.
In one embodiment, the first matrix is the attention weight matrix, and the obtaining a first leading zero count result obtained by performing leading zero count processing on the first matrix of the neural network model includes:
Performing leading zero counting processing on an attention weight matrix of a neural network model to obtain a first leading zero counting result corresponding to the attention weight matrix;
storing the first leading zero counting result into a target storage unit;
When estimating the attention matrix, the first leading zero count result is read from the target storage unit.
In one embodiment, the second leading zero count result includes zero leading numbers corresponding to each first element in the third matrix, and the multiplying approximation shift processing is performed on the fourth matrix based on the second leading zero count result to obtain a prediction result of the attention matrix, where the predicting result includes:
For each group of multiplication elements in the transposed multiplication process of the query matrix and the key matrix, determining a shift object element corresponding to a second element in the group of multiplication elements according to a first element in the group of multiplication elements, wherein the first element is an element in the third matrix, and the second element is an element in the fourth matrix;
performing shift processing on the shift object element based on the leading zero number corresponding to the first element to obtain a first shift result corresponding to the group of multiplied elements;
performing sign bit expansion on the first shift result to obtain a product corresponding to the group of multiplication elements;
And determining a prediction result of the attention matrix based on the products corresponding to the multiplication elements of the groups.
In one embodiment, the determining, according to the first element in the set of multiplied elements, a shift object element corresponding to the second element in the set of multiplied elements includes:
If the first element in the group of multiplied elements is a negative number, inverting the original code of the second element in the group of multiplied elements according to the bit to obtain a shift object element corresponding to the second element;
and if the first element in the group of multiplied elements is a positive number, taking the original code of the second element in the group of multiplied elements as a shifting object element corresponding to the second element.
In one embodiment, the performing shift processing on the shift target element based on the number of leading zeros corresponding to the first element to obtain a first shift result corresponding to the set of multiplied elements includes:
determining the sign bit of a first shift result corresponding to the group of multiplication elements according to the sign bit of the first element and the sign bit of the second element;
Determining the shift bit number corresponding to the shift object element according to the bit width of the second element and the leading zero number corresponding to the first element;
according to the shift bit number, shifting the shift object element leftwards to obtain the effective bit of the first shift result corresponding to the group of multiplication elements;
And forming a first shift result corresponding to the group of multiplication elements by using the sign bit of the first shift result and the valid bit of the first shift result.
In one embodiment, the second leading zero count result includes zero leading numbers corresponding to each first element in the third matrix, and the multiplying approximation shift processing is performed on the fourth matrix based on the second leading zero count result to obtain a prediction result of the attention matrix, where the predicting result includes:
For each group of multiplication elements in the transposed multiplication process of the query matrix and the key matrix, performing shift processing on a second element in the group of multiplication elements based on the leading zero number corresponding to the first element in the group of multiplication elements to obtain a second shift result corresponding to the group of multiplication elements;
determining an original code corresponding to each attention element in the prediction result of the attention matrix based on the second shift result corresponding to each group of multiplication elements;
and decoding the original codes corresponding to the attention elements to obtain the prediction result of the attention matrix.
In a second aspect, the present application also provides a low complexity transducer attention module prediction apparatus, including:
The acquisition module is used for acquiring a first leading zero count result obtained by leading zero count processing of a first matrix of the neural network model, wherein the attention weight matrix of the neural network model is the first matrix, the word element matrix input into the neural network model is the second matrix, or the word element matrix input into the neural network model is the first matrix, and the attention weight matrix of the neural network model is the second matrix;
the processing module is used for carrying out multiplication approximate shift processing on the second matrix based on the first leading zero counting result to obtain a query matrix and a key matrix;
and the determining module is used for determining the prediction result of the attention matrix based on the query matrix and the key matrix, wherein the prediction result of the attention matrix is used for representing the matching degree of the context word elements.
In a third aspect, the present application also provides a computer device comprising a memory storing a computer program and a processor implementing the steps of the first aspect when the processor executes the computer program.
In a fourth aspect, the present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the first aspect described above.
In a fifth aspect, the application also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the first aspect described above.
The low-complexity transducer attention module prediction method, device, computer equipment, computer readable storage medium and computer program product are used for obtaining a first leading zero counting result obtained by leading zero counting on a first matrix of a neural network model, wherein an attention weight matrix of the neural network model is the first matrix, a word element matrix of the neural network model is input to be a second matrix, or a word element matrix of the neural network model is input to be the first matrix, the attention weight matrix of the neural network model is the second matrix, multiplication approximate shift processing is carried out on the second matrix based on the first leading zero counting result to obtain a query matrix and a key matrix, and a prediction result of the attention matrix is determined based on the query matrix and the key matrix, wherein the prediction result of the attention matrix is used for representing the matching degree of context word elements. Therefore, a unilateral leading zero conversion mechanism is provided, only one of the two multipliers is required to be subjected to leading zero detection, and the other multiplier is subjected to multiplication approximate shift processing based on the leading zero detection result, so that multiplication approximation can be realized through shift, and the estimated attention moment array is not required to be realized by adopting an integer multiplier for hardware realization, so that the hardware resource expenditure can be effectively reduced.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are needed in the description of the embodiments of the present application or the related technologies will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other related drawings may be obtained according to these drawings without inventive effort to those of ordinary skill in the art.
FIG. 1 is a flow chart of a low complexity transform attention module prediction method in one embodiment;
FIG. 2 is a flow diagram of a process for performing a leading zero count on 0011 (+3) and 1101 (-3), respectively, in one embodiment;
FIG. 3 is a flow diagram of the steps for determining a prediction result of an attention matrix based on a query matrix and a key matrix in one embodiment;
FIG. 4 is a flowchart illustrating steps for obtaining a first leading-zero count result obtained by leading-zero count processing on a first matrix of a neural network model according to an embodiment;
FIG. 5 is a flow chart illustrating an implementation of a low complexity transform attention module prediction method in one embodiment;
FIG. 6 is a flowchart illustrating steps for performing a multiplicative approximate shift process on a fourth matrix based on a second leading zero count result to obtain a prediction result of an attention matrix in one embodiment;
FIG. 7 is a flowchart illustrating a step of determining a shift object element corresponding to a second element in the set of multiplied elements according to a first element in the set of multiplied elements in one embodiment;
FIG. 8 is a flowchart of a step of performing shift processing on a shift object element based on zero preceding elements corresponding to a first element to obtain a first shift result corresponding to the set of multiplied elements in one embodiment;
FIG. 9 is a schematic diagram of a multiplicative approximate shift process for a fourth matrix based on a second leading zero count result in one embodiment;
FIG. 10 is a flowchart illustrating a step of performing a multiplicative approximate shift process on a fourth matrix based on a second leading zero count result to obtain a prediction result of an attention matrix according to another embodiment;
FIG. 11 is a flowchart illustrating a low complexity transform attention module prediction method according to another embodiment;
FIG. 12 is a schematic diagram of a multiplicative approximate shift process for a fourth matrix based on a second leading zero count result in another embodiment;
FIG. 13 is a block diagram of a low complexity transducer attention module predictive device in one embodiment;
fig. 14 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
In one embodiment, as shown in fig. 1, a low-complexity converterler attention module prediction method is provided, and this embodiment is illustrated by applying the method to a terminal, it can be understood that the method can also be applied to a server, and can also be applied to a system including the terminal and the server, and implemented through interaction between the terminal and the server. The terminal can be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things equipment and portable wearable equipment, and the internet of things equipment can be smart speakers, smart televisions, smart air conditioners, smart vehicle-mounted equipment, projection equipment and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The head-mounted device may be a Virtual Reality (VR) device, an augmented Reality (Augmented Reality, AR) device, smart glasses, or the like. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing service. In this embodiment, the method includes the steps of:
Step 101, obtaining a first leading zero counting result obtained by leading zero counting processing on a first matrix of the neural network model.
The attention weight matrix of the neural network model is a first matrix, and the word element matrix input into the neural network model is a second matrix. Or the word element matrix input into the neural network model is a first matrix, and the attention weight matrix of the neural network model is a second matrix.
In an embodiment of the present application, the neural network model is a Transformer (Transformer) architecture-based neural network model, including Attention (Attention) mechanisms. Inputting a token matrix corresponding to the language sequence into an attention module of the neural network model to obtain an attention matrix. Note that the moment array is used to characterize a measure of the contextual importance of each token (token) with other tokens (token) at other locations in the language sequence, i.e., the degree of matching of the contextual tokens. Thus, the attention matrix may also be referred to as a context token association matrix or a context token matching degree matrix. Specifically, the word element matrix X is multiplied by the query weight matrix W Q and the Key weight matrix W K respectively to obtain a corresponding query (Querry, Q) matrix and Key (Key, K) matrix, and then the Q matrix is multiplied by the transpose of the K matrix to obtain the Attention matrix. The above specific process can be expressed as:
Q=X×WQ
K=X×WK
Attention=Q×KT
The attention weight matrix includes a query weight matrix and a key weight matrix. Leading zero (zero) a bit 0 preceding the first 1 to the left of a binary number, e.g. for a binary number (00101110) the leading zero number is 2. A leading zero counter (leading zero counter, loC) for performing leading zero count processing, counting the number of bits 0 before the most significant bit 1 thereof for one binary number, i.e., loC (00101110) =2.
In one example, when estimating the attention matrix, the terminal performs a first leading zero count process on a first matrix of the neural network model to obtain a first leading zero count result.
In one example, where the first matrix is an attention weight matrix, the first leading zero count result includes a query weight matrix leading zero count result and a key weight matrix leading zero count result. And the terminal respectively carries out leading zero counting processing on the query weight matrix and the key weight matrix to obtain a leading zero counting result of the query weight matrix and a leading zero counting result of the key weight matrix.
In one example, in the case that the first matrix is a lemma matrix, the terminal performs a leading zero count process on the lemma matrix to obtain a first leading zero count result.
The specific process of conducting leading zero counting processing on the matrix comprises the step that leading zero counting results of the matrix comprise leading zero numbers corresponding to elements in the matrix. For each element in the matrix, if the binary number of the element is positive, the terminal counts leading zeros of the binary number to obtain the leading zeros corresponding to the element. If the binary number of the element is negative, the terminal converts the binary number into an original code. And then, the terminal ignores the highest symbol bit 1 of the original code, and counts leading zeros of the original code to obtain the leading zeros corresponding to the element. For example 0011 (+3) with a leading zero number of 2;1101 (-3), which is first converted to the original code 1011 (-3), the most significant sign bits are ignored and the leading zero number is 2. A schematic flow chart of the leading zero count processing for 0011 (+3) and 1101 (-3) respectively is shown in fig. 2.
In one example, the terminal performs a leading zero count process on the matrix using a priority encoder or logic gate.
Step 102, based on the first leading zero counting result, multiplying and approximate shifting processing is carried out on the second matrix, so that a query matrix and a key matrix are obtained.
In the embodiment of the present application, the second matrix may be a single thermal sequence matrix or a non-single thermal sequence matrix. The multiplication approximate shift process is used for obtaining an approximate result of binary number multiplication by shifting binary numbers, and may or may not include a single-hot sequence conversion process. One-hot sequence is a string of 01 binary vectors, of which there is only one bit 1, e.g. 00010000. Single hot sequence conversion (one-hot sequence converter, OSC) converts one binary number into a single hot sequence, but leaves the most significant bit 1, e.g. for 00101110- >00100000.
In one example, where the first matrix is an attention weight matrix, the first leading zero count result includes a query weight matrix leading zero count result and a key weight matrix leading zero count result. The terminal performs multiplication approximate shift processing on the word element matrix based on the leading zero counting result of the query weight matrix to obtain the query matrix. The terminal performs multiplication approximate shift processing on the word element matrix based on the key weight matrix leading zero counting result to obtain a key matrix.
In one example, in the case that the first matrix is a word element matrix, the terminal performs a multiplication approximate shift process on the query weight matrix based on the first leading zero count result to obtain the query matrix. And the terminal performs multiplication approximate shift processing on the key weight matrix based on the first leading zero counting result to obtain the key matrix.
The principle of the multiplication approximation shift process for realizing the multiplication approximation is that for an Integer (INT) type binary number x, which is mathematically expressed as x=sign×2 W-LO-1 ×m, where sign represents its sign bit, W represents the quantization bit width, LO represents the number of bits of leading zeros, M represents the mantissa, and its value lies in the interval [1,2]. Taking the value 3 at 4bit quantization as an example, 3= (+1) ×2 4-2-1 ×1.5, its binary representation is 3= (0011) b, leading zero number lo=2. Thus, for multiplication of two binary numbers, the mathematical expression can be expressed as follows:
x×y=XOR(Sx,Sy)×2(Wx+Wy-(LOx+LOy)-2)×(Mx×My)
Wherein S x and S y represent sign bits of x and y, respectively, the two numbers are multiplied, the sign bit of the product corresponds to an exclusive or of the sign bits of the two multipliers, W x and W y represent quantized bit widths of x and y, LOx and LOy represent leading zero numbers of x and y, and M x and M y represent mantissas of x and y, respectively. Since the mantissa is located between [1,2], which is typically small, the leading zero shift of x and y can be used to approximate the multiplication of two numbers, i.e
x×y≈XOR(Sx,Sy)×2(Wx+Wy-(LOx+LOy)-2)×Mx
For example, for the calculation of 00011000 (24) ×00000110 (6) =10010000 (144), it is possible to approximate by 00010000 (16) ×00000110 (6) =0100000 (96). It can be seen that the result 0100000 corresponds to shifting bit 1 in 00000110 by four bits to the left, and the number of bits shifted to the left corresponds to the leading zero bit number of bit width 8 minus 00010000 minus 1.
Step 103, determining a prediction result of the attention matrix based on the query matrix and the key matrix.
Wherein the prediction result of the attention matrix is used for representing the matching degree of the context word elements.
In the embodiment of the present application, the prediction result of the attention matrix is an approximation result of the attention matrix, and may be an approximation result of multiplication of the query matrix and the transpose of the key matrix.
In one example, the terminal performs a multiplicative approximation of the query matrix with the transpose of the key matrix to obtain a prediction of the attention matrix. The multiplication approximation process is used for obtaining an approximation result of binary number multiplication, and may or may not include single-heat sequence conversion process. The multiplication approximation process includes a multiplication approximation shift process.
In one example, the terminal performs a leading zero counting process on a third matrix of the neural network model to obtain a second leading zero counting result corresponding to the third matrix. And then, the terminal performs single-heat sequence conversion on the fourth matrix to obtain a converted fourth matrix. And then, the terminal shifts the converted fourth matrix based on the second leading zero counting result to obtain a prediction result of the attention matrix. Wherein the query matrix is a third matrix and the transpose of the key matrix is a fourth matrix. Or the transpose of the key matrix is the third matrix and the query matrix is the fourth matrix.
In the low-complexity transducer attention module prediction method, a first leading zero counting result obtained by leading zero counting on a first matrix of a neural network model is obtained, an attention weight matrix of the neural network model is the first matrix, a word element matrix of an input neural network model is the second matrix, or the word element matrix of the input neural network model is the first matrix, the attention weight matrix of the neural network model is the second matrix, multiplication approximate shift processing is carried out on the second matrix based on the first leading zero counting result to obtain a query matrix and a key matrix, a prediction result of the attention matrix is determined based on the query matrix and the key matrix, and the prediction result of the attention matrix is used for representing the matching degree of context word elements. Therefore, a unilateral leading zero conversion mechanism is provided, only one of the two multipliers is required to carry out leading zero detection, and the other multiplier is subjected to multiplication approximate shift processing based on the leading zero detection result, so that multiplication approximation can be realized through shift, and the estimated attention moment array is not required to adopt an integer multiplier to carry out hardware realization, so that the hardware resource cost can be effectively reduced, the chip area is reduced, and the power consumption and the chip energy cost during chip operation are reduced. Moreover, the object for performing multiplication approximate shift processing in the unilateral leading zero conversion mechanism provided by the method can be original data which is not subjected to single-heat sequence conversion, single-heat sequence conversion is not needed, hardware cost of the single-heat sequence converter is not needed, hardware resource cost and hardware area can be further reduced, chip area is further reduced, and power consumption and chip energy cost during chip operation are further reduced.
In one exemplary embodiment, as shown in FIG. 3, the specific process of determining the predicted outcome of an attention matrix based on a query matrix and a key matrix includes the steps of:
Step 301, performing leading zero counting processing on a third matrix of the neural network model to obtain a second leading zero counting result corresponding to the third matrix.
Wherein the query matrix is a third matrix and the transpose of the key matrix is a fourth matrix. Or the transpose of the key matrix is the third matrix and the query matrix is the fourth matrix.
In the embodiment of the application, the second leading zero counting result is a leading zero counting result obtained by leading zero counting processing on a third matrix of the neural network model.
Step 302, based on the second leading zero count result, performing multiplication approximate shift processing on the fourth matrix to obtain a prediction result of the attention matrix.
In the low-complexity transducer attention module prediction method, leading zero counting processing is carried out on a third matrix of the neural network model to obtain a second leading zero counting result corresponding to the third matrix, and multiplication approximate shift processing is carried out on a fourth matrix based on the second leading zero counting result to obtain a prediction result of the attention matrix. Therefore, the transpose multiplication of the query matrix and the key matrix adopts a newly proposed unilateral leading zero conversion mechanism, only one of the two multipliers is required to be subjected to leading zero detection, and the other multiplier is subjected to multiplication approximate shift processing based on the leading zero detection result, so that multiplication approximation can be realized through shift, the estimation attention moment matrix is ensured to be realized without adopting an integer multiplier to carry out hardware realization, and the hardware resource expenditure can be effectively reduced.
In an exemplary embodiment, as shown in fig. 4, the first matrix is an attention weight matrix, and the specific process of obtaining the first leading zero count result obtained by leading zero count processing on the first matrix of the neural network model includes the following steps:
Step 401, performing leading zero counting processing on an attention weight matrix of the neural network model to obtain a first leading zero counting result corresponding to the attention weight matrix.
In the embodiment of the application, in the network deployment stage of the neural network model, leading zero counting processing is carried out on the attention weight matrix of the neural network model to obtain a first leading zero counting result corresponding to the attention weight matrix.
Step 402, storing the first leading zero count result in the target storage unit.
In the embodiment of the application, the terminal determines the second number of the elements in the first leading zero counting result according to the first number of the elements in the attention weight matrix. For example, the number of the cells to be processed,Wherein B is the second number of bits, A is the first number of bits. Then, for each decimal number element in the first leading zero counting result, the terminal converts the decimal number element into a binary number with a quantization bit width of a second bit number, and the converted first leading zero counting result is obtained. Then, the terminal stores the converted first leading zero count result into a target storage unit. The target memory cell may be a dynamic random access memory (Dynamic Random Access Memory, DRAM).
In step 403, when estimating the attention matrix, the first leading zero count result is read from the target storage unit.
In the low-complexity transducer attention module prediction method, leading zero counting is conducted on an attention weight matrix of a neural network model to obtain a first leading zero counting result corresponding to the attention weight matrix, the first leading zero counting result is stored in a target storage unit, and when the attention matrix is estimated, the first leading zero counting result is read from the target storage unit. In this way, in the neural network pushing, once the network training is completed, the weight of the network can be obtained, and the invention carries out leading zero detection on the attention weight matrix in advance by utilizing the known weight characteristic of the network model when estimating the attention matrix and stores the leading zero detection in a storage unit, so that the storage bit width of data and the bandwidth of data transmission can be reduced. For example, for an 8-bit binary number 00011001, the leading zero number is 3, which is converted into binary number expression 0011, if the direct storage 00011001 needs to store 8 bits, and if the leading zero number is stored, only 4 bits need to be stored, so that the storage capacity is saved, and the bandwidth resource required when the data is read from the storage unit to the chip is also saved. Meanwhile, the method ensures that leading zero detection is not needed when the attention matrix is estimated every time later, thereby reducing the cost of the converter and further reducing the hardware resource cost.
In one embodiment, a flow diagram of an implementation of the low complexity transporter attention module prediction method is shown in FIG. 5. In the network deployment stage, the terminal preprocesses the two weight matrices W Q and W K, converts the leading zero count of each element, and stores the converted result in a target storage unit DRAM. When estimating the Attention matrix, namely in formal calculation, the terminal directly reads the leading zero numbers of W Q and W K of the weight matrix from the DRAM, performs multiplication approximate shift processing on the input Token matrix X to obtain matrices Q and K, performs leading zero count on the Q matrix, and performs multiplication approximate shift processing on the K matrix based on the count result to obtain the predicted Attention matrix. Therefore, the method skillfully utilizes the weight known characteristic of the neural network in reasoning, reduces the conversion requirement of the single-heat sequence, has only four integral calculation steps, has very few processing steps for estimating the attention matrix, and can save the cost of unnecessary single-heat sequence conversion and decoding modules, thereby effectively reducing the cost of hardware area.
In an exemplary embodiment, as shown in fig. 6, the second leading zero count result includes the leading zero number corresponding to each first element in the third matrix, and based on the second leading zero count result, the specific process of performing the multiplicative approximate shift processing on the fourth matrix to obtain the prediction result of the attention matrix includes the following steps:
Step 601, for each set of multiplication elements in the transposed multiplication process of the query matrix and the key matrix, determining a shift object element corresponding to a second element in the set of multiplication elements according to a first element in the set of multiplication elements.
Wherein the first element is an element in the third matrix and the second element is an element in the fourth matrix.
In an embodiment of the application, a set of multiplied elements comprises a first element and a second element. In the case where the query matrix is the third matrix and the transpose of the key matrix is the fourth matrix, the number of columns of the first element in a set of multiplied elements is the same as the number of rows of the second element in the set of multiplied elements. In the case where the transpose of the key matrix is the third matrix and the query matrix is the fourth matrix, the number of rows of the first element in the set of multiplied elements is the same as the number of columns of the second element in the set of multiplied elements. The shift target element is a target to which shift processing is performed.
Step 602, performing shift processing on the shift target element based on the first zero number of the first element, to obtain a first shift result corresponding to the multiplied element.
In the embodiment of the present application, the first shift result is a result of performing shift processing on the shift target element based on the zero preceding numbers corresponding to the first element.
And 603, performing sign bit expansion on the first shift result to obtain a product corresponding to the set of multiplication elements.
Step 604, determining a prediction result of the attention matrix based on the products corresponding to the multiplied elements of each group.
In the embodiment of the application, aiming at each attention element in the prediction result of the attention matrix, the terminal determines the target group multiplication element corresponding to the attention element according to the row number and the column number of the attention element and the row number corresponding to the first element and the second element in each group multiplication element. Then, the terminal performs a first accumulation process on products corresponding to the multiplied elements of each target group to obtain the attention element. Specifically, the terminal performs accumulation processing on products corresponding to multiplication elements of each target group according to a two's complement algorithm to obtain the attention element. The terminal then constructs the attention elements into a prediction of the attention matrix. The first accumulation processing is to perform accumulation processing according to a two's complement algorithm.
In one example, when the query matrix is the third matrix and the transpose of the key matrix is the fourth matrix, the terminal determines a group multiplication element having the same number of rows of the first element as the attention element and the same number of columns of the second element as the attention element as the target group multiplication element corresponding to the attention element. When the transpose of the key matrix is the third matrix and the query matrix is the fourth matrix, the terminal determines a group multiplication element having the same column number of the first element as the column number of the attention element and the same column number of the second element as the column number of the attention element as the target group multiplication element corresponding to the attention element.
In the low-complexity transducer attention module prediction method, for each group of multiplication elements in the transposed multiplication process of the query matrix and the key matrix, a shift object element corresponding to a second element in the group of multiplication elements is determined according to a first element in the group of multiplication elements, shift processing is carried out on the shift object element based on the leading zero number corresponding to the first element to obtain a first shift result corresponding to the group of multiplication elements, sign bit expansion is carried out on the first shift result to obtain a product corresponding to the group of multiplication elements, and the prediction result of the attention matrix is determined based on the product corresponding to each group of multiplication elements. Therefore, based on the leading zero number of one multiplier, the shift object element of the other multiplier is directly shifted, and sign bit expansion is carried out, so that multiplication approximation can be realized, independent thermal sequence conversion and decoding are not needed, and hardware resource overhead and hardware area can be further reduced.
In an exemplary embodiment, as shown in fig. 7, the specific process of determining, according to a first element in the set of multiplied elements, a shift object element corresponding to a second element in the set of multiplied elements includes the following steps:
In step 701, if the first element in the set of multiplied elements is a negative number, the original code of the second element in the set of multiplied elements is inverted according to the bits, so as to obtain a shift object element corresponding to the second element.
In the embodiment of the application, if the first element in the group of multiplied elements is a negative number, the terminal performs bit-wise inversion containing sign bits on the original code of the second element in the group of multiplied elements to obtain the shift object element corresponding to the second element.
In step 702, if the first element in the set of multiplied elements is a positive number, the original code of the second element in the set of multiplied elements is used as the shift target element corresponding to the second element.
In the low-complexity transducer attention module prediction method, if the first element in the group of multiplied elements is a negative number, the original code of the second element in the group of multiplied elements is inverted according to the bit to obtain a shift object element corresponding to the second element, and if the first element in the group of multiplied elements is a positive number, the original code of the second element in the group of multiplied elements is used as the shift object element corresponding to the second element. Therefore, when multiplication approximation is carried out, single-heat sequence conversion is not needed, only the mantissa of one of the two multipliers is ignored, the error is small, and the accuracy of attention moment array estimation can be ensured.
In an exemplary embodiment, as shown in fig. 8, the specific process of performing shift processing on the shift object element based on the zero preceding numbers corresponding to the first element to obtain the first shift result corresponding to the set of multiplication elements includes the following steps:
Step 801, determining the sign bit of the first shift result corresponding to the multiplied element according to the sign bit of the first element and the sign bit of the second element.
In the embodiment of the present application, if the sign bit of the first element is the same as the sign bit of the second element, the terminal determines 0 as the sign bit of the first shift result corresponding to the set of multiplication elements. If the sign bit of the first element is different from the sign bit of the second element, the terminal determines that 1 is the sign bit of the first shift result corresponding to the multiplied element.
Step 802, determining a shift bit number corresponding to the shift target element according to the bit width of the second element and the zero preceding numbers corresponding to the first element.
In the embodiment of the application, the bit width of the second element is subtracted by the leading zero number and 1 corresponding to the first element by the terminal to obtain the shift bit number corresponding to the shift object element, which can be expressed as that the shift bit number corresponding to the shift object element=the bit width of the second element-the leading zero number-1 corresponding to the first element.
Step 803, performing left shift on the shift target element according to the shift bit number, to obtain the valid bit of the first shift result corresponding to the multiplied element group.
In the embodiment of the application, the terminal shifts the bit number of the effective bit of the shifting object element leftwards to obtain the effective bit of the first shifting result corresponding to the multiplying element.
Step 804, the sign bit of the first shift result and the valid bit of the first shift result are formed into the first shift result corresponding to the set of multiplication elements.
The low-complexity transducer attention module prediction method comprises the steps of determining sign bits of a first shift result corresponding to a group of multiplication elements according to the sign bits of a first element and the sign bits of a second element, determining shift bits corresponding to shift target elements according to the bit width of the second element and the leading zero number corresponding to the first element, shifting the shift target elements leftwards according to the shift bits to obtain the sign bits of the first shift result corresponding to the group of multiplication elements, and forming the sign bits of the first shift result and the sign bits of the first shift result into the first shift result corresponding to the group of multiplication elements. Therefore, the bit 1 of the shifting object element corresponding to one multiplier is shifted left, the left shift bit number is the bit width minus the leading zero number of the other multiplier and then is reduced by 1, so that multiplication approximation is realized, only the mantissa of one of the two multipliers is ignored, the error is small, and the accuracy of attention moment array estimation can be ensured.
In one embodiment, the fourth matrix is subjected to a schematic diagram of a multiplicative approximate shift process based on the second leading zero count result, as shown in fig. 9. The transpose of the query matrix multiplied by the key matrix is [x0,x1]×[y0,y1]T=z=[x0×y0+x1×y1],x0=+3,x1=+4,y0=-3,y1=-3, and the transpose of the key matrix is the third matrix, the query matrix is the fourth matrix, the first set of multiplied elements is x 0 (second element) and y 0 (first element), and the second set of multiplied elements is x 1 (second element) and y 1 (first element). For the first group of multiplication elements, +3 (x 0) of which 4 bits are expressed as 0011, the terminal performs bit inversion on the original code of x 0 to obtain a shift target element 1100 corresponding to x 0 because-3 (y 0) is a negative number, determines that 1 is the sign bit of the first shift result corresponding to the group of multiplication elements because x 0 is a positive number and y 0 is a negative number and sign bits are different, determines that the leading zero number L0 of-3 (y 0) is 2, and the shift bit number is 4-2-1=1 because-3 (y 0) is a negative number, performs left shift on the terminal pair 1100 to obtain the valid bit of the first shift result corresponding to the group of multiplication elements, performs sign bit expansion on the sign bit of the first shift result and the valid bit of the first shift result to form a first shift result 10001000 corresponding to the group of multiplication elements, and performs sign bit expansion on 10001000 to obtain a product 11111000 corresponding to the group of multiplication elements. For the second group of multiplication elements, 4 bits of +4 (x 1) are represented as 0100, as-3 (y 1) is negative, the terminal performs bit inversion on the original code of x 1 to obtain a shift object element 1011 corresponding to x 1, as x 1 is positive, y 1 is negative, sign bits are different, the terminal determines that 1 is the sign bit of the first shift result corresponding to the group of multiplication elements, the leading zero number L0 of-3 (y 1) is 2, the shift bit number is 4-2-1=1, the terminal performs bit left shift on 1011 to obtain the valid bit of the first shift result corresponding to the group of multiplication elements, performs bit expansion on the sign bit of the first shift result and the valid bit of the first shift result to form a first shift result 10000110 corresponding to the group of multiplication elements, and performs bit expansion on 10000110 to obtain a product 11110110 corresponding to the group of multiplication elements. The terminal then adds the two products according to the two's complement algorithm to obtain the attention element 11101110. The terminal then constructs this attention element into a prediction result of the attention matrix [11101110].
In an exemplary embodiment, as shown in fig. 10, the second leading zero count result includes the leading zero number corresponding to each first element in the third matrix, and based on the second leading zero count result, the specific process of performing the multiplicative approximate shift processing on the fourth matrix to obtain the prediction result of the attention matrix includes the following steps:
Step 1001, for each set of multiplication elements in the transposed multiplication process of the query matrix and the key matrix, performing shift processing on a second element in the set of multiplication elements based on the first zero number corresponding to the first element in the set of multiplication elements, to obtain a second shift result corresponding to the set of multiplication elements.
Wherein the first element is an element in the third matrix and the second element is an element in the fourth matrix.
In the embodiment of the application, the terminal determines the sign bit of the second shift result corresponding to the multiplied element according to the sign bit of the first element and the sign bit of the second element. Then, the terminal determines the shift bit number corresponding to the second element according to the bit width of the second element and the leading zero number corresponding to the first element. And then, the terminal shifts the second element left according to the shift bit number to obtain the valid bit of the second shift result corresponding to the multiplied element. Then, the terminal constructs a second shift result corresponding to the set of multiplication elements from the sign bit of the second shift result and the valid bit of the second shift result. The second shift result is a result of shifting the second element based on the zero preceding numbers corresponding to the first element. It can be understood that, the specific process of the terminal performing shift processing on the second element in the set of multiplication elements based on the zero number of the preamble corresponding to the first element in the set of multiplication elements to obtain the second shift result corresponding to the set of multiplication elements is similar to the specific process of the terminal performing shift processing on the shift object element based on the zero number of the preamble corresponding to the first element to obtain the first shift result corresponding to the set of multiplication elements, which is similar to the specific process of steps 801-804.
Step 1002, determining an original code corresponding to each attention element in the prediction result of the attention matrix based on the second shift result corresponding to each group of multiplication elements.
In the embodiment of the application, aiming at each attention element in the prediction result of the attention matrix, the terminal determines the target group multiplication element corresponding to the attention element according to the row number and the column number of the attention element and the row number corresponding to the first element and the second element in each group multiplication element. And then, the terminal performs second accumulation processing on the second shift result corresponding to each target group multiplication element to obtain the original code corresponding to the attention element.
Specifically, for each accumulation object performing the second accumulation processing, if the accumulation object is a negative number, the terminal updates the original value corresponding to each bit of the accumulation object to the negative number of the original value corresponding to each bit. For each accumulation object subjected to the second accumulation processing, if the accumulation object is a positive number, the terminal holds the original value corresponding to each bit of the accumulation object. Then, the terminal performs bit-wise accumulation on each accumulation object to obtain a result of performing second accumulation processing.
In step 1003, decoding is performed on the original codes corresponding to the attention elements to obtain the prediction result of the attention matrix.
In the embodiment of the application, the terminal converts the original codes corresponding to the attention elements into two complementary codes respectively to obtain the prediction result of the attention matrix.
Specifically, for each attention element, the terminal performs weighted summation according to the numerical value on each bit (bit) of the original code corresponding to the attention element and the weight corresponding to each bit to obtain the decimal number corresponding to the attention element. Then, the terminal converts the decimal number into a binary number to obtain a binary number corresponding to the attention element. Then, the terminal converts the binary number into a two's complement to obtain a prediction result of the attention matrix.
In the low-complexity transducer attention module prediction method, for each group of multiplication elements in the transposed multiplication process of the query matrix and the key matrix, the second elements in the group of multiplication elements are subjected to shift processing based on the leading zero numbers corresponding to the first elements in the group of multiplication elements to obtain second shift results corresponding to the group of multiplication elements, the primary codes corresponding to the attention elements in the prediction results of the attention matrix are determined based on the second shift results corresponding to the groups of multiplication elements, and the primary codes corresponding to the attention elements are subjected to decoding processing to obtain the prediction results of the attention matrix. In this way, the two multipliers are multiplied, the bit 1 of one multiplier is directly shifted to the left, the left shift bit number is the bit width minus the leading zero number of the other multiplier, then 1 is subtracted, and the result of shift accumulation is decoded.
In another embodiment, a flow chart of an implementation of the low complexity transporter attention module prediction method is shown in fig. 11. In the network deployment stage, the terminal preprocesses the two weight matrices W Q and W K, converts the leading zero count of each element, and stores the converted result in a target storage unit DRAM. When estimating the Attention matrix, i.e. in formal calculation, the terminal directly reads the leading zero numbers of W Q and W K of the weight matrix from the DRAM, carries out shift processing and second accumulation processing, i.e. shift accumulation processing, on the input Token matrix X, then decodes the result, after obtaining the matrices Q and K, carries out leading zero count on the Q matrix, carries out shift processing and second accumulation processing, i.e. shift accumulation processing, on the K matrix based on the count result, then decodes the result, and converts the result into two's complement, thus obtaining the predicted Attention matrix.
In another embodiment, the fourth matrix is subjected to a schematic diagram of a multiplicative approximate shift process based on the second leading zero count result, as shown in fig. 12. The transpose of the query matrix multiplied by the key matrix is [x0,x1]×[y0,y1]T=z=[x0×y0+x1×y1],x0=+3,x1=+4,y0=-3,y1=-3, and the transpose of the key matrix is the third matrix, the query matrix is the fourth matrix, the first set of multiplied elements is x 0 (second element) and y 0 (first element), and the second set of multiplied elements is x 1 (second element) and y 1 (first element). For the first group of multiplied elements, 4-bit binary representation of +3 (x 0) is 0011, as x 0 is positive number, y 0 is negative number, sign bits of the two are different, a terminal determines that 1 is sign bit of a second shifting result corresponding to the group of multiplied elements, the leading zero number L0 of-3 (y 0) is 2, the shifting bit number is 4-2-1=1, the terminal shifts 0011 left by 1 bit to obtain valid bit 0110 of the second shifting result corresponding to the group of multiplied elements, and the sign bit of the second shifting result and the valid bit of the second shifting result form the second shifting result 0110 corresponding to the group of multiplied elements. For the second group of multiplication elements, 4 bits of +4 (x 1) are represented as 0100, since x 1 is positive number and y 1 is negative number, sign bits of the two are different, the terminal determines that 1 is sign bit of the second shift result corresponding to the group of multiplication elements, the leading zero number L0 of-3 (y 1) is 2, the shift bit number is 4-2-1=1, the terminal shifts 0100 left by 1 bit to obtain the valid bit 1000 of the second shift result corresponding to the group of multiplication elements, and the sign bit of the second shift result and the valid bit of the second shift result form the second shift result 10001000 corresponding to the group of multiplication elements. Then, the terminal performs a second accumulation process on the two second shift results to obtain the original code [ 00 0-1-1-1 0] corresponding to the attention element. Then, the terminal decodes the original code corresponding to the attention element to obtain a predicted result 11110010 of the attention matrix. The terminal then constructs this attention element into a prediction result of the attention matrix [11110010].
In one embodiment, the specific process of step 102 may include that the first leading zero count result includes the leading zero number corresponding to each third element in the first matrix. For each group of multiplication elements in the multiplication process of the word element matrix and the query weight matrix and the multiplication process of the word element matrix and the key weight matrix, the terminal determines a shift object element corresponding to a fourth element in the group of multiplication elements according to the third element in the group of multiplication elements. Wherein the third element is an element in the first matrix and the fourth element is an element in the second matrix. And then, the terminal performs shift processing on the shift object element based on the first zero number corresponding to the third element, so as to obtain a third shift result corresponding to the group of multiplied elements. And then, the terminal performs sign bit expansion on the third shift result to obtain a product corresponding to the set of multiplication elements. The terminal then determines a query matrix and a key matrix based on the products corresponding to the sets of multiplied elements. It will be appreciated that the specific process of the steps described above is similar to the specific process of steps 601-604.
The specific process of determining the shift object element corresponding to the fourth element in the set of multiplied elements by the terminal according to the third element in the set of multiplied elements may include the step of, if the third element in the set of multiplied elements is a negative number, inverting the original code of the fourth element in the set of multiplied elements according to the bits to obtain the shift object element corresponding to the fourth element. And if the third element in the multiplied elements is a positive number, taking the original code of the fourth element in the multiplied elements as a shifting object element corresponding to the second element. It will be appreciated that the specific process of the steps described above is similar to the specific process of steps 701-702.
The terminal carries out shift processing on the shift object element based on the leading zero number corresponding to the third element, and the specific process of obtaining the third shift result corresponding to the group of multiplication elements comprises the following steps that the terminal determines the sign bit of the third shift result corresponding to the group of multiplication elements according to the sign bit of the third element and the sign bit of the fourth element. Then, the terminal determines the shift bit number corresponding to the shift target element according to the bit width of the fourth element and the zero preceding numbers corresponding to the third element. And then, the terminal shifts the shifting object element leftwards according to the shift bit number to obtain the valid bit of the third shift result corresponding to the multiplying element group. Then, the terminal constructs a third shift result corresponding to the set of multiplication elements by the sign bit of the third shift result and the valid bit of the third shift result. It will be appreciated that the specific process of the steps described above is similar to the specific process of steps 801-804.
The specific process of determining the query matrix and the key matrix by the terminal based on the products corresponding to the multiplication elements of the groups comprises the following steps that for each query element in the query matrix, the terminal determines the target group multiplication element corresponding to the query element in the multiplication elements of the word element matrix and the query weight matrix according to the row and column numbers of the query element and the row and column numbers corresponding to the third element and the fourth element in the multiplication elements of the groups. And then, the terminal performs first accumulation processing on products corresponding to multiplication elements of each target group corresponding to the query element to obtain the query element. Then, the terminal constructs each query element into a query matrix. For each key element in the key matrix, the terminal determines a target group multiplication element corresponding to the key element in each group multiplication element in the process of multiplying the word element matrix and the key weight matrix according to the row and column numbers of the key element and the row and column numbers corresponding to the third element and the fourth element in each group multiplication element. And then, the terminal performs a first accumulation process on products corresponding to the multiplication elements of each target group corresponding to the key element to obtain the key element. The terminal then constructs each key element into a key matrix. It will be appreciated that the specific process by which the terminal determines the query matrix based on the products corresponding to the sets of multiplied elements and the specific process by which the terminal determines the key matrix based on the products corresponding to the sets of multiplied elements are similar to the specific process of step 604.
In another embodiment, the specific process of step 102 may include that the first leading zero count result includes the leading zero number corresponding to each third element in the first matrix. And aiming at each group of multiplication elements in the multiplication process of the word element matrix and the query weight matrix and the multiplication process of the word element matrix and the key weight matrix, carrying out shift processing on fourth elements in the group of multiplication elements based on the leading zero numbers corresponding to the third elements in the group of multiplication elements to obtain fourth shift results corresponding to the group of multiplication elements. Wherein the third element is an element in the first matrix and the fourth element is an element in the second matrix. And then, the terminal determines the original codes corresponding to the query elements in the query matrix and the original codes corresponding to the key elements in the key matrix based on the fourth shift results corresponding to the multiplied elements in each group. Specifically, for each query element in the query matrix, the terminal determines, according to the number of rows and columns of the query element and the number of rows and columns corresponding to the first element and the second element in each group of multiplied elements, a target group of multiplied elements corresponding to the query element in each group of multiplied elements in the process of multiplying the word element matrix and the query weight matrix. And then, the terminal performs second accumulation processing on the fourth shift result corresponding to the multiplication element of the target group corresponding to each query element to obtain the original code corresponding to the query element. For each key element in the key matrix, the terminal determines a target group multiplication element corresponding to the key element in each group multiplication element in the process of multiplying the word element matrix and the key weight matrix according to the row and column numbers of the key element and the row and column numbers corresponding to the first element and the second element in each group multiplication element. And then, the terminal performs second accumulation processing on the fourth shift result corresponding to the multiplication element of the target group corresponding to each key element to obtain the original code corresponding to the key element. Then, the terminal decodes the original codes corresponding to the query elements to obtain a query matrix. And the terminal decodes the original codes corresponding to the key elements to obtain a key matrix. It will be appreciated that the specific process of the steps described above is similar to the specific process of steps 1001-1003.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a low-complexity transporter attention module prediction device for realizing the low-complexity transporter attention module prediction method. The implementation of the solution provided by the apparatus is similar to that described in the above method, so specific limitations in the embodiments of the low complexity fransformer attention module prediction apparatus provided below can be found in the above limitations for the low complexity fransformer attention module prediction method, and will not be repeated here.
In one exemplary embodiment, as shown in FIG. 13, a low complexity transducer attention module prediction apparatus 1300 is provided, comprising an acquisition module 1310, a processing module 1320, and a determination module 1330, wherein:
An obtaining module 1310, configured to obtain a first leading zero count result obtained by performing leading zero count processing on a first matrix of a neural network model, where an attention weight matrix of the neural network model is the first matrix, a word element matrix input to the neural network model is a second matrix, or a word element matrix input to the neural network model is the first matrix, and an attention weight matrix of the neural network model is the second matrix;
a processing module 1320, configured to perform a multiplicative approximate shift process on the second matrix based on the first leading zero count result, to obtain a query matrix and a key matrix;
A determining module 1330, configured to determine a prediction result of an attention matrix based on the query matrix and the key matrix, where the prediction result of the attention matrix is used to characterize a matching degree of the context word elements.
Optionally, the determining module 1330 is specifically configured to:
Performing leading zero counting processing on a third matrix of the neural network model to obtain a second leading zero counting result corresponding to the third matrix, wherein the query matrix is the third matrix, the transpose of the key matrix is a fourth matrix, or the transpose of the key matrix is the third matrix, and the query matrix is the fourth matrix;
and based on the second leading zero counting result, multiplying, approximating and shifting the fourth matrix to obtain a prediction result of the attention matrix.
Optionally, the first matrix is the attention weight matrix, and the obtaining module 1310 is specifically configured to:
Performing leading zero counting processing on an attention weight matrix of a neural network model to obtain a first leading zero counting result corresponding to the attention weight matrix;
storing the first leading zero counting result into a target storage unit;
When estimating the attention matrix, the first leading zero count result is read from the target storage unit.
Optionally, the second preamble zero count result includes a preamble zero number corresponding to each first element in the third matrix, and the determining module 1330 is specifically configured to:
For each group of multiplication elements in the transposed multiplication process of the query matrix and the key matrix, determining a shift object element corresponding to a second element in the group of multiplication elements according to a first element in the group of multiplication elements, wherein the first element is an element in the third matrix, and the second element is an element in the fourth matrix;
performing shift processing on the shift object element based on the leading zero number corresponding to the first element to obtain a first shift result corresponding to the group of multiplied elements;
performing sign bit expansion on the first shift result to obtain a product corresponding to the group of multiplication elements;
And determining a prediction result of the attention matrix based on the products corresponding to the multiplication elements of the groups.
Optionally, the determining module 1330 is specifically configured to:
If the first element in the group of multiplied elements is a negative number, inverting the original code of the second element in the group of multiplied elements according to the bit to obtain a shift object element corresponding to the second element;
and if the first element in the group of multiplied elements is a positive number, taking the original code of the second element in the group of multiplied elements as a shifting object element corresponding to the second element.
Optionally, the determining module 1330 is specifically configured to:
determining the sign bit of a first shift result corresponding to the group of multiplication elements according to the sign bit of the first element and the sign bit of the second element;
Determining the shift bit number corresponding to the shift object element according to the bit width of the second element and the leading zero number corresponding to the first element;
according to the shift bit number, shifting the shift object element leftwards to obtain the effective bit of the first shift result corresponding to the group of multiplication elements;
And forming a first shift result corresponding to the group of multiplication elements by using the sign bit of the first shift result and the valid bit of the first shift result.
Optionally, the second preamble zero count result includes a preamble zero number corresponding to each first element in the third matrix, and the determining module 1330 is specifically configured to:
For each group of multiplication elements in the transposed multiplication process of the query matrix and the key matrix, performing shift processing on a second element in the group of multiplication elements based on the leading zero number corresponding to the first element in the group of multiplication elements to obtain a second shift result corresponding to the group of multiplication elements;
determining an original code corresponding to each attention element in the prediction result of the attention matrix based on the second shift result corresponding to each group of multiplication elements;
and decoding the original codes corresponding to the attention elements to obtain the prediction result of the attention matrix.
The various modules in the low complexity transducer attention module predictive device described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In an exemplary embodiment, a computer device, which may be a terminal, is provided, and an internal structure thereof may be as shown in fig. 14. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The Communication interface of the computer device is used for conducting wired or wireless Communication with an external terminal, and the wireless Communication can be realized through WIFI, a mobile cellular network, near field Communication (NEAR FIELD Communication) or other technologies. The computer program when executed by a processor implements a low complexity transducer attention module prediction method. The display unit of the computer device is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device. The display screen can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the structure shown in fig. 14 is merely a block diagram of a portion of the structure associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements are applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In an exemplary embodiment, a computer device is provided, comprising a memory and a processor, the memory having stored therein a computer program, the processor performing the steps of the method embodiments described above when the computer program is executed.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the method embodiments described above.
In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.
It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are both information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to meet the related regulations.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile memory and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (RESISTIVE RANDOM ACCESS MEMORY, reRAM), magneto-resistive Memory (MagnetoresistiveRandom Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computation, an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) processor, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the present application.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (10)

1.一种低复杂度Transformer注意力模块预测方法,其特征在于,所述方法包括:1. A low-complexity Transformer attention module prediction method, characterized in that the method comprises: 获取对神经网络模型的第一矩阵进行前导零计数处理得到的第一前导零计数结果;所述神经网络模型的注意力权重矩阵为所述第一矩阵,输入所述神经网络模型的词元矩阵为第二矩阵;或者输入所述神经网络模型的词元矩阵为所述第一矩阵,所述神经网络模型的注意力权重矩阵为第二矩阵;Obtaining a first leading zero count result obtained by performing leading zero count processing on a first matrix of a neural network model; the attention weight matrix of the neural network model is the first matrix, and the word unit matrix input to the neural network model is the second matrix; or the word unit matrix input to the neural network model is the first matrix, and the attention weight matrix of the neural network model is the second matrix; 基于所述第一前导零计数结果,对所述第二矩阵进行乘法近似移位处理,得到查询矩阵和键矩阵;Based on the first leading zero count result, performing multiplication approximate shift processing on the second matrix to obtain a query matrix and a key matrix; 基于所述查询矩阵和所述键矩阵,确定注意力矩阵的预测结果;所述注意力矩阵的预测结果用于表征上下文词元的匹配程度。Based on the query matrix and the key matrix, a prediction result of the attention matrix is determined; the prediction result of the attention matrix is used to characterize the matching degree of the context word. 2.根据权利要求1所述的方法,其特征在于,所述基于所述查询矩阵和所述键矩阵,确定注意力矩阵的预测结果,包括:2. The method according to claim 1, characterized in that the step of determining the prediction result of the attention matrix based on the query matrix and the key matrix comprises: 对所述神经网络模型的第三矩阵进行前导零计数处理,得到所述第三矩阵对应的第二前导零计数结果;所述查询矩阵为所述第三矩阵,所述键矩阵的转置为第四矩阵;或者所述键矩阵的转置为所述第三矩阵,所述查询矩阵为第四矩阵;Performing leading zero counting processing on the third matrix of the neural network model to obtain a second leading zero counting result corresponding to the third matrix; the query matrix is the third matrix, and the transpose of the key matrix is the fourth matrix; or the transpose of the key matrix is the third matrix, and the query matrix is the fourth matrix; 基于所述第二前导零计数结果,对所述第四矩阵进行乘法近似移位处理,得到注意力矩阵的预测结果。Based on the second leading zero count result, the fourth matrix is subjected to multiplication approximate shift processing to obtain a prediction result of the attention matrix. 3.根据权利要求1所述的方法,其特征在于,所述第一矩阵为所述注意力权重矩阵,所述获取对神经网络模型的第一矩阵进行前导零计数处理得到的第一前导零计数结果,包括:3. The method according to claim 1, wherein the first matrix is the attention weight matrix, and the obtaining of a first leading zero count result obtained by performing leading zero count processing on the first matrix of the neural network model comprises: 对神经网络模型的注意力权重矩阵进行前导零计数处理,得到所述注意力权重矩阵对应的第一前导零计数结果;Performing leading zero counting processing on the attention weight matrix of the neural network model to obtain a first leading zero counting result corresponding to the attention weight matrix; 将所述第一前导零计数结果存入目标存储单元;storing the first leading zero count result into a target storage unit; 当估计注意力矩阵时,从所述目标存储单元中读取所述第一前导零计数结果。When estimating the attention matrix, the first leading zero count result is read from the target storage unit. 4.根据权利要求2所述的方法,其特征在于,所述第二前导零计数结果包括所述第三矩阵中各第一元素对应的前导零个数,所述基于所述第二前导零计数结果,对所述第四矩阵进行乘法近似移位处理,得到注意力矩阵的预测结果,包括:4. The method according to claim 2, characterized in that the second leading zero count result includes the number of leading zeros corresponding to each first element in the third matrix, and the method of performing multiplication approximate shift processing on the fourth matrix based on the second leading zero count result to obtain the prediction result of the attention matrix includes: 针对所述查询矩阵与所述键矩阵的转置相乘过程中的每一组相乘元素,根据所述组相乘元素中的第一元素,确定所述组相乘元素中第二元素对应的移位对象元素;所述第一元素为所述第三矩阵中的元素,所述第二元素为所述第四矩阵中的元素;For each group of multiplication elements in the transpose multiplication process of the query matrix and the key matrix, determine the shift object element corresponding to the second element in the group of multiplication elements according to the first element in the group of multiplication elements; the first element is an element in the third matrix, and the second element is an element in the fourth matrix; 基于所述第一元素对应的前导零个数,对所述移位对象元素进行移位处理,得到所述组相乘元素对应的第一移位结果;Based on the number of leading zeros corresponding to the first element, the shift object element is shifted to obtain a first shift result corresponding to the group multiplication element; 对所述第一移位结果进行符号位扩展,得到所述组相乘元素对应的乘积;Performing sign bit extension on the first shift result to obtain a product corresponding to the group multiplication element; 基于各所述组相乘元素对应的乘积,确定注意力矩阵的预测结果。Based on the products corresponding to the groups of multiplied elements, the prediction results of the attention matrix are determined. 5.根据权利要求4所述的方法,其特征在于,所述根据所述组相乘元素中的第一元素,确定所述组相乘元素中第二元素对应的移位对象元素,包括:5. The method according to claim 4, characterized in that the step of determining the shift object element corresponding to the second element in the group multiplication element according to the first element in the group multiplication element comprises: 若所述组相乘元素中的第一元素为负数,则对所述组相乘元素中第二元素的原码按位取反,得到所述第二元素对应的移位对象元素;If the first element in the group of multiplied elements is a negative number, the original code of the second element in the group of multiplied elements is bitwise inverted to obtain a shift object element corresponding to the second element; 若所述组相乘元素中的第一元素为正数,则将所述组相乘元素中第二元素的原码,作为所述第二元素对应的移位对象元素。If the first element in the group of multiplied elements is a positive number, the original code of the second element in the group of multiplied elements is used as the shift object element corresponding to the second element. 6.根据权利要求4所述的方法,其特征在于,所述基于所述第一元素对应的前导零个数,对所述移位对象元素进行移位处理,得到所述组相乘元素对应的第一移位结果,包括:6. The method according to claim 4, characterized in that the step of performing a shift process on the shift object element based on the number of leading zeros corresponding to the first element to obtain a first shift result corresponding to the group multiplication element comprises: 根据所述第一元素的符号位和所述第二元素的符号位,确定所述组相乘元素对应的第一移位结果的符号位;Determine, according to the sign bit of the first element and the sign bit of the second element, the sign bit of the first shift result corresponding to the group of multiplied elements; 根据所述第二元素的位宽和所述第一元素对应的前导零个数,确定所述移位对象元素对应的移位位数;Determine the number of shift bits corresponding to the shift object element according to the bit width of the second element and the number of leading zeros corresponding to the first element; 按照所述移位位数,对所述移位对象元素进行左移,得到所述组相乘元素对应的第一移位结果的有效位;Shifting the shift object element to the left according to the shift bit number to obtain the valid bit of the first shift result corresponding to the group multiplication element; 将所述第一移位结果的符号位和所述第一移位结果的有效位,构成所述组相乘元素对应的第一移位结果。The sign bit of the first shift result and the valid bit of the first shift result constitute the first shift result corresponding to the group of multiplication elements. 7.根据权利要求2所述的方法,其特征在于,所述第二前导零计数结果包括所述第三矩阵中各第一元素对应的前导零个数,所述基于所述第二前导零计数结果,对所述第四矩阵进行乘法近似移位处理,得到注意力矩阵的预测结果,包括:7. The method according to claim 2, characterized in that the second leading zero count result includes the number of leading zeros corresponding to each first element in the third matrix, and the method of performing multiplication approximate shift processing on the fourth matrix based on the second leading zero count result to obtain the prediction result of the attention matrix includes: 针对所述查询矩阵与所述键矩阵的转置相乘过程中的每一组相乘元素,基于所述组相乘元素中的第一元素对应的前导零个数,对所述组相乘元素中的第二元素进行移位处理,得到所述组相乘元素对应的第二移位结果;所述第一元素为所述第三矩阵中的元素,所述第二元素为所述第四矩阵中的元素;For each group of multiplication elements in the transpose multiplication process of the query matrix and the key matrix, based on the number of leading zeros corresponding to the first element in the group of multiplication elements, a second element in the group of multiplication elements is shifted to obtain a second shift result corresponding to the group of multiplication elements; the first element is an element in the third matrix, and the second element is an element in the fourth matrix; 基于各所述组相乘元素对应的第二移位结果,确定注意力矩阵的预测结果中各注意力元素对应的原码;Determine the original code corresponding to each attention element in the prediction result of the attention matrix based on the second shift result corresponding to each group of multiplied elements; 对各所述注意力元素对应的原码进行译码处理,得到所述注意力矩阵的预测结果。The original code corresponding to each of the attention elements is decoded to obtain the prediction result of the attention matrix. 8.一种低复杂度Transformer注意力模块预测装置,其特征在于,所述装置包括:8. A low-complexity Transformer attention module prediction device, characterized in that the device comprises: 获取模块,用于获取对神经网络模型的第一矩阵进行前导零计数处理得到的第一前导零计数结果;所述神经网络模型的注意力权重矩阵为所述第一矩阵,输入所述神经网络模型的词元矩阵为第二矩阵;或者输入所述神经网络模型的词元矩阵为所述第一矩阵,所述神经网络模型的注意力权重矩阵为第二矩阵;An acquisition module, used for acquiring a first leading zero count result obtained by performing leading zero count processing on a first matrix of a neural network model; the attention weight matrix of the neural network model is the first matrix, and the word unit matrix input to the neural network model is the second matrix; or the word unit matrix input to the neural network model is the first matrix, and the attention weight matrix of the neural network model is the second matrix; 处理模块,用于基于所述第一前导零计数结果,对所述第二矩阵进行乘法近似移位处理,得到查询矩阵和键矩阵;A processing module, configured to perform multiplication approximate shift processing on the second matrix based on the first leading zero count result to obtain a query matrix and a key matrix; 确定模块,用于基于所述查询矩阵和所述键矩阵,确定注意力矩阵的预测结果;所述注意力矩阵的预测结果用于表征上下文词元的匹配程度。A determination module is used to determine the prediction result of the attention matrix based on the query matrix and the key matrix; the prediction result of the attention matrix is used to characterize the matching degree of the context word. 9.一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1至7中任一项所述的方法的步骤。9. A computer device, comprising a memory and a processor, wherein the memory stores a computer program, wherein the processor implements the steps of the method according to any one of claims 1 to 7 when executing the computer program. 10.一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至7中任一项所述的方法的步骤。10. A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 7 are implemented.
CN202411179921.9A 2024-08-27 2024-08-27 Low-complexity Transformer attention module prediction method and device Pending CN119312839A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411179921.9A CN119312839A (en) 2024-08-27 2024-08-27 Low-complexity Transformer attention module prediction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411179921.9A CN119312839A (en) 2024-08-27 2024-08-27 Low-complexity Transformer attention module prediction method and device

Publications (1)

Publication Number Publication Date
CN119312839A true CN119312839A (en) 2025-01-14

Family

ID=94183945

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411179921.9A Pending CN119312839A (en) 2024-08-27 2024-08-27 Low-complexity Transformer attention module prediction method and device

Country Status (1)

Country Link
CN (1) CN119312839A (en)

Similar Documents

Publication Publication Date Title
CN110447010B (en) Performing matrix multiplication in hardware
CN106951962B (en) Complex arithmetic unit, method and electronic device for neural network
US10810484B2 (en) Hardware accelerator for compressed GRU on FPGA
Lu et al. Evaluations on deep neural networks training using posit number system
Lin et al. Learning the sparsity for ReRAM: Mapping and pruning sparse neural network for ReRAM based accelerator
EP3657398A1 (en) Weight quantization method for a neural network and accelerating device therefor
CN111915001B (en) Convolution calculation engine, artificial intelligent chip and data processing method
US20180046905A1 (en) Efficient Data Access Control Device for Neural Network Hardware Acceleration System
CN112508125A (en) Efficient full-integer quantization method of image detection model
CN113741858A (en) In-memory multiply-add computing method, device, chip and computing device
US20240134930A1 (en) Method and apparatus for neural network weight block compression in a compute accelerator
CN115730653A (en) Quantitative neural network training and reasoning
Song et al. Research on parallel principal component analysis based on ternary optical computer
CN112889072A (en) System, method and apparatus for reducing power consumption
CN112101511B (en) Sparse Convolutional Neural Networks
CN118014030A (en) Neural network accelerator and system
CN119312839A (en) Low-complexity Transformer attention module prediction method and device
CN114418104A (en) Quantum application problem processing method and device
CN113222160A (en) Quantum state conversion method and device
US20240094986A1 (en) Method and apparatus for matrix computation using data conversion in a compute accelerator
CN117370488A (en) Data processing method, device, electronic equipment and computer readable storage medium
WO2023124371A1 (en) Data processing apparatus and method, and chip, computer device and storage medium
CN114330682A (en) Hardware Architecture and Computation Method Applied to Fastformer Neural Network
US20240152327A1 (en) Computing circuit, computing method, and decoder
CN116186526B (en) Feature detection method, device and medium based on sparse matrix vector multiplication

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination