CN118503605B

CN118503605B - Data processing method, computing device and storage medium

Info

Publication number: CN118503605B
Application number: CN202410954370.2A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Bi Ren Technology Co ltd; Beijing Bilin Technology Development Co ltd
Current assignee: Shanghai Bi Ren Technology Co ltd; Beijing Bilin Technology Development Co ltd
Priority date: 2024-07-16
Filing date: 2024-07-16
Publication date: 2024-10-18
Anticipated expiration: 2044-07-16
Also published as: CN118503605A

Abstract

Embodiments of the present invention relate to a data processing method, a computing device, and a storage medium. The method comprises the following steps: determining a three-dimensional vector based on the input sequence, the three-dimensional vector comprising: query vectors, key vectors, and value vectors; grouping at least one of the three-dimensional vectors to divide the three-dimensional vectors into a plurality of three-dimensional vector groupings; performing attention-related calculations using the current kernel and based at least on the current three-dimensional vector group to obtain current process data; and in response to obtaining the current process data, performing an attention-related calculation with a next kernel and based at least on the current process data and the next three-dimensional vector group to obtain the next process data until final data is determined to generate an attention output based on the final data. The invention can realize the attention calculation for the ultra-long input sequence without increasing the number of processors.

Description

Data processing method, computing device and storage medium

Technical Field

Embodiments of the present invention relate generally to the field of data processing and, more particularly, relate to a data processing method, computing device, and storage medium.

Background

Attention mechanisms (Attention Mechanism) are widely used in the field of machine learning such as Natural Language Processing (NLP), computer Vision (CV), etc. to help models pay more attention to important information when processing complex tasks, thereby improving performance. In particular, the attention mechanism enables the model to dynamically assign different weights to each element in the input sequence as the input sequence is processed, thereby enabling the model to focus more on information in the input sequence that is relevant to the current task.

However, in the attention mechanism, the length of the input sequence that can be processed by the model is limited by the memory of the processor, in other words, the existing data processing method is difficult to implement the attention calculation of the ultra-long input sequence without increasing the memory.

Disclosure of Invention

In view of the above problems, the present invention provides a data processing method, so that attention calculation for an ultra-long input sequence can be implemented without increasing memory.

According to a first aspect of the present invention, there is provided a data processing method, comprising: determining a three-dimensional vector based on the input sequence, the three-dimensional vector comprising: query vectors, key vectors, and value vectors; grouping at least one of the three-dimensional vectors to divide the three-dimensional vectors into a plurality of three-dimensional vector groupings; performing attention-related calculations using the current kernel and based at least on the current three-dimensional vector group to obtain current process data; and in response to obtaining the current process data, performing an attention-related computation using the next kernel and based at least on the current process data and the next three-dimensional vector group to obtain the next process data until final data is determined to generate an attention output based on the final data, wherein the final data is the process data resulting from performing the attention-related computation based at least on the last three-dimensional vector group.

In some embodiments, grouping at least one of the three-dimensional vectors comprises: grouping at least one of the three-dimensional vectors to obtain a plurality of sub-vectors of the at least one-dimensional vector; and grouping the three-dimensional vectors based at least on the plurality of sub-vectors to obtain a plurality of three-dimensional vector groupings.

In some embodiments, each three-dimensional vector grouping of the plurality of three-dimensional vector groupings includes at least one sub-vector of each of the three-dimensional vectors that are grouped.

In some embodiments, each sub-vector is included in only one of the plurality of three-dimensional vector groupings.

In some embodiments, the process data includes: first process data, second process data, and third process data. In these embodiments, obtaining current process data includes: maximum value calculation is conducted on the query vector and the key vector in the current three-dimensional vector group so as to obtain first process data; summing up the query vector and the key vector in the current three-dimensional vector group so as to obtain second process data; and performing product calculation on the query vector, the key vector and the value vector in the current three-dimensional vector group so as to obtain third process data.

In some embodiments, the process data includes: the method comprises the steps of obtaining first process data, second process data and third process data, wherein obtaining current process data comprises: maximum value calculation is conducted on the query vector and the key vector in the current three-dimensional vector group so as to obtain first process data; summing up the query vector and the key vector in the current three-dimensional vector group so as to obtain second process data; and performing product calculation on the query vector, the key vector and the value vector in the current three-dimensional vector group so as to obtain third process data.

In some embodiments, obtaining current process data further comprises: configuring original process data in a current kernel, and initializing the original process data to obtain initialized original process data, wherein the initialized original process data comprises: the method comprises the steps of initializing first original process data, initializing second original process data and initializing third original process data; and calculating to obtain current process data based on the initialized original process data.

In some embodiments, deriving the initialized raw process data includes: in response to the current three-dimensional vector grouping being the first three-dimensional vector grouping, making values of the initialized first raw process data, the initialized second raw process data, and the initialized third raw process data all 0; in response to the current three-dimensional vector grouping not being the first three-dimensional vector grouping, values of the initialized first raw process data, the initialized second raw process data, and the initialized third raw process data are made equal to values of the first process data, the second process data, and the third process data, respectively, of the previous process data.

In some embodiments, generating the attention output further comprises: gradients for each of the three-dimensional vectors are calculated for use in generating the attention output.

In some embodiments, computing the gradient of each of the three-dimensional vectors comprises: dividing a plurality of variables obtained by operation based on an input sequence, wherein the plurality of variables comprise: gradient variables obtained by gradient operation based on all three-dimensional vectors of the input sequence, and correlation variables, wherein the correlation variables are determined by predetermined function operation based on products of query vectors and key vectors; and determining a gradient for each of the three-dimensional vectors based on the partitioned plurality of variables.

In some embodiments, partitioning the plurality of variables that are computed based on the input sequence includes: dividing the gradient variables based on the number of sub-vectors of the query vector; and dividing the relevance variable based on the number of sub-vectors of the query vector and the number of sub-vectors of the key vector.

In some embodiments, the correlation variable comprises: a first relevance variable, the first relevance variable determined by performing a normalized exponential function on the product of the query vector and the key vector; and a second correlation variable, the second correlation variable being determined by performing a transposition function based on the first correlation variable.

According to a second aspect of the present invention there is provided a computing device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect of the invention.

According to a third aspect of the present invention there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of the first aspect of the present invention.

According to a fourth aspect of the present invention there is provided a computer program product tangibly stored on a non-transitory computer-readable medium and comprising machine executable instructions which, when executed, cause a machine to perform the steps in the method of the first aspect of the present invention.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

The above and other features, advantages and aspects of embodiments of the present invention will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, the same or similar reference numerals denote the same or similar elements.

Fig. 1 shows a flow chart of a data processing method according to an embodiment of the invention.

Fig. 2A to 2C show schematic diagrams of exemplary embodiments of the data processing method of the present invention.

Fig. 3 shows a flow chart of a data processing method for reverse computation according to an embodiment of the invention.

Fig. 4A to 4C show schematic diagrams of exemplary embodiments of the data processing method of the present invention.

FIG. 5 schematically illustrates a block diagram of a computing device suitable for use in implementing embodiments of the present invention.

Detailed Description

Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The term "comprising" and variations thereof as used herein means open ended, i.e., "including but not limited to. The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment. The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below.

In the field of natural language processing, a common deep learning model architecture includes: a transducer model based on Attention (Attention) mechanisms. Regarding the attention mechanism, it allows the model to focus on information at different locations in the input sequence as it processes text to help the model screen out of a multitude of information that is helpful to the model. That is, the core of the attention mechanism is to screen out information that is more critical to the current task, so that attention can be focused on the screened out information.

According to the basic principle of the attention mechanism, a three-dimensional vector is generally first generated based on an input sequence input into a model, and the three-dimensional vector includes: query vector Q, key vector K, value vector V. On the basis, based on the generated query vector Q, key vector K and value vector V, attention calculation can be performed according to the following formula (1) to calculate corresponding attention weight and attention output;

(1)

Wherein, The dimension size of the key vector K is represented. Specifically, the query vector Q and the key vector K may be subjected to, for example, a scaling dot product calculation to obtain an attention weight matrix, and then the value vector V is weighted and summed based on the attention weight matrix to obtain the attention output.

In practice, the aforementioned attention calculations may be performed on a processor such as a Graphics Processor (GPU), a General Purpose Graphics Processor (GPGPU), or the like. Taking GPU as an example, the existing data processing method related to attention computation includes: input data is read from a High Bandwidth Memory (HBM) of the GPU into a Static Random Access Memory (SRAM) of the GPU, so that the input data is attentively computed in the SRAM, and the computation result is returned to the HBM. Typically, the data processing method described above with respect to attention computation is done in one kernel of the GPU. Since a lot of time is consumed for constantly exchanging data between the HBM and the SRAM of the GPU in this process, the calculation speed is slow and the efficiency is low. Moreover, since the SRAM capacity of the GPU is small, the size of the memory space required for the attention calculation increases in square order with the increase of the length N of the input sequence, so that the input sequence exceeding the threshold length is limited by the GPU memory, and the attention calculation cannot be realized.

In summary, the existing data processing method has the following disadvantages: the attention calculation for the very long input sequence cannot be realized without increasing the memory.

In some embodiments, to enable processing of input sequences exceeding a threshold length, the processor memory for attention calculations may be increased by increasing the number of processors. That is, the attention computation to be performed may be split into multiple parts, such as many as processors (such as GPUs), so that each GPU is responsible for the attention computation of a part, and then a final attention output is derived based on the computation result of each GPU. However, when the number of GPUs reaches a threshold number, the memory limitations of each GPU may also limit the attention calculations that can be performed, making it difficult to process longer input sequences, such as input sequences that are longer than a mega word (token).

To at least partially address one or more of the above problems, as well as other potential problems, example embodiments of the present invention provide a data processing scheme. In this scheme, at least one-dimensional vectors among three-dimensional vectors determined based on an input sequence are grouped so as to divide the three-dimensional vectors into a plurality of three-dimensional vector groups; performing attention-related calculations using the current kernel and based at least on the current three-dimensional vector group to obtain current process data; and in response to obtaining the current process data, performing an attention-related computation using a next kernel and based at least on the current process data and the next three-dimensional vector packet to obtain the next process data, enabling the attention-related computation to be performed using multiple kernels of a same processor so that each processor can process a longer input sequence, thereby enabling the attention computation to be performed for an extra-long input sequence without increasing the number of processors.

The data processing scheme of the present invention will be described below with reference to fig. 1 to 4C. Fig. 1 shows a flow chart of a data processing method 100 according to an embodiment of the invention. It should be appreciated that method 100 may also include additional actions not shown and/or may omit actions shown, the scope of the invention being not limited in this respect.

In step 102, a three-dimensional vector is determined based on the input sequence.

With respect to three-dimensional vectors, it means: query vector Q, key vector K, and value vector V.

According to an embodiment of the present invention, an input sequence may be linearly transformed to obtain a query vector Q, a key vector K, and a value vector V corresponding to the input sequence.

At step 104, at least one of the three-dimensional vectors is grouped to divide the three-dimensional vector into a plurality of three-dimensional vector groupings.

By grouping at least one of the three-dimensional vectors, it is meant that at least one of the query vector Q, the key vector K, and the value vector V can be grouped to obtain a plurality of sub-vectors of the at least one-dimensional vector. For example, in some embodiments, only query vector Q in the three-dimensional vector may be grouped to obtain a plurality of query sub-vectors Q _n of query vector Q. In still other embodiments, key vector K and value vector V in the three-dimensional vector may be grouped to obtain a plurality of key sub-vectors K _n for key vector K and a plurality of value sub-vectors V _n for value vector V. In still other embodiments, the query vector Q, the key vector K, and the value vector V in the three-dimensional vector may also be grouped to yield a plurality of query sub-vectors Q _n, a plurality of key sub-vectors K _n, and a plurality of value sub-vectors V _n.

According to embodiments of the present invention, the grouped vectors in the three-dimensional vector may include two or more sub-vectors. For example, as query vector Q is grouped, in some examples, the grouped query vector Q may include two query sub-vectors (such as a first query sub-vector Q ₁ and a second query sub-vector Q ₂); in yet another example, the grouped query vector Q may include three query sub-vectors (such as a first query sub-vector Q ₁, a second query sub-vector Q ₂, and a third query sub-vector Q ₃). The invention is not limited in this regard.

Further, according to an embodiment of the present invention, when multi-dimensional vectors in the three-dimensional vector are grouped, the number of sub-vectors grouped for each dimensional vector may be different or the same. For example, in the example of grouping the key vectors K and the value vectors V in the three-dimensional vector as described above, the number of resulting key sub-vectors K _n may be the same as the number of value sub-vectors V _n.

Regarding deriving a plurality of three-dimensional vector groupings, it may include: the three-dimensional vectors are grouped based at least on a plurality of sub-vectors obtained by grouping at least one of the three-dimensional vectors to obtain a plurality of three-dimensional vector groupings.

For example, in the example of grouping only query vector Q as described above, it is assumed that first query sub-vector Q ₁ and second query sub-vector Q ₂ are obtained after grouping query vector Q. In this case, the three-dimensional vectors may then be grouped based on the first query sub-vector Q ₁ and the second query sub-vector Q ₂, and a first three-dimensional vector group and a second three-dimensional vector group are obtained, where the first three-dimensional vector group includes: a first query sub-vector Q ₁, a key vector K, and a value vector V; the second three-dimensional vector grouping includes: a second query sub-vector Q ₂, a key vector K, and a value vector V.

For another example, in another example of grouping the key vector K and the value vector V in the three-dimensional vector as described above, it is assumed that the first key sub-vector K ₁ and the second key sub-vector K ₂ are obtained by grouping the key vector K, and the first value sub-vector V ₁ and the second value sub-vector V ₂ are obtained by grouping the value vector V. In this case, the three-dimensional vectors may then be grouped based on the first key sub-vector K ₁, the second key sub-vector K ₂, the first value sub-vector V ₁, and the second value sub-vector V ₂, and two three-dimensional vector groupings are obtained, where the first three-dimensional vector grouping includes: query vector Q, first key sub-vector K ₁, and first value sub-vector V ₁; the second three-dimensional vector grouping includes: query vector Q, second key sub-vector K ₂, and second value sub-vector V ₂.

From the above, in embodiments of the present invention, each three-dimensional vector grouping may include at least one sub-vector of each of the three-dimensional vectors that are grouped, and further, according to some embodiments of the present invention, each sub-vector is included in only one of the plurality of three-dimensional vector groupings.

At step 106, attention-related calculations are performed using the current kernel and based at least on the current three-dimensional vector groupings to obtain current process data.

As for the process data, according to an embodiment of the present invention, it may include first process data (Max) calculated based on the maximum value, second process data (Sum) calculated based on the summation, and third process data (Q') calculated based on the vector product.

Specifically, according to an embodiment of the present invention, obtaining current process data may include: maximum value calculation is conducted on the query vector and the key vector in the current three-dimensional vector group so as to obtain first process data; summing up the query vector and the key vector in the current three-dimensional vector group so as to obtain second process data; and performing product calculation on the query vector, the key vector and the value vector in the current three-dimensional vector group so as to obtain third process data.

Taking the current three-dimensional vector group as an example of a three-dimensional vector group including a query vector Q, a first key sub-vector K ₁, and a first value sub-vector V ₁, according to an embodiment of the present invention, obtaining process data of the current three-dimensional vector group may include: maximum value calculation is carried out on the query vector Q and the first key sub-vector K ₁ so as to obtain first process data Max1; summing the query vector Q and the first key sub-vector K ₁ to obtain second process data Sum1; and performing product calculation on the query vector Q, the first key sub-vector K ₁ and the first value sub-vector V ₁ to obtain third process data Q1. That is, in this example, the set of process data resulting for the three-dimensional vector grouping including query vector Q, first key sub-vector K ₁, and first value sub-vector V ₁ includes: first process data Max1 calculated based on maximum values for query vector Q and first key sub-vector K ₁; second process data Sum1 calculated based on the Sum of the query vector Q and the first key sub-vector K ₁; and third process data Q1' calculated based on vector products for the query vector Q, the first key sub-vector K ₁, and the first value sub-vector V ₁.

Further, in order to calculate process data of a next three-dimensional vector group on the basis of the process data calculated for the current three-dimensional vector group, according to the inventive concept of the present invention, raw process data may be preconfigured in each core and initialized such that current process parameters may then be calculated based at least on these raw process data and the current three-dimensional vector group in the current core. Specifically, according to an embodiment of the present invention, obtaining current process data may further include: configuring original process data in a current kernel, and initializing the original process data to obtain initialized original process data; and calculating to obtain current process data based on the initialized original process data.

Regarding raw process data, similar to the process data, it includes: first raw process data (Max 0), second raw process data (Sum 0) and third raw process data (Q0').

With respect to the initialized raw process data, it comprises: the method includes initializing first raw process data, initializing second raw process data, and initializing third raw process data.

According to an embodiment of the present invention, in response to the current three-dimensional vector group being the first three-dimensional vector group, the values of the initialized first raw process data, the initialized second raw process data, and the initialized third raw process data are all 0; in response to the current three-dimensional vector grouping not being the first three-dimensional vector grouping, values of the initialized first raw process data, the initialized second raw process data, and the initialized third raw process data are made equal to values of the first raw process data, the second raw process data, and the third raw process data, respectively, of the previous process data.

For example, assuming that the three-dimensional vector grouping in the current kernel is the first three-dimensional vector grouping of the plurality of three-dimensional vector groupings, the corresponding raw process data of the current kernel may be initialized such that the values of the initialized first raw process data, the initialized second raw process data, and the initialized third raw process data are all initialized to 0, i.e.

Max0 = 0

Sum0 = 0

Q0’ = 0

Assuming that the three-dimensional vector grouping in the current kernel is not the first three-dimensional vector grouping of the plurality of three-dimensional vector groupings, that is, that the attention-related computation for at least one other three-dimensional vector grouping has been performed prior to the attention-related computation for the three-dimensional vector grouping in the current kernel, the raw process data for the current kernel may be initialized based on the last process data calculated. If the last process data calculated includes the first process data Max1, the second process data Sum1 and the third process data Q1', the values of the first original process data Max0, the second original process data Sum0 and the third original process data Q0' in the original process data of the current core can be initialized to be the same as the values of the first process data Max1, the second process data Sum1 and the third process data Q1' in the last process data, respectively, that is

Max0 = Max1

Sum0 = Sum1

Q0’ = Q1’

In summary, according to the inventive concept of the present invention, a plurality of sub-vectors may be obtained by grouping at least one of three-dimensional vectors, and the division of the three-dimensional vectors may be implemented based on the sub-vectors, so that the three-dimensional vectors may be divided into a plurality of three-dimensional vector groups. On this basis, these three-dimensional vector groups can then be assigned to different computation cores in order to perform attention-related computation for these three-dimensional vector groups in turn, wherein the data communication costs are significantly reduced by loading and initializing process data in each core, so as to avoid loading the complete three-dimensional vector and all intermediate data related to the input sequence in each core.

In response to obtaining the current process data, at step 108, an attention-dependent calculation is performed using the next kernel and based at least on the current process data and the next three-dimensional vector group to obtain the next process data.

According to the embodiment of the invention, before the calculation related to the attention is performed by using the next kernel and based on the next three-dimensional vector group, as described above, the original process data of the next kernel is initialized by using the current process data so that the values of the first original process data, the second original process data and the third original process data in the original process data of the next kernel are respectively equal to the first process data, the second process data and the third process data in the current process data, and then the calculation related to the attention is performed based on the initialized original process data and the next three-dimensional vector group, thereby obtaining the next process data based on the current process data. And so on until all three-dimensional vector grouping attention-related calculations are completed and the last set of process data (hereinafter also referred to as "final data") is obtained.

For example, in some embodiments, the three-dimensional vectors determined based on the input sequence are divided into M three-dimensional vector groupings (M being an integer greater than 1), including a first three-dimensional vector grouping, a second three-dimensional vector grouping, and an mth three-dimensional vector grouping. According to the inventive concept of the present invention, attention-related calculations may be performed based on the process data Max1, sum1, and Q1 'calculated for the first three-dimensional vector group and the second three-dimensional vector group to obtain the process data Max2, sum2, and Q2'; similarly, attention-related calculations may be performed based on the process data Max2, sum2, and Q2' calculated for the second three-dimensional vector group and the third three-dimensional vector group to obtain the process data Max3, sum3, and Q3', until the final data MaxM, sum, and QM ' are calculated.

According to an embodiment of the invention, final data may then be determined based on the last set of process data in order to generate the attention output based on the final data.

As for the final data, as described above, it is the process data resulting from the attention-related calculation based on the last three-dimensional vector group. According to an embodiment of the present invention, the final data may include first data, second data, and third data, wherein the first data is first process data among process data obtained by performing attention-related computation based on the last three-dimensional vector group, the second data is second process data among process data obtained by performing attention-related computation based on the last three-dimensional vector group, and the third data is third process data among process data obtained by performing attention-related computation based on the last three-dimensional vector group.

With respect to generating an attention output based on the Final data, it may refer to calculating a Final result (final_r) related to the attention output based on the second data SumM and the third data QM' in the Final data according to the following formula (2);

(2)

In summary, the present invention performs attention-related computation by dividing a three-dimensional vector determined based on an input sequence into a plurality of three-dimensional vector groups, and sequentially assigning the three-dimensional vector groups to different cores of a plurality of cores; and introducing process data into each core, initializing the original process data of the current core based on the process data calculated in the previous core, and performing attention-related calculation by utilizing the initialized process data and the three-dimensional vector group in the current core, so that in the process of performing complete attention calculation, the final result of the attention calculation can be obtained without additional memory and additional merging operation for intermediate data. Therefore, the invention reasonably divides the ultra-long input sequence and utilizes a plurality of cores of the same processor to execute the calculation related to the attention, so that a single processor can better process the longer input sequence, thereby realizing that the attention calculation for the ultra-long input sequence is executed without increasing the number of the processors.

As described above, since data needs to be read into SRAM for attention-related computation, and the capacity of SRAM is generally small, according to some embodiments of the present invention, each vector in a three-dimensional vector group written into each kernel for attention-related computation may be further partitioned (tilling), and then the three-dimensional vector group may be further divided into a plurality of subgroups according to the partitioned vector implementation. On the basis, the process data of all the subgroups are calculated through an iterative mode, so that the process data corresponding to the three-dimensional vector group is calculated.

For example, in one example, based on a blocking operation on vectors, three-dimensional vector groupings are divided into two subgroups, denoted as a first subgroup and a second subgroup, where each subgroup includes a respective three-dimensional vector q, k, v. For clarity of description, the three-dimensional vector of the first subset is denoted herein as q ₁、k₁、v₁, and the three-dimensional vector of the second subset is denoted herein as q ₂、k₂、v₂. In this example, according to the inventive concept of the present invention, an attention-related calculation may then be performed first based on the three-dimensional vector q ₁、k₁、v₁ of the first subgroup, and first intermediate process data corresponding to the first subgroup may be determined. In response to determining intermediate process data corresponding to the first subset, then performing an attention-related calculation based on the three-dimensional vector of the second subset, denoted as q ₂、k₂、v₂, and calculating second intermediate process data based on the first intermediate process data, where the second intermediate process data may be considered as process data corresponding to the three-dimensional vector group. The specific parameter calculation method can be referred to above, and will not be described herein.

Exemplary embodiments of the data processing scheme of the present invention will be described below in connection with fig. 2A-2C.

As shown in fig. 2A, the three-dimensional vector determined based on the input sequence includes: query vector Q, key vector K, and value vector V. Further, in the embodiment shown in fig. 2A, the key vector K and the value vector V are divided into two sets of sub-vectors, respectively, i.e., the key vector K is divided into a first key sub-vector K ₁ and a second key sub-vector K ₂, and the value vector V is divided into a first value sub-vector V ₁ and a second value sub-vector V ₂. Further, a three-dimensional vector group consisting of the query vector Q, the first key sub-vector K ₁, and the first value sub-vector V ₁ is regarded as a first three-dimensional vector group, and a three-dimensional vector group consisting of the query vector Q, the second key sub-vector K ₂, and the second value sub-vector V ₂ is regarded as a second three-dimensional vector group.

As shown in fig. 2B, a first kernel_1 may be utilized to calculate a first set of process data, i.e., max1, sum1, and Q1', based on a first three-dimensional vector grouping (i.e., query vector Q, first key sub-vector K ₁, and first value sub-vector V ₁). Specifically, according to an embodiment of the present invention, a set of raw process data including first raw process data Max0, second raw process data Sum0, and third raw process data Q0' may be first configured in the first kernel_1. Whereas the three-dimensional vector group to be subjected to the attention-related computation in the first kernel_1 is the first three-dimensional vector group, the raw process data may be initialized such that Max 0=0, sum 0=0, Q0' =0. Then, max1, sum1, and Q1 'are calculated based on the initialized raw process data Max0, sum0, and Q0' and the query vector Q, the first key sub-vector K ₁, and the first value sub-vector V ₁.

As shown in fig. 2C, a second kernel kernel_2 may be utilized to calculate a second set of process data, i.e., max2, sum2, and Q2', based on a second three-dimensional vector grouping (i.e., query vector Q, second key sub-vector K ₂, and second value sub-vector V ₂). Specifically, according to an embodiment of the present invention, a set of raw process data including the first raw process data Max0, the second raw process data Sum0, and the third raw process data Q0' may be configured in the second kernel nal_2. Since the three-dimensional vector group to be subjected to the attention-related computation in the second kernel kernal_2 is the second three-dimensional vector group instead of the first three-dimensional vector group, the original process data is initialized with Max1, sum1, and Q1' computed in fig. 2B such that Max 0=max 1, sum 0=sum 1, Q0' =q1 '. Then, max2, sum2, and Q2' are calculated based on the initialized raw process data Max0, sum0, and Q0' (corresponding to Max1, sum1, and Q1 ') and the query vector Q, the second key sub-vector K ₂, and the second value sub-vector V ₂.

Further, in the examples shown in fig. 2A to 2C, the second three-dimensional vector group is the last three-dimensional vector group, and thus, the second set of process data Max2, sum2, and Q2' is the final data. According to an embodiment of the invention, the third kernel may then be utilized to generate an attention output based on the final data. Specifically, the Final result (final_r) for generating the attention output may be determined based on the second data (i.e., the second process data Sum2 in the second set of process data) and the third data (i.e., the third process data Q2' in the second set of process data) in the Final data according to the formula (2) shown above, i.e.

Final_R = Q2’ / Sum2

In general, the foregoing details the implementation of the forward computing portion associated with attention calculations based on the data processing scheme of the present invention. According to the inventive concept, on the basis of a data processing scheme such as the data processing method 100 shown in fig. 1, it is also possible to calculate gradients of three-dimensional vectors determined based on an input sequence to realize a reverse calculation section related to attention calculation. In particular, according to embodiments of the present invention, gradients per-dimensional vector in a three-dimensional vector determined based on an input sequence may also be calculated for generating an attention output. As will be described in detail below in connection with fig. 3.

FIG. 3 illustrates a flow chart of a data processing method 300 for reverse computation according to an embodiment of the invention. It should be appreciated that method 300 may also include additional actions not shown and/or may omit actions shown, the scope of the invention being not limited in this respect.

In step 302, a plurality of variables calculated based on the input sequence are divided, the plurality of variables including: gradient variables and correlation variables.

The gradient variable is a variable obtained by performing gradient operation based on all three-dimensional vectors of the input sequence.

With respect to the correlation variable, it is determined by a predetermined function operation based on the product of the query vector Q and the key vector K, which are two-dimensional vectors among three-dimensional vectors determined based on the input sequence. According to an embodiment of the present invention, the correlation variable further includes: a first correlation variable determined by performing a normalized exponential function (also called Softmax function) on the product of the query vector Q and the key vector K, and a second correlation variable determined by performing a transpose function based on the first correlation variable.

According to an embodiment of the present invention, dividing the plurality of variables calculated based on the input sequence may include: dividing the gradient variables based on the number of sub-vectors of the query vector Q in the forward calculation; and dividing the correlation variable based on the number of sub-vectors of the query vector Q in the forward calculation and the number of sub-vectors of the key vector K in the forward calculation. For example, in the forward computation, the query vector Q is split into two sub-vectors, namely a first query sub-vector Q ₁ and a second query sub-vector Q ₂, and the key vector K is split into two sub-vectors, namely a first key sub-vector K ₁ and a second key sub-vector K ₂, then the gradient variable is split into two parts and the correlation variable is split into four parts. In yet another example, if in the forward computation the query vector Q is not grouped and the key vector K is split into two sub-vectors, then the gradient variables are not split and the dependency variables are split into two parts. Similarly, if in the forward computation the query vector Q is split into two sub-vectors and the key vector K is split into three sub-vectors, then the gradient variable is split into two parts and the correlation variable into six parts.

At step 304, a gradient for each of the three-dimensional vectors is determined based on the partitioned plurality of variables.

According to an embodiment of the present invention, intermediate results related to gradients of the same vector may be calculated based on different combinations of the divided variables, respectively, and then at least one of an addition and a connection operation may be performed with respect to the intermediate results to determine the gradient of the vector.

Fig. 4A to 4C show schematic diagrams of exemplary embodiments of the data processing method of the present invention. In this embodiment, in the forward calculation process of the front end, the query vector Q and the key vector K determined based on the input sequence are each divided into two groups.

Fig. 4A shows a schematic diagram of the gradient of the calculated value vector V.

As shown in fig. 4A, the gradient variable 401A calculated based on the input sequence may be divided into two parts do_1 and do_2, and the product (denoted as 402A) of the query vector Q and the key vector K may be divided into four parts qk_1, qk_2, qk_3, and qk_4. Then, based on the divided gradient variables 401A and the product of the divided query vector Q and the key vector K, the resulting intermediate results dv_1, dv_2, dv_3, and dv_4 relating to the gradient of the value vector V can be calculated.

On this basis, at least one of the addition and concatenation operations may be performed with respect to the intermediate results dv_1, dv_2, dv_3, and dv_4 to determine the gradient of the value vector V. For example, the addition operation may be performed on dv_1 and dv_2 to obtain the first addition result dv_12; performing addition operation on dV_3 and dV_4 to obtain a second addition result dV_34; and performing a connection operation on the first addition result dv_12 and the second addition result dv_34, thereby obtaining a gradient dV of the value vector V.

Fig. 4B shows a schematic diagram of calculating the gradient of the key vector K.

As shown in fig. 4B, the query vector Q (denoted 401B) is divided into two parts q_1 and q_2, and the correlation variable 402B determined by performing the Softmax function on the product of the query vector Q and the key vector K is divided into four parts ds_1, ds_2, ds_3, and ds_4. Intermediate results dk_1, dk_2, dk_3, and dk_4 relating to the gradient of the value vector K can then be calculated based on the partitioned query vector Q and the partitioned correlation variable 402B.

On this basis, at least one of addition and concatenation operations may be performed with respect to the intermediate results dk_1, dk_2, dk_3, and dk_4 to determine the gradient of the value vector K. For example, the add operations may be performed on dk_1 and dk_2 to obtain the first addition result dk_12; performing addition operation on the dK_3 and the dK_4 to obtain a second addition result dK_34; and performing a connection operation on the first addition result dk_12 and the second addition result dk_34, thereby obtaining the gradient dK of the value vector K.

Fig. 4C shows a schematic diagram of calculating the gradient of the query vector Q.

As shown in fig. 4C, the key vector K (denoted 401C) is divided into two parts k_1 and k_2, and the correlation variable 402C determined by performing the Softmax function for the product of the query vector Q and the key vector K and then performing the transpose function is divided into four parts dS '_1, dS' _2, dS '_3, and dS' _4. Then, based on the divided key vector K and the divided correlation variable 402C, intermediate results dq_1, dq_2, dq_3, and dq_4 regarding the gradient of the value vector Q can be calculated.

On this basis, at least one of addition and concatenation operations may be performed with respect to the intermediate results dQ_1, dQ_2, dQ_3, and dQ_4 to determine the gradient of the value vector Q. For example, the addition operations may be performed on dQ_1 and dQ_2 to obtain a first addition result dQ_12; performing addition operation on the dQ_3 and the dQ_4 to obtain a second addition result dQ_34; and performing a connection operation on the first addition result dQ_12 and the second addition result dQ_34, thereby obtaining a gradient dQ of the value vector Q.

In summary, the data processing scheme of the invention can enable a single processor to process the attention calculation aiming at the ultra-long input sequence, reduces the communication cost among the processors, and can realize the forward calculation and the backward calculation of the attention calculation without additional memory or additional merging operation aiming at a kernel.

Fig. 5 schematically illustrates a block diagram of a computing device 500 suitable for use in implementing embodiments of the present invention. Device 500 may be a device for implementing the method 100 shown in fig. 1 and the method 300 shown in fig. 3. As shown in fig. 5, the apparatus 500 includes a Central Processing Unit (CPU) 501, which may perform various suitable actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM) 502 or loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The CPU 501, ROM 502, and RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

Various components in the device 500 are connected to the I/O interface 505, including: an input unit 506, an output unit 507, a storage unit 508, a processing unit 501 performs the various methods and processes described above, e.g., performs method 100 or method 300. For example, in some embodiments, method 100 or method 300 may be implemented as a computer software program stored on a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into RAM 503 and executed by CPU 501, one or more of the operations of method 100 or method 300 described above may be performed. Alternatively, in other embodiments, CPU 501 may be configured to perform one or more actions of method 100 and/or method 300 in any other suitable manner (e.g., by means of firmware).

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for carrying out operations of the present invention may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SMALLTALK, C ++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information for computer readable program instructions, which can execute the computer readable program instructions.

These computer readable program instructions may be provided to a processor in a voice interaction device, a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner.

The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

The above is only an alternative embodiment of the present invention and is not intended to limit the present invention, and various modifications and variations will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of data processing, comprising:

Determining a three-dimensional vector based on the input sequence, the three-dimensional vector comprising: query vectors, key vectors, and value vectors;

grouping at least one of the three-dimensional vectors to obtain a plurality of sub-vectors of the at least one-dimensional vector;

grouping the three-dimensional vectors based at least on the plurality of sub-vectors to obtain the plurality of three-dimensional vector groupings, wherein each sub-vector is included in only one of the plurality of three-dimensional vector groupings;

Performing attention-related calculations using the current kernel and based at least on the current three-dimensional vector group to obtain current process data; and

In response to obtaining the current process data, performing an attention-related computation using a next kernel and based at least on the current process data and a next three-dimensional vector group to obtain next process data until final data is determined to generate the attention output based on the final data, wherein the final data is process data resulting from performing an attention-related computation based at least on a last three-dimensional vector group.

2. The method of claim 1, wherein the process data comprises: first process data, second process data and third process data,

Wherein obtaining the current process data comprises:

Maximum value calculation is conducted on the query vector and the key vector in the current three-dimensional vector group so as to obtain the first process data;

summing up the query vector and the key vector in the current three-dimensional vector group so as to obtain the second process data; and

And carrying out product calculation on the query vector, the key vector and the value vector in the current three-dimensional vector group so as to obtain the third process data.

3. The method of claim 2, wherein obtaining the current process data further comprises:

configuring original process data in a current kernel, and initializing the original process data to obtain initialized original process data, wherein the initialized original process data comprises: the method comprises the steps of initializing first original process data, initializing second original process data and initializing third original process data; and

The current process data is calculated based on the initialized raw process data.

4. The method of claim 3, wherein obtaining initialized raw process data comprises:

In response to the current three-dimensional vector grouping being a first three-dimensional vector grouping, making values of the initialized first raw process data, the initialized second raw process data, and the initialized third raw process data all 0;

In response to the current three-dimensional vector packet not being the first three-dimensional vector packet, values of the initialized first raw process data, the initialized second raw process data, and the initialized third raw process data are equal to values of first process data, second process data, and third process data, respectively, of previous process data.

5. The method of claim 1, wherein generating the attention output further comprises:

gradients of each of the three-dimensional vectors are calculated for use in generating the attention output.

6. The method of claim 5, wherein calculating the gradient of each of the three-dimensional vectors comprises:

Dividing a plurality of variables obtained based on an input sequence operation, wherein the plurality of variables comprise: gradient variables obtained by gradient operation based on all three-dimensional vectors of the input sequence, and correlation variables, wherein the correlation variables are determined by predetermined function operation based on products of the query vector and the key vector; and

Based on the partitioned plurality of variables, a gradient of each of the three-dimensional vectors is determined.

7. The method of claim 6, wherein dividing the plurality of variables calculated based on the input sequence comprises:

Dividing the gradient variables based on the number of sub-vectors of the query vector; and

The relevance variable is partitioned based on a number of sub-vectors of the query vector and a number of sub-vectors of the key vector.

8. The method of claim 6, wherein the correlation variable comprises:

a first correlation variable, the first correlation variable determined by performing a normalized exponential function on a product of the query vector and the key vector; and

A second correlation variable that performs a transpose function determination based on the first correlation variable.

9. A computing device, comprising:

At least one processor; and

A memory communicatively coupled to the at least one processor;

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

10. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-8.

11. A computer program product, characterized in that it is tangibly stored on a non-transitory computer-readable medium and comprises machine-executable instructions which, when executed, cause a machine to perform the steps in the method according to any of claims 1-8.