CN119295293A

CN119295293A - Optimization method and related equipment of attention operator based on Ascend AI processor

Info

Publication number: CN119295293A
Application number: CN202411806740.4A
Authority: CN
Inventors: 刘海峰; 李厚强; 王晓芸; 艾坤; 叶溪石; 常峰
Original assignee: Hefei Zhongke Leinao Intelligent Technology Co ltd; University of Science and Technology of China USTC
Current assignee: Hefei Zhongke Leinao Intelligent Technology Co ltd; University of Science and Technology of China USTC
Priority date: 2024-12-10
Filing date: 2024-12-10
Publication date: 2025-01-10
Anticipated expiration: 2044-12-10
Also published as: CN119295293B

Abstract

The present invention relates to the field of data processing technology, and provides an optimization method and related equipment for an attention operator based on an Ascend AI processor. By transferring input data of a multi-scale deformable attention operator to an AI Core, the input data is directly used inside the AI Core to determine the position and size of a target feature map and the coordinates of a target interpolation point, and the reference coordinates and corresponding reference interpolation indexes required for the target interpolation point are calculated, thereby avoiding the need to transfer a large amount of data to a CPU for processing. Through parallel computing and a local interpolation strategy, the target interpolation value and its weight of the target interpolation point are calculated on the AI Core, and at the same time, the fifth input data related to the target interpolation weight is transferred to the AI Core, and the output value calculation of the attention operator is completed inside the AI Core, thereby reducing the number of data transmissions between the inside and outside of the processor and improving the overall processing efficiency.

Description

Attention operator optimization method based on rising AI processor and related equipment

Technical Field

The invention relates to the technical field of data processing, in particular to an attention operator optimization method based on a rising AI processor and related equipment.

Background

In the field of computer vision and image processing, a Multi-scale deformable attention operator (Multi-Scale Deformable Attention) has become a key technology for improving model performance by virtue of its ability to adaptively weight features and extract highly characterizable capabilities at different scales as an efficient attention mechanism. However, on the high-performance domestic AI hardware platform, the rising AI processor, which is developed independently, the direct implementation and invocation of the operator faces technical challenges.

Disclosure of Invention

The present invention aims to solve at least one of the technical problems in the related art to some extent.

Therefore, a first objective of the present invention is to provide an optimization method for attention operator based on a lifting AI processor, which realizes efficient operation of multi-scale deformable attention operator on the lifting AI processor through deep fusion of algorithm and hardware architecture, thereby improving model reasoning speed.

The second object of the invention is to provide an attention operator optimizing method based on a lifting AI processor, which comprises the steps of obtaining first input data, second input data, third input data, fourth input data and fifth input data of the attention operator, conveying the second input data, the third input data and the fourth input data to an AI Core, determining the width and the height of a target feature map according to the second input data, determining the starting position of the target feature map on a spliced feature map according to the third input data and determining the coordinates of a target interpolation point according to the fourth input data, wherein the spliced feature map is obtained by splicing input feature maps of all attention operators, the target feature map is contained in the spliced feature map, the target interpolation point is contained in the target feature map, at least four reference coordinates of the target interpolation point are obtained according to the coordinates of the target interpolation point, determining at least four reference coordinates corresponding to the starting position of the target feature map, the coordinates of the target interpolation point and the at least four reference coordinates of the target feature map, and conveying the at least four reference coordinates from the first reference index to the AI Core.

According to the optimization method of attention operator based on the rising AI processor, the input data (the second, third and fourth input data and the part of the first input data required by the follow-up) of the multiscale deformable attention operator are conveyed to the AI Core (the Core acceleration unit of the rising AI processor) through a data conveying strategy, so that the cost and delay of data movement are reduced. The position, the size and the coordinates of the target interpolation point of the target feature map are determined directly by utilizing second, third and fourth input data in the AI Core, the reference coordinates and the corresponding reference interpolation indexes required by the target interpolation point are calculated, the need of moving a large amount of data to a CPU (Central processing Unit) is avoided, the target interpolation and the weight of the target interpolation point are directly calculated on the AI Core through parallel calculation and local interpolation strategies, meanwhile, fifth input data related to the target interpolation weight is also carried to the AI Core, the calculation of the target interpolation weight is completed in the AI Core, the intervention of the CPU is reduced, and the overall efficiency is improved. Finally, the output value calculation of the multi-scale deformable attention operator is completed in the AI Core, and the output value is carried out to a subsequent processing unit, so that the number of data transmission between the inside and the outside of the processor is reduced, and the overall processing efficiency is improved.

In addition, the data processing method according to the above embodiment of the present invention may further have the following additional technical features:

as an alternative embodiment, after transporting at least four reference interpolation values from the first input data to the AI Core according to the index, further comprising:

respectively calculating at least four reference interpolation weights corresponding to the at least four reference interpolations;

determining target interpolation of a target interpolation point according to at least four reference interpolation values and at least four reference interpolation weights;

Carrying the fifth input data to an AI Core, and determining a target interpolation weight of a target interpolation point according to the fifth input data;

Determining an output value of an attention operator according to the target interpolation and the target interpolation weight;

and carrying the output value out of the AI Core.

As an alternative embodiment, acquiring the first input data, the second input data, the third input data, the fourth input data, and the fifth input data of the attention operator includes:

Storing first input data, second input data, third input data, fourth input data and fifth input data into an external storage area of an AI Core, wherein the first input data comprises a spliced characteristic diagram formed by splicing L input characteristic diagrams, the second input data comprises a width and a height of each input characteristic diagram in the L input characteristic diagrams, the size is [ L,2], the third input data comprises a starting position of each input characteristic diagram in the spliced characteristic diagram, the size is [ L ], the fourth input data comprises coordinates of interpolation points sampled from the input characteristic diagram of the first input data, the size is [ bs, lq, M, L, P,2], the fifth input data comprises attention weights corresponding to the coordinates of the interpolation points sampled from the input characteristic diagram of the first input data, the size is [ Lq, M, L, P ] in the L input characteristic diagrams, the third input data comprises L input characteristic diagrams, the starting position of each input characteristic diagram in the L input characteristic diagrams is the L input characteristic diagrams, the size is [ L ], the fourth input data comprises coordinates of interpolation points sampled from the input characteristic diagrams of the first input characteristic diagram is [ bs, the size is [ L, M, L, P is the size of the interpolation points sampled from the input characteristic diagram is corresponding to the coordinates of the interpolation points sampled from the first input characteristic diagram, the input characteristic diagram is represented by the length of the first input characteristic diagram, the value is represented by the length of the input characteristic map, and the value is represented by the value of the input characteristic map.

As an alternative embodiment, the transporting the second input data, the third input data, and the fourth input data to the AI Core includes:

carrying second transfer data of size L,2 from the second input data at the 0 start position of the external storage area to AI Core;

transferring the third transfer data of size L from the third input data at the 0 start position of the external storage area to AI Core;

from the fourth input data at the coreIdx x M x P x 2 start position of the external memory area, the fourth transfer data with the size of [ M, L, P,2] is transferred to AI Core.

As an alternative embodiment, determining the width and the height of the target feature map according to the second input data includes:

and transforming the dimension of the second transfer data from [ L,2] to [2, L, P, M ], and splitting the second transfer data after dimension transformation into the width and the height of the target feature map according to the highest dimension, wherein the width of the target feature map is [ L, P, M ], and the height of the target feature map is [ L, P, M ].

As an optional embodiment, determining the position of the target feature map in the spliced feature map according to the third input data further includes:

The dimension of the third transfer data is transformed from [ L ] to [ L, P, M ].

As an alternative embodiment, determining the coordinates of the target interpolation point according to the fourth input data includes:

And transforming the dimension of the fourth transfer data from [ M, L, P,2] to [2, L, P, M ], and splitting the fourth transfer data after dimension transformation into a first coordinate and a second coordinate of a target interpolation point according to the highest dimension, wherein the first coordinate size of the target interpolation point is [ L, P, M ], and the second coordinate size of the target interpolation point is [ L, P, M ].

As an alternative embodiment, obtaining at least four reference coordinates of the target interpolation point according to the coordinates of the target interpolation point includes:

Wherein, In order to be a function of the rounding-off,For a first coordinate of the target interpolation point,For the second coordinate of the target interpolation point,For a first reference coordinate of the target interpolation point,For a second reference coordinate of the target interpolation point,A third reference coordinate for the target interpolation point,And a fourth reference coordinate for the target interpolation point.

As an alternative embodiment, determining the index of at least four reference interpolations corresponding to at least four reference coordinates according to the width and the height of the target feature map, the position of the target feature map in the stitching feature map, the coordinates of the target interpolation point, and the at least four reference coordinates of the target interpolation point, includes:

Wherein, To the height of the target feature map,For the width of the target feature map,For the index of the first reference coordinate and the second reference coordinate at the height of the target feature map,For the index of the third reference coordinate and the fourth reference coordinate at the height of the target feature map,For the index of the first reference coordinate and the third reference coordinate over the width of the target feature map,For the index of the second reference coordinate and the fourth reference coordinate over the width of the target feature map,For the second input data to be processed,For the index of the AI Core,For indexing of attention headers in the self-attention mechanism,An index of the first reference interpolation corresponding to the first reference coordinate,An index of a second reference interpolation corresponding to the second reference coordinate,An index of a third reference interpolation corresponding to the third reference coordinate,An index of a fourth reference interpolation corresponding to the fourth reference coordinate.

As an alternative embodiment, calculating at least four reference interpolation weights corresponding to at least four reference interpolations, respectively, includes:

Wherein, The weights for the first reference interpolation are calculated,The weights for the second reference interpolation are calculated,Weights for the third reference interpolation are calculated,For the weights of the fourth reference interpolation,For indexing the target interpolation point at the height of the target feature map,Index the target interpolation point across the width of the target feature map.

As an alternative embodiment, determining the target interpolation of the target interpolation point based on at least four reference interpolations and at least four reference interpolation weights includes:

Wherein, To interpolate the target for the target interpolation point,For the first reference interpolation value,For the second reference interpolation value,For the third reference interpolation value,Interpolation for the fourth reference.

As an alternative embodiment, the transporting of the fifth input data to the AI Core includes:

from the fifth input data at the coreIdx x M x L x P start position of the external storage area, the fifth transfer data of size [ M, L, P ] is transferred to AI Core.

A second object of the present invention is to provide an optimization system for attention operator based on a rising AI processor, comprising:

An input module configured to obtain first, second, third, fourth, and fifth input data of an attention operator, and to carry the second, third, and fourth input data to an AI Core;

The data preparation module is configured to determine the width and the height of a target feature map according to the second input data, determine the initial position of the target feature map in the spliced feature map according to the third input data, and determine the coordinates of a target interpolation point according to the fourth input data, wherein the spliced feature map is obtained by splicing the input feature maps of all attention operators, the target feature map is contained in the spliced feature map, and the target interpolation point is contained in the target feature map;

The interpolation calculation module is configured to obtain at least four reference coordinates of the target interpolation point according to the coordinates of the target interpolation point, determine indexes of at least four reference interpolations corresponding to the at least four reference coordinates according to the width and the height of the target feature map, the starting position of the target feature map in the spliced feature map, the coordinates of the target interpolation point and the at least four reference coordinates of the target interpolation point, and convey the at least four reference interpolations to the AI Core from the first input data according to the indexes.

According to the attention operator optimizing system based on the rising AI processor, input data (second, third and fourth input data and a part of first input data required by follow-up) of the multiscale deformable attention operator are conveyed into an AI Core (Core acceleration unit of the rising AI processor) through a data conveying strategy, so that the cost and delay of data movement are reduced. The position, the size and the coordinates of the target interpolation point of the target feature map are determined directly by utilizing second, third and fourth input data in the AI Core, the reference coordinates and the corresponding reference interpolation indexes required by the target interpolation point are calculated, the need of moving a large amount of data to a CPU (Central processing Unit) is avoided, the target interpolation and the weight of the target interpolation point are directly calculated on the AI Core through parallel calculation and local interpolation strategies, meanwhile, fifth input data related to the target interpolation weight is also carried to the AI Core, the calculation of the target interpolation weight is completed in the AI Core, the intervention of the CPU is reduced, and the overall efficiency is improved. Finally, the output value calculation of the multi-scale deformable attention operator is completed in the AI Core, and the output value is carried out to a subsequent processing unit, so that the number of data transmission between the inside and the outside of the processor is reduced, and the overall processing efficiency is improved.

To achieve the above object, an embodiment of the third aspect of the present invention provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the optimization method based on attention operator of the rising AI processor as described above when executing the program.

To achieve the above object, a fourth embodiment of the present invention provides a computer readable storage medium storing computer instructions for causing a computer to execute the optimization method of attention operator based on a rising AI processor.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only of the invention and that other drawings can be obtained from them without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a proposed architecture of a rising AI processor.

Fig. 2 is a schematic diagram of a hardware architecture of an AI Core according to an embodiment of the present invention.

FIG. 3 is a schematic diagram of an optimization system for attention operators based on a lifting AI processor according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of a target interpolation point and a reference interpolation point according to an embodiment of the present invention.

FIG. 5 is a flowchart of an optimization method for attention operator based on a rising AI processor according to an embodiment of the present invention. Fig. 6 is a schematic diagram of a stitching feature map when l=4 according to an embodiment of the present invention.

FIG. 7 is a flowchart of an optimization method for attention operator based on a lifting AI processor according to an embodiment of the invention.

Fig. 8 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present invention.

Detailed Description

The present invention will be further described in detail below with reference to specific embodiments and with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent.

It is to be noted that unless otherwise defined, technical or scientific terms used herein should be taken in a general sense as understood by one of ordinary skill in the art to which the present invention belongs. The terms "first," "second," and the like, as used herein, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.

As described in the background section, the multiscale deformable attention operator (Multi-Scale Deformable Attention) is an attention mechanism for computer vision tasks. It has achieved a good effect in image processing and computer vision tasks and is widely used in various application fields. This has the advantage that the features can be weighted adaptively on different scales, thereby extracting more characterizable features. Meanwhile, because the local and sparse attention weight is only required to be applied to the corresponding value, the convergence of the model can be greatly accelerated, and the reasoning efficiency is improved.

Currently, due to the shortcomings of hardware architecture and operator optimization, the multiscale deformable attention operator cannot be executed efficiently on the core acceleration unit (such as an AI accelerator) of the rising AI processor directly, but needs to be disassembled into a complex operation sequence with grid sampling as a core, and is processed by relying on an AI CPU. The processing mode not only increases the computational complexity, but also obviously reduces the reasoning speed of the model, and restricts the performance advantage and application potential of the rising AI processor in the calculation of visual tasks.

Therefore, a multi-scale deformable attention operator optimization method for the lifting AI processor is needed, and aims to realize efficient operation of the operator on the AI accelerator through deep fusion of an algorithm and a hardware architecture, so that the model reasoning speed is improved, and the increasing calculation vision application requirements are met.

The technical scheme of the invention is further described in detail through specific examples.

Referring to FIG. 1, a schematic diagram of a rising AI processor is provided in an embodiment of the invention.

The architecture diagram of the rising AI processor mainly comprises AI Core, AI CPU, data vision preprocessing module, task scheduler, L2/HBM/DDR memory area, etc.

The AI Core is a computing engine of the lifting AI processor, and is specially designed for processing large-scale matrix operation, vector operation and tensor operation in deep learning. These operations are critical in the forward propagation, backward propagation, etc. of the neural network. The AI Core can provide extremely high calculation efficiency through a highly parallelized calculation unit (such as a matrix multiplication unit) and an optimized instruction set, and meets the strict requirements on calculation force and speed in AI application.

AI CPU plays an auxiliary role in AI processor, mainly responsible for processing those tasks that are logically complex, branch intensive, or specific computation intensive tasks (e.g. complex operations) that AI Core cannot directly handle. In addition, the AI CPU is also responsible for the management of processing control flow and data flow, and ensures the stability and high efficiency of the overall operation of the AI processor. Compared with AI Core, AI CPU is more flexible in processing non-matrix, non-vector, non-tensor type calculation, and can cope with various complex calculation scenes.

The data vision preprocessing module is mainly used for preprocessing input data, including but not limited to data cleaning, normalization, data enhancement and the like. In AI applications, high quality input data is a prerequisite for good model performance, and therefore the importance of the data preprocessing module is self-evident. By optimizing the data preprocessing flow, the efficiency and accuracy of data processing can be improved, thereby providing more valuable data input for subsequent AI Core and AI CPU calculations.

The task scheduler is responsible for coordinating task allocation and scheduling among the AI Core, AI CPU and memory resources. According to the nature and priority of the task, the use of the computing resource and the memory resource is reasonably arranged so as to ensure the high efficiency and the stability of the overall operation of the AI processor. Task schedulers typically employ a variety of optimization strategies, such as load balancing, priority scheduling, etc., to cope with different computing scenarios and demands.

The L2 cache, HBM (high bandwidth Memory) and DDR (double data rate Memory) together form a Global Memory system (GM) of the rising AI processor. These memory resources provide the necessary data storage space for the AI Core and AI CPU. Among them, the HBM is particularly important in AI application due to its high bandwidth and low latency characteristics, and it can significantly improve the data access speed and reduce the latency in the calculation process. DDR provides a richer data storage option for the AI processor due to its large capacity and high cost effectiveness.

Referring to fig. 2, a schematic diagram of a hardware architecture of an AI Core according to an embodiment of the present invention is shown.

The AI Core is mainly divided into a computing Unit, a storage system and a control Unit, wherein the computing Unit mainly comprises three basic computing resources, namely a matrix computing Unit (Cube Unit), a Vector computing Unit (Vector Unit) and a scalar computing Unit (Scalar Unit).

Among them, matrix computation units (Cube units) are mainly used for processing tensor computation intensive tasks such as convolution and matrix multiplication. These operations are very common in deep learning algorithms, which are core computations in the neural network forward and backward propagation processes. Vector computing units (Vector units) are mainly used to perform Vector computation intensive tasks. Vector computation is efficient in processing data sequences, array operations, etc., and can accelerate key steps in many AI algorithms. The scalar computing unit (Scalar Unit) is primarily responsible for executing control logic of the program, such as branch decisions, loop control, and the like. This unit, although not the most computationally intensive part of the AI Core, is critical to ensure proper program execution and flow control.

The memory system may be divided into LOA/LOB/LOC buffers, unified buffers (UB for short), scalar buffers/registers.

Wherein the LOA/LOB/LOC cache is optimized for tensors, vectors, and specific types of computation, respectively, by storing frequently accessed data in a cache closer to the computation unit, latency and bandwidth consumption of accessing external storage (such as L2 cache, HBM, or DDR) can be reduced. A Unified Buffer (UB) may provide a more flexible Buffer mechanism for supporting different types of computing and data access modes, and may be shared by multiple computing units to improve Buffer utilization and reduce data redundancy. Scalar buffers/registers are used primarily to store data and operation results required by a scalar computing unit, and since scalar data is typically small and frequently accessed, storing such data in registers may further increase access speed. The L1 cache region is positioned in the AI Core and is used for temporarily storing the data needing to be used repeatedly, and the L1 cache can obviously reduce delay and improve the overall calculation performance by reducing the times of reading and writing the data from the bus.

The control unit is mainly responsible for coordinating the operation between the computing unit and the storage system, ensuring the correct execution of instructions and the efficient flow of data, and comprises components such as an instruction decoder, an instruction queue, a branch predictor and the like to support complex control flows and data flow management.

In addition, the operator of the rising AI processor running on AI Core is called TBE operator, which has higher execution efficiency. The lifting provides two ways to develop TBE operators, namely TBE-DSL (Tensor Boost Engine-Domain Specific Language) and TBE-TIK (Tensor Iterator Kernel). The TBE-DSL development mode provides a pre-packaged function interface to be called by a developer, and only simple combination of interfaces can be provided. The TBE-TIK development mode has highest flexibility, an AI Core bottom instruction can be directly opened to a developer for calling, and the operation efficiency of a developed operator is higher.

Therefore, in the embodiment of the application, a TBE-TIK development mode is adopted.

Referring to FIG. 3, a schematic diagram of an optimization system for attention operator based on a lifting AI processor is provided in an embodiment of the invention.

Specifically, the method can be implemented in five modules, including an input module 302, a data preparation module 304, an interpolation calculation module 306, a matrix multiplication module 308, and an output module 3010.

Wherein the input module 302 is responsible for handling input data of the multi-scale deformable attention operator from an external storage area (GM) into a Unified Buffer (UB) of AI Core.

The data preparation module 304 is responsible for sorting and dimension transforming the input data carried by the input module 302, so as to facilitate the use of the subsequent calculation module. In addition, some initialization and dimensional transformation of constant vectors, dimensional transformation of intermediate results of the operation module are also performed in the module.

The interpolation calculation module 306 calculates interpolation points based on bilinear interpolation algorithm, wherein two input are coordinates of the interpolation points, two input are width and height of the feature map where the interpolation points are located, and output is a value of the interpolation points, and three calculation tasks are mainly completed, namely, coordinates of 4 reference interpolation points near the interpolation points are calculated based on the two input、、、Secondly, calculating weights of 4 reference interpolation points nearby the interpolation point、、、Thirdly, calculating the value of the target interpolation point。

The matrix multiplication module 308 is responsible for calculating matrix multiplication, and has two inputs, namely, sampling data obtained from the first input data based on the fourth input data, and fifth input data, and outputs as matrix multiplication of two inputs in the dimension of l×p, and performs n×lq×m times altogether.

The output module 3010 is responsible for transferring output data from the unified memory area (UB) of the AI Core to the external memory area (GM) for output.

Specifically, the data preparation module 304 performs the computation on the storage conversion unit of AI Core, and the common data dimension conversion operations are:

dimension transformation, for example, transforms [ bs, lq, M, L, P ] to [ bs, lq, L, P, M ].

Dimension expansion, e.g., transform [ L ] to [ M, L, P ].

Data concatenation, for example, concatenating 4 [ M, L, P ] data into [4, M, L, P ] sized data.

The interpolation calculation module 306 has two inputs, namely, a coordinate of the target interpolation point, which is denoted as (locH, locW), a dimension of [ bs, lq, M, L, P ], and a width and height of the target feature map where the target interpolation point is located, which is denoted as (spatialH, spatialW). The output has one value of the target interpolation point, denoted val, with dimensions [ bs, lq, M, L, P, D ]. The module performs the calculation on the AI Core's vector calculation unit. The basic calculation flow is to execute a cycle with the number of times bs, lq, M, L, P, for each cycle:

first, coordinates of 4 reference points near the interpolation point are calculated based on two inputs 、、、。

Referring to fig. 4, a schematic diagram of a target interpolation point and a reference interpolation point according to an embodiment of the present invention is provided.

It should be noted that the number of the substrates,It can be understood that the coordinates of the target interpolation point in the height dimension of the target feature map,It can be understood that the coordinates of the target interpolation point in the width dimension of the target feature map are centered on the target interpolation point in fig. 4, the first reference coordinate thereof is the coordinates of the reference interpolation point in the upper left corner, the second reference coordinate thereof is the coordinates of the reference interpolation point in the upper right corner, the third reference coordinate thereof is the coordinates of the reference interpolation point in the lower left corner, and the fourth reference coordinate thereof is the coordinates of the reference interpolation point in the lower right corner.

Further, will、、、Is sent into an input module, and is sent into the input module, the input module is used for indexing from the first input data external storage area according to dimensions、、、At the coordinate position, respectively carrying the data with the length of D (namely, the reference interpolation corresponding to each reference interpolation point) to the UB memory area of the AI Core to obtain、、、(Dimensions [ D ]) 4 data.

Further, weights of 4 reference points around the interpolation point are calculated based on two inputs、、、Wherein:

Wherein, The weights for the first reference interpolation are calculated,The weights for the second reference interpolation are calculated,Weights for the third reference interpolation are calculated,Weights interpolated for the fourth reference.

Finally, calculating the target interpolation of the target interpolation point(Dimension [ D ]):

For the matrix multiplication module, there are two inputs, namely one is sampling data obtained from the first input data based on the fourth input data (i.e. target interpolation)The second is fifth input data, denoted ATTNWEIGHT, with dimensions [ bs, lq, M, L, P ]. And outputting a final calculation result of the operator. The module performs a matrix multiplication of the [ L.P ] matrix and the [ L.P, D ] matrix in a total of N.times.Lq.M times. The module performs the calculation on the matrix calculation unit of AI Core.

The parallel optimization means used in the embodiment of the application comprise multi-core parallel, vectorization parallel, access optimization and Double buffer mechanism, wherein:

For multi-Core parallel, a rising AI processor includes a plurality of AI cores, the AI cores are independent from each other, and can execute the same calculation task in parallel. In the TBE TIK operator development tool, parallelism of multi-Core parallelism can be set on the outermost loop, and the parallelism is set to be equal to or greater than the actual AI Core number, so that the scheduler performs full-load scheduling.

In the embodiment of the present invention, the parallelism of the multi-Core parallelism is set to bs×lq, and the index on each AI Core is expressed as。

For vectorized parallelism, vectorized parallelism is typically implemented using Single Instruction Multiple Data (SIMD) instructions. For AI Core, the matrix computation unit may complete a 16x16 and 16x16 matrix multiplication of fp16 one clock cycle (4096). The vector computation unit may perform two 128-length fp16 type vector operations in one clock cycle. The memory conversion unit can complete 16x16 and 16x16 matrix transposes of one fp16 in one clock cycle.

In the embodiment of the present invention, the parallelism of vectorization parallelism of each AI Core is set to m×l×p.

I.e. for each AI Core:

in the data preparation module, data of size m×l×p×2 is transferred from the start position of the fourth input data GM memory area coreIdx ×m×l×p×2 into the UB memory area as (Dimension [ M, L, P ]) and(Dimension [ M, L, P ]).

Data of size M x L x P is transferred from the fifth input data GM memory area coreIdx x M x L x P start position into the UB memory area as ATTNWEIGHT (dimension M, L, P).

Data of size M x L x P x 4 (4 reference points) x D is co-transferred from the first input data GM memory area through the UB memory area as v1, v2, v3, v4 (dimensions M, L, P, D).

In the interpolation calculation module, a vector calculation unit is required to calculate data of m×l×p at a time.

In the matrix multiplication module, since matrix multiplication is required in the dimension of l×p, a cycle of M is set, and for each cycle, matrix multiplication of the [ l×p ] matrix and the [ l×p, D ] matrix is calculated.

For memory optimization, in the multiscale deformable attention operator, the first input data needs to be transferred from the GM memory area to the UB memory area for n×lq×m×p times in total in actual operation, and addresses transferred each time are discontinuous, which causes the occurrence of memory bottleneck and becomes an important point for optimization.

In the scheme, the relation between the first input data, the second input data and the third input data in physical sense is considered, and the interaction with an external storage area is reduced by adopting an on-chip cache mode. Meanwhile, considering that the coordinates of each input feature map may be more continuous in the L input feature maps, the data dimension [ M, L, P ] in the AI Core is transformed from M to [ L, P, M ] with L as the first dimension. The method comprises the following steps:

Setting a cache of a first input data on-chip in a UB memory area, wherein the size of the cache is K, M and D;

The coordinates of the 4 reference points obtained by the interpolation calculation module are 、、、(The dimension is [ M, L, P ]), firstly, dimension transformation is carried out, and the dimension after transformation is [ L, P, M ];

And splicing the coordinates of the 4 reference points together to obtain data with the dimensions of [ L,4, P, M ].

A loop of number L is constructed for each loop:

obtaining a maximum value max and a minimum value min in the [4, P, M ] length data and a difference value diff (diff=max-min) between the maximum value min and the minimum value min;

When diff < = K, pulling data with diff M D to the cache from the initial position of the first input data GM storage area min, and directly reading the subsequent v1, v2, v3 and v4 from the cache;

When diff > K, v1, v2, v3, v4 are read from the first input data GM storage area.

For a Double buffer mechanism, a vector instruction, a matrix instruction and a storage conversion instruction on an AI Core are respectively issued to a vector operation unit, a matrix operation unit and a storage conversion unit through different queues for scheduling. These different queues of instructions may be executed in parallel.

In this scheme, the interpolation calculation module is executed on the vector operation unit, the matrix multiplication module is executed on the matrix calculation unit, and the data preparation module is executed on the storage conversion unit. Parallel execution is realized among the three modules through a double buffer mechanism.

The following describes the optimization method flow of attention operator based on the rising AI processor in detail with reference to the embodiment.

Referring to fig. 5, a flowchart of an optimization method for attention operator based on a rising AI processor according to an embodiment of the present invention is shown.

Step S502, the first input data, the second input data, the third input data, the fourth input data, and the fifth input data of the attention operator are acquired, and the second input data, the third input data, and the fourth input data are transferred to the AI Core.

As an alternative embodiment, the acquiring the first, second, third, fourth and fifth input data of the attention operator includes storing the first, second, third, fourth and fifth input data in an external storage area of AI Core; the first input data comprises a spliced characteristic diagram formed by splicing L input characteristic diagrams, the size of the spliced characteristic diagram is [ bs, S, M and D ], the second input data comprises the width and the height of each input characteristic diagram in the L input characteristic diagrams, the size of each input characteristic diagram is [ L,2], the third input data comprises the starting position of each input characteristic diagram in the spliced characteristic diagram in the L input characteristic diagrams, the size of each input characteristic diagram is [ L ], the fourth input data comprises the coordinates of interpolation points sampled from the input characteristic diagrams of the first input data, the size of each interpolation point is [ bs, lq, M, L, P and 2], the fifth input data comprises attention weights corresponding to the coordinates of the interpolation points sampled from the input characteristic diagrams of the first input data, the size of each input characteristic diagram is [ bs, lq, M, L and P ], S represents the number of samples used in one time of training, S represents the number of attention heads in the self-attention mechanism, D represents the coordinates, L represents the number of interpolation points of the input characteristic diagrams, and the number of the decoder Ly represents the length of the interpolation points of the decoder.

Specifically, one of the outputs of the multi-scale deformable attention operator is represented as output data (i.e., output values), of the size (bs, lq, M, D). Five inputs are shown as:

The first input data In1 has a size of [ bs, S, M, D ], and represents a total input feature map (i.e., a mosaic feature map) formed by stitching L input feature maps.

The second input data In2 has a size of L,2, and represents the width and height of each of the L input feature maps.

The third input data In3 has a size of L, and represents a start position of each of the L input feature maps In the spliced feature map.

The fourth input data In4 has a size of [ bs, lq, M, L, P,2], and represents coordinates of sampling points sampled from the first input data (i.e., coordinates of interpolation points sampled from the input feature map of the first input data), and is required to be sampled bs×lq×m×l×p times In total.

The fifth input data has a size of [ bs, lq, M, L, P ], and represents attention weights, which are in one-to-one correspondence with sampling points obtained by sampling the first input data, i.e., the first input data is co-sampled bs×lq×m×l×p times, and corresponds to the fifth input data bs×lq×m×l×p attention weights.

Where bs represents the size of the batch, bs (Batch Size) represents the number of samples processed simultaneously during one training process, and in deep learning, the data is divided into multiple batches, each batch containing a fixed number of samples, for iterative training. S denotes the length of the input feature map. M represents the number of Attention heads in the self-Attention mechanism, and M represents the number of Attention heads in the Multi-Head Attention mechanism. Each head will learn the different representations of the input data independently and splice the representations together to obtain more information, with multi-head attention being one of the core components of the transducer model, widely used in natural language processing and computer vision tasks. D represents the hidden dimension. L represents the number of input feature maps, in a computer vision task, L may represent the number or length of input feature maps, and when feature maps of multiple scales are processed, L may refer to the number of feature maps at different scales. Lq represents the ratio of encoder feature map length to decoder Query length, and in a multi-scale deformable attention or other similar mechanism Lq represents some length metric or factor associated with a Query (Query) that can be used to adjust the manner in which the Query interacts with keys (keys) or values (values), particularly when processing sequences or feature maps of different lengths. P represents the number of target interpolation points, in a multi-scale deformable attention mechanism, P generally represents the number of points sampled at each query or location, and deformable attention allows the model to dynamically adjust the sampling location according to the characteristics of the input data, so that relevant information can be more accurately captured, and P can determine the degree of refinement of the sampling at each query or location, which has a certain effect on the performance of the model.

Referring to fig. 6, a schematic diagram of a stitching feature map when l=4 is provided in an embodiment of the present invention.

As an alternative embodiment, the relationship between the first input and the second input is that the first input data In1 is formed by stitching L feature maps, each feature map has a size of n×m×d, N [ L ] =in2l ] [1], and 0< =l < L.

The relationship between the first input data In1 and the third input data In3 is that the third input data In3 represents a starting position of each of the L input feature maps In the first input data In 1.

The relationship between the second input data In2 and the third input data In3 is In3[ L ] =In3 [ L-1] +In2[ L ] [0] ×In2[ L ] [1], where 0< = L < In3[0] =0.

As an alternative embodiment, the transferring the second input data, the third input data and the fourth input data to the AI Core includes transferring the second transfer data with the size of [ L,2] from the second input data at the 0 start position of the external storage area to the AI Core, transferring the third transfer data with the size of [ L ] from the third input data at the 0 start position of the external storage area to the AI Core, and transferring the fourth transfer data with the size of [ M, L, P,2] from the fourth input data at the coreIdx ×m×p×2 start position of the external storage area to the AI Core.

Step S504, determining the width and the height of a target feature map according to the second input data, determining the initial position of the target feature map in the spliced feature map according to the third input data, and determining the coordinates of target interpolation points according to the fourth input data, wherein the spliced feature map is obtained by splicing the input feature maps of all attention operators, the target feature map is contained in the spliced feature map, and the target interpolation points are contained in the target feature map.

As an alternative embodiment, determining the width and the height of the target feature map according to the second input data comprises transforming the dimension of the second transfer data from [ L,2] to [2, L, P, M ], and splitting the second transfer data after the dimension transformation into the width and the height of the target feature map according to the highest dimension, wherein the width of the target feature map is [ L, P, M ], and the height of the target feature map is [ L, P, M ].

As an alternative embodiment, determining the position of the target feature map in the stitched feature map based on the third input data further comprises transforming the dimension of the third transfer data from [ L ] to [ L, P, M ].

As an alternative embodiment, determining the coordinates of the target interpolation point according to the fourth input data comprises transforming the dimension of the fourth transfer data from [ M, L, P,2] to [2, L, P, M ], and splitting the fourth transfer data after dimension transformation into the first coordinates and the second coordinates of the target interpolation point according to the highest dimension, wherein the first coordinates of the target interpolation point are [ L, P, M ], and the second coordinates of the target interpolation point are [ L, P, M ].

Specifically, it is further necessary to sort and dimension transform the transfer data loaded into the UB memory. The method comprises the following steps:

Transforming the dimension of the second transfer data from [ L,2] to [2, L, P, M ], and splitting into two input data according to the highest dimension, which are respectively denoted by spatialH (the size is [ L, P, M ]) and spatialW (the size is [ L, P, M ]);

transforming the dimension of the third transfer data from [ L ] to [ L, P, M ];

Converting the dimension of the fourth transfer data from [ M, L, P,2] to [2, L, P, M ], and splitting the fourth transfer data into two input data according to the highest dimension, wherein the two input data are respectively denoted by locW (with the size of [ L, P, M ]) and locH (with the size of [ L, P, M ]);

In addition, there is a need to create a one-dimensional data of length M, stored content (1, 2,3, M), denoted mIdx, whose dimensions are then transformed from [ M ] to [ L, P, M ].

Step S506, at least four reference coordinates of the target interpolation point are obtained according to the coordinates of the target interpolation point.

Step S508, determining indexes of at least four reference interpolations corresponding to the at least four reference coordinates according to the width and the height of the target feature map, the initial position of the target feature map in the spliced feature map, the coordinates of the target interpolation point and the at least four reference coordinates of the target interpolation point, and conveying the at least four reference interpolations to the AI Core from the first input data according to the indexes.

As an optional embodiment, the determining the index of the at least four reference interpolations corresponding to the at least four reference coordinates according to the width and the height of the target feature map, the position of the target feature map in the stitching feature map, the coordinates of the target interpolation point, and the at least four reference coordinates of the target interpolation point includes:

It should be noted that, the calculation process executes SIMD vectorization parallelism, the parallelism is l×p×m, and the dimensions of the 4 reference point coordinates are all L, P, M.

Referring to fig. 7, another flow chart of the optimization method of attention operator based on the rising AI processor according to the embodiment of the invention is shown.

Step S702, calculating at least four reference interpolation weights corresponding to the at least four reference interpolations, respectively.

As an alternative embodiment, the calculating at least four reference interpolation weights corresponding to at least four reference interpolations respectively includes:

In this step, the dimensions of w1, w2, w3, and w4 are converted from [ L, M, P ] to [ L, M, P, D ].

Step S704, determining a target interpolation of the target interpolation point according to the at least four reference interpolations and the at least four reference interpolation weights.

In this step, val is subjected to a dimensional transformation from [ L, P, M, D ] to [ M, D, L, P ].

Step S706, the fifth input data is carried to the AI Core, and the target interpolation weight of the target interpolation point is determined according to the fifth input data.

Step S708, determining the output value of the attention operator according to the target interpolation and the target interpolation weight.

As an alternative embodiment, val (dimension [ M, D, L, P ]) is required to be taken as input one in the step, the target interpolation weight obtained from the fifth input data in the step S5014 is taken as input two, and the following matrix multiplication is performed by [ D, LP ] x [ LP ] - > [ D ], and finally, the output value with the dimension of [ M, D ] is obtained.

Step S710, the output value is carried out of the AI Core.

As an alternative embodiment, the output value is carried to the start position of the output data external storage area coreIdx x M x D.

It should be noted that the rising AI processor used in the test of the embodiment of the present invention is the Assetnd 910ProB, and the development tool uses CANN 7.0.0 RC1 version. In addition, the optimization method of attention operator based on the rising AI processor can use other versions according to actual requirements.

Referring to Table 1, the operating efficiencies of the multi-scale deformable attention operator on the Assnd 910ProB are compared.

Specifically, the AI CPU implementation represents that the multiscale deformable attention operator is broken down on the boost AI processor into a set of operations with the grid samples (GRID SAMPLING) as cores, and the grid samples as core operations can only be put on the AI CPU for operation.

AI Core implementation represents the AI Core end-to-end implementation proposed by the present scheme for multi-scale deformable attention operator, [ bs, S, lq, M, D, L, P ] represents the different sizes of the input.

TABLE 1

As can be seen from Table 1, the scheme can bring about 10 times of acceleration gain, and greatly improves the operation efficiency of operators on the rising AI processor.

Referring to Table 2, for comparison of the operating speeds of a model using a multi-scale deformable attention operator on an Assetnd 910 ProB.

TABLE 2

Specifically, a set of operations representing the decomposition of a multi-scale deformable attention operator on a lifting AI processor with grid sampling (GRID SAMPLING) as a core is realized, and the grid sampling is used as a core operation and can only be put on an AI CPU to run.

The AI Core end-to-end implementation scheme provided by the scheme is adopted to realize the two-representation multi-scale deformable attention operator.

GroudingDINO in Table 2 is a target detection model based on the transducer architecture. The model comprises 6 layers of multi-scale deformable attention operators in a Encoder module and a Decoder module respectively, wherein the input sizes of the multi-scale deformable attention operators in the Encoder module are [ bs, S, lq, M, D, L, P ] = [1,34000,34000,8,32,4,4], and the input sizes of the multi-scale deformable attention operators in the Decoder module are [ bs, S, lq, M, D, L, P ] = [1,34000,900,8,32,4,4].

As can be seen from Table 2, by adopting the scheme, the overall running speed of the model is improved by 6 times.

As can be seen from the above, the present invention provides a method for optimizing attention operator based on a rising AI processor, which uses a data handling strategy to handle the input data (the second, third and fourth input data and the subsequent required part of the first input data) of the multi-scale deformable attention operator into the AI Core (the Core acceleration unit of the rising AI processor) so as to reduce the overhead and delay of data movement. The position, the size and the coordinates of the target interpolation point of the target feature map are determined directly by utilizing second, third and fourth input data in the AI Core, the reference coordinates and the corresponding reference interpolation indexes required by the target interpolation point are calculated, the need of moving a large amount of data to a CPU (Central processing Unit) is avoided, the target interpolation and the weight of the target interpolation point are directly calculated on the AI Core through parallel calculation and local interpolation strategies, meanwhile, fifth input data related to the target interpolation weight is also carried to the AI Core, the calculation of the target interpolation weight is completed in the AI Core, the intervention of the CPU is reduced, and the overall efficiency is improved. Finally, the output value calculation of the multi-scale deformable attention operator is completed in the AI Core, and the output value is carried out to a subsequent processing unit, so that the number of data transmission between the inside and the outside of the processor is reduced, and the overall processing efficiency is improved.

It should be noted that, the method of the embodiment of the present invention may be performed by a single device, for example, a computer or a server. The method of the embodiment can also be applied to a distributed scene, and is completed by mutually matching a plurality of devices. In the case of such a distributed scenario, one of the devices may perform only one or more steps of the method of an embodiment of the present invention, the devices interacting with each other to accomplish the method.

It should be noted that the foregoing describes some embodiments of the present invention. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Based on the same inventive concept, the invention also provides an optimization system based on attention operators of the lifting AI processor, corresponding to the method provided by any embodiment.

FIG. 3 is a schematic diagram of an optimization system based on attention operators of a lifting AI processor according to an embodiment of the invention.

The optimization system 300 of attention operator based on the rising AI processor includes:

an input module 302 configured to obtain first, second, third, fourth, and fifth input data of an attention operator, and to carry the second, third, and fourth input data to an AI Core;

The data preparation module 304 is configured to determine the width and the height of the target feature map according to the second input data, determine the starting position of the target feature map in the spliced feature map according to the third input data, and determine the coordinates of the target interpolation point according to the fourth input data, where the spliced feature map is obtained by splicing the input feature maps of all attention operators, the target feature map is included in the spliced feature map, and the target interpolation point is included in the target feature map;

The interpolation calculation module 306 is configured to obtain at least four reference coordinates of the target interpolation point according to the coordinates of the target interpolation point, determine indexes of at least four reference interpolations corresponding to the at least four reference coordinates according to the width and the height of the target feature map, the starting position of the target feature map in the spliced feature map, the coordinates of the target interpolation point and the at least four reference coordinates of the target interpolation point, and transfer the at least four reference interpolations from the first input data to the AI Core according to the indexes.

Optionally, the interpolation computation module 306 is further configured to:

and carrying the output value out of the AI Core.

Optionally, the input module 302 is further configured to:

Optionally, the interpolation computation module 306 is further configured to:

Optionally, the data preparation module 304 is further configured to:

Optionally, the interpolation computation module 306 is further configured to:

For convenience of description, the above system is described as being functionally divided into various modules, respectively. Of course, the functions of each module may be implemented in the same piece or pieces of software and/or hardware when implementing the present invention.

The system of the foregoing embodiment is configured to implement the corresponding method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which is not described herein.

Based on the same inventive concept, the present invention also provides an electronic device, corresponding to the method described in any of the above embodiments, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the method described in any of the above embodiments when executing the program.

Fig. 8 shows a more specific hardware architecture of an electronic device provided by the present embodiment, which may include a processor 810, a memory 820, an input/output interface 830, a communication interface 840, and a bus 850. Wherein processor 810, memory 820, input/output interface 830, and communication interface 840 enable communication connections among each other within the device via bus 850.

The processor 810 may be implemented by a general-purpose CPU (Central Processing Unit ), a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 820 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory ), static storage, dynamic storage, etc. Memory 820 may store an operating system and other application programs, and when the technical solutions provided by the embodiments of the present specification are implemented in software or firmware, relevant program codes are stored in memory 820 and invoked by processor 810 for execution.

The input/output interface 830 is used for connecting with an input/output module to realize information input and output. The input/output module may be configured as a component in a device (not shown in the figure) or may be external to the device to provide corresponding functionality. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various types of sensors, etc., and the output devices may include a display, speaker, vibrator, indicator lights, etc.

The communication interface 840 is used to connect a communication module (not shown in the figure) to enable communication interaction between the device and other devices. The communication module may implement communication through a wired manner (such as USB, network cable, etc.), or may implement communication through a wireless manner (such as mobile network, WIFI, bluetooth, etc.).

Bus 850 includes a path to transfer information between components of the device (e.g., processor 810, memory 820, input/output interface 830, and communication interface 840).

It should be noted that although the above-described device only shows processor 810, memory 820, input/output interface 830, communication interface 840, and bus 850, in an implementation, the device may include other components necessary to achieve proper operation. Furthermore, it will be understood by those skilled in the art that the above-described apparatus may include only the components necessary to implement the embodiments of the present description, and not all the components shown in the drawings.

The electronic device of the foregoing embodiment is configured to implement the corresponding method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which is not described herein.

Based on the same inventive concept, the present invention also provides a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method according to any of the above embodiments, corresponding to the method according to any of the above embodiments.

The non-transitory computer readable storage media described above can be any available media or data storage device that can be accessed by a computer, including, but not limited to, magnetic storage (e.g., floppy disks, hard disks, magnetic tapes, magneto-optical disks (MOs), etc.), optical storage (e.g., CD, DVD, BD, HVD, etc.), and semiconductor storage (e.g., ROM, EPROM, EEPROM, nonvolatile storage (NAND FLASH), solid State Disk (SSD)), etc.

The storage medium of the above embodiments stores computer instructions for causing the computer to perform the method described in any of the above exemplary method portions, and has the advantages of the corresponding method embodiments, which are not described herein.

Furthermore, although the operations of the methods of the present invention are depicted in the drawings in a particular order, this is not required or suggested that these operations must be performed in this particular order or that all of the illustrated operations must be performed in order to achieve desirable results. Rather, the steps depicted in the flowcharts may change the order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of techniques known in the art, discrete logic circuits with logic gates for implementing logic functions on data signals, application specific integrated circuits with appropriate combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

It should be noted that unless otherwise defined, technical or scientific terms used in the embodiments of the present invention should be given the ordinary meaning as understood by one of ordinary skill in the art to which the present invention belongs. The terms "first," "second," and the like, as used in embodiments of the present invention, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.

While the spirit and principles of the present invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments nor does it imply that features of the various aspects are not useful in combination, nor are they useful in any combination, such as for convenience of description. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims. The scope of the following claims is to be accorded the broadest interpretation so as to encompass all such modifications and equivalent structures and functions.

Claims

1. A method for optimizing an attention operator based on an Ascend AI processor, comprising:

Obtain the first input data, the second input data, the third input data, the fourth input data, and the fifth input data of the attention operator, and transfer the second input data, the third input data, and the fourth input data to AICore;

Determine the width and height of the target feature map according to the second input data, determine the starting position of the target feature map in the spliced feature map according to the third input data, and determine the coordinates of the target interpolation point according to the fourth input data; wherein the spliced feature map is obtained by splicing the input feature maps of all the attention operators, the target feature map is included in the spliced feature map, and the target interpolation point is included in the target feature map;

Obtaining at least four reference coordinates of the target interpolation point according to the coordinates of the target interpolation point;

Determine the indexes of at least four reference interpolations corresponding to the at least four reference coordinates according to the width and height of the target feature map, the starting position of the target feature map in the spliced feature map, the coordinates of the target interpolation point, and the at least four reference coordinates of the target interpolation point, and move the at least four reference interpolations from the first input data to the AI Core according to the indexes.

2. The method for optimizing the attention operator based on the Ascend AI processor according to claim 1, characterized in that after transferring the at least four reference interpolations from the first input data to the AI Core according to the index, the method further comprises:

respectively calculating at least four reference interpolation weights corresponding to the at least four reference interpolation values;

Determining a target interpolation value of the target interpolation point according to the at least four reference interpolation values and the at least four reference interpolation weights;

The fifth input data is transferred to the AI Core, and a target interpolation weight of the target interpolation point is determined according to the fifth input data;

Determining an output value of the attention operator according to the target interpolation and the target interpolation weight;

Move the output value out of the AI Core.

3. The method for optimizing the attention operator based on the Ascend AI processor according to claim 2, wherein the step of obtaining the first input data, the second input data, the third input data, the fourth input data, and the fifth input data of the attention operator comprises:

The first input data, the second input data, the third input data, the fourth input data and the fifth input data are stored in the external storage area of the AI Core; wherein the first input data includes a concatenated feature map formed by concatenating L input feature maps, and the size is [bs, S, M, D]; the second input data includes the width and height of each input feature map in the L input feature maps, and the size is [L, 2]; the third input data includes the starting position of each input feature map in the concatenated feature map in the L input feature maps, and the size is [L]; the fourth input data includes the coordinates of the interpolation points sampled from the input feature map of the first input data, and the size is [bs, Lq, M, L, P, 2]; the fifth input data includes the attention weights corresponding to the coordinates of the interpolation points sampled on the input feature map of the first input data, and the size is [bs,Lq,M,L,P]; bs represents the number of samples used in a training process, S represents the length of the input feature map, M represents the number of attention heads in the self-attention mechanism, D represents the hidden dimension, L represents the number of input feature maps, Lq represents the ratio of the encoder feature map length to the decoder Query length, and P represents the number of target interpolation points.

4. The method for optimizing the attention operator based on the Ascend AI processor according to claim 3, wherein the step of transferring the second input data, the third input data, and the fourth input data to the AI Core comprises:

From the second input data located at the starting position 0 of the external storage area, transfer the second transfer data of size [L, 2] to the AI Core;

From the third input data located at the starting position 0 of the external storage area, transfer the third transfer data of size [L] to the AI Core;

From the fourth input data located at the starting position of coreIdx*M*L*P*2 in the external storage area, the fourth transfer data of size [M, L, P, 2] is transferred to the AI Core.

5. The method for optimizing the attention operator based on the Ascend AI processor according to claim 4, wherein determining the width and height of the target feature map according to the second input data comprises:

The dimension of the second transfer data is transformed from [L, 2] to [2, L, P, M], and the dimensionally transformed second transfer data is split into the width and height of the target feature map according to the highest dimension; wherein the width of the target feature map is [L, P, M], and the height of the target feature map is [L, P, M].

6. The method for optimizing the attention operator based on the Ascend AI processor according to claim 5, characterized in that the step of determining the position of the target feature map in the spliced feature map according to the third input data further comprises:

The dimension of the third transfer data is transformed from [L] to [L, P, M].

7. The method for optimizing the attention operator based on the Ascend AI processor according to claim 6, wherein determining the coordinates of the target interpolation point according to the fourth input data comprises:

The dimension of the fourth transfer data is transformed from [M, L, P, 2] to [2, L, P, M], and the dimensionally transformed fourth transfer data is split into a first coordinate and a second coordinate of the target interpolation point according to the highest dimension; wherein the size of the first coordinate of the target interpolation point is [L, P, M], and the size of the second coordinate of the target interpolation point is [L, P, M].

8. The method for optimizing the attention operator based on the Ascend AI processor according to claim 7, characterized in that the step of obtaining at least four reference coordinates of the target interpolation point according to the coordinates of the target interpolation point comprises:

in, is the rounding function, is the first coordinate of the target interpolation point, is the second coordinate of the target interpolation point, is the first reference coordinate of the target interpolation point, is the second reference coordinate of the target interpolation point, is the third reference coordinate of the target interpolation point, The fourth reference coordinate of the target interpolation point.

9. The method for optimizing the attention operator based on the Ascend AI processor according to claim 8, characterized in that the step of determining the indexes of at least four reference interpolations corresponding to the at least four reference coordinates according to the width and height of the target feature map, the position of the target feature map in the spliced feature map, the coordinates of the target interpolation point, and the at least four reference coordinates of the target interpolation point comprises:

in, is the height of the target feature map, is the width of the target feature map, is the index of the first reference coordinate and the second reference coordinate at the height of the target feature map, is the index of the third reference coordinate and the fourth reference coordinate at the height of the target feature map, is the index of the first reference coordinate and the third reference coordinate on the width of the target feature map, is the index of the second reference coordinate and the fourth reference coordinate on the width of the target feature map, is the second input data, is the index of AI Core, is the index of the attention head in the self-attention mechanism, is the index of the first reference interpolation value corresponding to the first reference coordinate, is the index of the second reference interpolation value corresponding to the second reference coordinate, is the index of the third reference interpolation value corresponding to the third reference coordinate, is the index of the fourth reference interpolation value corresponding to the fourth reference coordinate.

10. The method for optimizing the attention operator based on the Ascend AI processor according to claim 9, wherein the calculating the at least four reference interpolation weights corresponding to the at least four reference interpolations respectively comprises:

in, is the weight of the first reference interpolation, is the weight of the second reference interpolation, is the weight of the third reference interpolation, is the weight of the fourth reference interpolation, is the index of the target interpolation point at the height of the target feature map, The index of the target interpolation point on the width of the target feature map.

11. The method for optimizing the attention operator based on the Ascend AI processor according to claim 10, characterized in that the determining the target interpolation value of the target interpolation point according to the at least four reference interpolations and the at least four reference interpolation weights comprises:

in, is the target interpolation point, is the first reference interpolation, is the second reference interpolation, is the third reference interpolation, Interpolate for the fourth reference.

12. The method for optimizing the attention operator based on the Ascend AI processor according to claim 11, wherein the step of transferring the fifth input data to the AI Core comprises:

From the fifth input data located at the starting position coreIdx*M*L*P of the external storage area, the fifth transfer data of size [M, L, P] is transferred to the AI Core.

13. An optimization system for an attention operator based on an Ascend AI processor, comprising:

An input module is configured to obtain first input data, second input data, third input data, fourth input data, and fifth input data of the attention operator, and transfer the second input data, the third input data, and the fourth input data to the AI Core;

A data preparation module is configured to determine the width and height of a target feature map according to the second input data, determine the starting position of the target feature map in the spliced feature map according to the third input data, and determine the coordinates of a target interpolation point according to the fourth input data; wherein the spliced feature map is obtained by splicing the input feature maps of all the attention operators, the target feature map is included in the spliced feature map, and the target interpolation point is included in the target feature map;

The interpolation calculation module is configured to obtain at least four reference coordinates of the target interpolation point according to the coordinates of the target interpolation point; determine the indexes of at least four reference interpolations corresponding to the at least four reference coordinates according to the width and height of the target feature map, the starting position of the target feature map in the spliced feature map, the coordinates of the target interpolation point, and the at least four reference coordinates of the target interpolation point, and move the at least four reference interpolations from the first input data to the AI Core according to the indexes.

14. An electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, the method for optimizing an attention operator based on an Ascend AI processor as described in any one of claims 1 to 12 is implemented.

15. A computer-readable storage medium, characterized in that the computer-readable storage medium stores computer instructions, and the computer instructions are used to enable the computer to execute the optimization method of the attention operator based on the Ascend AI processor described in any one of claims 1 to 12.