CN118012631B

CN118012631B - Operator execution method, processing device, storage medium and program product

Info

Publication number: CN118012631B
Application number: CN202410411948.XA
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Bi Ren Technology Co ltd; Beijing Bilin Technology Development Co ltd
Current assignee: Shanghai Bi Ren Technology Co ltd; Beijing Bilin Technology Development Co ltd
Priority date: 2024-04-07
Filing date: 2024-04-07
Publication date: 2024-07-05
Anticipated expiration: 2044-04-07
Also published as: CN118012631A

Abstract

An operator execution method, processing equipment, a storage medium and a program product relate to the technical field of artificial intelligence and are used for reducing access delay in operator operation. The operator execution method is suitable for processing equipment with N execution units; each execution unit is provided with a local memory corresponding to each execution unit; the method comprises the following steps: the first execution unit stores the output data obtained after the first operator is executed into a local memory corresponding to the second execution unit; the local memory corresponding to the second execution unit is provided with an address space with a mapping relation with the first execution unit; the second execution unit acquires the output data by reading a local memory, and executes a second operator by taking the output data as input data of the second operator; the second operator is a subsequent operator to the first operator.

Description

Operator execution method, processing device, storage medium and program product

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to an operator execution method, a processing device, a storage medium, and a program product.

Background

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is a discipline that simulates intelligent behaviors such as human learning, reasoning, thinking, etc. by a computer, and can be applied to scenes such as image reasoning, speech reasoning, etc. The operations of the artificial intelligence model may be implemented by operators in a computational graph (computation graph). Wherein the computational graph is a multi-graph structure for representing computational tasks and dataflow processes of the artificial intelligence model. The operator refers to various operations performed on the data of each layer in the artificial intelligent model, for example, the convolution operation performed on the input data of the artificial intelligent model by the convolution layer of the artificial intelligent model is the convolution operator.

In general, the computation of the operator requires a plurality of processing units to cooperatively process, and due to arrangement factors of each processing unit and a local memory corresponding to each processing unit, a larger time delay may exist when the processing unit accesses the data of other processing units. For some memory intensive scenes, a great deal of time is spent for memory access in the calculation process, which results in overlong calculation time and low calculation efficiency.

Therefore, a scheme is needed to reduce the access delay during the operation of the operator.

Disclosure of Invention

The application provides an operator execution method and processing equipment for executing an operator, which are used for reducing access time delay during operator operation.

In a first aspect, the present application provides an operator execution method, the method being applicable to a processing device having N execution units; each execution unit is provided with a local memory corresponding to each execution unit; the method comprises the following steps: the first execution unit stores the output data obtained after the first operator is executed into a local memory corresponding to the second execution unit; the local memory corresponding to the second execution unit is provided with an address space with a mapping relation with the first execution unit; the second execution unit acquires the output data by reading a local memory, and executes a second operator by taking the output data as input data of the second operator; the second operator is a subsequent operator to the first operator.

In the technical scheme, the first execution unit stores the output data obtained after the first operator is executed into the local memory corresponding to the second execution unit, so that the second execution unit can read the input data from the local memory when executing the second operator, and the data reading speed from the local memory is much faster than that of the data reading speed from the memories of other execution units, thereby reducing the access time of the operator operation and improving the calculation efficiency.

In one possible design, the N execution units are divided into a plurality of execution groups, and the first execution unit and the second execution unit belong to the same execution group; in the same execution group, the local memory of each execution unit has an address space with mapping relation with other execution units; the output data of any first execution unit executing the first operator is the input data of each execution unit executing the second operator in the execution group; the first execution unit stores the output data obtained after the first operator is executed into a local memory corresponding to the second execution unit, and the method comprises the following steps: and writing the output data of the first operator which is executed by any first execution unit into an address space which accords with a corresponding mapping relation in a local memory of each execution unit in the execution group.

In one possible design, the N execution units are divided into a plurality of execution groups, and the first execution unit and the second execution unit belong to the same execution group; each execution group is provided with an execution unit serving as a core; the second execution unit is an execution unit serving as a core; the first execution unit is any execution unit; the output data of each first execution unit executing the first operator is the input data of the second execution unit executing the second operator; the first execution unit stores the output data obtained after the first operator is executed into a local memory corresponding to the second execution unit, and the method comprises the following steps: and writing the output data of which the first operator is executed into an address space which accords with a corresponding mapping relation in a local memory of the second execution unit by any first execution unit.

In one possible design, the N execution units are divided into a plurality of execution groups, and the first execution unit and the second execution unit belong to the same execution group; each execution group is provided with an execution unit serving as a core; in the same execution group, the local memory of each execution unit has an address space with a mapping relation with the execution unit serving as a core; the first execution unit is an execution unit serving as a core, and the second execution unit is any execution unit; the first execution unit stores the output data obtained after the first operator is executed into a local memory corresponding to the second execution unit, and the method comprises the following steps: and the first execution unit writes the output data after the first operator is executed into an address space which accords with the corresponding mapping relation in the local memory of each second execution unit.

In one possible design, the N execution units are divided into a plurality of execution groups, each of which is provided with an execution unit as a core; the first execution unit is an execution unit serving as a core in each execution group; each execution unit executes a second operator; the local memory of each execution unit is divided into a plurality of continuous storage areas; each continuous storage area and the first execution unit have a mapping relation of an address space; the first execution unit stores the output data obtained after the first operator is executed into a local memory corresponding to the second execution unit, and the method comprises the following steps: and storing the output data obtained after the first operator is executed by any first execution unit into an address space which accords with the corresponding mapping relation in the continuous storage areas corresponding to the second execution units.

In one possible design, the output data obtained after any first execution unit finishes executing the first operator is input data of the second execution unit executing the second operator; the first execution unit and the second execution unit are the same execution unit; the first execution unit stores the output data obtained after the first operator is executed into a local memory corresponding to the second execution unit, and the method comprises the following steps: and writing the output data of which the first operator is executed into a local memory of the first execution unit by any first execution unit.

In one possible design, the method further comprises: and determining an execution group to which each execution unit belongs and an execution unit serving as a core in each execution group based on the layout of each execution unit on the chip, the access bandwidth of the local memory and the distance between each execution unit and the local memory.

In a second aspect, an embodiment of the present application provides a processing apparatus for executing an operator, including: n execution units and local memories corresponding to the execution units; the N execution units are divided into a plurality of execution groups; any execution unit is adapted to invoke stored program instructions, and to execute the method as described in any of the possible designs of the first aspect according to the obtained program instructions.

In a third aspect, embodiments of the present application provide a computer readable storage medium having stored therein computer readable instructions which, when read and executed by a computer, cause the method described in any one of the possible designs of the first aspect to be implemented.

In a fourth aspect, embodiments of the present application provide a computer program product comprising a computer program stored on a computer readable storage medium, the computer program comprising program instructions which, when executed by a computer device, cause the computer device to perform the steps in any of the possible designs of the first aspect described above.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a processing apparatus according to an embodiment of the present application;

FIG. 2 is a flow chart of an operator execution method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a default mode according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an execution group according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a mode one according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a second mode according to an embodiment of the present application;

FIG. 7 is a schematic diagram of mode three according to an embodiment of the present application;

fig. 8 is a schematic diagram of a fourth mode provided in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In embodiments of the present application, a plurality refers to two or more. The words "first," "second," and the like are used merely for distinguishing between the descriptions and not be construed as indicating or implying a relative importance or order.

Fig. 1 schematically shows a schematic structure of a processing device according to an embodiment of the present application, where the processing device includes N execution units (including execution units 101-1 to 101-N) and a video memory 102.

Each execution unit may be configured to perform an operation of an operator. The memory 102 may be a high bandwidth memory (high bandwidth memory, HBM) or other types of memory. The execution unit accesses the memory 102 via the bus 103. Each execution unit has a corresponding local memory in the video memory 102, and referring to fig. 1, the memory space 102-1 allocated for the execution unit 101-1 in the video memory 102 is used as the local memory of the execution unit 101-1, and the memory space 102-N allocated for the execution unit 101-N is used as the local memory of the execution unit 101-N. The local memory corresponding to each execution unit can store the output data of the execution unit for performing the operator operation, and can also store the output data of other execution units for performing the operator operation. And, the output data of the current operator operation may be used as the input data of the next operator operation.

The processing apparatus of the present application may include other structures in addition to the above-described structures, and the present application is not particularly limited in this regard.

FIG. 2 schematically illustrates a flowchart of an operator execution method according to an embodiment of the present application, where the method is applicable to a processing device having N execution units; each execution unit has a respective local memory, as shown in fig. 2, the method includes the steps of:

Step 201, the first execution unit stores the output data obtained after the first operator is executed, into a local memory corresponding to the second execution unit.

Operators refer to various operations performed on data at various layers in an artificial intelligence model, and types of operators include, but are not limited to: convolution operators, full join operators, pooling operators, batch normalization BatchNorm operators, sobel operators, remodelling Reshape operators, transpose Transpose operators, and the like. Multiple operator combinations of the artificial intelligence model can be obtained by analyzing the computational graph corresponding to the operation of the artificial intelligence model.

An operator can be calculated by one or more execution units on a chip, and the number of the execution units specifically participating in calculation can be determined according to the type of the operator. The first execution unit refers to an execution unit for executing a first operator, and the first execution unit may be one or more execution units; the second execution unit refers to an execution unit that executes a second operator, and the second execution unit may be one or more execution units. The local memory corresponding to the second execution unit is provided with an address space with a mapping relation with the first execution unit. The second operator is the latter operator of the first operator, the first execution unit will execute the output data obtained by the first operator, the second execution unit needs to use the output data when executing the second operator, namely the input data when executing the second operator, so the first execution unit stores the output data obtained by the first operator to the local memory corresponding to the second execution unit.

Step 202, the second execution unit obtains the output data by reading the local memory, and executes the second operator by taking the output data as the input data of the second operator.

The execution units write data to the local memory by means of unified memory access (uniform memory access, UMA). Unified memory access refers to multiple execution units accessing shared memory through the same bus. The execution units read data from the local memory by means of non-uniform memory access (non-uniform memory access, NUMA). Non-uniform memory access refers to multiple execution units accessing their respective memories in parallel. The execution unit in non-uniform memory access has the fastest speed when accessing the local memory, and the slower the longer the memory distance is accessed. According to the application, the first execution unit stores the output data obtained after the first operator is executed into the local memory corresponding to the second execution unit, so that the second execution unit can read the input data from the local memory when executing the second operator, and the memory access time is reduced.

The operator execution method comprises the following modes when being concretely implemented:

(one), default mode

The output data obtained after any first execution unit finishes executing the first operator is input data of a second execution unit for executing the second operator, wherein the first execution unit and the second execution unit are the same execution unit; in the default mode, any first execution unit writes the output data of the executed first operator into the local memory of the execution unit. Finally, the local memory of each execution unit in the default mode stores the output data of the execution unit after executing the first operator.

Referring to fig. 3, taking execution units 0, 2, 8, and 10 as an example, the execution unit 0 will execute the output data 0 of the first operator, where the output data 0 may be a tensor (tensor), and is written into the local memory of the execution unit 0. The execution unit 2 writes the output data 2 of the executed first operator into the local memory of the execution unit 2. The execution unit 8 and the execution unit 10 are similar, and finally, the local memories of the execution units 0, 2, 8 and 10 in the default mode all store the output data of the first operator executed respectively. When the execution units 0, 2, 8 and 10 execute the second operator, the output data of the first operator is obtained by reading the local memory, and the second operator is executed as the input data of the second operator.

Similarly, after each execution unit participating in the operation of the first operator in the default mode executes the first operator, the output data of each executed first operator is also written into each local memory.

The default mode is applicable to output data of any execution unit after the execution unit executes the first operator when executing the second operator. In this case, each execution unit stores the output data obtained after the execution of the first operator in the respective local memory, so that each execution unit can read the input data from the respective local memory when executing the second operator.

(II) mode one

In the mode, N execution units are divided into a plurality of execution groups, and the first execution unit and the second execution unit belong to the same execution group. Specifically, the execution group to which each execution unit belongs can be determined based on the layout of each execution unit on the chip, the access bandwidth of the local memory and the distance between each execution unit and the local memory, so that the overall bandwidth access efficiency and the access delay are optimal.

For example, fig. 4 is a division manner of execution groups, 16 execution units are disposed on the chip shown in fig. 4, each circle in fig. 4 represents one execution unit, and a layout of the 16 execution units on the chip is shown, for this layout manner, the 16 execution units may be divided into 4 execution groups, and referring to fig. 4, the first execution group includes execution units 0, 2, 8, and 10; the second execution group comprises execution units 1, 3, 9, 11; the third execution group comprises execution units 4, 6, 12, 14; the fourth execution group comprises execution units 5, 7, 13, 15.

It should be noted that, the above-mentioned method of dividing execution groups into 16 execution units only, the number of specific execution groups and which execution units are included in each execution group may be determined according to the layout of each execution unit on the actual chip, the memory access bandwidth of the local memory, and the distance between each execution unit and the local memory.

In the mode, the local memory of each execution unit in the same execution group has an address space with a mapping relation with other execution units. The output data of the first operator executed by any first execution unit is the input data of the second operator executed by each execution unit in the execution group, and the output data of the first operator executed by any first execution unit is written into the address space which accords with the corresponding mapping relation in the local memory of each execution unit in the execution group. Finally, the local memory of each execution unit of one execution group in the mode stores the total output data of the first operator executed by each execution unit.

Taking the first execution group as an example, referring to fig. 5, the local memories of the execution units 0, 2, 8, and 10 each have an address space having a mapping relationship with the execution units 0, 2, 8, and 10, that is, the local memories of the execution units 0, 2, 8, and 10 are each allocated an address space storing output data of the execution units 0, 2, 8, and 10 for executing the first operator. The execution unit 0 writes the output data 0 of the executed first operator into the address space allocated for the output data 0 of the execution unit 0 in the local memories of the execution units 0, 2, 8, 10, respectively. The execution unit 2 writes the output data 2 of the executed first operator into the address space allocated for the output data 2 of the execution unit 2 in the local memories of the execution units 0, 2, 8, 10, respectively. The execution unit 8 is similar to the execution unit 10, and finally, the local memories of the execution units 0, 2, 8, 10 in the mode all store the total output data (output data 0, output data 2, output data 8, and output data 10) of the first operator executed by each execution unit. When the execution units 0, 2, 8 and 10 execute the second operator, the local memory is read to obtain the total output data (output data 0, output data 2, output data 8 and output data 10) of the first operator executed by each execution unit, and the second operator is executed as the input data of the second operator.

Similarly, in the second execution group, the execution units 1, 3, 9, 11 in the second execution group write the output data 1, 3, 9, 11 of the first operator into the address space allocated for the output data 1, 3, 9, 11 of the execution units 1, 3, 9, 11 in the local memory of the execution units 1, 3, 9, 11, respectively. Other execution groups are similar and will not be described in detail herein.

Specifically, each execution unit is provided with a corresponding address offset, after any execution unit finishes executing the first operator to obtain output data, the base address of the local memory corresponding to each execution unit in the execution group is added with the address offset corresponding to the execution unit to obtain an address, which should be stored in the local memory corresponding to each execution unit, of the output data of the execution unit, and then the execution unit writes the output data for executing the first operator into a corresponding address space respectively.

The first operator is used for executing calculation by each execution unit of the execution group for any execution group, each execution unit obtains respective output data after each execution unit executes the first operator, and the second operator is also used for executing calculation by each execution unit of the execution group, and each execution unit of the execution group needs to use each execution unit to execute the total output data of each execution unit after the first operator when executing the second operator. Under the condition, each execution unit stores the output data obtained after the first operator is executed into the local memory corresponding to each calculation unit, so that each execution unit can read the input data (the total output data of the first operator is executed by each execution unit) from the local memory when the second operator is executed, and the data reading speed is high from the local memory, thereby reducing the access time of the operator operation and improving the calculation efficiency.

(III), mode II

In the second mode, the N execution units are divided into a plurality of execution groups, and the first execution unit and the second execution unit belong to the same execution group. The specific implementation group is referred to in mode one, and the description of the present application is omitted. In the second mode, each execution group is provided with an execution unit serving as a core, the execution unit serving as a hub of the execution group can execute operation for summarizing output data of each execution unit in the execution group, for example, calculation of a reduction operator can be executed on the execution unit serving as the core. Specifically, the execution units serving as cores in the execution group can be determined based on the layout of each execution unit on the chip, the access bandwidth of the local memory and the distance between each execution unit and the local memory, so that the overall bandwidth access efficiency and the access time delay are optimal.

For example, referring to fig. 4, execution unit 10 may be provided in the first execution group as an execution unit of execution unit 0, 2, 8, 10 cores; the second execution group may be provided with an execution unit 9 as an execution unit of the cores of the execution units 1,3, 9, 11; the third execution group may be provided with an execution unit 6 as an execution unit of the cores of the execution units 4,6, 12, 14; in the fourth execution group execution units 5 may be provided as execution units of the cores of execution units 5, 7, 13, 15.

In the second mode, the second execution unit is an execution unit serving as a core in the execution group, the first execution unit is any execution unit in the execution group, and the output data of the first execution unit executing the first operator is the input data of the second execution unit executing the second operator. And writing the output data of which the first operator is executed by any first execution unit into an address space which accords with the corresponding mapping relation in a local memory of the second execution unit. Finally, the local memory of the execution unit serving as the core in the next execution group of the mode two stores the total output data of the execution units after executing the first operator.

Taking the first execution group as an example, referring to fig. 6, the execution unit 10 is set as the core execution unit in the first execution group, and the local memory of the execution unit 10 has an address space having a mapping relationship with the execution units 0, 2, 8, and 10, that is, the local memory of the execution unit 10 is allocated with an address space storing the output data of the execution units 0, 2, 8, and 10 for executing the first operator. The execution unit 0 writes the output data 0 of the first operator into an address space allocated for the output data 0 of the execution unit 0 in the local memory of the execution unit 10. The execution unit 2 writes the output data 2 of the first operator into an address space allocated for the output data 2 of the execution unit 2 in a local memory of the execution unit 10. The execution unit 8 is similar to the execution unit 10, and finally, the local memory of the execution unit 10 stores the total output data (output data 0, output data 2, output data 8 and output data 10) of each execution unit executing the first operator in the second mode. When executing the second operator, the execution unit 10 obtains the total output data (output data 0, output data 2, output data 8 and output data 10) of each execution unit executing the first operator by reading the local memory, and executes the second operator as the input data of the second operator.

Similarly, the execution units 1,3, 9, 11 in the second execution group in the second mode write the output data 1,3, 9, 11 of the first operator, which have been executed by each execution unit, into the address space allocated for the output data 1,3, 9, 11 of the execution unit 1,3, 9, 11 in the local memory of the execution unit 9. Other execution groups are similar and will not be described in detail herein.

Specifically, each execution unit is provided with a corresponding address offset, after any execution unit finishes executing the first operator to obtain output data, the base address of the local memory corresponding to the execution unit serving as a core in the execution group is added with the address offset corresponding to the execution unit to obtain an address, which is to be stored in the local memory corresponding to the execution unit serving as the core, of the output data of the execution unit, and then the execution unit writes the output data for executing the first operator into a corresponding address space respectively.

The second mode is applicable to any execution group, wherein the first operator is calculated by each execution unit of the execution group, each execution unit obtains respective output data after each execution unit executes the first operator, the second operator is calculated by the execution unit serving as a core in the execution group, and the execution unit serving as the core in the execution group needs to use each execution unit to execute the total output data of each execution unit after the first operator when executing the second operator. In this case, each execution unit stores the output data obtained after the execution of the first operator in the local memory corresponding to the computation unit serving as the core, so that the execution unit serving as the core can read the input data (the full output data of the first operator is executed by each execution unit) from the local memory when the second operator is executed, and the data reading speed is high from the local memory, thereby reducing the access time of the operator operation and improving the computation efficiency.

(IV), mode III

In the third mode, N execution units are divided into a plurality of execution groups, the first execution unit and the second execution unit belong to the same execution group, and each execution group is provided with an execution unit serving as a core. The present application is not described herein in detail, with reference to the first partition mode of the execution group, and the second partition mode of the execution unit serving as a core in the execution group.

In mode three, in the same execution group, the local memory of each execution unit has an address space having a mapping relationship with the execution unit serving as a core. The first execution unit is an execution unit serving as a core, and the second execution unit is any execution unit. The first execution unit writes the output data after the first operator is executed into an address space which accords with the corresponding mapping relation in the local memory of each second execution unit. Finally, the local memory of each execution unit of the next execution group of the mode three stores the output data of the first operator executed by the execution unit serving as a core.

Taking the first execution group as an example, referring to fig. 7, the execution units 10 are set as execution units of the core in the first execution group, and the local memories of the execution units 0, 2, 8, and 10 all have address spaces having mapping relation with the execution units 10 as the core, that is, the local memories of the execution units 0, 2, 8, and 10 are all allocated with address spaces for storing the output data of the execution unit 10 after executing the first operator. The execution unit 10 writes the output data 10 of the executed first operator into the address space allocated for the output data 10 of the execution unit 10 on the local memories of the execution units 0, 2, 8, 10, respectively. Finally, in the third mode, the local memories of the execution units 0, 2, 8, and 10 all store the output data (output data 10) of the execution unit 10 after executing the first operator. When the execution units 0, 2, 8, 10 execute the second operator, the execution unit 10 executes the output data (output data 10) of the first operator by reading the local memory, and executes the second operator as the input data of the second operator.

Similarly, the execution unit 9 serving as a core in the second execution group in the third mode writes the output data 9 after execution of the first operator into the address space allocated for the output data 9 of the execution unit 9 in the local memories of the execution units 1,3, 9, 11, respectively. Other execution groups are similar and will not be described in detail herein.

Specifically, 4 address spaces with the same size may be opened on the first execution unit, output data may be written into the 4 address spaces respectively, and then the output data in each address space is mapped into the local memory of the corresponding second execution unit. For example, 4 blocks of address spaces with the same size are opened on the execution unit 10 serving as a core, the output data 10 are written into the 4 blocks of address spaces respectively, and then the output data 10 in the first block of address space is mapped into the local memory of the execution unit 0; mapping the output data 10 in the second block address space into the local memory of the execution unit 2; mapping the output data 10 in the third block address space into the local memory of the execution unit 8; the output data 10 in the fourth block of address space is mapped into the local memory of the execution unit 10.

The third mode is applicable to any execution group, in which the first operator performs computation by the execution unit serving as a core in the execution group, the execution unit serving as the core performs computation by each execution unit of the execution group to obtain output data, and the second operator performs computation by each execution unit of the execution group, so that each execution unit of the execution group needs to use the execution unit serving as the core to execute the output data after the first operator when executing the second operator. In this case, the execution units serving as the cores store the output data obtained after the execution of the first operator into the local memories corresponding to the execution units, so that each execution unit can read the input data (the output data of the first operator is executed by the execution unit serving as the core) from the local memories when executing the second operator, and the data reading speed is high from the local memories, thereby reducing the access time of the operator operation and improving the calculation efficiency.

Fifth, fourth mode

In the fourth mode, N execution units are divided into a plurality of execution groups, each execution group is provided with an execution unit serving as a core, and the first execution unit is an execution unit serving as a core in each execution group. The first partition mode reference mode of the specific execution group, the second partition mode reference mode of the execution unit serving as a core set in the execution group, and the present application will not be described in detail.

In the fourth mode, each execution unit executes the second operator, the local memory of each execution unit is divided into a plurality of continuous storage areas, each continuous storage area has a mapping relation of an address space with the first execution unit, and any first execution unit stores output data obtained after executing the first operator into the address space which accords with the corresponding mapping relation in the continuous storage areas corresponding to the plurality of second execution units. Finally, in the fourth mode, each continuous storage area stores the total output data of the first operator executed by the execution unit taking each execution group as a core.

Referring to fig. 8, the first continuous memory area is composed of the local memories of the execution units 0 to 7, and the second continuous memory area is composed of the local memories of the execution units 8 to 15. The first continuous storage area and the second continuous storage area have mapping relation of address space with the first execution unit. That is, the address space for storing the output data of the first operator is allocated to each of the continuous memory area one and the continuous memory area two, and the execution units 5, 6, 9, and 10 serving as cores execute the output data of the first operator. The execution units 5, 6, 9, 10 as cores write the output data 5, 6, 9, 10 of the executed first operator into the continuous storage area one and the continuous storage area two, respectively. Finally, the execution units 5, 6, 9, 10 as cores have the output data (output data 5, output data 6, output data 9, and output data 10) of the first operator executed in both the continuous storage area one and the continuous storage area two. When any one of the execution units 0 to 15 executes the second operator, the full output data (output data 5, output data 6, output data 9, and output data 10) each of which has executed the first operator as the execution unit of the core is acquired by reading the adjacent continuous memory area (continuous memory area one or continuous memory area two), and the second operator is executed as the input data of the second operator. Specifically, when any one of the execution units 0 to 7 executes the second operator, the continuous storage area one is read to acquire the full-output data of the first operator executed by each of the execution units serving as cores, and when any one of the execution units 8 to 15 executes the second operator, the continuous storage area two is read to acquire the full-output data of the first operator executed by each of the execution units serving as cores.

The fourth mode is applied to the case where the first operator is calculated by the execution units serving as cores of the execution groups, the execution units serving as cores obtain respective output data by the execution units serving as cores after the first operator is executed by the execution units serving as cores, the second operator is calculated by the execution units, and the execution units need to use the execution units serving as cores to execute the first operator and then the full output data of the execution units serving as cores when the second operator is executed by the execution units serving as cores. In this case, each execution unit serving as a core stores the output data obtained after the execution of the first operator in the continuous storage areas corresponding to the plurality of second execution units, so that each execution unit can read the input data from the adjacent memory (the execution unit serving as the core has executed the full output data of the first operator) when executing the second operator, and the data reading from the adjacent memory is faster than the data reading from the original memory, thereby reducing the access time of the operator operation and improving the calculation efficiency.

Based on the same technical concept, an embodiment of the present application provides a processing apparatus for executing an operator, including: n execution units and local memories corresponding to the execution units; the N execution units are divided into a plurality of execution groups; any execution unit is used for calling the stored program instruction and executing the operator execution method listed in any mode according to the obtained program instruction.

Based on the same technical concept, the embodiments of the present application provide a computer-readable storage medium, in which computer-readable instructions are stored, which when read and executed by a computer, cause the operator execution method listed in any of the above-mentioned modes to be implemented.

Based on the same technical idea, an embodiment of the present application further provides a computer program product, which comprises a computer program stored on a computer readable storage medium, the computer program comprising program instructions, which when executed by a computer device, cause the computer device to perform the steps of the operator performing method listed in any of the above ways.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. An operator execution method is characterized by being suitable for processing equipment with N execution units; each execution unit is provided with a local memory corresponding to each execution unit; the N execution units are divided into a plurality of execution groups; the first execution unit and the second execution unit belong to the same execution group, or each execution group is provided with an execution unit serving as a core, and the first execution unit is the execution unit serving as the core in each execution group; the method comprises the following steps:

the first execution unit stores the output data obtained after the first operator is executed into a local memory corresponding to the second execution unit; the local memory corresponding to the second execution unit is provided with an address space with a mapping relation with the first execution unit;

The second execution unit reads a local memory to acquire the output data, and executes a second operator by taking the output data as input data of the second operator; the second operator is a subsequent operator to the first operator.

2. The method of claim 1, wherein the first execution unit and the second execution unit belong to the same execution group;

in the same execution group, the local memory of each execution unit has an address space with mapping relation with other execution units;

the output data of any first execution unit executing the first operator is the input data of each execution unit executing the second operator in the execution group;

The first execution unit stores the output data obtained after the first operator is executed into a local memory corresponding to the second execution unit in a unified memory access mode, and the method comprises the following steps:

and writing the output data of the first operator which is executed by any first execution unit into an address space which accords with a corresponding mapping relation in a local memory of each execution unit in the execution group.

3. The method of claim 1, wherein the first execution unit and the second execution unit belong to the same execution group;

each execution group is provided with an execution unit serving as a core; the second execution unit is an execution unit serving as a core; the first execution unit is any execution unit;

The output data of each first execution unit executing the first operator is the input data of the second execution unit executing the second operator;

and writing the output data of which the first operator is executed into an address space which accords with a corresponding mapping relation in a local memory of the second execution unit by any first execution unit.

4. The method of claim 1, wherein the first execution unit and the second execution unit belong to the same execution group;

each execution group is provided with an execution unit serving as a core;

in the same execution group, the local memory of each execution unit has an address space with a mapping relation with the execution unit serving as a core;

the first execution unit is an execution unit serving as a core, and the second execution unit is any execution unit;

and the first execution unit writes the output data after the first operator is executed into an address space which accords with the corresponding mapping relation in the local memory of each second execution unit.

5. The method according to claim 1, wherein each execution group is provided with an execution unit as a core and the first execution unit is an execution unit as a core in each execution group;

Each execution unit executes a second operator; the local memory of each execution unit is divided into a plurality of continuous storage areas; each continuous storage area and the first execution unit have a mapping relation of an address space;

And storing the output data obtained after the first operator is executed by any first execution unit into an address space which accords with the corresponding mapping relation in the continuous storage areas corresponding to the second execution units.

6. The method of claim 1, wherein the output data obtained after the execution of the first operator by any one of the first execution units is input data of the second operator executed by the second execution unit; the first execution unit and the second execution unit are the same execution unit;

and writing the output data of which the first operator is executed into a local memory of the first execution unit by any first execution unit.

7. The method according to any one of claims 1 to 6, further comprising:

And determining an execution group to which each execution unit belongs and an execution unit serving as a core in each execution group based on the layout of each execution unit on the chip, the access bandwidth of the local memory and the distance between each execution unit and the local memory.

8. A processing device that executes an operator, comprising: n execution units and local memories corresponding to the execution units; the N execution units are divided into a plurality of execution groups;

Any execution unit is arranged to invoke stored program instructions, and to execute the method according to any of claims 1 to 7 in accordance with the obtained program instructions.

9. A computer readable storage medium comprising computer readable instructions which, when read and executed by a computer, cause the method of any one of claims 1 to 7 to be implemented.

10. A computer program product comprising a computer program stored on a computer readable storage medium, the computer program comprising program instructions which, when executed by a computer device, cause the computer device to carry out the steps of the method according to any one of claims 1-7.