CN112967172B

CN112967172B - Data processing device, method, computer equipment and storage medium

Info

Publication number: CN112967172B
Application number: CN202110221038.1A
Authority: CN
Inventors: 周军; 常亮; 周亮; 何翔; 赵能
Original assignee: University of Electronic Science and Technology of China; Chengdu Sensetime Technology Co Ltd
Current assignee: University of Electronic Science and Technology of China; Chengdu Sensetime Technology Co Ltd
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2024-09-17
Anticipated expiration: 2041-02-26
Also published as: WO2022179074A1; CN112967172A

Abstract

The present disclosure provides a data processing apparatus, a method, a computer device, and a storage medium, wherein the apparatus includes: comprising the following steps: a first storage unit and a calculation unit; the computing unit comprises a processing engine PE array; the plurality of first storage units are respectively connected with PEs in the PE array; the PE is used for performing read/write access on the connected first storage unit; the plurality of first storage units are used for storing data transmitted in the read/write access process of the connected PE. According to the embodiment of the disclosure, different PE connected with different first storage units in the PE array can access the different first storage units in parallel, so that the efficiency of reading data from the first storage units is improved, the efficiency of storing the data to the first storage units is improved, and the data processing efficiency is improved.

Description

Data processing device, method, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technology, and in particular, to a data processing apparatus, a data processing method, a computer device, and a storage medium.

Background

The image processing algorithm is widely applied in different fields of image recognition, target detection and the like, and is realized by adopting an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) accelerator hardware architecture. The hardware architecture of the conventional AI accelerator mainly comprises a storage unit, a calculation unit, a control unit and the like, wherein the core calculation unit is generally composed of a two-dimensional PE array (Processing Engine) and a register array (local REGISTER FILE); the memory units may be composed of different hierarchical caches, including Double Data Rate (DDR), static random access memory (Static Random Access Memory, SRAM), registers, post-relational database cache, and the like. The input data stream is buffered and transferred in different storage spaces, enters a register array corresponding to the PE array, is read from the register array through the PE array, and then is subjected to arithmetic operation (or logic operation), and finally the obtained result is written back to the corresponding storage space.

The current way of controlling the storage of input data into the register array has the problem of inefficient data processing.

Disclosure of Invention

The embodiment of the disclosure at least provides a data processing device, a data processing method, computer equipment and a storage medium.

In a first aspect, an embodiment of the present disclosure provides a data processing apparatus, including: a first storage unit and a calculation unit; the computing unit comprises a processing engine PE array; the plurality of first storage units are respectively connected with PEs in the PE array; the PE is used for performing read/write access on the connected first storage unit; the plurality of first storage units are used for storing data transmitted in the read/write access process of the connected PE.

Therefore, different PEs connected with different first storage units in the PE array can access different first storage units in parallel, the efficiency of reading data from the first storage units is improved, the efficiency of storing the data to the first storage units is improved, and the data processing efficiency is improved.

In an alternative embodiment, the plurality of first storage units are configured to be connected to different PE groups in the PE array respectively.

In an alternative embodiment, each first storage unit is connected to a PE in a PE group; different PEs belong to different PE groups, respectively.

In an alternative embodiment, the one PE group includes a plurality of PEs in the PE array that have a physical connection relationship, and the plurality of PEs are located in the same row, or in the same half row, or in the same block on a hardware layout.

Therefore, the number of the data transmission channels can be increased by distributing the corresponding first storage units for the PEs in the PE array, so that the PEs can have more data transmission when performing read/write access on the first storage units, and the data transmission efficiency is improved; meanwhile, the flexibility of the data processing device can be further increased so as to adapt to different data processing requirements.

In an optional implementation manner, the PE is configured to perform a read access to the connected first storage unit in a first processing period to obtain first data corresponding to the PE; and/or in a second processing period, performing write access to the connected first storage unit, and storing the second data generated by the PE to the connected first storage unit.

In an alternative embodiment, different PEs connected to the same first storage unit perform read/write access to the same first storage unit in different processing cycles; and/or, the PE groups connected with different first storage units respectively have one PE in the same processing period to perform read/write access on the connected first storage units.

In an alternative embodiment, in each PE group connected to a different first storage unit, PEs having the same relative position perform read/write access to the connected first storage unit in the same processing cycle.

Therefore, a plurality of PEs can synchronously read/write access to different first storage units, and the data transmission efficiency is improved.

In an alternative embodiment, the method further comprises: a control unit; the control unit is used for generating a first control signal based on a data processing instruction and transmitting the first control signal to the PE; and the PE is used for responding to the first control signal transmitted by the control unit and reading the first data to be processed by the PE from a first storage unit connected with the PE.

In an alternative embodiment, the control unit is further configured to generate a second control signal based on the data processing instruction, and transmit the second control signal to the PE; and the PE is used for responding to the second control signal transmitted by the control unit and writing the data generated by the PE into a first storage unit connected with the PE.

In an alternative embodiment, the method further comprises: a data scheduler; the control unit is further used for generating a third control signal based on the data processing instruction and transmitting the third control signal to the data scheduler; the data scheduler is configured to perform write access to the first storage unit based on the third control signal.

Therefore, the data scheduler is used as a medium for data transmission, so that the data with larger data quantity can be controlled to be efficiently and orderly transmitted during transmission, and errors during transmission are avoided.

In an alternative embodiment, the device further comprises a second storage unit; the data scheduler is configured to read data to be processed corresponding to each first storage unit from the second storage unit, and store the data to be processed corresponding to each first storage unit into the corresponding first storage unit based on a first data storage address carried in the third control signal; the data to be processed corresponding to each first storage unit comprises: and the PE connected with each first storage unit needs to read the data.

In an alternative embodiment, the control unit is further configured to generate a fourth control signal based on the data processing instruction, and transmit the fourth control signal to the data scheduler; the data scheduler is further configured to perform a read access to the first storage unit based on the fourth control signal.

In an alternative embodiment, the data scheduler is configured to read result data from the plurality of first storage units and store the result data in a second storage unit based on the fourth control signal; wherein the result data includes: and the PE connected with the first storage unit generates data which is stored in the first storage unit.

In a second aspect, an embodiment of the present disclosure further provides a data processing method, applied to a data processing apparatus, where the data processing apparatus includes: a first storage unit and a calculation unit; the computing unit comprises a processing engine PE array; the plurality of first storage units are respectively connected with PEs in the PE array; the data processing method comprises the following steps: the PE performs read/write access on the connected first storage unit; the plurality of first storage units store data transmitted in the read/write access process of the connected PE.

In an alternative embodiment, the plurality of first storage units are respectively connected to different PE groups in the PE array.

In an alternative embodiment, the PE performs read/write access to the connected first storage unit, including: the PE performs read access on the connected first storage unit in a first processing period to obtain first data corresponding to the PE; and/or in a second processing period, performing write access to the connected first storage unit, and storing the second data generated by the PE to the connected first storage unit.

In an alternative embodiment, the PE performs read/write access to the connected first storage unit, including: different PEs connected with the same first storage unit perform read/write access on the same first storage unit in different processing periods; and/or, the PE groups connected with different first storage units respectively have one PE in the same processing period to perform read/write access on the connected first storage units.

In an alternative embodiment, the PE groups to which the different first storage units are connected respectively have one PE to perform read/write access to the connected first storage unit in the same processing cycle, including: in each PE group connected to different first storage units, the PEs with the same relative positions perform read/write access to the connected first storage units in the same processing cycle.

In an alternative embodiment, the data processing apparatus further comprises a control unit; the data processing method further comprises the following steps: the control unit generates a first control signal based on a data processing instruction and transmits the first control signal to the PE; and the PE responds to the received first control signal transmitted by the control unit, and reads the first data to be processed by the PE from a first storage unit connected with the PE.

In an alternative embodiment, the method further comprises: the control unit generates a second control signal based on the data processing instruction and transmits the second control signal to the PE; and the PE responds to receiving a second control signal transmitted by the control unit, and writes second data generated by the PE into a first storage unit connected with the PE.

In an alternative embodiment, the data processing apparatus further comprises a data scheduler; the data processing method further comprises the following steps: the control unit generates a third control signal based on the data processing instruction and transmits the third control signal to the data scheduler; the data scheduler performs write access to the first memory unit based on the third control signal.

In an alternative embodiment, the data processing apparatus further comprises a second storage unit; the data dispatcher reads the data to be processed corresponding to each first storage unit from the second storage unit, and stores the data to be processed corresponding to each first storage unit into the corresponding first storage unit based on the first data storage address carried in the third control signal; the data to be processed corresponding to each first storage unit comprises: and the PE connected with each first storage unit needs to read the data.

In an alternative embodiment, the method further comprises: the control unit generates a fourth control signal based on the data processing instruction and transmits the fourth control signal to the data scheduler; the data scheduler performs read access to the first memory cell based on the fourth control signal.

In an alternative embodiment, the data scheduler performs read access to the first memory unit based on the fourth control signal, including: the data scheduler reads result data from the plurality of first storage units based on the fourth control signal and stores the result data into a second storage unit; wherein the result data includes: and the PE connected with the first storage unit generates data which is stored in the first storage unit.

In a third aspect, an optional implementation of the disclosure further provides a computer device, including: an instruction memory and a data processing apparatus provided in the first aspect of the present disclosure.

In a fourth aspect, alternative implementations of the disclosure also provide a computer readable storage medium having stored thereon a computer program which when executed performs the steps of the second aspect, or any of the possible implementations of the second aspect, described above.

The description of the effects of the data processing method, the computer device, and the computer readable storage medium refers to the description of the data processing apparatus, and is not repeated here.

The foregoing objects, features and advantages of the disclosure will be more readily apparent from the following detailed description of the preferred embodiments taken in conjunction with the accompanying drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for the embodiments are briefly described below, which are incorporated in and constitute a part of the specification, these drawings showing embodiments consistent with the present disclosure and together with the description serve to illustrate the technical solutions of the present disclosure. It is to be understood that the following drawings illustrate only certain embodiments of the present disclosure and are therefore not to be considered limiting of its scope, for the person of ordinary skill in the art may admit to other equally relevant drawings without inventive effort.

FIG. 1 shows a schematic diagram of a data processing apparatus provided by an embodiment of the present disclosure;

FIG. 2 illustrates a schematic diagram of a PE array provided in an embodiment of the disclosure;

FIG. 3 is a schematic diagram of an internal structure of a PE according to an embodiment of the disclosure;

FIG. 4a is a schematic diagram illustrating a connection manner between a first memory cell and a PE array according to an embodiment of the disclosure;

FIG. 4b is a schematic diagram illustrating another connection method between a first memory cell and a PE array according to an embodiment of the disclosure;

Fig. 5 is a schematic diagram of a data processing apparatus according to an embodiment of the disclosure when performing data processing.

Fig. 6 shows a flowchart of a data processing method provided by an embodiment of the present disclosure.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, but not all embodiments. The components of the disclosed embodiments generally described and illustrated herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure is not intended to limit the scope of the disclosure, as claimed, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be made by those skilled in the art based on the embodiments of this disclosure without making any inventive effort, are intended to be within the scope of this disclosure.

It has been found that when the AI accelerator hardware structure is used to process image data to be processed, it is generally required to transfer the image data to be processed from the external memory to the registers included in the PEs in the PE array, so that the computing units in each PE in the PE array can read the corresponding image data to be processed from the registers and process the image data, while the register arrays formed by the registers included in different PEs share the same bus, and the bandwidth of the bus is limited, so that in order to avoid data collision when data is transferred in the bus, the data required by each register can enter the corresponding registers from the external memory one by one, which results in a great amount of time being required in the process of transferring the image to be processed to the register array, and the efficiency of data processing is low.

In addition, after the PE array processes the image data to be processed, result data can be generated; the generated result data needs to be stored in an external memory; at this time, the bus is also required to transmit the result data generated by different PEs in the PE array to the external memory one by one. This results in a greater time consumption in storing the resulting data in the external memory, resulting in a reduced efficiency of data transfer and also a lower efficiency of data processing.

Based on the above-mentioned research, the present disclosure provides a data processing apparatus, in the data processing apparatus, including a plurality of first storage units, different first storage units are respectively connected with different PEs in a PE array, each PE in the PE array can perform read/write access to the first storage unit connected thereto, and further, different PEs in the PE array connected with different first storage units can access different first storage units in parallel, so that the efficiency of reading data from the first storage units is improved, and the efficiency of storing data to the first storage units is improved, thereby improving the data processing efficiency.

The defects of the scheme are all results obtained by the inventor after practice and careful study, and therefore, the discovery process of the above problems and the solutions to the above problems set forth hereinafter by the present disclosure should be all contributions of the inventors to the present disclosure during the course of the present disclosure.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

For the sake of understanding the present embodiment, a detailed description will be given first of a data processing apparatus provided in an embodiment of the present disclosure.

Referring to fig. 1, a schematic diagram of a data processing apparatus according to an embodiment of the disclosure is shown, where the apparatus includes: a first storage unit (a plurality of first storage units are shown in the figure, including a first storage unit 0 to a first storage unit 3) and a calculation unit; the computing unit comprises a processing engine PE array; the first storage units are respectively connected with PEs in the PE array (a plurality of PEs are shown in the figure, and include PE 0-PE 15); wherein,

PE, which is used for performing read/write access to the connected first storage unit;

And the plurality of first storage units are used for storing data transmitted in the read/write access process of the connected PE.

Illustratively, the computing unit includes at least one PE array therein; the physical connection relationship between any PE in the PE array and other PEs is shown in fig. 2, where multiple PEs together form a 2D (dimension) torus network, and the PEs may be connected to different PEs that are physically connected and include positions up, down, left, and right.

The PE array includes PEs 22 that specifically perform the relevant operation tasks, and PEs 21 that are located at edge positions of the PE array, so as to distinguish between PEs 21 that are located at edge positions of the PE array, labeled halo in fig. 2. The PE22 can complete operation operations such as multiply-add (multiply-accumulate, MAC) and the like on the data; the PEs 21 form a peripheral annular array at the periphery of the PE array, and since the PEs 22 in the PE array may shift (shift) data between different PEs when processing data, the PEs 21 in the peripheral annular array can ensure that data is not lost when moving between different PEs inside the PE array.

In addition, referring to fig. 3, a schematic diagram of an internal structure of a PE according to an embodiment of the disclosure is shown. In this example, a PE22 specifically performing the task of the associated operation is included, as well as a PE21 coupled thereto at an edge location of the PE array. In PE22, a memory access module 33 is included, denoted as M1; an arithmetic Logic Unit 34 (ARITHMETIC AND Logic Unit, ALU), denoted ALU1; an internal register 35, denoted r0_1; and a shift register file 36, wherein:

The memory access module M1 is configured to perform read/write access to a first memory unit connected to the PE22, where the memory access module M1 may transmit data acquired from the first memory unit to the internal register r0_1 or the ALU1 when performing read access to the first memory unit, so as to wait for the acquired data to be processed by the ALU1 in the PE 22; or transmitting the acquired data to a shift register file corresponding to the PE22, so that the acquired data is transmitted to the PE21 connected with the PE 22; or when the first memory unit is subjected to write access, the result operation data obtained by the ALU1 can be written into the corresponding first memory unit.

An arithmetic logic unit ALU1 for data processing the received data. Since there may be multiple intermediate calculation steps when data processing is performed on the data, the resulting intermediate operation data may also be transferred to an internal register in the PE and the intermediate operation data in the internal register may be called for further processing in the next calculation. After the result operation data is obtained, the result operation data can be transmitted to the memory access module M1 to wait for output according to an actual data processing instruction; or the result operation data can be transmitted to a shift register file corresponding to the PE22 to wait for being transmitted to the PE21 connected with the PE 22.

Here, for the PE21, since the PE21 does not take on the function of data operation, the ALU may not be contained inside, so as to reduce the equipment requirement, thereby reducing the equipment cost; or an ALU is present, but the ALU does not actually perform the associated data operations to reduce the complexity of the device integration. In fig. 3, the connection which may exist in the presence of ALU2 in PE21 is shown in dashed lines, similar to ALU1 in PE 21.

An internal register r0_1 for receiving and storing data read from the first memory unit connected to the PE22 by the M1; or is connected with the ALU1, stores the generated intermediate operation data, and transmits the intermediate operation data to the ALU1 so that the ALU1 obtains result operation data and stores the result operation data; or the result operation data is transmitted to M1.

Here, for the PE21, since the PE21 may only perform a function of transmitting data between the PEs or only receive data transmitted in the first storage unit, an internal register may not be included therein to reduce a device requirement, thereby reducing a device cost; or there are internal registers but do not perform the relevant task of storing to reduce the complexity of device integration. In fig. 3, the connection relationship that may exist similar to the internal register r0_1 in the PE21 when the internal register r0_2 exists in the PE21 is shown in broken lines.

And the shift register file is used for transmitting the data acquired by the PE to other PEs connected with the PE. In fig. 3, a shift register file 36 corresponding to a PE22 may be connected in circuit with a shift register file corresponding to a PE having a connection relationship in four directions, i.e., up, down, left, and right, where there are 4 shift registers corresponding to the shift register file, including R1, R2, R3, and R4; similarly, in PE21, there is also a shift register file 37, including R1', R2', R3', R4'. The data transfer from the PE22 to the PE21 may be implemented by a shift register having a connection relationship with the shift register file 37 in the shift register file 36, for example, when the shift register R4 in the PE22 has a connection relationship with the shift register R4 'in the PE21, the PE22 may receive the data from the R4' corresponding to the PE21 after transferring the data to the R4, so that the PE21 receives the data.

Here, the structure of other PEs is similar to the internal structure of the PE, and is only illustrated herein, and will not be repeated.

For different PEs in the PE array, it may be connected to different first memory cells, or to the same first memory cell, respectively. The PE with the connection relationship and the first storage unit can perform data up-transmission, and the PE can read the data in the first storage unit and transmit the processed data to the first storage unit.

For each first storage unit, a corresponding plurality of PEs may be connected, so that the first storage unit may include a plurality of storage units corresponding to the number of PEs, for storing data read/write by each PE of the connected plurality of PEs.

Specifically, when determining the connection relationship between the plurality of first storage units and the plurality of PEs in the PE array, for example, the plurality of PEs may be first grouped, and a corresponding first storage unit may be determined for each group of PEs after the grouping. When grouping a plurality of PEs, a plurality of PEs having a physical connection relationship may be regarded as one PE group, and the plurality of PEs may be located in the same row, or in the same half row, or in the same block in terms of hardware layout.

For example, in one possible implementation, a row of PEs may be used as a PE group, as shown in fig. 4a, which shows a schematic diagram of a connection manner of a first storage unit to a PE array. In fig. 4a, the first row of PEs is denoted as a PE group (PE group), denoted as G0, the second row as a PE group, denoted as G1, and so on, until the nth row is divided by a PE group, denoted as Gn. And allocating a corresponding first storage unit 0 for the PE group G0, allocating a corresponding first storage unit 1 for the PE group G1, and so on, until a corresponding first storage unit n is allocated for the PE group Gn.

In another possible embodiment, referring to fig. 4b, a schematic diagram of another connection manner of the first storage unit to the PE array is shown. In fig. 4b, two rows of PEs are taken as one PE group, the first row and the second row of PEs are taken as one PE group, denoted as G0, the third row and the fourth row of PEs are taken as one PE group, denoted as G1, and so on, the plurality of rows of PEs in the PE array may be divided into n different PE groups, that is, the PE array is divided into n PE groups, and then the n different PE groups are allocated with corresponding n first storage units, for example, the first storage units 0 to n first storage units may be included.

In particular, each PE in the PE array may be used as a PE group, and a corresponding first storage unit may be allocated to each PE, that is, each PE in the PE array corresponds to a storage unit. This way, the first storage unit is further divided, so that the throughput of data interaction can be maximized, and the time consumed when data is transmitted is reduced.

Here, the number of PEs included in each of the plurality of PE groups may be the same or different, for example, in order to reduce the influence of the complexity of wiring in a circuit or the like, a PE that is closer in physical connection relationship is taken as one PE group, or in a more targeted device, a plurality of PEs having a stronger arithmetic processing function are taken as one PE group. That is, the specific manner of determining the PE group may be determined according to practical situations, which is not limited herein.

When the PE performs read/write access to the first storage unit, aiming at the condition of performing read access to the first storage unit: for example, in a first processing period, a read access can be performed on the connected first storage unit to obtain first data corresponding to the PE;

For example, in the case of performing write access to the first storage unit, write access may be performed to the connected first storage unit in the second processing cycle, and the second data generated by the PE may be stored in the connected first storage unit.

Wherein the processing period can be determined according to the actual data processing procedure, and in the processing step of multiplying and adding the data, for example, two or three clock periods can be included because the calculation is simpler; in processing steps such as weighted filtering of data, four or five clock cycles may be involved due to the relatively complex computation. That is, the number of clock cycles included in a processing cycle is related to the actual processing procedure, and the number of clock cycles included in different processing cycles may be the same or different.

In addition, because a plurality of PE groups exist, and when the data transmission is carried out for the PE in the PE groups, the data transmission between each group of PE and the corresponding first storage unit is realized in a single instruction multiple data stream (Single Instruction Multiple Data, SIMD) mode, so that different PEs connected with the same first storage unit can carry out read/write access on the same first storage unit in different data processing periods; and/or, the PE groups connected with different first storage units respectively have one PE in the same processing period to perform read/write access on the connected first storage units.

For example, for the plurality of first storage units and the plurality of PEs shown in fig. 1, in one processing cycle, PEs at the same location in each group of PEs may perform read access to the corresponding first storage units, taking the embodiment corresponding to fig. 1 as an example, the number of first storage units is 4, where: first storage unit 0, first storage unit 1, first storage unit 2, and first storage unit 3, the PE connected to first storage unit 0 includes: PE0, PE1, PE2, PE3, wherein the PE connected to the first memory unit 1 comprises PE4, PE5, PE6, PE7, the PE connected to the first memory unit 2 comprises PE8, PE9, PE10, PE11, and the PE connected to the first memory unit 3 comprises PE12, PE13, PE14, PE15; in this example, the first processing period may include 4 clock cycles; in the first clock cycle, PE0 performs a read access to the first memory cell 0, PE4 performs a read access to the first memory cell 1, PE8 performs a read access to the first memory cell 2, and PE12 performs a read access to the first memory cell 3.

The PEs 0, 4, 8, 12 may then store the read data in the corresponding internal memory, so that the PEs containing the arithmetic logic unit may perform an operation on the read data, or the PEs not containing the arithmetic logic unit may store the read data, waiting for a movement of a next processing cycle or other data transmission.

In the second clock cycle, PE1 performs read access to the first memory unit 0, meanwhile, PE5 performs read access to the first memory unit 1, PE9 performs read access to the first memory unit 2, and PE13 performs read access to the first memory unit 3; in the third clock cycle, PE2 performs a read access to first memory unit 0, while PE6 performs a read access to first memory unit 1, PE10 performs a read access to first memory unit 2, and PE14 performs a read access to first memory unit 3; in the fourth clock cycle, PE3 performs a read access to the first memory cell 0, while PE7 performs a read access to the first memory cell 1, PE11 performs a read access to the first memory cell 2, and PE15 performs a read access to the first memory cell 3. In this way, in the first processing period, the PE performs read access to the first storage unit, and transmits the first data which is correspondingly stored in the first storage unit and is waiting for processing by the PE to the internal registers respectively corresponding to the PEs, and waits for further data access.

In another possible implementation manner, when the number of PEs in the PE array is large and the size of the image to be processed is small, there may be a case that only a part of PEs in the PE array need to be used for processing the image to be processed, so there may also be a case that a part of PEs do not access the corresponding first storage unit in the first processing cycle, and continue waiting for the data processing instruction of the next processing cycle.

Specifically, when the PE performs read access to the first storage unit, a control unit in the data processing device generates a first control signal based on a data processing instruction and transmits the first control signal to the PE, and the PE reads first data to be processed by the PE from the first storage unit connected with the PE in response to receiving the first control signal transmitted by the control unit.

The data processing instructions may include related instructions that control the PE to operate on the data in the first memory location, such as different instructions, for example, a data transfer instruction (MOV), an ADD instruction (ADD), a subtract instruction (SUB), a logical AND instruction (AND), etc.

Taking the processing of any image to be processed by the data processing device as an example, after the first storage unit processes and stores the image, the control unit can generate a first control signal based on the data transmission instruction, where the first control signal includes a data address accessed by the PE receiving the first control signal when the PE performs a read access on the first storage unit, and is used to control the PE receiving the first control signal to perform data reading on the corresponding first storage unit, and store the read data in the corresponding internal register.

For example, in the first storage unit 0 shown in fig. 1, since the PEs 0 to PE3 are connected, four corresponding data storage spaces (spaces) may be included, and the first control signals transmitted to the PE0 by the control unit may include, for example, the address of s0, and after the PE0 receives the first control signals, the corresponding data may be read from the data storage Space s0 in the connected first storage unit 0 according to the address of s0 carried therein.

The manner in which other PEs read data from the corresponding first storage unit is similar to the manner in which PE0 reads data from the first storage unit 0, and will not be described herein.

In addition, when an image to be processed is processed and stored in the first storage unit, for example, the following manner may be adopted: the control unit generates a third control signal based on the data processing instruction and transmits the third control signal to a data scheduler in the data processing device; the data scheduler performs write access to the first memory location based on the third control signal.

The third control signal may, for example, carry a first data storage address, where the first data storage address is used to determine a storage location of the data to be processed stored in the first storage unit.

In a specific implementation, the data processing apparatus further includes a second storage unit, where the second storage unit may include an external memory, and is configured to store data such as an original image, a feature map, and the like to be processed. The embodiments of the present disclosure describe a detailed procedure of data processing of a data device taking processing of an original image as an example. Taking the PE array shown in fig. 1 as an example, when each PE can process sub-image data composed of 4×4 pixels, each PE can process the corresponding 4×4 pixels on average when the image size (in pixels) is 16×16. At this time, the data contained in the 16 obtained sub-images can be stored in the second storage unit, and the data scheduler waits for reading the data from the second storage unit; in addition, as the data stored in the second storage unit is the data which can be directly processed by the PE, when the data in the second storage unit is stored in the first storage unit, the data transmission can be completed only without the need of processing the data in a slicing way, thereby reducing the processing task of the data processing device during the data transmission and improving the efficiency of the data transmission; in addition, the data stored in the second storage unit can be directly used as the data to be processed corresponding to the first storage unit, so that the first storage unit and the PE connected with the first storage unit are also beneficial to reading the data to be processed.

Specifically, the data dispatcher reads the data to be processed corresponding to each first storage unit from the second storage unit, and stores the number to be processed corresponding to each first storage unit into the corresponding first storage unit based on the first data storage address carried in the third control signal; the data to be processed corresponding to each first storage unit comprises: the PE connected with each first storage unit needs to read the data.

After the first storage unit stores the data which needs to be read by the connected PE, the PE can wait for the receiving control unit to transmit the control signal, and after the first control signal sent by the control unit is received, the PE reads the corresponding data from the corresponding first storage unit to process. In this case, for example, when the image is convolved, the complex image processing algorithm includes multiple steps such as weighted summation, so that when the image is processed, multiple intermediate data may exist, for example, the intermediate data may be stored in the internal memories corresponding to the PEs respectively for temporary storage, and then the data temporarily stored in the internal memories is directly called for processing when the image is processed next time until all the data processing tasks of the original image are completed.

Or the intermediate data may be transferred to the first storage unit, but since the intermediate data is not the final output result data, further processing is required, the intermediate data in the first storage unit may not be output to the second storage unit.

Specifically, the control unit may generate a second control signal based on the data processing instruction, and transmit the second control signal to the PE; and the PE responds to the second control signal transmitted by the control unit, and the data generated by the PE is written into the first storage unit connected with the PE.

The second control signal is similar to the first control signal, and includes a data address accessed by the PE receiving the second control signal when the PE performs a write access to the first storage unit, and is used for controlling the PE receiving the second control signal to write data into the corresponding first storage unit, so that the first storage unit receives the data written by the corresponding PE, waits for outputting to the second storage unit, and obtains a processing result of the original image.

After the PE completes all processing steps on the data in the original image, the result data for output can be obtained, and at the moment, the control unit can also generate a fourth control signal and transmit the fourth control signal to the data dispatcher; the data scheduler reads result data from the plurality of first storage units based on the fourth control signal and stores the result data into the second storage unit; the result data comprise data generated by PE connected with the first storage unit and stored in the first storage unit.

In particular, the fourth control signal may carry a second data storage address, where the second data storage address is used to instruct the data scheduler to store the result data in the second storage unit. The fourth control signal may not carry the memory address of the second data.

For example, the data scheduler may read result data generated by PE0, PE1, PE2, and PE3, respectively, from the first storage unit 0, that is, result data stored in the four data storage spaces s0, s1, s2, and s3 of the first storage unit 0, and then store the result data in the second storage unit, thereby obtaining a processing result of the original image.

In one possible implementation manner, the control unit may further control to splice the plurality of result data output in the second storage unit sequentially, so as to restore the plurality of result data obtained by dividing the original image into the plurality of sub-images into the result data corresponding to the original image.

The embodiment of the disclosure also provides a specific example of convolution processing of the original image a with the data processing apparatus.

Referring to fig. 5, a schematic diagram of a data processing apparatus when performing data processing is shown; wherein, the number of the memory units is 4, which are respectively denoted as PE_RAm0 to PE_RAm3, and the PE array comprises 16 PEs, which are respectively denoted as PE0 to PE15.

Wherein PE0 to PE3 are taken as one PE group, PE4 to PE7 are taken as one PE group, PE8 to PE11 are taken as one PE group, PE12 to PE15 are taken as one PE group, and are respectively denoted as G0, G1, G2 and G3.

After determining the PE sub-array, determining pe_ram0 in the first memory cell as a first memory cell corresponding to G0; taking PE_RAAM1 in the first storage unit as a first storage unit corresponding to G1; taking PE_RAm2 in the first storage unit as a first storage unit corresponding to G2; and taking PE_RAm3 in the first storage unit as the first storage unit corresponding to G3.

When the operation of the convolution layer is completed by the data processing device, the control unit generates a third control signal C3 based on the data processing instruction, sends the third control signal to the data scheduler, the data scheduler performs read access on the second storage unit, the second storage unit stores data corresponding to the original image A, and then the data scheduler stores the data used for performing convolution calculation in the second storage unit into the first storage unit.

Then, the control unit sends a first control signal C1 to the PEs, and each PE working in the PE array reads first data to be processed from the corresponding first storage unit, and then performs corresponding calculation.

Wherein C1 controls the following operations: in the first clock period, PE0, PE4, PE8 and PE12 respectively corresponding to PE_RAM0-PE_RAM3 read the first data to be processed respectively corresponding to PE; in the second clock period, PE1, PE5, PE9 and PE13 respectively read the corresponding first data to be processed; in a third clock period, the PE2, the PE6, the PE10 and the PE14 respectively read the corresponding first data to be processed; in the fourth clock cycle, the PE3, the PE7, the PE11, and the PE15 read the first data to be processed respectively corresponding to each other.

Then, the PEs 0 to 15 respectively perform data processing on the corresponding first data to be processed, for example, perform convolution operation processing on the first data, so as to obtain second data.

The second data is herein the result data.

After the PE in the PE array processes the first data to obtain second data, the control unit sends a second control signal C2 to the PE, and the second data in the PE is written into the first storage unit corresponding to the PE. At this time, the control unit transmits a fourth control signal C4 to the data scheduler, causing the data scheduler to read out the result data from the first storage unit and store it in the second storage unit.

It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.

Based on the same inventive concept, the embodiments of the present disclosure further provide a data processing method corresponding to the data processing device, and since the principle of solving the problem by the method in the embodiments of the present disclosure is similar to that of the data processing device in the embodiments of the present disclosure, the implementation of the device may refer to the implementation of the method, and the repetition is omitted.

Referring to fig. 6, a schematic diagram of a data processing method according to an embodiment of the disclosure is shown, where the data processing method is applied to a data processing apparatus; the data processing method comprises the following steps:

S601: PE performs read/write access on the connected first storage unit;

S602: the plurality of first storage units store data transmitted in the read/write access process of the connected PE.

The disclosed embodiments also provide a computer device comprising: instruction memory and data processing apparatus provided by embodiments of the present disclosure.

The data processing apparatus provided by the embodiments of the present disclosure may include a chip, an AI chip, and the like. The computer device provided in the embodiments of the present disclosure may include an intelligent terminal such as a mobile phone, or may be other devices, servers, etc. that may be used to perform data processing, which is not limited herein.

The disclosed embodiments also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the data processing method described in the method embodiments above. Wherein the storage medium may be a volatile or nonvolatile computer readable storage medium.

Embodiments of the present disclosure further provide a computer program product, where the computer program product carries program code, where instructions included in the program code may be used to perform steps of a data processing method described in the foregoing method embodiments, and specifically reference may be made to the foregoing method embodiments, which are not described herein.

Wherein the above-mentioned computer program product may be realized in particular by means of hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system and apparatus may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in essence or a part contributing to the prior art or a part of the technical solution, or in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present disclosure. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Finally, it should be noted that: the foregoing examples are merely specific embodiments of the present disclosure, and are not intended to limit the scope of the disclosure, but the present disclosure is not limited thereto, and those skilled in the art will appreciate that while the foregoing examples are described in detail, it is not limited to the disclosure: any person skilled in the art, within the technical scope of the disclosure of the present disclosure, may modify or easily conceive changes to the technical solutions described in the foregoing embodiments, or make equivalent substitutions for some of the technical features thereof; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the disclosure, and are intended to be included within the scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A data processing apparatus, comprising: a first storage unit and a calculation unit; the computing unit comprises a processing engine PE array; a plurality of PEs in the PE array form a torus network, any PE having different PEs physically connected in a plurality of different orientations; the plurality of first storage units are respectively connected with PEs in the PE array;

The PE is used for performing read/write access on the connected first storage unit; the PEs in the array include: PE in PE array for completing operation task, and PE in edge position of PE array for ensuring data not to be lost when moving between different PEs in PE array;

the plurality of first storage units are used for storing data transmitted in the read/write access process of the connected PE.

2. The data processing apparatus according to claim 1, wherein the plurality of first storage units are respectively connected to different PE groups in the PE array.

3. A data processing apparatus according to claim 1 or 2, wherein each first storage unit is connected to a PE of a group of PEs; different PEs belong to different PE groups, respectively.

4. A data processing apparatus according to claim 3, wherein said one PE group includes a plurality of PEs having a physical connection relationship in said PE array, and wherein the plurality of PEs are located in the same row, or in the same half row, or in the same block in a hardware layout.

5. The data processing apparatus according to any one of claims 1 to 4, wherein the PE is configured to perform a read access to the connected first storage unit in a first processing cycle to obtain first data corresponding to the PE; and/or

And in a second processing period, performing write access on the connected first storage unit, and storing the second data generated by the PE into the connected first storage unit.

6. A data processing apparatus according to any one of claims 1-5, wherein different PEs to which the same first memory unit is connected perform read/write access to the same first memory unit in different processing cycles;

And/or the number of the groups of groups,

The PE groups connected with different first storage units respectively have one PE in the same processing period to perform read/write access on the connected first storage units.

7. The data processing apparatus according to claim 6, wherein among the groups of PEs to which the different first storage units are connected, PEs having the same relative position perform read/write access to the connected first storage units in the same processing cycle.

8. The data processing apparatus according to any one of claims 1 to 7, further comprising: a control unit;

the control unit is used for generating a first control signal based on a data processing instruction and transmitting the first control signal to the PE;

and the PE is used for responding to the first control signal transmitted by the control unit and reading the first data to be processed by the PE from a first storage unit connected with the PE.

9. The data processing apparatus according to claim 8, wherein the control unit is further configured to generate a second control signal based on the data processing instruction, and transmit the second control signal to the PE;

And the PE is used for responding to the second control signal transmitted by the control unit and writing the data generated by the PE into a first storage unit connected with the PE.

10. The data processing apparatus according to claim 8 or 9, further comprising: a data scheduler;

The control unit is further used for generating a third control signal based on the data processing instruction and transmitting the third control signal to the data scheduler;

the data scheduler is configured to perform write access to the first storage unit based on the third control signal.

11. The data processing apparatus of claim 10, further comprising a second storage unit;

The data scheduler is configured to read data to be processed corresponding to each first storage unit from the second storage unit, and store the data to be processed corresponding to each first storage unit into the corresponding first storage unit based on a first data storage address carried in the third control signal;

the data to be processed corresponding to each first storage unit comprises: and the PE connected with each first storage unit needs to read the data.

12. The data processing apparatus according to claim 10 or 11, wherein the control unit is further configured to generate a fourth control signal based on the data processing instruction, and to deliver the fourth control signal to the data scheduler;

The data scheduler is further configured to perform a read access to the first storage unit based on the fourth control signal.

13. The data processing apparatus according to claim 12, wherein the data scheduler is configured to read result data from the plurality of first storage units and store the result data into a second storage unit based on the fourth control signal;

Wherein the result data includes: and the PE connected with the first storage unit generates data which is stored in the first storage unit.

14. A data processing method, characterized by being applied to a data processing apparatus, the data processing apparatus comprising: a first storage unit and a calculation unit; the computing unit comprises a processing engine PE array; a plurality of PEs in the PE array form a torus network, any PE having different PEs physically connected in a plurality of different orientations; the plurality of first storage units are respectively connected with PEs in the PE array; the data processing method comprises the following steps:

The PE performs read/write access on the connected first storage unit; the PEs in the array include: PE in PE array for completing operation task, and PE in edge position of PE array for ensuring data not to be lost when moving between different PEs in PE array;

the plurality of first storage units store data transmitted in the read/write access process of the connected PE.

15. The method of claim 14, wherein the PE has read/write access to the connected first storage unit, comprising:

the PE performs read access on the connected first storage unit in a first processing period to obtain first data corresponding to the PE; and/or

16. A data processing method according to claim 14 or 15, wherein the PE has read/write access to the connected first storage unit, comprising:

Different PEs connected with the same first storage unit perform read/write access on the same first storage unit in different processing periods;

And/or the number of the groups of groups,

17. The method according to claim 16, wherein the group of PEs to which the different first storage units are connected have read/write access to the connected first storage units by one PE respectively in the same processing cycle, comprising: in each PE group connected to different first storage units, the PEs with the same relative positions perform read/write access to the connected first storage units in the same processing cycle.

18. A data processing method according to any one of claims 14-17, wherein the data processing apparatus further comprises a control unit; the data processing method further comprises the following steps:

The control unit generates a first control signal based on a data processing instruction and transmits the first control signal to the PE;

And the PE responds to the received first control signal transmitted by the control unit, and reads the first data to be processed by the PE from a first storage unit connected with the PE.

19. The data processing method of claim 18, further comprising:

the control unit generates a second control signal based on the data processing instruction and transmits the second control signal to the PE;

And the PE responds to receiving a second control signal transmitted by the control unit, and writes second data generated by the PE into a first storage unit connected with the PE.

20. A data processing method according to claim 18 or 19, wherein the data processing apparatus further comprises a data scheduler; the data processing method further comprises the following steps:

The control unit generates a third control signal based on the data processing instruction and transmits the third control signal to the data scheduler;

The data scheduler performs write access to the first memory unit based on the third control signal.

21. The data processing method according to claim 20, wherein the data processing apparatus further comprises a second storage unit;

The data dispatcher reads the data to be processed corresponding to each first storage unit from the second storage unit, and stores the data to be processed corresponding to each first storage unit into the corresponding first storage unit based on the first data storage address carried in the third control signal;

22. A data processing method according to claim 20 or 21, further comprising:

The control unit generates a fourth control signal based on the data processing instruction and transmits the fourth control signal to the data scheduler;

the data scheduler performs read access to the first memory cell based on the fourth control signal.

23. The data processing method of claim 22, wherein the data scheduler performs read access to the first memory location based on the fourth control signal, comprising:

The data scheduler reads result data from the plurality of first storage units based on the fourth control signal and stores the result data into a second storage unit;

24. A computer device, comprising: instruction memory and data processing apparatus according to any one of claims 1 to 13.

25. A computer-readable storage medium, on which a computer program is stored, which computer program, when being run by data processing means, performs the steps of the data processing method according to any of claims 14 to 23.