CN110766150A

CN110766150A - Regional parallel data loading device and method in deep convolutional neural network hardware accelerator

Info

Publication number: CN110766150A
Application number: CN201910979031.9A
Authority: CN
Inventors: 杨旭光; 林森; 伍世聪
Original assignee: Beijing Xinqi Technology Co Ltd
Current assignee: Beijing Xinqi Technology Co Ltd
Priority date: 2019-10-15
Filing date: 2019-10-15
Publication date: 2020-02-07

Abstract

The invention provides a regional parallel data loading device and a regional parallel data loading method for deep convolution, which can meet the requirement of providing high-bandwidth data input for a parallel hardware accelerated computation execution unit array, and greatly simplifies the circuit design due to the regional input data. The device design includes: the parallel input register array provides a fast register area for data rearrangement for the input characteristic diagram in the input cache; the parallel input data access engine performs regional parallel access on the data in the parallel input register array, simplifies the structure of a connecting circuit and saves the area and the power consumption.

Description

Regional parallel data loading device and method in deep convolutional neural network hardware accelerator

Technical Field

The invention belongs to the field of computer hardware and artificial neural network algorithm deployment hardware acceleration, and relates to the field of digital integrated circuit design, in particular to a key processing device of input data of a deep convolution neural network hardware acceleration chip and a design method thereof.

Background

The deep convolutional neural network algorithm consists of a plurality of specific neuron algorithm layers and hidden layers and mainly comprises convolutional layers, and a main operator is convolution calculation of a matrix or a vector. The calculation task is mainly characterized in that the input data volume is large, the input data has coupling of spatial characteristic information, data calculated by convolution each time often overlaps with calculated data, and the input data often is calculation data required by extraction according to a certain spatial rule from data in a tensor format. The convolution layer requires huge calculation force and larger data. The artificial neural algorithm is deployed on the embedded end side chip, because the hardware resource of the acceleration chip is limited, data must be segmented, the spatial coupling of input data is utilized as much as possible for each segmented data, and the bandwidth waste caused by repeated data filling is avoided; for different artificial neural network algorithms commonly used in different fields and industrial scenes, data segmentation processing should be designed into a method which is as simple as possible and convenient to implement, otherwise, development tasks under low-cost and rapid application scenes cannot be supported.

Patent document 1 (publication No. CN105488565A) discloses an arithmetic device and method for accelerating an acceleration chip of a deep neural network algorithm, in order to overcome the problem that a large number of intermediate values are generated and need to be stored, and thus the required main memory space increases, the arithmetic device is provided with intermediate value storage areas, and the number of times of reading and writing the intermediate values to the main memory is reduced. Patent document 2 (publication No. USB0170103316a1) discloses a method, system and apparatus of a convolutional neural network accelerator in which a Unified Buffer is designed. Because different neural network algorithm layers are different in size and different in data reuse degree, waste of accelerator resources is easily caused, and patent 1 needs to be matched with other heterogeneous processors to help accelerate a data arrangement task and starts to gradually move to the direction of application of a server end; and the patent 2 only can be used in the scenes of a server, a data center and the like due to the adoption of a huge strategy of parallel computing scale and storage scale.

Patent document 3 (application publication No. CN107341544A) discloses a reconfigurable accelerator based on a partitionable array and an implementation method thereof, and a scratch pad memory buffer is designed for implementing data reuse. By adopting the reconfigurable computing idea, although the problem of resource waste is solved, the data segmentation and arrangement method is very complex, so that the difficulty of re-deploying a new network is very high. Patent document 4 (publication No. US20180341495a1) discloses a convolutional neural network accelerator and method in which a cache device is employed to provide data required for parallel acceleration. The invention of patent 4 is too coupled with the design of the central processing unit, and the complexity of the design implementation is too high.

Disclosure of Invention

The invention provides a region parallel data loading device in a deep convolutional neural network hardware parallel accelerator and a method thereof, which can reduce the complexity and cost of application, reduce the complexity of hardware circuit design, reduce the area and power consumption of a chip, and simultaneously provide parallel data bandwidth with high throughput rate and high performance.

To achieve the above object, an embodiment of the present invention provides a region parallel data loading apparatus, where the region parallel data loading apparatus includes:

the parallel input register array provides a fast register area for data rearrangement for the input characteristic diagram in the input cache, and the registered data is used as input data for high-bandwidth calculation of the parallel accelerated calculation unit array and can be randomly accessed or simultaneously accessed in parallel and in parallel;

the parallel input data access engine performs regionalized parallel and concurrent access on the data in the parallel input register array, does not need to perform overall access on the register array, and does not cause any data loss.

The area parallel data loading device is characterized in that the parallel input register array is used for caching a characteristic diagram which is stored in an input cache and is output by a hidden layer in front of a deep convolutional neural network algorithm layer, and the parallel input register array provides a quick register area for data rearrangement, so that the difficulty of input data arrangement is simplified; the parallel input register array can be repeatedly accessed, and when the data in the parallel input register array is invalidated, new data can be quickly written in from the input buffer; the register array supports random access, simultaneous parallel access and multi-path concurrent access, and the number of concurrent paths is not less than the number of parallel acceleration computing units in a deep convolutional neural network hardware parallel accelerator.

The parallel data loading device comprises a parallel input data access engine, a parallel input register array and a parallel acceleration calculation unit, wherein the parallel input data access engine carries out regionalized parallel and concurrent access on data in the parallel input register array, the access is neither serial access nor random access of a full address space, and the number of concurrent access paths is not less than the number of parallel acceleration calculation units in a deep convolutional neural network hardware parallel accelerator; the regional data in the parallel input register array is repeatedly accessed according to a certain rule, so that the regional data coupling characteristic of the input characteristic diagram of the convolutional neural network algorithm layer can be exerted in the data region, and repeated data do not need to be repeatedly written into the parallel input register array in a large quantity.

The embodiment of the invention also provides a design method of the regional parallel data loading device, which comprises the following methods and principles:

the size design of the parallel input register array is related to the instantiated size of the parallel computing unit array, and a specific design formula is met;

carrying out regionalized parallel and concurrent access on the area in the parallel input register array, wherein the access is neither serial access nor random access of a full address space; for the simultaneous and concurrent multiple access, the address calculation follows a specific calculation rule, and the conversion rule is simple. The design method can simplify the complexity of a hardware circuit in the hardware engine module and reduce the area and the power consumption.

The invention has the following effects:

1. simplifying the complexity of connection between the hardware parallel computing unit array and the input device

2. Simplifying the spatial complexity of arranging data between the output device and the main storage

3. Simplifying the address calculation complexity of software configuration data and dividing data macro block

4. The practical application efficiency of the hardware parallel computing unit array is improved

5. Is more suitable for being implemented on a low-cost embedded ASIC chip

Drawings

FIG. 1 is a block diagram of a data flow and an apparatus for an embodiment of a method for loading local data according to the present invention;

FIG. 2 is a region data flow diagram in the first convolution calculation according to the embodiment;

FIG. 3 is a region data flow diagram in the 2 nd convolution calculation according to the embodiment;

FIG. 4 is a region data flow diagram in the convolution calculation of the 3 rd time according to the embodiment;

FIG. 5 is a block diagram of a local parallel data loading apparatus according to the present invention.

Description of the reference numerals

1 parallel hardware computing Unit Array (Process Elements Array, PEA)

12 parallel Output Register Array (Output Register Array, ORA)

101 convolution computing Element (Process Element, PE)

202 parallel Input Register Array (Input Register Array, IRA)

203 parallel input Data access Engine (IRA Data Access Engine, IDE)

Detailed Description

The invention is described in further detail below with reference to the figures and examples.

Fig. 1 is a diagram of a data flow and an apparatus structure of a method for loading area parallel data in a deep convolutional neural network hardware accelerator according to the present invention, where the apparatus includes a parallel Input Register Array (IRA)202 and a parallel input data access engine (IDE) 203. A simplified connection scheme of the apparatus of the present invention to a parallel hardware computing unit array (PEA)1 is also illustrated. The apparatus 1 consists of several parallel hardware computation units (PEs) 101.

The parallel Input Register Array (IRA)202 is composed of a specific number of registers, and is used for providing a fast register area for data rearrangement, so that the difficulty of input data arrangement is simplified; the parallel input register array can be repeatedly accessed, and when the data in the parallel input register array is invalidated, new data can be quickly written in from the input buffer; the register array supports random access, simultaneous parallel access and multi-path concurrent access, and the number of concurrent paths is not less than the number of parallel acceleration computing units in a deep convolutional neural network hardware parallel accelerator.

As shown in fig. 1, the accumulation units in 202 are fixedly divided into regions, which are marked with ellipses above 202 in the figure, the number of regions representing the capability of concurrent, parallel access. The multiple access engine in 203 does not need to randomly access all the addresses of the whole 202, but only needs to access the register array in the area corresponding to each way, and the corresponding relation can be fixed, thereby further simplifying the circuit scale.

In one embodiment, PEA is defined as a rectangular 2-dimensional array structure, the instantiation sizes of PEs on the PEA are wide Pw and high Ph, respectively, and the PE instantiation number is P ═ Pw × Ph; performing a convolution kernel of size Kw Kh in parallel with acceleration; the IRA is also considered as a 2-dimensional array structure, and the instantiated sizes of the input registers thereon are Rw wide and Rh high, respectively, and the number of times that the IRA can be convolved by PEA is B ═ Bw × Bh, and the IRA is registered in the input parallel register array (ORA) 12. The invention provides a design method of a region parallel data loading device in a deep convolutional neural network hardware accelerator, which comprises the following steps:

estimating the total instantiation size P of the PEA according to a target algorithm in a specific field and the calculation requirements of a common network, in combination with the real-time requirements of industrial application and the theoretical calculation power interval of a parallel accelerator device, wherein P is Pw Ph; the number of concurrent and parallel data input paths is also P; as noted in fig. 1, the parallel computing acceleration requirement can be satisfied if the number of ways of the input data access engine (IDE)203 is also P;

according to the design framework of the PE, the maximum output calculation power and efficiency can be known, and the selection of the chip design is combined, so that the frequency B for the PEA to complete convolution after the IRA updates data each time is selected; as noted in fig. 1, the number of parallel output data register arrays (ORA)12 is P × B;

the invention provides a method for realizing the size of a region parallel input data register array in a deep convolutional neural network hardware accelerator, which comprises the following steps: Rw-Bw-Pw + Kw-1, Rh-Bh Ph + Kh-1;

assuming that the coordinates of PE are (px, py), the upper left-hand coordinates (x, y) of the input data region corresponding to PE are calculated as: x px Bw + px% Bw + fw (kw), y py Bh + py/Bh + fh (kh);

fw () and fh () functions are related to the maximum size of the convolution kernel that the embodiment chooses to support, fw (x) max (supported _ Kw)/Bw-x/Bw, fh (x) max (supported _ Kh)/Bh-x/Bh, respectively;

the size of the input data region (dwxdh) for the PE is calculated as: dw + Bw-1, dh + Kh-1;

an input data access engine (IDE)203 accesses respective areas, the access sequence and the positions in each area are all consistent, and the circuit design of address generation is simplified;

in the case of a convolution calculated jump Stribe >1, B ═ Bw/Stribe ═ Bh/Stribe supported, i.e., the number of PEA calculations that can be supported after each IRA fill is decreased by a multiple.

The invention provides a calculation sequence of regional parallel input data in a deep convolutional neural network hardware accelerator, which comprises the following steps:

as shown in fig. 2, in the current region data, all IDE channels access the upper left-corner Kw × Kh region synchronously and concurrently to support the corresponding PE to perform the first convolution calculation;

as shown in fig. 3, in the current region data, all synchronized and concurrent access Kw × Kh regions of the IDE channels are translated, and the right Stribe data supports the corresponding PE to perform the 2 nd convolution calculation; the result of the previous 1 st calculation is stored in the ORA;

as shown in fig. 3, after all synchronous and concurrent access Kw × Kh regions of all IDE channels translate and traverse all horizontal data, the K-1 row is moved down, and the Bw +1 th convolution calculation is started; the previous result is stored in Bw register units of ORA;

and repeating the steps until all the Bw Bh convolutions are calculated.

Fig. 5 is a specific structural diagram of a local parallel data loading apparatus according to the present invention, wherein the IDE apparatus 203 includes an address decoder and multiplexer 2031, an address translator 2033, and an address generator 2032 of the IDE to IRA area.

As shown in fig. 5, when the upper left PE with (px, py) of 0 performs the first K-7 convolution, the center point of the convolution region is located at r33, and according to the above rule included in the present invention, the other three times are located at r34, r43, and r44, respectively, and the dashed boxes are all the regions that p0 needs to access, and there are 64 input data in total. The dashed arrow indicates that 64 data in the region is connected to the 1-out-of-64 multiplexer 2031. The local address generator 2032 provides a local address change rule, the address converter 2033 decodes the address corresponding to each path into an address adjusted according to the local offset, and selects one of the data by decoding through the multiplexer 2031, and outputs the data to the PE until all 4 convolutions are completed.

As shown in fig. 5, when p0 is performing the computation acceleration in the corresponding region, the other PEs are also performing the array computation acceleration in parallel and synchronously. The central point of the IRA area corresponding to each PE is marked by a cross. The addresses and the sequence of the respective areas accessed by all the PEs are the same, which simplifies the complexity of the control circuit.

The invention can meet the input data throughput performance by the area parallel input device and according to the design of the method for dividing the areas accessed by the PEs, the circuit design is simplified, and the method for dividing the input data is simple and visual.

The present invention may be described in the general context of general and/or extended instructions executed by a central controller, such as a software program. Software programs generally include routines, objects, components, data structures, reference models, and the like that perform particular tasks or implement particular data types.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A region parallel data loading apparatus in a deep convolutional neural network hardware parallel accelerator, comprising:

2. The parallel input register array of claim 1, wherein for a feature map stored in the input buffer and output by a hidden layer before a deep convolutional neural network algorithm layer, the parallel input register array provides a fast register region for data rearrangement, which simplifies the difficulty of input data arrangement; the parallel input register array can be repeatedly accessed, and when the data in the parallel input register array is invalidated, new data can be quickly written in from the input buffer; the register array supports random access, simultaneous parallel access and multi-path concurrent access, and the number of concurrent paths is not less than the number of parallel acceleration computing units in a deep convolutional neural network hardware parallel accelerator.

3. A method of designing a parallel input register array as claimed in claims 1-2, characterized in that the dimensioning is related to the instantiated dimensions of the array of parallel computing units, satisfying a specific design formula.

4. The parallel input data access engine of claim 1, comprising:

carrying out regional parallel and concurrent access on data in the parallel input register array, wherein the access is neither serial access nor random access of a full address space, and the number of concurrent access paths is not less than the number of parallel acceleration computing units in a deep convolutional neural network hardware parallel accelerator;

the regional data in the parallel input register array is repeatedly accessed according to a certain rule, so that the regional data coupling characteristic of the input characteristic diagram of the convolutional neural network algorithm layer can be exerted in the data region, and repeated data do not need to be repeatedly written into the parallel input register array in a large quantity.

5. The design method of the parallel input data access engine according to claim 1 and 4, characterized in that the area in the parallel input register array is subjected to regionalized parallel and concurrent access, which is neither serial access nor random access of the full address space; for the simultaneous and concurrent multiple access, the address calculation follows a specific calculation rule, and the conversion rule is simple. The design method can simplify the complexity of a hardware circuit in the hardware engine module and reduce the area and the power consumption.