CN110766150A - Regional parallel data loading device and method in deep convolutional neural network hardware accelerator - Google Patents
Regional parallel data loading device and method in deep convolutional neural network hardware accelerator Download PDFInfo
- Publication number
- CN110766150A CN110766150A CN201910979031.9A CN201910979031A CN110766150A CN 110766150 A CN110766150 A CN 110766150A CN 201910979031 A CN201910979031 A CN 201910979031A CN 110766150 A CN110766150 A CN 110766150A
- Authority
- CN
- China
- Prior art keywords
- parallel
- data
- access
- input
- register array
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000011068 loading method Methods 0.000 title claims abstract description 16
- 238000000034 method Methods 0.000 title claims description 24
- 238000013527 convolutional neural network Methods 0.000 title claims description 20
- 238000013461 design Methods 0.000 claims abstract description 20
- 238000010586 diagram Methods 0.000 claims abstract description 13
- 230000008707 rearrangement Effects 0.000 claims abstract description 6
- 238000004364 calculation method Methods 0.000 claims description 27
- 230000001133 acceleration Effects 0.000 claims description 15
- 230000008878 coupling Effects 0.000 claims description 4
- 238000010168 coupling process Methods 0.000 claims description 4
- 238000005859 coupling reaction Methods 0.000 claims description 4
- 238000006243 chemical reaction Methods 0.000 claims description 2
- 238000013528 artificial neural network Methods 0.000 description 5
- 238000012545 processing Methods 0.000 description 3
- 239000002699 waste material Substances 0.000 description 3
- 230000008569 process Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 241000278713 Theora Species 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Neurology (AREA)
- Complex Calculations (AREA)
Abstract
The invention provides a regional parallel data loading device and a regional parallel data loading method for deep convolution, which can meet the requirement of providing high-bandwidth data input for a parallel hardware accelerated computation execution unit array, and greatly simplifies the circuit design due to the regional input data. The device design includes: the parallel input register array provides a fast register area for data rearrangement for the input characteristic diagram in the input cache; the parallel input data access engine performs regional parallel access on the data in the parallel input register array, simplifies the structure of a connecting circuit and saves the area and the power consumption.
Description
Technical Field
The invention belongs to the field of computer hardware and artificial neural network algorithm deployment hardware acceleration, and relates to the field of digital integrated circuit design, in particular to a key processing device of input data of a deep convolution neural network hardware acceleration chip and a design method thereof.
Background
The deep convolutional neural network algorithm consists of a plurality of specific neuron algorithm layers and hidden layers and mainly comprises convolutional layers, and a main operator is convolution calculation of a matrix or a vector. The calculation task is mainly characterized in that the input data volume is large, the input data has coupling of spatial characteristic information, data calculated by convolution each time often overlaps with calculated data, and the input data often is calculation data required by extraction according to a certain spatial rule from data in a tensor format. The convolution layer requires huge calculation force and larger data. The artificial neural algorithm is deployed on the embedded end side chip, because the hardware resource of the acceleration chip is limited, data must be segmented, the spatial coupling of input data is utilized as much as possible for each segmented data, and the bandwidth waste caused by repeated data filling is avoided; for different artificial neural network algorithms commonly used in different fields and industrial scenes, data segmentation processing should be designed into a method which is as simple as possible and convenient to implement, otherwise, development tasks under low-cost and rapid application scenes cannot be supported.
Patent document 1 (publication No. CN105488565A) discloses an arithmetic device and method for accelerating an acceleration chip of a deep neural network algorithm, in order to overcome the problem that a large number of intermediate values are generated and need to be stored, and thus the required main memory space increases, the arithmetic device is provided with intermediate value storage areas, and the number of times of reading and writing the intermediate values to the main memory is reduced. Patent document 2 (publication No. USB0170103316a1) discloses a method, system and apparatus of a convolutional neural network accelerator in which a Unified Buffer is designed. Because different neural network algorithm layers are different in size and different in data reuse degree, waste of accelerator resources is easily caused, and patent 1 needs to be matched with other heterogeneous processors to help accelerate a data arrangement task and starts to gradually move to the direction of application of a server end; and the patent 2 only can be used in the scenes of a server, a data center and the like due to the adoption of a huge strategy of parallel computing scale and storage scale.
Patent document 3 (application publication No. CN107341544A) discloses a reconfigurable accelerator based on a partitionable array and an implementation method thereof, and a scratch pad memory buffer is designed for implementing data reuse. By adopting the reconfigurable computing idea, although the problem of resource waste is solved, the data segmentation and arrangement method is very complex, so that the difficulty of re-deploying a new network is very high. Patent document 4 (publication No. US20180341495a1) discloses a convolutional neural network accelerator and method in which a cache device is employed to provide data required for parallel acceleration. The invention of patent 4 is too coupled with the design of the central processing unit, and the complexity of the design implementation is too high.
Disclosure of Invention
The invention provides a region parallel data loading device in a deep convolutional neural network hardware parallel accelerator and a method thereof, which can reduce the complexity and cost of application, reduce the complexity of hardware circuit design, reduce the area and power consumption of a chip, and simultaneously provide parallel data bandwidth with high throughput rate and high performance.
To achieve the above object, an embodiment of the present invention provides a region parallel data loading apparatus, where the region parallel data loading apparatus includes:
the parallel input register array provides a fast register area for data rearrangement for the input characteristic diagram in the input cache, and the registered data is used as input data for high-bandwidth calculation of the parallel accelerated calculation unit array and can be randomly accessed or simultaneously accessed in parallel and in parallel;
the parallel input data access engine performs regionalized parallel and concurrent access on the data in the parallel input register array, does not need to perform overall access on the register array, and does not cause any data loss.
The area parallel data loading device is characterized in that the parallel input register array is used for caching a characteristic diagram which is stored in an input cache and is output by a hidden layer in front of a deep convolutional neural network algorithm layer, and the parallel input register array provides a quick register area for data rearrangement, so that the difficulty of input data arrangement is simplified; the parallel input register array can be repeatedly accessed, and when the data in the parallel input register array is invalidated, new data can be quickly written in from the input buffer; the register array supports random access, simultaneous parallel access and multi-path concurrent access, and the number of concurrent paths is not less than the number of parallel acceleration computing units in a deep convolutional neural network hardware parallel accelerator.
The parallel data loading device comprises a parallel input data access engine, a parallel input register array and a parallel acceleration calculation unit, wherein the parallel input data access engine carries out regionalized parallel and concurrent access on data in the parallel input register array, the access is neither serial access nor random access of a full address space, and the number of concurrent access paths is not less than the number of parallel acceleration calculation units in a deep convolutional neural network hardware parallel accelerator; the regional data in the parallel input register array is repeatedly accessed according to a certain rule, so that the regional data coupling characteristic of the input characteristic diagram of the convolutional neural network algorithm layer can be exerted in the data region, and repeated data do not need to be repeatedly written into the parallel input register array in a large quantity.
The embodiment of the invention also provides a design method of the regional parallel data loading device, which comprises the following methods and principles:
the size design of the parallel input register array is related to the instantiated size of the parallel computing unit array, and a specific design formula is met;
carrying out regionalized parallel and concurrent access on the area in the parallel input register array, wherein the access is neither serial access nor random access of a full address space; for the simultaneous and concurrent multiple access, the address calculation follows a specific calculation rule, and the conversion rule is simple. The design method can simplify the complexity of a hardware circuit in the hardware engine module and reduce the area and the power consumption.
The invention has the following effects:
1. simplifying the complexity of connection between the hardware parallel computing unit array and the input device
2. Simplifying the spatial complexity of arranging data between the output device and the main storage
3. Simplifying the address calculation complexity of software configuration data and dividing data macro block
4. The practical application efficiency of the hardware parallel computing unit array is improved
5. Is more suitable for being implemented on a low-cost embedded ASIC chip
Drawings
FIG. 1 is a block diagram of a data flow and an apparatus for an embodiment of a method for loading local data according to the present invention;
FIG. 2 is a region data flow diagram in the first convolution calculation according to the embodiment;
FIG. 3 is a region data flow diagram in the 2 nd convolution calculation according to the embodiment;
FIG. 4 is a region data flow diagram in the convolution calculation of the 3 rd time according to the embodiment;
FIG. 5 is a block diagram of a local parallel data loading apparatus according to the present invention.
Description of the reference numerals
1 parallel hardware computing Unit Array (Process Elements Array, PEA)
12 parallel Output Register Array (Output Register Array, ORA)
101 convolution computing Element (Process Element, PE)
202 parallel Input Register Array (Input Register Array, IRA)
203 parallel input Data access Engine (IRA Data Access Engine, IDE)
Detailed Description
The invention is described in further detail below with reference to the figures and examples.
Fig. 1 is a diagram of a data flow and an apparatus structure of a method for loading area parallel data in a deep convolutional neural network hardware accelerator according to the present invention, where the apparatus includes a parallel Input Register Array (IRA)202 and a parallel input data access engine (IDE) 203. A simplified connection scheme of the apparatus of the present invention to a parallel hardware computing unit array (PEA)1 is also illustrated. The apparatus 1 consists of several parallel hardware computation units (PEs) 101.
The parallel Input Register Array (IRA)202 is composed of a specific number of registers, and is used for providing a fast register area for data rearrangement, so that the difficulty of input data arrangement is simplified; the parallel input register array can be repeatedly accessed, and when the data in the parallel input register array is invalidated, new data can be quickly written in from the input buffer; the register array supports random access, simultaneous parallel access and multi-path concurrent access, and the number of concurrent paths is not less than the number of parallel acceleration computing units in a deep convolutional neural network hardware parallel accelerator.
As shown in fig. 1, the accumulation units in 202 are fixedly divided into regions, which are marked with ellipses above 202 in the figure, the number of regions representing the capability of concurrent, parallel access. The multiple access engine in 203 does not need to randomly access all the addresses of the whole 202, but only needs to access the register array in the area corresponding to each way, and the corresponding relation can be fixed, thereby further simplifying the circuit scale.
In one embodiment, PEA is defined as a rectangular 2-dimensional array structure, the instantiation sizes of PEs on the PEA are wide Pw and high Ph, respectively, and the PE instantiation number is P ═ Pw × Ph; performing a convolution kernel of size Kw Kh in parallel with acceleration; the IRA is also considered as a 2-dimensional array structure, and the instantiated sizes of the input registers thereon are Rw wide and Rh high, respectively, and the number of times that the IRA can be convolved by PEA is B ═ Bw × Bh, and the IRA is registered in the input parallel register array (ORA) 12. The invention provides a design method of a region parallel data loading device in a deep convolutional neural network hardware accelerator, which comprises the following steps:
estimating the total instantiation size P of the PEA according to a target algorithm in a specific field and the calculation requirements of a common network, in combination with the real-time requirements of industrial application and the theoretical calculation power interval of a parallel accelerator device, wherein P is Pw Ph; the number of concurrent and parallel data input paths is also P; as noted in fig. 1, the parallel computing acceleration requirement can be satisfied if the number of ways of the input data access engine (IDE)203 is also P;
according to the design framework of the PE, the maximum output calculation power and efficiency can be known, and the selection of the chip design is combined, so that the frequency B for the PEA to complete convolution after the IRA updates data each time is selected; as noted in fig. 1, the number of parallel output data register arrays (ORA)12 is P × B;
the invention provides a method for realizing the size of a region parallel input data register array in a deep convolutional neural network hardware accelerator, which comprises the following steps: Rw-Bw-Pw + Kw-1, Rh-Bh Ph + Kh-1;
assuming that the coordinates of PE are (px, py), the upper left-hand coordinates (x, y) of the input data region corresponding to PE are calculated as: x px Bw + px% Bw + fw (kw), y py Bh + py/Bh + fh (kh);
fw () and fh () functions are related to the maximum size of the convolution kernel that the embodiment chooses to support, fw (x) max (supported _ Kw)/Bw-x/Bw, fh (x) max (supported _ Kh)/Bh-x/Bh, respectively;
the size of the input data region (dwxdh) for the PE is calculated as: dw + Bw-1, dh + Kh-1;
an input data access engine (IDE)203 accesses respective areas, the access sequence and the positions in each area are all consistent, and the circuit design of address generation is simplified;
in the case of a convolution calculated jump Stribe >1, B ═ Bw/Stribe ═ Bh/Stribe supported, i.e., the number of PEA calculations that can be supported after each IRA fill is decreased by a multiple.
The invention provides a calculation sequence of regional parallel input data in a deep convolutional neural network hardware accelerator, which comprises the following steps:
as shown in fig. 2, in the current region data, all IDE channels access the upper left-corner Kw × Kh region synchronously and concurrently to support the corresponding PE to perform the first convolution calculation;
as shown in fig. 3, in the current region data, all synchronized and concurrent access Kw × Kh regions of the IDE channels are translated, and the right Stribe data supports the corresponding PE to perform the 2 nd convolution calculation; the result of the previous 1 st calculation is stored in the ORA;
as shown in fig. 3, after all synchronous and concurrent access Kw × Kh regions of all IDE channels translate and traverse all horizontal data, the K-1 row is moved down, and the Bw +1 th convolution calculation is started; the previous result is stored in Bw register units of ORA;
and repeating the steps until all the Bw Bh convolutions are calculated.
Fig. 5 is a specific structural diagram of a local parallel data loading apparatus according to the present invention, wherein the IDE apparatus 203 includes an address decoder and multiplexer 2031, an address translator 2033, and an address generator 2032 of the IDE to IRA area.
As shown in fig. 5, when the upper left PE with (px, py) of 0 performs the first K-7 convolution, the center point of the convolution region is located at r33, and according to the above rule included in the present invention, the other three times are located at r34, r43, and r44, respectively, and the dashed boxes are all the regions that p0 needs to access, and there are 64 input data in total. The dashed arrow indicates that 64 data in the region is connected to the 1-out-of-64 multiplexer 2031. The local address generator 2032 provides a local address change rule, the address converter 2033 decodes the address corresponding to each path into an address adjusted according to the local offset, and selects one of the data by decoding through the multiplexer 2031, and outputs the data to the PE until all 4 convolutions are completed.
As shown in fig. 5, when p0 is performing the computation acceleration in the corresponding region, the other PEs are also performing the array computation acceleration in parallel and synchronously. The central point of the IRA area corresponding to each PE is marked by a cross. The addresses and the sequence of the respective areas accessed by all the PEs are the same, which simplifies the complexity of the control circuit.
The invention can meet the input data throughput performance by the area parallel input device and according to the design of the method for dividing the areas accessed by the PEs, the circuit design is simplified, and the method for dividing the input data is simple and visual.
The present invention may be described in the general context of general and/or extended instructions executed by a central controller, such as a software program. Software programs generally include routines, objects, components, data structures, reference models, and the like that perform particular tasks or implement particular data types.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (5)
1. A region parallel data loading apparatus in a deep convolutional neural network hardware parallel accelerator, comprising:
the parallel input register array provides a fast register area for data rearrangement for the input characteristic diagram in the input cache, and the registered data is used as input data for high-bandwidth calculation of the parallel accelerated calculation unit array and can be randomly accessed or simultaneously accessed in parallel and in parallel;
the parallel input data access engine performs regionalized parallel and concurrent access on the data in the parallel input register array, does not need to perform overall access on the register array, and does not cause any data loss.
2. The parallel input register array of claim 1, wherein for a feature map stored in the input buffer and output by a hidden layer before a deep convolutional neural network algorithm layer, the parallel input register array provides a fast register region for data rearrangement, which simplifies the difficulty of input data arrangement; the parallel input register array can be repeatedly accessed, and when the data in the parallel input register array is invalidated, new data can be quickly written in from the input buffer; the register array supports random access, simultaneous parallel access and multi-path concurrent access, and the number of concurrent paths is not less than the number of parallel acceleration computing units in a deep convolutional neural network hardware parallel accelerator.
3. A method of designing a parallel input register array as claimed in claims 1-2, characterized in that the dimensioning is related to the instantiated dimensions of the array of parallel computing units, satisfying a specific design formula.
4. The parallel input data access engine of claim 1, comprising:
carrying out regional parallel and concurrent access on data in the parallel input register array, wherein the access is neither serial access nor random access of a full address space, and the number of concurrent access paths is not less than the number of parallel acceleration computing units in a deep convolutional neural network hardware parallel accelerator;
the regional data in the parallel input register array is repeatedly accessed according to a certain rule, so that the regional data coupling characteristic of the input characteristic diagram of the convolutional neural network algorithm layer can be exerted in the data region, and repeated data do not need to be repeatedly written into the parallel input register array in a large quantity.
5. The design method of the parallel input data access engine according to claim 1 and 4, characterized in that the area in the parallel input register array is subjected to regionalized parallel and concurrent access, which is neither serial access nor random access of the full address space; for the simultaneous and concurrent multiple access, the address calculation follows a specific calculation rule, and the conversion rule is simple. The design method can simplify the complexity of a hardware circuit in the hardware engine module and reduce the area and the power consumption.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910979031.9A CN110766150A (en) | 2019-10-15 | 2019-10-15 | Regional parallel data loading device and method in deep convolutional neural network hardware accelerator |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910979031.9A CN110766150A (en) | 2019-10-15 | 2019-10-15 | Regional parallel data loading device and method in deep convolutional neural network hardware accelerator |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110766150A true CN110766150A (en) | 2020-02-07 |
Family
ID=69331196
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910979031.9A Withdrawn CN110766150A (en) | 2019-10-15 | 2019-10-15 | Regional parallel data loading device and method in deep convolutional neural network hardware accelerator |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110766150A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022021459A1 (en) * | 2020-07-29 | 2022-02-03 | 中国科学院深圳先进技术研究院 | Data pre-loading apparatus and data pre-loading method, and computer-readable storage medium |
-
2019
- 2019-10-15 CN CN201910979031.9A patent/CN110766150A/en not_active Withdrawn
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022021459A1 (en) * | 2020-07-29 | 2022-02-03 | 中国科学院深圳先进技术研究院 | Data pre-loading apparatus and data pre-loading method, and computer-readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11816559B2 (en) | Dilated convolution using systolic array | |
CN108765247B (en) | Image processing method, device, storage medium and equipment | |
US11954583B2 (en) | Transposed convolution using systolic array | |
KR102443546B1 (en) | matrix multiplier | |
US11803736B1 (en) | Fine-grained sparsity computations in systolic array | |
CN110222818B (en) | A multi-bank row-column interleaving reading and writing method for data storage in convolutional neural networks | |
US12026607B1 (en) | Memory operation for systolic array | |
CN108170640B (en) | Neural network operation device and operation method using same | |
US11030005B2 (en) | Configuration of application software on multi-core image processor | |
US11625453B1 (en) | Using shared data bus to support systolic array tiling | |
CN114995782B (en) | Data processing method, apparatus, device and readable storage medium | |
CN114385972B (en) | A Parallel Computing Method for Directly Solving Structured Triangular Sparse Linear Equations | |
US20230289398A1 (en) | Efficient Matrix Multiply and Add with a Group of Warps | |
CN111028360A (en) | A method and system for reading and writing data in 3D image processing, storage medium and terminal | |
CN116775519A (en) | Method and apparatus for efficient access to multidimensional data structures and/or other large data blocks | |
CN111783933A (en) | Hardware circuit design and method for data loading device combining main memory and accelerating deep convolution neural network calculation | |
WO2016024508A1 (en) | Multiprocessor device | |
CN113240074B (en) | A Reconfigurable Neural Network Processor | |
CN110766150A (en) | Regional parallel data loading device and method in deep convolutional neural network hardware accelerator | |
CN112988621A (en) | Data loading device and method for tensor data | |
Yousefzadeh et al. | Energy-efficient in-memory address calculation | |
CN113095024A (en) | Regional parallel loading device and method for tensor data | |
Guo et al. | Fused DSConv: Optimizing sparse CNN inference for execution on edge devices | |
Chen et al. | A Survey on Graph Neural Network Acceleration: A Hardware Perspective | |
Liang et al. | Design of 16-bit fixed-point CNN coprocessor based on FPGA |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20200207 |
|
WW01 | Invention patent application withdrawn after publication |