WO2020258529A1

WO2020258529A1 - Bnrp-based configurable parallel general convolutional neural network accelerator

Info

Publication number: WO2020258529A1
Application number: PCT/CN2019/105534
Authority: WO
Inventors: 陆生礼; 范雪梅; 庞伟; 刘昊; 舒程昊; 付成龙
Original assignee: 东南大学
Priority date: 2019-06-28
Filing date: 2019-09-12
Publication date: 2020-12-30
Also published as: CN110390385A; CN110390385B

Abstract

A BNRP-based configurable parallel general convolutional neural network accelerator, which relates to the technical fields of calculation, inference and counting. The accelerator comprises: a mode configurator, a convolution calculator, a BNRP calculator, a data communication unit, and a data compression encoder/decoder. The convolution calculator comprises T systolic convolution arrays having a size of R*C, and each systolic convolution array is configured with corresponding input and output feature map buffer regions and configuration information data buffer regions. The BNRP calculator may execute two calculation modes, and comprises: R*T data input and output interfaces, R*T poolers, a normalization calculation module and a nonlinear activation calculation module, wherein each functional module is executed in parallel in a pipeline. According to the characteristics of various network structures, the execution mode of parallel acceleration calculation modules may be dynamically configured, and the versatility is good. For convolutional neural networks that have complex network structures and that are relatively large scale, the complexity of calculation may be greatly reduced, while power consumption is low and throughput is high.

Description

A configurable parallel general convolutional neural network accelerator based on BNRP

Technical field

The invention discloses a configurable parallel general convolutional neural network accelerator based on BNRP, which belongs to the technical field of calculation, calculation and counting.

Background technique

In recent years, deep learning has greatly accelerated the development of machine learning and artificial intelligence and has achieved remarkable results in various research fields and commercial applications. At present, it has been verified that the most widely used deep neural network (DNN, Deep Neural Network) and convolutional neural network (CNN, Convolutional Neural Network) have better capabilities in solving image recognition, speech recognition and other complex machine learning tasks . However, with the increasing complexity of actual application scenarios and the increasing accuracy requirements of actual application scenarios, the network topology of neural networks is constantly changing. Accordingly, the network scale has expanded dramatically. For example, the Baidu brain with 100 billion neuron connections And Google's cat-recognizing system with 1 billion neuron connections. Therefore, how to realize large-scale deep learning neural network models with low consumption and high speed through computational acceleration and advanced technology has become an important issue in the field of machine learning and artificial intelligence.

Deep neural networks not only require a large amount of calculation, but also need to store millions or even hundreds of millions of network parameters. Therefore, at present, they mainly use high-performance multi-core CPU (Central Processing Unit) and GPU (Graphic Processing Unit). Complete real-time detection and recognition based on deep neural network. However, for robots, consumer electronics, smart cars and other mobile devices with limited power consumption, size and cost, it is almost impossible to transplant complex and diverse convolutional neural network models through CPU or GPU. Therefore, the use of general-purpose devices to build a flexibly configurable, high-performance, low-power general-purpose hardware accelerator can meet the large-scale computing and storage requirements of convolutional neural networks.

Compared with GPU acceleration, hardware accelerators such as FPGA and ASIC can use lower power consumption and achieve at least 50% performance. However, both FPGA and ASIC have relatively limited computing resources, memory and I/O bandwidth. Therefore, it is challenging to develop complex and large-scale DNNs using hardware accelerators. In recent years, research and development based on FPGA high-level synthesis tools have brought great breakthroughs to FPGA design, and greatly improved research and development efficiency without affecting performance. FPGA is a programmable standard device with low cost and high flexibility, and has the advantages of low power consumption and high parallelism, which is very suitable for hardware acceleration of convolutional neural network calculations. Although ASIC has the disadvantages of long development cycle, high cost and low flexibility, because ASIC is customized, it is better than GPU and FPGA in performance and power consumption. The performance of the TPU series ASICAI chip released by Google in 2016 is 14 to 16 times that of the traditional GPU, and the performance of the NPU released by Vimicro is 118 times that of the GPU.

Therefore, FPGA or ASIC is applied to the mobile work platform, and the convolutional neural network configurable general hardware accelerator is designed based on the systolic convolutional array and the high parallelism pipeline method that can achieve high computing throughput with only moderate storage and communication bandwidth. It is an effective solution.

Summary of the invention

The purpose of the present invention is to address the deficiencies of the above-mentioned background technology and provide a configurable parallel general convolutional neural network accelerator based on BNRP, which can support the calculation acceleration of convolutional neural network structures of various scales, has good versatility, and is suitable for on-chip Storage resources and I/O bandwidth requirements are low, which improves computing parallelism and throughput, and solves the technical problem that the limited on-chip storage and I/O bandwidth of existing hardware accelerators cannot meet the high-throughput computing requirements of convolutional neural networks.

The present invention adopts the following technical solutions to achieve the above-mentioned purpose of the invention:

A BNRP-based configurable parallel general convolutional neural network accelerator, including: mode configurator, parallel computing acceleration unit (convolution calculator, BNRP calculator), data cache unit (input and output feature map cache, weight parameter cache) , Data communication unit (AXI4 bus interface, AHB bus interface), data compression encoder/decoder. The input feature map data In_Map, weight parameters and BN parameters are compressed and encoded by the data compression encoder/decoder through the AXI4 bus interface in the data communication unit and then buffered to the corresponding In_Map Buffer, weight buffer and BN parameter buffer area; accelerator calculation mode and function The configuration information is transmitted to the mode configurator through the AHB bus interface in the data communication unit; the mode configurator configures the calculation modes and functions of the parallel computing acceleration unit according to the received configuration information, and the parallel computing acceleration unit reads In_Map Buffer and weight After caching and BN parameter buffering data, perform corresponding convolution, batch normalization, nonlinear activation or pooling operations layer by layer, row, column and channel according to the configuration parameters in parallel pipeline mode; after each layer of network has extracted the features The output feature map data is sent back to the data compression encoder/decoder for decoding, and then sent back to the accelerator external data storage unit through the AXI4 bus interface.

Based on the preferred solution of the above technical solution, the parallel calculation acceleration unit includes: T convolution calculation arrays and BNRP calculator; the convolution calculation array is based on the systolic array architecture, the size is R*C, and C feature maps can be calculated each time R line data is subjected to convolution calculation, and the convolution calculation result is stored in the output buffer Output Buffer; correspondingly, the BNRP calculator includes R*T data input interfaces, R*T output interfaces, and R*T "2*2"Pooler" and R*T "3*3 Pooler", configured by the mode configurator only

Each pooler is in the enabled state, and S represents the pooling step size (S=1, 2).

Based on the preferred solution of the above technical solution, the network configuration information such as the network level of the current processed data read from the AHB bus interface by the mode configurator, network model parameters, and buffer data read and write addresses are cached in the data buffer area of the convolution calculator; Whether the configurator reads from the AHB bus interface performs batch normalization (BN), non-linear activation (ReLu), pooling (pooling), data compression encoding/decoding function operations, and calculation mode configuration parameters, etc. And function configuration parameters are transferred to the BNRP calculator.

Based on the preferred solution of the above technical solution, the BNRP calculator executes batch normalization (BN), non-linear activation (ReLu) or 4 kinds of pooling (Pooling) operations in parallel in a pipeline manner, and can be configured to execute the above according to the flag bit One or several operations, and perform the corresponding calculation mode according to the configuration parameters, mode 1: perform the BN operation, perform the pooling operation first, and then perform the ReLu operation; mode 2: perform the BN operation, perform the ReLu operation first, and then perform the pooling operation .

Based on the preferred solution of the above technical solution, the BNRP calculator, when the input feature map size map_size>R and the pooling operation is performed according to the configuration needs, according to the network model, the number of rows R of the pulsating convolutional array and the configuration parameters, the configuration will input m rows of features The image data is interleaved and cached to 2m on-chip Block RAM.

Based on the preferred solution of the above technical solution, the "2*2 pooler" consists of two one-of-two comparators Comparator2_1 and Comparator2_2 to form a four-to-one comparator, each clock inputs two feature map data to Comparator2_2, every 2 The clock outputs a 2*2pooling value, when S=1: save the output value of Comparator2_2 as the output value of the next clock Comparator2_1; "3*3 pooling device" is composed of three comparators Comparator3_1, Comparator3_2 and Comparator3_3. Choose a comparator, input three feature map data for each clock, and output a 3*3 pooling value for every 3 clocks. When S=1: save the output value of Comparator3_2 as the next clock Comparator3_1 output value, save the output value of Comparator3_3, As the next clock Comparator3_2 output value, when S=2: Save the Comparator3_3 output value as the next clock Comparator3_1 output value.

Based on the preferred solution of the above technical solution, 2*R*T poolers are partially enabled according to the configuration information, and the others are turned off; among them, the "2*2 pooler" executes 2*2AP or 2 according to the configuration parameters. *2 MP operation, "3*3 pooler" performs 3*3AP or 3*3MP operation according to the configuration parameters; each type of pooler has R*T, all numbered in sequence (1, 2, 3, · ··, R*T), when S=2, the odd-numbered pooler is enabled.

Based on the preferred solution of the above technical solution, the convolution calculation array and the BNRP calculator, if the configuration requires BN operation, before performing the ReLu operation, first determine the feature map data map[i][j], BN weight parameters a[i][j] and b[i][j] and the size of 0, if map[i][j]≤0, a[i][j]≥0 and b[i][j ]≤0, the convolution calculation array does not need to multiply the map[i][j] and a[i][j], and does not need to add b[i][j], BNRP calculator mode 1 The corresponding output value of the BN operation of the BNRP calculator mode 2 and the corresponding output value of the ReLu operation are 0.

The present invention adopts the above technical scheme and has the following beneficial effects:

(1) The present invention uses the parallel pipeline method to design the BNRP calculator, and reduces the calculation amount of the neural network accelerator by dynamically configuring the parameters of the parallel calculator, especially the calculation execution mode of the BNRP calculator, especially for the larger volume of the network structure. Convolutional neural network can greatly accelerate the calculation of the convolutional neural network accelerator, while reducing repetitive calculations and reducing the power consumption of the accelerator; based on the systolic array architecture, the convolutional calculation array is designed, and only moderate storage and I/O communication bandwidth can be used. Achieve high computing throughput, effectively improve the data reuse rate, and further reduce the data transmission time.

(2) Through the design of the mode configurator, the calculation execution mode of the BNRP calculator can be dynamically configured according to the characteristics of the network structure, which is more versatile and is no longer restricted by the network model structure and the number of layers, and unnecessary intermediate value caches are also omitted. Reduce the use of memory resources.

Description of the drawings

Fig. 1 is a schematic structural diagram of the accelerator disclosed in the present invention.

Figure 2 is a schematic diagram of the structure of the BNRP calculator of the present invention.

Fig. 3 is a schematic diagram of the working process of the BNRP calculator of the present invention.

Figure 4 is a schematic diagram of the 3*3 pooling device of the present invention performing pooling operation.

Detailed ways

The technical solution of the invention will be described in detail below in conjunction with the drawings.

The configurable parallel general convolutional neural network accelerator based on BNRP disclosed in the present invention is shown in Fig. 1, and includes: a parallel computing acceleration unit composed of a mode configurator, a convolution calculator and a BNRP calculator, an input and output feature map cache and a weight Data buffer unit composed of parameter buffer, data communication unit composed of AXI4 bus interface and AHB bus interface, data compression encoder/decoder. The working status of the accelerator includes read configuration parameter status, read data status, calculation status, and send data status.

The mode configurator reads the mode configuration parameters from the outside of the accelerator through the AHB bus, among which, whether to perform BN, ReLu or pooling operation and the configuration information such as execution mode, network layer number, feature map size, etc. are transmitted to the BNRP calculator; network layer number, Information such as feature map size and batch size, and convolution kernel size are transmitted to the data buffer area of the convolution calculator; configuration information such as the number of network layers, data read/write enable and address are transmitted to the data compression encoder/decoder.

After the data compression encoder/decoder reads the data read enable and address signals, it reads the corresponding weight parameters (convolution kernel and bias) from the accelerator through the AXI4 bus and transmits them to the weight parameter buffer area, and reads the corresponding input The feature map data is transferred to In_MapBuffer.

After the convolution calculator receives the calculation enable signal, it reads the number of network layers, feature map size and batch size, and convolution kernel size from the data buffer area, and reads the weight parameters and input feature map data in a pulsating manner for corresponding Convolution calculation. After the calculation is completed, the end flag information is output to the BNRP calculator, and the convolution calculation result is output to Out_MapBuffer.

Referring to Figure 2, the BNRP calculator waits for the calculation completion flag information sent by the convolution calculator after receiving the mode configuration parameters. If the configuration requires BN operation, it initiates a BN parameter read request and reads the corresponding BN parameter from the BN parameter buffer ; Otherwise, no BN operation is performed.

Referring to Figure 3, the BNRP calculator determines the calculation mode to be executed according to the configuration information. If the execution mode 1 is configured, the pooling operation is executed first, and the input pixel values of the feature map that need to be cached are sent to the corresponding BlockRAM according to the received network model parameters (pooling step size) and feature map size, and the corresponding pooling is enabled After completing the pooling calculation, execute the ReLu operation; if the execution mode 2 is configured, execute the ReLu operation first. Among them, the maximum pooler calculation process is as follows:

The calculation process of the average pooler is as follows:

k = 1, 2 represents the size of the pooler, IMap represents the pixel value of the input feature map, OMap represents the pixel value of the output feature map, and OMap[c][i][j] represents the ith row and the th row of the C-th output feature map. Column j pixel value.

4, the number of rows in the convolution calculation array is R=6, the input feature map size is 13*13, and the pooler size k=3 and the pooling step size s=2 as an example, the output feature map size is 6* 6. Since the principle of the corresponding calculation process for the rows and columns of the output feature map is the same, the following only describes the row calculation in detail:

The first convolution calculation outputs the 1, 2, 3, 4, 5, and 6 lines of the feature map to the corresponding BlockRAM1, BlockRAM2, BlockRAM3, BlockRAM4, BlockRAM5, BlockRAM6, and caches the fifth line of data to BlockRAM5B, and caches the sixth line Data to BlockRAM6B, enable 1C, 3, 5 poolers. The first output value of No. 1C pooler is invalid; No. 3 pooler executes R1, R2, R3 three-row pooling calculation, and outputs the pixel value of the first row of Out_Map; No. 5 pooler executes R3, R4, R5 three rows Pooling calculation, output the pixel value of the second row of Out_Map.

The second convolution calculation outputs lines 7, 8, 9, 10, 11, and 12 of the feature map to the corresponding BlockRAM1, BlockRAM2, BlockRAM3, BlockRAM4, BlockRAM5, BlockRAM6, and caches the 11th line of data to BlockRAM5B, and caches the 12th line Data to BlockRAM6B, enable 1B, 3, 5 poolers. Pooler 1B performs three-line pooling calculations of R5, R6, and R7, and outputs the pixel value of the third row of Out_Map; pooler No. 3 performs three-line pooling calculation of R7, R8, R9, and outputs the pixel value of the fourth row of Out_Map; The No. 5 pooler performs the three-line pooling calculation of R9, R10, and R11, and outputs the pixel value of the fifth row of Out_Map.

The third convolution calculation outputs the random numbers of 13 lines and 5 lines of the feature map to the

corresponding BlockRAM

1, 2, 3, 4, 5, 6. At this time, the convolution output feature map size map_size<R, so there is no need to cache, so Can pooler No. 1C. No. 1C pooler performs the three-line pooling calculation of R11, R12, and R13, and outputs the pixel value of the sixth row of Out_Map to complete the pooling operation of the input image of this layer. In the actual application design process, poolers 1B and 1C can be combined into a 3*3 pooler numbered 1 using multiplexers and comparators. Therefore, in the actual calculation process, when the pooling step size s=2, the odd-numbered pooler is enabled.

It has been verified that when configuring and using mode 1, performing the pooling operation first reduces the size of the feature map and can reduce

or

The calculation amount of ReLu operation; when configured to use mode 2, perform the ReLu operation first so that the feature map data values are corrected to a non-zero number set. The pooling operation does not need to consider the sign bit of the input pixel value, which reduces the complexity and complexity of the pooling calculation. Comparator power consumption.

The embodiments are merely illustrative of the technical idea of the present invention, and cannot be used to limit the protection scope of the present invention. Any changes made on the basis of the technical solution that conform to the inventive concept of the present application shall fall within the protection scope of the present invention.

Claims

A BNRP-based configurable parallel general convolutional neural network accelerator, which is characterized in that it includes:

Mode configurator, which reads network parameters, characteristic map parameters, calculation mode and function configuration parameters from the outside, and outputs instructions for switching the accelerator working state according to the read parameters

Data compression encoder/decoder, after receiving the network parameters, data read and write enable command and address configuration information sent by the mode configurator, it encodes the feature map data, weight data and BN parameters read from the outside, The calculation result output by the BNRP calculator decodes the calculation result,

The BN parameter buffer is used to store the encoded BN parameters,

The input feature map buffer is used to store the encoded input feature map data,

The weight parameter buffer is used to store the encoded weight data,

The data buffer is used to store the network parameters and feature map size parameters read from the outside by the mode configurator, and read the encoded weight data from the weight parameter buffer after entering the calculation state,

The convolution calculator, after receiving the calculation enable command sent by the mode configurator, reads network parameters, feature map parameters, and weight data from the data buffer, and reads input features from the input feature map buffer and weight parameter buffer After graph data and weight data, convolution calculation is performed,

The output feature map buffer is used to store the convolution result output by the convolution calculator, and,

The BNRP calculator, after receiving the calculation mode sent by the mode configurator and the convolution calculation end flag output by the convolution calculator, execute the first batch of the convolution results output by the convolution calculator according to the function configuration parameters sent by the mode configurator The calculation mode of pooling and then nonlinear activation after normalization or the calculation mode of nonlinear activation and then pooling after batch normalization.
The BNRP-based configurable parallel general convolutional neural network accelerator according to claim 1, wherein the BNRP calculator comprises:

R*T data input interfaces, receiving the R line feature maps output by the convolution calculator T convolution arrays,

BN operation module, when the function configuration parameters sent by the mode configurator include batch normalization operation instructions, read the BN parameters from the BN parameter buffer and perform batch normalization operations on the data received by the data input port.

Relu operation module, when the calculation mode sent by the mode configurator is batch normalization first, then pooling and then non-linear activation, the pooling result is activated nonlinearly, and the calculation mode sent by the mode configurator is batch normalization first After nonlinear activation and repooling, perform nonlinear activation on the batch-normalized data, and,

R*T poolers, the calculation mode sent by the mode configurator is batch normalization first, pooling and then non-linear activation, output the pooling results of batch normalization data, the calculation mode sent by the mode configurator is First batch normalization, then nonlinear activation and then pooling, output the pooling result of the batch normalized data after nonlinear activation.
The BNRP-based configurable parallel general convolutional neural network accelerator according to claim 2, wherein the BNRP calculator further includes a mode simplification module, and before performing the nonlinear activation operation, the mode selector reads the BNRP The feature map data and BN weight parameters and bias parameters received by the calculator data input interface will be batch-normalized first, then pooled, and then nonlinearly activated when there is no need to multiply and offset the feature map data. Set the batch normalization command in the calculation mode to zero, or reset the batch normalization operation command and the non-linear activation command in the calculation mode of the batch normalization first, then nonlinear activation and then pooling.
The BNRP-based configurable parallel general convolutional neural network accelerator according to claim 3, characterized in that the mode simplification module includes three determining the relationship between feature map data, BN weight parameters, and bias parameters, respectively, and 0 Comparator, when the three conditions that the feature number data is less than or equal to 0, the BN weight parameter is greater than or equal to 0, and the bias parameter is less than or equal to 0, the output is first batch normalized and then pooled and then activated nonlinearly. A configuration parameter in which the batch normalization command is zero in the calculation mode, or the configuration parameter in which the batch normalization command and the non-linear activation command are both zero in the calculation mode, after the batch normalization is first nonlinear activation and then pooling .
The BNRP-based configurable parallel general convolutional neural network accelerator according to claim 2, characterized in that, when the function configuration parameters sent by the mode configurator include the execution of 2*2 maximum pooling instructions, the R*T The two poolers are R*T 2*2 poolers. The 2*2 pooler is a four-choice comparator composed of the first two-choice one comparator and the second two-choice one comparator, each Clock input two feature map data to the output terminals of two one-of-two comparators, one-of-four comparators output a 2*2 pooling value every 2 clocks, when the pooling step is 1, save the second and choose one The output value of the comparator is used as the output value of the first or second comparator in the next clock; when the function configuration parameters sent by the mode configurator include the execution of 2*2 average pooling instructions, the maximum pooling mode comparator is configured as 1/2 divider.
The BNRP-based configurable parallel general convolutional neural network accelerator according to claim 2, characterized in that, when the function configuration parameters sent by the mode configurator include the execution of a 3*3 maximum pooling instruction, the R*T Each pooler is R*T 3*3 pooler, 3*3 pooler is composed of the first three-select one comparator, the second three-select one comparator, and the third three-select one comparator One-of-nine comparators, each clock inputs three characteristic map data to the input terminals of three-to-three comparators, and one-of-nine comparators output a 3*3 pooling value every 3 clocks, when the pooling step is 1 When, save the output value of the second one-of-three comparator as the output value of the next clock, and save the output value of the third one-of-three comparator as the next clock When the pooling step size is 2, save the output value of the third one-of-three comparator as the output value of the first three-to-one comparator of the next clock; when the function configuration parameters sent by the mode configurator include execution When the 3*3 average pooling instruction is used, configure the comparator in the maximum pooling mode as a 1/3 divider.
The BNRP-based configurable parallel general convolutional neural network accelerator according to claim 1, wherein the mode configurator reads network parameters, feature map parameters, calculation mode and functional configuration parameters from the outside through the AHB bus The network parameters include the number of network layers and the size of the convolution kernel. The feature map parameters include the feature map size parameters and batches. The calculation mode is to perform batch normalization and then pooling on the convolution results output by the convolution calculator. Non-linear activation or first batch normalization and then non-linear activation and then pooling. The function configuration parameters include whether to perform batch normalization operations, whether to perform non-linear activation operations, and whether to perform pooling operations.
The BNRP-based configurable parallel general convolutional neural network accelerator according to claim 1, wherein the data compression encoder/decoder reads feature map data, weight data, and BN parameters from the outside through the AXI4 bus .
The BNRP-based configurable parallel general convolutional neural network accelerator according to claim 1, wherein when the input feature map data is greater than the number of array rows of the convolution calculator and the pooling operation needs to be performed, m rows The input feature map data is interleaved and cached to 2m on-chip Block RAM.