Disclosure of Invention
The invention aims to provide a method for realizing acceleration of TB-Net hardware for DoA estimation, which is used for improving the hardware energy efficiency for realizing TB-Net, reducing the calculation delay and realizing real-time DoA estimation.
The invention provides a TB-Net (Two-branch neural network) hardware acceleration implementation method for DoA (Direction of Arrival, arrival angle estimation) estimation, which comprises the following specific steps:
Step1, architecture design of TB-Net accelerator for estimation to DoA:
Step 2, designing a data stream, and writing a script capable of reordering the weights and the bias of the TB-Net;
Step 3, designing a custom simple instruction set, and writing the TB-Net into an instruction program;
Step 4, quantizing the weight and bias data of the TB-Net into 16bits signed integer data;
And 5, verifying the circuit function and comparing software and hardware results.
The architecture design of the accelerator in step 1 comprises the following specific steps:
Step 1-1, designing a data path from a host end to an accelerator, storing weight and bias data into a DRAM through a PCI-E data path, and reading out a calculation result in the accelerator;
Step 1-2, designing a PE array of a convolution calculation module, setting a PE array with 8 x 8 blocks of PE units, wherein a single PE unit is used for processing single-channel convolution calculation of a convolution kernel, each row of PE corresponds to convolution calculation of different channels of the same convolution kernel, each row of PE corresponds to convolution calculation of the same channel of different convolution kernels, the whole PE array can simultaneously carry out 8-channel convolution operation of 8 convolution kernels, calculation results of each row of PE are added to obtain eight-channel convolution results, the obtained output characteristic value is added with the characteristic value after being subjected to offset calculation for the convolution layer with the number of convolution kernels and the number of channels being smaller than 8, the obtained output characteristic value is required to be subjected to batch processing for the convolution layer calculation with the number of convolution kernels or the number of channels being larger than 8, wherein the number of convolution kernels and the number of channels are not more than 8, and then batch processing is carried out in the same manner, for example, the convolution calculation with the number of convolution kernels being 16 and the number of channels being required to be divided into four blocks M 0-7C0-7,M0-7C8-15,M8-15C0-7,M8-15C8-15, wherein M 0-7C0-7 represents weight blocks of 0 to 7 convolutions of the convolution kernels, the number of the rest channels is added with the weight blocks of the convolution kernels, the number of 0 to 7 convolutions is obtained for the convolution kernels, the characteristic value is obtained after the convolution kernels and the characteristic value is added with the number of the convolution kernels is not added with the characteristic value to the number of the convolution kernels is calculated, and the characteristic value is obtained for the channel is added with the characteristic value after being subjected to the convolution value is added with the number of the convolution value is greater than 35;
Step 1-3, designing a single PE unit in step 1-2; the single PE comprises a weight storage module with the address depth of 8, 8 multipliers and 8 accumulators, and supports one-dimensional convolution calculation with the convolution kernel length of 1 to 8, wherein corresponding weight data in the DRAM is loaded into the PE to fix weight, then, a characteristic value data is loaded into the PE from a characteristic value buffer in each clock period, the characteristic value is multiplied with the weight value fixed by the PE at the same time, the multiplication with the previous clock period is accumulated, and the corresponding convolution calculated output characteristic value is output at the corresponding moment;
The method comprises the steps of 1-4, designing an on-chip cache for temporarily storing input characteristic values, offset data and intermediate results, equally dividing the storage of the intermediate results into two parts, namely a characteristic value cache 1 and a characteristic value cache 2, equally dividing the storage of the intermediate results into the characteristic value cache 1 and the characteristic value cache 2, calculating one layer of a network for convolution, wherein the characteristic value cache 1 is used for storing the input characteristic values of the layer of convolution calculation, the characteristic value cache 2 is used for storing the output characteristic values of the layer of convolution calculation, and the output characteristic values of the upper layer stored in the characteristic value cache 2 are the input characteristic values of the layer of convolution calculation, wherein the characteristic value cache 1 is used for storing the output characteristic values, and if the convolution calculation of the lower layer is carried out, the function conversion between the characteristic value cache 1 and the characteristic value cache 2 is equal, dividing the characteristic value cache 1 and the characteristic value cache 2 into 8 equally and are used for storing the characteristic values of different channels respectively;
and step 1-5, designing a control module, wherein the control module comprises a clock generator for controlling convolution step, a convolution Padding generator, a mark signal generator in the data transmission and calculation process and a series of address pointer generators.
The design data flow in the step 2 is written with a script capable of reordering the weights and offsets of the TB-Net, wherein the steps are that the weights and offsets are stored in a DRAM according to the convolution calculation sequence of the PE array, when the convolution calculation is carried out, an accelerator only needs to read data from the DRAM according to a sequence pointer, and the convolution calculation sequence of the PE array is as follows:
Step 2-1, reading CxMxW xH weight data from a DRAM and storing the weight data into a PE array, wherein C is the number of convolution kernel channels, M is the number of convolution kernels, W is the width of the convolution kernels and H is the height of the convolution kernels, and the widths of the convolution kernels of TB-Net used for DoA estimation are all 1, so that the weight data can be simplified into the weight data of CxMxH;
Step 2-2, reading C multiplied by W input characteristic values from a characteristic value cache in each clock cycle, wherein C is the number of input characteristic value channels, the same as the number of convolution kernel channels, W is the width of the input characteristic values, and the width of the input characteristic values of TB-Net used for DoA estimation is 1, so that the C input characteristic values can be simply read in each clock cycle, the C input characteristic values are multiplied by C multiplied by M multiplied by H weights stored in a PE array to obtain C multiplied by M multiplied by H intermediate results;
Step 2-3, for each clock cycle, accumulating H intermediate results obtained in a single PE unit and intermediate results obtained in the previous cycle in a staggered manner, thus obtaining a convolution result of a first convolution scribing of a convolution kernel mapping on a characteristic value after the H clock cycles, and obtaining a convolution result of a second convolution scribing in the next clock cycle;
Step 2-4 PE array can process 8-channel operation of 8 convolution kernels at most. When the number of convolution kernels of a TB-Net layer exceeds 8 or the number of channels exceeds 8, the TB-Net layer needs to be subjected to blocking processing. When the block processing is carried out, the weight blocks with the same convolution kernel and different channels are preferentially calculated, and when the ownership weights with the same convolution kernel and different channels are processed, the weight blocks with different convolution kernels of the next batch are processed.
Step 3, designing a custom simple instruction set, writing a TB-Net into an instruction program, and specifically, the specific flow is as follows:
The self-defined simple instruction set comprises four modes of LOAD_B/LOAD_X/CONV_W/READ_Y, an instruction decoder and an instruction memory are required to be designed according to the self-defined instruction set, then an instruction program is written according to an algorithm by using the instruction set, so that the algorithm can be realized on hardware, and the instruction program is stored in the instruction memory;
Load_b, loading the offset data stored in DRAM into an on-chip offset cache;
Load_x, loading input eigenvalue data into an on-chip offset cache;
the CONV_W is used for loading the weight data stored in the DRAM into a weight buffer module in the PE array and carrying out convolution calculation, wherein the operand comprises data quantity, block calculation times, convolution kernel length, step length, padding and whether Relu is carried out;
READ_Y, reading out the eigenvalues in the on-chip eigenvalue buffer, and the operand contains the data amount of the READ-out eigenvalues.
The step 4 of quantizing the weight and bias data of the TB-Net into 16bits signed integer data, specifically, the step is to adopt a symmetrical pseudo-quantization scheme, perform pseudo-quantization before the input of each weight and after the output of an activation function, quantize floating point numbers to fixed point numbers, then inverse quantize the floating point numbers, and perform forward operation by using floating point values of errors generated in the process, wherein the pseudo-quantization operation can enable the distribution of the weight and the activation value to be more uniform, the variance to be smaller, the precision loss compared with the direct post-quantization can be smaller, and the output of each layer can be controlled within a certain range, thereby being more helpful for overflow processing.
And 5, verifying the circuit function, namely, using quantized data to respectively realize forward reasoning of the neural network by using software and hardware, and comparing whether output results are consistent.
The invention increases the data multiplexing times by utilizing the weight fixing and pulse array, obviously reduces the data transmission quantity, and can adapt to one-dimensional convolution of various scales, thereby realizing the low-power-consumption high-speed hardware accelerator for the TB-Net of the DoA.
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings.
The invention relates to a TB-Net hardware acceleration implementation method for DoA estimation. A vcs+verdi joint simulation platform was used. 16bits of data of 120×2 are input. A specific network structure is shown in fig. 1.
TABLE 1 network structure of TB-Net
The invention mainly comprises the following steps:
Step 1, architecture design of TB-Net accelerator for DoA estimation is carried out, and an architecture diagram is shown in figure 2.
Step 1-1, designing a data path from a host end to an accelerator, storing weight and bias data into a DRAM through a PCI-E data path, and reading out a calculation result in the accelerator.
Step 1-2, designing a PE array of a convolution calculation module, setting a PE array with 8 x 8 blocks of PE units, wherein a single PE unit can process single-channel convolution calculation of one convolution kernel, each row of PE corresponds to convolution calculation of different channels of the same convolution kernel, each row of PE corresponds to convolution calculation of the same channel of different convolution kernels, the whole PE array can simultaneously carry out 8-channel convolution calculation of 8 convolution kernels, and calculation results of each row of PE are added to obtain eight-channel convolution results;
for example, for a convolution calculation with 16 convolution kernels and 16 channels, it is necessary to divide the convolution calculation into four blocks
M0-7C0-7,M0-7C8-15,M8-15C0-7,M8-15C8-15;
Wherein M 0-7C0-7 represents the weight blocks of 0 to 7 channels of the 0 to 7 convolution kernels, the rest of the representation methods are the same, the output characteristic value is obtained after the offset is added to the obtained convolution result for the first batch (M 0-7C0-7,M8-15C0-7) of the channel number and is stored in the characteristic value cache, and the output characteristic value is obtained after the obtained convolution result is added to the convolution result of the last channel of the same convolution kernel in the characteristic value cache for the non-first batch (M 0-7C8-15,M8-15C8-15) of the channel number and is stored in the characteristic value cache.
Step 1-3, designing a single PE unit in step 1-2, wherein the working principle diagram of the single PE unit is shown in fig. 3, the single PE comprises a weight storage module with the address depth of 8,8 multipliers and 8 accumulators, one-dimensional convolution calculation with the convolution kernel length of 1 to 8 is supported, the structure diagram of the single PE unit is shown in fig. 4, corresponding weight data in the DRAM is loaded into the PE to fix the weight, then, one characteristic value data is loaded from a characteristic value cache to the PE every clock period, the characteristic value is multiplied by the weight value fixed by the PE at the same time, the multiplication of the characteristic value and the multiplication of the previous clock period is carried out, and the output characteristic value of the corresponding convolution calculation is output at the corresponding moment.
For example, in FIG. 3, for clock cycles T0-T4, a single-channel 1×5 convolution kernel weight is input to the PE unit:
[W0,W1,W2,W3,W4],
which requires the eigenvalues of the corresponding channels:
[X0,X1,X2...X59]
the convolution calculation is performed, the clock period T5 is obtained by multiplying the input X0 of the PE unit with [ W0, W1, W2, W3, W4] respectively, and the result is that:
[W0X0,W1X0,W2X0,W3X0,W4X0];
The clock period T6, the input X1 to the PE unit is multiplied by [ W0, W1, W2, W3, W4] to obtain:
[W0X1,W1X1,W2X1,W3X1,W4X1];
and is derived from the last clock:
[W0X0,W1X0,W2X0,W3X0,W4X0];
Dislocation accumulation, obtaining:
[W0X0+W1X1,W1X0+W2X1,W2X0+W3X1,W3X0+W4X1,W4X0];
Similarly, each clock cycle multiplies an input eigenvalue by a weight value stored in the PE, and the obtained products are respectively accumulated with the products of the previous clock cycle in a staggered way, and the first convolution result is obtained in the clock cycle T7 because of padding=2:
Y0=W2X0+W3X1+W4X2;
when the convolution step size stride=1, then a second convolution result is obtained in clock period T8:
Y1=W1X0+W2X1+W3X2+W4X3;
When the convolution step size stride=2, then a second convolution result is obtained at clock period T9:
Y2=W0X0+W1X1+W2X2+W3X3+W4X4;
Since the number of input eigenvalues is 60 and padding=2, reading of the input eigenvalues is stopped and the product is made 0 in clock cycles T65, T66, resulting in final convolution results Y58, Y59.
And step 1-4, designing an on-chip cache for temporarily storing the input characteristic value, the offset data and the intermediate result. For the intermediate result deposit, the Feature value cache will be equally divided into two major parts, feature value cache 1 (sram_0 Feature in fig. 2) and Feature value cache 2 (sram_1 Feature in fig. 2). For one layer of calculation of the convolutional neural network, the eigenvalue buffer 1 is used for storing the input eigenvalue of the layer of convolution calculation, and the eigenvalue buffer 2 is used for storing the output eigenvalue of the layer of convolution calculation. For the convolution calculation of the next layer, the output eigenvalue of the last layer stored in the eigenvalue buffer 2 is the input eigenvalue of the convolution calculation of the layer, and at this time, the eigenvalue buffer 1 is used for storing the output eigenvalue. If there is a next level of convolution computation, the functional transition between eigenvalue cache 1 and eigenvalue cache 2 will be analogized. And dividing the characteristic value buffer 1 and the characteristic value buffer 2 into 8 equal parts respectively for storing characteristic values of different channels. The data output end of each cache corresponds to the input characteristic value end of one line of PE array in the step 1-2, and the data input end of each cache corresponds to the output characteristic value end of one line of PE array in the step 1-2.
And step 1-5, designing a control module, wherein the control module comprises a clock generator for controlling convolution step, a convolution Padding generator, a mark signal generator in the data transmission and calculation process and a series of address pointer generators.
And 2, designing a data stream, and writing a script capable of reordering the weights and the bias of the TB-Net.
Step 2-1, reading the CxMxW xH weight data from the DRAM and storing the weight data in the PE array. Wherein C is the number of convolution kernel channels, M is the number of convolution kernels, W is the width of the convolution kernels, and H is the height of the convolution kernels. The convolution kernel widths of TB-Net for DoA estimation are all 1, so can be reduced to C M H weight data.
And 2-2, reading C multiplied by W input characteristic values from the characteristic value buffer memory every clock cycle. Wherein C is the number of channels of the input eigenvalue, which is the same as the number of channels of the convolution kernel, and W is the width of the input eigenvalue. The input eigenvalues of TB-Net for DoA estimation are all 1 in width, so that C input eigenvalues can be read for each clock cycle. And multiplying the C input eigenvalues by the C multiplied by the M multiplied by H weights stored in the PE array to obtain C multiplied by M multiplied by H intermediate results. For a single PE, an input characteristic value is multiplied by H weights stored in the input characteristic value to obtain H intermediate results.
And 2-3, carrying out staggered accumulation on H intermediate results obtained in a single PE unit and the intermediate result obtained in the last period in each clock period, so that after the H clock periods, the convolution result of the first convolution scribing of the convolution kernel mapping on the characteristic value can be obtained, and the convolution result of the second convolution scribing can be obtained in the next clock period. Similarly, if the input eigenvalue scale of one convolution layer of the TB-Net is C×L, L is the eigenvalue length, namely the number of the eigenvalues of a single channel, after L clock cycles, the calculation of all convolution scribing is completed.
Step 2-4 PE array can process 8-channel operation of 8 convolution kernels at most. When the number of convolution kernels of a TB-Net layer exceeds 8 or the number of channels exceeds 8, the TB-Net layer needs to be subjected to blocking processing. When the block processing is carried out, the weight blocks with the same convolution kernel and different channels are preferentially calculated, and when the ownership weights with the same convolution kernel and different channels are processed, the weight blocks with different convolution kernels of the next batch are processed.
For example, for a convolution calculation with 16 convolution kernels and 16 channels, it is necessary to divide the convolution calculation into four blocks
M0-7C0-7,M0-7C8-15,M8-15C0-7,M8-15C8-15。
The batch processing sequence is
M0-7C0-7→M0-7C8-15→M8-15C0-7→M8-15C8-15。
Step 3, a custom simple instruction set comprising four modes of load_b/load_x/conv_w/read_y, is provided, wherein the instruction decoder and the instruction memory are designed according to the custom instruction set. An instruction program is then written in accordance with the algorithm using the instruction set, enabling the algorithm to be implemented on hardware, and the instruction program is stored in the instruction memory.
Load_b-LOAD the offset data stored in DRAM to the on-chip offset cache. The operand contains the amount of data loaded;
load_x-LOAD input eigenvalue data to on-chip offset cache. The operand contains the amount of data loaded;
And CONV_W, loading the weight data stored in the DRAM to a weight buffer module in the PE array, and performing convolution calculation. The operand contains the data size, the number of block computations, the convolution kernel length, the step size, and whether Padding is Relu.
Read_y, READ the feature value in the on-chip feature value buffer. The operand contains the amount of data from which the characteristic value is read.
The quantization step in step 4 is to adopt a symmetrical pseudo quantization scheme, perform pseudo quantization before the input of each weight and after the output of the activation function, quantize the floating point number to a fixed point number, then inverse quantize the floating point number to a floating point number, and perform forward operation by using the floating point value of the error generated in the process. The pseudo quantization operation can make the distribution of the weight and the activation value more uniform, the variance is smaller, the precision loss is smaller compared with the direct post quantization, and the output of each layer can be controlled within a certain range, thereby being more helpful for overflow treatment.
Pseudo quantization scheme:
the project adopts signed number calculation, so the quantization scheme is symmetric quantization, and the quantization scheme is as follows:
For the weights:
round(x)=clamp(-2bit-1,2bit-1-1,x)
s is a quantization factor, r is a value before quantization, and q is a value after quantization.
For an activation value:
the activation value for each layer is dynamically obtained using EMA:
movingmax=movingmax*momenta+max(abs(r))*(1-momenta)
round(x)=clamp(-2bit-1,2bit-1-1,x)
Where s is a quantization factor, r is a value before quantization, q is a value after quantization, momenta is a sliding index.
The BN (bulk normalization) layer was folded. Unlike batch normalization processing, which is a separate module during training, in the actual reasoning process, batch normalization parameters are "folded" into the weights and deviations of the convolution or full connection layer to improve efficiency. Folding the parameters after software training can play an acceleration role in hardware.
And step 5, the circuit function verification step is to use the quantized data to realize forward reasoning of the neural network on a pytorch platform and a VCS+Verdi platform respectively and compare whether the output results of the software and the hardware are consistent.
Take the comparison of the software and hardware results of the first layer convolution layer of TB-Net as an example.
And firstly, performing software imitation by using the quantized data, and recording a first layer of convolution calculation result for subsequent comparison with a hardware result.
Then writing a program by using a custom instruction set:
Load_b// LOAD the weight required by TB-Net to on-chip cache;
load_x// LOAD the input eigenvalue of TB-Net to on-chip cache;
CONV_W// loads the weight of the first layer of TB-Net to the PE array and performs first layer convolution calculation;
read_y// READs out the convolution calculation result of the first layer;
Because of the flexibility in the design of custom instruction sets, operands following each instruction type are not listed here. The programmed instruction program is binary mechanical code.
The instruction program, a plurality of groups of input characteristic values, quantized weights and offsets and a first-layer convolution calculation result of software simulation are stored into an off-chip ROM only used for simulation through a $ readmem function, excitation is given, the instruction program is stored into an on-chip instruction memory, and then an accelerator operation program can start hardware simulation. The convolution result of the first layer is stored into a text file through the $fopen/$fwrite/$fclose function in the READ_Y process, and the comparison is carried out with the convolution calculation result of the first layer simulated by software, so that errors are calculated. If the error is within the index range, the function verification is passed.
Under the working frequency of 100MHz, simulation results show that the forward reasoning of the TB-Net can be carried out once only by taking 1.2ms when the accelerator is used, and real-time DoA estimation can be realized in most DoA estimation application scenes.