[go: up one dir, main page]

CN115034372B - Hardware Acceleration Implementation of TB-Net for DoA Estimation - Google Patents

Hardware Acceleration Implementation of TB-Net for DoA Estimation Download PDF

Info

Publication number
CN115034372B
CN115034372B CN202210576754.6A CN202210576754A CN115034372B CN 115034372 B CN115034372 B CN 115034372B CN 202210576754 A CN202210576754 A CN 202210576754A CN 115034372 B CN115034372 B CN 115034372B
Authority
CN
China
Prior art keywords
convolution
data
characteristic value
net
calculation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210576754.6A
Other languages
Chinese (zh)
Other versions
CN115034372A (en
Inventor
陈赟
佘超然
林立宇
石启航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN202210576754.6A priority Critical patent/CN115034372B/en
Publication of CN115034372A publication Critical patent/CN115034372A/en
Application granted granted Critical
Publication of CN115034372B publication Critical patent/CN115034372B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/781On-chip cache; Off-chip memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/10Interfaces, programming languages or software development kits, e.g. for simulating neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Hardware Design (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Neurology (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Complex Calculations (AREA)

Abstract

本发明属于无线通信技术领域,具体为一种用于DoA估计的TB‑Net硬件加速实现方法。本发明方法包含以下步骤:设计用于DoA估计的TB‑Net加速器的架构设计;设计数据流,编写能够将TB‑Net的权重和偏置重新排序的脚本;设计自定义简易指令集,将TB‑Net编写成指令程序;将TB‑Net的权重和偏置数据量化为16bits有符号整型数据;进行电路功能验证,软硬件结果对比。电路结构包括:指令译码模块、指令存储模块、全局数据缓存模块、数据传输网络、PE阵列和全局控制模块。本发明利用权重固定和脉冲阵列来增加数据复用次数,显著减少数据传输量,并可适应多种规模的一维卷积,以此来实现用于DoA的TB‑Net低功耗高速率的硬件加速器。

The present invention belongs to the field of wireless communication technology, and specifically is a TB‑Net hardware acceleration implementation method for DoA estimation. The method of the present invention comprises the following steps: designing the architecture design of the TB‑Net accelerator for DoA estimation; designing the data flow, and writing a script that can reorder the weights and biases of TB‑Net; designing a custom simple instruction set, and writing TB‑Net into an instruction program; quantizing the weights and bias data of TB‑Net into 16-bits signed integer data; performing circuit function verification, and comparing software and hardware results. The circuit structure includes: an instruction decoding module, an instruction storage module, a global data cache module, a data transmission network, a PE array, and a global control module. The present invention uses fixed weights and pulse arrays to increase the number of data reuses, significantly reduce the amount of data transmission, and can adapt to one-dimensional convolutions of various sizes, so as to realize a low-power and high-speed hardware accelerator for DoA using TB‑Net.

Description

TB-Net hardware acceleration implementation method for DoA estimation
Technical Field
The invention relates to the field of artificial intelligence and the technical field of wireless communication, and provides a TB-Net hardware acceleration implementation method for DoA estimation.
Background
Traditional methods of DoA estimation, such as multiple signal classification (MUSIC) and signal parameter Estimation (ESPIRIT) based on rotation invariant techniques, rely on accurate signal models, and their DoA estimation accuracy is significantly reduced when the models are defective.
In recent years, with the rapid development of deep learning, a DoA estimation algorithm based on a neural network has been proposed. These methods are generally classified into two types, regression networks and classification networks. Due to the data-driven nature, these methods are robust against model defects. The university of double denier, liriot et al, in the text "TB-NET: A Two-Branch Neural Network for Direction of Arrival Estimation under Model Imperfections", combines the current deep learning to apply a wide variety of classification network and regression network structures in the DoA estimation, uses the Two structures in parallel, designs a double-branch neural network combining classification and regression in parallel, marks as TB-Net (Two-Branch Neural Network, double-branch neural network), and optimizes branches of the classification network, thereby improving the accuracy of the coarse estimation of the DoA by branches of the classification network. The TB-Net ensures that the robustness of the system to model defects is stronger while the network is light in weight and the calculated amount is reduced.
The TB-Net for the DoA still has certain computational complexity, and if the CPU is used for computation, the TB-Net can be completed only by a long time, and the requirement of high real-time performance of the DoA cannot be met. Therefore, hardware such as GPU, ASIC or FPGA is required to perform accelerated computation, so as to implement real-time reasoning of the neural network. Although GPUs have a strong parallel computing capability, the power consumption of GPUs generally reaches about 100W, which severely limits the application scenarios. ASIC has the advantage of high energy efficiency and small area, but has a relatively long development cycle and high cost. Although the hardware performance of the FPGA is inferior to that of the ASIC, the development period is shorter and the cost is lower. The FPGA and the ASIC can be selected according to actual conditions to realize the neural network accelerator.
Disclosure of Invention
The invention aims to provide a method for realizing acceleration of TB-Net hardware for DoA estimation, which is used for improving the hardware energy efficiency for realizing TB-Net, reducing the calculation delay and realizing real-time DoA estimation.
The invention provides a TB-Net (Two-branch neural network) hardware acceleration implementation method for DoA (Direction of Arrival, arrival angle estimation) estimation, which comprises the following specific steps:
Step1, architecture design of TB-Net accelerator for estimation to DoA:
Step 2, designing a data stream, and writing a script capable of reordering the weights and the bias of the TB-Net;
Step 3, designing a custom simple instruction set, and writing the TB-Net into an instruction program;
Step 4, quantizing the weight and bias data of the TB-Net into 16bits signed integer data;
And 5, verifying the circuit function and comparing software and hardware results.
The architecture design of the accelerator in step 1 comprises the following specific steps:
Step 1-1, designing a data path from a host end to an accelerator, storing weight and bias data into a DRAM through a PCI-E data path, and reading out a calculation result in the accelerator;
Step 1-2, designing a PE array of a convolution calculation module, setting a PE array with 8 x 8 blocks of PE units, wherein a single PE unit is used for processing single-channel convolution calculation of a convolution kernel, each row of PE corresponds to convolution calculation of different channels of the same convolution kernel, each row of PE corresponds to convolution calculation of the same channel of different convolution kernels, the whole PE array can simultaneously carry out 8-channel convolution operation of 8 convolution kernels, calculation results of each row of PE are added to obtain eight-channel convolution results, the obtained output characteristic value is added with the characteristic value after being subjected to offset calculation for the convolution layer with the number of convolution kernels and the number of channels being smaller than 8, the obtained output characteristic value is required to be subjected to batch processing for the convolution layer calculation with the number of convolution kernels or the number of channels being larger than 8, wherein the number of convolution kernels and the number of channels are not more than 8, and then batch processing is carried out in the same manner, for example, the convolution calculation with the number of convolution kernels being 16 and the number of channels being required to be divided into four blocks M 0-7C0-7,M0-7C8-15,M8-15C0-7,M8-15C8-15, wherein M 0-7C0-7 represents weight blocks of 0 to 7 convolutions of the convolution kernels, the number of the rest channels is added with the weight blocks of the convolution kernels, the number of 0 to 7 convolutions is obtained for the convolution kernels, the characteristic value is obtained after the convolution kernels and the characteristic value is added with the number of the convolution kernels is not added with the characteristic value to the number of the convolution kernels is calculated, and the characteristic value is obtained for the channel is added with the characteristic value after being subjected to the convolution value is added with the number of the convolution value is greater than 35;
Step 1-3, designing a single PE unit in step 1-2; the single PE comprises a weight storage module with the address depth of 8, 8 multipliers and 8 accumulators, and supports one-dimensional convolution calculation with the convolution kernel length of 1 to 8, wherein corresponding weight data in the DRAM is loaded into the PE to fix weight, then, a characteristic value data is loaded into the PE from a characteristic value buffer in each clock period, the characteristic value is multiplied with the weight value fixed by the PE at the same time, the multiplication with the previous clock period is accumulated, and the corresponding convolution calculated output characteristic value is output at the corresponding moment;
The method comprises the steps of 1-4, designing an on-chip cache for temporarily storing input characteristic values, offset data and intermediate results, equally dividing the storage of the intermediate results into two parts, namely a characteristic value cache 1 and a characteristic value cache 2, equally dividing the storage of the intermediate results into the characteristic value cache 1 and the characteristic value cache 2, calculating one layer of a network for convolution, wherein the characteristic value cache 1 is used for storing the input characteristic values of the layer of convolution calculation, the characteristic value cache 2 is used for storing the output characteristic values of the layer of convolution calculation, and the output characteristic values of the upper layer stored in the characteristic value cache 2 are the input characteristic values of the layer of convolution calculation, wherein the characteristic value cache 1 is used for storing the output characteristic values, and if the convolution calculation of the lower layer is carried out, the function conversion between the characteristic value cache 1 and the characteristic value cache 2 is equal, dividing the characteristic value cache 1 and the characteristic value cache 2 into 8 equally and are used for storing the characteristic values of different channels respectively;
and step 1-5, designing a control module, wherein the control module comprises a clock generator for controlling convolution step, a convolution Padding generator, a mark signal generator in the data transmission and calculation process and a series of address pointer generators.
The design data flow in the step 2 is written with a script capable of reordering the weights and offsets of the TB-Net, wherein the steps are that the weights and offsets are stored in a DRAM according to the convolution calculation sequence of the PE array, when the convolution calculation is carried out, an accelerator only needs to read data from the DRAM according to a sequence pointer, and the convolution calculation sequence of the PE array is as follows:
Step 2-1, reading CxMxW xH weight data from a DRAM and storing the weight data into a PE array, wherein C is the number of convolution kernel channels, M is the number of convolution kernels, W is the width of the convolution kernels and H is the height of the convolution kernels, and the widths of the convolution kernels of TB-Net used for DoA estimation are all 1, so that the weight data can be simplified into the weight data of CxMxH;
Step 2-2, reading C multiplied by W input characteristic values from a characteristic value cache in each clock cycle, wherein C is the number of input characteristic value channels, the same as the number of convolution kernel channels, W is the width of the input characteristic values, and the width of the input characteristic values of TB-Net used for DoA estimation is 1, so that the C input characteristic values can be simply read in each clock cycle, the C input characteristic values are multiplied by C multiplied by M multiplied by H weights stored in a PE array to obtain C multiplied by M multiplied by H intermediate results;
Step 2-3, for each clock cycle, accumulating H intermediate results obtained in a single PE unit and intermediate results obtained in the previous cycle in a staggered manner, thus obtaining a convolution result of a first convolution scribing of a convolution kernel mapping on a characteristic value after the H clock cycles, and obtaining a convolution result of a second convolution scribing in the next clock cycle;
Step 2-4 PE array can process 8-channel operation of 8 convolution kernels at most. When the number of convolution kernels of a TB-Net layer exceeds 8 or the number of channels exceeds 8, the TB-Net layer needs to be subjected to blocking processing. When the block processing is carried out, the weight blocks with the same convolution kernel and different channels are preferentially calculated, and when the ownership weights with the same convolution kernel and different channels are processed, the weight blocks with different convolution kernels of the next batch are processed.
Step 3, designing a custom simple instruction set, writing a TB-Net into an instruction program, and specifically, the specific flow is as follows:
The self-defined simple instruction set comprises four modes of LOAD_B/LOAD_X/CONV_W/READ_Y, an instruction decoder and an instruction memory are required to be designed according to the self-defined instruction set, then an instruction program is written according to an algorithm by using the instruction set, so that the algorithm can be realized on hardware, and the instruction program is stored in the instruction memory;
Load_b, loading the offset data stored in DRAM into an on-chip offset cache;
Load_x, loading input eigenvalue data into an on-chip offset cache;
the CONV_W is used for loading the weight data stored in the DRAM into a weight buffer module in the PE array and carrying out convolution calculation, wherein the operand comprises data quantity, block calculation times, convolution kernel length, step length, padding and whether Relu is carried out;
READ_Y, reading out the eigenvalues in the on-chip eigenvalue buffer, and the operand contains the data amount of the READ-out eigenvalues.
The step 4 of quantizing the weight and bias data of the TB-Net into 16bits signed integer data, specifically, the step is to adopt a symmetrical pseudo-quantization scheme, perform pseudo-quantization before the input of each weight and after the output of an activation function, quantize floating point numbers to fixed point numbers, then inverse quantize the floating point numbers, and perform forward operation by using floating point values of errors generated in the process, wherein the pseudo-quantization operation can enable the distribution of the weight and the activation value to be more uniform, the variance to be smaller, the precision loss compared with the direct post-quantization can be smaller, and the output of each layer can be controlled within a certain range, thereby being more helpful for overflow processing.
And 5, verifying the circuit function, namely, using quantized data to respectively realize forward reasoning of the neural network by using software and hardware, and comparing whether output results are consistent.
The invention increases the data multiplexing times by utilizing the weight fixing and pulse array, obviously reduces the data transmission quantity, and can adapt to one-dimensional convolution of various scales, thereby realizing the low-power-consumption high-speed hardware accelerator for the TB-Net of the DoA.
Drawings
Fig. 1 is a network architecture of TB-Net for DoA estimation.
FIG. 2 is an architecture for a TB-Net accelerator for estimation to DoA.
Fig. 3 shows the working principle of the PE unit.
Fig. 4 is a diagram of a PE unit.
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings.
The invention relates to a TB-Net hardware acceleration implementation method for DoA estimation. A vcs+verdi joint simulation platform was used. 16bits of data of 120×2 are input. A specific network structure is shown in fig. 1.
TABLE 1 network structure of TB-Net
The invention mainly comprises the following steps:
Step 1, architecture design of TB-Net accelerator for DoA estimation is carried out, and an architecture diagram is shown in figure 2.
Step 1-1, designing a data path from a host end to an accelerator, storing weight and bias data into a DRAM through a PCI-E data path, and reading out a calculation result in the accelerator.
Step 1-2, designing a PE array of a convolution calculation module, setting a PE array with 8 x 8 blocks of PE units, wherein a single PE unit can process single-channel convolution calculation of one convolution kernel, each row of PE corresponds to convolution calculation of different channels of the same convolution kernel, each row of PE corresponds to convolution calculation of the same channel of different convolution kernels, the whole PE array can simultaneously carry out 8-channel convolution calculation of 8 convolution kernels, and calculation results of each row of PE are added to obtain eight-channel convolution results;
for example, for a convolution calculation with 16 convolution kernels and 16 channels, it is necessary to divide the convolution calculation into four blocks
M0-7C0-7,M0-7C8-15,M8-15C0-7,M8-15C8-15;
Wherein M 0-7C0-7 represents the weight blocks of 0 to 7 channels of the 0 to 7 convolution kernels, the rest of the representation methods are the same, the output characteristic value is obtained after the offset is added to the obtained convolution result for the first batch (M 0-7C0-7,M8-15C0-7) of the channel number and is stored in the characteristic value cache, and the output characteristic value is obtained after the obtained convolution result is added to the convolution result of the last channel of the same convolution kernel in the characteristic value cache for the non-first batch (M 0-7C8-15,M8-15C8-15) of the channel number and is stored in the characteristic value cache.
Step 1-3, designing a single PE unit in step 1-2, wherein the working principle diagram of the single PE unit is shown in fig. 3, the single PE comprises a weight storage module with the address depth of 8,8 multipliers and 8 accumulators, one-dimensional convolution calculation with the convolution kernel length of 1 to 8 is supported, the structure diagram of the single PE unit is shown in fig. 4, corresponding weight data in the DRAM is loaded into the PE to fix the weight, then, one characteristic value data is loaded from a characteristic value cache to the PE every clock period, the characteristic value is multiplied by the weight value fixed by the PE at the same time, the multiplication of the characteristic value and the multiplication of the previous clock period is carried out, and the output characteristic value of the corresponding convolution calculation is output at the corresponding moment.
For example, in FIG. 3, for clock cycles T0-T4, a single-channel 1×5 convolution kernel weight is input to the PE unit:
[W0,W1,W2,W3,W4],
which requires the eigenvalues of the corresponding channels:
[X0,X1,X2...X59]
the convolution calculation is performed, the clock period T5 is obtained by multiplying the input X0 of the PE unit with [ W0, W1, W2, W3, W4] respectively, and the result is that:
[W0X0,W1X0,W2X0,W3X0,W4X0];
The clock period T6, the input X1 to the PE unit is multiplied by [ W0, W1, W2, W3, W4] to obtain:
[W0X1,W1X1,W2X1,W3X1,W4X1];
and is derived from the last clock:
[W0X0,W1X0,W2X0,W3X0,W4X0];
Dislocation accumulation, obtaining:
[W0X0+W1X1,W1X0+W2X1,W2X0+W3X1,W3X0+W4X1,W4X0];
Similarly, each clock cycle multiplies an input eigenvalue by a weight value stored in the PE, and the obtained products are respectively accumulated with the products of the previous clock cycle in a staggered way, and the first convolution result is obtained in the clock cycle T7 because of padding=2:
Y0=W2X0+W3X1+W4X2;
when the convolution step size stride=1, then a second convolution result is obtained in clock period T8:
Y1=W1X0+W2X1+W3X2+W4X3;
When the convolution step size stride=2, then a second convolution result is obtained at clock period T9:
Y2=W0X0+W1X1+W2X2+W3X3+W4X4;
Since the number of input eigenvalues is 60 and padding=2, reading of the input eigenvalues is stopped and the product is made 0 in clock cycles T65, T66, resulting in final convolution results Y58, Y59.
And step 1-4, designing an on-chip cache for temporarily storing the input characteristic value, the offset data and the intermediate result. For the intermediate result deposit, the Feature value cache will be equally divided into two major parts, feature value cache 1 (sram_0 Feature in fig. 2) and Feature value cache 2 (sram_1 Feature in fig. 2). For one layer of calculation of the convolutional neural network, the eigenvalue buffer 1 is used for storing the input eigenvalue of the layer of convolution calculation, and the eigenvalue buffer 2 is used for storing the output eigenvalue of the layer of convolution calculation. For the convolution calculation of the next layer, the output eigenvalue of the last layer stored in the eigenvalue buffer 2 is the input eigenvalue of the convolution calculation of the layer, and at this time, the eigenvalue buffer 1 is used for storing the output eigenvalue. If there is a next level of convolution computation, the functional transition between eigenvalue cache 1 and eigenvalue cache 2 will be analogized. And dividing the characteristic value buffer 1 and the characteristic value buffer 2 into 8 equal parts respectively for storing characteristic values of different channels. The data output end of each cache corresponds to the input characteristic value end of one line of PE array in the step 1-2, and the data input end of each cache corresponds to the output characteristic value end of one line of PE array in the step 1-2.
And step 1-5, designing a control module, wherein the control module comprises a clock generator for controlling convolution step, a convolution Padding generator, a mark signal generator in the data transmission and calculation process and a series of address pointer generators.
And 2, designing a data stream, and writing a script capable of reordering the weights and the bias of the TB-Net.
Step 2-1, reading the CxMxW xH weight data from the DRAM and storing the weight data in the PE array. Wherein C is the number of convolution kernel channels, M is the number of convolution kernels, W is the width of the convolution kernels, and H is the height of the convolution kernels. The convolution kernel widths of TB-Net for DoA estimation are all 1, so can be reduced to C M H weight data.
And 2-2, reading C multiplied by W input characteristic values from the characteristic value buffer memory every clock cycle. Wherein C is the number of channels of the input eigenvalue, which is the same as the number of channels of the convolution kernel, and W is the width of the input eigenvalue. The input eigenvalues of TB-Net for DoA estimation are all 1 in width, so that C input eigenvalues can be read for each clock cycle. And multiplying the C input eigenvalues by the C multiplied by the M multiplied by H weights stored in the PE array to obtain C multiplied by M multiplied by H intermediate results. For a single PE, an input characteristic value is multiplied by H weights stored in the input characteristic value to obtain H intermediate results.
And 2-3, carrying out staggered accumulation on H intermediate results obtained in a single PE unit and the intermediate result obtained in the last period in each clock period, so that after the H clock periods, the convolution result of the first convolution scribing of the convolution kernel mapping on the characteristic value can be obtained, and the convolution result of the second convolution scribing can be obtained in the next clock period. Similarly, if the input eigenvalue scale of one convolution layer of the TB-Net is C×L, L is the eigenvalue length, namely the number of the eigenvalues of a single channel, after L clock cycles, the calculation of all convolution scribing is completed.
Step 2-4 PE array can process 8-channel operation of 8 convolution kernels at most. When the number of convolution kernels of a TB-Net layer exceeds 8 or the number of channels exceeds 8, the TB-Net layer needs to be subjected to blocking processing. When the block processing is carried out, the weight blocks with the same convolution kernel and different channels are preferentially calculated, and when the ownership weights with the same convolution kernel and different channels are processed, the weight blocks with different convolution kernels of the next batch are processed.
For example, for a convolution calculation with 16 convolution kernels and 16 channels, it is necessary to divide the convolution calculation into four blocks
M0-7C0-7,M0-7C8-15,M8-15C0-7,M8-15C8-15
The batch processing sequence is
M0-7C0-7→M0-7C8-15→M8-15C0-7→M8-15C8-15
Step 3, a custom simple instruction set comprising four modes of load_b/load_x/conv_w/read_y, is provided, wherein the instruction decoder and the instruction memory are designed according to the custom instruction set. An instruction program is then written in accordance with the algorithm using the instruction set, enabling the algorithm to be implemented on hardware, and the instruction program is stored in the instruction memory.
Load_b-LOAD the offset data stored in DRAM to the on-chip offset cache. The operand contains the amount of data loaded;
load_x-LOAD input eigenvalue data to on-chip offset cache. The operand contains the amount of data loaded;
And CONV_W, loading the weight data stored in the DRAM to a weight buffer module in the PE array, and performing convolution calculation. The operand contains the data size, the number of block computations, the convolution kernel length, the step size, and whether Padding is Relu.
Read_y, READ the feature value in the on-chip feature value buffer. The operand contains the amount of data from which the characteristic value is read.
The quantization step in step 4 is to adopt a symmetrical pseudo quantization scheme, perform pseudo quantization before the input of each weight and after the output of the activation function, quantize the floating point number to a fixed point number, then inverse quantize the floating point number to a floating point number, and perform forward operation by using the floating point value of the error generated in the process. The pseudo quantization operation can make the distribution of the weight and the activation value more uniform, the variance is smaller, the precision loss is smaller compared with the direct post quantization, and the output of each layer can be controlled within a certain range, thereby being more helpful for overflow treatment.
Pseudo quantization scheme:
the project adopts signed number calculation, so the quantization scheme is symmetric quantization, and the quantization scheme is as follows:
For the weights:
round(x)=clamp(-2bit-1,2bit-1-1,x)
s is a quantization factor, r is a value before quantization, and q is a value after quantization.
For an activation value:
the activation value for each layer is dynamically obtained using EMA:
movingmax=movingmax*momenta+max(abs(r))*(1-momenta)
round(x)=clamp(-2bit-1,2bit-1-1,x)
Where s is a quantization factor, r is a value before quantization, q is a value after quantization, momenta is a sliding index.
The BN (bulk normalization) layer was folded. Unlike batch normalization processing, which is a separate module during training, in the actual reasoning process, batch normalization parameters are "folded" into the weights and deviations of the convolution or full connection layer to improve efficiency. Folding the parameters after software training can play an acceleration role in hardware.
And step 5, the circuit function verification step is to use the quantized data to realize forward reasoning of the neural network on a pytorch platform and a VCS+Verdi platform respectively and compare whether the output results of the software and the hardware are consistent.
Take the comparison of the software and hardware results of the first layer convolution layer of TB-Net as an example.
And firstly, performing software imitation by using the quantized data, and recording a first layer of convolution calculation result for subsequent comparison with a hardware result.
Then writing a program by using a custom instruction set:
Load_b// LOAD the weight required by TB-Net to on-chip cache;
load_x// LOAD the input eigenvalue of TB-Net to on-chip cache;
CONV_W// loads the weight of the first layer of TB-Net to the PE array and performs first layer convolution calculation;
read_y// READs out the convolution calculation result of the first layer;
Because of the flexibility in the design of custom instruction sets, operands following each instruction type are not listed here. The programmed instruction program is binary mechanical code.
The instruction program, a plurality of groups of input characteristic values, quantized weights and offsets and a first-layer convolution calculation result of software simulation are stored into an off-chip ROM only used for simulation through a $ readmem function, excitation is given, the instruction program is stored into an on-chip instruction memory, and then an accelerator operation program can start hardware simulation. The convolution result of the first layer is stored into a text file through the $fopen/$fwrite/$fclose function in the READ_Y process, and the comparison is carried out with the convolution calculation result of the first layer simulated by software, so that errors are calculated. If the error is within the index range, the function verification is passed.
Under the working frequency of 100MHz, simulation results show that the forward reasoning of the TB-Net can be carried out once only by taking 1.2ms when the accelerator is used, and real-time DoA estimation can be realized in most DoA estimation application scenes.

Claims (2)

1. A TB-Net hardware acceleration implementation method for DoA estimation is characterized by comprising the following specific steps:
Step1, architecture design of TB-Net accelerator for estimation to DoA:
Step 2, designing a data stream, and writing a script capable of reordering the weights and the bias of the TB-Net;
Step 3, designing a custom simple instruction set, and writing the TB-Net into an instruction program;
Step 4, quantizing the weight and bias data of the TB-Net into 16bits signed integer data;
Step 5, verifying the circuit function and comparing the software and hardware results;
The architecture design of the accelerator in step 1 comprises the following specific steps:
Step 1-1, designing a data path from a host end to an accelerator, storing weight and bias data into a DRAM through a PCI-E data path, and reading out a calculation result in the accelerator;
step 1-2, designing a convolution calculation module PE array;
Setting a PE array with 8 x 8 PE units, wherein a single PE unit is used for processing single-channel convolution calculation of one convolution kernel, each row of PE corresponds to convolution calculation of different channels of the same convolution kernel, each row of PE corresponds to convolution calculation of the same channel of different convolution kernels, the whole PE array can simultaneously carry out 8-channel convolution calculation of 8 convolution kernels, the calculation results of each row of PE are added to obtain eight-channel convolution results, the output characteristic value obtained by calculation of convolution kernels and convolution layers with channel numbers smaller than 8 is added with bias and then stored into a characteristic value buffer, and the convolution layer calculation with the number of convolution kernels or the channel number larger than 8 is batched, the number of convolution kernels and the channel number of each batch are not more than 8, and then batched in the same manner;
Step 1-3, designing a single PE unit in step 1-2;
the single PE comprises a weight storage module with the address depth of 8, 8 multipliers and 8 accumulators, and supports one-dimensional convolution calculation with the convolution kernel length of 1 to 8, wherein corresponding weight data in the DRAM is loaded into the PE to fix weight, then, a characteristic value data is loaded into the PE from a characteristic value buffer in each clock period, the characteristic value is multiplied with the weight value fixed by the PE at the same time, the multiplication with the previous clock period is accumulated, and the corresponding convolution calculated output characteristic value is output at the corresponding moment;
Step 1-4, designing an on-chip cache for temporarily storing an input characteristic value, offset data and an intermediate result;
For storing the intermediate result, the characteristic value buffer is divided into two parts, namely a first characteristic value buffer and a second characteristic value buffer; for one layer of calculation of the convolutional neural network, a first characteristic value buffer memory is used for storing input characteristic values of the layer of convolution calculation, and a second characteristic value buffer memory is used for storing output characteristic values of the layer of convolution calculation; for the convolution calculation of the next layer, the output characteristic value of the upper layer stored in the second characteristic value buffer is the input characteristic value of the convolution calculation of the layer, and at the moment, the first characteristic value buffer is used for storing the output characteristic value; if the next-layer convolution calculation is performed, the function conversion between the first characteristic value cache and the second characteristic value cache is performed, and the like, the first characteristic value cache and the second characteristic value cache are respectively divided into 8 equal parts and are used for storing characteristic values of different channels, the data output end of each part of cache corresponds to the input characteristic value end of one line of PE array in the step 1-2, and the data input end of each part of cache corresponds to the output characteristic value end of one line of PE array in the step 1-2;
step 1-5, designing a control module, wherein the control module comprises a clock generator for controlling convolution step length, a convolution Padding generator, a mark signal generator in the process of data transmission and calculation and a series of address pointer generators;
the design data flow in the step 2 is written with a script capable of reordering the weights and offsets of the TB-Net, wherein the steps are that the weights and offsets are stored in a DRAM according to the convolution calculation sequence of the PE array, and when the convolution calculation is carried out, an accelerator reads data from the DRAM according to a sequence pointer, and the convolution calculation sequence of the PE array is as follows:
Step 2-1, reading CxMxW xH weight data from a DRAM and storing the weight data into a PE array, wherein C is the number of convolution kernel channels, M is the number of convolution kernels, W is the width of the convolution kernels and H is the height of the convolution kernels, and the widths of the convolution kernels of TB-Net used for DoA estimation are all 1, so the weight data are simplified into CxMxH weight data;
Step 2-2, reading C multiplied by W input characteristic values from a characteristic value cache in each clock cycle, wherein C is the number of input characteristic value channels, the same as the number of convolution kernel channels, W is the width of the input characteristic values, and the width of the input characteristic values of TB-Net used for DoA estimation is 1, so that the C input characteristic values are simply read in each clock cycle, the C input characteristic values are multiplied by C multiplied by M multiplied by H weights stored in a PE array to obtain C multiplied by M multiplied by H intermediate results;
Step 2-3, for each clock cycle, accumulating H intermediate results obtained in a single PE unit with intermediate results obtained in the previous cycle in a staggered manner, thus obtaining a convolution result of a first convolution scribing of a convolution kernel mapping on a characteristic value after the H clock cycles, obtaining a convolution result of a second convolution scribing in the next clock cycle, and similarly, if the input characteristic value scale of a layer of convolution layer of TB-Net is C×L, L is the characteristic value length, namely the number of the characteristic values of a single channel, completing calculation of all convolution scribing after the L clock cycles;
The method comprises the steps of 2-4, processing 8-channel operation of 8 convolution kernels at most by a PE array simultaneously, performing block processing on the TB-Net layer when the number of the convolution kernels of the TB-Net layer exceeds 8 or the number of the channels exceeds 8, preferentially calculating weight blocks with the same convolution kernels but different channels during the block processing, and processing the weight blocks with different next convolution kernels after the weight blocks with the same convolution kernels but different channels are processed;
step 3, designing a custom simple instruction set, and writing a TB-Net into an instruction program, wherein:
The custom simple instruction set comprises four modes of LOAD_ B, LOAD _ X, CONV _ W, READ _Y, an instruction decoder and an instruction memory are designed according to the custom instruction set, then an instruction program is written according to an algorithm by using the instruction set, so that the algorithm can be realized on hardware, and the instruction program is stored in the instruction memory, wherein:
Load_b, loading the offset data stored in DRAM into an on-chip offset cache;
Load_x, loading input eigenvalue data into an on-chip offset cache;
the CONV_W is used for loading the weight data stored in the DRAM into a weight buffer module in the PE array and carrying out convolution calculation, wherein the operand comprises data quantity, block calculation times, convolution kernel length, step length, padding and whether Relu is carried out;
read_y, READ the eigenvalue in the on-chip eigenvalue buffer;
The step 4 of quantizing the weight and bias data of the TB-Net into 16bits signed integer data, specifically, adopting a symmetrical pseudo-quantization scheme, performing pseudo-quantization before the input of each weight and after the output of an activation function, quantizing the floating point number to a fixed point number, then inversely quantizing the fixed point number to a floating point number, and performing forward operation by using the floating point value of the error generated in the process;
the symmetrical pseudo-quantization scheme is specifically as follows:
For the weights:
round(x)=clamp(-2bit-1,2bit-1-1,x)
s is a quantization factor, r is a value before quantization, and q is a value after quantization;
for an activation value:
the activation value for each layer is dynamically obtained using EMA:
movingmax=movingmax*momenta+max(abs(r))*(1-momenta)
round(x)=clamp(-2bit-1,2bit-1-1,x)
Where s is a quantization factor, r is a value before quantization, q is a value after quantization, momenta is a sliding index;
folding BN is a batch normalization layer, and the batch normalization parameters are folded into the weight and the deviation of the convolution or full connection layer so as to improve the efficiency.
2. The method according to claim 1, wherein the circuit function verification in step 5 includes using quantized data to implement forward reasoning of the neural network with software and hardware, respectively, and comparing whether the output results are consistent.
CN202210576754.6A 2022-05-25 2022-05-25 Hardware Acceleration Implementation of TB-Net for DoA Estimation Active CN115034372B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210576754.6A CN115034372B (en) 2022-05-25 2022-05-25 Hardware Acceleration Implementation of TB-Net for DoA Estimation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210576754.6A CN115034372B (en) 2022-05-25 2022-05-25 Hardware Acceleration Implementation of TB-Net for DoA Estimation

Publications (2)

Publication Number Publication Date
CN115034372A CN115034372A (en) 2022-09-09
CN115034372B true CN115034372B (en) 2025-02-18

Family

ID=83120949

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210576754.6A Active CN115034372B (en) 2022-05-25 2022-05-25 Hardware Acceleration Implementation of TB-Net for DoA Estimation

Country Status (1)

Country Link
CN (1) CN115034372B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116720585B (en) * 2023-08-11 2023-12-29 福建亿榕信息技术有限公司 Low-power-consumption AI model reasoning optimization method based on autonomous controllable software and hardware platform
CN119512498A (en) * 2023-08-24 2025-02-25 华为技术有限公司 Processor, floating point unit and operation method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109934339A (en) * 2019-03-06 2019-06-25 东南大学 A Universal Convolutional Neural Network Accelerator Based on One-Dimensional Systolic Array
CN110390383A (en) * 2019-06-25 2019-10-29 东南大学 A Deep Neural Network Hardware Accelerator Based on Exponential Quantization

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109934339A (en) * 2019-03-06 2019-06-25 东南大学 A Universal Convolutional Neural Network Accelerator Based on One-Dimensional Systolic Array
CN110390383A (en) * 2019-06-25 2019-10-29 东南大学 A Deep Neural Network Hardware Accelerator Based on Exponential Quantization

Also Published As

Publication number Publication date
CN115034372A (en) 2022-09-09

Similar Documents

Publication Publication Date Title
CN110378468B (en) A neural network accelerator based on structured pruning and low-bit quantization
Peng et al. Accelerating transformer-based deep learning models on fpgas using column balanced block pruning
US11816574B2 (en) Structured pruning for machine learning model
Dong et al. Heatvit: Hardware-efficient adaptive token pruning for vision transformers
CN111062472B (en) A Sparse Neural Network Accelerator and Acceleration Method Based on Structured Pruning
CN115034372B (en) Hardware Acceleration Implementation of TB-Net for DoA Estimation
CN112329910B (en) Deep convolution neural network compression method for structure pruning combined quantization
CN108805274A (en) The hardware-accelerated method and system of Tiny-yolo convolutional neural networks based on FPGA
CN113283587B (en) A Winograd convolution operation acceleration method and acceleration module
CN108805267A (en) The data processing method hardware-accelerated for convolutional neural networks
US20210150362A1 (en) Neural network compression based on bank-balanced sparsity
CN111368988B (en) A Deep Learning Training Hardware Accelerator Exploiting Sparsity
CN113361695B (en) Convolutional neural network accelerator
US20230359697A1 (en) Tensor processing
Khabbazan et al. Design and implementation of a low-power, embedded CNN accelerator on a low-end FPGA
Yang et al. Fixar: A fixed-point deep reinforcement learning platform with quantization-aware training and adaptive parallelism
CN113392973A (en) AI chip neural network acceleration method based on FPGA
CN104615516A (en) Method for achieving large-scale high-performance Linpack testing benchmark for GPDSP
CN113313244B (en) Near-storage neural network accelerator for addition network and acceleration method thereof
Byun et al. Hardware-friendly Hessian-driven Row-wise Quantization and FPGA Acceleration for Transformer-based Models
CN115730648A (en) A LSTM Accelerator with Low Bitwidth Quantization and Compression
He et al. A Novel Quantization and Model Compression Approach for Hardware Accelerators in Edge Computing.
Aydin et al. FPGA-Based Implementation of Convolutional Layer Accelerator Part for CNN
CN112906887A (en) Sparse GRU neural network acceleration realization method and device
CN112734021A (en) Neural network acceleration method based on bit sparse calculation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant