[go: up one dir, main page]

CN114186679A - Convolutional neural network accelerator based on FPGA and optimization method thereof - Google Patents

Convolutional neural network accelerator based on FPGA and optimization method thereof Download PDF

Info

Publication number
CN114186679A
CN114186679A CN202111543413.0A CN202111543413A CN114186679A CN 114186679 A CN114186679 A CN 114186679A CN 202111543413 A CN202111543413 A CN 202111543413A CN 114186679 A CN114186679 A CN 114186679A
Authority
CN
China
Prior art keywords
unit
convolution
neural network
image
convolutional neural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111543413.0A
Other languages
Chinese (zh)
Inventor
李甫
李旭超
付博勋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Institute Of Integrated Circuit Innovation Xi'an University Of Electronic Science And Technology
Original Assignee
Chongqing Institute Of Integrated Circuit Innovation Xi'an University Of Electronic Science And Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Institute Of Integrated Circuit Innovation Xi'an University Of Electronic Science And Technology filed Critical Chongqing Institute Of Integrated Circuit Innovation Xi'an University Of Electronic Science And Technology
Priority to CN202111543413.0A priority Critical patent/CN114186679A/en
Publication of CN114186679A publication Critical patent/CN114186679A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/046Forward inferencing; Production systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The application relates to the technical field of neural network computing, and particularly provides a convolutional neural network accelerator based on an FPGA and an optimization method thereof. A convolutional neural network accelerator based on FPGA comprises a program instruction storage unit, a program instruction decoding unit, a data control unit, a data buffer unit, a parameter buffer unit, an off-chip storage unit, a parallel processing unit, an image buffer unit, an image splicing unit and an off-chip storage unit. An optimization method of a convolutional neural network accelerator based on an FPGA comprises the following steps: step one, acquiring forward reasoning instructions CN of different operation typesnThe number of clock cycles used for forward reasoning; step two, constructing convolutionThe number of clock cycles used by the neural network to reason forward once; step three, constructing a hardware resource constraint expression; step four, constructing and solving an optimization function F' with constraint; and step five, setting the number of 3 multiplied by 3 convolution units, 1 multiplied by 1 convolution units and pooling units according to the optimal solution of the optimization function F'.

Description

Convolutional neural network accelerator based on FPGA and optimization method thereof
Technical Field
The application relates to the technical field of neural network computing, in particular to a convolutional neural network accelerator based on an FPGA and an optimization method thereof.
Background
With the continuous development of neural networks, the complexity of the networks is higher and higher, the requirement on the computing power of hardware is also increased, and a CPU serving as a general processor cannot meet the requirement of practical application. For this reason, research on the deployment of neural networks on hardware platforms has become a hotspot. The existing deep learning frameworks such as PyTorch, Caffe, TensorFlow, etc. are based on GPU for network acceleration. Although the GPU has a strong computational power, it has disadvantages of high price, large power consumption, high noise, large volume, and the like. The FPGA has the advantages of low power consumption, customizability, reconfigurability, small volume, high parallelism and the like, and becomes a research hotspot of neural network hardware deployment.
The customizable and reconfigurable characteristics of the FPGA enable developers to flexibly deploy the classical convolutional neural network models such as SSD, YOLOV3 and VGG16 on the FPGA according to application requirements. The acceleration of the convolutional neural network model on the FPGA mainly depends on the number of parallel computing units, and the higher the number of the parallel computing units is, the higher the algorithm parallelism is, and the higher the computing speed is. In practice, the parallel computing units usually include a 3 × 3 convolution unit, a pooling unit, a 1 × 1 convolution unit, and an activation unit, and under the limited FPGA resources, increasing the number of one type of parallel computing unit will result in decreasing the number of the other type of parallel computing unit. In general practical deployment, the number of various parallel computing units is not matched, for example, the number of deployed 3 × 3 convolution units is large, the number of deployed 1 × 1 convolution units is small, and for a specific convolution neural network model, for example, an SSD convolution neural network model, the forward inference speed of the convolution neural network model is reduced because of being limited by the number of 1 × 1 convolution units in forward inference.
The patent with publication number CN109086867A and the patent name "a convolutional neural network acceleration system based on FPGA" discloses a convolutional neural network acceleration system based on FPGA, which comprises a data preprocessing module, a data post-processing module, a data storage module, a network model configuration module and a convolutional neural network calculation module, wherein the convolutional neural network calculation module comprises a convolutional calculation submodule, an activation function calculation submodule, a pooling calculation submodule and a full-connection calculation submodule. The disadvantages of the acceleration system are: the problem that the forward reasoning speed of a specific convolutional neural network model is reduced due to the fact that the number of various convolutional neural network calculation sub-modules is not matched is not considered in the setting process.
The publication number is CN112766479A, and the patent name is "a neural network accelerator supporting channel separation convolution based on FPGA", and the functional unit module of the accelerator disclosed by the patent is used for realizing functions of pooling, activation, batch normalization and the like. The defects of the accelerator are as follows: the functional unit module of the accelerator realizes the same number of parallel computing units with various functions, fails to flexibly configure the number of the parallel computing units with various functions, and does not consider the problem of forward reasoning speed reduction of a specific convolutional neural network model caused by the mismatching of the number of the parallel computing units with various functions.
The reconfigurable advantage of the convolutional neural network accelerator based on the FPGA aims at a specific convolutional neural network model, such as an SSD convolutional neural network model, the number of various parallel computing units can be reconfigured, so that the convolutional neural network model can achieve higher forward reasoning speed. However, the existing accelerator and optimization method do not consider the problem that the forward reasoning speed of the convolutional neural network model is reduced due to the mismatching problem of the number of various parallel computing units.
Disclosure of Invention
The present invention aims to provide a convolutional neural network accelerator based on an FPGA and an optimization method thereof, so as to solve the problem that the forward inference speed of a convolutional neural network model is decreased due to the fact that the number of various parallel computing units is not matched in the prior art.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
the application provides a convolutional neural network accelerator based on an FPGA and an optimization method thereof. The convolutional neural network accelerator based on the FPGA comprises a program instruction storage unit, a program instruction decoding unit, a data control unit, a data buffer unit, a parameter buffer unit, an off-chip storage unit, a parallel processing unit, an image buffer unit, an image splicing unit and an off-chip storage unit which are realized on a dynamic random access memory through the FPGA.
Further, the program instruction storage unit includes a ROM in which the convolutional neural network layer L is storednForward inference instruction CNn
Further, the Forward inference instruction CNnBy convolutional neural network layer LnOperation type, convolution step length, convolution kernel number, input feature map channel number, input feature map width, input feature map start address, output feature map width, neural network parameter start address, output image block start address, convolution offset start address, binary data corresponding to filling mark
Figure RE-GDA0003476049870000022
And splicing the components in sequence.
Furthermore, the program instruction decoding unit is sequentially cascaded between the instruction storage unit and the data control unit.
Furthermore, the data control unit is connected with the parallel processing unit, the image buffer unit, the data buffer unit, the parameter buffer unit, the image splicing unit and the off-chip memory control unit.
Furthermore, two identical parameter buffers are arranged in the parameter cache unit.
Further, the parallel processing unit includes a 3 × 3 convolution unit, a 1 × 1 convolution unit, and a pooling unit.
Furthermore, two identical data buffers are arranged in the data buffer unit.
Further, the data buffer includes therein a size of
Figure RE-GDA0003476049870000021
In which x0Representing the number of 3 x3 convolution units and W, H representing the width and height, respectively, of the image blocks that the parallel processing unit can process at one time.
The method for optimizing the convolutional neural network accelerator based on the FPGA comprises the following steps:
the method comprises the following steps: obtaining forward reasoning instruction CN of different operation typesnThe number of clock cycles used for forward reasoning;
step two: constructing a clock period number used by the convolutional neural network for one forward inference;
step three: constructing a hardware resource constraint expression;
step four: constructing and solving an optimization function F' with constraint;
step five: and setting the number of the 3 multiplied by 3 convolution units, the 1 multiplied by 1 convolution units and the pooling units according to the optimal solution of the optimization function F'.
Compared with the prior art, the invention has the beneficial effects that:
the accelerator divides a characteristic diagram to be input into blocks and divides the characteristic diagram into single-channel images so as to fully utilize the resources of the FPGA and improve the running speed of the accelerator; the convolution operation is partitioned firstly and then is divided into single-channel convolution kernels, so that the resources of the FPGA are fully utilized, and the running speed of the accelerator is increased.
According to the invention, by constructing the FPGA hardware resource constraint expression, the number of various parallel processing units in the convolutional neural network is maximized under the condition that the total hardware resources of the FPGA are not exceeded, so that the operation speed is increased.
The number of the corresponding optimal 3 x3 convolution units, 1 x1 convolution units and pooling units is obtained through the optimization step, the number matching of various parallel computing units is fully considered, the problem that the forward reasoning speed of the convolution neural network model is reduced due to the fact that the number mismatching problem of various parallel computing units is not considered is solved, and therefore the processing speed of the accelerator on the convolution neural network is improved.
Drawings
FIG. 1 is a schematic diagram of an FPGA-based convolutional neural network accelerator according to the present invention;
fig. 2 is a schematic diagram of the convolutional neural network accelerator based on FPGA executing the 3 × 3 convolution operation of step S32 according to the present invention;
fig. 3 is a schematic diagram of the FPGA-based convolutional neural network accelerator performing the 1 × 1 convolution operation of step S33 according to the present invention;
FIG. 4 is a schematic diagram of the FPGA-based convolutional neural network accelerator performing the pooling operation of step S34 according to the present invention;
fig. 5 is a schematic diagram of a storage manner of an input feature map in the off-chip storage unit.
Detailed Description
In order to make the implementation of the present invention clearer, the following detailed description is made with reference to the accompanying drawings.
The invention provides a convolutional neural network accelerator based on an FPGA and an optimization method thereof. Fig. 1 is a schematic diagram of a convolutional neural network accelerator based on an FPGA according to the present invention.
The hardware of the convolutional neural network accelerator based on the FPGA comprises the FPGA and a dynamic random access memory, wherein the dynamic random access memory can be one of the dynamic random access memories such as DDR3, DDR4 and SDRAM, and specifically, the dynamic random access memory adopts DDR 3. As shown in fig. 1, the convolutional neural network accelerator based on FPGA provided by the present invention includes a program instruction storage unit, a program instruction decoding unit, a data control unit, a data buffer unit, a parameter buffer unit, an off-chip storage unit, a parallel processing unit, an image buffer unit, an image stitching unit, and an off-chip storage unit. The FPGA is provided with a program instruction storage unit, a program instruction decoding unit, a data control unit, a data buffer unit, a parameter buffer unit, an off-chip storage unit, a parallel processing unit, an image buffer unit and an image splicing unit, wherein the parallel processing unit comprises x0 A 3 × 3 convolution unit, x1A 1 × 1 convolution unit and x2A pooling unit including a pair of data buffers, a parameter bufferThe storage unit includes a pair of parameter buffers; the off-chip memory unit is implemented with DDR 3. The program instruction storage unit, the program instruction decoding unit and the data control unit are sequentially cascaded to realize reading, decoding and transmission of the forward reasoning instruction; the data control unit is connected with the parallel processing unit, the image cache unit, the data cache unit, the parameter cache unit, the image splicing unit and the off-chip memory control unit to realize instruction and data distribution; the parallel processing unit is loaded between the data control unit and the image cache unit; the image buffer unit is loaded between the parallel processing unit and the data control unit; the parameter cache unit is connected between the data control unit and the off-chip memory control unit; the data cache unit is connected between the data control unit and the off-chip memory control unit; the image splicing unit and the data control unit are connected with each other; the off-chip memory control unit and the off-chip memory unit are interconnected and connected with the parameter cache unit, the data control unit and the data cache unit.
The characteristic diagram to be input and the convolutional neural network information to be accelerated can be obtained from other ports, and can also be stored in a storage device through an external interface. More specifically, the feature map to be input is stored in the off-chip storage unit, and the specific storage manner and the arrangement of the off-chip storage unit are described later in this embodiment; the parameters and related information of the convolutional neural network are stored in an off-chip storage unit. The forward reasoning instruction of the convolutional neural network is stored in a program instruction storage unit.
When the accelerator runs, the program instruction storage unit reads the forward reasoning instruction of the convolutional neural network and outputs the forward reasoning instruction to the program instruction decoding unit; the program instruction decoding unit divides the characteristic image to be input into blocks to be input image blocks, decodes the forward reasoning instruction according to the division result and sends the decoding result to the data control unit; the data control unit divides an image block to be input into a single-channel image and a single-channel convolution kernel (for convolution operation) according to a channel sequence or continuously divides the image block into a plurality of small-size single-channel images (for pooling operation) according to an image column sequence, reads the small-size single-channel image or the single-channel convolution kernel from the off-chip storage unit according to a division result, and sends the result of multiply-accumulate of the last single-channel image into the parallel processing unit; the parallel processing unit carries out 3 multiplied by 3 convolution operation or 1 multiplied by 1 convolution operation on an input single-channel image and a single-channel convolution kernel (single-channel convolution kernel), or carries out pooling operation on an input small-size single-channel image, and sends operation results to the image caching unit, when the operation on all channels of the current image block is finished, the operation on the next image block is carried out, and the image caching unit sends the convolution operation or pooling operation results to the image splicing unit; the image splicing unit splices and subdivides the image blocks and sends the blocking results to the data control unit; the data control unit stores the result of the block in the off-chip storage unit through the data cache unit and the off-chip memory control unit.
The convolutional neural network accelerator based on the FPGA can be used for accelerating different convolutional neural network models only by adjusting the setting of the parallel processing unit, and the embodiment of the invention is explained by taking the SSD convolutional neural network model as an example.
The program instruction storage unit includes a ROM in which N network layers L ═ L of the convolutional neural network are stored1,L2,...,Ln,...,LNThe forward reasoning instruction of the network layer L is carried out for each network layer LnDesigning a forward reasoning instruction CNnForward reasoning Command CNnBy each network layer LnOperation type, convolution step length, convolution kernel number, input feature map channel number, input feature map width, input feature map start address, output feature map width, neural network parameter start address, output image block start address, convolution offset start address, binary data corresponding to filling mark
Figure RE-GDA0003476049870000051
Sequentially spliced to form a network layer LnForward inference instructions of
Figure RE-GDA0003476049870000052
N network layers of the convolutional neural network have N forward inference instructions CN ═ CN1,CN2,...,CNn,...,CNN}. The program instruction storage unit is used for reading each piece of forward reasoning instruction of the convolutional neural network from the ROM one by one and sending the forward reasoning instruction to the program instruction decoding unit.
And the program instruction decoding unit is sequentially cascaded between the instruction storage unit and the data control unit and is responsible for receiving the forward reasoning instruction sent by the instruction storage unit. And successively decoding the forward inference instruction corresponding to the convolution operation into the initial input address of the image block to be input, the initial input address and the offset address of the convolution kernel to be input according to the operation type in the forward inference instruction, and successively decoding the instruction corresponding to the pooling operation into the initial input address of the image block to be input. And sending the decoding result to the data control unit, and after the data control unit completes the corresponding operation, carrying out next decoding. An instruction corresponding to a convolution operation is decoded ixj times, and an instruction corresponding to a pooling operation is decoded I times. (IxJ is detailed in the description to follow)
The convolution (including 3 × 3 convolution and 1 × 1 convolution) operation decoding mainly comprises the steps of partitioning a feature image to be input, partitioning a convolution kernel to be input, and then calculating a storage address of a partitioned image block to be input, a storage address of the convolution kernel to be input and a bias parameter storage address of convolution operation. Therefore, the problem that a large amount of embedded block RAM resources in the FPGA are occupied due to the fact that the input feature diagram and the convolution kernel parameters are read and then partitioned can be solved, and the problem that the FPGA is not used for a large amount of DSP resources due to the fact that the embedded block RAM resources are consumed completely is further caused, so that the SSD convolution neural network model is low in calculation parallelism and low in forward reasoning speed. Thus, the accelerator of the present invention processes convolution operations at a faster rate. The pooling operation decoding is mainly to divide the feature map to be input into blocks and then calculate the storage address of the image block to be input after the blocks are divided. Therefore, the problem that a large amount of embedded block RAM resources in the FPGA are occupied due to the fact that the input feature diagram is read firstly and then the block division is carried out can be solved, and further the problem that the FPGA is not used for a large amount of DSP resources due to the fact that the embedded block RAM resources are consumed completely is caused, and therefore the SSD convolutional neural network model is low in calculation parallelism and low in forward reasoning speed. Thus, the accelerator of the present invention processes pooling operations at a faster rate.
The specific steps of decoding are as follows:
s11, the program instruction decoding unit makes the characteristic diagram R to be input of the n-th layer network layernIs divided into sizes of
Figure RE-GDA0003476049870000053
Figure RE-GDA0003476049870000054
I image blocks R to be inputn={R(n,1),R(n,2),...,R(n,i),...,R(n,I)And calculating the ith image block R to be input(n,i)Input start address of IA(n,i)
Figure RE-GDA0003476049870000055
Figure RE-GDA0003476049870000061
Wherein,
Figure RE-GDA0003476049870000062
representing a rounding-up operation, W, H representing the width and height, respectively, of an image block that can be processed by the parallel processing unit at one time, R(n,i)Which represents the i-th image block,
Figure RE-GDA0003476049870000063
representation input characteristic diagram RnThe number of the channels of (a) is,
Figure RE-GDA0003476049870000064
representation input characteristic diagram RnWidth (input feature map R)nAre equal in width and height),
Figure RE-GDA0003476049870000065
representing the initial address of the input characteristic diagram; w, H can be set reasonably according to different convolutional neural network models, to avoid inputting characteristic diagram R when the forward reasoning of convolutional neural network model enters into deeper network layernIs less than W, resulting in a problem that the partial 3 × 3 convolution unit cannot be effectively used. By reasonably setting W and H, the utilization rate of the 3 multiplied by 3 convolution unit can be improved, so that the forward reasoning speed of the SSD convolution neural network model is improved, and the processing speed of the accelerator is higher.
S12, program instruction decoding unit based on
Figure RE-GDA0003476049870000066
And judging the operation type of the n-th network layer. When the operation type is the 3 × 3 convolution operation, step S13 is performed; when the operation type is the 1 × 1 convolution operation, step S14 is performed; when the operation type is the pooling operation, step S15 is performed. Executing one of the steps S13, S14 and S15 under a forward reasoning instruction;
s13, the program instruction decode unit will LnLayers to be input
Figure RE-GDA0003476049870000067
A convolution kernel of size
Figure RE-GDA0003476049870000068
Figure RE-GDA0003476049870000069
The number of 3 × 3 convolution kernels is J, which can be expressed as K(3,n)={K(3,n,1),K(3,n,2),...,K(3,n,j),...,K(3,n,J)H, calculating the jth convolution kernel K to be input(3,n,j)Input start address KA(3,n,j)While computing offset addresses BA for a 3 x3 convolution operation(3,n,i,j)And will be IA(n,i)、BA(3,n,i,j)And KA(3,n,j)As CNnThe (i-1) x J +j decoding results are sent to a data control unit, wherein the j to-be-input convolution kernel K(3,n,j)Input start address KA(3,n,j)Can be expressed as:
Figure RE-GDA00034760498700000610
the number of convolution kernels, J, can be expressed as:
Figure RE-GDA00034760498700000611
wherein
Figure RE-GDA00034760498700000612
Is of size
Figure RE-GDA00034760498700000613
The number of the 3 x3 convolution kernels of (c),
Figure RE-GDA00034760498700000614
the expression of (a) is:
Figure RE-GDA00034760498700000615
offset address BA for 3 x3 convolution operation(3,n,i,j)Can be expressed as:
Figure RE-GDA00034760498700000616
wherein,
Figure RE-GDA00034760498700000617
which represents a rounding-down operation, the rounding-down operation,
Figure RE-GDA00034760498700000618
representing a ceiling operation, x0Representing the number of 3 x3 convolution units,
Figure RE-GDA00034760498700000619
the number of channels of the input feature map is shown,
Figure RE-GDA00034760498700000620
represents the convolution offset starting address; here, the
Figure RE-GDA00034760498700000621
Is reasonably set according to the number and W of the 3 multiplied by 3 convolution units, and the size of the one-time input parallel processing unit is
Figure RE-GDA00034760498700000622
Is set in this way
Figure RE-GDA00034760498700000623
Can make LnLayers to be input
Figure RE-GDA00034760498700000624
The division of the convolution kernel is more reasonable, so that the parallel processing unit can process more sizes at one time
Figure RE-GDA00034760498700000625
The convolution kernel of the invention further accelerates the forward reasoning speed of the SSD convolution neural network model, thereby the processing speed of the accelerator is higher.
S14, the program instruction decode unit will LnLayers to be input
Figure RE-GDA0003476049870000071
A convolution kernel of size
Figure RE-GDA0003476049870000072
Figure RE-GDA0003476049870000073
1 × 1 convolution kernel, the number of 1 × 1 convolution kernels is J, which can be expressed as K(1,n)={K(1,n,1),K(1,n,2),...,K(1,n,j),...,K(1,n,J)And calculating the jth convolution kernel K to be input(1,n,j)Input start address KA(1,n,j)Simultaneously computing offset addresses BA for 1 x1 convolution operations(1,n,i,j)And will be IA(n,i)、BA(1,n,i,j)And KA(1,n,j)As CNnThe (i-1) × J + J decoding results of the data control unit are sent to the data control unit, wherein the jth convolution kernel K(1,n,j)Input start address KA(1,n,j)Can be expressed as:
Figure RE-GDA0003476049870000074
the number of convolution kernels, J, can be expressed as:
Figure RE-GDA0003476049870000075
wherein
Figure RE-GDA0003476049870000076
Is of size
Figure RE-GDA0003476049870000077
The number of 1 x1 convolution kernels of (c),
Figure RE-GDA0003476049870000078
the expression of (a) is:
Figure RE-GDA0003476049870000079
offset address BA for 1 x1 convolution operation(1,n,i,j)Can be expressed as:
Figure RE-GDA00034760498700000710
wherein x is1Represents the number of 1 x1 convolution units,
Figure RE-GDA00034760498700000711
the number of channels of the input feature map is shown,
Figure RE-GDA00034760498700000712
represents the convolution offset starting address; here, the
Figure RE-GDA00034760498700000713
Is reasonably set according to the number of 1 multiplied by 1 convolution units and W, and the size of the one-time input parallel processing unit is
Figure RE-GDA00034760498700000714
1 x1 convolution kernel number, set in this way
Figure RE-GDA00034760498700000715
Can make LnLayers to be input
Figure RE-GDA00034760498700000716
The division of the convolution kernel is more reasonable, so that the parallel processing unit can process more sizes at one time
Figure RE-GDA00034760498700000717
Figure RE-GDA00034760498700000718
The convolution kernel of the invention further accelerates the forward reasoning speed of the SSD convolution neural network model, thereby the processing speed of the accelerator is higher.
S15, the program instruction decoding unit will
Figure RE-GDA00034760498700000719
As the nth forward reasoning instruction CNnThe ith decoding result of (2) is sent to the data control unit. As the pooling operation only needs to input the feature map to obtain the pooling result and does not need other parameters, the pooling operation only needs to treat the input feature map RnAnd (5) dividing.
The data control unit is connected with the program instruction decoding unit, the parallel processing unit, the image cache unit, the data buffer unit, the parameter buffer unit, the image splicing unit and the off-chip memory control unit and is used for receiving an instruction decoding result of the program instruction decoding unit and forwarding related addresses and data.
And according to the instruction decoding result of the program instruction decoding unit, for the convolution operation, dividing the image block to be input into a single-channel image to be input according to the channel sequence, and dividing the convolution kernel to be input into a single-channel convolution kernel to be input according to the channel sequence. Although the characteristic diagram R is to be input in the program instruction decoding unitnPartitioning is carried out from two dimensions of rows and columns, and a to-be-input image block R is obtained after partitioning(n,i)If the data is directly obtained from the off-chip storage unit, a large amount of embedded block RAM resources in the FPGA can still be occupied, so that the FPGA has the problem that the embedded block RAM resources are consumed and the DSP resources are not used in a large amount, and the SSD convolutional neural network model is low in calculation parallelism and low in forward reasoning speed. By dividing the image block to be input into a single-channel image to be input, the forward reasoning speed of the SSD convolutional neural network model can be improved, so that the processing speed of the accelerator is higher.
Taking out a single-channel image from the off-chip storage unit through the data cache unit and the off-chip memory control unit; and taking out the single-channel convolution kernel from the off-chip storage unit through the parameter buffer unit and the off-chip memory control unit. And taking out the result of the multiply-accumulate of the previous single-channel image from the image cache unit, and sending the partitioned single-channel image, the batched single-channel convolution kernel and the result of the multiply-accumulate of the previous single-channel image into the parallel processing unit. When the first single-channel image is processed, only the blocked single-channel image and the batched single-channel convolution kernel are sent to the parallel processing unit because the result of multiply-accumulate of the previous single-channel image does not exist. The settings of the data buffer unit, the parameter buffer unit, the off-chip memory control unit, the image buffer unit, and the parallel processing unit are described later in this embodiment, and will not be described in detail here.
And taking out the single-channel image from the off-chip storage unit through the data cache unit and the off-chip memory control unit for pooling operation according to an instruction decoding result of the instruction decoding unit, dividing the single-channel image in the column direction again, and sending the divided small-size single-channel image to the parallel processing unit. Therefore, the number of the pooling units in the parallel processing unit can be not limited by W, the number of the pooling units can be flexibly set, the data control unit can send small-size single-channel images into the parallel processing unit in one clock period according to the number of the pooling units, the parallelism of pooling operation is improved, and the forward reasoning instruction execution speed of the operation type of the pooling operation is higher, so that the speed of processing the pooling operation by the accelerator is higher.
The method comprises the following specific steps:
s21, inputting the ith image block R to be input(n,i)Channel-wise division into M single-channel images R of size 1 xWxH(n,i)={R(n,i,1),R(n,i,2),...,R(n,i,m),...,R(n,i,M)},
Figure RE-GDA0003476049870000081
And calculating R(n,i,m)Input start address of IA(n,i,m)By sending IA to an off-chip memory control unit(n,i,m)Reading R from off-chip memory cells(n,i,m)To give R'(n,i,m),IA(n,i,m)Expressed as: IA(n,i,m)=IA(n,i)+W×H×(m-1)。
S22, according to different operation types, different steps are executed. When the operation type is a 3 × 3 convolution operation, performing step S23, when the operation type is a 1 × 1 convolution operation, performing step S24, and when the operation type is a pooling operation, performing step S25;
s23, inputting the jth to-be-input 3X 3 convolution kernel K(3,n,j)Divided into sizes in channel order of
Figure RE-GDA0003476049870000082
The number of 3 × 3 convolution kernels is M, which can be expressed as K(3,n,j)={K(3,n,j,1),K(3,n,j,2),...,K(3,n,j,m),...,K(3,n,j,M)And
Figure RE-GDA0003476049870000083
calculating the mth convolution kernel K to be input(3,n,j,m)Input start address KA(3,n,j,m)By sending KA to the off-chip memory control unit(3,n,j,m)Reading K from off-chip memory cells(3,n,j,m)To give K'(3,n,j,m). Meanwhile, when m is 1, the single channel image R'(n,i,m)And a single-channel 3 × 3 convolution kernel K'(3,n,j,m)Sending the data to a parallel processing unit; when M is>m>1, the m-1 single-channel image R 'stored in the image buffer unit'(n,i,m-1)Result of convolution of (P)(n,i,m-1)And a single-channel image R'(n,i,m)And a single-channel 3 × 3 convolution kernel K'(3,n,j,m)Sending the data to a parallel processing unit; when M is M, offset address BA for 3 × 3 convolution operation is sent to off-chip memory control unit(3,n,i,j)Reading B from off-chip memory cells(3,n,i,j)To obtain B'(3,n,i,j)The m-1 th single-channel image R 'stored in the image buffer unit'(n,i,m-1)Result of convolution of (P)(n,i,m-1)And a single-channel image R'(n,i,m)3 x3 convolution kernel K 'of single channel'(3,n,j,m)And B'(3,n,i,j)And sending the data to the parallel processing unit. Calculating the mth convolution kernel K to be input(3,n,j,m)Input start address KA(3,n,j,m)The expression of (a) is:
Figure RE-GDA0003476049870000091
s24, inputting the jth convolution kernel K of 1 multiplied by 1 to be input(1,n,j)Divided into sizes in channel order of
Figure RE-GDA0003476049870000095
The number of 1 × 1 convolution kernels is M, which can be expressed as K(1,n,j)={K(1,n,j,1),K(1,n,j,2),...,K(1,n,j,m),...,K(1,n,j,M) And
Figure RE-GDA0003476049870000092
calculating the mth convolution kernel K to be input(1,n,j,m)Input start address KA(1,n,j,m)By sending KA to the off-chip memory control unit(1,n,j,m)Reading K from off-chip memory cells(1,n,j,m)To give K'(1,n,j,m). Meanwhile, when m is 1, the single channel image R'(n,i,m)And a single-channel 1 × 1 convolution kernel K'(1,n,j,m)Sending the data to a parallel processing unit; when M is>m>1, the m-1 single-channel image R 'stored in the image buffer unit'(n,i,m-1)Result of convolution of (P)(n,i,m-1)And a single-channel image R'(n,i,m)And a single-channel 1 × 1 convolution kernel K'(1,n,j,m)Sending the data to a parallel processing unit; when M is equal to M, offset address BA for 1 × 1 convolution operation is sent to off-chip memory control unit(1,n,i,j)Reading B from off-chip memory cells(1,n,i,j)To obtain B'(1,n,i,j)The m-1 th single-channel image R 'stored in the image buffer unit'(n,i,m-1)Result of convolution of (P)(n,i,m-1)And a single-channel image R'(n,i,m)Single channel 1 x1 convolution kernel K'(1,n,j,m)And B'(1,n,i,j)Sending the data to a parallel processing unit; calculating the mth convolution kernel K to be input(1,n,j,m)Input start address KA(1,n,j,m)The expression of (a) is:
Figure RE-GDA0003476049870000093
s25, single channel image R(n,i,m)Dividing the image into P small-sized single-channel images R with the size of 1 xWxPN according to the sequence of image columns(n,i,m)={R(n,i,m,1),R(n,i,m,2),...,R(n,i,m,p),...,R(n,Y,m,P)},
Figure RE-GDA0003476049870000094
And calculating R(n,i,m,p)Input start address of IA(n,i,m,p)By sending IA to an off-chip memory control unit(n,i,m,p)Reading R from off-chip memory cells(n,i,m,p)To give R'(n,i,m,p)。R(n,i,m,p)Input start address of IA(n,i,m,p)Can be expressed as:
IA(n,i,m,p)=IA(n,i,m)+1×W×PN×(p-1)
the parallel processing unit comprises x0A 3 × 3 convolution unit, x1A 1 × 1 convolution unit and x2A pooling unit for loading the forward reasoning instruction and image data input by the data control unit and inputting W image data into x in one clock period according to the instruction0A 3 × 3 convolution unit or x11 × 1 convolution unit or W × PN image data input x2And (4) a pooling unit.
The processing of the single-channel image is performed in a parallel processing unit comprising x0A 3 × 3 convolution unit, x1A 1 × 1 convolution unit and x2A pooling unit for pooling the instruction CN according to the forward reasoning instructionnTo single-channel image R'(n,i,m)Performing a 3 × 3 convolution operation (step S31) or a 1 × 1 convolution operation (step S32) to obtain a single-channel image R'(n,i,m)Is processed as a result P(n,i,m)Or for P small-size single-channel images R'(n,i,m,p)Obtaining a single-channel image R 'by performing a pooling operation'(n,i,m)Is processed as a result P(n,i,m)(step S33), and P(n,i,m)And sending the image data to an image buffer unit. The parallel processing unit inputs W image data into x in one clock cycle0A 3 × 3 convolution unit or x11 × 1 convolution units; inputting W x PN image data into x in one clock period2And (4) a pooling unit.
S31, a 3 × 3 convolution operation;
single-channel image R 'by 3 x3 convolution unit in parallel processing unit'(n,i,m)A 3 x3 convolution operation and a relu activation operation are performed. Fig. 2 is a schematic diagram of a 3 × 3 convolution operation. A single 3 x3 convolution unit inputs from a single-channel image R 'per clock cycle'(n,i,m)Three lines of 3 image data of the same column, the image data inputted in the clock period and the image data inputted in the next two clock periods are convoluted together. Then judging the value cnt of the channel counter, and if the cnt is 1, adding 0 to the result after convolution and outputting the result; if cnt>1 and
Figure RE-GDA0003476049870000101
adding the result of the multiplication and accumulation of the previous channel to the result after the convolution and outputting the result; if it is
Figure RE-GDA0003476049870000102
Adding convolution offset and the multiply-accumulate result of the previous channel to the result after convolution, and performing relu activation output on the result after addition to obtain the convolution result. Here, the 3 x3 convolution unit needs only one clock cycle to feed the single-channel image R'(n,i,m)The convolution of the image data of 9 can be completed in 3 clock cycles by 3 image data of three rows and the same column, and the same operation can be completed by the pulsation array in 9 clock cycles, so that the 3 x3 convolution unit has a faster calculation speed for the 3 x3 convolution compared with the pulsation array, and the processing speed of the accelerator is higher.
S32, 1 × 1 convolution operation;
single-channel image R 'is subjected to convolution unit 1 x1 in parallel processing unit'(n,i,m)A 1 × 1 convolution operation and a relu activation operation are performed. FIG. 3 is a schematic diagram of a 1 × 1 convolution operation. A single 1 x1 convolution unit inputs from a single-channel image R 'per clock cycle'(n,i,m)1 image data of one line. Then judging the value cnt of the channel counter, and if the cnt is 1, adding 0 to the result after convolution and outputting the result; if cnt>1 and
Figure RE-GDA0003476049870000103
adding the result of the multiplication and accumulation of the previous channel to the result after the convolution and outputting the result; if it is
Figure RE-GDA0003476049870000104
Adding convolution bias to the result of convolution and the multiply-accumulate node of the previous channelAnd performing relu activation output on the added result.
S33, performing pooling operation;
pairing of a small-sized single-channel image R 'by a pooling unit in a parallel processing unit'(n,i,m,p)Performing a pooling operation. FIG. 4 is a schematic diagram of a pooling operation. A single pooling unit inputs each clock cycle from a small-sized single-channel image R'(n,i,m,p)2 image data of two rows and the same column are subtracted from each other, and the sign bit of the subtraction result is input to the data selector MUX as a selection signal, thereby selecting the maximum number max1 of the two image data. The maximum number max1 calculated for this clock cycle is subtracted from the maximum number max2 of the 2 pieces of image data input for the next clock cycle, the sign bit of the subtraction result is input to the data selector MUX as a selection signal, the maximum number max3 of max1 and max2 is selected, and the maximum number max3, that is, max3 ═ max (max1, max2), where max (·) represents the maximum value, is output.
The image splicing unit internally comprises an embedded block RAM with the size of PC multiplied by PW multiplied by PH, and the embedded block RAM is used for caching the splicing result of the image. The calculation formulas of PC, PW and PH are respectively as follows:
Figure RE-GDA0003476049870000111
Figure RE-GDA0003476049870000112
Figure RE-GDA0003476049870000113
splicing the image blocks, and enabling the spliced image blocks to be subjected to the next forward reasoning instruction CNn+1And (5) partitioning. After splicing, zero value filling operation is favorably carried out on the image blocks. The data control unit takes out the spliced and blocked image blocks from the image splicing unit and stores the image blocks outside the chip through the data cache unit and the off-chip memory control unitAnd a memory unit.
According to the forward reasoning instruction CNnIf the operation type is a 3 × 3 convolution operation, go to step S41; if the convolution operation is 1 × 1, go to step S42; if the pooling operation is performed, S43 is performed.
S41, for I input image blocks { R'(n,1),R’(n,2),...,R’(n,i),...,R’(n,I)And 3 × 3 convolution kernel K'(3,n,j)Splicing convolution results of (1): the image splicing unit receives the input image blocks from the data control unit
Figure RE-GDA0003476049870000116
And 3 × 3 convolution kernel K'(3,n,j)Convolution result of (1), image block
Figure RE-GDA0003476049870000117
Sequence number i of (1) and 3X 3 convolution kernel K'(3,n,j)The serial number j of (1). For each computation of I input image blocks and 3 x3 convolution kernel K(3,n,j)After convolution, the image splicing unit receives a splicing start signal from the data control unit, completes image splicing, and receives the next batch of 3 multiplied by 3 convolution kernels K after image splicing is completed(3,n,j+1)And splicing the convolution results of the I input image blocks. When splicing, if the convolution step length
Figure RE-GDA0003476049870000118
Then spliced into the size of
Figure RE-GDA0003476049870000114
Image block P of(n,j)(ii) a If the convolution step size
Figure RE-GDA0003476049870000119
Then spliced into the size of
Figure RE-GDA0003476049870000115
Image block P of(n,j). By the next forward reasoning instruction CNn+1Determining whether the image block P needs to be processed(n,j)Performing zero-padding operation, if reasoning in forward directionInstruction CNn+1Is filled with
Figure RE-GDA00034760498700001110
For image block P(n,j)Performing zero padding operation, otherwise, performing zero padding operation on the image block P(n,j)No filling operation is performed. For image block P(n,j)Is divided into blocks with the size of
Figure RE-GDA00034760498700001111
G image blocks P(n,j)={P(n,j,1),P(n,j,2),...,P(n,j,g),...,P(n,j,G)And mix P(n,j,g)And sequentially sent to the data control unit.
S42, for I image blocks { R'(n,1),R’(n,2),...,R’(n,i),...,R’(n,I)And 1 × 1 convolution kernel K'(1,n,j)Splicing convolution results of (1): the image splicing unit receives the input image blocks from the data control unit
Figure RE-GDA0003476049870000122
And 1 × 1 convolution kernel K'(1,n,j)Convolution result of (1), image block
Figure RE-GDA0003476049870000123
Sequence number i of (1) & 1X 1 convolution kernel K'(1,n,j)The serial number j of (1); for each calculation of I image blocks and 1 x1 convolution kernel K(1,n,j)After convolution, the image splicing unit receives a splicing start signal from the data control unit, and combines the I image blocks and the 1 multiplied by 1 convolution kernel K(1,n,j)The convolution result of (a) is spliced into a size of
Figure RE-GDA0003476049870000124
Image block P of(n,j)After the image splicing unit finishes image splicing, the next batch of 1 multiplied by 1 convolution kernels K are received(1,n,j+1)And splicing the convolution results of the I input image blocks. By the next forward reasoning instruction CNn+1Determining whether the image block P needs to be processed(n,j)Performing zero-value filling operation if the forward reasoning instruction CNn+1Is filled withWill (Chinese character)
Figure RE-GDA0003476049870000125
For image block P(n,j)Performing zero padding operation, otherwise, performing zero padding operation on the image block P(n,j)No filling operation is performed. For image block P(n,j)Is divided into blocks with the size of
Figure RE-GDA0003476049870000126
G image blocks P(n,j)={P(n,j,1),P(n,j,2),...,P(n,j,g),...,P(n,j,G)And mix P(n,j,g)And sequentially sent to the data control unit.
S43, for I single-channel images { R'(n,1,m),R’(n,2,m),...,R’(n,i,m),...,R’(n,I,m)Splicing the pooling results, wherein the image splicing unit receives a small-size single-channel image R 'from the data control unit'(n,i,m,p)R 'of'(n,i,m,p)The serial number p and the serial number m are sent to an image splicing unit; calculating m multiplied by p small-size single-channel image R 'each'(n,i,m,p)The image splicing unit receives a splicing start signal sent by the data control unit and performs image splicing on the mxp small-size single-channel image R(n,i,m,p)The pooling result is spliced into a size of
Figure RE-GDA0003476049870000121
Of a single-channel image P(n,m)After the image splicing unit finishes image splicing, the next batch of m multiplied by p small-size single-channel images R are received(n,i,m,p)The pooling result of (1). By the next forward reasoning instruction CNn+1Determining whether the image block P needs to be processed(n,m)Performing zero-value filling operation if the forward reasoning instruction CNn+1Is filled with
Figure RE-GDA0003476049870000127
Then for a single channel image P(n,m)Performing zero-filling operation, otherwise, performing zero-filling operation on the single-channel image P(n,m)No filling operation is performed. For single-channel image P(n,m)Is divided into blocks with the size of1 xWxH G image blocks P(n,m)={P(n,m,1),P(n,m,2),...,P(n,m,g),...,P(n,m,G)And mix P(n,m,g)And sequentially sent to the data control unit.
The image splicing unit splices the images and then leads the spliced images to be subjected to the next forward reasoning instruction CNn+1Blocking is performed. And the data control unit takes out the spliced and blocked image blocks from the image splicing unit and stores the image blocks into the off-chip storage unit through the data cache unit and the off-chip memory control unit. According to the forward reasoning instruction CNnIf the operation type of (3) is a 3 × 3 convolution operation, step S51 is executed; in the case of the 1 × 1 convolution operation, step S52 is executed; for the pooling operation, step S53 is executed.
The method comprises the following specific steps:
s51, the data control unit takes out the image blocks P after splicing and blocking from the image splicing unit(n,j,g)Calculate P(n,j,g)Output result start address PA(n,j,g)By sending the PA to an off-chip memory control unit(n,j,g)Storing P to an off-chip memory cell(n,j,g). Wherein, the starting address PA(n,j,g)Can be expressed as:
Figure RE-GDA0003476049870000131
s52, the data control unit takes out the image blocks P after splicing and blocking from the image splicing unit(n,j,g)Calculate P(n,j,g)Output result start address PA(n,j,g)By sending the PA to an off-chip memory control unit(n,j,g)Storing P to an off-chip memory cell(n,j,g). Wherein, the starting address PA(n,j,g)Expressed as:
Figure RE-GDA0003476049870000132
s53, the data control unit takes out the splicing and the blocking from the image splicing unitImage block P of(n,m,g)Calculate P(n,m,g)Output result start address PA(n,m,g)By sending the PA to an off-chip memory control unit(n,m,g)Storing PA to off-chip memory unit(n,m,g). Wherein, the starting address PA(n,m,g)Expressed as:
Figure RE-GDA0003476049870000133
the result stored in steps S51, S52, and S53 is the input feature map of the next forward inference instruction, the processing result of each forward inference instruction is the input feature map of the next forward inference instruction, and so on until all the forward inference instructions are run, that is, the N network layers of the convolutional neural network all complete the operation, and the running result of the convolutional neural network is obtained.
The image cache unit, the parameter cache unit, the data cache unit, the off-chip memory control unit and the off-chip memory unit are specifically arranged as follows:
the image buffer unit is internally provided with a size of
Figure RE-GDA0003476049870000134
The embedded block RAM. Storing processing results P of parallel processing units(n,i,m)And is combined with P(n,i,m)And sending the data to a data control unit.
The parameter buffer unit is provided with two parameter buffers which have the same structure and are used for realizing ping-pong operation. The parameter buffer includes a size of
Figure RE-GDA0003476049870000135
The embedded block RAM. The parameter caching unit is arranged between the data control unit and the off-chip memory control unit and is used for caching the parameters.
The data buffer unit is provided with two data buffers, and the two parameter buffers have the same structure and are used for realizing ping-pong operation. The data buffer internally contains a buffer having a size of
Figure RE-GDA0003476049870000136
The embedded block RAM. The data buffer unit is arranged between the data control unit and the off-chip memory control unit and is used for buffering the characteristic diagram related image data and the convolution offset.
The off-chip memory control unit is provided with an MIG IP core of the Xilinx and is used for being connected with the off-chip memory unit, and the exchange of parameters between the FPGA and the off-chip memory unit and image data related to the characteristic diagram can be realized.
DDR3 is arranged in the off-chip storage unit, and each network layer L can be realizednInput feature map RnAnd storing information such as convolution kernel parameters, convolution offset parameters, convolution or pooling results, and the like. Wherein the network layer LnInput feature map RnAs a result of the convolution or pooling ofn+1Input feature map Rn+1
The specific storage mode is as follows:
to input feature map R'nThe storage method of (1): has a size of
Figure RE-GDA0003476049870000142
Input feature map R'nIs divided into sizes of
Figure RE-GDA0003476049870000143
L image blocks R'n={R’(n,1),R’(n,2),...,R’(n,i),...,R’(n,I)R '}, to R'(n,i)Division into M single-channel images R 'of size 1 xWxH by channel'(n,i)={R’(n,i,1),R’(n,i,2),...,R’(n,i,m),R’(n,i,M)}, single-channel image R'(n,i,m)Internal is continuous storage, image block R'(n,i)The M single-channel images of (1) are successively stored in order of channel, and a feature map R 'is input'nL image blocks R'nIs stored continuously from small to big in sequence according to the sequence of the dividing sequence iStorage as shown in fig. 5.
Convolve 3 × 3 with convolution step size
Figure RE-GDA0003476049870000144
Convolution bias parameter B'(3,n)The storage method of (1): has a size of
Figure RE-GDA0003476049870000145
Convolution bias parameter B'(3,n)Divided into sizes in channel order of
Figure RE-GDA0003476049870000146
J offset blocks B 'of'(3,n)={B’(3,n,1),B’(3,n,2),...,B’(3,n,j),B’(3,n,J)B'(3,n,j)Is divided into sizes of
Figure RE-GDA0003476049870000147
I offset blocks B'(3,n,j)={B’(3,n,1,j),B’(3,n,2,j),...,B’(3,n,i,j),B’(3,n,I,j)Is biased by block B'(3,n,i,j)Internal is continuous storage, B'(3,n,i)Is stored consecutively in partition order, B'(3,n)The J offset blocks are stored continuously from small to large in sequence according to the sequence of the channel sequence i.
Convolve 3 × 3 with convolution step size
Figure RE-GDA0003476049870000148
Convolution bias parameter B'(3,n)The storage method of (1): has a size of
Figure RE-GDA0003476049870000149
Figure RE-GDA00034760498700001410
Convolution bias parameter B'(3,n)Divided into sizes in channel order of
Figure RE-GDA00034760498700001411
J offset blocks B 'of'(3,n)={B’(3,n,1),B’(3,n,2),...,B’(3,n,j),...,B’(3,n,J)B'(3,n,j)Is divided into sizes of
Figure RE-GDA0003476049870000141
I offset blocks B'(3,n,j)={B’(3,n,1,j),B’(3,n,2,j),...,B’(3,n,i,j),...,B’(3,n,I,j)Is biased by block B'(3,n,i,j)Internal is continuous storage, B'(3,n,i)Is stored consecutively in partition order, B'(3,n)The J offset blocks are stored continuously from small to large in sequence according to the sequence of the channel sequence i.
Convolution bias parameter B 'to 1 x1 convolution'(1,n)The storage method of (1): has a size of
Figure RE-GDA00034760498700001412
Convolution bias parameter B'(1,n)Divided into sizes in channel order of
Figure RE-GDA00034760498700001413
J offset blocks B 'of'(1,n)={B’(1,n,1),B’(1,n,2),...,B’(1,n,j),...,B’(1,n,J)B'(1,n,j)Is divided into sizes of
Figure RE-GDA00034760498700001414
I offset blocks B'(1,n,j)={B’(1,n,1,j),B’(1,n,2,j),...,B’(1,n,i,j),B’(1,n,I,j)Is biased by block B'(1,n,i,j)Internal is continuous storage, B'(1,n,i)Is stored consecutively in partition order, B'(1,n)The J offset blocks are sequentially connected from small to large according to the sequence of the channel sequence iAnd (5) continuously storing.
For 3 × 3 convolution kernel parameter K'(3,n)The storage method of (1): has a size of
Figure RE-GDA00034760498700001415
Convolution kernel parameter K'(3,n)Division into sizes in the order of the convolution kernel
Figure RE-GDA00034760498700001416
J3 × 3 convolution kernels K(3,n)={K(3,n,1),K(3,n,2),...,K(3,n,j),...,K(3,n,J)H, to K'(3,n,j)According to the channel divided into the size
Figure RE-GDA00034760498700001417
M single-channel 3 x3 convolution kernel K'(3,n,j)={K’(3,n,j,1),K’(3,n,j,2),...,K’(3,n,j,m),...,K’(3,n,j,M)}, single-channel 3 × 3 convolution kernel K'(3,n,j,m)Internal is continuous storage, 3 x3 convolution kernel K'(3,n,j)The M single-channel 3 x3 convolution kernels are continuously stored according to the channel sequence, and the parameter K 'of the convolution kernels'(3,n)The J3 × 3 convolution kernels are stored sequentially from small to large in the order of the division order J.
For 1 x1 convolution kernel parameter K'(1,n)The storage method of (1): has a size of
Figure RE-GDA0003476049870000151
Convolution kernel parameter K'(1,n)Division into sizes in the order of the convolution kernel
Figure RE-GDA0003476049870000152
J1 × 1 convolution kernels K(1,n)={K(1,n,1),K(1,n,2),...,K(1,n,j),...,K(1,n,J)H, to K'(1,n,j)According to the channel divided into the size
Figure RE-GDA0003476049870000153
M single-channel 1 x1 convolution kernel K'(1,n,j)={K’(1,n,j,1),K’(1,n,j,2),...,K’(1,n,j,m),...,K’(1,n,j,M)}, single-channel 1 × 1 convolution kernel K'(1,n,j,m)Internal is continuous storage, 1 x1 convolution kernel K'(1,n,j)The M single-channel 1 x1 convolution kernels are continuously stored according to the channel sequence, and the parameter K 'of the convolution kernels'(1,n)The J1 × 1 convolution kernels are stored sequentially from small to large in the order of the division order J.
In order to further improve the running speed of the convolutional neural network, the invention also provides an optimization method of the convolutional neural network accelerator based on the FPGA. The method comprises the following specific steps:
s61, obtaining the forward reasoning instruction CN of different operation typesnThe number of clock cycles used for forward reasoning;
the 3 × 3 convolution operation performs step S611, the 1 × 1 convolution operation performs step S612, and the pooling operation performs step S613.
S611, the forward reasoning instruction CN is the convolution operation of 3 multiplied by 3 according to the operation typenObtaining a forward inference instruction CN with an operation type of 3 x3 convolution operation from all steps from decoding to instruction execution end and the number of cycles used in each stepnNumber of clock cycles F used for forward reasoning3(n);
Forward reasoning instruction CN with program instruction storage unit read operation type being 3 x3 convolution operationnAnd the number of clock cycles used for transmission is Tr3(ii) a The program instruction decoding unit decodes the forward reasoning instruction with the operation type of 3 multiplied by 3 convolution operation once by using the clock period number Ty3A forward reasoning instruction needs to be decoded I multiplied by J times, and the required clock period number is I multiplied by J multiplied by Ty3(ii) a The data control unit controls the single-channel image R 'stored in the off-chip storage unit'(n,i,m)And a single-channel 3 × 3 convolution kernel K'(3,n,j,m)The clock cycles used for forwarding for one time are T through the cache unit and the off-chip memory control unit to the parallel processing unitcps3In a forward reasoning instruction, M times of forwarding are needed, and the needed clock period number is M multiplied by Tcps3(ii) a The data control unit takes out an input image block R 'from the image buffer unit'(n,i)And 3 × 3 convolution kernel K'(3,n,j)The convolution result is forwarded to the image splicing unit once, and the required clock period number is Tscj3The forward reasoning instruction needs I × J forwarding times in total, and the required clock period number is I × J × Tscj3(ii) a The data control unit takes out the image block P from the image splicing unit(n,j,g)The number of clock cycles needed for storing the data into the off-chip storage unit through the data cache unit and the off-chip memory control unit is Tjc3The forward reasoning instruction needs G times in all, and the required clock period number is G multiplied by Tjc3(ii) a Calling 3 x3 convolution unit by parallel processing unit to process single-channel image R'(n,i,m)And a single-channel 3 × 3 convolution kernel K'(3,n,j,m)The number of clock cycles required for convolution accumulation is Tp3In a forward reasoning instruction, I × J × M is required to be carried out, and the number of clock cycles required is I × J × M × Tp3(ii) a Pair of image buffer units P(n,i,m)The number of clock cycles required for forwarding to the data control unit is Ts3In a forward reasoning instruction, I × J × M is required to be carried out, and the number of clock cycles required is I × J × M × Ts3(ii) a Image splicing unit for image blocks P(n,j)The number of clock cycles required for filling and blocking is Tj3J is required to be carried out in a forward reasoning instruction, and the required clock period number is J multiplied by Tj3. According to the forward reasoning instruction CNnAnd the number of cycles used for each step, to yield F3(n):
F3(n)=Tr3+I×J×Ty3+M×Tcps3+I×J×Tscj3+G×Tjc3+I×J×M×Tp3+I×J×M×Ts3+J×Tj3
Wherein:
Figure RE-GDA0003476049870000161
Figure RE-GDA0003476049870000162
s612, forward reasoning instruction CN of 1 × 1 convolution operation according to operation typenObtaining a forward inference instruction CN with an operation type of 1 x1 convolution operation from all steps from decoding to instruction execution end and the number of cycles used in each stepnNumber of clock cycles F used for forward reasoning1(n);
Forward reasoning instruction CN with program instruction storage unit read operation type being 1 x1 convolution operationnAnd the number of clock cycles used for transmission is Tr1(ii) a The program instruction decoding unit decodes the forward reasoning instruction with the operation type of 1 multiplied by 1 convolution operation once by using the clock period number Ty1A forward reasoning instruction needs to be decoded I multiplied by J times, and the required clock period number is I multiplied by J multiplied by Ty1(ii) a The data control unit controls the single-channel image R 'stored in the off-chip storage unit'(n,i,m)And a single-channel 1 × 1 convolution kernel K'(1,n,j,m)The clock cycles used for forwarding for one time are T through the cache unit and the off-chip memory control unit to the parallel processing unitcps1In a forward reasoning instruction, M times of forwarding are needed, and the needed clock period number is M multiplied by Tcps1(ii) a The data control unit takes out an input image block R 'from the image buffer unit'(n,i)And 1 × 1 convolution kernel K'(1,n,j)The convolution result is forwarded to the image splicing unit once, and the required clock period number is Tscj1The forward reasoning instruction needs I × J forwarding times in total, and the required clock period number is I × J × Tscj1(ii) a The data control unit takes out the image block P from the image splicing unit(n,j,g)The number of clock cycles needed for storing the data into the off-chip storage unit through the data cache unit and the off-chip memory control unit is Tjc1The forward reasoning instruction needs G times in all, and the required clock period number is G multiplied by Tjc1(ii) a The parallel processing unit calls a 1 x1 convolution unit to process the single-channel image R'(n,i,m)And a single-channel 1 × 1 convolution kernel K'(1,n,j,m)The number of clock cycles required for convolution accumulation is Tp1In a forward reasoning instruction, I × J × M is required to be carried out, and the number of clock cycles required is I × J × M × Tp1(ii) a Pair of image buffer units P(n,i,m)The number of clock cycles required for forwarding to the data control unit is Ts1In a forward reasoning instruction, I × J × M is required to be carried out, and the number of clock cycles required is I × J × M × Ts1(ii) a Image splicing unit for image blocks P(n,j)The number of clock cycles required for filling and blocking is Tj1J is required to be carried out in a forward reasoning instruction, and the required clock period number is J multiplied by Tj1. According to the forward reasoning instruction CNnAnd the number of cycles used for each step, to yield F1(n):
F1(n)=Tr1+I×J×Ty1+M×Tcps1+I×J×Tscj1+G×Tjc1+I×J×M×Tp1+I×J×M×Ts1+J×Tj1
S613, the forward reasoning instruction CN is the pooling operation according to the operation typenObtaining a forward inference instruction CN with the operation type of pooling operation from all steps from decoding to instruction execution end and the number of cycles used in each stepnNumber of clock cycles F used for forward reasoningp(n)。
Forward reasoning instruction CN with program instruction storage unit read operation type being pooling operationnAnd the number of clock cycles used for transmission is Trp(ii) a The program instruction decoding unit decodes the forward reasoning instruction with the pooling operation once by using the clock period number TypA forward reasoning instruction needs I times of decoding, and the required clock period number is I multiplied by Typ(ii) a The data control unit controls the single-channel image R 'stored in the off-chip storage unit'(n,i,m,p)The clock cycles used for forwarding for one time are T through the data cache unit and the off-chip memory control unit to the parallel processing unitcpspThe forward reasoning instruction needs M times P times of forwarding, and the required clock period number is M times P times Tcpsp(ii) a Data control unit slave imageBuffer unit extracts single-channel image R'(n,i,m,p)The pooling result is forwarded to the image splicing unit once, and the number of clock cycles needed by the image splicing unit is TscjpThe forward reasoning instruction needs I × P times of forwarding in total, and the required clock period number is I × P × Tscjp(ii) a The data control unit takes out the image block P from the image splicing unit(n,j,g)The number of clock cycles needed for storing the data into the off-chip storage unit through the data cache unit and the off-chip memory control unit is TjcpThe forward reasoning instruction needs G times in all, and the required clock period number is G multiplied by Tjcp(ii) a The parallel processing unit calls a pooling unit to process a single-channel image R'(n,i,m,p)The number of clock cycles required for pooling is TppIn a forward reasoning instruction, I × P × M is required to be carried out in common, and the required clock period number is I × P × M × Tpp(ii) a Image buffer unit for single-channel image R'(n,i,m,p)The number of clock cycles required for forwarding the pooled result to the data control unit is TspIn a forward reasoning instruction, I × P × M is required to be carried out in common, and the required clock period number is I × P × M × Tsp(ii) a Image splicing unit for image blocks P(n,m)The number of clock cycles required for filling and blocking is TjpIn a forward reasoning instruction, M is required to be carried out in total, and the required clock period number is M multiplied by Tjp. According to the forward reasoning instruction CNnAnd the number of cycles used for each step, to yield Fp(n):
F1(n)=Trp+I×Typ+M×P×Tcpsp+I×J×Tscjp+G×Tjcp+I×P×M×Tpp+I×P×M×Tsp+J×Tsp
Wherein:
Figure RE-GDA0003476049870000181
s62, constructing a clock period number used by the convolutional neural network for one forward inference;
according to the forward reasoning instruction CNnThe number of clock cycles F (n) used by the forward reasoning, the number of clock cycles F, F used by calculating the forward reasoning of the convolutional neural networkThe calculation formulas of (n) and F are respectively as follows:
Figure RE-GDA0003476049870000182
Figure RE-GDA0003476049870000183
wherein
Figure RE-GDA0003476049870000184
Indicating that the operation type is a 3 x3 convolution operation,
Figure RE-GDA0003476049870000185
indicating that the operation type is a 1 x1 convolution operation,
Figure RE-GDA0003476049870000186
indicating that the operation type is a pooling operation.
S63, constructing a hardware resource constraint expression;
and constructing a hardware resource constraint expression for the number of hardware resources used by the neural network acceleration system. And describing each unit of the convolutional neural network acceleration system by using a hardware description language verilog, and then obtaining each unit circuit and the number of hardware resources used by each unit circuit by synthesizing, laying out and wiring. Because the number of the resources of various FPGA chips is not consistent, only the hardware resources are used: DPS, embedded block RAM, data selector, carry chain, register and look-up table are used as examples to demonstrate. The program instruction storage unit uses DPS, embedded block RAM, data selector, carry chain, register and lookup table respectively
Figure RE-GDA0003476049870000187
The program instruction decoding unit uses DPS, embedded block RAM, data selector, carry chain, register and lookup table respectively
Figure RE-GDA0003476049870000188
The data control unit uses DPS, embedded block RAM, data selector, carry chain, register and lookup table respectively
Figure RE-GDA0003476049870000189
Figure RE-GDA00034760498700001810
The number of DPS, embedded block RAM, data selector, carry chain, register and lookup table used by a single 3 x3 convolution unit is respectively
Figure RE-GDA00034760498700001811
The number of DPS, embedded block RAM, data selector, carry chain, register and lookup table used by a single 1 × 1 convolution unit is respectively
Figure RE-GDA00034760498700001812
The DPS, the embedded block RAM, the data selector, the carry chain, the register and the lookup table used by a single pooling unit are respectively
Figure RE-GDA00034760498700001813
Figure RE-GDA0003476049870000194
The DPS, the data selector, the carry chain, the register and the lookup table used by the parameter cache unit and the data cache unit are respectively
Figure RE-GDA0003476049870000195
The embedded block RAM resource used is
Figure RE-GDA0003476049870000191
The off-chip memory control unit uses DPS, embedded block RAM, data selector, carry chain, register and lookup table respectively
Figure RE-GDA0003476049870000196
The image splicing unit uses DPS, data selector, carry chain, register and lookup table respectively
Figure RE-GDA0003476049870000197
The embedded block RAM resource used is
Figure RE-GDA0003476049870000198
The image cache unit uses DPS, data selector, carry chain, register and lookup table respectively
Figure RE-GDA0003476049870000199
The embedded block RAM resource used is
Figure RE-GDA0003476049870000192
According to the number of resources used by each unit, constructing a hardware resource constraint expression for the number of hardware resources used by the neural network acceleration system, wherein the hardware resource constraint expression comprises the following steps:
Figure RE-GDA0003476049870000193
wherein N isdspRepresenting the number of digital signal processors in the FPGA, NbramRepresenting the number of embedded block RAMs in the FPGA, NmuxRepresenting the number of data selectors in the FPGA, NcarryRepresenting the number of carry chains in the FPGA, NregRepresenting the number of registers in the FPGA, NlutRepresenting the number of look-up tables in the FPGA. Mu.sdsp、μbram、μmux、μcarry、μregister、μlutIs a hyper-parameter, and takes the value of [0,1]To (c) to (d);
s64, constructing and solving an optimization function F' with constraint;
the number F of clock cycles used for the convolutional neural network forward reasoning obtained in step S62 and the hardware resource constraint expression obtained in step S63 constitute a constrained optimization function F':
Figure RE-GDA0003476049870000201
Figure RE-GDA0003476049870000202
using optimization algorithm to solve constrained optimization function F' to obtain x0、x1And x2The optimal solution of the invention can improve the optimal quantity of each parallel unit under the limited FPGA resource, thereby improving the forward reasoning speed of the SSD convolutional neural network, therefore, the optimization method of the invention can effectively improve the processing speed of the accelerator.
And S65, setting the number of the 3 multiplied by 3 convolution units, the 1 multiplied by 1 convolution units and the pooling units according to the optimal solution of the optimization function F'.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. The convolutional neural network accelerator based on the FPGA is characterized by comprising a program instruction storage unit, a program instruction decoding unit, a data control unit, a data buffer unit, a parameter buffer unit, an off-chip storage unit, a parallel processing unit, an image buffer unit, an image splicing unit and an off-chip storage unit, wherein the program instruction storage unit, the program instruction decoding unit, the data control unit, the data buffer unit, the parameter buffer unit, the off-chip storage unit, the parallel processing unit, the image buffer unit and the image splicing unit are realized through the FPGA, and the off-chip storage unit is realized on a dynamic random access memory.
2. The FPGA-based convolutional neural network accelerator of claim 1, wherein said program instruction storage unit comprises a ROM in which the convolutional neural network layer L is storednForward inference instruction CNn
3. According to claim2, the convolutional neural network accelerator based on FPGA is characterized in that the forward inference instruction CNnBy the convolutional neural network layer LnOperation type, convolution step length, convolution kernel number, input feature map channel number, input feature map width, input feature map start address, output feature map width, neural network parameter start address, output image block start address, convolution offset start address, binary data corresponding to filling mark
Figure FDA0003415011260000012
And splicing the components in sequence.
4. The FPGA-based convolutional neural network accelerator of claim 3, wherein the program instruction decoding unit is serially cascaded between the instruction storage unit and the data control unit.
5. The FPGA-based convolutional neural network accelerator of claim 4, wherein the data control unit is connected to the parallel processing unit, the image buffer unit, the data buffer unit, the parameter buffer unit, the image stitching unit, and the off-chip memory control unit.
6. The FPGA-based convolutional neural network accelerator of claim 5, wherein two identical parameter buffers are disposed in the parameter buffer unit.
7. The FPGA-based convolutional neural network accelerator of claim 6, wherein said parallel processing units comprise a 3 x3 convolution unit, a 1 x1 convolution unit, and a pooling unit.
8. The FPGA-based convolutional neural network accelerator of claim 7, wherein two identical data buffers are disposed in the data buffer unit.
9. The FPGA-based convolutional neural network accelerator of claim 8, wherein said data buffer internally comprises a size of
Figure FDA0003415011260000011
In which x0Representing the number of 3 x3 convolution units and W, H representing the width and height, respectively, of the image blocks that the parallel processing unit can process at one time.
10. An optimization method of a convolutional neural network accelerator based on an FPGA is characterized by comprising the following steps:
the method comprises the following steps: obtaining forward reasoning instruction CN of different operation typesnThe number of clock cycles used for forward reasoning;
step two: constructing a clock period number used by the convolutional neural network for one forward inference;
step three: constructing a hardware resource constraint expression;
step four: constructing and solving an optimization function F' with constraint;
step five: the number of the 3 x3 convolution units, the 1 x1 convolution units, the pooling units of claim 7 are set according to the optimal solution of the optimization function F'.
CN202111543413.0A 2021-12-16 2021-12-16 Convolutional neural network accelerator based on FPGA and optimization method thereof Pending CN114186679A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111543413.0A CN114186679A (en) 2021-12-16 2021-12-16 Convolutional neural network accelerator based on FPGA and optimization method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111543413.0A CN114186679A (en) 2021-12-16 2021-12-16 Convolutional neural network accelerator based on FPGA and optimization method thereof

Publications (1)

Publication Number Publication Date
CN114186679A true CN114186679A (en) 2022-03-15

Family

ID=80544168

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111543413.0A Pending CN114186679A (en) 2021-12-16 2021-12-16 Convolutional neural network accelerator based on FPGA and optimization method thereof

Country Status (1)

Country Link
CN (1) CN114186679A (en)

Similar Documents

Publication Publication Date Title
CN110383237B (en) Reconfigurable matrix multiplier system and method
US11119765B2 (en) Processor with processing cores each including arithmetic unit array
CN108241890B (en) Reconfigurable neural network acceleration method and architecture
CN108805266B (en) Reconfigurable CNN high-concurrency convolution accelerator
CN111859273B (en) Matrix Multiplier
CN107301456B (en) Implementation method of multi-core acceleration of deep neural network based on vector processor
CN110390384A (en) A Configurable General Convolutional Neural Network Accelerator
WO2019205617A1 (en) Calculation method and apparatus for matrix multiplication
CN107704916A (en) A kind of hardware accelerator and method that RNN neutral nets are realized based on FPGA
CN110851779B (en) Systolic array architecture for sparse matrix operations
CN109284824B (en) A device for accelerating convolution and pooling operations based on reconfigurable technology
JP2018120547A (en) Processor, information processing apparatus, and operation method of processor
CN113743599B (en) Computing device and server of convolutional neural network
WO2023065701A1 (en) Inner product processing component, arbitrary-precision computing device and method, and readable storage medium
US20230376733A1 (en) Convolutional neural network accelerator hardware
CN115983348A (en) RISC-V Accelerator System Supporting Extended Instructions for Convolutional Neural Networks
CN110232441B (en) Stack type self-coding system and method based on unidirectional pulsation array
CN110377874A (en) Convolution algorithm method and system
US7653676B2 (en) Efficient mapping of FFT to a reconfigurable parallel and pipeline data flow machine
US20250005104A1 (en) Computational memory
CN110414672A (en) Convolution operation method, device and system
US11526305B2 (en) Memory for an artificial neural network accelerator
CN114186679A (en) Convolutional neural network accelerator based on FPGA and optimization method thereof
US12086453B2 (en) Memory for an artificial neural network accelerator
Wang et al. An FPGA-based reconfigurable CNN training accelerator using decomposable Winograd

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination