CN114186679A

CN114186679A - Convolutional neural network accelerator based on FPGA and optimization method thereof

Info

Publication number: CN114186679A
Application number: CN202111543413.0A
Authority: CN
Inventors: 李甫; 李旭超; 付博勋
Original assignee: Chongqing Institute Of Integrated Circuit Innovation Xi'an University Of Electronic Science And Technology
Current assignee: Chongqing Institute Of Integrated Circuit Innovation Xi'an University Of Electronic Science And Technology
Priority date: 2021-12-16
Filing date: 2021-12-16
Publication date: 2022-03-15

Abstract

The application relates to the technical field of neural network computing, and particularly provides a convolutional neural network accelerator based on an FPGA and an optimization method thereof. A convolutional neural network accelerator based on FPGA comprises a program instruction storage unit, a program instruction decoding unit, a data control unit, a data buffer unit, a parameter buffer unit, an off-chip storage unit, a parallel processing unit, an image buffer unit, an image splicing unit and an off-chip storage unit. An optimization method of a convolutional neural network accelerator based on an FPGA comprises the following steps: step one, acquiring forward reasoning instructions CN of different operation types_nThe number of clock cycles used for forward reasoning; step two, constructing convolutionThe number of clock cycles used by the neural network to reason forward once; step three, constructing a hardware resource constraint expression; step four, constructing and solving an optimization function F' with constraint; and step five, setting the number of 3 multiplied by 3 convolution units, 1 multiplied by 1 convolution units and pooling units according to the optimal solution of the optimization function F'.

Description

Convolutional neural network accelerator based on FPGA and optimization method thereof

Technical Field

The application relates to the technical field of neural network computing, in particular to a convolutional neural network accelerator based on an FPGA and an optimization method thereof.

Background

With the continuous development of neural networks, the complexity of the networks is higher and higher, the requirement on the computing power of hardware is also increased, and a CPU serving as a general processor cannot meet the requirement of practical application. For this reason, research on the deployment of neural networks on hardware platforms has become a hotspot. The existing deep learning frameworks such as PyTorch, Caffe, TensorFlow, etc. are based on GPU for network acceleration. Although the GPU has a strong computational power, it has disadvantages of high price, large power consumption, high noise, large volume, and the like. The FPGA has the advantages of low power consumption, customizability, reconfigurability, small volume, high parallelism and the like, and becomes a research hotspot of neural network hardware deployment.

The customizable and reconfigurable characteristics of the FPGA enable developers to flexibly deploy the classical convolutional neural network models such as SSD, YOLOV3 and VGG16 on the FPGA according to application requirements. The acceleration of the convolutional neural network model on the FPGA mainly depends on the number of parallel computing units, and the higher the number of the parallel computing units is, the higher the algorithm parallelism is, and the higher the computing speed is. In practice, the parallel computing units usually include a 3 × 3 convolution unit, a pooling unit, a 1 × 1 convolution unit, and an activation unit, and under the limited FPGA resources, increasing the number of one type of parallel computing unit will result in decreasing the number of the other type of parallel computing unit. In general practical deployment, the number of various parallel computing units is not matched, for example, the number of deployed 3 × 3 convolution units is large, the number of deployed 1 × 1 convolution units is small, and for a specific convolution neural network model, for example, an SSD convolution neural network model, the forward inference speed of the convolution neural network model is reduced because of being limited by the number of 1 × 1 convolution units in forward inference.

The patent with publication number CN109086867A and the patent name "a convolutional neural network acceleration system based on FPGA" discloses a convolutional neural network acceleration system based on FPGA, which comprises a data preprocessing module, a data post-processing module, a data storage module, a network model configuration module and a convolutional neural network calculation module, wherein the convolutional neural network calculation module comprises a convolutional calculation submodule, an activation function calculation submodule, a pooling calculation submodule and a full-connection calculation submodule. The disadvantages of the acceleration system are: the problem that the forward reasoning speed of a specific convolutional neural network model is reduced due to the fact that the number of various convolutional neural network calculation sub-modules is not matched is not considered in the setting process.

The publication number is CN112766479A, and the patent name is "a neural network accelerator supporting channel separation convolution based on FPGA", and the functional unit module of the accelerator disclosed by the patent is used for realizing functions of pooling, activation, batch normalization and the like. The defects of the accelerator are as follows: the functional unit module of the accelerator realizes the same number of parallel computing units with various functions, fails to flexibly configure the number of the parallel computing units with various functions, and does not consider the problem of forward reasoning speed reduction of a specific convolutional neural network model caused by the mismatching of the number of the parallel computing units with various functions.

The reconfigurable advantage of the convolutional neural network accelerator based on the FPGA aims at a specific convolutional neural network model, such as an SSD convolutional neural network model, the number of various parallel computing units can be reconfigured, so that the convolutional neural network model can achieve higher forward reasoning speed. However, the existing accelerator and optimization method do not consider the problem that the forward reasoning speed of the convolutional neural network model is reduced due to the mismatching problem of the number of various parallel computing units.

Disclosure of Invention

The present invention aims to provide a convolutional neural network accelerator based on an FPGA and an optimization method thereof, so as to solve the problem that the forward inference speed of a convolutional neural network model is decreased due to the fact that the number of various parallel computing units is not matched in the prior art.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

the application provides a convolutional neural network accelerator based on an FPGA and an optimization method thereof. The convolutional neural network accelerator based on the FPGA comprises a program instruction storage unit, a program instruction decoding unit, a data control unit, a data buffer unit, a parameter buffer unit, an off-chip storage unit, a parallel processing unit, an image buffer unit, an image splicing unit and an off-chip storage unit which are realized on a dynamic random access memory through the FPGA.

Further, the program instruction storage unit includes a ROM in which the convolutional neural network layer L is stored_nForward inference instruction CN_n。

Further, the Forward inference instruction CN_nBy convolutional neural network layer L_nOperation type, convolution step length, convolution kernel number, input feature map channel number, input feature map width, input feature map start address, output feature map width, neural network parameter start address, output image block start address, convolution offset start address, binary data corresponding to filling mark

And splicing the components in sequence.

Furthermore, the program instruction decoding unit is sequentially cascaded between the instruction storage unit and the data control unit.

Furthermore, the data control unit is connected with the parallel processing unit, the image buffer unit, the data buffer unit, the parameter buffer unit, the image splicing unit and the off-chip memory control unit.

Furthermore, two identical parameter buffers are arranged in the parameter cache unit.

Further, the parallel processing unit includes a 3 × 3 convolution unit, a 1 × 1 convolution unit, and a pooling unit.

Furthermore, two identical data buffers are arranged in the data buffer unit.

Further, the data buffer includes therein a size of

In which x₀Representing the number of 3 x3 convolution units and W, H representing the width and height, respectively, of the image blocks that the parallel processing unit can process at one time.

The method for optimizing the convolutional neural network accelerator based on the FPGA comprises the following steps:

the method comprises the following steps: obtaining forward reasoning instruction CN of different operation types_nThe number of clock cycles used for forward reasoning;

step two: constructing a clock period number used by the convolutional neural network for one forward inference;

step three: constructing a hardware resource constraint expression;

step four: constructing and solving an optimization function F' with constraint;

step five: and setting the number of the 3 multiplied by 3 convolution units, the 1 multiplied by 1 convolution units and the pooling units according to the optimal solution of the optimization function F'.

Compared with the prior art, the invention has the beneficial effects that:

the accelerator divides a characteristic diagram to be input into blocks and divides the characteristic diagram into single-channel images so as to fully utilize the resources of the FPGA and improve the running speed of the accelerator; the convolution operation is partitioned firstly and then is divided into single-channel convolution kernels, so that the resources of the FPGA are fully utilized, and the running speed of the accelerator is increased.

According to the invention, by constructing the FPGA hardware resource constraint expression, the number of various parallel processing units in the convolutional neural network is maximized under the condition that the total hardware resources of the FPGA are not exceeded, so that the operation speed is increased.

The number of the corresponding optimal 3 x3 convolution units, 1 x1 convolution units and pooling units is obtained through the optimization step, the number matching of various parallel computing units is fully considered, the problem that the forward reasoning speed of the convolution neural network model is reduced due to the fact that the number mismatching problem of various parallel computing units is not considered is solved, and therefore the processing speed of the accelerator on the convolution neural network is improved.

Drawings

FIG. 1 is a schematic diagram of an FPGA-based convolutional neural network accelerator according to the present invention;

fig. 2 is a schematic diagram of the convolutional neural network accelerator based on FPGA executing the 3 × 3 convolution operation of step S32 according to the present invention;

fig. 3 is a schematic diagram of the FPGA-based convolutional neural network accelerator performing the 1 × 1 convolution operation of step S33 according to the present invention;

FIG. 4 is a schematic diagram of the FPGA-based convolutional neural network accelerator performing the pooling operation of step S34 according to the present invention;

fig. 5 is a schematic diagram of a storage manner of an input feature map in the off-chip storage unit.

Detailed Description

In order to make the implementation of the present invention clearer, the following detailed description is made with reference to the accompanying drawings.

The invention provides a convolutional neural network accelerator based on an FPGA and an optimization method thereof. Fig. 1 is a schematic diagram of a convolutional neural network accelerator based on an FPGA according to the present invention.

The hardware of the convolutional neural network accelerator based on the FPGA comprises the FPGA and a dynamic random access memory, wherein the dynamic random access memory can be one of the dynamic random access memories such as DDR3, DDR4 and SDRAM, and specifically, the dynamic random access memory adopts DDR 3. As shown in fig. 1, the convolutional neural network accelerator based on FPGA provided by the present invention includes a program instruction storage unit, a program instruction decoding unit, a data control unit, a data buffer unit, a parameter buffer unit, an off-chip storage unit, a parallel processing unit, an image buffer unit, an image stitching unit, and an off-chip storage unit. The FPGA is provided with a program instruction storage unit, a program instruction decoding unit, a data control unit, a data buffer unit, a parameter buffer unit, an off-chip storage unit, a parallel processing unit, an image buffer unit and an image splicing unit, wherein the parallel processing unit comprises x₀ A 3 × 3 convolution unit, x₁A 1 × 1 convolution unit and x₂A pooling unit including a pair of data buffers, a parameter bufferThe storage unit includes a pair of parameter buffers; the off-chip memory unit is implemented with DDR 3. The program instruction storage unit, the program instruction decoding unit and the data control unit are sequentially cascaded to realize reading, decoding and transmission of the forward reasoning instruction; the data control unit is connected with the parallel processing unit, the image cache unit, the data cache unit, the parameter cache unit, the image splicing unit and the off-chip memory control unit to realize instruction and data distribution; the parallel processing unit is loaded between the data control unit and the image cache unit; the image buffer unit is loaded between the parallel processing unit and the data control unit; the parameter cache unit is connected between the data control unit and the off-chip memory control unit; the data cache unit is connected between the data control unit and the off-chip memory control unit; the image splicing unit and the data control unit are connected with each other; the off-chip memory control unit and the off-chip memory unit are interconnected and connected with the parameter cache unit, the data control unit and the data cache unit.

The characteristic diagram to be input and the convolutional neural network information to be accelerated can be obtained from other ports, and can also be stored in a storage device through an external interface. More specifically, the feature map to be input is stored in the off-chip storage unit, and the specific storage manner and the arrangement of the off-chip storage unit are described later in this embodiment; the parameters and related information of the convolutional neural network are stored in an off-chip storage unit. The forward reasoning instruction of the convolutional neural network is stored in a program instruction storage unit.

When the accelerator runs, the program instruction storage unit reads the forward reasoning instruction of the convolutional neural network and outputs the forward reasoning instruction to the program instruction decoding unit; the program instruction decoding unit divides the characteristic image to be input into blocks to be input image blocks, decodes the forward reasoning instruction according to the division result and sends the decoding result to the data control unit; the data control unit divides an image block to be input into a single-channel image and a single-channel convolution kernel (for convolution operation) according to a channel sequence or continuously divides the image block into a plurality of small-size single-channel images (for pooling operation) according to an image column sequence, reads the small-size single-channel image or the single-channel convolution kernel from the off-chip storage unit according to a division result, and sends the result of multiply-accumulate of the last single-channel image into the parallel processing unit; the parallel processing unit carries out 3 multiplied by 3 convolution operation or 1 multiplied by 1 convolution operation on an input single-channel image and a single-channel convolution kernel (single-channel convolution kernel), or carries out pooling operation on an input small-size single-channel image, and sends operation results to the image caching unit, when the operation on all channels of the current image block is finished, the operation on the next image block is carried out, and the image caching unit sends the convolution operation or pooling operation results to the image splicing unit; the image splicing unit splices and subdivides the image blocks and sends the blocking results to the data control unit; the data control unit stores the result of the block in the off-chip storage unit through the data cache unit and the off-chip memory control unit.

The convolutional neural network accelerator based on the FPGA can be used for accelerating different convolutional neural network models only by adjusting the setting of the parallel processing unit, and the embodiment of the invention is explained by taking the SSD convolutional neural network model as an example.

The program instruction storage unit includes a ROM in which N network layers L ═ L of the convolutional neural network are stored₁，L₂,...,L_n,...,L_NThe forward reasoning instruction of the network layer L is carried out for each network layer L_nDesigning a forward reasoning instruction CN_nForward reasoning Command CN_nBy each network layer L_nOperation type, convolution step length, convolution kernel number, input feature map channel number, input feature map width, input feature map start address, output feature map width, neural network parameter start address, output image block start address, convolution offset start address, binary data corresponding to filling mark

Sequentially spliced to form a network layer L_nForward inference instructions of

N network layers of the convolutional neural network have N forward inference instructions CN ═ CN₁,CN₂,...,CN_n,...,CN_N}. The program instruction storage unit is used for reading each piece of forward reasoning instruction of the convolutional neural network from the ROM one by one and sending the forward reasoning instruction to the program instruction decoding unit.

And the program instruction decoding unit is sequentially cascaded between the instruction storage unit and the data control unit and is responsible for receiving the forward reasoning instruction sent by the instruction storage unit. And successively decoding the forward inference instruction corresponding to the convolution operation into the initial input address of the image block to be input, the initial input address and the offset address of the convolution kernel to be input according to the operation type in the forward inference instruction, and successively decoding the instruction corresponding to the pooling operation into the initial input address of the image block to be input. And sending the decoding result to the data control unit, and after the data control unit completes the corresponding operation, carrying out next decoding. An instruction corresponding to a convolution operation is decoded ixj times, and an instruction corresponding to a pooling operation is decoded I times. (IxJ is detailed in the description to follow)

The convolution (including 3 × 3 convolution and 1 × 1 convolution) operation decoding mainly comprises the steps of partitioning a feature image to be input, partitioning a convolution kernel to be input, and then calculating a storage address of a partitioned image block to be input, a storage address of the convolution kernel to be input and a bias parameter storage address of convolution operation. Therefore, the problem that a large amount of embedded block RAM resources in the FPGA are occupied due to the fact that the input feature diagram and the convolution kernel parameters are read and then partitioned can be solved, and the problem that the FPGA is not used for a large amount of DSP resources due to the fact that the embedded block RAM resources are consumed completely is further caused, so that the SSD convolution neural network model is low in calculation parallelism and low in forward reasoning speed. Thus, the accelerator of the present invention processes convolution operations at a faster rate. The pooling operation decoding is mainly to divide the feature map to be input into blocks and then calculate the storage address of the image block to be input after the blocks are divided. Therefore, the problem that a large amount of embedded block RAM resources in the FPGA are occupied due to the fact that the input feature diagram is read firstly and then the block division is carried out can be solved, and further the problem that the FPGA is not used for a large amount of DSP resources due to the fact that the embedded block RAM resources are consumed completely is caused, and therefore the SSD convolutional neural network model is low in calculation parallelism and low in forward reasoning speed. Thus, the accelerator of the present invention processes pooling operations at a faster rate.

The specific steps of decoding are as follows:

s11, the program instruction decoding unit makes the characteristic diagram R to be input of the n-th layer network layer_nIs divided into sizes of

I image blocks R to be input_n＝{R_(n,1),R_(n,2),...,R_(n,i),...,R_(n,I)And calculating the ith image block R to be input_(n,i)Input start address of IA_(n,i)：

Wherein,

representing a rounding-up operation, W, H representing the width and height, respectively, of an image block that can be processed by the parallel processing unit at one time, R_(n,i)Which represents the i-th image block,

representation input characteristic diagram R_nThe number of the channels of (a) is,

representation input characteristic diagram R_nWidth (input feature map R)_nAre equal in width and height),

representing the initial address of the input characteristic diagram; w, H can be set reasonably according to different convolutional neural network models, to avoid inputting characteristic diagram R when the forward reasoning of convolutional neural network model enters into deeper network layer_nIs less than W, resulting in a problem that the partial 3 × 3 convolution unit cannot be effectively used. By reasonably setting W and H, the utilization rate of the 3 multiplied by 3 convolution unit can be improved, so that the forward reasoning speed of the SSD convolution neural network model is improved, and the processing speed of the accelerator is higher.

S12, program instruction decoding unit based on

And judging the operation type of the n-th network layer. When the operation type is the 3 × 3 convolution operation, step S13 is performed; when the operation type is the 1 × 1 convolution operation, step S14 is performed; when the operation type is the pooling operation, step S15 is performed. Executing one of the steps S13, S14 and S15 under a forward reasoning instruction;

s13, the program instruction decode unit will L_nLayers to be input

A convolution kernel of size

The number of 3 × 3 convolution kernels is J, which can be expressed as K_(3,n)＝{K_(3,n,1),K_(3,n,2),...,K_(3,n,j),...,K_(3,n,J)H, calculating the jth convolution kernel K to be input_(3,n,j)Input start address KA_(3,n,j)While computing offset addresses BA for a 3 x3 convolution operation_(3,n,i,j)And will be IA_(n,i)、BA_(3,n,i,j)And KA_(3,n,j)As CN_nThe (i-1) x J +j decoding results are sent to a data control unit, wherein the j to-be-input convolution kernel K_(3,n,j)Input start address KA_(3,n,j)Can be expressed as:

the number of convolution kernels, J, can be expressed as:

wherein

Is of size

The number of the 3 x3 convolution kernels of (c),

the expression of (a) is:

offset address BA for 3 x3 convolution operation_(3,n,i,j)Can be expressed as:

wherein,

which represents a rounding-down operation, the rounding-down operation,

representing a ceiling operation, x₀Representing the number of 3 x3 convolution units,

the number of channels of the input feature map is shown,

represents the convolution offset starting address; here, the

Is reasonably set according to the number and W of the 3 multiplied by 3 convolution units, and the size of the one-time input parallel processing unit is

Is set in this way

Can make L_nLayers to be input

The division of the convolution kernel is more reasonable, so that the parallel processing unit can process more sizes at one time

The convolution kernel of the invention further accelerates the forward reasoning speed of the SSD convolution neural network model, thereby the processing speed of the accelerator is higher.

S14, the program instruction decode unit will L_nLayers to be input

A convolution kernel of size

1 × 1 convolution kernel, the number of 1 × 1 convolution kernels is J, which can be expressed as K_(1,n)＝{K_(1,n,1),K_(1,n,2),...,K_(1,n,j),...,K_(1,n,J)And calculating the jth convolution kernel K to be input_(1,n,j)Input start address KA_(1,n,j)Simultaneously computing offset addresses BA for 1 x1 convolution operations_(1,n,i,j)And will be IA_(n,i)、BA_(1,n,i,j)And KA_(1,n,j)As CN_nThe (i-1) × J + J decoding results of the data control unit are sent to the data control unit, wherein the jth convolution kernel K_(1,n,j)Input start address KA_(1,n,j)Can be expressed as:

the number of convolution kernels, J, can be expressed as:

wherein

Is of size

The number of 1 x1 convolution kernels of (c),

the expression of (a) is:

offset address BA for 1 x1 convolution operation_(1,n,i,j)Can be expressed as:

wherein x is₁Represents the number of 1 x1 convolution units,

the number of channels of the input feature map is shown,

represents the convolution offset starting address; here, the

Is reasonably set according to the number of 1 multiplied by 1 convolution units and W, and the size of the one-time input parallel processing unit is

1 x1 convolution kernel number, set in this way

Can make L_nLayers to be input

S15, the program instruction decoding unit will

As the nth forward reasoning instruction CN_nThe ith decoding result of (2) is sent to the data control unit. As the pooling operation only needs to input the feature map to obtain the pooling result and does not need other parameters, the pooling operation only needs to treat the input feature map R_nAnd (5) dividing.

The data control unit is connected with the program instruction decoding unit, the parallel processing unit, the image cache unit, the data buffer unit, the parameter buffer unit, the image splicing unit and the off-chip memory control unit and is used for receiving an instruction decoding result of the program instruction decoding unit and forwarding related addresses and data.

And according to the instruction decoding result of the program instruction decoding unit, for the convolution operation, dividing the image block to be input into a single-channel image to be input according to the channel sequence, and dividing the convolution kernel to be input into a single-channel convolution kernel to be input according to the channel sequence. Although the characteristic diagram R is to be input in the program instruction decoding unit_nPartitioning is carried out from two dimensions of rows and columns, and a to-be-input image block R is obtained after partitioning_(n,i)If the data is directly obtained from the off-chip storage unit, a large amount of embedded block RAM resources in the FPGA can still be occupied, so that the FPGA has the problem that the embedded block RAM resources are consumed and the DSP resources are not used in a large amount, and the SSD convolutional neural network model is low in calculation parallelism and low in forward reasoning speed. By dividing the image block to be input into a single-channel image to be input, the forward reasoning speed of the SSD convolutional neural network model can be improved, so that the processing speed of the accelerator is higher.

Taking out a single-channel image from the off-chip storage unit through the data cache unit and the off-chip memory control unit; and taking out the single-channel convolution kernel from the off-chip storage unit through the parameter buffer unit and the off-chip memory control unit. And taking out the result of the multiply-accumulate of the previous single-channel image from the image cache unit, and sending the partitioned single-channel image, the batched single-channel convolution kernel and the result of the multiply-accumulate of the previous single-channel image into the parallel processing unit. When the first single-channel image is processed, only the blocked single-channel image and the batched single-channel convolution kernel are sent to the parallel processing unit because the result of multiply-accumulate of the previous single-channel image does not exist. The settings of the data buffer unit, the parameter buffer unit, the off-chip memory control unit, the image buffer unit, and the parallel processing unit are described later in this embodiment, and will not be described in detail here.

And taking out the single-channel image from the off-chip storage unit through the data cache unit and the off-chip memory control unit for pooling operation according to an instruction decoding result of the instruction decoding unit, dividing the single-channel image in the column direction again, and sending the divided small-size single-channel image to the parallel processing unit. Therefore, the number of the pooling units in the parallel processing unit can be not limited by W, the number of the pooling units can be flexibly set, the data control unit can send small-size single-channel images into the parallel processing unit in one clock period according to the number of the pooling units, the parallelism of pooling operation is improved, and the forward reasoning instruction execution speed of the operation type of the pooling operation is higher, so that the speed of processing the pooling operation by the accelerator is higher.

The method comprises the following specific steps:

s21, inputting the ith image block R to be input_(n,i)Channel-wise division into M single-channel images R of size 1 xWxH_(n,i)＝{R_(n,i,1),R_(n,i,2),...,R_(n,i,m),...,R_(n,i,M)}，

And calculating R_(n,i,m)Input start address of IA_(n,i,m)By sending IA to an off-chip memory control unit_(n,i,m)Reading R from off-chip memory cells_(n,i,m)To give R'_(n,i,m)，IA_(n,i,m)Expressed as: IA_(n,i,m)＝IA_(n,i)+W×H×(m-1)。

S22, according to different operation types, different steps are executed. When the operation type is a 3 × 3 convolution operation, performing step S23, when the operation type is a 1 × 1 convolution operation, performing step S24, and when the operation type is a pooling operation, performing step S25;

s23, inputting the jth to-be-input 3X 3 convolution kernel K_(3,n,j)Divided into sizes in channel order of

The number of 3 × 3 convolution kernels is M, which can be expressed as K_(3,n,j)＝{K_(3,n,j,1),K_(3,n,j,2),...,K_(3,n,j,m),...,K_(3,n,j,M)And

calculating the mth convolution kernel K to be input_(3,n,j,m)Input start address KA_(3,n,j,m)By sending KA to the off-chip memory control unit_(3,n,j,m)Reading K from off-chip memory cells_(3,n,j,m)To give K'_(3,n,j,m). Meanwhile, when m is 1, the single channel image R'_(n,i,m)And a single-channel 3 × 3 convolution kernel K'_(3,n,j,m)Sending the data to a parallel processing unit; when M is>m>1, the m-1 single-channel image R 'stored in the image buffer unit'_(n,i,m-1)Result of convolution of (P)_(n,i,m-1)And a single-channel image R'_(n,i,m)And a single-channel 3 × 3 convolution kernel K'_(3,n,j,m)Sending the data to a parallel processing unit; when M is M, offset address BA for 3 × 3 convolution operation is sent to off-chip memory control unit_(3,n,i,j)Reading B from off-chip memory cells_(3,n,i,j)To obtain B'_(3,n,i,j)The m-1 th single-channel image R 'stored in the image buffer unit'_(n,i,m-1)Result of convolution of (P)_(n,i,m-1)And a single-channel image R'_(n,i,m)3 x3 convolution kernel K 'of single channel'_(3,n,j,m)And B'_(3,n,i,j)And sending the data to the parallel processing unit. Calculating the mth convolution kernel K to be input_(3,n,j,m)Input start address KA_(3,n,j,m)The expression of (a) is:

s24, inputting the jth convolution kernel K of 1 multiplied by 1 to be input_(1,n,j)Divided into sizes in channel order of

The number of 1 × 1 convolution kernels is M, which can be expressed as K_(1，n，j)＝{K_{(1，n，j，1)}，K_{(1，n，j，2)}，...，K_{(1，n，j，m)}，...，K_{(1，n，j，M}) And

calculating the mth convolution kernel K to be input_(1,n,j,m)Input start address KA_(1,n,j,m)By sending KA to the off-chip memory control unit_(1,n,j,m)Reading K from off-chip memory cells_(1,n,j,m)To give K'_(1,n,j,m). Meanwhile, when m is 1, the single channel image R'_(n,i,m)And a single-channel 1 × 1 convolution kernel K'_(1,n,j,m)Sending the data to a parallel processing unit; when M is>m>1, the m-1 single-channel image R 'stored in the image buffer unit'_(n,i,m-1)Result of convolution of (P)_(n,i,m-1)And a single-channel image R'_(n,i,m)And a single-channel 1 × 1 convolution kernel K'_(1,n,j,m)Sending the data to a parallel processing unit; when M is equal to M, offset address BA for 1 × 1 convolution operation is sent to off-chip memory control unit_(1,n,i,j)Reading B from off-chip memory cells_(1,n,i,j)To obtain B'_(1,n,i,j)The m-1 th single-channel image R 'stored in the image buffer unit'_(n,i,m-1)Result of convolution of (P)_(n,i,m-1)And a single-channel image R'_(n,i,m)Single channel 1 x1 convolution kernel K'_(1,n,j,m)And B'_(1,n,i,j)Sending the data to a parallel processing unit; calculating the mth convolution kernel K to be input_(1,n,j,m)Input start address KA_(1,n,j,m)The expression of (a) is:

s25, single channel image R_(n,i,m)Dividing the image into P small-sized single-channel images R with the size of 1 xWxPN according to the sequence of image columns_(n，i，m)＝{R_{(n，i，m，1)}，R_{(n，i，m，2)}，...，R_{(n，i，m，p)}，...，R_{(n，Y，m，P)}}，

And calculating R_(n,i,m,p)Input start address of IA_(n,i,m,p)By sending IA to an off-chip memory control unit_(n,i,m,p)Reading R from off-chip memory cells_(n,i,m,p)To give R'_(n,i,m,p)。R_(n,i,m,p)Input start address of IA_(n,i,m,p)Can be expressed as:

IA_(n,i,m,p)＝IA_(n,i,m)+1×W×PN×(p-1)

the parallel processing unit comprises x₀A 3 × 3 convolution unit, x₁A 1 × 1 convolution unit and x₂A pooling unit for loading the forward reasoning instruction and image data input by the data control unit and inputting W image data into x in one clock period according to the instruction₀A 3 × 3 convolution unit or x₁1 × 1 convolution unit or W × PN image data input x₂And (4) a pooling unit.

The processing of the single-channel image is performed in a parallel processing unit comprising x₀A 3 × 3 convolution unit, x₁A 1 × 1 convolution unit and x₂A pooling unit for pooling the instruction CN according to the forward reasoning instruction_nTo single-channel image R'_(n,i,m)Performing a 3 × 3 convolution operation (step S31) or a 1 × 1 convolution operation (step S32) to obtain a single-channel image R'_(n,i,m)Is processed as a result P_(n,i,m)Or for P small-size single-channel images R'_(n,i,m,p)Obtaining a single-channel image R 'by performing a pooling operation'_(n,i,m)Is processed as a result P_(n,i,m)(step S33), and P_(n,i,m)And sending the image data to an image buffer unit. The parallel processing unit inputs W image data into x in one clock cycle₀A 3 × 3 convolution unit or x₁1 × 1 convolution units; inputting W x PN image data into x in one clock period₂And (4) a pooling unit.

S31, a 3 × 3 convolution operation;

single-channel image R 'by 3 x3 convolution unit in parallel processing unit'_(n,i,m)A 3 x3 convolution operation and a relu activation operation are performed. Fig. 2 is a schematic diagram of a 3 × 3 convolution operation. A single 3 x3 convolution unit inputs from a single-channel image R 'per clock cycle'_(n,i,m)Three lines of 3 image data of the same column, the image data inputted in the clock period and the image data inputted in the next two clock periods are convoluted together. Then judging the value cnt of the channel counter, and if the cnt is 1, adding 0 to the result after convolution and outputting the result; if cnt>1 and

adding the result of the multiplication and accumulation of the previous channel to the result after the convolution and outputting the result; if it is

Adding convolution offset and the multiply-accumulate result of the previous channel to the result after convolution, and performing relu activation output on the result after addition to obtain the convolution result. Here, the 3 x3 convolution unit needs only one clock cycle to feed the single-channel image R'_(n,i,m)The convolution of the image data of 9 can be completed in 3 clock cycles by 3 image data of three rows and the same column, and the same operation can be completed by the pulsation array in 9 clock cycles, so that the 3 x3 convolution unit has a faster calculation speed for the 3 x3 convolution compared with the pulsation array, and the processing speed of the accelerator is higher.

S32, 1 × 1 convolution operation;

single-channel image R 'is subjected to convolution unit 1 x1 in parallel processing unit'_(n,i,m)A 1 × 1 convolution operation and a relu activation operation are performed. FIG. 3 is a schematic diagram of a 1 × 1 convolution operation. A single 1 x1 convolution unit inputs from a single-channel image R 'per clock cycle'_(n,i,m)1 image data of one line. Then judging the value cnt of the channel counter, and if the cnt is 1, adding 0 to the result after convolution and outputting the result; if cnt>1 and

Adding convolution bias to the result of convolution and the multiply-accumulate node of the previous channelAnd performing relu activation output on the added result.

S33, performing pooling operation;

pairing of a small-sized single-channel image R 'by a pooling unit in a parallel processing unit'_(n,i,m,p)Performing a pooling operation. FIG. 4 is a schematic diagram of a pooling operation. A single pooling unit inputs each clock cycle from a small-sized single-channel image R'_(n,i,m,p)2 image data of two rows and the same column are subtracted from each other, and the sign bit of the subtraction result is input to the data selector MUX as a selection signal, thereby selecting the maximum number max1 of the two image data. The maximum number max1 calculated for this clock cycle is subtracted from the maximum number max2 of the 2 pieces of image data input for the next clock cycle, the sign bit of the subtraction result is input to the data selector MUX as a selection signal, the maximum number max3 of max1 and max2 is selected, and the maximum number max3, that is, max3 ═ max (max1, max2), where max (·) represents the maximum value, is output.

The image splicing unit internally comprises an embedded block RAM with the size of PC multiplied by PW multiplied by PH, and the embedded block RAM is used for caching the splicing result of the image. The calculation formulas of PC, PW and PH are respectively as follows:

splicing the image blocks, and enabling the spliced image blocks to be subjected to the next forward reasoning instruction CN_n+1And (5) partitioning. After splicing, zero value filling operation is favorably carried out on the image blocks. The data control unit takes out the spliced and blocked image blocks from the image splicing unit and stores the image blocks outside the chip through the data cache unit and the off-chip memory control unitAnd a memory unit.

According to the forward reasoning instruction CN_nIf the operation type is a 3 × 3 convolution operation, go to step S41; if the convolution operation is 1 × 1, go to step S42; if the pooling operation is performed, S43 is performed.

S41, for I input image blocks { R'_(n,1),R’_(n,2),...,R’_(n,i),...,R’_(n,I)And 3 × 3 convolution kernel K'_(3,n,j)Splicing convolution results of (1): the image splicing unit receives the input image blocks from the data control unit

And 3 × 3 convolution kernel K'_(3,n,j)Convolution result of (1), image block

Sequence number i of (1) and 3X 3 convolution kernel K'_(3,n,j)The serial number j of (1). For each computation of I input image blocks and 3 x3 convolution kernel K_(3,n,j)After convolution, the image splicing unit receives a splicing start signal from the data control unit, completes image splicing, and receives the next batch of 3 multiplied by 3 convolution kernels K after image splicing is completed_(3,n,j+1)And splicing the convolution results of the I input image blocks. When splicing, if the convolution step length

Then spliced into the size of

Image block P of_(n,j)(ii) a If the convolution step size

Then spliced into the size of

Image block P of_(n,j). By the next forward reasoning instruction CN_n+1Determining whether the image block P needs to be processed_(n,j)Performing zero-padding operation, if reasoning in forward directionInstruction CN_n+1Is filled with

For image block P_(n,j)Performing zero padding operation, otherwise, performing zero padding operation on the image block P_(n,j)No filling operation is performed. For image block P_(n,j)Is divided into blocks with the size of

G image blocks P_(n,j)＝{P_(n,j,1),P_(n,j,2),...,P_(n,j,g),...,P_(n,j,G)And mix P_(n,j,g)And sequentially sent to the data control unit.

S42, for I image blocks { R'_(n,1),R’_(n,2),...,R’_(n,i),...,R’_(n,I)And 1 × 1 convolution kernel K'_(1,n,j)Splicing convolution results of (1): the image splicing unit receives the input image blocks from the data control unit

And 1 × 1 convolution kernel K'_(1,n,j)Convolution result of (1), image block

Sequence number i of (1) & 1X 1 convolution kernel K'_(1,n,j)The serial number j of (1); for each calculation of I image blocks and 1 x1 convolution kernel K_(1,n,j)After convolution, the image splicing unit receives a splicing start signal from the data control unit, and combines the I image blocks and the 1 multiplied by 1 convolution kernel K_(1,n,j)The convolution result of (a) is spliced into a size of

Image block P of_(n,j)After the image splicing unit finishes image splicing, the next batch of 1 multiplied by 1 convolution kernels K are received_(1,n,j+1)And splicing the convolution results of the I input image blocks. By the next forward reasoning instruction CN_n+1Determining whether the image block P needs to be processed_(n,j)Performing zero-value filling operation if the forward reasoning instruction CN_n+1Is filled withWill (Chinese character)

S43, for I single-channel images { R'_(n,1,m),R’_(n,2,m),...,R’_(n,i,m),...,R’_(n,I,m)Splicing the pooling results, wherein the image splicing unit receives a small-size single-channel image R 'from the data control unit'_(n,i,m,p)R 'of'_(n,i,m,p)The serial number p and the serial number m are sent to an image splicing unit; calculating m multiplied by p small-size single-channel image R 'each'_(n,i,m,p)The image splicing unit receives a splicing start signal sent by the data control unit and performs image splicing on the mxp small-size single-channel image R_(n,i,m,p)The pooling result is spliced into a size of

Of a single-channel image P_(n,m)After the image splicing unit finishes image splicing, the next batch of m multiplied by p small-size single-channel images R are received_(n,i,m,p)The pooling result of (1). By the next forward reasoning instruction CN_n+1Determining whether the image block P needs to be processed_(n,m)Performing zero-value filling operation if the forward reasoning instruction CN_n+1Is filled with

Then for a single channel image P_(n,m)Performing zero-filling operation, otherwise, performing zero-filling operation on the single-channel image P_(n,m)No filling operation is performed. For single-channel image P_(n,m)Is divided into blocks with the size of1 xWxH G image blocks P_(n,m)＝{P_(n,m,1),P_(n,m,2),...,P_(n,m,g),...,P_(n,m,G)And mix P_(n,m,g)And sequentially sent to the data control unit.

The image splicing unit splices the images and then leads the spliced images to be subjected to the next forward reasoning instruction CN_n+1Blocking is performed. And the data control unit takes out the spliced and blocked image blocks from the image splicing unit and stores the image blocks into the off-chip storage unit through the data cache unit and the off-chip memory control unit. According to the forward reasoning instruction CN_nIf the operation type of (3) is a 3 × 3 convolution operation, step S51 is executed; in the case of the 1 × 1 convolution operation, step S52 is executed; for the pooling operation, step S53 is executed.

The method comprises the following specific steps:

s51, the data control unit takes out the image blocks P after splicing and blocking from the image splicing unit_(n,j,g)Calculate P_(n,j,g)Output result start address PA_(n,j,g)By sending the PA to an off-chip memory control unit_(n,j,g)Storing P to an off-chip memory cell_(n,j,g). Wherein, the starting address PA_(n,j,g)Can be expressed as:

s52, the data control unit takes out the image blocks P after splicing and blocking from the image splicing unit_(n,j,g)Calculate P_(n,j,g)Output result start address PA_(n,j,g)By sending the PA to an off-chip memory control unit_(n,j,g)Storing P to an off-chip memory cell_(n,j,g). Wherein, the starting address PA_(n,j,g)Expressed as:

s53, the data control unit takes out the splicing and the blocking from the image splicing unitImage block P of_(n,m,g)Calculate P_(n,m,g)Output result start address PA_(n,m,g)By sending the PA to an off-chip memory control unit_(n,m,g)Storing PA to off-chip memory unit_(n,m,g). Wherein, the starting address PA_(n,m,g)Expressed as:

the result stored in steps S51, S52, and S53 is the input feature map of the next forward inference instruction, the processing result of each forward inference instruction is the input feature map of the next forward inference instruction, and so on until all the forward inference instructions are run, that is, the N network layers of the convolutional neural network all complete the operation, and the running result of the convolutional neural network is obtained.

The image cache unit, the parameter cache unit, the data cache unit, the off-chip memory control unit and the off-chip memory unit are specifically arranged as follows:

the image buffer unit is internally provided with a size of

The embedded block RAM. Storing processing results P of parallel processing units_(n,i,m)And is combined with P_(n,i,m)And sending the data to a data control unit.

The parameter buffer unit is provided with two parameter buffers which have the same structure and are used for realizing ping-pong operation. The parameter buffer includes a size of

The embedded block RAM. The parameter caching unit is arranged between the data control unit and the off-chip memory control unit and is used for caching the parameters.

The data buffer unit is provided with two data buffers, and the two parameter buffers have the same structure and are used for realizing ping-pong operation. The data buffer internally contains a buffer having a size of

The embedded block RAM. The data buffer unit is arranged between the data control unit and the off-chip memory control unit and is used for buffering the characteristic diagram related image data and the convolution offset.

The off-chip memory control unit is provided with an MIG IP core of the Xilinx and is used for being connected with the off-chip memory unit, and the exchange of parameters between the FPGA and the off-chip memory unit and image data related to the characteristic diagram can be realized.

DDR3 is arranged in the off-chip storage unit, and each network layer L can be realized_nInput feature map R_nAnd storing information such as convolution kernel parameters, convolution offset parameters, convolution or pooling results, and the like. Wherein the network layer L_nInput feature map R_nAs a result of the convolution or pooling of_n+1Input feature map R_n+1。

The specific storage mode is as follows:

to input feature map R'_nThe storage method of (1): has a size of

Input feature map R'_nIs divided into sizes of

L image blocks R'_n＝{R’_(n,1),R’_(n,2),...,R’_(n,i),...,R’_(n,I)R '}, to R'_(n,i)Division into M single-channel images R 'of size 1 xWxH by channel'_(n，i)＝{R’_(n，i，1)，R’_(n，i，2)，...，R’_(n，i，m)，R’_(n，i，M)}, single-channel image R'_(n,i,m)Internal is continuous storage, image block R'_(n,i)The M single-channel images of (1) are successively stored in order of channel, and a feature map R 'is input'_nL image blocks R'_nIs stored continuously from small to big in sequence according to the sequence of the dividing sequence iStorage as shown in fig. 5.

Convolve 3 × 3 with convolution step size

Convolution bias parameter B'_(3,n)The storage method of (1): has a size of

Convolution bias parameter B'_(3,n)Divided into sizes in channel order of

J offset blocks B 'of'_(3，n)＝{B’_(3，n，1)，B’_(3，n，2)，...，B’_(3，n，j)，B’_(3，n，J)B'_(3,n,j)Is divided into sizes of

I offset blocks B'_(3，n，j)＝{B’_{(3，n，1，j)}，B’_{(3，n，2，j)}，...，B’_{(3，n，i，j)}，B’_{(3，n，I，j)}Is biased by block B'_(3,n,i,j)Internal is continuous storage, B'_(3,n,i)Is stored consecutively in partition order, B'_(3,n)The J offset blocks are stored continuously from small to large in sequence according to the sequence of the channel sequence i.

Convolve 3 × 3 with convolution step size

Convolution bias parameter B'_(3,n)The storage method of (1): has a size of

Convolution bias parameter B'_(3,n)Divided into sizes in channel order of

J offset blocks B 'of'_(3,n)＝{B’_(3,n,1),B’_(3,n,2),...,B’_(3,n,j),...,B’_(3,n,J)B'_(3,n,j)Is divided into sizes of

I offset blocks B'_(3，n，j)＝{B’_{(3，n，1，j)}，B’_{(3，n，2，j)}，...，B’_{(3，n，i，j)}，...，B’_{(3，n，I，j)}Is biased by block B'_(3,n,i,j)Internal is continuous storage, B'_(3,n,i)Is stored consecutively in partition order, B'_(3,n)The J offset blocks are stored continuously from small to large in sequence according to the sequence of the channel sequence i.

Convolution bias parameter B 'to 1 x1 convolution'_(1,n)The storage method of (1): has a size of

Convolution bias parameter B'_(1,n)Divided into sizes in channel order of

J offset blocks B 'of'_(1,n)＝{B’_(1,n,1),B’_(1,n,2),...,B’_(1,n,j),...,B’_(1,n,J)B'_(1,n,j)Is divided into sizes of

I offset blocks B'_(1，n，j)＝{B’_{(1，n，1，j)}，B’_{(1，n，2，j)}，...，B’_{(1，n，i，j)}，B’_{(1，n，I，j)}Is biased by block B'_(1,n,i,j)Internal is continuous storage, B'_(1,n,i)Is stored consecutively in partition order, B'_(1,n)The J offset blocks are sequentially connected from small to large according to the sequence of the channel sequence iAnd (5) continuously storing.

For 3 × 3 convolution kernel parameter K'_(3,n)The storage method of (1): has a size of

Convolution kernel parameter K'_(3,n)Division into sizes in the order of the convolution kernel

J3 × 3 convolution kernels K_(3，n)＝{K_(3，n，1)，K_(3，n，2)，...，K_(3，n，j)，...，K_(3，n，J)H, to K'_(3,n,j)According to the channel divided into the size

M single-channel 3 x3 convolution kernel K'_(3,n,j)＝{K’_(3,n,j,1),K’_(3,n,j,2),...,K’_(3,n,j,m),...,K’_(3,n,j,M)}, single-channel 3 × 3 convolution kernel K'_(3,n,j,m)Internal is continuous storage, 3 x3 convolution kernel K'_(3,n,j)The M single-channel 3 x3 convolution kernels are continuously stored according to the channel sequence, and the parameter K 'of the convolution kernels'_(3,n)The J3 × 3 convolution kernels are stored sequentially from small to large in the order of the division order J.

For 1 x1 convolution kernel parameter K'_(1,n)The storage method of (1): has a size of

Convolution kernel parameter K'_(1,n)Division into sizes in the order of the convolution kernel

J1 × 1 convolution kernels K_(1，n)＝{K_(1，n，1)，K_(1，n，2)，...，K_(1，n，j)，...，K_(1，n，J)H, to K'_(1,n,j)According to the channel divided into the size

M single-channel 1 x1 convolution kernel K'_(1，n，j)＝{K’_{(1，n，j，1)}，K’_{(1，n，j，2)}，...，K’_{(1，n，j，m)}，...，K’_{(1，n，j，M)}}, single-channel 1 × 1 convolution kernel K'_(1,n,j,m)Internal is continuous storage, 1 x1 convolution kernel K'_(1,n,j)The M single-channel 1 x1 convolution kernels are continuously stored according to the channel sequence, and the parameter K 'of the convolution kernels'_(1,n)The J1 × 1 convolution kernels are stored sequentially from small to large in the order of the division order J.

In order to further improve the running speed of the convolutional neural network, the invention also provides an optimization method of the convolutional neural network accelerator based on the FPGA. The method comprises the following specific steps:

s61, obtaining the forward reasoning instruction CN of different operation types_nThe number of clock cycles used for forward reasoning;

the 3 × 3 convolution operation performs step S611, the 1 × 1 convolution operation performs step S612, and the pooling operation performs step S613.

S611, the forward reasoning instruction CN is the convolution operation of 3 multiplied by 3 according to the operation type_nObtaining a forward inference instruction CN with an operation type of 3 x3 convolution operation from all steps from decoding to instruction execution end and the number of cycles used in each step_nNumber of clock cycles F used for forward reasoning₃(n)；

Forward reasoning instruction CN with program instruction storage unit read operation type being 3 x3 convolution operation_nAnd the number of clock cycles used for transmission is T_r3(ii) a The program instruction decoding unit decodes the forward reasoning instruction with the operation type of 3 multiplied by 3 convolution operation once by using the clock period number T_y3A forward reasoning instruction needs to be decoded I multiplied by J times, and the required clock period number is I multiplied by J multiplied by T_y3(ii) a The data control unit controls the single-channel image R 'stored in the off-chip storage unit'_(n,i,m)And a single-channel 3 × 3 convolution kernel K'_(3,n,j,m)The clock cycles used for forwarding for one time are T through the cache unit and the off-chip memory control unit to the parallel processing unit_cps3In a forward reasoning instruction, M times of forwarding are needed, and the needed clock period number is M multiplied by T_cps3(ii) a The data control unit takes out an input image block R 'from the image buffer unit'_(n,i)And 3 × 3 convolution kernel K'_(3,n,j)The convolution result is forwarded to the image splicing unit once, and the required clock period number is T_scj3The forward reasoning instruction needs I × J forwarding times in total, and the required clock period number is I × J × T_scj3(ii) a The data control unit takes out the image block P from the image splicing unit_(n,j,g)The number of clock cycles needed for storing the data into the off-chip storage unit through the data cache unit and the off-chip memory control unit is T_jc3The forward reasoning instruction needs G times in all, and the required clock period number is G multiplied by T_jc3(ii) a Calling 3 x3 convolution unit by parallel processing unit to process single-channel image R'_(n,i,m)And a single-channel 3 × 3 convolution kernel K'_(3,n,j,m)The number of clock cycles required for convolution accumulation is T_p3In a forward reasoning instruction, I × J × M is required to be carried out, and the number of clock cycles required is I × J × M × T_p3(ii) a Pair of image buffer units P_(n,i,m)The number of clock cycles required for forwarding to the data control unit is T_s3In a forward reasoning instruction, I × J × M is required to be carried out, and the number of clock cycles required is I × J × M × T_s3(ii) a Image splicing unit for image blocks P_(n,j)The number of clock cycles required for filling and blocking is T_j3J is required to be carried out in a forward reasoning instruction, and the required clock period number is J multiplied by T_j3. According to the forward reasoning instruction CN_nAnd the number of cycles used for each step, to yield F₃(n)：

F₃(n)＝T_r3+I×J×T_y3+M×T_cps3+I×J×T_scj3+G×T_jc3+I×J×M×T_p3+I×J×M×T_s3+J×T_j3

Wherein:

s612, forward reasoning instruction CN of 1 × 1 convolution operation according to operation type_nObtaining a forward inference instruction CN with an operation type of 1 x1 convolution operation from all steps from decoding to instruction execution end and the number of cycles used in each step_nNumber of clock cycles F used for forward reasoning₁(n)；

Forward reasoning instruction CN with program instruction storage unit read operation type being 1 x1 convolution operation_nAnd the number of clock cycles used for transmission is T_r1(ii) a The program instruction decoding unit decodes the forward reasoning instruction with the operation type of 1 multiplied by 1 convolution operation once by using the clock period number T_y1A forward reasoning instruction needs to be decoded I multiplied by J times, and the required clock period number is I multiplied by J multiplied by T_y1(ii) a The data control unit controls the single-channel image R 'stored in the off-chip storage unit'_(n,i,m)And a single-channel 1 × 1 convolution kernel K'_(1,n,j,m)The clock cycles used for forwarding for one time are T through the cache unit and the off-chip memory control unit to the parallel processing unit_cps1In a forward reasoning instruction, M times of forwarding are needed, and the needed clock period number is M multiplied by T_cps1(ii) a The data control unit takes out an input image block R 'from the image buffer unit'_(n,i)And 1 × 1 convolution kernel K'_(1,n,j)The convolution result is forwarded to the image splicing unit once, and the required clock period number is T_scj1The forward reasoning instruction needs I × J forwarding times in total, and the required clock period number is I × J × T_scj1(ii) a The data control unit takes out the image block P from the image splicing unit_(n,j,g)The number of clock cycles needed for storing the data into the off-chip storage unit through the data cache unit and the off-chip memory control unit is T_jc1The forward reasoning instruction needs G times in all, and the required clock period number is G multiplied by T_jc1(ii) a The parallel processing unit calls a 1 x1 convolution unit to process the single-channel image R'_(n,i,m)And a single-channel 1 × 1 convolution kernel K'_(1,n,j,m)The number of clock cycles required for convolution accumulation is T_p1In a forward reasoning instruction, I × J × M is required to be carried out, and the number of clock cycles required is I × J × M × T_p1(ii) a Pair of image buffer units P_(n,i,m)The number of clock cycles required for forwarding to the data control unit is T_s1In a forward reasoning instruction, I × J × M is required to be carried out, and the number of clock cycles required is I × J × M × T_s1(ii) a Image splicing unit for image blocks P_(n,j)The number of clock cycles required for filling and blocking is T_j1J is required to be carried out in a forward reasoning instruction, and the required clock period number is J multiplied by T_j1. According to the forward reasoning instruction CN_nAnd the number of cycles used for each step, to yield F₁(n)：

F₁(n)＝T_r1+I×J×T_y1+M×T_cps1+I×J×T_scj1+G×T_jc1+I×J×M×T_p1+I×J×M×T_s1+J×T_j1

S613, the forward reasoning instruction CN is the pooling operation according to the operation type_nObtaining a forward inference instruction CN with the operation type of pooling operation from all steps from decoding to instruction execution end and the number of cycles used in each step_nNumber of clock cycles F used for forward reasoning_p(n)。

Forward reasoning instruction CN with program instruction storage unit read operation type being pooling operation_nAnd the number of clock cycles used for transmission is T_rp(ii) a The program instruction decoding unit decodes the forward reasoning instruction with the pooling operation once by using the clock period number T_ypA forward reasoning instruction needs I times of decoding, and the required clock period number is I multiplied by T_yp(ii) a The data control unit controls the single-channel image R 'stored in the off-chip storage unit'_(n,i,m,p)The clock cycles used for forwarding for one time are T through the data cache unit and the off-chip memory control unit to the parallel processing unit_cpspThe forward reasoning instruction needs M times P times of forwarding, and the required clock period number is M times P times T_cpsp(ii) a Data control unit slave imageBuffer unit extracts single-channel image R'_(n,i,m,p)The pooling result is forwarded to the image splicing unit once, and the number of clock cycles needed by the image splicing unit is T_scjpThe forward reasoning instruction needs I × P times of forwarding in total, and the required clock period number is I × P × T_scjp(ii) a The data control unit takes out the image block P from the image splicing unit_(n,j,g)The number of clock cycles needed for storing the data into the off-chip storage unit through the data cache unit and the off-chip memory control unit is T_jcpThe forward reasoning instruction needs G times in all, and the required clock period number is G multiplied by T_jcp(ii) a The parallel processing unit calls a pooling unit to process a single-channel image R'_(n,i,m,p)The number of clock cycles required for pooling is T_ppIn a forward reasoning instruction, I × P × M is required to be carried out in common, and the required clock period number is I × P × M × T_pp(ii) a Image buffer unit for single-channel image R'_(n,i,m,p)The number of clock cycles required for forwarding the pooled result to the data control unit is T_spIn a forward reasoning instruction, I × P × M is required to be carried out in common, and the required clock period number is I × P × M × T_sp(ii) a Image splicing unit for image blocks P_(n,m)The number of clock cycles required for filling and blocking is T_jpIn a forward reasoning instruction, M is required to be carried out in total, and the required clock period number is M multiplied by T_jp. According to the forward reasoning instruction CN_nAnd the number of cycles used for each step, to yield F_p(n)：

F₁(n)＝T_rp+I×T_yp+M×P×T_cpsp+I×J×T_scjp+G×T_jcp+I×P×M×T_pp+I×P×M×T_sp+J×T_sp

Wherein:

s62, constructing a clock period number used by the convolutional neural network for one forward inference;

according to the forward reasoning instruction CN_nThe number of clock cycles F (n) used by the forward reasoning, the number of clock cycles F, F used by calculating the forward reasoning of the convolutional neural networkThe calculation formulas of (n) and F are respectively as follows:

wherein

Indicating that the operation type is a 3 x3 convolution operation,

indicating that the operation type is a 1 x1 convolution operation,

indicating that the operation type is a pooling operation.

S63, constructing a hardware resource constraint expression;

and constructing a hardware resource constraint expression for the number of hardware resources used by the neural network acceleration system. And describing each unit of the convolutional neural network acceleration system by using a hardware description language verilog, and then obtaining each unit circuit and the number of hardware resources used by each unit circuit by synthesizing, laying out and wiring. Because the number of the resources of various FPGA chips is not consistent, only the hardware resources are used: DPS, embedded block RAM, data selector, carry chain, register and look-up table are used as examples to demonstrate. The program instruction storage unit uses DPS, embedded block RAM, data selector, carry chain, register and lookup table respectively

The program instruction decoding unit uses DPS, embedded block RAM, data selector, carry chain, register and lookup table respectively

The data control unit uses DPS, embedded block RAM, data selector, carry chain, register and lookup table respectively

The number of DPS, embedded block RAM, data selector, carry chain, register and lookup table used by a single 3 x3 convolution unit is respectively

The number of DPS, embedded block RAM, data selector, carry chain, register and lookup table used by a single 1 × 1 convolution unit is respectively

The DPS, the embedded block RAM, the data selector, the carry chain, the register and the lookup table used by a single pooling unit are respectively

The DPS, the data selector, the carry chain, the register and the lookup table used by the parameter cache unit and the data cache unit are respectively

The embedded block RAM resource used is

The off-chip memory control unit uses DPS, embedded block RAM, data selector, carry chain, register and lookup table respectively

The image splicing unit uses DPS, data selector, carry chain, register and lookup table respectively

The embedded block RAM resource used is

The image cache unit uses DPS, data selector, carry chain, register and lookup table respectively

The embedded block RAM resource used is

According to the number of resources used by each unit, constructing a hardware resource constraint expression for the number of hardware resources used by the neural network acceleration system, wherein the hardware resource constraint expression comprises the following steps:

wherein N is_dspRepresenting the number of digital signal processors in the FPGA, N_bramRepresenting the number of embedded block RAMs in the FPGA, N_muxRepresenting the number of data selectors in the FPGA, N_carryRepresenting the number of carry chains in the FPGA, N_regRepresenting the number of registers in the FPGA, N_lutRepresenting the number of look-up tables in the FPGA. Mu.s_dsp、μ_bram、μ_mux、μ_carry、μ_register、μ_lutIs a hyper-parameter, and takes the value of [0,1]To (c) to (d);

s64, constructing and solving an optimization function F' with constraint;

the number F of clock cycles used for the convolutional neural network forward reasoning obtained in step S62 and the hardware resource constraint expression obtained in step S63 constitute a constrained optimization function F':

using optimization algorithm to solve constrained optimization function F' to obtain x₀、x₁And x₂The optimal solution of the invention can improve the optimal quantity of each parallel unit under the limited FPGA resource, thereby improving the forward reasoning speed of the SSD convolutional neural network, therefore, the optimization method of the invention can effectively improve the processing speed of the accelerator.

And S65, setting the number of the 3 multiplied by 3 convolution units, the 1 multiplied by 1 convolution units and the pooling units according to the optimal solution of the optimization function F'.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The convolutional neural network accelerator based on the FPGA is characterized by comprising a program instruction storage unit, a program instruction decoding unit, a data control unit, a data buffer unit, a parameter buffer unit, an off-chip storage unit, a parallel processing unit, an image buffer unit, an image splicing unit and an off-chip storage unit, wherein the program instruction storage unit, the program instruction decoding unit, the data control unit, the data buffer unit, the parameter buffer unit, the off-chip storage unit, the parallel processing unit, the image buffer unit and the image splicing unit are realized through the FPGA, and the off-chip storage unit is realized on a dynamic random access memory.

2. The FPGA-based convolutional neural network accelerator of claim 1, wherein said program instruction storage unit comprises a ROM in which the convolutional neural network layer L is stored_nForward inference instruction CN_n。

3. According to claim2, the convolutional neural network accelerator based on FPGA is characterized in that the forward inference instruction CN_nBy the convolutional neural network layer L_nOperation type, convolution step length, convolution kernel number, input feature map channel number, input feature map width, input feature map start address, output feature map width, neural network parameter start address, output image block start address, convolution offset start address, binary data corresponding to filling mark

And splicing the components in sequence.

4. The FPGA-based convolutional neural network accelerator of claim 3, wherein the program instruction decoding unit is serially cascaded between the instruction storage unit and the data control unit.

5. The FPGA-based convolutional neural network accelerator of claim 4, wherein the data control unit is connected to the parallel processing unit, the image buffer unit, the data buffer unit, the parameter buffer unit, the image stitching unit, and the off-chip memory control unit.

6. The FPGA-based convolutional neural network accelerator of claim 5, wherein two identical parameter buffers are disposed in the parameter buffer unit.

7. The FPGA-based convolutional neural network accelerator of claim 6, wherein said parallel processing units comprise a 3 x3 convolution unit, a 1 x1 convolution unit, and a pooling unit.

8. The FPGA-based convolutional neural network accelerator of claim 7, wherein two identical data buffers are disposed in the data buffer unit.

9. The FPGA-based convolutional neural network accelerator of claim 8, wherein said data buffer internally comprises a size of

10. An optimization method of a convolutional neural network accelerator based on an FPGA is characterized by comprising the following steps:

step three: constructing a hardware resource constraint expression;

step five: the number of the 3 x3 convolution units, the 1 x1 convolution units, the pooling units of claim 7 are set according to the optimal solution of the optimization function F'.