CN108647773A

CN108647773A - A kind of hardwired interconnections framework of restructural convolutional neural networks

Info

Publication number: CN108647773A
Application number: CN201810358443.6A
Authority: CN
Inventors: 曹伟; 王伶俐; 谢亮; 罗成; 范锡添; 周学功
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2018-04-20
Filing date: 2018-04-20
Publication date: 2018-10-12
Anticipated expiration: 2038-04-20
Also published as: CN108647773B

Abstract

The invention belongs to the hardware design technique field of image processing algorithm, specially a kind of hardwired interconnections framework of restructural convolutional neural networks.The present invention interconnection architecture include：Data and the outer cache module of parameter piece, for caching, the pixel data in the pending picture inputted is gentle to deposit into the parameter inputted when row convolutional neural networks calculate；Basic calculation cell array module, for realizing the core calculations of convolutional neural networks；Arithmetic logic unit computing module, the result of calculation for handling the basic calculation cell array are realized to down-sampling layer, activation primitive and part and add up.Wherein, basic calculation cell array module interconnects in the way of two-dimensional array, in the row direction, shares input data, and parallel computation is realized by using different supplemental characteristics；In a column direction, result of calculation is transmitted line by line, and the input as next line participates in operation.While the present invention interconnects promotion data-reusing ability by structure, the demand of bandwidth can be reduced.

Description

A Hardware Interconnect Architecture for Reconfigurable Convolutional Neural Networks

技术领域technical field

本发明属于图像处理算法的硬件设计技术领域，具体涉及一种可重构卷积神经网络的硬件互连架构。The invention belongs to the technical field of hardware design of image processing algorithms, and in particular relates to a hardware interconnection architecture of a reconfigurable convolutional neural network.

背景技术Background technique

随着人工智能的兴起，深度学习被广泛应用于计算机视觉、语音识别以及其它一些大数据的应用上，受到了越来越广泛的应用。卷积神经网络作为深度学习中的一个重要的算法模型，目前得到了广泛的应用，例如在图像分类、人脸识别、视频检测、语音识别等方面。With the rise of artificial intelligence, deep learning has been widely used in computer vision, speech recognition and other big data applications, and has been more and more widely used. As an important algorithm model in deep learning, convolutional neural network has been widely used, such as in image classification, face recognition, video detection, speech recognition, etc.

随着图像识别精度的提高，卷积神经网络也变得越来越复杂，同时也增多了更多的计算需求。这使得存在大量冗余资源的传统通用处理器在处理大型的卷积神经网络时性能较低，不能满足许多场景中的实际需求。因此，在工程中使用硬件平台对卷积神经网络进行加速逐渐成为一种主流的方法，如果采用并行流水方式的硬件架构来实现算法，可以获得很好的加速效果从而达到实时处理的效果。As the accuracy of image recognition improves, convolutional neural networks become more complex and require more computation. This makes traditional general-purpose processors with a large number of redundant resources have low performance when processing large-scale convolutional neural networks, which cannot meet the actual needs in many scenarios. Therefore, using hardware platforms to accelerate convolutional neural networks in engineering has gradually become a mainstream method. If a parallel pipeline hardware architecture is used to implement the algorithm, a good acceleration effect can be obtained to achieve real-time processing.

虽然硬件平台的性能要远远高于传统的通用处理器，但随着卷积神经网络的网络结构日益复杂并且需要面临各种大小的卷积核，硬件结构灵活性低的特点会使硬件处理器只提高某些特定的网络的加速效果，而对其他网络的加速效率不高。因此，对于可适用于任意大小的卷积核以及各类网络结构的可重构卷积神经网络硬件架构的需求变得越来越迫切。Although the performance of the hardware platform is much higher than that of traditional general-purpose processors, as the network structure of the convolutional neural network becomes increasingly complex and needs to face convolution kernels of various sizes, the low flexibility of the hardware structure will make the hardware processing The controller only improves the acceleration effect of some specific networks, but the acceleration efficiency of other networks is not high. Therefore, the demand for a reconfigurable convolutional neural network hardware architecture that can be applied to convolution kernels of any size and various network structures is becoming more and more urgent.

对于可重构的卷积神经网络硬件实现，目前的主要研究难点有两个方面，一是，如何使用固定的硬件结构配置不同大小的卷积核，二是如何使硬件结构对任意卷积核都能达到很高的效率。在现有技术中，先设计一个实现最常见的3×3卷积模板的硬件结构作为基础单元，然后通过基础单元之间的互连来传递数据，利用多个基础单元来实现其他类型的卷积核，例如需要5个基础单元来实现3个5×5的卷积模板。在这种方案下，由于硬件中的基础单元的数量是一定的但实现一个卷积核所需要的基础单元数量是不定的，会使实现某些大小的卷积时存在一部分基础单元处于不工作的状态，无法保证百分之百的利用率。并且在这种方案下不是所有的基础单元都能产生有效的输出，需要先将所有的输出缓存，再选取有效的数据，产生大量的额外存储资源的浪费。For the reconfigurable convolutional neural network hardware implementation, the current main research difficulties have two aspects, one is how to use a fixed hardware structure to configure convolution kernels of different sizes, and the other is how to make the hardware structure for any convolution kernel can achieve high efficiency. In the existing technology, first design a hardware structure that implements the most common 3×3 convolution template as the basic unit, and then transfer data through the interconnection between the basic units, and use multiple basic units to realize other types of convolutions. The product kernel, for example, requires 5 basic units to realize three 5×5 convolution templates. In this scheme, since the number of basic units in the hardware is fixed but the number of basic units required to implement a convolution kernel is indeterminate, some basic units will not work when implementing a convolution of a certain size. 100% utilization cannot be guaranteed. And under this scheme, not all basic units can generate effective output, and all output needs to be cached first, and then effective data is selected, resulting in a waste of a large amount of additional storage resources.

在现有技术中，许多图像处理硬件加速器会为某一种算法设计最优的硬件结构，然后通过互连的方式支持其他算法。由于算法的多样性，以及硬件规模固定的限制，当网络模型发生变化或者计算种类发生变化或者计算种类发生变化时，就会降低资源的利用率以及计算性能。而且，在图像处理领域中，尤其是神经网络算法中，除了计算单元的利用率之外，带宽也是一个影响计算性能的关键。In the prior art, many image processing hardware accelerators design an optimal hardware structure for a certain algorithm, and then support other algorithms through interconnection. Due to the diversity of algorithms and the limitation of fixed hardware scale, when the network model changes or the type of calculation changes or the type of calculation changes, the utilization rate of resources and computing performance will be reduced. Moreover, in the field of image processing, especially in neural network algorithms, in addition to the utilization of computing units, bandwidth is also a key factor affecting computing performance.

亟需一种通过结构互连提升数据复用能力的同时，能够降低带宽的需求。There is an urgent need for a method that can reduce bandwidth requirements while improving data multiplexing capabilities through structural interconnection.

发明内容Contents of the invention

本发明的目的是提供一种通过结构互连提升数据复用能力的同时，能够降低带宽需求的可重构卷积神经网络的硬件互连架构。The object of the present invention is to provide a hardware interconnection architecture of a reconfigurable convolutional neural network capable of reducing bandwidth requirements while improving data multiplexing capability through structural interconnection.

本发明提供的可重构卷积神经网络的硬件互连架构，应用在图像处理领域，所述硬件互连架构包括：The hardware interconnection architecture of the reconfigurable convolutional neural network provided by the present invention is applied in the field of image processing, and the hardware interconnection architecture includes:

数据和参数片外缓存模块，分别用于缓存输入的待处理图片中的像素数据和缓存进行卷积神经网络计算时输入的参数；The data and parameter off-chip cache module is used to cache the input pixel data in the picture to be processed and the parameters input when the convolutional neural network is calculated;

基础计算单元阵列模块，分别与所述数据片外缓存模块和所述参数片外缓存模块连接；所述基础计算单元阵列模块用于实现卷积神经网络的核心计算；所述基础计算单元阵列模块按照二维阵列的方式互连，在行方向上，基础计算单元阵列共享输入数据，通过使用不同的参数数据实现并行计算；在列方向上，基础计算单元阵列的计算结果逐行传递，作为下一行的输入参与运算；The basic computing unit array module is respectively connected with the data off-chip cache module and the parameter off-chip cache module; the basic computing unit array module is used to realize the core calculation of the convolutional neural network; the basic computing unit array module Interconnected in the form of a two-dimensional array, in the row direction, the basic computing unit array shares input data, and achieves parallel computing by using different parameter data; in the column direction, the calculation results of the basic computing unit array are passed row by row as the next row The input participates in the operation;

算术逻辑单元计算模块，与所述基础计算单元阵列模块连接，所述算术逻辑单元计算模块用于处理所述基础计算单元阵列的计算结果，实现对下采样层、激活函数以及部分和累加；An arithmetic logic unit calculation module connected to the basic calculation unit array module, the arithmetic logic unit calculation module is used to process the calculation results of the basic calculation unit array, and realize the downsampling layer, activation function and partial sum accumulation;

在计算卷积神经网络时，基础计算单元阵列模块的每一列独立计算一个输出特征图时，对于不同大小的卷积核按照卷积窗口位置分割输入特征图多次输入的方式来进行映射；When calculating the convolutional neural network, when each column of the basic computing unit array module independently calculates an output feature map, the convolution kernels of different sizes are mapped by dividing the input feature map into multiple inputs according to the convolution window position;

控制模块，分别与所述基础计算单元阵列模块和所述算术逻辑单元计算模块连接，所述控制模块用于根据不同的参数实现任意大小的卷积核和多种计算模式。The control module is respectively connected with the basic calculation unit array module and the arithmetic logic unit calculation module, and the control module is used to realize a convolution kernel of any size and various calculation modes according to different parameters.

可选的，所述算术逻辑单元计算模块和所述基础计算单元阵列模块在计算卷积神经网络时，分割后的每张输入特征图只需要卷积核的一个参数。Optionally, when the arithmetic logic unit calculation module and the basic calculation unit array module calculate the convolutional neural network, only one parameter of the convolution kernel is required for each input feature map after segmentation.

可选的，所述基础计算单元阵列模块内部采用互连的方式组成，能够通过改变所述基础计算单元阵列模块内部的互连方式和数据通路，用于支持不同种类、不同位宽的计算。Optionally, the basic computing unit array module is internally composed of interconnections, which can be used to support different types of calculations with different bit widths by changing the internal interconnection mode and data path of the basic computing unit array module.

可选的，所述硬件互连架构采用现场可编程门阵列和集成电路中的任意一种方式实现。Optionally, the hardware interconnection architecture is realized by any one of a field programmable gate array and an integrated circuit.

根据本发明提供的具体实施例，本发明具有以下技术效果：本发明使用新的映射方式映射卷积神经网络，将不同神经网络模型的自由度由空间层面转化为时间层面，从而使固定的硬件互连架构能够对任意卷积神经网络保持百分之百的资源利用率。同时所述基础计算单元模块内部充分利用数字信号处理单元，并且采用灵活的互连结构以支持不同的计算模式、数据位宽可变的计算需求。并且采用行间流水线并行计算，列间复用输入数据的方式降低了输入数据与输出带宽的需求。According to the specific embodiments provided by the present invention, the present invention has the following technical effects: the present invention uses a new mapping method to map the convolutional neural network, and converts the degrees of freedom of different neural network models from the spatial level to the time level, so that the fixed hardware The interconnect architecture is capable of maintaining 100 percent resource utilization for any convolutional neural network. At the same time, the basic calculation unit module fully utilizes the digital signal processing unit, and adopts a flexible interconnection structure to support different calculation modes and variable data bit width calculation requirements. In addition, the parallel calculation of the pipeline between rows and the multiplexing of input data between columns reduce the demand for input data and output bandwidth.

附图说明Description of drawings

图1为本发明提供的卷积神经网络的硬件互连架构的结构示意图。FIG. 1 is a schematic structural diagram of the hardware interconnection architecture of the convolutional neural network provided by the present invention.

图2为本发明提供的映射方式的具体映射的示意图。FIG. 2 is a schematic diagram of specific mapping in the mapping manner provided by the present invention.

图3为本发明提供的所述基础计算模块的内部结构示意图。Fig. 3 is a schematic diagram of the internal structure of the basic computing module provided by the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述。显然，附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。所描述的实施例也仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中描述的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the drawings in the embodiments of the present invention. Apparently, the drawings are only some embodiments of the present invention, and those skilled in the art can obtain other drawings based on these drawings without any creative effort. The described embodiments are only some of the embodiments of the present invention, but not all of them. Based on the embodiments described in the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

本发明的目的是提供一种通过结构互连提升数据复用能力的同时，能够降低带宽的需求的卷积神经网络的硬件互连架构。The purpose of the present invention is to provide a hardware interconnection architecture of a convolutional neural network capable of reducing bandwidth requirements while improving data multiplexing capability through structural interconnection.

一种可重构卷积神经网络的硬件互连架构，所述互连架构应用在图像处理领域，所述互连架构包括：A hardware interconnection architecture of a reconfigurable convolutional neural network, the interconnection architecture is applied in the field of image processing, and the interconnection architecture includes:

数据和参数片外缓存模块，用于缓存输入的待处理图片中的像素数据和缓存进行卷积神经网络计算时输入的参数；The data and parameter off-chip cache module is used to cache the pixel data in the input picture to be processed and the parameters input during the convolutional neural network calculation;

基础计算单元阵列模块分别与所述数据片外缓存模块和所述参数片外缓存模块连接，所述基础计算单元阵列模块用于实现图像处理核心--卷积神经网络的计算，所述基础计算模块按照二维阵列的方式互连，在行方向上，所述基础计算单元阵列模块共享输入数据，通过使用不同的参数数据实现并行计算；在列方向上，所述基础计算模块的计算结果逐行传递，作为下一行的输入参与运算；The basic computing unit array module is respectively connected with the data off-chip cache module and the parameter off-chip cache module, and the basic computing unit array module is used to realize the image processing core - the calculation of the convolutional neural network. The modules are interconnected in a two-dimensional array. In the row direction, the basic computing unit array modules share input data, and achieve parallel computing by using different parameter data; in the column direction, the calculation results of the basic computing modules are row by row Pass, participate in the operation as the input of the next line;

算术逻辑单元计算模块与所述基础计算模块连接，所述算数逻辑单元计算模块用于处理所述基础计算单元阵列的计算结果，实现对下采样层、激活函数以及部分和累加；The arithmetic logic unit calculation module is connected to the basic calculation module, and the arithmetic logic unit calculation module is used to process the calculation results of the basic calculation unit array, and realize the downsampling layer, activation function and partial sum accumulation;

控制模块分别与所述基础计算模块和所述算数逻辑单元计算模块连接，所述控制模块用于根据不同的参数实现任意大小的卷积核和多种计算模式，能够针对图像处理方法中的不同网络结构的卷积神经网络、距离计算、向量计算以及数字信号处理类计算设计高效、灵活的计算单元与互连结构。The control module is respectively connected with the basic calculation module and the arithmetic logic unit calculation module, and the control module is used to realize convolution kernels of any size and multiple calculation modes according to different parameters, and can be used for different image processing methods. Design efficient and flexible computing units and interconnection structures for network-structured convolutional neural networks, distance calculations, vector calculations, and digital signal processing calculations.

可选的，所述算数逻辑单元计算模块和所述基础计算模块在计算卷积神经网络时，分割后的每张输入特征图只需要卷积核的一个参数。Optionally, when the ALU calculation module and the basic calculation module calculate the convolutional neural network, only one parameter of the convolution kernel is required for each segmented input feature map.

可选的，所述基础计算模块内部采用互连的方式组成，能够通过改变所述基础计算模块内部的互连方式和数据通路，用于支持不同种类、不同位宽的计算。Optionally, the basic computing modules are internally composed of interconnections, which can be used to support different types of computing with different bit widths by changing the internal interconnection methods and data paths of the basic computing modules.

可选的，所述架构采用现场可编程门阵列和集成电路中的任意一种方式实现。Optionally, the architecture is realized by any one of a field programmable gate array and an integrated circuit.

如图1所示，数据片外缓存模块与参数片外缓存模块1，其被配置为：外部输入的图像数据和参数首先缓存在该模块中，等待被所述基础计算单元阵列模块读取出来；以及计算得到的结果缓存在该模块中，等待被算术逻辑单元计算模块再次调用或者等待输出。数据与参数片外缓存模块相当于一个较大的现场可编程门阵列，用以应对所述基础计算单元阵列模块的处理速度与外部输入速度相差较大的情况，否则如果外部直接发送数据到内部的所述基础计算单元阵列模块，将需要比较复杂的握手信号，以确保当两者速度不匹配的时候没有数据漏发或者重发等情况。As shown in Figure 1, the data off-chip cache module and the parameter off-chip cache module 1 are configured such that the externally input image data and parameters are first cached in the module and wait to be read out by the basic computing unit array module ; and the calculated results are cached in the module, waiting to be called again by the arithmetic logic unit calculation module or waiting for output. The data and parameter off-chip cache module is equivalent to a larger field programmable gate array, which is used to deal with the situation that the processing speed of the basic computing unit array module is greatly different from the external input speed, otherwise if the external directly sends data to the internal The basic computing unit array module will need more complex handshake signals to ensure that there is no data leakage or retransmission when the speeds of the two do not match.

与数据参数片外缓存模块1连接的所述基础计算单元阵列模块2，用来实现卷积神经网络的核心计算，所述基础计算单元阵列之间的互连方式如图1所示。该互连架构由多级流水线构成，每一级流水线的功能都是相同的，用于计算输入特征图与参数的乘法，并将部分和与上一级流水线得到的结果相加。输入的数据从最前面的第一级流水线开始计算，逐级向后，直到最后一级流水线存入片外缓存模块中，当输入特征图的数量大于计算阵列的行数时，可以把这些计算结果从片外缓存模块中送回算术逻辑单元计算模块中继续完成累加以得到最终的输出特征图。当第一张特征图在第一级流水线中开始计算一个周期后，第二张特征图会在第二级流水线中开始计算，这使得第一级流水线得到结果时刚好能够送入第二级流水线中参与加法运算，当流水线被全部填满时可以实现较高的并行度与较高的处理性能。为了使流水线计算效率较高，应尽可能使计算阵列的行数/列数是输入特征图数/输出特征图数的因子。The basic computing unit array module 2 connected to the data parameter off-chip cache module 1 is used to realize the core calculation of the convolutional neural network, and the interconnection mode between the basic computing unit arrays is shown in FIG. 1 . The interconnection architecture consists of multi-stage pipelines, and the function of each pipeline is the same, which is used to calculate the multiplication of the input feature map and the parameters, and add the partial sum to the result obtained by the previous pipeline. The input data is calculated from the first stage of the pipeline at the front, and it goes backward step by step until the last stage of the pipeline is stored in the off-chip cache module. When the number of input feature maps is greater than the number of rows of the calculation array, these calculations can be The results are sent back from the off-chip cache module to the arithmetic logic unit calculation module to continue to complete the accumulation to obtain the final output feature map. When the first feature map starts to be calculated for one cycle in the first-stage pipeline, the second feature map will start to be calculated in the second-stage pipeline, which makes the result of the first-stage pipeline just enough to be sent to the second-stage pipeline Participating in the addition operation, when the pipeline is fully filled, a higher degree of parallelism and higher processing performance can be achieved. In order to make the pipeline calculation more efficient, the number of rows/columns of the calculation array should be a factor of the number of input feature maps/number of output feature maps as much as possible.

与所述基础计算单元阵列模块2连接的所述算术逻辑单元计算模块3，用来实现卷积的部分和累加、下采样层以及激活函数层。其具体配置为，在卷积层中，计算阵列最后一行得到的计算结果存入片外缓存模块中后，如果仍为部分和，这些部分和会从片外缓存模块中读取出来送入所述算术逻辑单元计算模块中与下一组计算阵列最后一行的计算结果进一步累加，再存入片外缓存模块以实现多组输入特征图计算结果的累加；在下采样层中，所述算术逻辑单元计算模块会将上一个卷积层的输出特征图从片外缓存模块中读出，根据下采样层是最大值还是平均值选择其功能为比较器还是加法器以实现下采样操作；在激活函数层中，本实例中为Relu函数，所述算术逻辑单元计算模块会使用一个数据选择器Mux来实现激活函数。The arithmetic logic unit calculation module 3 connected to the basic calculation unit array module 2 is used to realize the partial sum and accumulation of the convolution, the down-sampling layer and the activation function layer. The specific configuration is that in the convolutional layer, after the calculation result obtained by the last line of the calculation array is stored in the off-chip cache module, if it is still a partial sum, these partial sums will be read from the off-chip cache module and sent to the off-chip cache module. In the arithmetic logic unit calculation module and the calculation results of the last row of the next group of calculation arrays are further accumulated, and then stored in the off-chip cache module to realize the accumulation of the calculation results of multiple sets of input feature maps; in the down sampling layer, the arithmetic logic unit The calculation module will read the output feature map of the previous convolutional layer from the off-chip cache module, and select its function as a comparator or an adder according to whether the downsampling layer is the maximum value or the average value to realize the downsampling operation; in the activation function In the layer, in this example, it is a Relu function, and the arithmetic logic unit calculation module will use a data selector Mux to realize the activation function.

本发明的可重构卷积神经网络的硬件互连架构，除用于计算卷积神经网络以外，也能够有效支持距离计算、线性代数计算以及图像信号处理类计算。The hardware interconnection architecture of the reconfigurable convolutional neural network of the present invention can effectively support distance calculation, linear algebra calculation and image signal processing calculation in addition to being used for calculating the convolutional neural network.

与上述模块连接的控制模块4，用来产生其余模块的控制信号。其具体配置为，接受外部的指令并译码，生成片外缓存模块向片外读写数据以及计算阵列向片外缓存模块读写数据的使能信号及地址，生成计算单元及所述算术逻辑单元计算模块中的数据及功能选通信号。The control module 4 connected with the above-mentioned modules is used to generate control signals of other modules. Its specific configuration is to accept and decode external instructions, generate enable signals and addresses for the off-chip cache module to read and write data off-chip, and for the computing array to read and write data to the off-chip cache module, and generate computing units and the arithmetic logic Data and function strobe signals in the cell computing module.

在上述的基础计算单元阵列2中，由多级流水线构成。其在映射时将卷积滑动窗口中处于卷积模板同一位置（即对应的参数相同）的图像数据抽取出来组成一个新的输入特征图，这相当于将卷积模板分割为多个1x1的卷积模板，再将这些使用分割后卷积模板计算出的结果累加即可得到原本的卷积结果。以5x5卷积模板为例，其映射方式如图2所示。通过这种方式，任意大小的卷积模板都可以分割为1x1的卷积模板，因此这种结构能够对任意大小的卷积核都保持很高的效率。其次，行间采取流水线的结构在高效利用DSP功能的同时将输出端口数量从计算单元个数降低为计算阵列的列数，这使得除了最后一行之外的计算单元的计算结果不需要额外的缓存空间，降低了片上存储资源的需求。In the above-mentioned basic computing unit array 2, it is composed of multi-stage pipelines. When mapping, it extracts the image data in the same position of the convolution template (that is, the same corresponding parameters) in the convolution sliding window to form a new input feature map, which is equivalent to dividing the convolution template into multiple 1x1 volumes The original convolution result can be obtained by accumulating the results calculated by using the segmented convolution template. Taking the 5x5 convolution template as an example, its mapping method is shown in Figure 2. In this way, convolution templates of any size can be divided into 1x1 convolution templates, so this structure can maintain high efficiency for convolution kernels of any size. Secondly, the pipeline structure between the rows reduces the number of output ports from the number of computing units to the number of columns of the computing array while efficiently utilizing the DSP function, which makes the calculation results of the computing units except the last row not require additional caching space, reducing the need for on-chip storage resources.

在上面的所述算术逻辑单元计算模块3，其将下采样层、激活函数层与卷积层连在一起。由于下采样层和激活函数层不需要参数，且运算简单，为了降低外部数据带宽需求，提高计算效率，该架构不将这两层映射到计算阵列中，而是在卷积层结果输出前送入所述算术逻辑单元计算模块中完成比较或是相加或是数据选择的操作。具体方法是在实现最大值时将所述算术逻辑单元计算模块配置为比较器；在实现平均值时提前在上一层卷积层中将权重预先除去分母，这样做平均值就只需要将所述算术逻辑单元计算模块配置为加法器即可；在实现激活函数Relu时将所述算术逻辑单元计算模块配置为数据选择器Mux。In the arithmetic logic unit calculation module 3 above, it connects the downsampling layer, the activation function layer and the convolution layer together. Since the downsampling layer and the activation function layer do not require parameters and the operation is simple, in order to reduce the external data bandwidth requirements and improve computing efficiency, this architecture does not map these two layers into the computing array, but sends them before the output of the convolutional layer results. into the calculation module of the arithmetic logic unit to complete the operation of comparison or addition or data selection. The specific method is to configure the arithmetic logic unit calculation module as a comparator when realizing the maximum value; to remove the denominator from the weight in the upper convolutional layer in advance when realizing the average value, so that the average value only needs to be The ALU calculation module can be configured as an adder; when the activation function Relu is implemented, the ALU calculation module can be configured as a data selector Mux.

在上述的所述算术逻辑单元计算模块4，其产生的用于控制计算阵列的控制信号也随数据逐层传递以使得各级流水线的计算时序相匹配。并且控制信号多级延迟之后同一个控制信号需要控制的计算单元数量从计算单元个数降低为列数，这降低了控制信号的扇出，可以提高芯片的工作频率。In the arithmetic and logic unit calculation module 4 mentioned above, the control signals generated by it for controlling the calculation array are also transmitted layer by layer along with the data so as to match the calculation timings of the pipelines at all levels. And after the multi-stage delay of the control signal, the number of computing units that need to be controlled by the same control signal is reduced from the number of computing units to the number of columns, which reduces the fan-out of the control signal and can increase the operating frequency of the chip.

本发明采用新的互连结构与新的映射方式实现了可重构的卷积神经网络硬件架构。可以对任意大小的卷积核达到100%的资源利用率，实现较高的性能和较少的硬件资源耗费。The invention adopts a new interconnection structure and a new mapping method to realize a reconfigurable convolutional neural network hardware architecture. It can achieve 100% resource utilization for convolution kernels of any size, achieving higher performance and less hardware resource consumption.

本说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似部分互相参见即可。Each embodiment in this specification is described in a progressive manner, each embodiment focuses on the difference from other embodiments, and the same and similar parts of each embodiment can be referred to each other.

本文中应用了具体个例对本发明的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想；同时，对于本领域的一般技术人员，依据本发明的思想，在具体实施方式及应用范围上均会有改变之处。综上所述，本说明书内容不应理解为对本发明的限制。In this paper, specific examples have been used to illustrate the principle and implementation of the present invention. The description of the above embodiments is only used to help understand the method of the present invention and its core idea; meanwhile, for those of ordinary skill in the art, according to the present invention Thoughts, there will be changes in specific implementation methods and application ranges. In summary, the contents of this specification should not be construed as limiting the present invention.

Claims

1. A hardware interconnection architecture of a reconfigurable convolutional neural network, applied in the field of image processing, characterized in that, the hardware interconnection architecture includes:

The data and parameter off-chip cache module is used to cache the pixel data in the input picture to be processed and the parameters input during the convolutional neural network calculation;

The basic computing unit array module is respectively connected to the data and parameter off-chip cache modules; the basic computing unit array module is used to realize the core calculation of the convolutional neural network; the basic computing unit array module is in the form of a two-dimensional array Interconnection, in the row direction, the basic computing unit array shares input data, and achieves parallel computing by using different parameter data; in the column direction, the calculation results of the basic computing unit array are passed row by row, and participate in the operation as the input of the next row;

An arithmetic logic unit calculation module connected to the basic calculation unit array module, the arithmetic logic unit calculation module is used to process the calculation results of the basic calculation unit array, and realize the downsampling layer, activation function and partial sum accumulation;

When calculating the convolutional neural network, when each column of the basic computing unit array module independently calculates an output feature map, the convolution kernels of different sizes are mapped according to the convolution window position to divide the input feature map into multiple inputs;

The control module is respectively connected with the basic calculation unit array module and the arithmetic logic unit calculation module, and the control module is used to realize a convolution kernel of any size and various calculation modes according to different parameters.

2. The hardware interconnection architecture of the reconfigurable convolutional neural network according to claim 1, wherein when the arithmetic logic unit computing module and the basic computing unit array module are computing the convolutional neural network, they are divided into After each input feature map, only one parameter of the convolution kernel is required.

3. The hardware interconnection architecture of the reconfigurable convolutional neural network according to claim 1, wherein the interconnection mode is adopted inside the basic computing unit array module, and by changing the internal The interconnection methods and data paths are used to support calculations of different types and different bit widths.

4. The hardware interconnection architecture of the reconfigurable convolutional neural network according to claim 1, characterized in that, it is realized by any one of a field programmable gate array and an integrated circuit.

5. The hardware interconnection architecture of the reconfigurable convolutional neural network according to claim 1, wherein the basic computing unit array modules are interconnected in the form of a two-dimensional array, and its interconnection architecture consists of a multi-stage pipeline Composition, the function of each stage of the pipeline is the same, it is used to calculate the multiplication of the input feature map and the parameter, and add the partial sum to the result obtained by the previous pipeline; the input data is calculated from the front first pipeline , step by step backward until the last stage of the pipeline is stored in the off-chip cache module; when the number of input feature maps is greater than the number of rows of the calculation array, these calculation results are sent back from the off-chip cache module to the arithmetic logic unit calculation module Continue to complete the accumulation in order to obtain the final output feature map; when the first feature map starts to be calculated for one cycle in the first-stage pipeline, the second feature map starts to be calculated in the second-stage pipeline, so that the first-stage pipeline gets The result can just be sent to the second-stage pipeline to participate in the addition operation; when the pipeline is fully filled, a higher degree of parallelism and higher processing performance can be achieved.

6. The hardware interconnection architecture of the reconfigurable convolutional neural network according to claim 1, wherein the arithmetic logic unit calculation module is used to realize the convolutional part and accumulation, down-sampling layer and activation function layer ; Its specific configuration is as follows: In the convolutional layer, after the calculation result obtained by the last line of the calculation array is stored in the off-chip cache module, if it is still a partial sum, these partial sums are read from the off-chip cache module and sent to the off-chip cache module. In the arithmetic logic unit calculation module and the calculation results of the last row of the next group of calculation arrays are further accumulated, and then stored in the off-chip cache module to realize the accumulation of the calculation results of multiple sets of input feature maps; in the down sampling layer, the arithmetic logic unit The calculation module reads the output feature map of the previous convolutional layer from the off-chip cache module, and selects its function as a comparator or an adder according to whether the downsampling layer is the maximum value or the average value to realize the downsampling operation; in the activation function layer In, the arithmetic logic unit calculation module uses a data selector Mux to realize the activation function.

7. The hardware interconnection architecture of the reconfigurable convolutional neural network according to claim 1, wherein the control module is used to generate control signals of the remaining modules; its specific configuration is: accepting external instructions and translating code, generate the off-chip cache module to read and write data off-chip and the enable signal and address of the calculation array to read and write data to the off-chip cache module, and generate the data and function strobe signals in the calculation unit and the arithmetic logic unit calculation module .