CN108647773A - A kind of hardwired interconnections framework of restructural convolutional neural networks - Google Patents
A kind of hardwired interconnections framework of restructural convolutional neural networks Download PDFInfo
- Publication number
- CN108647773A CN108647773A CN201810358443.6A CN201810358443A CN108647773A CN 108647773 A CN108647773 A CN 108647773A CN 201810358443 A CN201810358443 A CN 201810358443A CN 108647773 A CN108647773 A CN 108647773A
- Authority
- CN
- China
- Prior art keywords
- module
- calculation
- convolutional neural
- data
- neural network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Complex Calculations (AREA)
- Image Processing (AREA)
- Stored Programmes (AREA)
- Image Analysis (AREA)
Abstract
Description
技术领域technical field
本发明属于图像处理算法的硬件设计技术领域,具体涉及一种可重构卷积神经网络的硬件互连架构。The invention belongs to the technical field of hardware design of image processing algorithms, and in particular relates to a hardware interconnection architecture of a reconfigurable convolutional neural network.
背景技术Background technique
随着人工智能的兴起,深度学习被广泛应用于计算机视觉、语音识别以及其它一些大数据的应用上,受到了越来越广泛的应用。卷积神经网络作为深度学习中的一个重要的算法模型,目前得到了广泛的应用,例如在图像分类、人脸识别、视频检测、语音识别等方面。With the rise of artificial intelligence, deep learning has been widely used in computer vision, speech recognition and other big data applications, and has been more and more widely used. As an important algorithm model in deep learning, convolutional neural network has been widely used, such as in image classification, face recognition, video detection, speech recognition, etc.
随着图像识别精度的提高,卷积神经网络也变得越来越复杂,同时也增多了更多的计算需求。这使得存在大量冗余资源的传统通用处理器在处理大型的卷积神经网络时性能较低,不能满足许多场景中的实际需求。因此,在工程中使用硬件平台对卷积神经网络进行加速逐渐成为一种主流的方法,如果采用并行流水方式的硬件架构来实现算法,可以获得很好的加速效果从而达到实时处理的效果。As the accuracy of image recognition improves, convolutional neural networks become more complex and require more computation. This makes traditional general-purpose processors with a large number of redundant resources have low performance when processing large-scale convolutional neural networks, which cannot meet the actual needs in many scenarios. Therefore, using hardware platforms to accelerate convolutional neural networks in engineering has gradually become a mainstream method. If a parallel pipeline hardware architecture is used to implement the algorithm, a good acceleration effect can be obtained to achieve real-time processing.
虽然硬件平台的性能要远远高于传统的通用处理器,但随着卷积神经网络的网络结构日益复杂并且需要面临各种大小的卷积核,硬件结构灵活性低的特点会使硬件处理器只提高某些特定的网络的加速效果,而对其他网络的加速效率不高。因此,对于可适用于任意大小的卷积核以及各类网络结构的可重构卷积神经网络硬件架构的需求变得越来越迫切。Although the performance of the hardware platform is much higher than that of traditional general-purpose processors, as the network structure of the convolutional neural network becomes increasingly complex and needs to face convolution kernels of various sizes, the low flexibility of the hardware structure will make the hardware processing The controller only improves the acceleration effect of some specific networks, but the acceleration efficiency of other networks is not high. Therefore, the demand for a reconfigurable convolutional neural network hardware architecture that can be applied to convolution kernels of any size and various network structures is becoming more and more urgent.
对于可重构的卷积神经网络硬件实现,目前的主要研究难点有两个方面,一是,如何使用固定的硬件结构配置不同大小的卷积核,二是如何使硬件结构对任意卷积核都能达到很高的效率。在现有技术中,先设计一个实现最常见的3×3卷积模板的硬件结构作为基础单元,然后通过基础单元之间的互连来传递数据,利用多个基础单元来实现其他类型的卷积核,例如需要5个基础单元来实现3个5×5的卷积模板。在这种方案下,由于硬件中的基础单元的数量是一定的但实现一个卷积核所需要的基础单元数量是不定的,会使实现某些大小的卷积时存在一部分基础单元处于不工作的状态,无法保证百分之百的利用率。并且在这种方案下不是所有的基础单元都能产生有效的输出,需要先将所有的输出缓存,再选取有效的数据,产生大量的额外存储资源的浪费。For the reconfigurable convolutional neural network hardware implementation, the current main research difficulties have two aspects, one is how to use a fixed hardware structure to configure convolution kernels of different sizes, and the other is how to make the hardware structure for any convolution kernel can achieve high efficiency. In the existing technology, first design a hardware structure that implements the most common 3×3 convolution template as the basic unit, and then transfer data through the interconnection between the basic units, and use multiple basic units to realize other types of convolutions. The product kernel, for example, requires 5 basic units to realize three 5×5 convolution templates. In this scheme, since the number of basic units in the hardware is fixed but the number of basic units required to implement a convolution kernel is indeterminate, some basic units will not work when implementing a convolution of a certain size. 100% utilization cannot be guaranteed. And under this scheme, not all basic units can generate effective output, and all output needs to be cached first, and then effective data is selected, resulting in a waste of a large amount of additional storage resources.
在现有技术中,许多图像处理硬件加速器会为某一种算法设计最优的硬件结构,然后通过互连的方式支持其他算法。由于算法的多样性,以及硬件规模固定的限制,当网络模型发生变化或者计算种类发生变化或者计算种类发生变化时,就会降低资源的利用率以及计算性能。而且,在图像处理领域中,尤其是神经网络算法中,除了计算单元的利用率之外,带宽也是一个影响计算性能的关键。In the prior art, many image processing hardware accelerators design an optimal hardware structure for a certain algorithm, and then support other algorithms through interconnection. Due to the diversity of algorithms and the limitation of fixed hardware scale, when the network model changes or the type of calculation changes or the type of calculation changes, the utilization rate of resources and computing performance will be reduced. Moreover, in the field of image processing, especially in neural network algorithms, in addition to the utilization of computing units, bandwidth is also a key factor affecting computing performance.
亟需一种通过结构互连提升数据复用能力的同时,能够降低带宽的需求。There is an urgent need for a method that can reduce bandwidth requirements while improving data multiplexing capabilities through structural interconnection.
发明内容Contents of the invention
本发明的目的是提供一种通过结构互连提升数据复用能力的同时,能够降低带宽需求的可重构卷积神经网络的硬件互连架构。The object of the present invention is to provide a hardware interconnection architecture of a reconfigurable convolutional neural network capable of reducing bandwidth requirements while improving data multiplexing capability through structural interconnection.
本发明提供的可重构卷积神经网络的硬件互连架构,应用在图像处理领域,所述硬件互连架构包括:The hardware interconnection architecture of the reconfigurable convolutional neural network provided by the present invention is applied in the field of image processing, and the hardware interconnection architecture includes:
数据和参数片外缓存模块,分别用于缓存输入的待处理图片中的像素数据和缓存进行卷积神经网络计算时输入的参数;The data and parameter off-chip cache module is used to cache the input pixel data in the picture to be processed and the parameters input when the convolutional neural network is calculated;
基础计算单元阵列模块,分别与所述数据片外缓存模块和所述参数片外缓存模块连接;所述基础计算单元阵列模块用于实现卷积神经网络的核心计算;所述基础计算单元阵列模块按照二维阵列的方式互连,在行方向上,基础计算单元阵列共享输入数据,通过使用不同的参数数据实现并行计算;在列方向上,基础计算单元阵列的计算结果逐行传递,作为下一行的输入参与运算;The basic computing unit array module is respectively connected with the data off-chip cache module and the parameter off-chip cache module; the basic computing unit array module is used to realize the core calculation of the convolutional neural network; the basic computing unit array module Interconnected in the form of a two-dimensional array, in the row direction, the basic computing unit array shares input data, and achieves parallel computing by using different parameter data; in the column direction, the calculation results of the basic computing unit array are passed row by row as the next row The input participates in the operation;
算术逻辑单元计算模块,与所述基础计算单元阵列模块连接,所述算术逻辑单元计算模块用于处理所述基础计算单元阵列的计算结果,实现对下采样层、激活函数以及部分和累加;An arithmetic logic unit calculation module connected to the basic calculation unit array module, the arithmetic logic unit calculation module is used to process the calculation results of the basic calculation unit array, and realize the downsampling layer, activation function and partial sum accumulation;
在计算卷积神经网络时,基础计算单元阵列模块的每一列独立计算一个输出特征图时,对于不同大小的卷积核按照卷积窗口位置分割输入特征图多次输入的方式来进行映射;When calculating the convolutional neural network, when each column of the basic computing unit array module independently calculates an output feature map, the convolution kernels of different sizes are mapped by dividing the input feature map into multiple inputs according to the convolution window position;
控制模块,分别与所述基础计算单元阵列模块和所述算术逻辑单元计算模块连接,所述控制模块用于根据不同的参数实现任意大小的卷积核和多种计算模式。The control module is respectively connected with the basic calculation unit array module and the arithmetic logic unit calculation module, and the control module is used to realize a convolution kernel of any size and various calculation modes according to different parameters.
可选的,所述算术逻辑单元计算模块和所述基础计算单元阵列模块在计算卷积神经网络时,分割后的每张输入特征图只需要卷积核的一个参数。Optionally, when the arithmetic logic unit calculation module and the basic calculation unit array module calculate the convolutional neural network, only one parameter of the convolution kernel is required for each input feature map after segmentation.
可选的,所述基础计算单元阵列模块内部采用互连的方式组成,能够通过改变所述基础计算单元阵列模块内部的互连方式和数据通路,用于支持不同种类、不同位宽的计算。Optionally, the basic computing unit array module is internally composed of interconnections, which can be used to support different types of calculations with different bit widths by changing the internal interconnection mode and data path of the basic computing unit array module.
可选的,所述硬件互连架构采用现场可编程门阵列和集成电路中的任意一种方式实现。Optionally, the hardware interconnection architecture is realized by any one of a field programmable gate array and an integrated circuit.
根据本发明提供的具体实施例,本发明具有以下技术效果:本发明使用新的映射方式映射卷积神经网络,将不同神经网络模型的自由度由空间层面转化为时间层面,从而使固定的硬件互连架构能够对任意卷积神经网络保持百分之百的资源利用率。同时所述基础计算单元模块内部充分利用数字信号处理单元,并且采用灵活的互连结构以支持不同的计算模式、数据位宽可变的计算需求。并且采用行间流水线并行计算,列间复用输入数据的方式降低了输入数据与输出带宽的需求。According to the specific embodiments provided by the present invention, the present invention has the following technical effects: the present invention uses a new mapping method to map the convolutional neural network, and converts the degrees of freedom of different neural network models from the spatial level to the time level, so that the fixed hardware The interconnect architecture is capable of maintaining 100 percent resource utilization for any convolutional neural network. At the same time, the basic calculation unit module fully utilizes the digital signal processing unit, and adopts a flexible interconnection structure to support different calculation modes and variable data bit width calculation requirements. In addition, the parallel calculation of the pipeline between rows and the multiplexing of input data between columns reduce the demand for input data and output bandwidth.
附图说明Description of drawings
图1为本发明提供的卷积神经网络的硬件互连架构的结构示意图。FIG. 1 is a schematic structural diagram of the hardware interconnection architecture of the convolutional neural network provided by the present invention.
图2为本发明提供的映射方式的具体映射的示意图。FIG. 2 is a schematic diagram of specific mapping in the mapping manner provided by the present invention.
图3为本发明提供的所述基础计算模块的内部结构示意图。Fig. 3 is a schematic diagram of the internal structure of the basic computing module provided by the present invention.
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述。显然,附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。所描述的实施例也仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中描述的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the drawings in the embodiments of the present invention. Apparently, the drawings are only some embodiments of the present invention, and those skilled in the art can obtain other drawings based on these drawings without any creative effort. The described embodiments are only some of the embodiments of the present invention, but not all of them. Based on the embodiments described in the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.
本发明的目的是提供一种通过结构互连提升数据复用能力的同时,能够降低带宽的需求的卷积神经网络的硬件互连架构。The purpose of the present invention is to provide a hardware interconnection architecture of a convolutional neural network capable of reducing bandwidth requirements while improving data multiplexing capability through structural interconnection.
一种可重构卷积神经网络的硬件互连架构,所述互连架构应用在图像处理领域,所述互连架构包括:A hardware interconnection architecture of a reconfigurable convolutional neural network, the interconnection architecture is applied in the field of image processing, and the interconnection architecture includes:
数据和参数片外缓存模块,用于缓存输入的待处理图片中的像素数据和缓存进行卷积神经网络计算时输入的参数;The data and parameter off-chip cache module is used to cache the pixel data in the input picture to be processed and the parameters input during the convolutional neural network calculation;
基础计算单元阵列模块分别与所述数据片外缓存模块和所述参数片外缓存模块连接,所述基础计算单元阵列模块用于实现图像处理核心--卷积神经网络的计算,所述基础计算模块按照二维阵列的方式互连,在行方向上,所述基础计算单元阵列模块共享输入数据,通过使用不同的参数数据实现并行计算;在列方向上,所述基础计算模块的计算结果逐行传递,作为下一行的输入参与运算;The basic computing unit array module is respectively connected with the data off-chip cache module and the parameter off-chip cache module, and the basic computing unit array module is used to realize the image processing core - the calculation of the convolutional neural network. The modules are interconnected in a two-dimensional array. In the row direction, the basic computing unit array modules share input data, and achieve parallel computing by using different parameter data; in the column direction, the calculation results of the basic computing modules are row by row Pass, participate in the operation as the input of the next line;
在计算卷积神经网络时,基础计算单元阵列模块的每一列独立计算一个输出特征图时,对于不同大小的卷积核按照卷积窗口位置分割输入特征图多次输入的方式来进行映射;When calculating the convolutional neural network, when each column of the basic computing unit array module independently calculates an output feature map, the convolution kernels of different sizes are mapped by dividing the input feature map into multiple inputs according to the convolution window position;
算术逻辑单元计算模块与所述基础计算模块连接,所述算数逻辑单元计算模块用于处理所述基础计算单元阵列的计算结果,实现对下采样层、激活函数以及部分和累加;The arithmetic logic unit calculation module is connected to the basic calculation module, and the arithmetic logic unit calculation module is used to process the calculation results of the basic calculation unit array, and realize the downsampling layer, activation function and partial sum accumulation;
控制模块分别与所述基础计算模块和所述算数逻辑单元计算模块连接,所述控制模块用于根据不同的参数实现任意大小的卷积核和多种计算模式,能够针对图像处理方法中的不同网络结构的卷积神经网络、距离计算、向量计算以及数字信号处理类计算设计高效、灵活的计算单元与互连结构。The control module is respectively connected with the basic calculation module and the arithmetic logic unit calculation module, and the control module is used to realize convolution kernels of any size and multiple calculation modes according to different parameters, and can be used for different image processing methods. Design efficient and flexible computing units and interconnection structures for network-structured convolutional neural networks, distance calculations, vector calculations, and digital signal processing calculations.
可选的,所述算数逻辑单元计算模块和所述基础计算模块在计算卷积神经网络时,分割后的每张输入特征图只需要卷积核的一个参数。Optionally, when the ALU calculation module and the basic calculation module calculate the convolutional neural network, only one parameter of the convolution kernel is required for each segmented input feature map.
可选的,所述基础计算模块内部采用互连的方式组成,能够通过改变所述基础计算模块内部的互连方式和数据通路,用于支持不同种类、不同位宽的计算。Optionally, the basic computing modules are internally composed of interconnections, which can be used to support different types of computing with different bit widths by changing the internal interconnection methods and data paths of the basic computing modules.
可选的,所述架构采用现场可编程门阵列和集成电路中的任意一种方式实现。Optionally, the architecture is realized by any one of a field programmable gate array and an integrated circuit.
如图1所示,数据片外缓存模块与参数片外缓存模块1,其被配置为:外部输入的图像数据和参数首先缓存在该模块中,等待被所述基础计算单元阵列模块读取出来;以及计算得到的结果缓存在该模块中,等待被算术逻辑单元计算模块再次调用或者等待输出。数据与参数片外缓存模块相当于一个较大的现场可编程门阵列,用以应对所述基础计算单元阵列模块的处理速度与外部输入速度相差较大的情况,否则如果外部直接发送数据到内部的所述基础计算单元阵列模块,将需要比较复杂的握手信号,以确保当两者速度不匹配的时候没有数据漏发或者重发等情况。As shown in Figure 1, the data off-chip cache module and the parameter off-chip cache module 1 are configured such that the externally input image data and parameters are first cached in the module and wait to be read out by the basic computing unit array module ; and the calculated results are cached in the module, waiting to be called again by the arithmetic logic unit calculation module or waiting for output. The data and parameter off-chip cache module is equivalent to a larger field programmable gate array, which is used to deal with the situation that the processing speed of the basic computing unit array module is greatly different from the external input speed, otherwise if the external directly sends data to the internal The basic computing unit array module will need more complex handshake signals to ensure that there is no data leakage or retransmission when the speeds of the two do not match.
与数据参数片外缓存模块1连接的所述基础计算单元阵列模块2,用来实现卷积神经网络的核心计算,所述基础计算单元阵列之间的互连方式如图1所示。该互连架构由多级流水线构成,每一级流水线的功能都是相同的,用于计算输入特征图与参数的乘法,并将部分和与上一级流水线得到的结果相加。输入的数据从最前面的第一级流水线开始计算,逐级向后,直到最后一级流水线存入片外缓存模块中,当输入特征图的数量大于计算阵列的行数时,可以把这些计算结果从片外缓存模块中送回算术逻辑单元计算模块中继续完成累加以得到最终的输出特征图。当第一张特征图在第一级流水线中开始计算一个周期后,第二张特征图会在第二级流水线中开始计算,这使得第一级流水线得到结果时刚好能够送入第二级流水线中参与加法运算,当流水线被全部填满时可以实现较高的并行度与较高的处理性能。为了使流水线计算效率较高,应尽可能使计算阵列的行数/列数是输入特征图数/输出特征图数的因子。The basic computing unit array module 2 connected to the data parameter off-chip cache module 1 is used to realize the core calculation of the convolutional neural network, and the interconnection mode between the basic computing unit arrays is shown in FIG. 1 . The interconnection architecture consists of multi-stage pipelines, and the function of each pipeline is the same, which is used to calculate the multiplication of the input feature map and the parameters, and add the partial sum to the result obtained by the previous pipeline. The input data is calculated from the first stage of the pipeline at the front, and it goes backward step by step until the last stage of the pipeline is stored in the off-chip cache module. When the number of input feature maps is greater than the number of rows of the calculation array, these calculations can be The results are sent back from the off-chip cache module to the arithmetic logic unit calculation module to continue to complete the accumulation to obtain the final output feature map. When the first feature map starts to be calculated for one cycle in the first-stage pipeline, the second feature map will start to be calculated in the second-stage pipeline, which makes the result of the first-stage pipeline just enough to be sent to the second-stage pipeline Participating in the addition operation, when the pipeline is fully filled, a higher degree of parallelism and higher processing performance can be achieved. In order to make the pipeline calculation more efficient, the number of rows/columns of the calculation array should be a factor of the number of input feature maps/number of output feature maps as much as possible.
与所述基础计算单元阵列模块2连接的所述算术逻辑单元计算模块3,用来实现卷积的部分和累加、下采样层以及激活函数层。其具体配置为,在卷积层中,计算阵列最后一行得到的计算结果存入片外缓存模块中后,如果仍为部分和,这些部分和会从片外缓存模块中读取出来送入所述算术逻辑单元计算模块中与下一组计算阵列最后一行的计算结果进一步累加,再存入片外缓存模块以实现多组输入特征图计算结果的累加;在下采样层中,所述算术逻辑单元计算模块会将上一个卷积层的输出特征图从片外缓存模块中读出,根据下采样层是最大值还是平均值选择其功能为比较器还是加法器以实现下采样操作;在激活函数层中,本实例中为Relu函数,所述算术逻辑单元计算模块会使用一个数据选择器Mux来实现激活函数。The arithmetic logic unit calculation module 3 connected to the basic calculation unit array module 2 is used to realize the partial sum and accumulation of the convolution, the down-sampling layer and the activation function layer. The specific configuration is that in the convolutional layer, after the calculation result obtained by the last line of the calculation array is stored in the off-chip cache module, if it is still a partial sum, these partial sums will be read from the off-chip cache module and sent to the off-chip cache module. In the arithmetic logic unit calculation module and the calculation results of the last row of the next group of calculation arrays are further accumulated, and then stored in the off-chip cache module to realize the accumulation of the calculation results of multiple sets of input feature maps; in the down sampling layer, the arithmetic logic unit The calculation module will read the output feature map of the previous convolutional layer from the off-chip cache module, and select its function as a comparator or an adder according to whether the downsampling layer is the maximum value or the average value to realize the downsampling operation; in the activation function In the layer, in this example, it is a Relu function, and the arithmetic logic unit calculation module will use a data selector Mux to realize the activation function.
本发明的可重构卷积神经网络的硬件互连架构,除用于计算卷积神经网络以外,也能够有效支持距离计算、线性代数计算以及图像信号处理类计算。The hardware interconnection architecture of the reconfigurable convolutional neural network of the present invention can effectively support distance calculation, linear algebra calculation and image signal processing calculation in addition to being used for calculating the convolutional neural network.
与上述模块连接的控制模块4,用来产生其余模块的控制信号。其具体配置为,接受外部的指令并译码,生成片外缓存模块向片外读写数据以及计算阵列向片外缓存模块读写数据的使能信号及地址,生成计算单元及所述算术逻辑单元计算模块中的数据及功能选通信号。The control module 4 connected with the above-mentioned modules is used to generate control signals of other modules. Its specific configuration is to accept and decode external instructions, generate enable signals and addresses for the off-chip cache module to read and write data off-chip, and for the computing array to read and write data to the off-chip cache module, and generate computing units and the arithmetic logic Data and function strobe signals in the cell computing module.
在上述的基础计算单元阵列2中,由多级流水线构成。其在映射时将卷积滑动窗口中处于卷积模板同一位置(即对应的参数相同)的图像数据抽取出来组成一个新的输入特征图,这相当于将卷积模板分割为多个1x1的卷积模板,再将这些使用分割后卷积模板计算出的结果累加即可得到原本的卷积结果。以5x5卷积模板为例,其映射方式如图2所示。通过这种方式,任意大小的卷积模板都可以分割为1x1的卷积模板,因此这种结构能够对任意大小的卷积核都保持很高的效率。其次,行间采取流水线的结构在高效利用DSP功能的同时将输出端口数量从计算单元个数降低为计算阵列的列数,这使得除了最后一行之外的计算单元的计算结果不需要额外的缓存空间,降低了片上存储资源的需求。In the above-mentioned basic computing unit array 2, it is composed of multi-stage pipelines. When mapping, it extracts the image data in the same position of the convolution template (that is, the same corresponding parameters) in the convolution sliding window to form a new input feature map, which is equivalent to dividing the convolution template into multiple 1x1 volumes The original convolution result can be obtained by accumulating the results calculated by using the segmented convolution template. Taking the 5x5 convolution template as an example, its mapping method is shown in Figure 2. In this way, convolution templates of any size can be divided into 1x1 convolution templates, so this structure can maintain high efficiency for convolution kernels of any size. Secondly, the pipeline structure between the rows reduces the number of output ports from the number of computing units to the number of columns of the computing array while efficiently utilizing the DSP function, which makes the calculation results of the computing units except the last row not require additional caching space, reducing the need for on-chip storage resources.
在上面的所述算术逻辑单元计算模块3,其将下采样层、激活函数层与卷积层连在一起。由于下采样层和激活函数层不需要参数,且运算简单,为了降低外部数据带宽需求,提高计算效率,该架构不将这两层映射到计算阵列中,而是在卷积层结果输出前送入所述算术逻辑单元计算模块中完成比较或是相加或是数据选择的操作。具体方法是在实现最大值时将所述算术逻辑单元计算模块配置为比较器;在实现平均值时提前在上一层卷积层中将权重预先除去分母,这样做平均值就只需要将所述算术逻辑单元计算模块配置为加法器即可;在实现激活函数Relu时将所述算术逻辑单元计算模块配置为数据选择器Mux。In the arithmetic logic unit calculation module 3 above, it connects the downsampling layer, the activation function layer and the convolution layer together. Since the downsampling layer and the activation function layer do not require parameters and the operation is simple, in order to reduce the external data bandwidth requirements and improve computing efficiency, this architecture does not map these two layers into the computing array, but sends them before the output of the convolutional layer results. into the calculation module of the arithmetic logic unit to complete the operation of comparison or addition or data selection. The specific method is to configure the arithmetic logic unit calculation module as a comparator when realizing the maximum value; to remove the denominator from the weight in the upper convolutional layer in advance when realizing the average value, so that the average value only needs to be The ALU calculation module can be configured as an adder; when the activation function Relu is implemented, the ALU calculation module can be configured as a data selector Mux.
在上述的所述算术逻辑单元计算模块4,其产生的用于控制计算阵列的控制信号也随数据逐层传递以使得各级流水线的计算时序相匹配。并且控制信号多级延迟之后同一个控制信号需要控制的计算单元数量从计算单元个数降低为列数,这降低了控制信号的扇出,可以提高芯片的工作频率。In the arithmetic and logic unit calculation module 4 mentioned above, the control signals generated by it for controlling the calculation array are also transmitted layer by layer along with the data so as to match the calculation timings of the pipelines at all levels. And after the multi-stage delay of the control signal, the number of computing units that need to be controlled by the same control signal is reduced from the number of computing units to the number of columns, which reduces the fan-out of the control signal and can increase the operating frequency of the chip.
本发明采用新的互连结构与新的映射方式实现了可重构的卷积神经网络硬件架构。可以对任意大小的卷积核达到100%的资源利用率,实现较高的性能和较少的硬件资源耗费。The invention adopts a new interconnection structure and a new mapping method to realize a reconfigurable convolutional neural network hardware architecture. It can achieve 100% resource utilization for convolution kernels of any size, achieving higher performance and less hardware resource consumption.
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。Each embodiment in this specification is described in a progressive manner, each embodiment focuses on the difference from other embodiments, and the same and similar parts of each embodiment can be referred to each other.
本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处。综上所述,本说明书内容不应理解为对本发明的限制。In this paper, specific examples have been used to illustrate the principle and implementation of the present invention. The description of the above embodiments is only used to help understand the method of the present invention and its core idea; meanwhile, for those of ordinary skill in the art, according to the present invention Thoughts, there will be changes in specific implementation methods and application ranges. In summary, the contents of this specification should not be construed as limiting the present invention.
Claims (7)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810358443.6A CN108647773B (en) | 2018-04-20 | 2018-04-20 | A Hardware Interconnection System for Reconfigurable Convolutional Neural Networks |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810358443.6A CN108647773B (en) | 2018-04-20 | 2018-04-20 | A Hardware Interconnection System for Reconfigurable Convolutional Neural Networks |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108647773A true CN108647773A (en) | 2018-10-12 |
CN108647773B CN108647773B (en) | 2021-07-23 |
Family
ID=63747074
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810358443.6A Active CN108647773B (en) | 2018-04-20 | 2018-04-20 | A Hardware Interconnection System for Reconfigurable Convolutional Neural Networks |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108647773B (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109902821A (en) * | 2019-03-06 | 2019-06-18 | 苏州浪潮智能科技有限公司 | A kind of data processing method, device and associated component |
CN110503189A (en) * | 2019-08-02 | 2019-11-26 | 腾讯科技(深圳)有限公司 | A kind of data processing method and device |
CN111461310A (en) * | 2019-01-21 | 2020-07-28 | 三星电子株式会社 | Neural network device, neural network system and method for processing neural network model |
CN111582451A (en) * | 2020-05-08 | 2020-08-25 | 中国科学技术大学 | Image recognition interlayer parallel pipeline type binary convolution neural network array architecture |
CN111738433A (en) * | 2020-05-22 | 2020-10-02 | 华南理工大学 | A Reconfigurable Convolution Hardware Accelerator |
WO2021070006A1 (en) * | 2019-10-11 | 2021-04-15 | International Business Machines Corporation | Hybrid data-model parallelism for efficient deep learning |
CN112766453A (en) * | 2019-10-21 | 2021-05-07 | 华为技术有限公司 | Data processing device and data processing method |
CN112860320A (en) * | 2021-02-09 | 2021-05-28 | 山东英信计算机技术有限公司 | Method, system, device and medium for data processing based on RISC-V instruction set |
CN113064852A (en) * | 2021-03-24 | 2021-07-02 | 珠海市一微半导体有限公司 | Reconfigurable processor and configuration method |
CN113191491A (en) * | 2021-03-16 | 2021-07-30 | 杭州慧芯达科技有限公司 | Multi-dimensional parallel artificial intelligence processor architecture |
CN113240074A (en) * | 2021-04-15 | 2021-08-10 | 中国科学院自动化研究所 | Reconfigurable neural network processor |
CN113971261A (en) * | 2020-07-23 | 2022-01-25 | 中科亿海微电子科技(苏州)有限公司 | Convolution operation device, convolution operation method, electronic device, and medium |
CN114416182A (en) * | 2022-03-31 | 2022-04-29 | 深圳致星科技有限公司 | FPGA accelerator and chip for federal learning and privacy computation |
WO2022126630A1 (en) * | 2020-12-18 | 2022-06-23 | 清华大学 | Reconfigurable processor and method for computing multiple neural network activation functions thereon |
CN115334056A (en) * | 2022-08-09 | 2022-11-11 | 嘉兴真与科技有限责任公司 | 2D Convolutional Neural Network Architecture Based on Video Stream Processing |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103019656A (en) * | 2012-12-04 | 2013-04-03 | 中国科学院半导体研究所 | Dynamically reconfigurable multi-stage parallel single instruction multiple data array processing system |
CN106875011A (en) * | 2017-01-12 | 2017-06-20 | 南京大学 | The hardware structure and its calculation process of two-value weight convolutional neural networks accelerator |
US20180053084A1 (en) * | 2016-08-22 | 2018-02-22 | Kneron Inc. | Multi-layer neural network |
CN107832839A (en) * | 2017-10-31 | 2018-03-23 | 北京地平线信息技术有限公司 | The method and apparatus for performing the computing in convolutional neural networks |
CN107832699A (en) * | 2017-11-02 | 2018-03-23 | 北方工业大学 | Method and device for testing interest point attention degree based on array lens |
CN107836001A (en) * | 2015-06-29 | 2018-03-23 | 微软技术许可有限责任公司 | Convolutional neural networks on hardware accelerator |
CN107851214A (en) * | 2015-07-23 | 2018-03-27 | 米雷普里卡技术有限责任公司 | For the performance enhancement of two-dimensional array processor |
CN107918794A (en) * | 2017-11-15 | 2018-04-17 | 中国科学院计算技术研究所 | Neural network processor based on computing array |
-
2018
- 2018-04-20 CN CN201810358443.6A patent/CN108647773B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103019656A (en) * | 2012-12-04 | 2013-04-03 | 中国科学院半导体研究所 | Dynamically reconfigurable multi-stage parallel single instruction multiple data array processing system |
CN107836001A (en) * | 2015-06-29 | 2018-03-23 | 微软技术许可有限责任公司 | Convolutional neural networks on hardware accelerator |
CN107851214A (en) * | 2015-07-23 | 2018-03-27 | 米雷普里卡技术有限责任公司 | For the performance enhancement of two-dimensional array processor |
US20180053084A1 (en) * | 2016-08-22 | 2018-02-22 | Kneron Inc. | Multi-layer neural network |
CN106875011A (en) * | 2017-01-12 | 2017-06-20 | 南京大学 | The hardware structure and its calculation process of two-value weight convolutional neural networks accelerator |
CN107832839A (en) * | 2017-10-31 | 2018-03-23 | 北京地平线信息技术有限公司 | The method and apparatus for performing the computing in convolutional neural networks |
CN107832699A (en) * | 2017-11-02 | 2018-03-23 | 北方工业大学 | Method and device for testing interest point attention degree based on array lens |
CN107918794A (en) * | 2017-11-15 | 2018-04-17 | 中国科学院计算技术研究所 | Neural network processor based on computing array |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111461310A (en) * | 2019-01-21 | 2020-07-28 | 三星电子株式会社 | Neural network device, neural network system and method for processing neural network model |
CN109902821A (en) * | 2019-03-06 | 2019-06-18 | 苏州浪潮智能科技有限公司 | A kind of data processing method, device and associated component |
CN110503189B (en) * | 2019-08-02 | 2021-10-08 | 腾讯科技(深圳)有限公司 | Data processing method and device |
CN110503189A (en) * | 2019-08-02 | 2019-11-26 | 腾讯科技(深圳)有限公司 | A kind of data processing method and device |
GB2604060B (en) * | 2019-10-11 | 2024-10-16 | Ibm | Hybrid data-model parallelism for efficient deep learning |
WO2021070006A1 (en) * | 2019-10-11 | 2021-04-15 | International Business Machines Corporation | Hybrid data-model parallelism for efficient deep learning |
US11556450B2 (en) | 2019-10-11 | 2023-01-17 | International Business Machines Corporation | Hybrid data-model parallelism for efficient deep learning |
GB2604060A (en) * | 2019-10-11 | 2022-08-24 | Ibm | Hybrid data-model parallelism for efficient deep learning |
CN112766453B (en) * | 2019-10-21 | 2025-02-21 | 华为技术有限公司 | Data processing device and data processing method |
CN112766453A (en) * | 2019-10-21 | 2021-05-07 | 华为技术有限公司 | Data processing device and data processing method |
CN111582451B (en) * | 2020-05-08 | 2022-09-06 | 中国科学技术大学 | Image recognition interlayer parallel pipeline type binary convolution neural network array architecture |
CN111582451A (en) * | 2020-05-08 | 2020-08-25 | 中国科学技术大学 | Image recognition interlayer parallel pipeline type binary convolution neural network array architecture |
CN111738433A (en) * | 2020-05-22 | 2020-10-02 | 华南理工大学 | A Reconfigurable Convolution Hardware Accelerator |
CN111738433B (en) * | 2020-05-22 | 2023-09-26 | 华南理工大学 | Reconfigurable convolution hardware accelerator |
CN113971261A (en) * | 2020-07-23 | 2022-01-25 | 中科亿海微电子科技(苏州)有限公司 | Convolution operation device, convolution operation method, electronic device, and medium |
WO2022126630A1 (en) * | 2020-12-18 | 2022-06-23 | 清华大学 | Reconfigurable processor and method for computing multiple neural network activation functions thereon |
CN112860320A (en) * | 2021-02-09 | 2021-05-28 | 山东英信计算机技术有限公司 | Method, system, device and medium for data processing based on RISC-V instruction set |
CN113191491A (en) * | 2021-03-16 | 2021-07-30 | 杭州慧芯达科技有限公司 | Multi-dimensional parallel artificial intelligence processor architecture |
CN113064852B (en) * | 2021-03-24 | 2022-06-10 | 珠海一微半导体股份有限公司 | A reconfigurable processor and configuration method |
WO2022199459A1 (en) * | 2021-03-24 | 2022-09-29 | 珠海一微半导体股份有限公司 | Reconfigurable processor and configuration method |
CN113064852A (en) * | 2021-03-24 | 2021-07-02 | 珠海市一微半导体有限公司 | Reconfigurable processor and configuration method |
CN113240074A (en) * | 2021-04-15 | 2021-08-10 | 中国科学院自动化研究所 | Reconfigurable neural network processor |
CN114416182B (en) * | 2022-03-31 | 2022-06-17 | 深圳致星科技有限公司 | FPGA accelerator and chip for federal learning and privacy computation |
CN114416182A (en) * | 2022-03-31 | 2022-04-29 | 深圳致星科技有限公司 | FPGA accelerator and chip for federal learning and privacy computation |
CN115334056A (en) * | 2022-08-09 | 2022-11-11 | 嘉兴真与科技有限责任公司 | 2D Convolutional Neural Network Architecture Based on Video Stream Processing |
Also Published As
Publication number | Publication date |
---|---|
CN108647773B (en) | 2021-07-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108647773A (en) | A kind of hardwired interconnections framework of restructural convolutional neural networks | |
CN109086867B (en) | Convolutional neural network acceleration system based on FPGA | |
CN109102065B (en) | Convolutional neural network accelerator based on PSoC | |
CN108537331A (en) | A kind of restructural convolutional neural networks accelerating circuit based on asynchronous logic | |
CN111176727B (en) | Computing device and computing method | |
CN105681628B (en) | A kind of convolutional network arithmetic element and restructural convolutional neural networks processor and the method for realizing image denoising processing | |
CN106951395B (en) | Parallel convolution operations method and device towards compression convolutional neural networks | |
CN109948774B (en) | Neural network accelerator based on network layer binding operation and implementation method thereof | |
CN111582465B (en) | Convolutional neural network acceleration processing system and method based on FPGA and terminal | |
CN110458279A (en) | An FPGA-based binary neural network acceleration method and system | |
JP2021510219A (en) | Multicast Network On-Chip Convolutional Neural Network Hardware Accelerator and Its Behavior | |
CN110852428A (en) | Neural network acceleration method and accelerator based on FPGA | |
CN110175670B (en) | A method and system for implementing YOLOv2 detection network based on FPGA | |
CN107256424B (en) | Three-value weight convolution network processing system and method | |
CN111079923B (en) | Spark convolutional neural network system suitable for edge computing platform and circuit thereof | |
CN108629406B (en) | Arithmetic device for convolutional neural network | |
CN107153873A (en) | A kind of two-value convolutional neural networks processor and its application method | |
CN107633297A (en) | A kind of convolutional neural networks hardware accelerator based on parallel quick FIR filter algorithm | |
WO2021259098A1 (en) | Acceleration system and method based on convolutional neural network, and storage medium | |
CN117795473A (en) | Partial and managed and reconfigurable systolic flow architecture for in-memory computation | |
CN113762480B (en) | Time sequence processing accelerator based on one-dimensional convolutional neural network | |
CN112862079B (en) | Design method of running water type convolution computing architecture and residual error network acceleration system | |
CN112001492B (en) | Hybrid Pipeline Acceleration Architecture and Acceleration Method for Binary Weight DenseNet Model | |
CN110209627A (en) | A kind of hardware-accelerated method of SSD towards intelligent terminal | |
CN108647780A (en) | Restructural pond operation module structure towards neural network and its implementation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |