CN108647779A

CN108647779A - A kind of low-bit width convolutional neural networks Reconfigurable Computation unit

Info

Publication number: CN108647779A
Application number: CN201810318783.6A
Authority: CN
Inventors: 曹伟; 王伶俐; 罗成; 谢亮; 范锡添; 周学功
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2018-04-11
Filing date: 2018-04-11
Publication date: 2018-10-12
Anticipated expiration: 2038-04-11
Also published as: CN108647779B

Abstract

The invention discloses a kind of low-bit width convolutional neural networks Reconfigurable Computation units.The unit includes：Several restructural displacement accumulator module, multi-channel gating device and quantification treatment modules；Restructural displacement accumulator module includes controller, the first register, the second register, third register and shifting accumulator；The present invention utilizes network discreteness structure controller, the first register, the second register, third register and shifting accumulator, it judges whether the fixed-point number data of current period and index weight are zero by controller, once the fixed-point number data and index weight that detect current period are zero, then the second trigger signal that the first trigger signal and the second register sent out according to the first register is sent out controls the third register output current period and shifts cumulative data；The present invention can realize that the flexible fixed point of 4 bits and 8 bits multiplies accumulating operation, moreover it is possible to improve displacement accumulating operation rate, reduce memory and power consumption that operation occupies.

Description

A low-bit-width convolutional neural network reconfigurable computing unit

技术领域technical field

本发明涉及可重构计算技术领域，特别是涉及一种低位宽卷积神经网络可重构计算单元。The invention relates to the technical field of reconfigurable computing, in particular to a low-bit-width convolutional neural network reconfigurable computing unit.

背景技术Background technique

随着人工智能的发展，深度学习在语音识别、计算机视觉和自动驾驶等领域取得了巨大的成功，推动这些领域进一步发展。而推动深度学习研究发展的核心技术便是卷积神经网络。采用卷积神经网络（Convolutional Neural Network）的目标识别技术在2012年举办的大规模图像识别竞赛ILSVRC2012中击败了传统的图像识别方法，宣告了深度学术时代的到来。而随着深度学习技术的不断发展，卷积神经网络结构也被不断优化，识别性能也不断提升。而在2015年举办的大规模图像识别竞赛ILSVRC2015上，卷积神经网络第一次超越人类的图像识别能力。这一里程碑事件标志着深度学习技术的巨大成功。With the development of artificial intelligence, deep learning has achieved great success in fields such as speech recognition, computer vision, and autonomous driving, pushing these fields to further develop. The core technology that promotes the development of deep learning research is the convolutional neural network. The target recognition technology using Convolutional Neural Network (Convolutional Neural Network) defeated the traditional image recognition method in the large-scale image recognition competition ILSVRC2012 held in 2012, announcing the arrival of the deep academic era. With the continuous development of deep learning technology, the convolutional neural network structure has been continuously optimized, and the recognition performance has also been continuously improved. On the large-scale image recognition competition ILSVRC2015 held in 2015, the convolutional neural network surpassed human image recognition ability for the first time. This milestone marks the great success of deep learning technology.

随着卷积神经网络性能的不断提升，网络结构也变得越来越复杂，对应着更多计算需求和存储需求。为了支持卷积神经网络计算，通常将网络处理流程运行在服务器和数据中心上，与数据中心进行数据交互时，需要传输大量的数据，因此带来极大的延时，阻碍了卷积神经网络在智能手机、智能车等嵌入式设备中的应用。为了解决这一问题，学术界和工业界开始研究如何部署卷积神经网络到嵌入式硬件系统的加速器上，因此许多有效的卷积神经网络加速器已经被设计成具有专门的计算单元（PE），通常对不同的卷积神经网络模型使用固定的计算单元。由于卷积神经网络的多样性，当网络模型发生变化时，固定的计算单元可能不适合，这将增加数据移动并损害功率效率。而且，它们的卷积映射方法对于各种卷积参数来说不是很可缩放的，网络形状和计算资源之间会出现不匹配，从而降低资源利用率和性能。因此，如何针对不同网络设计可重构计算单元成为本领域重点研究的内容。With the continuous improvement of the performance of convolutional neural networks, the network structure has become more and more complex, corresponding to more computing and storage requirements. In order to support the calculation of convolutional neural network, the network processing process is usually run on the server and data center. When interacting with the data center, a large amount of data needs to be transmitted, which brings great delay and hinders the convolutional neural network. Applications in embedded devices such as smartphones and smart cars. In order to solve this problem, academia and industry have begun to study how to deploy convolutional neural networks to accelerators of embedded hardware systems, so many effective convolutional neural network accelerators have been designed with dedicated computing units (PEs), It is common to use fixed computational units for different convolutional neural network models. Due to the diversity of convolutional neural networks, fixed computing units may not be suitable when the network model changes, which will increase data movement and hurt power efficiency. Moreover, their convolutional mapping method is not very scalable for various convolution parameters, and a mismatch between network shape and computing resources will occur, thereby reducing resource utilization and performance. Therefore, how to design reconfigurable computing units for different networks has become an important research content in this field.

现有的可重构计算单元基本上采用专有DSP（Digital Signal Processing数字信号处理）做计算，而DSP计算单元在设计上是为了浮点型运算设计的，在普通的浮点卷积神经网络硬件设计中，通常采用DSP单元做乘累加运算（Multiply-and-Accumulat，MAC），运用一个DSP可以在一个时钟周期内完成一次乘累加运算。但是DSP计算单元不适合进行对低位宽的乘累加运算，这一缺点使其在低位宽硬件设计上不能发挥其全部能力。Existing reconfigurable computing units basically use proprietary DSP (Digital Signal Processing) for calculations, and DSP computing units are designed for floating-point operations. In ordinary floating-point convolutional neural networks In hardware design, DSP units are usually used for multiply-and-accumulate operations (Multiply-and-Accumulat, MAC), and a DSP can be used to complete a multiply-accumulate operation within one clock cycle. However, the DSP calculation unit is not suitable for multiplying and accumulating operations on low-bit widths. This shortcoming makes it unable to exert its full capabilities in low-bit-width hardware design.

为了解决这一问题，Xilinx公司推出了一种特殊的DSP映射技术，针对Xilinx推出的FPGA芯片设计，使得每个FPGA片上DSP计算单元能够实现并行的两次的八比特乘累加运算。这一技术充分发挥了FPGA片上DSP的计算能力，提高了其FPGA的面积和功耗性能。但是，这一技术适用范围过于狭窄，仅能用于八位宽的定点乘累加运算，不能适用于指数卷积神经网络的特殊运算需求。基于上述问题，如何克服上述问题，成为本领域亟需解决的问题。In order to solve this problem, Xilinx has introduced a special DSP mapping technology, which is designed for the FPGA chip launched by Xilinx, so that each FPGA on-chip DSP calculation unit can realize two parallel 8-bit multiply-accumulate operations. This technology gives full play to the computing power of the DSP on the FPGA chip and improves the area and power consumption performance of the FPGA. However, the scope of application of this technology is too narrow, and it can only be used for eight-bit wide fixed-point multiply-accumulate operations, and cannot be applied to the special operation requirements of exponential convolutional neural networks. Based on the above problems, how to overcome the above problems has become an urgent problem to be solved in this field.

发明内容Contents of the invention

本发明的目的是提供一种低位宽卷积神经网络可重构计算单元，以实现指数卷积神经网络的运算需求，既能实现4比特和8比特的灵活定点乘累加运算，还能提高移位累加运算速率，降低运算占用的内存和功耗。The purpose of the present invention is to provide a low-bit-width convolutional neural network reconfigurable computing unit to realize the computing requirements of the exponential convolutional neural network, which can not only realize 4-bit and 8-bit flexible fixed-point multiplication The bit accumulation operation rate reduces the memory and power consumption occupied by the operation.

为实现上述目的，本发明提供一种低位宽卷积神经网络可重构计算单元，所述低位宽卷积神经网络可重构计算单元应用于指数卷积神经网络的位移累加运算，其包括：若干个可重构移位累加模块、多路选通器和量化处理模块；In order to achieve the above object, the present invention provides a low-bit width convolutional neural network reconfigurable computing unit, which is applied to the displacement accumulation operation of the exponential convolutional neural network, which includes: Several reconfigurable shift accumulation modules, multiplexers and quantization processing modules;

所述多路选通器分别与各所述可重构移位累加模块相连，用于选择所述可重构移位累加模块输出的当前周期的移位累加数据；所述量化处理模块，与所述多路选通器相连，用于根据当前周期的移位累加数据进行量化处理，获得量化处理数据；其中：The multiplexers are respectively connected to each of the reconfigurable shift-accumulation modules, and are used to select the shift-accumulation data of the current period output by the reconfigurable shift-accumulation modules; the quantization processing module and The multiplexers are connected to perform quantization processing according to the shift accumulation data of the current period to obtain quantization processing data; wherein:

所述可重构移位累加模块包括控制器、第一寄存器、第二寄存器、第三寄存器和移位累加器；The reconfigurable shift accumulation module includes a controller, a first register, a second register, a third register and a shift accumulator;

所述控制器用于判断当前周期的指数权重数据是否为负数；如果当前周期的指数权重数据为负数，则无需数据移位累加操作，等待判断下一周期的指数权重数据；如果当前周期的指数权重数据不为负数，则判断当前周期的指数权重数据是否为0；如果当前周期的指数权重数据不为0，则控制第一寄存器存储当前周期的指数权重数据；当前周期的指数权重数据为0，控制则第一寄存器发出第一触发信号；The controller is used to judge whether the index weight data of the current cycle is a negative number; if the index weight data of the current cycle is a negative number, no data shift and accumulation operation is required, and the index weight data of the next cycle is waited to be judged; if the index weight data of the current cycle If the data is not a negative number, then judge whether the index weight data of the current cycle is 0; if the index weight data of the current cycle is not 0, then control the first register to store the index weight data of the current cycle; the index weight data of the current cycle is 0, control, the first register sends a first trigger signal;

所述控制器还用于判断当前周期的定点数数据是否为负数；如果当前周期的定点数数据为负数，则无需数据移位累加操作，等待判断下一周期的定点数数据；如果当前周期的定点数数据不为负数，则判断当前周期的定点数数据是否为0；如果当前周期的定点数数据不为0，则控制第二寄存器存储当前周期的定点数数据；如果当前周期的定点数数据为0，则控制第二寄存器发出第二触发信号；The controller is also used to judge whether the fixed-point number data of the current cycle is a negative number; if the fixed-point number data of the current cycle is a negative number, no data shift and accumulation operation is required, and the fixed-point number data of the next cycle is waited to be judged; if the fixed-point number data of the current cycle If the fixed-point number data is not a negative number, then judge whether the fixed-point number data of the current cycle is 0; if the fixed-point number data of the current cycle is not 0, then control the second register to store the fixed-point number data of the current cycle; if the fixed-point number data of the current cycle is 0, then control the second register to send the second trigger signal;

所述第三寄存器分别与所述第一寄存器、所述第二寄存器相连，所述第三寄存器用于根据所述第一寄存器发出的第一触发信号或所述第二寄存器发出的第二触发信号控制所述第三寄存器输出当前周期的移位累加数据；所述第三寄存器还用于存储上一周期的移位累加数据；The third register is connected to the first register and the second register respectively, and the third register is used for triggering according to the first trigger signal sent by the first register or the second trigger signal sent by the second register. The signal controls the third register to output the shift accumulation data of the current cycle; the third register is also used to store the shift accumulation data of the previous cycle;

所述移位累加器分别与所述第一寄存器、所述第二寄存器和所述第三寄存器相连，所述移位累加器用于根据所述第一寄存器存储的上一周期的指数权重数据、所述第二寄存器存储的上一周期的定点数数据和所述第三寄存器存储的上一周期的第一移位累加数据确定当前周期的移位累加数据，并将当前周期的移位累加数据存储在所述第三寄存器内。The shift accumulator is respectively connected to the first register, the second register and the third register, and the shift accumulator is used to store the index weight data of the previous cycle according to the first register, The fixed-point number data of the previous cycle stored in the second register and the first shift-accumulation data of the previous cycle stored in the third register determine the shift-accumulation data of the current cycle, and the shift-accumulation data of the current cycle stored in the third register.

优选的，所述移位累加器包括：Preferably, the shift accumulator includes:

移位器，分别与所述第一寄存器、所述第二寄存器相连，用于根据所述第一寄存器存储的指数权重数据和所述第二寄存器存储的定点数数据确定移位数据；A shifter, respectively connected to the first register and the second register, for determining shift data according to the index weight data stored in the first register and the fixed-point number data stored in the second register;

累加器，分别与所述移位器、所述第三寄存器相连，用于根据所述移位器确定的移位数据和所述第三寄存器存储的上一周期的第一移位累加数据确定当前周期的移位累加数据。The accumulator is connected to the shifter and the third register respectively, and is used to determine the shift data determined by the shifter and the first shift accumulation data of the last cycle stored in the third register. Shift-accumulate data for the current cycle.

优选的，所述低位宽可重构移位累加模块还包括：Preferably, the low-bit-width reconfigurable shift-accumulation module further includes:

输出寄存器，与所述第三寄存器相连，用于存储所述第三寄存器输出的当前周期的移位累加数据。The output register is connected to the third register and is used for storing the shift-accumulation data of the current period output by the third register.

优选的，所述指数权重数据为4比特。Preferably, the index weight data is 4 bits.

优选的，所述定点数数据为8比特。Preferably, the fixed-point number data is 8 bits.

优选的，所述移位累加数据为18比特。Preferably, the shift accumulation data is 18 bits.

优选的，所述量化处理数据为8比特数据。Preferably, the quantized data is 8-bit data.

和现有技术相比，本发明具有以下技术效果：Compared with the prior art, the present invention has the following technical effects:

本发明利用网络的离散性构建控制器、第一寄存器、第二寄存器、第三寄存器、移位累加器，通过控制器判断当前周期的定点数数据和指数权重是否为零值，一旦检测当前周期的定点数数据和指数权重为零，则根据所述第一寄存器发出的第一触发信号和所述第二寄存器发出的第二触发信号控制所述第三寄存器输出当前周期的移位累加数据；既能实现4比特和8比特的灵活定点乘累加运算，还能提高移位累加运算速率，降低运算占用的内存和功耗。The present invention utilizes the discreteness of the network to construct a controller, a first register, a second register, a third register, and a shift accumulator, and judges whether the fixed-point number data and the index weight of the current cycle are zero values through the controller. Once the current cycle is detected, If the fixed-point number data and the index weight are zero, then the first trigger signal sent by the first register and the second trigger signal sent by the second register are used to control the third register to output the shift accumulation data of the current cycle; It can not only realize 4-bit and 8-bit flexible fixed-point multiply-accumulate operations, but also increase the speed of shift-accumulate operations, and reduce the memory and power consumption occupied by operations.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the accompanying drawings required in the embodiments. Obviously, the accompanying drawings in the following description are only some of the present invention. Embodiments, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without paying creative labor.

图1为本发明实施例低位宽卷积神经网络可重构计算单元结构图；FIG. 1 is a structural diagram of a reconfigurable computing unit of a low-bit-width convolutional neural network according to an embodiment of the present invention;

图2为本发明实施例低位宽可重构移位累加模块结构图。FIG. 2 is a structural diagram of a low-bit-width reconfigurable shift-accumulation module according to an embodiment of the present invention.

其中，10、可重构移位累加模块，11、控制器，12、第一寄存器，13、第二寄存器、14、移位器，15、累加器，16、第三寄存器，17、输出寄存器，20、多路选通器，30、量化处理模块。Among them, 10, reconfigurable shift and accumulation module, 11, controller, 12, first register, 13, second register, 14, shifter, 15, accumulator, 16, third register, 17, output register , 20, a multiplexer, 30, a quantization processing module.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

为使本发明的上述目的、特征和优点能够更加明显易懂，下面结合附图和具体实施方式对本发明作进一步详细的说明。In order to make the above objects, features and advantages of the present invention more comprehensible, the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

图1为本发明实施例低位宽卷积神经网络可重构计算单元结构图；图2为本发明实施例低位宽可重构移位累加模块结构图，如图1-图2所示，本发明提供一种低位宽卷积神经网络可重构计算单元，所述低位宽卷积神经网络可重构计算单元应用于指数卷积神经网络的位移累加运算，所述低位宽卷积神经网络可重构计算单元包括：多个可重构移位累加模块10、多路选通器20和量化处理模块30；Figure 1 is a structural diagram of a low-bit-width convolutional neural network reconfigurable computing unit according to an embodiment of the present invention; Figure 2 is a structural diagram of a low-bit-width reconfigurable shift-accumulation module according to an embodiment of the present invention, as shown in Figures 1-2, the present invention The invention provides a low-bit-width convolutional neural network reconfigurable computing unit, the low-bit-width convolutional neural network reconfigurable computing unit is applied to the displacement accumulation operation of the exponential convolutional neural network, and the low-bit-width convolutional neural network can be The reconstruction calculation unit includes: a plurality of reconfigurable shift accumulation modules 10, a multiplexer 20 and a quantization processing module 30;

所述可重构移位累加模块10包括控制器11、第一寄存器12、第二寄存器13、第三寄存器16、移位累加器15。The reconfigurable shift-accumulation module 10 includes a controller 11 , a first register 12 , a second register 13 , a third register 16 , and a shift-accumulator 15 .

所述控制器11用于判断当前周期的指数权重数据是否为负数；如果当前周期的指数权重数据为负数，则无需数据移位累加操作，等待判断下一周期的指数权重数据；如果当前周期的指数权重数据不为负数，则判断当前周期的指数权重数据是否为0；如果当前周期的指数权重数据不为0，则控制第一寄存器12存储当前周期的指数权重数据；当前周期的指数权重数据为0，控制则第一寄存器12发出第一触发信号。The controller 11 is used to judge whether the index weight data of the current cycle is a negative number; if the index weight data of the current cycle is a negative number, no data shift and accumulation operation is required, and the index weight data of the next cycle is waited to be judged; if the index weight data of the current cycle The index weight data is not a negative number, then judge whether the index weight data of the current cycle is 0; if the index weight data of the current cycle is not 0, then control the first register 12 to store the index weight data of the current cycle; the index weight data of the current cycle is 0, the first register 12 is controlled to send out the first trigger signal.

所述控制器11还用于判断当前周期的定点数数据是否为负数；如果当前周期的定点数数据为负数，则无需数据移位累加操作，等待判断下一周期的定点数数据；如果当前周期的定点数数据不为负数，则判断当前周期的定点数数据是否为0；如果当前周期的定点数数据不为0，则控制第二寄存器13存储当前周期的定点数数据；如果当前周期的定点数数据为0，则控制第二寄存器13发出第二触发信号。Described controller 11 is also used for judging whether the fixed-point number data of current cycle is a negative number; If the fixed-point number data of current cycle is negative number, then need not data shift accumulation operation, wait to judge the fixed-point number data of next cycle; If current cycle If the fixed-point number data of the current cycle is not negative, it is judged whether the fixed-point number data of the current cycle is 0; if the fixed-point number data of the current cycle is not 0, then the second register 13 is controlled to store the fixed-point number data of the current cycle; If the point data is 0, the second register 13 is controlled to send a second trigger signal.

所述第三寄存器16分别与所述第一寄存器12、所述第二寄存器13相连，所述第三寄存器16用于根据所述第一寄存器12发出的第一触发信号或所述第二寄存器13发出的第二触发信号控制所述第三寄存器16输出当前周期的移位累加数据；所述第三寄存器16还用于存储上一周期的移位累加数据。The third register 16 is connected to the first register 12 and the second register 13 respectively, and the third register 16 is used to send the first trigger signal according to the first register 12 or the second register The second trigger signal sent by 13 controls the third register 16 to output the shift accumulation data of the current cycle; the third register 16 is also used to store the shift accumulation data of the previous cycle.

所述移位累加器15分别与所述第一寄存器12、所述第二寄存器13、所述第三寄存器16相连，所述移位累加器15用于根据所述第一寄存器12存储的上一周期的指数权重数据、所述第二寄存器13存储的上一周期的定点数数据和所述第三寄存器16存储的上一周期的第一移位累加数据确定当前周期的移位累加数据，并将当前周期的移位累加数据存储在所述第三寄存器16内。The shift accumulator 15 is connected to the first register 12, the second register 13, and the third register 16 respectively, and the shift accumulator 15 is used for storing the upper register according to the first register 12. The index weight data of one period, the fixed-point number data of the previous period stored in the second register 13 and the first shift-accumulation data of the previous period stored in the third register 16 determine the shift-accumulation data of the current period, And store the shift accumulation data of the current cycle in the third register 16 .

多路选通器20，分别与各所述低位宽可重构移位累加模块10相连，用于选择所述低位宽可重构移位累加模块10输出的当前周期的移位累加数据。Multiplexers 20 are respectively connected to the low-bit-width reconfigurable shift-accumulation modules 10 for selecting the shift-accumulation data of the current period output by the low-bit-width reconfigurable shift-accumulation modules 10 .

量化处理模块30，与所述多路选通器20相连，用于根据当前周期的移位累加数据进行量化处理，获得量化处理数据。所述量化处理数据为8比特数据。The quantization processing module 30 is connected with the multiplexer 20 and is used for performing quantization processing according to the shift-accumulation data of the current period to obtain quantization processing data. The quantized data is 8-bit data.

本发明所述移位累加器15包括：The shift accumulator 15 of the present invention includes:

移位器14，分别与所述第一寄存器12、所述第二寄存器13相连，用于根据所述第一寄存器12存储的指数权重数据和所述第二寄存器13存储的定点数数据确定移位数据。The shifter 14 is connected to the first register 12 and the second register 13 respectively, and is used to determine the shift according to the index weight data stored in the first register 12 and the fixed-point number data stored in the second register 13. bit data.

累加器15，分别与所述移位器14、所述第三寄存器16相连，用于根据所述移位器14确定的移位数据和所述第三寄存器16存储的上一周期的第一移位累加数据确定当前周期的移位累加数据。The accumulator 15 is connected to the shifter 14 and the third register 16 respectively, and is used for shifting data determined by the shifter 14 and the first cycle of the last cycle stored by the third register 16. Shift-accumulate data determines the shift-accumulate data for the current cycle.

本发明所述低位宽卷积神经网络可重构移位累加模块10还包括：输出寄存器17，与所述第三寄存器16相连，用于存储所述第三寄存器16输出的当前周期的移位累加数据。The low-bit-width convolutional neural network reconfigurable shift accumulation module 10 of the present invention also includes: an output register 17, connected to the third register 16, for storing the shift of the current cycle output by the third register 16 Accumulated data.

本发明所述指数权重数据为4比特。The index weight data in the present invention is 4 bits.

本发明所述定点数数据为8比特。The fixed-point data in the present invention is 8 bits.

本发明所述移位累加数据为18比特。The shift accumulation data in the present invention is 18 bits.

由于卷积神经网络中含有很大一部分的离散性，充分利用网络的离散性能能够极大地提升硬件设计的功率性能，因此为了进一步提升可重构计算单元的性能，本发明扩展研究的卷积神经网络的离散性并利用离散性提升网络的功耗性能。研究表明，在卷积神经网络中约有40%~60%的输入数据是零值，权重数据中也有很大一部分的小数据可以被修剪，并不影响网络的精度，因此包含零值的乘法和加法是无意义的，因为它不影响输出结果，所以本发明一旦检测前周期的定点数数据和指数权重为零，则根据所述第一寄存器12发出的第一触发信号和所述第二寄存器13发出的第二触发信号控制所述第三寄存器16输出当前周期的移位累加数据。Since the convolutional neural network contains a large part of discreteness, making full use of the discrete performance of the network can greatly improve the power performance of the hardware design. Therefore, in order to further improve the performance of the reconfigurable computing unit, the convolutional neural network of the present invention is extended. The discreteness of the network and the use of discreteness to improve the power consumption performance of the network. Studies have shown that about 40% to 60% of the input data in convolutional neural networks are zero-valued, and a large part of the small data in the weight data can be pruned without affecting the accuracy of the network, so the multiplication of zero values And addition is meaningless because it does not affect the output result, so once the present invention detects that the fixed-point number data and the index weight of the previous cycle are zero, then according to the first trigger signal sent by the first register 12 and the second The second trigger signal sent by the register 13 controls the third register 16 to output the shift accumulation data of the current period.

本发明量化处理模块30将18比特的移位累加数据量化处理后得到8比特的量化处理数据。The quantization processing module 30 of the present invention quantizes the 18-bit shifted accumulation data to obtain 8-bit quantized processed data.

本发明在移位-累加计算过程中，上一周期的移位累加数据与输出的当前周期的移位累加数据的宽度明显大于定点数数据和指数权重数据的宽度，这是因为在移位-累加计算需要较大的计算范围来避免计算溢出。本发明设置18-bits的当前周期的移位累加数据宽度完全能够容纳下所有的移位-累加操作得到的当前周期的移位累加数据。In the shift-accumulation calculation process of the present invention, the width of the shift-accumulation data of the previous cycle and the shift-accumulation data of the output current cycle is obviously greater than the width of the fixed-point number data and the index weight data, and this is because in the shift-accumulation Accumulation calculations require a large calculation range to avoid calculation overflow. The present invention sets the shift-accumulation data width of the current period to 18-bits, which can fully accommodate the shift-accumulation data of the current period obtained by all shift-accumulation operations.

本发明采用的xc7z020clg400-2型号的实验板卡进行测试，具有以下优点：（1）采用本发明设计的可重构计算单元提高了移位累加运算速率。经测试，采用普通的可重构乘累加单元的普通神经网络加速器结构占用95个LUT，计算功耗为1.658，采用本发明设计的可重构计算单元的普通神经网络加速器结构仅占用46个LUT，计算功耗仅为1W，显然采用本发明设计的可重构计算单元接近普通乘累加单元运行频率的两倍。（2）本发明设计的可重构计算单元充分利用可重构性能，支持多位宽多配置的网络结构，实现4-8比特的灵活位宽配置。（3）本发明设计的可重构计算单元充分利用了网络的离散性，进一步提升硬件性能。（4）本发明使得指数卷积神经网络能够有效地映射在嵌入式系统上，进一步减低其面积和功率开销。The xc7z020clg400-2 experimental board used in the present invention has the following advantages for testing: (1) The reconfigurable computing unit designed in the present invention improves the shift and accumulation operation rate. After testing, the common neural network accelerator structure using the common reconfigurable multiply-accumulate unit occupies 95 LUTs, and the calculation power consumption is 1.658, while the common neural network accelerator structure adopting the reconfigurable computing unit designed by the present invention only occupies 46 LUTs , the computing power consumption is only 1W, obviously the reconfigurable computing unit designed by the present invention is close to twice the operating frequency of the ordinary multiply-accumulate unit. (2) The reconfigurable computing unit designed in the present invention fully utilizes reconfigurable performance, supports a network structure with multiple bit widths and multiple configurations, and realizes flexible bit width configurations of 4-8 bits. (3) The reconfigurable computing unit designed in the present invention fully utilizes the discreteness of the network to further improve hardware performance. (4) The present invention enables the exponential convolutional neural network to be effectively mapped on the embedded system, further reducing its area and power consumption.

表1可重构计算单元与可重构乘累加单元对比表Table 1 Comparison between reconfigurable computing unit and reconfigurable multiply-accumulate unit

本说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似部分互相参见即可。Each embodiment in this specification is described in a progressive manner, each embodiment focuses on the difference from other embodiments, and the same and similar parts of each embodiment can be referred to each other.

本文中应用了具体个例对本发明的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想；同时，对于本领域的一般技术人员，依据本发明的思想，在具体实施方式及应用范围上均会有改变之处。综上所述，本说明书内容不应理解为对本发明的限制。In this paper, specific examples have been used to illustrate the principle and implementation of the present invention. The description of the above embodiments is only used to help understand the method of the present invention and its core idea; meanwhile, for those of ordinary skill in the art, according to the present invention Thoughts, there will be changes in specific implementation methods and application ranges. In summary, the contents of this specification should not be construed as limiting the present invention.

Claims

1. A low-bit wide convolutional neural network reconfigurable computing unit, characterized in that the low-bit wide convolutional neural network reconfigurable computing unit is applied to the displacement accumulation operation of the exponential convolutional neural network, which includes: several A reconfigurable shift-accumulation module, a multiplexer, and a quantization processing module; the multiplexers are respectively connected to each of the reconfigurable shift-accumulation modules for selecting the reconfigurable shift-accumulation The shift accumulation data of the current cycle output by the module; the quantization processing module is connected to the multiplexer, and is used to perform quantization processing according to the shift accumulation data of the current cycle to obtain quantization processing data; wherein:

The reconfigurable shift accumulation module includes a controller, a first register, a second register, a third register and a shift accumulator;

The controller is used to judge whether the index weight data of the current cycle is a negative number; if the index weight data of the current cycle is a negative number, no data shift and accumulation operation is required, and the index weight data of the next cycle is waited to be judged; if the index weight data of the current cycle If the data is not a negative number, then judge whether the index weight data of the current cycle is 0; if the index weight data of the current cycle is not 0, then control the first register to store the index weight data of the current cycle; the index weight data of the current cycle is 0, control, the first register sends a first trigger signal;

The controller is also used to judge whether the fixed-point number data of the current cycle is a negative number; if the fixed-point number data of the current cycle is a negative number, no data shift and accumulation operation is required, and the fixed-point number data of the next cycle is waited to be judged; if the fixed-point number data of the current cycle If the fixed-point number data is not a negative number, then judge whether the fixed-point number data of the current cycle is 0; if the fixed-point number data of the current cycle is not 0, then control the second register to store the fixed-point number data of the current cycle; if the fixed-point number data of the current cycle is 0, then control the second register to send the second trigger signal;

The third register is connected to the first register and the second register respectively, and the third register is used for triggering according to the first trigger signal sent by the first register or the second trigger signal sent by the second register. The signal controls the third register to output the shift accumulation data of the current cycle; the third register is also used to store the shift accumulation data of the previous cycle;

The shift accumulator is respectively connected to the first register, the second register and the third register, and the shift accumulator is used to store the index weight data of the previous cycle according to the first register, The fixed-point number data of the previous cycle stored in the second register and the first shift-accumulation data of the previous cycle stored in the third register determine the shift-accumulation data of the current cycle, and the shift-accumulation data of the current cycle stored in the third register.

2. The low-bit-width convolutional neural network reconfigurable computing unit according to claim 1, wherein the shift accumulator comprises:

A shifter, respectively connected to the first register and the second register, for determining shift data according to the index weight data stored in the first register and the fixed-point number data stored in the second register;

The accumulator is connected to the shifter and the third register respectively, and is used to determine the shift data determined by the shifter and the first shift accumulation data of the last cycle stored in the third register. Shift-accumulate data for the current cycle.

3. The low-bit-width convolutional neural network reconfigurable computing unit according to claim 1, wherein the reconfigurable shift-and-accumulate module further comprises:

The output register is connected to the third register and is used for storing the shift-accumulation data of the current period output by the third register.

4. The low-bit-width convolutional neural network reconfigurable computing unit according to claim 1, wherein the index weight data is 4 bits.

5. The low-bit-width convolutional neural network reconfigurable computing unit according to claim 1, wherein the fixed-point number data is 8 bits.

6. The low-bit-width convolutional neural network reconfigurable computing unit according to claim 1, wherein the shift-accumulation data is 18 bits.

7. The low-bit-width convolutional neural network reconfigurable computing unit according to claim 1, wherein the quantized data is 8-bit data.