CN109635944B

CN109635944B - A sparse convolutional neural network accelerator and implementation method

Info

Publication number: CN109635944B
Application number: CN201811582530.6A
Authority: CN
Inventors: 刘龙军; 李宝婷; 孙宏滨; 梁家华; 任鹏举; 郑南宁
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2018-12-24
Filing date: 2018-12-24
Publication date: 2020-10-27
Anticipated expiration: 2038-12-24
Also published as: CN109635944A

Abstract

A sparse convolutional neural network accelerator and implementation method, the connection weight of a sparse network in an off-chip DRAM is read into a weight input buffer, decoded by a weight decoding unit, and stored in the weight on-chip global buffer; The input buffer is read into the input buffer of the neuron, and then the read neuron is decoded by the neuron decoding unit and stored in the global buffer on the neuron chip; the calculation mode of the PE computing unit array is determined according to the configuration parameters of the current layer of the neural network, Send the decoded neurons and connection weights to the PE computing unit; calculate the product of neurons and connection weights; in the accelerator of the present invention, the multipliers in the PE unit are all replaced by shifters, and all basic modules are It can be configured according to network computing and hardware resources, so it has the advantages of high speed, low power consumption, small resource occupation and high data utilization.

Description

A sparse convolutional neural network accelerator and implementation method

技术领域technical field

本发明属于深度神经网络加速计算技术领域，涉及一种稀疏卷积神经网络加速器及实现方法。The invention belongs to the technical field of deep neural network accelerated computing, and relates to a sparse convolutional neural network accelerator and an implementation method.

背景技术Background technique

深度神经网络(Deep Neural Network，DNN)的卓越性能来自于它能够使用统计学习从大量数据中获得输入空间的有效表示，并从原始数据中提取高级特征。但是这些算法计算密集，随着网络层数不断增加，DNN对存储资源和计算资源的需求越来越高，使得其在硬件资源有限、功耗预算紧张的嵌入式设备上部署困难。DNN模型压缩技术有利于减少网络模型内存占用、降低计算复杂度和系统功耗等，一方面可以提高模型的运行效率，有利于云计算等对速度要求较高的平台；另一方面有利于将模型部署到嵌入式系统。The remarkable performance of Deep Neural Network (DNN) comes from its ability to use statistical learning to obtain an efficient representation of the input space from a large amount of data, and to extract high-level features from the raw data. However, these algorithms are computationally intensive. As the number of network layers continues to increase, DNNs require higher and higher storage and computing resources, making it difficult to deploy them on embedded devices with limited hardware resources and tight power budgets. DNN model compression technology is conducive to reducing the memory usage of network models, reducing computational complexity and system power consumption. Models are deployed to embedded systems.

相比于冯·诺依曼计算机体系架构，神经网络计算机以神经元作为基本的计算和存储单元，构造一种新型的分布式存储和计算的体系结构。现场可编程门阵列(FieldProgrammable Gate Array，FPGA)不仅具有软件的可编程性和灵活性，同时又有专用集成电路(Application Specific Integrated Circuit，ASIC)高吞吐和低延迟的特性，成为了当下DNN硬件加速器设计应用的热门载体。现有的神经网络模型研究大都集中在对网络识别精度的提升，即使是轻量化网络模型也忽视了硬件加速时的计算复杂度，浮点型的数据表示导致硬件资源消耗过大。虽然DNN加速器方案众多，但其中更高效的数据复用、硬件计算的流水线设计以及片上存储资源规划等需要解决或优化的问题还有很多。Compared with the von Neumann computer architecture, neural network computers use neurons as the basic computing and storage units to construct a new type of distributed storage and computing architecture. Field Programmable Gate Array (FPGA) not only has the programmability and flexibility of software, but also has the characteristics of high throughput and low latency of Application Specific Integrated Circuit (ASIC), which has become the current DNN hardware Popular vehicle for accelerator design applications. Most of the existing neural network model research focuses on the improvement of network recognition accuracy. Even the lightweight network model ignores the computational complexity of hardware acceleration, and the floating-point data representation leads to excessive consumption of hardware resources. Although there are many DNN accelerator solutions, there are still many problems that need to be solved or optimized, such as more efficient data multiplexing, pipeline design of hardware computing, and on-chip storage resource planning.

发明内容SUMMARY OF THE INVENTION

为了克服上述现有技术的不足，本发明的目的在于提出了一种高能效的稀疏卷积神经网络加速器及实现方法，使得DNN连接权重具有稀疏、硬件计算友好的特点。In order to overcome the above-mentioned deficiencies of the prior art, the purpose of the present invention is to propose an energy-efficient sparse convolutional neural network accelerator and an implementation method, so that the DNN connection weight has the characteristics of sparseness and hardware calculation friendliness.

为实现上述目的，本发明是通过以下技术方案实现的：To achieve the above object, the present invention is achieved through the following technical solutions:

一种稀疏卷积神经网络加速器，包括：片外DRAM，神经元输入缓冲区，神经元解码单元，神经元编码单元，神经元输出缓冲区，神经元片上全局缓冲区，权值输入缓冲区，权值解码单元，权值片上全局缓冲区，PE计算单元阵列，激活单元以及池化单元；其中，A sparse convolutional neural network accelerator, comprising: off-chip DRAM, neuron input buffer, neuron decoding unit, neuron encoding unit, neuron output buffer, neuron on-chip global buffer, weight input buffer, Weight decoding unit, weight on-chip global buffer, PE computing unit array, activation unit and pooling unit; among them,

片外DRAM，用于存储压缩好的神经元和连接权重的数据；Off-chip DRAM for storing compressed neuron and connection weight data;

神经元输入缓冲区，用于缓存从片外DRAM读取的压缩神经元数据，并传输给神经元解码单元；The neuron input buffer is used to buffer the compressed neuron data read from the off-chip DRAM and transmit it to the neuron decoding unit;

神经元解码单元，用于解码压缩神经元数据，并将解码后的神经元传输给神经元片上全局缓冲区；The neuron decoding unit is used to decode the compressed neuron data and transmit the decoded neuron to the neuron on-chip global buffer;

神经元编码单元，用于将神经元片上全局缓冲区中的数据压缩编码后传输给神经元输出缓冲区；The neuron coding unit is used to compress and encode the data in the global buffer on the neuron chip and transmit it to the neuron output buffer;

神经元输出缓冲区，用于将接收到的压缩神经元数据存入到片外DRAM中；Neuron output buffer, used to store the received compressed neuron data in off-chip DRAM;

神经元片上全局缓冲区，用于缓存解码后的神经元和PE阵列计算得到的中间结果，并与PE计算单元阵列进行数据通信；The neuron on-chip global buffer is used to cache the decoded neurons and the intermediate results calculated by the PE array, and communicate data with the PE computing unit array;

激活单元，用于将PE计算单元阵列计算的结果激活后传输到池化单元或神经元片上全局缓冲区；池化单元，用于对激活单元激活后的结果降采样；The activation unit is used to activate the result calculated by the PE computing unit array and transmit it to the pooling unit or the global buffer on the neuron chip; the pooling unit is used to downsample the result after the activation of the activation unit;

权值输入缓冲区，用于缓存从片外DRAM读取的压缩连接权重数据，并传输给权值解码单元；The weight input buffer is used to cache the compressed connection weight data read from the off-chip DRAM and transmit it to the weight decoding unit;

权值解码单元，用于解码压缩连接权重数据，并将解码后的连接权重传输给权值片上全局缓冲区；The weight decoding unit is used to decode the compressed connection weight data, and transmit the decoded connection weight to the weight on-chip global buffer;

权值片上全局缓冲区，用于缓存解码后的连接权重，并与PE计算单元阵列进行数据通信；On-chip global buffer for weights, used to cache the decoded connection weights and communicate data with the PE computing unit array;

PE计算单元阵列，用于采用卷积计算模式或全连接计算模式进行计算。The PE computing unit array is used for computing in the convolution computing mode or the fully connected computing mode.

一种稀疏卷积神经网络加速器的实现方法，对数据进行预处理后，存储在DRAM中，将存储在片外DRAM中的稀疏网络的连接权重读入权值输入缓冲区，通过权值解码单元进行解码后存储在权值片上全局缓冲区；将存储在片外DRAM中的神经元读入神经元输入缓冲区，然后将读入的神经元通过神经元解码单元进行解码后存储在神经元片上全局缓冲区；按照神经网络当前层的配置参数确定PE计算单元阵列的计算模式，将解码后排列好的神经元和连接权重发送给对应的PE计算单元；神经元和连接权重数据进入PE后分别缓存在Neuron_Buffer和Weight_Buffer中，当Neuron_Buffer和Weight_Buffer第一次填充满之后，开始计算神经元和连接权重的乘积；其中，Neuron_Buffer为PE内神经元缓冲区，Weight_Buffer为PE内权重缓冲区；An implementation method of a sparse convolutional neural network accelerator. After preprocessing data, it is stored in DRAM, and the connection weight of the sparse network stored in the off-chip DRAM is read into the weight input buffer, and the weight decoding unit is passed through the weight decoding unit. After decoding, it is stored in the weight on-chip global buffer; the neurons stored in the off-chip DRAM are read into the neuron input buffer, and then the read neurons are decoded by the neuron decoding unit and stored on the neuron slice Global buffer; determine the calculation mode of the PE computing unit array according to the configuration parameters of the current layer of the neural network, and send the neurons and connection weights arranged after decoding to the corresponding PE computing units; the neurons and connection weight data enter PE separately Cached in Neuron_Buffer and Weight_Buffer. When Neuron_Buffer and Weight_Buffer are filled for the first time, the product of neuron and connection weight starts to be calculated; among them, Neuron_Buffer is the neuron buffer in PE, and Weight_Buffer is the weight buffer in PE;

PE计算单元阵列计算得到的结果如果是中间结果，则暂时存入神经元片上全局缓冲区，等待下次与PE计算单元阵列计算出的其他中间结果进行累加；如果是卷积层最终结果，则送入激活单元，然后将激活结果送入池化单元，降采样后再存回神经元片上全局缓冲区；如果是全连接层最终结果，则送入激活单元，然后直接存回神经元片上全局缓冲区；存储于神经元片上全局缓冲区的数据经过神经元编码单元按照稀疏矩阵压缩编码存储格式压缩编码后送入神经元输出缓冲区，再存回片外DRAM，等待下一层计算时再读入；如此循环直到最后一层的全连接层输出一个一维向量，向量长度为分类类别数。If the result calculated by the PE computing unit array is an intermediate result, it will be temporarily stored in the global buffer on the neuron chip, and wait for the next accumulation with other intermediate results calculated by the PE computing unit array; if it is the final result of the convolution layer, then Send it to the activation unit, then send the activation result to the pooling unit, downsample and then save it back to the neuron on-chip global buffer; if it is the final result of the fully connected layer, it is sent to the activation unit, and then directly stored back to the neuron on-chip global buffer Buffer; the data stored in the global buffer on the neuron chip is compressed and encoded by the neuron coding unit according to the sparse matrix compression coding storage format, and then sent to the neuron output buffer, and then stored back to the off-chip DRAM, waiting for the next layer of calculation. Read in; loop in this way until the fully connected layer of the last layer outputs a one-dimensional vector whose length is the number of classification categories.

本发明进一步的改进在于，对数据进行预处理的具体过程为：A further improvement of the present invention is that the specific process of preprocessing the data is:

1)，对连接权重剪枝，使神经网络变得稀疏；1), prune the connection weights to make the neural network sparse;

2)，2ⁿ连接权重聚类：首先根据剪枝后网络参数的分布情况，在参数集中的范围内根据量化位宽选取2ⁿ形式的聚类中心，计算剩余非零参数和聚类中心的欧式距离，将非零参数聚类到与其欧氏距离最短的聚类中心上，其中n是负整数、正整数或零；2), 2 ⁿ connection weight clustering: First, according to the distribution of network parameters after pruning, select 2 ⁿ cluster centers in the form of quantized bit width within the range of the parameter set, and calculate the remaining non-zero parameters and cluster centers. Euclidean distance, which clusters non-zero parameters to the cluster center with the shortest Euclidean distance, where n is a negative integer, positive integer or zero;

3)，针对步骤2)聚类后的稀疏网络，将稀疏网络的连接权重以及特征图神经元采用稀疏矩阵压缩编码存储格式压缩后存储在片外DRAM中。3), for the sparse network after the clustering in step 2), the connection weight of the sparse network and the feature map neurons are compressed and stored in the off-chip DRAM using a sparse matrix compression coding storage format.

本发明进一步的改进在于，稀疏矩阵压缩编码存储格式具体为：卷积层特征图和全连接层神经元的编码分为三部分，其中，每个5-bit表示当前非零神经元与上一个非零神经元之间0的个数，每个16-bit表示非零元素本身数值，最后的1-bit用来标记卷积层特征图中非零元素是否换行或全连接层非零元素是否换层，整个存储单元编码三个非零元素以及它的相对位置信息；如果一个存储单元中间包含卷积层特征图某一行最后一个非零元素或全连接层某一层输入神经元最后一个非零元素，则后面的5-bit和16-bit全部补0，并将最后的1-bit标志位置1；连接权重的编码也分为三部分，其中，每个4-bit表示当前非零连接权重与上一个非零连接权重之间0的个数，每个5-bit表示非零连接权重索引，最后的1-bit用来标记卷积层中非零连接权重是否对应同一张输入特征图或全连接层中非零连接权重是否对应同一个输出神经元，整个存储单元编码七个非零连接权重以及它的相对位置信息；同样，如果一个存储单元中间包含卷积层中对应同一张输入特征图的最后一个非零连接权重或全连接层某一层中对应同一个输出神经元的最后一个非零连接权重，则后面的4-bit和5-bit全部补0，并将最后的1-bit标志位置1。A further improvement of the present invention is that the sparse matrix compression coding storage format is specifically: the coding of the convolutional layer feature map and the fully connected layer neuron is divided into three parts, wherein each 5-bit represents the current non-zero neuron and the previous one. The number of 0s between non-zero neurons, each 16-bit represents the value of the non-zero element itself, and the last 1-bit is used to mark whether the non-zero elements in the feature map of the convolutional layer wrap or not in the fully-connected layer. When changing layers, the entire storage unit encodes three non-zero elements and their relative position information; if a storage unit contains the last non-zero element of a row of the feature map of the convolutional layer or the last non-zero element of an input neuron of a certain layer of the fully connected layer If the element is zero, then the following 5-bit and 16-bit are all filled with 0, and the last 1-bit flag is set to 1; the encoding of the connection weight is also divided into three parts, where each 4-bit represents the current non-zero connection The number of 0s between the weight and the previous non-zero connection weight, each 5-bit represents the non-zero connection weight index, and the last 1-bit is used to mark whether the non-zero connection weight in the convolutional layer corresponds to the same input feature map Or whether the non-zero connection weights in the fully connected layer correspond to the same output neuron, and the entire storage unit encodes seven non-zero connection weights and their relative position information; similarly, if a storage unit contains the same input in the convolutional layer The last non-zero connection weight of the feature map or the last non-zero connection weight corresponding to the same output neuron in a layer of the fully connected layer, then all the following 4-bit and 5-bit are filled with 0, and the last 1 -bit flag bit 1.

本发明进一步的改进在于，当PE计算单元阵列的计算模式为卷积计算时，若卷积窗大小为k×k，卷积窗滑动步长为s，则填满后的Neuron_Buffer和Weight_Buffer按照先入先出的原则分别从先到后存储了神经元1～k以及连接权重1～k；A further improvement of the present invention is that, when the calculation mode of the PE calculation unit array is convolution calculation, if the size of the convolution window is k×k, and the sliding step size of the convolution window is s, the filled Neuron_Buffer and Weight_Buffer will be calculated according to the first input. The first-out principle stores neurons 1~k and connection weights 1~k from first to last;

若卷积窗滑动步长s等于1，则在第一个时钟周期计算神经元1与连接权重1的乘积后直接丢掉神经元1，并将连接权重1转存到Weight_Buffer最后面；第二个时钟周期读入新的神经元k+1同时计算神经元2与连接权重2的乘积后，将神经元2与连接权重2都转存到各自Buffer最后面；第三个时钟周期计算神经元3与连接权重3的乘积后，将神经元3与连接权重3都转存到各自Buffer最后面；以此类推，直到第k个时钟周期PE单元计算完成一个卷积窗一行的乘累加，输出结果的同时按照步长大小s调整Buffer中数值的顺序，等待计算下一个卷积窗；If the sliding step s of the convolution window is equal to 1, the neuron 1 is directly discarded after the product of the neuron 1 and the connection weight 1 is calculated in the first clock cycle, and the connection weight 1 is transferred to the end of the Weight_Buffer; the second After the clock cycle reads in the new neuron k+1 and calculates the product of neuron 2 and connection weight 2, dumps both neuron 2 and connection weight 2 to the end of their respective Buffers; the third clock cycle calculates neuron 3 After multiplying with connection weight 3, transfer neuron 3 and connection weight 3 to the end of their respective Buffers; and so on, until the kth clock cycle PE unit calculates and completes the multiplication and accumulation of one convolution window and one row, and outputs the result At the same time, adjust the order of the values in the Buffer according to the step size s, and wait for the calculation of the next convolution window;

若步长大小s大于1，则在第一个时钟周期计算神经元1与连接权重1的乘积后直接丢掉神经元1，并将连接权重1转存到Weight_Buffer最后面；第二个时钟周期读入新的神经元k+1，计算神经元2与连接权重2的乘积后直接丢掉神经元2，并将连接权重2转存到Weight_Buffer最后面；以此类推，直到第s个时钟周期，读入新的神经元k+s-1，计算神经元s与连接权重s的乘积后直接丢掉神经元s，并将连接权重s转存到Weight_Buffer最后面；在第s+1个时钟周期读入新的神经元k+s同时计算神经元s+1与连接权重s+1的乘积后，将神经元s+1与连接权重s+1都转存到各自Buffer最后面；在第s+2个时钟周期计算神经元s+2与连接权重s+2的乘积后，将神经元s+2与连接权重s+2都转存到各自Buffer最后面；以此类推，直到第k个时钟周期PE单元计算完成一个卷积窗一行的乘累加，输出结果的同时按照步长大小s调整Buffer中数值的顺序，等待计算下一个卷积窗。If the step size s is greater than 1, after calculating the product of neuron 1 and connection weight 1 in the first clock cycle, neuron 1 is directly discarded, and the connection weight 1 is transferred to the end of Weight_Buffer; the second clock cycle reads Enter a new neuron k+1, calculate the product of neuron 2 and connection weight 2, and discard neuron 2 directly, and transfer connection weight 2 to the end of Weight_Buffer; and so on, until the sth clock cycle, read Enter the new neuron k+s-1, calculate the product of the neuron s and the connection weight s, and discard the neuron s directly, and transfer the connection weight s to the end of the Weight_Buffer; read in the s+1th clock cycle After the new neuron k+s calculates the product of the neuron s+1 and the connection weight s+1 at the same time, the neuron s+1 and the connection weight s+1 are dumped to the end of their respective Buffers; After calculating the product of neuron s+2 and connection weight s+2 for each clock cycle, transfer neuron s+2 and connection weight s+2 to the end of their respective Buffers; and so on, until the kth clock cycle The PE unit calculates the multiplication and accumulation of one line of convolution window, and adjusts the order of the values in the Buffer according to the step size s while outputting the result, waiting to calculate the next convolution window.

本发明进一步的改进在于，当PE计算单元阵列的计算模式为全连接计算时，若每个PE单元计算的输入神经元为m个，则Neuron_Buffer和Weight_Buffer填满后按照先入先出的原则分别从先到后存储了神经元1～m以及连接权重1～m；第一个时钟周期计算神经元1与连接权重1的乘积后直接丢掉连接权重1，并将神经元1转存到Neuron_Buffer最后面；第二个时钟周期读入新的连接权重m+1同时计算神经元2与连接权重2的乘积后，直接丢掉连接权重2，并将神经元2转存到Neuron_Buffer最后面；以此类推，直到第m个时钟周期PE单元计算完成一个输出神经元的中间结果，输出结果的同时调整Buffer中数值的顺序，等待计算下一组输出神经元。A further improvement of the present invention is that when the calculation mode of the PE calculation unit array is full connection calculation, if the number of input neurons calculated by each PE unit is m, then Neuron_Buffer and Weight_Buffer are filled up according to the principle of first-in, first-out. The neuron 1~m and the connection weight 1~m are stored first, then the connection weight 1~m is stored; the first clock cycle calculates the product of the neuron 1 and the connection weight 1, then directly discards the connection weight 1, and transfers the neuron 1 to the end of Neuron_Buffer ; The second clock cycle reads in the new connection weight m+1 and calculates the product of neuron 2 and connection weight 2, directly discards connection weight 2, and transfers neuron 2 to the end of Neuron_Buffer; and so on, Until the mth clock cycle, the PE unit calculates the intermediate result of an output neuron, adjusts the order of the values in the Buffer while outputting the result, and waits for the calculation of the next group of output neurons.

与现有技术相比，本发明具有的有益效果：Compared with the prior art, the present invention has the following beneficial effects:

本发明采用了一种稀疏矩阵压缩编码存储格式，使得神经网络不同层的数据具有统一的编码格式，并且考虑到了神经网络中卷积层和全连接层各自的计算特点，最大程度节约存储资源的同时，简化了硬件设计时编解码电路的复杂度。本发明针对PE单元中神经元和连接权重缓存的数据流也做了相应优化设计，提出了一种高数据复用率的PE内数据流方式，在降低存储空间的同时使得计算更加高效。本发明的操作灵活，编解码简单，且压缩效果较好，每个输入神经元采用16-bit定点表示，连接权重采用5-bit表示。The invention adopts a sparse matrix compression coding storage format, so that the data of different layers of the neural network has a unified coding format, and considering the respective computing characteristics of the convolution layer and the fully connected layer in the neural network, the storage resources are saved to the greatest extent. At the same time, the complexity of the codec circuit in hardware design is simplified. The present invention also makes a corresponding optimization design for the data flow buffered by neurons and connection weights in the PE unit, and proposes a high data multiplexing rate in-PE data flow mode, which reduces storage space and makes computation more efficient. The invention has flexible operation, simple encoding and decoding, and good compression effect. Each input neuron is represented by 16-bit fixed point, and the connection weight is represented by 5-bit.

进一步的，经过本发明中的数据预处理过程后，神经网络剩余的非零连接权重全部变为2ⁿ或2ⁿ组合，因此，在本发明中PE计算单元中的乘法器被优化为更加节省资源的移位器。Further, after the data preprocessing process in the present invention, the remaining non-zero connection weights of the neural network all become 2 ⁿ or 2 ⁿ combinations, therefore, in the present invention, the multiplier in the PE computing unit is optimized to be more economical. Shifter for resources.

进一步的，本发明包括剪枝、2ⁿ聚类和压缩编码三个主要步骤。压缩后的LeNet、AlexNet、VGG相较原始网络模型分别减少了大约222倍、22倍、68倍的存储消耗。在本发明的加速器中，PE单元中的乘法器全部被移位器代替，所有的基本模块都可以根据网络计算和硬件资源进行配置，因此具有速度快、功耗低、资源占用小以及数据利用率高的优点。Further, the present invention includes three main steps of pruning, ²ⁿ clustering and compression coding. Compared with the original network model, the compressed LeNet, AlexNet, and VGG reduce the storage consumption by about 222 times, 22 times, and 68 times, respectively. In the accelerator of the present invention, the multipliers in the PE unit are all replaced by shifters, and all basic modules can be configured according to network computing and hardware resources, so it has the advantages of high speed, low power consumption, small resource occupation and data utilization. advantage of high rate.

附图说明Description of drawings

图1为神经网络模型预处理的整体流程图。Figure 1 is the overall flow chart of neural network model preprocessing.

图2为2ⁿ连接权重聚类的流程图。Figure 2 is a flow chart of 2 ⁿ connection weight clustering.

图3为压缩编码存储格式。Figure 3 shows the compressed encoding storage format.

图4为本发明的神经网络硬件加速器的整体架构图。FIG. 4 is an overall architecture diagram of the neural network hardware accelerator of the present invention.

图5为PE阵列配置为卷积计算模式时的示意图。Figure 5 is a schematic diagram when the PE array is configured in the convolution calculation mode.

图6为PE阵列配置为卷积计算模式时数据流分配示意图。FIG. 6 is a schematic diagram of data flow allocation when the PE array is configured in the convolution calculation mode.

图7为PE阵列配置为全连接计算模式时的示意图。FIG. 7 is a schematic diagram when the PE array is configured in the fully connected computing mode.

图8为PE阵列配置为全连接计算模式时的数据流分配示意图。FIG. 8 is a schematic diagram of data flow allocation when the PE array is configured in the fully connected computing mode.

图9为PE单元内部结构及PE阵列的基本列单元的结构示意图。FIG. 9 is a schematic structural diagram of the internal structure of the PE unit and the basic column unit of the PE array.

图10a为卷积计算时，PE中神经元和连接权重的数据流。Figure 10a shows the data flow of neurons and connection weights in PE during convolution calculation.

图10b为全连接计算时，PE中神经元和连接权重的数据流。Figure 10b shows the data flow of neurons and connection weights in PE during full connection calculation.

图11为本发明的可配置设计图。Figure 11 is a configurable design diagram of the present invention.

具体实施方式Detailed ways

下面将结合附图，对本申请的技术方案进行清楚、完整的描述。The technical solutions of the present application will be clearly and completely described below with reference to the accompanying drawings.

本发明中的连接权重，也称为权值。The connection weight in the present invention is also called a weight.

参见图4，本发明的神经网络加速器包括：片外DRAM，神经元输入缓冲区，神经元解码单元，神经元编码单元，神经元输出缓冲区，权值输入缓冲区，权值解码单元，神经元片上全局缓冲区，权值片上全局缓冲区，PE计算单元阵列，激活单元以及池化单元。Referring to FIG. 4, the neural network accelerator of the present invention includes: off-chip DRAM, neuron input buffer, neuron decoding unit, neuron encoding unit, neuron output buffer, weight input buffer, weight decoding unit, neural network On-chip global buffer, weight on-chip global buffer, PE computing unit array, activation unit and pooling unit.

片外DRAM，用于存储压缩好的神经元和连接权重的数据。Off-chip DRAM for storing compressed neuron and connection weight data.

神经元输入缓冲区，用于缓存从片外DRAM读取的压缩神经元数据，传输给神经元解码单元。The neuron input buffer is used to buffer the compressed neuron data read from the off-chip DRAM and transmit it to the neuron decoding unit.

神经元解码单元，用于解码压缩神经元数据，并将解码后的神经元传输给神经元片上全局缓冲区。The neuron decoding unit is used to decode the compressed neuron data and transmit the decoded neuron to the neuron on-chip global buffer.

权值输入缓冲区，用于缓存从片外DRAM读取的压缩连接权重数据，传输给权值解码单元。The weight input buffer is used to buffer the compressed connection weight data read from the off-chip DRAM and transmit it to the weight decoding unit.

权值解码单元，用于解码对压缩连接权重数据，并将解码后的连接权重传输给权值片上全局缓冲区。The weight decoding unit is used to decode the compressed connection weight data, and transmit the decoded connection weight to the weight on-chip global buffer.

神经元编码单元，用于将神经元片上全局缓冲区中计算完成的数据压缩编码后传输给神经元输出缓冲区。The neuron coding unit is used to compress and encode the calculated data in the neuron on-chip global buffer and transmit it to the neuron output buffer.

神经元输出缓冲区，用于将接收到的压缩神经元数据存入到片外DRAM中。The neuron output buffer is used to store the received compressed neuron data into off-chip DRAM.

神经元片上全局缓冲区，用于缓存解码后的神经元和PE阵列计算得到的中间结果，并与PE计算单元阵列进行数据通信。The neuron on-chip global buffer is used to cache the decoded neuron and the intermediate result calculated by the PE array, and communicate data with the PE computing unit array.

权值片上全局缓冲区，用于缓存解码后的连接权重，并与PE计算单元阵列进行数据通信。The weight on-chip global buffer is used to cache the decoded connection weight and communicate data with the PE computing unit array.

激活单元，用于将PE计算单元阵列计算的结果进行激活后传输到池化单元或神经元片上全局缓冲区。常见的激活函数有Sigmoid、Tanh、ReLU等，其中，Sigmoid函数是典型卷积神经网络中最常用的一种，而ReLU函数是目前使用最多的激活函数，因为其能让训练结果快速收敛，并且输出很多的零值，除此之外，ReLu函数不用计算，只需将输入的数据与零做比较，这会比用其他激活函数省去很多的计算资源，相关硬件模块设计也比较简单，只需要比较器就可以实现。对于Sigmoid激活函数，本发明选择分段线性逼近的方式来实现Sigmoid函数的硬件模块设计。分段线性逼近的基本思想是用一系列折线段来逼近激活函数本身。具体实现时，可以将每段折线的参数k和b存入查找表，断点间的值按照公式(1)代入相应参数k和b计算得到。The activation unit is used to activate the result calculated by the PE computing unit array and transmit it to the pooling unit or the global buffer on the neuron slice. Common activation functions include Sigmoid, Tanh, ReLU, etc. Among them, the Sigmoid function is the most commonly used in typical convolutional neural networks, and the ReLU function is currently the most used activation function, because it allows the training results to converge quickly, and A lot of zero values are output. In addition, the ReLu function does not need to be calculated. It only needs to compare the input data with zero, which saves a lot of computing resources than using other activation functions. The design of the relevant hardware modules is also relatively simple. A comparator is needed to achieve this. For the sigmoid activation function, the present invention selects the method of piecewise linear approximation to realize the hardware module design of the sigmoid function. The basic idea of piecewise linear approximation is to approximate the activation function itself with a series of polyline segments. During specific implementation, the parameters k and b of each broken line can be stored in the look-up table, and the value between the breakpoints can be calculated by substituting the corresponding parameters k and b according to formula (1).

y＝kx+b (1)y=kx+b (1)

需要注意的是，Sigmoid函数是关于(0,0.5)点中心对称的，所以只需要存储x轴正半轴的拟合线段参数，即可通过1-f(x)计算得到x轴负半轴的函数值。由于x>8时f(x)的值已经非常接近1，所以本设计发明中只拟合x从0到8的部分，x大于8时激活函数输出全部为1。首先将输入x值与分段折线x区间进行比较，得到相应区间索引，然后根据索引去查找表中获取相应线段的计算参数k和b，最后通过乘法器和加法器计算得到最终的激活输出结果。It should be noted that the sigmoid function is symmetric about the center of the (0,0.5) point, so it is only necessary to store the fitted line segment parameters of the positive semi-axis of the x-axis, and then the negative semi-axis of the x-axis can be calculated by 1-f(x). the function value. Since the value of f(x) is very close to 1 when x>8, only the part of x from 0 to 8 is fitted in this design invention, and the output of the activation function is all 1 when x is greater than 8. First, the input x value is compared with the segmented polyline x interval to obtain the corresponding interval index, then the calculation parameters k and b of the corresponding line segment are obtained from the lookup table according to the index, and finally the final activation output result is calculated by the multiplier and the adder. .

池化单元，用于对激活单元激活后的结果降采样。在卷积神经网络中，通常激活函数对卷积计算结果激活后，都会进入池化层降采样。常见的池化方式有平均池化、最大池化、空间金字塔池化等。本发明针对最常见的最大池化和平均池化设计硬件。最大池化只需将所有需要池化的卷积结果送入比较器进行比较，得到其中的最大值即为最大池化输出结果。平均池化计算时首先将所有需要池化的卷积结果送入加法器进行累加，再将累加结果除以所有输入卷积值的个数即可得到平均池化输出结果。在平均池化中，最常见的池化窗大小一般为2，在设计硬件电路时，平均池化最后一步的除法操作其实是可以用移位操作来代替的，即当池化窗大小为2时，只需要将累加结果右移两位即可。The pooling unit is used to downsample the result after the activation unit is activated. In a convolutional neural network, usually after the activation function activates the convolution calculation result, it will enter the pooling layer for downsampling. Common pooling methods include average pooling, max pooling, and spatial pyramid pooling. The present invention designs hardware for the most common max pooling and average pooling. The maximum pooling only needs to send all the convolution results that need to be pooled into the comparator for comparison, and the maximum value obtained is the maximum pooling output result. In the calculation of average pooling, all convolution results that need to be pooled are first sent to the adder for accumulation, and then the accumulated result is divided by the number of all input convolution values to obtain the average pooling output result. In average pooling, the most common pooling window size is generally 2. When designing hardware circuits, the division operation in the last step of average pooling can actually be replaced by a shift operation, that is, when the pooling window size is 2 , just shift the accumulated result to the right by two digits.

PE计算单元阵列，可以配置为卷积计算模式或全连接计算模式。PE计算单元阵列包括若干个PE单元，每一列PE单元称为一个基本列单元。每一个PE单元的主要输入端口是输入神经元(16bits)和与输入神经元对应的连接权重(5bits)，这两种数据进入PE单元后分别缓存在Neuron_Buffer(PE内神经元缓冲区)和Weight_Buffer(PE内权重缓冲区)中，当Neuron_Buffer和Weight_Buffer第一次填充满之后，开始计算神经元和连接权重的乘积。经过本发明中的数据预处理过程后，神经网络剩余的非零连接权重全部变为2ⁿ或2ⁿ组合，因此，在本发明中PE计算单元中的乘法器被优化为更加节省资源的移位器。The PE computing unit array can be configured in convolution computing mode or fully connected computing mode. The PE computing unit array includes several PE units, and each column of PE units is called a basic column unit. The main input port of each PE unit is the input neuron (16bits) and the connection weight (5bits) corresponding to the input neuron. After these two kinds of data enter the PE unit, they are buffered in Neuron_Buffer (neuron buffer in PE) and Weight_Buffer respectively. In (weight buffer in PE), when Neuron_Buffer and Weight_Buffer are filled for the first time, the product of neuron and connection weight starts to be calculated. After the data preprocessing process in the present invention, the remaining non-zero connection weights of the neural network are all changed to 2 ⁿ or 2 ⁿ combinations. Therefore, in the present invention, the multiplier in the PE computing unit is optimized to be a more resource-saving shifter. positioner.

参见图9，从Neuron_Buffer和Weight_Buffer分别读出神经元和连接权重，首先判断神经元和连接权重是否为0，若两个数据中有一个为0或两个都为0，则将计算使能信号en置为0，否则为1。如果使能信号en为0，则不进行移位操作，直接将移位计算结果赋为0并发送给加法器去累加。Count_out用来计数PE单元计算的移位-累加操作次数，卷积计算时一个PE单元计算每个卷积窗一行的移位-累加，全连接计算时一个PE单元计算一个输出神经元与一组输入神经元相连的移位-累加。PE计算单元阵列的基本列单元计算得到的输出结果存在Regfile中，经过数据选择器(Select)选择加操作数发送给该列PE单元对应的加法器进行累加操作，并加上对应的偏置值bias，得到最终的列输出结果Colunm_out。Referring to Figure 9, the neuron and connection weights are read from Neuron_Buffer and Weight_Buffer respectively. First, determine whether the neuron and connection weights are 0. If one of the two data is 0 or both are 0, the enable signal will be calculated. en is set to 0, otherwise 1. If the enable signal en is 0, no shift operation is performed, and the result of the shift calculation is directly assigned to 0 and sent to the adder for accumulation. Count_out is used to count the number of shift-accumulation operations calculated by the PE unit. During convolution calculation, a PE unit calculates the shift-accumulation of one row of each convolution window. During full connection calculation, a PE unit calculates an output neuron and a set of Shift-accumulate connected input neurons. The output result calculated by the basic column unit of the PE calculation unit array is stored in the Regfile, and the addition operand is selected by the data selector (Select) and sent to the adder corresponding to the PE unit of the column for accumulation operation, and the corresponding offset value is added. bias, get the final column output result Colunm_out.

由于卷积计算模式和全连接计算模式的数据复用与计算并行不一样，因此针对卷积计算过程和全连接计算过程的PE单元内部具体的数据流方式略有差异。Since the data multiplexing and computing parallelism in the convolution calculation mode and the full connection calculation mode are different, the specific data flow methods inside the PE unit for the convolution calculation process and the full connection calculation process are slightly different.

卷积计算过程中，卷积神经网络(CNN)是一个由结构相似的卷积层堆叠而成的前向传播网络结构，层间运算具有独立性且各层运算具有高度相似性，因此可以抽象整合卷积层计算的特点设计PE计算单元阵列的通用卷积计算模式，本发明中选择拆分输出特征图并行并融合输入特征图并行和相同输入特征图卷积窗口间并行，使得片上数据复用最大化的同时不给片上缓存带来更多压力。In the process of convolution calculation, the convolutional neural network (CNN) is a forward propagation network structure composed of stacking convolution layers with similar structures. The operations between layers are independent and the operations of each layer are highly similar, so it can be abstracted. Integrate the characteristics of convolution layer calculation to design a general convolution calculation mode of PE calculation unit array. In the present invention, it is selected to split the output feature map in parallel and fuse the input feature map parallel and the same input feature map convolution window in parallel, so that the data on the chip is complex. Maximize it without putting more pressure on the on-chip cache.

全连接计算过程中，本发明在类比了神经网络中全连接层与卷积层计算特点的同时，考虑了各种不同结构的卷积神经网络中的全连接层结构、数据计算流水线、系统时钟周期和硬件资源等一系列实际情况后，设计了一种PE计算单元阵列的参数化、可配置的全连接计算模式。In the process of full connection calculation, the present invention compares the calculation characteristics of the full connection layer and the convolution layer in the neural network, and also considers the full connection layer structure, data calculation pipeline and system clock in the convolutional neural network with different structures. After a series of practical situations such as cycle and hardware resources, a parametric and configurable fully connected computing mode of PE computing unit array is designed.

参见图11，各种各样的卷积神经网络模型不同卷积层的卷积核大小、输入特征图数量、输出特征图数量以及不同全连接层的输入神经元数量、输出神经元数量都不相同，为了让运算单元有较强的通用性，本发明分别从卷积计算模式、全连接计算模式、激活函数选择和池化方式选择等优化了神经网络硬件加速器的可配置设计。PE计算单元阵列、激活单元和池化单元等关键模块的可配置设计是通过参数化卷积神经网络的一些关键特征尺寸来实现的。Referring to Figure 11, the size of the convolution kernel, the number of input feature maps, the number of output feature maps, and the number of input neurons and output neurons of different fully connected layers of various convolutional neural network models are different. Similarly, in order to make the operation unit more versatile, the present invention optimizes the configurable design of the neural network hardware accelerator from the convolution calculation mode, the full connection calculation mode, the selection of activation function and the selection of pooling mode. The configurable design of key modules such as PE computational unit arrays, activation units, and pooling units is achieved by parameterizing some key feature sizes of convolutional neural networks.

表1列举了代码实现时一些关键模块中基本单元的配置参数及参数意义。当alculate_layer为0时代表当前PE阵列需要计算卷积层第一层，为1代表当前PE阵列需要计算卷积层中间层，为2代表当前计算池化层，为3代表当前PE阵列需要计算全连接层；当connet_mu_or_ad_pe为0时代表PE单元执行乘法操作，为1代表PE单元执行加法操作，为2代表PE单元执行乘累加操作；当active_type为0时代表选择Sigmoid激活函数，为1代表选择ReLU激活函数；当pooling_type为0时代表选择最大池化方式，为1代表选择平均池化方式。这些参数都可以在程序的顶层文件配置，用户根据卷积神经网络中不同计算层的输入输出大小、激活函数选择、池化方式选择来修改配置文件，就可以搭建自己想要的网络模型，并快速在硬件上生成相应规模和需求的神经网络映射。Table 1 lists the configuration parameters and parameter meanings of the basic units in some key modules when the code is implemented. When calculate_layer is 0, it means that the current PE array needs to calculate the first layer of the convolutional layer, 1 means that the current PE array needs to calculate the middle layer of the convolutional layer, 2 means the current calculation pooling layer, and 3 means that the current PE array needs to calculate the full Connection layer; when connet_mu_or_ad_pe is 0, it means the PE unit performs the multiplication operation, 1 means the PE unit performs the addition operation, and 2 means the PE unit performs the multiply-accumulate operation; when the active_type is 0, it means the Sigmoid activation function is selected, and 1 means the ReLU is selected Activation function; when pooling_type is 0, it means selecting the maximum pooling method, and 1 means selecting the average pooling method. These parameters can be configured in the top-level file of the program. Users can modify the configuration file according to the input and output size, activation function selection, and pooling method selection of different computing layers in the convolutional neural network, and then they can build the network model they want. Quickly generate neural network maps of the appropriate scale and needs on hardware.

表1代码实现时一些关键模块中基本单元的配置参数及参数意义Table 1 Configuration parameters and parameter meanings of basic units in some key modules during code implementation

一种基于上述稀疏卷积神经网络加速器的实现方法，包括以下步骤：An implementation method based on the above-mentioned sparse convolutional neural network accelerator, comprising the following steps:

其中，数据预处理包括以下步骤：Among them, data preprocessing includes the following steps:

步骤1)，对连接权重剪枝，使神经网络变得稀疏；Step 1), pruning the connection weights to make the neural network sparse;

步骤2)，2ⁿ连接权重聚类：首先根据剪枝后网络参数的分布情况，在参数集中的范围内根据量化位宽选取2ⁿ形式的聚类中心，计算剩余非零参数和聚类中心的欧式距离，将非零参数聚类到与其欧氏距离最短的聚类中心上，其中n可以是负整数、正整数或零；Step 2), 2 ⁿ connection weight clustering: First, according to the distribution of network parameters after pruning, select the 2 ⁿ form of cluster centers according to the quantized bit width within the range of the parameter set, and calculate the remaining non-zero parameters and cluster centers. The Euclidean distance of , clusters non-zero parameters to the cluster center with the shortest Euclidean distance from it, where n can be a negative integer, a positive integer or zero;

步骤3)，针对步骤2)聚类后的稀疏网络，将稀疏网络的连接权重以及特征图神经元采用稀疏矩阵压缩编码存储格式压缩后存储在片外DRAM中。具体的存储格式为：卷积层特征图和全连接层神经元的编码分为三部分，其中，每个5-bit表示当前非零神经元与上一个非零神经元之间0的个数，每个16-bit表示非零元素本身数值，最后的1-bit用来标记卷积层特征图中非零元素是否换行或全连接层非零元素是否换层，整个存储单元可以编码三个非零元素以及它的相对位置信息。如果一个存储单元中间包含卷积层特征图某一行最后一个非零元素或全连接层某一层输入神经元最后一个非零元素，则后面的5-bit和16-bit全部补0，并将最后的1-bit标志位置1。连接权重的编码也分为三部分，其中，每个4-bit表示当前非零连接权重与上一个非零连接权重之间0的个数，每个5-bit表示非零连接权重索引，最后的1-bit用来标记卷积层中非零连接权重是否对应同一张输入特征图或全连接层中非零连接权重是否对应同一个输出神经元，整个存储单元可以编码七个非零连接权重以及它的相对位置信息。同样，如果一个存储单元中间包含卷积层中对应同一张输入特征图的最后一个非零连接权重或全连接层某一层中对应同一个输出神经元的最后一个非零连接权重，则后面的4-bit和5-bit全部补0，并将最后的1-bit标志位置1。Step 3), for the clustered sparse network in step 2), the connection weights of the sparse network and the feature map neurons are compressed and stored in the off-chip DRAM using a sparse matrix compression coding storage format. The specific storage format is: the encoding of the feature map of the convolutional layer and the neurons of the fully connected layer is divided into three parts, where each 5-bit represents the number of 0s between the current non-zero neuron and the previous non-zero neuron , each 16-bit represents the value of the non-zero element itself, and the last 1-bit is used to mark whether the non-zero element in the feature map of the convolutional layer wraps the line or whether the non-zero element in the fully connected layer changes the layer. The entire storage unit can encode three Non-zero element and its relative position information. If a storage unit contains the last non-zero element of a certain row of the feature map of the convolutional layer or the last non-zero element of the input neuron of a certain layer of the fully connected layer, the following 5-bit and 16-bit are all filled with 0, and the The last 1-bit flag bit is set to 1. The encoding of the connection weight is also divided into three parts, where each 4-bit represents the number of 0s between the current non-zero connection weight and the previous non-zero connection weight, each 5-bit represents the non-zero connection weight index, and finally The 1-bit is used to mark whether the non-zero connection weight in the convolutional layer corresponds to the same input feature map or whether the non-zero connection weight in the fully connected layer corresponds to the same output neuron. The entire storage unit can encode seven non-zero connection weights and its relative position information. Similarly, if a storage unit contains the last non-zero connection weight corresponding to the same input feature map in the convolutional layer or the last non-zero connection weight corresponding to the same output neuron in a layer of the fully connected layer, then the following 4-bit and 5-bit are all filled with 0, and the last 1-bit flag is set to 1.

步骤4)，将存储在片外DRAM中的稀疏网络的连接权重读入权值输入缓冲区，通过权值解码单元进行解码后存储在权值片上全局缓冲区；将存储在片外DRAM中的神经元读入神经元输入缓冲区，然后将读入的神经元通过神经元解码单元进行解码后存储在神经元片上全局缓冲区；按照神经网络当前层的配置参数(见表1)来确定PE计算单元阵列的计算模式，将解码后排列好的神经元和连接权重发送给对应的PE计算单元；这两种数据进入PE后分别缓存在Neuron_Buffer和Weight_Buffer中，当Neuron_Buffer和Weight_Buffer第一次填充满之后，开始计算神经元和连接权重的乘积。Step 4), the connection weight of the sparse network stored in the off-chip DRAM is read into the weight input buffer, decoded by the weight decoding unit, and stored in the weight on-chip global buffer; The neuron reads the neuron input buffer, and then decodes the read neuron through the neuron decoding unit and stores it in the neuron on-chip global buffer; the PE is determined according to the configuration parameters of the current layer of the neural network (see Table 1). The calculation mode of the calculation unit array sends the decoded neurons and connection weights to the corresponding PE calculation unit; these two kinds of data are buffered in Neuron_Buffer and Weight_Buffer respectively after entering PE. When Neuron_Buffer and Weight_Buffer are filled for the first time After that, start calculating the product of neurons and connection weights.

当PE计算单元阵列的计算模式配置为卷积计算时，若卷积窗大小为k×k，卷积窗滑动步长为s，则填满后的Neuron_Buffer和Weight_Buffer按照先入先出的原则分别从先到后存储了神经元1～k以及连接权重1～k。When the calculation mode of the PE computing unit array is configured as convolution calculation, if the size of the convolution window is k×k, and the sliding step size of the convolution window is s, the filled Neuron_Buffer and Weight_Buffer are respectively changed from The neurons 1~k and the connection weights 1~k are stored on a first-come-first-served basis.

若卷积窗滑动步长s等于1，则在第一个时钟周期计算神经元1与连接权重1的乘积后直接丢掉神经元1，并将连接权重1转存到Weight_Buffer最后面；第二个时钟周期读入新的神经元k+1同时计算神经元2与连接权重2的乘积后，将神经元2与连接权重2都转存到各自Buffer最后面；第三个时钟周期计算神经元3与连接权重3的乘积后，将神经元3与连接权重3都转存到各自Buffer最后面；以此类推，直到第k个时钟周期PE单元计算完成一个卷积窗一行的乘累加，输出结果的同时按照步长大小s调整Buffer中数值的顺序，等待计算下一个卷积窗。If the sliding step s of the convolution window is equal to 1, the neuron 1 is directly discarded after the product of the neuron 1 and the connection weight 1 is calculated in the first clock cycle, and the connection weight 1 is transferred to the end of the Weight_Buffer; the second After the clock cycle reads in the new neuron k+1 and calculates the product of neuron 2 and connection weight 2, dumps both neuron 2 and connection weight 2 to the end of their respective Buffers; the third clock cycle calculates neuron 3 After multiplying with connection weight 3, transfer neuron 3 and connection weight 3 to the end of their respective Buffers; and so on, until the kth clock cycle PE unit calculates and completes the multiplication and accumulation of one convolution window and one row, and outputs the result At the same time, adjust the order of the values in the Buffer according to the step size s, waiting to calculate the next convolution window.

当PE计算单元阵列的计算模式配置为全连接计算时，若每个PE单元计算的输入神经元为m个，则Neuron_Buffer和Weight_Buffer填满后按照先入先出的原则分别从先到后存储了神经元1～m以及连接权重1～m。第一个时钟周期计算神经元1与连接权重1的乘积后直接丢掉连接权重1，并将神经元1转存到Neuron_Buffer最后面；第二个时钟周期读入新的连接权重m+1同时计算神经元2与连接权重2的乘积后，直接丢掉连接权重2，并将神经元2转存到Neuron_Buffer最后面；以此类推，直到第m个时钟周期PE单元计算完成一个输出神经元的中间结果，输出结果的同时调整Buffer中数值的顺序，等待计算下一组输出神经元。When the computing mode of the PE computing unit array is configured to be fully connected computing, if the number of input neurons calculated by each PE unit is m, the Neuron_Buffer and Weight_Buffer are filled and the neurons are stored from first to last according to the principle of first-in, first-out. Elements 1 to m and connection weights 1 to m. The first clock cycle calculates the product of neuron 1 and connection weight 1, then directly discards connection weight 1, and dumps neuron 1 to the end of Neuron_Buffer; the second clock cycle reads in the new connection weight m+1 and calculates at the same time After the product of neuron 2 and connection weight 2, the connection weight 2 is directly discarded, and neuron 2 is transferred to the end of Neuron_Buffer; and so on, until the mth clock cycle PE unit calculates the intermediate result of an output neuron , adjust the order of the values in the Buffer while outputting the results, and wait for the next set of output neurons to be calculated.

PE计算单元阵列计算得到的结果如果是中间结果，则暂时存入神经元片上全局缓冲区，等待下次与PE计算单元阵列计算出的其他中间结果进行累加；如果是卷积层最终结果，则送入激活单元，然后将激活结果送入池化单元，降采样后再存回神经元片上全局缓冲区；如果是全连接层最终结果，则送入激活单元，然后直接存回神经元片上全局缓冲区。存储于神经元片上全局缓冲区的数据经过神经元编码单元按照稀疏矩阵压缩编码存储格式压缩编码后送入神经元输出缓冲区，再存回片外DRAM，等待下一层计算时再读入。如此循环直到最后一层的全连接层输出一个一维向量，向量长度为分类类别数。If the result calculated by the PE computing unit array is an intermediate result, it will be temporarily stored in the global buffer on the neuron chip, and wait for the next accumulation with other intermediate results calculated by the PE computing unit array; if it is the final result of the convolution layer, then Send it to the activation unit, then send the activation result to the pooling unit, downsample and then save it back to the neuron on-chip global buffer; if it is the final result of the fully connected layer, it is sent to the activation unit, and then directly stored back to the neuron on-chip global buffer buffer. The data stored in the global buffer on the neuron chip is compressed and encoded by the neuron coding unit according to the sparse matrix compression coding storage format, and then sent to the neuron output buffer, and then stored back to the off-chip DRAM, and read in after the next layer of calculation. This cycle is repeated until the fully connected layer of the last layer outputs a one-dimensional vector whose length is the number of classification categories.

本发明的操作灵活，编解码简单，且压缩效果较好，每个输入神经元采用16-bit定点表示，连接权重采用5-bit表示。The invention has flexible operation, simple encoding and decoding, and good compression effect. Each input neuron is represented by 16-bit fixed point, and the connection weight is represented by 5-bit.

图1为神经网络模型预处理的整体流程图，第一步称为连接权重的剪枝，在训练好的模型上移除连接权重较小的连接，再训练使网络识别精度恢复初始状态；第二步称为2ⁿ连接权重聚类，在第一步训练好的模型上进行固定聚类中心的聚类操作，并用这些聚类中心代替所有的非零连接权重；第三步称为连接权重压缩编码。Figure 1 is the overall flow chart of the preprocessing of the neural network model. The first step is called pruning of connection weights. The connections with smaller connection weights are removed from the trained model, and then the network recognition accuracy is restored to the initial state by retraining. The second step is called 2 ⁿ connection weight clustering, and the clustering operation of fixed cluster centers is performed on the model trained in the first step, and these cluster centers are used to replace all non-zero connection weights; the third step is called connection weights Compression encoding.

图2是本发明实现的2ⁿ连接权重聚类的具体流程图。得到剪枝后的稀疏网络后，进行训练，直到识别准确率不再上升，然后根据此时的参数分布选取N个2ⁿ形式的聚类中心，将所有权重聚类到与其欧氏距离最小的聚类中心上，判断此时网络正确率是否损失，若有损失则继续进行训练，直到准确率不再上升，以此类推直到聚类后网络准确率没有损失。FIG. 2 is a specific flow chart of the 2 ⁿ connection weight clustering implemented by the present invention. After obtaining the pruned sparse network, perform training until the recognition accuracy no longer increases. Then, according to the parameter distribution at this time, select N cluster centers in the form of 2 ⁿ , and cluster all the weights to the one with the smallest Euclidean distance. On the clustering center, it is judged whether the network accuracy rate is lost at this time. If there is a loss, continue training until the accuracy rate no longer increases, and so on until the network accuracy rate is not lost after clustering.

图3给出了本发明的稀疏矩阵压缩编码存储格式。Figure 3 shows the sparse matrix compression coding storage format of the present invention.

图4给出了神经网络硬件加速器的整体架构，包括片外DRAM，神经元输入缓冲区，神经元解码单元，神经元编码单元，神经元输出缓冲区，权值输入缓冲区，权值解码单元，神经元片上全局缓冲区，权值片上全局缓冲区，PE计算单元阵列，激活单元以及池化单元。Figure 4 shows the overall architecture of the neural network hardware accelerator, including off-chip DRAM, neuron input buffer, neuron decoding unit, neuron encoding unit, neuron output buffer, weight input buffer, weight decoding unit , neuron on-chip global buffer, weight on-chip global buffer, PE computing unit array, activation unit and pooling unit.

图5给出了PE阵列配置为卷积计算模式时的示意图，虚线框内是大小为m行n列的PE单元阵列，每一列PE单元称为一个基本列单元，对应一个加操作数选择器(SEL)，用来选择对应一列PE单元的计算结果是否送给累加器进行累加。图5中所有的PE单元都对应同一张输入特征图，其中同一行PE单元接收输入特征图同一行的神经元以及对应不同输出特征图的相同行卷积核连接权重，同一列PE单元分别接收需要卷积计算的输入特征图不同行的神经元以及对应输出特征图的同一个卷积核不同行的连接权重，最终同时计算得到n张输出特征图相同位置的连续t个神经元，t＝m/k，其中k为卷积核尺寸大小。Figure 5 shows a schematic diagram when the PE array is configured in the convolution calculation mode. The dotted box is a PE unit array with m rows and n columns. Each column of PE units is called a basic column unit and corresponds to an add operand selector. (SEL), used to select whether the calculation result corresponding to a column of PE units is sent to the accumulator for accumulation. All PE units in Figure 5 correspond to the same input feature map, in which the same row of PE units receives neurons in the same row of the input feature map and the same row of convolution kernel connection weights corresponding to different output feature maps, and the same column of PE units respectively receives The neurons in different lines of the input feature map that need convolution calculation and the connection weights of the same convolution kernel in different lines of the corresponding output feature map are finally calculated at the same time to obtain n continuous t neurons in the same position of the output feature map, t= m/k, where k is the size of the convolution kernel.

图6举例说明了本发明中卷积计算模式下PE阵列具体的数据流分配方式，左边是对应第一张输出特征图的卷积核以及需要进行卷积计算的输入特征图，假设卷积核大小k＝5，PE阵列大小m＝n＝5。对应第一张输出特征图的卷积核的每行5个连接权重作为一组，5组连接权重分别记作W_r1～W_r5，将这5组连接权重分别发送给PE阵列第一列。也就是说，实例中5列PE单元需要分别接收对应5张输出特征图的不同卷积核。输入特征图大小为N×N，N行神经元数据记作Row₁～Row_N，每次卷积计算需要输入PE单元的神经元数据记作N_k1～N_k5，将这5组输入神经元数据分别发送给PE阵列同一行的PE单元。也就是说，第一行PE单元共享N_k1神经元数据，第二行PE单元共享N_k2神经元数据，以此类推，最终，实例中5×5的PE阵列同时计算得到5张输出特征图相同位置的1个输出神经元值。Figure 6 illustrates the specific data flow distribution method of the PE array in the convolution calculation mode of the present invention. The left side is the convolution kernel corresponding to the first output feature map and the input feature map that needs to be convolution calculated. It is assumed that the convolution kernel Size k=5, PE array size m=n=5. Each row of 5 connection weights corresponding to the convolution kernel of the first output feature map is used as a group, and the 5 groups of connection weights are respectively recorded as W _r1 ~ W _r5 , and these 5 groups of connection weights are sent to the first column of the PE array respectively. That is to say, the five columns of PE units in the example need to receive different convolution kernels corresponding to the five output feature maps respectively. The size of the input feature map is N×N, the neuron data of N lines are denoted as Row ₁ ～ Row _N , the neuron data that needs to be input to the PE unit for each convolution calculation are denoted as N _k1 ～N _k5 , and these 5 groups of input neurons are denoted as N k1 ～ N k5 . Data is sent to PE units in the same row of the PE array, respectively. That is to say, the PE units in the first row share N _k1 neuron data, the PE units in the second row share N _k2 neuron data, and so on. Finally, the 5×5 PE array in the example is calculated to obtain 5 output feature maps at the same time. 1 output neuron value at the same location.

图7给出了通用PE阵列配置为全连接计算模式时的示意图，虚线框内是大小为m行n列的PE单元阵列，每一列PE单元称为一个基本列单元，对应一个加操作数选择器(SEL)，用来选择对应一列PE单元的计算结果是否送给累加器进行累加。本发明最终选择将所有输入神经元和输出神经元都进行分组再计算的并行方式来实现最大化的数据复用。因此，图7中所有的PE单元对应所有输入神经元的不同部分，其中同一行PE单元接收同一组输入神经元以及与之连接的不同输出神经元的连接权重，同一列PE单元分别接收不同组输入神经元以及与之连接的不同输出神经元的连接权重，最终同时计算得到连续n个输出神经元。Figure 7 shows a schematic diagram of a general PE array configured in the fully connected computing mode. Inside the dashed box is a PE unit array with m rows and n columns. Each column of PE units is called a basic column unit and corresponds to an add operand selection. The device (SEL) is used to select whether the calculation result corresponding to a column of PE units is sent to the accumulator for accumulation. The present invention finally selects a parallel manner in which all input neurons and output neurons are grouped and recalculated to achieve maximum data reuse. Therefore, all PE units in Figure 7 correspond to different parts of all input neurons, where the same row of PE units receive the same group of input neurons and the connection weights of different output neurons connected to it, and the same column of PE units receive different groups of The input neuron and the connection weights of different output neurons connected to it are finally calculated at the same time to obtain n consecutive output neurons.

图8举例说明了本发明中全连接计算模式下PE阵列具体的数据流分配方式，左边是一层全连接层，假设输入神经元为20个，输出神经元为10个，则该层全连接层连接权重总共为200个，PE阵列大小m＝4，n＝5。将所有输入神经元分为4组，每组5个输入神经元，分别记作N₁～N₄，PE阵列的每一行共享同一组输入神经元。也就是说，实例中4行PE单元分别需要接收对应4组输入神经元，第一行PE单元共享N₁神经元数据，第二行PE单元共享N₂神经元数据，以此类推，第m行PE单元共享N_m神经元数据。由于PE阵列的列数为5，故将所有输出神经元分为2组，分别记作F₁、F₂，每组计算得到5个输出神经元。N₁～N₄组输入神经元与F₁组第一个输出神经元的连接连接权重分别记作W1₁₁～W1₁₄，与F₁组第二个输出神经元的连接连接权重分别记作W1₂₁～W1₂₄，以此类推，与F₂组第五个输出神经元的连接连接权重分别记作W2₅₁～W2₅₄，共有40组连接连接权重，每组5个。将这40组连接连接权重分别发送给PE阵列中的计算单元，第一列PE单元分别接收W1₁₁～W1₁₄的连接连接权重，第二列PE单元分别接收W1₂₁～W1₂₄的连接连接权重，以此类推，第五列PE单元分别接收W1₅₁～W1₅₄的连接连接权重。最终，实例中4×5的PE阵列同时计算得到F₁组的5个输出神经元值。Figure 8 illustrates the specific data flow distribution method of the PE array in the fully connected computing mode of the present invention. The left side is a fully connected layer. Assuming that there are 20 input neurons and 10 output neurons, this layer is fully connected The layer connection weight is 200 in total, and the PE array size is m=4 and n=5. All input neurons are divided into 4 groups, each group has 5 input neurons, which are respectively denoted as N ₁ -N ₄ , and each row of the PE array shares the same group of input neurons. That is to say, in the example, 4 rows of PE units need to receive corresponding 4 groups of input neurons, the first row of PE units share N ₁ neuron data, the second row of PE units share N ₂ neuron data, and so on, the mth Row PE units share N _m neuron data. Since the number of columns in the PE array is 5, all the output neurons are divided into two groups, which are respectively denoted as F ₁ and F ₂ , and 5 output neurons are calculated in each group. The connection weights of the input neurons of the N ₁ to N ₄ groups and the first output neuron of the F ₁ group are respectively recorded as W1 ₁₁ to W1 ₁₄ , and the connection weights of the connections to the second output neuron of the F ₁ group are respectively recorded as W1 ₂₁ to W1 ₂₄ , and so on, the connection weights with the fifth output neuron of the F2 group are respectively recorded as W2 ₅₁ to W2 ₅₄ , and there are 40 groups of connection weights, ₅ in each group. The 40 groups of connection weights are respectively sent to the computing units in the PE array. The PE units in the first column receive the connection weights from W1 ₁₁ to W1 ₁₄ respectively, and the PE units in the second column receive the connection weights from W1 ₂₁ to W1 ₂₄ respectively. , and so on, the fifth column of PE units respectively receive the connection weights of W1 ₅₁ to W1 ₅₄ . Finally, the 4×5 PE array in the example is calculated simultaneously to obtain the ₅ output neuron values of the F1 group.

图9展示了单个PE计算单元的内部结构以及PE阵列中由m个PE计算单元组成的基本列单元之间的数据流。整个PE阵列最终由n个这样的基本列单元组成。从图中可以看出，每一个PE单元的主要输入端口是输入神经元(16bits)和与之对应的连接权重(5bits)，这两种数据进入PE单元后分别缓存在Neuron_Buffer和Weight_Buffer中。除此之外，还有数据输入有效信号以及PE单元计算使能信号等，图中未标注。Figure 9 shows the internal structure of a single PE computing unit and the data flow between basic column units consisting of m PE computing units in the PE array. The entire PE array is ultimately composed of n such basic column units. As can be seen from the figure, the main input port of each PE unit is the input neuron (16bits) and the corresponding connection weight (5bits). These two kinds of data are buffered in Neuron_Buffer and Weight_Buffer respectively after entering the PE unit. In addition, there are data input valid signals and PE unit calculation enable signals, etc., which are not marked in the figure.

图10a和图10b分别给出了卷积计算模式和全连接计算模式时PE单元的数据流示例。假设卷积计算模式时配置的卷积窗大小为5×5，滑动步长为1，全连接计算模式时每个PE单元计算的输入神经元为5个，Neuron_Buffer和Weight_Buffer填满后开始计算。Figures 10a and 10b show data flow examples of PE units in the convolution calculation mode and the fully connected calculation mode, respectively. Assume that the convolution window size configured in the convolution calculation mode is 5 × 5, the sliding step size is 1, and the input neurons calculated by each PE unit in the full connection calculation mode are 5, and the Neuron_Buffer and Weight_Buffer are filled and start the calculation.

卷积计算模式的数据流示例如图10a所示，第一个时钟周期计算神经元1与连接权重1的乘积后直接丢掉神经元1，并将连接权重1转存到Weight_Buffer最后面；第二个时钟周期读入新的神经元6同时计算神经元2与连接权重2的乘积后，将神经元2与连接权重2都转存到各自Buffer最后面；第三个时钟周期计算神经元3与连接权重3的乘积后，将神经元3与连接权重3都转存到各自Buffer最后面；以此类推，直到第五个时钟周期PE单元计算完成一个卷积窗一行的乘累加，输出结果的同时调整Buffer中数值的顺序，等待计算下一个卷积窗。An example of the data flow of the convolution calculation mode is shown in Figure 10a. In the first clock cycle, the product of neuron 1 and connection weight 1 is calculated, and neuron 1 is directly discarded, and the connection weight 1 is transferred to the end of Weight_Buffer; the second After the new neuron 6 is read in every clock cycle and the product of neuron 2 and connection weight 2 is calculated, both neuron 2 and connection weight 2 are dumped to the end of their respective Buffers; the third clock cycle calculates neuron 3 and After connecting the product of weight 3, transfer neuron 3 and connection weight 3 to the end of their respective Buffers; and so on, until the fifth clock cycle PE unit calculates and completes the multiplication and accumulation of one convolution window and one row, and outputs the result. At the same time, adjust the order of the values in the Buffer and wait for the calculation of the next convolution window.

全连接计算模式的数据流示例如图10b所示，第一个时钟周期计算神经元1与连接权重1的乘积后直接丢掉连接权重1，并将神经元1转存到Neuron_Buffer最后面；第二个时钟周期读入新的连接权重6同时计算神经元2与连接权重2的乘积后，直接丢掉连接权重2，并将神经元2转存到Neuron_Buffer最后面；以此类推，直到第五个时钟周期PE单元计算完成一个输出神经元的中间结果，输出结果的同时调整Buffer中数值的顺序，等待计算下一组输出神经元。An example of the data flow in the fully connected computing mode is shown in Figure 10b. The first clock cycle calculates the product of neuron 1 and connection weight 1, and then directly discards connection weight 1, and transfers neuron 1 to the end of Neuron_Buffer; the second After reading the new connection weight 6 and calculating the product of the neuron 2 and the connection weight 2, the connection weight 2 is directly discarded, and the neuron 2 is transferred to the end of Neuron_Buffer; and so on, until the fifth clock The periodic PE unit calculates the intermediate result of an output neuron, adjusts the order of the values in the Buffer while outputting the result, and waits for the calculation of the next group of output neurons.

图11为本发明的可配置设计。卷积计算模式时不仅可以根据卷积核尺寸和输出特征图数量配置PE阵列大小，还可以根据可提供的硬件资源和输入特征图数量配置可并行计算的PE阵列数量。其中，每一个PE阵列对应一张输入特征图的卷积计算，与之对应的加法器阵列用来累加PE单元的输出值以及相对应的偏置值bias。当输入通道数不为1时，每个PE阵列计算得到的输出神经元都是相应输出特征图的中间结果，因此在多个PE阵列外又需要一个加法器阵列用来累加所有PE阵列计算得到的中间结果，加法器阵列的输出才是相应输出特征图神经元的最终输出结果。Figure 11 is a configurable design of the present invention. In the convolution calculation mode, not only the size of the PE array can be configured according to the size of the convolution kernel and the number of output feature maps, but also the number of PE arrays that can be computed in parallel according to the available hardware resources and the number of input feature maps. Among them, each PE array corresponds to the convolution calculation of an input feature map, and the corresponding adder array is used to accumulate the output value of the PE unit and the corresponding bias value bias. When the number of input channels is not 1, the output neuron calculated by each PE array is the intermediate result of the corresponding output feature map, so an adder array is required outside the multiple PE arrays to accumulate all PE arrays to calculate the result. The output of the adder array is the final output of the corresponding output feature map neuron.

本发明根据一种有效的神经网络模型压缩算法并且针对压缩后的稀疏卷积神经网络数据特点，设计了一种高能效、参数化、可配置的硬件加速器。本发明提出的硬件友好的应用于模型离线训练时的神经网络模型压缩算法，包括剪枝、2ⁿ聚类和压缩编码三个主要步骤。压缩后的LeNet、AlexNet、VGG相较原始网络模型分别减少了大约222倍、22倍、68倍的存储消耗。然后，在本发明设计的硬件加速器中，PE单元中的乘法器全部被移位器代替，所有的基本模块都可以根据网络计算和硬件资源进行配置。因此具有速度快、功耗低、资源占用小以及数据利用率高的优点。According to an effective neural network model compression algorithm and the characteristics of compressed sparse convolutional neural network data, the present invention designs a high-energy-efficiency, parameterized and configurable hardware accelerator. The hardware-friendly neural network model compression algorithm proposed by the invention and applied to the offline training of the model includes three main steps: pruning, ²ⁿ clustering and compression coding. Compared with the original network model, the compressed LeNet, AlexNet, and VGG reduce the storage consumption by about 222 times, 22 times, and 68 times, respectively. Then, in the hardware accelerator designed by the present invention, all multipliers in the PE unit are replaced by shifters, and all basic modules can be configured according to network computing and hardware resources. Therefore, it has the advantages of high speed, low power consumption, small resource occupation and high data utilization rate.

Claims

1. a sparse convolutional neural network accelerator, is characterized in that, comprises: off-chip DRAM, neuron input buffer, neuron decoding unit, neuron coding unit, neuron output buffer, neuron on-chip global buffer, Weight input buffer, weight decoding unit, weight on-chip global buffer, PE computing unit array, activation unit and pooling unit; among them,

Off-chip DRAM for storing compressed neuron and connection weight data;

The neuron input buffer is used to buffer the compressed neuron data read from the off-chip DRAM and transmit it to the neuron decoding unit;

The neuron decoding unit is used to decode the compressed neuron data and transmit the decoded neuron to the neuron on-chip global buffer;

The neuron coding unit is used to compress and encode the data in the global buffer on the neuron chip and transmit it to the neuron output buffer;

Neuron output buffer, used to store the received compressed neuron data in off-chip DRAM;

The neuron on-chip global buffer is used to cache the decoded neurons and the intermediate results calculated by the PE array, and communicate data with the PE computing unit array;

The activation unit is used to activate the result calculated by the PE computing unit array and transmit it to the pooling unit or the global buffer on the neuron chip; the pooling unit is used to downsample the result after the activation of the activation unit;

The weight input buffer is used to cache the compressed connection weight data read from the off-chip DRAM and transmit it to the weight decoding unit;

The weight decoding unit is used to decode the compressed connection weight data, and transmit the decoded connection weight to the weight on-chip global buffer;

On-chip global buffer for weights, used to cache the decoded connection weights and communicate data with the PE computing unit array;

The PE computing unit array is used for computing in the convolution computing mode or the fully connected computing mode.

2. an implementation method based on the sparse convolutional neural network accelerator described in claim 1, is characterized in that, after data is preprocessed, stored in DRAM, will be stored in the connection of the sparse network in off-chip DRAM The weight is read into the weight input buffer, decoded by the weight decoding unit, and stored in the weight on-chip global buffer; the neuron stored in the off-chip DRAM is read into the neuron input buffer, and is processed by the neuron decoding unit. After decoding, it is stored in the global buffer on the neuron chip; the calculation mode of the PE computing unit array is determined according to the configuration parameters of the current layer of the neural network, and the neurons and connection weights arranged after decoding are sent to the corresponding PE computing unit; neuron and After the connection weight data enters the PE, it is cached in Neuron_Buffer and Weight_Buffer respectively. When Neuron_Buffer and Weight_Buffer are filled for the first time, the product of neuron and connection weight starts to be calculated; among them, Neuron_Buffer is the neuron buffer in PE, and Weight_Buffer is in PE. weight buffer;

If the result calculated by the PE computing unit array is an intermediate result, it will be temporarily stored in the global buffer on the neuron chip, and wait for the next accumulation with other intermediate results calculated by the PE computing unit array; if it is the final result of the convolution layer, then Send it to the activation unit, then send the activation result to the pooling unit, downsample and then save it back to the neuron on-chip global buffer; if it is the final result of the fully connected layer, it is sent to the activation unit, and then directly stored back to the neuron on-chip global buffer Buffer; the data stored in the global buffer on the neuron chip is compressed and encoded by the neuron coding unit according to the sparse matrix compression coding storage format, and then sent to the neuron output buffer, and then stored back to the off-chip DRAM, waiting for the next layer of calculation. Read in; loop in this way until the fully connected layer of the last layer outputs a one-dimensional vector whose length is the number of classification categories.

3. according to the realization method of the sparse convolutional neural network accelerator described in claim 2, it is characterized in that, the concrete process that data is preprocessed is:

1), prune the connection weights to make the neural network sparse;

2), 2 ⁿ connection weight clustering: First, according to the distribution of network parameters after pruning, select 2 ⁿ cluster centers in the form of quantized bit width within the range of the parameter set, and calculate the remaining non-zero parameters and cluster centers. Euclidean distance, which clusters non-zero parameters to the cluster center with the shortest Euclidean distance, where n is a negative integer, positive integer or zero;

3), for the sparse network after the clustering in step 2), the connection weight of the sparse network and the feature map neurons are compressed and stored in the off-chip DRAM using a sparse matrix compression coding storage format.

4. according to the realization method of the sparse convolutional neural network accelerator described in claim 2, it is characterised in that the sparse matrix compression coding storage format is specifically: the coding of the convolutional layer feature map and the fully connected layer neuron is divided into three parts , where each 5-bit represents the number of 0s between the current non-zero neuron and the previous non-zero neuron, each 16-bit represents the value of the non-zero element itself, and the last 1-bit is used to mark the convolution Whether the non-zero elements in the layer feature map wrap or whether the non-zero elements in the fully connected layer change layers, the entire storage unit encodes three non-zero elements and their relative position information; if a storage unit contains the convolutional layer feature map in the middle of a line at the end A non-zero element or the last non-zero element of the input neuron of a certain layer of the fully connected layer, then all the following 5-bit and 16-bit are filled with 0, and the last 1-bit flag is set to 1; the encoding of the connection weight is also Divided into three parts, each 4-bit represents the number of 0s between the current non-zero connection weight and the previous non-zero connection weight, each 5-bit represents the non-zero connection weight index, and the last 1-bit uses To mark whether the non-zero connection weight in the convolutional layer corresponds to the same input feature map or whether the non-zero connection weight in the fully connected layer corresponds to the same output neuron, the entire storage unit encodes seven non-zero connection weights and their relative position information ; Similarly, if a storage unit contains the last non-zero connection weight corresponding to the same input feature map in the convolutional layer or the last non-zero connection weight corresponding to the same output neuron in a layer of the fully connected layer, then the following The 4-bit and 5-bit are all filled with 0, and the last 1-bit flag is set to 1.

5. according to the realization method of the sparse convolutional neural network accelerator described in claim 2, it is characterized in that, when the calculation mode of PE calculation unit array is convolution calculation, if convolution window size is k × k, convolution If the window sliding step is s, the filled Neuron_Buffer and Weight_Buffer store neurons 1~k and connection weights 1~k respectively according to the principle of first-in, first-out;

If the sliding step s of the convolution window is equal to 1, the neuron 1 is directly discarded after the product of the neuron 1 and the connection weight 1 is calculated in the first clock cycle, and the connection weight 1 is transferred to the end of the Weight_Buffer; the second After the clock cycle reads in the new neuron k+1 and calculates the product of neuron 2 and connection weight 2, dumps both neuron 2 and connection weight 2 to the end of their respective Buffers; the third clock cycle calculates neuron 3 After multiplying with connection weight 3, transfer neuron 3 and connection weight 3 to the end of their respective Buffers; and so on, until the kth clock cycle PE unit calculates and completes the multiplication and accumulation of one convolution window and one row, and outputs the result At the same time, adjust the order of the values in the Buffer according to the step size s, and wait for the calculation of the next convolution window;

If the step size s is greater than 1, after calculating the product of neuron 1 and connection weight 1 in the first clock cycle, neuron 1 is directly discarded, and the connection weight 1 is transferred to the end of Weight_Buffer; the second clock cycle reads Enter a new neuron k+1, calculate the product of neuron 2 and connection weight 2, and discard neuron 2 directly, and transfer connection weight 2 to the end of Weight_Buffer; and so on, until the sth clock cycle, read Enter the new neuron k+s-1, calculate the product of the neuron s and the connection weight s, and discard the neuron s directly, and transfer the connection weight s to the end of the Weight_Buffer; read in the s+1th clock cycle After the new neuron k+s calculates the product of the neuron s+1 and the connection weight s+1 at the same time, the neuron s+1 and the connection weight s+1 are dumped to the end of their respective Buffers; After calculating the product of neuron s+2 and connection weight s+2 for each clock cycle, transfer neuron s+2 and connection weight s+2 to the end of their respective Buffers; and so on, until the kth clock cycle The PE unit calculates the multiplication and accumulation of one line of convolution window, and adjusts the order of the values in the Buffer according to the step size s while outputting the result, waiting to calculate the next convolution window.

6. according to the realization method of the sparse convolutional neural network accelerator described in claim 2, it is characterized in that, when the calculation mode of PE calculation unit array is full connection calculation, if the input neuron that each PE unit calculates is m After the Neuron_Buffer and Weight_Buffer are filled, neurons 1 to m and connection weights 1 to m are stored in accordance with the first-in-first-out principle, respectively; in the first clock cycle, the product of neuron 1 and connection weight 1 is calculated. Directly discard the connection weight 1, and dump the neuron 1 to the end of Neuron_Buffer; read the new connection weight m+1 in the second clock cycle and calculate the product of the neuron 2 and the connection weight 2, and directly discard the connection weight 2 , and transfer neuron 2 to the end of Neuron_Buffer; and so on, until the mth clock cycle PE unit calculates the intermediate result of an output neuron, adjusts the order of the values in the Buffer while outputting the result, and waits for the calculation of the next Group output neurons.