CN115936064A

CN115936064A - Neural network acceleration array based on weight cycle data stream

Info

Publication number: CN115936064A
Application number: CN202211141844.9A
Authority: CN
Inventors: 程筱舒; 王忆文; 娄鸿飞; 李平
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-09-20
Filing date: 2022-09-20
Publication date: 2023-04-07
Anticipated expiration: 2042-09-20
Also published as: CN115936064B

Abstract

The present invention specifically relates to a neural network acceleration array based on weight cyclic data flow, which fully reuses the weight value read from memory and input feature map data, greatly reduces the access to external memory, and belongs to the hardware acceleration of neural network technology field. In the field of artificial intelligence chips, the convolution operation occupies more than 90% of the calculation amount of the entire convolutional neural network model. With , a weighted recurrent data flow is proposed. By designing a PE array based on the weighted cyclic data flow, the convolution operation is optimized to effectively reduce the power consumption and delay of the hardware acceleration structure, thereby improving the overall performance of the system.

Description

A Neural Network Acceleration Array Based on Weighted Circular Data Flow

技术领域technical field

本发明涉及神经网络的硬件加速技术领域，具体涉及一种基于权重循环数据流的神经网络加速阵列设计方法。The invention relates to the technical field of hardware acceleration of a neural network, in particular to a design method of a neural network acceleration array based on a weight cycle data flow.

背景技术Background technique

随着物联网技术的飞速发展，可穿戴智能产品将人体各个部位作为互联网的接口，真正实现具有微型化、便携化、智能化可穿戴产品特点的人机一体化产品体验，为消费者提供便携式实时信息采集和数据服务，具有更大的技术含量和市场吸引力。With the rapid development of the Internet of Things technology, wearable smart products use various parts of the human body as the interface of the Internet, truly realize the human-machine integrated product experience with the characteristics of miniaturization, portability, and intelligent wearable products, and provide consumers with portable real-time Information collection and data services have greater technical content and market appeal.

不幸的是，设计高性能的可穿戴计算设备并非易事，其实现面临许多挑战。该领域是计算机科学和工程等不同研究领域的交叉点，使用了微电子和无线通信等各种技术。微电子技术的进步导致了适用于可穿戴计算设备的小尺寸低功耗人工智能芯片。Unfortunately, designing high-performance wearable computing devices is not an easy task, and its implementation faces many challenges. The field is the intersection of different fields of study such as computer science and engineering, using various technologies such as microelectronics and wireless communication. Advances in microelectronics have led to small, low-power artificial intelligence chips suitable for wearable computing devices.

在人工智能芯片领域中，隶属于机器学习范畴的深度学习，被广泛运用于图像分类、语音识别、对象检测等方面，并取得了显著的成果。卷积神经网络、递归神经网络和深度置信网络是深度学习研究的主要聚焦点，其中最先进的当属卷积神经网络。In the field of artificial intelligence chips, deep learning, which belongs to the category of machine learning, is widely used in image classification, speech recognition, object detection, etc., and has achieved remarkable results. Convolutional Neural Networks, Recurrent Neural Networks, and Deep Belief Networks are the main focus of deep learning research, with Convolutional Neural Networks being the most advanced.

典型的卷积神经网络结构包括有：卷积层，激活层，池化层，全连接层和输入输出特征图等。其中卷积层的作用是特征提取，池化层的作用是像素压缩，全连接层的作用是分类。卷积层为计算密集型运算，而全连接层为数据密集型运算。Typical convolutional neural network structures include: convolutional layers, activation layers, pooling layers, fully connected layers, and input and output feature maps. Among them, the role of the convolutional layer is feature extraction, the role of the pooling layer is pixel compression, and the role of the fully connected layer is classification. Convolutional layers are computation-intensive operations, while fully-connected layers are data-intensive operations.

卷积神经网络在数据处理方面有三个瓶颈问题：一是数据密集型，需要处理的数据量极大。二是计算密集型，对数据处理存储需要耗费大量计算资源和大量时间。三是速度失配问题，即数据处理速度慢于数据的产生速度。因此适合人工智能架构的专属芯片亟需发展，而高速数据传输、高速计算的神经网络加速器的实现将在多个方面有着重要的意义。Convolutional neural networks have three bottlenecks in data processing: one is data-intensive, and the amount of data to be processed is extremely large. The second is computing-intensive, which consumes a lot of computing resources and a lot of time for data processing and storage. The third is the speed mismatch problem, that is, the data processing speed is slower than the data generation speed. Therefore, the development of dedicated chips suitable for artificial intelligence architecture is urgently needed, and the realization of high-speed data transmission and high-speed computing neural network accelerators will be of great significance in many aspects.

目前，由于硬件性能的提高，在加速神经网络训练和推断过程中，主要采用CPU、GPU、FPGA和ASIC的形式。大约二十年前，CPU曾是实现神经网络算法的主流，其优化领域主要集中在软件部分。CNN不断增加的计算成本使得需要硬件加速其推理过程。在GPU方面，GPU集群可以并行地加速具有10亿多个参数的超大网络。主流的GPU聚类神经网络通常使用分布式SGD算法。许多研究进一步利用了这种并行性，努力实现不同集群之间的通信。由于FPGA具有许多吸引人的特性，因此成为CNN硬件加速的良好平台。一般来说，FPGA比CPU和GPU提供更高的能源效率，比CPU具有更高的性能。与GPU相比，FPGA的吞吐量是几十千兆次，内存访问有限。此外，它本身不支持浮点计算，但有更低能耗。专用集成电路ASIC是为特定应用而设计的专用处理器，虽然ASIC的灵活性较低，开发周期长，成本高，但是具有体积小、功耗低、计算速度快和可靠性高等优点。At present, due to the improvement of hardware performance, in the process of accelerating neural network training and inference, it is mainly in the form of CPU, GPU, FPGA and ASIC. About twenty years ago, the CPU was the mainstream for implementing neural network algorithms, and its optimization areas were mainly focused on the software part. The ever-increasing computational cost of CNNs necessitates hardware acceleration of their inference process. On the GPU side, GPU clusters can accelerate very large networks with more than 1 billion parameters in parallel. The mainstream GPU clustering neural network usually uses the distributed SGD algorithm. Many studies have further exploited this parallelism in an effort to enable communication between different clusters. FPGAs are a good platform for hardware acceleration of CNNs due to their many attractive features. In general, FPGAs offer higher energy efficiency than CPUs and GPUs, and higher performance than CPUs. Compared with GPU, the throughput of FPGA is tens of gigaflops, and the memory access is limited. In addition, it does not natively support floating-point calculations, but has lower power consumption. ASIC is a special-purpose processor designed for specific applications. Although ASIC has low flexibility, long development cycle and high cost, it has the advantages of small size, low power consumption, fast calculation speed and high reliability.

神经网络硬件加速器有两种较为典型的体系架构：时域计算架构(树状结构)和空域计算架构(PE阵列结构)。树状结构基于指令流对算数计算单元和存储资源进行集中控制，每个算数逻辑单元都从集中式存储系统获取运算数据，并向其写回结果。它由一个乘法加法树，一个用于分配输入值的缓冲区和一个预取缓冲区组成。PE阵列结构，每个算数运算单元都具有本地存储器，整个架构采用数据流控制，即所有的PE单元形成处理链关系，数据直接在PE之间传递。它由全局缓冲区、FIFO和PE阵列组成。每个PE由一个或多个的乘法器和加法器组成，可实现高度并行计算。There are two typical architectures of neural network hardware accelerators: time domain computing architecture (tree structure) and space domain computing architecture (PE array structure). The tree structure centrally controls the arithmetic calculation unit and storage resources based on the instruction flow, and each arithmetic logic unit obtains the operation data from the centralized storage system and writes back the result to it. It consists of a multiply-add tree, a buffer for allocating input values, and a prefetch buffer. PE array structure, each arithmetic operation unit has a local memory, the entire architecture adopts data flow control, that is, all PE units form a processing chain relationship, and data is directly transmitted between PEs. It consists of global buffer, FIFO and PE array. Each PE consists of one or more multipliers and adders, enabling highly parallel computing.

神经网络硬件加速器有四种较为典型的数据流模式：无局部复用数据流，输入固定流，输出固定流和权重固定流。对于无局部复用数据流，为了最大化存储容量和最小化片外存储器带宽，不给PE分配本地存储，而是把所有的区域分配给全局缓冲区以增加其容量，它必须多路传送输入特征图，单路传送卷积核权重，然后通过PE阵列累加部分和。对于输入固定流，计算核心把输入特征图读入局部的输入寄存器；计算核心充分复用这些输入数据，更新输出缓存中所有相关的输出部分和；更新后的输出部分和会重新写回输出缓存。对于输出固定流，计算核心把输入特征图的各通道读入局部的输入寄存器；存储在计算核心输出寄存器中的输出部分被充分复用，以完成三维卷积通道方向上的完全累加；最终的输出特征图会在池化之后再写入输出缓存。对于权重固定流，计算核心读取输入特征图分块到局部的输入寄存器；计算核心利用这些输入数据更新分块的输出部分和；存储在权重缓存中的分块卷积核权重被充分复用，以更新存储在输出缓存中的输出部分和。The neural network hardware accelerator has four typical data flow modes: no local multiplexing data flow, input fixed flow, output fixed flow and weight fixed flow. For data streams without local multiplexing, in order to maximize storage capacity and minimize off-chip memory bandwidth, instead of allocating local storage to the PE, all regions are allocated to the global buffer to increase its capacity, which must multiplex the input Feature map, unicast the convolution kernel weights, and then accumulate the partial sums through the PE array. For the input fixed stream, the calculation core reads the input feature map into the local input register; the calculation core fully reuses these input data, and updates all relevant output partial sums in the output buffer; the updated output partial sum will be written back to the output buffer . For the output fixed flow, the calculation core reads each channel of the input feature map into the local input register; the output part stored in the output register of the calculation core is fully multiplexed to complete the full accumulation in the direction of the three-dimensional convolution channel; the final The output feature maps are written to the output cache after pooling. For the weight fixed stream, the calculation core reads the input feature map block to the local input register; the calculation core uses these input data to update the output part of the block; the weight of the block convolution kernel stored in the weight cache is fully reused , to update the output partial sum stored in the output buffer.

图1为一个简单的卷积层示意图，其中10为7×7的输入特征图，11为一个3×3的卷积核，12为5×5的输出特征图。卷积核窗口逐行呈“Z”字形在10上滑动做卷积运算，得到结果12。卷积运算占据整个卷积神经网络模型的计算量的百分之九十以上，因此通过设计一种结构化的PE阵列，对卷积操作进行优化，能有效地降低硬件加速结构的面积和功耗，从而提升系统的总体性能。Figure 1 is a schematic diagram of a simple convolutional layer, where 10 is a 7×7 input feature map, 11 is a 3×3 convolution kernel, and 12 is a 5×5 output feature map. The convolution kernel window slides on 10 line by line in a "Z" shape to perform convolution operation, and the result 12 is obtained. Convolution operations occupy more than 90% of the calculation of the entire convolutional neural network model. Therefore, by designing a structured PE array and optimizing the convolution operation, the area and power of the hardware acceleration structure can be effectively reduced. consumption, thereby improving the overall performance of the system.

发明内容Contents of the invention

针对以上背景内容，本发明提出了一种针对神经网络卷积层运算的加速阵列设计方法。可以应用于FPGA或ASIC神经网络硬件加速器设计中，作为AI加速处理器的计算部分。In view of the above background content, the present invention proposes an accelerated array design method for neural network convolution layer operations. It can be applied to the design of FPGA or ASIC neural network hardware accelerator as the computing part of AI acceleration processor.

在空域计算结构中，每个运算单元是通过数据流来进行控制的，因此关键就是解决数据流动问题。为了减少对输入数据的重复调用和移动，最大化数据复用，最好一次性输入特征图，因此提出了权重循环数据流(weight ring dataflow,WR)。In the airspace computing structure, each computing unit is controlled through data flow, so the key is to solve the problem of data flow. In order to reduce repeated calls and movements of input data and maximize data reuse, it is best to input feature maps at one time, so weight ring dataflow (WR) is proposed.

本发明基于所提出的WR数据流，构建一种新的神经网络PE阵列架构。假设某一卷积层的卷积核尺寸为K²，则相应PE阵列的尺寸也为K²。PE单元为了数据移动而横向互连，为了权重循环而纵向循环互连。The present invention builds a new neural network PE array architecture based on the proposed WR data flow. Assuming that the size of the convolution kernel of a certain convolutional layer is K ² , the size of the corresponding PE array is also K ² . PE units are horizontally interconnected for data movement and vertically cyclically interconnected for weight cycling.

本发明的优点主要包括：充分复用了从内存中读取的权重值和输入特征图数据，大大减少了对外部存储器的访问，降低了整体功耗和延迟。The advantages of the present invention mainly include: the weight value read from the memory and the input feature map data are fully reused, the access to the external memory is greatly reduced, and the overall power consumption and delay are reduced.

附图说明Description of drawings

图1为卷积层示意图；Figure 1 is a schematic diagram of a convolutional layer;

图2为权重循环数据流示意图；Fig. 2 is a schematic diagram of weight cycle data flow;

图3为基于权重循环数据流的神经网络加速阵列示意图；Fig. 3 is a schematic diagram of a neural network acceleration array based on a weighted cyclic data flow;

图4为权重循环数据流PE阵列的工作流程图Figure 4 is a workflow diagram of the PE array of the weight cycle data flow

具体实施方式Detailed ways

以下结合附图对本发明的权重循环数据流和对应的阵列硬件实现进行详细说明：The weight cycle data flow and corresponding array hardware implementation of the present invention will be described in detail below in conjunction with the accompanying drawings:

权重循环数据流具体体现如图2所示，20为尺寸为N²的输入特征图，这里假设N取7，则第一个周期的卷积核201，尺寸为K²，这里假设K取3，其滑动的行数范围为1至K行。在下一个周期的卷积核202，滑动的行数范围为2至K+1行，以此类推周期203的卷积核和204的卷积核等的情况；The specific embodiment of the weight cycle data flow is shown in Figure 2. 20 is the input feature map with a size of N ^2. Here, it is assumed that N is 7, and the convolution kernel 201 of the first cycle has a size of K ² . Here, it is assumed that K is 3 , the number of sliding rows ranges from 1 to K rows. In the convolution kernel 202 of the next cycle, the number of sliding rows ranges from 2 to K+1 rows, and so on for the convolution kernel of the cycle 203 and the convolution kernel of 204;

由图1卷积层的运算规律可知，11窗口内的数据在每次换行后，只更新一行数据。即原先的K行数据仅有一行被舍弃掉。因此若每次换行后都全部重新更新数据，会大大增加数据访存量。因此，将固有数据行保持位置的不变，仅用更新行覆盖舍弃的数据行即可。From the operation law of the convolutional layer in Figure 1, it can be seen that the data in the 11 window only updates one line of data after each new line. That is, only one row of the original K rows of data is discarded. Therefore, if all the data is updated after each line change, the amount of data access will be greatly increased. Therefore, keep the position of the inherent data row unchanged, and just cover the discarded data row with the update row.

基于权重循环数据流的神经网络加速阵列如图3所示，含K²个PE单元301的阵列是对卷积核窗口在图像上逐行滑动的模拟。首先将卷积核上的权重值预置到PE阵列30中，当K行图像行同时移入阵列做卷积的过程，原本称为权重固定流。但本研究提出的WR流稍许不同，它将每列的权重连接起来，提供了权值流动的通路，即30中的竖循环线。The neural network acceleration array based on the weighted cyclic data flow is shown in FIG. 3 , the array containing ^K2 PE units 301 is a simulation of the convolution kernel window sliding row by row on the image. First, the weight value on the convolution kernel is preset into the PE array 30. When K lines of image lines are simultaneously moved into the array for convolution, it is originally called the weight fixed flow. However, the WR flow proposed in this study is slightly different. It connects the weights of each column and provides a path for the flow of weights, that is, the vertical loop line in 30.

对于图4所示的PE阵列工作流程图，PE阵列的工作情况如下：For the PE array workflow shown in Figure 4, the PE array works as follows:

1)首先同时输入图像行的1至K行，直到输入特征图数据首次充满30开始，按步长整体右移图像行，依次做卷积运算；1) First input the 1 to K lines of the image line at the same time, until the input feature map data is filled with 30 for the first time, move the image line to the right as a whole according to the step size, and perform convolution operations in turn;

2)当1至K行的输入特征图遍历完成后，更新一行图像行。其余未更新的K-1行数据继续重新循环输入；2) After the traversal of the input feature map of rows 1 to K is completed, an image row is updated. The rest of the K-1 row data that has not been updated continue to be re-entered;

3)当31的每个周期图像行更新的时候，30的权重就按行整体循环下移，存储在每行的单个301的权重寄存器中，30的K行图像行就整体右移做卷积；3) When the image row of each cycle of 31 is updated, the weight of 30 will be moved down by the overall cycle of the row, stored in a single weight register of 301 for each row, and the image row of K rows of 30 will be shifted to the right as a whole for convolution ;

4)当图像行更新完时，30完成输入特征图最后K行的卷积后，运算结束。4) When the image rows are updated, 30 completes the convolution of the last K rows of the input feature map, and the operation ends.

单个301主要由乘法器、加法器和权重寄存器组成。预置权重首先输入到权重寄存器中存储起来，并不断地重复使用，直到更行图像行的时候，才分别循环下移到下一行的301的权重寄存器中。A single 301 is mainly composed of multipliers, adders and weight registers. The preset weights are first input into the weight registers for storage, and are continuously reused until the image row is changed, and then are moved down to the weight registers of 301 in the next row respectively.

在输入特征图数据首次充满30前，输入激励不通过乘法器，直接输出到前传激励信号，将数据向右侧的301传递。Before the input feature map data fills 30 for the first time, the input excitation does not pass through the multiplier, but is directly output to the forward transmission excitation signal, and the data is transmitted to 301 on the right.

当输入特征图数据充满30，开始做卷积运算后，输入激励与权重输入到乘法器中，产生一个部分和。该部分和与左侧来的输入部分和信号累加，产生前传部分和，输入到右侧的301中。当每一行的部分和到达30的最右边时，由K-1个边缘加法器汇总，得到最终的一个卷积结果。When the input feature map data is full of 30, after the convolution operation is started, the input excitation and weight are input into the multiplier to generate a partial sum. This partial sum is accumulated with the input partial sum signal from the left to generate the forward partial sum, which is input to 301 on the right. When the partial sum of each row reaches the far right of 30, it is summarized by K-1 edge adders to obtain the final convolution result.

加法器部分除了上述结构外，还可替换成加法树的结构。In addition to the above structure, the adder part can also be replaced with an addition tree structure.

以上所述，仅是本发明的较佳实施例而已，并非对本发明作任何形式上的限制。任何熟悉本领域的技术人员，在不脱离本发明技术方案范围情况下，都可利用上述提出的方法和技术内容对本发明技术方案做出一些可能的变动和修饰，或修改为等同变化的等效实施例。因此，凡是未脱离本发明技术方案的内容，依据本发明的技术实质对以上实施例所做的任何简单修改、等同变化及修饰，均仍属于本发明技术方案保护的范围内。The above descriptions are only preferred embodiments of the present invention, and do not limit the present invention in any form. Any person familiar with the art, without departing from the scope of the technical solution of the present invention, can use the methods and technical content proposed above to make some possible changes and modifications to the technical solution of the present invention, or modify it to be equivalent to equivalent changes Example. Therefore, any simple modifications, equivalent changes and modifications made to the above embodiments according to the technical essence of the present invention, which do not deviate from the technical solution of the present invention, still fall within the protection scope of the technical solution of the present invention.

Claims

1. A neural network acceleration array based on weight cyclic data flow, characterized in that the PE array size is the size of the convolution window, PE units are horizontally interconnected for data movement, and vertically cyclically interconnected for weight circulation.

2. The method according to claim 1, characterized in that the data flow in the array obeys the flow rules of the weighted cycle data flow.

3. The working condition of the neural network acceleration array according to claim 2, wherein, at first, input 1 to K lines of the image line at the same time, until the input feature map data fills the PE array for the first time, and move the image line to the right as a whole according to the step size , perform convolution operations in turn; when the traversal of the input feature map of rows 1 to K is completed, update a row of image rows; the rest of the unupdated K-1 rows of data continue to be re-circulated and input; when the image rows are updated in each cycle, PE The weight of the array is shifted down by row as a whole, stored in the weight register of a single PE unit in each row, and the K rows of image rows in the PE array are shifted to the right as a whole for convolution; when the image row is updated, the PE array completes the input After the convolution of the last K rows of the feature map, the operation ends.

4. The method according to claim 3, characterized in that, during the process of shifting the input feature map to the right, the position of the weight in the array remains unchanged, and the array performs a convolution once for each bit shifted to the right as a whole.

5. The method according to claim 4, characterized in that, except that K rows of image rows enter at the same time at the beginning, the update interval of the image row is a cycle after the PE array traverses the convolution K rows, instead of every Just shoot once.

6 . The method according to claim 5 , wherein the update of the image line in each period moves along with the last line of the convolution kernel, and is updated sequentially according to the gradient.

7. The method according to claim 6, characterized in that the convolution operation is not performed as the input data moves to the right, but waits until it reaches the corresponding weight before performing the convolution operation.

8. The PE unit according to claim 1, characterized in that it is composed of at least one data register, a multiplier and multiple adders, and may also have structures such as FIFO.

9. The PE unit according to claim 8, characterized in that, in addition to the data transmission signals, there are control signals controlling the flow of data.