CN110288086B

CN110288086B - A Configurable Convolution Array Accelerator Structure Based on Winograd

Info

Publication number: CN110288086B
Application number: CN201910511987.6A
Authority: CN
Inventors: 魏继增; 徐文富; 王宇吉; 郭炜
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-06-13
Filing date: 2019-06-13
Publication date: 2023-07-21
Anticipated expiration: 2039-06-13
Also published as: CN110288086A

Abstract

A Winograd-based configurable convolutional array accelerator structure comprising: the system comprises an activation value caching module, a weight caching module, an output caching module, a controller, a weight preprocessing module, an activation value preprocessing module, a weight conversion module, an activation value matrix conversion module, a dot multiplication module, a result matrix conversion module, an accumulation module, a pooling module and an activation module. According to the Winograd-based configurable convolution array accelerator structure, the convolution array accelerator with the configurable bit width is designed according to the operation characteristics of a Winograd convolution algorithm of a fixed paradigm, and the requirements of different neural networks and different convolution layers on the bit width are flexibly met. In addition, a special multiplier unit with configurable data bit width is designed, so that the calculation efficiency of the neural network convolution operation is improved, and the calculation power consumption is reduced.

Description

A Configurable Convolution Array Accelerator Structure Based on Winograd

技术领域technical field

本发明涉及一种可配置卷积阵列加速器结构。特别是涉及一种基于Winograd的可配置卷积阵列加速器结构。The invention relates to a configurable convolution array accelerator structure. In particular, it relates to a Winograd-based configurable convolutional array accelerator structure.

背景技术Background technique

神经网络在诸多领域应用特别是图像相关任务上表现优异，诸如图像分类、图像语义分割、图像检索、物体检测等计算机视觉问题上，开始替代大部分传统算法，并逐步被部署到终端设备上。Neural networks perform well in many fields, especially image-related tasks, such as image classification, image semantic segmentation, image retrieval, object detection and other computer vision problems. They have begun to replace most traditional algorithms and are gradually deployed on terminal devices.

但是神经网络计算量非常巨大，从而存在神经网络处理速度慢、运行功耗大等问题。神经网络主要包含训练阶段和推理阶段。为了得到高精度的处理结果，权重数据在训练中需要从海量数据中通过反复迭代计算得到。在神经网络推理阶段中，需要在极短的响应时间(通常为毫秒级)内完成对输入数据的运算处理，特别是当神经网络应用于实时系统时，例如自动驾驶领域。此外，神经网络中涉及的计算主要包括卷积运算、激活运算和池化运算等。However, the amount of calculation of the neural network is very huge, so there are problems such as slow processing speed of the neural network and high power consumption. A neural network mainly includes a training phase and an inference phase. In order to obtain high-precision processing results, weight data needs to be obtained through repeated iterative calculations from massive data during training. In the neural network reasoning stage, it is necessary to complete the operation and processing of the input data within a very short response time (usually milliseconds), especially when the neural network is applied to a real-time system, such as the field of automatic driving. In addition, the calculations involved in the neural network mainly include convolution operations, activation operations, and pooling operations.

已有研究表明，神经网络超过90％的计算时间被卷积过程占据。传统的卷积算法通过多重乘法累加运算，分别计算输出特征图中的每个元素。虽然之前使用该算法的解决方案已经取得了初步的成功，但当算法本身效率更高时，效率可能更高。因此，目前研究者们提出了Winograd的卷积算法，该算法通过对输入特征图与权重进行特定的数据域转换，完成等效的卷积运算任务并减少卷积运算过程的乘法次数。由于实际应用中大多数神经网络处理器芯片的预测过程是采用固定神经网络模型，因此所采用的Winograd卷积输出范式通常也是固定模式，其运算过程十分明确，具有较大的优化空间。如何设计并优化基于Winograd神经网络加速器结构成为了一个研究重点。Previous studies have shown that more than 90% of the computing time of neural networks is occupied by the convolution process. Traditional convolution algorithms calculate each element in the output feature map separately through multiple multiply-accumulate operations. While previous solutions using this algorithm have had initial success, it may be more efficient when the algorithm itself is more efficient. Therefore, researchers currently propose Winograd's convolution algorithm, which completes the equivalent convolution operation task and reduces the number of multiplications in the convolution operation process by performing specific data domain conversion on the input feature map and weights. Since the prediction process of most neural network processor chips in practical applications uses a fixed neural network model, the Winograd convolution output paradigm used is usually a fixed model, and its operation process is very clear and has a large room for optimization. How to design and optimize the Winograd-based neural network accelerator structure has become a research focus.

另外，对于绝大数神经网络应用，输入定点类型的数据便可达到良好的实验结果，更能提高速度，降低功耗。然而现有的定点化神经网络中的卷积数据位宽是固定的，不能灵活配置，降低了适用性。通常，16bit的数据位宽便可以满足神经网络的精度需求，而对于一些精度要求不高的网络和场景，8bit数据位宽也可满足精度需求。因此，在神经网络中，实现数据位宽可配置能够得到更好的优化。In addition, for the vast majority of neural network applications, good experimental results can be achieved by inputting fixed-point data, which can improve speed and reduce power consumption. However, the convolutional data bit width in the existing fixed-point neural network is fixed and cannot be flexibly configured, which reduces the applicability. Usually, a data bit width of 16 bits can meet the precision requirements of the neural network, and for some networks and scenarios that do not require high precision, a data bit width of 8 bits can also meet the precision requirements. Therefore, in the neural network, realizing the configurable data bit width can be better optimized.

发明内容Contents of the invention

本发明所要解决的技术问题是，提供一种能够提高神经网络卷积运算的计算效率的基于Winograd的可配置卷积阵列加速器结构。The technical problem to be solved by the present invention is to provide a Winograd-based configurable convolution array accelerator structure capable of improving the computational efficiency of neural network convolution operations.

本发明所采用的技术方案是：一种基于Winograd的可配置卷积阵列加速器结构，包括：激活值缓存模块、权重缓存模块、输出缓存模块、控制器、权重预处理模块、激活值预处理模块、权重转换模块、激活值矩阵转换模块、点乘模块、结果矩阵转换模块、累加模块、池化模块和激活模块，其中，The technical solution adopted in the present invention is: a configurable convolution array accelerator structure based on Winograd, including: an activation value cache module, a weight cache module, an output cache module, a controller, a weight preprocessing module, an activation value preprocessing module, a weight conversion module, an activation value matrix conversion module, a dot product module, a result matrix conversion module, an accumulation module, a pooling module and an activation module, wherein,

激活值缓存模块，用于存储输入像素值或输入特征图值，与控制器相连，为激活值预处理模块提供激活值数据；An activation value cache module is used to store input pixel values or input feature map values, and is connected to the controller to provide activation value data for the activation value preprocessing module;

权重缓存模块，用于存储已训练好的权值，与控制器相连，为权重预处理模块提供权重数据；The weight caching module is used to store the trained weights, is connected with the controller, and provides weight data for the weight preprocessing module;

输出缓存模块，用于存储一次卷积层结果，与控制器相连，当激活模块输出数据完成后，将数据传入输出缓存模块，用于下一层卷积；The output cache module is used to store the results of a convolutional layer and is connected to the controller. After the activation module outputs data, the data is transferred to the output cache module for the next layer of convolution;

控制器，根据计算过程控制待处理的激活值数据、权重数据、和卷积层数据的传输；The controller controls the transmission of activation value data, weight data, and convolutional layer data to be processed according to the calculation process;

权重预处理模块，接收权重缓存模块传输的待运算数据，用于划分卷积核，得到时域权重矩阵K；The weight preprocessing module receives the data to be calculated transmitted by the weight buffer module, and is used to divide the convolution kernel to obtain the time domain weight matrix K;

激活值预处理模块，接收激活值缓存模块传输的待运算数据，用于从激活值缓存模块取出激活值，用于划分激活值，得到时域激活值矩阵I；The activation value preprocessing module receives the data to be calculated transmitted by the activation value buffer module, and is used to take out the activation value from the activation value buffer module, and is used to divide the activation value to obtain the time domain activation value matrix I;

权重转换模块，接收权重预处理模块传输的待运算数据，用于实现权重数据从时域转换为Winograd域，得到Winograd域权重矩阵U；The weight conversion module receives the data to be calculated transmitted by the weight preprocessing module, and is used to convert the weight data from the time domain to the Winograd domain to obtain the Winograd domain weight matrix U;

激活值矩阵转换模块，接收激活值预处理模块传输的待运算数据，用于实现激活值从时域转换为Winograd域，得到Winograd域激活值矩阵V；The activation value matrix conversion module receives the data to be calculated transmitted by the activation value preprocessing module, and is used to convert the activation value from the time domain to the Winograd domain to obtain the Winograd domain activation value matrix V;

点乘模块，分别接收权重转换模块和激活值矩阵转换模块传输的待运算数据，用于实现Winograd域激活值矩阵与Winograd域权重矩阵的点积操作，得到Winograd域点积结果矩阵M；The dot product module receives the data to be calculated transmitted by the weight conversion module and the activation value matrix conversion module respectively, and is used to realize the dot product operation of the Winograd domain activation value matrix and the Winograd domain weight matrix to obtain the Winograd domain dot product result matrix M;

结果矩阵转换模块，接收点乘模块传输的待运算数据，用于实现点积结果矩阵从Winograd域到时域的转换，得到转换后的时域点积结果矩阵F；The result matrix conversion module receives the data to be calculated transmitted by the dot product module, and is used to realize the conversion of the dot product result matrix from the Winograd domain to the time domain, and obtain the converted time domain dot product result matrix F;

累加模块，接收结果矩阵转换模块传输的待运算数据，通过将接收的数据累加，得到最终的卷积结果；The accumulation module receives the data to be calculated transmitted by the result matrix conversion module, and obtains the final convolution result by accumulating the received data;

池化模块，接收累加模块传输的待运算数据，将最终的卷积结果阵进行池化；The pooling module receives the data to be calculated transmitted by the accumulation module, and pools the final convolution result array;

激活模块，接收池化模块传输的待运算数据，将池化结果进行Relu激活函数处理，得到激活后的结果，传输到输出缓存模块。The activation module receives the data to be calculated transmitted by the pooling module, processes the pooling result with the Relu activation function, obtains the activated result, and transmits it to the output buffer module.

所述的权重预处理模块包括：Described weight pretreatment module comprises:

(1)将一个大小为5*5的卷积核通过补零，扩展成6*6的卷积矩阵；(1) Expand a convolution kernel with a size of 5*5 into a 6*6 convolution matrix by padding zeros;

(2)将6*6的卷积矩阵划分为四个3*3的卷积核；(2) Divide the 6*6 convolution matrix into four 3*3 convolution kernels;

具体划分如下所示，其中K_input表示一个5*5的权重矩阵，下侧分别是4个对应的划分后的时域待处理权重矩阵K₁、K₂、K₃、K₄。在计算U＝GKG^T中，K值依次为K₁、K₂、K₃、K₄：The specific division is as follows, where K _input represents a 5*5 weight matrix, and the lower sides are four corresponding divided weight matrices K ₁ , K ₂ , K ₃ , and K ₄ to be processed in the time domain. In the calculation of U=GKG ^T , the K values are K ₁ , K ₂ , K ₃ , K ₄ in turn:

所述的激活值预处理模块是将6*6大小的激活值矩阵划分为重叠的4个4*4大小的矩阵。划分如下所示，其中I_input表示一个5*5的权重矩阵，下侧分别是划分后的大小为4*4的时域待处理激活值矩阵I₁、I₂、I₃、I₄。在计算V＝B^TIB中，I值依次为I₁、I₂、I₃、I₄：The activation value preprocessing module divides the 6*6 activation value matrix into four overlapping matrices of 4*4 size. The division is as follows, where I _input represents a 5*5 weight matrix, and the lower side is respectively divided activation value matrices I ₁ , I ₂ , I ₃ , and I ₄ in the time domain with a size of 4*4. In the calculation of V=B ^T IB, the values of I are I ₁ , I ₂ , I ₃ , and I ₄ in turn:

所述的权重转换模块，是通过行列向量相加减完成计算中的矩阵乘，从而执行Winograd卷积中针对权重矩阵的转换，得到Winograd域权重矩阵U＝[GKG^T]其中，K表示时域权重矩阵、G是权重转换辅助矩阵、U是Winograd域权重矩阵；Described weight conversion module is to complete the matrix multiplication in the calculation by the addition and subtraction of the row and column vectors, thereby performing the conversion for the weight matrix in the Winograd convolution, obtaining the Winograd domain weight matrix U=[GKG ^T ] wherein, K represents the time domain weight matrix, G is the weight conversion auxiliary matrix, and U is the Winograd domain weight matrix;

具体操作：将权重矩阵K的第一行向量作为临时矩阵C₂的第一行，其中临时矩阵C₂＝G^TK；将权重矩阵K中的整数右移补0、负数右移补1完成除二；当权值为正值时，权值右移，权值左边补0；当权值为负时，权值右移，权值左边补1；将权重矩阵K的第一、二、三行元素相加之后再右移一位之后的向量结果作为临时矩阵C₂的第二行；将权重矩阵K的第一、二、三行元素相加之后再右移一位之后的向量结果作为临时矩阵C₂的第三行；将权重矩阵K的第三行向量作为临时矩阵C₂的第四行；将临时矩阵C₂第一列向量作为Winograd域权重矩阵U的第一列；将临时矩阵C₂的第一、二、三列相加之后再右移一位之后的向量结果作为Winograd域权重矩阵U的第二列；将临时矩阵C₂的第一、二、三列相加之后再右移一位之后的向量结果作为Winograd域权重矩阵U的第三列；将临时矩阵C₂的第三列向量作为Winograd域权重矩阵U的第四列，最后得到Winograd域权重矩阵U。Specific operation: use the first row vector of the weight matrix K as a temporary matrix C₂The first row of , where the temporary matrix C₂=G^TK; shift the integers in the weight matrix K to the right to add 0, and the negative numbers to the right to add 1 to complete the division by two; when the weight is positive, the weight is shifted to the right, and the left of the weight is filled with 0; when the weight is negative, the weight is right.₂The second line of ; the vector result after adding the first, second, and third row elements of the weight matrix K and then shifting to the right by one bit is used as a temporary matrix C₂The third row; the third row vector of the weight matrix K as a temporary matrix C₂The fourth line of ; the temporary matrix C₂The first column vector is used as the first column of the Winograd domain weight matrix U; the temporary matrix C₂The vector result after adding the first, second and third columns and then shifting one bit to the right is used as the second column of the Winograd domain weight matrix U; the temporary matrix C₂The vector result after adding the first, second, and third columns and then shifting one bit to the right is used as the third column of the Winograd domain weight matrix U; the temporary matrix C₂The third column vector of is used as the fourth column of the Winograd domain weight matrix U, and finally the Winograd domain weight matrix U is obtained.

所述的激活值矩阵转换模块，是通过行列向量相加减，完成计算中的矩阵乘，从而执行Winograd卷积中针对时域激活值矩阵的转换操作，得到矩阵V＝[B^TIB]其中，I是时域激活值矩阵、B是激活值转换辅助矩阵、V是Winograd域激活值矩阵；Described activation value matrix conversion module, is to complete the matrix multiplication in the calculation by adding and subtracting the row and column vectors, thereby performing the conversion operation for the time domain activation value matrix in the Winograd convolution, obtaining matrix V=[B ^T IB] wherein, I is the time domain activation value matrix, B is the activation value conversion auxiliary matrix, and V is the Winograd domain activation value matrix;

具体操作：将时域激活值矩阵I的第一行减第三行的向量差值作为临时矩阵C₁的第一行，其中临时矩阵C₁＝B^TI；将时域激活值矩阵I的第二行与第三行相加的结果作为临时矩阵C₁的第二行；将时域激活值矩阵I的第三行减第二行的向量差值作为临时矩阵C₁的第三行；将时域激活值矩阵I的第二行减第四行的向量差值作为临时矩阵C₁的第四行；将临时矩阵C₁的第一列减第三列的向量差值作为Winograd域激活值矩阵V的第一列；将临时矩阵C₁的第二列与第三列相加的结果作为Winograd域激活值矩阵V的第二列；将临时矩阵C₁的第三列减第二列的向量差值作为Winograd域激活值矩阵V的第三列；将临时矩阵C₁的第二列减第四列的向量差值作为Winograd域激活值矩阵V的第四列，最后得到Winograd域激活值矩阵V。Specific operation: use the vector difference of the first row minus the third row of the time-domain activation value matrix I as the temporary matrix C₁The first row of , where the temporary matrix C₁=B^TI; the result of adding the second row and the third row of the time-domain activation value matrix I as a temporary matrix C₁The second row; the third row of the time-domain activation value matrix I minus the vector difference of the second row is used as a temporary matrix C₁The third row; the second row of the time-domain activation value matrix I minus the vector difference of the fourth row is used as a temporary matrix C₁The fourth line of ; the temporary matrix C₁The vector difference of the first column minus the third column is used as the first column of the Winograd domain activation value matrix V; the temporary matrix C₁The result of adding the second column and the third column of the Winograd domain activation value matrix V is the second column; the temporary matrix C₁The vector difference of the third column minus the second column is used as the third column of the Winograd domain activation value matrix V; the temporary matrix C₁The vector difference of the second column minus the fourth column is used as the fourth column of the Winograd domain activation value matrix V, and finally the Winograd domain activation value matrix V is obtained.

所述的点乘模块是通过执行Winograd域权重矩阵U和Winograd域激活值矩阵V的点积操作，获得Winograd域点积结果矩阵M，公式表达为M＝U⊙V，其中，U是Winograd域权重矩阵，V是Winograd域激活值矩阵；所述的点乘模块以实现数据位宽可配置的点积，有8位乘法器和16位乘法器两个工作模式，分别对应进行8bit和16bit两种数据位宽的运算，实现8*8bit和16*16bit的定点乘法运算。The dot product module obtains the Winograd domain dot product result matrix M by performing the dot product operation of the Winograd domain weight matrix U and the Winograd domain activation value matrix V, and the formula is expressed as M=U⊙V, wherein U is the Winograd domain weight matrix, and V is the Winograd domain activation value matrix; the described dot product module has two operating modes of 8-bit multiplier and 16-bit multiplier in order to realize the configurable dot product of the data bit width, and correspondingly carry out two kinds of operations of 8bit and 16bit data bit width respectively, realizing 8 *8bit and 16*16bit fixed-point multiplication.

所述的8位乘法器包括依次连接的第一选通单元、第一取反单元、第一移位单元、第一累加单元、第二选通单元、第二取反单元和第三选通单元，其中，The 8-bit multiplier includes a first gating unit, a first inversion unit, a first shift unit, a first accumulation unit, a second gating unit, a second inversion unit and a third gating unit connected in sequence, wherein,

第一选通单元分别接收：权重转换模块和激活值矩阵转换模块的数据信息以及权重转换模块的符号控制信号；The first gating unit respectively receives: the data information of the weight conversion module and the activation value matrix conversion module and the sign control signal of the weight conversion module;

第一取反单元接收第一选通单元的数据信息，对接收的数据进行取反；The first inversion unit receives the data information of the first gating unit, and inverts the received data;

第一移位单元接收第一取反单元的数据信息，以及接收第一选通单元的符号位信息，根据符号信息对接收的数据进行移位；The first shift unit receives the data information of the first inversion unit, and receives the sign bit information of the first gating unit, and shifts the received data according to the sign information;

第一累加单元接收第一移位单元的数据信息，对接收的数据进行累加；The first accumulation unit receives the data information of the first shift unit, and accumulates the received data;

第二选通单元接收第一累加单元的数据信息和第一选通单元的符号位信息，并传送给第二取反单元；The second gating unit receives the data information of the first accumulating unit and the sign bit information of the first gating unit, and transmits them to the second inverting unit;

第二取反单元接收第二选通单元的数据信息，对接收的数据进行取反；The second inversion unit receives the data information of the second gating unit, and inverts the received data;

第三选通单元分别接收第二取反单元和第一累加单元的数据信息，并输出。The third gating unit receives and outputs the data information of the second inversion unit and the first accumulation unit respectively.

所述的16位乘法器包括依次连接的第四选通单元、第三取反单元、8位乘法器、第二移位单元、第二累加单元、第五选通单元、第四取反单元和第六选通单元，其中，The 16-bit multiplier includes a fourth gating unit, a third inversion unit, an 8-bit multiplier, a second shift unit, a second accumulation unit, a fifth gating unit, a fourth inversion unit and a sixth gating unit connected in sequence, wherein,

第四选通单元分别接收：权重转换模块和激活值矩阵转换模块的数据信息以及权重转换模块的符号控制信号；The fourth gating unit respectively receives: the data information of the weight conversion module and the activation value matrix conversion module and the sign control signal of the weight conversion module;

第三取反单元接收第四选通单元的数据信息，对接收的数据进行取反；The third inversion unit receives the data information of the fourth gating unit, and inverts the received data;

8位乘法器进行8bit数据位宽的运算，实现8*8bit的定点乘法运算；The 8-bit multiplier performs 8-bit data bit-width operations to realize 8*8-bit fixed-point multiplication operations;

第二移位单元接收8位乘法器的数据信息，对接收的数据进行移位；The second shift unit receives the data information of the 8-bit multiplier, and shifts the received data;

第二累加单元接收第二移位单元的数据信息，对接收的数据进行累加；The second accumulation unit receives the data information of the second shift unit, and accumulates the received data;

第五选通单元接收第二累加单元的数据信息和第四选通单元的符号位信息，并传送给第四取反单元；The fifth gating unit receives the data information of the second accumulation unit and the sign bit information of the fourth gating unit, and transmits them to the fourth inverting unit;

第四取反单元接收第五选通单元的数据信息，对接收的数据进行取反；The fourth inversion unit receives the data information of the fifth gating unit, and inverts the received data;

第六选通单元接收第四取反单元的数据信息，并输出。The sixth gating unit receives the data information of the fourth inversion unit and outputs it.

所述的结果矩阵转换模块是通过Winograd域点积结果矩阵M行列向量移位加减操作执行针对Winograd域点积结果矩阵M的转换操作F＝A^TMA，其中，M是Winograd域点积结果矩阵，A是Winograd域点积结果矩阵M的换辅助矩阵，F是时域点积结果矩阵；Described result matrix conversion module is to carry out the transformation operation F= ^ATMA for Winograd domain dot product result matrix M by Winograd domain dot product result matrix M row and column vector shift addition and subtraction operation, wherein, M is the Winograd domain dot product result matrix, A is the exchange auxiliary matrix of Winograd domain dot product result matrix M, F is the time domain dot product result matrix;

具体操作：将Winograd域点积结果矩阵M的第一、二、三行相加的向量结果作为临时矩阵C₃的第一行，其中临时矩阵C₃＝A^TM；将点Winograd域点积结果矩阵M的第二、三、四行相加的向量结果作为临时矩阵C₃的第二行；将临时矩阵C₃的第一、二、三列相加的向量结果作为转换后的时域点积结果矩阵F的第一列；将临时矩阵C₃的第二、三、四列相加的向量结果作为转换后的时域点积结果矩阵F的第二列，最后得到转换后的时域点积结果矩阵F。Concrete operation: the vector result of the addition of the first, second and third _rows of the Winograd domain dot product result matrix M is used as the first row of the temporary matrix C ₃ , wherein temporary matrix C ₃ =A ^T M; the vector result of the addition of the second, third and fourth rows of the Winograd domain dot product result matrix M is used as the second row of the temporary matrix C ₃ ; the vector result of the addition of the first, second and third columns of the temporary matrix C ₃ is used as the first column of the converted time domain dot product result matrix F; The vector result of the addition of the third and fourth columns is used as the second column of the converted time domain dot product result matrix F, and finally the converted time domain dot product result matrix F is obtained.

本发明的一种基于Winograd的可配置卷积阵列加速器结构，根据固定范式的Winograd卷积算法的运算特点，设计了位宽可配置的卷积阵列加速器，灵活满足不同神经网络以及不同卷积层对位宽的需求。另外，还设计了专用的数据位宽可配置的乘法器单元，从而提高了神经网络卷积运算的计算效率，降低了计算功耗。According to the Winograd-based configurable convolution array accelerator structure of the present invention, a convolution array accelerator with configurable bit width is designed according to the operational characteristics of the Winograd convolution algorithm in a fixed paradigm, so as to flexibly meet the bit width requirements of different neural networks and different convolution layers. In addition, a dedicated multiplier unit with configurable data bit width is also designed, thereby improving the computational efficiency of neural network convolution operations and reducing computational power consumption.

附图说明Description of drawings

图1是Winograd卷积阵列加速器总体架构图；Figure 1 is the overall architecture diagram of the Winograd convolution array accelerator;

图2是本发明一种基于Winograd的可配置卷积阵列加速器结构的构成示意图；Fig. 2 is the composition schematic diagram of a kind of configurable convolution array accelerator structure based on Winograd of the present invention;

图3是数据位宽可配中8位乘法器的示意图；Fig. 3 is a schematic diagram of an 8-bit multiplier in which the data bit width can be configured;

图4是数据位宽可配中16位乘法器的示意图。Fig. 4 is a schematic diagram of a 16-bit multiplier in an adjustable data bit width.

具体实施方式Detailed ways

下面结合实施例和附图对本发明的一种基于Winograd的可配置卷积阵列加速器结构做出详细说明。A Winograd-based configurable convolution array accelerator structure of the present invention will be described in detail below in conjunction with the embodiments and drawings.

神经网络的卷积计算中，Winograd转换公式为In the convolution calculation of the neural network, the Winograd conversion formula is

Out＝A^T[(GKG^T)⊙(B^TIB)]A(1)Out＝A ^T [(GKG ^T )⊙(B ^T IB)]A(1)

其中K表示时域权重矩阵，I表示时域激活值矩阵，A、G、B分别表示与点乘结果矩阵[(GKG^T)⊙(B^TIB)]、时域权重矩阵K、时域激活值矩阵I对应的转换矩阵，转换矩阵A、G、B具体如下所示：Among them, K represents the time-domain weight matrix, I represents the time-domain activation value matrix, A, G, and B represent the conversion matrix corresponding to the dot product result matrix [(GKG ^T )⊙(B ^T IB)], the time-domain weight matrix K, and the time-domain activation value matrix I respectively. The conversion matrices A, G, and B are as follows:

本发明中所用到Winograd卷积的输出范式为F(2*2,3*3),第一个参数2*2表示输出特征图的大小，第二个参数3*3表示卷积核的大小。The output paradigm of the Winograd convolution used in the present invention is F(2*2, 3*3), the first parameter 2*2 represents the size of the output feature map, and the second parameter 3*3 represents the size of the convolution kernel.

如图1所示，Winograd卷积可以分为三个阶段执行。第一阶段，将从缓存中读取的权重矩阵G与时域激活值矩阵I从时域转为Winograd域，具体操作为矩阵乘法运算，计算结果用U与V表示，其中U＝GKG^T，V＝B^TIB；第二阶段，将Winograd域权重矩阵U与Winograd域激活值矩阵V执行点积操作“⊙”，得到Winograd域点积结果矩阵M＝U⊙V；第三阶段将点积结果从Winograd域转为时域。As shown in Figure 1, Winograd convolution can be performed in three stages. In the first stage, the weight matrix G and time-domain activation value matrix I read from the cache are transferred from the time domain to the Winograd domain. The specific operation is matrix multiplication, and the calculation result is represented by U and V, where U=GKG ^T , V=B ^T IB ; in the second stage, the Winograd domain weight matrix U and the Winograd domain activation value matrix V are executed.

如图2所示，本发明的一种基于Winograd的可配置卷积阵列加速器结构，包括：激活值缓存模块1、权重缓存模块2、输出缓存模块3、控制器4、权重预处理模块5、激活值预处理模块6、权重转换模块7、激活值矩阵转换模块8、点乘模块9、结果矩阵转换模块10、累加模块11、池化模块12和激活模块13，其中，As shown in Figure 2, a Winograd-based configurable convolution array accelerator structure of the present invention includes: an activation value cache module 1, a weight cache module 2, an output cache module 3, a controller 4, a weight preprocessing module 5, an activation value preprocessing module 6, a weight conversion module 7, an activation value matrix conversion module 8, a dot product module 9, a result matrix conversion module 10, an accumulation module 11, a pooling module 12 and an activation module 13, wherein,

1)激活值缓存模块1，用于存储输入像素值或输入特征图值，与控制器4相连，为激活值预处理模块6提供激活值数据；1) The activation value cache module 1 is used to store the input pixel value or the input feature map value, and is connected to the controller 4 to provide activation value data for the activation value preprocessing module 6;

2)权重缓存模块2，用于存储已训练好的权重，与控制器4相连，为权重预处理模块5提供权重数据；2) The weight caching module 2 is used to store the trained weights and is connected to the controller 4 to provide weight data for the weight preprocessing module 5;

3)输出缓存模块3，用于存储一次卷积层结果，与控制器4相连，当激活模块13输出数据完成后，将数据传入输出缓存模块3，用于下一层卷积；3) The output cache module 3 is used to store the results of a convolutional layer and is connected to the controller 4. After the activation module 13 outputs data, the data is transferred to the output cache module 3 for the next layer of convolution;

4)控制器4，根据计算过程控制待处理的激活值数据、权重数据、和卷积层数据的传输；4) The controller 4 controls the transmission of activation value data, weight data, and convolutional layer data to be processed according to the calculation process;

5)权重预处理模块5，接收权重缓存模块2传输的待运算数据，用于划分卷积核，分别得到四个时域待处理权重矩阵K₁、K₂、K₃、K₄；5) The weight preprocessing module 5 receives the data to be calculated transmitted by the weight buffer module 2, and is used to divide the convolution kernel to obtain four time domain weight matrices K ₁ , K ₂ , K ₃ , and K ₄ to be processed respectively;

所述的权重预处理模块5包括：(1)将一个大小为5*5的卷积核通过补零，扩展成6*6的卷积矩阵；(2)将6*6的卷积矩阵划分为四个3*3的卷积核；这样就可以用3*3的Winograd输出范式实现5*5的卷积，高效且不会增加功耗乘法次数。The weight preprocessing module 5 includes: (1) a convolution kernel with a size of 5*5 is expanded into a convolution matrix of 6*6 by padding zero; (2) the convolution matrix of 6*6 is divided into four convolution kernels of 3*3; thus, the Winograd output paradigm of 3*3 can be used to realize the convolution of 5*5, which is efficient and does not increase the number of power consumption multiplications.

具体划分如下所示，其中K_input表示一个大小为5*5的时域输入权重矩阵，右侧是时域输入权重矩阵扩展后的6*6时域权重矩阵划分后的四个处理结果，下面分别是4个对应的划分后的时域待处理权重矩阵K₁、K₂、K₃、K₄。在计算U＝GKG^T中，K值依次为K₁、K₂、K₃、K₄：The specific division is as follows, where K _input represents a time-domain input weight matrix with a size of 5*5, the right side is the four processing results after the division of the 6*6 time-domain weight matrix after the expansion of the time-domain input weight matrix, and the following are four corresponding divided time-domain weight matrices _K1 , _K2 , _K3 , _K4 . In the calculation of U=GKG ^T , the K values are K ₁ , K ₂ , K ₃ , K ₄ in turn:

6)激活值预处理模块6，接收激活值缓存模块1传输的待运算数据，用于从激活值缓存模块1取出激活值，用于划分激活值，分别得到时域待处理激活值矩阵I₁、I₂、I₃、I₄。在计算V＝B^TIB中，I值依次为I₁、I₂、I₃、I₄：6) The activation value preprocessing module 6 receives the data to be calculated transmitted by the activation value buffer module 1, and is used to take out the activation value from the activation value buffer module 1, divide the activation value, and obtain the activation value matrices I ₁ , I ₂ , I ₃ , and I ₄ in the time domain respectively. In the calculation of V=B ^T IB, the values of I are I ₁ , I ₂ , I ₃ , and I ₄ in turn:

所述的激活值预处理模块6实现激活值的读取并对其进行预处理。在Winograd算法中，激活值需要与权重相对应，而其中有许多重复使用的数据，所以将其重叠划分。所述的激活值预处理模块6是将6*6大小的激活值矩阵划分为重叠的4个4*4大小的矩阵，分别对应所述的4个3*3的卷积核；划分如下所示，其中I_input表示一个大小为6*6的时域输入激活值矩阵，下方分别为划分后的大小为4*4的时域待处理激活值矩阵I₁、I₂、I₃、I₄。在计算V＝B^TIB中，I值依次为I₁、I₂、I₃、I₄：The activation value preprocessing module 6 realizes the reading of the activation value and performs preprocessing on it. In the Winograd algorithm, the activation value needs to correspond to the weight, and there are many reused data, so it is overlapped and divided. The activation value preprocessing module 6 divides the 6*6 activation value matrix into four overlapping matrices of 4*4 size, respectively corresponding to the four 3*3 convolution kernels; the division is as follows, wherein I _input represents a time-domain input activation value matrix with a size of 6*6, and the lower part is the divided time-domain activation value matrices I ₁ , I ₂ , I ₃ , and I 4 with a size of 4* ₄ . In the calculation of V=B ^T IB, the values of I are I ₁ , I ₂ , I ₃ , and I ₄ in turn:

7)权重转换模块7，接收权重预处理模块5传输的待运算数据，用于实现权重数据从时域转换为Winograd域，得到Winograd域权重矩阵U；7) The weight conversion module 7 receives the data to be calculated transmitted by the weight preprocessing module 5, and is used to convert the weight data from the time domain to the Winograd domain to obtain the Winograd domain weight matrix U;

所述的权重转换模块7，是通过行列向量相加减完成计算中的矩阵乘，从而执行Winograd卷积中针对权重矩阵的转换，得到Winograd域权重矩阵U＝[GKG^T]其中，K表示时域权重矩阵、G是权重转换辅助矩阵、U是Winograd域权重矩阵；Described weight conversion module 7 is to complete the matrix multiplication in the calculation by the addition and subtraction of the row and column vectors, thereby performing the conversion for the weight matrix in the Winograd convolution, obtaining the Winograd domain weight matrix U=[GKG ^T ] wherein, K represents the time domain weight matrix, G is the weight conversion auxiliary matrix, and U is the Winograd domain weight matrix;

具体操作：将时域权重矩阵K的第一行向量作为临时矩阵C₂的第一行，其中临时矩阵C₂＝G^TK；因为权重矩阵中存在值1/2，所以只需要将时域权重矩阵K中的整数右移补0、负数右移补1完成除二；当权重为正值时，权重右移，权重左边补0；当权重为负时，权重右移，权重左边补1；将时域权重矩阵K的第一、二、三行元素相加之后再右移一位之后的向量结果作为临时矩阵C₂的第二行；将时域权重矩阵K的第一、二、三行元素相加之后再右移一位之后的向量结果作为矩阵C₂的第三行；将时域权重矩阵K的第三行向量作为临时矩阵C₂的第四行；将临时矩阵C₂的第一列向量作为Winograd域权重矩阵U的第一列；将临时矩阵C₂的第一、二、三列相加之后再右移一位之后的向量结果作为Winograd域权重矩阵U的第二列；将临时矩阵C₂的第一、二、三列相加之后再右移一位之后的向量结果作为Winograd域权重矩阵U的第三列；将临时矩阵C₂的第三列向量作为Winograd域权重矩阵U的第四列，最后得到Winograd域权重矩阵U。Specific operation: use the first row vector of the time domain weight matrix K as a temporary matrix C₂The first row of , where the temporary matrix C₂=G^TK; because there is a value of 1/2 in the weight matrix, it is only necessary to shift the integers in the time-domain weight matrix K to the right to add 0, and the negative numbers to the right to fill in 1 to complete the division by two; when the weight is positive, the weight is shifted to the right, and the left side of the weight is filled with 0;₂The second line of ; the vector result after adding the first, second, and third row elements of the time-domain weight matrix K and then shifting to the right by one bit is used as matrix C₂The third row; the third row vector of the time domain weight matrix K as a temporary matrix C₂The fourth line of ; the temporary matrix C₂The first column of the vector is used as the first column of the Winograd domain weight matrix U; the temporary matrix C₂The vector result after adding the first, second and third columns and then shifting one bit to the right is used as the second column of the Winograd domain weight matrix U; the temporary matrix C₂The vector result after adding the first, second, and third columns and then shifting one bit to the right is used as the third column of the Winograd domain weight matrix U; the temporary matrix C₂The third column vector of is used as the fourth column of the Winograd domain weight matrix U, and finally the Winograd domain weight matrix U is obtained.

8)激活值矩阵转换模块8，接收激活值预处理模块6传输的待运算数据，用于实现激活值从时域转换为Winograd域，得到Winograd域激活值矩阵V；8) Activation value matrix conversion module 8 receives the data to be calculated transmitted by the activation value preprocessing module 6, and is used to convert the activation value from the time domain to the Winograd domain to obtain the activation value matrix V in the Winograd domain;

所述的激活值矩阵转换模块8，是通过行列向量相加减，完成计算中的矩阵乘，从而执行Winograd卷积中针对时域激活值矩阵的转换操作，得到Winograd域激活值矩阵V＝[B^TIB]其中，I是时域激活值矩阵、B是激活值转换辅助矩阵、V是Winograd域激活值矩阵；Described activation value matrix conversion module 8, is by the addition and subtraction of row and column vectors, completes the matrix multiplication in the calculation, thus performs the conversion operation for the time domain activation value matrix in the Winograd convolution, obtains the Winograd domain activation value matrix V=[B ^T IB] wherein, I is the time domain activation value matrix, B is the activation value conversion auxiliary matrix, and V is the Winograd domain activation value matrix;

9)点乘模块9，分别接收权重转换模块7和激活值矩阵转换模块8传输的待运算数据，用于实现Winograd域激活值矩阵与Winograd域权重矩阵的点积操作，得到Winograd域点积结果矩阵M，也是卷积中最消耗计算时间和资源的模块；9) The dot product module 9 receives the data to be calculated transmitted by the weight conversion module 7 and the activation value matrix conversion module 8 respectively, and is used to realize the dot product operation of the Winograd domain activation value matrix and the Winograd domain weight matrix to obtain the Winograd domain dot product result matrix M, which is also the module that consumes the most calculation time and resources in the convolution;

所述的点乘模块9是通过执行Winograd域权重矩阵U和Winograd域激活值矩阵V的点积操作，获得Winograd域点积结果矩阵M，公式表达为M＝U⊙V，其中，U是Winograd域权重矩阵，V是Winograd域激活值矩阵；所述的点乘模块9以实现数据位宽可配置的点积，有8位乘法器和16位乘法器两个工作模式，分别对应进行8bit和16bit两种数据位宽的运算，实现8*8bit和16*16bit的定点乘法运算。其中，Described dot product module 9 is by performing the dot product operation of Winograd domain weight matrix U and Winograd domain activation value matrix V, obtains Winograd domain dot product result matrix M, formula expression is M=U ⊙ V, wherein, U is the Winograd domain weight matrix, V is the Winograd domain activation value matrix; Described dot product module 9 is to realize the dot product that the data bit width is configurable, has 8 multipliers and 16 multiplier two operating modes, correspondingly carries out the operation of two kinds of data bit widths of 8bit and 16bit respectively, Realize 8*8bit and 16*16bit fixed-point multiplication. in,

(1)如图3所示，所述的8位乘法器包括依次连接的第一选通单元14、第一取反单元15、第一移位单元16、第一累加单元17、第二选通单元18、第二取反单元19和第三选通单元20，其中，(1) As shown in Figure 3, described 8-bit multiplier comprises the first strobe unit 14, the first negation unit 15, the first shift unit 16, the first accumulation unit 17, the second strobe unit 18, the second negation unit 19 and the third strobe unit 20 connected in sequence, wherein,

第一选通单元14分别接收：权重转换模块7和激活值矩阵转换模块8的数据信息以及权重转换模块7的符号控制信号；The first gating unit 14 respectively receives: the data information of the weight conversion module 7 and the activation value matrix conversion module 8 and the sign control signal of the weight conversion module 7;

第一取反单元15接收第一选通单元14的数据信息，对接收的数据进行取反；The first inversion unit 15 receives the data information of the first gating unit 14, and inverts the received data;

第一移位单元16接收第一取反单元15的数据信息，以及接收第一选通单元14的符号位信息，根据符号信息对接收的数据进行移位；The first shift unit 16 receives the data information of the first inversion unit 15, and receives the sign bit information of the first gating unit 14, and shifts the received data according to the sign information;

第一累加单元17接收第一移位单元16的数据信息，对接收的数据进行累加；The first accumulation unit 17 receives the data information of the first shift unit 16, and accumulates the received data;

第二选通单元18接收第一累加单元17的数据信息和第一选通单元14的符号位信息，并传送给第二取反单元19；The second gating unit 18 receives the data information of the first accumulating unit 17 and the sign bit information of the first gating unit 14, and transmits to the second inverting unit 19;

第二取反单元19接收第二选通单元18的数据信息，对接收的数据进行取反；The second inversion unit 19 receives the data information of the second gating unit 18, and inverts the received data;

第三选通单元20分别接收第二取反单元19和第一累加单元17的数据信息，并输出。The third gating unit 20 respectively receives the data information of the second inverting unit 19 and the first accumulating unit 17 and outputs them.

8位乘法器具体操作：根据两个乘数的符号位，相异或得到结果的符号位，并且根据符号位判断正负，若为负数则提出符号位，将后七位数取反加1；若为正数，则后七位数保持不变。判断正负后的乘数A₁分别判断乘数B₁每个二进制位是否为1，若为1则对应中间值为乘数A₁后七位左移相应的位置，若为0则对应中间值为8位的0。判断完乘数B₁的后七位后，将所有中间值相加得到相乘的结果H₂，然后根据结果符号位决定是否需要将其取反加1，若结果符号位1则将相乘的结果H₂取反加1，若结果符号位为0则保持不变，得到相乘结果H₃，最后在相乘结果H₃的第八位取结果符号位，得到最终的结果。无符号8位乘则无需考虑符号位，将根据乘数B₁的8位数据移位相加得到结果。The specific operation of the 8-bit multiplier: According to the sign bits of the two multipliers, the difference or the sign bit of the result is obtained, and the sign bit is judged according to the sign bit. If it is a negative number, the sign bit is raised, and the last seven digits are inverted and 1 is added; if it is a positive number, the last seven digits remain unchanged. The multiplier A ₁ after judging positive and negative respectively judges whether each binary bit of the multiplier B ₁ is 1, and if it is 1, the corresponding intermediate value is the position corresponding to the left shift of seven bits after the multiplier A ₁ , and if it is 0, the corresponding intermediate value is 8 bits of 0. After judging the last seven bits of the multiplier B ₁ , add all the intermediate values to obtain the multiplication result H ₂ , and then decide whether to invert it and add 1 according to the sign bit of the result. If the result sign bit is 1, invert the multiplication result H 2 and add 1. If the result sign bit is 0, keep it unchanged to obtain the multiplication result H ₃ , and finally take the result sign bit in the eighth bit of the multiplication result _{H 3} _to obtain the final result. Unsigned 8-bit multiplication does not need to consider the sign bit, and the result will be obtained by shifting and adding the 8-bit data of the multiplier B ₁ .

(2)如图4所示，所述的16位乘法器包括依次连接的第四选通单元21、第三取反单元22、8位乘法器23、第二移位单元24、第二累加单元25、第五选通单元26、第四取反单元27和第六选通单元28，其中，(2) As shown in Figure 4, described 16-bit multipliers include the fourth gating unit 21, the third inverting unit 22, 8-bit multiplier 23, the second shifting unit 24, the second accumulating unit 25, the fifth gating unit 26, the fourth negating unit 27 and the sixth gating unit 28 connected in sequence, wherein,

第四选通单元21分别接收：权重转换模块7和激活值矩阵转换模块8的数据信息以及权重转换模块7的符号控制信号；The fourth gating unit 21 respectively receives: the data information of the weight conversion module 7 and the activation value matrix conversion module 8 and the sign control signal of the weight conversion module 7;

第三取反单元22接收第四选通单元21的数据信息，对接收的数据进行取反；The third inversion unit 22 receives the data information of the fourth gating unit 21, and inverts the received data;

8位乘法器23进行8bit数据位宽的运算，实现8*8bit的定点乘法运算；The 8-bit multiplier 23 performs the operation of 8bit data bit width, and realizes the fixed-point multiplication operation of 8*8bit;

第二移位单元24接收8位乘法器23的数据信息，对接收的数据进行移位；The second shift unit 24 receives the data information of the 8-bit multiplier 23, and shifts the received data;

第二累加单元25接收第二移位单元24的数据信息，对接收的数据进行累加；The second accumulation unit 25 receives the data information of the second shift unit 24, and accumulates the received data;

第五选通单元26接收第二累加单元25的数据信息和第四选通单元21的符号位信息，并传送给第四取反单元27；The fifth gating unit 26 receives the data information of the second accumulation unit 25 and the sign bit information of the fourth gating unit 21, and sends them to the fourth inverting unit 27;

第四取反单元27接收第五选通单元26的数据信息，对接收的数据进行取反；The fourth inverting unit 27 receives the data information of the fifth gating unit 26, and inverts the received data;

第六选通单元28接收第四取反单元27的数据信息，并输出。The sixth gating unit 28 receives the data information of the fourth inversion unit 27 and outputs it.

16位乘法器是通过4个8位乘法器装置实现得到的，其中所用8位乘法器的选通信号为0，即无符号乘法器。首先，根据两个16位乘数的符号位判断正负，若为正则保持不变，若为负则取反加1；其次将判断后的16位数分为高8位数与低8位数，然后对应相乘；之后将两个高8位数相乘的结果左移16位，分别将乘数D高8位乘数E低8位相乘的结果、乘数D低8位乘数E高8位相乘的结果相加之后左移8位，将移位后的结果加上乘数A低8位与乘数B低8位相乘得到相乘结果L；最后，根据结果符号位决定是否需要取反加1，若相乘结果L符号为1则将相乘的结果取反加1，若相乘结果L符号位为0则保持不变，最后在相乘结果L首位取符号位的值得到最后输出结果。The 16-bit multiplier is realized through four 8-bit multiplier devices, and the strobe signal of the 8-bit multiplier used is 0, that is, an unsigned multiplier. First, judge whether it is positive or negative according to the sign bits of the two 16-bit multipliers. If it is positive, it will remain unchanged. If it is negative, it will be reversed and added with 1; secondly, the judged 16-digit number will be divided into high 8-digit number and low 8-digit number, and then multiplied accordingly; then the result of multiplying the two high-level 8-digit numbers will be left shifted by 16 bits. Multiply the lower 8 bits of A with the lower 8 bits of multiplier B to obtain the multiplication result L; finally, decide whether to invert and add 1 according to the sign bit of the result. If the sign of the multiplication result L is 1, invert the multiplication result and add 1. If the sign bit of the multiplication result L is 0, it remains unchanged. Finally, take the value of the sign bit in the first bit of the multiplication result L to obtain the final output result.

10)结果矩阵转换模块10，接收点乘模块9传输的待运算数据，用于实现点积结果矩阵从Winograd域到时域的转换，得到转换后的时域点积结果矩阵F；10) result matrix conversion module 10, receives the data to be calculated that dot product module 9 transmits, is used for realizing the conversion of dot product result matrix from Winograd domain to time domain, obtains the converted time domain dot product result matrix F;

所述的结果矩阵转换模块10是通过Winograd域点积结果矩阵M行列向量移位加减操作执行针对Winograd域点积结果矩阵M的转换操作F＝A^TMA，其中，M是Winograd域点积结果矩阵，A是Winograd域点积结果矩阵M的转换辅助矩阵，F是时域点积结果矩阵；Described result matrix conversion module 10 is to carry out the transformation operation F= ^ATMA for Winograd domain dot product result matrix M by Winograd domain dot product result matrix M row and column vector shift addition and subtraction operations, wherein, M is the Winograd domain dot product result matrix, A is the conversion auxiliary matrix of Winograd domain dot product result matrix M, and F is the time domain dot product result matrix;

具体操作：将Winograd域点积结果矩阵M的第一、二、三行相加的向量结果作为临时矩阵C₃的第一行，其中C₃＝A^TM；将Winograd域点积结果矩阵M的第二、三、四行相加的向量结果作为临时矩阵C₃的第二行；将临时矩阵C₃的第一、二、三列相加的向量结果作为转换后的时域点积结果矩阵F的第一列；将临时矩阵C₃的第二、三、四列相加的向量结果作为转换后的时域点积结果矩阵F的第二列，最后得到时域点积结果矩阵F。Concrete operation: the vector result of the addition of the first, second and third rows of the Winograd domain dot product result matrix M is used as the first row of the temporary matrix C ₃ , wherein C ₃ =A ^T M; the vector result of the addition of the second, third and fourth rows of the Winograd domain dot product result matrix M is used as the second row of the temporary matrix C ₃ ; the vector result of the addition of the first, second and third columns of the temporary matrix C ₃ is used as the first column of the converted time domain dot product result matrix F; the second, third and fourth of the temporary matrix C ₃ The vector result of the column addition is used as the second column of the converted time domain dot product result matrix F, and finally the time domain dot product result matrix F is obtained.

11)累加模块11，接收结果矩阵转换模块10传输的待运算数据，通过将接收的数据累加，得到最终的卷积结果，一个2*2大小的结果矩阵；11) The accumulation module 11 receives the data to be calculated transmitted by the result matrix conversion module 10, and accumulates the received data to obtain the final convolution result, a result matrix of a 2*2 size;

12)池化模块12，接收累加模块11传输的待运算数据，将最终的卷积结果阵进行池化；可采用不同的池化方法，包括求最大值法、求平均值法、求最小值法，对输入的神经元进行池化操作。由于Winograd卷积F(2*2,3*3)最后输出的结果矩阵为2*2大小，则可以直接进行2*2的池化操作，通过三次大小对比得到池化结果：第一次是结果矩阵第一行的两个数进行对比，第二次是第二行的两个数进行对比，第三次是将前两次对比的结果进行对比，得到该结果矩阵的最大池化结果。12) The pooling module 12 receives the data to be calculated transmitted by the accumulation module 11, and pools the final convolution result array; different pooling methods can be used, including the maximum value method, the average value method, and the minimum value method, to perform pooling operations on the input neurons. Since the final output result matrix of Winograd convolution F(2*2,3*3) is 2*2 in size, the pooling operation of 2*2 can be performed directly, and the pooling result can be obtained through three size comparisons: the first time is to compare the two numbers in the first row of the result matrix, the second time is to compare the two numbers in the second row, and the third time is to compare the results of the previous two comparisons to obtain the maximum pooling result of the result matrix.

13)激活模块13，接收池化模块12传输的待运算数据，将池化结果进行Relu激活函数处理，得到激活后的结果，传输到输出缓存模块3。13) The activation module 13 receives the data to be calculated transmitted by the pooling module 12, processes the pooling result with the Relu activation function, obtains the activated result, and transmits it to the output buffer module 3.

Claims

1. A Winograd-based configurable convolutional array accelerator structure comprising: an activation value buffer module (1), a weight buffer module (2), an output buffer module (3), a controller (4), a weight preprocessing module (5), an activation value preprocessing module (6), a weight conversion module (7), an activation value matrix conversion module (8), a dot multiplication module (9), a result matrix conversion module (10), an accumulation module (11), a pooling module (12) and an activation module (13),

the activation value buffer module (1) is used for storing input pixel values or input characteristic diagram values, is connected with the controller (4) and provides activation value data for the activation value preprocessing module (6);

the weight buffer memory module (2) is used for storing trained weight values, is connected with the controller (4) and provides weight data for the weight preprocessing module (5);

the output buffer module (3) is used for storing the primary convolution layer result, is connected with the controller (4), and transmits the data to the output buffer module (3) for the next layer convolution after the data output by the activation module (13) is completed;

a controller (4) for controlling transmission of the activation value data, the weight data, and the convolutional layer data to be processed according to the calculation process;

the weight preprocessing module (5) is used for receiving the data to be operated transmitted by the weight caching module (2) and dividing convolution kernels to obtain a time domain weight matrix K;

the activation value preprocessing module (6) is used for receiving the data to be operated transmitted by the activation value caching module (1), and is used for taking out the activation value from the activation value caching module (1) and dividing the activation value to obtain a time domain activation value matrix I;

the weight conversion module (7) is used for receiving the data to be operated, which is transmitted by the weight preprocessing module (5), and converting the weight data from a time domain to a Winograd domain to obtain a Winograd domain weight matrix U;

the activation value matrix conversion module (8) is used for receiving the data to be operated, which is transmitted by the activation value preprocessing module (6), and converting the activation value from a time domain to a Winograd domain to obtain a Winograd domain activation value matrix V;

the dot multiplication module (9) is used for respectively receiving the data to be operated transmitted by the weight conversion module (7) and the activation value matrix conversion module (8) and realizing dot product operation of the Winograd domain activation value matrix and the Winograd domain weight matrix to obtain a Winograd domain dot product result matrix M;

the result matrix conversion module (10) is used for receiving the data to be operated transmitted by the dot product module (9) and converting the dot product result matrix from a Winograd domain to a time domain to obtain a converted time domain dot product result matrix F;

the accumulation module (11) is used for receiving the data to be operated transmitted by the result matrix conversion module (10) and accumulating the received data to obtain a final convolution result;

the pooling module (12) receives the data to be operated transmitted by the accumulation module (11) and pools the final convolution result array;

the activation module (13) receives the data to be operated transmitted by the pooling module (12), carries out Relu activation function processing on the pooling result to obtain an activated result, and transmits the activated result to the output buffer module (3), wherein:

the weight preprocessing module (5) comprises:

(1) Extending a convolution kernel of size 5*5 to a 6*6 convolution matrix by zero padding;

(2) The convolution matrix of 6*6 is divided into four 3*3 convolution kernels;

the specific division is shown below, where K _input Representing a 5*5 weight matrix, the lower sides of which are respectively 4 corresponding divided time domain weight matrices to be processed K ₁ 、K ₂ 、K ₃ 、K ₄ The method comprises the steps of carrying out a first treatment on the surface of the In calculating u=gkg ^T Wherein the K value is K in turn ₁ 、K ₂ 、K ₃ 、K ₄ ：

Wherein K represents a time domain weight matrix, G is a weight conversion auxiliary matrix, and U is a Winograd domain weight matrix;

the activation value preprocessing module (6) divides the 6*6 activation value matrix into heavy partsA stacked 4 4*4 sized matrix; the division is as follows, where I _input Representing a 6*6 weight matrix, the lower side is divided into 4*4 time domain to-be-processed activation value matrix I ₁ 、I ₂ 、I ₃ 、I ₄ The method comprises the steps of carrying out a first treatment on the surface of the In calculating v=b ^T In IB, I values are I in turn ₁ 、I ₂ 、I ₃ 、I ₄ ：

Wherein I is a time domain activation value matrix, B is an activation value conversion auxiliary matrix, and V is a Winograd domain activation value matrix.

2. The configurable convolutional array accelerator structure based on Winograd according to claim 1, wherein said weight conversion module (7) performs matrix multiplication in calculation by row-column vector addition subtraction, thereby performing conversion for weight matrix in Winograd convolution to obtain Winograd domain weight matrix U= [ GKG ^T ]；

The specific operation is as follows: taking the first row vector of the weight matrix K as a temporary matrix C ₂ In which the temporary matrix C ₂ ＝G ^T K, performing K; the integer right shift complement 0 and the negative number right shift complement 1 in the weight matrix K are divided into two; when the weight is a positive value, the weight is shifted to the right, and the left side of the weight is supplemented with 0; when the weight is negative, the weight is shifted to the right, and 1 is supplemented to the left of the weight; the vector result after adding the first, second and third row elements of the weight matrix K and right shifting by one bit is taken as a temporary matrix C ₂ Is a second row of (2); the first of the weight matrix K,2. The vector result after one bit more right shift after the addition of the three-row elements is used as a temporary matrix C ₂ Is a third row of (2); taking the third row vector of the weight matrix K as a temporary matrix C ₂ Is the fourth row of (2); temporary matrix C ₂ The first column vector is used as a first column of a Winograd domain weight matrix U; temporary matrix C ₂ The vector result after one bit of right shift after the addition of the first, second and third columns is used as the second column of Winograd domain weight matrix U; temporary matrix C ₂ The vector result after adding the first, second and third columns and right shifting by one bit is used as the third column of Winograd domain weight matrix U; temporary matrix C ₂ And taking the third column vector of the (4) as the fourth column of the Winograd domain weight matrix U to finally obtain the Winograd domain weight matrix U.

3. The configurable convolutional array accelerator structure based on Winograd according to claim 1, wherein said active value matrix conversion module (8) performs matrix multiplication in computation by adding and subtracting row vectors and column vectors, thereby performing a conversion operation on a time domain active value matrix in Winograd convolution to obtain a matrix V= [ B ] ^T IB]；

The specific operation is as follows: taking the vector difference value of the first row minus the third row of the time domain activation value matrix I as a temporary matrix C ₁ In which the temporary matrix C ₁ ＝B ^T I, a step of I; the result of adding the second row and the third row of the time domain activation value matrix I is taken as a temporary matrix C ₁ Is a second row of (2); taking the vector difference value of the third row minus the second row of the time domain activation value matrix I as a temporary matrix C ₁ Is a third row of (2); taking the vector difference value of the second row minus the fourth row of the time domain activation value matrix I as a temporary matrix C ₁ Is the fourth row of (2); temporary matrix C ₁ The vector difference value of the third column minus the first column of the Winograd domain activation value matrix V; temporary matrix C ₁ The result of the addition of the second and third columns of (a) is taken as the second column of Winograd domain activation value matrix V; temporary matrix C ₁ Subtracting the vector difference value of the second column as the third column of the Winograd domain activation value matrix V; temporary matrix C ₁ The second column minus the fourth columnAnd taking the vector difference value of (2) as the fourth column of the Winograd domain activation value matrix V to finally obtain the Winograd domain activation value matrix V.

4. The configurable convolutional array accelerator structure based on Winograd according to claim 1, wherein said dot multiplication module (9) obtains a Winograd domain dot product result matrix M by performing a dot product operation of a Winograd domain weight matrix U and a Winograd domain activation value matrix V, expressed as M=U.V, where U is a Winograd domain weight matrix and V is a Winograd domain activation value matrix; the dot multiplication module (9) is used for realizing the dot product with configurable data bit width, and comprises two working modes of an 8-bit multiplier and a 16-bit multiplier, wherein the two working modes respectively and correspondingly carry out the operation of two data bit widths of 8 bits and 16 bits, and realize the fixed-point multiplication operation of 8 x 8 bits and 16 x 16 bits.

5. The Winograd-based configurable convolutional array accelerator structure according to claim 4, wherein said 8-bit multiplier comprises a first gating unit (14), a first inverting unit (15), a first shifting unit (16), a first accumulating unit (17), a second gating unit (18), a second inverting unit (19) and a third gating unit (20) connected in sequence,

the first gating units (14) respectively receive: the weight conversion module (7) and the activation value matrix conversion module (8) are used for converting data information of the weight conversion module (7) and a symbol control signal of the weight conversion module (7);

the first negation unit (15) receives the data information of the first gating unit (14) and performs negation on the received data;

the first shifting unit (16) receives the data information of the first inverting unit (15) and the symbol bit information of the first gating unit (14), and shifts the received data according to the symbol information;

a first accumulation unit (17) receives the data information of the first shift unit (16) and accumulates the received data;

the second gating unit (18) receives the data information of the first accumulating unit (17) and the sign bit information of the first gating unit (14) and transmits the data information and the sign bit information to the second inverting unit (19);

a second inverting unit (19) receives the data information of the second gating unit (18) and inverts the received data;

the third gating unit (20) receives the data information of the second inverting unit (19) and the first accumulating unit (17) respectively and outputs the data information.

6. The Winograd-based configurable convolutional array accelerator structure according to claim 4, wherein said 16-bit multiplier comprises a fourth gating unit (21), a third inverting unit (22), an 8-bit multiplier (23), a second shifting unit (24), a second accumulating unit (25), a fifth gating unit (26), a fourth inverting unit (27) and a sixth gating unit (28) connected in sequence,

the fourth strobe units (21) respectively receive: the weight conversion module (7) and the activation value matrix conversion module (8) are used for converting data information of the weight conversion module (7) and a symbol control signal of the weight conversion module (7);

a third inverting unit (22) receives the data information of the fourth gating unit (21) and inverts the received data;

an 8-bit multiplier (23) performs 8-bit data bit width operation to realize 8 x 8bit fixed-point multiplication operation;

a second shift unit (24) receives data information of the 8-bit multiplier (23) and shifts the received data;

a second accumulating unit (25) receives the data information of the second shifting unit (24) and accumulates the received data;

the fifth gating unit (26) receives the data information of the second accumulating unit (25) and the sign bit information of the fourth gating unit (21) and transmits the data information and the sign bit information to the fourth inverting unit (27);

a fourth inverting unit (27) receives the data information of the fifth gating unit (26) and inverts the received data;

a sixth strobe unit (28) receives the data information of the fourth inverting unit (27) and outputs the data information.

7. A configurable convolutional array accelerator structure based on Winograd as recited in claim 1,the method is characterized in that the result matrix conversion module (10) executes conversion operation F=A aiming at Winograd domain dot product result matrix M through Winograd domain dot product result matrix M row-column vector shift add-subtract operation ^T MA, wherein M is a Winograd domain dot product result matrix, A is a replacement auxiliary matrix of the Winograd domain dot product result matrix M, and F is a time domain dot product result matrix;

the specific operation is as follows: taking the vector result of the first, second and third row addition of Winograd domain dot product result matrix M as a temporary matrix C ₃ In which the temporary matrix C ₃ ＝A ^T M; taking the vector result of the second, third and fourth row addition of the dot Winograd domain dot product result matrix M as a temporary matrix C ₃ Is a second row of (2); temporary matrix C ₃ The vector result of the first, second and third column addition is used as the first column of the converted time domain dot product result matrix F; temporary matrix C ₃ The vector result of the second, third and fourth column addition is used as the second column of the converted time domain dot product result matrix F, and finally the converted time domain dot product result matrix F is obtained.