[go: up one dir, main page]

WO2020258529A1 - Bnrp-based configurable parallel general convolutional neural network accelerator - Google Patents

Bnrp-based configurable parallel general convolutional neural network accelerator Download PDF

Info

Publication number
WO2020258529A1
WO2020258529A1 PCT/CN2019/105534 CN2019105534W WO2020258529A1 WO 2020258529 A1 WO2020258529 A1 WO 2020258529A1 CN 2019105534 W CN2019105534 W CN 2019105534W WO 2020258529 A1 WO2020258529 A1 WO 2020258529A1
Authority
WO
WIPO (PCT)
Prior art keywords
mode
pooling
data
parameters
calculation
Prior art date
Application number
PCT/CN2019/105534
Other languages
French (fr)
Chinese (zh)
Inventor
陆生礼
范雪梅
庞伟
刘昊
舒程昊
付成龙
Original Assignee
东南大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 东南大学 filed Critical 东南大学
Publication of WO2020258529A1 publication Critical patent/WO2020258529A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the invention discloses a configurable parallel general convolutional neural network accelerator based on BNRP, which belongs to the technical field of calculation, calculation and counting.
  • DNN Deep Neural Network
  • CNN convolutional neural network
  • CNN Convolutional Neural Network
  • the network topology of neural networks is constantly changing. Accordingly, the network scale has expanded dramatically. For example, the Baidu brain with 100 billion neuron connections And Google's cat-recognizing system with 1 billion neuron connections. Therefore, how to realize large-scale deep learning neural network models with low consumption and high speed through computational acceleration and advanced technology has become an important issue in the field of machine learning and artificial intelligence.
  • Deep neural networks not only require a large amount of calculation, but also need to store millions or even hundreds of millions of network parameters. Therefore, at present, they mainly use high-performance multi-core CPU (Central Processing Unit) and GPU (Graphic Processing Unit). Complete real-time detection and recognition based on deep neural network. However, for robots, consumer electronics, smart cars and other mobile devices with limited power consumption, size and cost, it is almost impossible to transplant complex and diverse convolutional neural network models through CPU or GPU. Therefore, the use of general-purpose devices to build a flexibly configurable, high-performance, low-power general-purpose hardware accelerator can meet the large-scale computing and storage requirements of convolutional neural networks.
  • CPU Central Processing Unit
  • GPU Graphic Processing Unit
  • FPGA field-programmable gate array
  • ASIC has the disadvantages of long development cycle, high cost and low flexibility, because ASIC is customized, it is better than GPU and FPGA in performance and power consumption.
  • the performance of the TPU series ASICAI chip released by Google in 2016 is 14 to 16 times that of the traditional GPU, and the performance of the NPU released by Vimicro is 118 times that of the GPU.
  • FPGA or ASIC is applied to the mobile work platform, and the convolutional neural network configurable general hardware accelerator is designed based on the systolic convolutional array and the high parallelism pipeline method that can achieve high computing throughput with only moderate storage and communication bandwidth. It is an effective solution.
  • the purpose of the present invention is to address the deficiencies of the above-mentioned background technology and provide a configurable parallel general convolutional neural network accelerator based on BNRP, which can support the calculation acceleration of convolutional neural network structures of various scales, has good versatility, and is suitable for on-chip Storage resources and I/O bandwidth requirements are low, which improves computing parallelism and throughput, and solves the technical problem that the limited on-chip storage and I/O bandwidth of existing hardware accelerators cannot meet the high-throughput computing requirements of convolutional neural networks.
  • a BNRP-based configurable parallel general convolutional neural network accelerator including: mode configurator, parallel computing acceleration unit (convolution calculator, BNRP calculator), data cache unit (input and output feature map cache, weight parameter cache) , Data communication unit (AXI4 bus interface, AHB bus interface), data compression encoder/decoder.
  • the input feature map data In_Map, weight parameters and BN parameters are compressed and encoded by the data compression encoder/decoder through the AXI4 bus interface in the data communication unit and then buffered to the corresponding In_Map Buffer, weight buffer and BN parameter buffer area; accelerator calculation mode and function
  • the configuration information is transmitted to the mode configurator through the AHB bus interface in the data communication unit; the mode configurator configures the calculation modes and functions of the parallel computing acceleration unit according to the received configuration information, and the parallel computing acceleration unit reads In_Map Buffer and weight
  • caching and BN parameter buffering data perform corresponding convolution, batch normalization, nonlinear activation or pooling operations layer by layer, row, column and channel according to the configuration parameters in parallel pipeline mode; after each layer of network has extracted the features
  • the output feature map data is sent back to the data compression encoder/decoder for decoding, and then sent back to the accelerator external data storage unit through the AXI4 bus interface.
  • the network configuration information such as the network level of the current processed data read from the AHB bus interface by the mode configurator, network model parameters, and buffer data read and write addresses are cached in the data buffer area of the convolution calculator; Whether the configurator reads from the AHB bus interface performs batch normalization (BN), non-linear activation (ReLu), pooling (pooling), data compression encoding/decoding function operations, and calculation mode configuration parameters, etc. And function configuration parameters are transferred to the BNRP calculator.
  • BN batch normalization
  • ReLu non-linear activation
  • Pooling pooling
  • data compression encoding/decoding function operations and calculation mode configuration parameters
  • the BNRP calculator executes batch normalization (BN), non-linear activation (ReLu) or 4 kinds of pooling (Pooling) operations in parallel in a pipeline manner, and can be configured to execute the above according to the flag bit One or several operations, and perform the corresponding calculation mode according to the configuration parameters, mode 1: perform the BN operation, perform the pooling operation first, and then perform the ReLu operation; mode 2: perform the BN operation, perform the ReLu operation first, and then perform the pooling operation .
  • BN batch normalization
  • ReLu non-linear activation
  • Pooling 4 kinds of pooling
  • the BNRP calculator when the input feature map size map_size>R and the pooling operation is performed according to the configuration needs, according to the network model, the number of rows R of the pulsating convolutional array and the configuration parameters, the configuration will input m rows of features
  • the image data is interleaved and cached to 2m on-chip Block RAM.
  • 2*R*T poolers are partially enabled according to the configuration information, and the others are turned off; among them, the "2*2 pooler” executes 2*2AP or 2 according to the configuration parameters.
  • the convolution calculation array and the BNRP calculator if the configuration requires BN operation, before performing the ReLu operation, first determine the feature map data map[i][j], BN weight parameters a[i][j] and b[i][j] and the size of 0, if map[i][j] ⁇ 0, a[i][j] ⁇ 0 and b[i][j ] ⁇ 0, the convolution calculation array does not need to multiply the map[i][j] and a[i][j], and does not need to add b[i][j], BNRP calculator mode 1 The corresponding output value of the BN operation of the BNRP calculator mode 2 and the corresponding output value of the ReLu operation are 0.
  • the present invention uses the parallel pipeline method to design the BNRP calculator, and reduces the calculation amount of the neural network accelerator by dynamically configuring the parameters of the parallel calculator, especially the calculation execution mode of the BNRP calculator, especially for the larger volume of the network structure.
  • Convolutional neural network can greatly accelerate the calculation of the convolutional neural network accelerator, while reducing repetitive calculations and reducing the power consumption of the accelerator; based on the systolic array architecture, the convolutional calculation array is designed, and only moderate storage and I/O communication bandwidth can be used. Achieve high computing throughput, effectively improve the data reuse rate, and further reduce the data transmission time.
  • the calculation execution mode of the BNRP calculator can be dynamically configured according to the characteristics of the network structure, which is more versatile and is no longer restricted by the network model structure and the number of layers, and unnecessary intermediate value caches are also omitted. Reduce the use of memory resources.
  • Fig. 1 is a schematic structural diagram of the accelerator disclosed in the present invention.
  • Figure 2 is a schematic diagram of the structure of the BNRP calculator of the present invention.
  • Fig. 3 is a schematic diagram of the working process of the BNRP calculator of the present invention.
  • Figure 4 is a schematic diagram of the 3*3 pooling device of the present invention performing pooling operation.
  • the configurable parallel general convolutional neural network accelerator based on BNRP disclosed in the present invention is shown in Fig. 1, and includes: a parallel computing acceleration unit composed of a mode configurator, a convolution calculator and a BNRP calculator, an input and output feature map cache and a weight Data buffer unit composed of parameter buffer, data communication unit composed of AXI4 bus interface and AHB bus interface, data compression encoder/decoder.
  • the working status of the accelerator includes read configuration parameter status, read data status, calculation status, and send data status.
  • the mode configurator reads the mode configuration parameters from the outside of the accelerator through the AHB bus, among which, whether to perform BN, ReLu or pooling operation and the configuration information such as execution mode, network layer number, feature map size, etc. are transmitted to the BNRP calculator; network layer number, Information such as feature map size and batch size, and convolution kernel size are transmitted to the data buffer area of the convolution calculator; configuration information such as the number of network layers, data read/write enable and address are transmitted to the data compression encoder/decoder.
  • the data compression encoder/decoder After the data compression encoder/decoder reads the data read enable and address signals, it reads the corresponding weight parameters (convolution kernel and bias) from the accelerator through the AXI4 bus and transmits them to the weight parameter buffer area, and reads the corresponding input
  • the feature map data is transferred to In_MapBuffer.
  • the convolution calculator After the convolution calculator receives the calculation enable signal, it reads the number of network layers, feature map size and batch size, and convolution kernel size from the data buffer area, and reads the weight parameters and input feature map data in a pulsating manner for corresponding Convolution calculation. After the calculation is completed, the end flag information is output to the BNRP calculator, and the convolution calculation result is output to Out_MapBuffer.
  • the BNRP calculator waits for the calculation completion flag information sent by the convolution calculator after receiving the mode configuration parameters. If the configuration requires BN operation, it initiates a BN parameter read request and reads the corresponding BN parameter from the BN parameter buffer ; Otherwise, no BN operation is performed.
  • the BNRP calculator determines the calculation mode to be executed according to the configuration information. If the execution mode 1 is configured, the pooling operation is executed first, and the input pixel values of the feature map that need to be cached are sent to the corresponding BlockRAM according to the received network model parameters (pooling step size) and feature map size, and the corresponding pooling is enabled After completing the pooling calculation, execute the ReLu operation; if the execution mode 2 is configured, execute the ReLu operation first.
  • the maximum pooler calculation process is as follows:
  • k 1, 2 represents the size of the pooler, IMap represents the pixel value of the input feature map, OMap represents the pixel value of the output feature map, and OMap[c][i][j] represents the ith row and the th row of the C-th output feature map. Column j pixel value.
  • the first convolution calculation outputs the 1, 2, 3, 4, 5, and 6 lines of the feature map to the corresponding BlockRAM1, BlockRAM2, BlockRAM3, BlockRAM4, BlockRAM5, BlockRAM6, and caches the fifth line of data to BlockRAM5B, and caches the sixth line Data to BlockRAM6B, enable 1C, 3, 5 poolers.
  • the first output value of No. 1C pooler is invalid;
  • No. 3 pooler executes R1, R2, R3 three-row pooling calculation, and outputs the pixel value of the first row of Out_Map;
  • No. 5 pooler executes R3, R4, R5 three rows Pooling calculation, output the pixel value of the second row of Out_Map.
  • the second convolution calculation outputs lines 7, 8, 9, 10, 11, and 12 of the feature map to the corresponding BlockRAM1, BlockRAM2, BlockRAM3, BlockRAM4, BlockRAM5, BlockRAM6, and caches the 11th line of data to BlockRAM5B, and caches the 12th line Data to BlockRAM6B, enable 1B, 3, 5 poolers.
  • Pooler 1B performs three-line pooling calculations of R5, R6, and R7, and outputs the pixel value of the third row of Out_Map;
  • pooler No. 3 performs three-line pooling calculation of R7, R8, R9, and outputs the pixel value of the fourth row of Out_Map;
  • the No. 5 pooler performs the three-line pooling calculation of R9, R10, and R11, and outputs the pixel value of the fifth row of Out_Map.
  • the third convolution calculation outputs the random numbers of 13 lines and 5 lines of the feature map to the corresponding BlockRAM 1, 2, 3, 4, 5, 6.
  • the convolution output feature map size map_size ⁇ R, so there is no need to cache, so Can pooler No. 1C.
  • No. 1C pooler performs the three-line pooling calculation of R11, R12, and R13, and outputs the pixel value of the sixth row of Out_Map to complete the pooling operation of the input image of this layer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

A BNRP-based configurable parallel general convolutional neural network accelerator, which relates to the technical fields of calculation, inference and counting. The accelerator comprises: a mode configurator, a convolution calculator, a BNRP calculator, a data communication unit, and a data compression encoder/decoder. The convolution calculator comprises T systolic convolution arrays having a size of R*C, and each systolic convolution array is configured with corresponding input and output feature map buffer regions and configuration information data buffer regions. The BNRP calculator may execute two calculation modes, and comprises: R*T data input and output interfaces, R*T poolers, a normalization calculation module and a nonlinear activation calculation module, wherein each functional module is executed in parallel in a pipeline. According to the characteristics of various network structures, the execution mode of parallel acceleration calculation modules may be dynamically configured, and the versatility is good. For convolutional neural networks that have complex network structures and that are relatively large scale, the complexity of calculation may be greatly reduced, while power consumption is low and throughput is high.

Description

一种基于BNRP的可配置并行通用卷积神经网络加速器A configurable parallel general convolutional neural network accelerator based on BNRP 技术领域Technical field
本发明公开了一种基于BNRP的可配置并行通用卷积神经网络加速器,属于计算、推算、计数的技术领域。The invention discloses a configurable parallel general convolutional neural network accelerator based on BNRP, which belongs to the technical field of calculation, calculation and counting.
背景技术Background technique
近年来,深度学习大大加速了机器学习和人工智能的发展且在各个研究领域和商业应用都取得了显著的成效。目前,已验证使用最广泛的深度神经网络(DNN,Deep Neural Network)和卷积神经网络(CNN,Convolutional Neural Network)在解决图像识别、语音识别和其它复杂机器学习任务时具有更为出色的能力。然而,随着实际应用场景的越加复杂以及实际应用场景对精度要求的提高,神经网络的网络拓扑结构不断变化,相应地,网络规模急剧扩大,例如,具有1000亿个神经元连接的百度脑以及具有10亿个神经元连接的Google cat-recognizing系统。因此,如何通过计算加速和先进的技术低耗、高速地实现大规模深度学习神经网络模型成为机器学习和人工智能领域的重要问题。In recent years, deep learning has greatly accelerated the development of machine learning and artificial intelligence and has achieved remarkable results in various research fields and commercial applications. At present, it has been verified that the most widely used deep neural network (DNN, Deep Neural Network) and convolutional neural network (CNN, Convolutional Neural Network) have better capabilities in solving image recognition, speech recognition and other complex machine learning tasks . However, with the increasing complexity of actual application scenarios and the increasing accuracy requirements of actual application scenarios, the network topology of neural networks is constantly changing. Accordingly, the network scale has expanded dramatically. For example, the Baidu brain with 100 billion neuron connections And Google's cat-recognizing system with 1 billion neuron connections. Therefore, how to realize large-scale deep learning neural network models with low consumption and high speed through computational acceleration and advanced technology has become an important issue in the field of machine learning and artificial intelligence.
深度神经网络不仅计算量大还需要存储数百万甚至近亿的网络参数,因此,目前主要通过高性能多核CPU(Central Processing Unit,中央处理器)和GPU(Graphic Processing Unit,图形处理器)来完成基于深度神经网络的实时检测识别。然而,对于机器人、消费电子产品、智能汽车等功耗、体积及成本受限的移动设备,几乎无法通过CPU或者GPU移植复杂多样的卷积神经网络模型。因此,使用通用器件构建可灵活配置的高性能、低功耗通用硬件加速器可满足卷积神经网络大量计算和存储的需求。Deep neural networks not only require a large amount of calculation, but also need to store millions or even hundreds of millions of network parameters. Therefore, at present, they mainly use high-performance multi-core CPU (Central Processing Unit) and GPU (Graphic Processing Unit). Complete real-time detection and recognition based on deep neural network. However, for robots, consumer electronics, smart cars and other mobile devices with limited power consumption, size and cost, it is almost impossible to transplant complex and diverse convolutional neural network models through CPU or GPU. Therefore, the use of general-purpose devices to build a flexibly configurable, high-performance, low-power general-purpose hardware accelerator can meet the large-scale computing and storage requirements of convolutional neural networks.
与GPU加速相比,FPGA和ASIC等硬件加速器可使用更低的功耗并实现至少50%的性能。然而,FPGA和ASIC都具有相对有限的计算资源、存储器和I/O带宽,因此,使用硬件加速器开发复杂且大规模的DNN具有挑战性。而近年来基于FPGA高层综合工具的研发给FPGA设计带来很大突破,在不影响性能的情况下大幅度提高了研发效率。FPGA是一种成本低、灵活性高的可编程标准器件,且具有功耗低、并行性高等优点,非常适合卷积神经网络计算的硬件加速。而ASIC虽然具有较长的开发周期且成本高以及灵活性较低的的缺陷,但由于 ASIC是定制化的,所以在性能和功耗上都要优于GPU和FPGA。Google 2016年发布的TPU系列ASICAI芯片的性能是传统GPU的14到16倍,中星微电子发布的NPU的性能是GPU的118倍。Compared with GPU acceleration, hardware accelerators such as FPGA and ASIC can use lower power consumption and achieve at least 50% performance. However, both FPGA and ASIC have relatively limited computing resources, memory and I/O bandwidth. Therefore, it is challenging to develop complex and large-scale DNNs using hardware accelerators. In recent years, research and development based on FPGA high-level synthesis tools have brought great breakthroughs to FPGA design, and greatly improved research and development efficiency without affecting performance. FPGA is a programmable standard device with low cost and high flexibility, and has the advantages of low power consumption and high parallelism, which is very suitable for hardware acceleration of convolutional neural network calculations. Although ASIC has the disadvantages of long development cycle, high cost and low flexibility, because ASIC is customized, it is better than GPU and FPGA in performance and power consumption. The performance of the TPU series ASICAI chip released by Google in 2016 is 14 to 16 times that of the traditional GPU, and the performance of the NPU released by Vimicro is 118 times that of the GPU.
因此,将FPGA或ASIC应用于移动工作平台,基于仅利用适度的存储和通信带宽即可实现高计算吞吐量的脉动卷积阵列和高并行度流水线方式来设计卷积神经网络可配置通用硬件加速器是有效的解决方案。Therefore, FPGA or ASIC is applied to the mobile work platform, and the convolutional neural network configurable general hardware accelerator is designed based on the systolic convolutional array and the high parallelism pipeline method that can achieve high computing throughput with only moderate storage and communication bandwidth. It is an effective solution.
发明内容Summary of the invention
本发明的发明目的是针对上述背景技术的不足,提供了一种基于BNRP的可配置并行通用卷积神经网络加速器,能够支持各种规模卷积神经网络结构的计算加速,通用性好,对片上存储资源和I/O带宽需求较低,提高了计算并行度和吞吐量,解决了现有硬件加速器有限的片上存储和I/O带宽不能适应卷积神经网络大吞吐量计算需求的技术问题。The purpose of the present invention is to address the deficiencies of the above-mentioned background technology and provide a configurable parallel general convolutional neural network accelerator based on BNRP, which can support the calculation acceleration of convolutional neural network structures of various scales, has good versatility, and is suitable for on-chip Storage resources and I/O bandwidth requirements are low, which improves computing parallelism and throughput, and solves the technical problem that the limited on-chip storage and I/O bandwidth of existing hardware accelerators cannot meet the high-throughput computing requirements of convolutional neural networks.
本发明为实现上述发明目的采用如下技术方案:The present invention adopts the following technical solutions to achieve the above-mentioned purpose of the invention:
一种基于BNRP的可配置并行通用卷积神经网络加速器,包括:模式配置器、并行计算加速单元(卷积计算器、BNRP计算器)、数据缓存单元(输入输出特征图缓存、权重参数缓存)、数据通信单元(AXI4总线接口、AHB总线接口)、数据压缩编码/解码器。输入特征图数据In_Map、权重参数和BN参数通过数据通信单元中的AXI4总线接口经过数据压缩编码/解码器压缩编码后缓存到对应的In_Map Buffer、权重缓存和BN参数缓存区;加速器计算模式和功能配置信息则通过数据通信单元中的AHB总线接口传输到模式配置器;模式配置器根据接收到的配置信息对并行计算加速单元的计算模式和功能进行配置,并行计算加速单元读取In_Map Buffer、权重缓存和BN参数缓存区数据后,根据配置参数逐层、行、列和通道按并行流水线方式进行相应的卷积、批量归一化、非线性激活或者池化操作;每层网络提取完特征后输出的特征图数据回传到数据压缩编码/解码器进行解码后,再通过AXI4总线接口回传到加速器外部数据存储单元。A BNRP-based configurable parallel general convolutional neural network accelerator, including: mode configurator, parallel computing acceleration unit (convolution calculator, BNRP calculator), data cache unit (input and output feature map cache, weight parameter cache) , Data communication unit (AXI4 bus interface, AHB bus interface), data compression encoder/decoder. The input feature map data In_Map, weight parameters and BN parameters are compressed and encoded by the data compression encoder/decoder through the AXI4 bus interface in the data communication unit and then buffered to the corresponding In_Map Buffer, weight buffer and BN parameter buffer area; accelerator calculation mode and function The configuration information is transmitted to the mode configurator through the AHB bus interface in the data communication unit; the mode configurator configures the calculation modes and functions of the parallel computing acceleration unit according to the received configuration information, and the parallel computing acceleration unit reads In_Map Buffer and weight After caching and BN parameter buffering data, perform corresponding convolution, batch normalization, nonlinear activation or pooling operations layer by layer, row, column and channel according to the configuration parameters in parallel pipeline mode; after each layer of network has extracted the features The output feature map data is sent back to the data compression encoder/decoder for decoding, and then sent back to the accelerator external data storage unit through the AXI4 bus interface.
基于上述技术方案的优选方案,并行计算加速单元,包括:T个卷积计算阵列和BNRP计算器;卷积计算阵列基于脉动阵列架构,大小为R*C,每次可对C张特征图的R行数据进行卷积计算,卷积计算结果保存在输出缓存Output Buffer 中;相应的,BNRP计算器包含R*T个数据输入接口、R*T个输出接口、R*T个“2*2池化器”和R*T个“3*3池化器”,由模式配置器配置每次仅
Figure PCTCN2019105534-appb-000001
个池化器处于使能状态,S表示池化步长(S=1、2)。
Based on the preferred solution of the above technical solution, the parallel calculation acceleration unit includes: T convolution calculation arrays and BNRP calculator; the convolution calculation array is based on the systolic array architecture, the size is R*C, and C feature maps can be calculated each time R line data is subjected to convolution calculation, and the convolution calculation result is stored in the output buffer Output Buffer; correspondingly, the BNRP calculator includes R*T data input interfaces, R*T output interfaces, and R*T "2*2"Pooler" and R*T "3*3 Pooler", configured by the mode configurator only
Figure PCTCN2019105534-appb-000001
Each pooler is in the enabled state, and S represents the pooling step size (S=1, 2).
基于上述技术方案的优选方案,模式配置器从AHB总线接口读取的当前处理数据所在网络层次、网络模型参数、缓存数据读写地址等网络配置信息缓存在卷积计算器的数据缓存区;模式配置器从AHB总线接口读取的是否进行批量归一化(Batch Normalization,BN)、非线性激活(ReLu)、池化(Pooling)、数据压缩编码/解码功能操作以及计算模式配置参数等计算模式和功能配置参数传输到BNRP计算器。Based on the preferred solution of the above technical solution, the network configuration information such as the network level of the current processed data read from the AHB bus interface by the mode configurator, network model parameters, and buffer data read and write addresses are cached in the data buffer area of the convolution calculator; Whether the configurator reads from the AHB bus interface performs batch normalization (BN), non-linear activation (ReLu), pooling (pooling), data compression encoding/decoding function operations, and calculation mode configuration parameters, etc. And function configuration parameters are transferred to the BNRP calculator.
基于上述技术方案的优选方案,BNRP计算器按流水线方式并行执行批量归一化(Batch Normalization,BN)、非线性激活(ReLu)或者4种池化(Pooling)操作,根据标志位可配置执行上述一种或者几种操作,且根据配置参数执行相应的计算模式,模式1:执行BN操作后先执行pooling操作,再执行ReLu操作;模式2:执行BN操作后先执行ReLu操作,再执行pooling操作。Based on the preferred solution of the above technical solution, the BNRP calculator executes batch normalization (BN), non-linear activation (ReLu) or 4 kinds of pooling (Pooling) operations in parallel in a pipeline manner, and can be configured to execute the above according to the flag bit One or several operations, and perform the corresponding calculation mode according to the configuration parameters, mode 1: perform the BN operation, perform the pooling operation first, and then perform the ReLu operation; mode 2: perform the BN operation, perform the ReLu operation first, and then perform the pooling operation .
基于上述技术方案的优选方案,BNRP计算器,当输入特征图尺寸map_size>R且按配置需要进行pooling操作时,根据网络模型、脉动卷积阵列行数R以及配置参数,配置将m行输入特征图数据交错缓存到2m块片上Block RAM。Based on the preferred solution of the above technical solution, the BNRP calculator, when the input feature map size map_size>R and the pooling operation is performed according to the configuration needs, according to the network model, the number of rows R of the pulsating convolutional array and the configuration parameters, the configuration will input m rows of features The image data is interleaved and cached to 2m on-chip Block RAM.
基于上述技术方案的优选方案,“2*2池化器”由两个二选一比较器Comparator2_1和Comparator2_2组成一个四选一比较器,每个时钟输入两个特征图数据到Comparator2_2,每2个时钟输出一个2*2pooling值,当S=1时:保存Comparator2_2输出值作为下一个时钟Comparator2_1输出值;“3*3池化器”由三个三选一比较器Comparator3_1、Comparator3_2和Comparator3_3组成一个九选一比较器,每个时钟输入三个特征图数据,每3个时钟输出一个3*3pooling值,当S=1时:保存Comparator3_2输出值,作为下一个时钟Comparator3_1输出值,保存Comparator3_3输出值,作为下一个时钟Comparator3_2输出值,当S=2时:保存Comparator3_3输出值,作为下一个时钟Comparator3_1输出值。Based on the preferred solution of the above technical solution, the "2*2 pooler" consists of two one-of-two comparators Comparator2_1 and Comparator2_2 to form a four-to-one comparator, each clock inputs two feature map data to Comparator2_2, every 2 The clock outputs a 2*2pooling value, when S=1: save the output value of Comparator2_2 as the output value of the next clock Comparator2_1; "3*3 pooling device" is composed of three comparators Comparator3_1, Comparator3_2 and Comparator3_3. Choose a comparator, input three feature map data for each clock, and output a 3*3 pooling value for every 3 clocks. When S=1: save the output value of Comparator3_2 as the next clock Comparator3_1 output value, save the output value of Comparator3_3, As the next clock Comparator3_2 output value, when S=2: Save the Comparator3_3 output value as the next clock Comparator3_1 output value.
基于上述技术方案的优选方案,2*R*T个池化器根据配置信息部分被使能,其它的处于关闭状态;其中,“2*2池化器”根据配置参数执行2*2AP或者2*2 MP操作,“3*3池化器”根据配置参数执行3*3AP或者3*3MP操作;每种池化器各有R*T个,均按序编号(1、2、3、···、R*T),当S=2时编号为奇数的池化器被使能。Based on the preferred solution of the above technical solution, 2*R*T poolers are partially enabled according to the configuration information, and the others are turned off; among them, the "2*2 pooler" executes 2*2AP or 2 according to the configuration parameters. *2 MP operation, "3*3 pooler" performs 3*3AP or 3*3MP operation according to the configuration parameters; each type of pooler has R*T, all numbered in sequence (1, 2, 3, · ··, R*T), when S=2, the odd-numbered pooler is enabled.
基于上述技术方案的优选方案,卷积计算阵列和BNRP计算器,若配置需进行BN操作,则在进行ReLu操作之前,先通过设计三个比较器判断特征图数据map[i][j]、BN权重参数a[i][j]和b[i][j]与0的大小,若map[i][j]≤0、a[i][j]≥0同时b[i][j]≤0,则卷积计算阵列无需对该map[i][j]和a[i][j]进行乘法计算,且无需对b[i][j]进行加法计算,BNRP计算器模式1的BN操作对应输出值为0,BNRP计算器模式2的BN操作和ReLu操作对应输出值均为0。Based on the preferred solution of the above technical solution, the convolution calculation array and the BNRP calculator, if the configuration requires BN operation, before performing the ReLu operation, first determine the feature map data map[i][j], BN weight parameters a[i][j] and b[i][j] and the size of 0, if map[i][j]≤0, a[i][j]≥0 and b[i][j ]≤0, the convolution calculation array does not need to multiply the map[i][j] and a[i][j], and does not need to add b[i][j], BNRP calculator mode 1 The corresponding output value of the BN operation of the BNRP calculator mode 2 and the corresponding output value of the ReLu operation are 0.
本发明采用上述技术方案,具有以下有益效果:The present invention adopts the above technical scheme and has the following beneficial effects:
(1)本发明运用并行流水线方式设计BNRP计算器,通过动态配置并行计算器的参数尤其是BNRP计算器的计算执行模式减小神经网络加速器的计算量,尤其是对网络结构层较大的卷积神经网络,可极大地加速卷积神经网络加速器的计算,同时减少重复计算进而降低加速器功耗;基于脉动阵列架构设计了卷积计算阵列,仅利用适度的存储和I/O通信带宽即可实现高计算吞吐量,且有效地提高了数据的重用率,进一步降低了数据传输时间。(1) The present invention uses the parallel pipeline method to design the BNRP calculator, and reduces the calculation amount of the neural network accelerator by dynamically configuring the parameters of the parallel calculator, especially the calculation execution mode of the BNRP calculator, especially for the larger volume of the network structure. Convolutional neural network can greatly accelerate the calculation of the convolutional neural network accelerator, while reducing repetitive calculations and reducing the power consumption of the accelerator; based on the systolic array architecture, the convolutional calculation array is designed, and only moderate storage and I/O communication bandwidth can be used. Achieve high computing throughput, effectively improve the data reuse rate, and further reduce the data transmission time.
(2)通过模式配置器的设计,可根据网络结构特点动态配置BNRP计算器计算执行模式,更具有通用性,不再受网络模型结构和层数约束,也省略了不必要的中间值缓存,减少了内存资源的使用。(2) Through the design of the mode configurator, the calculation execution mode of the BNRP calculator can be dynamically configured according to the characteristics of the network structure, which is more versatile and is no longer restricted by the network model structure and the number of layers, and unnecessary intermediate value caches are also omitted. Reduce the use of memory resources.
附图说明Description of the drawings
图1是本发明公开的加速器的结构示意图。Fig. 1 is a schematic structural diagram of the accelerator disclosed in the present invention.
图2是本发明BNRP计算器的结构示意图。Figure 2 is a schematic diagram of the structure of the BNRP calculator of the present invention.
图3是本发明BNRP计算器工作流程的示意图。Fig. 3 is a schematic diagram of the working process of the BNRP calculator of the present invention.
图4是本发明3*3池化器执行池化操作的示意图。Figure 4 is a schematic diagram of the 3*3 pooling device of the present invention performing pooling operation.
具体实施方式Detailed ways
下面结合附图对发明的技术方案进行详细说明。The technical solution of the invention will be described in detail below in conjunction with the drawings.
本发明公开的基于BNRP的可配置并行通用卷积神经网络加速器如图1所示,包括:模式配置器、卷积计算器和BNRP计算器组成的并行计算加速单元、 输入输出特征图缓存和权重参数缓存组成的数据缓存单元、AXI4总线接口和AHB总线接口组成的数据通信单元、数据压缩编码/解码器。加速器的工作状态包括读取配置参数状态、读取数据状态、计算状态、发送数据状态。The configurable parallel general convolutional neural network accelerator based on BNRP disclosed in the present invention is shown in Fig. 1, and includes: a parallel computing acceleration unit composed of a mode configurator, a convolution calculator and a BNRP calculator, an input and output feature map cache and a weight Data buffer unit composed of parameter buffer, data communication unit composed of AXI4 bus interface and AHB bus interface, data compression encoder/decoder. The working status of the accelerator includes read configuration parameter status, read data status, calculation status, and send data status.
模式配置器通过AHB总线从加速器外部读取模式配置参数,其中,是否要进行BN、ReLu或者pooling操作以及执行模式、网络层数、特征图尺寸等配置信息传输到BNRP计算器;网络层数、特征图尺寸和批次、卷积核大小等信息传输到卷积计算器的数据缓存区;网络层数、数据读写使能和地址等配置信息传输到数据压缩编码/解码器。The mode configurator reads the mode configuration parameters from the outside of the accelerator through the AHB bus, among which, whether to perform BN, ReLu or pooling operation and the configuration information such as execution mode, network layer number, feature map size, etc. are transmitted to the BNRP calculator; network layer number, Information such as feature map size and batch size, and convolution kernel size are transmitted to the data buffer area of the convolution calculator; configuration information such as the number of network layers, data read/write enable and address are transmitted to the data compression encoder/decoder.
数据压缩编码/解码器读取数据读取使能和地址信号后,通过AXI4总线从加速器外部读取相应的权重参数(卷积核和偏置)传输到权重参数缓存区,读取相应的输入特征图数据传输到In_MapBuffer。After the data compression encoder/decoder reads the data read enable and address signals, it reads the corresponding weight parameters (convolution kernel and bias) from the accelerator through the AXI4 bus and transmits them to the weight parameter buffer area, and reads the corresponding input The feature map data is transferred to In_MapBuffer.
卷积计算器接收到计算使能信号后,从数据缓存区读取到网络层数、特征图尺寸和批次、卷积核大小,按脉动方式读取权重参数和输入特征图数据进行相应的卷积计算。计算完成后,输出结束标志信息给BNRP计算器,且将卷积计算结果输出到Out_MapBuffer。After the convolution calculator receives the calculation enable signal, it reads the number of network layers, feature map size and batch size, and convolution kernel size from the data buffer area, and reads the weight parameters and input feature map data in a pulsating manner for corresponding Convolution calculation. After the calculation is completed, the end flag information is output to the BNRP calculator, and the convolution calculation result is output to Out_MapBuffer.
参照图2,BNRP计算器接收模式配置参数后等待卷积计算器发送的计算完成标志信息,若配置需要执行BN操作,则发起BN参数读取请求,从BN参数缓存区读取相应的BN参数;否则,不执行BN操作。Referring to Figure 2, the BNRP calculator waits for the calculation completion flag information sent by the convolution calculator after receiving the mode configuration parameters. If the configuration requires BN operation, it initiates a BN parameter read request and reads the corresponding BN parameter from the BN parameter buffer ; Otherwise, no BN operation is performed.
参照图3,BNRP计算器根据配置信息判断需要执行的计算模式。若配置执行模式1,则先执行pooling操作,根据接收的网络模型参数(池化步长)和特征图尺寸,将需要缓存的特征图输入像素值发送到相应的BlockRAM,且使能相应的池化器,完成pooling计算后执行ReLu操作;若配置执行模式2,则先执行ReLu操作。其中,最大池化器计算过程如下:Referring to Figure 3, the BNRP calculator determines the calculation mode to be executed according to the configuration information. If the execution mode 1 is configured, the pooling operation is executed first, and the input pixel values of the feature map that need to be cached are sent to the corresponding BlockRAM according to the received network model parameters (pooling step size) and feature map size, and the corresponding pooling is enabled After completing the pooling calculation, execute the ReLu operation; if the execution mode 2 is configured, execute the ReLu operation first. Among them, the maximum pooler calculation process is as follows:
Figure PCTCN2019105534-appb-000002
Figure PCTCN2019105534-appb-000002
平均池器计算过程如下:The calculation process of the average pooler is as follows:
Figure PCTCN2019105534-appb-000003
Figure PCTCN2019105534-appb-000003
k=1,2表示池化器尺寸,IMap表示输入特征图像素值,OMap表示输出特征图 像素值,OMap[c][i][j]表示第C个输出特征图的第i行、第j列像素值。k = 1, 2 represents the size of the pooler, IMap represents the pixel value of the input feature map, OMap represents the pixel value of the output feature map, and OMap[c][i][j] represents the ith row and the th row of the C-th output feature map. Column j pixel value.
参照图4,以卷积计算阵列行数为R=6,输入特征图尺寸为13*13同时池化器尺寸k=3以及池化步长s=2为例,输出特征图尺寸为6*6。由于,输出特征图行和列对应计算过程原理相同,下面仅针对行计算进行详细说明:4, the number of rows in the convolution calculation array is R=6, the input feature map size is 13*13, and the pooler size k=3 and the pooling step size s=2 as an example, the output feature map size is 6* 6. Since the principle of the corresponding calculation process for the rows and columns of the output feature map is the same, the following only describes the row calculation in detail:
第1次卷积计算输出特征图的1、2、3、4、5、6行到对应的BlockRAM1、BlockRAM2、BlockRAM3、BlockRAM4、BlockRAM5、BlockRAM6,且缓存第5行数据到BlockRAM5B,缓存第6行数据到BlockRAM6B,使能1C、3、5号池化器。1C号池化器首次输出值为无效值;3号池化器执行R1、R2、R3三行池化计算,输出Out_Map第1行像素值;5号池化器执行R3、R4、R5三行池化计算,输出Out_Map第2行像素值。The first convolution calculation outputs the 1, 2, 3, 4, 5, and 6 lines of the feature map to the corresponding BlockRAM1, BlockRAM2, BlockRAM3, BlockRAM4, BlockRAM5, BlockRAM6, and caches the fifth line of data to BlockRAM5B, and caches the sixth line Data to BlockRAM6B, enable 1C, 3, 5 poolers. The first output value of No. 1C pooler is invalid; No. 3 pooler executes R1, R2, R3 three-row pooling calculation, and outputs the pixel value of the first row of Out_Map; No. 5 pooler executes R3, R4, R5 three rows Pooling calculation, output the pixel value of the second row of Out_Map.
第2次卷积计算输出特征图的7、8、9、10、11、12行到对应的BlockRAM1、BlockRAM2、BlockRAM3、BlockRAM4、BlockRAM5、BlockRAM6,且缓存第11行数据到BlockRAM5B,缓存第12行数据到BlockRAM6B,使能1B、3、5号池化器。1B号池化器执行R5、R6、R7三行池化计算,输出Out_Map第3行像素值;3号池化器执行R7、R8、R9三行池化计算,输出Out_Map第4行像素值;5号池化器执行R9、R10、R11三行池化计算,输出Out_Map第5行像素值。The second convolution calculation outputs lines 7, 8, 9, 10, 11, and 12 of the feature map to the corresponding BlockRAM1, BlockRAM2, BlockRAM3, BlockRAM4, BlockRAM5, BlockRAM6, and caches the 11th line of data to BlockRAM5B, and caches the 12th line Data to BlockRAM6B, enable 1B, 3, 5 poolers. Pooler 1B performs three-line pooling calculations of R5, R6, and R7, and outputs the pixel value of the third row of Out_Map; pooler No. 3 performs three-line pooling calculation of R7, R8, R9, and outputs the pixel value of the fourth row of Out_Map; The No. 5 pooler performs the three-line pooling calculation of R9, R10, and R11, and outputs the pixel value of the fifth row of Out_Map.
第3次卷积计算输出特征图的13行和5行随机数到对应的BlockRAM1、2、3、4、5、6,此时,卷积输出特征图尺寸map_size<R,因此无需缓存,使能1C号池化器。1C号池化器执行R11、R12、R13三行池化计算,输出Out_Map第6行像素值,完成本层输入图像的池化操作。在实际应用设计过程中,1B和1C号池化器可使用多路选择器和比较器组合成一个编号为1的3*3池化器。所以在实际计算过程中,池化步长s=2时,使能编号为奇数的池化器。The third convolution calculation outputs the random numbers of 13 lines and 5 lines of the feature map to the corresponding BlockRAM 1, 2, 3, 4, 5, 6. At this time, the convolution output feature map size map_size<R, so there is no need to cache, so Can pooler No. 1C. No. 1C pooler performs the three-line pooling calculation of R11, R12, and R13, and outputs the pixel value of the sixth row of Out_Map to complete the pooling operation of the input image of this layer. In the actual application design process, poolers 1B and 1C can be combined into a 3*3 pooler numbered 1 using multiplexers and comparators. Therefore, in the actual calculation process, when the pooling step size s=2, the odd-numbered pooler is enabled.
经验证,当配置使用模式1时,先执行pooling操作缩小了特征图尺寸,可减少
Figure PCTCN2019105534-appb-000004
或者
Figure PCTCN2019105534-appb-000005
的ReLu操作计算量;当配置使用模式2时,先执行ReLu操作使得特征图数据值均修正到非零数集,pooling操作无需考虑输入像素值的符号位,减小了pooling计算的复杂度和比较器功耗。
It has been verified that when configuring and using mode 1, performing the pooling operation first reduces the size of the feature map and can reduce
Figure PCTCN2019105534-appb-000004
or
Figure PCTCN2019105534-appb-000005
The calculation amount of ReLu operation; when configured to use mode 2, perform the ReLu operation first so that the feature map data values are corrected to a non-zero number set. The pooling operation does not need to consider the sign bit of the input pixel value, which reduces the complexity and complexity of the pooling calculation. Comparator power consumption.
实施例仅为说明本发明的技术思想,不能以此限定本发明的保护范围,在技术方案基础上所做符合本申请发明构思的任何改动均落入本发明保护范围之内。The embodiments are merely illustrative of the technical idea of the present invention, and cannot be used to limit the protection scope of the present invention. Any changes made on the basis of the technical solution that conform to the inventive concept of the present application shall fall within the protection scope of the present invention.

Claims (9)

  1. 一种基于BNRP的可配置并行通用卷积神经网络加速器,其特征在于,包括:A BNRP-based configurable parallel general convolutional neural network accelerator, which is characterized in that it includes:
    模式配置器,从外部读取网络参数、特征图参数、计算模式和功能配置参数,根据读取的参数输出切换加速器工作状态的指令,Mode configurator, which reads network parameters, characteristic map parameters, calculation mode and function configuration parameters from the outside, and outputs instructions for switching the accelerator working state according to the read parameters
    数据压缩编码/解码器,在收到模式配置器发送的网络参数、数据读写使能指令和地址配置信息后对从外部读取的特征图数据、权重数据、BN参数进行编码,在接收到BNRP计算器输出的计算结果时对计算结果进行解码,Data compression encoder/decoder, after receiving the network parameters, data read and write enable command and address configuration information sent by the mode configurator, it encodes the feature map data, weight data and BN parameters read from the outside, The calculation result output by the BNRP calculator decodes the calculation result,
    BN参数缓存器,用于存储编码后的BN参数,The BN parameter buffer is used to store the encoded BN parameters,
    输入特征图缓存器,用于存储编码后的输入特征图数据,The input feature map buffer is used to store the encoded input feature map data,
    权重参数缓存器,用于存储编码后的权重数据,The weight parameter buffer is used to store the encoded weight data,
    数据缓存器,用于存储模式配置器从外部读取的网络参数、特征图尺寸参数,在进入计算状态后从权重参数缓存器读取编码后的权重数据,The data buffer is used to store the network parameters and feature map size parameters read from the outside by the mode configurator, and read the encoded weight data from the weight parameter buffer after entering the calculation state,
    卷积计算器,在收到模式配置器发送的计算使能指令后,从数据缓存器读取网络参数、特征图参数、权重数据,从输入特征图缓存器和权重参数缓存器读取输入特征图数据和权重数据后进行卷积计算,The convolution calculator, after receiving the calculation enable command sent by the mode configurator, reads network parameters, feature map parameters, and weight data from the data buffer, and reads input features from the input feature map buffer and weight parameter buffer After graph data and weight data, convolution calculation is performed,
    输出特征图缓存器,用于存储卷积计算器输出的卷积结果,及,The output feature map buffer is used to store the convolution result output by the convolution calculator, and,
    BNRP计算器,在收到模式配置器发送的计算模式和卷积计算器输出的卷积计算结束标志后,根据模式配置器发送的功能配置参数对卷积计算器输出的卷积结果执行先批量归一化后池化再非线性激活的计算模式或者先批量归一化后非线性激活再池化的计算模式。The BNRP calculator, after receiving the calculation mode sent by the mode configurator and the convolution calculation end flag output by the convolution calculator, execute the first batch of the convolution results output by the convolution calculator according to the function configuration parameters sent by the mode configurator The calculation mode of pooling and then nonlinear activation after normalization or the calculation mode of nonlinear activation and then pooling after batch normalization.
  2. 根据权利要求1所述一种基于BNRP的可配置并行通用卷积神经网络加速器,其特征在于,所述BNRP计算器包括:The BNRP-based configurable parallel general convolutional neural network accelerator according to claim 1, wherein the BNRP calculator comprises:
    R*T个数据输入接口,接收卷积计算器T个卷积阵列输出的R行特征图,R*T data input interfaces, receiving the R line feature maps output by the convolution calculator T convolution arrays,
    BN操作模块,在模式配置器发送的功能配置参数包含批归一化操作指令时,从BN参数缓存器读取BN参数后对数据输入端口接收的数据进行批量归一化操作,BN operation module, when the function configuration parameters sent by the mode configurator include batch normalization operation instructions, read the BN parameters from the BN parameter buffer and perform batch normalization operations on the data received by the data input port.
    Relu操作模块,在模式配置器发送的计算模式为先批量归一化后池化再非线性激活时,对池化结果进行非线性激活,在模式配置器发送的计算模式为先批 量归一化后非线性激活再池化时,对批量归一化后的数据进行非线性激活,及,Relu operation module, when the calculation mode sent by the mode configurator is batch normalization first, then pooling and then non-linear activation, the pooling result is activated nonlinearly, and the calculation mode sent by the mode configurator is batch normalization first After nonlinear activation and repooling, perform nonlinear activation on the batch-normalized data, and,
    R*T个池化器,在模式配置器发送的计算模式为先批量归一化后池化再非线性激活时输出批量归一化数据的池化结果,在模式配置器发送的计算模式为先批量归一化后非线性激活再池化时输出非线性激活后的批量归一化数据的池化结果。R*T poolers, the calculation mode sent by the mode configurator is batch normalization first, pooling and then non-linear activation, output the pooling results of batch normalization data, the calculation mode sent by the mode configurator is First batch normalization, then nonlinear activation and then pooling, output the pooling result of the batch normalized data after nonlinear activation.
  3. 根据权利要求2所述一种基于BNRP的可配置并行通用卷积神经网络加速器,其特征在于,所述BNRP计算器还包括模式简化模块,在执行非线性激活操作前,模式选择器读取BNRP计算器数据输入接口接收的特征图数据以及BN权重参数和偏置参数,在不需要对特征图数据进行乘法运算和偏置加运算时将先批量归一化后池化再非线性激活这一计算模式下的批量归一化指令置零,或将先批量归一化后非线性激活再池化这一计算模式下的批量归一化操作指令及非线性激活指令置零。The BNRP-based configurable parallel general convolutional neural network accelerator according to claim 2, wherein the BNRP calculator further includes a mode simplification module, and before performing the nonlinear activation operation, the mode selector reads the BNRP The feature map data and BN weight parameters and bias parameters received by the calculator data input interface will be batch-normalized first, then pooled, and then nonlinearly activated when there is no need to multiply and offset the feature map data. Set the batch normalization command in the calculation mode to zero, or reset the batch normalization operation command and the non-linear activation command in the calculation mode of the batch normalization first, then nonlinear activation and then pooling.
  4. 根据权利要求3所述一种基于BNRP的可配置并行通用卷积神经网络加速器,其特征在于,所述模式简化模块包括三个分别判断特征图数据、BN权重参数和偏置参数与0大小关系的比较器,在同时满足特征数数据小于或等于0、BN权重参数大于或等于0、偏置参数小于或等于0这三个条件时,输出先批量归一化后池化再非线性激活这一计算模式中批量归一化指令为零的配置参数,或先批量归一化后非线性激活再池化这一计算模式中批量归一化操作指令及非线性激活指令均为零的配置参数。The BNRP-based configurable parallel general convolutional neural network accelerator according to claim 3, characterized in that the mode simplification module includes three determining the relationship between feature map data, BN weight parameters, and bias parameters, respectively, and 0 Comparator, when the three conditions that the feature number data is less than or equal to 0, the BN weight parameter is greater than or equal to 0, and the bias parameter is less than or equal to 0, the output is first batch normalized and then pooled and then activated nonlinearly. A configuration parameter in which the batch normalization command is zero in the calculation mode, or the configuration parameter in which the batch normalization command and the non-linear activation command are both zero in the calculation mode, after the batch normalization is first nonlinear activation and then pooling .
  5. 根据权利要求2所述一种基于BNRP的可配置并行通用卷积神经网络加速器,其特征在于,当模式配置器发送的功能配置参数包含执行2*2最大池化指令时,所述R*T个池化器为R*T个2*2池化器,2*2池化器是由第一二选一比较器和第二二选一比较器组成的一个四选一比较器,每个时钟输入两个特征图数据到两个二选一比较器的输出端,四选一比较器每2个时钟输出一个2*2pooling值,当池化步长为1时,保存第二二选一比较器的输出值作为下一个时钟第一二选一比较器的输出值;当模式配置器发送的功能配置参数包含执行2*2平均池化 指令时,将最大池化模式的比较器配置成1/2除法器。The BNRP-based configurable parallel general convolutional neural network accelerator according to claim 2, characterized in that, when the function configuration parameters sent by the mode configurator include the execution of 2*2 maximum pooling instructions, the R*T The two poolers are R*T 2*2 poolers. The 2*2 pooler is a four-choice comparator composed of the first two-choice one comparator and the second two-choice one comparator, each Clock input two feature map data to the output terminals of two one-of-two comparators, one-of-four comparators output a 2*2 pooling value every 2 clocks, when the pooling step is 1, save the second and choose one The output value of the comparator is used as the output value of the first or second comparator in the next clock; when the function configuration parameters sent by the mode configurator include the execution of 2*2 average pooling instructions, the maximum pooling mode comparator is configured as 1/2 divider.
  6. 根据权利要求2所述一种基于BNRP的可配置并行通用卷积神经网络加速器,其特征在于,当模式配置器发送的功能配置参数包含执行3*3最大池化指令时,所述R*T个池化器为R*T个3*3池化器,3*3池化器是由第一三选一比较器、第二三选一比较器、第三三选一比较器组成的一个九选一比较器,每个时钟输入三个特征图数据到三个三选一比较器的输入端,九选一比较器每3个时钟输出一个3*3pooling值,当池化步长为1时,保存第二三选一比较器的输出值作为下一个时钟第一三选一比较器的输出值,保存第三三选一比较器的输出值作为下一个时钟第二三选一比较器的输出值,当池化步长为2时,保存第三三选一比较器的输出值作为下一个时钟第一三选一比较器的输出值;当模式配置器发送的功能配置参数包含执行3*3平均池化指令时,将最大池化模式的比较器配置成1/3除法器。The BNRP-based configurable parallel general convolutional neural network accelerator according to claim 2, characterized in that, when the function configuration parameters sent by the mode configurator include the execution of a 3*3 maximum pooling instruction, the R*T Each pooler is R*T 3*3 pooler, 3*3 pooler is composed of the first three-select one comparator, the second three-select one comparator, and the third three-select one comparator One-of-nine comparators, each clock inputs three characteristic map data to the input terminals of three-to-three comparators, and one-of-nine comparators output a 3*3 pooling value every 3 clocks, when the pooling step is 1 When, save the output value of the second one-of-three comparator as the output value of the next clock, and save the output value of the third one-of-three comparator as the next clock When the pooling step size is 2, save the output value of the third one-of-three comparator as the output value of the first three-to-one comparator of the next clock; when the function configuration parameters sent by the mode configurator include execution When the 3*3 average pooling instruction is used, configure the comparator in the maximum pooling mode as a 1/3 divider.
  7. 根据权利要求1所述一种基于BNRP的可配置并行通用卷积神经网络加速器,其特征在于,所述模式配置器通过AHB总线从外部读取网络参数、特征图参数、计算模式和功能配置参数,所述网络参数包括网络层数和卷积核大小,特征图参数包括特征图尺寸参数和批次,计算模式为对卷积计算器输出的卷积结果执行先批量归一化后池化再非线性激活或者先批量归一化后非线性激活再池化,功能配置参数包括是否进行批量归一化操作、是否进行非线性激活操作、是否进行池化操作。The BNRP-based configurable parallel general convolutional neural network accelerator according to claim 1, wherein the mode configurator reads network parameters, feature map parameters, calculation mode and functional configuration parameters from the outside through the AHB bus The network parameters include the number of network layers and the size of the convolution kernel. The feature map parameters include the feature map size parameters and batches. The calculation mode is to perform batch normalization and then pooling on the convolution results output by the convolution calculator. Non-linear activation or first batch normalization and then non-linear activation and then pooling. The function configuration parameters include whether to perform batch normalization operations, whether to perform non-linear activation operations, and whether to perform pooling operations.
  8. 根据权利要求1所述一种基于BNRP的可配置并行通用卷积神经网络加速器,其特征在于,所述数据压缩编码/解码器通过AXI4总线从外部读取的特征图数据、权重数据、BN参数。The BNRP-based configurable parallel general convolutional neural network accelerator according to claim 1, wherein the data compression encoder/decoder reads feature map data, weight data, and BN parameters from the outside through the AXI4 bus .
  9. 根据权利要求1所述一种基于BNRP的可配置并行通用卷积神经网络加速器,其特征在于,在输入特征图数据大于卷积计算器的阵列行数且需要执行池化操作时,将m行输入特征图数据交错缓存到2m块片上Block RAM。The BNRP-based configurable parallel general convolutional neural network accelerator according to claim 1, wherein when the input feature map data is greater than the number of array rows of the convolution calculator and the pooling operation needs to be performed, m rows The input feature map data is interleaved and cached to 2m on-chip Block RAM.
PCT/CN2019/105534 2019-06-28 2019-09-12 Bnrp-based configurable parallel general convolutional neural network accelerator WO2020258529A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910572582.3 2019-06-28
CN201910572582.3A CN110390385B (en) 2019-06-28 2019-06-28 BNRP-based configurable parallel general convolutional neural network accelerator

Publications (1)

Publication Number Publication Date
WO2020258529A1 true WO2020258529A1 (en) 2020-12-30

Family

ID=68285909

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/105534 WO2020258529A1 (en) 2019-06-28 2019-09-12 Bnrp-based configurable parallel general convolutional neural network accelerator

Country Status (2)

Country Link
CN (1) CN110390385B (en)
WO (1) WO2020258529A1 (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112905239A (en) * 2021-02-19 2021-06-04 北京超星未来科技有限公司 Point cloud preprocessing acceleration method based on FPGA, accelerator and electronic equipment
CN113051216A (en) * 2021-04-22 2021-06-29 南京工业大学 MobileNet-SSD target detection device and method based on FPGA acceleration
CN113052299A (en) * 2021-03-17 2021-06-29 浙江大学 Neural network memory computing device based on lower communication bound and acceleration method
CN113255897A (en) * 2021-06-11 2021-08-13 西安微电子技术研究所 Pooling computing unit of convolutional neural network
CN113516236A (en) * 2021-07-16 2021-10-19 西安电子科技大学 VGG16 network parallel acceleration processing method based on ZYNQ platform
CN113592086A (en) * 2021-07-30 2021-11-02 中科亿海微电子科技(苏州)有限公司 Method and system for obtaining optimal solution of parallelism of FPGA CNN accelerator
CN113592067A (en) * 2021-07-16 2021-11-02 华中科技大学 Configurable convolution calculation circuit for convolution neural network
CN113743587A (en) * 2021-09-09 2021-12-03 苏州浪潮智能科技有限公司 Convolutional neural network pooling calculation method, system and storage medium
CN113792621A (en) * 2021-08-27 2021-12-14 杭州电子科技大学 A Design Method of Target Detection Accelerator Based on FPGA
CN114239816A (en) * 2021-12-09 2022-03-25 电子科技大学 Reconfigurable hardware acceleration architecture of convolutional neural network-graph convolutional neural network
CN114265696A (en) * 2021-12-28 2022-04-01 北京航天自动控制研究所 Pooler and Pooling Acceleration Circuit for Max Pooling Layer of Convolutional Neural Network
CN114819129A (en) * 2022-05-10 2022-07-29 福州大学 Convolution neural network hardware acceleration method of parallel computing unit
CN114911628A (en) * 2022-06-15 2022-08-16 福州大学 An FPGA-based MobileNet Hardware Acceleration System
CN114936636A (en) * 2022-04-29 2022-08-23 西安电子科技大学广州研究院 General lightweight convolutional neural network acceleration method based on FPGA
CN115145839A (en) * 2021-03-31 2022-10-04 广东高云半导体科技股份有限公司 Deep convolution accelerator and method for accelerating deep convolution by using same
CN115204364A (en) * 2022-06-28 2022-10-18 中国电子科技集团公司第五十二研究所 A convolutional neural network hardware acceleration device with dynamic allocation of cache space
CN116309520A (en) * 2023-04-03 2023-06-23 江南大学 A strip steel surface defect detection system
CN117933345A (en) * 2024-03-22 2024-04-26 长春理工大学 A training method for medical image segmentation model
CN118070855A (en) * 2024-04-18 2024-05-24 南京邮电大学 Convolutional neural network accelerator based on RISC-V architecture

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111158756B (en) * 2019-12-31 2021-06-29 百度在线网络技术(北京)有限公司 Method and apparatus for processing information
CN111242295B (en) * 2020-01-20 2022-11-25 清华大学 Method and circuit capable of configuring pooling operator
CN111142808B (en) * 2020-04-08 2020-08-04 浙江欣奕华智能科技有限公司 Access device and access method
CN111832717B (en) * 2020-06-24 2021-09-28 上海西井信息科技有限公司 Chip and processing device for convolution calculation
CN111736904B (en) * 2020-08-03 2020-12-08 北京灵汐科技有限公司 Multitask parallel processing method and device, computer equipment and storage medium
CN112905530B (en) * 2021-03-29 2023-05-26 上海西井信息科技有限公司 On-chip architecture, pooled computing accelerator array, unit and control method
CN113065647B (en) * 2021-03-30 2023-04-25 西安电子科技大学 Calculation-storage communication system and communication method for accelerating neural network
CN114004351B (en) * 2021-11-22 2025-04-18 浙江大学 A convolutional neural network hardware acceleration platform
CN115470164B (en) * 2022-09-30 2025-07-08 上海安路信息科技股份有限公司 A hybrid system based on FPGA+NPU architecture

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105184366A (en) * 2015-09-15 2015-12-23 中国科学院计算技术研究所 Time-division-multiplexing general neural network processor
CN105631519A (en) * 2015-12-31 2016-06-01 北京工业大学 Convolution nerve network acceleration method based on pre-deciding and system
CN107239824A (en) * 2016-12-05 2017-10-10 北京深鉴智能科技有限公司 Apparatus and method for realizing sparse convolution neutral net accelerator
US20180341495A1 (en) * 2017-05-26 2018-11-29 Purdue Research Foundation Hardware Accelerator for Convolutional Neural Networks and Method of Operation Thereof
CN109635944A (en) * 2018-12-24 2019-04-16 西安交通大学 A kind of sparse convolution neural network accelerator and implementation method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229647A (en) * 2017-08-18 2018-06-29 北京市商汤科技开发有限公司 The generation method and device of neural network structure, electronic equipment, storage medium
US11568218B2 (en) * 2017-10-17 2023-01-31 Xilinx, Inc. Neural network processing system having host controlled kernel acclerators
CN109389212B (en) * 2018-12-30 2022-03-25 南京大学 Reconfigurable activation quantization pooling system for low-bit-width convolutional neural network
CN109767002B (en) * 2019-01-17 2023-04-21 山东浪潮科学研究院有限公司 A neural network acceleration method based on multi-block FPGA co-processing
CN109934339B (en) * 2019-03-06 2023-05-16 东南大学 A Universal Convolutional Neural Network Accelerator Based on a 1D Systolic Array

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105184366A (en) * 2015-09-15 2015-12-23 中国科学院计算技术研究所 Time-division-multiplexing general neural network processor
CN105631519A (en) * 2015-12-31 2016-06-01 北京工业大学 Convolution nerve network acceleration method based on pre-deciding and system
CN107239824A (en) * 2016-12-05 2017-10-10 北京深鉴智能科技有限公司 Apparatus and method for realizing sparse convolution neutral net accelerator
US20180341495A1 (en) * 2017-05-26 2018-11-29 Purdue Research Foundation Hardware Accelerator for Convolutional Neural Networks and Method of Operation Thereof
CN109635944A (en) * 2018-12-24 2019-04-16 西安交通大学 A kind of sparse convolution neural network accelerator and implementation method

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112905239B (en) * 2021-02-19 2024-01-12 北京超星未来科技有限公司 Point cloud preprocessing acceleration method based on FPGA, accelerator and electronic equipment
CN112905239A (en) * 2021-02-19 2021-06-04 北京超星未来科技有限公司 Point cloud preprocessing acceleration method based on FPGA, accelerator and electronic equipment
CN113052299B (en) * 2021-03-17 2022-05-31 浙江大学 Neural network memory computing device based on lower communication bound and acceleration method
CN113052299A (en) * 2021-03-17 2021-06-29 浙江大学 Neural network memory computing device based on lower communication bound and acceleration method
CN115145839A (en) * 2021-03-31 2022-10-04 广东高云半导体科技股份有限公司 Deep convolution accelerator and method for accelerating deep convolution by using same
CN115145839B (en) * 2021-03-31 2024-05-14 广东高云半导体科技股份有限公司 Depth convolution accelerator and method for accelerating depth convolution
CN113051216A (en) * 2021-04-22 2021-06-29 南京工业大学 MobileNet-SSD target detection device and method based on FPGA acceleration
CN113051216B (en) * 2021-04-22 2023-07-11 南京工业大学 MobileNet-SSD target detection device and method based on FPGA acceleration
CN113255897A (en) * 2021-06-11 2021-08-13 西安微电子技术研究所 Pooling computing unit of convolutional neural network
CN113255897B (en) * 2021-06-11 2023-07-07 西安微电子技术研究所 Pooling calculation unit of convolutional neural network
CN113592067B (en) * 2021-07-16 2024-02-06 华中科技大学 A configurable convolution computing circuit for convolutional neural networks
CN113592067A (en) * 2021-07-16 2021-11-02 华中科技大学 Configurable convolution calculation circuit for convolution neural network
CN113516236A (en) * 2021-07-16 2021-10-19 西安电子科技大学 VGG16 network parallel acceleration processing method based on ZYNQ platform
CN113592086A (en) * 2021-07-30 2021-11-02 中科亿海微电子科技(苏州)有限公司 Method and system for obtaining optimal solution of parallelism of FPGA CNN accelerator
CN113792621B (en) * 2021-08-27 2024-04-05 杭州电子科技大学 FPGA-based target detection accelerator design method
CN113792621A (en) * 2021-08-27 2021-12-14 杭州电子科技大学 A Design Method of Target Detection Accelerator Based on FPGA
CN113743587A (en) * 2021-09-09 2021-12-03 苏州浪潮智能科技有限公司 Convolutional neural network pooling calculation method, system and storage medium
CN113743587B (en) * 2021-09-09 2024-02-13 苏州浪潮智能科技有限公司 A convolutional neural network pooling calculation method, system, and storage medium
CN114239816A (en) * 2021-12-09 2022-03-25 电子科技大学 Reconfigurable hardware acceleration architecture of convolutional neural network-graph convolutional neural network
CN114239816B (en) * 2021-12-09 2023-04-07 电子科技大学 Reconfigurable hardware acceleration architecture of convolutional neural network-graph convolutional neural network
CN114265696A (en) * 2021-12-28 2022-04-01 北京航天自动控制研究所 Pooler and Pooling Acceleration Circuit for Max Pooling Layer of Convolutional Neural Network
CN114936636A (en) * 2022-04-29 2022-08-23 西安电子科技大学广州研究院 General lightweight convolutional neural network acceleration method based on FPGA
CN114819129A (en) * 2022-05-10 2022-07-29 福州大学 Convolution neural network hardware acceleration method of parallel computing unit
CN114911628A (en) * 2022-06-15 2022-08-16 福州大学 An FPGA-based MobileNet Hardware Acceleration System
CN115204364A (en) * 2022-06-28 2022-10-18 中国电子科技集团公司第五十二研究所 A convolutional neural network hardware acceleration device with dynamic allocation of cache space
CN116309520A (en) * 2023-04-03 2023-06-23 江南大学 A strip steel surface defect detection system
CN117933345A (en) * 2024-03-22 2024-04-26 长春理工大学 A training method for medical image segmentation model
CN117933345B (en) * 2024-03-22 2024-06-11 长春理工大学 A training method for medical image segmentation model
CN118070855A (en) * 2024-04-18 2024-05-24 南京邮电大学 Convolutional neural network accelerator based on RISC-V architecture
CN118070855B (en) * 2024-04-18 2024-07-09 南京邮电大学 A convolutional neural network accelerator based on RISC-V architecture

Also Published As

Publication number Publication date
CN110390385A (en) 2019-10-29
CN110390385B (en) 2021-09-28

Similar Documents

Publication Publication Date Title
WO2020258529A1 (en) Bnrp-based configurable parallel general convolutional neural network accelerator
CN109284817B (en) Deep separable convolutional neural network processing architecture/method/system and medium
CN109934339B (en) A Universal Convolutional Neural Network Accelerator Based on a 1D Systolic Array
WO2020258841A1 (en) Deep neural network hardware accelerator based on power exponent quantisation
CN109447241B (en) A Dynamic Reconfigurable Convolutional Neural Network Accelerator Architecture for the Internet of Things
CN106991477B (en) Artificial neural network compression coding device and method
US10936941B2 (en) Efficient data access control device for neural network hardware acceleration system
US20190026626A1 (en) Neural network accelerator and operation method thereof
CN110516801A (en) A High Throughput Dynamically Reconfigurable Convolutional Neural Network Accelerator Architecture
CN107169563A (en) Processing system and method applied to two-value weight convolutional network
CN109389212B (en) Reconfigurable activation quantization pooling system for low-bit-width convolutional neural network
CN106228240A (en) Degree of depth convolutional neural networks implementation method based on FPGA
CN107085562B (en) Neural network processor based on efficient multiplexing data stream and design method
CN110991630A (en) Convolutional neural network processor for edge calculation
CN111626403B (en) Convolutional neural network accelerator based on CPU-FPGA memory sharing
CN115880132B (en) Graphics processor, matrix multiplication task processing method, device and storage medium
CN111507465B (en) A Configurable Convolutional Neural Network Processor Circuit
CN107844829A (en) Method and system and neural network processor for accelerans network processing unit
CN107729995A (en) Method and system and neural network processor for accelerans network processing unit
CN108345934B (en) A kind of activation device and method for neural network processor
CN111860773A (en) Processing apparatus and method for information processing
CN113392963B (en) FPGA-based CNN hardware acceleration system design method
CN115983348A (en) RISC-V Accelerator System Supporting Extended Instructions for Convolutional Neural Networks
CN117632844A (en) Reconfigurable AI algorithm hardware accelerator
CN108647780A (en) Restructural pond operation module structure towards neural network and its implementation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19935380

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19935380

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19935380

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 31.08.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19935380

Country of ref document: EP

Kind code of ref document: A1