CN111142808B - Access device and access method - Google Patents
Access device and access method Download PDFInfo
- Publication number
- CN111142808B CN111142808B CN202010267401.9A CN202010267401A CN111142808B CN 111142808 B CN111142808 B CN 111142808B CN 202010267401 A CN202010267401 A CN 202010267401A CN 111142808 B CN111142808 B CN 111142808B
- Authority
- CN
- China
- Prior art keywords
- data
- ports
- read
- control unit
- data storage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/0604—Improving or facilitating administration, e.g. storage management
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0629—Configuration or reconfiguration of storage systems
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/0644—Management of space entities, e.g. partitions, extents, pools
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/0671—In-line storage system
- G06F3/0683—Plurality of storage devices
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Neurology (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Input (AREA)
Abstract
本申请公开了一种存取设备及存取方法。该存取设备中本地数据存储缓冲区用于存储目标硬件加速引擎计算出的至少一种数据类型的数据;与本地数据存储缓冲区中第一数量的基本写入端口连接的写入控制单元用于将目标硬件加速引擎对应的待写入的至少一种数据类型的数据并行写入本地数据存储缓冲区的相应数据存储区;与本地数据存储缓冲区中第一数量的基本读取端口连接的读取控制单元,用于从本地数据存储缓冲区的数据存储区并行读取目标硬件加速引擎待读取的目标数据类型的数据。该存取设备可以根据不同数据类型的数据量灵活配置本地数据存储缓冲区,且存取数据的长度不受限制。
The present application discloses an access device and an access method. The local data storage buffer in the access device is used to store data of at least one data type calculated by the target hardware acceleration engine; the write control unit connected to the first number of basic write ports in the local data storage buffer uses for parallel writing data of at least one data type to be written corresponding to the target hardware acceleration engine into the corresponding data storage area of the local data storage buffer; connected to the first number of basic read ports in the local data storage buffer The read control unit is configured to read data of the target data type to be read by the target hardware acceleration engine in parallel from the data storage area of the local data storage buffer. The access device can flexibly configure the local data storage buffer according to the data volume of different data types, and the length of the access data is not limited.
Description
技术领域technical field
本申请涉及集成电路技术领域,尤其涉及一种存取设备及存取方法。The present application relates to the technical field of integrated circuits, and in particular, to an access device and an access method.
背景技术Background technique
随着物联网以及人工智能的发展,硬件的算力需求越来越高,大量的数据往往需要硬件加速模块处理。为了满足数据的快速存取,以及数据缓存等需求,硬件加速模块(或称“硬件加速引擎”)通常在本地加入一个片内数据存储缓冲区,即本地数据存储缓冲区。该数据缓冲区在数据通路中直接参与核心运算,并且受到硬件加速引擎中多个模块的存取访问,因此需要本地数据存储缓冲区能够提供多端口并行存取的操作功能。With the development of the Internet of Things and artificial intelligence, the computing power requirements of hardware are getting higher and higher, and a large amount of data often needs to be processed by hardware acceleration modules. In order to meet the requirements of fast data access and data cache, the hardware acceleration module (or "hardware acceleration engine") usually adds an on-chip data storage buffer locally, that is, the local data storage buffer. The data buffer directly participates in the core operation in the data path, and is accessed by multiple modules in the hardware acceleration engine, so the local data storage buffer is required to provide the operation function of multi-port parallel access.
目前,对于主频(或称“时钟频率”)不高的处理器,可以在较短时间内分k次访问多端口共享存储器中不同存储单元的数据,即实现分时并行访问,以完成不同端口的并行访问。其次,该共享存储器需要采用足够高频率的存储芯片才能满足分时并行访问的要求,这类多端口分时并行存取的方式仅适用于吞吐量不高的数据传输需求,即该结构共享存储器的并不能实现多端口并行存取的操作。At present, for processors with low main frequency (or "clock frequency"), the data of different storage units in the multi-port shared memory can be accessed in k times in a relatively short period of time, that is, time-sharing parallel access can be realized to complete different Parallel access to ports. Secondly, the shared memory needs to use a memory chip with a high enough frequency to meet the requirements of time-sharing and parallel access. This kind of multi-port time-sharing and parallel access method is only suitable for data transmission requirements with low throughput, that is, the structure of the shared memory It cannot realize the operation of multi-port parallel access.
对于主频高的多核处理器访问多端口共享存储器时,k个端口的存储控制器采用k*n交叉开关或片上网络将n个不同存储单元连接在一起,k个端口通过交叉开关或片上网络,同时访问n个不同的存储单元,实现了多端口并行访问存储器。When a multi-core processor with a high main frequency accesses a multi-port shared memory, the k-port memory controller uses a k*n crossbar or an on-chip network to connect n different memory units together, and the k ports use a crossbar or an on-chip network to connect n different memory units together. , and simultaneously access n different storage units to realize multi-port parallel access to memory.
然而,此类多端口共享存储器中的交叉开关可以完成多个端口的交叉操作,但交叉开关仅支持固定长度数据的读写,即只能完成固定长数据缓冲区的读写操作,且固定长数据缓冲区内存储的数据类型也为固定类型,若要实现数据缓冲区内存储其他数据类型的数据,则需要预先配置相应数据类型的数据存储区,导致灵活度较低。However, the crossbar switch in such a multi-port shared memory can complete the crossover operation of multiple ports, but the crossbar switch only supports the read and write of fixed-length data, that is, it can only complete the read and write operations of the fixed-length data buffer, and the fixed-length data buffer can only be read and written. The data type stored in the data buffer is also a fixed type. To store data of other data types in the data buffer, it is necessary to configure the data storage area of the corresponding data type in advance, resulting in low flexibility.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供一种存取设备及存取方法,解决了现有技术存在的上述问题,提高了配置本地数据存储缓冲区中数据类型的灵活性,以及解决了现有技术中本地数据存储缓冲区仅对支持固定长度数据进行读写操作的限制问题。The embodiments of the present application provide an access device and an access method, which solve the above-mentioned problems in the prior art, improve the flexibility of configuring data types in the local data storage buffer, and solve the problem of local data storage in the prior art The buffer is only limited to support fixed-length data read and write operations.
第一方面,提供了一种存取设备,该设备可以包括:In a first aspect, an access device is provided, the device may include:
本地数据存储缓冲区,用于存储目标硬件加速引擎对应的至少一种数据类型的数据;所述本地数据存储缓冲区是由具有统一地址编码的第一数量的基本存储模块组成的不同数据类型的数据存储区;其中,所述第一数量是根据所述目标硬件加速引擎计算所需的数据量确定的,每个基本存储模块具有基本读取端口和基本写入端口;a local data storage buffer, used for storing data of at least one data type corresponding to the target hardware acceleration engine; the local data storage buffer is a different data type composed of a first number of basic storage modules with uniform address codes A data storage area; wherein, the first quantity is determined according to the amount of data required for calculation by the target hardware acceleration engine, and each basic storage module has a basic read port and a basic write port;
与所述本地数据存储缓冲区中所述第一数量的基本写入端口连接的写入控制单元,用于将所述目标硬件加速引擎对应的待写入的至少一种数据类型的数据并行写入所述本地数据存储缓冲区的相应数据存储区;a write control unit connected to the first number of basic write ports in the local data storage buffer, for parallel writing data of at least one data type to be written corresponding to the target hardware acceleration engine into the corresponding data storage area of the local data storage buffer;
与所述本地数据存储缓冲区中所述第一数量的基本读取端口连接的读取控制单元,用于从所述本地数据存储缓冲区的数据存储区并行读取所述目标硬件加速引擎待读取的目标数据类型的数据。The read control unit connected to the first number of basic read ports in the local data storage buffer is used to read the target hardware acceleration engine to be read in parallel from the data storage area of the local data storage buffer. The data of the target data type to read.
在一个可选的实现中,所述写入控制单元,还用于根据所述至少一种数据类型的类型数,确定所述写入控制单元的写入端口的数量;In an optional implementation, the write control unit is further configured to determine the number of write ports of the write control unit according to the number of types of the at least one data type;
所述读取控制单元,还用于根据所述目标数据类型的类型数,确定所述读取控制单元的读取端口的数量。The read control unit is further configured to determine the number of read ports of the read control unit according to the type number of the target data type.
在一个可选的实现中,若所述写入控制单元写入端口的数量为第二数量N,则所述写入控制单元的第二数量N的写入端口与所述目标硬件加速引擎相连,所述写入控制单元的第一数量K的输出端口与所述第一数量K的基本写入端口一一对应相连;In an optional implementation, if the number of write ports of the write control unit is a second number N, the second number N of write ports of the write control unit are connected to the target hardware acceleration engine , the output ports of the first quantity K of the write control unit are connected in one-to-one correspondence with the basic write ports of the first quantity K;
若所述读取控制单元读取端口的数量为第三数量M,则所述读取控制单元的第三数量M的读取端口与所述目标硬件加速引擎相连,所述读取控制单元的第一数量K的输入端口与所述第一数量K的基本读取端口一一对应连接。If the number of read ports of the read control unit is a third number M, the read ports of the third number M of the read control unit are connected to the target hardware acceleration engine, and the read ports of the read control unit are connected to the target hardware acceleration engine. The input ports of the first number K are connected to the basic read ports of the first number K in a one-to-one correspondence.
在一个可选的实现中,所述写入控制单元包括所述第二数量N的多路分配器和所述第一数量K的多路选择器;每个多路分配器包括一个输入端口和所述第一数量K的输出端口,每个多路选择器包括所述第二数量N的输入端口和一个输出端口;In an optional implementation, the write control unit includes the second number N of demultiplexers and the first number K of demultiplexers; each demultiplexer includes an input port and the first number K of output ports, each multiplexer including the second number N of input ports and one output port;
所述写入控制单元的第二数量N的写入端口与所述第二数量N的多路分配器的输入端一一对应连接;The write ports of the second quantity N of the write control unit are connected in one-to-one correspondence with the input ends of the demultiplexers of the second quantity N;
每个多路分配器的第一数量K的输出端与所述第一数量K的多路选择器的输入端一一对应连接;The output terminals of the first quantity K of each demultiplexer are connected in one-to-one correspondence with the input terminals of the first quantity K of the multiplexers;
所述每个多路选择器的输出端与所述相应数据存储区中待写入的基本写入端口连接。The output of each multiplexer is connected to the basic write port to be written in the corresponding data storage area.
在一个可选的实现中,所述读取控制单元包括第三数量M的多路选择器和所述第一数量K的多路分配器;每个多路选择器包括第一数量K的输入端口和一个输出端口,每个多路分配器包括一个输入端口和第三数量M的输出端口;In an optional implementation, the read control unit includes a third number M of multiplexers and the first number K of demultiplexers; each multiplexer includes a first number K of inputs ports and an output port, each demultiplexer includes an input port and a third number M of output ports;
所述读取控制单元的第三数量M的读取端口与所述第三数量M的多路选择器的输入端一一对应连接;The read ports of the third quantity M of the read control unit are connected in one-to-one correspondence with the input ends of the multiplexers of the third quantity M;
每个多路选择器中第一数量K的输入端分别与所述第一数量K的多路分配器的输出端一一对应连接;The input terminals of the first quantity K in each multiplexer are respectively connected with the output terminals of the first quantity K of the demultiplexers in one-to-one correspondence;
所述每个多路分配器的输入端与所述相应数据存储区中待读取的基本读取端口连接。The input of each demultiplexer is connected to the base read port to be read in the corresponding data storage area.
在一个可选的实现中,若所述目标硬件加速引擎为卷积神经网络引擎,则所述至少一种数据类型包括权值类型、特征值类型、卷积部分和类型。In an optional implementation, if the target hardware acceleration engine is a convolutional neural network engine, the at least one data type includes a weight type, an eigenvalue type, a convolution part, and a type.
第二方面,提供了一种存取方法,应用在存取设备中,该方法可以包括:In a second aspect, an access method is provided, applied in an access device, the method may include:
将目标硬件加速引擎对应的至少一种数据类型的数据,并行写入所述存取设备中本地数据存储缓冲区的数据存储区;Write data of at least one data type corresponding to the target hardware acceleration engine into the data storage area of the local data storage buffer in the access device in parallel;
存储所述目标硬件加速引擎对应的至少一种数据类型的数据;storing data of at least one data type corresponding to the target hardware acceleration engine;
或者,从所述本地数据存储缓冲区的数据存储区并行读取所述目标硬件加速引擎待读取的目标数据类型的数据。Alternatively, data of the target data type to be read by the target hardware acceleration engine is read in parallel from the data storage area of the local data storage buffer.
在一个可选的实现中,根据所述目标硬件加速引擎对应的至少一种数据类型的类型数,确定所述存取设备中写入控制单元的写入端口的数量;In an optional implementation, according to the type number of at least one data type corresponding to the target hardware acceleration engine, determine the number of write ports written to the control unit in the access device;
以及,根据所述目标硬件加速引擎待读取的目标数据类型的类型数,确定所述存取设备中读取控制单元的读取端口的数量。And, the number of read ports of the read control unit in the access device is determined according to the number of types of target data types to be read by the target hardware acceleration engine.
第三方面,提供了一种电子设备,该电子设备包括处理器、通信接口、存储器和通信总线,其中,处理器,通信接口,存储器通过通信总线完成相互间的通信;In a third aspect, an electronic device is provided, the electronic device includes a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus;
存储器,用于存放计算机程序;memory for storing computer programs;
处理器,用于执行存储器上所存放的程序时,实现上述第二方面中任一项上所述的方法步骤。The processor is configured to implement the method steps described in any one of the second aspect above when executing the program stored in the memory.
第四方面,提供了一种计算机可读存储介质,该计算机可读存储介质内存储有计算机程序,所述计算机程序被处理器执行时实现上述第二方面中任一所述的方法步骤。In a fourth aspect, a computer-readable storage medium is provided, and a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the method steps of any one of the second aspect above are implemented.
本发明实施例提供了一种存取设备,该存取设备中本地数据存储缓冲区用于存储目标硬件加速引擎对应的至少一种数据类型的数据;本地数据存储缓冲区是由具有统一地址编码的第一数量的基本存储模块组成的不同数据类型的数据存储区;其中,第一数量是根据目标硬件加速引擎计算所需的数据量确定的,每个基本存储模块具有基本读取端口和基本写入端口。与本地数据存储缓冲区中第一数量的基本写入端口连接的写入控制单元用于将目标硬件加速引擎对应的待写入的至少一种数据类型的数据并行写入本地数据存储缓冲区的相应数据存储区;与本地数据存储缓冲区中第一数量的基本读取端口连接的读取控制单元,用于从本地数据存储缓冲区的数据存储区并行读取目标硬件加速引擎待读取的目标数据类型的数据。该存取设备可以根据不同数据类型的数据量灵活配置本地数据存储缓冲区,且可以解决现有技术中本地数据存储缓冲区仅对支持固定长度数据进行读写操作的限制问题。An embodiment of the present invention provides an access device, in which a local data storage buffer is used to store data of at least one data type corresponding to a target hardware acceleration engine; the local data storage buffer is encoded by a uniform address Data storage areas of different data types composed of a first number of basic storage modules; wherein, the first number is determined according to the amount of data required for calculation by the target hardware acceleration engine, and each basic storage module has a basic read port and a basic write port. The write control unit connected to the first number of basic write ports in the local data storage buffer is configured to write data of at least one data type to be written corresponding to the target hardware acceleration engine into the local data storage buffer in parallel. Corresponding data storage area; a read control unit connected to the first number of basic read ports in the local data storage buffer, for parallel reading from the data storage area of the local data storage buffer to be read by the target hardware acceleration engine Data of the target data type. The access device can flexibly configure the local data storage buffer according to the amount of data of different data types, and can solve the problem that the local data storage buffer in the prior art only supports read and write operations of fixed-length data.
附图说明Description of drawings
图1为本发明实施例提供的一种存取设备的结构示意图;FIG. 1 is a schematic structural diagram of an access device according to an embodiment of the present invention;
图2为本发明实施例提供的一种写入控制单元的结构示意图;2 is a schematic structural diagram of a write control unit provided by an embodiment of the present invention;
图3为本发明实施例提供的一种读取控制单元的结构示意图;3 is a schematic structural diagram of a reading control unit according to an embodiment of the present invention;
图4为本发明实施例提供的一种存取方法的流程示意图;4 is a schematic flowchart of an access method provided by an embodiment of the present invention;
图5为本发明实施例提供的一种电子设备的结构示意图。FIG. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请一部分实施例,并不是全部的实施例。基于本申请实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, not all of the embodiments. Based on the embodiments of the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the scope of the protection of the present application.
本发明实施例提供的存取设备可以安装在服务器上,也可以安装在终端上。其中,终端可以是移动电话、智能电话、笔记本电脑、数字广播接收器、个人数字助理(PDA)、平板电脑(PAD)等用户设备(User Equipment,UE)、手持设备、车载设备、可穿戴设备、计算设备或连接到无线调制解调器的其它处理设备、移动台(Mobile station,MS)、移动终端(Mobile Terminal)等。The access device provided by the embodiment of the present invention may be installed on a server or a terminal. The terminal may be User Equipment (UE) such as a mobile phone, a smart phone, a notebook computer, a digital broadcast receiver, a personal digital assistant (PDA), a tablet computer (PAD), a handheld device, a vehicle-mounted device, and a wearable device. , computing equipment or other processing equipment connected to the wireless modem, mobile station (Mobile station, MS), mobile terminal (Mobile Terminal), etc.
如图1所示,该存取设备可以包括本地数据存储缓冲区、写入控制单元和读取控制单元。As shown in FIG. 1, the access device may include a local data storage buffer, a write control unit and a read control unit.
本地数据存储缓冲区是由具有统一地址编码的第一数量K的基本存储模块组成的不同数据类型的数据存储区。其中,每个基本存储模块具有基本读取端口和基本写入端口;第一数量K可以是根据目标硬件加速引擎计算所需的数据量确定的,也可以是技术人员预先设定的,本发明实施例在此不做限定。The local data storage buffer is a data storage area of different data types composed of a first number K of basic storage modules with uniform address codes. Wherein, each basic storage module has a basic read port and a basic write port; the first quantity K may be determined according to the amount of data required for calculation of the target hardware acceleration engine, or may be preset by a technician, the present invention The embodiment is not limited here.
本地数据存储缓冲区,用于存储目标硬件加速引擎对应的至少一种数据类型的数据,其中,至少一种数据类型的数据包括目标硬件加速引擎计算所需的数据和目标硬件加速引擎计算得到的数据;The local data storage buffer is used to store data of at least one data type corresponding to the target hardware acceleration engine, wherein the data of at least one data type includes the data required for the calculation of the target hardware acceleration engine and the data obtained by the calculation of the target hardware acceleration engine. data;
可以理解的是,上述的目标硬件加速引擎仅是一种需要访问本地数据存储缓冲区进行存取数据的硬件模块,本地数据存储缓冲区还可以存储其他需要进行数据存取操作的硬件模块,本发明实施例在此不做限定。It can be understood that the above-mentioned target hardware acceleration engine is only a hardware module that needs to access the local data storage buffer to access data, and the local data storage buffer can also store other hardware modules that need to perform data access operations. The embodiments of the invention are not limited herein.
可选地,若目标硬件加速引擎为卷积神经网络引擎,则至少一种数据类型包括权值类型、特征值类型、卷积部分和类型,也就是说,至少一种数据类型的数据为权值系数,图像或特征值,卷积部分和数据。其中,权值系数和图像可以从内存中获取,卷积部分和数据可以由目标硬件加速引擎每次计算得到,且目标硬件加速引擎将每次计算得到的卷积部分和数据累加,得到特征值。Optionally, if the target hardware acceleration engine is a convolutional neural network engine, at least one data type includes a weight type, an eigenvalue type, a convolution part and a type, that is, the data of at least one data type is a weight type. Value coefficients, image or eigenvalues, convolution part and data. Among them, the weight coefficients and images can be obtained from the memory, the convolution part and data can be calculated by the target hardware acceleration engine each time, and the target hardware acceleration engine accumulates the convolution parts and data obtained by each calculation to obtain the eigenvalues .
为了提高配置本地数据存储缓冲区的灵活性,可以根据当前目标硬件加速引擎计算所需的数据和计算得到的数据的数据类型,配置相应的数据存储区。其中,不同数据类型的数据存储区所占的基本存储模块的数量可以不同也可以相同。In order to improve the flexibility of configuring the local data storage buffer, you can configure the corresponding data storage area according to the data required by the current target hardware acceleration engine calculation and the data type of the calculated data. The number of basic storage modules occupied by data storage areas of different data types may be different or the same.
写入控制单元与本地数据存储缓冲区中所述第一数量K的基本写入端口连接,且读取控制单元与本地数据存储缓冲区中所述第一数量K的基本读取端口连接。The write control unit is connected to the first number K of basic write ports in the local data storage buffer, and the read control unit is connected to the first number K of basic read ports in the local data storage buffer.
写入控制单元,用于将目标硬件加速引擎对应的至少一种数据类型的数据并行写入相应数据存储区。The writing control unit is configured to write data of at least one data type corresponding to the target hardware acceleration engine into the corresponding data storage area in parallel.
读取控制单元,用于从本地数据存储缓冲区的数据存储区并行读取目标硬件加速引擎待读取的目标数据类型的数据。The read control unit is configured to read data of the target data type to be read by the target hardware acceleration engine in parallel from the data storage area of the local data storage buffer.
可选地,写入控制单元,还用于根据至少一种数据类型的类型数,确定写入控制单元的写入端口的数量。Optionally, the write control unit is further configured to determine the number of write ports of the write control unit according to the number of types of at least one data type.
读取控制单元,还用于根据目标硬件加速引擎待读取目标数据类型的类型数,确定读取控制单元的读取端口的数量。The read control unit is further configured to determine the number of read ports of the read control unit according to the number of types of target data types to be read by the target hardware acceleration engine.
若至少一种数据类型的类型数为第二数量N,则写入控制单元的写入端口的数量为第二数量N。如图2所示,在写入控制单元写入端口的数量为第二数量N时,写入控制单元的第二数量N的写入端口与目标硬件加速引擎相连,写入控制单元的第一数量K的输出端口与第一数量K的基本写入端口一一对应连接。If the number of types of at least one data type is the second number N, the number of write ports of the write control unit is the second number N. As shown in FIG. 2 , when the number of write ports of the write control unit is the second number N, the write ports of the second number N of the write control unit are connected to the target hardware acceleration engine, and the write ports of the second number N of the write control unit are connected to the target hardware acceleration engine. The output ports of the number K are connected with the basic write ports of the first number K in a one-to-one correspondence.
其中,写入控制单元可以包括第二数量N的多路分配器和第一数量K的多路选择器;每个多路分配器包括一个输入端口和第一数量K的输出端口,每个多路选择器包括第二数量N的输入端口和一个输出端口。Wherein, the writing control unit may include a second number N of multiplexers and a first number K of multiplexers; each multiplexer includes an input port and a first number K of output ports, and each multiplexer includes an input port and a first number K of output ports. The way selector includes a second number N of input ports and one output port.
写入控制单元的第二数量N的写入端口与第二数量N的多路分配器的输入端一一对应连接;每个多路分配器的第一数量K的输出端与第一数量K的多路选择器的输入端一一对应连接;每个多路选择器的输出端与相应数据存储区中待写入的基本写入端口连接。The write ports of the second quantity N of the write control unit are connected in one-to-one correspondence with the input ends of the second quantity N of demultiplexers; the outputs of the first quantity K of each demultiplexer are connected with the first quantity K The input terminals of the multiplexers are connected in one-to-one correspondence; the output terminals of each multiplexer are connected with the basic write port to be written in the corresponding data storage area.
多路分配器,用于根据待写入的数据类型对应的基本存储模块的块间地址,确定待写入的目标基本存储区域。The multiplexer is used for determining the target basic storage area to be written according to the inter-block address of the basic storage module corresponding to the data type to be written.
多路选择器,用于根据待写入的目标基本存储区域,确定目标基本存储区域对应的基本存储块的基本写入端口。The multiplexer is used for determining the basic write port of the basic storage block corresponding to the target basic storage area according to the target basic storage area to be written.
若目标硬件加速引擎待读取相应数据类型的类型数为第三数量M,则读取控制单元的读取端口的数量为第三数量M。如图3所示,在读取控制单元读取端口的数量为第三数量M时,读取控制单元的第三数量M的读取端口与目标硬件加速引擎相连,读取控制单元的第一数量K的输入端口与第一数量K的基本读取端口一一对应连接。If the number of types of the corresponding data types to be read by the target hardware acceleration engine is the third number M, the number of read ports of the read control unit is the third number M. As shown in FIG. 3 , when the number of read ports of the read control unit is a third number M, the read ports of the third number M of the read control unit are connected to the target hardware acceleration engine, and the first number of read ports of the read control unit M are connected to the target hardware acceleration engine. The input ports of the number K are connected with the basic read ports of the first number K in a one-to-one correspondence.
其中,读取控制单元可以包括第三数量M的多路选择器和第一数量K的多路分配器;每个多路选择器包括第一数量K的输入端口和一个输出端口,每个多路分配器包括一个输入端口和第三数量M的输出端口。Wherein, the reading control unit may include a third number M of multiplexers and a first number K of multiplexers; each multiplexer includes a first number K of input ports and one output port, each multiplexer The splitter includes one input port and a third number M of output ports.
读取控制单元的第三数量M的读取端口与第三数量M的多路选择器的输入端一一对应连接;每个多路选择器中第一数量K的输入端分别与第一数量K的多路分配器的输出端一一对应连接;每个多路分配器的输入端与相应数据存储区中待读取的基本读取端口连接。The read ports of the third quantity M of the read control unit are connected with the input ends of the multiplexers of the third quantity M in a one-to-one correspondence; the input ends of the first quantity K in each multiplexer are respectively connected with the first quantity The outputs of the demultiplexers of K are connected in one-to-one correspondence; the input end of each demultiplexer is connected with the basic read port to be read in the corresponding data storage area.
多路选择器,用于根据待读取的数据类型对应的基本存储块的基本读取端口,确定待读取的目标基本存储区域。The multiplexer is used to determine the target basic storage area to be read according to the basic read port of the basic storage block corresponding to the data type to be read.
多路分配器,用于根据待读取的目标基本存储区域,确定目标基本存储区域对应的基本存储块的基本读取端口。The demultiplexer is used to determine the basic read port of the basic storage block corresponding to the target basic storage area according to the target basic storage area to be read.
需要说明的是,每个多路分配器和每个多路选择器的具体配置是各自独立完成的。It should be noted that the specific configuration of each multiplexer and each multiplexer is completed independently.
进一步的,卷积神经网络引擎一般有多个网络层需要实现卷积计算,每个网络层由于神经网络节点规模,以及每个神经网络节点的权值系数的不同,使计算出的不同数据类型的数据所需要的存储区域的大小也不同,即基本存储模块的数量不同。Further, the convolutional neural network engine generally has multiple network layers that need to implement convolution calculations. Due to the size of the neural network node and the different weight coefficients of each neural network node, each network layer calculates different data types. The size of the storage area required by the data is also different, that is, the number of basic storage modules is different.
可见,本地数据存储缓冲区能够动态分配大小的机制,即能够根据不同卷积网络层需求的变化,动态适时的改变本地数据缓冲区的功能区的大小,以满足不同种类的数据对不同存储区域大小的需求,实现了本地数据存储缓冲区配置的灵活性。It can be seen that the local data storage buffer can dynamically allocate the size of the mechanism, that is, it can dynamically and timely change the size of the functional area of the local data buffer according to the changes in the requirements of different convolutional network layers, so as to meet the needs of different types of data for different storage areas. Size requirements, to achieve the flexibility of the local data storage buffer configuration.
在一个例子中,以目标硬件加速引擎为卷积神经网络引擎为例,对于写入控制单元的写入端口,若待写入的数据类型为:权值系数,图像或特征值,卷积部分和数据,则写入控制单元具有三个写入端口,即N=3,分别对应三种类型的数据写入。对于读取控制单元读取端口,若待读取的数据类型为:权值系数,图像或特征值,卷积部分和数据和最终卷积累加和,则读取控制单元具有四个读取端口,即M=4,分别对应四种类型的数据读取。In an example, taking the target hardware acceleration engine as a convolutional neural network engine as an example, for the write port of the write control unit, if the type of data to be written is: weight coefficient, image or feature value, convolution part and data, the write control unit has three write ports, namely N=3, corresponding to three types of data write. For the read control unit read port, if the data type to be read is: weight coefficient, image or feature value, convolution part and data and final volume accumulation and sum, the read control unit has four read ports , that is, M=4, corresponding to four types of data read.
本地数据存储缓冲区共需要108KB,基本存储模块的大小为4KB(一般FPGA的一个基本Block RAM的大小为4KB,这样便于使用FPGA的Block RAM资源进行验证),数据位宽为16bit,这样第一数量K的取值为27。基本存储模块内地址位宽为11bit,基本存储模块间地址位宽为5bit(2^5=32>27)即配置地址从00000b到11010b。The local data storage buffer requires a total of 108KB, the size of the basic memory module is 4KB (generally, the size of a basic block RAM of an FPGA is 4KB, which is convenient to use the block RAM resources of the FPGA for verification), and the data bit width is 16bit, so that the first The value of the quantity K is 27. The address bit width in the basic storage module is 11 bits, and the address bit width between the basic storage modules is 5 bits (2^5=32>27), that is, the configuration address is from 00000b to 11010b.
写入控制单元包括3个多路分配器和27个3选1多路选择器,其中,每个多路分配器为1到27路的多路分配,即每个多路分配器包括27个输出端口。The write control unit includes 3 multiplexers and 27 3-to-1 multiplexers, wherein each multiplexer is a multiplexer from 1 to 27 channels, that is, each multiplexer includes 27 multiplexers output port.
读取控制单元包括4个多路选择器和27个1到4路的多路分配器,其中,每个多路选择器为27选1的多路选择器。The read control unit includes 4 multiplexers and 27 multiplexers of 1 to 4 channels, wherein each multiplexer is a 27-to-1 multiplexer.
进一步的,以两层网络卷积计算所需的数据为例,说明如下:Further, taking the data required for the two-layer network convolution calculation as an example, the description is as follows:
对于第一层网络层中本地数据存储缓冲区的配置:For the configuration of the local data storage buffer in the first network layer:
本地数据存储缓冲区中数据类型的配置:Configuration of data types in the local data store buffer:
输入的各数据类型的空间需求如下:The space requirements for each data type entered are as follows:
输入的权值系数分配到Block1(以下简称为B1)基本存储模块空间及B2基本存储模块空间;The input weight coefficients are allocated to the basic storage module space of Block1 (hereinafter referred to as B1) and the basic storage module space of B2;
输入的图像数据分配到B3到B15基本存储模块空间;The input image data is allocated to B3 to B15 basic storage module space;
输入的卷积部分和数据分配到B16到B27基本存储模块空间;The input convolution part and data are allocated to B16 to B27 basic memory module space;
权值系数输出端口只能从B1及B2基本存储模块空间读取输入的第一层网络层的权值系数;The weight coefficient output port can only read the input weight coefficient of the first layer network layer from the B1 and B2 basic storage module spaces;
图像或特征值输出端口只能从B3到B15基本存储模块空间读取输入的第一层网络层的图像数据;The image or eigenvalue output port can only read the input image data of the first layer network layer from the B3 to B15 basic storage module space;
卷积部分和数据输出端口只能从B16到B27基本存储模块空间读取在第一层网络层卷积计算过程中产生的卷积部分和数据,以供第一层网络层后续卷积计算的需要;The convolution part and data output port can only read the convolution part and data generated during the convolution calculation of the first layer network layer from the basic storage module space of B16 to B27 for the subsequent convolution calculation of the first layer network layer. need;
最终卷积累加和输出端口只能从B16到B27基本存储模块空间读取第一层网络层所有卷积计算完成后所形成的最终卷积计算和。该端口与卷积部分和数据输出端口分时并行读取B16到B27基本存储模块空间。The final volume accumulation and sum output port can only read the final convolution calculation sum formed after all the convolution calculations of the first layer network layer are completed from the B16 to B27 basic storage module space. This port reads the B16 to B27 basic memory module space in parallel with the convolution part and the data output port in time-sharing.
本地数据存储缓冲区中输入端口的配置:Configuration of input ports in the local data store buffer:
对应的输入端口为权值系数输入端口、图像数据输入端口,以及卷积部分和数据输入端口。The corresponding input ports are the weight coefficient input port, the image data input port, and the convolution part and the data input port.
(1)权值系数输入端口的块间地址输入范围为:00000b到00001b,再加入11位的块内地址为:000h—7ffh,即权值系数端口输入到本地数据存储缓冲区的地址为0000h到0fffh;其中,第1个3选1多路选择器和第2个3选1多路选择器的输入端口应与输出权值系数的多路分配器的输出端口相连,配置参数为00b;(1) The input range of the inter-block address of the weight coefficient input port is: 00000b to 00001b, and then the 11-bit intra-block address is: 000h-7ffh, that is, the address of the weight coefficient port input to the local data storage buffer is 0000h to 0fffh; among them, the input port of the first 3-to-1 multiplexer and the second 3-to-1 multiplexer should be connected to the output port of the demultiplexer that outputs the weight coefficient, and the configuration parameter is 00b;
(2)图像数据输入端口的块间地址输入范围为:00010b到01110b,与11位的块内地址合在一起,即图像端口输入到本地数据存储缓冲区的地址为1000h到77ffh;其中,第3个3选1多路选择器到第15个3选1多路选择器的输入端口应与输出图像数据的多路分配器的输出端口相连,配置参数为01b;(2) The input range of the inter-block address of the image data input port is: 00010b to 01110b, which are combined with the 11-bit intra-block address, that is, the address of the image port input to the local data storage buffer is 1000h to 77ffh; The input ports of the three 3-to-1 multiplexers to the fifteenth 3-to-1 multiplexer should be connected to the output ports of the multiplexer that outputs image data, and the configuration parameter is 01b;
(3)卷积部分和数据输入端口的块间地址输入范围为01111b到11010b,与11位的块内地址合在一起即卷积部分和端口输入到本地数据存储缓冲区的地址为7800h到D7ffh;其中,第16个3选1多路选择器到第27个3选1多路选择器的输入端口应与输出卷积部分和的输出端口相连,配置参数为10b;(3) The input range of the inter-block address of the convolution part and the data input port is 01111b to 11010b, which is combined with the 11-bit intra-block address, that is, the address of the convolution part and the port input to the local data storage buffer is 7800h to D7ffh ; Among them, the input port of the 16th 3-to-1 multiplexer to the 27th 3-to-1 multiplexer should be connected to the output port of the output convolution part and the configuration parameter is 10b;
本地数据存储缓冲区中输出端口的配置:Configuration of output ports in the local data store buffer:
对应的输出端口为权值系数输出端口、图像或特征值输出端口、卷积部分和数据输出端口以及最终卷积累加和输出端口。The corresponding output ports are the weight coefficient output port, the image or feature value output port, the convolution part and the data output port, and the final volume accumulation sum output port.
(1)权值系数输出端口的块间地址输入范围为00000b到00001b,再加入11位的块内地址000h—7ffh,合在一起即权值系数输出端口从本地数据存储缓冲区地址为0000h到0fffh的存储空间 读取 权值系数;其中,第1个1到4路多路分配器和第2个1到4路多路分配器的输出端口分别与输出权值系数的27选1路多路选择器的第1个和第2个输出端口相连,配置参数为00b;(1) The input range of the inter-block address of the weight coefficient output port is 00000b to 00001b, and then add the 11-bit intra-block address 000h-7ffh, which together means that the weight coefficient output port is from the local data storage buffer address 0000h to The storage space of 0fffh reads the weight coefficient; among them, the output ports of the first 1 to 4-way demultiplexer and the second 1 to 4-way demultiplexer are respectively selected from 27 of the output weight coefficients. The first and second output ports of the channel selector are connected, and the configuration parameter is 00b;
(2)图像或特征值输出端口的块间地址输入范围为00010b到01110b,与11位的块内地址合在一起即图像或特征值输出端口从本地数据存储缓冲区地址为1000h到77ffh的存付空间读取图像数据;其中,第3个1到4路多路分配器到第15个1到4路多路分配器的输出端口分别与输出图像数据的27选1路多路选择器的第3个到第15个输出端口相连,配置参数为01b;(2) The input range of the inter-block address of the image or eigenvalue output port is 00010b to 01110b, which is combined with the 11-bit intra-block address, that is, the image or eigenvalue output port is stored from the local data storage buffer address from 1000h to 77ffh. Pay space to read image data; among them, the output ports of the third 1 to 4-way demultiplexer to the 15th 1 to 4-way demultiplexer are respectively connected with the output ports of the 27-to-1-way multiplexer for outputting image data. The 3rd to 15th output ports are connected, and the configuration parameter is 01b;
(3)卷积部分和数据输出端口的块间地址输入范围为01111b到11010b,与11位的块内地址合在一起即卷积部分和数据输出端口从本地数据存储缓冲区的地址为7800h到D7ffh的存储空间读取卷积部分和数据;其中,第16个1到4路多路分配器到第27个1到4路多路分配器的输出端口分别与输出卷积部分和的27选1路多路选择器的第16个到第27输出端口相连,配置参数为10b;(3) The input range of the inter-block address of the convolution part and the data output port is 01111b to 11010b, which is combined with the 11-bit intra-block address, that is, the address of the convolution part and the data output port from the local data storage buffer is 7800h to The storage space of D7ffh reads the convolution part and data; among them, the output ports of the 16th 1 to 4-way demultiplexer to the 27th 1 to 4-way demultiplexer are respectively selected from the 27 selections of the output convolution part and the sum. The 16th to 27th output ports of the 1-way multiplexer are connected, and the configuration parameter is 10b;
(4)最终卷积累加和输出端口的块间地址输入范围为01111b到11010b,与11位的块内地址合在一起即最终卷积累加和输出端口从本地数据存储缓冲区的地址为7800h到D7ffh的存储空间读取最终卷积和数据;由于与卷积部分和数据输出端口分时复用,则此时第16个1到4路多路分配器到第27个1到4路多路分配器的输出端口重新配置为分别与输出最终卷积累加和的27选1路多路选择器的第16个到第27个输出端口相连,配置参数为11b。(4) The input range of the inter-block address of the final volume accumulation and output port is 01111b to 11010b, which is combined with the 11-bit intra-block address, that is, the final volume accumulation and output port is from the address of the local data storage buffer from 7800h to 7800h. The storage space of D7ffh reads the final convolution and data; due to time-multiplexing with the convolution part and the data output port, the 16th 1 to 4-way demultiplexer at this time to the 27th 1 to 4-way multiplexer The output ports of the distributor are reconfigured to be connected to the 16th to 27th output ports of the 27-to-1-way multiplexer outputting the accumulated sum of the final volume respectively, and the configuration parameter is 11b.
对于第二层网络层中本地数据存储缓冲区的配置:For the configuration of the local data storage buffers in the
本地数据存储缓冲区中数据类型的配置:Configuration of data types in the local data store buffer:
输入的各数据类型的空间需求如下:The space requirements for each data type entered are as follows:
输入的权值系数分配到B1、B2、B3及B4基本存储模块空间;The input weight coefficients are allocated to the basic storage module space of B1, B2, B3 and B4;
输入的图像数据分配到B5到B16基本存储模块空间;The input image data is allocated to B5 to B16 basic storage module space;
输入的卷积部分和数据分配到B17到B27基本存储模块空间;The input convolution part and data are allocated to B17 to B27 basic memory module space;
权值系数输出端口只能从B1、B2、B3及B4基本存储模块空间读取输入的第二层网络层的权值系数;The weight coefficient output port can only read the input weight coefficient of the second layer network layer from the basic storage module space of B1, B2, B3 and B4;
图像或特征值输出端口只能从B3到B16基本存储模块空间读取输入的第二层网络层的特征数据;The image or feature value output port can only read the input feature data of the second layer network layer from the B3 to B16 basic storage module space;
卷积部分和数据输出端口只能从B17到B27基本存储模块空间读取在第二层网络层卷积计算过程中产生的卷积部分和数据,以供第二层网络层后续卷积计算的需要;The convolution part and data output port can only read the convolution part and data generated during the convolution calculation of the second layer network layer from the basic storage module space of B17 to B27 for the subsequent convolution calculation of the second layer network layer. need;
最终卷积累加和输出端口只能从B17到B27基本存储模块空间读取第二层网络层所有卷积计算完成后所形成的最终卷积计算和。该端口与卷积部分和数据输出端口分时并行读取B17到B27基本存储模块空间。The final volume accumulation sum output port can only read the final convolution calculation sum formed after all the convolution calculations of the second layer network layer are completed from the B17 to B27 basic storage module space. This port reads B17 to B27 basic memory module space in parallel with the convolution part and the data output port in time-sharing.
本地数据存储缓冲区中输入端口的配置:Configuration of input ports in the local data store buffer:
对应的输入端口为权值系数输入端口、图像数据输入端口以及卷积部分和数据输入端口。The corresponding input ports are the weight coefficient input port, the image data input port, the convolution part and the data input port.
(1)权值系数输入端口的块间地址输入范围为00000b到00011b,再加入11位的块内地址000h—7ffh,合在一起即权值系数端口输入到本地数据存储缓冲区的地址为0000h到1fffh;其中,第1个到第4个3选1多路选择器的输入端口分别与输出权值系数的多路分配器的输出端口相连,配置参数为00b;(1) The input range of the inter-block address of the weight coefficient input port is 00000b to 00011b, and then add the 11-bit intra-block address 000h-7ffh, which together means that the address of the weight coefficient port input to the local data storage buffer is 0000h to 1fffh; among them, the input ports of the first to fourth 3-to-1 multiplexers are respectively connected with the output ports of the multiplexer that outputs the weight coefficient, and the configuration parameter is 00b;
(2)图像数据输入端口的块间地址输入范围为00100b到01111b,与11位的块内地址合在一起即图像端口输入到本地数据存储缓冲区的地址为2000h到7fffh;其中,第5个到第16个3选1多路选择器的输入端口应与输出图像数据的多路分配器的输出端口相连,配置参数为01b;(2) The input range of the inter-block address of the image data input port is 00100b to 01111b, which is combined with the 11-bit intra-block address, that is, the address of the image port input to the local data storage buffer is 2000h to 7fffh; among them, the fifth The input port to the 16th 3-to-1 multiplexer should be connected to the output port of the multiplexer that outputs image data, and the configuration parameter is 01b;
(3)卷积部分和数据输入端口的块间地址输入范围为10000b到11010b,与11位的块内地址合在一起即卷积部分和端口输入到本地数据存储缓冲区的地址为8000h到D7ffh;其中,第17个到第27个3选1多路选择器的输入端口应与输出卷积部分和数据的输出端口相连,配置参数为10b;(3) The input range of the inter-block address of the convolution part and the data input port is 10000b to 11010b, which is combined with the 11-bit intra-block address, that is, the address of the convolution part and the port input to the local data storage buffer is 8000h to D7ffh ; Among them, the input ports of the 17th to 27th 3-to-1 multiplexers should be connected with the output convolution part and the output port of the data, and the configuration parameter is 10b;
本地数据存储缓冲区中输出端口的配置:Configuration of output ports in the local data store buffer:
对应的输出端口为权值系数输出端口、图像或特征值输出端口、卷积部分和数据输出端口以及最终卷积累加和输出端口。The corresponding output ports are the weight coefficient output port, the image or feature value output port, the convolution part and the data output port, and the final volume accumulation sum output port.
(1)权值系数输出端口的块间地址输入范围为00000b到00011b,再加入11位的块内地址000h—7ffh,合在一起即权值系数输出端口从本地数据存储缓冲区地址为0000h到1fffh的存储空间读取权值系数;其中,第1个到第4个1到4路多路分配器的输出端口,分别与输出权值系数的27选1路的多路选择器的第1个到第4个输出端口相连,配置参数为00b;(1) The input range of the inter-block address of the weight coefficient output port is 00000b to 00011b, and then add the 11-bit intra-block address 000h-7ffh, which together means that the weight coefficient output port is from the local data storage buffer address 0000h to The storage space of 1fffh reads the weight coefficient; among them, the output ports of the 1st to
(2)图像或特征值输出端口的块间地址输入范围为00100b到01111b,与11位的块内地址合在一起即图像或特征值输出端口从本地数据存储缓冲区地址为2000h到7fffh的存付空间读取图像数据;其中,第5个到16个的1到4路的多路分配器的输出端口分别与输出图像数据的27选1路的多路选择器的第5个到第16个输出端口相连,配置参数为01b;(2) The input range of the inter-block address of the image or eigenvalue output port is 00100b to 01111b, which is combined with the 11-bit intra-block address, that is, the image or eigenvalue output port is stored from the local data storage buffer address 2000h to 7fffh. Pay space to read image data; among them, the output ports of the 5th to 16th 1 to 4-way multiplexers are respectively connected with the 5th to 16th of the 27-to-1-way multiplexers that output image data. The output ports are connected, and the configuration parameter is 01b;
(3)卷积部分和数据输出端口的块间地址输入范围为10000b到11010b,与11位的块内地址合在一起即卷积部分和数据输出端口从本地数据存储缓冲区的地址为8000h到D7ffh的存储空间读取卷积部分和数据;其中,第17个到第27个1到4路的多路分配器的输出端口分别与输出卷积部分和数据的27选1路的多路选择器的第17个到第27个输出端口相连,配置参数为10b;(3) The input range of the inter-block address of the convolution part and the data output port is 10000b to 11010b, which is combined with the 11-bit intra-block address, that is, the address of the convolution part and the data output port from the local data storage buffer is 8000h to The storage space of D7ffh reads the convolution part and data; among them, the output ports of the 17th to 27th 1 to 4-way demultiplexers are respectively connected to the 27-to-1-way multiplexing of the output convolution part and data The 17th to 27th output ports of the device are connected, and the configuration parameter is 10b;
(4)最终卷积累加和输出端口的块间地址输入范围为10000b到11010b,与11位的块内地址合在一起即最终卷积累加和输出端口从本地数据存储缓冲区的地址为8000h到D7ffh的存储空间读取最终卷积和数据;由于与卷积部分和数据输出端口分时复用,则此时第17个到第27个1到4路多路分配器的输出端口重新配置为分别与输出最终卷积累加和的27选1路的多路选择器的第17个到第27个输出端口相连,配置参数为11b。(4) The input range of the inter-block address of the final volume accumulation and output port is 10000b to 11010b, which is combined with the 11-bit intra-block address, that is, the final volume accumulation and output port is from the address of the local data storage buffer from 8000h to 8000h. The storage space of D7ffh reads the final convolution and data; due to time-multiplexing with the convolution part and the data output port, the output ports of the 17th to 27th 1 to 4-way demultiplexers are reconfigured as They are respectively connected with the 17th to 27th output ports of the 27-to-1-way multiplexer that outputs the final volume accumulation and summation, and the configuration parameter is 11b.
可见,本发明实施例提供的存取设备可以根据不同数据类型的数据量灵活配置本地数据存储缓冲区,且可以解决现有技术中本地数据存储缓冲区仅对支持固定长度数据进行读写操作的限制问题。It can be seen that the access device provided by the embodiment of the present invention can flexibly configure the local data storage buffer according to the amount of data of different data types, and can solve the problem that the local data storage buffer in the prior art can only perform read and write operations on data that supports a fixed length. limitation issues.
与上述存取设备对应的,本发明实施例还提供一种存取方法,如图4所示,该方法的执行主体为存取设备,该方法包括:Corresponding to the above access device, an embodiment of the present invention further provides an access method. As shown in FIG. 4 , the execution subject of the method is an access device, and the method includes:
S410、将目标硬件加速引擎对应的至少一种数据类型的数据并行写入本地数据存储缓冲区的数据存储区。S410. Write data of at least one data type corresponding to the target hardware acceleration engine into the data storage area of the local data storage buffer in parallel.
S420、存储目标硬件加速引擎对应的至少一种数据类型的数据。S420. Store data of at least one data type corresponding to the target hardware acceleration engine.
本地数据存储缓冲区是由具有统一地址编码的第一数量K的基本存储模块组成的不同数据类型的数据存储区,用于存储目标硬件加速引擎计算出的至少一种数据类型的数据。The local data storage buffer is a data storage area of different data types composed of a first number K of basic storage modules with uniform address codes, and is used to store data of at least one data type calculated by the target hardware acceleration engine.
其中,第一数量K是根据目标硬件加速引擎计算所需的数据量确定的,每个基本存储模块具有基本读取端口和基本写入端口。The first quantity K is determined according to the amount of data required for calculation by the target hardware acceleration engine, and each basic storage module has a basic read port and a basic write port.
S430、从本地数据存储缓冲区的数据存储区并行读取目标硬件加速引擎待读取的目标数据类型的数据。S430: Read data of the target data type to be read by the target hardware acceleration engine in parallel from the data storage area of the local data storage buffer.
在一个可选的实现中,根据目标硬件加速引擎对应的至少一种数据类型的类型数,确定所述存取设备中写入控制单元的写入端口的数量;In an optional implementation, according to the type number of at least one data type corresponding to the target hardware acceleration engine, determine the number of write ports written to the control unit in the access device;
以及,根据所述目标硬件加速引擎待读取的目标数据类型的类型数,确定所述存取设备中读取控制单元的读取端口的数量。And, the number of read ports of the read control unit in the access device is determined according to the number of types of target data types to be read by the target hardware acceleration engine.
本发明上述实施例提供的存取方法的各功能单元的功能,可以通过上述各单元来实现,因此,本发明实施例提供的存取方法中的各个方法步骤的具体工作过程和有益效果,在此不复赘述。The functions of each functional unit of the access method provided by the above embodiments of the present invention can be implemented by the above units. Therefore, the specific working process and beneficial effects of each method step in the access method provided by the embodiments of the present invention are as follows: This will not be repeated.
本发明实施例还提供了一种电子设备,如图5所示,包括处理器510、通信接口520、存储器530和通信总线540,其中,处理器510,通信接口520,存储器530通过通信总线540完成相互间的通信。An embodiment of the present invention further provides an electronic device, as shown in FIG. 5 , including a
存储器530,用于存放计算机程序;a
处理器510,用于执行存储器530上所存放的程序时,实现如下步骤:When the
将目标硬件加速引擎对应的至少一种数据类型的数据,并行写入所述存取设备中本地数据存储缓冲区的数据存储区;Write data of at least one data type corresponding to the target hardware acceleration engine into the data storage area of the local data storage buffer in the access device in parallel;
存储所述目标硬件加速引擎对应的至少一种数据类型的数据;storing data of at least one data type corresponding to the target hardware acceleration engine;
或者,从所述本地数据存储缓冲区的数据存储区并行读取所述目标硬件加速引擎待读取的目标数据类型的数据。Alternatively, data of the target data type to be read by the target hardware acceleration engine is read in parallel from the data storage area of the local data storage buffer.
在一个可选的实现中,根据所述目标硬件加速引擎对应的至少一种数据类型的类型数,确定所述存取设备中写入控制单元的写入端口的数量;In an optional implementation, according to the type number of at least one data type corresponding to the target hardware acceleration engine, determine the number of write ports in the access device to which the control unit is written;
以及,根据所述目标硬件加速引擎待读取的目标数据类型的类型数,确定所述存取设备中读取控制单元的读取端口的数量。And, the number of read ports of the read control unit in the access device is determined according to the number of types of target data types to be read by the target hardware acceleration engine.
上述提到的通信总线可以是外设部件互连标准(Peripheral ComponentInterconnect,PCI)总线或扩展工业标准结构(Extended Industry StandardArchitecture,EISA)总线等。该通信总线可以分为地址总线、数据总线、控制总线等。为便于表示,图中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。The above-mentioned communication bus may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an Extended Industry Standard Architecture (Extended Industry Standard Architecture, EISA) bus or the like. The communication bus can be divided into an address bus, a data bus, a control bus, and the like. For ease of presentation, only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.
通信接口用于上述电子设备与其他设备之间的通信。The communication interface is used for communication between the above electronic device and other devices.
存储器可以包括随机存取存储器(Random Access Memory,RAM),也可以包括非易失性存储器(Non-Volatile Memory,NVM),例如至少一个磁盘存储器。可选的,存储器还可以是至少一个位于远离前述处理器的存储装置。The memory may include random access memory (Random Access Memory, RAM), and may also include non-volatile memory (Non-Volatile Memory, NVM), such as at least one disk storage. Optionally, the memory may also be at least one storage device located away from the aforementioned processor.
上述的处理器可以是通用处理器,包括中央处理器(Central Processing Unit,CPU)、网络处理器(Network Processor,NP)等;还可以是数字信号处理器(Digital SignalProcessing,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。The above-mentioned processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; may also be a digital signal processor (Digital Signal Processing, DSP), an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
由于上述实施例中电子设备的各器件解决问题的实施方式以及有益效果可以参见图4所示的实施例中的各步骤来实现,因此,本发明实施例提供的电子设备的具体工作过程和有益效果,在此不复赘述。Since the implementation manners and beneficial effects of each component of the electronic device in the above-mentioned embodiment to solve the problem can be achieved by referring to the steps in the embodiment shown in FIG. 4 , the specific working process and beneficial effects of the electronic device provided by the embodiment of the present invention The effect will not be repeated here.
在本发明提供的又一实施例中,还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述实施例中任一所述的存取方法。In yet another embodiment provided by the present invention, a computer-readable storage medium is also provided, where instructions are stored in the computer-readable storage medium, when the computer-readable storage medium is run on a computer, the computer is made to execute any one of the above-mentioned embodiments. the described access method.
在本发明提供的又一实施例中,还提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述实施例中任一所述的存取方法。In yet another embodiment provided by the present invention, there is also provided a computer program product containing instructions, which, when run on a computer, cause the computer to execute the access method described in any one of the foregoing embodiments.
本领域内的技术人员应明白,本申请实施例中的实施例可提供为方法、系统、或计算机程序产品。因此,本申请实施例中可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请实施例中可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments in the embodiments of the present application may be provided as methods, systems, or computer program products. Therefore, the embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein .
本申请实施例中是参照根据本申请实施例中实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The embodiments of the present application are described with reference to the flowcharts and/or block diagrams of the methods, devices (systems), and computer program products according to the embodiments of the present application. It will be understood that each flow and/or block in the flowcharts and/or block diagrams, and combinations of flows and/or blocks in the flowcharts and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in one or more of the flowcharts and/or one or more blocks of the block diagrams.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions An apparatus implements the functions specified in a flow or flows of the flowcharts and/or a block or blocks of the block diagrams.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in one or more of the flowcharts and/or one or more blocks of the block diagrams.
尽管已描述了本申请实施例中的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请实施例中范围的所有变更和修改。Although the preferred embodiments of the embodiments of the present application have been described, additional changes and modifications to these embodiments may be made by those skilled in the art once the basic inventive concepts are known. Therefore, the appended claims are intended to be construed to include the preferred embodiments and all changes and modifications that fall within the scope of the embodiments of the present application.
显然,本领域的技术人员可以对本申请实施例中实施例进行各种改动和变型而不脱离本申请实施例中实施例的精神和范围。这样,倘若本申请实施例中实施例的这些修改和变型属于本申请实施例中权利要求及其等同技术的范围之内,则本申请实施例中也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the embodiments in the embodiments of the present application without departing from the spirit and scope of the embodiments in the embodiments of the present application. In this way, if these modifications and variations of the embodiments in the embodiments of the present application fall within the scope of the claims in the embodiments of the present application and their equivalents, the embodiments of the present application are also intended to include these modifications and variations.
Claims (7)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010267401.9A CN111142808B (en) | 2020-04-08 | 2020-04-08 | Access device and access method |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010267401.9A CN111142808B (en) | 2020-04-08 | 2020-04-08 | Access device and access method |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN111142808A CN111142808A (en) | 2020-05-12 |
| CN111142808B true CN111142808B (en) | 2020-08-04 |
Family
ID=70528815
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202010267401.9A Expired - Fee Related CN111142808B (en) | 2020-04-08 | 2020-04-08 | Access device and access method |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN111142808B (en) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114442908B (en) * | 2020-11-05 | 2023-08-11 | 珠海一微半导体股份有限公司 | Hardware acceleration system and chip for data processing |
Family Cites Families (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101330433A (en) * | 2007-06-20 | 2008-12-24 | 中兴通讯股份有限公司 | Ethernet equipment shared buffer management method and device based on transmission network |
| CN101776988B (en) * | 2010-02-01 | 2012-11-07 | 中国人民解放军国防科学技术大学 | Restructurable matrix register file with changeable block size |
| EP3035204B1 (en) * | 2014-12-19 | 2018-08-15 | Intel Corporation | Storage device and method for performing convolution operations |
| CN105808454A (en) * | 2014-12-31 | 2016-07-27 | 北京东土科技股份有限公司 | Method and device for accessing to shared cache by multiple ports |
| US10572225B1 (en) * | 2018-09-26 | 2020-02-25 | Xilinx, Inc. | Circuit arrangements and methods for performing multiply-and-accumulate operations |
| CN109740739B (en) * | 2018-12-29 | 2020-04-24 | 中科寒武纪科技股份有限公司 | Neural network computing device, neural network computing method and related products |
| CN110390385B (en) * | 2019-06-28 | 2021-09-28 | 东南大学 | BNRP-based configurable parallel general convolutional neural network accelerator |
| CN110751263B (en) * | 2019-09-09 | 2022-07-01 | 瑞芯微电子股份有限公司 | High-parallelism convolution operation access method and circuit |
| US11726950B2 (en) * | 2019-09-28 | 2023-08-15 | Intel Corporation | Compute near memory convolution accelerator |
| CN110880038B (en) * | 2019-11-29 | 2022-07-01 | 中国科学院自动化研究所 | FPGA-based system for accelerating convolution computing, convolutional neural network |
-
2020
- 2020-04-08 CN CN202010267401.9A patent/CN111142808B/en not_active Expired - Fee Related
Also Published As
| Publication number | Publication date |
|---|---|
| CN111142808A (en) | 2020-05-12 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN110597559B (en) | Computing device and computing method | |
| Simon et al. | BLADE: An in-cache computing architecture for edge devices | |
| US20220171634A1 (en) | Data flows in a processor with a data flow manager | |
| JP7201802B2 (en) | Data read/write method and system in 3D image processing, storage medium and terminal | |
| CN105843775A (en) | On-chip data partitioning read-write method, system and device | |
| CN107301455A (en) | Mixing cube storage system and speed-up computation method for convolutional neural networks | |
| CN111142808B (en) | Access device and access method | |
| WO2022007597A1 (en) | Matrix operation method and accelerator | |
| US12399722B2 (en) | Memory device and method including processor-in-memory with circular instruction memory queue | |
| CN111566614B (en) | Bit width matching circuit, data writing device, data reading device and electronic equipment | |
| WO2023115529A1 (en) | Data processing method in chip, and chip | |
| CN117149447B (en) | Bandwidth adjustment method, device, equipment and storage medium | |
| US20240070107A1 (en) | Memory device with embedded deep learning accelerator in multi-client environment | |
| US20230244461A1 (en) | Configurable Access to a Reconfigurable Processor by a Virtual Function | |
| CN118012628A (en) | A data processing method, device and storage medium | |
| CN118820124A (en) | Memory address management device, method, processor and computer equipment | |
| CN105760317B (en) | Data writing system and data writing method for core processor | |
| CN119396845B (en) | Hardware lookup table for artificial intelligent chip, method for loading data and computing device | |
| US7636817B1 (en) | Methods and apparatus for allowing simultaneous memory accesses in a programmable chip system | |
| CN115169270B (en) | Memory conversion method, system, equipment and storage medium | |
| US20220326909A1 (en) | Technique for bit up-conversion with sign extension | |
| TWI793676B (en) | Padding architecture applied to neural network | |
| CN116775555B (en) | A multi-die storage and computing architecture FPGA with high memory bandwidth | |
| CN118519957B (en) | Data processing method, device, electronic device and readable storage medium | |
| CN111694513B (en) | Memory device and method including a circular instruction memory queue |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant | ||
| CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20200804 |
|
| CF01 | Termination of patent right due to non-payment of annual fee |
