CN114489809A

CN114489809A - High-throughput many-core data stream processor and its task execution method

Info

Publication number: CN114489809A
Application number: CN202111673269.2A
Authority: CN
Inventors: 李文明; 安述倩; 吴海彬; 刘艳欢; 吴萌; 叶笑春; 范东睿
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-05-13

Abstract

The present invention provides a high-throughput many-core data stream processor, comprising: a plurality of processing units, which are connected to each other to form an on-chip network structure of the processor; each of the processing units includes a plurality of sub-processing units, and the sub-processing units include Instruction sub-memory and data sub-memory, a plurality of the sub-processing units are arranged in an array structure, and are connected in communication with each other to form a multi-hop network structure of the processing units; a configuration unit is connected in communication with each of the sub-processing units. And a task execution method of the high-throughput many-core data stream processor. Compared with the prior art, the invention has better scalability, simple control logic, and is suitable for a large-scale many-core structure. At the same time, it supports SIMD-MIMD-Systolic mode configuration, scale configuration, regional configuration and other advantages, which is more flexible and suitable for processing in more general application fields.

Description

High-throughput many-core data stream processor and its task execution method

技术领域technical field

本发明涉及处理器体系结构设计领域，具体涉及一种基于数据流执行模式的众核处理器体系结构。The invention relates to the field of processor architecture design, in particular to a many-core processor architecture based on a data flow execution mode.

背景技术Background technique

单指令流多数据流机器(SIMD)，是采用一个指令流处理多个数据流。这类机器在数字信号处理、图像处理、以及多媒体信息处理等领域非常有效。Intel处理器实现的MMXTM、SSE(Streaming SIMD Extensions)、SSE2及SSE3扩展指令集，都能在单个时钟周期内处理多个数据单元。也就是说现在用的单核计算机基本上都属于SIMD机器。多指令流多数据流机器(MIMD)机器可以同时执行多个指令流，这些指令流分别对不同数据流进行操作。最新的多核计算平台就属于MIMD的范畴，例如Intel和AMD的多核、众核处理器等都属于MIMD。在计算机体系中，数据并行有两种实现路径：MIMD和SIMD。其中MIMD的表现形式主要有多发射、多线程、多核心，在当代设计的以处理能力为目标驱动的处理器中，均能看到它们的身影。同时，随着多媒体、大数据、人工智能等应用的兴起，为处理器赋予SIMD处理能力变得愈发重要，因为这些应用存在大量细粒度、同质、独立的数据操作，而SIMD天生就适合处理这些操作。A single instruction stream, multiple data stream machine (SIMD) uses one instruction stream to process multiple data streams. Such machines are very effective in the fields of digital signal processing, image processing, and multimedia information processing. The MMXTM, SSE (Streaming SIMD Extensions), SSE2 and SSE3 extended instruction sets implemented by Intel processors can process multiple data units in a single clock cycle. That is to say, the single-core computers used today are basically SIMD machines. Multiple Instruction Streams Multiple Data Streams Machines (MIMD) machines can execute multiple instruction streams simultaneously, each of which operates on different data streams. The latest multi-core computing platforms belong to the category of MIMD. For example, Intel and AMD's multi-core and many-core processors belong to MIMD. In computer architecture, there are two implementation paths for data parallelism: MIMD and SIMD. Among them, MIMD's main manifestations are multi-launch, multi-threading, and multi-core, which can be seen in contemporary designed processors driven by processing power. At the same time, with the rise of multimedia, big data, artificial intelligence and other applications, it has become more and more important to give SIMD processing capabilities to processors, because these applications have a large number of fine-grained, homogeneous, independent data operations, and SIMD is naturally suitable for handle these operations.

由于应用算法的发展和迭代速度较快，且面向的数据处理需求不同，因此在一些常用的应用中，有些代码适用于SIMD结构，有些代码在MIMD中执行效果更好。为了提高处理器芯片的通用型，有发明专利和论文提出了SIMD和MIMD的可配置实现方法，例如对于多核结构和众核结构来说，可以实现多核和众核根据应用算法的特征，动态配置SIMD和MIMD的执行方式。以及通过控制已有的SIMD处理结构，实现MIMD计算模式等等方法。Due to the rapid development and iteration of application algorithms and different data processing requirements, in some common applications, some codes are suitable for SIMD structures, and some codes perform better in MIMD. In order to improve the generality of processor chips, some invention patents and papers have proposed configurable implementation methods of SIMD and MIMD. For example, for multi-core structure and many-core structure, multi-core and many-core can be dynamically configured according to the characteristics of the application algorithm. How SIMD and MIMD are implemented. And by controlling the existing SIMD processing structure, the MIMD calculation mode is realized and so on.

但是，现有技术中存在如下问题：一是未考虑面向数据流执行特征的大规模众核阵列的SIMD-MIMD结构设计；二是灵活度并不够高，相对计算阵列规模并不大，导致可扩展性较差，当阵列规模较大时，控制结构复杂；三是灵活性差，未考虑数据流图节点计算特征的动态结构配置匹配，例如对不同粒度SIMD的需求等，针对数据流众核处理器架构，并没有有效的SIMD-MIMD结构设计；四是无法根据数据流图节点的计算特征动态调整SIMD-MIMD结构。最后，无法动态配置为脉动阵列(Systolic)模式。However, there are the following problems in the prior art: First, the SIMD-MIMD structure design of large-scale many-core arrays with data flow-oriented execution characteristics is not considered; Poor scalability, when the array scale is large, the control structure is complex; third, the flexibility is poor, the dynamic structure configuration matching of the computing characteristics of the data flow graph nodes is not considered, such as the demand for different granularity SIMD, etc., for the data flow many-core processing There is no effective SIMD-MIMD structure design; fourth, the SIMD-MIMD structure cannot be dynamically adjusted according to the computing characteristics of the data flow graph nodes. Finally, there is no way to dynamically configure into Systolic mode.

发明内容SUMMARY OF THE INVENTION

针对上述问题，本发明提出一种数据流众核处理器的片上网络架构，包括：多个处理单元，相互通信连接形成该处理器的片上网络结构；每个该处理单元包括多个子处理单元，该子处理单元包括指令子存储器和数据子存储器，多个该子处理单元以阵列结构排列，且相互通信连接形成该处理单元的多跳网络结构；配置单元，与每个该子处理单元通信连接。In view of the above problems, the present invention proposes a network-on-chip architecture of a data stream many-core processor, comprising: a plurality of processing units, which are connected in communication with each other to form a network-on-chip structure of the processor; each of the processing units includes a plurality of sub-processing units, The sub-processing unit includes an instruction sub-memory and a data sub-memory, a plurality of the sub-processing units are arranged in an array structure, and are communicatively connected to each other to form a multi-hop network structure of the processing unit; a configuration unit is communicatively connected to each of the sub-processing units .

本发明所述的处理器，当执行单指令流多数据流的SIMD任务时，该配置单元将该处理单元根据任务需求划分SIMD任务组，并将指令代码发送至该SIMD任务组的处理单元的子处理单元，以执行该SIMD任务。由该处理单元接收该配置单元发出的配置信息，并由该处理单元根据该配置信息对该处理单元内所有的子处理单元和子处理单元的子路由器进行配置。指定该处理单元的一个子处理单元为数据传输单元，该数据传输单元通过内存访问命令对该处理单元所有子处理单元进行数据的加载或存储。该SIMD任务组指定一个子处理单元为该SIMD任务组的主控单元，当两个SIMD任务组之间进行内存请求和/或数据传输时，交互数据通过各SIMD任务组的主控单元进行传输。该配置单元包括shuffle寄存器，当对该SIMD任务组进行shuffle操作时，通过将shuffle指令的向量数据读取到该shuffle寄存器并写入目标sPE以完成数据交换。In the processor of the present invention, when executing a SIMD task of a single instruction stream and multiple data streams, the configuration unit divides the processing unit into a SIMD task group according to task requirements, and sends the instruction code to the processing unit of the SIMD task group. sub-processing unit to perform the SIMD task. The configuration information sent by the configuration unit is received by the processing unit, and all sub-processing units in the processing unit and sub-routers of the sub-processing units are configured by the processing unit according to the configuration information. A sub-processing unit of the processing unit is designated as a data transmission unit, and the data transmission unit loads or stores data for all sub-processing units of the processing unit through memory access commands. The SIMD task group designates a sub-processing unit as the main control unit of the SIMD task group. When a memory request and/or data transmission is performed between the two SIMD task groups, the interactive data is transmitted through the main control unit of each SIMD task group. . The configuration unit includes a shuffle register. When a shuffle operation is performed on the SIMD task group, data exchange is completed by reading the vector data of the shuffle instruction into the shuffle register and writing it into the target sPE.

本发明所述的处理器，当执行多指令流多数据流的MIMD任务时，该配置单元配置每个该子处理单元独立运行，各子处理单元接收各自的控制命令、指令和运行数据，并处理独立的DFG或DFG的一部分。各子处理单元之间只通过子处理单元的子路由器进行通信。In the processor of the present invention, when executing MIMD tasks with multiple instruction streams and multiple data streams, the configuration unit configures each sub-processing unit to run independently, and each sub-processing unit receives its own control commands, instructions and operating data, and Process a standalone DFG or part of a DFG. The sub-processing units communicate only through the sub-routers of the sub-processing units.

本发明所述的处理器，当执行脉动阵列任务时，该配置单元配置多个处理单元的子处理单元构建为脉动阵列，并配置该脉动阵列的数据流出方向的最后一行或最后一列的子处理单元为累加器。In the processor of the present invention, when performing a systolic array task, the configuration unit configures the sub-processing units of the plurality of processing units to construct a systolic array, and configures the sub-processing of the last row or last column of the data flow direction of the systolic array The unit is an accumulator.

本发明还提出一种任务执行方法，用于在如前所述的众核处理器上执行数据流程序任务，该任务执行方法包括：接收用户程序并进行编译，将该用户程序划分为程序块，生成数据流图；判断该程序块的执行模式，将该程序块标注为单指令流多数据流模式、多指令流多数据流模式或脉动阵列模式；为该程序块配置执行单元，将该程序块分发给配置的执行单元；将该配置的执行单元设置为该程序块对应的任务执行模式；由该配置的执行单元执行该程序块。The present invention also provides a task execution method for executing a data flow program task on the aforementioned many-core processor. The task execution method includes: receiving and compiling a user program, and dividing the user program into program blocks , generate a data flow diagram; judge the execution mode of the program block, mark the program block as single instruction stream multi-data stream mode, multi-instruction stream multi-data stream mode or systolic array mode; configure the execution unit for the program block, The program block is distributed to the configured execution unit; the configured execution unit is set to the task execution mode corresponding to the program block; the program block is executed by the configured execution unit.

附图说明Description of drawings

图1是本发明的数据流众核处理器架构示意图。FIG. 1 is a schematic diagram of the architecture of a data stream many-core processor of the present invention.

图2是本发明的单PE内部sPE布局及连线图。FIG. 2 is a layout and wiring diagram of an sPE inside a single PE of the present invention.

图3是本发明的不同SIMD配置的域划分示意图。FIG. 3 is a schematic diagram of domain division of different SIMD configurations of the present invention.

图4是本发明的sPE数据传输通路示意图。FIG. 4 is a schematic diagram of the sPE data transmission path of the present invention.

图5是本发明的sPE的配置及指令通路示意图。FIG. 5 is a schematic diagram of the configuration and instruction path of the sPE of the present invention.

图6是本发明的执行流程图。FIG. 6 is a flow chart of the execution of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图，对本发明进一步详细说明。应当理解，此处所描述的具体实施方法仅仅用以解释本发明，并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings. It should be understood that the specific implementation methods described herein are only used to explain the present invention, but not to limit the present invention.

本发明的目的是解决(克服)上述现有技术的SIMD-MIMD可扩展性差、不实用数据流众核体系结构等问题(缺陷)，提出了一种高度灵活可配的SIMD-MIMD-Systolic配置方法及系统。The purpose of the present invention is to solve (overcome) the problems (defects) of the above-mentioned prior art such as poor scalability of SIMD-MIMD, impractical data stream many-core architecture, etc., and proposes a highly flexible and configurable SIMD-MIMD-Systolic configuration method and system.

本发明的数据流众核处理器架构，针对数据流众核处理器结构设计，可灵活配置SIMD-MIMD宽度、规模、区域等，实现了根据软件命令灵活可配的数据流众核架构，支持SIMD-MIMD等参数可配，提高了不同应用类型的执行效果，进而提升能效比；支持脉动阵列Systolic array模式的配置，对于类似矩阵乘这样的应用类型，脉动阵列的执行效率更高；灵活可扩展的数据网络，指令网络及配置网络的设计和配置方式，可以支持大规模的结构可扩展，控制结构简单，硬件开销低；支持可扩展的SIMD规模的Shuffle操作，对于矩阵类运算具有较高的执行效率。The data stream many-core processor architecture of the present invention can flexibly configure SIMD-MIMD width, scale, area, etc. according to the structure design of the data stream many-core processor, realizes a flexible and configurable data stream many-core architecture according to software commands, and supports Parameters such as SIMD-MIMD can be configured, which improves the execution effect of different application types, thereby improving the energy efficiency ratio; supports the configuration of the systolic array mode of the systolic array. For application types such as matrix multiplication, the execution efficiency of the systolic array is higher; The design and configuration method of the extended data network, instruction network and configuration network can support large-scale structure expansion, simple control structure, and low hardware overhead; support scalable SIMD-scale Shuffle operation, which has high performance for matrix operations execution efficiency.

发明人在进行高通量通用数据流众核处理器研究时，发现现有技术中处理能效(性能/功耗)较低的原因是硬件资源的未充分利用，或者说硬件资源提供的计算特征与应用所需要的计算特征不匹配导致的，由于是面向通用数据处理，遇到的应用类型会存在多种多样的数据量大学、数据计算模式等特征，发明人经过对通用数据处理应用特征及体系结构特征研究发现，解决该项缺陷可以通过SIMD宽度可变、SIMD计算单元规模可变、MIMD规模可变、甚至对脉动阵列等方式实现。基于SIMD-MIMD可配方式已经存在一些设计，本发明基于此思路，进一步探索更为灵活高效、可扩展性强以及适用于数据流众核处理器结构的设计。While conducting research on high-throughput general-purpose data stream many-core processors, the inventor found that the reason for the low processing energy efficiency (performance/power consumption) in the prior art is the underutilization of hardware resources, or the computing features provided by hardware resources. Due to the mismatch with the computing features required by the application, because it is oriented to general data processing, the types of applications encountered will have a variety of data volume, data computing mode and other characteristics. The study of architectural features found that the solution to this defect can be achieved by means of variable SIMD width, variable SIMD computing unit scale, variable MIMD scale, and even systolic arrays. There are already some designs based on the SIMD-MIMD configurable way. Based on this idea, the present invention further explores a design that is more flexible, efficient, highly scalable, and suitable for a data stream many-core processor structure.

为了支持多种执行模式，本发明提出了一个可配置的NoC(Network-on-Chip，片上网络)架构。图1是本发明的数据流众核处理器架构示意图。如图1所示为数据流众核处理器架构设计，包括：DMA、片上存储、配置单元和PE处理单元，在本发明中，每个PE里面存在多个sub PE(子处理单元，以下称sPE)，每个sPE可以单独执行任务，具有完整的取指、译码、发射、执行、提交等指令执行功能。In order to support multiple execution modes, the present invention proposes a configurable NoC (Network-on-Chip, Network-on-Chip) architecture. FIG. 1 is a schematic diagram of the architecture of a data stream many-core processor of the present invention. Figure 1 shows the architecture design of the data stream many-core processor, including: DMA, on-chip storage, configuration unit and PE processing unit. sPE), each sPE can perform tasks independently, and has complete instruction execution functions such as fetching, decoding, transmitting, executing, and submitting.

·SIMD模式：如图2所示，当配置为SIMD模式时，星形配置NoC被激活。所有sPE都由SIMD控制单元同步控制，该单元表现为一个解耦的SIMD架构。如图2所示，四种数据通路分别用于传输数据信号、反馈信号、控制信号和指令信号。与传统的SIMD模型不同，本发明的数据流众核处理器架构中，在一个PE中具有N个sPE，或者SIMD的宽度为N，每个sPE具有独立的指令存储器和数据存储器。当配置为选定的SIMD模式时，PE由配置单元的SIMD控制器控制。指令代码通过SIMD控制器中的路由器加载，并分发到属于指定SIMD组的每个sPE。每个sPE的执行同时由SIMD控制器控制。为了提高SIMD宽度的可扩展性，或者一个SIMD组中的sPE规模，本发明尝试解耦集中控制机制，两个SIMD组之间的内存请求和数据传输由每个SIMD组中的指定主sPE分别发出。换句话说，对于一个SIMD组，每个sPE的数据被视为一个整体，以防止数据移动不同步。Shuffle是SIMD结构中常见但特殊的操作，需要特殊考虑。通过两个sPE之间的数据交换来支持shuffle操作，这应该使用网状网络来传输shuffle数据。但是，如果shuffle的程度比较高，sPE之间交换数据会花费很多时间。为了加速shuffle操作，本发明通过在SIMD控制器单元中添加控制逻辑和几个寄存器来支持shuffle。shuffle指令通过将向量数据读取到SIMD控制器单元中的寄存器并将它们写入目标sPE来控制数据交换。为了减少硬件资源，本发明复用了shuffle数据的指令通道。SIMD mode: As shown in Figure 2, when configured in SIMD mode, the star configuration NoC is activated. All sPEs are controlled synchronously by a SIMD control unit, which behaves as a decoupled SIMD architecture. As shown in Figure 2, four data paths are used to transmit data signals, feedback signals, control signals and command signals respectively. Different from the traditional SIMD model, in the data flow many-core processor architecture of the present invention, there are N sPEs in one PE, or the width of SIMD is N, and each sPE has independent instruction memory and data memory. When configured in the selected SIMD mode, the PE is controlled by the SIMD controller of the configuration unit. The instruction code is loaded through the router in the SIMD controller and distributed to each sPE belonging to the specified SIMD group. The execution of each sPE is simultaneously controlled by the SIMD controller. In order to improve the scalability of SIMD width, or the size of sPE in one SIMD group, the present invention attempts to decouple the centralized control mechanism, and the memory request and data transfer between the two SIMD groups are controlled by the designated master SPE in each SIMD group respectively. issue. In other words, for a SIMD group, the data of each sPE is treated as a whole to prevent data movement from being out of sync. Shuffle is a common but special operation in SIMD structures that requires special consideration. Shuffle operations are supported by data exchange between two sPEs, which should use a mesh network to transmit shuffled data. However, if the degree of shuffle is high, it will take a lot of time to exchange data between sPEs. In order to speed up the shuffle operation, the present invention supports shuffle by adding control logic and several registers in the SIMD controller unit. The shuffle instruction controls data exchange by reading vector data into registers in the SIMD controller unit and writing them to the target sPE. In order to reduce hardware resources, the present invention multiplexes the instruction channel of shuffle data.

图2是本发明的单PE内部sPE布局及连线图。如图2所示，数据信号线组成了Mesh片上网络结构，反馈信号线为ack反馈信号，用来控制数据流图的执行，控制信号线为控制信号，指令信号线及指令传输线，是由SIMD控制单元统一控制实现，完成硬件资源的配置以及将要执行的指令发送至各sPE上，并通过各sPE的路由器sR接收。FIG. 2 is a layout and wiring diagram of an sPE inside a single PE of the present invention. As shown in Figure 2, the data signal line constitutes the Mesh on-chip network structure, the feedback signal line is the ack feedback signal, which is used to control the execution of the data flow graph, the control signal line is the control signal, the command signal line and the command transmission line are composed of SIMD The control unit performs unified control and implementation, completes the configuration of hardware resources and sends the instructions to be executed to each sPE, and receives them through the router sR of each sPE.

MIMD模式：MIMD模式旨在处理无法利用SIMD计算模式的不规则数据或程序。因此，对于这些不适合SIMD计算的应用，可以配置为纯MIMD模式，即典型2D拓扑的众核模式。在MIMD模式下，所有sPE都配置为独立的内核，独立运行，不受SIMD约束。如图4所示，经过SIMD控制器配置后，sPE之间只通过子路由器sR进行通信。每个sPE都有自己的控制命令、指令和数据，并且可以处理独立的DFG(DataFlow Graph，数据流图)或DFG的一部分。MIMD Mode: MIMD Mode is designed to handle irregular data or programs that cannot take advantage of SIMD computing modes. Therefore, for these applications that are not suitable for SIMD computing, pure MIMD mode can be configured, i.e. many-core mode for typical 2D topologies. In MIMD mode, all sPEs are configured as independent cores, run independently, and are not subject to SIMD constraints. As shown in Figure 4, after the SIMD controller is configured, the sPEs communicate only through the sub-router sR. Each sPE has its own control commands, instructions and data, and can process an independent DFG (DataFlow Graph) or a part of a DFG.

SIMD-MIMD混合模式：一些应用程序既不适合纯MIMD模式也不适合高宽度SIMD模式，它们在执行过程中更喜欢低宽度SIMD甚至可变SIMD宽度以实现更高的能效。对于其他应用，虽然更高的SIMD宽度可以获得更高的性能，但也会带来更高的功耗，这可能无法满足一些低功耗场景。为了适应多领域场景，实现了混合模式，提供可能的选择。用户根据不同的需求选择不同的SIMD宽度和不同的MIMD比例。例如，输入数据可能不够大，无法充分利用所有SIMD计算单元，或者为了低功耗，用户可能会牺牲性能来追求更高的效率，因为SIMD的宽度越大，可能会消耗更高的功率，而不会成比例地提高性能。对于更大规模的PE阵列，提供灵活多变的配置策略，可有效提高局部PE或整体PE阵列的能效。SIMD-MIMD Hybrid Mode: Some applications are neither suitable for pure MIMD mode nor high-width SIMD mode, they prefer low-width SIMD or even variable SIMD width during execution for higher energy efficiency. For other applications, although higher SIMD width can achieve higher performance, it also brings higher power consumption, which may not meet some low-power scenarios. In order to adapt to multi-domain scenarios, a hybrid mode is implemented to provide possible choices. Users can choose different SIMD widths and different MIMD ratios according to different needs. For example, the input data may not be large enough to fully utilize all SIMD computing units, or for low power consumption, the user may sacrifice performance in pursuit of higher efficiency, as the wider SIMD may consume higher power, while Does not increase performance proportionally. For larger-scale PE arrays, flexible configuration strategies are provided, which can effectively improve the energy efficiency of local PEs or the entire PE array.

SystolicArray脉动阵列模式：对于某些应用，例如矩阵乘法，脉动阵列无疑是最高效的架构之一，并且已被Google TPU证明。从另一个角度来看，脉动阵列是数据流模型的一个特例。数据从上、左SPM流入计算数组，在这种情况下，最下面的一排PE的最下面的sPE可以配置为累加器。从而实现脉动阵列所具备的硬件架构。SystolicArray Systolic Array Mode: For some applications, such as matrix multiplication, systolic array is undoubtedly one of the most efficient architectures and has been proven by Google TPU. From another perspective, systolic arrays are a special case of the data flow model. Data flows into the calculation array from the top and left SPMs, in which case the bottom sPE of the bottom row of PEs can be configured as an accumulator. Thus, the hardware architecture possessed by the systolic array is realized.

图3是本发明的不同SIMD配置的域划分示意图。如图3所示，本发明以局部4*4PE阵列为例，NoC可以配置成不同的SIMD模式，例如：SIMD-1(①)、SIMD-2(②)、SIMD-4(③)、SIMD-8(④)和SIMD-16(⑤)。图4和图5描述了路由器和通道之间的物理连接。为了更清楚地解释结构，本发明在逻辑上将其分为五种可选模式：FIG. 3 is a schematic diagram of domain division of different SIMD configurations of the present invention. As shown in Figure 3, the present invention takes a local 4*4PE array as an example, and the NoC can be configured into different SIMD modes, such as: SIMD-1(①), SIMD-2(②), SIMD-4(③), SIMD -8(④) and SIMD-16(⑤). Figures 4 and 5 describe the physical connections between routers and channels. In order to explain the structure more clearly, the present invention logically divides it into five optional modes:

·SIMD-1(MIMD)模式，处理器运行在一级SIMD控制器下，形成典型的4*42DMeshNoC。每个子路由器为一个sPE服务，每个sPE独立运行。从1到16的子路由器都在X和Y方向都连接到aa通道，图4所示。·SIMD-1 (MIMD) mode, the processor runs under the first-level SIMD controller, forming a typical 4*42DMeshNoC. Each sub-router serves one sPE, and each sPE operates independently. Sub-routers from 1 to 16 are all connected to the aa channel in both X and Y directions, as shown in Figure 4.

·SIMD-2模式，每两个sPE作为一个SIMD-2PE在二层。Level-2SIMD控制器控制指定的两个sPE，同步执行相同的指令。配置命令、指令和数据均由Level-2SIMD控制器管理。由于本发明将x轴的两个sPE组合在一起，因此除了共享的现有通道之外，还应该添加一个新的通道，以保证两个sPE可以同时从x轴接收数据。例如。子路由器2接a通道，子路由器1接X方向b通道。在Y方向，它们都连接到a通道。因此，1和2组合为SIMD-2PE。·SIMD-2 mode, every two SPEs are used as a SIMD-2PE on the second layer. The Level-2 SIMD controller controls the two designated sPEs to execute the same instruction synchronously. Configuration commands, instructions and data are managed by the Level-2SIMD controller. Since the present invention combines two sPEs of the x-axis, a new channel should be added in addition to the shared existing channel to ensure that both sPEs can receive data from the x-axis at the same time. E.g. Sub-router 2 is connected to channel a, and sub-router 1 is connected to channel b in the X direction. In the Y direction, they are both connected to the a channel. Therefore, 1 and 2 are combined as SIMD-2PE.

·SIMD-4模式，一个SIMD控制器包含两个二级控制器，四个sPE。执行行为与级别3相同。需要注意的一点是，本发明在y轴上添加了相应的数据路径，即2和6配置为连接a通道，1和5连接X方向的b通道。Y方向，1、2接a通道，5、6接b通道。· SIMD-4 mode, a SIMD controller contains two secondary controllers, four SPEs. The execution behavior is the same as level 3. It should be noted that the present invention adds a corresponding data path on the y-axis, that is, 2 and 6 are configured to connect the a channel, and 1 and 5 are connected to the b channel in the X direction. In the Y direction, 1 and 2 are connected to a channel, and 5 and 6 are connected to b channel.

·SIMD-8模式，一个4级控制器同步管理8个sPE。本发明在y轴上添加数据路径以支持基于SIMD-4模式的SIMD-8模式，这意味着在X方向上，1、2、3、4分别连接到a、b、c和d通道。5、6、7、8是一样的。在Y方向，1、2、3、4接a通道，5、6、7、8接b通道。·SIMD-8 mode, a 4-level controller manages 8 SPEs simultaneously. The present invention adds data paths on the y-axis to support SIMD-8 mode based on SIMD-4 mode, which means that in the X direction, 1, 2, 3, 4 are connected to a, b, c, and d channels, respectively. 5, 6, 7, 8 are the same. In the Y direction, 1, 2, 3, and 4 are connected to a channel, and 5, 6, 7, and 8 are connected to b channel.

·SIMD-16模式，最后一级SIMD控制器控制所有16个子路由器。如上所述，在X方向上，每行的四个子路由器连接到四个不同的通道。在Y方向上，每列的四个子路由器也连接到四个不同的通道。·SIMD-16 mode, the last level SIMD controller controls all 16 sub-routers. As mentioned above, in the X direction, the four sub-routers of each row are connected to four different channels. In the Y direction, the four sub-routers of each column are also connected to four different channels.

图5显示了PE的配置路径。对于子路由器sR(sub Router)，应该连接哪个通道由config控制器配置。在图5中，每一层逻辑上都有一个SIMD控制器。然而，SIMD控制器的面积和功耗成本随着SIMD宽度的增加而增加。为了降低设计复杂度和提高可扩展性，本发明没有为各种SIMD模式设置不同的层控制逻辑。在PE中实现统一控制器，接收来自配置单元的配置信息，并对所有子路由器和sPE进行配置，如图5所示。SIMD模式设置后，所有处于SIMD模式的sPE中的第一个sPE被配置为主子PE，负责与其他PE的内存访问和数据交换。例如，当配置为SIMD-32模式时，第一个子PE将同时发出内存访问命令来加载或存储所有32个sPE的数据。由于所有sPE结构相同，数据同时到达，因此sPE以SIMD方式执行指令。Figure 5 shows the configuration path of the PE. For the sub-router sR (sub Router), which channel should be connected is configured by the config controller. In Figure 5, each layer logically has a SIMD controller. However, the area and power cost of the SIMD controller increases as the SIMD width increases. In order to reduce design complexity and improve scalability, the present invention does not set different layer control logics for various SIMD modes. A unified controller is implemented in the PE, which receives the configuration information from the configuration unit, and configures all sub-routers and sPEs, as shown in Figure 5. After the SIMD mode is set, the first sPE among all the sPEs in the SIMD mode is configured as the master and child PE, responsible for memory access and data exchange with other PEs. For example, when configured in SIMD-32 mode, the first child PE will simultaneously issue memory access commands to load or store data for all 32 sPEs. Since all sPEs have the same structure and data arrive at the same time, the sPEs execute instructions in a SIMD fashion.

本发明的处理器执行用户程序的具体流程图及配置过程如图6所示。用户程序在编译器编译和分析的基础上，形成对硬件结构的配置文件、指令代码和数据段，通过数据流图的映射，将配置文件、指令等发送至各PE及sPE中，配置硬件并执行。具体包括：The specific flow chart and configuration process of the processor of the present invention executing the user program are shown in FIG. 6 . On the basis of compiler compilation and analysis, the user program forms configuration files, instruction codes and data segments for the hardware structure. Through the mapping of the data flow graph, the configuration files, instructions, etc. are sent to each PE and sPE, and the hardware is configured and implement. Specifically include:

步骤S01、接收用户程序；Step S01, receiving a user program;

步骤S02、编译器编译程序；Step S02, the compiler compiles the program;

步骤S03、分析程序应用特征，形成粗粒度程序块(block)数据流图；Step S03, analyzing program application features, forming a coarse-grained program block (block) data flow diagram;

步骤S04、判断程序块是否以SIMD模式执行；Step S04, judge whether the program block is executed in SIMD mode;

步骤S05、若程序块以SIMD模式执行，则将程序块标注为SIMD模式；Step S05, if the program block is executed in the SIMD mode, then the program block is marked as the SIMD mode;

步骤S06、若程序块不以SIMD模式执行，则判断程序块是否以Systolic模式执行；Step S06, if the program block is not executed in SIMD mode, then judge whether the program block is executed in Systolic mode;

步骤S07、若程序块不以Systolic模式执行，则将程序块标注为MIMD模式；Step S07, if the program block is not executed in Systolic mode, then the program block is marked as MIMD mode;

步骤S08、若程序块以Systolic模式执行，则将程序块标注为Systolic模式；Step S08, if the program block is executed in the Systolic mode, the program block is marked as the Systolic mode;

步骤S09、程序块控制器分发程序块，并根据前序判断步骤配置处理单元为SIMD、MIMD或Systolic模式；Step S09, the program block controller distributes the program block, and configures the processing unit to be SIMD, MIMD or Systolic mode according to the pre-order judgment step;

步骤S11、根据目前的处理单元配置状态表，以及block之间的依赖关系动态调度block在处理单元上执行；Step S11, dynamically schedule blocks to be executed on the processing unit according to the current processing unit configuration state table and the dependencies between the blocks;

步骤S12、处理单元根据分配的block配置为SIMD模式、MIMD模式或Systolic模式执行用户程序。Step S12, the processing unit executes the user program in SIMD mode, MIMD mode or Systolic mode according to the allocated block configuration.

该发明相对于以往技术具有较好的可扩展性，控制逻辑简单，适用于大规模众核结构。同时支持SIMD-MIMD-Systolic模式可配、规模可配、区域可配等优势，灵活性更强，适用于更为通用的应用领域处理。Compared with the prior art, the invention has better scalability, simple control logic, and is suitable for a large-scale many-core structure. At the same time, it supports the advantages of SIMD-MIMD-Systolic mode configuration, scale configuration, and regional configuration. It is more flexible and suitable for processing in more general application fields.

以上实施方式仅用于说明本发明，而并非对本发明的限制，有关技术领域的普通技术人员，在不脱离本发明的精神和范围的情况下，还可以做出各种变化和变形，因此所有等同的技术方案也属于本发明的范畴，本发明的专利保护范围应由权利要求限定。The above embodiments are only used to illustrate the present invention, but not to limit the present invention. Those of ordinary skill in the relevant technical field can also make various changes and modifications without departing from the spirit and scope of the present invention. Therefore, all Equivalent technical solutions also belong to the scope of the present invention, and the patent protection scope of the present invention should be defined by the claims.

Claims

1. a high-throughput many-core data stream processor, is characterized in that, comprises:

a plurality of processing units, connected in communication with each other to form an on-chip network structure of the processor;

Each of the processing units includes a plurality of sub-processing units, the sub-processing units include an instruction sub-memory and a data sub-memory, a plurality of the sub-processing units are arranged in an array structure, and are connected in communication with each other to form a multi-hop network structure of the processing unit;

a configuration unit in communication with each of the sub-processing units.

2. The high-throughput many-core data stream processor as claimed in claim 1, wherein when the processor executes the SIMD task of single instruction stream and multiple data streams, the configuration unit divides the processing unit according to task requirements The SIMD task group, and the instruction code is sent to the sub-processing unit of the processing unit of the SIMD task group to execute the SIMD task.

3. The high-throughput many-core data stream processor as claimed in claim 2, wherein the configuration information sent by the configuration unit is received by the processing unit, and the configuration information sent by the configuration unit is received by the processing unit according to the configuration information. All sub-processing units and sub-routers of sub-processing units are configured.

4. The high-throughput many-core data stream processor as claimed in claim 3, wherein a sub-processing unit of the designated processing unit is a data transmission unit, and the data transmission unit has all of the processing units through memory access commands The sub-processing unit performs the loading or storing of data.

5. high-throughput many-core data stream processor as claimed in claim 2, is characterized in that, this SIMD task group designates a sub-processing unit as the main control unit of this SIMD task group, when between two SIMD task groups During memory request and/or data transmission, the interaction data is transmitted through the main control unit of each SIMD task group.

6. The high-throughput many-core data stream processor of claim 2, wherein the configuration unit comprises a shuffle register, and when the SIMD task group is subjected to a shuffle operation, the vector data of the shuffle instruction is read by to this shuffle register and write to the target sPE to complete the data exchange.

7. The high-throughput many-core data stream processor of claim 1, wherein when the processor executes the MIMD task of multiple instruction streams and multiple data streams, the configuration unit configures each of the sub-processing units to be independent In operation, each sub-processing unit receives its own control commands, instructions and operation data, and processes an independent DFG or a part of a DFG.

8 . The high-throughput many-core data stream processor of claim 7 , wherein each sub-processing unit communicates only through a sub-router of the sub-processing unit. 9 .

9. The high-throughput many-core data stream processor of claim 1, wherein when the processor performs a systolic array task, the configuration unit configures the sub-processing units of the plurality of processing units to construct a systolic array, And configure the sub-processing unit in the last row or the last column of the data flow direction of the systolic array as an accumulator.

10. A task execution method for executing a data flow program task on the high-throughput many-core data flow processor as claimed in any one of claims 1-9, wherein the task execution method comprises:

Receive the user program and compile it, divide the user program into program blocks, and generate a data flow diagram;

Determine the execution mode of the program block, and mark the program block as a single instruction stream multi-data stream mode, a multi-instruction stream multi-data stream mode or a systolic array mode;

Configure an execution unit for the program block, and distribute the program block to the configured execution units;

Set the configured execution unit to the task execution mode corresponding to the program block;

The program block is executed by the configured execution unit.