CN105487838B - A task-level parallel scheduling method and system for a dynamically reconfigurable processor - Google Patents
A task-level parallel scheduling method and system for a dynamically reconfigurable processor Download PDFInfo
- Publication number
- CN105487838B CN105487838B CN201510817591.6A CN201510817591A CN105487838B CN 105487838 B CN105487838 B CN 105487838B CN 201510817591 A CN201510817591 A CN 201510817591A CN 105487838 B CN105487838 B CN 105487838B
- Authority
- CN
- China
- Prior art keywords
- processing unit
- reconfigurable
- reconfigurable processing
- shared memory
- task
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multi Processors (AREA)
Abstract
本发明提出一种动态可重构处理器的任务级并行调度方法与系统,其中该系统包括主控制器、多个可重构处理单元、主存储器、直接存储访问和系统总线,其中,所述每个可重构处理单元由协控制器、多个负责可重构计算的可重构处理单元阵列和多个用于数据存储的共享存储器组成,其中所述可重构处理单元阵列和共享存储器相邻排列,所述共享存储器可被周围相连的两个可重构处理单元阵列所读写。本发明提出的动态可重构处理器的任务级并行调度方法与系统,通过调节调度方法能够针对不同的任务进行不同的调度方式,基本所有并行任务均能在这种可重构处理器上得到好的并行加速。
The present invention proposes a task-level parallel scheduling method and system for a dynamically reconfigurable processor, wherein the system includes a main controller, a plurality of reconfigurable processing units, a main memory, direct storage access and a system bus, wherein the Each reconfigurable processing unit is composed of a co-controller, multiple reconfigurable processing unit arrays responsible for reconfigurable computing, and multiple shared memories for data storage, wherein the reconfigurable processing unit array and shared memory Arranged adjacently, the shared memory can be read and written by two surrounding reconfigurable processing unit arrays. The task-level parallel scheduling method and system of the dynamic reconfigurable processor proposed by the present invention can perform different scheduling methods for different tasks by adjusting the scheduling method, and basically all parallel tasks can be obtained on this reconfigurable processor. Nice parallel speedup.
Description
技术领域technical field
本发明涉及计算机领域,且特别涉及一种动态可重构处理器的任务级并行调度方法与系统。The invention relates to the field of computers, and in particular to a task-level parallel scheduling method and system of a dynamically reconfigurable processor.
背景技术Background technique
在以往的处理器计算模式中,通常分为以下两类。传统的以冯·诺依曼处理器为基础的通用计算具有极强的灵活性,但是其指令流驱动的执行方式、有限的运算单元和存储带宽使其整体性能和功耗并不理想。专用计算可以针对特定的应用优化结构和电路,无需指令集,其执行速度快,功耗低。但专用计算系统存在着致命的缺陷,灵活性和扩展性很差,对不断演化的更加复杂的应用往往不能通过简单的扩展来完成。针对不同的应用,必须设计不同的专用计算系统,因此硬件的设计往往无法跟上应用的更新速度。同时,专用计算系统的设计周期长,一次性工程投入成本过高。可重构计算正是在这种背景下出现的一种将软件的灵活性和硬件的高效性结合在一起的计算方式,可重构计算技术结合了通用处理器和ASIC两者的优点,既能够提供硬件的高效率又具有软件的可编程性。其在性能、功耗和灵活性等关键指标之间取得更好的平衡,填补了通用计算和专用计算之间的空白。In the previous processor computing models, they are usually divided into the following two categories. The traditional general-purpose computing based on the von Neumann processor is extremely flexible, but its instruction stream-driven execution mode, limited computing unit and memory bandwidth make its overall performance and power consumption unsatisfactory. Dedicated computing can optimize structures and circuits for specific applications without requiring an instruction set, with fast execution speed and low power consumption. However, there are fatal flaws in dedicated computing systems, such as poor flexibility and scalability, and the ever-evolving and more complex applications often cannot be completed through simple expansion. For different applications, different dedicated computing systems must be designed, so hardware design often cannot keep up with the update speed of applications. At the same time, the design cycle of the dedicated computing system is long, and the one-time engineering investment cost is too high. Reconfigurable computing is a computing method that combines the flexibility of software and the efficiency of hardware in this context. Reconfigurable computing technology combines the advantages of both general-purpose processors and ASICs. It can provide high efficiency of hardware and programmability of software. It achieves a better balance between key indicators such as performance, power consumption and flexibility, and fills the gap between general-purpose computing and special-purpose computing.
现阶段计算机中的处理器主流是拥有2~8个核心的多核CPU以及众核GPU,而这种设计也使得并行处理成为了热门课题,并行算法和并行编程也成为了程序员必须了解和掌握的内容。2007年6月,NVIDIA推出了CUDA,这是一种将GPU作为数据并行计算设备的软硬件体系。CUDA变成模型将CPU作为主机,GPU作为协处理器其中运行在GPU上的CUDA并行计算函数被称为kernal(内核)。一个kenerl函数并不是一个完整的程序,而是整个CUDA程序中可以被并行执行的步骤。这样很好的对CPU和GPU进行分工从而实现各种层次的并行。At this stage, the mainstream processors in computers are multi-core CPUs and many-core GPUs with 2 to 8 cores, and this design also makes parallel processing a hot topic. Parallel algorithms and parallel programming have also become a must for programmers to understand and master. Content. In June 2007, NVIDIA launched CUDA, a software and hardware system that uses GPU as a data parallel computing device. CUDA becomes a model with the CPU as the host and the GPU as the coprocessor. The CUDA parallel computing function running on the GPU is called the kernel (kernel). A kenerl function is not a complete program, but steps in the entire CUDA program that can be executed in parallel. In this way, the CPU and GPU are well divided to achieve various levels of parallelism.
但是,对于本发明所针对的通用可重构处理器却没有一个统一的并行处理规范。典型的可重构处理器架构中包含了一个通用处理器和一个或多个可重构处理单元(Reconfigurable Processing Unit,RPU),对于多任务应用如何在这种粗粒度可重构处理器上进行任务分配,本发明提出了一套RPU分配和任务调度方法和系统。However, there is no uniform parallel processing specification for the general-purpose reconfigurable processor targeted by the present invention. A typical reconfigurable processor architecture includes a general-purpose processor and one or more reconfigurable processing units (Reconfigurable Processing Unit, RPU), how to implement multi-tasking applications on this coarse-grained reconfigurable processor Task allocation, the present invention proposes a set of RPU allocation and task scheduling methods and systems.
发明内容Contents of the invention
本发明提出一种动态可重构处理器的任务级并行调度方法与系统,通过调节调度方法能够针对不同的任务进行不同的调度方式,基本所有并行任务均能在这种可重构处理器上得到好的并行加速。The present invention proposes a task-level parallel scheduling method and system for a dynamically reconfigurable processor. By adjusting the scheduling method, different scheduling methods can be implemented for different tasks. Basically, all parallel tasks can be performed on this reconfigurable processor. Get good parallel speedup.
为了达到上述目的,本发明提出一种动态可重构处理器的任务级并行调度系统,包括主控制器、多个可重构处理单元、主存储器、直接存储访问和系统总线,In order to achieve the above object, the present invention proposes a task-level parallel scheduling system of a dynamically reconfigurable processor, including a main controller, a plurality of reconfigurable processing units, a main memory, direct storage access and a system bus,
其中,所述每个可重构处理单元由协控制器、多个负责可重构计算的可重构处理单元阵列和多个用于数据存储的共享存储器组成,其中所述可重构处理单元阵列和共享存储器相邻排列,所述共享存储器可被周围相连的两个可重构处理单元阵列所读写。Wherein, each reconfigurable processing unit is composed of a co-controller, a plurality of reconfigurable processing unit arrays responsible for reconfigurable computing, and a plurality of shared memories for data storage, wherein the reconfigurable processing unit The array and the shared memory are arranged adjacently, and the shared memory can be read and written by two reconfigurable processing unit arrays connected around them.
进一步的,所述主控制器用于执行程序中不适合可重构处理单元处理的串行代码,并负责多个可重构处理单元的调度、启动与运行。Further, the main controller is used to execute serial codes in the program that are not suitable for processing by the reconfigurable processing unit, and is responsible for scheduling, starting and running of multiple reconfigurable processing units.
进一步的,所述可重构处理单元负责计算程序中计算密集的可并行代码。Further, the reconfigurable processing unit is responsible for computing intensive parallel codes in the computing program.
进一步的,所述协控制器用于搬运多个可重构处理单元阵列计算所需的数据与配置信息,控制多个可重构处理单元阵列的启动、运行与终止。Further, the co-controller is used to carry the data and configuration information required for calculation of multiple reconfigurable processing unit arrays, and control the start, operation and termination of multiple reconfigurable processing unit arrays.
为了达到上述目的,本发明还提出一种动态可重构处理器的任务级并行调度方法,包括下列步骤:In order to achieve the above object, the present invention also proposes a task-level parallel scheduling method of a dynamically reconfigurable processor, comprising the following steps:
将应用程序中的计算密集的可并行代码封装为内核函数;Encapsulate the computationally intensive parallelizable code in the application as a kernel function;
将串行部分代码和并行部分代码分别编译生成适合主控制器和可重构处理单元的可执行代码;Compile the serial code and the parallel code respectively to generate executable code suitable for the main controller and the reconfigurable processing unit;
所述主控制器执行串行部分代码;the host controller executes the serial portion of code;
当执行到内核函数部分代码时,所述主控制器对可重构处理单元进行调度分配处理所述内核函数部分代码。When part of the code of the kernel function is executed, the main controller schedules and assigns the reconfigurable processing unit to process the part of the code of the kernel function.
进一步的,所述主控制器对可重构处理单元进行调度分为同步调用和异步调用两种并行方式:Further, the main controller schedules the reconfigurable processing unit in two parallel ways: synchronous call and asynchronous call:
若为同步调用,则由主控制器寻找未在运行中的可重构处理单元,并且载入可执行代码和配置信息,与此同时,主控制器挂起,同步调用中,将多个可重构处理单元同时调用,处理不同数据块的内容,在所有的可重构处理单元处理结束之后,通过同步函数返回值来更新处理结果,并且继续主控制器的串行代码执行;If it is a synchronous call, the main controller will search for a reconfigurable processing unit that is not running, and load the executable code and configuration information. The reconfigurable processing units call at the same time to process the contents of different data blocks. After all the reconfigurable processing units are processed, the processing results are updated through the return value of the synchronization function, and the serial code execution of the main controller is continued;
若为异步调用,则由主控制器寻找未在运行中的可重构处理单元,与此同时在不中断主控制器的前提下将可执行代码和配置信息载入,启动可重构处理单元,主控制器继续运行至需要可重构处理单元返回数据时,再中止等待可重构处理单元计算完成并返回数据。If it is an asynchronous call, the main controller looks for a reconfigurable processing unit that is not running, and at the same time loads the executable code and configuration information without interrupting the main controller, and starts the reconfigurable processing unit , the main controller continues to run until the reconfigurable processing unit needs to return data, and then stops waiting for the reconfigurable processing unit to complete calculation and return data.
进一步的,当所述内核函数指令较少,一个可重构处理单元可以单独完成整个内核函数的计算任务时,其内部的多个可重构处理单元阵列并行执行相同的配置信息,每个可重构处理单元阵列负责自己的共享存储器的数据计算。Further, when the kernel function instructions are few and one reconfigurable processing unit can complete the calculation task of the entire kernel function alone, multiple reconfigurable processing unit arrays inside it execute the same configuration information in parallel, and each reconfigurable processing unit The array of reconfigurable processing units is responsible for data computation in its own shared memory.
进一步的,当所述内核函数指令较多,不能一次性的执行完内核函数的所有语句时,将内核函数分为多个长度相同的子任务,分别将子任务的配置信息按顺序分配给多个可重构处理单元阵列,由于每个可重构处理单元阵列可以读写相邻的上层共享存储器和下层共享存储器,将每个共享存储器分为等大小的A块和B块,在任务流水过程中,每个可重构处理单元阵列先从上层共享存储器的A块读取数据,并将结果写入下层共享存储器的B块,处理结束之后,每个可重构处理单元阵列再从上层共享存储器的B块读取数据,并将结果写入下层共享存储器的A块,在这两个过程的同时,对首末共享存储器没有参与计算的部分进行数据到主内存的搬运。Further, when there are many instructions of the kernel function, and all statements of the kernel function cannot be executed at one time, the kernel function is divided into a plurality of subtasks with the same length, and the configuration information of the subtasks is assigned to multiple subtasks in order respectively. A reconfigurable processing unit array, because each reconfigurable processing unit array can read and write the adjacent upper layer shared memory and lower layer shared memory, each shared memory is divided into A block and B block of equal size, in the task pipeline During the process, each reconfigurable processing unit array first reads data from block A of the upper-layer shared memory, and writes the result into block B of the lower-layer shared memory. Block B of the shared memory reads data and writes the result to block A of the underlying shared memory. At the same time as these two processes, data is transferred to the main memory for the part of the first and last shared memory that does not participate in the calculation.
与现有技术相比,上述技术方案包括以下创新点及有益效果(优点):Compared with the prior art, the above-mentioned technical solution includes the following innovations and beneficial effects (advantages):
1、本发明的任务级并行调度方法是面向特定的三层次的异构粗粒度可重构处理器来进行设计实现的,将应用程序的数据密集和计算密集的部分用内核函数的方法打包出来,主控制器负责串行代码的处理和可重构处理单元的分配,内核函数交由并行计算能力较强的可重构处理单元进行处理,可重构处理单元中在灵活分配其中的可重构阵列来进行计算处理。用这种方式能够充分的发挥多层次异构粗粒度可重构处理器的并行计算能力,再配合特定的编译器能够充分并行加速计算密集型的应用程序。1. The task-level parallel scheduling method of the present invention is designed and implemented for a specific three-level heterogeneous coarse-grained reconfigurable processor, and the data-intensive and calculation-intensive parts of the application are packaged by the kernel function method , the main controller is responsible for the processing of serial codes and the allocation of reconfigurable processing units, and the kernel functions are processed by reconfigurable processing units with strong parallel computing capabilities. Arrays are constructed for calculation processing. In this way, the parallel computing capability of the multi-level heterogeneous coarse-grained reconfigurable processor can be fully utilized, and in conjunction with a specific compiler, it can fully accelerate computing-intensive applications in parallel.
2、本发明在GPU并行运算工具CUDA的基础上,将其并行调度方法移植到了多层次异构粗粒度可重构处理器上,并且提出了新的流水化的调度方式,扩展了对不同种类任务的调度方法。2. On the basis of the GPU parallel computing tool CUDA, the present invention transplants its parallel scheduling method to a multi-level heterogeneous coarse-grained reconfigurable processor, and proposes a new pipelined scheduling method, expanding the support for different types of The task scheduling method.
3、本发明实现了多任务多可重构处理单元调度的过程,通过调节调度方法能够针对不同的任务进行不同的调度方式,避免了单一的调度方式对任务的要求,基本所有并行任务均能在这种可重构处理器上得到好的并行加速。3. The present invention realizes the scheduling process of multi-task and multi-reconfigurable processing units. By adjusting the scheduling method, different scheduling methods can be carried out for different tasks, avoiding the requirement of a single scheduling method for tasks, and basically all parallel tasks can be Get good parallel speedup on this reconfigurable processor.
附图说明Description of drawings
图1所示为本发明较佳实施例的动态可重构处理器的任务级并行调度系统结构示意图。FIG. 1 is a schematic structural diagram of a task-level parallel scheduling system of a dynamically reconfigurable processor according to a preferred embodiment of the present invention.
图2所示为本发明较佳实施例的动态可重构处理器的任务级并行调度方法流程图。FIG. 2 is a flowchart of a task-level parallel scheduling method for a dynamically reconfigurable processor according to a preferred embodiment of the present invention.
图3和图4所示为当所述内核函数指令较少时的同步调度方法示意图。FIG. 3 and FIG. 4 are schematic diagrams of a synchronous scheduling method when the number of kernel function instructions is small.
图5所示为当所述内核函数指令较多时的异步调度方法示意图。FIG. 5 is a schematic diagram of an asynchronous scheduling method when there are many kernel function instructions.
具体实施方式detailed description
以下结合附图给出本发明的具体实施方式,但本发明不限于以下的实施方式。根据下面说明和权利要求书,本发明的优点和特征将更清楚。需说明的是,附图均采用非常简化的形式且均使用非精准的比率,仅用于方便、明晰地辅助说明本发明实施例的目的。The specific embodiments of the present invention are given below in conjunction with the accompanying drawings, but the present invention is not limited to the following embodiments. Advantages and features of the present invention will be apparent from the following description and claims. It should be noted that all the drawings are in very simplified form and use imprecise ratios, which are only used for the purpose of conveniently and clearly assisting in describing the embodiments of the present invention.
请参考图1,图1所示为本发明较佳实施例的动态可重构处理器的任务级并行调度系统结构示意图,其中右侧方框边上RPU1的放大图,以便显示其内部功能结构。本发明提出一种动态可重构处理器的任务级并行调度系统,包括主控制器ARM11、多个可重构处理单元(Reconfigurable Process Unit,RPU)、主存储器DDR、直接存储访问(Direct MemoryAccess,DMA)和系统总线AHB,其中,所述每个可重构处理单元RPU由协控制器ARM7、多个负责可重构计算的可重构处理单元阵列(Processing Element Array,PEA)和多个用于数据存储的共享存储器(Shared Memory,SM)组成,其中所述可重构处理单元阵列PEA和共享存储器SM相邻排列,所述共享存储器SM可被周围相连的两个可重构处理单元阵列PEA所读写。Please refer to Fig. 1, Fig. 1 shows the schematic structural diagram of the task-level parallel scheduling system of the dynamic reconfigurable processor of the preferred embodiment of the present invention, wherein the enlarged view of RPU1 on the right side of the box to show its internal functional structure . The present invention proposes a task-level parallel scheduling system of a dynamically reconfigurable processor, including a main controller ARM11, a plurality of reconfigurable processing units (Reconfigurable Process Unit, RPU), a main memory DDR, a direct memory access (Direct Memory Access, DMA) and a system bus AHB, wherein each reconfigurable processing unit RPU is composed of a co-controller ARM7, a plurality of reconfigurable processing unit arrays (Processing Element Array, PEA) responsible for reconfigurable computing, and a plurality of It is composed of a shared memory (Shared Memory, SM) for data storage, wherein the reconfigurable processing unit array PEA and the shared memory SM are arranged adjacently, and the shared memory SM can be connected by two reconfigurable processing unit arrays Read and write by PEA.
根据本发明较佳实施例,所述主控制器ARM11用于执行程序中不适合可重构处理单元RPU处理的串行代码,并负责多个可重构处理单元RPU的调度、启动与运行。所述可重构处理单元RPU负责计算程序中计算密集的可并行代码。进一步的,所述协控制器ARM7用于搬运多个可重构处理单元阵列PEA计算所需的数据与配置信息,控制多个可重构处理单元阵列PEA的启动、运行与终止。According to a preferred embodiment of the present invention, the main controller ARM11 is used to execute serial codes in the program that are not suitable for processing by the reconfigurable processing unit RPU, and is responsible for scheduling, starting and running of multiple reconfigurable processing units RPU. The reconfigurable processing unit RPU is responsible for computing intensive parallel codes in computing programs. Further, the co-controller ARM7 is used to carry the data and configuration information required for calculation of multiple reconfigurable processing unit arrays PEA, and control the start, operation and termination of multiple reconfigurable processing unit arrays PEA.
根据本发明较佳实施例,每个可重构处理单元RPU包括4个可重构处理单元阵列PEA和4个共享存储器SM。According to a preferred embodiment of the present invention, each reconfigurable processing unit RPU includes 4 reconfigurable processing unit arrays PEA and 4 shared memories SM.
在请参考图2,图2所示为本发明较佳实施例的动态可重构处理器的任务级并行调度方法流程图。本发明还提出一种动态可重构处理器的任务级并行调度方法,包括下列步骤:Please refer to FIG. 2 , which is a flow chart of a task-level parallel scheduling method for a dynamically reconfigurable processor according to a preferred embodiment of the present invention. The present invention also proposes a task-level parallel scheduling method of a dynamically reconfigurable processor, comprising the following steps:
步骤S100:将应用程序中的计算密集的可并行代码封装为内核函数;Step S100: encapsulating the computationally intensive parallelizable code in the application program as a kernel function;
步骤S200:将串行部分代码和并行部分代码分别编译生成适合主控制器和可重构处理单元的可执行代码;Step S200: compiling the serial code and the parallel code respectively to generate executable code suitable for the main controller and the reconfigurable processing unit;
步骤S300:所述主控制器执行串行部分代码;Step S300: the main controller executes the serial code;
步骤S400:当执行到内核函数部分代码时,所述主控制器对可重构处理单元进行调度分配处理所述内核函数部分代码。Step S400: When the kernel function part code is executed, the main controller schedules and allocates the reconfigurable processing unit to process the kernel function part code.
对于一般的可重构处理器,并行过程大多仅仅是利用可重构部件的高速运算来处理大量重复性的计算密集型的运算,而对于本发明针对的多层次的比较复杂的异构可重构处理器,它包含了三个层次的计算模块,分别为主控制器、协控制器和PEA,它们三者之间内存独立,共同构成了一个三层次的可重构异构架构。在对应用程序进行处理的过程中,任务级并行的方法不仅仅包括了可重构单元的并行,还包括了多RPU的分配和调度。For general reconfigurable processors, most of the parallel processes only use the high-speed operations of reconfigurable components to process a large number of repetitive calculation-intensive operations. Architectural processor, which includes three levels of computing modules, namely the main controller, co-controller and PEA. The memory of the three is independent, and together constitute a three-level reconfigurable heterogeneous architecture. In the process of processing applications, the method of task-level parallelism not only includes the parallelism of reconfigurable units, but also includes the allocation and scheduling of multiple RPUs.
对于一个应用的C代码程序,我们将其中计算密集的可并行代码封装为kernal(内核)函数,在一个应用中,可以存在多个功能和复杂程度不同的kernal函数。在编译过程中,将串行部分和并行部分分别编译生成适合主控制器部分和可重构处理单元RPU部分的可执行代码。kernal函数存在两个层次的并行,分别为不同RPU之间的同步异步的并行以及RPU内部4个PEA上的代码并行。For a C code program of an application, we encapsulate the computationally intensive parallel code as a kernal (kernel) function. In an application, there may be multiple kernal functions with different functions and complexity. During the compilation process, the serial part and the parallel part are respectively compiled to generate executable codes suitable for the main controller part and the reconfigurable processing unit RPU part. There are two levels of parallelism in the kernel function, which are the synchronous and asynchronous parallelism between different RPUs and the code parallelism on the 4 PEAs inside the RPU.
对于一个多任务的应用程序,在可重构架构上执行时,先由主控制器执行串行部分代码,在执行到kernal部分时,会由主控制器的RPU调度器对RPU进行分配,并且分为同步调用和异步调用两种并行方式:For a multi-tasking application, when executing on the reconfigurable architecture, the main controller first executes the serial part of the code, and when executing the kernal part, the RPU scheduler of the main controller will allocate the RPU, and Divided into two parallel ways of synchronous call and asynchronous call:
若为同步调用,则由主控制器寻找未在运行中的可重构处理单元,并且载入可执行代码和配置信息,与此同时,主控制器挂起,同步调用中,将多个可重构处理单元同时调用,处理不同数据块的内容,在所有的可重构处理单元处理结束之后,通过同步函数返回值来更新处理结果,并且继续主控制器的串行代码执行;If it is a synchronous call, the main controller will search for a reconfigurable processing unit that is not running, and load the executable code and configuration information. The reconfigurable processing units call at the same time to process the contents of different data blocks. After all the reconfigurable processing units are processed, the processing results are updated through the return value of the synchronization function, and the serial code execution of the main controller is continued;
若为异步调用,则由主控制器寻找未在运行中的可重构处理单元,与此同时在不中断主控制器的前提下将可执行代码和配置信息载入,启动可重构处理单元,主控制器继续运行至需要可重构处理单元返回数据时,再中止等待可重构处理单元计算完成并返回数据。同步调用方式可以将数据量大的kernal函数通过多RPU并行计算来处理,而异步调用可以将不同的没有依赖关系的kernal进行并行。If it is an asynchronous call, the main controller looks for a reconfigurable processing unit that is not running, and at the same time loads the executable code and configuration information without interrupting the main controller, and starts the reconfigurable processing unit , the main controller continues to run until the reconfigurable processing unit needs to return data, and then stops waiting for the reconfigurable processing unit to complete calculation and return data. The synchronous call method can process the kernal function with a large amount of data through multi-RPU parallel computing, while the asynchronous call can parallelize different kernal functions without dependencies.
在对执行kernal函数的RPU进行分配和调度后,在RPU内部同样的存在任务并行,也就是第二个层次的并行计算。在本发明依赖的硬件架构中,每个RPU中包含了4个PEA和相对应的SM,其中SM可以被周围相连的两个PEA所读写,这样我们就可以针对这种架构完成针对kernal种类的两种方式的并行计算方法。After allocating and scheduling the RPU that executes the kernal function, there is also task parallelism inside the RPU, which is the second level of parallel computing. In the hardware architecture that the present invention relies on, each RPU contains 4 PEAs and corresponding SMs, where the SMs can be read and written by the two PEAs connected around, so that we can complete the kernal type for this architecture Two ways of parallel computing methods.
请参考图3和图4,图3和图4所示为当所述内核函数指令较少时的同步调度方法示意图。当所述内核函数指令较少,一个可重构处理单元可以单独完成整个内核函数的计算任务时,其内部的多个可重构处理单元阵列并行执行相同的配置信息,每个可重构处理单元阵列负责自己的共享存储器的数据计算,这样便可以将kernal执行时间缩短到最多四分之一。其中第一排方框block1~4表示协控制器将数据从DDR拷贝到SM的左半部分,圆圈表示PEA执行,第二排方框block1~4表示PEA将计算完的数据输出到SM的右半部分,第三排方框block1~4表示协控制器将数据从SM右半部分输出到DDR中,以上为程序当中的循环体。Please refer to FIG. 3 and FIG. 4 . FIG. 3 and FIG. 4 are schematic diagrams of a synchronous scheduling method when the number of kernel function instructions is small. When the kernel function instructions are few and one reconfigurable processing unit can complete the calculation task of the entire kernel function alone, multiple reconfigurable processing unit arrays inside it execute the same configuration information in parallel, and each reconfigurable processing unit The cell array is responsible for data calculations in its own shared memory, which reduces kernal execution time by up to a quarter. The first row of boxes block1~4 means that the co-controller copies data from DDR to the left half of SM, the circle means PEA execution, and the second row of boxes block1~4 means that PEA outputs the calculated data to the right side of SM Half part, the third row of blocks block1~4 indicates that the co-controller outputs data from the right half of SM to DDR, and the above is the loop body in the program.
当所述内核函数指令较多,不能一次性的执行完内核函数的所有语句时,对这种情况,我们需要采取流水的方式进行任务并行,具体方案为将kernal函数分为不多于4个的子任务,每个子任务长度尽量相同,分别将子任务的配置信息按顺序分给4个PEA,由于每个可重构处理单元阵列可以读写相邻的上层共享存储器和下层共享存储器,将每个共享存储器分为等大小的A块和B块,在任务流水过程中,每个可重构处理单元阵列先从上层共享存储器的A块读取数据,并将结果写入下层共享存储器的B块,处理结束之后,每个可重构处理单元阵列再从上层共享存储器的B块读取数据,并将结果写入下层共享存储器的A块,在这两个过程的同时,对首末共享存储器没有参与计算的部分进行数据到主内存的搬运。这样,整个过程中没有额外的数据搬运时间,能够很好的无间断的流水执行kernal函数。When there are many kernel function instructions and it is impossible to execute all the statements of the kernel function at one time, in this case, we need to adopt a pipeline method for task parallelism. The specific solution is to divide the kernel function into no more than 4 Each subtask has the same length as possible, and the configuration information of the subtasks is assigned to four PEAs in sequence. Since each reconfigurable processing unit array can read and write adjacent upper-layer shared memory and lower-layer shared memory, the Each shared memory is divided into block A and block B of equal size. During the task pipeline process, each reconfigurable processing unit array first reads data from block A of the upper shared memory, and writes the result to the lower shared memory. Block B, after processing, each reconfigurable processing unit array reads data from block B of the upper-layer shared memory and writes the result to block A of the lower-layer shared memory. The part of the shared memory that does not participate in the calculation performs data transfer to the main memory. In this way, there is no additional data transfer time in the whole process, and the kernal function can be executed in an uninterrupted pipeline.
请参考图5,图5所示为当所述内核函数指令较多时的异步调度方法示意图。如图5所示,程序员将循环中的所有命令按照一定的方法分为几个部分,在编译器处理GR-C的过程中,会将这些命令分别编译成不同的PEA配置包,由4个PEA顺序调用,在时钟周期上分别延后一个周期在进行运作,达到流水的效果,第一个周期,PEA1将SM1中A部分的数据进行运算,结果放入SM2的B块中,第二个周期PEA1将SM1中B部分的数据进行运算,同时PEA2将SM2的B块数据进行运算,分别放入SM2和SM3的A块中,以此类推形成流水的过程,在PEA4计算完数据后直接将数据copy到主存中,直到所有的数据处理完毕。RPU内部按任务划分完整的执行流程,其中block1~block5表示不同的数据流,最右侧方框表示数据从SM3中输出到DDR中。Please refer to FIG. 5 , which is a schematic diagram of an asynchronous scheduling method when there are many kernel function instructions. As shown in Figure 5, the programmer divides all the commands in the loop into several parts according to a certain method. When the compiler processes GR-C, these commands will be compiled into different PEA configuration packages respectively. The PEAs are called sequentially, and the clock cycle is delayed by one cycle to achieve the effect of water flow. In the first cycle, PEA1 calculates the data in part A of SM1, and puts the result into block B of SM2. In one cycle, PEA1 calculates the data of part B in SM1, and at the same time, PEA2 calculates the data of block B of SM2, and puts them into the A blocks of SM2 and SM3 respectively, and so on to form a pipeline process. After PEA4 calculates the data, it directly Copy the data to main memory until all the data has been processed. Inside the RPU, the complete execution process is divided by task, where block1~block5 represent different data streams, and the rightmost box represents the data output from SM3 to DDR.
虽然本发明已以较佳实施例揭露如上,然其并非用以限定本发明。本发明所属技术领域中具有通常知识者,在不脱离本发明的精神和范围内,当可作各种的更动与润饰。因此,本发明的保护范围当视权利要求书所界定者为准。Although the present invention has been disclosed above with preferred embodiments, it is not intended to limit the present invention. Those skilled in the art of the present invention can make various changes and modifications without departing from the spirit and scope of the present invention. Therefore, the scope of protection of the present invention should be defined by the claims.
Claims (3)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201510817591.6A CN105487838B (en) | 2015-11-23 | 2015-11-23 | A task-level parallel scheduling method and system for a dynamically reconfigurable processor |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201510817591.6A CN105487838B (en) | 2015-11-23 | 2015-11-23 | A task-level parallel scheduling method and system for a dynamically reconfigurable processor |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN105487838A CN105487838A (en) | 2016-04-13 |
| CN105487838B true CN105487838B (en) | 2018-01-26 |
Family
ID=55674840
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201510817591.6A Active CN105487838B (en) | 2015-11-23 | 2015-11-23 | A task-level parallel scheduling method and system for a dynamically reconfigurable processor |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN105487838B (en) |
Families Citing this family (20)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105867994A (en) * | 2016-04-20 | 2016-08-17 | 上海交通大学 | Instruction scheduling optimization method for coarse-grained reconfigurable architecture complier |
| CN106095552B (en) * | 2016-06-07 | 2019-06-28 | 华中科技大学 | A method and system for multitasking graph processing based on I/O deduplication |
| CN106648883B (en) * | 2016-09-14 | 2020-02-04 | 深圳鲲云信息科技有限公司 | Dynamic reconfigurable hardware acceleration method and system based on FPGA |
| JP6960479B2 (en) * | 2017-03-14 | 2021-11-05 | アズールエンジン テクノロジーズ ヂュハイ インク.Azurengine Technologies Zhuhai Inc. | Reconfigurable parallel processing |
| CN110275771B (en) * | 2018-03-15 | 2021-12-14 | 中国移动通信集团有限公司 | A business processing method, an Internet of Things billing infrastructure system, and a storage medium |
| CN109672524B (en) * | 2018-12-12 | 2021-08-20 | 东南大学 | SM3 algorithm round iteration system and iterative method based on coarse-grained reconfigurable architecture |
| CN110096474A (en) * | 2019-04-28 | 2019-08-06 | 北京超维度计算科技有限公司 | A kind of high-performance elastic computing architecture and method based on Reconfigurable Computation |
| CN110059050B (en) * | 2019-04-28 | 2023-07-25 | 北京美联东清科技有限公司 | AI supercomputer based on high-performance reconfigurable elastic calculation |
| CN112579090A (en) * | 2019-09-27 | 2021-03-30 | 无锡江南计算技术研究所 | Asynchronous parallel I/O programming framework method under heterogeneous many-core architecture |
| CN110765046A (en) * | 2019-11-07 | 2020-02-07 | 首都师范大学 | A DMA transmission device and method for dynamically reconfigurable high-speed serial bus |
| CN111897580B (en) * | 2020-09-29 | 2021-01-12 | 北京清微智能科技有限公司 | Instruction scheduling system and method for reconfigurable array processor |
| CN111930319B (en) * | 2020-09-30 | 2021-09-03 | 北京清微智能科技有限公司 | Data storage and reading method and system for multi-library memory |
| CN112463719A (en) * | 2020-12-04 | 2021-03-09 | 上海交通大学 | In-memory computing method realized based on coarse-grained reconfigurable array |
| CN112559441A (en) * | 2020-12-11 | 2021-03-26 | 清华大学无锡应用技术研究院 | Control method of digital signal processor |
| CN112540793A (en) * | 2020-12-18 | 2021-03-23 | 清华大学 | Reconfigurable processing unit array supporting multiple access modes and control method and device |
| CN112486908B (en) * | 2020-12-18 | 2024-06-11 | 清华大学 | Hierarchical multi-RPU multi-PEA reconfigurable processor |
| CN112256632B (en) * | 2020-12-23 | 2021-06-04 | 北京清微智能科技有限公司 | Instruction distribution method and system in reconfigurable processor |
| CN113553031B (en) * | 2021-06-04 | 2023-02-24 | 中国人民解放军战略支援部队信息工程大学 | Software definition variable structure computing architecture and left-right brain integrated resource joint distribution method realized by using same |
| CN113568731B (en) * | 2021-09-24 | 2021-12-28 | 苏州浪潮智能科技有限公司 | Task scheduling method, chip and electronic equipment |
| CN120724028A (en) * | 2025-08-21 | 2025-09-30 | 中国科学院光电技术研究所 | An adaptive optical matrix calculation acceleration method based on ARM architecture |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103984677A (en) * | 2014-05-30 | 2014-08-13 | 东南大学 | Embedded reconfigurable system based on large-scale coarseness and processing method thereof |
| CN104134210A (en) * | 2014-07-22 | 2014-11-05 | 兰州交通大学 | 2D-3D medical image parallel registration method based on combination similarity measure |
| CN105302525A (en) * | 2015-10-16 | 2016-02-03 | 上海交通大学 | Parallel processing method for reconfigurable processor with multilayer heterogeneous structure |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR20140131199A (en) * | 2013-05-03 | 2014-11-12 | 삼성전자주식회사 | Reconfigurable processor and method for operating thereof |
-
2015
- 2015-11-23 CN CN201510817591.6A patent/CN105487838B/en active Active
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103984677A (en) * | 2014-05-30 | 2014-08-13 | 东南大学 | Embedded reconfigurable system based on large-scale coarseness and processing method thereof |
| CN104134210A (en) * | 2014-07-22 | 2014-11-05 | 兰州交通大学 | 2D-3D medical image parallel registration method based on combination similarity measure |
| CN105302525A (en) * | 2015-10-16 | 2016-02-03 | 上海交通大学 | Parallel processing method for reconfigurable processor with multilayer heterogeneous structure |
Non-Patent Citations (1)
| Title |
|---|
| 异构粗粒度可重构处理器的自动任务编译器框架设计;楼杰超等;《微电子学与计算机》;20150831;第32卷(第8期);第110-114页 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN105487838A (en) | 2016-04-13 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN105487838B (en) | A task-level parallel scheduling method and system for a dynamically reconfigurable processor | |
| JP6525286B2 (en) | Processor core and processor system | |
| JP4936517B2 (en) | Control method for heterogeneous multiprocessor system and multi-grain parallelizing compiler | |
| US8782645B2 (en) | Automatic load balancing for heterogeneous cores | |
| CN102576314B (en) | The mapping with the data parallel thread across multiple processors processes logic | |
| KR102432380B1 (en) | Method for performing WARP CLUSTERING | |
| KR100878917B1 (en) | Global compiler for heterogeneous multiprocessor | |
| JP5152594B2 (en) | Execution of retargeted graphics processor acceleration code by a general-purpose processor | |
| CN103279445A (en) | Computing method and super-computing system for computing task | |
| US20080109795A1 (en) | C/c++ language extensions for general-purpose graphics processing unit | |
| CN104375805A (en) | Method for simulating parallel computation process of reconfigurable processor through multi-core processor | |
| Sunitha et al. | Performance improvement of CUDA applications by reducing CPU-GPU data transfer overhead | |
| US8615770B1 (en) | System and method for dynamically spawning thread blocks within multi-threaded processing systems | |
| Brown | Porting incompressible flow matrix assembly to FPGAs for accelerating HPC engineering simulations | |
| CN107329818A (en) | A kind of task scheduling processing method and device | |
| CN108228242B (en) | Configurable and flexible instruction scheduler | |
| Inagaki et al. | Performance evaluation of a 3d-stencil library for distributed memory array accelerators | |
| Hussain et al. | Mvpa: An fpga based multi-vector processor architecture | |
| Hussain et al. | PMSS: A programmable memory system and scheduler for complex memory patterns | |
| JP2023552789A (en) | Software-based instruction scoreboard for arithmetic logic unit | |
| Ren et al. | Parallel Optimization of BLAS on a New-Generation Sunway Supercomputer | |
| Cheramangalath et al. | GPU Architecture and Programming Challenges | |
| JP2025126345A5 (en) | ||
| JP2025126345A (en) | Accelerator, data processing method, and compiler apparatus | |
| Saremi | Low Level GPU Performance Characteristics Using Vendor Independent Benchmarks |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |