CN103970580A

CN103970580A - Data flow compilation optimization method oriented to multi-core cluster

Info

Publication number: CN103970580A
Application number: CN201410185945.5A
Authority: CN
Inventors: 于俊清; 张维维; 唐九飞; 何云峰; 管涛
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2014-05-05
Filing date: 2014-05-05
Publication date: 2014-08-06
Anticipated expiration: 2034-05-05
Also published as: CN103970580B

Abstract

The invention discloses a data flow compiling optimization method for a multi-core cluster system, comprising: determining calculation tasks and processing core mapping task division and scheduling steps; constructing pipelines between cluster nodes and cluster node cores according to the task division and scheduling results The hierarchical pipeline scheduling step of the scheduling table; the caching optimization step based on cache is performed according to the structural characteristics of the multi-core processor, the communication situation between the cluster nodes and the execution situation of the data flow program on the multi-core processor. The method of the present invention combines the optimization technology related to the data flow program and the system structure, fully exerts the high load balance and the high parallelism of the synchronous and asynchronous mixed pipeline code on the multi-core cluster, and aims at the cache and communication mode on the multi-core cluster, The cache access and communication transmission of the program are optimized, which further improves the execution performance of the program and has a shorter execution time.

Description

A data flow compilation optimization method for multi-core clusters

技术领域technical field

本发明属于计算机编译技术领域，更具体地，涉及一种面向多核集群的数据流编译优化方法。The invention belongs to the technical field of computer compilation, and more specifically, relates to a multi-core cluster-oriented data flow compilation optimization method.

背景技术Background technique

随着半导体技术的发展，多核处理器已经被验证为开发并行性的一个可行平台。多核集群并行系统以强大的并行计算能力和良好的扩展性成为一种重要的并行计算平台设计。多核集群系统提供了强大的计算处理能力，同时也将更多的负担交给了编译器和编程人员以有效地开发核间的粗粒度并行。数据流编程提供了一种可行的方法来开发多核架构的并行性。在这种模型中，每个结点代表了一个计算任务，每条边代表了计算任务之间的数据流动。每个计算任务都是一个独立的计算单元。它有独立的指令流和地址空间，计算任务之间的数据流动通过先进先出的通信队列来实现。数据流编程模型以数据流模型为基础，以数据流编程语言为实现方式。数据流编译即将数据流编程语言转换为底层目标可执行程序所涉及到的编译技术。其中，编译优化对数据流程序在目标处理核上的运行性能起到了决定性作用。With the development of semiconductor technology, multi-core processors have been verified as a viable platform for exploiting parallelism. Multi-core cluster parallel system has become an important parallel computing platform design due to its powerful parallel computing capability and good scalability. The multi-core cluster system provides powerful computing and processing capabilities, but also puts more burdens on compilers and programmers to effectively develop coarse-grained parallelism between cores. Dataflow programming provides a viable approach to exploit the parallelism of multicore architectures. In this model, each node represents a computing task, and each edge represents the data flow between computing tasks. Each computing task is an independent computing unit. It has an independent instruction flow and address space, and the data flow between computing tasks is realized through a first-in-first-out communication queue. The data flow programming model is based on the data flow model and implemented in the data flow programming language. Dataflow compilation is the compilation technique involved in converting a dataflow programming language into an underlying target executable program. Among them, compilation optimization plays a decisive role in the running performance of the data flow program on the target processing core.

麻省理工学院编译实验室公开了一种流编程语言StreamIt。该语言基于Java，对Java进行了流扩展引入了Filter概念。Filter是最基本的计算单元，它是一个单输入单输出的程序块。Filter中各个处理过程用Work来描述，每个Work之间采用Push、Pop和Peek操作以FIFO方式进行通信。同时，针对下一代高性能计算机(Raw)提出了一种流编译优化技术：首先，编译器采用数据分裂和融合相结合的方法，对计算结点进行分裂与融合，以增加计算与通信开销比；然后把处理过后的计算结点映射到各个处理核上，达到负载均衡，各处理核采用流水线的执行方式，处理核间采用显示的通信来实现数据传输。The MIT Compilation Lab has released StreamIt, a stream programming language. The language is based on Java, and the stream extension of Java introduces the concept of Filter. Filter is the most basic calculation unit, it is a program block with single input and single output. Each processing process in Filter is described by Work, and each Work uses Push, Pop and Peek operations to communicate in FIFO mode. At the same time, a stream compilation optimization technology is proposed for the next-generation high-performance computer (Raw): first, the compiler adopts a method of combining data splitting and fusion to split and fuse computing nodes to increase the ratio of computing and communication overhead. ; Then map the processed computing nodes to each processing core to achieve load balance. Each processing core adopts pipeline execution mode, and the display communication is used between processing cores to realize data transmission.

StreamIt的流编译优化技术为流编程模型在多核处理器上的调度问题提出了一种解决方案。通过将计算任务分配到各个处理核上，实现了负载均衡，确保了计算任务在处理核上的并行执行。但是，存在以下缺陷：(1)调度到处理核上的各个计算和通信是分离的，在流水线中单独为其分配了独立的通信时间，因此增加了通信的开销；(2)没有考虑到处理核的底层存储分配优化问题和通信优化问题；(3)编译优化方法没有针对多核集群系统底层的体系架构特性进行优化。总之，对多核集群系统而言，它在提供强大计算能力的同时，也向程序员开放了其层次性的存储结构与软件通信机制。现有的流编译优化方法，并没有考虑到底层的体系架构，没有充分利用系统硬件资源如存储资源来提高程序的执行效率。StreamIt's stream compilation optimization technology proposes a solution to the scheduling problem of the stream programming model on multi-core processors. By distributing computing tasks to each processing core, load balancing is realized and parallel execution of computing tasks on processing cores is ensured. However, there are the following defects: (1) each calculation and communication scheduled to the processing core is separated, and an independent communication time is allocated for it in the pipeline, thus increasing the communication overhead; (2) the processing is not considered The underlying storage allocation optimization problem and communication optimization problem of the core; (3) The compilation optimization method does not optimize the underlying architecture characteristics of the multi-core cluster system. In short, for a multi-core cluster system, while providing powerful computing capabilities, it also opens its hierarchical storage structure and software communication mechanism to programmers. The existing stream compilation optimization methods do not take into account the underlying architecture, and do not make full use of system hardware resources such as storage resources to improve program execution efficiency.

发明内容Contents of the invention

本发明的目的在于提供一种面向多核集群的数据流编译优化方法，针对多核集群系统的架构，对数据流程序进行优化处理，较大程度地提高了数据流程序的执行性能。The purpose of the present invention is to provide a multi-core cluster-oriented data flow compilation optimization method, which optimizes the data flow program for the architecture of the multi-core cluster system, and greatly improves the execution performance of the data flow program.

本发明采用的优化方法以数据流编译器前端产生的中间表示—同步数据流图作为输入，对其依次进行任务划分与调度、层次性流水线调度、缓存优化三级处理，最后生成可执行代码。具体步骤如下：The optimization method adopted in the present invention takes the intermediate representation generated by the front end of the data flow compiler - the synchronous data flow graph as input, sequentially performs three-level processing of task division and scheduling, hierarchical pipeline scheduling, and cache optimization, and finally generates executable code. Specific steps are as follows:

(1)确定计算任务与多核集群计算节点以及处理核映射的任务划分与调度步骤(1) Determine the computing tasks and multi-core cluster computing nodes and the task division and scheduling steps for processing core mapping

数据流图中的结点代表计算任务，边代表计算任务间的通信。首先，根据集群中节点的数目对同步数据流图进行进程级任务划分，该子步骤采用Group任务划分策略，目标为最小化节点间通信开销最大化程序执行性能，划分时既要考虑负载均衡又要考虑通信开销最小化，将每个计算任务分配到的对应集群节点上。其次，根据每个集群节点上的计算任务，为每个计算任务分配到集群节点的处理核上进行线程级任务划分，该子步骤采用复制分裂算法，将负载大的计算任务进行分裂，目标是实现集群节点内部处理核上的负载均衡。The nodes in the data flow graph represent computing tasks, and the edges represent the communication between computing tasks. First, process-level task division is performed on the synchronous data flow graph according to the number of nodes in the cluster. This sub-step adopts the Group task division strategy. The goal is to minimize the communication overhead between nodes and maximize program execution performance. When dividing, both load balancing and To consider the minimization of communication overhead, assign each computing task to the corresponding cluster node. Secondly, according to the computing tasks on each cluster node, assign each computing task to the processing core of the cluster node for thread-level task division. This sub-step uses the replication splitting algorithm to split the heavy-load computing tasks. The goal is Realize load balancing on the internal processing cores of the cluster nodes.

(2)根据任务划分与调度结果构造集群节点间和集群节内核间的流水线调度的层次流水线调度步骤(2) According to the task division and scheduling results, construct the hierarchical pipeline scheduling steps of pipeline scheduling between cluster nodes and cluster node cores

同步流水线利用一个全局同步时钟保证流水线各个阶段上的执行任务同时完成，异步软件流水线各个子任务间采用数据驱动的方式执行。首先，对同步数据流图进行异步流水线调度，确定集群节点间的任务执行过程，该步骤将各个进程上的计算任务整体随机映射到集群计算节点上，完成进程与集群节点的映射；其次，根据集群节点内计算任务间的依赖关系，为每个计算任务(结点)以分配其在流水线中的阶段号，完成同步流水线构造；最后，利用以上两种信息，构造层次流水调度表。The synchronous pipeline uses a global synchronous clock to ensure that the execution tasks on each stage of the pipeline are completed at the same time, and the subtasks of the asynchronous software pipeline are executed in a data-driven manner. First, asynchronous pipeline scheduling is performed on the synchronous data flow graph to determine the task execution process among cluster nodes. This step randomly maps the computing tasks on each process to cluster computing nodes to complete the mapping between processes and cluster nodes; secondly, according to The dependencies between computing tasks in the cluster nodes assign each computing task (node) its stage number in the pipeline to complete the synchronous pipeline construction; finally, use the above two information to construct a hierarchical pipeline scheduling table.

(3)根据所述多核处理器的结构特性、集群节点间的通信情况和数据流程序在多核处理器上的执行情况做缓存优化步骤(3) according to the structural characteristics of the multi-core processor, the communication situation between the cluster nodes and the execution situation of the data flow program on the multi-core processor, the cache optimization step is performed

当计算任务(节点)在执行时，计算任务所在的处理核对缓存的使用会存在伪共享，对程序执行的性能产生较大的影响。When the computing task (node) is executing, the processing core where the computing task is located will have false sharing in the use of cache, which will have a great impact on the performance of program execution.

对X86架构的通用多核处理器进行分析，采用cache line行填充机制和稳态扩展技术相结合消除程序执行存在的伪共享，对缓存的使用进行优化。Analyze the general multi-core processor of X86 architecture, use the combination of cache line filling mechanism and steady-state expansion technology to eliminate the false sharing of program execution, and optimize the use of cache.

本发明将数据流调度与多核集群系统的结构相关优化结合起来，实现了对数据流程序的三级优化过程，具体包括任务划分与调度、层次性流水线调度、缓存优化，提高了数据流程序在目标平台上的执行性能。具体而言，本发明具有以下优点：The present invention combines data flow scheduling with structure-related optimization of a multi-core cluster system, and realizes a three-level optimization process for data flow programs, specifically including task division and scheduling, hierarchical pipeline scheduling, and cache optimization, and improves data flow programs in Execution performance on the target platform. Specifically, the present invention has the following advantages:

(1)提高了程序的并行性。通过对问题的形式化描述，本发明将数据流图调度到多核集群系统的处理核上抽象为一个贪心问题，从而为数据流程序构造了层次性的流水调度模型，将任务均映射到每个处理核上，实现低通信开销和负载均衡，提高了程序的并行性。(1) The parallelism of the program is improved. Through the formal description of the problem, the present invention abstracts the scheduling of the data flow graph to the processing cores of the multi-core cluster system as a greedy problem, thereby constructing a hierarchical pipeline scheduling model for the data flow program, and mapping tasks to each On the processing core, it realizes low communication overhead and load balancing, and improves the parallelism of the program.

(2)减小开销。本发明提出了一个同步与异步混合的层次流水线调度模型充分利用系统的计算与通信资源，同时，针对集群节点内部的缓存的使用进行了优化，提高数据访问的局部性和缓存利用率，增强程序的运行效率。(2) Reduce overhead. The present invention proposes a synchronous and asynchronous mixed hierarchical pipeline scheduling model to make full use of the computing and communication resources of the system. At the same time, it optimizes the use of the internal cache of the cluster nodes, improves the locality of data access and cache utilization, and enhances the program. operating efficiency.

附图说明Description of drawings

图1为本发明方法在数据流编译系统中的结构框架图；Fig. 1 is the structural frame diagram of the inventive method in the data stream compiling system;

图2为本发明实施例中数据流程序在集群节点内部复制分裂算法流程图；Fig. 2 is a flow chart of the replication and splitting algorithm within the cluster node by the data flow program in the embodiment of the present invention;

图3为本发明实施例中数据流程序在集群上异步流水线执行示例图；FIG. 3 is an example diagram of asynchronous pipeline execution of a data flow program on a cluster in an embodiment of the present invention;

图4(a)为本发明实施例中同步软件流水调度中，任务划分、阶段赋值的示例图；Figure 4(a) is an example diagram of task division and stage assignment in the synchronous software pipeline scheduling in the embodiment of the present invention;

图4(b)为图4(a)所对应的软件流水执行过程示例图；Fig. 4(b) is an example diagram of the software pipeline execution process corresponding to Fig. 4(a);

图5(a)为本发明实施例中数据流程序执行稳态扩展技术消除伪共享任务执行示意图；Fig. 5(a) is a schematic diagram of the implementation of the data flow program execution steady-state expansion technology to eliminate false sharing tasks in the embodiment of the present invention;

图5(b)为图5(a)中的任务伪共享消除前示意图；Figure 5(b) is a schematic diagram before the elimination of task false sharing in Figure 5(a);

图5(c)为图5(a)中的任务伪共享消除后示意图。Fig. 5(c) is a schematic diagram after eliminating false sharing of tasks in Fig. 5(a).

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。此外，下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间未构成冲突就可以相互组合。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not constitute a conflict with each other.

如图1所示为本实施例在流编译系统中的结构框架图，数据流程序经过数据流编译器前端解析之后会生成一个中间表示——同步数据流图(Synchronous Data Flow,SDF)，随后依次经过任务划分与调度、层次性流水线调度、缓存优化和通信优化三级优化过程，最后生成经消息传递接口(Message Passing Interface，MPI)封装的目标代码，完成编译。As shown in Figure 1 is the structural framework diagram of this embodiment in the flow compilation system. After the data flow program is parsed by the front end of the data flow compiler, an intermediate representation - Synchronous Data Flow (Synchronous Data Flow, SDF) will be generated, and then After the three-level optimization process of task division and scheduling, hierarchical pipeline scheduling, cache optimization and communication optimization, the target code encapsulated by Message Passing Interface (MPI) is finally generated and compiled.

该步骤包括两个子步骤：进程级任务划分和线程级任务划分。在多核集群系统中，由于不同的节点具有不同的网络地址，节点间需要通过网络进行通信，它的通信代价大，而节点内的通信属于机器内部通信，它的通信代价小，所以对数据流程序任务划分需要区分节点间和节点内核间的这些差异。对集群下不同层次的任务划分描述如下：进程级任务划分在保证节点间负载均衡的前提下最小化节点间通信开销，且划分结果间不出现环路；线程级任务划分在保证节点间负载均衡的前提下最小化同步开销，且尽可能保证数据局部性。具体步骤如下：This step includes two sub-steps: process-level task division and thread-level task division. In a multi-core cluster system, since different nodes have different network addresses, the nodes need to communicate through the network, and its communication cost is high, while the communication within the node belongs to the internal communication of the machine, and its communication cost is small, so the data flow Program task partitioning requires distinguishing between these differences between nodes and between node cores. The description of task division at different levels under the cluster is as follows: process-level task division minimizes the communication overhead between nodes under the premise of ensuring load balance between nodes, and there is no loop between division results; thread-level task division ensures load balance between nodes Under the premise of minimizing synchronization overhead, and ensuring data locality as much as possible. Specific steps are as follows:

(1.1)进程级任务划分。进程任务划分将确定计算单元与集群节点之间的映射，在执行时为了摊销数据流程序单位数据量的通信开销，进程间数据通信采用块通信机制，只有当缓冲区被填满或者强行刷新缓存时才会触发消息传递。为了防止程序执行时死锁，在进程级划分要避免划分间的数据依赖出现环路。针对多核集群的同步数据流图的进程级任务划分提出了Group任务划分策略，它采用了贪心算法进行实现。Group任务划分引入group结构，group表示由同步数据流图中一个或多个计算单元组成的集合。初始时将同步数据流图每一个计算单元作为一个group对待，group间依赖关系与计算单元间一致。Group任务划分主要有四个阶段组成：(1.1) Process-level task division. The process task division will determine the mapping between the computing unit and the cluster node. In order to amortize the communication overhead of the unit data volume of the data flow program during execution, the inter-process data communication adopts the block communication mechanism. Only when the buffer is filled or forcibly refreshed Messaging is only triggered when cached. In order to prevent deadlock during program execution, it is necessary to avoid loops in data dependencies between partitions at the process level. Aiming at the process-level task division of synchronous data flow graph of multi-core cluster, a Group task division strategy is proposed, which is realized by greedy algorithm. Group task division introduces a group structure, and a group represents a collection composed of one or more computing units in a synchronous data flow graph. Initially, each computing unit in the synchronous data flow graph is treated as a group, and the dependencies between groups are consistent with those between computing units. Group task division mainly consists of four stages:

(1.1.1)预处理阶段。该阶段针对同步数据流图中计算单元多输入多输出而设计，该阶段将多个计算单元融合成一个group，降低了group内单个计算单元与其他group中的计算单元通信边的数目。(1.1.1) Preprocessing stage. This stage is designed for the multi-input and multi-output of computing units in the synchronous data flow graph. In this stage, multiple computing units are fused into a group, which reduces the number of communication edges between a single computing unit in a group and computing units in other groups.

(1.1.2)Group粗粒度阶段。该阶段对预处理后的group图进行粗化处理，将多个相邻的group融合成一个，在粗粒度时要避免group图中出现环路。一对group融合产生的收益称为粗化收益，计算公式如下所示：(1.1.2) Group coarse-grained stage. In this stage, the preprocessed group graph is coarsened, and multiple adjacent groups are merged into one. When coarse-grained, loops in the group graph should be avoided. The income generated by the fusion of a pair of groups is called the coarsening income, and the calculation formula is as follows:

$gain gain = = \frac{comm comm ((SrcGroup SrcGroup))}{workload workload ((srcGroup srcGroup)) + + workload workload ((snkGroup snkGroup))}$

其中，workload(srcGroup)与workload(snkGroup)表示srcGroup和snkGroup各自的负载，comm(srcGroup,snkGroup)表示srcGroup与snkGroup之间的通信开销，通信开销包括数据发送和数据接收两个方面。Among them, workload(srcGroup) and workload(snkGroup) represent the respective loads of srcGroup and snkGroup, comm(srcGroup, snkGroup) represents the communication overhead between srcGroup and snkGroup, and the communication overhead includes two aspects of data sending and data receiving.

粗粒度采用贪心启发思想，首先计算所有相邻group的粗化收益将结果保存在一个优先队列中，从优先队列中选择收益最大的一对group做融合，如果融合后形成的新的group的负载不大于划分后负载理论平均值且融合后group图中不会出现环路，那么该次融合是有效的，将经过有效融合掉的group从group图中删除，融合得到的新的group插入到图中更新group间依赖关系，根据新的group更新优先队列中的收益，反复迭代上述过程。算法的终止条件是任何一对group间融合都不会产生正收益或者group图中group的数目小于阈值。Coarse-grainedness adopts the greedy heuristic idea, first calculates the coarsening income of all adjacent groups and saves the result in a priority queue, selects a pair of groups with the highest income from the priority queue for fusion, if the load of the new group formed after fusion If it is not greater than the theoretical average value of the divided load and there will be no loops in the group graph after fusion, then the fusion is valid. The group that has been effectively fused will be deleted from the group graph, and the new group obtained by fusion will be inserted into the graph. Update the dependencies between groups, update the revenue in the priority queue according to the new group, and iterate the above process repeatedly. The termination condition of the algorithm is that the fusion between any pair of groups will not produce positive returns or the number of groups in the group graph is less than the threshold.

(1.1.3)初始划分阶段。该阶段初步决定粗化后group图中group与集群节点之间的映射。初始划分是使得各个划分负载均衡且尽可能保证划分间通信最小。初始划分采用预防死锁的策略，在划分开始就避免在划分结果中出现环路。粗粒度后group图是一个有向无环图(Directed AcyclicGraph，DAG)，对于DAG图拓扑排序能够利用图中节点间的偏序关系得一个拓扑序列，初始划分时根据group拓扑序列逐个考察group图中group节点，确定每个group具体的划分编号。(1.1.3) Initial division stage. At this stage, the preliminary decision is made on the mapping between the group and the cluster nodes in the group graph after coarsening. The initial division is to balance the load of each division and minimize the communication between divisions as much as possible. The initial division adopts the strategy of preventing deadlock, and avoids loops in the division results at the beginning of the division. The coarse-grained group graph is a directed acyclic graph (Directed Acyclic Graph, DAG). For DAG graph topology sorting, a topological sequence can be obtained by using the partial order relationship between the nodes in the graph. In the initial division, the group graph is examined one by one according to the group topology sequence. In the group node, determine the specific division number of each group.

(1.1.4)细粒度调整阶段。该阶段是将划分的边界计算单元，即与其他集群节点上的计算单元存在通信的计算单元，根据通信情况做进一步调优，降低节点通信开销。对一个边界计算单元而言，该计算单元所在的划分集合称为源划分(srcPartition)，与该计算单元有依赖关系的计算单元所在的划分成为目标划分(objPartition)，一个计算单元只有一个srcPartition而可能存在多个objPartition，计算单元与srcPartition中的其他计算单元的通信量为internalData，计算单元与第i个objPartition中的计算单元的通信量为externalData[i]，在细粒度调整时维护一个优先队列，其权值是externalData[i]–internalData。在调整过程中选择权值最大的进行处理，一个计算单元能否被移动到一个objPartition要从如下两个因素考虑：首先，不会在划分中引入环路；其次，在一定程度上不会破坏整个划分间的负载均衡。一个计算单元调整完过后要根据调整过后的结果更新优先队列，但对于调整过的计算单元不会再被作为调整对象。(1.1.4) Fine-grained adjustment stage. In this stage, the divided boundary computing unit, that is, the computing unit that communicates with computing units on other cluster nodes, is further optimized according to the communication situation to reduce node communication overhead. For a boundary computing unit, the partition set where the computing unit is located is called the source partition (srcPartition), and the partition where the computing unit that has a dependency relationship with the computing unit is called the target partition (objPartition). A computing unit has only one srcPartition and There may be multiple objPartitions, the communication volume between the computing unit and other computing units in the srcPartition is internalData, the communication traffic between the computing unit and the computing unit in the i-th objPartition is externalData[i], and a priority queue is maintained during fine-grained adjustments , whose weight is externalData[i]–internalData. In the adjustment process, the one with the largest weight is selected for processing. Whether a computing unit can be moved to an objPartition depends on the following two factors: first, it will not introduce loops in the partition; second, it will not destroy to a certain extent Load balancing across partitions. After a computing unit is adjusted, the priority queue should be updated according to the adjusted result, but the adjusted computing unit will no longer be used as an adjustment object.

(1.2)线程级任务划分。线程级任务划分要确定集群节点内部处理核与该节点上计算单元之间的映射。节点内任务执行采用同步流水线调度方式，线程级任务划分采用的是以负载均衡同时最小化同步开销为目标的分配策略。线程间划分主要考虑的因素是负载均衡和局部性。线程级任务划分步骤具体为：首先，采用多层K路图划分算法对各个集群节点内部的计算单元进行初始划分；其次，采用复制分裂算法对负载大的计算单元进行分裂，降低计算单元的粒度，如图2所示描述了数据流程序在多核集群节点内部复制分裂算法的流程图。该算法各步骤如下：以上阶段中K路图划分算法的结果作为输入，依次求各个划分的计算负载，按照负载进行排序，找到能够被分裂actor(一个基本计算单元)的且负载最大的划分编号MaxPartition和工作量maxWright，再找负载最小的划分编号MinPartition和工作量minWeight，再根据不等式maxWeight<minWeight*balanceFactor(balanceFactor为平衡因子)的结果进行判断，若结果为真，则算法结束，若结果为假，则继续寻找MaxPartition中工作量最大的可分裂的actor，计算该actor的分裂分数repFactor，repFactor＝Max(repFactor,2),然后将该actor水平分裂成repFactor份，一份放到MinPartition中，剩下repFactor-1份放在MaxPartition中，从MaxPartition中移除已分裂的actor，然后回到程序的初始处(求各个划分的计算负载并排序)，循环执行，直到算法满足退出条件而退出；最后，重新使用多层K路图划分算法对经过分裂后的图进行划分，保证处理核上的负载均衡和良好的局部性。(1.2) Thread-level task division. Thread-level task division needs to determine the mapping between the internal processing core of the cluster node and the computing unit on the node. The task execution in the node adopts the synchronous pipeline scheduling method, and the thread-level task division adopts the allocation strategy with the goal of load balancing and minimizing synchronization overhead. The main considerations for partitioning between threads are load balancing and locality. The thread-level task division steps are as follows: firstly, the computing units inside each cluster node are initially divided using the multi-layer K-path graph division algorithm; secondly, the computing units with heavy loads are split using the replication splitting algorithm to reduce the granularity of the computing units , as shown in Figure 2, describes the flow chart of the data flow program replicating the splitting algorithm inside the multi-core cluster nodes. The steps of this algorithm are as follows: the results of the K-road graph partitioning algorithm in the above stages are used as input, and the calculation load of each division is calculated in turn, sorted according to the load, and the division number with the largest load that can be split actor (a basic calculation unit) is found MaxPartition and workload maxWright, then find the partition number MinPartition with the smallest load and workload minWeight, and then judge according to the result of the inequality maxWeight<minWeight*balanceFactor (balanceFactor is the balance factor). If the result is true, the algorithm ends. If the result is If false, continue to find the splittable actor with the largest workload in MaxPartition, calculate the split score repFactor of the actor, repFactor=Max(repFactor, 2), then split the actor horizontally into repFactor parts, and put one part in MinPartition, The remaining repFactor-1 is placed in MaxPartition, remove the split actors from MaxPartition, and then return to the beginning of the program (find and sort the calculation load of each partition), and execute in a loop until the algorithm meets the exit conditions and exits; Finally, the multi-layer K-path graph partition algorithm is used again to partition the split graph to ensure load balance and good locality on the processing core.

该步骤主要针对步骤(1)的任务划分结果确定进程级和线程级划分的任务的流水线执行过程，使得程序执行延迟尽可能的小。包括两个步骤：集群节点间的异步流水线调度和集群节点内部核间的同步软件流水线调度。同步流水线利用一个全局同步时钟保证流水线各个阶段上的执行任务同时完成，各个执行阶段具有相等的执行延迟。异步软件流水线各个子任务间采用数据驱动的方式执行，当一个子任务执行产生的数据被发送到与其有依赖关系的另一个子任务上，子任务在其他条件满足的情况下收到数据就可以开始执行，异步流水线中整个流水线的执行不需要全局同步，计算与通信被分离。为了平衡计算时间与数据传输时间，异步流水线子任务间数据传输通常采用块传输机制，只要任务间的通信缓存被填满就可以触发消息传输，不需要等到子任务当前阶段执行完才可以传输数据。具体步骤如下This step mainly determines the pipeline execution process of the tasks divided by the process level and the thread level according to the task division result of step (1), so that the program execution delay is as small as possible. It includes two steps: asynchronous pipeline scheduling between cluster nodes and synchronous software pipeline scheduling between internal cores of cluster nodes. The synchronous pipeline utilizes a global synchronous clock to ensure that the execution tasks on each stage of the pipeline are completed at the same time, and each execution stage has an equal execution delay. The subtasks of the asynchronous software pipeline are executed in a data-driven manner. When the data generated by the execution of a subtask is sent to another subtask that has a dependency relationship with it, the subtask can receive the data when other conditions are met. Start execution, the execution of the entire pipeline in the asynchronous pipeline does not require global synchronization, and the calculation and communication are separated. In order to balance the calculation time and data transmission time, the data transmission between asynchronous pipeline subtasks usually adopts the block transmission mechanism. As long as the communication buffer between tasks is filled, the message transmission can be triggered, and the data can be transmitted without waiting for the execution of the current stage of the subtask. . Specific steps are as follows

(2.1)集群节点间异步流水线调度(2.1) Asynchronous pipeline scheduling between cluster nodes

进程级划分在将子任务分配到节点上的同时也确定了子任务间的依赖关系。异步流水线调度没有全局同步时钟，子任务执行满足数据驱动的特性，子任务间的执行符合生产者消费者模式。如图3所示描述了对应的数据流程序在由3台机器组成的集群上的执行示意图。图中共有3台多核机器分别对应编译器经过进程级任务划分将数据流程序分为三个子任务I、II和III。机器内actor的执行与机器内部并行架构和调度方式有关，在共享存储多核平台上节点内部采用同步流水线调度。节点间异步流水线为了摊销单位数据量在传输中的开销，数据流程序在节点间采用块通信方式，生产者将通信块填满时触发消息传递机制，消费者在收到消息后开始执行。以图3中的I和II为例，当actor C执行一段时间后actor C和actor F间的通信缓冲区被填满C发送数据到F，F收到C产生的数据后F开始执行，同时C可以继续执行生成新的数据。通过异步流水线执行方式保证数据流程序在集群上的执行。Process-level division not only assigns subtasks to nodes, but also determines the dependencies between subtasks. Asynchronous pipeline scheduling does not have a global synchronous clock, the execution of subtasks meets the characteristics of data-driven, and the execution of subtasks conforms to the producer-consumer mode. As shown in Figure 3, a schematic diagram of the execution of the corresponding data flow program on a cluster composed of 3 machines is described. In the figure, there are 3 multi-core machines corresponding to the compiler, which divides the data flow program into three subtasks I, II and III through process-level task division. The execution of actors in the machine is related to the parallel architecture and scheduling method inside the machine. On the shared storage multi-core platform, synchronous pipeline scheduling is adopted inside the node. In order to amortize the overhead of unit data volume in the transmission of the asynchronous pipeline between nodes, the data flow program adopts a block communication method between nodes. When the producer fills up the communication block, the message delivery mechanism is triggered, and the consumer starts to execute after receiving the message. Taking I and II in Figure 3 as an example, when actor C executes for a period of time, the communication buffer between actor C and actor F is filled, C sends data to F, F starts to execute after receiving the data generated by C, and at the same time C can continue to generate new data. Guarantee the execution of data flow programs on the cluster through asynchronous pipeline execution.

(2.2)集群节点内部同步流水线调度(2.2) Internal synchronous pipeline scheduling of cluster nodes

线程级同步流水线调度包括阶段分配和构造流水线调度表两个步骤。线程级任务划分完成后，进行阶段赋值，构建同步软件流水线。具体步骤如下Thread-level synchronous pipeline scheduling includes two steps: stage allocation and pipeline scheduling table construction. After the thread-level task division is completed, stage assignment is performed to build a synchronous software pipeline. Specific steps are as follows

(2.2.1)阶段分配。首先，对集群节点内部的数据流图中的计算结点进行拓扑排序，形成拓扑序列；其次，对拓扑序列中的每个计算节点将其结点的阶段号初始化为0，然后，判断其与前驱结点是否在同一个集群节点上，如果在同一个集群节点上，判断其与前驱结点是否在同一个处理核上，如果在同一个处理核上，那么它与前驱节点的阶段相同，如果不在同一个处理核上，那么其阶段号比前驱计算结点的阶段号大1，如果不在同一个集群节点上，那么其阶段号与该前驱计算节点无关，通过遍历计算节点的拓扑序列，对所有结点进行阶段号赋值(2.2.1) Phase Allocation. First, perform topological sorting on the computing nodes in the data flow graph inside the cluster nodes to form a topological sequence; secondly, initialize the stage number of each computing node in the topological sequence to 0, and then judge whether it is consistent with Whether the predecessor node is on the same cluster node, if it is on the same cluster node, judge whether it is on the same processing core as the predecessor node, if it is on the same processing core, then it is at the same stage as the predecessor node, If it is not on the same processing core, its phase number is 1 greater than the phase number of the predecessor computing node. If it is not on the same cluster node, its phase number has nothing to do with the predecessor computing node. By traversing the topology sequence of computing nodes, Assign phase numbers to all nodes

(2.2.2)构造流水线调度。将任务划分和阶段分配的结果构造同步流水线调度表。如图4所示，横坐标代表资源包括处理核，纵坐标代表阶段号。如图4(a)中，P、Q、S被划分到同一个核Core0上，R、T被划分到同一个和核Core1上，U、V被划分到同一个核Core2上。P为初始节点，阶段号为0，Q与其父节点P处于同一核，所以其阶段号也是0；R的阶段号为1，S的阶段号为2，T的阶段号为3，U、V的阶段号为4。如图4(b)，软件流水线执行过程中，经历了软件流水填充阶段、满阶段和排空阶段。(2.2.2) Construction pipeline scheduling. The results of task division and stage allocation are used to construct a synchronous pipeline schedule. As shown in FIG. 4 , the abscissa represents resources including processing cores, and the ordinate represents stage numbers. As shown in Figure 4(a), P, Q, and S are allocated to the same core Core0, R, T are allocated to the same core Core1, and U, V are allocated to the same core Core2. P is the initial node, the phase number is 0, Q and its parent node P are in the same core, so its phase number is also 0; the phase number of R is 1, the phase number of S is 2, the phase number of T is 3, U, V The phase number is 4. As shown in Figure 4(b), during the execution of the software pipeline, it has experienced the filling phase, full phase and emptying phase of the software pipeline.

由于多线程共享缓存数据，cache是以cache line为存储单位的，当多个线程修改互相独立的变量且这些变量共享同一个cache line上时，存在伪共享(False Sharing)，影响程序执行的性能。该步骤主要针对缓存访问存在的文共享从两个方面进行优化：Since multiple threads share the cached data, the cache uses the cache line as the storage unit. When multiple threads modify independent variables and these variables share the same cache line, there is false sharing (False Sharing), which affects the performance of program execution. . This step is mainly optimized from two aspects for file sharing with cache access:

(3.1)Cache line填充消除流水线阶段同步产生的伪共享。通过行填充机制使得不同线程的变量不会共享同一个cache line，消除伪共享。(3.1) Cache line filling eliminates the false sharing caused by the synchronization of the pipeline stage. Through the line filling mechanism, variables of different threads will not share the same cache line, eliminating false sharing.

(3.2)采用稳态扩展技术消除计算单元间由于数据传输而产生的伪共享。如图5(a)所示生产消费者链中如果P、C在不同的核上并行执行时，当P和C访问的空间在同一个cache line上时也会发生伪共享，如图5(b)所示。在复杂的数据流图中会存在多条核间通信边，如果依然采用缓存行填充机制，必然会造成大量空间被浪费，降低存储利用率且产生较高通信延迟。为了消除通信缓冲区的伪共享且尽可能提高cache的利用率，采用了稳态扩展技术。如图5(c)所示为消除伪共享后cache的使用情况。稳态扩展技术算法采用贪心思想，首先计算数据流程序稳态执行一次所有输出边消除伪共享后相关的计算单元应该扩展的系数，然后在所有扩展系数中找能够使所有计算单元扩展后在执行时都不会使L1数据cache溢出的最大系数作为最终扩展系数。为了使cache更好的发挥效用，在查找扩展系数时并不一定要使所有的计算单元执行都不会发生数据cache溢出，根据“90/10原则”允许有10％的计算单元执行时发生溢出L1数据cache但不会使L2/L3cache溢出，这样也能取得较好的性能。(3.2) The steady-state expansion technology is used to eliminate the false sharing caused by data transmission between computing units. As shown in Figure 5(a), if P and C are executed in parallel on different cores in the producer-consumer chain, false sharing will also occur when the space accessed by P and C is on the same cache line, as shown in Figure 5( b) as shown. There will be multiple inter-core communication edges in a complex data flow graph. If the cache line filling mechanism is still used, a lot of space will be wasted, storage utilization will be reduced, and high communication delay will be generated. In order to eliminate the false sharing of the communication buffer and improve the utilization rate of the cache as much as possible, a steady-state expansion technology is adopted. Figure 5(c) shows the usage of cache after eliminating false sharing. The steady-state expansion technology algorithm adopts the greedy idea. First, calculate the coefficients that the relevant computing units should expand after the data flow program is executed in a steady state once all output edges are eliminated. The maximum coefficient that will never cause the L1 data cache to overflow is used as the final expansion coefficient. In order to make the cache more effective, it is not necessary to ensure that all computing units execute without data cache overflow when looking for expansion coefficients. According to the "90/10 principle", 10% of computing units are allowed to overflow when executing The L1 data cache does not overflow the L2/L3 cache, which can also achieve better performance.

本领域的技术人员容易理解，以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。It is easy for those skilled in the art to understand that the above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention, All should be included within the protection scope of the present invention.

Claims

1. A data flow compilation optimization method for multi-core clusters, characterized in that, comprising the following steps:

Determine computing tasks and multi-core cluster computing nodes and process task division and scheduling steps for core mapping;

According to the task division and scheduling results, the hierarchical pipeline scheduling steps of pipeline scheduling between cluster nodes and cluster node cores are constructed;

The caching optimization step is performed according to the structural characteristics of the multi-core processor, the communication situation among the cluster nodes and the execution situation of the data flow program on the multi-core processor.

2. The multi-core cluster-oriented data stream compilation and optimization method according to claim 1, wherein the task division and scheduling steps are specifically:

First, divide the synchronous data flow graph into process-level tasks, and determine the corresponding cluster nodes assigned to each computing task;

Secondly, the tasks in the synchronous data flow graph in the cluster nodes are divided into thread-level tasks, and the corresponding processing cores in the cluster nodes assigned to each computing task are determined.

3. The multi-core cluster-oriented data flow compilation and optimization method according to claim 2, wherein the task division is converted into a graph division problem, and according to the difference between process-level and thread-level task division objectives, respectively It is obtained by solving it by using the Group division strategy and the replication split strategy.

4. The multi-core cluster-oriented data flow compilation and optimization method according to claim 3, wherein the process-level task division adopts the Group division strategy step as follows:

In the preprocessing stage, multiple computing units are fused into a group, which reduces the number of communication edges between a single computing unit in a group and computing units in other groups;

In the coarse-grained stage, multiple adjacent groups are merged into one;

In the initial division stage, the group is mapped to the cluster computing nodes, and the mapping between the computing nodes and the cluster nodes is determined at the same time;

In the fine-grained adjustment stage, the boundary nodes on each division after the initial division are optimized to reduce communication overhead.

5. The multi-core cluster-oriented data flow compilation and optimization method according to any one of claims 2 to 4, wherein the step of dividing the thread-level tasks is specifically:

First, use the multi-layer K-path graph partition algorithm to initially partition the computing units inside each cluster node;

Secondly, use the copy splitting algorithm to split the computing unit with heavy load to reduce the granularity of the computing unit;

Finally, the multi-layer K-path graph partition algorithm is used again to partition the split graph to ensure load balance and good locality on the processing core.

6. The multi-core cluster-oriented data flow compilation optimization method according to any one of claims 1 to 5, wherein the hierarchical pipeline scheduling step is specifically:

First, asynchronous pipeline scheduling is adopted between cluster nodes.

Second, synchronous pipeline scheduling is adopted within the cluster nodes.

7. The multi-core cluster-oriented data flow compilation and optimization method according to claim 6, wherein the asynchronous pipeline scheduling method is to use the producer and consumer model to randomly assign the results of the process level division to on each node of the cluster.

8. according to claim 6 or 7 described oriented multi-core cluster data stream compilation optimization method, it is characterized in that, the specific process of the synchronous pipeline scheduling that is carried out is as follows:

First, topologically sort the computing nodes in the data flow graph inside the process to form a topological sequence;

Secondly, for each computing node in the topology sequence, initialize its node phase number to 0, and then judge whether it is on the same cluster node as the predecessor node, and if it is on the same cluster node, judge whether it is on the same cluster node as Whether the node is on the same processing core, if it is on the same processing core, then it has the same stage as the predecessor node, if it is not on the same processing core, then its stage number is 1 greater than the stage number of the predecessor node, if If they are not on the same cluster node, then its phase number has nothing to do with the predecessor node. By traversing the topological sequence of computing nodes, assign phase numbers to all nodes, and construct a synchronous pipeline scheduling table inside the cluster node.

9. The multi-core cluster-oriented data stream compilation and optimization method according to any one of claims 6 to 8, wherein the specific process of performing cache optimization is:

First, the cache line filling mechanism is used to eliminate the false sharing caused by the synchronization between the stages of the synchronous software pipeline in the cluster nodes;

Secondly, the steady-state expansion technology is used to eliminate the false sharing caused by the data transmission of computing units.