CN110363700A

CN110363700A - A parallel enumeration method of custom instructions based on depth map segmentation

Info

Publication number: CN110363700A
Application number: CN201910627526.5A
Authority: CN
Inventors: 肖成龙; 王珊珊; 王心霖
Original assignee: Liaoning Technical University
Current assignee: Liaoning Technical University
Priority date: 2019-07-12
Filing date: 2019-07-12
Publication date: 2019-10-22

Abstract

The invention provides a parallel enumeration method of custom instructions based on depth map segmentation, and relates to the technical field of electronic design automation. The method first adopts the master-slave parallel mode, and the master node in the computing cluster receives the data flow graph generated in the intermediate representation generation stage of the automatic generation process of the dedicated custom instruction set as input; The graph segmentation method divides the original data flow graph into several subgraphs, and assigns the divided subgraphs to idle computing nodes in the computing cluster; at the same time, the running time of the divided subtasks is predicted, according to the predicted time of all subtasks and the number of computing nodes to determine whether it is necessary to continue to divide complex subtasks; the computing nodes use the convex subgraph enumeration algorithm to enumerate custom instructions from the received subgraph. The method of the invention can more effectively ensure the load balance among the computing nodes and achieve an approximate linear speedup ratio.

Description

A parallel enumeration method of custom instructions based on depth map segmentation

技术领域technical field

本发明涉及电子设计自动化技术领域，尤其涉及一种基于深度图分割的自定义指令并行枚举方法。The invention relates to the technical field of electronic design automation, in particular to a parallel enumeration method of custom instructions based on depth map segmentation.

背景技术Background technique

为了满足嵌入式应用对高性能和低功耗不断增长的需求，通过使用加速器或自定义功能单元(Custom Function Unit)的定制计算越来越多地应用到嵌入式系统当中。其中，专用处理器是实现定制运算的最重要的方案之一。To meet the ever-increasing demands of embedded applications for high performance and low power consumption, custom computing by using accelerators or custom function units (Custom Function Unit) is increasingly applied to embedded systems. Among them, the special-purpose processor is one of the most important solutions to realize the customized operation.

专用处理器是一种架构与指令集优化设计的处理器，其通过扩展指令集，使得目标应用程序的部分代码在基准处理器执行，其它计算密集的代码在自定义指令的硬件实现-自定义功能单元中执行。应用专用芯片是针对某个特定应用而设计，只能运行特定的应用，因此缺乏一定的灵活性。相对于应用专用芯片(ASIC)，专用处理器架构中的基准处理器可以保证一定的灵活性。其次，设计和生产应用专用芯片具有设计周期长和测试成本高的缺点。A dedicated processor is a processor with an optimized architecture and instruction set. By extending the instruction set, part of the code of the target application program is executed on the benchmark processor, and other computationally intensive codes are implemented in the hardware of custom instructions - custom executed in the functional unit. Application-specific chips are designed for a specific application and can only run a specific application, so they lack a certain degree of flexibility. A benchmark processor in an application-specific processor architecture can guarantee a certain flexibility relative to an application-specific chip (ASIC). Second, designing and producing application-specific chips has the disadvantages of long design cycles and high testing costs.

相对于通用处理器，专用处理器通过使用自定义指令封装一系列基本操作指令(例如，加法，减法，乘法以及逻辑操作)，使得这些基本操作指令之间根据数据依赖关系自动链化，没有数据依赖关系的基本指令并行化，从而很大程度地提高运算的速度。此外，由于多个基本指令封装到一个自定义指令中，使得取指令次数和数据在寄存器与处理器之间传输的次数减少，进而专用处理器的功耗显著低于通用处理器。Compared with general-purpose processors, special-purpose processors use custom instructions to encapsulate a series of basic operation instructions (such as addition, subtraction, multiplication, and logical operations), so that these basic operation instructions are automatically chained according to data dependencies without data. The basic instructions of dependencies are parallelized, thereby greatly improving the speed of operations. In addition, since multiple basic instructions are encapsulated into a custom instruction, the number of instruction fetches and the number of data transfers between registers and the processor are reduced, and the power consumption of special-purpose processors is significantly lower than that of general-purpose processors.

扩展指令集的自动生成是专用处理器设计实现的关键。专用处理器的应用专用扩展指令集的自动生成一般包括如图1所示四个步骤：中间表示生成，自定义指令枚举，自定义指令选择，代码生成。中间表示生成阶段将应用程序转换为合适的中间表示，比如，控制数据流图；自定义指令枚举阶段在架构约束下枚举出所有满足约束条件的子图作为候选的自定义指令；自定义指令选择阶段根据不同的设计目的从枚举出的子图(自定义指令的图形化表示)中选出一部分最佳子图作为最终的自定义指令。这些选出的自定义指令就构成了最终的扩展指令集。代码生成阶段负责自动生成自定义指令的硬件实现代码，以及将源代码转换为包含自定义指令的新代码。The automatic generation of extended instruction sets is the key to the design and implementation of special-purpose processors. The automatic generation of the application-specific extended instruction set of the special-purpose processor generally includes four steps as shown in Figure 1: intermediate representation generation, user-defined instruction enumeration, user-defined instruction selection, and code generation. The intermediate representation generation phase converts the application into a suitable intermediate representation, for example, a control data flow graph; the custom instruction enumeration phase enumerates all subgraphs that satisfy the constraints as candidate custom instructions under architectural constraints; The instruction selection stage selects a part of the best subgraphs from the enumerated subgraphs (graphical representation of custom instructions) as the final custom instruction according to different design purposes. These selected custom instructions constitute the final extended instruction set. The code generation phase is responsible for automatically generating the hardware implementation code of the custom instructions, as well as converting the source code into new code containing the custom instructions.

自定义指令枚举和自定义指令选择是扩展指令集自动生成过程中最为关键和复杂的两个阶段。在自定义指令枚举方面已有大量研究工作，这些研究工作都是采用串行的方法，然而，当问题规模较大时，串行方法可能无法在合理的时间内给出最优设计方案或者不能给出最优设计方案。自定义指令枚举问题是从应用程序对应的数据流图中枚举出所有满足一定设计约束或用户约束的凸子图作为候选自定义指令。一个给定的数据流图中的子图(AS)数目最多可能有2n个，其中n是数据流图中结点的数目。由此可以看出自定义指令枚举是算法复杂度非常高的问题。为了降低问题的复杂度，之前的研究或是引入微体系结构的约束或是人为加入一些约束条件。根据约束条件的不同，之前的研究可以如下分类和分析：Custom instruction enumeration and custom instruction selection are the two most critical and complex stages in the automatic generation of extended instruction sets. There has been a lot of research work on custom instruction enumeration, all of which use serial methods. However, when the problem is large, the serial method may not be able to give the optimal design in a reasonable time or The optimal design scheme cannot be given. The problem of custom instruction enumeration is to enumerate all convex subgraphs that satisfy certain design constraints or user constraints from the data flow graph corresponding to the application as candidate custom instructions. The number of subgraphs (AS) in a given dataflow graph may be at most 2n, where n is the number of nodes in the dataflow graph. It can be seen that the enumeration of custom instructions is a problem with very high algorithm complexity. In order to reduce the complexity of the problem, the previous research either introduced the constraints of the microarchitecture or artificially added some constraints. Depending on the constraints, previous studies can be categorized and analyzed as follows:

(1)树形子图(TS)：为了降低枚举的复杂度，早期的研究主要集中于枚举出所有树形子图。然而，只枚举树形子图作为自定义指令，只能非常有限地提升性能或降低功耗。(1) Tree subgraph (TS): In order to reduce the complexity of enumeration, early research mainly focused on enumerating all tree subgraphs. However, only enumerating tree subgraphs as custom instructions can only improve performance or reduce power very limitedly.

(2)多输入单输出子图(MISO)：此类研究关注于枚举所有具有多个输入个单个输出的子图。虽然只枚举多输入单输出子图可以很大程度上降低问题的复杂度，但是多输入单个输出子图作为自定义指令带来的性能提升或功耗降低还是非常有限。(2) Multiple Input Single Output Subgraph (MISO): This type of research focuses on enumerating all subgraphs with multiple inputs and a single output. Although only enumerating the multi-input single-output subgraph can greatly reduce the complexity of the problem, the performance improvement or power consumption reduction brought by the multi-input single-output subgraph as a custom command is still very limited.

(3)多输入多输出子图(MIMO)：由于寄存器的读写端口的数目有限(I/O)，近期的一些研究主要是关注于枚举满足I/O限制的子图。其中Pozzi等人提出了基于二元决策树的枚举算法，该算法利用数据流图当中按拓扑顺序形成子图的输出数目的单调性来削减搜索空间。为了更好利用数据流图的拓扑结构特性，Xiao等人提出了更有效的算法。Xiao等人提出了基于一次性图分割方法自定义指令并行枚举方法，当计算节点较少时，此方法能取得近似线性加速比。Chen等人首次证明了满足I/O条件的子图数目的上限为n^IN+OUT，其中n为数据流图中结点的个数，同时也提出了一个能够通过参数设置来分别枚举出连通子图和分离子图的算法。然而，此算法在枚举所有子图(包括连通子图和分离子图)上的运行时间只是与Pozzi等人提出的算法的运行时间相当。Xiao等人提出了一个能够根据用户要求灵活地枚举出连通子图或分离子图的算法。实验表明，该算法在枚举所有子图上的运行时间要比Chen等人提出的算法快一到两个数量级。(3) Multiple-input multiple-output subgraphs (MIMO): Due to the limited number of read and write ports (I/O) of registers, some recent researches mainly focus on enumerating subgraphs that satisfy the I/O constraints. Among them, Pozzi et al. proposed an enumeration algorithm based on binary decision tree, which reduces the search space by using the monotonicity of the number of outputs that form subgraphs in topological order in the data flow graph. In order to make better use of the topological structure of the data flow graph, Xiao et al. proposed a more efficient algorithm. Xiao et al. proposed a parallel enumeration method of custom instructions based on the one-shot graph segmentation method. When there are few computing nodes, this method can achieve an approximate linear speedup. Chen et al. proved for the first time that the upper limit of the number of subgraphs satisfying the I/O condition is n ^IN+OUT , where n is the number of nodes in the data flow graph, and also proposed a method that can be enumerated separately through parameter settings. Algorithms for connected and separating subgraphs. However, the running time of this algorithm on enumerating all subgraphs (including connected and separating subgraphs) is only comparable to that of the algorithm proposed by Pozzi et al. Xiao et al. proposed an algorithm that can flexibly enumerate connected subgraphs or separate subgraphs according to user requirements. Experiments show that the algorithm runs one to two orders of magnitude faster than the algorithm proposed by Chen et al. on enumerating all subgraphs.

(4)最大凸子图(MaxMIMO)：通过实验发现，放松I/O限制的情况下枚举出的自定义指令，往往能够带来更高的性能提升。因此，近年来，也有部分研究集中于枚举最大凸子图，而不考虑I/O的限制。需要注意的是，MaxMIMO子图作为自定义指令虽然能够带来更高的性能提升，但是其通常不具备良好的重用性或通用性。(4) Maximum convex subgraph (MaxMIMO): Through experiments, it is found that custom instructions enumerated in the case of relaxing I/O constraints can often bring higher performance improvements. Therefore, in recent years, some studies have also focused on enumerating maximal convex subgraphs without considering the I/O constraints. It should be noted that although MaxMIMO submaps can bring higher performance improvement as a custom command, they usually do not have good reusability or generality.

除了以上类型的自定义指令枚举的研究，近年来，也有部分学者提出相关的算法来枚举出所有的凸子图(ACS)。Gutin等人首次证明了数据流图中的凸子图的数目的上限为2ⁿ+n+1-d_n，其中n为数据流图中的结点数目，如果n为偶数d_n＝2*2^n/2，如果n为奇数d_n＝3*2^(n-1)/2。Wang等人提出了能够枚举出所有凸子图或枚举满足大小约束的凸子图的算法，该算法较Balistera等人提出的算法平均快3.29倍。Wang等人同时也提出了一种基于一次性图分割的子图并行枚举方法，实验结果表明相对于串行方法，当计算节点数目较少时，该并行方法可达到近似线性的加速比。然而随着计算节点数目增加，负载不均衡的问题逐渐显现，该方法取得加速比呈下降的趋势。Rezgui等人提出了一种通过将初始问题分解为足够多的子问题来保证计算节点间的负载均衡的并行处理方法。该作者通过大量实验发现通常每个计算节点分配的子问题数目在30至100之间时，能够较好地保证负载均衡。然而，由于每个计算节点分配合适的子问题数目是很难确定的，该方法仍然可能造成计算节点之间的负载不均衡，从而影响并行处理的效率。In addition to the research on the above types of custom instruction enumeration, in recent years, some scholars have also proposed related algorithms to enumerate all convex subgraphs (ACS). Gutin et al. proved for the first time that the upper bound on the number of convex subgraphs in a data flow graph is 2 ⁿ +n+1-d _n , where n is the number of nodes in the data flow graph, if n is an even number d _n = 2* 2 ^n/2 if _n is odd dn =3*2 ^(n-1)/2 . Wang et al. proposed an algorithm capable of enumerating all convex subgraphs or enumerating convex subgraphs satisfying size constraints, which is on average 3.29 times faster than the algorithm proposed by Balisterera et al. Wang et al. also proposed a parallel enumeration method of subgraphs based on one-shot graph segmentation. The experimental results show that compared with the serial method, when the number of computing nodes is small, the parallel method can achieve an approximate linear speedup. However, as the number of computing nodes increases, the problem of unbalanced load gradually emerges, and the speedup ratio obtained by this method tends to decrease. Rezgui et al. proposed a parallel processing method that ensures load balancing among computing nodes by decomposing the initial problem into enough sub-problems. The author found through a large number of experiments that when the number of sub-problems allocated to each computing node is usually between 30 and 100, the load balance can be better guaranteed. However, since it is difficult to determine the appropriate number of sub-problems allocated to each computing node, this method may still cause unbalanced load among computing nodes, thereby affecting the efficiency of parallel processing.

发明内容SUMMARY OF THE INVENTION

本发明要解决的技术问题是针对上述现有技术的不足，提供一种基于深度图分割的自定义指令并行枚举方法，采用主从并行模式实现自定义指令并行枚举。The technical problem to be solved by the present invention is to aim at the deficiencies of the above-mentioned prior art, and provide a parallel enumeration method of custom instructions based on depth map segmentation, which adopts a master-slave parallel mode to realize parallel enumeration of custom instructions.

为解决上述技术问题，本发明所采取的技术方案是：一种基于深度图分割的自定义指令并行枚举方法，包括以下步骤：In order to solve the above-mentioned technical problems, the technical solution adopted by the present invention is: a method for parallel enumeration of self-defined instructions based on depth map segmentation, comprising the following steps:

步骤1、采用主从并行模式，计算集群中的主节点接收专用自定义指令集自动生成过程的中间表示生成阶段产生的数据流图作为输入；Step 1. Adopt the master-slave parallel mode, and the master node in the computing cluster receives the data flow graph generated in the intermediate representation generation stage of the automatic generation process of the dedicated custom instruction set as input;

所述数据流图是一种有向无环图G＝(V，E)，其中结点集V＝{v₁，...，v_n}表示基本指令，n为数据流图结点的个数，边集E＝{e₁，...，e_m}∈V×V表示指令之间数据依赖关系，m表示数据流图边的个数；The data flow graph is a directed acyclic graph G=(V, E), wherein the node set V={v ₁ , . . . , v _n } represents basic instructions, and n is the The number of edge sets E={e ₁ ,...,e _m }∈V×V represents the data dependency between instructions, and m represents the number of edges in the data flow graph;

步骤2、采用基于非线性回归任务运行时间预测模型的深度图分割方法将原始数据流图分割为若干子图，并将分割后的子图分配给计算集群中的空闲计算节点，具体方法为：Step 2. Using the depth map segmentation method based on the non-linear regression task runtime prediction model to divide the original data flow graph into several sub-graphs, and assign the divided sub-graphs to idle computing nodes in the computing cluster. The specific method is as follows:

步骤2.1、对自定义指令枚举任务T进行一次性分割为k个子任务，如下公式所示：Step 2.1. Divide the custom instruction enumeration task T into k subtasks at one time, as shown in the following formula:

其中，G_k＝G-{v₁，v₂，...，v_k-1}为数据流图G分割后产生的第k个子图，k＝1，2，…，|v|，|v|为图G中的结点数，E(G_k，v_k)表示从第k个子图G_k中枚举所有包含结点v_k的自定义指令；Among them, G _k =G-{v ₁ , v ₂ ,...,v _k-1 } is the k-th subgraph generated after the data flow graph G is divided, k=1, 2,...,|v|,| v| is the number of nodes in the graph G, E(G _k , v _k ) means enumerating all custom instructions including the node v _k from the k-th subgraph G _k ;

步骤2.2、建立基于非线性回归的任务运行时间预测模型，对分割的子任务的运行时间进行预测；Step 2.2, establish a task running time prediction model based on nonlinear regression, and predict the running time of the divided subtasks;

当约束条件一定的时候，自定义指令枚举时间跟图中的结点数及边数有关，自定义指令枚举时间随着结点数或边数的增加而增加；对于一个给定的子图枚举任务T_k及其对应的数据流图G_k(V_k，E_k)，自定义指令枚举的最坏运行时间为其中α为常量，|V_k|为图G_k中的结点数；公式中的底数的泛化形式有多种，这里假设：When the constraints are certain, the enumeration time of the custom instruction is related to the number of nodes and edges in the graph, and the enumeration time of the custom instruction increases with the increase of the number of nodes or edges; for a given subgraph Taking task T _k and its corresponding data flow graph G _k (V _k , E _k ), the worst running time of custom instruction enumeration is where α is a constant, |V _k | is the number of nodes in the graph G _k ; there are many generalization forms of the base in the formula, here we assume:

其中，f(|V_k|，|E_k|)为结点数|V_k|和边数|E_k|的多项式函数；where f(|V _k |, |E _k |) is a polynomial function of the number of nodes |V _k | and the number of edges |E _k |;

对公式(2)进行泰勒级数展开得任务运行时间预测模型，如下公式所示：The Taylor series expansion of formula (2) is used to obtain the task running time prediction model, as shown in the following formula:

其中，参数k′用于控制模型的展开，参数a_i，j通过采用非线性回归的方法，根据实验数据拟合得出；Among them, the parameter k' is used to control the expansion of the model, and the parameters a _{i, j} are obtained by fitting the experimental data by using the nonlinear regression method;

步骤2.3、根据所有子任务的预测时间和计算节点的数目，判断是否需要继续对复杂的子任务进行分割；如果子任务的预测运行时间超过给定的时间上限，将该子任务继续分割为若干子任务，直到所有的子任务的预测运行时间小于等于给定的时间上限，否则执行步骤3；所述给定的时间上限为当前所有子任务的平均预测运行时间；Step 2.3. According to the predicted time of all subtasks and the number of computing nodes, determine whether it is necessary to continue to divide complex subtasks; if the predicted running time of the subtask exceeds the given time limit, continue to divide the subtask into several subtasks. Subtask, until the predicted running time of all subtasks is less than or equal to the given time upper limit, otherwise perform step 3; the given time upper limit is the average predicted running time of all current subtasks;

假设子任务T_k的预测运行时间大于给定的时间上限，对T_k继续进行分割，如下公式所示：Assuming that the predicted running time of the subtask _Tk is greater than the given time upper limit, continue to divide _Tk , as shown in the following formula:

其中，G_k，l＝G_k-{w₁，w₂，...，w_l-1}为子任务T_k分割后产生的第l个子图，E(G_k，l，{v_k，w_l})表示从子图G_k，l枚举包含结点v_k和结点w_l的所有自定义指令；h作为截止值，T_k，h-1的预测运行时间大于当前平均预测运行时间，T_k，h的预测运行时间小于当前平均预测运行时间；如果子任务T_k，l的预测运行时间仍然大于上限值，则继续将T_k，l分割成若干子任务；Among them, G _{k, l} =G _k -{w ₁ ,w ₂ ,...,w _l-1 } is the l-th subgraph generated after the subtask Tk is divided, E(G _k _{, l} , {v _k , w _l }) means enumerating all custom instructions including node v _k and node w _l from the subgraph G _{k, l} ; h as the cutoff value, the prediction running time of T _{k, h-1} is greater than the current average prediction Running time, the predicted running time of T _{k, h} is less than the current average predicted running time; if the predicted running time of sub-task T _{k, l} is still greater than the upper limit, continue to divide T _{k, l} into several sub-tasks;

步骤3、计算节点使用凸子图枚举算法从收到的子图中枚举自定义指令；Step 3. The computing node uses the convex subgraph enumeration algorithm to enumerate custom instructions from the received subgraph;

所述枚举自定义指令为：从给定的一个数据流图G＝(V，E)中枚举出满足以下条件的所有子图S：(1)子图S是凸子图；(2)子图S是连通图；The enumeration custom instruction is: from a given data flow graph G=(V, E), enumerate all subgraphs S that satisfy the following conditions: (1) the subgraph S is a convex subgraph; (2) ) subgraph S is a connected graph;

所述凸子图为：对于G的子图S，v∈V_s，若在G中u与v之间的任何路径都只经过S中的结点，则称S是G的凸子图；The convex subgraph is: for the subgraph S of G, v∈V _s , if any path between u and v in G only passes through the nodes in S, then S is said to be a convex subgraph of G;

所述连通图为：对于G的子图S，v∈V_s，存在至少一条路径连接u和v，则S是连通图。The connected graph is: for the subgraph S of G, v∈V _s , there is at least one path connecting u and v, then S is a connected graph.

采用上述技术方案所产生的有益效果在于：本发明提供的一种基于深度图分割的自定义指令并行枚举方法，考虑到自定义指令枚举问题的高复杂度，采用一种基于任务运行时间预测模型的深度图分割方法，将初始问题划分为若干子问题，并由计算节点分别独立地解决子问题。与已有的并行方法相比，本发明方法能够更有效地保证计算节点间的负载均衡，达到近似线性的加速比。The beneficial effect of adopting the above technical solution is that: a method for parallel enumeration of custom commands based on depth map segmentation provided by the present invention, taking into account the high complexity of the problem of custom command enumeration, a method based on task running time is adopted. The depth map segmentation method of the prediction model divides the initial problem into several sub-problems, and the sub-problems are solved independently by the computing nodes. Compared with the existing parallel methods, the method of the present invention can more effectively ensure the load balance among the computing nodes, and achieve an approximate linear speedup ratio.

附图说明Description of drawings

图1为本发明背景技术提供的应用专用扩展指令集自动生成流程图；1 is a flowchart of automatic generation of an application-specific extended instruction set provided by the background technology of the present invention;

图2为本发明实施例提供的一种基于深度图分割的自定义指令并行枚举方法的流程图；2 is a flowchart of a method for parallel enumeration of custom instructions based on depth map segmentation provided by an embodiment of the present invention;

图3为本发明实施例提供的一种基于深度图分割的自定义指令并行枚举方法的框架图；3 is a framework diagram of a method for parallel enumeration of custom instructions based on depth map segmentation according to an embodiment of the present invention;

图4为本发明实施例提供的针对四个测试基准程序三种并行方法取得的加速比对比图，其中，(a)为基准程序MP3，(b)为基准程序MESA，(c)为基准程序IIR，(d)为基准程序DES3；4 is a comparison diagram of speedup ratios obtained by three parallel methods for four test benchmark programs according to an embodiment of the present invention, wherein (a) is the benchmark program MP3, (b) is the benchmark program MESA, and (c) is the benchmark program IIR, (d) is the benchmark program DES3;

图5为本发明实施例提供的使用三种并行方法各个计算节点的总运行时间比较，其中，(a)为MGP并行方法，(b)为EPS并行方法，(c)为ODP并行方法；5 is a comparison of the total running time of each computing node using three parallel methods provided by an embodiment of the present invention, wherein (a) is an MGP parallel method, (b) is an EPS parallel method, and (c) is an ODP parallel method;

图6为本发明实施例提供的采用本发明的运行时间预测模型所预测的时间与实际运行时间的比较图。FIG. 6 is a comparison diagram of the time predicted by the running time prediction model of the present invention and the actual running time provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图和实施例，对本发明的具体实施方式作进一步详细描述。以下实施例用于说明本发明，但不用来限制本发明的范围。The specific embodiments of the present invention will be described in further detail below with reference to the accompanying drawings and embodiments. The following examples are intended to illustrate the present invention, but not to limit the scope of the present invention.

本实施例中，一种基于深度图分割的自定义指令并行枚举方法，如图2和图3所示，包括以下步骤：In this embodiment, a method for parallel enumeration of custom instructions based on depth map segmentation, as shown in Figure 2 and Figure 3, includes the following steps:

步骤2.2、可以看出以上分割产生的子任务，有些任务明显较其他任务复杂，不能保证计算节点间的负载均衡，需要根据子任务的预测运行时间，选择复杂的子任务进行进一步分割；因此，建立基于非线性回归的任务运行时间预测模型，对分割的子任务的运行时间进行预测；Step 2.2. It can be seen that the sub-tasks generated by the above segmentation are obviously more complex than other tasks, and the load balance between computing nodes cannot be guaranteed. It is necessary to select complex sub-tasks for further segmentation according to the predicted running time of the sub-tasks; therefore, Build a task running time prediction model based on nonlinear regression to predict the running time of the divided subtasks;

步骤2.3、根据所有子任务的预测时间和计算节点的数目，判断是否需要继续对复杂的子任务进行分割；如果子任务的预测运行时间超过给定的时间上限，将该子任务继续分割为若干子任务，直到所有的子任务的预测运行时间小于等于给定的时间上限；所述给定的时间上限为当前所有子任务的平均预测运行时间；Step 2.3. According to the predicted time of all subtasks and the number of computing nodes, determine whether it is necessary to continue to divide complex subtasks; if the predicted running time of the subtask exceeds the given time limit, continue to divide the subtask into several subtasks. subtasks, until the predicted running time of all subtasks is less than or equal to a given time upper limit; the given time upper limit is the average predicted running time of all current subtasks;

其中，G_k，l＝G_k-{w₁，w₂，...，w_l-1}为子任务T_k分割后产生的第l个子图，E(G_k，l，{v_k，w_l})表示从子图G_k，l枚举包含结点v_k和结点w₁的所有自定义指令；h作为截止值，T_k，m-1的预测运行时间大于当前平均预测运行时间，T_k，n的预测运行时间小于当前平均预测运行时间；如果子任务T_k，l的预测运行时间仍然大于上限值，则继续将T_k，l分割成若干子任务；Among them, G _{k, l} =G _k -{w ₁ ,w ₂ ,...,w _l-1 } is the l-th subgraph generated after the subtask Tk is divided, E(G _k _{, l} , {v _k , w _l }) means enumerating all custom instructions including node v _k and node w ₁ from the subgraph G _{k, l} ; h as the cutoff value, the prediction running time of T _{k, m-1} is greater than the current average prediction Running time, the predicted running time of T _{k, n} is less than the current average predicted running time; if the predicted running time of subtask T _{k, l} is still greater than the upper limit, continue to divide T _{k, l} into several subtasks;

本实施例还给出了一种基于深度图分割的自定义指令并行枚举方法的伪代码，如算法1所示：This embodiment also provides a pseudo-code for a parallel enumeration method of custom instructions based on depth map segmentation, as shown in Algorithm 1:

算法1：自定义指令并行枚举方法Algorithm 1: Custom Instruction Parallel Enumeration Method

输入：数据流图G(V，E)Input: data flow graph G(V, E)

输出：满足约束条件的子图集合SOutput: The set of subgraphs S satisfying the constraints

本实施例使用的计算节点的环境均为i3-3240 3.4GHz处理器，4-GB主存储器。测试基准集来源于MiBench。这些基准是由通用编译平台GeCoS进行前端编译和模拟的。实施例中使用的数据流图的特征及串行算法的运行时间见表1。表中列NV、列NE和列NS分别表示数据流图中结点的数目、边的数目和枚举的子图数目。串行算法的运行时间记录在列Runtime(单位为毫秒)。The environments of the computing nodes used in this embodiment are all i3-3240 3.4GHz processors and 4-GB main memory. The test benchmark set is derived from MiBench. These benchmarks are front-end compiled and simulated by the general-purpose compilation platform GeCoS. The characteristics of the data flow graph used in the embodiment and the running time of the serial algorithm are shown in Table 1. Columns NV, NE, and NS in the table represent the number of nodes, the number of edges, and the number of enumerated subgraphs in the data flow graph, respectively. The running time of the serial algorithm is recorded in the column Runtime (in milliseconds).

表1基准程序特征及串行算法运行时间Table 1 Benchmark program characteristics and serial algorithm running time

基准程序benchmark program NVNV NENE NSNS RuntimeRuntime MP3MP3 4343 6666 181,533,673181,533,673 221,554221,554 MESAMESA 3737 6565 7,554,4997,554,499 8,0278,027 IIRIIR 4040 5656 23,195,41423,195,414 28,72528,725 DES3DES3 4545 6060 637,125,710637,125,710 649,873649,873

本实施例中，将本发明的自定义指令并行枚举方法(记为MGP)分别与Wang等人提出的并行枚举方法(记为ODP)及Rezgui等人^]提出的并行方法(记为EPS)进行了比较。所涉及的计算节点的配置均为i3-32403.4GHz处理器，4-GB主存储器，三个并行方法均使用Hadoop1.0.0实现。计算节点采用Wang等人提出子图枚举算法进行子图枚举。本实施例中所用到的测试基准程序如表1所示。对于并行方法EPS，每个计算节点分配的子任务数设为50，因此，子任务总数为50*w，其中w为计算节点的数目。In this embodiment, the parallel enumeration method for custom instructions of the present invention (denoted as MGP) is respectively combined with the parallel enumeration method proposed by Wang et al. (denoted as ODP) and the parallel method proposed by Rezgui et al. ^] (denoted as EPS) ) were compared. The configuration of the computing nodes involved are all i3-32403.4GHz processors, 4-GB main memory, and the three parallel methods are all implemented using Hadoop 1.0.0. The computing node adopts the subgraph enumeration algorithm proposed by Wang et al. to perform subgraph enumeration. The test benchmark programs used in this embodiment are shown in Table 1. For the parallel method EPS, the number of subtasks allocated to each computing node is set to 50, so the total number of subtasks is 50*w, where w is the number of computing nodes.

针对表1中的四个测试基准程序，三个并行算法在不同计算节点数目下取得加速比结果如图4所示。通过对比结果可以观察到，当计算节点小于等于10时，三个并行算法取得的加速比相似，均呈近似线性的加速比。但是，随着计算节点数目的增加，MGP方法取得的加速比明显优于其他两个并行算法，EPS方法优于ODP方法。其中，EPS方法和ODP方法所取得的加速比不能随着计算节点的增加保持线性增长。尤其对于ODP方法，当计算节点增加到18以后，加速比的增长趋于平缓，不再随着计算节点的增加而呈线性增长。对比结果也说明，MGP方法在负载均衡问题上的处理效果好于EPS方法和ODP方法。For the four benchmark programs in Table 1, the speedup results obtained by the three parallel algorithms under different numbers of computing nodes are shown in Figure 4. By comparing the results, it can be observed that when the computing nodes are less than or equal to 10, the speedup ratios obtained by the three parallel algorithms are similar, and they are all approximately linear speedup ratios. However, as the number of computing nodes increases, the speedup obtained by the MGP method is significantly better than the other two parallel algorithms, and the EPS method is better than the ODP method. Among them, the speedup ratio achieved by EPS method and ODP method cannot keep a linear growth with the increase of computing nodes. Especially for the ODP method, when the number of computing nodes increases to 18, the growth of the acceleration ratio tends to be flat, and no longer increases linearly with the increase of computing nodes. The comparison results also show that the MGP method is better than the EPS method and ODP method in dealing with the load balancing problem.

为了进一步分析以上三种并行方法取得的加速比差异，本实施例中比较和分析了三种并行方法使用相同数目计算节点枚举自定义指令时，各个计算节点的总运行时间。在此前的测试基础上，统计每个计算节点的总运行时间。针对基准测试程序DES3，在计算节点数为8的情况下，使用三种并行方法每个计算节点的总运行时间分别如图5所示。可以观察到使用MGP方法，各个计算节点之间的总运行时间相差较小(最大运行时间与最小运行时间的差别为13.1％)，而使用EPS方法或ODP方法，各个计算节点之间的总运行时间差异较明显(最大运行时间与最小运行时间的差别分别为21.3％和39.6％)。通过对比发现，EPS方法和ODP方法产生的子图大小存在较大差异，其中EPS方法主要考虑的是分割后产生的子图数量而没有考虑子图大小的问题，ODP方法采用一次性分割方法所产生的子图大小差异更加明显，而MGP方法在任务运行时间预测模型的指导下，可将较大子图进一步分割为若干较小子图，因此该方法产生的子图大小比较均衡。对比结果进一步说明本发明的MGP方法可使得分配给各个计算节点的任务具有相似的复杂度，可保证计算节点之间的负载均衡，从而取得近似线性的加速比。In order to further analyze the difference in the speedup ratios obtained by the above three parallel methods, the present embodiment compares and analyzes the total running time of each computing node when the three parallel methods use the same number of computing nodes to enumerate custom instructions. Based on the previous tests, the total running time of each computing node is counted. For the benchmark test program DES3, when the number of computing nodes is 8, the total running time of each computing node using the three parallel methods is shown in Figure 5. It can be observed that using the MGP method, the total running time difference between the various computing nodes is small (the difference between the maximum running time and the minimum running time is 13.1%), while using the EPS method or the ODP method, the total running time between the various computing nodes. The time difference is obvious (the difference between the maximum running time and the minimum running time is 21.3% and 39.6%, respectively). By comparison, it is found that there is a big difference in the size of the sub-images generated by the EPS method and the ODP method. The EPS method mainly considers the number of sub-images generated after segmentation without considering the size of the sub-images. The ODP method adopts the one-time segmentation method. The difference in the size of the generated subgraphs is more obvious, while the MGP method can further divide the larger subgraph into several smaller subgraphs under the guidance of the task running time prediction model, so the subgraphs generated by this method are of relatively balanced size. The comparison results further illustrate that the MGP method of the present invention can make the tasks assigned to each computing node have similar complexity, and can ensure the load balance among the computing nodes, thereby obtaining an approximate linear speedup ratio.

同时，本实施例还对运行时间预测模型进行评价。首先随机生成了200个数据流图，这些数据流图的结点数|V|范围为20～55，边的数目|E|＝1.5*|V|。针对每个随机生成的数据流图，在一台配置为i3-3240 3.4GHz处理器，4-GB主存储器的计算机上使用子图枚举算法枚举所有满足约束条件的凸子图，并记录相应的实际算法运行时间。然后，使用其中185份实际的数据(结点数、边数、运行时间)作为训练样本对公式(3)中的参数进行多元非线性回归拟合。为了评价任务运行时间预测模型，针对15个随机生成的数据流图，将任务运行时间预测模型所预测的时间与实际运行时间进行了比较。子图枚举任务运行时间预测模型给出的预测运行时间与实际运行时间的比较如图6所示。对比结果显示，预测运行时间与实际运行时间非常接近，误差范围为3％～12％。At the same time, this embodiment also evaluates the running time prediction model. First, 200 data flow graphs are randomly generated. The number of nodes |V| of these data flow graphs ranges from 20 to 55, and the number of edges |E|=1.5*|V|. For each randomly generated dataflow graph, use the subgraph enumeration algorithm on a computer configured with an i3-3240 3.4GHz processor and 4-GB main memory to enumerate all convex subgraphs that satisfy the constraints, and record The corresponding actual algorithm running time. Then, use the 185 actual data (number of nodes, edges, running time) as training samples to perform multivariate nonlinear regression fitting on the parameters in formula (3). To evaluate the task runtime prediction model, the time predicted by the task runtime prediction model was compared with the actual runtime for 15 randomly generated data flow graphs. The comparison between the predicted running time given by the subgraph enumeration task running time prediction model and the actual running time is shown in Figure 6. The comparison results show that the predicted running time is very close to the actual running time, with an error range of 3% to 12%.

由于进行图分割和子任务运行时间预测需要额外时间开销，本实施例详细分析了基于任务运行时间预测模型的图分割处理所需时间占并行枚举方法总运行时间的比例。针对四个不同测试基准程序，在计算节点数目为12的条件下，基于任务运行时间预测模型的图分割所需时间占总运行时间的比例如表2所示。列Benchmark、列Pre.&Par.Time、列TotalTime和列Ratio分别表示测试基准程序名称、图分割及任务预测所需时间(单位为毫秒)、总运行时间和二者之比(％)。根据结果，可以看出基于任务运行时间预测的图分割处理所需时间平均约为总运行时间的3.82％。这也表明本发明的基于任务时间预测模型的图分割方法具有较高的效率，其所需时间明显少于总运行时间，不会显著地影响并行方法总运行时间。Since extra time is required for graph segmentation and subtask running time prediction, this embodiment analyzes in detail the proportion of the time required for graph segmentation processing based on the task running time prediction model to the total running time of the parallel enumeration method. For four different test benchmark programs, under the condition that the number of computing nodes is 12, the proportion of the time required for graph segmentation based on the task running time prediction model to the total running time is shown in Table 2. Column Benchmark, column Pre.&Par.Time, column TotalTime and column Ratio respectively represent the test benchmark program name, graph segmentation and task prediction time (in milliseconds), total running time and the ratio (%). From the results, it can be seen that the time required for graph segmentation processing based on task running time prediction is about 3.82% of the total running time on average. This also shows that the graph segmentation method based on the task time prediction model of the present invention has high efficiency, and the required time is significantly less than the total running time, and the total running time of the parallel method will not be significantly affected.

表2基于任务运行时间预测模型的图分割所需时间与总运行时间比较Table 2 Comparison of the time required for graph segmentation and the total running time based on the task running time prediction model

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述实施例所记载的技术方案进行修改，或者对其中部分或者全部技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明权利要求所限定的范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still be The technical solutions described in the foregoing embodiments are modified, or some or all of the technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions depart from the scope defined by the claims of the present invention.

Claims

1. a kind of custom instruction parallel enumerating method based on depth map segmentation, it is characterised in that: the following steps are included:

Step 1, using Master-slave parallel mode, the host node in computing cluster receives dedicated custom instruction collection and automatically generates process Intermediate representation generation phase generate data flow diagram as input；

The data flow diagram is a kind of directed acyclic graph G=(V, E), wherein nodal set V={ v₁..., v_nIndicate elementary instruction, N is the number of data flow diagram node, side collection E={ e₁..., e_m∈ V × V indicate instruction between data dependence relation, m indicate number According to the number on flow graph side；

Step 2, using the depth image segmentation method based on nonlinear regression Runtime prediction model by original data stream Figure is divided into several subgraphs, and the subgraph after segmentation is distributed to the idle calculate node in computing cluster, method particularly includes:

Step 2.1, custom instruction is enumerated task T progress be disposably divided into k subtask, shown in following formula:

Wherein, G_k=G- { v₁, v₂..., v_k-1It is k-th of subgraph generating after data flow diagram G segmentation, and k=1,2 ..., | v |, | V | for the nodal point number in figure G, E (G_k, v_k) indicate from k-th of subgraph G_kIn enumerate it is all comprising node v_kCustom instruction；

Step 2.2 establishes the Runtime prediction model based on nonlinear regression, to the runing time of the subtask of segmentation It is predicted；

Step 2.3, according to the predicted time of all subtasks and the number of calculate node, judge whether to need to continue to complicated Subtask is split；If the prediction runing time of subtask is more than given time upper limit, which is continued to divide For several subtasks, until the prediction runing time of all subtasks is less than or equal to given time upper limit, step is otherwise executed Rapid 3；The given time upper limit is the consensus forecast runing time of current all subtasks；

Step 3, calculate node enumerate custom instruction from the subgraph received using convex portion enumeration of graph algorithm；

It is described to enumerate custom instruction are as follows: the institute for meeting the following conditions is enumerated from a given data flow diagram G=(V, E) Subgraph S:(1) subgraph S is convex portion figure；(2) subgraph S is connected graph；

The convex portion figure are as follows: for the subgraph S of G,If in any path in G between u and v all only by S Node, then claiming S is the convex portion figure of G；

The connected graph are as follows: for the subgraph S of G,There are at least one paths to connect u and v, then S is connected graph.

2. a kind of custom instruction parallel enumerating method based on depth map segmentation according to claim 1, feature exist In: the step 2.2 method particularly includes:

When constraint condition is certain, custom instruction enumerate the time in figure nodal point number and number of edges it is related, customized finger Order is enumerated the time and is increased with the increase of nodal point number or number of edges；The subgraph given for one enumerates task T_kAnd its it is corresponding Data flow diagram G_k(V_k, E_k), the worst runing time that custom instruction is enumerated isWherein α is constant, | V_k| it is Scheme G_kIn nodal point number；There are many extensive forms of the truth of a matter in formula, it is assumed here that:

Wherein, f (| V_k|, | E_k|) be nodal point number | V_k| and number of edges | E_k| polynomial function；

Taylor series expansion is carried out to formula (2) and obtains Runtime prediction model, shown in following formula:

Wherein, parameter k ' is used for the expansion of Controlling model, parameter a_{I, j}By using the method for nonlinear regression, according to experiment number It is obtained according to fitting.

3. a kind of custom instruction parallel enumerating method based on depth map segmentation according to claim 2, feature exist In: the step 2.3 method particularly includes:

Assuming that subtask T_kPrediction runing time be greater than given time upper limit, to T_kContinue to divide, following formula institute Show:

Wherein, G_{K, l}=G_k-{w₁, w₂..., w_l-1It is subtask T_kFirst of the subgraph generated after segmentation, E (G_{K, l}, { v_k, w_l}) It indicates from subgraph G_{K, l}It enumerates comprising node v_kWith node w_lAll custom instructions；H is as cutoff value, T_{K, h-1}Prediction fortune The row time is greater than current consensus forecast runing time, T_{K, h}Prediction runing time be less than current consensus forecast runing time；If Subtask T_{K, l}Prediction runing time upper limit value is still greater than, then continue T_{K, l}It is divided into several subtasks.