CN107861606A

CN107861606A - A kind of heterogeneous polynuclear power cap method by coordinating DVFS and duty mapping

Info

Publication number: CN107861606A
Application number: CN201711163506.4A
Authority: CN
Inventors: 方娟; 汪梦萱; 马傲男; 程妍瑾; 常泽清
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2017-11-21
Filing date: 2017-11-21
Publication date: 2018-03-30

Abstract

The invention discloses a heterogeneous multi-core power capping method by coordinating DVFS and task mapping. Firstly, for heterogeneous systems, the scripts that can measure the power consumption of computing nodes, CPU power consumption and GPU power consumption respectively after program execution is completed, and then modify The selected parallel test benchmark program is used to obtain the execution time of different kernel functions; then, at different frequencies of the CPU and GPU, run the application program only on the CPU and GPU respectively to obtain detailed running information, including the total execution time, each Kernel function execution time, computing node power consumption, CPU power consumption and GPU power consumption; based on the running information, design a prediction model, including the prediction execution time model and power consumption model; finally, based on the prediction model, get different CPU frequencies and GPU frequencies Fill in the configuration table with the system power consumption and execution time under the task allocation scheme, and find the optimal configuration scheme according to the improved greedy algorithm. By adopting the invention, the system performance can be improved and the system power consumption budget can be limited at the same time.

Description

A Power Capping Method for Heterogeneous Multi-Cores by Coordinating DVFS and Task Mapping

技术领域technical field

本发明属于计算机体系结构领域，具体涉及实现一种通过协调DVFS和任务映射的异构多核功率封顶方法。The invention belongs to the field of computer architecture, and in particular relates to realizing a heterogeneous multi-core power capping method by coordinating DVFS and task mapping.

背景技术Background technique

经过最近几年的不断研究与发展，以多核处理器为代表的先进体系结构已经逐渐取代单核处理器成为提高处理器性能的主要途径。相比较同构多核处理器，异构多核平台能够实现更好的性能。功率封顶是一种将异构系统的功耗限制在预定水平下的技术。功耗和散热限制了异构多核性能的提升。现代处理器的构建使得它们可以承受一定水平功耗带来的伤害，从而需要能够实现处理器功率上限的系统。目前最常见的功率预算技术依靠硬件组件在不同频率下工作，因此具有不同的功耗，主要思想是利用动态电压频率缩放(DVFS)。利用DVFS限制异构系统功耗的同时，CPU和GPU之间会出现负载不平衡的情况。通过将并行程序分解为可同时执行的任务，并将每个任务映射到最合适的处理器能够充分利用系统的计算能力，提高系统性能，但是这种映射方案通常没有考虑到系统功耗。本文提出了一种将DVFS和任务映射结合的方案，在限制系统功耗预算下提高系统性能。After continuous research and development in recent years, the advanced architecture represented by multi-core processors has gradually replaced single-core processors as the main way to improve processor performance. Compared with homogeneous multi-core processors, heterogeneous multi-core platforms can achieve better performance. Power capping is a technique that limits the power consumption of a heterogeneous system to a predetermined level. Power consumption and heat dissipation limit the improvement of heterogeneous multi-core performance. Modern processors are built so that they can suffer from a certain level of power consumption, requiring systems that can achieve a processor power cap. The most common power budgeting techniques today rely on hardware components operating at different frequencies and thus have different power consumption, the main idea is to use dynamic voltage frequency scaling (DVFS). While using DVFS to limit the power consumption of heterogeneous systems, there will be load imbalance between CPU and GPU. By decomposing a parallel program into tasks that can be executed simultaneously, and mapping each task to the most suitable processor, the computing power of the system can be fully utilized to improve system performance, but this mapping scheme usually does not take system power consumption into consideration. This paper proposes a scheme that combines DVFS and task mapping to improve system performance while limiting system power consumption budget.

发明内容Contents of the invention

本发明提出了一种通过协调DVFS和任务映射的异构多核功率封顶方法，提高系统性能的同时限制系统功耗预算，首先针对异构系统实现一个可以在程序执行完成后分别测量计算节点功耗，CPU功耗和GPU功耗的脚本，然后修改选择的并行测试基准程序，用来获取不同核函数的执行时间。然后在CPU和GPU设置不同频率下，分别只在CPU和GPU上运行应用程序，获取详细运行信息，包括总执行时间，每一个核函数执行时间，计算节点功耗和CPU功耗和GPU功耗。基于运行信息，设计一个预测模型，包括预测执行时间模型和功耗模型。最后，基于预测模型，得到不同CPU频率、GPU频率和任务分配方案下的系统功耗和执行时间填入配置表中。根据改进的贪心算法，寻找到最优配置方案(CPU频率，GPU频率，任务映射表)。The present invention proposes a heterogeneous multi-core power capping method by coordinating DVFS and task mapping to improve system performance while limiting system power consumption budget. Firstly, for heterogeneous systems, a method that can measure the power consumption of computing nodes after program execution is completed , the scripts of CPU power consumption and GPU power consumption, and then modify the selected parallel test benchmark program to obtain the execution time of different kernel functions. Then, at different frequencies of the CPU and GPU, run the application only on the CPU and GPU respectively, and obtain detailed running information, including the total execution time, the execution time of each kernel function, the power consumption of computing nodes, CPU power consumption and GPU power consumption . Based on the running information, a predictive model is designed, including predictive execution time model and power consumption model. Finally, based on the prediction model, the system power consumption and execution time under different CPU frequencies, GPU frequencies and task allocation schemes are obtained and filled in the configuration table. According to the improved greedy algorithm, the optimal configuration scheme (CPU frequency, GPU frequency, task mapping table) is found.

为了达到上述目的，本发明采用以下技术方案。In order to achieve the above object, the present invention adopts the following technical solutions.

一种通过协调DVFS和任务映射的异构多核功率封顶方法，将DVFS和任务映射结合在追求系统性能的同时将系统功耗限制在预算功耗下；包括以下步骤：A heterogeneous multi-core power capping method by coordinating DVFS and task mapping, combining DVFS and task mapping to limit system power consumption under budget power consumption while pursuing system performance; comprising the following steps:

步骤1，实现在应用程序执行完成后测量应用程序总执行时间，CPU功耗和GPU功耗。Step 1, implement measuring the total execution time of the application program, the power consumption of the CPU and the power consumption of the GPU after the execution of the application program is completed.

步骤2，修改选择的并行测试基准程序，获得程序中每一个核函数的执行时间。Step 2, modify the selected parallel test benchmark program, and obtain the execution time of each kernel function in the program.

步骤3，分别只在CPU上或GPU上执行应用程序，设置不同的CPU(GPU)可选频率，获得详细运行信息，其中包括程序总执行时间，程序中每一个核函数执行时间，系统总功耗，CPU功耗和GPU功耗。Step 3. Execute the application program only on the CPU or GPU respectively, set different CPU (GPU) optional frequencies, and obtain detailed running information, including the total execution time of the program, the execution time of each kernel function in the program, and the total power of the system. Power consumption, CPU power consumption and GPU power consumption.

步骤4，设计预测模型，预测在不同CPU和GPU频率下的不同任务映射方案的功耗和执行时间。预测模型的输入是CPU频率，GPU频率和任务映射方案，本文涉及到的任务映射是指将每一个核函数作为一个整体，映射到CPU或GPU上，而不是将核函数根据一定比例分配到CPU和GPU上同时执行。预测模型的输出是预测的程序总执行时间和预测的系统功耗。此预测模型包括执行时间模型和功耗模型。Step 4, designing a prediction model to predict the power consumption and execution time of different task mapping schemes at different CPU and GPU frequencies. The input of the prediction model is the CPU frequency, GPU frequency and task mapping scheme. The task mapping involved in this article refers to mapping each kernel function as a whole to the CPU or GPU, rather than assigning the kernel function to the CPU according to a certain ratio. Execute simultaneously on the GPU. The output of the predictive model is the predicted total program execution time and the predicted system power consumption. This predictive model includes an execution time model and a power consumption model.

步骤4.1，执行时间模型Step 4.1, Execution Time Model

应用程序总执行时间可以根据程序中的每一个内核的执行时间和对应的数据传输时间得到。由程序总执行时间由公式①表示。The total execution time of the application program can be obtained according to the execution time of each core in the program and the corresponding data transmission time. The total execution time of the program is expressed by the formula ①.

其中，f_cpu,f_gpu分别表示CPU频率、GPU频率，T_i(f_cpu,f_gpu)表示第i个核函数的执行时间和所需数据传输时间，由公式②表示。Among them, f _cpu and f _gpu represent the CPU frequency and GPU frequency respectively, and T _i (f _cpu , f _gpu ) represents the execution time of the i-th kernel function and the required data transmission time, expressed by formula ②.

公式的第一部分表示执行时间，第二部分表示数据传输时间。核函数执行时间在第2步已经得到。H2D和D2H分别表示从主机到设备的数据传输成本和设备到主机的传输成本。Required等于1或0，表示是否需要核k是否数据d。数据有可能已经在设备上，不需要传输，所以使用OnDevice表示数据是否在设备上。用size表示数据大小。The first part of the formula represents the execution time, and the second part represents the data transfer time. The execution time of the kernel function has been obtained in step 2. H2D and D2H denote the data transfer cost from host to device and the transfer cost from device to host, respectively. Required is equal to 1 or 0, indicating whether the core k is required or the data d is required. The data may already be on the device and does not need to be transmitted, so use OnDevice to indicate whether the data is on the device. Use size to represent the data size.

步骤4.2，功耗模型Step 4.2, Power Consumption Model

系统功耗可以由三个部分表示，分别为空闲功耗，CPU功耗P_cpu和GPU功耗P_gpu。系统功耗由公式③表示。System power consumption may be represented by three parts, which are idle power consumption, CPU power consumption P _cpu and GPU power consumption P _gpu . System power consumption is represented by formula ③.

P＝P_idle(f_cpu，f_gpu)+P_cpu(f_cpu，f_gpu)+P_gpu(f_cpu，f_gpu) ③P＝P _idle (f _cpu ，f _gpu )+P _cpu (f _cpu ，f _gpu )+P _gpu (f _cpu ，f _gpu ) ③

其中，P_idle表示空闲功耗，与CPU频率和GPU频率有关，而与任务映射无关。可以通过仅在CPU上执行应用程序得到的总功耗减去CPU上的功耗得到，由公式④得到。Among them, P _idle represents idle power consumption, which is related to CPU frequency and GPU frequency, but has nothing to do with task mapping. It can be obtained by subtracting the power consumption on the CPU from the total power consumption obtained by executing the application program only on the CPU, and obtained by formula ④.

分别表示仅在CPU上执行应用程序的条件下，获得的系统总功耗，CPU功耗和GPU功耗，与设置频率有关。 Respectively represent the total power consumption of the system, CPU power consumption and GPU power consumption obtained under the condition that only the application program is executed on the CPU, and are related to the set frequency.

在应用程序执行时，CPU和GPU由于核函数之间的数据关联，不是一直在执行的，所以CPU功耗和GPU功耗根据给定的任务映射方案的不同而有所改变，结合这种现象，我们假设功耗和执行时间与总执行时间比值成正比，所以CPU和GPU功耗可以分别由公式⑤和公式⑥表示。When the application program is executed, the CPU and GPU are not always executing due to the data association between the kernel functions, so the CPU power consumption and GPU power consumption vary according to the given task mapping scheme, combined with this phenomenon , we assume that power consumption and execution time are proportional to the ratio of total execution time, so CPU and GPU power consumption can be expressed by formula ⑤ and formula ⑥ respectively.

其中，λ_cpu和λ_gpu分别表示CPU和GPU繁忙时间比率，分别表示仅在GPU上执行应用程序的条件下，获得的CPU功耗和GPU功耗。和分别表示CPU和GPU中动态功耗的最大值的估计量。Among them, λ _cpu and λ _gpu represent the CPU and GPU busy time ratio, respectively, Respectively represent the CPU power consumption and the GPU power consumption obtained under the condition that the application program is only executed on the GPU. with estimators representing the maximum value of dynamic power consumption in the CPU and GPU, respectively.

因为λ_cpu和λ_gpu分别表示CPU和GPU繁忙时间比率，所以可以从第3步中测量的运行信息得到。程序执行时，每个设备繁忙时间定义为每个核函数实际执行时间的总和。λ_cpu和λ_gpu分别可以由公式⑦和公式⑧表示。Since _λcpu and _λgpu denote CPU and GPU busy time ratios respectively, they can be obtained from the running information measured in step 3. When the program is executed, the busy time of each device is defined as the sum of the actual execution time of each kernel function. λ _cpu and λ _gpu can be represented by formula ⑦ and formula ⑧, respectively.

t_cpu表示CPU繁忙时间,t_gpu表示GPU繁忙时间,t_to_tal表示应用程序总执行时间。t _cpu represents the CPU busy time, t _gpu represents the GPU busy time, and t _t o _tal represents the total execution time of the application.

步骤5，基于预测模型，构建配置参数表。Step 5: Construct a configuration parameter table based on the prediction model.

根据执行时间模型和功耗模型计算出不同CPU频率和GPU频率，不同任务映射方案下的执行时间和功耗，并填入配置参数表中。According to the execution time model and power consumption model, the execution time and power consumption under different CPU frequencies and GPU frequencies, different task mapping schemes are calculated, and filled in the configuration parameter table.

步骤6，根据配置参数表，使用改进的贪心算法搜索最优参数集。分为两步，首先在给定的CPU频率和GPU频率下利用贪心算法搜索执行时间最短的任务映射方案，根据此任务映射，CPU频率,GPU频率和功率模型，得到此参数配置下的系统预测功耗。然后，改变CPU频率和GPU频率，再次根据上一步计算得出系统预测功耗，最终得到限制在预算功耗下的最优配置方案。Step 6, according to the configuration parameter table, use the improved greedy algorithm to search for the optimal parameter set. It is divided into two steps. First, use the greedy algorithm to search for the task mapping scheme with the shortest execution time under the given CPU frequency and GPU frequency. According to the task mapping, CPU frequency, GPU frequency and power model, the system prediction under this parameter configuration is obtained. power consumption. Then, change the CPU frequency and GPU frequency, calculate the predicted power consumption of the system again according to the previous step, and finally obtain the optimal configuration scheme limited to the budgeted power consumption.

步骤6.1，给定CPU频率和GPU频率，搜索执行时间最短的任务映射方案，并根据功耗模型，得出系统预测功耗。Step 6.1, given the CPU frequency and GPU frequency, search for the task mapping scheme with the shortest execution time, and obtain the predicted power consumption of the system according to the power consumption model.

步骤6.2，根据CPU频率和GPU频率可选设置，改变CPU频率和GPU频率，重复执行步骤6.1，得到此频率组合下的最优执行时间对应的映射方案，并根据此映射方案计算系统功耗，由给定的预算功耗，得出限定在预算功耗下的最优频率参数选择和映射方案。Step 6.2, according to the optional settings of CPU frequency and GPU frequency, change the CPU frequency and GPU frequency, repeat step 6.1, obtain the mapping scheme corresponding to the optimal execution time under this frequency combination, and calculate the system power consumption according to this mapping scheme, Based on the given budget power consumption, the optimal frequency parameter selection and mapping scheme limited under the budget power consumption is obtained.

与现有技术相比，本发明具有以下优点：Compared with the prior art, the present invention has the following advantages:

将DVFS和任务映射结合起来，共同实现将系统功耗限制在预算功耗的同时保证系统性能。现有的系统功率封顶技术大部分都是通过动态电压频率缩放来实现的，因为系统中的设备频率对系统功耗影响最大，但是没有考虑到改变CPU频率和GPU频率会带来的系统负载不均衡的情况。通过将并行程序分解为可同时执行的任务，并将每个任务映射到最合适的处理器能够充分利用系统的计算能力，提高系统性能，但是这种映射方案通常没有考虑到系统功耗。所以本发明通过将DVFS和任务映射两种优化策略结合起来，在提升系统性能的同时将系统功耗限定在一定的预算水平下。Combining DVFS and task mapping together realizes limiting system power consumption to budget power consumption while ensuring system performance. Most of the existing system power capping technologies are implemented through dynamic voltage and frequency scaling, because the device frequency in the system has the greatest impact on system power consumption, but it does not take into account the system load caused by changing the CPU frequency and GPU frequency. balanced situation. By decomposing a parallel program into tasks that can be executed simultaneously, and mapping each task to the most suitable processor, the computing power of the system can be fully utilized to improve system performance, but this mapping scheme usually does not take system power consumption into consideration. Therefore, the present invention combines two optimization strategies of DVFS and task mapping to limit system power consumption to a certain budget level while improving system performance.

附图说明Description of drawings

为使本发明的目的，方案更加通俗易懂，下面将结合配图对本发明进一步说明。In order to make the purpose of the present invention, the scheme more easy to understand, the present invention will be further described below in conjunction with accompanying drawing.

图1为CPU-GPU异构多核系统架构图，该异构系统是由gem5-gpu模拟构建，一个4核CPU和一个由8个CU组成的GPU集成在同一个芯片上。Figure 1 is a CPU-GPU heterogeneous multi-core system architecture diagram. The heterogeneous system is constructed by gem5-gpu simulation. A 4-core CPU and a GPU composed of 8 CUs are integrated on the same chip.

图2是本发明中基于详细运行信息的功率封顶方案设计示意图。Fig. 2 is a schematic diagram of the design of a power capping scheme based on detailed operating information in the present invention.

图3是采用传统的任务数据分块的CPU和GPU之间的细粒度同步示意图。Fig. 3 is a schematic diagram of fine-grained synchronization between CPU and GPU using traditional task data partitioning.

图4是本发明中使用的任务核函数分块的CPU和GPU繁忙时间和等待任务管理数据造成的CPU和GPU空闲时间示意图。Fig. 4 is a schematic diagram of CPU and GPU busy time and CPU and GPU idle time caused by waiting for task management data of the task kernel function blocks used in the present invention.

具体实施方式Detailed ways

下面结合附图对本发明做进一步说明。The present invention will be further described below in conjunction with the accompanying drawings.

图1是由gem5-gpu模拟器构建的异构多核系统，模拟的是一个将一个4核心CPU和一个由8个CU组成的GPU集成在同一片芯片上的异构架构，在gem5-gpu中，可以根据配置文件灵活的修改此模拟架构，并且gem5-gpu支持DVFS。Figure 1 is a heterogeneous multi-core system built by the gem5-gpu simulator. It simulates a heterogeneous architecture that integrates a 4-core CPU and a GPU composed of 8 CUs on the same chip. In gem5-gpu , the simulation architecture can be flexibly modified according to the configuration file, and gem5-gpu supports DVFS.

本发明在由gem5-gpu模拟构建的异构多核系统中，实现一种通过将DVFS和任务映射结合起来的功率封顶方法，包含以下具体步骤：In the heterogeneous multi-core system constructed by gem5-gpu simulation, the present invention realizes a power capping method by combining DVFS and task mapping, including the following specific steps:

步骤1，实现在应用程序执行完成后测量总计算节点时间，CPU功耗和GPU功耗。Step 1, implement measuring the total computing node time, CPU power consumption and GPU power consumption after the execution of the application program is completed.

在gem5-gpu中，一个应用程序执行结束后，会自动生成一个包含所有程序执行信息的文件stat.txt，其中就包含程序执行时间。McPAT模块能够单独测量CPU功耗，GPUWattch模块能够单独测量GPU模块,通过在gem5-gpu中配置McPAT模块额GPUWattch模块能够在程序执行完成后获得CPU功耗和GPU功耗。In gem5-gpu, after an application is executed, a file stat.txt containing all program execution information will be automatically generated, including the program execution time. The McPAT module can measure the CPU power consumption separately, and the GPUWattch module can measure the GPU module separately. By configuring the McPAT module and the GPUWattch module in gem5-gpu, the CPU power consumption and the GPU power consumption can be obtained after the program execution is completed.

OpenCL程序可以在不同的设备上执行，包括CPU和GPU。本发明中用到的benchmark测试程序是OpenCL版本的NAS并行测试基准程序集。每一个benchmark有不同的特性，其中，有的程序包含超过60个kernel，有的程序只有两个kernel。通过改写测试程序，收集每一个kernel的执行时间。OpenCL programs can be executed on different devices, including CPUs and GPUs. The benchmark test program used in the present invention is the NAS parallel test benchmark program set of the OpenCL version. Each benchmark has different characteristics. Among them, some programs contain more than 60 kernels, and some programs only have two kernels. By rewriting the test program, the execution time of each kernel is collected.

步骤3，分别只在CPU上或GPU上执行应用程序，设置不同的CPU(GPU)可选频率，获得详细运行信息，作为基于运行信息的功率封顶方案的输入。Step 3: Execute the application program only on the CPU or the GPU respectively, set different CPU (GPU) optional frequencies, obtain detailed operation information, and use it as an input for a power capping scheme based on the operation information.

图2显示了基于运行信息的功率封顶策略的流程。其中，CPU Profile Runs和GPUProfile Runs分别表示只在CPU上和只在GPU上执行应用程序后获得的运行信息，通过这些运行信息，建立了步骤4中的预测模型，包括时间模型和功耗模型，由图2中的Time model和Power model表示。预测模型的输入时CPU频率，GPU频率和任务映射方案，输出是对应的预测执行时间和预测系统功耗。不同的输入输出构建了一张配置表，根据配置表，使用改进的贪心算法根据步骤6中的算法搜索最佳映射方案和频率设置。由图2中的Distributeparallel tasks and set device frequencies表示。Figure 2 shows the flow of a power capping strategy based on operational information. Among them, CPU Profile Runs and GPUProfile Runs respectively represent the running information obtained after executing the application program only on the CPU and only on the GPU. Through these running information, the prediction model in step 4 is established, including the time model and the power consumption model. Represented by the Time model and Power model in Figure 2. The input of the prediction model is CPU frequency, GPU frequency and task mapping scheme, and the output is the corresponding predicted execution time and predicted system power consumption. Different input and output construct a configuration table, and use the improved greedy algorithm to search for the best mapping scheme and frequency setting according to the algorithm in step 6 according to the configuration table. Represented by Distribute parallel tasks and set device frequencies in Figure 2.

步骤4，建立预测模型，包括执行时间模型和功耗模型，用来预测在异构多核环境中应用程序的执行时间和系统功耗。步骤3中在不同的频率设置下，只在CPU上和只在GPU上执行应用程序得到的运行信息是建立预测模型的基础。从公式①到公式⑧，可以看出，通过程序运行信息包括每个kernel执行信息，CPU功耗和GPU功耗，可以预测出不同CPU频率，GPU频率和任务映射方案下，程序的执行时间和功耗。Step 4, establishing a prediction model, including an execution time model and a power consumption model, to predict the execution time and system power consumption of an application program in a heterogeneous multi-core environment. In step 3, under different frequency settings, the running information obtained by executing the application program only on the CPU and only on the GPU is the basis for establishing the prediction model. From formula ① to formula ⑧, it can be seen that through the program running information including the execution information of each kernel, CPU power consumption and GPU power consumption, the program execution time and power consumption.

本发明中涉及到的任务映射是指将程序中的任何一个kernel作为一个整体，映射给CPU或GPU，这与传统的按照任务数据分配一定比例的数据给CPU和GPU不同。按照任务数据比例分配指的是一个kernel同时在CPU和GPU上执行相同的代码，kernel需要的数据按照比例分配给CPU和GPU，CPU和GPU同时处理分配的数据。按照任务数据比例分配时由于每一个kernel在执行结束后需要在CPU和GPU上同步，所以固定的数据分配比例可能会在kernel执行的时候产生很多CPU时间空闲和GPU时间空闲。如图3所示，任务数据分块的CPU和GPU之间的细粒度同步示意。固定数据分配比例为α，对于kernel1来说，GPU执行效率更好，而CPU处理速度就会慢一点，对于分配好的数据，GPU会先于CPU执行完成，这个时候GPU就处于空闲时刻，等待CPU执行完成。相对的，对kernel2来说，CPU的处理速度更快，所以CPU会先于GPU执行完成，这个时候CPU就会处于空闲状态等待GPU执行完成。所以从图3中可以看出，由于任务数据分块所需要的CPU和GPU同步会导致在执行过程中CPU和GPU产生很多的空闲时间。对于将一个kernel作为一个整体直接分配给CPU或GPU,就不需要CPU和GPU之间的同步，但是相对的，数据传输时间会更长一点。如图4所示，显示的是一个基于kernel作为一个整体映射的CPU和GPU执行过程，不需要CPU和GPU同步，但是CPU和GPU之间数据传输更加频繁。The task mapping involved in the present invention refers to mapping any kernel in the program to the CPU or GPU as a whole, which is different from the traditional method of assigning a certain proportion of data to the CPU and GPU according to the task data. Allocation according to the proportion of task data means that a kernel executes the same code on the CPU and GPU at the same time, the data required by the kernel is allocated to the CPU and GPU in proportion, and the CPU and GPU process the allocated data at the same time. When assigning according to the task data ratio, since each kernel needs to be synchronized on the CPU and GPU after execution, the fixed data allocation ratio may cause a lot of idle CPU time and GPU time during kernel execution. As shown in Figure 3, the fine-grained synchronization between the CPU and GPU of the task data block is illustrated. The fixed data allocation ratio is α. For kernel1, the GPU execution efficiency is better, while the CPU processing speed will be slower. For the allocated data, the GPU will complete the execution before the CPU. At this time, the GPU is idle, waiting CPU execution completes. In contrast, for kernel2, the processing speed of the CPU is faster, so the CPU will complete the execution before the GPU. At this time, the CPU will be in an idle state and wait for the GPU to complete the execution. Therefore, it can be seen from Figure 3 that the CPU and GPU synchronization required for task data partitioning will cause a lot of idle time in the CPU and GPU during execution. For directly assigning a kernel as a whole to the CPU or GPU, there is no need for synchronization between the CPU and the GPU, but relatively, the data transfer time will be longer. As shown in Figure 4, it shows a CPU and GPU execution process based on kernel mapping as a whole, which does not require CPU and GPU synchronization, but data transmission between CPU and GPU is more frequent.

使用步骤4中的时间预测模型和功耗预测模型，可以计算得到所有可能的CPU频率，GPU频率和任务映射方案下的执行时间和功耗，并储存在配置参数表中。Using the time prediction model and power consumption prediction model in step 4, the execution time and power consumption under all possible CPU frequencies, GPU frequencies and task mapping schemes can be calculated and stored in the configuration parameter table.

步骤6，根据配置参数表，使用改进的贪心算法搜索最优参数集。Step 6, according to the configuration parameter table, use the improved greedy algorithm to search for the optimal parameter set.

步骤6.1，给定CPU频率和GPU频率，搜索执行时间最短的任务映射方案，并根据功耗模型，得出系统预测功耗。由算法1所示。Step 6.1, given the CPU frequency and GPU frequency, search for the task mapping scheme with the shortest execution time, and obtain the predicted power consumption of the system according to the power consumption model. Shown by Algorithm 1.

步骤6.2，根据CPU频率和GPU频率可选设置，改变CPU频率和GPU频率，重复执行步骤6.1，得到此频率组合下的最优执行时间对应的映射方案，并根据此映射方案计算系统功耗，由给定的预算功耗，得出限定在预算功耗下的最优频率参数选择和映射方案。此步骤由算法2所示。Step 6.2, according to the optional settings of CPU frequency and GPU frequency, change the CPU frequency and GPU frequency, repeat step 6.1, obtain the mapping scheme corresponding to the optimal execution time under this frequency combination, and calculate the system power consumption according to this mapping scheme, Based on the given budget power consumption, the optimal frequency parameter selection and mapping scheme limited under the budget power consumption is obtained. This step is shown by Algorithm 2.

Claims

1. A method for capping heterogeneous multi-core power by coordinating DVFS and task mapping, characterized in that, comprising the following steps:

Step 1. Measure the total computing node time, CPU power consumption and GPU power consumption after the execution of the application program is completed;

Step 2. Modify the selected parallel test benchmark program to obtain the execution time of each kernel function in the program;

Step 3. Execute the application program only on the CPU or GPU respectively, set different CPU or GPU optional frequencies, and obtain detailed running information, including the total execution time of the program, the execution time of each kernel function in the program, and the total power consumption of the system , CPU power consumption and GPU power consumption;

Step 4, design a predictive model to predict the power consumption and execution time of different task mapping schemes under different CPU and GPU frequencies; wherein, the input of the predictive model is CPU frequency, GPU frequency and task mapping scheme, and the task mapping refers to It is to map each kernel function as a whole to the CPU or GPU, instead of assigning the kernel function to the CPU and GPU according to a certain ratio for simultaneous execution; the output of the prediction model is the predicted total execution time of the program and the predicted system function. consumption;

Step 5, calculate different CPU frequencies and GPU frequencies, execution time and power consumption under different task mapping schemes according to the execution time model and power consumption model, and fill in the configuration parameter table;

Step 6, according to the configuration parameter table, use the improved greedy algorithm to search for the optimal parameter set; it is divided into two steps, first use the greedy algorithm to search for the task mapping scheme with the shortest execution time under the given CPU frequency and GPU frequency, according to this task Mapping, CPU frequency, GPU frequency, and power model to obtain the predicted power consumption of the system under this parameter configuration; then, change the CPU frequency and GPU frequency, and calculate the predicted power consumption of the system according to the previous step again, and finally get the power consumption limited to the budget The optimal configuration scheme below.

2. The heterogeneous multi-core power capping method by coordinating DVFS and task mapping as claimed in claim 1, wherein the prediction model in step 4 comprises: an execution time model and a power consumption model,

Step 4.1, Execution Time Model

The total execution time of the application program can be obtained according to the execution time of each core in the program and the corresponding data transmission time. The total execution time of the program is expressed by the formula ①.

Among them, f _cpu and f _gpu represent the CPU frequency and GPU frequency respectively, and T _i (f _cpu , f _gpu ) represents the execution time of the i-th kernel function and the required data transmission time, expressed by formula ②.

The first part of the formula represents the execution time, and the second part represents the data transfer time. The execution time of the kernel function has been obtained in step 2. H2D and D2H denote the data transfer cost from host to device and the transfer cost from device to host, respectively. Required is equal to 1 or 0, indicating whether the core k is required or the data d is required. The data may already be on the device and does not need to be transmitted, so use OnDevice to indicate whether the data is on the device. Use size to represent the data size.

Step 4.2, Power Consumption Model

System power consumption may be represented by three parts, which are idle power consumption, CPU power consumption P _cpu and GPU power consumption P _gpu . System power consumption is represented by formula ③.

P＝P _idle (f _cpu ，f _gpu )+P _cpu (f _cpu ，f _gpu )+P _gpu (f _cpu ，f _gpu ) ③

Among them, P _idle represents idle power consumption, which is related to CPU frequency and GPU frequency, but has nothing to do with task mapping. It can be obtained by subtracting the power consumption on the CPU from the total power consumption obtained by executing the application program only on the CPU, and obtained by formula ④.

Respectively represent the total power consumption of the system, CPU power consumption and GPU power consumption obtained under the condition that only the application program is executed on the CPU, and are related to the set frequency.

When the application program is executed, the CPU and GPU are not always executing due to the data association between the kernel functions, so the CPU power consumption and GPU power consumption vary according to the given task mapping scheme, combined with this phenomenon , we assume that power consumption and execution time are proportional to the ratio of total execution time, so CPU and GPU power consumption can be expressed by formula ⑤ and formula ⑥ respectively.

Among them, λ _cpu and λ _gpu represent the CPU and GPU busy time ratio, respectively, Respectively represent the CPU power consumption and the GPU power consumption obtained under the condition that the application program is only executed on the GPU. with estimators representing the maximum value of dynamic power consumption in the CPU and GPU, respectively.

Since λ _cpu and λ _gpu respectively represent the ratio of CPU and GPU busy time, it can be obtained from the running information measured in step 3. When the program is executed, the busy time of each device is defined as the sum of the actual execution time of each kernel function, λ _cpu and λ _gpu can be represented by Equation ○7 and Equation 8○, respectively.

t _cpu indicates the busy time of the CPU, t _gpu indicates the busy time of the GPU, and t _total indicates the total execution time of the application.