CN110333933A

CN110333933A - A Simulation Method of HPL Calculation Model

Info

Publication number: CN110333933A
Application number: CN201910583440.7A
Authority: CN
Inventors: 陆璐; 方文
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2019-07-01
Filing date: 2019-07-01
Publication date: 2019-10-15

Abstract

An HPL calculation model simulation method disclosed by the present invention comprises the steps in the following sequence: obtaining system configuration parameters such as CPU, GPU, and MISC input by the user, and determining the amount of simulation calculation tasks on the CPU and GPU according to the computing power of the CPU and GPU The allocation ratio of the corresponding task is simulated and calculated on the CPU and GPU respectively, the time consumed by the simulation calculation is recorded, the floating-point performance of the system is calculated and analyzed, and the simulation results are displayed. The HPL calculation model simulation method of the present invention provides a visual interface, is easy to operate, and calculates and analyzes the floating-point calculation performance of the system through simulation. Compared with the traditional HPL test, the work efficiency is greatly improved, and it is also important for mining the bottleneck of system performance. helpful meaning.

Description

A Simulation Method of HPL Calculation Model

技术领域technical field

本发明涉及计算机领域，特别涉及一种HPL计算模型仿真方法。The invention relates to the field of computers, in particular to an HPL calculation model simulation method.

背景技术Background technique

随着信息化社会的飞速发展，人们对信息处理能力的要求也随之水涨船高。近年来，HPC市场在信息化的推动下，不断扩大与细分。HPC应用的类型和数量也随之持续递增，除了传统HPC应用(如科学研究、石油勘探、气象环境、航天、国防等)，特别是近两年备受关注的人工智能领域(如自动驾驶汽车、无人机、人脸识别、医疗诊断、智能家居等)让HPC成为重要的支撑平台，并且逐渐影响着人们的日常生活。With the rapid development of the information society, people's requirements for information processing capabilities are also rising. In recent years, driven by informatization, the HPC market has continued to expand and subdivide. The types and numbers of HPC applications have also continued to increase. In addition to traditional HPC applications (such as scientific research, oil exploration, meteorological environment, aerospace, national defense, etc.), especially the field of artificial intelligence that has attracted much attention in the past two years (such as self-driving cars , drones, face recognition, medical diagnosis, smart home, etc.) make HPC an important supporting platform, and gradually affect people's daily life.

Linpack是国际上广泛用于测试高性能计算机系统浮点性能的基准。通过对高性能计算机采用高斯消元法求解N元稠密线性代数方程组的测试，评价高性能计算机的浮点性能。Linpack is an international benchmark widely used to test the floating-point performance of high-performance computer systems. Through the test of high-performance computer using Gaussian elimination method to solve N-ary dense linear algebraic equations, evaluate the floating-point performance of high-performance computer.

HPL即High Performance Linpack，它是大规模并行系统广泛采用的Linpack测试软件包，针对现代并行计算机所提出的一种测试方案。用户无需修改任意测试程序，通过调节问题规模大小N(矩阵大小)、CPU使用个数，利用各种方法优化等来执行该测试程序，以便得到更好的测试性能。HPL采用高斯消元法求解线性方程组，它的浮点运算次数固定，当问题规模为N时，为(2N³/3+3N²/2)。因此，当求解的问题规模N和测得系统计算时间T被给出时，就可以轻易算出峰值＝(2N³/3+3N²/2)/T，单位为浮点运算每秒(Flops)。HPL stands for High Performance Linpack, which is a Linpack test software package widely used in large-scale parallel systems, and it is a test scheme proposed for modern parallel computers. The user does not need to modify any test program, and executes the test program by adjusting the size of the problem N (matrix size), the number of CPUs used, and optimizing by various methods, so as to obtain better test performance. HPL uses Gaussian elimination method to solve linear equations, and its number of floating-point operations is fixed. When the problem size is N, it is (2N ³ /3+3N ² /2). Therefore, when the problem size N to be solved and the measured system computing time T are given, the peak value = (2N ³ /3+3N ² /2)/T can be easily calculated, and the unit is floating-point operations per second (Flops) .

随着产品硬件不断更新换代，计算机的运算能力也以数量级的速度提升。计算峰值成为衡量计算机性能的一个重要标准，例如浮点计算峰值，它是指计算机每秒钟能完成的浮点运算最大次数，包括理论浮点峰值和实测浮点峰值。With the continuous upgrading of product hardware, the computing power of computers is also increasing at an order of magnitude speed. Calculation peak has become an important criterion for measuring computer performance, such as floating-point calculation peak, which refers to the maximum number of floating-point operations that a computer can complete per second, including theoretical floating-point peaks and measured floating-point peaks.

理论浮点峰值是该计算机理论上能达到的每秒钟能完成浮点计算最大次数，它主要是由CPU的主频决定的。理论浮点峰值＝CPU主频×CPU每个时钟周期执行浮点运算的次数×系统中CPU核心数目。The theoretical floating-point peak value is the maximum number of floating-point calculations that the computer can theoretically achieve per second, which is mainly determined by the main frequency of the CPU. Theoretical floating-point peak value = CPU main frequency × the number of times the CPU performs floating-point operations per clock cycle × the number of CPU cores in the system.

实测浮点峰值是指Linpack测试值，也就是通过运行Linpack测试程序测试该台机器的浮点能力的测试结果。作为衡量机器性能的一个指标，这两个值用来度量机器的处理能力。The measured floating-point peak value refers to the Linpack test value, that is, the test result of testing the floating-point capability of the machine by running the Linpack test program. As an indicator of machine performance, these two values are used to measure the processing power of the machine.

对于传统的HPL系统性能测试软件，虽然已被广泛运用在多种类型计算机的性能测试，并且在某种程度上，其测试结果能够正确反映计算系统的某些运行性能，但随着HPC的飞速发展，尤其是异构计算体系的推广，以及应用领域的不断拓展，也从很多方面暴露了传统HPL测试软件存在的不足之处：Although the traditional HPL system performance test software has been widely used in the performance test of various types of computers, and to a certain extent, its test results can correctly reflect some operating performance of the computing system, but with the rapid development of HPC Development, especially the promotion of heterogeneous computing systems, and the continuous expansion of application fields have also exposed the shortcomings of traditional HPL test software in many ways:

传统HPL测试软件只是将测试结果反馈给用户，但并不能给出存在的问题以及解决问题的方式，不能帮助用户快速找出性能瓶颈和原因，并提出性能优化的改进建议。Traditional HPL test software only feeds back the test results to the user, but it cannot give the existing problems and the way to solve the problem, and cannot help the user quickly find out the performance bottleneck and the reason, and propose improvement suggestions for performance optimization.

发明内容Contents of the invention

本发明的目的在于克服现有技术的缺点与不足，提供一种HPL计算模型仿真方法,该方法能够克服现有HPL测试软件不能测出系统性能瓶颈的技术难题，帮助用户快速找出性能瓶颈和原因，并提出性能优化的改进建议。The purpose of the present invention is to overcome the shortcomings and deficiencies of the prior art, and provide a HPL calculation model simulation method, which can overcome the technical problem that the existing HPL test software cannot detect system performance bottlenecks, and help users quickly find performance bottlenecks and Reasons, and put forward suggestions for improvement of performance optimization.

本发明的目的通过以下的技术方案实现：The purpose of the present invention is achieved through the following technical solutions:

一种HPL计算模型仿真方法,包含以下顺序的步骤：A kind of HPL computing model emulation method, comprises the steps of following order:

获取用户输入的系统配置参数，所述系统配置参数包括CPU、GPU、MISC参数；Obtain system configuration parameters input by the user, the system configuration parameters including CPU, GPU, MISC parameters;

根据CPU和GPU的运算能力来确定仿真计算任务量在CPU和GPU上的分配比，将对应的任务分别在CPU和GPU上进行仿真计算，记录仿真计算所消耗的时间，计算分析系统的浮点性能，展示仿真结果。Determine the allocation ratio of simulation calculation tasks on CPU and GPU according to the computing power of CPU and GPU, perform simulation calculation on CPU and GPU respectively for corresponding tasks, record the time consumed by simulation calculation, and calculate and analyze the floating point of the system Performance, showing simulation results.

所述CPU参数包括CPU核心数、频率、每一时钟周期浮点运算次数、CPU个数；所述GPU参数包括GPU频率、GPU个数；所述MISC参数包括问题规模、矩阵分块大小、总线带宽、总线效率因子、CPU并行执行效率因子、GPU并行执行效率因子、GEMM效率因子。The CPU parameters include the number of CPU cores, frequency, the number of floating-point operations per clock cycle, and the number of CPUs; the GPU parameters include GPU frequency and the number of GPUs; the MISC parameters include problem scale, matrix block size, bus Bandwidth, bus efficiency factor, CPU parallel execution efficiency factor, GPU parallel execution efficiency factor, GEMM efficiency factor.

所述CPU、GPU、MISC参数的配置方式包括手动输入、模版输入、自动输入三种方式。The configuration modes of the CPU, GPU, and MISC parameters include three modes: manual input, template input, and automatic input.

所述模版输入，CPU类型包括EPYC 7251、EPYC 7281、Xeon Gold 6142，GPU类型包括RX Vega 56、RX Vega 64、Vega 10；当选择CPU或GPU模版时，对应的配置参数将会自动填充。For the template input, the CPU type includes EPYC 7251, EPYC 7281, and Xeon Gold 6142, and the GPU type includes RX Vega 56, RX Vega 64, and Vega 10; when a CPU or GPU template is selected, the corresponding configuration parameters will be automatically filled.

所述自动输入，是该HPL计算模型仿真方法自动获取本机的配置参数。The automatic input means that the HPL calculation model simulation method automatically obtains the configuration parameters of the machine.

所述根据CPU和GPU的运算能力来确定仿真计算任务量在CPU和GPU上的分配比，具体包含以下步骤：According to the computing power of the CPU and the GPU, the distribution ratio of the simulated computing tasks on the CPU and the GPU is determined, which specifically includes the following steps:

(1)计算总任务量：(1) Calculate the total task volume:

W＝GFLOPS_CPU·(t_CPU+t_T)+GFLOPS_GPU·t_GPU W＝GFLOPS _CPU ·(t _CPU +t _T )+GFLOPS _GPU ·t _GPU

(2)计算CPU与GPU之间数据传输总量：(2) Calculate the total amount of data transfer between CPU and GPU:

(3)计算CPU与GPU之间数据传输的通信开销：(3) Calculate the communication overhead of data transmission between CPU and GPU:

(4)计算分配比：(4) Calculate the distribution ratio:

(5)将R·W的任务量分配给GPU，(1-R)·W的任务量分配给CPU。(5) Assign R·W tasks to GPU, and (1-R)·W tasks to CPU.

所述将对应的任务分别在CPU和GPU上进行仿真计算，具体包含以下步骤：The corresponding tasks are respectively simulated and calculated on the CPU and the GPU, which specifically includes the following steps:

(1)初始化增广矩阵A|b：根据问题规模初始化相应大小的增广矩阵；(1) Initialize the augmented matrix A|b: initialize the augmented matrix of the corresponding size according to the scale of the problem;

(2)循环分解增广矩阵：基于LU分解法，根据任务量的分配比对增广矩阵进行split，并将split后需要传输到GPU显存上的部分矩阵传输到GPU显存上；(2) Cyclic decomposition of the augmented matrix: based on the LU decomposition method, the augmented matrix is split according to the distribution ratio of the task amount, and the part of the matrix that needs to be transferred to the GPU memory after the split is transferred to the GPU memory;

CPU执行cpu_dtrsm()来计算CPU部分的下三角矩阵，GPU执行gpu_dtrsm()来计算GPU部分的下三角矩阵；The CPU executes cpu_dtrsm() to calculate the lower triangular matrix of the CPU part, and the GPU executes gpu_dtrsm() to calculate the lower triangular matrix of the GPU part;

GPU将运算完成的结果传输回内存，并同时进行HPL_dgemm()根据任务量的分配比对增广矩阵进行split，将split后需要传输到GPU显存上的部分矩阵传输到GPU显存上；The GPU transfers the result of the operation back to the memory, and at the same time performs HPL_dgemm() to split the augmented matrix according to the distribution ratio of the task amount, and transfers the part of the matrix that needs to be transferred to the GPU memory after the split to the GPU memory;

CPU执行cpu_dgemm()进行CPU部分矩阵乘运算，GPU执行gpu_dgemm()进行GPU部分矩阵乘运算；The CPU executes cpu_dgemm() to perform the CPU partial matrix multiplication operation, and the GPU executes gpu_dgemm() to perform the GPU partial matrix multiplication operation;

若分解尚未结束，在GPU将数据传输回内存的同时与下一轮的HPL_dtrsm()需要的数据并行传输，若分解结束，将运算完成的结果传输回内存。If the decomposition is not yet complete, the GPU will transmit the data back to the memory in parallel with the data required by the next round of HPL_dtrsm(). If the decomposition is complete, the results of the calculation will be transmitted back to the memory.

所述记录仿真计算所消耗的时间，具体为：The time consumed by the record simulation calculation is specifically:

(1)将split后需要传输到GPU显存上的部分矩阵传输到GPU显存上，若是首轮则耗时t1₁，否则与上轮结束后并行执行，耗时：max(t1_i,t4_i-1)；(1) Transfer part of the matrix that needs to be transferred to the GPU memory after the split to the GPU memory. If it is the first round, it will take t1 ₁ . Otherwise, it will be executed in parallel after the last round. Time-consuming: max(t1 _i ,t4 _{i- 1} );

(2)CPU执行cpu_dtrsm()耗时 (2) CPU executes cpu_dtrsm() time-consuming

(3)GPU执行gpu_dtrsm()耗时 (3) GPU executes gpu_dtrsm() time-consuming

(4)上述(2)(3)共耗时 (4) The above (2) (3) takes a total of time

(5)GPU将运算完成的结果传输回内存耗时t2_i；(5) It takes time t2 _i for the GPU to transfer the completed calculation result back to the memory;

(6)执行HPL_dgemm()，将split后需要传输到GPU显存上的部分矩阵传输到GPU显存上，耗时t3_i；(6) Execute HPL_dgemm(), and transfer the part matrix that needs to be transferred to the GPU display memory after the split to the GPU display memory, which takes time t3 _i ;

(7)上述(5)(6)共耗时max(t2_i,t3_i)；(7) The above (5) (6) takes a total time max(t2 _i , t3 _i );

(8)CPU执行cpu_dgemm()耗时 (8) CPU executes cpu_dgemm() time-consuming

(9)GPU执行gpu_dgemm()耗时 (9) GPU executes gpu_dgemm() time-consuming

(10)上述(8)(9)共耗时 (10) The above (8) (9) total time-consuming

(11)将运算完成的结果传输回内存耗时t4_i；(11) It takes time t4 _i to transfer the completed result of the operation back to the memory;

(12)计时结束，系统总耗时：(12) When the timing ends, the total time spent by the system is:

所述计算分析系统的浮点性能，具体如下：The floating-point performance of the calculation and analysis system is as follows:

(1)计算系统的实测浮点性能、理论浮点性能、仿真计算效率；(1) Measured floating-point performance, theoretical floating-point performance, and simulation calculation efficiency of the computing system;

实测浮点性能：Measured floating point performance:

理论浮点性能：Theoretical floating point performance:

v₂＝FLOPS_CPU·n_CPU·η_{CPU_PARA}·η_GEMM+FLOPS_GPU·n_GPU·η_{GPU_PARA}·η_GEMM；v ₂ =FLOPS _CPU ·n _CPU ·η _{CPU_PARA} ·η _GEMM +FLOPS _GPU ·n _GPU ·η _{GPU_PARA} ·η _GEMM ;

仿真计算效率：Simulation calculation efficiency:

η＝v₁/(FLOPS_CPU·n_CPU+FLOPS_GPU·n_GPU)；η=v ₁ /(FLOPS _CPU n _CPU +FLOPS _GPU n _GPU );

(2)调整配置参数从而得到不同的仿真结果，进而绘制不同配置参数下仿真计算效率与矩阵分块大小之间的变化关系图。(2) Adjust the configuration parameters to obtain different simulation results, and then draw the relationship between the simulation calculation efficiency and the matrix block size under different configuration parameters.

所述展示仿真结果包括split常数，仿真计算的耗时、实测浮点性能、理论浮点性能、仿真计算效率、不同配置参数下仿真计算效率与矩阵分块大小之间的变化关系图、系统性能的瓶颈和原因，并给出性能优化的改进建议。The displayed simulation results include split constants, time-consuming simulation calculations, measured floating-point performance, theoretical floating-point performance, simulation calculation efficiency, change relationship diagram between simulation calculation efficiency and matrix block size under different configuration parameters, and system performance Bottlenecks and causes, and suggestions for performance optimization improvements.

本发明与现有技术相比，具有如下优点和有益效果：Compared with the prior art, the present invention has the following advantages and beneficial effects:

1、本发明利用可视化界面执行上述操作，简便、快捷、操作性强；1. The present invention uses a visual interface to perform the above operations, which is simple, fast and highly operable;

2、本发明充分利用CPU和GPU的性能，将任务合理的分配到CPU和GPU上运算，实现了负载均衡，提升了仿真测试性能；2. The present invention makes full use of the performance of the CPU and GPU, reasonably assigns tasks to the CPU and GPU for calculation, realizes load balancing, and improves the simulation test performance;

3、本发明的仿真结果与真实测试结果相近，对机器的处理能力具有重要的评判意义；3. The simulation result of the present invention is similar to the real test result, which has important evaluation significance for the processing capacity of the machine;

4、本发明通过对仿真测试数据进行分析，可以帮助用户快速找出性能瓶颈和原因，并给出性能优化的改进建议。4. The present invention can help users quickly find out performance bottlenecks and causes by analyzing simulation test data, and provide improvement suggestions for performance optimization.

附图说明Description of drawings

图1为本发明所述一种HPL计算模型仿真方法的整体工作流程图。Fig. 1 is an overall working flow chart of an HPL calculation model simulation method according to the present invention.

图2为本发明所述一种HPL计算模型仿真方法的详细工作流程图。Fig. 2 is a detailed working flow chart of an HPL calculation model simulation method according to the present invention.

图3为本发明所述一种HPL计算模型仿真方法的配置参数获取界面。Fig. 3 is a configuration parameter acquisition interface of an HPL calculation model simulation method according to the present invention.

图4为本发明所述一种HPL计算模型仿真方法的仿真结果展示界面。FIG. 4 is an interface for displaying simulation results of an HPL calculation model simulation method according to the present invention.

图5为本发明所述一种HPL计算模型仿真方法的不同总线带宽下的仿真效率图。Fig. 5 is a simulation efficiency diagram under different bus bandwidths of an HPL calculation model simulation method according to the present invention.

图6为本发明所述一种HPL计算模型仿真方法的不同GEMM效率下的仿真效率图。Fig. 6 is a simulation efficiency diagram under different GEMM efficiencies of an HPL calculation model simulation method according to the present invention.

图7为本发明所述一种HPL计算模型仿真方法在2个Xeon Gold 6142 CPU和4个Vega 10 GPU下的仿真结果。Fig. 7 is a simulation result of an HPL calculation model simulation method described in the present invention under 2 Xeon Gold 6142 CPUs and 4 Vega 10 GPUs.

图8为本发明所述一种HPL计算模型仿真方法在具有2个Xeon Gold 6142 CPU和4个Vega 10 GPU的物理机上进行HPL测试的真实结果。Fig. 8 is the real result of HPL test performed on a physical machine with 2 Xeon Gold 6142 CPUs and 4 Vega 10 GPUs by an HPL calculation model simulation method according to the present invention.

具体实施方式Detailed ways

下面结合实施例及附图对本发明作进一步详细的描述，但本发明的实施方式不限于此。The present invention will be further described in detail below in conjunction with the embodiments and the accompanying drawings, but the embodiments of the present invention are not limited thereto.

如图1-8，一种HPL计算模型仿真方法，包含以下顺序的步骤：获取用户输入的CPU、GPU、MISC等系统配置参数，根据CPU和GPU的运算能力来确定仿真计算任务量在CPU和GPU上的分配比，将对应的任务分别在CPU和GPU上进行仿真计算，记录仿真计算所消耗的时间，计算分析系统的浮点性能，展示仿真结果。As shown in Figure 1-8, an HPL calculation model simulation method includes the following steps: obtain the system configuration parameters such as CPU, GPU, and MISC input by the user, and determine the amount of simulation calculation tasks between the CPU and GPU according to the computing capabilities of the CPU and GPU. For the distribution ratio on the GPU, simulate the corresponding tasks on the CPU and GPU respectively, record the time consumed by the simulation calculation, calculate and analyze the floating-point performance of the system, and display the simulation results.

(一)配置参数的获取(1) Acquisition of configuration parameters

如图3所示，本发明提供的一种HPL计算模型仿真平台，提供可视化操作环境，通过点击可视化界面的输入框，即可完成各项参数的配置，配置的CPU参数包括CPU核心数、频率、每一时钟周期内所执行的指令数、CPU个数；配置的GPU参数包括GPU频率、GPU个数；所述的MISC配置包括问题规模、矩阵分块大小、总线带宽、总线效率因子、CPU并行执行效率因子、GPU并行执行效率因子、GEMM效率因子。除了手动输入参数外，还可根据提供的配置模板进行配置，本平台提供的CPU类型有EPYC 7251、EPYC 7281、Xeon Gold 6142等，GPU类型包括RX Vega 56、RX Vega 64、Vega 10等。也可通过点击Auto Fill获取本机配置信息，从而完成自动配置。当上述配置信息填写完成后，系统会自动计算出CPU和GPU的理论浮点性能，并填充到对应的输入框中。As shown in Figure 3, a kind of HPL calculation model simulation platform provided by the present invention provides a visual operating environment, and by clicking the input box of the visual interface, the configuration of various parameters can be completed, and the configured CPU parameters include the number of CPU cores, frequency , the number of instructions executed in each clock cycle, the number of CPUs; the configured GPU parameters include GPU frequency and the number of GPUs; the MISC configuration includes problem scale, matrix block size, bus bandwidth, bus efficiency factor, CPU Parallel execution efficiency factor, GPU parallel execution efficiency factor, GEMM efficiency factor. In addition to manually entering parameters, it can also be configured according to the configuration template provided. The CPU types provided by this platform include EPYC 7251, EPYC 7281, Xeon Gold 6142, etc., and the GPU types include RX Vega 56, RX Vega 64, Vega 10, etc. You can also click Auto Fill to obtain the local configuration information to complete the automatic configuration. After the above configuration information is filled in, the system will automatically calculate the theoretical floating-point performance of the CPU and GPU, and fill them in the corresponding input boxes.

(二)HPL计算模型仿真平台开始仿真计算(2) HPL calculation model simulation platform starts simulation calculation

(1)根据输入的问题规模初始化相应大小的增广矩阵A|b(1) Initialize the augmented matrix A|b of the corresponding size according to the input problem scale

(2)启动HPL计时系统(2) Start the HPL timing system

(3)基于LU分解法循环分解增广矩阵A|b，由于CPU计算能力和GPU计算能力相差较大，并且数据在CPU和GPU之间传输需要一定的通信开销，因此为CPU和GPU分配的计算量比例将会影响到系统整体性能。根据CPU与GPU的运算能力，并考虑CPU与GPU之间数据传输的通信开销，对CPU和GPU的计算任务量作均衡分配，CPU和GPU的计算量分配比如下：(3) Based on the LU decomposition method, the augmented matrix A|b is cyclically decomposed. Since the computing power of the CPU and the computing power of the GPU are quite different, and data transmission between the CPU and the GPU requires a certain amount of communication overhead, the allocation for the CPU and the GPU The calculation ratio will affect the overall performance of the system. According to the computing power of CPU and GPU, and considering the communication overhead of data transmission between CPU and GPU, the computing tasks of CPU and GPU are distributed in a balanced manner. The computing workload distribution ratio of CPU and GPU is as follows:

然后根据任务量分配比对需要进行乘加运算的矩阵进行split，并将split后需要传输到GPU显存上的部分矩阵传输到GPU显存上，若是首轮则耗时t1₁，否则与上轮结束后并行执行，耗时：max(t1_i,t4_i-1)Then split the matrix that needs to be multiplied and added according to the task allocation ratio, and transfer the part of the matrix that needs to be transferred to the GPU memory after the split to the GPU memory. If it is the first round, it will take t1 ₁ , otherwise it will end with the last round. After parallel execution, time-consuming: max(t1 _i ,t4 _i-1 )

CPU执行cpu_dtrsm()计算CPU部分的下三角矩阵耗时GPU执行gpu_dtrsm()计算GPU部分的下三角矩阵耗时并行耗时 The CPU executes cpu_dtrsm() to calculate the lower triangular matrix of the CPU part. The GPU executes gpu_dtrsm() to calculate the time-consuming of the lower triangular matrix of the GPU part Parallel time-consuming

GPU将运算完成的结果传输回内存耗时t2_i，并同时进行HPL_dgemm()根据CPU与GPU运算能力进行对需要进行乘加运算的矩阵进行split，将split后需要传输到GPU显存上的部分矩阵传输到GPU显存上，耗时t3_i，并行耗时max(t2_i,t3_i)It takes time t2 _i for the GPU to transfer the result of the operation back to the memory, and at the same time perform HPL_dgemm() to split the matrix that needs to be multiplied and added according to the computing power of the CPU and GPU, and transfer the part of the matrix that needs to be transferred to the GPU memory after the split Transfer to GPU memory, time-consuming t3 _i , parallel time-consuming max(t2 _i ,t3 _i )

CPU执行cpu_dgemm()进行CPU部分的矩阵乘运算耗时GPU执行gpu_dgemm()进行GPU部分的矩阵乘运算耗时并行耗时 It takes time for the CPU to execute cpu_dgemm() to perform the matrix multiplication operation of the CPU part The GPU executes gpu_dgemm() to perform the matrix multiplication operation of the GPU part, which takes time Parallel time-consuming

GPU若分解尚未结束，在GPU将数据传输回内存的同时与下一轮的HPL_dtrsm()需要的数据并行传输，若分解结束，将运算完成的结果传输回内存耗时t4_i If the decomposition of the GPU has not been completed, the GPU will transfer the data back to the memory in parallel with the data required by the next round of HPL_dtrsm(). If the decomposition is complete, it will take t4 _i to transfer the completed calculation result back to the memory

(4)运算完成，HPL主要耗时部分计算完毕，剩余一些其他操作比如对内存空间的释放等暂时不考虑耗时(4) The calculation is completed, the main time-consuming part of HPL is calculated, and the remaining operations such as the release of memory space are not considered time-consuming for the time being

(5)计时结束，系统总耗时：(5) When the timing is over, the total time spent by the system is:

(6)计算系统的实测浮点性能、理论浮点性能、仿真计算效率(6) Measured floating-point performance, theoretical floating-point performance, and simulation calculation efficiency of the computing system

实测浮点性能：Measured floating point performance:

理论浮点性能：Theoretical floating point performance:

v₂＝FLOPS_CPU·n_CPU·η_{CPU_PARA}·η_GEMM+FLOPS_GPU·n_GPU·η_{GPU_PARA}· _GEMM v₂= FLOPS_CPUn_CPU·η_{CPU_PARA}·η_GEMM+FLOPS_GPUn_GPU·η_{GPU_PARA}· _GEMM

仿真计算效率：Simulation calculation efficiency:

η＝v₁/(FLOPS_CPU·n_CPU+FLOPS_GPU·n_GPU)η=v ₁ /(FLOPS _CPU n _CPU +FLOPS _GPU n _GPU )

(7)调整配置参数从而得出不同的仿真结果，进而绘制不同配置参数下仿真计算效率与矩阵分块大小之间的变化关系图。(7) Adjust the configuration parameters to obtain different simulation results, and then draw the relationship between the simulation calculation efficiency and the matrix block size under different configuration parameters.

(三)展示仿真结果(3) Display the simulation results

如图4所示，反馈仿真结果包括split常数，仿真计算的耗时、实测浮点性能、理论浮点性能、仿真计算效率。它们都将在仿真计算完成后，自动填充到对应的文本框中。As shown in Figure 4, the feedback simulation results include split constants, time-consuming simulation calculations, measured floating-point performance, theoretical floating-point performance, and simulation calculation efficiency. They will be automatically filled in the corresponding text boxes after the simulation calculation is completed.

本仿真平台反馈的仿真结果还包括不同配置参数下仿真计算效率与矩阵分块大小之间的变化关系图、系统性能的瓶颈和原因，并给出性能优化的改进建议。此部分内容将通过PDF文档反馈给用户，用户只需下载即可。如图5和图6，是本仿真平台给出的部分总线带宽和GEMM效率下的仿真效率图。The simulation results fed back by this simulation platform also include the change relationship diagram between simulation calculation efficiency and matrix block size under different configuration parameters, bottlenecks and reasons of system performance, and suggestions for performance optimization are given. This part of the content will be fed back to the user through a PDF document, and the user only needs to download it. As shown in Figure 5 and Figure 6, it is the simulation efficiency diagram under the partial bus bandwidth and GEMM efficiency given by this simulation platform.

(四)仿真结果与真实结果的对比(4) Comparison of simulation results and real results

如图7所示，是在2个Xeon Gold 6142CPU和4个Vega 10GPU下的仿真结果，图8是在同等条件物理机上进行HPL测试的真实结果，对比仿真结果和真实测试结果，可以发现它们之间相当接近，据此可以验证本仿真平台的可靠性。As shown in Figure 7, it is the simulation result under 2 Xeon Gold 6142CPUs and 4 Vega 10GPUs. Figure 8 is the real result of the HPL test on a physical machine with the same conditions. The time is quite close, so the reliability of this simulation platform can be verified.

上述实施例为本发明较佳的实施方式，但本发明的实施方式并不受上述实施例的限制，其他的任何未背离本发明的精神实质与原理下所作的改变、修饰、替代、组合、简化，均应为等效的置换方式，都包含在本发明的保护范围之内。The above-mentioned embodiment is a preferred embodiment of the present invention, but the embodiment of the present invention is not limited by the above-mentioned embodiment, and any other changes, modifications, substitutions, combinations, Simplifications should be equivalent replacement methods, and all are included in the protection scope of the present invention.

Claims

1. A HPL computing model emulation method, is characterized in that, comprises the steps of following order:

Obtain system configuration parameters input by the user, the system configuration parameters including CPU, GPU, MISC parameters;

Determine the allocation ratio of simulation calculation tasks on CPU and GPU according to the computing power of CPU and GPU, perform simulation calculation on CPU and GPU respectively for corresponding tasks, record the time consumed by simulation calculation, and calculate and analyze the floating point of the system Performance, showing simulation results.

2. according to the described HPL calculation model emulation method of claim 1, it is characterized in that, described CPU parameter comprises CPU core number, frequency, the number of times of floating-point operations per clock cycle, CPU number; Described GPU parameter comprises GPU frequency, The number of GPUs; the MISC parameters include problem scale, matrix block size, bus bandwidth, bus efficiency factor, CPU parallel execution efficiency factor, GPU parallel execution efficiency factor, and GEMM efficiency factor.

3. The HPL calculation model simulation method according to claim 1, wherein the configuration modes of the CPU, GPU, and MISC parameters include three modes: manual input, template input, and automatic input.

4. The HPL calculation model simulation method according to claim 3, wherein, for the template input, the CPU type includes EPYC 7251, EPYC 7281, Xeon Gold 6142, and the GPU type includes RX Vega 56, RX Vega 64, Vega10; When selecting a CPU or GPU template, the corresponding configuration parameters will be automatically populated.

5. according to the described HPL calculation model simulation method of claim 3, it is characterized in that, described automatic input is that this HPL calculation model simulation method obtains the configuration parameter of this machine automatically.

6. according to the described HPL calculation model simulation method of claim 1, it is characterized in that, described according to the computing power of CPU and GPU, determine the allocation ratio of simulation calculation task amount on CPU and GPU, specifically comprise the following steps:

(1) Calculate the total task volume:

W＝GFLOPS _CPU ·(t _CPU +t _T )+GFLOPS _GPU ·t _GPU

(2) Calculate the total amount of data transfer between CPU and GPU:

(3) Calculate the communication overhead of data transmission between CPU and GPU:

(4) Calculate the distribution ratio:

(5) Assign R·W tasks to GPU, and (1-R)·W tasks to CPU.

7. according to the described HPL calculation model simulation method of claim 1, it is characterized in that, described corresponding task is carried out simulation calculation on CPU and GPU respectively, specifically comprises the following steps:

(1) Initialize the augmented matrix A|b: initialize the augmented matrix of the corresponding size according to the scale of the problem;

(2) Cyclic decomposition of the augmented matrix: based on the LU decomposition method, the augmented matrix is split according to the distribution ratio of the task amount, and the part of the matrix that needs to be transferred to the GPU memory after the split is transferred to the GPU memory;

The CPU executes cpu_dtrsm() to calculate the lower triangular matrix of the CPU part, and the GPU executes gpu_dtrsm() to calculate the lower triangular matrix of the GPU part;

The GPU transfers the result of the operation back to the memory, and at the same time performs HPL_dgemm() to split the augmented matrix according to the distribution ratio of the task amount, and transfers the part of the matrix that needs to be transferred to the GPU memory after the split to the GPU memory;

The CPU executes cpu_dgemm() to perform the CPU partial matrix multiplication operation, and the GPU executes gpu_dgemm() to perform the GPU partial matrix multiplication operation;

If the decomposition is not yet complete, the GPU will transmit the data back to the memory in parallel with the data required by the next round of HPL_dtrsm(). If the decomposition is complete, the results of the calculation will be transmitted back to the memory.

8. according to the described HPL calculation model simulation method of claim 1, it is characterized in that, the time consumed by the described record simulation calculation is specifically:

(1) Transfer part of the matrix that needs to be transferred to the GPU memory after the split to the GPU memory. If it is the first round, it will take t1 ₁ . Otherwise, it will be executed in parallel with the end of the last round. Time-consuming: max(t1 _i , t4 _{i- 1} );

(2) CPU executes cpu_dtrsm() time-consuming

(3) GPU executes gpu_dtrsm() time-consuming

(4) The above (2) (3) takes a total of time

(5) It takes time t2 _i for the GPU to transfer the completed calculation result back to the memory;

(6) Execute HPL_dgemm(), and transfer the part matrix that needs to be transferred to the GPU display memory after the split to the GPU display memory, which takes time t3 _i ;

(7) The above (5) (6) takes a total time max(t2 _i , t3 _i );

(8) CPU executes cpu_dgemm() time-consuming

(9) GPU executes gpu_dgemm() time-consuming

(10) The above (8) (9) total time-consuming

(11) It takes time t4 _i to transfer the completed result of the operation back to the memory;

(12) When the timing ends, the total time spent by the system is:

9. according to the described HPL calculation model emulation method of claim 1, it is characterized in that, the floating-point performance of described calculation analysis system, specifically as follows:

(1) Measured floating-point performance, theoretical floating-point performance, and simulation calculation efficiency of the computing system;

Measured floating point performance:

Theoretical floating point performance:

v ₂ =FLOPS _CPU ·n _CPU ·η _{CPU_PARA} ·η _GEMM +FLOPS _GPU ·n _GPU ·η _{GPU_PARA} ·η _GEMM ;

Simulation calculation efficiency:

η=v ₁ /(FLOPS _CPU ·n _CPU +FLOPS _GPU ·n _GPU );

(2) Adjust the configuration parameters to obtain different simulation results, and then draw the relationship between the simulation calculation efficiency and the matrix block size under different configuration parameters.

10. The HPL calculation model simulation method according to claim 1, wherein said display simulation results include split constants, time-consuming simulation calculations, measured floating-point performance, theoretical floating-point performance, simulation calculation efficiency, and different configuration parameters Below is a diagram of the relationship between the simulation calculation efficiency and the size of the matrix block, the bottleneck and the reason of the system performance, and the improvement suggestions for performance optimization are given.