CN114253821B

CN114253821B - Method and device for analyzing GPU performance and computer storage medium

Info

Publication number: CN114253821B
Application number: CN202210192669.XA
Authority: CN
Inventors: 齐航空; 张竞丹; 李亮
Original assignee: Xi'an Xintong Semiconductor Technology Co ltd
Current assignee: Xintong Semiconductor Technology (Xiamen) Co.,Ltd.
Priority date: 2022-03-01
Filing date: 2022-03-01
Publication date: 2022-05-27
Anticipated expiration: 2042-03-01
Also published as: CN114253821A

Abstract

The embodiment of the present invention discloses a method, a device and a computer storage medium for analyzing GPU performance; the method may include: obtaining an instruction list obtained by running a target program in a set environment, the number of threads to be started, and the effect of each thread on all the execution result of each instruction in the instruction list; start the thread simulator in the GPU performance model to be analyzed according to the number of threads to be started by the simulation scheduler in the GPU performance model to be analyzed; each thread simulator traverse each instruction in the instruction list, and execute the instruction according to the instruction execution control value of each instruction during the traversal process to measure the duration of executing the traversed instruction; when all instructions in the instruction list have been traversed , to obtain the total execution time of all thread simulators executing all the instructions in the instruction list.

Description

A method, device and computer storage medium for analyzing GPU performance

技术领域technical field

本发明实施例涉及图形处理器（GPU，Graphics Processing Unit）技术领域，尤其涉及一种分析GPU性能的方法、装置及计算机存储介质。Embodiments of the present invention relate to the technical field of graphics processing units (GPUs, Graphics Processing Units), and in particular, to a method, an apparatus, and a computer storage medium for analyzing GPU performance.

背景技术Background technique

在GPU性能统计中，每周期指令数（IPC，Instructions Per Cycle）是一项比较重要的GPU性能指标，其代表GPU在每个时钟周期内总共能处理多少条指令；通常情况下，可以根据线程执行时间以及系统的主频进行计算来获得IPC。In GPU performance statistics, Instructions Per Cycle (IPC, Instructions Per Cycle) is a relatively important GPU performance indicator, which represents how many instructions the GPU can process in each clock cycle; The execution time and the main frequency of the system are calculated to obtain the IPC.

在对GPU进行性能统计过程中，通常需要对GPU性能进行建模。具体来说，通常采用两种方式对GPU的性能进行建模：其一是仿真建模，比如使用软件模拟构建GPU的仿真模型，并根据仿真模型进行真实的执行过程以获取GPU真实性能数据；其二是分析建模，比如通过构建一定的映射函数（也可称之为分析模型）分析处理GPU的输入，从而计算得出相应性能结果。In the process of GPU performance statistics, it is usually necessary to model GPU performance. Specifically, the performance of GPU is usually modeled in two ways: one is simulation modeling, such as using software simulation to build a simulation model of GPU, and performing real execution process according to the simulation model to obtain real performance data of GPU; The second is analytical modeling, for example, by constructing a certain mapping function (also called an analytical model) to analyze and process the input of the GPU, thereby calculating the corresponding performance results.

对于仿真建模方式来说，虽然能够真实的模拟硬件执行过程，并且获取真实的仿真数据；但是，由于仿真模型需要模拟真实GPU的执行，所以运行效率低下，如果需要对GPU架构进行调整，那么就需要针对架构调整后的GPU重建仿真模型，因此，采用仿真建模的方式进行GPU性能统计，存在可扩展性相对较差，开发周期长的缺陷。对于分析建模方式来说，分析模型由于不用仿真指令的真实运行过程，只需要针对输入指令信息进行建模分析运算就可得到性能结果数据，所以采用分析建模方式进行GPU性能统计的运行效率非常高，结构设计简单，可扩展性强；但是，在具体实施过程中，如果分析模型对输出指令的处理不够精细，那就会造成最终所得到的GPU性能统计结果的误差率较大。For the simulation modeling method, although the hardware execution process can be simulated and real simulation data can be obtained; however, since the simulation model needs to simulate the execution of the real GPU, the operation efficiency is low. If the GPU architecture needs to be adjusted, then Therefore, it is necessary to reconstruct the simulation model for the GPU after the architecture adjustment. Therefore, the use of simulation modeling for GPU performance statistics has the defects of relatively poor scalability and long development cycle. For the analysis modeling method, the analysis model does not need to simulate the real operation process of the instruction, and only needs to perform modeling analysis operations on the input instruction information to obtain the performance result data. Therefore, the analysis modeling method is used to calculate the running efficiency of GPU performance statistics. It is very high, the structure design is simple, and the scalability is strong; however, in the specific implementation process, if the analysis model does not process the output instructions finely enough, it will cause a large error rate in the final GPU performance statistics.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本发明实施例期望提供一种分析GPU性能的方法、装置及计算机存储介质；能够降低基于分析建模方式进行GPU性能统计结果的误差率，提供更加准确的关于GPU的性能数据。In view of this, the embodiments of the present invention are expected to provide a method, apparatus and computer storage medium for analyzing GPU performance, which can reduce the error rate of GPU performance statistics based on the analysis and modeling method, and provide more accurate performance data about GPU.

本发明实施例的技术方案是这样实现的：The technical solution of the embodiment of the present invention is realized as follows:

第一方面，本发明实施例提供了一种分析GPU性能的方法，所述方法包括：In a first aspect, an embodiment of the present invention provides a method for analyzing GPU performance, the method comprising:

获取目标程序在设定环境下运行所得到指令列表、需启动的线程数量及每个线程对所述指令列表中的每个指令的执行结果；其中，所述执行结果包括对每个指令的指令执行控制值；Obtain the instruction list obtained by running the target program under the setting environment, the number of threads to be started, and the execution result of each thread on each instruction in the instruction list; wherein, the execution result includes the instruction for each instruction. execution control value;

通过待分析GPU性能模型中的模拟调度器根据所述需启动的线程数量启动所述待分析GPU性能模型中的线程模拟器；Start the thread simulator in the GPU performance model to be analyzed according to the number of threads to be started by the simulation scheduler in the GPU performance model to be analyzed;

每个线程模拟器均遍历所述指令列表中的每个指令，并在遍历过程中根据每个指令的指令执行控制值执行指令，以计量执行被遍历指令的时长；Each thread simulator traverses each instruction in the instruction list, and executes the instruction according to the instruction execution control value of each instruction during the traversal process, so as to measure the duration of executing the traversed instruction;

当所述指令列表中的所有指令均遍历完毕，获取所有线程模拟器执行所述指令列表中的所有指令的总执行时长。When all the instructions in the instruction list have been traversed, the total execution time of all the thread simulators executing all the instructions in the instruction list is obtained.

第二方面，本发明实施例提供了一种分析GPU性能的装置，所述装置包括：获取部分、模拟调度器、线程模拟器和统计部分；其中，In a second aspect, an embodiment of the present invention provides an apparatus for analyzing GPU performance, the apparatus includes: an acquisition part, a simulation scheduler, a thread simulator, and a statistics part; wherein,

所述获取部分，经配置为获取目标程序在设定环境下运行所得到指令列表、需启动的线程数量及每个线程对所述指令列表中的每个指令的执行结果；其中，所述执行结果包括对每个指令的指令执行控制值；The obtaining part is configured to obtain the instruction list obtained by running the target program in the set environment, the number of threads to be started, and the execution result of each thread on each instruction in the instruction list; wherein, the execution The result includes the instruction execution control value for each instruction;

所述模拟调度器，经配置为根据所述需启动的线程数量启动所述待分析GPU性能模型中的线程模拟器；The simulation scheduler is configured to start a thread simulator in the GPU performance model to be analyzed according to the number of threads to be started;

每个所述线程模拟器，均经配置为遍历所述指令列表中的每个指令，并在遍历过程中根据每个指令的指令执行控制值以及指令类型执行当前被遍历指令，以计量执行所述当前被遍历指令的时长；Each of the thread simulators is configured to traverse each instruction in the instruction list, and execute the currently traversed instruction according to the instruction execution control value and the instruction type of each instruction during the traversal process, so as to measure the execution of the instruction. Describe the duration of the currently traversed instruction;

所述统计部分，经配置为当所述指令列表中的所有指令均遍历完毕，获取所有线程模拟器执行所述指令列表中的所有指令的总执行时长。The statistics part is configured to acquire the total execution time for all thread simulators to execute all the instructions in the instruction list when all instructions in the instruction list have been traversed.

第三方面，本发明实施例提供了一种计算机存储介质，所述计算机存储介质存储有分析GPU性能的程序，所述分析GPU性能的程序被至少一个处理器执行时实现第一方面所述分析GPU性能的方法步骤。In a third aspect, an embodiment of the present invention provides a computer storage medium, where the computer storage medium stores a program for analyzing GPU performance, and when the program for analyzing GPU performance is executed by at least one processor, the analysis described in the first aspect is implemented Method steps for GPU performance.

本发明实施例提供了一种分析GPU性能的方法、装置及计算机存储介质；在进行性能分析过程中，计量GPU性能模型中的每个模拟线程执行目标程序的指令列表中各类型指令的执行时长，并统计执行总时长，从而能够覆盖模拟目前所有GPU指令类型的处理，降低基于分析建模方式进行GPU性能统计结果的误差率，提供更加准确的关于GPU的性能数据。The embodiments of the present invention provide a method, a device and a computer storage medium for analyzing GPU performance; during the performance analysis process, the execution time of each type of instruction in the instruction list of the target program executed by each simulated thread in the GPU performance model is measured. , and count the total execution time, so that it can cover and simulate the processing of all current GPU instruction types, reduce the error rate of GPU performance statistics based on analysis and modeling methods, and provide more accurate performance data about GPUs.

附图说明Description of drawings

图1为本发明实施例提供的SIMT方式下多线程执行顺序指令的示意图。FIG. 1 is a schematic diagram of multithreading executing sequential instructions in a SIMT mode according to an embodiment of the present invention.

图2为本发明实施例提供的一种分析GPU性能的方法流程示意图。FIG. 2 is a schematic flowchart of a method for analyzing GPU performance according to an embodiment of the present invention.

图3为本发明实施例提供的获取目标程序的指令列表以及执行结果的流程示意图。FIG. 3 is a schematic flowchart of obtaining an instruction list and an execution result of a target program according to an embodiment of the present invention.

图4为本发明实施例提供的每个线程模拟器在遍历指令列表过程中计量执行被遍历指令的时长并获取所有线程模拟器执行所述指令列表中的所有指令的总执行时长的具体实施流程示意图。FIG. 4 is a specific implementation process for each thread simulator according to an embodiment of the present invention to measure the duration of executing the traversed instruction in the process of traversing the instruction list and obtain the total execution duration of all thread simulators executing all the instructions in the instruction list. Schematic.

图5为本发明实施例提供的一种分析GPU性能的装置结构示意图。FIG. 5 is a schematic structural diagram of an apparatus for analyzing GPU performance according to an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

目前，GPU所能处理的指令一般可被划分为三种类型，即算数逻辑指令、访存指令以及分支跳转指令。目前常规的采用分析建模的方式进行GPU性能统计方案均涉及了算数逻辑指令和访存指令，能够准确地计算获得每条算数逻辑指令以及访存指令的执行时长。而对于分支跳转指令来说，一方面，目前现有的常规方案均没有考虑到关于分支跳转指令的处理，另一方面，分支跳转指令对指令的执行性能影响非常大。因此，本发明实施例期望提供一种分析GPU性能的方案，能够准确获知每个线程关于分支跳转指令的执行情况，从而相较于常规方案能够更加准确地分析得到GPU运行指令的性能。At present, the instructions that the GPU can process can generally be divided into three types, namely arithmetic logic instructions, memory access instructions, and branch and jump instructions. At present, the conventional method of analyzing and modeling GPU performance statistics involves arithmetic logic instructions and memory access instructions, and can accurately calculate and obtain the execution time of each arithmetic logic instruction and memory access instructions. As for the branch and jump instructions, on the one hand, the existing conventional solutions do not consider the processing of the branch and jump instructions, and on the other hand, the branch and jump instructions have a great impact on the execution performance of the instructions. Therefore, the embodiments of the present invention are expected to provide a solution for analyzing GPU performance, which can accurately know the execution status of each thread on branch and jump instructions, so as to more accurately analyze and obtain the performance of GPU running instructions compared with conventional solutions.

需要说明的是，GPU采用单指令多线程（SIMT，Single Instruction MultipleThreads）的方式运行指令，即每取指一条指令然后调度多个线程同时执行。基于SIMT方式，如果在执行过程中遇到分支跳转指令时，就会根据每个线程处理的数据不同从而进入不同的分支处理过程，通常来说，当GPU执行分支跳转指令时，仍然会继续顺序取指，同时每个线程则会通过设置mask位来控制自身当前指令是否需要执行，如果线程当前的mask位为true，则执行当前指令；如果mask位为false，则当前指令不进入执行流水线执行，从而控制SIMT的分支跳转指令的处理与执行。以图1为例，设定GPU包括n+1个线程，分别标识为Thread0、Thread1、Thread2、……、Threadn。指令顺序如图1中左侧所示，依次为：ADD、SUB、IF A、ADD、SUB、ELSE、MUL、DIV、ENDIF、ADD和SUB。由此可知，分支跳转指令为IF A和ELSE，那么GPU在通过SIMT方式执行上述指令过程中，当出现分支跳转指令时，会调度不同的线程执行不同的指令段，图1中线程Thread1与Thread2在执行分支跳转指令IF A的指令段过程中，其mask位为false，因此，分支跳转指令IF A的指令段不进入线程Thread1与Thread2执行，图1中以虚线框所示；此外，图1中线程Thread0与Threadn在执行分支跳转指令ELSE的指令段过程中，其mask位为false，因此，分支跳转指令ELSE的指令段不进入线程Thread0与Threadn执行，图1中同样以虚线框所示。It should be noted that the GPU uses Single Instruction Multiple Threads (SIMT, Single Instruction Multiple Threads) to run instructions, that is, each instruction fetches an instruction and then schedules multiple threads to execute at the same time. Based on the SIMT method, if a branch jump instruction is encountered during the execution process, it will enter different branch processing processes according to the data processed by each thread. Generally speaking, when the GPU executes the branch jump instruction, it will still Continue to fetch instructions sequentially, and each thread will control whether its current instruction needs to be executed by setting the mask bit. If the current mask bit of the thread is true, the current instruction will be executed; if the mask bit is false, the current instruction will not be executed. Pipeline execution, thereby controlling the processing and execution of SIMT branch and jump instructions. Taking FIG. 1 as an example, it is assumed that the GPU includes n+1 threads, which are respectively identified as Thread0, Thread1, Thread2, ..., Threadn. The instruction sequence is shown on the left side of Figure 1, which are: ADD, SUB, IF A, ADD, SUB, ELSE, MUL, DIV, ENDIF, ADD, and SUB. It can be seen from this that the branch jump instructions are IF A and ELSE, then when the GPU executes the above instructions through SIMT, when a branch jump instruction occurs, it will schedule different threads to execute different instruction segments. Thread1 in Figure 1 In the process of executing the instruction segment of the branch jump instruction IF A with Thread2, its mask position is false, therefore, the instruction segment of the branch jump instruction IF A does not enter the execution of threads Thread1 and Thread2, as shown in the dashed box in Figure 1; In addition, in Figure 1, when threads Thread0 and Threadn execute the instruction segment of the branch jump instruction ELSE, their mask bit is false. Therefore, the instruction segment of the branch jump instruction ELSE does not enter the execution of threads Thread0 and Threadn. The same is true in Figure 1. Shown as a dashed box.

结合以上关于GPU通过SIMT方式执行指令过程中顺序取指执行并且利用mask位控制线程是否执行当前分支跳转指令的阐述，参见图2，其示出了本发明实施例提供的一种分析GPU性能的方法，该方法可以包括：Combined with the above description about sequential instruction fetch execution in the process of GPU executing instructions in SIMT mode and whether to use the mask bit to control whether the thread executes the current branch jump instruction, see FIG. 2 , which shows an analysis GPU performance provided by an embodiment of the present invention. method, which can include:

S201：获取目标程序在设定环境下运行所得到指令列表、需启动的线程数量及每个线程对所述指令列表中的每个指令的执行结果；其中，所述执行结果包括对每个指令的指令执行控制值；S201: Obtain the instruction list obtained by running the target program in the setting environment, the number of threads to be started, and the execution result of each thread on each instruction in the instruction list; wherein, the execution result includes the execution result of each instruction The instruction execution control value of ;

S202：通过待分析GPU性能模型中的模拟调度器根据所述需启动的线程数量启动所述待分析GPU性能模型中的线程模拟器；S202: Start the thread simulator in the GPU performance model to be analyzed according to the number of threads to be started by the simulation scheduler in the GPU performance model to be analyzed;

S203：每个线程模拟器均遍历所述指令列表中的每个指令，并在遍历过程中根据每个指令的指令执行控制值以及指令类型执行当前被遍历指令，以计量执行所述当前被遍历指令的时长；S203: Each thread simulator traverses each instruction in the instruction list, and executes the currently traversed instruction according to the instruction execution control value and the instruction type of each instruction during the traversal process to measure the execution of the currently traversed instruction the duration of the instruction;

S204：当所述指令列表中的所有指令均遍历完毕，获取所有线程模拟器执行所述指令列表中的所有指令的总执行时长。S204: When all the instructions in the instruction list have been traversed, obtain the total execution time of all the thread simulators executing all the instructions in the instruction list.

通过图2所示的技术方案，通过提前在设定环境下运行目标程序以获得指令列表以及每个线程对指令列表中每个指令的执行结果；接着在进行性能分析过程中，计量GPU性能模型中的每个模拟线程执行目标程序的指令列表中各类型指令的执行时长，并统计执行总时长；从而能够覆盖模拟目前所有GPU指令类型的处理，降低基于分析建模方式进行GPU性能统计结果的误差率，提供更加准确的关于GPU的性能数据。而且可以便于调整待分析GPU的结构，以实现相同指令在不同GPU架构下的性能指标的分析过程。Through the technical solution shown in FIG. 2 , the target program is run in advance in the setting environment to obtain the instruction list and the execution result of each thread on each instruction in the instruction list; then in the process of performance analysis, the GPU performance model is measured Each simulated thread in the target program executes the execution time of various types of instructions in the instruction list of the target program, and counts the total execution time; thus, it can cover the processing of all current GPU instruction types and reduce the time required for GPU performance statistics based on analytical modeling methods. Error rate, which provides more accurate data about GPU performance. Moreover, the structure of the GPU to be analyzed can be easily adjusted, so as to realize the analysis process of the performance indicators of the same instruction under different GPU architectures.

基于图2所示的技术方案，在一些可能的实现方式中，所述获取目标程序在设定环境下运行所得到指令列表、需启动的线程数量及每个线程对所述指令列表中的每个指令的执行结果，包括：Based on the technical solution shown in FIG. 2 , in some possible implementations, the obtained target program runs in a set environment to obtain an instruction list, the number of threads to be started, and each thread’s response to each item in the instruction list. The execution result of each instruction, including:

通过真实环境或者仿真环境运行所述目标程序，并在运行过程中获取所述目标程序的指令列表、所需启动的线程数量、每个线程执行每个指令的执行结果；其中，所述执行结果包括操作数寄存器值以及指令执行控制值。Run the target program in a real environment or a simulation environment, and obtain the instruction list of the target program, the number of threads to be started, and the execution result of each thread executing each instruction during the running process; wherein, the execution result Includes operand register values and instruction execution control values.

对于上述实现方式，详细来说，目标程序可以示例性地选择一个用来进行GPU性能分析的应用程序，如图3所示，该应用程序可以在已有的真实运行环境中或者仿真运行环境中运行，检测获得指令代码，并根据检测到的指令代码获得该应用程序的指令列表；此外，在通过SIMT方式运行该应用程序的过程中，还可以追踪到真实运行环境中或者仿真运行环境为执行指令列表所启用的线程数量以及各线程执行指令列表中每个指令的执行结果，具体来说，在本发明实施例中，设定指令列表中包括m条指令，并且启用的线程数量为n，那么执行结果的数量一共为m×n条。对于每条执行结果，本发明实施例优选通过列表形式进行表示，即[dest，src0，src1，mask]；其中，dest，src0，src1均表示操作数寄存器值，具体地，dest表示作为指令执行结果的目的操作数寄存器值，src0表示第一源操作数寄存器值，src1表示第二源操作数寄存器的值，mask则表示指令执行控制值，结合前述内容，mask值用来控制线程是否需要执行当前指令，比如如果线程当前的mask值为true，则执行当前指令；如果mask位为false，则当前指令不进入执行流水线执行。此外，对于执行结果来说，根据指令类型的不同，dest，src0和src1三者并不一定需要齐备，但是dest是肯定会存在于每个执行结果中，同样mask值也肯定会存在于每个执行结果中。举例来说，参见表1，设定启用的线程数量为n+1，分别标识为T₀、T₁、T₂、……、T_n，对于第一列所示出的指令列表的每条指令，每个线程均对应一执行结果，具体内容如表1中所示。For the above implementation manner, in detail, the target program can exemplarily select an application program for performing GPU performance analysis, as shown in FIG. 3, the application program can be in an existing real running environment or a simulation running environment Run, detect and obtain the instruction code, and obtain the instruction list of the application program according to the detected instruction code; in addition, in the process of running the application program through the SIMT mode, it can also be traced to the real operating environment or the simulated operating environment for execution. The number of threads enabled by the instruction list and the execution result of each instruction in the instruction list executed by each thread. Specifically, in the embodiment of the present invention, it is set that the instruction list includes m instructions, and the number of enabled threads is n, Then the number of execution results is m×n in total. Each execution result is preferably represented in the form of a list in this embodiment of the present invention, that is, [dest, src0, src1, mask]; wherein, dest, src0, and src1 all represent operand register values, and specifically, dest represents execution as an instruction The destination operand register value of the result, src0 represents the first source operand register value, src1 represents the second source operand register value, and mask represents the instruction execution control value. Combined with the foregoing, the mask value is used to control whether the thread needs to execute For the current instruction, for example, if the current mask value of the thread is true, the current instruction is executed; if the mask bit is false, the current instruction does not enter the execution pipeline for execution. In addition, for the execution result, depending on the type of instruction, dest, src0 and src1 do not necessarily need to be complete, but dest will definitely exist in each execution result, and the mask value will definitely exist in each execution result. execution result. For example, referring to Table 1, the number of enabled threads is set as n+1, which are respectively identified as T ₀ , T ₁ , T ₂ , . . . , T _n . instruction, each thread corresponds to an execution result, and the specific content is shown in Table 1.

表1Table 1

从表1中可以看出，指令列表中的IF R3以及ELSE R3为分支跳转指令，各自后面均具有对应的指令段。对于IF R3来说，从表1从可以看出，线程T₀和T_n的mask值为true，表示执行IF R3对应的包括有指令“ADD R4, R1, R2”与“SUB R5, R4, R1”的指令段，而线程T₁和T₂的mask值为false，表示均不执行IF R3对应的指令段；进一步来说，当线程T₀和T_n执行包括有指令“ADD R4, R1, R2”与“SUB R5, R4, R1”的指令段过程中，线程T₁和T₂可以看成执行NOP指令。As can be seen from Table 1, IF R3 and ELSE R3 in the instruction list are branch and jump instructions, each of which has a corresponding instruction segment behind it. For IF R3, it can be seen from Table 1 that the mask values of threads T ₀ and T _n are true, indicating that the execution of IF R3 corresponds to the instructions "ADD R4, R1, R2" and "SUB R5, R4, R1" instruction segment, and the mask value _of threads T1 and T2 is false, indicating that neither the instruction segment corresponding to IF R3 _is executed _; further, when threads _T0 and Tn execute instructions "ADD R4, R1" , R2" and "SUB R5, R4, R1", threads _T1 and _T2 can be regarded as executing NOP instructions.

对于ELSE R3来说，从表1可以看出，线程T₁和T₂的mask值为true，表示执行ELSE R3对应的包括有指令“MUL R4, R1, R2”与“ADD R5, R4, R1”的指令段，而线程T₁和T₂的mask值为false，表示均不执行ELSE R3对应的指令段；进一步来说，当线程T₁和T₂执行包括有指令“MUL R4, R1, R2”与“ADD R5, R4, R1”的指令段过程中，线程T₀和T_n可以看成执行NOP指令。For ELSE R3, it can be seen from Table 1 that the mask values of threads T ₁ and T ₂ are true, indicating that the execution of ELSE R3 corresponds to the instructions "MUL R4, R1, R2" and "ADD R5, R4, R1""instruction segment, and the mask value _of threads T1 and T2 is false, indicating that neither _of them executes the instruction segment corresponding to ELSE R3 _; further, when threads T1 and _T2 execute instructions including "MUL R4, R1, During the instruction segment process of "R2" and "ADD R5, R4, R1", threads T ₀ and T _n can be regarded as executing NOP instructions.

接着参见表1，对于除分支跳转指令以外的其他类型指令，各线程的mask值均为true，也就表示对于其他类型指令，诸如算数逻辑指令和访存指令在执行过程中，SIMT方式均会调度多个线程同时执行。Then refer to Table 1. For other types of instructions except branch and jump instructions, the mask value of each thread is true, which means that for other types of instructions, such as arithmetic logic instructions and memory access instructions, during the execution process, SIMT mode Multiple threads are scheduled to execute at the same time.

通过上述对表1所示执行结果的阐述，可以理解地，由于执行结果中存在每个线程对于每个指令的指令执行控制值mask，并且该mask值能够用来控制线程的是否执行指令，因此，可以将指令列表输入至待分析GPU的性能分析模型，并根据表1所示的执行结果控制各线程模型的执行情况，从而能够更加准确地获得执行分支跳转指令的性能数据，进而降低基于分析建模方式进行GPU性能统计结果的误差率，提供更加准确的关于GPU的性能数据。From the above description of the execution results shown in Table 1, it can be understood that since there is an instruction execution control value mask of each thread for each instruction in the execution result, and the mask value can be used to control whether the thread executes the instruction, so , the instruction list can be input into the performance analysis model of the GPU to be analyzed, and the execution of each thread model can be controlled according to the execution results shown in Table 1, so that the performance data of executing branch and jump instructions can be obtained more accurately, thereby reducing the Analyze the error rate of GPU performance statistics by modeling method, and provide more accurate performance data about GPU.

对于图2所示的技术方案，在一些可能的实现方式中，所述每个线程模拟器均遍历所述指令列表中的每个指令，并在遍历过程中根据每个指令的指令执行控制值以及指令类型执行当前被遍历指令，以计量执行所述当前被遍历指令的时长，包括：For the technical solution shown in FIG. 2, in some possible implementations, each thread simulator traverses each instruction in the instruction list, and executes the control value according to the instruction of each instruction during the traversal process And the instruction type executes the currently traversed instruction to measure the duration of executing the currently traversed instruction, including:

对于每个所述线程模拟器，判断所述当前被遍历指令的执行结果中指令执行控制值是否表示执行所述当前被遍历指令；For each of the thread simulators, determine whether the instruction execution control value in the execution result of the currently traversed instruction indicates that the currently traversed instruction is executed;

相应于所述指令执行控制值表示不执行所述当前被遍历指令，将执行固定的NOP指令的时长作为执行所述当前被遍历指令的时长；Corresponding to the instruction execution control value indicating that the currently traversed instruction is not executed, the duration of executing the fixed NOP instruction is used as the duration of executing the currently traversed instruction;

相应于所述指令执行控制值表示执行所述当前被遍历指令，确定所述当前被遍历指令的指令类型；Corresponding to the instruction execution control value indicating that the currently traversed instruction is executed, the instruction type of the currently traversed instruction is determined;

相应于所述当前被遍历指令的指令类型为访存指令，按照执行访存指令的方式计量执行所述当前被遍历指令的时长；The instruction type corresponding to the currently traversed instruction is a memory access instruction, and the duration of executing the currently traversed instruction is measured according to the mode of executing the memory access instruction;

相应于所述当前被遍历指令的指令类型为算数逻辑指令，将执行算数逻辑指令的固定时长作为执行所述当前被遍历指令的时长。Corresponding to the instruction type of the currently traversed instruction is an arithmetic logic instruction, the fixed duration for executing the arithmetic logic instruction is used as the duration for executing the currently traversed instruction.

针对上述实现方式，在一些示例中，所述按照执行访存指令的方式计量执行所述当前被遍历指令的时长，包括：For the above implementation manner, in some examples, the measuring the execution time of the currently traversed instruction according to the manner of executing the memory access instruction includes:

根据所述当前被遍历指令的执行结果中的操作数寄存器值所对应的访存地址，按照设定的Cache访问分析模型计量执行所述当前被遍历指令的时长。According to the memory access address corresponding to the operand register value in the execution result of the currently traversed instruction, the duration of executing the currently traversed instruction is measured according to the set cache access analysis model.

对于上述实现方式及其示例，在具体实施过程中，结合前述内容，指令执行控制值mask可以为true或者false；如果线程当前的mask值为true，则执行当前指令；如果线程当前的mask位为false，则当前指令不进入执行流水线执行，即不执行所述当前被遍历指令。For the above implementation manner and its examples, in the specific implementation process, combined with the foregoing content, the instruction execution control value mask can be true or false; if the current mask value of the thread is true, the current instruction is executed; if the current mask bit of the thread is If false, the current instruction does not enter the execution pipeline for execution, that is, the currently traversed instruction is not executed.

针对上述实现方式，在一些示例中，所述方法还包括：For the above implementations, in some examples, the method further includes:

对于每个所述线程模拟器，在遍历过程中，判断所述当前被遍历指令是否为所述指令列表中的结束指令：For each thread simulator, during the traversal process, determine whether the currently traversed instruction is the end instruction in the instruction list:

若不是，则判断所述当前被遍历指令的执行结果中指令执行控制值是否表示执行所述当前被遍历指令；If not, then determine whether the instruction execution control value in the execution result of the currently traversed instruction indicates that the currently traversed instruction is executed;

若是，则确定所述线程模拟器遍历完毕所述指令列表中的所有指令。If so, it is determined that the thread simulator has completed traversing all the instructions in the instruction list.

对于上述示例，需要说明的是，由于每个线程模拟器均需要对指令列表中的指令进行顺序执行，因此，在遍历指令列表的每条指令时，应当首先确定当前被遍历指令是否为指令列表中的结束指令，该结束指令表示指令列表中的指令运行的终结；若为结束指令，则确定该线程模拟器已经遍历完毕指令列表中的所有指令，进而可以执行S204所述获取所有线程模拟器执行所述指令列表中的所有指令的总执行时长的过程；若不为结束指令，则需要按照上述实现方式及前述示例所阐述的“执行根据每个指令的指令执行控制值以及指令类型执行当前被遍历指令，以计量执行所述当前被遍历指令的时长”的过程。For the above example, it should be noted that since each thread simulator needs to execute the instructions in the instruction list sequentially, when traversing each instruction in the instruction list, it should firstly determine whether the currently traversed instruction is an instruction list The end instruction in the instruction list, the end instruction represents the termination of the operation of the instructions in the instruction list; if it is an end instruction, it is determined that the thread simulator has traversed all the instructions in the instruction list, and then can execute S204 to obtain all thread simulators The process of executing the total execution time of all the instructions in the instruction list; if it is not an end instruction, it is necessary to execute the current execution according to the instruction execution control value and instruction type of each instruction as described in the above implementation manner and the foregoing example. traversed instructions to measure how long to execute the currently traversed instruction" process.

对于图2所示的技术方案，在一些可能的实现方式中，对于每个所述线程模拟器，在计量执行所述当前被遍历指令的时长之后，所述方法还包括：For the technical solution shown in FIG. 2, in some possible implementations, for each thread simulator, after measuring the duration of executing the currently traversed instruction, the method further includes:

对于每个所述线程模拟器，将执行所述当前被遍历指令的时长加入至对应线程模拟器的执行总时长；其中，在遍历所述指令列表中的首个指令之前，所述执行总时长的起始值为0。For each thread simulator, the duration of executing the currently traversed instruction is added to the total execution duration of the corresponding thread simulator; wherein, before traversing the first instruction in the instruction list, the total execution duration The starting value is 0.

对于上述实现方式，需要说明的是，对于每个所述线程模拟器，在遍历指令列表中的指令之前，可以设置起始值为0的执行总时长；接着，随着线程模拟器遍历指令列表中的每条指令过程，不断地将被遍历指令的时长加入至执行总时长；最终，当遍历完成指令列表中的所有指令后，最终所得到的执行总时长即为该线程模拟器执行所述指令列表中的所有指令的总执行时长。For the above implementation, it should be noted that, for each of the thread simulators, before traversing the instructions in the instruction list, the total execution duration with a starting value of 0 may be set; then, as the thread simulator traverses the instruction list In each instruction process, the duration of the traversed instruction is continuously added to the total execution duration; finally, when all the instructions in the instruction list are traversed and completed, the final total execution duration is the execution time of the thread simulator. The total execution time of all instructions in the instruction list.

综合上述实现方式及示例，参见图4，本发明实施例以一个线程模拟器为例示出实现S203和S204的具体流程，如图4所示，该具体流程可以包括：4, the embodiment of the present invention uses a thread simulator as an example to illustrate the specific process of implementing S203 and S204. As shown in FIG. 4, the specific process may include:

S401：设置线程模拟器的执行总时长的起始值为0；S401: Set the initial value of the total execution time of the thread simulator to 0;

S402：线程模拟器遍历指令列表；S402: The thread simulator traverses the instruction list;

S403：判断当前被遍历指令是否为指令列表中的结束指令：若是，则转至S410，确定线程模拟器遍历完毕指令列表中的所有指令，并且执行S411：获取该线程模拟器执行指令列表中的所有指令的总执行时长；若否，则转至S404：判断当前被遍历指令的执行结果的mask值是否为ture：若否，则说明当前被遍历指令为分支跳转指令或分支跳转指令对应的指令段，且线程模拟器被控制为不执行该分支跳转指令及分支跳转指令对应的指令段，因此，转至S405：将执行固定的NOP指令的时长作为执行当前被遍历指令的时长；若为true，则无论当前被遍历指令是否为分支跳转指令或分支跳转指令的指令段，线程模拟器均被控制为执行该指令，因此，转至S406：确定当前被遍历指令的指令类型。S403: Determine whether the current traversed instruction is the end instruction in the instruction list: if so, go to S410, determine that the thread simulator has traversed all instructions in the instruction list, and execute S411: obtain the thread simulator execution instruction list. The total execution time of all instructions; if not, go to S404: determine whether the mask value of the execution result of the currently traversed instruction is true: if not, it means that the currently traversed instruction is a branch jump instruction or corresponding to a branch jump instruction and the thread simulator is controlled not to execute the instruction segment corresponding to the branch jump instruction and the branch jump instruction, therefore, go to S405: take the duration of executing the fixed NOP instruction as the duration of executing the currently traversed instruction ; if true, then regardless of whether the currently traversed instruction is a branch jump instruction or an instruction segment of a branch jump instruction, the thread simulator is controlled to execute the instruction, therefore, go to S406: determine the instruction of the currently traversed instruction type.

相应于指令类型为访存指令，则执行S407：按照执行访存指令的方式计量执行当前被遍历指令的时长；具体来说，如果当前被遍历指令是访存指令，根据该线程的执行结果中所对应的访存地址，通过Cache访问分析模型计算线程模拟器针对当前访存指令的执行时长。Corresponding to the instruction type being a memory access instruction, execute S407: measure the duration of executing the current traversed instruction according to the method of executing the memory access instruction; specifically, if the currently traversed instruction is a memory access instruction, according to the execution result of the thread The corresponding memory access address is used to calculate the execution time of the thread simulator for the current memory access instruction through the Cache access analysis model.

相应于指令类型为算数逻辑指令，由于算数逻辑指令的执行时长是固定的，因此执行S408：将执行算数逻辑指令的固定时长作为执行当前被遍历指令的时长。Corresponding to the instruction type being an arithmetic logic instruction, since the execution duration of the arithmetic logic instruction is fixed, S408 is executed: the fixed duration of executing the arithmetic logic instruction is used as the execution duration of the currently traversed instruction.

在完成S405、S407和S408以获得执行当前被遍历指令的时长之后，线程模拟器均会执行S409：将执行当前被遍历指令的时长加入至对应线程模拟器的执行总时长。并且转至S402，以使得线程模拟器遍历指令列表中的下一个指令，直至遍历到指令列表中的结束指令。After completing S405, S407 and S408 to obtain the execution duration of the current traversed instruction, the thread simulator will execute S409: adding the execution duration of the currently traversed instruction to the total execution duration of the corresponding thread simulator. And go to S402, so that the thread simulator traverses the next instruction in the instruction list until it traverses to the end instruction in the instruction list.

通过图4所示的具体流程示例，利用S201所获得的每个线程对每条指令执行结果，可以在性能分析过程中无需真实的模拟指令的执行过程，最终获得线程模拟器执行指令列表中的所有指令的总执行时长，进而根据线程执行时间以及系统的主频进行计算来获得IPC。从而能够更加准确地获得执行分支跳转指令的性能数据，降低基于分析建模方式进行GPU性能统计结果的误差率，提供更加准确的关于GPU的性能数据。Through the specific flow example shown in FIG. 4, using the execution result of each thread obtained by S201 for each instruction, the execution process of the real simulated instruction can be eliminated in the performance analysis process, and finally the execution process of the thread simulator in the execution instruction list can be obtained. The total execution time of all instructions is calculated according to the thread execution time and the main frequency of the system to obtain the IPC. Therefore, the performance data of executing the branch and jump instructions can be obtained more accurately, the error rate of the GPU performance statistics result based on the analytical modeling method can be reduced, and more accurate performance data about the GPU can be provided.

基于前述技术方案相同的发明构思，参见图5，其示出了本发明实施例提供的一种分析GPU性能的装置50，所述装置50包括：获取部分501、模拟调度器502、线程模拟器503和统计部分504；其中，Based on the same inventive concept as the foregoing technical solutions, see FIG. 5 , which shows an apparatus 50 for analyzing GPU performance provided by an embodiment of the present invention. The apparatus 50 includes: an acquisition part 501 , a simulation scheduler 502 , and a thread simulator 503 and statistics section 504; where,

所述获取部分501，经配置为获取目标程序在设定环境下运行所得到指令列表、需启动的线程数量及每个线程对所述指令列表中的每个指令的执行结果；其中，所述执行结果包括对每个指令的指令执行控制值；The obtaining part 501 is configured to obtain the instruction list obtained by running the target program under the set environment, the number of threads to be started, and the execution result of each thread on each instruction in the instruction list; wherein, the The execution result includes the instruction execution control value for each instruction;

所述模拟调度器502，经配置为根据所述需启动的线程数量启动所述待分析GPU性能模型中的线程模拟器503；The simulation scheduler 502 is configured to start the thread simulator 503 in the GPU performance model to be analyzed according to the number of threads to be started;

每个所述线程模拟器503，均经配置为遍历所述指令列表中的每个指令，并在遍历过程中根据每个指令的指令执行控制值以及指令类型执行当前被遍历指令，以计量执行所述当前被遍历指令的时长；Each of the thread simulators 503 is configured to traverse each instruction in the instruction list, and execute the currently traversed instruction according to the instruction execution control value and the instruction type of each instruction during the traversal process to measure the execution the duration of the currently traversed instruction;

所述统计部分504，经配置为当所述指令列表中的所有指令均遍历完毕，获取所有线程模拟器503执行所述指令列表中的所有指令的总执行时长。The statistics section 504 is configured to obtain the total execution time of all the thread simulators 503 for executing all the instructions in the instruction list when all instructions in the instruction list have been traversed.

在一些示例中，所述获取部分501，经配置为通过真实环境或者仿真环境运行所述目标程序，并在运行过程中获取所述目标程序的指令列表、所需启动的线程数量、每个线程执行每个指令的执行结果；其中，所述执行结果包括操作数寄存器值以及指令执行控制值。In some examples, the obtaining part 501 is configured to run the target program in a real environment or a simulation environment, and obtain the instruction list of the target program, the number of threads to be started, each thread during the running process An execution result of executing each instruction; wherein the execution result includes an operand register value and an instruction execution control value.

在一些示例中，每个所述线程模拟器503，均经配置为：In some examples, each of the thread simulators 503 is configured to:

判断所述当前被遍历指令的执行结果中指令执行控制值是否表示执行所述当前被遍历指令；Determine whether the instruction execution control value in the execution result of the currently traversed instruction indicates that the currently traversed instruction is executed;

在遍历过程中，判断当前被遍历指令是否为所述指令列表中的结束指令：During the traversal process, determine whether the current traversed instruction is the end instruction in the instruction list:

若是，则确定所述线程模拟器503遍历完毕所述指令列表中的所有指令。If so, it is determined that the thread simulator 503 has finished traversing all the instructions in the instruction list.

在一些示例中，每个所述线程模拟器503，均还经配置为：在计量执行所述当前被遍历指令的时长之后，将执行所述当前被遍历指令的时长加入至对应线程模拟器503的执行总时长中；其中，在遍历所述指令列表中的首个指令之前，所述执行总时长的起始值为0。In some examples, each of the thread simulators 503 is further configured to: after measuring the duration of executing the currently traversed instruction, add the duration of executing the currently traversed instruction to the corresponding thread simulator 503 in the total execution time; wherein, before traversing the first instruction in the instruction list, the initial value of the total execution time is 0.

可以理解地，在本实施例中，“部分”可以是部分电路、部分处理器、部分程序或软件等等，当然也可以是单元，还可以是模块也可以是非模块化的。It can be understood that, in this embodiment, a "part" may be a part of a circuit, a part of a processor, a part of a program or software, etc., of course, it may also be a unit, or a module or non-modularity.

另外，在本实施例中的各组成部分可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能模块的形式实现。In addition, each component in this embodiment may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of software function modules.

所述集成的单元如果以软件功能模块的形式实现并非作为独立的产品进行销售或使用时，可以存储在一个计算机可读取存储介质中，基于这样的理解，本实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备（可以是个人计算机，服务器，或者网络设备等）或processor（处理器）执行本实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器（ROM， Read Only Memory）、随机存取存储器（RAM，Random Access Memory）、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional module and is not sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this embodiment is essentially or The part that contributes to the prior art or the whole or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium and includes several instructions for making a computer device (which can be It is a personal computer, a server, or a network device, etc.) or a processor (processor) that executes all or part of the steps of the method described in this embodiment. The aforementioned storage medium includes: U disk, removable hard disk, Read Only Memory (ROM, Read Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes.

因此，本实施例提供了一种计算机存储介质，所述计算机存储介质存储有分析GPU性能的程序，所述分析GPU性能的程序被至少一个处理器执行时实现上述技术方案中所述分析GPU性能的方法步骤。Therefore, this embodiment provides a computer storage medium, where the computer storage medium stores a program for analyzing GPU performance, and when the program for analyzing GPU performance is executed by at least one processor, the analysis of GPU performance in the foregoing technical solution is implemented. method steps.

可以理解地，上述分析GPU性能的装置50的示例性技术方案，与前述分析GPU性能的方法的技术方案属于同一构思，因此，上述对于分析GPU性能的装置50的技术方案未详细描述的细节内容，均可以参见前述分析GPU性能的方法的技术方案的描述。本发明实施例对此不做赘述。It can be understood that the above-mentioned exemplary technical solution of the device 50 for analyzing GPU performance belongs to the same concept as the technical solution of the aforementioned method for analyzing GPU performance. Therefore, the above-mentioned technical solution for analyzing the device 50 for analyzing GPU performance does not describe the details in detail. , can refer to the description of the technical solution of the method for analyzing GPU performance. This embodiment of the present invention will not describe this in detail.

需要说明的是：本发明实施例所记载的技术方案之间，在不冲突的情况下，可以任意组合。It should be noted that the technical solutions described in the embodiments of the present invention may be combined arbitrarily unless there is a conflict.

以上所述，仅为本发明的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应以所述权利要求的保护范围为准。The above are only specific embodiments of the present invention, but the protection scope of the present invention is not limited thereto. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed by the present invention. should be included within the protection scope of the present invention. Therefore, the protection scope of the present invention should be based on the protection scope of the claims.

Claims

1. A method for analyzing GPU performance, the method comprising:

acquiring an instruction list obtained by the operation of a target program under a set environment, the number of threads to be started and an execution result of each thread on each instruction in the instruction list; wherein the execution result comprises an instruction execution control value for each instruction;

starting a thread simulator in the GPU performance model to be analyzed according to the number of threads to be started through a simulation scheduler in the GPU performance model to be analyzed;

each thread simulator traverses each instruction in the instruction list, and executes the currently traversed instruction according to the instruction execution control value and the instruction type of each instruction in the traversing process so as to measure the time length for executing the currently traversed instruction;

when all the instructions in the instruction list are traversed, acquiring the total execution duration of all the instructions in the instruction list executed by all the thread simulators;

wherein each thread simulator traverses each instruction in the instruction list, and executes a currently traversed instruction according to an instruction execution control value and an instruction type of each instruction in the traversal process to measure the time length for executing the currently traversed instruction, and the method comprises the following steps:

for each thread simulator, judging whether an instruction execution control value in the execution result of the current traversed instruction represents the execution of the current traversed instruction;

corresponding to the instruction execution control value, the current traversed instruction is not executed, and the time length for executing the fixed NOP instruction is used as the time length for executing the current traversed instruction;

determining an instruction type of the currently traversed instruction corresponding to the instruction execution control value indicating execution of the currently traversed instruction;

the instruction type corresponding to the current traversed instruction is a memory access instruction, and the time length for executing the current traversed instruction is measured according to the mode of executing the memory access instruction;

and taking the fixed time length for executing the arithmetic logic instruction as the time length for executing the current traversed instruction, wherein the instruction type corresponding to the current traversed instruction is an arithmetic logic instruction.

2. The method of claim 1, wherein the obtaining of the instruction list obtained by the target program running under the set environment, the number of threads to be started, and the execution result of each thread on each instruction in the instruction list comprises:

running the target program through a real environment or a simulation environment, and acquiring an instruction list of the target program, the number of threads needing to be started and an execution result of each instruction executed by each thread in the running process; wherein the execution result includes an operand register value and an instruction execution control value.

3. The method of claim 1, wherein said metering a duration of execution of said currently traversed instruction by way of executing a memory access instruction comprises:

and according to the access address corresponding to the operand register value in the execution result of the current traversed instruction, measuring the time length for executing the current traversed instruction according to a set Cache access analysis model.

4. The method of claim 1, further comprising:

for each thread simulator, judging whether the current traversed instruction is an end instruction in the instruction list in the traversing process:

if not, judging whether an instruction execution control value in the execution result of the current traversed instruction represents the execution of the current traversed instruction or not;

and if so, determining that the thread simulator finishes traversing all the instructions in the instruction list.

5. The method of claim 1, wherein for each of the thread simulators, after metering a duration of execution of the currently traversed instruction, the method further comprises:

for each thread simulator, adding the time length for executing the currently traversed instruction into the total execution time length of the corresponding thread simulator; before traversing the first instruction in the instruction list, the starting value of the total execution time is 0.

6. An apparatus for analyzing GPU performance, the apparatus comprising: the device comprises an acquisition part, a simulation scheduler, a thread simulator and a statistic part; wherein,

the acquisition part is configured to acquire an instruction list obtained by running a target program under a set environment, the number of threads to be started and an execution result of each thread on each instruction in the instruction list; wherein the execution result comprises an instruction execution control value for each instruction;

the simulation scheduler is configured to start a thread simulator in a GPU performance model to be analyzed according to the number of threads to be started;

each thread simulator is configured to traverse each instruction in the instruction list, and execute a currently traversed instruction according to an instruction execution control value and an instruction type of each instruction in the traversing process so as to measure the time length for executing the currently traversed instruction;

the counting part is configured to acquire the total execution duration of all the instructions in the instruction list executed by all the thread simulators when all the instructions in the instruction list are traversed;

wherein each of the thread simulators is configured to:

judging whether an instruction execution control value in an execution result of the current traversed instruction represents the execution of the current traversed instruction or not;

and taking the fixed time length for executing the arithmetic logic instruction as the time length for executing the current traversed instruction, wherein the instruction type corresponding to the current traversed instruction is the arithmetic logic instruction.

7. The apparatus according to claim 6, wherein the acquiring section is configured to execute the target program through a real environment or a simulation environment, and acquire an instruction list of the target program, the number of threads to be started, and an execution result of each thread executing each instruction during execution; wherein the execution result includes an operand register value and an instruction execution control value.

8. A computer storage medium storing a program for analyzing GPU performance, the program for analyzing GPU performance implementing the method steps of any of claims 1 to 5 when executed by at least one processor.