CN105094949B

CN105094949B - A kind of analogy method and system based on instruction computation model and feedback compensation

Info

Publication number: CN105094949B
Application number: CN201510476754.9A
Authority: CN
Inventors: 张为华; 王浩骏; 王欣
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2015-08-06
Filing date: 2015-08-06
Publication date: 2018-04-17
Anticipated expiration: 2035-08-06
Also published as: CN105094949A

Abstract

The invention belongs to the technical field of processor software simulation, and specifically relates to a simulation method and system based on an instruction calculation model and feedback compensation. The simulation process of the present invention is carried out around the instruction sequence, and the simulation method of calculating the sequence information of the instructions one by one is adopted. The simulator obtains the launch cycle and the completion cycle of the instructions flowing through the functional components through the calculation method; the multi-threaded parallel acceleration calculation process is used, and the processor is for the shared Resource access is carried out in the way of speculative execution; in the process of speculative execution, each private resource module sends the access information to the shared resource module, and the timing correction algorithm performs timing calculation. If the speculative execution timing is inconsistent with the actual shared resource calculation timing, the cumulative error is calculated, and the cumulative error is fed back and compensated to the processor core that generated the error after the program simulation ends. The invention can be used for framework simulation of hardware production and evaluation of application programs and systems, can quickly and accurately obtain the simulation results of each application program under the target system structure, and is convenient for fast and accurate evaluation of applications and systems.

Description

A simulation method and system based on command calculation model and feedback compensation

技术领域technical field

本发明属于处理器软件模拟技术领域，具体涉及一种模拟处理器运行的方法与系统。The invention belongs to the technical field of processor software simulation, and in particular relates to a method and system for simulating the operation of a processor.

背景技术Background technique

人类已进入信息爆炸时代，信息与人类生活的关系变得越来越密切。随着信息技术的不断发展，新型电子产品的种类不断丰富，人均拥有的电子产品数也不断增加。人们日常使用的电子产品已由功能简单的手机和MP3过渡到功能强大的MP4、电子书、智能手机和智能可穿戴设备等。IDC最新数据显示，2014年中国的智能手机出货量将同比增加13%，达到4.2亿部。随着网络技术的飞速发展，各种新型应用和应用模式如雨后春笋般层出不穷，如云计算、智能计算、图计算、移动计算和可穿戴计算等。Human beings have entered the era of information explosion, and the relationship between information and human life has become increasingly close. With the continuous development of information technology, the types of new electronic products are constantly enriched, and the number of electronic products per capita is also increasing. The electronic products that people use daily have transitioned from simple mobile phones and MP3 to powerful MP4, e-books, smart phones and smart wearable devices. According to the latest data from IDC, China's smartphone shipments will increase by 13% year-on-year in 2014, reaching 420 million units. With the rapid development of network technology, various new applications and application modes have sprung up, such as cloud computing, intelligent computing, graph computing, mobile computing and wearable computing.

随着应用的极大丰富，对底层处理平台也提出了越来越高的要求，造成各种处理器产品的不断丰富，这也意味着处理器产品来自方方面面的竞争压力越来越大，留给处理器产品的软硬件开发周期也越来越短。由于竞争的不断加剧，硬件芯片产品的设计开发周期通常为8个月到1年。开发周期越短，产品越有竞争力，盈利空间也越大。然而，Intel SoC设计部门的统计数据显示，一般处理器产品从开始设计到完成生产的周期一般为9个月，而对应软件部分的开发周期也为9个月。在这种背景下，一方面要求硬件工程师尽快完成芯片的设计和评估，另一方面，也要求在芯片设计的早期（如3个月），可以尽早为软件设计和开发人员提供可用的模拟测试平台。With the great enrichment of applications, higher and higher requirements are put forward for the underlying processing platform, resulting in the continuous enrichment of various processor products, which also means that the competitive pressure of processor products from all aspects is increasing, leaving The hardware and software development cycle for processor products is getting shorter and shorter. Due to the increasing competition, the design and development cycle of hardware chip products is usually 8 months to 1 year. The shorter the development cycle, the more competitive the product and the greater the profit margin. However, statistics from the Intel SoC design department show that the cycle from the beginning of design to the completion of production of general processor products is generally 9 months, and the development cycle of the corresponding software part is also 9 months. In this context, on the one hand, hardware engineers are required to complete chip design and evaluation as soon as possible; on the other hand, it is also required to provide software designers and developers with available simulation tests as early as possible in the early stages of chip design (such as 3 months). platform.

为了提高体系结构设计开发的效率、节省开发成本，当前大多使用模拟器通过软件模拟整体体系结构的运行。模拟器作为体系结构研究与系统评估工具被广泛使用在工业生产与研究领域，能够提升硬件及体系结构的生产开发效率，缩短开发周期，同时能提升相关研究效率，缩短不必要的等待时间。In order to improve the efficiency of architecture design and development and save development costs, simulators are currently used to simulate the operation of the overall architecture through software. As an architecture research and system evaluation tool, simulators are widely used in industrial production and research fields, which can improve the production and development efficiency of hardware and architecture, shorten the development cycle, and at the same time improve the efficiency of related research and shorten unnecessary waiting time.

在模拟过程中为了保证模拟结果的精确性，现有模拟模型通常采用时钟驱动的方式模拟各个功能单元。虽然这种紧耦合方式保证了模拟结果的精确性，却导致模拟模型面临严重的性能挑战。为了提升模拟速度，已有很多模拟机加速的相关研究。按其加速方式不同，可以将模拟加速技术粗略分为基于FPGA的加速技术、采样技术和并行加速。In order to ensure the accuracy of the simulation results during the simulation process, the existing simulation models usually simulate each functional unit in a clock-driven manner. Although this tightly coupled approach ensures the accuracy of the simulation results, it causes the simulation model to face serious performance challenges. In order to increase the simulation speed, there have been many related researches on simulator acceleration. According to different acceleration methods, analog acceleration technology can be roughly divided into FPGA-based acceleration technology, sampling technology and parallel acceleration.

基于FPGA的加速技术：该技术主要是通过利用FPGA等硬件平台的高效性和并行性等特点提升模拟器的运行速度，其模拟速度一般可以达到10 MIPS的数量级别。由于比软件模拟器可以快10-100个数量级，这种技术已经被国外一些著名高校和研究机构广泛应用。但是基于FPGA 开发的模拟器系统参数调整复杂，每次参数调整都需要重新生成FPGA网表文件。而在网表文件生成和到FPGA 的映射过程中，系统的调试十分复杂，从而限制了这种方法的易用性。同时，虽然FPGA 是一种硬件模拟，但由于目前的FPGA系统的主频要比主流通用处理器的主频慢1个数量级以上，因此，FPGA 的低主频也造成了这种技术在模拟速度提升方面存在一定的限制。FPGA-based acceleration technology: This technology mainly uses the high efficiency and parallelism of FPGA and other hardware platforms to improve the running speed of the simulator, and its simulation speed can generally reach the order of magnitude of 10 MIPS. Because it can be 10-100 orders of magnitude faster than software simulators, this technology has been widely used by some famous foreign universities and research institutions. However, the parameter adjustment of the simulator system developed based on FPGA is complicated, and the FPGA netlist file needs to be regenerated every time the parameter is adjusted. However, in the process of netlist file generation and mapping to FPGA, the debugging of the system is very complicated, which limits the ease of use of this method. At the same time, although FPGA is a kind of hardware simulation, since the main frequency of the current FPGA system is more than an order of magnitude slower than that of mainstream general-purpose processors, the low main frequency of FPGA also causes the simulation speed of this technology to be limited. There are certain limitations in lifting.

采样加速技术：采样技术是另一种目前应用比较普遍的模拟器加速技术。采样技术的原理是通过获得被模拟测试程序指令子集的特征来推断整体的特征，此类技术能获得比较精确的模拟结果的关键是确保所选择的子集的特征能够充分代表测试程序整体的特征。虽然采样技术可以提高模拟速度，但是采样技术在提高性能的同时需要牺牲模拟的精确性。同时，由于并行程序的复杂性，采样技术对多核模拟器的性能提升也有限，目前典型的多核采样模拟器也只能达到几个MIPS的模拟速度。Sampling acceleration technology: Sampling technology is another commonly used simulator acceleration technology. The principle of sampling technology is to infer the characteristics of the whole by obtaining the characteristics of the subset of simulated test program instructions. The key to obtaining more accurate simulation results for this type of technology is to ensure that the characteristics of the selected subset can fully represent the overall characteristics of the test program. feature. Although sampling techniques can increase simulation speed, sampling techniques need to sacrifice simulation accuracy while improving performance. At the same time, due to the complexity of parallel programs, sampling technology can only improve the performance of multi-core simulators. Currently, typical multi-core sampling simulators can only achieve a simulation speed of several MIPS.

并行模拟技术：随着多核硬件平台的普及，底层硬件环境提供了更多的计算资源，如何尽可能多的利用底层硬件的计算资源来加速模拟器的执行速度也逐渐引起人们的关注。在现有针对模拟器并行性的研究中，一种方法是对已有多核模拟器进行手工并行化。这种方法虽然可以获得相对较好的加速效果，却十分繁琐。另一种方法是设计一种新的针对模拟器的编程模型，然后基于该模型编写模拟器程序，并在此基础上对模拟器进行自动并行划分。这种方法虽然提供了自动并行的手段，但是由于基于特定的编程模型，无法应用到目前还在广泛使用的各种基于串行模型开发的模拟器上。虽然并行模拟可以提升模拟性能，但是目前的并行模拟也面临较大的挑战。在多核处理器中通常存在核间共享硬件资源，如共享Cache或者片上网络等。为了保证模拟精确性，并行后的模拟线程需要频繁同步以保证模拟结果的精确性。然而这种方式导致较差的并行可扩展性。有些并行方法采用降低同步的方式提升并行性能，但却导致模拟精确性的损失。Parallel simulation technology: With the popularity of multi-core hardware platforms, the underlying hardware environment provides more computing resources. How to use the computing resources of the underlying hardware as much as possible to accelerate the execution speed of the simulator has gradually attracted people's attention. In existing research on simulator parallelism, one approach is to manually parallelize existing multi-core simulators. Although this method can obtain a relatively good acceleration effect, it is very cumbersome. Another method is to design a new programming model for the simulator, and then write the simulator program based on the model, and then automatically divide the simulator in parallel on this basis. Although this method provides an automatic parallel method, it cannot be applied to various simulators developed based on the serial model that are still widely used at present because of the specific programming model. Although parallel simulation can improve simulation performance, the current parallel simulation is also facing great challenges. In a multi-core processor, there are usually shared hardware resources between cores, such as a shared Cache or an on-chip network. In order to ensure the accuracy of the simulation, the parallel simulation threads need frequent synchronization to ensure the accuracy of the simulation results. However, this approach leads to poor parallel scalability. Some parallel methods improve parallel performance by reducing synchronization, but lead to a loss of simulation accuracy.

鉴于这些技术方案不能在保证精确性的同时进行快速模拟，本发明提出一种基于指令计算模拟模型与投机执行的技术构建模拟系统平台的方法，提高了处理器模拟系统的性能与可用性。同时，本发明可以在现有相应系统的任意计算机上运行，能够精确并快速地模拟可以支持的不同硬件结构、系统和应用。本系统的应用，有利于提高处理器设计和对应软件的开发效率，降低开发成本；本系统使得软件开发大为提前，有效缩短了计算机软硬件产品的开发周期，对计算机软硬件开发行业的发展有着重要意义，同时本系统能广泛应用到研究领域中，提升处理器设计与相关软硬件研究效率。In view of the fact that these technical solutions cannot perform fast simulation while ensuring accuracy, the present invention proposes a method for constructing a simulation system platform based on instruction calculation simulation model and speculative execution technology, which improves the performance and usability of the processor simulation system. At the same time, the present invention can run on any computer of the existing corresponding system, and can accurately and quickly simulate different hardware structures, systems and applications that can be supported. The application of this system is conducive to improving the efficiency of processor design and corresponding software development, and reducing development costs; this system greatly advances software development, effectively shortens the development cycle of computer software and hardware products, and is beneficial to the development of computer software and hardware development industry. It is of great significance. At the same time, this system can be widely used in the research field to improve the efficiency of processor design and related software and hardware research.

发明内容Contents of the invention

本发明的目的在于提供一种能够提高处理器及其对应软件开发效率，降低开发成本，缩短开发周期的处理器软件模拟方法及系统。The purpose of the present invention is to provide a processor software simulation method and system that can improve the development efficiency of the processor and its corresponding software, reduce the development cost, and shorten the development cycle.

本发明提供的处理器软件模拟方法，是基于指令计算模型的，包括：The processor software simulation method provided by the present invention is based on an instruction calculation model, including:

首先，其模拟过程围绕指令序列进行，以逐条计算指令时序信息的模拟方法方式代替传统模拟器基于时钟周期更新功能模块状态的模拟方式，模拟器通过计算的方式获得流经功能部件的指令的发射周期和完成周期，而非通过模拟处理器部件的在每个周期的运行过程获取。First of all, the simulation process is carried out around the instruction sequence, and the simulation method of calculating the timing information of the instructions one by one replaces the simulation method of the traditional simulator updating the state of the functional module based on the clock cycle. The simulator obtains the emission of the instructions flowing through the functional components through calculation. cycles and completed cycles rather than by simulating the execution of processor components on a per-cycle basis.

其次，使用多线程并行加速计算过程，处理器对于共享资源的访问以投机执行的方式进行，即私有资源维护一份共享资源的备份，当系统需要对共享资源进行访问时，先基于自己维护的共享资源信息推测实际的访问周期以用于自己的后继计算，从而减少同步操作以提升性能。在投机执行的过程中，为保证访问共享资源时序信息的准确性，各个私有资源模块会将访问信息发送给共享资源模块，由一个全局时序矫正算法进行正确的执行时序计算。如果投机执行时序与实际共享资源计算时序不一致，时序矫正算法计算这个累计误差，并在程序模拟结束后将累积误差反馈补偿给产生该误差的处理器核，这种反馈机制可以消除模拟过程中由于私有核交互访问共享资源而产生的误差。Secondly, using multi-threaded parallelism to accelerate the computing process, the processor accesses shared resources in a speculative manner, that is, private resources maintain a backup of shared resources. When the system needs to access shared resources, it first bases Shared resource information speculates the actual access cycle for its own subsequent calculations, thereby reducing synchronization operations and improving performance. In the process of speculative execution, in order to ensure the accuracy of the timing information of accessing shared resources, each private resource module will send the access information to the shared resource module, and a global timing correction algorithm will perform correct execution timing calculations. If the speculative execution timing is inconsistent with the actual shared resource calculation timing, the timing correction algorithm calculates the accumulated error, and feeds back the accumulated error to the processor core that generated the error after the program simulation ends. This feedback mechanism can eliminate the error caused by Errors caused by private cores interactively accessing shared resources.

本发明方法具体流程如下：系统通过二进制翻译的方式模拟运行目标系统镜像和应用程序，提取指令流及内存访问信息；根据提取的指令流及内存访问信息，计算指令在执行过程中的时序信息，涉及到共享资源的访问，通过虚拟共享资源计算模拟的投机执行方式减少同步操作，收集共享资源访问信息，统一进行正确的共享资源时序计算并累计矫正结果，于系统执行结束后进行反馈补偿，系统执行结束时最终返回时序结果和微体系结构信息。The specific process of the method of the present invention is as follows: the system simulates running the target system image and application program through binary translation, extracts the instruction flow and memory access information; according to the extracted instruction flow and memory access information, calculates the timing information of the instruction during execution, When it comes to access to shared resources, use the speculative execution method of virtual shared resource calculation simulation to reduce synchronous operations, collect shared resource access information, uniformly perform correct shared resource timing calculations and accumulate correction results, and perform feedback compensation after system execution. Timing results and microarchitecture information are eventually returned at the end of execution.

对于传统模拟器而言，模拟过程主要基于每个时钟周期处理器部件的执行过程，对于目标体系结构的每个处理器核而言，需要维护每个核的处理器部件状态信息以及对应的输入输出信息。在模拟过程中的每个时钟周期，每个核上的所有部件需要检查输入信息存储块来确定该部件将执行的指令，之后对这条指令在该部件的行为进行模拟，更新部件状态后再将其存储到输出信息存储块。这种方式在模拟过程中存在一些缺陷。首先，每个核需要维护大量的信息存储块，而每个部件在每个时钟周期都需要访问相应的信息存储块来决定其输入指令、状态更新以及输出指令执行状态，将会产生许多内存访问操作，导致内存信息连续重用减少，提高访问需要的时钟周期从而影响性能；其次，由于每个部件将根据输入指令的信息来决定其模拟中进行的操作，由于这些信息检查操作是输入相关的，因此会带来许多预测失效的跳转操作，从而影响模拟效率；此外，由于处理器上不同部件之间每个时钟周期都有信息转移的过程，这些部件之间存在紧耦合联系，难以利用多核平台进行模拟加速。由于上述缺陷，传统模拟方式使得模拟器的性能存在较大局限性，无法有效提升执行速度。For traditional simulators, the simulation process is mainly based on the execution process of processor components in each clock cycle. For each processor core of the target architecture, it is necessary to maintain the state information of each core processor component and the corresponding input Output information. At each clock cycle during the simulation, all components on each core need to check the input information storage block to determine the instruction that the component will execute, then simulate the behavior of this instruction in the component, update the component status and then Store it to the output information storage block. This approach has some drawbacks in the simulation process. First of all, each core needs to maintain a large number of information storage blocks, and each component needs to access the corresponding information storage blocks in each clock cycle to determine its input instructions, status updates, and output instruction execution status, which will generate many memory accesses operation, resulting in the reduction of continuous reuse of memory information, increasing the clock cycle required for access and thus affecting performance; secondly, since each component will determine the operation performed in its simulation according to the information of the input instruction, since these information check operations are input-related, Therefore, many jump operations that fail to predict will be brought, which will affect the simulation efficiency; in addition, because there is a process of information transfer between different components on the processor every clock cycle, there is a tight coupling between these components, and it is difficult to take advantage of multi-core The platform performs simulation acceleration. Due to the above defects, the traditional simulation method has great limitations in the performance of the simulator, and cannot effectively improve the execution speed.

本方法的主要优势在于使用计算指令时钟周期的方式获取模拟指令流的时序信息，代替了原有模拟器针对每一个时钟周期更新处理器部件状态，从而最终得到指令流时序信息的模拟方式，大幅度提升了模拟效率。The main advantage of this method is that the timing information of the simulated instruction stream is obtained by calculating the instruction clock cycle, which replaces the simulation method in which the original simulator updates the state of the processor components for each clock cycle, and finally obtains the timing information of the instruction stream. Significantly improves the simulation efficiency.

对于一条指令而言，对其完成解码工作后，执行过程与涉及部件已经明确，可以通过计算的方式确定该指令执行各个阶段所需要的时序数量。理想的处理器只需要考虑指令执行所需的部件资源，对于真实执行过程而言，需要考虑多种因素。首先是数据依赖，模拟过程中可能存在寄存器依赖、内存访问队列等情况；除此之外需要考虑到资源竞争，包括功能单元、发射宽度、指令窗口、重排序队列等资源；跳转预测结果也会对指令时序产生影响。针对每条指令，将上述条件考虑进计算过程，就能进行时序计算，将整体执行过程中的时序改变通过计算的方式准确模拟出来，从而大幅度提升模拟效率。For an instruction, after the decoding work is completed, the execution process and the involved components have been clarified, and the number of timings required for each stage of the instruction execution can be determined by calculation. An ideal processor only needs to consider the component resources required for instruction execution. For the actual execution process, various factors need to be considered. The first is data dependence. During the simulation process, there may be register dependence and memory access queues. In addition, resource competition needs to be considered, including resources such as functional units, launch widths, instruction windows, and reordering queues; jump prediction results are also will have an impact on instruction timing. For each instruction, taking the above conditions into the calculation process, the timing calculation can be performed, and the timing changes in the overall execution process can be accurately simulated by calculation, thereby greatly improving the simulation efficiency.

对于跳转指令而言，使用计算的方式可以忽略错误路径的计算，减少不必要的计算开销。由于基于指令流的计算在乱序执行中可能存在指令序列与实际执行顺序不一致，方法通过时序指令窗口记录最近计算的指令流，若发现指令间功能单元等资源占用时序违背指令顺序且有重合时，从冲突指令开始重新计算。缓存替换同样有类似情况出现，方法通过记录缓存行访问时钟周期，替换时优先替换时钟周期较晚的缓存行。For the jump instruction, the calculation method can ignore the calculation of the wrong path and reduce unnecessary calculation overhead. Due to the fact that the instruction sequence may be inconsistent with the actual execution sequence in the out-of-order execution of the calculation based on the instruction flow, the method records the latest calculated instruction flow through the sequence instruction window. , recalculate from the conflicting instruction. Cache replacement also has a similar situation. The method records the access clock cycle of the cache line, and replaces the cache line with the later clock cycle first when replacing.

由于传统模拟器共享资源与私有资源以紧耦合的方式联系，当传统模拟器应用多核平台进行加速时，需要大量同步操作保证时序正确，很大程度限制了多核平台的加速效果。本方法使用多核平台进行加速，在涉及到共享资源的访问模拟时解除了私有资源与共享资源的紧密耦合性，在私有资源部分每个核虚拟维护了对共享资源的访问，使私有资源访问可以进行合理的投机执行。当私有资源模块判断需要访问共享资源时，首先根据自己维护的信息推测共享资源需要花费的时钟周期数，根据结果继续自己的后继计算。之后将相关信息传递给共享资源模块进行统一计算，共享模块接收所有模拟核对共享资源的访问并计算正确的共享资源访问时序。当共享资源计算的结果与私有资源的猜测结果不一致时，说明私有资源猜测错误。为了避免打断各个处理器核的模拟进程，共享资源模块只是累计每个核需补偿的时钟周期数，在系统执行结束时将累计结果反馈补偿。这种方式在提升模拟性能的同时提升了整体结构的模块化，更易于优化。Because traditional simulators share resources and private resources in a tightly coupled manner, when traditional simulators are accelerated by multi-core platforms, a large number of synchronous operations are required to ensure correct timing, which greatly limits the acceleration effect of multi-core platforms. This method uses a multi-core platform for acceleration. When it comes to the access simulation of shared resources, the tight coupling between private resources and shared resources is released. In the private resource part, each core virtual maintains the access to shared resources, so that private resource access can be achieved. Conduct reasonable speculative execution. When the private resource module judges that it needs to access the shared resource, it first guesses the number of clock cycles that the shared resource needs to spend based on the information it maintains, and continues its subsequent calculations based on the result. Afterwards, the relevant information is passed to the shared resource module for unified calculation, and the shared module receives all simulated checks of shared resource access and calculates the correct shared resource access timing. When the calculated result of the shared resource is inconsistent with the guessed result of the private resource, it means that the guess of the private resource is wrong. In order to avoid interrupting the simulation process of each processor core, the shared resource module only accumulates the number of clock cycles that each core needs to compensate, and feeds back the accumulated results for compensation when the system execution ends. This method improves the modularity of the overall structure while improving the simulation performance, making it easier to optimize.

基于上述模拟方法的模拟系统，分为三部分，功能模拟子系统、时序指令计算模拟子系统、时序矫正子系统。其中，功能模拟子系统对应于模拟过程中指令序列的生成，通过二进制翻译的方式模拟运行目标系统镜像和应用程序，提取指令流及内存访问信息；时序指令计算模拟子系统对应于模拟过程中指令流时序信息的计算；时序矫正子系统对应于时序矫正计算，负责在模拟过程中对由于投机执行造成的误差进行补偿和矫正。基于功能模拟子系统提取的指令流及内存访问信息，时序指令计算模拟子系统计算每条指令流经各种功能部件的时钟周期和完成时的时序信息，涉及到共享资源的计算则通过投机执行的方式推测共享资源访问所需时序；时序指令计算模拟子系统将共享资源访问信息传输给时序矫正子系统，时序矫正子系统统一进行正确的共享资源时序计算，比较投机执行时序与正确时序并累计误差时序，于系统执行结束后进行反馈补偿，系统执行结束时最终返回时序结果和微体系结构信息。The simulation system based on the above simulation method is divided into three parts, the function simulation subsystem, the timing instruction calculation simulation subsystem, and the timing correction subsystem. Among them, the functional simulation subsystem corresponds to the generation of instruction sequences in the simulation process, simulates the operation of the target system image and application program through binary translation, and extracts the instruction flow and memory access information; the timing instruction calculation simulation subsystem corresponds to the instruction sequence in the simulation process Calculation of stream timing information; the timing correction subsystem corresponds to the timing correction calculation, and is responsible for compensating and correcting errors caused by speculative execution during the simulation process. Based on the instruction flow and memory access information extracted by the functional simulation subsystem, the timing instruction calculation simulation subsystem calculates the clock cycle of each instruction flowing through various functional components and the timing information when it is completed, and the calculation involving shared resources is executed through speculation The timing required for accessing shared resources is speculated in the same way; the timing instruction calculation simulation subsystem transmits the shared resource access information to the timing correction subsystem, and the timing correction subsystem uniformly performs the correct shared resource timing calculation, compares the speculative execution timing with the correct timing and accumulates Error timing, feedback compensation is performed after the system execution ends, and the timing results and microarchitecture information are finally returned when the system execution ends.

基于本方法构建的模拟系统可以安装在现有相应系统的任意计算机上。The simulation system constructed based on this method can be installed on any computer of the existing corresponding system.

一种基于指令中心的模拟系统的触发方法，具体包括以下：A method for triggering an analog system based on a command center, specifically comprising the following:

（1）当需要测试某系统架构或某应用程序时，用户指定某个操作系统镜像和应用程序；用户使用相应目标操作系统镜像启动系统；(1) When it is necessary to test a certain system architecture or an application program, the user specifies an operating system image and application program; the user uses the corresponding target operating system image to start the system;

（2）选择需要执行的应用程序，功能模拟子系统通过二进制翻译模拟执行目标系统下的目标程序，并收集指令流信息和内存访问信息；(2) Select the application program to be executed, and the function simulation subsystem simulates the execution of the target program under the target system through binary translation, and collects instruction flow information and memory access information;

（3）时序模拟子系统获取指令流，根据功能部件状态与时序计算当前指令流经功能部件的周期和完成周期，并更新功能部件的状态与时序保证后续指令流的正确计算；(3) The timing simulation subsystem obtains the instruction flow, calculates the cycle and completion cycle of the current instruction flowing through the functional component according to the status and timing of the functional component, and updates the status and timing of the functional component to ensure the correct calculation of the subsequent instruction flow;

（4）涉及到共享资源的模拟解除了私有资源与共享资源的紧密耦合性，在私有资源部分每个核虚拟维护了对共享资源的访问，使私有资源访问可以进行合理的投机执行，并将访问指令与投机执行结果传递给时序矫正子系统；(4) The simulation involving shared resources removes the tight coupling between private resources and shared resources. In the private resource part, each core virtual maintains access to shared resources, so that private resource access can be reasonably speculatively executed, and the Access instructions and speculative execution results are passed to the timing correction subsystem;

（5）共享模块接收所有模拟核对共享资源的访问并计算正确的共享资源访问时序。当共享资源计算的结果与私有资源的猜测结果不一致时，说明私有资源猜测错误。为了避免打断各个处理器核的模拟进程，共享资源模块只是累计每个核需补偿的时钟周期数，在系统执行结束时将累计结果反馈补偿；(5) The shared module receives all simulated accesses to the shared resources and calculates the correct access timing of the shared resources. When the calculated result of the shared resource is inconsistent with the guessed result of the private resource, it means that the guess of the private resource is wrong. In order to avoid interrupting the simulation process of each processor core, the shared resource module only accumulates the number of clock cycles that need to be compensated for each core, and feeds back the accumulated results for compensation at the end of the system execution;

（6）系统执行完毕时返回时序计算结果及微体系结构信息。(6) When the system finishes executing, return timing calculation results and microarchitecture information.

优选的，使用全系统多核并行的方式进行功能模拟以及时序计算结果，提供系统对应用系统执行带来的影响。Preferably, the system-wide multi-core parallel method is used to perform function simulation and timing calculation results to provide the impact of the system on the execution of the application system.

优选的，使用状态压缩减少子系统之间的信息传输量，提升整体执行效率。Preferably, state compression is used to reduce the amount of information transmission between subsystems and improve overall execution efficiency.

本方法的收益效果是：可构建一种基于指令中心的模拟系统，加速体系结构的模拟和评估，提升体系结构软硬件测试和评估效率，缩短开发周期，满足日益快速地软硬件开发需求。The benefits of this method are: a simulation system based on the command center can be constructed, the simulation and evaluation of the architecture can be accelerated, the efficiency of the software and hardware testing and evaluation of the architecture can be improved, the development cycle can be shortened, and the demand for increasingly rapid software and hardware development can be met.

系统可用性强，易于使用和推广。主要表现在：（1）模拟系统能在任意相应系统的计算机上运行；（2）模拟方法逻辑清晰，易于理解，能够快速上手进行体系结构开发；（3）模拟系统耦合性小，便于利用各种方式提升性能，进一步缩短开发周期；（4）应用程序部署方便，能够快速执行新的应用程序并且给出准确评估。（5）能够缩短体系结构及相应软硬件的开发周期，进一步提高产品竞争力。The system has strong usability and is easy to use and popularize. The main performances are: (1) The simulation system can run on any corresponding system computer; (2) The logic of the simulation method is clear, easy to understand, and can be used quickly for architecture development; (3) The coupling of the simulation system is small, and it is easy to use various This method improves performance and further shortens the development cycle; (4) The application program is easy to deploy, and can quickly execute new applications and give accurate evaluation. (5) It can shorten the development cycle of the system structure and corresponding software and hardware, and further improve product competitiveness.

附图说明Description of drawings

图1是本发明与传统模拟方式的本质区别。Fig. 1 is the essential difference between the present invention and the traditional simulation method.

图2是投机执行模型结构示意图。Figure 2 is a schematic diagram of the structure of the speculative execution model.

图3是本发明的结构示意图。Fig. 3 is a structural schematic diagram of the present invention.

具体实施方式Detailed ways

下面参照附图和实施例进一步详细说明本发明。在此之前需要说明的是，本说明书及权利要求书中所使用的术语或词语不能限定解释为通常的含义或辞典中的含义，而应当立足于为了以最佳方式说明其发明人可以对术语的概念进行适当定义的原则解释为符合本发明技术思想的含义和概念。随之，本说明书所记载的实施例和附图中表示的结构只是本发明最佳实施例之一，并不能完全代表本发明的技术思想，因此应该理解到对于本发明而言可能会存在能够进行替换的各种等同物和变形例。The present invention will be described in further detail below with reference to the accompanying drawings and examples. What needs to be explained before this is that the terms or words used in this specification and claims should not be limitedly interpreted as the usual meaning or the meaning in the dictionary, but should be based on the best way for the inventor to explain the term. The principle of properly defining the concepts is interpreted as meanings and concepts consistent with the technical idea of the present invention. Subsequently, the embodiment described in this description and the structure shown in the drawings are only one of the best embodiments of the present invention, and cannot fully represent the technical ideas of the present invention, so it should be understood that there may be possible Various equivalents and modifications are substituted.

具体实施方式之一，指令计算模拟的多核全系统模拟器。One of the specific implementation manners is a multi-core full-system simulator for instruction calculation simulation.

在这个实施案例中，模拟系统部署于本地多核开发机上，全系统功能模拟子系统可采用单线程或多线程配置，进行功能模拟并采集指令流及内存访问信息；时序模拟子系统采用线程数与功能模拟器对应；时序矫正子系统采用单独线程模拟共享资源的访问执行并进行时序矫正。系统的主要工作流程如下：In this implementation case, the simulation system is deployed on a local multi-core development machine, and the system-wide functional simulation subsystem can adopt single-thread or multi-thread configuration to perform functional simulation and collect instruction flow and memory access information; the timing simulation subsystem adopts the number of threads and Corresponding to the functional simulator; the timing correction subsystem uses a separate thread to simulate the access execution of shared resources and perform timing correction. The main workflow of the system is as follows:

（1）用户使用相应目标操作系统镜像启动系统；(1) The user uses the corresponding target operating system image to start the system;

（6）系统执行结束后返回时序计算结果信息和微体系结构信息，其中包含系统对应用程序产生的影响，从而指导进一步开发。(6) After the execution of the system, the timing calculation result information and micro-architecture information are returned, including the impact of the system on the application program, so as to guide further development.

上述（2）、（3）步骤可根据开发计算机的物理核数进行资源分配，可以有效利用物理资源，进一步提升模拟效率。The above steps (2) and (3) can allocate resources according to the number of physical cores of the development computer, which can effectively use physical resources and further improve simulation efficiency.

具体实施方式之二，指令计算模拟的用户态模拟器。The second specific embodiment is a user state simulator for instruction calculation simulation.

在这个实施案例中，模拟系统仅对用户态应用程序进行模拟，开发人员可指定操作系统镜像或将应用程序可执行文件放入模拟系统提供的操作系统镜像中。系统的主要工作流程如下：In this implementation case, the simulation system only simulates the user mode application, and the developer can specify the operating system image or put the application executable file into the operating system image provided by the simulation system. The main workflow of the system is as follows:

（6）系统执行结束后返回时序计算结果信息和微体系结构信息，开发人员根据信息用户态应用执行情况，从而指导进一步开发。(6) After the execution of the system, the timing calculation result information and micro-architecture information are returned, and the developers can guide further development according to the execution status of the user-mode application based on the information.

需要注意的是，尽管本发明已参照具体实施方式进行描述和举例说明，但是并不意味着本发明限于这些描述的实施方式，本领域技术人员可以从中衍生出许多不同的变体，例如将系统部署云端等，它们都将覆盖于本发明权利要求的真实精神和范围中。It should be noted that although the present invention has been described and illustrated with reference to specific embodiments, it does not mean that the present invention is limited to these described embodiments, and those skilled in the art can derive many different variants therefrom, for example, the system Deploying the cloud, etc., they will all be covered in the true spirit and scope of the claims of the present invention.

Claims

1. A simulation method based on command calculation model and feedback compensation, characterized in that:

First of all, the simulation process is carried out around the instruction sequence, and the simulation method of calculating the timing information of the instructions one by one replaces the simulation method of the traditional simulator updating the state of the functional module based on the clock cycle. The simulator obtains the emission of the instructions flowing through the functional components through calculation. cycles and completed cycles, rather than by simulating the execution of processor components at each cycle;

Secondly, using multi-threaded parallelism to accelerate the computing process, the processor accesses shared resources in a speculative manner, that is, private resources maintain a backup of shared resources. When the system needs to access shared resources, it first bases Shared resource information guesses the actual access cycle for its own subsequent calculations, thereby reducing synchronous operations to improve performance; in the process of speculative execution, in order to ensure the accuracy of access to shared resource timing information, each private resource module sends access information to For the shared resource module, a global timing correction algorithm is used to calculate the correct execution timing; if the speculative execution timing is inconsistent with the actual shared resource calculation timing, the global timing correction algorithm calculates the accumulated error and feeds back the accumulated error to compensate after the program simulation ends to the processor core that generated the error;

The specific process is: the system simulates running the target system image and application program through binary translation, and extracts the instruction flow and memory access information; according to the extracted instruction flow and memory access information, calculates the timing information of the instruction during execution; The access to shared resources reduces synchronous operations through the speculative execution method of virtual shared resource calculation simulation, collects shared resource access information, uniformly performs correct shared resource timing calculations and accumulates correction results, and performs feedback compensation after system execution ends, and system execution ends eventually returns timing results and microarchitecture information.

2. A simulation system based on the simulation method of claim 1, characterized in that it comprises: a functional simulation subsystem, a timing instruction calculation simulation subsystem, and a timing correction subsystem; wherein the timing correction subsystem corresponds to timing correction calculation, Responsible for compensating and correcting errors caused by speculative execution during the simulation process; the functional simulation subsystem corresponds to the generation of instruction sequences during the simulation process, simulates and runs the target system image and application program through binary translation, and extracts the instruction stream and memory Access information; timing instruction calculation simulation subsystem corresponds to the calculation of instruction flow timing information in the simulation process; based on the instruction flow and memory access information extracted by the functional simulation subsystem, calculate the clock cycle and completion of each instruction flowing through various functional components The timing information of the time sequence information, and the calculation involving shared resources speculatively executes the timing required for shared resource access; the timing instruction calculation simulation subsystem transmits the shared resource access information to the timing correction subsystem, and the timing correction subsystem uniformly performs correct Timing calculation of shared resources, compare speculative execution timing with correct timing and accumulate error timing, perform feedback compensation after system execution ends, and finally return timing results and microarchitecture information at the end of system execution.

3. A trigger method based on the analog system described in claim 2, characterized in that the specific steps are as follows:

(1) When it is necessary to test a system architecture or an application, the user specifies an operating system image and application;

(2) Select the application program to be executed, and the function simulation subsystem simulates the execution of the target program under the target system through binary translation, and collects instruction flow information and memory access information;

(3) The timing simulation subsystem obtains the instruction flow, calculates the cycle and completion cycle of the current instruction flowing through the functional component according to the status and timing of the functional component, and updates the status and timing of the functional component to ensure the correct calculation of the subsequent instruction flow;

(4) The simulation involving shared resources removes the tight coupling between private resources and shared resources. In the private resource part, each core virtual maintains access to shared resources, so that private resource access can be reasonably speculatively executed, and the Access instructions and speculative execution results are passed to the timing correction subsystem;

(5) The shared module receives all simulated and checked shared resource accesses and calculates the correct shared resource access timing; when the calculated results of the shared resources are inconsistent with the guessed results of the private resources, it means that the private resources are wrongly guessed; in order to avoid interrupting each processor In the core simulation process, the shared resource module only accumulates the number of clock cycles that need to be compensated for each core, and feeds back the accumulated results for compensation at the end of the system execution;

(6) When the system finishes executing, return timing calculation results and microarchitecture information.