CN114968576A

CN114968576A - a processor

Info

Publication number: CN114968576A
Application number: CN202210578451.8A
Authority: CN
Inventors: 王剑
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2022-05-25
Filing date: 2022-05-25
Publication date: 2022-08-30

Abstract

The present invention provides a processor, which can run in a host machine state or a virtual guest state according to application requirements, the host machine state and the virtual guest state respectively correspond to different instruction systems, and the processor includes: a host machine The first-level instruction cache is used to cache the instructions of the host state corresponding to the instruction system that can be executed by the processor running in the host state; the client level one instruction cache is used to cache the virtual guest state corresponding to the instruction system. instructions to be executed by a processor operating in a virtual guest state; and a selection circuit for selecting the host L1 instruction cache or the guest L1 instruction cache with the processing according to the processor operating state The corresponding components of the processor are in communication such that instructions are fetched from the host L1 instruction cache when the processor is operating in the host state, and instructions are fetched from the guest L1 instruction cache when the processor is operating in the virtual guest state.

Description

a processor

技术领域technical field

本发明涉及处理器设计领域，具体来说，涉及处理器指令系统相关领域，更具体地说，涉及一种支持跨指令系统的处理器，即一种应用于双指令高速缓存的处理器。The present invention relates to the field of processor design, in particular, to the related field of processor instruction systems, and more particularly, to a processor that supports cross-instruction systems, that is, a processor applied to dual instruction caches.

背景技术Background technique

处理器作为计算机系统的运算和控制核心，是信息处理、程序运行的最终执行单元。如图1所示，现有的处理器都只有一个一级指令高速缓存(L1 IC)和一个一级数据高速缓存(L1 DC)，处理器的取指部件从指令高速缓存读取指令交由执行部件执行，执行部件在执行指令时根据需要读/写数据高速缓存。处理器的设计都是为了高效地实现某个特定的指令系统，如X86指令系统、ARM指令系统、MIPS指令系统、RISCV指令系统等等。现有的处理器设计方法可以高效地执行设计指定的特定指令系统的代码，但如果还需要同时执行其他指令系统的代码，则只能采用软件的二进制翻译技术。单纯的从一个指令系统代码到另一个指令系统代码的翻译本身开销并不太大，相当于把硬件的译码换成软件的译码，如Java虚拟机就是采用二进制翻译技术，性能影响并不大。但是，不同指令系统间的代码二进制翻译与Java虚拟机相比，有一系列软件难以解决的额外开销，主要是由于软件翻译中，翻译模块和执行模块共享同一个地址空间和运行环境会引起间接转移目标地址动态重定位、进出翻译模块的上下文环境切换保存、翻译后的基本块代码踪迹衔接、翻译模块和翻译后代码的地址空间重叠等等，使得不同指令系统间的代码二进制翻译性能比较差。As the computing and control core of the computer system, the processor is the final execution unit for information processing and program operation. As shown in Figure 1, all existing processors have only one level one instruction cache (L1 IC) and one level one data cache (L1 DC). The execution unit executes, and the execution unit reads/writes the data cache as needed while executing instructions. Processors are designed to efficiently implement a specific instruction system, such as X86 instruction system, ARM instruction system, MIPS instruction system, RISCV instruction system, and so on. The existing processor design method can efficiently execute the code of the specific instruction system specified by the design, but if the code of other instruction systems needs to be executed at the same time, only the software binary translation technology can be used. The simple translation from one instruction system code to another instruction system code itself is not too expensive, which is equivalent to replacing hardware decoding with software decoding. For example, the Java virtual machine uses binary translation technology, and the performance impact is not significant. big. However, compared with the Java virtual machine, the binary translation of code between different instruction systems has a series of extra costs that are difficult for software to solve. The main reason is that in software translation, the translation module and the execution module share the same address space and operating environment, which will cause indirect transfer. The dynamic relocation of target addresses, the context switching saving in and out of the translation module, the connection of the translated basic block code traces, the overlapping of the address space of the translation module and the translated code, etc., make the code binary translation performance between different instruction systems relatively poor.

发明内容SUMMARY OF THE INVENTION

因此，本发明的目的在于克服上述现有技术的缺陷，提供一种能够支持软件翻译并避免间接转移目标地址动态重定位、进出翻译模块的上下文环境切换保存、翻译后的基本块代码踪迹衔接、翻译模块和翻译后代码的地址空间重叠等问题的处理器。Therefore, the object of the present invention is to overcome the defects of the above-mentioned prior art, provide a kind of can support software translation and avoid indirect transfer target address dynamic relocation, the context environment switching preservation of entering and leaving translation module, the basic block code trace connection after translation, A handler for issues such as address space overlap between translation modules and translated code.

根据本发明的第一方面，提供一种处理器，其根据应用需求可以运行在宿主机状态或虚拟客户机状态，所述宿主机状态和虚拟客户机状态分别对应不同的指令系统，所述处理器包括：宿主机一级指令高速缓存，用于缓存宿主机状态对应指令系统的可被运行在宿主机状态下的处理器执行的指令；客户机一级指令高速缓存，用于缓存虚拟客户机状态对应指令系统的可被运行在虚拟客户机状态下的处理器执行的指令；以及，选择电路，用于根据处理器运行状态选择将所述宿主机一级指令高速缓存或客户机一级指令高速缓存与所述处理器的相应部件连通，以使得，在处理器运行在宿主机状态时从宿主机一级指令高速缓存取指，在处理器运行在虚拟客户机状态时从客户机一级指令高速缓存取指。According to a first aspect of the present invention, a processor is provided, which can run in a host state or a virtual guest state according to application requirements, wherein the host state and the virtual guest state respectively correspond to different instruction systems, and the processing The device includes: a host-level instruction cache, which is used to cache instructions of the host state corresponding to the instruction system that can be executed by a processor running in the host state; a client-level instruction cache, which is used to cache the virtual guest The state corresponds to the instruction of the instruction system that can be executed by the processor running in the virtual guest state; and, the selection circuit is used for selecting the host machine level instruction cache or the client level instruction according to the processor operating state. The cache is in communication with corresponding components of the processor such that instructions are fetched from the host level one instruction cache when the processor is operating in the host state and from the guest level when the processor is operating in the virtual guest state Instruction cache fetches.

优选的，在所述宿主机状态中运行有：二进制翻译程序，所述二进制翻译程序可被运行在宿主机状态下的处理器执行用于实现将虚拟客户机状态对应指令系统的源程序翻译成运行在虚拟客户机状态下的处理器可执行的代码并存储在客户机一级指令高速缓存中。其中，所述客户机一级指令高速缓存被配置为支持处理器运行在宿主机状态时处理器执行部件的读写以及支持处理器外部的访问读写。Preferably, run in the host state: a binary translation program, the binary translation program can be executed by the processor running in the host state for translating the source program of the instruction system corresponding to the virtual guest state into Code executable by the processor running in the virtual guest state and stored in the guest L1 instruction cache. Wherein, the client L1 instruction cache is configured to support read and write of processor execution components when the processor is running in a host state, and to support access read and write outside the processor.

在本发明的一些实施例中，所述客户机一级指令高速缓存被配置为多路组相联的高速缓存结构，其中每一缓存行包括：标签域，用于指示其所在缓存行对应的虚拟客户机源程序被翻译前的源程序计数器值；续行域，用于指示翻译后的虚拟客户机代码是否存在续行；代码长度域，用于存储其所在缓存行的代码长度；以及翻译后的指令代码域。In some embodiments of the present invention, the client L1 instruction cache is configured as a multi-way set-associative cache structure, wherein each cache line includes: a tag field, used to indicate the corresponding cache line in which it is located. The source program counter value before the virtual client source program is translated; the continuation line field, which is used to indicate whether there is a continuation line in the translated virtual client code; the code length field, which is used to store the code length of the cache line where it is located; and the translation The following instruction code field.

优选的，所述处理器运行在虚拟客户机状态时被配置为按照如下方式从客户机一级指令高级缓存中取指：采用虚拟客户机状态的源程序计数器值在客户机一级指令高速缓存中查找缓存行，并返回源程序计数器值与对应标签域匹配的缓存行的指令，并根据最终完成取指的缓存行中的标签域指示的源程序计数器值和代码长度域计算下一次取指的源程序计数器值。在本发明的一些实施例中，所述处理器还被配置为根据续行域的指示确定是否在相邻的下一组缓存行中继续查找续行。Preferably, when the processor is running in the virtual guest state, it is configured to fetch instructions from the client-level instruction high-level cache in the following manner: using the source program counter value of the virtual guest state to store the instruction in the client-level instruction cache Find the cache line in , and return the instruction of the cache line whose source program counter value matches the corresponding tag field, and calculate the next instruction fetch according to the source program counter value and code length field indicated by the tag field in the cache line that finally completes the instruction fetch the source program counter value. In some embodiments of the present invention, the processor is further configured to determine whether to continue searching for a continuation line in an adjacent next group of cache lines according to the indication of the continuation line field.

优选的，所述下一次取指的源程序计数器值为本次完成取指的缓存行标签域指示的源程序计数器值与代码长度域值之和Preferably, the source program counter value of the next instruction fetch is the sum of the source program counter value and the code length field value indicated by the cache line tag field of the current instruction fetch.

在本发明的一些实施例中，所述处理器被配置为：在客户机一级指令高速缓存失效产生中断时，运行宿主机状态以响应中断，并调用二进制翻译程序来完成虚拟客户机状态对应指令系统的源程序的后续未完成的翻译并将翻译后的代码写入客户机一级指令高速缓存。In some embodiments of the present invention, the processor is configured to: run the host state to respond to the interrupt when an interrupt is generated by the invalidation of the client L1 instruction cache, and call a binary translation program to complete the virtual guest state correspondence Subsequent outstanding translation of the source program of the instruction set and writing the translated code into the client L1 instruction cache.

根据本发明的第二方面，提供一种基于本发明第一方面所述处理器的双指令系统运行方法，其中，所述方法包括：响应于应用需求，将处理器运行在与应用需求对应的状态，其中，所述处理器可运行宿主机状态或虚拟客户机状态，所述宿主机状态和虚拟客户机状态分别对应不同的指令系统；基于应用需求对应的状态，选择将所述宿主机一级指令高速缓存或客户机一级指令高速缓存与所述处理器的相应部件连通，以使得，在处理器运行在宿主机状态时从宿主机一级指令高速缓存取指，在处理器运行在虚拟客户机状态时从客户机一级指令高速缓存取指。According to a second aspect of the present invention, a method for running a dual-instruction system based on the processor according to the first aspect of the present invention is provided, wherein the method includes: in response to an application requirement, running the processor in a state corresponding to the application requirement. state, wherein the processor can run the host machine state or the virtual guest state, the host machine state and the virtual guest state respectively correspond to different instruction systems; Level 1 instruction cache or guest level 1 instruction cache is in communication with corresponding components of the processor, such that instructions are fetched from the host level 1 instruction cache when the processor is running in the host state, and Fetches instructions from the guest L1 instruction cache while in virtual guest state.

与现有技术相比，本发明的优点在于：本发明针对现有处理器通过二进制翻译技术兼容其他指令系统程序时的性能不足问题，提出了一种双指令高速缓存的处理器设计方法，可以大幅提升兼容其他指令系统的二进制翻译性能。通过上述实施例可知，本发明通过增加少量的硬件支持，使得处理器可以高效地执行其他指令系统的代码，解决了跨指令系统的软件高效兼容问题。Compared with the prior art, the advantages of the present invention are: the present invention proposes a dual instruction cache processor design method for the problem of insufficient performance when the existing processors are compatible with other instruction system programs through binary translation technology. Significantly improve the binary translation performance compatible with other instruction systems. It can be seen from the above embodiments that the present invention enables the processor to efficiently execute codes of other instruction systems by adding a small amount of hardware support, and solves the problem of efficient software compatibility across instruction systems.

附图说明Description of drawings

以下参照附图对本发明实施例作进一步说明，其中：The embodiments of the present invention will be further described below with reference to the accompanying drawings, wherein:

图1为根据本发明实施例的现有技术下处理器简要结构示意图；1 is a schematic structural diagram of a processor in the prior art according to an embodiment of the present invention;

图2为根据本发明实施例的双指令高速缓存处理器的简要结构示意图；2 is a schematic structural diagram of a dual instruction cache processor according to an embodiment of the present invention;

图3为根据本发明实施例的双指令高速缓存处理器中客户机一级指令高级缓存中的缓存行数据结构简要示意图。FIG. 3 is a schematic diagram of a cache line data structure in a client L1 instruction high-level cache in a dual instruction cache processor according to an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的，技术方案及优点更加清楚明白，以下通过具体实施例对本发明进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below through specific embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

发明人在进行处理器设计研究过程中发现，复杂指令系统的处理器(如X86处理器)实际上也是一个不同指令系统的二进制翻译系统，只不过这个二进制翻译系统是硬件实现的，将复杂的X86指令硬件翻译为内部类似精简指令系统的微码指令，同时X86处理器本身的性能却丝毫不受影响。而硬件翻译与软件翻译的最大区别恰恰是翻译部件和执行部件具有独立的运行环境和地址空间，因此软件翻译不能克服的间接转移目标地址动态重定位等问题，硬件翻译都不存在，但是采用硬件翻译的成本远超软件翻译。由此可知，如果能够通过处理器设计改进，使得软件二进制翻译在运行时能够与硬件翻译一样，具有独立的运行环境和地址空间，软件二进制翻译将具有和硬件翻译相当的性能，这样可以极大的降低硬件成本，提高软件翻译的性能。In the process of processor design research, the inventor found that a processor with a complex instruction system (such as an X86 processor) is actually a binary translation system with a different instruction system, but this binary translation system is implemented in hardware, which translates the complex instruction system into a binary translation system. The X86 instruction hardware is translated into microcode instructions similar to the reduced instruction system internally, while the performance of the X86 processor itself is not affected at all. The biggest difference between hardware translation and software translation is that translation components and execution components have independent operating environments and address spaces. Therefore, the problems such as indirect transfer target address dynamic relocation that cannot be overcome by software translation, hardware translation does not exist, but hardware translation is used. Translation costs far more than software translation. It can be seen that if the processor design can be improved so that the software binary translation can have an independent operating environment and address space like the hardware translation at runtime, the software binary translation will have the same performance as the hardware translation, which can greatly improve the performance of the software binary translation. to reduce hardware costs and improve the performance of software translation.

下面简单介绍一下本发明的技术思想。The technical idea of the present invention is briefly introduced below.

现代的高性能处理器都支持CPU虚拟化技术，即处理器在原有的运行状态基础上增加了一个客户机运行状态，用于运行虚拟的客户机程序，原有的运行状态则用于运行本地程序，即宿主机程序，基于此，发明人提出一种处理器设计方案：将处理器原有的运行状态称为宿主机状态，并将二进制翻译程序运行在宿主机状态，将翻译后的代码运行在客户机状态，就可以实现二进制翻译程序和翻译后代码具有独立的运行环境和地址空间。其中，从实际应用来说，虚拟客户机状态真正执行的硬件指令系统和宿主机状态是一样的，通过二进制翻译技术，使得虚拟客户机状态表现为是在运行不同于宿主机状态的指令系统。但是，由于翻译后的代码与翻译前的源代码的大小发生了变化，通常是变大了，称为代码膨胀。为了保持翻译后代码的运行地址空间与翻译前的源代码一致，翻译后的代码在地址空间的组织存放方式与宿主机的其他程序应有所不同。而现有处理器只有一个一级指令高速缓存，由宿主机程序和客户机的翻译后代码共享。这就导致了客户机的翻译后代码为了使用这个共享的一级指令高速缓存而破环了自身地址空间的独立性。为此，发明人提出在支持CPU虚拟化技术的处理器中增加一个专用的客户机一级指令高速缓存，专供处理器在运行虚拟客户机程序时使用，则可以彻底解决上述因共享一级指令高速缓存带来的问题，从而实现软件二进制翻译和硬件翻译具有相近的性能。Modern high-performance processors all support CPU virtualization technology, that is, the processor adds a client running state to the original running state to run virtual client programs, and the original running state is used to run local The program, that is, the host program, based on this, the inventor proposes a processor design scheme: the original operating state of the processor is called the host state, and the binary translation program is run in the host state, and the translated code Running in the client state, the binary translator and the translated code can have independent running environment and address space. Among them, in terms of practical application, the hardware instruction system actually executed in the virtual guest state is the same as the host state. Through binary translation technology, the virtual guest state appears to be running an instruction system different from the host state. However, because the size of the translated code has changed from the pre-translation source code, usually larger, this is called code bloat. In order to keep the running address space of the translated code consistent with the source code before translation, the organization and storage of the translated code in the address space should be different from other programs on the host. Existing processors have only one L1 instruction cache, which is shared by the host program and the guest's translated code. This causes the client's translated code to break the independence of its own address space in order to use this shared L1 instruction cache. Therefore, the inventor proposes to add a dedicated client-level instruction cache in the processor supporting the CPU virtualization technology, which is specially used by the processor when running the virtual client program, which can completely solve the above-mentioned problem of sharing the first-level instruction cache. Problems caused by the instruction cache, so that software binary translation and hardware translation have similar performance.

根据本发明的一个实施例，如图2所示，本发明提供一种双指令高速缓存的处理器设计方案，所述处理器包括两个一级指令高速缓存，其中一个是传统的用于宿主机状态的一级指令高速缓存(L1 IC),另一个是用于虚拟客户机状态的一级指令高速缓存(L1 GIC),所述两个指令高速缓存通过一个二选一电路连接到处理器的取指部件，并由一个选通信号sel选择处理器从哪个指令高速缓存取指，当处理器运行在虚拟客户机状态时，sel选通信号会控制二选一电路选通取指部件到L1 GIC，使取指部件从客户机一级指令高速缓存取指，当处理器运行在宿主机状态时，sel选通信号会控制二选一电路选通取指部件到L1 IC，使取指部件从宿主机一级指令高速缓存取指。根据本发明的一个实施例，L1 GIC可以被处理器的执行部件读/写，也可以被来自处理器外部的访问读/写，通常处理器工作在宿主机状态时执行部件才允许读写L1 GIC，因为二进制翻译程序运行在宿主机状态，翻译后的代码运行在客户机状态，允许执行部件对L1 GIC的读写可以实现将翻译后的代码存入其中，使得处理器允许在虚拟客户机状态时可以从LI GIC中取指。According to an embodiment of the present invention, as shown in FIG. 2 , the present invention provides a design solution for a processor with dual instruction caches. The processor includes two first-level instruction caches, one of which is a traditional one for the sink. A Level 1 Instruction Cache (L1 IC) for the host state and a Level 1 Instruction Cache (L1 GIC) for the virtual guest state, the two instruction caches are connected to the processor through a one-of-two circuit The instruction fetch component, and a strobe signal sel selects which instruction cache the processor fetches instructions from. When the processor is running in the virtual guest state, the sel strobe signal will control the two-to-one circuit to select the instruction fetch component to L1 GIC, so that the instruction fetch unit fetches instructions from the client's first-level instruction cache. When the processor is running in the host state, the sel strobe signal will control the two-to-one circuit to select the instruction fetch unit to the L1 IC, so that the instruction fetch is made. The component fetches instructions from the host L1 instruction cache. According to an embodiment of the present invention, the L1 GIC can be read/written by the execution unit of the processor, or read/written by accesses from outside the processor. Usually, the execution unit only allows to read and write L1 when the processor is working in the host state. GIC, because the binary translation program runs in the host state, and the translated code runs in the client state, allowing the execution unit to read and write the L1 GIC to store the translated code in it, so that the processor allows the virtual guest. state can be fetched from the LI GIC.

根据本发明的一个实施例，L1 GIC的组织方式与传统的L1 IC一致，采用多路组相联的高速缓存结构，因此此处不再对该种结构进行赘述。但L1 GIC中具体的高速缓存行的结构与传统的L1 IC有两点不同，其结构如图3所示，其中，PC是源指令代码程序计数器值(标签域)、C是续行标志域、CLEN是源指令代码长度域、Inst是翻译后的指令代码。具体来说，L1 GIC与L1 IC的不同为：According to an embodiment of the present invention, the organization of the L1 GIC is the same as that of the traditional L1 IC, and a multi-way set-associative cache structure is adopted, so this structure will not be repeated here. However, the specific cache line structure in L1 GIC is different from the traditional L1 IC in two points. Its structure is shown in Figure 3, where PC is the source instruction code program counter value (tag field), and C is the continuation line flag field. , CLEN is the source instruction code length field, Inst is the translated instruction code. Specifically, the difference between L1 GIC and L1 IC is:

第一点不同：L1 GIC中的缓存行中标签域是客户机翻译前的源程序计数器值(PC)，传统的L1 IC中的标签域是该行代码的物理内存地址。The first difference is that the tag field in the cache line in the L1 GIC is the source program counter value (PC) before client translation, and the tag field in the traditional L1 IC is the physical memory address of the line of code.

第二点不同：L1 GIC的缓存行增加了2个新的域：续行域(C)和源程序代码长度域(CLEN)。其中，续行域指示翻译后的代码是否存在续行，如果续行域为1则表示存在续行，如果续行域为0则表示不存在续行，存在续行则需要在下一个高速缓存行继续取指；源程序代码长度域(CLEN)用于计算后续代码(下一次取指代码)的翻译前源程序计数器值。The second difference is that the cache line of the L1 GIC adds two new fields: the continuation line field (C) and the source code length field (CLEN). Among them, the continuation line field indicates whether there is a continuation line in the translated code. If the continuation line field is 1, it means that there is a continuation line. If the continuation line field is 0, it means that there is no continuation line. If there is a continuation line, it needs to be in the next cache line. Continue to fetch; the source code length field (CLEN) is used to calculate the pre-translation source program counter value of the subsequent code (next fetch code).

此外，本发明还对处理器进行了如下设计：客户机一级指令高速缓存L1 GIC如果发生失效，将产生客户机高速指令缓存失效中断/例外，该中断/例外可以发给本处理器处理，也可以发给外部的处理器处理。如果发给本处理器，则处理器从虚拟客户机状态退回到宿主机状态来响应这个中断/例外，例如可通过在宿主机状态调用翻译程序来完成后续客户机指令代码的翻译并写入客户机指令高速缓存L1 GIC来响应这个中断/例外。In addition, the present invention also designs the processor as follows: if the client's first-level instruction cache L1 GIC fails, an interrupt/exception for the client's high-speed instruction cache invalidation will be generated, and the interrupt/exception can be sent to the processor for processing, It can also be sent to an external processor for processing. If issued to this processor, the processor returns from the virtual guest state to the host state in response to this interrupt/exception, for example, by calling a translator in the host state to complete the translation of subsequent guest instruction codes and write to the guest Machine Instruction Cache L1 GIC to respond to this interrupt/exception.

为了更好的理解本发明的处理器的工作原理，下面以支持CPU虚拟化技术的单处理器为例进行说明：In order to better understand the working principle of the processor of the present invention, a single processor that supports CPU virtualization technology is used as an example to illustrate:

处理器一开始运行在宿主机状态，当需要执行跨指令系统的客户机程序时，通过响应进入客户机状态的指令，进入虚拟客户机状态。The processor initially runs in the host state, and when it needs to execute the client program across the instruction system, it enters the virtual client state by responding to the instruction to enter the client state.

当处理器进入客户机状态时，通过设置二选一电路的选择信号sel让处理器的取指部件切换到客户机一级指令高速缓存取指。例如，进入客户机状态的sel信号为1,回到宿主机状态sel信号0,通过设置0/1的不同来选择不同的指令高速缓存进行取指。When the processor enters the client state, the instruction fetching unit of the processor is switched to the client level one instruction cache for instruction fetching by setting the selection signal sel of the one-of-two circuit. For example, the sel signal entering the client state is 1, returning to the host state sel signal 0, and selecting different instruction caches for instruction fetching by setting the difference of 0/1.

处理器采用客户机程序的源程序计数器值在客户机专用指令高速缓存中查找缓存行，缓存行的PC域的值和要查找的客户机程序的源程序计数器值匹配就表示命中，如果命中，返回命中缓存行的指令，同时，如果缓存行的续行域为1，则按照预设的续行查找策略查找续行(比如在相邻的下一组缓存行中继续查找续行，但是续行查找策略并不仅限于这种策略，可以根据需求设置成其他任意方便查找的策略)。其中，需要说明的是，处理器运行在客户机状态时是按照L1 GIC中的取指命中的代码行的指令依次执行，执行完一行指令后需要计算读取本代码行执行完之后需要继续执行的代码行也就是需要计算下一次取指的源程序计数器值。如果续行域为0，则根据本缓存行的程序计数器值和代码长度域计算下一次取指的源程序计数器值，即下一次取指的源程序计数器值＝本缓存行的PC值+本缓存行的CLEN值；如果续行域为1，则在完成续行取指后按照同样的方式计算下一次取指的源程序计数器值，即下一次取指的源程序计数器值＝完成取指的缓存行的PC值+完成取指的缓存行的CLEN值。The processor uses the source program counter value of the client program to look up the cache line in the client-specific instruction cache, and the value of the PC field of the cache line matches the source program counter value of the client program to be looked up. Returns the instruction that hits the cache line, and at the same time, if the continuation line field of the cache line is 1, the continuation line is searched according to the preset continuation line search strategy (for example, the continuation line is searched in the next adjacent group of cache lines, but the continuation line is searched for The row search strategy is not limited to this strategy, and can be set to any other convenient search strategy according to requirements). Among them, it should be noted that when the processor is running in the client state, it executes the instructions of the code line hit by the instruction fetch in the L1 GIC in sequence. After executing a line of instructions, it needs to calculate and read the code line and then continue to execute. The line of code is the source program counter value that needs to be calculated for the next instruction fetch. If the continuation line field is 0, the source program counter value of the next instruction fetch is calculated according to the program counter value and code length field of this cache line, that is, the source program counter value of the next instruction fetch = the PC value of this cache line + this The CLEN value of the cache line; if the continuation line field is 1, the source program counter value of the next instruction fetch is calculated in the same way after the continuation line fetch is completed, that is, the source program counter value of the next instruction fetch = the completed instruction fetch The PC value of the cache line + the CLEN value of the cache line that completed the fetch.

如果客户机一级指令高速缓存查找失效，则产生客户机指令高速缓存失效例外，处理器返回宿主机状态，执行二进制翻译程序，通过处理器执行部件更新客户机一级指令高速缓存，更新完成后，再次进入虚拟客户机状态继续执行跨指令系统的客户机程序。If the client L1 instruction cache lookup is invalid, an exception of client instruction cache invalidation is generated, the processor returns to the host state, executes the binary translation program, and updates the client L1 instruction cache through the processor execution unit. , enter the virtual client state again and continue to execute the client program across the instruction system.

本发明针对现有处理器通过二进制翻译技术兼容其他指令系统程序时的性能不足问题，提出了一种双指令高速缓存的处理器设计方法，可以大幅提升兼容其他指令系统的二进制翻译性能。通过上述实施例可知，本发明通过增加少量的硬件支持，使得处理器可以高效地执行其他指令系统的代码，解决了跨指令系统的软件高效兼容问题。Aiming at the problem of insufficient performance when the existing processor is compatible with other instruction system programs through binary translation technology, the present invention proposes a dual instruction cache processor design method, which can greatly improve the binary translation performance compatible with other instruction systems. It can be seen from the above embodiments that the present invention enables the processor to efficiently execute codes of other instruction systems by adding a small amount of hardware support, and solves the problem of efficient software compatibility across instruction systems.

需要说明的是，虽然上文按照特定顺序描述了各个步骤，但是并不意味着必须按照上述特定顺序来执行各个步骤，实际上，这些步骤中的一些可以并发执行，甚至改变顺序，只要能够实现所需要的功能即可。It should be noted that although the steps are described above in a specific order, it does not mean that the steps must be executed in the above-mentioned specific order. In fact, some of these steps can be executed concurrently, or even change the order, as long as it can be achieved The required function can be.

本发明可以是系统、方法和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质，其上载有用于使处理器实现本发明的各个方面的计算机可读程序指令。The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for causing a processor to implement various aspects of the present invention.

计算机可读存储介质可以是保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以包括但不限于电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括：便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。A computer-readable storage medium may be a tangible device that retains and stores instructions for use by the instruction execution device. Computer-readable storage media may include, but are not limited to, electrical storage devices, magnetic storage devices, optical storage devices, electromagnetic storage devices, semiconductor storage devices, or any suitable combination of the foregoing, for example. More specific examples (non-exhaustive list) of computer readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM) or flash memory), static random access memory (SRAM), portable compact disk read only memory (CD-ROM), digital versatile disk (DVD), memory sticks, floppy disks, mechanically coded devices, such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above.

以上已经描述了本发明的各实施例，上述说明是示例性的，并非穷尽性的，并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下，对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择，旨在最好地解释各实施例的原理、实际应用或对市场中的技术改进，或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。Various embodiments of the present invention have been described above, and the foregoing descriptions are exemplary, not exhaustive, and not limiting of the disclosed embodiments. Numerous modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. a kind of processor, it can run in host machine state or virtual guest state according to application requirement, described host machine state and virtual guest state correspond to different instruction systems respectively, it is characterized in that, described processor comprises:

The host's first-level instruction cache is used to cache the instructions of the instruction system corresponding to the host state that can be executed by the processor running in the host state;

a guest level one instruction cache for caching instructions of a virtual guest state corresponding to an instruction system that can be executed by a processor running in the virtual guest state; and

A selection circuit, configured to select, according to the operating state of the processor, to connect the host-level instruction cache or the client-level instruction cache with the corresponding components of the processor, so that, when the processor operates in the host state The instruction is fetched from the host's first-level instruction cache when the processor is running in the virtual guest state, and the instruction is fetched from the guest's first-level instruction cache when the processor is running in the virtual guest state.

2. The processor according to claim 1, characterized in that, in the host state, running:

A binary translation program, the binary translation program can be executed by the processor running in the host state and used for translating the source program of the instruction system corresponding to the virtual guest state into a executable program executable by the processor running in the virtual guest state code and stored in the client L1 instruction cache.

3. The processor of claim 2, wherein the client L1 instruction cache is configured to support read and write of processor execution components when the processor is running in a host state and to support external read and write operations of the processor. Access to read and write.

4. The processor of claim 3, wherein the client L1 instruction cache is configured as a multiple-way set-associative cache structure, wherein each cache line comprises:

The tag field is used to indicate the source program counter value of the virtual client source program corresponding to its cache line before translation;

The continuation line field is used to indicate whether there is a continuation line in the translated virtual client code;

The code length field is used to store the code length of the cache line where it is located;

and the translated instruction code field.

5 . The processor according to claim 4 , wherein, when the processor is running in a virtual guest state, it is configured to fetch instructions from the client-level instruction high-level cache in the following manner: 6 .

The source program counter value of the virtual client state is used to find the cache line in the client-level instruction cache, and the instruction of the cache line whose source program counter value matches the corresponding tag field is returned. The label field indicates the source program counter value and the code length field calculates the source program counter value for the next instruction fetch.

6 . The processor of claim 5 , wherein the processor is further configured to determine whether to continue searching for a continuation line in an adjacent next group of cache lines according to the indication of the continuation line field. 7 .

7. The processor according to claim 5, wherein the source program counter value of the next instruction fetch is the difference between the source program counter value and the code length field value indicated by the cache line tag field of the instruction fetch this time. and.

8. The processor of claim 2, wherein the processor is configured to: run the host state to respond to the interrupt and invoke the binary translation program when an interrupt is generated by a client-level instruction cache invalidation to complete the subsequent unfinished translation of the source program of the virtual guest state corresponding to the instruction system and write the translated code into the client level one instruction cache.

9. A method for running a dual-instruction system based on any one of the processors of claims 1-8, wherein the method comprises:

In response to application requirements, the processor is run in a state corresponding to the application requirements, wherein the processor can run a host state or a virtual guest state, and the host state and the virtual guest state respectively correspond to different instruction systems ;

Based on the state corresponding to the application requirement, the host machine level I instruction cache or the client machine level I instruction cache is selected to communicate with the corresponding components of the processor, so that when the processor is running in the host machine state, the host machine is in the host machine state. The host L1 instruction cache fetches instructions, and fetches instructions from the guest L1 instruction cache when the processor is running in the virtual guest state.

10. An electronic device, characterized in that the device comprises:

storage device;

One or more processors as claimed in any of claims 1-9.