CN105354010B

CN105354010B - Processor and method for executing hardware data by processor

Info

Publication number: CN105354010B
Application number: CN201510683936.3A
Authority: CN
Inventors: 罗德尼·E·虎克; 艾伯特·J·娄坡; 约翰·麦可·吉尔
Original assignee: Via Technologies Inc
Current assignee: Via Technologies Inc
Priority date: 2014-10-20
Filing date: 2015-10-20
Publication date: 2018-10-30
Anticipated expiration: 2035-10-20
Also published as: CN105354010A; CN105278919A; CN105278919B

Abstract

A processor and a method for executing hardware data by the processor are provided, the processor comprises a processing core, the processing core detects whether a preset program is executed on the processor and inquires the prefetching characteristic related to the preset program, wherein the prefetching characteristic is mutual exclusion or sharing. The processor also includes a hardware data prefetcher that performs hardware prefetching for the predetermined program using the prefetch characteristic. The present invention can perform an analysis of changing prefetch characteristics at runtime, thus making it easier to determine when other memory access agents will access which memory block at compile time, as opposed to software prefetching.

Description

Processor and method for executing hardware data by processor

技术领域technical field

本发明有关于处理器的数据预取，并且主张2014年10月20日提出的美国临时申请案第62/066,131号的优先权，此案件整体引用为本发明的参考。This invention relates to data prefetching for processors and claims priority to US Provisional Application No. 62/066,131, filed October 20, 2014, which is incorporated by reference in its entirety.

背景技术Background technique

由于处理器内部对快取存储器的存取时间相对于处理器对系统存储器的存取时间持续不对等成长，凸显处理器需要更好的预取方式。举例来说， Mowry描述一个对编译器做修改以使用互斥模式预取的方式，此编译器对分割存储器执行本地分析时，参考“等效分类，其可为单一参考的参考集合”，并插入“一互斥模式预取而非共享模式预取在一给定的等效分类上，如果至少一个等效分类的成员是写入的话”，请参阅“通过软件控制数据预取的延迟容忍”，Mowry,Todd Carl,史丹佛大学1994年博士论文，第89页的叙述。As the processor's internal access time to the cache memory continues to grow unequal to the processor's access time to the system memory, it highlights the need for better prefetch methods for processors. For example, Mowry describes a way to modify a compiler to use exclusive-mode prefetching that, when performing local analysis on partitioned memory, refers to an "equivalent class, which may be a reference set of a single reference," and Insert "An exclusive-mode prefetch rather than a shared-mode prefetch on a given equivalent class if at least one member of the equivalent class is a write", see "Delay tolerance of data prefetching controlled by software ", Mowry, Todd Carl, Ph.D. Dissertation, Stanford University, 1994, p. 89.

以软件为主的预取方式的一个缺点就如Mowry所述，因为预取指令将写在程序里于是会增加微码大小，而增加微码大小可能需要在系统主要储存媒介(例如硬盘)上更多的储存空间以保留较大的程序，并且在较大的程序执行时也要在系统存储器中在维持更大的空间才行。额外的指令也消耗处理器的资源，诸如分派区域(dispatch slots)、保留站区域、以及执行单元区域等，这些都可能对处理器效能产生负面影响，而且更特别的是将减少指令区间有效的预视能力，因此对使用指令阶层的平行处理能力影响甚巨。另一个缺点是它不会对所有在处理器执行的程序提供好处，而只会对那些使用最佳化编译器所描述(profiled)与编译过的程序有所助益。One disadvantage of the software-based approach to prefetching, as Mowry states, is that the prefetch instructions will be written in the program and thus increase the microcode size, which may need to be stored on the system's primary storage medium (such as the hard disk) More storage space to keep larger programs, and also to maintain more space in system memory while larger programs are executing. Additional instructions also consume processor resources, such as dispatch slots, reservation station areas, and execution unit areas, which can negatively impact processor performance and, more specifically, reduce the effective The look-ahead capability, therefore, has a huge impact on the parallel processing capabilities using the instruction hierarchy. Another disadvantage is that it will not benefit all programs executed on the processor, but only those programs that are profiled and compiled using an optimizing compiler.

发明内容Contents of the invention

本发明提供一种处理器，该处理器包含一用以侦测一预定程序是否正在处理器上执行的核，该核还查询一个与正在处理器上执行的该预定程序相关的预取特性，其中该预取特性为互斥或共享。该处理器还包含一硬件数据预取器，其使用该预取特性为该预定程序执行硬件预取。The present invention provides a processor comprising a core for detecting whether a predetermined program is executing on the processor, the core also queries a prefetch characteristic associated with the predetermined program being executed on the processor, Wherein the prefetching feature is mutual exclusion or sharing. The processor also includes a hardware data prefetcher that uses the prefetch feature to perform hardware prefetch for the predetermined program.

本发明另提供一种由处理器执行的硬件数据预取器方法，该方法包含查询一个与正在处理器上执行的该预定程序相关的预取特性，其中该预取特性为互斥或共享。该方法还包含使用该预取特性为该预定程序执行硬件预取。The present invention further provides a hardware data prefetcher method executed by a processor, the method includes querying a prefetching characteristic related to the predetermined program being executed on the processor, wherein the prefetching characteristic is exclusive or shared. The method also includes performing hardware prefetching for the predetermined program using the prefetch feature.

本发明又提供一种处理器，该处理器包含一用以侦测一预定程序是否正在处理器上执行的核，并响应于侦测到该预定程序正在处理器上执行，由每个该处理器的一或多个范围暂存器各自加载地址范围至其中，每个该一或多个范围暂存器具有一相关的预取特性，其中该预取特性为互斥或共享。该处理器还包含一硬件数据预取器，其使用相关于加载至该范围暂存器的地址范围的该预取特性，为该预定程序执行硬件预取。The present invention also provides a processor, the processor includes a core for detecting whether a predetermined program is being executed on the processor, and in response to detecting that the predetermined program is being executed on the processor, each of the processors Each of the one or more range registers of the memory device loads an address range thereinto, each of the one or more range registers has an associated prefetch property, wherein the prefetch property is mutually exclusive or shared. The processor also includes a hardware data prefetcher that performs hardware prefetching for the predetermined program using the prefetching characteristics associated with the address range loaded into the range register.

本发明又提供一种由处理器执行的硬件数据预取器方法，该方法包含侦测一预定程序是否正在处理器上执行。该方法还包含响应于侦测到该预定程序正在处理器上执行，由每个该处理器的一或多个范围暂存器各自加载地址范围至其中，每个该一或多个范围暂存器具有一相关的预取特性，其中该预取特性为互斥或共享。该方法还使用相关于加载至该范围暂存器的地址范围的该预取特性，为该预定程序执行硬件预取。The present invention also provides a hardware data prefetcher method executed by a processor, the method includes detecting whether a predetermined program is being executed on the processor. The method also includes loading address ranges into one or more range registers of each of the processors in response to detecting that the predetermined program is being executed on the processor, each of the one or more range registers The device has an associated prefetch characteristic, wherein the prefetch characteristic is exclusive or shared. The method also performs hardware prefetching for the predetermined program using the prefetch property associated with the address range loaded into the range register.

本发明可在运行时观察其他存储器存取代理器对存储器区块的存取，即执行改变预取特性的分析，因而相对于软件预取而言，容易在编译时决定其他存储器存取代理器何时会对哪个存储器区块进行存取。The present invention can observe the accesses of memory blocks by other memory access agents during operation, that is, perform analysis to change the prefetch characteristics, so compared with software prefetch, it is easy to determine other memory access agents at compile time When to access which memory block.

附图说明Description of drawings

图1是本发明一实施例的计算机系统的方块图。FIG. 1 is a block diagram of a computer system according to an embodiment of the present invention.

图2是图1的硬件数据预取器的细部方块图。FIG. 2 is a detailed block diagram of the hardware data prefetcher of FIG. 1 .

图3是图1的系统的操作流程图。FIG. 3 is a flowchart of the operation of the system of FIG. 1 .

图4至11是图1依据多个存储器存取代理器对一存储器区块的存取分析以动态地更新该预取特性的操作流程图。4 to 11 are flowcharts of the operation of dynamically updating the prefetch characteristic according to the access analysis of a memory block by multiple memory access agents in FIG. 1 .

图12是使用离线程序分析来决定预取特性以执行硬件预取的操作流程图。12 is a flowchart of operations for performing hardware prefetching using off-line program analysis to determine prefetch characteristics.

图13描绘多个范围暂存器的方块图。Figure 13 depicts a block diagram of multiple range registers.

图14是图1依据多个存储器存取代理器对一存储器区块的存取分析以动态地更新该预取特性的操作流程图。FIG. 14 is a flowchart of the operation of dynamically updating the prefetch characteristic according to the access analysis of a memory block by multiple memory access agents in FIG. 1 .

其中，附图中符号的简单说明如下：Among them, a brief description of the symbols in the drawings is as follows:

100：计算系统100: Computing Systems

101：存储器存取代理器101: Memory access agent

102：核102: Nuclear

103：处理器103: Processor

104：绘图处理单元(GPU)104: Graphics Processing Unit (GPU)

106：直接存储器存取(DMA)装置106: Direct memory access (DMA) device

108：系统存储器108: System memory

112：总线112: bus

114：存储器区块114: memory block

122：硬件数据预取器122: Hardware data prefetcher

124：末级快取存储器(LLC)124: Last level cache memory (LLC)

132：预取特性132: Prefetch feature

202：存储器存取历史202: Memory access history

204：更新模块204: Update module

206：预取模块206: Prefetch module

212：存储器存取历史206的一部分212: Part of memory access history 206

208：预取请求208: Prefetch request

232：微码撷取232: Microcode Extraction

234：程序加载/储存234: Program load/store

236：窥探236: Spy

1302：地址范围字段1302: Address range field

1304：预取特性字段1304: Prefetch attribute field

302～312、402～406、502～506、602～608、702～712、802～812、902～912、 1002～1008、1102～1112、1202～1208、1402～1408：步骤。302～312, 402～406, 502～506, 602～608, 702～712, 802～812, 902～912, 1002～1008, 1102～1112, 1202～1208, 1402～1408: steps.

具体实施方式Detailed ways

<术语><term>

存储器存取代理器是一存取系统存储器的装置，举例来说，处理核、绘图处理单元、以及执行直接存储器存取的(DMA)周边装置都是存储器存取代理器。A memory access agent is a device that accesses system memory. For example, processing cores, graphics processing units, and peripherals that perform direct memory access (DMA) are all memory access agents.

硬件数据预取器基于存储器存取代理器未来将需要数据的预估而经由从系统存储器的数据读取，特别是，如本发明所述，硬件预取不是一软件预取，其指处理器因执行一架构预取指令而由处理器从系统存储器的数据读取。因此，处理器基于处理器在运转时的分析(亦即与硬件预取同时发生的存储器存取分析)来执行硬件预取。相反的，相关于插入在程序架构预取指令(例如在编译时间)的软件预取会在程序执行之前被执行，因此不会与软件预取同时发生。硬件预取执行所读取的数据可能是由处理器所执行的指令、或是非指令的数据，例如处理器执行指令时的数据运算元。The hardware data prefetcher reads data from system memory based on an estimate that the memory access agent will need the data in the future. In particular, as described herein, a hardware prefetch is not a software prefetch, it refers to a processor Data is read from system memory by the processor as a result of executing an architectural prefetch instruction. Thus, the processor performs hardware prefetching based on the processor's analysis at runtime (ie, analysis of memory accesses that occurs concurrently with hardware prefetching). In contrast, software prefetching related to prefetching instructions inserted in the program architecture (eg, at compile time) is performed before program execution, and thus does not occur simultaneously with software prefetching. The data read by the hardware prefetch execution may be instructions executed by the processor, or non-instruction data, such as data operands when the processor executes instructions.

存储器区块是在系统存储器中一连串的存储位置，例如存储器页。A memory block is a series of storage locations, such as memory pages, in system memory.

预取特性是用以指出数据读取者所要求的与相关快取线所有权互斥(互斥预取特性)的预取性质、或是所要求的是允许其他存储器区块可保留快取线副本(共享预取特性)的预取性质。当一预取使用互斥预取特性时，将包含一命令每个其他的存储器存取代理器无效其本地快取线副本(假使有修改则写回目前数据数值)的总线传输，其通常被当成是读取无效总线传输、具修改意图的读取总线传输、读取所有权、或是相似的技术名称；反之，当预取使用共享预取特性时，将包含一在共享状态下允许每个其他的存储器存取代理器保留其本地快取线副本的总线传输，其通常被当成是单纯的读取总线传输、或是读取共享OK总线传输、或是相似的技术名称。The prefetch feature is used to indicate the prefetch nature required by the data reader to be mutually exclusive with the ownership of the relevant cache line (mutually exclusive prefetch feature), or the requirement is to allow other memory blocks to reserve the cache line The prefetch nature of the replica (shared prefetch feature). When a prefetch uses the exclusive prefetch feature, it will consist of a bus transfer instructing each other memory access agent to invalidate its local cache line copy (writing back the current data value if modified), which is usually When considered to be a read invalid bus transfer, read intent-to-modify bus transfer, read ownership, or similar technical names; conversely, when prefetching uses the shared prefetch feature, it will include a shared state that allows each Other memory access agents keep their local cache line copies of bus transfers, which are often referred to as pure read bus transfers, or read shared OK bus transfers, or similar technical names.

图1绘示本发明一实施例的计算机系统100的方块图。计算机系统100包含存储器存取代理器101，其共享一系统存储器108并通过一总线112存取。存储器存取代理器101可包含一执行直接存储器存取(DMA)的周边装置106、一绘图处理单元(GPU)104、以及一处理器103。处理器103包含多个处理核 102、一个由核102所共享的末级快取存储器(LLC)124、以及一硬件数据预取器122，GPU 104与DMA 106亦可包含一硬件数据预取器122。虽然在图1仅显示两核102，其他数量核的实施例亦可运用本发明的技术。FIG. 1 shows a block diagram of a computer system 100 according to an embodiment of the present invention. The computer system 100 includes a memory access broker 101 that shares a system memory 108 and is accessed through a bus 112 . The memory access agent 101 may include a peripheral device 106 performing direct memory access (DMA), a graphics processing unit (GPU) 104 , and a processor 103 . The processor 103 includes a plurality of processing cores 102, a last level cache (LLC) 124 shared by the cores 102, and a hardware data prefetcher 122, and the GPU 104 and the DMA 106 may also include a hardware data prefetcher 122. Although only two cores 102 are shown in FIG. 1 , embodiments with other numbers of cores can also utilize the techniques of the present invention.

硬件数据预取器122包含一预取特性132，其系硬件数据预取器122用来从系统存储器108的一存储器区块114执行硬件预取，预取特性132具有一互斥或共享的值。硬件数据预取器122动态地且选择性地依据由存储器存取代理器101对该存储器区块114进行存取的分析来更新预取特性132。硬件数据预取器122将在下面的图2与其他图示做更进一步的描述。The hardware data prefetcher 122 includes a prefetch property 132, which is used by the hardware data prefetcher 122 to perform hardware prefetch from a memory block 114 of the system memory 108, the prefetch property 132 has a value of exclusive or shared . The hardware data prefetcher 122 dynamically and selectively updates the prefetch characteristics 132 based on the analysis of accesses to the memory block 114 by the memory access agent 101 . The hardware data prefetcher 122 will be further described in FIG. 2 and other diagrams below.

处理器103可包含一作为处理器103与总线112的接口的总线接口单元，而每个核102包含一指令快取、指令解码器、指令分派器、存储子系统(例如加载/储存单元、存储缓冲区)、其他执行单元、以及一本地数据快取(例如第一级数据快取)。The processor 103 may include a bus interface unit as an interface between the processor 103 and the bus 112, and each core 102 includes an instruction cache, instruction decoder, instruction dispatcher, memory subsystem (e.g. load/store unit, memory buffer), other execution units, and a local data cache (such as a first-level data cache).

当硬件数据预取器122对总线接口单元提出一硬件预取请求时，其将伴随预取特性132(亦即共享或互斥)。总线单元予以回应，并在总线112上执行传输而获得相关于该硬件预取请求的快取线所有权。如果该预取特性132是互斥，总线接口单元执行一指示其他存储器存取代理器101无效其本地快取线副本，并在本地快取线副本有修改时回写目前数据的数值。如果预取特性 132为共享，总线接口单元执行一允许每个其他的存储器存取代理器保留其本地快取线副本的总线传输。When the hardware data prefetcher 122 issues a hardware prefetch request to the bus interface unit, it will be accompanied by the prefetch characteristic 132 (ie, shared or exclusive). The bus unit responds and performs a transfer on the bus 112 to take ownership of the cache line associated with the hardware prefetch request. If the prefetch property 132 is exclusive, the BUI executes a command to instruct other memory access agents 101 to invalidate their local cache line copies and to write back the current data values when the local cache line copies are modified. If the prefetch property 132 is shared, the bus interface unit performs a bus transfer that allows each other memory access agent to keep its local copy of the cache line.

当一快取线对一处理器103的快取存储器做预取且快取线可被预取时，其状态为对其他核102为互斥、或是其状态为与其他共享系统存储器108的其他存储器存取代理器101所共享。举例来说，如果一快取线将被多个核102所共享，它可有效地在共享状态下使快取线被预取；然而，如果一快取线将被预取核102执行写入时，它可有效地在互斥而非共享之下预取快取线。When a cache line prefetches the cache memory of a processor 103 and the cache line can be prefetched, its state is either exclusive to other cores 102 or its state is shared with other system memory 108 shared by other memory access agents 101. For example, if a cache line is to be shared by multiple cores 102, it can effectively cause the cache line to be prefetched in the shared state; , it effectively prefetches cache lines under mutual exclusion rather than sharing.

请参阅图2，其描图1的硬件数据预取器122的方块结构图。该硬件数据预取器122包含一更新模块204，用以接收由存储器存取历史202而来的信息。存储器存取历史202包含由存储器存取代理器101对系统存储器108进行存取的信息，特别的是，该存储器存取历史202包含由每个核102从系统存储器108 所执行的微码撷取232(亦即指令撷取232)信息、由核102对系统存储器108所执行的程序加载/储存234、以及响应产生在总线112上的对系统存储器108执行存取而产生的窥探236(该系统存储器108存取系由包含该硬件数据预取器122之外的多个存储器存取代理器101之一所产生)。该信息可包含，但不限定于，存储器地址、存取类型(例如指令撷取、加载、储存)、以及每个存储器存取代理器101的识别码(其亦包含产生该存取的核102辨识码)。较佳者，该硬件数据预取器122对系统存储器108中被处理器103所存取的有源区块 114维持一分离的预取特性132与分离的存储器历史202。该更新模块204依据存储器存取历史202的分析来更新预取特性132，相关实施例将阐述如下。Please refer to FIG. 2 , which depicts a block diagram of the hardware data prefetcher 122 in FIG. 1 . The hardware data prefetcher 122 includes an update module 204 for receiving information from the memory access history 202 . The memory access history 202 includes information about accesses made to the system memory 108 by the memory access agent 101. In particular, the memory access history 202 includes the microcode fetches executed by each core 102 from the system memory 108. 232 (i.e., instruction fetch 232) information, program loads/stores 234 performed by the core 102 to the system memory 108, and snoops 236 generated in response to accesses to the system memory 108 generated on the bus 112 (the system Memory 108 accesses are generated by one of a plurality of memory access agents 101 including the hardware data prefetcher 122). This information may include, but is not limited to, the memory address, the type of access (e.g., fetch, load, store), and the identity of each memory access agent 101 (which also includes the core 102 that originated the access). identification code). Preferably, the hardware data prefetcher 122 maintains a separate prefetch profile 132 and separate memory history 202 for active blocks 114 of system memory 108 accessed by the processor 103 . The update module 204 updates the prefetch characteristic 132 according to the analysis of the memory access history 202 , related embodiments will be described as follows.

硬件数据预取器122亦包含一接收预取特性132的预取模块206。当启动预取模块206去分析核102的存储器存取历史，并依据该分析预估那些数据在将来会被核102需要时，该预取模块206也接收一与核102相关的存储器存取历史202的一部分212。预取模块206经由对总线接口单元产生包含该预取特性132的预取请求208，来执行对该预取数据的硬件预取，该预取特性可包含一预设值，亦即共享或互斥。举例来说，预设值可通过当核102在制造时选择性地烧断镕丝并以其状态做预设、或经由核102微码的常数值做预设。预取模块206可从系统存储器108预取一或多个有价值的快取线，并存入快取存储器124、及/或处理器103在快取存储器阶层中较低阶的快取存储器(例如核 102的私人快取存储器)里。The hardware data prefetcher 122 also includes a prefetch module 206 that receives the prefetch characteristics 132 . When starting the prefetch module 206 to analyze the memory access history of the core 102, and predicting those data that will be needed by the core 102 in the future according to the analysis, the prefetch module 206 also receives a memory access history related to the core 102 202 part of 212. The prefetching module 206 performs hardware prefetching of the prefetched data by generating a prefetch request 208 including the prefetch characteristic 132 to the bus interface unit. The prefetch characteristic may include a default value, that is, shared or mutual Rebuke. For example, the preset value can be preset by selectively blowing the fuse and its state when the core 102 is manufactured, or by a constant value in the core 102 microcode. The prefetch module 206 may prefetch one or more valuable cache lines from the system memory 108, and store them in the cache memory 124, and/or the cache memory of the processor 103 lower in the cache memory hierarchy ( For example, in the private cache memory of core 102).

请参阅图3，其描绘图1所示系统的操作流程图。Please refer to FIG. 3 , which depicts a flowchart of the operation of the system shown in FIG. 1 .

在步骤302中，存储器存取代理器101存取系统存储器108中的一存储器区块114，该存取可包含核102对存储器区块114 的存取，如步骤306所述者。硬件数据预取器122累积相关于每个有源存储器区块114的存储器存取历史 202中的存取信息。流程前往步骤304。In step 302 , the memory access agent 101 accesses a memory block 114 in the system memory 108 , the access may include the core 102 accessing the memory block 114 as described in step 306 . The hardware data prefetcher 122 accumulates access information in the memory access history 202 associated with each active memory bank 114. The flow goes to step 304 .

在步骤304中，更新模块204分析该存储器存取代理器101对存储器区块 114的存取，并依据该分析动态地更新相关于该存储器区块114的预取特性 132。更新模块204在步骤312中，当预取模块206持续对存储器区块114执行硬件预取时，继续分析并更新预取特性132。图3的步骤304至312已显示操作流程，而分析的实施例将随后搭配后续图示做说明。In step 304, the update module 204 analyzes the memory access agent 101's access to the memory block 114, and dynamically updates the prefetch characteristic 132 related to the memory block 114 according to the analysis. In step 312 , the update module 204 continues to analyze and update the prefetch characteristic 132 when the prefetch module 206 continues to perform hardware prefetch on the memory block 114 . Steps 304 to 312 in FIG. 3 have shown the operation flow, and the embodiment of the analysis will be described later with subsequent figures.

在步骤306中，核102执行程序，其包含从系统存储器108撷取程序指令，并响应于所撷取程序指令的执行而对系统存储器108执行加载/储存。此外，指令存取、加载与储存对系统存储器108的存储器区块114(例如存储器页)做存取。基本上，存取将对多个存储器区块114执行。硬件数据预取器122累积相关于每个有源存储器区块114的存储器存取历史202中的存取信息。流程由步骤306前往步骤308。In step 306, core 102 executes the program, which includes fetching program instructions from system memory 108 and performing loads/stores to system memory 108 in response to execution of the fetched program instructions. In addition, instructions access, load, and store access memory blocks 114 (eg, memory pages) of system memory 108 . Basically, the access will be performed to multiple memory banks 114 . The hardware data prefetcher 122 accumulates access information in the memory access history 202 associated with each active memory bank 114 . The process goes from step 306 to step 308 .

在步骤308中，预取模块206依据在步骤306所累积的核102对存储器区块 114的存储器存取历史202的一部分212，来预估存储器区块114的那些数据将是核102需要的。流程从步骤308前往步骤312。In step 308 , the prefetch module 206 predicts which data of the memory block 114 will be needed by the core 102 according to the portion 212 of the memory access history 202 of the core 102 to the memory block 114 accumulated in step 306 . The flow goes from step 308 to step 312 .

在步骤312中，预取模块206执行在步骤308所预估数据的硬件预取，此硬件预取使用在步骤304所动态更新的预取特性132。虽然步骤302至304所显示的是存储器存取代理器101来驱动预取特性的更新，应注意的是，在步骤 302由存储器存取代理器101进行的存储器存取以及步骤304所进行的预取特性132动态更新可同时发生。此外，尽管在流程的步骤306、308、至312因核 102的存储器存取来驱动预估，且此预估使用动态更新的预取特性来驱动硬件预取，应注意的是，在步骤306由核102进行的存储器存取以及步骤308所进行的预取与在步骤312的硬件预取可同时发生。如图3所示，由步骤312返回至步骤302与306，因为流程在步骤302与304、以及步骤306、308与312是同时发生的，所以在步骤312所执行的预取为硬件预取而非软件预取。In step 312 , the prefetch module 206 performs a hardware prefetch of the data estimated in step 308 using the prefetch characteristics 132 dynamically updated in step 304 . Although steps 302 to 304 show the memory access broker 101 driving the update of the prefetch feature, it should be noted that the memory access by the memory access broker 101 at step 302 and the prefetch at step 304 Fetch properties 132 dynamic updates can occur concurrently. Furthermore, although the estimation is driven by memory accesses by the core 102 at steps 306, 308, through 312 of the flow, and the estimation uses the dynamically updated prefetch feature to drive hardware prefetching, it should be noted that at step 306 The memory access by core 102 and the prefetching at step 308 and the hardware prefetching at step 312 may occur concurrently. As shown in FIG. 3 , return to steps 302 and 306 from step 312, because the process occurs simultaneously in steps 302 and 304, and steps 306, 308 and 312, so the prefetching performed in step 312 is hardware prefetching. No software prefetching.

应注意的是，虽然上述流程仅描绘关于单一存储器区块的操作情形，但硬件数据预取器122可对多个存储器区块114执行硬件数据预取，亦可同时使用动态更新的预取特性132来进行预取。较佳者，硬件数据预取器122可对其执行硬件预取的每个存储器区块144维持一个相关的动态更新预取特性132。It should be noted that although the above process only depicts the operation of a single memory block, the hardware data prefetcher 122 can perform hardware data prefetch on multiple memory blocks 114, and can also use the dynamically updated prefetch feature at the same time 132 for prefetching. Preferably, the hardware data prefetcher 122 maintains an associated dynamically updated prefetch characteristic 132 for each memory block 144 for which it performs hardware prefetch.

互斥而非共享预取快取线的一个好处，是这样做可导致一单一总线传输而非两个总线传输，亦即，并非是一请求数据的第一传输之后，跟随着一取得与所有权互斥的数据的第二传输，互斥预取是一结合两个请求并要求数据互斥的单一传输，这个做法对多芯片多核处理器且每个核具有自己的末级快取的架构特别有好处。One benefit of mutual exclusion rather than shared prefetch cache lines is that doing so results in a single bus transfer instead of two bus transfers, i.e. instead of a first transfer of requested data followed by a get and ownership A second transfer of mutually exclusive data. Mutual prefetch is a single transfer that combines two requests and requires mutual exclusion of the data. This approach is especially useful for architectures with multi-chip multi-core processors where each core has its own last-level cache. There are benefits.

此处所描述的基于预取特性以动态在共享或互斥间改变的硬件预取相对于软件预取解决方式的好处在于，硬件预取的解决方式可在运行时观察其他存储器存取代理器该对存储器区块的存取，亦即当它们发生时执行改变预取特性的分析，然而对软件预取的解决方式而言，很难在编译时去决定其他存储器存取代理器何时会对哪个存储器区块进行存取。The benefit of hardware prefetching based on the prefetching characteristics described here to dynamically change between shared or exclusive over the software prefetching solution is that the hardware prefetching solution can observe other memory access agents at runtime. accesses to memory blocks, i.e. when they occur perform analysis that changes the prefetching characteristics, however for software prefetch solutions it is difficult to determine at compile time when other memory access agents will which memory block to access.

请参阅图4，其描绘对图1中依据存储器存取代理器101对一存储器区块 114做存取的分析，来进行预取特性132的动态更新的操作流程图。流程开始于步骤402。Please refer to FIG. 4 , which depicts an operation flowchart of dynamically updating the prefetch characteristic 132 according to the analysis of the memory access agent 101 accessing a memory block 114 in FIG. 1 . The flow starts at step 402 .

在步骤402中，存储器区块114的预取特性132初始值为互斥，这是因为预设值是互斥的(如上所述)，或者是依据一个对该存储器区块114的初始存取 (例如依据图6或图10所述)而对该存储器区块114的预取特性132初始化为互斥。一般来说，如果核102读取数据极可能也会更新该数据，而且就一般而言，在存储器区块114中的数据通常具有相似的性质。因此，如前所述，互斥地预取快取线来执行单一总线传输而非多个总线传输，可以降低总线112 的运载量并减少延迟。流程前往步骤404。In step 402, the initial value of the prefetch characteristic 132 of the memory bank 114 is mutually exclusive, because the default value is mutually exclusive (as described above), or it is based on an initial access to the memory bank 114 The prefetch property 132 of the memory block 114 is initialized to mutual exclusion (eg, as described in accordance with FIG. 6 or FIG. 10 ). In general, if a core 102 reads data it will most likely also update the data, and in general, the data in the memory bank 114 will generally have similar properties. Therefore, as previously mentioned, mutually exclusive prefetching of cache lines to perform a single bus transfer rather than multiple bus transfers can reduce bus 112 traffic and reduce latency. The flow goes to step 404 .

在步骤404中，硬件数据预取器122被通知该存储器区块114中的一快取线已被其他存储器存取代理器101所窥探，并具有写入该存储器线的意图而将导致存储器历史202的更新，这也表示在该存储器区块114中其他快取线的数据将被其他存储器存取代理器所写入。在这种情形下，因为可能影响核102 与其他存储器存取代理器101之间对该快取线的预取，因此可能对这些快取线的互斥预取产生不利的影响。流程前往步骤406。In step 404, the hardware data prefetcher 122 is notified that a cache line in the memory bank 114 has been snooped by other memory access agents 101 and has the intention to write to the memory line which will result in a memory history 202, which also means that the data of other cache lines in the memory block 114 will be written by other memory access agents. In this case, since the prefetching of the cache line between the core 102 and other memory access agents 101 may be affected, the mutually exclusive prefetching of these cache lines may be adversely affected. The flow goes to step 406 .

在步骤406中，更新模块204响应于步骤404的窥探而更新预取特性132为共享。流程在步骤406结束。In step 406 , the update module 204 updates the prefetch property 132 to share in response to the snoop of step 404 . The process ends at step 406 .

请参阅图5，其描述图1中，依据存储器存取代理器对存储器区块114的存取分析而动态更新预取特性132的操作流程图。流程开始于步骤502。Please refer to FIG. 5 , which depicts an operation flowchart of dynamically updating the prefetch characteristic 132 according to the memory access agent's access analysis to the memory block 114 in FIG. 1 . The flow starts at step 502 .

在步骤502中，每个存储器区块114的预取特性132被初始设定为共享，这是因为预设值是共享的(如上所述)，或者是基于一个对存储器区块114的初始存取(例如依据图6或图10所述)而对存储器区块114的预取特性132初始化为共享。流程前往步骤504。In step 502, the prefetch characteristics 132 of each memory bank 114 are initially set to shared, either because the default value is shared (as described above), or based on an initial memory bank 114 The prefetch property 132 of the memory block 114 is initialized to shared for fetching (eg, as described with respect to FIG. 6 or FIG. 10 ). The flow goes to step 504 .

在步骤504中，硬件数据预取器122持续追踪(例如记录在存储器存取历史202中)在存储器区块114中的快取线已被核102写入的数量并侦测到该数量已超过一临界值。这可能表示在该存储器区块114中其他快取线的数据将被核102所写入，而在这种情形下对这些存储器线的互斥预取产生不利影响的理由如下。临界值可为一预先决定的数值、或是经由系统软件程序化、或是依据预取执行成效分析并由硬件数据预取器122所动态更新的数值。在一实施例中，临界值是1，亦即依据第一次对存储器区块114的写入而将预取特性 132更新为互斥。流程前往步骤506。In step 504, the hardware data prefetcher 122 keeps track (eg, recorded in the memory access history 202) of the number of cache lines in the memory bank 114 that have been written by the core 102 and detects that the number has exceeded a critical value. This may indicate that data for other cache lines in the memory bank 114 will be written by the core 102 , and the reason why exclusive prefetching of these memory lines would be adversely affected in this case is as follows. The threshold value can be a predetermined value, programmed by system software, or dynamically updated by the hardware data prefetcher 122 according to the performance analysis of prefetch execution. In one embodiment, the threshold value is 1, that is, the prefetch property 132 is updated to exclusive based on the first write to the memory block 114 . The flow goes to step 506 .

在步骤506中，更新模块204响应于在步骤504中临界值被超越而更新预取特性132为互斥。流程终止于步骤506。In step 506 , the update module 204 updates the prefetch property 132 to be exclusive in response to the threshold being exceeded in step 504 . The flow ends at step 506 .

请参阅图6，其描绘图1中，依据存储器存取代理器101对存储器区块114 做存取分析以进行预取特性132的动态更新的操作流程图。流程开始于步骤 602。Please refer to FIG. 6 , which depicts an operation flowchart of dynamically updating the prefetch characteristic 132 according to the memory access agent 101 performing access analysis on the memory block 114 in FIG. 1 . The flow starts at step 602.

在步骤602中，更新模块204测得核102对存储器区块114的初始存取。流程前进至步骤604。In step 602 , the update module 204 detects an initial access by the core 102 to the memory block 114 . The process proceeds to step 604 .

在决策步骤604中，更新模块204判断该初始存取是指令撷取或加载/储存。如果是指令撷取则流程前往步骤606，否则前往步骤608。In decision step 604, the update module 204 determines whether the initial access is an instruction fetch or a load/store. If it is command fetching, the process goes to step 606 , otherwise, goes to step 608 .

在步骤606中，更新模块204响应于在步骤604判断该初始存取为指令撷取而更新预取特性132为共享，这样做是有益处的，因为对一存储器区块114 执行指令撷取时，剩余对存储器区块114的存取也可能是指令撷取，且基本上包含指令的存储器位置一旦被载入存储器后就不会再被写入。在一实施例中，硬件数据预取器122持续使用在步骤606所动态更新的共享预取特性132 来从存储器区块114执行硬件预取，然而，如同在说明书所描述的其他实施例中，当硬件数据预取器122对存储器区块的存取做监控与分析时，初始预取特性132可从共享被更新为互斥(反之亦然)。流程终止于步骤606。In step 606, the update module 204 updates the prefetch property 132 to share in response to determining that the initial access is an instruction fetch in step 604, which is beneficial because when an instruction fetch is performed on a memory block 114 , the remaining accesses to the memory block 114 may also be instruction fetches, and essentially the memory locations containing instructions are never written to once loaded into memory. In one embodiment, the hardware data prefetcher 122 continues to perform hardware prefetching from the memory bank 114 using the dynamically updated shared prefetch feature 132 at step 606, however, as in other embodiments described herein, When the hardware data prefetcher 122 monitors and analyzes accesses to memory blocks, the initial prefetch property 132 can be updated from shared to exclusive (and vice versa). The flow ends at step 606 .

在步骤608中，更新模块204响应于在步骤604判断该初始存取为加载/储存而更新预取特性132为互斥。在一实施例中，硬件数据预取器122持续使用在步骤608所动态更新的互斥预取特性132来从存储器区块114执行硬件预取，然而，如同在说明书所描述的其他实施例中，当硬件数据预取器122对存储器区块的存取做监控与分析时，初始预取特性132可从互斥被更新为共享(反之亦然)。流程结束于步骤608。In step 608 , update module 204 updates prefetch property 132 to exclusive in response to determining in step 604 that the initial access is a load/store. In one embodiment, the hardware data prefetcher 122 continues to perform hardware prefetching from the memory bank 114 using the dynamically updated exclusive prefetch attribute 132 at step 608, however, as in other embodiments described herein , when the hardware data prefetcher 122 monitors and analyzes accesses to memory blocks, the initial prefetch property 132 can be updated from exclusive to shared (and vice versa). The process ends at step 608 .

请参阅图7，其描绘图1中，依据存储器存取代理器101对存储器区块114 做存取分析，以进行预取特性132的动态更新的操作流程图。流程开始于步骤702。Please refer to FIG. 7 , which depicts an operation flowchart of dynamically updating the prefetch characteristic 132 according to the memory access agent 101 performing access analysis on the memory block 114 in FIG. 1 . The flow starts at step 702 .

在步骤702中，更新模块204维持核102从存储器区块114所撷取的指令计数值(例如记录在存储器存取历史202中)，并以fetch_cnt来表示、以及从存储器区块114而来的程序加载/储存计数值，并以load_store_cnt来表示。流程前进至步骤704。In step 702, the update module 204 maintains the count of instructions fetched by the core 102 from the memory block 114 (for example, recorded in the memory access history 202), represented by fetch_cnt, and the instruction count value from the memory block 114 The program loads/stores the count value and is represented by load_store_cnt. The process proceeds to step 704 .

在决策步骤704中，更新模块204判断fetch_cnt是否大于load_store_cnt。如果是则流程前往步骤706，否则流程前往步骤708。In decision step 704, update module 204 determines whether fetch_cnt is greater than load_store_cnt. If yes, the flow goes to step 706 , otherwise the flow goes to step 708 .

在步骤706中，更新模块204响应于在步骤704中fetch_cnt大于 load_store_cnt的判断而更新预取特性132为共享。流程前往步骤706。In step 706, the update module 204 updates the prefetch property 132 to share in response to the determination in step 704 that fetch_cnt is greater than load_store_cnt. The flow goes to step 706 .

在步骤708中，更新模块204判断fetch_cnt是否小于load_store_cnt，如果是则流程前往步骤712，否则流程结束。In step 708, the update module 204 judges whether the fetch_cnt is smaller than load_store_cnt, if yes, the process proceeds to step 712, otherwise, the process ends.

在步骤712中，更新模块204响应于在步骤708中fetch_cnt小于 load_store_cnt的判断而更新预取特性132为互斥。流程结束于步骤712。In step 712, the update module 204 updates the prefetch property 132 to exclusive in response to the determination in step 708 that fetch_cnt is less than load_store_cnt. The flow ends at step 712 .

请参阅图8，其描绘图1中，依据存储器存取代理器101对存储器区块114 做存取分析，以进行预取特性132的动态更新的操作流程图。流程开始于步骤802。Please refer to FIG. 8 , which depicts an operation flowchart of dynamically updating the prefetch characteristic 132 according to the memory access broker 101 performing access analysis on the memory block 114 in FIG. 1 . The flow starts at step 802 .

在步骤802中，硬件数据预取器122维持一核102从存储器区块114所撷取的指令计数值(例如记录在存储器存取历史202中)，并以fetch_cnt来表示、以及从存储器区块114而来的程序加载/储存计数值，并以load_store_cnt来表示。流程前进至步骤804。In step 802, the hardware data prefetcher 122 maintains a count of instructions fetched by a core 102 from the memory block 114 (for example, recorded in the memory access history 202), represented by fetch_cnt, and from the memory block The program load/store count value from 114 is represented by load_store_cnt. The process proceeds to step 804 .

在决策步骤804中，更新模块204判断fetch_cnt与load_store_cnt的差值是否大于一临界值。如果是则流程前往步骤806，否则流程前往步骤808。临界值可为一预先决定的数值、或是经由系统软件程序化、或是依据预取执行成效分析并由硬件数据预取器122所动态更新的数值。In decision step 804, the update module 204 determines whether the difference between fetch_cnt and load_store_cnt is greater than a threshold. If yes, the flow goes to step 806 , otherwise the flow goes to step 808 . The threshold value can be a predetermined value, programmed by system software, or dynamically updated by the hardware data prefetcher 122 according to the performance analysis of prefetch execution.

在步骤806中，更新模块204响应于在步骤804中fetch_cnt与load_store_cnt 之间的差值大于该临界值的判断而更新预取特性132为共享。流程前往步骤 806。In step 806 , the update module 204 updates the prefetch property 132 to share in response to the determination in step 804 that the difference between fetch_cnt and load_store_cnt is greater than the threshold. The flow goes to step 806.

在决策步骤808中，更新模块204判断load_store_cnt与fetch_cnt的差值是否大于一临界值。如果是则流程前往步骤812，否则流程结束。在步骤808的临界值数值可与步骤804所使用的临界值相同或不同。In decision step 808, the update module 204 determines whether the difference between load_store_cnt and fetch_cnt is greater than a threshold. If yes, the flow goes to step 812, otherwise the flow ends. The threshold value at step 808 may be the same as or different from the threshold value used at step 804 .

在步骤812中，更新模块204响应于在步骤808中load_store_cnt与fetch_cnt 之间的差值大于临界值的判断而更新预取特性132为互斥。流程结束于步骤 812。In step 812 , the update module 204 updates the prefetch property 132 to exclusive in response to the determination in step 808 that the difference between load_store_cnt and fetch_cnt is greater than the threshold. The process ends at step 812.

请参阅图9，其描绘图1中，依据存储器存取代理器101对存储器区块114 做存取分析，以进行预取特性132的动态更新的操作流程图。流程开始于步骤902。Please refer to FIG. 9 , which depicts an operation flowchart of dynamically updating the prefetch characteristic 132 according to the memory access agent 101 performing access analysis on the memory block 114 in FIG. 1 . The flow starts at step 902 .

在步骤902中，硬件数据预取器122维持一核102从存储器区块114所撷取的指令计数值(例如记录在存储器存取历史202中)，并以fetch_cnt来表示、以及从存储器区块114而来的程序加载/储存计数值，并以load_store_cnt来表示。流程前进至步骤904。In step 902, the hardware data prefetcher 122 maintains a count of instructions fetched by a core 102 from the memory block 114 (for example, recorded in the memory access history 202), represented by fetch_cnt, and from the memory block The program load/store count value from 114 is represented by load_store_cnt. The process proceeds to step 904 .

在决策步骤904中，更新模块204判断fetch_cnt与load_store_cnt的差值是否大于一临界值。如果是则流程前往步骤906，否则流程前往步骤908。In decision step 904, the update module 204 determines whether the difference between fetch_cnt and load_store_cnt is greater than a threshold. If yes, the flow goes to step 906 , otherwise the flow goes to step 908 .

在步骤906中，更新模块204响应于在步骤904中fetch_cnt与load_store_cnt 之间的差值大于该临界值的判断而更新预取特性132为共享。流程结束于步骤906。In step 906 , the update module 204 updates the prefetch property 132 to share in response to the determination in step 904 that the difference between fetch_cnt and load_store_cnt is greater than the threshold. The process ends at step 906.

在决策步骤908中，更新模块204判断fetch_cnt与load_store_cn之间的差值是否小于一临界值。如果是则流程前往步骤912，否则流程结束。在步骤 908的临界值数值可与步骤904所使用的临界值相同或不同。In decision step 908, the update module 204 determines whether the difference between fetch_cnt and load_store_cn is less than a threshold. If yes, the flow goes to step 912, otherwise the flow ends. The threshold value at step 908 may be the same as or different from the threshold value used at step 904.

在步骤912中，更新模块204响应于在步骤908中fetch_cnt与load_store_cnt 之间的差值小于临界值的判断而更新预取特性132为互斥。流程结束于步骤 912。In step 912 , the update module 204 updates the prefetch property 132 to exclusive in response to the determination in step 908 that the difference between fetch_cnt and load_store_cnt is less than the threshold. The flow ends at step 912.

请参阅图10，其描绘图1中，依据存储器存取代理器101对存储器区块114 做存取分析，以进行预取特性132的动态更新的操作流程图。流程开始于步骤1002。Please refer to FIG. 10 , which depicts an operation flowchart of dynamically updating the prefetch characteristic 132 according to the memory access agent 101 performing access analysis on the memory block 114 in FIG. 1 . The process starts at step 1002.

在步骤1002中，更新模块204测得核102对存储器区块114的初始存取。流程前进至步骤1004。In step 1002 , the update module 204 detects an initial access by the core 102 to the memory block 114 . The flow goes to step 1004.

在决策步骤1004中，更新模块204判断该初始存取是加载或储存。如果是加载则流程前往步骤1006，否则前往步骤1008。在本文中，一加载存取包含一个程序指令的撷取与一个由该程序加载指令所执行的加载。In decision step 1004, the update module 204 determines whether the initial access is a load or a store. If it is loaded, the process goes to step 1006, otherwise goes to step 1008. In this context, a load access includes a fetch of a program instruction and a load performed by the program load instruction.

在步骤1006中，更新模块204响应于在步骤1004判断该初始存取为加载而更新预取特性132为共享，在一实施例中，硬件数据预取器122持续使用在步骤1006所动态更新的共享预取特性132来从存储器区块114执行硬件预取，然而，如同在说明所描述的其他实施例中，当硬件数据预取器122对存储器区块的存取做监控与分析时，初始预取特性132可从共享被更新为互斥(反之亦然)。流程前往步骤1006。In step 1006, the update module 204 updates the prefetch characteristic 132 to share in response to determining that the initial access is a load in step 1004. In one embodiment, the hardware data prefetcher 122 continues to use the dynamically updated Shared prefetch feature 132 is used to perform hardware prefetch from memory bank 114. However, as in other embodiments described in the specification, when hardware data prefetcher 122 monitors and analyzes memory bank accesses, the initial The prefetch property 132 can be updated from shared to exclusive (and vice versa). The flow goes to step 1006 .

在步骤1008中，更新模块204响应于在步骤1004判断该初始存取为储存而更新预取特性132为互斥。这样做是有益处的，因为对一个从存储器区块 114执行储存时，剩余对存储器区块114的存取也可能是储存。在一实施例中，硬件数据预取器122持续使用在步骤1008所动态更新的互斥预取特性132来从存储器区块114执行硬件预取，然而，如同在说明书所描述的其他实施例中，当硬件数据预取器122对存储器区块的存取做监控与分析时，初始预取特性132可从互斥被更新为共享(反之亦然)。流程结束于步骤1008。In step 1008 , the update module 204 updates the prefetch property 132 to be exclusive in response to determining that the initial access is a store in step 1004 . This is beneficial because when a store is performed to one slave memory bank 114, the remaining accesses to the memory bank 114 may also be stores. In one embodiment, the hardware data prefetcher 122 continues to perform hardware prefetching from the memory bank 114 using the dynamically updated exclusive prefetch property 132 at step 1008, however, as in other embodiments described herein , when the hardware data prefetcher 122 monitors and analyzes accesses to memory blocks, the initial prefetch property 132 can be updated from exclusive to shared (and vice versa). The process ends at step 1008.

请参阅图11，其描绘图1中，依据存储器存取代理器101对存储器区块114 做存取分析，以进行预取特性132的动态更新的操作流程图。流程开始于步骤1102。Please refer to FIG. 11 , which depicts an operation flowchart of dynamically updating the prefetch characteristic 132 according to the memory access agent 101 performing access analysis on the memory block 114 in FIG. 1 . The flow starts at step 1102 .

在步骤1102中，硬件数据预取器122维持一核102从存储器区块114而来的加载计数值(例如记录在存储器存取历史202)，并以load_cnt来表示、以及从存储器区块114而来的程序储存计数值，并以store_cnt来表示。流程前进至步骤1104。In step 1102, the hardware data prefetcher 122 maintains the load count value (for example, recorded in the memory access history 202) of a core 102 from the memory block 114, represented by load_cnt, and from the memory block 114 The coming program stores the count value and is represented by store_cnt. The process proceeds to step 1104 .

在决策步骤1104中，更新模块204判断load_cnt与store_cnt的比例值是否大于一临界值。如果是则流程前往步骤1106，否则流程前往步骤1108。临界值可为一预先决定的数值、或是经由系统软件程序化、或是依据预取成效分析并由硬件数据预取器122所动态更新的数值。In decision step 1104, the update module 204 determines whether the ratio of load_cnt to store_cnt is greater than a threshold. If yes, the flow goes to step 1106 , otherwise the flow goes to step 1108 . The threshold value can be a predetermined value, programmed by system software, or dynamically updated by the hardware data prefetcher 122 according to the prefetch performance analysis.

在步骤1106中，更新模块204响应于在步骤1104中load_cnt与store_cnt的比例值大于该临界值的判断而更新预取特性132为共享。流程结束于步骤 1106。In step 1106 , the update module 204 updates the prefetch property 132 to share in response to the determination in step 1104 that the ratio of load_cnt to store_cnt is greater than the threshold. The process ends at step 1106.

在决策步骤1108中，更新模块204判断store_cnt与load_cnt的比例值是否大于一临界值。如果是则流程前往步骤1112，否则流程结束。在步骤1108的临界值数值可与步骤1104所使用的临界值相同或不同。In decision step 1108, the update module 204 determines whether the ratio of store_cnt to load_cnt is greater than a threshold. If yes, the flow goes to step 1112, otherwise the flow ends. The threshold value at step 1108 may be the same as or different from the threshold value used at step 1104 .

在步骤1112中，更新模块204响应于在步骤1008判断store_cnt与load_cnt 的比例值大于该临界值而更新预取特性132为互斥。流程结束于步骤1112。In step 1112 , the update module 204 updates the prefetch property 132 to be exclusive in response to determining in step 1008 that the ratio of store_cnt to load_cnt is greater than the threshold. The process ends at step 1112.

请参阅图12，其描绘使用离线程序分析来决定预取特性以执行硬件预取的操作流程图。流程开始于步骤1202 。Please refer to FIG. 12 , which depicts an operational flowchart of using off-line program analysis to determine prefetch characteristics to perform hardware prefetch. The flow starts at step 1202.

在步骤1202中，一程序被分析以判断当处理器在共享预取特性或互斥预取特性时的执行硬件预取执行成效较佳。该分析对多个有兴趣的不同程序(例如常执行的程序、或已知在一般情形下需要较长时间来执行的程序，因此它们的执行成效相当重要而需要被最佳化)而进行。较佳者，当处理器使用共享预取特性来执行硬件预取时程序会被执行许多次，且当处理器使用互斥预取特性来执行硬件预取时程序会被执行许多次执行时，其执行成效都会加以记录，例如对每个共享或互斥的组态而言，多次执行结果的平均值将被计算出来。在另一个实施例中，该分析会使用在多个系统与伺服器通信时的共同实验值，其中伺服器提供当系统使用一组态信息并动态决定系统的改良组态时所需的组态信息与执行成效数据。这类的实施如申请日为2014年9月2 日的美国专利编号第14/474,623与14/474,699的申请案，其皆引用申请日为 2014年5月20日且编号是62/000,808的美国暂时申请案做优先权，而它们全部都合并在此作参考用。在这个实例中，动态系统组态包含对预取特性132的动态更新。流程前往步骤1204。In step 1202, a program is analyzed to determine that hardware prefetching performs better when the processors are performing shared prefetching or exclusive prefetching. The analysis is performed on a number of different programs of interest (eg, programs that are frequently executed, or programs that are known to generally take a long time to execute, so their execution performance is important and needs to be optimized). Preferably, when the processor uses the shared prefetch feature to perform hardware prefetch, the program will be executed many times, and when the processor uses the exclusive prefetch feature to perform hardware prefetch, the program will be executed many times, Its implementation results will be recorded, for example, for each shared or mutually exclusive configuration, the average value of multiple execution results will be calculated. In another embodiment, the analysis will use common experimental values when multiple systems communicate with a server, where the server provides the desired configuration when the system uses a configuration information and dynamically determines the improved configuration of the system Information and performance data. Such implementations are for example U.S. Patent Nos. 14/474,623 and 14/474,699 filed September 2, 2014, both of which refer to U.S. Patent No. 62/000,808 filed May 20, 2014. Provisional Applications claim priority, and all of them are hereby incorporated by reference. In this example, dynamic system configuration includes dynamic updates to prefetch characteristics 132 . The flow goes to step 1204 .

在步骤1204中，一个对每个程序配置一项目(entry)的表格被编译。较佳者，每个项目包含程序的识别特性以及在步骤1202中提供最佳执行成效的预取特性。该识别特性可包含一程序名称(例如已被操作系统所知的程序名称)、存储器存取态样(pattern)及/或程序所使用的不同指令类型的数量。该表格亦可包含在处理器103上最后执行的系统软件，诸如装置驱动程序。流程前往步骤1206。In step 1204, a table with an entry for each program is compiled. Preferably, each entry includes the identification properties of the program and the prefetching properties that provide the best performance in step 1202 . The identifying characteristics may include a program name (eg, known to the operating system), memory access patterns, and/or the number of different instruction types used by the program. The table may also contain system software, such as device drivers, that was last executed on the processor 103 . The flow goes to step 1206 .

在步骤1206中，在表格中的程序被侦测是否正在处理器103中执行。在一实施例中，系统软件侦测该程序正在执行中，例如操作系统可查询执行中程序的程序名称，其好比操作系统在它的执行程表格中查询每个程序的名称般。在另一个实施例中，该表格会在启动时间里被操作系统下载至处理器 103，而且处理器103会测得该程序正被执行中。举例来说，处理器103可在程序执行时收集与程序相关的识别特性(例如存储器态样及/或程序所使用的不同指令类型的数量)，以及将识别特性与步骤1204所编译表格的项目作比对并下载至处理器103。流程前往步骤1208。In step 1206 , the programs in the table are checked to see if they are being executed in the processor 103 . In one embodiment, the system software detects that the program is running. For example, the operating system can query the program name of the running program, just as the operating system queries the name of each program in its execution table. In another embodiment, the table is downloaded to the processor 103 by the operating system during boot time, and the processor 103 detects that the program is being executed. For example, the processor 103 can collect identification characteristics related to the program (such as memory configuration and/or the number of different instruction types used by the program) when the program is executed, and compare the identification characteristics with the items of the table compiled in step 1204 Compare and download to processor 103. The flow goes to step 1208 .

在步骤1208中，硬件数据预取器122对在步骤1206中所测得的程序执行硬件预取，其使用该表格项目中与该所测得程序相关的预取特性来进行。流程结束于步骤1208。In step 1208, the hardware data prefetcher 122 performs hardware prefetching on the program measured in step 1206 using the prefetching characteristics associated with the measured program in the table entry. The process ends at step 1208.

请参阅图13，其描绘多个范围暂存器1300的方块图。范围暂存器1300包含于硬件数据预取器122。在一实施例中，硬件数据预取器122包含一相关于每个核102的范围暂存器1300集合，每个范围暂存器1300包含一地址范围字段1302与一预取特性字段1304。每个地址范围字段1302可经由程序化以指示位于处理器103地址空间中的一个地址范围。预取特性1304指示一预取特性，其共享或互斥。如预取模块206对硬件预取所预测的数据地址，该预取模块 206判断该预取地址是否位于所指示范围暂存器1300的地址范围中。如果正确，预取模块206依据在相关的预取特性字段1304所指示的预取特性来产生该预取请求208；如果不正确，在一实施例中，该预取模块206以预设的预取特性来产生该预取请求208。在一实施例中，该预设的预取特性为共享，因此范围暂存器1300仅需要用来指示互斥硬件预取所需的地址范围。在另一实施例中，该预设的预取特性为互斥，所以范围暂存器1300仅需要用来指示共享硬件预取所需的地址范围。在这些实施例中，由于暗示所指示的预取特性与预设的预取特性相反，因此该预取特性字段1304可能不需要。Please refer to FIG. 13 , which depicts a block diagram of a plurality of range registers 1300 . The range register 1300 is included in the hardware data prefetcher 122 . In one embodiment, the hardware data prefetcher 122 includes a set of range registers 1300 associated with each core 102 , and each range register 1300 includes an address range field 1302 and a prefetch characteristic field 1304 . Each address range field 1302 can be programmed to indicate an address range within the address space of the processor 103 . Prefetch property 1304 indicates a prefetch property, either shared or exclusive. If the prefetch module 206 prefetches the data address predicted by the hardware, the prefetch module 206 judges whether the prefetch address is located in the address range of the indicated range register 1300. If correct, the prefetch module 206 generates the prefetch request 208 according to the prefetch characteristics indicated in the relevant prefetch characteristic field 1304; The prefetch request 208 is generated by fetching properties. In one embodiment, the default prefetch feature is shared, so the range register 1300 only needs to be used to indicate the address range required for exclusive hardware prefetch. In another embodiment, the preset prefetch feature is mutually exclusive, so the range register 1300 only needs to be used to indicate the address range required for shared hardware prefetch. In these embodiments, the prefetch characteristic field 1304 may not be required since it implies that the indicated prefetch characteristic is contrary to the preset prefetch characteristic.

请参阅图14，其描绘使用由图13的范围暂存器1300所决定的预取特性来执行硬件预取的操作流程图。流程开始于步骤1402。Please refer to FIG. 14 , which depicts an operation flowchart of performing hardware prefetching using the prefetching characteristics determined by the range register 1300 of FIG. 13 . The flow starts at step 1402 .

在步骤1402中，一程序被分析以当处理器使用共享预取特性或互斥预取特性执行硬件预取时，判断不同程序中的哪些程序在处理器103上执行时有较佳的执行成效，其类似于上述图12所述的方式，然而在步骤1402所执行的分析包含较执行于步骤1202的分析具更细的区间(finer granularity)。更特别的是，此分析包含以相关于被程序化在地址暂存器1300之内的地址范围，来评估每个程序以共享或互斥预取特性的执行成效。举例来说，包含由多个存储器存取代理器101所存取数据的地址范围，可通过共享预取特性而有效地包含于表格中；反之，包含由一单一核102所写入数据的地址范围，可经由互斥预取特性而有效地包含于表格中。流程前往步骤1404。In step 1402, a program is analyzed to determine which of the different programs have better execution performance when executed on the processor 103 when the processor uses the shared prefetch feature or the exclusive prefetch feature to perform hardware prefetch , which is similar to that described above in FIG. 12 , however the analysis performed at step 1402 includes a finer granularity than the analysis performed at step 1202 . More specifically, the analysis includes evaluating the implementation effectiveness of each program with the shared or exclusive prefetching feature relative to the address range programmed into the address register 1300 . For example, address ranges containing data accessed by multiple memory access agents 101 can be efficiently included in the table by sharing the prefetch feature; conversely, addresses containing data written by a single core 102 Ranges can be efficiently contained in tables via the exclusive prefetch feature. The flow goes to step 1404 .

在步骤1404中，一个对每个程序配置一项目的表格被编译，其类似于步骤1204所述的方式。然而在步骤1404所编译的表格包含地址范围以及将被推入范围暂存器1300的相关预取特性。流程前往步骤1406。In step 1404 , a table with one entry per program is compiled in a manner similar to that described for step 1204 . However, the table compiled at step 1404 contains address ranges and associated prefetch properties to be pushed into the range register 1300 . The flow goes to step 1406 .

在步骤1406中，在表格中的程序被侦测是否正在处理器103上执行，其类似于步骤1206所述的方式。然而，当程序被测到正在执行时，使用该表格项目中相关于该程序的信息来额外地程序化范围暂存器1300。在一实施例中，由操作系统程序化该范围暂存器1300。在另一个实施例中，处理器103 将响应于侦测到该程序的执行而自己本身程序化该范围暂存器1300，举例来说，处理器103的微码可程序化范围暂存器1300。流程前往步骤1408。In step 1406 , the programs in the table are checked to see if they are being executed on processor 103 , in a manner similar to that described in step 1206 . However, when a program is detected to be executing, the scope register 1300 is additionally programmed using information about the program in the table entry. In one embodiment, the range register 1300 is programmed by the operating system. In another embodiment, the processor 103 will itself program the range register 1300 in response to detecting execution of the program, for example, the microcode of the processor 103 can program the range register 1300 . The flow goes to step 1408 .

在步骤1408中，硬件数据预取器122使用范围暂存器1300的预取特性并结合预设的预取特性，对在步骤1406所测到的程序执行硬件预取。流程结束于步骤1408。In step 1408 , the hardware data prefetcher 122 performs hardware prefetching on the program detected in step 1406 by using the prefetch feature of the range register 1300 in combination with the preset prefetch feature. The process ends at step 1408 .

虽然上文已描述许多动态更新预取特性132的不同实施例，其他不脱离本发明精神的实施例亦被考量在本发明中，举例来说，在一实施例中，将对每个有源存储器区块114维持一饱和计数值。当存储器存取代理器101的一个对存储器区块114做存取，并意图从互斥硬件预取得(例如一储存或加载/储存) 获得好处时，更新模块204用饱和的方式增加计数值；反之，当存储器存取代理器101的一个对存储器区块114做存取，并意图从共享硬件预取(例如一加载或指令撷取)获得好处时，更新模块204用饱和的方式减少计数值。较佳者，预取特性132是饱和计数值的最重要位，举例来说，更新模块204维持一队列 (例如移位暂存器)用以储存关于对每个存储器区块114的最近N个存取的信息(例如储存、加载/储存、指令撷取)，其中N大于1。更新模块204依据储存在队列且希望从互斥或共享硬件预取获得好处的信息，来动态地更新预取特性132为互斥或共享。举例而言，若大部分的最近N个存取是储存时则更新为互斥，反之，若大部分的最近N个存取是指令撷取时则更新为共享。在另一个例子中，对预取模块206从存储器区块114执行的每个硬件预取而言，更新模块204维持所使用预取特性132的指示。对每个发生在预取快取线的存取来说，若正在做存取的存储器存取代理器101对相关于指示的快取线进行写入时，更新模块204将更新该指示为互斥，且若该快取线被窥探时则更新模块 204更新该指示为共享。在这个方式下，一个在存储器区块114中的快取线位图示(bitmap)将被维护，用以指出存储器区块114中，不同快取线可能用到的最接近最佳预取特性的信息。更新模块204在位图示中搜寻态样并判断下一个快取线的地址是否硬件预取命中样态中的任一个，并使用位图示以动态判断该预取特性132是否可用于硬件预取该快取线。最后，虽然本文已揭露包含于一多核处理器的硬件数据预取器的实施例，包含于单一核的硬件数据预取器的其他实施例也属于不脱离本发明精神的范畴。Although many different embodiments of the dynamically updated prefetch feature 132 have been described above, other embodiments are also contemplated in the present invention without departing from the spirit of the present invention. For example, in one embodiment, each active The memory block 114 maintains a saturation count. When one of the memory access brokers 101 accesses the memory block 114 and intends to benefit from exclusive hardware prefetching (such as a store or load/store), the update module 204 increments the count value in a saturated manner; Conversely, when one of the memory access agents 101 makes an access to the memory block 114 and intends to benefit from a shared hardware prefetch (such as a load or instruction fetch), the update module 204 decrements the counter value in a saturated manner . Preferably, the prefetch feature 132 is the most significant bit of the saturation count value. For example, the update module 204 maintains a queue (such as a shift register) for storing the latest N Accessed information (eg store, load/store, instruction fetch), where N is greater than 1. The update module 204 dynamically updates the prefetch property 132 to be exclusive or shared based on information stored in the queue and wishing to benefit from exclusive or shared hardware prefetching. For example, if most of the last N accesses are stores, the update is exclusive, whereas if most of the last N accesses are fetches, the update is shared. In another example, for each hardware prefetch performed by prefetch module 206 from memory bank 114 , update module 204 maintains an indication of the prefetch characteristic 132 used. For each access that occurs on a prefetch cache line, if the memory access agent 101 that is doing the access writes to the cache line associated with the indication, the update module 204 will update the indication as an interactive If the cache line is snooped, the update module 204 updates the indication as shared. In this way, a cache line bitmap (bitmap) in memory bank 114 will be maintained to indicate the closest optimal prefetch characteristics that may be used for different cache lines in memory bank 114 Information. The update module 204 searches the pattern in the bitmap and determines whether the address of the next cache line is a hardware prefetch hit pattern, and uses the bitmap to dynamically determine whether the prefetch feature 132 is available for hardware prefetch Take that cache line. Finally, although an embodiment of a hardware data prefetcher included in a multi-core processor has been disclosed herein, other embodiments of a hardware data prefetcher included in a single core are within the scope of the invention.

本发明已通过各种实施例描述于此，上述实施例应理解为本发明所呈现的范例，而不应对本发明产生任何限制。所属技术领域人员应明显的了解到，在不偏离本发明的精神和范围内，当可进行任何形式上或细节的改变或润饰。举例而言，可用软件实现，如本文所述的装置与方法的功能、制造、建模、模拟、描述及/或测试。上述可通过使用一般程序语言(例如C、C++)、包括Verilog HDL、VHDL等的硬件描述语言(HDL)或其他可用的程序而加以实现。上述软件可被设置于任何已知的计算机可用介质，例如磁带(magnetic tape)、半导体、磁盘或光盘(例如CD-ROM、DVD-ROM等)、一网络、有线或无线或其他通讯介质。本文所述的装置与方法的各种实施例可包括一半导体知识产权核(semiconductor intellectualproperty core)，例如一处理器核(例如通过HDL实现或指定)以及通过集成电路制造而转换为硬件。此外，本文所述的装置与方法可通过硬件与软件的组合而加以实现。因此，本发明的范围不应受限于本文的任何示范性实施例，而只应以本发明权利要求书的范围与其等效范围为准。应特别注意，本发明可实现于处理器装置中，而该处理器可使用于一般计算机之中。最后，所属技术领域人员应理解，基于本文所公开的概念与实施例，任何设计或修饰其他架构以具备与本发明相同目的的应用，皆已包含于本发明的范围且皆已定义于本发明权利要求书的范围中。The present invention has been described herein through various embodiments, and the above-mentioned embodiments should be understood as examples of the present invention and should not limit the present invention in any way. It should be obvious to those skilled in the art that any change or modification in form or detail can be made without departing from the spirit and scope of the present invention. For example, software may be used to implement, function, manufacture, model, simulate, describe and/or test the devices and methods as described herein. The above can be realized by using general programming languages (such as C, C++), hardware description languages (HDL) including Verilog HDL, VHDL, etc., or other available programs. The above software can be installed on any known computer-usable medium, such as magnetic tape, semiconductor, magnetic disk or optical disk (such as CD-ROM, DVD-ROM, etc.), a network, wired or wireless or other communication medium. Various embodiments of the apparatus and methods described herein may include a semiconductor intellectual property core, such as a processor core (implemented or specified, for example, by HDL) and converted to hardware by integrated circuit fabrication. In addition, the devices and methods described herein may be implemented by a combination of hardware and software. Therefore, the scope of the present invention should not be limited by any exemplary embodiments herein, but only by the scope of the claims of the present invention and their equivalents. It should be particularly noted that the present invention can be implemented in a processor device which can be used in a general computer. Finally, those skilled in the art should understand that based on the concepts and embodiments disclosed herein, any application that designs or modifies other structures to achieve the same purpose as the present invention is included in the scope of the present invention and is defined in the present invention within the scope of the claims.

Claims

1. A processor, characterized in that it comprises:

One core for:

loading a first table for indicating a prefetch characteristic for a predetermined program, wherein the prefetch characteristic is exclusive or shared;

detecting whether the predetermined program is executing on the processor from a second table for identifying the processing being executed; and

query the prefetch characteristics associated with the scheduled program being executed on the processor; and

a hardware data prefetcher performing hardware prefetching to the predetermined program using the prefetch feature,

Wherein, the hardware data prefetcher analyzes memory accesses by the core that occur simultaneously with the hardware prefetch, and dynamically updates the prefetch characteristic according to the analysis.

2. The processor of claim 1, wherein the prefetch feature is confirmed before detecting whether the predetermined program is executing on the processor.

3. The processor according to claim 2, wherein the step of confirming the prefetching characteristic before detecting whether the predetermined program is being executed on the processor comprises:

When the processor executes the predetermined program when using a shared prefetch feature to perform hardware prefetch, determine a first execution result;

determining a second execution result under the condition that the predetermined program is executed when the processor executes hardware prefetching using a mutually exclusive prefetching feature; and

When the first performance is better than the second execution performance, select the prefetching characteristic as shared, and when the second execution performance is better than the first execution performance, select the prefetching characteristic as exclusive.

4. The processor according to claim 1, wherein an operating system executing on the processor detects whether the identification code of the predetermined program exists in the executing program table of the operating system to detect Test whether the scheduled program is executing on the processor.

5. The processor of claim 4, wherein the processor receives the prefetch characteristic from the operating system.

6. The processor according to claim 1, wherein the processor is also used for:

receiving information indicative of at least one identifying characteristic of the predetermined program and associated prefetch characteristics prior to detecting whether the predetermined program is executing on the processor;

Wherein the processor compares at least one identification characteristic of the predetermined program in the received information with the identification characteristic determined when the predetermined program is executed, to detect whether the predetermined program is executing on the processor; and

Wherein the hardware data prefetcher performs hardware prefetching on the predetermined program by using the prefetching characteristic in the received information related to the compared at least one identification characteristic, to use the prefetching characteristic to perform hardware prefetching on the predetermined program. Prefetching.

7. The processor of claim 1, wherein the first table is included in a device driver.

8. A method for performing hardware data prefetching by a processor, characterized in that the method comprises:

detecting whether the predetermined program is executing on the processor from a second table for identifying the processing being executed;

query the prefetch characteristics associated with the scheduled program being executed on the processor;

performing hardware prefetching of the predetermined program using the prefetching feature; and

Memory accesses occurring concurrently with the hardware prefetch are analyzed, and the prefetch characteristics are dynamically updated based on the analysis.

9. The method for performing hardware data prefetching by a processor according to claim 8, wherein the prefetch feature is confirmed before detecting whether the predetermined program is being executed on the processor.

10. The method for performing hardware data prefetching by a processor according to claim 9, wherein the step of confirming the prefetching characteristics before detecting whether the predetermined program is being executed on the processor comprises:

11. The method for performing hardware data prefetching by a processor according to claim 8, wherein an operating system executing on the processor detects whether the identification code of the predetermined program exists in the operating system Executing program table to detect whether the predetermined program is being executed on the processor.

12. The method for performing hardware data prefetching by a processor according to claim 11, further comprising:

The step of providing, by the operating system, the prefetch feature to the processor in response to detecting that the predetermined program is executing on the processor.

13. The method for performing hardware data prefetching by a processor according to claim 8, further comprising:

receiving, by the processor, information indicative of at least one identifying characteristic of the predetermined program and an associated prefetch characteristic prior to detecting whether the predetermined program is executing on the processor;

Wherein the step of detecting whether the predetermined program is executing on the processor includes: comparing at least one identification characteristic of the predetermined program in the received information with an identification characteristic determined when the predetermined program is executed; and

The step of performing hardware prefetching on the predetermined program by using the prefetching characteristic includes: performing hardware prefetching on the predetermined program by using the prefetching characteristic in the received information related to the compared at least one identification characteristic .

14. A processor, characterized in that it comprises:

One core for:

detecting whether a predetermined program is executing on the processor; and

In response to detecting that the predetermined program is being executed on the processor, loading the respective address ranges of the one or more range registers to the processor's one or more range registers, wherein one or more of the range registers each of the address ranges has an associated prefetch characteristic, the prefetch characteristic being exclusive or shared; and

a hardware data prefetcher for performing hardware prefetching of the predetermined program using the prefetch characteristic relative to the address range loaded into the range register,

15. The processor according to claim 14, wherein the performing hardware prefetching of the predetermined program using the prefetching characteristic related to the address range loaded into the range register comprises:

Predict the data addresses that may be used by the scheduled program in the future;

judging whether the data address falls in one of the one or more address ranges;

performing a hardware prefetch of data at the data address using the prefetch characteristic associated with the one of the one or more address ranges when the data address falls within the one of the one or more address ranges take; and

When the data address does not fall in the one of the one or more address ranges, performing hardware prefetching of the data on the data address using a preset prefetch characteristic.

16. The processor of claim 15, wherein the prefetch characteristic associated with the one of the one or more address ranges implies the opposite of the preset prefetch characteristic.

17. The processor of claim 15, wherein the prefetch characteristic associated with the one of the one or more address ranges is maintained at the one of the one or more address ranges maintained address register.

18. A method for performing hardware data prefetching by a processor, characterized in that the method comprises:

detecting whether a predetermined program is executing on the processor;

A hardware prefetch is performed on the predetermined program using the prefetch property associated with the address range loaded into the range register.

19. The method for performing hardware data prefetching by a processor according to claim 18, wherein the hardware prefetching is performed on the predetermined program using the prefetch characteristic related to the address range loaded into the range register The steps taken include:

20. The method of performing hardware data prefetching by a processor according to claim 19, wherein the prefetch characteristic associated with the one of the one or more address ranges implies that the preset prefetch The characteristics are opposite.

21. The method of performing hardware data prefetching by a processor according to claim 19, wherein the prefetch characteristic associated with the one of the one or more address ranges is maintained to maintain the one or more address ranges address register in one of the address ranges.