CN118820134A - Cache consistency optimization method in automatic thread-level parallelization - Google Patents
Cache consistency optimization method in automatic thread-level parallelization Download PDFInfo
- Publication number
- CN118820134A CN118820134A CN202411311602.9A CN202411311602A CN118820134A CN 118820134 A CN118820134 A CN 118820134A CN 202411311602 A CN202411311602 A CN 202411311602A CN 118820134 A CN118820134 A CN 118820134A
- Authority
- CN
- China
- Prior art keywords
- instruction
- consistency
- cache
- consistent
- instructions
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0815—Cache consistency protocols
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0815—Cache consistency protocols
- G06F12/0831—Cache consistency protocols using a bus scheme, e.g. with bus monitoring or watching means
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
Description
技术领域Technical Field
本公开涉及计算机技术领域,特别地涉及一种自动线程级并行化中的缓存一致性优化方法。The present disclosure relates to the field of computer technology, and in particular to a cache consistency optimization method in automatic thread-level parallelization.
背景技术Background Art
现代众核处理器上的每个处理器核都具有高速缓存(Cache),用以提高数据的访问速度并确保程序编写的便捷性。在运行并行程序时,多个处理器核上运行的多个线程或进程可能会读写同一段共享内存区域内的数据。为了确保并行程序运行的正确性,同一计算节点内的不同处理器之间以及同一处理器的不同处理器核之间需要具备缓存一致性(Cache coherence),以使同一份数据在多个处理器核的高速缓存之间保持一致。因此,众核处理器需要具备维护缓存一致性的硬件体系。Each processor core on a modern many-core processor has a cache to increase data access speed and ensure ease of program writing. When running parallel programs, multiple threads or processes running on multiple processor cores may read and write data in the same shared memory area. In order to ensure the correctness of parallel program execution, cache coherence is required between different processors in the same computing node and between different processor cores of the same processor, so that the same data remains consistent between the caches of multiple processor cores. Therefore, many-core processors need to have a hardware system that maintains cache coherence.
随着计算机技术的不断发展,几乎所有运行在超级计算机上的应用程序都实现了进程级并行。然而在绝大多数情况下,不同进程之间并没有共享的内存区域或变量,因此通常无需在两个进程之间进行缓存一致性操作。有些应用程序可能同时具备进程级并行和线程级并行能力,即一个进程会包含运行在同一计算节点内的多个线程,由于属于同一进程的所有线程共同使用该进程的内存空间,因此通常需要在这些线程之间进行缓存一致性操作。尽管如此,各线程也会私有变量,因此线程之间不会在所有变量上都存在共享使用;即使对于同一共享变量,其在某一程序段可能出现多线程间的共享读写而在另一程序段则可能被各线程间私有使用。由此可见,绝大部分的缓存一致性操作是冗余的。维护缓存一致性,不仅会降低并行程序性能,还会增大硬件系统设计的压力,因此亟需一种自动线程级并行化中的缓存一致性优化方法。With the continuous development of computer technology, almost all applications running on supercomputers have achieved process-level parallelism. However, in most cases, there is no shared memory area or variable between different processes, so cache consistency operations are usually not required between two processes. Some applications may have both process-level and thread-level parallelism capabilities, that is, a process will contain multiple threads running in the same computing node. Since all threads belonging to the same process share the memory space of the process, cache consistency operations are usually required between these threads. Nevertheless, each thread will also have private variables, so there will not be shared use of all variables between threads; even for the same shared variable, it may be shared read and written between multiple threads in a certain program segment, and may be privately used by each thread in another program segment. It can be seen that most cache consistency operations are redundant. Maintaining cache consistency will not only reduce the performance of parallel programs, but also increase the pressure of hardware system design. Therefore, a cache consistency optimization method in automatic thread-level parallelization is urgently needed.
现有技术中引入了对缓存一致性的状态的开启和关闭的方案,但这种方案依然无法充分地降低操作开销。The prior art introduces a solution for turning on and off the state of cache consistency, but this solution still cannot fully reduce the operation overhead.
发明内容Summary of the invention
本公开提供一种自动线程级并行化中的缓存一致性优化方法、装置、设备和存储介质。The present disclosure provides a cache consistency optimization method, apparatus, device and storage medium in automatic thread-level parallelization.
第一方面,本公开提供了一种自动线程级并行化中的缓存一致性优化方法,包括:In a first aspect, the present disclosure provides a cache consistency optimization method in automatic thread-level parallelization, comprising:
编译器响应于对一个循环的线程级自动并行化编译,获取所有一致性共享变量;The compiler obtains all consistent shared variables in response to thread-level automatic parallelization compilation of a loop;
根据所述一致性共享变量,从各访存指令中查找出访问所述一致性共享变量的访存指令,确定为一致性指令;According to the consistent shared variable, searching for a memory access instruction for accessing the consistent shared variable from each memory access instruction, and determining the memory access instruction as a consistent instruction;
根据预设规则,确定各所述一致性指令所对应的一致性指令流片段和不包含所述一致性指令的私有指令流片段,为各个所述一致性指令流片段生成维护缓存一致性的第一目标指令子序列,为各个所述私有指令流片段生成不维护缓存一致性的第二目标指令子序列;According to preset rules, the consistent instruction stream fragments corresponding to each of the consistent instructions and the private instruction stream fragments not containing the consistent instructions are determined, a first target instruction subsequence maintaining cache consistency is generated for each of the consistent instruction stream fragments, and a second target instruction subsequence not maintaining cache consistency is generated for each of the private instruction stream fragments;
所述第一目标指令子序列和所述第二目标指令子序列均对应于目标处理器;所述目标处理器具有维护核间缓存一致性的功能,并支持在关闭缓存一致性的情况下执行访存指令。The first target instruction subsequence and the second target instruction subsequence both correspond to a target processor; the target processor has the function of maintaining cache consistency between cores and supports executing memory access instructions when cache consistency is turned off.
在其中一个实施例中,所述编译器响应于对一个循环的线程级自动并行化编译,获取所有一致性共享变量,包括以下方式的至少一种:In one embodiment, the compiler obtains all consistent shared variables in response to thread-level automatic parallelization compilation of a loop, including at least one of the following methods:
从所述自动并行化编译的制导指令中获取缓存一致性设置命令,从所述缓存一致性设置命令中获取所有一致性共享变量;Acquire a cache consistency setting command from the guidance instruction of the automatic parallel compilation, and acquire all consistent shared variables from the cache consistency setting command;
获取各个变量的属性信息,基于所述属性信息确定各个变量是否是一致性共享变量,确定所有一致性共享变量。Acquire attribute information of each variable, determine whether each variable is a consistent shared variable based on the attribute information, and determine all consistent shared variables.
在其中一个实施例中,所述方法,还包括:In one embodiment, the method further comprises:
获取所述编译器对应的当前编译选项,从所述当前编译选项中获取缓存一致性编译设置,当所述缓存一致性编译设置为关闭缓存一致性优化时,把所有变量都当作一致性共享变量。A current compilation option corresponding to the compiler is obtained, and a cache consistency compilation setting is obtained from the current compilation option. When the cache consistency compilation setting is to turn off cache consistency optimization, all variables are regarded as consistent shared variables.
在其中一个实施例中,所述目标处理器具有维护核间缓存一致性的功能,并支持在关闭缓存一致性的情况下执行访存指令,包括以下方式的至少一种:In one embodiment, the target processor has a function of maintaining cache consistency between cores and supports executing memory access instructions with cache consistency turned off, including at least one of the following methods:
所述目标处理器提供缓存一致性开启指令和缓存一致性关闭指令,其中,所述缓存一致性开启指令用于将当前处理器核的缓存一致性协议的状态设置为开启状态,所述缓存一致性关闭指令用于将当前处理器核的缓存一致性协议的状态设置为关闭状态;The target processor provides a cache consistency enable instruction and a cache consistency disable instruction, wherein the cache consistency enable instruction is used to set the state of the cache consistency protocol of the current processor core to the enable state, and the cache consistency disable instruction is used to set the state of the cache consistency protocol of the current processor core to the disable state;
所述目标处理器提供缓存一致访存指令、私有访存指令或普通访存指令,其中所述缓存一致访存指令在任何情况下执行都发起缓存一致性操作请求,所述私有访存指令在任何情况下执行都不发起缓存一致性操作请求,所述普通访存指令在当前处理器核的缓存一致性协议处于开启状态时发起缓存一致性操作请求,所述普通访存指令在当前处理器核的缓存一致性协议处于关闭状态时不发起缓存一致性操作请求;其中,在一条访存指令发起一个缓存一致性操作请求后,硬件系统会根据该请求对应的缓存块的一致性协议状态,确定当前处理器核是否要发起一个缓存一致性操作。The target processor provides cache coherent memory access instructions, private memory access instructions or ordinary memory access instructions, wherein the cache coherent memory access instructions initiate cache coherence operation requests when executed under any circumstances, the private memory access instructions do not initiate cache coherence operation requests when executed under any circumstances, the ordinary memory access instructions initiate cache coherence operation requests when the cache coherence protocol of the current processor core is in an on state, and the ordinary memory access instructions do not initiate cache coherence operation requests when the cache coherence protocol of the current processor core is in a off state; wherein, after a memory access instruction initiates a cache coherence operation request, the hardware system determines whether the current processor core is to initiate a cache coherence operation based on the coherence protocol state of the cache block corresponding to the request.
在其中一个实施例中,所述方法还包括:In one embodiment, the method further comprises:
当前处理器核的缓存一致性协议为关闭状态下执行不是所述缓存一致访存指令的访存指令时,不向缓存一致性的硬件系统发起任何关于缓存一致性的操作;When the cache coherence protocol of the current processor core is in the off state and a memory access instruction other than the cache coherence memory access instruction is executed, no operation related to cache coherence is initiated to the cache coherence hardware system;
所述缓存一致性关闭指令具有响应开启模态和响应关闭模态;The cache coherence closing instruction has a response opening mode and a response closing mode;
以所述响应开启模态执行缓存一致性关闭指令后,当前处理器核始终响应由其他处理器核发起的缓存一致性操作;After executing the cache coherence closing instruction in the response opening mode, the current processor core always responds to the cache coherence operation initiated by other processor cores;
以所述响应关闭模态执行缓存一致性关闭指令后,当前处理器核忽略由其他处理器核发起的缓存一致性操作。After executing the cache coherence shutdown instruction in the response shutdown mode, the current processor core ignores cache coherence operations initiated by other processor cores.
在其中一个实施例中,所述为各个所述一致性指令流片段生成维护缓存一致性的第一目标指令子序列,包括以下方式的至少一种:In one embodiment, generating a first target instruction subsequence for maintaining cache consistency for each of the consistent instruction stream fragments includes at least one of the following methods:
各所述一致性指令流片段中的各所述一致性指令对应所述第一目标指令子序列中的一条所述缓存一致访存指令或一条所述普通访存指令;Each of the consistency instructions in each of the consistency instruction stream fragments corresponds to one of the cache consistent memory access instructions or one of the common memory access instructions in the first target instruction subsequence;
一个所述一致性指令流片段的所述第一目标指令子序列的第一条指令为缓存一致性开启指令,或者一个所述一致性指令流片段的所述第一目标指令子序列的最后一条指令为缓存一致性关闭指令。The first instruction of the first target instruction subsequence of one of the consistent instruction stream fragments is a cache consistency enable instruction, or the last instruction of the first target instruction subsequence of one of the consistent instruction stream fragments is a cache consistency disable instruction.
在其中一个实施例中,所述为各个所述私有指令流片段生成不维护缓存一致性的第二目标指令子序列,包括以下方式的至少一种:In one embodiment, generating a second target instruction subsequence that does not maintain cache coherence for each of the private instruction stream fragments includes at least one of the following methods:
各所述私有指令流片段中的各条访存指令对应所述第二目标指令子序列中的一条所述私有访存指令或一条所述普通访存指令;Each memory access instruction in each of the private instruction stream fragments corresponds to one of the private memory access instructions or one of the common memory access instructions in the second target instruction subsequence;
一个所述私有指令流片段的所述第二目标指令子序列的第一条指令为缓存一致性关闭指令,或者一个所述私有指令流片段的所述第二目标指令子序列的最后一条指令为缓存一致性开启指令。The first instruction of the second target instruction subsequence of one of the private instruction stream fragments is a cache coherence off instruction, or the last instruction of the second target instruction subsequence of one of the private instruction stream fragments is a cache coherence on instruction.
在其中一个实施例中,所述确定各所述一致性指令所对应的一致性指令流片段和不包含所述一致性指的私有指令流片段,包括:In one embodiment, the determining of the consistent instruction stream segments corresponding to each consistent instruction and the private instruction stream segments not including the consistent instruction comprises:
各所述一致性指令流片段包含一条或多条指令,且当只包含一条指令时,该条指令为所述一致性指令;Each of the consistent instruction stream fragments includes one or more instructions, and when only one instruction is included, the instruction is the consistent instruction;
把所述循环的所有所述一致性指令流片段以外各个指令流片段确定为私有指令流片段。Each instruction stream segment other than all the consistent instruction stream segments of the loop is determined as a private instruction stream segment.
在其中一个实施例中,所述根据预设规则,确定各所述一致性指令所对应的一致性指令流片段的步骤包括:In one embodiment, the step of determining the consistent instruction stream segment corresponding to each consistent instruction according to a preset rule comprises:
获取各所述一致性指令之间的数据依赖关系;Obtaining data dependencies between the consistency instructions;
将存在数据依赖关系的至少两条一致性指令聚合到同一个一致性指令流片段中。Aggregate at least two consistent instructions with data dependencies into the same consistent instruction stream segment.
在其中一个实施例中,所述根据预设规则,确定各所述一致性指令所对应的一致性指令流片段的步骤包括:In one embodiment, the step of determining the consistent instruction stream segment corresponding to each consistent instruction according to a preset rule comprises:
获取各所述一致性指令之间的数据依赖关系;Obtaining data dependencies between the consistency instructions;
将相互之间不存在数据依赖关系的至少两条一致性指令调度到相邻的位置;Scheduling at least two coherent instructions that have no data dependency relationship with each other to adjacent positions;
将调度后的相邻的一致性指令聚合到同一个一致性指令流片段中。Aggregate the scheduled adjacent consistency instructions into the same consistency instruction stream fragment.
在其中一个实施例中,所述根据预设规则,确定各所述一致性指令所对应的一致性指令流片段的步骤包括:In one embodiment, the step of determining the consistent instruction stream segment corresponding to each consistent instruction according to a preset rule comprises:
确定所述一致性共享变量在循环的多个迭代中所对应的一致性指令;Determining a consistency instruction corresponding to the consistency shared variable in a plurality of iterations of a loop;
将所述一致性共享变量在多个迭代中所对应的各所述一致性指令聚合到同一个一致性指令流片段中。The consistency instructions corresponding to the consistency shared variables in multiple iterations are aggregated into the same consistency instruction stream segment.
在其中一个实施例中,所述根据预设规则,确定各所述一致性指令所对应的一致性指令流片段的步骤包括:In one embodiment, the step of determining the consistent instruction stream segment corresponding to each consistent instruction according to a preset rule comprises:
根据所述一致性共享变量的数量N,创建元素数量为N的临时数组;According to the number N of the consistent shared variables, create a temporary array with N elements;
每间隔N个迭代,将各所述一致性共享变量的N个元素一次性读入所述临时数组或将所述临时数组的N个元素一次性赋值到一致性共享变量;将访问各所述一致性共享变量的指令替换为对所述临时数组的数组访问指令;At intervals of N iterations, N elements of each of the coherent shared variables are read into the temporary array at one time or N elements of the temporary array are assigned to the coherent shared variables at one time; instructions for accessing each of the coherent shared variables are replaced with array access instructions for the temporary array;
将所述一次性读入和所述一次性赋值对应的访存指令加入到所述一致性指令流片段。The memory access instructions corresponding to the one-time read and the one-time assignment are added to the consistent instruction stream segment.
在其中一个实施例中,所述根据预设规则,确定各所述一致性指令所对应的一致性指令流片段的步骤包括:In one embodiment, the step of determining the consistent instruction stream segment corresponding to each consistent instruction according to a preset rule comprises:
获取按执行顺序在预设范围内的至少两条一致性指令以及在预设范围内的一致性指令之间的访存指令,确定为待合并指令;Acquire at least two consistent instructions within a preset range in execution order and memory access instructions between the consistent instructions within the preset range, and determine them as instructions to be merged;
计算在各所述待合并指令合并为一致性指令流片段的情况下,在预设范围内的一致性指令之间的访存指令产生的一致性操作开销,得到第一操作开销;Calculating the consistency operation overhead generated by the memory access instructions between the consistency instructions within a preset range when the instructions to be merged are merged into a consistency instruction stream segment, to obtain a first operation overhead;
计算在各所述待合并指令合并为一致性指令流片段的情况下,减小缓存一致性协议的状态开启和关闭的次数所减小的操作开销,得到第二操作开销;Calculate the operation overhead reduced by reducing the number of times the state of the cache coherence protocol is turned on and off when the instructions to be merged are merged into a coherent instruction stream segment, to obtain a second operation overhead;
当所述第一操作开销小于所述第二操作开销时,确定各所述待合并指令合并为一致性指令流片段。When the first operation overhead is less than the second operation overhead, it is determined that the instructions to be merged are merged into a consistent instruction stream segment.
在其中一个实施例中,所述根据预设规则,确定各所述一致性指令所对应的一致性指令流片段的步骤包括:In one embodiment, the step of determining the consistent instruction stream segment corresponding to each consistent instruction according to a preset rule comprises:
获取按执行顺序在预设范围内的至少两条一致性指令之间的访存指令的数量,确定为其他访存指令数量;Obtaining the number of memory access instructions between at least two consistent instructions within a preset range in execution order, and determining the number of other memory access instructions;
当其他访存指令数量小于或等于预设指令数量阈值时,将按执行顺序在预设范围内的至少两条一致性指令以及在预设范围内的一致性指令之间的访存指令合并为一致性指令流片段。When the number of other memory access instructions is less than or equal to a preset instruction number threshold, at least two consistent instructions within a preset range in execution order and memory access instructions between consistent instructions within the preset range are merged into a consistent instruction stream segment.
在其中一个实施例中,所述根据预设规则,确定各所述一致性指令所对应的一致性指令流片段的步骤包括:In one embodiment, the step of determining the consistent instruction stream segment corresponding to each consistent instruction according to a preset rule comprises:
确定缓存区中的互斥区,将所述互斥区的所有指令合并为一致性指令流片段。A mutually exclusive region in the cache is determined, and all instructions in the mutually exclusive region are merged into a consistent instruction stream segment.
在其中一个实施例中,所述方法还包括:In one embodiment, the method further comprises:
响应于对数据预取指令的执行,检测触发所述数据预取指令的各访存指令中是否存在至少一个访存指令为缓存一致访存指令或对应于缓存一致性协议的开启状态;In response to the execution of the data prefetch instruction, detecting whether at least one memory access instruction among the memory access instructions triggering the data prefetch instruction is a cache coherent memory access instruction or corresponds to an on state of a cache coherence protocol;
当触发所述数据预取指令的各访存指令中存在至少一个访存指令为缓存一致访存指令或对应于缓存一致性协议的开启状态时,确定在执行所述数据预取指令时发起缓存一致性操作请求;When at least one memory access instruction among the memory access instructions that trigger the data prefetch instruction is a cache coherent memory access instruction or corresponds to an on state of a cache coherence protocol, determining to initiate a cache coherence operation request when executing the data prefetch instruction;
当触发所述数据预取指令的全部访存指令是私有访存指令或对应于缓存一致性协议的关闭状态时,确定在执行所述数据预取指令时不发起缓存一致性操作请求。When all memory access instructions that trigger the data prefetch instruction are private memory access instructions or correspond to a closed state of a cache coherence protocol, it is determined that no cache coherence operation request is initiated when executing the data prefetch instruction.
第二方面,本公开提供了一种自动线程级并行化中的缓存一致性优化装置,包括:In a second aspect, the present disclosure provides a cache consistency optimization device in automatic thread-level parallelization, comprising:
共享变量获取模块,用于获取一致性共享变量;Shared variable acquisition module, used to obtain consistent shared variables;
指令确定模块,用于根据所述一致性共享变量,从各访存指令中查找出访问所述一致性共享变量的访存指令,确定为一致性指令;An instruction determination module, configured to find a memory access instruction for accessing the consistent shared variable from each memory access instruction according to the consistent shared variable, and determine the memory access instruction as a consistent instruction;
指令流片段确定模块,用于根据预设规则,确定各所述一致性指令所对应的一致性指令流片段;An instruction stream segment determination module, used to determine the consistent instruction stream segment corresponding to each consistent instruction according to a preset rule;
指令插入模块,用于在每一所述一致性指令流片段的执行顺序之前插入缓存一致性开启指令,在每一所述一致性指令流片段的执行顺序之后插入缓存一致性关闭指令,其中,所述缓存一致性开启指令用于将当前处理器核的缓存一致性协议的状态设置为开启状态,所述缓存一致性关闭指令用于将当前处理器核的缓存一致性协议的状态设置为关闭状态。An instruction insertion module is used to insert a cache consistency enable instruction before the execution order of each of the consistency instruction stream fragments, and insert a cache consistency disable instruction after the execution order of each of the consistency instruction stream fragments, wherein the cache consistency enable instruction is used to set the state of the cache consistency protocol of the current processor core to the enable state, and the cache consistency disable instruction is used to set the state of the cache consistency protocol of the current processor core to the disable state.
第三方面,本公开提供了一种计算机设备,包括存储器、处理器及存储在存储器上的计算机程序,所述处理器执行所述计算机程序以实现上述方面所述方法的步骤。In a third aspect, the present disclosure provides a computer device, including a memory, a processor, and a computer program stored in the memory, wherein the processor executes the computer program to implement the steps of the method described in the above aspects.
第四方面,本公开提供了一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现上述方面所述方法的步骤。In a fourth aspect, the present disclosure provides a computer-readable storage medium having a computer program stored thereon, which implements the steps of the method described in the above aspects when executed by a processor.
第五方面,本公开提供了一种计算机程序产品,包括计算机程序/指令,该计算机程序被处理器执行时实现上述方面所述方法的步骤。In a fifth aspect, the present disclosure provides a computer program product, including a computer program/instructions, which implements the steps of the method described in the above aspects when the computer program is executed by a processor.
本公开提供的一种自动线程级并行化中的缓存一致性优化方法、装置、设备和存储介质,能够实现一致性指令流片段尽量小,有效降低私有变量访问所增加的缓存一致性的开销及其影响,并且,由于缓存一致性开启和关闭交替次数尽量少,能够降低缓存一致性开启指令和关闭指令的调用频率,避免带来过多的额外开销,从而进一步减小开销。The present disclosure provides a cache consistency optimization method, apparatus, device and storage medium in automatic thread-level parallelization, which can achieve the smallest possible consistency instruction stream fragments, effectively reduce the cache consistency overhead and its impact added by private variable access, and because the number of cache consistency on and off alternations is as small as possible, the calling frequency of cache consistency on and off instructions can be reduced, avoiding excessive additional overhead, thereby further reducing overhead.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1为本公开实施例提供的一种自动线程级并行化中的缓存一致性优化方法的流程示意图;FIG1 is a schematic flow chart of a cache consistency optimization method in automatic thread-level parallelization provided by an embodiment of the present disclosure;
图2为本公开实施例提供的一种自动线程级并行化中的缓存一致性优化装置的结构示意图;FIG2 is a schematic diagram of the structure of a cache consistency optimization device in automatic thread-level parallelization provided by an embodiment of the present disclosure;
图3A为OpenMP并行程序的代码实现过程;FIG3A is a code implementation process of an OpenMP parallel program;
图3B为采用粗粒度方式指定一致性程序片段的实现过程;FIG3B is a diagram showing the implementation process of specifying consistent program fragments in a coarse-grained manner;
图3C为采用细粒度方式指定一致性程序片段的实现过程;FIG3C is an implementation process of specifying consistent program fragments in a fine-grained manner;
图3D为本公开实施例提供的基于自动并行化编译的制导指令指定一致性共享变量的实现过程。FIG3D is an implementation process of specifying consistent shared variables based on guidance instructions for automatic parallel compilation provided by an embodiment of the present disclosure.
具体实施方式DETAILED DESCRIPTION
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solution and advantages of the present application more clearly understood, the present application is further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application and are not used to limit the present application.
为了使本技术领域的人员更好地理解本公开的技术方案,并对本公开如何应用技术手段来解决技术问题,并达到相应技术效果的实现过程能充分理解并据以实施,下面将结合本公开实施例中的附图,对本公开的实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开一部分的实施例,而不是全部的实施例。本公开的实施例以及实施例中的各个特征,在不相冲突前提下可以相互结合,所形成的技术方案均在本公开的保护范围之内。基于本公开中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本公开保护的范围。In order to enable those skilled in the art to better understand the technical solution of the present disclosure, and to fully understand and implement how the present disclosure applies technical means to solve technical problems and achieve the corresponding technical effects, the technical solution in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the drawings in the embodiments of the present disclosure. Obviously, the described embodiments are only embodiments of a part of the present disclosure, not all of the embodiments. The embodiments of the present disclosure and the various features in the embodiments can be combined with each other without conflict, and the technical solutions formed are all within the scope of protection of the present disclosure. Based on the embodiments in the present disclosure, all other embodiments obtained by ordinary technicians in this field without making creative work should fall within the scope of protection of the present disclosure.
需要说明的是,本公开的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本公开的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first", "second", etc. in the specification and claims of the present disclosure and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It should be understood that the data used in this way can be interchangeable where appropriate, so that the embodiments of the present disclosure described herein can be implemented in an order other than those illustrated or described herein. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions, for example, a process, method, system, product, or device that includes a series of steps or units is not necessarily limited to those steps or units that are clearly listed, but may include other steps or units that are not clearly listed or inherent to these processes, methods, products, or devices.
需要说明的是,在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行,并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。It should be noted that the steps shown in the flowcharts of the accompanying drawings can be executed in a computer system such as a set of computer executable instructions, and that, although a logical order is shown in the flowcharts, in some cases, the steps shown or described can be executed in an order different from that shown here.
目前主要有两种缓存一致性协议:基于侦听形式的一致性协议(下文简称侦听协议)和基于目录结构的一致性协议(下文简称目录协议)。侦听协议的实现依赖于总线或类总线形式的网络连接。使用此网络连接,单个处理器核的私有缓存所发出的所有请求会被广播到系统中所有其它处理器核的私有缓存中,而所有处理器的访问请求也可以在此总线上进行定序操作,以实现缓存一致性模型及存储同一性模型中对访存序的要求。侦听协议还可以通过总线结构来很好地处理对同一数据块的多个冲突请求,而且多个处理器的私有缓存之间也可以通过此总线结构进行直接通信,减少了通信延迟。但是,由于所有请求都是通过总线来进行传输,而总线带宽资源有限,因此它会影响整个系统的扩展性。目录协议则是采用一个目录结构来实现对缓存块(Cache line)的管理。在目录协议中,处理器核的私有缓存发出的访存请求,会首先发送到拥有相应缓存块的目录结构中,此目录结构中记录了当前缓存块的共享情况,目录结构控制器会根据当前缓存块的状态,选取响应此请求或者转发此请求到其它相应处理器核的私有缓存中。此方法不需要依赖于特定拓扑结构的网络,且通过点对点的直接通信形式降低了网络中的带宽消耗,易于扩展。但是由于在目录协议的实现中,所有请求都必须通过目录结构进行处理,因此会引入额外延迟。There are two main cache consistency protocols: the snooping-based consistency protocol (hereinafter referred to as the snooping protocol) and the directory-based consistency protocol (hereinafter referred to as the directory protocol). The implementation of the snooping protocol relies on a bus or bus-like network connection. Using this network connection, all requests issued by the private cache of a single processor core will be broadcast to the private caches of all other processor cores in the system, and the access requests of all processors can also be sequenced on this bus to achieve the requirements for access order in the cache consistency model and the storage identity model. The snooping protocol can also handle multiple conflicting requests for the same data block well through the bus structure, and the private caches of multiple processors can also communicate directly through this bus structure, reducing communication delays. However, since all requests are transmitted through the bus, and the bus bandwidth resources are limited, it will affect the scalability of the entire system. The directory protocol uses a directory structure to manage cache lines. In the directory protocol, the memory access request issued by the private cache of the processor core will first be sent to the directory structure that owns the corresponding cache block. This directory structure records the sharing status of the current cache block. The directory structure controller will choose to respond to this request or forward this request to the private cache of other corresponding processor cores according to the status of the current cache block. This method does not need to rely on a network with a specific topology, and reduces bandwidth consumption in the network through point-to-point direct communication, which is easy to expand. However, in the implementation of the directory protocol, all requests must be processed through the directory structure, which will introduce additional delays.
如在AMD等公司的商用处理器中,主要使用侦听协议来实现Cache一致性。目前一片CPU中的核数已经达到了近百,而超级计算机的一个计算节点通常会有两个CPU,使得需要基于侦听协议来支持近200个处理器核的缓存一致性。这不仅给总线的设计和布局带来了困难,也会使总线带宽资源对并行程序性能的制约更加明显。随着同一CPU或同一计算节点中的处理器核数增多,目录协议也会遇到瓶颈难题,如目录越发庞大且访问延迟加大。由此可见,缓存一致性维护开销不仅会影响并行程序性能,而且会增加硬件系统设计的压力。For example, in commercial processors from companies such as AMD, the snooping protocol is mainly used to achieve cache consistency. Currently, the number of cores in a CPU has reached nearly 100, and a computing node in a supercomputer usually has two CPUs, which requires the snooping protocol to support the cache consistency of nearly 200 processor cores. This not only brings difficulties to the design and layout of the bus, but also makes the bus bandwidth resources more restrictive on the performance of parallel programs. As the number of processor cores in the same CPU or the same computing node increases, the directory protocol will also encounter bottleneck problems, such as the directory becoming larger and the access delay increasing. It can be seen that the cache consistency maintenance overhead not only affects the performance of parallel programs, but also increases the pressure on hardware system design.
为了降低缓存一致性给应用程序带来的开销,已有技术提出了一种减少冗余缓存一致性操作(中国专利号:202410831240.X),其通过应用程序当前线程发起缓存一致性关闭指令和开启指令,实现对当前处理器核的缓存一致性协议的关闭和开启。该技术指出需要通过程序编程人员或编译器把缓存一致性关闭指令和开启指令插入到程序代码或汇编代码中。为了尽可能降低缓存一致性开销的影响,可以把执行各线程的缓存一致性默认为关闭状态,而当线程在运行访问进程间或线程间共享变量的指令时,则把缓存一致性设置为开启状态。虽然这样能够避免私有变量的访问产生一致性操作的开销,但使用这种现有技术对一致性协议的状态开启和关闭后,存在一个问题:频繁开启和关闭缓存一致性协议的状态,导致系统开销增大。In order to reduce the overhead brought by cache consistency to the application, the existing technology has proposed a method to reduce redundant cache consistency operations (China Patent No.: 202410831240.X), which initiates cache consistency closing instructions and opening instructions through the current thread of the application to realize the closing and opening of the cache consistency protocol of the current processor core. This technology points out that it is necessary to insert cache consistency closing instructions and opening instructions into the program code or assembly code by the programmer or compiler. In order to minimize the impact of cache consistency overhead, the cache consistency of each thread can be set to the closed state by default, and when the thread is running an instruction to access shared variables between processes or threads, the cache consistency is set to the open state. Although this can avoid the overhead of consistency operations generated by accessing private variables, there is a problem after using this existing technology to open and close the state of the consistency protocol: frequently opening and closing the state of the cache consistency protocol leads to increased system overhead.
为此,本申请中的方案,编译器响应于对一个循环的线程级自动并行化编译,从所述自动并行化编译的制导指令中获取缓存一致性设置命令,从所述缓存一致性设置命令中获取所有一致性共享变量,从线程级自动并行化编译优化生成的所述循环的完整指令流中,找出所有一致性共享变量的所有访问指令,从所述完整指令流确定所有一致性指令流片段(即需要维护cache一致性的一段指令流),在执行一致性指令流片段的指令时,执行缓存一致性操作,而在执行一致性指令流片段以外的指令时,则不执行缓存一致性操作。To this end, in the scheme of the present application, the compiler, in response to thread-level automatic parallelization compilation of a loop, obtains a cache consistency setting command from the guidance instruction of the automatic parallelization compilation, obtains all consistency shared variables from the cache consistency setting command, finds all access instructions of all consistency shared variables from the complete instruction stream of the loop generated by the thread-level automatic parallelization compilation optimization, determines all consistency instruction stream fragments (i.e., a section of instruction stream that needs to maintain cache consistency) from the complete instruction stream, performs cache consistency operations when executing instructions in the consistency instruction stream fragment, and does not perform cache consistency operations when executing instructions outside the consistency instruction stream fragment.
这样,能够实现一致性指令流片段尽量小,有效降低私有变量访问所增加的缓存一致性的开销及其影响,并且,由于缓存一致性开启和关闭的交替次数尽量少,能够降低缓存一致性开启指令和关闭指令的调用频率,避免带来过多的额外开销,从而进一步减小开销。In this way, the consistency instruction stream fragment can be made as small as possible, effectively reducing the cache consistency overhead and its impact caused by private variable access. In addition, since the number of alternations of turning cache consistency on and off is as small as possible, the calling frequency of cache consistency on and off instructions can be reduced, avoiding excessive additional overhead, thereby further reducing overhead.
由于属于同一进程的多个线程共用进程的存储空间,线程级并行下的一致性指令流片段通常会比较多。在具有缓存及其一致性协议的通用环境下,线程级并行往往利用诸如OpenMP的编译制导指令和编译器的自动并行化功能得以便捷实现。例如图3A给出了典型的OpenMP并行程序,共有两个循环,其中第一个循环对共享变量A进行计算,第二个循环使用共享变量A,并对共享变量B进行计算。图3B给出了为图3A中的线程级并行程序显式指定一致性程序片段(需要维护cache一致性的一个源程序片段)的一种粗粒度方式,其中因为两个循环都会访问共享变量A且第一个循环计算出的共享变量A会被第二个循环使用,所以将两个循环全都放在了同一一致性程序片段中。这种粗粒度方式使得两个循环中对私有变量的计算过程都会引发缓存一致性操作,导致无法有效降低缓存一致性的开销。图3C给出了为图3A中的线程级并行程序显式指定一致性程序片段的一种细粒度方式,其中仅把每个循环中涉及到共享变量的少量计算设置为一致性程序片段,几乎在每一个涉及共享变量的指令的前后都插入一致性开启指令和缓存一致性关闭指令。虽然,这一细粒度方式可尽量降低缓存一致性的开销,但每个循环体(loop body)都会执行缓存一致性的开启指令和关闭指令,从而带来不小的额外开销。为此,需要对缓存一致性的状态的开启和关闭进行优化,进而解决如何在自动线程级并行化过程中自动插入缓存一致性开启指令和关闭指令的问题,以实现一致性指令流片段尽量小且缓存一致性开启指令和关闭指令执行次数尽量少的目标。下面将通过具体的实施例来对本申请所提供的方法进行详细说明。Since multiple threads belonging to the same process share the storage space of the process, there are usually more consistent instruction stream fragments under thread-level parallelism. In a general environment with cache and its consistency protocol, thread-level parallelism is often conveniently implemented using compilation guidance instructions such as OpenMP and the automatic parallelization function of the compiler. For example, Figure 3A shows a typical OpenMP parallel program with two loops, in which the first loop calculates the shared variable A, and the second loop uses the shared variable A and calculates the shared variable B. Figure 3B shows a coarse-grained method for explicitly specifying a consistent program fragment (a source program fragment that needs to maintain cache consistency) for the thread-level parallel program in Figure 3A, in which both loops access the shared variable A and the shared variable A calculated by the first loop will be used by the second loop, so both loops are placed in the same consistent program fragment. This coarse-grained method causes the calculation process of private variables in both loops to trigger cache consistency operations, resulting in the inability to effectively reduce the overhead of cache consistency. FIG3C shows a fine-grained method for explicitly specifying consistent program fragments for the thread-level parallel program in FIG3A , in which only a small amount of calculations involving shared variables in each loop are set as consistent program fragments, and consistency enable instructions and cache consistency disable instructions are inserted before and after almost every instruction involving shared variables. Although this fine-grained method can minimize the overhead of cache consistency, each loop body will execute cache consistency enable instructions and disable instructions, which will bring considerable additional overhead. To this end, it is necessary to optimize the opening and closing of the cache consistency state, and then solve the problem of how to automatically insert cache consistency enable instructions and disable instructions in the process of automatic thread-level parallelization, so as to achieve the goal of keeping the consistency instruction stream fragment as small as possible and the number of executions of cache consistency enable instructions and disable instructions as small as possible. The method provided by the present application will be described in detail below through specific embodiments.
实施例一Embodiment 1
图1为本公开实施例提供的一种自动线程级并行化中的缓存一致性优化方法的流程示意图。如图1所示,一种自动线程级并行化中的缓存一致性优化方法,包括:FIG1 is a flow chart of a cache consistency optimization method in automatic thread-level parallelization provided by an embodiment of the present disclosure. As shown in FIG1 , a cache consistency optimization method in automatic thread-level parallelization includes:
步骤110,编译器响应于对一个循环的线程级自动并行化编译,获取所有一致性共享变量。Step 110 : In response to thread-level automatic parallelization compilation of a loop, the compiler obtains all consistent shared variables.
本实施例中,一致性共享变量为缓存中被多个处理器核访问、处理的变量,而缓存中仅能够被一个处理器核访问、处理的变量为私有变量。本实施例中,确定一致性共享变量,其目的在于使得各处理器核对该共享变量进行访问、处理时,能够保持该共享变量对各处理器核的一致性。In this embodiment, a consistent shared variable is a variable in the cache that is accessed and processed by multiple processor cores, while a variable in the cache that can only be accessed and processed by one processor core is a private variable. In this embodiment, the consistent shared variable is determined to enable each processor core to maintain the consistency of the shared variable for each processor core when accessing and processing the shared variable.
本实施例中,从缓存中的各变量中确定一致性共享变量。其中,确定一致性共享变量的过程为:编译器响应于对一个循环的线程级自动并行化编译,从自动并行化编译的制导指令中获取缓存一致性设置命令,从所述缓存一致性设置命令中获取所有一致性共享变量。从而获得在一个循环的线程的中所有一致性共享变量。In this embodiment, the consistent shared variables are determined from the variables in the cache. The process of determining the consistent shared variables is as follows: the compiler, in response to the thread-level automatic parallelization compilation of a loop, obtains the cache consistency setting command from the guidance instruction of the automatic parallelization compilation, and obtains all the consistent shared variables from the cache consistency setting command. Thus, all the consistent shared variables in the threads of a loop are obtained.
编译器能完成从原有高级语言程序到汇编语言或二进制程序的编译,并进行进行编译优化,从而提高程序执行时的速度。在编译优化过程中,编译器会有中间语言,基于中间语言,可以确定指令流中的每条访存指令所对应变量,因此可以从完整指令流中找出访问任意一致性共享变量的访存指令。The compiler can complete the compilation from the original high-level language program to the assembly language or binary program, and perform compilation optimization to improve the speed of program execution. During the compilation optimization process, the compiler will have an intermediate language. Based on the intermediate language, the variable corresponding to each memory access instruction in the instruction stream can be determined, so the memory access instruction that accesses any consistent shared variable can be found from the complete instruction stream.
步骤120,根据所述一致性共享变量,从各访存指令中查找出访问所述一致性共享变量的访存指令,确定为一致性指令。Step 120 : According to the coherent shared variable, a memory access instruction for accessing the coherent shared variable is searched from various memory access instructions, and the memory access instruction is determined to be a coherent instruction.
本实施例中,各访存指令指的是按顺序执行的多个访存指令,也可以是编译器从线程级自动并行化编译优化生成的循环的完整指令流中的多个访存指令。In this embodiment, each memory access instruction refers to a plurality of memory access instructions executed in sequence, and may also be a plurality of memory access instructions in a complete instruction stream of a loop generated by the compiler from thread-level automatic parallel compilation optimization.
本实施例中,从一个完整指令流中查找出所有的访问、处理共享变量的访存指令,将这些访存指令确定为一致性指令。即该一致性指令为访问共享变量的访存指令。In this embodiment, all memory access instructions for accessing and processing shared variables are found from a complete instruction stream, and these memory access instructions are determined as consistent instructions, that is, the consistent instructions are memory access instructions for accessing shared variables.
本实施例中的访存指令既包括如load和store等显式访存指令,也包括如硬件数据预取等隐式访存指令。可以理解的是,为了让同一份数据在多个处理器核的缓存之间保持一致,各个处理器核的缓存一致性协议的状态默认为开启状态。The memory access instructions in this embodiment include both explicit memory access instructions such as load and store, and implicit memory access instructions such as hardware data prefetch. It can be understood that in order to keep the same data consistent between the caches of multiple processor cores, the state of the cache consistency protocol of each processor core is enabled by default.
步骤130,根据预设规则,确定各所述一致性指令所对应的一致性指令流片段和不包含所述一致性指令的私有指令流片段,为各个所述一致性指令流片段生成维护缓存一致性的第一目标指令子序列,为各个所述私有指令流片段生成不维护缓存一致性的第二目标指令子序列;其中,所述第一目标指令子序列和所述第二目标指令子序列均对应于目标处理器;所述目标处理器具有维护核间缓存一致性的功能,并支持在关闭缓存一致性的情况下执行访存指令。Step 130, according to preset rules, determine the consistency instruction stream fragments corresponding to each of the consistency instructions and the private instruction stream fragments that do not contain the consistency instructions, generate a first target instruction subsequence that maintains cache consistency for each of the consistency instruction stream fragments, and generate a second target instruction subsequence that does not maintain cache consistency for each of the private instruction stream fragments; wherein, the first target instruction subsequence and the second target instruction subsequence both correspond to a target processor; the target processor has the function of maintaining cache consistency between cores, and supports executing memory access instructions when cache consistency is turned off.
本实施例中,当目标处理器具有缓存一致访存指令时,可基于缓存一致访存指令生成一致性指令流片段的第一目标指令子序列,其中需要为一致性指令流片段中的每条一致性指令生成第一目标指令子序列中的相应缓存一致访存指令,使得能在执行第一目标指令子序列的过程中维护多个线程并行访问一致性共享变量的缓存一致性。In this embodiment, when the target processor has a cache-coherent memory access instruction, a first target instruction subsequence of a consistent instruction stream fragment can be generated based on the cache-coherent memory access instruction, wherein a corresponding cache-coherent memory access instruction in the first target instruction subsequence needs to be generated for each consistent instruction in the consistent instruction stream fragment, so that cache consistency of multiple threads accessing consistent shared variables in parallel can be maintained during the execution of the first target instruction subsequence.
当目标处理器具有私有访存指令时,可基于私有访存指令生成私有指令流片段的第二目标指令子序列,其中需要为私有指令流片段中的每条访存指令生成第二目标指令子序列中的相应私有访存指令,使得能在执行第二目标指令子序列的过程中,不发起任何缓存一致性操作请求。When the target processor has private memory access instructions, a second target instruction subsequence of a private instruction stream fragment can be generated based on the private memory access instructions, wherein a corresponding private memory access instruction in the second target instruction subsequence needs to be generated for each memory access instruction in the private instruction stream fragment, so that no cache consistency operation request is initiated during the execution of the second target instruction subsequence.
还可以基于缓存一致性开启指令和关闭指令生成第一目标指令子序列和第二目标指令子序列,此时目标指令子序列中的访存指令均可以是普通访存指令(可认为现有处理器上的所有访存指令都是普通访存指令)。在循环的整个指令流中,一致性指令流片段和私有指令流片段交替出现。因此,可以在各一致性指令流片段与其随后的私有指令流片段之间插入一条缓存一致性关闭指令,并在各私有指令流片段与其随后的一致性指令流片段之间插入一条缓存一致性开启指令。此时,第一目标指令子序列的最后一条指令或第二目标指令子序列的第一条指令为缓存一致性关闭指令,而第一目标指令子序列的第一条指令或第二目标指令子序列的最后一条指令为缓存一致性开启指令。The first target instruction subsequence and the second target instruction subsequence can also be generated based on the cache consistency on instruction and the off instruction. In this case, the memory access instructions in the target instruction subsequence can all be ordinary memory access instructions (it can be considered that all memory access instructions on the existing processor are ordinary memory access instructions). In the entire instruction stream of the loop, the consistency instruction stream fragments and the private instruction stream fragments appear alternately. Therefore, a cache consistency off instruction can be inserted between each consistency instruction stream fragment and its subsequent private instruction stream fragment, and a cache consistency on instruction can be inserted between each private instruction stream fragment and its subsequent consistency instruction stream fragment. At this time, the last instruction of the first target instruction subsequence or the first instruction of the second target instruction subsequence is a cache consistency off instruction, and the first instruction of the first target instruction subsequence or the last instruction of the second target instruction subsequence is a cache consistency on instruction.
本文中,将包括一致性指令和位于一致性指令之间的其他访存指令在内的多个访存指令确定为一致性指令流片段。该一致性指令流片段可以是一个循环中的完整指令流。Herein, a plurality of memory access instructions including a consistency instruction and other memory access instructions between the consistency instructions are determined as a consistency instruction stream segment. The consistency instruction stream segment may be a complete instruction stream in a cycle.
本实施例中,可以把循环的整个指令流划分为包含一致性指令的一致性指令流片段和不包含一致性指的私有指令流片段,即循环的整个指令流由若干个一致性指令流片段和若干个私有指令流片段交替连接而成。因此,在确定所有一致性指令流片段后,就可以确定所有私有指令流片段,具体方法可以是把所有一致性指令流片段以外各个指令流片段确定为私有指令流片段。In this embodiment, the entire instruction stream of the loop can be divided into consistent instruction stream segments containing consistent instructions and private instruction stream segments not containing consistent instructions, that is, the entire instruction stream of the loop is composed of a plurality of consistent instruction stream segments and a plurality of private instruction stream segments connected alternately. Therefore, after determining all consistent instruction stream segments, all private instruction stream segments can be determined, and the specific method can be to determine each instruction stream segment other than all consistent instruction stream segments as a private instruction stream segment.
由于缓存一致访存指令不会改变当前处理器核的缓存一致性状态,因此执行缓存一致访存指令的额外开销可以被忽略。所以,当目标处理器提供缓存一致访存指令时,一个一致性指令流片段可仅包含一条指令,该条指令即一致性指令。Since cache coherent memory access instructions do not change the cache coherence state of the current processor core, the additional overhead of executing cache coherent memory access instructions can be ignored. Therefore, when the target processor provides cache coherent memory access instructions, a coherent instruction stream segment can contain only one instruction, which is the coherent instruction.
本实施例中,可以将采用插入一致性开启指令和一致性关闭指令的方式实现确定一致性指令所对应的一致性指令流片段和不包含所述一致性指令的私有指令流片段。因此,最终生成的一致性指令流片段的大小可以是由缓存一致性开启指令和关闭指令的具体插入实现方式确定的。该一致性开启指令可以插入到一致性指令流片段之前,作为第一目标指令子序列的第一个指令,也可以插入到私有指令流片段之后,作为第二目标指令子序列的最后一个指令;一致性关闭指令可以插入到一致性指令流片段之后,作为第一目标指令子序列的最后一个指令,也可以插入到私有指令流片段之前,作为第二目标指令子序列的第一个指令。下面一实施例中将对此作进一步阐述。In this embodiment, the consistency instruction stream fragment corresponding to the consistency instruction and the private instruction stream fragment that does not contain the consistency instruction can be determined by inserting the consistency enable instruction and the consistency disable instruction. Therefore, the size of the consistency instruction stream fragment finally generated can be determined by the specific insertion implementation method of the cache consistency enable instruction and the cache consistency disable instruction. The consistency enable instruction can be inserted before the consistency instruction stream fragment as the first instruction of the first target instruction subsequence, or it can be inserted after the private instruction stream fragment as the last instruction of the second target instruction subsequence; the consistency disable instruction can be inserted after the consistency instruction stream fragment as the last instruction of the first target instruction subsequence, or it can be inserted before the private instruction stream fragment as the first instruction of the second target instruction subsequence. This will be further elaborated in the following embodiment.
其中,所述缓存一致性开启指令用于将当前处理器核的缓存一致性协议的状态设置为开启状态,所述缓存一致性关闭指令用于将当前处理器核的缓存一致性协议的状态设置为关闭状态。The cache coherence on instruction is used to set the state of the cache coherence protocol of the current processor core to an on state, and the cache coherence off instruction is used to set the state of the cache coherence protocol of the current processor core to a off state.
当一致性指令流片段过大时,容易造成缓存一致性开销的影响较大;而当一致性指令流片段过小时,可能造成对缓存一致性开启指令和关闭指令频繁调用,这也会带来不小的额外开销。因此,通过预设规则,确定一致性指令流片段,能同时实现一致性指令流片段尽量小且缓存一致性开启指令和关闭指令执行次数尽量少。When the consistency instruction stream segment is too large, it is easy to cause a large impact on the cache consistency overhead; when the consistency instruction stream segment is too small, it may cause frequent calls to the cache consistency enable and disable instructions, which will also bring considerable additional overhead. Therefore, by presetting rules to determine the consistency instruction stream segment, it is possible to simultaneously achieve the consistency instruction stream segment as small as possible and the number of cache consistency enable and disable instructions executed as few times as possible.
在一个实施例中,所述根据预设规则,确定各所述一致性指令所对应的一致性指令流片段的步骤包括:In one embodiment, the step of determining the consistent instruction stream fragment corresponding to each consistent instruction according to a preset rule includes:
在每一所述一致性指令流片段的执行顺序之前插入缓存一致性开启指令,在每一所述一致性指令流片段的执行顺序之后插入缓存一致性关闭指令,以所述一致性开启指令和所述一致性关闭指令以及位于所述一致性开启指令和所述一致性关闭指令之间的其他指令为第一目标指令子序列。A cache consistency enable instruction is inserted before the execution order of each of the consistency instruction stream fragments, and a cache consistency disable instruction is inserted after the execution order of each of the consistency instruction stream fragments, with the consistency enable instruction, the consistency disable instruction and other instructions between the consistency enable instruction and the consistency disable instruction being the first target instruction subsequence.
本实施例中,以不含一致性指令的指令片段流为私有指令流片段,该私有指令流片段位于一致性指令流片段之外,或者与各所述一致性指令流片段交错执行。以私有指令流片段中的各指令为第二目标指令子序列。In this embodiment, the instruction fragment stream without consistent instructions is a private instruction stream fragment, which is located outside the consistent instruction stream fragment or is interleaved with each of the consistent instruction stream fragments. Each instruction in the private instruction stream fragment is a second target instruction subsequence.
本实施例中,可以通过缓存一致性开启指令和缓存一致性关闭指令来改变处理器核的缓存一致性协议的状态。其中,缓存一致性开启指令用于开启缓存一致性协议,其可以使得处理器核的缓存一致性协议为开启状态,包括从关闭状态变为开启状态以及继续保持开启状态;缓存一致性关闭指令用于关闭缓存一致性协议,其可以使得处理器核的缓存一致性协议为关闭状态,包括从开启状态变为关闭状态以及继续保持关闭状态。具体地,缓存一致性开启指令和缓存一致性关闭指令可以插入在程序代码或者程序的执行流程中,当前处理器核响应于当前线程发起的缓存一致性开启指令,将当前处理器核的缓存一致性协议的状态设置为开启状态,响应于当前线程发起的缓存一致性关闭指令,将当前处理器核的缓存一致性协议的状态设置为关闭状态。In this embodiment, the state of the cache consistency protocol of the processor core can be changed by a cache consistency enable instruction and a cache consistency disable instruction. Among them, the cache consistency enable instruction is used to enable the cache consistency protocol, which can make the cache consistency protocol of the processor core in the enabled state, including changing from the disabled state to the enabled state and continuing to maintain the enabled state; the cache consistency disable instruction is used to disable the cache consistency protocol, which can make the cache consistency protocol of the processor core in the disabled state, including changing from the enabled state to the disabled state and continuing to maintain the disabled state. Specifically, the cache consistency enable instruction and the cache consistency disable instruction can be inserted into the execution flow of the program code or the program, and the current processor core responds to the cache consistency enable instruction initiated by the current thread, and sets the state of the cache consistency protocol of the current processor core to the enabled state, and responds to the cache consistency disable instruction initiated by the current thread, and sets the state of the cache consistency protocol of the current processor core to the disabled state.
本实施例中,缓存一致性开启指令和缓存一致性关闭指令可以通过程序编程人员或编译器插入到程序代码中,也可以通过操作系统在进行进程或线程的相关操作时,插入到程序的执行流程中,也可以被写入某些功能库。针对并行程序中冗余的缓存一致性操作,可以通过缓存一致性关闭指令将当前处理器核的缓存一致性协议的状态设为关闭状态,以避免冗余的缓存一致性操作,从而提高应用程序并行运行的性能,降低缓存一致性总线或目录的工作压力,使得有机会降低它们的工作频率,进而降低CPU的功耗。In this embodiment, the cache consistency on instruction and the cache consistency off instruction can be inserted into the program code by the programmer or compiler, or inserted into the execution flow of the program by the operating system when performing related operations of the process or thread, or written into certain function libraries. For redundant cache consistency operations in parallel programs, the state of the cache consistency protocol of the current processor core can be set to the off state through the cache consistency off instruction to avoid redundant cache consistency operations, thereby improving the performance of parallel running of applications, reducing the working pressure of the cache consistency bus or directory, and making it possible to reduce their working frequency, thereby reducing the power consumption of the CPU.
本实施例中,在每一个一致性指令流片段之前插入缓存一致性开启指令,并在该一致性指令流片段之后插入缓存一致性关闭指令,这样,可以针对每一个涉及到对一致性共享变量进行访问的访存指令流片段,进行缓存一致性协议的状态的开启和关闭,在执行一致性指令流片段时使得缓存一致性协议处于开启状态,执行访存指令时能进行处理器核之间的缓存一致性操作,即使得按照处理器核之间的缓存一致性协议要求执行访存指令。具体的,对于采用侦听协议来实现缓存一致性的处理器,则按照侦听协议的要求来执行访存指令;对于采用目录协议来实现缓存一致性的处理器,则按照目录协议的要求来执行访存指令。在执行一致性指令流片段后使得缓存一致性协议处于关闭状态,使得不涉及到一致性共享变量的访问不产生缓存一致性操作。In this embodiment, a cache consistency enable instruction is inserted before each consistency instruction stream segment, and a cache consistency disable instruction is inserted after the consistency instruction stream segment. In this way, the state of the cache consistency protocol can be turned on and off for each memory access instruction stream segment involving access to a consistency shared variable. When executing the consistency instruction stream segment, the cache consistency protocol is in the enabled state, and when executing the memory access instruction, the cache consistency operation between the processor cores can be performed, that is, the memory access instruction is executed in accordance with the cache consistency protocol requirements between the processor cores. Specifically, for a processor that uses a snooping protocol to achieve cache consistency, the memory access instruction is executed in accordance with the requirements of the snooping protocol; for a processor that uses a directory protocol to achieve cache consistency, the memory access instruction is executed in accordance with the requirements of the directory protocol. After executing the consistency instruction stream segment, the cache consistency protocol is turned off, so that access that does not involve consistency shared variables does not generate cache consistency operations.
而对于不涉及到一致性共享变量的访问的不在一致性指令流片段内的其他访存指令,则不进行缓存一致性协议的状态的开启和关闭,从而避免了降低缓存一致性开启指令和关闭指令的调用频率,避免带来过多的额外开销,从而进一步减小开销。For other memory access instructions that are not within the consistent instruction stream fragment and do not involve access to consistent shared variables, the cache consistency protocol state is not turned on and off, thereby avoiding reducing the calling frequency of cache consistency opening and closing instructions, avoiding excessive additional overhead, and further reducing overhead.
本实施例中,根据预设规则,确定各所述一致性指令所对应的一致性指令流片段,在一致性指令流片段前后插入缓存一致性开启指令和缓存一致性关闭指令,使得同时实现一致性指令流片段尽量小,有效降低私有变量访问所增加的缓存一致性的开销及其影响,并且,由于缓存一致性开启指令和关闭指令执行次数尽量少,能够降低缓存一致性开启指令和关闭指令的调用频率,避免带来过多的额外开销,从而进一步减小开销。In this embodiment, according to preset rules, the consistency instruction stream fragments corresponding to each consistency instruction are determined, and cache consistency enable instructions and cache consistency disable instructions are inserted before and after the consistency instruction stream fragments, so that the consistency instruction stream fragments can be as small as possible, effectively reducing the cache consistency overhead and its impact increased by private variable access. In addition, since the number of execution times of cache consistency enable instructions and disable instructions is as small as possible, the calling frequency of cache consistency enable instructions and disable instructions can be reduced, avoiding excessive additional overhead, thereby further reducing overhead.
实施例二Embodiment 2
为了获取或确定所有一致性共享变量,在上述任一实施例的基础上,本实施例提供的自动线程级并行化中的缓存一致性优化方法中,所述编译器响应于对一个循环的线程级自动并行化编译,获取所有一致性共享变量,包括以下方式的至少一种:In order to obtain or determine all consistent shared variables, based on any of the above embodiments, in the cache consistency optimization method in automatic thread-level parallelization provided in this embodiment, the compiler obtains all consistent shared variables in response to thread-level automatic parallelization compilation of a loop, including at least one of the following methods:
从所述自动并行化编译的制导指令中获取缓存一致性设置命令,从所述缓存一致性设置命令中获取所有一致性共享变量;Acquire a cache consistency setting command from the guidance instruction of the automatic parallel compilation, and acquire all consistent shared variables from the cache consistency setting command;
获取各个变量的属性信息,基于所述属性信息确定各个变量是否是一致性共享变量,确定所有一致性共享变量。Acquire attribute information of each variable, determine whether each variable is a consistent shared variable based on the attribute information, and determine all consistent shared variables.
本实施例中,可以从编译器自动并行化编译的制导指令中查找并且获取缓存一致性设置命令,从而从制导指令中找到缓存一致性设置命令所对应访问的变量,即为一致性共享变量,从而获取所有一致性共享变量;此外,还可以获取各个变量的属性信息,基于属性信息确定某个变量是否是一致性共享变量,这样,从所有变量中确定所有的一致性共享变量。In this embodiment, the cache consistency setting command can be searched and obtained from the guidance instructions automatically compiled in parallel by the compiler, so that the variable accessed by the cache consistency setting command can be found from the guidance instructions, that is, the consistency shared variable, so as to obtain all the consistency shared variables; in addition, the attribute information of each variable can be obtained, and whether a variable is a consistency shared variable is determined based on the attribute information, so that all the consistency shared variables are determined from all variables.
在上述任一实施例的基础上,本实施例提供的自动线程级并行化中的缓存一致性优化方法中,还包括:获取所述编译器对应的当前编译选项,从所述当前编译选项中获取缓存一致性编译设置,当所述缓存一致性编译设置为关闭缓存一致性优化时,把所有变量都当作一致性共享变量。Based on any of the above embodiments, the cache consistency optimization method in automatic thread-level parallelization provided in this embodiment further includes: obtaining the current compilation options corresponding to the compiler, obtaining the cache consistency compilation settings from the current compilation options, and when the cache consistency compilation is set to turn off cache consistency optimization, all variables are treated as consistent shared variables.
本实施例中,通过编译器对应的编译选项,可以获得缓存一致性编译设置,这样,当缓存一致性编译设置为关闭缓存一致性优化时,可以确定所有的变量为一致性共享变量,从而获得所有的一致性共享变量。In this embodiment, cache consistency compilation settings can be obtained through the compilation options corresponding to the compiler. In this way, when the cache consistency compilation is set to turn off cache consistency optimization, all variables can be determined to be consistent shared variables, thereby obtaining all consistent shared variables.
在上述任一实施例的基础上,本实施例提供的自动线程级并行化中的缓存一致性优化方法中,所述目标处理器具有维护核间缓存一致性的功能,并支持在关闭缓存一致性的情况下执行访存指令,包括以下方式的至少一种:Based on any of the above embodiments, in the cache consistency optimization method in automatic thread-level parallelization provided in this embodiment, the target processor has the function of maintaining cache consistency between cores and supports executing memory access instructions when cache consistency is turned off, including at least one of the following methods:
所述目标处理器提供缓存一致性开启指令和缓存一致性关闭指令,其中,所述缓存一致性开启指令用于将当前处理器核的缓存一致性协议的状态设置为开启状态,所述缓存一致性关闭指令用于将当前处理器核的缓存一致性协议的状态设置为关闭状态;The target processor provides a cache consistency enable instruction and a cache consistency disable instruction, wherein the cache consistency enable instruction is used to set the state of the cache consistency protocol of the current processor core to an enable state, and the cache consistency disable instruction is used to set the state of the cache consistency protocol of the current processor core to a disable state;
所述目标处理器提供缓存一致访存指令、私有访存指令或普通访存指令,其中所述缓存一致访存指令在任何情况下执行都发起缓存一致性操作请求,所述私有访存指令在任何情况下执行都不发起缓存一致性操作请求,所述普通访存指令在当前处理器核的缓存一致性协议处于开启状态时发起缓存一致性操作请求,所述普通访存指令在当前处理器核的缓存一致性协议处于关闭状态时不发起缓存一致性操作请求。其中,在一条访存指令发起一个缓存一致性操作请求后,硬件系统会根据该请求对应的缓存块的一致性协议状态,确定当前处理器核是否要发起一个缓存一致性操作。The target processor provides a cache coherent memory access instruction, a private memory access instruction or a common memory access instruction, wherein the cache coherent memory access instruction initiates a cache coherent operation request when executed in any case, the private memory access instruction does not initiate a cache coherent operation request when executed in any case, the common memory access instruction initiates a cache coherent operation request when the cache coherence protocol of the current processor core is in an on state, and the common memory access instruction does not initiate a cache coherent operation request when the cache coherence protocol of the current processor core is in a off state. After a memory access instruction initiates a cache coherent operation request, the hardware system determines whether the current processor core is to initiate a cache coherent operation based on the coherence protocol state of the cache block corresponding to the request.
应该理解的是,随着处理器技术的发展,在本申请之后的未来可能出现独立控制是否维护缓存一致性的访存指令,例如:无论当前处理器核的缓存一致性处于任何状态(开启状态或关闭状态)都发起缓存一致性操作请求的访存指令,把它称之为缓存一致访存指令;无论当前处理器核的缓存一致性处于任何状态都不发起缓存一致性操作请求的访存指令,把它称之为私有访存指令;由当前处理器核的缓存一致性状态决定是否发起缓存一致性操作请求的访存指令,把它称之为普通访存指令。缓存一致访存指令、私有访存指令和普通访存指令均不会改变当前处理器核的缓存一致性状态,因此它们各自能与其他访存指令乱序执行。It should be understood that with the development of processor technology, memory access instructions that independently control whether to maintain cache consistency may appear in the future after this application, for example: a memory access instruction that initiates a cache consistency operation request regardless of the cache consistency of the current processor core in any state (on or off), which is called a cache consistency memory access instruction; a memory access instruction that does not initiate a cache consistency operation request regardless of the cache consistency of the current processor core in any state, which is called a private memory access instruction; a memory access instruction that determines whether to initiate a cache consistency operation request based on the cache consistency state of the current processor core, which is called a normal memory access instruction. Cache consistency memory access instructions, private memory access instructions, and normal memory access instructions will not change the cache consistency state of the current processor core, so they can each be executed out of order with other memory access instructions.
因此,本实施例中,可以采用插入缓存一致性开启指令和缓存一致性关闭指令的方式实现划分一致性指令流片段和私有指令流片段,也可以采用设置缓存一致访存指令、私有访存指令或普通访存指令的方式划分一致性指令流片段和私有指令流片段,比如,将至少两个缓存一致访存指令以及位于这些缓存一致访存指令之间的普通访存指令划分为一致性指令流片段,将其他的单一个的或者连续多个的私有访存指令划分为私有指令流片段,这些私有指令流片段均位于一致性指令流片段之外,这样,可以清楚划分包含一致性指令的一致性指令流片段和不包含所述一致性指令的私有指令流片段,并以一致性指令流片段的各指令为第一目标指令子序列,以私有指令流片段中的各指令为第二目标指令子序列。Therefore, in this embodiment, the division of consistent instruction stream segments and private instruction stream segments can be achieved by inserting cache consistency enable instructions and cache consistency disable instructions, or the division of consistent instruction stream segments and private instruction stream segments can be achieved by setting cache consistent memory access instructions, private memory access instructions or ordinary memory access instructions. For example, at least two cache consistent memory access instructions and ordinary memory access instructions located between these cache consistent memory access instructions are divided into consistent instruction stream segments, and other single or multiple consecutive private memory access instructions are divided into private instruction stream segments. These private instruction stream segments are all located outside the consistent instruction stream segments. In this way, the consistent instruction stream segments containing the consistent instructions and the private instruction stream segments not containing the consistent instructions can be clearly divided, and each instruction in the consistent instruction stream segment is used as the first target instruction subsequence, and each instruction in the private instruction stream segment is used as the second target instruction subsequence.
根据中国专利申请202410831240.X的记载,在当前处理器核的缓存一致性协议的状态为关闭状态时,则在执行所述访存指令的过程中,进行以下方式中的一种或两种:当前处理器核不向缓存一致性的硬件系统发起任何关于缓存一致性的操作;当前处理器核忽略任何由其他处理器核发起的缓存一致性的操作。此外,在缓存一致性协议处于开启状态时,如果访存指令命中了一个标记了私有访问状态的缓存块,则需要根据当前访问的属性和缓存一致性协议初始化该缓存块的一致性状态,并把私有访问状态的标记去掉。上述初始化过程通常会引发处理器核之间的缓存一致性操作,因此可能会引起该缓存块数据从其他核或内存的重新载入。为此,本申请中,当前处理器核在其缓存一致性协议为关闭状态时,不发起任何关于缓存一致性的操作,但可以选择是否响应由其他处理器核发起的缓存一致性的操作。本申请中的一个实施例中,自动线程级并行化中的缓存一致性优化方法中,所述方法还包括:当前处理器核的缓存一致性协议为关闭状态下执行不是所述缓存一致访存指令的访存指令时,不向缓存一致性的硬件系统发起任何关于缓存一致性的操作;所述缓存一致性关闭指令具有响应开启模态和响应关闭模态;以所述响应开启模态执行缓存一致性关闭指令后,当前处理器核始终响应由其他处理器核发起的缓存一致性操作;以所述响应关闭模态执行缓存一致性关闭指令后,当前处理器核忽略由其他处理器核发起的缓存一致性操作。According to the records of Chinese patent application 202410831240.X, when the state of the cache consistency protocol of the current processor core is closed, one or both of the following methods are performed during the execution of the memory access instruction: the current processor core does not initiate any cache consistency operations to the cache consistency hardware system; the current processor core ignores any cache consistency operations initiated by other processor cores. In addition, when the cache consistency protocol is in the open state, if the memory access instruction hits a cache block marked with a private access state, it is necessary to initialize the consistency state of the cache block according to the attributes of the current access and the cache consistency protocol, and remove the mark of the private access state. The above initialization process usually triggers cache consistency operations between processor cores, and may therefore cause the cache block data to be reloaded from other cores or memory. For this reason, in this application, when the current processor core is in the closed state of its cache consistency protocol, it does not initiate any cache consistency operations, but can choose whether to respond to cache consistency operations initiated by other processor cores. In one embodiment of the present application, in the cache consistency optimization method in automatic thread-level parallelization, the method also includes: when the cache consistency protocol of the current processor core is in the closed state and a memory access instruction that is not the cache consistency memory access instruction is executed, no cache consistency operation is initiated to the cache consistency hardware system; the cache consistency closing instruction has a response on mode and a response off mode; after executing the cache consistency closing instruction in the response on mode, the current processor core always responds to the cache consistency operations initiated by other processor cores; after executing the cache consistency closing instruction in the response off mode, the current processor core ignores the cache consistency operations initiated by other processor cores.
本实施例中,当缓存一致性关闭指令为响应开启模态时,当前处理器核依然响应由其他处理器核发起的缓存一致性操作,原因在于,一个线程在缓存一致性协议处于开启状态下使用共享变量时,可能需要从缓存一致性协议处于关闭状态的另一线程获取到共享变量的最新数据,这样,需要另一线程在缓存一致性协议处于关闭状态能响应缓存一致性操作。这样,能够避免由于某一线程的缓存一致性协议处于关闭状态导致其他线程无法获得共享变量的最新数据。而当缓存一致性关闭指令为响应关闭模态时,忽略由其他处理器核发起的缓存一致性操作。这样,能够进一步减小一致性操作开销。In this embodiment, when the cache consistency shutdown instruction is in response to the on mode, the current processor core still responds to the cache consistency operation initiated by other processor cores. The reason is that when a thread uses a shared variable when the cache consistency protocol is in the on state, it may be necessary to obtain the latest data of the shared variable from another thread when the cache consistency protocol is in the off state. In this way, the other thread needs to be able to respond to the cache consistency operation when the cache consistency protocol is in the off state. In this way, it can be avoided that other threads cannot obtain the latest data of the shared variable due to the cache consistency protocol of a certain thread being in the off state. When the cache consistency shutdown instruction is in the response off mode, the cache consistency operation initiated by other processor cores is ignored. In this way, the consistency operation overhead can be further reduced.
为了生成第一目标指令子序列,在一个实施例中,所述为各个所述一致性指令流片段生成维护缓存一致性的第一目标指令子序列,包括以下方式的至少一种:各所述一致性指令流片段中的各所述一致性指令对应所述第一目标指令子序列中的一条所述缓存一致访存指令或一条所述普通访存指令;一个所述一致性指令流片段的所述第一目标指令子序列的第一条指令为缓存一致性开启指令,或者一个所述一致性指令流片段的所述第一目标指令子序列的最后一条指令为缓存一致性关闭指令。In order to generate a first target instruction subsequence, in one embodiment, the generation of the first target instruction subsequence for maintaining cache consistency for each of the consistency instruction stream fragments includes at least one of the following methods: each of the consistency instructions in each of the consistency instruction stream fragments corresponds to a cache consistent memory access instruction or a normal memory access instruction in the first target instruction subsequence; the first instruction of the first target instruction subsequence of a consistency instruction stream fragment is a cache consistency on instruction, or the last instruction of the first target instruction subsequence of a consistency instruction stream fragment is a cache consistency off instruction.
本实施例中,为生成第一目标指令子序列,一种方式为:如上述实施例中所说,目标处理器提供缓存一致访存指令、私有访存指令和普通访存指令,将各所述一致性指令流片段中的各一致性指令作为所述第一目标指令子序列中的所述缓存一致访存指令或所述普通访存指令,这样,可以使得一致性指令流片段的各指令在执行时都能够确保缓存一致性。In the present embodiment, one method for generating the first target instruction subsequence is as follows: as described in the above embodiment, the target processor provides cache-consistent memory access instructions, private memory access instructions and ordinary memory access instructions, and each consistent instruction in each consistent instruction stream fragment is used as the cache-consistent memory access instruction or the ordinary memory access instruction in the first target instruction subsequence. In this way, cache consistency can be ensured for each instruction in the consistent instruction stream fragment during execution.
另一种方式为:以第一目标指令子序列的第一条指令为缓存一致性开启指令,这样在第一目标指令子序列中的接下来的访存指令都将发起缓存一致性操作请求,而在该第一目标指令子序列的第一条指令之前,属于私有指令流片段,则不发起缓存一致性操作请求;以第一目标指令子序列的最后一条指令为缓存一致性关闭指令,则在该第一目标指令子序列的最后一条指令后接下来的指令属于私有指令流片段,则不发起缓存一致性操作请求,而在该第一目标指令子序列的最后一条指令之前由于未关闭一致性操作,则可能发起缓存一致性操作请求。Another way is: the first instruction of the first target instruction subsequence is used as the cache consistency enable instruction, so that the subsequent memory access instructions in the first target instruction subsequence will initiate a cache consistency operation request, and before the first instruction of the first target instruction subsequence, it belongs to a private instruction stream fragment, and no cache consistency operation request is initiated; the last instruction of the first target instruction subsequence is used as the cache consistency disable instruction, then the instructions following the last instruction of the first target instruction subsequence belong to a private instruction stream fragment, and no cache consistency operation request is initiated, but before the last instruction of the first target instruction subsequence, a cache consistency operation request may be initiated because the consistency operation is not disabled.
为了生成第二目标指令子序列,在一个实施例中,所述为各个所述私有指令流片段生成不维护缓存一致性的第二目标指令子序列,包括以下方式的至少一种:各所述私有指令流片段中的各条访存指令对应所述第二目标指令子序列中的一条所述私有访存指令或一条所述普通访存指令;一个所述私有指令流片段的所述第二目标指令子序列的第一条指令为缓存一致性关闭指令,或者一个所述私有指令流片段的所述第二目标指令子序列的最后一条指令为缓存一致性开启指令。In order to generate a second target instruction subsequence, in one embodiment, the second target instruction subsequence that does not maintain cache consistency is generated for each of the private instruction stream fragments, including at least one of the following methods: each memory access instruction in each of the private instruction stream fragments corresponds to a private memory access instruction or a common memory access instruction in the second target instruction subsequence; the first instruction of the second target instruction subsequence of a private instruction stream fragment is a cache consistency off instruction, or the last instruction of the second target instruction subsequence of a private instruction stream fragment is a cache consistency on instruction.
本实施例中,为生成第一目标指令子序列,一种方式为:各所述私有指令流片段中的各条访存指令作为所述第二目标指令子序列中的所述私有访存指令或所述普通访存指令,这样,可以使得私有指令流片段的各指令在执行时均不发起缓存一致性操作请求。In this embodiment, in order to generate the first target instruction subsequence, one method is: each memory access instruction in each of the private instruction stream fragments is used as the private memory access instruction or the ordinary memory access instruction in the second target instruction subsequence, so that each instruction in the private instruction stream fragment does not initiate a cache consistency operation request when executed.
另一种方式为:以第二目标指令子序列的第一条指令为缓存一致性关闭指令,这样在第二目标指令子序列中的接下来的指令都不执行缓存一致性操作,而在该第二目标指令子序列的第一条指令之前,属于一致性指令流片段,则执行缓存一致性操作;以第二目标指令子序列的最后一条指令为缓存一致性开启指令,则在该第二目标指令子序列的最后一条指令后接下来的指令属于一致性指令流片段,则执行缓存一致性操作,而在该第二目标指令子序列的最后一条指令之前由于保持开启一致性操作,则执行缓存一致性操作。Another method is: the first instruction of the second target instruction subsequence is used as a cache consistency shutdown instruction, so that the following instructions in the second target instruction subsequence do not perform cache consistency operations, and before the first instruction of the second target instruction subsequence, it belongs to a consistency instruction stream segment, so the cache consistency operation is performed; the last instruction of the second target instruction subsequence is used as a cache consistency startup instruction, then the instructions following the last instruction of the second target instruction subsequence belong to a consistency instruction stream segment, and the cache consistency operation is performed, and before the last instruction of the second target instruction subsequence, the cache consistency operation is performed because the consistency operation is kept turned on.
在以上的两个实施例中表明,缓存一致性开启指令和缓存一致性关闭指令可以作为第一目标指令子序列或第二目标指令子序列的第一个指令或最后一个指令,其目的在于实现在一致性指令流片段的指令执行时,发起缓存一致性操作请求,而在私有指令流片段执行时,不发起缓存一致性操作请求。The above two embodiments show that the cache consistency on instruction and the cache consistency off instruction can be used as the first instruction or the last instruction of the first target instruction subsequence or the second target instruction subsequence. The purpose is to initiate a cache consistency operation request when the instructions of the consistent instruction stream fragment are executed, and not initiate a cache consistency operation request when the private instruction stream fragment is executed.
在一个实施例中,所述确定各所述一致性指令所对应的一致性指令流片段和不包含所述一致性指的私有指令流片段,包括:各所述一致性指令流片段包含一条或多条指令,且当只包含一条指令时,该条指令为所述;把所述循环的所有所述一致性指令流片段以外各个指令流片段确定为私有指令流片段。In one embodiment, the determination of the consistent instruction stream fragments corresponding to each of the consistent instructions and the private instruction stream fragments that do not contain the consistent instructions includes: each of the consistent instruction stream fragments contains one or more instructions, and when only one instruction is contained, the instruction is described; each instruction stream fragment other than all the consistent instruction stream fragments of the loop is determined as a private instruction stream fragment.
本实施例中,当一致性指令流片段包含多条指令时,可按上述实施例的方式,在一致性指令流片段的前或后,插入缓存一致性开启指令或缓存一致性关闭指令,或在私有指令流片段的前或后,插入缓存一致性关闭指令或缓存一致性卡其指令。而当一致性指令流片段仅包含一条指令时,则以一致性指令流片段中唯一的指令为一致性指令流片段,以该一致性指令为第一目标指令子序列,并且把一致性指令流片段以外各个断续的、连续的指令流片段确定为私有指令流片段,以每一私有指令流片段内的指令为对应的第一目标指令子序列。In this embodiment, when the consistent instruction stream segment contains multiple instructions, a cache consistency on instruction or a cache consistency off instruction may be inserted before or after the consistent instruction stream segment, or a cache consistency off instruction or a cache consistency lock instruction may be inserted before or after the private instruction stream segment in the manner of the above embodiment. When the consistent instruction stream segment contains only one instruction, the only instruction in the consistent instruction stream segment is the consistent instruction stream segment, and the consistent instruction is the first target instruction subsequence. In addition, each discontinuous and continuous instruction stream segment other than the consistent instruction stream segment is determined as a private instruction stream segment, and the instruction in each private instruction stream segment is the corresponding first target instruction subsequence.
实施例三Embodiment 3
在上述任一实施例的基础上,本实施例提供的自动线程级并行化中的缓存一致性优化方法中,所述根据预设规则,确定各所述一致性指令所对应的一致性指令流片段的步骤包括;获取各所述一致性指令之间的数据依赖关系;将存在数据依赖关系的至少两条一致性指令聚合到同一个一致性指令流片段中。Based on any of the above embodiments, in the cache consistency optimization method in automatic thread-level parallelization provided in this embodiment, the step of determining the consistency instruction stream fragment corresponding to each consistency instruction according to preset rules includes: obtaining the data dependency relationship between each consistency instruction; and aggregating at least two consistency instructions with data dependencies into the same consistency instruction stream fragment.
本实施例中,在线程级自动并行化编译优化中,会分析所述循环中的指令之间的数据依赖关系。如果指令A计算出来的结果会被指令B作为输入使用,则B依赖于A;如果指令C依赖于B、B依赖于A,则可以认为C也依赖于A,即数据依赖关系可具有传递性。根据所述数据依赖关系进行指令调度优化,将存在数据依赖关系的多个一致性共享变量的以及位于这些一致性指令之间的其他的多条访问指令聚合,形成一致性指令流片段。这样,能够避免存在依赖关系的一致性指令被划分至不同的一致性指令流片段中,避免导致数据依赖关系被破坏。In this embodiment, in the thread-level automatic parallel compilation optimization, the data dependency between the instructions in the loop will be analyzed. If the result calculated by instruction A will be used as input by instruction B, then B depends on A; if instruction C depends on B and B depends on A, then it can be considered that C also depends on A, that is, the data dependency can be transitive. Instruction scheduling optimization is performed based on the data dependency, and multiple consistent shared variables with data dependencies and other multiple access instructions between these consistent instructions are aggregated to form consistent instruction stream fragments. In this way, it is possible to avoid the consistent instructions with dependencies being divided into different consistent instruction stream fragments, thereby avoiding the destruction of data dependencies.
值得一提的是,这种优化方法涉及到与一致性共享变量访问指令相关的数据依赖关系分析。在大部分情况下,编译器会进行数据依赖关系分析的编译优化,但在有些编译选项下也可以不做数据依赖关系分析。It is worth mentioning that this optimization method involves data dependency analysis related to consistent shared variable access instructions. In most cases, the compiler will perform compilation optimization for data dependency analysis, but under some compilation options, data dependency analysis can also be omitted.
分析一段程序中的数据依赖关系并以此进行指令调度是编译器的基本功能。本申请所解决的一个关键问题在于:如何在不破坏数据依赖关系的情况下,将多条一致性共享变量访问指令聚合,这样,可以让这些指令在生成的目标指令序列中的距离尽量小。因此,本申请一个实施例中提供的自动线程级并行化中的缓存一致性优化方法中,所述根据预设规则,确定各所述一致性指令所对应的一致性指令流片段的步骤的步骤包括:获取各所述一致性指令之间的数据依赖关系;将相互之间不存在数据依赖关系的至少两条一致性指令调度到相邻的位置;将调度后的相邻的一致性指令聚合到同一个一致性指令流片段中。Analyzing the data dependencies in a program and scheduling instructions based on them is a basic function of the compiler. A key problem solved by the present application is: how to aggregate multiple consistent shared variable access instructions without destroying the data dependencies, so that the distance between these instructions in the generated target instruction sequence can be minimized. Therefore, in the cache consistency optimization method in automatic thread-level parallelization provided in one embodiment of the present application, the step of determining the consistent instruction stream fragment corresponding to each of the consistent instructions according to preset rules includes: obtaining the data dependencies between each of the consistent instructions; scheduling at least two consistent instructions that have no data dependencies to adjacent positions; aggregating the scheduled adjacent consistent instructions into the same consistent instruction stream fragment.
本实施例中,通过对并行区中所有指令之间数据依赖关系的分析,确定两条一致性共享变量访问指令之间是否有直接或间接的依赖关系。当两条一致性共享变量访问指令之间没有依赖关系时,这两条指令可以被调度到相邻位置,而当两条一致性共享变量访问指令之间有依赖关系时,则不将与它们无依赖关系的其他指令调度到它们之间,避免导致数据依赖关系被破坏。In this embodiment, by analyzing the data dependency between all instructions in the parallel region, it is determined whether there is a direct or indirect dependency between two consistent shared variable access instructions. When there is no dependency between two consistent shared variable access instructions, the two instructions can be scheduled to adjacent positions, and when there is a dependency between two consistent shared variable access instructions, other instructions that have no dependency between them will not be scheduled between them, so as to avoid the destruction of data dependency.
本申请一个实施例中提供的自动线程级并行化中的缓存一致性优化方法中,所述根据预设规则,确定各所述一致性指令所对应的一致性指令流片段的步骤包括:确定所述一致性共享变量在循环的多个迭代中所对应的一致性指令;将所述一致性共享变量在多个迭代中所对应的各所述一致性指令聚合到同一个一致性指令流片段中。In the cache consistency optimization method in automatic thread-level parallelization provided in one embodiment of the present application, the step of determining the consistency instruction stream fragment corresponding to each consistency instruction according to a preset rule includes: determining the consistency instructions corresponding to the consistency shared variable in multiple iterations of the loop; and aggregating the consistency instructions corresponding to the consistency shared variable in multiple iterations into the same consistency instruction stream fragment.
应该理解的是,基础优化方法的优化空间往往有限,自动并行化编译在大多数情况下是针对循环的,而循环中除互斥区以外部分在不同迭代之间需要是没有依赖关系,因为,如果有依赖关系,则并行化会破坏依赖关系而导致并行运行的结果出错。因此,可以通过循环展开,把对一致性共享变量(为数组的情况)在多个迭代的访问指令聚合起来。It should be understood that the optimization space of basic optimization methods is often limited. Automatic parallelization compilation is for loops in most cases, and there needs to be no dependencies between different iterations of the loop except for the mutually exclusive area. If there are dependencies, parallelization will destroy the dependencies and cause errors in the results of parallel operation. Therefore, loop unrolling can be used to aggregate access instructions to consistent shared variables (in the case of arrays) in multiple iterations.
在一个实施例中,所述根据预设规则,确定各所述一致性指令所对应的一致性指令流片段的步骤包括:根据所述一致性共享变量的数量N,创建元素数量为N的临时数组;每间隔N个迭代,将各所述一致性共享变量的N个元素一次性读入所述临时数组或将所述临时数组的N个元素一次性赋值到一致性共享变量;将访问各所述一致性共享变量的指令替换为对所述临时数组的数组访问指令,将所述一次性读入和所述一次性赋值对应的访存指令加入到所述一致性指令流片段。In one embodiment, the step of determining the consistency instruction stream fragment corresponding to each consistency instruction according to a preset rule includes: creating a temporary array with N elements according to the number N of the consistency shared variables; reading N elements of each consistency shared variable into the temporary array at one time or assigning N elements of the temporary array to the consistency shared variable at one time every N iterations; replacing instructions for accessing each consistency shared variable with array access instructions for the temporary array, and adding memory access instructions corresponding to the one-time reading and the one-time assignment to the consistency instruction stream fragment.
应该理解的是,尽管循环展开是一个常用的编译优化手段,但展开的次数往往有限,因为循环展开会造成编译所生成的指令数量的急剧增加,甚至使得处理器核的指令缓存无法装下循环体的所有指令,从而导致程序运行速度的急剧降低。在不进行循环展开或循环展开次数小的情况下,可采用临时数组的方式实现一致性共享变量(为数组的情况)在多个迭代访问指令的聚合。本实施例中,首先可以为各一致性共享变量分配元素个数为N的临时数组,然后每间隔N个迭代,就把一致性共享变量的N个元素一次性读入临时数组,或把临时数组中的N个元素一次性赋值到一致性共享变量数组中;而在原有循环体中,把对一致性共享变量元素的访问替换为对相应临时数组元素的访问。这样,针对不进行循环展开或循环展开次数小的情况,利用创建临时数组的方式,将访问临时数组内的一致性共享变量的访问指令聚合到同一个一致性指令流片段中。It should be understood that although loop unrolling is a commonly used compilation optimization method, the number of unrolling is often limited, because loop unrolling will cause a sharp increase in the number of instructions generated by the compilation, and even make the instruction cache of the processor core unable to accommodate all the instructions of the loop body, thereby causing a sharp decrease in the program running speed. In the case where loop unrolling is not performed or the number of loop unrolling is small, a temporary array can be used to achieve the aggregation of access instructions of consistent shared variables (in the case of arrays) in multiple iterations. In this embodiment, first, a temporary array with N elements can be allocated to each consistent shared variable, and then every N iterations, the N elements of the consistent shared variable are read into the temporary array at one time, or the N elements in the temporary array are assigned to the consistent shared variable array at one time; and in the original loop body, the access to the consistent shared variable elements is replaced by the access to the corresponding temporary array elements. In this way, for the case where loop unrolling is not performed or the number of loop unrolling is small, the access instructions to the consistent shared variables in the temporary array are aggregated into the same consistent instruction stream fragment by creating a temporary array.
在完成指令调度优化后,能得到循环经自动并行化编译优化后的完整指令流,其中可以标记出所有一致性共享变量访问指令。如果两条一致性共享变量的访存指令在指令流里面相邻,它们自然可以被合并到同一个一致性指令流片段中。对于前后两条一致性指令流片段,如果把它们合并起来,则它们之间其他访存指令会增加一致性操作开销(例如用访存指令数量乘以一致性操作开销基准值),而如果不合并,则需要插入多条一致性开启指令和关闭指令,也会产生额外的开销,因此,需要合理地判断是否将这些一致性指令合并为一致性指令流片段,为此,在一个实施例中,所述根据预设规则,确定各所述一致性指令所对应的一致性指令流片段的步骤包括:After completing the instruction scheduling optimization, a complete instruction stream can be obtained after the loop is automatically parallelized and compiled and optimized, in which all consistent shared variable access instructions can be marked. If the memory access instructions of two consistent shared variables are adjacent in the instruction stream, they can naturally be merged into the same consistent instruction stream fragment. For the two consistent instruction stream fragments before and after, if they are merged, the other memory access instructions between them will increase the consistency operation overhead (for example, the number of memory access instructions multiplied by the consistency operation overhead baseline value), and if they are not merged, it is necessary to insert multiple consistency opening instructions and closing instructions, which will also generate additional overhead. Therefore, it is necessary to reasonably judge whether to merge these consistent instructions into consistent instruction stream fragments. To this end, in one embodiment, the step of determining the consistent instruction stream fragment corresponding to each of the consistent instructions according to the preset rules includes:
获取按执行顺序在预设范围内的至少两条一致性指令以及在预设范围内的一致性指令之间的访存指令,确定为待合并指令;Acquire at least two consistent instructions within a preset range in execution order and memory access instructions between the consistent instructions within the preset range, and determine them as instructions to be merged;
计算在各所述待合并指令合并为一致性指令流片段的情况下,在预设范围内的一致性指令之间的访存指令产生的一致性操作开销,得到第一操作开销;Calculating the consistency operation overhead generated by the memory access instructions between the consistency instructions within a preset range when the instructions to be merged are merged into a consistency instruction stream segment, to obtain a first operation overhead;
计算在各所述待合并指令合并为一致性指令流片段的情况下,减小缓存一致性协议的状态开启和关闭的次数所减小的操作开销,得到第二操作开销;Calculate the operation overhead reduced by reducing the number of times the state of the cache coherence protocol is turned on and off when the instructions to be merged are merged into a coherent instruction stream segment, to obtain a second operation overhead;
当所述第一操作开销小于所述第二操作开销时,确定各所述待合并指令合并为一致性指令流片段。When the first operation overhead is less than the second operation overhead, it is determined that the instructions to be merged are merged into a consistent instruction stream segment.
本实施例中,多个访存指令合并为一致性指令流片段,能够减少对缓存一致性开启指令和缓存一致性关闭指令的调用开销,而如果合并的一致性指令流片段中非一致性指令的其他访存指令太多,会导致一致性操作开销较高,为此,需要对两种开销进行估算和评估。In this embodiment, multiple memory access instructions are merged into a consistent instruction stream segment, which can reduce the calling overhead of cache consistency enable instructions and cache consistency disable instructions. However, if there are too many other memory access instructions of non-consistent instructions in the merged consistent instruction stream segment, it will lead to high consistency operation overhead. Therefore, it is necessary to estimate and evaluate the two overheads.
本实施例中,预设范围指的是指令按照执行顺序执行时的执行序列范围,也就是说,待合并的两条一致性指令可以是前后相邻的两条指令,也可以是两者之间间隔多个其他指令。其中,预设范围通常是指令数量(例如20条指令)。本实施例中,第一操作开销和第二操作开销都属于假设情况下的开销,其中,第一操作开销为假设将在预设范围内的至少两条一致性指令以及在这些一致性指令之间的其他访存指令合并到同一个一致性指令流片段后,相较于指令不合并且仅对这其中涉及的一致性共享变量进行一致性操作的情况,所增加的一致性操作开销;第二操作开销为假设将在预设范围内的至少两条一致性指令以及在这些一致性指令之间的其他访存指令合并到同一个一致性指令流片段后,相较于指令不合并,且在每一个一致性指令的前后插入一致性开启指令和一致性关闭指令的情况,所减小的一致性协议控制开销。当新增的第一操作开销小于减少的第二操作开销时,表明合并后其他访存指令产生的一致性操作开销小于针对每一一致性指令都调用一致性开启指令和关闭指令所产生的调用开销,因此,将前后两条一致性指令连同它们之间的其他指令合并为同一一致性指令流片段,在整个一致性指令流片段的前后插入一致性开启指令和关闭指令。In this embodiment, the preset range refers to the execution sequence range when the instructions are executed in the execution order, that is, the two consistency instructions to be merged can be two adjacent instructions, or there can be multiple other instructions between them. Among them, the preset range is usually the number of instructions (for example, 20 instructions). In this embodiment, the first operation overhead and the second operation overhead are both hypothetical overheads, wherein the first operation overhead is the assumption that at least two consistency instructions within the preset range and other memory access instructions between these consistency instructions are merged into the same consistency instruction stream fragment, compared with the case where the instructions are not merged and only the consistency shared variables involved are operated consistently, the increased consistency operation overhead; the second operation overhead is the assumption that at least two consistency instructions within the preset range and other memory access instructions between these consistency instructions are merged into the same consistency instruction stream fragment, compared with the case where the instructions are not merged, and the consistency opening instruction and the consistency closing instruction are inserted before and after each consistency instruction. The reduced consistency protocol control overhead. When the newly added first operation overhead is less than the reduced second operation overhead, it indicates that the consistency operation overhead generated by other memory access instructions after the merger is less than the call overhead generated by calling the consistency enable instruction and the disable instruction for each consistency instruction. Therefore, the two consistency instructions before and after and together with other instructions between them are merged into the same consistency instruction stream fragment, and the consistency enable instruction and the disable instruction are inserted before and after the entire consistency instruction stream fragment.
在一种更简单的实现方式中,如果前后两条一致性指令流片段之间的其他访存指令数量不超过访存指令数量阈值时,就进行合并。比如,在一个实施例中,所述根据预设规则,确定各所述一致性指令所对应的一致性指令流片段的步骤包括:In a simpler implementation, if the number of other memory access instructions between the two consecutive consistent instruction stream segments does not exceed the memory access instruction number threshold, the segments are merged. For example, in one embodiment, the step of determining the consistent instruction stream segments corresponding to each consistent instruction according to a preset rule includes:
获取按执行顺序在预设范围内的至少两条一致性指令之间的访存指令的数量,确定为其他访存指令数量;Obtaining the number of memory access instructions between at least two consistent instructions within a preset range in execution order, and determining the number of other memory access instructions;
当其他访存指令数量小于或等于预设指令数量阈值时,将按执行顺序在预设范围内的至少两条一致性指令以及在预设范围内的一致性指令之间的访存指令合并为一致性指令流片段。When the number of other memory access instructions is less than or equal to a preset instruction number threshold, at least two consistent instructions within a preset range in execution order and memory access instructions between consistent instructions within the preset range are merged into a consistent instruction stream segment.
本实施例中,提供了另一种合并一致性指令的思路,如果前后两条一致性指令之间的其他访存指令数量不超过预设指令数量阈值时,就将两条一致性指令之间的其他访存指令与这两条一致性指令一起合并为一致性指令流片段。应该理解的是,当在预设范围内的多个一致性指令中任意前后两条一致性指令之间的其他访存指令数量不超过预设指令数量阈值时,也可以将在预设范围内的这些一致性指令以及一致性指令之间的访存指令合并为一致性指令流片段。In this embodiment, another idea of merging consistency instructions is provided. If the number of other memory access instructions between the two consistency instructions does not exceed the preset instruction number threshold, the other memory access instructions between the two consistency instructions are merged with the two consistency instructions into a consistency instruction stream segment. It should be understood that when the number of other memory access instructions between any two consistency instructions in a preset range does not exceed the preset instruction number threshold, these consistency instructions in the preset range and the memory access instructions between the consistency instructions can also be merged into a consistency instruction stream segment.
一致性共享变量可以从编译制导命令中获取到。此外,也可以在程序中声明各个变量与缓存一致性相关的属性信息,编译器通过记录各个变量的属性信息,就能确定一段程序中所包含的所有一致性共享变量。编译器会记录一条访存指令对应的变量,当一条访存指令对应的变量为一致性共享变量时,这条访存指令就是一致性访存指令。Coherent shared variables can be obtained from the compilation guidance command. In addition, you can also declare the attribute information related to cache coherence of each variable in the program. The compiler can determine all the coherent shared variables contained in a program by recording the attribute information of each variable. The compiler will record the variable corresponding to a memory access instruction. When the variable corresponding to a memory access instruction is a coherent shared variable, this memory access instruction is a coherent memory access instruction.
以OpenMP为例,自动并行化编译的制导指令包括对并行区(如C/C++语言中的for循环)中私有变量或共享变量的指定,对并行区中互斥区的指定等。然而现有制导指令并不能指定一致性共享变量(即在读写访问时需要保证cache一致性的共享变量)。因此,在制导指令中增加对一致性共享变量的指定。例如,图3D中给出了相应的一个例子,其中“CC_vars”为指定一致性共享变量的制导指令的关键字,即本实施例中所述的一致性标识。Taking OpenMP as an example, the guidance instructions for automatic parallel compilation include the specification of private variables or shared variables in parallel regions (such as for loops in C/C++ language), the specification of mutually exclusive regions in parallel regions, etc. However, the existing guidance instructions cannot specify consistent shared variables (i.e., shared variables that need to ensure cache consistency during read and write access). Therefore, the specification of consistent shared variables is added to the guidance instructions. For example, a corresponding example is given in Figure 3D, where "CC_vars" is the keyword of the guidance instruction for specifying consistent shared variables, that is, the consistency identifier described in this embodiment.
自动并行化编译过程中需要确定并行区中的所有私有变量和共享变量。通常容易认为对任何共享变量的访问都需要得到缓存一致性的保证,即容易认为所有共享变量都是一致性共享变量。为此,可以设计一个关键字的值,即一致性标识来便捷指定这一情况,例如为图3D中的“CC_vars(default shared)”。但是,也存在对部分共享变量的访问无需得到缓存一致性保证的情况。例如对于图3D中的第三个循环,尽管其使用了共享变量A作为输入,但无需在该循环中保证读入A的缓存一致性,因为第三个循环的自动化并行方式会与第二个循环相同,且第二个循环在之前已经读入过共享变量A。此外,也可以人工对共享变量的数据按照线程级并行的要求进行划分,避免线程间对同一缓存块的共享,使得无需指定一致性共享变量。即程序开发人员可以基于对程序的理解或改进,减少一致性共享变量,从而降低缓存一致性开销。因此,一致性共享变量的编译制导指令既能便捷标记所有共享变量,也能灵活枚举若干具体共享变量。During the automatic parallelization compilation process, all private variables and shared variables in the parallel region need to be determined. It is usually easy to assume that access to any shared variable needs to be guaranteed by cache consistency, that is, it is easy to assume that all shared variables are consistent shared variables. To this end, a keyword value, namely a consistency identifier, can be designed to conveniently specify this situation, such as "CC_vars (default shared)" in Figure 3D. However, there are also cases where access to some shared variables does not need to be guaranteed by cache consistency. For example, for the third loop in Figure 3D, although it uses shared variable A as input, it is not necessary to ensure the cache consistency of reading A in this loop, because the automatic parallelization method of the third loop will be the same as the second loop, and the second loop has read shared variable A before. In addition, the data of shared variables can also be manually divided according to the requirements of thread-level parallelism to avoid sharing of the same cache block between threads, so that there is no need to specify consistent shared variables. That is, program developers can reduce consistent shared variables based on their understanding or improvement of the program, thereby reducing cache consistency overhead. Therefore, the compilation guidance instructions for consistent shared variables can not only conveniently mark all shared variables, but also flexibly enumerate several specific shared variables.
在一个实施例中,所述根据预设规则,确定各所述一致性指令所对应的一致性指令流片段的步骤包括:确定缓存区中的互斥区,将所述互斥区的所有指令合并为一致性指令流片段。In one embodiment, the step of determining the consistent instruction stream fragment corresponding to each consistent instruction according to a preset rule includes: determining a mutually exclusive area in a cache area, and merging all instructions in the mutually exclusive area into a consistent instruction stream fragment.
本实施例中,要进行自动线程级并行化的循环中可能存在着每次只让一个线程进入执行的互斥区。互斥区通常都会修改共享变量。因此,可以把整个互斥区看成一个一致性指令流片段,即在互斥区的前后分别插入缓存一致性开启指令和缓存一致性关闭指令。In this embodiment, there may be a mutex region in the loop to be automatically parallelized at the thread level, which allows only one thread to enter the execution at a time. The mutex region usually modifies shared variables. Therefore, the entire mutex region can be regarded as a consistent instruction stream segment, that is, the cache consistency enable instruction and the cache consistency disable instruction are inserted before and after the mutex region respectively.
去掉冗余缓存一致性操作的优化的正确性是由程序员保证的。程序员往往需要调试优化实现的正确性。一种常用的调试方法就是检查保留所有缓存一致性操作的程序结果与去掉冗余缓存一致性操作后的程序结果是否完全相同。即检查程序结果在优化前后是否完全相同。在不改变程序的情况下,如果能便捷开启和关闭优化,则能更方便开展检查工作。一种简单的实现方法就是,在编译选项设计实现能开启或关闭缓存一致性优化的编译设置。编译器在进行自动线程级并行化时,如果发现编译选项中的缓存一致性编译设置为关闭缓存一致性优化,就把所有变量都当作一致性共享变量,确保编译所得到的目标程序不去掉任何缓存一致性操作。The correctness of the optimization of removing redundant cache consistency operations is ensured by the programmer. Programmers often need to debug the correctness of the optimization implementation. A common debugging method is to check whether the program results with all cache consistency operations retained are exactly the same as the program results after removing redundant cache consistency operations. That is, check whether the program results are exactly the same before and after optimization. If the optimization can be turned on and off easily without changing the program, it will be more convenient to carry out the inspection work. A simple implementation method is to implement a compilation setting that can turn on or off cache consistency optimization in the compilation option design. When the compiler performs automatic thread-level parallelization, if it finds that the cache consistency compilation setting in the compilation option is to turn off cache consistency optimization, all variables are treated as consistent shared variables to ensure that the compiled target program does not remove any cache consistency operations.
根据中国专利申请202410831240.X的记载,在当前处理器核的缓存一致性协议的状态为关闭状态时,则在执行所述访存指令的过程中,进行以下方式中的一种或两种:当前处理器核不向缓存一致性的硬件系统发起任何关于缓存一致性的操作;当前处理器核忽略任何由其他处理器核发起的缓存一致性的操作。此外,在缓存一致性协议处于开启状态时,如果访存指令命中了一个标记了私有访问状态的缓存块,则需要根据当前访问的属性和缓存一致性协议初始化该缓存块的一致性状态,并把私有访问状态的标记去掉。上述初始化过程通常会引发处理器核之间的缓存一致性操作,因此可能会引起该缓存块数据从其他核或内存的重新载入。为此,本申请中,当前处理器核在其缓存一致性协议为关闭状态时,不发起任何关于缓存一致性的操作,但可以选择是否响应由其他处理器核发起的缓存一致性的操作。本申请中的一个实施例中,缓存一致性关闭指令具有响应开启模态和响应关闭模态,其中,以响应开启模态执行缓存一致性关闭指令后,当前处理器核始终响应由其他处理器核发起的缓存一致性操作;以响应关闭模态执行缓存一致性关闭指令后,当前处理器核忽略由其他处理器核发起的缓存一致性操作。According to the records of Chinese patent application 202410831240.X, when the state of the cache consistency protocol of the current processor core is closed, one or both of the following methods are performed during the execution of the memory access instruction: the current processor core does not initiate any cache consistency operations to the cache consistency hardware system; the current processor core ignores any cache consistency operations initiated by other processor cores. In addition, when the cache consistency protocol is in the open state, if the memory access instruction hits a cache block marked with a private access state, it is necessary to initialize the consistency state of the cache block according to the attributes of the current access and the cache consistency protocol, and remove the mark of the private access state. The above initialization process usually triggers cache consistency operations between processor cores, and may therefore cause the cache block data to be reloaded from other cores or memory. For this reason, in this application, when the current processor core is in the closed state of its cache consistency protocol, it does not initiate any cache consistency operations, but can choose whether to respond to cache consistency operations initiated by other processor cores. In one embodiment of the present application, the cache consistency off instruction has a response on mode and a response off mode, wherein, after executing the cache consistency off instruction in the response on mode, the current processor core always responds to cache consistency operations initiated by other processor cores; after executing the cache consistency off instruction in the response off mode, the current processor core ignores cache consistency operations initiated by other processor cores.
本实施例中,当缓存一致性关闭指令为响应开启模态时,当前处理器核依然响应由其他处理器核发起的缓存一致性操作,原因在于,一个线程在缓存一致性协议处于开启状态下使用共享变量时,可能需要从缓存一致性协议处于关闭状态的另一线程获取到共享变量的最新数据,这样,需要另一线程在缓存一致性协议处于关闭状态能响应缓存一致性操作。这样,能够避免由于某一线程的缓存一致性协议处于关闭状态导致其他线程无法获得共享变量的最新数据。而当缓存一致性关闭指令为响应关闭模态时,忽略由其他处理器核发起的缓存一致性操作。这样,能够进一步减小一致性操作开销。In this embodiment, when the cache consistency shutdown instruction is in response to the on mode, the current processor core still responds to the cache consistency operation initiated by other processor cores. The reason is that when a thread uses a shared variable when the cache consistency protocol is in the on state, it may be necessary to obtain the latest data of the shared variable from another thread when the cache consistency protocol is in the off state. In this way, the other thread needs to be able to respond to the cache consistency operation when the cache consistency protocol is in the off state. In this way, it can be avoided that other threads cannot obtain the latest data of the shared variable due to the cache consistency protocol of a certain thread being in the off state. When the cache consistency shutdown instruction is in the response off mode, the cache consistency operation initiated by other processor cores is ignored. In this way, the consistency operation overhead can be further reduced.
很多现代处理器都具有自动预取缓存数据的功能,其中常用的预取方法是邻近预取和等跨度预取。可能出现在关闭缓存一致性协议的情况下将未来需要保证缓存一致性访问的缓存块提前取进来了,而在进入启动了缓存一致性协议的区域后,如果访问之前已预取进来的缓存块时不检查其他核是否已改变了该缓存块的值,则可能造成数据未得到及时更新。中国专利申请202410831240.X提出,把缓存一致性协议为关闭状态时访问过的缓存块标记为私有访问状态,而当缓存一致性协议为开启状态时命中了标记为私有访问状态的缓存块,则需要对该缓存块进行缓存一致性操作或重新读入,以确保该缓存块中数据内容的正确性。这一方法的本质就是把私有访问状态缓存块在缓存一致性协议为开启状态下首次访问当作缓存缺失来处理,因此会引入不小的额外开销。本申请中,使得并行化后的循环的执行过程中频繁开启和关闭缓存一致性协议,从而导致一致性协议为关闭状态下预取一致性共享变量的现象频繁出现。中国专利申请202410831240.X提出技术方法,会明显降低对一致性共享变量进行数据预取的意义。为此,需要更高效的技术方法。数据预取指令基本上都是基于load和store指令地址变化规律的情况下发起的,即一条数据预取指令是由若干条load和store指令触发出来的。基于此,在本申请提供的自动线程级并行化中的缓存一致性优化方法的一个实施例中,该方法还包括:响应于对数据预取指令的执行,检测触发所述数据预取指令的各访存指令中是否存在至少一个访存指令为缓存一致访存指令或对应于缓存一致性协议的开启状态;当触发所述数据预取指令的各访存指令中存在至少一个访存指令为缓存一致访存指令或对应于缓存一致性协议的开启状态时,确定在执行所述数据预取指令时发起缓存一致性操作请求(在缓存一致性协议为开启状态下执行所述数据预取指令,或使用缓存一致访存指令中的相应数据预取指令);当触发所述数据预取指令的全部访存指令是私有访存指令或对应于缓存一致性协议的关闭状态时,确定在执行所述数据预取指令时不发起缓存一致性操作请求(在缓存一致性协议为关闭状态下执行所述数据预取指令,或使用私有访存指令中的相应数据预取指令)。Many modern processors have the function of automatically prefetching cache data, among which the commonly used prefetching methods are adjacent prefetching and equal-span prefetching. It may happen that when the cache consistency protocol is turned off, the cache blocks that need to be accessed with cache consistency in the future are fetched in advance. After entering the area where the cache consistency protocol is started, if the cache blocks that have been prefetched before are accessed without checking whether other cores have changed the value of the cache blocks, the data may not be updated in time. Chinese patent application 202410831240.X proposes to mark the cache blocks accessed when the cache consistency protocol is turned off as private access states, and when the cache consistency protocol is turned on, the cache blocks marked as private access states are hit, and the cache blocks need to be cached or re-read to ensure the correctness of the data content in the cache blocks. The essence of this method is to treat the first access to the private access state cache block when the cache consistency protocol is turned on as a cache miss, so it will introduce a considerable amount of additional overhead. In this application, the cache consistency protocol is frequently turned on and off during the execution of the parallelized loop, resulting in the frequent occurrence of the phenomenon of prefetching consistent shared variables when the consistency protocol is turned off. The technical method proposed in Chinese patent application 202410831240.X will significantly reduce the significance of data prefetching for consistent shared variables. To this end, a more efficient technical method is needed. Data prefetch instructions are basically initiated based on the address change rules of load and store instructions, that is, a data prefetch instruction is triggered by several load and store instructions. Based on this, in one embodiment of the cache consistency optimization method in automatic thread-level parallelization provided in the present application, the method also includes: in response to the execution of the data prefetch instruction, detecting whether there is at least one memory access instruction among the memory access instructions that trigger the data prefetch instruction that is a cache consistent memory access instruction or corresponds to the open state of the cache consistency protocol; when there is at least one memory access instruction among the memory access instructions that trigger the data prefetch instruction that is a cache consistent memory access instruction or corresponds to the open state of the cache consistency protocol, determining to initiate a cache consistency operation request when executing the data prefetch instruction (execute the data prefetch instruction when the cache consistency protocol is in the open state, or use the corresponding data prefetch instruction in the cache consistent memory access instruction); when all the memory access instructions that trigger the data prefetch instruction are private memory access instructions or correspond to the closed state of the cache consistency protocol, determining not to initiate a cache consistency operation request when executing the data prefetch instruction (execute the data prefetch instruction when the cache consistency protocol is in the closed state, or use the corresponding data prefetch instruction in the private memory access instruction).
本实施例中,在执行一条数据预取指令时,如果触发该数据预取指令的若干条访存指令中有至少一条为缓存一致访存指令或对应于缓存一致性协议的开启状态,则确定在执行该指令时发起缓存一致性操作请求。这样,能够有效避免数据未得到及时更新的情况,并且也不会造成额外的开销。而在执行一条数据预取指令时,如果触发该指令的所有访存指令都是私有访存指令或对应于缓存一致性协议的关闭状态,则确定在执行该指令时不发起缓存一致性操作请求。In this embodiment, when executing a data prefetch instruction, if at least one of the several memory access instructions that trigger the data prefetch instruction is a cache consistent memory access instruction or corresponds to the open state of the cache consistency protocol, it is determined that a cache consistency operation request is initiated when executing the instruction. In this way, it can effectively avoid the situation where data is not updated in time, and it will not cause additional overhead. When executing a data prefetch instruction, if all the memory access instructions that trigger the instruction are private memory access instructions or correspond to the closed state of the cache consistency protocol, it is determined that a cache consistency operation request is not initiated when executing the instruction.
实施例四Embodiment 4
图2为本公开实施例提供的一种自动线程级并行化中的缓存一致性优化装置的结构示意图。如图2所示,本实施例提供的自动线程级并行化中的缓存一致性优化装置可以包括:FIG2 is a schematic diagram of the structure of a cache consistency optimization device in automatic thread-level parallelization provided by an embodiment of the present disclosure. As shown in FIG2, the cache consistency optimization device in automatic thread-level parallelization provided by this embodiment may include:
共享变量获取模块210,用于编译器响应于对一个循环的线程级自动并行化编译,获取所有一致性共享变量;A shared variable acquisition module 210, used by the compiler to acquire all consistent shared variables in response to thread-level automatic parallelization compilation of a loop;
指令确定模块220,用于根据所述一致性共享变量,从各访存指令中查找出访问所述一致性共享变量的访存指令,确定为一致性指令;An instruction determination module 220, configured to find a memory access instruction for accessing the consistent shared variable from each memory access instruction according to the consistent shared variable, and determine the memory access instruction as a consistent instruction;
指令流片段确定模块230,用于根据预设规则,确定各所述一致性指令所对应的一致性指令流片段和不包含所述一致性指令的私有指令流片段,为各个所述一致性指令流片段生成维护缓存一致性的第一目标指令子序列,为各个所述私有指令流片段生成不维护缓存一致性的第二目标指令子序列;The instruction stream segment determination module 230 is used to determine the consistent instruction stream segment corresponding to each consistent instruction and the private instruction stream segment not containing the consistent instruction according to a preset rule, generate a first target instruction subsequence maintaining cache consistency for each consistent instruction stream segment, and generate a second target instruction subsequence not maintaining cache consistency for each private instruction stream segment;
所述第一目标指令子序列和所述第二目标指令子序列均对应于目标处理器;所述目标处理器具有维护核间缓存一致性的功能,并支持在关闭缓存一致性的情况下执行访存指令。The first target instruction subsequence and the second target instruction subsequence both correspond to a target processor; the target processor has the function of maintaining cache consistency between cores and supports executing memory access instructions when cache consistency is turned off.
所述共享变量获取模块还用于通过至少一种方法获取所有一致性共享变量:The shared variable acquisition module is also used to acquire all consistent shared variables by at least one method:
从所述自动并行化编译的制导指令中获取缓存一致性设置命令,从所述缓存一致性设置命令中获取所有一致性共享变量;Acquire a cache consistency setting command from the guidance instruction of the automatic parallel compilation, and acquire all consistent shared variables from the cache consistency setting command;
获取各个变量的属性信息,基于所述属性信息确定各个变量是否是一致性共享变量,确定所有一致性共享变量。Acquire attribute information of each variable, determine whether each variable is a consistent shared variable based on the attribute information, and determine all consistent shared variables.
在一个实施例中,装置还包括:In one embodiment, the apparatus further comprises:
一致性共享变量确定模块,用于获取所述编译器对应的当前编译选项,从所述当前编译选项中获取缓存一致性编译设置,当所述缓存一致性编译设置为关闭缓存一致性优化时,把所有变量都当作一致性共享变量。The consistent shared variable determination module is used to obtain the current compilation options corresponding to the compiler, obtain the cache consistency compilation settings from the current compilation options, and when the cache consistency compilation setting is to turn off cache consistency optimization, all variables are regarded as consistent shared variables.
在一个实施例中,所述目标处理器具有维护核间缓存一致性的功能,并支持在关闭缓存一致性的情况下执行访存指令,包括以下方式的至少一种:In one embodiment, the target processor has a function of maintaining cache consistency between cores and supports executing memory access instructions with cache consistency turned off, including at least one of the following methods:
所述目标处理器提供缓存一致性开启指令和缓存一致性关闭指令,其中,所述缓存一致性开启指令用于将当前处理器核的缓存一致性协议的状态设置为开启状态,所述缓存一致性关闭指令用于将当前处理器核的缓存一致性协议的状态设置为关闭状态;The target processor provides a cache consistency enable instruction and a cache consistency disable instruction, wherein the cache consistency enable instruction is used to set the state of the cache consistency protocol of the current processor core to an enable state, and the cache consistency disable instruction is used to set the state of the cache consistency protocol of the current processor core to a disable state;
所述目标处理器提供缓存一致访存指令、私有访存指令或普通访存指令,其中所述缓存一致访存指令在任何情况下执行都发起缓存一致性操作请求,所述私有访存指令在任何情况下执行都不发起缓存一致性操作请求,所述普通访存指令在当前处理器核的缓存一致性协议处于开启状态时发起缓存一致性操作请求,所述普通访存指令在当前处理器核的缓存一致性协议处于关闭状态时不发起缓存一致性操作请求。The target processor provides cache consistent memory access instructions, private memory access instructions or ordinary memory access instructions, wherein the cache consistent memory access instructions initiate cache consistent operation requests when executed under any circumstances, the private memory access instructions do not initiate cache consistent operation requests when executed under any circumstances, the ordinary memory access instructions initiate cache consistent operation requests when the cache consistent protocol of the current processor core is in the turned-on state, and the ordinary memory access instructions do not initiate cache consistent operation requests when the cache consistent protocol of the current processor core is in the turned-off state.
在一个实施例中,当前处理器核的缓存一致性协议为关闭状态下执行不是所述缓存一致访存指令的访存指令时,不向缓存一致性的硬件系统发起任何关于缓存一致性的操作;In one embodiment, when the cache coherence protocol of the current processor core is in a closed state and a memory access instruction other than the cache coherence memory access instruction is executed, no cache coherence operation is initiated to the cache coherence hardware system;
所述缓存一致性关闭指令具有响应开启模态和响应关闭模态;The cache coherence closing instruction has a response opening mode and a response closing mode;
所述装置还包括:The device also includes:
响应开启模块,用于以所述响应开启模态执行缓存一致性关闭指令后,当前处理器核始终响应由其他处理器核发起的缓存一致性操作;A response enable module, configured to enable the current processor core to always respond to cache consistency operations initiated by other processor cores after executing a cache consistency disable instruction in the response enable mode;
响应关闭模块,用于以所述响应关闭模态执行缓存一致性关闭指令后,当前处理器核忽略由其他处理器核发起的缓存一致性操作。The response closing module is used to execute the cache consistency closing instruction in the response closing mode, and the current processor core ignores the cache consistency operation initiated by other processor cores.
在一个实施例中,所述指令流片段确定模块还用于:In one embodiment, the instruction stream segment determination module is further configured to:
各所述一致性指令流片段中的各所述一致性指令对应所述第一目标指令子序列中的一条所述缓存一致访存指令或一条所述普通访存指令;Each of the consistency instructions in each of the consistency instruction stream fragments corresponds to one of the cache consistent memory access instructions or one of the common memory access instructions in the first target instruction subsequence;
一个所述一致性指令流片段的所述第一目标指令子序列的第一条指令为缓存一致性开启指令,或者一个所述一致性指令流片段的所述第一目标指令子序列的最后一条指令为缓存一致性关闭指令。The first instruction of the first target instruction subsequence of one of the consistent instruction stream fragments is a cache consistency on instruction, or the last instruction of the first target instruction subsequence of one of the consistent instruction stream fragments is a cache consistency off instruction.
在一个实施例中,所述指令流片段确定模块还用于:In one embodiment, the instruction stream segment determination module is further configured to:
各所述私有指令流片段中的各条访存指令对应所述第二目标指令子序列中的一条所述私有访存指令或一条所述普通访存指令;Each memory access instruction in each of the private instruction stream fragments corresponds to one of the private memory access instructions or one of the common memory access instructions in the second target instruction subsequence;
一个所述私有指令流片段的所述第二目标指令子序列的第一条指令为缓存一致性关闭指令,或者一个所述私有指令流片段的所述第二目标指令子序列的最后一条指令为缓存一致性开启指令。The first instruction of the second target instruction subsequence of one of the private instruction stream fragments is a cache coherence off instruction, or the last instruction of the second target instruction subsequence of one of the private instruction stream fragments is a cache coherence on instruction.
在一个实施例中,所述指令流片段确定模块还用于:In one embodiment, the instruction stream segment determination module is further configured to:
各所述一致性指令流片段包含一条或多条指令,且当只包含一条指令时,该条指令为所述一致性指令;Each of the consistent instruction stream fragments includes one or more instructions, and when only one instruction is included, the instruction is the consistent instruction;
把所述循环的所有所述一致性指令流片段以外各个指令流片段确定为私有指令流片段。Each instruction stream segment other than all the consistent instruction stream segments of the loop is determined as a private instruction stream segment.
在一个实施例中,所述指令流片段确定模块包括:In one embodiment, the instruction stream segment determination module includes:
依赖关系获取单元,包括获取各所述一致性指令之间的数据依赖关系;A dependency acquisition unit, comprising acquiring data dependencies between the consistency instructions;
指令流片段聚合单元,包括将存在数据依赖关系的至少两条一致性指令聚合到同一个一致性指令流片段中。The instruction stream fragment aggregation unit includes aggregating at least two consistent instructions with data dependencies into the same consistent instruction stream fragment.
在一个实施例中,所述指令流片段确定模块包括:In one embodiment, the instruction stream segment determination module includes:
依赖关系获取单元,包括获取各所述一致性指令之间的数据依赖关系;A dependency acquisition unit, comprising acquiring data dependencies between the consistency instructions;
所述指令流片段聚合单元用于将相互之间不存在数据依赖关系的至少两条一致性指令调度到相邻的位置;将调度后的相邻的一致性指令聚合到同一个一致性指令流片段中。The instruction stream fragment aggregation unit is used to schedule at least two consistent instructions that have no data dependency relationship to adjacent positions; and aggregate the scheduled adjacent consistent instructions into the same consistent instruction stream fragment.
在一个实施例中,所述指令流片段确定模块包括:In one embodiment, the instruction stream segment determination module includes:
迭代中指令确定单元,用于确定所述一致性共享变量在循环的多个迭代中所对应的一致性指令;An iterative instruction determination unit, used to determine the consistency instructions corresponding to the consistency shared variable in multiple iterations of the loop;
指令流片段聚合单元,包括将所述一致性共享变量在多个迭代中所对应的各所述一致性指令聚合到同一个一致性指令流片段中。The instruction stream fragment aggregation unit comprises aggregating the consistency instructions corresponding to the consistency shared variable in multiple iterations into the same consistency instruction stream fragment.
在一个实施例中,所述指令流片段确定模块包括:In one embodiment, the instruction stream segment determination module includes:
临时数组创建单元,用于根据所述一致性共享变量的数量N,创建元素数量为N的临时数组;A temporary array creation unit, used for creating a temporary array with N elements according to the number N of the consistency shared variables;
指令替换单元,用于将各所述一致性共享变量的N个元素一次性读入所述临时数组或将所述临时数组的N个元素一次性赋值到一致性共享变量;将访问各所述一致性共享变量的指令替换为对所述临时数组的数组访问指令;An instruction replacement unit, configured to read N elements of each of the coherent shared variables into the temporary array at one time or assign N elements of the temporary array to the coherent shared variables at one time; and replace instructions for accessing each of the coherent shared variables with array access instructions for the temporary array;
指令加入单元,用于将所述一次性读入和所述一次性赋值对应的访存指令加入到所述一致性指令流片段。An instruction adding unit is used to add the memory access instructions corresponding to the one-time read and the one-time assignment to the consistent instruction stream segment.
在一个实施例中,所述指令流片段确定模块包括;In one embodiment, the instruction stream segment determination module includes:
待合并指令确定单元,用于获取按执行顺序在预设范围内的至少两条一致性指令以及在预设范围内的一致性指令之间的访存指令,确定为待合并指令;A to-be-merged instruction determination unit, used to obtain at least two consistent instructions within a preset range in execution order and a memory access instruction between the consistent instructions within the preset range, and determine them as to-be-merged instructions;
第一操作开销计算单元,用于计算在各所述待合并指令合并为一致性指令流片段的情况下,在预设范围内的一致性指令之间的访存指令产生的一致性操作开销,得到第一操作开销;A first operation cost calculation unit is used to calculate the consistency operation cost generated by the memory access instruction between the consistency instructions within a preset range when the instructions to be merged are merged into a consistency instruction stream fragment, so as to obtain a first operation cost;
第二操作开销计算单元,用于计算在各所述待合并指令合并为一致性指令流片段的情况下,减小缓存一致性协议的状态开启和关闭的次数所减小的操作开销,得到第二操作开销;A second operation cost calculation unit is used to calculate the operation cost reduced by reducing the number of times the state of the cache coherence protocol is turned on and off when the instructions to be merged are merged into a coherent instruction stream fragment, so as to obtain a second operation cost;
合并单元,用于当所述第一操作开销小于所述第二操作开销时,确定各所述待合并指令合并为一致性指令流片段。A merging unit is used to determine that each of the instructions to be merged is merged into a consistent instruction stream segment when the first operation overhead is less than the second operation overhead.
在一个实施例中,所述指令流片段确定模块包括;In one embodiment, the instruction stream segment determination module includes:
指令数量获取单元,用于获取按执行顺序在预设范围内的至少两条一致性指令之间的访存指令的数量,确定为其他访存指令数量;An instruction quantity acquisition unit, used to acquire the number of memory access instructions between at least two consistent instructions within a preset range in execution order, and determine it as the number of other memory access instructions;
合并单元,用于当其他访存指令数量小于或等于预设指令数量阈值时,将按执行顺序在预设范围内的至少两条一致性指令以及在预设范围内的一致性指令之间的访存指令合并为一致性指令流片段。A merging unit is used to merge at least two consistent instructions within a preset range in execution order and memory access instructions between consistent instructions within a preset range into a consistent instruction stream fragment when the number of other memory access instructions is less than or equal to a preset instruction number threshold.
在一个实施例中,所述指令流片段确定模块还用于确定缓存区中的互斥区,将所述互斥区的所有指令合并为一致性指令流片段。In one embodiment, the instruction stream segment determination module is further used to determine a mutually exclusive area in the cache area, and merge all instructions in the mutually exclusive area into a consistent instruction stream segment.
在一个实施例中,所述装置还包括:In one embodiment, the apparatus further comprises:
数据预取指令响应模块,用于响应于对数据预取指令的执行,检测触发所述数据预取指令的各访存指令中是否存在至少一个访存指令为缓存一致访存指令或对应于缓存一致性协议的开启状态;A data prefetch instruction response module, configured to detect, in response to the execution of the data prefetch instruction, whether at least one memory access instruction among the memory access instructions triggering the data prefetch instruction is a cache coherent memory access instruction or corresponds to an on state of a cache coherent protocol;
第一预取指令模块,用于当触发所述数据预取指令的各访存指令中存在至少一个访存指令为缓存一致访存指令或对应于缓存一致性协议的开启状态时,确定在执行所述数据预取指令发起缓存一致性操作请求;A first prefetch instruction module, configured to determine to initiate a cache coherence operation request when executing the data prefetch instruction when at least one memory access instruction among the memory access instructions that trigger the data prefetch instruction is a cache coherence memory access instruction or corresponds to an on state of a cache coherence protocol;
第二预取指令模块,用于当触发所述数据预取指令的全部访存指令是私有访存指令或对应于缓存一致性协议的关闭状态时,确定在执行所述数据预取指令时不发起缓存一致性操作请求。The second prefetch instruction module is used to determine not to initiate a cache coherence operation request when executing the data prefetch instruction when all memory access instructions that trigger the data prefetch instruction are private memory access instructions or correspond to a closed state of the cache coherence protocol.
实施例五Embodiment 5
在上述实施例的基础上,本实施例提供一个应用实例。Based on the above embodiments, this embodiment provides an application example.
编译器响应于对一个循环的线程级自动并行化编译,从所述自动并行化编译的制导指令中获取缓存一致性设置命令,从所述缓存一致性设置命令中获取所有一致性共享变量,从线程级自动并行化编译优化生成的所述循环的完整指令流中,找出所有一致性共享变量的所有访问指令,从所述完整指令流确定所有一致性指令流片段,在各所述一致性指令流片段的前后分别插入缓存一致性开启指令和缓存一致性关闭指令。In response to thread-level automatic parallel compilation of a loop, the compiler obtains a cache consistency setting command from the guidance instruction of the automatic parallel compilation, obtains all consistent shared variables from the cache consistency setting command, finds all access instructions of all consistent shared variables from the complete instruction stream of the loop generated by thread-level automatic parallel compilation optimization, determines all consistent instruction stream fragments from the complete instruction stream, and inserts a cache consistency enable instruction and a cache consistency disable instruction before and after each of the consistent instruction stream fragments.
其中,所述从线程级自动并行化编译优化生成的所述循环的完整指令流中,找出所有一致性共享变量的所有访问指令,包括:The step of finding all access instructions to all consistent shared variables from the complete instruction stream of the loop generated by thread-level automatic parallelization compilation optimization includes:
在线程级自动并行化编译优化中,分析所述循环中的数据依赖关系,根据所述数据依赖关系进行指令调度优化,把对多个一致性共享变量的多条访问指令聚合起来。In the thread-level automatic parallel compilation optimization, the data dependency in the loop is analyzed, instruction scheduling optimization is performed according to the data dependency, and multiple access instructions to multiple consistent shared variables are aggregated.
关于“缓存一致性开启指令和缓存一致性关闭指令” 需要说明的是:根据中国专利申请202410831240.X,缓存一致性关闭指令把当前处理器核的缓存一致性协议的状态设置为关闭状态,而缓存一致性开启指令把当前处理器核的缓存一致性协议的状态设置为开启状态。在当前处理器核的缓存一致性协议的状态为关闭状态时,则在执行所述访存指令的过程中,进行以下方式中的一种或两种:当前处理器核不向缓存一致性的硬件系统发起任何关于缓存一致性的操作;当前处理器核忽略任何由其他处理器核发起的缓存一致性的操作。此外,在cache一致性协议处于开启状态时,如果访存指令命中了一个标记了私有访问状态的cache块,则需要根据当前访问的属性和cache一致性协议初始化该cache 块的一致性状态,并把私有访问状态的标记去掉。上述初始化过程通常会引发处理器核之间的cache一致性操作,因此可能会引起该cache块数据从其他核或内存的重新载入。Regarding the "cache consistency on instruction and cache consistency off instruction", it should be noted that: according to Chinese patent application 202410831240.X, the cache consistency off instruction sets the state of the cache consistency protocol of the current processor core to the off state, while the cache consistency on instruction sets the state of the cache consistency protocol of the current processor core to the on state. When the state of the cache consistency protocol of the current processor core is off, one or two of the following methods are performed during the execution of the memory access instruction: the current processor core does not initiate any cache consistency operations to the cache consistency hardware system; the current processor core ignores any cache consistency operations initiated by other processor cores. In addition, when the cache consistency protocol is in the on state, if the memory access instruction hits a cache block marked with a private access state, it is necessary to initialize the consistency state of the cache block according to the current access attributes and the cache consistency protocol, and remove the mark of the private access state. The above initialization process usually triggers cache consistency operations between processor cores, and may therefore cause the cache block data to be reloaded from other cores or memory.
本申请技术进一步认为,当前处理器核在其缓存一致性协议为关闭状态时,不发起任何关于缓存一致性的操作,但可以选择是否响应由其他处理器核发起的缓存一致性的操作。因此,缓存一致性关闭指令具有响应开启和响应关闭模态。其中,以响应开启模态执行缓存一致性关闭指令后,当前处理器核始终响应由其他处理器核发起的缓存一致性操作;以响应关闭模态执行缓存一致性关闭指令后,当前处理器核忽略由其他处理器核发起的缓存一致性操作。The technology of the present application further considers that when the cache consistency protocol of the current processor core is in the off state, it does not initiate any operations related to cache consistency, but can choose whether to respond to cache consistency operations initiated by other processor cores. Therefore, the cache consistency off instruction has a response on and response off mode. Among them, after executing the cache consistency off instruction in the response on mode, the current processor core always responds to cache consistency operations initiated by other processor cores; after executing the cache consistency off instruction in the response off mode, the current processor core ignores cache consistency operations initiated by other processor cores.
此外,本申请技术认为,编译器在对循环进行线程级自动并行化编译过程中插入的缓存一致性关闭指令均是响应开启模态。这是因为,一个线程在cache一致性协议处于开启状态下使用共享变量时,可能需要从cache一致性协议处于关闭状态的另一线程获取到共享变量的最新数据,即需要另一线程在cache一致性协议处于关闭状态能响应cache一致性操作。In addition, the present application technology believes that the cache consistency shutdown instructions inserted by the compiler during the thread-level automatic parallelization compilation of the loop are all response-on modes. This is because when a thread uses a shared variable with the cache consistency protocol turned on, it may need to obtain the latest data of the shared variable from another thread with the cache consistency protocol turned off, that is, the other thread needs to be able to respond to the cache consistency operation when the cache consistency protocol is turned off.
关于“从所述自动并行化编译的制导指令中获取缓存一致性设置命令,从所述缓存一致性设置命令中获取所有一致性共享变量”需要说明的是:以OpenMP为例,自动并行化编译的制导指令包括对并行区(如C/C++语言中的for循环)中私有变量或共享变量的指定,对并行区中互斥区的指定等。然而现有制导指令并不能指定一致性共享变量(即在读写访问时需要保证cache一致性的共享变量)。因此,在制导指令中增加对一致性共享变量的指定,是本申请技术的一个重要创新或技术特征。例如图3D中给出了相应的一个例子,其中“CC_vars”为指定一致性共享变量的制导指令的关键字。Regarding “obtaining cache consistency setting commands from the guidance instructions of the automatic parallel compilation, and obtaining all consistent shared variables from the cache consistency setting commands”, it should be noted that: taking OpenMP as an example, the guidance instructions of the automatic parallel compilation include the specification of private variables or shared variables in parallel areas (such as for loops in C/C++ language), the specification of mutually exclusive areas in parallel areas, etc. However, the existing guidance instructions cannot specify consistent shared variables (i.e., shared variables that need to ensure cache consistency during read and write access). Therefore, adding the specification of consistent shared variables in the guidance instructions is an important innovation or technical feature of the technology of the present application. For example, a corresponding example is given in Figure 3D, where "CC_vars" is the keyword of the guidance instructions for specifying consistent shared variables.
自动并行化编译过程中需要确定并行区中的所有私有变量和共享变量。通常容易认为对任何共享变量的访问都需要得到cache一致性的保证,即容易认为所有共享变量都是一致性共享变量。为此,可以设计一个关键字的值(例如“CC_vars(default shared)”)来便捷指定这一情况。但是,也存在对部分共享变量的访问无需得到cache一致性保证的情况。例如对于图3D中的第三个循环,尽管其使用了共享变量A作为输入,但无需在该循环中保证读入A的cache一致性,因为第三个循环的自动化并行方式会与第二个循环相同,且第二个循环在之前已经读入过共享变量A。此外,程序开发人员也可以人工对共享变量的数据按照线程级并行的要求进行划分,避免线程间对同一cache块的共享,使得无需指定一致性共享变量。即程序开发人员可以基于对程序的理解或改进,减少一致性共享变量,从而降低cache一致性开销。因此,一致性共享变量的编译制导指令既能便捷标记所有共享变量,也能灵活枚举若干具体共享变量。During the automatic parallelization compilation process, all private variables and shared variables in the parallel region need to be determined. It is usually easy to assume that access to any shared variable needs to be guaranteed by cache consistency, that is, it is easy to assume that all shared variables are consistent shared variables. To this end, a keyword value (such as "CC_vars (default shared)") can be designed to conveniently specify this situation. However, there are also cases where access to some shared variables does not need to be guaranteed by cache consistency. For example, for the third loop in Figure 3D, although it uses shared variable A as input, it is not necessary to ensure the cache consistency of reading A in this loop, because the automatic parallelization method of the third loop will be the same as the second loop, and the second loop has read shared variable A before. In addition, program developers can also manually divide the data of shared variables according to the requirements of thread-level parallelism to avoid sharing the same cache block between threads, so that there is no need to specify consistent shared variables. That is, program developers can reduce consistent shared variables based on their understanding or improvement of the program, thereby reducing cache consistency overhead. Therefore, the compilation guidance instructions for consistent shared variables can not only conveniently mark all shared variables, but also flexibly enumerate several specific shared variables.
关于“从线程级自动并行化编译优化生成的所述循环的完整指令流中,找出所有一致性共享变量的所有访问指令”,需要说明的是:编译器能完成从原有高级语言程序到汇编语言或二进制程序的编译,其中会进行编译优化,从而提高程序执行时的速度。在编译优化过程中,编译器会有中间语言,基于中间语言,可以确定指令流中的每条访存指令所对应变量,因此可以从完整指令流中找出访问任意一致性共享变量的指令。Regarding "finding all access instructions for all consistent shared variables from the complete instruction stream of the loop generated by thread-level automatic parallelization compilation optimization", it should be noted that the compiler can complete the compilation from the original high-level language program to assembly language or binary program, during which compilation optimization will be performed to increase the speed of program execution. During the compilation optimization process, the compiler will have an intermediate language, based on which the variable corresponding to each memory access instruction in the instruction stream can be determined, so the instruction to access any consistent shared variable can be found from the complete instruction stream.
本申请中,给出了一种针对一致性共享变量访问指令的优化方法,这种优化方法涉及到与一致性共享变量访问指令相关的数据依赖关系分析。即从线程级自动并行化编译到数据依赖关系分析,这是一种特例化或范围的缩小,因为尽管编译器在绝大多数情况下会做数据依赖关系分析的编译优化,但在有些编译选项下(例如O0)也可以不做数据依赖关系分析。In this application, an optimization method for consistent shared variable access instructions is provided, and this optimization method involves data dependency analysis related to consistent shared variable access instructions. That is, from thread-level automatic parallelization compilation to data dependency analysis, this is a specialization or scope reduction, because although the compiler will perform compilation optimization of data dependency analysis in most cases, it may not perform data dependency analysis under some compilation options (such as O0).
因此,关于“分析所述循环中的数据依赖关系,根据所述数据依赖关系进行指令调度优化,把对多个一致性共享变量的多条访问指令聚合起来”有以下实现方式:Therefore, regarding "analyzing the data dependency in the loop, optimizing instruction scheduling according to the data dependency, and aggregating multiple access instructions to multiple consistent shared variables", there are the following implementation methods:
分析一段程序中变量间的数据依赖关系并以此进行指令调度是编译器的基本功能。本申请技术的一个关键问题在于,如何在不破坏数据依赖关系的情况下,将多条一致性共享变量访问指令聚合起来,即让这些指令在生成的目标指令序列中的距离尽量小。可以有优化方式或优化程度不同的多种编译优化方法:Analyzing the data dependencies between variables in a program and scheduling instructions based on them is a basic function of the compiler. A key problem of the present application technology is how to aggregate multiple consistent shared variable access instructions without destroying the data dependencies, that is, to minimize the distance between these instructions in the generated target instruction sequence. There can be multiple compilation optimization methods with different optimization modes or optimization degrees:
1)基础优化方法,即通过对并行区中所有指令之间数据依赖关系的分析,确定两条一致性共享变量访问指令之间是否有直接或间接的依赖关系。当两条一致性共享变量访问指令之间没有依赖关系时,这两条指令可以被调度到相邻位置,而当两条一致性共享变量访问指令之间有依赖关系时,需要尽量不把与它们无依赖关系的其他指令调度到它们之间。1) Basic optimization method, that is, by analyzing the data dependencies between all instructions in the parallel region, determine whether there is a direct or indirect dependency between two consistent shared variable access instructions. When there is no dependency between two consistent shared variable access instructions, the two instructions can be scheduled to adjacent positions, and when there is a dependency between two consistent shared variable access instructions, it is necessary to try not to schedule other instructions that have no dependency between them between them.
2)基于循环展开进行优化。基础优化方法的优化空间往往有限。自动并行化编译在大多数情况下是针对循环的,而循环中除互斥区以外部分在不同迭代之间需要是没有依赖关系(程序员需要保证这一点),因为如果有依赖关系,则并行化会破坏依赖关系而导致并行运行的结果有错。因此,可以通过循环展开,把对一致性共享变量(为数组的情况)在多个迭代的访问指令聚合起来。2) Optimize based on loop unrolling. The optimization space of basic optimization methods is often limited. Automatic parallelization compilation is mostly for loops, and the parts of the loop other than the mutually exclusive area need to have no dependencies between different iterations (the programmer needs to ensure this), because if there are dependencies, parallelization will destroy the dependencies and cause errors in the results of parallel operation. Therefore, loop unrolling can be used to aggregate access instructions to consistent shared variables (in the case of arrays) in multiple iterations.
3)基于私有的临时数组进行优化。尽管循环展开是一个常用的编译优化手段,但展开的次数往往有限,因为循环展开会造成编译所生成的指令数量的急剧增加,甚至使得处理器核的指令cache无法装下循环体的所有指令,从而导致程序运行速度的急剧降低。在不进行循环展开或循环展开次数小的情况下,可采用临时数组的方式实现一致性共享变量(为数组的情况)在多个迭代访问指令的聚合。首先可以为各一致性共享变量分配元素个数为N的临时数组,然后每间隔N个迭代,就把一次性把一致性共享变量的至多N个元素读入临时数组,或一次性把临时数组中的至多N个元素赋值到一致性共享变量数组中;而在原有循环体中,把对一致性共享变量元素的访问替换为对相应临时数组元素的访问。3) Optimization based on private temporary arrays. Although loop unrolling is a commonly used compilation optimization method, the number of unrollings is often limited, because loop unrolling will cause a sharp increase in the number of instructions generated by the compilation, and even make the instruction cache of the processor core unable to accommodate all the instructions of the loop body, resulting in a sharp decrease in the program running speed. In the case of no loop unrolling or a small number of loop unrollings, a temporary array can be used to achieve the aggregation of consistent shared variables (in the case of arrays) in multiple iterations. First, a temporary array with N elements can be allocated to each consistent shared variable, and then every N iterations, at most N elements of the consistent shared variable are read into the temporary array at one time, or at most N elements in the temporary array are assigned to the consistent shared variable array at one time; and in the original loop body, the access to the consistent shared variable elements is replaced with the access to the corresponding temporary array elements.
关于“从所述完整指令流确定所有一致性指令流片段,在各所述一致性指令流片段的前后分别插入缓存一致性开启指令和缓存一致性关闭指令”,需要说明的是:在完成指令调度优化后,能得到循环经自动并行化编译优化后的完整指令流,其中可以标记出所有一致性共享变量访问指令。如果两条一致性共享变量访问指令在指令流里面相邻,它们自然可以被合并到同一个一致性指令流片段中。对于前后两条一致性指令流片段,如果把它们合并起来,则它们之间其他访存指令会增加一致性操作开销,此时可以估算新增的一致性操作开销(例如用访存指令数量乘以一致性操作开销基准值);合并也会减少对缓存一致性开启指令和缓存一致性关闭指令的一次调用,因此需要估算减少的一致性协议控制开销,例如使用一致性协议控制开销基准值。当新增的一致性操作开销小于减少的一致性协议控制开销时,就把前后两条一致性指令流片段连同它们之间的其他指令合并为同一一致性指令流片段。在一种更简单的实现方式中,如果前后两条一致性指令流片段之间的其他访存指令数量不超过访存指令数量阈值时,就进行合并。Regarding "determining all consistent instruction stream segments from the complete instruction stream, and inserting cache consistency enable instructions and cache consistency disable instructions before and after each consistent instruction stream segment", it should be noted that: after completing the instruction scheduling optimization, a complete instruction stream after the loop is automatically parallelized and compiled and optimized can be obtained, in which all consistent shared variable access instructions can be marked. If two consistent shared variable access instructions are adjacent in the instruction stream, they can naturally be merged into the same consistent instruction stream segment. For the two consistent instruction stream segments before and after, if they are merged, the other memory access instructions between them will increase the consistency operation overhead. At this time, the newly added consistency operation overhead can be estimated (for example, the number of memory access instructions multiplied by the consistency operation overhead benchmark value); the merger will also reduce the call of the cache consistency enable instruction and the cache consistency disable instruction once, so it is necessary to estimate the reduced consistency protocol control overhead, for example, using the consistency protocol control overhead benchmark value. When the newly added consistency operation overhead is less than the reduced consistency protocol control overhead, the two consistent instruction stream segments before and after, together with other instructions between them, are merged into the same consistent instruction stream segment. In a simpler implementation, if the number of other memory access instructions between two consecutive consistent instruction stream fragments does not exceed a threshold value for the number of memory access instructions, merging is performed.
关于数据预取的cache一致性相关实现,需要说明的是:很多现代处理器都具有自动预取cache数据的功能,其中常用的预取方法是邻近预取和等跨度预取。可能出现在关闭cache一致性协议的情况下将未来需要保证缓存一致性访问的缓存块提前取进来了,而在进入启动了缓存一致性协议的区域后,如果访问之前已预取进来的缓存块时不检查其他核是否已改变了该缓存块的值,则可能造成数据未得到及时更新。中国专利申请202410831240.X提出,把缓存一致性协议为关闭状态时访问过的cache块标记为私有访问状态,而当缓存一致性协议为开启状态时命中了标记为私有访问状态的缓存块,则需要对该cache块进行缓存一致性操作或重新读入,以确保该cache块中数据内容的正确性。这一方法的本质就是把私有访问状态缓存块在缓存一致性协议为开启状态下首次访问当作cache缺失来处理,因此会引入不小的额外开销。Regarding the cache consistency implementation of data prefetching, it should be noted that many modern processors have the function of automatically prefetching cache data, among which the commonly used prefetching methods are adjacent prefetching and equal-span prefetching. It may happen that when the cache consistency protocol is turned off, the cache blocks that need to be accessed with cache consistency in the future are fetched in advance. After entering the area where the cache consistency protocol is enabled, if the cache blocks that have been prefetched before are accessed without checking whether other cores have changed the value of the cache blocks, the data may not be updated in time. Chinese patent application 202410831240.X proposes to mark the cache blocks that have been accessed when the cache consistency protocol is turned off as private access status, and when the cache consistency protocol is turned on, if a cache block marked as private access status is hit, the cache block needs to be cached or re-read to ensure the correctness of the data content in the cache block. The essence of this method is to treat the first access to the private access status cache block when the cache consistency protocol is turned on as a cache miss, which will introduce a considerable amount of additional overhead.
本申请技术会使得并行化后的循环的执行过程中频繁开启和关闭cache一致性协议,从而导致一致性协议为关闭状态下预取一致性共享变量的现象频繁出现。中国专利申请202410831240.X提出技术方法,会明显降低对一致性共享变量进行数据预取的意义。为此,需要更高效的技术方法。数据预取指令基本上都是在学到了load和store指令地址变化规律的情况下发起的,即一条数据预取指令是由若干条load和store指令触发出来的。基于此,本申请提出了关于数据预取的新方法:在执行一条数据预取指令时,如果触发该指令的若干条访存指令中有至少一条对应于cache一致性协议的开启状态,则以cache一致性协议的开启状态执行该指令;如果触发该指令的所有访存指令都对应于cache一致性协议的关闭状态,则让该指令在执行时不发起缓存一致性操作。上述技术使得当前处理器核能在cache一致性协议为关闭状态下,使用cache一致性操作保证预取到的数据的正确性。在硬件上,上述功能容易实现。The technology of this application will cause the cache consistency protocol to be frequently turned on and off during the execution of the parallelized loop, resulting in the frequent occurrence of the phenomenon of prefetching consistent shared variables when the consistency protocol is in the off state. Chinese patent application 202410831240.X proposes a technical method that will significantly reduce the significance of data prefetching for consistent shared variables. To this end, a more efficient technical method is needed. Data prefetch instructions are basically initiated when the address change rules of load and store instructions are learned, that is, a data prefetch instruction is triggered by several load and store instructions. Based on this, this application proposes a new method for data prefetching: when executing a data prefetch instruction, if at least one of the several memory access instructions that trigger the instruction corresponds to the on state of the cache consistency protocol, the instruction is executed in the on state of the cache consistency protocol; if all memory access instructions that trigger the instruction correspond to the off state of the cache consistency protocol, the instruction is not initiated when the cache consistency operation is executed. The above technology enables the current processor core to use cache consistency operations to ensure the correctness of the prefetched data when the cache consistency protocol is in the off state. In hardware, the above functions are easy to implement.
关于互斥区的一致性指令的开启和关闭:要进行自动线程级并行化的循环中可能存在着每次只让一个线程进入执行的互斥区。互斥区通常都会修改共享变量。因此,可以把整个互斥区看成一个一致性指令流片段,即在互斥区的前后分别插入缓存一致性开启指令和缓存一致性关闭指令。Regarding the opening and closing of the consistency instructions of the mutex region: In the loop to be automatically parallelized at the thread level, there may be a mutex region that only allows one thread to enter the execution at a time. The mutex region usually modifies shared variables. Therefore, the entire mutex region can be regarded as a consistency instruction stream segment, that is, the cache consistency opening instruction and the cache consistency closing instruction are inserted before and after the mutex region respectively.
对于非循环的程序的自动并行化:通过编译制导语句要进行自动线程级并行化的整个并行区也可能是非循环程序。非循环并行区也可以使用本申请技术,此时只需要把并行区看成一个循环执行次数为1的特殊循环即可。并行区中可能还含有对一致性共享变量进行访问的小循环,此时除基础优化方法外,也可以采用前面提到过的循环展开优化和临时数组优化。For automatic parallelization of non-loop programs: The entire parallel region to be automatically parallelized at the thread level by compiling the guidance statement may also be a non-loop program. The non-loop parallel region can also use the technology of the present application. In this case, it is only necessary to regard the parallel region as a special loop with a loop execution number of 1. The parallel region may also contain a small loop that accesses the consistent shared variables. In this case, in addition to the basic optimization method, the loop unrolling optimization and temporary array optimization mentioned above can also be used.
实施例六Embodiment 6
在上述实施例的基础上,本实施例提供一种计算机设备,包括存储器、处理器及存储在存储器上的计算机程序,所述处理器执行所述计算机程序以实现上述实施例所述方法的步骤。On the basis of the above embodiments, this embodiment provides a computer device, including a memory, a processor, and a computer program stored in the memory, wherein the processor executes the computer program to implement the steps of the method described in the above embodiments.
本实施例的一些实施方式中,提供一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现上述实施例所述方法的步骤。In some implementations of this embodiment, a computer-readable storage medium is provided, on which a computer program is stored. When the computer program is executed by a processor, the steps of the method described in the above embodiment are implemented.
本实施例的一些实施方式中,提供一种计算机程序产品,包括计算机程序/指令,该计算机程序被处理器执行时实现上述实施例所述方法的步骤。In some implementations of this embodiment, a computer program product is provided, including a computer program/instructions, and when the computer program is executed by a processor, the steps of the method described in the above embodiment are implemented.
处理器可以包括但不限于例如一个或者多个处理器或者或微处理器等。每一处理器可以是专用集成电路(Application Specific Integrated Circuit,简称ASIC)、数字信号处理器(Digital Signal Processor,简称DSP)、数字信号处理设备(Digital SignalProcessing Device,简称DSPD)、可编程逻辑器件(Programmable Logic Device,简称PLD)、现场可编程门阵列(Field Programmable Gate Array,简称FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述实施例中的方法。The processor may include, but is not limited to, one or more processors or microprocessors, etc. Each processor may be an application specific integrated circuit (ASIC), a digital signal processor (DSP), a digital signal processing device (DSPD), a programmable logic device (PLD), a field programmable gate array (FPGA), a controller, a microcontroller, a microprocessor or other electronic components to execute the methods in the above embodiments.
计算机可读存储介质可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,计算机可读存储介质可以包括但不限于例如,随机存取存储器(RAM)、只读存储器(ROM)、快闪存储器、EPROM存储器、EEPROM存储器、寄存器、计算机存储介质(例如硬碟、软碟、固态硬盘、可移动碟、CD-ROM、DVD-ROM、蓝光盘等)。The computer-readable storage medium may be implemented by any type of volatile or non-volatile storage device or a combination thereof, and the computer-readable storage medium may include but is not limited to, for example, random access memory (RAM), read-only memory (ROM), flash memory, EPROM memory, EEPROM memory, registers, computer storage media (e.g., hard disk, floppy disk, solid-state drive, removable disk, CD-ROM, DVD-ROM, Blu-ray disc, etc.).
计算机可读存储介质还可以存储至少一个计算机可执行程序/指令,计算机可执行程序/指令例如是计算机可读指令。计算机可读存储介质包括但不限于例如易失性存储器和/或非易失性存储器。易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。计算机可读存储介质例如可以包括只读存储器(ROM)、硬盘、闪存等。例如,非暂时性计算机可读存储介质可以连接于诸如计算机等的计算设备,接着,在计算设备运行计算机可读存储介质上存储的计算机可读指令的情况下,可以进行如上描述的各个方法。The computer-readable storage medium may also store at least one computer executable program/instruction, which may be, for example, a computer-readable instruction. The computer-readable storage medium includes, but is not limited to, for example, a volatile memory and/or a non-volatile memory. The volatile memory may include, for example, a random access memory (RAM) and/or a cache memory (cache), etc. The computer-readable storage medium may include, for example, a read-only memory (ROM), a hard disk, a flash memory, etc. For example, a non-transitory computer-readable storage medium may be connected to a computing device such as a computer, and then, when the computing device runs the computer-readable instructions stored on the computer-readable storage medium, the various methods described above may be performed.
除此之外,该计算机设备还可以包括(但不限于)数据总线、输入/输出(I/O)总线,显示器以及输入/输出设备 (例如,键盘、鼠标、扬声器等)等。In addition, the computer device may also include (but not limited to) a data bus, an input/output (I/O) bus, a display, and input/output devices (e.g., keyboard, mouse, speakers, etc.), etc.
处理器可以通过I/O总线经由有线或无线网络与外部设备通信。The processor may communicate with external devices via an I/O bus via a wired or wireless network.
在一个实施方式中,该至少一个计算机可执行指令也可以被编译为或组成一种软件产品/计算机程序产品,其中一个或多个计算机可执行指令被处理器运行时执行本技术所描述的实施例中的各个功能和/或方法的步骤。In one embodiment, the at least one computer executable instruction may also be compiled into or constitute a software product/computer program product, wherein one or more computer executable instructions are executed by a processor to perform the various functions and/or method steps in the embodiments described in the present technology.
在本公开所提供的实施例中,应该理解到,所揭露的装置和方法,也可以通过其它的方式实现。以上所描述的装置实施例仅仅是示意性的,例如,附图中的流程图和框图显示了根据本公开的多个实施例的装置、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或代码的一部分,上述模块、程序段或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现方式中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。In the embodiments provided in the present disclosure, it should be understood that the disclosed devices and methods can also be implemented in other ways. The device embodiments described above are merely schematic. For example, the flowcharts and block diagrams in the accompanying drawings show the possible architecture, functions and operations of the devices, methods and computer program products according to multiple embodiments of the present disclosure. In this regard, each box in the flowchart or block diagram can represent a module, a program segment or a part of a code, and the above-mentioned module, program segment or a part of the code contains one or more executable instructions for implementing the specified logical function. It should also be noted that in some alternative implementations, the functions marked in the box can also occur in a different order from the order marked in the accompanying drawings. For example, two consecutive boxes can actually be executed substantially in parallel, and they can sometimes be executed in the opposite order, depending on the functions involved. It should also be noted that each box in the block diagram and/or flowchart, and the combination of boxes in the block diagram and/or flowchart can be implemented with a dedicated hardware-based system that performs a specified function or action, or can be implemented with a combination of dedicated hardware and computer instructions.
需要说明的是,在本公开中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限的要素,并不排除在包括要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that in the present disclosure, the terms "include", "comprises" or any other variations thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of further restrictions, an element limited by the sentence "includes a ..." does not exclude the existence of other identical elements in the process, method, article or device including the element.
虽然本公开所揭露的实施方式如上,但上述的内容只是为了便于理解本公开而采用的实施方式,并非用以限定本公开。任何本公开所属技术领域内的技术人员,在不脱离本公开所揭露的精神和范围的前提下,可以在实施的形式上及细节上作任何的修改与变化,但本公开的专利保护范围,仍须以所附的权利要求书所界定的范围为准。Although the embodiments disclosed in the present disclosure are as above, the above contents are only embodiments adopted for facilitating the understanding of the present disclosure and are not intended to limit the present disclosure. Any technician in the technical field to which the present disclosure belongs can make any modifications and changes in the form and details of the implementation without departing from the spirit and scope disclosed in the present disclosure, but the scope of patent protection of the present disclosure shall still be subject to the scope defined in the attached claims.
Claims (20)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202411311602.9A CN118820134B (en) | 2024-09-20 | 2024-09-20 | Cache consistency optimization method in automatic thread-level parallelization |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202411311602.9A CN118820134B (en) | 2024-09-20 | 2024-09-20 | Cache consistency optimization method in automatic thread-level parallelization |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118820134A true CN118820134A (en) | 2024-10-22 |
CN118820134B CN118820134B (en) | 2025-02-07 |
Family
ID=93075055
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202411311602.9A Active CN118820134B (en) | 2024-09-20 | 2024-09-20 | Cache consistency optimization method in automatic thread-level parallelization |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118820134B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN119883435A (en) * | 2025-03-28 | 2025-04-25 | 北京麟卓信息科技有限公司 | Instruction conversion memory conflict optimization method based on instruction stream feature recognition |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8181168B1 (en) * | 2007-02-07 | 2012-05-15 | Tilera Corporation | Memory access assignment for parallel processing architectures |
CN105117369A (en) * | 2015-08-04 | 2015-12-02 | 复旦大学 | Heterogeneous platform based multi-parallel error detection system framework |
CN114298892A (en) * | 2021-12-29 | 2022-04-08 | 长沙景嘉微电子股份有限公司 | A cache module and system for distributed processing unit |
CN118377637A (en) * | 2024-06-26 | 2024-07-23 | 北京卡普拉科技有限公司 | Method, device, equipment and storage medium for reducing redundant cache consistency operation |
CN118656236A (en) * | 2024-08-19 | 2024-09-17 | 北京卡普拉科技有限公司 | Cache consistency optimization method, device and equipment for multi-level bus |
-
2024
- 2024-09-20 CN CN202411311602.9A patent/CN118820134B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8181168B1 (en) * | 2007-02-07 | 2012-05-15 | Tilera Corporation | Memory access assignment for parallel processing architectures |
CN105117369A (en) * | 2015-08-04 | 2015-12-02 | 复旦大学 | Heterogeneous platform based multi-parallel error detection system framework |
CN114298892A (en) * | 2021-12-29 | 2022-04-08 | 长沙景嘉微电子股份有限公司 | A cache module and system for distributed processing unit |
CN118377637A (en) * | 2024-06-26 | 2024-07-23 | 北京卡普拉科技有限公司 | Method, device, equipment and storage medium for reducing redundant cache consistency operation |
CN118656236A (en) * | 2024-08-19 | 2024-09-17 | 北京卡普拉科技有限公司 | Cache consistency optimization method, device and equipment for multi-level bus |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN119883435A (en) * | 2025-03-28 | 2025-04-25 | 北京麟卓信息科技有限公司 | Instruction conversion memory conflict optimization method based on instruction stream feature recognition |
Also Published As
Publication number | Publication date |
---|---|
CN118820134B (en) | 2025-02-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8250549B2 (en) | Variable coherency support when mapping a computer program to a data processing apparatus | |
KR101769260B1 (en) | Concurrent accesses of dynamically typed object data | |
US8522220B2 (en) | Post-pass binary adaptation for software-based speculative precomputation | |
US10949200B2 (en) | Methods and apparatus for executing data-dependent threads in parallel | |
US9015690B2 (en) | Proactive loop fusion of non-adjacent loops with intervening control flow instructions | |
Campanoni et al. | HELIX-RC: An architecture-compiler co-design for automatic parallelization of irregular programs | |
US8990786B2 (en) | Program optimizing apparatus, program optimizing method, and program optimizing article of manufacture | |
CN118377637B (en) | Method, device, equipment and storage medium for reducing redundant cache consistency operation | |
JP5283128B2 (en) | Code generation method executable by processor, storage area management method, and code generation program | |
JP2015084251A (en) | Software application performance enhancement | |
CN118820134A (en) | Cache consistency optimization method in automatic thread-level parallelization | |
US8700851B2 (en) | Apparatus and method for information processing enabling fast access to program | |
CN101425052B (en) | A Realization Method of Transactional Memory | |
Leopoldseder et al. | A Cost Model for a Graph-based Intermediate-representation in a Dynamic Compiler | |
US20100070716A1 (en) | Processor and prefetch support program | |
CN1952897A (en) | Combination and optimization methods of access and storage based on analysis of data stream | |
Porterfield et al. | Multi-threaded library for many-core systems | |
Paul et al. | Improving efficiency of embedded multi-core platforms with scratchpad memories | |
Soria-Pardos et al. | Dynamo: Improving parallelism through dynamic placement of atomic memory operations | |
US20050251795A1 (en) | Method, system, and program for optimizing code | |
Puthoor et al. | Turn-based spatiotemporal coherence for GPUs | |
Wu et al. | Pre-Stores: Proactive Software-guided Movement of Data Down the Memory Hierarchy | |
CN118535352A (en) | A multi-granularity remote memory runtime method based on user mode | |
Coates | A Syntax Directed Imperative Language Microprocessor for Reduced Power Consumption and Improved Performance | |
Lin | Optimizing Heap Data Management on Software Managed Manycore Architecture |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |