CN1149494C

CN1149494C - Method and system for maintaining cache coherency

Info

Publication number: CN1149494C
Application number: CNB001188542A
Authority: CN
Inventors: J・M・努内斯; J·M·努内斯; 彼得森; T·A·彼得森; 沙利文; M·J·沙利文
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1999-06-18
Filing date: 2000-06-15
Publication date: 2004-05-12
Anticipated expiration: 2020-06-15
Also published as: KR100380674B1; CN1278625A; TW548547B; JP2001043133A; KR20010015008A

Abstract

A method and system for maintaining cache coherency for write-through store operations in a data processing system. A write-through store operation is passed from a particular processor to the system bus through any caches of said multiple levels of cache which are interposed between the particular processor and the system bus. The write-through store operation is performed in any of the interposed caches in which a cache hit for the write-through store operation is obtained. All caches of said multiple levels of cache, which are not interposed between the particular processor and the system bus, are snooped from an external snoop path of the system bus with a data address of said write-through operation until the write-through operation is successful, wherein the cache coherency point for the memory hierarchy is set at the system bus for write-through store operations such that the write-through operation is completed successfully prior to completion of any other instructions to the same data address.

Description

Method and system for maintaining cache coherency

技术领域technical field

本发明一般涉及数据处理的改进方法和系统，特别涉及在多处理器数据处理系统中保持高速缓冲存储器一致性(coherency)的改进方法和系统。此外，本发明还特别涉及在多处理器系统中直写存储操作时保持高速缓冲存储器一致性的方法和系统。The present invention relates generally to improved methods and systems for data processing, and more particularly to improved methods and systems for maintaining cache coherency in multiprocessor data processing systems. In addition, the present invention relates more particularly to methods and systems for maintaining cache coherency during write-through store operations in multiprocessor systems.

背景技术Background technique

大多数的现代高性能数据处理系统结构包括存储体系中的多级高速缓冲存储器。高速缓冲存储器用在数据处理系统中，与访问系统存储器的时间相比能更快地访问频繁使用的数据，从而提高了整体性能。高速缓冲存储器的各级通常用在渐进增长的较长访问等待时间中。较小较快的高速缓冲存储器用在靠近一个或多个处理器的存储体系内中的各级，而较大较慢的高速缓冲存储器用在靠近系统存储器的各级。Most modern high-performance data processing system architectures include multiple levels of cache memory in the storage hierarchy. Cache memory is used in data processing systems to provide faster access to frequently used data compared to the time to access system memory, thereby improving overall performance. Levels of cache memory are typically used with progressively longer access latencies. Smaller, faster caches are used at levels within the memory hierarchy close to the processor or processors, while larger, slower caches are used at levels close to system memory.

在常规的对称多处理器(SMP)数据处理系统中，所有的处理器通常完全相同，使用相同的公共指令集和相同通信协议的所有处理器都有相似的硬件结构，并通常提供有类似的存储体系。例如，常规的SMP数据处理系统包括系统存储器；多个处理单元，每个处理单元包括一个处理器和一级或多级高速缓冲存储器；以及将处理单元相互耦合和耦合到系统存储器的系统总线。许多这种系统包括在两个或多个处理器之间共享的至少一级高速缓冲存储器。要在SMP数据处理系统中获得有效的执行结果，重要的是获得一致的存储层次，即，为所有的处理器提供同一存储内容。In a conventional symmetric multiprocessor (SMP) data processing system, all processors are usually identical, all processors using the same common instruction set and the same communication protocol have similar hardware structures, and usually provide similar storage system. For example, a conventional SMP data processing system includes a system memory; a plurality of processing units, each including a processor and one or more levels of cache memory; and a system bus coupling the processing units to each other and to the system memory. Many such systems include at least one level of cache shared between two or more processors. To obtain effective execution results in an SMP data processing system, it is important to obtain a consistent storage hierarchy, ie, provide the same storage content for all processors.

虽然设计为通过侦听(snooping)保持高速缓冲存储器的一致性，但是“重试”响应会导致处理器操作错误。特别是，对于直写存储，一旦进行了写更新，随后允许负载读取新数据，重试相同的直写存储会有问题。Although designed to maintain cache coherency through snooping, "retrying" responses can cause processor operation errors. In particular, with write-through stores, retrying the same write-through store will be problematic once a write update has been made, and the load is subsequently allowed to read the new data.

因此，需要提供一种在多处理器系统中保持高速缓冲存储器一致性的方法，特别是在存在重试时保持直写存储操作时高速缓冲存储器一致性。Therefore, there is a need to provide a method of maintaining cache coherency in a multiprocessor system, especially maintaining cache coherency for write-through store operations in the presence of retries.

发明内容Contents of the invention

因此本发明的一个目的是提供一种数据处理的改进方法和系统。因此本发明的另一目的是提供一种在多处理器数据处理系统中保持高速缓冲存储器一致性的改进方法和系统。It is therefore an object of the present invention to provide an improved method and system for data processing. It is therefore another object of the present invention to provide an improved method and system for maintaining cache coherency in a multiprocessor data processing system.

本发明的再一目的是提供一种在多处理器系统中保持直写操作时高速缓冲存储器一致性的改进方法和系统。It is a further object of the present invention to provide an improved method and system for maintaining cache coherency during write-through operations in a multiprocessor system.

通过这里介绍的方法和系统可以获得以上的目的。本发明的方法和系统用于在数据处理系统中直写存储操作时保持高速缓冲存储器的一致性，其中所述数据处理系统包括通过存储体系耦合到系统总线的多个处理器，其中存储体系包括多级高速缓冲存储器。通过插入在特定处理器和系统总线之间的所述多级高速缓冲存储器的任何高速缓冲存储器，直写存储操作由特定的处理器传递到系统总线。直写存储操作可在被直写存储操作命中的任何插入的高速缓冲存储器中进行。没有插在特定的处理器系统总线之间的所述多级高速缓冲存储器的所有高速缓冲存储器由系统总线的外部侦听路径侦听所述直写操作的数据地址，直到直写操作成功，其中存储体系的高速缓冲存储器的一致点设置在用于直写存储操作的系统总线上，由此在完成到相同数据地址的任何其它指令之前成功地完成直写操作。The above objects are achieved by the methods and systems presented herein. The method and system of the present invention are used to maintain cache coherency during write-through store operations in a data processing system, wherein the data processing system includes a plurality of processors coupled to a system bus through a memory architecture, wherein the memory architecture includes Multi-level cache memory. Write-through store operations are passed from a particular processor to the system bus through any cache memory of the multi-level cache interposed between the particular processor and the system bus. A write-through store operation may proceed in any inserted cache that is hit by a write-through store operation. all caches of said multi-level cache not interposed between a particular processor system bus listen to the data address of said write-through operation by an external snoop path of the system bus until the write-through operation succeeds, wherein A coherency point for the memory architecture's cache memory is placed on the system bus for write-through store operations, whereby the write-through operation completes successfully before any other instruction to the same data address completes.

从下面详细的说明中本发明的以上及其它目的、特点和优点将变得很显然。The above and other objects, features and advantages of the present invention will become apparent from the following detailed description.

附图说明Description of drawings

本发明的新颖特性在附带的权利要求书中陈述。然而通过结合附图阅读参考以下示例性实施例的详细说明，本发明自身及使用的优选模式以及本发明的其它目的和优点将很好理解，其中：The novel features of the invention are set forth in the appended claims. However, the invention itself and preferred modes of use, as well as other objects and advantages of the invention, will be best understood by reading the following detailed description with reference to the following exemplary embodiments when read in conjunction with the accompanying drawings, in which:

图1示出了当通过目前的侦听技术重试直写存储指令时发生错误的时序图；Figure 1 shows a timing diagram of an error occurring when a write-through store instruction is retried by current snooping techniques;

图2示出了根据本发明的多处理器数据处理系统的高层方框图；Figure 2 shows a high-level block diagram of a multiprocessor data processing system according to the present invention;

图3示出了用自侦听技术直写存储指令的性能的时序图；以及Figure 3 shows a timing diagram of the performance of direct write store instructions with the self-snooping technique; and

图4示出了进行直写存储操作过程的高层逻辑流程图。FIG. 4 shows a high-level logic flow chart of the process of performing a write-through storage operation.

具体实施万式Specific implementation

现在参考附图，特别是参考图2，示出了根据本发明的多处理器数据处理系统的高层方框图。如图所示，数据处理系统8包括与其它多个处理器内核11a-11n成对的多个处理器内核10a-10n，每个优选包括一个可从国际商用机器公司得到的处理器的PowerPC线，除了常规的寄存器、用于执行程序指令的指令流逻辑和执行单元之外，每个处理器内核10a-10n和11a-11n还包括一个板上的一级(L1)高速缓冲存储器12a-12n和13a-13n，能临时地存储可能会被相关的处理器访问的指令和数据。虽然在图2中L1高速缓冲存储器12a-12n和13a-13n显示为存储指令和数据(下文简称为数据)的一体高速缓冲存储器，但本领域的技术人员应理解每个L1高速缓冲存储器12a-12n和13a-13n也可以用指令和数据高速缓冲存储器两部分来实现。Referring now to the drawings, and in particular to FIG. 2, there is shown a high level block diagram of a multiprocessor data processing system in accordance with the present invention. As shown, data processing system 8 includes a plurality of processor cores 10a-10n paired with other plurality of processor cores 11a-11n, each preferably comprising a PowerPC line of processors available from International Business Machines Corporation. Each processor core 10a-10n and 11a-11n includes an on-board Level 1 (L1) cache memory 12a-12n, in addition to conventional registers, instruction stream logic and execution units for executing program instructions and 13a-13n, can temporarily store instructions and data that may be accessed by the associated processor. Although L1 cache memories 12a-12n and 13a-13n are shown in FIG. 2 as integral cache memories storing instructions and data (hereinafter simply referred to as data), those skilled in the art will understand 12n and 13a-13n can also be implemented with both instruction and data caches.

为了减少等待时间，数据处理系统8还包括一个或多个附加级的高速缓冲存储器，例如二级(L2)高速缓冲存储器14a-14n，用于将数据分离到L1高速缓冲存储器12a-12n和13a-13n。换句话说，L2高速缓冲存储器14a-14n起系统存储器18和L1高速缓冲存储器12a-12n和13a-13n之间的中间存储器的作用，通常可以存储比L1高速缓冲存储器12a-12n和13a-13n容量大得多的数据，但需要较长的存储等待时间。例如，L2高速缓冲存储器14a-14n有256或512千字节的存储容量，L1高速缓冲存储器12a-12n和13a-13n有64或128千字节的存储容量。如上所述，虽然图2仅示出了两级高速缓冲存储器，但数据处理系统8的存储体系可以扩大，包括串联连接或后备的高速缓冲存储器的附加级(L3、L4等)。To reduce latency, data processing system 8 also includes one or more additional levels of cache memory, such as level two (L2) cache memories 14a-14n, for separating data into L1 cache memories 12a-12n and 13a -13n. In other words, L2 caches 14a-14n act as intermediate memory between system memory 18 and L1 caches 12a-12n and 13a-13n and can typically store Much larger volumes of data, but require longer storage latencies. For example, L2 caches 14a-14n have a storage capacity of 256 or 512 kilobytes, and L1 caches 12a-12n and 13a-13n have a storage capacity of 64 or 128 kilobytes. As noted above, although FIG. 2 shows only two levels of cache memory, the storage hierarchy of data processing system 8 may be expanded to include additional levels of cache memory (L3, L4, etc.) connected in series or backed up.

如图所示，数据处理系统8还包括输入/输出(I/O)装置20、系统存储器18以及非易失性存储器22，每一个都耦合到互连16。I/O装置20包括常规的外围设备，例如显示装置、键盘以及通过常规的适配器连接到互连16的图形指示器。非易失性存储器22存储操作系统和其它的软件，当数据处理系统8上电时上述操作系统和软件装入易失性系统存储器18内。当然，本领域的技术人员应理解数据处理系统可以包括在图2中没有示出的许多其它部件，例如用于连接到网络或附属设备的串口和并口、管理访问到系统存储器18的存储器控制器等。As shown, data processing system 8 also includes input/output (I/O) devices 20 , system memory 18 , and nonvolatile memory 22 , each coupled to interconnect 16 . I/O device 20 includes conventional peripherals such as a display device, keyboard, and graphic pointers connected to interconnect 16 by conventional adapters. Non-volatile memory 22 stores an operating system and other software that is loaded into volatile system memory 18 when data processing system 8 is powered on. Of course, those skilled in the art will appreciate that the data processing system may include many other components not shown in FIG. wait.

包括一个系统总线的一个或多个总线的互连16作为管线用于L2高速缓冲存储器14a-14n、系统存储器18、I/O装置20以及非易失性存储器22之间的通信的管线。互连16上的典型通信交易包括指示交易源的源标签、确定交易的指定接受者的目的标签、地址和/或数据。耦合到互连16的每个装置优选侦听互连16上的所有通信交易，以确定装置的一致性是否因为交易而更新。优选提供由每个高速缓冲存储器到互连16的系统总线的外部侦听路径。An interconnect 16 of one or more buses, including a system bus, acts as a pipeline for communication between the L2 caches 14 a - 14 n , system memory 18 , I/O devices 20 , and non-volatile memory 22 . A typical communication transaction on the interconnect 16 includes a source tag indicating the source of the transaction, a destination tag identifying the intended recipient of the transaction, an address, and/or data. Each device coupled to the interconnect 16 preferably listens to all communication transactions on the interconnect 16 to determine whether the device's identity has been updated as a result of the transaction. An external snoop path from each cache memory to the system bus of interconnect 16 is preferably provided.

通过使用选择的存储器一致性协议，例如MESI协议保持一致的存储器体系。在MESI协议中，一致状态指示的存储与至少所有上级(高速缓冲存储器)存储器的每个一致性区组(例如，高速缓冲存储器线或扇区)相关。每个一致性区组可具有修改(M)、排它(E)、共享(S)或无效(I)这四个状态中的一种，可以由高速缓冲存储器目录中的两位编码。修改的状态表示一致性区组仅在存储修改的一致性区组的高速缓冲存储器中有效，并且修改的一致性区组的值还没有写到系统存储器。当一致性区组表示为排它时，一致性区组在存储体系的级别仅驻留在所有的高速缓冲存储器的具有排它状态的一致性区组的高速缓冲存储器中。然而，在排它状态中的数据与系统存储器中的一致。如果一致性区组在高速缓冲存储器目录中标记为共享，那么一致性区组驻留在相关的高速缓冲存储器和在存储层次的相同级的其它可能的高速缓冲存储器中，一致性区组的所有副本都与系统存储器一致。最后，无效的状态表示与一致性区组相关的数据或地址标签都没有驻留在高速缓冲存储器中。A coherent memory architecture is maintained by using a memory coherency protocol of choice, such as the MESI protocol. In the MESI protocol, the storage of a coherent state indication is associated with each coherent block (eg, cache line or sector) of at least all upper-level (cache) memory. Each coherency granule can have one of four states, modified (M), exclusive (E), shared (S), or invalid (I), which can be encoded by two bits in the cache directory. The modified state indicates that the coherency granule is only valid in the cache storing the modified coherency granule, and the value of the modified coherency granule has not been written to system memory. When a coherent granule is indicated as exclusive, the coherent granule resides only in the caches of all caches that have a coherent granule in the exclusive state at the level of the storage hierarchy. However, the data in the exclusive state is consistent with that in system memory. If a coherent granule is marked as shared in the cache directory, then the coherent granule resides in the associated cache and possibly other caches at the same level of the storage hierarchy, all of the coherent granules The copies are all consistent with system memory. Finally, an invalid state indicates that neither the data nor the address tags associated with the coherency block resides in the cache.

在SMP系统中数据的每个高速缓冲存储器线(块)优选包括地址标记字段、状态位字段、内含位字段、以及用于存储实际指令或数据的值字段。状态位字段和内含位字段用于在多处理器计算机系统中保持高速缓冲存储器一致性(表示在高速缓冲存储器中存储的值有效)。地址标记为对应存储块的全地址的子集。如果输入为有效状态，地址标记字段内的一个标记与输入的地址相比较后匹配，表示高速缓冲存储器“命中(hit)”。Each cache line (block) of data in an SMP system preferably includes an address tag field, a status bit field, a contained bit field, and a value field for storing actual instructions or data. The status bit field and the contained bit field are used to maintain cache coherency (indicating that the value stored in the cache memory is valid) in a multiprocessor computer system. Addresses are marked as a subset of the full address of the corresponding memory block. If the input is valid, a tag in the address tag field compares to the entered address and matches, indicating a cache "hit".

在保持高速缓冲存储器一致性中，在高速缓冲存储器中进行一次存储之前，直写高速缓冲存储器存储不分配高速缓冲存储器线或增益所有权(MESI协议的E或M状态)。特别是，直写或全存储高速缓冲存储器工作，在处理器写操作期间对高速缓冲存储器和主存储器提供写操作，由此确保高速缓冲存储器的数据和主存储器之间的一致性。为保持高速缓冲存储器的一致性，一致的直写存储必须使处理器上任何有效的高速缓冲存储器线无效，处理来自特定高速缓冲存储器一致点的始发高速缓冲存储器线之外，以确保来自所有处理器的后续负载得到新更新的数据。In maintaining cache coherency, a write-through cache store does not assign cache line or gain ownership (E or M state of the MESI protocol) until a store is made in the cache. In particular, write-through or full-store caches operate, providing write operations to the cache and main memory during processor write operations, thereby ensuring coherency between the cache's data and main memory. To maintain cache coherency, a coherent write-through store must invalidate any valid cache line on the processor, handling outside the originating cache line from a particular cache coherency point, to ensure access from all Subsequent loads to the processor get the newly updated data.

通常，总线“侦听”技术用于使来自高速缓冲存储器一致点的高速缓冲存储器线无效。每个高速缓冲存储器优选包括侦听逻辑以进行侦听。只要进行读或写，数据的地址由始发的处理器内核传播到共享一个公用总线的所有其它高速缓冲存储器。每个侦听逻辑单元侦听来自总线的地址，并将地址与用于高速缓冲存储器的地址标记阵列比较。当命中时，侦听响应返回，允许进行进一步的操作，以保持高速缓冲存储器的一致性，例如使命中的高速缓冲存储器线无效。此外，由于高速缓冲存储器有一个第一个必须推出高速缓冲存储器的修改复制或存储防止适当侦听的问题，所以“重试”的侦听响应由高速缓冲存储器的总线侦听逻辑发出。当重试时，始发数据地址的处理器内核将重试读或写操作。Typically, bus "snooping" techniques are used to invalidate cache lines from cache coherency points. Each cache preferably includes snoop logic to snoop. Whenever a read or write is performed, the address of the data is propagated by the originating processor core to all other caches sharing a common bus. Each snoop logic unit snoops an address from the bus and compares the address to an array of address tags for the cache memory. When there is a hit, the snoop response returns, allowing further operations to maintain cache coherency, such as invalidating the cache line that hit. In addition, a "retry" snoop response is issued by the cache's bus snoop logic since the cache has a first modification copy or store that must push out the cache to prevent a proper snoop. When retrying, the processor core that originated the data address will retry the read or write operation.

根据为优选实施例备选方案的侦听技术，图1示出了当重试直写存储指令时发生错误的时序图。在该例中，假设SMP结构带有一个处理器内核0和处理器内核1、与每个内核相关的L1高速缓冲存储器以及由两种处理器内核共享的L2高速缓冲存储器。在该例中保持处理器的高速缓冲存储器一致性的点设置在L2高速缓冲存储器。然后对于在图1中没有利用的该例的目的，可以使用附加的处理器内核和高速缓冲存储器的各级。Figure 1 shows a timing diagram where an error occurs when a write-through store instruction is retried, according to the snooping technique, which is an alternative to the preferred embodiment. In this example, assume an SMP architecture with a processor core 0 and processor core 1, an L1 cache associated with each core, and an L2 cache shared by both processor cores. The point of maintaining cache coherency of the processor in this example is set at the L2 cache. Additional processor cores and levels of cache memory can then be used for purposes of this example not utilized in FIG. 1 .

对于该例，伪码序列为：For this example, the pseudocode sequence is:

处理器内核0 处理器内核1 Processor Core 0 Processor Core 1

存储2到A 循环：装载Astore 2 to A loop: load A

如果A！＝2循环If A! = 2 cycles

存储3到AStore 3 to A

如果进行处理器内核0的存储，但再次进行处理器内核0的存储之前，重试允许继续进行处理器内核1的输入和存储，地址A的所得一致存储状态为2，不正确。If a store to processor core 0 occurs, but before a store to processor core 0 occurs again, retrying allows the input and store to proceed to processor core 1, the resulting consistent store state for address A is 2, which is incorrect.

在第一时钟周期60中，显示在时序图中，总线由内核0(内核0WTST)裁定由此直写存储操作的寻址和数据(RA)传送到L2高速缓冲存储器。此后，在参考数字62，直写存储的数据地址在系统总线上传播到所有的非始发内核(内核1)，由此非始发内棱侦听数据地址。此外，在相同的周期期间，在参考数字64，数据地址与L2标记阵列比较，以便确定数据的先前版本是否驻留在L2高速缓冲存储器中。在第三周期中，在参考数字66，侦听到的地址和与内核1相关的L1高速缓冲存储器中的L1标记阵列比较。此外，在L2高速缓冲存储器中L2命中返回，如参考数字68所示。此后，通过将写指令写入管线中用于将L2高速缓冲存储器更新为“A＝2”，进行L2数据写入，如参考数字70所示。接下来，在第四时钟周期期间，内核1的L1高速缓冲存储器的侦听响应作为重试返回，如参考数字72所示。In the first clock cycle 60, shown in the timing diagram, the bus is asserted by core 0 (core OWTST) whereby the address and data (RA) of the write-through store operation is transferred to the L2 cache. Thereafter, at reference numeral 62, the write-through stored data address is propagated on the system bus to all non-originating cores (Core 1), whereby the non-originating edge snoops the data address. Also, during the same cycle, at reference numeral 64, the data address is compared to the L2 tag array to determine if a previous version of the data resides in the L2 cache. In a third cycle, at reference numeral 66, the snooped address is compared to the L1 tag array in the L1 cache associated with core 1 . In addition, an L2 hit returns in the L2 cache, as indicated by reference numeral 68 . Thereafter, L2 data writing is performed by writing a write instruction into the pipeline for updating the L2 cache to "A=2", as indicated by reference numeral 70 . Next, during the fourth clock cycle, the snoop response of the L1 cache of core 1 is returned as a retry, as indicated by reference numeral 72 .

注意，特别是通过所述非优选的侦听技术，在表示重试的侦听响应返回之前，直写存储更新L2高速缓冲存储器。由于包括侦听命中处于M状态的扇区和侦听命中排队等待的有效操作的原因，重试返回。当重试被内核1的L1高速缓冲存储器返回时，设置内核0重试直写存储操作。在L2高速缓冲存储器保持了高速缓冲存储器的一致性，所以在将直写操作发送到总线之前，重试在L2高速缓冲存储器中进行直写存储并更新任何较高层的高速缓冲存储器。Note that, particularly with the non-preferred snooping technique, the write-through store updates the L2 cache before the snoop response indicating a retry is returned. The retry returns for reasons that include a listen hit for a sector in the M state and a listen hit queued for valid operations. Set core 0 to retry the write-through store operation when the retry is returned by core 1's L1 cache. Cache coherency is maintained at the L2 cache, so write-through stores are retried in the L2 cache and any higher-level caches are updated before sending the write-through operation to the bus.

当“A！＝2”时，处理器内核1在循环中等待。当来自内核0的存储操作写到L2高速缓冲存储器时，即使重试设置在内核0中，内核1裁定装载的总线并传播参考数字74所示的数据地址。接下来，地址与L2高速缓冲存储器的L2标记阵列比较，如参考数字76所示。此后，收到L2高速缓冲存储器命中，如参考数字78所示。最后，进行L2高速缓冲存储器中数据的读取，其中“A＝2”，如参考数字80所示。经过延迟周期81之后读取数据，内核1中止循环，进行“将3存储到A”的存储操作。When "A!=2", processor core 1 waits in a loop. When a store operation from core 0 writes to the L2 cache, even if retry is set in core 0, core 1 arbitrates the loaded bus and propagates the data address indicated by reference numeral 74 . Next, the address is compared to the L2 tag array of the L2 cache, as indicated by reference numeral 76 . Thereafter, an L2 cache hit is received, as indicated by reference numeral 78 . Finally, reading of data in the L2 cache is performed, where "A=2", as indicated by reference numeral 80 . After reading the data after a delay period 81, core 1 aborts the loop and performs a store operation of "store 3 to A".

内核1裁定总线传送直写存储操作，其中传播直写存储的数据地址，如参考数字82所示。接下来，进行L2标记比较，如参考数字84所示。此后，收到L2高速缓冲存储器命中，如参考数字86所示。最后，数据提交到L2高速缓冲存储器的管线作为“A＝3”的写入，如参考数字88所示。Core 1 arbitrates the bus transfer store-through operation, wherein the data address of the store-through is propagated, as indicated by reference numeral 82 . Next, an L2 flag comparison is performed, as indicated by reference numeral 84 . Thereafter, an L2 cache hit is received, as indicated by reference numeral 86 . Finally, the data is committed to the pipeline of the L2 cache as a write of "A=3", as indicated by reference numeral 88 .

由于来自内核1的装载和存储操作裁定局部的总线，延迟内核0“存储2到A”操作的重试直到总线接下来可以使用。内核0重新发出由L2高速缓冲存储器接收的直写存储操作，如参考数字90所示。局部地发送数据地址，由此侦听内核1，如参考数字92所示。此后，在内核1的L1高速缓冲存储器中比较L1标记，如参考数字94所示。接下来，在L2高速缓冲存储器中比较L2标记，如参考数字96所示。高速缓冲存储器命中由L2返回，如参考数字98所示。最后，数据重新写入L2高速缓冲存储器，由此“A＝2”显示在参考数字100。Since load and store operations from core 1 arbitrate the local bus, retries of core 0 "store 2 to A" operations are delayed until the bus is next available. Core 0 reissues write-through store operations received by the L2 cache, as indicated by reference numeral 90 . The data address is sent locally, thereby snooping the core 1, as indicated by reference numeral 92 . Thereafter, the L1 tags are compared in the L1 cache of core 1, as indicated by reference numeral 94 . Next, the L2 tags are compared in the L2 cache, as indicated by reference numeral 96 . A cache hit is returned by L2, as indicated by reference numeral 98 . Finally, the data is rewritten into the L2 cache, whereby "A=2" is shown at reference numeral 100 .

如上所述，如果局部地侦听直写存储并重试存储，裁定总线的另一处理器内核可进行一次装载，参见L2高速缓冲存储器中的更新数据，在原存储接收再次进行的裁定之前进行一次直写存储。第一直写存储将覆盖取决于第一存储的第二存储的数据。As mentioned above, if the write-through store is snooped locally and the store is retried, another processor core arbitrating the bus can do a load, see the updated data in the L2 cache, before the original store receives the arbitration again. write storage. The first write-through store will overwrite the data of the second store that depends on the first store.

图1中所示问题的一个可能解决方案是延迟L2数据和地址管线，由此数据重试阶段紧跟着提交阶段。要进行所述解决方案，L2读取将与L2写入分开，或L2读取将延迟。在第一种情况中，L2裁定的复杂性将显著增加。在第二种情况中，将2个附加周期添加到所有的L2高速缓冲存储器命中条件，导致不希望的性能损失。One possible solution to the problem shown in Figure 1 is to delay the L2 data and address pipeline, whereby the data retry phase follows the commit phase. To do the solution, L2 reads would be separated from L2 writes, or L2 reads would be delayed. In the first case, the complexity of L2 adjudication will increase significantly. In the second case, 2 additional cycles are added to all L2 cache hit conditions, resulting in an undesirable performance penalty.

另一解决方案是通过将直写操作的先前状态以寄存器重命名方案使用的类似方式刷新提交的L2更新，这在本领域中是公知的。对于高速缓冲存储器，所述解决方案会增加额外的不希望的复杂性，会降低高速缓冲存储器的速度。Another solution is to flush the committed L2 update with the previous state of the write-through operation in a similar manner to that used by the register renaming scheme, which is well known in the art. For a cache, the solution adds additional undesired complexity, which slows down the cache.

现在参考图3，根据本发明的一个优选实施例，示出了用自侦听技术的直写存储指令的性能的时序图。图3示出了也在图1中示出的处理器操作，然而，在图3中，自侦听用于消除重试造成的错误。内核0发出在L2高速缓冲存储器接收的直写存储操作，由此进行L2高速缓冲存储器裁定，如参考数字110所示。接下来，进行L2标记与L2标记阵列的比较，如参考数字112所示。接下来，接收在L2标记阵列中带标记的高速缓冲存储器命中，如参考数字114所示。由此，写入到L2高速缓冲存储器的数据设置在用于执行的管线中，如参考数字116所示。延迟117之后，在直写存储操作裁定为写入主存储器的系统总线期间，裁定沿系统总线的自侦听，如参考数字118所示。在图1中，高速缓冲存储器一致点是在用于直写存储操作的L2高速缓冲存储器，然而在本实施例中，高速缓冲存储器一致点是在用于直写存储操作的系统总线上。对于高速缓冲存储器一致点在用于直写存储操作的系统总线上，如果在自侦听期间提出重试，那么直写操作根据需要在系统总线上侦听多次，直到没有返回信号返回，不管其它的指令等待。特别是，系统总线包括总线裁定逻辑，确保侦听装置继续访问总线直到直写存储的存储一致在所有的高速缓冲存储器中完成，因此数据可以写入主存储器。Referring now to FIG. 3, a timing diagram illustrating the performance of a write-through store instruction using the self-sense technique is shown in accordance with a preferred embodiment of the present invention. Figure 3 shows the processor operation also shown in Figure 1, however, in Figure 3, self-listening is used to eliminate errors caused by retries. Core 0 issues a write-through store operation received at the L2 cache, thereby making an L2 cache arbitration, as indicated by reference numeral 110 . Next, a comparison of the L2 markers with the array of L2 markers is performed, as indicated by reference numeral 112 . Next, a cache hit tagged in the L2 tag array is received, as indicated by reference numeral 114 . Thus, data written to the L2 cache is placed in the pipeline for execution, as indicated by reference numeral 116 . After a delay 117 , during a write-through store operation arbitrated to write to the system bus of main memory, a self-snoop along the system bus is arbitrated, as indicated by reference numeral 118 . In FIG. 1, the cache coherency point is on the L2 cache for write-through store operations, however in this embodiment, the cache coherency point is on the system bus for write-through store operations. For cache coherent points on the system bus for write-through store operations, if a retry is raised during a self-listen, the write-through operation listens on the system bus as many times as necessary until no return signal is returned, regardless of Other commands wait. In particular, the system bus includes bus arbitration logic that ensures that snooping devices continue to access the bus until write-through stores are consistently completed in all cache memories, so data can be written to main memory.

除了自侦听，直写存储操作的局部数据地址沿外部侦听路径传播到非始发内核，如参考数字120所示。此后，进行L1标记与L1标记阵列比较，如参考数字122所示。在下一周期中，L1标记比较的响应返回，如参考数字124所示。如果响应为重试，那么直写存储的地址将继续裁定自侦听的系统总线，直到L1高速缓冲存储器返回非重试响应。In addition to self-snooping, local data addresses for write-through store operations are propagated along external snoop paths to non-originating cores, as indicated by reference numeral 120 . Thereafter, L1 markers are compared with L1 marker arrays, as indicated by reference numeral 122 . In the next cycle, the response to the L1 tag comparison is returned, as indicated by reference numeral 124 . If the response is a retry, then the write-through store's address will continue to arbitrate from the snooped system bus until the L1 cache returns a non-retry response.

一旦非重试响应返回，内核1裁定局部总线进行一次装载，如参考数字126所示。然后，在另一实施例中，内核1装载不需要等待，直到存储已提交到系统总线不用重试。例如，如果装载在L2高速缓冲存储器中命中并提交，如参考数字116所示的L2数据写入之后开始内核1的装载，不会破坏数据的一致性。此后，进行L2标记与L2标记阵列比较，如参考数字128所示。接下来，在L2标记阵列中带标记的L2命中返回，如参考数字130所示。此后，从L2读取数据，如参考数字132所示。延迟133之后，内核1裁定用于直写存储的局部总线，如参考数字134所示。此后，进行L2标记与L2标记阵列比较，如参考数字136所示。接下来，在L2标记阵列中带标记的L2命中返回，如参考数字138所示。此后，提交L2数据写入，如参考数字140所示。如内核1的直写存储所示，参考数字140所示的L2数据写入之后，直写存储操作将继续到主存储器中要更新的系统总线，由此通过由系统总线进行自侦听保持高速缓冲存储器的一致性。Once the no-retry response is returned, core 1 arbitrates the local bus for a load, as indicated by reference numeral 126 . Then, in another embodiment, the core 1 load need not wait until the store has been committed to the system bus without retrying. For example, if the load hits and commits in the L2 cache, starting the core 1 load after writing the L2 data as indicated by reference numeral 116 will not break data coherency. Thereafter, an L2 marker is compared to an L2 marker array, as indicated by reference numeral 128 . Next, the tagged L2 hits are returned in the L2 tag array, as indicated by reference numeral 130 . Thereafter, data is read from L2 as indicated by reference numeral 132 . After a delay 133 , core 1 asserts the local bus for write-through storage, as indicated by reference numeral 134 . Thereafter, an L2 marker is compared to an L2 marker array, as indicated by reference numeral 136 . Next, the tagged L2 hits are returned in the L2 tag array, as indicated by reference numeral 138 . Thereafter, the L2 data write is committed, as indicated by reference numeral 140 . As shown in the write-through store of core 1, after the L2 data write indicated by reference numeral 140, the write-through store operation will continue to the system bus to be updated in the main memory, thereby maintaining high speed by self-snooping by the system bus Buffer memory consistency.

图4示出了进行直写存储操作过程的高层逻辑流程图。过程开始于方框150，此后进行到方框152。方框152示出了裁定处理器内核和局部总线将直写存储操作的地址发送到高速缓冲存储器的下层。此后，方框154示出了在下层高速缓冲存储器中比较地址与标记阵列。下一方框156示出了确定在下层高速缓冲存储器中是否有带标记的命中。如果在下层高速缓冲存储器中有带标记的命中，那么过程传递到方框158。方框158示出了在低层高速缓冲存储器中数据提交到写入。此后过程传递到方框160。返回到方框156，如果在下层高速缓冲存储器中没有带标记的命中，那么过程传递到方框160。虽然未示出，但在方框154、156和158中示出的过程可以在多级的下层高速缓冲存储器上进行。FIG. 4 shows a high-level logic flow chart of the process of performing a write-through storage operation. The process begins at block 150 and proceeds to block 152 thereafter. Block 152 shows arbitrating the processor core and local bus to send the address of the write-through store operation to the lower layer of cache memory. Thereafter, block 154 shows comparing the address with the tag array in the lower cache. The next block 156 shows determining whether there is a tagged hit in the lower cache. If there is a tagged hit in the lower cache memory, then the process passes to block 158 . Block 158 shows the committing of data to write in the lower level cache. The process then passes to block 160 . Returning to block 156 , if there are no tagged hits in the lower cache, then the process passes to block 160 . Although not shown, the processes shown in blocks 154, 156 and 158 may be performed on multiple levels of lower cache memory.

方框160示出了将直写存储操作传递到系统总线。接下来，方框162示出了裁定系统总线将直写存储操作发送到存储器并进行系统总线的自侦听。此后，方框164示出了通过外部侦听路径在未经过的高速缓冲存储器中侦听地址。例如，任何未经过的高速缓冲存储器为不提供由始发直写存储操作的处理器内核到系统总线的路径的路径。接下来，方框166示出了在未经过的高速缓冲存储器中比较侦听地址与标记阵列。此后，方框168示出了确定侦听是否返回重试。如果侦听返回重试，那么过程传递到方框162。如果如果侦听不返回重试，那么过程传递到方框170。方框170示出了将直写存储提交到主存储器。此后，方框172示出了将系统总线释放到此后过程返回的下一操作。Block 160 illustrates passing the write-through store operation to the system bus. Next, block 162 shows arbitrating the system bus to send a write-through store operation to the memory and performing a self-listening of the system bus. Thereafter, block 164 illustrates snooping the address in the untraversed cache through the external snoop path. For example, any cache that is not passed is a path that does not provide a path to the system bus from the processor core that originated the write-through store operation. Next, block 166 shows comparing the snoop address to the tag array in the unpassed cache. Thereafter, block 168 shows determining whether the listener returns to retry. If the intercept returns a retry, then the process passes to block 162 . If the listener does not return to retry, then the process passes to block 170 . Block 170 shows committing the write-through store to main memory. Thereafter, block 172 shows releasing the system bus to the next operation after which the process returns.

虽然参考优选实施例显示和介绍了本发明，但本领域的技术人员应该理解可以对形式和细节进行各种改变而不脱离本发明的精神核范围。例如，替换实施例允许到系统总线请求的管线，由此在等待的请求提交(得到非重试响应)或完成(读取或写入相关的数据)之前，只要以与系统总线上出现的请求相同的顺序提交请求并且也保持数据顺序，那么可以裁定到相同地址作为随后请求等待的请求。Although the present invention has been shown and described with reference to preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made without departing from the spirit and scope of the invention. For example, an alternative embodiment allows the pipelining of requests to the system bus, whereby pending requests are committed (get a non-retry response) or complete (read or write associated data) as long as the If the requests are submitted in the same order and the data order is also maintained, then the request can be arbitrated to the same address as the subsequent request waiting.

Claims

1. A method of maintaining cache coherency during a write-through store operation in a data processing system, wherein the data processing system includes a plurality of processors and memory architectures coupled to a system bus, wherein the memory architecture includes several All levels of cache memory, the method includes the following steps:

transferring a write-through store operation from a processor to said system bus via a cache memory interposed between said processor and said system bus;

performing the store-through operation in any one of the intervening caches that gets a cache hit for the store-through operation; and

With the data address of the write-through operation, the external snoop path of the system bus is used to snoop a cache memory not interposed between the processor and the system bus until the write-through operation is successfully performed by This maintains cache coherency.

2. The method for maintaining cache memory coherency during a write-through store operation according to claim 1, said write-through store operation being processed by a cache memory interposed between said processor and said system bus The step that the device transmits to the system bus also includes the following steps:

arbitrating a local bus for said processor to transfer said data address for said write-through store operation to said intervening cache memory.

3. The method of maintaining cache memory coherence during a write-through store operation according to claim 1, wherein said cache memory hit in any one of said inserted cache memories that gets said write-through store operation The step of performing the write-through storage operation also includes the following steps:

comparing the data address of the write-through operation with an array of address tags in the inserted cache memory; and

If the data address matches any tag in the address tag array, a cache hit is returned.

4. The method for maintaining cache memory coherency during a write-through storage operation according to claim 1, wherein the data address of the write-through operation is detected by an external snooping path of the system bus that is not inserted in the The step of maintaining the cache memory between the processor and the system bus until the write-through operation succeeds, thereby maintaining the consistency of the cache memory, further comprising the following steps:

arbitrating the system bus for the write-through store operation;

communicating the data address of the write-through operation to the external listening path of the system bus;

comparing said data address with an array of address tags not in a cache memory interposed between said processor and said system bus;

maintaining the data address along the external snoop path in response to any retry response returned to the system bus; and

The write-through store operation within system memory of the memory architecture is completed in response to the snoop without a retry condition being returned to the system bus.

5. A system for maintaining cache coherency during write-through store operations in a data processing system, wherein the data processing system includes a plurality of processors and a memory hierarchy coupled to a system bus, wherein the memory hierarchy includes several stages cache memory, the system includes:

means for transferring write-through store operations from a processor to said system bus via a cache memory interposed between said processor and said system bus;

means for performing said write-through store operation in any one of said intervening caches that gets a cache hit for said write-through store operation; and

With the data address of the write-through operation, the external snoop path of the system bus is used to snoop a cache memory not interposed between the processor and the system bus until the write-through operation is successfully performed by This is the means by which cache coherency is maintained.

6. The system for maintaining cache memory coherence during write-through store operations according to claim 5, the write-through store operations are transmitted by a processor through a cache memory interposed between said processor and said system bus The means to the system bus also includes:

Arbitration means for arbitrating a local bus for said processor to transfer said data address of said write-through store operation to said intervening cache memory.

7. The system for maintaining cache coherency during a write-through store operation according to claim 5 , performing said The means for write-through storage operation also includes:

means for comparing said data address of said write-through operation with an array of address tags in said inserted cache memory; and

If the data address matches any tag in the address tag array, then a cache hit is returned.

8. The system for maintaining cache memory coherency during a write-through store operation according to claim 5, wherein, with the data address of the write-through operation, the external snoop path of the system bus is used to snoop a The cache memory between the processor and the system bus, until the write-through operation succeeds, thereby maintaining the coherency of the cache memory, further comprising:

means for arbitrating the system bus for the write-through store operation;

means for communicating the data address of the write-through operation to the external snoop path of the system bus;

means for comparing said data address with an array of address tags not in a cache memory interposed between said processor and said system bus;

means for maintaining said data address along said external snoop path in response to any retry response returned to said system bus; and

means for completing said write-through store operation within system memory of said memory architecture in response to said snoop without a retry condition being returned to said system bus.