JP5116418B2

JP5116418B2 - Method for processing data in a multiprocessor data processing system, processing unit for multiprocessor data processing system, and data processing system

Info

Publication number: JP5116418B2
Application number: JP2007247897A
Authority: JP
Inventors: ガイ・リン・ガスリー; ウィリアム・ジョン・スターク; デレク・エドワード・ウィリアムス; フィリップ・ジー・ウィリアムス
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2006-10-09
Filing date: 2007-09-25
Publication date: 2013-01-09
Anticipated expiration: 2027-09-25
Also published as: CN101162442A; JP2008097598A; US8495308B2; US20080086602A1

Description

本発明は、マルチプロセッサ・データ処理システムの改良に係り、更に詳しく云えば、マルチプロセッサ・データ処理システムにおける階層キャッシュ・システムのコヒーレンシ管理の改良に係る。 The present invention relates to improved multiprocessor data processing systems and, more particularly, to an improved coherency management hierarchy cache system in multiprocessor data processing system.

サーバ・コンピュータ・システムのような通常の対称マルチプロセッサ（ＳＭＰ）コンピュータ・システムは、典型的には１つまたは複数のアドレス・バス、データ・バス、および制御バスを含むシステム相互接続バスにすべて接続された複数の処理ユニットを含む。システム相互接続バスに接続されるシステム・メモリは、マルチプロセッサ・コンピュータ・システムにおける最低レベルの揮発性メモリを表し、一般に、すべての処理ユニットによる読取りおよび書込みアクセスのためにアクセス可能である。システム・メモリに存在する命令およびデータに対するアクセス待ち時間を少なくするために、各処理ユニットは、典型的には、それぞれのマルチレベル・キャッシュ階層によって更に支援され、その階層の下位レベルのものが１つまたは複数のプロセッサ・コアによって共用される。 Conventional symmetric multiprocessor (SMP) computer systems, such as server computer systems, typically all connect to a system interconnect bus that includes one or more address, data, and control buses A plurality of processing units. System memory connected to the system interconnect bus represents the lowest level of volatile memory in a multiprocessor computer system and is generally accessible for read and write access by all processing units. In order to reduce the access latency for instructions and data residing in system memory, each processing unit is typically further assisted by its respective multi-level cache hierarchy, with one at the lower level of that hierarchy being one. Shared by one or more processor cores.

複数のプロセッサ・コアがデータの同じキャッシュ・ラインへの書込みアクセスを要求し得るし、修正されたキャッシュ・ラインがシステム・メモリと直接的には同期しないので、マルチプロセッサ・コンピュータ・システムのキャッシュ階層は、システム・メモリの内容に関して種々のプロセッサ・コアの「ビュー（view）」間での少なくとも最低レベルのコヒーレンシを保証するようにキャッシュ・コヒーレンシ・プロトコルを実装する。詳しく云えば、キャッシュ・コヒーレンシは、少なくとも、処理ユニットがメモリ・ブロックのコピーをアクセスし、その後、そのメモリ・ブロックの更新されたコピーをアクセスした後、処理ユニットがそのメモリ・ブロックの古いコピーを再びアクセスすることができないことを必要とする。 Multiple processor cores can request write access to the same cache line of data, and the modified cache line is not directly synchronized with system memory, so the cache hierarchy of a multiprocessor computer system It will implement a cache coherency protocol to ensure at least a minimum level of coherency between "view (view)" of the various processor cores for the content of system memory. Specifically, cache coherency is at least when a processing unit accesses a copy of a memory block and then an updated copy of that memory block, after which the processing unit retrieves an old copy of that memory block. Need to be inaccessible again.

キャッシュ・コヒーレンシ・プロトコルは、一般に、キャッシュ階層の各レベルにストアされたキャッシュ・ラインと関連してストアされた一組のキャッシュ状態、およびキャッシュ階層間でキャッシュ状態の情報を通信するために利用される一組のコヒーレンシ・メッセージを定義する。典型的な実装では、キャッシュ状態の情報は、周知のＭＥＳＩ（Modified, Exclusive, Shared, Invalid）プロトコルまたはその変形という形態をとり、コヒーレンシ・メッセージは、メモリ・アクセス要求の要求元および／または宛先のキャッシュ階層におけるプロトコルで定義されたコヒーレンシ状態の遷移を表す。ＭＥＳＩプロトコルは、データのキャッシュ・ラインが４つの状態、即ち、Ｍ（Modified ‐ 修正済み）、Ｅ（Exclusive ‐ 排他的）、Ｓ（Shared ‐ 共用）、またはＩ（Invalid ‐ 無効）、の１つによってタグ付けされることを可能にする。「修正済み」状態は、コヒーレンシ・グラニュールが、修正されたコヒーレンシ・グラニュールをストアしているキャッシュにおいてのみ有効であるということ、および修正されたコヒーレンシ・グラニュールの値がシステム・メモリに書込まれていないということを表す。メモリ階層におけるすべてのキャッシュのうち、その時点でコヒーレンシ・グラニュールが「排他的」として表されているとき、そのキャッシュだけがコヒーレンシ・グラニュールを保持する。しかし、「排他的」状態におけるデータは、システム・メモリと整合している。もし、或るコヒーレンシ・グラニュールがキャッシュ・ディレクトリにおいて「共用」としてマークされるならば、そのコヒーレンシ・グラニュールは、関連するキャッシュに、および、恐らくはメモリ階層内の１つまたは複数の他のキャッシュに存在し、そのコヒーレンシ・グラニュールのすべてのコピーがシステム・メモリと整合している。最後に、「無効」状態は、コヒーレンシ・グラニュールと関連するデータおよびアドレスが共に無効であるということを表す。 Cache coherency protocols are commonly used to communicate a set of cache states stored in association with cache lines stored at each level of the cache hierarchy, and cache state information between cache hierarchies. Define a set of coherency messages. In a typical implementation , the cache state information takes the form of the well-known MESI (Modified, Exclusive, Shared, Invalid) protocol or a variant thereof, and the coherency message is the source and / or destination of the memory access request. It represents the transition of coherency state defined by the protocol in the cache hierarchy. The MESI protocol is one of four states of data cache lines: M (Modified), E (Exclusive), S (Shared), or I (Invalid). Allows to be tagged by. The “modified” state means that the coherency granule is only valid in the cache that stores the modified coherency granule, and that the modified coherency granule value is written to system memory. Indicates that it is not included. Of all the caches in the memory hierarchy, when the coherency granule is currently represented as “exclusive”, only that cache holds the coherency granule. However, the data in the “exclusive” state is consistent with the system memory. If a coherency granule is marked as “shared” in the cache directory , the coherency granule is associated with the associated cache and possibly one or more other caches in the memory hierarchy. All copies of that coherency granule are consistent with system memory. Finally, the “invalid” state represents that both the data and address associated with the coherency granule are invalid.

各コヒーレンシ・グラニュール（例えば、キャッシュ・ライン）がセットされる状態は、そのキャッシュ・ライン内のデータの前の状態と、要求元の装置（例えば、プロセッサ）から受けたメモリ・アクセス要求のタイプとの両方に依存する。従って、システムにおいてメモリ階層を維持するということは、メモリ・ロケーションから読取るまたはメモリ・ロケーションに書込むという意図を表すメッセージをプロセッサがシステム全体にわたって通信することを必要とする。例えば、プロセッサが或るメモリ・ロケーションにデータを書込みたいとき、プロセッサは、先ず、そのメモリ・ロケーションにデータを書込むという意図を他のすべての処理要素に知らせなければならず、書込みオペレーションを実行するための許可を他のすべての処理要素から受けなければならない。要求元のプロセッサが受けた許可メッセージは、そのメモリ・ロケーションの内容のすべての他のキャッシュされたコピーが無効にされたということ或いは無効にされるであろうということを表し、それによって、他のプロセッサがそれらの失効したローカル・データを間違ってアクセスしないということを保証する。 The state in which each coherency granule (eg, cache line) is set includes the previous state of the data in that cache line and the type of memory access request received from the requesting device (eg, processor). And depends on both. Thus, maintaining a memory hierarchy in the system requires the processor to communicate messages throughout the system that represent the intent to read from or write to the memory location. For example, when a processor wants to write data to a memory location, the processor must first inform all other processing elements that it intends to write data to that memory location and perform the write operation Permission to do so must be obtained from all other processing elements. Grant message requesting processor has received represents that would be in or disable that copies all of the other caches of the contents of the memory location is invalidated, whereby the other Guarantees that these processors will not accidentally access their stale local data.

或るシステムでは、キャッシュ階層は、少なくとも２つのレベルのキャッシュ、即ち、レベル１（Ｌ１）または上位レベルのキャッシュと、レベル２（Ｌ２）キャッシュおよびレベル３（Ｌ３）キャッシュのような１つまたは複数のレベルの下位レベルのキャッシュを含む（Ｌ２キャッシュはＬ３キャッシュに関して上位レベルのキャッシュである）。Ｌ１キャッシュは、通常、ＭＰシステムにおける特定のプロセッサ・コアに関連する専用キャッシュである。プロセッサ・コアは、先ず、そのＬ１キャッシュにおけるデータをアクセスしようとする。もし、要求されたデータがＬ１キャッシュにおいて見つからなければ、プロセッサ・コアは、要求されたデータに関して１つまたは複数の下位レベルのキャッシュ（例えば、レベル２（Ｌ２）またはレベル３（Ｌ３）キャッシュ）をアクセスする。多くの場合、最低レベルのキャッシュ（例えば、Ｌ３キャッシュ）は幾つかのプロセッサ・コアの間で共用される。 In some systems, the cache hierarchy includes at least two levels of cache, one or more such as a level 1 (L1) or higher level cache, and a level 2 (L2) cache and a level 3 (L3) cache. (The L2 cache is a higher level cache with respect to the L3 cache). The L1 cache is usually a dedicated cache associated with a specific processor core in the MP system. The processor core first tries to access data in its L1 cache. If not found the requested data is in the L1 cache, the processor core, one or more lower-level cache for the requested data (e.g., level 2 (L2) or level 3 (L3) cache) the to access. In many cases, the lowest level cache (eg, L3 cache) is shared among several processor cores.

典型的には、上位レベルのキャッシュの合同クラスが満杯になるとき、データ・ラインが、記憶のために下位レベルのキャッシュにまたはシステム・メモリに「排出（evict）」または書き出される。しかし、いずれのメモリ階層の場合も、メモリ階層にある同じデータの複数のコピーが同時に存在し得る。より多くのスペースを上位レベルのキャッシュに与えるためにラインを排出するというポリシは、その結果として、下位レベルのキャッシュに対する更新を生じさせ、それは下位レベルのキャッシュ・ディレクトリにおけるコヒーレンシ状態情報の更新も含む。 Typically, when the combined class of higher level caches is full, data lines are "evicted" or written out to lower level caches or to system memory for storage. However, in any memory hierarchy, multiple copies of the same data in the memory hierarchy can exist simultaneously. The policy of draining the line to give more space to the higher level cache results in an update to the lower level cache, which also includes an update of coherency state information in the lower level cache directory. .

従来は、キャッシュ・コヒーレンシ・プロトコルは、一般に、キャッシュ・コヒーレンシを維持するために、上位レベルのキャッシュからのキャッシュ・ラインの排出時に、上位レベルのキャッシュからのコヒーレンシ状態が下位レベルのキャッシュにコピーされるものと仮定していた。本発明は、キャストアウトが行われるときおよび他のデータ処理シナリオのために、キャッシュ階層におけるコヒーレンシ状態とコヒーレンシ状態遷移とを適切に定義することによって、データ処理システムに対するパフォーマンスの強化を達成することが可能であると認識するものである。 Traditionally, the cache coherency protocol generally copies the coherency state from the higher level cache to the lower level cache when the cache line is drained from the higher level cache to maintain cache coherency. Was assumed. The present invention, in case, and other data processing scenarios cast out takes place, by appropriately defining the coherency status and coherency state transitions in the cache hierarchy, to achieve enhanced performance with respect to the data processing system Is recognized as being possible.

本発明の目的は、マルチプロセッサ・データ処理システムにおいてコヒーレンシ管理を行うための改良された処理ユニット、データ処理システム、および方法を提供することにある。 It is an object of the present invention to provide an improved processing unit, data processing system, and method for performing coherency management in a multiprocessor data processing system.

本発明の一実施例によれば、データ処理システムは、少なくとも第１コヒーレンシ・ドメインおよび第２コヒーレンシ・ドメインを含み、第１コヒーレンシ・ドメインはシステム・メモリおよびキャッシュ・メモリを含む。本発明のデータを処理する方法は、キャッシュ・メモリのデータ・アレイにキャッシュ・ラインをバッファするステップと、キャッシュ・ラインがデータ・アレイにおいて有効であること、キャッシュ・ラインがキャッシュ・メモリに非排他的に保持されること、および第２コヒーレンシ・ドメインにおける別のキャッシュがそのキャッシュ・ラインのコピーを保持し得ること、を表すためにキャッシュ・メモリのキャッシュ・ディレクトリにおける状態フィールドをコヒーレンシ状態にセットするステップと、を有する。 According to one embodiment of the present invention, a data processing system includes at least a first coherency domain and a second coherency domain, and the first coherency domain includes system memory and cache memory. The method of processing data of the present invention includes the steps of buffering a cache line in a data array of cache memory, the cache line being valid in the data array, and the cache line being non-exclusive to the cache memory. Set the state field in the cache directory of the cache memory to a coherency state to indicate that it is kept and that another cache in the second coherency domain may hold a copy of that cache line Steps.

本発明のすべての目的、特徴、および利点が以下の詳細な説明において明らかになるであろう。 All objects, features and advantages of the present invention will become apparent in the following detailed description.

I．例示的アーキテクチャの概要
図面全体を通して、同じ参照番号は同じ部分または対応する部分を指す。図１を参照すると、本発明を実装し得る例示的なデータ処理システムを表す高レベルのブロック図が示される。データ処理システムは、キャッシュ・コヒーレントな対称マルチプロセッサ（ＳＭＰ）・データ処理システム１００として示される。図示のように、データ処理システム１００は、データおよび命令を処理するための複数の処理ノード１０２ａ、１０２ｂを含む。処理ノード１０２は、アドレス、データ、および制御情報を通信するためのシステム相互接続網１１０に接続される。システム相互接続網１１０は、例えば、バス型の相互接続網、交換型の相互接続網、またはハイブリッド相互接続網として実装することが可能である。 I. Exemplary Architecture Overview Throughout the drawings, the same reference numbers refer to the same or corresponding parts. Referring to FIG. 1, a high level block diagram representing an exemplary data processing system in which the present invention may be implemented is shown. The data processing system is shown as a cache coherent symmetric multiprocessor (SMP) data processing system 100. As shown, the data processing system 100 includes a plurality of processing nodes 102a, 102b for processing data and instructions. The processing node 102 is connected to a system interconnection network 110 for communicating addresses, data, and control information. The system interconnection network 110 can be implemented as, for example, a bus type interconnection network, an exchange type interconnection network, or a hybrid interconnection network.

図示の実施例では、各処理ノード１０２は、４つの処理ユニット１０４ａ〜１０４ｄを含むマルチチップ・モジュール（ＭＣＭ）として実装される。各処理ユニットは、それぞれの集積回路として実装されることが望ましい。各処理ノード１０２における処理ユニット１０４は、ローカル相互接続網１１４によって相互におよびシステム相互接続網１１０に通信を行うために接続される。ローカル相互接続網１１４は、システム相互接続網１１０のように、例えば、１つまたは複数のバスまたはスイッチを用いて実装することも可能である。 In the illustrated embodiment, each processing node 102 is implemented as a multi-chip module (MCM) that includes four processing units 104a-104d. Each processing unit is preferably implemented as a respective integrated circuit. The processing units 104 at each processing node 102 are connected to communicate with each other and the system interconnect network 110 by a local interconnect network 114. The local interconnect network 114 may be implemented using, for example, one or more buses or switches, like the system interconnect network 110.

図２に示されるように、各処理ユニット１０４は、それぞれのシステム・メモリ１０８に接続された統合メモリ・コントローラ（ＩＭＣ）２０６を含む。システム・メモリ１０８に存在するデータおよび命令は、一般に、データ処理システム１００内の任意の処理ノード１０２の任意の処理ユニット１０４におけるプロセッサ・コアがアクセスおよび修正することも可能である。本発明の別の実施例では、１つまたは複数のメモリ・コントローラ２０６（およびシステム・メモリ１０８）をシステム相互接続網１１０またはローカル相互接続網１１４に接続することも可能である。 As shown in FIG. 2, each processing unit 104 includes an integrated memory controller (IMC) 206 connected to a respective system memory 108. Data and instructions residing in system memory 108 can generally be accessed and modified by a processor core in any processing unit 104 of any processing node 102 in data processing system 100. In another embodiment of the present invention, one or more memory controllers 206 (and system memory 108) may be connected to the system interconnect network 110 or the local interconnect network 114.

図１のＳＭＰデータ処理システム１００が、相互接続ブリッジ、不揮発性記憶装置、ネットワークまたは付属装置等への接続のためのポート、のような多くの追加の図示されてないコンポーネントを含み得るということは当業者には明らかであろう。そのような追加のコンポーネントは、本発明の理解にとっては必要のないものなので図１には示されず、これ以上説明されない。しかし、本発明によって提供される機能強化が様々なアーキテクチャのキャッシュ・コヒーレントなデータ処理システムに適用可能であって、図１に示された汎用データ処理システム・アーキテクチャにまったく限定されないということは当然である。 1 that the SMP data processing system 100 of FIG. 1 may include many additional non- illustrated components such as interconnect bridges, non-volatile storage devices, ports for connection to networks or accessory devices, and the like. It will be apparent to those skilled in the art. Such additional components are not shown in FIG. 1 and will not be described further as they are not necessary for an understanding of the present invention. However, it should be understood that the enhancements provided by the present invention are applicable to cache-coherent data processing systems of various architectures and are not limited to the general purpose data processing system architecture shown in FIG. is there.

図２を参照すると、本発明に従った例示的な処理ユニット１０４の更に詳細なブロック図が示される。図示の実施例では、単一の集積回路として都合よく実装し得る各処理ユニット１０４は、互いに独立して命令およびデータを処理するための４つのプロセッサ・コア２００ａ〜２００ｄを含む。１つの好適な実施例では、各プロセッサ・コア２００は、複数（例えば、２つ）の同時的なハードウェア・スレッドの実行を支援する。 Referring to FIG. 2, a more detailed block diagram of an exemplary processing unit 104 according to the present invention is shown. In the illustrated embodiment, each processing unit 104 that may be conveniently implemented as a single integrated circuit includes four processor cores 200a-200d for processing instructions and data independently of one another. In one preferred embodiment, each processor core 200 supports the execution of multiple (eg, two) simultaneous hardware threads.

各プロセッサ・コア２００のオペレーションはマルチ・レベルの揮発性メモリ・サブシステムによって支援され、そのメモリ・サブシステムは、共用システム・メモリ１０８を最低レベルで有し、キャッシュ可能なアドレス内に存在するデータおよび命令をキャッシュするための２つまたはそれ以上のレベルのキャッシュ・メモリを上位レベルで有する。図示の実施例では、キャッシュ・メモリ階層は、各プロセッサ・コア２００内にあってそのプロセッサ・コア専用のそれぞれのストア・スルー・レベル１（Ｌ１）キャッシュ（図示されてない）、各プロセッサ・コア２００に専用のそれぞれのストア・イン・レベル２（Ｌ２）キャッシュ２３０、およびＬ２キャストアウト（castout）をバッファするためのＬ３ビクティム（victim）・キャッシュ２３２を含む。図示の実施例では、プロセッサ・コア２００ａおよび２００ｄがそれぞれのＬ３キャッシュ２３２ａを共用し、プロセッサ・コア２００ｂおよび２００ｃがＬ３キャッシュ２３２ｂを共用する。もちろん、別の実施例では、プロセッサ・コア２００の各々がそれ自身のＬ３キャッシュ２３２を有することも可能である。図２に示されたものを含む少なくともいくつかの実施例では、Ｌ３キャッシュ２３２ａ、２３２ｂがデータ交換を可能にするために更に相互に接続される。このデータ交換は、プロセッサ・コア２００によってアクセスされそうなデータを処理ユニット１０４のキャッシュ階層内にできるだけ長い期間保存するために、一方のＬ３キャッシュ２３２がそれのキャッシュ・ラインの１つを他方のＬ３キャッシュ２３２にキャストアウトすることを可能にすることを含む。 The operation of each processor core 200 is supported by a multi-level volatile memory subsystem that has a shared system memory 108 at the lowest level and data residing in cacheable addresses. And two or more levels of cache memory at the upper level for caching instructions. In the illustrated embodiment, a cache memory hierarchy exists within each processor core 200 for each store-through level 1 (L1) cache (not shown) dedicated to that processor core, each processor core. Each store in level 2 (L2) cache 230 dedicated to 200 and an L3 victim cache 232 for buffering L2 castouts. In the illustrated embodiment, processor cores 200a and 200d share their respective L3 cache 232a, and processor cores 200b and 200c share their L3 cache 232b. Of course, in alternative embodiments, each of the processor cores 200 may have its own L3 cache 232. In at least some embodiments, including those shown in FIG. 2, L3 caches 232a, 232b are further interconnected to allow data exchange. This data exchange is such that one L3 cache 232 holds one of its cache lines to the other L3 in order to store data likely to be accessed by the processor core 200 in the cache hierarchy of the processing unit 104 for as long as possible. Including allowing castout to cache 232.

各処理ユニット１０４は応答ロジック２１０のインスタンスを含み、その応答ロジックは、データ処理システム１００内のキャッシュ・コヒーレンシを維持する分散型のコヒーレンシ信号機構の一部を実装する。更に、各処理ユニット１０４は、処理ユニット１０４と、ローカル相互接続網１１４およびシステム相互接続網１１０との間の通信を管理するための相互接続ロジック２１２のインスタンスを含む。Ｌ２キャッシュ２３０およびＬ３キャッシュ２３２の各々は、図１の相互接続網１１０および１１４を介したデータおよびコヒーレンシ通信における参加を可能にするために相互接続ロジック２１２に接続される。最後に、各処理ユニット１０４は、Ｉ／Ｏ装置２１６のような１つまたは複数のＩ／Ｏ装置の接続を支援する統合Ｉ／Ｏコントローラ２１４を含む。Ｉ／Ｏコントローラ２１４は、Ｉ／Ｏ装置２１６による要求に応答してローカル相互接続網１１４および／またはシステム相互接続網１１０上のオペレーションを発行させることが可能である。 Each processing unit 104 includes an instance of response logic 210 that implements part of a distributed coherency signaling mechanism that maintains cache coherency within the data processing system 100. In addition, each processing unit 104 includes an instance of interconnect logic 212 for managing communication between the processing unit 104 and the local interconnect network 114 and the system interconnect network 110. Each of L2 cache 230 and L3 cache 232 is connected to interconnect logic 212 to allow participation in data and coherency communications via interconnect networks 110 and 114 of FIG. Finally, each processing unit 104 includes an integrated I / O controller 214 that supports connection of one or more I / O devices, such as I / O device 216. I / O controller 214 may issue operations on local interconnect network 114 and / or system interconnect network 110 in response to requests by I / O device 216.

次に図３を参照すると、図２の処理ユニット１０４におけるプロセッサ・コア２００およびＬ２キャッシュ２３０の更に詳細なブロック図が示される。図示のように、プロセッサ・コア２００は、実行のために命令をフェッチして順序付けるための命令シーケンス・ユニット（ＩＳＵ）３００、命令を実行するための１つまたは複数の実行ユニット３０２、およびＬ１キャッシュ３０６を含む。 Referring now to FIG. 3, a more detailed block diagram of the processor core 200 and L2 cache 230 in the processing unit 104 of FIG. 2 is shown. As shown, the processor core 200 includes an instruction sequence unit (ISU) 300 for fetching and ordering instructions for execution, one or more execution units 302 for executing instructions, and L1. A cache 306 is included.

実行ユニット３０２は、データをメモリからロードさせるためにおよびメモリにストアさせるためにメモリ・アクセス命令（例えば、ロードおよびストア命令）を実行するロード・ストア・ユニット（ＬＳＵ）３０４を含む。メモリ・サブシステムによるコヒーレンシ・プロトコルの実装を通してそのようなメモリ・アクセス・オペレーションを行うとき、メモリ内容のコヒーレントなビューが維持される。 Execution unit 302 includes a load store unit (LSU) 304 that executes memory access instructions (eg, load and store instructions) to cause data to be loaded from and stored in memory. When performing such memory access operations through the implementation of a coherency protocol by the memory subsystem, coherent view of memory contents is maintained.

本発明によれば、分離したＬ１データおよび命令キャッシュを含み得るＬ１キャッシュ３０６は、他のプロセッサ・コア２００に関するキャッシュ・コヒーレンシのポイントがＬ１キャッシュ３０６の下に置かれ、図示の実施例では、Ｌ２キャッシュ２３０に置かれるということを意味するストア・スルー・キャッシュとして実装される。従って、Ｌ１キャッシュ３０６は、それのキャッシュ・ラインに関して真のキャッシュ・コヒーレンシ状態を維持するのではなく、有効／無効ビットを維持するだけである。 In accordance with the present invention, the L1 cache 306, which may include separate L1 data and instruction caches, has cache coherency points for other processor cores 200 placed under the L1 cache 306, in the illustrated embodiment, L2 It is implemented as a store-through cache, which means that it is placed in the cache 230. Thus, the L1 cache 306 only maintains a valid / invalid bit rather than maintaining true cache coherency state for its cache line.

Ｌ２キャッシュ２３０は、命令およびデータのキャッシュ・ラインをストアするデータ・アレイ３１０、およびデータ・アレイ３１０のキャッシュ・ディレクトリ３１２を含む。通常のセット・アソシアティブ・キャッシュにおけるように、システム・メモリ１０８におけるメモリ・ブロックは、システム・メモリ（実）アドレス内の所定のインデックス・ビットを利用してデータ・アレイ３１０内の特定の合同クラスにマップされる。一実施例では、コヒーレンシ・システムに対する標準のメモリ・ブロックは１２８バイトのキャッシュ・ラインにセットされる。データ・アレイ３１０内にストアされた特定のメモリ・ブロックまたはキャッシュ・ラインがキャッシュ・ディレクトリ３１２に記録される。なお、キャッシュ・ディレクトリ３１２は、データ・アレイ３１０における各キャッシュ・ラインに対して１つのディレクトリ・エントリを含む。当業者には明らかであるように、キャッシュ・ディレクトリ３１２における各ディレクトリ・エントリは、少なくとも、対応する実アドレスの一部分を利用してデータ・アレイ３１０にストアされた特定のキャッシュ・ラインを指定するタグ・フィールド３１４、キャッシュ・ラインのコヒーレンシ状態を表す状態フィールド３１６、および同じ合同クラスにおける他のキャッシュ・ラインに関してそのキャッシュ・ラインに対する置換順序を表すＬＲＵ（最低使用頻度）フィールド３１８を含む。 The L2 cache 230 includes a data array 310 that stores instruction and data cache lines, and a cache directory 312 of the data array 310. As in a normal set associative cache, a memory block in system memory 108 is assigned to a specific congruence class in data array 310 using a predetermined index bit in the system memory (real) address. Mapped. In one embodiment, the standard memory block for a coherency system is set to a 128 byte cache line. A particular memory block or cache line stored in the data array 310 is recorded in the cache directory 312. Note that cache directory 312 includes one directory entry for each cache line in data array 310. As will be apparent to those skilled in the art, each directory entry in the cache directory 312 is a tag that specifies a particular cache line stored in the data array 310 utilizing at least a portion of the corresponding real address. It includes a field 314, a state field 316 representing the coherency state of the cache line, and an LRU (least recently used) field 318 representing the replacement order for that cache line with respect to other cache lines in the same congruence class

更に図３に示されるように、Ｌ２キャッシュ２３０は、Ｌ２キャッシュ２３０のデータおよびコヒーレンシ・オペレーションを制御するキャッシュ・コントローラ３３０も含む。キャッシュ・コントローラ３３０は、関連するプロセッサ・コア２００から受けたロード（ＬＤ）要求およびストア（ＳＴ）要求を独立して且つ同時にサービスするための複数の読取り・クレーム（Read-Claim : ＲＣ）・マシン３３２、並びに、関連するプロセッサ・コア２００以外のプロセッサ・コアによって発行され且つローカル相互接続網１１４から「スヌープされた」リモート・メモリ・アクセス要求を独立して且つ同時にサービスするための複数のスヌープ（ＳＮ）・マシン３３４を含む。明らかなように、ＲＣマシン３３２によるメモリ・アクセス要求のサービスは、データ・アレイ３１０内のメモリ・ブロックの置換または無効化を必要とすることがある。従って、キャッシュ・コントローラ３３０は、データ・アレイ３１０からのメモリ・ブロックの除去または書き戻しを管理する複数のＣＯ（キャストアウト）マシン３３６も含む。 As further shown in FIG. 3, the L2 cache 230 also includes a cache controller 330 that controls the data and coherency operations of the L2 cache 230. Cache controller 330, a plurality of read-claims to service and simultaneously independently load received from the processor core 200 (LD) request and store (ST) requests the related (Read-Claim: RC) machine 332, and is issued by the related to the processor core 200 other than the processor core and "snooped" from local interconnect network 114 remote memory access requests independently and simultaneously to service A plurality of snoop (SN) machines 334. As will be apparent, servicing memory access requests by the RC machine 332 may require replacement or invalidation of memory blocks in the data array 310. Accordingly, the cache controller 330 also includes a plurality of CO (castout) machines 336 that manage the removal or write-back of memory blocks from the data array 310.

次に図４を参照すると、本発明に従ったＬ３キャッシュの実施例の更に詳細なブロック図が示される。図３および図４を比較するとわかるように、Ｌ２キャストアウトをバッファするためのビクティム・キャッシュとして働くＬ３キャッシュ２３２は、図３のＬ２キャッシュ２３０と同様に構成される。従って、Ｌ３キャッシュ２３２は、セット・アソシアティブ・データ・アレイ３６０、データ・アレイ３６０の内容のキャッシュ・ディレクトリ３６２、およびキャッシュ・コントローラ３８０を含む。 Referring now to FIG. 4, a more detailed block diagram of an embodiment of an L3 cache according to the present invention is shown. As can be seen by comparing FIGS. 3 and 4, the L3 cache 232 that acts as a victim cache for buffering L2 castouts is configured similarly to the L2 cache 230 of FIG. Accordingly, the L3 cache 232 includes a set associative data array 360, a cache directory 362 for the contents of the data array 360, and a cache controller 380.

キャッシュ・ディレクトリ３６２における各ディレクトリ・エントリは、対応する実アドレスの一部分を利用してデータ・アレイ３６０にストアされた特定のキャッシュ・ラインを指定するタグ・フィールド３６４、キャッシュ・ラインのコヒーレンシ状態を表す状態フィールド３６６、および同じ合同クラスにおける他のキャッシュ・ラインに関してそのキャッシュ・ラインに対する置換順序を表すＬＲＵ（最小使用頻度）フィールド３６８を含む。キャッシュ・コントローラ３８０は、図３を参照して説明したように、複数のスヌープ（ＳＮ）・マシン３８４および複数のキャストアウト（ＣＯ）・マシン３８６を含む。ＲＣマシンの代わりに、キャッシュ・コントローラ３８０は、垂直方向に接続されたＬ２キャッシュ２３０のデータ要求をサービスする複数の読取り（ＲＤ）マシン３８２を含む。 Each directory entry in the cache directory 362 represents a tag field 364 that specifies a particular cache line stored in the data array 360 using a portion of the corresponding real address, representing the coherency state of the cache line. It includes a status field 366 and an LRU (minimum frequency of use) field 368 representing the replacement order for that cache line with respect to other cache lines in the same congruence class. The cache controller 380 includes a plurality of snoop (SN) machines 384 and a plurality of castout (CO) machines 386 as described with reference to FIG. Instead of the RC machine, the cache controller 380 includes a plurality of read (RD) machines 382 that service data requests for the vertically connected L2 cache 230.

II．例示的オペレーション
次に図５を参照すると、図１のデータ処理システム１００のローカル相互接続網１１４またはシステム相互接続網１１０を介した例示的オペレーションの時空線図が示される。相互接続網１１０、１１４は必ずしもバス型の相互接続網ではないが、１つまたは複数のローカル相互接続網１１４および／またはシステム相互接続網１１０を介して伝送されるオペレーションは本明細書では「バス・オペレーション」と呼ばれ、プロセッサ・コア２００とそれ自身のキャッシュ階層内にあるキャッシュ・メモリとの間で伝送されるＣＰＵ要求とは区別される。 II. Exemplary Operation Referring now to FIG. 5, a space-time diagram of an exemplary operation through the local interconnect network 114 or system interconnect network 110 of the data processing system 100 of FIG. 1 is shown. Although the interconnect networks 110, 114 are not necessarily bus-type interconnect networks, operations transmitted over one or more local interconnect networks 114 and / or system interconnect networks 110 are referred to herein as "buses". A distinction is made between CPU requests, referred to as “operations”, which are transmitted between the processor core 200 and the cache memory in its own cache hierarchy.

図示のバス・オペレーションは、Ｌ２キャッシュ２３０のＲＣマシン３３２またはＩ／Ｏコントローラ２１４のようなマスタ（Ｍ）４００がローカル相互接続網１１４および／またはシステム相互接続網１１０を介して要求４０２を発行するときに開始する。要求４０２は、所望のアクセスのタイプを表すトランザクション・タイプおよびその要求によってアクセスすべきリソースを表すリソース識別子（例えば、実アドレス）を含むことが望ましい。要求の一般的なタイプは、下記の表１に示されるものを含むことが望ましい。

Bus operations illustrated issues a request 402 via the RC machine 332 or I / O controller master (M) 400 is a local interconnection network 114, such as 214 and / or system interconnect network 110 of the L2 cache 230 To start when. Request 402 preferably includes a transaction type representing the type of access desired and a resource identifier (eg, real address) representing the resource to be accessed by the request. Desirable general types of requests include those shown in Table 1 below.

要求４０２は、Ｌ２キャッシュ２３０のスヌープ・マシン３３４、Ｌ３キャッシュ２３２のスヌープ・マシン３８４のようなスヌーパ４１２、およびメモリ・コントローラ２０６（図２）によって受取られる。一般に、若干の例外はあるものの、要求４０２を発行したＲＣマシン３３２と同じＬ２キャッシュ２３０におけるスヌープ・マシン３３４および接続されたＬ３キャッシュ２３２のスヌープ・マシン３８４は、要求４０２をスヌープしない（即ち、一般に、自己スヌーピングは存在しない）。というのは、要求４０２は、それが処理ユニット１０４によって内部的にサービスされ得ない場合にだけ、ローカル相互接続網１１４および／またはシステム相互接続網１１０を介して伝送されるからである。要求４０２を受取る各スヌーパ４１２は、少なくとも要求４０２に対するそのスヌーパの応答を表すそれぞれの部分的応答４０６を供給し得る。メモリ・コントローラ２０６は、例えば、そのメモリ・コントローラ２０６が要求アドレスに対して責任を負うのかどうか、およびそれが要求のサービスのために使用可能なリソースを有するかどうかに基づいて、供給すべき部分的応答４０６を決定する。Ｌ２またはＬ３キャッシュは、例えば、Ｌ２キャッシュ・ディレクトリの使用可能性、要求を処理すべきスヌープ・マシンの使用可能性、およびキャッシュ・ディレクトリにおける要求アドレスと関連するコヒーレンシ状態に基づいて、その部分的応答４０６を決定することが可能である。 Request 402 is received by snoop machine 412, such as snoop machine 334 in L2 cache 230, snoop machine 384 in L3 cache 232, and memory controller 206 (FIG. 2). In general, a few exceptions, the snoop machine 384 snoop machine 334 and connected L3 cache 232 in the same L2 cache 230 and RC machines 332 making the request 402 does not snoop request 402 (i.e. In general, there is no self-snooping). Since the request 402, only if it can not be internally service by the processing unit 104, since transmitted over the local interconnect network 114 and / or system interconnect network 110. Each snooper 412 that receives the request 402 may provide a respective partial response 406 that represents at least that snooper's response to the request 402. Partial memory controller 206, for example, based on whether the memory controller 206 whether responsible for the request address, and it has the resources available for the requested service, to be supplied Determine the dynamic response 406. L2 or L3 cache, for example, potential uses of the L2 cache directory, the availability of a processing request to be snoop machine, and based on the associated coherency status and request address in the cache directory, parts thereof The dynamic response 406 can be determined.

スヌーパ４１２の部分的応答が、１つまたは複数の応答ロジック２１０のインスタンスによって段階的にまたは同時に論理的に結合され、要求４０２に対するシステム・ワイドの結合応答（ＣＲ）４１０を決定する。後述の範囲制限（scope restriction）次第で、応答ロジック２１０は、結合応答４１０を、ローカル相互接続網１１４および／またはシステム相互接続網１１０を介してバス・オペレーションのマスタおよびスヌーパに与えて、要求４０２に対するシステム・ワイドの応答（例えば、成功、失敗、再試行等）を表示させる。もし、ＣＲ４１０が要求４０２の成功を表すのであれば、ＣＲ４１０は、例えば、要求されたメモリ・ブロックのデータ・ソース、その要求されたメモリ・ブロックがキャッシュされるべきキャッシュ状態、および１つまたは複数のＬ２キャッシュ２３０またはＬ３キャッシュ２３２におけるその要求されたメモリ・ブロックを無効にする「クリーンアップ」オペレーションが必要であるかどうかを表すことも可能である。 The partial responses of snooper 412 are logically combined in stages or simultaneously with one or more instances of response logic 210 to determine a system-wide combined response (CR) 410 for request 402. Depending range limits described below (scope restriction), the response logic 210, the combined response 410, giving the master and snoopers of bus operation via a local interconnection network 114 and / or system interconnect network 110, request 402 System wide response to (eg, success, failure, retry, etc.). If CR 410 represents the success of request 402, CR 410 may include, for example, the data source of the requested memory block, the cache state to which the requested memory block is to be cached, and one or more. It may also indicate whether a “clean up” operation is required to invalidate the requested memory block in the current L2 cache 230 or L3 cache 232.

典型的には、結合応答４１０を受取ったことに応答して、１つまたは複数のマスタ４００およびスヌーパ４１２は、要求４０２をサービスするために１つまたは複数のオペレーション実行する。これらのオペレーションは、データをマスタ４００に供給すること、１つまたは複数のＬ２キャッシュまたはＬ３キャッシュにおいてキャッシュされたデータのコヒーレンシ状態を無効にするかまたは更新すること、キャストアウト・オペレーションを遂行すること、データをシステム・メモリ１０８に書き戻すこと等を含み得る。もし、要求４０２によって必要とされるのであれば、要求されたメモリ・ブロックまたはターゲット・メモリ・ブロックは、応答ロジック２１０による結合応答４１０の発生の前または後に、マスタ４００にまたはマスタ４００から伝送され得る。 Typically, in response to receiving the combined response 410, one or more master 400 and snoopers 412 performs one or more operations to service the request 402. These operations provide data to the master 400, invalidate or update the coherency state of data cached in one or more L2 or L3 caches, and perform castout operations. , Writing the data back to the system memory 108, etc. If, if required by the request 402, the requested memory block or target memory block, before or after the occurrence of the combined response 410 by response logic 210, is transmitted to the master 400 or from the master 400 obtain.

以下の記述では、要求に対するスヌーパ４１２の部分的応答並びにその要求および／またはその結合応答に応答してスヌーパによって行われるオペレーションを、スヌーパが、要求によって指定された要求アドレスに関し、「コヒーレンシの最高ポイント（Highest Point of Coherency : ＨＰＣ）」であるか、「コヒーレンシの最低ポイント（Lowest Point of Coherency : ＬＰＣ）」であるか、またはそのいずれでもないかに関連して、説明することにする。ＬＰＣは、本明細書では、メモリ・ブロックに対するリポジトリとして作用するメモリ装置またはＩ／Ｏ装置として定義される。メモリ・ブロックに対するＨＰＣが存在しない場合、ＬＰＣは、そのメモリ・ブロックの真のイメージを保持し、そのメモリ・ブロックの追加のキャッシュされたコピーを発生するための要求を許可または拒否する権限を有する。図１および図２のデータ処理システムの実施例における一般的な要求に関して、ＬＰＣは、参照メモリ・ブロックを保持するシステム・メモリ１０８のためのメモリ・コントローラ２０６であろう。ＨＰＣは、本明細書では、メモリ・ブロックの真のイメージ（ＬＰＣにおける対応するメモリ・ブロックと整合していてもよく或いは整合していなくてもよい）をキャッシュし且つメモリ・ブロックを修正するための要求を許可または拒否する権限を有する、一意的に識別された装置として定義される。記述的には、ＨＰＣは、メモリ・ブロックを修正しないオペレーションに応答して、要求元にメモリ・ブロックのコピーを供給することも可能である。従って、図１および図２のデータ処理システムの実施例における一般的な要求に対して、ＨＰＣは、存在するとすれば、Ｌ２キャッシュ２３０であろう。メモリ・ブロックに対するＨＰＣを指定するために他の標識を利用することも可能であるが、本発明の好適な実施例は、下記の表２を参照して後述するように、Ｌ２キャッシュ２３０のＬ２キャッシュ・ディレクトリ３１２またはＬ３キャッシュ２３２のＬ３キャッシュ・ディレクトリ３６２における選択されたキャッシュ・コヒーレンシ状態を利用して、メモリ・ブロックに対するＨＰＣを、それが存在する場合、指定する。 In the following description, the operations performed by the snooper in response to partial response, as well as the request and / or its combined response snooper 412 to the request, the snooper is directed to request the address specified by the request, the highest point of "coherency (Highest Point of Coherency: HPC) " or " Lowest Point of Coherency (HPC) " Coherency: whether it is LPC) ", or in connection with the one or not even, to be explain. An LPC is defined herein as a memory device or I / O device that acts as a repository for memory blocks. If HPC for the memory block is not present, LPC holds the true image of the memory block, the permission or deny permission requests for generating a copy of the additional cache for that memory block Have. With respect to the general requirements in the embodiment of the data processing system of FIGS. 1 and 2, the LPC will be the memory controller 206 for the system memory 108 that holds the reference memory block. HPC is used herein to cache and modify a memory block's true image (which may or may not be consistent with the corresponding memory block in the LPC). has the authority to the request enable or deny, it is defined as a uniquely identified device. Descriptively, the HPC may provide a copy of the memory block to the requestor in response to an operation that does not modify the memory block. Thus, for a general requirement in the embodiment of the data processing system of FIGS. 1 and 2 , the HPC would be the L2 cache 230, if it exists. We are also possible to use other labels in order to specify the HPC for a memory block, but preferred embodiments of the present invention, so that to be described later with reference to Table 2 below, the L2 cache 230 L2 using the selected cache coherency states in the L3 cache directory 362 of cache directory 312 or L3 cache 232, a HPC for the memory block, when present, designates.

図５を更に参照すると、要求４０２において参照されたメモリ・ブロックに対するＨＰＣが存在する場合にはそのＨＰＣが、またはＨＰＣの不存在の場合にはそのメモリ・ブロックのＬＰＣが、保護ウィンドウ４０４ａの期間中に要求４０２に応答して、メモリ・ブロックの所有権の移動を防止する責任を有することが望ましい。図５に示された例示的なシナリオでは、要求４０２の要求アドレスによって指定されたメモリ・ブロックに対するＨＰＣであるスヌーパ４１２は、スヌーパ４１２がその部分的応答４０６を決定する時間からスヌーパ４１２が結合応答４１０を受取るまで延びている保護ウィンドウ４０４ａの期間中に、マスタ４００へのその要求されたメモリ・ブロックの所有権の移転を防止する。保護ウィンドウ４０４ａの期間中、スヌーパ４１２は、同じ要求アドレスを指定する他の要求に対する部分的応答を供給することによって、所有権の移転を防止する。なお、その部分的応答は、所有権がマスタ４００に成功裏に移転してしまうまで、他のマスタが所有権を得ることを防止する。マスタ４００は、同様に、結合応答４１０の受け取りに続いて要求４０２において要求されたメモリ・ブロックの所有権を保護するために、保護ウィンドウ４０４ｂを開始する。 With further reference to FIG. 5, if there is an HPC for the memory block referenced in request 402, the HPC of the memory block in the absence of HPC or the HPC of the memory block in the absence of HPC is the duration of the protection window 404a. It is desirable to be responsible for preventing transfer of ownership of the memory block in response to request 402 during. In the illustrated exemplary scenario in Figure 5, the snooper 412 is HPC for a memory block designated by the request address of the request 402, time or lath Nupa 412 snooper 412 determines its partial response 406 during the protection window 404a which extends to receive a combined response 410, to prevent the transfer of ownership of the requested memory block to the master 400. During the protection window 404a, snooper 412, by supplying a partial response to other requests that specify the same request address, to prevent the transfer of ownership. Incidentally, the partial response prevents the ownership until it has transferred successfully to the master 400, another master to gain ownership. Master 400 similarly initiates protection window 404b to protect ownership of the memory block requested in request 402 following receipt of merge response 410.

III．データ配信ドメイン
通常のブロードキャスト・ベースのデータ処理システムは、ブロードキャスト通信を通してキャッシュ・コヒーレンシおよびデータ配信の両方を処理する。ブロードキャスト通信は、通常のシステムでは、システム相互接続網を通して少なくともそのシステムにおけるすべてのメモリ・コントローラおよびキャッシュ階層に伝送される。別のアーキテクチャおよび同様の規模のシステムと比べると、ブロードキャスト・ベースのシステムは、少ないアクセス待ち時間、並びに、共用メモリ・ブロックの良好なデータ処理およびコヒーレンシ管理の提供に有用である。 III. Data delivery domains normal broadcast-based data processing systems handle both cache coherency and data delivery through broadcast communication. Broadcast communications are transmitted in a typical system through the system interconnect network to at least all memory controllers and cache hierarchies in the system. Compared to other architectures and similar scale systems, broadcast-based system is less access latency, and are useful in providing good data processing and coherency management of the shared memory block.

ブロードキャスト・ベースのシステムは大きさが増大するので、システム相互接続網におけるトラフィック量は増加し、それは、システム相互接続網を介した通信のために更なる帯域幅を必要とするので、システムのコストがシステムの規模と共に急激に上昇することを意味する。即ち、ｍ個のプロセッサ・コアを有し、各プロセッサ・コアの平均トラフィック量がｎ個のトランザクションであるシステムは、ｍ＊ｎのトラフィック量を有することになり、それは、ブロードキャスト・ベースのシステムにおけるトラフィック量が加算的ではなく乗算的に増加することを意味する。かなり大きな相互接続網の帯域幅を必要とするほかに、システム・サイズの増加は、アクセス待ち時間の増加という二次的効果を有する。例えば、データ読取りのアクセス待ち時間は、最悪の場合、要求されたメモリ・ブロックを共用コヒーレンシ状態（この状態から要求されたデータが配信され得る）で保持する最も遠い低レベル・キャッシュの結合応答待ち時間によって制限される。 As broadcast-based systems grow in size, the amount of traffic in the system interconnect network increases, which requires additional bandwidth for communication over the system interconnect network, thus reducing the cost of the system. Means that it will rise rapidly with the scale of the system. That is, a system with m processor cores and an average traffic volume of each processor core is n transactions will have m * n traffic volume, which is the same in a broadcast-based system. This means that the traffic volume increases in a multiply rather than additive manner. In addition to requiring significantly greater interconnect network bandwidth, increasing system size has the secondary effect of increasing access latency. For example, the access latency of the data read, the worst case, the combined response of the farthest lower level cache that holds at request has been shared memory block coherency state (data requested from this condition can be delivered) Limited by waiting time.

ブロードキャスト・ベースのシステムの利点を保持しながら、システム相互接続網の帯域幅要件およびアクセス待ち時間を少なくするために、データ処理システム１００全体にわたって分散した複数のＬ２キャッシュは、「特別」共用コヒーレンシ状態で同じメモリ・ブロックのコピーを保持することを許容される。このようにすると、これらのキャッシュは、キャッシュ間のデータ介入（cache-to-cache data intervention）を使って、要求元のＬ２キャッシュ２３０にメモリ・ブロックを供給することが可能になる。データ処理システム１００のようなＳＭＰデータ処理システムにおいて、共用メモリ・ブロックに対する複数の並行で且つ分散されたソースを実装するためには、２つの問題点が扱われなければならない。第１に、前述の「特別」共用コヒーレンシ状態におけるメモリ・ブロックのコピーの作成を管理するルールが実装されなければならない。第２に、スヌーピングＬ２キャッシュ２３０が存在する場合、どのスヌーピングＬ２キャッシュが、例えば、バス読取りオペレーションまたはバスＲＷＩＴＭオペレーションに応答して、共用メモリ・ブロックを要求元のＬ２キャッシュ２３０に供給するのかを管理するルールがなければならない。 In order to reduce the bandwidth requirements and access latency of the system interconnect network while retaining the benefits of broadcast-based systems, multiple L2 caches distributed throughout the data processing system 100 are "special" shared coherency states. Is allowed to keep a copy of the same memory block. In this way, these caches with data intervention between the cache (cache-to-cache data intervention ), it is possible to supply the memory block to the requesting L2 cache 230. In SMP data processing system, such as data processing system 100, in order to implement and distributed source over scan by multiple concurrent to the shared memory blocks, two problems must be addressed. First, the rules that govern the creation of a copy of the memory block must be implemented in a "special" shared coherency state of before mentioned. Second, if there is a snooping L2 cache 230, it manages which snooping L2 cache will supply a shared memory block to the requesting L2 cache 230 in response to, for example, a bus read operation or a bus RWITM operation. There must be a rule to do.

本明細書では、これらの問題点が、いずれもデータ配信（sourcing）ドメインの実装を通して扱われる。詳しく云えば、ＳＭＰデータ処理システム内の各ドメイン（なお、ドメインは、データ要求に対する応答に参加する１つまたは複数の下位レベルの（例えば、Ｌ２またはＬ３）キャッシュを含むように定義される）は、特定のメモリ・ブロックを「特別」共用コヒーレンシ状態で保持する唯一のキャッシュを含むことを許容される。そのキャッシュが存在する場合、そのキャッシュは、バス読取りタイプ（例えば、ＲＥＡＤまたはＲＷＩＴＭ）オペレーションが同じドメインにおける要求元のキャッシュによって開始されるとき、要求元のキャッシュに対してその要求されたメモリ・ブロックを配信する責任を負う。多くの異なるドメイン・サイズを定義することが可能であるが、図１のデータ処理システム１００において、各処理ノード１０２（即ち、ＭＣＭ）をデータ配信ドメインと見なすと、好都合である。表２を参照して、そのような「特別」共用コヒーレンシ状態（例えば、Ｓl およびＳlg）の例を後述する。 In this specification , all of these issues are addressed through the implementation of the data sourcing domain. As far detail, each domain in the SMP data processing system (Note that domain, one or more lower level to participate in the response to the data request (e.g., is defined to include L2 or L3) cache) is is permissible to include only the cache that holds a particular memory block in the "special" shared coherency state. If the cache exists, the cache will request that the requested memory block for the requesting cache when a bus read type (eg, READ or RWITM) operation is initiated by the requesting cache in the same domain. Responsible for delivering . Although many it is possible to define different domain sizes, in a data processing system 100 of FIG. 1, each processing node 102 (i.e., MCM) that viewed the data delivery domains strike a good convenience. Examples of such “special” shared coherency states (eg, Sl and Slg) are described below with reference to Table 2.

IV．コヒーレンシ・ドメイン
前述のデータ配信ドメインの実装は、データ・アクセス待ち時間を改良するが、この機能強化は、システムの規模が増大するとき、ｍ＊ｎの乗算的なトラフィック量に対処しない。ブロードキャスト・ベースのコヒーレンシ機構を維持しながらトラフィック量を減らすために、本発明の好適な実施例は、前述のデータ配信ドメインのように、別々のコヒーレンシ・ドメインを形成する各処理ノード１０２によって複数のコヒーレンシ・ドメインを都合よく（しかし必要なことではないが）追加的に実装することが可能である。データ配信ドメインおよびコヒーレンシ・ドメインは、共存することが可能であるが、必ずしもそうである必要はない。データ処理システム１００の例示的オペレーションを説明する便宜上、以下では、これらのコヒーレンシ・ドメインが、処理ノード１０２によって定義された境界を有するものと仮定する。 IV. Coherency domain
The implementation of the data delivery domains before mentioned, although improved data access latency, this enhancement is can the size of the system increases, does not address the multiplicative traffic volume of m * n. To reduce the amount of traffic while maintaining broadcast-based coherency mechanism, preferred embodiments of the present invention, prior to the data delivery domains mentioned, each processing node forming a separate coherency domain 10 2 the result (Although it is not however necessary) conveniently a multiple of coherency domain it is possible to implement additionally. Data delivery domains and coherency domains, it is possible to co-exist, need not name necessarily be the case. For convenience in describing the exemplary operation of the data processing system 100, in the following, these coherency domains, assumed to have a boundary defined by the processing node 102.

コヒーレンシ・ドメインの実装は、すべてのコヒーレンシ・ドメインよりも少ないコヒーレンシ・ドメインによる参加によって要求がサービスされる場合、システム相互接続網１１０を介したドメイン間のブロードキャスト通信を制限することによって、システム・トラフィックを減少させる。例えば、処理ノード１０２ａの処理ユニット１０４ａが発行すべきバス読取りオペレーションを有するのであれば、処理ユニット１０４ａは、最初に、他のコヒーレンシ・ドメイン（例えば、処理ノード１０２ｂ）における参加者を除く、それ自身のコヒーレンシ・ドメイン（例えば、処理ノード１０２ａ）内の参加者全部に対し、バス読取りオペレーションをブロードキャストすることを選択し得る。オペレーションのマスタと同じコヒーレンシ・ドメイン内の参加者のみに伝送されるブロードキャスト・オペレーションは、本明細書では、「ローカル・オペレーション」として定義される。もし、ローカル・バス読取りオペレーションが処理ユニット１０４ａのコヒーレンシ・ドメイン内でサービスされ得るならば、バス読取りオペレーションの更なるブロードキャストは行われない。しかし、ローカル・バス読取りオペレーションが処理ノード１０２ａのコヒーレンシ・ドメイン内で単独ではサービスされ得ないということを、そのローカル・バス読取りオペレーションに対する部分的応答および結合応答が表すならば、ローカル・コヒーレンシ・ドメインに加えて、１つまたは複数の追加のコヒーレンシ・ドメインを含むようにブロードキャストの範囲を拡張することも可能である。 Implementation of coherency domain, if the request is serviced by the participation of low coherency domain than any coherency domain, by limiting the broadcast communication between domains via a system interconnect network 110, the system Reduce traffic. For example, if the processing unit 104a of processing node 102a has issued should do bus read operation, the processing unit 104a, first, the other coherency domain (e.g., processing node 102b) excluding participants in, its own coherency domain (e.g., processing node 102a) against the participants all in, may choose to broadcast the bus read operation. Broadcast operation in which participants in the same coherency domain as operations master only Ru is transmitted is herein defined as "local operation". If a local bus read operation can be serviced within the coherency domain of processing unit 104a, no further broadcast of the bus read operation is performed. However, if the partial and combined responses to the local bus read operation indicate that the local bus read operation cannot be serviced alone within the coherency domain of processing node 102a, the local coherency domain In addition , it is possible to extend the scope of the broadcast to include one or more additional coherency domains.

基本的な実装では、２つのブロードキャスト範囲、即ち、ローカル・コヒーレンシ・ドメインのみを含む「ローカル」範囲およびＳＭＰデータ処理システムにおける他のすべてのコヒーレンシ・ドメインを含む「グローバル」範囲、が使用される。従って、本明細書では、ＳＭＰデータ処理システムにおけるすべてのコヒーレンシ・ドメインに伝送されるオペレーションは「グローバル・オペレーション」として定義される。重要な点として、ローカル・オペレーションまたはそれよりも更に拡張的な範囲のオペレーション（例えば、グローバル・オペレーション）がオペレーションのサービスのために使用されるかどうかに関係なく、キャッシュ・コヒーレンシは、ＳＭＰデータ処理システムにおけるすべてのコヒーレンシ・ドメインにわたって維持される。ローカル・オペレーションおよびグローバル・オペレーションの例は、米国特許出願番号１１／０５５,６９７号の明細書に詳細に説明されている。 In a basic implementation , two broadcast ranges are used: a “local” range that includes only local coherency domains and a “global” range that includes all other coherency domains in the SMP data processing system. Accordingly, in this specification, all the operations that will be transmitted to a coherency domains in SMP data processing system is defined as a "Global Operations". Importantly, local operation or than more extensive range of the Operation (e.g., global operations) regardless of whether is used for operation of the service, the cache coherency, SMP Maintained across all coherency domains in the data processing system. Examples of local and global operations are described in detail in the specification of US patent application Ser. No. 11 / 055,697.

好適な実施例では、オペレーションの範囲は、一実施例では、１ビット・フラッグから構成し得るローカル／グローバル範囲標識（信号）により、バス・オペレーションにおいて表される。処理ユニット１０４における相互接続ロジック２１２は、ローカル相互接続網１１４を介して受取られたオペレーションを、そのオペレーションにおけるローカル／グローバル範囲標識（信号）の設定に基づいて、システム相互接続網１１０上に転送すべきかどうかを決定することが望ましい。 In a preferred embodiment, the range of operations, in one embodiment, the local / global scope indicator which may consist of 1-bit flag (signal), is represented in bus operation. The interconnect logic 212 in the processing unit 104 should forward the operations received via the local interconnect network 114 onto the system interconnect network 110 based on the local / global range indicator (signal) setting in that operation. It is desirable to determine whether or not

V．ドメイン標識
不要なローカル・オペレーションの発行を制限することによって、操作待ち時間を減少させ且つローカル相互接続網上の追加の帯域幅を浪費しないようにするために、本発明は、関連するメモリ・ブロックのコピーがローカル・コヒーレンシ・ドメインの外でキャッシュされるかどうかを表す、メモリ・ブロックごとのドメイン標識を実装することが望ましい。図６は、本発明に従ったドメイン標識の第１実装例を示す。図６に示されるように、ダイナミック・ランダム・アクセス・メモリ（ＤＲＡＭ）において実装されるシステム・メモリ１０８は、複数のメモリ・ブロック５００をストアする。システム・メモリ１０８は、各メモリ・ブロック５００と関連して、そのメモリ・ブロック５００における可能なエラーを訂正するために利用されるエラー訂正コード（ＥＣＣ）５０２およびドメイン標識５０４をストアする。本発明の或る実施例では、ドメイン標識５０４は、特定のコヒーレンシ・ドメインを識別し（即ち、コヒーレンシ・ドメインまたはノードＩＤを指定し）得るが、以後の説明では、ドメイン標識５０４は、１ビットの標識であるものと仮定し、そして関連するメモリ・ブロック５００がＬＰＣとして作用するメモリ・コントローラ２０６と同じコヒーレンシ・ドメイン内にのみキャッシュされる場合は、（例えば、「ローカル」を表すために「１」に）セットされるものと仮定する。さもなければ、ドメイン標識５０４は（例えば、「グローバル」を表すために「０」に）リセットされる。「ローカル」を表すようにドメイン標識５０４を設定することは、「グローバル」を誤って設定しても如何なるコヒーレンシ・エラーも生じさせないが、オペレーションの不要なグローバル・ブロードキャストを生じさせ得るという点で、不正確に実装され得る。 V. Therefore to limit the issue of domain indicator unwanted local operation, in order not to waste additional bandwidth on the cause and local interconnect network reducing operation latency, the present invention is related to memory It is desirable to implement a per- memory block domain indicator that indicates whether a copy of the block is cached outside the local coherency domain. Figure 6 shows a first implementation example of a domain indicator in accordance with the present invention. As shown in FIG. 6, system memory 108 which is implemented in dynamic random access memory (DRAM), stores a plurality of memory blocks 500. System memory 108, associated with each memory block 500, stores the Rue error correction code (ECC) 502 and domain indicator 504 is utilized to correct possible errors in the memory block 500. In some embodiments of the present invention, domain indicator 504 identifies a particular coherency domain (i.e., specify the coherency domain or node ID) obtained, but in the following description, domain indicator 504, 1 bit It was assumed to be a label, and if the associated memory block 500 is cached only in the same coherency domain as the memory controller 206 that acts as the LPC, (for example, to represent the "local" ones and assumptions that are) set to "1". Otherwise, the domain indicator 504 is reset (eg, “0” to represent “global”). Setting the domain indicator 504 to represent “local” will not cause any coherency error if setting “global” incorrectly, but may cause unnecessary global broadcast of the operation , that could be incorrectly implemented.

オペレーションに応答してメモリ・ブロックを配信するメモリ・コントローラ２０６は、要求されたブロックと共に関連するドメイン標識５０４を伝送することが望ましい。 Memory controller 206 for delivering the memory blocks in response to the operation, it is desirable to transmit the domain indicator 504 associated with the requested block.

VI．例示的コヒーレンシ・プロトコル
好適な実施例では、Ｌ２キャッシュ２３０およびＬ３キャッシュ２３２は、周知のＭＥＳＩプロトコルの変形を使用する。コヒーレンシ状態の集合体は、
（１）キャッシュがメモリ・ブロックに対するＨＰＣであるかどうかの表示、
を提供することに加えて、次のような３つの属性を表示する。
（２）キャッシュされたコピーが、そのメモリ階層レベルにおけるキャッシュ間でユニークである（即ち、唯一のキャッシュされたコピーである）かどうかの表示、
（３）キャッシュが、メモリ・ブロックのコピーを要求のマスタに提供し得るかどうか、およびいつそれを提供し得るかの表示、
（４）メモリ・ブロックのキャッシュされたイメージが、ＬＰＣにおける対応するブロックと整合しているかどうかの表示。
これらの４つの属性は、下記の表２において要約されるコヒーレンシ・プロトコル状態において表される。

VI. In an exemplary coherency protocol preferred embodiment, L2 cache 230 and L3 cache 232 uses a variation of the MESI protocol circumferential intelligence. A collection of coherency states is
(1) an indication of whether the cache is an HPC for a memory block;
In addition , the following three attributes are displayed.
(2) the cached copy, its is unique among the cache in the memory hierarchy level (i.e., only a cached copy) whether display,
(3) cache, whether to provide a copy of the memory block to the master's request, and when the or may provide it display,
(4) An indication of whether the cached image of the memory block is consistent with the corresponding block in the LPC.
These four attributes are represented in the coherency protocol state summarized in Table 2 below.

図１を参照して説明したデータ処理システム１００の実施例において、ドメイン標識は、関連するメモリ・ブロックと共にＬ２／Ｌ３キャッシュ２３０、２３２によって受取られ、任意選択的に、そのメモリ・ブロックと共にデータ・アレイ３１０、３６０にストアされ得る。この構成は、ドメイン標識のための単純化されたデータ・フローを許容するが、第１のＬ２キャッシュ２３０が、要求されたメモリ・ブロックを供給することによって異なるコヒーレンシ・ドメインにある第２のＬ２キャッシュ２３０のＲＷＩＴＭオペレーションに応答するとき、如何なる「グローバル」標識も、ローカル・コヒーレンシ・ドメインにおいてキャッシュされていない。従って、メモリ・ブロックがローカルにのみキャッシュされていることが知られているかどうかを決定するために、ＬＰＣをアクセスしなければならない。その結果、もし、メモリ・ブロックに対するＨＰＣが、リモート・コヒーレンシ・ドメインにおける要求元からバスＲＷＩＴＭオペレーション（または、他のストレージ修正オペレーション）を受取るならば、システムは、要求されたメモリ・ブロックのキャッシュ・キャストアウトおよびバスＲＷＩＴＭオペレーションの再試行を含む再試行−プッシュ（retry-push）でもって応答する。明らかなように、再試行−プッシュ・オペレーションと関連する待ち時間および帯域幅利用を排除することが望ましい。 In an embodiment of a data processing system 100 described with reference to FIG. 1, domain indicator is received by L2 / L3 cache 230 along with the memory blocks associated, optionally, the data with its memory block Stored in arrays 310, 360. This configuration allows the simplified data flow for domain indicator Suruga, first L2 cache 230, the requested second L2 of the memory blocks in different coherency domain by supplying when responding to RWITM operation of the cache 230, any "global" indicator, has not been cached in the local coherency domain. Therefore, in order to determine whether or not the child only cached in memory block Gallo Karu is known, it shall access Shinano the LPC. As a result, if, HPC for the memory block, if receive remote coherency bus RWITM operation that the requesting of domains (or other storage modify operation), the system caches the requested memory block Retry including retry cast out and bus RWITM operation - responds with a push (retry-push). As will be apparent, it is desirable to eliminate the latency and bandwidth utilization associated with retry- push operations.

ドメイン標識に対するアクセス待ち時間を少なくするために、Ig（Invalid Global : 無効グローバル）、Sg（Shared Global：共用グローバル）および Slg（Shared Local Global : 共用ローカル・グローバル）コヒーレンシ状態が提供される。Ig状態は、本明細書では、次の３つの状態を表すキャッシュ・コヒーレンシ状態として定義される。
（１）キャッシュ・アレイにおける関連するメモリ・ブロックが無効である、
（２）キャッシュ・ディレクトリにおけるアドレス・タグが有効である、
（３）アドレス・タグにより識別されたメモリ・ブロックの修正コピーが、リモート・コヒーレンシ・ドメインにおけるキャッシュに対して配信された。
Sg状態は、本明細書では、次の４つの状態を表すキャッシュ・コヒーレンシ状態として定義される。
（１）キャッシュ・アレイにおける関連するメモリ・ブロックが有効である、
（２）キャッシュ・ディレクトリにおけるアドレス・タグが有効である、
（３）アドレス・タグにより識別されたメモリ・ブロックの修正コピーが、リモート・コヒーレンシ・ドメインにおけるキャッシュに対して配信された。
（４）メモリ・ブロックのコピーが他のキャッシュに保持されていたし、依然として当該他のキャッシュに保持されていることも可能である。
Slg状態は、本明細書では、次の５つの状態を表すキャッシュ・コヒーレンシ状態として同様に定義される。
（１）キャッシュ・アレイにおける関連するメモリ・ブロックが有効である、
（２）キャッシュ・ディレクトリにおけるアドレス・タグが有効である、
（３）アドレス・タグにより識別されたメモリ・ブロックの修正コピーが、リモート・コヒーレンシ・ドメインにおけるキャッシュに対して配信された。
（４）メモリ・ブロックのコピーが他のキャッシュに保持されていたし、依然として当該他のキャッシュに保持されていることも可能である。
（５）キャッシュは、キャッシュ間のデータ介入によって、メモリ・ブロックのコピーをそのコヒーレンシ・ドメインにおけるマスタに対して配信する権限を有する。 Ig (Invalid Global) , Sg (Shared Global), and Slg (Shared Local Global) coherency states are provided to reduce access latency for domain indicators. The Ig state is defined herein as a cache coherency state that represents the following three states:
(1) it is invalid the associated memory block in the cache array,
(2) The address tag in the cache directory is valid.
(3) Address modified copy of the memory block identified by the tag, which is delivered to the cache at the remote coherency domain.
The Sg state is defined herein as a cache coherency state that represents the following four states.
(1) memory blocks associated in the cache array is valid,
(2) The address tag in the cache directory is valid.
(3) A modified copy of the memory block identified by the address tag was delivered to the cache in the remote coherency domain.
(4) a copy of the memory block plus Tei held in other cache, it is still held in the other cache Tei Rukoto are possible.
The Slg state is similarly defined herein as a cache coherency state that represents the following five states.
(1) memory blocks associated in the cache array is valid,
(2) The address tag in the cache directory is valid.
(3) A modified copy of the memory block identified by the address tag was delivered to the cache in the remote coherency domain.
(4) a copy of the memory block plus Tei held in other cache, it is still held in the other cache Tei Rukoto are possible.
(5) cache, the data intervention between the cache has the authority to distribute a copy of the memory block to the master at the coherency domain.

所与のメモリ・ブロックに対する Ig、Sg、および Slg を、そのメモリ・ブロックに対するＬＰＣを含むコヒーレンシ・ドメイン内においてのみ、形成することが望ましいことがある。このような実施例では、或る機構（例えば、ＬＰＣによる部分的応答、およびその後の結合応答）は、要求されたメモリ・ブロックを配信するキャッシュに対し、ＬＰＣがそのローカル・コヒーレンシ・ドメイン内にあるということを表示するように実装されなければならない。ＬＰＣがローカルであるという表示の通信を支援しない他の実施例では、メモリ・ブロックがリモート・コヒーレンシ・ドメインに対して配信されるときに、Ig、Sg、およびSlg 状態が形成され、従って、Ig、Sg、および Slg が不正確に形成されることがある。 It may be desirable to form Ig, Sg, and Slg for a given memory block only within the coherency domain that contains the LPC for that memory block. In such an embodiment, one mechanism (e.g., partial response by LPC, and subsequent formation Go応 answer), compared caches to deliver the requested memory block, LPC its local coherency domain It must be implemented to indicate that it is within. In other embodiments not support the communication of the display that LPC is local, when the memory block is delivered to the remote coherency domain, Ig, Sg, and Slg state is made form, therefore, Ig, Sg, and Slg can be formed incorrectly.

幾つかのルールがIg、Sg、および Slg（「Ｘg」と総称）キャッシュ・エントリの選択および置換を管理する。第１に、キャッシュがＸg エントリを置換のためのビクティムとして選択すれば、Ｘgエントリのキャストアウトが（I または S エントリが選択されるときの場合とは異なって）行われる。第２に、Ｘg 状態のキャストアウトは、ローカル・オペレーションとして行われることが望ましく、或いは、グローバル・オペレーションとして行われるのであれば、キャストアウト・アドレスのリモートＬＰＧによって無視される。もし、Ｘgエントリが、メモリ・ブロックに対するＬＰＣと同じコヒーレンシ・ドメイン内にないキャッシュにおいて形成することを許容されるならば、ＬＰＣにおけるドメイン標識に対する更新は必要ない。第４に、Ｘg状態のキャストアウトは、ドメイン標識が（そのキャストアウト行うキャッシュにとってはローカルである場合）ＬＰＣに書き戻されるという、データレス・アドレス・オンリ・オペレーションとして行われることが望ましい。 Several rules govern the selection and replacement of Ig, Sg, and Slg ( collectively “X g ” ) cache entries. First, if selected as the victim for cache replacement and Xg entry, (unlike the case of when the I or S entries is selected) castout Xg entry is performed. Second, the castout in the Xg state is preferably done as a local operation or, if done as a global operation , is ignored by the remote LPG at the castout address. If, Xg entry, if is permissible to form in the cache is not in the same coherency domain as LPC for the memory block is not necessary update to the domain indicator in LPC. Fourth, cast out of Xg state, the domain indicator that (for the cache to perform the cast-out if a local) is written back to the LPC, it is preferably performed as a dataless address-only operation.

Ｘg状態を含むキャッシュ・ディレクトリ・エントリは潜在的に有用な情報を保有しているので、少なくとも幾つかの実装では、例えば、置換のためのビクティム・キャッシュ・エントリを選択するようにＬＲＵフィールド３１８、３６８を評価するために利用される最低使用頻度（ＬＲＵ）アルゴリズムを修正することによって、同じベース状態（例えば、ＳまたはＩ）を有する他のエントリよりもＸg状態にあるエントリを優先的に維持することが望ましい。Ｘg ディレクトリ・エントリはキャッシュ内に維持されるので、そのようなエントリは、時間とともに「失効」することがあり得る。というのは、排他的アクセス要求の結果としてＸg 状態の形成を生じさせたキャッシュが、メモリ・ブロックのアドレス・タグをＸg状態で保持するキャッシュに通知することなく、そのメモリ・ブロックのコピーを割当て解除しまたは書き戻し得るからである。そのようなケースでは、ローカル・オペレーションの代わりにグローバル・オペレーションが発行されなければならないということを間違って表す「失効」したＸg状態は、如何なるコヒーレンシ・エラーも生じさせないであろうが、そうでない場合にローカル・オペレーションを利用してサービスされ得る幾つかのオペレーションをグローバル・オペレーションとして発行させるだけであろう。そのような非効率の発生は、「失効」のサブｇキャッシュ・エントリの最終的な置換によって期間を制限されるであろう。 Because cache directory entries including an Xg state owns potentially useful information, it is at least some implementations, for example, LRU field to select the victim cache entry for replacement by the least-recently-used (LRU) algorithm utilized for positive Osamu to assess 318,368, same base state (e.g., S, or I) an entry in the X g state than other entries having excellent It is desirable to maintain it in advance. Since Xg directory entries are maintained in the cache , such entries can "stale" over time. Because, as a result of the exclusive access request caused the formation of Xg state cache, without notifying the cache that holds the address tag of the memory block in Xg state, a copy of that memory block This is because it can be deallocated or written back . In such a case, Xg state was "stale" global operations represents the wrong that must be issued in place of the local operation, but will not cause any coherency errors, so It would only be issued a number of operations that may be served by using a local operation as a global operation when not. The occurrence of such inefficiencies will be limited in duration by the final replacement of the “stale” sub-g cache entry.

Ｘgコヒーレンシ状態の実装は、（Ig のケースのように）コヒーレンシ・ドメインにおいてキャッシュされているメモリ・ブロックの有効コピーがないときでも、コヒーレンシ・ドメインにおけるメモリ・ブロックに対するキャッシュされたドメイン標識を維持することによって通信効率を改良する。その結果、メモリ・ブロックに対するＨＰＣは、リモート・コヒーレンシ・ドメインからの排他的アクセス要求（例えば、バスＲＷＩＴＭオペレーションまたはバスＤＣlaimオペレーション）を再試行することなくおよびＬＰＣへの要求されたメモリ・ブロックのプッシュを行うことなく、そのような要求をサービスすることができる。 Implementation of Xg coherency states, even when there is no valid copy of Tei Ru memory block is cached in the coherency domain (such as the case of Ig), maintaining the cached domain indicator for the memory block in the coherency domain the communication efficiency modifier for good by. As a result, the HPC for the memory block pushes the requested memory block to the LPC without retrying the exclusive access request (eg, bus RWITM operation or bus DClaim operation) from the remote coherency domain. Such requests can be serviced without having to

VII．例示的Ｌ２／Ｌ３コヒーレンシ状態遷移
次に図７を参照すると、本発明の好適な実施例に従ったＬ３キャッシュへのキャストインを行う例示的な方法の高レベルの論理的流れ図が示される。図７に示されたプロセスは、Ｌ３キャッシュ・コントローラ３８０によるオペレーションを含む。プロセスはブロック６００で開始し、次に、ブロック６０２に進み、そこで、Ｌ３キャッシュ２３２（例えば、Ｌ３キャッシュ２３２ａ）のＬ３キャッシュ・コントローラ３８０は、ソースのＬ２キャッシュ２３０からキャッシュ・ラインが排出される結果として、それが接続されているＬ２キャッシュ２３０の１つ（例えば、Ｌ２キャッシュ２３０ａ）からキャストアウト要求を受取る。キャストアウト要求は、ターゲット・アドレス、キャストイン・キャッシュ・ライン、およびキャストイン・キャッシュ・ラインのキャッシュ・ディレクトリ状態を含む。Ｌ３キャッシュ・コントローラ３８０は、キャストイン・キャッシュ・ラインがそのデータ・アレイ３６０に保存されるかどうかを決定し、そうであれば、状態フィールド３６６におけるそのキャッシュ・ラインに対する適切なコヒーレンシ状態を決定するための置換ポリシでもってプログラムされる。 VII. Exemplary L2 / L3 Coherency State Transition Referring now to FIG. 7, a high level logical flow diagram of an exemplary method for performing a cast-in to an L3 cache in accordance with a preferred embodiment of the present invention is shown. The process illustrated in Figure 7, includes an operation by the L3 cache controller 380. The process begins at block 600, then, seen block 602 binary, where, L3 cache 232 (e.g., L3 Cache 232a) L3 cache controller 380 of the cache line Ru is discharged from the L2 cache 230 of the source as a result, one of the L2 cache 230 its Re is connected (e.g., L2 cache 230a) takes receiving a cast out request from. The castout request includes the target address, the cast-in cache line, and the cache directory state of the cast-in cache line. L3 cache controller 380 determines whether the cast-in cache line is stored in the data array 360, and if so, to determine the appropriate coherency state for the cache line in the status field 366 It is programmed with a replacement policy for.

次に、ブロック６０４において、キャッシュ・コントローラ３８０は、ターゲット・アドレスに対するディレクトリ・エントリが既に存在するかどうかを決定するために、Ｌ３キャッシュ・ディレクトリ３６２のタグ・フィールド３６４を読取る。もし、そのターゲット・アドレスがタグ・フィールド３６４において見つからなければ、プロセスはブロック６０５に進み、そこで、キャッシュ・コントローラ３８０は、システム・メモリにキャストアウトされ得る置換のためのビクティム・キャッシュ・ラインを、そのビクティム・キャッシュのコヒーレンシ状態（例えば、Ｘg、Ｍ、Ｔ、またはＴn）に従って、選択する。次に、プロセスはブロック６０６に進み、そこで、キャッシュ・コントローラ３８０は、ソースのＬ２キャッシュ２３０から受取ったキャストイン・キャッシュ・ラインをＬ３データ・アレイ３６０にストアし、キャッシュ・ディレクトリ３６２内に対応するキャッシュ・ディレクトリ・エントリを作成する。キャッシュ・コントローラ３８０は、キャストアウト要求において指定された状態に従って、ディレクトリ・エントリのコヒーレンシ状態フィールド３６６をセットする。しかる後、プロセスはブロック６０８において終了する。 Next, at block 604, the cache controller 380 reads the tag field 364 of the L3 cache directory 362 to determine if a directory entry already exists for the target address. If Kere such found its target address in the tag field 364, the process seen diblock 605 binary, where, cache controller 380, victim cache line for replacement of the system memory may be cast out and the victim cache coherency state (e.g., Xg, M, T or Tn,) accordingly select. Next, the process seen diblock 606 binary, where, cache controller 380, stores the cast-in cache line received from L2 cache 230 of the source to the L3 data array 360, corresponding to the cache directory 362 Create a cache directory entry for Cache controller 380, in accordance with the state specified in the cast-out request, to set the coherency state field 366 of the directory entry. Thereafter, the process ends at block 608.

ブロック６０４に説明を戻す。もし、キャッシュ・コントローラ３８０が、キャストイン・キャッシュ・ラインのターゲット・アドレスに対するディレクトリ・エントリが既にＬ３キャッシュ・ディレクトリ３６２内に存在するということを決定すれば、プロセスはブロック６１０に進み、そこで、キャッシュ・コントローラ３８０は、表３および図８を参照して後述するように、キャストイン・ポリシに従ってキャストアウト要求を参照することにより、データ・アレイ３６０およびキャッシュ・ディレクトリ３６２を更新する。本発明の好適な実施例において実装されるように、キャストインは、次の２つの事項を指定する。即ち、
（１）キャストイン・キャッシュ・ラインがＬ３データ・アレイ３６０内にストアされるのか或いは廃棄されるのか、
（２）キャッシュ・ディレクトリ３６６における対応するエントリのコヒーレンシ状態。 Return description to block 604 . If the cache controller 380, if decision that exists in the directory entry is already L3 cache directory 362 for the target address of the cast-in cache line, the process seen block 610 binary, where , cache controller 380, as will be described later with reference to Table 3 and FIG. 8, by referring to a cast out request according key Yasutoin policy, to update the data array 360 and cache directory 362. As implemented in the preferred embodiment of the present invention, the cast-in specifies the following two matters. That is,
(1) Whether the cast-in cache line is stored in the L3 data array 360 or discarded
(2) The coherency state of the corresponding entry in the cache directory 366.

好適な実施例では、キャッシュ・ディレクトリ３６２においてエントリを既に有しているキャッシュ・ラインに対してキャストインを行うとき、Ｌ３キャッシュ２３２によって実装されるキャストイン・ポリシが下記の表３に要約される。表３は、Ｌ３キャッシュ・ディレクトリ３６２におけるキャッシュ・ラインの前の状態とキャストアウト要求において指定されたコヒーレンシ状態との関数として、状態フィールド３６６におけるその結果のコヒーレンシ状態を識別する。

In a preferred embodiment, when performing the casting-in to the cache line that already has an entry in cache directory 362, cast-policy is implemented by the L3 cache 232 is summarized in Table 3 below The Table 3, as a function of the previous state and the coherency state specified in the cast-out request of the cache line in L3 cache directory 362, which identifies the results of coherency states in the state field 366.

このキャストイン・ポリシは、更に、Ｌ３データ・アレイ３６０内にストアされたメモリ・ブロックが維持されるべきかまたはソースのＬ２キャッシュ２３０から受けたキャストイン・キャッシュ・ラインによって上書きされるべきかを管理する。キャッシュ・ライン・データを上書きすべきかどうかの決定は、表３では下線付きの結果のコヒーレンシ状態によって示される。もし、結果のコヒーレンシ状態の遷移が下線付きであれば、キャストイン・キャッシュ・ラインは、前のキャッシュ・ライン・データに代わって、Ｌ３キャッシュ・アレイ３６０にストアされる。もし、その結果のコヒーレンシ状態の遷移が下線付きでなければ、キャッシュ・アレイ３６０内のキャッシュ・ラインは維持され、状態フィールド３６６におけるコヒーレンシ状態が、表３において識別されるその結果のコヒーレンシ状態に更新される。 This cast-in policy further determines whether memory blocks stored in the L3 data array 360 should be maintained or overwritten by cast-in cache lines received from the source L2 cache 230. to manage. The decision whether to overwrite the cache line data is indicated in Table 3 by the resulting coherency state underlined. If the transition of the results of the coherency state if underlined, cast-cache line, instead of the front of the cache line data are stored in L3 cache array 360. If the if the transition of the results of coherency state is not underlined, the cache line of the cache array 360 is maintained, updated coherency state in the state field 366, the result of the coherency states identified in Table 3 Is done.

表３のＳg またはＳlg 行を詳細に参照すると、Ｌ３キャッシュ・ディレクトリ３６２におけるキャッシュ・ラインの現在の状態がＩn、Ｉg、またはＩコヒーレンシ状態のいずれかであり、キャストイン・コヒーレンシ状態がＳgまたはＳlg である場合、そのキャストイン・コヒーレンシ状態は、状態フィールド３６６を更新するために利用される。更に、表３における下線付きのエントリによって表されるように、キャッシュ・ラインはキャッシュ・アレイ３６０内でキャストイン・データと置換される。もし、Ｌ３キャッシュ・ディレクトリ３６２におけるキャッシュ・ラインの現在の状態がＳgであり、キャストイン・コヒーレンシ状態がＳlg であれば、キャストイン・コヒーレンシ状態は、Ｓlg における「l」が意味するように、キャッシュ間のデータ介入によりデータを配信するための機能が維持されるように、状態フィールド３６６を更新するために利用される。しかし、Ｌ３キャッシュ・ディレクトリ３６２におけるキャッシュ・ラインの現在の状態がＳg状態であり、キャストイン・コヒーレンシ状態がＳg であれば、Ｌ３キャッシュ２３２に対するコヒーレンシまたはデータ更新は行われない。同様に、Ｌ３キャッシュ・ディレクトリ３６２における現在の状態がＳlgであり、キャストイン状態がＳg またはＳlg であれば、Ｌ３キャッシュ２３２に対するコヒーレンシ更新またはデータ更新は行われない。もし、Ｌ３キャッシュ・ディレクトリ３６２におけるキャッシュ・ラインの現在の状態がＳまたはＳlgであり、キャストイン・コヒーレンシ状態がＳg またはＳlg であれば、キャッシュ・コントローラ３８０は、ドメイン標識５０４が更新されるべきであるというキャッシュ表示を維持するために、状態フィールド３６６をＳからＳgに、またはＳl からＳlg に更新する。キャッシュ・コントローラ３８０は、キャストインＳlg コヒーレンシ状態を受取ったことに応答して、Ｓ状態からＳlg 状態への同様のコヒーレンシ状態の更新を行う。更に表３に示されるように、Ｌ２およびＬ３キャッシュの両者は、キャッシュ・ラインをＳlx状態で含むことができず、それは、エラーが生じたことをＳl-Ｓlg ケースが表すということを意味する。もし、Ｌ３の現在の状態が、Ｔx およびＭx 列に示されるように、ＴxまたはＭx であれば、この情報は、Ｌ２からのキャストイン時にはいつもＬ３キャッシュ内に維持される。 Referring in detail to the Sg or Slg line in Table 3, the current state of the cache line in the L3 cache directory 362 is either In, Ig, or I coherency state, and the cast-in coherency state is Sg or Slg. The cast-in coherency state is used to update the state field 366. In addition, cache lines are replaced with cast-in data in the cache array 360, as represented by the underlined entries in Table 3. If, is the current state of Sg of the cache line in L3 cache directory 362, if the cast-in coherency state is Slg, cast-in coherency state, as is meant by "l" in the Slg, cache It is used to update the status field 366 so that the function for delivering data is maintained by intervening data intervention. However, if the current state of the cache line in the L3 cache directory 362 is the Sg state and the cast-in coherency state is Sg, coherency or data update to the L3 cache 232 is not performed. Similarly, the current state of the L3 cache directory 362 is Slg, if cast-in state is Sg or Slg, coherency update or data updates to the L3 cache 232 is not performed. If the current state of the cache line in L3 cache directory 362 is S or Slg, if cast-in coherency state is Sg or Slg, cache controller 380, should domain indicator 504 is updated In order to maintain a cache indication that there is , the status field 366 is updated from S to Sg or from Sl to Slg. In response to receiving the cast-in Slg coherency state, the cache controller 380 performs a similar coherency state update from the S state to the Slg state. As further shown in Table 3, the L2 and L3 caches both can not include a cache line in Slx state, it means that represents Sl-Slg case that an error has occurred . If the current state of the L3, as indicated in Tx and Mx column, if Tx or Mx, this information at the time of casting-in from L2 is always maintained in the L3 cache.

次に表３のＳg およびＳlg 列を参照すると、Ｔx コヒーレンシ状態のキャストインのケースでは、キャッシュ・コントローラ３８０は、データ・アレイ３６０へのデータ更新およびＳgまたはＳlg からＴx へのコヒーレンシ状態の更新の両方を行う。キャッシュ・ディレクトリ３６２内に記録された前のコヒーレンシ状態がＳlg である別のケースでは、キャストインを受けたことに応答して、データ更新またはコヒーレンシ状態の更新は行われない。キャッシュ・ディレクトリ３６２においてＳgとマークされたキャッシュ・ラインに対して、キャッシュ・コントローラ３８０は、Ｓl またはＳlg コヒーレンシ状態におけるキャストイン・キャッシュ・ラインに応答して、ＳgからＳlg へのコヒーレンシ状態の更新を行うが、Ｉn、Ｉg、Ｓg、またはＳコヒーレンシ状態におけるキャストイン・キャッシュ・ラインに対するデータの更新またはコヒーレンシ状態の更新を行わない。 When the next For Sg and Slg row of Table 3, the cast-in case of Tx coherency state, cache controller 380, updates the data update and Sg or Slg to the data array 360 of the coherency state of the Tx Do both. In another case where the previous coherency state recorded in cache directory 362 is Slg , no data update or coherency state update occurs in response to receiving a cast-in. For cache lines marked Sg in the cache directory 362, the cache controller 380 updates the coherency state from Sg to Slg in response to a cast-in cache line in the Sl or Slg coherency state. Do, but do not update data or update coherency state for cast-in cache lines in In, Ig, Sg, or S coherency states.

次に図８を参照すると、本発明の好適な実施例に従ったキャストアウト要求を受けたことに応答して、Ｌ３キャッシュにおいてキャストイン・ポリシを実装する例示的方法の高レベルの論理的流れ図が示される。プロセスは、例えば、図７のブロック６０４における肯定的な決定に応答して、ブロック７００において開始し、しかる後、ブロック７０４に進み、そこで、Ｌ３キャッシュ・コントローラ３８０は、ビクティム・キャッシュ・ラインの指定されたコヒーレンシ状態を決定するために、キャストアウト要求を調べる。更に、ブロック７０６において、Ｌ３キャッシュ・コントローラ３８０は、キャストイン・キャッシュ・ラインに対する存在するコヒーレンシ状態を決定するために、キャッシュ・ディレクトリ３６２における関連するエントリの状態フィールド３６６を読取る。次に、プロセスはブロック７０８に進み、そこで、Ｌ３キャッシュ・コントローラ３８０は、表３に要約されたキャストイン・ポリシに従って、Ｌ３キャッシュ・ディレクトリ３６２における適切な結果のコヒーレンシ状態を決定する。この決定は、例えば、Ｌ３キャッシュ２３２内の不揮発性メモリにおける状態テーブルを参照することによって行うことが可能である。別の実施例では、Ｌ３キャッシュ・コントローラ３８０は、ソフトウェアの実行を通してまたは集積回路により行われる計算を通して、ブロック７０８に示された決定を行うことが可能である。 Referring now to FIG. 8, in response to receiving a cast out request in accordance with a preferred embodiment of the present invention, the high-level logic of the L3 exemplary method that implements a cast-in policy in the cache A flow diagram is shown. The process, for example, in response to a positive determination at block 604 of FIG. 7, begins at block 700 and thereafter, see block 704 binary, where, L3 cache controller 380, the victim cache line to determine the designated coherency state, Ru examined cast out request. Furthermore, at block 706, the L3 cache controller 380, in order to determine the coherency state exists for the cast-in cache line reads the status field 366 of the entry associated in the cache directory 362. Next, the process seen block 708 binary, where, L3 cache controller 380, according to the cast-in policy is summarized in Table 3, that determine the coherency state of the appropriate results in L3 cache directory 362. This determination can be made, for example, by referring to a state table in the nonvolatile memory in the L3 cache 232. In another embodiment, L3 cache controller 380, through calculations performed by or through an integrated circuit executing the software, it is possible to make a decision indicated by block 708.

次に、プロセスはブロック７１０に進み、そこで、キャッシュ・コントローラ３８０は、ブロック７０８で決定された結果のコヒーレンシ状態に基づいて、ビクティム・キャッシュ・ラインに対する既存のコヒーレンシ状態が更新されるかどうかを決定する。もし、現在の状態が更新されるべきものであれば、プロセスはブロック７１２に進み、そこで、キャッシュ・コントローラ３８０は、ブロック７０８で決定された結果のコヒーレンシ状態でもって、キャッシュディレクトリ３６２におけるコヒーレンシ状態を上書きする。プロセスは、ブロック７１２から、またはコヒーレンシ状態に対する更新が行われるべきでない場合にはブロック７１０から、判断ブロック７１４に進み、そこで、キャッシュ・コントローラ３８０は、Ｌ２キャッシュ２３０から受取ったキャストイン・キャッシュ・ラインがＬ３データ・アレイ３６０にストアされるべきであるということをキャストイン・ポリシが表すかどうかを決定する。そうであれば、プロセスはブロック７１６に進み、そこで、キャッシュ・コントローラ３８０は、キャストイン・キャッシュ・ラインをＬ３データ・アレイ３６０にストアすることによって、キャストイン・ターゲット・アドレスに対する前にストアされたキャッシュ・ラインを上書きする。ブロック７１６に続いて、またはデータ更新が行われるべきでない場合にはブロック７１４に続いて、プロセスはブロック７１８において終了する。 Next, the process seen diblock 710 binary, where, if a cache controller 380, based on the coherency state of the results determined at block 708, the existing coherency state for the victim cache line is updated To decide. If, as long as should the current state is updated, the process seen diblock 712 binary, where, cache controller 380 with a coherency state of the results determined at block 708, the coherency state of the cache directory 362 Is overwritten. The process from block 710. If it is not to the block 712, or updates to the coherency state is performed, see decision block 714 binary, where, cache controller 380, cast-cash received from L2 cache 230 Determine if the cast-in policy indicates that the line should be stored in the L3 data array 360. If so , the process proceeds to block 716 where the cache controller 380 was previously stored for the cast-in target address by storing the cast-in cache line in the L3 data array 360. Overwrite cache line. Following block 716, or following block 714 if no data update is to be made, the process ends at block 718.

表３並びに図７および図８を参照して説明したように、本発明に従ったＳg およびＳlg コヒーレンシ状態の実装は、Ｌ３キャストイン・オペレーションが行われる態様に影響を与えることに加えて、Ｌ２キャッシュによってＬ２コヒーレンシ状態で保持されたキャッシュ・ラインに関してヒットするＣＰＵ読取り要求またはＣＰＵ更新要求の受け取りに応答して、Ｌ２キャッシュ２３０のような上位レベルのキャッシュによって行われるオペレーションを単純化する。そのような動作シナリオ（図１０に示される）に応じた、Ｌ２キャッシュ２３０における処理の例示的方法の理解を容易にするために、先ず、図９を参照して、Ｌ２キャッシュにおける従来の処理方法を説明することにする。 As described with reference to Table 3 and FIGS. 7 and 8 , the implementation of Sg and Slg coherency states in accordance with the present invention affects the manner in which L3 cast-in operations are performed in addition to L In response to receiving a CPU read request or CPU update request that hits for a cache line held in L2 coherency state by two caches, it simplifies operations performed by higher level caches such as L2 cache 230. Such was depending on operating scenario (as shown in FIG. 10), in order to facilitate understanding of the exemplary method of processing in the L2 cache 230, first, with reference to FIG. 9, a conventional in L 2 Cache A processing method will be described.

図９を参照すると、Ｉg コヒーレンシ状態でＬ２キャッシュに保持されたキャッシュ・ラインにおいてヒットするＣＰＵ読取り要求を受取ったことに応答して、従来のＬ２キャッシュにより行われるオペレーションを表すタイミング図が示される。図示のように、プロセスは、従来のＬ２キャッシュがその関連するプロセッサ・コアからＣＰＵ読取り要求を受取るときに開始する。ＣＰＵ読取り要求を受取ったことに応答して、Ｌ２キャッシュは、参照番号８０２で示されるように、その要求をサービスするようにＲＣマシンを割り当て、参照番号８０４で示されるように、それのキャッシュ・ディレクトリのディレクトリ読取りを開始する。 Referring to FIG. 9, a timing diagram is shown representing operations performed by a conventional L2 cache in response to receiving a CPU read request that hits a cache line held in the L2 cache in an Ig coherency state. As illustrated, the process begins when a conventional L2 cache receives a CPU read request from its associated processor core. In response to receiving the CPU read request, the L2 cache assigns an RC machine to service the request, as indicated by reference numeral 802, and its cache, as indicated by reference numeral 804. Start directory reading of the directory.

キャッシュ・ディレクトリに記録されたコヒーレンシ状態がＩg であるという決定に応答して、従来のＬ２キャッシュは、コヒーレンシ状態が維持されるようにシステム・メモリにおけるドメイン標識を「グローバル」状態に更新するために、参照番号８０６で示されるようにＣＯマシンを割り当てる。ＲＣおよびＣＯマシンのオペレーションの完了は非同期的であり、それは、これらのオペレーションが任意の順序で完了し得ることを意味する。もし、ＲＣマシンが時間ｔ０にそのオペレーションを完了し、ＣＯマシンが時間ｔ１にそのオペレーションを完了すれば、コヒーレンシ解決ウィンドウ８１０が形成される。時間ｔ０までに、ＲＣマシンは、新たに獲得されたキャッシュ・ラインの状態を反映するようにディレクトリを（例えば、「共用」に）更新しているが、ＣＯマシンは、時間ｔ１までキャストアウトに関して依然としてアクティブに作業を行っている。 In response to the determination that the coherency state recorded in the cache directory is Ig, the conventional L2 cache updates the domain indicator in system memory to a “global” state so that the coherency state is maintained. Assign a CO machine as indicated by reference numeral 806. Completion of RC and CO machine operations is asynchronous, meaning that these operations can be completed in any order. If completed its operation to the RC machine time t0, if CO machine completes its operations in time t1, the coherency resolution window 810 is formed. Until the time t0, RC machine, the directory to reflect the state of the newly acquired cache line (for example, in "shared"), but has been updated, CO machine, cast out until the time t1 Is still actively working on .

通常、相互接続網を介してスヌープされたオペレーションに対する部分的応答を決定するとき、Ｌ２キャッシュにおいてアクティブなマシンにより反映されたコヒーレンシ状態だけが考察される。しかし、ＣＯマシンにより処理中のものと同じキャッシュ・ラインをターゲットとするコヒーレンシ解決ウィンドウ期間中にスヌープされたオペレーションにとって、このポリシは不十分である。そのため、ディレクトリ状態およびアクティブなキャッシュアウト・マシンによって反映されたコヒーレンシ状態の両方とも、そのスヌープされたオペレーションに与えられるべき部分的応答を決定する場合に考慮されなければならない。そのように行うことができないと、要求元のＬ２キャッシュにおいて不正確なコヒーレンシ状態が形成され得ることになり、ひいてはコヒーレンシの喪失につながることにある。従って、コヒーレンシ解決ウィンドウ８００の期間中にＣＯマシンによる処理中のものと同じキャッシュ・ラインに対するスヌープを処理するために、特別のコヒーレンシ解決ロジックがＬ２キャッシュ内に実装されなければならない。 Normally, when determining the partial response to the snooped operation through the interconnection network, only separated mirrored coherency state by the active machine is discussed in the L2 cache. However , this policy is inadequate for operations snooped during a coherency resolution window that targets the same cache line that is being processed by the CO machine . Therefore, both coherency state of being reflected by the directory state and active cash out machines, must be considered when determining the snooped partial response should be given to the operation. Failure to do so can result in an inaccurate coherency state in the requesting L2 cache, which in turn leads to loss of coherency. Therefore, in order to process a snoop to pair those with the same cache line being processed by the C O machines during the coherency resolution window 800, special coherency resolution logic must be implemented in the L2 cache.

Ｌ２キャッシュの設計、特に、図９に示された動作シナリオの下でのそのコヒーレンシ処理は、本発明に従ったＳl およびＳlg コヒーレンシ状態の実装によって簡易化される。次に、図１０を参照すると、本発明に従ったＬ２キャッシュ２３０のような上位レベルのキャッシュにおけるコヒーレンシ処理の例示的方法の高レベルの論理的フローチャートが示される。図示のように、プロセスはブロック９００で開始し、しかる後、ブロック９０２に進み、そこで、Ｌ２キャッシュ２３０は、その関連するプロセッサ・コア２００からのＣＰＵ読取り要求またはＣＰＵ更新要求を受け取る。一般に要求のタイプを識別するトランザクション・タイプ（ＴＴＹＰＥ）およびターゲット・アドレスを含む、ＣＰＵ要求を受取ったことに応答して、ブロック９０４において、Ｌ２キャッシュ２３０のＬ２キャッシュ・コントローラ３３０は、ターゲット・アドレスに関してそのコヒーレンシ状態を決定するためにターゲット・アドレスを利用してそのキャッシュ・ディレクトリ３１２をアクセスし、ＣＰＵ要求をサービスするためにＲＣマシン３３２をディスパッチする。ブロック９０６に示されるように、コヒーレンシ状態がＩgであることをキャッシュ・コントローラ３３０が決定すれば、キャッシュ・コントローラ３３０は、ブロック９２０およびそれに続くブロックに示されるようにＣＰＵ要求をサービスする。もし、コヒーレンシ状態がＩgとは異なるものであれば、キャッシュ・コントローラ３３０は、ブロック９１０に示された他の処理を利用してＣＰＵ要求をサービスする。 L2 cache design, in particular, the coherency process under the operating scenario shown in Figure 9, is mounted on the thus simplified Sl and Slg coherency state in accordance with the present invention. Referring now to FIG. 10, a logical flow chart of a high-level exemplary method coherency process in the higher level cache, such as L2 cache 230 in accordance with the present invention is shown. As illustrated, the process begins at block 900 and thereafter, see block 902 binary, where, L2 cache 230, Ru receive a CPU read request or CPU update request from the processor core 200 its associated. Generally it includes a transaction type (TTYPE) and target address identifying the request type, in response to receiving the CPU request, at block 904, the L2 cache controller 330 of the L2 cache 230, the target to access the cache directory 312 using the target address in order to determine its coherency state with respect to the address, it dispatches the RC machine 332 in order to service the CPU request. As shown in block 906, if the cache controller 330 determines that the coherency state is Ig, the cache controller 330 services the CPU request as shown in block 920 and subsequent blocks. If the coherency state is different from Ig, the cache controller 330 uses other processing shown in block 910 to service the CPU request.

次にブロック９２０を参照すると、ディスパッチされたＲＣマシン３３２は、ＣＰＵ要求のターゲット・メモリ・ブロックに対するコヒーレンシ状態がＬ２キャッシュ・ディレクトリ３１２においてＩg であるという決定に応答して、ＴＴＹＰＥがＣＰＵ更新要求を表すかどうかを決定し、そうであれば、ブロック９２２において、ターゲット・メモリ・ブロックの排他的コピーを得るためにすべてのローカル相互接続網１１４およびグローバル相互接続網１１０上にグローバル範囲のバスＲＷＩＴＭオペレーションを発行する。ＲＣマシン３３２は、メモリ・ブロックの更新されたコピーがリモート・コヒーレンシ・ドメイン内にあるという、Ｉgコヒーレンシ状態により供給された不正確な表示に基づいて、グローバル・オペレーション範囲を選択する。ＲＣマシン３３２は、ターゲット・メモリ・ブロックのコピーを受取ると、ブロック９２４に示されるように、そのターゲット・メモリ・ブロックをデータ・アレイ３１０内に配置し、Ｌ２キャッシュ・ディレクトリ３１２における対応するエントリのコヒーレンシ状態をＩgからＭに更新する。しかる後、ＲＣマシン３３２は割当て解除され、プロセスはブロック９４０において終了する。 Referring now to block 920, the dispatched RC machine 332 responds to the determination that the coherency state for the target memory block of the CPU request is Ig in the L2 cache directory 312 and TTYPE issues a CPU update request. And if so, at block 922, a global range bus RWITM operation on all local interconnect networks 114 and global interconnect networks 110 to obtain an exclusive copy of the target memory block. the you issue. RC machines 332, that updated copy of the memory block is in a remote coherency domain, based on inaccurate display supplied by the Ig coherency state to select the global operations range. When the RC machine 332 receives a copy of the target memory block, it places the target memory block in the data array 310 as shown in block 924 and the corresponding entry in the L2 cache directory 312. Update the coherency state from Ig to M. Thereafter, the RC machine 332 is deallocated and the process ends at block 940.

ブロック９２０を再び参照する。もし、ＲＣマシン３３２が、ＣＰＵ要求のＴＴＹＰＥから、それがＣＰＵ読取り要求であるということを決定すれば、プロセスはブロック９３０に進み、そこで、ＲＣマシン３３２は、ターゲット・メモリ・ブロックのコピーを得るためにグローバル範囲のバスＲＥＡＤオペレーションを発行する。ＲＣマシン３３２は、再び、そのメモリ・ブロックの更新されたコピーがリモート・コヒーレンシ・ドメインにあるというＩg コヒーレンシ状態によって提供される不正確な表示に基づいて、グローバル・オペレーション範囲を選択する。要求されたメモリ・ブロックを受取ったことに応答して、ＲＣマシン３３２は、ブロック９３２に示されるように、データ・アレイ３１０にそのメモリ・ブロックを配置し、キャッシュ・ディレクトリ３１２における対応するエントリの状態フィールド３１６をＩg状態からＳlg またはＭe 状態の一方に更新する。特に、ＲＣマシン３３２は、メモリ・ブロックがメモリ・コントローラ２０６によって配信され且つ他のキャッシュがメモリ・ブロックのコピーを保持していない場合には、コヒーレンシ状態をＭeに更新し、そうでない場合には、コヒーレンシ状態をＳlg に更新する。しかる後、ＲＣマシン３３２は割当て解除され、プロセスはブロック９４０において終了する。 Reference is again made to block 920 . If, RC machines 332, the TTYPE the CPU requests, if decision that it is a CPU read request, the process seen diblock 930 binary, where, RC machine 332, a copy of the target memory block issue the bus READ operation of the global range in order to obtain. RC machine 332 is again updated copy of that memory block is based on an inaccurate indication provided by the Ig coherency state that a remote coherency domain, selects the global operations range. In response to receiving the requested memory block, the RC machine 332 places the memory block in the data array 310, as indicated by block 932, and the corresponding entry in the cache directory 312. Update the status field 316 from the Ig state to one of the Slg or Me states. In particular, RC machine 332, when the memory block is when and other caches are delivered by the memory controller 206 does not hold a copy of the memory block, updates the coherency state to Me, is not the case , Update the coherency state to Slg. Thereafter, the RC machine 332 is deallocated and the process ends at block 940.

明らかに、本発明に従ったＳg およびＳlg コヒーレンシ状態の実装は、少なくとも次の２つの点でコヒーレンシ処理を簡単にする。第１に、Ｉg 状態によって表されたドメイン標識のグローバル状態のキャッシュされた表示は、ＳgまたはＳlg コヒーレンシ状態のいずれかによってキャッシュ・ディレクトリにおいて維持されることが可能であるので、ＣＰＵ読取り要求に対するＩgヒットの場合に、Ｉg コヒーレンシ状態をキャストアウトするためにどのＣＯマシン３３６も割り当てられない。従って、Ｌ２キャッシュ・コントローラ３３０内の有限なリソースの利用率が減少する。第２に、そのようなケースにおいてキャストアウトを行うためにどのＣＯマシン３３６も割り当てられないので、コヒーレンシ解決ウィンドウ８００は形成されず、そのため、応答ロジック２１０は、スヌープされた要求に対する部分的応答の基礎となる適切なコヒーレンシ状態をキャッシュ・ディレクトリ３１２から直接的に決定することができる。その結果、応答ロジック２１０に実装されるロジックが単純化される。 Obviously, the implementation of Sg and Slg coherency state in accordance with the present invention, to simplify the least coherency process in two ways. First, since the display cached the global state of the domain indicator represented by Ig state, either by Sg or Slg coherency state can be maintained in the cache directory, C PU read request in the case of Ig hits to which CO machine 336 not assigned to cast out the Ig coherency state. Therefore, the utilization rate of the finite resource in the L2 cache controller 330 is reduced. Second, since any CO machine 336 not assigned to perform cast out in cases like its coherency resolution window 800 is not formed, therefore, the response logic 210, partial response to the snooped request Ru can be determined directly to the appropriate coherency state to be the foundation from the cache directory 312. As a result, the logic implemented in the response logic 210 is simplified.

前述のように、本発明は、特定のメモリ・ブロックが複数のキャッシュに保持され得ることおよびそのメモリ・ブロックのコピーがキャッシュのローカル・コヒーレンシ・ドメインの外にあるという表示を行うために、Ｓg またはＳlg のようなコヒーレンシ状態が利用される、データ処理のための改良された方法、装置、およびシステムを提供する。１つまたは複数のそのようなコヒーレンシ状態の実装は、共用の低レベル（例えば、Ｌ３）のキャッシュが、メモリ・ブロックのＩgコピーにおけるキャストアウト・ヒットの場合にそのメモリ・ブロックのコピーを維持することを許容するという点で有利である。更に、１つまたは複数のそのようなコヒーレンシ状態の実装は、上位レベル（例えば、Ｌ２）のキャッシュの設計を簡素化し、コヒーレンシ処理を効率的にする。 As before mentioned, the present invention is, in order to perform an indication that a copy of it and that memory block specific memory block can be retained in the plurality of caches are outside the local coherency domain of the cache, Improved methods, apparatus, and systems for data processing are provided in which coherency states such as Sg or Slg are utilized. Implementation of one or more of such coherency state maintained cache low-level shared (e.g., L3) is a copy of the memory block in the case of a cast-out hit in Ig copy of memory block It is advantageous in that it is allowed to do. Furthermore, the implementation of one or more of such coherency state, the higher level (e.g., L2) simplifies cache design, the coherency process efficient.

好適な実施例を参照して本発明を詳細に開示および説明したが、本発明の真意および範囲から逸脱することなく、形態および細部における種々の変更を行い得ることは当業者には明らかであろう。 Although the invention has been disclosed and described in detail with reference to preferred embodiments, it will be apparent to those skilled in the art that various changes in form and detail can be made without departing from the spirit and scope of the invention. Let's go.

本発明に従ったキャッシュ・コヒーレントな対称マルチプロセッサ（ＳＭＰ）データ処理システムの実施例の高レベル・ブロック図である。1 is a high level block diagram of an embodiment of a cache coherent symmetric multiprocessor (SMP) data processing system according to the present invention. FIG. 本発明の好適な実施例に従った例示的な処理ユニットのブロック図である。FIG. 3 is a block diagram of an exemplary processing unit in accordance with a preferred embodiment of the present invention. 本発明の好適な実施例に従ったプロセッサ・コアおよびＬ２キャッシュの実施例の更に詳細なブロック図である。FIG. 3 is a more detailed block diagram of an embodiment of a processor core and L2 cache in accordance with a preferred embodiment of the present invention. 本発明の好適な実施例に従ったＬ３キャッシュの実施例の更に詳細なブロック図である。FIG. 4 is a more detailed block diagram of an embodiment of an L3 cache according to a preferred embodiment of the present invention. 本発明の好適な実施例に従ったデータ処理システムのローカルまたはシステム相互接続網における例示的なオペレーションの時空間表示図である。FIG. 2 is a space-time representation of an exemplary operation in a local or system interconnect network of a data processing system in accordance with a preferred embodiment of the present invention. 本発明の好適な実施例に従ったドメイン標識を含むシステム・メモリを示す概略図である。FIG. 3 is a schematic diagram illustrating a system memory including a domain indicator according to a preferred embodiment of the present invention. 本発明の好適な実施例に従ったＬ３キャッシュ・メモリへのキャストインを行う例示的な方法の高レベルの論理的流れ図である。3 is a high level logical flow diagram of an exemplary method for performing a cast-in to L3 cache memory in accordance with a preferred embodiment of the present invention. 本発明の好適な実施例に従ったキャストインに応答してＬ３キャッシュ・メモリにおいてコヒーレンシ状態遷移方法を実装する例示的な方法の高レベルの論理的流れ図である。Is a high level logic flow diagram of an exemplary method that implements a coherency state transition method in L3 cache memory in response to the cast-in in accordance with a preferred embodiment of the present invention. スヌープされた読取りタイプ・オペレーションに対する適切なコヒーレンシ応答を決定するためにキャッシュ・ディレクトリを調べなければならないコヒーレンシ解決ウィンドウを、Ｉg コヒーレンシ状態におけるキャストアウト・ヒットが作成する従来オペレーション・フローを示すタイミング図である。A timing diagram showing a conventional operation flow in which a castout hit in an Ig coherency state creates a coherency resolution window that must be consulted in the cache directory to determine an appropriate coherency response for a snooped read type operation. is there. 本発明の好適な実施例に従ったＬ２キャッシュ・メモリによって実装されるコヒーレンシ状態遷移方法の例示的方法の高レベルの論理的流れ図である。Is a logical flow diagram of a high-level exemplary method of L2 coherency state transition method is implemented by the cache memory in accordance with a preferred embodiment of the present invention.

Claims

Processing data in a multiprocessor data processing system including at least a first coherency domain and a second coherency domain, wherein the first coherency domain includes at least one processing unit, system memory, and cache memory A way to
Buffering cache lines in the data array of the cache memory;
The cache line is valid in the data array; the cache line is held non-exclusively in the cache memory and unmodified with respect to a corresponding memory block in the system memory ; it another cache before Symbol second coherency domain may retain a copy of the cache line, and a first state representative of a first coherency communication range for the cache line that does not contain the second coherency domain And the domain indicator in the first coherency domain has to be updated to the second state with a second state representing a second coherency communication range for the cache line including the second coherency domain. , to represent the And the step of setting the coherency state of the state field in the serial cache memory of the cache directory,
Having a method.

The cache memory is a lower level cache memory;
The data processing system includes a plurality of upper level cache memories coupled to the lower level cache memory;
The coherency state is a first coherency state;
The setting step is responsive to a cache line cast-in from one of the plurality of higher level cache memories to the data array indicating a second invalid cache line. Updating the state field from a coherency state to the first coherency state;
The method of claim 1.

The cache memory is connected to an interconnection network of the data processing system;
The coherency state is a first coherency state;
It said cache memory further comprises the step of issuing a request for said cache line on the interconnection network,
The setting step is responsive to receiving the cache line in response to the request, wherein the cache line is invalid, and another cache in the second coherency domain is the cache line. Updating the state field from a second coherency state that represents holding a copy of the state to the first coherency state;
The method of claim 1.

Selecting the cache line to drain from the data array;
In response to selecting the cache line to drain from the data array, a dataless indication of an indication that another cache in the second coherency domain holds a copy of the cache line. A castout is performed by the cache memory; and
The method of claim 1, further comprising:

In response to receiving the indication that another cache in the second coherency domain holds a copy of the cache line, the memory controller of the system memory is responsible for the cache line. further comprising the method of claim 4 updating the domain indicator.

Said coherency state is further said cache memory indicates that they have the authority of the first coherency domain for delivering a copy of the cache line with data intervention between the cache, according to claim 1 the method of.

A processing unit for a multiprocessor data processing system including at least a first coherency domain and a second coherency domain, the first coherency domain comprising system memory and the processing unit;
A processor core;
A cache memory connected to the processor core ,
The cache memory is
A data array that holds cache lines; and
And cache directory containing an entry containing and status field associated with said cache line,
A cache controller,
The cache controller is
The cache line is valid in the data array; the cache line is held non-exclusively in the cache memory and unmodified with respect to a corresponding memory block in the system memory ; it another cache before Symbol second coherency domain may retain a copy of the cache line, and a first state representative of a first coherency communication range for the cache line that does not contain the second coherency domain And the domain indicator in the first coherency domain has to be updated to the second state with a second state representing a second coherency communication range for the cache line including the second coherency domain. , to represent the You set the serial state field to the coherency state,
Processing unit.

Comprising at least interconnected cash coherent first coherency domain and a second coherency domain, said first coherency domain includes a first processing unit and domain indicator, said second coherency domain first A data processing system comprising two processing units, wherein a system memory is located in at least one of the first coherency domain and the second coherency domain ,
The first processing unit includes :
A processor core;
A cache memory connected to the processor core ,
The cache memory is
A data array that holds cache lines; and
And cache directory containing an entry containing and status field associated with said cache line,
A cache controller,
The cache controller is
The cache line is valid in the data array; the cache line is held non-exclusively in the cache memory and unmodified with respect to a corresponding memory block in the system memory ; it another cache before Symbol second coherency domain may retain a copy of the cache line, and a first state representative of a first coherency communication range for the cache line that does not contain the second coherency domain And the domain indicator in the first coherency domain has to be updated to the second state with a second state representing a second coherency communication range for the cache line including the second coherency domain. , to represent the You set the serial state field to the coherency state,
Data processing system.