[go: up one dir, main page]

CN111143244A - Memory access method of computer device and computer device - Google Patents

Memory access method of computer device and computer device Download PDF

Info

Publication number
CN111143244A
CN111143244A CN201911394845.2A CN201911394845A CN111143244A CN 111143244 A CN111143244 A CN 111143244A CN 201911394845 A CN201911394845 A CN 201911394845A CN 111143244 A CN111143244 A CN 111143244A
Authority
CN
China
Prior art keywords
cache
node
cache line
rcls
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911394845.2A
Other languages
Chinese (zh)
Other versions
CN111143244B (en
Inventor
蔡云龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hygon Information Technology Co Ltd
Original Assignee
Hygon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hygon Information Technology Co Ltd filed Critical Hygon Information Technology Co Ltd
Priority to CN201911394845.2A priority Critical patent/CN111143244B/en
Publication of CN111143244A publication Critical patent/CN111143244A/en
Application granted granted Critical
Publication of CN111143244B publication Critical patent/CN111143244B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0893Caches characterised by their organisation or structure
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5072Grid computing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The present disclosure provides a memory access method for a computer device comprising a plurality of nodes connected via an interconnection bus, each node comprising an integrated processor core and cache, and a memory, the method comprising: the cache lines of the other nodes are locally stored in the caches of the nodes to form a remote cache local image; and the processor core of the node accessing a cache line from the remote cache local image. According to the method and the device, the nodes in the computer equipment mirror the cache data at other nodes to local storage, so that the times of accessing the memory across the nodes are reduced, and the computer performance under a large-memory application scene is improved.

Description

Memory access method of computer equipment and computer equipment
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a memory access method for a computer device and a computer device.
Background
The existing servers based on NUMA (non-uniform memory access) architecture can combine dozens of CPUs (even hundreds of CPUs) in one server, and have good expansion capability. In a NUMA architecture server, multiple nodes are connected together by an interconnection network, each node having independent CPUs, caches, memory and I/O devices, and all memory is shared throughout the server. In this case, the CPU has faster and less latency in accessing local memory within the node than in accessing remote memory outside the node. However, in some applications (such as database lookup) using large memory, the CPU of each node inevitably needs to frequently access the remote memory, which greatly reduces the performance of the server.
Disclosure of Invention
In view of this, a memory access method for a computer device and a computer device are provided, where the computer device has a plurality of nodes, and a node in the computer device mirrors cached data at other nodes to a local storage, so that the number of times of accessing a memory across nodes is reduced, and the computer performance in a large-memory application scenario is improved.
According to a first aspect of the present disclosure, there is provided a memory access method for a computer device comprising a plurality of nodes connected via an interconnection bus, each node comprising an integrated processor core and cache, and a memory, the method comprising: the cache lines of the nodes in the caches of other nodes are locally stored in the nodes to form a remote cache local image RCLS; and a processor core of the node accessing the cache line from the RCLS.
In one possible embodiment, the node may include a plurality of processor cores, and the cache may be an L3 cache shared by the plurality of processor cores.
In one possible embodiment, the RCLS is stored in the node's memory or L4 cache.
In one possible embodiment, the cache line of the RCLS may be updated periodically, or may be updated when the cache line in the other node's cache is infrequently read and written.
In one possible embodiment, the cacheline may include write history information that records the number of times the cacheline was written in the L3 cache and may be cleared or modified to be consistent in the L3 cache and the RCLS when transferred to the RCLS.
In one possible embodiment, the method may further include: a processor core of the node broadcasting a cache line request message on the interconnect bus; a processor core of the node receiving a response message of the other node, the response message indicating that the other node has the cache line in its cache, the response message including write history information of the cache line; comparing, by a processor core of the node, write history information of a cache line in the RCLS with write history information of the response message; based on the comparison, the processor core of the node obtains the cache line from the RCLS or obtains the cache line from the other node.
In one possible embodiment, when the comparison indicates that the cacheline of the other node is the same as the cacheline of the RCLS, the cacheline may be obtained from the RCLS; when the comparison indicates that the cacheline of the other node is different from the cacheline of the RCLS, the cacheline may be obtained from the other node and the RCLS may be updated.
In one possible embodiment, the method may further include, after the processor core of the node acquires the cache line, modifying a flag bit of the cache line in the cache of the other node to be shared when the access is a read operation; when the access is a write operation, modifying the flag bit of the cache line in the cache of the other node to be invalid.
According to a second aspect of the present disclosure, there is provided a computer device comprising a plurality of nodes connected via an interconnection bus, each node comprising an integrated processor core and cache, and a memory, wherein a node of the plurality of nodes has stored locally on it a remote cache local image RCLS for storing a cache line in the caches of other nodes and providing the cache line to the processor core of the node.
In one possible embodiment, the node may include a plurality of processor cores, and the cache is an L3 cache shared by the plurality of processor cores.
In one possible embodiment, the RCLS may be stored in the node's memory or L4 cache.
In one possible embodiment, the cache line of the RCLS may be updated periodically, or when the cache line in the other node's cache may be read and written infrequently.
In one possible embodiment, the cacheline may include write history information that records the number of times the cacheline was written in the L3 cache and may be cleared or modified to be consistent in the L3 cache and the RCLS when transferred to the RCLS.
In one possible embodiment, the processor core of the node may be configured to: broadcasting a cache line request message on the interconnect bus; receiving a response message of the other node, the response message indicating that the other node has the cache line in its cache, the response message including write history information of the cache line; comparing the write history information of the cache line in the RCLS with the write history information of the response message; obtaining the cacheline from the RCLS or obtaining the cacheline from the other node according to the comparison.
In one possible embodiment, the processor core of the node may be further configured to obtain the cache line from the RCLS when the comparison indicates that the cache line of the other node is the same as the cache line of the RCLS; when the comparison indicates that the cacheline of the other node is different from the cacheline of the RCLS, the cacheline is obtained from the other node, and the RCLS is updated.
In one possible embodiment, the processor core of the node is further configured to modify the flag bit of the cache line in the other node's cache to shared when the access is a read operation after the cache line is fetched; when the access is a write operation, modifying the flag bit of the cache line in the cache of the other node to be invalid.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and drawings.
Drawings
The foregoing aspects and many of the attendant advantages of this invention will become more readily appreciated as the same become better understood by reference to the following detailed description, when taken in conjunction with the accompanying drawings, wherein like reference numerals refer to like parts, unless otherwise specified.
FIG. 1 shows a schematic diagram of a computer device of a NUMA architecture in accordance with an embodiment of the present disclosure.
FIG. 2 shows a schematic flow diagram of a memory access method of a NUMA architecture according to an embodiment of the present disclosure.
Fig. 3 shows a schematic diagram of a computer device of another NUMA architecture in accordance with an embodiment of the present disclosure.
FIG. 4 shows a schematic flow diagram of another method of memory access by NUMA according to an embodiment of the present disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The words "a", "an" and "the" and the like as used herein are also intended to include the meanings of "a plurality" and "the" unless the context clearly dictates otherwise. Furthermore, the terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense. For example, in the present disclosure, the term "cache" has the same meaning as "cache", and the term "memory" has the same meaning as "main memory", or "system memory"
Fig. 1 shows a schematic diagram of a computer device 100 of a NUMA architecture in accordance with an embodiment of the present disclosure. Computer device 100 may be, for example, a server device, a home computer, a High Performance Computer (HPC), or the like. As shown, computer device 100 is based on a NUMA architecture, including two NUMA nodes 101a and 101b, which may also be referred to as sockets. For ease of presentation and understanding, reference numerals having the letter "a" herein shall be understood as components or parts associated with node 101a, and reference numerals having the letter "b" shall be understood as components or parts associated with node 101 a. Although fig. 1 shows only two nodes 101a and 101b, those skilled in the art will appreciate that the computer device 100 may include more nodes, and the number is not limited thereto.
Nodes 101a and 101b in fig. 1 are similarly arranged, and the specific composition of the computer apparatus 100 is described below with reference to only node 101 a.
As shown, node 101a may include a processor device 110a, memory 120a, I/O devices, and the like (not shown for simplicity). The processor device 110a may be a single-core or multi-core processor chip, for example, as in the embodiment shown in fig. 1, the processor device 110a includes a plurality of processor cores 111a and 112a, etc., each of which includes a core and an L1 cache (L1 cache) and an L2 cache (L2 cache) that are specific to the core. Within the core may be a memory access component, a bus processing component, etc., implemented in hardware or software.
In the illustrated embodiment, the remainder of the processor device 110a includes various interconnection circuits and interfaces and components for connecting various functional blocks on the processor device in communication. As shown, within processor device 110a, multiple processor cores 111a and 112a may communicate with each other via an intra-socket interconnect 113 a. For simplicity, the interconnect is depicted as an intra-socket interconnect 113a, however, it should be understood that interconnect 113a may represent one or more interconnect structures, such as buses and single or multi-channel serial point-to-point, ring, or mesh interconnect structures. Also included within processor device 110a is an L3 cache 114a for sharing by multiple processor cores; processor cores 111a and 112a may access shared L3 cache 114a via intra-socket interconnect 113 a. Processor device 110a may also include a memory controller 115a for controlling the transfer of read and write data between processor device 110a and memory 120 a. Further, processor device 110a also includes a socket-to-socket (S-S) interconnect interface 116a, where S-S interface 116a may receive data on intra-socket interconnect 113a, send and receive data to and from other sockets or nodes via inter-socket interconnects, and vice versa. In addition to these illustrated blocks, the processor 110a may include many other functional blocks that are not shown for clarity, such as a PCIe interface, a network interface controller NIC, and so forth.
In fig. 1, each processor device may be operatively coupled to a printed circuit board, referred to as a motherboard, via a socket, or otherwise coupled to the motherboard via a direct coupling technique (e.g., flip chip bonding). The motherboard includes electrical wires (e.g., traces and vias) to achieve electrical connections corresponding to the physical structure of the various interconnects depicted in fig. 1. These interconnects may include a socket-to-socket (S-S) interconnect 102 between socket-to-socket (S-S) interconnect interfaces 116a and 116b, which are used to communicatively couple multiple processor nodes. In one embodiment, the inter-socket interface S-S interconnect 102 may employ or support QPI or Infinity Fabirc, among other protocols and wiring structures.
As shown in fig. 1, the L1 cache, L2 cache, L3 cache 114a, and memory 120a form a storage architecture based on the distance from the core from near to far. Generally, the closer to the core, the faster the core can access, and the less latency. For example, the L1 cache and the L2 cache may be relatively high cost Static Random Access Memory (SRAM), with core access L1 cache latency of about 1-4 clock cycles and access L2 cache latency of about 12 clock cycles. The L3 cache may be a relatively low cost Dynamic Random Access Memory (DRAM) or enhanced ram (edram), etc., with an access latency of approximately 36 clock cycles. While memory 120a may typically be a dynamic random access memory DRAM or the like with an access latency of about 60-100 clock cycles.
Due to data limitations, including spatial and temporal limitations, processors (such as the core of fig. 1) often need to read data multiple times in a short time. Since the operating frequency of the memory is far lower than the speed of the processor, through the above storage architecture, the processor sequentially queries whether needed data exists in the L1-L3, so that the processor can read the needed data from the caches of all levels instead of the memory, which is called as hit (hit), thereby greatly saving the time for data access.
As shown in fig. 1, the L3 cache 114a includes several cache lines (similar to L1 and L2), and when all of L1-L3 miss (miss), the needed data can be loaded from the memory 120 a. Specifically, a memory line (or block of memory) containing the desired data will be loaded and held in the L3 cache in the form of a cache line for subsequent re-access. Similarly, when data is loaded from the L3 cache for processor access, the data is further promoted to the L2 cache, and so on.
Referring to table 1, a schematic structure of a cache line of an embodiment of the present disclosure is shown.
Label (R) Data block Marker bit
TABLE 1
A cache line may include a tag (tag) including a portion of a memory address, a data block (data block) that is the content of the cache line, i.e., a valid data portion, and a flag (flag bits) that may generally include whether data is overwritten (Modified), Exclusive (Exclusive), Shared (Shared), valid (Invalid), etc. for controlling cache coherency.
The bus processing unit 117a of the core may determine whether the cache hit occurs according to the memory address of the required data. Table 2 shows an exemplary structure of an effective memory address according to an embodiment of the present disclosure.
Label (R) Indexing Offset in block
TABLE 2
The effective memory address may include a tag, an index, and an intra-block offset, where the index indicates an index number of a data block loaded into the cache set, and the effective memory address index is combined with the tag to search for a cache line in each level of cache and determine whether the cache line is hit. The intra-block offset represents the offset within the cache line of data required by the processor.
For example, if the L1 cache size is 8k, each cache line is 64 bytes, and 4 cache lines make up a cache set. Then there are a total of 128 cache lines of 8k/64 bytes, one set of every 4, and then 128/4 of 32 cache sets. Therefore, the intra-block offset occupies 6 bits (2^6 ^ 64), the index occupies 5 bits (2^5 ^ 32), and the tag occupies 21 bits (32-5-6). The above briefly describes the mapping between memory addresses and cache lines and the mechanism for looking up cache lines from memory addresses in a cache.
Returning to FIG. 1, there are multiple nodes under a NUMA architecture. In this case, for example, when a processor core 112a does not find a desired cache line in any of its local caches L1-L3, its bus processing component 117a may broadcast a snoop signal (snoop) on the interconnect bus, querying the other nodes' L3 whether the desired data is stored. The purpose is to save on the one hand the extra delay incurred by memory accesses and, more importantly, to ensure that the most up-to-date data is read, since the same memory data may have been cached in L3 of one or more other nodes and overwritten. As shown in fig. 1, a snoop signal is broadcast on S-S interconnect bus 102 via intra-socket interconnect bus 113a and S-S interface 116a, and response information of other nodes is received. This operation may be implemented by the bus processing component 117 a. Bus processing component 117a is shown in FIG. 1 as a separate component within node 101a, alternatively bus processing component 117a may be a software implementation or logic configured within the cores of processor cores 112a, and each processor core may have a bus processing component to accomplish similar operations.
Fig. 1 shows the sending and receiving of a snoop signal between node 101a and node 101b, however, when there are more nodes, the snoop signal is broadcast to each node on the interconnect bus 102. Although snoop communication between bus processing components 117a and 117b is shown as a separate line in fig. 1, this is for clarity only, it being understood that the communication is actually done over S-S interconnect 102. For example, when node 101b receives the snoop signal and the cache finds the requested cache line (valid) at its L3, processor 112a sends the most recent copy of the cache line to node 101a via interconnect bus 102. When no valid response is received after the snoop signal is issued, the request data is accessed from the memory. When data is requested from a memory access, there is a different latency depending on the memory address being local and remote to the node, as described in more detail below.
Under NUMA architectures, each processor (and processor core) is able to access different memory resources distributed across the various nodes. In other words, all memories are shared by all processors, together forming a unified memory address space. Memory may be considered a local memory resource (e.g., a memory resource on the same node as a processor or core) or a remote memory resource (e.g., a memory resource on another node). Generally, for example, from the perspective of node 101a, memory 120a is a local memory resource and memory 120b is a remote memory resource. Since the local memory resources are operatively coupled to the processor for a given node, access to the local memory resources is not the same (i.e., access is non-uniform) with respect to the remote memory resources. For example, the latency for processor 110a to access local memory resource 120a is about 80ns, while the latency for accessing remote memory resources may be up to 250ns or more. Even considering that it is possible to cache access data from L3 of node 101b without having to access remote memory 120b, the data transfer across nodes may introduce a delay of about 100 ns. Therefore, a policy of prioritizing local memory resources is required to be adopted in actual memory allocation. Unfortunately, in some applications with large memory usage (e.g., database lookup), the processor of each node inevitably has to frequently access remote memory, which significantly reduces the performance of the server.
FIG. 2 shows a schematic flow diagram of a method of memory access in a NUMA architecture in accordance with an embodiment of the present disclosure. This is explained with reference to fig. 1.
At step 210, the processor core issues a memory access request. For example, machine instructions obtained by processor core 112a include a read or write to a memory address, creating a data access need for the memory address.
Then, at step 220, it is sequentially looked up whether the cache lines associated with the required data are included in L1 and L2 associated with the processor core. Specifically, a cache line is looked up in L1 and L2 according to the memory address information. In one embodiment, each cache line may include 64 bytes of data, and when the required data is included in one cache line (the corresponding cache line should be "valid"), there are L1 and L2 hits. In this case, the corresponding cache line is promoted to a closer cache (e.g., L2 through L1) for use by the processor core, and the flow ends. Otherwise, proceed to step 230.
At step 230, the local L3 cache is queried for the required data. For example, a lookup in L3114 a is made to see if there is data needed (the corresponding cache line should be "valid"). If so, L3 hits, the corresponding cache line is promoted to a closer cache (e.g., L3 through L2 through L1) for use by the processor core, and the process ends. Otherwise, proceed to step 240.
In step 240, in the event of a miss at none of L1-L3, a snoop signal is broadcast on the interconnect bus. For example, processor core 112a may utilize bus processing component 117a to broadcast a snoop signal on S-S interconnect bus 102 via interconnect bus 113a and interface 116a so that bus processing components of other nodes may snoop.
Next, at step 250, the other nodes look up the required data in their L3 caches. If found (the corresponding cache line should be "valid"), then an L3 hit. In this case, a copy of the corresponding node cache line is sent onto the S-S bus for receipt by the node requesting the data. For example, a processor core in node 101b may send its cache line in L3 cache 114b to node 101a via S-S bus 102. Accordingly, the cache line is stored in the L3 cache 114a at node 101a and is promoted to a closer cache (e.g., L3 through L2 through L1) for use by the processor core, and the flow ends. If there is no hit at the other node, proceed to step 260.
In step 260, in the event that none of the caches hit, the memory access request is submitted to the memory controller at the node of the required data, the data is accessed from the memory, and the process ends. If the access is a local memory access, the needed data is cached in a local L3; if it is a remote memory access, the required data is cached at the remote L3 and then transmitted to the L3 at the requesting node via the S-S bus.
As described above, when accessing a remote memory, a large delay may be generated in memory access and data transmission across nodes, which may result in performance degradation, and this is especially obvious in a large memory application scenario. In order to alleviate or mitigate this problem, embodiments of the present disclosure provide a technical solution for remote cache local image (RCLS), which improves data access efficiency by storing a remote cache image in a local backing store with a larger capacity. In one embodiment, the L3 cache image at the remote node may be stored in a local L4 cache, or Last Level Cache (LLC). In another possible embodiment, the L3 cache image at the remote cache node may be stored to local memory if L3 is already the last level cache LLC.
Fig. 3 shows a schematic diagram of a computer device 300 of another NUMA architecture in accordance with an embodiment of the present disclosure. In contrast to fig. 1, the computer device 300 of fig. 3 further comprises: the cache lines in L3 of the remote nodes are stored in the backing store of each node lower than the L3 cache, as indicated by reference numerals 305 and 330 a. It should be noted that although fig. 3 shows two nodes 301a and 301b, those skilled in the art will appreciate that computer device 300 may include many more nodes and that each node may store a cache line from any other node with remote cache local mapping, RLCS, enabled. As will be described in detail below.
According to embodiments of the present disclosure, computer device 300 may be configured to enable remote cache local mapping, RLCS, in which case each node in computer device 300 may receive cache lines in the L3 caches of the other nodes via interconnect bus 302. In other words, in addition to being cached at each node L3, the memory data is mirrored (further cached) at a lower level of storage at each node. Generally, memory below L3 has storage space much larger than the L3 capacity, so each node has enough storage capacity to mirror the L3 cache at other nodes without impacting local use.
For example, a computer device of an exemplary NUMA architecture may include 64 nodes, each of which may have, for example, an L3 capacity of 16M and a local memory of 16G. With the RCLS enabled, even if all the cache lines of the far-end L3 are mirrored to each node local memory, only 64 × 16M/16G — 6.4% of the memory capacity is occupied. However, in actual operation, the RCLS function may not be selected to be enabled on all nodes, but only a portion of the nodes, and each RCLS may also be mirrored locally on demand, rather than all of the remotely cached cache lines. In fact, the scale (number of nodes in NUMA) of enabling RCLS can be flexibly adjusted, depending on the cache capacity of each node and the capacity of the next level memory, so as to achieve better performance of the computer device.
Referring to FIG. 3, arrow 305 indicates mirroring the L3 cache of node 301b locally to node 301a, forming a remote cache local image RCLS330 a. This RCLS330a may be stored in any memory below the local L3 cache, e.g., when node 101a has L4 cache 318a, RCLS330a is stored in the L4 cache; when node 101a does not have an L4 cache, RCLS330a may be stored in an allocated memory area. Similarly, as shown, the memories 320b and L43 4318 b of the node 301b store RCLS cached by the L3 of the other nodes. It should be noted that although FIG. 3 only shows cache lines of the L3 cache of storage node 301b at node 101a, it is understood that cache lines of the L3 cache from any other node may be stored on each NUMA node that has RLCS enabled.
According to the embodiment of the disclosure, the cache line of the L3 cache of each node can be transmitted to other nodes in a synchronous or asynchronous mode, that is, the RCLS can be updated in time or in a delayed mode. For example, in one embodiment, when memory controller 316b of node 301b reads data from memory 320b and places it into L3 cache 314b, the data will be synchronized to the interconnect bus, cached via the S-S bus to L4 cache 318a or memory 320a, for updating RCLS at node 101 a. In one possible embodiment, updating RCLS330a behavior at node 101a may be initiated proactively by memory controller 316b of node 301b, thereby enabling real-time updating of RCLS. Alternatively, the behavior of updating RCLS330a at node 101a may be delay triggered, such as when the processor core of node 101b is not reading or writing L3 or timed to trigger. Alternatively, the RCLS update may also be triggered by snooping L3 cache update messages on the interconnect bus, e.g., bus processing component 317a of node 101a may snoop all L3 cache updates on the interconnect bus, with node 101a itself deciding whether to update its local RCLS from the L3 cache of node 101b (and other nodes).
As described above, in addition to being cached at each node L3, data in memory is also cached in the remote cache local image of each node, with the data being cached in the form of cache lines. Embodiments of the present disclosure also provide a structure of a cache line in order to meet cache coherency requirements and reduce data transmission across nodes. The cache line structure extends the structure shown in table 1.
Label (R) Data block Marker bit WH
TABLE 3
Referring to table 3, according to an embodiment of the present disclosure, a write history flag (WH) is extended in a flag of a cache line to record information of the number of times of writing of effective alteration of data. In one embodiment, the WH flag bit is made up of a plurality of bits, it being understood that the greater the number of WH bits, the greater the number of cache line writes that can be recorded.
In a conventional cache coherency protocol, whenever a cache line is written, the write cache information needs to be broadcast on the system bus to be sensed by other nodes to modify the flag bit information of the own cache line and update the data block. In this case, the data traffic required in the NUMA architecture to maintain cache coherency takes up at least about 15-20% of the interconnect bus. In contrast, the embodiment of the present disclosure introduces the write history flag WH into the flag and combines with the remote cache local mapping RCLS, so that data used for maintaining cache consistency can be greatly reduced, and bus bandwidth resources are saved.
According to the embodiment of the present disclosure, when the read-write operation of the cache line is less, for example, when the processor core does not read or write the L3 cache for about 10 milliseconds (which is a relatively long time compared to the speed of the CPU), the cache line of the L3 cache may be packed and sent onto the interconnection bus so as to be synchronized to the RCLS of other nodes, that is, the RCLS may be updated with a delay, which may ensure that the processor is not blocked when the data read or write is frequent. On the other hand, because the WH zone bit for recording the writing time information is used, the processor core does not need to broadcast the writing cache information to the interconnection bus during each writing cache; in other words, the processor core may resynchronize to the RCLS of other nodes after accumulating modifications of multiple cache data blocks.
According to an embodiment of the present disclosure, whenever a processor or processor core writes to the L3 cache, the history identification bit WH of the corresponding cache line is incremented by 1, and the cache line should be synchronized to the RCLS of the respective nodes with a WH value of 15(0xF) before the WH overflows, for example, in the case where the WH has 4 bits (the present invention is not limited thereto). Also as described above, when there are fewer cache line read and write operations, the cache line is packed for transmission to the interconnect bus for synchronization to other RCLS. After the L3 cache is synchronized to the RCLS node, the WH bits of all cache lines in the L3 and the RCLS are cleared or modified to be consistent in the L3 cache and the RCLS. In other words, the WH bit may be used to indicate the coherency of the L3 cache with the RCLS.
A memory access method of a computer device according to an embodiment of the present disclosure will be described below with reference to fig. 4.
FIG. 4 shows a schematic flow chart diagram of another memory access method 400 for a NUMA architecture in accordance with an embodiment of the present disclosure. Here, assume that the method is performed in a computer device having 4 NUMA NODEs, which are labeled NODE _0, NODE _1, NODE _2, and NODE _ 3. For example, each node includes 32 processor cores, denoted as CPU 0 through CPU 127, each node has a respective L3 cache, denoted as L3_0, L3_1, L3_2, L3_ 3. Here, the node may have an L4 cache, the remote cache local image (RCLS) described above being held in the L4 cache; the node may not have an L4 cache, and the remote cache local image is saved in the memory of the node. The RCLS of the nodes are denoted as RCLS _0, RCLS _1, RCLS _2, and RCLS _3, respectively.
Taking CPU 0 as an example, the method 400 begins when a memory read/write request generated by CPU 0 misses at none of its local L1-L3.
At step 410, the bus processing component of CPU 0 broadcasts a data request message over the interconnect bus requesting a cache line including the desired data. This data request message may be sensed by the bus processing components of the other processor cores and looked up in the L3 cache for data needed by CPU 0.
Then, at step 420, if a far-end cache hit, for example, the L3_3 cache of CPU 127 (at NODE _3) has the latest version of the data needed by CPU 0, CPU 127 may synchronize the corresponding cache line from its L1, L2 cache to the L3_3 cache and write-lock the corresponding cache line in L3_3 such that the cache line is not allowed to be overwritten until this read-write operation of CPU 0 is complete, and the method proceeds to step 430. If the remote cache is not hit, the flow of the method is ended, and the CPU 0 acquires data from the memory.
At step 430, a response message issued by CPU 127 on the interconnect bus is received, which may include write history information WH of the cache line of demand data. In one embodiment, to improve bus utilization efficiency, the response message also includes a portion of the data block contents of the cache line. For example, the response message may be 16 bytes, and WH occupies only a part (several bits) of the response message, and the rest may fill a part of the data of the cache line, so as to reduce the amount of data to be transmitted subsequently.
Then, at step 440, CPU 0 compares the WH in the response message with the WH in RCLS _ 0. If so, CPU 0 fetches the cache line from local RLCS _0 and loads it into its local L3_0 at step 450, and sends a data transfer complete message to NODE _3 at step 470, and CPU 127 modifies the flag bit of the corresponding cache line in L3_ 3.
If the WHs in the RCLS and the response message from CPU 127 are not consistent, i.e., the WH provided by L3_3 at the node where CPU 127 is located is newer than the WH in RCLS, then at step 460, CPU 0 fetches the cache line from the node of CPU 127 for storage in L3_0, promotion to L2, L1, register usage of CPU 0, and synchronizes the cache line in RCLS 0 to the cache line provided by CPU 127. At this point, the WH is cleared or modified to be consistent. In addition, upon completion of the transfer of L3_3 to L3_0, it may also be determined whether a write to memory is required as appropriate, for example, in a write back mode or a write through mode (write through).
In step 470, the CPU sends a data transfer complete message to the CPU 127. If the access request of CPU 0 is a read operation, CPU 127, upon receiving the transfer complete message, may modify the flag bit of the corresponding cache line of L3_3 from Exclusive (Exclusive) to Shared (Shared), for example, and may unlock the cache line write lock of L3_ 3. If the access request of CPU 0 is a write operation, CPU 127 may modify the flag bit to Invalid (Invalid), which, in effect, in the case of a write operation, according to the cache coherency protocol, an Invalid signal is placed on the bus to be heard by other nodes, such that the flag bits of the cache line copy of other RCLS are all set Invalid except for the data in L3_0, and therefore no update of RCLS at other nodes is required.
Although example embodiments have been described, it will be apparent to those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the inventive concept. Accordingly, it should be understood that the above-described exemplary embodiments are not limiting, but illustrative.

Claims (16)

1.一种计算机设备的内存访问方法,所述计算机设备包括经由互连总线连接的多个节点,每个节点包括集成的处理器核心和缓存、以及内存,所述方法包括:1. A memory access method for a computer device, the computer device comprising a plurality of nodes connected via an interconnection bus, each node comprising an integrated processor core and a cache, and a memory, the method comprising: 所述多个节点中的节点在其本地存储其他节点的缓存中的缓存行,形成为远端缓存本地映像RCLS;以及A node of the plurality of nodes locally stores cache lines in the caches of other nodes, forming a remote cache local image RCLS; and 所述节点的处理器核心从所述RCLS访问所述缓存行。The processor core of the node accesses the cache line from the RCLS. 2.如权利要求1所述的方法,其中,所述节点包括多个处理器核心,所述缓存是所述多个处理器核心共享的L3缓存。2. The method of claim 1, wherein the node includes a plurality of processor cores and the cache is an L3 cache shared by the plurality of processor cores. 3.如权利要求1所述的方法,其中,所述RCLS存储在所述节点的内存或L4缓存中。3. The method of claim 1, wherein the RCLS is stored in a memory or L4 cache of the node. 4.如权利要求1所述的方法,其中,所述RCLS的缓存行定期更新,或者在所述其他节点的缓存中的缓存行不频繁读写时更新。4. The method of claim 1, wherein the cache lines of the RCLS are updated regularly, or when cache lines in the caches of the other nodes are not frequently read or written. 5.如权利要求1所述的方法,其中,所述缓存行包括写历史信息,所述写历史信息记录所述缓存行在L3缓存被写的次数,并且在被传输到所述RCLS时清零或修改为在所述L3缓存和所述RCLS中一致。5. The method of claim 1, wherein the cache line includes write history information that records the number of times the cache line is written in the L3 cache and is cleared when transmitted to the RCLS. Zero or modify to be consistent in the L3 cache and the RCLS. 6.如权利要求1所述的方法,所述方法还包括:6. The method of claim 1, further comprising: 所述节点的处理器核心在所述互连总线上广播缓存行请求消息;a processor core of the node broadcasts a cache line request message on the interconnect bus; 所述节点的处理器核心接收所述其他节点的响应消息,所述响应消息指示所述其他节点在其缓存中具有所述缓存行,所述响应消息包括所述缓存行的写历史信息;The processor core of the node receives a response message from the other node, the response message indicates that the other node has the cache line in its cache, and the response message includes write history information of the cache line; 所述节点的处理器核心比较所述RCLS中缓存行的写历史信息和所述响应消息的写历史信息;以及the processor core of the node compares the write history information of the cache line in the RCLS with the write history information of the response message; and 根据所述比较,所述节点的处理器核心从所述RCLS获取所述缓存行或者从所述其他节点获取所述缓存行。According to the comparison, the processor core of the node obtains the cache line from the RCLS or obtains the cache line from the other node. 7.如权利要求6所述的方法,其中,7. The method of claim 6, wherein, 当所述比较指示所述其他节点的缓存行与所述RCLS的缓存行相同时,从所述RCLS获取所述缓存行;以及obtaining the cache line from the RCLS when the comparison indicates that the cache line of the other node is the same as the cache line of the RCLS; and 当所述比较指示其他节点的缓存行与所述RCLS的缓存行不同时,从所述其他节点获取所述缓存行,并更新所述RCLS。When the comparison indicates that the cache line of the other node is different from the cache line of the RCLS, the cache line is obtained from the other node, and the RCLS is updated. 8.如权利要求1所述的方法,在所述节点的处理器核心获取所述缓存行之后,所述方法还包括8. The method of claim 1, further comprising after a processor core of the node fetches the cache line 当所述访问是读操作时,修改所述其他节点的缓存中的所述缓存行的标志位为共享;以及When the access is a read operation, modifying the flag bit of the cache line in the cache of the other node is shared; and 当所述访问是写操作时,修改所述其他节点的缓存中的所述缓存行的标志位为无效。When the access is a write operation, the flag bit of the cache line in the cache of the other node is modified to be invalid. 9.一种计算机设备,所述计算机设备包括经由互连总线连接的多个节点,每个节点包括集成的处理器核心和缓存、以及内存,其中9. A computer device comprising a plurality of nodes connected via an interconnecting bus, each node comprising an integrated processor core and cache, and memory, wherein 所述多个节点中的节点在其本地存储有远端缓存本地映像RCLS,所述RCLS用于存储其他节点的缓存中的缓存行和向所述节点的处理器核心提供所述缓存行。A node of the plurality of nodes locally stores a remote cache local image RCLS, where the RCLS is used for storing cache lines in caches of other nodes and providing the cache lines to processor cores of the nodes. 10.如权利要求9所述的计算机设备,其中,所述节点包括多个处理器核心,所述缓存是所述多个处理器核心共享的L3缓存。10. The computer device of claim 9, wherein the node includes a plurality of processor cores, and the cache is an L3 cache shared by the plurality of processor cores. 11.如权利要求9所述的计算机设备,其中,所述RCLS存储在所述节点的内存或L4缓存中。11. The computer device of claim 9, wherein the RCLS is stored in a memory or L4 cache of the node. 12.如权利要求9所述的计算机设备,其中,所述RCLS的缓存行定期更新,或者在所述其他节点的缓存中的缓存行不频繁读写时更新。12. The computer device of claim 9, wherein the cache lines of the RCLS are updated regularly, or when cache lines in the caches of the other nodes are not frequently read or written. 13.如权利要求9所述的计算机设备,其中,所述缓存行包括写历史信息,所述写历史信息记录所述缓存行在L3缓存被写的次数,并且在被传输到所述RCLS时清零或修改为在所述L3缓存和所述RCLS中一致。13. The computer device of claim 9, wherein the cache line includes write history information that records the number of times the cache line was written in the L3 cache, and when transmitted to the RCLS Cleared or modified to be consistent in the L3 cache and the RCLS. 14.如权利要求9所述的计算机设备,其中,所述节点的处理器核心被配置用于14. The computer device of claim 9, wherein the node's processor cores are configured for 在所述互连总线上广播缓存行请求消息;broadcasting a cache line request message on the interconnect bus; 接收所述其他节点的响应消息,所述响应消息指示所述其他节点在其缓存中具有所述缓存行,所述响应消息包括所述缓存行的写历史信息;receiving a response message from the other node, the response message indicating that the other node has the cache line in its cache, the response message including write history information of the cache line; 比较所述RCLS中缓存行的写历史信息和所述响应消息的写历史信息;以及comparing the write history information of the cache line in the RCLS with the write history information of the response message; and 根据所述比较,从所述RCLS获取所述缓存行或者从所述其他节点获取所述缓存行。According to the comparison, the cache line is obtained from the RCLS or the cache line is obtained from the other node. 15.如权利要求14所述的计算机设备,其中,所述节点的处理器核心还被配置用于15. The computer device of claim 14, wherein the processor core of the node is further configured to 当所述比较指示所述其他节点的缓存行与所述RCLS的缓存行相同时,从所述RCLS获取所述缓存行;以及obtaining the cache line from the RCLS when the comparison indicates that the cache line of the other node is the same as the cache line of the RCLS; and 当所述比较指示其他节点的缓存行与所述RCLS的缓存行不同时,从所述其他节点获取所述缓存行,并更新所述RCLS。When the comparison indicates that the cache line of the other node is different from the cache line of the RCLS, the cache line is obtained from the other node, and the RCLS is updated. 16.如权利要求9所述的计算机设备,其中,所述节点的处理器核心还被配置用于在获取所述缓存行之后,16. The computer device of claim 9, wherein the node's processor core is further configured to, after fetching the cache line, 当所述访问是读操作时,修改所述其他节点的缓存中的所述缓存行的标志位为共享;以及When the access is a read operation, modifying the flag bit of the cache line in the cache of the other node is shared; and 当所述访问是写操作时,修改所述其他节点的缓存中的所述缓存行的标志位为无效。When the access is a write operation, the flag bit of the cache line in the cache of the other node is modified to be invalid.
CN201911394845.2A 2019-12-30 2019-12-30 Memory access method of computer device and computer device Active CN111143244B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911394845.2A CN111143244B (en) 2019-12-30 2019-12-30 Memory access method of computer device and computer device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911394845.2A CN111143244B (en) 2019-12-30 2019-12-30 Memory access method of computer device and computer device

Publications (2)

Publication Number Publication Date
CN111143244A true CN111143244A (en) 2020-05-12
CN111143244B CN111143244B (en) 2022-11-15

Family

ID=70521927

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911394845.2A Active CN111143244B (en) 2019-12-30 2019-12-30 Memory access method of computer device and computer device

Country Status (1)

Country Link
CN (1) CN111143244B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112416259A (en) * 2020-12-04 2021-02-26 海光信息技术股份有限公司 Data access method and data access device
CN112468416A (en) * 2020-10-23 2021-03-09 曙光网络科技有限公司 Network flow mirroring method and device, computer equipment and storage medium
CN112612727A (en) * 2020-12-08 2021-04-06 海光信息技术股份有限公司 Cache line replacement method and device and electronic equipment
CN113722110A (en) * 2021-11-02 2021-11-30 阿里云计算有限公司 Computer system, memory access method and device
WO2022155820A1 (en) * 2021-01-20 2022-07-28 Alibaba Group Holding Limited Core-aware caching systems and methods for multicore processors
CN116955270A (en) * 2023-09-21 2023-10-27 北京数渡信息科技有限公司 Method for realizing passive data caching of multi-die interconnection system

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030009639A1 (en) * 2001-06-21 2003-01-09 International Business Machines Corp. Non-uniform memory access (NUMA) data processing system that provides precise notification of remote deallocation of modified data
CN101477496A (en) * 2008-12-29 2009-07-08 北京航空航天大学 NUMA structure implementing method based on distributed internal memory virtualization
CN103744799A (en) * 2013-12-26 2014-04-23 华为技术有限公司 Memory data access method, device and system
US8918587B2 (en) * 2012-06-13 2014-12-23 International Business Machines Corporation Multilevel cache hierarchy for finding a cache line on a remote node
CN104838388A (en) * 2012-12-20 2015-08-12 英特尔公司 Versatile and reliable intelligent package
CN104917784A (en) * 2014-03-10 2015-09-16 华为技术有限公司 Data migration method and device, and computer system
CN106250350A (en) * 2016-07-28 2016-12-21 浪潮(北京)电子信息产业有限公司 A kind of caching of page read method based on NUMA architecture and system
CN106331148A (en) * 2016-09-14 2017-01-11 郑州云海信息技术有限公司 A cache management method and device for client data reading
CN107025130A (en) * 2016-01-29 2017-08-08 华为技术有限公司 Handle node, computer system and transactional conflict detection method
CN107623722A (en) * 2017-08-21 2018-01-23 云宏信息科技股份有限公司 A kind of remote data caching method, electronic equipment and storage medium
CN109313610A (en) * 2016-06-13 2019-02-05 超威半导体公司 Scaled set for cache replacement strategy competes
EP3486785A1 (en) * 2017-11-20 2019-05-22 Samsung Electronics Co., Ltd. Systems and methods for efficient cacheline handling based on predictions
CN109815165A (en) * 2017-11-20 2019-05-28 三星电子株式会社 System and method for storing and processing efficient compressed cache lines
CN109952565A (en) * 2016-11-16 2019-06-28 华为技术有限公司 Internal storage access technology
CN110321299A (en) * 2018-03-29 2019-10-11 英特尔公司 For detecting repeated data access and automatically loading data into system, method and apparatus in local cache

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030009639A1 (en) * 2001-06-21 2003-01-09 International Business Machines Corp. Non-uniform memory access (NUMA) data processing system that provides precise notification of remote deallocation of modified data
CN101477496A (en) * 2008-12-29 2009-07-08 北京航空航天大学 NUMA structure implementing method based on distributed internal memory virtualization
US8918587B2 (en) * 2012-06-13 2014-12-23 International Business Machines Corporation Multilevel cache hierarchy for finding a cache line on a remote node
CN104838388A (en) * 2012-12-20 2015-08-12 英特尔公司 Versatile and reliable intelligent package
CN103744799A (en) * 2013-12-26 2014-04-23 华为技术有限公司 Memory data access method, device and system
CN104917784A (en) * 2014-03-10 2015-09-16 华为技术有限公司 Data migration method and device, and computer system
CN107025130A (en) * 2016-01-29 2017-08-08 华为技术有限公司 Handle node, computer system and transactional conflict detection method
CN109313610A (en) * 2016-06-13 2019-02-05 超威半导体公司 Scaled set for cache replacement strategy competes
CN106250350A (en) * 2016-07-28 2016-12-21 浪潮(北京)电子信息产业有限公司 A kind of caching of page read method based on NUMA architecture and system
CN106331148A (en) * 2016-09-14 2017-01-11 郑州云海信息技术有限公司 A cache management method and device for client data reading
CN109952565A (en) * 2016-11-16 2019-06-28 华为技术有限公司 Internal storage access technology
CN107623722A (en) * 2017-08-21 2018-01-23 云宏信息科技股份有限公司 A kind of remote data caching method, electronic equipment and storage medium
EP3486785A1 (en) * 2017-11-20 2019-05-22 Samsung Electronics Co., Ltd. Systems and methods for efficient cacheline handling based on predictions
CN109815165A (en) * 2017-11-20 2019-05-28 三星电子株式会社 System and method for storing and processing efficient compressed cache lines
CN110321299A (en) * 2018-03-29 2019-10-11 英特尔公司 For detecting repeated data access and automatically loading data into system, method and apparatus in local cache

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
PAT CONWAY等: "Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor", 《IEEE》 *
方君 等: "网络存储中分布式I_O缓存机制研究", 《华中科技大学学报(自然科学版)》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112468416A (en) * 2020-10-23 2021-03-09 曙光网络科技有限公司 Network flow mirroring method and device, computer equipment and storage medium
CN112468416B (en) * 2020-10-23 2022-08-30 曙光网络科技有限公司 Network flow mirroring method and device, computer equipment and storage medium
CN112416259A (en) * 2020-12-04 2021-02-26 海光信息技术股份有限公司 Data access method and data access device
CN112416259B (en) * 2020-12-04 2022-09-13 海光信息技术股份有限公司 Data access method and data access device
CN112612727A (en) * 2020-12-08 2021-04-06 海光信息技术股份有限公司 Cache line replacement method and device and electronic equipment
WO2022155820A1 (en) * 2021-01-20 2022-07-28 Alibaba Group Holding Limited Core-aware caching systems and methods for multicore processors
CN115119520A (en) * 2021-01-20 2022-09-27 阿里巴巴集团控股有限公司 Core-aware cache system and method for multi-core processors
CN113722110A (en) * 2021-11-02 2021-11-30 阿里云计算有限公司 Computer system, memory access method and device
CN113722110B (en) * 2021-11-02 2022-04-15 阿里云计算有限公司 Computer system, memory access method and device
CN116955270A (en) * 2023-09-21 2023-10-27 北京数渡信息科技有限公司 Method for realizing passive data caching of multi-die interconnection system
CN116955270B (en) * 2023-09-21 2023-12-05 北京数渡信息科技有限公司 Method for realizing passive data caching of multi-die interconnection system

Also Published As

Publication number Publication date
CN111143244B (en) 2022-11-15

Similar Documents

Publication Publication Date Title
CN111143244B (en) Memory access method of computer device and computer device
KR101497002B1 (en) Snoop filtering mechanism
US7814286B2 (en) Method and apparatus for filtering memory write snoop activity in a distributed shared memory computer
KR100970229B1 (en) Computer system with a processor cache for storing remote cache presence information
CN101354682B (en) Apparatus and method for settling access catalog conflict of multi-processor
EP1190326A1 (en) System and method for increasing the snoop bandwidth to cache tags in a cache memory subsystem
JP2000298659A (en) Complete and concise remote(ccr) directory
US20080109624A1 (en) Multiprocessor system with private memory sections
US20020078305A1 (en) Method and apparatus for invalidating a cache line without data return in a multi-node architecture
US7149852B2 (en) System and method for blocking data responses
US7174430B1 (en) Bandwidth reduction technique using cache-to-cache transfer prediction in a snooping-based cache-coherent cluster of multiprocessing nodes
US8732410B2 (en) Method and apparatus for accelerated shared data migration
JP2000067024A (en) Partitioned sparse directory for distributed shared memory multiprocessor systems
US11599467B2 (en) Cache for storing coherent and non-coherent data
JP6343722B2 (en) Method and device for accessing a data visitor directory in a multi-core system
US11321233B2 (en) Multi-chip system and cache processing method
US6757793B1 (en) Reducing probe traffic in multiprocessor systems using a victim record table
US11954033B1 (en) Page rinsing scheme to keep a directory page in an exclusive state in a single complex
US11755483B1 (en) System and methods for reducing global coherence unit snoop filter lookup via local memories
JP7617346B2 (en) Coherent block read execution
US6636948B2 (en) Method and system for a processor to gain assured ownership of an up-to-date copy of data
CN112612726B (en) Data storage method, device, processing chip and server based on cache coherence
JPH09305489A (en) Information processing system and control method therefor
CN118964231A (en) Method and device for maintaining cache consistency, filter, electronic device and medium
US20050289302A1 (en) Multiple processor cache intervention associated with a shared memory unit

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: Industrial incubation-3-8, North 2-204, No. 18, Haitai West Road, Tianjin Huayuan Industrial Zone, Binhai New Area, Tianjin

Applicant after: Haiguang Information Technology Co.,Ltd.

Address before: Industrial incubation-3-8, North 2-204, No. 18, Haitai West Road, Tianjin Huayuan Industrial Zone, Binhai New Area, Tianjin

Applicant before: HAIGUANG INFORMATION TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant