CN111143244A

CN111143244A - Memory access method of computer device and computer device

Info

Publication number: CN111143244A
Application number: CN201911394845.2A
Authority: CN
Inventors: 蔡云龙
Original assignee: Hygon Information Technology Co Ltd
Current assignee: Hygon Information Technology Co Ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2020-05-12
Anticipated expiration: 2039-12-30
Also published as: CN111143244B

Abstract

The present disclosure provides a memory access method for a computer device comprising a plurality of nodes connected via an interconnection bus, each node comprising an integrated processor core and cache, and a memory, the method comprising: the cache lines of the other nodes are locally stored in the caches of the nodes to form a remote cache local image; and the processor core of the node accessing a cache line from the remote cache local image. According to the method and the device, the nodes in the computer equipment mirror the cache data at other nodes to local storage, so that the times of accessing the memory across the nodes are reduced, and the computer performance under a large-memory application scene is improved.

Description

Memory access method of computer equipment and computer equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a memory access method for a computer device and a computer device.

Background

The existing servers based on NUMA (non-uniform memory access) architecture can combine dozens of CPUs (even hundreds of CPUs) in one server, and have good expansion capability. In a NUMA architecture server, multiple nodes are connected together by an interconnection network, each node having independent CPUs, caches, memory and I/O devices, and all memory is shared throughout the server. In this case, the CPU has faster and less latency in accessing local memory within the node than in accessing remote memory outside the node. However, in some applications (such as database lookup) using large memory, the CPU of each node inevitably needs to frequently access the remote memory, which greatly reduces the performance of the server.

Disclosure of Invention

In view of this, a memory access method for a computer device and a computer device are provided, where the computer device has a plurality of nodes, and a node in the computer device mirrors cached data at other nodes to a local storage, so that the number of times of accessing a memory across nodes is reduced, and the computer performance in a large-memory application scenario is improved.

According to a first aspect of the present disclosure, there is provided a memory access method for a computer device comprising a plurality of nodes connected via an interconnection bus, each node comprising an integrated processor core and cache, and a memory, the method comprising: the cache lines of the nodes in the caches of other nodes are locally stored in the nodes to form a remote cache local image RCLS; and a processor core of the node accessing the cache line from the RCLS.

In one possible embodiment, the node may include a plurality of processor cores, and the cache may be an L3 cache shared by the plurality of processor cores.

In one possible embodiment, the RCLS is stored in the node's memory or L4 cache.

In one possible embodiment, the cache line of the RCLS may be updated periodically, or may be updated when the cache line in the other node's cache is infrequently read and written.

In one possible embodiment, the cacheline may include write history information that records the number of times the cacheline was written in the L3 cache and may be cleared or modified to be consistent in the L3 cache and the RCLS when transferred to the RCLS.

In one possible embodiment, the method may further include: a processor core of the node broadcasting a cache line request message on the interconnect bus; a processor core of the node receiving a response message of the other node, the response message indicating that the other node has the cache line in its cache, the response message including write history information of the cache line; comparing, by a processor core of the node, write history information of a cache line in the RCLS with write history information of the response message; based on the comparison, the processor core of the node obtains the cache line from the RCLS or obtains the cache line from the other node.

In one possible embodiment, when the comparison indicates that the cacheline of the other node is the same as the cacheline of the RCLS, the cacheline may be obtained from the RCLS; when the comparison indicates that the cacheline of the other node is different from the cacheline of the RCLS, the cacheline may be obtained from the other node and the RCLS may be updated.

In one possible embodiment, the method may further include, after the processor core of the node acquires the cache line, modifying a flag bit of the cache line in the cache of the other node to be shared when the access is a read operation; when the access is a write operation, modifying the flag bit of the cache line in the cache of the other node to be invalid.

According to a second aspect of the present disclosure, there is provided a computer device comprising a plurality of nodes connected via an interconnection bus, each node comprising an integrated processor core and cache, and a memory, wherein a node of the plurality of nodes has stored locally on it a remote cache local image RCLS for storing a cache line in the caches of other nodes and providing the cache line to the processor core of the node.

In one possible embodiment, the node may include a plurality of processor cores, and the cache is an L3 cache shared by the plurality of processor cores.

In one possible embodiment, the RCLS may be stored in the node's memory or L4 cache.

In one possible embodiment, the cache line of the RCLS may be updated periodically, or when the cache line in the other node's cache may be read and written infrequently.

In one possible embodiment, the processor core of the node may be configured to: broadcasting a cache line request message on the interconnect bus; receiving a response message of the other node, the response message indicating that the other node has the cache line in its cache, the response message including write history information of the cache line; comparing the write history information of the cache line in the RCLS with the write history information of the response message; obtaining the cacheline from the RCLS or obtaining the cacheline from the other node according to the comparison.

In one possible embodiment, the processor core of the node may be further configured to obtain the cache line from the RCLS when the comparison indicates that the cache line of the other node is the same as the cache line of the RCLS; when the comparison indicates that the cacheline of the other node is different from the cacheline of the RCLS, the cacheline is obtained from the other node, and the RCLS is updated.

In one possible embodiment, the processor core of the node is further configured to modify the flag bit of the cache line in the other node's cache to shared when the access is a read operation after the cache line is fetched; when the access is a write operation, modifying the flag bit of the cache line in the cache of the other node to be invalid.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and drawings.

Drawings

The foregoing aspects and many of the attendant advantages of this invention will become more readily appreciated as the same become better understood by reference to the following detailed description, when taken in conjunction with the accompanying drawings, wherein like reference numerals refer to like parts, unless otherwise specified.

FIG. 1 shows a schematic diagram of a computer device of a NUMA architecture in accordance with an embodiment of the present disclosure.

FIG. 2 shows a schematic flow diagram of a memory access method of a NUMA architecture according to an embodiment of the present disclosure.

Fig. 3 shows a schematic diagram of a computer device of another NUMA architecture in accordance with an embodiment of the present disclosure.

FIG. 4 shows a schematic flow diagram of another method of memory access by NUMA according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The words "a", "an" and "the" and the like as used herein are also intended to include the meanings of "a plurality" and "the" unless the context clearly dictates otherwise. Furthermore, the terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense. For example, in the present disclosure, the term "cache" has the same meaning as "cache", and the term "memory" has the same meaning as "main memory", or "system memory"

Fig. 1 shows a schematic diagram of a computer device 100 of a NUMA architecture in accordance with an embodiment of the present disclosure. Computer device 100 may be, for example, a server device, a home computer, a High Performance Computer (HPC), or the like. As shown, computer device 100 is based on a NUMA architecture, including two

NUMA nodes

101a and 101b, which may also be referred to as sockets. For ease of presentation and understanding, reference numerals having the letter "a" herein shall be understood as components or parts associated with node 101a, and reference numerals having the letter "b" shall be understood as components or parts associated with node 101 a. Although fig. 1 shows only two

nodes

101a and 101b, those skilled in the art will appreciate that the computer device 100 may include more nodes, and the number is not limited thereto.

Nodes

101a and 101b in fig. 1 are similarly arranged, and the specific composition of the computer apparatus 100 is described below with reference to only node 101 a.

As shown, node 101a may include a processor device 110a, memory 120a, I/O devices, and the like (not shown for simplicity). The processor device 110a may be a single-core or multi-core processor chip, for example, as in the embodiment shown in fig. 1, the processor device 110a includes a plurality of

processor cores

111a and 112a, etc., each of which includes a core and an L1 cache (L1 cache) and an L2 cache (L2 cache) that are specific to the core. Within the core may be a memory access component, a bus processing component, etc., implemented in hardware or software.

In the illustrated embodiment, the remainder of the processor device 110a includes various interconnection circuits and interfaces and components for connecting various functional blocks on the processor device in communication. As shown, within processor device 110a,

multiple processor cores

111a and 112a may communicate with each other via an intra-socket interconnect 113 a. For simplicity, the interconnect is depicted as an intra-socket interconnect 113a, however, it should be understood that interconnect 113a may represent one or more interconnect structures, such as buses and single or multi-channel serial point-to-point, ring, or mesh interconnect structures. Also included within processor device 110a is an L3 cache 114a for sharing by multiple processor cores;

processor cores

111a and 112a may access shared L3 cache 114a via intra-socket interconnect 113 a. Processor device 110a may also include a memory controller 115a for controlling the transfer of read and write data between processor device 110a and memory 120 a. Further, processor device 110a also includes a socket-to-socket (S-S) interconnect interface 116a, where S-S interface 116a may receive data on intra-socket interconnect 113a, send and receive data to and from other sockets or nodes via inter-socket interconnects, and vice versa. In addition to these illustrated blocks, the processor 110a may include many other functional blocks that are not shown for clarity, such as a PCIe interface, a network interface controller NIC, and so forth.

In fig. 1, each processor device may be operatively coupled to a printed circuit board, referred to as a motherboard, via a socket, or otherwise coupled to the motherboard via a direct coupling technique (e.g., flip chip bonding). The motherboard includes electrical wires (e.g., traces and vias) to achieve electrical connections corresponding to the physical structure of the various interconnects depicted in fig. 1. These interconnects may include a socket-to-socket (S-S) interconnect 102 between socket-to-socket (S-S) interconnect interfaces 116a and 116b, which are used to communicatively couple multiple processor nodes. In one embodiment, the inter-socket interface S-S interconnect 102 may employ or support QPI or Infinity Fabirc, among other protocols and wiring structures.

As shown in fig. 1, the L1 cache, L2 cache, L3 cache 114a, and memory 120a form a storage architecture based on the distance from the core from near to far. Generally, the closer to the core, the faster the core can access, and the less latency. For example, the L1 cache and the L2 cache may be relatively high cost Static Random Access Memory (SRAM), with core access L1 cache latency of about 1-4 clock cycles and access L2 cache latency of about 12 clock cycles. The L3 cache may be a relatively low cost Dynamic Random Access Memory (DRAM) or enhanced ram (edram), etc., with an access latency of approximately 36 clock cycles. While memory 120a may typically be a dynamic random access memory DRAM or the like with an access latency of about 60-100 clock cycles.

Due to data limitations, including spatial and temporal limitations, processors (such as the core of fig. 1) often need to read data multiple times in a short time. Since the operating frequency of the memory is far lower than the speed of the processor, through the above storage architecture, the processor sequentially queries whether needed data exists in the L1-L3, so that the processor can read the needed data from the caches of all levels instead of the memory, which is called as hit (hit), thereby greatly saving the time for data access.

As shown in fig. 1, the L3 cache 114a includes several cache lines (similar to L1 and L2), and when all of L1-L3 miss (miss), the needed data can be loaded from the memory 120 a. Specifically, a memory line (or block of memory) containing the desired data will be loaded and held in the L3 cache in the form of a cache line for subsequent re-access. Similarly, when data is loaded from the L3 cache for processor access, the data is further promoted to the L2 cache, and so on.

Referring to table 1, a schematic structure of a cache line of an embodiment of the present disclosure is shown.

Label (R)

Data block

Marker bit

TABLE 1

A cache line may include a tag (tag) including a portion of a memory address, a data block (data block) that is the content of the cache line, i.e., a valid data portion, and a flag (flag bits) that may generally include whether data is overwritten (Modified), Exclusive (Exclusive), Shared (Shared), valid (Invalid), etc. for controlling cache coherency.

The bus processing unit 117a of the core may determine whether the cache hit occurs according to the memory address of the required data. Table 2 shows an exemplary structure of an effective memory address according to an embodiment of the present disclosure.

Label (R)

Indexing

Offset in block

TABLE 2

The effective memory address may include a tag, an index, and an intra-block offset, where the index indicates an index number of a data block loaded into the cache set, and the effective memory address index is combined with the tag to search for a cache line in each level of cache and determine whether the cache line is hit. The intra-block offset represents the offset within the cache line of data required by the processor.

For example, if the L1 cache size is 8k, each cache line is 64 bytes, and 4 cache lines make up a cache set. Then there are a total of 128 cache lines of 8k/64 bytes, one set of every 4, and then 128/4 of 32 cache sets. Therefore, the intra-block offset occupies 6 bits (2^6 ^ 64), the index occupies 5 bits (2^5 ^ 32), and the tag occupies 21 bits (32-5-6). The above briefly describes the mapping between memory addresses and cache lines and the mechanism for looking up cache lines from memory addresses in a cache.

Returning to FIG. 1, there are multiple nodes under a NUMA architecture. In this case, for example, when a processor core 112a does not find a desired cache line in any of its local caches L1-L3, its bus processing component 117a may broadcast a snoop signal (snoop) on the interconnect bus, querying the other nodes' L3 whether the desired data is stored. The purpose is to save on the one hand the extra delay incurred by memory accesses and, more importantly, to ensure that the most up-to-date data is read, since the same memory data may have been cached in L3 of one or more other nodes and overwritten. As shown in fig. 1, a snoop signal is broadcast on S-S interconnect bus 102 via intra-socket interconnect bus 113a and S-S interface 116a, and response information of other nodes is received. This operation may be implemented by the bus processing component 117 a. Bus processing component 117a is shown in FIG. 1 as a separate component within node 101a, alternatively bus processing component 117a may be a software implementation or logic configured within the cores of processor cores 112a, and each processor core may have a bus processing component to accomplish similar operations.

Fig. 1 shows the sending and receiving of a snoop signal between node 101a and node 101b, however, when there are more nodes, the snoop signal is broadcast to each node on the interconnect bus 102. Although snoop communication between bus processing components 117a and 117b is shown as a separate line in fig. 1, this is for clarity only, it being understood that the communication is actually done over S-S interconnect 102. For example, when node 101b receives the snoop signal and the cache finds the requested cache line (valid) at its L3, processor 112a sends the most recent copy of the cache line to node 101a via interconnect bus 102. When no valid response is received after the snoop signal is issued, the request data is accessed from the memory. When data is requested from a memory access, there is a different latency depending on the memory address being local and remote to the node, as described in more detail below.

Under NUMA architectures, each processor (and processor core) is able to access different memory resources distributed across the various nodes. In other words, all memories are shared by all processors, together forming a unified memory address space. Memory may be considered a local memory resource (e.g., a memory resource on the same node as a processor or core) or a remote memory resource (e.g., a memory resource on another node). Generally, for example, from the perspective of node 101a, memory 120a is a local memory resource and memory 120b is a remote memory resource. Since the local memory resources are operatively coupled to the processor for a given node, access to the local memory resources is not the same (i.e., access is non-uniform) with respect to the remote memory resources. For example, the latency for processor 110a to access local memory resource 120a is about 80ns, while the latency for accessing remote memory resources may be up to 250ns or more. Even considering that it is possible to cache access data from L3 of node 101b without having to access remote memory 120b, the data transfer across nodes may introduce a delay of about 100 ns. Therefore, a policy of prioritizing local memory resources is required to be adopted in actual memory allocation. Unfortunately, in some applications with large memory usage (e.g., database lookup), the processor of each node inevitably has to frequently access remote memory, which significantly reduces the performance of the server.

FIG. 2 shows a schematic flow diagram of a method of memory access in a NUMA architecture in accordance with an embodiment of the present disclosure. This is explained with reference to fig. 1.

At step 210, the processor core issues a memory access request. For example, machine instructions obtained by processor core 112a include a read or write to a memory address, creating a data access need for the memory address.

Then, at step 220, it is sequentially looked up whether the cache lines associated with the required data are included in L1 and L2 associated with the processor core. Specifically, a cache line is looked up in L1 and L2 according to the memory address information. In one embodiment, each cache line may include 64 bytes of data, and when the required data is included in one cache line (the corresponding cache line should be "valid"), there are L1 and L2 hits. In this case, the corresponding cache line is promoted to a closer cache (e.g., L2 through L1) for use by the processor core, and the flow ends. Otherwise, proceed to step 230.

At step 230, the local L3 cache is queried for the required data. For example, a lookup in L3114 a is made to see if there is data needed (the corresponding cache line should be "valid"). If so, L3 hits, the corresponding cache line is promoted to a closer cache (e.g., L3 through L2 through L1) for use by the processor core, and the process ends. Otherwise, proceed to step 240.

In step 240, in the event of a miss at none of L1-L3, a snoop signal is broadcast on the interconnect bus. For example, processor core 112a may utilize bus processing component 117a to broadcast a snoop signal on S-S interconnect bus 102 via interconnect bus 113a and interface 116a so that bus processing components of other nodes may snoop.

Next, at step 250, the other nodes look up the required data in their L3 caches. If found (the corresponding cache line should be "valid"), then an L3 hit. In this case, a copy of the corresponding node cache line is sent onto the S-S bus for receipt by the node requesting the data. For example, a processor core in node 101b may send its cache line in L3 cache 114b to node 101a via S-S bus 102. Accordingly, the cache line is stored in the L3 cache 114a at node 101a and is promoted to a closer cache (e.g., L3 through L2 through L1) for use by the processor core, and the flow ends. If there is no hit at the other node, proceed to step 260.

In step 260, in the event that none of the caches hit, the memory access request is submitted to the memory controller at the node of the required data, the data is accessed from the memory, and the process ends. If the access is a local memory access, the needed data is cached in a local L3; if it is a remote memory access, the required data is cached at the remote L3 and then transmitted to the L3 at the requesting node via the S-S bus.

As described above, when accessing a remote memory, a large delay may be generated in memory access and data transmission across nodes, which may result in performance degradation, and this is especially obvious in a large memory application scenario. In order to alleviate or mitigate this problem, embodiments of the present disclosure provide a technical solution for remote cache local image (RCLS), which improves data access efficiency by storing a remote cache image in a local backing store with a larger capacity. In one embodiment, the L3 cache image at the remote node may be stored in a local L4 cache, or Last Level Cache (LLC). In another possible embodiment, the L3 cache image at the remote cache node may be stored to local memory if L3 is already the last level cache LLC.

Fig. 3 shows a schematic diagram of a computer device 300 of another NUMA architecture in accordance with an embodiment of the present disclosure. In contrast to fig. 1, the computer device 300 of fig. 3 further comprises: the cache lines in L3 of the remote nodes are stored in the backing store of each node lower than the L3 cache, as indicated by

reference numerals

305 and 330 a. It should be noted that although fig. 3 shows two

nodes

301a and 301b, those skilled in the art will appreciate that computer device 300 may include many more nodes and that each node may store a cache line from any other node with remote cache local mapping, RLCS, enabled. As will be described in detail below.

According to embodiments of the present disclosure, computer device 300 may be configured to enable remote cache local mapping, RLCS, in which case each node in computer device 300 may receive cache lines in the L3 caches of the other nodes via interconnect bus 302. In other words, in addition to being cached at each node L3, the memory data is mirrored (further cached) at a lower level of storage at each node. Generally, memory below L3 has storage space much larger than the L3 capacity, so each node has enough storage capacity to mirror the L3 cache at other nodes without impacting local use.

For example, a computer device of an exemplary NUMA architecture may include 64 nodes, each of which may have, for example, an L3 capacity of 16M and a local memory of 16G. With the RCLS enabled, even if all the cache lines of the far-end L3 are mirrored to each node local memory, only 64 × 16M/16G — 6.4% of the memory capacity is occupied. However, in actual operation, the RCLS function may not be selected to be enabled on all nodes, but only a portion of the nodes, and each RCLS may also be mirrored locally on demand, rather than all of the remotely cached cache lines. In fact, the scale (number of nodes in NUMA) of enabling RCLS can be flexibly adjusted, depending on the cache capacity of each node and the capacity of the next level memory, so as to achieve better performance of the computer device.

Referring to FIG. 3, arrow 305 indicates mirroring the L3 cache of node 301b locally to node 301a, forming a remote cache local image RCLS330 a. This RCLS330a may be stored in any memory below the local L3 cache, e.g., when node 101a has L4 cache 318a, RCLS330a is stored in the L4 cache; when node 101a does not have an L4 cache, RCLS330a may be stored in an allocated memory area. Similarly, as shown, the memories 320b and L43 4318 b of the node 301b store RCLS cached by the L3 of the other nodes. It should be noted that although FIG. 3 only shows cache lines of the L3 cache of storage node 301b at node 101a, it is understood that cache lines of the L3 cache from any other node may be stored on each NUMA node that has RLCS enabled.

According to the embodiment of the disclosure, the cache line of the L3 cache of each node can be transmitted to other nodes in a synchronous or asynchronous mode, that is, the RCLS can be updated in time or in a delayed mode. For example, in one embodiment, when memory controller 316b of node 301b reads data from memory 320b and places it into L3 cache 314b, the data will be synchronized to the interconnect bus, cached via the S-S bus to L4 cache 318a or memory 320a, for updating RCLS at node 101 a. In one possible embodiment, updating RCLS330a behavior at node 101a may be initiated proactively by memory controller 316b of node 301b, thereby enabling real-time updating of RCLS. Alternatively, the behavior of updating RCLS330a at node 101a may be delay triggered, such as when the processor core of node 101b is not reading or writing L3 or timed to trigger. Alternatively, the RCLS update may also be triggered by snooping L3 cache update messages on the interconnect bus, e.g., bus processing component 317a of node 101a may snoop all L3 cache updates on the interconnect bus, with node 101a itself deciding whether to update its local RCLS from the L3 cache of node 101b (and other nodes).

As described above, in addition to being cached at each node L3, data in memory is also cached in the remote cache local image of each node, with the data being cached in the form of cache lines. Embodiments of the present disclosure also provide a structure of a cache line in order to meet cache coherency requirements and reduce data transmission across nodes. The cache line structure extends the structure shown in table 1.

Label (R)

Data block

Marker bit

WH

TABLE 3

Referring to table 3, according to an embodiment of the present disclosure, a write history flag (WH) is extended in a flag of a cache line to record information of the number of times of writing of effective alteration of data. In one embodiment, the WH flag bit is made up of a plurality of bits, it being understood that the greater the number of WH bits, the greater the number of cache line writes that can be recorded.

In a conventional cache coherency protocol, whenever a cache line is written, the write cache information needs to be broadcast on the system bus to be sensed by other nodes to modify the flag bit information of the own cache line and update the data block. In this case, the data traffic required in the NUMA architecture to maintain cache coherency takes up at least about 15-20% of the interconnect bus. In contrast, the embodiment of the present disclosure introduces the write history flag WH into the flag and combines with the remote cache local mapping RCLS, so that data used for maintaining cache consistency can be greatly reduced, and bus bandwidth resources are saved.

According to the embodiment of the present disclosure, when the read-write operation of the cache line is less, for example, when the processor core does not read or write the L3 cache for about 10 milliseconds (which is a relatively long time compared to the speed of the CPU), the cache line of the L3 cache may be packed and sent onto the interconnection bus so as to be synchronized to the RCLS of other nodes, that is, the RCLS may be updated with a delay, which may ensure that the processor is not blocked when the data read or write is frequent. On the other hand, because the WH zone bit for recording the writing time information is used, the processor core does not need to broadcast the writing cache information to the interconnection bus during each writing cache; in other words, the processor core may resynchronize to the RCLS of other nodes after accumulating modifications of multiple cache data blocks.

According to an embodiment of the present disclosure, whenever a processor or processor core writes to the L3 cache, the history identification bit WH of the corresponding cache line is incremented by 1, and the cache line should be synchronized to the RCLS of the respective nodes with a WH value of 15(0xF) before the WH overflows, for example, in the case where the WH has 4 bits (the present invention is not limited thereto). Also as described above, when there are fewer cache line read and write operations, the cache line is packed for transmission to the interconnect bus for synchronization to other RCLS. After the L3 cache is synchronized to the RCLS node, the WH bits of all cache lines in the L3 and the RCLS are cleared or modified to be consistent in the L3 cache and the RCLS. In other words, the WH bit may be used to indicate the coherency of the L3 cache with the RCLS.

A memory access method of a computer device according to an embodiment of the present disclosure will be described below with reference to fig. 4.

FIG. 4 shows a schematic flow chart diagram of another memory access method 400 for a NUMA architecture in accordance with an embodiment of the present disclosure. Here, assume that the method is performed in a computer device having 4 NUMA NODEs, which are labeled NODE _0, NODE _1, NODE _2, and NODE _ 3. For example, each node includes 32 processor cores, denoted as CPU 0 through CPU 127, each node has a respective L3 cache, denoted as L3_0, L3_1, L3_2, L3_ 3. Here, the node may have an L4 cache, the remote cache local image (RCLS) described above being held in the L4 cache; the node may not have an L4 cache, and the remote cache local image is saved in the memory of the node. The RCLS of the nodes are denoted as RCLS _0, RCLS _1, RCLS _2, and RCLS _3, respectively.

Taking CPU 0 as an example, the method 400 begins when a memory read/write request generated by CPU 0 misses at none of its local L1-L3.

At step 410, the bus processing component of CPU 0 broadcasts a data request message over the interconnect bus requesting a cache line including the desired data. This data request message may be sensed by the bus processing components of the other processor cores and looked up in the L3 cache for data needed by CPU 0.

Then, at step 420, if a far-end cache hit, for example, the L3_3 cache of CPU 127 (at NODE _3) has the latest version of the data needed by CPU 0, CPU 127 may synchronize the corresponding cache line from its L1, L2 cache to the L3_3 cache and write-lock the corresponding cache line in L3_3 such that the cache line is not allowed to be overwritten until this read-write operation of CPU 0 is complete, and the method proceeds to step 430. If the remote cache is not hit, the flow of the method is ended, and the CPU 0 acquires data from the memory.

At step 430, a response message issued by CPU 127 on the interconnect bus is received, which may include write history information WH of the cache line of demand data. In one embodiment, to improve bus utilization efficiency, the response message also includes a portion of the data block contents of the cache line. For example, the response message may be 16 bytes, and WH occupies only a part (several bits) of the response message, and the rest may fill a part of the data of the cache line, so as to reduce the amount of data to be transmitted subsequently.

Then, at step 440, CPU 0 compares the WH in the response message with the WH in RCLS _ 0. If so, CPU 0 fetches the cache line from local RLCS _0 and loads it into its local L3_0 at step 450, and sends a data transfer complete message to NODE _3 at step 470, and CPU 127 modifies the flag bit of the corresponding cache line in L3_ 3.

If the WHs in the RCLS and the response message from CPU 127 are not consistent, i.e., the WH provided by L3_3 at the node where CPU 127 is located is newer than the WH in RCLS, then at step 460, CPU 0 fetches the cache line from the node of CPU 127 for storage in L3_0, promotion to L2, L1, register usage of CPU 0, and synchronizes the cache line in RCLS 0 to the cache line provided by CPU 127. At this point, the WH is cleared or modified to be consistent. In addition, upon completion of the transfer of L3_3 to L3_0, it may also be determined whether a write to memory is required as appropriate, for example, in a write back mode or a write through mode (write through).

In step 470, the CPU sends a data transfer complete message to the CPU 127. If the access request of CPU 0 is a read operation, CPU 127, upon receiving the transfer complete message, may modify the flag bit of the corresponding cache line of L3_3 from Exclusive (Exclusive) to Shared (Shared), for example, and may unlock the cache line write lock of L3_ 3. If the access request of CPU 0 is a write operation, CPU 127 may modify the flag bit to Invalid (Invalid), which, in effect, in the case of a write operation, according to the cache coherency protocol, an Invalid signal is placed on the bus to be heard by other nodes, such that the flag bits of the cache line copy of other RCLS are all set Invalid except for the data in L3_0, and therefore no update of RCLS at other nodes is required.

Although example embodiments have been described, it will be apparent to those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the inventive concept. Accordingly, it should be understood that the above-described exemplary embodiments are not limiting, but illustrative.

Claims

1. A memory access method for a computer device, the computer device comprising a plurality of nodes connected via an interconnection bus, each node comprising an integrated processor core and a cache, and a memory, the method comprising:

A node of the plurality of nodes locally stores cache lines in the caches of other nodes, forming a remote cache local image RCLS; and

The processor core of the node accesses the cache line from the RCLS.

2. The method of claim 1, wherein the node includes a plurality of processor cores and the cache is an L3 cache shared by the plurality of processor cores.

3. The method of claim 1, wherein the RCLS is stored in a memory or L4 cache of the node.

4. The method of claim 1, wherein the cache lines of the RCLS are updated regularly, or when cache lines in the caches of the other nodes are not frequently read or written.

5. The method of claim 1, wherein the cache line includes write history information that records the number of times the cache line is written in the L3 cache and is cleared when transmitted to the RCLS. Zero or modify to be consistent in the L3 cache and the RCLS.

6. The method of claim 1, further comprising:

a processor core of the node broadcasts a cache line request message on the interconnect bus;

The processor core of the node receives a response message from the other node, the response message indicates that the other node has the cache line in its cache, and the response message includes write history information of the cache line;

the processor core of the node compares the write history information of the cache line in the RCLS with the write history information of the response message; and

According to the comparison, the processor core of the node obtains the cache line from the RCLS or obtains the cache line from the other node.

7. The method of claim 6, wherein,

obtaining the cache line from the RCLS when the comparison indicates that the cache line of the other node is the same as the cache line of the RCLS; and

When the comparison indicates that the cache line of the other node is different from the cache line of the RCLS, the cache line is obtained from the other node, and the RCLS is updated.

8. The method of claim 1, further comprising after a processor core of the node fetches the cache line

When the access is a read operation, modifying the flag bit of the cache line in the cache of the other node is shared; and

When the access is a write operation, the flag bit of the cache line in the cache of the other node is modified to be invalid.

9. A computer device comprising a plurality of nodes connected via an interconnecting bus, each node comprising an integrated processor core and cache, and memory, wherein

A node of the plurality of nodes locally stores a remote cache local image RCLS, where the RCLS is used for storing cache lines in caches of other nodes and providing the cache lines to processor cores of the nodes.

10. The computer device of claim 9, wherein the node includes a plurality of processor cores, and the cache is an L3 cache shared by the plurality of processor cores.

11. The computer device of claim 9, wherein the RCLS is stored in a memory or L4 cache of the node.

12. The computer device of claim 9, wherein the cache lines of the RCLS are updated regularly, or when cache lines in the caches of the other nodes are not frequently read or written.

13. The computer device of claim 9, wherein the cache line includes write history information that records the number of times the cache line was written in the L3 cache, and when transmitted to the RCLS Cleared or modified to be consistent in the L3 cache and the RCLS.

14. The computer device of claim 9, wherein the node's processor cores are configured for

broadcasting a cache line request message on the interconnect bus;

receiving a response message from the other node, the response message indicating that the other node has the cache line in its cache, the response message including write history information of the cache line;

comparing the write history information of the cache line in the RCLS with the write history information of the response message; and

According to the comparison, the cache line is obtained from the RCLS or the cache line is obtained from the other node.

15. The computer device of claim 14, wherein the processor core of the node is further configured to

16. The computer device of claim 9, wherein the node's processor core is further configured to, after fetching the cache line,