WO2007113757A2

WO2007113757A2 - System and method for supporting a hot-word-first request policy for a multi-heirarchical memory system

Info

Publication number: WO2007113757A2
Application number: PCT/IB2007/051135
Authority: WO
Inventors: Miland Manohar Kulkarni
Original assignee: Koninklijke Philips Electronics N.V.
Priority date: 2006-04-04
Filing date: 2007-03-29
Publication date: 2007-10-11
Also published as: WO2007113757A3

Abstract

Fetching data from a low level memory utilizing customized data requests based on a determination of the spatial locality of a currently running task/transaction. One of two customized requests are issued from a cache system to a low level memory (20). In the case where the spatial locality information is determined to be above a predetermined threshold value, a first customized data request (type-one) is issued to the low-level memory (20) which retrieves data in a single operation with a block size equal to the block size of a higher order cache (18) in the cache system (10). Otherwise, in the case where the spatial locality information is determined to be below the predetermined threshold value, a second customized data request (type-two) is issued from the cache system (10) to the low-level memory (10). The type-two data request comprises multiple data requests where each individual data request retrieves data for a block size equal to the block size of a level one (Ll) cache (16A-16C) in the cache system (10).

Description

SYSTEM AND METHOD FOR SUPPORTING A HOT- WORD-FIRST REQUEST POLICY FOR A MULTI-HEIRARCHICAL MEMORY SYSTEM

The invention relates to fetching data and instructions from a hierarchical memory.

Often, a computer relies on a relatively large, slow and inexpensive mass storage system such as a hard disk drive or other external storage device, an intermediate main storage memory that uses dynamic random access memory devices (DRAM's) or other volatile memory storage devices, and one or more high speed, limited capacity cache memories, or caches, implemented with static random access memory devices (SRAM's) or the like. Often multiple levels of cache memories are used, each with progressively faster and smaller memory devices. Also, depending upon the memory architecture used, cache memories may be shared by multiple microprocessors or dedicated to individual microprocessors, and may either be integrated onto the same integrated circuit as a microprocessor, or provided on a separate integrated circuit. As is well known, the speed performance of a cache tends to decrease data retrieval times for a processor. The cache stores specific subsets of data in high-speed memory. A few examples of data include instructions and addresses.

In a computer system, a cache is the first level of memory hierarchy encountered once an address leaves a processor toward a memory subsystem. A cache uses the principle of locality to buffer commonly used addresses. Caches improve system performance by reducing memory access latencies and by reducing bandwidth consumption on a processor front side bus. A cache operates on a processor backside bus with processors such as Intel's Pentium III Xeon processor. Many of today's processors include integrated level-one (Ll) and level-two (L2) caches. Computer systems can employ a cache external to the processor in addition to these internal caches. If a data element required by the CPU is available in the cache, it is referred to as a cache "hit", while if the required data element is not available in the cache, it is said that a cache

"miss" has occurred. On such a cache miss, the processor has to access the slower main memory to get data. These misses represent the portion of processor time that limits overall system performance improvement. A drawback of using a Level 2 (L2) cache in the memory hierarchy is that it can increase latency for compulsory (i.e., unavoidable) misses. As a result, if there is a miss in both the Ll and L2 cache, then many cycles are typically expended in fetching the data from the lower levels of memory. To keep latency to a minimum, the processor issues hot-word (i.e., critical) addresses to the memory so that the memory can return the most urgent data first. This is generally achieved in a wrapped transaction wherein the transfer starts from the first word requested by the processor and typically wraps around the size of the cache block. For example, assume that the cache block size is 64 words. If the hot- word address is the seventh word in the cache block, then words 7-64 are fetched first from lower memory followed by words 1-6 to complete the wrapped transaction. One drawback of this approach is that it ensures reduced latency to only the hot- word address (e.g., address 7).

Fig. 1 is a schematic diagram illustrating a cache system for issuing hot-word requests according to two approaches utilized in the prior art. In Fig. 1 there is shown a cache system 110 comprising a level 1 (Ll) cache 103 and a level 2 (L2) cache 105. The level 1 (Ll) cache is a single 64 Kbyte block and the level 2 (L2) cache is a 256Kbyte block comprised of four segments, 1-4. The cache system 110 is shown coupled to a processor 101 that issues data requests to the cache system 110 and a low level memory 107 for retrieving data requested by the cache system 110.

By way of example, assume that it is desired to fetch data word 191. This becomes the hot word in the data request issued from the processor 101. In the case of a cache miss in both the Ll and L2 caches in response to the hot- word request for hot word address 191, a level two (L2) refill/fetch controller (not shown) issues a hot- word request to the low level memory 107 and retrieves a data block from the low level memory 107 of block size L2 (e.g., 256k bytes), including the requested hot-word address 191. The 256k data block retrieved from the low level memory 107 is loaded into the L2 cache starting with the hot- word address 191 followed by addresses 192-256 and a wrap around is performed to retrieve remaining addresses 1-190.

One advantage of requesting data in large blocks (e.g., of L2 block size) is that timing constraints imposed by the low-level memory 107 are minimized. In other words, fewer accesses are required because data is being retrieved in larger block sizes. However, an associated disadvantage of requesting data in this manner is that it is less efficient for those applications/tasks which exhibit good spatial locality. Good spatial locality may be defined as a tendency for the currently running task/transaction to access more data values inside of the cache block that is being currently accessed than outside of the same cache block. For example, for an application/task that accesses more data values outside of the cache block, each request results in a significant penalty whenever the requested hot- word address is closer to the end of the Ll block. This occurs because the Ll block is refilled from the L2 block and for those cases where the requested hot- word address is closer to the end of the Ll block, the refill time of the Ll block is high due to the manner in which the L2 block is filled (i.e., in a wrap transaction).

According to a second type of read request utilized in the prior art, in the event a cache miss occurs in both the Ll and L2 caches, the cache controller retrieves data from the low level memory 105 using multiple requests of block size Ll (e.g., 64 bytes). As shown in the illustrative example, the L2 cache block size is related to the Ll cache block size according to: L2 = 2^N*L1, (where N=O, 1, 2,3,...). In accordance with this second request type, four successive requests are issued from the cache controller. Each request returns 64 Kbytes of data. The first request returns addresses {191-192 & 129-190}, including hot word address (191), from the low-level memory 107. The second through fourth requests return addresses {193-256}, {1-64} and {65-128}, respectively.

It is noted that this second request type, as used in the prior art, is advantageous for applications which exhibit good spatial locality. As discussed above, good spatial locality may be defined as a tendency for the currently running task/transaction to access more data values inside of the cache block that is being currently accessed than outside of the same cache block. Therefore, by filling the Ll block from segment 3 of the L2 block first, latency to data values inside the cache block is minimized. This results in better utilization of data-path connections, as they are always allocated for a fixed duration of time. However, high latency results for unaligned accesses. The second request type is disadvantageous, however, from the point of view of the low level memory performance because of the timing constraints associated with having to make multiple accesses to the low level memory (e.g., 4 in the example illustrated in Fig. 1). Thus, a need exists for a cache memory system to provide customized requests to a low level memory, which takes into account the spatial locality of an application/task to minimize latency.

According to a method, system, and program product of the invention, memory read latency is reduced and computer processor performance is thereby improved based upon a determination of the spatial locality of a currently running task/transaction. Specifically, the problems outlined above are in large part solved by a cache memory system, program product and method for providing customized requests to a low level memory to reduce access time and thereby improve the performance of a computer processor. It is noted that the method of the invention operates in accordance with the computer processor's hot- word- first fetch protocol. According to one aspect, a method of fetching data from a low level memory comprises the following acts: determining that a cache miss has occurred in a level one (Ll) cache, determining that a cache miss has occurred in a higher order cache, calculating a measure of the spatial locality of a currently scheduled task/transaction, and issuing a customized data request to a low-level memory based on said measure of the spatial locality. According to another aspect, a cache system comprises, a level one (Ll) high speed cache memory coupled to the computer processor and a higher level high speed cache memory, a configurable register coupled to the higher level high speed cache memory, configured to dynamically determine the spatial locality of a currently running task/transaction on said computer processor and a refill/fetch controller coupled to the higher level high speed cache memory configured to issue a customized data request to the low-level memory in dependence on the dynamically determined spatial locality information.

According to the method, system and program product of the invention, the spatial locality of a task/transaction may be dynamically determined using any number of well-known methods. For example, the spatial locality may be determined in one way by utilizing run-time measurements that characterize the access patterns of the currently scheduled task/transaction on the processor. According to the method, system and program product of the invention, the spatial locality information may be determined at different intervals. For example, in one embodiment, the spatial locality information may be determined during a context switch, by software when a new task/transaction is scheduled on the processor. In another embodiment, the spatial locality information may be determined by the processor based on the run-time hardware measurements for each new task/transaction issued from a processor sharing the cache. The latter embodiment is advantageous in that it removes the necessity of additional processor cycles given that the content of the register is modified by the hardware directly based on the run-time measurements.

The above summary of the present invention is not intended to represent each disclosed embodiment, or every aspect, of the present invention. Other aspects and example embodiments are provided in the figures and the detailed description that follows.

FIG. 1 is a block diagram of a system option, according to the prior art, and FIG. 2 is a block diagram of a system, according to an embodiment of the invention.

While the invention is amenable to various modifications and alternative forms, specifics thereof have been shown by way of example in the drawings and will be described in detail. It should be understood, however, that the intention is not to limit the invention to the particular embodiments described. On the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention as defined by the appended claims.

FIG. 2 is a block diagram of a system 10 according to one embodiment of the invention. In the illustrated embodiment, the system 10 includes three processors 12A- 12C which may be designed to any instruction set architecture, and may execute programs written to that instruction set architecture. Exemplary instruction set architectures may include the MIPS instruction set architecture (including the MIPS-3D and MIPS MDMX application specific extensions), the IA-32 or IA-64 instruction set architectures developed by Intel Corp., the PowerPC instruction set architecture, the Alpha instruction set architecture, the ARM instruction set architecture, or any other instruction set architecture. Each processor 12A-12C includes an on-chip level one (Ll) cache 16A-16C. Each processor 12A-12C also shares a single (on/off)-chip second level (L2) cache 18 via a common bus structure 17. The shared second level (L2) cache 18 is shown directly coupled to a low level memory 20. In the illustrated embodiment, the level two (L2) cache 18 comprises a cache memory 22 coupled to a configurable register 26.

The level one (Ll) on-chip caches 16A-16C and the shared level two (L2) cache 18 may be configured to cache blocks of data from the memory system 20 in response to data requests from respective processors 12A-12C. In the context of caching data from memory, the term "data" is being used in its generic sense to refer to either data or instructions for execution. The level one cache memories 16A-16C comprise a plurality of entries, where each entry is capable of storing a cache block. Additionally, each entry may store a state of the cache block, as well as any other information (e.g. an address tag for the cache block, replacement algorithm data such as least recently used state, etc.). The configurable register 26 in the shared level two (L2) cache 18 is configured to implement the novel cache refill/fetch policy of the invention which dynamically determines spatial locality information associated with a currently running task/transaction and based on the spatial locality information, issues a customized data request to the low level memory 20 to reduce access time to the low level memory 20 and thereby improve the performance of a computer processor.

In one embodiment, the pre-fetch controller issues a customized data request based on a value contained in the configurable register 26. The customized data request may be at type-one request or a type-two request, to be described. It is noted, however, that both the type-one and type-two requests conform to a hot-word- first fetch policy that is being implemented by the respective processors 12A-12C.

It should be appreciated that the novel cache refill/fetch policy of the invention may not be implemented in all cases. Specifically, it is not implemented in the case where the data request is found in the processor's 12A-12C on-chip level one (Ll) cache 16A-16C. In this case, a cache "hit" is said to occur and the data is read or written to the Ll cache without the need for further action.

In the case where the requested data is not found in the on-chip Ll cache 16A- 16C, the data request is then passed from the Ll cache 16A-16C to the L2 cache 18 to determine if the data can be found there. In this case, the novel cache refill/fetch policy of the invention may be implemented as follows. Overall Sequence of Operations

Upon transmitting a data request from the Ll cache 16A-16B and receiving the request at the L2 cache 18, a configurable register 26 associated with the L2 cache 18 dynamically determines (calculates) a coarse estimate of the spatial locality of the currently running task/transaction. The result of this determination is used to select one of two customized data request types (i.e., type-one request, type-two request) to be issued to the low level memory 20 to locate the requested data in an efficient manner. In other words, the two request types are customized to take advantage of the dynamically determined spatial locality information to locate and retrieve the requested data from the low level memory 20 in the most efficient way.

It should be appreciated that the system, method and program product of the invention is directed to utilizing customized data requests based on spatial locality information without emphasizing the manner in which such spatial locality information is derived. That is, the spatial locality of the currently running task/transaction may be determined by the configurable register 26 using well-known methods in the art. For example, the spatial locality information may be determined by utilizing well-known runtime measurements that characterize the access patterns of the currently scheduled task/transaction on the processor 12A-12C.

To derive the spatial locality information, according to the present illustrative embodiment, a single numerical value is computed in some manner and stored in the configurable register 26. The numerical value provides a coarse measure of the spatial locality of the currently running task/transaction. In the presently described embodiment, a "large" or "high" computed numerical value indicates that the currently scheduled task/transaction has "poor" spatial locality. For example, a large value may be on the order of 70% of some pre-determined maximum numerical value. Whenever a high value is calculated, a type-one customized data request is issued from the configurable register 26 to the lower level memory 20. This type-one data request first retrieves a hot- word specified in the data request, followed by a data block of L2 block size in a wraparound fashion, which is well-known in the art, and described in the background above. Conversely, whenever a "low" numerical value is calculated as a measure of the spatial locality of the currently running task/transaction, a second type (i.e., type-two) customized data request is issued from the configurable register 26 to the low level memory 20. A low numerical value indicates that the currently scheduled task/transaction has "good" spatial locality. A "low" value may be quantified, for example, as being on the order of 30% of some pre-determined minimum numerical value. The type-two customized data request utilizes multiple read requests to the lower level memory 20. Each read request is equal to an Ll block size and reads data words in a wrap around fashion, as described in the background. One of the multiple data requests includes the hot- word- first address for the range of data words to be retrieved. It is noted that in the case where the value falls above the so-called "low" value and below the so-called "high" value (i.e., middle ground), a different action can result. For example, for those values that fall in the middle ground, in one embodiment, the type-one customized request may be issued if the spatial locality measurement falls closer to the upper threshold and the type-two customized request may be issued if the spatial locality measurement falls closer to the lower threshold. Of course, in other embodiments, different actions are contemplated by the invention. As another example, the type-one and type-two customized requests may be selected on an alternating basis.

In another embodiment, a single threshold value may be used to determine when it is appropriate to issue the first or second type of customized data request. The single threshold value can be derived in any manner. For example, the value may be derived based on an average of the so-called "low" and "high" values, as described above.

According to one aspect, the dynamically determined spatial locality information may be calculated by, and stored in, the configurable register 26 at different intervals. For example, in one embodiment, the dynamically determined spatial locality information is calculated multiple times during the time a current task/transaction is executing on the processor 12A-12C. In another embodiment, the dynamically determined spatial locality information may be calculated during a context switch, when the task/transaction is scheduled on the processor 12A-12C. In other embodiments, the calculation may be performed for every new transaction issued from a processor 12A-12C sharing the L2 cache. It is noted that the calculation is performed in real-time and provides a coarse estimate of the spatial locality of the currently scheduled task. The configurable register 26 may comprise any circuitry used to implement the refill/fetch protocol of the invention. The configurable register 26 may further comprise circuitry to implement other cache functions (e.g. responding to local requests, allocating cache entries for local requests that miss in the cache, etc.). In other embodiments, the configurable register 26 may be implemented in software executed by a processor 12A-12C. In still other embodiments, the configurable register 26 may be implemented in a combination of software and hardware circuitry. The software may be stored on a computer accessible medium. The software may comprise a plurality of instructions which, when executed, implement the cache control operations. Generally speaking, a computer accessible medium may include any media accessible by a computer during use to provide instructions and/or data to the computer. For example, a computer accessible medium may include storage media such as magnetic or optical media, e.g., disk (fixed or removable), tape drives, compact disk-ROM (CD-ROM), or digital versatile disk-ROM (DVD-ROM), CD-Recordable (CD-R), CD-Rewritable (CD- RW), DVD-R, DVD-RW, volatile or non-volatile memory media such as RAM (e.g. synchronous dynamic RAM (SDRAM), Rambus DRAM (RDRAM), static RAM (SRAM), etc.), ROM, Flash memory, non-volatile memory (e.g. Flash memory) accessible via a peripheral interface such as the Universal Serial Bus (USB) interface, etc., as well as media accessible via transmission media or signals such as electrical, electromagnetic, or digital signals, conveyed via a communication medium such as a network and/or a wireless link.

As used herein, the term "cache block" refers to the unit of allocation and deallocation of storage in the caches. Cache blocks may be of any desired size in various embodiments. For example, cache block sizes of 32 bytes and 64 bytes are common, although larger or smaller sizes may be used. The memory system 20 may comprise any type of memory, and may also include a memory controller to interface to the memory. For example, dynamic random access memory (DRAM), synchronous DRAM (SDRAM), double data rate (DDR) SDRAM, etc. may be used. The memory system 30 may be a distributed memory system, in some embodiments. Various peripheral devices (e.g. storage devices such as disk drives, printers, networking devices, etc.) may be included in various embodiments of a system as well.

While the present invention has been described with reference to several particular example embodiments, those skilled in the art will recognize that many changes may be made thereto without departing from the spirit and scope of the present invention, which is set forth in the following claims.

Claims

CLAIMS:

1. A method of pre-fetching data from a low level memory comprising the acts of: determining that a cache miss has occurred in a level one (Ll) cache (16A- 16C), determining that a cache miss has occurred in a higher order cache, calculating a measure of the spatial locality of a currently scheduled task/transaction, and issuing a customized data request to a low-level memory (20) based on said measure of the spatial locality.

2. The method according to claim 1, wherein the act of issuing a customized data request to a low-level memory (20) in accordance with said spatial locality calculation further comprises: issuing a type-one data request to said low-level memory

(20) in the case where it is determined that the calculated spatial locality value exceeds a predetermined threshold value, and otherwise issuing a type-two data request to said low-level memory system (20).

3. The method according to claim 1, wherein said higher order cache is a level two (L2) cache (18).

4. The method according to claim 1, wherein the spatial locality information is calculated during a context switch of the currently scheduled task/transaction.

5. The method according to claim 1, wherein the spatial locality information is calculated during each scheduled execution of the currently scheduled task/transaction on said at least one computer processor (12A-12C).

6. The method according to claim 1 , wherein the act of determining said degree of spatial locality of the currently scheduled task/transaction further comprises utilizing run-time measurements that characterize the access patterns of the currently scheduled task/transaction on said at least one computer processor (12A- 12C).

7. The method according to claim 1, wherein the predetermined threshold is calculated as a percentage of a predetermined maximum numerical value.

8. The method according to claim 7, wherein the percentage is on the order of seventy percent of said predetermined maximum numerical value.

9. The method according to claim 1, wherein the step of retrieving data from said low-level memory (20) in accordance with said type-one data request further comprises retrieving a block of data words from said low level memory (20); having a block size equal to the block size of the higher order cache, starting with a hot- word address specified in said type-one data request and terminating with a data word immediately preceding said hot word.

10. The method according to claim 1, wherein the step of retrieving data from said low-level memory (20) in accordance with said type-two data request further comprises retrieving a plurality of data blocks from said low level memory (20), each data block having a block size equal to the block size of said level one (Ll) cache (16A- 16C), wherein the first retrieved data block includes a hot- word address specified in said type-two data request.

11. The method according to claim 1 , wherein said computer processor (12A- 12C) operates in accordance with a hot- word- first fetch policy.

12. A cache memory system (10) , said cache memory system (10) interfacing a computer processor (12A- 12C) with a low-level memory (20), said cache memory system (10) comprising: a level one (Ll) high speed cache memory (16A-16B), coupled to said computer processor (12A- 12C) and a higher level high speed cache memory (18), a configurable register (26) coupled to the higher level high speed cache memory (18), configured to dynamically determine the spatial locality of a currently running task/transaction on said computer processor (12A- 12C) and a refill/fetch controller coupled to the higher level high speed cache memory (18) configured to issue a customized data request to the low-level memory (20) in dependence on the dynamically determined spatial locality information.

13. The cache memory system of claim 12, wherein said computer processor (12A- 12C) operates in accordance with a hot- word- first fetch policy.

14. The cache memory system of claim 12, wherein the spatial locality information is calculated during a context switch of the currently scheduled task/transaction.

15. The cache memory system of claim 12, wherein the spatial locality information is calculated at least two times during each scheduled execution of the currently scheduled task/transaction on said computer processor (12A- 12C).

16. The cache memory system of claim 12, wherein the act of determining said degree of spatial locality of the currently scheduled task/transaction further comprises utilizing run-time measurements that characterize the access patterns of the currently scheduled task/transaction on a processor.

17. A program product tangibly embodying a program of computer readable instructions executable by a digital processing apparatus to control a cache memory system having, a level one (Ll) high speed cache memory (16A- 16B), at least a level two (L2) high speed cache memory, a configurable register (26) coupled to the level two (L2) high speed cache memory (18) and a refill/fetch controller coupled to the level two (L2) high speed cache memory (18), said program product controlling the computer system to perform operations comprising: determining that a cache miss has occurred in said level one (Ll) high-speed cache memory (16A- 16C), determining that a cache miss has occurred in said and at least one higher order cache, said cache miss occurring in response to a data request from said at least one computer processor (12A-12C) for data stored in a low level memory (20) coupled to said at least one shared higher order cache; calculating a measure of the spatial locality of a currently scheduled task/transaction associated with said data request from said at least one computer processor (12A- 12C); issuing a type-one data request to said low-level memory (20) in the case where it is determined that the calculated spatial locality exceeds a predetermined threshold, otherwise issuing a type-two data request to said low-level memory system (20); and retrieving the requested data from the low level memory (20) in accordance with one of said type-one customized data request or said type-two customized data request.

18. The program product according to claim 17, wherein the first predetermined threshold is calculated as a percentage of a first predetermined maximum numerical value.

19. The program product according to claim 17, wherein the percentage is on the order of seventy percent of said predetermined maximum numerical value.