KR100302928B1

KR100302928B1 - Hardware-managed programmable unified/split caching mechanism for instructions and data

Info

Publication number: KR100302928B1
Application number: KR1019980003910A
Authority: KR
Inventors: 레이비 쿠마 아리밀리; 레오 제임스 클락; 존 스티븐 도드슨; 제리 돈 루이스
Original assignee: 포만 제프리 엘; 인터내셔널 비지네스 머신즈 코포레이션
Priority date: 1997-04-14
Filing date: 1998-02-10
Publication date: 2001-09-22
Anticipated expiration: 2018-02-10
Also published as: KR19980079707A

Abstract

본 발명에 따르면, 명령어 및 데이터와 같은 적어도 두 개의 밸류 클래스들 사이에서, 컴퓨터 시스템의 프로세서에 의해서 사용되는 캐쉬를 할당하는 방법이 제공된다. 각각의 클래스에 의한 캐쉬의 상대적인 사용을 모니터링하고 다수의 가용 비율중에서 상기 클래스에 의한 소정의 캐쉬 사용 비율을 선택하기 위해 캐쉬에 하나의 논리 장치가 접속되며, 캐쉬 내의 캐쉬 블록은 제거된 캐쉬의 교체를 소정의 캐쉬 사용 비율에 기초하여 밸류 클래스들 중 특정 클래스로 제한하는 캐쉬 교체 메카니즘을 이용하여 제거된다. 멀티-비트 기능은 선택된 빅팀(victim)을 소정의 캐쉬 블록으로 어떻게 제한할 것인가를 표시하기 위해서 제공되며, 논리 장치는 멀티-비트 기능을 설정하므로써 소정의 캐쉬 사용 비율을 선택한다. 캐쉬 교체 메카니즘은 제거된 캐쉬의 교체를 소정의 캐쉬 사용 비율에 기초하여 밸류 클래스중 특정 클래스에 제한하도록 수정된 최소 최근 사용 교체 메카니즘이 될 수 있다. 다수의 가용 비율은 예를들면, 1:1, 1:2 및 2:1의 명령/데이터 캐쉬 블록 사용 비율을 포함할 수 있다.In accordance with the present invention, a method is provided for allocating cache used by a processor of a computer system between at least two value classes, such as instructions and data. One logical device is connected to the cache to monitor the relative usage of the cache by each class and to select a predetermined cache usage rate by the class among a number of available rates, and cache blocks within the cache replace the cache that has been removed. Is removed using a cache replacement mechanism that restricts to a particular class of value classes based on a given cache usage ratio. The multi-bit function is provided to indicate how to limit the selected victor to a given cache block, and the logic device selects a given cache usage rate by setting the multi-bit function. The cache replacement mechanism may be a least recently used replacement mechanism that has been modified to limit replacement of removed caches to a particular class of value classes based on a given cache usage rate. Multiple available ratios may include, for example, instruction / data cache block usage ratios of 1: 1, 1: 2, and 2: 1.

Description

HARDWARE-MANAGED PROGRAMMABLE UNIFIED / SPLIT CACHING MECHANISM FOR INSTRUCTIONS AND DATA}

본 발명은 일반적으로 컴퓨터 시스템에 관한 것으로서, 상세하게는 프로세서에 의해서 사용되는 캐쉬에 관한 것이며, 특히 연관 캐쉬(associative cache)의 효율적인 사용 방법에 관한 것이다.TECHNICAL FIELD The present invention generally relates to computer systems, and more particularly, to a cache used by a processor, and more particularly to a method of efficiently using an associative cache.

도 1에는 종래의 컴퓨터 시스템(10)의 기본 구조가 도시되어 있다. 컴퓨터 시스템(10)은 하나 또는 그 이상의 처리 장치를 구비할 수 있는데, 그 중에 2개의 처리 장치(12a, 12b)가 도시되어 있으며, 이들 처리 장치는 (디스플레이 모니터, 키보드, 영구 기억 장치와 같은) 입출력(I/O) 장치(14), 프로그램 명령을 실행하기 위해 처리 장치에 의해 이용되는 (랜덤 액세스 메모리, 즉 RAM과 같은) 메모리 장치(16), 컴퓨터가 처음에 턴 온될 때마다 주변 장치 중 한 장치(통상적으로 영구 메모리 장치)로부터 운영 체제를 탐색하여 로드하는 것이 일차적인 목적인 펌웨어(firmware)(18)를 포함하여, 여러 주변 장치에 접속된다. 처리 장치(12a,12b)는 일반적인 상호 접속부 즉, 버스(20)를 포함하는 여러 수단에 의해 주변 장치와 통신한다. 예를 들면, 컴퓨터 시스템(10)은 모뎀 또는 프린터에 접속하기 위한 직렬 및 병렬 포트와 같은 도시되지 않은 많은 추가의 구성 요소를 구비할 수 있다. 본 기술분야에 통상의 지식을 가진 자는 도1의 블록도에 도시된 것과 관련하여 사용될 수 있는 다른 구성 요소가 있고, 예를 들면 디스플레이 어댑터(display adapter)는 비디오 디스플레이 모니터를 제어하는데 사용될 수 있고, 메모리 제어기는 메모리(16)에 액세스 하는데 사용될 수 있다는 것 등을 인식할 것이다. 또한, I/O 장치(14)를 버스(20)에 직접적으로 접속하는 대신, 상기 I/O 장치(14)는 버스(20)의 I/O 브리지(bridge)에 접속된 이차 (I/O) 버스에 접속될 수도 있다. 컴퓨터는 둘 또는 그 이상의 처리 장치를 구비할 수 있다.1 shows the basic structure of a conventional computer system 10. Computer system 10 may include one or more processing devices, of which two processing devices 12a, 12b are shown, such processing devices (such as display monitors, keyboards, permanent storage devices). Input / output (I / O) device 14, a memory device 16 (such as random access memory, or RAM) that is used by a processing device to execute program instructions, one of the peripheral devices each time the computer is first turned on. It is connected to several peripheral devices, including firmware 18, whose primary purpose is to seek and load an operating system from one device (usually a permanent memory device). Processing devices 12a and 12b communicate with peripheral devices by a variety of means, including common interconnects, bus 20. For example, computer system 10 may have many additional components not shown, such as serial and parallel ports for connecting to a modem or printer. One of ordinary skill in the art has other components that can be used in connection with what is shown in the block diagram of FIG. 1, for example a display adapter can be used to control a video display monitor, It will be appreciated that the memory controller can be used to access memory 16, and so forth. Also, instead of connecting the I / O device 14 directly to the bus 20, the I / O device 14 is connected to a secondary (I / O) bridge of the I / O bridge of the bus 20. May be connected to the bus. The computer may have two or more processing devices.

대칭 멀티프로세서(SMP) 컴퓨터에서, 모든 처리 장치는 일반적으로 동일하다. 즉, 이들 처리 장치는 모두 동작을 위해 공통된 명령어와 프로토콜의 세트 또는 서브셋을 사용하며, 일반적으로 동일한 구조(architecture)를 갖는다. 전형적인 구조는 도 1에 도시되어 있다. 처리 장치는 컴퓨터를 연산하기 위해 프로그램 명령어를 실행하는 다수의 레지스터 및 실행 유닛을 구비한 프로세서 코어(core)(22)를 포함한다. 예시적인 처리 장치는 인터네셔널 비지니스 머신즈(IBM) 코퍼레이션에 의해 판매되는 파워 PC^TM(PowerPC^TM) 프로세서를 포함한다. 처리 장치는 또한 고속 메모리 장치를 이용하여 구현되는 명령어 캐쉬(24) 및 데이터 캐쉬(26)와 같은 하나 또는 그 이상의 캐쉬를 구비할 수 있다. 명령어 및 데이터는 CPU가 명령어를 피연산자로 하는 연산을 요구하는지 또는 데이터를 피연산자로하는 연산을 요구하는지의 여부를 나타내는 신호를 검사함으로써 구별되어 각각의 캐쉬 (24,26)로 전달될 수 있다. 캐쉬는 통상적으로, 메모리(16)로부터 밸류를 로드하는 보다 긴 단계를 회피함으로써 처리 속도를 상승시키기 위하여 프로세서에 의해 반복적으로 액세스될 수 있는 밸류를 일시적으로 저장하는데 이용된다. 이러한 캐쉬는 단일 집적 칩(28) 상에 프로세서 코어와 일체적으로 패키지(package)될 때 "온-보드(on-board)"라고 지칭된다. 각각의 캐쉬는 프로세서 코어와 캐쉬 메모리 사이의 데이터의 전송을 관리하는 캐쉬 제어기(도시되지 않음)와 관련된다.In symmetric multiprocessor (SMP) computers, all processing units are generally the same. That is, these processing units all use a common set of instructions and protocols or a set of protocols for operation, and generally have the same architecture. A typical structure is shown in FIG. The processing device includes a processor core 22 having a plurality of registers and execution units for executing program instructions for computing a computer. Exemplary processing apparatus includes a power ^TM PC (PowerPC ^TM) processor marketed by International Business Machines (IBM) Corporation. The processing device may also have one or more caches, such as an instruction cache 24 and a data cache 26, implemented using a high speed memory device. Instructions and data may be distinguished and passed to each cache 24, 26 by examining a signal indicating whether the CPU requires an operation with an operand as an operand or an operation with the data as an operand. The cache is typically used to temporarily store values that can be repeatedly accessed by the processor to speed up processing by avoiding longer steps of loading values from memory 16. This cache is referred to as an "on-board" when packaged integrally with the processor core on a single integrated chip 28. Each cache is associated with a cache controller (not shown) that manages the transfer of data between the processor core and the cache memory.

처리 장치(12)는 온-보드 (레벨 1) 캐쉬(24, 26)를 지원하기 때문에 레벨2 (L2) 캐쉬로 지칭되는 캐쉬(30)와 같은 추가의 캐쉬를 포함할 수 있다. 다시 말하면, 캐쉬(30)는 메모리(16)와 온-보드 캐쉬 사이의 중간 단계로서 작용하며, 온-보드 캐쉬가 저장할 수 있는 것보다 더 큰 양의 정보(명령과 데이터)를 저장할 수 있으나 더 큰 액세스 패널티(penalty)가 따른다. 예를 들면, 캐쉬(30)는 256 또는 512 킬로 바이트의 저장 능력을 갖춘 칩일 수 있으며, 프로세서는 총 64 킬로 바이트의 저장 능력을 갖춘 온-보드 캐쉬를 구비한 IBM PowerPC^TM604 시리즈 프로세서가 될 수도 있다. 캐쉬(30)는 버스(20)에 접속되고, 메모리(16)로부터 프로세서 코어(22)로의 모든 정보의 로딩은 캐쉬(30)를 통하여 이루어져야 한다. 비록 도1이 단지 2레벨 캐쉬 체계(hierarchy)를 도시하고 있지만, 다레벨의 직렬 접속 캐쉬가 있는 다레벨 캐쉬 체계가 제공될 수 있다.Processing device 12 may include additional caches, such as cache 30, referred to as level 2 (L2) caches because they support on-board (level 1) caches 24 and 26. In other words, the cache 30 acts as an intermediate step between the memory 16 and the on-board cache, and can store a greater amount of information (instructions and data) than the on-board cache can store, but more There is a large access penalty. For example, the cache 30 may be a chip having a storage capacity of 256 or 512 kilobytes, and the processor may be an IBM PowerPC ^™ 604 series processor with an on-board cache with a total storage capacity of 64 kilobytes. have. The cache 30 is connected to the bus 20, and the loading of all information from the memory 16 to the processor core 22 must be through the cache 30. Although FIG. 1 shows only a two-level cache hierarchy, a multilevel cache scheme with a multilevel serial connection cache may be provided.

캐쉬는 여러 명령 및 데이터 밸류를 개별적으로 저장하는 많은 "블록"을 갖는다. 임의의 캐쉬내의 블록은 "셋트(sets)"로 불리는 블록 그룹으로 분할된다. 하나의 셋트는 주어진 메모리 블록이 상주할 수 있는 캐쉬 블록의 집합(collection)이다. 어떤 주어진 메모리 블록에 있어서, 프리셋트(preset) 매핑 기능에 따라, 상기 블록이 매핑될 수 있는 캐쉬 내의 고유의 셋트가 있다. 하나의 셋트내의 블록 수는 캐쉬의 연관성(associativity)으로 불리는데, 예를 들면, 2-웨이 셋트 연관성은, 어떤 주어진 메모리 블록에 대해서, 메모리 블록이 매핑될 수 있는 캐쉬 내의 두 개의 블록이 존재한다는 것을 의미한다. 그러나, 메인 메모리내의 수개의 상이한 블록이 어떤 주어진 셋트로 매핑될 수 있다. 1-웨이 셋트 연관 캐쉬는 직접 매핑되는데, 즉, 특정 메모리 블록을 포함할 수 있는 캐쉬 블록이 하나만 있다. 하나의 캐쉬는 메모리 블록이 어떤 캐쉬 블록을 점유할 수 있을 때, 즉 하나의 셋트가 있을 때, 완전하게 연관된다고(fully associative) 연관된다고 하며, 이때의 어드레스 태그(tag)는 메모리 블록의 풀(full) 어드레스이다.The cache has many "blocks" that store several instructions and data values separately. Blocks in any cache are divided into groups of blocks called "sets". One set is a collection of cache blocks in which a given memory block can reside. For any given memory block, depending on the preset mapping function, there is a unique set in the cache to which the block can be mapped. The number of blocks in one set is called the associative nature of the cache. For example, a two-way set association means that for any given memory block, there are two blocks in the cache to which the memory blocks can be mapped. it means. However, several different blocks in main memory can be mapped to any given set. The one-way set associative cache is mapped directly, i.e. there is only one cache block that can contain a particular memory block. A cache is said to be fully associative when a memory block can occupy some cache block, that is, when there is a set, and the address tag is a pool of memory blocks ( full) address.

예시된 캐쉬 라인(블록)은 어드레스-태그 필드, 상태-비트(state-bit) 필드, 총괄(inclusivity)-비트 필드, 및 실제 명령 또는 데이터를 저장하기 위한 밸류(value) 필드를 포함한다. 상태-비트 필드 및 총괄-비트 필드는 멀티프로세서 컴퓨터 시스템 내의 캐쉬 일관성을 유지하기 위해서 사용된다. 어드레스 태그는 대응하는 메모리 블록의 풀 어드레스의 서브셋트이다. 어드레스-태그 필드 내의 태그중 하나와 인입 유효 어드레스의 비교 매치는 캐쉬 "히트(hit)"를 나타낸다. 하나의 캐쉬 내의 모든 어드레스 태그(및 때로는 상태-비트 및 총괄-비트 필드)의 집합은 디렉토리로 불리며, 모든 밸류 필드의 집합은 캐쉬-엔트리 어레이(cache-entry array)이다.The illustrated cache line (block) includes an address-tag field, a state-bit field, an inclusivity-bit field, and a value field for storing the actual instruction or data. The status-bit field and the overall-bit field are used to maintain cache coherency in multiprocessor computer systems. The address tag is a subset of the full address of the corresponding memory block. A comparison match between one of the tags in the address-tag field and an incoming valid address represents a cache "hit." The set of all address tags (and sometimes status-bit and aggregate-bit fields) in one cache is called a directory, and the set of all value fields is a cache-entry array.

주어진 캐쉬에 대한 하나의 셋트 내의 모든 블록이 가득 차있고 그러한 캐쉬가 풀 셋트로 매핑되는 메모리 위치에 대한 "판독" 또는 "기록"이든지 간에 하나의 요구를 수신할 때, 캐쉬는 이러한 셋트 내의 현재의 블록중 하나를 "제거(evict)"해야 한다. 캐쉬는 제거를 위해서 본 기술 분야에 숙련된 사람에게는 공지된 다수의 수단[LRU(least recently used) 방식, 랜덤 방식, 의사-LRU 방식, 등] 중 하나에 의해 블록을 선택한다. 선택된 블록 내의 데이터가 수정되면, 그 데이터는 다른 캐쉬(L1 또는 온-보드 캐쉬의 경우임) 또는 메인 메모리(도1의 2-레벨 구조에 도시된 바와같이, L2 캐쉬의 경우임)가 될 수 있는, 메모리 계층(hierachy)에서 그 다음 최하위 레벨로 기록된다. 산입 원리(principle of inclusion)에 의해, 메모리 계층의 하위 레벨은 기록된 수정 데이터를 보유하는 데 이용가능한 블록을 이미 갖고 있게 된다. 그러나, 선택된 블록 내의 데이터가 수정되지 않으면, 블록은 간단히 폐기되며 계층의 그 다음 최하위 레벨로 기록되지 않는다. 체계의 하나의 레벨로부터 하나의 블록을 제거하는 이러한 프로세스는 "제거(eviction)"로서 알려져 있다. 이러한 프로세스의 끝으로, 캐쉬는 제거된 블록의 카피(copy)를 더 이상 보유하지 않는다.When all blocks in one set for a given cache are full and such a cache receives a request whether it is a "read" or a "write" to a memory location that maps to a full set, the cache will block the current block in this set. You must "evict" one of them. The cache selects the block for removal by one of a number of means known to those skilled in the art (least recently used, random, pseudo-LRU, etc.). If the data in the selected block is modified, that data can be another cache (in the case of L1 or on-board cache) or main memory (in the case of L2 cache, as shown in the two-level structure of FIG. 1). Which is then written to the next lowest level in the hierarchy. By the principle of inclusion, the lower levels of the memory hierarchy already have blocks available for holding the written correction data. However, if the data in the selected block is not modified, the block is simply discarded and is not written to the next lowest level of the hierarchy. This process of removing one block from one level of the scheme is known as "eviction". At the end of this process, the cache no longer retains a copy of the removed block.

프로세서에서 실행되는 소정의 절차(프로그램)는 제한된 수의 셋트[합동 클래스(congruence classes)]를 반복적으로 이용하는 의도하지 않은 효과를 갖게 되고, 따라서 캐쉬가 덜 효율적이 된다. 다시 말하면, 하나의 절차가 많은 수의 다른멤버를 이용하지 않으면서 적은 수의 합동 클래스 멤버 내에서 많은 수의 제거를 유발하면, 메모리 대기시간(latency) 지연이 증가된다. 스트라이드(stride)라고 하는 이러한 효과는, 합동 매핑 기능 및 특정 절차가 메인 메모리 장치(RAM 16) 내의 메모리 블록을 할당하는 방식과 관련된다. 특정 연관 캐쉬를 이용하는 통계적인 장점은 이러한 형태의 절차에 대해서는 얻을 수 없게된다.Certain procedures (programs) executed on the processor have the unintended effect of repeatedly using a limited number of sets (congruence classes), thus making the cache less efficient. In other words, if one procedure causes a large number of removals within a small number of joint class members without using a large number of other members, the memory latency delay is increased. This effect, called stride, is associated with the joint mapping function and the manner in which certain procedures allocate memory blocks in main memory device RAM 16. The statistical advantage of using a specific associative cache is not obtained for this type of procedure.

때때로 쓸모 없어지는 또 다른 통계적인 장점은 명령어 및 데이터를 위해 분할된 캐쉬 블록[예컨대, 캐쉬(24) 및 캐쉬(26)]을 제공하는 것과 관련되어 있다. 전형적인 처리 장치는 명령어 및 데이터를 위한 동일 수의 L1 캐쉬 블록을 제공하며, 따라서 50%의 가용(available) 캐쉬 엔트리가 명령을 위한 이러한 레벨에서 사용될 수 있으며 50%는 데이터를 위해서 사용될 수 있다. L2 캐쉬에서는 구별이 없다. 즉, L2 레벨에서 100%의 캐쉬가 명령어를 위해서 사용될 수 있으며 또한, 100%가 데이터를 위해서 사용될 수 있다. 그러나, 명령어 대 데이터를 위한 이러한 가용 블록의 비율이 항상 특정 절차를 위한 가장 효율적인 캐쉬 사용은 아니다. 많은 소프트웨어 애플리케이션은 분할된 I/D 캐싱으로 시스템에서 실행될 때 더 양호하게 수행되지만, 한편 다른 것들은 균일한, 통합된 캐쉬에서 실행될 때 더 양호하게 수행된다(전체 캐쉬 공간이 동일한 경우라고 가정할 때임). 캐쉬 I/D 비율이 명령어 및 데이터 캐쉬 동작의 실제 비율에 특별히 가깝지 않는 경우에는, 문제가 되는 정도의 제거가 존재한다.Another statistical advantage that is sometimes obsolete relates to providing partitioned cache blocks (eg, cache 24 and cache 26) for instructions and data. Typical processing units provide the same number of L1 cache blocks for instructions and data, so 50% of available cache entries can be used at this level for instructions and 50% can be used for data. There is no distinction in the L2 cache. That is, at the L2 level, 100% of the cache can be used for instructions and 100% can be used for the data. However, this ratio of available blocks for instruction to data is not always the most efficient use of cache for a particular procedure. Many software applications perform better when run in a system with partitioned I / D caching, while others perform better when run in a uniform, integrated cache (assuming the entire cache space is the same). . If the cache I / D ratio is not particularly close to the actual rate of instruction and data cache operations, there is a problem of removal.

없어질 수 있는 연관 캐쉬의 또 다른 통계적인 장점은 주어진 셋트 중 어떤 캐쉬 블록이 제거되는지를 판단하는 캐쉬 교체 알고리즘과 관련된 것이다. 예를 들면, 8-웨이 연관 캐쉬는 그 셋트와 연관된 7-비트 필드를 검사하는 LRU 장치를 이용할 수 있다. 프로세서 상에서 실행되는 절차의 특정 순환 빈도(cycling frequency)로 인해서, 이러한 7-비트 LRU 알고리즘은 캐쉬가 4-웨이 연관성 또는 2-웨이 연관성인 경우 보다 더 많은 캐쉬 블록을 제거하는 결과를 가져올 수 있다.Another statistical advantage of associative caches that can be lost is associated with a cache replacement algorithm that determines which cache block of a given set is removed. For example, an 8-way association cache may use an LRU device that examines a 7-bit field associated with that set. Due to the specific cycling frequency of the procedure executed on the processor, this 7-bit LRU algorithm may result in removing more cache blocks than if the cache is 4-way or 2-way associative.

연관 캐쉬를 통계적으로 최적화하는 것은 어렵다. 왜냐하면 상이한 기술적 애플리케이션이 상이한 스트라이드 상태 또는 상이한 명령어/데이터 비율을 제공할 수 있기 때문이다. 예를 들어, 데스크탑 출판 프로그램, 창고 재고목록 프로그램, 공기역학 모델링 프로그램 및 서버 프로그램이 모두 상이한 스트라이드 상태 또는 명령 연산 대 데이터 연산 비율을 제공할 수 있다. 그러므로, 프로세서 상에서 실행되는 절차의 형태에 관계없이 그 통계적인 장점을 더 완전하게 최적화시킬수 있는 캐쉬를 설계하는 것이 바람직하며 유리하다.Statistical optimization of the associative cache is difficult. This is because different technical applications may provide different stride states or different instruction / data ratios. For example, desktop publishing programs, warehouse inventory programs, aerodynamic modeling programs, and server programs can all provide different stride states or instruction operation to data operation ratios. Therefore, it is desirable and advantageous to design a cache that can more fully optimize its statistical benefits regardless of the type of procedure executed on the processor.

그러므로 본 발명의 한 가지 목적은 컴퓨터 시스템의 프로세서용의 개선된 캐쉬를 제공하는 것이다.It is therefore an object of the present invention to provide an improved cache for a processor of a computer system.

본 발명의 다른 목적은 연관성에 관한 통계적인 장점을 최적화시키는 캐쉬를 제공하는 것이다.Another object of the present invention is to provide a cache that optimizes the statistical advantages of association.

본 발명의 또 다른 목적은 명령어 대 데이터를 억세스하는것에 관한 통계적인 장점을 최적화시키는 캐쉬를 제공하는 것이다.Another object of the present invention is to provide a cache that optimizes the statistical advantages of accessing instruction versus data.

본 발명의 또 다른 목적은 캐쉬 교체(제거) 알고리즘에 관한 통계적인 장점을 최적화시키는 캐쉬를 제공하는 것이다.It is another object of the present invention to provide a cache that optimizes the statistical advantages of the cache replacement (removal) algorithm.

전술된 목적은 적어도 두 개의 밸류(명령어 및 데이터) 클래스 사이에서, 컴퓨터 시스템의 프로세서에 의해서 사용되는 캐쉬를 할당하는 방법에서 성취되는데, 이러한 방법은, 캐쉬에 접속되어 각각의 클래스에 의한 캐쉬의 상대적인 사용을 모니터링하기 위한 논리 장치를 제공하는 단계; 다수의 가용 비율 중에서 클래스에 의한 소정의 캐쉬 사용 비율을 선택하는 단계; 및 제거된 캐쉬의 교체를 소정의 캐쉬 사용 비율에 기초하여 밸류 클래스 중 특정 클래스에 제한하는 캐쉬 교체 메카니즘을 이용하여 캐쉬 내의 캐쉬 블록을 제거하는 단계를 포함한다. 선택된 빅팀을 소정의 캐쉬 블록에 어떻게 제한할 것인가를 표시하기 위해 멀티-비트 기능이 제공될 수 있으며, 논리 장치는 멀티-비트 기능을 설정하므로서 소정의 캐쉬 사용 비율을 선택한다. 캐쉬 교체 메카니즘은 제거된 캐쉬의 교체를 소정의 캐쉬 사용 비율에 기초하여 밸류의 클래스중 특정 클래스에 제한하도록 수정된 LRU 교체 메카니즘이 될 수 있다. 다수의 가용 비율은 예를 들면, 1:1, 1:2 및 2:1 의 명령어/데이터 캐쉬 블록 사용 비율을 포함할 수 있다.The above object is achieved in a method of allocating a cache used by a processor of a computer system between at least two value (instruction and data) classes, which method is connected to the cache and relative to the cache by each class. Providing a logic device for monitoring usage; Selecting a predetermined cache usage rate by class among a plurality of available rates; And removing the cache block in the cache using a cache replacement mechanism that restricts replacement of the removed cache to a particular class of value classes based on a predetermined cache usage ratio. A multi-bit function may be provided to indicate how to limit the selected Victim to a given cache block, and the logic device selects a given cache usage rate by setting the multi-bit function. The cache replacement mechanism may be an LRU replacement mechanism that has been modified to limit replacement of removed caches to a particular class of values based on a given cache usage ratio. Multiple available ratios may include, for example, instruction / data cache block usage ratios of 1: 1, 1: 2, and 2: 1.

본 발명의 전술된 목적과 부가적인 목적, 특징, 및 장점은 다음의 상세한 설명에서 명백히 알 수 있다.The above and further objects, features and advantages of the present invention will become apparent from the following detailed description.

도 1은 종래 기술의 멀티-프로세서 컴퓨터 시스템의 블록도.1 is a block diagram of a prior art multi-processor computer system.

도 2a-2c는 연관 캐쉬에 대한 연관성을 변화시키는 신규의 방법을 표시한 도면.2A-2C illustrate a novel method of changing the association for an association cache.

도 3은 어드레스 태그로부터의 비트를 사용하여 부가적인 클래스를 생성함으로써 수정되는 기본 합동(basic congruence) 클래스 매핑을 이용하는, 도2a-2c에 도시된 것과 같은 프로그래머블 연관성을 제공하는 한 가지 방법을 도시한 도면.FIG. 3 illustrates one method of providing programmable associations, such as those shown in FIGS. 2A-2C, using basic congruence class mapping modified by creating additional classes using bits from address tags. drawing.

도 4는 어드레스 비트를 스위칭함으로써, 특정 어드레스를 특정 합동 클래스에 임의로 할당하도록 허용하는 프로그래머블 합동 클래스를 제공하는 신규의 방법을 도시하는 도면.4 illustrates a novel method of providing a programmable joint class that allows random assignment of specific addresses to specific joint classes by switching address bits.

도 5는 어드레스 전체의 각각의 비트를 위한 인코딩 밸류를 사용하여, 도4에 도시된 것과 같은 프로그래머블 합동 클래스를 제공하는 하드웨어 구현예에 대한 하이-레벨 개략도.FIG. 5 is a high-level schematic diagram of a hardware implementation that provides a programmable joint class such as that shown in FIG. 4, using encoding values for each bit of the address as a whole.

도 6은 LRU 알고리즘의 등급을 변화시키는 데 있어 랜덤한 구성 요소가 도입되도록 허용하는 교체 제어 장치를 가진 신규의 캐쉬에 대한 블록도.6 is a block diagram of a novel cache with a replacement control device allowing random components to be introduced in varying the rank of the LRU algorithm.

< 도면의 주요 부분에 대한 부호의 설명 ><Description of Symbols for Main Parts of Drawings>

40 : 캐쉬40: cache

50 : 5-비트 프로그래머블 필드50: 5-bit programmable field

52 : 5-투-32 디코더52: 5-to-32 decoder

54 : AND 게이트 어레이54: AND gate array

56 : OR 게이트 어레이56: OR gate array

본 발명은 처리 장치의 캐쉬에 의한 더 효율적인 연산에 관한 것으로서 캐쉬 효율을 개선하는 여러 방법을 제공한다. 하나의 방법은 캐쉬 구조의 연관성에 관한 것이며, 단일 캐쉬(40)의 여러 상태를 도시하는 도2a-2c를 참조하여 이해될 수 있다. 캐쉬 제어기(도시되지 않음)를 포함할 수 있는 캐쉬(40)는, 연관성을 제공하기위해서 셋트(합동 클래스들) 내에 배열된 다수의 캐쉬 라인을 갖고 있다. 도2a에 도시된 캐쉬(40)의 제1 상태에서, 하나의 셋트에 여덟 개의 캐쉬 라인이 있는데, 예를 들면, 셋트1 내에 캐쉬 라인 1 내지 8, 셋트2 내에 캐쉬 라인 9-16 등 이며, 이는 8-웨이 연관성을 의미한다. 캐쉬(40) 내의 엔트리는, 어드레스 태그 필드, 상태 비트 필드, 총괄 비트 필드 및 밸류 필드를 갖는 것과 같은, 변동되는 포맷이 될 수 있다.The present invention relates to more efficient computation by the cache of a processing apparatus and provides several methods for improving cache efficiency. One method relates to the association of cache structures and can be understood with reference to FIGS. 2A-2C, which illustrate the different states of a single cache 40. Cache 40, which may include a cache controller (not shown), has a number of cache lines arranged in sets (joint classes) to provide association. In the first state of the cache 40 shown in FIG. 2A, there are eight cache lines in one set, for example, cache lines 1 through 8 in set 1, cache lines 9-16 in set 2, and the like. This means 8-way association. Entries in cache 40 may be of varying formats, such as having an address tag field, a status bit field, a global bit field, and a value field.

도 2a의 정지 화상은 종래의 8-웨이 연관 캐쉬의 잇점을 제공하지만, 본 발명은 도2b 및 도2c에 도시된 바와 같은, 연관 적응성(associative adaptability) 또는 프로그램 가능성을 부가적으로 제공한다. 도 2b에서, 각각의 8-블록 셋트는 셋트(1a, 1b, 2a, 2b)를 포함하여, 두 개의 더 적은 셋트로 분할되었다. 이러한 셋트 각각은 네 개의 블록을 포함하며, 따라서 이러한 캐쉬(40)의 상태는 4-웨이 연관이다. 도2c에서, 이러한 셋트는 셋트당 두 개의 블록을 발생하기 위해서 즉, 2-웨이 연관성을 발생하기 위해서 더 세분되었다. 이러한 진행은 1-웨이 연관성으로 연장될 수도 있다. 또한, 그러한 진행은 가장 큰 셋트 내의 더 많은 수의 캐쉬 블록, 예를 들면 8 대신에 16으로 시작될 수 있다.The still picture of FIG. 2A provides the advantages of a conventional eight-way associative cache, but the present invention additionally provides associative adaptability or programmability, as shown in FIGS. 2B and 2C. In FIG. 2B, each 8-block set was divided into two smaller sets, including sets 1a, 1b, 2a, 2b. Each of these sets contains four blocks, so the state of this cache 40 is a four-way association. In Figure 2c, this set is further subdivided to generate two blocks per set, i.e. to generate a two-way association. This progression may extend to one-way association. Also, such progression may begin with 16 instead of a larger number of cache blocks, eg 8, in the largest set.

캐쉬(40)의 연관성 레벨을 교체하는 기능은 캐쉬가 더 효율적으로 동작되게 한다. 종래 기술의 설명에서 기술되었듯이, 부분적으로는 특별한 연관성 크기로 인해서, 스트라이드를 야기시키는, 즉, 하나 또는 두 개의 합동 클래스 내에서 순환하는 캐쉬를 야기시키는 소정의 절차가 있을 수 있다. 이러한 절차에 있어서, 스트라이드는 다른 연관성 크기를 이용함으로써 제거되거나 최소화될 수 있다. 이러한연관성 크기는 어떤 연관성 레벨이 바람직한지를 표시하는 데 사용되는 하나 이상의 프로그래머블 비트를 제공함으로써 서로 다른 응용을 위해서 최적화될 수 있다. 예를 들어, 표 1은 도2a-2c의 적응가능한 연관성 스켐을 구현하기 위해 프로그램가능한 2-비트 기능이 어떻게 사용될 수 있는가를 도시한다.The ability to replace the associative level of cache 40 allows the cache to operate more efficiently. As described in the description of the prior art, there may be some procedures that cause a stride, in part due to a particular association size, that is, causing a cache to circulate within one or two joint classes. In this procedure, strides can be eliminated or minimized by using different association sizes. This degree of association can be optimized for different applications by providing one or more programmable bits that are used to indicate which level of association is desired. For example, Table 1 shows how a programmable 2-bit function can be used to implement the adaptive association scheme of FIGS. 2A-2C.

연관성correlation 프로그램 비트Program bits 합동 클래스Joint class 어드레스 비트Address bits LRU 비트LRU bit 8-웨이8-way 0000 NN AA 77 4-웨이4-way 0101 N x 2N x 2 A-1A-1 33 2-웨이2-way 1010 N x 4N x 4 A-2A-2 1One 직접 맵됨Directly mapped 1111 N x 8N x 8 A-3A-3 00

2-비트 기능은 8-웨이 연관성을 표시하기 위해서 "00"로 설정되고, 4-웨이 연관성을 표시하기 위해서는 "01"로 설정되며, 2-웨이 연관성을 표시하기 위해서는 "10"으로 설정되고, 1-웨이 연관성(즉, 직접 매핑됨)을 표시하기 위해서는 "11"로 설정된다. 이러한 셋트에 대한 필요한 세분은 원래 셋트에 대한 하나 이상의 특정 서브셋트를 편리하게 이용하기 위해서 합동-클래스 매핑 기능을 수정함으로써 제어된다. 다시 말해서, 두 개의 셋트, 1a 및 1b는 원래의 셋트 1에 있었던 캐쉬 라인만을 포함하며, 셋트 1c 및 1d는 제1 세분된 셋트 1a 내에 있었던 캐쉬 라인만을 포함한다. 고정된 수의 캐쉬 라인을 갖는 캐쉬(40)에 대해서, 이것은 합동 클래스의 수가 N 과 N x 8 사이에서 변동됨을 의미하며, 여기서 N은 기본 매핑 기능에 의해서 규정된 합동 클래스의 최소 수 이다.The 2-bit function is set to "00" to indicate 8-way associations, to "01" to indicate 4-way associations, to "10" to indicate 2-way associations, Set to "11" to indicate one-way association (ie, direct mapping). The necessary subdivision for this set is controlled by modifying the joint-class mapping function to conveniently use one or more specific subsets of the original set. In other words, the two sets 1a and 1b contain only the cache lines that were in the original set 1 and the sets 1c and 1d contain only the cache lines that were in the first subdivided set 1a. For a cache 40 with a fixed number of cache lines, this means that the number of joint classes varies between N and N x 8, where N is the minimum number of joint classes defined by the basic mapping function.

특정 서브셋트가 식별되는 방식은 변동될 수 있다. 메모리 블록의 풀 어드레스 중 일부는 합동 클래스 매핑을 구분하는 데 사용될 수 있다. 예를 들면, 32-비트 풀 어드레스는 세 개의 부분, 즉, 도3에 도시된 바와 같이, 오프셋 필드, 합동클래스 필드, 및 어드레스-태그 필드로 분류될 수 있다. 본 예에서는 여섯 개의 비트인 오프셋 필드는 실제 명령어 또는 데이터에 대응하는 밸류 필드 내의 바이트의 정확한 위치를 정의한다. 합동 클래스 필드는 매핑 함수로 입력되는 피연산자로서 사용되며 메모리 블록을 일차 셋트에 할당하는데, 즉 셋트1과 같이 한 셋트는 여덟 개의 블록을 갖는다. 이 예에서, 합동 클래스 필드는 13 비트이며 어드레스 태그는 8-웨이 연관성에 대해서 13 비트이지만, 합동-클래스 필드는 어드레스 태그중의 다른 비트를 이용함으로써 다른 연관성 레벨을 위해서 효율적으로 증대되며, 이 경우 어드레스-태그 필드는 줄어든다. 4-웨이 연관성은, 8-블록 셋트를 각각 네 개의 블록으로된 두 개의 더 작은 그룹으로 세분하기 위해 원래의 어드레스 태그 필드내의 마지막 비트를 이용함으로써 수행된다. 이와 유사하게, 2-웨이 또는 1-웨이 연관성은 셋트를 더 세분하기 위해 원래의 어드레스 태그 필드 내의 두 번째 비트를 마지막 비트로 하거나 세 번째 비트를 마지막 비트로 이용함으로써 수행된다.The manner in which a particular subset is identified can vary. Some of the full addresses of memory blocks may be used to distinguish joint class mappings. For example, a 32-bit full address may be classified into three parts, namely an offset field, a joint class field, and an address-tag field, as shown in FIG. In this example, the six bit offset field defines the exact location of the byte in the value field corresponding to the actual instruction or data. The joint class field is used as an operand input to the mapping function and allocates a block of memory to the primary set, ie one set has eight blocks, such as set 1. In this example, the joint class field is 13 bits and the address tag is 13 bits for 8-way association, but the joint-class field is efficiently augmented for different levels of association by using different bits in the address tag, in this case The address-tag field is reduced. Four-way association is performed by using the last bit in the original address tag field to subdivide the eight-block set into two smaller groups of four blocks each. Similarly, two-way or one-way association is performed by using the second bit as the last bit or the third bit as the last bit in the original address tag field to further refine the set.

프로그래머블 연관성은 2-비트 기능을 실현하는 하드웨어 또는 소프트웨어에 의해서 제공될 수 있다. 전술된 구현에서, 논리 장치는 미스(miss) 정보를 수집할 수 있으며 미리 정의된 기준에 기초하여 연관성 레벨을 선택할 수 있는데, 그 기준은 어떤 하나의 합동 클래스에 대한 최대 미스율, 또는 하나 또는 그 이상의 임계값 이상의 미스율을 갖는 소정의 수의 합동 클래스 이상의 클래스에 대한 최대-미스율과 같은 것이다. 이러한 연관성 관리는 동적으로 이루어질 수 있으며, 따라서 캐쉬는 컴퓨터 시스템 상에서 실행되는 응용프로그램 형태에서의 교체와 같은 것으로 인한, 프로세서 상에서 실행되는 절차의 속성 변화에 빠르게 반응하게 된다. 대안으로, 한 셋트의 접속 핀이 수동 선택을 위해 사용될 수 있다. 이와 유사하게, 연관성 레벨을 조절하기 위해 소프트웨어(프로그램 명령어들)가 동작될 수 있다. 스트라이드를 야기할 수 있는 절차를 갖는 것으로 알려진 특정 프로그램을 위해 응용 소프트웨어가 제공될 수 있지만, 이 응용 소프트웨어는 스트라이드로 인한 과도의 메모리 대기시간을 줄이기 위해서 2-비트 연관성 기능을 공지된 적당한 레벨로 설정할 수 있다. 응용 소프트웨어는 프로그램에 의해 사용된 상이한 루틴에 기초하여 연관성 레벨을 간헐적으로 조절할 수도 있다. 운영체제 소프트웨어는 어드레스 요구를 모니터하고, 이 절차가 상이한 연관성 레벨에서 얼마나 효율적으로 동작하는지를 예측 방식으로 판단하기 위해 사용될 수 있다. 이 경우, 운영체제는 가장 효율적인 레벨을 선택할 수 있다. 이러한 기술은, 프로그램 실행 중에도 연관성 레벨의 실시간 조절을 제공한다.Programmable association may be provided by hardware or software that realizes 2-bit functionality. In the implementation described above, the logic device may collect miss information and select an association level based on a predefined criterion, which criterion is the maximum miss rate for any one joint class, or one or more It is equal to the maximum-miss rate for a class above a predetermined number of joint classes having a miss rate above a threshold. This association management can be done dynamically, so that the cache reacts quickly to changes in the nature of procedures running on the processor, such as replacements in the form of applications running on a computer system. Alternatively, a set of connection pins can be used for manual selection. Similarly, software (program instructions) can be operated to adjust the level of association. Although application software may be provided for certain programs known to have procedures that can cause strides, the application software may set the 2-bit associative function to a known and appropriate level to reduce excessive memory latency due to strides. Can be. The application software may intermittently adjust the level of association based on the different routines used by the program. Operating system software can be used to monitor address requests and to predict in a predictive manner how this procedure works at different levels of association. In this case, the operating system can select the most efficient level. This technique provides real-time control of the level of association even during program execution.

전술된 프로그래머블 연관성은 합동 클래스에 영향을 미치는 하나의 방식을 제공하는데, 예컨대 예시된 실시예에서 승산 인수(multiplicative factor)에 따라 합동 클래스의 수를 증가시킴으로써 영향을 미친다. 본 발명에 따른 캐쉬 효율을 개선하기 위한 또 다른 방식은 합동 클래스의 다른 특징에 관련되어 있으며, 어떤 특정 메모리 블록이 어떤 합동 클래스에 할당되어야 하는지를 규정하는 매핑 함수의 특징에 관련되어 있다. 종래 기술의 매핑 기술은 전형적으로 모듈로-형태 함수를 포함하지만, 그러한 기능의 순환 속성은 스트라이드 문제를 야기시킬 수 있다. 본 발명은 풀 또는 부분적인 어드레스가 새롭고 유일한 어드레스로 인코드될 수 있도록 허용하는 매핑 함수를 이용함으로써, 즉, 특정 어드레스를 특정 합동 클래스로 임의로(미리 정의하여) 할당함으로써, 이러한 문제를 해결한다. 도4의 예에 도시되었듯이, 풀(원래의) 32-비트 어드레스 내의 10번째 비트는 인코드된 32-비트 어드레스 내의 26번째 비트로 시프트되며, 원래의 어드레스 내의 26번째 비트는 인코드된 어드레스 내의 18번째 비트로 시프트되며, 원래의 어드레스내의 18번째 비트는 인코드된 어드레스 내의 22번째 비트로 시프트되며, 풀(원래의) 어드레스 내의 22번째 비트는 인코드된 어드레스 내의 10번째 비트로 시프트된다. 이러한 예에서는 어드레스 비트를 스위칭함으로써 특정 합동 클래스에 대한 소정의 어드레스의 유일하고 임의적인 할당이 제공된다.Programmable association described above provides one way of influencing a joint class, for example by increasing the number of joint classes according to multiplicative factors in the illustrated embodiment. Another way to improve cache efficiency in accordance with the present invention relates to other features of the joint class and to the feature of the mapping function that specifies which particular memory block should be assigned to which joint class. Prior art mapping techniques typically include modulo-type functions, but the circular nature of such functionality can cause stride problems. The present invention solves this problem by using a mapping function that allows a pool or partial address to be encoded with a new and unique address, i.e., by randomly (predefining) assigning a particular address to a particular joint class. As shown in the example of Figure 4, the tenth bit in the full (original) 32-bit address is shifted to the 26th bit in the encoded 32-bit address, and the 26th bit in the original address is in the encoded address. Shifted to the 18th bit, the 18th bit in the original address is shifted to the 22nd bit in the encoded address, and the 22nd bit in the full (original) address is shifted to the 10th bit in the encoded address. In this example, a unique random assignment of a given address to a particular joint class is provided by switching address bits.

또한, 이러한 합동 클래스의 프로그램 가능성은 하드웨어 또는 소프트웨어 구현에 의해 실현될 수 있다. 응용 소프트웨어가 캐쉬/프로세서로 전달되기 전에 적당한 어드레스의 인코딩을 제공하거나, 운영체제 소프트웨어가 메모리 블록의 할당을 모니터하고 하드웨어로 전달될 때 어드레스를 수정하기 위해 인터프리터를 사용할 수 있다. 이러한 기술에서는 합동 클래스의 수를 간헐적으로 또는 실시간으로 조절하는 것이 허용된다. 도5에는 하드웨어 구현이 도시되어 있다. 인코드되는 어드레스(풀 또는 부분적인) 내의 각 비트에 대해서 하나씩, 다수의 5-비트 프로그래머블 필드(50)가 제공된다. 이러한 5-비트 프로그래머블 필드(50) 각각은 각각의 5-대-32(5-to-32) 디코더(52) 내로 공급되며, 각각의 디코더 출력(32 라인)은 각각의 AND 게이트 어레이(54)(어레이당 32개의 AND 게이트)로 공급된다. AND 게이트 어레이(54)(각각 32 라인)의 출력은 다수의 OR 게이트(56)로 분기된다. 각각의 OR 게이트(56)는 각각의 AND 게이트 어레이(54)로부터 하나의 출력을 수신한다. OR 게이트(56)의 출력은 인코드된 어드레스에 대한 시프트된 밸류를 제공한다. 이러한 하드웨어는 5-비트 프로그래머블 필드(50)에 대해 적당한 밸류를 선택함으로써 프로그래머블 합동 클래스를 제공하며, 또한 동적으로 미스 정보를 수집하고 미리 정해진 기준에 기초하여 임의의 매핑 기능을 선택할 수 있다. 캐쉬가 플러쉬(flush)되는 것은 일관성을 보장하기 위해서 연관성 레벨을 교체키기 전에 하드웨어 구현에 요구된다.In addition, the programmability of such joint classes can be realized by hardware or software implementation. The application software may provide an encoding of the appropriate address before it is passed to the cache / processor, or the interpreter may be used to monitor the allocation of memory blocks and modify the address as it is passed to the hardware. This technique allows for adjusting the number of joint classes intermittently or in real time. 5 shows a hardware implementation. Multiple 5-bit programmable fields 50 are provided, one for each bit in the address (full or partial) to be encoded. Each of these 5-bit programmable fields 50 is fed into a respective 5-to-32 decoder 52, with each decoder output 32 lines having a respective AND gate array 54. (32 AND gates per array). The output of AND gate array 54 (32 lines each) is branched to a number of OR gates 56. Each OR gate 56 receives one output from each AND gate array 54. The output of the OR gate 56 provides a shifted value for the encoded address. Such hardware provides a programmable joint class by selecting the appropriate value for the 5-bit programmable field 50, and can also dynamically collect miss information and select any mapping function based on predetermined criteria. Flushing the cache is required by the hardware implementation before changing the association level to ensure consistency.

전술된 프로그래머블 합동 클래스는, 전술된 프로그래머블 연관성에 무관하다. 이것은 이 둘이 결합되어 사용될 수 있다 할지라도 무관하다. 예를 들어, 프로그래머블 연관성은 그 레벨을 최적화시키도록 두 개의-비트 연관성 기능을 설정하기 위해서 사용될 수 있으며, 또한 5-비트 인코딩 필드를 사용하는 프로그래머블 합동 클래스는 제거율을 낮추기 위해 사용될 수 있다.The programmable joint class described above is independent of the programmable associations described above. This is irrelevant even though the two can be used in combination. For example, programmable association can be used to set the two-bit association function to optimize its level, and a programmable joint class using a 5-bit encoding field can also be used to lower the removal rate.

본 발명에 따라 캐쉬 효율을 개선하기 위한 또 다른 방법은 명령어 대 데이터를 위한 캐쉬의 사용에 관한 것이다. CPU-캐싱 구조를 구현하는 컴퓨터 시스템에서, 캐쉬는 명령어 및 데이터가 항상 동일하게 취급되는 연합된 캐쉬로서, 또는 전체 캐쉬 RAM 공간의 일부(보통은 1/2)가 명령어 전용이고 나머지는 데이터 전용인 분할된 I/D 캐쉬로서 사전 정의되는 것이 통상적이다. 또한, 종래의 분할된 I/D 캐쉬 설계에서, 명령 대 데이터에 전용인 공간의 비율은 고정된다(보통은 50%/50%).Another method for improving cache efficiency in accordance with the present invention relates to the use of cache for instruction versus data. In computer systems implementing a CPU-caching scheme, a cache is a federated cache in which instructions and data are always treated the same, or a portion of the total cache RAM space (usually 1/2) is instruction-only and the rest is data-only. It is common to be predefined as a partitioned I / D cache. In addition, in conventional partitioned I / D cache designs, the ratio of space dedicated to instruction to data is fixed (usually 50% / 50%).

본 명세서는 변동되는 정도에 따라 명령어/데이터 분할 비율이 프로그램되는 신규의 캐쉬 할당 설계에 대해서 설명한다. 하나의 구현예에서, 프로그램 가능성은 소프트웨어에 의해서 판독 및 기록되는 2-비트 I/D 기능(아래에는 "id_ratio"로 칭함)에 의해서 제공된다. 아래의 표 2에 도시된 이러한 기능을 설정하는데 대한 정의는 예시된 구현을 위한 것이다. 하지만, 본 발명은 다른 캐쉬 비율에 쉽게 적응되거나 확장될 수 있다.This specification describes a novel cache allocation design in which the instruction / data partition ratio is programmed according to varying degrees. In one implementation, programmability is provided by a 2-bit I / D function (hereinafter referred to as "id_ratio") that is read and written by software. The definition for setting this function shown in Table 2 below is for the illustrated implementation. However, the present invention can be easily adapted or extended to other cache rates.

id_ratioid_ratio 정의Justice 0000 명령어 및 데이터내에 할당된100%의 캐시100% cache allocated within instructions and data 0101 명령어에만 할당된 50%의 캐시명령어 및 데이터에 할당된 50%50% of cache instruction and 50% of data allocated to instructions only 1010 데이터에만 할당된 50%의 캐시명령어 및 데이터에 할당된 50%50% of cache instructions allocated to data only and 50% of data allocated to data 1111 보류됨Suspended

프로그래머블 I/D 비율은 셋트 연관 캐쉬의 빅팀 교체 알고리즘을 수정함으로써 수행된다. 아래의 구현예에서, 캐쉬는 8-웨이 셋트 연관(a,b,c,d,e,f,g,h로서 표시된 8개의 멤버를 가짐)이며 7-비트 LRU 알고리즘이 사용된다. 이러한 구현예에서, 정상적인 빅팀 선택 논리는 다음의 부울 식으로 기술된다. 다음의 논리는 종래 기술의 7-비트 LRU 알고리즘을 표현한 것이다(이러한 부울식에서, "＾"은 논리 NOT(반전)이며, "&"은 논리 AND이고, "+"는 논리 OR 이다).Programmable I / D ratios are performed by modifying the Victim replacement algorithm of the set association cache. In the implementation below, the cache is an 8-way set association (having 8 members, denoted as a, b, c, d, e, f, g, h) and a 7-bit LRU algorithm is used. In this implementation, the normal Victim selection logic is described by the following boolean equation. The following logic represents a 7-bit LRU algorithm of the prior art (in this Boolean expression, "＾" is logical NOT (inverted), "&" is logical AND and "+" is logical OR).

victim_is_member_a =＾1ru_bits(0) & ＾1ru_bits(1) & ＾1ru_bits(3);victim_is_member_a = ＾ 1ru_bits (0) & ＾ 1ru_bits (1) & ＾ 1ru_bits (3);

victim_is_member_b =＾1ru_bits(0) & ＾1ru_bits(1) & 1ru_bits(3);victim_is_member_b = ＾ 1ru_bits (0) & ＾ 1ru_bits (1) & 1ru_bits (3);

victim_is_member_c =＾1ru_bits(0) & 1ru_bits(1) & ＾1ru_bits(4);victim_is_member_c = ＾ 1ru_bits (0) & 1ru_bits (1) & ＾ 1ru_bits (4);

victim_is_member_d =＾1ru_bits(0) & 1ru_bits(1) & 1ru_bits(4);victim_is_member_d = ＾ 1ru_bits (0) & 1ru_bits (1) & 1ru_bits (4);

victim_is_member_e =1ru_bits(0) & ＾1ru_bits(2) & ＾1ru_bits(5);victim_is_member_e = 1ru_bits (0) & ＾ 1ru_bits (2) & ＾ 1ru_bits (5);

victim_is_member_f =1ru_bits(0) & ＾1ru_bits(2) & 1ru_bits(5);victim_is_member_f = 1ru_bits (0) & ＾ 1ru_bits (2) & 1ru_bits (5);

victim_is_member_g =1ru_bits(0) & 1ru_bits(2) &＾1ru_bits(6);victim_is_member_g = 1ru_bits (0) & 1ru_bits (2) & ＾ 1ru_bits (6);

victim_is_member_h =1ru_bits(0) & 1ru_bits(2) & 1ru_bits(6);victim_is_member_h = 1ru_bits (0) & 1ru_bits (2) & 1ru_bits (6);

I/D 비율을 수정하기 위해서, 선택된 빅팀은 "id_ratio"의 셋팅에 따라 그리고 CPU가 명령어 판독(i_read)을 요구하는지 또는 데이터 판독(＾i_read)을 요구하는지에 따라 다음과 같이, 소정의 합동 클래스 멤버에만 제한된다.In order to modify the I / D ratio, the selected BIC team depends on the setting of "id_ratio" and depending on whether the CPU requires an instruction read (i_read) or a data read (\ i_read). Limited to members only.

d50_mode=(id_ratio ="01");d50_mode = (id_ratio = "01");

i50_mode=(id_ratio ="10");i50_mode = (id_ratio = 10 ");

gate_abcd = ＾((d50_mode & ＾i_read)+(i50_mode & i_read)).gate_abcd = ＾ ((d50_mode & ＾ i_read) + (i50_mode & i_read)).

"gate_abcd" 신호는, "1"일 때 합동 클래스 멤버 a,b,c 또는 d가 교체를 위한 빅팀으로서 사용될 수 있도록 한다. "gate_abcd"가 "0" 이면, 합동 클래스 멤버 e,f,g 또는 h가 빅팀으로서 사용되어야 한다. 따라서, 빅팀 선택 식은 아래와 같이 수정된다.The " gate_abcd " signal, when " 1 ", allows the joint class members a, b, c or d to be used as a big team for replacement. If "gate_abcd" is "0", the joint class members e, f, g or h should be used as the big team. Therefore, the big team selection formula is modified as follows.

victim_is_member_a =gate_abcd & ＾1ru_bits(0) & ＾1ru_bits(1) & ＾1ru_bits(3);victim_is_member_a = gate_abcd & " 1ru_bits (0) & " 1ru_bits (1) & " 1ru_bits (3);

victim_is_member_b =gate_abcd & ＾1ru_bits(0) & ＾1ru_bits(1) & 1ru_bits(3);victim_is_member_b = gate_abcd & " 1ru_bits (0) & " 1ru_bits (1) & 1ru_bits (3);

victim_is_member_c =gate_abcd & ＾1ru_bits(0) & 1ru_bits(1) & ＾1ru_bits(4);victim_is_member_c = gate_abcd & " 1ru_bits (0) & 1ru_bits (1) & " 1ru_bits (4);

victim_is_member_d =gate_abcd & ＾1ru_bits(0) & 1ru_bits(1) & 1ru_bits(4);victim_is_member_d = gate_abcd & ＾ 1ru_bits (0) & 1ru_bits (1) & 1ru_bits (4);

victim_is_member_e =(＾gate_abcd + 1ru_bits(0)) & ＾1ru_bits(2) &＾1ru_bits(5);victim_is_member_e = (＾ gate_abcd + 1ru_bits (0)) & 1ru_bits (2) & 1ru_bits (5);

victim_is_member_f =(＾gate_abcd + 1ru_bits(0)) & ＾1ru_bits(2) & 1ru_bits(5);victim_is_member_f = (＾ gate_abcd + 1ru_bits (0)) & ＾ 1ru_bits (2) & 1ru_bits (5);

victim_is_member_g =(＾gate_abcd + 1ru_bits(0)) & 1ru_bits(2) &＾1ru_bits(6);victim_is_member_g = (＾ gate_abcd + 1ru_bits (0)) & 1ru_bits (2) & ＾ 1ru_bits (6);

victim_is_member_h =(＾gate_abcd + 1ru_bits(0)) & 1ru_bits(2) & 1ru_bits(6);victim_is_member_h = (＾ gate_abcd + 1ru_bits (0)) & 1ru_bits (2) & 1ru_bits (6);

전술된 본 발명의 사용예로서, id_ratio ="01"인 경우를 고려하자. 이 경우에, CPU가 명령어 판독을 요구하면, gate_abcd="1" 이며, 여덟 개의 합동 클래스 멤버 중 어떤 것도 교체를 위한 빅팀으로서 선택될 수 있다. CPU가 데이터 판독을 요구하면, 멤버 e,f,g 또는 h가 빅팀으로서 선택될 수 있다. 그 결과, 전체 캐쉬가 명령을 저장하기 위해서 사용될 수 있지만, 단지 캐쉬 중 50%가 데이터를 저장하는데 사용될 수 있다. 그러므로, 이 모드에서, 캐쉬는 명령어 쪽으로 "가중된다(weighted)". 전술된 예는 2:1, 1:1 및 1:2의 명령어/데이터 캐쉬 블록 사용 비율을 제공한다. 예를 들어, 12.5% 만큼 가용 캐쉬 용량을 증가시키므로서, 3:1, 4:1 또는 8:1과 같은 다른 비율이 제공될 수 있으며, 3-비트 I/D는 12.5%, 25%, 37.5%, 50%, 62.5%, 75%, 87.5% 또는 100%의 상대적인 사용량을 제공하는데 사용될 수 있다.As an example of use of the present invention described above, consider the case where id_ratio = " 01 ". In this case, if the CPU requires reading the instruction, gate_abcd = "1", any of the eight joint class members can be selected as the big team for replacement. If the CPU requires reading data, members e, f, g or h can be selected as the big team. As a result, the entire cache can be used to store instructions, but only 50% of the cache can be used to store data. Therefore, in this mode, the cache is "weighted" toward the instruction. The above example provides instruction / data cache block usage ratios of 2: 1, 1: 1 and 1: 2. For example, by increasing the available cache capacity by 12.5%, other ratios such as 3: 1, 4: 1, or 8: 1 can be provided, with 3-bit I / D being 12.5%, 25%, 37.5 It can be used to provide relative usage of%, 50%, 62.5%, 75%, 87.5% or 100%.

이러한 신규의 캐쉬-할당 설계는 프로그램되는 명령어/데이터 분할 비율을 제공한다. 이것은 소프트웨어 응용 프로그램 또는 운영체제가 최적의 성능을 위해캐쉬 내의 명령어 대 데이터의 가중치를 실시간으로 조절할 수 있도록 한다. I/D 캐쉬 비율 셋팅은 소프트웨어가 먼저 CPU 및 캐쉬의 상태를 세이브할 필요 없이 언제라도 변경될 수 있다. 또한, 이 기술은 명령어 판독 대 데이터 판독의 상대적인 사용량을 모니터링함으로써, 하드웨어로도 구현될 수 있다. LRU 빅팀 선택 논리 이외에, 캐쉬 제어기 로직은 어떤 I/D 비율 모드가 사용되는지에 관계 없이 동일 방식으로 작용한다. 이러한 프로그램가능성은 모든 형태의 캐쉬[인-라인(in-line), 룩어사이드(lookaside), 라이트-쓰루(write-through) 등]에서 사용하기 위해 적응될 수 있다. 전술된 본 발명의 구현예는 8-웨이 셋트 연관 캐쉬를 사용하지만, 본 발명은 임의의 등급의 연관성(2-웨이 또는 그 이상)에도 적용될 수 있다. 또한, 전술된 구현예는 7-비트 LRU 알고리즘을 사용하지만, 본 발명은 다른 LRU 알고리즘에도 적용될 수 있다. 가변 I/D 가중이 수행되는 수단으로서 빅팀 선택 논리를 이용함으로써, 본 발명은 매우 적은 논리 회로로 구현될 수 있다.This new cache-allocation design provides a programmable instruction / data partitioning ratio. This allows a software application or operating system to adjust the weight of instructions versus data in the cache in real time for optimal performance. The I / D cache rate setting can be changed at any time without the need for software to first save the state of the CPU and cache. The technique can also be implemented in hardware by monitoring the relative usage of instruction reads versus data reads. In addition to the LRU victor selection logic, the cache controller logic works in the same way regardless of which I / D ratio mode is used. This programmability can be adapted for use in all types of caches (in-line, lookaside, write-through, etc.). While the embodiments of the present invention described above use an 8-way set associative cache, the present invention can be applied to any degree of associative (2-way or more). In addition, although the implementation described above uses a 7-bit LRU algorithm, the present invention can be applied to other LRU algorithms. By using Victim selection logic as a means by which variable I / D weighting is performed, the present invention can be implemented with very little logic circuitry.

본 발명에 따른 캐쉬 효율을 개선하기 위한 또 다른 방법은, 두 개의 밸류 클래스(명령 또는 데이터)의 상대적인 캐쉬 사용을 조절하는 이외의 방식으로, 캐쉬 블록을 제거하기 위한 메카니즘에 관한 것이다. 캐쉬의 효율을 개선하기 위한 전술된 기술을 채용한다 해도, 특히 메모리 블록의 할당과 그 각각의 캐쉬 블록 사이에서 일어나는 순환 패턴으로 인해서, 일부 스트라이딩 레벨이 있을 수 있다. 이 경우에, 비효율적인 순환 제거를 해소하여 스트라이드를 감소시키는 한정된 랜덤성 요소(defined dement of randomness)를 도입하기 위해 캐쉬 교체 알고리즘(예를 들면, LRU)을 더 수정하기 위한 하나의 방법이 제공될 수 있다.Another method for improving cache efficiency in accordance with the present invention is directed to a mechanism for removing cache blocks in a manner other than controlling the relative cache usage of two value classes (command or data). Even with the techniques described above for improving the efficiency of the cache, there may be some levels of striding, especially due to the allocation of memory blocks and the recursive pattern that occurs between each cache block. In this case, one method may be provided for further modifying the cache replacement algorithm (e.g., LRU) to introduce a defined dement of randomness that eliminates inefficient cyclic elimination to reduce strides. Can be.

본 발명의 이러한 특징의 일 실시예가 도6에 도시되어 있다. 캐쉬(60)는, 캐쉬내에 저장된 다양한 밸류의 캐쉬 엔트리 어레이(62), 엔트리를 추적하기 위한 캐쉬 디렉토리(64) 및 랜덤 인자에 의해서 선택적으로 수정되는 LRU 알고리즘을 이용하는 교체 제어 장치(66)를 포함하는 여러 구성요소를 포함한다. 이러한 실시예에서, 랜덤성 요소를 도입하기 위한 4가지의 가능한 교체 제어 장치의 변형이 있다. 제1 변형(68)에서, 랜덤화(randomization)가 도입되지 않을 때, 7-비트가 8-블록 셋트 내에서 LRU 캐쉬 블록을 선택하는데 사용되며(즉, 캐쉬는 8-웨이 연관임) 어떤 랜덤화기(randomizer)에 대해서도 추가적인 비트가 요구되지 않는다.One embodiment of this feature of the invention is shown in FIG. The cache 60 includes a cache entry array 62 of various values stored in the cache, a cache directory 64 for tracking entries, and a replacement control device 66 using an LRU algorithm that is optionally modified by a random factor. It contains several components. In this embodiment, there are four possible variations of the replacement control device for introducing a random element. In the first variant 68, when randomization is not introduced, 7-bit is used to select the LRU cache block within the 8-block set (ie, the cache is an 8-way association) and some random No additional bits are required for the randomizer.

약간의 랜덤화가 필요하면, 제2 변형(70)에서, 교체 알고리즘은 약간의 랜덤성을 도입함으로써 수정된다. 네 개의 그룹 사이에서 주어진 합동 클래스(캐쉬 셋트) 내의 선택을 위해 단지 세 개의 LRU 비트만이 사용되며, 각각의 그룹은 클래스 의 4분의 1을 포함하거나 또는 8-웨이 연관 캐쉬의 경우에 두 개의 블록을 포함한다. 이러한 2-멤버 그룹(서브클래스)이 선택된 후, 단일 랜덤 비트가 그 그룹 내의 두 개의 블록 중 하나를 선택하기 위해 사용된다. 더 이상의 랜덤화가 요구되면, 제3 변형(72)은 원래의 합동 클래스를 두 개의 서브클래스(캐쉬가 8-웨이 연관일 때 각각 네 개의 블록)로 분류하기 위해서 1-비트 LRU 알고리즘을 이용하며, 두 개의 랜덤 비트는 네 개의 서브클래스 멤버중 하나를 선택하기 위해서 사용된다. 결국, 마지막 변형(74)에서는, LRU가 사용되지 않으며, 세 개의 랜덤 비트가 여덟 개의-멤버 클래스내에서 제거를 위한 블록을 완전하게 결정하기 위해 사용된다.If some randomization is needed, in the second variant 70, the replacement algorithm is modified by introducing some randomness. Only three LRU bits are used for selection within a given joint class (cache set) between the four groups, each group containing one quarter of the class, or two in the case of an 8-way associative cache. Contains a block. After this two-member group (subclass) is selected, a single random bit is used to select one of the two blocks in that group. If further randomization is required, the third variant 72 uses a 1-bit LRU algorithm to classify the original joint class into two subclasses (four blocks each when the cache is an 8-way association), Two random bits are used to select one of the four subclass members. Finally, in the last variant 74, no LRU is used, and three random bits are used to completely determine the block for removal within the eight-member class.

도6에서는, LRU 및 랜덤 블록이 분리되어 도시되어 있지만, 그것들은 단일의7-비트 필드로 결합될 수 있다. 다시 말해서, 이러한 필드는 변형(68)을 위해서는 완전히 사용되지만, 이러한 필드중 단지 네 개의 비트만이 변형(70)(세개의 LRU 비트 및 하나의 랜덤 비트) 및 변형(72)(두개의 LRU 비트 및 두 개의 랜덤 비트)에서 사용되며, 변형(74)에 있어서는 필드중 단지 세 개의 비트만이 사용된다.In Figure 6, the LRUs and random blocks are shown separately, but they can be combined into a single 7-bit field. In other words, these fields are fully used for variant 68, but only four bits of these fields are variant 70 (three LRU bits and one random bit) and variant 72 (two LRU bits). And two random bits), and in variant 74 only three bits of the field are used.

도6의 예는 8-웨이 연관성을 위한 것이지만, 본 기술 분야의 사람이라면 본 발명이 다른 셋트 크기에도 적용될 수 있음을 알 수 있을 것이다. 예를 들어, 4-웨이 연관 셋트에서, 세 개의 변형이 있을 수 있는데, 세 개의 LRU 비트를 사용하고 랜덤 비트를 사용하지 않는 제1 변형, 하나의 LRU 비트를 사용하고 하나의 랜덤 비트를 사용하는 제2 변형, 및 LRU 비트를 사용하지 않고 두 개의 랜덤 비트를 사용하는 제3 변형이 존재할 수 있다. 2-웨이 연관 셋트는 두 개의 변형을 가질 수 있다. 즉, 하나의 LRU 비트를 사용하고 랜덤 비트를 사용하지 않는 제1 변형과, LRU 비트를 사용하지 않고 하나의 랜덤 비트를 사용하는 제2 변형이 있다. 이러한 변화하는 랜덤성은 제거를 최적화시키는 또 다른 방식이며, 전술된 프로그래머블 연관성, 프로그래머블 합동 클래스 및 프로그래머블 I/D 비율 중 어느 것과도 함께 사용될 수 있다.Although the example of Figure 6 is for 8-way association, one of ordinary skill in the art will appreciate that the present invention can be applied to other set sizes. For example, in a four-way association set, there can be three variants, a first variant that uses three LRU bits and no random bits, one LRU bit, and one random bit. There may be a second variant, and a third variant using two random bits without using the LRU bit. The two-way association set can have two variants. That is, there is a first variant that uses one LRU bit and no random bits, and a second variant that uses one random bit and no LRU bits. This varying randomness is another way of optimizing the removal and can be used with any of the programmable associations, programmable joint classes, and programmable I / D ratios described above.

여기에 기술된 개선된 캐쉬는 온-보드(L1) 캐쉬, 또는 하위-레벨 캐쉬(예를 들면, L2)로서 사용될 수 있다. 이러한 캐쉬 구성이 캐쉬 체계내의 하나 또는 제한된 수의 캐쉬 레벨을 위해서 사용될 수 있지만, 본 기술 분야의 사람이라면 성능을 극대화시키기 위해서 모든 캐쉬 레벨에 대해 이러한 구성을 이용하는 것이 바람직함을 알 수 있다. 본 발명은 일반적으로 멀티 프로세서 컴퓨터 시스템은 물론이고단일 프로세서 컴퓨터 시스템에도 적용될 수 있다.The improved cache described herein can be used as an on-board L1 cache, or a low-level cache (eg, L2). While such a cache configuration may be used for one or a limited number of cache levels in the cache scheme, one of ordinary skill in the art would recognize that it is desirable to use this configuration for all cache levels to maximize performance. The invention is generally applicable to single processor computer systems as well as to multiprocessor computer systems.

비록 본 발명이 특정 실시예를 참조하여 설명되었지만, 이러한 설명은 제한하는 의미로 해석되는 것을 의미하지 않는다. 본 발명의 설명을 참조해 볼 때 본 기술 분야에 숙련된 사람이라면 본 발명의 대체 실시예는 물론이고 예시된 실시예의 여러 변형이 가능함을 알 수 있을 것이다. 그러므로, 첨부된 청구 범위에서 정의된 바와 같은 본 발명의 사상 및 범위에서 벗어나지 않고 그러한 변형이 이루어 질수 있다는 것을 알 수 있다.Although the present invention has been described with reference to specific embodiments, this description is not meant to be interpreted in a limiting sense. Referring to the description of the present invention, those skilled in the art will appreciate that various modifications of the illustrated embodiments, as well as alternative embodiments of the present invention, are possible. It is, therefore, to be understood that such modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

본 발명에 따르면, 프로세서 상에서 실행되는 절차의 형태에 관계 없이 그 통계적인 장점을 보다 완전하게 최적화시킬수 있는 캐쉬를 설계할 수 있는 효과가 있다.According to the present invention, there is an effect of designing a cache that can more fully optimize its statistical advantages regardless of the type of procedure executed on the processor.

Claims

A method of allocating cache used by a processor of a computer system between at least two value classes,

Providing a logic device connected to the cache to monitor the relative use of the cache by each class;

Selecting a predetermined cache usage rate by classes from the plurality of available rates; And

Removing the cache block in the cache using a cache replacement mechanism that restricts replacement of the removed cache to a particular class of value classes based on a predetermined cache usage ratio.

Cache allocation method comprising a.

2. The method of claim 1 wherein the two value classes are instructions and data.

The method of claim 1,

Providing a multi-bit function to indicate how to limit the selected big team to a given cache block;

And the logical unit selects a predetermined cache usage ratio by setting a multi-bit function.

The method of claim 1, wherein the cache replacement mechanism is an LRU replacement mechanism modified to limit replacement of removed caches to a particular class of value classes based on a predetermined cache usage ratio.

3. The cache allocation method of claim 2, wherein the plurality of available ratios comprises an instruction / data cache block usage ratio of 1: 1.

3. The cache allocation method of claim 2, wherein the plurality of available ratios comprises an instruction / data cache block usage ratio of 1: 2.

3. The cache allocation method of claim 2, wherein the plurality of available ratios comprises a 2: 1 instruction / data cache block usage ratio.

The method of claim 2,

The multiple available magnifications include instruction / data cache block usage ratios of 1: 1, 1: 2, and 2: 1,

Providing a 2-bit faculty to indicate which of the instruction / data cache block utilization rates will be used as a given cache usage rate.

Cache allocation method further included.

9. The method of claim 8, wherein the cache replacement mechanism is an LRU replacement mechanism modified to limit replacement of the removed cache to instruction reading or data reading based on a predetermined cache usage ratio.

In a computer system,

A processor;

Memory devices;

A cache connected to the processor and the memory device, the cache having a plurality of cache blocks to store a memory block corresponding to an address of the memory device;

A logic device connected to the cache for monitoring the relative usage of the cache by at least two value classes and for selecting a predetermined cache usage rate by the classes from a plurality of available rates; And

A cache replacement mechanism for removing a cache block in the cache that limits replacement of the removed cache to a particular one of the value classes based on the predetermined cache usage ratio

Computer system comprising a.

The method of claim 10,

The two value classes are instruction and data;

And the logic unit monitors the relative usage of the cache by detecting instruction read requests and data read requests.

11. The computer system of claim 10 further comprising a multi-bit functionality to indicate how to limit the selected Victim to a given cache block.

12. The computer system of claim 10 wherein the cache replacement mechanism is an LRU replacement mechanism modified to limit replacement of removed caches to a particular class of value class based on the predetermined cache usage ratio.

12. The computer system of claim 11 wherein the plurality of available ratios comprises an instruction / data cache block usage ratio of 1: 1.

12. The computer system of claim 11 wherein the plurality of available ratios comprises an instruction / data cache block usage ratio of 1: 2.

12. The computer system of claim 11 wherein the plurality of available ratios comprises an instruction / data cache block usage ratio of 2: 1.

The method of claim 11,

The plurality of available ratios includes instruction / data cache block usage ratios of 1: 1, 1: 2 and 2: 1,

A 2-bit function for indicating which of the instruction / data cache block usage ratios will be used as the predetermined cache usage ratio;

Computer system further including.

18. The method of claim 17, wherein the cache replacement mechanism is a least recently used replacement mechanism modified to limit replacement of the removed cache to one of the instruction read request or one of the data read requests based on a predetermined cache usage rate. Computer system.