KR20040053153A

KR20040053153A - Method and apparatus for dispatching tasks in a non-uniform memory access (numa) computer system

Info

Publication number: KR20040053153A
Application number: KR10-2004-7005100A
Authority: KR
Inventors: 맥도날드잘웬델
Original assignee: 인터내셔널 비지네스 머신즈 코포레이션
Priority date: 2001-11-07
Filing date: 2002-11-06
Publication date: 2004-06-23
Anticipated expiration: 2022-11-06
Also published as: US20030088608A1; CN1582428A; JP3965157B2; WO2003040912A1; CA2463748A1; US8122451B2; KR100754300B1; CN100530079C; US7159216B2; US20070039002A1; JP2005508545A

Abstract

논-유니폼 메모리 액세스 컴퓨터 시스템을 위한 디스패처는, 임의의 CPU(201-204)와 관련이 없는 common 준비 큐로부터 스레드를 디스패치하지만, 더 짧은 메모리 액세스 시간을 갖는 CPU에, 스레드를 디스패치하는 것을 더 선호한다. 바람직하게는 이 시스템은 복수의 이산적인 노드(101-104)를 포함하며, 각 노드는 로컬 메모리(205)와 하나 또는 그 이상의 CPU를 구비한다. 시스템 메인 메모리는 로컬 메모리의 집합을 포함하는 분산 메모리이다. 개별적이고 바람직한 CPU와 바람직한 노드는 각 스레드와 관련될 수 있다. CPU가 유효하게 된 경우, 이 디스패처는 적어도 어떤 상대적인 우선권(714-717;801-804)을, 상이한 노드에 바람직한 CPU를 구비한 스레드에 관해 유효한 CPU와 동일한 노드의 바람직한 CPU를 갖는 스레드에 부여한다. 이 선호도는 상대적이며, 디스패처가 궁핍함 또는 다른 문제를 회피하기 위해 기호도를 오버라이드하는 것을 방해하지 않는다.The dispatcher for a non-uniform memory access computer system dispatches threads from a common ready queue that is not associated with any CPU 201-204, but prefers to dispatch threads to CPUs with shorter memory access times. do. Preferably the system comprises a plurality of discrete nodes 101-104, each node having a local memory 205 and one or more CPUs. System main memory is distributed memory that includes a collection of local memory. Individual and preferred CPUs and preferred nodes can be associated with each thread. When the CPU is enabled, this dispatcher gives at least some relative priority (714-717; 801-804) to the thread having the desired CPU of the same node as the valid CPU with respect to the thread having the desired CPU at different nodes. . This preference is relative and does not prevent the dispatcher from overriding the preference to avoid lack or other problems.

Description

TECHNICAL AND APPARATUS FOR DISPATCHING TASKS IN A NON-UNIFORM MEMORY ACCESS (NUMA) COMPUTER SYSTEM}

현대의 컴퓨터 시스템은 일반적으로, 중앙 처리 장치(CPU)와, 통신 버스 및 메모리와 같이, 정보를 저장, 검색, 전송하는데 필요한 지원 하드웨어를 포함한다. 또한 컴퓨터 시스템은 입/출력 제어기, 또는 저장 제어기와 같이 외부와 통신하는데 필요한 하드웨어 및 키보드, 모니터, 테이프 드라이브, 네트워크와 결합된 통신 라인 등과 같은 제반 장치를 포함한다. CPU는 이 시스템의 중심이다. CPU는 하나의 컴퓨터 프로그램을 포함하고, 다른 시스템 요소의 조작을 지시하는 명령을 수행한다.Modern computer systems generally include a central processing unit (CPU) and supporting hardware for storing, retrieving, and transmitting information, such as communication buses and memories. The computer system also includes hardware necessary for communicating with the outside, such as an input / output controller or storage controller, and various devices such as keyboards, monitors, tape drives, communication lines coupled with a network, and the like. The CPU is the heart of this system. The CPU includes one computer program and executes instructions for instructing operation of other system elements.

컴퓨터의 하드웨어의 관점에서 보면, 대부분의 시스템은 기본적으로 동일한 방식으로 작동한다. 프로세서들은 수학적, 논리적인 비교와 하나의 장소에서 다른 장소로 데이터를 이동시키는 것과 같은 매우 단순한 조작의 제한적인 집합을 실행할 수 있다. 그러나, 각 조작들은 매우 빠르게 실행된다. 컴퓨터로 하여금 이러한 다수의 단순한 조작들을 실행하도록 하는 프로그램들은 컴퓨터가 정교한 작업을 수행할 것이라는 환상을 준다. 사용자가 컴퓨터의 새롭고, 개량된 성능으로 인식하도록 하는 것은, 본질적으로 매우 단순한 조작의 동일한 집합의 실행이지만, 매우 빠르게 실행함으로써 가능해진다. 그러므로, 컴퓨터 시스템의 연속적인 개량은 이러한 시스템이 더욱 빨라지게 되는 것을 필요로 한다.From the computer's hardware point of view, most systems basically work the same way. Processors can perform a limited set of very simple operations, such as mathematical and logical comparisons and moving data from one place to another. However, each operation is executed very quickly. Programs that cause a computer to perform many of these simple operations give the illusion that the computer will perform sophisticated tasks. It is possible to allow the user to perceive the new, improved performance of the computer, in essence, by performing the same set of very simple operations, but by performing very quickly. Therefore, continuous improvement of computer systems requires these systems to be faster.

컴퓨터 시스템의 전체 속도[또한 작업처리량(throughput)이라 호칭됨]는 있는 그대로 단위 시간 당 실행되는 조작들의 수로써 측정된다. 개념적으로, 시스템 속도의 모든 가능한 개량 중 가장 간단한 것은, 다양한 구성 요소의 클락 속도를 증가시키는 것이고, 특히 프로세서의 클락 속도를 증가시키는 것이다. 예컨대, 만약 모든 것이 두 번씩 고속으로, 그러나 동일한 방법으로 정확하게 실행된다면, 요소 크기를 줄이고, 요소 수를 감소시키고, 궁극적으로는 단일 칩 상의 집적 회로로써 전체 프로세서를 패키징함에 의해 상당한 속도의 개량이 가능할 것이다. 크기의 감소로 인해, 프로세서의 클락 속도를 증가시키고, 따라서 시스템 속도를 증가시킬 수 있게 된다.The overall speed (also called throughput) of a computer system is measured as is the number of operations performed per unit time as is. Conceptually, the simplest of all possible improvements in system speed is to increase the clock speed of various components, in particular to increase the clock speed of the processor. For example, if everything is executed correctly at high speed, but in the same way twice, significant speed improvements will be possible by reducing the element size, reducing the number of elements, and ultimately packaging the entire processor as an integrated circuit on a single chip. will be. Due to the reduction in size, it is possible to increase the clock speed of the processor and thus increase the system speed.

집적 회로로부터 얻어진 현저한 개량에도 불구하고, 매우 빠른 컴퓨터 시스템을 위한 요구는 계속되고 있다. 하드웨어 개발자들은 더 진보된 집적화(환언하면, 단일 칩 상에 패키지화되는 회로의 수를 증가시킴), 회로의 크기의 감소 및 다른 다양한 기술로써 속도 면에서 상당한 개량을 얻을 수 있었다. 그러나, 개발자들은 물리적인 크기의 감소가 막연히 계속될 수는 없다는 것과, 프로세서의 클락 속도를 증가시킬 수 있는 능력에 한계가 있다는 것을 알고 있다. 그러므로, 컴퓨터 시스템의 작업 처리량에 있어서의 현저한 개량을 위해서는 다른 접근이 필요하다는 것에 관심이 모아지고 있다.Despite the significant improvements obtained from integrated circuits, there is a continuing need for very fast computer systems. Hardware developers have gained significant improvements in speed with more advanced integration (in other words, increasing the number of circuits packaged on a single chip), reducing the size of the circuit, and various other technologies. However, developers know that the reduction in physical size cannot continue vaguely, and there is a limit to the ability to increase the clock speed of the processor. Therefore, attention is drawn to the need for a different approach for significant improvements in the throughput of computer systems.

클락 속도의 변동 없이, 복수의 프로세서를 사용함으로써 시스템의 작업 처리량을 개량하는 것이 가능하다. 적당한 비용으로 집적 회로 칩상에 각 프로세서들을 패키지화하는 것이 이러한 실제적인 접근이 되어 왔다. 그러나, 프로세서를 하나에서 둘로 증가시킴으로써 단순히 시스템의 작업 처리량이 배가 되지는 않는다. 시스템으로 복수의 프로세서를 도입함으로 인해 수많은 아키텍쳐상의 문제가 생성되었다. 예컨대, 복수의 프로세서는 일반적으로 동일한 메인 메모리를 공유한다(각 프로세서가 자신의 공유의 캐시를 구비하고 있음에도 불구하고). 그러므로, 메모리 액세스 충돌을 방지하고 캐시 내의 데이터의 여분 카피들이 일관된 형식으로 트랙되어 있다는 것을 보증하는 매카니즘을 고안할 필요가 있다. 더 나아가, 각 프로세서는 저장 장치, I/O, 메모리와, 특히 다양한 요소들을 연결하는 통신 버스와 같은 시스템의 다른 요소에 부가적인 요구를 더한다. 더 많은 프로세서들이 도입되고 있기 때문에, 이러한 아키텍처상의 문제는 점점 복잡해지고 있으며, 범용성은 점점 곤란해지며, 프로세서들이 또다른 프로세서에 의해 사용되는 임의의 자원을 대기하는데 상당한 시간을 보낼 가능성이 더 높다. 시스템 개발자들은 이러한 모든 문제들을 알고 있으며, 이 문제들이 하나 또는 그 밖의 형식으로 제기되어 왔다. 완벽한 해결책은 없지만, 이 분야의 개량이 계속되어 왔다.It is possible to improve the throughput of the system by using a plurality of processors without changing the clock speed. Packaging each processor on an integrated circuit chip at a reasonable cost has been this practical approach. However, increasing the processor from one to two does not simply double the system's throughput. The introduction of multiple processors into the system has created a number of architectural problems. For example, multiple processors typically share the same main memory (although each processor has its own cache). Therefore, there is a need to devise a mechanism to prevent memory access conflicts and to ensure that extra copies of data in the cache are tracked in a consistent format. Furthermore, each processor places additional demands on other elements of the system, such as storage, I / O, memory, and especially the communication bus that connects the various elements. As more processors are being introduced, this architectural problem is becoming increasingly complex, versatility becomes more difficult, and it is more likely that processors will spend significant time waiting for any resources used by another processor. System developers are aware of all these issues, and these issues have been raised in one or another form. There is no perfect solution, but improvements in this area have continued.

최근 부각되어온 하나의 아키텍쳐상의 접근은, 분산 공유 메모리 컴퓨터 시스템 또는 논-유니폼 메모리 액세스(NUMA)로도 알려진 프로세서와 관련된 메모리의 이산적 노드들(discrete nodes)을 구비하는 컴퓨터 시스템의 설계이다. 종래의 대칭 멀티-프로세서 시스템(symmetrical multi-processor system)에 있어서, 메인 메모리는 단일한 대형 데이터 저장 본체로써 설계되는데, 이 저장 본체란 시스템 내부의 모든 CPU들에 동일하게 접근될 수 있는 것이다. NUMA 시스템은 메인 메모리를 이산적 서브세트(subset)으로 분할함으로써 이 문제에 접근하고 있으며, 각 서브세트는 각 CPU 또는 일반적으로 CPU들의 각 집합과 물리적으로 관련되어 있다. 메모리의 서브세트들과, 관련된 CPU들과 다른 하드웨어는 종종 "노드(node)"라고 호칭된다. 하나의 노드는 일반적으로 CPU로부터 노드 내부의 로컬 메모리로 직접적인 접근을 제공하는 내부 메모리 버스를 구비한다. 더 느린, 간접적 메카니즘이 노드 경계를 넘어 메모리에 액세스하도록 존재한다. 그러므로, 임의의 CPU가 여전히 어떤 임의의 메모리 장소에 액세스하는 경우, CPU는 그 노드 외부의 어드레스에 액세스할 수 있는 것보다 더 빨리 그 자신의 노드의 어드레스에 액세스할 수 있다(그래서, 논-유니폼 메모리 액세스임). 노드의 내부 메모리 버스 상의 장치 수를 제한함으로써, 버스 중재 메카니즘(bus arbitration mechanism)과 버스 트래픽이 다수의 CPU를 구비하는 시스템 내에서조차 관리 가능한 수준으로 유지될 수 있는데, 이는 대부분의 이러한 CPU가 상이한 노드 내에 있기 때문이다. 하드웨어 관점에서 보면, 이것은 NUMA 시스템 아키텍처가 증가된 범용성의 잠재적 이점을 갖고 있다는 것을 의미한다.One architectural approach that has recently emerged is the design of a computer system with discrete nodes of memory associated with a processor, also known as a distributed shared memory computer system or non-uniform memory access (NUMA). In a conventional symmetrical multi-processor system, the main memory is designed as a single large data storage body, which is equally accessible to all CPUs in the system. NUMA systems approach this problem by dividing the main memory into discrete subsets, each subset being physically associated with each CPU or each set of CPUs in general. Subsets of memory, associated CPUs and other hardware are often referred to as "nodes." One node typically has an internal memory bus that provides direct access from the CPU to local memory inside the node. Slower, indirect mechanisms exist to access memory across node boundaries. Therefore, if any CPU still accesses any arbitrary memory location, the CPU can access its own node's address faster than it can access an address outside that node (so, non-uniform). Memory access). By limiting the number of devices on a node's internal memory bus, bus arbitration mechanisms and bus traffic can be maintained at manageable levels even in systems with multiple CPUs, where most of these CPUs are different nodes. Because it is within. From a hardware point of view, this means that the NUMA system architecture has the potential benefit of increased versatility.

NUMA 시스템은 인터-노드(inter-node) 액세스를 제공하며, 이는 단일 논리메인 메모리를 갖고 있고, 각 메인 메모리의 장소는 고유한 어드레스를 갖는다. NUMA 시스템이 효율적으로 동작하기 위해서, CPU가 필요로 하는 데이터는 일반적으로 동일한 노드의 리얼 메모리 내에 저장되어야 한다. 이것이 항상 과도하게 엄격한 제한을 강요함이 없는 경우가 됨을 보장한다는 것은 비현실적이다. 인터-노드 메모리 액세스에 대한 필요를 감소시키는 메모리 할당 메카니즘이 바람직하다.NUMA systems provide inter-node access, which has a single logical main memory, and each main memory location has a unique address. In order for a NUMA system to work efficiently, the data needed by the CPU should generally be stored in real memory on the same node. It is impractical to ensure that this is always the case without imposing excessively strict restrictions. A memory allocation mechanism that reduces the need for inter-node memory access is desirable.

멀티 태스킹 시스템 컴퓨터 시스템에 있어서, 운영 시스템은 일반적으로 임의의 시스템 자원의 할당을 관리하며, 특히, 작업들을(또는 스레드들)을 CPU에 디스패칭하고 메모리를 할당한다. 그러한 시스템에 있어서, 복수의 스레드들이 동시에 활성화된다. 보통, 활성 스레드의 수는 시스템 내부의 CPU의 수를 초과한다. 주어진 스레드는 일반적으로 일정한 수의 사이클동안 CPU 내에서 실행되고, 그 뒤 완결되지 않았음에도 불구하고, 임시적으로 중단되어, 이후에 실행을 계속하기 위해 큐 내에 위치된다. 하나의 스레드가 시간 제한에 도달했거나, 상위의 우선 스레드에 의해 사전에 비워졌거나, 저장 액세스 또는 락 릴리스와 같은 어떤 대기시간 동안 대기해야만 하거나, 또는 다른 이유 때문에, 스레드가 정지할 수도 있다. 제1 스레드가 대기하는 동안 또다른 스레드가 실행될 수 있도록 함으로써, CPU 자원이 더 충분히 이용된다. CPU가 이러한 또는 다른 이유로 스레드를 실행할 수 있도록 되는 경우, 운영 시스템 내의 디스패처(dispatcher)는 일반적으로, 복수의 대기 스레드 중 어느 것이, 실행을 위해 유효한 CPU에 디스패치될 것인가를 결정한다.Multitasking System In a computer system, the operating system generally manages the allocation of any system resources, and in particular dispatches tasks (or threads) to the CPU and allocates memory. In such a system, a plurality of threads are activated at the same time. Normally, the number of active threads exceeds the number of CPUs inside the system. A given thread generally runs within the CPU for a certain number of cycles, and then, although not completed, is temporarily suspended and placed in a queue to continue execution later. A thread may hang because a thread has reached a time limit, has been emptied by a higher priority thread, has to wait for some wait time, such as a store access or lock release, or for some other reason. By allowing another thread to run while the first thread is waiting, CPU resources are more fully utilized. If the CPU is to be able to run a thread for this or other reason, the dispatcher in the operating system generally determines which of the plurality of waiting threads will be dispatched to the CPU valid for execution.

종래의 디스패처들은 보통 대칭 멀티프로세서 컴퓨터 시스템을 위해 설계되며, 이 시스템 내의 메모리는 동등하게 모든 CPU에 액세스할 수 있으나, 작업 디스패칭 상의 논-유니폼 메모리 액세스의 효과를 최적으로 고려할 수는 없다. 예컨대, 마이크로소프트 윈도우즈 2000^TM운영 체제에서 사용하는 디스패처 내에서, 스레드는 다양한 고려 대상에 따른 디스패치를 위해 선택되는데. 이 고려사항은 선-할당 우선권(pre-assinged priority), 큐 시간의 길이, 스레드가 동일한 CPU 상에서 지속 실행되는지 여부, CPU가 스레드를 위해 지시된 바람직한 프로세서인지 여부 등을 포함한다. 이러한 인자들은 정상적으로 목표된 CPU의 이용을 최적화하도록 의도된 것이다. 그러나, 디스패처는 CPU의 노드 위치를 고려하지 않으며, CPU들이 높은 정도로 사용될 수 있음에도 불구하고, 불필요한 다수의 인터-노드의 메모리 액세스의 결과로써, 시스템 작업 처리량이 저하된다. 일부 디스패처는 스레드 또는 작업의 CPU로의 할당에 엄격한 제한을 가할 수 있어서, 특정한 스레드는 항상 동일한 CPU 또는 동일한 노드 내에서 실행된다. 자원들이 이산 서브세트로 분할되고, 프로세스들이 각 서브세트로 할당되는 컴퓨터 시스템의 논리적 파티션이, 유사한 효과를 달성할 수 있다. 몇몇의 예에서, 이러한 효과들이 의도된다(예컨대, 프로세스들의 하나의 그룹이 다른 프로세스로부터의 간섭 없이 일정한 양의 자원들을 보장함). 그러나 이로 인해, 일부 CPU를 미흡하게 사용하거나, CPU를 과도하게 사용한 병목 현상이 나타날 수 있다.Conventional dispatchers are usually designed for symmetric multiprocessor computer systems, where the memory in the system can equally access all CPUs, but cannot optimally take into account the effects of non-uniform memory access on task dispatching. For example, within the dispatcher used by the Microsoft Windows 2000 ^™ operating system, threads are selected for dispatch according to various considerations. These considerations include pre-assinged priority, length of queue time, whether the thread is running continuously on the same CPU, whether the CPU is the preferred processor indicated for the thread, and so on. These factors are intended to optimize the utilization of the normally targeted CPU. However, the dispatcher does not consider the node location of the CPU, and although the CPUs can be used to a high degree, the system workload is degraded as a result of unnecessary memory access of many inter-nodes. Some dispatchers may impose strict limits on the allocation of threads or jobs to CPUs so that certain threads always run within the same CPU or the same node. A logical partition of a computer system in which resources are divided into discrete subsets and processes are allocated to each subset can achieve a similar effect. In some examples, these effects are intended (eg, one group of processes guarantees a certain amount of resources without interference from another process). However, this can lead to bottlenecks that are under-using some CPUs or over-using the CPUs.

NUMA 플랫폼을 위해 설계된 하나의 공지된 운영 시스템은 시퀀트 컴퓨터(SeQuent Computer, 현재 IBM의 부서)의 PTX 운영 체제이다. PTX는 복수의 구동 큐(run queue)- 각 CPU당 하나 -를 제공하며, 사용자에게 임의의 그룹의 CPU들에 대한 부가적인 구동 큐를 정의할 수 있도록 한다. 프로세스가 초기화되면, 이 프로세스는 구동 큐 중 어느 하나에 할당되며, 프로세스에 의해 생성된 모든 스레드들은, 실행을 대기하는 동안, 그 구동 큐에 위치된다. 운영 시스템은 그후 우선적으로, 이 프로세스의 스레드들은 그 할당된 구동 큐의 CPU 또는 CPU들에 디스패치하며, 다소 더 낮은 우선 레벨로는 그 할당된 구동 큐의 CPU(또는 CPU들)로써 동일한 시스템 노드 내의 CPU들에게 디스패치한다. 운영 시스템은 더 나아가 각 CPU에 대한 CPU 사용 및 on-going 기반의 각 노드에 대한 메모리 사용을 모니터하는 능력을 포함한다. 만약 특정한 노드에서의 CPU 사용 및/또는 메모리 사용이 충분히 높다면, 운영 시스템은 하나의 스레드를 우선된 CPU 또는 CPU들을 포함한 노드를 제외한 노드에 디스패치한다. 이러한 방법으로, PTX는 NUMA 아키텍처를 이용하지만, 자원 사용에 있어서 큰 차이를 나타낼 수 있는 스레드 디스패칭에 엄격한 제한을 회피한다.One known operating system designed for the NUMA platform is the PTX operating system of SeQuent Computer (currently a division of IBM). PTX provides a plurality of run queues, one for each CPU, and allows the user to define additional run queues for any group of CPUs. When a process is initialized, it is assigned to any of the drive queues, and all threads created by the process are placed in that drive queue while waiting for execution. The operating system then preferentially dispatches the threads of this process to the CPU or CPUs of the assigned drive queue, and at a somewhat lower priority level, within the same system node as the CPU (or CPUs) of the allocated drive queue. Dispatch to the CPUs. The operating system further includes the ability to monitor CPU usage for each CPU and memory usage for each node on-going. If the CPU usage and / or memory usage at a particular node is high enough, the operating system dispatches one thread to nodes other than the node containing the preferred CPU or CPUs. In this way, PTX uses the NUMA architecture, but avoids strict limitations on thread dispatching that can make a big difference in resource usage.

반드시 인식해야 하는 것은 아니지만, NUMA 시스템에 대한 개선된 디스패처에 대한 필요가 존재하며, 이 NUMA 시스템은 스레드를 디스패치하는 경우 다양한 CPU의 노드 장소를 고려하는 PTX의 중요한 장점을 갖고 있으며, 그러므로 인터-노드 메모리 액세스의 주파수를 감소시키지만, 더 단순한 운영 시스템, 특히 복수의 구동 큐 및 CPU/메모리 사용 모니터링을 지원하지 않는 운영 시스템에 적용될 수 있다.While not required, there is a need for an improved dispatcher for NUMA systems, which has the significant advantage of PTX considering node location of various CPUs when dispatching threads, and therefore inter-nodes. While reducing the frequency of memory accesses, it can be applied to simpler operating systems, especially those that do not support monitoring of multiple drive queues and CPU / memory usage.

본 발명은 멀티-태스킹 컴퓨터 시스템에 관한 것으로, 특히 복수의 중앙 처리 장치 및 논-유니폼 메모리 액세스(non-uniform memory access)를 구비하는 시스템 내에서 디스패칭되는 작업 또는 스레드에 관한 것이다.TECHNICAL FIELD The present invention relates to a multi-tasking computer system, and more particularly, to a task or thread that is dispatched within a system having a plurality of central processing units and non-uniform memory access.

도 1은 본 발명의 바람직한 실시예에 따른, 멀티-노드, 멀티프로세서 컴퓨터 시스템의 주요 요소의 상위-레벨 블럭 다이어그램이다.1 is a high-level block diagram of the main elements of a multi-node, multiprocessor computer system, in accordance with a preferred embodiment of the present invention.

도 2는 바람직한 실시예에 따른, 멀티-노드 컴퓨터 시스템의 전형적인 노드의 주요 하드웨어 요소의 블럭 다이어그램이다.2 is a block diagram of the major hardware elements of a typical node of a multi-node computer system, according to a preferred embodiment.

도 3은 바람직한 실시예(100)에 따른 멀티-노드 컴퓨터 시스템 내에서 상이한 추상화 레벨에서의 하드웨어 및 소프트웨어 함수의 분할을 보여주는 개념적인 설명도이다.3 is a conceptual explanatory diagram showing the division of hardware and software functions at different levels of abstraction within a multi-node computer system according to a preferred embodiment 100.

도 4는 바람직한 실시예에 따른, 디스패터에 의해 사용되는 유효한 프로세서를 대기하는 스레드의 준비 큐 구조를 묘사한다.4 depicts a ready queue structure of a thread waiting for a valid processor to be used by the dispatcher, according to a preferred embodiment.

도 5는 바람직한 실시예에 따른, 디스패처에 의해 사용되는 준비 큐로부터의 임의의 스레드-특정 정보를 설명한다.5 illustrates any thread-specific information from the ready queue used by the dispatcher, according to a preferred embodiment.

도 6은 바람직한 실시예에 따른, 임의의 스레드 제어 값의 초기화를 보여주는 상위-레벨 흐름도이다.6 is a high-level flow chart showing the initialization of any thread control value, according to a preferred embodiment.

도 7A 및 7B는 바람직한 실시예에 따른, 실행될 스레드의 선택을 보여주는 택일적인 흐름도이다.7A and 7B are alternative flow diagrams illustrating the selection of threads to be executed, according to a preferred embodiment.

도 8은 바람직한 실시예에 따른, 새로운 준비 스레드를 실행하기 위한 CPU 선택을 보여주는 흐름도이다.8 is a flow chart showing CPU selection for executing a new prepare thread, according to a preferred embodiment.

본 발명에 따르면, 논-유니폼 메모리 액세스 컴퓨터 시스템을 위한 디스패처는 모든 스레드들을 단일 공동 준비 큐[common ready queue, 또한 구동 큐(run queue)로 알려짐]로부터 디스패치하며, 이 큐는 어떠한 CPU나 CPU 그룹과도 우선적으로 관련되지 않는다. 디스패처가, 스레드가 필요로 하는 상대적으로 대량의 데이터를 포함할 가능성이 높은 메모리 서브세트로 액세스하는데 더 짧은 메모리 액세스 시간을 구비하는 CPU에게 스레드를 우선적으로 디스패치하는 경우에, 디스패처는 CPU의 물리적인 위치를 고려한다.According to the present invention, a dispatcher for a non-uniform memory access computer system dispatches all threads from a single common ready queue (also known as a run queue), which queue is any CPU or CPU group. Transition is not primarily related. When the dispatcher preferentially dispatches a thread to a CPU that has a shorter memory access time to access a subset of memory that is more likely to contain a relatively large amount of data that the thread needs, the dispatcher is the physical Consider the location.

바람직한 일 실시예에 있어서, NUMA 시스템은, 각각이 로컬 메모리를 갖는 복수의 이산 노드(discrete node), 하나 또는 이상의 CPU, 내부 노드 버스 및 노드 상호간의 통신을 위한 인터페이스의 시스템으로써 설계된다. 시스템 메인 메모리는 각 노드에 로컬 메모리들의 연합을 포함하는 분산 메모리이다. 프로세서 노드 내의 위치로의 메모리 액세스는 노드 경계를 가로지르는 메모리 액세스보다 더 빠르다.In a preferred embodiment, a NUMA system is designed as a system of a plurality of discrete nodes, each having local memory, one or more CPUs, an internal node bus, and an interface for inter-node communication. System main memory is distributed memory that includes an association of local memories to each node. Memory access to locations within a processor node is faster than memory access across node boundaries.

바람직한 일 실시예에 있어서, 각 바람직한 CPU는 각 스레드들과 관계된다. CPU가 유효하게 되는 경우, 디스패처는 상이한 노드의 바람직한 CPU를 갖고 있는 스레드에 관해 유효한 CPU와 동일한 노드 내의 바람직한 CPU를 갖는 하나의 스레드에게 적어도 일정한 상대적 우선권을 부여한다. 이것은 상대적인 우선권으로서, 절대적인 제한이 아니다. 여전히 스레드의 바람직한 CPU와 동일한 노드에 있지 않은 CPU로의 디스패치를 위한 스레드를 선택하는 것이 가능하며, 그러므로 이 스레드 디스패칭 선택을 엄격하게 너무 엄격하게 제한함으로써 발생할 수 있는 궁핍(starvation)이나 다른 문제를 피할 수 있다. 바람직한 일 실시예에 있어서, "이상적인 노드(ideal node)"라 불리는, 바람직한 노드가 사용자 프로세스에 일반적으로 할당된다. 프로세스가 스레드를 생성하면, 이 스레드는 프로세스의 이 이상적인 노드를 승계한다. 부가적으로, 이상적인 노드 내의 CPU는 스레드에 대한 "이상적인 프로세서"로 선택된다. 단일 프로세스에 의해 생성된 스레드에 대한 이상적인 프로세서의 선택은 일반적으로 라운드-로빈(round-robin) 기반에서 회전한다. 다른 선택 기준이 동일하다면, 스레드는 우선적으로 이상적인 프로세서에 첫번째로 디스패치되며, 두번째로는 이상적인 노드에 디스패치된다. 일정한 환경 하에서는, 디스패처가 상이한 이상적인 노드를 갖는 스레드를 디스패치하기보다는 아이들 프로세서를 선택할 수도 있지만, 다른 환경에서는 그러한 스레드를 디스패치할 수도 있다.In one preferred embodiment, each preferred CPU is associated with each thread. When the CPU is enabled, the dispatcher gives at least a certain relative priority to one thread having the preferred CPU in the same node as the valid CPU with respect to the thread having the preferred CPU of the different node. This is a relative priority, not an absolute limitation. It is still possible to select a thread for dispatch to a CPU that is not on the same node as the thread's desired CPU, thus avoiding starvation or other problems that can arise by restricting this thread dispatching choice too strictly. Can be. In one preferred embodiment, a preferred node, referred to as an "ideal node," is generally assigned to a user process. When a process creates a thread, that thread inherits this ideal node of the process. In addition, the CPU in the ideal node is chosen as the "ideal processor" for the thread. The selection of the ideal processor for a thread created by a single process generally rotates on a round-robin basis. If the other selection criteria are the same, the thread is dispatched first to the ideal processor and second to the ideal node. Under certain circumstances, the dispatcher may choose an idle processor rather than dispatch threads with different ideal nodes, but in other environments it may dispatch such threads.

논-유니폼 메모리 액세스를 설명하는 다양한 선택적인 디스패칭 기술이 가능하다. 선택적으로, CPU가 유효한 경우, 디스패처는, 상이한 노드에서 최근에 실행된 스레드에 대한 유효한 CPU와 동일한 노드의 CPU 상에서 최근에 실행된 스레드에 적어도 일부의 상대적 우선권을 준다.Various alternative dispatching techniques are possible that account for non-uniform memory access. Optionally, if the CPU is available, the dispatcher gives at least some relative priority to the recently executed thread on the CPU of the same node as the valid CPU for the recently executed thread on a different node.

모든 스레드에 대한 단일 공동 준비 큐(single common ready queue)의 사용은, 어떤 특정한 CPU 또는 CPU의 그룹과 관계되지 않고, 다양한 비-NUMA 운영 시스템에 일관된다. 여기서 설명된 본 발명의 실시예에 따른 동일한 노드에 스레드를 디스패치하는데 대한 상대적인 우선권을 유지함으로써, 스레드는 동일한 노드에서 실행되기 쉬우며, 노드의 이상적인 메모리는 이 스레드가 필요한 데이터에 비례하는 대량의 공유를 축적하기가 쉽다. 결과적으로, 인터-노드 메모리 액세스의 주파수는 스레드를 디스패치하는 경우 노드의 위치를 고려하지 않는 시스템의 주파수보다 감소된다. 동시에, 엄격한 노드 제한은 회피되고, 전 시스템의 사용이 가능하고 궁핍과 다른 문제들도 피할 수 있게 된다.The use of a single common ready queue for all threads is consistent with various non-NUMA operating systems, regardless of any particular CPU or group of CPUs. By maintaining the relative priority of dispatching a thread to the same node according to the embodiment of the present invention described herein, the thread is likely to run on the same node, and the ideal memory of the node has a large amount of sharing proportional to the data needed by this thread. It is easy to accumulate. As a result, the frequency of the inter-node memory access is reduced than the frequency of the system, which does not consider the position of the node when dispatching threads. At the same time, strict node limitations are avoided, the entire system can be used, and deprivation and other problems can be avoided.

본 발명의 다른 형태와 장점들은, 도면과 함께 본 발명이 바람직한 실시예의 상세한 설명에서 더 명백해질 것이다.Other aspects and advantages of the invention will become more apparent from the following detailed description of the preferred embodiment of the invention, in conjunction with the drawings.

<개요><Overview>

여기서 설명하는 바와 같이, 멀티프로세서, 논-유니폼 메모리 액세스(NUMA) 컴퓨터 시스템을 위한 디스패처는 단일한, 광범위 준비 큐를 위한 스레드들을 디스패치하며, 다양한 프로세서로의 디스패치를 위한 스레드 또는 작업들을 선택하는데 있어 노드의 유사성을 고려하여, 각 스레드는 일관된 노드에서 실행되기 쉬우며, 스레드가 필요로 하는 메모리 페이지는 그 노드의 로컬 리얼 메모리(local real memory) 내에 축적되기 쉽다. 일관성을 위해, "스레드(thread)"라는 용어는 여기서 고유한 상태를 갖는 컴퓨터 실행가능 명령 순서의 예로써 기술되기 위해 사용되며, 이는 디스패처에 의해 디스패치되는 실체이다. 일부의 환경에 있어서, 이는 "작업(task)"으로써 참조되며, 여기서 "스레드"와 "작업"은 구별되지 않는다. "스레드"라는 용어는 프로세스가 단일 프로그램으로부터 현재 실행가능한 복수의 스레드를 생성한다는 것을 의미하기 위해 사용된다; 그러나, 여기서는, 그렇게 제한적으로 사용되지 않고, 프로세스는 오직 단일 실행 스레드만을 발생하거나, 복수의 스레드를 발생한다.As described herein, the dispatcher for a multiprocessor, non-uniform memory access (NUMA) computer system dispatches threads for a single, broadly ready queue, and in selecting threads or tasks for dispatch to various processors. Given the similarity of nodes, each thread is likely to run on a consistent node, and the memory pages that a thread needs are likely to accumulate in that node's local real memory. For consistency, the term "thread" is used herein to be described as an example of a computer executable instruction sequence having a unique state, which is the entity dispatched by the dispatcher. In some circumstances, this is referred to as a "task", where "thread" and "task" are not distinguished. The term "thread" is used to mean that a process creates a plurality of threads that are currently executable from a single program; However, here, not so limitedly used, a process may generate only a single thread of execution or a plurality of threads.

도 1은 본 발명의 바람직한 실시예에 따른 멀티-노드, 멀티프로세서 컴퓨터 시스템(100)의 주요 하드웨어 요소의 상위-레벨 블럭 다이어그램이다. 컴퓨터 시스템(100)은 분산 공유 메모리(Distributed Shared Memory, DSM)를 기반한 컴퓨터 아키텍처를 사용하며, 이것이 NUMA 시스템이다. 컴퓨터 시스템(100)은 도 1에 예시적으로(4개) 설명된 복수의 노드(101-104)를 포함하며, 노드의 수는 가변적이다. 이 노드는 다른 노드들과 통신하도록 하는 인터-노드 통신 네트워크(105)에 의해 연결된다. 인터-노드 통신 네트워크의 목적은 장치들이 노드의 경계를 넘어 통신할 수 있도록 하며, 특히 임의의 노드 내의 프로세서가 다른 노드에 상주하는 메모리에 액세스할 수 있게 한다. 바람직한 실시예에 있어서, 인터-노드 네트워크(105)는 스위치 기반 네트워크이다. IEEE 1596-1992 표준을 따르는 Scalable Coherent Interface(SCI) 인터커넥션 매카니즘을 사용하는 스위치 기반 네트워크이다. SCI 는 각 개별적인 포인트-투-포인트 인터커넥트 상에서 패킷을 전송하고 그 시스템을 통해 캐시 일관성을 위해 제공하는 펌프 버스에 의해 구현되는 고 대역폭 인터커넥션 네트워크이다. SCI에 관한 더 많은 정보는 여기서 참조되는 IEEE 1596(Aug 3,1993)에서 찾을 수 있다.1 is a high-level block diagram of the major hardware elements of a multi-node, multiprocessor computer system 100 in accordance with a preferred embodiment of the present invention. Computer system 100 uses a computer architecture based on Distributed Shared Memory (DSM), which is a NUMA system. Computer system 100 includes a plurality of nodes 101-104 described illustratively (four) in FIG. 1, the number of nodes being variable. This node is connected by an inter-node communication network 105 that allows communication with other nodes. The purpose of an inter-node communication network is to allow devices to communicate across node boundaries, and in particular to allow processors within any node to access memory residing on another node. In a preferred embodiment, inter-node network 105 is a switch based network. A switch-based network that uses the Scalable Coherent Interface (SCI) interconnect mechanism that conforms to the IEEE 1596-1992 standard. SCI is a high bandwidth interconnect network implemented by a pump bus that transmits packets on each individual point-to-point interconnect and provides for cache coherency through the system. More information about SCI can be found in IEEE 1596 (Aug 3,1993) referenced herein.

인터-노드 네트워크(105), 바람직하게는 SCI 호환 통신 매체, 종래에 존재했거나 그후에 개발된 다양한 임의의 대체물들이 사용될 수 있다. 인터-노드 통신 매체는 바람직하게는 고 대역폭과 낮은 대기시간을 제공하며, 더 많은 노드의 추가를 고려하기 위해 가변적이다. 적합한 매체는 고 데이터 작업 처리량을 갖는 포인트-투-포인트 인터커넥션 링크를 포함한다(예컨대, 초당 1 기가 바이트 또는 이상). 이 링크는, 링 토폴로지(ring topology), 스위치를 통한 임의의 토폴로지 또는 양자의 조합과 같은 적합하고 다양한 방법으로 구성될 수 있다. 이 링크는 시스템의수행 요구에 따라 유선 또는 무선이 될 수 있다(선택적인, RF 등). 부가적인 토폴로지의 예는 "Interconnect Topologies with Point-To-Point Rings"(Ross E.Johnson and James E. Goodman December 1991, Computer Science Technical Report #1058, University of Wisconsin-Madison)에 기술되어 있으며, 그곳에 설명된 모든 예들이 적합한 네트워크의 모든 종류로 망라될 필요는 없다.Inter-node network 105, preferably an SCI compatible communication medium, may be used in any of a variety of alternatives that have existed or since developed. Inter-node communication media preferably provide high bandwidth and low latency and are variable to allow for the addition of more nodes. Suitable media include point-to-point interconnect links with high data throughput (eg, 1 gigabyte or more per second). This link can be configured in any suitable and various ways such as a ring topology, any topology through a switch or a combination of both. This link can be wired or wireless (optional, RF, etc.) depending on the performance requirements of the system. Examples of additional topologies are described in "Interconnect Topologies with Point-To-Point Rings" (Ross E. Johnson and James E. Goodman December 1991, Computer Science Technical Report # 1058, University of Wisconsin-Madison). All examples do not need to be covered by all kinds of suitable networks.

도 2는 바람직한 실시예에 따른, 컴퓨터 시스템(100)의 일반적인 노드(101)의 주요 하드웨어 요소의 블럭 다이어그램이다. 여기에 포함된 설명과 일관되게, 노드는 포괄적으로 참조 번호(101)로써 지시되며, 이것은 임의의 노드(101-104)를 지시하는 것과 같다. 노드(101)는 명령들과 분산된 메인 메모리로부터의 다른 데이터 상의 장치 처리 함수(machine processing function)를 수행하는 복수의 중앙 처리 장치(CPU, 201-204)를 포함한다. 각 CPU(201-204)는 데이터 및 명령들의 임시적인 저장을 위해 각각의 캐시(205-208)를 포함하거나 제어한다. 대형의 멀티프로세서 컴퓨터 시스템에 대해, 캐시는 일반적으로, 복수의 레벨과 복수의 구조로써 존재한다. 예컨대, CPU는 CPU(L1 명령 캐시) 상에서 실행하는 명령들의 저장에 공헌하는 레벨 1 캐시와, CPU(L1 데이터 캐시)에 의해 조종되는 명령을 제외한 데이터의 저장에 공헌하는 물리적으로 분리된 레벨 1 캐시와, 명령 및 다른 데이터 양자를 저장하는 레벨 2 캐시(L2 캐시)를 포함할 수 있으며, L2 캐시는 L1 명령 캐시 및 L1 데이터 캐시를 피드(feed)하는데 사용된다. 캐시 구조 또는 구조들은 도 2에서 각 개별적인 프로세서에 대한 단일 블럭(205-208)로써 단순화된 형태로 나타난다. 본 발명을 위해서, 각 프로세서 상의 캐시를 정확하게 구체적으로 구현하는 것은 중요하지 않다. 다수의 다른 변화가 가능하며, 본 발명은 어떤 특정한 캐시 설계에 제한을 두지 않으며, 캐시의 사용을 반드시 요구하는 것도 아니다.2 is a block diagram of the major hardware elements of a general node 101 of a computer system 100, according to a preferred embodiment. Consistent with the description contained herein, a node is indicated generically by the reference numeral 101, which is equivalent to referring to any node 101-104. Node 101 includes a plurality of central processing units (CPUs) 201-204 that perform instructions and machine processing functions on other data from distributed main memory. Each CPU 201-204 includes or controls a respective cache 205-208 for temporary storage of data and instructions. For large multiprocessor computer systems, caches generally exist in multiple levels and in multiple structures. For example, the CPU may have a level 1 cache that contributes to the storage of instructions executing on the CPU (L1 instruction cache) and a physically separate level 1 cache that contributes to the storage of data except instructions manipulated by the CPU (L1 data cache). And a level 2 cache (L2 cache) that stores both instructions and other data, the L2 cache being used to feed the L1 instruction cache and the L1 data cache. The cache structure or structures are shown in simplified form in FIG. 2 as a single block 205-208 for each individual processor. For the purposes of the present invention, it is not important to exactly and specifically implement the cache on each processor. Many other variations are possible, and the present invention is not limited to any particular cache design and does not necessarily require the use of a cache.

컴퓨터 시스템(100)은 각 개별적인 노드(101) 내에서 분리된 로컬 메모리(210)를 포함하는, 분산 메인 메모리(distributed main memory)를 사용한다. 시스템(100) 내의 어드레스가 가능한 전체 메인 메모리는, 각 개별적인 노드 내의 어드레스가 가능한 로컬 메모리(210)의 합이다. 그러므로, 메일 메모리의 리얼 어드레스 공간은 전 시스템을 통해 일정하며, 로컬 메모리(210) 내의 임의의 메모리 장소는 모든 프로세서 및 모든 노드에 대해 동일한, 고유한 리얼 어드레스를 갖는다.Computer system 100 uses distributed main memory, including local memory 210 separated within each individual node 101. The total addressable main memory in system 100 is the sum of the addressable local memories 210 in each individual node. Therefore, the real address space of the mail memory is constant throughout the system, and any memory location in local memory 210 has the same, unique real address for all processors and all nodes.

인터-노드 인터페이스 장치(215)는 노드(101)를 인터-노드 네트워크(105)에 연결시키고, 그곳에서 노드(101)는 시스템(100) 내의 다른 노드들과 통신할 수 있게 된다. 인터페이스 유닛(215)은 일반적으로 노드 사이에 전달되는 데이터의 일시적인 저장을 위한 캐시 또는 버퍼를 포함한다.Inter-node interface device 215 connects node 101 to inter-node network 105, where node 101 is able to communicate with other nodes in system 100. Interface unit 215 generally includes a cache or buffer for temporary storage of data passed between nodes.

I/O 버스 인터페이스 유닛(220)은 하나 또는 이상의 I/O 버스(221-222)를 통한 하나 또는 이상의 I/O 장치로의 통신의 제공한다. I/O 버스(221-222)는, 직접 액세스 저장 장치(Direct Access Storage Device, DASD, 224), 테이프 드라이브, 워크스테이션(225), 프린터 및 원격 장치, 또는 통신용 라인 또는 네트워크를 통해 다른 컴퓨터 시스템과 통신하기 위한 원격 통신 어댑터와 같은 종래의 I/O 장치와 통신하는데 적합한 임의의 종류이다. 예컨대, I/O 버스(221)는 산업 표준 PCI 버스이다. 2개의 I/O 버스 및 2개의 I/O 장치가 도 2에 나타나 있지만, 그러한 버스와 장치의 수는 가변적이며, 더나아가 모든 노드(101)가 I/O 인터페이스 유닛(220) 또는 부착된 I/O 장치를 포함할 필요도 없다.I / O bus interface unit 220 provides for communication to one or more I / O devices via one or more I / O buses 221-222. I / O buses 221-222 are direct access storage devices (DASDs, 224), tape drives, workstations 225, printers and remote devices, or other computer systems via a line or network for communication. And any kind suitable for communicating with conventional I / O devices, such as telecommunication adapters for communicating with the device. For example, I / O bus 221 is an industry standard PCI bus. Although two I / O buses and two I / O devices are shown in FIG. 2, the number of such buses and devices is variable, and furthermore, all nodes 101 may have an I / O interface unit 220 or attached I. There is no need to include / O devices.

내부 노드 버스(internal node bus, 212)는 노드(101)의 다양한 요소 사이의 통신을 제공한다. 특히, 버스(212)는 CPU 에 의해 내려진 메모리 액세스에 응답하여, 개별적인 CPU(201-204)의 로컬 메모리(210)와 캐시(205-208) 사이의 데이터를 전달한다. 로컬 메모리(210), 인터-노드 인터페이스(215) 및/또는 버스(212)에서의 논리를 모니터함으로써, 메모리 액세스 상에서 요청되는 특정한 리얼 어드레스가 노드(101)의 로컬 메모리(210) 내 또는 다른(원격) 노드의 로컬 메모리 내에 포함되는지 여부를 결정하고, 로컬 메모리(210), 또는 원격 노드와 통신하기 위한 인터-노드 인터페이스(215)로의 메모리 액세스를 지시한다. 데이터가 상주하는 응답 노드의 로컬 메모리에 도달하기 위해서, 요청 노드의 노드 버스(212), 요청 노드의 인터-노드 인터페이스(215), 인터-노드 네트워크(215), 응답 노드의 상응하는 인터-노드 인터페이스와 응답 노드의 상응하는 노드 버스를 교차하는 동안, 로컬 메모리(210) 내부의 리얼 어드레스로의 메모리 액세스는 버스(212)를 통해 장치 사이클의 상대적으로 적은 수로 되돌아오는 것을 관찰할 수 있을 것이다(만약 상기 요청 데이터가 인터페이스 캐시 중 어느 하나에 있다면, 이 조작은 일부 경우에 있어서, 단축됨). 결과적으로, 원격 노드로의 메모리 액세스는 상대적으로 더 큰 수의 사이클을 일반적으로 필요로 한다.Internal node bus 212 provides communication between the various elements of node 101. In particular, the bus 212 transfers data between the local memory 210 and the caches 205-208 of the individual CPUs 201-204 in response to the memory accesses made by the CPU. By monitoring the logic on local memory 210, inter-node interface 215 and / or bus 212, the specific real address requested on the memory access may be in or out of the local memory 210 of node 101. Determine whether it is included in the local memory of the remote node, and direct memory access to the local memory 210, or the inter-node interface 215 for communicating with the remote node. In order to reach the local memory of the responding node where the data resides, the node bus 212 of the requesting node, the inter-node interface 215 of the requesting node, the inter-node network 215, the corresponding inter-node of the responding node While crossing the corresponding node buses of the interface and response nodes, one may observe that memory accesses to real addresses inside local memory 210 return back to a relatively small number of device cycles via bus 212 ( If the request data is in any of the interface caches, this operation is shortened in some cases). As a result, memory accesses to remote nodes generally require a relatively larger number of cycles.

4개의 노드를 갖는 시스템이 도 1에 보여지고, 4개의 CPU와 다른 다양한 장치를 갖는 일반적인 노드가 도 1에 보여지지만, 도 1과 2는 설명의 목적으로 NUMA의 하나의 가능한 구성의 단순화된 예로써만 의도되는 것이며, 그러한 구성에 있어서 가능한 장치의 개수와 종류는 가변적이고, 시스템은 도면에 나타나 있지 않은 부가적인 장치를 포함할 수 있음이 이해되어야 한다. 또한 모든 노드가 동일할 필요는 없으며, 모든 노드가 동일한 수의 CPU 또는 동일한 수의 어드레스가 가능한 로컬 메모리를 구비해야 할 필요도 없다.Although a system with four nodes is shown in FIG. 1, and a typical node with four CPUs and various other devices is shown in FIG. 1, FIGS. 1 and 2 are simplified examples of one possible configuration of NUMA for illustrative purposes. It is to be understood that the number and type of devices possible in such a configuration is variable, and that the system may include additional devices not shown in the figures. Also, not all nodes need to be the same, nor do all nodes need to have the same number of CPUs or local memory capable of the same number of addresses.

<운영 시스템 관점><Operating System Perspective>

도 3은 바람직한 실시예(100)에 따른 멀티-노드 컴퓨터 시스템 내에서 상이한 추상화 레벨의 하드웨어 및 소프트웨어 함수의 분할을 보여주는 개념적인 설명도이다. 공지된 바와 같이, 컴퓨터 시스템은 프로세스를 수행하는 순차적인 상태 장치이다. 이러한 프로세스들은 추상화 레벨을 변화시키는 것으로 나타난다. 추상화의 상위 레벨에서, 사용자는 하나의 프로세스와 입력을 열거하고, 출력을 수신한다. 하위 레벨로 진행됨에 따라, 이러한 프로세스들이 일부 프로그래밍 언어에서의 명령의 순차라는 것을 알게되고, 더 하위로 진행됨에 따라 하위 레벨 명령들이 번역되어 운영 시스템을 통해, 궁극적으로 임의의 동작을 강요하는 장치 레지스터 내의 데이터 비트로 보내진다. 하위 레벨에서, 전기적인 전위를 변화시키는 것은 다양한 트랜지스터가 턴 온 및 턴 오프되는 것을 가능케 한다. 도 3에 있어서, 추상화의 "더 높은" 레벨이란 형태의 정상(top)을 향함을 의미하고, 반면에 더 낮은 레벨은 바닥(bottom)으로 향함을 의미한다.3 is a conceptual explanatory diagram showing the division of hardware and software functions of different levels of abstraction within a multi-node computer system according to a preferred embodiment 100. As is known, a computer system is a sequential state machine that performs a process. These processes appear to change the level of abstraction. At a higher level of abstraction, the user enumerates one process and input, and receives output. As it progresses down the level, it becomes known that these processes are a sequence of instructions in some programming languages, and as it progresses further down the level registers are translated into device registers that ultimately force arbitrary operation through the operating system. Is sent in the data bits. At a lower level, changing the electrical potential enables various transistors to be turned on and off. In Figure 3, the "higher" level of abstraction means towards the top of the form, while the lower level means towards the bottom.

도 3에서 보여지는 하드웨어 레벨(301)은 명령이 실행되도록 하는 물리적인 프로세서, 메모리 버스 및 다른 구성 요소를 의미한다. 여기서 사용된 것처럼, 하드웨어 레벨(301)은 도 1 및 2에서 보여지는 물리적인 장치(장치 내에 저장된 데이터에 반대되는 의미의)의 모음(collection)을 의미하며, 이는 도 1 및 2에서 나타나지 않는 다른 하드웨어를 포함한다.The hardware level 301 shown in FIG. 3 refers to the physical processor, memory bus, and other components that cause instructions to be executed. As used herein, hardware level 301 refers to a collection of physical devices (as opposed to data stored in the device) shown in FIGS. 1 and 2, which are not shown in FIGS. 1 and 2. Include hardware.

하드웨어 바로 위에는 저-레벨 운영 시스템 레벨(low-level operating system level, 302)이 있으며, 일부 운영 시스템에서는 "커널(kernel)"이라 불린다. 물리적인 감각에서, 운영 시스템은 코드, 예컨대, 요청된 함수를 수행하도록 하나 또는 이상의 프로세서 상에서 실행되거나 다양한 메모리 장소에 저장되는 명령의 형태인 데이터이다. 저-레벨 운영 시스템은, 시스템 자원을 공유하고, 메모리를 할당하고, 보안을 강화하는 등에 필요한 임의의 기본 운영 시스템 함수를 제공한다. 저-레벨 운영 시스템(302)에 의해 제공되는 함수 중에서는 페이징 함수(303) 및 디스패칭 함수(304)가 있다. 페이저(303)는, 실행 스레드가 시스템의 분배 메인 메모리 내에 현재 있지 않는 데이터에 액세스하는 시도를 하는 경우, 예컨대 그 데이터가 다양한 노드의 임의의 로컬 메모리(210) 내에 있지 않는 경우에 호출된다. 이 경우에, 페이저(303)는 요청 데이터가 저장 장치(회전 자기 디스크 드라이브 저장 장치와 같은)로부터 나와, 로컬 메모리(210) 중 어느 하나에 위치되도록 한다. 여기에서 더 상세히 설명되는 것처럼, 디스패처(304)는 실행용 프로세서로 실행되기 위해 대기하는 스레드를 치한다. 디스패치 준비 큐 구조(305)는 디스패처(304)에 의한 디스패치를 대기하는 스레드를 포함한다.Just above the hardware is a low-level operating system level (302), which is called a "kernel" in some operating systems. In a physical sense, an operating system is code, such as data in the form of instructions that are executed on one or more processors or stored in various memory locations to perform a requested function. The low-level operating system provides any basic operating system functions necessary for sharing system resources, allocating memory, enhancing security, and the like. Among the functions provided by the low-level operating system 302 are the paging function 303 and the dispatching function 304. Phaser 303 is called when an execution thread attempts to access data that is not currently in the system's distributed main memory, for example when the data is not in any local memory 210 of the various nodes. In this case, pager 303 causes the request data to come out of the storage device (such as a rotating magnetic disk drive storage device) and be located in any of local memory 210. As described in more detail herein, dispatcher 304 hits a thread waiting to be executed by a processor for execution. Dispatch ready queue structure 305 includes a thread waiting for dispatch by dispatcher 304.

저-레벨 운영 시스템(302)의 레벨 위에는 부가적인 더 상위-레벨 운영 시스템 함수(308)뿐만 아니라, 다양한 사용자 프로세스(310-312, 예컨대 사용자 어플리케이션 코드 및 데이터)가 존재한다. 일반적으로, 더 상위-레벨 운영 시스템 함수(308)는 액세스를 원하는 사용자에게 부가적인 수단과 함수들을 제공하지만, 사용자 프로세스는 직접적으로 실행을 위한 저-레벨 운영 시스템(302)에 액세스할 수도 있다.Above the level of the low-level operating system 302 is an additional higher-level operating system function 308, as well as various user processes 310-312, such as user application code and data. In general, the higher-level operating system function 308 provides additional means and functions to the user who wants access, but the user process may directly access the low-level operating system 302 for execution.

바람직한 실시예에 있어서, 운영 시스템은 마이크로소프트 윈도우즈 2000^TM운영 시스템이며, 여기서 작업 디스패처 및 페이저는 CPU와 메모리의 노드 위치를 고려하여 설명되는 것처럼 개조되었다. 그러나, 작업이 디스패치되는 단일의, 공동 준비 큐를 구비한 가상의 임의의 멀티-태스킹 운영 시스템은, 이후에 발전된 운영 시스템을 포함하는 UNIX^TM기반 운영 시스템, IBM AS/400^TM운영 시스템 등과 같이 여기에서 기술되는 함수들에 적용될 수 있다.In a preferred embodiment, the operating system is a Microsoft Windows 2000 ^™ operating system, where the task dispatcher and pager have been modified as described with regard to the node location of the CPU and memory. However, any virtual multi-tasking operating system with a single, co-ready queue to which jobs are dispatched may be used here, such as a UNIX ^TM based operating system, including an advanced operating system, an IBM AS / 400 ^TM operating system, and the like. It can be applied to the functions described in.

전형적인 컴퓨터 시스템 설계에 있어서, 더 상위 레벨의 실체들은 하위 레벨의 실체의 구체적인 구현을 하는 것으로부터 보호되는 것이 바람직하다. NUMA 시스템의 하드웨어 설계의 경우에 있어서, 이것은 운영 시스템 및 상위 레벨 소프트웨어가 바람직하게는 NUMA 특성을 인지할 것을 요구하지 않는다는 것을 의미한다. 그러므로, 일반적으로는 NUMA 시스템은, 운영 시스템이 그 분배 메인 메모리를, 단일 모놀리식 실체(single monolithic entity) - 이 단일 모놀리식 실체는, 요청 데이터가 존재하는 경우에는 요청 데이터를 반환하고, 존재하지 않는 경우에는 페이지 장애를 발생시킴으로써 데이터 요청에 응답함-로 간주하도록 설계된다. 유사하게는 운영 시스템은 노드와 프로세서의 모음을 단순하게 프로세서의 하나의 대형 풀로서간주하는데, 모든 프로세서란 임의의 프로세스를 실행 가능하게 하는 것을 말한다. 이것은 모든 프로세서들이 주어진 작업을 동시에 수행한다거나, 모든 메모리 액세스가 동시에 달성된다는 것을 의미하지는 않는다; 그 이유는, 전에 언급한 바와 같다. 그러나, 그것은 요청 메모리가 동일한 노드에 있는지 여부에 관계없이, 메모리 액세스 에러 없이 달성된다는 것을 의미하지 않는다. 그러므로, 운영 시스템은 프로세서가 위치하는 노드에 관계없이 스레드를 디스패치할 수 있다.In a typical computer system design, it is desirable that higher level entities are protected from doing specific implementations of lower level entities. In the case of a hardware design of a NUMA system, this means that the operating system and higher level software preferably do not require knowledge of the NUMA characteristics. Thus, in general, a NUMA system requires that the operating system share its distributed main memory, a single monolithic entity-this single monolithic entity returns the request data if it exists. If not present, it is designed to be considered as responding to a data request by generating a page fault. Similarly, an operating system simply considers a collection of nodes and processors as a large pool of processors, which means that any processor can execute any process. This does not mean that all processors perform a given task at the same time, or that all memory accesses are achieved at the same time; The reason is as mentioned previously. However, that does not mean that the request memory is achieved without a memory access error, whether or not it is in the same node. Therefore, the operating system can dispatch a thread regardless of the node where the processor is located.

NUMA 시스템 하드웨어가 표준 저-레벨 운영 시스템 함수와 함께 함수화되도록 설계되어도, 만약 저-레벨 운영 시스템이 하드웨어 설계를 더 잘 인식하고 있다면 - 특히 프로세서의 노드 위치가 스레드를 디스패칭할 때 고려된다면 -, 컴퓨터 시스템은 더 효율적으로 동작할 것이다.Although NUMA system hardware is designed to be functionalized with standard low-level operating system functions, if the low-level operating system is better aware of the hardware design, especially if the node location of the processor is taken into account when dispatching threads, Computer systems will operate more efficiently.

<디스패칭 및 페이징 함수(paging function)><Dispatching and Paging Functions>

여기서 설명되는 스레드 디스패처는, 만약 스레드가 일관된 노드들 상에서 실행되며, 스레드가 필요한 데이터는 일반적인 실행 노드 내에 축적되는 경향이 있을 것이고, 따라서 인터-노드 메모리 액세스의 주파수는 감소될 것이라는 원리에서 조작된다. 이것은 만약 메모리 페이징 메카니즘이 일부 위치의 국지성(locality of placement)을 드러낸다면, 환언하면, 만약 페이저(303)가 요청 페이지를 일부 특정한 로컬 노드에 위치시키려는 경향이 매우 일정하지 않게 존재한다면, 오직 참이 된다.The thread dispatcher described herein is manipulated on the principle that if a thread runs on consistent nodes, the data needed by the thread will tend to accumulate in a general execution node, and thus the frequency of inter-node memory access will be reduced. This is true if the memory paging mechanism reveals the locality of placement, in other words, if the pager 303 tends to be very inconsistent to place the request page at some particular local node. do.

위치의 국지성을 나타내는 페이저를 구현하는 단순하고도 직접적인 방법은 요청하는 프로세서의 노드로 페이지 위치를 제한하는 것이고, 이것은 바람직한 실시예에서 사용되는 방법이다. 환언하면, 페이지 장애의 경우에 있어서, 페이저(pager, 303)는 항상 새로운 페이지를 프로세서를 포함하는 노드의 로컬 메모리에 위치시키며, 상기 프로세서란 페이지 장애(page fault)를 유발하는 메모리 액세스 요청을 하는 것을 말한다. 페이저(303)는 노드의 로컬 메모리의 유효한 페이지로부터 페이지 아웃(paged out)되는 최상의 후보를 선택한다. 페이저를 구현하는 선택적인 방법은 요청 프로세스의 이상적인 노드로 페이지 위치를 제한하는 것이다. 환언하면, 이상적인 노드는 각 프로세스(아래에서 더 전체적으로 설명될 것임)에 관련되어 있고, 메모리 액세스를 내는 프로세서와 동일하지 않은 노드인 경우조차, 하나의 페이지는 항상 페이지 장애를 유발한 프로세스와 관계되는 이상적인 노드의 로컬 메모리에 위치된다. 이 선택적인 방법의 원인은, 프로세스에 의해 발생된 스레드가 종종 다른 노드에서 실행되는 경우에도, 페이지가 일관된 로컬 노드에 위치하고 있다는 것이다. 그러나, 이러한 2개의 선택성은 페이저에 의해 채택될 수 있는 유일하게 가능한 기술은 아니며, 다양한 선택적인 표준 또는 표준의 조합이 어느 정도의 페이지 위치의 국지성을 달성하는데 사용될 수 있다.A simple and straightforward way to implement a pager indicating locality of location is to limit the page location to the node of the requesting processor, which is the method used in the preferred embodiment. In other words, in the case of a page fault, the pager 303 always places a new page in the local memory of the node containing the processor, the processor making a memory access request causing a page fault. Say that. Phaser 303 selects the best candidate to be paged out from a valid page in the node's local memory. An alternative way to implement the pager is to limit the page position to the ideal node of the request process. In other words, an ideal node is associated with each process (which will be described more fully below), and even if the node is not the same as the processor giving the memory access, one page is always associated with the process that caused the page fault. It is located in the local memory of the ideal node. The reason for this optional method is that the page is located on a consistent local node, even if the thread created by the process is often run on another node. However, these two selectivities are not the only possible techniques that can be adopted by the pager, and various optional standards or combinations of standards can be used to achieve some degree of localization of the page position.

스레드 디스패칭은 스레드의 상태 및 우선권에 의존한다. 어느 순간적인 시간에 있어, 스레드는 여러 상태들 중 하나에 존재한다. 예컨대, 스레드는 프로세서 상에서 실행되는 실행 상태에 있을 수도 있으며, 어떤 외부 이벤트가 발생될 때까지 실행될 수 없어서, 이벤트가 발생될 때까지 대기하는 이벤트 대기 상태에 있을 수도 있으며, 또는 유효한 프로세서에 대해서만 대기하고 실행을 준비하는 준비 상태에 있을 수도 있다. 운영 시스템에 따라, 부가적인 상태 또는 상기 상태의 진보된 상태가 정의될 수도 있다. 부가적으로, 실행의 우선권은 각 스레드에 관계된다. 종래 기술에 있어서 공지된 다양한 우선권 할당 기술(priority assignment scheme)이 사용될 수도 있다. 우선권은 일반적으로 사용자, 시스템 관리자 도는 운영 시스템 자신에 의해 할당된다. 예컨대, 사용자 어플리케이션 프로세스의 우선권은 종종 사용자에 의한 오버라이드(override)에 지배되어, 운영 시스템에 의해 열거되는 사용자 프로세스에 대한 디폴트 우선권(default priority)이 된다. 우선권은 스레드가 존재하는 지속시간 동안 고정될 수 있고, 또는 스레드가 준비 큐 상에서 대기하고 있는 시간의 길이와 같이 다양한 인자에 의존하여 조정된다. 종래에 의하면, 우선권이 선택적으로 역순서로 있을 수 있음에도 불구하고, 더 높은 수가 더 큰 우선권을 지시한다.Thread dispatching depends on the state and priority of the thread. At any moment in time, a thread is in one of several states. For example, a thread may be in a running state running on a processor, may not be executed until some external event occurs, and may be in an event waiting state waiting for an event to occur, or only waiting for a valid processor You may be ready to run. Depending on the operating system, additional states or advanced states of those states may be defined. In addition, the priority of execution relates to each thread. Various priority assignment schemes known in the art may be used. Priorities are usually assigned by the user, the system administrator, or the operating system itself. For example, the priority of a user application process is often governed by an override by the user, resulting in a default priority for the user process enumerated by the operating system. The priority can be fixed for the duration that the thread is present, or adjusted depending on various factors, such as the length of time the thread is waiting on the ready queue. Conventionally, although the priority may optionally be in reverse order, a higher number indicates a higher priority.

디스패처(304)는 스레드 준비 큐 구조(305)로부터의 디스패치를 위한 스레드를 선택한다. 준비 큐 구조(305)는 도 4에서 더 자세히 설명된다. 도 4에서 보여지는 것처럼, 준비 큐 구조는 제어 블럭(410-412)의 복수의 리스트(401-403)를 포함하며, 도 4에서 설명을 위해 도시된 3개의 리스의 수는 가변적이다. 각 제어 블럭 리스트(401-403)는 FIFO 순으로 정렬된다. 주어진 리스트의 제어 블럭(410-412)은 목표된 우선권과 관계된 스레드를 의미하며, 이 스레드는 실행을 준비하고 대기한다. 환언하면, 제어 블럭 리스트(401-403)는 준비 상태에 있는 스레드를 포함한다. 하나의 스레드가 준비 상태에 들어가면, 그 제어 블럭은 스레드와 관계된 우선권을 갖는 리스트의 말단에 위치한다. 디스패처(304)가 실행용 CPU에 그것을 디스패치한 경우에, 제어 블럭은 정상적으로 리스트에서 제거된다.Dispatcher 304 selects a thread for dispatch from thread ready queue structure 305. The ready queue structure 305 is described in more detail in FIG. As shown in FIG. 4, the ready queue structure includes a plurality of lists 401-403 of control blocks 410-412, and the number of three leases shown for explanation in FIG. 4 is variable. Each control block list 401-403 is sorted in FIFO order. Control blocks 410-412 of a given list represent the threads associated with the desired priority, which are ready to run and wait. In other words, the control block lists 401-403 include threads that are in a ready state. When a thread enters the ready state, the control block is placed at the end of the list with priority associated with the thread. When the dispatcher 304 dispatches it to the execution CPU, the control block is normally removed from the list.

바람직한 실시예에 있어서, 시스템(100)에 대한 오직 하나의 준비 큐 구조(305)가 존재하고, 실행을 준비하는 모든 스레드는 그 스레드 우선권에 상응하는 준비 큐 구조(305)의 리스트(401-403)에 위치한다. 준비 큐는, 어떠한 CPU 또는 CPU 그룹(노드와 같은)도 그 준비 큐로부터 작업의 우선적인 디스패치를 수신하지 않는다는 것을 의미하는 임의의 CPU 또는 CPU의 그룹과 관계되어 있지 않다. 준비 큐가 메모리 구조일 경우, 일반적으로 그것은 노드 중 하나에 저장될 것이고, 디스패처는 일반적으로 그 노드의 CPU에서 실행될 것이지만, 이것이 "CPU 또는 CPU 그룹과 관계되어 있는" 것을 의미하지는 않는다.In the preferred embodiment, there is only one ready queue structure 305 for the system 100, and every thread preparing to run is a list 401-403 of the ready queue structure 305 corresponding to its thread priority. ) The ready queue is not associated with any CPU or group of CPUs, meaning that no CPU or group of CPUs (such as nodes) receive preferential dispatch of jobs from the ready queue. If the ready queue is a memory structure, it will generally be stored on one of the nodes, and the dispatcher will generally run on that node's CPU, but this does not mean that it is "associated with the CPU or CPU group."

각 제어 블럭(410-412)은 활성 스레드에 관한 임의의 상태 정보를 포함하며, 제어 블럭의 일부는 디스패치를 위한 스레드를 선택하는 디스패처(304)에 의해 사용된다. 도 5는 디스패처에 의해 사용되는 전형적인 제어 블럭(410)으로부터의 임의의 스레드-특정 정보를 설명한다. 도 5에 도시된 바와 같이, 제어 블럭은 우선권(501), 유사성(affinity) 마스크(502), 이상적인 노드 마스크(503), 최근 실행 프로세서(505)와 큐 시간(506)을 포함한다. 우선권 영역(501)은 스레드의 목표한 수적 우선권을 포함한다. 유사성 마스크(502)는 개별적인 CPU 들에 상응하는 마스크 비트의 시리즈이며, 그것에 의해 사용자 또는 시스템 관리자는, 프로세스가 시스템 상에서 유효한 CPU의 서브세트 상에서만 실행될 수도 있다는 것을 요청할 수도 있으며, 여기서 서브세트란 유사성 마스크에 의해 열거되는 것을 말한다; 대부분의 캐시에서, 사용자는 실행을 제한하지 않으며, 유사성 마스크는 모든 CPU가 실행 가능하도록 설정된다. 이상적인 노드 마스크(503)는 개별적인 노드에 상응하는 마스크 비트의 세트이며, 이로써 실행을 위한 하나 또는 이상의 바람직한 노드가 여기서 설명된 바와 같이 지명될 수 있다. 이상적인 프로세서 영역(504)는 스레드의 실행을 위한 단일의 바람직한 CPU의 수적인 지정이다. 최근 실행된 프로세서 영역(505)은 어느 스레드가 가장 최근에 실행되었는지에 관한 CPU의 수적인 지정이다. 큐 시간 영역(506)은 준비 큐에 스레드가 있는 시간의 길이를 지시하는 값을 포함한다. 예컨대, 이 값은 어떤 이벤트가 발생한 경우 증가하는 카운터일 수도 있고 또다른 값일 수도 있음에도 불구하고, 스레드가 큐에 들어간 때를 기록하는 시간 스탬프(time-stamp)가 될 수도 있다.Each control block 410-412 includes any state information about an active thread, and a portion of the control block is used by the dispatcher 304 to select a thread for dispatch. 5 illustrates any thread-specific information from a typical control block 410 used by the dispatcher. As shown in FIG. 5, the control block includes a priority 501, an affinity mask 502, an ideal node mask 503, a recently executed processor 505 and a queue time 506. Priority area 501 includes the target numerical priority of the thread. Similarity mask 502 is a series of mask bits corresponding to individual CPUs, whereby a user or system administrator may request that a process may only run on a subset of CPUs that are valid on the system, where the subset is similarity. To enumerate by a mask; In most caches, the user does not restrict execution, and the similarity mask is set to make all CPUs executable. The ideal node mask 503 is a set of mask bits corresponding to individual nodes, whereby one or more preferred nodes for execution may be named as described herein. The ideal processor area 504 is a numerical designation of a single preferred CPU for the execution of the thread. Recently executed processor region 505 is a numerical designation of the CPU as to which thread was most recently executed. Queue time area 506 contains a value indicating the length of time that a thread is in the ready queue. For example, this value may be a time-stamp that records when a thread enters the queue, although it may be a counter that increments when some event occurs or another value.

도 4 및 5는 설명을 위해 단순화된 형태로 준비 큐와 제어 블럭을 도시하고 있으며, 디스패처에 의해 사용되는 데이터 구조 형태의 정확한 청사진을 제공할 목적은 아니다. 준비 큐는 링크된 리스트 정렬의 복수의 제어 블럭(410-412)을 포함하고 있음을 보여주며, 각 블럭(410-412)은 단일의 개별적인 스레드에 상응하고, 모든 필요한 상태 정보를 포함하고 있다. 그러나 큐 데이터 구조의 정확한 구조적 상세는 가변적이며, 이것은 데이터 구조의 하나의 어레이 또는 다른 형태로써 구현될 수 있다. 더 나아가, 제어 블럭(410-412)이 각 개별적인 스레드에 대한 완전한 상태 정보를 포함하고 있음을 보여줌과 동시에, 큐 내부의 기록이 디스패처가 필요한 부분적인 정보만을 포함할 수도 있으며, 필요한 데이터가 발견될 수 있는 장소에 대한 더 많은 포인터 또는 다른 인덱스를 단순히 포함하고 있을 수도 있다. 제어 블럭은 디스패처 또는 다른 함수들에 의해 사용되는 다른 그리고 부가적인 상태 정보를 포함할 수도 있다.4 and 5 illustrate ready queues and control blocks in a simplified form for illustrative purposes, and are not intended to provide an accurate blueprint of the data structure type used by the dispatcher. The ready queue shows that it contains a plurality of control blocks 410-412 of linked list sorts, each block 410-412 corresponding to a single individual thread and containing all the necessary state information. However, the exact structural details of the queue data structure are variable, which can be implemented as one array or other form of data structure. Furthermore, while showing that the control blocks 410-412 contain complete state information for each individual thread, the writes in the queue may contain only partial information requiring dispatchers, and the necessary data may be found. It may simply contain more pointers or other indices to places where it can be. The control block may include other and additional state information used by the dispatcher or other functions.

디스패칭 선택을 제어하는 제어 블럭의 임의의 값은, 프로세서가 스레드를 생성하고, 프로세스 초기화에서 발생되는 값으로부터 상속될 수도 있는 경우에, 초기화된다. 도 6은 스레드 제어값을 초기화하는 운영 시스템에 의해 택해지는 고 레벨의 임의의 단계를 보여주는 흐름도이다. 도 6에서 보여지는 바와 같이, 프로세스는 종래의 방법에 의해 초기화되며, 임의의 데이터 구조를 발생케하고, 초기화시키며, 특히 도 5에 묘사된 값을 유지하는 제어 블럭 또는 유사한 구조가 생성되도록 한다. 부가적으로, 우선권과 프로세서 유사성은 프로세스에 할당된다. 이러한 단계는 단계(601)로써 고 레벨로 선택적으로 나타내진다.Any value of the control block that controls the dispatching selection is initialized if the processor creates a thread and may inherit from a value generated in process initialization. 6 is a flow chart showing any high level step taken by the operating system to initialize thread control values. As shown in FIG. 6, the process is initiated by conventional methods, allowing any data structure to be generated, initialized, and in particular a control block or similar structure that maintains the values depicted in FIG. 5. In addition, priority and processor similarity are assigned to the process. This step is optionally represented at high level as step 601.

하나의 이상적인 노드는 다음과 같이 프로세스와 관련되어 있다. 만약 프로세스가 시스템 프로세스이거나 또는 그것에 할당된 특정한 프로세서 유사성을 가지고 있다면, 단계(602)로부터 "Y" 가지가 선택되며, 이상적인 노드는 모두로 설정된다(단계 603). 환언하면, 프로세스와 관련된 이상적인 노드 마스크는 모든 노드와 함께 "on"으로 설정되고, 이는 어떠한 이상적인 노드 선택도 존재하지 않음을 효과적으로 의미한다. 시스템 프로세스는 모든 노드에서 구동될 것으로 의도된 다양한 목표 운영 시스템 중의 하나이다. 프로세서 유사성은 사용자 또는 시스템 관리자에 의해 지시되며, 프로세스가 유효한 CPU의 특정한 서브세트에서 실행되도록 제한한다. 프로세서 유사성이 드물게 지시됨에도 불구하고, 그러한 유사성이 지시된 경우에, 시스템에 할당된 "이상적인 노드"를 오버라이드하게 되므로, 이상적인 노드 할당은 이 경우에서는 사용되지 않는다.One ideal node is involved in the process as follows. If the process is a system process or has a particular processor similarity assigned to it, then a "Y" branch is selected from step 602 and the ideal node is set to all (step 603). In other words, the ideal node mask associated with the process is set to "on" with all nodes, which effectively means that no ideal node selection exists. The system process is one of various target operating systems intended to run on every node. Processor similarity is dictated by a user or system administrator and limits the process to running on a specific subset of valid CPUs. Although processor similarity is rarely indicated, ideal node allocation is not used in this case, if such similarity is indicated, as it will override the "ideal node" assigned to the system.

만약 프로세스가 시스템 프로세스도 아니고, 특정한 프로세서 유사성을 갖고있지도 않다면(환언하면, 임의의 프로세서 상에서 구동될 수 있음), 단계(602)에서 "N" 가지가 선택된다. 이러한 경우에 운영 시스템은 라운드 로빈(round robin) 알고리즘을 사용하여 이상적인 노드를 할당한다. 환언하면, 가장 최근에 프로세스에 할당된 노드의 수는 증가하고(단계 604), 이 노드는 프로세스의 이상적인 노드로써 할당된다(단계 605). 할당 조작은 선택된 이상적인 노드에 상응하는 이상적인 노드 마스크의 비트를 설정함으로써 수행된다.If the process is not a system process and does not have a particular processor similarity (in other words, it can run on any processor), then an "N" branch is selected at step 602. In this case, the operating system allocates the ideal node using a round robin algorithm. In other words, the number of nodes most recently assigned to the process increases (step 604), and this node is allocated as the ideal node of the process (step 605). The assignment operation is performed by setting the bits of the ideal node mask corresponding to the selected ideal node.

높은 적응성을 위해, 노드 마스크가 사용되어, 단일 노드는 지시된 이상적인 노드가 될 수도 있고, 모든 노드가 지시될 수도 있으며 또는 노드의 임의의 서브세트가 지시될 수도 있다. 디폴트로써, 운영 시스템은 아래와 같이 대부분의 사용자 프로세스에 대한 단일 노드를 선택한다. 그러나, 사용자가 특별한 함수 호출을 통해 이 선택을 오버라이드하는 것이 가능하다. 이 기능은 운영 시스템에 의해 수행되는 자원의 평형에 간섭을 가하는 경향이 있으나, 그것을 정당화하는 특별한 상황이 있을 수도 있기 때문에, 이 기능은 거의 사용되지 않는다고 생각된다.For high adaptability, a node mask may be used such that a single node may be the indicated ideal node, all nodes may be indicated, or any subset of nodes may be indicated. By default, the operating system chooses a single node for most user processes, as shown below. However, it is possible for the user to override this selection through a special function call. This function tends to interfere with the equilibrium of resources performed by the operating system, but it is thought that this function is rarely used because there may be special circumstances justifying it.

간단한 라운드-로빈 알고리즘은 동등한 기반 하에서 유효한 노드 사이에 프로세스를 분배하고 자원 사용에 평형을 이루는 디폴트로써 사용된다. 그러나, 선호되는 노드를 할당하기 위한 임의의 수의 택일적인 방법이 운영 시스템에 의해 사용될 수 있다. 예컨대, 만약 각 노드의 프로세서의 수가 동일하지 않다면, 할당에 무게를 싣는 것이 바람직하다. 선택적으로, 최근 CPU 사용에 관한 통계를 내세울 수도 있으며, 가장 낮은 최근 CPU 사용도를 가진 노드에 프로세스를 할당할 수도 있다.A simple round-robin algorithm is used as a default to distribute processes and balance resource usage between valid nodes on an equal basis. However, any number of alternative methods for assigning preferred nodes can be used by the operating system. For example, if the number of processors in each node is not the same, it is desirable to weigh the allocation. Optionally, statistics about recent CPU usage can be provided, and processes can be assigned to nodes with the lowest recent CPU usage.

어떤 점에서, 단계(610)에 나타난 바와 같이, 프로세스는 스레드를 생성한다. 프로세스는 단일 스레드를 생성하거나, 또는 복수의 스레드를 생성할 수도 있지만, 오직 하나만이 설명을 위해 도 6에 도시되어 있다. 다른 것 중에서, 스레드를 생성하는 것은, 상태 기록 또는 기록들[예컨대, 제어 블럭(410)]이 스레드를 위해 생성되어 임의의 값으로 초기화되었다는 것을 의미한다. 프로세스를 초기화하는 것처럼, 스레드를 생성하는 것은 종래 기술에서 공지된 바와 같이 다수의 단계를 포함할 수 있으며, 여기서는 설명되지 않고, 오직 단계(610)의 고 레벨로서만 설명된다. 스레드 우선권 값(501), 유사성 마스크(502)와 이상적인 노드 마스크(503)는 스레드를 생성했던 프로세스에 대한 유사한 값으로부터 상속된다(단계 611); 이것은 프로세스 값이 스레드 제어 블럭으로 복사된다는 것 또는 스레드 제어 블럭이 단순히 프로세스값을 참조한다는 것을 의미할 수도 있다.At some point, as shown in step 610, the process creates a thread. The process may create a single thread or may create multiple threads, but only one is shown in FIG. 6 for illustration. Among other things, creating a thread means that a status record or records (eg, control block 410) has been created for the thread and initialized to an arbitrary value. As initiating a process, creating a thread may include a number of steps, as known in the art, and is not described herein, but only as a high level of step 610. Thread priority value 501, similarity mask 502 and ideal node mask 503 are inherited from similar values for the process that created the thread (step 611); This may mean that the process value is copied to the thread control block or that the thread control block simply references the process value.

실행을 위한 바람직한 CPU("이상적인 CPU"라고 불리는)는, 이상적인 노드에서의 랜덤 CPU로 시작하고, 라운드-로빈 기반의 이상적인 CPU 할당을 회전함으로써 각 스레드에 할당된다. 환언하면, 만약 생성된 스레드가 프로세스에 의해 생성된 제1 스레드라면, "Y" 가지가 단계(612)에서 선택되고, 이상적인 노드 또는 노드들(스레드의 이상적인 노드 마스크에 의해 지시되는) 내의 CPU가 랜덤하게 선택된다(단계 613). 만약 생성된 스레드가 제1 스레드가 아니라면, "N" 가지가 단계(612)에서 선택되며, 운영 시스템은 이상적인 노드 또는 노드들(단계 614) 내의 CPU의 수를 증가시키고, 이 다음 CPU를 새로 생성된 스레드에 할당한다(단계 615).Preferred CPUs for execution (called "ideal CPUs") are assigned to each thread by starting with a random CPU at the ideal node and rotating a round-robin based ideal CPU allocation. In other words, if the created thread is the first thread created by the process, then the "Y" branch is selected in step 612 and the CPU in the ideal node or nodes (indicated by the thread's ideal node mask) is Randomly selected (step 613). If the created thread is not the first thread, an "N" branch is selected in step 612, and the operating system increases the number of CPUs in the ideal node or nodes (step 614), and then creates a new CPU. The allocated thread (step 615).

도 6은 디스패처에 의해 사용되는 임의의 변수의 초기화를 설명하기 위한 프로세스 및 스레드 초기화의 매우 단순화된 흐름도이고, 프로세스를 초기화하거나 스레드를 생성하는 단계를 완벽하게 설명하려는 것은 아니다.6 is a very simplified flow chart of process and thread initialization to illustrate the initialization of any variable used by the dispatcher and is not intended to fully describe the steps of initializing a process or creating a thread.

제어 블럭(410) 내에 포함되는 준비 큐(305) 및 정보와 결합된 스레드 디스패처의 조작이 설명될 것이다. 일반적으로, 디스패처는, 새로운 스레드가 디스패치되어야 함 또는 디스패치 될 수도 있음을 지시하는 외부 이벤트에 응답하고, 스레드가 디스패치되도록 결정하거나, CPU가 스레드를 실행하도록 결정한다. 제1 모드에서(도 7에서 도시됨), 디스패처는, CPU가 스레드를 실행하기 유효하게 된 경우에, 준비 큐(305)로부터 유효한 스레드를 선택하도록 한다. 이것은 예컨대, CPU 상에서 이전에 실행된 스레드가 긴 대기시간 이벤트(long latency event)와 만나거나, 또는 이전에 스레드를 실행하는 것이 종료되거나, 또는 이전에 스레드를 실행하는 것이 인터럽트되었거나, 또는 실행을 종료하였기 때문에, 발생할 수도 있다. 제2 모드에 있어서(도 8에서 도시된), 디스패처는 스레드가 실행을 준비하게 되기 때문에, 유효한 프로세서를 선택하도록 불려온다(예컨대, 새로운 스레드가 생성되었거나, 또는 스레드가 대기하고 있었던 외부 이벤트가 발생하였거나 또는 어떤 다른 이벤트가 스레드를 준비하게 만들도록 야기되었음). 운영 시스템의 설계에 따라, 디스패처는 또한 다른 이유로 불려올 수 있다.The operation of the ready dispatcher 305 included in the control block 410 and the thread dispatcher associated with the information will be described. In general, the dispatcher responds to external events indicating that a new thread should be dispatched or may be dispatched, and determines that the thread is to be dispatched, or the CPU decides to execute the thread. In the first mode (shown in FIG. 7), the dispatcher causes the CPU to select a valid thread from the ready queue 305 when it becomes available to execute the thread. This may be the case, for example, if a previously executed thread on the CPU encounters a long latency event, or the execution of the thread previously terminated, or the execution of a previously interrupted thread, or terminates execution. May occur. In the second mode (shown in Figure 8), the dispatcher is called to select a valid processor because the thread is ready to run (e.g., a new thread has been created or an external event that the thread was waiting on has occurred). Or some other event caused the thread to be ready). Depending on the design of the operating system, the dispatcher may also be called for other reasons.

디스패처의 중심은 스레드 선택 매카니즘이고, 이 매카니즘은 디스패치를 위해 스레드를 선택한다. 하나의 스레드는 유효한 CPU에 대한 최상의 매치로써, 선택되므로, 이 스레드 선택 함수가 호출되면, 스레드의 디스패칭을 위한 목표 CPU가 고려된다. 바람직한 실시예에 있어서, 이 목표 CPU는 일반적으로, 막 유효하게 되었고, 디스패치 함수를 유발시켰던 CPU 이다.The center of the dispatcher is the thread selection mechanism, which selects a thread for dispatch. Since one thread is selected as the best match for a valid CPU, when this thread selection function is called, the target CPU for dispatching the thread is considered. In a preferred embodiment, this target CPU is generally a CPU that has just become available and has triggered a dispatch function.

도 7A 및 7B(도 7로써 선택적으로 참조됨)는 디스패처(304) 내의 스레드 선택 함수의 조작을 보여주는 흐름도이다. 스레드 선택 함수는 목표 CPU(P로 지시됨)에 대한 스레드를 선택하도록 호출되며, 이 목표 CPU란 일반적으로, 상기에서 설명된 막 유효화된 CPU이다. 이 스레드 선택 함수는 적합한 스레드가 발견될 때가지 가장 높은 우선권으로부터 가장 낮은 우선권으로 준비 큐 내의 다양한 제어 블럭 리스트(401-403)를 검토한다. 도 7에 도시된 바와 같이, 스레드 선택 함수는 첫째로 검토된 리스트를 선택한다(단계 701). 초기에, 선택된 제어 블럭 리스트는 가장 높은 우선권 리스트이며, 메인 루프의 후속하는 반복과 함께, 단계(701)는 아직 시험되지 않은 리스트의 가장 높은 우선권을 갖는 리스트를 선택한다. 변수 ideal_node_hit 및 ideal_CPU_hit 는 널 값으로 초기화된다(단계 702). 부가적으로, 디스패처는 선택 제어 블럭 리스트의 스레드에 대한 최대 대기 시간(maximum waiting time, wmax)을 결정한다(단계 702도 또한). 최대 대기 시간은 각 리스트에 대해 변하는데, 더 높은 우선권에 대해서는 더 적고, 더 낮은 우선권 리스트에 대해서는 더 크다; 그러므로 시험되는 각 선택 리스트에 대한 wmax 값을 리셋시킬 필요가 있다.7A and 7B (optionally referred to as FIG. 7) are flow diagrams illustrating the operation of the thread selection function in dispatcher 304. The thread selection function is called to select a thread for the target CPU (indicated by P), which is generally the just-validated CPU described above. This thread selection function examines the various control block lists 401-403 in the ready queue from the highest priority to the lowest priority until a suitable thread is found. As shown in Fig. 7, the thread selection function first selects the list examined (step 701). Initially, the selected control block list is the highest priority list, and with subsequent iterations of the main loop, step 701 selects the highest priority list of the list that has not yet been tested. The variables ideal_node_hit and ideal_CPU_hit are initialized with null values (step 702). In addition, the dispatcher determines a maximum waiting time (wmax) for a thread of the selection control block list (also step 702). The maximum wait time varies for each list, less for higher priority and greater for lower priority list; Therefore, it is necessary to reset the wmax value for each selection list tested.

스레드 선택 함수는 단계(710-718)을 포함하는 루프로써 도 7에 도시된, 매치가 발견될 때까지 교대로 선택된 제어 블럭 리스트내 또는 도달된 리스트의 끝에서 각 스레드를 시험한다. 리스트로부터의 스레드(t)가 선택된다(단계 710). 이 스레드는 초기에는 리스트에 있는 제1 스레드가 되고, 이 스레드의 제어 블럭은 가장길게 리스트 상에 있어 왔고, 후속적으로는, 아직 선택되지 않았던 것들 중에서 가장 기게 리스트 상에 있었던 스레드이다. 스레드 선택 함수는 그 뒤 P 가 스레드 t의 프로세서 유사성에 있는 CPU 중 하나인지 아닌지를 결정하는데, 예컨대, 프로세서 P 에 대응하는 비트는 스레드 t의 프로세서 유사성 마스크(502)에서 설정되는지 여부를 말한다. 만약 그렇지 않다면, 스레드 t는 프로세서 P상에서 실행되지 않도록 배제되며, 스레드 선택 함수는 다음 스레드를 시험하기 위해, 단계(718)로 진행한다.The thread selection function loops through steps 710-718 to test each thread in the alternately selected control block list or at the end of the reached list, as shown in FIG. 7, until a match is found. Thread t from the list is selected (step 710). This thread initially becomes the first thread in the list, and its control block has been on the list for the longest, and subsequently the thread on the list that has not been selected yet. The thread selection function then determines whether P is one of the CPUs in thread t's processor similarity, eg, whether the bit corresponding to processor P is set in thread t's processor similarity mask 502. If not, thread t is excluded from running on processor P, and the thread selection function proceeds to step 718 to test the next thread.

만약 P가 t의 프로세서 유사성에 있다면(단계 711로부터의 "Y"가지), 스레드 선택 함수는 t가 즉각적인 선택에 대한 표준을 만족하는지 여부를 결정한다(단계 712). 단계(712)에서 수행된 테스트가 논리적으로 다음과 같이 표현된다 :If P is in the processor similarity of t (" Y " from step 711), then the thread selection function determines whether t meets the criteria for immediate selection (step 712). The test performed at step 712 is logically represented as:

(t는 리얼-타임 우선권 리스트) OR (1)(t is the real-time priority list) OR (1)

(t는 wmax 보다 더 길게 대기하였음) OR (2)(t waited longer than wmax) OR (2)

((P=last_CPU) AND (P=ideal_CPU)) OR (3)((P = last_CPU) AND (P = ideal_CPU)) OR (3)

((P=last_CPU) AND (P는 t의 이상적인 노드 내에 있음)) OR (4)((P = last_CPU) AND (P is within the ideal node of t)) OR (4)

((P=last_CPU) AND (ideal_CPUt의 유사)) OR (5)((P = last_CPU) AND (ideal_CPU similar to t)) OR (5)

((이상적인 노드가 존재하지 않음) AND (P=ideal_CPU)) (6)((No ideal node exists) AND (P = ideal_CPU)) (6)

조건 (1) 및 (2)는, 스레드 t를 긴급하게 디스패칭할 때, 고려사항과 매칭되는 정상 노드(normal node)를 오버라이드한다. 리얼-타임 우선권 리스트는 특별한 고-우선권 제어 블럭 리스트이며, 이는 효과적으로 0인 wmax를 갖고, 이 리스트에 상에서 대기하는 어떠한 스레드 제어 블럭도 그 최대 대기 기간을 초과하였다. 모든 다른 제어 블럭 리스트에서, 만약 스레드 t가 이미, t가 대기하고 있는 리스트에 대한 소정의 기간 wmax보다 더 오래 대기하고 있었다면, t가 즉시 디스패치를 위해 선택된다. 만약 스레드가 마지막으로 P에서 실행되었다면(last_CPU 영역 505에 열거된 것처럼) 조건 (3)은 스레드를 선택하고, P는 ideal_CPU 영역(504)에 의해 열거되는 스레드의 이상적인 CPU이다. 조건 (4)는 (3)과 유사하지만, 이상적인의 개념을 t의 이상적인 노드 - 환언하면, t의 이상적인 노드 마스크(503)에 의해 열거되는 노드 - 의 임의의 프로세서로 확장한다. 조건 (5)은 스레드 t의 유사성 내의 이상적인 CPU 가 존재하지 않는 특별한 경우를 처리한다; 이것은 API 호출에 의한 것처럼, 몇 명의 임의의 값(유사성과 같은)이 프로세스 초기화 후에 변경된 곳에서만 발생한다. 조건 (6)은 마스크(503)에서 열거된 이상적인 노드가 없는 특별한 경우를 처리하는데, 이 경우 이상적인 CPU가 바람직하다.Conditions (1) and (2) override the normal node that matches the consideration when dispatching thread t urgently. The real-time priority list is a special high-priority control block list, which effectively has a wmax of zero, and any thread control block waiting on this list has exceeded its maximum waiting period. In all other control block lists, if thread t has already been waiting longer than the predetermined time period wmax for the list on which t is waiting, t is immediately selected for dispatch. If the thread was last executed at P (as listed in last_CPU region 505), condition (3) selects the thread, and P is the ideal CPU of the thread listed by ideal_CPU region 504. Condition (4) is similar to (3), but extends the concept of ideal to any processor of the ideal node of t, in other words, the node listed by the ideal node mask 503 of t. Condition (5) deals with the special case where there is no ideal CPU within the similarity of thread t; This only happens where some arbitrary value (such as similarity) has changed after process initialization, such as by an API call. Condition (6) deals with the special case where there are no ideal nodes listed in mask 503, in which case an ideal CPU is preferred.

만약 즉각적인 선택에 대해 위에서 표현된 기준이 만족된다면, "Y" 가지가 단계(712)에서 선택되고, 스레드 t는 지시된 선택 스레드가 되며(단계 713), 이 스레드 선택 함수는 잔여 스레드를 더 시험하지 않고 반환된다. 만약 그렇지 않다면, "N" 가지가 선택된다. 그 경우에, 만약 P가 ideal_CPU 영역(504)에 의해 열거된 이상적인 CPU이고, 이것이 첫번째로 만나는 스레드라면(환언하면, ideal_CPU_hit=null), 그 뒤 "Y" 가지가 단계(714)에서 선택되고, ideal_CPU_hit 는 t로 설정된다(단계 715). 만약 "N" 가지가 단계(714)에서 선택되고, 그뒤 만약 P가 이상적인 노드 마스크(503)에 의해 열거된 이상적인 노드에 있고, 이것이 첫번째로 만나는 스레드라면(예컨대, ideal_node_hit=null), "Y" 가지가 단계(716)으로부터 선택되고, ideal_node_hit 는 t로 설정된다(단계 717). 만약 더 많은 스레드가 선택 제어 블럭 리스트에 남아있다면(단계 718), 이 스레드 선택 함수는 리스 상의 다음 스레드를 선택하고 시험하기 위해 반환된다. 선택 리스트에 있는 모든 스레드가 시험된 경우에는, "N" 가지가 단계(718)로부터 선택된다. 스레드 선택 함수는 단계(720)로 계속된다.If the criterion expressed above for the immediate selection is satisfied, the "Y" branch is selected in step 712, and thread t becomes the indicated selection thread (step 713), and this thread selection function further tests the remaining threads. Is returned. If not, the "N" branch is selected. In that case, if P is the ideal CPU listed by ideal_CPU region 504, and this is the first thread to meet (in other words, ideal_CPU_hit = null), then the "Y" branch is selected at step 714, ideal_CPU_hit is set to t (step 715). If the "N" branch is selected at step 714, then if P is at the ideal node listed by the ideal node mask 503 and this is the first thread to meet (e.g., ideal_node_hit = null), then "Y" The branch is selected from step 716 and ideal_node_hit is set to t (step 717). If more threads remain in the selection control block list (step 718), this thread selection function is returned to select and test the next thread on the lease. If all threads in the select list have been tested, an "N" branch is selected from step 718. The thread selection function continues to step 720.

전체 제어 블럭 리스트를 검토하여, 만약 ideal_CPU_hit가 널이 아니면, "N" 가지가 단계(720)에서 선택되고, ideal_CPU_hit에 의해 열거된 스레드는 지시된 선택 스레드가 되며(단계 721), 함수는 부가적인 리스트를 시험함이 없이 반환된다. 만약 ideal_node_hit가 널이면, "Y" 가지가 단계(720)에서 선택된다. 이 경우에 있어서, 만약 ideal_node_hit이 널이 아니면, "N" 가지가 단계(722)에서 선택되고, ideal_node_hit에 의해 열거된 스레드는 지시된 선택 스레드이며, 함수는 부가적인 리스트의 시험없이 반환된다. 만약 ideal_CPU_hit 및 ideal_node_hit 양쪽 모두 널이면, "Y" 가지가 취해지고, 현 리스트 바로 아래의 우선권을 갖는 리스트가 시험을 위해 선택된다. 만약 모든 리스트가 시험되었다면, "N" 가지가 단계(724)에서 선택되어, 널 값은 선택 스레드로써 지시되며(단계 725), 이 스레드 선택 함수는 반환된다. 스레드가 실행을 위해 유효하게 된 경우, 디스패처는, 만약 가능하다면, 스레드를 실행하기에 적합한 CPU를 선택하도록 불러진다. 도 7의 복수의 잠재적인 CPU 후보들로부터 CPU를 선택하기 위해 CPU 선택 프로세스가 호출된다. 이 프로세스는 도 8에서 설명된다.If the ideal_CPU_hit is not null, then the "N" branch is selected in step 720, and the thread listed by ideal_CPU_hit becomes the indicated selection thread (step 721), and the function is additional The list is returned without testing. If ideal_node_hit is null, a "Y" branch is selected at step 720. In this case, if ideal_node_hit is not null, an "N" branch is selected in step 722, and the thread listed by ideal_node_hit is the indicated selection thread, and the function returns without testing the additional list. If both ideal_CPU_hit and ideal_node_hit are null, a "Y" branch is taken, and the list with the priority immediately below the current list is selected for testing. If all lists have been examined, an "N" branch is selected in step 724, where a null value is indicated as the selection thread (step 725), and this thread selection function is returned. If the thread has been enabled for execution, the dispatcher is called to select the appropriate CPU to run the thread, if possible. The CPU selection process is invoked to select a CPU from the plurality of potential CPU candidates of FIG. This process is described in FIG.

CPU 선택 함수는 첫째로 이상적인 CPU가 스레드 t의 유사성 내에 존재하는지여부와 현재 아이들 상태인지 여부를 결정한다(단계 801). 만약 그렇다면, "Y" 가지가 단계(801)에서 선택되며, 이 이상적인 CPU가 선택되며(단계 802), 이 스레드 선택 함수는 반환된다. 만약 그렇지 않다면, "N" 가지가 단계(801)에서 선택된다.The CPU selection function first determines whether an ideal CPU is within the similarity of thread t and whether it is currently idle (step 801). If so, a "Y" branch is selected in step 801, this ideal CPU is selected (step 802), and this thread selection function is returned. If not, an "N" branch is selected at step 801.

만약 적어도 하나의 아이들 CPU가 스레드 t의 유사성 내에 있고, 이상적 노드가 존재하는 경우, 이 CPU가 스레드 t에 대한 이상적인 노드 내에 있다면(환언하면, 어떠한 이상적인 노드도 존재하지 않는 경우에, 테스트는 적어도 하나의 아이들 CPU 가 스레드 t의 유사성 내에 있는지 여부임), 그후 "Y" 가지는 단계(803)로부터 취해진다. 이러한 경우에는, 하나의 그러한 CPU 가 선택된다(단계 804). 단계(803)의 기준을 만족하는 하나 이상의 아이들 CPU가 존재하는 곳에서, 만약 그것이 그 기준을 만족하는 CPU들 하나라면, CPU 선택 함수는 스레드 t에 의해 최종적으로 사용된 CPU 를 선택하고, 만약 그렇지 않다면, 디폴트 선택 논리의 기반 하의 CPU들 중 어느 하나를 선택한다. 스레드 선택 함수는 그 뒤 반환된다. 만약 상기 기준을 만족하는 CPU가 단계(803)에서 발견되지 않는다면, "N" 가지는 아이들이 아닌 임의의 프로세서를 고려하여 선택된다.If at least one idle CPU is within the similarity of thread t, and an ideal node exists, if this CPU is within the ideal node for thread t (in other words, if no ideal node exists, the test is at least one) Whether the idle CPU of is within the similarity of thread t), and then the "Y" branch is taken from step 803. In this case, one such CPU is selected (step 804). Where there is one or more idle CPUs that meet the criteria of step 803, if it is one of the CPUs that satisfy the criteria, the CPU selection function selects the CPU finally used by thread t, If not, select one of the CPUs under the base of the default selection logic. The thread selection function is then returned. If no CPU meeting the criteria is found in step 803, the " N " branch is selected taking into account any processor that is not idle.

만약 아이들 CPU가 스레드 t의 유사성 내에 존재한다면, "Y" 가지는 단계(805)에서 선택되고, 이 이상적인 CPU는 후보 CPU로써 임시적으로 선택된다(단계 806). 이러한 경우에 있어서, CPU는 반드시 busy 상태이거나, 단계(801)에서 선택되었을 것이다. 만약 "N" 가지가 단계(805)에서 취해지고, 스레드 t의 유사성 내의 CPU가 존재하고, 또한 만약 아이들 노드가 존재한다면, 또한 스레드 t의 아이들 노드 내에 있는 것이라면(환언하면, 아이들 노드가 없는 곳에서, 테스트 대상은 스레드 t의 유사성 내의 CPU가 존재하는지 여부임), "Y" 가지는 단계(807)에서 선택되며, 하나의 그러한 CPU가 임시적으로 선택된다(단계 808). 하나 이상의 그러한 CPU가 존재하는 곳에서는, 만약 그것이 기준을 만족하는 CPU 중 하나라면, CPU 선택 함수는 임시적으로 스레드 t에 의해 최종적으로 사용되는 CPU를 선택하고, 만약 그렇지 않다면, 임시적으로 디폴트 선택 논리 기반의 CPU 중 하나를 선택한다. 만약 "N" 가지가 단계(807)에서 취해졌다면, CPU 선택 함수는 임시적으로 디폴트 선택 논리를 사용하여, 스레드 t의 유사성 내의 CPU를 선택한다(단계 809).If the idle CPU is within the similarity of thread t, the "Y" branch is selected in step 805, and this ideal CPU is temporarily selected as a candidate CPU (step 806). In this case, the CPU must be busy or selected in step 801. If an "N" branch is taken at step 805, there is a CPU in the similarity of thread t, and if there is an idle node, it is also within the idle node of thread t (in other words, where there is no idle node) In this case, the test target is whether or not there is a CPU in the similarity of the thread t), the "Y" branch is selected in step 807, and one such CPU is temporarily selected (step 808). Where there is more than one such CPU, if it is one of those that meets the criteria, the CPU selection function temporarily selects the CPU finally used by thread t, and if not, temporarily based on the default selection logic. Choose one of the CPUs. If the "N" branch was taken in step 807, the CPU selection function temporarily selects the CPU within the similarity of thread t using the default selection logic (step 809).

만약 후보 CPU가 임시적으로 단계(806, 808 또는 809)에서 선택되었다면, 후보 프로세서 내의 임의의 현재 구동되는 스레드의 우선권은 스레드 t의 우선권과 비교된다(단계 810). 만약 t의 우선권이 더 크면, "N" 가지가 단계(810)로부터 취해지고, 후보 CPU 가 선택 CPU로써 확인된다. 이 경우에 있어서, 아이들이 아닌 CPU의 선택은 현재 실행중인 스레드를 사전에 비워둘 것이다. 만약 t의 우선권이 실행 중인 스레드의 우선권보다 더 크지 않다면, "Y" 가지가 단계(810)에서 취해지고, 선택 CPU는 널로 설정된다. 어느 경우에 있어서나, CPU 선택 함수는 반환된다. CPU 선택 함수가 널 선택으로 반환되는 곳에서는, 스레드 t의 즉각적인 디스패치를 위한 적당한 CPU 를 찾을 수 없었으며, 그러므로 스레드 t는, 전에 설명한 스레드 선택 함수에 의해 선택된 경우에 큐로부터 궁극적인 디스패치를 대기하는 준비 큐에 위치된다. 그러므로, 스레스 선택 함수에 유사하다면, 어떠한 CPU도 선택되지 않고, 스레드 t가 궁극적으로 준비 큐에 위치되는 지점에서조차, CPU 선택 함수는 비-이상적인 노드의 아이들 CPU 를 선택하는 것을 거절할 수도 있다.If the candidate CPU has been temporarily selected in step 806, 808 or 809, the priority of any currently running thread in the candidate processor is compared with that of thread t (step 810). If the priority of t is greater, an "N" branch is taken from step 810, and the candidate CPU is identified as the selection CPU. In this case, the selection of a non-idle CPU will free up the currently running thread in advance. If the priority of t is not greater than the priority of the running thread, then a "Y" branch is taken at step 810, and the selection CPU is set to null. In either case, the CPU selection function is returned. Where the CPU selection function returned null selection, no suitable CPU could be found for immediate dispatch of thread t, so thread t waits for the ultimate dispatch from the queue if selected by the thread selection function described previously. Located in the ready queue. Therefore, if similar to the thread selection function, no CPU is selected, and even at the point where thread t is ultimately placed in the ready queue, the CPU selection function may refuse to select the idle CPU of the non-ideal node.

일반적으로, 본 발명의 설명된 실시예를 구현하기 위해 실행되는 루틴, 즉 운영 시스템 또는 특정한 어플리케이션, 프로그램, 객체, 모듈 또는 순차적인 명령의 한 부분으로써 구현되는지 여부는, 여기의 "컴퓨터 프로그램" 또는 단순히 "프로그램"이라고 언급된다. 컴퓨터 프로그램은 전형적으로, 본 발명과 일관된 컴퓨터 시스템의 장치 또는 시스템 내의 하나 또는 이상의 프로세서에 의해 준비 또는 실행되는 경우에, 장치 또는 시스템이 단계를 실행하거나, 본 발명의 다양한 면을 실시하는 요소를 발생하는데 필요한 단계를 수행하도록 하는 명령들을 포함한다. 게다가, 발명이 최대한으로 기능하는 컴퓨터 시스템을 설명하고 있는 반면에, 본 발명의 다양한 실시예는 다양한 형태의 프로그램 제품으로써 배포될 수 있으며, 본 발명은 배포를 실제로 완수하는데 사용되는 특정한 종류의 신호 유지 매체에도 불구하고, 동일하게 적용된다. 신호 유지 매체의 예는, 한정되지는 않지만, 휘발성 및 비휘발성 메모리 장치와 같은 기록 가능형 매체, 플로피 디스크, 하드-디스크 드라이브, CD-ROM, DVD, 자기 테이프와 무선 통신 링크를 포함하는 디지털 및 아날로그 통신 링크와 같은 전송형 매체를 포함한다. 신호 유지 매체의 예는, 메모리(210)와 저장 장치(224)로써 도 2에 설명된다.In general, the routines executed to implement the described embodiments of the invention, i.e. whether implemented as an operating system or as part of a particular application, program, object, module or sequential instruction, are referred to herein as a "computer program" or It is simply referred to as a "program." A computer program is typically prepared or executed by one or more processors in a device or system of a computer system consistent with the present invention, where the device or system generates elements that perform steps or implement various aspects of the present invention. Instructions to perform the steps necessary to perform. In addition, while the invention describes a computer system that functions to the maximum, various embodiments of the invention may be distributed as various types of program products, the present invention being capable of maintaining a particular kind of signal used to actually accomplish the distribution. Notwithstanding the medium, the same applies. Examples of signal bearing media include, but are not limited to, recordable media such as volatile and nonvolatile memory devices, digital media including floppy disks, hard-disk drives, CD-ROMs, DVDs, magnetic tapes and wireless communication links; Transmission type media such as analog communication links. Examples of signal holding media are described in FIG. 2 with memory 210 and storage 224.

<부가적인 선택적 실시예>Additional Optional Embodiments

NUMA 시스템의 스레드를 디스패칭하기 위한 특정한 알고리즘이 지금까지 바람직한 실시예로서 설명되어 왔다. 그러나, 상기의 알고리즘의 다양한 변화가 가능하다는 것을 이해해야 한다. 선택된 정확한 알고리즘은 종종 설계된 컴퓨터에 특정한 다양한 하드웨어 및 소프트웨어 고려에 의존한다. 일반적으로, 분산 메모리 시스템의 서브세트에 관한 프로세서의 물리적인 위치를 고려하는 공동 준비 큐로부터의 임의의 스레드를 디스패치하는 알고리즘이, 긴 대기시간 리얼 메모리 액세스(long latency real memory access)를 감소시키기 위해 사용될 수 있다. 대부분의 확실한 대체는 바람직한 실시예에 관하여 여기서 설명된 하나 또는 이상의 조건을 제거하는 것이며, 또는 상기의 알고리즘에 일정한 조건을 부가하는 것이다. 부가적인 변화의 몇몇의 특정한 예가 다음에 기술되며, 이것들은 오직 예로써 언급되는 것이고, 가능한 대체물에 대한 완전한 리스트로 간주되어서는 안 된다. 하나의 대체에 있어서, 이상적인 프로세서 도는 이상적인 노드를 지시할 필요는 없다. 프로세스 또는 스레드가 첫째로 실행되는 프로세서는 랜덤하게 결정된다. 스레드 선택 알고리즘은 목표 프로세서 또는 동일한 노드 상의 또다른 프로세서 상에서 마지막으로 실행되는 스레드를 선호한다. 동일한 노드 상의 프로세서가 더 바람직하기 때문에, 스레드는 이상적인 지시가 부족함에도 불구하고, 일관된 노드 내에서 실행되는 경향이 있다.Specific algorithms for dispatching threads of NUMA systems have been described as preferred embodiments so far. However, it should be understood that various variations of the above algorithms are possible. The exact algorithm chosen often depends on various hardware and software considerations specific to the computer on which it is designed. In general, an algorithm for dispatching any thread from a common ready queue that takes into account the physical location of a processor relative to a subset of a distributed memory system may be used to reduce long latency real memory access. Can be used. Most obvious alternatives are to remove one or more of the conditions described herein with respect to the preferred embodiment, or to add certain conditions to the above algorithm. Some specific examples of additional changes are described below, which are mentioned as examples only and should not be considered as a complete list of possible alternatives. In one alternative, there is no need to indicate an ideal processor or ideal node. The processor on which the process or thread runs first is randomly determined. The thread selection algorithm prefers the last thread running on the target processor or another processor on the same node. Since processors on the same node are more desirable, threads tend to run within a consistent node, despite the lack of ideal instructions.

또다른 선택에 있어서, 스레드는 상기의 논리에 반대로써의 수학적 측정 함수에 의해 선택될 수도 있다. 측정 함수는 큐에 있는 각 스레드에 대해 특정 스코어를 제공하며, 이 스레드는 디스패치를 위해 선택된 최적의 스코어를 갖는 것이다. 예컨대, 그러한 측정 함수의 형태는 다음과 같다:In another option, the thread may be selected by a mathematical measurement function as opposed to the above logic. The measure function provides a specific score for each thread in the queue, which has the optimal score selected for dispatch. For example, the form of such a measurement function is as follows:

F₁(wait_time)+F₂(node)+F₃(CPU)+F₄(Priority)+…F ₁ (wait_time) + F ₂ (node) + F ₃ (CPU) + F ₄ (Priority) +.

여기서, F_N는 개별적인 독립 변수의 뉴머리컬 함수이다. 물론, 그러한 측정함수는 더 복잡하다.Where F _N is the pneumatic function of the individual independent variables. Of course, such a measurement function is more complicated.

상기의 바람직한 실시예에 있어서, NUMA 컴퓨터 시스템은 아키텍처상으로, 반-독립 노드의 모음으로써 설계되며, 각 노드는 내부 버스, 프로세서, 로컬 메모리, 등을 구비하고, 인터-노드 통신 매체로써 서로 결합되어 있다. 부가적으로, 그러한 NUMA 시스템 아키텍처의 몇몇의 예들이 구성되어 왔으며, 공중이 사용가능하며, 임의의 실제적인 경험이 이미 이러한 접근으로 얻어져 왔다. 그러나, 본 발명에 따른 NUMA 시스템은 반드시 그러한 노드 모델로 설계될 필요는 없다. 시스템이 상기의 노드 시스템을 제외한 일정한 설계 모델을 기반으로 하고 있음에도 불구하고, 스레드 또는 작업을 실행하는 프로세서를 선택함에 있어, 논-유니폼 메모리 액세스를 설명하는 디스패처는 잠재적으로 임의의 NUMA 시스템 내의 값화 되어 있다. 그러므로, 공지된 또는 그후에 개발된 어떤 선택적인 시스템 아키텍처(논-유니폼 메모리 액세스의 특성을 나타내는)가 채택될 수도 있다. 그러한 선택적인 아키텍처의 예로써, NUMA 시스템은 복잡한 메모리 버스 구조를 가지고 있는 시스템이며, 이 메모리 버스는 인터페이스에 의해 링크된 버스 세그먼트(bus segment)의 웹을 포함하고, 일부 메모리 액세스는 오직 단일한 버스 세그먼트를 포함하며, 반면에 다른 메모리 액세스는 복수의 세그먼트를 검토하여 더 많은 시간을 필요로 한다. 다른 대체가 가능하다. 더 나아가, 바람직한 실시예에서, 메모리 액세스가 2개의 클래스로 분할됨에도 불구하고(인트라-노드 및 인터-노드), 하나의 시스템은 2개 이상의 메모리 액세스 클래스를 구비할 수 있으며, 각 클래스는 다른 개별적인 액세스 시간을 필요로 한다.In the above preferred embodiment, the NUMA computer system is architecturally designed as a collection of semi-independent nodes, each node having an internal bus, a processor, a local memory, and the like, coupled to each other as an inter-node communication medium. It is. In addition, some examples of such NUMA system architectures have been constructed, publicly available, and any practical experience has already been gained with this approach. However, the NUMA system according to the present invention does not necessarily need to be designed with such a node model. Although the system is based on a certain design model except for the node system described above, in selecting a processor that executes a thread or task, the dispatcher describing non-uniform memory access is potentially valued in any NUMA system. have. Therefore, any optional system architecture known or later developed (indicating the nature of non-uniform memory access) may be employed. As an example of such an optional architecture, a NUMA system is a system with a complex memory bus structure, which includes a web of bus segments linked by an interface, and some memory accesses only a single bus. Segments, while other memory accesses require more time to review multiple segments. Other alternatives are possible. Furthermore, in a preferred embodiment, even though memory access is divided into two classes (intra-node and inter-node), one system may have two or more memory access classes, each class being a separate individual Requires access time.

본 발명의 특정한 실시예가 설명을 위해 기술될 수 있음에도 불구하고, 다양한 개조가 본 발명의 정신과 범위를 벗어나지 않고 이루어질 수 있음을 이해해야 한다. 따라서, 본 발명의 보호 범위는 다음의 청구범위와 그 균등물에 의해 제한되지 않는다.Although specific embodiments of the invention may be described for illustration, it should be understood that various modifications may be made without departing from the spirit and scope of the invention. Therefore, the protection scope of the present invention shall not be limited by the following claims and their equivalents.

상기 스레드 선택 알고리즘의 동작에 관해 다양한 관찰이 이루어질 수 있다. 상기의 디스패처는 임의의 CPU 또는 CPU 그룹과 관계되지 않은 단일 준비 큐가 존재함에도 불구하고, NUMA-지각 동작을 달성할 수 있다. 대부분의 스레드에 대해(지시된 단일 이상적인 노드와 지시된 단일 이상적인 CPU를 구비함), 만약 프로세서 P가 적어도 스레드의 이상적인 노드에 있지 않다면, 스레드는 일반적으로 디스패치를 위해 선택되지 않는다. 이상적인 노드 내에서, 그 이상적인 노드에 있고, 그 이상적인 프로세서에 있지 않은 P를 구비한 스레드에 대한 그 이상적 프로세서로써 P를 구비한 스레드에게, 약한 우선이 주어진다. 만약 P가 스레드가 실행하고 임의의 다른 조건이 만족되는 마지막 프로세서라면[단계(712)의 (3)-(5)], 큐를 더 시험하지 않고, 이상적인 후보로써 하나의 스레드가 즉시 선택된다; 노드의 로컬 메모리에 부가하여, 프로세서의 캐시에 유용한 데이터가 있을 수 있기 때문에, 마지막 프로세서가 중요하다. 만약 프로세서 P가 스레드가 실행된 마지막 프로세서가 아니고, 이상적인 노드에 있거나, 이상적인 프로세서라면, 선택은 ideal_CPU_hit 및 ideal_node_hit 변수에 의해서 임시적이고, 만약 큐를 검토하는 동안 더 좋은 후보가 발견되면, 오버라이드된다. 스레드 선택 알고리즘은 스레드가 큐에서 대기하고있음에도 불구하고, 반드시 어떠한 스레드를 선택하지는 않는다. 최종적으로, 큐 상에서 대기하는 스레드는 궁극적으로, P가 이상적인 노드 또는 스레드의 이상적인 프로세서와 매치되지 않음에도 불구하고, 디스패치를 위해 선택된다.Various observations may be made regarding the operation of the thread selection algorithm. The dispatcher can achieve NUMA-perceived operation despite the presence of a single ready queue that is not associated with any CPU or group of CPUs. For most threads (with a single ideal node indicated and a single ideal CPU indicated), if processor P is not at least at the ideal node of the thread, the thread is generally not selected for dispatch. Within an ideal node, a weak priority is given to a thread with P as its ideal processor for a thread with P that is at that ideal node and not to that ideal processor. If P is the last processor that the thread executes and any other condition is satisfied ((3)-(5) of steps 712), then no more tests for the queue, and one thread is immediately selected as an ideal candidate; The last processor is important because in addition to the node's local memory, there may be useful data in the processor's cache. If processor P is not the last processor the thread ran on, but is at the ideal node or is the ideal processor, the selection is temporary by the ideal_CPU_hit and ideal_node_hit variables, and if a better candidate is found while examining the queue, it is overridden. The thread selection algorithm does not necessarily select any thread even though the thread is waiting in the queue. Finally, the thread waiting on the queue is ultimately chosen for dispatch, even though P does not match the ideal node or the ideal processor of the thread.

이상적인 노드를 지시한 결과로써, 이상적인 프로세서가 유효하지 않더라도, 스레드는 동일한 노드에서 실행되는 경향이 있다. 페이저는 페이지 데이터를 페이지 요청을 발생하는 프로세서의 노드의 로컬 메모리에 로드하므로, 스레드가 필요한 데이터는 지시된 이상적인 노드의 로컬 리얼 메모리에 축적되는 경향이 있다. 결과적으로, CPU 로부터 시스템 분배 리얼 메모리로의 메모리 액세스는, 노드의 장소를 고려하지 않는 유사한 작업 디스패처의 경우보다 더 큰 비율로 CPU 노드의 로컬 리얼 메모리에(노드 경계 너머의 메모리보다는) 액세스하게 된다. 인트라-노드 메모리 액세스에 비례한 증가가, 인터-노드 통신 매체에서의 트래픽을 감소시키고, 메모리 액세스 시간을 감축시킴으로써, 시스템 작업처리량을 향상시킨다.As a result of pointing to an ideal node, threads tend to run on the same node even if the ideal processor is not valid. The pager loads the page data into the local memory of the node of the processor making the page request, so the data needed by the thread tends to accumulate in the local real memory of the indicated ideal node. As a result, memory access from the CPU to the system distributed real memory will access the CPU node's local real memory (rather than memory beyond the node boundary) at a greater rate than for similar task dispatchers that do not consider the location of the node. . Increasing proportional to intra-node memory access improves system throughput by reducing traffic in inter-node communication media and reducing memory access time.

Claims

A method of dispatching threads to central processing units (CPUs) of a computer system, the computer system comprising a plurality of computers 201-204 and a memory 210 that can be divided into a plurality of discrete subsets,

Identifying a target CPU;

Identifying one set of threads suitable for execution on the target CPU (710), wherein the set of threads waits on a common ready queue 305 that is not associated with any CPU or group of CPUs. Wow;

Identifying at least one target subset of the plurality of discrete memory subsets for each individual thread of the thread set, wherein the target subset is memory to a location in the target subset by the target CPU. An identification step having an individual latency period for access;

Selecting one thread from the set of threads for execution on the target CPU, the selection step being based at least in part on the respective latency of each target subset;

Thread dispatch method comprising a.

2. The system of claim 1, wherein each of the plurality of CPUs is associated with an individual subset of one of the plurality of discrete memory subsets and identifies at least one target subset of the plurality of discrete memory subsets. Said step of instructing an individual CPU desired to execute each thread of said thread set, wherein said target subset of said plurality of discrete memory subsets is a memory subset associated with said preferred CPU. Thread dispatch method.

3. The method of claim 1 or 2, wherein the plurality of CPUs are associated with each individual subset of the plurality of discrete memory subsets.

The method of claim 1, wherein the step of selecting one thread from the set of threads for execution comprises:

Assigning a first relative priority to a thread for which the indicated preferred CPU is the target CPU (714-715),

Assigning (716-717) a second relative priority to a thread in which the indicated preferred CPU is associated with the same subset of memory as the target CPU;

Assigning a third relative priority to a thread for which the indicated preferred CPU is not associated with the same subset of memory as the target CPU;

The first relative priority is greater than the second relative priority, and the second relative priority is greater than the third relative priority

Thread dispatch method.

5. The computer system of claim 1, wherein the computer system comprises a plurality of discrete nodes 101-104, wherein each node is a separate sub-set of at least one CPU and one of the plurality of discrete memory subsets. 6. Identifying a target subset of at least one of the plurality of discrete memory subsets for each individual thread, identifying at least one target node for each individual thread (605). Thread dispatch method comprising a.

6. The method of claim 1, wherein the step of identifying at least one target node for each individual thread includes the step 615 of indicating an individual CPU desired to run each thread of the thread set. And wherein the target node is the node on which the preferred CPU is located.

7. The method of claim 2, wherein the step of instructing a desired individual CPU to execute each thread of the thread set is performed when the thread is created (610).

Instructions for performing the method of any one of claims 1 to 7;

Signal bearing media in which the command is recorded

Computer program product comprising a.

Each node includes one or more central processing units (CPUs) 201-204 and local memory 210, wherein the set of all local memory in a plurality of discrete nodes includes distributed main memory of the computer system. A plurality of discrete nodes;

An interface network (105) for providing data communication between the plurality of nodes;

A common ready queue for holding a plurality of threads ready to run on said computer system, said queue not being associated with any particular CPU or group of CPUs;

A thread that dispatches a thread from the common ready queue to run on the CPU, wherein the thread dispatcher considers the node location of the CPU when dispatching the thread, and the thread dispatcher is the data requested by the thread in the local memory of the node. Thread dispatcher 304 to preferentially dispatch threads to CPUs within nodes that are likely to contain relatively large shares of

A non-uniform memory access computer system comprising a.

10. The system of claim 9, wherein one location is selected in the distributed main memory for storing paged-in data, and the pager stores at least page-in data that has generated a page fault. A pager 303 for storing in the local memory of the node determined by one thread;

Containing the CPU that was running the thread that caused the page fault

Non-Uniform Memory Access Computer System.

The non-uniform memory access computer system of claim 9, wherein each node comprises a plurality of CPUs.

12. The non-uniform memory access computer system of claim 9, wherein the dispatcher selects one thread from a plurality of ready threads running on a target CPU.

12. The non-uniform memory access computer system of claim 9, wherein the dispatcher selects one CPU from among a plurality of CPUs suitable for execution on a preparation thread.

14. The method of claim 9, wherein the preferred individual node is associated with at least some threads when a thread is created, wherein the thread dispatcher causes the thread to be preferentially dispatched to the CPU at the preferred node of the thread. Non-Uniform Memory Access Computer System.

In a thread dispatching mechanism for a non-uniform memory access computer system, the non-uniform memory access computer system includes a plurality of central processing units (CPUs) 201-204,

For each of a plurality of threads in a common ready queue that is not associated with any CPU or set of CPUs, for determining one CPU set that the thread is likely to have shorter memory access time to the requested data. Means for each individual CPU set having at least one CPU;

Means for selecting a thread and a CPU for executing the threads 610-615, wherein the selecting means preferentially selects a thread and a CPU for executing the thread, the CPU for executing the thread Thread dispatching mechanism, wherein the thread is a member of one of said CPU sets that is more likely to have shorter memory access times to the requested data.

16. The non-uniform memory access computer system of claim 15, wherein the non-uniform memory access computer system comprises a plurality of discrete nodes 101-104, wherein each node is an individual of at least one CPU and one of the plurality of discrete memory subsets. Means for selecting one CPU set that includes a subset 210 and, for each thread, is more likely to have a shorter memory access time for the data requested by the thread, for each thread, the individual thread. And means for identifying a set of CPUs in the node that are likely to contain the data requested by the thread dispatching mechanism.

17. The thread dispatching mechanism of claim 15 wherein said selecting means preferentially selects said thread when a thread is created.