CN100489784C

CN100489784C - Multithreading microprocessor and its novel threading establishment method and multithreading processing system

Info

Publication number: CN100489784C
Application number: CNB2004800247988A
Authority: CN
Inventors: 凯文·基塞尔
Original assignee: MIPS Technologies Inc
Current assignee: Imagination Technologies Ltd; MIPS Tech LLC
Priority date: 2003-08-28
Filing date: 2004-08-27
Publication date: 2009-05-20
Anticipated expiration: 2024-08-27
Also published as: CN1842769A; CN100538640C; CN1842771A; CN1846194A; CN1842770A; CN1846194B

Abstract

The invention discloses a fork instruction, which is executed on a multi-threaded microprocessor and occupies a time slot for issuing a single instruction. When executed in a parent thread, the fork instruction includes a first operand and a second operand, the first operand specifying an initial instruction fetch address of the new thread. The microprocessor allocates a context for the new thread, copies a first operand to a program counter of the new thread context, copies a second operand to a register of the new thread context, and schedules the new thread Execution to execute the fork instruction. If no new thread contexts are free to allocate, the microprocessor issues an exception to the fork instruction. The fork instruction of the present invention is very efficient because it does not need to copy the entire parent thread general registers to the new thread. The second operand may generally be used as a pointer to a data structure in memory containing the new thread's starting general purpose register set value.

Description

Multi-thread microprocessor and its new thread creation method and multi-thread processing system

相关申请交叉引用Related Application Cross Reference

本申请是下列待审美国非临时专利申请的部分继续(CIP)，在此将其全文引入，作为参考，用于所有目的。This application is a continuation-in-part (CIP) of the following pending US non-provisional patent application, which is hereby incorporated by reference in its entirety for all purposes.

申请号(案卷号) 申请日期题目 10/684350(MIPS.0188-01-US) 10/10/03 用于确保在一个多线程处理器上执行的程序的服务品质的机制(MECHANISMS FOR ASSURIGQUALITY OF SERVICE FOR PROGRAMSEXECUTING ON A MULTITHRERADEDPROCESSOR) 10/684348(MIPS.0189-00-US) 10/10/03 用于挂起和释放在一个处理器中执行的计算线程的整合机制(INTEGRATEDMECHANISM FOR SUSPENSION ANDDEALLOCATION OF COMPUTATIONALTHREADS OF EXECUTION IN APROCESSOR) Application number (docket number) Date of Application topic 10/684350 (MIPS.0188-01-US) 10/10/03 Mechanism for ensuring the quality of service of programs executing on a multithreaded processor 10/684348 (MIPS.0189-00-US) 10/10/03 Integrated mechanism for suspending and releasing computation threads executing in a processor (INTEGRATEDMECHANISM FOR SUSPENSION ANDDEALLOCATION OF COMPUTATIONALTHREADS OF EXECUTION IN APROCESSOR)

上述待审的美国非临时专利申请要求下列美国临时申请的权利，在此将其全文引入，作为参考，用于所有目的。The aforementioned pending US non-provisional patent application claims the benefit of the following US provisional application, which is hereby incorporated by reference in its entirety for all purposes.

申请号(案卷号) 申请日期题目 60/499180(MIPS.0188-00-US) 8/28/03 多线程应用程序特定扩展(MULTITHREADING APPLICATIONSPECIFIC EXTENSION) 60/502358(MIPS.0188-02-US) 9/12/03 对一个处理器架构的多线程应用程序特定扩展(MULTITHREADINGAPPLICATION SPECIFIC EXTENSIONTO A PROCESSOR ARCHITECTURE) 60/502359(MIPS.0188-03-US) 9/12/03 对一个处理器架构的多线程应用程序特定扩展(MULTITHREADINGAPPLICATION SPECIFIC EXTENSIONTO A PROCESSOR ARCHITECTURE) Application number (docket number) Date of Application topic 60/499180 (MIPS.0188-00-US) 8/28/03 Multithreaded Application Specific Extension (MULTITHREADING APPLICATIONSPECIFIC EXTENSION) 60/502358 (MIPS.0188-02-US) 9/12/03 MULTITHREADING APPLICATION SPECIFIC EXTENSION TO A PROCESSOR ARCHITECTURE 60/502359 (MIPS.0188-03-US) 9/12/03 MULTITHREADING APPLICATION SPECIFIC EXTENSION TO A PROCESSOR ARCHITECTURE

本申请涉及下列同时申请的美国非临时专利申请，在此将其全文引入，作为参考，用于所有目的。This application is related to the following concurrently filed US Nonprovisional Patent Application, which is hereby incorporated by reference in its entirety for all purposes.

申请号(案卷号) 申请日期题目 (MIPS.0189-01-US) 8/27/04 用于挂起和释放在一个处理器中执行的计算线程的整合机制(INTEGRATED MECHANISM FORSUSPENSION AND DEALLOCATION OFCOMPUTATIONAL THREADS OFEXECUTION IN A PROCESSOR) Application number (docket number) Date of Application topic (MIPS.0189-01-US) 8/27/04 Integrated mechanism for suspending and releasing computation threads executing in a processor (INTEGRATED MECHANISM FORSUSPENSION AND DEALLOCATION OFCOMPUTATIONAL THREADS OFEXECUTION IN A PROCESSOR)

(MIPS.0193-00-US) 8/27/04 用于动态配置虚拟处理器资源的机制(MECHANISMS FOR DYNAMICCONFIGURATION OF VIRTUALPROCESSOR RESOURCES) (MIPS.0194-00-US) 8/27/04 用于在多线程微处理器中对多重计算上下文(CONTEXT)进行软件管理的装置、方法和指令(APPARAYUS，METHOD，AND INSTRUCTION FORSOFTWARE MANAGEMENT OFMULTIPLE COMPUTATIONALCONTEXTS IN A MULTITHREADEDMICROPROCESSOR) (MIPS.0193-00-US) 8/27/04 Mechanism for dynamically configuring virtual processor resources (MECHANISMS FOR DYNAMICCONFIGURATION OF VIRTUALPROCESSOR RESOURCES) (MIPS.0194-00-US) 8/27/04 APPARATUS, METHOD, AND INSTRUCTION FORSOFTWARE MANAGEMENT OFMULTIPLE COMPUTATIONALCONTEXTS IN A MULTITHREADEDMICROPROCESSOR FOR SOFTWARE MANAGEMENT OF MULTIPLE COMPUTATIONAL CONTEXTS IN A MULTITHREADED MICROPROCESSOR

技术领域 technical field

本发明通常涉及多线程处理器领域，特别是用于产生在多线程处理器中执行的新线程的指令。The present invention relates generally to the field of multithreaded processors, and in particular to instructions for spawning new threads for execution in multithreaded processors.

背景技术 Background technique

微处理器的设计者使用了许多技术来提高微处理器的性能。大多数的微处理器使用在一固定频率下运行的时钟信号来进行操作。每个时钟周期，微处理器中的电路执行其各自的功能。根据Hennessy(汉尼斯)与Patterson(派特森)，微处理器的性能是根据执行一个程序或多个程序所需的时间来实际测量的。根据此种观点，微处理器的性能是其时钟频率、执行一个指令所需的时钟周期的平均数目(或者表述为，每一个时钟周期所执行的指令的平均数目)，以及一个程序或多个程序中所执行的指令数目的函数。半导体科学家以及工程师不断地使得微处理器能够在更快的时钟频率下运行，主要是通过减少晶体管大小来实现，这导致更快的切换时间。能够执行的指令数目主要由程序中所欲执行的任务所限制，尽管也会受微处理器指令集架构影响。利用架构或组织的概念，即改善每一时钟周期所能执行的指令，特别是利用并行处理概念，可以大幅度提高性能。Microprocessor designers use a number of techniques to increase the performance of microprocessors. Most microprocessors operate using a clock signal that runs at a fixed frequency. Each clock cycle, the circuits in the microprocessor perform their respective functions. According to Hennessy and Patterson, the performance of a microprocessor is actually measured in terms of the time it takes to execute a program or programs. According to this view, the performance of a microprocessor is its clock frequency, the average number of clock cycles required to execute an instruction (or expressed as the average number of instructions executed per clock cycle), and the A function of the number of instructions executed in the program. Semiconductor scientists and engineers are continually enabling microprocessors to run at faster clock frequencies, primarily by reducing transistor size, which results in faster switching times. The number of instructions that can be executed is primarily limited by the tasks the program is trying to perform, although it is also affected by the microprocessor's instruction set architecture. Using the concept of architecture or organization, that is, improving the instructions that can be executed per clock cycle, and especially using the concept of parallel processing, can greatly improve performance.

一种能改善每一时钟周期所能执行的指令以及时钟频率的微处理器的并行处理概念是流水线，其使在微处理器的流水线阶段中的多个指令的执行重叠。在理想的情况下，每一时钟周期，一个指令将流水线向下移到一个新阶段，该新阶段执行所述指令的一个不同的功能。因此，虽然每一个体指令需要花费多个时钟周期来完成，但是因为所述单个指令的所述多个周期互相重叠，所以每一指令所需的平均时钟周期会减少。所述流水线的性能改善可以被实现到所述程序中的指令允许的程度，即使得一指令不依赖其前趋(precedessor)来执行且因此与其前趋并行执行的程度，此称之为指令级并行处理。现今微处理器使用的指令级并行处理的另一种方式是每一时钟周期发出多个执行指令，通常称之为超级标量微处理器。A parallel processing concept for microprocessors that improves the number of instructions that can be executed per clock cycle and the clock frequency is pipelining, which overlaps the execution of multiple instructions in the microprocessor's pipeline stages. Ideally, each clock cycle, an instruction moves the pipeline down to a new stage that performs a different function of the instruction. Thus, although each individual instruction takes multiple clock cycles to complete, the average clock cycle required per instruction is reduced because the multiple cycles of the single instruction overlap each other. The performance improvement of the pipeline can be realized to the extent that the instructions in the program allow, that is, to the extent that an instruction is executed independently of its predecessor and is therefore executed in parallel with its predecessor, which is called the instruction level. Parallel processing. Another form of instruction-level parallelism used by today's microprocessors is to issue multiple instructions for execution per clock cycle, often referred to as superscalar microprocessors.

前述所讨论的适合于个体指令级的并行处理，然而，通过使用指令级并行处理所能实现的性能改善是有限的。由有限的指令级并行处理产生的各种限制以及其他性能限制问题使得近来对使用数据块(block)级的并行处理，或是指令系列或是流的并行处理，这通常被称为线程级并行处理，重新产生兴趣。一个线程只是一程序指令系列或程序指令流。多线程微处理器可根据一调度策略来并行执行多个线程，该调度策略规定各种线程的指令的提取及发出，比如交织、成块(blocked)或并发多线程。一个多线程微处理器通常允许多线程同时共享微处理器中的一些功能单元(如：指令提取及译码单元、高速缓存器、分支预测单元、以及读取/存储、整数处理、浮点处理、SIMD等执行单元)。然而，多线程微处理器还包括用于存储每一线程的惟一状态的多组资源或上下文(context)，比如多个程序计数器和通用寄存器组，以促进在线程之间快速切换以提取及发出指令的能力。The foregoing discussion lends itself to individual instruction-level parallelism, however, there are limits to the performance improvements that can be achieved by using instruction-level parallelism. Various limitations arising from limited instruction-level parallelism and other performance-limiting issues have recently led to the use of block-level parallelism, or parallelism of instruction sequences or streams, which is often referred to as thread-level parallelism. Process, renewed interest. A thread is simply a sequence or stream of program instructions. A multithreaded microprocessor can execute multiple threads in parallel according to a scheduling policy that specifies fetching and issuing of instructions for various threads, such as interleaved, blocked, or concurrent multithreading. A multithreaded microprocessor usually allows multiple threads to share some functional units in the microprocessor at the same time (such as: instruction fetch and decode unit, cache, branch prediction unit, and read/store, integer processing, floating point processing , SIMD and other execution units). However, multithreaded microprocessors also include multiple sets of resources or contexts for storing the unique state of each thread, such as multiple program counters and general-purpose register banks, to facilitate fast switching between threads to fetch and issue ability to command.

多线程微处理器所解决的性能限制问题的一个例子是下述一个事实，即因为高速存取失败而必须存取微处理器外部的存储器时通常需要一段相对长的等待时间。对基于当代微处理器的计算机而言，存储器命中(cache hit)的存取时间通常为高速缓存存储器命中的存取时间的一至二个数量级倍数是十分常见的。因而，当流水线停顿以等待来自存储器的数据时，一个单线程微处理器的一些或全部流水线阶段必须暂停执行任何无用的工作且持续多个时钟周期。而多线程微处理器则可以通过在存储器提取的等待时间内发出来自其它线程的指令来解决这个问题，由此使得流水线阶段能够进行下一进程来执行有用的工作，有点类似于在页面错误时执行任务切换的操作系统，但其粒度级更为精细。其它的例子可以是由于分支错误预测以及伴随的流水线排空(flush)，或者由于数据依赖性，或是由于如分割指令的长等待时间指令，而造成的流水线停顿以及其相应的空闲时钟周期。同样的，多线程微处理器将来自其它线程的指令发出到另外处于空闲的流水线阶段的能力可以使执行包含多个线程的一程序或多个程序所需的时间大大减少。另一个问题是，特别是在嵌入式系统中，与中断服务相关的所述浪费的开销。通常，当一输入/输出设备向微处理器发出一个中断事件信号时，所述微处理器切换对中断服务例程的控制，这要求存储当前程序状态，服务于该中断，以及在中断已经完成后重新恢复到当前程序状态。一多线程微处理器为事件服务代码提供一种能力，使得其自有的线程具有其自有的上下文。因此，响应于输入/输出设备发出一个事件信号，多线程微处理器可以快速地，可能在单一时钟周期内，切换回所述事件服务线程，由此可以避免出现传统中断服务例程开销。An example of a performance limiting problem addressed by multi-threaded microprocessors is the fact that there is usually a relatively long latency when having to access memory external to the microprocessor due to high-speed access failures. For modern microprocessor based computers, it is quite common for the access time of a memory hit (cache hit) to be a multiple of one to two orders of magnitude of the access time of a cache memory hit. Thus, some or all of the pipeline stages of a single-threaded microprocessor must be suspended from performing any useless work for many clock cycles while the pipeline is stalled waiting for data from memory. A multi-threaded microprocessor can solve this problem by issuing instructions from other threads during the latency of memory fetches, thus enabling the next process in the pipeline to perform useful work, somewhat similar to a page fault An operating system that performs task switching, but at a finer granularity. Other examples could be pipeline stalls and their corresponding idle clock cycles due to branch mispredictions and accompanying pipeline flushes, or due to data dependencies, or due to long latency instructions such as split instructions. Likewise, the ability of a multi-threaded microprocessor to issue instructions from other threads to otherwise idle pipeline stages can greatly reduce the time required to execute a program or programs involving multiple threads. Another problem, especially in embedded systems, is said wasteful overhead associated with servicing interrupts. Typically, when an input/output device signals an interrupt event to the microprocessor, the microprocessor switches control of the interrupt service routine, which requires storing the current program state, servicing the interrupt, and Then restore to the current program state. A multi-threaded microprocessor provides the event service code with the ability to have its own thread with its own context. Thus, in response to an event signaled by an I/O device, the multithreaded microprocessor can quickly, possibly within a single clock cycle, switch back to the event service thread, thereby avoiding the overhead of conventional interrupt service routines.

如同指令级并行处理的程度规定微处理器对流水线以及超量指令发出的好处的利用程度，线程级并行处理的程度也可规定微处理器对多线程执行的利用程度。线程的一个重要特征是其不依赖于与在多线程微处理器上执行的其它线程。一个线程不依赖于另一线程，使得该线程的指令不依赖于其他线程中的指令。线程的独立特性使得微处理器能够同时执行各个线程的指令。也就是，微处理器可以将一线程的指令发至执行单元，而不需要考虑被发出的其它线程的指令。就所述线程存取公共数据而言，所述线程自身必须被编程来保证使数据存取彼此同步，以确保正确操作，使得微处理器指令发出阶段并不需要关注所述依赖性。Just as the degree of instruction-level parallelism dictates the extent to which the microprocessor utilizes the benefits of pipelining and excess instruction issuance, the degree of thread-level parallelism dictates the extent to which the microprocessor utilizes multi-threaded execution. An important characteristic of threads is that they are not dependent on other threads executing on a multithreaded microprocessor. A thread is not dependent on another thread such that instructions in that thread do not depend on instructions in other threads. The independent nature of threads enables the microprocessor to execute the instructions of each thread concurrently. That is, the microprocessor can issue instructions for one thread to the execution units without regard to the instructions of other threads being issued. As far as the threads access common data, the threads themselves must be programmed to ensure that data accesses are synchronized with each other to ensure correct operation, so that the microprocessor instruction issue stage does not need to be concerned with the dependencies.

如同上面叙述可以观察到的，一处理器同时执行多个线程，可以减少执行包含所述多个线程的一个程序或多个程序所需的时间。然而，存在与用于执行的新线程的创建以及分发相关联的开销。也就是说，微处理器必须花费有用的时间来执行必需的功能来创建一个新线程——通常要为新线程分配上下文且将父线程的上下文复制到新线程的上下文——并且调度新线程的执行，例如，确定微处理器何时开始从新线程提取并发出指令。所述开销时间类似于多任务操作系统的任务切换开销，并且不会对执行由一个程序或多个程序完成的实际任务，如乘法矩阵，或处理从网络接收的一个分组或提供一个图像有所贡献。因此，虽然并行执行多线程在理论上可以改善微处理器的性能，但是该性能改善的程度受限于创建新线程所需的开销。换句话说，创建新线程所需的开销越大，必须由该新线程执行的有用工作的量就更多，以抵销创建新线程的代价。对于具有相当长的执行时间的线程而言，线程创建开销可能基本与性能无关。然而，某些应用或许可以受益于相对频繁创建的、具有相对短的执行时间的线程，在这种情况下，所述线程创建开销必须足够短以便实现从多线程获得的相当高的性能。因此，需要一个多线程微处理器，该多线程微处理器在其指令集中具有一个轻量级的线程创建指令。As can be observed from the above description, a processor executing multiple threads simultaneously can reduce the time required to execute a program or programs comprising said multiple threads. However, there is an overhead associated with the creation and distribution of new threads for execution. That is, the microprocessor must spend useful time performing the necessary functions to create a new thread—typically allocating a context for the new thread and copying the parent thread's context to the new thread's context—and scheduling the new thread's Execution, for example, determines when the microprocessor starts fetching and issuing instructions from a new thread. The overhead time is similar to the task switching overhead of a multitasking operating system, and does not contribute to performing the actual task performed by the program or programs, such as multiplying a matrix, or processing a packet received from the network or rendering an image. contribute. Thus, while executing multiple threads in parallel can theoretically improve microprocessor performance, the extent of this performance improvement is limited by the overhead required to create new threads. In other words, the greater the overhead required to create a new thread, the greater the amount of useful work that must be performed by that new thread to offset the cost of creating the new thread. For threads with significant execution times, thread creation overhead may be largely irrelevant to performance. However, certain applications may benefit from relatively frequently created threads with relatively short execution times, in which case the thread creation overhead must be short enough to achieve reasonably high performance from multithreading. Therefore, there is a need for a multithreaded microprocessor that has a lightweight thread creation instruction in its instruction set.

发明内容 Contents of the invention

本发明在多线程微处理器指令集中提供单一指令，当执行时，该单一指令为一新线程分配一线程上下文，且调度该新线程的执行。在一个实施例中，所述指令以类似精简指令集计算机(RISC)方式来占用微处理器中的单一指令发出时隙。所述指令具有非常低的开销，因为其放弃将整个父线程上下文复制到该线程，对于这种整个复制，如果顺序地复制所述线程上下文，则需要相当长的时间，或如果并行复制则需要一个庞大的数据路径以及多个逻辑。取而代之，所述指令包含第一操作数以及第二操作数，其中所述第一操作数是被存储进所述新线程上下文的程序计数器中的初始指令提取地址，所述第二操作数被存储在所述新线程上下文的寄存器组中的一个寄存器(比如一个通用寄存器)中。所述第二操作数可以被所述新线程用作指向存储器中的数据结构的指针，所述存储器包含所述新线程所需的数据，如初始通用寄存器组值。所述第二操作数使得所述新线程通过从所述数据结构中读取所述新线程所需的寄存器来仅仅提供所述寄存器。因为本发明的发明人注意到新线程通常仅需要提供一到五个寄存器，所以这是有利的。例如许多现今的微处理器包含有32个通用寄存器，在通常的情况下，本发明的微处理器可避免用于将整个父线程寄存器组复制到新线程寄存器组的无用努力。The present invention provides a single instruction in a multithreaded microprocessor instruction set that, when executed, allocates a thread context for a new thread and schedules execution of the new thread. In one embodiment, the instructions occupy a single instruction issue slot in a microprocessor in a reduced instruction set computer (RISC) like manner. The instruction has very low overhead because it foregoes copying the entire parent thread context to the thread, which would take a considerable amount of time if copying the thread context sequentially, or would require A huge data path with multiple logic. Instead, the instruction includes a first operand which is an initial instruction fetch address to be stored into the program counter of the new thread context, and a second operand which is stored in In a register (such as a general-purpose register) in the register set of the new thread context. The second operand may be used by the new thread as a pointer to a data structure in memory containing data needed by the new thread, such as an initial general purpose register set value. The second operand causes the new thread to provide only the registers needed by the new thread by reading the registers from the data structure. This is advantageous because the inventors of the present invention noted that new threads typically only need to provide one to five registers. For example, many present-day microprocessors contain 32 general-purpose registers, and the microprocessor of the present invention avoids the useless effort of copying the entire parent thread's register set to the new thread's register set in the usual case.

在一个实施例中，所述指令包含第三操作数，该第三操作数用于指定所述新线程上下文中的哪一个寄存器将要接收所述第二操作数。在一个实施例中，所述指令可以由用户模式代码加以执行，有利于避免操作系统在一般情况下创建新线程的需要。具有执行新线程上下文分配和新线程调度的单一指令的另一个好处是，可以在要求多个指令来创建和调度新线程的实现期间节省指令集中的宝贵操作码空间。如果在执行所述指令时，没有自由的线程上下文可用于分配，则通过发出异常给所述指令，本指令可以在单一指令内执行两个功能。In one embodiment, the instruction includes a third operand specifying which register in the new thread context is to receive the second operand. In one embodiment, the instructions can be executed by user-mode code, advantageously avoiding the need for the operating system to create new threads in general. Another benefit of having a single instruction that performs new thread context allocation and new thread scheduling is that valuable opcode space in the instruction set can be saved during implementations that require multiple instructions to create and schedule new threads. This instruction can perform two functions within a single instruction by raising an exception to the instruction if no free thread context is available for allocation when the instruction is executed.

在一个方面，本发明提供一种用于在被配置为执行并行程序线程的微处理器上执行的指令。所述指令包括一操作码，用于指示该微处理器为新线程分配资源，且调度该新线程在该微处理器上的执行。所述资源包含程序计数器以及一寄存器组。所述指令还包括第一操作数，用于指定被存储在为新线程分配的程序计数器中的初始指令提取地址。所述指令还包括第二操作数，该第二操作数存储在为所述新线程分配的所述寄存器组中的一个寄存器中。In one aspect, the invention provides instructions for execution on a microprocessor configured to execute parallel program threads. The instruction includes an opcode for instructing the microprocessor to allocate resources for the new thread and to schedule execution of the new thread on the microprocessor. The resources include a program counter and a set of registers. The instruction also includes a first operand specifying an initial instruction fetch address to be stored in a program counter allocated for the new thread. The instruction also includes a second operand stored in a register in the set of registers allocated for the new thread.

在另一个方面，本发明提供一种多线程微处理器。所述微处理器包括多个线程上下文，每一个线程上下文被配置为存储线程的状态，且指示该线程上下文是否可用于分配。所述微处理器还包括一调度器，耦接到所述多个线程上下文，用于响应于当前执行的线程中的单一指令，将所述多个线程上下文中的一个分配给新线程，且调度该新线程的执行。如果所述多个线程上下文都不可用于分配，则所述微处理器会发出异常给该指令。In another aspect, the invention provides a multithreaded microprocessor. The microprocessor includes a plurality of thread contexts, each thread context configured to store a state of a thread and indicate whether the thread context is available for allocation. The microprocessor also includes a scheduler coupled to the plurality of thread contexts for assigning one of the plurality of thread contexts to a new thread in response to a single instruction in a currently executing thread, and Schedules execution of this new thread. If none of the plurality of thread contexts is available for allocation, the microprocessor issues an exception to the instruction.

在另一方面，本发明提供一种多线程微处理器。所述微处理器包括第一程序计数器，用于将指令的提取地址存储在第一程序线程中。所述微处理器还包括第一寄存器组，该第一寄存器组包括由所述指令指定的第一和第二寄存器，用于分别存储第一和第二操作数，所述第一操作数指定第二程序线程的提取地址。所述微处理器还包括第二程序计数器，该第二程序计数器与所述第一寄存器组耦接，用于响应于所述指令，从所述第一寄存器接收第一操作数。所述微处理器还包括第二寄存器组，该第二寄存器组与该第一寄存器组耦接，包含第三寄存器，用于响应于所述指令，从所述第二寄存器接收第二操作数。所述微处理器还包括一调度器，该调度器与所述第一和第二寄存器组耦接，用于使所述微处理器响应于所述指令，从在所述第二程序计数器中存储的第二线程初始提取地址提取指令且执行所述提取的指令。In another aspect, the invention provides a multi-threaded microprocessor. The microprocessor includes a first program counter for storing fetch addresses of instructions in the first program thread. The microprocessor also includes a first set of registers including first and second registers specified by the instruction for storing first and second operands respectively, the first operand specifying Fetch address for the second program thread. The microprocessor also includes a second program counter coupled to the first set of registers for receiving a first operand from the first registers in response to the instruction. The microprocessor also includes a second set of registers coupled to the first set of registers including a third register for receiving a second operand from the second register in response to the instruction . The microprocessor also includes a scheduler coupled to the first and second register sets for causing the microprocessor to, in response to the instruction, read from the second program counter The stored second thread initially fetches an address fetch instruction and executes the fetched instruction.

在另一方面，本发明提供一种用于创建多线程处理器上的执行的新线程的方法。所述方法包括对在第一程序线程中执行的单一指令进行解码。所述方法还包括响应于对所述指令进行译码，为第二程序线程分配该微处理器的程序计数器和寄存器组。所述方法还包括响应于为所述第二程序线程分配所述程序计数器和寄存器组，将所述指令的第一操作数存储在所述寄存器组中的一个寄存器中。所述方法还包括响应于为所述第二程序线程分配所述程序计数器和寄存器组，将所述指令的第二操作数存储在所述程序计数器中。所述方法还包括在存储所述第一和第二操作数之后，对第二程序线程在所述微处理器上的执行的进行调度。In another aspect, the invention provides a method for creating a new thread of execution on a multithreaded processor. The method includes decoding a single instruction executing in a first program thread. The method also includes allocating a program counter and register set of the microprocessor to a second program thread in response to decoding the instruction. The method also includes storing a first operand of the instruction in a register of the set of registers in response to allocating the program counter and set of registers for the second program thread. The method also includes storing a second operand of the instruction in the program counter in response to allocating the program counter and register bank for the second program thread. The method also includes scheduling execution of a second program thread on the microprocessor after storing the first and second operands.

在另一方面，本发明提供了一种多线程处理系统。所述系统包括一存储器，该存储器被配置来存储第一线程的派生(fork)指令以及一数据结构，所述fork指令指定用于存储所述数据结构的存储器地址以及第二线程的初始指令地址的寄存器。所述数据结构包含第二线程的初始通用寄存器值。所述系统还包括一微处理器，该微处理器与所述存储器耦接。所述微处理器为第二线程分配一自由的线程上下文，将第二线程初始指令地址存储在该线程上下文的程序计数器中，将该数据结构存储器地址存储在该线程上下文的寄存器中，以及响应于所述fork指令，对所述第二线程的执行进行调度。In another aspect, the invention provides a multithreaded processing system. The system includes a memory configured to store a fork instruction of a first thread and a data structure, the fork instruction specifying a memory address for storing the data structure and an initial instruction address of a second thread register. The data structure includes initial general purpose register values for the second thread. The system also includes a microprocessor coupled to the memory. The microprocessor allocates a free thread context for the second thread, stores the initial instruction address of the second thread in the program counter of the thread context, stores the data structure memory address in the register of the thread context, and responds Based on the fork instruction, the execution of the second thread is scheduled.

在另一方面，本发明提供一种与计算机设备一起使用的计算机程序产品。所述计算机程序产品包括一计算机可用介质，使得计算机可读程序代码包含于该介质中，以导致多线程微处理器。所述计算机可读程序代码包括用于提供第一程序计数器的第一程序代码，以将指令的提取地址存储在第一程序线程中。所述计算机可读程序代码还包括用于提供第一寄存器组的第二程序代码，该第一寄存器组包含由所述指令指定的第一及第二寄存器，用于分别存储第一及第二操作数，所述第一操作数指定第二程序线程的提取地址。所述计算机可读程序代码还包括用于提供第二程序计数器的第三程序代码，该第二程序计数器耦接至该第一寄存器组，以响应于所述指令，从第一寄存器接收第一操作数。该计算机可读程序代码还包括用于提供第二寄存器组的第四程序代码，该第二寄存器组耦接至该第一寄存器组，包含第三寄存器，以响应于所述指令，从第二寄存器接收第二操作数。该计算机可读程序代码还包括用于提供调度器的第五程序代码，该调度器耦接至第一及第二寄存器组，以响应于所述指令，使得该微处理器从在所述第二程序计数器中存储的所述第二程序线程初始提取地址提取执行且执行所述提取的指令。In another aspect, the invention provides a computer program product for use with a computer device. The computer program product includes a computer usable medium such that computer readable program code is embodied on the medium to cause a multi-threaded microprocessor. The computer readable program code includes first program code for providing a first program counter to store a fetch address of an instruction in a first program thread. The computer readable program code further includes second program code for providing a first set of registers comprising first and second registers specified by the instruction for storing first and second registers, respectively operands, the first operand specifying a fetch address for a second program thread. The computer readable program code further includes third program code for providing a second program counter coupled to the first register set for receiving a first register from the first register in response to the instruction. operand. The computer-readable program code also includes fourth program code for providing a second set of registers coupled to the first set of registers, including a third register, to read from the second set of registers in response to the instruction. A register receives a second operand. The computer readable program code also includes fifth program code for providing a scheduler coupled to the first and second register sets, responsive to the instructions, causing the microprocessor to The second program thread initially fetches an address stored in a program counter for execution and executes the fetched instruction.

在另一方面，本发明提供一种在传输介质内包含的计算机数据信号，该计算机数据信号包括计算机可读程序代码，以提供多线程微处理器来执行fork指令。所述程序代码包括用于提供操作码的第一程序代码，以指示该微处理器为一新线程分配资源和对新线程在该微处理器上的执行进行调度，所述资源包含程序计数器以及寄存器组。该程序代码还包括用于提供第一操作数的第二程序代码，以指定将被存储到为所述新线程分配的该程序计数器中的初始指令提取地址。所述程序代码还包括用于提供第二操作数的第三程序代码，以存储在为所述新线程分配的所述寄存器组中的一个寄存器中。In another aspect, the present invention provides a computer data signal embodied within a transmission medium, the computer data signal including computer readable program code to provide a multi-threaded microprocessor to execute a fork instruction. The program code includes first program code for providing an opcode to instruct the microprocessor to allocate resources for a new thread and to schedule execution of the new thread on the microprocessor, the resources including a program counter and register set. The program code also includes second program code for providing a first operand specifying an initial instruction fetch address to be stored into the program counter allocated for the new thread. The program code also includes third program code for providing a second operand for storage in a register of the set of registers allocated for the new thread.

附图说明 Description of drawings

图1是示出根据本发明的计算机系统的方框图；1 is a block diagram illustrating a computer system according to the present invention;

图2是示出根据本发明的图1中的计算机系统的多线程微处理器的方框图；FIG. 2 is a block diagram illustrating a multithreaded microprocessor of the computer system in FIG. 1 according to the present invention;

图3是示出根据本发明的由图2中的微处理器执行的fork指令的方框图；Figure 3 is a block diagram illustrating a fork instruction executed by the microprocessor in Figure 2 according to the present invention;

图4是示出根据本发明的图中的每线程控制寄存器和TCStatus寄存器的方框图；FIG. 4 is a block diagram illustrating a per-thread control register and a TCStatus register in a graph according to the present invention;

图5是示出根据本发明的图2中的微处理器执行图3中的fork指令的操作的流程图。FIG. 5 is a flowchart illustrating the operation of the microprocessor in FIG. 2 executing the fork instruction in FIG. 3 according to the present invention.

发明详述Detailed description of the invention

现在参见图1，示出了根据本发明的计算机系统100的方框图。所述计算机系统100包括与系统接口控制器104耦接的多线程处理器102。所述系统接口控制器104与系统存储器108以及多个输入/输出(I/O)设备106耦接。每一输入/输出(I/O)设备106提供至微处理器102的中断请求线路112。所述计算机系统100可以是但并不局限于通用可编程计算机系统、服务器计算机、工作站计算机、个人计算机、笔记型计算机、个人数字助理(PDA)或嵌入式系统(Embedded system)，该嵌入式系统比如是但并不局限于网络路由器或交换机、打印机、海量存储控制器、照相机、扫瞄仪、汽车控制系统等等。Referring now to FIG. 1, there is shown a block diagram of a computer system 100 in accordance with the present invention. The computer system 100 includes a multithreaded processor 102 coupled to a system interface controller 104 . The system interface controller 104 is coupled to a system memory 108 and a plurality of input/output (I/O) devices 106 . Each input/output (I/O) device 106 provides an interrupt request line 112 to the microprocessor 102 . Described computer system 100 can be but not limited to general programmable computer system, server computer, workstation computer, personal computer, notebook computer, personal digital assistant (PDA) or embedded system (Embedded system), and this embedded system Examples are, but are not limited to, network routers or switches, printers, mass storage controllers, cameras, scanners, automotive control systems, and more.

所述系统存储器108包括用于存储在微处理器102上执行的程序指令以及用于存储根据所述程序指令将由微处理器处理的数据的存储器，比如随机存取存储器(RAM)或是只读存储器(ROM)。所述程序指令可以包含微处理器102同时执行的多个程序线程。程序线程或线程包括可执行的程序指令序列或程序指令流，以及与所述指令的执行相关联的微处理器102内的状态改变的关联序列。此指令序列，通常，但不必一定，包含一个或多个程序控制指令，比如分支指令。因此，这些指令可能具有或可能不具有连续的存储器地址。包含线程的指令序列来自单一程序。特别是，微处理器102被配置来执行用于创建新程序线程的fork指令，例如，分配微处理器102执行一个线程所需要的微处理器的资源，以及对线程在微处理器102上的执行进行调度，如同下面的详述。The system memory 108 includes memory, such as random access memory (RAM) or read-only memory, for storing program instructions executed on the microprocessor 102 and for storing data to be processed by the microprocessor in accordance with the program instructions Memory (ROM). The program instructions may comprise multiple program threads executing concurrently by the microprocessor 102 . A program thread or thread includes an executable sequence or stream of program instructions and an associated sequence of state changes within microprocessor 102 associated with the execution of the instructions. This sequence of instructions usually, but not necessarily, includes one or more program control instructions, such as branch instructions. Therefore, these instructions may or may not have contiguous memory addresses. A sequence of instructions comprising threads originates from a single program. In particular, microprocessor 102 is configured to execute a fork instruction for creating a new program thread, e.g., to allocate microprocessor resources required by microprocessor 102 to execute a thread, and to execute the fork instruction for the thread on microprocessor 102. Execution is scheduled as detailed below.

系统接口控制器104经由将微处理器102耦接到系统接口控制器104的处理器总线与微处理器102进行交互。在一个实施例中，系统接口控制器104包括用于控制系统存储器108的存储器控制器。在一个实施例中，系统接口控制器104包括本地总线接口控制器，以提供与输入输出(I/O)设备106相连接的总线，比如PCI总线。System interface controller 104 interacts with microprocessor 102 via a processor bus that couples microprocessor 102 to system interface controller 104 . In one embodiment, system interface controller 104 includes a memory controller for controlling system memory 108 . In one embodiment, system interface controller 104 includes a local bus interface controller to provide a bus, such as a PCI bus, to which input/output (I/O) devices 106 are connected.

输入输出(I/O)设备106可包括但不局限于，用户输入设备，如键盘、鼠标、扫描仪等等；显示设备，比如监视器、打印机等等；存储设备，比如磁盘驱动器、磁带驱动器、光盘驱动器等等；系统外围设备，比如直接存储器存取控制器(DMAC)、计时器、定时器、输入/输出端口等等；网络设备，比如以太网络、光纤信道、无限带宽(Infiniband)或其它高速网络接口的媒体存取控制器(MAC)；数据变换设备，比如模拟/数字(A/D)变换器、数字/模拟(D/A)变换器等。输入/输出(I/O)设备106产生至微处理器102的中断信号112来请求服务。有利的是，微处理器102能够同时执行多个程序线程，以处理中断请求线路112上表示的事件，而不需要与存储微处理器102的当前状态、传送对中断服务例程的控制以及在中断服务例程完成后重新恢复状态等相关联的传统开销。Input output (I/O) devices 106 may include, but are not limited to, user input devices such as keyboards, mice, scanners, etc.; display devices such as monitors, printers, etc.; storage devices such as disk drives, tape drives , optical drive, etc.; system peripherals, such as direct memory access controller (DMAC), timers, timers, input/output ports, etc.; network devices, such as Ethernet, Fiber Channel, Infiniband or Media access controllers (MAC) for other high-speed network interfaces; data conversion equipment, such as analog/digital (A/D) converters, digital/analog (D/A) converters, etc. Input/output (I/O) devices 106 generate interrupt signals 112 to microprocessor 102 to request service. Advantageously, microprocessor 102 is capable of executing multiple program threads concurrently to handle events represented on interrupt request line 112 without the need to intersect with storing the current state of microprocessor 102, transferring control to an interrupt service routine, and Traditional overhead associated with restoring state after an interrupt service routine completes.

在一个实施例中，计算机系统100包括包含多个多线程微处理器102的多处理系统。在一个实施例中，每个微处理器102提供两个分立的但并不互斥的多线程能力。第一，每一微处理器102包含多个逻辑处理上下文，对于操作系统而言，每个逻辑处理上下文通过共享微处理器102中的资源而表现为一个独立处理元件，在此称为虚拟处理元件(VPE)。对于所述操作系统而言，N个虚拟处理组件(VPE)的微处处理器102表现为类似一N路对称多处理器(SMP)，其允许现有具有对称多处理器(SMP)能力的操作系统，可以管理多个虚拟处理元件(VPE)。第二，每一虚拟处理元件(VPE)可以包含多个线程上下文，以同时执行多个线程。因此，所述微处理器102还可以提供一多线程编程模型，其中在通常情况下，可以创建或破坏线程而不需要操作系统的参与，同时可以响应于外部条件(如输入/输出服务事件信号)，可以零等待时间对系统服务线程进行调度。In one embodiment, computer system 100 includes a multiprocessing system including multiple multithreaded microprocessors 102 . In one embodiment, each microprocessor 102 provides two separate but not mutually exclusive multithreading capabilities. First, each microprocessor 102 contains a plurality of logical processing contexts, and for the operating system, each logical processing context appears as an independent processing element by sharing resources in the microprocessor 102, which is called virtual processing here. Element (VPE). To the operating system, the microprocessor 102 of N virtual processing elements (VPEs) behaves like an N-way symmetric multiprocessor (SMP), which allows existing symmetric multiprocessor (SMP)-capable An operating system that can manage multiple virtual processing elements (VPEs). Second, each virtual processing element (VPE) can contain multiple thread contexts to execute multiple threads simultaneously. Thus, the microprocessor 102 can also provide a multi-threaded programming model in which threads can be created or destroyed without operating system intervention in general, and can respond to external conditions such as I/O service event signals ), the system service thread can be scheduled with zero waiting time.

现在，参见图2，示出了根据本发明的图1中的计算机系统100的多线程微处理器102的方框图。所述微处理器102是包括多个流水线阶段的流水线微处理器。所述微处理器102包括多个线程上下文228，以存储与多个线程相关的状态。线程上下文228包含微处理器102的寄存器集合和/或寄存器中的位，该寄存器中的位描述线程的执行状态。在一个实施例中，线程上下文228包括寄存器组224(如通用寄存器(GPR)组)，程序计数器(PC)222以及每线程控制寄存器226。所述每线程控制寄存器226部分的内容将会在下面进行详细描述。图2中的实施例示出了四个线程上下文228，每一个均包括程序计数器222，寄存器组224以及每线程控制寄存器226。在一个实施例中，线程上下文228还包括乘法结果寄存器。在另一个实施例中，寄存器组224中的每一个均包含两个读取端口以及一个写入端口，以支持在单个时钟周期内从两个寄存器中每一个中读取以及向一个寄存器写入。如下所述，FORK指令300包含两个源操作数以及一个目的操作数。因此，微处理器102可以在单个时间周期内执行FORK指令300。Referring now to FIG. 2, there is shown a block diagram of the multithreaded microprocessor 102 of the computer system 100 of FIG. 1 in accordance with the present invention. The microprocessor 102 is a pipelined microprocessor comprising a plurality of pipeline stages. The microprocessor 102 includes a plurality of thread contexts 228 to store state associated with a plurality of threads. Thread context 228 contains a set of registers of microprocessor 102 and/or bits in registers that describe the execution state of the thread. In one embodiment, thread context 228 includes a register bank 224 , such as a general purpose register (GPR) bank, a program counter (PC) 222 , and per-thread control registers 226 . The content of the per-thread control register 226 will be described in detail below. The embodiment in FIG. 2 shows four thread contexts 228 , each including a program counter 222 , register set 224 , and per-thread control registers 226 . In one embodiment, thread context 228 also includes a multiplication result register. In another embodiment, each of register banks 224 includes two read ports and one write port to support reading from each of the two registers and writing to one register in a single clock cycle . As described below, the FORK instruction 300 includes two source operands and one destination operand. Accordingly, microprocessor 102 may execute FORK instruction 300 within a single time period.

对照于线程上下文228，微处理器102也保持着一处理器上下文，其是微处理器102状态的一个较大的集合。在图2的实施例中，所述处理器上下文被存储在每处理器控制寄存器218中。每个虚拟处理元件(VPE)包含其自有的每处理器控制寄存器组218。在一个实施例中，每处理器控制寄存器218中的一个寄存器包含状态寄存器，其含有一指定由异常信号234所发出(raise)的最近分发的线程异常的字段。具体而言，如果一虚拟处理元件(VPE)发出当前线程的fork指令300，但是当前并没有可自由分配的线程上下文228来分配给新线程，则此异常字段会指示线程上溢(overflow)情形。在一个实施例中，微处理器102基本上符合MIPS 32或MIPS 64指令集体系架构(ISA)，同时每处理器控制寄存器218基本上符合用于存储MIPS特权资源架构(PRA，Priviledged Resource Architecture)的处理器上下文的寄存器，比如管理微处理器102的资源(比如虚拟存储器，高速缓存存储器，异常以及用户上下文等)的操作系统所必需的机制。In contrast to thread context 228 , microprocessor 102 also maintains a processor context, which is a larger collection of microprocessor 102 state. In the FIG. 2 embodiment, the processor context is stored in per-processor control registers 218 . Each virtual processing element (VPE) contains its own set 218 of per-processor control registers. In one embodiment, one of the per-processor control registers 218 includes a status register that contains a field specifying the most recently dispatched thread exception raised by the exception signal 234 . Specifically, if a virtual processing element (VPE) issues the fork instruction 300 of the current thread, but currently there is no freely assignable thread context 228 to distribute to the new thread, then this exception field will indicate a thread overflow situation . In one embodiment, microprocessor 102 basically conforms to MIPS 32 or MIPS 64 instruction set architecture (ISA), and each processor control register 218 basically conforms to be used to store MIPS privileged resource architecture (PRA, Privileged Resource Architecture) simultaneously Registers of the processor context, such as the mechanisms necessary for the operating system to manage the resources of the microprocessor 102 (such as virtual memory, cache memory, exceptions, and user context, etc.).

微处理器102包含调度器216，该调度器216用于对由微处理器并行执行的各种线程的执行进行调度。调度器216与每线程控制寄存器226以及每处理器控制寄存器218耦接。特别的是，调度器216负责调度自各个线程的程序计数器222中提取出指令，以及调度将所述提取出的指令发出给微处理器102的执行单元，如下所述。调度器216根据微处理器102的调度策略来对所述线程的执行进行调度。所述调度策略可以包括但不限于下述调度策略中的任何一个。在一个实施例中，调度器216使用轮转(Round-robin)，或时分复用，或交织的调度策略，该调度策略可以按照循环顺序来将预定数目的时钟周期或指令发出时隙分配给每一准备好的线程。轮转策略对于其中要求公平性以及某些线程(比如实时应用程序线程)需要最低服务品质的应用而言是十分有效的。在一个实施例中，调度器216使用阻塞调度策略，在该阻塞调度策略中，调度器216继续对当前正在运行的线程的提取及发出进行调度，直到阻塞所述线程进一步执行的事件产生，比如高速缓存丢失，分支错误预测，数据依赖性，或是长等待时间的指令。在一个实施例中，微处理器102包括超级标量流水线微处理器，调度器216每一个时钟周期对多个指令的发出进行调度，特别是，每一个时钟周期从多个线程中发出指令，通常称为并发多线程。Microprocessor 102 includes a scheduler 216 for scheduling the execution of various threads executed in parallel by the microprocessor. Scheduler 216 is coupled with per-thread control registers 226 and per-processor control registers 218 . In particular, the scheduler 216 is responsible for scheduling the fetching of instructions from the program counters 222 of the various threads and scheduling the issue of the fetched instructions to the execution units of the microprocessor 102, as described below. The scheduler 216 schedules the execution of the threads according to the scheduling policy of the microprocessor 102 . The scheduling policy may include but not limited to any one of the following scheduling policies. In one embodiment, the scheduler 216 uses a round-robin, or time-division multiplexing, or interleaved scheduling strategy, which can allocate a predetermined number of clock cycles or instruction issue slots to each A prepared thread. The round robin strategy is very effective for applications where fairness is required and certain threads (such as real-time application threads) require a minimum quality of service. In one embodiment, the scheduler 216 uses a blocking scheduling strategy. In this blocking scheduling strategy, the scheduler 216 continues to schedule the extraction and issuance of the currently running thread until an event that blocks the further execution of the thread occurs, such as Cache misses, branch mispredictions, data dependencies, or high latency instructions. In one embodiment, the microprocessor 102 includes a superscalar pipeline microprocessor, and the scheduler 216 schedules the issuance of multiple instructions per clock cycle, in particular, issues instructions from multiple threads per clock cycle, usually Called concurrent multithreading.

微处理器102包含指令高速缓存存储器202，用于高速缓存从图1中的系统存储器108中提取出的程序指令，比如图3中的fork指令300。在一个实施例中，微处理器102提供虚拟存储能力，同时提取单元204包含转换监视缓存器(translation lookaside buffer)，以高速缓存实体到虚拟存储器的页面转换。在一个实施例中，在微处理器102上执行的每一程序或任务被分配一个惟一任务ID，或地址空间ID(ASID)，其可用来执行存储器存取以及尤其是执行存储器地址转换，并且线程上下文228还包括用于存储与该线程相关的地址空间ID(ASID)。在一个实施例中，当父线程执行fork指令300来创建新线程时，所述新线程继承父线程的所述ASID和其地址空间。在一个实施例中，在微处理器102上执行的各个线程共享指令高速缓存存储器202以及转换监视缓存器。在另一个实施例中，每个线程包含其自己的转换监视缓存器。Microprocessor 102 includes instruction cache memory 202 for caching program instructions fetched from system memory 108 in FIG. 1 , such as fork instruction 300 in FIG. 3 . In one embodiment, the microprocessor 102 provides virtual storage capabilities, while the fetch unit 204 includes a translation lookaside buffer to cache page translations from physical to virtual storage. In one embodiment, each program or task executing on the microprocessor 102 is assigned a unique task ID, or address space ID (ASID), which can be used to perform memory accesses and, in particular, perform memory address translation, and Thread context 228 also includes an address space ID (ASID) for storing the thread. In one embodiment, when a parent thread executes the fork instruction 300 to create a new thread, the new thread inherits the ASID of the parent thread and its address space. In one embodiment, the various threads executing on the microprocessor 102 share the instruction cache memory 202 and the translation monitor register. In another embodiment, each thread contains its own transition monitor buffer.

微处理器102还包括提取单元204，该提取单元204与指令高速缓存存储器202耦接，以从指令高速缓存存储器202和系统存储器108中提取如fork指令300的程序指令。提取单元204提取在由复用器244提供的指令提取地址处的指令。复用器244从对应的多个指令计数器222接收多个指令提取地址。每一个指令计数器222存储用于不同线程的当前指令提取地址。图2中所示的实施例示出了与四个不同的线程相关联的四个程序计数器222。而复用器244基于由调度器216提供的选择输入来选择四个程序计数器222中的一个。在一个实施例中，在微处理器102上执行的各个线程共享提取单元204。The microprocessor 102 also includes a fetch unit 204 coupled to the instruction cache memory 202 to fetch program instructions such as the fork instruction 300 from the instruction cache memory 202 and the system memory 108 . Fetch unit 204 fetches the instruction at the instruction fetch address provided by multiplexer 244 . The multiplexer 244 receives a plurality of instruction fetch addresses from corresponding plurality of instruction counters 222 . Each instruction counter 222 stores the current instruction fetch address for a different thread. The embodiment shown in FIG. 2 shows four program counters 222 associated with four different threads. Instead, multiplexer 244 selects one of four program counters 222 based on a selection input provided by scheduler 216 . In one embodiment, fetch unit 204 is shared among threads executing on microprocessor 102 .

微处理器102还包括译码单元206，该译码单元206与提取单元204耦接，以对由提取单元204中提取出的程序指令(比如fork指令300)进行译码。译码单元206对指令的操作码、操作数以及其它字段进行译码。在一个实施例中，在微处理器102上执行的各个线程共享译码单元206。The microprocessor 102 also includes a decoding unit 206 coupled to the fetch unit 204 to decode the program instructions (such as the fork instruction 300 ) fetched by the fetch unit 204 . The decode unit 206 decodes the opcode, operands, and other fields of the instruction. In one embodiment, the decode unit 206 is shared among threads executing on the microprocessor 102 .

微处理器102还包括用于执行指令的执行单元212。执行单元212可以包括但是不局限于一个或多个整数(integer)单元，用于执行整数运算、布尔运算、移位运算、旋转运算等等；浮点单元，用于执行浮点运算；读取/存储单元，用于执行存储器存取，特别是对与执行单元212耦接的数据高速缓存存储器242进行存取；以及分支解析单元，用于解析分支指令的结果和目标地址。在一个实施例中，数据高速缓存存储器242包括转换监视缓存器，用于高速缓存实体到虚拟存储器的页面转换。除了从数据高速缓存存储器242接收的操作数外，执行单元212还从寄存器组224的寄存器中接收操作数。特别是，执行单元212从分配给所述指令所属的线程的线程上下文228的寄存器组224中接收操作数。复用器248基于由执行单元212执行的指令的线程上下文228，从合适的寄存器组224中选取操作数，以提供给执行单元212。在一个实施例中，各个执行单元212可以同时执行来自多个并行线程的指令。Microprocessor 102 also includes an execution unit 212 for executing instructions. Execution unit 212 may include, but is not limited to, one or more integer (integer) units for performing integer operations, Boolean operations, shift operations, rotation operations, etc.; floating point units for performing floating point operations; The /storage unit is used to perform memory access, especially to the data cache memory 242 coupled to the execution unit 212; and the branch resolution unit is used to analyze the result and target address of the branch instruction. In one embodiment, data cache memory 242 includes a translation watchdog buffer for caching physical to virtual memory page translations. In addition to operands received from data cache memory 242 , execution units 212 also receive operands from registers of register file 224 . In particular, execution unit 212 receives operands from register set 224 assigned to thread context 228 of the thread to which the instruction belongs. The multiplexer 248 selects operands from the appropriate register file 224 to provide to the execution unit 212 based on the thread context 228 of the instruction being executed by the execution unit 212 . In one embodiment, each execution unit 212 may execute instructions from multiple parallel threads concurrently.

执行单元212中的一个负责执行fork指令300，并响应于正在发出fork指令300而使new_thread_request信号232为真值，该真值被提供给调度器216。new_thread_request信号232请求调度器216来分配新线程上下文228，并对与线程上下文228相关的新线程的执行进行调度。如下更为详细的描述，如果新线程上下文228被请求来用于分配但并没有可自由分配的线程可用，则调度器216会使异常信号234为真值，以发出一个异常给fork指令300。在一个实施例中，调度器216会保持可自由分配的线程上下文228的数目的计数，如果当new_thread_request信号232产生时所述数目小于零，则调度器216发出异常234给fork指令300。在另一个实施例中，在new_thread_request信号232产生时，调度器216检验每线程控制寄存器226中的状态位以决定是否可以得到可自由分配的线程上下文228。One of execution units 212 is responsible for executing fork instruction 300 and asserts new_thread_request signal 232 , which is provided to scheduler 216 , in response to fork instruction 300 being issued. The new_thread_request signal 232 requests the scheduler 216 to allocate a new thread context 228 and schedule execution of the new thread associated with the thread context 228 . As described in more detail below, if a new thread context 228 is requested for allocation but no freely allocatable threads are available, the scheduler 216 asserts exception signal 234 to issue an exception to the fork instruction 300 . In one embodiment, the scheduler 216 keeps a count of the number of freely allocatable thread contexts 228 , and if the number is less than zero when the new_thread_request signal 232 is generated, the scheduler 216 issues an exception 234 to the fork instruction 300 . In another embodiment, when the new_thread_request signal 232 is generated, the scheduler 216 checks the status bit in the per-thread control register 226 to determine whether a freely assignable thread context 228 is available.

微处理器102还包括指令发出单元208，该指令发出单元208与调度器216耦接，并且耦接在译码单元206与执行单元212之间，以根据调度器216的指示且响应于译码单元206译码出的指令信息而将指令发出到执行单元212。特别的是，指令发出单元208确保如果指令对先前发到执行单元212的其他指令具有数据依赖性，则该指令不被发到执行单元212。在一个实施例中，指令队列被强加在译码单元206与指令发出单元208之间，以缓存等待发到执行单元212的指令，从而减少执行单元212“饥饿(starvation)”的可能性。在一个实施例中，在微处理器102上执行的各个线程共享指令发出单元208。The microprocessor 102 also includes an instruction issuing unit 208, the instruction issuing unit 208 is coupled to the scheduler 216, and is coupled between the decoding unit 206 and the execution unit 212, so as to respond to the instructions of the scheduler 216 and decode The unit 206 decodes the instruction information and issues the instruction to the execution unit 212 . In particular, instruction issue unit 208 ensures that an instruction is not issued to execution unit 212 if the instruction has data dependencies on other instructions previously issued to execution unit 212 . In one embodiment, an instruction queue is imposed between the decode unit 206 and the instruction issue unit 208 to buffer instructions waiting to be issued to the execution unit 212, thereby reducing the possibility of "starvation" of the execution unit 212. In one embodiment, the instruction issue unit 208 is shared among threads executing on the microprocessor 102 .

微处理器102还包括回写单元214，该回写单元214与执行单元212耦接，以将完成的指令的结果写回到寄存器组224。解复用器246从回写单元214接收指令结果，并将指令结果存储到与完成的指令的线程相关的合适寄存器组224中。Microprocessor 102 also includes a write-back unit 214 coupled to execution unit 212 to write results of completed instructions back to register file 224 . Demultiplexer 246 receives the instruction result from writeback unit 214 and stores the instruction result into the appropriate register bank 224 associated with the thread of the completed instruction.

现在，参见图3，示出了根据本发明的图2中的微处理器102执行的fork指令300的方框图。如图所示，FORK指令300的表示方式为fork rd，rs，rt，其中rd，rs，rt是FORK指令300的三个操作数。图3示出了所述FORK指令的各个字段，位26到31是操作码字段302，而位0到5是功能字段314。在一个实施例中，操作码字段302指示指令是MIPS指令集中的SPECIAL 3型指令，功能字段314指示所述功能是FORK指令。因此，图2中的译码单元206检验操作码字段302与功能字段314，以决定指令是否是FORK指令300。位6到10被保留为0。Referring now to FIG. 3, there is shown a block diagram of a fork instruction 300 executed by the microprocessor 102 of FIG. 2 in accordance with the present invention. As shown in the figure, the representation of the FORK instruction 300 is fork rd, rs, rt, where rd, rs, rt are the three operands of the FORK instruction 300. FIG. 3 shows the various fields of the FORK instruction, bits 26 to 31 are the opcode field 302 and bits 0 to 5 are the function field 314 . In one embodiment, opcode field 302 indicates that the instruction is a SPECIAL type 3 instruction in the MIPS instruction set, and function field 314 indicates that the function is a FORK instruction. Therefore, the decode unit 206 in FIG. 2 checks the opcode field 302 and the function field 314 to determine whether the instruction is the FORK instruction 300 . Bits 6 through 10 are reserved as 0.

位21到25、16到20和11到15分别是rs字段304、rt字段306和rd字段308，分别用来指示图2中的寄存器组224中的一组寄存器中的rs寄存器324、rt寄存器326和rd寄存器328。在一个实施例中，rs寄存器324、rt寄存器326和rd寄存器328中的每个分别是MIPS ISA中的32个通用寄存器中的一个。rs寄存器324和rt寄存器326均是分配给其中包含FORK指令300的线程的寄存器组224中的寄存器中的一个，所述线程被称之为父线程，或派生线程，或当前线程。rd寄存器328是分配给FORK指令300所创建的线程的寄存器器组224中的一个寄存器，所述线程被称之为新线程，或子线程。Bits 21 to 25, 16 to 20, and 11 to 15 are the rs field 304, the rt field 306, and the rd field 308, respectively, which are used to indicate the rs register 324 and the rt register in a group of registers in the register group 224 in FIG. 326 and rd register 328. In one embodiment, each of rs register 324, rt register 326, and rd register 328 is one of 32 general purpose registers in the MIPS ISA. The rs register 324 and the rt register 326 are each one of the registers in the register bank 224 assigned to the thread in which the FORK instruction 300 is contained, referred to as the parent thread, or a descendant thread, or the current thread. The rd register 328 is a register in register bank 224 assigned to the thread created by the FORK instruction 300, which is referred to as a new thread, or child thread.

如图3中所示，FORK指令300指示微处理器102将父线程的rs寄存器324的值复制到新线程的程序计数器222中。所述新线程程序计数器222将被用作新线程的初始指令提取地址。As shown in Figure 3, the FORK instruction 300 instructs the microprocessor 102 to copy the value of the parent thread's rs register 324 into the new thread's program counter 222. The new thread program counter 222 will be used as the initial instruction fetch address for the new thread.

此外，FORK指令300指示微处理器102将父线程的rt寄存器326的值复制到新线程的rd寄存器328中。在一典型的程序操作中，所述程序将所述rd寄存器328的值用作新线程的数据结构的存储器地址。这使得FORK指令300可以放弃将整个父线程的寄存器组324的内容复制到新线程寄存器组324中，由此有利于使FORK指令300更简洁及更有效率，且可在单一时钟周期内执行。取而代之，新线程包含指令来通过从数据结构中读取寄存器值而仅仅提供新线程所需的寄存器，该数据结构存在于数据高速缓存存储器242中的几率较高。因为本发明的发明人确定新线程通常仅需要提供一到五个寄存器，而不是在通常在许多当前微处理中发现的大数目的寄存器(如MIPS指令集中的32个通用寄存器)，所以这是有利的。欲在单个时钟周期内复制整个寄存器组324，则在微处理器102中的各个线程上下文228之间需要一个不可实现的宽数据路径，同时顺序地复制整个寄存器组324(每个时钟周期复制一到两个寄存器)，会花费更多的时间同时要求微处理器102更为复杂。然而，此FORK指令300可以有利地在单个时钟周期内按照RISC形式来执行。Additionally, the FORK instruction 300 instructs the microprocessor 102 to copy the value of the parent thread's rt register 326 into the new thread's rd register 328 . In a typical program operation, the program uses the value of the rd register 328 as the memory address of the new thread's data structure. This enables the FORK instruction 300 to forego copying the contents of the entire parent thread's register set 324 to the new thread's register set 324, thereby making the FORK instruction 300 more concise and efficient, and can be executed in a single clock cycle. Instead, the new thread includes instructions to provide only the registers needed by the new thread by reading register values from data structures that have a higher probability of being present in data cache memory 242 . This is because the inventors of the present invention determined that new threads typically only need to provide one to five registers, rather than the large number of registers typically found in many current microprocessors (like the 32 general-purpose registers in the MIPS instruction set). advantageous. To copy the entire register file 324 in a single clock cycle would require an impractically wide data path between the various thread contexts 228 in the microprocessor 102 while simultaneously copying the entire register file 324 sequentially (one per clock cycle). to two registers), takes more time and requires more complexity on the microprocessor 102. However, the FORK instruction 300 can advantageously be executed in a RISC fashion within a single clock cycle.

有利的是，不只是在微处理器102上执行的操作系统可以使用FORK指令300来为新线程分配资源且对所述新线程的执行进行调度，而且用户级线程也可以如此进行。对于相对频繁地创建和终止相对短的线程的程序而言，这个事实特别有利。例如，包含大量具有短的循环主体且在迭代之间没有数据依赖性的循环的程序可以受益于FORK指令300的小线程创建开销。假设代码循环如下：Advantageously, not only can an operating system executing on microprocessor 102 use FORK instruction 300 to allocate resources for and schedule execution of a new thread, but user-level threads can do the same. This fact is particularly advantageous for programs that create and terminate relatively short threads relatively frequently. For example, a program containing a large number of loops with short loop bodies and no data dependencies between iterations can benefit from the small thread creation overhead of the FORK instruction 300 . Suppose the code loops as follows:

for(i＝0；i<N；i++){for(i=0; i<N; i++){

result[i]＝FUNCTION(x[i]，y[i])； result[i] = FUNCTION(x[i], y[i]);

}}

线程创建和破坏的开销越小，FUNCTION指令序列就越短，并且仍可以被有用地并行于多个线程中。如果与创建和破坏一个新线程相关联的开销为100个指令的数量级，如同传统线程创建机制的情形中一样，那么FUNCTION必须是多个指令长，使得可以从将所述循环并行于于多个线程中获得，如果有的话，更多的益处。然而，FORK指令300开销是如此地小(在一个实施例中仅是一个时钟周期)的事实有利地暗示，即使非常小的代码区间，也可以从并行于多个线程中受益。The less overhead of thread creation and destruction, the shorter the sequence of FUNCTION instructions can be and still be usefully parallelized across multiple threads. If the overhead associated with creating and destroying a new thread is on the order of 100 instructions, as is the case with traditional thread creation mechanisms, then FUNCTION must be multiple instructions long so that the loop can be parallelized across multiple Threads gain, if any, more benefit. However, the fact that the FORK instruction 300 overhead is so small (only one clock cycle in one embodiment) favorably suggests that even very small code intervals can benefit from parallelism with multiple threads.

虽然图3仅示出将rt寄存器326和rs寄存器324的值从父线程上下文228复制到新线程上下文228中，但是响应于FORK指令300，其它状态，或内容，也可以被复制，如下参照图4进行的描述。Although Fig. 3 only shows that the value of the rt register 326 and the rs register 324 is copied from the parent thread context 228 to the new thread context 228, other states, or contents, may also be copied in response to the FORK instruction 300, as shown in the following figure 4 for a description.

现在，参照图4，示出了根据本发明的图2中的每线程控制寄存器226和TCStatus寄存器400中的一个的方框图。也就是，每个线程上下文228包含TCStatus寄存器400。TCStatus寄存器400中的各个字段将会在图4的表格中进行描述；然而，明显与FORK指令300相关的特定字段将会进行更详细地描述。Referring now to FIG. 4, there is shown a block diagram of one of the per-thread control register 226 and the TCStatus register 400 of FIG. 2 in accordance with the present invention. That is, each thread context 228 contains a TCStatus register 400 . The various fields in the TCStatus register 400 will be described in the table of FIG. 4; however, specific fields that are clearly related to the FORK instruction 300 will be described in more detail.

TCStatus寄存器400包含TCU字段402。在一个实施例中，根据MIPS指令集或特权资源体系架构(PRA)，微处理器102包含一单独的微处理器核心以及一个或多个共处理器。TCU字段402控制线程是否可以存取或被限制于一特定的共处理器。在图4的实施例中，TCU字段402允许控制至多四个共处理器。在一个实施例中，FORK指令300指示微处理器102将父线程中的TCU字段402的值复制到由FORK指令300创建的新线程中的TCU字段402中。The TCStatus register 400 contains a TCU field 402 . In one embodiment, microprocessor 102 includes a single microprocessor core and one or more co-processors, in accordance with the MIPS instruction set or Privileged Resource Architecture (PRA). The TCU field 402 controls whether a thread can access or be restricted to a particular coprocessor. In the embodiment of FIG. 4, the TCU field 402 allows control of up to four co-processors. In one embodiment, the FORK instruction 300 instructs the microprocessor 102 to copy the value of the TCU field 402 in the parent thread to the TCU field 402 in the new thread created by the FORK instruction 300 .

TCStatus寄存器400还包含DT位406，该DT位406指示线程上下文228是否是“脏(dirty)”的。DT位406可以被操作系统使用来确保不同程序间的安全。例如，使用FORK指令300来动态地分配线程上下文228，且同时在不同的安全域使用微处理器102的YIELD指令来释放(deallocate)所述线程上下文228，如，通过多个应用程序或是通过操作系统及应用程序两者，存在由应用程序继承的寄存器值形式的信息泄漏的风险，其必须由安全操作系统进行管理。无论线程上下文228是否被修改，与线程上下文228相关的DT位406可以被软件清除，同时由微处理器102来设置。所述操作系统可以将全部的线程上下文228初始化为已知的清洁状态，并且在调度任务之前清除所有与所述线程上下文228相关的DT位406。当任务切换发生时，其DT位406被设置的线程上下文228必须在其它作业被允许分配和使用它们之前被清除至一清洁状态。如果安全操作系统希望动态地创建线程和分配特权服务线程，则相关线程上下文228必须在将它们输送来由应用程序潜在使用之前被清除。读者可参阅在本发明的开始提及的、共同提交的题目为“Integrated mechanism for suspensionand deallocation of computational threads of execution in a processor”美国共同待审专利申请，该申请的案卷号为MIPS.0189-01US，在该申请中，对YIELD指令进行了详尽的描述。The TCStatus register 400 also contains a DT bit 406 that indicates whether the thread context 228 is "dirty." DT bits 406 may be used by the operating system to ensure security between different programs. For example, use the FORK instruction 300 to dynamically allocate the thread context 228, and at the same time use the YIELD instruction of the microprocessor 102 to deallocate the thread context 228 in a different security domain, such as by multiple applications or by Both the operating system and the application, there is a risk of information leakage in the form of register values inherited by the application, which must be managed by a secure operating system. The DT bit 406 associated with the thread context 228 may be cleared by software while being set by the microprocessor 102 regardless of whether the thread context 228 has been modified. The operating system may initialize all thread contexts 228 to a known clean state and clear all DT bits 406 associated with the thread contexts 228 before scheduling tasks. When a task switch occurs, the thread context 228 whose DT bit 406 is set must be cleared to a clean state before other jobs are allowed to allocate and use them. If the secure operating system wishes to dynamically create threads and allocate privileged service threads, the relevant thread contexts 228 must be cleared before delivering them for potential use by applications. The reader is referred to the co-filed U.S. co-pending patent application entitled "Integrated mechanism for suspension and deallocation of computational threads of execution in a processor", docket number MIPS.0189-01US, mentioned at the beginning of this application , in this application, the YIELD instruction is described in detail.

TCStatus寄存器400还包括DA状态位412，该DA状态位412指示线程上下文228是否是由FORK指令300动态地分配和调度，以及由YIELD指令动态地释放。在一个实施例中，线程上下文228的一部分由FORK指令300进行动态地分配，而线程上下文228的另一部分不是由FORK指令300进行动态地分配，而是将线程上下文228静态地分配到一程序的永久线程中。例如，一个或多个线程上下文228可以被静态地分配到此操作系统的一部分中，而不是由FORK指令300进行动态地分配。在另一个例子中，在嵌入式系统中，一个或多个线程上下文228可以被静态地分配到特权服务线程，在传统处理器中，该特权服务线程的功能类似于用于服务于中断源的中断服务例程，该中断服务例程被熟知为所述应用程序中的必需的部分。例如，在一网络路由器中，一个或多个线程上下文228可以被静态地分配到用于处理由一组输入/输出端口发出的事件的线程中，这可能产生由在此描述的微处理器102的单个时间周期线程切换高效处理的大量事件，但是对于不得不导致与进行大量中断事件、存储其相关的状态和向该中断服务例程传送控制权相关的开销的其它的微处理器而言，在此描述的微处理器102具有优势。The TCStatus register 400 also includes a DA status bit 412 that indicates whether the thread context 228 was dynamically allocated and scheduled by the FORK instruction 300 and dynamically released by the YIELD instruction. In one embodiment, a portion of the thread context 228 is dynamically allocated by the FORK instruction 300, while another portion of the thread context 228 is not dynamically allocated by the FORK instruction 300, but the thread context 228 is statically allocated to a program. in a permanent thread. For example, one or more thread contexts 228 may be statically allocated as part of the operating system rather than dynamically allocated by the FORK instruction 300 . In another example, in an embedded system, one or more thread contexts 228 may be statically allocated to a privileged service thread, which functions similarly to an interrupt source for servicing an interrupt source in a conventional processor. Interrupt Service Routine, which is well known as an essential part of the application program. For example, in a network router, one or more thread contexts 228 may be statically assigned to threads for processing events issued by a set of input/output ports, which may result in A single time period thread switching efficiently handles a large number of events, but for other microprocessors that have to incur the overhead associated with taking a large number of interrupt events, storing their associated state, and passing control to the interrupt service routine, The microprocessor 102 described herein has advantages.

在一个实施例中，DA位412可以被一操作系统使用来处理在应用程序间的线程上下文228的共享。例如，当FORK指令300可以在没有自由的线程上下文228可用于分配时尝试分配线程上下文228，在此种情况下，微处理器102会发出一个线程上溢异常234给FORK指令300。与之响应，操作系统存储当前值的一个复制，然后清除所有线程上下文228的DA位412。在下一次由应用程序释放线程上下文时，线程下溢异常234将会被发出，与之响应，操作系统将会恢复响应于线程下溢异常234而存储的DA位412，并且调度产生原始线程上溢异常234的FORK指令300的重放。In one embodiment, DA bit 412 may be used by an operating system to handle sharing of thread context 228 among applications. For example, the microprocessor 102 may issue a thread overflow exception 234 to the FORK instruction 300 when the FORK instruction 300 may attempt to allocate a thread context 228 when no free thread context 228 is available for allocation. In response, the operating system stores a copy of the current value and then clears the DA bit 412 of all thread contexts 228 . The next time the thread context is released by the application, a thread underflow exception 234 will be issued, in response, the operating system will restore the DA bit 412 stored in response to the thread underflow exception 234, and schedule the original thread overflow Replay of FORK instruction 300 of exception 234.

TCStatus寄存器400还包括A位414，该A位414指示与线程上下文228相关的线程是否是处于激活状态。当所述线程处于激活状态时，调度器216将会根据调度器216的调度策略，对从其程序计数器222提取指令和发出所述指令进行调度。当FORK指令300动态地分配线程上下文228时，调度器216自动设置此A位414，且当一YIELD指令动态地释放线程上下文228时，调度器216自动清除此A位414。在一个实施例中，当微处理器102重置时，线程上下文228中之一会被指派为重置的线程上下文228，以执行此微处理器102的初始线程。响应于微处理器102的重置，所述重置的线程上下文228的A位414会被自动设置。The TCStatus register 400 also includes an A bit 414 that indicates whether the thread associated with the thread context 228 is active. When the thread is in the active state, the scheduler 216 will schedule fetching instructions from its program counter 222 and issuing the instructions according to the scheduling policy of the scheduler 216 . When the FORK instruction 300 dynamically allocates the thread context 228, the scheduler 216 automatically sets the A bit 414, and when a YIELD instruction dynamically releases the thread context 228, the scheduler 216 automatically clears the A bit 414. In one embodiment, when the microprocessor 102 is reset, one of the thread contexts 228 is assigned as the reset thread context 228 to execute the initial thread of the microprocessor 102 . In response to a reset of the microprocessor 102, the A bit 414 of the reset thread context 228 is automatically set.

TCStatus寄存器400还包括TKSU字段416，该TKSU字段指示线程上下文228的特权状态或级别。在一个实施例中，此特权可以是以下三个级别之一：核心、管理者或是用户。在一个实施例中，FORK指令300指示微处理器102将父线程的TKSU字段416的值复制到由该FORK指令300创建的新线程的TKSU字段416中。The TCStatus register 400 also includes a TKSU field 416 that indicates the privilege status or level of the thread context 228 . In one embodiment, this privilege can be at one of three levels: core, supervisor, or user. In one embodiment, the FORK instruction 300 instructs the microprocessor 102 to copy the value of the parent thread's TKSU field 416 into the TKSU field 416 of the new thread created by the FORK instruction 300 .

TCStatus寄存器400还包括TASID字段422，该TASID字段指定线程上下文228的地址空间ID(ASID)，或惟一任务ID。在一个实施例中，FORK指令300指示微处理器102将父线程的TASID字段422的值复制到由该FORK指令300创建的新线程的TASID字段422中，使得父线程和新线程共享相同的地址空间。The TCStatus register 400 also includes a TASID field 422 that specifies the address space ID (ASID), or unique task ID, of the thread context 228 . In one embodiment, the FORK instruction 300 instructs the microprocessor 102 to copy the value of the parent thread's TASID field 422 into the TASID field 422 of the new thread created by the FORK instruction 300 so that the parent thread and the new thread share the same address space.

在一个实施例中，每线程控制寄存器226还包括用于存储停止位的寄存器，通过设置该停止位，使得软件能够停止一线程，比如，将此线程上下文228置于一停止状态。In one embodiment, the per-thread control register 226 also includes a register for storing a stop bit. By setting the stop bit, software can stop a thread, eg, put the thread context 228 in a stopped state.

现在，参见图5，示出了依照本发明的图2中的微处理器102执行图3中的FORK指令300的流程图。此流程自方块502开始。Referring now to FIG. 5, there is shown a flow diagram of the execution of the FORK instruction 300 of FIG. 3 by the microprocessor 102 of FIG. 2 in accordance with the present invention. The process starts at block 502 .

在方块502，提取单元204使用当前线程的程序计数器222提取FORK指令300，译码单元206对该FORK指令300进行译码，同时指令发出单元208将该FORK指令300发出到图2中的执行单元212。此流程继续至方块504。In block 502, the extraction unit 204 uses the program counter 222 of the current thread to extract the FORK instruction 300, the decoding unit 206 decodes the FORK instruction 300, and the instruction issuing unit 208 issues the FORK instruction 300 to the execution unit in Fig. 2 212. The flow continues to block 504 .

在方块504，执行单元212经由new_thread_request信号232指示FORK指令300正在请求将被分配和调度的新线程上下文228。此流程继续至方块506。At block 504 , the execution unit 212 indicates via the new_thread_request signal 232 that the FORK instruction 300 is requesting a new thread context 228 to be allocated and scheduled. The flow continues to block 506 .

在方块506，调度器216确定是否有自由的线程上下文228可用于分配。在一个实施例中，调度器216会保持一计数器，该计数器指示可自由分配的线程上下文228的数目，每次YIELD指令释放线程上下文228后，所述数目就加一，而在每次FORK指令300分配线程上下文228后，所述数目就减一，并且调度器216通过确定此计数器的值是否大于零，来确定是否存在可自由分配的线程上下文228。在另一个实施例中，调度器216会检验每线程控制寄存器226中的状态位，比如图4中的TCStatus寄存器400中的DA位412和A位414，以及停止位，以确定是否存在可自由分配的线程上下文228。当线程上下文228未处于激活或停止状态且不是一个静态分配的线程上下文228时，则该线程上下文228是可自由分配的线程上下文228。如果线程上下文228可用于自由分配，则此流程继续至方块508；否则，此流程继续至方块522。At block 506, the scheduler 216 determines whether there are free thread contexts 228 available for allocation. In one embodiment, the scheduler 216 maintains a counter indicating the number of freely allocable thread contexts 228, which is incremented by one each time the YIELD instruction releases the thread context 228, and is incremented by one each time the FORK instruction Once a thread context 228 is allocated 300, the number is decremented by one, and the scheduler 216 determines whether there is a freely assignable thread context 228 by determining whether the value of this counter is greater than zero. In another embodiment, scheduler 216 checks status bits in per-thread control registers 226, such as DA bit 412 and A bit 414 in TCStatus register 400 of FIG. The thread context is allocated 228 . When the thread context 228 is not activated or stopped and is not a statically allocated thread context 228 , then the thread context 228 is a freely allocatable thread context 228 . If the thread context 228 is available for free allocation, the flow continues to block 508 ; otherwise, the flow continues to block 522 .

在方块508，调度器216响应于FORK指令300而将为新线程分配一个可自由分配的线程上下文228。此流程继续至方块512。At block 508 , the scheduler 216 in response to the FORK instruction 300 will allocate a freely assignable thread context 228 for the new thread. The flow continues to block 512 .

在方块512，父线程上下文228的rs寄存器324的值被复制到新线程上下文228的程序计数器222中，父线程上下文228的rt寄存器326的值也会被复制到新线程上下文228的rd寄存器328中，如图3所示，以及与FORK指令300相关的其它上下文，如图4中所描述，也从父线程上下文228复制到所述新线程上下文228中。此流程继续至方块514。In block 512, the value of the rs register 324 of the parent thread context 228 is copied in the program counter 222 of the new thread context 228, and the value of the rt register 326 of the parent thread context 228 will also be copied to the rd register 328 of the new thread context 228 3, and other contexts associated with the FORK instruction 300, as described in FIG. 4, are also copied from the parent thread context 228 to the new thread context 228. The flow continues to block 514 .

在方块514，调度器216对执行的新线程上下文228进行调度。也就是，调度器216将该线程上下文228加入到当前已准备好执行的线程上下文228列表中，使得提取单元204依据调度策略的限制，开始从线程上下文228的程序计数器222提取和发出指令。此流程继续至方块516。At block 514, the scheduler 216 schedules the new thread context 228 for execution. That is, the scheduler 216 adds the thread context 228 to the list of currently ready-to-execute thread contexts 228, so that the fetch unit 204 starts fetching and issuing instructions from the program counter 222 of the thread context 228 according to the restriction of the scheduling policy. The flow continues to block 516 .

在方块516，提取单元204开始提取在新线程上下文228的程序计数器222中的指令。此流程继续至方块518。At block 516 , the fetch unit 204 begins fetching instructions in the program counter 222 of the new thread context 228 . The flow continues to block 518 .

在方块518，新线程的指令在需要时将寄存器组224提供给新线程上下文228的寄存器。如前所述，新线程的程序指令通常会从由rd寄存器328值指定的存储器中的数据结构提供寄存器组224。此流程于方块518结束。At block 518, the instructions of the new thread provide the register set 224 to the registers of the new thread context 228 as needed. As previously stated, program instructions for a new thread will typically provide register bank 224 from a data structure in memory specified by the rd register 328 value. The process ends at block 518 .

在方块522，调度器216向FORK指令300发出线程上溢异常234，以表示在FORK指令300执行时，没有自由的线程上下文228可用来分配。此流程继续至方块524。At block 522, the scheduler 216 issues a thread overflow exception 234 to the FORK instruction 300 to indicate that no free thread context 228 was available for allocation while the FORK instruction 300 was executing. The flow continues to block 524 .

在方块524，此操作系统中的异常处理器创建一个条件，在该条件下，被分配的线程上下文228被释放来用于FORK指令300，如先前针对图4中的DA位412进行的描述。此流程继续至方块526。At block 524, the exception handler in the operating system creates a condition under which the allocated thread context 228 is freed for the FORK instruction 300, as previously described for the DA bit 412 in FIG. The flow continues to block 526 .

在方块526，此操作系统重新发出造成方块522中的异常234的FORK指令300，现在因为可自由分配的线程上下文228可以得到而可以成功地执行，如先前针对图4中的DA位412进行的描述。此流程结束于方块526。At block 526, the operating system reissues the FORK instruction 300 that caused the exception 234 in block 522, which can now execute successfully because the freely allocatable thread context 228 is available, as previously done for the DA bit 412 in FIG. describe. The process ends at block 526 .

虽然本发明及其目的、特征、与优点都已经进行了详细地描述，但是本发明包括其它实施例。例如，虽然已经描述了其中新线程上下文228在与父线程上下文相同的VPE上被分配的实施例，但是在另一实施例中，如果父VPE检测在VPE上没有可自由分配的线程上下文，则所述VPE会尝试另一VPE上的远程FORK指令300。特别是，VPE确定另一VPE是否具有可自由分配的线程上下文以及是否具有与父线程上下文相同的地址空间，如果是，则将FORK指令信息分组发送到其它的VPE，以使得其它的VPE能够分配及调度此自由的线程上下文。除此之外，在此描述的FORK指令并不限于在一可同时执行多线程来解决特定潜在事件的微处理器上使用，也可以在一可针对高速缓存失误、错误预测分支、长时间指令等进行多线程的微处理器上执行。此外，在此描述的FORK指令也可以在标量或超级标量微处理器中执行。另外，在此处所描述的FORK指令也可以在具有不同调度策略的微处理器中执行。此外，虽然已经描述了其中rt寄存器值被复制到一新线程上下文的寄存器中的FORK指令的实施例，但是也可以期望其它的实施例，其中rt寄存器值可以通过其他方式被提供给新线程上下文，比如通过存储器。最后，虽然已经描述了其中FORK指令的操作数被存储在通用寄存器中的实施例，但是在其它的实施例中，可通过其它方式，如经由存储器或经由非通用寄存器，来存储操作数。例如，虽然已经描述了其中微处理器是基于寄存器的微处理器的实施例，但是也可以期望其他实施例，在其它的实施例中，微处理器可以是基于堆栈的微处理器，比如处理器被配置为有效地执行Java虚拟机器程序代码。在此实施例中，FORK指令的操作数在存储器的操作数堆栈中被指定，而不是在一寄存器中。例如，每一线程上下文可以包括一堆栈指针寄存器，且FORK指令的字段可以指示该FORK指令的操作数在堆栈存储器中相对于堆栈指针寄存器的偏移，而不是指定微处理器的寄存器空间中的寄存器。While the invention and its objects, features, and advantages have been described in detail, the invention encompasses other embodiments. For example, although an embodiment has been described in which the new thread context 228 is allocated on the same VPE as the parent thread context, in another embodiment, if the parent VPE detects that there is no freely assignable thread context on the VPE, then The VPE will attempt a remote FORK command 300 on another VPE. In particular, the VPE determines whether another VPE has a freely assignable thread context and whether it has the same address space as the parent thread context, and if so, sends a FORK instruction message packet to other VPEs so that other VPEs can allocate And schedule this free thread context. Additionally, the FORK instruction described here is not limited to use on a microprocessor that can execute multiple threads concurrently to address specific potential events, but can also be used on a Execute on a multi-threaded microprocessor. Furthermore, the FORK instruction described herein can also be implemented in scalar or superscalar microprocessors. Additionally, the FORK instruction described here can also be implemented in microprocessors with different scheduling policies. In addition, although the embodiment of the FORK instruction in which the rt register value is copied into a register of a new thread context has been described, other embodiments are also contemplated in which the rt register value may be provided to the new thread context in other ways , such as through memory. Finally, while an embodiment has been described where the operand of the FORK instruction is stored in a general purpose register, in other embodiments the operand may be stored by other means, such as via memory or via a non-general purpose register. For example, while an embodiment has been described in which the microprocessor is a register-based microprocessor, other embodiments are contemplated in which the microprocessor may be a stack-based microprocessor, such as a processing The processor is configured to efficiently execute Java Virtual Machine program code. In this embodiment, the operand of the FORK instruction is specified on the operand stack in memory, rather than in a register. For example, each thread context may include a stack pointer register, and the field of the FORK instruction may indicate the offset of the operand of the FORK instruction in stack memory relative to the stack pointer register, rather than specifying an offset in the microprocessor's register space. register.

除了使用硬件实现本发明之外，本发明也可以体现在嵌入至计算机可用(如可读)介质的软件中(如计算机可读代码、程序代码、指令与/或数据)。这些软件实现在此描述的装置与方法的功能、制造、模块、仿真、描述且/或测试。例如，这可通过使用通用程序语言(如C、C++、JAVA等)、GDSII数据库、包括Vorilog HDL、VHDL等的硬件描述语言(HDL)、或是其它可用程序、数据库、与/或电路(亦即示意图)捕获工具来实现。此种软件可安置于任何计算机可用(如可读)介质中，这些介质可包括半导体存储器、磁盘、光盘(如CD-ROM、DVD-ROM等)等，且可作为计算机数据信号包含在计算机可用(可读)传输介质(如载波或任何其它包括数字、光学、或基于模拟的介质的介质)中。同样，软件可通过包括因特网与内部网络等通信网路来进行传输。应该理解的是，本发明可以体现为软件、且可转化为作为集成电路产品的部分的硬件，前述软件例如在HDL中，可作为半导体知识产权核心(如微处理器核心)的一部分，或作为系统级设计，比如片上系统或SOC。同样，本发明也可以利用软件与硬件的组合来实现。Instead of implementing the invention in hardware, the invention can also be embodied in software (eg, computer readable code, program code, instructions and/or data) embodied in a computer-usable (eg, readable) medium. Such software enables the function, fabrication, modeling, simulation, description and/or testing of the devices and methods described herein. For example, this can be achieved through the use of general purpose programming languages (such as C, C++, JAVA, etc.), GDSII databases, hardware description languages (HDL) including Vorilog HDL, VHDL, etc., or other available programs, databases, and/or circuits (also That is, the schematic diagram) capture tool to achieve. Such software can be placed on any computer usable (eg, readable) medium, which can include semiconductor memory, magnetic disk, optical disk (such as CD-ROM, DVD-ROM, etc.), etc., and can be included as a computer data signal in a computer usable In a (readable) transmission medium such as a carrier wave or any other medium including digital, optical, or analog-based media. Likewise, software may be transmitted over communication networks including the Internet and intranets. It should be understood that the present invention may be embodied in software, and may be translated into hardware as part of an integrated circuit product, such as in HDL, as part of a semiconductor intellectual property core (such as a microprocessor core), or as System-level design, such as a system-on-chip or SOC. Likewise, the present invention can also be realized by a combination of software and hardware.

最后，本领域的技术人员应该明白的是，他们可以容易将所述公开的概念和具体实施例用作基础，来设计或修改其他结构，以实现本发明的相同目的，而不会偏离本发明的精神和范围，本发明的范围应该由所附权利要求来限定。Finally, those skilled in the art should appreciate that they can readily use the disclosed conception and specific embodiment as a basis for designing or modifying other structures for carrying out the same purposes of the present invention without departing from the invention. Rather, the scope of the invention should be defined by the appended claims.

Claims

1, a kind of multithreaded microprocessor comprises:

A plurality of thread context, whether each thread context is configured to store the state of thread, and indicate described thread context to can be used for distributing; And

Scheduler, it is coupled to described a plurality of thread context, is used for the single instruction in response to the thread of current execution, a thread context in described a plurality of thread context is distributed to new thread, and dispatch the execution of described new thread;

Wherein, if described a plurality of thread context all is not useable for distributing, then described microprocessor can send one and give described single instruction unusually;

Wherein, described single instruction comprises: operational code, first operand and second operand.

2, microprocessor as claimed in claim 1, wherein, each thread context in described a plurality of thread context comprises programmable counter.

3, microprocessor as claimed in claim 2, wherein, described single instruction indicate described microprocessor with the first operand of described instruction store into distribute to described new thread described a plurality of thread context in the described programmable counter of a thread context in.

4, microprocessor as claimed in claim 3, wherein, described single instruction indicate described microprocessor with the second operand of described instruction store into can memory location by described new thread visit in.

5, microprocessor as claimed in claim 4, wherein, in described a plurality of thread context each comprises a plurality of general-purpose registers, and wherein said single instruction indicates described microprocessor described second operand to be stored in the register in described a plurality of general-purpose registers of a described thread context of described a plurality of thread context of distributing to described new thread.

6, microprocessor as claimed in claim 5, wherein, a register in described a plurality of general-purpose registers is specified by the 3-operand of described instruction.

7, microprocessor as claimed in claim 4, wherein, each thread context in described a plurality of thread context comprises the SP that is used to specify stacked memory, and wherein said single instruction indicates described microprocessor that described second operand is stored in the position in the described stacked memory.

8, microprocessor as claimed in claim 7, wherein, the described position in the described stacked memory is specified by the 3-operand of described instruction.

9, microprocessor as claimed in claim 1, wherein, described microprocessor allows described instruction to distribute a thread context in described a plurality of thread context for described new thread, even and the thread of described current execution also dispatch the execution of described new thread when on the user privilege level, carrying out.

10, microprocessor as claimed in claim 1, wherein, time slot is sent in the single instruction that described instruction takies in the described microprocessor.

11, microprocessor as claimed in claim 1, wherein, each in the described registers group comprises two read ports and one and writes inbound port.

12, microprocessor as claimed in claim 1, wherein, maximum two source-register operands and a destination register operand are specified in described fork instruction.

13, a kind of multithreaded microprocessor comprises:

First programmable counter is used for the extraction address of instruction is stored in first program threads; Described instruction comprises operational code, first operand and second operand;

First registers group comprises first and second registers by described instruction appointment, and to store first and second operands respectively, described first operand is specified the extraction address of second program threads;

Second programmable counter, itself and described first registers group couple, and are used in response to described instruction, receive described first operand from described first registers group;

Second registers group, itself and described first registers group couple, and comprise the 3rd register, are used in response to described instruction, receive described second operand from described second registers group; And

Scheduler, itself and described first and second registers group couple, and are used in response to described instruction, make described microprocessor from the extraction address extraction instruction of described second program threads stored described second programmable counter and carry out the instruction of described extraction.

14, microprocessor as claimed in claim 13 also comprises:

Unusual indicator, itself and described scheduler couple, and are used in response to described instruction, if described second programmable counter and registers group should not be used to receive described first and second operands, then make described microprocessor send a unusual extremely described instruction.

15, microprocessor as claimed in claim 13 also comprises:

Unusual indicator, itself and described scheduler couple, and are used in response to described instruction, if described second programmable counter and registers group have been that another thread is used, then make described microprocessor send a unusual extremely described instruction.

16, microprocessor as claimed in claim 13, wherein, described the 3rd register is specified by described instruction.

17, microprocessor as claimed in claim 13, wherein, described first and second registers group comprise general purpose register set, wherein in response to described instruction, described second general purpose register set is only from the described second operand of the described first general-purpose register group of received.

18, a kind of method that is used for creating the new thread that can carry out at multithreaded microprocessor, this method comprises:

The single instruction that decoding is carried out in first program threads; Wherein said single instruction comprises operational code, first operand and second operand;

In response to described decoding, be programmable counter and the registers group that second program threads distributes described microprocessor;

In response to described distribution, the first operand of described instruction is stored in the register in the described registers group;

In response to described distribution, the second operand of described instruction is stored in the described programmable counter; And

After described first and second operands of storage, dispatch the execution of described second program threads on described microprocessor.

19, method as claimed in claim 18 also comprises:

In response to described decoding, determine whether programmable counter and registers group can be used for distributing.

20, method as claimed in claim 19 also comprises:

If do not have programmable counter and registers group to can be used for the branch timing, then send a unusual extremely described instruction.

21, method as claimed in claim 18, wherein, described distribution, described first and second operands of described storage and described scheduling are all finished in the cycle at the single clock of described microprocessor.

22, a kind of method that is used to create the new thread that can on multithreaded microprocessor, carry out, this method comprises:

In response to described decoding, be the second program threads allocator counter;

Determine whether described distribution is successful;

If described being allocated successfully, then the operand with described instruction stores in the described programmable counter, and dispatches the execution of second program threads on described microprocessor;

If described distribution is unsuccessful, then send a unusual extremely described instruction.

23, method as claimed in claim 22 also comprises:

If described being allocated successfully, then the second operand with described instruction offers described second thread.

24, method as claimed in claim 23 also comprises:

In response to described decoding, for described second program threads distributes a registers group;

It is wherein said in described second thread provides the described second operand of described instruction to comprise described second operand stored into register in the described registers group of distributing into described second program threads.

25, method as claimed in claim 23 also comprises:

In response to described decoding, for described second program threads distributes a stack pointer, this stack pointer is specified and the relevant stacked memory of described second thread;

Wherein saidly provide the second operand of described instruction to comprise to described second thread described second operand is stored in the described stacked memory.

26, a kind of multithreaded processing system comprises:

Storer, be configured to store the fork instruction and data structure of first thread, described fork instruction comprises operational code, first operand and second operand, be used to specify a register and store the storage address of described data structure and the initial order address of second thread, described data structure comprises the initial generic register value of described second thread; And

Microprocessor, itself and described storer couple, being configured to (1) is described second thread distribution one thread context freely, (2) the described second thread initial order address is stored in the programmable counter of described thread context, (3) storage address with described data structure is stored in the register of described thread context, and (4) instruct the execution of dispatching described second thread in response to described fork.

27, disposal system as claimed in claim 26, wherein, the number of the described initial register value of described second thread in the described data structure is less than the number of the general register values of described thread context.

28, disposal system as claimed in claim 26, wherein, the described thread context that is assigned to described second thread and described first thread thread context different.

29, disposal system as claimed in claim 28, wherein, described storer further is configured to store the programmed instruction of described second thread, with the general-purpose register of described initial register value from described memory copy to described thread context with described data structure, in response to described fork instruction, make described microprocessor abandon the whole thread context of described first thread is copied in the described thread context of described second thread thus.

30, disposal system as claimed in claim 26, wherein, described microprocessor further is configured to, thread context can be assigned to described second thread if do not have freely, then sends a unusual extremely described fork instruction.

31, a kind of method of on multithreaded microprocessor, carrying out the Fork instruction, this Fork instruction comprises operational code, first operand and second operand, this method comprises:

Operational code is provided, and so that to indicate described microprocessor be the new thread Resources allocation and dispatch the execution of this new thread in described microprocessor, described resource comprises programmable counter and registers group;

First operand is provided, extracts the address to specify the initial order that will be stored in the described programmable counter that distributes into described new thread; And

Provide second operand, to be stored in the register in the described registers group of distributing for described new thread.

32, method as claimed in claim 31 also comprises:

3-operand is provided, will be stored in which register in the described registers group to specify described second operand.

33, method as claimed in claim 31 also comprises:

The relevant status register of described registers group that provides and distribute for described new thread, wherein said status register comprises an indicator, does for oneself after the described new thread distribution with indication, whether described registers group is written into.

34, method as claimed in claim 31 also comprises:

Unusual indicator is provided, if not freely programmable counter and registers group can be assigned to described new thread, then send one and give described fork instruction unusually.