WO2006079940A2

WO2006079940A2 - Multi-threaded processor

Info

Publication number: WO2006079940A2
Application number: PCT/IB2006/050167
Authority: WO
Inventors: Jan Hoogerbrugge
Original assignee: Nxp B.V.
Priority date: 2005-01-25
Filing date: 2006-01-17
Publication date: 2006-08-03
Also published as: US20080195851A1; JP2008529119A; CN100520714C; US8539211B2; WO2006079940A3; EP1844393A2; CN101151590A

Abstract

A multi-threaded processor comprises a processing unit (PU) for concurrently processing multiple threads. A register file means (RF) is provided having a plurality of registers, wherein a first register (LI) is used for storing loop invariant values and N second registers (LVl-LVN) are each used for storing loop variant values. Furthermore N program counters (PCl-PCN) are provided each being associated to one of the multiple threads, wherein N being the number of threads being processed.

Description

Multi-threaded processor

The present invention relates to a multi-threaded processor having a processing unit for concurrently processing multiple tasks as well as a method for compiling parallel loops.

A processor typically executes instructions from a thread and comprises a register file for data that will be referenced by the instruction as well as a program counter for those addresses of the currently executed instructions, i.e. an instruction address register. In order to reduce the time during which the processor does not execute the instructions because it is waiting for data or further instructions from the memory, multiple threads are executed concurrently on the processor. If the execution of one thread is stalled, it is switched to the next thread. Such a switching is also referred to as context switching. To enable an efficient and fast switching the context of the multiple threads must be kept in the processor. Therefore, a multi-threaded processor must contain a register file and a program counter for each thread. In order to enhance the performance of a multi-threaded processor additional resources are required to store the multiple contexts. However, the additional resources increase the additional costs in terms of the required area on the die and in terms of the higher design complexity.

US 6,351,808Bl relates a multi-threaded processor with a replicated register file structure.

US 6,092,175 disclose a multi-threaded processor supporting multiple contexts or threads. In order to reduce the number of registers some of the register files are shared between the threads and allocated to a thread if required.

It is therefore an object of the invention to provide a multi-threaded processor with reduced hardware costs as well a method for compiling parallel loops for a multithreaded processor. This object is solved by a multi-threaded processor to claim 1 and by a method for compiling parallel loops according to claim 4.

Therefore, a multi-threaded processor comprises a processing unit for concurrently processing multiple threads. A register file means is provided having a plurality of registers, wherein at least one first register is used for storing loop invariant values and N second registers or N further sets of registers are each used for storing loop variant values. Furthermore N program counters are provided each being associated to one of the multiple threads, wherein N being the number of threads being processed.

Hence, as merely the program counter is duplicated and only part of the register file, the hardware complexity and the hardware cost can be reduced for multithreaded processors.

According to an aspect of the invention each of the N second registers is associated to one of the multiple threads, and the at least one first register is shared between the multiple threads. Therefore, by sharing the first register between the multiple threads there is no need to provide such a first register for each thread. This will lead to an improved utilization of the registers in the register file. Instead of allocating a first register for storing loop invariant values for each thread, only one first register is allocated for all threads, i.e. shared between the threads.

According to a further aspect of the invention the partitioning or allocating of the plurality of registers in the register file means into first and second registers is performed per loop. As the requirements may vary for each loop, the partitioning or allocating of registers for loop invariant and loop variant values can be performed for each loop.

The invention also relates to a method for compiling parallel loops within a set of instructions. Loop invariant and loop variant values in loops within a set of instructions for a multi-threaded processor having a register file are detected. A plurality of registers of the register file are partitioned or allocated into at least one first register for storing the loop invariant values and N second registers or N sets of registers each for storing loop variant values, wherein N being the number of threads being processed.

The invention is based on the idea to compile loops into multiple threads. Merely the program counter is duplicated without duplicating all the register files accordingly. While according to the prior art the register files require N loop invariant registers and N loop variant registers, the present invention only requires register for loop invariant values (shared between threads) and N register for loop variant values each associated or dedicated to the N threads. The existing registers in a register file are partitioned or allocated into registers for loop variant registers and registers for loop invariant registers, i.e. the registers are partitioned into disjoint subsets. Multi-threading is applied to parallel loops by translating a parallel loop into two or more loops.

These and other aspects of the invention are apparent from and will be elucidated with reference to the embodiments described hereinafter.

Fig. 1 shows the basic architecture of a multi-threaded processor according to a first embodiment.

Fig. 1 shows a basic architecture of a multi-threaded processor according to a first embodiment. The processor comprises a processing unit PU that is capable for processing N threads concurrently. According to the embodiment of Fig. 1 three threads are processed, i.e. N=3. Therefore, a program counter PC is implemented which comprises a program counter unit for every thread, i.e. a first, second and third program counter unit PCl, PC2, PC3. Furthermore, a register file RF is provided with a plurality of registers. Here, merely 8 registers are depicted. From the plurality of registers of the register file RF a number of register sets LVl - LV3 are partitioned or allocated for loop variant values for each thread, i.e. the registers LVl - LV3 for loop variant values are duplicated according to the number of threads such that one register set LVl - LV3 for a loop variant is associated to each of the threads. In other words the partitioning of the registers in the register file is preformed to allocate N (one for each thread) registers for the loop variant values. The register file RF further comprises a register set LI for loop invariant values (one register is allocated for the loop invariant values and shared between the threads) such that the register file RF comprises a register LI for loop invariant values and 3 (N) register set LV for loop variant values. 4 of the registers are not used. It should be noted that the value of N is merely for illustrating and not limiting the embodiment.

As an example, the register file RF may comprise 32 registers. For one specific loop, 10 of the 32 registers are allocated for loop invariant values and 3x5 registers (3 sets of 5 registers) are allocated for loop variant values while seven registers are not used. In a further loop, four registers may be allocated for loop invariant values, 3x8 registers are allocated for loop variant values while four registers are not used. In other words, the existing registers can be allocated for loop invariant values and loop variant values for every loop to be performed. In some cases not all of the 32 registers in the register file RF will be used. It should be noted that the registers allocated to store loop invariant values are shared between the multiple threads, while those registers allocated to store loop variant values are associated exclusively or dedicated to one of the multiple threads at least for the duration of a loop. It should be furthermore noted that the register file may comprise a different number of registers.

As only the allocation of the registers LV1-LV3 for the loop invariant values are duplicated, the applications which may be processed based on the multiple threads are limited. As an example the multi-threading is applied to the processing of parallel loops as the requirements for the register file RF is less strict. According to this embodiment a parallel loop is translated into two loops.

A parallel loop may be implemented by the following code: For (i = 0 ; i < n; i++) S; and is translated into the following code: for k L /*create a second thread, initial pc=L */ for (i = 0 ; i < n/2; i++)

S; Wait /* wait on termination of second thread */

L: for (i = n/2 ; i < n; i++)

S;

Exit /* stop thread */

The parallel loop is translated into two loops, namely one loop from 0 to <n/2 and one loop from n/2 to <n. These two loops are then processed concurrently as two threads. The data values in the loops may be loop variant, i.e. changing during the processing in the loop, or loop invariant, i.e. not changing during the loops. Examples of loop invariant values are base pointer to an array. The register LI or the set of registers LI are used to allocate all loop invariant values and the register LV or the set of registers LV is used to allocate all loop variant values. Therefore, to implement the above code one register LI and 2 register LV are required as two threads are present. According to a second embodiment a SAXPY loop type is considered for multiplying a Scalar variable A with a vector X and adds it to a vector Y, i.e. S, A, X, P, Y. The following loop is considered.:

For (i = 0 ; i < 1000; i++) a[i]=b[i]+s*c[i];

A corresponding assembly code could be implemented as follows:

loadi#a->rl0 loadi #b -> rl 1 Ioadi#c->rl2 Ioadi#s->rl3 loadi #0->rl4 Ll: Ioadirll,rl4->rl5 /*loadb[i]*/ loadi rl2, rl4->rl6 /* load c[i] */ mult r 16, rl3 ->rl6 /* compute s*c[i] */ addrl6,rl5->rl6 /*addb[i]*/ store rlO, rl4<-rl6 /* store result in a[i] */ add#lrl4->rl4 bless rl4, #1000, Ll

In the above loop the registers LI = {rlO, rl 1, rl2 and rl3} comprise loop invariant values as rlO to rl2 contain the base addresses of arrays a, b and c. The register rl3 is used to store the scalar value s. As all these values are constant during the execution of the loop they constitute loop invariant values. The registers rl4 to rl6 contain loop variant values as the values change during the execution of the loop.

The above loop can be translated according to the second embodiment into the following code, where the first thread utilizes registers LVl = {rl4 - rl6} for variant values and the second thread utilizes the registers LV2 = {r24-r26} for variant values:

loadi #a->rlθ loadi #b -> rl 1 loadi #c->rl2 Ioadi #s -> rl3 Ioadi #0 -> rl4 fork L2 /* start second thread */

Ll : loadi rl 1, rl4 -> rl5 /* load b[i] */ loadi rl2, rl4 -> rl6 /* load c[i] */ mult rl6, rl3 -> rl6 /* compute s*c[i] */ add r 16, r 15 -> r 16 /* add b[i] */ store rlO, rl4 <- rl6 /* store result in a[i] */ add #l rl4 -> rl4 bless rl4, #500, Ll /* loop until 500 */ wait

L2: loadi #500 ->r24 /* start at i=500 */

L3 : loadi r 11 , r24 -> r25 /* load b[i] */ loadi r 12, r24 -> r26 /* load c[i] */ mult r26, r 13 -> r26 /* compute s*c[i] */ add r26, r25 -> r26 /* add b[i] */ store rlO, r24 <- r26 /* store result in a[i] */ add #l r24 -> r24 bless r24, #500, Ll /* loop until 1000 */ wait exit /* stop second thread*/

As the two threads implementing the first and second loop (loop from 0 to 499, and loop from 500 to 1000) are executed concurrently by the multi-threaded processor, one of the threads can still be processes if the other is stalled. Thereby the execution time is reduced and the utilization of the processor is improved.

Multi-threading is applied to parallel loops by translating a parallel loop into two or more loops. Loop variant and loop invariant values are determined. The parallel loop is divided or translated into a plurality of loops. From the registers in the register file a number of registers corresponding to the number of threads is allocated or partitioned for loop invariant values such that each thread will be associated to one register to store its variant values. Furthermore, from the registers in the register file at least one will be allocated for storing the loop invariant values and will be shared between the threads. It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parenthesis shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps other than those listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. In the device claim in numerating several means, several of these means can be embodied by one and the same item of hardware. The mere fact that certain measures are resided in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Furthermore, any reference signs in the claims shall not be constitute as limiting the scope of the claims.

Claims

CLAIMS:

1. Multi-threaded processor, comprising: a processing unit (PU) for concurrently processing multiple threads; register file means (RF) comprising a plurality of registers (LI, LV), wherein at least one first register (LI) is used for storing loop invariant values and N second registers (LVl-LVN) are each used for storing loop variant values; and

N program counters (PCl-PCN) each associated to one of the multiple threads; wherein N being the number of threads being processed.

2. Multi-threaded processor according to claim 1, wherein each of the N second registers (LVl-LVN) is associated to one of the multiple threads, and wherein the at least one first register (LI) is shared between the multiple threads.

3. Multi-threaded processor according to claim 1 or 2, wherein the partition of the plurality of registers in the register file means (RF) into first and second registers (LI; LVl -LV3) is performed per loop.

4. Method for compiling parallel loops within a set of instructions, comprising the steps of: detecting loop invariant and loop variant values in loops within a set of instructions for a multi-threaded processor (PU) having a register file (RF); and partitioning a plurality of registers (LI, LVl-LVN) of the register file (RF) into at least one first register (LI) for storing the loop invariant values and N second registers (LVl-LVN) each for storing loop variant values; and wherein N being the number of threads being processed.

5. Method according to claim 4, wherein each of the N second registers (LVl - LVN) is associated to one of the multiple threads, and wherein the at least one first register (LI) is shared between multiple threads.

6. Method according to claim 4 or 5, comprising the step of partitioning the plurality of registers in the register file means (RF) into first and second registers (LI; LVl- LV3) for at least one loop.