CN101036119A - Method and apparatus to provide a source operand for an instruction in a processor - Google Patents
Method and apparatus to provide a source operand for an instruction in a processor Download PDFInfo
- Publication number
- CN101036119A CN101036119A CNA2005800334355A CN200580033435A CN101036119A CN 101036119 A CN101036119 A CN 101036119A CN A2005800334355 A CNA2005800334355 A CN A2005800334355A CN 200580033435 A CN200580033435 A CN 200580033435A CN 101036119 A CN101036119 A CN 101036119A
- Authority
- CN
- China
- Prior art keywords
- register
- instruction
- scheduler
- processor
- read
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 26
- 230000015572 biosynthetic process Effects 0.000 claims description 24
- 238000011022 operating instruction Methods 0.000 claims description 11
- 230000003068 static effect Effects 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 6
- 238000003860 storage Methods 0.000 description 6
- 230000014759 maintenance of location Effects 0.000 description 3
- 238000007792 addition Methods 0.000 description 2
- 238000000354 decomposition reaction Methods 0.000 description 2
- 230000001934 delay Effects 0.000 description 2
- 238000007667 floating Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000006386 memory function Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000011664 signaling Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3824—Operand accessing
- G06F9/3826—Bypassing or forwarding of data results, e.g. locally between pipeline stages or within a pipeline stage
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/3012—Organisation of register space, e.g. banked or distributed register file
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/3012—Organisation of register space, e.g. banked or distributed register file
- G06F9/30138—Extension of register space, e.g. register cache
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3824—Operand accessing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3838—Dependency mechanisms, e.g. register scoreboarding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3838—Dependency mechanisms, e.g. register scoreboarding
- G06F9/384—Register renaming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3854—Instruction completion, e.g. retiring, committing or graduating
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3854—Instruction completion, e.g. retiring, committing or graduating
- G06F9/3858—Result writeback, i.e. updating the architectural state or memory
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Advance Control (AREA)
- Executing Machine-Instructions (AREA)
Abstract
A method and apparatus for providing a source operand for an instruction to be executed in a processor. Some embodiments may include a register file unit that has registers and a scheduler to schedule instructions. In some embodiments, the scheduler is to asynchronously receive an instruction and a source operand for that instruction, the source operand being received from the register file unit.
Description
Invention field
Embodiments of the invention relate generally to the instruction pipelining in the computer processor.
Background
Processor in the computer system generally executes instruction with a series of stage, and this series of stages can be called streamline.Each stage in these stages can be carried out by the different piece of processor.For example, instruction can subsequently, be carried out by a functional unit through decoded instruction by decoder decode.In " unordered " architecture, instruction can be carried out according to the order different with the order of the program defined that therefrom obtains these instructions by performance element.Under these circumstances, instruction can be distributed scheduler (scheduler) by dispatcher (dispatcher), and scheduler can determine to send to the functional unit of execution command the order of instruction.
The instruction that processor is carried out is generally stored data with register.Instruction can have one or more source operands, and they can be stored in the register; Instruction can produce a result, and it also can be stored in the register.If the source operand of instruction is stored in the register (promptly reading from register), perhaps a result is stored in that register and (promptly writes to register), then we can say this register of instruction use.For example, for a given instruction, processor can be read a data operand from register R0, reads a data operand from register R3, with these data operand additions, the result is deposited get back among the register R4 then.The architecture of some prior art can have a register high-speed cache, in such architecture, can obtain source operand from the register high-speed cache, perhaps, if the high-speed cache error then obtains source operand from register file cell.In existing architecture, the register file cell (or the register high-speed cache that is associated) of must flowing through of each instruction in the streamline.For example, the disordered system structure of some prior art is to have register file cell (in this case, can be scheduled that device accesses and access register file unit when mailing to functional unit in instruction) in the main flow waterline after the scheduler; Or in the main flow waterline, has register file cell (in this case, can be when instruction enters scheduler, access register file unit) before the scheduler.
The accompanying drawing summary
Fig. 1 is the simplified block diagram that the processor of source operand is provided to scheduler according to the embodiment of the invention;
Fig. 2 distributes instruction in the scheduler and the simplified flow chart of the method for request read register according to the embodiment of the invention;
Fig. 3 is the simplified block diagram of expression according to the details of the processor of the functional unit that has register file cell and couple with scheduler of another embodiment of the present invention;
Fig. 4 is the simplified block diagram of details of the processor of the register file cell that couples with scheduler according to having of another embodiment of the present invention of expression and functional unit.
Describe in detail
Embodiments of the invention relate to the method and apparatus that the instruction that is used in the processor pipeline provides source operand.In some embodiments, processor has a register file cell of realizing concurrently with functional unit, is used to the instruction in the scheduler that data operand is provided, and can be counted as providing as being both functional unit data operand.In some embodiments, if can distribute one has the instruction of the data operand that will read from register and need not send a read request to this register---this source operand has one to produce instruction just be in operation (in-flight), and, even without, distribute this instruction before also can obtaining the result in read request to this register.According to some embodiment, a plurality of instructions can be compressed into a single register file to a plurality of requests of reading of same physical register and read, like this, this single register file is read and its data can be pooled to 1 to n wait instruction in scheduler.In some embodiments, processor can be designed to such an extent that can tolerate visit with different register files that postpone.The various modifications and the variant that should be appreciated that example as herein described can and be in the scope of accompanying Claim by training centre covering provided below.
Fig. 1 is the simplified block diagram that the processor of source operand is provided to scheduler according to the embodiment of the invention.The processor 100 that Fig. 1 shows comprises retired order (retirement order) impact damper 110, dispatcher 120, operating storer 125, scheduler 130, reads formation 140, register file cell 150 and functional unit 160.Processor 100 can be the processor that is used for any kind of computer system, such as the Pentium class microprocessor of the Intel company that is positioned at California, USA Santa Clara.Each unit shown in Fig. 1 can be for example realized with the form of certain combination of hardware, firmware or hardware and firmware.Although Fig. 1 has shown an instruction pipelining of processor 100, instruction pipelining can contain more unit, different unit and/or extra unit in other embodiments.
As shown in fig. 1, many instructions of retired sequence buffer device 110 storages are such as instruction 15-17.These instructions for example can be the micro-orders through decoding that will be instructed by the batch processing that processor 100 is for example carried out according to known instruction process technology.These instructions also can be macro instructions, certain combination of micro-order and macro instruction, or the like.For example, instruction 15 can be the instruction of an execution " ADD R0=R3, R4 ", and this command request deposits the data operand addition of storage in the data operand of storage in the register 0 and the register 3 in register 4 with its result.Certainly, retired sequence buffer device 110 can be stored the instruction that will be performed more than three.In the embodiment shown, instruction is not to be that the data operand that and instruction will use is stored in the retired sequence buffer device 110 together, and data are to be provided by unit farther in the streamline.
Retired sequence buffer device 110 couples with dispatcher 120, can provide instruction to dispatcher 120.If two objects link to each other directly or indirectly, then they can be referred to as " coupling " herein.In some embodiments, as shown in fig. 1, dispatcher 120 with read formation 140 and couple, the latter couples with register file cell 150 again.In the embodiment shown in fig. 1, register file cell 150 contains first registers group 151 and second registers group 152, reads formation 140 and contains first cell group 141 and second cell group 142 accordingly.For example, queued packets (banked) can be become even number and group odd number.In other embodiments, register file cell 150 and read formation 140 and can not divide into groups perhaps can contain the group of different numbers.In some embodiments, processor can comprise 2 groups of register file cells, and each group has two read ports, realizes with simple sram cell, and each cycle reaches 4 altogether and reads the result like this.In the embodiment shown, dispatcher 120 can send a request to reading formation 140, require a register in the read register file unit 150 that the data operand that will be used by certain instruction is provided, reading formation 140 can the such request of buffer memory, and it is delivered to register suitable in the register file cell 150.For example, if dispatcher 120 is being distributed the instruction that will read a data operand from register R3, then dispatcher 120 can send the request 121 that register R3 is read in request to reading formation 140.In some embodiments, if the request of not waiting for then can be walked around and read formation.Reading formation 140 can be the storage component part of realizing any kind of formation.Can get the read request speed that for example absorbs worst case with reading cohort design, and can be discharged from the speed of reading to be supported in path (portage) (for example by register read port available in the group of correspondence) of register file cell.In some embodiments, the register read request can smooth to the stable status level to the flow rate of register read request.
Register file cell 150 contains one group of register R0 to Rn, and is such as everyone knows, can be used for storing the used data operand of being carried out by processor 100 of instruction.Register file cell 150 can be the memory cell of any kind, such as one group of static RAM (SRAM) unit, and one group of dynamic RAM (DRAM) unit, or traditional register cell.Register file cell 150 can contain any amount of register, for example 512 82 bit registers.In some embodiments, can realize magnanimity physical register (for example in data cache) with for example high density, lower powered sram cell.Structural and buffer status predictive may be combined in the same physical register file.In such embodiments, structural and predictive register renaming can be along with withdrawing from and the rename pointer of instructing, to keep correct structural state.In some register, the register in the register file cell has a spot of port.For example in one embodiment, utilize on the machine of distributing (4-wide dispatch machine) of 4 bit wides 2 read with 1 write the SRAM cache element, register can be embodied as one and have 4 read ports and 2 write ports, that be divided into two groups altogether register file cell altogether.For example, register file cell can have 512 registers and 4 output ports.
Dispatcher 120 couples with scheduler 130, and the latter couples with functional unit 160 again.In other embodiments, processor can have a plurality of schedulers, and for example each functional unit has a scheduler, and perhaps functional unit bunch has a scheduler.Dispatcher 120 can for example instruct 13 to distribute in the scheduler 130 instruction.Scheduler 130 can be stored some and wait for scheduling, the instruction that supplies one of functional unit to carry out, such as instruction 11-12.Scheduler 130 can store an instruction and do not store the operand that will use by this instruction (as the sky as shown in the scheduler 130 be listed as represented).In some embodiments, instruction is in case be dispatched in the scheduler 130, just can begin to have the state of " in service ", and can keep operating state, be passed to this register by (for example by register file cell or functional unit-as discussed below) and no longer be obtainable (for example from functional unit or bypass network) up to data result.Functional unit 160 can be the unit of one or more execution commands, such as ALU, performance element of floating point, Integer Execution Units, branch execution unit or the like.In the time will carrying out an instruction, it is passed to appropriate functional unit in the functional unit 160, and this functional unit is carried out this instruction.Scheduler 130 can be a unordered scheduler, because instruction can be carried out by the order different with the order of their appearance.For example, instruction 12 can be carried out before instruction 11, although instruct 11 to distribute before instruction 12.Scheduler 130 can come dispatch command by carrying out any dispatching algorithm.
As shown in fig. 1, the output port of register file cell 150 (being by a multiplex adapter here) is couple to the input port of scheduler 130.In addition, functional unit 160 also has the output port that the input port with scheduler 130 couples.According to some embodiment, can to from register, the instruction of read data operation number before operand can obtain, distribute scheduler and not be with source operand specifying.In some embodiments, scheduler 130 storages have the instruction in order to the source operand of specifying register, and dispatch this instruction according to the source operand arrival scheduler of this instruction.In some embodiments, the source operand of an instruction is provided to scheduler, this is asynchronous with distributing new instruction to scheduler, this means, does not have correlativity between the time of instruction being distributed scheduler and source operand arrival scheduler.Therefore, instruction arrive scheduler and this instruction operand arrival scheduler the two can be decoupling.See below that Fig. 2 will be described in more detail like that, the instruction of in scheduler, waiting for will with source operand can offer scheduler from register file 150 or functional unit 160.For example, presumptive instruction 12 is the instructions that will use the source operand 153 that will read from certain register, and this instruction 12 can be with being dispatched to scheduler 130 before at this source operand, then can or source operand 153 be offered scheduler 130, for using by the instruction of waiting for 12 from register file 150 or from functional unit.
Fig. 1 also shows an operating storer 125 that couples with dispatcher 120.As shown in the figure, run memory 125 contains the memory location of a plurality of 1 bits, and in order to illustrate, these positions are marked as (entry) 1 to n in Fig. 1.In the embodiment shown, run memory array of 125 storages or table are used for showing whether a register of register file cell will be used by an instruction in service.In all the other embodiment, can adopt other element/mechanism to show whether a register in the register file cell will be used by an operating instruction, such as the storer of content addressable, various comparers or the like.In example shown in Figure 1, No. 0 item in the run memory 125 contains value " 1 ", and this can show that register 0 will be used by an operating instruction, and No. 1 record contains value " 0 ", and this can show does not have operating instruction will use register 1.In some embodiments, can upgrade, be dispatched to all generator instructions of scheduler with reflection such table.In some embodiments, also can be for the register read request is provided with running status, because register file cell is the generator of physical register.As discussed below, dispatcher 120 can determine whether and will generate a request to reading formation 140 with the information that is stored in the run memory 125, to require to distribute an instruction (that is, require to generate will from the request of the register read data of the position of the source operand that is designated as this instruction).If generated such request, dispatcher 120 can be distributed scheduler 130 to instruction before read request is finished.In some embodiments, each unit in the run memory 125 can be (ported) or the part connectivity port of complete connectivity port.
Fig. 2 distributes instruction in the scheduler and the simplified flow chart of the method for request read register according to the embodiment of the invention.This method is discussed with reference to the device shown in Fig. 1, but this method also can be carried out with any other suitable device.Instruction can be flow through the streamline such as the processor of processor 100, and for example can be stored in the retired sequence buffer device 110.Can check new instruction, and can determine whether this instruction has a source operand (201) that will read from register.For example, dispatcher 120 can obtain an instruction from retired sequence buffer device, and determines that this instruction is " ADD R0=R3, R4 " instruction, must read source operand from register 0 and register 3 in this case.In another example, if the source operand that new instruction is not read from register then can be new instruction to be delivered to be used for scheduling in the scheduler, this is that those skilled in the art that understand.
If new instruction really has the source operand that reads from register, then according to shown in embodiment, will determine whether that any instruction in service uses and should newly instruct the identical register (202) that will read.In some embodiments, if certain instruction in service is read with the same register that is read by new instruction or write this same register, think that then this instruction in service uses and the same register that is read by new instruction.In some embodiments, definite with the identical register that is read by new instruction used in instruction in service, comprise an array of checking in the storer, will be read or write this identical register from the register identical with new instruction to determine whether any instruction in service.For example, will read register 0 and register 3 " ADD R0=R3, R4 " instruction if the instruction 14 that dispatcher 120 is received is one, whether dispatcher 120 can be checked storer 125, look to have any instruction in service will use register 0 and register 3.In the example shown in Fig. 1, item 0 (it is corresponding to register 0) in the run memory 125 contains value " 1 ", and this may show has an operating instruction (for example instructing 11) will use register 0.
If the same register that will read with instruction is newly used in an instruction in service, then can new instruction distribute in the scheduler, and not to request (203) that source operand is read in requirement of this register transmission.In this case, the result operand from this operating instruction can be provided to scheduler, use (204) as source operand for new instruction.In the above in the example of Tao Luning, if instruct 14 will use a source operand that is stored in the register 0, but instruct 11 to be to be in operation and will to read or write to register, then instruction 14 can be distributed scheduler 130, and not send a read request that requires read register 0 to register file cell 150.If instruct 11 just to read register 0, the operand (such as the operand Fig. 1 153) that gets from register read can offer the input of scheduler 130 from the output port of register, at this moment, operand can be stored in scheduler 130, for executing instruction in the future at 14 o'clock by this instruction use.If instruct 11 values that changed register 0 (for example writing a result) to register 0, then functional unit 160 can from the output port of functional unit be provided to the input of scheduler 130 in result's (such as operand among Fig. 1 153) of being produced of execution command at 11 o'clock, as just discussing, for cause instruction 14 is used.Certainly, this has supposed can not to arrange new instruction to carry out before relative operating instruction.Therefore, according to some embodiment, under the situation that the generator instruction may be instructed by the generation of functional unit or register file cell execution,, then do not generate read request if the generator instruction is operating.
If there is not operating instruction to use and the identical register that should new instruction will read, then can generates one and ask so that from the register of the source operand that contains new instruction usefulness, read (205).In the above example, if run memory 125 shows the instruction in service of not using register 0, then dispatcher 120 can generate the request 121 of a read register 0.In some embodiments, the request that generates read register comprises to this register one and reads formation (for example reading formation 140) and send with the request that will read source operand from this register.Therefore, according to some embodiment, even there are not enough ports to can be used to receive read request in the register file, dispatcher also can continue instruction is distributed in the instruction scheduler.Just will newly instruct before the result of the read request that generates in the embodiment shown in Figure 2, and distribute (206) in the scheduler receiving.In the above example, read register 0 even do not respond read request 121 as yet, also instruction 14 can be distributed in the scheduler 130.As shown in Figure 2, when read request is finished, can be provided to scheduler to the source operand of new instruction usefulness from register, use (207) for new instruction.For example, instruction 12 can wait in scheduler 130 that operand 153 is provided to scheduler 130 in response to read request 121 from register 0.
In some embodiments, can the register file result be loaded in the scheduler, can be used by the processor functional unit like this with the storer (CAM) of content addressable.In some embodiments, can or be provided to the port of scheduler from the output port of register or from the source operand that the output port of functional unit will newly instruct, wherein register file cell and functional unit are shared the input port of scheduler.In some embodiments, when the instruction of waiting in scheduler may arrive the register file data value is insensitive.Therefore, scheduler can functional unit and register file cell produce they as a result the time catch the source operand data.When scheduler had had desired all the source operand data that are used for specific instruction, scheduler can be published to correct functional unit to this value order.In some embodiments, those successor instructions that do not need register to read numerical value can enter scheduler immediately, so that be scheduled on every side in those instructions that need register to read numerical value.Therefore, can provide source operand to scheduler with distributing asynchronously of new instruction, scheduler can wait for that source operand is provided to scheduler in the new instruction of scheduling for before carrying out.
Fig. 3 is the simplified block diagram of expression according to the details of a processor of the functional unit that has register file cell and couple with scheduler of another embodiment of the present invention.The processor 100 that Fig. 3 represents comprises some parts shown in Fig. 1.Especially, Fig. 3 shows scheduler 130, register file cell 150 and the functional unit 160 of Fig. 1.In the embodiment shown in fig. 3, processor also has bypass network 310, register file cell 150 and a functional unit 160 that is couple to scheduler 130.Especially, the output port of bypass network 310 is couple to the input port of scheduler 130, and (being by a multiplex adapter here) is couple to the input port of functional unit 160.In addition, the output port of register file cell 150 and functional unit 160 also is couple to the input port of bypass network 310.In some embodiments, the output data from register file cell 150 and functional unit 160 can be forwarded to scheduler 130 or functional unit 160, use for instruction in the future.In some embodiments, bypass network 310 can contain impact damper, is used for temporarily storing data operand.Certainly, the output port of bypass network 310 can be couple to each register in (passing through register file port) register file cell 150, and is couple to one group of functional unit in the functional unit 160.In some embodiments, bypass network can contain formation or the impact damper that has been produced the temporary transient event memory in back in the result, and in some embodiments, as long as the result that such instruction is produced remains available in bypass network, can think that then this instruction has a state that is in operation.In some embodiments, use such impact damper can cause the less traffic (traffic) from register file cell.
In the embodiment shown in fig. 3, processor 100 also has a write queue 320 to be couple to register file cell 150, functional unit 160 and bypass network 310.Especially, the output port that write queue 320 can have an input port with register file cell 150 to couple, one is couple to the output port of bypass network 310 and an input port that is couple to the ground output port of register file cell 150 and functional unit 160 by bypass network 315.According to some embodiment, register writes and can be cushioned in write queue 320, and is written into register file cell when running background.In some embodiments, if certain instruction will be read register, and this register write operation is still being waited in write queue 320, then can data be provided to scheduler 130 from write queue by bypass network 315.The same with register file cell 150, in some embodiments, write queue 320 can have a plurality of groups.Certainly, the output port of write queue 320 can be couple to each register in the register file cell 150.In some embodiments, if register file write-read conflict is arranged, the register value that does not write as yet can be switched to the register read data routing from write queue.
Fig. 4 is the simplified block diagram of details of a processor of the register file cell that couples with scheduler according to having of another embodiment of the present invention of expression and functional unit.In some embodiments, can register data be delivered to functional unit/scheduler/bypass network with bus and bypass multiplex adapter.According to the embodiment of above-mentioned processor, can come conveniently to provide data to scheduler with extra CAM port to the performance element scheduler from register file cell.In some embodiments, and as shown in Figure 4, the CAM port of the performance element of functional unit result bus and selected quantity can be by overload/to read result bus shared with register file.
Fig. 4 shows the processor 100 with some parts shown in Fig. 1.Especially, Fig. 4 shows scheduler 130, register file cell 150 and the functional unit 160 of Fig. 1.In the embodiment shown in Fig. 4, register file cell 150 comprises four read port RP0 to RP3, and functional unit 160 comprises two memory function unit (M0 and M1) and two integer functional units (I0 and I1).Certainly, in other embodiments, register file cell can contain more or less read port, and can have more, still less and/or different functional units.As shown in Figure 4, scheduler 130 has a plurality of input ports.As shown in the figure, register file cell read port RP0 can be couple to an input port of scheduler 130 by bus RB0.In addition, the output port of register file cell read port RP1 that comes from register file cell and functional unit M0 the two can be couple to second in the scheduler 130 (sharing) input port by a shared bus (being labeled as A Fig. 4).Register file cell read port RP2 can be couple to the 3rd input port of scheduler 130 by bus RB2.The two can be couple to the 4th in the scheduler 130 (sharing) input port by a shared bus (being labeled as B among Fig. 4) output port of register file cell read port RP3 and functional unit M1.At last, the output port of integer functional unit 0 can be couple to the 5th input port in the scheduler 130 by bus C, and the output port of integer functional unit 1 can be couple to the 6th input port in the scheduler 130 by bus D.
Therefore, in some embodiments, a bus that couples with scheduler can be shared by functional unit and register, and/or a CAM input port of scheduler can be shared by functional unit and register.In some embodiments, the overload result bus can make register file cell that the influence of bypass network and scheduler is minimized between functional unit and register file cell.As shown in Figure 4, can provide an operand to scheduler 130 from each of the bus of register file cell and functional unit, as operand 153, for example described like that with reference to Fig. 1-3 as mentioned.
In some embodiments, functional unit result and register file result's the crush load time may be quadrature.For example, plateau the term of execution, functional unit may provide new result, most of instructions may obtain their needed source operand results from instruction in service (for example passing through bypass network).In this case, read may be not frequent for required register file.On the contrary, after restarting, functional unit may enter idle condition, does not place data on result bus, and register read then may be in the peak, so that the new command service for entering machine.When shared result bus and CAM port, can consider this orthogonality.
Following table be illustrated in the similar example processor of the processor shown in Fig. 4 in the result bus of all performance elements.Although the processor among Fig. 4 has two memory execution unit (M0 and M1) and two Integer Execution Units (I0 and I1), the processor in the following table has one the 3rd Integer Execution Units (I2), performance element of floating point (F) and branch execution unit (Br).In following table, list these unit in the leftmost row as generator, list the consumption device that will be sent to the result in the row at top.In this table, these result bus, the floating-point port on from the port memory 0 on the result bus ' A ' to result bus ' F ' is all at random named, similar shown in this and Fig. 4.In this embodiment, can be the decomposition (decomposition) of this some macro instruction of configuration hypothesis.Like this, in this example, can be by memory cell 0 be provided to functional unit each (and scheduler) by bus A by the result that read port 1 produces.Similarly, by memory cell 1 with can be provided to functional unit each (and scheduler) by the result that read port 2 produces by bus C.Note having some functional units may not consume or only consume data from a subclass of generator from the data of any generator (such as branch units).
Generator | The consumption device | ||||||
M0 | M1 | I0 | I1 | I2 | F | Br | |
M0 | A | A | A | A | A | A | A |
M1 | B | B | B | B | B | B | B |
I0 | C | C | C | C | C | C | |
I1 | D | D | D | D | D | D | |
I2 | E | E | E | E | E | E | |
F | F | F | F | ||||
Br | |||||||
Read | RB0 | RB0 | RB0 | RB0 | RB0 | RB0 | RB0 |
Read port 1 | A | A | A | A | A | A | A |
Read port 2 | RB1 | RB1 | RB1 | RB1 | RB1 | RB1 | RB1 |
Read port 3 | B | B | B | B | B | B | B |
Listed read port 0 in the last table to read port 0.But read port 0 and 1 shared group 0.But read port 2 and 3 shared group 1.In one embodiment, the list item of band underscore is expressed as the CAM port of supporting this register file configuration and increasing.This this show in the described processor example, support the register file read port by increasing by two full CAM ports, and, can make Effect on Performance is minimized by sharing all the other two register file read ports with the storer result bus.This represents that in table wherein read port 1 is shared result bus A (port memory 0), and read port 3 is shared result bus B (port memory 1).Can select read port 1 and 3 like this, so that two are read and only need an execution result bus in this more uncommon situation that activates simultaneously on identical registers group.In other embodiments, and according to performance requirement, any bus may be transshipped or do not transshipped to bus all or part.
In the example shown in the last table, be that port memory rather than integer port are transshipped, because integer instructions may be more commonly used, and because port memory has longer delay.In some embodiments, one with register file share the performance element of its result bus will be when an effective instruction will bear results the signaling register file.In such embodiments, performance element should have enough delays, so that can enough send this notice early, thereby conflicts with execution result to stop a register to be read to be distributed to the register file look-up table and to prevent.If a register is read and is delayed, operand can wait for that reading formation from register file cell sends, up to for example next clock period.If postpone too shortly, read request may be taken out from read formation and be inserted into register file searches streamline, and this may cause possible result bus conflict.Because the embodiment of processor discussed above can tolerate different delays, the issue of can the delay time register file reading, and do not suppress to be dispatched in the scheduler or be distributed to instruction in the functional unit.
In some embodiments, if the floating-point port does not arrive enough ports and (compares with read port, read port can arrive all execution ports) and the consideration of floating-point reference performance is if possible arranged, then can ignore the floating-point port when selecting the port that will be transshipped, as above the example shown in the table is the same.In some embodiments, port memory can be expanded so that arrive branch port, thereby support and register read port is shared.
More than gone through some embodiment.Certainly the scope of claim will cover the above embodiments and their equivalents embodiment in addition.For example, order format of being discussed and register title only are schematically, can use any order format and/or register in other situation.Similarly, the another one example is that processor can not use and read formation.
Claims (40)
1. processor comprises:
The register file cell that contains a plurality of registers; With
The scheduler that is used for dispatch command, wherein, scheduler will receive an instruction that will be scheduled, and wherein scheduler will receive the source operand of an instruction that will be scheduled that is used for being received from register file cell, and wherein scheduler will receive instruction and source operand asynchronously.
2. the processor of claim 1 further comprises and is used for showing whether a register of register file cell will be by the unit of an instruction use that is in operation.
3. the processor of claim 1, wherein this unit further comprises one of them of storer or a plurality of comparers.
4. the processor of claim 2, wherein this processor further comprises and uses and dispatcher from instruction to scheduler that distribute, wherein, dispatcher will according to this unit determine whether to generate one will be from the register of the position that is designated as the command source operand that will be distributed the request of reading of data, and, if generated such request, then before finishing, this read request distributes this instruction.
5. the processor of claim 1 further comprises one and reads formation, so that will the read request of reading of data carry out buffer memory from the register of register file cell.
6. the processor of claim 5 is wherein read formation and is comprised the multi-bank memory unit.
7. the processor of claim 6, wherein register file cell comprises a plurality of static random access memory cells.
8. the processor of claim 1, wherein a plurality of registers in the register file cell are arranged to a plurality of groups.
9. the processor of claim 8, wherein processor further comprises the write queue that couples with register file cell, is used for the queuing of writing to register file cell, and wherein write queue comprises a plurality of groups.
10. the processor of claim 1, wherein register file cell comprises four ports.
11. the processor of claim 1, wherein processor further comprises functional unit and bus, wherein bus is couple to the output port of first register of the output port of input port, functional unit of scheduler and a plurality of registers, and wherein bus is shared by the functional unit and first register.
12. the processor of claim 11, wherein the input port of scheduler is shared by the functional unit and first register.
13. the processor of claim 1, wherein processor further comprises functional unit, wherein processor further comprises bypass network, it has the input port that couples with register file cell and has the output port that couples with functional unit, and one of them instruction has the state that is in operation, and wherein can obtain the source operand of this instruction in bypass network.
14. a processor comprises:
The scheduler that is used for dispatch command;
The register file cell that comprises a plurality of registers; With
Dispatcher, be used for distributing the new instruction that is used to specify the source operand that will read from one of a plurality of registers to scheduler, wherein, if the register that is read by new instruction also will be used by formerly one in operating instruction, then dispatcher need not generate the request that will read this source operand for this new instruction from this register with regard to distributing this new instruction.
15. the processor of claim 14, wherein, if the register that is read by new instruction also will be used by formerly one in operating instruction, then scheduler will be the source operand of new command reception as this result who formerly instructs.
16. the processor of claim 15, wherein, the source operand of new instruction is provided to scheduler from a bypass network.
17. the processor of claim 14, wherein, wherein processor further comprises the storer that will store an array, and wherein an item in the array is that a register in the register file cell shows whether have an instruction in service will use this register.
18. the processor of claim 17 further comprises one and reads formation so that the read request of reading of data from the register of register file cell is carried out buffer memory.
19. the processor of claim 18 is wherein read formation and is comprised the multi-bank memory unit.
20. the processor of claim 14, wherein a plurality of registers in the register file cell are arranged to a plurality of groups.
21. the processor of claim 14, wherein processor further comprises the write queue that couples with register file cell, be used for writing of register file cell ranked, and wherein write queue comprises a plurality of groups.
22. the processor of claim 14, wherein processor further comprises the functional unit of the output port that an input port that has with scheduler couples, and wherein the output port of first register in the register file cell is couple to the same input port of scheduler as this functional unit.
23. the processor of claim 22, wherein processor further comprises the shared bus of the input port of an output port that is used to couple functional unit and scheduler, and wherein shared bus also couples the output port of first register and the input end of scheduler.
24. a system comprises
A register file cell that comprises a plurality of registers;
The scheduler that is used for dispatch command;
Dispatcher is used for distributing an instruction with the source operand that will read from one of a plurality of registers to scheduler; With
Be used for showing the unit that whether can obtain source operand from instruction formerly to dispatcher.
25. the system of claim 24, wherein, register file cell and scheduler couple, to be provided at the source operand of the instruction of waiting in the scheduler to scheduler.
26. the system of claim 24, wherein processor further comprises functional unit and bus, wherein bus is couple to the output port of first register of the output port of input port, functional unit of scheduler and register cell, and wherein bus is shared by the functional unit and first register.
27. the system of claim 26, wherein the input port of scheduler is shared by the functional unit and first register.
28. the system of claim 24 further comprises one and reads formation, so that the read request of reading of data from the register of register file cell is carried out buffer memory.
29. the system of claim 28 wherein reads formation and comprises many group SRAM cells.
30. a method comprises:
Determine that a new instruction has a source operand that will read from a register;
Determine whether an operating instruction will use the identical register that will read with this new instruction; With
If operating instruction will be used the same register that will read with this new instruction, then will newly instruct and distribute in the scheduler and need not send the request that to read source operand to register.
31. the method for claim 30, wherein this method further comprises to use by formerly instruction provides source operand from the result of register read to scheduler, uses for new instruction.
32. the method for claim 30 wherein, if an instruction in service is read the identical register that is read with new instruction or write this identical register, is thought that then this instruction in service will be used and is newly instructed the identical register that is read.
33. the method for claim 30, wherein, whether a described definite instruction in service will use the identical register that will read with new instruction to comprise an array of checking in the storer, will read or be written to this identical register from the register identical with new instruction to determine whether any instruction in service.
34. the method for claim 30, this method further comprises:
Determine not have instruction in service will use the identical register that will read with new instruction;
Generate a request of reading this identical register; With
To newly instruct before the result of the request of reading this identical register that generates receiving and distribute in the scheduler.
35. the method for claim 34 wherein generates the formation of reading that the request read this register comprises to this register and sends a request of reading source operand from this register.
36. the method for claim 30, this method further comprises or provides source operand to an input port of scheduler for new instruction from output port of register or from the output port of a functional unit, and wherein register and functional unit are shared the scheduler input port.
37. a method comprises:
Determine that a new instruction has a source operand that will read from a register;
To newly instruct and distribute a scheduler; With
With distribute new instruction to scheduler and provide source operand to scheduler asynchronously.
38. the method for claim 37, wherein this method further comprises the usefulness of the new instruction of scheduling for execution, and wherein scheduler will wait for that source operand was provided to scheduler in the new instruction of scheduling before carrying out.
39. the method for claim 37, wherein this method further comprises the result who uses an instruction formerly to read from register and comes to provide source operand to use for new instruction to scheduler.
40. the method for claim 37, wherein this method further comprises or is provided for the source operand of new instruction from output port of register or from the output port of a functional unit to an input port of scheduler, and wherein register and functional unit are shared the scheduler input port.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/953,760 | 2004-09-30 | ||
US10/953,760 US7395415B2 (en) | 2004-09-30 | 2004-09-30 | Method and apparatus to provide a source operand for an instruction in a processor |
PCT/US2005/035406 WO2006039613A1 (en) | 2004-09-30 | 2005-09-30 | Method and apparatus to provide a source operand for an instruction in a processor |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101036119A true CN101036119A (en) | 2007-09-12 |
CN101036119B CN101036119B (en) | 2011-10-05 |
Family
ID=35677609
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2005800334355A Expired - Fee Related CN101036119B (en) | 2004-09-30 | 2005-09-30 | Method and apparatus to provide a source operand for an instruction in a processor |
Country Status (6)
Country | Link |
---|---|
US (1) | US7395415B2 (en) |
JP (1) | JP4699468B2 (en) |
CN (1) | CN101036119B (en) |
DE (1) | DE112005002432B4 (en) |
TW (1) | TWI334099B (en) |
WO (1) | WO2006039613A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019136983A1 (en) * | 2018-01-12 | 2019-07-18 | 江苏华存电子科技有限公司 | Low-delay instruction scheduler |
CN115098167A (en) * | 2022-07-05 | 2022-09-23 | 飞腾信息技术有限公司 | Instruction execution method and device |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7650483B2 (en) * | 2006-11-03 | 2010-01-19 | Arm Limited | Execution of instructions within a data processing apparatus having a plurality of processing units |
JP5111020B2 (en) * | 2007-08-29 | 2012-12-26 | キヤノン株式会社 | Image processing apparatus and control method thereof |
US8130501B2 (en) | 2009-06-30 | 2012-03-06 | Teco-Westinghouse Motor Company | Pluggable power cell for an inverter |
WO2013077845A1 (en) * | 2011-11-21 | 2013-05-30 | Intel Corporation | Reducing power consumption in a fused multiply-add (fma) unit of a processor |
US9330432B2 (en) * | 2013-08-19 | 2016-05-03 | Apple Inc. | Queuing system for register file access |
US9632783B2 (en) * | 2014-10-03 | 2017-04-25 | Qualcomm Incorporated | Operand conflict resolution for reduced port general purpose register |
US11614942B2 (en) * | 2020-10-20 | 2023-03-28 | Micron Technology, Inc. | Reuse in-flight register data in a processor |
CN112463217B (en) * | 2020-11-18 | 2022-07-12 | 海光信息技术股份有限公司 | Systems, methods and media for register file shared read ports in superscalar processors |
CN113703841B (en) * | 2021-09-10 | 2023-09-26 | 中国人民解放军国防科技大学 | An optimized method, device and medium for register data reading |
CN115640047B (en) * | 2022-09-08 | 2024-01-19 | 海光信息技术股份有限公司 | Instruction operation method and device, electronic device and storage medium |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0651321B1 (en) * | 1993-10-29 | 2001-11-14 | Advanced Micro Devices, Inc. | Superscalar microprocessors |
US5761476A (en) * | 1993-12-30 | 1998-06-02 | Intel Corporation | Non-clocked early read for back-to-back scheduling of instructions |
US5555432A (en) * | 1994-08-19 | 1996-09-10 | Intel Corporation | Circuit and method for scheduling instructions by predicting future availability of resources required for execution |
US6826704B1 (en) * | 2001-03-08 | 2004-11-30 | Advanced Micro Devices, Inc. | Microprocessor employing a performance throttling mechanism for power management |
JP3576148B2 (en) * | 2002-04-19 | 2004-10-13 | 株式会社半導体理工学研究センター | Parallel processor |
-
2004
- 2004-09-30 US US10/953,760 patent/US7395415B2/en not_active Expired - Fee Related
-
2005
- 2005-09-29 TW TW094134004A patent/TWI334099B/en not_active IP Right Cessation
- 2005-09-30 DE DE112005002432T patent/DE112005002432B4/en not_active Expired - Fee Related
- 2005-09-30 JP JP2007534840A patent/JP4699468B2/en not_active Expired - Fee Related
- 2005-09-30 CN CN2005800334355A patent/CN101036119B/en not_active Expired - Fee Related
- 2005-09-30 WO PCT/US2005/035406 patent/WO2006039613A1/en active Application Filing
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019136983A1 (en) * | 2018-01-12 | 2019-07-18 | 江苏华存电子科技有限公司 | Low-delay instruction scheduler |
CN115098167A (en) * | 2022-07-05 | 2022-09-23 | 飞腾信息技术有限公司 | Instruction execution method and device |
Also Published As
Publication number | Publication date |
---|---|
DE112005002432T5 (en) | 2007-08-16 |
DE112005002432B4 (en) | 2009-05-14 |
TWI334099B (en) | 2010-12-01 |
CN101036119B (en) | 2011-10-05 |
JP4699468B2 (en) | 2011-06-08 |
TW200622877A (en) | 2006-07-01 |
US20060095728A1 (en) | 2006-05-04 |
WO2006039613A1 (en) | 2006-04-13 |
US7395415B2 (en) | 2008-07-01 |
JP2008515117A (en) | 2008-05-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8516280B2 (en) | Parallel processing computer systems with reduced power consumption and methods for providing the same | |
US8560795B2 (en) | Memory arrangement for multi-processor systems including a memory queue | |
CN100557570C (en) | Multicomputer system | |
US5574939A (en) | Multiprocessor coupling system with integrated compile and run time scheduling for parallelism | |
KR100990902B1 (en) | Memory Arrays for Multi-Processor Systems | |
US5185868A (en) | Apparatus having hierarchically arranged decoders concurrently decoding instructions and shifting instructions not ready for execution to vacant decoders higher in the hierarchy | |
JP4553936B2 (en) | Techniques for setting command order in an out-of-order DMA command queue | |
EP1023659B1 (en) | Efficient processing of clustered branch instructions | |
CN101036119B (en) | Method and apparatus to provide a source operand for an instruction in a processor | |
CN1092188A (en) | Method and system for enhancing instruction scheduling in superscalar processor system using independent access intermediate memory | |
US9886278B2 (en) | Computing architecture and method for processing data | |
Baugh et al. | Decomposing the load-store queue by function for power reduction and scalability | |
US20080320240A1 (en) | Method and arrangements for memory access | |
EP1039377B1 (en) | System and method supporting multiple outstanding requests to multiple targets of a memory hierarchy | |
US6324640B1 (en) | System and method for dispatching groups of instructions using pipelined register renaming | |
EP2221718B1 (en) | Distributed dispatch with concurrent, out-of-order dispatch | |
US20020004895A1 (en) | Method and apparatus for efficiently routing dependent instructions to clustered execution units | |
CN115328850A (en) | A hardware accelerator for hypergraph processing and its operation method | |
JPS63191253A (en) | Preference assigner for cache memory | |
US7370158B2 (en) | SIMD process with multi-port memory unit comprising single-port memories | |
US20040064679A1 (en) | Hierarchical scheduling windows | |
US6351803B2 (en) | Mechanism for power efficient processing in a pipeline processor | |
US20040034858A1 (en) | Programming a multi-threaded processor | |
US20040064678A1 (en) | Hierarchical scheduling windows | |
Saito et al. | Design of superscalar processor with multi-bank register file |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20111005 Termination date: 20180930 |