CN102779026B

CN102779026B - Multi-emission method of instructions in high-performance DSP (digital signal processor)

Info

Publication number: CN102779026B
Application number: CN201210222667.7A
Authority: CN
Inventors: 杨晓刚; 张庆文; 黄嵩人; 屈凌翔
Original assignee: CETC 58 Research Institute
Current assignee: CETC 58 Research Institute
Priority date: 2012-06-29
Filing date: 2012-06-29
Publication date: 2014-08-27
Anticipated expiration: 2032-06-29
Also published as: CN102779026A

Abstract

The invention discloses a multi-emission method of instructions in a high-performance DSP (digital signal processor). A multi-emission mechanism is realized by comprising an organization structure of an instruction Cache, instruction aligning and pre-fetching and a multi-emission device of the instructions, wherein the organization structure of the instruction can support fetching of a plurality of instructions, in the aligning and pre-fetching of the instructions, and the required instruction can be located rapidly according to a fetching address transmitted by the CPU (central processing unit); and the multi-emission device of the instructions can transmit the instructions to a corresponding assembly line according to the types of the instructions. The multi-emission method disclosed by the invention has the advantages that the efficiency of fetching and emitting the instructions can be improved, thereby improving the overall performance of the DSP.

Description

A kind of instruction multi-emitting method in High Performance DSP processor

Technical field

The present invention relates to the design of dsp processor kernel, specifically a kind of instruction multi-emitting method in High Performance DSP processor.

Background technology

Digital signal processor DSP (Digital Signal Processor) is a kind of microprocessor that is specifically designed to digital signal processing, and it can complete the algorithm process of various digital signals real-time.Because DSP has quick response and high-speed computation, so it is widely used in fields such as consumer electronics, communication, Aero-Space and national defense safeties.The develop rapidly of DSP microprocessor has all produced tremendous influence to national defense safety and daily life.

The research and development of High Performance DSP processor are the lofty perches that every country scientific and technological strength is competed in the world, China also pays much attention at microprocessor and researches and develops this field now, there are at present many colleges and universities and scientific research institution all microprocessor to be launched to research and design effort, and also obtained certain achievement, as " Godson ", " Noah's ark ", " milky way is soared " are come out one after another.But because our country starts late in this field, the processor of developing is all also cannot be at war with external chip in performance or in concrete practical application, still under one's control in the gordian technique of some cores.Therefore the high-performance processor that development has China's independent intellectual property right is all significant to China's economic development and national defense safety.

Advanced semiconducter process allows the transistor of integrated greater number above the silicon chip of same homalographic become possibility, and this realization for higher complexity processor chips provides strong guarantee.Meanwhile, advanced designing technique has further promoted the development of high-performance processor.Such as the multi-emitting technology of instruction, make microprocessor and carry out many instructions in same clock period transmitting, further improved the degree of parallelism of processor operating instruction, reduce every level production line in each path and moved the required time.The multi-emitting technology of instruction has greatly promoted the performance boost of processor.

Summary of the invention

The object of the invention is to overcome the deficiencies in the prior art, the method for the instruction multi-emitting in a kind of High Performance DSP processor is provided, improve whole dsp processor performance.

According to technical scheme provided by the invention, a kind of instruction multi-emitting method in High Performance DSP processor, comprises the following steps:

A. CPU fetching unit sends fetching address, takes out required instruction from instruction buffer; If instruction buffer lacks, instruction fetch from external memory unit, and upgrade corresponding cache lines;

B. will after the instruction alignment of getting, put into prefetch buffer, prefetch buffer is undertaken aligned according to the bits of offset of CPU fetching address by instruction, after the instruction alignment that value unit is got first, put into the low 4 of prefetch buffer, after the instruction alignment that value unit is got for the 2nd time, put into the high 4 of prefetch buffer, it is medium to be launched to corresponding streamline that last prefetch buffer is put into fetch buffer by aligned good instruction;

C. fetch buffer is a stack architecture, after the instruction that fetch buffer is low 4 is launched by transmitting register, it is low 4 that fetch buffer just starts to be depressed into fetch buffer by the instruction in high 4, prepare to start transmitting next time, and vacate high-order space and deposit from the instruction in prefetch buffer, wait to be launched;

D. the transmitting of instruction is by two transmitting registers: the 0th transmitting register and the first transmitting register are realized the multi-emitting of instruction, the 0th transmitting register is got the instruction in fetch buffer low level all the time, the first transmitting register is got ensuing instruction, and according to the type of got instruction and size, respectively instruction is sent in various flows waterline and gone.

The institutional framework of described instruction buffer adopts two-way set associative structure, and row size is 256, can support to get many instructions; For not cacheable address, adding specially size is that 256 line buffers are as single file cache lines.

After instruction being got from instruction buffer in prefetch buffer, carry out alignedly according to the bits of offset of CPU fetching address, and it is medium to be launched to corresponding streamline that aligned good instruction is put into fetch buffer; Fetch buffer is a stack architecture, during each fetching, is all the instruction of getting in fetch buffer low level; After instruction in fetch buffer low level sends, instruction high-order in fetch buffer just starts to be depressed into low level, and vacates high-order space and deposit from the instruction in instruction buffer.

The 0th transmitting register cooperatively interacts with the first transmitting register, responsible from fetch buffer fetching, and according to the size of instruction and type, by order structure in different streamlines; The 0th transmitting register root is according to the type of fetched instruction, by instruction issue in respective streams waterline, the first transmitting register root is according to the size of the 0th transmitting register fetched instruction, get ensuing instruction in fetch buffer, and the type of decision instruction, if identical with the 0th transmitting register fetched instruction type, give the 0th transmitting register etc. to be launched, avoid pipeline blocking, if different from the 0th transmitting register fetched instruction type, directly send in another streamline; Once after instruction sends in transmitting register, insert immediately the instruction in fetch buffer.

Advantage of the present invention is: 1. an external memory element address be divided into can Cache part with can not Cache part, for can not Cache this sector address space of part, add Line Buffer as single file Cache, greatly improved the fetching efficiency of this part address.Specialized designs two transmitting registers, realize instruction type and big or small judgement, particularly first launches register, according to the type and size of the 0th transmitting register fetched instruction, determine the next size of fetched instruction, which and to, according to the type of fetched instruction, judge and send to bar streamline.With two transmitting registers, send instruction, greatly improved mandatory emission efficiency, improved the performance of dsp processor.

Accompanying drawing explanation

Fig. 1 is the organization chart of instruction buffer (Cache).

Fig. 2 is line buffer (Line Buffer) structural drawing.

Fig. 3 prefetch buffer (prefetch buffer) deposits the schematic diagram of (fetch buffer) in fetch buffer in after alignment operation.

The multi-emitting schematic diagram of Fig. 4 instruction.

Embodiment

Below in conjunction with drawings and Examples, the invention will be further described.

Instruction multi-emitting method in High Performance DSP processor of the present invention, step is as follows:

A. CPU fetching unit sends fetching address, takes out required instruction from instruction buffer; If instruction buffer lacks, instruction fetch from external memory unit, and upgrade corresponding cache lines; 64 big or small instructions are got in CPU fetching unit at every turn;

B. will after the instruction alignment of getting, put into prefetch buffer, the total size of prefetch buffer is 128,8 16 prefetch buffers, consists of: prefetch buffer 0, and prefetch buffer 1 ... prefetch buffer 7.Prefetch buffer carries out instruction according to the bits of offset of CPU fetching address aligned, puts into prefetch buffer low 4 after the instruction alignment that value unit is got first: prefetch buffer 0, prefetch buffer 1, prefetch buffer 2, prefetch buffer 3.After the instruction alignment that value unit is got for the 2nd time, put into prefetch buffer high 4: prefetch buffer 4, prefetch buffer 5, prefetch buffer 6, prefetch buffer 7.Last prefetch buffer is delivered to instruction in fetch buffer, waits for and being transmitted in corresponding streamline;

C. fetch buffer is that a size is the stack architecture of 128,8 16 fetch buffers, consists of: fetch buffer 0, and fetch buffer 1 ... fetch buffer 7.Wherein fetch buffer is low 4: fetch buffer 0, and fetch buffer 1, fetch buffer 2, after first the instruction in fetch buffer 3 launches by transmitting register.Fetch buffer is high 4: fetch buffer 4, fetch buffer 5, fetch buffer 6, instruction in fetch buffer 7 starts to be depressed into fetch buffer 0,1,2,3, prepare to start transmitting, and vacating space deposits from the instruction in prefetch buffer next time, wait to be launched;

The institutional framework of the Instruction Cache of dsp processor, as row size, set association, can support the fetching of many instructions.The institutional framework of described Instruction Cache adopts two-way set associative structure, and row size is 256, can support to get many instructions; For address that can not Cache, adding specially size is that 256 line buffers (Line Buffer) are as single file cache lines (Cache Line).It is 256 Line Buffer that the present invention has added size between Instruction Cache and CPU fetching unit.For address that can Cache, Line Buffer is just as the impact damper of an instruction stream; For address that can not Cache, Line Buffer is as single file Cache Line.The recursion instruction and the subprogram segment that in most application programs, all contain quite a few, CPU can frequently access the program of the address of this part, and to the routine access outside this part address realm seldom.Add Line Buffer for can not Cache address portion, can greatly improve the efficiency of getting the instruction in this part address, effectively reduced the number of communications between CPU and sheet external memory unit.

The looking ahead and align of instruction: the instruction of taking out from Instruction Cache is put into after prefetch buffer (prefetch buffer), aligns according to CPU fetching address.In Instruction Cache, every delaying one-row capable (Cache Line) memory bank is comprised of four 32 big or small storage unit pmem0, pmem1, pmem2, pmem3.The instruction that fetching unit is got from Instruction Cache is at every turn followed successively by the instruction pmem3, pmem2, pmem1, pmem0 from high to low, instruction is put into after prefetch buffer, according to the bits of offset of CPU fetching address, carry out again further alignedly, finally aligned good instruction is put into fetch buffer (fetch buffer) medium to be launched to corresponding streamline.

Dsp processor always has 4 streamlines: IP streamline, LS streamline, MAC streamline and SIMD streamline.In the situation that not there is not data hazard, each clock period can be moved 4 instructions.Two transmitting registers are passed through in the transmitting of instruction: the 0th transmitting register issue_stream_inst0 and the first transmitting register issue_stream_inst1.The 0th transmitting register issue_stream_inst0 gets zero-bit fetch buffer fetch buffer0(all the time if 16 bit instructions) or zero-bit fetch buffer fetch buffer0 and first fetch buffer fetch buffer1 in instruction (if 32 bit instructions).The instruction that the instruction that the first transmitting register issue_stream_inst1 gets might not be got with the 0th transmitting register issue_stream_inst0 is dissimilar, but will determine according to situations such as the type of the 0th transmitting register issue_stream_inst0 fetched instruction and sizes, and got command assignment is gone to each streamline.Fetch buffer is a stack architecture, is all the instruction of getting in low 4 of fetch buffer during each fetching; After instruction in fetch buffer sends, in fetch buffer, the instruction of high 4 just starts to be depressed into low level, and vacates high-order space and deposit from the instruction in Instruction Cache.

As shown in Figure 1, Instruction Cache adopts two-way set associative structure, and fetching address is divided into tag bits tag, index bit index and bits of offset byte, is mainly used in the quick location of instruction and the judgement of hit situation.Instruction Cache is divided into label (tag) part and data (data) part, and whether in every delaying one-row is capable, designed the data that significance bit valid bit is used to refer in current cache row effective.Tag bits comparison match and significance bit in tag bits and instruction Cache in fetching address are effective, and Instruction Cache hits.When tag bits more do not mate or present instruction Cache in data invalid, Instruction Cache lacks.

When Instruction Cache hits, the output of Instruction Cache comes from hits Na mono-tunnel, and every delaying one-row capable (Cache Line) memory bank is comprised of four 32 big or small storage unit pmem0, pmem1, pmem2, pmem3.High 16 the 1st tunnels that form Instruction Cache of these four storage unit, low 16 the 0th tunnels that form Instruction Cache.Location executed in parallel within the same clock period of 64 bit instructions in the comparison of the label of two-way (Tag) and Cache memory bank, once label completes comparison, Instruction Cache just sends the instruction of hitting in Na mono-tunnel.For example the result of hit logic unit label comparison is that hit on 1 tunnel, 64 bit instructions that CPU fetching unit is got are so: high 16 bit instructions in pmem3 storage unit, high 16 bit instructions in pmem2 storage unit, high 16 bit instructions in pmem1 storage unit, high 16 bit instructions in pmem0 storage unit.

Instruction Cache to the instruction of first 16 in Instruction Cache memory bank, then as plot, is got ensuing three 16 bit instructions according to fetching address location.This design also exists a kind of special circumstances: when is expert in end in the request of CPU fetching, at this moment the off-set value of 4 of Instruction Cache memory banks has also navigated to last column, CPU can not once get the instruction of 64 in this case, but needs at twice, to carry out inter-bank fetching.

As shown in Figure 2, line buffer size is decided to be 256, identical with the size of cache lines, can deposit the instruction of 8 32 in the design of line buffer (Line Buffer) memory bank, and supports crucial double word precedence technique.During for the ease of CPU fetching, can navigate to fast required instruction, line buffer is divided into the memory bank of 4 64, each memory bank again Yi16Wei Wei unit is divided into 4 row, gets respectively 1 row in 4 memory banks during the each fetching of CPU.As shown in Figure 2, its 4 memory banks organize structure according to the division of encoding of 4 to 1 of CPU fetching addresses, and wherein the 3rd, 4 are used for determine selecting which as line displacement, and the 1st, 2 are used for determining as line skew and select which row.

After instruction is taken out from Instruction Cache, next with regard to operation will carry out alignment operation according to the bits of offset of CPU fetching address.As shown in Figure 3, in Cache memory bank in 4 storage unit the highest 16 of instruction be instruction in pmem3 storage unit, minimum 16 is instruction in pmem0 storage unit.Because always have 4 storage unit, so just can realize its alignment as long as 2 bits of offset are set in CPU fetching address.We carry out the alignment of instruction as follows, when the bits of offset of fetching address is " 00 ", and prefetch buffer3=pmem3, prefetch buffer2=pmem2, prefetch buffer1=pmem1, prefetch buffer0=pmem0; When the bits of offset of fetching address is " 01 ", prefetch buffer3=pmem0, prefetch buffer2=pmem3, prefetch buffer1=pmem2, prefetch buffer0=pmem1; When the bits of offset of fetching address is " 10 ", prefetch buffer3=pmem1, prefetch buffer2=pmem0, prefetch buffer1=pmem3, prefetch buffer0=pmem2; When the bits of offset of fetching address is " 11 ", prefetch buffer3=pmem2, prefetch buffer2=pmem1, prefetch buffer1=pmem0, prefetch buffer0=pmem3.Equal sign represent left side prefetch buffer get the right corresponding stored unit instruction.Identical with low 4 bit positions of aforementioned prefetch buffer, high 4 bit positions of prefetch buffer: prefetch buffer7, prefetch buffer6, prefetch buffer5, prefetch buffer4 also gets next 64 bit instructions by alignment thereof of the same race.

After instruction alignment completes, again the instruction after alignment is put into fetch buffer (fetch buffer), during instruction issue, according to order from low to high, launch, first get fetch buffer low 4: fetch buffer0, fetch buffer1, fetch buffer2, the instruction in fetch buffer3.Fetch buffer is actual is a stack architecture, and after instruction is taken out, fetch buffer is high 4: fetch buffer4, and fetch buffer5, the instruction in fetch buffer6 and fetch buffer7 starts to press down, and prepares to start transmitting next time.And vacate high-order space and deposit from the instruction in prefetch buffer, wait to be launched.

The 0th transmitting register issue_stream_inst0 and the first transmitting register issue_stream_inst1 are two transmitting registers of instruction launching phase, are responsible for fetching from fetch buffer, as shown in Figure 4.The 0th transmitting register issue_stream_inst0 gets zero-bit fetch buffer fetch buffer0(all the time if 16 bit instructions) or zero-bit fetch buffer fetch buffer0 and first fetch buffer fetch buffer1 in instruction (if 32 bit instructions).The instruction that the first transmitting register issue_stream_inst1 gets might not be exactly the instruction of launching register issue_stream_inst0 same kind with the 0th, but to determine according to situations such as the type of issue_stream_inst0 fetched instruction and sizes total total following 3 kinds of situations:

1. integer arithmetic class (IP) instruction that is 16-bit when article one instruction, issue_stream_inst1 gets the instruction in fetch buffer1 and fetch buffer2 so.

2. the IP instruction that is 32-bit when previous instruction, issue_stream_inst1 gets the instruction in fetch buffer2 and fetch buffer3 so.

3. when previous instruction is LS instruction, that issue_stream_inst1 gets the instruction in fetch buffer0 and fetch buffer1.

If the instruction that the first transmitting register issue_stream_inst1 gets is the instruction of access (LS) class, so just directly send to LS streamline, if the instruction that the first transmitting register issue_stream_inst1 gets is the instruction of single instruction multiple data (SIMD) class, so just directly send to SIMD streamline, if the instruction that the first transmitting register issue_stream_inst1 gets is the instruction of multiply accumulating (MAC) class, so just directly send to MAC streamline.Instruction and present got instruction that more common situation is launched before being are a same type, i.e. these two instructions will be assigned in same streamline.Because the instruction of same kind only has one group of performance element, for fear of pipeline blocking, it is to be launched that the first transmitting register issue_stream_inst1 will give got instruction the 0th transmitting register issue_stream_inst0 etc., and first launches register issue_stream_inst1 and get ensuing instruction issue in streamline simultaneously.

Claims

1. the instruction multi-emitting method in High Performance DSP processor, is characterized in that:

B. the instruction of getting is put into prefetch buffer, prefetch buffer is undertaken aligned according to the bits of offset of CPU fetching address by instruction, after the instruction alignment that value unit is got first, put into the low 4 of prefetch buffer, after the instruction alignment that value unit is got for the 2nd time, put into the high 4 of prefetch buffer, it is medium to be launched to corresponding streamline that last prefetch buffer is put into fetch buffer by aligned good instruction;

2. the instruction multi-emitting method in High Performance DSP processor as claimed in claim 1, is characterized in that, the institutional framework of described instruction buffer adopts two-way set associative structure, and row size is 256, can support to get many instructions; For not cacheable address, adding specially size is that 256 line buffers are as single file cache lines.

3. the instruction multi-emitting method in High Performance DSP processor as claimed in claim 1, it is characterized in that, after instruction being got from instruction buffer in prefetch buffer, according to the bits of offset of CPU fetching address, carry out alignedly, and it is medium to be launched to corresponding streamline that aligned good instruction is put into fetch buffer; Fetch buffer is a stack architecture, during each fetching, is all the instruction of getting in fetch buffer low level; After instruction in fetch buffer low level sends, instruction high-order in fetch buffer just starts to be depressed into low level, and vacates high-order space and deposit from the instruction in instruction buffer.

4. the instruction multi-emitting method in High Performance DSP processor as claimed in claim 1, it is characterized in that, the 0th transmitting register and the first transmitting register cooperatively interact, and are responsible for fetching from fetch buffer, and according to the size of instruction and type, by order structure in different streamlines; The 0th transmitting register root is according to the type of fetched instruction, by instruction issue in respective streams waterline, the first transmitting register root is according to the size of the 0th transmitting register fetched instruction, get ensuing instruction in fetch buffer, and the type of decision instruction, if identical with the 0th transmitting register fetched instruction type, give the 0th transmitting register etc. to be launched, avoid pipeline blocking, if different from the 0th transmitting register fetched instruction type, directly send in another streamline; Once after instruction sends in transmitting register, insert immediately the instruction in fetch buffer.