CN101021778A - Computing group structure for superlong instruction word and instruction flow multidata stream fusion - Google Patents
Computing group structure for superlong instruction word and instruction flow multidata stream fusion Download PDFInfo
- Publication number
- CN101021778A CN101021778A CN 200710034567 CN200710034567A CN101021778A CN 101021778 A CN101021778 A CN 101021778A CN 200710034567 CN200710034567 CN 200710034567 CN 200710034567 A CN200710034567 A CN 200710034567A CN 101021778 A CN101021778 A CN 101021778A
- Authority
- CN
- China
- Prior art keywords
- data
- instruction
- simd
- calculation
- group
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000004927 fusion Effects 0.000 title description 5
- 238000012545 processing Methods 0.000 claims abstract description 54
- 239000000872 buffer Substances 0.000 claims abstract description 36
- 238000002360 preparation method Methods 0.000 claims description 5
- 238000012546 transfer Methods 0.000 claims description 5
- 230000008520 organization Effects 0.000 claims description 2
- 238000011161 development Methods 0.000 abstract description 9
- 238000000034 method Methods 0.000 description 14
- 230000006870 function Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 3
- 238000013507 mapping Methods 0.000 description 3
- 238000013461 design Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000010354 integration Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Landscapes
- Advance Control (AREA)
Abstract
本发明公开了一种超长指令字与单指令流多数据流融合的计算群结构,它包括与主控制器相连的数据缓冲和微控制器,以及连接它们的计算群。主控制器负责指令和数据的移动,指令载入微控制器,数据载入数据缓冲并接收其输出;数据缓冲接收主控制器数据,为计算群提供操作数并接收其计算结果,最后输出到主控制器;微控制器接收主控制器的超长指令字序列,译码后广播到计算群每个处理单元并行执行;计算群是多个相同的处理单元,分别有多种处理部件,处理的指令序列相同,数据则取自数据缓冲的不同单元。本发明能够支持数据级并行和指令级并行同时开发,能结合超长指令字、单指令流多数据流和软件流水技术,从多种途径改善计算密集型应用的处理性能。
The invention discloses a calculation group structure combining super long instruction word and single instruction stream and multiple data streams. The main controller is responsible for the movement of instructions and data. The instructions are loaded into the microcontroller, the data is loaded into the data buffer and receives its output; the data buffer receives the data from the main controller, provides operands for the calculation group and receives its calculation results, and finally outputs to The main controller; the micro-controller receives the super-long instruction word sequence of the main controller, broadcasts it to the calculation group and executes it in parallel in each processing unit after decoding; The instruction sequence is the same, and the data is taken from a different unit of the data buffer. The invention can support data-level parallelism and instruction-level parallelism simultaneous development, and can improve the processing performance of computing-intensive applications in various ways by combining ultra-long instruction word, single instruction stream, multiple data streams and software pipelining technology.
Description
技术领域technical field
本发明主要涉及到微处理器设计中的指令处理技术,尤其是应用于面向计算密集型计算的处理器中,特指一种超长指令字与单指令流多数据流融合的计算群结构。The invention mainly relates to instruction processing technology in microprocessor design, especially applied to processors oriented to calculation-intensive calculations, and particularly refers to a calculation group structure in which ultra-long instruction words and single instruction stream and multiple data streams are fused.
背景技术Background technique
指令级并行(Instruction Level Parallelism,简称ILP)是微处理器设计中并行性开发的主要途径,不相关的指令之间可以并行执行,以此提高处理器执行的效率,如超标量技术和超长指令字技术(Very Long Instruction Word,简称VLIW)。VLIW技术在硬件上配置多个功能部件,它们可以并行执行指令,编译器负责将程序组织成超长指令字序列,每个新的指令字包含多条可以并行执行的原始指令,执行时可以直接映射到功能部件上处理,从而达到开发指令级并行的目的,没有复杂的硬件检测相关和发射逻辑。而多媒体和科学计算等计算密集型应用中同时还存在有大量的数据级并行(DataLevel Parallelism,简称为DLP),相同类型或结构的数据往往需要执行相同的一个或一串操作。采用特殊的指令处理技术可以有效地开发此类程序中的数据级并行性,从而提升处理器的执行性能。Instruction Level Parallelism (ILP for short) is the main way to develop parallelism in microprocessor design. Unrelated instructions can be executed in parallel to improve the efficiency of processor execution, such as superscalar technology and super long Instruction word technology (Very Long Instruction Word, referred to as VLIW). VLIW technology configures multiple functional components on the hardware, and they can execute instructions in parallel. The compiler is responsible for organizing the program into a sequence of super long instruction words. Each new instruction word contains multiple original instructions that can be executed in parallel. Mapping to functional components for processing, so as to achieve the purpose of developing instruction-level parallelism, without complex hardware detection correlation and launch logic. In computing-intensive applications such as multimedia and scientific computing, there is also a large amount of data-level parallelism (DLP for short), and data of the same type or structure often needs to perform the same one or a series of operations. The data-level parallelism in such programs can be effectively exploited by using special instruction processing techniques, thereby improving the execution performance of the processor.
首先数据级并行性开发可以转化为指令级并行性开发。软件流水是开发指令级并行性的一种有效的编译调度方法,编译器通过循环展开将不存在数据相关的来自不同循环周期的指令重新调度,组成新的循环体,以此来增加处理器中可以并行执行的指令条数,解决数据相关。含有大量数据级并行的程序中指令间的数据相关性主要是存储与计算之间的相关,软件流水技术可以比较容易地化解这种相关。软件流水技术可以与超标量或VLIW技术同时使用来开发ILP。First, data-level parallelism development can be transformed into instruction-level parallelism development. Software pipelining is an effective compilation and scheduling method for developing instruction-level parallelism. The compiler reschedules instructions from different cycle cycles that do not have data correlation through loop unrolling to form a new loop body, thereby increasing the number of processors in the processor. The number of instructions that can be executed in parallel to solve data dependencies. The data correlation between instructions in a program containing a large amount of data-level parallelism is mainly the correlation between storage and calculation, and software pipelining technology can easily resolve this correlation. Software pipelining can be used simultaneously with superscalar or VLIW techniques to develop ILP.
向量处理技术是一种提高处理器成批数据处理性能的有效方法,利用时间重叠的思想,将数据并行转化为指令并行,通过将用于处理相同操作的循环语句向量化,不仅可以减少程序的代码量,还能将循环迭代之间的控制相关性隐藏到向量指令中,提高硬件的执行效率。向量的链接技术可以在向量处理的基础上有效减少中间结果的存储量,缓解向量寄存器的分配压力。Vector processing technology is an effective method to improve the batch data processing performance of the processor. Using the idea of time overlap, data parallelism is transformed into instruction parallelism. By vectorizing the loop statements used to process the same operation, it can not only reduce the The amount of code can be reduced, and the control correlation between loop iterations can be hidden in the vector instructions to improve the execution efficiency of the hardware. Vector link technology can effectively reduce the storage capacity of intermediate results on the basis of vector processing, and relieve the allocation pressure of vector registers.
单指令流多数据流(Single Instruction,Multiple Data,简称SIMD)技术是一种资源重复技术,通过配置多个并行的处理单元或将一个处理单元分割为多条数据通路来开发数据级并行。一条指令的控制信号可以控制多个运算部件同时工作,但加工的数据来自多个数据流。Intel的IA-32指令集和IA-64指令集都有针对SIMD的指令扩展,可以提高数值计算应用的性能。Single Instruction, Multiple Data (SIMD) technology is a resource duplication technology that develops data-level parallelism by configuring multiple parallel processing units or dividing a processing unit into multiple data paths. The control signal of one instruction can control multiple computing components to work simultaneously, but the processed data comes from multiple data streams. Both Intel's IA-32 instruction set and IA-64 instruction set have instruction extensions for SIMD, which can improve the performance of numerical computing applications.
可见,当前对于多媒体和科学计算等计算密集型应用的并行性开发以独立的指令级并行或数据级并行开发为主,在应用规模持续扩大以及复杂度日益增加的情况下,利用上述方法并行化的难度变大,所获得的性能增益也不断减小。如何在硬件结构上支持数据级并行和指令级并行的同时开发是一个解决此类问题的新思路。It can be seen that the current parallel development of computing-intensive applications such as multimedia and scientific computing is mainly based on independent instruction-level parallelism or data-level parallel development. In the case of continuous expansion of application scale and increasing complexity, parallelization using the above method The difficulty becomes larger, and the performance gain obtained is also continuously reduced. How to support the simultaneous development of data-level parallelism and instruction-level parallelism on the hardware structure is a new idea to solve such problems.
发明内容Contents of the invention
本发明要解决的技术问题就在于:针对现有技术存在的问题,本发明提供一种能够同时支持数据级并行性和指令级并行性开发,能结合超长指令字技术、单指令流多数据流技术和软件流水技术,从多种途径进一步改善计算密集型应用程序在处理器上执行性能的超长指令字与单指令流多数据流融合的计算群结构。The technical problem to be solved by the present invention is that: in view of the problems existing in the prior art, the present invention provides a development that can support data-level parallelism and instruction-level parallelism at the same time, and can combine ultra-long instruction word technology, single instruction stream multiple data Streaming technology and software pipelining technology further improve the execution performance of computing-intensive applications on processors in multiple ways. The computing group structure of the fusion of ultra-long instruction words and single instruction stream multiple data streams.
为解决上述技术问题,本发明提出的解决方案为:一种超长指令字与单指令流多数据流融合的计算群结构,其特征在于:它包括主控制器、数据缓冲、SIMD计算群以及微控制器,数据缓冲和微控制器分别与主控制器相连,数据缓冲和微控制器之间通过SIMD计算群相连,主控制器负责指令和数据的准备,将需要SIMD计算群执行的指令调入微控制器中的存储部件,控制SIMD计算群的启动和暂停,所需源操作数据调入数据缓冲,并接收数据缓冲中最终计算结果;数据缓冲部件接收从主控制器传来的数据,并以特定的组织方式进行存储,为SIMD计算群提供计算所需源操作数,计算结束后接收计算群的输出结果,最终结果输出到主控制器;微控制器部件接收主控制器提供的超长指令字序列,进行译码后,以广播的形式将每个操作分派、映射到SIMD计算群的每个处理单元的对应处理部件F并行执行;SIMD计算群是以SIMD形式组织的多个并行处理单元,每个处理单元结构相同,并配置多个处理部件,所执行的指令序列都来自于微控制器,但所需计算数据来自数据缓冲的不同位置。In order to solve the above-mentioned technical problems, the solution proposed by the present invention is: a computing group structure in which ultra-long instruction words and single instruction streams and multiple data streams are fused, and is characterized in that it includes a main controller, data buffer, SIMD computing group and The microcontroller, the data buffer and the microcontroller are respectively connected to the main controller, and the data buffer and the microcontroller are connected through the SIMD computing group. Into the storage unit in the microcontroller, control the start and pause of the SIMD calculation group, transfer the required source operation data into the data buffer, and receive the final calculation result in the data buffer; the data buffer unit receives the data transmitted from the main controller, And store it in a specific organizational way, provide the source operands required by the calculation for the SIMD calculation group, receive the output result of the calculation group after the calculation, and output the final result to the main controller; the microcontroller part receives the super data provided by the main controller After decoding the long instruction word sequence, each operation is dispatched in the form of broadcast and mapped to the corresponding processing unit F of each processing unit of the SIMD computing group to execute in parallel; the SIMD computing group is a plurality of parallel operations organized in the form of SIMD Each processing unit has the same structure and is configured with multiple processing components. The executed instruction sequences all come from the microcontroller, but the required calculation data come from different positions of the data buffer.
所述SIMD计算群是结构相同的一组处理单元PE,所有PE的结构均完全相同,以SIMD的方式同时执行来自微控制器的同一条指令或指令序列;每个PE包含多个算术、逻辑运算处理部件Fn和局部的寄存器文件,逻辑运算处理部件Fn支持超长指令字执行,可以同时并行处理多个不同类型的操作,每个逻辑运算处理部件Fn都有单独的局部寄存器文件,直接为处理部件提供操作数并保存计算结果。The SIMD computing group is a group of processing unit PEs with the same structure, all PEs have the same structure, and simultaneously execute the same instruction or instruction sequence from the microcontroller in SIMD mode; each PE contains multiple arithmetic and logic Operation processing unit Fn and local register files, logic operation processing unit Fn supports super-long instruction word execution, can process multiple different types of operations in parallel at the same time, each logic operation processing unit Fn has a separate local register file, directly for The processing unit provides operands and stores calculation results.
与现有技术相比,由于多媒体和科学计算等计算密集型应用程序中数据计算量较大,且相同类型或结构的数据往往需要执行相同的一个或一串操作。因此在面向此类应用的微处理器中,通过配置本发明所提出的硬件结构具有以下优点:Compared with the existing technology, due to the large amount of data calculation in computing-intensive applications such as multimedia and scientific computing, and data of the same type or structure often need to perform the same one or a series of operations. Therefore, in the microprocessor facing this type of application, the hardware structure proposed by the present invention has the following advantages by configuring:
(1)本发明所提出的超长指令字技术与单指令流多数据流技术融合的计算群结构通过在PE内部配置局部寄存器、在系统中配置数据缓冲保存程序运算的中间结果,避免了存储带宽浪费,利用资源重复和编译调度的策略同时开发数据级并行性和指令级并行性,能够提高应用程序在处理器上的执行效率;(1) The calculation group structure of the fusion of ultra-long instruction word technology and single instruction stream multiple data stream technology proposed by the present invention avoids storage Bandwidth waste, using resource repetition and compilation scheduling strategies to develop data-level parallelism and instruction-level parallelism at the same time, can improve the execution efficiency of applications on processors;
(2)同时开发DLP和ILP。由于支持VLIW程序以SIMD的方式执行,协同开发程序中的指令级并行性和数据级并行性,大大提高了处理器执行此类应用程序的吞吐率;(2) Simultaneously develop DLP and ILP. Since it supports the execution of VLIW programs in the form of SIMD, the collaborative development of instruction-level parallelism and data-level parallelism in the program greatly improves the throughput of the processor for executing such applications;
(3)硬件效率高。多个PE共用一套指令控制逻辑,取指、译码、分派、映射,而操作数来自不同的数据流,这种SIMD的执行方式利用一条指令的控制通路实现了多处理器系统才能实现的数据吞吐率,提高了硬件效率;(3) High hardware efficiency. Multiple PEs share a set of instruction control logic, fetching, decoding, dispatching, and mapping, while the operands come from different data streams. This SIMD execution method uses the control path of one instruction to achieve what can only be achieved by a multi-processor system. Data throughput rate improves hardware efficiency;
(4)硬件实现复杂度低。由于编译可以确定各种指令的操作延时,并行性开发的工作完全由编译器来完成,避免了复杂的硬件检测逻辑和流水线互锁逻辑,降低了硬件实现的复杂度;(4) The hardware implementation complexity is low. Since compilation can determine the operation delay of various instructions, the work of parallel development is completely completed by the compiler, which avoids complex hardware detection logic and pipeline interlock logic, and reduces the complexity of hardware implementation;
(5)缓解了存储带宽瓶颈。由于使用了PE内部的局部寄存器,指令操作产生的中间结果不需要占有外部的数据缓冲,缓解了外部存储的带宽压力,并加快了操作数读取的速度;(5) Alleviate the storage bandwidth bottleneck. Due to the use of local registers inside PE, the intermediate results generated by instruction operations do not need to occupy external data buffers, which relieves the bandwidth pressure of external storage and speeds up the speed of operand reading;
综上所述,本发明所提出的硬件结构结合了VLIW和SIMD开发程序并行性的优势,适合应用于面向计算密集型应用的处理器中,但不局限于此种处理器,其他需要同时开发多种并行性的处理器也可以采用。In summary, the hardware structure proposed by the present invention combines the advantages of VLIW and SIMD development program parallelism, and is suitable for use in processors for computing-intensive applications, but it is not limited to such processors, and others need to be developed at the same time Processors of various parallelisms may also be employed.
附图说明Description of drawings
图1是本发明的框架结构示意图;Fig. 1 is a schematic diagram of a frame structure of the present invention;
图2是本发明中VLIW与SIMD融合的计算群结构示意图;Fig. 2 is a schematic diagram of the calculation group structure of VLIW and SIMD fusion among the present invention;
图3是本发明指令处理的流程示意图。Fig. 3 is a schematic flow chart of instruction processing in the present invention.
具体实施方式Detailed ways
以下将结合附图和具体实施例对本发明做进一步详细说明。The present invention will be described in further detail below in conjunction with the accompanying drawings and specific embodiments.
参见图1所示,本发明的超长指令字与单指令流多数据流融合的计算群结构,它包括主控制器、数据缓冲、SIMD计算群以及微控制器。其中主控制器负责指令和数据的准备,将需要SIMD计算群执行的指令调入微控制器中的存储部件,控制SIMD计算群的启动和暂停,所需源操作数据调入数据缓冲,并接收数据缓冲中最终计算结果;数据缓冲部件接收从主控制器传来的数据,并以特定的组织方式进行存储,为SIMD计算群提供计算所需源操作数,计算结束后接收计算群的输出结果,最终结果输出到主控制器;微控制器部件接收主控制器提供的超长指令字序列,进行译码后,以广播的形式将每个操作分派、映射到SIMD计算群的每个处理单元(Processing Element,简称PE)的对应处理部件F并行执行;SIMD计算群是以SIMD形式组织的多个并行处理单元,每个处理单元结构相同,并配置多个处理部件,所执行的指令序列都来自于微控制器,但所需计算数据来自数据缓冲的不同位置。Referring to Fig. 1, the computing group structure of the fusion of ultra-long instruction word and single instruction stream and multiple data streams of the present invention includes a main controller, data buffer, SIMD computing group and microcontroller. Among them, the main controller is responsible for the preparation of instructions and data, transfers the instructions that need to be executed by the SIMD computing group into the storage unit in the microcontroller, controls the start and pause of the SIMD computing group, transfers the required source operation data into the data buffer, and receives The final calculation result in the data buffer; the data buffer component receives the data transmitted from the main controller and stores it in a specific organization, providing the SIMD calculation group with the source operands required for calculation, and receiving the output result of the calculation group after the calculation is completed , the final result is output to the main controller; the microcontroller component receives the super-long instruction word sequence provided by the main controller, and after decoding, dispatches and maps each operation to each processing unit of the SIMD computing group in the form of broadcasting (Processing Element, referred to as PE) corresponding processing unit F executes in parallel; SIMD calculation group is a plurality of parallel processing units organized in the form of SIMD, each processing unit has the same structure, and is equipped with multiple processing units, and the executed instruction sequences are all from the microcontroller, but the required computational data comes from a different location in the data buffer.
图2是VLIW与SIMD融合的计算群结构图。SIMD计算群是结构相同的一组处理单元PE(PE0、PE1、……、PEN)。所有PE的结构均完全相同,以SIMD的方式同时执行来自微控制器的同一条指令或指令序列。每个PE包含多个算术、逻辑运算处理部件(F1、F2、……、Fn)和局部的寄存器文件。处理部件F1、F2、……、Fn支持超长指令字执行,可以同时并行处理多个不同类型的操作(如加法、乘法、乘加、逻辑运算等)。每个处理部件都有单独的局部寄存器文件,直接为处理部件提供操作数并保存计算结果。操作数首先从数据缓冲读入局部寄存器,由于主控制器在启动微控制器进行译码、分派、映射执行之前已经把操作数调入了数据缓冲,操作数从数据缓冲调入PE局部寄存器的延时是固定的。处理单元内部每个局部寄存器文件之间有网络相连,可以交换计算生成的中间临时数据,每个处理部件的操作数都直接从自己的局部寄存器读取,所以处理部件在执行指令操作时不存在因数据未就绪而导致的不可预期的停顿。因此,超长指令字在所有PE上执行任何操作所需的延时都是一致的、可知的,编译器可以完全开发指令级并行性,生成超长指令字指令序列,并利用软件流水技术重组程序中的循环,化解超长指令字之间的数据相关性,无需硬件干预。同时PE之间也存在互连网络,可以进行必要的同步和数据交换。Figure 2 is a diagram of the computing group structure of the integration of VLIW and SIMD. A SIMD computing group is a group of processing units PE (PE0, PE1, ..., PEN) with the same structure. All PEs have exactly the same structure and execute the same instruction or sequence of instructions from the microcontroller simultaneously in a SIMD manner. Each PE contains multiple arithmetic and logical operation processing units (F1, F2, ..., Fn) and local register files. The processing units F1, F2, . . . , Fn support the execution of very long instruction words, and can simultaneously process multiple operations of different types in parallel (such as addition, multiplication, multiply-add, logic operations, etc.). Each processing element has a separate local register file, which directly provides operands and saves calculation results to the processing element. The operand is first read from the data buffer into the local register. Since the host controller has already transferred the operand into the data buffer before starting the microcontroller for decoding, dispatching, and mapping execution, the operand is transferred from the data buffer to the PE local register. The delay is fixed. There is a network connection between each local register file inside the processing unit, and the intermediate temporary data generated by calculation can be exchanged. The operand of each processing unit is directly read from its own local register, so the processing unit does not exist when executing instructions. Unexpected stalls due to data not ready. Therefore, the delay required for VLW to execute any operation on all PEs is consistent and knowable, and the compiler can fully develop instruction-level parallelism, generate VLW instruction sequences, and use software pipelining technology to reorganize The loop in the program resolves the data dependencies between very long instruction words without hardware intervention. At the same time, there is also an interconnection network between PEs, which can perform necessary synchronization and data exchange.
以如下伪码程序为例:Take the following pseudocode program as an example:
//数据准备//data preparation
load(datal[m],meml);load(datal[m], meml);
load(data2[m],mem2);load(data2[m], mem2);
//数据处理//data processing
for(i from l to m)for(i from l to m)
func(datal[i],data2[i],data3[i],data4[i]);func(datal[i], data2[i], data3[i], data4[i]);
func(data3[i],data4[i],data5[i],data6[i]);func(data3[i], data4[i], data5[i], data6[i]);
//数据写回到外存或网络接口//write data back to external memory or network interface
send(data5[m],mem3);send(data5[m], mem3);
send(data6[m],chan0);send(data6[m], chan0);
//数据处理函数定义,输入in_a和in_b,输出out_c 和out_d//Data processing function definition, input in_a and in_b, output out_c and out_d
func(in_a,in_b,out_c,out_d){func(in_a, in_b, out_c, out_d){
//第一条超长指令字指令I1 //The first super long instruction word instruction I 1
OP11//执行部件为F1OP 11 //The execution unit is F1
OP12//执行部件为F2OP 12 //The execution unit is F2
……
OP1n;;//执行部件为Fn,;;为超长指令字边界标识OP 1n ;;//The execution unit is Fn,;; is the super long instruction word boundary mark
//第二条超长指令字指令I2 //The second super long instruction word instruction I 2
OP21 OP 21
OP22 OP 22
……
OP2n;;OP 2n ;;
……
}}
该程序描述一连串相同结构的数据(data1、data2)顺序执行一系列操作(两个func函数),并生产新的一连串数据(data5、data6)的过程,主体为一个for循环。其中for循环的第二个func函数同第一个func函数之间存在写后读的数据相关,在传统指令处理技术中,程序执行效率即并行性受限于这个相关,而且两次调用之间将产生大量的临时数据(data3、data4),如果寄存器资源不足,则要换出到存储器进行保存,需要时再读入到寄存器中,浪费了存储带宽。采用本发明提出的硬件结构可以有效避免这些瓶颈,该程序在本发明上的执行过程是:This program describes the process of performing a series of operations (two func functions) in sequence on a series of data (data1, data2) with the same structure, and producing a new series of data (data5, data6). The main body is a for loop. Among them, there is a read-after-write data correlation between the second func function of the for loop and the first func function. In the traditional instruction processing technology, the efficiency of program execution, that is, parallelism, is limited by this correlation. A large amount of temporary data (data3, data4) will be generated. If the register resource is insufficient, it must be swapped out to the memory for storage, and then read into the register when necessary, wasting storage bandwidth. Adopting the hardware structure proposed by the present invention can effectively avoid these bottlenecks, and the execution process of this program on the present invention is:
1.执行load指令,主控制器控制数据流的流动,启动数据缓冲从外部存储器调入程序执行所需数据序列data1和data2,并按固定索引地址存放在数据缓冲内;1. Execute the load command, the main controller controls the flow of the data flow, starts the data buffer, transfers the data sequence data1 and data2 required for program execution from the external memory, and stores them in the data buffer according to the fixed index address;
2.主控制器将程序中的超长指令字序列(第一个func函数)发射到微控制器中的指令缓存部件;2. The main controller transmits the ultra-long instruction word sequence (the first func function) in the program to the instruction cache part in the microcontroller;
3.微控制器对超长指令字指令I1,I2等进行译码,每拍译出n个微操作(OPi1,OPi2,…,OPin),并行分派到N个PE的n个处理部件上执行:3. The microcontroller decodes the ultra-long instruction word instructions I 1 , I 2 , etc., decodes n micro-operations (OP i1 , OP i2 ,..., OP in ) in each shot, and distributes them to n of N PEs in parallel. Execute on processing components:
(a)数据载入操作控制数据(in_a,in_b)从数据缓冲读入到PE的局部寄存器;(a) Data loading operation control data (in_a, in_b) is read from the data buffer into the local register of the PE;
(b)数据写回操作控制数据(out_c,out_d)从PE的局部寄存器写回数据缓冲;(b) Data write-back operation control data (out_c, out_d) is written back to the data buffer from the local register of the PE;
(c)算术逻辑操作的源操作数取自各自的局部寄存器,执行结果写回各自的局部寄存器;(c) The source operands of the arithmetic and logic operations are taken from their respective local registers, and the execution results are written back to their respective local registers;
(d)数据可以通过PE内部的互连网络在局部寄存器之间移动,也可以通过PE间的互连网络跨PE移动;(d) Data can move between local registers through the interconnection network within the PE, or across PEs through the interconnection network between PEs;
4.执行第二个func函数,重复2、3步;4. Execute the second func function and repeat steps 2 and 3;
5.执行send指令,主控制器控制数据流的流动,启动数据缓冲将数据序列data5和data6写回到外部存储器或指定的网络端口;5. Execute the send command, the main controller controls the flow of the data stream, starts the data buffer and writes the data sequence data5 and data6 back to the external memory or the designated network port;
每个微操作都具有固定的、可预知的执行延时,所有PE同时开始执行、同时结束,PE之间是SIMD方式,PE内部的并行采用VLIW方式。Each micro-operation has a fixed and predictable execution delay. All PEs start to execute and end at the same time. The SIMD method is used between PEs, and the VLIW method is used for parallelism inside PEs.
图3是本发明指令处理的流程。指令在主控制器译码后,判断是数据准备指令还是数据处理指令。如果是数据准备指令,则发射到数据缓冲执行,数据缓冲判断是写指令还是读指令。读指令根据指令给出的数据地址和取数长度从外部存储器读入一整块数据,并按固定索引地址存放在数据缓冲中;写指令将数据缓冲中的指定数据块输出到指定位置(外部存储器或网络端口)。如果是数据处理指令,则从外部存储器取相应的一段超长指令字指令序列到微控制器中,微控制器逐条指令进行译码、分派和执行。每条超长指令字指令译出的微操作同时分派到N个PE,映射到对应的处理部件F,所需数据来自数据缓冲,PE之间以SIMD的方式执行。一条超长指令字指令处理完毕,微控制器启动下一条指令的译码和分派;一段指令序列执行完毕,主控制器启动下一个序列的读入和处理。Fig. 3 is a flowchart of instruction processing in the present invention. After the instruction is decoded by the main controller, it is judged whether it is a data preparation instruction or a data processing instruction. If it is a data preparation command, it is sent to the data buffer for execution, and the data buffer judges whether it is a write command or a read command. The read instruction reads a whole block of data from the external memory according to the data address and fetch length given by the instruction, and stores it in the data buffer according to the fixed index address; the write instruction outputs the specified data block in the data buffer to the specified location (external storage or network port). If it is a data processing instruction, a corresponding super long instruction word instruction sequence is fetched from the external memory into the microcontroller, and the microcontroller decodes, dispatches and executes instructions one by one. The micro-operation translated by each VLW instruction is assigned to N PEs at the same time, and mapped to the corresponding processing unit F. The required data comes from the data buffer, and the PEs are executed in SIMD mode. After processing a super-long instruction word instruction, the microcontroller starts decoding and dispatching the next instruction; after executing a sequence of instructions, the main controller starts reading and processing the next sequence.
Claims (2)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB2007100345670A CN100456230C (en) | 2007-03-19 | 2007-03-19 | Computational group unit combining ultra-long instruction word and single instruction stream multiple data streams |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB2007100345670A CN100456230C (en) | 2007-03-19 | 2007-03-19 | Computational group unit combining ultra-long instruction word and single instruction stream multiple data streams |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101021778A true CN101021778A (en) | 2007-08-22 |
CN100456230C CN100456230C (en) | 2009-01-28 |
Family
ID=38709553
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNB2007100345670A Expired - Fee Related CN100456230C (en) | 2007-03-19 | 2007-03-19 | Computational group unit combining ultra-long instruction word and single instruction stream multiple data streams |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN100456230C (en) |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101452394B (en) * | 2007-11-28 | 2012-05-23 | 无锡江南计算技术研究所 | Compiling method and compiler |
CN102970049A (en) * | 2012-10-26 | 2013-03-13 | 北京邮电大学 | Parallel circuit based on chien search algorithm and forney algorithm and RS decoding circuit |
WO2015173674A1 (en) * | 2014-05-12 | 2015-11-19 | International Business Machines Corporation | Parallel slice processor with dynamic instruction stream mapping |
US9672043B2 (en) | 2014-05-12 | 2017-06-06 | International Business Machines Corporation | Processing of multiple instruction streams in a parallel slice processor |
US9720696B2 (en) | 2014-09-30 | 2017-08-01 | International Business Machines Corporation | Independent mapping of threads |
US9740486B2 (en) | 2014-09-09 | 2017-08-22 | International Business Machines Corporation | Register files for storing data operated on by instructions of multiple widths |
US9934033B2 (en) | 2016-06-13 | 2018-04-03 | International Business Machines Corporation | Operation of a multi-slice processor implementing simultaneous two-target loads and stores |
US9971602B2 (en) | 2015-01-12 | 2018-05-15 | International Business Machines Corporation | Reconfigurable processing method with modes controlling the partitioning of clusters and cache slices |
US9983875B2 (en) | 2016-03-04 | 2018-05-29 | International Business Machines Corporation | Operation of a multi-slice processor preventing early dependent instruction wakeup |
US10037211B2 (en) | 2016-03-22 | 2018-07-31 | International Business Machines Corporation | Operation of a multi-slice processor with an expanded merge fetching queue |
US10037229B2 (en) | 2016-05-11 | 2018-07-31 | International Business Machines Corporation | Operation of a multi-slice processor implementing a load/store unit maintaining rejected instructions |
US10042647B2 (en) | 2016-06-27 | 2018-08-07 | International Business Machines Corporation | Managing a divided load reorder queue |
US10133581B2 (en) | 2015-01-13 | 2018-11-20 | International Business Machines Corporation | Linkable issue queue parallel execution slice for a processor |
US10133576B2 (en) | 2015-01-13 | 2018-11-20 | International Business Machines Corporation | Parallel slice processor having a recirculating load-store queue for fast deallocation of issue queue entries |
WO2019104638A1 (en) * | 2017-11-30 | 2019-06-06 | 深圳市大疆创新科技有限公司 | Neural network processing method and apparatus, accelerator, system, and mobile device |
US10318419B2 (en) | 2016-08-08 | 2019-06-11 | International Business Machines Corporation | Flush avoidance in a load store unit |
US10346174B2 (en) | 2016-03-24 | 2019-07-09 | International Business Machines Corporation | Operation of a multi-slice processor with dynamic canceling of partial loads |
US10761854B2 (en) | 2016-04-19 | 2020-09-01 | International Business Machines Corporation | Preventing hazard flushes in an instruction sequencing unit of a multi-slice processor |
CN113032013A (en) * | 2021-01-29 | 2021-06-25 | 成都商汤科技有限公司 | Data transmission method, chip, equipment and storage medium |
WO2024222351A1 (en) * | 2023-04-27 | 2024-10-31 | 上海曦智科技有限公司 | Optic-electric computing system and data processing method |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6718457B2 (en) * | 1998-12-03 | 2004-04-06 | Sun Microsystems, Inc. | Multiple-thread processor for threaded software applications |
JP2001175618A (en) * | 1999-12-17 | 2001-06-29 | Nec Eng Ltd | Parallel computer system |
WO2005036384A2 (en) * | 2003-10-14 | 2005-04-21 | Koninklijke Philips Electronics N.V. | Instruction encoding for vliw processors |
US9047094B2 (en) * | 2004-03-31 | 2015-06-02 | Icera Inc. | Apparatus and method for separate asymmetric control processing and data path processing in a dual path processor |
US7949856B2 (en) * | 2004-03-31 | 2011-05-24 | Icera Inc. | Method and apparatus for separate control processing and data path processing in a dual path processor with a shared load/store unit |
CN100357932C (en) * | 2006-06-05 | 2007-12-26 | 中国人民解放军国防科学技术大学 | Method for decreasing data access delay in stream processor |
-
2007
- 2007-03-19 CN CNB2007100345670A patent/CN100456230C/en not_active Expired - Fee Related
Cited By (42)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101452394B (en) * | 2007-11-28 | 2012-05-23 | 无锡江南计算技术研究所 | Compiling method and compiler |
CN102970049B (en) * | 2012-10-26 | 2016-01-20 | 北京邮电大学 | Based on parallel circuit and the RS decoding circuit of money searching algorithm and Fu Ni algorithm |
CN102970049A (en) * | 2012-10-26 | 2013-03-13 | 北京邮电大学 | Parallel circuit based on chien search algorithm and forney algorithm and RS decoding circuit |
US9690585B2 (en) | 2014-05-12 | 2017-06-27 | International Business Machines Corporation | Parallel slice processor with dynamic instruction stream mapping |
US9665372B2 (en) | 2014-05-12 | 2017-05-30 | International Business Machines Corporation | Parallel slice processor with dynamic instruction stream mapping |
US9672043B2 (en) | 2014-05-12 | 2017-06-06 | International Business Machines Corporation | Processing of multiple instruction streams in a parallel slice processor |
US9690586B2 (en) | 2014-05-12 | 2017-06-27 | International Business Machines Corporation | Processing of multiple instruction streams in a parallel slice processor |
US10157064B2 (en) | 2014-05-12 | 2018-12-18 | International Business Machines Corporation | Processing of multiple instruction streams in a parallel slice processor |
WO2015173674A1 (en) * | 2014-05-12 | 2015-11-19 | International Business Machines Corporation | Parallel slice processor with dynamic instruction stream mapping |
US9740486B2 (en) | 2014-09-09 | 2017-08-22 | International Business Machines Corporation | Register files for storing data operated on by instructions of multiple widths |
US9760375B2 (en) | 2014-09-09 | 2017-09-12 | International Business Machines Corporation | Register files for storing data operated on by instructions of multiple widths |
US11144323B2 (en) | 2014-09-30 | 2021-10-12 | International Business Machines Corporation | Independent mapping of threads |
US9720696B2 (en) | 2014-09-30 | 2017-08-01 | International Business Machines Corporation | Independent mapping of threads |
US9870229B2 (en) | 2014-09-30 | 2018-01-16 | International Business Machines Corporation | Independent mapping of threads |
US10545762B2 (en) | 2014-09-30 | 2020-01-28 | International Business Machines Corporation | Independent mapping of threads |
US9971602B2 (en) | 2015-01-12 | 2018-05-15 | International Business Machines Corporation | Reconfigurable processing method with modes controlling the partitioning of clusters and cache slices |
US9977678B2 (en) | 2015-01-12 | 2018-05-22 | International Business Machines Corporation | Reconfigurable parallel execution and load-store slice processor |
US10983800B2 (en) | 2015-01-12 | 2021-04-20 | International Business Machines Corporation | Reconfigurable processor with load-store slices supporting reorder and controlling access to cache slices |
US10083039B2 (en) | 2015-01-12 | 2018-09-25 | International Business Machines Corporation | Reconfigurable processor with load-store slices supporting reorder and controlling access to cache slices |
US11150907B2 (en) | 2015-01-13 | 2021-10-19 | International Business Machines Corporation | Parallel slice processor having a recirculating load-store queue for fast deallocation of issue queue entries |
US10133581B2 (en) | 2015-01-13 | 2018-11-20 | International Business Machines Corporation | Linkable issue queue parallel execution slice for a processor |
US10133576B2 (en) | 2015-01-13 | 2018-11-20 | International Business Machines Corporation | Parallel slice processor having a recirculating load-store queue for fast deallocation of issue queue entries |
US11734010B2 (en) | 2015-01-13 | 2023-08-22 | International Business Machines Corporation | Parallel slice processor having a recirculating load-store queue for fast deallocation of issue queue entries |
US10223125B2 (en) | 2015-01-13 | 2019-03-05 | International Business Machines Corporation | Linkable issue queue parallel execution slice processing method |
US12061909B2 (en) | 2015-01-13 | 2024-08-13 | International Business Machines Corporation | Parallel slice processor having a recirculating load-store queue for fast deallocation of issue queue entries |
US9983875B2 (en) | 2016-03-04 | 2018-05-29 | International Business Machines Corporation | Operation of a multi-slice processor preventing early dependent instruction wakeup |
US10037211B2 (en) | 2016-03-22 | 2018-07-31 | International Business Machines Corporation | Operation of a multi-slice processor with an expanded merge fetching queue |
US10564978B2 (en) | 2016-03-22 | 2020-02-18 | International Business Machines Corporation | Operation of a multi-slice processor with an expanded merge fetching queue |
US10346174B2 (en) | 2016-03-24 | 2019-07-09 | International Business Machines Corporation | Operation of a multi-slice processor with dynamic canceling of partial loads |
US10761854B2 (en) | 2016-04-19 | 2020-09-01 | International Business Machines Corporation | Preventing hazard flushes in an instruction sequencing unit of a multi-slice processor |
US10268518B2 (en) | 2016-05-11 | 2019-04-23 | International Business Machines Corporation | Operation of a multi-slice processor implementing a load/store unit maintaining rejected instructions |
US10255107B2 (en) | 2016-05-11 | 2019-04-09 | International Business Machines Corporation | Operation of a multi-slice processor implementing a load/store unit maintaining rejected instructions |
US10042770B2 (en) | 2016-05-11 | 2018-08-07 | International Business Machines Corporation | Operation of a multi-slice processor implementing a load/store unit maintaining rejected instructions |
US10037229B2 (en) | 2016-05-11 | 2018-07-31 | International Business Machines Corporation | Operation of a multi-slice processor implementing a load/store unit maintaining rejected instructions |
US9940133B2 (en) | 2016-06-13 | 2018-04-10 | International Business Machines Corporation | Operation of a multi-slice processor implementing simultaneous two-target loads and stores |
US9934033B2 (en) | 2016-06-13 | 2018-04-03 | International Business Machines Corporation | Operation of a multi-slice processor implementing simultaneous two-target loads and stores |
US10042647B2 (en) | 2016-06-27 | 2018-08-07 | International Business Machines Corporation | Managing a divided load reorder queue |
US10318419B2 (en) | 2016-08-08 | 2019-06-11 | International Business Machines Corporation | Flush avoidance in a load store unit |
WO2019104638A1 (en) * | 2017-11-30 | 2019-06-06 | 深圳市大疆创新科技有限公司 | Neural network processing method and apparatus, accelerator, system, and mobile device |
CN113032013A (en) * | 2021-01-29 | 2021-06-25 | 成都商汤科技有限公司 | Data transmission method, chip, equipment and storage medium |
CN113032013B (en) * | 2021-01-29 | 2023-03-28 | 成都商汤科技有限公司 | Data transmission method, chip, equipment and storage medium |
WO2024222351A1 (en) * | 2023-04-27 | 2024-10-31 | 上海曦智科技有限公司 | Optic-electric computing system and data processing method |
Also Published As
Publication number | Publication date |
---|---|
CN100456230C (en) | 2009-01-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101021778A (en) | Computing group structure for superlong instruction word and instruction flow multidata stream fusion | |
US20230106990A1 (en) | Executing multiple programs simultaneously on a processor core | |
US10409606B2 (en) | Verifying branch targets | |
US10452399B2 (en) | Broadcast channel architectures for block-based processors | |
US11681531B2 (en) | Generation and use of memory access instruction order encodings | |
US20170083319A1 (en) | Generation and use of block branch metadata | |
US20170083320A1 (en) | Predicated read instructions | |
US10445097B2 (en) | Multimodal targets in a block-based processor | |
US20160378491A1 (en) | Determination of target location for transfer of processor control | |
US10824429B2 (en) | Commit logic and precise exceptions in explicit dataflow graph execution architectures | |
US20180032335A1 (en) | Transactional register file for a processor | |
WO2017048648A1 (en) | Segmented instruction block | |
KR20180021812A (en) | Block-based architecture that executes contiguous blocks in parallel | |
US20180267807A1 (en) | Precise exceptions for edge processors | |
US11726912B2 (en) | Coupling wide memory interface to wide write back paths | |
CN100489830C (en) | 64 bit stream processor chip system structure oriented to scientific computing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
C17 | Cessation of patent right | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20090128 Termination date: 20110319 |