[go: up one dir, main page]

CN101021832A - 64 bit floating-point integer amalgamated arithmetic group capable of supporting local register and conditional execution - Google Patents

64 bit floating-point integer amalgamated arithmetic group capable of supporting local register and conditional execution Download PDF

Info

Publication number
CN101021832A
CN101021832A CN 200710034572 CN200710034572A CN101021832A CN 101021832 A CN101021832 A CN 101021832A CN 200710034572 CN200710034572 CN 200710034572 CN 200710034572 A CN200710034572 A CN 200710034572A CN 101021832 A CN101021832 A CN 101021832A
Authority
CN
China
Prior art keywords
arithmetic
group
data
condition
parts
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 200710034572
Other languages
Chinese (zh)
Inventor
邢座程
蒋江
杨学军
张民选
张明
穆长富
阳柳
曾献君
马驰远
李勇
陈海燕
高军
李晋文
衣晓飞
倪晓强
唐遇星
张承义
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN 200710034572 priority Critical patent/CN101021832A/en
Publication of CN101021832A publication Critical patent/CN101021832A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Advance Control (AREA)

Abstract

本发明公开了一种支持局部寄存和条件执行的64位浮点整数融合运算群,它包括一个或者一个以上64位浮点/整数融合的运算部件、条件执行部件、部件互联结构以及控制器部件;运算部件和条件执行部件通过部件互联结构进行相互连接,部件之间的数据传输通过部件间的互联结构完成,整个运算群的存储地址空间由数个完全独立的地址空间构成,本地的运算群只有通过统一的控制器对每个地址空间进行间接寻址访问;控制器中有专门的指令存储器存储编译好的指令,程序执行时,控制器将编译好的指令按时序要求同时发送给各个运算群的功能部件,各个运算群以SIMD方式接收指令并执行,执行的结果存入每个运算群中相同的LRF中或者进行输出操作。

The invention discloses a 64-bit floating-point integer fusion operation group supporting local register and conditional execution, which includes one or more 64-bit floating-point/integer fusion operation components, conditional execution components, component interconnection structures and controller components ; Computational components and conditional execution components are connected to each other through the component interconnection structure, and the data transmission between components is completed through the interconnection structure between components. The storage address space of the entire calculation group is composed of several completely independent address spaces, and the local calculation group Only through a unified controller can indirect addressing access to each address space; there is a special instruction memory in the controller to store the compiled instructions, when the program is executed, the controller will send the compiled instructions to each operation at the same time according to the timing requirements The functional components of the group, each operation group receives and executes instructions in SIMD mode, and the execution results are stored in the same LRF in each operation group or perform output operations.

Description

64 floating point integers supporting local register and condition to carry out merge arithmetic group
Technical field
The present invention is mainly concerned with 64 floating-points in the chip design and integer arithmetic group's method for designing field, refers in particular to a kind of 64 floating point integers supporting local register and condition to carry out and merges arithmetic groups.
Background technology
Along with a large amount of appearance of compute-intensive applications, how effectively to utilize lot of data concurrency in the program to become the focus of computing machine circle.In numerous compute-intensive applications, the data parallelism of Media Stream is particularly outstanding.As multimedia, graph image and signal Processing etc., all comprise the mass data concurrency.Use at this lot of data concurrency, produced the architecture that much can effectively extract this concurrency, as vector processor and SIMD processor etc., these structures can both obtain very high performance.The SIMD stream handle structure that Stanford University proposes according to the characteristics of string routine is exactly the typical case's representative in these data parallel structures.The string routine model is divided into stream level and core stage, the stream scheduling between stream responsible main memory of level and the stream registers file; Core stage (kernel) is responsible for the computing of stream.Whole string routine model is exactly a model of being handled the flow data of a series of serials by stream level scheduling kernel.In the stream programming model, application program is made up of a series of kernel, and the flow data that each kernel consumption is produced by previous kernel also is the new data stream of kernel generation thereafter.Flowing system structure is used for developing to the parallel concurrency of carrying out same operation of a large amount of different pieces of informations.
Develop 32 Imagine stream handle chips also having thrown sheet by Stanford University, comprise 8 arithmetic groups, each arithmetic group comprises 3 adding units, 2 multiplying units and 1 inverse/inverse square root parts.Existing studies show that, aspect the processing of Media Stream program, with respect to other processor structures, Imagine has remarkable advantages.But the design for 64 bit arithmetic groups but has many difficulties.The present invention proposes the arithmetic group method for designing of floating-point/integer fusion of a kind of 64 band local register file and support condition execution.
Summary of the invention
The technical problem to be solved in the present invention just is: at the technical matters of prior art existence, the invention provides a kind of 64 bit arithmetic group parts that can support effectively that the class Media Stream calculates, calculate unlimited demand to satisfy the current high performance science, and solve relevant sharply the descend support local register of problem and 64 floating point integers that condition is carried out of performance that cause of control in a large amount of parallel computations with simple and effective way and merge arithmetic groups calculated performance.
For solving the problems of the technologies described above, the solution that the present invention proposes is: a kind of 64 floating point integers supporting local register and condition to carry out merge arithmetic group, it is characterized in that: it comprises 64 arithmetic units that floating-point/integer merges more than or, with solving the relevant condition execution unit of control, be used for realizing component interconnection structure and the controller that data cross is shared, whole LRF address space by several fully independently address space constitute, corresponding address space of an input of each arithmetic unit, all arithmetic groups can not be to its directly address, and local arithmetic group has only by unified controller each address space is carried out the indirect addressing visit; The compiled instruction of special stores is arranged in the controller, when program is carried out, controller requires compiled instruction to send to simultaneously the functional part of each arithmetic group chronologically, each arithmetic group receives instruction in the SIMD mode and carries out, and the result of execution deposits among the LRF identical in each arithmetic group or carries out output function.
Described condition execution unit comprises control assembly, communication component and buffering parts, control assembly be responsible for preserving with the update condition implementation in each mode bit, coordinate miscellaneous part and finish the condition executable operations jointly, control assembly is passed to communication component and buffering parts with the mode bit that upgrades, communication component obtains data according to the signal that obtains from other arithmetic groups, buffer unit is buffered data between stream buffering and arithmetic group, for conditional operation provides buffer zone.
Described arithmetic unit comprises floating point multiplication addition parts, floating-point miscellany parts and integer arithmetic logical block.
Described arithmetic unit comprises the DSQ parts of asking the reciprocal and inverse square root of floating-point, and these DSQ parts are obtained the inverse or the inverse square root of data in the mode of tabling look-up, and cooperation software in addition iteration can be realized high-precision division or sqrt computing as requested.
Compared with prior art, advantage of the present invention just is:
1, expanding data width.The computing structure of the processor of 32 traditional bit stream processors is extended to 64, rationally adjust the number of iostream, overcome the shortcoming of the low delay that high data bandwidth causes, under the situation that does not influence computing velocity, tupe that can compatible 32 bit stream processors, effectively support simultaneously 64 science to calculate, can support wideer data width.
2,64 floating point multiplication additions of design.Improve the structure of arithmetic units of 2 inputs of 32 traditional bit stream processors, adopt 64 floating point multiplication addition structures of 3 inputs, a multiply operation and an add operation are fused to once take advantage of add operation to improve arithmetic speed.And can provide effective peak value computing velocity according to design requirement with the fusion of a plurality of 64 floating point multiplication addition efficient in operation in the structural framing of stream handle.
3, merging multimedia calculates.Inside at floating point multiplication addition, under the prerequisite that does not influence existing structure, the operation of multimedia instruction has been finished in design, make this invention can effectively handle 8,16,32 and 64 s' multimedia instruction, when the science of carrying out high data width is calculated, also can satisfy the requirement of the Multimedia Program of low width.
4, condition cross processing.Depend on the traditional structure of stream handle, designed effective condition execution unit, supported the condition processing with simple and effective way, remedied the shortcoming that penetrates the SIMD structure, make to use the processor of this invention intersection data between can more effective processing arithmetic group, thus the problem that the conventional flow processor performance of having avoided condition to carry out causing descends.
Description of drawings
Fig. 1 is the arithmetic group one-piece construction synoptic diagram that comprises four arithmetic units;
Fig. 2 is the data path synoptic diagram that arithmetic group carries out a multiply-add operation;
Fig. 3 is the detailed data path synoptic diagram that comprises the arithmetic group of four arithmetic units;
Fig. 4 is the interconnected relationship synoptic diagram of 8 arithmetic groups and controller;
Fig. 5 is the annexation synoptic diagram of inner each subassembly of FMAC arithmetic unit;
Fig. 6 is the processing example schematic of condition execution unit;
Fig. 7 is the treatment scheme synoptic diagram that condition is carried out.
Embodiment
Below with reference to the drawings and specific embodiments the present invention is described in further details.
A kind of 64 floating point integers supporting local register and condition to carry out of the present invention merge arithmetic group, it comprises 64 arithmetic units that floating-point/integer merges more than or, with solving the relevant condition execution unit of control, be used for realizing component interconnection structure and the controller that data cross is shared, whole LRF address space by several fully independently address space constitute, corresponding address space of an input of each arithmetic unit, all arithmetic groups can not be to its directly address, and local arithmetic group has only by unified controller each address space is carried out the indirect addressing visit; The compiled instruction of special stores is arranged in the controller, when program is carried out, controller requires compiled instruction to send to simultaneously the functional part of each arithmetic group chronologically, each arithmetic group receives instruction in the SIMD mode and carries out, and the result of execution deposits among the LRF identical in each arithmetic group or carries out output function.The overall logic structure of arithmetic group comprises relevant condition execution unit and the shared component interconnection structure of data cross of arithmetic unit, solution control that one or more 64 floating-points/integer merges.
Arithmetic unit adds the arithmetic sum logical operation that also will finish fixed-point number the fusion calculating except finishing taking advantage of of double-precision floating point.Can be divided into floating point multiplication addition parts (FMAC), floating-point miscellany parts (FMISC) and three parts of integer arithmetic logical block (ALU) from arithmetic unit in logic.The IEEE-754 international standard is followed in the Floating-point Computation instruction, and floating data adopts the IEEE double precision formats to represent (11 indexes, 53 mantissa take the lead 1 comprising implicit).For can be at arithmetic group inter-process nonformatted data, index be expanded one, and promptly index expands to 12.
The essence of floating point multiplication addition is that addend is long-pending as an item parts, with add up after the result exponent of taking advantage of aligns, then in summation, rather than take advantage of and add twice summation separately, round off for twice, thereby the time-delay of reduction critical path.The instruction that the floating point multiplication addition parts are realized mainly contains: 8,16,32 and 64 multiplication of integers instructions of floating point multiplication addition instruction, the whole class instruction of floatingization, floating point normalized instruction and fusion; The instruction that floating-point miscellany parts are realized mainly contains: instruction of logical floating point class and the instruction of floating-point test class.The finger that the ALU parts are realized has also increased the integer test instruction except 8,16,32 of common SIMD mode and 64 multimedia instructions, saturated computations and data move are shuffled operation as byte, condition routing operation etc.
The low delay structure that the floating point multiplication addition parts have adopted normalization shift to shift to an earlier date, sue for peace and round off and merge, supporting double-precision floating point to take advantage of on the basis that adds, expanding bit wide and increase that correlation unit has realized that floatingization is whole, integralization is floating, floating point normalized and 8,16,32 and 64 fixed points take advantage of four classes to instruct.Whole floating point multiplication addition parts can be divided into a plurality of timeticks, and floating number or fixed-point number are all supported nearest even rounding mode.
In addition, in order effectively to support division and subduplicate computing, the present invention also to design special DSQ parts of asking floating-point inverse and inverse square root.The DSQ parts are obtained the inverse or the inverse square root of data in the mode of tabling look-up, cooperate software in addition iteration can realize high-precision division or sqrt computing as requested.If the value of double-precision floating points is less than the denotable minimum value of normalized floating point number, then according to the situation or it is expressed as minimum normalized floating point number or directly this numerical table is shown normalized+0.0 or 1 of rounding off of mantissa, concrete operations are the sign bits that keep the result, and its index, magnitude portion are changed to complete 0, with the Times floating-point underflow.
The condition execution unit can be with the relevant data route that is converted into of control, having solved the relevant performance that causes of a small amount of control that contains during mass data is parallel sharply descends, expand the scope of data parallel program, made the enough more effective execution bands of concurrent operation group energy control relevant application program.
Conditional operation can occur in before the flow data calculating, is called the condition entry operation; Or occur in after the flow data calculating, be called the condition output function.The condition output function can flow to row filter to one, and different data allocations in homogeneous turbulence not, is made the data that only contain same kind in each stream.Though the various flows that produces may need different kernel to handle, the processing of the data during each is flowed but is identical.Condition entry operation can merge a plurality of streams, and different digital data streams is merged in the same stream, makes to include various types of data in this stream.For the condition output function, the condition entry operation is a complementary process.
The condition execution unit is divided into control assembly, communication component and buffering parts.Control assembly be responsible for preserving with the update condition implementation in each mode bit, coordinate miscellaneous part and finish the condition executable operations jointly.Control assembly is passed to communication component and buffering parts with the mode bit that upgrades, and communication component obtains data according to the signal that obtains from other arithmetic groups, and buffer unit is buffered data between stream buffering and arithmetic group, for conditional operation provides buffer zone.During the condition entry operation, data are at first read in buffer unit from the stream buffering, and control assembly upgrades the state of this condition entry according to the condition code and the last round-robin cond of all arithmetic groups, and sends to other conditional operation parts; Communication component is according to the corresponding entry reading of data of control signal from the buffering parts.During the condition output function, control signal at first according to condition code and last time the round-robin cond produce current round-robin cond and send; Communication component is stored in the corresponding entry of buffer unit according to the valid data of control signal with this locality; Carrying out a data transfer operation then writes data in the stream buffering.
Whole LRF address space is made of a plurality of fully independently address spaces, corresponding address space of an input of each arithmetic unit, they logically are independently, all arithmetic groups can not be to its directly address, and local arithmetic group has only by unified controller each address space is carried out the indirect addressing visit.To the access instruction of each address space, all do not control by controller is unified the read and write access operation of address space local arithmetic group.Select in the address of writing controller appointment after the data to writing of register from group's internal bus of arithmetic group by controller; To the reading directly to be sent out to register by controller and read the address and finish of register, the data of reading are directly given arithmetic unit.LRF is distributed in the input port of each arithmetic unit with independently physics and logical space, is arithmetic unit buffer memory input data, can increase the processing bandwidth of core stage computing like this, makes the processing that is more suitable for big data quantity; Accelerate the processing speed of arithmetic unit, be unlikely to owing to the availability of data deficiency is wasted arithmetic element.
All functional parts of whole arithmetic group inside connect together by a totally interconnected cross bar switch, the input of all LRF switch in the group obtains, (remove special functional part, its output only need be stored among the proprietary LRF) on the cross bar switch directly delivered in the output of all functional parts.
Whole arithmetic group adopts full flowing structure, and the correlativity of instruction is controlled by compiling and a unified controller.The compiled instruction of special stores is arranged in the controller, when program is carried out, controller requires compiled instruction to send to simultaneously the functional part of each arithmetic group chronologically, each arithmetic group receives instruction in the SIMD mode and carries out, and the result of execution deposits among the LRF identical in each arithmetic group or carries out output function.Link to each other by interconnect bus between full group of intersecting between each arithmetic group.
In the present embodiment, as shown in Figure 1, comprise the arithmetic group one-piece construction figure of 4 arithmetic units.Arithmetic group is made up of arithmetic unit and inverse parts two large divisions altogether.Wherein arithmetic unit comprises four floating point multiplication addition parts, in the realization of reality, if very high to performance requirement, can increase the number of floating point multiplication addition parts, needs to increase the bar number of group's internal bus accordingly; The inverse parts comprise SP, JB, VAL and COMM parts, and the work of treatment of condition stream is finished in these four parts collaborative works.In addition, the COMM communication component can also be finished the mutual traffic operation between each arithmetic group parts separately, and this is very important for the shortcoming that remedies the SIMD processor.The output result of each functional part has the internal bus of a special use, the input of the input LRF of each functional part is provided by an inner full bus switch that intersects, this cross bar switch receives the output of all functions unit, and select one tunnel bus result for each LRF, and be written in the LRF corresponding entry according to the address that controller is sent and go according to the signal of controller.The input of the LRF of JB/VAL comes from itself among the figure, belongs to special LRF, does not draw in input source in the drawings.In the YHFT64-2 stream handle, in order to save the burden of internal bus, the quantity of inlet flow is reduced to 4, each inlet flow accounts for an internal bus, and the quantity of output stream is 8, and each output stream needs a special full cross bar switch.This syndeton can be regulated according to the requirement of actual performance.As increasing an arithmetic unit and an inlet flow, will increase by 2 internal buss accordingly, and be connected to the input end of each cross bar switch, the connection of configuration of corresponding modification compiler and control signal.
Fig. 2 is the data path figure that arithmetic group carries out a multiply-add operation.Because take advantage of add operation to need 3 operands, three operands come from 3 data stream respectively in this example, link to each other with 3 group's internal buses respectively, as shown in the figure.Compiler is judged the made component of taking advantage of of current free time, and takes advantage of add operation to its distribution.In this example, take advantage of to add 0 and be in idle condition, controller control is taken advantage of and is added 0 carry out buffer memory from group's internal bus receives the LRF of outside three data to three inputs sending here; Send into behind the LRF buffer memory then and take advantage of made component inside to carry out computing, the result after the calculating sends on group internal bus; Output unit is selected to take advantage of to add 0 result who sends on the bus from group's internal bus by cross bar switch in the group, and exports away.
Fig. 3 is the detailed data path figure that comprises the arithmetic group of four arithmetic units.Arithmetic group is made up of a plurality of functions, has comprised a plurality of local register file LRF, is connected to each other with one group of arithmetic group internal bus between functional part and the LRF.Link to each other with bus between the group by communication component between the arithmetic group, link to each other by inputoutput unit between arithmetic group and the stream buffering.Functional part is finished different arithmetical operations and other operation, local register file (LRF) is the data source and the scratchpad of functional part, the result that condition code register file (CCRF) storage comparison order produces is used for data path and selects and the condition flow operation.Data transmission between exchanges data between the functional part and stream buffering and the functional part is finished by calculating the interior cross interconnected network switching of group, all functional parts all are that the output result is sent on the result bus, and the input end of LRF can receive all result bus, and cross interconnected like this switch just will form a kind of complete interconnected structure between all functional parts and the LRF.
The functional part of arithmetic group roughly can be divided into two classes: arithmetic unit and inverse parts.Arithmetic unit is carried out integer and floating-point operation instruction, comprises that 4 are taken advantage of made component: FMACO, FMAC1 FMAC2, FMAC3 and an inverse/inverse square root parts DSQ; Inverse parts support condition stream and data move operation comprise the I/O unit, communication unit COMM between the group, and scratch pad register cell S P, condition is carried out control module JB and VAL.The result of calculation of each parts is directly delivered on the interior bus of group; Each input of parts is directly selected by cross bar switch in the group.
Fig. 4 is the interconnected relationship figure of arithmetic group and controller.Comprise 4 arithmetic groups in the YHFT64-2 stream handle.Actual arithmetic group number can increase or reduce (be generally 2 integral number power) according to performance requirement, only needs the bus number between the change arithmetic group, because the proprietary bus of each arithmetic group.8 arithmetic groups that show among the figure and the annexation of controller.Controller links to each other with one group of full interconnect bus that intersects with arithmetic group, each arithmetic group and controller need be put into local data on the corresponding bus, and can receive the data of other any arithmetic groups, the communication unit of each arithmetic group inside is responsible for carrying out the traffic operation between each arithmetic group and the controller.In addition, controller also is responsible for to all arithmetic group sending controling instructions, and all arithmetic groups carry out operation of data in the mode of SIMD.
Fig. 5 is the annexation of inner each subassembly of FMAC arithmetic unit.The FMAC parts merge the one that is performed on of floating-point and integer, and the partial product figure place of floating-point multiplication is expanded the multiplication of having realized 8,16,32 and 64 integers accordingly; The ALU parts are finished the operation of other integer; The FMISC parts are finished floating-point test and logical operation.As shown in the figure, operational code is sent into three subassemblies simultaneously, and each subassembly carries out instruction decode separately, if corresponding computing is promptly carried out in the instruction of these parts, is changed to 0 otherwise will export the result.FPMA will finish and take advantage of add operation, so need three operands; ALU and FMISC only need 2 operands to carry out computing, and specifying preceding 2 operands in YHFT64-2 is its operand.The operation result of three subassemblies carries out once or operates promptly drawing last result at last, and this is to be guaranteed by the result of instruction decode signal zero setting miscellaneous part.
Fig. 6 is the processing example that comprises the condition execution unit of four arithmetic groups.As the kernel program of figure shown in (a), each element of an inlet flow is at first carried out Compter_1 () calculate, only the circle element in the convection current is carried out computer_circle () then and calculate.Figure (b) is this program without any the implementation on the stream handle that only contains four arithmetic groups of condition treatment mechanism.Under situation without any the condition treatment mechanism, when four arithmetic groups carry out computer_circle () calculating, the workload of four arithmetic groups has produced unbalanced situation, arithmetic group 1 will be handled four data elements, and arithmetic group 0 and arithmetic group 3 are as long as handle 1 data element, therefore, because the characteristic arithmetic group 0 of SIMD and arithmetic group 3 be necessary wait operation group 1 after handling the data of oneself, finish up to arithmetic group 1 calculating, four arithmetic groups could carry out next step computing simultaneously.Under the controlling mechanism of SIMD, because the problem of this laod unbalance that relevant control causes can cause the significant wastage of resource in the X stream handle, the arithmetic group of finishing calculation task at first must dally and wait for other arithmetic group, all arithmetic groups are all finished after the calculation task, just can carry out next calculation task, and still there is unbalanced problem in the data that enter follow-up computing between each arithmetic group, therefore the problem of laod unbalance all exists in the calculating process of whole kernel, the calculating process of kernel is many more, and the situation of the wasting of resources that this laod unbalance causes is just serious more.
A design philosophy of YHFT64-2 stream handle is: controlling mechanism will be tried one's best simply.Other branch instruction outside not supporting in the YHFT64-2 stream handle to circulate, the mode that its employing condition is carried out partly solves the problem of the relevant laod unbalance that causes of control.Normal data stream visit is unconditional to arithmetic group.So, each arithmetic group is is always read and write data stream simultaneously, and arithmetic group and stream buffering be one to one, as arithmetic group 0 can only access stream buffering 0 data, have no idea to visit the data in other stream buffering.And in condition was handled, arithmetic group will carry out according to the CC condition code selectively to the visit of data stream, thereby data stream is expanded or compressed.
Figure (c) is the process that figure (a) program is carried out in the YHFT64-2 condition.Condition mechanism will cause that at first the data in the data stream of laod unbalance screen, form a stream that only comprises same data type, carry out computing for each arithmetic group this data stream average mark then, make each arithmetic group handle the data element of same kind simultaneously, just the arithmetic group that can not produce must be finished calculating to wait for other arithmetic group the free time, has avoided the problem of laod unbalance.
Fig. 7 is the treatment scheme that condition is carried out.For the condition entry operation, before entering circular treatment, at first carry out the initialization of cond, import batch data padding data buffering.Condition code CC is 1, represents that current arithmetic group will ask valid data.At every turn according to this round-robin condition code state and last time the round-robin cond upgrade this round-robin cond, send to data buffering SP, communication unit and controller, the control communication unit is chosen data to local arithmetic group from the SP corresponding entry, and controller is according to the ready signal controlling INOUT unit that produces reading of data from the stream buffering.For the condition output function, entry condition circular treatment after the initialization condition state, different with the condition entry operation is, the CC sign indicating number is 1, the data of representing local arithmetic group are effective, communication unit will be sent to the data of this locality in the data buffering, after data buffering fills up, carries out an output function by controller control INOUT unit.At condition output function end, can operate according to the FLUSH that programmer's requirement is flowed, convection current is filled.

Claims (4)

1, a kind of 64 floating point integers supporting local register and condition to carry out merge arithmetic group, it is characterized in that: it comprises 64 arithmetic units that floating-point/integer merges more than or, with solving the relevant condition execution unit of control, be used for realizing component interconnection structure that data cross is shared and the controller part that above three base parts is carried out whole control, in arithmetic group inside, arithmetic unit and condition execution unit interconnect by the component interconnection structure, data transmission between the parts is finished by the interconnect architecture between parts, the i.e. input of all arithmetic units and condition execution unit obtains from interconnect architecture, the result of handling is also delivered in the interconnect architecture, is handled accordingly by interconnect architecture; Controller sends instruction by interconnect architecture to whole arithmetic group, its function is carried out control corresponding, the memory address space of whole arithmetic group by several fully independently address space constitute, the corresponding address space independently of each input of each functional part of each computing inside, this address space can not adopt the direct addressing method visit; Local arithmetic group has only by unified controller each address space is carried out the indirect addressing visit; The compiled instruction of special stores is arranged in the controller, when program is carried out, controller requires compiled instruction to send to simultaneously the functional part of each arithmetic group chronologically, each arithmetic group receives instruction in the SIMD mode and carries out, and the result of execution deposits among the LRF identical in each arithmetic group or carries out output function.
2,64 floating point integers that support local register according to claim 1 and condition are carried out merge arithmetic group, it is characterized in that: described condition execution unit comprises control assembly, communication component and buffering parts, control assembly be responsible for preserving with the update condition implementation in each mode bit, coordinate miscellaneous part and finish the condition executable operations jointly, control assembly is passed to communication component and buffering parts with the mode bit that upgrades, communication component obtains data according to the signal that obtains from other arithmetic groups, buffer unit is buffered data between stream buffering and arithmetic group, for conditional operation provides buffer zone.
3,64 floating point integers of support local register according to claim 1 and 2 and condition execution merge arithmetic groups, and it is characterized in that: described arithmetic unit comprises floating point multiplication addition parts, floating-point miscellany parts and integer arithmetic logical block.
4,64 floating point integers of support local register according to claim 3 and condition execution merge arithmetic groups, it is characterized in that: described arithmetic unit comprises the DSQ parts of asking floating-point inverse and inverse square root, these DSQ parts are obtained the inverse or the inverse square root of data in the mode of tabling look-up, cooperate software in addition iteration can realize high-precision division or sqrt computing as requested.
CN 200710034572 2007-03-19 2007-03-19 64 bit floating-point integer amalgamated arithmetic group capable of supporting local register and conditional execution Pending CN101021832A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200710034572 CN101021832A (en) 2007-03-19 2007-03-19 64 bit floating-point integer amalgamated arithmetic group capable of supporting local register and conditional execution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200710034572 CN101021832A (en) 2007-03-19 2007-03-19 64 bit floating-point integer amalgamated arithmetic group capable of supporting local register and conditional execution

Publications (1)

Publication Number Publication Date
CN101021832A true CN101021832A (en) 2007-08-22

Family

ID=38709604

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200710034572 Pending CN101021832A (en) 2007-03-19 2007-03-19 64 bit floating-point integer amalgamated arithmetic group capable of supporting local register and conditional execution

Country Status (1)

Country Link
CN (1) CN101021832A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102053816A (en) * 2010-11-25 2011-05-11 中国人民解放军国防科学技术大学 Data shuffling unit with switch matrix memory and shuffling method thereof
CN103221935A (en) * 2010-11-18 2013-07-24 德克萨斯仪器股份有限公司 Method and apparatus for moving data from a SIMD register file to general purpose register file
CN105335127A (en) * 2015-10-29 2016-02-17 中国人民解放军国防科学技术大学 Scalar operation unit structure supporting floating-point division method in GPDSP
CN105814538A (en) * 2013-10-23 2016-07-27 芬兰国家技术研究中心股份公司 Floating-point supportive pipeline for emulated shared memory architectures
CN106030517A (en) * 2013-12-19 2016-10-12 芬兰国家技术研究中心股份公司 Architecture for simulating long-latency operations in shared memory architectures
CN106528044A (en) * 2010-09-24 2017-03-22 英特尔公司 Processor, instruction execution method, and calculating system
CN107873091A (en) * 2015-07-20 2018-04-03 高通股份有限公司 SIMD sliding window computings
CN108958705A (en) * 2018-06-26 2018-12-07 天津飞腾信息技术有限公司 A kind of floating-point fusion adder and multiplier and its application method for supporting mixed data type
CN110688159A (en) * 2017-07-20 2020-01-14 上海寒武纪信息科技有限公司 Neural network task processing system
CN112579519A (en) * 2021-03-01 2021-03-30 湖北芯擎科技有限公司 Data arithmetic circuit and processing chip
CN112711443A (en) * 2018-11-09 2021-04-27 英特尔公司 System and method for executing 16-bit floating-point vector dot-product instructions

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528044A (en) * 2010-09-24 2017-03-22 英特尔公司 Processor, instruction execution method, and calculating system
CN103221935A (en) * 2010-11-18 2013-07-24 德克萨斯仪器股份有限公司 Method and apparatus for moving data from a SIMD register file to general purpose register file
CN103221935B (en) * 2010-11-18 2016-08-10 德克萨斯仪器股份有限公司 The method and apparatus moving data to general-purpose register file from simd register file
CN102053816A (en) * 2010-11-25 2011-05-11 中国人民解放军国防科学技术大学 Data shuffling unit with switch matrix memory and shuffling method thereof
CN105814538A (en) * 2013-10-23 2016-07-27 芬兰国家技术研究中心股份公司 Floating-point supportive pipeline for emulated shared memory architectures
CN105814538B (en) * 2013-10-23 2020-04-14 芬兰国家技术研究中心股份公司 Floating-point-enabled pipeline for emulating shared memory architectures
CN106030517A (en) * 2013-12-19 2016-10-12 芬兰国家技术研究中心股份公司 Architecture for simulating long-latency operations in shared memory architectures
CN107873091A (en) * 2015-07-20 2018-04-03 高通股份有限公司 SIMD sliding window computings
CN105335127A (en) * 2015-10-29 2016-02-17 中国人民解放军国防科学技术大学 Scalar operation unit structure supporting floating-point division method in GPDSP
CN110688159B (en) * 2017-07-20 2021-12-14 上海寒武纪信息科技有限公司 Neural network task processing system
CN110688159A (en) * 2017-07-20 2020-01-14 上海寒武纪信息科技有限公司 Neural network task processing system
CN108958705A (en) * 2018-06-26 2018-12-07 天津飞腾信息技术有限公司 A kind of floating-point fusion adder and multiplier and its application method for supporting mixed data type
CN108958705B (en) * 2018-06-26 2021-11-12 飞腾信息技术有限公司 Floating point fusion multiply-add device supporting mixed data types and application method thereof
CN112711443A (en) * 2018-11-09 2021-04-27 英特尔公司 System and method for executing 16-bit floating-point vector dot-product instructions
US12008367B2 (en) 2018-11-09 2024-06-11 Intel Corporation Systems and methods for performing 16-bit floating-point vector dot product instructions
CN112579519B (en) * 2021-03-01 2021-05-25 湖北芯擎科技有限公司 Data arithmetic circuit and processing chip
CN112579519A (en) * 2021-03-01 2021-03-30 湖北芯擎科技有限公司 Data arithmetic circuit and processing chip

Similar Documents

Publication Publication Date Title
CN101021832A (en) 64 bit floating-point integer amalgamated arithmetic group capable of supporting local register and conditional execution
US11307873B2 (en) Apparatus, methods, and systems for unstructured data flow in a configurable spatial accelerator with predicate propagation and merging
US10467183B2 (en) Processors and methods for pipelined runtime services in a spatial array
US10387319B2 (en) Processors, methods, and systems for a configurable spatial accelerator with memory system performance, power reduction, and atomics support features
US10445451B2 (en) Processors, methods, and systems for a configurable spatial accelerator with performance, correctness, and power reduction features
US10445098B2 (en) Processors and methods for privileged configuration in a spatial array
US11086816B2 (en) Processors, methods, and systems for debugging a configurable spatial accelerator
US10445234B2 (en) Processors, methods, and systems for a configurable spatial accelerator with transactional and replay features
US10558575B2 (en) Processors, methods, and systems with a configurable spatial accelerator
US10416999B2 (en) Processors, methods, and systems with a configurable spatial accelerator
US12086080B2 (en) Apparatuses, methods, and systems for a configurable accelerator having dataflow execution circuits
US10817291B2 (en) Apparatuses, methods, and systems for swizzle operations in a configurable spatial accelerator
US11029958B1 (en) Apparatuses, methods, and systems for configurable operand size operations in an operation configurable spatial accelerator
US20190007332A1 (en) Processors and methods with configurable network-based dataflow operator circuits
JP3860575B2 (en) High performance hybrid processor with configurable execution unit
US20190004878A1 (en) Processors, methods, and systems for a configurable spatial accelerator with security, power reduction, and performace features
US20190101952A1 (en) Processors and methods for configurable clock gating in a spatial array
US11037050B2 (en) Apparatuses, methods, and systems for memory interface circuit arbitration in a configurable spatial accelerator
US8892620B2 (en) Computer for Amdahl-compliant algorithms like matrix inversion
CN112579159A (en) Apparatus, method and system for instructions for a matrix manipulation accelerator
US11907713B2 (en) Apparatuses, methods, and systems for fused operations using sign modification in a processing element of a configurable spatial accelerator
US20200409709A1 (en) Apparatuses, methods, and systems for time-multiplexing in a configurable spatial accelerator
CN117992396A (en) Stream tensor processor
Uhrig et al. A two-dimensional superscalar processor architecture
Abdelhamid et al. MITRACA: A next-gen heterogeneous architecture

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication