Summary of the invention
The technical problem to be solved in the present invention just is: at the technical matters of prior art existence, the invention provides a kind of 64 bit arithmetic group parts that can support effectively that the class Media Stream calculates, calculate unlimited demand to satisfy the current high performance science, and solve relevant sharply the descend support local register of problem and 64 floating point integers that condition is carried out of performance that cause of control in a large amount of parallel computations with simple and effective way and merge arithmetic groups calculated performance.
For solving the problems of the technologies described above, the solution that the present invention proposes is: a kind of 64 floating point integers supporting local register and condition to carry out merge arithmetic group, it is characterized in that: it comprises 64 arithmetic units that floating-point/integer merges more than or, with solving the relevant condition execution unit of control, be used for realizing component interconnection structure and the controller that data cross is shared, whole LRF address space by several fully independently address space constitute, corresponding address space of an input of each arithmetic unit, all arithmetic groups can not be to its directly address, and local arithmetic group has only by unified controller each address space is carried out the indirect addressing visit; The compiled instruction of special stores is arranged in the controller, when program is carried out, controller requires compiled instruction to send to simultaneously the functional part of each arithmetic group chronologically, each arithmetic group receives instruction in the SIMD mode and carries out, and the result of execution deposits among the LRF identical in each arithmetic group or carries out output function.
Described condition execution unit comprises control assembly, communication component and buffering parts, control assembly be responsible for preserving with the update condition implementation in each mode bit, coordinate miscellaneous part and finish the condition executable operations jointly, control assembly is passed to communication component and buffering parts with the mode bit that upgrades, communication component obtains data according to the signal that obtains from other arithmetic groups, buffer unit is buffered data between stream buffering and arithmetic group, for conditional operation provides buffer zone.
Described arithmetic unit comprises floating point multiplication addition parts, floating-point miscellany parts and integer arithmetic logical block.
Described arithmetic unit comprises the DSQ parts of asking the reciprocal and inverse square root of floating-point, and these DSQ parts are obtained the inverse or the inverse square root of data in the mode of tabling look-up, and cooperation software in addition iteration can be realized high-precision division or sqrt computing as requested.
Compared with prior art, advantage of the present invention just is:
1, expanding data width.The computing structure of the processor of 32 traditional bit stream processors is extended to 64, rationally adjust the number of iostream, overcome the shortcoming of the low delay that high data bandwidth causes, under the situation that does not influence computing velocity, tupe that can compatible 32 bit stream processors, effectively support simultaneously 64 science to calculate, can support wideer data width.
2,64 floating point multiplication additions of design.Improve the structure of arithmetic units of 2 inputs of 32 traditional bit stream processors, adopt 64 floating point multiplication addition structures of 3 inputs, a multiply operation and an add operation are fused to once take advantage of add operation to improve arithmetic speed.And can provide effective peak value computing velocity according to design requirement with the fusion of a plurality of 64 floating point multiplication addition efficient in operation in the structural framing of stream handle.
3, merging multimedia calculates.Inside at floating point multiplication addition, under the prerequisite that does not influence existing structure, the operation of multimedia instruction has been finished in design, make this invention can effectively handle 8,16,32 and 64 s' multimedia instruction, when the science of carrying out high data width is calculated, also can satisfy the requirement of the Multimedia Program of low width.
4, condition cross processing.Depend on the traditional structure of stream handle, designed effective condition execution unit, supported the condition processing with simple and effective way, remedied the shortcoming that penetrates the SIMD structure, make to use the processor of this invention intersection data between can more effective processing arithmetic group, thus the problem that the conventional flow processor performance of having avoided condition to carry out causing descends.
Embodiment
Below with reference to the drawings and specific embodiments the present invention is described in further details.
A kind of 64 floating point integers supporting local register and condition to carry out of the present invention merge arithmetic group, it comprises 64 arithmetic units that floating-point/integer merges more than or, with solving the relevant condition execution unit of control, be used for realizing component interconnection structure and the controller that data cross is shared, whole LRF address space by several fully independently address space constitute, corresponding address space of an input of each arithmetic unit, all arithmetic groups can not be to its directly address, and local arithmetic group has only by unified controller each address space is carried out the indirect addressing visit; The compiled instruction of special stores is arranged in the controller, when program is carried out, controller requires compiled instruction to send to simultaneously the functional part of each arithmetic group chronologically, each arithmetic group receives instruction in the SIMD mode and carries out, and the result of execution deposits among the LRF identical in each arithmetic group or carries out output function.The overall logic structure of arithmetic group comprises relevant condition execution unit and the shared component interconnection structure of data cross of arithmetic unit, solution control that one or more 64 floating-points/integer merges.
Arithmetic unit adds the arithmetic sum logical operation that also will finish fixed-point number the fusion calculating except finishing taking advantage of of double-precision floating point.Can be divided into floating point multiplication addition parts (FMAC), floating-point miscellany parts (FMISC) and three parts of integer arithmetic logical block (ALU) from arithmetic unit in logic.The IEEE-754 international standard is followed in the Floating-point Computation instruction, and floating data adopts the IEEE double precision formats to represent (11 indexes, 53 mantissa take the lead 1 comprising implicit).For can be at arithmetic group inter-process nonformatted data, index be expanded one, and promptly index expands to 12.
The essence of floating point multiplication addition is that addend is long-pending as an item parts, with add up after the result exponent of taking advantage of aligns, then in summation, rather than take advantage of and add twice summation separately, round off for twice, thereby the time-delay of reduction critical path.The instruction that the floating point multiplication addition parts are realized mainly contains: 8,16,32 and 64 multiplication of integers instructions of floating point multiplication addition instruction, the whole class instruction of floatingization, floating point normalized instruction and fusion; The instruction that floating-point miscellany parts are realized mainly contains: instruction of logical floating point class and the instruction of floating-point test class.The finger that the ALU parts are realized has also increased the integer test instruction except 8,16,32 of common SIMD mode and 64 multimedia instructions, saturated computations and data move are shuffled operation as byte, condition routing operation etc.
The low delay structure that the floating point multiplication addition parts have adopted normalization shift to shift to an earlier date, sue for peace and round off and merge, supporting double-precision floating point to take advantage of on the basis that adds, expanding bit wide and increase that correlation unit has realized that floatingization is whole, integralization is floating, floating point normalized and 8,16,32 and 64 fixed points take advantage of four classes to instruct.Whole floating point multiplication addition parts can be divided into a plurality of timeticks, and floating number or fixed-point number are all supported nearest even rounding mode.
In addition, in order effectively to support division and subduplicate computing, the present invention also to design special DSQ parts of asking floating-point inverse and inverse square root.The DSQ parts are obtained the inverse or the inverse square root of data in the mode of tabling look-up, cooperate software in addition iteration can realize high-precision division or sqrt computing as requested.If the value of double-precision floating points is less than the denotable minimum value of normalized floating point number, then according to the situation or it is expressed as minimum normalized floating point number or directly this numerical table is shown normalized+0.0 or 1 of rounding off of mantissa, concrete operations are the sign bits that keep the result, and its index, magnitude portion are changed to complete 0, with the Times floating-point underflow.
The condition execution unit can be with the relevant data route that is converted into of control, having solved the relevant performance that causes of a small amount of control that contains during mass data is parallel sharply descends, expand the scope of data parallel program, made the enough more effective execution bands of concurrent operation group energy control relevant application program.
Conditional operation can occur in before the flow data calculating, is called the condition entry operation; Or occur in after the flow data calculating, be called the condition output function.The condition output function can flow to row filter to one, and different data allocations in homogeneous turbulence not, is made the data that only contain same kind in each stream.Though the various flows that produces may need different kernel to handle, the processing of the data during each is flowed but is identical.Condition entry operation can merge a plurality of streams, and different digital data streams is merged in the same stream, makes to include various types of data in this stream.For the condition output function, the condition entry operation is a complementary process.
The condition execution unit is divided into control assembly, communication component and buffering parts.Control assembly be responsible for preserving with the update condition implementation in each mode bit, coordinate miscellaneous part and finish the condition executable operations jointly.Control assembly is passed to communication component and buffering parts with the mode bit that upgrades, and communication component obtains data according to the signal that obtains from other arithmetic groups, and buffer unit is buffered data between stream buffering and arithmetic group, for conditional operation provides buffer zone.During the condition entry operation, data are at first read in buffer unit from the stream buffering, and control assembly upgrades the state of this condition entry according to the condition code and the last round-robin cond of all arithmetic groups, and sends to other conditional operation parts; Communication component is according to the corresponding entry reading of data of control signal from the buffering parts.During the condition output function, control signal at first according to condition code and last time the round-robin cond produce current round-robin cond and send; Communication component is stored in the corresponding entry of buffer unit according to the valid data of control signal with this locality; Carrying out a data transfer operation then writes data in the stream buffering.
Whole LRF address space is made of a plurality of fully independently address spaces, corresponding address space of an input of each arithmetic unit, they logically are independently, all arithmetic groups can not be to its directly address, and local arithmetic group has only by unified controller each address space is carried out the indirect addressing visit.To the access instruction of each address space, all do not control by controller is unified the read and write access operation of address space local arithmetic group.Select in the address of writing controller appointment after the data to writing of register from group's internal bus of arithmetic group by controller; To the reading directly to be sent out to register by controller and read the address and finish of register, the data of reading are directly given arithmetic unit.LRF is distributed in the input port of each arithmetic unit with independently physics and logical space, is arithmetic unit buffer memory input data, can increase the processing bandwidth of core stage computing like this, makes the processing that is more suitable for big data quantity; Accelerate the processing speed of arithmetic unit, be unlikely to owing to the availability of data deficiency is wasted arithmetic element.
All functional parts of whole arithmetic group inside connect together by a totally interconnected cross bar switch, the input of all LRF switch in the group obtains, (remove special functional part, its output only need be stored among the proprietary LRF) on the cross bar switch directly delivered in the output of all functional parts.
Whole arithmetic group adopts full flowing structure, and the correlativity of instruction is controlled by compiling and a unified controller.The compiled instruction of special stores is arranged in the controller, when program is carried out, controller requires compiled instruction to send to simultaneously the functional part of each arithmetic group chronologically, each arithmetic group receives instruction in the SIMD mode and carries out, and the result of execution deposits among the LRF identical in each arithmetic group or carries out output function.Link to each other by interconnect bus between full group of intersecting between each arithmetic group.
In the present embodiment, as shown in Figure 1, comprise the arithmetic group one-piece construction figure of 4 arithmetic units.Arithmetic group is made up of arithmetic unit and inverse parts two large divisions altogether.Wherein arithmetic unit comprises four floating point multiplication addition parts, in the realization of reality, if very high to performance requirement, can increase the number of floating point multiplication addition parts, needs to increase the bar number of group's internal bus accordingly; The inverse parts comprise SP, JB, VAL and COMM parts, and the work of treatment of condition stream is finished in these four parts collaborative works.In addition, the COMM communication component can also be finished the mutual traffic operation between each arithmetic group parts separately, and this is very important for the shortcoming that remedies the SIMD processor.The output result of each functional part has the internal bus of a special use, the input of the input LRF of each functional part is provided by an inner full bus switch that intersects, this cross bar switch receives the output of all functions unit, and select one tunnel bus result for each LRF, and be written in the LRF corresponding entry according to the address that controller is sent and go according to the signal of controller.The input of the LRF of JB/VAL comes from itself among the figure, belongs to special LRF, does not draw in input source in the drawings.In the YHFT64-2 stream handle, in order to save the burden of internal bus, the quantity of inlet flow is reduced to 4, each inlet flow accounts for an internal bus, and the quantity of output stream is 8, and each output stream needs a special full cross bar switch.This syndeton can be regulated according to the requirement of actual performance.As increasing an arithmetic unit and an inlet flow, will increase by 2 internal buss accordingly, and be connected to the input end of each cross bar switch, the connection of configuration of corresponding modification compiler and control signal.
Fig. 2 is the data path figure that arithmetic group carries out a multiply-add operation.Because take advantage of add operation to need 3 operands, three operands come from 3 data stream respectively in this example, link to each other with 3 group's internal buses respectively, as shown in the figure.Compiler is judged the made component of taking advantage of of current free time, and takes advantage of add operation to its distribution.In this example, take advantage of to add 0 and be in idle condition, controller control is taken advantage of and is added 0 carry out buffer memory from group's internal bus receives the LRF of outside three data to three inputs sending here; Send into behind the LRF buffer memory then and take advantage of made component inside to carry out computing, the result after the calculating sends on group internal bus; Output unit is selected to take advantage of to add 0 result who sends on the bus from group's internal bus by cross bar switch in the group, and exports away.
Fig. 3 is the detailed data path figure that comprises the arithmetic group of four arithmetic units.Arithmetic group is made up of a plurality of functions, has comprised a plurality of local register file LRF, is connected to each other with one group of arithmetic group internal bus between functional part and the LRF.Link to each other with bus between the group by communication component between the arithmetic group, link to each other by inputoutput unit between arithmetic group and the stream buffering.Functional part is finished different arithmetical operations and other operation, local register file (LRF) is the data source and the scratchpad of functional part, the result that condition code register file (CCRF) storage comparison order produces is used for data path and selects and the condition flow operation.Data transmission between exchanges data between the functional part and stream buffering and the functional part is finished by calculating the interior cross interconnected network switching of group, all functional parts all are that the output result is sent on the result bus, and the input end of LRF can receive all result bus, and cross interconnected like this switch just will form a kind of complete interconnected structure between all functional parts and the LRF.
The functional part of arithmetic group roughly can be divided into two classes: arithmetic unit and inverse parts.Arithmetic unit is carried out integer and floating-point operation instruction, comprises that 4 are taken advantage of made component: FMACO, FMAC1 FMAC2, FMAC3 and an inverse/inverse square root parts DSQ; Inverse parts support condition stream and data move operation comprise the I/O unit, communication unit COMM between the group, and scratch pad register cell S P, condition is carried out control module JB and VAL.The result of calculation of each parts is directly delivered on the interior bus of group; Each input of parts is directly selected by cross bar switch in the group.
Fig. 4 is the interconnected relationship figure of arithmetic group and controller.Comprise 4 arithmetic groups in the YHFT64-2 stream handle.Actual arithmetic group number can increase or reduce (be generally 2 integral number power) according to performance requirement, only needs the bus number between the change arithmetic group, because the proprietary bus of each arithmetic group.8 arithmetic groups that show among the figure and the annexation of controller.Controller links to each other with one group of full interconnect bus that intersects with arithmetic group, each arithmetic group and controller need be put into local data on the corresponding bus, and can receive the data of other any arithmetic groups, the communication unit of each arithmetic group inside is responsible for carrying out the traffic operation between each arithmetic group and the controller.In addition, controller also is responsible for to all arithmetic group sending controling instructions, and all arithmetic groups carry out operation of data in the mode of SIMD.
Fig. 5 is the annexation of inner each subassembly of FMAC arithmetic unit.The FMAC parts merge the one that is performed on of floating-point and integer, and the partial product figure place of floating-point multiplication is expanded the multiplication of having realized 8,16,32 and 64 integers accordingly; The ALU parts are finished the operation of other integer; The FMISC parts are finished floating-point test and logical operation.As shown in the figure, operational code is sent into three subassemblies simultaneously, and each subassembly carries out instruction decode separately, if corresponding computing is promptly carried out in the instruction of these parts, is changed to 0 otherwise will export the result.FPMA will finish and take advantage of add operation, so need three operands; ALU and FMISC only need 2 operands to carry out computing, and specifying preceding 2 operands in YHFT64-2 is its operand.The operation result of three subassemblies carries out once or operates promptly drawing last result at last, and this is to be guaranteed by the result of instruction decode signal zero setting miscellaneous part.
Fig. 6 is the processing example that comprises the condition execution unit of four arithmetic groups.As the kernel program of figure shown in (a), each element of an inlet flow is at first carried out Compter_1 () calculate, only the circle element in the convection current is carried out computer_circle () then and calculate.Figure (b) is this program without any the implementation on the stream handle that only contains four arithmetic groups of condition treatment mechanism.Under situation without any the condition treatment mechanism, when four arithmetic groups carry out computer_circle () calculating, the workload of four arithmetic groups has produced unbalanced situation, arithmetic group 1 will be handled four data elements, and arithmetic group 0 and arithmetic group 3 are as long as handle 1 data element, therefore, because the characteristic arithmetic group 0 of SIMD and arithmetic group 3 be necessary wait operation group 1 after handling the data of oneself, finish up to arithmetic group 1 calculating, four arithmetic groups could carry out next step computing simultaneously.Under the controlling mechanism of SIMD, because the problem of this laod unbalance that relevant control causes can cause the significant wastage of resource in the X stream handle, the arithmetic group of finishing calculation task at first must dally and wait for other arithmetic group, all arithmetic groups are all finished after the calculation task, just can carry out next calculation task, and still there is unbalanced problem in the data that enter follow-up computing between each arithmetic group, therefore the problem of laod unbalance all exists in the calculating process of whole kernel, the calculating process of kernel is many more, and the situation of the wasting of resources that this laod unbalance causes is just serious more.
A design philosophy of YHFT64-2 stream handle is: controlling mechanism will be tried one's best simply.Other branch instruction outside not supporting in the YHFT64-2 stream handle to circulate, the mode that its employing condition is carried out partly solves the problem of the relevant laod unbalance that causes of control.Normal data stream visit is unconditional to arithmetic group.So, each arithmetic group is is always read and write data stream simultaneously, and arithmetic group and stream buffering be one to one, as arithmetic group 0 can only access stream buffering 0 data, have no idea to visit the data in other stream buffering.And in condition was handled, arithmetic group will carry out according to the CC condition code selectively to the visit of data stream, thereby data stream is expanded or compressed.
Figure (c) is the process that figure (a) program is carried out in the YHFT64-2 condition.Condition mechanism will cause that at first the data in the data stream of laod unbalance screen, form a stream that only comprises same data type, carry out computing for each arithmetic group this data stream average mark then, make each arithmetic group handle the data element of same kind simultaneously, just the arithmetic group that can not produce must be finished calculating to wait for other arithmetic group the free time, has avoided the problem of laod unbalance.
Fig. 7 is the treatment scheme that condition is carried out.For the condition entry operation, before entering circular treatment, at first carry out the initialization of cond, import batch data padding data buffering.Condition code CC is 1, represents that current arithmetic group will ask valid data.At every turn according to this round-robin condition code state and last time the round-robin cond upgrade this round-robin cond, send to data buffering SP, communication unit and controller, the control communication unit is chosen data to local arithmetic group from the SP corresponding entry, and controller is according to the ready signal controlling INOUT unit that produces reading of data from the stream buffering.For the condition output function, entry condition circular treatment after the initialization condition state, different with the condition entry operation is, the CC sign indicating number is 1, the data of representing local arithmetic group are effective, communication unit will be sent to the data of this locality in the data buffering, after data buffering fills up, carries out an output function by controller control INOUT unit.At condition output function end, can operate according to the FLUSH that programmer's requirement is flowed, convection current is filled.