Summary of the invention
The main technical problem to be solved in the present invention is, a kind of lower hardware mapping method of integrated circuit and device are provided, and can improve the design rate of integrated circuit.
The present invention also provides a kind of Data Control stream generating method and device, and comprising data dependence, data can concurrency and the analysis of corresponding control information, can ancillary hardware circuit designer carry out circuit design.
For solving the problems of the technologies described above, the technical solution used in the present invention is as follows:
A lower hardware mapping method of integrated circuit, comprises step:
Process analysis step, for reading the computer language procedure of describing integrated circuit algorithm, and therefrom identifies mapped execution object and parameter object;
Data Control flow graph generates step, for the execution object identifying and parameter object being mapped to the respective nodes of the Data Control flow graph of describing integrated circuit algorithm;
Operator spacetime diagram generates step, for the function treatment of carrying out according to each node of Data Control flow graph, from the operator cell library of setting up in advance, obtain at least one operator unit of corresponding function, Data Control flow graph is converted to the operator spacetime diagram being formed by operator unit;
Temporal constraint step, for determining total temporal constraint according to the requirement of user specification requirement and target integrated circuit technology, each the operator unit label time in operator spacetime diagram, carries out temporal constraint to each level of operator spacetime diagram;
Spacetime diagram compression step, for according to time-labeling, operator spacetime diagram being carried out to the cluster compression on space, makes overall algorithm execution time close to total temporal constraint;
Lower hardware mapping step, for generating integrated circuit lower hardware logical description according to the operator spacetime diagram after cluster compression.
Method based on above-mentioned, the present invention also provides a kind of integrated circuit lower hardware mapping device, comprising:
Process analysis module, for reading the computerese of describing integrated circuit algorithm, and therefrom identifies mapped execution object and parameter object;
Data Control flow graph generation module, for being mapped to the execution object identifying and parameter object the respective nodes of the Data Control flow graph of describing integrated circuit algorithm;
Operator spacetime diagram generation module, for the function treatment of carrying out according to each node of Data Control flow graph, from the operator cell library of setting up in advance, take out at least one operator unit of corresponding function, Data Control flow graph is converted to the operator spacetime diagram being formed by operator unit;
Temporal constraint module, for determining total temporal constraint according to the requirement of user specification demand and target integrated circuit technology, each the operator unit label time in operator spacetime diagram, carries out temporal constraint to each level of operator spacetime diagram;
Spacetime diagram compression module, for according to time-labeling, spacetime diagram being carried out to the cluster compression on space, and makes it overall algorithm execution time close to total temporal constraint;
Lower hardware mapping block, generates integrated circuit lower hardware logical description according to the spacetime diagram after cluster compression.
The present invention also provides a kind of Data Control stream generating method, comprises step:
Process analysis step, for reading the computer language procedure of describing integrated circuit algorithm, according to the rule of this kind of computerese, from described computer language procedure, identify mapped execution object and parameter object, described execution object comprises operational order and/or steering order, and described parameter object comprises at least one in input data, output data, intermediate data;
Data Control flow graph generates step, for the execution object identifying and parameter object being mapped to the respective nodes of the Data Control flow graph of describing integrated circuit algorithm.
Further, at described Data Control flow graph, generate in step, described operation table command mappings, for processing block diagram, is mapped as to the control stream for identification-state, state transitions condition and state control signal by described steering order, described parameter object is mapped as to the memory node in data stream.
Wherein, described process analysis step comprises: according to the rule of this kind of computerese, from described computer language procedure, find out function, computing in the input and output of described function and function is resolved, identify and carry out object and parameter object, in resolving, if current resolved function is the upper layer functions that comprises lower layer functions, continue lower layer functions to resolve.
Further, described steering order is the recursion instruction in loop statement, and described loop statement comprises quiet cycle statement and dynamic circulation statement, and described Data Control flow graph generation step comprises:
When loop statement is quiet cycle statement, according to cycle index, loop body is launched, after loop body launches, bring parameter object into and obtain new operation expression, operational order in operation expression is mapped as to processing block diagram, the parameter object in operation expression is mapped as to the memory node in data stream;
When loop statement is while can be changed into the dynamic circulation statement of quiet cycle statement, according to the different occasion change state loop statements that call, it is quiet cycle statement, its expansion is obtained to new operation expression, recursion instruction in operation expression is mapped as to processing block diagram, the parameter object in operation expression is mapped as to the memory node in data stream;
When loop statement is individual layer dynamic circulation statement, the content map of loop statement, for processing block diagram, is mapped as to state machine by recursion instruction;
When loop statement is multilayer dynamic circulation statement, by the content map of outer loop statement and interior loop statement, be respectively that the first processing block diagram and second is processed block diagram, outer recursion instruction is mapped as to the first state machine, by interior loop command mappings, be the second state machine, and described second process block diagram and the second state machine and is mapped in described first and processes in block diagram; Or by the content map of outer loop statement and interior loop statement, be respectively that the first processing block diagram and second is processed block diagram, recursion instruction is mapped as to state machine, and described second processes block diagram is mapped in described first and processes in block diagram, and the cycle index that the status number of described state machine equals described circulation adds one.
Further, described steering order is the steering order in branch's control statement, described Data Control flow graph generates step and comprises: described branch steering order is mapped as to MUX, the content map of branch's control statement is that the input end of MUX is processed block diagram, and the controlled condition of branch's control statement is mapped as the control end of MUX and processes block diagram.
Further, described branch control statement is nested branch control statement, Ze Jiang upper strata branch steering order is mapped as the first MUX, the content map of described upper strata branch control statement is that the input end of described the first MUX is processed block diagram, and the controlled condition of described upper strata branch control statement is mapped as the control end of described the first MUX and processes block diagram; Branch of lower floor steering order is mapped as to the second MUX, the content map of branch of described lower floor control statement is that the input end of described the second MUX is processed block diagram, the controlled condition of branch of described lower floor control statement is mapped as the control end of described the second MUX and processes block diagram, and the output of described the first MUX is as the input of described the second MUX.
Method based on above-mentioned, the present invention also provides a kind of Data Control stream generating apparatus, comprising:
Process analysis module, for reading the computer language procedure of describing integrated circuit algorithm, according to the rule of this kind of computerese, from described computer language procedure, identify mapped execution object and parameter object, described execution object comprises operational order and/or steering order, and described parameter object comprises at least one in input data, output data, intermediate data;
Data Control flow graph generation module, for being mapped to the execution object identifying and parameter object the respective nodes of the Data Control flow graph of describing integrated circuit algorithm.
Further, described Data Control flow graph generation module comprises: carry out object map module, for the operational order of described execution object is mapped as to processing block diagram, and for the steering order of described execution object being mapped as to the control stream of identification-state, state transitions condition and state control signal;
Parameter object mapping block, for being mapped as parameter object the memory node in data stream.
The invention has the beneficial effects as follows: by the analysis to former c program or MATLAB program, identify execution object and the parameter object of mapping; And the execution object identifying and parameter object are become to Data Control flow graph, this Data Control flow graph can represent the algorithm of integrated circuit; Then each node in Data Control flow graph is substituted to generating operator spacetime diagram with operator; The operator spacetime diagram generating, through cluster compression, makes the overall execution time of the spacetime diagram after compression close to total temporal constraint; Spacetime diagram after compression is generated to the lower hardware circuit of integrated circuit.Thereby created a kind of mapping tool from computerese to integrated circuit lower hardware circuit, realized to standardization the process of integrated circuit from language generation lower hardwares such as C or MATLAB, implemented convenient and swift.
The present invention also can concurrency and corresponding control information etc. by computer language procedure analysis being obtained to its corresponding data dependence, data, thereby according to the corresponding Data Control flow graph of these Information generations, and wherein controlling expression formula conversion method is the core of all linguistic expression's conversions, and its conversion efficiency directly affects the data volume of generating algorithm Data Control flow graph.The inventive method is taken concurrency factor, hardware configuration into account in advance such as state machine, MUX etc., can help to a greater extent Hardware Engineer to carry out hardware design.
Embodiment
Below by embodiment, by reference to the accompanying drawings the present invention is described in further detail.
Look back the development course that method of designing integrated circuit is learned, can see: when integrated circuit fabrication process enters epoch of 1um, occurred classifying as with gate array the method for designing of elementary cell; When integrated circuit fabrication process enters epoch of 0.5um, there is take the method for designing that standard block is elementary cell; When integrated circuit fabrication process enters epoch of 0.18um, there is take the method for designing that IP kernel is elementary cell.This shows: the design methodology of integrated circuit is along with the development of integrated circuit fabrication process on the one hand, and the unit granularity of the elementary cell (door, standard block, IP kernel) that method of designing integrated circuit is used in learning on the other hand constantly increases.Meanwhile, the appearance of each new elementary cell, all indicates the revolutionary progress of method of designing integrated circuit.Therefore, what can rationally predict is, along with the progress at full speed of integrated circuit fabrication process nearly ten years, especially integrated circuit fabrication process enters after nanoscale, to there is and open the new situation of integrated circuit (IC) design in the elementary cell of coarsegrain more, to adapt to the develop rapidly of integrated circuit fabrication process.
Operator is as the elementary cell in integrated circuit building block, its granularity is greater than the granularity of standard block, the computer programming language that is conducive to describe integrated circuit algorithm shines upon to lower hardware, therefore the present invention adopts the method for designing integrated circuit based on operator, made to accelerate the design rate of integrated circuit, to adapt to the progress of integrated circuit fabrication process.
In the present invention, conventional operator has five classes, is respectively computing class operator, storage class operator, class of paths operator, controls class operator and clock class operator.
1, computing class operator.
Computing operator (AU) is for realizing the elementary cell of logical operation or arithmetical operation or the hybrid operation of logical and arithmetic.It comprises arithmetic logical unit and computing configuration register, computing configuration register is for receiving and storage computing configuration-direct, the arithmetical logic operation that different computing configuration-directs is corresponding different, that is to say, by computing configuration-direct, can make same computing operator realize multiple different function.Below, the ADDS operator of take describes computing operator as example.
Fig. 2 is the structural representation of ADDS operator, it comprises for realizing and adds the ADD unit of reducing and for realizing < </> > unit of shifting function, by the parameter value of control bit X is set, can make ADDS operator realize multiple different function, such as, the form of Fig. 3 shows the corresponding relation of different control bit X values and different operating in a kind of embodiment.As this operator that can realize multiple difference in functionality by control bit X of ADDS, be called restructural operator, restructural operator, because abundant application function can be used in different scenes, has reduced the operator number storing in operator cell library.And restructural operator can also be realized dynamic reconstruct by changing the mode of control bit in its implementation.
2, storage class operator.
Be illustrated in figure 4 the basic structure schematic diagram (in figure, CU represents to control operator) of storage class operator (MU).Storage operators comprises stored configuration register (MU configuration register) and storage unit, and storage unit comprises address-generation unit, data-carrier store, data generation unit and data output control unit.Stored configuration register can be by data output control unit configuration store operator (MU) memory bank (various storage mediums: writing and/or the playback mode MEM such as register, RAM), the working method of can also config memory corresponding address-generation unit.Into precalculated position is directly stored input data in the address generating according to address-generation unit, and the data of needs are exported from deposit position.
3, class of paths operator.
As shown in Figure 5, be the universal architecture schematic diagram of class of paths operator (LU).Class of paths operator LU comprises routing configuration register (LU configuration register) and forms alteration switch and the data register (REG) of Route Selection unit, wherein, routing configuration register is controlled the control of operator CU, controls alteration switch and realize the connection between nonidentity operation operator AU according to the mode of expectation under the control action of controlling operator CU.Data register is for the inputoutput data of temporary computing class operator LU and storage class operator M U.
4, control class operator.
As shown in Figure 6, for controlling the universal architecture schematic diagram of class operator (CU).The effect of controlling class operator is mainly that configuration information is sent to corresponding configuration register, and configuration computing operator AU, storage operators MU and path operator LU realize predetermined function.The form of controlling operator CU comprises three kinds of counter, state machine and micro-orders.Wherein micro-order structure comprises code translator, programmable counter, command memory and Pipeline control module etc.Control operator CU and to each functional unit, send configuration information by carrying out simple configuration-direct, seldom, so order register capacity is little in the instruction of supporting due to CU, and code translator is very simple.
5, clock class operator.
Clock operator is for the clock control signal of computing class operator, storage class operator, class of paths operator and control class operator, and clock signal comprises the signal of controlling clock start-stop and controlling clock frequency, and clock signal can configure according to the mode of expectation.
Above five class operators are the bases of realizing following embodiment, be understandable that, above-mentioned to according to function, the operator for integrated circuit (IC) design being divided into five large classes not exclusive dividing mode, can also carry out targetedly according to actual conditions the division of wide region more or thinner scope.
In an embodiment of the present invention, a kind of mapped system from computerese to integrated circuit lower hardware circuit is provided, be the lower hardware mapping method of integrated circuit of this system as shown in Figure 7a, comprise the following steps: (in the present embodiment, " spacetime diagram " is equal to " operator spacetime diagram ", and the former is only for being called for short.)
Step S1, analyzes program, reads the computer language procedure of describing integrated circuit algorithm, identifies mapped execution object and parameter object according to the rule of this computerese from described computer language procedure.Special IC, for realizing specific agreement or function, and first these functions and agreement are described by computer language procedure conventionally, and computerese wherein adopts C language or MATLAB language etc. conventionally.The computer language procedure of writing is input in mapped system of the present invention, this mapped system identifies mapped execution object and parameter object from described computer language procedure according to the rule of coding computerese used again.
In present embodiment, this execution object comprises operational order and/or steering order, and this parameter object comprises at least one in input data, output data, intermediate data.Operational order in the present embodiment comprises and adding, subtracts, takes advantage of and the computing such as displacement.
Step S2, generated data is controlled flow graph, and the execution object identifying and parameter object are mapped to the respective nodes in the Data Control flow graph of describing integrated circuit algorithm.Described operational order is mapped as to processing block diagram, described steering order is mapped as to the control stream for identification-state, state transitions condition and state control signal, described input data, output data and intermediate data are mapped as to the memory node in data stream.
Step S3, operator spacetime diagram generates step, for the processing capacity of carrying out according to each node of Data Control flow graph, from the operator cell library of setting up in advance, take out at least one operator unit of corresponding function, Data Control flow graph is converted to the operator spacetime diagram being formed by operator unit.First Data Control flow graph is launched according to its data flow dependency, then convert each node after launching to can complete this nodal function operator unit.With the combination of one or more operators unit, replace each node in Data Control flow graph, the combination of one or more operators unit can complete the function identical with each node.For how Data Control flow graph being launched, include but not limited to following several mode: if the data stream in Data Control flow graph is for order related data flow structure, adopt the mode of streamline to launch order related data flow; If there is feedback in the data stream in Data Control flow graph, this data stream is a circulation, and this data stream exists data dependence, so this data stream can not be converted into flowing structure; If but while there is not data dependence between this data stream internal data, described internal data is not existed each data stream of data dependence to adopt the mode of local flow's waterline to launch; If there is not data dependence between the data stream in Data Control flow graph, adopt parallel mode to launch this data stream, and convert the operator spacetime diagram being formed by operator unit to.
Step S4, temporal constraint step, for determining total temporal constraint according to the requirement of user specification requirement and target integrated circuit technology, to each the operator unit label time in operator spacetime diagram.On the other hand, from operator cell library, can extract operator time sequence information, operator spacetime diagram is done to sequential mark, form the object of temporal constraint.Thereby according to data flow characteristic can be by temporal constraint each level specific to operator spacetime diagram, realize each level of operator spacetime diagram carried out to temporal constraint.Because operator can form different operator function pieces, and then form different operator function groups, each operator function group is an operator level.
If described data flow architecture is parallel data stream, total temporal constraint is divided equally to each the operator level being given in corresponding spacetime diagram, and divided the temporal constraint of each operator level equally each operator unit in this operator level.Using the basic sequential unit of the total operator of corresponding each operator level of each node of serial in Data Control stream as overall temporal constraint, the ratio that accounts for the sequential summation that operator unit that in each operator level, the longest arithmetic path shines upon is corresponding according to the sequential of the computing operator that in each operator level, the longest arithmetic path shines upon is distributed the sequential of each operator level.
Step S5, spacetime diagram compression step, for according to time-labeling, spacetime diagram being carried out the cluster compression in space (being on hardware resource or area), and makes it overall algorithm execution time close to total temporal constraint.
In one embodiment, spacetime diagram is compressed and comprised the following steps: in operator spacetime diagram, find out computing class operator and/or the identical storage class operator of memory attribute that attribute is identical; Then according to time-labeling, the identical computing class operator of operational attribute is spatially merged compression and/or the identical storage class operator of memory attribute is spatially merged to compression; Then introduce and control class operator, the computing class operator after compression and/or storage class computing operator are generated to corresponding configuration instruction, realize the multiplexing of computing class operator and/or storage class operator.
The step of cluster compression step and generation restructural operator function piece, all can produce not only a kind of result.The same subfunction of different function calls, because confinement time is different, the cluster result producing is also different.Therefore need to be optimized according to parameters such as time, area, power consumptions, by performance (execution time) discharge order, just the cluster result that meets time-constrain represents that its hardware realizes Least-cost, therefore selects overall algorithm execution time close to completing the optimum results of the spacetime diagram of the needed total temporal constraint of integrated circuit algorithm as cluster compression.
Step S6, lower hardware mapping step, generates integrated circuit lower hardware logical description according to the spacetime diagram after cluster compression.
Lower hardware mapping method of integrated circuit based on above-mentioned, the invention also discloses a kind of integrated circuit lower hardware mapping device, please refer to Fig. 7 b, and the integrated circuit lower hardware mapping device of the present embodiment comprises:
Process analysis module 1 for reading the computer language procedure of describing integrated circuit algorithm, identifies mapped execution object and parameter object from described computer language procedure according to the rule of this computerese; Data Control flow graph generation module 2, for being mapped to by the execution object identifying and parameter object the Data Control flow graph of describing integrated circuit algorithm; Operator spacetime diagram generation module 3, for the processing capacity of carrying out according to each node of Data Control flow graph, from the operator cell library of setting up in advance, take out at least one operator unit of corresponding function, thereby Data Control flow graph is converted to the operator spacetime diagram being formed by operator unit; Temporal constraint module 4, for determining total temporal constraint according to the requirement of user specification demand and target integrated circuit technology, each the operator unit label time in operator spacetime diagram, carries out temporal constraint to each level of operator spacetime diagram; Spacetime diagram compression module 5, for carry out the cluster compression on space during to spacetime diagram according to time-labeling, and makes it overall algorithm execution time close to total temporal constraint; Lower hardware mapping block 6, generates integrated circuit lower hardware circuit according to the spacetime diagram after cluster compression.
Below in conjunction with specific embodiment, lower hardware mapping method of integrated circuit of the present invention and device are described.
H.264 be the common digital video coding standard of formulating of the common joint video team (JVT) of setting up of International Telecommunication Association (ITU-T) and ISO (International Standards Organization) (ISO).The X264_me_search function of the H.264 C language description of standard of take in the present embodiment is example, and method of the present invention is described in more detail.
As shown in Figure 8, for function X264me search is analyzed and comprises step:
S11, read computer language procedure, and search function in this computer language procedure.In the present embodiment, first the fetch program, go forward side by side lang method and lexical analysis, obtain function X264_me_search.
S12, this function is resolved, obtain function calling relationship and the parameter object of this function, this parameter object comprises input data, output data, the input constant of this function, and the intermediate data of layer functions on this, and each parameter object of this function is carried out corresponding data dependence, shares the mark of the information such as storage, distributed store.In the present embodiment, function x264_me_search is analyzed and obtains its input variable, output variable, input constant and output constant is as shown in table 1:
Table 1:
Signal name |
Data type |
Direction |
Explanation |
i_pixel |
Int |
IN |
∥PIXEL WxH |
Continued 1:
In the present embodiment, to this function x264me search, internal analysis obtains its built-in variable and constant, as shown in table 2:
Table 2:
Signal name |
Data type |
Explanation |
i_pixel |
1nt |
∥ pixel WxH |
bcost |
1nt |
The interim optimum estimate point of ∥ |
bmx |
1nt |
The interim motion vector x of ∥ component |
bmy |
1nt |
The interim motion vector y of ∥ component |
p_ffef |
uint8t
* |
∥ reference frame address |
i_iter |
1nt |
∥ loop variable |
The current resolved function of function calling relationship judgement of the current resolved function of S13, basis comprises lower layer functions, and search in this way lower layer functions, and perform step S14, otherwise end operation.In the present embodiment, if do not comprise lower layer functions in current resolved function, this function is also bottom function.This function is being carried out in resolving, function pixel_sad_16 * 16 that obtained this function call, function x264_me_search is upper layer functions, function pixel_sad_16 * 16 are lower layer functions.In the present embodiment, if do not comprise lower layer functions in current resolved function, this function is also bottom function.
S14, lower layer functions is analyzed, obtained the parameter object of this lower layer functions, comprise input data and output data, and each parameter object of this lower layer functions is carried out data dependence, shares the mark of the information such as storage, distributed store.In the present embodiment, lower layer functions pixel_sad_16 * 16 are analyzed, obtained the input and output of these lower layer functions pixel_sad_16 * 16, as shown in table 3:
Table 3:
Signal name |
Data type |
Direction |
Explanation |
pix1 |
uint8t
* |
IN |
∥ input pixel value 1 |
i_stride_pix1 |
1nt |
IN |
∥ pixel value 1 storage line width |
pix2 |
uint8t
* |
IN |
∥ input pixel value 2 |
i_stride_pix2 |
1nt |
IN |
∥ pixel value 1 storage line width |
i_sum |
1nt |
OUT |
∥ SAD result of calculation |
In present embodiment, data dependence refers to analyzes the relation being associated between the variable draw and/or constant, comprises that computing is relevant relevant with storage.Wherein, computing is relevant is that input and output are relevant, and output variable or the output signal of process computing operator are relevant to its input variable or input signal; Storage is relevant comprises that write-read is relevant, read-write is relevant and write relevant, wherein, write-read is relevant refers to that variable proper operation for same memory address is sequentially for first writing and read, and reads variable and is relevant to and writes variable, if traffic error occurs write-after-read in operation; Read-write is relevant refers to that variable proper operation for same memory address is sequentially for first reading and write, and writes variable and is relevant to and reads variable, if write-then-read is that read data is capped and makes a mistake in operation; Writing and relevant referring to that the variable for same memory address will write according to proper operation order, writing and between variable, have correlativity.
In present embodiment, shared storage refers to be connected between the modules of same storage and can mutually access, and the data between them can exchange by sharing storage mode.And distributed store refers to that each module has the storage of its independent allocation, the storage of each other module of module inaccessible, the data communication between them can only be mutual by communication port.In hardware design, the shared storage distributed store that compares has increased extra interconnect resources, therefore when generating algorithm Data Control flow graph, will separate and share storage and distributed store according to its content.
As shown in Figure 9, the execution object and the parameter object that according to analysis, obtain, generated data is controlled flow graph and is comprised step:
S21, the parameter object that identification is obtained are mapped as the memory node in data stream, and distinguish sharing storage and distributed store according to the storage information of mark.
S22, according to mark data dependence, the operational order in function is mapped as to the processing block diagram in Data Control flow graph.The computings such as the operational order in the present embodiment comprises and adding, subtracts, displacement.
S23, according to the data dependence of mark, the steering order in function is mapped as in Data Control flow graph to the control stream for identification-state, state transitions condition and state control signal.
In the present embodiment, steering order comprises call relation between function, recursion instruction etc.Loop statement comprises quiet cycle and dynamic circulation statement, and wherein dynamic circulation statement comprises again dynamic circulation, individual layer dynamic circulation and the multilayer dynamic circulation that can be changed into quiet cycle.Quiet cycle refers to that loop variable is constant; The dynamic circulation that can be changed into quiet cycle refers to that cycle index is variable, once but when the occasion of its application, determine, its loop variable is also just determined, in this circulation, the occasion of application is determined, its cycle index also just becomes constant, thereby becomes quiet cycle from dynamic circulation; Individual layer dynamic circulation refers to that cycle index is variable, and there is no nested other circulations; Multilayer dynamic circulation refers to that cycle index is variable, and is nested with interior loop.
In the present embodiment, when loop statement is quiet cycle, by this quiet cycle, is mapped as control stream and comprises step:
S2411, according to cycle index, loop body is launched, obtain the new loop body with cycle index equivalent number.Each loop body comprises operation expression, and has common parameter object between each operation expression.
In the present embodiment, in function static void predict_16x16_dc, include three loop statements:
Thus, known first loop unrolling is obtained to new operation expression, is respectively:
dc=dc+src[-1];dc=dc+src[-i_stride];dc=dc+src[-1+i_stride];dc=dc+src[1-i_stride];......dc=dc+src[-1+15*i_stride];dc=dc+src[15-i_stride];dc=(dc+16)>>5。
Second and the 3rd loop unrolling obtain respectively operation expression and are:
src[0]=dc;...src[15]=dc;
src[0+i_stride]=dc;...src[15+i_stride]=dc;...src[0+15*i_stride]=dc;...src[15+15*i_stride]=dc;
S2412, new expression formula expansion being obtained according to parameter object are carried out iteration, thereby are obtained a new operation expression.In the present embodiment, the new operation expression being obtained from first above-mentioned loop unrolling, carries out iteration according to index dc by these operation expressions, obtains a new expression formula about dc:
dc=(0+src[-1]+src[-i_stride]+src[-1+i_stride]+src[1-i_stride]+...+src[-1+15*i_stride]+src[15-i_stride]+16)>>5。
S2413, the operational order in this new operation expression is mapped as to processing block diagram, the parameter object in operation expression is mapped as to the memory node in data stream.In the present embodiment, when loop statement is while can be changed into the dynamic circulation statement of quiet cycle, once because the environment of its application is determined, its loop variable just becomes constant, and its corresponding mapping step is identical with the mapping step of quiet cycle statement.
In the present embodiment, when loop statement is the mapping of individual layer dynamic circulation, the generation step of controlling stream comprises:
S2421a, circulating content is mapped as to processing block diagram, recursion instruction is mapped as to state machine.The present embodiment, by respectively circulating content being mapped as to processing block diagram, being mapped as recursion instruction the mode of state machine, thereby data stream is divided into two states: order status (sequence) A and recurrent state (loop) A, as shown in figure 10.
In the present embodiment, when circulation is multilayer dynamic circulation, the generation method of controlling stream comprises:
S2421b, respectively by the content map of interior loop statement and outer loop statement for processing block diagram A and processing block diagram B, by outer cyclic mapping, be state machine B respectively, interior loop is mapped as to state machine A, and this processing block diagram A and state machine A are mapped in and process in block diagram B, as shown in figure 11.That is to say in the present embodiment, adopt each circulation in multilayer dynamic circulation to be mapped as state machine, and the control that the nested mode of carrying out state machine generates corresponding Data Control stream is flowed.
Certainly in the present embodiment, it is the mode of a unified state machine that this multilayer dynamic circulation also can adopt interior loop and outer cyclic mapping, as shown in figure 12, if but nested circulation is the circulation of N layer, circulating content is mapped as to processing block diagram, by cyclic mapping, be a unified state machine, and the status number of state machine is N+1.That is to say in the present embodiment, when multilayer dynamic circulation adopts unified state machine, the status number of state machine equals cycle index and adds one.
In the present embodiment, according to the control statement in computer language procedure, comprise branch's control statement, and branch's control statement comprises single branch control statement and nested branch control statement.
In the present embodiment, during control statement Wei Dan branch of Dang Gai branch control statement:
S2421b, branch's steering order is mapped as to MUX.
The input end that S2422b is MUX by control statement content map is processed block diagram.
S2423b, the control end that controlled condition is mapped as to MUX are processed block diagram, finally obtain the structural drawing of this branch's control statement, as shown in figure 13.In the present embodiment, if this control statement content is or/and controlled condition is variable, be mapped as memory node.
In the present embodiment, when Dang Gai branch control statement is nested branch control statement, owing to function being analyzed before, therefore obtain upper strata control statement and the bottom control statement of this nested branch control statement:
S2421c, upper strata branch steering order is mapped as to MUX 1, the input end that is MUX 1 by the content map of upper strata branch control statement is processed block diagram, and the control end that the controlled condition in upper strata branch control statement is mapped as to this MUX 1 is processed block diagram.
S2422c, branch of lower floor steering order is mapped as to MUX 2, the content map of branch of lower floor control statement is that the input end of MUX 2 is processed block diagram, and the control end that the controlled condition in branch of lower floor control statement is mapped as to MUX 2 is processed block diagram.
S2423c, selects 1 output as the input of MUX 2 multichannel, thereby obtains the structural drawing of this nested branch control statement, as shown in figure 14.
Certainly in the present embodiment, if this control statement content or/and controlled condition is variable, is mapped as corresponding memory node in data stream.
Method based on above-mentioned, thus obtain respectively corresponding to function X264_me_search and function pixel_sad_16 * 16, Data Control flow graph, as shown in Figure 15 and Figure 16.
Control expression formula conversion method in the present embodiment is the core of all linguistic expression's conversions, and its conversion efficiency directly affects the data volume of generating algorithm Data Control flow graph.The conversion method target that the present embodiment proposes is hardware realization, therefore in advance concurrency factor, hardware configuration is taken into account as state machine, MUX etc., can help to a greater extent Hardware Engineer to carry out hardware design.
Please refer to Figure 17, Data Control stream map generalization method based on above-mentioned, the present invention also provides a kind of Data Control flow graph generating apparatus, comprise: process analysis module 1, for reading the computer language procedure of describing integrated circuit algorithm, according to the rule of this kind of computerese, from computer language procedure, identify mapped execution object and parameter object, wherein carry out object and comprise operational order and/or steering order, parameter object comprises at least one in input data, output data and intermediate data; Data Control flow graph generation module 2, for being mapped to by the execution object identifying and parameter object the Data Control flow graph of describing integrated circuit algorithm.
Please refer to Figure 18, in the present embodiment, Data Control flow graph generation module 2 comprises: carry out object map submodule 21, for the operational order of carrying out object is mapped as to processing block diagram, and for the steering order of carrying out object being mapped as to the control stream of identification-state, state transitions condition and state control signal; Parameter object mapping submodule 22, for being mapped as parameter object the memory node in data stream.
The Data Control flow graph generating based on said method, is described in more detail generating operator space-time drawing generating method below in conjunction with specific embodiment.
Transfer principle by Data Control flow graph generating operator spacetime diagram is: computing comprises 1 with storage implementation structure, and same storer is shared in input and output, arithmetic section restructural, i.e. the data stream form based on processor; 2 pipeline data stream forms; 3 parallel data stream forms.Wherein first kind form is general form, and its data stream form is serial in time, and will be according to sequential demand by its parallelization in from higher level lanquage to hardware conversion.
Therefore the principle of, in the present embodiment, data stream being launched is:
I, order related data flow structure: utilize operator unit to launch, this expansion can be implemented streamline, and flow beat was calculated according to the longest processing time, by storer, as data between streamline, is stored, and adjusts streamline beat.For example, the execution time of FunA, Func B and Func C is respectively 3,4 and 5 unit interval; The input bandwidth of Func A is 3data/cycle, output bandwidth is 2data/cycle, the input bandwidth of FuncB is 5data/cycle, output bandwidth is 4data/cycle, the input bandwidth of Func C is 7data/cycle, output bandwidth is 6data/cycle, and raw data A_OUT, B_IN, B_OUT, C_IN, C_OUT and A_IN are shared to same storage, and arithmetic section is reconstruct.Data stream is closed and to be: A_OUT=B_IN, and B_OUT=C_IN, and during C_OUT ≠ A_IN, be that the output data of Func A are as the input of Func B, the output data of Func B are as the input of Func C, and do not have the feedback of data, and this data stream can be converted into pipeline organization.Wherein, flow beat was calculated according to the longest processing time, by storage operators, as interrupting, adjusted streamline beat, as shown in figure 19.
The data stream of II, existence feedback: when data stream is a circulation, for there being the data stream of data dependence can not be converted into pipeline organization.If there is data dependence but be stored between each batch data of same shared storage, and each batch data inside is not while there is no data dependence, can realize the inside flowing water of processing each batch data, can reduce like this bandwidth of storage.For example, data A_IN is by A_IN_0, A_IN_1 and A_IN_2 form, although A_IN integral body depends on the output of C_OUT, A_IN_0, between A_IN_1 and A_IN_2, there is no data dependence, therefore can utilize inner flowing water by A_IN_0, A_IN_1 and A_IN_2 do in batches, obtain being Func A after complete C_OUT again, thereby obtain local flow's waterline formal transformation structure, as shown in figure 20.
III, parallel data flow structure: owing to there is no inputoutput data correlativity, the hardware independent expansion that can walk abreast.For example, as A_IN ≠ A_OUT ≠ B_OUT ≠ C_OUT, B_IN ≠ A_OUT ≠ B_OUT ≠ C_OUT, during C_IN ≠ A_OUT ≠ B_OUT ≠ C_OUT, be Func A, Func B, any input of Func C is all uncorrelated with their output, thereby data stream is expanded into parallel form, as shown in figure 21.
Principle based on above-mentioned, is converted to corresponding operator spacetime diagram by Data Control flow graph and specifically comprises step:
S31, set up operator cell library in advance.
S32, according to mark shared storage information and distributed store information, inputoutput data and intermediate data are stored.
S33, in Data Control flow graph, existence order related data flow, expands into pipeline organization by this data stream.
S34, in operator cell library, take out with launch after the operator unit of each node corresponding function.When processing capacity that the node in Data Control flow graph wherein carries out is simple, only need an operator or two operators in corresponding operator cell library, when yet the processing of carrying out when its functional module is more complicated, need to be corresponding to a plurality of operators in operator cell library, and replace corresponding node in Data Control flow graph with the combination of the plurality of operator.In the present embodiment, arithmetic logic L0 and the L1 of function x264me search are respectively: L0 logic: bmx=x264_clip3 ((m-> mvp[0]+2) > > 2,-m-> i_mv_range, m-> i_mv_range); L1 logic: bmy=x264_clip3 ((m-> mvp[1]+2) > > 2,-m-> i_mv_range, m-> i_mv_range).
In the present embodiment, logic L0 and logic L1 are order related data flow structure, are therefore expanded into operator spacetime diagram as shown in figure 22, wherein, the configuration signal that X0 and X1 produce for controlling operator, it is controlled stream and dots.Rectangle in figure is processed computing operator or the storage operators of block diagram representative mapping, and their interconnection completes by linking operator.
In the present embodiment, between data stream, not only there is alphabetic data dependency structure, also may there is the feedback of data stream, perform step: S35, according to storage information, judgement is stored between the data of each batch data inside of same shared storage whether have data dependence, if there is no data dependence, can realize the inside flowing water of processing this batch data, thereby reduce the bandwidth of storage.
In the present embodiment, the arithmetic logic L3 of function x264me search comprises two parts, is respectively: first:
bcost=h->pixf.sad[i_pixel](m->p_fenc,m->i_stride,p_fref,(m->i_stride)*5);
for(i_iter=0;i_iter<16;i_iter++)
{int best=0;int cost[4];
(cost[0])=h->pixf.sad[i_pixel](m->p_fenc,
m->i_stride,&p_fref[(-1)*m->i_stride*5+(0)],m->i_stride*5)+m->lm*(bs_size_se(((bmx+(0))<<2)-m->mvp[0])+bs_size_se(((bmy+(-1))<<2)-m->mvp[1]));
(cost[1])=h->pixf.sad[i_pixel](m->p_fenc,
m->i_stride,&p_fref[(1)*m->i_stride*5+(0)],m->i_stride*5)+m->lm*(bs_size_se(((bmx+(0))<<2)-m->mvp[0])+bs_size_se(((bmy+(1))<<2)-m->mvp[1]));
(cost[2])=h->pixf.sad[i_pixel](m->p_fenc,
m->i_stride,&p_fref[(0)*m->i_stride*5+(-1)],m->i_stride*5)+m->lm*(bs_size_se(((bmx+(-1))<<2)-m->mvp[0])+bs_size_se(((bmy+(0))<<2)-m->mvp[1]));
(cost[3])=h->pixf.sad[i_pixel](m->p_fenc,
m->i_stride,&p_fref[(0)*m->i_stride*5+(1)],m->i_stride*5)+m->lm*(bs_size_se(((bmx+(1))<<2)-m->mvp[0])+bs_size_se(((bmy+(0))<<2)-m->mvp[1]));
Second portion:
if(cost[1]<cost[0])best=1;
if(cost[2]<cost[best])best=2;
if(cost[3]<cost[best])best=3;
if(bcost<=cost[best])
break;
bcost=cost[best];
Wherein, first is order related data flow, existence order related data flow in each batch data in second portion, and there is feedback, first generate inner streamline, feeding back conversion, two parts are converted to the spacetime diagram being comprised of operator unit, respectively as shown in Figure 23 a and Figure 23 b.
In the present embodiment, be stored between each batch data stream in same shared storage and also may do not have data dependence, separate between each batch data stream, by this data stream parallel expansion.
S36, carries out parallel pipeline by data stream.In the present embodiment, the arithmetic logic L5 of function x264_me_search is: m-> mv[0]=bmx < < 2; M-> mv[1]=bmy < < 2; This logic L5 is parallel pipeline, and its generating operator spacetime diagram as shown in figure 24.
Through mentioned above principle, the operator spacetime diagram of being changed by the Data Control flow graph of function x264_me_search and function pixel_sad_16 * 16, respectively as shown in Figure 25 a and Figure 25 b.Wherein, function pixel_sad_16 * 16 in original program due to function X264_me_search, have been called, so in Figure 25 a, clearly do not indicate the concrete operator spacetime diagram of function pixel_sad_16 * 16, the operator spacetime diagram of function pixel_sad_16 * 16 is as shown in Figure 25 b.
The Data Control flow graph that method based on above-mentioned generates, the method that this Data Control flow graph is applied to temporal constraint below in conjunction with specific embodiment describes in detail.
Data flow diagram is applied to the method for temporal constraint, is divided into two stages:
I, the requirement specification definition of determining algorithm and target integrated circuit technology, and adopt the basic sequential unit of different operators according to different process.II, take function as unit carries out the mark of temporal constraint to data flow diagram, according to its data dependence and variable storage information, can obtain from the temporal constraint of the downward layer functions of top layer.
Based on mentioned above principle, in the present embodiment, the method that Data Control flow graph is applied to temporal constraint comprises step:
S41, according to the requirement of user specification demand and target integrated circuit technology, determine total temporal constraint.In the present embodiment, the c program of algorithm is H.264 described to X264 code, if algorithm requirements specification is defined as
(video resolution is 1280 * 720 pixels, processes 60 frame video datas p.s.).If target integrated circuit technology is 130nm technique, call operator technology library, can obtain standard unit's sequential is 5ns, utilizing operator technology library can generative circuit dominant frequency be 200MHz.If target integrated circuit technology is 65nm technique, call operator technology library, can obtain standard unit's sequential is 2.5ns, utilizing operator technology library can generative circuit dominant frequency be 400MHz.
In the present embodiment, according to step S41, obtain after total temporal constraint, data flow architectures different in Data Control flow graph is carried out to temporal constraint and comprise two kinds of methods, be respectively according to parallel data stream and carry out temporal constraint; According to serial data stream, carry out temporal constraint.
When the data stream in Data Control flow graph is parallel data stream, step S42 comprises step:
S421a, divide total temporal constraint equally each operator level in corresponding spacetime diagram, and divide the temporal constraint of each operator level equally each operator unit in this operator level.The H.264 algorithm of take in the present embodiment is example, and in frame level is processed, its constraint comes from algorithm requirements specification and is defined as 720p@60fps, i.e. processing 60 frames per second.If selecting technique is 130nm technique, utilizing operator technology library can generative circuit dominant frequency be 200MHz, and the temporal constraint that obtains this level is 200MHz/60 frame=3.33M cycle/ frame.At macroblock level Data processing, because each macro block processing sequence is serial processing, therefore Timing Constraints distributes according to serial, be that every frame is comprised of 80 * 45=3600 macro block, the processing power that requires each macro block is 1280 * 720 * 60/256=216000MB/s, when each macro block is processed required operator unit, ordinal number is 200MHz/216000MB/s=926cycle/MB, and the sequential that is about to the computing of frame level is divided equally to each macro block: 3.33M cycle/ frame/3600=926cycle/MB.Because the macro block in video encoder is divided into inter prediction, infra-frame prediction, transform and quantization, entropy coding, block elimination filtering etc., these macro blocks are processed a kind of parallel form that is operating as, therefore the temporal constraint of each macro block is also 926cycle/MB, thereby has generated successively the temporal constraint of bottom layer treatment module downwards.
In the present embodiment, when serial data is carried out to temporal constraint, step S42 comprises step:
S421b, using the basic sequential unit of the total operator of corresponding each operator level of each node of serial in Data Control stream as overall temporal constraint, the ratio that accounts for the sequential summation that operator unit that in each operator level, the longest arithmetic path shines upon is corresponding according to the sequential of the computing operator that in each operator level, the longest arithmetic path shines upon is distributed the sequential of each operator level.For example, when two operator level A and the pass of B in Data Control flow graph are serial, and their overall temporal constraint be them N the basic sequential of operator with.Sequential corresponding to computing operator that the longest arithmetic path was shone upon in operator level A and B of take is benchmark, in this benchmark, show that ratio distributes N the basic sequential unit of operator.The computing operator number that for example in operator level A, the longest arithmetic path shines upon is Ma operator, the basic sequential unit of a corresponding Na operator; The computing operator number that in operator level B, the longest arithmetic path shines upon is Mb operator, the basic sequential unit of a corresponding Nb operator, and the temporal constraint being assigned to so on operator level A is:
the temporal constraint being assigned on operator level B is:
In the present embodiment, computing operator can be that many basic sequential units carry out, so operator number is not necessarily consistent with the basic sequential unit of operator.
S422b, the sign that serial data stream contains parallel data in Data Control flow graph, revises temporal constraint in step S421b according to the sign of this parallel data.For example, modules A can obtain information p-x-n-m, and wherein p is concurrency sign, and x is concurrency kind, comprises the types such as pipeline parallel method, local pipeline parallel method, data parallel; N is parallel position, walks abreast n processing stage; M is and line number.
According to above information, the temporal constraint of modules A can be adjusted into
β wherein
nfor the empirical parameter in stage n parallelization, N
pnfor the basic sequential unit of processing operator in stage n.
Method based on above-mentioned, the structure during to structure shown in Figure 25 a and Figure 25 b after row sequential mark is as shown in Figure 26 a and Figure 26 b.
Because structure shown in Figure 25 a has been called the Operator structure of function pixel_sad_16 * 16, so first structure shown in Figure 25 b is compressed, the result after compression as shown in figure 27; And then the row compression during to structure shown in Figure 25 a according to the method that the spacetime diagram shown in Figure 27 is compressed, thereby the spacetime diagram of the function X264_me_search after being compressed.
When structure shown in 25a and Figure 25 b being carried out to spacetime diagram compression, mainly follow following principle:
1, the computing class operator that in operator spacetime diagram, operational attribute is identical is carried out to cluster compression.Such as, in spacetime diagram, two parallel add operation operators can be compressed into an add operation operator, the mode of simultaneously controlling operator by introducing realizes the multiplexing of addition operator after compression, complete with compress before two functions that addition operator is identical.As can be seen here, after operator spacetime diagram is compressed, the number of operator can significantly reduce, thereby has saved the area of integrated circuit, and correspondingly, the operator after compression is realized multiplexing by control operator, has increased the execution time of integrated circuit overall algorithm.Be understandable that, to the cluster compression of computing class operator, must cause that storage class operator, control class operator, class of paths operator and clock class operator also correspondingly change, so can also do corresponding cluster compression further to save integrated circuit area according to actual conditions to above-mentioned operator, storage class operator especially wherein.
2, when introducing control operator, generate corresponding configuration-direct, described configuration-direct is worked according to predetermined mode for controlling the operator of generation, thus the identical function of realization and compression pre-operator.
3, for the possible cluster compression result of same operator spacetime diagram, have multiple.Therefore, in compression process, the spacetime diagram that after selection compression, the spacetime diagram overall algorithm execution time approaches confinement time is most as final compression result, the spacetime diagram that the selection overall algorithm execution time approaches confinement time is most as compression result, can, in the situation that guaranteeing to meet sequential condition, save the area of integrated circuit the largelyst.Be the integrated circuit maximum execution time calculating according to the performance index of user's proposition confinement time.
Above-mentioned to after the compression of spacetime diagram cluster, can reduce area and the power consumption of integrated circuit.And the operator generating after cluster compression has certain regularity.
Above-mentioned spacetime diagram is carried out after cluster compression, in the time of can also be to some operator wherein, row be optimized, and a kind of mode of optimization is that row solidifies customization during to some operator.Such as, in Figure 28, the left side is the computing class operator after a kind of compression, due to not use of logic unit wherein, so obtain the Operator structure shown in the right in Figure 28 after the logic unit of this operator can being removed, dwindle further the area of operator.Again such as, for the ADDS operator shown in Fig. 2, owing to passing through to change the value of control bit X, can make this operator realize different functions, and in a certain concrete integrated circuit, in fact only use the addition shifting function of this operator, the value of the control bit X of this operator can be fixed as to 000, thereby met the power consumption of having saved integrated circuit under the prerequisite of functional requirement.By solidifying the method for customization, can in circuit, only occur realizing the minimal hardware that identical function operates, then these minimal hardware be carried out to full Custom Design, make it there is no other expanded function.Like this, both can guarantee the correct execution of algorithm, can optimize again area and the power consumption of integrated circuit.
After spacetime diagram before and after the compression of X264_me_search function cluster is analyzed, can obtain the form shown in Figure 29.From this form, can find out, after cluster compression, operator number used reduces to 139 by 1724, although 500 cycles that correspondingly sequential was increased to from 292 cycles, 500 cycles are still less than the constraint cycle 600 of X264_me_search.As can be seen here, after cluster compression, in the execution time that guarantees overall algorithm, meet under the condition of temporal constraint, can also reduce significantly operator number used, thereby save area and the power consumption of integrated circuit.
Method based on above-mentioned, by the X264_me_search function to the H.264 C language description of standard, carry out process analysis, and after shining upon, obtain the Data Control flow graph of function X264_me_search as shown in figure 15, by this Data Control flow graph, be converted to the operator spacetime diagram as shown in Figure 25 a and Figure 25 b, according to information such as data dependences, Data Control flow graph is carried out to temporal constraint and obtain the temporal constraint figure as shown in Figure 26 a and Figure 26 b, then according to time-labeling, spacetime diagram is carried out to cluster and compress the spacetime diagram after the cluster compression that obtains function pixel_sad_16 * 16 as shown in figure 27, last again by the spacetime diagram generation integrated circuit lower hardware circuit after this compression.Due in to computer language procedure analytic process, obtaining data dependence, data can concurrency and corresponding control information, and this program is mapped as to Data Control flow graph, can concurrency and corresponding control information thereby the Data Control flow graph that makes to obtain comprises data dependence, data, thus ancillary hardware circuit designer is carried out circuit design effectively.And it is Data Control flow graph that the device of the present embodiment can be realized computer language procedure automatic mapping, thereby be converted to again operator spacetime diagram, greatly promote integrated circuit Design of Hardware Architecture efficiency.
Method from computer language procedure generated data control flow graph provided by the invention is not only adapted to above-described embodiment, also can be used for other and by computer language procedure, is generated the mapping process of integrated circuit lower hardware logical description.
Above content is in conjunction with concrete embodiment further description made for the present invention, can not assert that specific embodiment of the invention is confined to these explanations.For general technical staff of the technical field of the invention, without departing from the inventive concept of the premise, can also make some simple deduction or replace, all should be considered as belonging to protection scope of the present invention.