[go: up one dir, main page]

Academia.eduAcademia.edu
CEAI, Vol.11, No.4, pp. 53-62, 2009 Printed in Romania Optimizing Depth, Area and Dynamic Power of k-LUT Based FPGA Circuits Ion I. Bucur *, Nicolae Cupcea *, Costin Stefanescu *, Adrian Surpateanu * * University Politehnica Bucharest, Faculty of Control and Computers, Computer Science and Engineering Department E-mail (ion.bucur, nicolae.cupcea, costin.stefanescu, adrian.surpateanu) @cs.pub.ro Abstract: This paper proposes an efficient algorithm for complex criteria applied to K-LUT based FPGA mapped circuits. This algorithm is developed considering an important design factor – dynamic power consumption - in addition to other design factors that are traditionally considered. To increase performance, it was used a flexible mapping tool based on exhaustive generation of all K-bounded subcircuits rooted in each node of the circuit. To achieve information about dissipated functional power was implement efficient simulator. Area is of paramount importance in FPGA mapping circuits. Lastly to lower power consumption, we devised several effective techniques designed for reducing area Keywords: power-aware mapping, optimal area, K-LUT based FPGA, logic activity simulator, functional power. 1. INTRODUCTION Power consumption is becoming one of the most important considerations in VLSI design. Increase in both complexity and size of these circuits highlights the importance of the power dissipation more and more. In many application domains, the condition of longer battery life for the increasing class of portable computing and wireless communication apparatus requires the operation of lowpower circuits. Moreover, the spectacular decrease in chip size and increase in both transistor count and clock operating rate are pointing to the high importance of circuits having low power dissipation. Low power dissipating chips involve low cost of the packaging and cooling. High power often run hot and high temperature tends to exacerbate several silicon failure mechanisms. It’s known that every 10° C increase in operating temperature roughly doubles a component’s failure rate (Pedram, 1996). Therefore, in addition to performance and area optimization a great deal of research has been directed towards issues related to the low power circuit design (Jones et al. 2007). For now, time to market pressure makes it imperative that design, development, production and testing time be diminished as much as possible. On the other hand, field-programmable gate arrays (FPGAs) are very attractive choices for digital circuit implementation. FPGAs have emerged as a well-liked technology due to its short turnaround time and low manufacturing costs. However, they are less power efficient than custom ASICs (Chen, C.-S. et al. 1997), and (Chen, D. et al. 2006). Our focus in this paper is on searching optimal solutions primarily for depth, and power at the gate level. Our secondary target is on depth, area and dynamic power at the same gate level. It is presented new technology mapping algorithm based on previous studies and results (Bucur, 1999), (Bucur 2007), (Bucur et al. 2008), (Bucur et al. 2009). Main idea was to build-up an application working at the logic level and able to find power-aware mapping solutions both for depth and area criteria. While the second target is much more complex, because there are involved two NP-hard problems, the first target was more fruitful being obtained more convincing results. The paper is organized as follows. In Section 2 are presented main power issues in FPGA technology. Relevant previously published work concerning power-aware technology mapping targeting FPGAs are outlined in Section 3. Definitions and main model are briefly presented in Section 4. Used model for the power estimation, at the gate level, is presented in Section 5. Section 6 contains our approach presentation and introduces main features of the PwAwMap algorithm. Finally, in Section 7, are presented experimental results, conclusions, and is outlined future work. 2. POWER ISSUES IN FPGA CIRCUITS FPGAs circuits are developed in CMOS technology. Power dissipation in CMOS circuits comprises both static (leakage) power and dynamic power. Static power is consumed when a circuit is in a quiescent, idle state. Static power results from leakage current in off transistors, primarily sub-threshold and gate-oxide leakage. An off-MOS transistor is an imperfect insulator allowing sub-threshold leakage current to flow between its drain and source terminals. Gate-oxide leakage is caused by tunneling current through the gate terminal of a transistor to its body, drain, and source (Gupta and Anderson, 2007). Lowering technology grid means lower supply voltages and smaller transistor dimensions. It leads to shorter wire length, less capacity, and an overall reduction in dynamic power. Smaller process geometries also mean shorter transistor channel lengths and thinner gate oxides, producing an increase in static power as technology scales. At the transistor level Virtex-4® and Virtex-5® FPGAs, as an example, employ triple-oxide process technology for 54 CONTROL ENGINEERING AND APPLIED INFORMATICS leakage mitigation. There are three possible oxide thicknesses, as an example, for each transistor in Xilinx technology. They are used depending on its speed, power and reliability requirements. Dynamic power, on the other hand, is caused by transitions on signals of a circuit and is governed by the equation: 2 P d = 0.5∑ Ci ⋅ Vdd ⋅ d (i ) (1) i Where: - Ci represents the capacitance of a signal, i; - d(i) is referred to as “transition activity” of logic signal i and represents the rate of transitions on signal i (i.e. the number of times that signal i changes its value in unit time) ; - Vdd is the supply voltage. The programmability and flexibility of FPGAs are making them less power-efficient than custom ASICs when considering the implementation of a given logic circuit. FPGAs consist of three kinds of programmable elements (Fig.1). Logic cells named configurable logic blocks (CLB), programmable I/O cells named input–output blocks (I/O Blocks) blocks, and programmable interconnect, routing resources are the main elements of a typical FPGA. Each configurable logic block contains combinational components such as multiplexers (MUXs), simple gates (e.g., OR and AND), programmable lookup tables (LUTs), and sequential components such as flip-flops. Configurable logic blocks have 4, 5 or 6 input lines. This number of logic input lines is symbolically referred by K. Each CLB contains one programmable combinational logic component and two flipflops. Packing LUTs and flip-flops into CLBs, in general, is a critical step in the cluster-based FPGA design flow, since it has a great impact on both timing and routability. While placement and routing is strongly connected with the detailed architecture inside of the chip and mostly managed by the commercial FPGA software, the optimization and mapping can be more influenced by the user. Programmable interconnect Programmable I/O cells Logic Cell The place and route optimization intend to lower power in the interconnect fabric. Most of the recent advanced algorithms or heuristics for FPGA mapping approaches area minimization under delay constraints. Even if delay constraints are not specified, an optimum delay, for the considered network is determined and after that step, without modifying the delay, the area is minimized. During the circuit logic optimization stage it’s possible to optimize dynamic power by modifying the way the circuit is K-LUT mapped. Hiding, in K-LUTs, lines with high switching activity, dissipated dynamic power by the circuit is lowered altogether. In practice, K is usually small (for example 4LUTs are widely used in commercial FPGAs), as the area of a K-LUT grows exponentially with large K. It was showed that 4-input, single-output LUT cell yields the smallest FPGA area of any K-LUT cell for a wide range of programming technologies and routing pitches (Rose et al. 1990). The I/O blocks can be programmed to become the primary inputs (PIs) or primary outputs (POs) of the circuits on FPGAs. Routing resources include segmented interconnects and switching blocks. The segmented interconnects connect to the inputs and outputs of logic blocks while the switching blocks link the segments to form long routing tracks to implement routing topology. In a typical FPGA design flow, a circuit is first synthesized and mapped into a netlist of K-LUTs and flip-flops. Then it goes through the following three steps: clustering, placement and routing. The clustering step arranges K-LUTs and flip-flops into CLBs according to the timing and the connectivity of the mapped netlist; the placement places the clustered netlist onto the array of onchip CLBs; the routing routes all the wires in the netlist with the available routing resources on the device. Routing tools are using an important portion of the global design effort (Gupta and Anderson, 2007), (Hansen and Thomas, 2005), and (Yang 2005). The FPGA configuration memory and configuration circuitry consumes silicon area, producing longer wire lengths and higher interconnect capacitance. Programmable routing switches, on the other hand, attach to the pre-fabricated metal wire segments in the FPGA interconnect and add to the capacitive load incurred by signals. Most dynamic power in an FPGA is consumed in the programmable routing fabric. About 50% - 70% of total power is dissipated in the inter-connection network (Anderson and Najm, 2004) and (Poon et al. 2005). Interconnect comprises a considerable fraction of the FPGA’s transistors and therefore dominates leakage. Signal capacitance is known after the placement and routing processes. While performing placement this capacitance is only estimated. Estimation is made using empirically derived capacitance models. Such models are derived using leastsquares regression analysis (Gupta and Anderson, 2007), (Poon et al. 2005). 3. RELATED WORKS Fig.1. Generic FPGA Architecture. There have been done several works for decreasing the power consumption in circuits mapped with FPGAs. Farrahi and Sarrafzadeh studied the technology mapping problem for lookup table-based FPGAs (Farrahi and Sarrafzadeh, 1994b). The problem is formulated as assigning LUTs to nodes of a CONTROL ENGINEERING AND APPLIED INFORMATICS 55 circuit so as to minimize the total area. They did show that the decision version of this problem is NP-complete, even for simple classes of inputs such as 3-level circuits. This result is extremely important pointing the difficulty of the problem. The same proof is extended to conclude that the general library-based technology mapping for power minimization is NP-complete (Farrahi and Sarrafzadeh, 1994a). A heuristic algorithm for mapping the network onto K-inputs LUTs in polynomial time, aimed at minimizing the power consumption is presented in their paper. The utility of some low-power design methods based on architectural and implementation modifications for FPGA LUT-based systems, are presented in the paper of Sutter and Boemo (Sutter and Boemo 2007). The contribution of spurious transitions to the overall consumption is evidenced and main strategies for its reduction are analyzed. Empirical results are presented in order to show the effectiveness of pipelining and sequentialization as low-power design methodologies. Possibilities of power management techniques are quantified. Li et al. are presenting in their paper an efficient heuristic to compute both low-power and reduced area mapping solutions (Li et al 2001). The major distinction of their work from previous ones was that while generating a LUT, it is looking ahead at the impact of the mapping solutions of this LUT on the power consumption of the remaining network. Their approach computes min-height K-feasible cuts for noncritical nodes to minimize the power consumption of the mapping solution. Li et al. implemented their application, named PowerMap, in C and tested it on a number of MCNC benchmark circuits. Compared to FlowMap, an early very well known delay-optimal mapping application, their algorithm reduces the power consumption by 17.8% and uses 9.4% less LUTs without any depth penalty. Gupta and Anderson, presents several very interesting results from power-aware placement and routing optimization (Gupta and Anderson 2007). A set of industrial designs were placed and routed using both the traditional place and route flow, as well as the power flow in ISE 9.2i Design Tools. The designs were augmented with built-in automatic input vector generation by attaching a linear feed-back shift register-based (LFSR-based) pseudo-random vector generator to the primary inputs. This feature permitted to perform board-level measurements of dynamic power without requiring a large number of externally applied waveforms. The industrial designs were mapped into Spartan-3, Virtex-4, and Virtex-5 devices. Results showed that dynamic power was reduced as much as 14% for Spartan-3 FPGAs, 11% for Virtex-4 FPGAs, and 12% for Virtex-5 FPGAs. On average, across all designs, dynamic power was reduced by 12% for Spartan-3 FPGAs, 5% for Virtex-4 FPGAs, and 7% for Virtex-5 FPGAs. For all families, on average, the speed performance hit was between 3% and 4%, which was appreciated as being small and acceptable in power-conscious designs. An efficient heuristic algorithm for the low-power design with FPGA is introduced in (Wang et al. 2001). The main idea in this paper was to exploit the cut enumeration technique to generate possible mapping solutions for the subcircuit rooted at each node. However, for the consideration of both run time and memory space, only a fixed-number of solutions were selected and stored by the heuristic. To facilitate the selection process, a method that correctly calculates the estimated power consumption for each mapped sub-circuit was developed. The experimental results are showing that their algorithm reduces on average the power consumption by up to 14.18% and the average number of LUTs by up to 6.99% over an existing method. Singh and Marek-Sadowska presented a routability-driven bottom-up clustering technique both for area and power reduction in clustered FPGAs (Singh and Marek-Sadowska, 2002). This technique uses cell connectivity metric to identify seeds for efficient clustering. It leads to better device utilization, savings in area, and reduction in power consumption. Authors report routing area reduction of 35% over previously published results. Power dissipation simulations using a buffered pass-transistor-based FPGA interconnect model is introduced. Singh and MarekSadowska did show that presented clustering technique can reduce the overall device power dissipation by an average of 13%. Anderson and Najm proposed a power-aware technology mapping technique for LUT-based FPGAs which aims to keep nets with high switching activity out of the FPGA routing network and takes an activity-conscious approach to logic replication (Anderson and Najm, 2002). Logic replication is known to be crucial for optimizing depth in technology mapping; an important contribution of their work was to recognize the effect of logic replication on circuit structure and to show its consequences on power. The work of Hsieh et al. discuss optimizing the interconnect power of designs implemented in FPGA platforms (Hsieh et al. 2008). In particular, it is reduced the glitch power on interconnects associated with the output of functional units in a design. The idea is to activate unused flipflops to block the propagation of glitches, which takes advantage of the abundant flip-flops in modern FPGA structures. Jiang et al. did present a mathematical programming formulation of the integer time budgeting problem for directed acyclic graphs (Jiang et al. 2008). In particular, it is formally proved that their constraint matrix has a special property that enables a polynomial-time algorithm to solve the problem optimally with a guaranteed integral solution. Their theory can be directly applied to solving a scheduling problem in behavioral synthesis with objective of minimizing the system power consumption. Given a set of scheduling constraints and a collection of convex power-delay tradeoff curves for each type of operation, their scheduler can intelligently schedule the operations to appropriate clock cycles and simultaneously select the module implementations that lead to low-power solutions. Experiments demonstrate that their proposed technique can produce near-optimal results (within 6% of the optimum by the ILP formulation), with 40x + speedup. In the work of Ho et al. is described an approach to estimate the power consumption of a set of hybrid FPGA architectures, island-style fine grained units and domainspecific coarse-grained units (Ho et al. 2008). They reported 56 CONTROL ENGINEERING AND APPLIED INFORMATICS results over a set of floating point benchmark circuits. The power performance of a hybrid FPGA is compared to the Xilinx Virtex II 3000 which has the same architecture as the hybrid FPGA but no coarse-grained unit. On average, floating point applications implemented on hybrid FPGA can reduce dynamic energy consumption by a factor of 14 compared to the Virtex II FPGA. Mashayekhi et al. present in their paper a method that attempts to reduce the switching activity among LUT blocks (Mashayekhi et al. 2008). To achieve this, they have introduced the fake register insertion method and combined it with retiming method. Fake registers have been inserted on low-transition wires around high-transition wires to force the synthesis tool to direct those low-transition wires on LUTs' outputs. Retiming is used to move registers that have been located on high-transition wires in order to prevent the synthesis tool to place them on LUTs' outputs. As of benchmarking, two of ISCAS89 benchmark library circuits were employed. Experimental results have shown that this scheme will decrease the off-block switching of the implemented circuits up to 25%. The paper of Jang et al. describes a powerful simulator and several complementary algorithms for power-aware logic optimization (Jang et al. 2009b). The proposed simulator draws on new ideas in logic representation and is geared for speed, e.g. it can simulate a 1 M-node sequential design using 1 000 bit patterns for 100 cycles in about 10 seconds on a typical one-core CPU. When applied to large industrial designs in a highly-optimized industrial flow the techniques described in this paper led to a 19.6% reduction in switching activity without a substantial increase in runtime or degradation of other metrics. 4. PROBLEM FORMULATION Combinational part of a general logic network N can be represented as a direct acyclic graph (DAG) noted as N(V, E) where V is the set of nodes and E is the set of directed edges. Each node in N(V, E) represents a logical gate (possible complex), and a direct edge (u, v) exists if the output of gate u is an input of the gate v. The set of direct predecessors of gate v is expressed as input(v) and the set of direct predecessors of a graph H ⊆ N(V, E), is similar expressed as input(H). The set of direct predecessors of G is the set of all primary inputs of the network. Primary input (PI) nodes and primary output (PO) nodes in a network are nodes that have no incoming edge, respectively are nodes that have no outgoing edge. Flip-flop outputs are considered as pseudoPIs and flip-flop inputs are considered as pseudo-POs and no distinction is made in terms of notation. Let u be a gate in N and we are interested to compute generic K-feasible cone of node u (denoted Cu) in direct acyclic graph N(V, E). Cone Cu is rooted in u and is included in the predecessor’s transitive cone of the node u (denoted PMTCu) and having no more than K direct predecessors (| input(Cu) | ≤ K). Given a cone Cu and a node v ∈ Cu, any path connecting the node v and the node u, lies entirely in Cu. A logic network N, modelled by the N(V, E), such that for each node u (in N(V, E)): | input(u) | ≤ K, is K-bounded. The level of a node u is computed, in general, is computed using the expression: level(u) = 1 + maximum_level{ input(u) } (2) The level of a PI node is zero and the level (depth) of a network is the largest node level in the network. 5. DYNAMIC POWER EVALUATION Dynamic power has two main components: switching power and short-circuit power. Both components can only occur when a signal transition happens at the CMOS gate output. In this work the focus is on switching power. Switching power evaluation is essentially concerned with switching activity estimation and load capacitance estimation. Gates (including buffers) and wires contribute capacitance in the FPGA’s circuits. It is known that in a combinational circuit, switching at a node has correlations with its own past values and with its neighbors in the circuit. Temporal correlation is due to the fact that switching in a node is dependent on its last value and its function. Spatial correlation among nodes arises of the basic logical connections. Various approaches to computing switching activity have been proposed in the literature, and they can be, generally, considered as either: o o o Probabilistic models, or Characterization through board measurement, or Simulation-based approaches. The approach of switching activity estimation is based on probabilistic models. Such strategies are also known in literature as non-simulative switching activity techniques (Bucur et al. 2009). The probabilistic techniques use knowledge about input statistics to estimate the switching activity of internal nodes (Li et al. 2001). Probabilistic models supporting signal probability calculus were introduced several years ago, when it was developed the random testability. Net signal probability was studied and various methods were established in order to compute an exact value or an estimate of it (Chen, D. et al. 2006). Najm introduced the concept of transition density. It is propagated throughout the circuit using Boolean difference algorithm (Najm 1993). Basically are used the equilibrium probability (the stationary probability that the signal has the value 1) p(w), and transition density, d(w) i.e. the number of times that the signal changes its value per time unit, of each net w in the circuit. This approach is using the Boolean difference of each net w: ∂w = w( x0 , x1 , ∂xi ⊕ w( x0 , x1 , , xi = 0, xi +1 , , xi = 1, xi +1 , , xn ) (3) , xn ) The probability of the Boolean difference p( ∂w ) is the ∂xi probability that a transition at input variable xi causes a transition at the output w. The transition density at the output w of a gate can be calculated as in (4). d ( w) = ∑ p( m i =1 ∂w )d ( xi ) ∂xi (4) CONTROL ENGINEERING AND APPLIED INFORMATICS 57 This theorem makes possible computing transition density of any node in the logic network N(V, E) when the transition densities at the primary inputs are known. Proof of this theorem can be found in (Najm 1993). Once the transition densities of every node (gate) in N(V, E) have been determined the power consumption of a given circuit can be calculated as in (1). It has to be remarked that this approach is computing an upper bound of the stationary probability in networks when there are re-convergent fanouts (Anderson and Najm, 2004). This model also assumes that the switching behavior of input signals are not interrelated, which is usually not true. Using this model is also hard to capture spurious power activity. This model is oriented towards the signal transitions necessaries to perform the required logic functions arising between two consecutive clock ticks. Such transitions also are named in literature functional transitions (Chen D. et al. 2006). Characterization through board measurement approach is very well described in (Xu and Kurdahi 1997). Using an emulation board embedded with a Virtex®, Xilinx FPGA, for power measurement the authors calculated the average switching activity for logic elements using a power estimation formula published by Xilinx: Pint = Vcore ⋅ K p ⋅ f Max ⋅ N LC ⋅ Tog LC (5) Symbols in (5) are defined as follows: Pint being the internal power consumption caused by charging and discharging the capacitance of each switched element; o Vcore is the core voltage; o Kp is a technology-dependent constant; o fMax is the maximum clock speed; o NLC is the number of used logic elements; o TogLC is the average switching activity of all logic elements. Switching activity determination based on simulation is using several sequences of random generated input vectors applied on the primary inputs and cycle-accurate gate-level simulation can be carried out for the whole circuit. Joint with back-annotated delay information obtainable after placement and routing, this kind of estimation is most precise for switching activity computation because it can also reflect behaviour due to glitches (D. Chen et al. 2006). Many works are using this model (Gupta and Anderson, 2007), (Sutter and Boemo, 2007). It was appreciated that the main difficulty of this approach stays, mainly, in its prohibitive runtime (Chen, D. et al. 2006). In this work, simulation-based approach was used to determine dynamic switching activity in 15 combinational circuits of the MCNC benchmark. These benchmark circuits are first optimized using rugged script in SIS application, an interactive system for the synthesis of sequential circuits (SIS-1.2 1994). It was used modified fault simulator from ATPG section of SIS-1.2 and used for simulation-based logic activity analysis. This simulator has new added capabilities for capturing the number of logic transitions on each net during simulation, as well as the proportion of time each net spends in the high and low logic states. Simulation with zero logic delays can be done for combinational or synchronous sequential circuits. However it was restricted our research only on combinational circuits but, we intend, in the near future, to extend our research. Circuits are simulated using almost 10 000 randomly generated input vectors. Each time an internal line is incrementing its logic activity (the number of changes from 1 to 0 or from 0 to 1), it is evaluated the following expression: n n +1 − ≤ε N N +m (6) In (6) was noted with N the number of simulated vectors after that the number of logic changes, on an arbitrary line w, became n. Let be N + m the number of simulated vectors when the number of logic changes, on line w, is incremented (n + 1). Parameter ε is introduced at beginning of the simulation and it represents the precision of simulated-based logic activity determination. The simulator is checking when all internal lines and primary output lines of the circuit fulfill condition (6). The power results that were computed are based on identical switching activities (0.5) for each primary input. But the simulator is able to determine switching activities of network's internal lines for arbitrary switching activities of the primary inputs of any network. As technology scales down, power consumption in interconnects becomes the dominant source in sub-micron FPGAs. In KLUT based FPGAs circuits the essential dynamic power consumption is caused by transitions that take place at the inputs and outputs of LUTs. It results that the sum of the dynamic power should be minimized over all the LUTs in the mapped network. As a consequence, power estimation for FPGAs has to consider routing interconnect equivalent capacitance. Interconnect estimation becomes increasingly accurate as the design enters lower design levels. In K-LUT based FPGAs circuits the essential dynamic power consumption is caused by transitions that take place at the inputs and outputs of LUTs. It results that the sum of the dynamic power should be minimized over all the LUTs in the mapped network. Nets having the greatest transition density have to be, if possible, hidden in LUTs. 6. ALGORITHM DESCRIPTION Our algorithm has been implemented in the C language atop the Berkeley SIS framework (SIS-1.2, 1994). The approach is using structures, and routines of SIS-1.2 in order to built-up the programmed application and tune-up appropriate cost functions for minimal depth mapping power-aware. The algorithm is based on exhaustive K-feasible cones generation of each arbitrary node u in the mapped network. The implemented technology mapping procedure operates in three steps: 1. In the first step are generated, for each node in the network, the set of all K-feasible cones. K-feasible cones generation is made during a network traverse from primary inputs to primary outputs and compute edge-delay for each feasible cone (Bucur 2007). 58 CONTROL ENGINEERING AND APPLIED INFORMATICS 2. In the second step are computed specific cost functions of each K-feasible cone of each node in the network (Bucur 2009). 3. In the third step, using the set of cost functions values, is determined the power-aware minimum depth mapped network. 6.1 Generating K-feasible Cones In order to generate power-aware minimal depth K-LUT mapped network is necessary, in general, the knowledge of an appropriate minimal height K-feasible cone, for each internal node u in the initial network. It is useful to note that the nodes that are not on a critical path do not need a minimal height K-LUT implementation. The generation of all Kfeasible cones rooted in every node of a node in network has to be considered in the context of network model. Let N be a K-bounded network, and u an arbitrary node of N. Then, a Kfeasible cone of the node u, noted C (u ) could be identified by the set: input (C (u )) = {v1 , v2 , , vm }, m ≤ K (7) Such a set could be represented as the product (conjunction) of the elements (literals) of the set in (7): p = v1v2 ….vm (8) The set of all feasible cones of node u, noted cones (u ) , can be represented as the sum (reunion) of each of the product (cube) representing the respective cone: cones (u ) = ∪ input (Ci (u )) (9) i Representing each K-feasible cone of the node u as a conjunction, in above relation, it becomes: cones (u ) = ∪ vi1 ⋅ vi2 ⋅… ⋅ vim , m ≤ k . (10) i Then it holds this Lemma: Lemma1. Given a node u having as immediate predecessors: input (u ) = {v, w, , z}, each predecessor having already computed the set of all K-feasible cones, respective cones(v), cones(w), … cones(z), than the set cones(u), of all the Kfeasible cones of node u, is: cones (u ) ⊆ {(v ∪ cones (v)) ∩ ( w ∪ cones ( w)) ∩ ∩( z ∪ cones ( z ))} (11) Applications of the Lemma1 are presented in [6]. It was established that this algorithm computes all possible mapping solution for each node (Bucur 1999). Computing the sum-of-products (SOP) form of the expression (11), and eliminating (as soon as possible) all the products having more than K literals, one can determines cones(u), the set of all K-feasible cones of the node u. It is not difficult to see that there is only polynomial number of K-feasible cones in the predecessor’s maximum transitive cone of each node u (denoted PMTCu), since the total number of possible combinations of K or fewer nodes is O(nK), where n is the number of nodes in PMTCu. In practice, Table 1. Estimating computing effort in K-feasible cone generation Max Max Circuit Node Depth Time Cone Node Count (seconds) Count Count C432 179 24 0.04 28 7 C499 206 13 0.08 45 15 C880 354 25 0.12 114 11 C1355 518 25 3.50 285 18 C1908 617 27 0.45 147 11 C2670 901 26 0.85 176 17 C3540 1270 41 1.05 327 15 C5315 2120 37 2.22 177 32 C6288 2353 120 10.44 284 21 C7552 2648 30 4.67 672 27 however, most of these combinations do not form cones, since the network connections determinate the cones. Results of the generation of all K-feasible cones rooted in every node of a DAG for 10 circuits from the MCNC 91 ATPG benchmark are shown in Table 1. These circuits were chosen because contain mostly two input gates and are among the largest of the benchmark. Circuit’s internal nodes count and depth, were computed after removing (sweep) inverters, buffers, and constant nodes. It was followed by simple two inputs AND-OR decomposition of those gates having more than two inputs. The largest circuit in Table 1, namely C6288, having 2120 internal nodes and depth 120, require less than 11 seconds for an exhaustive 5-feasible cone generation rooted in each internal node of this circuit. Time generation, listed in Table 1, was measured on Intel Dual 2 Duo Core T9300. The exhausting generation of all K-feasible cones of each internal node of a network leads to a simpler and smoother approach of the power-aware and minimum depth K-LUT network’s mapping. The procedure able to compute resourcefully all the K-feasible cones of all nodes in a network, in general, build up the complete solution’s space of K-LUT mapping. One could make use of any optimization criteria and any delay model associated with the edges of N(V, E), the DAG of the gate network N. Actually, the implemented algorithm is able to K-LUT map any K-bounded Boolean network. Multiple-level circuits, that are interconnections of single-output combinational complex gates, are considered in this work. Multiple-level logic optimization is, usually, divided into two stages (Jiang et al. 2009a). First, the logic is optimized while neglecting the implementation constraints on the logic gates and assuming loose models for their area and performance. Second stage is related to the used technology or gate library. K-LUT based FPGA implementation of multi-level networks can be viewed as subject of fan-in K-limitation of the used gates. Decomposition is the main approach to obtain a K-bounded network. It were used various methods, including Roth-Karp decomposition (Boolean), AND-OR decomposition (algebraic) etc. Making a network K-bounded is considered, in general, a pre-processing step in the K-LUT mapping of FPGAs. In Table 2 are listed part of our experiments made in order to show that decomposition granularity influences the performance of the mapping process. It was used, after CONTROL ENGINEERING AND APPLIED INFORMATICS 59 Table 2. Decomposition factor and the influence of it on the minimum mapped depth Circuit 5xp1 9sym C499 C5315 C880 alu2 alu4 apex2 apex4 apex6 apex7 b9 bw clip count des duke2 e64 f51m misex1 misex2 rd73 rd84 rot sao2 vg2 Initial depth 2 3 10 34 19 6 7 11 9 5 5 4 5 4 4 8 10 9 4 2 5 2 3 14 5 4 Circuit’s decomposition parameter and mapping minimum depth D=2 D=3 D=4 D=5 2 2 2 2 3 3 3 3 4 4 4 5 8 8 8 8 7 8 9 9 5 6 6 5 5 6 7 7 5 6 6 6 5 6 7 6 4 5 5 5 4 5 5 5 3 4 4 4 1 1 1 1 3 4 4 4 3 4 4 4 5 5 6 6 4 5 4 5 3 3 3 3 3 4 4 4 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 6 7 7 8 4 4 4 4 3 3 4 4 technology independent optimization, AND - OR balanced decomposition of circuits with parameterized number of input lines in each type of gate. Technology independent optimization is made using an approach similar to the optimization made in SIS-1.2 with the script.rugged (SIS-1.2, 1994). Results in Table 2 are showing that making decomposition with D = 2 always results best results. This value of the decomposition parameter bring best results a;1 u b;3 s c;1 w d;2 e;1 f;1 z n v x Fig. 2. Initial circuit for minimum depth K-LUT mapping (K=3). The main advantage of using presented approach, using 3LUT, will be pointed out using the simple circuit in Fig. 1. Primary inputs nodes (a, b, c, d, e, and f) contain an additional information concerning the individual delay of each one, in Fig. 1. a;1 u b;3 C2 s c;1 w d;2 e;1 f;1 z n v x C1 Fig. 3. Initial circuit for minimum depth K-LUT mapping (K=3). There are two critical paths in this circuit: (b, s, u, w) and (d, n, s, u, w). Primary inputs nodes (a, b, c, d, e, and f) contain an additional information concerning the individual delay of each one. Both critical paths contain the node u. The set of all the K-feasible cones of each of the circuit’s internal nodes are: cones(x) = {(e, f; 2)}, cones(n) = {(c, d; 3)}, cones(s) = {b, n; 4}, (b, c, d;4)}, cones(v) = {(n, x; 4), (c, d, x; 3), (n, e, f; 4)}, cones(u) = {(a, s; 5), (a, b, n; 4)}, cones(w) = {(u, v; 5), (a, s, v; 5), (u, n, x; 5)}; In the above cone sets, each cone contains a number indicating the minimal height (delay) of the respective cone when the inputs of the cone have attained their minimal height. The node w has three cones, all having the same height. From the three cones of the node w it will be considered, because it has the fewest literals, the first one: (u, v; 5). It results that the node u and the node v height’s implementation must not exceed 4. Node u has only one such implementation (node u belongs to both critical paths), namely (a, b, n; 4), while node v has three such possible implementations: (n, x; 4), (c, d, x; 3), and (n, e, f; 4). Note, also, that node v does not belong to any critical path. Since node n was required for the node u implementation and node e and node f are primary inputs nodes, from the three implementations of the node v, only the third one, namely (n, e, f; 4) implementation, will give the optimal area minimum depth 3-LUT mapping of the circuit. In Fig. 2 are marked with two elliptical surfaces the way nodes are chosen to collapse. This optimal area minimum depth circuit implementation will have only four 3-LUTs. In fact, the cone 60 CONTROL ENGINEERING AND APPLIED INFORMATICS having the minimum height implementation of the node v will imply a mapping solution having the same depth, but having five 3-LUTs. In Fig. 3 is presented the optimal area and minimal depth, 3-LUT mapping of the considered circuit. It is remarkable that the chosen implementation for the node v is a cone having the minimum possible height for this node (minimum height of the node v is 3). a;1 a;1 u b;3 b;3 c;1 c;1 w n z d;2 d;2 v The way it is selected a feasible cone from a set of K-feasible cones of an arbitrary node u is different when are mapped nodes belonging to two or more transitive cones (determined from primary outputs). This way helps avoiding unnecessary node duplication when dynamic dissipated power is not an issue. Depth Metric of an arbitrary node u, is computed over one the best depth K-feasible cone of u: DepthMetric(cones (u )) = 1 + min( DepthMetric(v | v ∈ cones (u ))) This metric is used mainly to quantify the depth criterion. EstimPowerCost is introduced in order to quantify the locally dissipated power. The main target of it is to attract as many high-activity lines as possible inside of LUTs. This cost is computed using the following relation: EstimPowerCost (cones (u )) = ∑ (d (u ) ⋅ fanout (u ) + ee;1 ;1 min C (u )∈cones (u ) { f;1 EstimPowerCost (cones (v)))} Fig. 4. Mapped circuit for minimum depth K-LUT mapping (K=3). 6.2 Cost functions Once K-feasible cones generation completed, same network is traversed from primary outputs to the primary inputs, starting with the primary outputs having largest delay mapping as it was evaluated during K-feasible cones generation. The right selection among the K-feasible cones of each node is guided using critical path and several appropriate cost functions. The main difficulty lies in the approach we use in order to select a subset of all K-feasible cones to cover the whole circuit. The problem of mapping for depth, of an arbitrary network, can be optimal computed in polynomial time using dynamic programming procedure. Implementing our actual heuristics, for dissipated power minimization, we used metrics as it were used in previous work (Anderson and Najm 2004) and (Jang et al. 2009b). However, our heuristics implements different metrics because we attached several specific data during K-feasible cones generation for each feasible cone. Data that we attached to each feasible cone are related to the number of internal nodes of it (as efficiency measure of the irrespective feasible cone), the count of internal nodes having fan-outs spreading in other feasible cones (marking possible duplicated nodes) etc. Main challenge lies in the approach we use in order to select a subset of all K-feasible cones to cover the whole circuit. The problem of mapping for depth, of an arbitrary network, can be optimal computed in polynomial time using dynamic programming procedure. Logic replication or duplication is performed implicitly when a K-LUT is used to implement a K-feasible cone. When a node in a circuit is replicated for depth minimization, a connection from the node to one of its successors is hidden within a K-LUT (Anderson and Najm 2002). Such hidden connections are no more routed through the FPGA interconnection network and therefore do no more contribute to the interconnect power dissipation. (12) v∈Input ( C ( u )) (13) EstimAreaCost of an arbitrary node u, is used to estimate the area involved for each cone C(u) belonging to the set of Kfeasible cones of u, cones(u): EstimAreaCost (u ) = min C (u )∈cones (u ) ( Area (C (u )) + ∑ v∈Input ( C ( u )) (14) Area(v) ) fanout (v) All three metrics are, in fact, partially computed during the K-feasible cones generation step. Globally the algorithm is using this parameterized cost: GlobalCost (u ) = w1 ⋅ DepthMetric(u ) + w2 ⋅ EstimPowerCost (u ) + w3 ⋅ EstimAreaCost (u ) (15) Parameters w1, w2 and w3, in (15) were experimentally determined. Choosing appropriated values of w1, w2 and w3, one can obtain a balanced mapping process. 7. EXPERIMENTAL RESULTS AND CONCLUSION The basis of our approach is the exhaustive generation of all feasible K-feasible rooted in every node of the network. The speed of this generation is offering enough time margins in order to search among all possible solutions the most appropriate one. It was assumed that all primary inputs have 0.5 switching activities, and all involved capacities have same value. In fact, switching activities, of each input line, are input parameters of our simulator. Our implemented algorithm did run for mapping into 5-LUT FPGAs several benchmark circuits. Obtained results are presented in Table 3. Mappers targeting LUT based-FPGAs evolved from the early ones (Bucur 1999) to those needing reduced runtime and enhanced quality of results. Our main idea was to hide into CONTROL ENGINEERING AND APPLIED INFORMATICS 61 LUTs nodes having high estimated dynamic consumption power. Main support of this approach is the fact that capacitance inside a LUT is very small and the power consumption will be reduced. Table 3. Experimental results of PwAwMap mapping tool for FPGA K-LUT. Estimated Dynamic Power Circuit Depth Optimum Optimal Optimal Depth Depth Depth & Area 5xp1 3 2,92 2,23 2,36 9symml 5 3,89 3,17 3,55 C499 5 10,76 8,82 9,16 C880 8 16,69 15,17 15,82 alu2 8 14,05 13,04 13,23 apex6 4 22,62 20,58 20,86 apex7 4 10,45 9,87 9,91 count 3 3,52 3,04 3,38 des 5 99,67 97,02 97,27 duke2 4 9,21 9,23 9,29 misex1 2 2,39 2,23 2,29 rd84 4 4,98 4,27 4,42 rot 6 37,68 36,54 36,66 vg2 4 4,32 3,21 3,29 z4ml 3 1,49 1,38 1,42 244,64 229,8 232,91 To estimate power consumption using (1) it is necessary to know the capacitance of each net or an estimate of it. Obviously, in this stage of designing circuits targeting FPGA mapping, the capacitance of any net it is not known until placement and layout is complete. Most of the powerconscious mappers (Anderson and Najm 2002) and placement and routing power-aware applications (Lamoureux and Wilton 2003) are using switching activity information of the netlist in order to estimate dynamic power dissipation. There are works that are not using nets capacitances or an estimate of them but are concentrating their efforts only toward globally reducing switching activity in the considered network (Jang et al. 2009b). In our actual algorithm implementation, structural properties of the circuit were used in order to have an estimate of the interconnect capacitance. Considering that most of the connections have, on average, same length it results that the fanout factor could be chosen as the main feature making difference between various connections. Since our attempt was to build-up a tool able to evaluate medium-grain different network mapping choices during logic design, the estimated dynamic power for each node u was simply computed mainly as the product of the transition density the node d (u ) and the fan-out of it: EstimatedDynamicPower (u ) = d (u ) ⋅ fanout (u ) (16) PwAwMap is an efficient algorithm being able to compute several low-power optimal options, as can be seen in Table 3. The first option keeps optimum depth and search among power-aware equivalent solutions. The second option is searching, on the base of the user’s explicit option, one of the solutions with optimal depth but performing with improved power consumption. The optimal depth was considered as an incremented optimum depth. For nodes situated on the critical paths of irrespective networks the optimal depth was computed using this relation: optimalDepth(u )u∈CriticalPath = optimumDepth + λ (17) Values listed in the second column of Table 3 were computed using λ = 1 for nodes belonging to the critical path, while for other nodes the optimal depth values were at most less or equal to the optimal depth of the circuit. Area minimization is extremely important for FPGA synthesis. Since area-optimal technology mapping for K-LUT-based FPGAs is NP-hard (Farrahi and Sarrafzadeh, 1994a) several methods were developed in our attempt. While maintaining an optimum depth of the network it is searched, among power-aware solutions, those having the minimal area (number of used LUTs). The third solution targets an optimal area and depth while keeping in low margin the dissipated power (illustrated in the third column of Table 3). On average, in Table 3, the detailed experimental results are showing that power-aware mapping for optimal depth, the estimated dissipated power is 7.07% less than mapping for optimum depth. Relaxing mapping conditions for circuits’ depth it is leading to less dissipated power. But, introducing area minimal constraint it makes mapping, for both optimal depth and area, to be only 4.80% more efficient (concerning the dissipated power) than mapping for optimum depth. Mapping power-aware both for depth and area optimal it appears be more complex and actual used heuristics have to be upgraded because it was searched only a limited part of mapping solutions’ space. It is intended in the future development of our research to use dynamic programming together with refined heuristics in PwAwMap algorithm. REFEERNCES Anderson, J.H. and Najm, F.N. (2002). Power-aware technology mapping for LUT-based FPGAs. IEEE International Conference on Field-Programmable Technology, pp. 211-218, Hong Kong. Anderson, J.H., and Najm, F.N. (2004). Power Estimation Techniques for FPGAs. IEEE Transactions on VLSI, Vol. 12, No. 10, pp. 1015-1027. Bucur, I. (1999). An Optimal Mapping for delay Optimization of Lookup Table-Based FPGAs. Proc. of the 12th International Conference on Control Systems and Computer Science, pp. 127-132. Bucur, I. (2007). Performance mapping of k-LUT based FPGAs. Univ. Politehnica of Bucharest, Scientific Bulletin, Series: C, Vol. 69, No. 2, pp.49-60. Bucur, I., Fagarasan, I., Popescu, C., Boiangiu, C.-A., and Culea, G. (2008). On K-LUT Based FPGA Optimum Delay and Optimal Area Mapping. Proc. of WSEAS International Conference on Math. and Comput. Methods in Science and Engineering, 2008, pp.137-142. Bucur, I., Stefanescu, C., Surpateanu, A., and Cupcea, N. (2009). Power-Aware and Optimal Depth Mapping of LUT Based FPGA Circuits. Proceedings of the 17th ICCSCS-17, May'09, Bucharest, Romania, pp. 117-124. Chen, C.-S., Hwang, T.T., and Liu, C.L. (1997). Low Power FPGA Design – A Re-engineering Approach. Proc. of the 35th DAC, pp. 656 – 661. 62 Chen, D., Cong, J., and Pan, P. (2006). FPGA Design Automation A Survey. Foundations and Trends® in Electronic Design Automation, Vol.1, No.3, pp. 195-330. Farrahi, A.H. and Sarrafzadeh, M. (1994a). FPGA technology mapping for power minimization,” R.W. Hartenstein, and M.Z. Servit (Editors), FieldProgrammable Logic Architectures, Synthesis and Applications, Springer, Lect. Notes in Comp. Science, Germany, pp.66-77. Farrahi, A.H., and Sarrafzadeh, M. (1994b). Complexity of the look-up table minimization problem for FPGA technology mapping,” IEEE Tran. on CAD of IC and Systems, Vol.13, No. 11, pp.1319 – 1332. Gupta, S. and Anderson, J. (2007). Optimizing FPGA Power with ISE Design Tools. Xcell Journal, Second Quarter. Hansen, L., and Thomas, T. (2005). Complete FPGA and CPLD Power Analysis, Xcell Journal, Second Quarter. Ho, C.H., Leong, P.H.W., Luk, W., and Wilton, S.J.E. (2008). Rapid Estimation of Power Consumption for Hybrid FPGAs, Intnl. Conf. on FPGA and Apps., pp. 227-232. Hsieh, C.-T., Cong, J., Zhang, Z., and Chang, S.-C. (2008). Behavioral Synthesis with Activating Unused Flip-Flops for Reducing Glitch Power in FPGA. Proc. ASP- DAC, pp. 10-15. Jang, S., Chung, B., Chan, K., and Mishchenko, A. (2009a). WireMap: FPGA Technology Mapping for Improved Routability and Enhanced LUT Merging. ACM Transactions on Reconfigurable Technology and Systems, Vol.2, No.2, Article 14, June'09. Jang, S., Chung, B., Chan, K., Mishchenko, A., and Brayton, R. (2009b). A power Optimization Toolbox for Large Synthesis and Mapping. Proc. IWLS'09, pp. 1-8. Jiang, W., Zhang, Z., Potkonjak, M., and Cong, J. (2008). Scheduling with Integer Time Budgeting for Low-Power Optimization. Proc. ASP-DAC'08, pp. 22-27. Jones, P.H., Cho, Y.C., and Lockwood, J.W. (2007). Dynamically Optimizing FPGA Applications by Monitoring Temperature and Workloads. Proc. 20th Intnl. Conf. on VLSI Design and 6th Intnl. Conf. Embedded Systems, Bangalore, India, Jan 6-10, pp. 391400. Lamoureux, J. and Wilton, S. (2003). On the Interaction Between Power-Aware FPGA CAD and Algorithms. Proc. IEEE/ACM ICCAD'03, pp. 701-708. CONTROL ENGINEERING AND APPLIED INFORMATICS Li, H., Mak, W.-K., and Katkoori, S. (2001). LUT-Based FPGA Technology Mapping for Power Minimization with Optimal Depth. Proc. IEEE CS Workshop on VLSI, Orlando, FL, April 19-20, pp. 123-128. Mashayekhi, M., Jeddi, Z., and Amini, E. (2008). Power Optimization of LUT based FPGA circuits. Proc. 11th IEEE Intnl Conf. On Optimization of Electrical and Electronic Equipment, Optim Brasov, Romania, pp. 3740. Najm, F.N. (1993). Transition Density: A New Measure of Activity in Digital Circuits. IEEE Trans. on CAD of IC and Systems, Vol.12, No. 2, pp. 310-323. Pedram, M. (1996). Power minimization in IC design: Principles and Applications,” ACM TODAES, Vol. 1, No.1, pp. 3-56. Poon, K., Wilton, S., and Yan, A. (2005). A Detailed Power Model for Field-Programmable Gate Arrays,” ACM TODAES, Vol. 10, Issue 2, pp 279-302. Rose, J., Francis, R., Lewis, D., and Chow, P. (1990). Architecture of Field Programmable Gate Arrays: The effect of Logic Block Functionality on Area Efficiency. IEEE J. of Solid State Circuits, Vol. 25, No. 5, pp. 12171225. Singh, A. and Marek-Sadowska, M. (2002). Efficient Circuit Clustering for Area and Power Reduction in FPGAs. International Symposium on FPGAs, 2002, pp. 59 – 66. SIS-1.2. (1994). http://embedded.eecs.berkeley.edu/pubs/ downloads/sis/index.htm. Sutter, G., Boemo, E. (2007). Experiments in Low Power FPGA Design. Latin American Applied Research, Vol. 37, No. 1, pp. 99-104. Sasao, T., Mishchenko, A. (2009). LUTMIN: FPGA logic synthesis with MUX-based and cascade realizations, Proc. IWLS'09, pp. 310-316. Wang, Z.-H., Liu, E.-C., Lai, J., and Wang, T.-C. (2001). Power minimization in LUT-based FPGA technology mapping,” Proc. ASP-DAC'01, pp. 635-640. Xu, M. and Kurdahi, F. J. (1997). ChipEst-FPGA: a tool for chip level area and timing estimation of lookup table based FPGAs for high level applications. Proc.ASP-DAC '97, pp. 435-440. Yang, A. (2005). Design Techniques to Reduce Power Consumption,” Xcell Journal, Third Quarter.