[go: up one dir, main page]

Academia.eduAcademia.edu
Annals of DAAAM for 2009 & Proceedings of 20th DAAAM International Symposium Publishing of research/scientific report as paper in ISI Proceedings without presentation at the conference SIGNAL DRIVEN PARTITIONING LARGE CIRCUITS BUCUR , I[on]; CUPCEA N[icolae]; STEFANESCU, C[ostin]; POPESCU C[ornel]; SURPATEANU, A[drian] & BOIANGIU, C[ostin]-A[nton] Abstract: Partitioning is a central problem in VLSI design automation, addressing circuit’s manufacturability. Circuit partitioning has multiple applications in VLSI design. One of the most common is that of dividing combinational circuits (usually large ones) that will not fit on a single package among a number of packages. Partitioning is of practical importance for k-LUT based FPGA circuit implementation. In this work is presented new multilevel and multi-resource partitioning algorithm targeting large combinational circuits in order to efficiently use existing FPGAs circuits. Key words: combinational networks, multilevel partitioning, signal driven, critical path delay, transitive cluster. 1. INTRODUCTION Field Programmable Gate Arrays (FPGAs), providing both large-scale integration and user-programmability, are essential architectures. FPGA packages, as a general feature, have maximum size Combinational Logic Blocks (CLBs) constraints much larger than the number of input-output pins IOBs. Therefore, realization of a large logic network into working FPGA involves network division into a near balanced packing of CLBs and Input-Output Blocks (IOBs). Partitioning is a modus operandi of separating a circuit or system into a set of smaller sub-circuits with almost equal sizes targeting to lessen the amount of links between the sub-circuits. Market trend pressure yields larger systems, both to make manufacture cheaper and to take advantage of the optimization probable of the entire structure. In this work is presented multilevel multiresource partitioning algorithm for partitioning large combinational circuits in order to efficiently use existing and commercially available FPGAs packages. 2. SPECIFIC ISSUES FPGA circuit realization has two main phases. Placement phase, the first one, is dedicated to assign useful locations within the FPGA structure, to the obtained blocks. This could be, however, an iterative process. Routing phase, the last one, provides the links between these blocks. Circuit partitioning is used, however, twice in FPGA implementation. First usage concerns too large designs to fit available FPGA packages. A less obvious usage of network partitioning is used in the blocks’ placement phase. Placement procedures based on circuit partitioning yields amazing results. 3. PREVIOUS WORK Typical partitioning objectives such as minimum-width bisection and minimum ratio cut are NP-complete and require such heuristics as simulated annealing, greedy k-opt interchange or quadratic optimization - via relaxation or spectral methods (Bucur, 2006). In (Brasen & Saucier 1998), two new partitioning algorithms are presented that use cone structures to partition large hierarchical blocks into FPGA's. Cone structures are minimum cut partitioning structures for net lists with low fanout, and clustering structures for partitioning net lists with high fan-out. Cone structures also allow for full containment of critical paths. The authors of the paper (Kwon & Kyung, 2004) proposed a circuit partitioning algorithm called SCheduling driven Algorithm for TOMi (SCATOMi) for multi-FPGA systems with interconnection architecture called Time-multiplexed, Offchip, Multi-casting interconnection (TOMi). SCATOMi improves the performance of the TOMi architecture by limiting the number of inter-FPGA signal transfers on the critical path and considering the scheduling of inter-FPGA signal transfer. The paper (Muthukumar & Selvaraj, 2003) deals with the problem of determining the set of best free and bound variables (variable partitioning problem) for disjoint (disjoint serial) decomposition; such that the decomposed circuits are smaller in size and its truth table representation have maximal don’t cares. The proposed approach has been successfully implemented and tested with MCNC and Espresso benchmarks. 4. PROBLEM FORMULATION In this paper, it is introduced new multilevel partitioning algorithm for combinational Boolean networks. A combinational Boolean network C can be represented traditionally as a directed acyclic graph G = (V, E) where each node n (nV) represents a logic gate and a directed edge (i, j), ((i, j)  E) exists if the output of gate i is an input of gate j. A primary input (PI) node has no incoming edge and a primary output (PO) node has no outgoing edge (De Micheli, 1994). A disjoint Q-way partitioning solution S  (A1, A2, ... AK) satisfies the following conditions: and contains all the gates in the network, than A1, A2, ... AK are known as clusters of G(C). Each node in C has only one output line and limited number of input lines. It is used input(v) to denote the set of fanins of gate. Given a subgraph H of the Boolean network, let input(H) denotes the set of distinct nodes outside H, which supply inputs to the nodes in H (fanins of H). For a node n in the network, a w-feasible cone at n, denoted Kn, is a subgraph consisting of node n and its predecessors (u is a predecessor of n if there is a directed path from u to n), such that |input(Kn)|  w and any path connecting a node in Kn and n lies entirely in Kn. The level of a node is the length of the longest path from any PI node to n. The level of a PI node is zero. The depth of a network is the largest node level in the network (Hachtel & Somenzi, 1996). A Boolean network is p-bounded if |input(n)|  p for each node n in the network. 5. SIGNAL DRIVEN PARTITIONING Signal driven partitioning procedure was implemented atop of SIS-1.2 (***, 1994). Implemented algorithm is splitting-up circuit C, before mapping it. Combinational circuits could be very large and cluster partitioning helps obtaining more technological compliant mapping over the initial circuit. Each internal node structure has additional array denoted po_label, mapping all POs nodes of the circuit. First traversal, depth first search from outputs, establish nodes affiliation with respect to the PO nodes. An internal node having more than one element not zero in its po_label array belongs to more than one PO transitive cluster, and it’s said to be multiple dominated. If node n belongs to the transitive clusters of PO1, PO2 and PO3, as an example, than po_label(1) = po_label(2) = po_label(3) = 1. Such nodes are defining sub-cluster(1,2,3) as the intersection of the three mentioned clusters. Mapping for k-LUT is made over homogenous dominated clusters. It means that nodes dominated only by PO z, or by PO z and PO t, as an example, will be mapped in separate mapping sessions. This strategy separates nodes having fan-outs in more than one single output cluster avoiding interactions during mappings in different primary output clusters. Additionally, clusters with multiple domination identification are making simpler the task of mapping for critical performance. 6. EXPERIMENTAL RESULTS Implemented cluster partitioning algorithm working with minLevelMapIVv3 (cluster mapping tool) was tested against minDepth, our reference mapping tool, used without cluster partitioning. Results are presented in Tab. 1. Circuits, listed in Tab. 1, are taken from MCNC91 multilevel examples benchmark; are selected the most representative ones (used in mostly similar works). Cluster partitioning procedure was designed first to minimize the critical path delay. This was implemented by merging those clusters containing nodes belonging to the critical path but having enough slack delay in order to avoid introducing other costs to the partitioning objective. Actual algorithm is computing all clusters Clusters(n) rooted on each internal node n of the network and having less inputs than k (k = number of input lines of each LUT in FPGA) practically. Comparing results, in Tab. 1, for minDepth and minLevelMapIVv3 it is obvious that almost all results are a little bit less competitive for minLevelMapIVv3 with cluster partitioning. However, these results are better than those previously reported in (Bucur 2006), because were introduced several new and powerful heuristics. These heuristics are working progressively and are searching best clusters based on distance among clusters having the same slack delay. Circuit Depth minDepth minLevelMapIVv3 LUT count LUT count 9sym 5 7 8 C499 4 67 68 C5315 8 500 501 C880 7 130 133 alu2 5 129 130 alu4 5 549 550 apex2 5 150 152 apex4 5 875 881 apex6 4 222 222 apex7 4 67 71 b9 3 37 39 clip 3 44 45 count 3 74 74 des 5 1014 1020 duke2 4 151 152 e64 3 338 339 f51m 3 51 51 rd84 3 13 13 rot 6 204 206 sao2 4 57 57 vg2 3 35 35 4714 4747 Tab. 1. Comparative experimental results. Results obtained, using actual algorithm minLevelMapIVv3 with cluster partitioning, are on average only 0.7% larger than those computed with minDepth (no cluster partitioning). Close examination of circuits are showing that several nodes are still avoidable duplicated. LUT count values are pointing out that, keeping the same depth of the homologues circuits, the area is mostly the same in the partitioned mapped circuit and in the not-partitioned mapped one. However, actual model is using the simplest model of the depth (level) and no references are made to the real interconnections and to their delay (Bucur et al. 2008). Implemented algorithms are mainly oriented to find best performance solution, using the depth model. 7. CONCLUSION AND FUTURE WORK Actual heuristic guided cluster-based partitioning approach is showing improvements, both in terms of the cut size and in the run time, compared with initial approach (partitioning the given circuit before mapping). Since wholly automatic partitioning is essential for fast iterations in the design and optimization steps, substantial study is made in university circles as well as in industry to make possible and get better the not so easy decisions on functional stage. Both mapping algorithms are, actually, under research and progress in order to be able to allow a variety of delay models simultaneously with new mapping routines in order to gain also enhanced area results. Cluster partitioning algorithm will be enhanced with better fast cost estimators (comparing with the actual distance). This will improve efficiency mainly on non-critical path cones process. 8. REFERENCES Brasen, D. R. & Saucier, G. (1998). Using cone structures for circuit partitioning into FPGA packages. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, Vol. 17, No7, pp. 592-600, ISSN: 0278-0070. Bucur, I. (2006). Partitioning combinational circuits for k-LUT FPGA mapping. University Politehnica of Bucharest Scientific Bulletin, Series C: Electrical Engineering, Vol. 68, No. 2, (2006),pp. 91-100, ISSN: 1454-234x. Bucur, I.; Fagarasan, I.; Popescu, C.; Boiangiu, C.-A. & Culea, G. (2008). On K-LUT Based FPGA Optimum Delay and Optimal Area Mapping. Proc. of the 10th WSEAS Intnl. Conf. on Mathematical and Computational Methods in Science and Engineering, pts I and II, Iliescu M et al. (Eds), pp.137-142,ISBN:978-960-474-019-2,Bucharest,Nov.2008, World Scientific and Engineering ACAD and SOC, Athens. De Micheli, G. (1994). Synthesis and Optimization of Digital Circuits, McGraw-Hill, Inc., ISBN 0-07-016333-2, USA. Hachtel, G. D. & Somenzi, F. (1996). Logic Synthesis and Verification Algorithms, Springer Science+Business Media, Inc., ISBN 978—0-387-31004-6, USA. Kwon, Y.-S. & Kyung, C.-M. (2004). Scheduling driven circuit partitioning algorithm for multiple FPGAs using timemultiplexed, off-chip, multi-casting interconnection architecture. Microprocessors and Microsystems Vol. 28, Issues 5-6, 2 August 2004, pp. 341-350, ISSN: 0141-9331. Muthukumar, V. & Selvaraj, H. (2003). Comparison of Heuristic Algorithms for Variable Partitioning in Circuit Implementation. Proceedings of 16th International Conference on VLSI Design, pp. 51-57, New Delhi, India, January 04-08 2003, IEEE Computer Science, USA NJ. *** (1994). SIS-1.2, Synthesis of both synchronous and asynchronous sequential circuits, Available from: http://embedded.eecs.berkeley.edu/pubs/downloads/sis/inde x.htm, Accessed on: 2009-07-12.