Skip to main content
Guojie Luo

    Guojie Luo

    Clock power contributes a significant portion of chip power in modern IC design. Applying multi-bit flip-flops can effectively reduce clock power. State-of-the-art work performs multi-bit flip-flop clustering at the post-placement stage.... more
    Clock power contributes a significant portion of chip power in modern IC design. Applying multi-bit flip-flops can effectively reduce clock power. State-of-the-art work performs multi-bit flip-flop clustering at the post-placement stage. However, the solution quality may be limited because the combinational gates are immovable during the clustering process. To overcome the deficiency, in this paper, we propose multi-bit flip-flop bonding at placement. Inspired by ionic bonding in Chemistry, we direct flip-flops to merging friendly locations thus facilitating flip-flop merging. Experimental results show that our algorithm, called FF-Bond, can save 27 % clock power on average. Compared with state-of-the-art post-placement multi-bit flip-flop clustering, FF-Bond can further reduce 14 % clock power.
    Transposed convolution is a learnable up-sampling operator widely-used in deep neural networks. It up-samples the input activations to generate useful information in applications like style transfer and super resolution. There exists a... more
    Transposed convolution is a learnable up-sampling operator widely-used in deep neural networks. It up-samples the input activations to generate useful information in applications like style transfer and super resolution. There exists a rising demand for accelerating transposed convolution layers since they occupy a large portion of computation in GAN-like networks.
    Circuit clustering is usually done through discrete optimizations to enable circuit size reduction or design-specific cluster formation. In this article, we are interested in the register-clustering technique for clock-power reduction by... more
    Circuit clustering is usually done through discrete optimizations to enable circuit size reduction or design-specific cluster formation. In this article, we are interested in the register-clustering technique for clock-power reduction by leveraging new opportunities introduced by multibit flip-flop (MBFF). Currently, INTEGRA is the only existing postplacement MBFF clustering optimizer with a subquadratic time complexity. However, it severely degrades the wirelength, especially for realistic designs, which may nullify the benefits of MBFF clustering. In contrast, we formulate an analytical clustering score with a nonlinear programming framework, in which the wirelength objective can be seamlessly integrated and the solver has empirical subquadratic time complexity. With the MBFF library, the application of our analytical clustering method achieves comparable clock power to the state-of-the-art techniques, but further reduces the wirelength by about 25%. Even without the MBFF library,...
    The high performance and energy requirement can be a limiting factor for the application of convolutional neural networks (CNN) in many areas. Recently, FPGA-based CNN accelerators have been demonstrated to have superior energy-efficiency... more
    The high performance and energy requirement can be a limiting factor for the application of convolutional neural networks (CNN) in many areas. Recently, FPGA-based CNN accelerators have been demonstrated to have superior energy-efficiency compared to high-performance devices like GPGPUs. However, due to the constrained on-chip resource and many other factors, single-board FPGA designs may have difficulties to achieve the optimal energy-efficiency. In this paper, we present a deeply pipelined multi-FPGA architecture that expands the design space for optimal performance and energy-efficiency. A dynamic programming algorithm is proposed to map the CNN computing layers efficiently to different FPGA boards. To demonstrate the potential of the architecture, we built a prototype system with seven FPGA boards connected with high-speed serial links. The experimental results on AlexNet and VGG-16 show that the prototype can achieve up to 21× and 2× energy-efficiency compared to optimized multi-core CPU and GPU implementations respectively.
    The high performance and energy requirement can be a limiting factor for the application of convolutional neural networks (CNN) in many areas. Recently, FPGA-based CNN accelerators have been demonstrated to have superior energy-efficiency... more
    The high performance and energy requirement can be a limiting factor for the application of convolutional neural networks (CNN) in many areas. Recently, FPGA-based CNN accelerators have been demonstrated to have superior energy-efficiency compared to high-performance devices like GPGPUs. However, due to the constrained on-chip resource and many other factors, single-board FPGA designs may have difficulties to achieve the optimal energy-efficiency. In this paper, we present a deeply pipelined multi-FPGA architecture that expands the design space for optimal performance and energy-efficiency. A dynamic programming algorithm is proposed to map the CNN computing layers efficiently to different FPGA boards. To demonstrate the potential of the architecture, we built a prototype system with seven FPGA boards connected with high-speed serial links. The experimental results on AlexNet and VGG-16 show that the prototype can achieve up to 21× and 2× energy-efficiency compared to optimized mult...
    In this paper, we present our quantitative studies of the impact of 3D IC design on repeater usage. The repeater usage is estimated by the interconnect optimizer IPEM in the post-placement/ pre-routing stage, where the 2D and 3D placement... more
    In this paper, we present our quantitative studies of the impact of 3D IC design on repeater usage. The repeater usage is estimated by the interconnect optimizer IPEM in the post-placement/ pre-routing stage, where the 2D and 3D placement are generated by state-of-art mixed-size placers mPL6 and mPL-3D. Experiments on a set of real industrial designs show that, through 3D placement, the total number of repeaters used in the on-chip interconnections can be reduced by 19.74% and 51.41% on average with 3 layers and 4 layers of 3D IC designs, respectively.
    Research Interests:
    ABSTRACT X-ray computed tomography is an important technique for clinical diagnose and non destructive testing. In many applications a number of image processing steps are needed before the image information can be used. Obtaining a... more
    ABSTRACT X-ray computed tomography is an important technique for clinical diagnose and non destructive testing. In many applications a number of image processing steps are needed before the image information can be used. Obtaining a segmentation of the image is one such image processing step and also is important for applications. The conventional approach is to first reconstruct the image and conduct image segmentation by other image processing methods afterwards. An emerging technique is to obtain the tomographic images and image segmentation simultaneously. An iterative algorithm with simultaneous reconstruction and segmentation using Mumford-Shah model has been proposed, which can be applied not only to regularize the ill-posedness of the tomographic reconstruction problem, but also to provide the image segmentation. The Mumford-Shah model is both mathematically and computationally difficult. In this paper, we accelerate the proposed algorithm with simultaneous reconstruction and segmentation using the Mumford-Shah model by FPGA devices. The algorithm is hand-optimized with both algorithmic domain knowledge and platform-specific information before translated into FPGA implementation using high-level synthesis and other electronic system-level design tools. A high-level performance model is used to guide the design and optimization process at early stages. The computational kernel and frequent invoked Radon transformation is parallelized by tiling the entire image to sub-images. Other optimization techniques including loop pipelining, loop merging, data streaming and computation sharing across computation modules are used to improve the performance. Intensive optimizations are also adopted to maximize the use of FPGA on-chip block RAMs against off-chip DRAMs to increase memory bandwidth. Experimental results show that a 9.24X speedup can be achieved by the FPGA accelerator over the CPU implementation for this computation and data intensive application.
    ABSTRACT X-ray computed tomography is an important technique for clinical diagnose and non destructive testing. In many applications a number of image processing steps are needed before the image information can be used. Obtaining a... more
    ABSTRACT X-ray computed tomography is an important technique for clinical diagnose and non destructive testing. In many applications a number of image processing steps are needed before the image information can be used. Obtaining a segmentation of the image is one such image processing step and also is important for applications. The conventional approach is to first reconstruct the image and conduct image segmentation by other image processing methods afterwards. An emerging technique is to obtain the tomographic images and image segmentation simultaneously. An iterative algorithm with simultaneous reconstruction and segmentation using Mumford-Shah model has been proposed, which can be applied not only to regularize the ill-posedness of the tomographic reconstruction problem, but also to provide the image segmentation. The Mumford-Shah model is both mathematically and computationally difficult. In this paper, we accelerate the proposed algorithm with simultaneous reconstruction and segmentation using the Mumford-Shah model by FPGA devices. The algorithm is hand-optimized with both algorithmic domain knowledge and platform-specific information before translated into FPGA implementation using high-level synthesis and other electronic system-level design tools. A high-level performance model is used to guide the design and optimization process at early stages. The computational kernel and frequent invoked Radon transformation is parallelized by tiling the entire image to sub-images. Other optimization techniques including loop pipelining, loop merging, data streaming and computation sharing across computation modules are used to improve the performance. Intensive optimizations are also adopted to maximize the use of FPGA on-chip block RAMs against off-chip DRAMs to increase memory bandwidth. Experimental results show that a 9.24X speedup can be achieved by the FPGA accelerator over the CPU implementation for this computation and data intensive application.
    Research Interests:
    Research Interests:
    The physical design process for 3D ICs is similar to that used for the traditional 2D physical design, in a sense that it transforms the circuit representation from a netlist into a geometric representation by the steps of floorplanning,... more
    The physical design process for 3D ICs is similar to that used for the traditional 2D physical design, in a sense that it transforms the circuit representation from a netlist into a geometric representation by the steps of floorplanning, placement, and routing. While the multiple-layer metals have already had 3D structure in traditional ICs for interconnects, the 3D IC technologies allow multiple layers of logical devices to be integrated in the third dimension by bonding stacks of multiple “tiers” to form 3D chips. Each tier, which is similar to a ...
    Abstract Achieving optimal throughput by extracting parallelism in behavioral synthesis often exaggerates memory bottleneck issues. Data partitioning is an important technique for increasing memory bandwidth by scheduling multiple... more
    Abstract Achieving optimal throughput by extracting parallelism in behavioral synthesis often exaggerates memory bottleneck issues. Data partitioning is an important technique for increasing memory bandwidth by scheduling multiple simultaneous memory accesses to different memory banks. In this paper we present a vertical memory partitioning and scheduling algorithm that can generate a valid partition scheme for arbitrary affine memory inputs.
    Abstract Early chip planning is becoming more critical as server system designers strive to explore a large design space with multiple cores and accelerators in an advanced silicon technology that includes 3D chip stacking. During early... more
    Abstract Early chip planning is becoming more critical as server system designers strive to explore a large design space with multiple cores and accelerators in an advanced silicon technology that includes 3D chip stacking. During early chip planning, designers search for the high-level design and layout that best satisfies a myriad of constraints and targets. In this paper, we discuss our experience in applying traditional floorplanning tools at this early stage and suggest how they might be adapted for early floorplanning.
    Most of the existing 3D designs restrict each functional module in the logical hierarchy to be on a single die, which may not generate the best 3D physical hierarchy. However, a flat 3D implementation will greatly increase the design... more
    Most of the existing 3D designs restrict each functional module in the logical hierarchy to be on a single die, which may not generate the best 3D physical hierarchy. However, a flat 3D implementation will greatly increase the design complexity. Therefore, it is worthwhile to apply virtual 3D physical design methods for design planning at the early-design stage, instead of only performing floorplanning with existing 2D modules. In general, we are motivated to use a 3D placer to explore the benefits of removing the logical hierarchical restrictions at the early-design stage. We perform some experiments on the design planning of the LEON3 processor. Compared to a flat 3D design, planning the entire processor core on a single die brings in 10% longer wirelength, and planning the entire register file on a single die brings in 20% longer wirelength. The results help the quantitative analysis on the tradeoff between the design complexity and the cost of wirelength.
    Abstract There are two prominent problems with technology scaling: increasing design complexity and more challenges with interconnect design, including routability. High-level synthesis has been proposed to solve the complexity problem by... more
    Abstract There are two prominent problems with technology scaling: increasing design complexity and more challenges with interconnect design, including routability. High-level synthesis has been proposed to solve the complexity problem by raising the abstraction level. In this paper, we share our vision that high-level synthesis can potentially help the routability problem as well.
    Abstract A unified optimization framework is presented for simultaneous gate sizing and placement. These processes are unified using Lagrangian multipliers, which synchronize the efforts of the gate sizing and placement subproblems. As... more
    Abstract A unified optimization framework is presented for simultaneous gate sizing and placement. These processes are unified using Lagrangian multipliers, which synchronize the efforts of the gate sizing and placement subproblems. As far as we know, this is the first work that formulates and solves the simultaneous gate sizing and placement under area density constraints, which are handled by the quadratic penalty method.
    Abstract Existing thermal-aware 3D placement methods assume that the temperature of 3D ICs can be optimized by properly distributing the power dissipations, and ignoring the heat conductivity of though-silicon-vias (TSVs). However, our... more
    Abstract Existing thermal-aware 3D placement methods assume that the temperature of 3D ICs can be optimized by properly distributing the power dissipations, and ignoring the heat conductivity of though-silicon-vias (TSVs). However, our study indicates that this is not exactly correct.

    And 8 more