A Reconfigurable System Featuring Dynamically Extensible Embedded Microprocessor, and Customisable
A Reconfigurable System Featuring Dynamically Extensible Embedded Microprocessor, and Customisable
STMicroelectronics
Innovative Systems Design, NVM-DP, Central R&D
Agrate Brianza (MI), ITALY
Abstract
extemal unit/sensor given that its communication protocol can
A system-chip targeting image and voice processing and be mapped to the on-chip programmable logic. Also, some
recognition application domains is implemented as a computations can be performed on-the-fly when data is
representative of the potential of using programmable logic captured.
in system design. It features an embedded reconfigurable The proposed system has been built using a set of state-of-the-
processor built by joining a configurable and extensible art IP cores and system design methodology. In particular, a
processor core and a SRAM-based embedded FPGA. configurable and extensible processor (1) with associated tools,
Application-spec@ bus-mapped coprocessors and flexible and an embedded FF'GA (2) were used. The resulting system
U0 peripherals and interfaces can also be added and has been developed to target image and voice processing and
dynamically modij?ed by reconfiguring the embedded recognition application domains. Design flows for system
FPGA. The architecture of the system is discussed as well exploration and implementation are also introduced.
as the design flows for pre- and post-silicon design and
customisation. The silicon area required by the system is System Architecture
20mm2 in a 0.I8um CMOS technology. The embedded One of the main goals of this work was to build a flexible
FPGA accounts for about 40% of the system area. architecture, working at a reasonable high clock frequency,
built around an embedded FPGA and an extensible 32-bit
Introduction microprocessor.
These days we are witnessing two conflicting trends in The base processor is a specific customisation of that
the electronic industry. At one side the economics of described in (1). It comes with a complete set of tools for
system integration pushes logic suppliers towards ever configuration and performance analysis. Main features of the
more complex system-chip devices. On the other side, processor core used in our system are: 5-stage pipeline,
increasing complexity of design and associated risks, 8+8kB direct-mapped datdinstruction caches, a 24 or 16 bit
increase of non-recurrent engineering expenses and instruction format for improved code density, a 64 bit
shorter time-to-marked and product life are causing processor interface (PIF) with burst transfers for cache-page
OEMs to look for faster turnaround and lower risk design refill, 13 interrupt lines organized in 4 priority interrupt
solutions and technology. levels.
The recent introduction of embedded programmable logic The system architecture is illustrated in Fig.1. The PIF/AHB
allows ASIC and ASSP vendors to broaden the appeal of Bridge translates processor cycles to the AMBA AHB bus (3)
their products. Also, hardware programmability can be with support for fast burst and locked transfers. An external
exploited by system integrators for product customisation. memory interface (EMI) exploits the available peak
throughput of fastest commercial extemal non-volatile flash
In this paper we present a pragmatic approach to introduce memories. It allows a wide range of burst mode and page
flexibility in system-chip design and exploit embedded mode configurations under software control and supports
programmable silicon fabrics to enhance system low-voltage, low-swing operations. If required, an external
performances. In particular, enabling application-specific RAM port allows the extension of the on-chip 48kB SRAM.
configurations to adapt the underlying hardware The heart of the system is an embedded FPGA and its
architecture to time-varying application demands can multiple interfaces to main system units, in particular the
improve execution speed and reduce power consumption functional purposes of the e-FPGA programmable logic are:
compared to a general-purpose programmable solution. In 0 extension of the processor datapath supporting a set
the proposed system the embedded programmable logic of additional special-purpose instructions (TIE). This is done
allows static or dynamic configuration of the instruction set by connecting the processor datapath through a wide bus and
of an embedded microprocessor, the creation of bus- a specific interface (TIE budinterface in Fig. 1);
mapped application-specific hardware coprocessors and 0 bus-mapped coprocessor. Hardware units mapped
accelerators, and the customisation of the system I/O. The into the e-FPGA can be interfaced to the system bus through
latter feature allows the device to potentially connect to any an AHB bus masterklave;
2-3-1
0-7803-7250-6/02/$10.00 0 2002 IEEE IEEE 2002 CUSTOM INTEGRATED CIRCUITS CONFERENCE 13
-------
Authorized licensed use limited to: University of Management & Technology Lahore. Downloaded on May 09,2020 at 00:20:29 UTC from IEEE Xplore. Restrictions apply.
e flexible VO. The programmable general-purpose runtime re-configuration of the instruction set. This implies
I/O pads interface is used to connect external units or that the number of user-defined instructions available at a
sensors with their application-specific communication given time is limited by the e-FPGA logic capacity and
protocol. instruction logic complexity. However, a set of additional
All these possibilities may be mixed in a singular instructions can be defined to target specific application
configuration for the FPGA and this results in a highly needs. If the logic size of the set of additional instructions
configurable device. To accelerate communications exceeds the logic capacity of the e-FPGA, it might be split
between the configurable hardware and software tasks into a number of contexts fitting the size constraints of the e-
running on the processor, 4 interrupt channels can be FPGA. These contexts might be used to dynamically re-
driven by logic mapped into the e-FPGA. A two-way program the FPGA to support application needs.
HW/SW communication can be implemented by the joint The flexibility advantage of this architecture implies a speed
usage of these interrupt channels and dedicated AMBA penalty for the part of logic mapped inside the e-FPGA. In
APB registers. particular, specific processor instructions mapped in the
reconfigurable fabric may be l x to lox slower than their
equivalent implementation in standard cells.
Fig.2 details the processor-FPGA interface: a focus is given
on how Instruction Extensions are mapped inside the FPGA
and how synchronisation between the microprocessor and the
e-FPGA is guaranteed.
Instruction]
14 2-3-2
Authorized licensed use limited to: University of Management & Technology Lahore. Downloaded on May 09,2020 at 00:20:29 UTC from IEEE Xplore. Restrictions apply.
execution of the pipeline between the FPGA and the base level. The microprocessor core is abstracted in the co-
processor. Thus, the system allows executing a set of TIES verification with its Instruction Set Simulator integrated into
among a panel of 4 user-defined speed penalties for any the simulation engine. Extensive simulations of the system
FPGA configuration. In this way, the processor CPU is tied with the usage of the profiler (memory accesses, CPU load,
to the FPGA speed for the strictly required number of exceptions) help in finding the computational kernels of the
cycles. The set of user-instructions can be defined after software running on the core (performanceanalysis).
tape out thanks to the FPGA. More, the system allows to
parametrise its execution time, to exploit the performances Functional model (untimed simulation)
of both hard-wired and programmable logic. ParlitioningI Interface SynthesisI Refinement
Global network
2-3-3 15
\
Authorized licensed use limited to: University of Management & Technology Lahore. Downloaded on May 09,2020 at 00:20:29 UTC from IEEE Xplore. Restrictions apply.
route stage, the final database is statically and dynamically recognition computing kernels. Additional 1 . 5 ~to 2x
verified against the RTL simulations in order to make performance improvements are reported on specific I/O
verification at all levels of abstraction. intensive tasks to interface an external CMOS camera and
doing some image processing computations on-the-fly using
the e-FPGA.
Acknowledgements:
Interl. The authors thank Sara Bocchio, G. Repetto, C. Gazzina and L. Fumagalli for
their valuable help and support. They also thank 0. Lepape, J. Barbier and F.
Reblewsky at M2000, J. Massingham and B. Campbell at Tensilica, and K.
Ahluwalia, D. Tilley, M. Woodward and P. Bingham at CoWare. A special
thank to Dr. A. Kramer for his support and encouragement.
wlih FPGA black box. Dynamic
TABLE I
n n DEVICE PERFORMANCES AND POWER CONSUMPTION
V
U Static Timing Analysis
U Chip average power consumption -3OOmW @ 100MHz, 1.8V
TABLE I1
Fig. 5 RTL to Layout TECHNOLOGY AND DEVICE CHARACTERISTICS
The timed database used for the verification, built after a Technology 0.18pm CMOS 6-ML
SRAM Main: 48kB (64-bit wide)
paracsitic extraction and a delay calculation process, Memory I$: 8kB (64-bit wide)
allows knowing the effective delays at the boundary of D$: 8kB (64-bit wide)
the e-FPGA hard macro (all e-FPGA U 0 pins are Buffers: 4x256B (8-bit wide)
characterized with the static timing analyzer in the worst Chip size 5.5x5.5 “2 (pad limited)
case condition). This information is exported in the e- Core size 20 “2
e-FPGA size 8.2 “2 (15k useable equivalent ASIC gates)
FPGA flow as a constraint file and used during Customisable 24 general-purpose inputs
synthesis/mapping of the soft hardware by specific e- VO 24 general-purpose outputs (tristate)
FPGA tools. This is done to correctly constrain the logic 8 general-purpose bidirs
mapped on the e-FPGA with the real timing budget. Power supply
.. . 2.7-3.6V (external), l.SV(core, internally
Finally the generation of the bitstream and a timed view generated I regulated)
of the macro can be used for the final sign-off. Static
timing analysis of the e-FPGA results in both a References
(1) R.E.Gonzalez., “Xtensa: A Configurable and Extensible Processor“ ,
backannotated netlist and a timing view for full chip static IEEE Micro, March-April 2000, pp. 60-70.
timing analysis. (2) M2000, “Flexeos family technical manual”, www.m2OOO.fr
(3) ARM Ltd., “AMBATMSpecification” Rev 2 . 0
System Implementation and Test (4) LBolsens, H.De Man, B. Lin, C.Van Rompaey, S.Vercauteren and
The full-chip has been implemented in a standard CMOS D.Verkest, “HardwarelSoftware Co-Design of Digital Telecommunication
Systems”, Proceedings of the IEEE, Vol. 85, No. 3, March 1997, pp 391-
1.8V/3.3V, 0.18um technology featuring 6 metal layers. 418.
The layout of the system has been integrated using
commercial place and route tools for digital ASIC. To
avoid external multiple power supply, an internal DC (3V
to 1.SV) voltage regulator has been integrated. The chip is
being tested and is fully functional at the clock rate of
175MHz. The processor system is able to reconfigure the
e-FPGA at full speed. Reconfiguration takes about 5OOus
at a clock rate of 100MHz. During reconfiguration the
average throughput sustained by external memories, EM1
and programming interface is SOMB/sec. Device
performances and power consumption are summarized in
Table I. Technology and device characteristics are
summarized in Table I1 and a chip micrograph is shown
in Fig.6 with a floorplan view of system components. The
system is being tested using both a face recognition
application and a speech recognition application. During
architecture development we reported speedups of 4x to Fig.6 Chip Micrograph
8x using instruction extensions to accelerate face-
16 2-3-4
Authorized licensed use limited to: University of Management & Technology Lahore. Downloaded on May 09,2020 at 00:20:29 UTC from IEEE Xplore. Restrictions apply.