[go: up one dir, main page]

0% found this document useful (0 votes)
62 views4 pages

A Reconfigurable System Featuring Dynamically Extensible Embedded Microprocessor, and Customisable

The document describes a reconfigurable system featuring an embedded microprocessor and FPGA. The system allows for dynamic configuration of the processor instruction set through the FPGA. It also enables application-specific coprocessors and I/O interfaces through the FPGA. The system targets image and voice processing applications and was implemented using commercial IP cores in 0.18um technology, with the FPGA comprising about 40% of the area.

Uploaded by

MOHAMMAD AWAIS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views4 pages

A Reconfigurable System Featuring Dynamically Extensible Embedded Microprocessor, and Customisable

The document describes a reconfigurable system featuring an embedded microprocessor and FPGA. The system allows for dynamic configuration of the processor instruction set through the FPGA. It also enables application-specific coprocessors and I/O interfaces through the FPGA. The system targets image and voice processing applications and was implemented using commercial IP cores in 0.18um technology, with the FPGA comprising about 40% of the area.

Uploaded by

MOHAMMAD AWAIS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

A Reconfigurable System featuring Dynamically Extensible Embedded

Microprocessor, FPGA and Customisable U 0


Michele Borgatti, Francesco Lertora, Benoit Forst and Lorenzo Cali

STMicroelectronics
Innovative Systems Design, NVM-DP, Central R&D
Agrate Brianza (MI), ITALY

Abstract
extemal unit/sensor given that its communication protocol can
A system-chip targeting image and voice processing and be mapped to the on-chip programmable logic. Also, some
recognition application domains is implemented as a computations can be performed on-the-fly when data is
representative of the potential of using programmable logic captured.
in system design. It features an embedded reconfigurable The proposed system has been built using a set of state-of-the-
processor built by joining a configurable and extensible art IP cores and system design methodology. In particular, a
processor core and a SRAM-based embedded FPGA. configurable and extensible processor (1) with associated tools,
Application-spec@ bus-mapped coprocessors and flexible and an embedded FF'GA (2) were used. The resulting system
U0 peripherals and interfaces can also be added and has been developed to target image and voice processing and
dynamically modij?ed by reconfiguring the embedded recognition application domains. Design flows for system
FPGA. The architecture of the system is discussed as well exploration and implementation are also introduced.
as the design flows for pre- and post-silicon design and
customisation. The silicon area required by the system is System Architecture
20mm2 in a 0.I8um CMOS technology. The embedded One of the main goals of this work was to build a flexible
FPGA accounts for about 40% of the system area. architecture, working at a reasonable high clock frequency,
built around an embedded FPGA and an extensible 32-bit
Introduction microprocessor.
These days we are witnessing two conflicting trends in The base processor is a specific customisation of that
the electronic industry. At one side the economics of described in (1). It comes with a complete set of tools for
system integration pushes logic suppliers towards ever configuration and performance analysis. Main features of the
more complex system-chip devices. On the other side, processor core used in our system are: 5-stage pipeline,
increasing complexity of design and associated risks, 8+8kB direct-mapped datdinstruction caches, a 24 or 16 bit
increase of non-recurrent engineering expenses and instruction format for improved code density, a 64 bit
shorter time-to-marked and product life are causing processor interface (PIF) with burst transfers for cache-page
OEMs to look for faster turnaround and lower risk design refill, 13 interrupt lines organized in 4 priority interrupt
solutions and technology. levels.
The recent introduction of embedded programmable logic The system architecture is illustrated in Fig.1. The PIF/AHB
allows ASIC and ASSP vendors to broaden the appeal of Bridge translates processor cycles to the AMBA AHB bus (3)
their products. Also, hardware programmability can be with support for fast burst and locked transfers. An external
exploited by system integrators for product customisation. memory interface (EMI) exploits the available peak
throughput of fastest commercial extemal non-volatile flash
In this paper we present a pragmatic approach to introduce memories. It allows a wide range of burst mode and page
flexibility in system-chip design and exploit embedded mode configurations under software control and supports
programmable silicon fabrics to enhance system low-voltage, low-swing operations. If required, an external
performances. In particular, enabling application-specific RAM port allows the extension of the on-chip 48kB SRAM.
configurations to adapt the underlying hardware The heart of the system is an embedded FPGA and its
architecture to time-varying application demands can multiple interfaces to main system units, in particular the
improve execution speed and reduce power consumption functional purposes of the e-FPGA programmable logic are:
compared to a general-purpose programmable solution. In 0 extension of the processor datapath supporting a set
the proposed system the embedded programmable logic of additional special-purpose instructions (TIE). This is done
allows static or dynamic configuration of the instruction set by connecting the processor datapath through a wide bus and
of an embedded microprocessor, the creation of bus- a specific interface (TIE budinterface in Fig. 1);
mapped application-specific hardware coprocessors and 0 bus-mapped coprocessor. Hardware units mapped
accelerators, and the customisation of the system I/O. The into the e-FPGA can be interfaced to the system bus through
latter feature allows the device to potentially connect to any an AHB bus masterklave;

2-3-1
0-7803-7250-6/02/$10.00 0 2002 IEEE IEEE 2002 CUSTOM INTEGRATED CIRCUITS CONFERENCE 13
-------

Authorized licensed use limited to: University of Management & Technology Lahore. Downloaded on May 09,2020 at 00:20:29 UTC from IEEE Xplore. Restrictions apply.
e flexible VO. The programmable general-purpose runtime re-configuration of the instruction set. This implies
I/O pads interface is used to connect external units or that the number of user-defined instructions available at a
sensors with their application-specific communication given time is limited by the e-FPGA logic capacity and
protocol. instruction logic complexity. However, a set of additional
All these possibilities may be mixed in a singular instructions can be defined to target specific application
configuration for the FPGA and this results in a highly needs. If the logic size of the set of additional instructions
configurable device. To accelerate communications exceeds the logic capacity of the e-FPGA, it might be split
between the configurable hardware and software tasks into a number of contexts fitting the size constraints of the e-
running on the processor, 4 interrupt channels can be FPGA. These contexts might be used to dynamically re-
driven by logic mapped into the e-FPGA. A two-way program the FPGA to support application needs.
HW/SW communication can be implemented by the joint The flexibility advantage of this architecture implies a speed
usage of these interrupt channels and dedicated AMBA penalty for the part of logic mapped inside the e-FPGA. In
APB registers. particular, specific processor instructions mapped in the
reconfigurable fabric may be l x to lox slower than their
equivalent implementation in standard cells.
Fig.2 details the processor-FPGA interface: a focus is given
on how Instruction Extensions are mapped inside the FPGA
and how synchronisation between the microprocessor and the
e-FPGA is guaranteed.

Instruction]

Fig. 1: System Architecture Block diagram

Download of the FPGA bitstream is performed by a


flexible programming interface. To allow validation of Fig. 2: Embedded FPGA - Microprocessor Interface
the FPGA configuration, the bitstream may be read-back
by hardware support. As the additional instruction set is part of the processor
Most audio or video applications require storage buffers pipeline (l), slowing .down this logic results in a drastic
to interface fast decoding hardware and slower software reduction of processor maximum speed hence affecting
running on the processor. With this concept in mind, a processor performance when using the baseline general-
lkByte dual port buffer has been added and organised as purpose instruction set.
4x256 bytes rows. One port of this buffer is connected to A mechanism is introduced to allow the processor to be
the AHB bus while the second port is directly accessed by clocked at its maximum speed while executing standard
the FPGA dual port buffer interface. instructions, whereas it is slowed down by a programmable,
The AMBA APB Bus connects all the instruction-dependent number of cycles (1-16) when
configuratiordgeneral purpose registers to the system. On executing processor instructions mapped into the FPGA.
the same bus, an 12C master interface has been added to A clock control system allows the processor to be
connect external devices or sensors like LCD display, synchronised with the e-FPGA for the number of cycles the
CMOS camera, etc.. . instruction is executed. A dedicated module is able to
A programmable general-purpose U 0 module features identify instructions whose performance is not aligned with
mono input/output and bi-directional pads under the the processor. As each of these instructions needs to be
control'of both the e-FPGA and the microprocessor. associated to its execution time, the set was partitioned. A
pre-defined map-table divides in 4 the whole set of opcodes
A. The Microprocessor-FPGA inte$ace reserved for user-defined instructions.
The configurable processor allows adding user-defined For each set that belongs to a configuration, a number, mapped
instructions. In the proposed architecture, this capability as a constant output of the FPGA, defines the number of times
was mapped exclusively into the e-FPGA, allowing the clock needs to be stretched to synchronise properly the

14 2-3-2

Authorized licensed use limited to: University of Management & Technology Lahore. Downloaded on May 09,2020 at 00:20:29 UTC from IEEE Xplore. Restrictions apply.
execution of the pipeline between the FPGA and the base level. The microprocessor core is abstracted in the co-
processor. Thus, the system allows executing a set of TIES verification with its Instruction Set Simulator integrated into
among a panel of 4 user-defined speed penalties for any the simulation engine. Extensive simulations of the system
FPGA configuration. In this way, the processor CPU is tied with the usage of the profiler (memory accesses, CPU load,
to the FPGA speed for the strictly required number of exceptions) help in finding the computational kernels of the
cycles. The set of user-instructions can be defined after software running on the core (performanceanalysis).
tape out thanks to the FPGA. More, the system allows to
parametrise its execution time, to exploit the performances Functional model (untimed simulation)
of both hard-wired and programmable logic. ParlitioningI Interface SynthesisI Refinement

B. Block Description of the e-FPGA


The architecture of the e-FPGA (2) is organised as a
hierarchical multi-level interconnect network (see Fig.3)

Global network

384 Inputs Fig. 4: System to RTL

At this point it is possible to group segments of codes that


Fig. 3: Block diagram of the e-FPGA result timing consuming as new instructions of the extensib!e
processor.
An array of logic elements called Multi Function logic Those extensions of the Instruction Set can be easily mapped
Cell (MFC) allows implementation of digital logic. The on the e-FPGA as well as the VHDL code that results from
MFC is a 4 input / 1 output programmable structure the refinement process done during partitioning phase.
associating a 4 input Look-Up Table and a storage The system integration flow ends producing:
element (dff, latch). There are 3k MFC shared among 24 Soft Hardware to be mapped on the eFPGA: HDL RTL
clusters. The Global Interconnect Network links the code of instruction extensions, bus-mapped coprocessors
clusters together and to IPads & OPads peripherals cells. and special purpose VO peripherals.
At a lower level, a Local Interconnect Network links 0 Conventional fixed hardware: Microprocessor RTL
MFC together and to the global network. The architecture code, AHB/APB bus and Peripherals.
allows defining up to 1 clock signal per cluster. The MFC Embedded Software (C code): Application software and
clock is one of 3 global signals defined to be connected to low-level drivers for the hardware platform.
any input of the cluster. This insures a low skew between The C code generated by the flow described above became
cluster clocks and a full IO assignment flexibility. The the final application while the RTL of the system with the e-
input (respectively output) pin set counts 384 independent FPGA hard macro goes into the system integration flow.
and fully equivalent inputs (respectively outputs).
B. The RTL-to-Layout design flow
Design Flow and System Integration In the Fig.5 both silicon implementation flow and e-FPGA
configuration flows are shown. These flows are run at
A. The System-to-RTL design flow different times. Once silicon implementation flow has
In Fig.4 the design flow used for system architecture produced the routed database its possible to implement e-
exploration and integration is described. The starting FPGA flow that can be repeated for each different function
point is an untimed model of the system written in C/C++ built as a soft macro.
code describing the desired functionality; at this stage the The RTL code of the CPU core, IP blocks and Interface
verification is done with simulations in CoWare N2C
modules (system bus) is synthesized and integrated with RAM
environment (4). This methodology allows designers to blocks and FPGA hard macro in the floorplanning
validate the system specifications and consequently, with a environment. To meet timing requirements at the boundary of
progressive refinement of the functional blocks into
the e-FPGA, a special care was taken during synthesis process
hardware and software (partitioning process) and the
for the logic cells that interfaces e - P G A with the rest of the
generation of the HW/SW interface (interface synthesis), system. A particular set of constraints was specified to reach
the verification of the system at a cycle accurate abstraction minimum delay of the hardwired logic. After the place and

2-3-3 15
\

Authorized licensed use limited to: University of Management & Technology Lahore. Downloaded on May 09,2020 at 00:20:29 UTC from IEEE Xplore. Restrictions apply.
route stage, the final database is statically and dynamically recognition computing kernels. Additional 1 . 5 ~to 2x
verified against the RTL simulations in order to make performance improvements are reported on specific I/O
verification at all levels of abstraction. intensive tasks to interface an external CMOS camera and
doing some image processing computations on-the-fly using
the e-FPGA.

Acknowledgements:
Interl. The authors thank Sara Bocchio, G. Repetto, C. Gazzina and L. Fumagalli for
their valuable help and support. They also thank 0. Lepape, J. Barbier and F.
Reblewsky at M2000, J. Massingham and B. Campbell at Tensilica, and K.
Ahluwalia, D. Tilley, M. Woodward and P. Bingham at CoWare. A special
thank to Dr. A. Kramer for his support and encouragement.
wlih FPGA black box. Dynamic
TABLE I
n n DEVICE PERFORMANCES AND POWER CONSUMPTION

Processor maximum speed: 125MHz (WCMIL)


175MHz (TYP)
Silicon lab Flnal verification with FPGA timino model
Reconfiguration speed: -50011s@ lOOMHz clock

V
U Static Timing Analysis
U Chip average power consumption -3OOmW @ 100MHz, 1.8V

TABLE I1
Fig. 5 RTL to Layout TECHNOLOGY AND DEVICE CHARACTERISTICS

The timed database used for the verification, built after a Technology 0.18pm CMOS 6-ML
SRAM Main: 48kB (64-bit wide)
paracsitic extraction and a delay calculation process, Memory I$: 8kB (64-bit wide)
allows knowing the effective delays at the boundary of D$: 8kB (64-bit wide)
the e-FPGA hard macro (all e-FPGA U 0 pins are Buffers: 4x256B (8-bit wide)
characterized with the static timing analyzer in the worst Chip size 5.5x5.5 “2 (pad limited)
case condition). This information is exported in the e- Core size 20 “2
e-FPGA size 8.2 “2 (15k useable equivalent ASIC gates)
FPGA flow as a constraint file and used during Customisable 24 general-purpose inputs
synthesis/mapping of the soft hardware by specific e- VO 24 general-purpose outputs (tristate)
FPGA tools. This is done to correctly constrain the logic 8 general-purpose bidirs
mapped on the e-FPGA with the real timing budget. Power supply
.. . 2.7-3.6V (external), l.SV(core, internally
Finally the generation of the bitstream and a timed view generated I regulated)
of the macro can be used for the final sign-off. Static
timing analysis of the e-FPGA results in both a References
(1) R.E.Gonzalez., “Xtensa: A Configurable and Extensible Processor“ ,
backannotated netlist and a timing view for full chip static IEEE Micro, March-April 2000, pp. 60-70.
timing analysis. (2) M2000, “Flexeos family technical manual”, www.m2OOO.fr
(3) ARM Ltd., “AMBATMSpecification” Rev 2 . 0
System Implementation and Test (4) LBolsens, H.De Man, B. Lin, C.Van Rompaey, S.Vercauteren and
The full-chip has been implemented in a standard CMOS D.Verkest, “HardwarelSoftware Co-Design of Digital Telecommunication
Systems”, Proceedings of the IEEE, Vol. 85, No. 3, March 1997, pp 391-
1.8V/3.3V, 0.18um technology featuring 6 metal layers. 418.
The layout of the system has been integrated using
commercial place and route tools for digital ASIC. To
avoid external multiple power supply, an internal DC (3V
to 1.SV) voltage regulator has been integrated. The chip is
being tested and is fully functional at the clock rate of
175MHz. The processor system is able to reconfigure the
e-FPGA at full speed. Reconfiguration takes about 5OOus
at a clock rate of 100MHz. During reconfiguration the
average throughput sustained by external memories, EM1
and programming interface is SOMB/sec. Device
performances and power consumption are summarized in
Table I. Technology and device characteristics are
summarized in Table I1 and a chip micrograph is shown
in Fig.6 with a floorplan view of system components. The
system is being tested using both a face recognition
application and a speech recognition application. During
architecture development we reported speedups of 4x to Fig.6 Chip Micrograph
8x using instruction extensions to accelerate face-

16 2-3-4

Authorized licensed use limited to: University of Management & Technology Lahore. Downloaded on May 09,2020 at 00:20:29 UTC from IEEE Xplore. Restrictions apply.

You might also like