[go: up one dir, main page]

0% found this document useful (0 votes)
22 views6 pages

Bus Based Architecture

Uploaded by

madankshatriya80
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views6 pages

Bus Based Architecture

Uploaded by

madankshatriya80
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Fast Exploration of Bus-based On-chip Communication

Architectures
Sudeep Pasricha†, Nikil Dutt†, Mohamed Ben-Romdhane‡
†Center for Embedded Computer Systems ‡Conexant Systems Inc.
University of California, Irvine, CA Newport Beach, CA
{sudeep, dutt}@cecs.uci.edu m.benromdhane@conexant.com
ABSTRACT Shared-bus based communication architectures such as AMBA [1],
As a result of improvements in process technology, more and more CoreConnect [2], WishBone [3] and OCP [4] are popular choices
components are being integrated into a single System-on-Chip for on-chip communication between components in current SoC
(SoC) design. Communication between these components is designs. These bus architectures can be configured in several
increasingly dominating critical system paths and frequently different ways, resulting in a vast exploration space that is
becomes the source of performance bottlenecks. It therefore prohibitive to explore at the RTL level. Not only is the RTL
becomes extremely important for designers to explore the simulation speed too slow to allow adequate coverage of the large
communication space early in the design flow. Traditionally, pin- design space, but making small changes in the design can require
accurate Bus Cycle Accurate (PA-BCA) models were used for considerable re-engineering effort due to the highly complex nature
exploring the communication space. To speed up simulation, of these systems. To overcome these problems, designers have
transaction based Bus Cycle Accurate (T-BCA) models have been raised the modeling abstraction level above the RTL level. Figure 1
proposed, which borrow concepts found in the Transaction Level shows the frequently used modeling abstraction levels for
Modeling (TLM) domain. More recently, the Cycle Count Accurate communication space exploration, usually captured with high level
at Transaction Boundaries (CCATB) modeling abstraction was languages such as C/C++ [5]. In Cycle Accurate (CA) models
introduced for fast communication space exploration. In this paper, [6][18], system components (both masters and slaves) and the bus
we describe the mechanisms that produce the speedup in CCATB architecture are captured at a cycle and signal accurate level. While
models and demonstrate the effectiveness of the CCATB these models are extremely accurate, they are too time-consuming
exploration approach with the aid of a case study involving an to model and only provide a moderate speedup over RTL models.
AMBA 2.0 based SoC subsystem used in the multimedia Bus Cycle Accurate (BCA) models [7] capture the system at a
application domain. We also analyze how the achieved simulation higher abstraction level than CA models. Components are modeled
speedup scales with design complexity and show that SoC designs at a less detailed behavioral level, which allows rapid system
modeled at the CCATB level simulate 120% faster than PA-BCA prototyping and considerable simulation speed over RTL. The
and 67% faster than T-BCA models on average. component interface and the bus however are still modeled at a
Categories and Subject Descriptors: I.6.5 [Simulation cycle and signal accurate level, which enables accurate
and Modeling]: Model Development; I.6.7 [Simulation and communication space exploration. However, with the increasing
Modeling]: Simulation Support Systems. role of embedded software and rising design complexity, even the
simulation speedup gained with BCA models is not enough.
General Terms: Performance, Design
Keywords: Fast Communication Architecture Exploration, master
v1 = a + b;
slave

Transaction Level Modeling, Bus Cycle Accurate Modeling, wait(1); //cycle 1
REG = d << v1; bus
case CTR_WR:
CTR_WR = in;
Shared Bus Architectures, AMBA wait(1); //cycle 2
REQ.set(1);
arb wait(1); //cycle 1
CTR_WR2 |=0xf;
ADDR.set(REG); wait(1); //cycle 2

1. INTRODUCTION
WDATA.set(v1); signal HRESP.set(1);
wait(1); //cycle 3 interface HREADY.set(0);

Over the years, System-on-Chip (SoC) designs have evolved from Cycle
Cycle Accurate
Accurate (CA)
(CA)
master slave
fairly simple uni-processor, single-memory designs to massively … …
complex multiprocessor systems with several on-chip memories, v1 = a + b;
REG = d << v1; bus
case CTR_WR:
CTR_WR = in;
standard peripherals and ASIC blocks. As more and more REQ.set(1);
arb CTR_WR2 |=0xf;
ADDR.set(REG); wait(2); //2 cycles
components are integrated into these designs to share the ever WDATA.set(v1); HRESP.set(1);
wait(3); //3 cycles HREADY.set(0);
increasing processing load, there is a corresponding increase in the …
signal
interface …
communication between these components. Inter-component Pin
Pin Accurate
Accurate Bus
Bus Cycle
Cycle Accurate
Accurate (PA-BCA)
(PA-BCA)
communication is often in the critical path of a SoC design and is a master slave
… …
very common source of performance bottlenecks. It thus becomes v1 = a + b; case CTR_WR:
REG = d << v1; bus
imperative for system designers to focus on exploring the addr = REG;
CTR_WR = in;
arb CTR_WR2 |=0xf;
communication design space. REQ.set(1);
write(addr,v1);
wait(2); //2 cycles
bus_resp(OK);
wait(3); //3 cycles HREADY.set(0);
… …
signal,
transaction interface
Transaction
Transaction based
based Bus
Bus Cycle
Cycle Accurate
Accurate (T-BCA)
(T-BCA)

Permission to make digital or hard copies of all or part of this work for Figure 1. Modeling Abstractions for Exploration
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear Recent research efforts [11-14] have focused on using concepts
this notice and the full citation on the first page. To copy otherwise, or found in the Transaction Level Modeling (TLM) [8-10] domain to
republish, to post on servers or to redistribute to lists, requires prior specific speed up BCA model simulation. Transaction Level Models are
permission and/or a fee. very high level bit-accurate models of a system with specifics of
CODES+ISSS’04, September 8–10, 2004, Stockholm, Sweden.
Copyright 2004 ACM 1-58113-937-3/04/0009...$5.00.
the bus protocol replaced by a generic bus (or channel), and where

242
communication takes place when components call read() and [16], annotated with timing details and modeled at a granularity
write() methods provided by the channel interface. Since detailed which would capture their precise functionality, yet not weigh
timing and signal-accuracy is omitted, these models are fast to down simulation speed due to unnecessary detail. Performance
simulate and are useful for early embedded software development numbers would then be obtained by simulating the working of the
and functional validation of the system [8]. Transaction based BCA entire system – including running embedded software on the CPU
(T-BCA) models [11-14] make use of the read/write function call architecture model. Ultimately, the exploration models need to be
interface, optionally with a few signals to maintain bus cycle fast, accurate and flexible – providing good simulation speed,
accuracy. The simpler interface reduces modeling effort and the overall cycle accuracy for reliable performance estimation and the
function call semantics result in faster simulation speeds. flexibility to seamlessly plug-and-run different bus architectures
More recently, we introduced the Cycle Count Accurate at and reuse components such as processors, memories and
Transaction Boundaries (CCATB) modeling abstraction [20] for peripherals.
fast exploration of communication architectures. CCATB extends
the TLM modeling abstraction to speed up system prototyping and 3. CCATB OVERVIEW
more importantly simulation performance, while maintaining cycle To enable fast exploration of the communication design space, we
count accuracy during communication space exploration. previously introduced a novel modeling abstraction level called
In this paper we will describe the mechanisms behind the speedup Cycle Count Accurate at Transaction Boundaries (CCATB) [20]. A
obtained in CCATB models. We will present a simulation transaction in this context refers to a read or write operation issued
implementation of the CCATB modeling abstraction, for high by a master to a slave, that can either be a single data word or a
performance shared bus architectures. To underline the multiple data burst transfer. Transactions at the CCATB level are
effectiveness of our exploration approach, we will describe a case similar to transactions at the TLM level [8] except that we
study involving an AMBA 2.0 based SoC subsystem used in the additionally pass bus protocol specific control and timing
multimedia application domain. We will also compare simulation information. Unlike BCA models, we do not maintain accuracy at
performance for CCATB, PA-BCA and T-BCA models and every cycle boundary. Instead, we raise the modeling abstraction
analyze the scalability of these approaches with design complexity. and maintain cycle count accuracy at transaction boundaries i.e. the
The paper is organized as follows. Section 2 briefly discusses number of bus cycles that elapse at the end of a transaction is the
requirements for a communication design space exploration effort. same when compared to cycles elapsed in a detailed cycle/pin
Section 3 gives an overview of the CCATB modeling abstraction accurate system model. A similar concept can be found in [15]
level for communication architecture exploration. Section 4 where Observable Time Windows were defined and used for
presents an implementation of the CCATB simulation model. verifying results of high level synthesis. We maintain overall cycle
Section 5 describes a case study which uses CCATB models to count accuracy needed to gather statistics for accurate
explore the communication space of a multimedia SoC subsystem. communication space exploration, while optimizing the models for
Section 6 compares modeling effort and simulation speeds for the faster simulation. Intra-transaction events such as interrupts and
CCATB and BCA models, and shows how the speeds scale with transaction aborts that have an impact on cycle count accuracy are
increasing system complexity. Finally, Section 7 concludes the also handled in our framework. More details can be found in [21].
paper and gives directions for future research. Our approach essentially trades off intra-transaction visibility to
gain simulation speedup.
2. COMMUNICATION DESIGN SPACE
EXPLORATION REQUIREMENTS master2 master3 slave2 slave3
System bus
After system designers have performed hardware/software arbiter
+
partitioning and architecture mapping in a typical design flow [10], decoder
slave1 (SDRAM)
they need to select a communication architecture for the design. % %"""
" &&
The selection is complicated by the plethora of choices [1-4] that a master1 (ISS +eSW) &
&
'
' (
())*
"
"
))* +
"
+
"
,-
,-'
'
!
!
"
"

designer is confronted with. Factors such as application domain ..

specific communication requirements and reuse of the existing ..


%
% "#
"#

design IP library play a major role in this selection process. Once a "
" !
!"
!!"
#$
"#$
" '
' /
!
! #$
#$
0+*
/0+ * /)'
/)'11

choice of communication architecture is made, the next challenge is !


!""#$
#$
!
!""
&
&
&
&
#$
#$
#$
#$
2234
34/ 5
/5

to configure the architecture to meet design performance "


" &
'
& 66 ! !
' / 0+*
/0+ **
*'4
''
4+
'+11
requirements. Bus-based communication architectures such as …

AMBA [1] have several parameters which can be configured to


improve performance: bus topology, data bus width, arbitration
Figure 2. CCATB Transaction Example
protocols, DMA burst lengths and buffer sizes have significant
impact on system performance and must be considered by
We chose SystemC 2.0 [8-9] to capture designs at the CCATB
designers during exploration. In the exploration study presented in
abstraction level, as it provides a rich set of primitives for system
this paper, we use our approach to configure a communication
modeling. Busses in CCATB are modeled by extending the generic
architecture once the selection process is completed. Exploration
TLM channel [8] to include bus architecture specific timing and
studies focusing on the selection of appropriate communication
protocol details. Arbiter and decoder modules are integrated with
architectures using our approach can be found in [20].
this channel model. Computation blocks (masters and slaves) are
Any meaningful exploration effort must be able to
modeled at the behavioral abstraction level, just like TLM models
comprehensively capture the communication architecture and be
in [8]. Masters are active blocks with (possibly) several
able to simulate the effects of changing configurable parameters at
computation threads and ports to interface with busses. Figure 2
a system level [19]. This implies that we need to model the entire
shows the interface used by the master to communicate with a
system and not just a portion of it. Fast simulation speed is also
slave. In the figure, port specifies the port to send the read/write
very essential when exploring large designs and the vast design
request on (since a master may be connected to multiple busses).
space, in a timely manner. System components such as CPUs,
addr is the address of the slave to send the transaction to. token is a
memories and peripherals need to be appropriately parameterized
structure that contains pointers to data and control information.

243
Slaves are passive entities, activated only when triggered by the procedure (Figure 5) for every subsystem to check if any executing
arbiter on a request from the master, and have a register/memory requests in Ract have completed, in which case the master is notified
map to handle read/write requests. The arbiter calls read() and and the transaction completed. HandleCompletedRequests also
write() functions implemented in the slave, as shown for the removes an out-of-order request from the set of out of order
SDRAM controller in the figure. requests Roo and adds it to the pending request set Rpend if it has
completed waiting for its specified OO period.
4. SIMULATION SPEEDUP
We now describe an implementation of the CCATB simulation procedure GatherRequ ests()
model to explain how we obtain simulation speedup. We consider a begin
design with several bus subsystems each with its own separate if request then
arbiter and decoder, and connected to the other subsystems via τ ⇐ request
τ .wait_cyc ⇐ 0
bridges. The bus subsystem supports pipelining, burst mode τ .oo_cyc ⇐ 0
transfers and out-of-order (OO) transaction completion which are τ .ooflag ⇐ FALSE
all features found in high performance bus architectures such as Rpend ⇐ Rpend τ
[17]. OO transaction completion allows slaves to relinquish control end
of the bus, complete received transactions in any order and then
Figure 3. GatherRequests procedure
request for re-arbitration so a response can be sent back to the
master for the completed transaction. OO latency period refers to procedure HandleBusRequests()
the number of cycles that elapse after the slave releases control of begin
the bus and before it requests for re-arbitration. for each set X ∈ A do
We begin with a few definitions. Each bus subsystem is HandleCompletedRequests(Rpend, Ract, Roo)
characterized by a tuple set X, where X = {Rpend, Ract, Roo}. Rpend is T ⇐ ArbitrateRequest(Rpend, Ract)
a set of read/write requests pending in the bus subsystem, waiting for each request τ ∈ T do
for selection by the arbiter. Ract is a set of read/write requests if ( τ .ooflag == TRUE) then
Ract ⇐ Ract τ
actively executing in the subsystem. Roo is a set of out-of-order else
read/write requests in a subsystem that are waiting to enter into the status ⇐ issue(τ .port,τ .addr,τ )
pending request set (Rpend) after the expiration of their OO latency UpdateDelaysAndSets(status,τ , Ract, Roo)
period. Let A be a superset of the sets X for all p bus subsystems in ψ ⇐ DetermineIncrementPeriod(A)
the entire system. for each set X ∈ A do
p for each request τ ∈ Roo do
A= Xi τ .oo_cyc ⇐ τ .oo_cyc - ψ
i =1 for each request τ ∈ Ract do
Next we define to be a transaction request structure, which τ .wait_cyc ⇐ τ .wait_cyc - ψ
for each value λ ∈ M do
includes the following subfields: λ ⇐ λ -ψ
• wait_cyc specifies the number of wait cycles before the bus simulation_time ⇐ simulation_time + ψ
can signal transaction completion to the master. end
• oo_cyc specifies the number of wait cycles before the request Figure 4. HandleBusRequests procedure
can apply for re-arbitration at the bus arbiter.
• ooflag indicates if the request is an out-of-order transaction procedure HandleComp letedReque sts(R pend, Ract, Roo)
begin
status is defined to be a transaction response structure returned by S pend ⇐ null ; S act ⇐ null ; S oo ⇐ null ;
the slave. It contains a field (stat) that indicates the status of the for each request τ ∈ Ract do
transaction (OK, ERROR etc.) as well as fields for the various if ( τ .wait_cyc == 0) then
delays encountered such as those for the slave interface notify( τ .master, τ .status)
(slave_int_delay), slave computation (slave_comp_delay) and else
bridges (bridge_delay). Finally, let M be a set of all masters in the S act ⇐ S act τ
for each request τ ∈ Roo do
system. Each master is represented by a value in this set which if ( τ .oo_cyc == 0) then
corresponds to the sum of (i) the number of cycles before the next S pend ⇐ S pend τ
read/write request is issued by the master and (ii) the master else
interface delay cycles. These values are maintained in a global table S oo ⇐ S oo τ
with an entry for each master and do not need to be specified Rpend ⇐ Spend ; Ract ⇐ S act ; Roo ⇐ S oo ;
end
manually by a designer – a preprocessing stage can automatically
insert directives in the code to update the table at the point when a Figure 5. HandleCompletedRequests procedure
master issues a request to a bus.
Our approach speeds up simulation by preventing unnecessary Next, we arbitrate to select requests from the pending request set
invocation of simulation components and efficiently handling idle Rpend which will be granted access to the bus. The function
time during simulation. We now describe the implementation for ArbitrateRequest (Figure 6) performs the selection based on the
our simulation model to show how this is accomplished. arbitration policy selected for every bus. We assume that a call to
On a positive clock edge, master computation threads are triggered the ArbitrateOnPolicy function applies the appropriate arbitration
and possibly issue read/write transactions, which in turn trigger the policy and returns the selected requests for the bus. After the
GatherRequests procedure (Figure 3) in the bus module. selection we update the set of pending requests Rpend by removing
GatherRequests simply adds the transaction request to the set of the requests selected for execution (and hence not ‘pending’
pending requests Rpend for the subsystem. On the negative clock anymore). Since a bus subsystem can have independent read and
edge, the HandleBusRequests procedure (Figure 4) in the bus write channels [17], there can be more than one active request
module is triggered to handle the communication requests in the executing in the subsystem, which is why ArbitrateRequest returns
system. This procedure first calls the HandleCompletedRequests a set of requests and not just a single request for every subsystem.

244
function ArbitrateR equest(R pend, Ract) function DetermineI ncrementPe riod(A)
begin begin
T ⇐ null ψ ⇐ inf
for each independen t channel c ∈ subsystem Rpend do for each set X ∈ A do
T ⇐ T ArbitrateO nPolicy(c, Rpend ) for each set Rpend ∈ Χ do
Rpend ⇐ Rpend \ T if Rpend ≠ NULL then
return T ψ ⇐1
end return ψ
for each set Ract ∈ X do
Figure 6. ArbitrateRequest function for each request τ` ∈ Ract do
ψ ⇐ min { ψ ,τ`.wait_cyc }
After the call to ArbitrateRequest, if the ooflag field of the selected for each set Roo ∈ X do
request is TRUE, it implies that this request has already been issued for each request τ`` ∈ Roo do
to the slave and now needs to wait for .wait_cyc cycles before ψ ⇐ min { ψ ,τ``.oo_cyc }
returning a response to the master. Therefore we simply add it to for each value λ ∈ M do
the executing requests set Ract. Otherwise we issue the request to ψ ⇐ min { ψ , λ }
return ψ
the slave which completes the transaction in zero-time and returns a end
status to the bus module. We use the returned status structure to
update the transaction status by calling the UpdateDelaysAndSets Figure 8. DetermineIncrementPeriod function
procedure (Figure 7). In this procedure we first check for the
returned error status. If there is no error, then depending on whether It should be noted that for some very high performance designs it is
the request is an out-of-order type or not, we update .oo_cyc with possible that there is very little scope for this kind of speedup.
the number of cycles to wait before applying for re-arbitration, and Although this might appear to be a limitation, there is still
.wait_cyc with the number of cycles before returning a response to substantial speedup achieved over BCA models because we handle
the master. We also update the set Ract with the actively executing all the delays in a transaction in one place – in the bus module,
requests and Roo with the OO requests. If an error occurs, then the without repeatedly invoking other parts of the system on every
actual slave computation delay can differ and is given by the field cycle (master and slave threads and processes) which would
error_delay. The values for other delays such as burst length and otherwise contribute to simulation overhead.
busy cycle delays are also adjusted to reflect the truncation of the
request due to the error. 5. EXPLORATION CASE STUDY
procedure UpdateDelaysAndSets(status,τ , Ract, Roo) To validate our modeling approach with the CCATB abstraction,
begin we performed an exploration study with a consumer multimedia
if (status.stat == OK) then SoC subsystem which performs audio and video encoding for
τ .status = OK popular codecs such as MPEG. Figure 9 shows this platform, which
if (status.oo == TRUE) then
τ .ooflag ⇐ TRUE is built around the AMBA 2.0 communication architecture [1], with
τ .oo_cyc ⇐ status.(oo_delay + slave_int_delay a high performance bus (AHB or Advanced high performance bus)
+ slave_comp_delay + bridge_delay) and a peripheral bus (APB or Advanced peripheral bus) for high
+ τ .arb_delay latency, low bandwidth peripheral devices. The system has an
τ .wait_cyc ⇐ τ .(busy_delay + burst_length_delay
+ ppl_delay + bridge_delay + arb_delay) ARM926EJ-S processor to supervise flow control and perform
Roo ⇐ Roo τ encryption, a fast USB interface, on-chip memory modules, a DMA
else controller, an SDRAM controller to interface with external memory
τ .wait_cyc ⇐ status.(slave_int_delay components and standard peripherals such as a timer, UART,
+ slave_comp_delay + bridge_delay)
+ τ .(busy_delay + burst_length_delay interrupt controller, general purpose I/O and a Compact Flash card
+ ppl_delay + arb_delay) interface.
Ract ⇐ Ract τ
else
τ .status = ERROR ARM926EJ-S DMA
A/V
Encoder
MEM3 MEM4 ITC Timer GPIO
τ .wait_cyc ⇐ status.(slave_int_delay
AHB/APB
Bridge

+ bridge_delay + error_delay) AHB System bus APB peripheral bus

+ τ .(busy_delay + burst_length_delay
+ ppl_delay + arb_delay) MEM1 MEM2 USB 2.0
SDRAM
MEM5 UART Flash
Interface
UART
controller

end
Figure 9. SoC Multimedia Subsystem
Figure 7. UpdateDelaysAndSets procedure
Consider a scenario where the designer wishes to extend the
After returning from the UpdateDelaysAndSets procedure, we find functionality of the encoder system to add support for audio/video
the minimum number of cycles ( ) before we need to invoke the decoding and an additional AVLink interface for streaming data.
HandleBusRequests procedure again, by calling the The final architecture must also meet peak bandwidth constraints
DetermineIncrementPeriod function (Figure 8). This function for the USB component (480Mbps) and the AVLink controller
returns the minimum value out of the wait cycles for every interface (768Mbps). Figure 10(a) shows the system with the
executing request ( .wait_cyc), out-of-order request cycles for all additional components added to the AHB bus. To explore the
waiting OO requests ( .oo_cyc) and the next request latency cycles effects of changing communication architecture topology and
for every master ( ). If there is a pending request which needs to be arbitration protocols on system performance, we modeled the SoC
serviced in the next cycle, the function returns 1, which is the worst platform at the CCATB level and simulated a test program for
case return value. By default, the HandleBusRequests procedure is several interesting combinations of topology and arbitration
invoked at the negative edge of every simulation cycle, but if we strategies. For each configuration, we determined if bandwidth
find a value of which is greater than 1, we can safely increment constraints were being met and iteratively modified the architecture
system simulation time by that value, preventing unnecessary till all the constraints were satisfied.
invocation of procedures and thus speeding up simulation. Table 1 shows the system performance (total cycle count for test

245
program execution) for some of the architectures we considered, for Arch2 as arbitration conflicts are reduced. With the exception of
shown in Figure 10 (a), (b) and (c). In the columns for arbitration the RR scheme, bandwidth constraints are met with all the other
strategies, RR stands for a round robin scheme where bus arbitration policies. The TDMA2 scheme outperforms TDMA1
bandwidth is equally distributed among all the masters. TDMA1 because of the reduced load on the main bus from the AVLink
refers to a TDMA strategy where in every frame 4 slots are allotted component which results in inefficient RR distribution of its 4 slots
to the AVLink controller, 2 slots to the USB, and 1 slot for the in TDMA1. TDMA2 also outperforms the SP schemes because SP
remaining masters. In TDMA2, 2 slots are allotted to the AVLink schemes result in much more arbitration delay for the low priority
and USB, and 1 slot for the remaining masters. In both the TDMA masters (ARM CPU, DMA), whereas TDMA2 guarantees certain
schemes, if a slot is not used by a master then a secondary RR bandwidth even to these low priority masters in every frame.
scheme is used to grant the slot to a master with a pending request. Statistics gathered during simulation indicate that the A/V decoder
SP1 is a static priority scheme with the AVLink controller having a frequently communicates with the ARM CPU and the DMA.
maximum priority followed by the USB, ARM926, DMA, A/V Therefore with the intention of improving performance even further
Encoder and the A/V Decoder. The priorities for the AVLink we allocate the high bandwidth USB and AVLink controller
controller and USB are interchanged in SP2, with the other components to separate AHB busses, and bring the A/V decoder to
priorities remaining the same as in SP1. the main bus. Figure 10(c) shows the modified architecture.
Performance figures from the table indicate that the SP1 scheme
Arch Arbitration Scheme performs better than the rest of the schemes. This is because the SP
RR TDMA1 TDMA2 SP1 SP2 scheme works well when requests from the high bandwidth
Arch1 27.24 24.65 25.06 25.72 26.49 components are infrequent (since they have been allocated on
Arch2 24.98 23.86 23.03 23.52 23.44 separate busses). The TDMA schemes suffer because of several
Arch3 22.02 21.79 21.65 21.18 21.26 wasted slots for the USB and AVLink controller, which are
Table 1. Execution cycle counts (in millions of cycles) inefficiently allocated by the secondary RR scheme.
We thus arrive at the Arch3 topology together with the SP1
For architecture Arch1, performance suffers due to frequent arbitration scheme as the best choice for the new version of the
arbitration conflicts in the shared AHB bus. The shaded cells SoC design. We arrived at this choice after evaluating several other
indicate scenarios where the bandwidth constraints for the USB combinations of topology/arbitration schemes not shown here due
and/or AVLink controller were not met. From Table 1 we can see to lack of space. It took us less than a day to evaluate these
that none of the arbitration policies in Arch1 satisfy the constraints. different communication design space points with our CCATB
models and our results were verified by simulating the system with
ARM926EJ-S A/V
a more detailed pin accurate BCA model. It would have taken
DMA MEM3 MEM4 MEM5
Encoder much longer to model and simulate the system with other
AHB/APB
Bridge

AHB System bus approaches. The next section quantifies the gains in simulation
SDRAM AVLink A/V
speed and modeling effort for the CCATB modeling abstraction,
MEM1 MEM2 USB 2.0
controller controller Decoder when compared with other models.
(a) Arch1 6. SIMULATION AND MODELING
ARM926EJ-S DMA
A/V
MEM3 MEM4 MEM5
EFFORT COMPARISON
Encoder
We now present a comparison of the modeling effort and
AHB/APB
Bridge

AHB System bus


simulation performance for pin accurate BCA (PA-BCA),
MEM1 MEM2 USB 2.0
SDRAM
controller
transaction based BCA (T-BCA) and our CCATB models. For the
purpose of this study we chose the SoC platform shown in Figure
AHB System bus AHB/AHB
Bridge 11. This platform is similar to the one we used for exploration in
MEM6
AVLink A/V the previous section but is more generic and is not restricted to the
controller Decoder
multimedia domain. It is built around the AMBA 2.0
communication architecture and has an ARM926 processor ISS
(b) Arch2 model with a test program running on it which initializes different
ARM926EJ-S A/V
components and then regulates data flow to and from the external
DMA MEM3 MEM4 MEM5
Encoder interfaces such as USB, switch, external memory controller (EMC)
AHB/APB
Bridge

AHB System bus and the SDRAM controller.


SDRAM A/V
MEM2
controller Decoder
MEM1 MEM6
ARM926EJ-S Arbiter + SDRAM
ITC Timer
Decoder controller
AHB/AHB
AHB/APB

AHB System bus AHB System bus


Bridge

Bridge
AHB System bus 1 APB peripheral bus
USB 2.0 AVLink
controller
ROM DMA RAM USB UART EMC

Arbiter + Traffic Arbiter + Traffic


(c) Arch3 Decoder generator1 Decoder generator2
AHB/AHB
AHB System bus 2 AHB System bus 3
Figure 10. SoC Communication Architecture Topologies Bridge

Traffic
RAM RAM
Switch
To decrease arbitration conflicts, we shift the new components to a generator3

dedicated AHB bus as shown in Figure 10(b). An AHB/AHB Figure 11. SoC platform
bridge is used to interface with the main bus. We split MEM5 and
attach one of the memories (MEM6) to the dedicated bus and also For the T-BCA model we chose the approach from [14]. Our goal
add an interface to the SDRAM controller ports from the new bus, was to compare not only the simulation speeds but also to ascertain
so that data traffic from the new components does not load the how the speed changed with system complexity. We first compared
main bus as frequently. Table 1 shows a performance improvement speedup for a ‘lightweight’ system comprising of just 2 traffic

246
generator masters along with peripherals used by these masters, other choices. We also showed that the CCATB models are faster
such as the RAM and the EMC. We gradually increased system to simulate than pin-accurate BCA (PA-BCA) models by as much
complexity by adding more masters and their slave peripherals. as 120% on average and are also faster than transaction based BCA
Figure 12 shows the simulation speed comparison with increasing (T-BCA) models by 67% on average. In addition, the CCATB
design complexity. models take less time to model than T-BCA and PA-BCA models.
Our future work will focus on automatic refinement of CCATB
400
350
CCATB models from high level TLM models and interface refinement from
PA-BCA
300 T-BCA CCATB down to the pin accurate BCA abstraction level for RTL
Kcycles/sec

250
200
co-simulation purposes.

8. ACKNOWLEDGEMENTS
150
100
50
0
This research was partially supported by grants from Conexant
2 3 4 5
masters
6 7
Systems Inc., UC Micro (03-029) and NSF grants CCR 0203813
and CCR 0205712.
Figure 12. Simulation Speed Comparison
Note the steep drop in simulation speed when the third master was
9. REFERENCES
added – this is due to the detailed non-native SystemC model of the [1] Flynn. “AMBA: enabling reusable on-chip designs”. IEEE
ARM926 processor which considerably slowed down simulation. Micro, 1997.
In contrast, the simulation speed was not affected as much when [2] IBM CoreConnect http://www.chips.ibm.com/products/ power
the DMA controller was added as the fourth master. This was pc/cores
because the DMA controller transferred data in multiple word [3] Wishbone Specification http://www.silicore.net/wishbone.htm
bursts which can be handled very efficiently by the transaction
based T-BCA and CCATB models. The CCATB particularly [4] Open Core Protocol International Partnership (OCP-IP). OCP
handles burst mode simulation very effectively and consequently datasheet, http://www.ocpip.org
has the least degradation in performance out of the three models. [5] “System-on-Chip Specification and Modeling Using C++:
Subsequent steps added the USB switch and another traffic Challenges and Opportunities”, IEEE D&T May/June 2001
generator which put considerable communication traffic and
[6] Joon-Seo Yim et al. “A C-Based RTL Design Verification
computation load on the system, resulting in a reduction in
Methodology for Complex Microprocessor”, DAC, 1997
simulation speed. Overall, the CCATB abstraction level
outperforms the other two models. Table 2 gives the average [7] Luc Séméria, Abhijit Ghosh, "Methodology for Hardware/
speedup of the CCATB over the PA-BCA and T-BCA models. We Software Co-verification in C/C++", ASP-DAC, 2000
note that on average, CCATB is faster than T-BCA by 67% and [8] Sudeep Pasricha, “Transaction Level Modeling of SoC with
even faster than PA-BCA models by 120%. SystemC 2.0”, SNUG, Bangalore, 2002
Model Average CCATB Modeling [9] T. Grötker, S. Liao, G. Martin, S. Swan. “System Design with
Abstraction speedup (x times) Effort SystemC”. Kluwer Academic Publishers, 2002.
CCATB 1 ~3 days [10] D. Gajski et al., “SpecC: Specification Language and
T-BCA 1.67 ~4 days Methodology”, Kluwer Academic Publishers, January 2000
PA-BCA 2.2 ~1.5 wks
[11] Xinping Zhu , S. Malik, “A hierarchical modeling framework
Table 2. Comparison of speed and modeling effort for on-chip communication architectures”, ICCAD, 2002
Table 2 also shows the time taken to model the communication [12] M. Caldari et al“Transaction-Level Models for AMBA Bus
architecture at the three different abstraction levels by a designer Architecture Using SystemC 2.0”, DATE, 2003
familiar with AMBA 2.0. While the time taken to capture the [13] O. Ogawa et al. “A Practical Approach for Bus Architecture
communication architecture and model the interfaces took just 3 Optimization at Transaction Level”, DATE, 2003
days for the CCATB model, it took a day more for the transaction
based BCA, primarily due to the additional modeling effort to [14] AHB CLI Specification http://www.arm.com/armtech/ahbcli
maintain accuracy at cycle boundaries for the bus system. It took [15] R. A. Bergamaschi and S. Raje, “Observable Time Windows:
almost 1.5 weeks to capture the PA-BCA model. Synchronizing Verifying the Results of High-Level Synthesis”, ECDT, 1996
and handling the numerous signals and design verification were the
major contributors for the additional design effort in these models. [16] M. Ben-Romdhane et al. “Quick-Turnaround ASIC Design in
In summary, CCATB models are faster to simulate and need less VHDL: Core-Based Behavioral Synthesis” Kluwer, 1996
modeling effort compared to T-BCA and PA-BCA models. [17] AMBA AXI Specification http://www.arm.com/armtech/AXI
[18] H. Jang et al., “High-Level System Modeling and Architecture
7. CONCLUSION Exloration with SystemC on a Network SoC: S3C2510 Case
Early exploration of System-on-chip communication architectures Study”, DATE, 2004
is extremely important to ensure efficient implementation and for
meeting performance constraints. We described the mechanisms [19] M. Loghi et al. “Analyzing On-Chip Communication in a
responsible for speedup in our recently proposed CCATB modeling MPSoC Environment”, DATE, 2004
abstraction, which enable fast and efficient exploration of the [20] Sudeep Pasricha, Nikil Dutt, Mohamed Ben-Romdhane,
communication design space, early in the design flow. We "Extending the Transaction Level Modeling Approach for Fast
demonstrated the usefulness of our approach in a case study Communication Architecture Exploration", DAC, 2004
involving exploration of a multimedia SoC subsystem. Using
[21] Sudeep Pasricha et al "Rapid Exploration of Bus-based
models at the CCATB abstraction, we were able to quickly explore
Communication Architectures at the CCATB Abstraction",
the impact of changes in the system and arrive at an architecture
CECS Technical Report 04-11, May 2004
which met component bandwidth constraints and outperformed

247

You might also like