[go: up one dir, main page]

0% found this document useful (0 votes)
133 views34 pages

Data Flow Implementation Strategies

The document discusses the implementation of data flow graphs (DFGs) in both software and hardware, focusing on the mapping of DFGs to software through sequential and parallel schedules, as well as the implementation of actors and queues. It highlights the differences between static and dynamic scheduling, including the challenges of load balancing and minimizing inter-processor communication in parallel systems. Additionally, it covers hardware implementations of DFGs, emphasizing the suitability of single-rate SDF graphs for direct mapping into hardware and the use of pipelining to optimize circuit performance.

Uploaded by

Sai Ranga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
133 views34 pages

Data Flow Implementation Strategies

The document discusses the implementation of data flow graphs (DFGs) in both software and hardware, focusing on the mapping of DFGs to software through sequential and parallel schedules, as well as the implementation of actors and queues. It highlights the differences between static and dynamic scheduling, including the challenges of load balancing and minimizing inter-processor communication in parallel systems. Additionally, it covers hardware implementations of DFGs, emphasizing the suitability of single-rate SDF graphs for direct mapping into hardware and the use of pipelining to optimize circuit performance.

Uploaded by

Sai Ranga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Flow Implementation in

Software and Hardware


Software Implementation of Data Flow
Converting Queues and Actors into Software
There are a wide variety of approaches of mapping DFGs to software

• Sequential implementations can make use of static or dynamic schedules


• Parallel, multi‐processor mappings require more effort due to:
• Load balancing: Mapping actors such that the activity on each processor is about the same
• Minimizing inter‐processor communication: Mapping actors such that communication overhead is minimized
Software Implementation of Data Flow
Mapping DFGs to Software
• We will focus first on single‐processor systems, and in particular, on finding efficient versions of
sequential schedules
• As noted on the previous slide, there are two options for implementing the schedule:
Dynamic schedule
• Here, software decides the order in which actors execute at runtime by testing firing
rules to determine which actor can run
• Dynamic scheduling can be done in a single‐threaded or multi‐threaded execution
environment
Static schedule
• In this case, the firing order is determined at design time and fixed in the
implementation
• The fixed order allows for a design time optimization in which the firing of multiple
actors can be treated as a single firing
• This in turn allows for ‘inlined’ implementations
Software Implementation of Data Flow
DFG Elements

• Before discussing these, let’s first look at C implementations of actors and queues
FIFO Queues:
• Although DFGs theoretically have infinite length queues, in practice, queues are limited in size
• We discussed earlier that constructing a PASS allows the maximum queue size to be determined by
analyzing actor firing sequences

• A typical software interface of a FIFO queue has two parameters and three methods
• The # of elements N that can be stored in the queue and the data type of the elements.
• Methods that put elements into the queue, get elements from the queue, and query the current size of
the queue.
Software Implementation of Data Flow
DFG Elements
• Queues are well defined (standardized) data structures
• A circular queue consists of an array, a write‐pointer and a read‐pointer
• They use modulo addressing, e.g., the Ith element is at position (Rptr + I) mod [Link]()
Note that the queue size is fixed here at
compile time.

Alternatively, queue size can be changed


dynamically at runtime using malloc()
Software Implementation of Data Flow

Software implementation of the dataflow actor


Software Implementation of Data Flow
ACTOR Implementation
• A data flow actor can be captured as a function, with some additional support to interface
with the FIFO queues.
• Designers will often differentiate between the internal activities of an actor and the
input–output behavior.
• The behavior corresponding to actor firing can be implemented as a simple C function.
• The actor function incorporates a finite state machine (FSM), which checks the firing rules
to determine whether to execute the actor code.
• The local controller (FSM) of an actor has two states.
• wait state: start state which checks the firing rules immediately after being invoked by a
scheduler.
• work state: wait transitions to work when firing rules are satisfied, the actor then reads
tokens, performs calculation and writes output tokens.
Software Implementation of Data Flow
ACTOR Implementation

Finally, the actor_io and queue objects can be instantiated in the main program, and the actor
functions can be called using a system scheduler.
Sequential Targets with Dynamic Schedule
• A software implementation of SDF is obtained by combining several different actor
descriptions, by interconnecting those actors using FIFO queues, and by executing the
actors through a system schedule.
• In a dynamic system schedule, the firing rules of the actors will be tested at runtime; the
system scheduling code consists of the firing rules, as well as the order in which the firing
rules are tested.
Mapping DFGs to Single Processors: Dynamic Schedule
Single‐Thread Dynamic Schedules
• Implementation of a system schedule as a function that instantiates all actors and
queues, and next calls the actors in a round‐robing fashion.
• First, note that it is impossible to call the actors in the ‘wrong’ order, because each of
them still has a firing rule that protects them from running when there is no data
available.

• The schedule on the right shows that snk in (a) is called as often as src, However, snk will
only fire on even numbered invocations.
• Even though snk will be often called as often as src, the firing rule of snk will only allow
that actor to run when there is sufficient data available.
• (b) shows a problem that is not handled by static schedulers Round‐robin scheduling in
this case will eventually lead to queue overflow
Mapping DFGs to Single Processors: Dynamic Schedule
• The underlying problem with (b) is that the implemented firing rate differs from the
firing rate for a PASS, which is given as (src, snk, snk)
• There are two solutions to this issue:
• Solution 1: We could adjust the system schedule to reflect the firing rate predicted by
the PASS.

• This solution is not very elegant, because it destroys the idea of having a dynamic
scheduler.
• If we have to obtain the PASS firing rate first, we may as well forget about using a
dynamic schedule.
Mapping DFGs to Single Processors: Dynamic Schedule
• Solution 2: We could adjust the code for the snk actor to continue execution as long as
there are tokens present. Thus, the code for the snk actor becomes:

• This is a better solution as the previous one, because it keeps the advantages of a
dynamic system schedule.
Mapping DFGs to Single Processors: Example Dynamic Schedule

• Let’s implement the 4‐point Fast Fourier Transform (FFT) shown above using a dynamic
schedule
• The array t stores 4 (time domain) samples
• The array f will be used to store the frequency domain representation of t
• The FFT utilizes butterfly operations to implement the FFT, as defined on the right side in
the figure
• The twiddle factor W(k, N) is a complex number defined as e‐j2pk/N, with W(0, 4) = 1 and
W(1, 4) = ‐j
Mapping DFGs to Single Processors: Example Dynamic Schedule

• reorder: Reads 4 tokens and shuffles them to match the flow diagram
• The t[0] and t[2] are processed by the top butterfly and t[1] and t[3] are processed by
the bottom butterfly
• fft2: Calculates the butterflies for the left half of the flow diagram
• fft4mag calculates the butterflies for the right half and produces the magnitude
component of the frequency domain representation
Mapping DFGs to Single Processors: Example Dynamic Schedule
The implementation first requires a valid schedule to be computed
• The firing rate is easily determined to be [qreorder, qfft2, qfft4mag] = [1, 2, 1]
void reorder(actorio_t *g)
{
int v0, v1, v2, v3;
while ( fifo_size(g‐>in[0]) >= 4 )
{
v0 = get_fifo(g‐>in[0]);
v1 = get_fifo(g‐>in[0]);
v2 = get_fifo(g‐>in[0]);
v3 = get_fifo(g‐>in[0]);
put_fifo(g‐>out[0], v0);
put_fifo(g‐>out[0], v2);
put_fifo(g‐>out[0], v1);
put_fifo(g‐>out[0], v3);
}
}
Mapping DFGs to Single Processors: Example Dynamic Schedule

The deterministic property of SDFs and the while loops inside the actors allow the
call order shown above to be re-arranged while preserving the functional behavior
Mapping DFGs to Single Processors: Multi‐Thread Dynamic Schedule
• In multi‐threaded programming, each actor (implemented as a function) lives in a
separate thread
• The threads are time‐interleaved by a scheduler in single processor environments
• Systems in which threads voluntarily relinquish control back to the scheduler is referred
to as cooperative multithreading

Such a system can be implemented using two functions create() and yield() as shown in above figure
The scheduler can apply different strategies to schedule threads, with the simplest one shown above as a round-
robin schedule
Mapping DFGs to Single Processors: Multi‐Thread Dynamic Schedule
• Quickthreads is a cooperative multithreading library
• The quickthreads (qt) API (Application Programmers Interface) consists of 4 functions
• spt_init(): initializes the threading system
• spt_create(stp_userf_t *F, void *G) creates a thread that will start execution with user function F, and
will be passed a single argument G
• stp_yield() releases control over the thread to the scheduler
• stp_abort() terminates a thread (prevents it from being scheduled)
• Here’s an example
#include "../qt/stp.h"
#include <stdio.h>
Mapping DFGs to Single Processors: Multi‐Thread Dynamic Schedule
Mapping DFGs to Single Processors: Multi‐Thread Dynamic Schedule
• Here is the example of a sort actor, implemented using the cooperative threading model.
Mapping DFGs to Single Processors: Multi‐Thread Dynamic Schedule
• The system scheduler now will call threads rather than actors:
Mapping DFGs to Single Processors: Static Schedule
• From the PASS analysis of an SDF graph, we know at least one
solution for a feasible sequential schedule
• This solution can be used to optimize the implementation in several
ways
• We can remove the firing rules since we know the exact sequential schedule
• This yields only a small performance benefit
• We can also determine an optimal interleaving of the actors to minimize the
storage requirements for the queues
• Finally, we can create a fully inlined version of the SDF graph which eliminates
the queues altogether
Mapping DFGs to Single Processors: Static Schedule

• From the above SDF topology, we know that the relative firing rates of A, B, and C must be 4, 2, and 1 to
yield a PASS.
• The right side of the code shows an example implementation of this PASS.
• The A, B, C actors are called in accordance with their PASS rate.
• Due to this particular interleaving, it is easy to see that in a steady state condition, the queue AB will carry
four tokens maximum, while the queue BC will contain two tokens maximum.
• This is not the most optimal interleaving.
• By calling the actors in the sequence (A,A,B,A,A,B,C), the maximum amount of tokens on any queue is
reduced to two.
Mapping DFGs to Single Processors: Static Schedule
• Implementing a truly static schedulemeans that we will no longer
test firing rules when calling actors.
• In fact, when we call an actor, we will have to guarantee that the
required input tokens are available.
• In a system with a static schedule, all SDFrelated operations get a
fixed execution order: the actor firings and the sequences of put and
get operations on the FIFO queues.
• This provides the opportunity to optimize the resulting SDF system.
• We will discuss optimization of single‐thread SDF systems with a
static schedule using an example we discussed before – the GCD.
• From our earlier analysis, we know that a valid PASS fires each node
a single time. Listing 2.4 is a description for each of the actors (sort,
diff).
𝑜𝑢𝑡1 𝑎 𝑏 ? 𝑎: 𝑏
Sort
𝑜𝑢𝑡2 𝑎 𝑏 ? 𝑏: 𝑎

diff 𝑜𝑢𝑡1 𝑎! 𝑏 ? 𝑎 𝑏: 𝑎;
𝑜𝑢𝑡2 𝑏;
Mapping DFGs to Single Processors: Static Schedule

𝑜𝑢𝑡1 𝑎 𝑏 ? 𝑎: 𝑏
Sort
𝑜𝑢𝑡2 𝑎 𝑏 ? 𝑏: 𝑎

diff 𝑜𝑢𝑡1 𝑎! 𝑏 ? 𝑎 𝑏: 𝑎;
𝑜𝑢𝑡2 𝑏;
Mapping DFGs to Single Processors: Static Schedule

// initial tokens
put_fifo(&F1, 16);
put_fifo(&F2, 12);
Hardware Implementation of Data Flow
• SDF graphs are particularly well‐suited for mapping directly into hardware
• For SDFs with single‐rate schedules, all actors can execute in parallel using
the same clock
• The rules for mapping single‐rate SDFs to hardware are straightforward
• All actors map to a single combinational hardware circuit
• All FIFO queues map to wires (without storage)
• Each initial token in a queue is replaced with a register
• This means that two actors with no token on the queue between them will
be effectively represented as two back‐to‐back combinational circuits
• Timing requirements regarding the delay of the combinational logic must
be met,
• i.e., all computation within the combinational logic must complete in 1 clock cycle
• Therefore, in practice, the length of the actor sequence will be limited
Hardware Implementation of Data Flow
Hardware implementation of Euclid’s algorithm

𝑜𝑢𝑡1 𝑎 𝑏 ? 𝑎: 𝑏
Sort
𝑜𝑢𝑡2 𝑎 𝑏 ? 𝑏: 𝑎

diff 𝑜𝑢𝑡1 𝑎! 𝑏 ? 𝑎 𝑏: 𝑎;
𝑜𝑢𝑡2 𝑏;

 The sort and diff actors are implemented using a comparator, a subtractor plus a few multiplexers
 Note that the delay through the circuits of both actors define the maximum achievable clock frequency
 If the critical path delays are 5 and 15 ns, then max. operating freq. is 50 MHz
Hardware Implementation of Data Flow
• Does this circuit work?
• Yes, it does. The circuit evaluates a single iteration through a PASS per clock cycle.
• Is this translation procedure general so that it would work for any SDF graph?
• No, it is not. The translation procedure is restricted to the following SDF graphs.
• We need a single‐rate SDF graph, which has a PASS firing vector with all ones in
it.
• All actors need to be implemented using combinational logic.
• In addition, the above method may result in circuits with a very long
combinational path.
• In the circuit above for example, the maximal operating clock frequency is
determined by the combined delay of the sort circuit and the diff circuit.
Hardware Implementation of Data Flow
Pipelining
• Pipelining of SDF graphs helps to break long combinational paths that may exist in circuits.
• Consider the example shown below.
• This is a data flow specification of a digital filter. It evaluates a weighted sum of samples of an
input stream, with the sum defined as

The critical path of this graph is associated with the one of the multiplier actors, c0 or c1, in series with
two addition actors
Hardware Implementation of Data Flow

• Pipeling allows additional tokens to be added (at the expense of increase latency) Here, one is
introduced by the in actor
• And then retiming is used to redistribute the tokens, as shown by the above marking, which is
produced after firing c0, c1 and c2
• Retiming creates additional registers and an additional pipeline stage
• The critical path is now reduced to two back‐to‐back adder actors
Hardware Implementation of Data Flow
• Note that it is not possible to introduce arbitrary initial tokens in a graph without
following the actor’s firing rules
• Doing so would likely change the behavior of the system

You might also like