HD Video Encoding with
DSP and FPGA Partitioning
Overview
Digital signal processors (DSPs) handle
the vast majority of video encoding appli-
cations unaided, implementing a scalable
architecture that includes FPGAs as a
co-processor to offload certain tasks that
satisfy even the most demanding video
applications.
As video and imaging applications evolve toward high-definition (HD) video compres-
sion standards, co-processing architectures that include both DSPs and FPGAs are
becoming a more popular option. However, partitioned systems are not the only
option because new advances in DSP architectures, performance, peripheral mix,
video hardware acceleration and implementation techniques have significantly
broadened the range of applications in which DSPs can provide a complete solution.
DSPs have an inherent advantage because they are programmable, and their ver-
satility allows designers to execute almost any algorithm. But as the computational
load grows exponentially, as is the case in HD video, FPGAs can sometimes be
employed to off load the DSP of certain compute-intensive tasks that can be hard-
wired into the FPGA.
In video encoding, as in virtually all engineering designs, there are no one-size-fits-
all solutions.
Even when the same codec is being employed, the application plays a critical role
in determining what level of computing power and memory bandwidth are required.
This, in turn, can play a dominant role in both hardware and software implementa-
tion strategies.
When dealing with compressed video, utilizing a standard compression algorithm
is the most likely choice for experienced design teams. Once a codec is selected,
however, another critical step is to assess the requirements of motion estimation
(ME) and motion compensation (MC) because they can be two of the most demand-
ing functions in video compression.
Not surprisingly, the computational and memory bandwidth demands that roll out
of the ME and MC engines depend on the amount of motion in the scene.
The H.264/AVC (advanced video coding) codec, for example, can be used in appli-
cations such as video surveillance where most often, very little action occurs over
hours of surveillance. At the other end of the spectrum, encoding HD video for a
broadcast application can require a memory bandwidth of 20 GBytes/s or higher.
Somewhere in between is HD video conferencing, which might require memory
bandwidth of 1.5 GBytes/s.
W H I T E P A P E R
By
Cheng Peng, Ph.D.
Video Applications Engineer;
Thanh Tran, Ph.D.
Embedded Systems Manager;
Texas Instruments
2 Texas Instruments
HD Video Encoding with DSP and FPGA Partitioning July 2007
Although profiles go a long way towards a packaged solution for design engineers, the
specific application has at least an equal impact on the implementations hardware
architecture.
An HD teleconferencing application, for example, can be expected to have relatively
little motion from frame to frame, but a broadcast TV application must deal with the more
intense video of sporting events, action movies and other content in which a substantial
amount of motion can be expected.
As previously mentioned, the ME and MC engines are key elements in a hardware parti-
tioning strategy, particularly for encoding video. The design team must consider whether
the ME engine alone should be implemented on the FPGA or whether the computational
load is heavy enough to require both the ME and MC engines for hardware acceleration.
The memory bandwidth, which can be 20 GBytes or higher, is just as important as the
computational loading. FPGA hardware architects may have flexibilities to scale the
memory bandwidth as high as necessary.
The H.264/AVC high profile is the obvious architecture for HD encoding of broadcast
transmission. An illustration of the basic H.264/AVC encoder architecture is shown in
Figure 1.
For ME calculations, the current frame and each of the frames it will be referenced to
are both divided/broken into macroblocks (MB), which are typically 16-by-16 pixels but can
be as small as 4-by-4 pixels. In a process called matching, a search attempts to locate the
MB in the reference frame that satisfies a pre-determined minimum error criterion from
Encoding HD Video
Memory
Video Input
Pre-Processing
Prediction
Trans Quant
De-Blocking
Filter
Entropy Encoder
Chrom. Conv.
MV Pred.
Transform
CAVLC\
CABAC
NAL Packet
Scal./Quant.
Inv. Scal./Quant.
Inv. Transform
Reconstruct MB
ME/MC
Intra Pred.
Mode Decision
Frame Store
PicAFF/
MBAFF
Pre-Processing
Memory Bus Memory Bus Memory Bus
Control Bus
MB
Rec. MB
Rec. MB
Bit Out
Data Pre-Entropy
MVs and Modes
Residual
Data Bus
Predict MB
Video In
Hint Bus
Multi-Port Memory Control
Coding Control Unit/Rate Control
Figure 1. The motion prediction block of the encoder is critical to hardware partitioning.
HD Video Encoding with DSP and FPGA Partitioning July 2007
3 Texas Instruments
the current frame. A common error criterion used by the ME engine is the Sum of Absolute
Differences, or SAD, which is defined as:
where x is the current frame macro-block, y is the reference frame macro-block and ij
denotes row i, column j of the frame.
In some applications, the ME engine may have to do only about 64 SAD calculations per
cycle while in others it may have to execute thousands of SADs per cycle. The difference
is quite significant and at the high end, it can lead to employing architectures that feature
multiple DSPs or call for partitioning some of the calculations in a hardwired accelerator
on an FPGA.
Regardless of whether the FPGA is needed to calculate SADs or for its memory band-
width, in order for the FPGA to be effective it has to have tightly coupled communication
with the DSP. Figure 2 illustrates a technique called macroblock-based pipeline processing,
which addresses this design challenge.
The P-frame (middle) is a video frame encoded relative to the past reference frame. A
reference frame is a P- or I-frame (top). The past reference frame is the closest preceding
reference frame.
More important from a hardware architecture perspective is that in the DSP and FPGA
solution, internal buffers are reserved for multiple macroblocks. While one macroblock is
being processed and written to an internal buffer, the macroblock data in the other buffers
(that has already been processed) is transmitted to the subsequent processing unit.
SAD = x y S S
i=0 j=0
15 15
DSP
FPGA
DSP + FPGA
I/P
P
Search
Window
Reconstructed
Frame
Video Stream
External Video
Frame Buffers
Figure 2. Macroblock-based pipeline processing.
4 Texas Instruments
HD Video Encoding with DSP and FPGA Partitioning July 2007
4
In a synchronous design, it is very important for the DSP and FPGA to access memory in
a particular order and granularity while minimizing the number of clock cycles due to
latency, bus contention, alignment, DMA transfer rate and the types of memory.
Inter-chip communication is critical in the implementation model. Figures 3a and 3b
show two optional architectures. Architecture A implements only the ME engine on the
FPGA while architecture B implements both the ME and MC engines. An additional
complexity is added by the fact that the ME engine and the MC engine must continuously
interact with each other.
C64x DSP
FPGA FPGA
Rate-Distortion Control
Intra
Intra Pre.
Inter
Transform Q
IQ
Motion Estimation
External Memory
Motion Comp.
Memory
De-Block Filter
MV
CABAC/
CAVLC
iDCT
+
+
Figure 3a. Architecture A needs straightforward DSP-FPGA communication protocols.
Rate-Distortion Control
Intra
Intra Pre.
Inter
Transform Q
IQ
External Memory
Motion Comp.
De-Block Filter
MV
CABAC/
CAVLC
iDCT
+
FPGA
+
Memory
Motion Estimation
Figure 3b. Architecture B significantly increases DSP/FPGA/memory interactions.
HD Video Encoding with DSP and FPGA Partitioning July 2007
5 Texas Instruments
Architecture B moves more than just the ME engine to the FPGA. The memory buffer,
de-blocking filter, and the CABAC/CAVLC block go along with it. The context-adaptive
binary arithmetic coding (CABAC) is a clever technique to compress syntax elements in the
video stream. Context-adaptive variable-length coding (CAVLC) is a lower-complexity alter-
native to CABAC for the coding of quantized transform coefficient values.
While architecture B keeps a good balance of functionalities among the DSP and FPGA
and enables both high performance and improved flexibility for H.264/AVC encoding, it
should be avoided when possible because memory data transfer and communication pro-
tocol can be very complicated between the DSP and FPGA.
Architecture A is the optimal choice since it keeps memory data transfer and the com-
munication protocol between the DSP and FPGA simple.
Broadcast video encoding requires a different peripheral mix than encoders in consumer
devices and relatively undemanding encoders such as video conferencing. High-end
encoders need high channel densities and throughput along with low cost per channel.
The right peripherals and memory go far towards reaching these goals.
Peripherals, such as Serial RapidIO
(SRIO), gigabit Ethernet interface, DDR2 and larger
L2 memory, all available on the TIs TMS320C6455 DSP, are more relevant to DSP and FPGA
partitioning decisions and can allow designers to create high-performance applications by
integrating multiple DSPs on the same board.
An SRIO bus decreases overall system cost by reducing the need for additional devices
used for switching and processor aggregation. SRIO interconnect also enables high-speed,
packet-switched, peer-to-peer connectivity, providing a performance breakthrough for
multi-channel implementations on multiple processors.
A 1 SRIO link is fast enough to send HD 1080i raw video between devices, and a 4
SRIO link can easily send HD 1080p raw video between devices with bandwidth to spare.
The use of SRIO in infrastructure applications with DSP farms can significantly cut sys-
tem cost (device count, board size and/or device cost).
In addition to SRIO, the DSP should integrate high-bandwidth peripherals such as the
gigabit EMAC, DDR2 and 2 MB of L2 memory. A gigabit EMAC has ten times more
Ethernet bandwidth than earlier generation devices.
A 500-MHz DDR2 external memory interface provides twice the throughput, allowing
system designers to transfer in data at a faster rate. Finally, the 2 MB of L2 memory
enables extra performance, further reducing the price per channel in infrastructure
applications.
H.264/AVC
Encoding of
Broadcast Video
6 Texas Instruments
HD Video Encoding with DSP and FPGA Partitioning July 2007
Integrating high-bandwidth I/O blocks into the DSP has the expected result of adding
another design option. With the availability of DSPs capable of satisfying the memory
bandwidth of real-time HD encoding of broadcast video, the designer should consider
using multiple DSPs in even more scenarios instead of an inherently complex DSP and
FPGA combination.
The primary reason for using two DSPs for HD encoding is that the inter-chip communi-
cations have been largely solved by the chip designers. Another reason is scalability.
Since the evolution to HD has really only just begun, in many instances designers will
find it useful to provide a standard definition (SD) solution that can be scaled to an HD
solution with little additional effort. Employing DSPs with high-performance I/O such as
SRIO offers an easy migration path.
The starting point in this scalability strategy is encoding SD video. A single 1-GHz DSP
with the peripherals mentioned above is capable of encoding H.264/AVCs SD baseline
profile at 720480 pixel resolution and 30 frames per second (fps). Motion compensation
is executed on chip.
When the encoding requirement moves to HD, two 1-GHz DSPs can be employed with
SRIO utilized for interprocessor communication. ME and MC are moved off the chip that
originally handled the SD encoding alone. Figure 4a shows the SD encoding architecture
and Figure 4b shows its evolution to HD encoding at 1280720 resolution, 30 fps.
Scalable Systems
P
C
I
Compressed Data
Intra
Pre.
Transform Q
IQ
DDR2
Memory
Motion Estimation
iDCT
Motion Comp.
De-Block
Filter
+
+
Entropy
Coding
(CAVLC)
C6455 DSP
P
C
I
P
C
I
Compressed Data
Motion
Vector
Intra
Pre.
Transform Q
IQ
DDR2
Memory
Motion
Estimation
iDCT
SRIO
SRIO
SRIO
SRIO
SRIO
SRIO
SRIO
SRIO
Motion Comp.
De-Block
Filter
+
+
Entropy
Coding
(CAVLC)
C6455 DSP 1
C6455 DSP 2
Figure 4a. SD encoding with 1-GHz DSP. Figure 4b. HD encoding with 1-GHz DSP.
Note that in these two designs, FPGA-assist is not required.
HD Video Encoding with DSP and FPGA Partitioning July 2007
7 Texas Instruments
When the design scenario moves to an application where encode and decode are both
required in the same system, DSPs can still be utilized to do a majority of the work.
For SD decode and encode, an FPGA is used as a buffer for the video from the camera
and a less-expensive media processor such as TIs TMS320DM642 digital media processor
running at 720 MHz can be used for decoding.
HD encode and decode employs fundamentally the same architecture as that of HD
encoding (Figure 4b) but with the addition of the low-cost, high-performance media
processor and an FPGA. Figures 5a and 5b illustrate these evolutionary steps. The HD
system can perform simultaneous H.264/AVC, baseline profile, HD encoding and decoding
at 1280720, 30 fps.
P
C
I
HD
Camera
Analog
Video
FPGA
P
C
I
C6455 DSP
1 GHz
DM642 DSP
720 MHz
Video
Encoder
THS8200
H.264 SD Decoder H.264 SD Encoder
Video
Port
Figure 5a. SD encode and decode system.
P
C
I
P
C
I
Analog
Video
FPGA
P
C
I
C6455 DSP
1 GHz
C6455 DSP
1 GHz
DM642 DSP
720 MHz
Video
Encoder
THS8200
H.264 HD Decoder H.264 HD Encoder
Video
Port
HD
Camera
SRIO SRIO SRIO
Figure 5b. HD encode and decode system.
8 Texas Instruments
One of the first DSPs to reach the level of integration for HD encoding is TIs
TMS320C6455 DSP with an SRIO bus to provide chip-level inter-connect and processor-to-
processor communication up to 10 Gbits per second full-duplex interconnectivity.
SRIO makes multiprocessing architectures easier to implement and using two or more
C6455 DSPs on the same board assures that there are no computing bottlenecks. A board
with 10 C6455 DSPs clocking 1 GHz and working in parallel will achieve 10 GHz of per-
formance. The board can also be designed to support multiple I/O modules such as SRIO,
HD SDI and CameraLink.
As high-end video applications continue to demand higher channel densities, throughput
and lower cost per channel, design teams must evaluate more diverse architectural
options. Developers no longer need to team a DSP with a FPGA to utilize the memory
bandwidth and hardware accelerators of FPGA. Instead, they can employ a new genera-
tion of DSPs such as the C6455 DSP, which has integrated several high-speed peripherals,
the most important of which is SRIO. High-performance video encoder requirements can
usually be met by deploying several DSPs on the same board and, since the chips all run
the same embedded operating system and have been designed to work together, the chip-
to-chip communications challenge is significantly reduced.
SPRY103 2007 Texas Instruments Incorporated
Important Notice: The products and services of Texas Instruments Incorporated and its subsidiaries described herein are sold subject to TIs standard terms and
conditions of sale. Customers are advised to obtain the most current and complete information about TI products and services before placing orders. TI assumes no
liability for applications assistance, customers applications or product designs, software performance, or infringement of patents. The publication of information
regarding any other companys products or services does not constitute TIs approval, warranty or endorsement thereof.
RapidIO is a registered trademark of RapidIO Trade Association. All other trademarks are the property of their respective owners.