JPEG Decoding Accelerator
Matthew Button Kyle Park Velu Manohar Muhammad Khan Sahil Vemuri
mbutton kylepark velu mamankh sahilvnv
1. Abstract
JPEG is a widely used image compression standard, but
the decompression process is computationally intensive and
due to its widespread usage, provides a reason for hav-
ing a dedicated module in a System-on-Chip (SoC) plat-
form. This project introduces a hardware accelerator de-
signed to efficiently perform JPEG decompression on a mo- Figure 1. JPEG Encoding Process
bile System-on-Chip. The design integrates key stages of
the decoding pipeline—including entropy decoding, coef-
ficient reconstruction, inverse transforms, chroma upsam- clude:
pling, and color space conversion—into a pipelined, and
almost multiplier free architecture. The accelerator was • A high-throughput Huffman decoder that uses parallel
evaluated on a variety of JPEG images and demonstrated bitmask matching and supports variable-length code-
significant performance gains compared to software-based words,
decoding, making it well-suited for embedded and mobile
• A multiplier-free 2D IDCT using Canonical Signed
applications.
Digit (CSD) approximations and shift-add logic for ef-
ficient computation,
2. Introduction
• A chroma supersampling module that outputs four up-
JPEG is one of the most widely adopted standards for sampled blocks per cycle to match the resolution of
lossy image compression due to its ability to significantly luminance data, and
reduce file sizes while maintaining acceptable visual qual-
ity. It is used extensively across mobile devices, digital • A color conversion unit using CSD-based fixed-point
cameras, web platforms, and embedded systems, making arithmetic for real-time YCbCr to RGB transforma-
it a critical part of the global image processing pipeline [1]. tion.
While the compression process is usually performed offline By implementing the full decoding pipeline in hardware,
or on high-performance servers, decompression must often this accelerator enables faster image rendering and lower
happen in real time, especially on power-constrained de- CPU usage, making it suitable for real-time video, camera
vices such as smartphones, tablets, and Internet-of-Things preview, and image-intensive mobile applications. Experi-
(IoT) platforms. mental results demonstrate that the design achieves consid-
JPEG decompression involves multiple computation- erable speedup compared to software-based decoding such
heavy stages, including entropy decoding, dequantization, as MATLAB’s imread() function. Figure 1 shows a high
inverse discrete cosine transform (IDCT), chroma upsam- level diagram of the JPEG encoding process, which is what
pling, and color space conversion. These steps place a con- the decoder will perform the inverse of.
siderable load on general-purpose processors, especially in
embedded contexts where performance, energy efficiency, 3. Survey of Previous Related Work
and thermal budgets are tightly constrained.
3.1. GPU-Based JPEG Decoding Using CUDA
To address these challenges, this work presents a hard-
ware accelerator for JPEG decompression designed specif- Tade and Ansari [1] present a CUDA-accelerated JPEG
ically for integration into mobile SoCs. The design imple- decoder aimed at improving the performance of the de-
ments a pipelined JPEG decoding architecture while main- compression pipeline on general-purpose GPUs. Their ap-
taining low area and power overhead. Key components in- proach focuses primarily on offloading the inverse discrete
cosine transform (IDCT), one of the most computation- ator in Verilog, focusing on low-level parallelization of the
ally demanding stages in JPEG decoding. By leverag- IDCT and Huffman decoding stages.
ing CUDA’s thread-level parallelism, they implement an
8×8 IDCT kernel that processes DCT blocks concurrently 3.3. An FPGA-based JPEG Preprocessing Acceler-
across hundreds of GPU threads. Their implementation ator for Image Classification
uses floating-point arithmetic and applies a standard sep- In contrast, FPGA-based accelerators offer a promis-
arable 2D IDCT method, executing row-wise and column- ing alternative for efficient JPEG decoding in resource-
wise transforms sequentially. The authors report substan- constrained environments. Li et al. [3] proposed an FPGA-
tial speedups when decoding large images, especially when based JPEG preprocessing accelerator aimed at improv-
compared to software-based decoding on CPUs. ing the throughput and energy efficiency of image classi-
While their results demonstrate the effectiveness of using fication tasks. Their design focuses on accelerating non-
GPUs for accelerating JPEG decoding, their approach tar- convolutional operations, including JPEG decoding, image
gets desktop-class computing environments with relatively block splicing, and scaling, which are often bottlenecks in
abundant power and thermal budgets. This makes the solu- end-to-end image classification pipelines. By implement-
tion less suitable for resource-constrained embedded or mo- ing these preprocessing steps on an FPGA, they achieved
bile platforms, where power efficiency and predictable la- a throughput of 875.67 frames per second and an energy
tency are critical. Moreover, the use of floating-point oper- efficiency of 0.014 J/frame on a Xilinx XCZU7EV FPGA.
ations and reliance on GPU memory hierarchies introduces When integrated with an Inception V3 accelerator, the end-
complexity and energy overhead. to-end system demonstrated a 28.27× speedup over CPU-
In contrast, our work implements a hardware JPEG de- based implementations and a 2.32× improvement in energy
coding accelerator in Verilog, optimized for integration into efficiency compared to GPU-based systems.
mobile SoC architectures. Rather than relying on floating- These studies highlight the potential of hardware accel-
point units or massive thread parallelism, our design uses erators in enhancing JPEG decoding performance. As a re-
fixed-point arithmetic and shift-add-based logic to approxi- sult, our project aims to develop a Verilog-based JPEG de-
mate multiplications via CSD representations. Specifically, coding accelerator suited for mobile SoC platforms. By fo-
our 2D IDCT pipeline is based on a modified version of cusing on hardware-level optimizations, we aim to achieve
Loeffler’s algorithm, which eliminates multipliers entirely real-time JPEG decoding with minimal power and area
in favor of hardware-friendly additions and shifts, reduc- overhead, making it suitable for embedded and mobile ap-
ing both area and power. Unlike the CUDA approach that plications.
treats each DCT block independently on a massively paral-
lel GPU, our design is deeply pipelined—capable of accept- 3.4. Improved Loeffler-Based 2D DCT/IDCT Hard-
ing a new block every cycle after initial latency, making it ware Acceleration
more suitable for real-time processing in streaming multi-
Zhou and Pan [4] present a hardware accelerator for
media systems.
2D 8×8 DCT/IDCT operations, utilizing an enhanced Lo-
3.2. Accelerating JPEG Decompression on GPUs effler architecture. Their design features an 8-stage pipeline
that optimizes the data stream of the Loeffler 8-point 1D
Weißenberger and Schmidt [2] presented a GPU-based DCT/IDCT, tailored for image and video processing appli-
JPEG decompression architecture that exploits fine-grained cations. By employing fixed-point arithmetic and Canoni-
parallelism inherent in block-based image processing. Their cal Signed Digit (CSD) encoding, the architecture achieves
work demonstrates the feasibility of high-throughput de- a multiplication-free approximation of DCT coefficients
compression by leveraging the massively parallel process- using only adders and shifters. A notable innovation is
ing capabilities of modern GPUs. The resulting implemen- their fast parallel transposed matrix architecture, which ef-
tation significantly outperforms baseline CPU decoders and ficiently handles row-column coefficient conversions with
even specialized libraries like NVIDIA’s nvJPEG, espe- reduced circuit complexity. Implemented on a Virtex-7
cially for high-resolution images. XC7VX330T FPGA, the accelerator operates at 288 MHz,
While GPU acceleration provides impressive through- achieving a throughput of 558 million pixels per second and
put, it is not always ideal in embedded or resource- processing Full HD frames at up to 269 frames per sec-
constrained systems due to power and thermal limitations. ond. The design completes 2D DCT/IDCT operations on
As such, hardware-based acceleration using FPGAs or 8×8 blocks in just 33 clock cycles.
ASICs remains a compelling alternative if the need for mul- In our project, we adapt this multiplier-free approach for
timedia processing is high, offering predictable latency and the 2D IDCT, leveraging CSD-based approximations and
lower power consumption. This project aims to explore shift-add logic to eliminate the need for multipliers. How-
such an alternative by designing a JPEG decoding acceler- ever, our design diverges in several key aspects. While
Zhou and Pan focus on a high-throughput solution suit- Because we selected a very specific baseline JPEG pro-
able for high-resolution video processing, our implemen- tocol: ITU-T T.81 (1992) / ISO/IEC 10918-1 [6], we were
tation targets integration for low power consumption and able to simplify the state machine significantly. Guarantees
minimal area overhead. Additionally, our architecture in- of note include:
tegrates the entire JPEG decoding pipeline—including en-
tropy decoding, dequantization, chroma upsampling, and • 8-bit color precision
color space conversion—into a cohesive, low-latency sys- • Sequential (one-pass) encoding
tem, while they only create an accelerator for the IDCT.
• Huffman only codes (no arithmetic)
4. Description of Design • A max of 2 AC and 2 DC tables
Each of the modules described in Figure 2 were imple- • Single SOS without restart markers
mented in Verilog. Below are the descriptions of the core
modules: • 4:2:0 Chroma sub-sampling
Our header decoder allows for multiple images to be
4.1. Header Extraction passed in continuously through the decoder. Tables are up-
The encoded information of a JPEG is really folded into dated before a subsequent Start-of-Scan stream is pushed
sections that constitute its header. Two byte markers indi- through the remaining modules. This presents an advantage
cate the start of a specific segment of data. These segments for near contiguous JPEG workloads for example in appli-
contain key information such as the image size, the sub sam- cations for streaming or computer vision.
pling method, the quantization coefficient tables, and the
Huffman symbols and lengths. From the onset we designed
our accelerator to be passed a pure bit-stream over an AXI
(Advanced eXtensible Interface) bus. We selected AXI in
particular for its ubiquity particularly for FPGAs [5]. Hard-
ware platforms with configurable FPGA modules could be-
fit from on-the-fly JPEG acceleration. Much of the header
decoding is a serial operation, but the structure of the header
itself does not lend itself easily to hardware processing. As
a prepossessing step we utilize a Python script that converts
a JPEG image into a System Verilog (.svh) array of 32 bit
lines. Our system simulates the passing of the JPEG bit-
stream in 32-bit (AXI compatible) lines by walking this pre-
processed array. True implementations would perform this
with DMA transactions coordinated by the CPU.
Reading the segments presents some difficulty because
there is only a guarantee of byte alignment in the JPEG pro-
tocol, and there is a weak ordering of segments prior to the Figure 2. JPEG Decoding System Block Diagram
Start-of-Scan demarcation. Two byte markers can appear in
four possible slots of the input lines or even cross the divide
4.2. Huffman Decoding
between two lines creating offsets in the data processing
that propagate as we read in these tables and parameters. After the header is read the symbols and lengths are
These marked segments are also variable length. For exam- passed through a Huffman modules to generate codes. This
ple, after witnessing a 0xFFC4 marker, there could be one, operation involves bit shifts and adds and is very quick as
two, or as many as four Huffman tables that follow. Two codes are constrained to under 16 bits and there are 256
distinct images that contain four Huffman tables might use or fewer symbols. As the Start-of-Scan stream comes in
a single or up to four separate markers requiring flexibil- from the FIFO we examine 16 bits at a time using parallel
ity in our hardware implementation. We also attempt to be look-ups against all Huffman codes loaded from the header.
maximize efficiency and push a full 32-bits of our eventual Each Huffman code has a corresponding length, and the
scan stream into the accelerator. However, we are slightly decoder uses bit-masks to search for matches of different
inhibited by scattered instances of ’bit stuffing’ markers that lengths against the current bit-stream prefix. Once a match-
require delaying until we can pass a full line into the subse- ing code is found, the decoder outputs the corresponding
quent FIFO block. symbol from the Huffman table.
Every 8×8 pixel block (canonically deemed a minimum from the quantization table. These quantization values vary
coded unit (MCU) in JPEG) starts with a DC term (intuited by frequency component, with lower-frequency coefficients
as the brightness of that MCU). This first term uses a delta typically receiving smaller weights to preserve more detail.
encoding from preceding terms and is handled with simple To maintain hardware efficiency, the dequantization
subtraction. AC terms (for subsequent block entries) use module is implemented using fixed-point arithmetic, with
variable length encoding (intuited as the spatial details of all operations designed to avoid multipliers where possible.
the JPEG). These AC terms each contain a run length (how This is achieved by encoding quantization table values us-
many zeros precede the next non-zero value in the zig-zag ing Canonical Signed Digit (CSD) representations, lower-
scan), and a Variable Length Integer (VLI) size, which tells ing power consumption and circuit complexity.
how many bits should be read next to form the actual value
(amplitude) of the non-zero coefficient. The decoder uses The module processes all 64 coefficients in parallel over
this VLI to fetch the correct number of bits from the in- multiple cycles, feeding the scaled output into the subse-
put FIFO for the VLI decoder, which reconstructs the orig- quent IDCT stage. Special care is taken to ensure that the
inal quantized DCT coefficient. These coefficients are then bit width of the dequantized values accommodates poten-
stored into a 64-element buffer, representing an 8x8 MCU. tial overflow while maintaining sufficient dynamic range to
preserve image fidelity.
4.3. 8x8 Block Buffer
The 8x8 block buffer functions as a circular FIFO that re- 4.6. 2D Inverse Discrete Cosine Transform (IDCT)
constructs a complete 64-coefficient block from run-length
encoded JPEG data. First, it receives input from the To perform the 2D IDCT, an improved version of Loef-
Huffman decoder and VLI decoder, which provide a run- fler’s algorithm was used [2]. Loeffler’s algorithm uses 29
length and the corresponding coefficient value. Using a tail additions and 11 multiplications. The improved version in-
pointer, the buffer skips ahead by the run-length, effectively creases the number of pipelined stages from 4 to 8. Figure
inserting that number of zeros into the output block. It then 3 shows the pipeline for the improved Loeffler’s algorithm.
writes the decoded coefficient at the updated position. This In addition, the multipliers are replaced by using Canoni-
process continues until either the buffer fills all 64 positions cal Signed Digit Representation approximations of constant
or an End of Block (EOB) symbol is received, which in- terms like cos(pi/8) and cos(pi/8) allowing for these com-
dicates that the remaining positions should be padded with putations to be done combinationally, only using adds and
zeros. Once either condition is met, the buffer outputs the shifts. From the output of the 8x8 block in the dequantiza-
full 8x8 coefficient block for dequantization. tion, each row of 8 elements is fed into a 1D IDCT mod-
ule using the improved loeffler’s algorithm which requires
4.4. Inverse Zig Zag 8 cycles to compute the output of the row. The output of
In JPEG encoding, the 64 DCT coefficients of an 8×8 each row is then gathered in another 8x8 arranged such that
block are arranged in a zig-zag order before compression. the output of each of the 8 rows are transposed and then
This ordering groups the low-frequency coefficients first each row is then fed into another 1D IDCT, which is used
(which carry most of the image’s visual information) and to compute the IDCT of each column. In total, an 8x8 input
places the high-frequency coefficients later, which are often requires 33 clock cycles to compute. See Figure 4 for the
zero after quantization. This pattern increases the effective- 2D IDCT module pipeline.
ness of run-length encoding (RLE) by clustering long runs
of zeros together toward the end of the sequence. Conse-
quently, during decoding, the 8x8 block needs to be “in-
verse zig zagged” to reverse the ordering, restoring the co-
efficients to their original 8×8 spatial positions. A hardware
module implements this using a lookup table where each
address corresponds to a position in the 1D zig-zag input
and outputs the correct 2D (row, column) index in the 8×8
block.
4.5. De-quantization
The dequantization stage restores the scale of the DCT
coefficients that were previously compressed during JPEG
Figure 3. 1D IDCT Pipeline using improved Loeffler’s Algorithm
encoding. Each coefficient in the reordered 8×8 block is
multiplied by a corresponding quantization factor retrieved
5. Experimentation and Methodology
We tested our design using multiple JPEG images of dif-
ferent resolutions and compared the time to run MATLAB’s
imread() function on the image to simulation time of the ac-
celerator with the chosen clock period after synthesis. Fig-
ure 6 shows a comparison of the decoded image using the
accelerator and MATLAB.
Figure 4. 2D IDCT Pipeline
Hardware MATLAB Speed PSNR
Image Dim. Cycles
Time (s) Time (s) up (dB)
4.7. Chroma Supersampling spider-man 256x256 19243 0.000173 0.01762 101.74 28.21
tiger 900x599 366535 0.00330 0.016998 5.15 26.59
During the JPEG encoding process, the chroma compo- cat 1200x734 249763 0.00225 0.026119 11.62 24.56
nebraska 1280x800 339360 0.00305 0.022645 7.41 28.73
nents (Cb and Cr) are stored at half the resolution of the lu-
minance component in both horizontal and vertical dimen- Table 1. Runtime and PSNR comparison between hardware de-
sions (See Figure 5). While decoding, the Cb and Cr need coder and MATLAB baseline. Hardware time calculated using a
to be brought back to full resolution so they can be aligned 9 ns clock period.
pixel-by-pixel with the Y data for proper color reconstruc-
tion.
The module upsamples each 8×8 chroma block into four
8×8 blocks. The supersampled chroma data is output as four
channels per cycle—one for each of the upsampled blocks.
These outputs are collected in a buffer along with the corre-
sponding Y blocks to form full-resolution YCbCr data for
downstream color conversion.
Figure 6. Comparison between decoded image using accelerator
(left) vs MATLAB imread() (right)
Figure 5. 4:2:0 Chroma Subsampling Example
Metric Value
Area 11,834.7 µm2
To improve the visual quality of the upsampled chroma Total Power 623.9 µW
Clock Frequency 111.11 MHz
components, we implemented a bilinear interpolation mod-
ule that performs full-resolution interpolation across the en- Table 2. Post-synthesis area, power, and clock frequency
tire 8×8 output grid. Unlike the nearest-neighbor approach,
which simply replicates chroma values, this module calcu-
lates each output pixel by blending the four surrounding in-
put pixels using bilinear weights derived from their relative
6. Analysis of Results
positions. The implementation avoids costly multipliers by Based on Table 1, the accelerator demonstrates signif-
leveraging simple shift-and-add operations, ensuring it re- icant speedup across all tested images. For smaller im-
mains hardware-efficient while producing smoother, more ages, the speedup reaches nearly 100×, while for larger
natural color transitions in the final image. images, the speedup remains substantial at approximately
7.5×. In terms of output quality, the accelerator delivers ac-
4.8. Color Space Conversion
ceptable results, with PSNR values consistently at or above
Once full-resolution YCbCr blocks are available, they 28dB—an important threshold noted in [3] as sufficient for
are converted to the RGB color space using integer approx- deep learning applications. This slight degradation in PSNR
imation formula and CSD for final image reconstruction. is expected, as our design minimizes the use of multipliers,
Multiplications are implemented using shift-and-add oper- relying instead on addition and shift operations throughout
ations, reducing the need for complex arithmetic units and most of the pipeline, except during the dequantization stage.
maintaining hardware efficiency. This conversion enables Unlike prior JPEG accelerators such as [1], [2], and [3],
the final RGB bitmap to be assembled and displayed. which report performance in frames per second (FPS), we
were unable to conduct such measurements due to timing [3] T.-Y. Li, F. Zhang, W. Guo, J.-L. Shen, and M.-Q.
constraints. However, our synthesis results in Table 2 pro- Sun, “An fpga-based jpeg preprocessing accelerator for
vide insight into the accelerator’s efficiency. Notably, when image classification,” The Journal of Engineering, vol.
compared to [1], our design achieves faster decoding on a 2022, no. 9, pp. 919–927, 2022. [Online]. Available:
larger image. While they report a decode time of 11.72ms https://ietresearch.onlinelibrary.wiley.com/doi/abs/10.
for a 600×522 image, our accelerator processes a 900×599 1049/tje2.12174
image (Tiger) in just 3.3ms.
[4] Z. Zhou and Z. Pan, “Effective hardware accelerator for
7. Conclusion 2d dct/idct using improved loeffler architecture,” IEEE
Access, vol. 10, pp. 101 101–101 111, 2022.
We have presented a complete hardware JPEG decoding
accelerator targeted at mobile System-on-Chip (SoC) plat- [5] R. Bhaktavatchalu, B. S. Rekha, G. A. Divya, and
forms, where power, area, and latency constraints are par- V. U. S. Jyothi, “Design of axi bus interface modules on
ticularly critical. The design integrates all major stages of fpga,” in 2016 International Conference on Advanced
the JPEG decompression pipeline—including entropy de- Communication Control and Computing Technologies
coding, dequantization, 2D IDCT, chroma upsampling, and (ICACCCT), 2016, pp. 141–146.
color space conversion—into a streamlined, pipelined ar- [6] International Telecommunication Union, “Digital com-
chitecture that avoids the use of multipliers where possible. pression and coding of continuous-tone still images:
The post-synthesis evaluation (Table 2) demonstrates Requirements and guidelines,” International Telecom-
substantial performance gains over software-based decod- munication Union, Tech. Rep. T.81, September 1992,
ing, with the accelerator achieving up to 100× speedup (Ta- also published as ISO/IEC 10918-1:1994. [Online].
ble 1) for small images and consistent improvements across Available: https://www.w3.org/Graphics/JPEG/itu-
a range of resolutions. Despite the use of approximate arith- t81.pdf
metic for power and area efficiency, the design maintains
image quality within acceptable bounds, with PSNR val-
ues suitable for visual applications and machine learning
pipelines.
With a modest silicon footprint and low power consump-
tion, our implementation is well-suited for real-time image
processing in embedded and mobile systems. Future work
will focus on extending the architecture to support stream-
ing video, optimizing memory bandwidth, and validating
the system on FPGA and ASIC platforms.
8. Contributions
See Table 3 for each teach members contributions.
Name Work Done %
Matthew Button Huffman decoding, Table Extraction, VLI Decoding 20%
Kyle Park IDCT pipeline support, Chroma Upsampling, RGB Conversion 20%
Velu Manohar 2D IDCT design, Testbench, 1D IDCT 20%
Muhammad Khan 2D IDCT design, Testbench, PSNR analysis 20%
Sahil Vemuri Inverse Zig-Zag, De-quantization, MATLAB Decoder 20%
Table 3. Team Member Contributions and Percentage Split
References
[1] R. Tade and S. Ansari, “Acceleration of jpeg decoding
process using cuda,” International Journal of Computer
Applications, vol. 120, no. 9, pp. 1–5, 2015.
[2] A. Weißenberger and B. Schmidt, “Accelerating jpeg
decompression on gpus,” pp. 121–130, 2021.