N
T
U
Y P
SCE10-0082
Implementation of Frequency Analysis Modules
for Audio Coding on GPU
Subrat M
Supervisor
r. Ian M L
Examiner
r. ertil S
Submitted in Partial ulfillment of the Requirements
for the egree of achelor of omputer Science
of the Nanyang Technological University
S
April 2011
Abstract
Advanced Audio Coding (AAC) is the current MPEG standard for lossy compression of digital audio,
which provides better sound quality than MP3. It involves several computation-intensive stages, e.g.,
implementation of filter bank and psycho-acoustic algorithm. e discrete Fourier transform (DFT) is
a commonly used tool to perform frequency analysis for psycho-acoustic model while Modified discrete
cosine transform (MDCT) is used for realization of filter bank. e implementation of DFT and
MDCT being computation intensive, it is important to implement them efficiently to achieve higher
speed performance and lower power consumption.
e tremendous evolution of GPU has given rise to a new computing paradigm called general purpose GPU computing (GPGPU) or simply GPU computing. GPGPU leverages on the enormous
computing resources of GPU for data-parallel scientific and engineering applications. Under the GPU
computing model, the sequential component of a given application is executed by the CPU, while
the computation-intensive data-parallel component is executed concurrently by the multiple cores in
GPU. In order to facilitate the use of GPU as a parallel processor to run high-level programs without
using graphics oriented APIs (like GLSL, HLSL and Cg), GPGPU APIs (like CUDA, OpenCL and
DirectCompute) were developed. Such APIs have been used to successfully accelerate several scientific
computation and engineering applications with high computational requirement.
In this report we discuss the viability of using GPUs for implementing DFT and MDCT algorithms to GPU since these two are relatively computation intensive functions in AAC. We have implemented the 2048 point FFT and MDCT on a GTX 460 GPU (GF104) with 1 GB of memory
using vendor neutral OpenCL API. Parallelization was maximized by using radix-2 DIT FFT and implementing the MDCT through FFT. We achieved 2 times speedup compared to FFTW for FFT.
Besides, we have achieved a speedup of 32 times over the Xiph Vorbis' implementation of MDCT.
Since the IO time over PCI Express incurs a large time cost, data transfers between GPU and host system have been minimized. Moreover, the accuracy of implementation is found to be acceptable. e
results of implementation of DFT and MDCT indicates that speed performance can be significantly
enhanced by implementing these frequency transforms on GPUs.
i
Acknowledgements
I would like to thank my supervisor, Associate Professor Ian McLoughlin, for his valuable guidance and
suggestions from the inception to completion of this project. His guidance has been instrumental in
setting the direction of the work. I am thankful to him for helping me stay on the right track and for
keeping me focused on the project's objective. He has always been very approachable and his technical insights helped me whenever I had any doubts. His support in editing this thesis has significantly
improved its quality and understandability.
I would also like to thank my father for taking interest in my work throughout the project. I am
grateful to him for patiently explaining me some key concepts related to this work. I also wish to thank
my brother and mother for their unending moral support and encouragements.
ii
Contents
1
Introduction
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Overview of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1
2
2
2
Perceptual Audio Compression
2.1 Filter Bank . . . . . . . .
2.2 Psychoacoustic Model . .
2.3 uantization . . . . . . .
2.4 Lossless Coding . . . . . .
.
.
.
.
3
4
4
6
6
.
.
.
.
.
7
7
8
9
10
12
.
.
.
.
14
14
15
17
21
.
.
.
.
24
25
30
31
35
3
4
5
.
.
.
.
.
.
.
.
GPU Programming using OpenCL
3.1 GPU Architecture . . . . . .
3.2 GPGPU APIs . . . . . . . .
3.3 OpenCL Architecture . . . .
3.4 OpenCL Execution Model . .
3.5 Performance Maximization .
Fast Fourier Transform
4.1 eory . . . . . . . . .
4.2 Implementation Details
4.3 Accuracy . . . . . . . .
4.4 Performance . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Modified Discrete Cosine Transform
5.1 eory . . . . . . . . . . . . .
5.2 Implementation Details . . . .
5.3 Accuracy . . . . . . . . . . . .
5.4 Performance . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
iii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
Conclusion and Future Work
38
A FFT Output Plots
39
B MDCT Output Plots
46
iv
List of Figures
2.1
2.2
2.3
Pipeline of a general perceptual audio encoder . . . . . . . . . . . . . . . . . . . . .
Absolute reshold of Hearing (ATH) in quiet of a human ear. . . . . . . . . . . . .
Simultaneous Frequency Masking . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
3.2
3.3
3.4
3.5
GPU architecture . . . . . . . . . . .
High Level OpenCL Architecture . . .
OpenCL memory model . . . . . . . .
OpenCL Work Distribution Model . .
Bandwidths of OpenCL Memory Types
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
9
10
11
12
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
read operations for N = 8 FFT . . . . . . . . . . . . . . . . . . .
Scaled error in FFT computed by GPU for a stream of zeroes . . . . .
Scaled error in FFT computed by GPU for a sequence of numbers . .
Scaled error in FFT computed by GPU for a set of random numbers .
Scaled error in FFT computed by GPU for a stream of alternating bits
Scaled error in FFT computed by GPU for a sine wave . . . . . . . .
Scaled error in FFT computed by GPU for a white noise signal . . . .
Execution time of FFT implementations. . . . . . . . . . . . . . . .
GPU IO times in FFT . . . . . . . . . . . . . . . . . . . . . . . . .
Total timing information in FFT . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
18
18
19
19
20
20
21
22
23
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
Overlap in adjacent blocks in MDCT computation . . . . . . . . . . . .
Scaled error in MDCT computed in GPU for a stream of zeroes. . . . . .
Scaled error in MDCT computed in GPU for a sequence of numbers . .
Scaled error in MDCT computed in GPU for a set of random numbers .
Scaled error in MDCT computed in GPU for an input of alternating bits
Scaled error in MDCT computed in GPU for a sine wave . . . . . . . .
Scaled error in MDCT computed in GPU for white noise . . . . . . . .
Execution time of OpenCL MDCT compared to that of Vorbis' MDCT.
Breakdown of time spent in performing MDCT on GPU . . . . . . . . .
Total execution time of OpenCL MDCT compared Vorbis' MDCT . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24
32
32
33
33
34
34
35
36
37
v
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
5
6
A.1
A.2
A.3
A.4
A.5
A.6
FFT of a stream of zeroes . . . . . . . . . .
FFT of Sequence of numbers 0 to 2047 . .
FFT of random numbers . . . . . . . . . .
FFT of a stream of alternating 1's and 0's . .
FFT of 500Hz sine wave sampled at 4 kHz
FFT of White noise sampled at 2 kHz . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
40
41
42
43
44
45
B.1
B.2
B.3
B.4
B.5
B.6
MDCT of 2048 zero samples . . . . . . . . .
MDCT of Sequence of numbers 0 to 2047 . .
MDCT of Random Numbers . . . . . . . . .
MDCT of alternating 1 and 0 bits . . . . . . .
MDCT of 0.5 kHz sine wave sampled at 4 kHz
MDCT of White noise sampled at 2 kHz . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
47
48
49
50
51
52
vi
.
.
.
.
.
.
Chapter 1
Introduction
Audio compression refers to the process of representing a digital audio data with fewer bits than its original representation. It can be done in a lossy or lossless manner. Lossless compression allows recovery of
the exact original signal from the compressed signal. Lossy compression on the other hand selectively
discards some frequency components and scales down other frequency components of audio data without perceivable loss of quality. Lossy compression is widely used for music players with limited storage
as it can achieve significantly higher compression ratio than its lossless counterpart. Advanced Audio
Coding (AAC) is the current standard for lossy compression of digital audio, which provides better
sound quality than MP3[4].
1.1 Background
Lossy audio compression algorithms determine the frequency components of audio data that are masked
and could be allocated fewer bits during quantization. Such algorithms are based on psychoacoustic
models (PAM) designed to take advantage of typical features and limitations of the human auditory
system. PAMs calculate the energy in various frequency bands to determine which of the frequency
components are masked and can be quantized using fewer bits[20][19]. erefore, time to frequency
domain transformation of data is a vital function in lossy audio compression systems. Frequency analysis is the process of converting time domain data into frequency domain. Apart from being used in
audio processing[4] it is used in a diverse range of fields such as image processing, computer vision,
seismography[6], audio visualization [9], etc. Discrete Fourier transform (DFT) is commonly used
tool for frequency analysis for determination of power spectral density, while discrete cosine transform
(DCT) is used for transformation of time domain signal to frequency domain for signal and image
compression due to its higher energy compaction property[21]. Modified discrete cosine transform
(MDCT) is a lapped form of DCT popularly used for filter bank in audio codecs[22]. Since DFT and
MDCT are computation intensive algorithms, they are important targets for speed optimization.
Graphics Processing Unit (GPU) consists of a clusters of parallel processors. Today's GPUs con1
tain between 48 to 512 processing cores. To utilize the enormous computing resource of GPU in recent
years, it is used for high-performance scientific computing and several engineering applications including image and video processing[23][13], where the sequential component of the algorithm is implemented by the CPU, while the parallel components are processed concurrently by the GPU nodes. It
is found that GPU impelementation of such general purpose applications have achieved 25 to 50 times
speedup[1].
1.2 Objectives
is final year project is the first part of a bigger project to implement the computation intensive blocks
of AAC on GPU. In this report we discuss the viability of using GPUs for implementing frequency
tranforms used in AAC. Specifically, we aim at efficient implementation of the DFT and MDCT algorithms since these are relatively computation intensive functions in AAC and are fundamental steps
for AAC encoding. To achieve this we have maximized the unfolding of computation and accordingly
maximized the number of threads in which the computation can be performed with the resources available in the GPU. We have considered the GPU implementation of the FFT and MDCT algorithms
on GPU for a block length of N = 2048 since they are popularly used in AAC. Apart from that, we
investigate the speed performance and the effect of single precision floating point arithmetic on the
accuracy of the implementation.
1.3 Overview of the Report
In the next chapter, we discuss the functions of significant blocks of a general lossy audio encoder.
Besides, we emphasize the role of FFT and MDCT in realization of the encoder in this chapter.
In Chapter 3, we briefly describe the general organization of GPU. Based on that we discuss a general method for parallel computation on GPU. Additionally we also discuss about some design considerations for maximizing performance of applications on GPUs.
Chapter 4 is devoted to GPU implementation of FFT. In this chapter, we compare our implementation of FFT on GPU with the current industry standard FFTW library.
Chapter 5 deals with the implementation of MDCT. We show that MDCT can be computed
through FFT which enables reuse of the FFT program, and thereby simplifying the MDCT implementation. We provide the necessary mathematical basis for implementing the MDCT through FFT.
Moreover, we discuss the accuracy and speed of our GPU implementation of MDCT and compare
that against those of Xiph Vorbis' of MDCT.
In Chapter 6, we present the conclusions along with the implications of the results obtained from
simulations. We also discuss the scope for future work.
2
Chapter 2
Perceptual Audio Compression
Perceptual audio compression is a method of compressing audio data which takes advantage of the limitations of human hearing. It is well known that the human auditory system can hear the sounds in the
range of 20 Hz to 20 kHz, with the upper limit falling to 16 kHz in most adults. Additionally, loud
tones of a certain frequency prevent us from hearing so tones in nearby frequencies [20]. Perceptual
coding systems exploit such limitations of human auditory system by scaling down the tones that have
less impact on the sound perceived by the human ear, thereby creating quantization noise in frequency
bands where the noise is not audible. e block diagram of the major components of a general perceptual encoder is shown in Fig 2.1 [18][3].
Figure 2.1: Pipeline of a general perceptual audio encoder
e samples of digital audio data are first decomposed to different frequency bands by a filterbank.
e output of fiter bank is then quantized using a strategy that hides the quantization noise in frequency
bins that are masked due to nature of the human hearing apparatus. Psychoacoustic analysis is used to
determine the level of noise in each frequency bin that can be masked. e quantized coefficients are
then losslessly coded using Huffman encoding. In the following we briefly describe the function of each
of these operational blocks.
3
2.1 Filter Bank
Filter banks are widely used as a digital signal processing tool. e primary function of a filter bank is
to decompose the input signal into a set of frequency bins. A filter bank may operate solely in time domain, which splits a time domain signal to several time domain components corresponding to different
frequency bands. Such a filter bank could be viewed as a set of band-pass filters with different passbands
to separate out the desired set of sub-bands from the input signal. Another kind of filterbank converts a
time domain signal to frequency domain thereby decomposing the signal into frequency components.
Frequency transforms such as Discrete Fourier Transform (DFT), Discrete Cosine Transform (DCT)
and Modified Discrete Cosine Transform (MDCT) are commonly used as filter banks for conversion
of time domain signal to frequency domain.
MDCT is a filterbank commonly used to implement the frequency domain filter bank. It is a
DCT based lapped sinusoidal transform. Do to its lapped property, MDCT is performed on overlapped blocks of input samples, where the first half of each block is overlapped by the last half of the
preceding block. Due to the overlapping of consecutive blocks along with the energy compaction properties of DCT, MDCT can help in compression as well as minimizing blocking artifacts. Due to these
properties MDCT filter bank is preferred for audio compression.
AAC and Vorbis are two audio codecs that use the MDCT as a filterbank. e other advantage
of using MDCT as a filterbank is that an N point MDCT can be implemented using a N /4 length
DFT together with some pre-processing and post-processing, and many optimized algorithms for low
complexity implementation of DFT are available [5] [8].
2.2 Psychoacoustic Model
A psychoacoustic model is used to determine how the quantization noise should be distributed across
the frequency spectrum of the audio data. It distributes the quantization noise in such a way that the
noise is masked by the signal being encoded [3].
2.2.1 Absolute reshold of Hearing and Masking
Human ears can perceive sound in the frequency range 20 Hz to 20 kHz. Moreover, the energy of
the sound should be above a minimum amount in order to be heard. e sound pressure level (SPL)
provides a measure of energy of sound. e SPLs for various frequencies for a sound to be audible
are given by the Absolute reshold of Hearing (ATH) curve (Fig. 2.2). Sound frequencies that are
below the curve are not audible to the human ear. Masking can be of two types, temporal masking and
frequency masking.
4
100
90
Sound Pressure Level, SPL (dB)
80
70
60
50
40
30
20
10
0
−10
2
3
10
10
4
10
Frequency (Hz)
Figure 2.2: Absolute reshold of Hearing (ATH) in quiet of a human ear [18].
Frequency Masking In frequency masking, a loud sound in a certain frequency band increases the
hearing threshold in adjacent frequency bands. us sound frequencies adjacent to the loud
sound that are below the new hearing threshold are not audible. Fig. 2.3 demonstrates the effect
of frequency masking.
Temporal Masking In temporal masking, a so tone immediately following a loud tone in time domain is masked by the loud sound. is occurs because the loud sound increases the threshold
of hearing for its constituent frequencies and the increased thresholds subside over time instead
of immediately upon the end of the loud sound.
ese masking effects are exploited to hide the quantization noise in the masked frequency bands,
thereby minimizing the effect of quantization noise on the perceived sound. As can be seen, a filter
bank is essential to the operation of the psychoacoustic model. Due to its importance in a lossy encoder and its high computational complexity, the filter bank is a good target for optimization.
5
100
90
Sound Pressure Level, SPL (dB)
80
Masking
Tone
70
60
50
Masked
Tone
New thresholds
of hearing
40
30
20
10
0
−10
2
3
10
10
4
10
Frequency (Hz)
Figure 2.3: A masking tone alters the ATH curve which masks a tone whose energy is less than the new
threshold.
2.3 uantization
uantization is the process of scaling down certain frequency components which do not impact the
quality of the reproduced sound. During quantization, the outputs of the filterbank are divided by a
predetermined integer so as to represent that by a fewer number of bits to satisfy the target bitrate/quality
level. Since the number of bits available to encode the signal is finite and oen less than the number of
bits used to originally sample the signal, this step causes a loss of information. Consequently, the signal
is distorted. Information from the perceptual model is used to determine the frequency bands where
the distortion will be masked out or hidden by the signal. is is the only step that causes data loss in
AAC.
2.4 Lossless Coding
Lossless coding is used to finally encode the quantized coefficients together with side information. It
is performed to represent the quantized audio data by fewer number of bits. ere are several methods
for lossless coding. AAC uses Huffman coding with codebooks that are defined by the specification of
the standard to encode the information.
6
Chapter 3
GPU Programming using OpenCL
A GPU typically contains hundreds of processing cores grouped under several clusters together with
their local memories and a common global memory [15]. It is a massively parallel, multithreaded and
programmable computing platform that is designed and optimized to accomplish fast processing and
display of 2-dimensional and 3-dimensional graphics, images and video data. It has evolved from a single processor graphic accelerator to a high-performance computing structure with teraflops of floating
point operations in the last two decades, and has been popularly used to achieve real-time display and
visual interactions in desktop computers and multimedia/graphic terminals.
e tremendous evolution of GPU has given rise to a new computing paradigm called general purpose GPU computing (GPGPU) or simply GPU computing. GPGPU leverages on the enormous
computing resources of GPU for data-parallel scientific and engineering applications. Under the GPU
computing model, the sequential component of a given application is executed by the CPU (which
has good performance on branching code for control-intensive computation), while the computationintensive data-parallel component is executed concurrently by the multiple cores in GPU. In order to
facilitate the use of GPU as a parallel processor to run high-level programs without using graphics oriented APIs (like GLSL, HLSL and Cg), GPGPU APIs (like CUDA, OpenCL and DirectCompute)
were developed. Such APIs have been used to successfully accelerate several applications with high
computation demands as detailed in [17].
3.1 GPU Architecture
A broad view of the first generation Fermi (GF100) architecture is shown in Fig. 3.1. It consists of 512
CUDA cores grouped into 16 streaming multiprocessors (SM). Each CUDA core can execute either a
floating point or an integer instruction in one clock period.[15] Each SM can therefore execute a group
of 32 threads (called a warp in Nvidia literature) in parallel. Each SM has an instruction cache and a
pair of warp schedulers and its own L1 cache, while the L2 cache is shared by all the SMs. Part of the
64 KB L1 cache is also used as the local memory of the SM. e main memory of the device has six
7
Figure 3.1: Architecture of a first generation Fermi GPU [15]. It is designed as a cluster of simple
execution units.
partitions each supporting upto 1 GB DRAM, allowing the device to support upto 6 GB of DRAM.
e GPU is connected with the host through the host interface over PCI-express bus. e thread block
scheduler distributes thread blocks to SMs, while warp schedulers within each SM schedule warps for
execution.
3.2 GPGPU APIs
GPGPU can be accomplished using several APIs including CUDA, OpenCL and DirectCompute.
e CUDA API was developed by Nvidia for GPGPU programming solely on Nvidia cards. It was released to the public in February, 2007 and can be prgrammed using several languages including Python,
Perl, Fortran, Java, Ruby and Lua. It is cross platform and is available on Linux, MS Windows and Mac
OS X.
DirectCompute is an API developed by Microso for GPGPU on MS Windows Vista and Windows 7 operating systems. It can be used to program GPUs that support DirectX 10.
8
OpenCL is a vendor neutral API for general purpose computation on the GPU. e specification of
the standard is decided by members of the Khronos consortium which consists of members from several
companies. Development of OpenCL was initiated by Apple and was completed by collaboration with
Nvidia, AMD, Intel and IBM. It is supported by GPUs from AMD, Nvidia and S3, Intel and AMD
CPUs and IBM Cell processors. Work is underway to automatically translate OpenCL programs into
application specific FPGA designs [?]. It can be programmed using C, Python and Java. A WebCL
working group within Khronos was formed in March 2011 to define a JavaScript binding to OpenCL,
thus enabling web applications running in a browser to leverage GPU's computational resources.
OpenCL was chosen for the GPU implementation of FFT and MDCT for this project because it
is not limited to any single GPU vendor and is not bound to any operating system.
Operating System
API
Linux MacOS X Windows
GPU
CPU/
FPGA
Specifying Body
CUDA
Y
Y
Y
Nvidia
N
Nvidia
OpenCL
Y
Y
Y
Any
Y
Industry consortium
DirectCompute
N
N
Y
DX10
N
Microso
Table 3.1: Comparison of various GPGPU APIs
3.3 OpenCL Architecture
Figure 3.2: Communication between OpenCL devices and OpenCL hosts.
It defines a host system as a device that is connected to a number of OpenCL devices [14] (GPUs
in the context of this project), as shown in Fig. 3.2. e host system in this project is a CPU together
with system RAM. e host is responsible for acquiring the data to be processed, uploading the data to
the device, submitting commands to the device to execute programs and for downloading the results
of computation from the device.
9
Figure 3.3: e memory model of OpenCL defines private and local memories for each compute unit
and cache backed global and constant memories shared by all compute units. PE stands for processing
element.
An OpenCL device consists of compute units which are composed certain number of processing
elements. Each compute unit has its own local memory accessible only to the processing elements in
that compute unit. Additionally, all the processing elements in a device have access to a common global
memory with high bandwidth but very high latency. Local memory on the other hand provides high
bandwidth and involves low latency. For example, on the first generation Fermi GPUs (GF100), local
memory bandwidth of a single compute unit is 73.6 GB/s and the latency is only a few cycles. is totals
upto 1.03 TB/s when bandwidths of all the compute units are combined, while the global memory
bandwidth is 144 GB/s with a latency of 400-800 cycles.
3.4 OpenCL Execution Model
e OpenCL execution model requires that the computation be expressed in terms of threads of execution. Each thread of execution is represented as a work item in OpenCL. Work items are grouped
into work groups with each work group containing the same number of work items. Each work group
is assigned to a compute unit. Work items in a work group are able to access the global memory and
only the local memory of the compute unit assigned to their work group. e set of instructions to be
10
Figure 3.4: e global work space is composed of work groups which are collections of work items.
Here the global work size is 15 × 15 in the x and y dimensions. e work group size is 5 × 5.
performed by each work item is specified in an OpenCL kernel which is written using a C like language.
Performing a computation in OpenCL follows a combination of the following steps:
1. Host specifies the file containing the kernel source code.
2. e OpenCL runtime compiles the kernel source into machine code for the GPU.
3. Host specifies the total number of work items and the number of work items in each work group.
4. Host creates buffers on the GPU global memory for storing the input and output data of the
computation. e host also uploads source data (to be used for the computation) onto the GPU
buffers from system memory.
5. Host instructs the GPU to begin computation.
6. At this point, threads are launched on the GPU that execute instructions specified in the kernel.
7. Once the computation is complete, the host downloads the computated results from the GPU
into system memory.
e GPU can perform the computation asynchronously, allowing the CPU to work on other tasks
while the GPU is executing the work items.
11
Figure 3.5: Bandwidths of various memory types involved in an OpenCL execution run. e link
between GPU and host is the slowest, while the fastest link is between the compute unit and local
memory.
3.5 Performance Maximization
Memory Bandwidths: As discussed in Section 3.4, computation on the GPU requires data transfers
on different kinds of channels. ese include the PCI Express (PCIe) bus for transfers between device
and host, transfers between global memory (global memory channel) and compute units and transfers
between local memory and processing elements (local memory channel). ere is a large difference
between the bandwidths of these memory channels. As shown in Fig. 3.5, the PCIe x16 connection
between the GPU and the host is the slowest at 4 GB/s. Cumulative local memory bandwidth of all
compute units is the highest at 1.03 TB/s with the global memory bandwidth (144 GB/s) about onetenth of local memory bandwith.
erefore, whenever possible, data transfers between GPU and host should be performed asynchronously so that the GPU can perform computations while the data transfers are ongoing. Local
memory should be used to cache a working copy of data due to the high speed and limited size. Bandwidth of host RAM is higher than that of PCIe x16 and therefore execution of the computation on the
CPU should be considered if the time spent in IO is significant compared to time spent in processing
data.
12
Global Memory Access Latency: While global memory has much higher bandwidth compared to
PCIe bus, each access to global memory involves a high latency (400-800 cycles on GF100). In order
to minimize the penalty incurred due to this latency, it is advisable to create a large number of work
items so that while a group of threads are waiting for data to arrive, other work items' instructions can
be processed.
Global Memory Coalescing: Coalescing is one of the best ways to minimize the latency incurred in
global memory transactions. It refers to the process of combining global memory access transactions
issued by threads of a half-warp1 into one transaction or two transactions in case of 128-bit words.
is allows avoiding latency from multiple global memory transactions. It can be achieved by accessing
memory in regular or sequential patterns. A detail discussion on this is available in Nvidia's whitepapers
[16] [15].
Local Memory Bank Conflicts: To allow parallel access by multiple threads, local memory is divided
into several memory banks that can be accessed in parallel. is way, the memory bandwidth is effectively multiplied by the number of banks available, as long as there are enough threads to access the
local memory through all the banks. Bank conflicts occur when more than one thread access memory
locations that map to the same bank. In such a situation, the memory accesses are serialized, negating
the benefit of having multiple banks. As an exception, when all threads in a half-warp access the same
local memory location, the bank operates in a broadcast configuration which requires no serialization
of requests.
Math Precision Tradeoffs: When applications do not require highly accurate estimation of mathematical functions, faster evaluation of such functions in dedicated hardware can be used to speedup the
calculation. Examples of such functions are native_sin and native_cos. Additionally, OpenCL
also specifies low precision implementations of mathematical functions when precision can be traded
for speed.
1
warp is Nvidia terminology for a group of threads that are scheduled together by the compute unit for execution on
the compute unit. It is called "wavefront" by AMD. A warp consists of 32 threads in Fermi GPUs and a half-warp contains
16 threads.
13
Chapter 4
Fast Fourier Transform
4.1 eory
e discrete Fourier transform (DFT) of an N -point sequence {x0 , x1 , . . . , xN −1 } is given by
Xk =
N
−1
∑
2π
xn e−i N kn
k = 0, 1, . . . , N − 1
(4.1)
n=0
e direct computation of an N -point DFT requires N 2 multiplications. To reduce the computational
complexity of computation of DFT, in 1965 Cooley and Tukey suggested the Fast Fourier transform
(FFT), which computes the DFT using N log2 N multiplications[7]. Several variations of FFT have
been proposed since then. But radix-2 algorithms are still most popularly used due to its simplicity.
ere are two basic variations of FFT algorithms. One is decimation in time (DIT) FFT and the other
is decimation in frequency (DIF) FFT. e DIF algorithm splits the DFT computation recursively into
two halves of even and odd indexed frequency component. e DIT radix-2 FFT on the other hand
recursively spits the N -point DFT into two N /2-point DFT of even and odd indexed time domain
samples. We have implemented DIT radix-2 due to the simplicity of reordering through bit reversal.
14
4.2 Implementation Details
Figure 4.1: read operations for N = 8 FFT. Each line style indicates work done by a single thread.
e computation of a radix-2 DIT FFT can be considered as a sequence of stages that are performed
sequentially, with each stage computing FFT butterflies with varying indices for source and destination
data. Our implementation of FFT encodes the indices of source and destination data for FFT butterflies as a function of thread ID1 . e function for mapping thread IDs to butterfly data indices is as
follows:
⌋
) (
(⌊
w)
t×2
× w + t mod
i = first index of FFT butterfly
(4.2)
i=
w
2
t = thread ID
w = butterfly width
e indices of the FFT butterfly are given by i and i + w2 . w, the butterfly width, is a function of the
stage number (s) and is given by w = 2s .
Each FFT butterfly computes values at two indices, and the number of FFT butterflies remains the
same in each stage. is allows us to launch N /2 threads for computing an FFT of length N . Under
this scheme, each thread performs calculations for one butterfly operation, in each stage.
1
read IDs are numeric identifiers for threads within a work group and range from 0 to thread_count-1
15
Since the AAC standard requires that each frame is composed from 2048 samples of the source
waveform with 50% overlap, an FFT for N = 2048 was implemented. To have 50% overlap 3072
samples of source audio data are used to form two blocks of length 2048 each. e first block contains
samples from indices 0, 1, . . . , 2047 and the second block contains samples 1024, 1025, . . . , 3071.
Each FFT of N = 2048 is calculated by a compute unit entirely in local memory. Before the
butterfly stages are executed, the samples are sorted according to positions calculated by a bit reversal
algorithm.
Trigonometric calculations for calculating twiddle factors were done using hardware implementations of native_sin and native_cos.
16
4.3 Accuracy
e GPU implementation of FFT was tested for accuracy by comparing the coefficients computed on
GPU with the output of FFTW for the following six bitstreams:
1. 500Hz sine wave sampled at 4000Hz. Amplitude: 2.
2. Zero signal. Amplitude: 0.
3. Random numbers. Amplitude: 2 × 109 .
4. Alternating 1 and 0 bits. Amplitude: 2 × 109 .
5. White noise sampled at 2000Hz. Amplitude: 2.
6. Arithmetic progression of numbers. Amplitude: 2047.
e following table summarizes errors in GPU calculation of FFT compared to the reference implementation:
Signal
Zero
1
A
∑
(xi − yi )
1
A
∑
0.0
abs(xi − yi )
0.0
1
A
∑
(xi − yi )2
0.0
Seq
7.484 × 10−5
4.898 × 10−3
1.474 × 10−4
Random
2.240 × 10−4
2.409 × 10−2
2.810 × 102
Sine
−6.1 × 10−5
1.53 × 10−4
6.3967 × 10−09
White Noise
1.325 × 10−4
1.985 × 10−2
1.6311 × 10−7
0.0
0.0
0.0
Alternating 1/0
Table 4.1: Summary of errors for FFT in various input signals. xi and yi are coefficients calculated by
the GPU implementation and FFTW respectively. A is the average amplitude of the signal.
e ratio of error to the correct value ranges from 10−2 to 10−5 with the exception of random input.
Random input results in high error due to the input not being normalized. Non normalized input can
cause calculations to go out of the representable range of floats which results in errors.
e following plots show the ratio of error to the correct value for different test streams.
17
0.06
real component
0.04
0.02
0.00
−0.02
−0.04
−0.06
0
200
400
600
800 1000 1200 1400 1600 1800 2000 2200 2400
0
200
400
600
800 1000 1200 1400 1600 1800 2000 2200 2400
coefficient index
complex component
0.06
0.04
0.02
0.00
−0.02
−0.04
−0.06
Figure 4.2: Scaled error in FFT computed by GPU for a stream of zeroes showing no errors.
0.06
real component
0.04
0.02
0.00
−0.02
−0.04
complex component
−0.06
0.016
0.014
0.012
0.010
0.008
0.006
0.004
0.002
0.000
−0.002
0
200
400
600
800 1000 1200 1400 1600 1800 2000 2200 2400
0
200
400
600
800 1000 1200 1400 1600 1800 2000 2200 2400
coefficient index
Figure 4.3: Scaled error in FFT computed by GPU for a sequence of numbers 0..2047 showing no
errors in real component and a small error in complex component.
18
real component
0.00002
0.00000
−0.00002
−0.00004
−0.00006
0
200
400
600
800 1000 1200 1400 1600 1800 2000 2200 2400
0
200
400
600
800 1000 1200 1400 1600 1800 2000 2200 2400
coefficient index
complex component
0.00005
0.00000
−0.00005
−0.00010
−0.00015
−0.00020
Figure 4.4: Scaled error in FFT computed by GPU for a set of random numbers shows that the errors
are distributed randomly. e test vector is not-normalized to show differences between the GPU and
CPU on non-normalized input.
0.06
real component
0.04
0.02
0.00
−0.02
−0.04
−0.06
0
200
400
600
800 1000 1200 1400 1600 1800 2000 2200 2400
0
200
400
600
800 1000 1200 1400 1600 1800 2000 2200 2400
coefficient index
complex component
0.06
0.04
0.02
0.00
−0.02
−0.04
−0.06
Figure 4.5: Scaled error in FFT computed by GPU for a stream of alternating 1 and 0 bits showing no
errors.
19
0.06
real component
0.04
0.02
0.00
−0.02
−0.04
−0.06
0
200
400
600
800 1000 1200 1400 1600 1800 2000 2200 2400
0
200
400
600
800 1000 1200 1400 1600 1800 2000 2200 2400
coefficient index
complex component
0.06
0.04
0.02
0.00
−0.02
−0.04
−0.06
Figure 4.6: Scaled error in FFT computed by GPU for a 0.5 kHz sine wave sampled at 4 kHz showing
no errors.
real component
0.00004
0.00002
0.00000
−0.00002
−0.00004
complex component
−0.00006
0
200
400
600
800 1000 1200 1400 1600 1800 2000 2200 2400
0
200
400
600
800 1000 1200 1400 1600 1800 2000 2200 2400
coefficient index
0.00004
0.00003
0.00001
0.00001
0.00000
−0.00001
−0.00001
−0.00003
Figure 4.7: Scaled error in FFT computed by GPU for a white noise signal showing that errors randomly distributed across the spectrum.
20
4.4 Performance
Speed of the OpenCL implementation of FFT was tested on an Nvidia GTX 460 with 1 GB of global
memory and 64 KB of memory shared between local memory and cache. e reference implementation was FFTW running on a 3.4 GHz Intel Core 2 X9650 with 6 GB of RAM.
We discuss the speed perfromance of the two implementations and then look at the cost incurred in
transferring data to and from the GPU.
4.4.1 Execution Time
400
GPU execution time
FFTW execution time
320
300
time (ms)
220
200
120
100
20
0
0
2000
10000
12000
20000
22000
number of blocks
30000
32000
40000
Figure 4.8: Execution time of FFT implementations.
e execution time of our GPU implementation is nearly one-third that of an industry standard
CPU FFT implementation. e FLOPS performance when calculated according to 2 is 110 GFLOPS.
is is significantly better than the performance of FFTW on a 3 GHz Core Duo (14 GFLOPS) or a
Cell processor (22.5 GFLOPS) on an IBM QS20 Cell Blade[10].
2
http://www.w.org/speed/method.html
21
4.4.2 Effect of IO
is section discusses the cost of transferring data between the device global memory and host system
RAM.
700
GPU download time
GPU upload time
GPU execution time
600
time (ms)
200
400
300
200
100
0
0
2000
10000
12000
20000
22000
number of blocks
30000
32000
40000
Figure 4.9: Time spent in uploading data to GPU and downloading result from GPU compared to
the execution time.
e above graph shows that the time spent in executing the FFT kernel is about one-fih of the time
spent in transferring data between the GPU and system RAM. When factoring the IO time into a
comparison with FFTW timing, FFTW's time is half that of the GPU execution + IO time.
22
900
Total GPU time
Total FFTW time
800
700
time (ms)
600
500
400
300
200
100
0
0
5000
10000
15000
20000
25000
number of blocks
30000
35000
40000
Figure 4.10: Total time spent in transferring data between GPU and system RAM compared to FFT
time.
23
Chapter 5
Modified Discrete Cosine Transform
MDCT filter bank is used in AAC due to its energy compaction property similar to DCT and potential
for reducing blocking artifacts due to overlapping of input block [24]. e lapped computation of
MDCT is shown in Fig. 5.1.
Figure 5.1: Samples are processed by the MDCT module in blocks of length 2048 with 50% overlap.
As shown in the figure, for each block of N audio samples an (N /2)-point MDCT is computed
such that adjacent pairs of input blocks overlap each other by N /2 samples. ere are several algorithms in literature for the implementation of MDCT. We chose to implement it through FFT algorithm due to two reasons. Firstly, it allows us to exploit the efficient parallel implementation of FFT in
GPU. Secondly, since FFT is also used in psychoacustic model for the estimation of energy in different
frequencies, the basic FFT program modules could be reused for MDCT computation.
24
5.1 eory
In this section we show that the MDCT can be implemented by FFT with some pre-processing and
post-processing as discussed in [11]. e MDCT Xk of a sequence xn is given by:
Xk =
N
−1
∑
xn cos
n=0
(
)
π
N
(2n + 1 + ) (2k + 1) ,
2N
2
0≤k≤
N
−1
2
(5.1)
MDCT can be implemented using a N /4 length FFT as discussed in [8]. Following is an explanation
of converting MDCT into FFT.
5.1.1 Rotation
If the samples xn are rotated by − N4 places, we can replace n with n −
5
N −1
4
Xk =
∑
xn− N cos
4
n= N
4
5
N −1
4
=
∑
xn− N
4
n= N
4
=
=
=
(
N
.
4
Eq. 5.1 then becomes:
)
π
N
N
(2(n − ) + 1 + ) (2k + 1)
2N
4
2
)
( π
cos
(2n + 1) (2k + 1)
2N
5
xn− N
n= N
4
N −1
) 4∑
)
( π
( π
cos
(2n + 1) (2k + 1) +
(2n + 1) (2k + 1)
xn− N cos
4
2N
2N
n=N
N
−1
∑
xn− N
n= N
4
−1
4
( π
) ∑
( π
)
cos
(2n + 1) (2k + 1) +
(2(n + N ) + 1) (2k + 1)
xn+ 3N cos
4
2N
2N
n=0
N
−1
∑
xn− N
−1
4
( π
) ∑
( π
)
cos
(2n + 1) (2k + 1) −
(2n + 1) (2k + 1)
xn+ 3N cos
4
2N
2N
n=0
N
−1
∑
n= N
4
4
N
4
N
4
(5.2)
Assuming x′n as:
−x 3N
n+ 4
x′n =
x N
n−
4
if 0 ≤ n ≤
N
4
−1
if N4 ≤ n ≤ N − 1
Eq. 5.2 can be rewritten as:
Xk =
N
−1
∑
n=0
x′n cos
)
( π
(2n + 1) (2k + 1)
2N
25
(5.3)
5.1.2 Even-Odd Redundancy
Assume k ′ as:
(5.4)
k′ = N − k − 1
Using Eq. 5.4, Eq. 5.3 can be expressed as:
N
−1
∑
Xk =
n=0
N
−1
∑
=
n=0
N
−1
∑
=
X =−
k′
n=0
N
−1
∑
n=0
x′n cos
)
( π
(2n + 1) (2(N − k ′ − 1) + 1)
2N
x′n cos
( π
)
(2n + 1) (2N − 2k ′ − 1)
2N
x′n cos
)
( π
(2n + 1) (−2k ′ − 1) + π(2n + 1)
2N
x′n
( π
)
′
cos
(2n + 1) (2k + 1)
2N
(5.5)
is shows that Xk is equal to −Xk′ for even k ′ and odd k. us:
X2k+1 = −XN′ −2k−2 ,
0≤k≤
N
−1
2
(5.6)
So, we need to compute only for even indexed coefficients, i.e. X2k for k = 0, 1, . . . N2 − 1.
5.1.3 Reductions using cos redundancy
Since even components can be used to derive odd components, replace Xk with X2k in Eq. 5.3:
X2k =
N
−1
∑
n=0
x′n
)
( π
(2n + 1)(4k + 1)
cos
2N
N
2
−1 {
( π
( π
)
)}
∑
′
′
xn cos
=
(2n + 1)(4k + 1) + xN −n−1 cos
(2N − 2n − 1)(4k + 1)
2N
2N
n=0
N
−1 {
2
)
( π
)}
( π
∑
(2n + 1)(4k + 1) + x′N −n−1 cos
(−2n − 1)(4k + 1) + π(4k + 1)
=
x′n cos
2N
2N
n=0
N
−1 {
2
( π
)
)}
( π
∑
=
(2n + 1)(4k + 1) − x′N −n−1 cos
(2n + 1)(4k + 1)
x′n cos
2N
2N
n=0
(5.7)
N
−1 {
2
( π
)}
∑
=
(x′n − x′N −n−1 ) cos
(2n + 1)(4k + 1)
2N
n=0
26
(5.8)
is reduction is possible due to the redundancy of cos over π in Eq. 5.7. A similar reduction is possible
due to the relation: cos( π2 − θ) = sin(θ), as shown below:
N
X2k
−1 {
2
)}
( π
∑
′
′
(2n + 1)(4k + 1)
=
(xn − xN −n−1 ) cos
2N
n=0
N
−1 {
4
∑
(
( π
)
=
−
cos
(4n + 1)(4k + 1)
2N
n=0
)}
(
)
(
π
N
′
′
(2( − 2n − 1) + 1) (4k + 1)
+ x N −2n−1 − xN −( N −2n−1)−1 cos
2
2
2N
2
x′2n
x′N −2n−1
)
N
−1 {
4
∑
(
( π
)
)
x′2n − x′N −2n−1 cos
(4n + 1)(4k + 1)
2N
n=0
)
( π
)}
(
(N − 4n − 1) (4k + 1)
+ x′N −2n−1 − x′N +2n cos
2
2
2N
N
−1
4
)
{(
( π
∑
)
′
′
(4n + 1)(4k + 1)
=
x2n − xN −2n−1 cos
2N
n=0
)}
(
)
(π
π
′
′
(4k + 1) −
(4n + 1) (4k + 1)
+ x N −2n−1 − x N +2n cos
2
2
2
2N
N
−1 {
4
)}
( π
∑
( ′
)
(4n + 1)(4k + 1)
=
x2n − x′N −2n−1 cos
2N
n=0
=
N
X2k
−1 {(
4
)}
) ( π
∑
−
(4n + 1) (4k + 1)
x′N +2n − x′N −2n−1 sin
2
2
2N
n=0
N
−1 {
4
}
∑
π
= ℜ
(x′2n − x′N −2n−1 ) − i(x′N +2n − x′N −2n−1 ) e−i 2N (4n+1)(4k+1)
n=0
2
2
(5.9)
(5.10)
To prove that Eq. 5.9 is equivalent to Eq. 5.10, consider the following:
ℜ[(x − iy)(cos θ − i sin θ)] = x cos θ − y sin θ
Here, the LHS is of the form of Eq. 5.10 and RHS is of the form of Eq. 5.9.
5.1.4 Redundancy Elemination in Even Frequency Components
From Eq. 5.6, we know that odd frequency components can be calculated from the even ones. We now
establish a relation between the first and second halves of X2k . We observe the following:
27
(
)
π
N
exp i
(4n + 1)(4(k + ) + 1)
2N
4
( π
)
= exp i
(4n + 1)(4k + N + 1)
2N
)
( π
π
(4n + 1)(4k + 1)
= exp i (4n + 1) + i
2
) 2N( π
)
( π
(4n + 1)(4k + 1)
= exp i (4n + 1) · exp i
2)
2N
( π
( π
)
= exp i
· exp i
(4n + 1)(4k + 1)
2N
)
(2π
(4n + 1)(4k + 1)
=i exp i
2N
(5.11)
Using Eq. 5.11 and Eq. 5.10, we can deduce:
X2k+ N
2
N
−1 {
4
}
∑
π
= ℜ
(x′2n − x′N −2n−1 ) − i(x′N +2n − x′N −2n−1 ) ie−i 2N (4n+1)(4k+1)
2
n=0
2
(5.12)
[
]
is is of the form ℜ (a − ib)ie−iθ which can be transformed as follows:
[
]
[
]
ℜ (a − ib)ie−iθ = ℜ (ia + b)e−iθ
= ℜ [(ia + b)(cos θ − i sin θ)]
= b cos θ + a sin θ
(5.13)
[
]
We now consider ℑ (a − ib)e−iθ :
[
]
ℑ (a − ib)e−iθ = ℑ [(a − ib)(cos θ − i sin θ)]
= −a sin θ − b cos θ
[
]
= −ℜ (a − ib)ie−iθ
(5.14)
Using 5.14 and 5.11 we can conclude that:
X2k+ N
2
N
−1 {
4
}
∑
π
= ℜ
(x′2n − x′N −2n−1 ) − i(x′N +2n − x′N −2n−1 ) ie−i 2N (4n+1)(4k+1)
n=0
2
2
N
−1 {
4
}
∑
π
(x′2n − x′N −2n−1 ) − i(x′N +2n − x′N −2n−1 ) e−i 2N (4n+1)(4k+1) (5.15)
= −ℑ
n=0
2
28
2
5.1.5 Representation of Computation using DFT
We can now express the MDCT as an DFT of length N4 . For simplicity of explanation, we rewrite the
complex number in Eq. 5.15 as follows:
N
X̂k =
−1
4
∑
π
(5.16)
x̂n e−i 2N (4n+1)(4k+1)
n=0
where
x̂n = (x′2n − x′N −2n−1 ) − i(x′N +2n − x′N −2n−1 )
2
2
Eq. 5.16 can be simplified and represented by DFT as follows:
N
−1
4
∑
[ −i π (4n+1)(4k+1) ]
X̂k =
x̂n e 2N
n=0
N
4
=
−1 [
∑
1
−i 8π
kn+ 16
+ k4 + n
N (
4)
x̂n e
n=0
]
N
=
−1 [
4
∑
x̂n e−i N (kn+ 32 + 4 + 32 + 4 )
1
8π
k
1
n
n=0
N
4
=
−1 [
∑
x̂n e−i
8πkn
N
]
· e−i N ( 4 + 32 ) · e−i N ( 4 + 32 )
8π
n
1
8π
k
1
n=0
]
N
=e
−i 2π
N
(
k+ 18
−1 [
4
]
∑
2π
1
2π
)
x̂n e−i N (n+ 8 ) · e−i N /4 kn ·
n=0
is is a DFT of x̂n e−i N (n+ 8 ) that is multiplied by e−i N (k+ 8 ) . Using Eq. 5.10 and Eq. 5.15, we can
conclude that:
2π
(
1
2π
1
N
−1
4
N
0≤k≤
−1
4
)
ℜ X̂k = X2k ,
( )
ℑ X̂k = −X2k+ N ,
0≤k≤
2
(5.17)
(5.18)
Using the even-odd redundancy derived in Eq. 5.6 and the relation between two halves of the even
freqency components in Eq. 5.18, we can show that the imaginary components of the DFT are the
same as the odd indexed MDCT components.
X2k+1 = −XN −2k−2
k = 0, 1, 2, ..
29
N
−1
4
Substituting k =
N
4
− k′ − 1
X2( N −k′ −1)+1 = −XN −2( N −k′ −1)−2
4
4
X N −2k′ −2+1 = −XN − N +2k′ +2−2
2
2
X N −2k′ −1 = −X N +2k′
2
2
X N −2k−1 = −X N +2k
2
2
(5.19)
is shows that the odd components of MDCT can be derived from the complex part of FFT and even
components from the real part.
5.2 Implementation Details
MDCT was implemented using FFT as discussed in Section 5.1. Since the AAC standard requires
2048-point MDCT we need to compute a N /4 (i.e. 512) point FFT. Since the calculation of FFT
requires N /2 threads, we use 256 threads for the computation of 2048-point MDCT.
Local memory buffers were reused as much as possible to allow implementation on devices with
less local memory. e GPU implementation uses 12288 bytes of local memory.
30
5.3 Accuracy
Accuracy of the MDCT transform implemented on GPU was tested by running the transform on six
different test streams and comparing the MDCT coefficients computed by GPU with those of a simple
brute force implementation. e following test streams were used:
1. 500Hz sine wave sampled at 4000Hz. Amplitude: 2.
2. Zero signal. Amplitude: 0.
3. Random numbers. Amplitude: 2 × 109 .
4. Alternating 1 and 0 bits. Amplitude: 2 × 109 .
5. White noise sampled at 2000Hz. Amplitude: 2.
6. Arithmetic progression of numbers. Amplitude: 2047.
e following table summarizes errors in GPU calculation of MDCT compared to the reference implementation:
Signal
Zero
1
A
∑
(xi − yi )
1
A
0.0
∑
abs(xi − yi )
0.0
1
A
∑
(xi − yi )2
0.0
Sequence
1.988 × 10−4
4.308 × 10−3
1.556 × 10−4
Random
−4.688 × 10−3
5.133 × 10−2
5.656 × 103
Sine
2.2718 × 10−3
7.7763 × 10−3
1.8326 × 10−06
White Noise
3.0922 × 10−3
4.8524 × 10−2
4.408 × 10−6
Alternating 1/0
−1.2285 × 106
1.9754 × 107
6.7068 × 1012
Table 5.1: Summary of MDCT calculation errors for various signal types. xi and yi are coefficients
calculated by the GPU implementation and brute force implementation respectively. A is the average
amplitude of the signal.
e following plots show the ratio of error to the correct value for different test streams.
31
0.06
0.04
coefficient value
0.02
0.00
−0.02
−0.04
−0.06
0
150
300
750
450
600
coefficient index
900
1050
Figure 5.2: Scaled error in MDCT computed in GPU for a stream of zeroes showing no errors.
0.0005
0.0004
coefficient value
0.0003
0.0002
0.0001
0.0000
−0.0001
0
150
300
750
450
600
coefficient index
900
1050
Figure 5.3: Scaled error in MDCT computed in GPU for a sequence of numbers from 0..2047 indicating that the errors increase as the frequency increases.
32
0.002
0.001
coefficient value
0.000
−0.001
−0.002
−0.003
−0.004
0
150
300
750
450
600
coefficient index
900
1050
Figure 5.4: Scaled error in MDCT computed in GPU for a set of random numbers showing a random
distribution of errors.
0.00030
0.00025
coefficient value
0.00020
0.00015
0.00010
0.00005
0.00000
−0.00005
0
150
300
750
450
600
coefficient index
900
1050
Figure 5.5: Scaled error in MDCT computed in GPU for an input of alternating 1 and 0 bits shows
an error distribution similar to Fig. 5.3 that increases as frequency increases.
33
0.008
0.006
coefficient value
0.004
0.002
0.000
−0.002
−0.004
0
150
300
750
450
600
coefficient index
900
1050
Figure 5.6: Scaled error in MDCT computed in GPU for a 0.5 kHz sine wave sampled at 4kHz showing the error is concentrated around the center frequency.
0.0015
0.0010
coefficient value
0.0005
0.0000
−0.0005
−0.0010
−0.0015
−0.0020
−0.0025
0
150
300
750
450
600
coefficient index
900
1050
Figure 5.7: Scaled error in MDCT computed in GPU for white noise sampled at 2000Hz shows a
random distribution of errors.
34
5.4 Performance
Speed of the OpenCL implementation of MDCT was compared against the MDCT implementaion
in Xiph.org Vorbis codec. e vorbis implementation was used as a reference because it is an industry
standard codec used in several applications. e OpenCL device used was an Nvidia GTX 460 with 1
GB of global memory and 64 KB memory shared between local memory and GPU cache. e Vorbis
implementation was run on 3.4 GHz Intel QX9650 and 6 GB RAM.
2500
GPU execution time
Vorbis execution time
2000
time (ms)
1500
1000
500
0
0
20000
40000
60000
number of blocks
80000
100000
Figure 5.8: Execution time of OpenCL MDCT compared to that of Vorbis' MDCT.
e OpenCL implementation is 32 times faster than the Vorbis implementation.
5.4.1 Effect of IO
If the time spent in transferring the data to and from the GPU is factored in, less than a tenth of the
total time is spent on IO. Despite this, time required by the GPU implementation is one-third of the
time taken by the Vorbis implementation.
35
400
GPU upload time
GPU execution time
350
GPU download time
300
time (ms)
250
200
150
100
50
0
0
40000
60000
number of blocks
80000
100000
Figure 5.9: Breakdown of time spent in performing MDCT on GPU. e time is largely dominated
by transfers over the PCIe bus.
36
2500
Total GPU time
Total Vorbis time
2000
time (ms)
1500
1000
500
0
0
40000
60000
number of blocks
80000
100000
Figure 5.10: Total execution time of OpenCL MDCT compared to that of Vorbis' MDCT when
GPU IO times are factored in.
37
Chapter 6
Conclusion and Future Work
We have implemented single precision floating point version of a 2048 point FFT and MDCT on a
GTX 460 GPU (GF104) with 1 GB of memory using vendor neutral OpenCL API. Parallelization
was maximized by using radix-2 DIT FFT and implementing the MDCT through FFT. We achieved
2 times speedup compared to the current industry standard FFTW for FFT. Speedup of 32 times has
been achieved over the Xiph Vorbis' implementation of MDCT. Since the IO time over PCI Express
incurs a large cost and data transfers between GPU and host system has been minimized for this reason.
e scaled error of our implementation is found to be of the order of 10−2 to 10−5 , which is very
low compared to the reference implementation of FFTW and Xiph Vorbis'. e accuracy can be further improved by using double precision instead of single. Double precision is an optional feature in
OpenCL 1.0. It is available on Nvidia GPUs since GTX 200 and on the high end AMD GPUs since
the 3850. Accuracy can be possibly further improved by avoiding the fixed-point hardware implementations of sine and cosine functions.
Considering the speed advantage and accuracy of GPU implementation, it is advisable to implement AAC on GPU. e high computational throughput per cycle of the GPU implementation could
be utilised for low-power implementation by using a slower clock and lower operating voltage.
Since the transfer between GPU and host involves considerable time, it is desirable to minimize the
number of data transfers between host and GPU. e data transfers could be performed asynchronous
to the computation to minimize the impact of data transfer time on the total computation time.
As a further work of this project, other computation intensive tasks, e. g., psychoacoustic model
and quantization should be implemented in GPU for a complete GPU implementation of perceptual
audio coding.
38
Appendix A
FFT Output Plots
39
0.06
real component
0.04
0.02
0.00
−0.02
−0.04
−0.06
0
200
400
600
800 1000 1200 1400 1600 1800 2000 2200 2400
0
200
400
600
800 1000 1200 1400 1600 1800 2000 2200 2400
coefficient index
complex component
0.06
0.04
0.02
0.00
−0.02
−0.04
−0.06
(a) FFT coefficients calculated on GPU for an input stream of zeroes
0.06
real component
0.04
0.02
0.00
−0.02
−0.04
−0.06
0
200
400
600
800 1000 1200 1400 1600 1800 2000 2200 2400
0
200
400
600
800 1000 1200 1400 1600 1800 2000 2200 2400
coefficient index
complex component
0.06
0.04
0.02
0.00
−0.02
−0.04
−0.06
(b) FFT coefficients calculated by FFTW for an input stream of zeroes
Figure A.1: FFT of a stream of zeroes
40
2500000
real component
2000000
1500000
1000000
500000
0
complex component
−500000
800000
600000
400000
200000
0
−200000
−400000
−600000
−800000
0
200
400
600
800 1000 1200 1400 1600 1800 2000 2200 2400
0
200
400
600
800 1000 1200 1400 1600 1800 2000 2200 2400
coefficient index
(a) FFT coefficients calculated on GPU for the sequence of numbers 0..2047
2500000
real component
2000000
1500000
1000000
500000
0
complex component
−500000
800000
600000
400000
200000
0
−200000
−400000
−600000
−800000
0
200
400
600
800 1000 1200 1400 1600 1800 2000 2200 2400
0
200
400
600
800 1000 1200 1400 1600 1800 2000 2200 2400
coefficient index
(b) FFT coefficients calculated by FFTW for the sequence of numbers 0..2047
Figure A.2: FFT of Sequence of numbers 0 to 2047
41
1.5
1e12
real component
1.0
1.5
1.0
0.5
0.0
complex component
−0.5
8
6
4
2
0
−2
−4
−6
−8
0
100
1e10
400
600
800 1000 1100 1400 1600 1800 1000 1100 1400
0
400
600
800 1000 1200 1400 1600 1800 2000 2200 2400
coefficient index
200
(a) FFT coefficients calculated on GPU for random numbers
1.5
1e12
real component
1.0
1.5
1.0
0.5
0.0
complex component
−0.5
8
6
4
2
0
−2
−4
−6
−8
0
100
1e10
400
600
800 1000 1100 1400 1600 1800 1000 1100 1400
0
400
600
800 1000 1200 1400 1600 1800 2000 2200 2400
coefficient index
200
(b) FFT coefficients calculated by FFTW for random numbers
Figure A.3: FFT of random numbers
42
1e11
6
real component
5
4
3
1
1
0
0
400
600
800 1000 1100 1400 1600 1800 1000 1100 1400
0
100
400
600
800 1000 1100 1400 1600 1800 1000 1100 1400
coefficient index
complex component
0.06
0.04
0.01
0.00
−0.01
−0.04
−0.06
(a) FFT coefficients calculated on GPU for a stream of alternating 1's and 0's
1e11
6
real component
5
4
3
1
1
0
0
100
400
600
800 1000 1100 1400 1600 1800 1000 1100 1400
0
100
400
600
800 1000 1100 1400 1600 1800 1000 1100 1400
coefficient index
complex component
0.06
0.04
0.01
0.00
−0.01
−0.04
−0.06
(b) FFT coefficients calculated with FFTW for a stream of alternating 1's and 0's
Figure A.4: FFT of a stream of alternating 1's and 0's
43
0.00003
0.00001
0.00000
−0.00001
−0.00002
−0.00003
0
100
400
600
800 1000 1100 1400 1600 1800 1000 1100 1400
0
200
400
600
800 1000 1200 1400 1600 1800 2000 2200 2400
coefficient index
complex component
1500
1000
500
0
−500
−1000
−1500
(a) FFT coefficients calculated on GPU for a 0.5 kHz sine wave sampled at 4 kHz
0.06
real component
0.04
0.01
0.00
−0.01
−0.04
−0.06
0
100
400
600
800 1000 1100 1400 1600 1800 1000 1100 1400
0
100
400
600
800 1000 1100 1400 1600 1800 1000 1100 1400
coefficient index
1500
complex component
real component
0.00002
1000
500
0
−500
−1000
−1500
(b) FFT coefficients calculated with FFTW for a 0.5 kHz sine wave sampled at 4 kHz
Figure A.5: FFT of 500Hz sine wave sampled at 4 kHz
44
60
real component
40
20
0
−20
−40
−60
complex component
−80
80
60
40
20
0
−20
−40
−60
−80
0
200
400
600
800 1000 1200 1400 1600 1800 2000 2200 2400
0
200
400
600
800 1000 1200 1400 1600 1800 2000 2200 2400
coefficient index
(a) FFT coefficients calculated on GPU for a white noise signal
60
real component
40
20
0
−20
−40
−60
complex component
−80
80
60
40
20
0
−20
−40
−60
−80
0
200
400
600
800 1000 1200 1400 1600 1800 2000 2200 2400
0
200
400
600
800 1000 1200 1400 1600 1800 2000 2200 2400
coefficient index
(b) FFT coefficeints calculated with FFTW for a white noise signal
Figure A.6: FFT of White noise sampled at 2 kHz
45
Appendix B
MDCT Output Plots
46
0.06
0.04
coefficient value
0.02
0.00
−0.02
−0.04
−0.06
0
300
750
450
600
coefficient index
900
1050
(a) MDCT of zeroes calculated by the GPU implementation
0.06
0.04
coefficient value
0.02
0.00
−0.02
−0.04
−0.06
0
300
750
450
600
coefficient index
900
1050
(b) MDCT of zeroes calculated by the Vorbis implementation on CPU
Figure B.1: MDCT of 2048 zero samples
47
500000
coefficient value
0
−500000
−1000000
−1500000
−2000000
0
300
750
450
600
coefficient index
900
1050
(a) MDCT of an arithmetic progression from 0..2047 calculated by the GPU implementation
500000
coefficient value
0
−500000
−1000000
−1500000
−2000000
0
300
750
450
600
coefficient index
900
1050
(b) MDCT of an arithmetic progression from 0..2047 calculated by the Vorbis implementation on CPU
Figure B.2: MDCT of Sequence of numbers 0 to 2047
48
0.4
1e12
0.2
coefficient value
0.0
−0.2
−0.4
−0.6
−0.8
−1.0
−1.2
0
150
300
750
450
600
coefficient index
900
1050
(a) MDCT of random numbers calculated by the GPU implementation
0.4
1e12
0.2
coefficient value
0.0
−0.2
−0.4
−0.6
−0.8
−1.0
−1.2
0
150
300
750
450
600
coefficient index
900
1050
(b) MDCT of random numbers calculated by the Vorbis implementation on CPU
Figure B.3: MDCT of Random Numbers
49
1.0
1e12
0.5
coefficient value
0.0
−0.5
−1.0
−1.5
−2.0
−2.5
−3.0
0
300
750
450
600
coefficient index
900
1050
(a) MDCT of an alternating stream of 1 and 0 bits calculated by the GPU implementation
1.0
1e12
0.5
coefficient value
0.0
−0.5
−1.0
−1.5
−2.0
−2.5
−3.0
0
300
750
450
600
coefficient index
900
1050
(b) MDCT of an alternating stream of 1 and 0 bits calculated by the CPU implementation in Vorbis
Figure B.4: MDCT of alternating 1 and 0 bits
50
600
coefficient value
400
200
0
−200
0
300
750
450
600
coefficient index
900
1050
(a) MDCT of a 0.5 kHz sine wave sampled at 4 kHz, calculated on GPU
600
coefficient value
400
200
0
−200
0
300
750
450
600
coefficient index
900
1050
(b) MDCT of a 0.5 kHz sine wave sampled at 4 kHz, calculated on CPU using Vorbis' implementation
Figure B.5: MDCT of 0.5 kHz sine wave sampled at 4 kHz
51
80
60
coefficient value
40
20
0
−20
−40
−60
0
300
750
450
600
coefficient index
900
1050
(a) MDCT of white noise sampled at 2 kHz calculated on GPU
80
60
coefficient value
40
20
0
−20
−40
−60
0
!"#
300
750
450
600
coefficient index
900
1050
(b) MDCT of white noise sampled at 2 kHz calculated on CPU using Vorbis' implementation
Figure B.6: MDCT of White noise sampled at 2 kHz
52
Bibliography
[1] Aᴀᴏ, S., Mᴀᴜᴀ ᴀ, T., ᴀᴅ Yᴀ ᴀ ᴜᴄ , Y. Performance comparison of FPGA,
GPU and CPU in image processing. In Field Programmable Logic and Applications,
2009. FPL 2009. International Conference on (2009ᴋ, IEEE, pp. 126--131.
[2] Bᴀ ᴇᴠ ᴄ, A. Parallel variable-length encoding on GPGPUs. In Euro-Par 2009--Parallel
Processing Workshops (2010ᴋ, Springer, pp. 26--35.
[3] Bᴏ , M., Gᴏ ᴅᴇ , R., ᴀᴅ M ᴛᴄ ᴇ , J. Introduction to digital audio coding and
standards. Kluwer Academic Publishers, Dordrecht, The Netherlands, 2003.
[4] Bᴀᴅᴇᴜ , K. MP3 and AAC explained. In AES 17th International Conference
on High-Quality Audio Coding (1999ᴋ, Citeseer.
[5] Bᴜᴜ, C. Fast Fourier Transforms. Connexions, http://cnx.org/content/
col10550/1.21/, Rice University, Houston, Texas, 2010.
[6] Cᴏᴏ ᴇ, J., Lᴇᴡ , P., ᴀᴅ Wᴇ ᴄ , P. The fast Fourier transform and its applications.
Education, IEEE Transactions on 12, 1 (1969ᴋ, 27--34.
[7] Cᴏᴏ ᴇ, J., ᴀᴅ Tᴜ ᴇ, J. An algorithm for the machine calculation of complex Fourier
series. Mathematics of computation 19, 90 (1965ᴋ, 297--301.
[8] Dᴜ ᴀ ᴇ , P., Mᴀ ᴇᴜ, Y., ᴀᴅ Pᴇᴛ ᴛ, J. A fast algorithm for the implementation
of filter banks based ontime domain aliasing cancellation. In Acoustics, Speech, and
Signal Processing, 1991. ICASSP-91., 1991 International Conference on (2002ᴋ, IEEE,
pp. 2209--2212.
[9] Fᴏᴏᴛᴇ, J. Visualizing music and audio using self-similarity. In ACM Multimedia
(1999ᴋ, vol. 1, Citeseer, pp. 77--80.
[10] F ᴏ, M., ᴀᴅ Jᴏ ᴏ, S. G. Benchfft fftw benchmarks.
[11] G ᴜᴛ , R. Regular FFT-related transform kernels for DCT/DST-based polyphase filter
banks. In Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference on (2002ᴋ, IEEE, pp. 2205--2208.
53
[12] Jᴀ̈ᴀ̈ ᴇ ᴀ̈ ᴇ, P., ᴅᴇ Lᴀ Lᴀ ᴀ, C., Hᴜᴇᴛᴀ, P., ᴀᴅ Tᴀ ᴀ ᴀ, J. OpenCL-based design methodology for application-specific processors. In Embedded Computer Systems
(SAMOSᴋ, 2010 International Conference on, IEEE, pp. 223--230.
[13] Lᴏᴢᴀᴏ, O., ᴀᴅ Oᴛᴜ ᴀ, K. Simultaneous and fast 3D tracking of multiple faces in
video by GPU-based stream processing. In Acoustics, Speech and Signal Processing,
2008. ICASSP 2008. IEEE International Conference on (2008ᴋ, IEEE, pp. 713--716.
[14] Mᴜ , A., ᴇᴛ ᴀ . The OpenCL specification version 1.0. Khronos OpenCL Working
Group (2009ᴋ.
[15] NVIDIA. NVIDIAs Next Generation CUDA Compute Architecture: Fermi. NVIDIA
Corporation, 2701 San Tomas Expressway, Santa Clara, CA 95059, 2009.
[16] NVIDIA. OpenCL Best Practices Guide. NVIDIA Corporation, 2701 San Tomas
Expressway, Santa Clara, CA 95059, August 2009.
[17] NVIDIA. Cuda-accelerated applications. http://www.nvidia.com/object/cuda_
app_tesla.html, April 2011.
[18] Pᴀ ᴛᴇ, T., ᴀᴅ Sᴘᴀ ᴀ, A. Perceptual coding of digital audio. Proceedings of the
IEEE 88, 4 (2000ᴋ, 451--515.
[19] Pᴀ, D. Digital audio compression. Digital Technical Journal 5, 2 (1993ᴋ, 28--40.
[20] Pᴀ, D. A tutorial on MPEG/audio compression. Multimedia, IEEE 2, 2 (1995ᴋ, 60-74.
[21] Pᴏᴀ , J., ᴀᴅ Mᴀᴏ ᴀ , D. Introduction to digital signal processing. Prentice Hall
Professional Technical Reference, 1988.
[22] S ᴇ, S. The modulated lapped transform, its time-varying forms, and its applications
to audio coding standards. Speech and Audio Processing, IEEE Transactions on 5, 4
(1997ᴋ, 359--366.
[23] S ᴀ, S., Fᴀ , J., Pᴏ ᴇ ᴇ, M., ᴀᴅ Gᴇᴄ, Y. GPU-based video feature tracking
and matching. In EDGE, Workshop on Edge Computing Using New Commodity
Architectures (2006ᴋ, vol. 278, Citeseer.
[24] Wᴀ , Y., ᴀᴅ V ᴇ ᴏ, M. Modified discrete cosine transform-its implications for
audio coding and error concealment. JOURNAL-AUDIO ENGINEERING SOCIETY 51, 1/2 (2003ᴋ, 52--61.
54
[25] Wᴀ , Y., Yᴀᴏ ᴀᴠ , L., ᴀᴅ V ᴇ ᴏ, M. On the relationship between MDCT,
SDPT and DFT. In Signal Processing Proceedings, 2000. WCCC-ICSP 2000. 5th
International Conference on (2002ᴋ, vol. 1, IEEE, pp. 44--47.
[26] Wᴀ , Y., Yᴀᴏ ᴀᴠ , L., V ᴇ ᴏ, M., ᴀᴅ Vᴇ, M. Some peculiar properties of the MDCT. In Proceedings of the 5th International Conference on Signal
Processing (ICSP 2000ᴋ. IEEE (2000ᴋ.
55