[go: up one dir, main page]

Academia.eduAcademia.edu
N  T  U   Y  P   SCE10-0082 Implementation of Frequency Analysis Modules for Audio Coding on GPU Subrat M Supervisor r. Ian M L Examiner r. ertil S Submitted in Partial ulfillment of the Requirements for the egree of achelor of omputer Science of the Nanyang Technological University S   April 2011 Abstract Advanced Audio Coding (AAC) is the current MPEG standard for lossy compression of digital audio, which provides better sound quality than MP3. It involves several computation-intensive stages, e.g., implementation of filter bank and psycho-acoustic algorithm. e discrete Fourier transform (DFT) is a commonly used tool to perform frequency analysis for psycho-acoustic model while Modified discrete cosine transform (MDCT) is used for realization of filter bank. e implementation of DFT and MDCT being computation intensive, it is important to implement them efficiently to achieve higher speed performance and lower power consumption. e tremendous evolution of GPU has given rise to a new computing paradigm called general purpose GPU computing (GPGPU) or simply GPU computing. GPGPU leverages on the enormous computing resources of GPU for data-parallel scientific and engineering applications. Under the GPU computing model, the sequential component of a given application is executed by the CPU, while the computation-intensive data-parallel component is executed concurrently by the multiple cores in GPU. In order to facilitate the use of GPU as a parallel processor to run high-level programs without using graphics oriented APIs (like GLSL, HLSL and Cg), GPGPU APIs (like CUDA, OpenCL and DirectCompute) were developed. Such APIs have been used to successfully accelerate several scientific computation and engineering applications with high computational requirement. In this report we discuss the viability of using GPUs for implementing DFT and MDCT algorithms to GPU since these two are relatively computation intensive functions in AAC. We have implemented the 2048 point FFT and MDCT on a GTX 460 GPU (GF104) with 1 GB of memory using vendor neutral OpenCL API. Parallelization was maximized by using radix-2 DIT FFT and implementing the MDCT through FFT. We achieved 2 times speedup compared to FFTW for FFT. Besides, we have achieved a speedup of 32 times over the Xiph Vorbis' implementation of MDCT. Since the IO time over PCI Express incurs a large time cost, data transfers between GPU and host system have been minimized. Moreover, the accuracy of implementation is found to be acceptable. e results of implementation of DFT and MDCT indicates that speed performance can be significantly enhanced by implementing these frequency transforms on GPUs. i Acknowledgements I would like to thank my supervisor, Associate Professor Ian McLoughlin, for his valuable guidance and suggestions from the inception to completion of this project. His guidance has been instrumental in setting the direction of the work. I am thankful to him for helping me stay on the right track and for keeping me focused on the project's objective. He has always been very approachable and his technical insights helped me whenever I had any doubts. His support in editing this thesis has significantly improved its quality and understandability. I would also like to thank my father for taking interest in my work throughout the project. I am grateful to him for patiently explaining me some key concepts related to this work. I also wish to thank my brother and mother for their unending moral support and encouragements. ii Contents 1 Introduction 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Overview of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 2 2 Perceptual Audio Compression 2.1 Filter Bank . . . . . . . . 2.2 Psychoacoustic Model . . 2.3 uantization . . . . . . . 2.4 Lossless Coding . . . . . . . . . . 3 4 4 6 6 . . . . . 7 7 8 9 10 12 . . . . 14 14 15 17 21 . . . . 24 25 30 31 35 3 4 5 . . . . . . . . GPU Programming using OpenCL 3.1 GPU Architecture . . . . . . 3.2 GPGPU APIs . . . . . . . . 3.3 OpenCL Architecture . . . . 3.4 OpenCL Execution Model . . 3.5 Performance Maximization . Fast Fourier Transform 4.1 eory . . . . . . . . . 4.2 Implementation Details 4.3 Accuracy . . . . . . . . 4.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Modified Discrete Cosine Transform 5.1 eory . . . . . . . . . . . . . 5.2 Implementation Details . . . . 5.3 Accuracy . . . . . . . . . . . . 5.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Conclusion and Future Work 38 A FFT Output Plots 39 B MDCT Output Plots 46 iv List of Figures 2.1 2.2 2.3 Pipeline of a general perceptual audio encoder . . . . . . . . . . . . . . . . . . . . . Absolute reshold of Hearing (ATH) in quiet of a human ear. . . . . . . . . . . . . Simultaneous Frequency Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 3.2 3.3 3.4 3.5 GPU architecture . . . . . . . . . . . High Level OpenCL Architecture . . . OpenCL memory model . . . . . . . . OpenCL Work Distribution Model . . Bandwidths of OpenCL Memory Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 9 10 11 12 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 read operations for N = 8 FFT . . . . . . . . . . . . . . . . . . . Scaled error in FFT computed by GPU for a stream of zeroes . . . . . Scaled error in FFT computed by GPU for a sequence of numbers . . Scaled error in FFT computed by GPU for a set of random numbers . Scaled error in FFT computed by GPU for a stream of alternating bits Scaled error in FFT computed by GPU for a sine wave . . . . . . . . Scaled error in FFT computed by GPU for a white noise signal . . . . Execution time of FFT implementations. . . . . . . . . . . . . . . . GPU IO times in FFT . . . . . . . . . . . . . . . . . . . . . . . . . Total timing information in FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 18 18 19 19 20 20 21 22 23 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 Overlap in adjacent blocks in MDCT computation . . . . . . . . . . . . Scaled error in MDCT computed in GPU for a stream of zeroes. . . . . . Scaled error in MDCT computed in GPU for a sequence of numbers . . Scaled error in MDCT computed in GPU for a set of random numbers . Scaled error in MDCT computed in GPU for an input of alternating bits Scaled error in MDCT computed in GPU for a sine wave . . . . . . . . Scaled error in MDCT computed in GPU for white noise . . . . . . . . Execution time of OpenCL MDCT compared to that of Vorbis' MDCT. Breakdown of time spent in performing MDCT on GPU . . . . . . . . . Total execution time of OpenCL MDCT compared Vorbis' MDCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 32 32 33 33 34 34 35 36 37 v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 5 6 A.1 A.2 A.3 A.4 A.5 A.6 FFT of a stream of zeroes . . . . . . . . . . FFT of Sequence of numbers 0 to 2047 . . FFT of random numbers . . . . . . . . . . FFT of a stream of alternating 1's and 0's . . FFT of 500Hz sine wave sampled at 4 kHz FFT of White noise sampled at 2 kHz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 41 42 43 44 45 B.1 B.2 B.3 B.4 B.5 B.6 MDCT of 2048 zero samples . . . . . . . . . MDCT of Sequence of numbers 0 to 2047 . . MDCT of Random Numbers . . . . . . . . . MDCT of alternating 1 and 0 bits . . . . . . . MDCT of 0.5 kHz sine wave sampled at 4 kHz MDCT of White noise sampled at 2 kHz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 48 49 50 51 52 vi . . . . . . Chapter 1 Introduction Audio compression refers to the process of representing a digital audio data with fewer bits than its original representation. It can be done in a lossy or lossless manner. Lossless compression allows recovery of the exact original signal from the compressed signal. Lossy compression on the other hand selectively discards some frequency components and scales down other frequency components of audio data without perceivable loss of quality. Lossy compression is widely used for music players with limited storage as it can achieve significantly higher compression ratio than its lossless counterpart. Advanced Audio Coding (AAC) is the current standard for lossy compression of digital audio, which provides better sound quality than MP3[4]. 1.1 Background Lossy audio compression algorithms determine the frequency components of audio data that are masked and could be allocated fewer bits during quantization. Such algorithms are based on psychoacoustic models (PAM) designed to take advantage of typical features and limitations of the human auditory system. PAMs calculate the energy in various frequency bands to determine which of the frequency components are masked and can be quantized using fewer bits[20][19]. erefore, time to frequency domain transformation of data is a vital function in lossy audio compression systems. Frequency analysis is the process of converting time domain data into frequency domain. Apart from being used in audio processing[4] it is used in a diverse range of fields such as image processing, computer vision, seismography[6], audio visualization [9], etc. Discrete Fourier transform (DFT) is commonly used tool for frequency analysis for determination of power spectral density, while discrete cosine transform (DCT) is used for transformation of time domain signal to frequency domain for signal and image compression due to its higher energy compaction property[21]. Modified discrete cosine transform (MDCT) is a lapped form of DCT popularly used for filter bank in audio codecs[22]. Since DFT and MDCT are computation intensive algorithms, they are important targets for speed optimization. Graphics Processing Unit (GPU) consists of a clusters of parallel processors. Today's GPUs con1 tain between 48 to 512 processing cores. To utilize the enormous computing resource of GPU in recent years, it is used for high-performance scientific computing and several engineering applications including image and video processing[23][13], where the sequential component of the algorithm is implemented by the CPU, while the parallel components are processed concurrently by the GPU nodes. It is found that GPU impelementation of such general purpose applications have achieved 25 to 50 times speedup[1]. 1.2 Objectives is final year project is the first part of a bigger project to implement the computation intensive blocks of AAC on GPU. In this report we discuss the viability of using GPUs for implementing frequency tranforms used in AAC. Specifically, we aim at efficient implementation of the DFT and MDCT algorithms since these are relatively computation intensive functions in AAC and are fundamental steps for AAC encoding. To achieve this we have maximized the unfolding of computation and accordingly maximized the number of threads in which the computation can be performed with the resources available in the GPU. We have considered the GPU implementation of the FFT and MDCT algorithms on GPU for a block length of N = 2048 since they are popularly used in AAC. Apart from that, we investigate the speed performance and the effect of single precision floating point arithmetic on the accuracy of the implementation. 1.3 Overview of the Report In the next chapter, we discuss the functions of significant blocks of a general lossy audio encoder. Besides, we emphasize the role of FFT and MDCT in realization of the encoder in this chapter. In Chapter 3, we briefly describe the general organization of GPU. Based on that we discuss a general method for parallel computation on GPU. Additionally we also discuss about some design considerations for maximizing performance of applications on GPUs. Chapter 4 is devoted to GPU implementation of FFT. In this chapter, we compare our implementation of FFT on GPU with the current industry standard FFTW library. Chapter 5 deals with the implementation of MDCT. We show that MDCT can be computed through FFT which enables reuse of the FFT program, and thereby simplifying the MDCT implementation. We provide the necessary mathematical basis for implementing the MDCT through FFT. Moreover, we discuss the accuracy and speed of our GPU implementation of MDCT and compare that against those of Xiph Vorbis' of MDCT. In Chapter 6, we present the conclusions along with the implications of the results obtained from simulations. We also discuss the scope for future work. 2 Chapter 2 Perceptual Audio Compression Perceptual audio compression is a method of compressing audio data which takes advantage of the limitations of human hearing. It is well known that the human auditory system can hear the sounds in the range of 20 Hz to 20 kHz, with the upper limit falling to 16 kHz in most adults. Additionally, loud tones of a certain frequency prevent us from hearing so tones in nearby frequencies [20]. Perceptual coding systems exploit such limitations of human auditory system by scaling down the tones that have less impact on the sound perceived by the human ear, thereby creating quantization noise in frequency bands where the noise is not audible. e block diagram of the major components of a general perceptual encoder is shown in Fig 2.1 [18][3]. Figure 2.1: Pipeline of a general perceptual audio encoder e samples of digital audio data are first decomposed to different frequency bands by a filterbank. e output of fiter bank is then quantized using a strategy that hides the quantization noise in frequency bins that are masked due to nature of the human hearing apparatus. Psychoacoustic analysis is used to determine the level of noise in each frequency bin that can be masked. e quantized coefficients are then losslessly coded using Huffman encoding. In the following we briefly describe the function of each of these operational blocks. 3 2.1 Filter Bank Filter banks are widely used as a digital signal processing tool. e primary function of a filter bank is to decompose the input signal into a set of frequency bins. A filter bank may operate solely in time domain, which splits a time domain signal to several time domain components corresponding to different frequency bands. Such a filter bank could be viewed as a set of band-pass filters with different passbands to separate out the desired set of sub-bands from the input signal. Another kind of filterbank converts a time domain signal to frequency domain thereby decomposing the signal into frequency components. Frequency transforms such as Discrete Fourier Transform (DFT), Discrete Cosine Transform (DCT) and Modified Discrete Cosine Transform (MDCT) are commonly used as filter banks for conversion of time domain signal to frequency domain. MDCT is a filterbank commonly used to implement the frequency domain filter bank. It is a DCT based lapped sinusoidal transform. Do to its lapped property, MDCT is performed on overlapped blocks of input samples, where the first half of each block is overlapped by the last half of the preceding block. Due to the overlapping of consecutive blocks along with the energy compaction properties of DCT, MDCT can help in compression as well as minimizing blocking artifacts. Due to these properties MDCT filter bank is preferred for audio compression. AAC and Vorbis are two audio codecs that use the MDCT as a filterbank. e other advantage of using MDCT as a filterbank is that an N point MDCT can be implemented using a N /4 length DFT together with some pre-processing and post-processing, and many optimized algorithms for low complexity implementation of DFT are available [5] [8]. 2.2 Psychoacoustic Model A psychoacoustic model is used to determine how the quantization noise should be distributed across the frequency spectrum of the audio data. It distributes the quantization noise in such a way that the noise is masked by the signal being encoded [3]. 2.2.1 Absolute reshold of Hearing and Masking Human ears can perceive sound in the frequency range 20 Hz to 20 kHz. Moreover, the energy of the sound should be above a minimum amount in order to be heard. e sound pressure level (SPL) provides a measure of energy of sound. e SPLs for various frequencies for a sound to be audible are given by the Absolute reshold of Hearing (ATH) curve (Fig. 2.2). Sound frequencies that are below the curve are not audible to the human ear. Masking can be of two types, temporal masking and frequency masking. 4 100 90 Sound Pressure Level, SPL (dB) 80 70 60 50 40 30 20 10 0 −10 2 3 10 10 4 10 Frequency (Hz) Figure 2.2: Absolute reshold of Hearing (ATH) in quiet of a human ear [18]. Frequency Masking In frequency masking, a loud sound in a certain frequency band increases the hearing threshold in adjacent frequency bands. us sound frequencies adjacent to the loud sound that are below the new hearing threshold are not audible. Fig. 2.3 demonstrates the effect of frequency masking. Temporal Masking In temporal masking, a so tone immediately following a loud tone in time domain is masked by the loud sound. is occurs because the loud sound increases the threshold of hearing for its constituent frequencies and the increased thresholds subside over time instead of immediately upon the end of the loud sound. ese masking effects are exploited to hide the quantization noise in the masked frequency bands, thereby minimizing the effect of quantization noise on the perceived sound. As can be seen, a filter bank is essential to the operation of the psychoacoustic model. Due to its importance in a lossy encoder and its high computational complexity, the filter bank is a good target for optimization. 5 100 90 Sound Pressure Level, SPL (dB) 80 Masking Tone 70 60 50 Masked Tone New thresholds of hearing 40 30 20 10 0 −10 2 3 10 10 4 10 Frequency (Hz) Figure 2.3: A masking tone alters the ATH curve which masks a tone whose energy is less than the new threshold. 2.3 uantization uantization is the process of scaling down certain frequency components which do not impact the quality of the reproduced sound. During quantization, the outputs of the filterbank are divided by a predetermined integer so as to represent that by a fewer number of bits to satisfy the target bitrate/quality level. Since the number of bits available to encode the signal is finite and oen less than the number of bits used to originally sample the signal, this step causes a loss of information. Consequently, the signal is distorted. Information from the perceptual model is used to determine the frequency bands where the distortion will be masked out or hidden by the signal. is is the only step that causes data loss in AAC. 2.4 Lossless Coding Lossless coding is used to finally encode the quantized coefficients together with side information. It is performed to represent the quantized audio data by fewer number of bits. ere are several methods for lossless coding. AAC uses Huffman coding with codebooks that are defined by the specification of the standard to encode the information. 6 Chapter 3 GPU Programming using OpenCL A GPU typically contains hundreds of processing cores grouped under several clusters together with their local memories and a common global memory [15]. It is a massively parallel, multithreaded and programmable computing platform that is designed and optimized to accomplish fast processing and display of 2-dimensional and 3-dimensional graphics, images and video data. It has evolved from a single processor graphic accelerator to a high-performance computing structure with teraflops of floating point operations in the last two decades, and has been popularly used to achieve real-time display and visual interactions in desktop computers and multimedia/graphic terminals. e tremendous evolution of GPU has given rise to a new computing paradigm called general purpose GPU computing (GPGPU) or simply GPU computing. GPGPU leverages on the enormous computing resources of GPU for data-parallel scientific and engineering applications. Under the GPU computing model, the sequential component of a given application is executed by the CPU (which has good performance on branching code for control-intensive computation), while the computationintensive data-parallel component is executed concurrently by the multiple cores in GPU. In order to facilitate the use of GPU as a parallel processor to run high-level programs without using graphics oriented APIs (like GLSL, HLSL and Cg), GPGPU APIs (like CUDA, OpenCL and DirectCompute) were developed. Such APIs have been used to successfully accelerate several applications with high computation demands as detailed in [17]. 3.1 GPU Architecture A broad view of the first generation Fermi (GF100) architecture is shown in Fig. 3.1. It consists of 512 CUDA cores grouped into 16 streaming multiprocessors (SM). Each CUDA core can execute either a floating point or an integer instruction in one clock period.[15] Each SM can therefore execute a group of 32 threads (called a warp in Nvidia literature) in parallel. Each SM has an instruction cache and a pair of warp schedulers and its own L1 cache, while the L2 cache is shared by all the SMs. Part of the 64 KB L1 cache is also used as the local memory of the SM. e main memory of the device has six 7 Figure 3.1: Architecture of a first generation Fermi GPU [15]. It is designed as a cluster of simple execution units. partitions each supporting upto 1 GB DRAM, allowing the device to support upto 6 GB of DRAM. e GPU is connected with the host through the host interface over PCI-express bus. e thread block scheduler distributes thread blocks to SMs, while warp schedulers within each SM schedule warps for execution. 3.2 GPGPU APIs GPGPU can be accomplished using several APIs including CUDA, OpenCL and DirectCompute. e CUDA API was developed by Nvidia for GPGPU programming solely on Nvidia cards. It was released to the public in February, 2007 and can be prgrammed using several languages including Python, Perl, Fortran, Java, Ruby and Lua. It is cross platform and is available on Linux, MS Windows and Mac OS X. DirectCompute is an API developed by Microso for GPGPU on MS Windows Vista and Windows 7 operating systems. It can be used to program GPUs that support DirectX 10. 8 OpenCL is a vendor neutral API for general purpose computation on the GPU. e specification of the standard is decided by members of the Khronos consortium which consists of members from several companies. Development of OpenCL was initiated by Apple and was completed by collaboration with Nvidia, AMD, Intel and IBM. It is supported by GPUs from AMD, Nvidia and S3, Intel and AMD CPUs and IBM Cell processors. Work is underway to automatically translate OpenCL programs into application specific FPGA designs [?]. It can be programmed using C, Python and Java. A WebCL working group within Khronos was formed in March 2011 to define a JavaScript binding to OpenCL, thus enabling web applications running in a browser to leverage GPU's computational resources. OpenCL was chosen for the GPU implementation of FFT and MDCT for this project because it is not limited to any single GPU vendor and is not bound to any operating system. Operating System API Linux MacOS X Windows GPU CPU/ FPGA Specifying Body CUDA Y Y Y Nvidia N Nvidia OpenCL Y Y Y Any Y Industry consortium DirectCompute N N Y DX10 N Microso Table 3.1: Comparison of various GPGPU APIs 3.3 OpenCL Architecture Figure 3.2: Communication between OpenCL devices and OpenCL hosts. It defines a host system as a device that is connected to a number of OpenCL devices [14] (GPUs in the context of this project), as shown in Fig. 3.2. e host system in this project is a CPU together with system RAM. e host is responsible for acquiring the data to be processed, uploading the data to the device, submitting commands to the device to execute programs and for downloading the results of computation from the device. 9 Figure 3.3: e memory model of OpenCL defines private and local memories for each compute unit and cache backed global and constant memories shared by all compute units. PE stands for processing element. An OpenCL device consists of compute units which are composed certain number of processing elements. Each compute unit has its own local memory accessible only to the processing elements in that compute unit. Additionally, all the processing elements in a device have access to a common global memory with high bandwidth but very high latency. Local memory on the other hand provides high bandwidth and involves low latency. For example, on the first generation Fermi GPUs (GF100), local memory bandwidth of a single compute unit is 73.6 GB/s and the latency is only a few cycles. is totals upto 1.03 TB/s when bandwidths of all the compute units are combined, while the global memory bandwidth is 144 GB/s with a latency of 400-800 cycles. 3.4 OpenCL Execution Model e OpenCL execution model requires that the computation be expressed in terms of threads of execution. Each thread of execution is represented as a work item in OpenCL. Work items are grouped into work groups with each work group containing the same number of work items. Each work group is assigned to a compute unit. Work items in a work group are able to access the global memory and only the local memory of the compute unit assigned to their work group. e set of instructions to be 10 Figure 3.4: e global work space is composed of work groups which are collections of work items. Here the global work size is 15 × 15 in the x and y dimensions. e work group size is 5 × 5. performed by each work item is specified in an OpenCL kernel which is written using a C like language. Performing a computation in OpenCL follows a combination of the following steps: 1. Host specifies the file containing the kernel source code. 2. e OpenCL runtime compiles the kernel source into machine code for the GPU. 3. Host specifies the total number of work items and the number of work items in each work group. 4. Host creates buffers on the GPU global memory for storing the input and output data of the computation. e host also uploads source data (to be used for the computation) onto the GPU buffers from system memory. 5. Host instructs the GPU to begin computation. 6. At this point, threads are launched on the GPU that execute instructions specified in the kernel. 7. Once the computation is complete, the host downloads the computated results from the GPU into system memory. e GPU can perform the computation asynchronously, allowing the CPU to work on other tasks while the GPU is executing the work items. 11 Figure 3.5: Bandwidths of various memory types involved in an OpenCL execution run. e link between GPU and host is the slowest, while the fastest link is between the compute unit and local memory. 3.5 Performance Maximization Memory Bandwidths: As discussed in Section 3.4, computation on the GPU requires data transfers on different kinds of channels. ese include the PCI Express (PCIe) bus for transfers between device and host, transfers between global memory (global memory channel) and compute units and transfers between local memory and processing elements (local memory channel). ere is a large difference between the bandwidths of these memory channels. As shown in Fig. 3.5, the PCIe x16 connection between the GPU and the host is the slowest at 4 GB/s. Cumulative local memory bandwidth of all compute units is the highest at 1.03 TB/s with the global memory bandwidth (144 GB/s) about onetenth of local memory bandwith. erefore, whenever possible, data transfers between GPU and host should be performed asynchronously so that the GPU can perform computations while the data transfers are ongoing. Local memory should be used to cache a working copy of data due to the high speed and limited size. Bandwidth of host RAM is higher than that of PCIe x16 and therefore execution of the computation on the CPU should be considered if the time spent in IO is significant compared to time spent in processing data. 12 Global Memory Access Latency: While global memory has much higher bandwidth compared to PCIe bus, each access to global memory involves a high latency (400-800 cycles on GF100). In order to minimize the penalty incurred due to this latency, it is advisable to create a large number of work items so that while a group of threads are waiting for data to arrive, other work items' instructions can be processed. Global Memory Coalescing: Coalescing is one of the best ways to minimize the latency incurred in global memory transactions. It refers to the process of combining global memory access transactions issued by threads of a half-warp1 into one transaction or two transactions in case of 128-bit words. is allows avoiding latency from multiple global memory transactions. It can be achieved by accessing memory in regular or sequential patterns. A detail discussion on this is available in Nvidia's whitepapers [16] [15]. Local Memory Bank Conflicts: To allow parallel access by multiple threads, local memory is divided into several memory banks that can be accessed in parallel. is way, the memory bandwidth is effectively multiplied by the number of banks available, as long as there are enough threads to access the local memory through all the banks. Bank conflicts occur when more than one thread access memory locations that map to the same bank. In such a situation, the memory accesses are serialized, negating the benefit of having multiple banks. As an exception, when all threads in a half-warp access the same local memory location, the bank operates in a broadcast configuration which requires no serialization of requests. Math Precision Tradeoffs: When applications do not require highly accurate estimation of mathematical functions, faster evaluation of such functions in dedicated hardware can be used to speedup the calculation. Examples of such functions are native_sin and native_cos. Additionally, OpenCL also specifies low precision implementations of mathematical functions when precision can be traded for speed. 1 warp is Nvidia terminology for a group of threads that are scheduled together by the compute unit for execution on the compute unit. It is called "wavefront" by AMD. A warp consists of 32 threads in Fermi GPUs and a half-warp contains 16 threads. 13 Chapter 4 Fast Fourier Transform 4.1 eory e discrete Fourier transform (DFT) of an N -point sequence {x0 , x1 , . . . , xN −1 } is given by Xk = N −1 ∑ 2π xn e−i N kn k = 0, 1, . . . , N − 1 (4.1) n=0 e direct computation of an N -point DFT requires N 2 multiplications. To reduce the computational complexity of computation of DFT, in 1965 Cooley and Tukey suggested the Fast Fourier transform (FFT), which computes the DFT using N log2 N multiplications[7]. Several variations of FFT have been proposed since then. But radix-2 algorithms are still most popularly used due to its simplicity. ere are two basic variations of FFT algorithms. One is decimation in time (DIT) FFT and the other is decimation in frequency (DIF) FFT. e DIF algorithm splits the DFT computation recursively into two halves of even and odd indexed frequency component. e DIT radix-2 FFT on the other hand recursively spits the N -point DFT into two N /2-point DFT of even and odd indexed time domain samples. We have implemented DIT radix-2 due to the simplicity of reordering through bit reversal. 14 4.2 Implementation Details Figure 4.1: read operations for N = 8 FFT. Each line style indicates work done by a single thread. e computation of a radix-2 DIT FFT can be considered as a sequence of stages that are performed sequentially, with each stage computing FFT butterflies with varying indices for source and destination data. Our implementation of FFT encodes the indices of source and destination data for FFT butterflies as a function of thread ID1 . e function for mapping thread IDs to butterfly data indices is as follows: ⌋ ) ( (⌊ w) t×2 × w + t mod i = first index of FFT butterfly (4.2) i= w 2 t = thread ID w = butterfly width e indices of the FFT butterfly are given by i and i + w2 . w, the butterfly width, is a function of the stage number (s) and is given by w = 2s . Each FFT butterfly computes values at two indices, and the number of FFT butterflies remains the same in each stage. is allows us to launch N /2 threads for computing an FFT of length N . Under this scheme, each thread performs calculations for one butterfly operation, in each stage. 1 read IDs are numeric identifiers for threads within a work group and range from 0 to thread_count-1 15 Since the AAC standard requires that each frame is composed from 2048 samples of the source waveform with 50% overlap, an FFT for N = 2048 was implemented. To have 50% overlap 3072 samples of source audio data are used to form two blocks of length 2048 each. e first block contains samples from indices 0, 1, . . . , 2047 and the second block contains samples 1024, 1025, . . . , 3071. Each FFT of N = 2048 is calculated by a compute unit entirely in local memory. Before the butterfly stages are executed, the samples are sorted according to positions calculated by a bit reversal algorithm. Trigonometric calculations for calculating twiddle factors were done using hardware implementations of native_sin and native_cos. 16 4.3 Accuracy e GPU implementation of FFT was tested for accuracy by comparing the coefficients computed on GPU with the output of FFTW for the following six bitstreams: 1. 500Hz sine wave sampled at 4000Hz. Amplitude: 2. 2. Zero signal. Amplitude: 0. 3. Random numbers. Amplitude: 2 × 109 . 4. Alternating 1 and 0 bits. Amplitude: 2 × 109 . 5. White noise sampled at 2000Hz. Amplitude: 2. 6. Arithmetic progression of numbers. Amplitude: 2047. e following table summarizes errors in GPU calculation of FFT compared to the reference implementation: Signal Zero 1 A ∑ (xi − yi ) 1 A ∑ 0.0 abs(xi − yi ) 0.0 1 A ∑ (xi − yi )2 0.0 Seq 7.484 × 10−5 4.898 × 10−3 1.474 × 10−4 Random 2.240 × 10−4 2.409 × 10−2 2.810 × 102 Sine −6.1 × 10−5 1.53 × 10−4 6.3967 × 10−09 White Noise 1.325 × 10−4 1.985 × 10−2 1.6311 × 10−7 0.0 0.0 0.0 Alternating 1/0 Table 4.1: Summary of errors for FFT in various input signals. xi and yi are coefficients calculated by the GPU implementation and FFTW respectively. A is the average amplitude of the signal. e ratio of error to the correct value ranges from 10−2 to 10−5 with the exception of random input. Random input results in high error due to the input not being normalized. Non normalized input can cause calculations to go out of the representable range of floats which results in errors. e following plots show the ratio of error to the correct value for different test streams. 17 0.06 real component 0.04 0.02 0.00 −0.02 −0.04 −0.06 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 coefficient index complex component 0.06 0.04 0.02 0.00 −0.02 −0.04 −0.06 Figure 4.2: Scaled error in FFT computed by GPU for a stream of zeroes showing no errors. 0.06 real component 0.04 0.02 0.00 −0.02 −0.04 complex component −0.06 0.016 0.014 0.012 0.010 0.008 0.006 0.004 0.002 0.000 −0.002 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 coefficient index Figure 4.3: Scaled error in FFT computed by GPU for a sequence of numbers 0..2047 showing no errors in real component and a small error in complex component. 18 real component 0.00002 0.00000 −0.00002 −0.00004 −0.00006 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 coefficient index complex component 0.00005 0.00000 −0.00005 −0.00010 −0.00015 −0.00020 Figure 4.4: Scaled error in FFT computed by GPU for a set of random numbers shows that the errors are distributed randomly. e test vector is not-normalized to show differences between the GPU and CPU on non-normalized input. 0.06 real component 0.04 0.02 0.00 −0.02 −0.04 −0.06 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 coefficient index complex component 0.06 0.04 0.02 0.00 −0.02 −0.04 −0.06 Figure 4.5: Scaled error in FFT computed by GPU for a stream of alternating 1 and 0 bits showing no errors. 19 0.06 real component 0.04 0.02 0.00 −0.02 −0.04 −0.06 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 coefficient index complex component 0.06 0.04 0.02 0.00 −0.02 −0.04 −0.06 Figure 4.6: Scaled error in FFT computed by GPU for a 0.5 kHz sine wave sampled at 4 kHz showing no errors. real component 0.00004 0.00002 0.00000 −0.00002 −0.00004 complex component −0.00006 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 coefficient index 0.00004 0.00003 0.00001 0.00001 0.00000 −0.00001 −0.00001 −0.00003 Figure 4.7: Scaled error in FFT computed by GPU for a white noise signal showing that errors randomly distributed across the spectrum. 20 4.4 Performance Speed of the OpenCL implementation of FFT was tested on an Nvidia GTX 460 with 1 GB of global memory and 64 KB of memory shared between local memory and cache. e reference implementation was FFTW running on a 3.4 GHz Intel Core 2 X9650 with 6 GB of RAM. We discuss the speed perfromance of the two implementations and then look at the cost incurred in transferring data to and from the GPU. 4.4.1 Execution Time 400 GPU execution time FFTW execution time 320 300 time (ms) 220 200 120 100 20 0 0 2000 10000 12000 20000 22000 number of blocks 30000 32000 40000 Figure 4.8: Execution time of FFT implementations. e execution time of our GPU implementation is nearly one-third that of an industry standard CPU FFT implementation. e FLOPS performance when calculated according to 2 is 110 GFLOPS. is is significantly better than the performance of FFTW on a 3 GHz Core Duo (14 GFLOPS) or a Cell processor (22.5 GFLOPS) on an IBM QS20 Cell Blade[10]. 2 http://www.w.org/speed/method.html 21 4.4.2 Effect of IO is section discusses the cost of transferring data between the device global memory and host system RAM. 700 GPU download time GPU upload time GPU execution time 600 time (ms) 200 400 300 200 100 0 0 2000 10000 12000 20000 22000 number of blocks 30000 32000 40000 Figure 4.9: Time spent in uploading data to GPU and downloading result from GPU compared to the execution time. e above graph shows that the time spent in executing the FFT kernel is about one-fih of the time spent in transferring data between the GPU and system RAM. When factoring the IO time into a comparison with FFTW timing, FFTW's time is half that of the GPU execution + IO time. 22 900 Total GPU time Total FFTW time 800 700 time (ms) 600 500 400 300 200 100 0 0 5000 10000 15000 20000 25000 number of blocks 30000 35000 40000 Figure 4.10: Total time spent in transferring data between GPU and system RAM compared to FFT time. 23 Chapter 5 Modified Discrete Cosine Transform MDCT filter bank is used in AAC due to its energy compaction property similar to DCT and potential for reducing blocking artifacts due to overlapping of input block [24]. e lapped computation of MDCT is shown in Fig. 5.1. Figure 5.1: Samples are processed by the MDCT module in blocks of length 2048 with 50% overlap. As shown in the figure, for each block of N audio samples an (N /2)-point MDCT is computed such that adjacent pairs of input blocks overlap each other by N /2 samples. ere are several algorithms in literature for the implementation of MDCT. We chose to implement it through FFT algorithm due to two reasons. Firstly, it allows us to exploit the efficient parallel implementation of FFT in GPU. Secondly, since FFT is also used in psychoacustic model for the estimation of energy in different frequencies, the basic FFT program modules could be reused for MDCT computation. 24 5.1 eory In this section we show that the MDCT can be implemented by FFT with some pre-processing and post-processing as discussed in [11]. e MDCT Xk of a sequence xn is given by: Xk = N −1 ∑ xn cos n=0 ( ) π N (2n + 1 + ) (2k + 1) , 2N 2 0≤k≤ N −1 2 (5.1) MDCT can be implemented using a N /4 length FFT as discussed in [8]. Following is an explanation of converting MDCT into FFT. 5.1.1 Rotation If the samples xn are rotated by − N4 places, we can replace n with n − 5 N −1 4 Xk = ∑ xn− N cos 4 n= N 4 5 N −1 4 = ∑ xn− N 4 n= N 4 = = = ( N . 4 Eq. 5.1 then becomes: ) π N N (2(n − ) + 1 + ) (2k + 1) 2N 4 2 ) ( π cos (2n + 1) (2k + 1) 2N 5 xn− N n= N 4 N −1 ) 4∑ ) ( π ( π cos (2n + 1) (2k + 1) + (2n + 1) (2k + 1) xn− N cos 4 2N 2N n=N N −1 ∑ xn− N n= N 4 −1 4 ( π ) ∑ ( π ) cos (2n + 1) (2k + 1) + (2(n + N ) + 1) (2k + 1) xn+ 3N cos 4 2N 2N n=0 N −1 ∑ xn− N −1 4 ( π ) ∑ ( π ) cos (2n + 1) (2k + 1) − (2n + 1) (2k + 1) xn+ 3N cos 4 2N 2N n=0 N −1 ∑ n= N 4 4 N 4 N 4 (5.2) Assuming x′n as:  −x 3N n+ 4 x′n = x N n− 4 if 0 ≤ n ≤ N 4 −1 if N4 ≤ n ≤ N − 1 Eq. 5.2 can be rewritten as: Xk = N −1 ∑ n=0 x′n cos ) ( π (2n + 1) (2k + 1) 2N 25 (5.3) 5.1.2 Even-Odd Redundancy Assume k ′ as: (5.4) k′ = N − k − 1 Using Eq. 5.4, Eq. 5.3 can be expressed as: N −1 ∑ Xk = n=0 N −1 ∑ = n=0 N −1 ∑ = X =− k′ n=0 N −1 ∑ n=0 x′n cos ) ( π (2n + 1) (2(N − k ′ − 1) + 1) 2N x′n cos ( π ) (2n + 1) (2N − 2k ′ − 1) 2N x′n cos ) ( π (2n + 1) (−2k ′ − 1) + π(2n + 1) 2N x′n ( π ) ′ cos (2n + 1) (2k + 1) 2N (5.5) is shows that Xk is equal to −Xk′ for even k ′ and odd k. us: X2k+1 = −XN′ −2k−2 , 0≤k≤ N −1 2 (5.6) So, we need to compute only for even indexed coefficients, i.e. X2k for k = 0, 1, . . . N2 − 1. 5.1.3 Reductions using cos redundancy Since even components can be used to derive odd components, replace Xk with X2k in Eq. 5.3: X2k = N −1 ∑ n=0 x′n ) ( π (2n + 1)(4k + 1) cos 2N N 2 −1 { ( π ( π ) )} ∑ ′ ′ xn cos = (2n + 1)(4k + 1) + xN −n−1 cos (2N − 2n − 1)(4k + 1) 2N 2N n=0 N −1 { 2 ) ( π )} ( π ∑ (2n + 1)(4k + 1) + x′N −n−1 cos (−2n − 1)(4k + 1) + π(4k + 1) = x′n cos 2N 2N n=0 N −1 { 2 ( π ) )} ( π ∑ = (2n + 1)(4k + 1) − x′N −n−1 cos (2n + 1)(4k + 1) x′n cos 2N 2N n=0 (5.7) N −1 { 2 ( π )} ∑ = (x′n − x′N −n−1 ) cos (2n + 1)(4k + 1) 2N n=0 26 (5.8) is reduction is possible due to the redundancy of cos over π in Eq. 5.7. A similar reduction is possible due to the relation: cos( π2 − θ) = sin(θ), as shown below: N X2k −1 { 2 )} ( π ∑ ′ ′ (2n + 1)(4k + 1) = (xn − xN −n−1 ) cos 2N n=0 N −1 { 4 ∑ ( ( π ) = − cos (4n + 1)(4k + 1) 2N n=0 )} ( ) ( π N ′ ′ (2( − 2n − 1) + 1) (4k + 1) + x N −2n−1 − xN −( N −2n−1)−1 cos 2 2 2N 2 x′2n x′N −2n−1 ) N −1 { 4 ∑ ( ( π ) ) x′2n − x′N −2n−1 cos (4n + 1)(4k + 1) 2N n=0 ) ( π )} ( (N − 4n − 1) (4k + 1) + x′N −2n−1 − x′N +2n cos 2 2 2N N −1 4 ) {( ( π ∑ ) ′ ′ (4n + 1)(4k + 1) = x2n − xN −2n−1 cos 2N n=0 )} ( ) (π π ′ ′ (4k + 1) − (4n + 1) (4k + 1) + x N −2n−1 − x N +2n cos 2 2 2 2N N −1 { 4 )} ( π ∑ ( ′ ) (4n + 1)(4k + 1) = x2n − x′N −2n−1 cos 2N n=0 = N X2k −1 {( 4 )} ) ( π ∑ − (4n + 1) (4k + 1) x′N +2n − x′N −2n−1 sin 2 2 2N n=0 N  −1 { 4 } ∑ π = ℜ (x′2n − x′N −2n−1 ) − i(x′N +2n − x′N −2n−1 ) e−i 2N (4n+1)(4k+1)  n=0 2 2 (5.9) (5.10) To prove that Eq. 5.9 is equivalent to Eq. 5.10, consider the following: ℜ[(x − iy)(cos θ − i sin θ)] = x cos θ − y sin θ Here, the LHS is of the form of Eq. 5.10 and RHS is of the form of Eq. 5.9. 5.1.4 Redundancy Elemination in Even Frequency Components From Eq. 5.6, we know that odd frequency components can be calculated from the even ones. We now establish a relation between the first and second halves of X2k . We observe the following: 27 ( ) π N exp i (4n + 1)(4(k + ) + 1) 2N 4 ( π ) = exp i (4n + 1)(4k + N + 1) 2N ) ( π π (4n + 1)(4k + 1) = exp i (4n + 1) + i 2 ) 2N( π ) ( π (4n + 1)(4k + 1) = exp i (4n + 1) · exp i 2) 2N ( π ( π ) = exp i · exp i (4n + 1)(4k + 1) 2N ) (2π (4n + 1)(4k + 1) =i exp i 2N (5.11) Using Eq. 5.11 and Eq. 5.10, we can deduce: X2k+ N 2 N  −1 { 4 } ∑ π = ℜ (x′2n − x′N −2n−1 ) − i(x′N +2n − x′N −2n−1 ) ie−i 2N (4n+1)(4k+1)  2 n=0 2 (5.12) [ ] is is of the form ℜ (a − ib)ie−iθ which can be transformed as follows: [ ] [ ] ℜ (a − ib)ie−iθ = ℜ (ia + b)e−iθ = ℜ [(ia + b)(cos θ − i sin θ)] = b cos θ + a sin θ (5.13) [ ] We now consider ℑ (a − ib)e−iθ : [ ] ℑ (a − ib)e−iθ = ℑ [(a − ib)(cos θ − i sin θ)] = −a sin θ − b cos θ [ ] = −ℜ (a − ib)ie−iθ (5.14) Using 5.14 and 5.11 we can conclude that: X2k+ N 2 N  −1 { 4 } ∑ π = ℜ (x′2n − x′N −2n−1 ) − i(x′N +2n − x′N −2n−1 ) ie−i 2N (4n+1)(4k+1)  n=0 2 2  N −1 { 4 } ∑ π (x′2n − x′N −2n−1 ) − i(x′N +2n − x′N −2n−1 ) e−i 2N (4n+1)(4k+1)  (5.15) = −ℑ  n=0 2 28 2 5.1.5 Representation of Computation using DFT We can now express the MDCT as an DFT of length N4 . For simplicity of explanation, we rewrite the complex number in Eq. 5.15 as follows: N X̂k = −1 4 ∑ π (5.16) x̂n e−i 2N (4n+1)(4k+1) n=0 where x̂n = (x′2n − x′N −2n−1 ) − i(x′N +2n − x′N −2n−1 ) 2 2 Eq. 5.16 can be simplified and represented by DFT as follows: N −1 4 ∑ [ −i π (4n+1)(4k+1) ] X̂k = x̂n e 2N n=0 N 4 = −1 [ ∑ 1 −i 8π kn+ 16 + k4 + n N ( 4) x̂n e n=0 ] N = −1 [ 4 ∑ x̂n e−i N (kn+ 32 + 4 + 32 + 4 ) 1 8π k 1 n n=0 N 4 = −1 [ ∑ x̂n e−i 8πkn N ] · e−i N ( 4 + 32 ) · e−i N ( 4 + 32 ) 8π n 1 8π k 1 n=0 ] N =e −i 2π N ( k+ 18 −1 [ 4 ] ∑ 2π 1 2π ) x̂n e−i N (n+ 8 ) · e−i N /4 kn · n=0 is is a DFT of x̂n e−i N (n+ 8 ) that is multiplied by e−i N (k+ 8 ) . Using Eq. 5.10 and Eq. 5.15, we can conclude that: 2π ( 1 2π 1 N −1 4 N 0≤k≤ −1 4 ) ℜ X̂k = X2k , ( ) ℑ X̂k = −X2k+ N , 0≤k≤ 2 (5.17) (5.18) Using the even-odd redundancy derived in Eq. 5.6 and the relation between two halves of the even freqency components in Eq. 5.18, we can show that the imaginary components of the DFT are the same as the odd indexed MDCT components. X2k+1 = −XN −2k−2 k = 0, 1, 2, .. 29 N −1 4 Substituting k = N 4 − k′ − 1 X2( N −k′ −1)+1 = −XN −2( N −k′ −1)−2 4 4 X N −2k′ −2+1 = −XN − N +2k′ +2−2 2 2 X N −2k′ −1 = −X N +2k′ 2 2 X N −2k−1 = −X N +2k 2 2 (5.19) is shows that the odd components of MDCT can be derived from the complex part of FFT and even components from the real part. 5.2 Implementation Details MDCT was implemented using FFT as discussed in Section 5.1. Since the AAC standard requires 2048-point MDCT we need to compute a N /4 (i.e. 512) point FFT. Since the calculation of FFT requires N /2 threads, we use 256 threads for the computation of 2048-point MDCT. Local memory buffers were reused as much as possible to allow implementation on devices with less local memory. e GPU implementation uses 12288 bytes of local memory. 30 5.3 Accuracy Accuracy of the MDCT transform implemented on GPU was tested by running the transform on six different test streams and comparing the MDCT coefficients computed by GPU with those of a simple brute force implementation. e following test streams were used: 1. 500Hz sine wave sampled at 4000Hz. Amplitude: 2. 2. Zero signal. Amplitude: 0. 3. Random numbers. Amplitude: 2 × 109 . 4. Alternating 1 and 0 bits. Amplitude: 2 × 109 . 5. White noise sampled at 2000Hz. Amplitude: 2. 6. Arithmetic progression of numbers. Amplitude: 2047. e following table summarizes errors in GPU calculation of MDCT compared to the reference implementation: Signal Zero 1 A ∑ (xi − yi ) 1 A 0.0 ∑ abs(xi − yi ) 0.0 1 A ∑ (xi − yi )2 0.0 Sequence 1.988 × 10−4 4.308 × 10−3 1.556 × 10−4 Random −4.688 × 10−3 5.133 × 10−2 5.656 × 103 Sine 2.2718 × 10−3 7.7763 × 10−3 1.8326 × 10−06 White Noise 3.0922 × 10−3 4.8524 × 10−2 4.408 × 10−6 Alternating 1/0 −1.2285 × 106 1.9754 × 107 6.7068 × 1012 Table 5.1: Summary of MDCT calculation errors for various signal types. xi and yi are coefficients calculated by the GPU implementation and brute force implementation respectively. A is the average amplitude of the signal. e following plots show the ratio of error to the correct value for different test streams. 31 0.06 0.04 coefficient value 0.02 0.00 −0.02 −0.04 −0.06 0 150 300 750 450 600 coefficient index 900 1050 Figure 5.2: Scaled error in MDCT computed in GPU for a stream of zeroes showing no errors. 0.0005 0.0004 coefficient value 0.0003 0.0002 0.0001 0.0000 −0.0001 0 150 300 750 450 600 coefficient index 900 1050 Figure 5.3: Scaled error in MDCT computed in GPU for a sequence of numbers from 0..2047 indicating that the errors increase as the frequency increases. 32 0.002 0.001 coefficient value 0.000 −0.001 −0.002 −0.003 −0.004 0 150 300 750 450 600 coefficient index 900 1050 Figure 5.4: Scaled error in MDCT computed in GPU for a set of random numbers showing a random distribution of errors. 0.00030 0.00025 coefficient value 0.00020 0.00015 0.00010 0.00005 0.00000 −0.00005 0 150 300 750 450 600 coefficient index 900 1050 Figure 5.5: Scaled error in MDCT computed in GPU for an input of alternating 1 and 0 bits shows an error distribution similar to Fig. 5.3 that increases as frequency increases. 33 0.008 0.006 coefficient value 0.004 0.002 0.000 −0.002 −0.004 0 150 300 750 450 600 coefficient index 900 1050 Figure 5.6: Scaled error in MDCT computed in GPU for a 0.5 kHz sine wave sampled at 4kHz showing the error is concentrated around the center frequency. 0.0015 0.0010 coefficient value 0.0005 0.0000 −0.0005 −0.0010 −0.0015 −0.0020 −0.0025 0 150 300 750 450 600 coefficient index 900 1050 Figure 5.7: Scaled error in MDCT computed in GPU for white noise sampled at 2000Hz shows a random distribution of errors. 34 5.4 Performance Speed of the OpenCL implementation of MDCT was compared against the MDCT implementaion in Xiph.org Vorbis codec. e vorbis implementation was used as a reference because it is an industry standard codec used in several applications. e OpenCL device used was an Nvidia GTX 460 with 1 GB of global memory and 64 KB memory shared between local memory and GPU cache. e Vorbis implementation was run on 3.4 GHz Intel QX9650 and 6 GB RAM. 2500 GPU execution time Vorbis execution time 2000 time (ms) 1500 1000 500 0 0 20000 40000 60000 number of blocks 80000 100000 Figure 5.8: Execution time of OpenCL MDCT compared to that of Vorbis' MDCT. e OpenCL implementation is 32 times faster than the Vorbis implementation. 5.4.1 Effect of IO If the time spent in transferring the data to and from the GPU is factored in, less than a tenth of the total time is spent on IO. Despite this, time required by the GPU implementation is one-third of the time taken by the Vorbis implementation. 35 400 GPU upload time GPU execution time 350 GPU download time 300 time (ms) 250 200 150 100 50 0 0  40000 60000 number of blocks 80000 100000 Figure 5.9: Breakdown of time spent in performing MDCT on GPU. e time is largely dominated by transfers over the PCIe bus. 36 2500 Total GPU time Total Vorbis time 2000 time (ms) 1500 1000 500 0 0  40000 60000 number of blocks 80000 100000 Figure 5.10: Total execution time of OpenCL MDCT compared to that of Vorbis' MDCT when GPU IO times are factored in. 37 Chapter 6 Conclusion and Future Work We have implemented single precision floating point version of a 2048 point FFT and MDCT on a GTX 460 GPU (GF104) with 1 GB of memory using vendor neutral OpenCL API. Parallelization was maximized by using radix-2 DIT FFT and implementing the MDCT through FFT. We achieved 2 times speedup compared to the current industry standard FFTW for FFT. Speedup of 32 times has been achieved over the Xiph Vorbis' implementation of MDCT. Since the IO time over PCI Express incurs a large cost and data transfers between GPU and host system has been minimized for this reason. e scaled error of our implementation is found to be of the order of 10−2 to 10−5 , which is very low compared to the reference implementation of FFTW and Xiph Vorbis'. e accuracy can be further improved by using double precision instead of single. Double precision is an optional feature in OpenCL 1.0. It is available on Nvidia GPUs since GTX 200 and on the high end AMD GPUs since the 3850. Accuracy can be possibly further improved by avoiding the fixed-point hardware implementations of sine and cosine functions. Considering the speed advantage and accuracy of GPU implementation, it is advisable to implement AAC on GPU. e high computational throughput per cycle of the GPU implementation could be utilised for low-power implementation by using a slower clock and lower operating voltage. Since the transfer between GPU and host involves considerable time, it is desirable to minimize the number of data transfers between host and GPU. e data transfers could be performed asynchronous to the computation to minimize the impact of data transfer time on the total computation time. As a further work of this project, other computation intensive tasks, e. g., psychoacoustic model and quantization should be implemented in GPU for a complete GPU implementation of perceptual audio coding. 38 Appendix A FFT Output Plots 39 0.06 real component 0.04 0.02 0.00 −0.02 −0.04 −0.06 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 coefficient index complex component 0.06 0.04 0.02 0.00 −0.02 −0.04 −0.06 (a) FFT coefficients calculated on GPU for an input stream of zeroes 0.06 real component 0.04 0.02 0.00 −0.02 −0.04 −0.06 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 coefficient index complex component 0.06 0.04 0.02 0.00 −0.02 −0.04 −0.06 (b) FFT coefficients calculated by FFTW for an input stream of zeroes Figure A.1: FFT of a stream of zeroes 40 2500000 real component 2000000 1500000 1000000 500000 0 complex component −500000 800000 600000 400000 200000 0 −200000 −400000 −600000 −800000 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 coefficient index (a) FFT coefficients calculated on GPU for the sequence of numbers 0..2047 2500000 real component 2000000 1500000 1000000 500000 0 complex component −500000 800000 600000 400000 200000 0 −200000 −400000 −600000 −800000 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 coefficient index (b) FFT coefficients calculated by FFTW for the sequence of numbers 0..2047 Figure A.2: FFT of Sequence of numbers 0 to 2047 41 1.5 1e12 real component 1.0 1.5 1.0 0.5 0.0 complex component −0.5 8 6 4 2 0 −2 −4 −6 −8 0 100 1e10 400 600 800 1000 1100 1400 1600 1800 1000 1100 1400 0 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 coefficient index 200 (a) FFT coefficients calculated on GPU for random numbers 1.5 1e12 real component 1.0 1.5 1.0 0.5 0.0 complex component −0.5 8 6 4 2 0 −2 −4 −6 −8 0 100 1e10 400 600 800 1000 1100 1400 1600 1800 1000 1100 1400 0 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 coefficient index 200 (b) FFT coefficients calculated by FFTW for random numbers Figure A.3: FFT of random numbers 42 1e11 6 real component 5 4 3 1 1 0 0  400 600 800 1000 1100 1400 1600 1800 1000 1100 1400 0 100 400 600 800 1000 1100 1400 1600 1800 1000 1100 1400 coefficient index complex component 0.06 0.04 0.01 0.00 −0.01 −0.04 −0.06 (a) FFT coefficients calculated on GPU for a stream of alternating 1's and 0's 1e11 6 real component 5 4 3 1 1 0 0 100 400 600 800 1000 1100 1400 1600 1800 1000 1100 1400 0 100 400 600 800 1000 1100 1400 1600 1800 1000 1100 1400 coefficient index complex component 0.06 0.04 0.01 0.00 −0.01 −0.04 −0.06 (b) FFT coefficients calculated with FFTW for a stream of alternating 1's and 0's Figure A.4: FFT of a stream of alternating 1's and 0's 43 0.00003 0.00001 0.00000 −0.00001 −0.00002 −0.00003 0 100 400 600 800 1000 1100 1400 1600 1800 1000 1100 1400 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 coefficient index complex component 1500 1000 500 0 −500 −1000 −1500 (a) FFT coefficients calculated on GPU for a 0.5 kHz sine wave sampled at 4 kHz 0.06 real component 0.04 0.01 0.00 −0.01 −0.04 −0.06 0 100 400 600 800 1000 1100 1400 1600 1800 1000 1100 1400 0 100 400 600 800 1000 1100 1400 1600 1800 1000 1100 1400 coefficient index 1500 complex component real component 0.00002 1000 500 0 −500 −1000 −1500 (b) FFT coefficients calculated with FFTW for a 0.5 kHz sine wave sampled at 4 kHz Figure A.5: FFT of 500Hz sine wave sampled at 4 kHz 44 60 real component 40 20 0 −20 −40 −60 complex component −80 80 60 40 20 0 −20 −40 −60 −80 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 coefficient index (a) FFT coefficients calculated on GPU for a white noise signal 60 real component 40 20 0 −20 −40 −60 complex component −80 80 60 40 20 0 −20 −40 −60 −80 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 coefficient index (b) FFT coefficeints calculated with FFTW for a white noise signal Figure A.6: FFT of White noise sampled at 2 kHz 45 Appendix B MDCT Output Plots 46 0.06 0.04 coefficient value 0.02 0.00 −0.02 −0.04 −0.06 0  300 750 450 600 coefficient index 900 1050 (a) MDCT of zeroes calculated by the GPU implementation 0.06 0.04 coefficient value 0.02 0.00 −0.02 −0.04 −0.06 0 300 750 450 600 coefficient index 900 1050 (b) MDCT of zeroes calculated by the Vorbis implementation on CPU Figure B.1: MDCT of 2048 zero samples 47 500000 coefficient value 0 −500000 −1000000 −1500000 −2000000 0  300 750 450 600 coefficient index 900 1050 (a) MDCT of an arithmetic progression from 0..2047 calculated by the GPU implementation 500000 coefficient value 0 −500000 −1000000 −1500000 −2000000 0  300 750 450 600 coefficient index 900 1050 (b) MDCT of an arithmetic progression from 0..2047 calculated by the Vorbis implementation on CPU Figure B.2: MDCT of Sequence of numbers 0 to 2047 48 0.4 1e12 0.2 coefficient value 0.0 −0.2 −0.4 −0.6 −0.8 −1.0 −1.2 0 150 300 750 450 600 coefficient index 900 1050 (a) MDCT of random numbers calculated by the GPU implementation 0.4 1e12 0.2 coefficient value 0.0 −0.2 −0.4 −0.6 −0.8 −1.0 −1.2 0 150 300 750 450 600 coefficient index 900 1050 (b) MDCT of random numbers calculated by the Vorbis implementation on CPU Figure B.3: MDCT of Random Numbers 49 1.0 1e12 0.5 coefficient value 0.0 −0.5 −1.0 −1.5 −2.0 −2.5 −3.0 0  300 750 450 600 coefficient index 900 1050 (a) MDCT of an alternating stream of 1 and 0 bits calculated by the GPU implementation 1.0 1e12 0.5 coefficient value 0.0 −0.5 −1.0 −1.5 −2.0 −2.5 −3.0 0  300 750 450 600 coefficient index 900 1050 (b) MDCT of an alternating stream of 1 and 0 bits calculated by the CPU implementation in Vorbis Figure B.4: MDCT of alternating 1 and 0 bits 50 600 coefficient value 400 200 0 −200 0  300 750 450 600 coefficient index 900 1050 (a) MDCT of a 0.5 kHz sine wave sampled at 4 kHz, calculated on GPU 600 coefficient value 400 200 0 −200 0  300 750 450 600 coefficient index 900 1050 (b) MDCT of a 0.5 kHz sine wave sampled at 4 kHz, calculated on CPU using Vorbis' implementation Figure B.5: MDCT of 0.5 kHz sine wave sampled at 4 kHz 51 80 60 coefficient value 40 20 0 −20 −40 −60 0  300 750 450 600 coefficient index 900 1050 (a) MDCT of white noise sampled at 2 kHz calculated on GPU 80 60 coefficient value 40 20 0 −20 −40 −60 0 !"# 300 750 450 600 coefficient index 900 1050 (b) MDCT of white noise sampled at 2 kHz calculated on CPU using Vorbis' implementation Figure B.6: MDCT of White noise sampled at 2 kHz 52 Bibliography [1] Aᴀᴏ, S., Mᴀᴜᴀ ᴀ, T., ᴀᴅ Yᴀ ᴀ ᴜᴄ , Y. Performance comparison of FPGA, GPU and CPU in image processing. In Field Programmable Logic and Applications, 2009. FPL 2009. International Conference on (2009ᴋ, IEEE, pp. 126--131. [2] Bᴀ ᴇᴠ ᴄ, A. Parallel variable-length encoding on GPGPUs. In Euro-Par 2009--Parallel Processing Workshops (2010ᴋ, Springer, pp. 26--35. [3] Bᴏ , M., Gᴏ ᴅᴇ , R., ᴀᴅ M ᴛᴄ ᴇ , J. Introduction to digital audio coding and standards. Kluwer Academic Publishers, Dordrecht, The Netherlands, 2003. [4] Bᴀᴅᴇᴜ , K. MP3 and AAC explained. In AES 17th International Conference on High-Quality Audio Coding (1999ᴋ, Citeseer. [5] Bᴜᴜ, C. Fast Fourier Transforms. Connexions, http://cnx.org/content/ col10550/1.21/, Rice University, Houston, Texas, 2010. [6] Cᴏᴏ ᴇ, J., Lᴇᴡ , P., ᴀᴅ Wᴇ ᴄ , P. The fast Fourier transform and its applications. Education, IEEE Transactions on 12, 1 (1969ᴋ, 27--34. [7] Cᴏᴏ ᴇ, J., ᴀᴅ Tᴜ ᴇ, J. An algorithm for the machine calculation of complex Fourier series. Mathematics of computation 19, 90 (1965ᴋ, 297--301. [8] Dᴜ ᴀ ᴇ , P., Mᴀ ᴇᴜ, Y., ᴀᴅ Pᴇᴛ ᴛ, J. A fast algorithm for the implementation of filter banks based ontime domain aliasing cancellation. In Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference on (2002ᴋ, IEEE, pp. 2209--2212. [9] Fᴏᴏᴛᴇ, J. Visualizing music and audio using self-similarity. In ACM Multimedia (1999ᴋ, vol. 1, Citeseer, pp. 77--80. [10] F ᴏ, M., ᴀᴅ Jᴏ ᴏ, S. G. Benchfft fftw benchmarks. [11] G ᴜᴛ , R. Regular FFT-related transform kernels for DCT/DST-based polyphase filter banks. In Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference on (2002ᴋ, IEEE, pp. 2205--2208. 53 [12] Jᴀ̈ᴀ̈ ᴇ ᴀ̈ ᴇ, P., ᴅᴇ Lᴀ Lᴀ ᴀ, C., Hᴜᴇᴛᴀ, P., ᴀᴅ Tᴀ ᴀ ᴀ, J. OpenCL-based design methodology for application-specific processors. In Embedded Computer Systems (SAMOSᴋ, 2010 International Conference on, IEEE, pp. 223--230. [13] Lᴏᴢᴀᴏ, O., ᴀᴅ Oᴛᴜ ᴀ, K. Simultaneous and fast 3D tracking of multiple faces in video by GPU-based stream processing. In Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on (2008ᴋ, IEEE, pp. 713--716. [14] Mᴜ , A., ᴇᴛ ᴀ . The OpenCL specification version 1.0. Khronos OpenCL Working Group (2009ᴋ. [15] NVIDIA. NVIDIAs Next Generation CUDA Compute Architecture: Fermi. NVIDIA Corporation, 2701 San Tomas Expressway, Santa Clara, CA 95059, 2009. [16] NVIDIA. OpenCL Best Practices Guide. NVIDIA Corporation, 2701 San Tomas Expressway, Santa Clara, CA 95059, August 2009. [17] NVIDIA. Cuda-accelerated applications. http://www.nvidia.com/object/cuda_ app_tesla.html, April 2011. [18] Pᴀ ᴛᴇ, T., ᴀᴅ Sᴘᴀ ᴀ, A. Perceptual coding of digital audio. Proceedings of the IEEE 88, 4 (2000ᴋ, 451--515. [19] Pᴀ, D. Digital audio compression. Digital Technical Journal 5, 2 (1993ᴋ, 28--40. [20] Pᴀ, D. A tutorial on MPEG/audio compression. Multimedia, IEEE 2, 2 (1995ᴋ, 60-74. [21] Pᴏᴀ , J., ᴀᴅ Mᴀᴏ ᴀ , D. Introduction to digital signal processing. Prentice Hall Professional Technical Reference, 1988. [22] S ᴇ, S. The modulated lapped transform, its time-varying forms, and its applications to audio coding standards. Speech and Audio Processing, IEEE Transactions on 5, 4 (1997ᴋ, 359--366. [23] S  ᴀ, S., Fᴀ , J., Pᴏ ᴇ ᴇ, M., ᴀᴅ Gᴇᴄ, Y. GPU-based video feature tracking and matching. In EDGE, Workshop on Edge Computing Using New Commodity Architectures (2006ᴋ, vol. 278, Citeseer. [24] Wᴀ , Y., ᴀᴅ V ᴇ ᴏ, M. Modified discrete cosine transform-its implications for audio coding and error concealment. JOURNAL-AUDIO ENGINEERING SOCIETY 51, 1/2 (2003ᴋ, 52--61. 54 [25] Wᴀ , Y., Yᴀᴏ ᴀᴠ , L., ᴀᴅ V ᴇ ᴏ, M. On the relationship between MDCT, SDPT and DFT. In Signal Processing Proceedings, 2000. WCCC-ICSP 2000. 5th International Conference on (2002ᴋ, vol. 1, IEEE, pp. 44--47. [26] Wᴀ , Y., Yᴀᴏ ᴀᴠ , L., V ᴇ ᴏ, M., ᴀᴅ Vᴇ, M. Some peculiar properties of the MDCT. In Proceedings of the 5th International Conference on Signal Processing (ICSP 2000ᴋ. IEEE (2000ᴋ. 55