CN104125466B

CN104125466B - A kind of HEVC parallel decoding methods based on GPU

Info

Publication number: CN104125466B
Application number: CN201410328646.2A
Authority: CN
Inventors: 梁凡; 罗林
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2014-07-10
Filing date: 2014-07-10
Publication date: 2017-10-10
Anticipated expiration: 2034-07-10
Also published as: CN104125466A

Abstract

The invention discloses a HEVC parallel decoding method based on GPU, comprising: GPU performs entropy decoding, reordering and inverse quantization on read code stream files, thereby obtaining transformation coefficient matrix, and at the same time, GPU parses the acquired code stream files , so as to obtain the motion vector and the reference frame; the GPU uses the HEVC inverse transform parallel algorithm to process the transformation coefficient matrix to obtain the residual data of the image, and at the same time, the GPU uses the HEVC motion compensation parallel algorithm to obtain the The predicted pixel value of the image; the GPU will sequentially sum the residual data of the image and the predicted pixel value of the image, perform deblocking filtering and sample adaptive compensation processing, so as to obtain the reconstructed image, and copy the pixel value of the reconstructed image to the CPU in memory. The invention effectively improves the decoding speed and decoding efficiency, and can be widely used in the field of video encoding and decoding.

Description

A parallel HEVC decoding method based on GPU

技术领域technical field

本发明涉及视频编解码领域，尤其是一种基于GPU的HEVC并行解码方法。The invention relates to the field of video encoding and decoding, in particular to a GPU-based HEVC parallel decoding method.

背景技术Background technique

随着互联网和移动通信技术的快速发展，数字视频正朝着高清晰度、高帧率、高压缩率的方向迈进，视频的格式从720P发展到1080P，在某些场合甚至出现了4Kx2K、8Kx4K的超高清晰数字视频。在视频应用中，传输带宽和存储空间无疑是最核心的资源，如何在有限的空间实现高清晰视频的存储，在带宽有瓶颈的网络环境中实现良好传输，是一个大的难题。高清晰度的视频可以给带来更高的生活质量，但这样一来必然会有庞大的数据量。举个例子，1080P高清晰视频，像素为1920X1080，4:2:0的格式，其一帧图像的数据量为24.88Mbit。这样高清晰度的视频产生了一个难题，那就是视频码率大幅度的升高。视频编码就是要用尽量少的比特数来表征视频信息，而当今广泛应用的H.264编码标准的压缩效率仍然无法充分满足超高清视频的应用需求。With the rapid development of Internet and mobile communication technology, digital video is moving towards the direction of high definition, high frame rate, and high compression rate. The format of video has developed from 720P to 1080P. In some occasions, 4Kx2K and 8Kx4K have even appeared. ultra-high-definition digital video. In video applications, transmission bandwidth and storage space are undoubtedly the core resources. How to realize high-definition video storage in a limited space and achieve good transmission in a network environment with bandwidth bottlenecks is a big problem. High-definition video can bring people a higher quality of life, but it will inevitably have a huge amount of data. For example, for 1080P high-definition video, the pixels are 1920X1080, the format is 4:2:0, and the data volume of one frame of image is 24.88Mbit. Such a high-definition video has a problem, that is, the video bit rate is greatly increased. Video coding is to use as few bits as possible to represent video information, but the compression efficiency of the widely used H.264 coding standard still cannot fully meet the application requirements of ultra-high-definition video.

HEVC(High Efficiency Video Coding)高效率视频编码是由ISO的MPEG和ITU-T的VCEG共同制定的下一代新的视频压缩编码方案。HEVC标准是继承了了现有的视频编码方案H.264的编码理念，沿用了其中的一些编码技术，并且改进了相关技术，编码单元尺寸更大、基于块的帧间/帧内预测选择方式更加多样化、更加复杂的内插滤波器等。HEVC对比之前的视频编码技术，有压缩效率更高、视频质量更好、健壮性更好、错误恢复能力更强、更适合在IP网络中传输等优点。HEVC对比H.264/AVC编码标准，在对高清晰度以及高保真的视频图像进行编码时，压缩效率提高了一倍，这样在解码后得到重建的图像质量相同的情况下，视频流的码率减少50％。HEVC (High Efficiency Video Coding) high-efficiency video coding is a next-generation new video compression coding scheme jointly formulated by ISO's MPEG and ITU-T's VCEG. The HEVC standard inherits the coding concept of the existing video coding scheme H.264, continues to use some of the coding technologies, and improves related technologies, with larger coding unit sizes and block-based inter/intra prediction selection methods More diverse, more complex interpolation filters, etc. Compared with the previous video coding technology, HEVC has the advantages of higher compression efficiency, better video quality, better robustness, stronger error recovery capability, and is more suitable for transmission in IP networks. Compared with the H.264/AVC encoding standard, HEVC doubles the compression efficiency when encoding high-definition and high-fidelity video images. In this way, when the quality of the reconstructed image after decoding is the same, the code of the video stream rate reduced by 50%.

但是，码率的下降是以编解码软件的复杂度的增加作为前提的，在采用了更复杂、更灵活的编码技术后，HEVC编解码软件的复杂度也大大增加，使得对高清晰视频进行压缩和解压缩所花费的时间也随之增加，无法满足视频会议和可视电话等应用领域的高实时解码播放要求。However, the reduction of the bit rate is based on the increase in the complexity of the codec software. After adopting more complex and flexible coding techniques, the complexity of the HEVC codec software has also increased greatly, making the high-definition video The time spent on compression and decompression also increases, which cannot meet the high real-time decoding and playback requirements of application fields such as video conferencing and video telephony.

在高清晰视频成为主流的情况下，单纯地依靠CPU显然不能很好地实现高清晰视频的实时解码。GPU具有优异的浮点计算能力和强大的并行计算能力，如果将解码算法中运算量巨大，复杂度较高的模块转移到GPU上实现，则能够有效解决实时解码这个难题。然而，目前业内还未有基于GPU的HEVC视频编解码方案出现。When high-definition video becomes the mainstream, relying solely on CPU obviously cannot realize real-time decoding of high-definition video well. GPU has excellent floating-point computing capability and powerful parallel computing capability. If the decoding algorithm has a huge amount of computation and a relatively high complexity module is transferred to the GPU for implementation, it can effectively solve the problem of real-time decoding. However, there is currently no GPU-based HEVC video codec solution in the industry.

发明内容Contents of the invention

为了解决上述技术问题，本发明的目的是：提供一种解码速度快和效率高的，基于GPU的HEVC并行解码方法。In order to solve the above technical problems, the object of the present invention is to provide a GPU-based HEVC parallel decoding method with high decoding speed and high efficiency.

本发明解决其技术问题所采用的技术方案是：一种基于GPU的HEVC并行解码方法，包括：The technical solution adopted by the present invention to solve the technical problems is: a GPU-based HEVC parallel decoding method, comprising:

A、GPU对读取的码流文件进行熵解码、重排序和反量化，从而得到变换系数矩阵，同时GPU对获取的码流文件进行解析，从而得到运动矢量以及参考帧；A. The GPU performs entropy decoding, reordering, and dequantization on the read code stream file to obtain the transformation coefficient matrix. At the same time, the GPU parses the obtained code stream file to obtain the motion vector and reference frame;

B、GPU采用HEVC反变换并行算法对变换系数矩阵进行处理，从而得到图像的残差数据，同时GPU采用HEVC运动补偿并行算法，根据运动矢量指向的参考帧位置求取图像的预测像素值；B. The GPU uses the HEVC inverse transform parallel algorithm to process the transformation coefficient matrix to obtain the residual data of the image. At the same time, the GPU uses the HEVC motion compensation parallel algorithm to obtain the predicted pixel value of the image according to the reference frame position pointed by the motion vector;

C、GPU将图像的残差数据与图像的预测像素值依次进行求和、去方块滤波和样本自适应补偿处理，从而得到重建图像，并将重建图像的像素值拷贝到CPU的内存中。C. The GPU sequentially performs summation, deblocking filtering, and sample adaptive compensation processing on the residual data of the image and the predicted pixel value of the image to obtain a reconstructed image, and copies the pixel value of the reconstructed image to the memory of the CPU.

进一步，所述步骤B中GPU采用HEVC反变换并行算法对变换系数矩阵进行处理这一步骤，其包括：Further, in the step B, the GPU uses the HEVC inverse transform parallel algorithm to process the transformation coefficient matrix, which includes:

B11、初始化GPU，在GPU上申请用于存储变换系数矩阵和残差数据的设备端全局内存；B11, initialize the GPU, and apply for device-side global memory for storing the transformation coefficient matrix and residual data on the GPU;

B12、对线程的网格大小和线程块的大小进行设定，并根据变换单元的大小为每个变换单元分配相应数量的线程及相应的线程ID号；B12, the grid size of the thread and the size of the thread block are set, and assign a corresponding number of threads and a corresponding thread ID number to each transformation unit according to the size of the transformation unit;

B13、读取设备端全局内存上各个变换单元所对应的变换系数矩阵，然后根据线程ID号对每个变换系数矩阵依次进行列并行一维IDCT反变换和行并行一维IDCT反变换，从而得到整个图像块的残差数据；B13. Read the transformation coefficient matrix corresponding to each transformation unit on the global memory of the device, and then perform column-parallel one-dimensional IDCT inverse transformation and row-parallel one-dimensional IDCT inverse transformation on each transformation coefficient matrix according to the thread ID number, thereby obtaining Residual data of the entire image block;

B14、将计算的各个图像块的残差数据拷贝回CPU内存，得到整个图像的残差数据，然后释放设备端全局内存空间。B14. Copy the calculated residual data of each image block back to the CPU memory to obtain the residual data of the entire image, and then release the device-side global memory space.

进一步，所述步骤B13，其包括：Further, the step B13 includes:

B131、读取设备端全局内存上各个变换单元所对应的变换系数矩阵；B131. Read the transformation coefficient matrix corresponding to each transformation unit on the device-side global memory;

B132、根据线程ID号对每个变换系数矩阵的各列同时进行一维IDCT反变换，得到变换后的系数矩阵并将变换的结果暂存在线程块的共享内存中；B132. Perform one-dimensional IDCT inverse transformation on each column of each transformation coefficient matrix according to the thread ID number, obtain the transformed coefficient matrix and temporarily store the transformed result in the shared memory of the thread block;

B133、根据线程ID号对共享内存中变换后的系数矩阵的每一行同时进行一维IDCT反变换，得到残差数据矩阵，并根据残差数据矩阵计算整个图像块的残差数据。B133. Perform one-dimensional IDCT inverse transformation on each row of the transformed coefficient matrix in the shared memory at the same time according to the thread ID number to obtain a residual data matrix, and calculate the residual data of the entire image block according to the residual data matrix.

进一步，所述步骤B中GPU采用HEVC运动补偿并行算法，根据运动矢量指向的参考帧位置求取图像的预测像素值这一步骤，其包括：Further, in the step B, the GPU adopts the HEVC motion compensation parallel algorithm to calculate the predicted pixel value of the image according to the reference frame position pointed by the motion vector, which includes:

S1、初始化GPU，在GPU中申请用来存储帧间预测模式每个像素点对应的运动矢量、参考帧和预测像素值的存储空间；S1, initialize the GPU, and apply for storage space in the GPU for storing the motion vector corresponding to each pixel in the inter-frame prediction mode, the reference frame and the predicted pixel value;

S2、将运动矢量和相应的参考帧图像拷贝到设备端，同时将参考帧绑定到纹理存储器上；S2. Copy the motion vector and the corresponding reference frame image to the device, and bind the reference frame to the texture memory at the same time;

S3、进行线程配置，为每个预测像素值的处理分配一个线程ID号，在设备端开辟用来存储预测像素值的全局内存空间；S3. Perform thread configuration, assign a thread ID number for the processing of each predicted pixel value, and open up a global memory space for storing predicted pixel values at the device side;

S4、各个线程根据自身的线程ID号和运动矢量指向参考帧的位置同时进行直接纹理读取或插值滤波处理，从而得到各个线程的像素预测值；S4. Each thread performs direct texture reading or interpolation filtering processing at the same time according to its own thread ID number and motion vector pointing to the position of the reference frame, thereby obtaining the pixel prediction value of each thread;

S5、将各个线程的像素预测值拷贝回CPU内存，然后释放设备端的全局内存空间。S5. Copy the pixel prediction value of each thread back to the CPU memory, and then release the global memory space of the device.

进一步，所述步骤S4，其具体为：Further, the step S4 is specifically:

每个线程根据自身的线程ID号和运动矢量指向参考帧的位置同时进行直接读取或插值滤波处理：若该线程的运动矢量指向的是整像素值位置，则直接读取纹理存储器中该运动矢量所指向的参考帧位置上的像素值，并以读取的像素值作为该线程的像素预测值；若该线程的运动矢量指向的是分像素位置，则根据分像素点的位置选择相应的亮度或色度分像素插值滤波公式进行计算，从而得到该线程的像素预测值。Each thread performs direct reading or interpolation filtering processing at the same time according to its own thread ID number and the position of the motion vector pointing to the reference frame: if the motion vector of the thread points to an integer pixel value position, the motion in the texture memory is directly read The pixel value at the position of the reference frame pointed to by the vector, and the read pixel value is used as the pixel prediction value of the thread; if the motion vector of the thread points to a sub-pixel position, select the corresponding pixel according to the position Luminance or chroma is calculated by pixel interpolation filter formula, so as to obtain the pixel prediction value of this thread.

进一步，所述亮度分像素插值滤波公式为8点插值滤波公式，所述度分像素插值滤波公式为4点插值滤波公式。Further, the brightness-by-pixel interpolation filter formula is an 8-point interpolation filter formula, and the degree-by-pixel interpolation filter formula is a 4-point interpolation filter formula.

本发明的有益效果是：构建了由CPU和GPU组成的解码架构，将解码复杂度较高的反变换处理和运动补偿处理转移到GPU上实现，并设计了基于GPU的HEVC反变换并行算法和HEVC运动补偿并行算法，有效提高了解码速度和解码效率。The beneficial effects of the present invention are: a decoding framework composed of CPU and GPU is constructed, the inverse transformation processing and motion compensation processing with relatively high decoding complexity are transferred to the GPU for implementation, and a GPU-based HEVC inverse transformation parallel algorithm and HEVC motion compensation parallel algorithm effectively improves decoding speed and decoding efficiency.

附图说明Description of drawings

下面结合附图和实施例对本发明作进一步说明。The present invention will be further described below in conjunction with drawings and embodiments.

图1为本发明一种基于GPU的HEVC并行解码方法的步骤流程图；Fig. 1 is a flow chart of the steps of a GPU-based HEVC parallel decoding method of the present invention;

图2为本发明步骤B中GPU采用HEVC反变换并行算法对变换系数矩阵进行处理的流程图；Fig. 2 is the flow chart that GPU adopts HEVC inverse transform parallel algorithm to process transform coefficient matrix in step B of the present invention;

图3为本发明步骤B13的流程图；Fig. 3 is the flowchart of step B13 of the present invention;

图4为本发明步骤B中GPU采用HEVC运动补偿并行算法，根据运动矢量指向的参考帧位置求取图像的预测像素值的流程图；Fig. 4 is the flow chart of calculating the predicted pixel value of the image according to the reference frame position pointed to by the motion vector using the HEVC motion compensation parallel algorithm by the GPU in step B of the present invention;

图5为本发明实施例一的HEVC解码框架图；FIG. 5 is a frame diagram of HEVC decoding according to Embodiment 1 of the present invention;

图6为本发明亮度的分像素内插示意图。FIG. 6 is a schematic diagram of pixel-by-pixel interpolation of luminance in the present invention.

具体实施方式detailed description

参照图1，一种基于GPU的HEVC并行解码方法，包括：Referring to Figure 1, a GPU-based HEVC parallel decoding method includes:

参照图2，进一步作为优选的实施方式，所述步骤B中GPU采用HEVC反变换并行算法对变换系数矩阵进行处理这一步骤，其包括：Referring to Fig. 2, further as a preferred embodiment, in the step B, the GPU uses the HEVC inverse transform parallel algorithm to process the transformation coefficient matrix, which includes:

其中，线程的网格大小设定为Grid(4,4,1)，线程块的大小设定为Block(16,16,1)，即一个Grid分配16个Block,每个Block分配256个线程。而线程的数量则根据变换单元的大小相应地进行分配。图像块由至少一个变换单元组成，而图像则由至少一个图像块组成。Among them, the grid size of the thread is set to Grid(4,4,1), and the size of the thread block is set to Block(16,16,1), that is, a Grid allocates 16 Blocks, and each Block allocates 256 threads . The number of threads is allocated accordingly according to the size of the transform unit. An image block consists of at least one transform unit, and an image consists of at least one image block.

参照图3，进一步作为优选的实施方式，所述步骤B13，其包括：With reference to Fig. 3, further as a preferred embodiment, described step B13, it comprises:

参照图4，进一步作为优选的实施方式，所述步骤B中GPU采用HEVC运动补偿并行算法，根据运动矢量指向的参考帧位置求取图像的预测像素值这一步骤，其包括：Referring to Fig. 4, further as a preferred embodiment, in the step B, the GPU adopts the HEVC motion compensation parallel algorithm, and obtains the step of predicting the pixel value of the image according to the reference frame position pointed to by the motion vector, which includes:

进一步作为优选的实施方式，所述步骤S4，其具体为：Further as a preferred embodiment, the step S4 is specifically:

其中，读取纹理存储器中该运动矢量所指向的参考帧位置上的像素值，通过调用纹理提取函数tex2D()函数来实现。Wherein, reading the pixel value at the position of the reference frame pointed to by the motion vector in the texture memory is realized by calling the texture extraction function tex2D() function.

进一步作为优选的实施方式，所述亮度分像素插值滤波公式为8点插值滤波公式，所述度分像素插值滤波公式为4点插值滤波公式。As a further preferred embodiment, the brightness-by-pixel interpolation filter formula is an 8-point interpolation filter formula, and the degree-by-pixel interpolation filter formula is a 4-point interpolation filter formula.

下面结合具体的实施例对本发明作进一步详细说明。The present invention will be described in further detail below in conjunction with specific examples.

实施例一Embodiment one

参照图5，本发明的第一实施例：Referring to Fig. 5, the first embodiment of the present invention:

HEVC解码框架如图5所示。HEVC解码过程就是编码过程的逆向化，解码器读取码流文件，从NAL(网络抽象层)中获得比特流,解码是按一帧一帧顺序进行的，一帧图像划分成若干个最大编码单元LCU，以光栅扫描的顺序，以LCU为基本单位进行熵解码，然后进行重排序，从而得到相应编码单元的残差系数；接着对残差系数进行反量化和反变换，从而得到图像残差数据。与此同时，解码器根据从码流中解码得到的头信息生成预测块：若是帧间预测模式，则根据运动矢量和参考帧生成相应的一个预测块；若是帧内预测模式，则从相邻的预测单元生成一个预测块。接着，预测块数据与残差块数据求和得到重构的图像块数据，最后图像块数据通过去方块滤波和样本自适应补偿处理后得到重建图像并输出。The HEVC decoding framework is shown in Figure 5. The HEVC decoding process is the reverse of the encoding process. The decoder reads the code stream file and obtains the bit stream from the NAL (Network Abstraction Layer). The decoding is performed in a frame-by-frame order, and a frame of image is divided into several maximum encoding Unit LCU, in the order of raster scanning, entropy decoding is performed with LCU as the basic unit, and then reordered to obtain the residual coefficient of the corresponding coding unit; then the residual coefficient is dequantized and inversely transformed to obtain the image residual data. At the same time, the decoder generates a prediction block based on the header information decoded from the code stream: if it is an inter-frame prediction mode, it will generate a corresponding prediction block based on the motion vector and the reference frame; if it is an intra-frame prediction mode, it will generate a prediction block from the adjacent The prediction unit of generates a prediction block. Next, the predicted block data and residual block data are summed to obtain reconstructed image block data, and finally the image block data is processed by deblocking filtering and sample adaptive compensation to obtain a reconstructed image and output.

运动补偿描述了编码关系上相邻帧的差别，也就是说描述了前面参考帧的宏块如何移动到当前帧中的某个位置上去，根据运动矢量指向的参考帧的等大小预测块的值和残差值相加得到重建值。这种方法经常被视频编解码器用来减少视频序列中的时域冗余。运动补偿用于图像重建，是视频编编解码中必不可少的关键模块。Motion compensation describes the difference between adjacent frames in the encoding relationship, that is to say, it describes how the macroblock of the previous reference frame moves to a certain position in the current frame, and predicts the value of the block according to the same size of the reference frame pointed to by the motion vector Added to the residual value to get the reconstructed value. This approach is often used by video codecs to reduce temporal redundancy in video sequences. Motion compensation is used for image reconstruction and is an essential key module in video codecs.

运动补偿就是一帧图像根据图像纹理划分为大小不等的编码单元，在编码单元的基础上划分预测单元，预测单元包括一个亮度块和两个色度块，帧间编码宏块的每个宏块就是从参考图像的某一相同大小的宏块预测得到的。运动补偿精度由运动矢量的精度决定，其直接关系到重建图像质量以及码流的大小。运动矢量是预测的过程中平移的大小，是在对编码端运动进行估计而得出的。运动矢量的精度越高，运动补偿的精确度越高。插值滤波则是运动补偿中是一个非常关键的技术，H.264标准采用的是六抽头的维纳滤波器，其运动补偿精度为1/4像素精度。而HEVC采用了更加先进高效的插值滤波器，也就是基于离散余弦变换的插值滤波器。相比之下，HEVC标准中亚像素的生成更加简洁高效，只需要一个滤波公式，进行一次滤波处理就可以了。亮度信号使用的是基于DCT离散余弦变换的8点插值滤波器，而色度信号使用的是基于DCT离散余弦变换的4点插值滤波器，进行像素的内插。但大量的插值计算导致复杂度相应提高，编码效率会较低。参考帧中分像素位置的亮度和色度像素实际上是不存在的，因此需要通过插值滤波算法进行像素内插得到分像素位置的像素值，这种运动补偿属于亚像素精度的运动补偿。Motion compensation means that a frame of image is divided into coding units of different sizes according to the image texture, and the prediction unit is divided on the basis of the coding unit. The prediction unit includes a luma block and two chrominance blocks, and each macro of the inter-coded macro block A block is predicted from a macroblock of the same size in the reference picture. The accuracy of motion compensation is determined by the accuracy of the motion vector, which is directly related to the quality of the reconstructed image and the size of the code stream. The motion vector is the size of the translation during the prediction process, which is obtained by estimating the motion of the encoder. The higher the accuracy of motion vectors, the higher the accuracy of motion compensation. Interpolation filtering is a very critical technology in motion compensation. The H.264 standard uses a six-tap Wiener filter, and its motion compensation accuracy is 1/4 pixel accuracy. HEVC uses a more advanced and efficient interpolation filter, that is, an interpolation filter based on discrete cosine transform. In contrast, the generation of sub-pixels in the HEVC standard is more concise and efficient. Only one filtering formula is needed, and one filtering process is enough. The luminance signal uses an 8-point interpolation filter based on DCT discrete cosine transform, while the chrominance signal uses a 4-point interpolation filter based on DCT discrete cosine transform to interpolate pixels. However, a large number of interpolation calculations lead to a corresponding increase in complexity, and the coding efficiency will be low. The luminance and chrominance pixels at sub-pixel positions in the reference frame do not actually exist, so pixel interpolation through interpolation filtering algorithm is required to obtain the pixel values at sub-pixel positions. This kind of motion compensation belongs to sub-pixel precision motion compensation.

实施例二Embodiment two

本实施例对本发明的HEVC反变换并行算法过程进行说明。This embodiment describes the HEVC inverse transformation parallel algorithm process of the present invention.

反变换模块是将当前块的变换系数矩阵转换为残差样值矩阵的一个过程，并为后续的重构做好准备。反变换是在反量化处理后进行的，同样是以TU变换单元为基本单位进行处理的，其所用的源数据就是反量化的结果。本发明的HEVC解码器进行二维IDCT反变换时，首先进行水平方向上的一维IDCT反离散余弦变换，然后进行垂直方向上的一维IDCT反离散余弦变换，最后再通过矩阵相乘，将变换系数矩阵转换成同样大小的残差数据矩阵，从而完成频域到时域的变换。The inverse transform module is a process of converting the transform coefficient matrix of the current block into a residual sample matrix, and prepares for subsequent reconstruction. The inverse transformation is performed after the inverse quantization process, and the TU transformation unit is also used as the basic unit for processing, and the source data used is the result of inverse quantization. When the HEVC decoder of the present invention performs two-dimensional IDCT inverse transform, it first performs one-dimensional IDCT inverse discrete cosine transform in the horizontal direction, then performs one-dimensional IDCT inverse discrete cosine transform in the vertical direction, and finally multiplies the matrix to obtain The transformation coefficient matrix is converted into a residual data matrix of the same size, thereby completing the transformation from the frequency domain to the time domain.

一帧图像中不同变换块的IDCT运算是相互独立的，而同一变换单元上的变换系数矩阵进行水平方向上的一维IDCT变换时，各列是相互独立的，因此可以实现各列的并行计算。同样地，进行垂直方向上的一维IDCT反变换时，各行之间并不存在数据的相关性，因此能够实现并行计算。本发明根据变换系数矩阵的大小分配相应的线程数进行处理，每一列分配一个线程同时进行处理，各列同时进行一维IDCT反变换，处理完毕后，每一行分配一个线程同时进行一维IDCT反变换计算，完成后即实现了对变换系数矩阵的二维IDCT反变换并行处理。The IDCT operations of different transformation blocks in a frame of image are independent of each other, and when the transformation coefficient matrix on the same transformation unit performs one-dimensional IDCT transformation in the horizontal direction, each column is independent of each other, so parallel calculation of each column can be realized . Similarly, when the one-dimensional IDCT inverse transformation in the vertical direction is performed, there is no data correlation between rows, so parallel computing can be realized. According to the size of the transformation coefficient matrix, the present invention allocates the corresponding number of threads to process, and each column allocates a thread to process at the same time, and each column performs one-dimensional IDCT inverse transformation at the same time. After processing, each row allocates a thread to simultaneously perform one-dimensional IDCT inverse transformation. After the transformation calculation is completed, the parallel processing of the two-dimensional IDCT inverse transformation of the transformation coefficient matrix is realized.

由于HEVC变换单元的大小为4x4,8x8,16x16或32x32。变换单元越大，则并行程度越高，加速效果越明显。例如，对32x32变换单元对应的变换系数矩阵，首先可以同时进行32列的一维IDCT反变换，完成后调用syncthreads()函数进行同步，然后再同时进行32行的一维IDCT反变换。此外，还可以对各个变换单元对应的变换系数矩阵同时进行IDCT反变换。为了得到更好的加速效果，本发明直接利用反量化并行处理后得到的在全局内存空间的变换系数。变换单元包括一个亮度变换块和两个色度变换块，因此，需要分别进行亮度和色度的反变换，二者的步骤是相同的。Since HEVC transform units are of size 4x4, 8x8, 16x16 or 32x32. The larger the transformation unit, the higher the degree of parallelism and the more obvious the acceleration effect. For example, for the transformation coefficient matrix corresponding to the 32x32 transformation unit, firstly, the one-dimensional inverse IDCT transformation of 32 columns can be performed simultaneously, and the syncthreads() function is called to synchronize after completion, and then the one-dimensional inverse IDCT transformation of 32 rows can be performed simultaneously. In addition, IDCT inverse transformation may also be performed on the transformation coefficient matrix corresponding to each transformation unit at the same time. In order to obtain a better acceleration effect, the present invention directly uses the transformation coefficients in the global memory space obtained after dequantization and parallel processing. The transformation unit includes one luma transformation block and two chroma transformation blocks, therefore, the inverse transformation of luma and chroma needs to be performed respectively, and the steps of the two are the same.

本发明基于GPU的反变换算法包括：The GPU-based inverse transformation algorithm of the present invention comprises:

(1)解码开始阶段，初始化GPU，在GPU上申请用于存储经过反变换后得到的残差数据的全局内存空间，同时直接从全局内存空间读取在GPU上进行反量化后得到的变换系数。(1) At the beginning of decoding, initialize the GPU, apply for a global memory space on the GPU for storing the residual data obtained after inverse transformation, and read the transformation coefficients obtained after inverse quantization on the GPU directly from the global memory space .

(2)进行线程数量的配置，配置线程网格大小为Grid(4,4,1)，线程块大小为Block(16,16,1)，即一个Grid分配16个Block,每个Block分配256个线程，接着根据变换单元的大小相应分配线程数量。(2) Configure the number of threads, configure the thread grid size as Grid(4,4,1), and the thread block size as Block(16,16,1), that is, a Grid allocates 16 Blocks, and each Block allocates 256 threads, and then allocate the number of threads according to the size of the transformation unit.

(3)为一个变换单元对应的变换系数矩阵分配给一个线程块进行反变换处理：首先每个线程依据自身的线程ID号对变换系数矩阵的每一列对应进行一维IDCT反变换，各列同时进行，并且调用syncthreads()函数进行同步，所得到的结果暂存在线程块中的共享内存中；然后对共享内存中系数矩阵中的每一行同时进行一维IDCT反变换，为一行分配一个线程进行处理，从而完成了对变换系数的二维IDCT反变换，并得到残差数据矩阵。对各个变换单元对应的变换系数矩阵同时进行反变换处理，得到的就是整个图像块的残差数据。(3) Assign the transformation coefficient matrix corresponding to a transformation unit to a thread block for inverse transformation processing: first, each thread performs one-dimensional IDCT inverse transformation on each column of the transformation coefficient matrix according to its own thread ID number, and each column is simultaneously , and call the syncthreads() function to synchronize, and the obtained results are temporarily stored in the shared memory in the thread block; then, one-dimensional IDCT inverse transformation is performed on each row in the coefficient matrix in the shared memory at the same time, and a thread is allocated for a row to perform Processing, thereby completing the two-dimensional IDCT inverse transform of the transform coefficients, and obtaining the residual data matrix. The inverse transformation process is performed on the transformation coefficient matrix corresponding to each transformation unit at the same time, and the residual data of the entire image block is obtained.

(4)将计算得到的各个图像块的残差数据从GPU全局内存空间复制到CPU的内存中，从而得到整个图像的残差数据。(4) Copy the calculated residual data of each image block from the GPU global memory space to the CPU memory, so as to obtain the residual data of the entire image.

(5)释放解码过程中分配的全局内存空间。(5) Release the global memory space allocated in the decoding process.

实施例三Embodiment three

本实施例对本发明的HEVC运动补偿并行算法过程进行说明。This embodiment describes the HEVC motion compensation parallel algorithm process of the present invention.

帧间运动补偿的实现原理，简单地说，就是通过码流解析得到的运动矢量，根据在参考帧上指向的位置求得预测值，指向的是参考帧的整像素点位置则直接读取，若是分像素点位置则需要经过像素内插得到分像素预测值，然后将预测值和经过反量化、反变换得到的图像残差值相加即得到图像重建值。运动补偿模块中，对像素插值滤波的计算大概占据了70％的运算量。所以本发明的运动补偿在GPU上的实现主要是进行像素内插。运动矢量本来是连续的，但进行帧间预测运动补偿时，为了提高编码过程中视频图像帧间预测的准确度，在搜索匹配块时，运动矢量是分像素精度的，亮度运动矢量的精度为1/4像素，而色度运动矢量则为1/8像素精度。因此，当运动矢量指向为参考帧的分像素位置时，需要根据周边像素值进行内插得到对应位置的像素值。The implementation principle of inter-frame motion compensation, simply put, is that the motion vector obtained by parsing the code stream is obtained according to the position pointed to on the reference frame to obtain the predicted value, and the position pointed to by the whole pixel point of the reference frame is directly read. If it is a sub-pixel position, it is necessary to obtain the sub-pixel predicted value through pixel interpolation, and then add the predicted value to the image residual value obtained through inverse quantization and inverse transformation to obtain the image reconstruction value. In the motion compensation module, the calculation of pixel interpolation and filtering takes up about 70% of the computational load. Therefore, the realization of the motion compensation of the present invention on the GPU is mainly to perform pixel interpolation. The motion vector is originally continuous, but when performing inter-frame prediction motion compensation, in order to improve the accuracy of video image inter-frame prediction in the encoding process, when searching for a matching block, the motion vector is sub-pixel precision, and the precision of the luminance motion vector is 1/4 pixel, and 1/8 pixel precision for chroma motion vectors. Therefore, when the motion vector points to a sub-pixel position of the reference frame, it is necessary to interpolate according to surrounding pixel values to obtain the pixel value of the corresponding position.

其中，亮度分像素进行插值滤波所用到的系数如表1所示。Table 1 shows the coefficients used for the interpolation filtering of the luminance sub-pixels.

表1 亮度像素插值滤波所用到的系数Table 1 Coefficients used in luminance pixel interpolation filtering

分像素位置sub-pixel position 插值滤波系数Interpolation filter coefficient 1/4像素1/4 pixel {-1,4,-10,58,17,-5,1,0}{-1,4,-10,58,17,-5,1,0} 2/4像素2/4 pixel {-1,4,-11,40,40,-11,4,-1}{-1,4,-11,40,40,-11,4,-1} 3/4像素3/4 pixel {0,1,-5,17,58,-10,4,-1}{0,1,-5,17,58,-10,4,-1}

图6中所示为亮度整像素点位置和通过插值出来的分像素点位置，大写字母代表的位置是整像素点，小写字母代表的是亚像素点。Figure 6 shows the position of the brightness integer pixel and the position of the sub-pixel obtained through interpolation. The positions represented by uppercase letters are integer pixels, and the positions represented by lowercase letters are sub-pixels.

HEVC标准软件中，通过参数表中xFracL和yFracL确定像素点的位置，xFracL实际意义上表示运动矢量的水平分量的小数部分，yFracL实际代表运动矢量的垂直分量上的小数部分，两者结合在一起在HEVC标准中代表着像素的位置，xFracL和yFracL都为0代表指的位置为整像素点位置，其余则为分像素位置。通过确定位置选择相应的插值系数和参考帧中的邻近像素进行内插得到对应位置的像素值。xFracL和yFracL与图6中像素点的位置对应如表2所示。In the HEVC standard software, the position of the pixel is determined by xFracL and yFracL in the parameter table. xFracL actually represents the fractional part of the horizontal component of the motion vector, and yFracL actually represents the fractional part of the vertical component of the motion vector. The two are combined together In the HEVC standard, it represents the position of the pixel. Both xFracL and yFracL are 0, indicating that the position of the finger is an integer pixel position, and the rest are sub-pixel positions. By determining the position, select the corresponding interpolation coefficient and the adjacent pixels in the reference frame to perform interpolation to obtain the pixel value at the corresponding position. Table 2 shows the correspondence between xFracL and yFracL and the pixel positions in Figure 6.

表2 亮度像素点位置映射关系Table 2 Luminance pixel position mapping relationship

亮度像素内插需要根据分像素点的位置的值来选择相应的插值系数进行求解，与整像素点处于同一水平位置的a_0,0，b_0,0，c_0,0对应处于1/4,2/4,3/4像素点位置,根据表1中的系数以及A-_3,0,A-_2,0,A-_1,0,A_0,0,A_1,0,A_2,0,A_3,0,A_4,O这几个整像素点进行计算得出。其中，变量shift1等于(BitDepthY-8),shift2设置为6和shift3设置成(14-BitDepthY)。具体计算公式为：Luminance pixel interpolation needs to select the corresponding interpolation coefficient to solve according to the value of the position of the sub-pixel, a _0,0 , b _0,0 , c _0,0 at the same horizontal position as the integer pixel correspond to 1/4 , 2/4, 3/4 pixel position, according to the coefficients in Table 1 and A- _3,0 , A- _2,0 , A- _1,0 , A _0,0 , A _1,0 , A _{2, 0} , A _3,0 , A _4,O these integer pixels are calculated. Among them, the variable shift1 is equal to (BitDepthY-8), shift2 is set to 6 and shift3 is set to (14-BitDepthY). The specific calculation formula is:

a_0,0＝(-A_-3,0+4*A_-2,0-10*A_-1,0+58*A_0,0+17*A_1,0-5*A_2,0+A_3,0)>>shift1a _0,0 ＝(-A _-3,0 +4*A _-2,0 -10*A _-1,0 +58*A _0,0 +17*A _1,0 -5*A _2,0 + A _3,0 )>>shift1

b_0,0＝(-A_-3,0+4*A_-2,0-11*A_-1,0+40*A_0,0+40*A_1,0-11*A_2,0+4*A_3,0-A_4,0)>>shift1b _0,0 ＝(-A _-3,0 +4*A _-2,0 -11*A _-1,0 +40*A _0,0 +40*A _1,0 -11*A _2,0 + 4*A _3,0 -A _4,0 )>>shift1

c_0,0＝(A_-2,0-5*A_-1,0+17*A_0,0+58*A_1,0-10*A_2,0+4*A_3,0-A_4,0)>>shift1c _0,0 ＝(A _-2,0 -5*A _-1,0 +17*A _0,0 +58*A _1,0 -10*A _2,0 +4*A _3,0 -A _{4 ,0} )>>shift1

而d_0,0，h_0,0以及n_0,0对应垂直方向上的1/4,2/4,3/4位置上的像素点，进行像素内插时还需要知道处于同一垂直位置的整像素点，其插值计算为：And d _0,0 , h _0,0 and n _0,0 correspond to the pixel points at the 1/4, 2/4, 3/4 positions in the vertical direction, and it is necessary to know the pixel points at the same vertical position when performing pixel interpolation. Integer pixels, the interpolation calculation is:

d_0,0＝(-A_0,-3+4*A_0,-2-10*A_0,-1+58*A_0,0+17*A_0,1-5*A_0,2+A_0,3)>>shift1d _0,0 ＝(-A _0,-3 +4*A _0,-2 -10*A _0,-1 +58*A _0,0 +17*A _0,1 -5*A _0,2 + A _0,3 )>>shift1

h_0,0＝(-A_0,-3+4*A_0,-2-11*A_0,-1+40*A_0,0+40*A_0,1-11*A_0,2+4*A_0,3-A_0,-4)>>shift1h _0,0 ＝(-A _0,-3 +4*A _0,-2 -11*A _0,-1 +40*A _0,0 +40*A _0,1 -11*A _0,2 + 4*A _0,3 -A _0,-4 )>>shift1

n_0,0＝(A_0,-2-5*A_0,-1+17*A_0,0+58*A_0,1-10*A_0,2+4*A_0,3-A_0,-4)>>shift1n _0,0 ＝(A _0,-2 -5*A _0,-1 +17*A _0,0 +58*A _0,1 -10*A _0,2 +4*A _0,3 -A _{0 ,-4} )>>shift1

a_0,0，b_0,0，c_0,0，d_0,0，h_0,0以及n_0,0像素值可以直接由整像素点和插值滤波系数一步推出，当运动矢量指向这些位置的像素点，根据上面的算法可求得对应的像素值。a _0,0 , b _0,0 , c _0,0 , d _0,0 , h _0,0 and n _0,0 pixel values can be deduced directly from integer pixels and interpolation filter coefficients in one step, when the motion vector points to these positions According to the above algorithm, the corresponding pixel value can be obtained.

而处于其它位置的分像素点的值则需要分两步进行才能求得。However, the values of sub-pixels at other positions need to be obtained in two steps.

e_0,0，i_0,0，p_0,0的像素值的计算。首先按照前面求与整像素处于同一水平或同一垂直位置的分像素点位置的值求得a_0,-3,a_0,-2,a_0,-1,a_0,0,a_0,1,a_0,2,a_0,3,a_0,4的值，然后再进行如下计算可得出：Calculation of pixel values for e _0,0 , i _0,0 , p _0,0 . First, calculate a _0,-3 , a _0,-2 , a _0,-1 , a _0,0 , a _0,1 according to the value of the sub-pixel position at the same horizontal or vertical position as the integer pixel previously calculated ,a _0,2 ,a _0,3 ,a _0,4 values, and then the following calculations can be obtained:

e_0,0＝(-a_0,-3+4*a_0,-2-10*a_0,-1+58*a_0,0+17*a_0,1-5*a_0,2+a_0,3)>>shift2e _0,0 ＝(-a _0,-3 +4*a _0,-2 -10*a _0,-1 +58*a _0,0 +17*a _0,1 -5*a _0,2 + a _0,3 )>>shift2

i_0,0＝(-a_0,-3+4*a_0,-2-11*a_0,-1+40*a_0,0+40*a_0,1-11*a_0,2+4*a_0,3-a_0,4)>>shift2i _0,0 ＝(-a _0,-3 +4*a _0,-2 -11*a _0,-1 +40*a _0,0 +40*a _0,1 -11*a _0,2 + 4*a _0,3 -a _0,4 )>>shift2

p_0,0＝(a_0,-2-5*a_0,-1+17*a_0,0+58*a_0,1-10*a_0,2+4*a_0,3-a_0,4)>>shift2p _0,0 ＝(a _0,-2 -5*a _0,-1 +17*a _0,0 +58*a _0,1 -10*a _0,2 +4*a _0,3 -a _{0 ,4} )>>shift2

f_0,0，j_0,0，q_0,0像素值的计算，同样需要先求得b_0,-3,b_0,-2,b_0,-1,b_0,0,b_0,1,b_0,2,b_0,3,b_0,4的值，然后再利用插值滤波参数进行如下处理：The calculation of f _0,0 , j _0,0 , q _0,0 pixel values also needs to first obtain b _0,-3 ,b _0,-2 ,b _0,-1 ,b _0,0 ,b _{0, 1} ,b _0,2 ,b _0,3 ,b _0,4 values, and then use the interpolation filter parameters for the following processing:

f_0,0＝(-b_0,-3+4*b_0,-2-10*b_0,-1+58*b_0,0+17*b_0,1-5*b_0,2+b_0,3)>>shift2f _0,0 ＝(-b _0,-3 +4*b _0,-2 -10*b _0,-1 +58*b _0,0 +17*b _0,1 -5*b _0,2 + b _0,3 )>>shift2

j_0,0＝(-b_0,-3+4*b_0,-2-11*b_0,-1+40*b_0,0+40*b_0,1-11*b_0,2+4*b_0,3-b_0,4)>>shift2j _0,0 ＝(-b _0,-3 +4*b _0,-2 -11*b _0,-1 +40*b _0,0 +40*b _0,1 -11*b _0,2 + 4*b _0,3 -b _0,4 )>>shift2

q_0,0＝(b_0,-2-5*b_0,-1+17*b_0,0+58*b_0,1-10*b_0,2+4*b_0,3-b_0,4)>>shift2q _0,0 ＝(b _0,-2 -5*b _0,-1 +17*b _0,0 +58*b _0,1 -10*b _0,2 +4*b _0,3 -b _{0 ,4} )>>shift2

g_0,0，k_0,0，r_0,0的计算则需要先求得c_0,-3,c_0,-2,c_0,-1,c_0,0,c_0,1,c_0,2,c_0,3,c_0,4的值，然后进行如下计算得出：The calculation of g _0,0 , k _0,0 , r _0,0 needs to obtain c _0,-3 ,c _0,-2 ,c _0,-1 ,c _0,0 ,c _0,1 ,c _0,2 ,c _0,3 ,c _0,4 values, and then calculated as follows:

g_0,0＝(-c_0,-3+4*c_0,-2-10*c_0,-1+58*c_0,0+17*c_0,1-5*c_0,2+c_0,3)>>shift2g _0,0 ＝(-c _0,-3 +4*c _0,-2 -10*c _0,-1 +58*c _0,0 +17*c _0,1 -5*c _0,2 + c _0,3 )>>shift2

k_0,0＝(-c_0,-3+4*c_0,-2-11*c_0,-1+40*c_0,0+40*c_0,1-11*c_0,2+4*c_0,3-c_0,4)>>shift2k _0,0 ＝(-c _0,-3 +4*c _0,-2 -11*c _0,-1 +40*c _0,0 +40*c _0,1 -11*c _0,2 + 4*c _0,3 -c _0,4 )>>shift2

r_0,0＝(c_0,-2-5*c_0,-1+17*c_0,0+58*c_0,1-10*c_0,2+4*c_0,3-c_0,4)>>shift2r _0,0 ＝(c _0,-2 -5*c _0,-1 +17*c _0,0 +58*c _0,1 -10*c _0,2 +4*c _0,3 -c _{0 ,4} )>>shift2

当运动矢量MV指向的位置恰好为参考帧上的整像素点时，指向的像素点的值就是预测像素值。这样整个帧间预测块的预测像素值都可以根据相应的MV和参考帧,在GPU上通过每个线程进行计算得到。When the position pointed by the motion vector MV is exactly an integer pixel on the reference frame, the value of the pointed pixel is the predicted pixel value. In this way, the predicted pixel values of the entire inter-frame prediction block can be calculated by each thread on the GPU according to the corresponding MV and the reference frame.

色度的分像素插值的过程与亮度原理相同，不过色度采用的是4抽头的插值滤波器，在这里不再作进行详细说明。其插值时所用到的系数如表3所示。The process of sub-pixel interpolation of chroma is the same as that of brightness, but chroma uses a 4-tap interpolation filter, which will not be described in detail here. The coefficients used in the interpolation are shown in Table 3.

表3 色度像素插值滤波所用到的系数Table 3 Coefficients used in chroma pixel interpolation filtering

分像素位置sub-pixel position 插值滤波系数Interpolation filter coefficient 1/8像素1/8 pixel {-2,58,10,-2}{-2,58,10,-2} 2/8像素2/8 pixel {-4,54,16,-2}{-4,54,16,-2} 3/8像素3/8 pixel {-6,46,28,-4}{-6,46,28,-4} 4/8像素4/8 pixels {-4,36,36,-4}{-4,36,36,-4} 5/8像素5/8 pixels {-4,28,46,-6}{-4,28,46,-6} 6/8像素6/8 pixels {-2,16,54,-4}{-2,16,54,-4} 7/8像素7/8 pixels {-2,10,58,-2}{-2,10,58,-2}

HEVC编解码标准中运动补偿以图像块为单位进行的，而实际上处理的基本单位是像素点，各像素点进行运动补偿时不存在相互依赖关系，只需要根据MV指向参考帧的位置计算得到相应的预测像素值再和残差像素值相加得到重建的像素值就可以了。每个线程处理的基本单位都是由PU预测块转换成的像素点，一个线程进行一个预测像素值的计算，这样，在整个图像块上的帧间预测像素值就可以同时计算得出。In the HEVC codec standard, motion compensation is performed in units of image blocks, but in fact the basic unit of processing is a pixel. There is no interdependence between each pixel when performing motion compensation. It only needs to be calculated according to the position of the MV pointing to the reference frame. The corresponding predicted pixel value is then added to the residual pixel value to obtain the reconstructed pixel value. The basic unit processed by each thread is the pixel point converted from the PU prediction block, and one thread calculates a predicted pixel value, so that the inter-frame predicted pixel value on the entire image block can be calculated at the same time.

本发明的HEVC运动补偿并行算法包括：The HEVC motion compensation parallel algorithm of the present invention includes:

(1)首先，在解码开始阶段，对GPU进行初始化，在GPU中申请用来存储帧间预测模式的每个像素点对应的运动矢量、参考帧和生成的预测像素值的存储空间。(1) First, at the beginning of decoding, initialize the GPU, and apply for storage space in the GPU for storing the motion vector corresponding to each pixel in the inter-frame prediction mode, the reference frame and the generated predicted pixel value.

(2)然后通过cudaMemcpy函数将运动矢量和相应的参考帧图像拷贝到设备端，同时调用cudaBindTexteure函数将参考帧绑定到纹理存储器上。纹理存储器读取数据时速度快，可忽略不计，进一步提高了操作的效率。(2) Then copy the motion vector and the corresponding reference frame image to the device through the cudaMemcpy function, and call the cudaBindTexteure function to bind the reference frame to the texture memory. The speed of reading data from the texture memory is negligible, which further improves the efficiency of the operation.

(3)进行线程配置，为每个预测像素值的处理分配一个线程ID号，并在设备端开辟用来存储相应生成的预测像素值的全局内存空间。设置线程网格大小为Grid(4,4,1)，线程块大小为Block(16,16,1)。即一个Grid分配16个Block,每个Block分配256个线程。(3) Perform thread configuration, assign a thread ID number to the processing of each predicted pixel value, and open up a global memory space for storing correspondingly generated predicted pixel values on the device side. Set the thread grid size to Grid(4,4,1), and the thread block size to Block(16,16,1). That is, a Grid allocates 16 Blocks, and each Block allocates 256 threads.

(4)根据运动矢量指向参考帧的位置求取预测像素值：如果指向的是整像素值位置，则直接读取对应运动矢量指向的参考帧位置上的值就是预测像素值；如果是分像素位置，则根据像素点的位置选择相应的分像素插值滤波公式进行求值，得到对应的分像素值就是预测像素值。每个线程依据自己的线程ID号执行相同的执行步骤得到对应的像素预测值。(4) Find the predicted pixel value according to the position where the motion vector points to the reference frame: if it points to the position of the integer pixel value, then directly read the value at the position of the reference frame pointed to by the corresponding motion vector to be the predicted pixel value; if it is sub-pixel position, select the corresponding sub-pixel interpolation filter formula for evaluation according to the position of the pixel point, and obtain the corresponding sub-pixel value as the predicted pixel value. Each thread performs the same execution steps according to its own thread ID number to obtain the corresponding pixel prediction value.

(5)将得到的像素值预测和对应的像素残差值相加得到像素重建值，并进行数据裁剪的安全性验证。(5) Add the obtained pixel value prediction and the corresponding pixel residual value to obtain the pixel reconstruction value, and perform safety verification of data clipping.

(6)将结果拷贝回主机端内存，释放设备端的存储空间。(6) Copy the result back to the memory of the host, and release the storage space of the device.

与现有技术相比，本发明构建了由CPU和GPU组成的解码架构，将解码复杂度较高的反变换处理和运动补偿处理转移到GPU上实现，并设计了基于GPU的HEVC反变换并行算法和HEVC运动补偿并行算法，有效提高了解码速度和解码效率。Compared with the prior art, the present invention constructs a decoding framework composed of CPU and GPU, transfers the inverse transformation processing and motion compensation processing with high decoding complexity to the GPU for implementation, and designs a GPU-based HEVC inverse transformation parallel Algorithm and HEVC motion compensation parallel algorithm effectively improve decoding speed and decoding efficiency.

以上是对本发明的较佳实施进行了具体说明，但本发明创造并不限于所述实施例，熟悉本领域的技术人员在不违背本发明精神的前提下还可做作出种种的等同变形或替换，这些等同的变形或替换均包含在本申请权利要求所限定的范围内。The above is a specific description of the preferred implementation of the present invention, but the invention is not limited to the described embodiments, and those skilled in the art can also make various equivalent deformations or replacements without violating the spirit of the present invention. , these equivalent modifications or replacements are all within the scope defined by the claims of the present application.

Claims

1. A GPU-based HEVC parallel decoding method, characterized in that: comprising:

A. The GPU performs entropy decoding, reordering, and dequantization on the read code stream file to obtain the transformation coefficient matrix. At the same time, the GPU parses the obtained code stream file to obtain the motion vector and reference frame;

B. The GPU uses the HEVC inverse transform parallel algorithm to process the transformation coefficient matrix to obtain the residual data of the image. At the same time, the GPU uses the HEVC motion compensation parallel algorithm to obtain the predicted pixel value of the image according to the reference frame position pointed by the motion vector;

C. The GPU sequentially performs summation, deblocking filtering, and sample adaptive compensation processing on the residual data of the image and the predicted pixel value of the image to obtain the reconstructed image, and copies the pixel value of the reconstructed image to the memory of the CPU;

In the step B, the GPU adopts HEVC motion compensation parallel algorithm, and obtains the step of predicting the pixel value of the image according to the reference frame position pointed by the motion vector, which includes:

S1. Initialize the GPU, and apply for storage space in the GPU for storing motion vectors, reference frames, and predicted pixel values corresponding to each pixel in the inter-frame prediction mode;

S2. Copy the motion vector and the corresponding reference frame image to the device, and bind the reference frame to the texture memory at the same time;

S3. Perform thread configuration, assign a thread ID number for the processing of each predicted pixel value, and open up a global memory space for storing predicted pixel values at the device side;

S4. Each thread performs direct texture reading or interpolation filtering processing at the same time according to its own thread ID number and motion vector pointing to the position of the reference frame, thereby obtaining the pixel prediction value of each thread;

S5. Copy the pixel prediction value of each thread back to the CPU memory, and then release the global memory space of the device.

2. a kind of HEVC parallel decoding method based on GPU according to claim 1, it is characterized in that: in described step B, GPU adopts HEVC inverse transformation parallel algorithm to process this step of transformation coefficient matrix, and it comprises:

B11, initialize the GPU, and apply for device-side global memory for storing the transformation coefficient matrix and residual data on the GPU;

B12, the grid size of the thread and the size of the thread block are set, and assign a corresponding number of threads and a corresponding thread ID number to each transformation unit according to the size of the transformation unit;

B13. Read the transformation coefficient matrix corresponding to each transformation unit on the global memory of the device, and then perform column-parallel one-dimensional IDCT inverse transformation and row-parallel one-dimensional IDCT inverse transformation on each transformation coefficient matrix according to the thread ID number, thereby obtaining Residual data of the entire image block;

B14. Copy the calculated residual data of each image block back to the CPU memory to obtain the residual data of the entire image, and then release the device-side global memory space.

3. a kind of HEVC parallel decoding method based on GPU according to claim 2, is characterized in that: described step B13, it comprises:

B131. Read the transformation coefficient matrix corresponding to each transformation unit on the device-side global memory;

B132. Perform one-dimensional IDCT inverse transformation on each column of each transformation coefficient matrix according to the thread ID number, obtain the transformed coefficient matrix and temporarily store the transformed result in the shared memory of the thread block;

B133. Perform one-dimensional IDCT inverse transformation on each row of the transformed coefficient matrix in the shared memory at the same time according to the thread ID number to obtain a residual data matrix, and calculate the residual data of the entire image block according to the residual data matrix.

4. a kind of HEVC parallel decoding method based on GPU according to claim 1, is characterized in that: described step S4, it is specifically:

Each thread performs direct reading or interpolation filtering processing at the same time according to its own thread ID number and the position of the motion vector pointing to the reference frame: if the motion vector of the thread points to an integer pixel value position, the motion in the texture memory is directly read The pixel value at the position of the reference frame pointed to by the vector, and the read pixel value is used as the pixel prediction value of the thread; if the motion vector of the thread points to a sub-pixel position, select the corresponding pixel according to the position Luminance or chroma is calculated by pixel interpolation filter formula, so as to obtain the pixel prediction value of this thread.

5. A GPU-based HEVC parallel decoding method according to claim 4, wherein the luminance sub-pixel interpolation filter formula is an 8-point interpolation filter formula, and the chroma sub-pixel interpolation filter formula is a 4-point interpolation filter formula Interpolation filter formula.