CN104268021A

CN104268021A - Graphic processor based RS (Reed-Solomon) decoding method

Info

Publication number: CN104268021A
Application number: CN201410468570.3A
Authority: CN
Inventors: 吴宪云; 曹金彦; 李云松; 雷杰; 王珂俨; 刘凯
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2014-09-15
Filing date: 2014-09-15
Publication date: 2015-01-07

Abstract

The invention discloses a graphic processor based RS (Reed-Solomon) decoding method. The defects that the CPU (Central Processing Unit) is low in processing speed and is not suitable for real-time transmission of large-scale data in the prior art are overcome. The implementation steps comprise step 1, initializing a computer; step 2, inputting image code stream data; step 3, reading the image code stream data; step 4, obtaining output image code stream data; step 5, performing parallel decoding; step 6, obtaining decoding image code stream data; step 7, outputting the decoding image code stream data. According to the graphic processor based RS decoding method, the correctness of the data transmission can be ensured, the decoding speed is high, and the graphic processor based RS decoding method is suitable for high-speed error correction and decoding in the process of the real-time transmission of the large-scale data.

Description

RS Decoding Method Based on Graphics Processor

技术领域technical field

本发明属于图像处理技术领域，更进一步涉及信道纠错解码技术领域中的一种基于图形处理器的里德-索罗蒙码(Reed-Solomon，RS)解码方法。本发明可用于大规模数据在信道实时传输过程中快速进行纠错解码。The invention belongs to the technical field of image processing, and further relates to a graphics processor-based Reed-Solomon (Reed-Solomon, RS) decoding method in the technical field of channel error correction decoding. The invention can be used for rapid error correction and decoding in the process of real-time channel transmission of large-scale data.

背景技术Background technique

在通信系统中，任何的信号都需要经过信道来进行传输。数字信号在信道的传输过程中,往往会受到各种干扰使得数据流经过信道传输后，到达接收端会产生误码。因此,为了保证数据传输的正确性与可靠性，通常采用差错控制的方法来对信息中的传输错误进行处理。RS码就是这样一种码，由于它不仅可以纠正随机错误，而且还可以对突发错误的进行纠错，所以被广泛用于通信系统、广播系统以及数据存储系统当中。In a communication system, any signal needs to be transmitted through a channel. During the transmission process of the digital signal in the channel, it is often subject to various interferences, so that after the data stream is transmitted through the channel, it will generate bit errors when it arrives at the receiving end. Therefore, in order to ensure the correctness and reliability of data transmission, error control methods are usually used to deal with transmission errors in information. RS code is such a code, because it can not only correct random errors, but also correct burst errors, so it is widely used in communication systems, broadcasting systems and data storage systems.

但是随着信息化时代的发展、大数据时代的到来，传统的对数据进行处理的方式已经不能满足时代的需求，如何对大量数据进行高效、快速的处理已经成为各类研究机构致力解决的问题。在当今的信息化社会中，不仅对数据传输的可靠性的要求在不断提高，如何快速地传递正确的数据也渐渐地被人们所关注。所以在通过纠错码技术提高信息传输可靠性的基础上，将精力集中放在如何提高纠错码的纠错速度上，从而诞生了使用GPU的高性能计算提高RS编解码速度的想法。GPU高性能计算是当前一个热门的研究领域，通过GPU的并行架构与CPU相结合，有效地减少计算时间、提高运算效率的一种运算方案。However, with the development of the information age and the advent of the big data era, the traditional way of processing data can no longer meet the needs of the times. How to efficiently and quickly process large amounts of data has become a problem that various research institutions are committed to solving. . In today's information society, not only the requirements for the reliability of data transmission are constantly increasing, but also how to quickly transmit correct data has gradually attracted people's attention. Therefore, on the basis of improving the reliability of information transmission through error-correcting code technology, we focus on how to improve the error-correcting speed of error-correcting codes, thus the idea of using GPU high-performance computing to improve the speed of RS encoding and decoding was born. GPU high-performance computing is currently a hot research field. Through the combination of GPU parallel architecture and CPU, it is a computing solution that can effectively reduce computing time and improve computing efficiency.

北京邮电大学的专利申请“一种用于10GEPON的RS解码装置及方法”(公开号：CN102546109A，申请号：2011104487713，申请日：2011年12月28日)中公开了一种用于10GEPON的RS解码的方法。该方法主要分为校正子计算、关键方程求解、错误位置搜索和矫正错误等环节。首先对接收数据进行校正子计算，然后对关键方程求解，得到错误位置多项式σ(x)和错误值多项式ω(x)，然后利用错误位置多项式进行钱搜索确定错误位置多项式和利用错误值多项式进行错误值计算并纠错。其中关键方程求解是难点，通常所采用的算法有伯利坎普-梅西算法(Berlekamp-Massey Algorithm，BM)、修正的欧几里得算法(ModifiedEuclidean Algorithm，ME)及频率译码算法等。BM算法与ME算法均为时域译码算法，实时性优于频域译码算法。同BM算法相比，ME算法具有运算结构规整、关键路径时延较小、易于脉动结构实现等优点，被认为是高速数据通信中RS译码的理想算法。该方法可以实现RS解码与数据速率相协调，并实现多字节的并行处理。但是，该方法存在的不足是，其一，分组码要求精确的帧同步，由于该方法分组码按帧(组)进行传送，只有整组码字都接收完成后，才能开始译码，这样会给系统带来一定的延迟，如果是大的分组码，则延迟会比较大；其二，在对数据进行访问时，采用传统的数据访问方法，在对大规模数据访问时，速度很慢；其三，在实现RS解码时，采用串行处理，在该串行解码的设计中，由于大量输入的码字是按顺序进行解码以及部分重复的运算，没有充分利用中央处理器CPU，造成了资源的浪费，效率偏低，输入数据较慢，计算时间过长，运算效率不高，尤其是对大规模数据不能进行高效快速的处理。The patent application of Beijing University of Posts and Telecommunications "An RS decoding device and method for 10GEPON" (publication number: CN102546109A, application number: 2011104487713, application date: December 28, 2011) discloses a RS for 10GEPON The method of decoding. The method is mainly divided into syndrome calculation, key equation solving, error location search and error correction. First, the syndrome calculation is performed on the received data, and then the key equation is solved to obtain the error position polynomial σ(x) and the error value polynomial ω(x), and then the error position polynomial is used to search for money to determine the error position polynomial and the error value polynomial is used to perform Error values are calculated and corrected. Among them, solving the key equation is difficult, and the commonly used algorithms include Berlekamp-Massey Algorithm (BM), Modified Euclidean Algorithm (ME) and frequency decoding algorithm. Both the BM algorithm and the ME algorithm are time-domain decoding algorithms, and their real-time performance is better than that of frequency-domain decoding algorithms. Compared with the BM algorithm, the ME algorithm has the advantages of regular operation structure, small critical path delay, and easy realization of pulsating structure, etc. It is considered to be an ideal algorithm for RS decoding in high-speed data communication. The method can coordinate RS decoding and data rate, and realize multi-byte parallel processing. But, the deficiency that this method exists is, one, block code requires accurate frame synchronization, because this method block code transmits by frame (group), only after whole group of codewords all receive and finish, just can start to decode, will like this It brings a certain delay to the system. If it is a large block code, the delay will be relatively large; second, when accessing data, the traditional data access method is used, and the speed is very slow when accessing large-scale data; Its three, when realizing RS decoding, adopt serial processing, in the design of this serial decoding, because a large number of input codewords are decoded in order and partly repeated operation, do not make full use of central processing unit CPU, caused Waste of resources, low efficiency, slow input data, long calculation time, low calculation efficiency, especially large-scale data cannot be processed efficiently and quickly.

航天恒星科技有限公司的专利申请“一种用于空间通信的高速并行RS译码方法”(公开号：CN101969358B,申请号：2010102978843，申请日：2010年9月29日)公开了一种用于空间通信的高速并行RS译码方法。该方法通过在解交织与交织过程中采用乒乓操作完成输入数据在FIFO中的填充和获取，而在译码过程中采用多路并行流水的方式，通过这种复合结构保证了系统性能的最大化和实现资源的最小化，并且可以适应1至8的任意交织深度；通过采取多路并行RS译码、优化有限域乘法器的实现逻辑、等措施，大幅度提高了译码速率，可以直接应用于高码速率遥感卫星地面接收系统，通过采用模块化设计，在需要时可以进一步增加并行度提高性能。但是，该方法存在的不足是，其一，不同内核函数可以在GPU上不能够同时运行，并行计算性能不高；其二，GPU不能没有CPU进行调度来建立新的工作；其三，在大型集群不能进行内存错误检查和修复ECC；其四，图形处理器GPU资源利用率不高。The patent application "A high-speed parallel RS decoding method for space communication" (publication number: CN101969358B, application number: 2010102978843, application date: September 29, 2010) of Aerospace Star Technology Co., Ltd. discloses a method for High-speed parallel RS decoding method for space communication. This method uses ping-pong operation in the process of de-interleaving and interleaving to complete the filling and acquisition of input data in FIFO, and uses multi-channel parallel pipeline in the decoding process. This composite structure ensures the maximum performance of the system. And realize the minimization of resources, and can adapt to any interleaving depth from 1 to 8; by taking measures such as multi-channel parallel RS decoding, optimizing the realization logic of the finite field multiplier, etc., the decoding rate is greatly improved, and can be directly applied Based on the high code rate remote sensing satellite ground receiving system, by adopting a modular design, the parallelism can be further increased to improve performance when needed. However, the disadvantages of this method are: firstly, different kernel functions cannot run simultaneously on the GPU, and the performance of parallel computing is not high; secondly, the GPU cannot be scheduled without the CPU to create new jobs; thirdly, in large-scale The cluster cannot perform memory error checking and repair ECC; Fourth, the resource utilization rate of the graphics processor GPU is not high.

发明内容Contents of the invention

本发明的目的在于针对克服上述现有技术的不足，提供一种基于图形处理器GPU的RS解码方法。该方法可以保证信道传输中数据的稳定性和可靠性，又提高了大规模数据在实时传输过程中的效率。The object of the present invention is to provide a RS decoding method based on a graphics processor GPU to overcome the above-mentioned deficiencies in the prior art. The method can ensure the stability and reliability of data in channel transmission, and improves the efficiency of large-scale data in real-time transmission.

实现本发明目的的具体思路是，首先对输入码流数据进行翻转、奇偶分离、转置，然后对输入码流数据进行并行解码，再对解码后的码流进行逆转置、奇偶合并、翻转，得到最终的解码码流。The specific idea of realizing the purpose of the present invention is to first flip, parity-separate, and transpose the input code stream data, and then perform parallel decoding on the input code stream data, and then perform inverse transposition, parity-merging, and flipping on the decoded code stream, Get the final decoded code stream.

本发明的实现的具体步骤如下：The concrete steps of the realization of the present invention are as follows:

(1)初始化计算机：(1) Initialize the computer:

(1a)将多个图形处理器GPU与计算机连接起来；(1a) connecting a plurality of graphics processing units (GPUs) to a computer;

(1b)计算机分配图形处理器GPU中的全局存储器；(1b) the computer allocates the global memory in the graphics processing unit GPU;

(2)输入图像码流数据：(2) Input image stream data:

将待解码的里德-索罗蒙码RS图像码流数据输入到计算机内存中；Inputting the Reed-Solomon code RS image stream data to be decoded into the computer memory;

(3)读取图像码流数据：(3) Read image stream data:

(3a)采用异步传输的方式，从计算机内存中读取待解码的里德-索罗蒙码RS图像码流数据；(3a) Using asynchronous transmission mode, read the Reed-Solomon code RS image code stream data to be decoded from the computer memory;

(3b)采用异步传输的方式，将读取的待解码的里德-索罗蒙码RS图像码流数据存储在全局存储器中；(3b) adopting asynchronous transmission mode, storing the read Reed-Solomon code RS image code stream data to be decoded in the global memory;

(4)获得输出图像码流数据：(4) Obtain the output image stream data:

(4a)采用合并访存的方式，访问全局存储器，将全局存储器中的里德-索罗蒙码RS图像码流数据，按每255比特为一组分为多组，每两组图像码流数据按照并行处理的方式翻转一次，得到翻转后的图像码流数据；(4a) Adopt the method of merging memory access to access the global memory, and divide the Reed-Solomon code RS image code stream data in the global memory into multiple groups according to each 255-bit group, and each two groups of image codes The stream data is reversed once in parallel processing to obtain the reversed image code stream data;

(4b)对翻转后的图像码流数据的奇数位和偶数位进行分离，得到奇偶分离的图像码流数据；(4b) Separating odd and even bits of the flipped image code stream data to obtain parity-separated image code stream data;

(4c)将奇偶分离的图像码流数据，按照偶数位在前、奇数位在后的顺序，组成输出图像码流数据；(4c) the image code stream data separated by parity, according to the order of the even number bit in front and the odd number bit in the back, to form the output image code stream data;

(5)并行解码：(5) Parallel decoding:

(5a)将输出图像码流数据进行转置，得到转置后的里德-索罗蒙码RS图像码流数据，采用合并访存的方式，将转置后的里德-索罗蒙码RS图像码流数据，采用异步传输的方式，存储在共享存储器中；(5a) Transpose the output image code stream data to obtain the transposed Reed-Solomon code RS image code stream data, and use the combined memory access method to convert the transposed Reed-Solomon code The RS image code stream data is stored in the shared memory by means of asynchronous transmission;

(5b)将转置后的里德-索罗蒙码RS图像码流数据，采用异步传输的方式，从共享存储器中传输到全局存储器，启动图形处理器GPU解码程序进行解码；(5b) Transmit the transposed Reed-Solomon code RS image code stream data from the shared memory to the global memory by means of asynchronous transmission, and start the graphics processor GPU decoding program to decode;

(5c)采用一个线程块处理一幅图像的码流、多个线程块处理多幅图像的码流的方法，对转置后的里德-索罗蒙码RS图像码流数据头部数据解析，得到解析图像码流数据头部数据；(5c) Adopt the method that one thread block processes the code stream of an image, and multiple thread blocks process the code stream of multiple images, analyze the header data of the transposed Reed-Solomon code RS image code stream data , to obtain the header data of the parsed image stream data;

(5d)采用一个线程块处理一幅图像、多个线程块处理多幅图像的方法，对转置后的里德-索罗蒙码RS图像码流数据包头解析，得到解析图像码流数据包头数据；(5d) Using a thread block to process an image and multiple thread blocks to process multiple images, analyze the transposed Reed-Solomon code RS image code stream data packet header, and obtain the parsed image code stream data packet header data;

(5e)在线程块内部启动多个线程，每个线程将同一层解码后的图像码块在显存中按照先后顺序重新拼接,得到解码后的图像码流数据；(5e) multiple threads are started inside the thread block, and each thread reassembles the decoded image code blocks of the same layer in the video memory according to the sequence to obtain the decoded image code stream data;

(6)获得解码图像码流数据：(6) Obtain the decoded image stream data:

(6a)将解码后的图像码流数据，进行逆转置，得到逆转置解码图像码流数据，将逆转置解码图像码流数据，采用异步传输的方式，存储在共享存储器中；(6a) Perform reverse transposition on the decoded image code stream data to obtain reverse transpose decoded image code stream data, and store the reverse transpose decoded image code stream data in a shared memory in an asynchronous transmission manner;

(6b)对逆转置后的解码图像码流数据进行奇偶合并，将得到的倒序存储的解码图像码流数据，采用异步传输的方式，存储在全局存储器中；(6b) Parity-combining the decoded image code stream data after reverse transposition, and storing the obtained decoded image code stream data stored in reverse order in the global memory by means of asynchronous transmission;

(6c)采用合并访存的方式，访问全局存储器，将全局存储器中奇偶合并的倒序存储的解码图像码流数据，按每255比特为一组分为多组，每两组图像码流数据按照并行处理的方式翻转一次，得到翻转后的解码图像码流数据；(6c) The global memory is accessed by means of combined memory access, and the decoded image code stream data stored in the reverse order of parity and even in the global memory is divided into multiple groups according to each 255-bit group, and each two groups of image code stream data Flip once according to the parallel processing method to obtain the flipped decoded image code stream data;

(6d)将解码结果存储在图形处理器GPU内部的全局存储器中；(6d) store the decoding result in the internal global memory of the graphics processing unit GPU;

(7)输出解码图像码流数据：(7) output decoded image stream data:

采用合并访问的方式，从图形处理器GPU内部的全局存储器中读出解码后的图像码流数据，输出到计算机内存中。The decoded image code stream data is read out from the internal global memory of the graphics processing unit GPU and output to the computer memory by means of combined access.

与现有技术相比，本发明具有以下优点：Compared with the prior art, the present invention has the following advantages:

第一，本发明提出的将多个图形处理器GPU与计算机连接起来，将连接有图形处理器GPU的计算机作为硬件实现平台，克服了现有技术中采用计算机中央处理器CPU处理速度慢的缺点，将统一计算设备架构CUDA应用到图形处理器GPU中，应用Fermi以及Kepler两种最新的GPU硬件架构，克服了现有技术中并行计算性能不高、图形处理器GPU不能没有中央处理器CPU进行调度来建立新的工作、不能进行内存错误检查和修复ECC以及图形处理器GPU资源利用率不高的缺点。使得本发明对大规模数据进行并行解码解码时效率更高，在动态并行作用下，能够使图形处理器GPU在没有中央处理器CPU进行调度的情况下，使用特有的加速硬件路径来建立新的工作，并且对工作进行调度以及结果进行同步，更进一步还可以在大型集群中进行内存错误检查和修复ECC，另外，同一程序中的不同内核函数可以在图形处理器GPU上同时运行，这样可以使图形处理器GPU的资源得到更加有效的利用。First, the present invention proposes to connect a plurality of graphics processors GPUs with computers, and the computer connected with graphics processors GPUs is used as a hardware implementation platform, which overcomes the slow processing speed of the computer central processing unit CPU in the prior art , the unified computing device architecture CUDA is applied to the graphics processor GPU, and the two latest GPU hardware architectures of Fermi and Kepler are used to overcome the low performance of parallel computing in the prior art, and the graphics processor GPU cannot be performed without the central processing unit CPU. Disadvantages of scheduling to create new jobs, inability to perform memory error checking and repairing ECC, and low utilization of GPU resources. This makes the present invention more efficient when performing parallel decoding on large-scale data, and under the action of dynamic parallelism, the graphics processor GPU can use a unique acceleration hardware path to establish a new Work, and schedule the work and synchronize the results. Further, it can also perform memory error checking and repair ECC in a large cluster. In addition, different kernel functions in the same program can run on the graphics processor GPU at the same time, which can make The resources of the graphics processing unit GPU are used more effectively.

第二，本发明引入合并访存访问存储器的方法，克服了现有技术中对存储器中数据访问速度慢，效率不高的缺点，使得本发明在大规模的图像码流数据进行解码时，显存的性能得到优化，数据访存速度更快。Second, the present invention introduces the method of combined memory access and memory access, which overcomes the shortcomings of slow and inefficient data access in the memory in the prior art, so that when the present invention decodes large-scale image code stream data, the video memory The performance is optimized, and the data access speed is faster.

第三，本发明引入异步传输的数据传递方式，克服了现有技术中分组码要求精确的帧同步处理而引起的数据传输方式缓慢的缺点，使得本发明在数据传输时提高传输带宽，进而并行效率得到提高。Third, the present invention introduces the data transmission mode of asynchronous transmission, which overcomes the shortcoming of the slow data transmission mode caused by the accurate frame synchronization processing of the block code in the prior art, so that the present invention improves the transmission bandwidth during data transmission, and then parallel Efficiency is improved.

第四，本发明引入并行解码处理的计算方法，克服了现有技术中串行处理消耗大量的时间、计算量庞大的缺点，使得本发明对大量数据传输时，解决了在传统各类工程应用中低效的问题，提高了解码的效率以及资源的利用率。Fourth, the present invention introduces a calculation method for parallel decoding processing, which overcomes the shortcomings of serial processing that consumes a lot of time and a large amount of calculation in the prior art, so that when the present invention transmits a large amount of data, it solves the problems in various traditional engineering applications. The problem of medium and low efficiency improves the efficiency of decoding and the utilization of resources.

附图说明Description of drawings

图1是本发明的流程图。Fig. 1 is a flow chart of the present invention.

具体实施方式detailed description

下面结合附图1对本发明做进一步描述。The present invention will be further described below in conjunction with accompanying drawing 1.

步骤1，初始化计算机。Step 1, initialize the computer.

将多个图形处理器GPU与计算机连接起来，图形处理器GPU的个数在[4，10]范围内任意选取，图形处理器GPU是指应用Fermi以及Kepler两种最新的图形处理器GPU硬件架构。Connect multiple graphics processors GPUs to the computer, and the number of graphics processors GPUs is arbitrarily selected within the range of [4, 10]. Graphics processor GPU refers to the application of two latest graphics processor GPU hardware architectures, Fermi and Kepler. .

计算机为图形处理器GPU分配共享存储器和全局存储器两个空间，来存放用于图形处理器GPU进行并行计算的数据，包括对输入码流、输出码流以及各个内核函数输出的伴随式矩阵和错误位置多项式矩阵的存储器空间进行分配。The computer allocates shared memory and global memory space for the graphics processor GPU to store data for parallel computing of the graphics processor GPU, including the adjoint matrix and error output of the input code stream, output code stream, and each kernel function The memory space for the positional polynomial matrix is allocated.

步骤2，输入图像码流数据。Step 2, input image code stream data.

将待解码的里德-索罗蒙码RS图像码流数据输入到计算机内存中。Input the Reed-Solomon code RS image code stream data to be decoded into the computer memory.

步骤3，读取图像码流数据。Step 3, read image code stream data.

采用异步传输的方式，从计算机内存中读取待解码的里德-索罗蒙码RS图像码流数据，采用异步传输的方式，将读取的待解码的里德-索罗蒙码RS图像码流数据存储在全局存储器中。Using asynchronous transmission, read the code stream data of the Reed-Solomon code RS image to be decoded from the computer memory, and use the asynchronous transmission method to read the read Reed-Solomon code RS image to be decoded Codestream data is stored in global memory.

步骤4，获得输出图像码流数据。Step 4, obtaining output image code stream data.

采用合并访存的方式，访问全局存储器，将图像码流数据从全局存储器中复制到共享存储器中，这里之所以使用共享存储器是因为在图形处理器GPU架构中，每个线程可以使用的寄存器的数目是有限的，不能提供足够的空间存放数组，所以我们将常量数据放入共享存储器中，共享存储器以及寄存的速度比全局存储器快得多；合并访存的方式是指，在图形处理器GPU中，同一个half-warp中的线程每次访问共享存储器中每组图像码流数据的同一位置上的数据。再将图像码流数据从共享存储器中复制到全局存储器中，将全局存储器中的里德-索罗蒙码RS图像码流数据，按每255比特为一组分为多组，每两组图像码流数据按照并行处理翻转一次，得到翻转后的，即得到倒序存储的图像码流数据，按照每个线程块block执行一幅图，每个线程thread执行一个整型数据，实现并行处理。The method of combined memory access is used to access the global memory, and the image code stream data is copied from the global memory to the shared memory. The reason why the shared memory is used here is because in the GPU architecture of the graphics processor, the number of registers that can be used by each thread The number is limited and cannot provide enough space to store the array, so we put the constant data into the shared memory. The speed of shared memory and registration is much faster than that of the global memory; the method of combined memory access means that in the graphics processor GPU In the same half-warp, threads in the same half-warp access the data at the same position of each group of image code stream data in the shared memory each time. Then copy the image code stream data from the shared memory to the global memory, and divide the Reed-Solomon code RS image code stream data in the global memory into multiple groups according to each 255-bit group. The image code stream data is flipped once according to parallel processing, and after the flip, the image code stream data stored in reverse order is obtained, and one image is executed according to each thread block, and each thread executes an integer data to realize parallel processing.

对翻转后的图像码流数据的奇数位和偶数位进行分离，得到奇偶分离的图像码流数据据。The odd bits and even bits of the flipped image code stream data are separated to obtain image code stream data with odd-even separation.

将分离后的图像码流数据，按偶数位在前、奇数位在后的顺序，组成输出码流。The separated image code stream data is composed of an output code stream in the order of even bits first and odd bits later.

步骤5，并行解码。Step 5, parallel decoding.

并行算法设计中采取了粗粒度以及细粒度同时进行并行的双层并行模型方法，在对细粒度进行并行访问过程中，将输出图像码流数据进行转置，得到转置后的里德-索罗蒙码RS图像码流数据，转置后的图像码流数据每组码字同一位置的数据是连续的，这样就能够满足合并访存的条件，采用合并访存的方式，将转置后的里德-索罗蒙码RS图像码流数据存储在共享存储器中，输出图像码流数据进行转置是指，按照每个线程块block执行一个大小是32*32的子数据块的转置，每个线程thread执行一个整型数据的运算，多个blcok并行进行处理。一个thread处理一个数据，多个线程并行进行处理。In the design of the parallel algorithm, a two-layer parallel model method with coarse-grained and fine-grained parallelism is adopted. In the process of parallel access to the fine-grained, the output image code stream data is transposed to obtain the transposed Reed-Sort Romon code RS image code stream data, the data in the same position of each group of code words of the transposed image code stream data is continuous, so that the conditions for combined memory access can be met, and the transposed The Reed-Solomon code RS image code stream data is stored in the shared memory, and the transposition of the output image code stream data refers to performing the transposition of a sub-data block with a size of 32*32 according to each thread block block , each thread executes an integer data operation, and multiple blcoks process in parallel. One thread processes one data, and multiple threads process it in parallel.

将转置后的里德-索罗蒙码RS图像码流数据，从共享存储器中复制到全局存储器，启动图形处理器GPU解码程序进行解码；启动图形处理器GPU解码程序进行解码的具体步骤如下：Copy the transposed Reed-Solomon code RS image code stream data from the shared memory to the global memory, and start the graphics processor GPU decoding program for decoding; the specific steps for starting the graphics processor GPU decoding program for decoding are as follows :

第一步，各图形处理器GPU从全局存储器中读取待解码的图像码流数据码字段，对于全局存储器的访问采用合并访问的方式，实现高效的访问，用每个线程处理一组里德-索罗蒙码RS图像码流数据，各线程采用查找表将码字从其所使用的有限域基变换至自然基，再利用查找表将数据变换至有限域；在有限域中将码字分别乘以相应的系数后，利用归约求和方法将各个块内的线程计算结果相加，得到多个码字的伴随式，生成的伴随式也是按转置的方式排列的，若伴随式全是0则结束解码，否则继续进行以下步骤。In the first step, each graphics processor GPU reads the code field of the image code stream data to be decoded from the global memory, and adopts a combined access method for global memory access to achieve efficient access, and each thread processes a set of Reed - Solomon code RS image code stream data, each thread uses a lookup table to transform the codeword from the finite field basis it uses to a natural basis, and then uses the lookup table to transform the data into a finite field; in the finite field, the codeword After multiplying the corresponding coefficients, use the reduction summation method to add the calculation results of the threads in each block to obtain the syndrome of multiple codewords. The generated syndrome is also arranged in a transposed manner. If the syndrome If it is all 0, the decoding ends, otherwise proceed to the following steps.

第二步，图形处理器GPU使用多个块对多个码流进行关键方程的计算，计算方法是无需求逆的B-M迭代算法，图形处理器GPU利用多个线程进行计算，每个线程均计算一个迭代参数，计算后得到的关键方程系数存入全局存储器。In the second step, the graphics processor GPU uses multiple blocks to calculate the key equations of multiple code streams. The calculation method is the B-M iterative algorithm that does not require inversion. The graphics processor GPU uses multiple threads to perform calculations, and each thread calculates An iteration parameter, the calculated key equation coefficients are stored in the global memory.

第三步，图形处理器GPU中使用多个块对误码进行纠正，纠正误码的方法是钱搜索和福尼算法；每个块通过多个线程读入关键方程系数，备线程分别对所读入的关键方程系数进行乘积运算，各线程的运算结果通过归约求和的方法进行求和，根据和的值判断当前块所对应的符号是否为错误符号；若当前块所对应符号存在错误，则采用福尼算法计算错误值，并利用块中某一线程执行纠正错误值的程序，得到解码结果，解码结果存储在全局存储器中。In the third step, multiple blocks are used in the graphics processing unit GPU to correct bit errors. The method of correcting bit errors is money search and Forney algorithm; each block reads in the key equation coefficients through multiple threads, and the standby threads respectively The key equation coefficients read in are multiplied, and the calculation results of each thread are summed by reduction and summation. According to the value of the sum, it is judged whether the symbol corresponding to the current block is an error symbol; if there is an error in the symbol corresponding to the current block , the Forney algorithm is used to calculate the error value, and a thread in the block is used to execute the program to correct the error value to obtain the decoding result, which is stored in the global memory.

并行解码过程中，采用一个线程块处理一幅图像的码流、多个线程块处理多幅图像的码流的方法，对转置后的里德-索罗蒙码RS码流头部数据解析，得到解析码流头部数据；这里的线程块是指，将线程网格维度设置为二维，线程网格的大小为8×m，其中m表示由用户设定的需要并行解压缩的图像数量，将每个线程网格拥有的线程块设置为28，将每个线程块拥有的线程数设置为256，得到设置好的线程网格和线程块。In the parallel decoding process, one thread block is used to process the code stream of one image, and multiple thread blocks are used to process the code stream of multiple images, and the header data of the transposed Reed-Solomon code RS code stream is analyzed , to get the header data of the parsed code stream; the thread block here refers to setting the dimension of the thread grid as two-dimensional, and the size of the thread grid is 8×m, where m represents the image that needs to be decompressed in parallel set by the user Quantity, set the number of thread blocks owned by each thread grid to 28, and set the number of threads owned by each thread block to 256 to obtain the set thread grid and thread blocks.

采用一个线程块处理一幅图像、多个线程块处理多幅图像的方法，对转置后的里德-索罗蒙码RS图像码流数据包头解析，得到解析图像码流数据包头数据。The method of processing one image by one thread block and processing multiple images by multiple thread blocks is used to analyze the packet header of the transposed Reed-Solomon code RS image code stream to obtain the header data of the analyzed image code stream.

在线程块内部启动多个线程，每个线程将同一层解码后的图像码块在显存中按照先后顺序重新拼接,得到解码后的图像码流数据。Multiple threads are started inside the thread block, and each thread reassembles the decoded image code blocks of the same layer in the video memory in order to obtain the decoded image code stream data.

步骤6，获得解码图像码流数据。Step 6, obtaining decoded image code stream data.

将解码后的图像码流数据，进行逆转置，按照每个线程块block执行一个大小是32*32的子数据块的逆转置，每个线程thread执行一个整型数据的运算，得到逆转置解码图像码流数据。Perform inverse transposition on the decoded image code stream data, perform inverse transposition of a sub-data block with a size of 32*32 according to each thread block block, and perform an operation on an integer data in each thread thread to obtain inverse transposition decoding Image stream data.

对逆转置后的解码图像码流数据进行奇偶合并，得到倒序存储的解码图像码流数据。Parity combining is performed on the reverse-transposed decoded image code stream data to obtain decoded image code stream data stored in reverse order.

采用合并访存的方式，访问全局存储器，将全局存储器中奇偶合并的倒序存储的解码图像码流数据，按每255比特分为一组，每两组码流翻转一次，得到正序存储的解码图像码流数据。The global memory is accessed by means of combined memory access, and the decoded image code stream data stored in the reverse order of the odd-even combination in the global memory is divided into groups of 255 bits, and every two groups of code streams are flipped once to obtain the decoded image stored in the normal order. Image stream data.

将解码结果存储在图形处理器GPU内部的全局存储器中。Store the decoding result in the global memory inside the graphics processing unit GPU.

步骤7，输出解码图像码流数据。Step 7, outputting the decoded image code stream data.

下面结合对基于图形处理器GPU的里德-索罗蒙码RS解码时间、解码时间比以及解码正确率的测试结果，对本发明的效果做进一步描述。The effect of the present invention will be further described below in combination with the test results of the decoding time, decoding time ratio and decoding accuracy rate of the Reed-Solomon code RS based on the graphics processing unit GPU.

每一次里德-索罗蒙码RS解码算法的对象是长为255的码字，因此里德-索罗蒙码RS解码算法的时间仅与每组码字的解码时间以及总共具有多少组码字有关。The object of each Reed-Solomon code RS decoding algorithm is a codeword with a length of 255, so the time of the Reed-Solomon code RS decoding algorithm is only related to the decoding time of each group of codewords and how many groups of codes there are in total word related.

对C2075架构下图形处理器GPU解码时间与中央处理器CPU解码时间进行比较，获得的测试结果如下表所示：Comparing the graphics processor GPU decoding time and the central processing unit CPU decoding time under the C2075 architecture, the test results obtained are shown in the following table:

上表中的输入图像码流数据大小是1286KB，在无错解码情况下及每组码字随机加错1个、8个、16个情况下程序均能正确解码。在没有经过优化的中央处理器CPU上执行的c程序中，无错解码条件下，所用时间是354ms；纠错解码的条件下，错误数为1的情况下，所用时间是344ms；错误数为8的情况下，所用时间是344ms；错误数为16的情况下，所用时间是344ms。纠错解码跟无错解码都需要经过相同的步骤进行计算，在无错跟纠错条件下，每组码字的时间复杂度都是一样的，因此，在同样的图像下，中央处理器CPU纠错解码跟无错解码时间相差不大。在图形处理器GPU解码过程中，无错解码条件下，所用时间是18ms；纠错解码的条件下，错误数为1的情况下，所用时间是21ms；，错误数为8的情况下，所用时间是30ms；，错误数为16的情况下，所用时间是40ms；无错解码条件下，中央处理器CPU与图形处理器GPU的解码时间比是19.6；纠错解码的条件下，错误数为1的情况下，中央处理器CPU与图形处理器GPU的解码时间比是16.4；错误数为8的情况下，中央处理器CPU与图形处理器GPU的解码时间比是11.4；错误数为16的情况下，中央处理器CPU与图形处理器GPU的解码时间比是8.6。经过优化后的算法在是否进行纠错解码的过程中进行了条件的判定，在求出伴随式以后，根据伴随式是否为零判断是否进行纠错解码，因此，无错解码是复杂度大大低于纠错解码的复杂度，因此，无错解码耗时最短，纠错解码复杂度随着需要纠正的错误数的增加而增加，纠错解码耗时也随着错误码字数的增加而增加。The data size of the input image code stream in the above table is 1286KB, and the program can decode correctly in the case of error-free decoding and in the case of randomly adding 1, 8, and 16 errors to each group of code words. In the c program executed on the central processing unit CPU that has not been optimized, under the condition of error-free decoding, the time used is 354ms; under the condition of error-correcting decoding, when the number of errors is 1, the time used is 344ms; the number of errors is In the case of 8, the time used is 344ms; when the number of errors is 16, the time used is 344ms. Both error-correcting decoding and error-free decoding need to be calculated through the same steps. Under the conditions of error-free and error-correcting, the time complexity of each group of codewords is the same. Therefore, under the same image, the central processing unit CPU The error-correcting decoding time is not much different from the error-free decoding time. In the GPU decoding process of the graphics processor, under the condition of error-free decoding, the time used is 18ms; under the condition of error-correcting decoding, when the number of errors is 1, the time used is 21ms; when the number of errors is 8, the time used is The time is 30ms; when the number of errors is 16, the time used is 40ms; under the condition of error-free decoding, the decoding time ratio of the central processing unit CPU and the graphics processing unit GPU is 19.6; under the condition of error-correcting decoding, the number of errors is In the case of 1, the decoding time ratio of the central processing unit CPU to the graphics processing unit GPU is 16.4; when the number of errors is 8, the decoding time ratio of the central processing unit CPU to the graphics processing unit GPU is 11.4; In this case, the decoding time ratio of the central processing unit CPU to the graphics processing unit GPU is 8.6. The optimized algorithm makes a conditional judgment in the process of whether to perform error-correcting decoding. After finding the syndrome, judge whether to perform error-correcting decoding according to whether the syndrome is zero. Therefore, the complexity of error-free decoding is much lower. Because of the complexity of error-correcting decoding, error-free decoding takes the shortest time. The complexity of error-correcting decoding increases with the number of errors to be corrected. The time-consuming error-correcting decoding also increases with the number of error code words.

为了验证纠错解码的性能，上表中所测的纠错解码时间为输入码流中每组码字都加错的情况下，即同一幅图中每组码字都需要进行纠错解码运算，而在实际的通信系统中，误码率一般都在e^-6到e^-4之间，即一幅2M大小的图错误码字个数在几个到几百个之间，需要进行纠错解码的码组数非常少，因此，在实际情况下图形处理器GPU解码时间相较于中央处理器CPU来说会有显著的减少。In order to verify the performance of error correction decoding, the error correction decoding time measured in the above table is the case where each group of codewords in the input code stream has errors, that is, each group of codewords in the same picture needs to perform error correction decoding operations , but in the actual communication system, the bit error rate is generally between e ^-6 and e ^-4 , that is, the number of error code words in a 2M picture is between a few to hundreds, and it needs to be corrected The number of wrongly decoded code groups is very small, so in actual situations, the decoding time of the graphics processing unit GPU will be significantly reduced compared with the central processing unit CPU.

基于图形处理器GPU的解码，我们对全局存储器的访问采取了合并访存的方式，增加了转置函数，并且对纠错解码函数中共享内存以及寄存器的使用进行了重新分配，因此，RS解码算法的性能有了显著的提高。Based on the decoding of the graphics processor GPU, we have adopted a method of merging access to the global memory, added a transpose function, and redistributed the use of shared memory and registers in the error correction decoding function. Therefore, RS decoding The performance of the algorithm has been significantly improved.

因此，由两次的优化结果可以看出，对全局存储器中的数据是否进行合并访存对CUDA程序的并行性能起到了较大的影响，同时，共享存储器以及寄存器的分配，以及并行结构的设计也是影响程序并行的主要因素。Therefore, it can be seen from the two optimization results that whether the data in the global memory is merged or not has a great impact on the parallel performance of the CUDA program. At the same time, the allocation of shared memory and registers, as well as the design of the parallel structure It is also the main factor affecting program parallelism.

Claims

1. a kind of RS decoding method based on graphic processor, comprises the steps:

(1) Initialize the computer:

(1a) connecting a plurality of graphics processing units (GPUs) to a computer;

(1b) the computer allocates the global memory in the graphics processing unit GPU;

(2) Input image stream data:

Inputting the Reed-Solomon code RS image stream data to be decoded into the computer memory;

(3) Read image stream data:

(3a) Using asynchronous transmission mode, read the Reed-Solomon code RS image code stream data to be decoded from the computer memory;

(3b) adopting asynchronous transmission mode, storing the read Reed-Solomon code RS image code stream data to be decoded in the global memory;

(4) Obtain the output image stream data:

(4a) Adopt the method of merging memory access to access the global memory, and divide the Reed-Solomon code RS image code stream data in the global memory into multiple groups according to each 255-bit group, and each two groups of image codes The stream data is reversed once in parallel processing to obtain the reversed image code stream data;

(4b) Separating odd and even bits of the flipped image code stream data to obtain parity-separated image code stream data;

(4c) the image code stream data separated by parity, according to the order of the even number bit in front and the odd number bit in the back, to form the output image code stream data;

(5) Parallel decoding:

(5a) Transpose the output image code stream data to obtain the transposed Reed-Solomon code RS image code stream data, and use the combined memory access method to convert the transposed Reed-Solomon code The RS image code stream data is stored in the shared memory by means of asynchronous transmission;

(5b) Transmit the transposed Reed-Solomon code RS image code stream data from the shared memory to the global memory by means of asynchronous transmission, and start the graphics processor GPU decoding program to decode;

(5c) Adopt the method that one thread block processes the code stream of an image, and multiple thread blocks process the code stream of multiple images, analyze the header data of the transposed Reed-Solomon code RS image code stream data , to obtain the header data of the parsed image stream data;

(5d) Using a thread block to process an image and multiple thread blocks to process multiple images, analyze the transposed Reed-Solomon code RS image code stream data packet header, and obtain the analyzed image code stream data packet header data;

(5e) multiple threads are started inside the thread block, and each thread reassembles the decoded image code blocks of the same layer in the video memory according to the sequence to obtain the decoded image code stream data;

(6) Obtain the decoded image stream data:

(6a) Perform reverse transposition on the decoded image code stream data to obtain reverse transpose decoded image code stream data, and store the reverse transpose decoded image code stream data in a shared memory in an asynchronous transmission manner;

(6b) Parity-combining the decoded image code stream data after reverse transposition, and storing the obtained decoded image code stream data stored in reverse order in the global memory by means of asynchronous transmission;

(6c) The global memory is accessed by means of combined memory access, and the decoded image code stream data stored in the reverse order of parity and even in the global memory is divided into multiple groups according to each 255-bit group, and each two groups of image code stream data Flip once according to the parallel processing method to obtain the flipped decoded image code stream data;

(6d) store the decoding result in the internal global memory of the graphics processing unit GPU;

(7) output decoded image stream data:

The decoded image code stream data is read out from the internal global memory of the graphics processing unit GPU and output to the computer memory by means of combined access.

2. The RS decoding method based on a graphics processor according to claim 1, characterized in that: the graphics processor GPU described in the step (1a) is selected in a range of numbers [4, 10].

3. the RS decoding method based on graphic processor according to claim 1, is characterized in that: step (3a), step (3b), step (5a), step (5b), step (6a), step (6b The asynchronous transmission mode described in ) means that the image data code streams to be transmitted in each step are divided into groups of 255 bits, and the image data code streams are transmitted in groups. When transmitting each bit group, the bit group A start bit is added in front of the bit group, and a stop bit is added after the bit group.

4. The RS decoding method based on a graphics processor according to claim 1, characterized in that: the reversal described in step (4a), step (6c) refers to that every two groups of image stream data are reversed storage.

5. the RS decoding method based on graphics processor according to claim 1, is characterized in that: the parallel processing described in step (4a), step (6c) refers to, image code stream data is according to each thread block The block executes a graph, and each thread executes an integer data for corresponding processing.

6. the RS decoding method based on graphic processor according to claim 1, is characterized in that: the mode of merging memory access described in step (4a), step (5a), step (6c), step (7) It means that in the graphics processing unit GPU, threads in the same half-warp access the data at the same position of each code stream in the shared memory each time.

7. The RS decoding method based on a graphics processor according to claim 1, characterized in that: transposing the output image code stream data described in the step (5a) refers to executing a block according to each thread block For a sub-data block whose size is 32*32, convert the rows and columns of the sub-data block to realize the transposition of the sub-data block, and each thread executes an operation on integer data.

8. the RS decoding method based on graphics processor according to claim 1, is characterized in that: the specific steps of starting graphics processor GPU decoding program described in the step (5b) to decode are as follows:

In the first step, each graphics processing unit (GPU) reads the code field to be decoded from the global memory. The access to the global memory is combined to achieve efficient access. Each thread processes a set of Reed-Solomon Code RS image code stream data, each thread first uses a lookup table to transform the code word from the finite field base it uses to a natural base, and then uses the lookup table to transform the image code stream data into a finite field; in the finite field, the code word Words are multiplied by the corresponding coefficients, and the calculation results of the threads in each block are added using the reduction summation method to obtain adjoint expressions of multiple codewords. The generated adjoint expressions are also arranged in a transposed manner. If the formula is all 0, the decoding ends, otherwise, proceed to the following steps;

In the second step, the graphics processor GPU uses multiple blocks to calculate the key equations of multiple image stream data. The calculation method is the B-M iterative algorithm that does not require inversion. The graphics processor GPU uses multiple threads to perform calculations. Each thread Both calculate an iteration parameter, and the key equation coefficients obtained after calculation are stored in the global memory;

In the third step, multiple blocks are used in the graphics processing unit GPU to correct bit errors. The method of correcting bit errors is money search and Forney algorithm; each block reads in the key equation coefficients through multiple threads, and then the standby threads respectively The read-in key equation coefficients are multiplied, and then the calculation results of each thread are summed by reduction and summation. According to the value of the sum, it is judged whether the symbol corresponding to the current block is an error symbol; if the symbol corresponding to the current block If there is an error, the Forney algorithm is used to calculate the error value, and a thread in the block is used to execute the program to correct the error value to obtain the decoding result, which is stored in the global memory.

9. the RS decoding method based on graphics processor according to claim 1, is characterized in that: the thread block described in step (5c) refers to, thread grid dimension is set to two-dimensional, the size of thread grid is 8×m, where m represents the number of images that need to be decompressed in parallel set by the user, and the thread blocks owned by each thread grid are set to 28, and the number of threads owned by each thread block is set to 256, and the set Nice thread grid and thread blocks.

10. The RS decoding method based on a graphics processor according to claim 1, characterized in that: the inverse transposition described in step (6a) refers to executing a subclass whose size is 32*32 according to each thread block block. Data block, convert the rows and columns of the sub-data block to realize the inverse transposition of the sub-data block, and each thread executes an operation on integer data.