CN103427844A

CN103427844A - High-speed lossless data compression method based on GPU-CPU hybrid platform

Info

Publication number: CN103427844A
Application number: CN2013103210717A
Authority: CN
Inventors: 金海�; 郑然�; 周斌
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2013-07-26
Filing date: 2013-07-26
Publication date: 2013-12-04
Anticipated expiration: 2033-07-26
Also published as: CN103427844B

Abstract

The invention discloses a high-speed lossless data compression method based on a mixed platform of GPU and CPU, comprising: reading the data file to be compressed by the CPU, copying the data file to be compressed from the internal memory to the global memory of the GPU, and setting the thread on the GPU Block group bk[a], the number of threads in each thread block b, set the length of the compression dictionary window as c, and set the head pointer pointing to the first compression dictionary window as p_dic_h, and set the size of the read-ahead window as d , the pointer p_pre_r pointing to the first pre-read window, the initial value of the pointer is set to p_dic_h-c, the initialization worker thread group threads[a*b], and (a*b/2)/c gMatrix matrix, its size For c*d, call (a*b/2) threads in the working thread group threads[a*b] to process q=(a*b/2)/c lengths of c+d in the data file to be compressed Data, in each of the q result matrices gMatrix, find the oblique segment with the most consecutive 1s, and determine the ternary result array locations[p] of each result matrix. The invention can greatly improve the compression rate of massive data.

Description

A high-speed lossless data compression method based on GPU and CPU mixed platform

技术领域technical field

本发明属于计算机数据压缩技术领域，更具体地，涉及一种基于GPU和CPU混合平台的高速无损数据压缩方法。The invention belongs to the technical field of computer data compression, and more specifically relates to a high-speed lossless data compression method based on a mixed platform of GPU and CPU.

背景技术Background technique

据国际数据公司（International Data Corporation，简称IDC）研究表明：近10年来，全球信息总量每过两年，就会增长一倍。2011年，全球被创建和被复制的数据总量为1.8ZB（1800EB）,到了2015年将达到8ZB（8000EB），而下一个十年即2020年左右，全球的数据将比现在高出50倍。与此同时，虽然网络带宽和新的存储技术也得到了快速的发展，但仍然远不能满足当前海量数据传输和存储的性能需求。而解决海量数据传输和存储面临的挑战的关键技术之一就是数据压缩。通过数据压缩技术，可以有效的减少需要传输和存储的数据容量，从而有效的控制数据传输和存放的成本，实现数据的低成本高效管理。According to the research of International Data Corporation (IDC), in the past 10 years, the total amount of global information will double every two years. In 2011, the total amount of data created and copied globally was 1.8ZB (1800EB), and by 2015 it will reach 8ZB (8000EB), and in the next ten years, around 2020, the global data will be 50 times higher than it is now . At the same time, although network bandwidth and new storage technologies have also been rapidly developed, they are still far from meeting the performance requirements of current massive data transmission and storage. One of the key technologies to solve the challenges of mass data transmission and storage is data compression. Through data compression technology, the data capacity that needs to be transmitted and stored can be effectively reduced, thereby effectively controlling the cost of data transmission and storage, and realizing low-cost and efficient management of data.

传统的基于CPU平台的压缩理论和压缩算法关注的重点在于如何提高数据的压缩比率，海量数据时代对数据压缩的速度提出了更高的要求。为了提高压缩速率，数据的并行压缩成为一个新的发展方向。进行并行数据压缩的前提是需要找到数据级别的并行性，将数据进行分块,并同时对各个分块进行压缩是一种自然的并行思想。现有的J.Gilchrist、GZIP、S.Pradhan等算法分别利用多线程，多核和集群的方式，在CPU平台上实现了并行数据压缩。但在实际的应用中，随着需要处理的数据量的增加，由于交互中产生的大量通信量，中间结果需要占用大量的内存空间，以及CPU本身固有的硬件体系结构并不适合大规模的并行计算特性，所有这些因素使得在CPU平台下采用并行的压缩算法并不能使得压缩的速率达到预期的目标。The traditional compression theory and compression algorithm based on the CPU platform focus on how to improve the compression ratio of data. The era of massive data puts forward higher requirements on the speed of data compression. In order to improve the compression rate, data parallel compression becomes a new development direction. The premise of parallel data compression is to find data-level parallelism. It is a natural parallel idea to divide data into blocks and compress each block at the same time. Existing algorithms such as J.Gilchrist, GZIP, and S.Pradhan use multi-threading, multi-core, and cluster methods to achieve parallel data compression on the CPU platform. However, in practical applications, as the amount of data to be processed increases, due to the large amount of communication generated in the interaction, the intermediate results need to occupy a large amount of memory space, and the inherent hardware architecture of the CPU itself is not suitable for large-scale parallelism. Computing characteristics, all these factors make the parallel compression algorithm under the CPU platform not make the compression rate reach the expected goal.

也有学者将一些CPU平台下的一些经典的无损数据压缩算法，如行程编码算法，BZIP2算法等通过改进，移植到了GPU平台下。这些算法为了解决上述CPU压缩的不足，专注于如何利用GPU的共享存储器和全局存储器，从而最大限度的减少不同模块之间的通信，减少内存的占用，进而提高压缩的速率。但由于这些算法本身是基于CPU平台发明的，在本质上并不适合GPU这种不同的硬件结构，因此在实际的应用中，压缩速率的提高仍然有待提高。Some scholars have also improved some classic lossless data compression algorithms under some CPU platforms, such as the run-length encoding algorithm, BZIP2 algorithm, etc., and transplanted them to the GPU platform. In order to solve the above-mentioned shortcomings of CPU compression, these algorithms focus on how to use the shared memory and global memory of GPU, so as to minimize the communication between different modules, reduce the memory occupation, and then improve the compression rate. However, since these algorithms are invented based on the CPU platform, they are not suitable for the different hardware structure of the GPU in essence. Therefore, in practical applications, the improvement of the compression rate still needs to be improved.

发明内容Contents of the invention

针对现有技术的以上缺陷或改进需求，本发明提供了一种基于GPU和CPU混合平台的高速无损数据压缩方法，其目的在于通过将数据压缩的过程分解为并行计算和串行计算两个部分，并行计算部分包括划分多个压缩数据字典和预读窗口，组织为多个矩阵，并将待压缩数据文件交由GPU中的多个线程块并行完成，而将串行计算部分中压缩编码的生成与输出交由CPU完成，这种工作模式综合了GPU和CPU本身各自的优点和长处，从而在保证压缩率不降低的情况下，大大提高海量数据的压缩速率。Aiming at the above defects or improvement needs of the prior art, the present invention provides a high-speed lossless data compression method based on a hybrid platform of GPU and CPU, the purpose of which is to decompose the process of data compression into two parts: parallel computing and serial computing , the parallel computing part includes dividing multiple compressed data dictionaries and pre-reading windows, organizing them into multiple matrices, and handing over the data files to be compressed to multiple thread blocks in the GPU to complete in parallel, while compressing and encoding the serial computing part The generation and output are completed by the CPU. This working mode combines the advantages and strengths of the GPU and the CPU itself, thereby greatly improving the compression rate of massive data without reducing the compression rate.

为实现上述目的，按照本发明的一个方面，提供了一种基于GPU和CPU混合平台的高速无损数据压缩方法，包括以下步骤：In order to achieve the above object, according to one aspect of the present invention, a kind of high-speed lossless data compression method based on GPU and CPU hybrid platform is provided, comprising the following steps:

（1）CPU读取待压缩数据文件，将该待压缩数据文件从内存拷贝到GPU的全局存储器中；(1) The CPU reads the data file to be compressed, and copies the data file to be compressed from the memory to the global memory of the GPU;

（2）设置GPU上的线程块组bk[a]，每个线程块中的线程个数b，其中a为线程块的总数；(2) Set the thread block group bk[a] on the GPU, the number of threads b in each thread block, where a is the total number of thread blocks;

（3）设置压缩字典窗口的长度为c，并设置指向第一个压缩字典窗口的头部指针为p_dic_h；(3) Set the length of the compressed dictionary window as c, and set the head pointer pointing to the first compressed dictionary window as p_dic_h;

（4）设置预读窗口大小为d，指向第一个预读窗口的指针p_pre_r，该指针的初始值设置为p_dic_h-c；(4) Set the size of the pre-read window to d, point to the pointer p_pre_r of the first pre-read window, and set the initial value of the pointer to p_dic_h-c;

（5）初始化工作线程组threads[a*b]，以及（a*b/2）/c个gMatrix矩阵，其大小为c*d；(5) Initialize the worker thread group threads[a*b], and (a*b/2)/c gMatrix matrices, whose size is c*d;

（6）调用工作线程组threads[a*b]中的（a*b/2）个线程处理待压缩数据文件中q=(a*b/2）/c个长度为c+d的数据；(6) Call (a*b/2) threads in the working thread group threads[a*b] to process q=(a*b/2)/c data whose length is c+d in the data file to be compressed;

（7）在q个结果矩阵gMatrix的每一个中寻找具有最多连续1的斜线段，确定每个结果矩阵的三元结果数组locations[p]，数组中的每个元素存储有三元结果（x,y,length），其中p为该结果矩阵中斜线段的数量，且等于c+d–1，x表示斜线段相对于其所在结果矩阵对应的压缩字典的偏移量，y表示该斜线段相对于其所在结果矩阵对应的预读窗口的偏移量，length表示该斜线段的长度；(7) Find the oblique segment with the most consecutive 1s in each of the q result matrices gMatrix, and determine the locations[p] of the ternary result array of each result matrix. Each element in the array stores a ternary result (x, y, length), where p is the number of slash segments in the result matrix, and is equal to c+d–1, x represents the offset of the slash segment relative to the compression dictionary corresponding to the result matrix where it is located, and y represents the relative The offset of the read-ahead window corresponding to the result matrix where it is located, and length indicates the length of the oblique segment;

（8）寻找每个gMatrix对应的locations[p]数组中具有最大length值的元素：设置线程T3，其线程编号为th3，T3是线程组threads[a*b]中第0个线程至第(q-1)个线程中的一个，T3线程负责寻找每个gMatrix矩阵对应的三元结果数组locations[p]中具有最大length值的元素，并将其对应的参数x、y与length存入全局的匹配结果数组match[q]，该数组中的每个元素也存储有三元结果（x，y，length）；(8) Find the element with the largest length value in the locations[p] array corresponding to each gMatrix: set thread T3, whose thread number is th3, and T3 is the 0th thread in the thread group threads[a*b] to the ( One of the q-1) threads, the T3 thread is responsible for finding the element with the largest length value in the ternary result array locations[p] corresponding to each gMatrix matrix, and storing its corresponding parameters x, y and length in the global The matching result array match[q], each element in the array also stores a ternary result (x, y, length);

（9）根据匹配结果数组match[q]对待压缩数据文件进行压缩；(9) Compress the data file to be compressed according to the matching result array match[q];

（10）判断指针p_pre_r是否已经到待压缩数据文件的尾部，如果是，则过程结束；否则，向前滑动字典窗口和预读窗口，即设置p_pre_r=p_pre_r+q*d，p_dic_h=p_dic_h+q*d，然后返回步骤（6）。(10) Determine whether the pointer p_pre_r has reached the end of the data file to be compressed, if so, the process ends; otherwise, slide the dictionary window and the pre-reading window forward, that is, set p_pre_r=p_pre_r+q*d, p_dic_h=p_dic_h+q *d, then return to step (6).

优选地，d的值等于16*n，n的取值范围为1-8。Preferably, the value of d is equal to 16*n, and the range of n is 1-8.

优选地，步骤（6）中，p_dic_h指向待压缩数据文件的第一个压缩字典窗口首部，p_pre_r指向待压缩数据文件的第一个预读窗口首部，而（p_dic_h-d）指向压缩数据文件第二个压缩字典窗口首部，（p_pre_r-d）指向待压缩数据文件的第二预读窗口首部，如此划分，一次循环，可以处理q对压缩字典窗口和预读数据窗口。Preferably, in step (6), p_dic_h points to the head of the first compression dictionary window of the data file to be compressed, p_pre_r points to the head of the first read-ahead window of the data file to be compressed, and (p_dic_h-d) points to the first Two compression dictionary window headers, (p_pre_r-d) point to the second pre-read window header of the data file to be compressed, so divided, one cycle can process q pairs of compression dictionary windows and pre-read data windows.

优选地，步骤（6）具体为，对于每一个待处理的压缩字典窗口和与其对应的预读数据窗口的数据，分别执行以下步骤：Preferably, step (6) is specifically, for each to-be-processed compression dictionary window and the data of the corresponding pre-read data window, respectively perform the following steps:

（6-1）设置计数器i=0；(6-1) Set counter i=0;

（6-2）设置线程T1，其线程编号为th1，T1是线程组threads[a*b/2]中第(c*k)个线程至第(c*(k+1)-1)个线程中的一个，其判断第k个压缩字典窗口中第(th1 mod c)位字节与第k个预读数据窗口中第i*16至(i+1)*16-1位字节是否匹配，其中0<=k<q，当两个字节匹配时，返回值1，否则返回值0，并把该匹配结果写回全局存储器的第k个gMatrix矩阵中的位置((th1 modc)*d+i*16)到位置((th1 mod c)*d+i*16+16)中；(6-2) Set thread T1, whose thread number is th1, T1 is the (c*k)th thread to (c*(k+1)-1)th thread in the thread group threads[a*b/2] One of the threads, which judges whether the (th1 mod c)th byte in the kth compression dictionary window and the i*16 to (i+1)*16-1th byte in the kth pre-reading data window are Match, where 0<=k<q, when the two bytes match, return value 1, otherwise return value 0, and write the matching result back to the position in the kth gMatrix matrix of the global memory ((th1 modc) *d+i*16) into position ((th1 mod c)*d+i*16+16);

(6-3）设置i=i+1，并判断是否有i<n，如果是则转入步骤（6-2），否则转入步骤（7）。(6-3) Set i=i+1, and judge whether i<n, if yes, go to step (6-2), otherwise go to step (7).

优选地，当length为小于3时，表明没有找到匹配，此时，x与y直接赋值为-1。Preferably, when the length is less than 3, it indicates that no match is found, and at this time, x and y are directly assigned a value of -1.

优选地，步骤（7）具体包括以下子步骤：Preferably, step (7) specifically includes the following sub-steps:

（7-1）设置线程T2，其线程编号为th2，T2是线程组threads[a*b]中第(c*k)个线程至第(c*(k+1)+d-1)个线程中的一个，T2负责寻找一个斜线段中具有连续1最多的子段，并记录下该子段对应的参数x、y与length；(7-1) Set thread T2, whose thread number is th2, T2 is the (c*k)th thread to (c*(k+1)+d-1)th thread in the thread group threads[a*b] One of the threads, T2 is responsible for finding the sub-segment with the most consecutive 1s in a slash segment, and recording the parameters x, y and length corresponding to the sub-segment;

（7-2）线程T2将其获取的对应数据x、y与length存储在三元结果数组locations的第(th2 mod p)个元素中。(7-2) Thread T2 stores the obtained corresponding data x, y and length in the (th2 mod p)th element of the ternary result array locations.

优选地，步骤（9）具体包括以下子步骤：Preferably, step (9) specifically includes the following sub-steps:

（9-1）将匹配结果数组match[q]从GPU传输到CPU中的主存储器；(9-1) Transfer the matching result array match[q] from the GPU to the main memory in the CPU;

（9-2）CPU将存储在匹配结果数组match[q]中的数据换算成预读窗口的子串在压缩字典窗口中最长匹配的偏移量和长度，并输出压缩编码三元组compress[q]，压缩编码三元数组中的每个元素存储有三元结果（flag，offset，length）其中，flag为标志字节，表明输出的是否是压缩数据，offset和length分别表示预读窗口的子串在对应的压缩字典窗口中最长匹配的偏移量和长度；(9-2) The CPU converts the data stored in the matching result array match[q] into the longest matching offset and length of the substring in the pre-reading window in the compression dictionary window, and outputs the compressed code triplet compress [q], each element in the compressed encoding ternary array stores a ternary result (flag, offset, length). Among them, flag is a flag byte, indicating whether the output is compressed data, and offset and length represent the read-ahead window respectively. The offset and length of the longest match of the substring in the corresponding compression dictionary window;

（9-3）根据标志字节对待压缩数据文件进行数据压缩。(9-3) Perform data compression on the data file to be compressed according to the flag byte.

优选地，步骤（9-3）中，得到的标志字节类型包括原始标志字节和混合标志字节，原始标志字节第一位为0，后7位表示输出的原始数据长度，后面最多可以输出连续128个原始数据字节，混合标志字节第一位为1，后7位表示7个原始和压缩的混合数据，对应位为0时表示输出的是原始数据，为1时表示输出的是压缩编码，由此实现数据的压缩，如果没有发现匹配，则原样输出数据。Preferably, in step (9-3), the obtained flag byte type includes original flag byte and mixed flag byte, the first bit of the original flag byte is 0, and the last 7 bits indicate the original data length of the output, and the following is at most It can output 128 consecutive original data bytes. The first bit of the mixed flag byte is 1, and the last 7 bits indicate 7 original and compressed mixed data. When the corresponding bit is 0, it means that the original data is output, and when it is 1, it means output What is used is compression coding, whereby the data is compressed, and if no match is found, the data is output as it is.

按照本发明的另一方面，提供了一种基于GPU和CPU混合平台的高速无损数据压缩系统，包括：According to another aspect of the present invention, a kind of high-speed lossless data compression system based on GPU and CPU hybrid platform is provided, comprising:

第一模块，用于读取待压缩数据文件，将该待压缩数据文件从内存拷贝到GPU的全局存储器中；The first module is used to read the data file to be compressed, and copy the data file to be compressed from the memory to the global memory of the GPU;

第二模块，用于设置GPU上的线程块组bk[a]，每个线程块中的线程个数b，其中a为线程块的总数；The second module is used to set the thread block group bk[a] on the GPU, the number of threads b in each thread block, where a is the total number of thread blocks;

第三模块，用于设置压缩字典窗口的长度为c，并设置指向第一个压缩字典窗口的头部指针为p_dic_h；The third module is used to set the length of the compressed dictionary window as c, and set the head pointer pointing to the first compressed dictionary window as p_dic_h;

第四模块，用于设置预读窗口大小为d，指向第一个预读窗口的指针p_pre_r，该指针的初始值设置为p_dic_h-c；The fourth module is used to set the size of the pre-reading window as d, pointing to the pointer p_pre_r of the first pre-reading window, and the initial value of the pointer is set to p_dic_h-c;

第五模块，用于初始化工作线程组threads[a*b]，以及（a*b/2）/c个gMatrix矩阵，其大小为c*d；The fifth module is used to initialize the worker thread group threads[a*b], and (a*b/2)/c gMatrix matrices, whose size is c*d;

第六模块，用于调用工作线程组threads[a*b]中的（a*b/2）个线程处理待压缩数据文件中q=(a*b/2）/c个长度为c+d的数据；The sixth module is used to call (a*b/2) threads in the working thread group threads[a*b] to process q=(a*b/2)/c in the data file to be compressed, the length of which is c+d The data;

第七模块，用于在q个结果矩阵gMatrix的每一个中寻找具有最多连续1的斜线段，确定每个结果矩阵的三元结果数组locations[p]，数组中的每个元素存储有三元结果（x,y,length），其中p为该结果矩阵中斜线段的数量，且等于c+d–1，x表示斜线段相对于其所在结果矩阵对应的压缩字典的偏移量，y表示该斜线段相对于其所在结果矩阵对应的预读窗口的偏移量，length表示该斜线段的长度；The seventh module is used to find the oblique segment with the most consecutive 1s in each of the q result matrices gMatrix, and determine the locations[p] of the ternary result array of each result matrix, and each element in the array stores a ternary result (x, y, length), where p is the number of slash segments in the result matrix, and is equal to c+d–1, x represents the offset of the slash segment relative to the compression dictionary corresponding to the result matrix where it is located, and y represents the The offset of the slash segment relative to the read-ahead window corresponding to the result matrix where it is located, length indicates the length of the slash segment;

第八模块，用于寻找每个gMatrix对应的locations[p]数组中具有最大length值的元素：设置线程T3，其线程编号为th3，T3是线程组threads[a*b]中第0个线程至第(q-1)个线程中的一个，T3线程负责寻找每个gMatrix矩阵对应的三元结果数组locations[p]中具有最大length值的元素，并将其对应的参数x、y与length存入全局的匹配结果数组match[q]，该数组中的每个元素也存储有三元结果（x，y，length）；The eighth module is used to find the element with the maximum length value in the locations[p] array corresponding to each gMatrix: set thread T3, whose thread number is th3, and T3 is the 0th thread in the thread group threads[a*b] To one of the (q-1)th threads, the T3 thread is responsible for finding the element with the largest length value in the ternary result array locations[p] corresponding to each gMatrix matrix, and its corresponding parameters x, y and length Stored in the global matching result array match[q], each element in the array also stores a ternary result (x, y, length);

第九模块，用于根据匹配结果数组match[q]对待压缩数据文件进行压缩；The ninth module is used to compress the data file to be compressed according to the matching result array match[q];

第十模块，用于判断指针p_pre_r是否已经到待压缩数据文件的尾部，如果是，则过程结束；否则，向前滑动字典窗口和预读窗口，即设置p_pre_r=p_pre_r+q*d，p_dic_h=p_dic_h+q*d，然后返回第六模块。The tenth module is used to judge whether the pointer p_pre_r has reached the end of the data file to be compressed, if so, the process ends; otherwise, slide the dictionary window and the pre-reading window forward, that is, set p_pre_r=p_pre_r+q*d, p_dic_h= p_dic_h+q*d, then return to the sixth module.

总体而言，通过本发明所构思的以上技术方案与现有技术相比，能够取得下列有益效果：Generally speaking, compared with the prior art, the above technical solutions conceived by the present invention can achieve the following beneficial effects:

（1）本发明通过矩阵并行匹配，加快了数据的匹配速度：本发明提出的矩阵并行匹配的工作模式，在一次循环中并行处理q个c*q的矩阵匹配，使得循环匹配的步长由d字节增加到了q*d字节，由此加快了匹配速度；(1) The present invention speeds up the matching speed of data through matrix parallel matching: the working mode of matrix parallel matching proposed by the present invention processes q matrix matching of c*q in parallel in one cycle, so that the step size of cyclic matching is by The d byte is increased to q*d byte, thus speeding up the matching speed;

（2）本发明实现了异步处理，重叠了冗余数据发现与压缩编码输出的时间，使得CPU和GPU中的操作能够以类似流水线的形式执行，从而也在一定程度上减少了整个数据压缩的时间，提高了压缩的速率。(2) The present invention realizes asynchronous processing and overlaps the time of redundant data discovery and compression coding output, so that the operations in CPU and GPU can be executed in a form similar to pipeline, thereby reducing the overall data compression time to a certain extent. time, increasing the rate of compression.

附图说明Description of drawings

图1是本发明基于GPU和CPU混合平台的高速无损数据压缩方法的流程图。Fig. 1 is a flow chart of the high-speed lossless data compression method based on the hybrid platform of GPU and CPU in the present invention.

图2是待压缩数据文件的压缩字典窗口和与其对应的预读窗口划分示意图。Fig. 2 is a schematic diagram of the division of the compression dictionary window of the data file to be compressed and the corresponding pre-reading window.

图3至图6是本发明应用实例的示意图。3 to 6 are schematic diagrams of application examples of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。此外，下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间未构成冲突就可以相互组合。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not constitute a conflict with each other.

如图1所示，本发明基于GPU和CPU混合平台的高速无损数据压缩方法包括以下步骤：As shown in Figure 1, the high-speed lossless data compression method based on GPU and CPU mixed platform of the present invention comprises the following steps:

（2）设置GPU上的线程块组bk[a]，每个线程块中的线程个数b，其中a为线程块的总数，其为正整数，b的取值范围是256到1024；(2) Set the thread block group bk[a] on the GPU, the number of threads in each thread block b, where a is the total number of thread blocks, which is a positive integer, and the value range of b is 256 to 1024;

（3）设置压缩字典窗口的长度为c，并设置指向第一个压缩字典窗口的头部指针为p_dic_h；具体而言，c的取值范围为2KB-8KB，优选值取为4KB，p_dic_h初始值为指向该压缩数据文件的开始处；(3) Set the length of the compression dictionary window to c, and set the head pointer pointing to the first compression dictionary window to p_dic_h; specifically, the value range of c is 2KB-8KB, and the preferred value is 4KB, p_dic_h is initially The value points to the beginning of the compressed data file;

（4）设置预读窗口大小为d，指向第一个预读窗口的指针p_pre_r，该指针的初始值设置为p_dic_h-c，d值等于16*n，n的取值范围为1-8，优选取值为4，即每个预读窗口的优选取值为64B；(4) Set the size of the pre-reading window to d, point to the pointer p_pre_r of the first pre-reading window, the initial value of the pointer is set to p_dic_h-c, the value of d is equal to 16*n, and the value range of n is 1-8, The preferred value is 4, that is, the preferred value of each pre-read window is 64B;

（6）调用工作线程组threads[a*b]中的（a*b/2）个线程处理待压缩数据文件中q=(a*b/2）/c个长度为c+d的数据；待压缩数据文件的压缩字典窗口和与其对应的预读窗口的划分如图2所示，p_dic_h指向待压缩数据文件的第一个压缩字典窗口首部，p_pre_r指向待压缩数据文件的第一个预读窗口首部；而（p_dic_h-d）指向压缩数据文件第二个压缩字典窗口首部，（p_pre_r-d）指向待压缩数据文件的第二预读窗口首部，如此划分，一次循环，可以处理q对压缩字典窗口和预读数据窗口。具体而言，对于每一个待处理的压缩字典窗口和与其对应的预读数据窗口的数据，分别执行以下步骤：(6) Call (a*b/2) threads in the working thread group threads[a*b] to process q=(a*b/2)/c data whose length is c+d in the data file to be compressed; The division of the compression dictionary window of the data file to be compressed and the corresponding read-ahead window is shown in Figure 2, p_dic_h points to the head of the first compression dictionary window of the data file to be compressed, and p_pre_r points to the first read-ahead of the data file to be compressed The window header; and (p_dic_h-d) points to the second compression dictionary window header of the compressed data file, (p_pre_r-d) points to the second pre-read window header of the data file to be compressed, so divided, one cycle, can process q pairs of compression Dictionary window and read-ahead data window. Specifically, for each to-be-processed compression dictionary window and the data of the corresponding pre-read data window, the following steps are performed respectively:

（6-1）设置计数器i=0；(6-1) Set counter i=0;

（6-2）设置线程T1，其线程编号为th1，T1是线程组threads[a*b/2]中第(c*k)个线程至第(c*(k+1)-1)个线程中的一个，其判断第k个压缩字典窗口中第(th1 mod c)位字节与第k个预读数据窗口中第i*16至(i+1)*16-1位字节是否匹配（即二者是否相等），其中0<=k<q，当两个字节匹配（即相等）时，返回值1，否则返回值0，并把该匹配结果写回全局存储器的第k个gMatrix矩阵中的位置((th1 mod c)*d+i*16)到位置((th1 mod c)*d+i*16+16)中；(6-2) Set thread T1, whose thread number is th1, T1 is the (c*k)th thread to (c*(k+1)-1)th thread in the thread group threads[a*b/2] One of the threads, which judges whether the (th1 mod c)th byte in the kth compression dictionary window and the i*16 to (i+1)*16-1th byte in the kth pre-reading data window are Match (that is, whether the two are equal), where 0<=k<q, when the two bytes match (that is, equal), return a value of 1, otherwise return a value of 0, and write the matching result back to the kth of the global memory Position ((th1 mod c)*d+i*16) in a gMatrix matrix to position ((th1 mod c)*d+i*16+16);

(6-3）设置i=i+1，并判断是否有i<n，如果是则转入步骤（6-2），否则转入步骤（7）；(6-3) Set i=i+1, and judge whether there is i<n, if yes, go to step (6-2), otherwise go to step (7);

（7）在q个结果矩阵gMatrix的每一个中寻找具有最多连续1的斜线段，确定每个结果矩阵的三元结果数组locations[p]，数组中的每个元素存储有三元结果（x,y,length），其中p为该结果矩阵中斜线段的数量，且等于c+d–1，x表示斜线段相对于其所在结果矩阵对应的压缩字典的偏移量，y表示该斜线段相对于其所在结果矩阵对应的预读窗口的偏移量，length表示该斜线段的长度（即包括“1”的个数），当length为小于3时，表明没有找到匹配（这是由于如果匹配子串的长度小于两个字节时，压缩后的代码比不压缩的还长，无意义），此时，x与y没有意义，可以直接赋值为-1；(7) Find the oblique segment with the most consecutive 1s in each of the q result matrices gMatrix, and determine the locations[p] of the ternary result array of each result matrix. Each element in the array stores a ternary result (x, y, length), where p is the number of slash segments in the result matrix, and is equal to c+d–1, x represents the offset of the slash segment relative to the compression dictionary corresponding to the result matrix where it is located, and y represents the relative The offset of the read-ahead window corresponding to the result matrix where it is located, length indicates the length of the slash segment (that is, the number including "1"), when the length is less than 3, it indicates that no match is found (this is because if a match When the length of the substring is less than two bytes, the compressed code is longer than the uncompressed one, which is meaningless), at this time, x and y have no meaning, and can be directly assigned to -1;

本步骤具体包括以下子步骤：This step specifically includes the following sub-steps:

（7-2）线程T2将其获取的对应数据x、y与length存储在三元结果数组locations的第(th2 mod p)个元素中；(7-2) Thread T2 stores the corresponding data x, y and length obtained by it in the (th2 mod p)th element of the ternary result array locations;

（9）根据匹配结果数组match[q]对待压缩数据文件进行压缩，具体包括以下子步骤；(9) Compress the data file to be compressed according to the matching result array match[q], specifically including the following sub-steps;

（9-3）根据标志字节对待压缩数据文件进行数据压缩；具体而言，得到的标志字节类型有两种，一种为原始标志字节，一种为混合标志字节，原始标志字节第一位为0，后7位表示输出的原始数据长度，后面最多可以输出连续128个原始数据字节；而混合标志字节第一位为1，后7位表示7个原始和压缩的混合数据，对应位为0时表示输出的是原始数据，为1时表示输出的是压缩编码，由此实现数据的压缩；如果没有发现匹配，则原样输出数据；(9-3) Perform data compression on the data file to be compressed according to the flag bytes; specifically, there are two types of flag bytes obtained, one is the original flag byte, the other is the mixed flag byte, and the original flag byte The first bit of the section is 0, and the last 7 bits indicate the length of the original data output, and a maximum of 128 consecutive original data bytes can be output at the end; while the first bit of the mixed flag byte is 1, and the last 7 bits indicate 7 original and compressed bytes. For mixed data, when the corresponding bit is 0, it means that the output is the original data, and when it is 1, it means that the output is compression coding, thereby realizing data compression; if no match is found, the data is output as it is;

本发明基于GPU和CPU混合平台的高速无损数据压缩系统，其特征在于，包括：The present invention is based on the high-speed lossless data compression system of GPU and CPU hybrid platform, is characterized in that, comprises:

示例example

为了清楚的阐述本发明的原理，以下举例说明本发明的实现过程。In order to clearly illustrate the principles of the present invention, the implementation process of the present invention is illustrated below with examples.

（1）CPU读取待压缩数据文件，将该压缩数据文件从内存拷贝到GPU的全局存储器中；(1) The CPU reads the data file to be compressed, and copies the compressed data file from the memory to the global memory of the GPU;

（2）设置GPU上的线程块组bk[1024]，每个线程块中的线程个数512；(2) Set the thread block group bk[1024] on the GPU, and the number of threads in each thread block is 512;

（3）设置压缩字典窗口的长度为4096字节，指向第一个压缩字典窗口的头部指针为p_dic_h=0；(3) Set the length of the compression dictionary window to 4096 bytes, and the head pointer pointing to the first compression dictionary window is p_dic_h=0;

（4）设置预读窗口长度为64字节，指向第一个预读窗口的指针为p_pre_r=4096；(4) Set the length of the pre-read window to 64 bytes, and the pointer to the first pre-read window is p_pre_r=4096;

（5）初始化工作线程组threads[1024*512]，以及64个gMatrix矩阵，其大小为4096*64；(5) Initialize the worker thread group threads[1024*512], and 64 gMatrix matrices, the size of which is 4096*64;

（6）调用工作线程组threads[1024*512]中的1024*256个线程处理待压缩数据文件中64个长度为（4096+64）B的数据；待压缩数据文件的压缩字典窗口和与其对应的预读窗口的划分如图3所示：(6) Call 1024*256 threads in the working thread group threads[1024*512] to process 64 pieces of data whose length is (4096+64) B in the data file to be compressed; the compression dictionary window of the data file to be compressed and its corresponding The division of the read-ahead window is shown in Figure 3:

p_dic_h指向待压缩数据文件的第一个压缩字典窗口首部，p_pre_r指向待压缩数据文件的第一个预读窗口首部；而（p_dic_h-64）指向压缩数据文件第二个压缩字典窗口首部，（p_pre_r-64）指向待压缩数据文件的第二预读窗口首部，如此划分，一次循环可以处理64对压缩字典窗口和预读数据窗口。具体而言，对于每一个待处理的压缩字典窗口和与其对应的预读数据窗口的数据，分别执行以下步骤（在此，我们选择对第一对压缩字典窗口和预读数据窗口的处理工作进行说明）：p_dic_h points to the header of the first compressed dictionary window of the data file to be compressed, p_pre_r points to the header of the first pre-read window of the data file to be compressed; and (p_dic_h-64) points to the header of the second compressed dictionary window of the compressed data file, (p_pre_r -64) points to the header of the second read-ahead window of the data file to be compressed, so divided, one cycle can process 64 pairs of compression dictionary windows and read-ahead data windows. Specifically, for each pending compression dictionary window and the data of the corresponding pre-reading data window, the following steps are respectively performed (here, we choose to process the first pair of compression dictionary windows and pre-reading data windows) illustrate):

为了简单明了的描述压缩过程，我们选择待压缩数据文件中第一个压缩字典中的数据为“hellothisaeybc…isa”，长度为4096字节；而第一个预读窗口的数据的前16B数据为“thisisaexampletosh”。线程组中T0至T4095共4096个线程，其线程编号分别为0至4095，Tm1(m1∈[0,4095])为其中任意一个线程，m1为其线程编号，此线程负责判断压缩字典窗口中第m1位字节与预读数据窗口中第0至15位字节是否匹配（即二者是否相等），当两个字节匹配（即相等）时，返回值1，否则返回值0，并把该匹配结果写回全局存储器的对应的gMatrix矩阵中的位置m1*64到位置m1*64+16中。由于仅仅选择了16B的数据进行压缩，则我们仅仅执行i=0时的一次循环；获得大小为4096*64的结果矩阵gMatrix中的1/4的结果，即此次获取的结果大小为4096*16，如图4所示：In order to describe the compression process simply and clearly, we choose the data in the first compression dictionary in the data file to be compressed as "hellothisaeybc...isa" with a length of 4096 bytes; and the first 16B data of the data in the first read-ahead window is "thisisaexampletosh". There are 4096 threads from T0 to T4095 in the thread group, and their thread numbers are 0 to 4095 respectively. Tm1 (m1∈[0,4095]) is any one of the threads, and m1 is its thread number. This thread is responsible for judging the compression dictionary window. Whether the m1-th byte matches the 0-15th byte in the pre-read data window (that is, whether the two are equal), when the two bytes match (that is, are equal), return a value of 1, otherwise return a value of 0, and Write the matching result back into the corresponding gMatrix matrix in the global memory from position m1*64 to position m1*64+16. Since only 16B of data is selected for compression, we only execute a loop when i=0; obtain a result of 1/4 of the result matrix gMatrix with a size of 4096*64, that is, the size of the result obtained this time is 4096* 16, as shown in Figure 4:

（7）下面仅描述在64个结果矩阵gMatrix的中的第一个gMatrix中如何寻找具有最多连续1的斜线段。(7) The following only describes how to find the slash segment with the most consecutive 1s in the first gMatrix of the 64 result matrices gMatrix.

（7-1）线程组中T0至T4110共（4096+15）个线程，其线程编号分别为0至4110，Tm2(m2∈[0,4110])为其中任意一个线程，m2为其线程编号，Tm2负责寻找一个斜线段中具有连续1最多的子段，如图5所示：(7-1) There are (4096+15) threads from T0 to T4110 in the thread group, and their thread numbers are 0 to 4110 respectively, Tm2(m2∈[0,4110]) is any one of the threads, and m2 is its thread number , Tm2 is responsible for finding the sub-segment with the most consecutive 1s in a slash segment, as shown in Figure 5:

（7-2）线程Tm2将其获取参数x、y与length，并将它们存储在三元结果数组locations的第m2个元素中；(7-2) The thread Tm2 obtains the parameters x, y and length, and stores them in the m2th element of the ternary result array locations;

此例中，线程T5与线程T10发现了多个连续1的子线段，其为对应的locations数组中的元素赋值：locations(5)={5，0，6}，locations(10)={10，2，3}，locations数组其他元素的值为{-1，-1，0},如图6所示：In this example, thread T5 and thread T10 have found multiple consecutive sub-line segments of 1, which assign values to the elements in the corresponding locations array: locations(5)={5, 0, 6}, locations(10)={10 , 2, 3}, the values of other elements of the locations array are {-1, -1, 0}, as shown in Figure 6:

（8）寻找每个gMatrix对应的locations[p]数组中具有最大length值的元素：线程组中T0至T63共64个线程，其线程编号分别为0至63，Tm3(m3∈[0,63])为其中任意一个线程，m3为其线程编号，Tm3负责寻找每个gMatrix矩阵对应的locations数组中具有最大length值的元素，并将其对应的参数x、y与length存入全局的匹配结果数组match[64]，该数组中的每个元素也存储有三元结果（x，y，length）。此例中，我们利用线程T0寻找对应的第0个结果矩阵gMatrix中的具有最大length值的locations元素为第5个元素，将其相关参数写入数组match中的第0个元素，即match(0)={5，0，6}；(8) Find the element with the largest length value in the locations[p] array corresponding to each gMatrix: there are 64 threads from T0 to T63 in the thread group, and their thread numbers are 0 to 63, Tm3(m3∈[0,63 ]) is any one of the threads, m3 is its thread number, and Tm3 is responsible for finding the element with the largest length value in the locations array corresponding to each gMatrix matrix, and storing its corresponding parameters x, y and length into the global matching result Array match[64], each element in this array also stores a ternary result (x, y, length). In this example, we use thread T0 to find the locations element with the largest length value in the corresponding 0th result matrix gMatrix as the 5th element, and write its related parameters into the 0th element in the array match, that is, match( 0) = {5, 0, 6};

（9）根据匹配结果数组match[64]对待压缩数据文件进行压缩，具体包括以下子步骤；(9) Compress the data file to be compressed according to the matching result array match[64], specifically including the following sub-steps;

（9-1）将匹配结果数组match[64]从GPU传输到CPU中的主存储器；(9-1) Transfer the matching result array match[64] from the GPU to the main memory in the CPU;

（9-2）CPU将存储在匹配结果数组match[64]中的数据换算成预读窗口的子串在压缩字典窗口中最长匹配的偏移量和长度，并输出压缩编码三元组compress[64]，此例中，将match(0)的数据换算为第0个预读窗口在第0个压缩字典窗口中的最长匹配的位移量offset和长度length分别为5和6，并将输出一个混合标志字节，即flag值为224（11100000），compress(0)={224，5，6}；此时待压缩数据文件的前（4096+16）个字节被压缩后输出为：hellothisaeybc…isa22456amplet3osh。此时，第一个压缩字典窗口中的4096B数据是不能被压缩的，原样输出，紧跟后面的22456表示预读数据窗口中有6个字节长的子串被压缩，其原文是压缩字典窗口从位移量5处开始，长度为6个字节，再后面输出一串未被压缩原始数据amplet，长6个字节；接着输出一个原始标志字节，其值为3（00000011），最后，在其后输出3字节的未被压缩的原始数据osh。(9-2) The CPU converts the data stored in the matching result array match[64] into the longest matching offset and length of the substring in the pre-reading window in the compression dictionary window, and outputs the compressed code triplet compress [64], in this example, the data of match(0) is converted to the offset and length of the longest match in the 0th pre-reading window in the 0th compression dictionary window are 5 and 6 respectively, and the Output a mixed flag byte, that is, the flag value is 224 (11100000), compress(0)={224, 5, 6}; at this time, the first (4096+16) bytes of the data file to be compressed are compressed and output as : hellothisaeybc...isa22456amplet3osh. At this time, the 4096B data in the first compression dictionary window cannot be compressed, and it is output as it is, followed by 22456, indicating that a 6-byte long substring in the pre-read data window is compressed, and its original text is the compression dictionary The window starts at offset 5, with a length of 6 bytes, and then outputs a string of uncompressed original data samplet, with a length of 6 bytes; then outputs an original flag byte, whose value is 3 (00000011), and finally , followed by the output of 3 bytes of uncompressed raw data osh.

（10）判断指针p_pre_r是否已经到待压缩数据文件的尾部，如果已经指向文件尾部，则过程结束；否则，向前滑动字典窗口和预读窗口，即设置p_pre_r=p_pre_r+64*64，p_dic_h=p_dic_h+64*64，然后返回步骤（6）。(10) Determine whether the pointer p_pre_r has reached the end of the data file to be compressed. If it points to the end of the file, the process ends; otherwise, slide the dictionary window and the pre-reading window forward, that is, set p_pre_r=p_pre_r+64*64, p_dic_h= p_dic_h+64*64, then return to step (6).

本领域的技术人员容易理解，以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。It is easy for those skilled in the art to understand that the above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention, All should be included within the protection scope of the present invention.

Claims

1. A high-speed lossless data compression method based on GPU and CPU mixed platform, is characterized in that, comprises the following steps:

(1) The CPU reads the data file to be compressed, and copies the data file to be compressed from the memory to the global memory of the GPU;

(2) Set the thread block group bk[a] on the GPU, the number of threads b in each thread block, where a is the total number of thread blocks;

(3) Set the length of the compressed dictionary window as c, and set the head pointer pointing to the first compressed dictionary window as p_dic_h;

(4) Set the size of the pre-read window to d, point to the pointer p_pre_r of the first pre-read window, and set the initial value of the pointer to p_dic_h-c;

(5) Initialize the worker thread group threads[a*b], and (a*b/2)/c gMatrix matrices, whose size is c*d;

(6) Call (a*b/2) threads in the working thread group threads[a*b] to process q=(a*b/2)/c data whose length is c+d in the data file to be compressed;

(7) Find the oblique segment with the most consecutive 1s in each of the q result matrices gMatrix, and determine the locations[p] of the ternary result array of each result matrix. Each element in the array stores a ternary result (x, y, length), where p is the number of slash segments in the result matrix, and is equal to c+d–1, x represents the offset of the slash segment relative to the compression dictionary corresponding to the result matrix where it is located, and y represents the relative The offset of the read-ahead window corresponding to the result matrix where it is located, and length indicates the length of the oblique segment;

(8) Find the element with the largest length value in the locations[p] array corresponding to each gMatrix: set thread T3, whose thread number is th3, and T3 is the 0th thread in the thread group threads[a*b] to the ( One of the q-1) threads, the T3 thread is responsible for finding the element with the largest length value in the ternary result array locations[p] corresponding to each gMatrix matrix, and storing its corresponding parameters x, y and length in the global The matching result array match[q], each element in the array also stores a ternary result (x, y, length);

(9) Compress the data file to be compressed according to the matching result array match[q];

(10) Determine whether the pointer p_pre_r has reached the end of the data file to be compressed, if so, the process ends; otherwise, slide the dictionary window and the pre-reading window forward, that is, set p_pre_r=p_pre_r+q*d, p_dic_h=p_dic_h+q *d, then return to step (6).

2. The high-speed lossless data compression method according to claim 1, wherein the value of d is equal to 16*n, and the value range of n is 1-8.

3. The high-speed lossless data compression method according to claim 1, characterized in that in step (6), p_dic_h points to the head of the first compression dictionary window of the data file to be compressed, and p_pre_r points to the first The header of the pre-read window, and (p_dic_h-d) points to the header of the second compressed dictionary window of the compressed data file, and (p_pre_r-d) points to the header of the second pre-read window of the data file to be compressed. In this way, one cycle can process q For compressed dictionary windows and read-ahead data windows.

4. The high-speed lossless data compression method according to claim 1, characterized in that the step (6) is specifically, for each data of the compression dictionary window to be processed and the corresponding pre-reading data window, respectively perform the following steps :

(6-1) Set counter i=0;

(6-2) Set thread T1, whose thread number is th1, T1 is the (c*k)th thread to (c*(k+1)-1)th thread in the thread group threads[a*b/2] One of the threads, which judges whether the (th1 mod c)th byte in the kth compression dictionary window and the i*16 to (i+1)*16-1th byte in the kth pre-reading data window are Match, where 0<=k<q, when the two bytes match, return value 1, otherwise return value 0, and write the matching result back to the position in the kth gMatrix matrix of the global memory ((th1 modc) *d+i*16) into position ((th1 mod c)*d+i*16+16);

(6-3) Set i=i+1, and judge whether i<n, if yes, go to step (6-2), otherwise go to step (7).

5. The high-speed lossless data compression method according to claim 1, wherein when the length is less than 3, it indicates that no match is found, and at this time, x and y are directly assigned a value of -1.

6. The high-speed lossless data compression method according to claim 1, wherein step (7) specifically includes the following sub-steps:

(7-1) Set thread T2, whose thread number is th2, T2 is the (c*k)th thread to (c*(k+1)+d-1)th thread in the thread group threads[a*b] One of the threads, T2 is responsible for finding the sub-segment with the most consecutive 1s in a slash segment, and recording the parameters x, y and length corresponding to the sub-segment;

(7-2) Thread T2 stores the obtained corresponding data x, y and length in the (th2 mod p)th element of the ternary result array locations.

7. The high-speed lossless data compression method according to claim 1, wherein step (9) specifically includes the following sub-steps:

(9-1) Transfer the matching result array match[q] from the GPU to the main memory in the CPU;

(9-2) The CPU converts the data stored in the matching result array match[q] into the longest matching offset and length of the substring in the pre-reading window in the compression dictionary window, and outputs the compressed code triplet compress [q], each element in the compressed encoding ternary array stores a ternary result (flag, offset, length). Among them, flag is a flag byte, indicating whether the output is compressed data, and offset and length represent the read-ahead window respectively. The offset and length of the longest match of the substring in the corresponding compression dictionary window;

(9-3) Perform data compression on the data file to be compressed according to the flag byte.

8. The high-speed lossless data compression method according to claim 7, characterized in that, in step (9-3), the obtained flag byte type includes original flag byte and mixed flag byte, and the original flag byte is first The bit is 0, the last 7 bits indicate the length of the original data output, and a maximum of 128 consecutive original data bytes can be output later, the first bit of the mixed flag byte is 1, and the last 7 bits indicate 7 original and compressed mixed data, corresponding to When the bit is 0, it means that the original data is output, and when it is 1, it means that the output is compression coding, thereby realizing data compression. If no match is found, the data is output as it is.

9. A high-speed lossless data compression system based on a GPU and CPU hybrid platform, characterized in that it comprises:

The first module is used to read the data file to be compressed, and copy the data file to be compressed from the memory to the global memory of the GPU;

The second module is used to set the thread block group bk[a] on the GPU, the number of threads b in each thread block, where a is the total number of thread blocks;

The third module is used to set the length of the compressed dictionary window as c, and set the head pointer pointing to the first compressed dictionary window as p_dic_h;

The fourth module is used to set the size of the pre-reading window as d, pointing to the pointer p_pre_r of the first pre-reading window, and the initial value of the pointer is set to p_dic_h-c;

The fifth module is used to initialize the worker thread group threads[a*b], and (a*b/2)/c gMatrix matrices, whose size is c*d;

The sixth module is used to call (a*b/2) threads in the working thread group threads[a*b] to process q=(a*b/2)/c in the data file to be compressed, the length of which is c+d The data;

The seventh module is used to find the oblique segment with the most consecutive 1s in each of the q result matrices gMatrix, and determine the locations[p] of the ternary result array of each result matrix, and each element in the array stores a ternary result (x, y, length), where p is the number of slash segments in the result matrix, and is equal to c+d–1, x represents the offset of the slash segment relative to the compression dictionary corresponding to the result matrix where it is located, and y represents the The offset of the slash segment relative to the read-ahead window corresponding to the result matrix where it is located, length indicates the length of the slash segment;

The eighth module is used to find the element with the maximum length value in the locations[p] array corresponding to each gMatrix: set thread T3, whose thread number is th3, and T3 is the 0th thread in the thread group threads[a*b] To one of the (q-1)th threads, the T3 thread is responsible for finding the element with the largest length value in the ternary result array locations[p] corresponding to each gMatrix matrix, and its corresponding parameters x, y and length Stored in the global matching result array match[q], each element in the array also stores a ternary result (x, y, length);

The ninth module is used to compress the data file to be compressed according to the matching result array match[q];

The tenth module is used to judge whether the pointer p_pre_r has reached the end of the data file to be compressed, if so, the process ends; otherwise, slide the dictionary window and the pre-reading window forward, that is, set p_pre_r=p_pre_r+q*d, p_dic_h= p_dic_h+q*d, then return to the sixth module.