CN104868922A - Data compression method and device - Google Patents
Data compression method and device Download PDFInfo
- Publication number
- CN104868922A CN104868922A CN201410062498.4A CN201410062498A CN104868922A CN 104868922 A CN104868922 A CN 104868922A CN 201410062498 A CN201410062498 A CN 201410062498A CN 104868922 A CN104868922 A CN 104868922A
- Authority
- CN
- China
- Prior art keywords
- data
- compressed
- compression
- coded
- bit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
本发明实施例提供一种数据压缩方法及装置,该装置包括:拆分模块,用于根据预设拆分规则,拆分待压缩数据,获得拆分后的数据并发送给压缩模块,使得压缩算法适用于拆分后的数据,其中,该预设拆分规则是根据待压缩数据的特点确定的;压缩模块,用于接收拆分模块发送的拆分后的数据,采用压缩算法对拆分后的数据进行压缩处理,输出压缩数据,并发送给存储模块;存储模块,用于接收压缩模块发送的压缩数据,并存储压缩数据。本发明实施例根据待压缩数据的特点对其进行拆分,使得不适宜采用专用压缩算法的待压缩数据,经过拆分处理后,可以通过专用压缩算法处理获得高压缩比,从而节省数据的存储空间,降低成本。
Embodiments of the present invention provide a data compression method and device, the device comprising: a splitting module, for splitting data to be compressed according to preset splitting rules, obtaining the split data and sending it to the compression module, so that the compressed The algorithm is applicable to the split data, wherein the preset split rule is determined according to the characteristics of the data to be compressed; the compression module is used to receive the split data sent by the split module, and use the compression algorithm to split Compress the final data, output the compressed data, and send it to the storage module; the storage module is used to receive the compressed data sent by the compression module, and store the compressed data. The embodiment of the present invention splits the data to be compressed according to the characteristics of the data to be compressed, so that the data to be compressed that is not suitable for using a special compression algorithm can be processed by a special compression algorithm to obtain a high compression ratio after splitting, thereby saving data storage space and reduce costs.
Description
技术领域technical field
本发明实施例涉及数据处理技术,尤其涉及一种数据压缩方法及装置。Embodiments of the present invention relate to data processing technologies, and in particular to a data compression method and device.
背景技术Background technique
随着科学技术的快速发展,信息量迅猛增加。因此,对信息的储存和挖掘也引起广泛关注,例如,信息技术(Information Technology,简称:IT)厂商追捧的大数据。如果对信息不加压缩的进行储存,存储成本开销较大,因此,压缩方法备受关注。With the rapid development of science and technology, the amount of information has increased rapidly. Therefore, the storage and mining of information has also attracted widespread attention, for example, big data sought after by information technology (Information Technology, abbreviated: IT) manufacturers. If the information is stored without compression, the storage cost will be high. Therefore, the compression method has attracted much attention.
若按信息是否有损来划分,压缩方法包括有损压缩方法和无损压缩方法。其中,无损压缩方法表示信息经解压缩后可以完整地还原;而采用有损压缩方法压缩后的数据不能被完整还原,例如,音频压缩和视频压缩就不能被完整还原。常见的无损压缩方法包括专用压缩算法,例如词典压缩算法、等值距离压缩算法、差值压缩算法和通用比特(bit)压缩算法,等等。If divided according to whether the information is lossy, the compression methods include lossy compression methods and lossless compression methods. Among them, the lossless compression method means that the information can be completely restored after decompression; while the data compressed by the lossy compression method cannot be completely restored, for example, audio compression and video compression cannot be completely restored. Common lossless compression methods include dedicated compression algorithms, such as dictionary compression algorithms, equivalence distance compression algorithms, difference compression algorithms, and general bit (bit) compression algorithms, among others.
专用压缩算法通常解决某一类数据的压缩,对有些数据的压缩可能导致信息储存空间的增加。Dedicated compression algorithms usually solve the compression of a certain type of data, and the compression of some data may lead to an increase in information storage space.
发明内容Contents of the invention
本发明实施例提供一种数据压缩方法及装置,以解决采用专用压缩算法对数据进行压缩所导致的信息储存空间增加的问题。Embodiments of the present invention provide a data compression method and device to solve the problem of increased information storage space caused by using a dedicated compression algorithm to compress data.
第一方面,本发明实施例提供一种数据压缩装置,包括:In a first aspect, an embodiment of the present invention provides a data compression device, including:
拆分模块,用于根据预设拆分规则,拆分待压缩数据,获得拆分后的数据并发送给压缩模块,使得压缩算法适用于所述拆分后的数据,其中,所述预设拆分规则是根据所述待压缩数据的特点确定的;A splitting module, configured to split the data to be compressed according to preset splitting rules, obtain the split data and send it to the compression module, so that the compression algorithm is applicable to the split data, wherein the preset The splitting rule is determined according to the characteristics of the data to be compressed;
所述压缩模块,用于接收所述拆分模块发送的所述拆分后的数据,采用所述压缩算法对所述拆分后的数据进行压缩处理,输出压缩数据,并发送给存储模块;The compression module is configured to receive the split data sent by the split module, use the compression algorithm to compress the split data, output the compressed data, and send it to the storage module;
所述存储模块,用于接收所述压缩模块发送的所述压缩数据,并存储所述压缩数据。The storage module is configured to receive the compressed data sent by the compression module, and store the compressed data.
结合第一方面,在第一方面的第一种可能的实现方式中,所述拆分模块具体用于:With reference to the first aspect, in a first possible implementation manner of the first aspect, the splitting module is specifically used for:
采用预设值整除法,拆分所述待压缩数据,其中,拆分的段数是根据对所述待压缩数据进行采样测试来确定的。The data to be compressed is split by using a predetermined value division method, wherein the number of segments to be split is determined according to a sampling test of the data to be compressed.
结合第一方面或第一方面的第一种可能的实现方式,在第一方面的第二种可能的实现方式中,所述压缩模块具体用于:With reference to the first aspect or the first possible implementation of the first aspect, in the second possible implementation of the first aspect, the compression module is specifically configured to:
接收所述拆分模块发送的所述拆分后的数据,采用所述压缩算法对所述拆分后的数据进行至少一次压缩处理,输出所述压缩数据,其中,每次压缩处理中所采用的压缩算法相同或不同。receiving the split data sent by the split module, using the compression algorithm to perform at least one compression process on the split data, and outputting the compressed data, wherein the The same or different compression algorithms.
结合第一方面或第一方面的第一种或第二种可能的实现方式,在第一方面的第三种可能的实现方式中,所述存储模块具体用于:With reference to the first aspect or the first or second possible implementation manner of the first aspect, in a third possible implementation manner of the first aspect, the storage module is specifically configured to:
采用自描述压缩编码存储所述压缩数据,获得编码数据,所述自描述压缩编码包括编码长度标识部分和数据值编码部分,所述编码长度标识部分用于表征对所述压缩数据进行编码的字节个数,所述数据值编码部分用于表征对所述压缩数据进行编码的部分,其中,所述压缩数据的编码长度由其自身的首N个比特进行标识,其中,N的大小等于所述编码数据的二进制表示中自最高位起第一个取值为零的位与最高位之间取值为非零的位的个数再加1。The compressed data is stored using a self-describing compression code to obtain coded data. The self-describing compression code includes a code length identification part and a data value code part. The number of sections, the data value coding part is used to represent the part that codes the compressed data, wherein the coded length of the compressed data is identified by its first N bits, where the size of N is equal to the In the binary representation of the above coded data, add 1 to the number of non-zero bits between the highest bit and the first bit whose value is zero from the highest bit.
结合第一方面或第一方面的第一种或第二种可能的实现方式,在第一方面的第四种可能的实现方式中,所述装置还包括:With reference to the first aspect or the first or second possible implementation manner of the first aspect, in a fourth possible implementation manner of the first aspect, the device further includes:
确定模块,用于若确定所述压缩处理中的预压缩数据的规模适用变形比特压缩算法,则触发所述压缩模块对所述预压缩数据进行压缩处理;A determination module, configured to trigger the compression module to perform compression processing on the pre-compressed data if it is determined that the size of the pre-compressed data in the compression process is applicable to the deformed bit compression algorithm;
则所述压缩模块具体用于采用所述变形比特压缩算法对所述预压缩数据中进行压缩处理,获得所述压缩数据,其中,所述变形比特压缩算法表示采用不同的比特位数对各数据进行压缩处理。The compression module is specifically used to compress the pre-compressed data by using the deformed bit compression algorithm to obtain the compressed data, wherein the deformed bit compression algorithm means that different bits are used to compress each data Perform compression processing.
结合第一方面的第四种可能的实现方式,在第一方面的第五种可能的实现方式中,所述确定模块具体用于:With reference to the fourth possible implementation of the first aspect, in a fifth possible implementation of the first aspect, the determining module is specifically configured to:
获取对所述预压缩数据中最大数据进行编码的最少字节数;Obtaining the minimum number of bytes for encoding the maximum data in the pre-compressed data;
根据所述预压缩数据的规模及所述最少字节数,确定对所述预压缩数据进行字节编码的第一字节数;determining a first number of bytes for encoding the pre-compressed data according to the size of the pre-compressed data and the minimum number of bytes;
若使用预设数值的比特位数对所述预压缩数据进行比特编码,则确定对所述预压缩数据进行比特编码的第二字节数;If the pre-compressed data is bit-coded using a preset number of bits, then determine the second number of bytes for bit-coding the pre-compressed data;
若确定所述第二字节数小于所述第一字节数,则确定所述预压缩数据的规模适用变形比特压缩算法,并触发所述压缩模块对所述预压缩数据进行压缩处理。If it is determined that the second number of bytes is smaller than the first number of bytes, then it is determined that the size of the pre-compressed data is suitable for a deformed bit compression algorithm, and the compression module is triggered to perform compression processing on the pre-compressed data.
结合第一方面的第四种或第五种可能的实现方式,在第一方面的第六种可能的实现方式中,所述存储模块用于:With reference to the fourth or fifth possible implementation of the first aspect, in a sixth possible implementation of the first aspect, the storage module is used for:
采用正数编码和负数编码中其一或其组合存储所述压缩数据,获得编码数据,其中,所述正数编码表示对所述压缩数据中采用字节压缩处理的部分进行编码,所述负数编码表示对所述压缩数据中采用比特压缩处理的部分进行编码。One or a combination of positive number encoding and negative number encoding is used to store the compressed data to obtain encoded data, wherein the positive number encoding means to encode the part of the compressed data that uses byte compression processing, and the negative number Encoding refers to encoding a portion of the compressed data that is subjected to bit compression processing.
结合第一方面的第六种可能的实现方式,在第一方面的第七种可能的实现方式中,所述编码数据包括第一编码数据,当采用正数编码存储所述压缩数据时,所述存储模块具体用于:With reference to the sixth possible implementation manner of the first aspect, in a seventh possible implementation manner of the first aspect, the encoded data includes the first encoded data, and when the compressed data is stored using a positive number encoding, the The storage module is specifically used for:
根据所述压缩数据中的最大数据,获取对所述最大数据进行编码的最小字节数;Acquiring the minimum number of bytes for encoding the maximum data according to the maximum data in the compressed data;
若所述最大数据的首字节的首比特值为零,则确定采用所述最小字节数对所述压缩数据中采用字节压缩处理的部分进行编码,获得所述第一编码数据;If the first bit value of the first byte of the largest data is zero, it is determined to use the minimum number of bytes to encode the part of the compressed data that uses byte compression processing to obtain the first encoded data;
若所述最大数据的首字节的首比特值为非零,则确定采用所述最小字节数加1后的字节数对所述压缩数据中采用字节压缩处理的部分进行编码,获得所述第一编码数据。If the first bit value of the first byte of the maximum data is non-zero, it is determined to use the number of bytes after adding 1 to the minimum number of bytes to encode the part of the compressed data that uses byte compression processing, and obtain The first encoded data.
结合第一方面的第六种或第七种可能的实现方式,在第一方面的第八种可能的实现方式中,所述编码数据包括第二编码数据,所述存储模块具体用于:With reference to the sixth or seventh possible implementation manner of the first aspect, in an eighth possible implementation manner of the first aspect, the encoded data includes second encoded data, and the storage module is specifically configured to:
根据负数编码规则对所述压缩数据中采用比特压缩处理的部分进行负数编码,获得所述第二编码数据,所述负数编码规则包括编码长度标识部分和数据编码部分,所述编码长度标识部分用于表征被编码数据的个数,所述数据编码部分用于表征采用预设长度的比特位对被编码数据进行编码后的编码值,其中,所述编码长度标识的长度由所述第二编码数据中各数据的首M个比特进行标识,其中,M的大小等于所述第二编码中各数据的二进制表示中自最高位起第一个取值为零的位与最高位之间取值为非零的位的个数。Negative encoding is performed on the part of the compressed data that adopts bit compression processing according to the negative encoding rule to obtain the second encoded data, the negative encoding rule includes a code length identification part and a data encoding part, and the code length identification part is used In order to represent the number of coded data, the data coding part is used to represent the coded value after the coded data is coded with bits of a preset length, wherein the length of the coded length identifier is determined by the second code The first M bits of each data in the data are identified, wherein the size of M is equal to the value between the first bit with a value of zero and the highest bit in the binary representation of each data in the second encoding The number of non-zero bits.
第二方面,本发明实施例提供一种数据压缩方法,包括:In a second aspect, an embodiment of the present invention provides a data compression method, including:
根据预设拆分规则,拆分待压缩数据,获得拆分后的数据,使得压缩算法适用于所述拆分后的数据,其中,所述预设拆分规则是根据所述待压缩数据的特点确定的;Split the data to be compressed according to a preset split rule to obtain the split data, so that the compression algorithm is applicable to the split data, wherein the preset split rule is based on the data to be compressed Characterized;
采用所述压缩算法对所述拆分后的数据进行压缩处理,输出压缩数据;compressing the split data by using the compression algorithm, and outputting the compressed data;
存储所述压缩数据。The compressed data is stored.
结合第二方面,在第二方面的第一种可能的实现方式中,所述根据预设拆分规则,拆分待压缩数据,获得拆分后的数据,包括:In combination with the second aspect, in the first possible implementation manner of the second aspect, the splitting the data to be compressed according to the preset splitting rules, and obtaining the split data includes:
采用预设值整除法,拆分所述待压缩数据,其中,拆分的段数是根据对所述待压缩数据进行采样测试来确定的。The data to be compressed is split by using a predetermined value division method, wherein the number of segments to be split is determined according to a sampling test of the data to be compressed.
结合第二方面或第二方面的第一种可能的实现方式,在第二方面的第二种可能的实现方式中,所述采用压缩算法对所述拆分后的数据进行压缩处理,输出压缩数据,包括:With reference to the second aspect or the first possible implementation of the second aspect, in the second possible implementation of the second aspect, the compression algorithm is used to compress the split data, and the output compressed data, including:
采用所述压缩算法对所述拆分后的数据进行至少一次压缩处理,输出压缩数据,其中,每次压缩处理中所采用的压缩算法相同或不同。Using the compression algorithm to perform at least one compression process on the split data to output compressed data, wherein the compression algorithm used in each compression process is the same or different.
结合第二方面或第二方面的第一种或第二种可能的实现方式,在第二方面的第三种可能的实现方式中,所述存储所述压缩数据,包括:With reference to the second aspect or the first or second possible implementation manner of the second aspect, in a third possible implementation manner of the second aspect, the storing the compressed data includes:
采用自描述压缩编码存储所述压缩数据,获得编码数据,所述自描述压缩编码包括编码长度标识部分和数据值编码部分,所述编码长度标识部分用于表征对所述压缩数据进行编码的字节个数,所述数据值编码部分用于表征对所述压缩数据进行编码的部分,其中,所述压缩数据的编码长度由其自身的首N个比特进行标识,其中,N的大小等于所述编码数据的二进制表示中自最高位起第一个取值为零的位与最高位之间取值为非零的位的个数再加1。The compressed data is stored using a self-describing compression code to obtain coded data. The self-describing compression code includes a code length identification part and a data value code part. The number of sections, the data value coding part is used to represent the part that codes the compressed data, wherein the coded length of the compressed data is identified by its first N bits, where the size of N is equal to the In the binary representation of the above coded data, add 1 to the number of non-zero bits between the highest bit and the first bit whose value is zero from the highest bit.
结合第二方面或第二方面的第一种或第二种可能的实现方式,在第二方面的第四种可能的实现方式中,所述采用所述压缩算法对拆分后的数据进行压缩处理之前,还包括:In combination with the second aspect or the first or second possible implementation of the second aspect, in the fourth possible implementation of the second aspect, the compression algorithm is used to compress the split data Before processing, also include:
若确定所述压缩处理中的预压缩数据的规模适用变形比特压缩算法,则采用所述变形比特压缩算法对所述预压缩数据中进行压缩处理,获得所述压缩数据,其中,所述变形比特压缩算法表示采用不同的比特位数对各数据进行压缩处理。If it is determined that the scale of the pre-compressed data in the compression process is applicable to the deformed bit compression algorithm, then the deformed bit compression algorithm is used to compress the pre-compressed data to obtain the compressed data, wherein the deformed bit The compression algorithm means that each data is compressed using different bit numbers.
结合第二方面的第四种可能的实现方式,在第二方面的第五种可能的实现方式中,所述确定所述压缩处理中的预压缩数据的规模适用变形比特压缩算法,包括:With reference to the fourth possible implementation of the second aspect, in the fifth possible implementation of the second aspect, the determination of the size of the pre-compressed data in the compression process applies a deformed bit compression algorithm, including:
获取对所述预压缩数据中最大数据进行编码的最少字节数;Obtaining the minimum number of bytes for encoding the maximum data in the pre-compressed data;
根据所述预压缩数据的规模及所述最少字节数,确定对所述预压缩数据进行字节编码的第一字节数;determining a first number of bytes for encoding the pre-compressed data according to the size of the pre-compressed data and the minimum number of bytes;
若使用预设数值的比特位数对所述预压缩数据进行比特编码,则确定对所述预压缩数据进行比特编码的第二字节数;If the pre-compressed data is bit-coded using a preset number of bits, then determine the second number of bytes for bit-coding the pre-compressed data;
若确定所述第二字节数小于所述第一字节数,则确定所述预压缩数据的规模适用变形比特压缩算法。If it is determined that the second number of bytes is smaller than the first number of bytes, then it is determined that the size of the pre-compressed data is applicable to a deformed bit compression algorithm.
结合第二方面的第四种或第五种可能的实现方式,在第二方面的第六种可能的实现方式中,所述存储所述压缩数据,包括:With reference to the fourth or fifth possible implementation manner of the second aspect, in a sixth possible implementation manner of the second aspect, the storing the compressed data includes:
采用正数编码和负数编码中其一或其组合存储所述压缩数据,获得编码数据,其中,所述正数编码表示对所述压缩数据中采用字节压缩处理的部分进行编码,所述负数编码表示对所述压缩数据中采用比特压缩处理的部分进行编码。One or a combination of positive number encoding and negative number encoding is used to store the compressed data to obtain encoded data, wherein the positive number encoding means to encode the part of the compressed data that uses byte compression processing, and the negative number Encoding refers to encoding a portion of the compressed data that is subjected to bit compression processing.
结合第二方面的第六种可能的实现方式,在第二方面的第七种可能的实现方式中,所述编码数据包括第一编码数据,当采用正数编码存储所述压缩数据时,所述采用正数编码和负数编码中其一或其组合存储所述压缩数据,获得编码数据,包括:With reference to the sixth possible implementation manner of the second aspect, in a seventh possible implementation manner of the second aspect, the encoded data includes the first encoded data, and when the compressed data is stored using a positive number encoding, the One of or a combination of positive codes and negative codes is used to store the compressed data and obtain coded data, including:
根据所述压缩数据中的最大数据,获取对所述最大数据进行编码的最小字节数;Acquiring the minimum number of bytes for encoding the maximum data according to the maximum data in the compressed data;
若所述最大数据的首字节的首比特值为零,则确定采用所述最小字节数对所述压缩数据中采用字节压缩处理的部分进行编码,获得所述第一编码数据;If the first bit value of the first byte of the largest data is zero, it is determined to use the minimum number of bytes to encode the part of the compressed data that uses byte compression processing to obtain the first encoded data;
若所述最大数据的首字节的首比特值为非零,则确定采用所述最小字节数加1后的字节数对所述压缩数据中采用字节压缩处理的部分进行编码,获得所述第一编码数据。If the first bit value of the first byte of the maximum data is non-zero, it is determined to use the number of bytes after adding 1 to the minimum number of bytes to encode the part of the compressed data that uses byte compression processing, and obtain The first encoded data.
结合第二方面的第六种或第七种可能的实现方式,在第二方面的第八种可能的实现方式中,所述编码数据包括第二编码数据,当采用负数编码存储所述压缩数据时,所述采用正数编码和负数编码中其一或其组合存储所述压缩数据,获得编码数据,包括:With reference to the sixth or seventh possible implementation manner of the second aspect, in an eighth possible implementation manner of the second aspect, the encoded data includes the second encoded data, and when the compressed data is stored using negative encoding When, the compressed data is stored in one or a combination of the positive code and the negative code, and the coded data is obtained, including:
根据负数编码规则对所述压缩数据中采用比特压缩处理的部分进行负数编码,获得所述第二编码数据,所述负数编码规则包括编码长度标识部分和数据编码部分,所述编码长度标识部分用于表征被编码数据的个数,所述数据编码部分用于表征采用预设长度的比特位对被编码数据进行编码后的编码值,其中,所述编码长度标识的长度由所述第二编码数据中各数据的首M个比特进行标识,其中,M的大小等于所述第二编码中各数据的二进制表示中自最高位起第一个取值为零的位与最高位之间取值为非零的位的个数。Negative encoding is performed on the part of the compressed data that adopts bit compression processing according to the negative encoding rule to obtain the second encoded data, the negative encoding rule includes a code length identification part and a data encoding part, and the code length identification part is used In order to represent the number of coded data, the data coding part is used to represent the coded value after the coded data is coded with bits of a preset length, wherein the length of the coded length identifier is determined by the second code The first M bits of each data in the data are identified, wherein the size of M is equal to the value between the first bit with a value of zero and the highest bit in the binary representation of each data in the second encoding The number of non-zero bits.
本发明实施例根据待压缩数据的特点对其进行拆分,使得不适宜采用专用压缩算法的待压缩数据,经过拆分处理后,可以通过专用压缩算法处理获得高压缩比,从而节省数据的存储空间,降低成本。The embodiment of the present invention splits the data to be compressed according to the characteristics of the data to be compressed, so that the data to be compressed that is not suitable for using a special compression algorithm can be processed by a special compression algorithm to obtain a high compression ratio after splitting, thereby saving data storage space and reduce costs.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图做一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description These are some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained according to these drawings without any creative effort.
图1为本发明数据压缩装置实施例一的结构示意图;FIG. 1 is a schematic structural diagram of Embodiment 1 of the data compression device of the present invention;
图2为本发明一组数据示例;Fig. 2 is a group of data example of the present invention;
图3为本发明自描述压缩编码的格式示意图;Fig. 3 is a schematic diagram of the format of the self-describing compression coding of the present invention;
图4为本发明对数据进行自描述压缩编码后的一结果示例图;Fig. 4 is an example diagram of a result after the present invention carries out self-describing compression coding to data;
图5为本发明对数据进行自描述压缩编码后的另一结果示例图;Fig. 5 is another result example figure after the present invention carries out self-describing compression coding to data;
图6为本发明采用通用比特压缩算法处理第二组数据后对应的码流数据示例图;Fig. 6 is an example diagram of code stream data corresponding to the second group of data processed by the present invention using a general bit compression algorithm;
图7为本发明另一组数据示例;Fig. 7 is another group of data example of the present invention;
图8为本发明对第二组数据编码前后的示意图;Fig. 8 is a schematic diagram of the present invention before and after encoding the second group of data;
图9为本发明数据压缩装置实施例二的结构示意图;9 is a schematic structural diagram of Embodiment 2 of the data compression device of the present invention;
图10为本发明数据压缩方法实施例一的流程图。FIG. 10 is a flow chart of Embodiment 1 of the data compression method of the present invention.
具体实施方式Detailed ways
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.
图1为本发明数据压缩装置实施例一的结构示意图。本发明实施例提供一种数据压缩装置,该装置可以集成在通信设备中,其中,通信设备可以为个人计算机(Personal Computer,简称:PC)、笔记本电脑或服务器等任意终端设备。如图1所示,该装置10包括:拆分模块11、压缩模块12和存储模块13。FIG. 1 is a schematic structural diagram of Embodiment 1 of the data compression device of the present invention. An embodiment of the present invention provides a data compression device, which can be integrated into a communication device, where the communication device can be any terminal device such as a personal computer (Personal Computer, PC for short), a notebook computer, or a server. As shown in FIG. 1 , the device 10 includes: a splitting module 11 , a compression module 12 and a storage module 13 .
其中,拆分模块11用于根据预设拆分规则,拆分待压缩数据,获得拆分后的数据并发送给压缩模块12,使得压缩算法适用于拆分后的数据,其中,预设拆分规则是根据待压缩数据的特点确定的;压缩模块12用于接收拆分模块11发送的拆分后的数据,采用压缩算法对拆分后的数据进行压缩处理,输出压缩数据,并发送给存储模块13;存储模块13用于接收压缩模块12发送的压缩数据,并存储压缩数据。Wherein, the splitting module 11 is used to split the data to be compressed according to the preset splitting rules, obtain the split data and send it to the compression module 12, so that the compression algorithm is applicable to the split data, wherein the preset split The split rule is determined according to the characteristics of the data to be compressed; the compression module 12 is used to receive the split data sent by the split module 11, and uses a compression algorithm to compress the split data, output the compressed data, and send it to Storage module 13; the storage module 13 is used to receive the compressed data sent by the compression module 12, and store the compressed data.
实际应用中,某些数据通过现有压缩方法进行压缩反而增加信息储存空间,因此,本发明实施例中,首先根据待压缩数据的特点对其进行预处理,例如,根据待压缩数据的特点,预先设定预设拆分准则;再根据预设拆分规则拆分待压缩数据,使得拆分后的数据通过通信设备提供的压缩方法进行压缩能够获得高压缩比。In practical applications, some data is compressed by existing compression methods but increases the information storage space. Therefore, in the embodiment of the present invention, it is firstly preprocessed according to the characteristics of the data to be compressed, for example, according to the characteristics of the data to be compressed, The preset splitting criterion is set in advance; and then the data to be compressed is split according to the preset splitting rule, so that the split data can be compressed by the compression method provided by the communication device to obtain a high compression ratio.
在上述实施例中,拆分模块11可以具体用于:采用预设值整除法,拆分待压缩数据,其中,拆分的段数是根据对待压缩数据进行采样测试来确定的。在这里,预设值例如可以为10n或256n,不对其进行限制。In the above-mentioned embodiment, the splitting module 11 may be specifically configured to: split the data to be compressed by using a preset value division method, wherein the number of segments to be split is determined according to a sampling test of the data to be compressed. Here, the preset value can be, for example, 10 n or 256 n , without limitation.
具体地,数据压缩装置10读取原始数据之后,其内置的拆分模块11按照预设拆分规则将该原始数据拆分为多组,然后触发压缩模块12对该拆分后的数据进行压缩处理。若预先未设置预设拆分规则,则直接由压缩模块12对该原始数据进行压缩处理。Specifically, after the data compression device 10 reads the original data, its built-in splitting module 11 splits the raw data into multiple groups according to preset splitting rules, and then triggers the compression module 12 to compress the split data deal with. If no preset splitting rule is set in advance, the original data is directly compressed by the compression module 12 .
其中,预设拆分规则可以有多种类型。以下对预设拆分规则进行说明。Wherein, the preset splitting rules may have multiple types. The preset splitting rules are described below.
在大数据领域,有些数据是按一定规则编码的,例如,手机号码、国际移动用户识别码(International Mobile Subscriber Identification Number,简称:IMSI)号码、互联网协议(Internet Protocol,简称:IP)地址和域名等等。以中国内地手机号码为例,手机号码总长度为11位,由三部分编码组成:第1~3位:网络识别号;第4~7位:地区编码;第8~11位:用户号码,因此,对手机号码数据进行压缩处理之前,可以根据这一规则对手机号码进行拆分。In the field of big data, some data is coded according to certain rules, for example, mobile phone number, International Mobile Subscriber Identification Number (IMSI) number, Internet Protocol (Internet Protocol, IP) address and domain name etc. Taking a mobile phone number in Mainland China as an example, the mobile phone number has a total length of 11 digits and consists of three parts: 1st to 3rd digits: network identification number; 4th to 7th digits: area code; 8th to 11th digits: user number, Therefore, before compressing the mobile phone number data, the mobile phone number can be split according to this rule.
例如,表1是为获得高压缩比,而将某一话单数据按某些字段排序后,所获得部分字段数据的片段,其中,SGSN_USR_IP表示在服务GPRS支持节点(Serving GPRS Support Node,简称:SGSN)上为用户分配的IP地址,GGSN_USR_IP表示在GPRS网关支持节点(Gateway GPRS Support Node,简称:GGSN)上为用户分配的IP地址。For example, table 1 is to obtain high compression ratio, after a bill data is sorted by some fields, the segment of the part field data that obtains, wherein, SGSN_USR_IP represents in Serving GPRS Support Node (Serving GPRS Support Node, referred to as: The IP address assigned to the user on the SGSN), and GGSN_USR_IP indicates the IP address assigned to the user on the GPRS Gateway Support Node (Gateway GPRS Support Node, referred to as: GGSN).
表1某一话单数据排序后片段Table 1 Fragments of a bill data after sorting
对表1中数据进行分析:Analyze the data in Table 1:
一、主叫号码和被叫号码不适宜采用专用压缩算法压缩,但若将主叫号码和被叫号码均按其编码规则拆分成三段(或两段),这样前两段(拆分成两段时为前一段)的数据就可以采用等值距离压缩算法进行压缩处理而获得高压缩比,例如,将主叫号码1345439747拆分为三段,134、5439和747;或将主叫号码1345439747拆分为两段,134和5439747,或1345439和747。又由于使用整数比字符串存储手机号码更节省空间,因此,预设拆分规则可以使用10n整除法,n为0或正整数。可选地,主叫号码和被叫号码对应的预设拆分规则可以相同或不同,在此不对其进行限制。1. The calling number and called number are not suitable for compression using a special compression algorithm, but if the calling number and called number are split into three sections (or two sections) according to their coding rules, the first two sections (split When it is divided into two segments, it is the previous segment) data can be compressed using the equivalent distance compression algorithm to obtain a high compression ratio, for example, split the calling number 1345439747 into three segments, 134, 5439 and 747; The number 1345439747 is split into two segments, 134 and 5439747, or 1345439 and 747. And because using an integer is more space-saving than a character string to store a mobile phone number, therefore, the preset splitting rule can use 10n integer division, and n is 0 or a positive integer. Optionally, the preset splitting rules corresponding to the calling number and the called number may be the same or different, which is not limited here.
二、SGSN_USR_IP和GGSN_USR_IP两个IP地址字段,均由四个点分数字组成,每个数字取值范围为0~256。这两个字段可以采用字典压缩算法对其进行压缩处理,但由于IP地址取值范围大,字典压缩算法需32比特(bit)整数才能完成编码。而在实际应用中,IPV4地址通常使用4字节(byte),也即32比特整数表示,每个字节为一个点分数字,这样可以比直接使用字符串表示的IP地址更节约空间。因此,可以对上述两个IP地址字段使用256n整除法进行拆分。2. The two IP address fields of SGSN_USR_IP and GGSN_USR_IP are composed of four dotted numbers, and the value range of each number is 0-256. These two fields can be compressed using a dictionary compression algorithm. However, due to the large value range of the IP address, the dictionary compression algorithm requires a 32-bit (bit) integer to complete the encoding. In practical applications, an IPV4 address is usually represented by 4 bytes, that is, a 32-bit integer, and each byte is a dotted number, which saves more space than an IP address directly represented by a string. Therefore, the above two IP address fields can be split using 256 n integer division.
例如,将IPV4地址拆分两段的规则为:第一段:(IPV4&0xFFFF0000)>>16,第二段:IPV4&0x0000FFFF;或,第一段:(IPV4&0xFF000000)>>24,第二段:IPV4&0x00FFFFFF。将IPV4地址拆分三段的规则为:第一段:(IPV4&0xFF000000)>>24,第二段:(IPV4&0x00FF0000)>>16,第三段:IPV4&0x0000FFFF,或其它拆分规则。For example, the rules for splitting an IPV4 address into two segments are: first segment: (IPV4&0xFFFF0000)>>16, second segment: IPV4&0x0000FFFF; or, first segment: (IPV4&0xFF000000)>>24, second segment: IPV4&0x00FFFFFF. The rules for splitting an IPV4 address into three segments are: first segment: (IPV4&0xFF000000)>>24, second segment: (IPV4&0x00FF0000)>>16, third segment: IPV4&0x0000FFFF, or other splitting rules.
另外,还可以根据实际数据特点进行制定,又如,对统一资源定位符(Uniform Resource Locator,简称:URL)的域名可以通过该URL中包含的“/”或“.”进行拆分。In addition, it can also be formulated according to the characteristics of actual data. For another example, the domain name of a Uniform Resource Locator (Uniform Resource Locator, referred to as: URL) can be split by "/" or "." contained in the URL.
还需说明的是,在本发明任意实施例中,拆分的段数可以通过采样测试进行确定,也可以由用户指定。在通过采样测试确定拆分的段数时,对全部或部分待压缩数据进行测试比较,确定拆分后数据压缩比高和/或压缩时间短的拆分规则为预设拆分规则。It should also be noted that, in any embodiment of the present invention, the number of split segments may be determined through a sampling test, or may be specified by a user. When determining the number of segments to be split by sampling test, test and compare all or part of the data to be compressed, and determine the split rule with high compression ratio and/or short compression time after splitting as the preset split rule.
在压缩模块12中,具体地,按照通信设备提供的压缩算法,对拆分后的数据进行压缩。获取压缩算法的途径可以有多种,例如,可以指定压缩算法,或启发式获取压缩算法,或完全探测式获取压缩算法。其中,指定压缩算法是指通过命令行参数,或配置文件等方式进行压缩算法指定;启发式获取压缩算法为通过命令行参数,或配置文件等方式通知程序上述拆分后的数据的类型、特征或建议采用的压缩算法,然后由程序内部根据预设的压缩算法,通过采样拆分后的数据压缩估算,选择一种适宜的压缩算法;完全探测式获取压缩算法是指采样拆分后的数据获得采样数据,采用各种压缩算法对采样数据进行压缩估算,然后根据上述压缩估算结果选择一种适宜的压缩算法,用户无需通过参数或配置指定所采用的压缩算法。In the compression module 12, specifically, the split data is compressed according to the compression algorithm provided by the communication device. There are many ways to obtain the compression algorithm. For example, the compression algorithm can be specified, or the compression algorithm can be obtained heuristically, or the compression algorithm can be obtained completely probingly. Among them, specifying the compression algorithm refers to specifying the compression algorithm through command line parameters or configuration files; heuristically obtaining the compression algorithm means notifying the program of the type and characteristics of the above-mentioned split data through command line parameters or configuration files Or the recommended compression algorithm, and then according to the preset compression algorithm, the program internally selects a suitable compression algorithm through sampling and splitting the data compression estimation; the complete detection acquisition compression algorithm refers to the sampling and splitting of the data Obtain the sampled data, use various compression algorithms to compress and estimate the sampled data, and then select an appropriate compression algorithm based on the above compression estimation results, and the user does not need to specify the compression algorithm to be used through parameters or configuration.
进一步地,压缩模块12可以具体用于:接收拆分模块11发送的拆分后的数据,采用压缩算法对拆分后的数据进行至少一次压缩处理,输出压缩数据,其中,每次压缩处理中所采用的压缩算法相同或不同。Further, the compression module 12 may be specifically configured to: receive the split data sent by the split module 11, use a compression algorithm to perform at least one compression process on the split data, and output the compressed data, wherein, in each compression process The compression algorithm used is the same or different.
需说明的是,压缩模块12对拆分后的数据进行一次压缩处理后,还可以对压缩后的数据再次压缩,即多级压缩。一般情况下,压缩级数越多,能够获得的压缩比增量递减,而且压缩时间开销增加,通常情况下,压缩级数设置不宜过大。因此,压缩级数的设置需考虑压缩时间和压缩效率的均衡,但对压缩时间无要求时,可适量增加压缩级数。以获得尽可能大的压缩比。It should be noted that after the compression module 12 compresses the split data once, it can also compress the compressed data again, that is, multi-level compression. In general, the more the number of compression stages, the incremental reduction of the compression ratio that can be obtained, and the increase of the compression time cost. Usually, the setting of the number of compression stages should not be too large. Therefore, the setting of compression stages needs to consider the balance between compression time and compression efficiency, but if there is no requirement for compression time, the number of compression stages can be increased appropriately. to obtain the highest possible compression ratio.
本发明实施例根据待压缩数据的特点对其进行拆分,使得不适宜采用专用压缩算法的待压缩数据,经过拆分处理后,可以通过专用压缩算法处理获得高压缩比,从而节省数据的存储空间,降低成本。The embodiment of the present invention splits the data to be compressed according to the characteristics of the data to be compressed, so that the data to be compressed that is not suitable for using a special compression algorithm can be processed by a special compression algorithm to obtain a high compression ratio after splitting, thereby saving data storage space and reduce costs.
另外,由于压缩数据本身需要使用一定大小的数据进行描述,例如,采用等值距离压缩算法对如图2所示的一组重复数据使用<重复数据,重复次数>二元组进行描述,从而将该组重复数据进行压缩,压缩数据为“2,10000,100,20000,90”5个整数,其中,第一个数字“2”表示两个二元组,接着的“10000,100,20000,90”是两个二元组,分别表示重复数据10000和20000,以及重复次数100和90。上述压缩数据需使用一定大小的数据描述,若使用的数据过大,例如4字节,则产生存储空间浪费;若使用的数据过小,例如1字节,又导致存储空间不足。因此,本发明实施例针对压缩数据的存储方法予以改进。In addition, because the compressed data itself needs to be described with data of a certain size, for example, the equivalent distance compression algorithm is used to describe a set of repeated data as shown in Figure 2 using the <repeated data, number of repetitions> two-tuple, so that This group of repeated data is compressed, and the compressed data is 5 integers of "2, 10000, 100, 20000, 90", where the first number "2" represents two binary groups, and the following "10000, 100, 20000, 90" are two 2-tuples, which respectively represent repeated data of 10000 and 20000, and repeated times of 100 and 90. The above-mentioned compressed data needs to be described with a certain size of data. If the data used is too large, such as 4 bytes, the storage space will be wasted; if the data used is too small, such as 1 byte, the storage space will be insufficient. Therefore, the embodiment of the present invention improves the method for storing compressed data.
一种实现方式中,存储模块13可以具体用于:采用自描述压缩编码存储压缩数据,获得编码数据。其中,图3为本发明自描述压缩编码的格式示意图,如图3所示,自描述压缩编码可以包括编码长度标识部分31和数据值编码部分32,编码长度标识部分31用于表征对压缩数据进行编码的字节个数,数据值编码部分32用于表征对压缩数据进行编码的部分。压缩数据的编码长度由其自身的首N个比特进行标识,其中,N的大小等于编码数据的二进制表示中自最高位起第一个取值为“0”的位与最高位之间取值为“1”的位的个数再加1,N即为数据编码长度,详见表2。In an implementation manner, the storage module 13 may be specifically configured to: use self-describing compression coding to store compressed data and obtain coded data. Wherein, FIG. 3 is a schematic diagram of the format of the self-describing compression coding of the present invention. As shown in FIG. 3 , the self-describing compression coding may include a coding length identification part 31 and a data value coding part 32, and the coding length identification part 31 is used to represent the compressed data The number of bytes to encode, the data value encoding part 32 is used to represent the part that encodes the compressed data. The encoded length of the compressed data is identified by its first N bits, where the size of N is equal to the value between the first bit with a value of "0" and the highest bit in the binary representation of the encoded data The number of bits that are "1" plus 1, N is the data encoding length, see Table 2 for details.
具体地,编码长度标识部分31和数据值编码部分32之间使用“0”进行分隔。例如:二进制数1011111111111111,其编码长度标识部分31为编码数据自高位起第1个“0”位以前的“1”位的个数,此时,取值为“1”的位的个数为1,因此N=1+1=2。数据值编码部分32为编码数据的剩余部分“11111111111111”,二者之间通过编码长度标识部分31的位“0”分隔。Specifically, "0" is used to separate the encoding length identification part 31 and the data value encoding part 32 . For example: the binary number 1011111111111111, its encoding length identification part 31 is the number of "1" bits before the first "0" bit of the encoded data from the high position, at this time, the number of bits that are valued as "1" is 1, so N=1+1=2. The data value coding part 32 is the remaining part of the coded data "11111111111111", which are separated by the bit "0" of the code length identification part 31.
表2自描述压缩编码的示例编码规则Table 2 Example encoding rules for self-describing compression encoding
采用如表2所示的编码规则,对压缩数据“2,10000,100,20000,90”进行自描述压缩编码,其十六进制结果如图4所示,可见该结果仅需8字节即可,相比于采用无符号整数存储压缩数据,无需预先确定应该使用多少个字节的整数来描述压缩数据的个数,从而避免预先确定的字节个数过大导致浪费,实现存储空间的合理分配和利用。其中,表2中省略号“……”表示按所示内容依次往后类推,此处不一一赘述。Using the encoding rules shown in Table 2, self-describing compression encoding is performed on the compressed data "2,10000,100,20000,90", the hexadecimal result is shown in Figure 4, it can be seen that the result only needs 8 bytes That is, compared to using unsigned integers to store compressed data, there is no need to predetermine how many bytes of integers should be used to describe the number of compressed data, so as to avoid the waste of the predetermined number of bytes and realize storage space rational distribution and utilization. Wherein, the ellipsis "..." in Table 2 indicates that the content shown is deduced sequentially and so on, and details will not be repeated here.
另一种实现方式中,数据压缩装置10还可以包括:确定模块(未示出)用于若确定压缩处理中的预压缩数据的规模适用变形比特压缩算法,则触发压缩模块12对预压缩数据进行压缩处理。此时,压缩模块12具体用于采用变形比特压缩算法对预压缩数据中进行压缩处理,获得压缩数据,其中,变形比特压缩算法表示采用不同的比特位数对各数据进行压缩处理。In another implementation manner, the data compression device 10 may also include: a determination module (not shown) used to trigger the compression module 12 to perform compression on the pre-compressed data if it is determined that the scale of the pre-compressed data in the compression process is applicable Perform compression processing. At this time, the compression module 12 is specifically configured to perform compression processing on the pre-compressed data by adopting a modified bit compression algorithm to obtain compressed data, wherein the modified bit compression algorithm indicates that each data is compressed using a different number of bits.
可选地,确定模块可以具体用于:获取对预压缩数据中最大数据进行编码的最少字节数;根据预压缩数据的规模及最少字节数,确定对预压缩数据进行字节编码的第一字节数;若使用预设数值的比特位数对预压缩数据进行比特编码,则确定对预压缩数据进行比特编码的第二字节数;若确定第二字节数小于第一字节数,则确定预压缩数据的规模适用变形比特压缩算法,并触发压缩模块12对预压缩数据进行压缩处理;否则,确定预压缩数据的规模不适用变形比特压缩算法,采用其他压缩算法压缩该预压缩数据或结束对其的压缩处理。Optionally, the determining module may be specifically configured to: obtain the minimum number of bytes for encoding the largest data in the precompressed data; A number of bytes; if the pre-compressed data is bit-coded using the number of bits of the preset value, then determine the second number of bytes for bit-coding the pre-compressed data; if it is determined that the second number of bytes is less than the first byte number, then it is determined that the scale of the pre-compressed data is applicable to the deformed bit compression algorithm, and the compression module 12 is triggered to compress the pre-compressed data; Compress data or end compression processing on it.
其中,变形比特压缩算法相对于通用比特压缩算法定义,通用比特压缩算法是找出待压缩数据中的最大值,确定编码该最大值所需要的最小比特位数,然后对所有待压缩数据使用这个比特位数进行编码;变形比特压缩算法是对待压缩数据中各数据分别使用不同的比特位数进行编码。以下举例说明变形比特压缩算法的应用。Among them, the deformed bit compression algorithm is defined relative to the general bit compression algorithm. The general bit compression algorithm is to find the maximum value in the data to be compressed, determine the minimum number of bits required to encode the maximum value, and then use this for all the data to be compressed The number of bits is used for encoding; the deformed bit compression algorithm is to encode each data in the data to be compressed using different number of bits. The following example illustrates the application of the deformed bit compression algorithm.
例如,对某话单数据中一段连续群组识别码(GROUP_ID)数据进行压缩处理。该段GROUP_ID数据(总共70个)为:153 153 153 153 153 153 153153 153 153 153 153 153 153 153 153 151 151 151 151 151 151 151 151 151 151153 153 153 153 153 153 152 152 151 153 153 153 152 152 153 151 151 151 153151 153 153 153 153 153 153 153 153 153 152 153 153 152 152 152 151 151 151153 153 151 152 152 152。For example, compress a piece of continuous group identification code (GROUP_ID) data in a bill data. This paragraph GROUP_ID data (70 in total) is: 153 153 153 153 153 153 153 153 153 153 153 153 153 151 151 151 151 151 151 1511 153 153 153 152 153 153 153 152 153 153 151 151 151 153151 153 153 153 153 153 153 153 153 153 152 153 153 152 152 152 151 151 151153 153 151 152 152 152.
首先,使用等值压缩算法对上述GROUP_ID数据进行压缩,获得第一压缩数据,即19个二元组<重复数据,重复次数>:<153,16><151,10><153,6><152,2><151,1><153,3><152,2><153,1><151,3><153,1><151,1><153,9><152,1><153,2><152,3><151,3><153,2><151,1><152,3>。First, use the equivalent compression algorithm to compress the above GROUP_ID data to obtain the first compressed data, that is, 19 binary groups <repeated data, number of repetitions>: <153,16><151,10><153,6>< 152,2><151,1><153,3><152,2><153,1><151,3><153,1><151,1><153,9><152,1>< 153,2><152,3><151,3><153,2><151,1><152,3>.
然后,分别编码存储重复数据和重复次数,则可获得两组数据:第一组为重复数据的集合:[153,151,153,152,151,153,152,153,151,153,151,153,152,Then, by encoding and storing the repeated data and the number of repetitions separately, two sets of data can be obtained: the first group is a collection of repeated data: [153,151,153,152,151,153,152,153,151,153,151,153,152,
153,152,151,153,151,152];第二组为重复次数的组合:[16,10,6,2,1,3,2,1,3,1,1,9,153,152,151,153,151,152]; the second group is the combination of repetitions: [16,10,6,2,1,3,2,1,3,1,1,9,
1,2,3,3,2,1,3],其中,各重复次数与第一组中各重复数据一一对应。1,2,3,3,2,1,3], wherein each repetition number corresponds to each repetition data in the first group one-to-one.
对第一组中数据使用差值压缩算法进行二次压缩。以151为基数,获得压缩数据为:(151),2,0,2,1,0,2,1,2,0,2,0,2,1,2;采用自描述压缩编码描述该压缩数据,其中,151使用自描述压缩编码,后面的数据全部使用2比特编码,编码后结果如图5所示。The difference compression algorithm is used to compress the data in the first group twice. Based on 151, the compressed data obtained is: (151), 2,0,2,1,0,2,1,2,0,2,0,2,1,2; the compression is described by self-describing compression coding Among them, 151 uses self-describing compression encoding, and all subsequent data uses 2-bit encoding, and the encoded result is shown in Figure 5.
对第二组中数据使用变形比特压缩算法进行二次压缩。若使用通用比特压缩算法,由于第二组中数据的最大值为16,对应二进制为10000,即至少需要5比特编码,因此,第二组中所有数据须使用5比特编码。此时,如果不考虑其它压缩编码信息,第二组数据本身编码需round((5*19+7)/8)=12字节,其中,round()函数表示取整处理,5*19/8的计算结果有小数点后部分存在,若计算round(5*19/8),会导致小数点后部分丢失而出现存储失真,因此,这里对括号内分子部分做加7处理,即对(5*19+7)/8做向上取整运算,以避免上述失真情况,保证存储结果的准确性。使用通用比特压缩算法处理第二组数据后对应的码流(二进制)数据如图6所示。The data in the second group is compressed twice using a modified bit compression algorithm. If the general bit compression algorithm is used, since the maximum value of the data in the second group is 16, the corresponding binary number is 10000, that is, at least 5-bit coding is required, therefore, all data in the second group must use 5-bit coding. At this time, if other compression coding information is not considered, the encoding of the second group of data itself needs round((5*19+7)/8)=12 bytes, where the round() function means rounding processing, 5*19/ The calculation result of 8 has a part after the decimal point. If you calculate round(5*19/8), the part after the decimal point will be lost and storage distortion will occur. Therefore, here, add 7 to the molecular part in the brackets, that is, to (5* 19+7)/8 is rounded up to avoid the above distortion and ensure the accuracy of the stored results. Figure 6 shows the corresponding code stream (binary) data after processing the second set of data using the general bit compression algorithm.
观察上述第二组中各数据可发现,仅个别数据较大,大多数数据都比较小,因此,采用变形比特压缩算法对第二组数据进行压缩,以获得更小的码流。例如,对16、10和9使用字节压缩,其它数据使用小于5比特的比特压缩。以下通过使用不同的编码规则,区分压缩数据是否采用比特压缩。Observing the data in the second group above, it can be found that only a few data are relatively large, and most of the data are relatively small. Therefore, the modified bit compression algorithm is used to compress the second group of data to obtain a smaller code stream. For example, use byte compression for 16, 10, and 9, and bit compression of less than 5 bits for other data. In the following, different encoding rules are used to distinguish whether compressed data adopts bit compression.
在上述基础上,存储模块13可以用于:采用正数编码和负数编码中其一或其组合存储压缩数据,获得编码数据,其中,正数编码表示对压缩数据中采用字节压缩处理的部分进行编码,负数编码表示对压缩数据中采用比特压缩处理的部分进行编码。On the basis of the above, the storage module 13 can be used to: adopt one or a combination of positive codes and negative codes to store compressed data and obtain coded data, wherein the positive code represents the part of the compressed data that uses byte compression processing Encoding is carried out, and the negative number encoding indicates that the part of the compressed data that is processed by bit compression is encoded.
进一步地,编码数据包括第一编码数据,当采用正数编码存储压缩数据时,存储模块13可以具体用于:根据压缩数据中的最大数据,获取对最大数据进行编码的最小字节数;若最大数据的首字节的首比特值为零,则确定采用最小字节数对压缩数据中采用字节压缩处理的部分进行编码,获得第一编码数据;若最大数据的首字节的首比特值为非零,则确定采用最小字节数加1后的字节数对压缩数据中采用字节压缩处理的部分进行编码,获得第一编码数据。Further, the coded data includes the first coded data, and when the compressed data is stored using a positive code, the storage module 13 can be specifically used to: obtain the minimum number of bytes for encoding the maximum data according to the maximum data in the compressed data; if If the value of the first bit of the first byte of the largest data is zero, it is determined to use the minimum number of bytes to encode the part of the compressed data that uses byte compression processing to obtain the first encoded data; if the first bit of the first byte of the largest data If the value is non-zero, it is determined to use the minimum number of bytes plus 1 to encode the part of the compressed data that is subjected to byte compression processing to obtain the first encoded data.
在上述基础上,编码数据包括第二编码数据,当采用负数编码存储压缩数据时,存储模块13可以具体用于:根据负数编码规则对压缩数据中采用比特压缩处理的部分进行负数编码,获得第二编码数据,负数编码规则包括编码长度标识部分和数据编码部分,编码长度标识部分用于表征被编码数据的个数,数据编码部分用于表征采用预设长度的比特位对被编码数据进行编码后的编码值,其中,编码长度标识的长度由第二编码数据中各数据的首M个比特进行标识,其中,M的大小等于第二编码中各数据的二进制表示中自最高位起第一个取值为零的位与最高位之间取值为非零的位的个数。On the basis of the above, the encoded data includes the second encoded data. When negative encoding is used to store the compressed data, the storage module 13 can be specifically used to: perform negative encoding on the part of the compressed data that uses bit compression according to the negative encoding rule to obtain the first Two-coded data, the negative number coding rule includes a code length identification part and a data coding part, the code length identification part is used to represent the number of coded data, and the data code part is used to represent that the coded data is coded with bits of a preset length The coded value after, wherein, the length of the coded length identifier is identified by the first M bits of each data in the second coded data, wherein, the size of M is equal to the first bit from the highest bit in the binary representation of each data in the second coded The number of non-zero bits between the zero-valued bits and the most significant bit.
具体地,首先找出压缩数据中的最大数,并获取对其编码的最小字节个数,假设该最小字节个数为m,m为正整数。若该最大数的首字节中第1个比特为0,则所有压缩数据中采用字节压缩处理的部分进行均使用m字节进行正数编码,否则使用m+1字节进行正数编码,且进行正数编码的字节数需保存在压缩码流中,以便于解压缩时进行识别和处理。Specifically, first find out the maximum number in the compressed data, and obtain the minimum number of bytes to encode it, assuming that the minimum number of bytes is m, and m is a positive integer. If the first bit in the first byte of the maximum number is 0, all compressed data that uses byte compression processing will use m bytes for positive encoding, otherwise use m+1 bytes for positive encoding , and the number of bytes encoded with a positive number needs to be stored in the compressed code stream to facilitate identification and processing during decompression.
对使用比特压缩处理的数据,使用负数编码,以便和前面的字节压缩结果进行区分。方法是对比特压缩的结果,将其首比特位的值置为“1”,取名负数编码也源于此(因为首比特为1,对有符号数而言就是负数),同时也正好和前面的正数编码进行对应。For the data processed by bit compression, use negative encoding to distinguish it from the previous byte compression result. The method is to set the value of the first bit to "1" for the result of bit compression, and the name of the negative number encoding also comes from this (because the first bit is 1, it is a negative number for signed numbers), and it is exactly the same as Correspond to the positive number code in front.
比特压缩结果是这样的:起始处是被压缩数据的个数的编码(使用负数编码规则,如表3),之后依次是每个被压缩数据按指定比特值位数的比特编码数据。The bit compression result is as follows: the beginning is the encoding of the number of compressed data (use the negative number encoding rule, as shown in Table 3), followed by the bit encoding data of each compressed data according to the specified bit value.
还需说明的是,由于采用负数编码的数据个数不可能是0,所以实际编码时对待编码数据做减1处理,对应地,在对其解码时做加1处理,例如,1使用0编码,2使用1编码,3使用2编码,……,依次类推;解码时则对待解码数据进行反操作,解码结果须做加1操作,例如,0解码为1,1解码为2,2解码为3,……,依次类推。It should also be noted that since the number of data to be coded with a negative number cannot be 0, the coded data is decremented by 1 during actual coding, and correspondingly, 1 is added when decoding it. For example, 1 is coded with 0 , 2 is encoded by 1, 3 is encoded by 2, ..., and so on; when decoding, the data to be decoded is reversed, and the decoding result must be added by 1. For example, 0 is decoded as 1, 1 is decoded as 2, and 2 is decoded as 3, ..., and so on.
表3负数编码规则Table 3 Negative encoding rules
由于使用比特编码须附加采用比特压缩处理的数据个数,因此可编码数据越多,越有利于压缩。接下来通过两个推导对其进行说明。Since the number of data processed by bit compression must be added when using bit coding, the more data that can be coded, the more conducive to compression. Next, it is illustrated by two derivations.
一,推导当编码数据达到多大规模时适合使用比特编码。First, deduce when the size of the coded data is suitable for using bit coding.
假设数据规模为n,其中最大值最少需要m个字节进行编码,若不使用比特编码,则需要m与n的乘积(m*n)个字节进行编码。Assuming that the data size is n, the maximum value requires at least m bytes for encoding. If bit encoding is not used, the product of m and n (m*n) bytes are required for encoding.
假设可以对这组数使用b个比特进行比特编码,则总共需要round((m*b+7)/8)个字节;此外,还需要至少1字节来标识编码数据个数,因此,适合使用比特编码的条件是:m*n>1+round((n*b+7)/8),即n>round(15/(8m-b)),其中,round为向下取整函数。可见,如果连续可以使用b个比特进行压缩的数据达不到round(15/(8m-b))个,即使其中存在可以使用b个比特压缩的数据,这部分数据也不适合用比特编码。Assuming that this group of numbers can be encoded using b bits, a total of round((m*b+7)/8) bytes are required; in addition, at least 1 byte is required to identify the number of encoded data, therefore, The condition suitable for bit coding is: m*n>1+round((n*b+7)/8), that is, n>round(15/(8m-b)), where round is the rounding down function . It can be seen that if there are less than round(15/(8m-b)) consecutive data that can be compressed using b bits, even if there is data that can be compressed using b bits, this part of the data is not suitable for bit coding.
图7为本发明另一组数据示例。如图7所示,给出了一组数据示例,以下说明该组数据中适合比特压缩的部分。该组数据中最大数为16,最小需要1字节编码,若使用3比特压缩编码,根据n>round(15/(8*1-3))=3,即至少要连续3个数据适合使用3比特编码,这组数据才适合使用比特编码,由此可得到如图7所示的结果。Fig. 7 is an example of another set of data in the present invention. As shown in FIG. 7 , a set of data examples is given, and the part of the set of data suitable for bit compression is described below. The maximum number in this group of data is 16, and the minimum 1-byte encoding is required. If 3-bit compression encoding is used, according to n>round(15/(8*1-3))=3, at least 3 consecutive data are suitable for use 3-bit encoding, this group of data is suitable for bit encoding, and thus the result shown in Figure 7 can be obtained.
二,获取比特编码的最佳编码比特值。Second, obtain the optimal coded bit value of the bit code.
对于给定的一组待压缩数据,首先确定该组数据中最大数据编码所需的比特值和字节数,采用该比特值对该组数据进行比特编码所需的字节数,以及该组数据中可连续使用该比特值位数进行编码的数据的个数;然后,根据上述预先确定的数值,按“1”递减比特值,逐一判断比特值取不同数值时,编码该组待压缩数据所需的字节数目,每次循环中,判断该次循环中获得的字节数目与前一次循环中获得字节数目的大小关系,并取二者中较小的数值作为该次循环的最终字节数;待最后一次循环结束后,即比特值取值为“1”的判断结束后,确定最佳字节数目及其对应的比特值,即最佳编码比特值。其中,获取最佳编码比特值的方法可以通过不同的算法实现,本发明不对其进行限定。For a given set of data to be compressed, first determine the bit value and number of bytes required for the largest data encoding in the set of data, the number of bytes required to bit-encode the set of data using the bit value, and the set The number of data in the data that can be encoded continuously using the number of digits of the bit value; then, according to the above predetermined value, the bit value is decremented by "1", and when the bit value is judged one by one to take different values, encode the group of data to be compressed The number of bytes required, in each cycle, judge the size relationship between the number of bytes obtained in this cycle and the number of bytes obtained in the previous cycle, and take the smaller value of the two as the final value of this cycle The number of bytes; after the last cycle ends, that is, after the judgment that the bit value is "1", the optimal number of bytes and its corresponding bit value, that is, the optimal encoding bit value, are determined. Wherein, the method for obtaining the optimal coded bit value can be realized by different algorithms, which is not limited in the present invention.
对于上面的第二组数据,可以发现使用2比特编码为最佳方案,其中,编码前后数据如图8所示。可见,使用2比特的变形比特压缩算法可获得10字节的压缩结果,比之前使用通用5比特压缩算法的压缩结果12字节少了2个字节,比每个数据直接使用1字节编码(共需19字节)少了9个字节。For the above second group of data, it can be found that using 2-bit encoding is the best solution, wherein the data before and after encoding are shown in FIG. 8 . It can be seen that using the 2-bit modified bit compression algorithm can obtain a 10-byte compression result, which is 2 bytes less than the 12-byte compression result of the previous general-purpose 5-bit compression algorithm, and is better than directly using 1-byte encoding for each data. (A total of 19 bytes are required) 9 bytes are missing.
因此,相比于通用比特压缩算法使用待压缩数据中最大数据的比特位数进行编码,本实现方式对待压缩数据中的不同数据采用不同的比特位数进行编码,可有效解决待压缩数据中只有极少数的大数据,其它都是较小的数据时,使用最大数据的比特位数进行编码所导致的存储资源浪费的问题。Therefore, compared with the general bit compression algorithm that uses the number of bits of the largest data in the data to be compressed to encode, this implementation method uses different numbers of bits to encode different data in the data to be compressed, which can effectively solve the problem that only When there are very few large data and the others are small data, the problem of waste of storage resources caused by encoding with the maximum number of bits of data.
图9为本发明数据压缩装置实施例二的结构示意图。如图9所示,该数据压缩装置20包括处理器91和存储器92。其中,存储器92存储执行指令,当数据压缩装置20运行时,处理器91与存储器92之间通信,处理器91调用存储器92中的执行指令,用于执行以下操作:FIG. 9 is a schematic structural diagram of Embodiment 2 of the data compression device of the present invention. As shown in FIG. 9 , the data compression device 20 includes a processor 91 and a memory 92 . Wherein, the memory 92 stores execution instructions. When the data compression device 20 is running, the processor 91 communicates with the memory 92, and the processor 91 calls the execution instructions in the memory 92 to perform the following operations:
根据预设拆分规则,拆分待压缩数据,获得拆分后的数据,使得压缩算法适用于所述拆分后的数据,其中,所述预设拆分规则是根据所述待压缩数据的特点确定的;Split the data to be compressed according to a preset split rule to obtain the split data, so that the compression algorithm is applicable to the split data, wherein the preset split rule is based on the data to be compressed Characterized;
采用所述压缩算法对所述拆分后的数据进行压缩处理,输出压缩数据,并发送给存储器92。The split data is compressed using the compression algorithm, and the compressed data is output and sent to the memory 92 .
存储器92用于接收处理器91发送的压缩数据,并存储压缩数据。The memory 92 is used for receiving the compressed data sent by the processor 91 and storing the compressed data.
上述实施例中,根据预设拆分规则,拆分待压缩数据,获得拆分后的数据,可以包括:采用预设值整除法,拆分所述待压缩数据,其中,拆分的段数是根据对所述待压缩数据进行采样测试来确定的。In the above embodiment, splitting the data to be compressed according to the preset splitting rules, and obtaining the split data may include: splitting the data to be compressed by using a preset value division method, wherein the number of segments to be split is It is determined by performing a sampling test on the data to be compressed.
上述实施例中,所述采用压缩算法对所述拆分后的数据进行压缩处理,输出压缩数据,可包括:采用所述压缩算法对所述拆分后的数据进行至少一次压缩处理,输出压缩数据,其中,每次压缩处理中所采用的压缩算法相同或不同。In the above embodiment, the compressing the split data by using the compression algorithm and outputting the compressed data may include: compressing the split data at least once by using the compression algorithm, and outputting the compressed data. data, wherein the compression algorithms used in each compression process are the same or different.
可选地,一种实现方式中,所述存储所述压缩数据,可包括:采用自描述压缩编码存储所述压缩数据,获得编码数据,所述自描述压缩编码包括编码长度标识部分和数据值编码部分,所述编码长度标识部分用于表征对所述压缩数据进行编码的字节个数,所述数据值编码部分用于表征对所述压缩数据进行编码的部分,其中,所述压缩数据的编码长度由其自身的首N个比特进行标识,其中,N的大小等于所述编码数据的二进制表示中自最高位起第一个取值为零的位与最高位之间取值为非零的位的个数再加1。Optionally, in an implementation manner, the storing the compressed data may include: storing the compressed data using a self-describing compression code to obtain coded data, the self-describing compression code including a code length identification part and a data value An encoding part, the encoding length identification part is used to represent the number of bytes for encoding the compressed data, and the data value encoding part is used to represent the part for encoding the compressed data, wherein the compressed data The coded length of is identified by its own first N bits, where the size of N is equal to the value between the first zero bit from the highest bit and the highest bit in the binary representation of the encoded data. Add 1 to the number of zero bits.
另一种实现方式中,若压缩处理中所采用的压缩算法为变形比特压缩算法,则处理器91还可以用于:在所述采用所述压缩算法对拆分后的数据进行压缩处理之前,若确定所述压缩处理中的预压缩数据的规模适用变形比特压缩算法,则采用所述变形比特压缩算法对所述预压缩数据中进行压缩处理,获得所述压缩数据,其中,所述变形比特压缩算法表示采用不同的比特位数对各数据进行压缩处理。In another implementation manner, if the compression algorithm used in the compression processing is a deformed bit compression algorithm, the processor 91 may also be configured to: before compressing the split data using the compression algorithm, If it is determined that the scale of the pre-compressed data in the compression process is applicable to the deformed bit compression algorithm, then the deformed bit compression algorithm is used to compress the pre-compressed data to obtain the compressed data, wherein the deformed bit The compression algorithm means that each data is compressed using different bit numbers.
进一步地,所述确定所述压缩处理中的预压缩数据的规模适用变形比特压缩算法,可以包括:获取对所述预压缩数据中最大数据进行编码的最少字节数;根据所述预压缩数据的规模及所述最少字节数,确定对所述预压缩数据进行字节编码的第一字节数;若使用预设数值的比特位数对所述预压缩数据进行比特编码,则确定对所述预压缩数据进行比特编码的第二字节数;若确定所述第二字节数小于所述第一字节数,则确定所述预压缩数据的规模适用变形比特压缩算法。Further, the determination that the size of the pre-compressed data in the compression process is applicable to the deformed bit compression algorithm may include: obtaining the minimum number of bytes for encoding the largest data in the pre-compressed data; according to the pre-compressed data size and the minimum number of bytes, determine the first number of bytes for encoding the pre-compressed data; A second number of bytes for which the pre-compressed data is bit-coded; if it is determined that the second number of bytes is smaller than the first number of bytes, it is determined that the scale of the pre-compressed data is suitable for a deformed bit compression algorithm.
在上述基础上,所述存储所述压缩数据可以包括:采用正数编码和负数编码中其一或其组合存储所述压缩数据,获得编码数据,其中,所述正数编码表示对所述压缩数据中采用字节压缩处理的部分进行编码,所述负数编码表示对所述压缩数据中采用比特压缩处理的部分进行编码。On the basis of the above, the storing of the compressed data may include: storing the compressed data by using one or a combination of positive encoding and negative encoding to obtain encoded data, wherein the positive encoding indicates that the compression The part of the data that is processed by byte compression is coded, and the negative number coding indicates that the part of the compressed data that is processed by bit compression is coded.
可选地,所述编码数据包括第一编码数据,当采用正数编码存储所述压缩数据时,所述采用正数编码和负数编码中其一或其组合存储所述压缩数据,获得编码数据,包括:根据所述压缩数据中的最大数据,获取对所述最大数据进行编码的最小字节数;若所述最大数据的首字节的首比特值为零,则确定采用所述最小字节数对所述压缩数据中采用字节压缩处理的部分进行编码,获得所述第一编码数据;若所述最大数据的首字节的首比特值为非零,则确定采用所述最小字节数加1后的字节数对所述压缩数据中采用字节压缩处理的部分进行编码,获得所述第一编码数据。Optionally, the coded data includes first coded data, and when the compressed data is stored using positive code, one or a combination of positive code and negative code is used to store the compressed data to obtain coded data , including: according to the maximum data in the compressed data, obtain the minimum number of bytes for encoding the maximum data; if the first bit value of the first byte of the maximum data is zero, then determine that the minimum number of bytes The number of sections encodes the part of the compressed data that uses byte compression to obtain the first encoded data; if the first bit value of the first byte of the largest data is non-zero, it is determined to use the smallest word The number of bytes obtained by adding 1 to the number of sections encodes the part of the compressed data that adopts byte compression processing to obtain the first encoded data.
进一步地,所述编码数据包括第二编码数据,当采用负数编码存储所述压缩数据时,所述采用正数编码和负数编码中其一或其组合存储所述压缩数据,获得编码数据,可包括:根据负数编码规则对所述压缩数据中采用比特压缩处理的部分进行负数编码,获得所述第二编码数据,所述负数编码规则包括编码长度标识部分和数据编码部分,所述编码长度标识部分用于表征被编码数据的个数,所述数据编码部分用于表征采用预设长度的比特位对被编码数据进行编码后的编码值,其中,所述编码长度标识的长度由所述第二编码数据中各数据的首M个比特进行标识,其中,M的大小等于所述第二编码中各数据的二进制表示中自最高位起第一个取值为零的位与最高位之间取值为非零的位的个数。Further, the encoded data includes the second encoded data. When the compressed data is stored with negative encoding, one or a combination of positive encoding and negative encoding is used to store the compressed data, and the encoded data can be obtained. It includes: performing negative coding on the part of the compressed data that adopts bit compression processing according to a negative coding rule to obtain the second coded data, the negative coding rule includes a coding length identification part and a data coding part, and the coding length identification The part is used to represent the number of coded data, and the data coding part is used to represent the coded value after coding the coded data with bits of a preset length, wherein the length identified by the coded length is determined by the first The first M bits of each data in the second coded data are identified, wherein the size of M is equal to the distance between the first zero value from the highest bit and the highest bit in the binary representation of each data in the second code The number of non-zero bits.
本实施例的数据压缩装置,其各元件的实现原理和技术效果与图1所示实施例中对应模块的实现原理和技术效果类似,此处不再赘述。The implementation principles and technical effects of the components of the data compression device in this embodiment are similar to those of the corresponding modules in the embodiment shown in FIG. 1 , and will not be repeated here.
图10为本发明数据压缩方法实施例一的流程图。本发明实施例提供一种数据压缩方法,该方法可以通过上述数据压缩装置执行,该装置可以集成在通信设备中,其中,通信设备可以为PC、笔记本电脑或服务器等任意终端设备。如图10所示,该方法包括:FIG. 10 is a flow chart of Embodiment 1 of the data compression method of the present invention. An embodiment of the present invention provides a data compression method, which can be executed by the above-mentioned data compression device, and the device can be integrated into a communication device, wherein the communication device can be any terminal device such as a PC, a notebook computer, or a server. As shown in Figure 10, the method includes:
S101、根据预设拆分规则,拆分待压缩数据,获得拆分后的数据,使得压缩算法适用于拆分后的数据,其中,该预设拆分规则是根据待压缩数据的特点确定的。S101. Split the data to be compressed according to a preset split rule, and obtain the split data, so that the compression algorithm is applicable to the split data, wherein the preset split rule is determined according to the characteristics of the data to be compressed .
具体地,数据压缩装置读取原始数据之后,按照预设拆分规则将该原始数据拆分为多组,然后执行S102。若预先未设置预设拆分规则,则直接对该原始数据进行压缩处理,执行S102。Specifically, after the data compression device reads the original data, it splits the original data into multiple groups according to a preset split rule, and then executes S102. If no preset splitting rule is set in advance, the original data is directly compressed, and S102 is executed.
S102、采用压缩算法对拆分后的数据进行压缩处理,输出压缩数据。S102. Compress the split data by using a compression algorithm, and output the compressed data.
S103、存储压缩数据。S103. Store the compressed data.
本实施例的方法,可以由图1或图9所示装置执行,其实现原理和技术效果类似,此处不再赘述。The method in this embodiment can be executed by the device shown in FIG. 1 or FIG. 9 , and its implementation principle and technical effect are similar, and will not be repeated here.
在上述实施例中,S101可以包括:采用预设值整除法,拆分待压缩数据,其中,拆分的段数是根据对待压缩数据进行采样测试来确定的。In the above-mentioned embodiment, S101 may include: splitting the data to be compressed by using a preset value division method, wherein the number of segments to be split is determined according to a sampling test of the data to be compressed.
可选地,S102可以包括:采用压缩算法对拆分后的数据进行至少一次压缩处理,输出压缩数据,其中,每次压缩处理中所采用的压缩算法相同或不同。Optionally, S102 may include: performing at least one compression process on the split data by using a compression algorithm, and outputting the compressed data, wherein the compression algorithm used in each compression process is the same or different.
在上述基础上,一种实现方式中,S103可以具体包括:采用自描述压缩编码存储所述压缩数据,获得编码数据,所述自描述压缩编码包括编码长度标识部分和数据值编码部分,所述编码长度标识部分用于表征对所述压缩数据进行编码的字节个数,所述数据值编码部分用于表征对所述压缩数据进行编码的部分,其中,所述压缩数据的编码长度由其自身的首N个比特进行标识,其中,N的大小等于所述编码数据的二进制表示中自最高位起第一个取值为零的位与最高位之间取值为非零的位的个数再加1。On the basis of the above, in an implementation manner, S103 may specifically include: using self-describing compression coding to store the compressed data to obtain coded data, the self-describing compression coding includes a coding length identification part and a data value coding part, the The encoding length identification part is used to represent the number of bytes for encoding the compressed data, and the data value encoding part is used to represent the part for encoding the compressed data, wherein the encoding length of the compressed data is determined by its The first N bits of itself are used to identify, wherein, the size of N is equal to the number of non-zero bits between the first zero value from the highest bit and the highest bit in the binary representation of the encoded data Add 1 to the number.
进一步地,在S102之前,数据压缩方法还可以包括:若确定所述压缩处理中的预压缩数据的规模适用变形比特压缩算法,则采用所述变形比特压缩算法对所述预压缩数据中进行压缩处理,获得所述压缩数据,其中,所述变形比特压缩算法表示采用不同的比特位数对各数据进行压缩处理。Further, before S102, the data compression method may also include: if it is determined that the size of the pre-compressed data in the compression process is applicable to the deformed bit compression algorithm, then compressing the pre-compressed data by using the deformed bit compression algorithm processing to obtain the compressed data, wherein the deformed bit compression algorithm means that different bit numbers are used to perform compression processing on each data.
其中,确定压缩处理中的预压缩数据的规模适用变形比特压缩算法,可以包括:获取对预压缩数据中最大数据进行编码的最少字节数;根据预压缩数据的规模及最少字节数,确定对预压缩数据进行字节编码的第一字节数;若使用预设数值的比特位数对预压缩数据进行比特编码,则确定对预压缩数据进行比特编码的第二字节数;若确定第二字节数小于第一字节数,则确定预压缩数据的规模适用变形比特压缩算法。Wherein, determining that the size of the pre-compressed data in the compression process is applicable to the deformed bit compression algorithm may include: obtaining the minimum number of bytes for encoding the maximum data in the pre-compressed data; according to the size of the pre-compressed data and the minimum number of bytes, determine The first number of bytes for byte-encoding the pre-compressed data; if the pre-compressed data is bit-encoded using the number of bits of the preset value, then determine the second number of bytes for bit-encoding the pre-compressed data; if determined If the second number of bytes is smaller than the first number of bytes, it is determined that the size of the pre-compressed data is applicable to the deformed bit compression algorithm.
在上述实施例的基础上,S103可包括:采用正数编码和负数编码中其一或其组合存储所述压缩数据,获得编码数据。其中,所述正数编码表示对所述压缩数据中采用字节压缩处理的部分进行编码,所述负数编码表示对所述压缩数据中采用比特压缩处理的部分进行编码。On the basis of the foregoing embodiments, S103 may include: storing the compressed data by using one or a combination of positive encoding and negative encoding to obtain encoded data. Wherein, the positive number encoding means encoding the part of the compressed data that adopts byte compression processing, and the negative number encoding represents encoding the part of the compressed data that adopts bit compression processing.
可选地,所述编码数据包括第一编码数据,当采用正数编码存储压缩数据时,所述采用正数编码中其一或其组合存储所述压缩数据,获得编码数据,可包括:根据所述压缩数据中的最大数据,获取对所述最大数据进行编码的最小字节数;若所述最大数据的首字节的首比特值为零,则确定采用所述最小字节数对所述压缩数据中采用字节压缩处理的部分进行编码,获得所述第一编码数据;若所述最大数据的首字节的首比特值为非零,则确定采用所述最小字节数加1后的字节数对所述压缩数据中采用字节压缩处理的部分进行编码,获得所述第一编码数据。Optionally, the coded data includes the first coded data. When the compressed data is stored using positive codes, the compressed data is stored using one or a combination of positive codes, and the coded data is obtained, which may include: according to The maximum data in the compressed data is obtained by obtaining the minimum number of bytes for encoding the maximum data; if the first bit value of the first byte of the maximum data is zero, it is determined that the minimum number of bytes is used to encode the Encode the part of the compressed data that uses byte compression processing to obtain the first encoded data; if the first bit value of the first byte of the largest data is non-zero, then determine to use the minimum number of bytes plus 1 Encode the part of the compressed data that is subjected to byte compression processing to obtain the first encoded data.
进一步地,所述编码数据包括第二编码数据,当采用负数编码存储所述压缩数据时,所述采用正数编码和负数编码中其一或其组合存储所述压缩数据,获得编码数据,可包括:根据负数编码规则对所述压缩数据中采用比特压缩处理的部分进行负数编码,获得所述第二编码数据,所述负数编码规则包括编码长度标识部分和数据编码部分,所述编码长度标识部分用于表征被编码数据的个数,所述数据编码部分用于表征采用预设长度的比特位对被编码数据进行编码后的编码值,其中,所述编码长度标识的长度由所述第二编码数据中各数据的首M个比特进行标识,其中,M的大小等于所述第二编码中各数据的二进制表示中自最高位起第一个取值为零的位与最高位之间取值为非零的位的个数。Further, the encoded data includes the second encoded data. When the compressed data is stored with negative encoding, one or a combination of positive encoding and negative encoding is used to store the compressed data, and the encoded data can be obtained. It includes: performing negative coding on the part of the compressed data that adopts bit compression processing according to a negative coding rule to obtain the second coded data, the negative coding rule includes a coding length identification part and a data coding part, and the coding length identification The part is used to represent the number of coded data, and the data coding part is used to represent the coded value after coding the coded data with bits of a preset length, wherein the length identified by the coded length is determined by the first The first M bits of each data in the second coded data are identified, wherein the size of M is equal to the distance between the first zero value from the highest bit and the highest bit in the binary representation of each data in the second code The number of non-zero bits.
本发明实施例根据待压缩数据的特点对其进行拆分,使得不适宜采用专用压缩算法的待压缩数据,经过拆分处理后,可以通过专用压缩算法处理获得高压缩比,从而节省数据的存储空间,降低成本。The embodiment of the present invention splits the data to be compressed according to the characteristics of the data to be compressed, so that the data to be compressed that is not suitable for using a special compression algorithm can be processed by a special compression algorithm to obtain a high compression ratio after splitting, thereby saving data storage space and reduce costs.
本领域普通技术人员可以理解:实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在执行时,执行包括上述各方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Those of ordinary skill in the art can understand that all or part of the steps for implementing the above method embodiments can be completed by program instructions and related hardware. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, it executes the steps including the above-mentioned method embodiments; and the aforementioned storage medium includes: ROM, RAM, magnetic disk or optical disk and other various media that can store program codes.
最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, rather than limiting them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: It is still possible to modify the technical solutions described in the foregoing embodiments, or perform equivalent replacements for some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the various embodiments of the present invention. scope.
Claims (18)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201410062498.4A CN104868922B (en) | 2014-02-24 | 2014-02-24 | Data compression method and apparatus |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201410062498.4A CN104868922B (en) | 2014-02-24 | 2014-02-24 | Data compression method and apparatus |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN104868922A true CN104868922A (en) | 2015-08-26 |
| CN104868922B CN104868922B (en) | 2018-05-29 |
Family
ID=53914480
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201410062498.4A Active CN104868922B (en) | 2014-02-24 | 2014-02-24 | Data compression method and apparatus |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN104868922B (en) |
Cited By (20)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105530013A (en) * | 2015-12-03 | 2016-04-27 | 四川中光防雷科技股份有限公司 | Waveform data compression method and system |
| CN106202213A (en) * | 2016-06-28 | 2016-12-07 | 深圳市恒扬数据股份有限公司 | A kind of FPGA binary file compression, decompressing method and compression, decompression device |
| CN106294683A (en) * | 2016-08-05 | 2017-01-04 | 中国银行股份有限公司 | A kind of file declustering method and device |
| CN107395209A (en) * | 2017-07-03 | 2017-11-24 | 北京京东尚科信息技术有限公司 | Data compression method, uncompressing data and its equipment |
| CN107592636A (en) * | 2017-08-17 | 2018-01-16 | 深圳市诚壹科技有限公司 | A kind of method of processing information, terminal and server |
| CN109039343A (en) * | 2017-06-12 | 2018-12-18 | 吕志强 | Calculating formula binary data invertibity compression method |
| CN109074669A (en) * | 2018-07-06 | 2018-12-21 | 深圳市汇顶科技股份有限公司 | data compression method and device |
| CN110120819A (en) * | 2019-04-26 | 2019-08-13 | 矩阵元技术(深圳)有限公司 | A kind of Boolean circuit coding method, apparatus and system |
| CN110320394A (en) * | 2019-08-30 | 2019-10-11 | 深圳市鼎阳科技有限公司 | Decoding processing method and decoding processing device, the digital oscilloscope of Wave data |
| CN110535846A (en) * | 2019-08-22 | 2019-12-03 | 中国电力科学研究院有限公司 | A data frame compression method and system based on DL/T698.45 protocol |
| WO2020215314A1 (en) * | 2019-04-26 | 2020-10-29 | 云图有限公司 | Boolean circuit encoding method, apparatus, and system |
| CN113037292A (en) * | 2021-02-02 | 2021-06-25 | 深圳市汉云科技有限公司 | Data compression method and data decompression method of structured database and communication equipment |
| CN113064555A (en) * | 2021-04-21 | 2021-07-02 | 山东英信计算机技术有限公司 | BIOS data storage method, device, equipment and storage medium |
| CN113220651A (en) * | 2021-04-25 | 2021-08-06 | 暨南大学 | Operation data compression method and device, terminal equipment and storage medium |
| CN113868206A (en) * | 2021-10-08 | 2021-12-31 | 八十一赞科技发展(重庆)有限公司 | A data compression method, decompression method, device and storage medium |
| CN114095034A (en) * | 2021-11-26 | 2022-02-25 | 欧姆龙(上海)有限公司 | Oversampling method and apparatus |
| CN114978421A (en) * | 2022-05-05 | 2022-08-30 | 煤炭科学研究总院有限公司 | Dynamic coding method and device and electronic equipment |
| WO2023071893A1 (en) * | 2021-10-29 | 2023-05-04 | 深圳智慧林网络科技有限公司 | Three mode-based data compression method and data decompression method |
| CN116208168A (en) * | 2022-12-07 | 2023-06-02 | 阿里云计算有限公司 | A data compression method, data decompression method and device |
| CN116303297A (en) * | 2023-05-25 | 2023-06-23 | 深圳市东信时代信息技术有限公司 | File compression processing method, device, equipment and medium |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5333313A (en) * | 1990-10-22 | 1994-07-26 | Franklin Electronic Publishers, Incorporated | Method and apparatus for compressing a dictionary database by partitioning a master dictionary database into a plurality of functional parts and applying an optimum compression technique to each part |
| CN1405705A (en) * | 2001-08-20 | 2003-03-26 | 北京九州计算机网络有限公司 | Intelligent compression method for file of computer |
| CN1494767A (en) * | 2001-02-02 | 2004-05-05 | Methods for compressing/decompressing structured documents | |
| CN101241508B (en) * | 2007-08-01 | 2011-05-18 | 金立 | Structured data sequence compression method |
| CN102521363A (en) * | 2011-12-15 | 2012-06-27 | 武汉达梦数据库有限公司 | Column partition based numerical data compression method for column storage database |
-
2014
- 2014-02-24 CN CN201410062498.4A patent/CN104868922B/en active Active
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5333313A (en) * | 1990-10-22 | 1994-07-26 | Franklin Electronic Publishers, Incorporated | Method and apparatus for compressing a dictionary database by partitioning a master dictionary database into a plurality of functional parts and applying an optimum compression technique to each part |
| CN1494767A (en) * | 2001-02-02 | 2004-05-05 | Methods for compressing/decompressing structured documents | |
| CN1405705A (en) * | 2001-08-20 | 2003-03-26 | 北京九州计算机网络有限公司 | Intelligent compression method for file of computer |
| CN101241508B (en) * | 2007-08-01 | 2011-05-18 | 金立 | Structured data sequence compression method |
| CN102521363A (en) * | 2011-12-15 | 2012-06-27 | 武汉达梦数据库有限公司 | Column partition based numerical data compression method for column storage database |
Cited By (28)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105530013B (en) * | 2015-12-03 | 2018-08-31 | 四川中光防雷科技股份有限公司 | A kind of Wave data compression method and system |
| CN105530013A (en) * | 2015-12-03 | 2016-04-27 | 四川中光防雷科技股份有限公司 | Waveform data compression method and system |
| CN106202213B (en) * | 2016-06-28 | 2019-12-17 | 深圳市恒扬数据股份有限公司 | FPGA binary file compression and decompression method and device |
| CN106202213A (en) * | 2016-06-28 | 2016-12-07 | 深圳市恒扬数据股份有限公司 | A kind of FPGA binary file compression, decompressing method and compression, decompression device |
| CN106294683A (en) * | 2016-08-05 | 2017-01-04 | 中国银行股份有限公司 | A kind of file declustering method and device |
| CN109039343A (en) * | 2017-06-12 | 2018-12-18 | 吕志强 | Calculating formula binary data invertibity compression method |
| CN107395209A (en) * | 2017-07-03 | 2017-11-24 | 北京京东尚科信息技术有限公司 | Data compression method, uncompressing data and its equipment |
| CN107395209B (en) * | 2017-07-03 | 2020-11-24 | 北京京东尚科信息技术有限公司 | Data compression method, data decompression method and device thereof |
| CN107592636A (en) * | 2017-08-17 | 2018-01-16 | 深圳市诚壹科技有限公司 | A kind of method of processing information, terminal and server |
| CN109074669A (en) * | 2018-07-06 | 2018-12-21 | 深圳市汇顶科技股份有限公司 | data compression method and device |
| CN110120819A (en) * | 2019-04-26 | 2019-08-13 | 矩阵元技术(深圳)有限公司 | A kind of Boolean circuit coding method, apparatus and system |
| WO2020215314A1 (en) * | 2019-04-26 | 2020-10-29 | 云图有限公司 | Boolean circuit encoding method, apparatus, and system |
| CN110120819B (en) * | 2019-04-26 | 2023-07-21 | 矩阵元技术(深圳)有限公司 | Boolean circuit coding method, device and system |
| CN110535846A (en) * | 2019-08-22 | 2019-12-03 | 中国电力科学研究院有限公司 | A data frame compression method and system based on DL/T698.45 protocol |
| CN110535846B (en) * | 2019-08-22 | 2022-03-04 | 中国电力科学研究院有限公司 | A data frame compression method and system based on DL/T698.45 protocol |
| CN110320394A (en) * | 2019-08-30 | 2019-10-11 | 深圳市鼎阳科技有限公司 | Decoding processing method and decoding processing device, the digital oscilloscope of Wave data |
| CN113037292A (en) * | 2021-02-02 | 2021-06-25 | 深圳市汉云科技有限公司 | Data compression method and data decompression method of structured database and communication equipment |
| CN113064555A (en) * | 2021-04-21 | 2021-07-02 | 山东英信计算机技术有限公司 | BIOS data storage method, device, equipment and storage medium |
| CN113220651A (en) * | 2021-04-25 | 2021-08-06 | 暨南大学 | Operation data compression method and device, terminal equipment and storage medium |
| CN113220651B (en) * | 2021-04-25 | 2024-02-09 | 暨南大学 | Method, device, terminal equipment and storage medium for compressing operation data |
| CN113868206A (en) * | 2021-10-08 | 2021-12-31 | 八十一赞科技发展(重庆)有限公司 | A data compression method, decompression method, device and storage medium |
| WO2023071893A1 (en) * | 2021-10-29 | 2023-05-04 | 深圳智慧林网络科技有限公司 | Three mode-based data compression method and data decompression method |
| CN114095034A (en) * | 2021-11-26 | 2022-02-25 | 欧姆龙(上海)有限公司 | Oversampling method and apparatus |
| CN114978421A (en) * | 2022-05-05 | 2022-08-30 | 煤炭科学研究总院有限公司 | Dynamic coding method and device and electronic equipment |
| CN114978421B (en) * | 2022-05-05 | 2023-10-27 | 煤炭科学研究总院有限公司 | Dynamic coding method and device and electronic equipment |
| CN116208168A (en) * | 2022-12-07 | 2023-06-02 | 阿里云计算有限公司 | A data compression method, data decompression method and device |
| CN116303297A (en) * | 2023-05-25 | 2023-06-23 | 深圳市东信时代信息技术有限公司 | File compression processing method, device, equipment and medium |
| CN116303297B (en) * | 2023-05-25 | 2023-09-29 | 深圳市东信时代信息技术有限公司 | File compression processing method, device, equipment and medium |
Also Published As
| Publication number | Publication date |
|---|---|
| CN104868922B (en) | 2018-05-29 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN104868922B (en) | Data compression method and apparatus | |
| US7071853B2 (en) | Method of compressing data packets | |
| US7064489B2 (en) | Huffman data compression method | |
| WO2009009574A2 (en) | Blocking for combinatorial coding/decoding for electrical computers and digital data processing systems | |
| CN112165331A (en) | Data compression method and device, data decompression method and device, storage medium and electronic equipment | |
| CN112260699B (en) | Attribute communication coding and decoding method, coding device, decoding device and system | |
| CN101599890A (en) | Data transmission system and method in a communication system | |
| CN104636377B (en) | Data compression method and device | |
| CN108880559A (en) | Data compression method, uncompressing data, compression device and decompression apparatus | |
| CN114979094B (en) | RTP-based data transmission method, device, equipment and medium | |
| CN103210590B (en) | Compression method and equipment | |
| CN104767710B (en) | The transmission payload extracting method of HTTP block transmissions coding based on DFA | |
| CN115189696A (en) | Hardware compression and decompression method based on Huffman decoding table | |
| CN102571540A (en) | Decompression method and device | |
| CN106790550A (en) | A kind of system suitable for the compression of power distribution network Monitoring Data | |
| CN114979093B (en) | RTP-based data transmission method, device, equipment and medium | |
| JP4064782B2 (en) | Data compression method and decompression method | |
| CN115865274A (en) | Data compression method, data decompression method and data compression device | |
| CN102891730A (en) | Method and device for encoding satellite short message based on binary coded decimal (BCD) code | |
| CN102572425A (en) | Huffman decoding method | |
| CN116170115B (en) | Digital fountain coding and decoding method, device and system based on codebook | |
| CN106533450A (en) | PMS coding compression method and device | |
| CN106533628B (en) | A Huffman parallel decoding method and device thereof | |
| CN118945237B (en) | Method and apparatus for data compression and decompression | |
| CN106850507B (en) | Harmful code detection method and device based on HTTP compressed data stream |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| EXSB | Decision made by sipo to initiate substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |