CN106445412A - Data volume compression ratio assessment method and system - Google Patents
Data volume compression ratio assessment method and system Download PDFInfo
- Publication number
- CN106445412A CN106445412A CN201610824302.XA CN201610824302A CN106445412A CN 106445412 A CN106445412 A CN 106445412A CN 201610824302 A CN201610824302 A CN 201610824302A CN 106445412 A CN106445412 A CN 106445412A
- Authority
- CN
- China
- Prior art keywords
- data
- sampling
- target volume
- compression
- data block
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000007906 compression Methods 0.000 title claims abstract description 118
- 230000006835 compression Effects 0.000 title claims abstract description 117
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000005070 sampling Methods 0.000 claims abstract description 150
- 238000011156 evaluation Methods 0.000 claims abstract description 45
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 27
- 238000004364 calculation method Methods 0.000 claims abstract description 27
- 230000009471 action Effects 0.000 claims description 7
- 238000011835 investigation Methods 0.000 abstract description 6
- 238000007619 statistical method Methods 0.000 abstract description 4
- 238000004458 analytical method Methods 0.000 abstract description 2
- 238000013144 data compression Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 7
- 238000012067 mathematical method Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/0608—Saving storage space on storage systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/064—Management of blocks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Complex Calculations (AREA)
Abstract
本发明公开了一种数据卷压缩率的评估方法和系统,该方法包括:预先设置n个数据抽样线程,n为不小于2的整数;将目标卷划分为n个不同的目标卷区域,各数据抽样线程和各目标卷区域一一对应;通过各数据抽样线程对对应的目标卷区域进行随机抽样;通过预设压缩算法对各抽样的数据块进行压缩计算;对各抽样的数据块相应的压缩计算结果进行汇总,计算目标卷的压缩率评估结果。在对数据卷进行压缩之前,采用统计学方法对目标卷的压缩率进行随机抽样分析和评估,调查简单节约成本,保证了评估的准确可靠性,减少了对系统资源的占用,无需漫长的数据压缩过程即可对目标卷压缩后的空间节约进行预估,为用户是否进行压缩的决策提供了数据支持。
The invention discloses a method and system for evaluating the compression rate of a data volume. The method includes: setting n data sampling threads in advance, where n is an integer not less than 2; dividing the target volume into n different target volume areas, each The data sampling thread is in one-to-one correspondence with each target volume area; the corresponding target volume area is randomly sampled through each data sampling thread; each sampled data block is compressed and calculated by the preset compression algorithm; each sampled data block is correspondingly The compression calculation results are summarized to calculate the compression ratio evaluation result of the target volume. Before compressing the data volume, statistical methods are used to conduct random sampling analysis and evaluation of the compression rate of the target volume. The investigation is simple and cost-effective, ensuring the accuracy and reliability of the evaluation, reducing the occupation of system resources, and eliminating the need for long data During the compression process, the space savings of the target volume after compression can be estimated, providing data support for the user's decision whether to perform compression.
Description
技术领域technical field
本发明涉及存储系统技术领域,特别是涉及一种数据卷压缩率的评估方法和系统。The invention relates to the technical field of storage systems, in particular to a method and system for evaluating the compression ratio of data volumes.
背景技术Background technique
随着科学技术的发展,人们已经进入了信息时代。在信息时代,最为关键的技术即为数据的处理。With the development of science and technology, people have entered the information age. In the information age, the most critical technology is data processing.
在对数据处理的过程中往往要采用存储设备对数据进行存储。但是随着大数据时代的到来,数据量越来越大,这就要求提高存储系统的使用效率,目前,常见的存储系统均提供有数据压缩的功能,以在存储时将占用存储空间较大的数据块压缩为占用存储空间相对较小的数据块。但是,数据压缩往往耗时较长,且占用CPU资源较高,对于不同类型的数据来说,压缩效率的差异也是很大的。若对于某些压缩效率较低的数据块进行压缩,其节省的存储空间较小,但是却耗费了大量的时间和CPU资源,往往得不偿失。In the process of data processing, storage devices are often used to store data. However, with the advent of the era of big data, the amount of data is increasing, which requires improving the efficiency of the storage system. At present, common storage systems provide data compression functions, so as to occupy a large storage space during storage. The data blocks are compressed into data blocks that occupy relatively small storage space. However, data compression often takes a long time and takes up high CPU resources. For different types of data, the compression efficiency varies greatly. If some data blocks with low compression efficiency are compressed, the storage space saved is small, but it consumes a lot of time and CPU resources, which is often not worth the candle.
因而,如何获知数据块压缩后所能够带来的节省的空间成本,以判断该数据块是否值得进行数据压缩,是本领域技术人员目前需要解决的技术问题。Therefore, how to know the space cost that can be saved after the data block is compressed, so as to judge whether the data block is worthy of data compression, is a technical problem that those skilled in the art need to solve at present.
发明内容Contents of the invention
本发明的目的是提供一种数据卷压缩率的评估方法和系统,可以在对数据卷进行压缩前,获知数据块压缩后所能够带来的节省的空间成本,以判断该数据块是否值得进行数据压缩,为用户提供决策依据。The purpose of the present invention is to provide a method and system for evaluating the compression rate of a data volume, which can be used to know the space cost that can be saved after the data block is compressed before the data volume is compressed, so as to judge whether the data block is worth performing. Data compression provides users with decision-making basis.
为解决上述技术问题,本发明提供了如下技术方案:In order to solve the problems of the technologies described above, the present invention provides the following technical solutions:
一种数据卷压缩率的评估方法,包括:A method for evaluating the compression ratio of a data volume, comprising:
预先设置n个数据抽样线程,n为不小于2的整数;Pre-set n data sampling threads, n is an integer not less than 2;
将目标卷划分为n个不同的目标卷区域,各所述数据抽样线程和各所述目标卷区域一一对应;Dividing the target volume into n different target volume areas, each of the data sampling threads is in one-to-one correspondence with each of the target volume areas;
通过各所述数据抽样线程对对应的目标卷区域进行随机抽样;performing random sampling on the corresponding target volume area through each of the data sampling threads;
通过预设压缩算法对各抽样的数据块进行压缩计算;Perform compression calculation on each sampled data block through a preset compression algorithm;
对各抽样的数据块相应的压缩计算结果进行汇总,计算所述目标卷的压缩率评估结果。Summarize the corresponding compression calculation results of each sampled data block, and calculate the compression ratio evaluation result of the target volume.
优选地,所述通过各所述数据抽样线程对对应的目标卷区域进行随机抽样,包括:Preferably, the random sampling of the corresponding target volume area by each of the data sampling threads includes:
预先确定每个样本的数据块的大小信息;Predetermine the size information of the data block of each sample;
通过各所述抽样线程对对应的目标卷区域按照所确定的每个数据块的大小信息循环读入数据块。The data blocks are cyclically read in by each sampling thread to the corresponding target volume area according to the determined size information of each data block.
优选地,所述通过预设压缩算法对各抽样的数据块进行压缩计算,为:Preferably, performing compression calculation on each sampled data block through a preset compression algorithm is:
通过VIFO压缩算法对各抽样的数据块进行压缩计算。Each sampled data block is compressed and calculated through the VIFO compression algorithm.
优选地,所述对各抽样的数据块相应的压缩计算结果进行汇总,包括:Preferably, summarizing the corresponding compression calculation results of each sampled data block includes:
判断抽样总量是否达到预设的抽样总量下限阈值;Judging whether the total sampling amount reaches the preset lower limit threshold of the total sampling amount;
若否,则保存每次抽样时所述VIFO压缩算法的输入数据大小、输出数据大小和压缩率;If not, then save the input data size, output data size and compression ratio of the VIFO compression algorithm when each sampling;
若是,则将所有保存的数据进行汇总。If so, all saved data are summarized.
优选地,还包括:Preferably, it also includes:
判断各所述数据抽样线程对对应目标卷区域的IO读取失败次数是否超过预设读取失败次数阈值;Judging whether the number of IO read failures of each of the data sampling threads to the corresponding target volume area exceeds a preset threshold of read failure times;
若是,则停止各所述数据抽样线程对对应的目标卷区域进行抽样动作。If yes, stop each data sampling thread from sampling the corresponding target volume area.
优选地,还包括:Preferably, it also includes:
判断各所述数据抽样线程对所述目标卷进行随机抽样时的样本的空数据块数目是否达到预设空数据块数目上限阈值;Judging whether the number of empty data blocks in the sample when each of the data sampling threads randomly samples the target volume reaches a preset upper limit threshold of the number of empty data blocks;
若是,则停止各所述数据抽样线程对对应的目标卷区域进行抽样动作。If yes, stop each data sampling thread from sampling the corresponding target volume area.
一种数据卷压缩率的评估系统,包括:数据抽样线程模块、区域划分模块和压缩率评估模块;A system for evaluating the compression ratio of data volumes, comprising: a data sampling thread module, an area division module, and a compression ratio evaluation module;
其中,所述数据抽样线程模块包括n个数据抽样线程单元,n为不小于2的整数,所述区域划分模块用于根据所述数据抽样线程单元的数目将目标卷划分为n个不同的目标卷区域,且各所述数据抽样线程单元和各所述目标卷区域一一对应;Wherein, the data sampling thread module includes n data sampling thread units, n is an integer not less than 2, and the area division module is used to divide the target volume into n different target volumes according to the number of the data sampling thread units a volume area, and each of the data sampling thread units is in one-to-one correspondence with each of the target volume areas;
所述数据抽样线程单元用于对对应的目标卷区域进行随机抽样,并根据预设压缩算法对各抽样的数据块进行压缩计算;The data sampling thread unit is used to randomly sample the corresponding target volume area, and perform compression calculation on each sampled data block according to a preset compression algorithm;
所述压缩率评估模块用于汇总各抽样的数据块相应的压缩计算结果,并计算所述目标卷的压缩率评估结果。The compression ratio evaluation module is used to summarize the corresponding compression calculation results of each sampled data block, and calculate the compression ratio evaluation result of the target volume.
优选地,所述数据抽样线程单元包括:Preferably, the data sampling thread unit includes:
样本数据块大小设定子单元,用于确定每个样本的数据块的大小信息;The sample data block size setting subunit is used to determine the size information of the data block of each sample;
抽样功能子单元,用于根据每个数据块的大小信息循环读入数据块,进行随机抽样,并根据预设压缩算法对各抽样的数据块进行压缩计算。The sampling function sub-unit is used to cyclically read in data blocks according to the size information of each data block, perform random sampling, and perform compression calculation on each sampled data block according to a preset compression algorithm.
优选地,还包括:Preferably, it also includes:
第一判断模块,用于判断抽样总量是否达到预设的抽样总量下限阈值,并在判定抽样总量达到预设的抽样总量下限阈值时,控制所述压缩率评估模块执行汇总功能动作。The first judging module is used to judge whether the total sampling amount reaches the preset lower limit threshold of the total sampling amount, and when it is determined that the total sampling amount reaches the preset lower limit threshold of the total sampling amount, control the compression rate evaluation module to perform a summary function action .
优选地,还包括:Preferably, it also includes:
第二判断模块,用于判断所述数据抽样线程模块对所述目标卷的IO读取失败次数是否超过预设读取失败次数阈值,并在判断结果为是时,控制所述数据抽样线程模块停止抽样动作;The second judging module is used to judge whether the IO reading failure times of the target volume by the data sampling thread module exceeds the preset reading failure times threshold, and when the judging result is yes, control the data sampling thread module stop sampling;
第三判断模块,用于判断所述数据抽样线程模块对所述目标卷进行随机抽样时的样本的空数据块数目是否达到预设空数据块数目上限阈值,并在判定结果为是时,控制所述数据抽样线程模块停止抽样动作。The third judging module is used to judge whether the number of empty data blocks in the sample when the data sampling thread module randomly samples the target volume reaches the preset upper limit threshold of the number of empty data blocks, and when the judgment result is yes, control The data sampling thread module stops the sampling action.
与现有技术相比,上述技术方案具有以下优点:Compared with the prior art, the above-mentioned technical solution has the following advantages:
本发明实施例所提供的数据卷压缩率的评估方法,包括:预先设置n个数据抽样线程,n为不小于2的整数;将目标卷划分为n个不同的目标卷区域,各数据抽样线程和各目标卷区域一一对应;通过各数据抽样线程对对应的目标卷区域进行随机抽样;通过预设压缩算法对各抽样的数据块进行压缩计算;对各抽样的数据块相应的压缩计算结果进行汇总,计算目标卷的压缩率评估结果。在对数据卷进行压缩之前,通过上述的技术方案对数据卷的压缩率进行评估,采用数学方法和统计学方法对目标卷的压缩率进行随机抽样分析,调查简单节约成本,保证了评估的准确可靠性,减少了对系统资源的占用,无需漫长的的数据压缩过程即可对目标卷压缩后的空间节约进行预估,为用户是否进行压缩的决策提供了数据支持。The method for evaluating the compression rate of a data volume provided by the embodiment of the present invention includes: setting n data sampling threads in advance, where n is an integer not less than 2; dividing the target volume into n different target volume areas, and each data sampling thread One-to-one correspondence with each target volume area; random sampling of the corresponding target volume area through each data sampling thread; compression calculation of each sampled data block through a preset compression algorithm; corresponding compression calculation results of each sampled data block Summarize and calculate the compression ratio evaluation results for the target volume. Before compressing the data volume, the compression rate of the data volume is evaluated through the above-mentioned technical scheme, and the compression rate of the target volume is randomly sampled and analyzed using mathematical and statistical methods. The investigation is simple and cost-effective, ensuring the accuracy of the evaluation Reliability, reducing the occupation of system resources, and can estimate the space savings of the target volume after compression without a lengthy data compression process, providing data support for users to decide whether to compress.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are For some embodiments of the present invention, those skilled in the art can also obtain other drawings based on these drawings without creative work.
图1为本发明一种具体实施方式所提供的数据卷压缩率的评估方法流程图;Fig. 1 is a flow chart of a method for evaluating the compression ratio of a data volume provided by a specific embodiment of the present invention;
图2为本发明一种具体实施方式所提供的数据卷压缩率的评估系统结构示意图。FIG. 2 is a schematic structural diagram of an evaluation system for data volume compression ratio provided by a specific embodiment of the present invention.
具体实施方式detailed description
本发明的核心是提供一种数据卷压缩率的评估方法和系统,可以在对数据卷进行压缩前,获知数据块压缩后所能够带来的节省的空间成本,以判断该数据块是否值得进行数据压缩,为用户提供决策依据。The core of the present invention is to provide a method and system for evaluating the compression rate of a data volume, which can be used to know the space cost that can be saved after the data block is compressed before the data volume is compressed, so as to judge whether the data block is worth performing. Data compression provides users with decision-making basis.
为了使本发明的上述目的、特征和优点能够更为明显易懂,下面结合附图对本发明的具体实施方式做详细的说明。In order to make the above objects, features and advantages of the present invention more comprehensible, the specific implementation manners of the present invention will be described in detail below in conjunction with the accompanying drawings.
在以下描述中阐述了具体细节以便于充分理解本发明。但是本发明能够以多种不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本发明内涵的情况下做类似推广。因此本发明不受下面公开的具体实施方式的限制。In the following description, specific details are set forth in order to provide a thorough understanding of the present invention. However, the present invention can be implemented in many other ways than those described here, and those skilled in the art can make similar extensions without departing from the connotation of the present invention. Accordingly, the present invention is not limited to the specific embodiments disclosed below.
请参考图1,图1为本发明一种具体实施方式所提供的数据卷压缩率的评估方法流程图。Please refer to FIG. 1 , which is a flowchart of a method for evaluating the compression ratio of a data volume provided by a specific embodiment of the present invention.
本发明的一种具体实施方式提供了一种数据卷压缩率的评估方法,包括:A specific embodiment of the present invention provides a method for evaluating the compression rate of a data volume, including:
S11:预先设置n个数据抽样线程,n为不小于2的整数;S11: Preset n data sampling threads, where n is an integer not less than 2;
S12:将目标卷划分为n个不同的目标卷区域,各数据抽样线程和各目标卷区域一一对应;S12: Divide the target volume into n different target volume areas, each data sampling thread corresponds to each target volume area;
S13:通过各数据抽样线程对对应的目标卷区域进行随机抽样;S13: Randomly sample the corresponding target volume area through each data sampling thread;
S14:通过预设压缩算法对各抽样的数据块进行压缩计算;S14: Perform compression calculation on each sampled data block through a preset compression algorithm;
S15:对各抽样的数据块相应的压缩计算结果进行汇总,计算目标卷的压缩率评估结果。S15: Summarize the corresponding compression calculation results of each sampled data block, and calculate the compression ratio evaluation result of the target volume.
在本实施方式中,为了提高评估方法的评估算法效率和随机抽样的准确性,采用了多线程架构,即预设了n个数据抽样线程,对目标卷按线程数划分为不同的区域,每个线程独立负责一个区域进行随机抽样。其中,每个目标卷区域相当于统计学整群抽样中的一个群,因此,本实施方式中的技术方案就具有调查方便、节约成本的优点。通过预设的压缩算法对抽样的数据块进行压缩计算,从而可以计算出样本的压缩率。这样对多个抽样计算结果进行汇总即可得出目标卷压缩率的评估结果。In this embodiment, in order to improve the evaluation algorithm efficiency of the evaluation method and the accuracy of random sampling, a multi-threaded architecture is adopted, that is, n data sampling threads are preset, and the target volume is divided into different areas according to the number of threads. Each thread is independently responsible for a region for random sampling. Wherein, each target volume area is equivalent to a cluster in statistical cluster sampling, therefore, the technical solution in this embodiment has the advantages of convenient investigation and cost saving. The sampled data block is compressed and calculated by a preset compression algorithm, so that the compression rate of the sample can be calculated. In this way, the evaluation result of the compression ratio of the target volume can be obtained by summarizing the calculation results of multiple samples.
在对数据卷进行压缩之前,通过上述的技术方案对数据卷的压缩率进行评估,采用数学方法和统计学方法对目标卷的压缩率进行随机抽样分析,调查简单节约成本,保证了评估的准确可靠性,减少了对系统资源的占用,无需漫长的的数据压缩过程即可对目标卷压缩后的空间节约进行预估,为用户是否进行压缩的决策提供了数据支持,若评估出当前目标卷压缩后能够节省较大的空间成本,则用户即可对该目标卷进行压缩,否则,不必对该目标卷进行压缩,可以避免对CPU资源的占用,节省时间。Before compressing the data volume, the compression rate of the data volume is evaluated through the above-mentioned technical scheme, and the compression rate of the target volume is randomly sampled and analyzed using mathematical and statistical methods. The investigation is simple and cost-effective, ensuring the accuracy of the evaluation Reliability, reducing the occupancy of system resources, and can estimate the space savings of the target volume after compression without a long data compression process, providing data support for the user's decision whether to compress, if the current target volume is evaluated After the compression can save a large space cost, the user can compress the target volume, otherwise, the target volume does not need to be compressed, which can avoid the occupation of CPU resources and save time.
在上述实施方式的基础上,本发明一种实施方式中,通过各数据抽样线程对对应的目标卷区域进行随机抽样,包括:On the basis of the above embodiments, in one embodiment of the present invention, each data sampling thread randomly samples the corresponding target volume area, including:
预先确定每个样本的数据块的大小信息;Predetermine the size information of the data block of each sample;
通过各抽样线程对对应的目标卷区域按照所确定的每个数据块的大小信息循环读入数据块。The data blocks are cyclically read in by each sampling thread to the corresponding target volume area according to the determined size information of each data block.
在本实施方式中,优选设定每64KB的数据块作为一个样本,这样既保证数据块不会太大或者过小,以提高抽样的效率。In this embodiment, it is preferable to set every 64KB data block as a sample, so as to ensure that the data block will not be too large or too small, so as to improve sampling efficiency.
需要说明的是,也可以采用其他大小的数据块作为一个样本,本实施方式对此并不做限定,具体视情况而定。It should be noted that a data block of other sizes may also be used as a sample, which is not limited in this embodiment, and depends on specific circumstances.
进一步地,通过预设压缩算法对各抽样的数据块进行压缩计算,为:通过VIFO压缩算法对各抽样的数据块进行压缩计算。Further, performing compression calculation on each sampled data block by using a preset compression algorithm is: performing compression calculation on each sampled data block by using a VIFO compression algorithm.
在本实施方式中,采用VIFO压缩算法,即Variable Input Fixed Output压缩算法,该算法输入缓冲长度不定而输出缓冲长度固定,可以尽可能地压缩更多的数据,直到输入数据缓冲区为空或者输出数据缓冲区满。In this embodiment, the VIFO compression algorithm is adopted, that is, the Variable Input Fixed Output compression algorithm. The input buffer length of this algorithm is indeterminate and the output buffer length is fixed, and more data can be compressed as much as possible until the input data buffer is empty or the output buffer is empty. The data buffer is full.
更进一步地,对各抽样的数据块相应的压缩计算结果进行汇总,包括:Furthermore, the corresponding compression calculation results of each sampled data block are summarized, including:
判断抽样总量是否达到预设的抽样总量下限阈值;Judging whether the total sampling amount reaches the preset lower limit threshold of the total sampling amount;
若否,则保存每次抽样时VIFO压缩算法的输入数据大小、输出数据大小和压缩率;If not, save the input data size, output data size and compression ratio of the VIFO compression algorithm when sampling each time;
若是,则将所有保存的数据进行汇总。If so, all saved data are summarized.
在本实施方式中,设置了抽样总量下限阈值,优选地该抽样总量下限阈值为3400个。抽样总量下限阈值的设定,需要既能满足抽样统计样本量的需求,又能兼顾算法高效快速的基本要求,这是由于采样的数量越多则评估的结果越精确,而为了保证评估的误差率较小,甚至该误差率在5%以内,则最低的采样总数为3400,当然根据实际需求,若需要误差率更小,则设定相应的抽样总量下限阈值即可。在抽样总量没有达到该抽样总量下限阈值时,不进行汇总计算,以保证压缩率评估结果的精确性。In this embodiment, a lower limit threshold of the total sampling amount is set, and preferably, the lower limit threshold of the total sampling amount is 3400. The setting of the lower limit threshold of the total sampling amount should not only meet the requirements of sampling statistical sample size, but also take into account the basic requirements of efficient and fast algorithms. This is because the larger the number of samples, the more accurate the evaluation results. If the error rate is small, even if the error rate is within 5%, the minimum total number of samples is 3400. Of course, according to actual needs, if the error rate is required to be smaller, then the corresponding lower limit threshold of the total number of samples can be set. When the total sampling amount does not reach the lower limit threshold of the total sampling amount, summary calculation is not performed to ensure the accuracy of the evaluation result of the compression rate.
在本发明的一种实施方式中,该方法还包括:In one embodiment of the invention, the method also includes:
判断各数据抽样线程对对应目标卷区域的IO读取失败次数是否超过预设读取失败次数阈值;Determine whether the number of IO read failures of each data sampling thread to the corresponding target volume area exceeds the preset threshold of read failure times;
若是,则停止各数据抽样线程对对应的目标卷区域进行抽样动作。If yes, stop each data sampling thread from sampling the corresponding target volume area.
当目标卷正被当前业务应用时,往往会引起数据抽样线程对目标卷的IO读取失败,而此时若强行继续进行抽样评估,会影响当前业务的进行,因此,当各数据抽样线程对对应目标卷区域的IO读取失败次数超过预设读取失败次数阈值时,自动终止抽样操作。优选地,当读取IO失败次数比例达到1%(默认1%,最大为5%)时,自动终止操作,该比例指的是失败次数与抽样总量下限阈值对应的读取IO次数的比。When the target volume is being used by the current business, it will often cause the data sampling thread to fail to read the IO of the target volume. At this time, if the sampling evaluation is continued forcibly, it will affect the progress of the current business. Therefore, when each data sampling thread When the number of IO read failures corresponding to the target volume area exceeds the preset threshold of read failure times, the sampling operation is automatically terminated. Preferably, when the proportion of read IO failures reaches 1% (1% by default, maximum 5%), the operation is automatically terminated. This proportion refers to the ratio of the number of failures to the number of read IOs corresponding to the lower limit threshold of the total sampling amount .
进一步地,该方法还包括:Further, the method also includes:
判断各数据抽样线程对目标卷进行随机抽样时的样本的空数据块数目是否达到预设空数据块数目上限阈值;Judging whether the number of empty data blocks in the sample when each data sampling thread randomly samples the target volume reaches the preset upper limit threshold of the number of empty data blocks;
若是,则停止各数据抽样线程对对应的目标卷区域进行抽样动作。If yes, stop each data sampling thread from sampling the corresponding target volume area.
在抽样过程中,若作为样本的数据块大部分为空数据块,如空数据块的抽样比例超过95%时,自动终止操作。这是由于在对目标卷进行抽样计算评估结果时,需要保证评估结果的精确度范围,即保证抽样误差在一定范围内,若空数据块过多会影响抽样结果的准确性,此外,若空数据块过多说明磁盘比较空,可用空间足够,甚至无需进行压缩以及压缩评估。因此,在本实施方式中,根据评估结果结合该结果的精度范围两项指标,能够更加高效精确地评估出目标卷的压缩率。During the sampling process, if most of the data blocks used as samples are empty data blocks, such as when the sampling ratio of empty data blocks exceeds 95%, the operation is automatically terminated. This is because when sampling the target volume to calculate the evaluation results, it is necessary to ensure the accuracy range of the evaluation results, that is, to ensure that the sampling error is within a certain range. If there are too many empty data blocks, the accuracy of the sampling results will be affected. In addition, if the empty data blocks Too many data blocks indicate that the disk is relatively empty and the available space is sufficient, even without compression and compression evaluation. Therefore, in this embodiment, the compression ratio of the target volume can be evaluated more efficiently and accurately according to the evaluation result combined with the two indicators of the accuracy range of the result.
请参考图2,图2为本发明一种具体实施方式所提供的数据卷压缩率的评估系统结构示意图。Please refer to FIG. 2 , which is a schematic structural diagram of a system for evaluating the compression ratio of data volumes provided by a specific embodiment of the present invention.
相应地,本发明一种实施方式还提供了一种数据卷压缩率的评估系统,包括:数据抽样线程模块1、区域划分模块2和压缩率评估模块3;其中,数据抽样线程模块1包括n个数据抽样线程单元,n为不小于2的整数,区域划分模块2用于根据数据抽样线程单元的数目将目标卷划分为n个不同的目标卷区域,且各数据抽样线程单元和各目标卷区域一一对应;数据抽样线程单元用于对对应的目标卷区域进行随机抽样,并根据预设压缩算法对各抽样的数据块进行压缩计算;压缩率评估模块3用于汇总各抽样的数据块相应的压缩计算结果,并计算目标卷的压缩率评估结果。Correspondingly, an embodiment of the present invention also provides a system for evaluating the compression ratio of data volumes, including: a data sampling thread module 1, an area division module 2, and a compression ratio evaluation module 3; wherein, the data sampling thread module 1 includes n data sampling thread units, n is an integer not less than 2, and the area division module 2 is used to divide the target volume into n different target volume areas according to the number of data sampling thread units, and each data sampling thread unit and each target volume One-to-one correspondence between areas; the data sampling thread unit is used to randomly sample the corresponding target volume area, and perform compression calculation on each sampled data block according to the preset compression algorithm; the compression rate evaluation module 3 is used to summarize each sampled data block The corresponding compression calculation results, and calculate the compression ratio evaluation results of the target volume.
在本实施方式中,压缩率评估模块和区域划分模块可以集成在数据抽样线程模块中。在对数据卷进行压缩之前,通过上述的技术方案对数据卷的压缩率进行评估,采用数学方法和统计学方法对目标卷的压缩率进行随机抽样分析,调查简单节约成本,保证了评估的准确可靠性,减少了对系统资源的占用,无需漫长的的数据压缩过程即可对目标卷压缩后的空间节约进行预估,为用户是否进行压缩的决策提供了数据支持,若评估出当前目标卷压缩后能够节省较大的空间成本,则用户即可对该目标卷进行压缩,否则,不必对该目标卷进行压缩,可以避免对CPU资源的占用,节省时间。In this implementation manner, the compression ratio evaluation module and the region division module can be integrated into the data sampling thread module. Before compressing the data volume, the compression rate of the data volume is evaluated through the above-mentioned technical scheme, and the compression rate of the target volume is randomly sampled and analyzed using mathematical and statistical methods. The investigation is simple and cost-effective, ensuring the accuracy of the evaluation Reliability, reducing the occupancy of system resources, and can estimate the space savings of the target volume after compression without a long data compression process, providing data support for the user's decision whether to compress, if the current target volume is evaluated After the compression can save a large space cost, the user can compress the target volume, otherwise, the target volume does not need to be compressed, which can avoid the occupation of CPU resources and save time.
在本发明的一种实施方式中,数据抽样线程单元包括:样本数据块大小设定子单元,用于确定每个样本的数据块的大小信息;抽样功能子单元,用于根据每个数据块的大小信息循环读入数据块,进行随机抽样,并根据预设压缩算法对各抽样的数据块进行压缩计算。In one embodiment of the present invention, the data sampling thread unit includes: a sample data block size setting subunit, used to determine the size information of the data block of each sample; a sampling function subunit, used to The size information of the data is cyclically read into data blocks, random sampling is performed, and each sampled data block is compressed and calculated according to the preset compression algorithm.
进一步地,该系统还包括:第一判断模块,用于判断抽样总量是否达到预设的抽样总量下限阈值,并在判定抽样总量达到预设的抽样总量下限阈值时,控制压缩率评估模块执行汇总功能动作。Further, the system also includes: a first judging module, configured to judge whether the total sampling amount has reached the preset lower limit threshold of the total sampling amount, and control the compression rate when it is determined that the total sampling amount reaches the preset lower limit threshold of the total sampling amount The evaluation module performs summary functional actions.
在本实施方式中,设置了每个样本的数据块的大小和抽样总量下限阈值,优选设定每64KB的数据块作为一个样本,这样既保证数据块不会太大或者过小,以提高抽样的效率。优选地该抽样总量下限阈值为3400个。抽样总量下限阈值的设定,需要既能满足抽样统计样本量的需求,又能兼顾算法高效快速的基本要求,这是由于采样的数量越多则评估的结果越精确,而为了保证评估的误差率较小,甚至该误差率在5%以内,则最低的采样总数为3400,当然根据实际需求,若需要误差率更小,则设定相应的抽样总量下限阈值即可。在抽样总量没有达到该抽样总量下限阈值时,不进行汇总计算,以保证压缩率评估结果的精确性。In this embodiment, the size of the data block of each sample and the lower limit threshold of the total sampling amount are set, preferably every 64KB data block is set as a sample, so as to ensure that the data block will not be too large or too small to improve Sampling efficiency. Preferably, the lower limit threshold of the total number of samples is 3400. The setting of the lower limit threshold of the total sampling amount should not only meet the requirements of sampling statistical sample size, but also take into account the basic requirements of efficient and fast algorithms. This is because the larger the number of samples, the more accurate the evaluation results. If the error rate is small, even if the error rate is within 5%, the minimum total number of samples is 3400. Of course, according to actual needs, if the error rate is required to be smaller, then the corresponding lower limit threshold of the total number of samples can be set. When the total sampling amount does not reach the lower limit threshold of the total sampling amount, summary calculation is not performed to ensure the accuracy of the evaluation result of the compression rate.
在本发明的一种实施方式中,该系统还包括:In one embodiment of the invention, the system also includes:
第二判断模块,用于判断数据抽样线程模块对目标卷的IO读取失败次数是否超过预设读取失败次数阈值,并在判断结果为是时,控制数据抽样线程模块停止抽样动作;The second judging module is used to judge whether the IO reading failure times of the target volume by the data sampling thread module exceeds the preset reading failure times threshold, and when the judgment result is yes, the data sampling thread module is controlled to stop the sampling action;
第三判断模块,用于判断数据抽样线程模块对目标卷进行随机抽样时的样本的空数据块数目是否达到预设空数据块数目上限阈值,并在判定结果为是时,控制数据抽样线程模块停止抽样动作。The third judging module is used to judge whether the number of empty data blocks of the sample when the data sampling thread module randomly samples the target volume reaches the preset upper limit threshold of the number of empty data blocks, and when the judgment result is yes, control the data sampling thread module Stop the sampling action.
当目标卷正被当前业务应用时,往往会引起数据抽样线程对目标卷的IO读取失败,而此时若强行继续进行抽样评估,会影响当前业务的进行,因此,采用了第二判断模块,当各数据抽样线程对对应目标卷区域的IO读取失败次数超过预设读取失败次数阈值时,自动终止抽样操作。优选地,当读取IO失败次数比例达到1%(默认1%,最大为5%)时,自动终止操作,该比例指的是失败次数与抽样总量下限阈值对应的读取IO次数的比。在抽样过程中,若作为样本的数据块大部分为空数据块,如空数据块的抽样比例超过95%时,自动终止操作。这是由于在对目标卷进行抽样计算评估结果时,需要保证评估结果的精确度范围,即保证抽样误差在一定范围内,若空数据块过会影响抽样结果的准确性,此外,若空数据块过多说明磁盘比较空,可用空间足够,甚至无需进行压缩以及压缩评估。因此,在本实施方式中,根据评估结果结合该结果的精度范围两项指标,能够更加高效精确地评估出目标卷的压缩率。When the target volume is being used by the current business, it will often cause the data sampling thread to fail to read the IO of the target volume. At this time, if the sampling evaluation is continued forcibly, the current business will be affected. Therefore, the second judgment module is adopted. , when the number of IO read failures of each data sampling thread on the corresponding target volume area exceeds the preset read failure number threshold, the sampling operation is automatically terminated. Preferably, when the proportion of read IO failures reaches 1% (1% by default, maximum 5%), the operation is automatically terminated. This proportion refers to the ratio of the number of failures to the number of read IOs corresponding to the lower limit threshold of the total sampling amount . During the sampling process, if most of the data blocks used as samples are empty data blocks, such as when the sampling ratio of empty data blocks exceeds 95%, the operation is automatically terminated. This is because when sampling the target volume to calculate the evaluation results, it is necessary to ensure the accuracy range of the evaluation results, that is, to ensure that the sampling error is within a certain range. If there are too many empty data blocks, the accuracy of the sampling results will be affected. In addition, if the empty data Too many blocks means that the disk is relatively empty, there is enough free space, and compression and compression evaluation are not even necessary. Therefore, in this embodiment, the compression ratio of the target volume can be evaluated more efficiently and accurately according to the evaluation result combined with the two indicators of the accuracy range of the result.
综上所述,本发明所提供的数据卷压缩率的评估方法和系统,可以在对数据卷进行压缩前,获知数据块压缩后所能够带来的节省的空间成本,采用数学方法和统计学方法对目标卷的压缩率进行随机抽样分析,调查简单节约成本,保证了评估的准确可靠性,减少了对系统资源的占用,无需漫长的的数据压缩过程即可对目标卷压缩后的空间节约进行预估,为用户是否进行压缩的决策提供了数据支持,若评估出当前目标卷压缩后能够节省较大的空间成本,则用户即可对该目标卷进行压缩,否则,不必对该目标卷进行压缩,可以避免对CPU资源的占用,节省时间。此外,采用评估结果结合该结果的精度范围两项指标,能够以更加高效精确的评估目标卷的压缩率。To sum up, the method and system for evaluating the compression rate of data volumes provided by the present invention can know the space cost that can be saved after the data blocks are compressed before compressing the data volumes, and use mathematical methods and statistics The method conducts random sampling analysis on the compression rate of the target volume, and the investigation is simple and cost-effective, which ensures the accuracy and reliability of the evaluation, reduces the occupation of system resources, and saves space after compression of the target volume without a long data compression process. Estimate to provide data support for the user's decision whether to compress. If it is estimated that the current target volume can save a large space cost after compression, the user can compress the target volume; otherwise, the target volume does not need to be compressed. Compression can avoid occupying CPU resources and save time. In addition, the compression rate of the target volume can be evaluated more efficiently and accurately by using the evaluation result combined with the two indicators of the accuracy range of the result.
以上对本发明所提供一种数据卷压缩率的评估方法和系统进行了详细介绍。本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明及其核心思想。应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以对本发明进行若干改进和修饰,这些改进和修饰也落入本发明权利要求的保护范围内。The method and system for evaluating the compression ratio of data volumes provided by the present invention have been introduced in detail above. In this paper, specific examples are used to illustrate the principles and implementation methods of the present invention, and the descriptions of the above embodiments are only used to help understand the present invention and its core ideas. It should be pointed out that for those skilled in the art, without departing from the principle of the present invention, some improvements and modifications can be made to the present invention, and these improvements and modifications also fall within the protection scope of the claims of the present invention.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610824302.XA CN106445412A (en) | 2016-09-14 | 2016-09-14 | Data volume compression ratio assessment method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610824302.XA CN106445412A (en) | 2016-09-14 | 2016-09-14 | Data volume compression ratio assessment method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106445412A true CN106445412A (en) | 2017-02-22 |
Family
ID=58169022
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610824302.XA Pending CN106445412A (en) | 2016-09-14 | 2016-09-14 | Data volume compression ratio assessment method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106445412A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108984152A (en) * | 2018-08-21 | 2018-12-11 | 北京睦合达信息技术股份有限公司 | A kind of data processing method, system and computer readable storage medium |
WO2022021865A1 (en) * | 2020-07-29 | 2022-02-03 | 苏州浪潮智能科技有限公司 | Data structure tree verification method, apparatus, and device, and storage medium |
CN117786169A (en) * | 2024-02-26 | 2024-03-29 | 北京数变科技有限公司 | Data self-adaptive storage method and device, electronic equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1127558A (en) * | 1994-04-22 | 1996-07-24 | 索尼公司 | Device and method for transmitting data, and device and method for recording data |
US20130111164A1 (en) * | 2011-10-31 | 2013-05-02 | Jichuan Chang | Hardware compression using common portions of data |
CN104462334A (en) * | 2014-12-03 | 2015-03-25 | 天津南大通用数据技术股份有限公司 | Data compression method and device for packing database |
CN105051724A (en) * | 2013-08-19 | 2015-11-11 | 华为技术有限公司 | Data object processing method and device |
-
2016
- 2016-09-14 CN CN201610824302.XA patent/CN106445412A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1127558A (en) * | 1994-04-22 | 1996-07-24 | 索尼公司 | Device and method for transmitting data, and device and method for recording data |
US20130111164A1 (en) * | 2011-10-31 | 2013-05-02 | Jichuan Chang | Hardware compression using common portions of data |
CN105051724A (en) * | 2013-08-19 | 2015-11-11 | 华为技术有限公司 | Data object processing method and device |
CN104462334A (en) * | 2014-12-03 | 2015-03-25 | 天津南大通用数据技术股份有限公司 | Data compression method and device for packing database |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108984152A (en) * | 2018-08-21 | 2018-12-11 | 北京睦合达信息技术股份有限公司 | A kind of data processing method, system and computer readable storage medium |
CN108984152B (en) * | 2018-08-21 | 2021-01-29 | 北京睦合达信息技术股份有限公司 | Data processing method, system and computer readable storage medium |
WO2022021865A1 (en) * | 2020-07-29 | 2022-02-03 | 苏州浪潮智能科技有限公司 | Data structure tree verification method, apparatus, and device, and storage medium |
CN117786169A (en) * | 2024-02-26 | 2024-03-29 | 北京数变科技有限公司 | Data self-adaptive storage method and device, electronic equipment and storage medium |
CN117786169B (en) * | 2024-02-26 | 2024-08-30 | 北京数变科技有限公司 | Data self-adaptive storage method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10503443B2 (en) | Systems and methods for allocating data compression activities in a storage system | |
CN108345524A (en) | Method for monitoring application program and Application Monitoring device | |
WO2017031837A1 (en) | Disk capacity prediction method, device and apparatus | |
CN110765031B (en) | Data storage method, device, mobile terminal and storage medium | |
WO2019232926A1 (en) | Method and apparatus for data consistency checking and flow control, electronic device and storage medium | |
CN106445412A (en) | Data volume compression ratio assessment method and system | |
CN103150260A (en) | Method and device for deleting repeating data | |
US11886743B2 (en) | Method for enhancing quality of service of solid-state drive and solid-state drive | |
CN104917836A (en) | Method and device for monitoring and analyzing availability of computing equipment based on cluster | |
CN109445719B (en) | A data storage method and device | |
CN103955440A (en) | Nonvolatile storage equipment and method of carrying out data manipulation therethrough | |
CN112685298A (en) | Code coverage testing method and device for application program and electronic equipment | |
CN103353883A (en) | Big data stream type cluster processing system and method for on-demand clustering | |
CN112749013B (en) | Thread load detection method and device, electronic equipment and storage medium | |
CN110457179A (en) | System detection method, memory monitoring method, device, medium and electronic device | |
CN103177080A (en) | File pre-reading method and file pre-reading device | |
CN101770419B (en) | System robustness analyzer and analysis method | |
CN102722456B (en) | Flash memory device and data writing method thereof | |
CN108599774B (en) | A compression method, system, device and computer-readable storage medium | |
CN105260140A (en) | Disk size monitoring method and apparatus | |
CN115469796A (en) | Data storage method, device, equipment and storage medium | |
CN109086186A (en) | log detection method and device | |
CN107220140A (en) | The method for testing reliability and system of a kind of dual control storage system | |
CN101551779B (en) | Computer and data storing method | |
CN103176753A (en) | Storage device and data management method of storage device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170222 |