CN111367871A

CN111367871A - An incremental synchronization method between files based on SAPCI variable-length blocks

Info

Publication number: CN111367871A
Application number: CN202010132871.4A
Authority: CN
Inventors: 齐德昱; 萧海彬; 张长建; 张一鸣; 刘晓; 周鹏
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-02-29
Filing date: 2020-02-29
Publication date: 2020-07-03
Anticipated expiration: 2040-02-29
Also published as: CN111367871B

Abstract

The invention discloses a method for incremental synchronization between files based on SAPCI variable-length blocks, comprising the following steps: S1, the backup server performs file block on the backup file; Each small block is encrypted and integrated separately, and the result of encryption and integration is sent to the source server; S3. In the source server, the same SAPCI algorithm is also used for the changed source file to perform file segmentation, and MD5 is also used for the source file. Each small block after the block is encrypted; S4, in the source server, by comparing the MD5 data sent from the backup server and the MD5 data after the source file is divided and calculated by itself, so as to obtain incremental data blocks and non-incremental data blocks. S5, the source server integrates the incremental data and matching data, and sends the incremental data and matching data to the backup server; S6, the backup server integrates the backup files.

Description

An incremental synchronization method between files based on SAPCI variable-length blocks

技术领域technical field

本发明属于计算机文件间增量领域，具体涉及一种基于SAPCI可变长分块的文件间增量同步方法The invention belongs to the field of increments between computer files, and in particular relates to an incremental synchronization method between files based on SAPCI variable-length blocks

背景技术Background technique

当今世界信息技术飞速发展，数字技术产生的数据正在爆发性增长，同时为了保证系统的容错性，往往会往系统中添加多个备份服务器，并生成多个备份文件，随着数据量的飞速增长，源文件与备份文件间的同步问题已经是不可被忽视的一个重要问题。With the rapid development of information technology in today's world, the data generated by digital technology is growing explosively. At the same time, in order to ensure the fault tolerance of the system, multiple backup servers are often added to the system and multiple backup files are generated. With the rapid growth of data volume , the synchronization between source files and backup files has become an important issue that cannot be ignored.

在目前，文件间进行增量同步的传统算法主要是应用Rsync算法实现的， Rsync算法基于文件固定长度分块以及滚动校验，能较为精确定位出源文件与备份文件之间的增量数据，但是由于固定长度分块算法的设计缺陷，该分块算法对字节偏移抵抗能力非常差，文件的轻微变动将导致文件后续分块边界的集体偏移(邓雪峰,孙瑞志，张永瀚，等.基于数据位图的滑动分块算法[J].计算机研究与发展，2014，51(Suppl.):30－38.)，因此Rsync算法除了计算强校验码外，还需要额外计算弱校验码，并且在源服务器端进行滚动校验，这就导致极大地增大了源服务器的CPU计算负担，以及极大地增加了校验耗时(A.Tridgell,“Efficient algorithms for sorting and synchronization,”https://www.sama.org//～tridge/phd_thesis.pdf, accessed February,1999.)。基于Rsync算法的固有缺陷，可变长分块(CDC)应运而生，该类算法对文件进行分块时的分块边界是根据一定条件来设计的，只有当算法达到对应条件时才会进行边界划分，因此并不会因为文件的轻微变动直接导致大量后续分块边界的集体偏移，而这就为把该算法应用于文件间增量同步提供了足够的可能。At present, the traditional algorithm for incremental synchronization between files is mainly implemented by the Rsync algorithm. The Rsync algorithm is based on file fixed-length block and rolling verification, which can accurately locate the incremental data between the source file and the backup file. However, due to the design defect of the fixed-length block algorithm, the block algorithm has very poor resistance to byte offset, and slight changes in the file will lead to the collective offset of the subsequent block boundaries of the file (Deng Xuefeng, Sun Ruizhi, Zhang Yonghan, et al. Based on Sliding block algorithm of data bitmap [J]. Computer Research and Development, 2014, 51(Suppl.): 30-38.), so Rsync algorithm needs to calculate weak check code in addition to strong check code , and rolling verification is performed on the source server side, which greatly increases the CPU computing burden of the source server and greatly increases the verification time (A. Tridgell, "Efficient algorithms for sorting and synchronization," https ://www.sama.org//~tridge/phd_thesis.pdf, accessed February, 1999.). Based on the inherent defects of the Rsync algorithm, the variable length block (CDC) came into being. The block boundary when this type of algorithm blocks the file is designed according to certain conditions, and only when the algorithm meets the corresponding conditions will it be performed. Boundary division, so slight changes in the file will not directly lead to the collective offset of a large number of subsequent block boundaries, and this provides enough possibility to apply the algorithm to incremental synchronization between files.

发明内容SUMMARY OF THE INVENTION

本发明的目的是提出基于SAPCI(自适应-区间奇偶校验)可变长分块的文件间增量同步方法。The purpose of the present invention is to propose an incremental synchronization method between files based on SAPCI (Adaptive-Interval Parity Check) variable-length blocks.

本发明的目的至少通过如下技术方案之一实现。The object of the present invention is achieved by at least one of the following technical solutions.

一种基于SAPCI可变长分块的文件间增量同步方法，包括以下步骤：A method for incremental synchronization between files based on SAPCI variable-length blocks, comprising the following steps:

S1、需要进行增量同步的备份服务器(Backup Server)对备份文件(Backup File)基于SAPCI算法进行文件分块；S1. The backup server (Backup Server) that needs to perform incremental synchronization performs file segmentation on the backup file (Backup File) based on the SAPCI algorithm;

S2、备份服务器利用信息摘要算法MD5(Message-Digest，MD5Algorithm) 对备份文件分块后的各小块分别进行加密和整合，生成MD5列表List_backup-MD5，最终把加密和整合的结果即MD5列表List_backup-MD5发送至源服务器(Source Server)；S2. The backup server uses the message digest algorithm MD5 (Message-Digest, MD5Algorithm) to encrypt and integrate the small blocks of the backup file respectively, and generate the MD5 list List _backup-MD5 , and finally the encrypted and integrated result is the MD5 list. List _backup-MD5 sent to the source server (Source Server);

S3、在源服务器中同样对发生变动的源文件(Source File)使用与步骤 S1相同的SAPCI算法进行文件分块，并同样利用MD5对源文件分块后的各小块进行加密，各小块加密后的结果为MD5_src-i，其中src代表对源文件分块后的各小块进行加密，i＝0,1,2,3,4,5…,代表着对源文件分块后各小块的编号顺序，完成加密后同样把加密后的结果MD5_src-i按照编号顺序存储于源服务器的内存中形成MD5列表List_src-MD5，其中该步骤与步骤S1、步骤S2同步进行；S3. In the source server, the changed source file (Source File) is also divided into file blocks using the same SAPCI algorithm as in step S1, and MD5 is also used to encrypt each small block after the source file is divided into blocks. The encrypted result is MD5 _src-i , where src represents the encryption of each small block after the source file is divided into blocks, i=0, 1, 2, 3, 4, 5... The numbering order of the small blocks, after the encryption is completed, the encrypted result MD5 _src-i is also stored in the memory of the source server according to the numbering order to form the MD5 list List _src-MD5 , wherein this step is performed synchronously with step S1 and step S2;

S4、在源服务器中获取增量数据块和非增量数据块；S4. Obtain incremental data blocks and non-incremental data blocks in the source server;

S5、源服务器根据一定规则整合增量数据与匹配数据，并把增量数据与匹配数据发送至备份服务器，该步骤与S4同步进行；S5, the source server integrates the incremental data and the matching data according to certain rules, and sends the incremental data and the matching data to the backup server, and this step is performed synchronously with S4;

S6、备份服务器根据接收到的增量数据进行解析后，对备份文件进行整合，最终完成同步。S6. After parsing the received incremental data, the backup server integrates the backup files, and finally completes the synchronization.

进一步地，步骤S1的SAPCI算法是基于区间奇偶校验分块算法(Parity Check OfInterval,PCI算法)进行改进的，改进后的SAPCI算法称为自适应的区间奇偶校验分块算法(Self-Adaptive Parity Check Of Interval)，主要包括以下步骤：Further, the SAPCI algorithm of step S1 is improved based on the interval parity block algorithm (Parity Check OfInterval, PCI algorithm), and the improved SAPCI algorithm is called the self-adaptive interval parity block algorithm (Self-Adaptive). Parity Check Of Interval), which mainly includes the following steps:

S101、从文件初始字节开始读取数据；S101. Read data from the initial byte of the file;

S102、把读取到的数据存入校验区间尾部，其中定义校验区间为长度为len 的窗口；S102, store the read data in the end of the check interval, wherein the check interval is defined as a window with a length of len;

S103、判断校验区间的数据量是否等于且不大于len，如果否，则继续执行步骤S102，如果是，则进入步骤S104；S103, determine whether the data amount of the verification interval is equal to or not greater than len, if not, continue to execute step S102, if yes, enter step S104;

S104、计算校验区间中各字节数据中比特为1的总数量n，并判断n是否大于或等于设定的阈值threshold，如果否，则进入步骤S105,如果是，则可确定校验区间尾部为分块边界；S104, calculate the total number n of bits that are 1 in each byte data in the check interval, and judge whether n is greater than or equal to the set threshold threshold, if not, then enter step S105, if yes, then determine the check interval The tail is the block boundary;

S105、校验区间往后滑动一个字节，再次计算校验区间中各字节数据中比特为1的总数量n，并判断n是否大于threshold，如果是，则可确定校验区间尾部为分块边界，如果否，则进入步骤S106。S105, slide one byte backward in the check interval, calculate the total number n of bits 1 in each byte data in the check interval again, and judge whether n is greater than the threshold, if so, it can be determined that the end of the check interval is divided into block boundary, if not, go to step S106.

S106、判断当前校验区间尾部与上一分块右边界距离是否大于等于阈值变动间隔d，如果否，则重新进入步骤S104，如果是，则令当前threshold减1，然后再重新进入步骤S104，直到找到分块边界为止；S106, determine whether the distance between the tail of the current check interval and the right border of the previous block is greater than or equal to the threshold change interval d, if not, then re-enter step S104, if so, reduce the current threshold by 1, and then re-enter step S104, until a block boundary is found;

S107、当找到分块边界后，重置当前threshold为初始值；S107, after finding the block boundary, reset the current threshold to the initial value;

S108、重复步骤S102至步骤S107，直至校验区间尾部达到文件末尾字节；S108, repeat steps S102 to S107, until the end of the check interval reaches the end of the file;

S109、把位于文件尾部的不符合分块条件的尾部数据块作为单独一块.其中不符合分块条件是指文件尾部的数据块中不存在任何一个长度为len的区间，使得区间内各字节数据中比特为1的总数量n大于或等于当前阈值threshold；S109, take the tail data block located at the end of the file that does not meet the block condition as a separate block. The block that does not meet the block condition means that there is no interval with a length of len in the data block at the end of the file, so that each byte in the interval does not exist. The total number n of 1 bits in the data is greater than or equal to the current threshold threshold;

S110、完成文件数据分块过程。S110. Complete the process of dividing the file data into blocks.

进一步地，步骤S2的加密和整合方式为：对备份文件进行分块后的各小块分别进行MD5加密，并令各小块加密结果为MD5_backup-i，其中backup代表对备份文件分块后的各小块进行加密，i＝0,1,2,3,4,5…，代表着对备份文件进行分块后各小块的编号顺序，完成加密后只需把加密后的结果MD5_backup-i按照编号顺序存储于备份服务器的通信发送缓存中形成MD5列表 List_backup-MD5，此时即可完成整合。Further, the encryption and integration method of step S2 is: each small block after the backup file is divided into blocks is respectively MD5 encrypted, and the encryption result of each small block is MD5 _backup-i , wherein backup represents the backup file after the block is divided. Each small block is encrypted, i=0, 1, 2, 3, 4, 5..., which represents the number sequence of each small block after the backup file is divided into blocks. After the encryption is completed, only the encrypted result needs to be MD5 _{backup -i} is stored in the communication sending buffer of the backup server according to the serial number to form the MD5 list List _backup-MD5 , and the integration can be completed at this time.

进一步地，所述步骤S2中备份服务器对源服务器发送的数据格式为：Further, the data format sent by the backup server to the source server in the step S2 is:

(chunkIndex，MD5(chunk bytes))(chunkIndex, MD5(chunk bytes))

其中chunkIndex为备份服务器对备份文件基于SAPCI算法进行分块后的分块序号，从0开始标记，chunk bytes为备份服务器对备份文件基于SAPCI 算法进行分块后的长度不固定的分块字节数组，MD5(chunk bytes)为对分块字节数组计算其MD5值后的校验码，其中MD5校验码计算方式有多种，以使用Java 编程语言为例，其可使用JDK(JavaDevelopment Kit，Java开发工具)中自带的MessageDigest类提供的digest方法进行计算，只需把分块字节数组作为参数传递给digest方法即可。where chunkIndex is the sequence number of the chunk after the backup server chunks the backup file based on the SAPCI algorithm, starting from 0, and chunk bytes is the chunked byte array with variable length after the backup server chunks the backup file based on the SAPCI algorithm. , MD5 (chunk bytes) is the check code after calculating the MD5 value of the block byte array. There are many ways to calculate the MD5 check code. Taking the Java programming language as an example, it can use the JDK (Java Development Kit, The digest method provided by the MessageDigest class that comes with Java development tools) can be calculated by passing the block byte array as a parameter to the digest method.

进一步地，所述步骤S4获取增量数据块和非增量数据块的步骤如下：Further, the step of obtaining the incremental data block and the non-incremental data block in the step S4 is as follows:

S401、对比List_backup-MD5和List_src-MD5两个列表间MD5数据的差异，找到List_src-MD5中存在，同时List_backup-MD5中也存在的MD5数据，该MD5数据对应的数据块即为非增量数据块，然后进入步骤S402；反之，List_src-MD5中存在，但是List_backup-MD5中不存在的MD5数据，则该MD5数据对应的数据块即为增量数据块，此时则进入步骤S403；S401. Compare the differences in MD5 data between List _backup-MD5 and List _src-MD5 , and find the MD5 data that exists in List _src-MD5 and also exists in List _backup-MD5 . The data block corresponding to the MD5 data is Non-incremental data block, then enter step S402; on the contrary, if the MD5 data exists in List _src-MD5 , but does not exist in List _backup-MD5 , the data block corresponding to the MD5 data is the incremental data block. Enter step S403;

S402、对于非增量数据块，生成数据结构，数据格式为如下：S402, for the non-incremental data block, generate a data structure, and the data format is as follows:

(source chunk index，backup chunk index)(source chunk index, backup chunk index)

其中，source chunk index为当前进行判断的数据源分块编号，从0开始进行编号，而backup chunk index为当前进行判断的数据源分块编号所匹配到的MD5所对应的备份文件数据块分块编号，该编号从备份服务器中发送过来的数据中获得；Among them, the source chunk index is the data source block number currently being judged, starting from 0, and the backup chunk index is the backup file data block corresponding to the MD5 corresponding to the data source block number currently being judged. Number, which is obtained from the data sent from the backup server;

S403、对于增量数据块，生成数据结构，数据格式如下：S403, for the incremental data block, generate a data structure, and the data format is as follows:

(source chunk index,diff chunk bytes)(source chunk index, diff chunk bytes)

其中，source chunk index为当前进行判断的数据源分块编号，diff chunkbytes是数据源中无法匹配到备份文件MD5的整块数据分块；Among them, source chunk index is the data source chunk number currently being judged, and diff chunkbytes is the entire data chunk in the data source that cannot match the MD5 of the backup file;

可以看到，与Rsync算法相比，基于SAPCI可变长分块的文件间增量同步方法找到的文件增量并不是精确数据，但同时又减少了备份服务器计算弱校验码与发送弱校验码，源服务器滚动计算弱校验码的计算消耗。因此基于SAPCI可变长分块的文件间增量同步方法，是以增大增量数据发现为代价，大幅度降低备份服务器与源服务器中的CPU计算消耗的一种方法。It can be seen that, compared with the Rsync algorithm, the file increments found by the incremental synchronization method between files based on SAPCI variable-length blocks are not accurate data, but at the same time, it reduces the need for the backup server to calculate the weak checksum and send the weak checksum. To verify the code, the source server rolls the calculation consumption of the weak verification code. Therefore, the incremental synchronization method between files based on SAPCI variable-length blocks is a method that greatly reduces the CPU computing consumption in the backup server and the source server at the expense of increasing incremental data discovery.

而其中，增量数据块的额外消耗主要取决于增量数据离散于备份文件中各个数据分块的情况，假设当前源文件相对于备份文件有100B的数据增量，利用Rsync算法能精准定位这100B的增量数据大小，但需要消耗额外非常多的计算消耗用于计算弱检验码与滚动校验。而基于SAPCI可变长分块的文件间增量同步方法，其发现的增量则依赖于当前100B增量数据离散于文件各个数据分块的程度。Among them, the additional consumption of incremental data blocks mainly depends on the fact that the incremental data is discrete in each data block in the backup file. Assuming that the current source file has a 100B data increment relative to the backup file, the Rsync algorithm can be used to accurately locate the data. The incremental data size of 100B, but it needs to consume a lot of extra computational consumption for calculating weak check codes and rolling checks. For the incremental synchronization method between files based on SAPCI variable-length blocks, the increment found depends on the degree to which the current 100B incremental data is discrete from each data block of the file.

定义增量离散度如下：The incremental dispersion is defined as follows:

L＝C/I (1)L=C/I (1)

其中L为增量离散度，C为增量数据所分布在源文件中分块的数量，I为增量字节的大小；增量离散度满足如下关系：Where L is the incremental dispersion, C is the number of blocks in the source file distributed by the incremental data, and I is the size of the incremental bytes; the incremental dispersion satisfies the following relationship:

1/I≤L≤1 (2)1/I≤L≤1 (2)

当L＝1/I时，说明源文件相对于备份文件，增量数据发生于备份文件的其中一个分块中，增量数据块大小近似为：When L=1/I, it means that the source file is relative to the backup file, and the incremental data occurs in one of the blocks of the backup file. The size of the incremental data block is approximately:

IC≈I+AC (3)IC≈I+AC (3)

其中AC为平均块长；where AC is the average block length;

当L＝1时，说明源文件相对于备份文件，增量数据的每一个字节都独立发生于备份文件的不同数据块中，增量数据块大小近似为：When L=1, it means that the source file is relative to the backup file, and each byte of incremental data occurs independently in different data blocks of the backup file. The size of the incremental data block is approximately:

IC≈I+AC*I。 (4)IC≈I+AC*I. (4)

进一步地，步骤S5的规则为：当步骤S4中数据块匹配成功时即当前数据块非增量数据块，对源文件分块后各数据块加密后的MD5结果同时存在于 List_src-MD5与List_backup-MD5中，此时生成匹配数据，格式为source chunk index,backup chunk index,其中sourcechunk index为当前进行判断的数据源分块编号，从0开始进行编号，而backup chunkindex为当前进行判断的数据源分块编号所匹配到的备份文件数据块分块编号；Further, the rule of step S5 is: when the data block is successfully matched in step S4, that is, the current data block non-incremental data block, the MD5 result after the encryption of each data block after the source file is divided into blocks exists in List _src-MD5 and List src-MD5 simultaneously. In List _backup-MD5 , matching data is generated at this time, the format is source chunk index, backup chunk index, where sourcechunk index is the data source chunk number currently being judged, starting from 0, and backup chunkindex is currently being judged. The block number of the backup file data block matched by the block number of the data source;

当步骤S4中数据块匹配失败时即当前数据块为增量数据块，对源文件分块后各数据块加密后的MD5结果只存在于List_src-MD5，而不存在于List_backup-MD5中，此时生成增量数据，格式为source chunk index,diff chunk bytes，其中source chunk index同样为当前进行判断的数据源分块编号，diff chunk bytes是数据源中无法匹配到备份文件相同数据块的字节数组，即增量数据块；When the data block matching fails in step S4, that is, the current data block is an incremental data block, and the encrypted MD5 result of each data block after the source file is divided into blocks only exists in List _src-MD5 , but not in List _backup-MD5 . , at this time, incremental data is generated in the format of source chunk index, diff chunk bytes, where source chunk index is also the data source block number currently being judged, and diff chunk bytes is the data source that cannot match the same data block of the backup file. Byte array, i.e. incremental data block;

在生成增量数据与匹配数据后，把数据结果统一根据数据源分块编号顺序存放于源服务器通信发送缓存中即可完成整合，并把整合结果按照数据源分块编号顺序发送至备份服务器中。After the incremental data and matching data are generated, the data results are stored in the communication sending cache of the source server according to the block number sequence of the data source to complete the integration, and the integration results are sent to the backup server according to the block number sequence of the data source. .

进一步地，所述步骤S5中发送的数据格式如下：Further, the data format sent in the step S5 is as follows:

1)若当前数据块匹配成功，此时传输匹配数据，发送格式如下：1) If the current data block is successfully matched, the matching data is transmitted at this time, and the sending format is as follows:

2)若当前数据块匹配失败，此时传输增量数据，发送格式如下：2) If the current data block fails to match, the incremental data is transmitted at this time, and the sending format is as follows:

(source chunk index,diff chunk bytes)。(source chunk index, diff chunk bytes).

进一步地，步骤S1～步骤S6总耗时定义如下：Further, the total time consumption of steps S1 to S6 is defined as follows:

T_CDC＝max{T_SB+T_ST+T_CB,T_CS+T_SS}+T_ICT+T_M (5)T _CDC =max{T _SB +T _ST +T _CB ,T _CS +T _SS }+T _ICT +T _M (5)

其中T_CDC总耗时时长，T_SB为备份服务器计算强校验编码MD5的耗时，T_ST为备份服务器传输强校验码至源服务器的耗时，T_CB为对备份服务器对文件进行分块的耗时，T_CS为对源服务器对文件进行分块的耗时，T_SS为源服务器计算强校验编码MD5的耗时，T_ICT为源服务器对比源文件分块MD5与备份文件分块MD5 并同步传输增量数据与匹配数据至备份服务器的耗时，T_M为备份服务器根据增量数据与匹配数据整合完成增量同步的耗时。Among them, T _CDC is the total time spent, T _SB is the time taken by the backup server to calculate the strong check code MD5, T _ST is the time taken by the backup server to transmit the strong check code to the source server, and T _CB is the time spent on the backup server to classify the file. Block time, T _CS is the time required for the source server to divide the file into blocks, T _SS is the time required for the source server to calculate the MD5 of the strong parity code, and T _ICT is the source server comparing the source file block MD5 with the backup file block. Block MD5 and transmit the incremental data and matching data to the backup server synchronously, _TM is the time taken by the backup server to complete the incremental synchronization according to the integration of the incremental data and matching data.

进一步地，所述步骤S6中，备份服务器根据接收到的增量数据与匹配数据，对备份文件进行整合的过程如下：Further, in the step S6, the backup server integrates the backup files according to the received incremental data and matching data as follows:

S601、创建一个临时文件，并基于临时文件进行同步；S601. Create a temporary file, and perform synchronization based on the temporary file;

S602、对接收到的数据进行判断，若当前数据格式为:S602, judge the received data, if the current data format is:

则说明数据源中存在备份文件中相同的数据块，此时只需把backup chunk index中指示的备份文件数据块写入重组文件的末尾即可，写入时需保证按照 source chunkindex指示的索引顺序进行写入；It means that the same data block in the backup file exists in the data source. In this case, you only need to write the data block of the backup file indicated in the backup chunk index to the end of the reorganized file. When writing, ensure that the index order indicated by the source chunkindex is followed. to write;

S603、若当前数据格式为:S603. If the current data format is:

(source chunk index，diff chunk bytes)(source chunk index, diff chunk bytes)

则说明当前接收的是增量数据块，此时只需直接把diff chunk bytes写入重组文件的末尾即可，写入时需保证按照source chunk index指示的索引进行顺序写入；It means that the currently received incremental data block, at this time, you only need to directly write the diff chunk bytes to the end of the reorganized file. When writing, ensure that the order is written according to the index indicated by the source chunk index;

S604、重复步骤S602和步骤S603，直到读取完毕所有从源服务器端返回的数据，此时，完成文件增量同步，删除原备份文件，重命名临时文件为原备份文件的文件名，最终即可完成整合与增量同步。S604. Repeat steps S602 and S603 until all data returned from the source server is read. At this time, the incremental synchronization of files is completed, the original backup file is deleted, and the temporary file is renamed to the file name of the original backup file. Integration and incremental synchronization can be done.

本发明相对于现有技术具有如下的优点及效果：Compared with the prior art, the present invention has the following advantages and effects:

1)本发明所使用的SAPCI可变长分块是基于PCI算法改进而来的。SAPCI 算法比起PCI算法，能大幅度减少超大数据分块(10倍与平均块长以上)，从而减少了后续冗余的数据传输消耗。采用的自适应阈值变动方法，由于阈值变动只是发生在局部数据块中，因此对整体文件分块数据量并无太大影响，即并没过多的额外计算消耗。1) The SAPCI variable length block used in the present invention is improved based on the PCI algorithm. Compared with the PCI algorithm, the SAPCI algorithm can greatly reduce the large data block (10 times or more than the average block length), thereby reducing the subsequent redundant data transmission consumption. The adaptive threshold change method adopted, because the threshold change only occurs in local data blocks, does not have much impact on the data volume of the overall file block, that is, there is no excessive computational consumption.

2)本发明提出的基于SAPCI可变长分块对文件进行增量同步，与传统用于文件增量同步的Rsync算法相比，其增量数据传输以增量数据块为主要单位，包含一定的数据冗余，但大大减少了备份服务器计算弱校验码与发送弱校验码，源服务器滚动计算弱校验码等计算消耗。在实际应用中，具有实际应用意义。2) Incremental synchronization of files based on SAPCI variable-length blocks proposed by the present invention, compared with the traditional Rsync algorithm for file incremental synchronization, its incremental data transmission takes incremental data blocks as the main unit, including certain The data redundancy, but greatly reduces the calculation consumption of the backup server to calculate the weak check code and send the weak check code, and the source server to calculate the weak check code on a rolling basis. In practical applications, it has practical application significance.

3)本发明提出的基于SAPCI可变长分块对文件进行增量同步，由于源服务器端不需要且无法通过滚动校验来查找增量数据，因此其对文件进行数据分块及计算对应分块MD5的操作可以与备份服务器端对文件进行数据分块及计算对应分块MD5的操作可以同步进行，因此从整体上能减少该部分的运行时间。3) Incremental synchronization of files based on SAPCI variable length blocks proposed by the present invention, since the source server does not need and cannot find incremental data through rolling verification, it performs data block on the file and calculates the corresponding points. The operation of block MD5 can be performed synchronously with the operation of the backup server to perform data block on the file and calculate the MD5 corresponding to the block, thus reducing the running time of this part as a whole.

附图说明Description of drawings

图1是本发明实施例中一种基于SAPCI可变长分块的文件间增量同步方法的流程示意图；1 is a schematic flowchart of a method for incremental synchronization between files based on SAPCI variable-length blocks in an embodiment of the present invention;

图2是本发明实施例中SAPCI算法的伪代码示意图。FIG. 2 is a schematic diagram of a pseudo-code of the SAPCI algorithm in an embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图及实施例，对本发明的具体实施作进一步的说明。The specific implementation of the present invention will be further described below with reference to the accompanying drawings and embodiments.

实施例：Example:

如图1所示，一种基于SAPCI可变长分块的文件间增量同步方法，包括以下步骤：As shown in Figure 1, a method for incremental synchronization between files based on SAPCI variable-length blocks includes the following steps:

1)需要进行增量同步的备份服务器(Backup Server)对备份文件(Backup File)基于SAPCI算法进行文件分块；其中SAPCI可变长分块的工作流程伪代码如图2所示其中代码输入为文件file，校验区间大小lenOfWindow，阈值 threshold，区间变动间隔d，代码输出为分块边界文件索引列表，1) The backup server (Backup Server) that needs to perform incremental synchronization performs file partitioning on the backup file (Backup File) based on the SAPCI algorithm; wherein the workflow pseudocode of the SAPCI variable-length partitioning is shown in Figure 2, wherein the code input is file file, check interval size lenOfWindow, threshold threshold, interval change interval d, the code output is a block boundary file index list,

本实施例的SAPCI算法是基于区间奇偶校验分块算法(Parity Check OfInterval,PCI算法)进行改进的，改进后的SAPCI算法称为自适应的区间奇偶校验分块算法(Self-Adaptive Parity Check Of Interval)，主要包括以下步骤：The SAPCI algorithm in this embodiment is improved based on an interval parity check block algorithm (Parity Check Of Interval, PCI algorithm). The improved SAPCI algorithm is called an adaptive interval parity check block algorithm (Self-Adaptive Parity Check algorithm). Of Interval), which mainly includes the following steps:

S102、把读取到的数据存入校验区间尾部，其中定义校验区间为长度为len 的窗口，len的确定无特别规则，一般设置为10即可；S102, store the read data into the end of the check interval, wherein the check interval is defined as a window with a length of len, and there is no special rule for determining len, and it is generally set to 10;

S104、计算校验区间中各字节数据中比特为1的总数量n，并判断n是否大于或等于设定的阈值threshold，如果否，则进入步骤S105,如果是，则可确定校验区间尾部为分块边界；其中threshold为设定的参数，threshold的确定并无特别规则，而threshold越小，将导致n大于threshold的可能性越大，从而找到分块边界的概率增加，进而增大文件分块数量，减少平均块长， threshold的确定可根据实际情况需求灵活设置；S104, calculate the total number n of bits that are 1 in each byte data in the check interval, and judge whether n is greater than or equal to the set threshold threshold, if not, then enter step S105, if yes, then determine the check interval The tail is the block boundary; where threshold is a set parameter, there is no special rule for the determination of threshold, and the smaller the threshold, the greater the possibility that n is greater than the threshold, so that the probability of finding the block boundary increases, and then increases The number of file blocks to reduce the average block length, and the determination of the threshold can be flexibly set according to actual needs;

S109、把位于文件尾部的不符合分块条件(其中不符合分块条件是指文件尾部的数据块中不存在任何一个长度为len的区间，使得区间内各字节数据中比特为1的总数量n大于或等于当前阈值threshold)的尾部数据块作为单独一块；S109. Dissolve the data blocks located at the end of the file that do not meet the block conditions (wherein the block conditions do not meet the block conditions means that there is no interval with a length of len in the data blocks at the end of the file, so that the total number of bits 1 in each byte data in the interval is 1). The tail data block whose number n is greater than or equal to the current threshold (threshold) is regarded as a separate block;

2)备份服务器利用MD5(信息摘要算法，Message-Digest Algorithm,以下均简称MD5)对备份文件分块后的各小块分别进行加密和整合，具体整合方式为：对备份文件进行分块后的各小块分别进行MD5加密，并令各小块加密结果为MD5_backup-i(其中backup代表对备份文件分块后的各小块进行加密， i＝0,1,2,3,4,5…，代表着对备份文件进行分块后各小块的编号顺序)，完成加密后只需把加密后的结果MD5_backup-i按照编号顺序存储于备份服务器的通信发送缓存中形成MD5列表(以List_backup-MD5表示)，此时即可完成整合，并且对每个分块进行编号，针对每一个分块，生成数据格式为：2) The backup server uses MD5 (Message Digest Algorithm, Message-Digest Algorithm, hereinafter referred to as MD5) to encrypt and integrate each small block after the backup file is divided into blocks. Each small block is encrypted by MD5, and the encryption result of each small block is MD5 _backup-i (where backup represents the encryption of each small block after the backup file is divided into blocks, i=0,1,2,3,4,5 ..., which represents the numbering sequence of each small block after the backup file is divided into blocks), after completing the encryption, only the encrypted result MD5 _backup-i is stored in the communication sending cache of the backup server in the order of the number to form an MD5 list (with List _backup-MD5 ), the integration can be completed at this time, and each block is numbered. For each block, the generated data format is:

(chunkIndex，MD5(chunk bytes))(chunkIndex, MD5(chunk bytes))

其中chunkIndex为备份服务器对备份文件基于SAPCI算法进行分块后的分块序号，从0开始标记，chunk bytes为备份服务器对备份文件基于SAPCI 算法进行分块后的长度不固定的分块字节数组，MD5(chunk bytes)为对分块字节数组计算其MD5值后的校验码(其中MD5校验码计算方式有多种，以使用Java 编程语言为例，其可使用JDK(JavaDevelopment Kit，Java开发工具)中自带的MessageDigest类提供的digest方法进行计算，只需把分块字节数组作为参数传递给digest方法即可)。对文件分块后对各数据块使用MD5进行加密，无需对数据块计算弱校验码。where chunkIndex is the sequence number of the chunk after the backup server chunks the backup file based on the SAPCI algorithm, starting from 0, and chunk bytes is the chunked byte array with variable length after the backup server chunks the backup file based on the SAPCI algorithm. , MD5 (chunk bytes) is the check code after calculating the MD5 value of the block byte array (there are many ways to calculate the MD5 check code. Taking the Java programming language as an example, it can use the JDK (Java Development Kit, The digest method provided by the MessageDigest class that comes with Java development tools) is calculated, just pass the block byte array as a parameter to the digest method). After the file is divided into blocks, MD5 is used to encrypt each data block, and there is no need to calculate the weak check code for the data block.

3)备份服务器对生成后的数据List_backup-MD5发送至源服务器中(Source Server)。3) The backup server sends the generated data List _backup-MD5 to the source server (Source Server).

4)在源服务器中同样对发生变动的源文件(Source File)使用相同的 SAPCI算法进行分块，并保持源服务器与备份服务器中使用的SAPCI算法的参数一致。该步骤可以与步骤1-步骤3同步进行。4) In the source server, the changed source file (Source File) is also divided into blocks using the same SAPCI algorithm, and the parameters of the SAPCI algorithm used in the source server and the backup server are kept consistent. This step can be performed simultaneously with steps 1-3.

5)在源服务器中同样利用MD5对源文件分块后的各数据块进行加密。同样，令各小块加密后的结果为MD5_src-i(其中src代表对源文件分块后的各小块进行加密，i＝0,1,2,3,4,5…,代表着对源文件分块后各小块的编号顺序)，完成加密后同样把加密后的结果MD5_src-i按照编号顺序存储于源服务器的内存中形成MD5列表(以List_src-MD5表示)，该步骤可以与上述步骤1-步骤3同步进行。5) In the source server, MD5 is also used to encrypt each data block after the source file is divided into blocks. Similarly, the encrypted result of each small block is MD5 _src-i (where src represents the encryption of each small block after the source file is divided into blocks, i=0, 1, 2, 3, 4, 5..., represents the After the source file is divided into blocks, the numbering sequence of each small block), after completing the encryption, the encrypted result MD5 _src-i is also stored in the memory of the source server in the number sequence to form an MD5 list (represented by List _src-MD5 ), this step It can be performed simultaneously with the above steps 1-3.

6)源服务器在接收到上述步骤3中从备份服务器发送过来的 List_backup-MD5，并且完成了上述步骤5得到了List_src-MD5后，对比两个列表间MD5数据差异，找到List_src-MD5中存在，但是List_backup-MD5中不存在的MD5 数据，则该MD5数据对应的数据块即为增量数据块。反之，List_src-MD5中存在，但是List_backup-MD5中也存在的MD5数据，该MD5数据对应的数据块即为非增量数据块。具体步骤如下：6) After receiving the List _backup-MD5 sent from the backup server in the above step 3, and after completing the above step 5 to obtain the List _src-MD5 , the source server compares the MD5 data difference between the two lists and finds the List _src-MD5 MD5 data that exists in MD5, but does not exist in List _backup-MD5 , the data block corresponding to the MD5 data is the incremental data block. Conversely, if the MD5 data exists in List _src-MD5 , but also exists in List _backup-MD5 , the data block corresponding to the MD5 data is the non-incremental data block. Specific steps are as follows:

S401、源服务器在接收到步骤3中从备份服务器发送过来的List_backup-MD5，并且完成了步骤5得到了List_src-MD5后，对比两个列表间MD5数据差异，找到 List_src-MD5中存在，同时List_backup-MD5中也存在的MD5数据，该MD5数据对应的数据块即为非增量数据块。此时进入步骤S402。反之，List_src-MD5中存在，但是List_backup-MD5中不存在的MD5数据，则该MD5数据对应的数据块即为增量数据块，此时则进入步骤S403。S401. After receiving the List _backup-MD5 sent from the backup server in step 3, and obtaining the List _src-MD5 after completing the step 5, the source server compares the MD5 data difference between the two lists and finds that the list _src-MD5 exists in the list src-MD5. , and the MD5 data also exists in the List _backup-MD5 , and the data block corresponding to the MD5 data is the non-incremental data block. At this time, it proceeds to step S402. Conversely, if there is MD5 data in List _src-MD5 , but not in List _backup-MD5 , the data block corresponding to the MD5 data is the incremental data block, and in this case, step S403 is entered.

S402、此时找到非增量数据块，生成数据结构，数据格式为如下：S402, at this time, the non-incremental data block is found, and a data structure is generated, and the data format is as follows:

S403、此时找到增量数据块，生成数据结构，数据格式如下：S403, find the incremental data block at this time, generate a data structure, and the data format is as follows:

(source chunk index,diff chunk bytes)(source chunk index, diff chunk bytes)

其中，增量数据块的额外消耗主要取决于增量数据离散于备份文件中各个数据分块的情况，定义增量离散度如下：Among them, the additional consumption of incremental data blocks mainly depends on the discreteness of incremental data in each data block in the backup file. The incremental dispersion is defined as follows:

L＝C/I (1)L=C/I (1)

1/I≤L≤1 (2)1/I≤L≤1 (2)

IC≈I+AC (3)IC≈I+AC (3)

其中AC为平均块长；where AC is the average block length;

IC≈I+AC*I。 (4)IC≈I+AC*I. (4)

7)源服务器根据当前文件分块的MD5值是否匹配，往备份服务器中发送不同格式的数据，如果MD5值匹配，则说明对应数据块为非增量数据块，此时发送的数据格式为：7) The source server sends data in different formats to the backup server according to whether the MD5 value of the current file block matches. If the MD5 value matches, it means that the corresponding data block is a non-incremental data block. The data format sent at this time is:

其中，source chunk index为当前进行判断的数据源分块编号，从0开始进行编号，而backup chunk index为当前进行判断的数据源分块编号所匹配到的MD5所对应的备份文件数据块分块编号，该编号可从上述提及的由备份服务器发送至源服务器中的数据中获得。Among them, the source chunk index is the data source block number currently being judged, starting from 0, and the backup chunk index is the backup file data block corresponding to the MD5 corresponding to the data source block number currently being judged. Number, which can be obtained from the above-mentioned data sent by the backup server to the source server.

如果MD5值不匹配，则说明对应数据块为增量数据块，此时发送的数据格式为：If the MD5 value does not match, it means that the corresponding data block is an incremental data block, and the data format sent at this time is:

(source chunk index,diff chunk bytes)(source chunk index, diff chunk bytes)

其中，source chunk index为当前进行判断的数据源分块编号，diff chunkbytes是数据源中无法匹配到备份文件MD5的整块数据分块。The source chunk index is the number of the data source chunk currently being judged, and the diff chunkbytes is the entire data chunk in the data source that cannot match the MD5 of the backup file.

另外，需要注意的是，为了提高源服务器与备份服务器CPU与IO的使用效率，在源服务端对数据进行MD5校验对比的过程中即可进行该部分结果的传输。In addition, it should be noted that in order to improve the utilization efficiency of the CPU and IO of the source server and the backup server, the transmission of this part of the results can be carried out in the process of MD5 verification and comparison of the data on the source server.

8)备份服务器根据接收到的增量数据进行解析后，对备份文件进行整合，整合的过程如下：8) After parsing the received incremental data, the backup server integrates the backup files. The integration process is as follows:

S801、创建一个临时文件，并基于临时文件进行同步；S801. Create a temporary file, and perform synchronization based on the temporary file;

S802、对接收到的数据进行判断，若当前数据格式为:S802, judge the received data, if the current data format is:

则说明数据源中存在备份文件中相同的数据块，此时只需把backup chunk index中指示的备份文件数据块写入重组文件的末尾即可，写入时需保证按照source chunkindex指示的索引顺序进行写入；It means that the same data block in the backup file exists in the data source. In this case, you only need to write the data block of the backup file indicated in the backup chunk index to the end of the reorganized file. When writing, ensure that the index order indicated by the source chunkindex is followed. to write;

S803、若当前数据格式为:S803. If the current data format is:

(source chunk index，diff chunk bytes)(source chunk index, diff chunk bytes)

则说明当前接收的是增量数据块，此时只需直接把diff chunk bytes写入重组文件的末尾即可，写入时需保证按照source chunk index指示的索引进行顺序写入，需要保证写入重组文件的有序性，否则可能将出现数据块乱序的问题。It means that the current received is incremental data blocks. In this case, it is only necessary to directly write the diff chunk bytes to the end of the reorganized file. When writing, it is necessary to ensure that the order is written according to the index indicated by the source chunk index. Reorganize the order of the file, otherwise there may be a problem of out-of-order data blocks.

S804、重复步骤S802和步骤S803，直到读取完毕所有从源服务器端返回的数据，此时，完成文件增量同步，删除原备份文件，重命名临时文件为原备份文件。S804. Repeat steps S802 and S803 until all data returned from the source server is read. At this time, the incremental synchronization of files is completed, the original backup file is deleted, and the temporary file is renamed as the original backup file.

本发明可以利用SAPCI可变长分块发现文件间增量数据，并在这基础上实现文件间的增量同步，本发明将SAPCI可变长分块应用于文件间增量同步，与传统的Rsync算法相比，以发现更多的增量数据，降低增量数据发现的精确度为代价，大幅度减少发现增量所需要的计算时间，从而能以更短时间实现文件间增量同步，具有可行性和实用性。The invention can use SAPCI variable-length blocks to find incremental data between files, and realize incremental synchronization between files on this basis. The invention applies SAPCI variable-length blocks to incremental synchronization between files, which is different from the traditional Compared with the Rsync algorithm, at the expense of discovering more incremental data and reducing the accuracy of incremental data discovery, it greatly reduces the computing time required for incremental discovery, so that incremental synchronization between files can be achieved in a shorter time. Feasible and practical.

上述实施例的描述较为具体和详细，但仅仅表达了本发明的一种可行的实施方式，并非对本发明专利范围的限制。需要指出的是，本领域的科研人员和工程人员，在本发明的框架下，可以在本优选实施例的基础上加入若干变形或改进，但这些都在本发明专利的保护范围之内。本发明专利的保护范围应以所附权利要求为准。The description of the above embodiment is relatively specific and detailed, but only expresses a feasible implementation manner of the present invention, and does not limit the patent scope of the present invention. It should be pointed out that, under the framework of the present invention, scientific researchers and engineers in the field can add some modifications or improvements on the basis of this preferred embodiment, but these are all within the protection scope of the present invention patent. The protection scope of the patent of the present invention shall be subject to the appended claims.

Claims

1. a method for incremental synchronization between files based on SAPCI variable length block, is characterized in that, comprises the following steps:

S1. The backup server (Backup Server) that needs to perform incremental synchronization performs file segmentation on the backup file (Backup File) based on the SAPCI algorithm;

S2. The backup server uses the message digest algorithm MD5 (Message-Digest, MD5Algorithm) to encrypt and integrate the small blocks of the backup file respectively, to generate an MD5 list List _backup-MD5 , and finally encrypt and integrate the result of the MD5 list. List _backup-MD5 sent to the source server (Source Server);

S3. In the source server, the changed source file (Source File) is also divided into file blocks using the same SAPCI algorithm as in step S1, and MD5 is also used to encrypt each small block after the source file is divided into blocks. The encrypted result is MD5 _src-i , where src represents the encryption of each small block after the source file is divided into blocks, i=0, 1, 2, 3, 4, 5... The numbering order of the small blocks, after the encryption is completed, the encrypted result MD5 _src-i is also stored in the memory of the source server according to the numbering order to form the MD5 list List _src-MD5 , wherein this step is performed synchronously with step S1 and step S2;

S4. Obtain incremental data blocks and non-incremental data blocks in the source server;

S5, the source server integrates the incremental data and the matching data according to certain rules, and sends the incremental data and the matching data to the backup server, and this step is performed synchronously with S4;

S6. After parsing the received incremental data, the backup server integrates the backup files, and finally completes the synchronization.

2. the incremental synchronization method between files based on SAPCI variable length block according to claim 1, is characterized in that, the SAPCI algorithm of step S1 is based on interval parity check block algorithm (Parity Check Of Interval, PCI algorithm ) to improve, the improved SAPCI algorithm is called the self-adaptive interval parity check block algorithm (Self-Adaptive Parity Check Of Interval), which mainly includes the following steps:

S101. Read data from the initial byte of the file;

S102, store the read data into the end of the check interval, wherein the check interval is defined as a window with a length of len;

S103, determine whether the data amount of the verification interval is equal to or not greater than len, if not, continue to execute step S102, if yes, enter step S104;

S104: Calculate the total number n of bits that are 1 in each byte data in the check interval, and determine whether n is greater than or equal to the set threshold threshold, if not, go to step S105, if so, the check interval can be determined The tail is the block boundary;

S105, slide one byte backward in the check interval, calculate the total number n of bits 1 in each byte data in the check interval again, and judge whether n is greater than the threshold, if so, it can be determined that the end of the check interval is divided into block boundary, if not, go to step S106;

S106, determine whether the distance between the tail of the current check interval and the right border of the previous block is greater than or equal to the threshold change interval d, if not, then re-enter step S104, if so, reduce the current threshold by 1, and then re-enter step S104, until a block boundary is found;

S107, after finding the block boundary, reset the current threshold to the initial value;

S108, repeat steps S102 to S107, until the end of the check interval reaches the end of the file;

S109, take the tail data block located at the end of the file that does not meet the block condition as a separate block. The block that does not meet the block condition means that there is no interval with a length of len in the data block at the end of the file, so that each byte in the interval does not exist. The total number n of 1 bits in the data is greater than or equal to the current threshold threshold;

S110. Complete the process of dividing the file data into blocks.

3. the incremental synchronization method between files based on the SAPCI variable length block according to claim 1, is characterized in that, the encryption of step S2 and the integration mode are: each small block after the block is carried out to the backup file respectively. MD5 encryption, and make the encryption result of each small block as MD5 _backup-i , where backup represents the encryption of each small block after the backup file is divided into blocks, i=0, 1, 2, 3, 4, 5... The number sequence of each small block after the backup file is divided into blocks, after encryption is completed, only the encrypted result MD5 _backup-i needs to be stored in the communication sending cache of the backup server according to the number sequence to form the MD5 list List _backup-MD5 . Integration can be completed.

4. the incremental synchronization method between files based on SAPCI variable length block according to claim 1, is characterized in that, in described step S2, the data format that backup server sends to source server is:

(chunkIndex, MD5(chunk bytes))

where chunkIndex is the sequence number of the chunk after the backup server chunks the backup file based on the SAPCI algorithm, starting from 0, and chunk bytes is the chunk byte array of variable length after the backup server chunks the backup file based on the SAPCI algorithm. , MD5 (chunk bytes) is the check code after calculating the MD5 value of the chunked byte array.

5. the incremental synchronization method between files based on SAPCI variable length block according to claim 1, is characterized in that, the step that described step S4 obtains incremental data block and non-incremental data block is as follows:

S401. Compare the differences in MD5 data between List _backup-MD5 and List _src-MD5 , and find the MD5 data that exists in List _src-MD5 and also exists in List _backup-MD5 . The data block corresponding to the MD5 data is Non-incremental data block, then enter step S402; on the contrary, if the MD5 data exists in List _src-MD5 , but does not exist in List _backup-MD5 , the data block corresponding to the MD5 data is the incremental data block. Enter step S403;

S402, for the non-incremental data block, generate a data structure, and the data format is as follows:

(source chunk index, backup chunk index)

Among them, the source chunk index is the data source block number currently being judged, starting from 0, and the backup chunk index is the backup file data block corresponding to the MD5 corresponding to the data source block number currently being judged. Number, which is obtained from the data sent from the backup server;

S403, for the incremental data block, generate a data structure, and the data format is as follows:

(source chunk index, diff chunk bytes)

Among them, source chunkindex is the number of the data source block currently being judged, and diff chunk bytes is the entire data block in the data source that cannot match the MD5 of the backup file;

The additional consumption of incremental data blocks mainly depends on the discreteness of incremental data in each data block in the backup file. The incremental dispersion is defined as follows:

L=C/I (1)

Where L is the incremental dispersion, C is the number of blocks in the source file distributed by the incremental data, and I is the size of the incremental bytes; the incremental dispersion satisfies the following relationship:

1/I≤L≤1 (2)

When L=1/I, it means that the source file is relative to the backup file, and the incremental data occurs in one of the blocks of the backup file. The size of the incremental data block is approximately:

IC≈I+AC (3)

where AC is the average block length;

When L=1, it means that the source file is relative to the backup file, and each byte of incremental data occurs independently in different data blocks of the backup file. The size of the incremental data block is approximately:

IC≈I+AC*I (4)

6. the incremental synchronization method between files based on SAPCI variable length block according to claim 1, is characterized in that, the rule of step S5 is: when data block matches successfully in step S4, namely current data block non-incremental Data block, the encrypted MD5 result of each data block after the source file is divided into blocks exists in both List _src-MD5 and List _backup-MD5 . At this time, matching data is generated in the format of source chunk index, backup chunk index, among which source chunk index is the block number of the currently judged data source, starting from 0, and the backup chunk index is the block number of the backup file data block matched by the currently judged data source block number;

When the data block matching fails in step S4, that is, the current data block is an incremental data block, and the encrypted MD5 result of each data block after the source file is divided into blocks only exists in List _src-MD5 , but not in List _backup-MD5 . , at this time, incremental data is generated in the format of source chunk index, diff chunk bytes, where source chunk index is also the data source chunk number currently being judged, and diff chunk bytes is the data source that cannot match the same data chunk of the backup file. Byte array, i.e. incremental data block;

After the incremental data and matching data are generated, the data results are stored in the communication sending cache of the source server according to the block number sequence of the data source to complete the integration, and the integration results are sent to the backup server according to the block number sequence of the data source. .

7. the incremental synchronization method between files based on SAPCI variable length block according to claim 1, is characterized in that, the data format that sends in described step S5 is as follows:

1) If the current data block is successfully matched, the matching data is transmitted at this time, and the sending format is as follows:

(source chunk index, backup chunk index)

2) If the current data block fails to match, the incremental data is transmitted at this time, and the sending format is as follows:

(source chunk index, diff chunk bytes).

8. The incremental synchronization method between files based on SAPCI variable length block according to claim 1, is characterized in that, the total time consumption of step S1～step S6 is defined as follows:

T _CDC =max{T _SB +T _ST +T _CB ,T _CS +T _SS }+T _ICT +T _M (5)

Among them, T _CDC is the total time spent, T _SB is the time taken by the backup server to calculate the strong check code MD5, T _ST is the time taken by the backup server to transmit the strong check code to the source server, and T _CB is the time spent on the backup server to classify the file. Block time, T _CS is the time required for the source server to divide the file into blocks, T _SS is the time required for the source server to calculate the MD5 of the strong parity code, and T _ICT is the source server comparing the source file block MD5 with the backup file block. Block MD5 and transmit the incremental data and matching data to the backup server synchronously, _TM is the time taken by the backup server to complete the incremental synchronization according to the integration of incremental data and matching data.

9. the incremental synchronization method between files based on SAPCI variable length block according to claim 1, is characterized in that, in described step S6, backup server according to received incremental data and matching data, to backup file The integration process is as follows:

S601. Create a temporary file, and perform synchronization based on the temporary file;

S602, judge the received data, if the current data format is:

(source chunk index, backup chunk index)

It means that the same data block in the backup file exists in the data source. In this case, it is only necessary to write the data block of the backup file indicated in the backup chunk index to the end of the reorganized file. When writing, ensure that the index indicated by the source chunk index is used. write sequentially;

S603. If the current data format is:

(source chunk index, diff chunk bytes)

It means that the currently received incremental data block, at this time, you only need to directly write the diff chunk bytes to the end of the reorganized file. When writing, ensure that the order is written according to the index indicated by the source chunk index;

S604. Repeat steps S602 and S603 until all data returned from the source server is read. At this time, the incremental synchronization of files is completed, the original backup file is deleted, and the temporary file is renamed to the file name of the original backup file. Integration and incremental synchronization can be done.