CN102467572B

CN102467572B - Data block query methods that support deduplicators

Info

Publication number: CN102467572B
Application number: CN 201010576146
Authority: CN
Inventors: 刘威; 王云松; 陈志丰
Original assignee: Inventec Corp
Current assignee: Shenzhen Excellent Clothing Co Ltd
Priority date: 2010-11-17
Filing date: 2010-11-17
Publication date: 2013-10-02
Anticipated expiration: 2030-11-17
Also published as: CN102467572A

Abstract

A data block query method supporting a repeating data deleting procedure improves the speed of the repeating data deleting procedure for querying data blocks. The query method comprises the following steps: storing a hash index list in a server; generating a data block and a hash value according to an input file in a client; the client sends a query request to the server, and the hash value of the corresponding data block is recorded in the query request; when the hash value is not stored in the server, the server sends a storage requirement to the client, and adds the received hash value into a hash index list; establishing a corresponding associated data index list for the hash index list, and recording information of data blocks related to the hash values in the associated data index list; and when the hash value is stored in the server, returning the hash value in the corresponding associated data index list to the client according to the hash value.

Description

Data block query methods that support deduplicators

技术领域 technical field

本发明涉及一种数据区块的查询方法，特别涉及一种支持重复数据删除程序的数据区块查询方法。The invention relates to a data block query method, in particular to a data block query method supporting a duplicate data deletion program.

背景技术 Background technique

重复数据删除是一种数据缩减技术，通常用于基于磁盘的备份系统，主要目的在于减少存储系统中使用的存储容量。它的工作方式是在某个时间周期内查找不同文件中不同位置的可变大小的重复数据块。重复的数据块用指示符取代。由于存储系统中总是充斥着大量的冗余数据。为了解决这个问题，节省还多空间，“重复删除”技术便顺理成章地成了人们关注的焦点。采用“重复删除”技术可以将存储的数据缩减为原来的1/20，从而让出还多的备份空间，不仅可以使存储系统上的备份数据保存还长的时间，而且还可以节约离线存储时所需的大量的带宽。Data deduplication is a data reduction technique, usually used in disk-based backup systems, with the main purpose of reducing the storage capacity used in the storage system. It works by looking for variable sized blocks of duplicate data at different locations in different files over a certain period of time. Duplicated data blocks are replaced with indicators. Because the storage system is always filled with a large amount of redundant data. In order to solve this problem and save more space, the "duplication deletion" technology has naturally become the focus of people's attention. Using "deduplication" technology can reduce the stored data to 1/20 of the original, so as to make more backup space, not only can save the backup data on the storage system for a long time, but also save the time of offline storage A lot of bandwidth is required.

为能达到数据完整保存的目的，所以在进行重复数据删除的过程中，会对输入文件进行切分的处理。输入文件在经过切分处理后会产生多个数据区块。为了能有效管理数据区块，所以在进行切分的过程中会利用索引文件来记录所有的数据区块的各项存储信息。In order to achieve the purpose of complete data preservation, the input file will be segmented during the process of data deduplication. After the input file is split, multiple data blocks will be generated. In order to effectively manage data blocks, index files are used to record various storage information of all data blocks during the splitting process.

客户端对整个输入文件进行了切分处理(定长或者变长)后，随即产生数据区块相应的哈希值。随后客户端向服务端发出查询请求，使用哈希值向服务端讯问是否已经存在有相同的哈希值。服务端会对每次查询请求在哈希索引表中进行搜索，然后通过网络返回查询结果。请参考图1所示，其为现有技术的查询数据区块的示意图。After the client splits the entire input file (fixed length or variable length), it immediately generates the corresponding hash value of the data block. Then the client sends a query request to the server, and uses the hash value to ask the server whether the same hash value already exists. The server will search the hash index table for each query request, and then return the query results through the network. Please refer to FIG. 1 , which is a schematic diagram of a query data block in the prior art.

当客户端110查询的数据量非常大时，哈希索引表也会随之剧增，有可能出现服务端120内存不足以存放哈希索引表，这样的话哈希索引表就要涉及到从文件存取速度较慢的存储设备进行查询，将会极大的拖缓整个系统的运行速度。When the amount of data queried by the client 110 is very large, the hash index table will also increase sharply, and it may occur that the memory of the server 120 is not enough to store the hash index table. In this case, the hash index table will involve Querying a storage device with a slow access speed will greatly slow down the running speed of the entire system.

发明内容 Contents of the invention

鉴于以上的问题，本发明所要解决的技术问题在于提供一种支持重复数据删除程序的数据区块查询方法，应用在经过重复数据删除程序所产生的多笔数据区块，并对数据区块进行查询的处理，进而提高数据区块的查询速度。In view of the above problems, the technical problem to be solved by the present invention is to provide a data block query method that supports the deduplication program, which is applied to multiple data blocks generated by the deduplication program, and the data blocks are Query processing, thereby improving the query speed of data blocks.

为达到上述目的，本发明所揭露的支持重复数据删除程序的数据区块查询方法包括以下步骤：在服务端中储存哈希索引列表，在哈希索引列表中记录多组哈希值；客户端中加载输入文件，并产生相应输入文件的数据区块与相应每一数据区块的哈希值；客户端向服务端发送查询请求，在查询请求中记录相应数据区块的哈希值，用以向服务端查询是否存在有相同的哈希值；当服务端的哈希索引列表中未储存哈希值，则服务端向客户端发送储存要求，用以将哈希值所相应的数据区块传送至服务端中储存，并且服务端将所接收到的哈希值依序加入哈希索引列表中；对哈希索引列表中的哈希值建立相应的关联数据索引列表，并在关联数据索引列表中记录哈希值相关的其它哈希值；当服务端中储存哈希值，则服务端根据哈希值将相应的关联数据索引列表中的哈希值一并返回给客户端；客户端下一次查询数据区块的哈希值时，客户端从所接收的关联数据索引列表查询是否已存在哈希值；当客户端所接收的关联数据索引列表中已存在哈希值，则由关联数据索引列表中取得哈希值信息或哈希值相关数据块的描述信息，例如该数据块已经被引用次数，可根据引用需要进行增加；当客户端所接收的关联数据索引列表中不存在哈希值，则客户端向服务端进行哈希值的查询。In order to achieve the above object, the data block query method that supports the deduplication program disclosed in the present invention includes the following steps: storing a hash index list in the server, recording multiple groups of hash values in the hash index list; Load the input file in the file, and generate the data block of the corresponding input file and the hash value of each data block; the client sends a query request to the server, records the hash value of the corresponding data block in the query request, and uses To query the server whether there is the same hash value; when the hash value is not stored in the hash index list of the server, the server sends a storage request to the client to store the data block corresponding to the hash value Send it to the server for storage, and the server will add the received hash values to the hash index list in sequence; create a corresponding associated data index list for the hash values in the hash index list, and create a corresponding associated data index list in the associated data index Record other hash values related to the hash value in the list; when the server stores the hash value, the server will return the hash value in the corresponding associated data index list to the client according to the hash value; the client When querying the hash value of the data block next time, the client queries whether the hash value already exists from the received associated data index list; when the hash value already exists in the associated data index list received by the client, the associated Obtain the hash value information or the description information of the data block related to the hash value in the data index list, such as the number of times the data block has been referenced, which can be increased according to the reference needs; when there is no hash value in the associated data index list received by the client Hash value, the client will query the server for the hash value.

由于关联数据索引列表能够表明数据区块的关联性(前后关联)，而且在使用过程中服务端可以根据统计信息不断调整该联数据索引列表。所以可以在一定程度上保证客户端在本地内存中查询的命中率。服务端可以使用一次访问慢速存储设备的代价获得大量的相关记录，这样大大减少了客户端反复进行查询请求而引起服务端不断在慢速存储设备进行读取查询的问题。同时一次通过网络发送数据索引集也减少了网络中来回请求/确认而进行网络存取的耗时。Since the associated data index list can indicate the association (front and back association) of the data blocks, and the server can continuously adjust the associated data index list according to the statistical information during use. Therefore, the hit rate of client queries in local memory can be guaranteed to a certain extent. The server can obtain a large number of related records at the cost of accessing the slow storage device once, which greatly reduces the problem that the client repeatedly performs query requests and causes the server to continuously read and query the slow storage device. At the same time, sending the data index set through the network at one time also reduces the time consumption of network access due to back and forth request/confirmation in the network.

以下结合附图和具体实施例对本发明进行详细描述，但不作为对本发明的限定。The present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments, but not as a limitation of the present invention.

附图说明Description of drawings

图1为现有技术的查询数据区块的示意图；FIG. 1 is a schematic diagram of a query data block in the prior art;

图2为本发明的架构示意图；Fig. 2 is a schematic diagram of the architecture of the present invention;

图3为本发明的运作流程示意图；Fig. 3 is a schematic diagram of the operation flow of the present invention;

图4为本发明的记录相关数据索引集的示意图。FIG. 4 is a schematic diagram of a record-related data index set in the present invention.

其中，附图标记Among them, reference signs

110 客户端110 clients

120 服务端120 server

210 服务端210 server

211 哈希索引列表211 hash index list

212 关联数据索引列表212 Linked Data Index List

220 客户端220 clients

具体实施方式 Detailed ways

下面结合附图对本发明的结构原理和工作原理作具体的描述：Below in conjunction with accompanying drawing, structural principle and working principle of the present invention are specifically described:

请参考图2所示，其为本发明的架构示意图。本发明包括服务端210与客户端220。客户端220可以通过因特网(Internet)或企业内网(intranet)的方式连接于服务端210，也可以将客户端220与服务端210同时运行于同一台计算器装置上服务端210还包括哈希索引列表211，哈希索引列表211记录多组哈希值。客户端220向服务端210发出对一输入文件中某一数据区块哈希值的查询要求时，服务端210根据哈希索引列表211所记载的内容并通过下述方式进行查询的动作。请参考图3所示，其为本发明的运作流程示意图。Please refer to FIG. 2 , which is a schematic diagram of the architecture of the present invention. The present invention includes a server 210 and a client 220 . The client 220 can be connected to the server 210 through the Internet (Internet) or the enterprise intranet (intranet), and the client 220 and the server 210 can also be run on the same computing device at the same time. The server 210 also includes a hash Index list 211. The hash index list 211 records multiple sets of hash values. When the client 220 sends a query request to the server 210 for the hash value of a data block in an input file, the server 210 performs the query according to the content recorded in the hash index list 211 in the following manner. Please refer to FIG. 3 , which is a schematic diagram of the operation flow of the present invention.

步骤S310：在服务端中储存哈希索引列表，在哈希索引列表中记录多组哈希值；Step S310: store the hash index list in the server, and record multiple sets of hash values in the hash index list;

步骤S320：客户端加载输入文件，并产生相应输入文件的数据区块与相应每一数据区块的哈希值；Step S320: the client loads the input file, and generates data blocks of the corresponding input file and a hash value corresponding to each data block;

步骤S330：客户端向服务端发送查询请求，在查询请求中记录相应数据区块的哈希值，用以向服务端查询是否存在有相同的哈希值；Step S330: the client sends a query request to the server, and records the hash value of the corresponding data block in the query request to query the server whether there is the same hash value;

步骤S340：当服务端的哈希索引列表中未储存哈希值，则服务端向客户端发送储存要求，用以将哈希值所相应的数据区块传送至服务端中储存，并且服务端将所接收到的哈希值依序加入哈希索引列表中；Step S340: When the hash value is not stored in the hash index list of the server, the server sends a storage request to the client to transmit the data block corresponding to the hash value to the server for storage, and the server will The received hash values are sequentially added to the hash index list;

步骤S350：对哈希索引列表中的哈希值建立相应的关联数据索引列表，并在关联数据索引列表中记录哈希值相关的其它哈希值；以及Step S350: Establish a corresponding associated data index list for the hash value in the hash index list, and record other hash values related to the hash value in the associated data index list; and

步骤S360：当服务端中储存哈希值，则服务端根据哈希值将相应的关联数据索引列表中的哈希值一并返回给客户端。Step S360: When the hash value is stored in the server, the server returns the hash value in the corresponding associated data index list to the client according to the hash value.

由客户端220中加载输入文件，客户端220对输入文件进行切分处理，并产生相应输入文件的数据区块与相应每一数据区块的哈希值。哈希值计算的算法可以是但不局限于SHA-1或MD5。而数据区块是根据固定长度方式(fixed-size partition)或基于内容变长度分割方式(content-defined chunking，CDC)。定长切分算法采用预先定义好的数据区块大小对输入文件进行切分。定长分块算法的优点是简单、性能高。内容定义切分算法是一种变长分块算法，它应用指纹数据(如Rabin指纹)将文件分割成长度大小不等的分块策略。与定长切分算法不同，内容定义切分算法是基于文件内容进行数据区块切分的，因此数据区块大小是可变化的。The client 220 loads the input file, and the client 220 splits the input file, and generates data blocks of the corresponding input file and hash values corresponding to each data block. The algorithm for hash value calculation can be but not limited to SHA-1 or MD5. The data blocks are based on fixed-size partition or content-defined chunking (CDC). The fixed-length segmentation algorithm uses a predefined data block size to segment the input file. The advantages of the fixed-length block algorithm are simplicity and high performance. The content-defined segmentation algorithm is a variable-length block algorithm, which uses fingerprint data (such as Rabin fingerprint) to divide files into block strategies of different lengths. Different from the fixed-length segmentation algorithm, the content-defined segmentation algorithm performs data block segmentation based on the file content, so the size of the data block can be changed.

接着，客户端220向服务端210发送查询请求，在查询请求中记录相应数据区块的哈希值，用以向服务端210查询是否存在有相同的哈希值。当服务端210的哈希索引列表211中未储存哈希值，则服务端210向客户端220发送储存要求，用以将哈希值所相应的数据区块传送至服务端210中储存，并且服务端210将所接收到的哈希值依序加入哈希索引列表211中。并对哈希索引列表211中的哈希值建立相应的关联数据索引列表212，并在关联数据索引列表212中记录哈希值相关的数据区块的信息。举例来说，在关联数据索引列表212中可以储存数据区块的哈希值或是数据区块的编号值，也或数据区块存储位置的索引信息。Next, the client 220 sends a query request to the server 210, and records the hash value of the corresponding data block in the query request to query the server 210 whether there is the same hash value. When the hash value is not stored in the hash index list 211 of the server 210, the server 210 sends a storage request to the client 220 to transmit the data block corresponding to the hash value to the server 210 for storage, and The server 210 sequentially adds the received hash values into the hash index list 211 . A corresponding associated data index list 212 is established for the hash value in the hash index list 211 , and the information of the data block related to the hash value is recorded in the associated data index list 212 . For example, the associated data index list 212 may store the hash value of the data block or the serial number of the data block, or the index information of the storage location of the data block.

假设从输入文件的第一个数据区块进行查询的处理说明，且服务端210没有纪录过输入文件的任一数据区块。客户端220首先将输入文件的第一数据区块转换为第一哈希值hash1(对应为第一哈希值hash1)，并将第一哈希值hash1向服务端210提出查询请求。由于服务端210中并未储存输入文件的任何数据区块的哈希值，所以服务端210将所接收到的第一哈希值hash1(第一数据区块)写入到服务端210。同理，第二数据区块(对应为第二哈希值hash2)仍然按照上面的过程写入到服务端210时。服务端210根据两个数据区块的前后关系来判定第一哈希值hash1与第二哈希值hash2是具有关联性。服务端210将第二哈希值hash2放入第一哈希值hash1的关联数据索引列表212中。请参考图4所示，其为本发明的记录相关数据索引集的示意图。Assume that the processing description of the query is performed from the first data block of the input file, and the server 210 has not recorded any data block of the input file. The client 220 first converts the first data block of the input file into a first hash value hash1 (corresponding to the first hash value hash1), and sends a query request to the server 210 with the first hash value hash1. Since the server 210 does not store the hash value of any data block of the input file, the server 210 writes the received first hash value hash1 (the first data block) into the server 210 . Similarly, the second data block (corresponding to the second hash value hash2) is still written to the server 210 according to the above process. The server 210 determines whether the first hash value hash1 and the second hash value hash2 are related according to the context of the two data blocks. The server 210 puts the second hash value hash2 into the associated data index list 212 of the first hash value hash1. Please refer to FIG. 4 , which is a schematic diagram of the record-related data index set of the present invention.

对于其它数据区块的哈希值也依照其顺序写入到第一哈希值hash1的关联数据索引列表212。在本发明中关联数据索引列表212的容量大小有一定限制。当关联数据索引列表212中的哈希值的数量符合门坎值时，服务端210除了会向下一关联数据索引列表212中继续进行存放哈希值的处理之外，也可以将查询后经过最久的哈希值从关联数据索引列表212中删除，将最新查询的该哈希值记录在该关联数据索引列表212中。The hash values of other data blocks are also written into the associated data index list 212 of the first hash value hash1 according to their order. In the present invention, the capacity of the associated data index list 212 is limited. When the number of hash values in the associated data index list 212 meets the threshold value, the server 210 will not only continue to store the hash values in the next associated data index list 212, but also store the hash values after the query. The old hash value is deleted from the associated data index list 212, and the hash value of the latest query is recorded in the associated data index list 212.

举例来说，若关联数据索引列表212的最大容量为记录10组哈希值，则第一哈希值hash1的相关索引记录是第二哈希值hash2～第十一哈希值hash11(换言之就是第一数据区块后的连续十个数据区块)。For example, if the maximum capacity of the associated data index list 212 is to record 10 sets of hash values, then the related index records of the first hash value hash1 are the second hash value hash2 to the eleventh hash value hash11 (in other words, ten consecutive data blocks after the first data block).

当第十二哈希值hash12产生后，服务端210会将第十二哈希值hash12存放在第十一哈希值hash11的关联数据索引列表212中。此外，若是某一组哈希值同时与其它哈希值都存在关联时，可根据相关特性采用存放在哪个哈希值的关联数据索引列表212中。或是将所有发生相关的关联数据索引列表212中都保存一份。After the twelfth hash value hash12 is generated, the server 210 will store the twelfth hash value hash12 in the associated data index list 212 of the eleventh hash value hash11. In addition, if a certain group of hash values is associated with other hash values at the same time, the associated data index list 212 of which hash value is stored can be used according to the relevant characteristics. Or save a copy of all relevant associated data index lists 212 .

以上所述状况为服务端210中未存储可被查询到的哈希值。当服务端210中储存哈希值，则服务端210根据哈希值将相应的关联数据索引列表212中的哈希值一并返回给客户端220。承接上例。当客户端220欲查询第五数据区块(意即查询第五哈希值hash5)，由于服务端210中第五哈希值hash5是被归类在第一哈希值hash1所相应的关联数据索引列表212中。所以服务端210除了将所查询到的第五哈希值hash5返还给客户端220外，服务端210同时也会将第一哈希值hash1的关联数据索引列表212一并传送给客户端220。The above situation is that the server 210 does not store any queryable hash value. When the hash value is stored in the server 210, the server 210 returns the hash value in the corresponding associated data index list 212 to the client 220 according to the hash value. Take the example above. When the client 220 wants to query the fifth data block (that is, query the fifth hash value hash5), since the fifth hash value hash5 in the server 210 is classified into the corresponding associated data of the first hash value hash1 index list 212. Therefore, in addition to returning the queried fifth hash value hash5 to the client 220 , the server 210 also transmits the associated data index list 212 of the first hash value hash1 to the client 220 at the same time.

客户端220在接收到关联数据索引表后，客户端220将关联数据索引表储存于内存中。使得客户端220在下一次查询数据区块的哈希值时，客户端220会先从所接收的关联数据索引列表212中开始查询是否已存在欲查询的哈希值。当客户端220所接收的关联数据索引列表212中已存在哈希值，则由关联数据索引列表212中取得哈希值。被查询的数据区块可能是连续，因此通过关联数据索引列表212可以有效地降低客户端220与服务端210间的存取时间，进而提高存取的效率。反之，当客户端220所接收的关联数据索引列表212中不存在哈希值，则客户端220重新向服务端210进行步骤S330～步骤S360的哈希值查询处理。After the client 220 receives the associated data index table, the client 220 stores the associated data index table in memory. When the client 220 inquires the hash value of the data block next time, the client 220 will start to inquire whether the hash value to be queried already exists from the received associated data index list 212 . When the hash value already exists in the associated data index list 212 received by the client 220 , the hash value is obtained from the associated data index list 212 . The data blocks to be queried may be continuous, so the access time between the client 220 and the server 210 can be effectively reduced by associating the data index list 212 , thereby improving access efficiency. On the contrary, when there is no hash value in the associated data index list 212 received by the client 220 , then the client 220 re-inquires the hash value from step S330 to step S360 to the server 210 .

由于关联数据索引列表212能够表明数据区块的关联性(意即前后顺序的关联)，而且在使用过程中服务端210可以根据统计信息不断调整关联数据索引列表212。所以可以在一定程度上保证客户端220在本地内存中查询的命中率。服务端210可以使用一次访问慢速存储设备的代价获得大量的相关记录，这样大大减少了客户端220反复进行查询请求而引起服务端210不断在慢速存储设备进行读取查询的问题。同时一次通过网络发送数据索引集也减少了网络中来回请求/确认而进行网络存取的耗时。Since the associated data index list 212 can indicate the association of data blocks (that is, the association of the sequence), and the server 210 can continuously adjust the associated data index list 212 according to statistical information during use. Therefore, the hit rate of the client 220 querying in the local memory can be guaranteed to a certain extent. The server 210 can obtain a large number of related records at the cost of one visit to the slow storage device, which greatly reduces the problem that the client 220 repeatedly performs query requests and causes the server 210 to continuously perform read queries on the slow storage device. At the same time, sending the data index set through the network at one time also reduces the time consumption of network access due to back and forth request/confirmation in the network.

当然，本发明还可有其它多种实施例，在不背离本发明精神及其实质的情况下，熟悉本领域的技术人员当可根据本发明作出各种相应的改变和变形，但这些相应的改变和变形都应属于本发明所附的权利要求的保护范围。Certainly, the present invention also can have other multiple embodiments, without departing from the spirit and essence of the present invention, those skilled in the art can make various corresponding changes and deformations according to the present invention, but these corresponding Changes and deformations should belong to the scope of protection of the appended claims of the present invention.

Claims

1. A data block query method that supports a deduplication program is applied to multiple data blocks generated through a deduplication program, and the processing of querying the multiple data blocks is characterized in that , the data block query method supporting the deduplication program includes the following steps:

storing a hash index list in a server, and recording multiple sets of hash values in the hash index list;

A client loads an input file, and generates data blocks corresponding to the input file and hash values corresponding to each of the data blocks;

The client sends a query request to the server, records the hash value of the corresponding data block in the query request, and queries the server whether there is the same hash value;

When the hash value is not stored in the hash index list of the server, the server sends a storage request to the client to transmit the data block corresponding to the hash value to the stored in the server, and the server sequentially adds the received hash values into the hash index list;

Establishing a corresponding associated data index list for the hash value added to the hash index list, and recording other hash values related to the hash value in the associated data index list;

When the server stores the hash value, the server returns the corresponding hash value in the associated data index list to the client according to the hash value;

When the client queries the hash value of the data block next time, the client queries whether the hash value already exists from the received associated data index list;

When the hash value already exists in the associated data index list received by the client, obtain the hash value from the associated data index list; and

When the hash value does not exist in the associated data index list received by the client, the client queries the server for the hash value.

2 . The data block query method supporting deduplication program according to claim 1 , wherein the data block is generated according to a fixed-length method or a content-based variable-length segmentation method. 3 .

3. The data block query method supporting de-duplication program according to claim 1, characterized in that, when the number of hash values in the associated data index list meets a threshold value, the longest time after the query The hash value of time is deleted from the associated data index list, and the hash value of the latest query is recorded in the associated data index list.

4. The data block query method supporting de-duplication program according to claim 1, characterized in that, when the number of hash values in the associated data index list meets a threshold value, the server will down - continue to store the hash value in the associated data index list.