CN113590535B

CN113590535B - An efficient data migration method and device for deduplication storage system

Info

Publication number: CN113590535B
Application number: CN202111156367.9A
Authority: CN
Inventors: 郭得科; 罗来龙; 程葛瑶; 夏俊旭; 张千桢
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-09-30
Filing date: 2021-09-30
Publication date: 2021-12-17
Anticipated expiration: 2041-09-30
Also published as: CN113590535A

Abstract

The present application provides an efficient data migration method and device for a deduplication storage system. The method includes: by receiving a data migration request, determining a corresponding source server, and determining a file to be migrated in the source server; Set the number of independent hash functions to determine the bit vector corresponding to each available server, and determine each candidate server based on the bit vector; Spatial overlap data and data volume, determine the space-saving index corresponding to each candidate server; based on the space-saving index, determine the target migration file and determine the target server in each candidate server, so as to migrate the target migration file to the target server, and update the bit vector and space-efficient indexes. Thus, additional space costs are reduced during the data migration process for efficient and highly adaptable data migration.

Description

An efficient data migration method and device for deduplication storage system

技术领域technical field

本申请涉及数据迁移技术领域，尤其涉及一种用于去重存储系统的高效数据迁移方法和装置。The present application relates to the technical field of data migration, and in particular, to an efficient data migration method and apparatus for a deduplication storage system.

背景技术Background technique

随着重复数据删除技术的引入，传统的迁移方法面临着严峻的挑战。首先，重复数据删除在存储文件中创建数据共享依赖关系；在迁移中打破这种依赖关系会增加额外的空间开销。其次，消除冗余增加了服务器崩溃时数据不可用的风险。现有的数据迁移方法无法一次性解决上述问题。With the introduction of deduplication technology, traditional migration methods face serious challenges. First, deduplication creates a data-sharing dependency in the storage file; breaking this dependency during migration adds additional space overhead. Second, eliminating redundancy increases the risk of data being unavailable in the event of a server crash. Existing data migration methods cannot solve the above problems at one time.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本申请的目的在于提出一种用于去重存储系统的高效数据迁移方法和装置，以解决上述技术问题。In view of this, the purpose of the present application is to propose an efficient data migration method and apparatus for a deduplication storage system to solve the above technical problems.

基于上述目的，本申请提供：For the above purposes, this application provides:

一种用于去重存储系统的高效数据迁移方法，包括：An efficient data migration method for a deduplication storage system, comprising:

接收数据迁移请求，确定所述数据迁移请求对应的源服务器，进而确定所述源服务器中的待迁移文件；Receive the data migration request, determine the source server corresponding to the data migration request, and then determine the files to be migrated in the source server;

确定各可用服务器，进而基于所述待迁移文件和预设个数的独立哈希函数，确定所述各可用服务器对应的位向量，基于所述位向量确定所述数据迁移请求对应的各候选服务器；Determine each available server, and then determine the bit vector corresponding to each available server based on the to-be-migrated file and the independent hash function of the preset number, and determine each candidate server corresponding to the data migration request based on the bit vector ;

基于所述待迁移文件确定与所述各候选服务器的文件空间重叠数据和共享分区块的数据量，进而基于所述文件空间重叠数据和所述数据量，确定所述各候选服务器对应的节省空间索引；Based on the file to be migrated, determine the file space overlapping data and the data amount of the shared partition block with the candidate servers, and then determine the space saving corresponding to each candidate server based on the file space overlapping data and the data amount index;

基于所述节省空间索引，确定所述待迁移文件中的目标迁移文件并确定所述各候选服务器中的目标服务器，以将所述目标迁移文件迁移至所述目标服务器，并更新所述目标服务器和所述源服务器对应的位向量和节省空间索引。Based on the space-saving index, determine the target migration file in the to-be-migrated files and determine the target server in each candidate server, so as to migrate the target migration file to the target server, and update the target server Bit vector and space saving index corresponding to the origin server.

可选地，所述确定所述待迁移文件中的目标迁移文件，包括：Optionally, the determining of the target migration file in the to-be-migrated files includes:

基于所述节省空间索引确定各所述待迁移文件对应的占用空间；determining the occupied space corresponding to each of the to-be-migrated files based on the space-saving index;

确定各所述待迁移文件所承担的平均数据访问频率；Determine the average data access frequency undertaken by each of the files to be migrated;

根据所述平均数据访问频率和所述占用空间，确定目标迁移文件。A target migration file is determined according to the average data access frequency and the occupied space.

可选地，所述根据所述平均数据访问频率和所述占用空间，确定目标迁移文件，包括：Optionally, determining the target migration file according to the average data access frequency and the occupied space includes:

确定所述平均数据访问频率与所述占用空间的比值；determining the ratio of the average data access frequency to the occupied space;

响应于确定所述比值大于预设单位热值，确定所述比值对应的待迁移文件为目标迁移文件。In response to determining that the ratio is greater than the preset unit heat value, determining that the file to be migrated corresponding to the ratio is the target migration file.

可选地，所述确定所述各候选服务器中的目标服务器，包括：Optionally, the determining of the target server in the candidate servers includes:

确定所述目标迁移文件与所述各候选服务器的相似度；determining the similarity between the target migration file and each candidate server;

将大于预设相似度阈值的相似度对应的候选服务器确定为目标服务器。A candidate server corresponding to a similarity greater than a preset similarity threshold is determined as a target server.

确定所述目标迁移文件的目标占用空间；determining the target occupied space of the target migration file;

确定所述目标迁移文件与所述各候选服务器的共享分区块的目标数据量；determining the target data volume of the shared partition block of the target migration file and each candidate server;

确定所述目标迁移文件的每个副本所承担的平均数据访问频率；determining the average data access frequency undertaken by each copy of the target migration file;

确定所述目标占用空间与各所述目标数据量的差值，进而确定所述平均数据访问频率与各所述差值的比值；Determine the difference between the target occupied space and each of the target data volumes, and then determine the ratio of the average data access frequency to each of the differences;

响应于确定所述比值大于预设值，确定所述比值对应的候选服务器为目标服务器。In response to determining that the ratio is greater than a preset value, determining that the candidate server corresponding to the ratio is the target server.

一种用于去重存储系统的高效数据迁移装置，包括：An efficient data migration device for a deduplication storage system, comprising:

接收单元，被配置成接收数据迁移请求，确定所述数据迁移请求对应的源服务器，进而确定所述源服务器中的待迁移文件；a receiving unit, configured to receive a data migration request, determine a source server corresponding to the data migration request, and then determine a file to be migrated in the source server;

候选服务器确定单元，被配置成确定各可用服务器，进而基于所述待迁移文件和预设个数的独立哈希函数，确定所述各可用服务器对应的位向量，基于所述位向量确定所述数据迁移请求对应的各候选服务器；The candidate server determination unit is configured to determine each available server, and then determine a bit vector corresponding to each available server based on the to-be-migrated file and a preset number of independent hash functions, and determine the bit vector based on the bit vector Each candidate server corresponding to the data migration request;

索引确定单元，被配置成基于所述待迁移文件确定与所述各候选服务器的文件空间重叠数据和共享分区块的数据量，进而基于所述文件空间重叠数据和所述数据量，确定所述各候选服务器对应的节省空间索引；The index determination unit is configured to determine, based on the to-be-migrated file, the file space overlapping data and the data amount of the shared partition block with each candidate server, and then determine the file space overlapping data and the data amount based on the file space overlapping data and the data amount. The space-saving index corresponding to each candidate server;

迁移单元，被配置成基于所述节省空间索引，确定所述待迁移文件中的目标迁移文件并确定所述各候选服务器中的目标服务器，以将所述目标迁移文件迁移至所述目标服务器，并更新所述目标服务器和所述源服务器对应的位向量和节省空间索引。a migration unit configured to, based on the space-saving index, determine a target migration file in the to-be-migrated files and determine a target server in the candidate servers, so as to migrate the target migration file to the target server, And update the bit vector and space saving index corresponding to the target server and the source server.

可选地，所述迁移单元进一步被配置成：Optionally, the migration unit is further configured to:

可选地，迁移单元进一步被配置成：Optionally, the migration unit is further configured to:

一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现如上所述的用于去重存储系统的高效数据迁移方法。An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, when the processor executes the program, the efficient data migration for a deduplication storage system as described above is realized method.

一种非暂态计算机可读存储介质，所述非暂态计算机可读存储介质存储计算机指令，所述计算机指令用于使所述计算机执行如上所述的用于去重存储系统的高效数据迁移方法。A non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform efficient data migration for a deduplication storage system as described above method.

从上面所述可以看出，本申请提供的用于去重存储系统的高效数据迁移方法和装置，通过接收数据迁移请求，确定所述数据迁移请求对应的源服务器，进而确定所述源服务器中的待迁移文件；确定各可用服务器，进而基于所述待迁移文件和预设个数的独立哈希函数，确定所述各可用服务器对应的位向量，基于所述位向量确定所述数据迁移请求对应的各候选服务器；基于所述待迁移文件确定与所述各候选服务器的文件空间重叠数据和共享分区块的数据量，进而基于所述文件空间重叠数据和所述数据量，确定所述各候选服务器对应的节省空间索引；基于所述节省空间索引，确定所述待迁移文件中的目标迁移文件并确定所述各候选服务器中的目标服务器，以将所述目标迁移文件迁移至所述目标服务器，并更新所述目标服务器和所述源服务器对应的位向量和节省空间索引。从而在数据迁移过程中减少额外的空间成本，以高效、适应性强的数据迁移策略进行数据迁移。It can be seen from the above that the efficient data migration method and device for a deduplication storage system provided by the present application determine the source server corresponding to the data migration request by receiving the data migration request, and then determine the source server in the source server. the files to be migrated; determine each available server, and then determine the bit vector corresponding to each available server based on the to-be-migrated file and a preset number of independent hash functions, and determine the data migration request based on the bit vector The corresponding candidate servers; determine the file space overlapped data and the data volume of the shared partition block with the candidate servers based on the files to be migrated, and then determine the data volume based on the file space overlap data and the data volume. The space-saving index corresponding to the candidate server; based on the space-saving index, determine the target migration file in the files to be migrated and determine the target server in each candidate server, so as to migrate the target migration file to the target server, and update the bit vector and space-saving index corresponding to the target server and the source server. In this way, additional space costs are reduced during the data migration process, and data migration is carried out with an efficient and adaptable data migration strategy.

附图说明Description of drawings

为了更清楚地说明本申请或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本说明书一个或多个实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the present application or in the prior art, the following briefly introduces the accompanying drawings that are used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only the One or more embodiments are described, and for those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative effort.

图1为本说明书一个实施例示出的用于去重存储系统的高效数据迁移方法的流程示意图；1 is a schematic flowchart of an efficient data migration method for a deduplication storage system according to an embodiment of this specification;

图2为本说明书另一个实施例示出的用于去重存储系统的高效数据迁移方法的流程示意图；2 is a schematic flowchart of an efficient data migration method for a deduplication storage system according to another embodiment of the present specification;

图3为本说明书另一个实施例示出的用于去重存储系统的高效数据迁移方法的位向量示意图；3 is a schematic diagram of a bit vector of an efficient data migration method for a deduplication storage system according to another embodiment of the present specification;

图4为本说明书另一个实施例示出的用于去重存储系统的高效数据迁移装置的结构框图；4 is a structural block diagram of an efficient data migration apparatus for a deduplication storage system according to another embodiment of the present specification;

图5为本说明书一个实施例示出的用于去重存储系统的高效数据迁移方法的电子设备硬件结构示意图。FIG. 5 is a schematic diagram of a hardware structure of an electronic device for an efficient data migration method for a deduplication storage system according to an embodiment of the present specification.

图6为本说明书一个实施例示出的用于去重存储系统的高效数据迁移方法的精卫概述示意图。FIG. 6 is a schematic overview of Jingwei for an efficient data migration method for a deduplication storage system according to an embodiment of this specification.

图7为本说明书的用于去重存储系统的高效数据迁移方法的一个从服务器

到服务器

的数据迁移的示例。FIG. 7 is a slave server of the efficient data migration method for the deduplication storage system of this specification

to the server

example of data migration.

具体实施方式Detailed ways

为使本申请的目的、技术方案和优点更加清楚明白，以下结合具体实施例，并参照附图，对本申请进一步详细说明。In order to make the objectives, technical solutions and advantages of the present application more clearly understood, the present application will be further described in detail below with reference to specific embodiments and accompanying drawings.

需要说明的是，除非另外定义，本说明书一个或多个实施例使用的技术术语或者科学术语应当为本申请所属领域内具有一般技能的人士所理解的通常意义。本说明书一个或多个实施例中使用的“第一”、“第二”以及类似的词语并不表示任何顺序、数量或者重要性，而只是用来区分不同的组成部分。“包括”或者“包含”等类似的词语意指出现该词前面的元件或者物件涵盖出现在该词后面列举的元件或者物件及其等同，而不排除其他元件或者物件。“连接”或者“相连”等类似的词语并非限定于物理的或者机械的连接，而是可以包括电性的连接，不管是直接的还是间接的。“上”、“下”、“左”、“右”等仅用于表示相对位置关系，当被描述对象的绝对位置改变后，则该相对位置关系也可能相应地改变。It should be noted that, unless otherwise defined, the technical or scientific terms used in one or more embodiments of the present specification should have the usual meanings understood by those with ordinary skill in the art to which this application belongs. The terms "first," "second," and similar terms used in one or more embodiments of this specification do not denote any order, quantity, or importance, but are merely used to distinguish the various components. "Comprises" or "comprising" and similar words mean that the elements or things appearing before the word encompass the elements or things recited after the word and their equivalents, but do not exclude other elements or things. Words like "connected" or "connected" are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "Up", "Down", "Left", "Right", etc. are only used to represent the relative positional relationship, and when the absolute position of the described object changes, the relative positional relationship may also change accordingly.

图1示出了根据本申请的用于去重存储系统的高效数据迁移方法的一个实施例的流程100。本实施例的用于去重存储系统的高效数据迁移方法，包括以下步骤：FIG. 1 shows a flow 100 of an embodiment of an efficient data migration method for a deduplication storage system according to the present application. The efficient data migration method for a deduplication storage system in this embodiment includes the following steps:

步骤S101，接收数据迁移请求，确定所述数据迁移请求对应的源服务器，进而确定所述源服务器中的待迁移文件。Step S101: Receive a data migration request, determine a source server corresponding to the data migration request, and then determine a file to be migrated in the source server.

本申请实施例的执行主体，例如可以是服务器，可以通过有线连接或无线连接的方式接收数据迁移请求。The execution body of the embodiment of the present application may be, for example, a server, and may receive a data migration request through a wired connection or a wireless connection.

具体地，本申请实施例的数据迁移，可以包括数据从一个服务器剪切到另一个或多个服务器，也可以包括数据从一个服务器复制到另一个或多个服务器。数据迁移，可以是整体文件的剪切或复制至另一个或多个服务器，也可以是文件中的某一个或几个数据块的剪切或复制至另一个或多个服务器。Specifically, the data migration in this embodiment of the present application may include cutting data from one server to another or multiple servers, or may include copying data from one server to another or multiple servers. Data migration can be the cutting or copying of the entire file to another or more servers, or the cutting or copying of one or several data blocks in the file to another or more servers.

执行主体在接收到数据迁移请求后，可以获取所述数据迁移请求中携带的源服务器标识，例如可以是源服务器IP地址，然后执行主体可以根据源服务器标识定位数据迁移请求对应的源服务器。然后，执行主体还可以获取数据迁移请求中携带的待迁移文件标识，例如可以是待迁移文件的文件名称或待迁移文件的存储地址等，本申请实施例对待迁移文件标识的具体内容不做具体限定。进而，执行主体可以根据待迁移文件标识定位源服务器中的待迁移文件。After receiving the data migration request, the execution subject can obtain the source server identifier carried in the data migration request, for example, the source server IP address, and then the execution subject can locate the source server corresponding to the data migration request according to the source server identifier. Then, the execution body may also obtain the identifier of the file to be migrated carried in the data migration request, which may be, for example, the file name of the file to be migrated or the storage address of the file to be migrated, etc. The specific content of the identifier of the file to be migrated is not specified in this embodiment of the present application. limited. Furthermore, the execution subject may locate the to-be-migrated file in the source server according to the to-be-migrated file identifier.

步骤S102，确定各可用服务器，进而基于所述待迁移文件和预设个数的独立哈希函数，确定所述各可用服务器对应的位向量，基于所述位向量确定所述数据迁移请求对应的各候选服务器。Step S102: Determine each available server, and then determine a bit vector corresponding to each available server based on the to-be-migrated file and a preset number of independent hash functions, and determine the corresponding data migration request based on the bit vector. each candidate server.

本实施例中，执行主体可以将多个非空服务器确定为可用服务器，该可用服务器可以接收来自过载服务器（例如源服务器）的迁移文件。为了从各可用服务器中为每个待迁移文件找到对应的候选服务器，可以采用比较待迁移文件中包含的块的指纹（例如可以使用MD5或SHA-1）和各可用服务器存储的块的指纹。例如，对于一个包含n块的文件，可以使用

的时间复杂度来确定服务器是否包含这样的块，其中

是可用服务器中的块总数。采用Bloom Filter (BF)来表示每个可用服务器上的块。捕获了数据特征，并促进了从成对指纹检查到位向量（也称数据草图）上的成员关系查询的相似性检测。然后，判断一个服务器是否在一个文件中包含n块的时间复杂度可以降低为

，其中

表示使用的哈希函数的数量。示例的，如图3所示，假设可用服务器中的块集合

由三个分区块

、

和

组成，则BF代表

，其长度为d=19。该可用服务器对应的位向量中的所有d位最初设置为0。使用

，即使用两个独立哈希函数将每个可用服务器中的每个块映射到位向量中的

位置。这些命中的位置都会被设为1。从哈希函数派生的二进制字符串即为位向量，也即数据草图。每个可用服务器将维护一个位向量，具有相同的

函数和向量长度，以在块级别记录成员信息。根据位向量和

所使用的哈希函数，可以实现对任意文件中的任意块的成员查询。具体来说，当源服务器中的文件

试图从所有可用的候选服务器中选择它的最佳目标服务器时，它首先需要每个可用服务器的BF位向量，然后基于位向量确定所述数据迁移请求对应的各候选服务器。示例的，对于源服务器中的文件

中的任何块

，如果Bloom Filter中

哈希位置的任何位为0，则BF判断该块不属于可用服务器。否则，BF认为查询的块

属于可用服务器。In this embodiment, the execution body may determine multiple non-empty servers as available servers, and the available servers may receive migration files from overloaded servers (eg, source servers). In order to find a corresponding candidate server for each file to be migrated from the available servers, it is possible to compare the fingerprints of the blocks contained in the files to be migrated (for example, MD5 or SHA-1 can be used) and the fingerprints of the blocks stored by the available servers. For example, for a file containing n blocks, you can use

time complexity to determine whether the server contains such a block, where

is the total number of blocks in the available server. A Bloom Filter (BF) is used to represent the blocks on each available server. Data features are captured and facilitated similarity detection from pairwise fingerprint checks to membership queries on bit vectors (also known as data sketches). Then, the time complexity of judging whether a server contains n blocks in a file can be reduced to

,in

Indicates the number of hash functions used. Illustratively, as shown in Figure 3, assuming the set of blocks in the available server

consists of three partitions

,

and

composition, then BF stands for

, whose length is d=19. All d bits in the bit vector corresponding to this available server are initially set to 0. use

, i.e. using two independent hash functions to map each block from each available server into a bit vector

Location. The position of these hits will be set to 1. A binary string derived from a hash function is a bit vector, or data sketch. Each available server will maintain a bit vector with the same

Function and vector length to log member information at the block level. According to the bit vector and

The hash function used enables member queries to any block in any file. Specifically, when a file in the origin server

When trying to select its best target server from all available candidate servers, it first needs the BF bit vector of each available server, and then determines each candidate server corresponding to the data migration request based on the bit vector. Example, for files in origin server

any block in

, if Bloom Filter

If any bit in the hash position is 0, BF judges that the block does not belong to an available server. Otherwise, BF considers the queried block

is an available server.

步骤S103，基于所述待迁移文件确定与所述各候选服务器的文件空间重叠数据和共享分区块的数据量，进而基于所述文件空间重叠数据和所述数据量，确定所述各候选服务器对应的节省空间索引。Step S103: Determine the file space overlapping data and the data amount of the shared partition block with the candidate servers based on the file to be migrated, and then determine the corresponding candidate servers based on the file space overlapping data and the data amount. space efficient index.

节省空间索引，定义为文件迁移到目标服务器时节省的存储资源数量。这是从源服务器释放的数据量与迁移目标上增加的空间之间的偏差。节省空间的数据迁移决定了要迁移哪些文件以及应该将它们指向哪些目标服务器，从而减少额外的空间成本。Space-saving index, defined as the amount of storage resources saved when files are migrated to the target server. This is the deviation between the amount of data released from the source server and the space added on the migration target. Space-efficient data migration determines which files are migrated and which target servers they should be directed to, reducing additional space costs.

具体地，确定节省空间索引的算法1如下所示：Specifically, Algorithm 1 to determine the space-saving index is as follows:

具体地，对于上述算法1的解释如下：输入包括源服务器

和所有目标候选

的基于BF的数据草图和文件集，其中

。为了导出迁移变量

和相应的迁移目标

，设计了一个空间效率高的索引（即节省空间索引）来激发迁移序列。该函数如算法1第9-13行所示。设

表示

和

之间共享块的数据量，可以从

的数据草图上

中块的基于bf的成员关系查询派生而来。然后，函数根据

与

的偏差，返回索引

，这反映了将文件

从

迁移到

所节省的空间资源。

的值是根据不包含

的草图计算的，反映了

中

和其他之间的空间重叠。Specifically, the above-mentioned Algorithm 1 is explained as follows: the input includes the origin server

and all target candidates

BF-based data sketches and filesets, where

. To export migration variables

and the corresponding migration target

, a space-efficient index (i.e., a space-saving index) is designed to motivate the migration sequence. The function is shown in Algorithm 1, lines 9-13. Assume

express

and

The amount of data shared between blocks, which can be changed from

on the data sketch

The middle block's bf-based membership query is derived. Then, the function according to

and

deviation, return index

, which reflects the

from

move to

space resources saved.

The value is based on not containing

The sketch of the calculated, reflects the

middle

and space overlap between others.

步骤S104，基于所述节省空间索引，确定所述待迁移文件中的目标迁移文件并确定所述各候选服务器中的目标服务器，以将所述目标迁移文件迁移至所述目标服务器，并更新所述目标服务器和所述源服务器对应的位向量和节省空间索引。Step S104, based on the space-saving index, determine the target migration file in the to-be-migrated files and determine the target server in each candidate server, so as to migrate the target migration file to the target server, and update all the target migration files. bit vector and space-saving index corresponding to the target server and the source server.

本实施例中，使用每个文件服务器匹配的节省空间索引，可以通过迭代检测最大

来确定目标迁移文件（具体是通过最大

中的变量i来确定）及各候选服务器中的目标服务器（具体是通过最大

中的变量k确定），直到

中数据量的M百分比已经被迁移(如上述算法1的第3-5行所示)。每次迁移都会改变源服务器和目标服务器的存储状态。因此，在添加或释放文件的服务器上本地更新数据草图（即更新位向量）。此外，节省空间索引

也相应地在相关服务器上进行更新(如上述算法1的第6-8行所示)。In this embodiment, the space-saving index matched by each file server can be used to iteratively detect the maximum

to determine the target migration file (specifically by max.

the variable i in ) and the target server in each candidate server (specifically by the maximum

variable k in ), until

M percent of the amount of data in M has been migrated (as shown in lines 3-5 of Algorithm 1 above). Each migration changes the storage state of the source and target servers. Therefore, update the data sketch (i.e. update the bit vector) locally on the server where files are added or released. Also, space-efficient indexes

Updates are also made on the relevant servers accordingly (as shown in lines 6-8 of Algorithm 1 above).

作为本申请实施例的另一种实现方式，用于去重存储系统的高效数据迁移方法还包括将待迁移文件迁移到单个空服务器。As another implementation manner of the embodiment of the present application, the efficient data migration method for the deduplication storage system further includes migrating the files to be migrated to a single empty server.

示例的，在过载的源服务器

中，有一组文件

有热度

。让

表示

中从文件划分的唯一块的集合。设

表示块

的大小，则服务器

的存储成本就是存储在其上的块的总大小，即

。注意，这个

函数为固定大小的块分块生成一个常量值，并为可变大小的块分块算法生成一个非常量值。设

表示包含关系，其中

表示块

包含在文件

中。不考虑一个文件在一个服务器上被多次复制的情况，因为它对访问分流没有影响，而只影响数据冗余。example, on an overloaded origin server

, there is a set of files

hot

. Let

express

A collection of unique blocks partitioned from the file. Assume

presentation block

size, the server

The storage cost of is the total size of the blocks stored on it, i.e.

. Note that this

The function generates a constant value for fixed-size chunking and a non-constant value for variable-size chunking algorithms. Assume

represents a containment relationship, where

presentation block

included in the file

middle. The case where a file is replicated multiple times on a server is not considered, because it has no effect on access distribution, but only on data redundancy.

源服务器在数据迁移前的初始状态

可以定义为一个五元组

，其中

和

分别表示

的空间约束和服务约束，

表示源服务器中的存储文件集合。在这个初始状态下，在数据迁移之后，源服务器

中的每个候选迁移文件

将处于以下三种状态之一:The original state of the origin server before data migration

can be defined as a quintuple

,in

and

Respectively

space constraints and service constraints,

Represents a collection of storage files in the origin server. In this initial state, after data migration, the origin server

each candidate migration file in

will be in one of three states:

迁移，即

被迁移到目标服务器

，而源服务器占用的空间被释放。migration, i.e.

to be migrated to the target server

, and the space occupied by the origin server is released.

引入如下布尔状态变量

来表示这种状态，这样:Introduce the following boolean state variables

to represent this state, like this:

。

.

复制，即源服务器向目标服务器发送

副本。这种情况通常出现在热门文件中，在这些文件中，数据访问需求可能会超过源服务器的容量。Replication, that is, the source server sends the target server

copy. This situation typically occurs with popular files, where data access requirements may exceed the capacity of the origin server.

引入二进制布尔状态变量

来表示这种状态，这样:Introduce binary boolean state variable

to represent this state, like this:

。

.

不变，即

在源服务器上没有被迁移或复制，这意味着

和

。unchanged, that is

has not been migrated or replicated on the source server, which means

and

.

基于上述布尔变量所指示的文件状态，也可以用数学方法表示其分区块的深度关联状态。同样，块也可以被迁移、复制或不变。这些状态也可以通过定义布尔变量来表示。为了表示块的文本迁移状态，定义了一个布尔变量

，其中：Based on the file state indicated by the above Boolean variable, the deep association state of its partitions can also be expressed mathematically. Likewise, blocks can also be migrated, copied, or unchanged. These states can also be represented by defining boolean variables. To represent the text transition state of the block, a boolean variable is defined

,in:

。

.

当

时，块

已经从源服务器

迁移到目标服务器

。这种状态可能是由于它的从属文件

的迁移造成的，其中

。when

time, block

already from the origin server

Migrate to target server

. This state may be due to its dependent files

caused by the migration of

.

进一步定义一个布尔变量

，这样:Further define a boolean variable

,so:

。

.

当

时，块

将同时出现在源服务器和目标服务器上。如果在迁移过程中复制了它的任何附属文件(包含

的文件)，

的状态将被标记为复制。此外，破坏两个文件(一个已迁移，另一个未更改)的数据共享依赖关系还会在共享部分附加块复制。请注意，

还与文件移动引起的额外空间成本有关，可以表示为

。如果

和

，这意味着

保持不变。when

time, block

will appear on both the source and target servers. If any of its accompanying files (including

document),

will be marked as duplicated. Also, breaking a data sharing dependency of two files (one migrated, the other unchanged) also appends a block copy in the shared section. caution,

is also related to the extra space cost caused by file movement, which can be expressed as

. if

and

,this means

constant.

可以分析

中文件及其分区块之间的状态关系，并将其表示为以下三种情况之一。采用易于理解的例子来说明这些特殊的例子。can be analyzed

The state relationship between the file and its partitioned blocks in , and express it as one of three cases. These special examples are illustrated with easy-to-understand examples.

一个文件保留在源服务器，即

和

，例如图7中的初始存储状态（Innitial storage state）的文件

。其中，

中的5表示文件

的热度，

为文件

中的块，其他文件的表示方式类似，不再赘述。在这种情况下，文件中包含的所有块要么是未更改的(如块

)，要么是已复制的(如块

)，依赖于未更改文件和已迁移/复制文件之间的块共享依赖关系。A file remains on the origin server, i.e.

and

, such as the file in the Initial storage state in Figure 7

. in,

The 5 in the file means the file

the heat,

for file

The blocks in other files are represented in a similar manner, and will not be repeated here. In this case, all blocks contained in the file are either unchanged (as blocks

), or copied (such as a block

), relying on block sharing dependencies between unchanged files and migrated/copied files.

一个文件被迁移到目标服务器，即

，例如图7中的初始存储状态（Innitialstorage state）的文件

。在这种情况下，文件中包含的所有块要么被迁移(如块

)，要么被复制(如块

)。A file is migrated to the target server, i.e.

, such as the initial storage state (Initialstorage state) file in Figure 7

. In this case, all blocks contained in the file are either migrated (as blocks

), or is copied (as a block

).

复制一个文件，使源服务器和目标服务器都有一个副本，例如图7中的初始存储状态（Innitial storage state）的文件

。在这种情况下，文件

所涉及的所有块，即

将被复制。此外，文件的访问需求将通过在

中使用分布参数

进行复制来分散。该参数由网络中中广泛使用的负载均衡器调整，负责根据服务器的实际负载均衡存储系统中的负载。文件

为初始存储状态容器中的文件。Copy a file so that both the source server and the target server have a copy, such as the file in the Initial storage state in Figure 7

. In this case, the file

all the blocks involved, i.e.

will be copied. In addition, file access requirements will be

using distribution parameters in

Make copies to disperse. This parameter is adjusted by the load balancer widely used in the network, which is responsible for balancing the load in the storage system according to the actual load of the server. document

File in the container for initial storage state.

通过前面提到的关于文件和块的布尔变量，可以将空目标服务器的迁移问题表述如下。With the boolean variables mentioned earlier for files and blocks, the migration problem for an empty target server can be formulated as follows.

当块

被迁移时，即

，那么包含该块的所有文件将被迁移到目标服务器:when block

when migrated, i.e.

, then all files containing that block will be migrated to the target server:

（1）

(1)

当文件

被迁移时，即

，那么它所包含的所有块将被迁移或复制:when file

when migrated, i.e.

, then all blocks it contains will be migrated or copied:

（2）

(2)

当文件

在服务器

和

都被复制时，即

，那么它所涉及的所有块也应该被复制:when file

on the server

and

are copied, i.e.

, then all blocks it involves should also be copied:

（3）

(3)

文件的状态，即

，

，块的状态，即

，是互斥的:the status of the file, i.e.

,

, the state of the block, i.e.

, which are mutually exclusive:

，

（4）

,

(4)

迁移后的数据量应满足预定义的迁移百分比M，

，The amount of migrated data should meet the predefined migration percentage M,

,

（5）

(5)

源服务器和目标服务器的最终空间/服务开销不应超过相应的容量

The final space/serving overhead of the source and target servers should not exceed the corresponding capacity

（6）

(6)

（7）

(7)

（8）

(8)

（9）

(9)

其中，

和

分别是初始状态

的空间约束和服务约束。

和

分别是目标服务器的空间约束和服务约束。状态变量都是布尔值:

。in,

and

are the initial state

space constraints and service constraints.

and

They are the space constraints and service constraints of the target server, respectively. The state variables are all boolean values:

.

将精卫方案（即本申请实施例的用于去重存储系统的高效数据迁移方法对应的各数据迁移策略）的目标规范化，即在空间效率(最小化额外空间成本)和服务适应性(最大化共享数据需求)之间实现优雅的权衡，原理如下：Standardize the goals of the Jingwei solution (that is, the data migration strategies corresponding to the efficient data migration method for the deduplication storage system in the embodiment of the present application), that is, in terms of space efficiency (minimizing extra space cost) and service adaptability (maximum To achieve an elegant trade-off between the requirements of the shared data), the principle is as follows:

（10）

(10)

其中，参数

可以调整以适应两种基本原理的不同优化趋势，

为文件

的热度。Among them, the parameter

can be adjusted to accommodate the different optimization trends of the two rationale,

for file

the heat.

以公式10为迁移目标，以公式1~9为约束条件，可以将该问题表示为一个整数线性规划问题(ILP)。已知ILP问题是NP-hard，目前还没有已知的多项式时间复杂度的有效求解算法。特别是，当变量被限制为布尔赋值(0或1)时，那么仅仅决定问题是否有一个最优解就已经是NP完全。然而，商业优化器，如CPLEX、lp_solve和Gurobi优化器，可以有效地解决具有数十万变量的实例的这类问题。因此，利用这些高度优化的求解器来直接搜索出最优的迁移计划。但是，这种场景可能不适用于大型存储系统，因为在大型存储系统中，服务器连接到空状态并不常见。此外，将所有文件迁移到一个服务器的限制可能会限制性能的提高。Taking Equation 10 as the migration objective and Equations 1 to 9 as constraints, the problem can be expressed as an integer linear programming problem (ILP). It is known that the ILP problem is NP-hard, and there is no known efficient solution algorithm with polynomial time complexity. In particular, when variables are restricted to boolean assignments (0 or 1), then simply deciding whether the problem has an optimal solution is already NP-complete. However, commercial optimizers, such as CPLEX, lp_solve, and the Gurobi optimizer, can efficiently solve such problems for instances with hundreds of thousands of variables. Therefore, these highly optimized solvers are used to directly search for the optimal migration plan. However, this scenario may not be suitable for large storage systems, where it is not common for servers to connect to an empty state. Also, the limitation of migrating all files to one server may limit performance gains.

本申请实施例通过接收数据迁移请求，确定所述数据迁移请求对应的源服务器，进而确定所述源服务器中的待迁移文件；确定各可用服务器，进而基于所述待迁移文件和预设个数的独立哈希函数，确定所述各可用服务器对应的位向量，基于所述位向量确定所述数据迁移请求对应的各候选服务器；基于所述待迁移文件确定与所述各候选服务器的文件空间重叠数据和共享分区块的数据量，进而基于所述文件空间重叠数据和所述数据量，确定所述各候选服务器对应的节省空间索引；基于所述节省空间索引，确定所述待迁移文件中的目标迁移文件并确定所述各候选服务器中的目标服务器，以将所述目标迁移文件迁移至所述目标服务器，并更新所述目标服务器和所述源服务器对应的位向量和节省空间索引。从而在数据迁移过程中减少额外的空间成本，以高效、适应性强的数据迁移策略进行数据迁移。In this embodiment of the present application, by receiving a data migration request, the source server corresponding to the data migration request is determined, and then the files to be migrated in the source server are determined; each available server is determined, and then based on the to-be-migrated files and the preset number determine the bit vector corresponding to each available server, determine each candidate server corresponding to the data migration request based on the bit vector; determine the file space corresponding to each candidate server based on the to-be-migrated file The overlapping data and the data amount of the shared partition block, and then determine the space-saving index corresponding to each candidate server based on the file space overlapping data and the data amount; and determine the target server among the candidate servers, so as to migrate the target migration file to the target server, and update the bit vector and space-saving index corresponding to the target server and the source server. In this way, additional space costs are reduced during the data migration process, and data migration is carried out with an efficient and adaptable data migration strategy.

继续参见图2，其示出了根据本申请的用于去重存储系统的高效数据迁移方法的另一个实施例的流程200。如图2所示，本实施例的用于去重存储系统的高效数据迁移方法可以包括以下步骤：Continue to refer to FIG. 2 , which shows a flow 200 of another embodiment of an efficient data migration method for a deduplication storage system according to the present application. As shown in FIG. 2 , the efficient data migration method for the deduplication storage system in this embodiment may include the following steps:

步骤S201，接收数据迁移请求，确定所述数据迁移请求对应的源服务器，进而确定所述源服务器中的待迁移文件。Step S201: Receive a data migration request, determine a source server corresponding to the data migration request, and then determine a file to be migrated in the source server.

步骤S202，确定各可用服务器，进而基于所述待迁移文件和预设个数的独立哈希函数，确定所述各可用服务器对应的位向量，基于所述位向量确定所述数据迁移请求对应的各候选服务器。Step S202: Determine each available server, and then determine the bit vector corresponding to each available server based on the to-be-migrated file and a preset number of independent hash functions, and determine the corresponding data migration request based on the bit vector. each candidate server.

位向量，即本申请实施例中的数据草图。基于BF的数据草图的时间复杂度为

，其中

表示使用的哈希函数的数量，

表示任何候选服务器上的最大块数。需要注意的是，抽样技术将减少抽样比率的一个因子的时间复杂度。基于BF的数据草图的空间复杂度为

，其中

表示BF长度。A bit vector, that is, a data sketch in this embodiment of the present application. The time complexity of the BF-based data sketch is

,in

represents the number of hash functions used,

Indicates the maximum number of blocks on any candidate server. Note that the sampling technique will reduce the time complexity by a factor of the sampling rate. The space complexity of the BF-based data sketch is

,in

Indicates the BF length.

步骤S203，基于所述待迁移文件确定与所述各候选服务器的文件空间重叠数据和共享分区块的数据量，进而基于所述文件空间重叠数据和所述数据量，确定所述各候选服务器对应的节省空间索引。Step S203, determining the file space overlapping data and the data amount of the shared partition block with the candidate servers based on the files to be migrated, and then determining the corresponding candidate servers based on the file space overlapping data and the data amount. space efficient index.

节省空间的数据迁移的时间复杂度为

，其中

表示任何文件中的最大块数。注意，在每次文件迁移之后，源服务器和目标服务器的存储状态都会部分更新。这不会增加算法的总体时间复杂度。另外，空间复杂度为

，其中

记录服务器概要，

记录节省空间的索引。The time complexity of space saving data migration is

,in

Represents the maximum number of blocks in any file. Note that the storage state of the source and target servers is partially updated after each file migration. This does not increase the overall time complexity of the algorithm. In addition, the space complexity is

,in

record server summary,

Record space-efficient indexes.

步骤S201~步骤S203的原理与步骤S101~步骤S103的原理类似，此处不再赘述。The principles of steps S201 to S203 are similar to the principles of steps S101 to S103, and are not repeated here.

步骤S204，基于所述节省空间索引确定各所述待迁移文件对应的占用空间

。Step S204, determining the occupied space corresponding to each of the to-be-migrated files based on the space-saving index

.

步骤S205，确定各所述待迁移文件所承担的平均数据访问频率SDA(i)。SDA(i)定义为文件

的每个副本所承担的平均数据访问频率。Step S205: Determine the average data access frequency SDA(i) undertaken by each of the files to be migrated. SDA(i) is defined as a file

The average frequency of data access undertaken by each replica of .

步骤S206，根据所述平均数据访问频率SDA(i)和所述占用空间

，确定目标迁移文件。Step S206, according to the average data access frequency SDA(i) and the occupied space

, to determine the target migration file.

具体地，所述根据所述平均数据访问频率和所述占用空间，确定目标迁移文件，包括：确定所述平均数据访问频率与所述占用空间的比值；响应于确定所述比值大于预设单位热值，确定所述比值对应的待迁移文件为目标迁移文件。预设单位热值反映了分割数据访问(SDA)量与待迁移文件副本所需额外空间之间的商（商即比值）。Specifically, the determining the target migration file according to the average data access frequency and the occupied space includes: determining a ratio of the average data access frequency to the occupied space; in response to determining that the ratio is greater than a preset unit The calorific value is determined, and the file to be migrated corresponding to the ratio is determined as the target migration file. The preset unit heat value reflects the quotient (quotient or ratio) between the amount of split data access (SDA) and the additional space required for copies of files to be migrated.

此外，作为本申请实施例的另一种实现方式，执行主体还可以根据占用空间计算空间成本，具体地，执行主体可以确定待迁移文件与各候选服务器之间共享分区块的数据量，然后可以根据占用空间

和共享分区块的数据量

来确定空间成本

。然后，执行主体可以根据平均数据访问频率SDA(i)和空间成本

的商（即比值）与预设值的大小，来确定该待迁移文件是否是目标迁移文件。如果，平均数据访问频率SDA(i)和空间成本

的商（即比值）大于预设值，则确定该待迁移文件为目标迁移文件。In addition, as another implementation manner of the embodiment of the present application, the execution body may also calculate the space cost according to the occupied space. Specifically, the execution body may determine the data amount of the partition block shared between the file to be migrated and each candidate server, and then may According to the space occupied

and the amount of data in the shared partition block

to determine the cost of space

. Then, the executive can access the data according to the average data access frequency SDA(i) and space cost

The quotient (that is, the ratio) and the preset value are used to determine whether the file to be migrated is the target migration file. If, average data access frequency SDA(i) and space cost

The quotient (that is, the ratio) is greater than the preset value, and the file to be migrated is determined as the target migration file.

执行主体可以采用基于热度感知的数据复制算法进行目标迁移文件的迁移。The execution body can use the data replication algorithm based on heat perception to migrate the target migration file.

示例的，基于热度感知的数据复制算法如下方算法2所示：An example, the data replication algorithm based on heat perception is shown in Algorithm 2 below:

对算法2的解释如下：The interpretation of Algorithm 2 is as follows:

对于源服务器中存储的任何文件，从函数

，

中得到

。

的值随后按降序排序，以构造服务器队列

(第3-4行)。排名在前面的服务器包含更多与

类似的内容。对于

中的每个服务器，计算SDA(i)和额外的空间成本-

。如果这两个参数之间的商大于

，那么文件

将被复制到

，即

中的第k个服务器(第5-9行)。在确定每个文件

的复制位置之后，进一步利用函数Heat Allocation

(第12-17行)来调整

中为每个服务器分配的数据访问量。此功能接管了负载调度程序的角色，后者根据所涉及服务器的可用功能来平衡服务负载。For any file stored in the origin server, from the function

,

get in

.

The values of are then sorted in descending order to construct the server queue

(Lines 3-4). The top-ranked servers contain more

similar content. for

For each server in , compute SDA(i) and additional space cost -

. If the quotient between these two parameters is greater than

, then the file

will be copied to

,Right now

The kth server in (Lines 5-9). in each file

After copying the location of , further use the function Heat Allocation

(lines 12-17) to adjust

The amount of data access allocated for each server in . This feature takes over the role of the load scheduler, which balances the service load based on the available capabilities of the servers involved.

步骤S207，确定所述各候选服务器中的目标服务器，以将所述目标迁移文件迁移至所述目标服务器，并更新所述目标服务器和所述源服务器对应的位向量和节省空间索引。Step S207: Determine the target server among the candidate servers, so as to migrate the target migration file to the target server, and update the bit vector and space-saving index corresponding to the target server and the source server.

在确定目标迁移文件及目标服务器之后，下一步就是利用文件的热度来调整迁移计划。过热的文件应该在系统中有多个副本，以便容错和更好的用户体验。本申请实施例提供了一个单位热值(

)来指示文件复制的必要性。单位热值(

)这个指标反映了分割数据访问(SDA)量与副本所需额外空间之间的商。SDA(i)定义为文件

的每个副本所承担的平均数据访问频率。如果其商大于这个单位热值，则执行任何复制。这确保了在有限的额外空间成本下增加热文件的副本。After determining the target migration file and target server, the next step is to adjust the migration plan by using the file's heat. Overheated files should have multiple copies in the system for fault tolerance and better user experience. The examples of this application provide a unit calorific value (

) to indicate the need for file copying. unit calorific value (

) This metric reflects the quotient between the amount of split data access (SDA) and the additional space required for replicas. SDA(i) is defined as a file

The average frequency of data access undertaken by each replica of . If its quotient is greater than this unit heat value, then perform any replication. This ensures increased copies of hot files with limited additional space cost.

热度感知的数据复制算法的时间复杂度为

。在这里，

是基于给定文件

的共享数据量对目标服务器排序的结果。

和

的复杂性是由

中每个文件的最大迁移时间和最大复制时间造成的。空间复杂度为

。The time complexity of the heat-aware data replication algorithm is

. it's here,

is based on the given file

The result of sorting the target server by the amount of shared data.

and

The complexity is given by

due to the maximum migration time and maximum copy time per file in . The space complexity is

.

具体地，作为另一实现方式，所述确定所述各候选服务器中的目标服务器，包括：Specifically, as another implementation manner, the determining of the target server among the candidate servers includes:

确定所述目标迁移文件与所述各候选服务器的相似度。具体地，执行主体可以确定目标迁移文件中的各个块在所属的源服务器对应的位向量中的

位置和该位置的值，进而与各候选服务器对应的位向量的相应

位置和该位置的值进行相似度计算，具体可以是将位向量中目标迁移文件中的各个块的

位置对应的值转化为向量，通过余弦相似度算法进行与候选服务器的相应位置的值的相似度的计算，进而得出目标迁移文件中的各个块与各个候选服务器的相似度。将大于预设相似度阈值的相似度对应的候选服务器确定为目标服务器。The similarity between the target migration file and each candidate server is determined. Specifically, the execution subject can determine the relative value of each block in the target migration file in the bit vector corresponding to the source server to which it belongs.

position and the value of this position, and then the corresponding bit vector corresponding to each candidate server

The similarity calculation is performed between the position and the value of the position, which can be calculated by calculating the similarity of each block in the target migration file in the bit vector.

The value corresponding to the position is converted into a vector, and the cosine similarity algorithm is used to calculate the similarity with the value of the corresponding position of the candidate server, and then the similarity between each block in the target migration file and each candidate server is obtained. A candidate server corresponding to a similarity greater than a preset similarity threshold is determined as a target server.

确定所述目标迁移文件的目标占用空间；确定所述目标迁移文件与所述各候选服务器的共享分区块的目标数据量；确定所述目标迁移文件的每个副本所承担的平均数据访问频率；确定所述目标占用空间与各所述目标数据量的差值，进而确定所述平均数据访问频率与各所述差值的比值；响应于确定所述比值大于预设值，确定所述比值对应的候选服务器为目标服务器。目标服务器可以是多个服务器，通过将目标迁移文件迁移到多个服务器，可以利用合理的通用迁移策略，以最大限度地利用各目标服务器上的共享分区块来重新构件目标迁移文件，从而潜在地降低总空间成本。determining the target occupied space of the target migration file; determining the target data volume of the shared partition between the target migration file and each candidate server; determining the average data access frequency undertaken by each copy of the target migration file; Determine the difference between the target occupied space and each of the target data volumes, and then determine the ratio of the average data access frequency to each of the differences; in response to determining that the ratio is greater than a preset value, determine that the ratio corresponds to The candidate server is the target server. The target server can be multiple servers, and by migrating the target migration file to multiple servers, a reasonable general migration strategy can be utilized to maximize the use of the shared partition blocks on each target server to reconstruct the target migration file, thereby potentially Reduce total space cost.

本申请实施例，利用节省空间的迁移算法，以较低的额外空间成本优先确定迁移文件及其迁移目标。在此基础上，提出了热感知数据复制算法，在有限的额外空间开销下复制热文件，实现了服务的自适应性。通过相似性感知的文件分配，节省的空间会随着迁移百分比的增加而逐渐增加。当45%的数据迁移时，大约41%的已占用空间可以从源服务器释放。通过检测目标迁移文件与候选服务器之间的相似性来实现空间缩减。In this embodiment of the present application, a space-saving migration algorithm is used to preferentially determine the migration file and its migration target at a lower cost of additional space. On this basis, a heat-aware data replication algorithm is proposed, which replicates hot files with limited extra space overhead, and realizes service adaptability. With similarity-aware file allocation, the space savings increases gradually with the migration percentage. When 45% of the data is migrated, about 41% of the occupied space can be freed from the origin server. Space reduction is achieved by detecting the similarity between the target migration file and the candidate server.

精卫（如图6所示），是一种高效和适应性强的数据迁移策略，即本申请各实施例所描述的用于去重存储系统的高效数据迁移方法，将文件迁移和复制到合适的服务器上。这有助于提高重复数据删除存储系统的空间效率和业务适应性。首先，在只允许一个空迁移目标的情况下，设计了基于ILP技术的迁移策略。进一步将这个问题扩展到一个更通用的场景中，其中多个非空服务器可用于迁移。使用基于Bloom过滤器的有效启发式算法来解决一般的迁移问题。不同场景下的跟踪驱动实验表明，本申请实施例可以在增加热文件副本的同时显著减少迁移过程中的额外空间成本。综上所述，精卫实现了一种高效、适应性强的迁移策略，在有限的额外存储空间下构建了大量的RH。具体来说，精卫为四个热文件生成副本(66.7%)，与Goseed相比，只增加了5.7%的空间利用率。此外，精卫实现了

的RS-ratio，与SARA方法相比节省了约3.5倍的存储空间。Jingwei (as shown in Figure 6) is an efficient and adaptable data migration strategy, that is, the efficient data migration method for the deduplication storage system described in the various embodiments of this application. on a suitable server. This helps improve the space efficiency and business adaptability of deduplication storage systems. First, under the condition that only one empty migration target is allowed, a migration strategy based on ILP technology is designed. Extending this question further to a more general scenario where multiple non-null servers can be used for migration. Use efficient heuristics based on Bloom filters to solve general migration problems. Tracking-driven experiments in different scenarios show that the embodiment of the present application can significantly reduce the extra space cost in the migration process while increasing the number of hot file copies. In summary, Jingwei implements an efficient and adaptable migration strategy to build a large amount of RH with limited additional storage space. Specifically, Jingwei generates copies for four hot files (66.7%), which only increases the space utilization by 5.7% compared to Goseed. In addition, Jingwei has achieved

, which saves about 3.5 times the storage space compared to the SARA method.

继续参见图4，作为对上述各图所示方法的实现，本申请提供了一种用于去重存储系统的高效数据迁移装置的一个实施例，该装置实施例与图1所示的方法实施例相对应，该装置具体可以应用于各种电子设备中。Continuing to refer to FIG. 4 , as an implementation of the methods shown in the above figures, the present application provides an embodiment of an efficient data migration apparatus for a deduplication storage system, which is implemented with the method shown in FIG. 1 . Correspondingly, the device can be specifically applied to various electronic devices.

如图4所示，本实施例的用于去重存储系统的高效数据迁移装置400包括：接收单元401、候选服务器确定单元402、索引确定单元403和迁移单元404。As shown in FIG. 4 , an efficient data migration apparatus 400 for a deduplication storage system in this embodiment includes: a receiving unit 401 , a candidate server determination unit 402 , an index determination unit 403 and a migration unit 404 .

接收单元401，被配置成接收数据迁移请求，确定所述数据迁移请求对应的源服务器，进而确定所述源服务器中的待迁移文件；The receiving unit 401 is configured to receive a data migration request, determine a source server corresponding to the data migration request, and then determine a file to be migrated in the source server;

候选服务器确定单元402，被配置成确定各可用服务器，进而基于所述待迁移文件和预设个数的独立哈希函数，确定所述各可用服务器对应的位向量，基于所述位向量确定所述数据迁移请求对应的各候选服务器；The candidate server determination unit 402 is configured to determine each available server, and then determine the bit vector corresponding to each available server based on the to-be-migrated file and a preset number of independent hash functions, and determine all available servers based on the bit vector. each candidate server corresponding to the data migration request;

索引确定单元403，被配置成基于所述待迁移文件确定与所述各候选服务器的文件空间重叠数据和共享分区块的数据量，进而基于所述文件空间重叠数据和所述数据量，确定所述各候选服务器对应的节省空间索引；The index determination unit 403 is configured to determine, based on the to-be-migrated file, the file space overlapping data and the data amount of the shared partition block with each candidate server, and then determine the file space overlapping data and the data amount based on the file space overlapping data and the data amount. the space-saving index corresponding to each candidate server;

迁移单元404，被配置成基于所述节省空间索引，确定所述待迁移文件中的目标迁移文件并确定所述各候选服务器中的目标服务器，以将所述目标迁移文件迁移至所述目标服务器，并更新所述目标服务器和所述源服务器对应的位向量和节省空间索引。The migration unit 404 is configured to, based on the space-saving index, determine a target migration file in the files to be migrated and determine a target server in each candidate server, so as to migrate the target migration file to the target server , and update the bit vector and space-saving index corresponding to the target server and the source server.

在一些实施例中，所述迁移单元404进一步被配置成：基于所述节省空间索引确定各所述待迁移文件对应的占用空间；确定各所述待迁移文件所承担的平均数据访问频率；根据所述平均数据访问频率和所述占用空间，确定目标迁移文件。In some embodiments, the migrating unit 404 is further configured to: determine the occupied space corresponding to each of the to-be-migrated files based on the space-saving index; determine the average data access frequency borne by each of the to-be-migrated files; The average data access frequency and the occupied space determine the target migration file.

在一些实施例中，所述迁移单元404进一步被配置成：确定所述平均数据访问频率与所述占用空间的比值；响应于确定所述比值大于预设单位热值，确定所述比值对应的待迁移文件为目标迁移文件。In some embodiments, the migrating unit 404 is further configured to: determine a ratio of the average data access frequency to the occupied space; in response to determining that the ratio is greater than a preset unit heat value, determine the ratio corresponding to the ratio The file to be migrated is the target migration file.

在一些实施例中，迁移单元404进一步被配置成：确定所述目标迁移文件与所述各候选服务器的相似度；将大于预设相似度阈值的相似度对应的候选服务器确定为目标服务器。In some embodiments, the migration unit 404 is further configured to: determine the similarity between the target migration file and each candidate server; and determine the candidate server corresponding to the similarity greater than a preset similarity threshold as the target server.

在一些实施例中，迁移单元404进一步被配置成：确定所述目标迁移文件的目标占用空间；确定所述目标迁移文件与所述各候选服务器的共享分区块的目标数据量；确定所述目标迁移文件的每个副本所承担的平均数据访问频率；确定所述目标占用空间与各所述目标数据量的差值，进而确定所述平均数据访问频率与各所述差值的比值；响应于确定所述比值大于预设值，确定所述比值对应的候选服务器为目标服务器。In some embodiments, the migration unit 404 is further configured to: determine the target occupied space of the target migration file; determine the target data volume of the shared partition between the target migration file and each candidate server; determine the target the average data access frequency undertaken by each copy of the migration file; determining the difference between the target occupied space and each of the target data volumes, and then determining the ratio of the average data access frequency to each of the differences; in response to It is determined that the ratio is greater than a preset value, and the candidate server corresponding to the ratio is determined to be the target server.

本说明书实施例中所述支付涉及的技术载体，例如可以包括近场通信（NearField Communication，NFC）、WIFI、3G/4G/5G、POS 机刷卡技术、二维码扫码技术、条形码扫码技术、蓝牙、红外、短消息（Short Message Service，SMS）、多媒体消息（MultimediaMessage Service，MMS）等。The technical carriers involved in the payment described in the embodiments of this specification may include, for example, Near Field Communication (NFC), WIFI, 3G/4G/5G, POS machine card swiping technology, two-dimensional code scanning technology, and barcode scanning technology , Bluetooth, infrared, short message (Short Message Service, SMS), multimedia message (MultimediaMessage Service, MMS) and so on.

需要说明的是，本说明书一个或多个实施例的方法可以由单个设备执行，例如一台计算机或服务器等。本实施例的方法也可以应用于分布式场景下，由多台设备相互配合来完成。在这种分布式场景的情况下，这多台设备中的一台设备可以只执行本说明书一个或多个实施例的方法中的某一个或多个步骤，这多台设备相互之间会进行交互以完成所述的方法。It should be noted that the methods of one or more embodiments of this specification may be executed by a single device, such as a computer or a server. The method in this embodiment can also be applied in a distributed scenario, and is completed by the cooperation of multiple devices. In the case of such a distributed scenario, one device among the multiple devices may only execute one or more steps in the method of one or more embodiments of the present specification, and the multiple devices may perform operations on each other. interact to complete the described method.

上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下，在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外，在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中，多任务处理和并行处理也是可以的或者可能是有利的。The foregoing describes specific embodiments of the present specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims can be performed in an order different from that in the embodiments and still achieve desirable results. Additionally, the processes depicted in the figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

为了描述的方便，描述以上装置时以功能分为各种模块分别描述。当然，在实施本说明书一个或多个实施例时可以把各模块的功能在同一个或多个软件和/或硬件中实现。For the convenience of description, when describing the above device, the functions are divided into various modules and described respectively. Of course, when implementing one or more embodiments of this specification, the functions of each module may be implemented in one or more software and/or hardware.

上述实施例的装置用于实现前述实施例中相应的方法，并且具有相应的方法实施例的有益效果，在此不再赘述。The apparatuses in the foregoing embodiments are used to implement the corresponding methods in the foregoing embodiments, and have the beneficial effects of the corresponding method embodiments, which will not be repeated here.

图5示出了本实施例所提供的一种更为具体的电子设备硬件结构示意图，该设备可以包括：处理器1010、存储器1020、输入/输出接口1030、通信接口1040和总线 1050。其中处理器1010、存储器1020、输入/输出接口1030和通信接口1040通过总线1050实现彼此之间在设备内部的通信连接。FIG. 5 shows a schematic diagram of a more specific hardware structure of an electronic device provided in this embodiment. The device may include: a processor 1010 , a memory 1020 , an input/output interface 1030 , a communication interface 1040 and a bus 1050 . The processor 1010 , the memory 1020 , the input/output interface 1030 and the communication interface 1040 realize the communication connection among each other within the device through the bus 1050 .

处理器1010可以采用通用的CPU（Central Processing Unit，中央处理器）、微处理器、应用专用集成电路（Application Specific Integrated Circuit，ASIC）、或者一个或多个集成电路等方式实现，用于执行相关程序，以实现本说明书实施例所提供的技术方案。The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit, central processing unit), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, and is used to execute related program to implement the technical solutions provided by the embodiments of this specification.

存储器1020可以采用ROM（Read Only Memory，只读存储器）、RAM（Random AccessMemory，随机存取存储器）、静态存储设备，动态存储设备等形式实现。存储器1020可以存储操作系统和其他应用程序，在通过软件或者固件来实现本说明书实施例所提供的技术方案时，相关的程序代码保存在存储器1020中，并由处理器1010来调用执行。The memory 1020 may be implemented in the form of a ROM (Read Only Memory, read only memory), a RAM (Random Access Memory, random access memory), a static storage device, a dynamic storage device, and the like. The memory 1020 may store an operating system and other application programs. When implementing the technical solutions provided by the embodiments of this specification through software or firmware, the relevant program codes are stored in the memory 1020 and invoked by the processor 1010 for execution.

输入/输出接口1030用于连接输入/输出模块，以实现信息输入及输出。输入输出/模块可以作为组件配置在设备中（图中未示出），也可以外接于设备以提供相应功能。其中输入设备可以包括键盘、鼠标、触摸屏、麦克风、各类传感器等，输出设备可以包括显示器、扬声器、振动器、指示灯等。The input/output interface 1030 is used to connect the input/output module to realize information input and output. The input/output/module can be configured in the device as a component (not shown in the figure), or can be externally connected to the device to provide corresponding functions. The input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output device may include a display, a speaker, a vibrator, an indicator light, and the like.

通信接口1040用于连接通信模块（图中未示出），以实现本设备与其他设备的通信交互。其中通信模块可以通过有线方式（例如USB、网线等）实现通信，也可以通过无线方式（例如移动网络、WIFI、蓝牙等）实现通信。The communication interface 1040 is used to connect a communication module (not shown in the figure), so as to realize the communication interaction between the device and other devices. The communication module can implement communication through wired means (such as USB, network cable, etc.), or can implement communication through wireless means (such as mobile network, WIFI, Bluetooth, etc.).

总线1050包括一通路，在设备的各个组件（例如处理器1010、存储器1020、输入/输出接口1030和通信接口1040）之间传输信息。The bus 1050 includes a path to transfer information between the various components of the device (eg, the processor 1010, the memory 1020, the input/output interface 1030, and the communication interface 1040).

需要说明的是，尽管上述设备仅示出了处理器1010、存储器1020、输入/输出接口1030、通信接口1040以及总线1050，但是在具体实施过程中，该设备还可以包括实现正常运行所必需的其他组件。此外，本领域的技术人员可以理解的是，上述设备中也可以仅包含实现本说明书实施例方案所必需的组件，而不必包含图中所示的全部组件。It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in the specific implementation process, the device may also include necessary components for normal operation. other components. In addition, those skilled in the art can understand that, the above-mentioned device may only include components necessary to implement the solutions of the embodiments of the present specification, rather than all the components shown in the figures.

本实施例的计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括，但不限于相变内存（PRAM）、静态随机存取存储器（SRAM）、动态随机存取存储器（DRAM）、其他类型的随机存取存储器（RAM）、只读存储器（ROM）、电可擦除可编程只读存储器（EEPROM）、快闪记忆体或其他内存技术、只读光盘只读存储器（CD-ROM）、数字多功能光盘（DVD）或其他光学存储、磁盒式磁带，磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质，可用于存储可以被计算设备访问的信息。The computer readable medium of this embodiment includes both permanent and non-permanent, removable and non-removable media and can be implemented by any method or technology for information storage. Information may be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

所属领域的普通技术人员应当理解：以上任何实施例的讨论仅为示例性的，并非旨在暗示本申请的范围（包括权利要求）被限于这些例子；在本申请的思路下，以上实施例或者不同实施例中的技术特征之间也可以进行组合，步骤可以以任意顺序实现，并存在如上所述的本说明书一个或多个实施例的不同方面的许多其它变化，为了简明它们没有在细节中提供。Those of ordinary skill in the art should understand that the discussion of any of the above embodiments is only exemplary, and is not intended to imply that the scope of the present application (including the claims) is limited to these examples; under the idea of the present application, the above embodiments or Technical features in different embodiments may also be combined, steps may be carried out in any order, and there are many other variations of the different aspects of one or more embodiments of this specification as described above, which are not in detail for the sake of brevity supply.

另外，为简化说明和讨论，并且为了不会使本说明书一个或多个实施例难以理解，在所提供的附图中可以示出或可以不示出与集成电路（IC）芯片和其它部件的公知的电源/接地连接。此外，可以以框图的形式示出装置，以便避免使本说明书一个或多个实施例难以理解，并且这也考虑了以下事实，即关于这些框图装置的实施方式的细节是高度取决于将要实施本说明书一个或多个实施例的平台的（即，这些细节应当完全处于本领域技术人员的理解范围内）。在阐述了具体细节（例如，电路）以描述本申请的示例性实施例的情况下，对本领域技术人员来说显而易见的是，可以在没有这些具体细节的情况下或者这些具体细节有变化的情况下实施本说明书一个或多个实施例。因此，这些描述应被认为是说明性的而不是限制性的。Additionally, in order to simplify illustration and discussion, and in order not to obscure the understanding of one or more embodiments of this specification, the figures provided may or may not be shown in connection with integrated circuit (IC) chips and other components. Well known power/ground connections. Furthermore, devices may be shown in block diagram form in order to avoid obscuring one or more embodiments of this description, and this also takes into account the fact that details regarding the implementation of such block diagram devices are highly dependent on the implementation of the invention (ie, these details should be well within the understanding of those skilled in the art) of the platform describing one or more embodiments. Where specific details (eg, circuits) are set forth to describe exemplary embodiments of the present application, it will be apparent to those skilled in the art that these specific details may be used without or with changes to the specific details One or more embodiments of this specification are implemented below. Accordingly, these descriptions are to be considered illustrative rather than restrictive.

尽管已经结合了本申请的具体实施例对本申请进行了描述，但是根据前面的描述，这些实施例的很多替换、修改和变型对本领域普通技术人员来说将是显而易见的。例如，其它存储器架构（例如，动态RAM（DRAM））可以使用所讨论的实施例。Although the present application has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations to these embodiments will be apparent to those of ordinary skill in the art from the foregoing description. For example, other memory architectures (eg, dynamic RAM (DRAM)) may use the discussed embodiments.

本说明书一个或多个实施例旨在涵盖落入所附权利要求的宽泛范围之内的所有这样的替换、修改和变型。因此，凡在本说明书一个或多个实施例的精神和原则之内，所做的任何省略、修改、等同替换、改进等，均应包含在本申请的保护范围之内。The embodiment or embodiments of this specification are intended to cover all such alternatives, modifications and variations that fall within the broad scope of the appended claims. Therefore, any omission, modification, equivalent replacement, improvement, etc. made within the spirit and principle of one or more embodiments of the present specification should be included within the protection scope of the present application.

Claims

1. An efficient data migration method for a deduplication storage system, comprising:

receiving a data migration request, determining a source server corresponding to the data migration request, and further determining a file to be migrated in the source server;

determining each available server, further determining a bit vector corresponding to each available server based on the file to be migrated and a preset number of independent hash functions, and determining each candidate server corresponding to the data migration request based on the bit vector;

determining file space overlapping data and data volume of a shared partition block of each candidate server based on the file to be migrated, determining deviation of the file space overlapping data and the data volume, and further determining a space-saving index corresponding to each candidate server based on the deviation, wherein the space-saving index is the amount of storage resources saved when the file is migrated to a target server;

and determining a target migration file in the files to be migrated and a target server in each candidate server based on the space-saving index so as to migrate the target migration file to the target server and update the bit vectors and the space-saving index corresponding to the target server and the source server.

2. The method according to claim 1, wherein the determining a target migration file in the files to be migrated comprises:

determining the occupied space corresponding to each file to be migrated based on the space-saving index;

determining the average data access frequency borne by each file to be migrated;

and determining a target migration file according to the average data access frequency and the occupied space.

3. The method of claim 2, wherein determining a target migration file based on the average data access frequency and the footprint comprises:

determining a ratio of the average data access frequency to the footprint;

and in response to the fact that the ratio is larger than the preset unit heat value, determining that the file to be migrated corresponding to the ratio is the target migration file.

4. The method of claim 2, wherein determining the target server of the candidate servers comprises:

determining the similarity between the target migration file and each candidate server;

and determining the candidate server corresponding to the similarity greater than the preset similarity threshold as the target server.

5. The method of claim 2, wherein determining the target server of the candidate servers comprises:

determining a target occupation space of the target migration file;

determining the target data volume of the shared partition blocks of the target migration file and each candidate server;

determining an average data access frequency assumed by each copy of the target migration file;

determining a difference value between the target occupied space and each target data quantity, and further determining a ratio of the average data access frequency to each difference value;

and in response to determining that the ratio is greater than a preset value, determining that the candidate server corresponding to the ratio is the target server.

6. An efficient data migration apparatus for a deduplication storage system, comprising:

the receiving unit is configured to receive a data migration request, determine a source server corresponding to the data migration request, and further determine a file to be migrated in the source server;

a candidate server determining unit configured to determine each available server, determine a bit vector corresponding to each available server based on the file to be migrated and a preset number of independent hash functions, and determine each candidate server corresponding to the data migration request based on the bit vector;

the index determining unit is configured to determine file space overlapping data of each candidate server and data quantity of a shared partition block based on the file to be migrated, determine deviation of the file space overlapping data and the data quantity, and further determine a space-saving index corresponding to each candidate server based on the deviation, wherein the space-saving index is the amount of storage resources saved when the file is migrated to a target server;

and the migration unit is configured to determine a target migration file in the files to be migrated and a target server in each candidate server based on the space-saving index, so as to migrate the target migration file to the target server, and update the bit vectors and the space-saving index corresponding to the target server and the source server.

7. The apparatus of claim 6, wherein the migration unit is further configured to:

8. The apparatus of claim 7, wherein the migration unit is further configured to:

determining a ratio of the average data access frequency to the footprint;

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 5 when executing the program.

10. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 5.