[go: up one dir, main page]

CN113590535B - An efficient data migration method and device for deduplication storage system - Google Patents

An efficient data migration method and device for deduplication storage system Download PDF

Info

Publication number
CN113590535B
CN113590535B CN202111156367.9A CN202111156367A CN113590535B CN 113590535 B CN113590535 B CN 113590535B CN 202111156367 A CN202111156367 A CN 202111156367A CN 113590535 B CN113590535 B CN 113590535B
Authority
CN
China
Prior art keywords
file
server
target
determining
migration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111156367.9A
Other languages
Chinese (zh)
Other versions
CN113590535A (en
Inventor
郭得科
罗来龙
程葛瑶
夏俊旭
张千桢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202111156367.9A priority Critical patent/CN113590535B/en
Publication of CN113590535A publication Critical patent/CN113590535A/en
Application granted granted Critical
Publication of CN113590535B publication Critical patent/CN113590535B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/119Details of migration of file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • G06F16/134Distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/176Support for shared access to files; File sharing support

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请提供一种用于去重存储系统的高效数据迁移方法和装置,该方法包括:通过接收数据迁移请求,确定对应的源服务器,确定源服务器中的待迁移文件;基于待迁移文件和预设个数的独立哈希函数,确定各可用服务器对应的位向量,基于位向量确定各候选服务器;基于待迁移文件确定与各候选服务器的文件空间重叠数据和共享分区块的数据量,基于文件空间重叠数据和数据量,确定各候选服务器对应的节省空间索引;基于节省空间索引,确定目标迁移文件并确定各候选服务器中的目标服务器,以将目标迁移文件迁移至目标服务器,并更新位向量和节省空间索引。从而在数据迁移过程中减少额外的空间成本,以进行高效、高适应性的数据迁移。

Figure 202111156367

The present application provides an efficient data migration method and device for a deduplication storage system. The method includes: by receiving a data migration request, determining a corresponding source server, and determining a file to be migrated in the source server; Set the number of independent hash functions to determine the bit vector corresponding to each available server, and determine each candidate server based on the bit vector; Spatial overlap data and data volume, determine the space-saving index corresponding to each candidate server; based on the space-saving index, determine the target migration file and determine the target server in each candidate server, so as to migrate the target migration file to the target server, and update the bit vector and space-efficient indexes. Thus, additional space costs are reduced during the data migration process for efficient and highly adaptable data migration.

Figure 202111156367

Description

一种用于去重存储系统的高效数据迁移方法和装置An efficient data migration method and device for deduplication storage system

技术领域technical field

本申请涉及数据迁移技术领域,尤其涉及一种用于去重存储系统的高效数据迁移方法和装置。The present application relates to the technical field of data migration, and in particular, to an efficient data migration method and apparatus for a deduplication storage system.

背景技术Background technique

随着重复数据删除技术的引入,传统的迁移方法面临着严峻的挑战。首先,重复数据删除在存储文件中创建数据共享依赖关系;在迁移中打破这种依赖关系会增加额外的空间开销。其次,消除冗余增加了服务器崩溃时数据不可用的风险。现有的数据迁移方法无法一次性解决上述问题。With the introduction of deduplication technology, traditional migration methods face serious challenges. First, deduplication creates a data-sharing dependency in the storage file; breaking this dependency during migration adds additional space overhead. Second, eliminating redundancy increases the risk of data being unavailable in the event of a server crash. Existing data migration methods cannot solve the above problems at one time.

发明内容SUMMARY OF THE INVENTION

有鉴于此,本申请的目的在于提出一种用于去重存储系统的高效数据迁移方法和装置,以解决上述技术问题。In view of this, the purpose of the present application is to propose an efficient data migration method and apparatus for a deduplication storage system to solve the above technical problems.

基于上述目的,本申请提供:For the above purposes, this application provides:

一种用于去重存储系统的高效数据迁移方法,包括:An efficient data migration method for a deduplication storage system, comprising:

接收数据迁移请求,确定所述数据迁移请求对应的源服务器,进而确定所述源服务器中的待迁移文件;Receive the data migration request, determine the source server corresponding to the data migration request, and then determine the files to be migrated in the source server;

确定各可用服务器,进而基于所述待迁移文件和预设个数的独立哈希函数,确定所述各可用服务器对应的位向量,基于所述位向量确定所述数据迁移请求对应的各候选服务器;Determine each available server, and then determine the bit vector corresponding to each available server based on the to-be-migrated file and the independent hash function of the preset number, and determine each candidate server corresponding to the data migration request based on the bit vector ;

基于所述待迁移文件确定与所述各候选服务器的文件空间重叠数据和共享分区块的数据量,进而基于所述文件空间重叠数据和所述数据量,确定所述各候选服务器对应的节省空间索引;Based on the file to be migrated, determine the file space overlapping data and the data amount of the shared partition block with the candidate servers, and then determine the space saving corresponding to each candidate server based on the file space overlapping data and the data amount index;

基于所述节省空间索引,确定所述待迁移文件中的目标迁移文件并确定所述各候选服务器中的目标服务器,以将所述目标迁移文件迁移至所述目标服务器,并更新所述目标服务器和所述源服务器对应的位向量和节省空间索引。Based on the space-saving index, determine the target migration file in the to-be-migrated files and determine the target server in each candidate server, so as to migrate the target migration file to the target server, and update the target server Bit vector and space saving index corresponding to the origin server.

可选地,所述确定所述待迁移文件中的目标迁移文件,包括:Optionally, the determining of the target migration file in the to-be-migrated files includes:

基于所述节省空间索引确定各所述待迁移文件对应的占用空间;determining the occupied space corresponding to each of the to-be-migrated files based on the space-saving index;

确定各所述待迁移文件所承担的平均数据访问频率;Determine the average data access frequency undertaken by each of the files to be migrated;

根据所述平均数据访问频率和所述占用空间,确定目标迁移文件。A target migration file is determined according to the average data access frequency and the occupied space.

可选地,所述根据所述平均数据访问频率和所述占用空间,确定目标迁移文件,包括:Optionally, determining the target migration file according to the average data access frequency and the occupied space includes:

确定所述平均数据访问频率与所述占用空间的比值;determining the ratio of the average data access frequency to the occupied space;

响应于确定所述比值大于预设单位热值,确定所述比值对应的待迁移文件为目标迁移文件。In response to determining that the ratio is greater than the preset unit heat value, determining that the file to be migrated corresponding to the ratio is the target migration file.

可选地,所述确定所述各候选服务器中的目标服务器,包括:Optionally, the determining of the target server in the candidate servers includes:

确定所述目标迁移文件与所述各候选服务器的相似度;determining the similarity between the target migration file and each candidate server;

将大于预设相似度阈值的相似度对应的候选服务器确定为目标服务器。A candidate server corresponding to a similarity greater than a preset similarity threshold is determined as a target server.

可选地,所述确定所述各候选服务器中的目标服务器,包括:Optionally, the determining of the target server in the candidate servers includes:

确定所述目标迁移文件的目标占用空间;determining the target occupied space of the target migration file;

确定所述目标迁移文件与所述各候选服务器的共享分区块的目标数据量;determining the target data volume of the shared partition block of the target migration file and each candidate server;

确定所述目标迁移文件的每个副本所承担的平均数据访问频率;determining the average data access frequency undertaken by each copy of the target migration file;

确定所述目标占用空间与各所述目标数据量的差值,进而确定所述平均数据访问频率与各所述差值的比值;Determine the difference between the target occupied space and each of the target data volumes, and then determine the ratio of the average data access frequency to each of the differences;

响应于确定所述比值大于预设值,确定所述比值对应的候选服务器为目标服务器。In response to determining that the ratio is greater than a preset value, determining that the candidate server corresponding to the ratio is the target server.

一种用于去重存储系统的高效数据迁移装置,包括:An efficient data migration device for a deduplication storage system, comprising:

接收单元,被配置成接收数据迁移请求,确定所述数据迁移请求对应的源服务器,进而确定所述源服务器中的待迁移文件;a receiving unit, configured to receive a data migration request, determine a source server corresponding to the data migration request, and then determine a file to be migrated in the source server;

候选服务器确定单元,被配置成确定各可用服务器,进而基于所述待迁移文件和预设个数的独立哈希函数,确定所述各可用服务器对应的位向量,基于所述位向量确定所述数据迁移请求对应的各候选服务器;The candidate server determination unit is configured to determine each available server, and then determine a bit vector corresponding to each available server based on the to-be-migrated file and a preset number of independent hash functions, and determine the bit vector based on the bit vector Each candidate server corresponding to the data migration request;

索引确定单元,被配置成基于所述待迁移文件确定与所述各候选服务器的文件空间重叠数据和共享分区块的数据量,进而基于所述文件空间重叠数据和所述数据量,确定所述各候选服务器对应的节省空间索引;The index determination unit is configured to determine, based on the to-be-migrated file, the file space overlapping data and the data amount of the shared partition block with each candidate server, and then determine the file space overlapping data and the data amount based on the file space overlapping data and the data amount. The space-saving index corresponding to each candidate server;

迁移单元,被配置成基于所述节省空间索引,确定所述待迁移文件中的目标迁移文件并确定所述各候选服务器中的目标服务器,以将所述目标迁移文件迁移至所述目标服务器,并更新所述目标服务器和所述源服务器对应的位向量和节省空间索引。a migration unit configured to, based on the space-saving index, determine a target migration file in the to-be-migrated files and determine a target server in the candidate servers, so as to migrate the target migration file to the target server, And update the bit vector and space saving index corresponding to the target server and the source server.

可选地,所述迁移单元进一步被配置成:Optionally, the migration unit is further configured to:

基于所述节省空间索引确定各所述待迁移文件对应的占用空间;determining the occupied space corresponding to each of the to-be-migrated files based on the space-saving index;

确定各所述待迁移文件所承担的平均数据访问频率;Determine the average data access frequency undertaken by each of the files to be migrated;

根据所述平均数据访问频率和所述占用空间,确定目标迁移文件。A target migration file is determined according to the average data access frequency and the occupied space.

可选地,所述迁移单元进一步被配置成:Optionally, the migration unit is further configured to:

确定所述平均数据访问频率与所述占用空间的比值;determining the ratio of the average data access frequency to the occupied space;

响应于确定所述比值大于预设单位热值,确定所述比值对应的待迁移文件为目标迁移文件。In response to determining that the ratio is greater than the preset unit heat value, determining that the file to be migrated corresponding to the ratio is the target migration file.

可选地,迁移单元进一步被配置成:Optionally, the migration unit is further configured to:

确定所述目标迁移文件与所述各候选服务器的相似度;determining the similarity between the target migration file and each candidate server;

将大于预设相似度阈值的相似度对应的候选服务器确定为目标服务器。A candidate server corresponding to a similarity greater than a preset similarity threshold is determined as a target server.

可选地,迁移单元进一步被配置成:Optionally, the migration unit is further configured to:

确定所述目标迁移文件的目标占用空间;determining the target occupied space of the target migration file;

确定所述目标迁移文件与所述各候选服务器的共享分区块的目标数据量;determining the target data volume of the shared partition block of the target migration file and each candidate server;

确定所述目标迁移文件的每个副本所承担的平均数据访问频率;determining the average data access frequency undertaken by each copy of the target migration file;

确定所述目标占用空间与各所述目标数据量的差值,进而确定所述平均数据访问频率与各所述差值的比值;Determine the difference between the target occupied space and each of the target data volumes, and then determine the ratio of the average data access frequency to each of the differences;

响应于确定所述比值大于预设值,确定所述比值对应的候选服务器为目标服务器。In response to determining that the ratio is greater than a preset value, determining that the candidate server corresponding to the ratio is the target server.

一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如上所述的用于去重存储系统的高效数据迁移方法。An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, when the processor executes the program, the efficient data migration for a deduplication storage system as described above is realized method.

一种非暂态计算机可读存储介质,所述非暂态计算机可读存储介质存储计算机指令,所述计算机指令用于使所述计算机执行如上所述的用于去重存储系统的高效数据迁移方法。A non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform efficient data migration for a deduplication storage system as described above method.

从上面所述可以看出,本申请提供的用于去重存储系统的高效数据迁移方法和装置,通过接收数据迁移请求,确定所述数据迁移请求对应的源服务器,进而确定所述源服务器中的待迁移文件;确定各可用服务器,进而基于所述待迁移文件和预设个数的独立哈希函数,确定所述各可用服务器对应的位向量,基于所述位向量确定所述数据迁移请求对应的各候选服务器;基于所述待迁移文件确定与所述各候选服务器的文件空间重叠数据和共享分区块的数据量,进而基于所述文件空间重叠数据和所述数据量,确定所述各候选服务器对应的节省空间索引;基于所述节省空间索引,确定所述待迁移文件中的目标迁移文件并确定所述各候选服务器中的目标服务器,以将所述目标迁移文件迁移至所述目标服务器,并更新所述目标服务器和所述源服务器对应的位向量和节省空间索引。从而在数据迁移过程中减少额外的空间成本,以高效、适应性强的数据迁移策略进行数据迁移。It can be seen from the above that the efficient data migration method and device for a deduplication storage system provided by the present application determine the source server corresponding to the data migration request by receiving the data migration request, and then determine the source server in the source server. the files to be migrated; determine each available server, and then determine the bit vector corresponding to each available server based on the to-be-migrated file and a preset number of independent hash functions, and determine the data migration request based on the bit vector The corresponding candidate servers; determine the file space overlapped data and the data volume of the shared partition block with the candidate servers based on the files to be migrated, and then determine the data volume based on the file space overlap data and the data volume. The space-saving index corresponding to the candidate server; based on the space-saving index, determine the target migration file in the files to be migrated and determine the target server in each candidate server, so as to migrate the target migration file to the target server, and update the bit vector and space-saving index corresponding to the target server and the source server. In this way, additional space costs are reduced during the data migration process, and data migration is carried out with an efficient and adaptable data migration strategy.

附图说明Description of drawings

为了更清楚地说明本申请或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本说明书一个或多个实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the present application or in the prior art, the following briefly introduces the accompanying drawings that are used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only the One or more embodiments are described, and for those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative effort.

图1为本说明书一个实施例示出的用于去重存储系统的高效数据迁移方法的流程示意图;1 is a schematic flowchart of an efficient data migration method for a deduplication storage system according to an embodiment of this specification;

图2为本说明书另一个实施例示出的用于去重存储系统的高效数据迁移方法的流程示意图;2 is a schematic flowchart of an efficient data migration method for a deduplication storage system according to another embodiment of the present specification;

图3为本说明书另一个实施例示出的用于去重存储系统的高效数据迁移方法的位向量示意图;3 is a schematic diagram of a bit vector of an efficient data migration method for a deduplication storage system according to another embodiment of the present specification;

图4为本说明书另一个实施例示出的用于去重存储系统的高效数据迁移装置的结构框图;4 is a structural block diagram of an efficient data migration apparatus for a deduplication storage system according to another embodiment of the present specification;

图5为本说明书一个实施例示出的用于去重存储系统的高效数据迁移方法的电子设备硬件结构示意图。FIG. 5 is a schematic diagram of a hardware structure of an electronic device for an efficient data migration method for a deduplication storage system according to an embodiment of the present specification.

图6为本说明书一个实施例示出的用于去重存储系统的高效数据迁移方法的精卫概述示意图。FIG. 6 is a schematic overview of Jingwei for an efficient data migration method for a deduplication storage system according to an embodiment of this specification.

图7为本说明书的用于去重存储系统的高效数据迁移方法的一个从服务器

Figure 704663DEST_PATH_IMAGE001
到服务器
Figure 559486DEST_PATH_IMAGE002
的数据迁移的示例。FIG. 7 is a slave server of the efficient data migration method for the deduplication storage system of this specification
Figure 704663DEST_PATH_IMAGE001
to the server
Figure 559486DEST_PATH_IMAGE002
example of data migration.

具体实施方式Detailed ways

为使本申请的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本申请进一步详细说明。In order to make the objectives, technical solutions and advantages of the present application more clearly understood, the present application will be further described in detail below with reference to specific embodiments and accompanying drawings.

需要说明的是,除非另外定义,本说明书一个或多个实施例使用的技术术语或者科学术语应当为本申请所属领域内具有一般技能的人士所理解的通常意义。本说明书一个或多个实施例中使用的“第一”、“第二”以及类似的词语并不表示任何顺序、数量或者重要性,而只是用来区分不同的组成部分。“包括”或者“包含”等类似的词语意指出现该词前面的元件或者物件涵盖出现在该词后面列举的元件或者物件及其等同,而不排除其他元件或者物件。“连接”或者“相连”等类似的词语并非限定于物理的或者机械的连接,而是可以包括电性的连接,不管是直接的还是间接的。“上”、“下”、“左”、“右”等仅用于表示相对位置关系,当被描述对象的绝对位置改变后,则该相对位置关系也可能相应地改变。It should be noted that, unless otherwise defined, the technical or scientific terms used in one or more embodiments of the present specification should have the usual meanings understood by those with ordinary skill in the art to which this application belongs. The terms "first," "second," and similar terms used in one or more embodiments of this specification do not denote any order, quantity, or importance, but are merely used to distinguish the various components. "Comprises" or "comprising" and similar words mean that the elements or things appearing before the word encompass the elements or things recited after the word and their equivalents, but do not exclude other elements or things. Words like "connected" or "connected" are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "Up", "Down", "Left", "Right", etc. are only used to represent the relative positional relationship, and when the absolute position of the described object changes, the relative positional relationship may also change accordingly.

图1示出了根据本申请的用于去重存储系统的高效数据迁移方法的一个实施例的流程100。本实施例的用于去重存储系统的高效数据迁移方法,包括以下步骤:FIG. 1 shows a flow 100 of an embodiment of an efficient data migration method for a deduplication storage system according to the present application. The efficient data migration method for a deduplication storage system in this embodiment includes the following steps:

步骤S101,接收数据迁移请求,确定所述数据迁移请求对应的源服务器,进而确定所述源服务器中的待迁移文件。Step S101: Receive a data migration request, determine a source server corresponding to the data migration request, and then determine a file to be migrated in the source server.

本申请实施例的执行主体,例如可以是服务器,可以通过有线连接或无线连接的方式接收数据迁移请求。The execution body of the embodiment of the present application may be, for example, a server, and may receive a data migration request through a wired connection or a wireless connection.

具体地,本申请实施例的数据迁移,可以包括数据从一个服务器剪切到另一个或多个服务器,也可以包括数据从一个服务器复制到另一个或多个服务器。数据迁移,可以是整体文件的剪切或复制至另一个或多个服务器,也可以是文件中的某一个或几个数据块的剪切或复制至另一个或多个服务器。Specifically, the data migration in this embodiment of the present application may include cutting data from one server to another or multiple servers, or may include copying data from one server to another or multiple servers. Data migration can be the cutting or copying of the entire file to another or more servers, or the cutting or copying of one or several data blocks in the file to another or more servers.

执行主体在接收到数据迁移请求后,可以获取所述数据迁移请求中携带的源服务器标识,例如可以是源服务器IP地址,然后执行主体可以根据源服务器标识定位数据迁移请求对应的源服务器。然后,执行主体还可以获取数据迁移请求中携带的待迁移文件标识,例如可以是待迁移文件的文件名称或待迁移文件的存储地址等,本申请实施例对待迁移文件标识的具体内容不做具体限定。进而,执行主体可以根据待迁移文件标识定位源服务器中的待迁移文件。After receiving the data migration request, the execution subject can obtain the source server identifier carried in the data migration request, for example, the source server IP address, and then the execution subject can locate the source server corresponding to the data migration request according to the source server identifier. Then, the execution body may also obtain the identifier of the file to be migrated carried in the data migration request, which may be, for example, the file name of the file to be migrated or the storage address of the file to be migrated, etc. The specific content of the identifier of the file to be migrated is not specified in this embodiment of the present application. limited. Furthermore, the execution subject may locate the to-be-migrated file in the source server according to the to-be-migrated file identifier.

步骤S102,确定各可用服务器,进而基于所述待迁移文件和预设个数的独立哈希函数,确定所述各可用服务器对应的位向量,基于所述位向量确定所述数据迁移请求对应的各候选服务器。Step S102: Determine each available server, and then determine a bit vector corresponding to each available server based on the to-be-migrated file and a preset number of independent hash functions, and determine the corresponding data migration request based on the bit vector. each candidate server.

本实施例中,执行主体可以将多个非空服务器确定为可用服务器,该可用服务器可以接收来自过载服务器(例如源服务器)的迁移文件。为了从各可用服务器中为每个待迁移文件找到对应的候选服务器,可以采用比较待迁移文件中包含的块的指纹(例如可以使用MD5或SHA-1)和各可用服务器存储的块的指纹。例如,对于一个包含n块的文件,可以使用

Figure 276906DEST_PATH_IMAGE003
的时间复杂度来确定服务器是否包含这样的块,其中
Figure 609799DEST_PATH_IMAGE004
是可用服务器中的块总数。采用Bloom Filter (BF)来表示每个可用服务器上的块。捕获了数据特征,并促进了从成对指纹检查到位向量(也称数据草图)上的成员关系查询的相似性检测。然后,判断一个服务器是否在一个文件中包含n块的时间复杂度可以降低为
Figure 361854DEST_PATH_IMAGE005
,其中
Figure 122000DEST_PATH_IMAGE006
表示使用的哈希函数的数量。示例的,如图3所示,假设可用服务器中的块集合
Figure 592295DEST_PATH_IMAGE007
由三个分区块
Figure 463299DEST_PATH_IMAGE008
Figure 66931DEST_PATH_IMAGE009
Figure 997978DEST_PATH_IMAGE010
组成,则BF代表
Figure 689991DEST_PATH_IMAGE007
,其长度为d=19。该可用服务器对应的位向量中的所有d位最初设置为0。使用
Figure 364686DEST_PATH_IMAGE011
,即使用两个独立哈希函数将每个可用服务器中的每个块映射到位向量中的
Figure 622492DEST_PATH_IMAGE006
位置。这些命中的位置都会被设为1。从哈希函数派生的二进制字符串即为位向量,也即数据草图。每个可用服务器将维护一个位向量,具有相同的
Figure 990019DEST_PATH_IMAGE006
函数和向量长度,以在块级别记录成员信息。根据位向量和
Figure 903749DEST_PATH_IMAGE006
所使用的哈希函数,可以实现对任意文件中的任意块的成员查询。具体来说,当源服务器中的文件
Figure 116555DEST_PATH_IMAGE012
试图从所有可用的候选服务器中选择它的最佳目标服务器时,它首先需要每个可用服务器的BF位向量,然后基于位向量确定所述数据迁移请求对应的各候选服务器。示例的,对于源服务器中的文件
Figure 494447DEST_PATH_IMAGE012
中的任何块
Figure 764367DEST_PATH_IMAGE013
,如果Bloom Filter中
Figure 165392DEST_PATH_IMAGE006
哈希位置的任何位为0,则BF判断该块不属于可用服务器。否则,BF认为查询的块
Figure 181890DEST_PATH_IMAGE013
属于可用服务器。In this embodiment, the execution body may determine multiple non-empty servers as available servers, and the available servers may receive migration files from overloaded servers (eg, source servers). In order to find a corresponding candidate server for each file to be migrated from the available servers, it is possible to compare the fingerprints of the blocks contained in the files to be migrated (for example, MD5 or SHA-1 can be used) and the fingerprints of the blocks stored by the available servers. For example, for a file containing n blocks, you can use
Figure 276906DEST_PATH_IMAGE003
time complexity to determine whether the server contains such a block, where
Figure 609799DEST_PATH_IMAGE004
is the total number of blocks in the available server. A Bloom Filter (BF) is used to represent the blocks on each available server. Data features are captured and facilitated similarity detection from pairwise fingerprint checks to membership queries on bit vectors (also known as data sketches). Then, the time complexity of judging whether a server contains n blocks in a file can be reduced to
Figure 361854DEST_PATH_IMAGE005
,in
Figure 122000DEST_PATH_IMAGE006
Indicates the number of hash functions used. Illustratively, as shown in Figure 3, assuming the set of blocks in the available server
Figure 592295DEST_PATH_IMAGE007
consists of three partitions
Figure 463299DEST_PATH_IMAGE008
,
Figure 66931DEST_PATH_IMAGE009
and
Figure 997978DEST_PATH_IMAGE010
composition, then BF stands for
Figure 689991DEST_PATH_IMAGE007
, whose length is d=19. All d bits in the bit vector corresponding to this available server are initially set to 0. use
Figure 364686DEST_PATH_IMAGE011
, i.e. using two independent hash functions to map each block from each available server into a bit vector
Figure 622492DEST_PATH_IMAGE006
Location. The position of these hits will be set to 1. A binary string derived from a hash function is a bit vector, or data sketch. Each available server will maintain a bit vector with the same
Figure 990019DEST_PATH_IMAGE006
Function and vector length to log member information at the block level. According to the bit vector and
Figure 903749DEST_PATH_IMAGE006
The hash function used enables member queries to any block in any file. Specifically, when a file in the origin server
Figure 116555DEST_PATH_IMAGE012
When trying to select its best target server from all available candidate servers, it first needs the BF bit vector of each available server, and then determines each candidate server corresponding to the data migration request based on the bit vector. Example, for files in origin server
Figure 494447DEST_PATH_IMAGE012
any block in
Figure 764367DEST_PATH_IMAGE013
, if Bloom Filter
Figure 165392DEST_PATH_IMAGE006
If any bit in the hash position is 0, BF judges that the block does not belong to an available server. Otherwise, BF considers the queried block
Figure 181890DEST_PATH_IMAGE013
is an available server.

步骤S103,基于所述待迁移文件确定与所述各候选服务器的文件空间重叠数据和共享分区块的数据量,进而基于所述文件空间重叠数据和所述数据量,确定所述各候选服务器对应的节省空间索引。Step S103: Determine the file space overlapping data and the data amount of the shared partition block with the candidate servers based on the file to be migrated, and then determine the corresponding candidate servers based on the file space overlapping data and the data amount. space efficient index.

节省空间索引,定义为文件迁移到目标服务器时节省的存储资源数量。这是从源服务器释放的数据量与迁移目标上增加的空间之间的偏差。节省空间的数据迁移决定了要迁移哪些文件以及应该将它们指向哪些目标服务器,从而减少额外的空间成本。Space-saving index, defined as the amount of storage resources saved when files are migrated to the target server. This is the deviation between the amount of data released from the source server and the space added on the migration target. Space-efficient data migration determines which files are migrated and which target servers they should be directed to, reducing additional space costs.

具体地,确定节省空间索引的算法1如下所示:Specifically, Algorithm 1 to determine the space-saving index is as follows:

Figure 414288DEST_PATH_IMAGE014
Figure 414288DEST_PATH_IMAGE014

具体地,对于上述算法1的解释如下:输入包括源服务器

Figure 123618DEST_PATH_IMAGE015
和所有目标候选
Figure 746360DEST_PATH_IMAGE016
的基于BF的数据草图和文件集,其中
Figure 566549DEST_PATH_IMAGE017
。为了导出迁移变量
Figure 856716DEST_PATH_IMAGE018
和相应的迁移目标
Figure 736947DEST_PATH_IMAGE019
,设计了一个空间效率高的索引(即节省空间索引)来激发迁移序列。该函数如算法1第9-13行所示。设
Figure 133073DEST_PATH_IMAGE020
表示
Figure 288110DEST_PATH_IMAGE021
Figure 901626DEST_PATH_IMAGE022
之间共享块的数据量,可以从
Figure 952758DEST_PATH_IMAGE022
的数据草图上
Figure 612410DEST_PATH_IMAGE021
中块的基于bf的成员关系查询派生而来。然后,函数根据
Figure 977663DEST_PATH_IMAGE020
Figure 445685DEST_PATH_IMAGE023
的偏差,返回索引
Figure 464456DEST_PATH_IMAGE024
,这反映了将文件
Figure 546157DEST_PATH_IMAGE021
Figure 511839DEST_PATH_IMAGE015
迁移到
Figure 631105DEST_PATH_IMAGE022
所节省的空间资源。
Figure 555199DEST_PATH_IMAGE023
的值是根据不包含
Figure 392705DEST_PATH_IMAGE021
的草图计算的,反映了
Figure 896498DEST_PATH_IMAGE015
Figure 135850DEST_PATH_IMAGE021
和其他之间的空间重叠。Specifically, the above-mentioned Algorithm 1 is explained as follows: the input includes the origin server
Figure 123618DEST_PATH_IMAGE015
and all target candidates
Figure 746360DEST_PATH_IMAGE016
BF-based data sketches and filesets, where
Figure 566549DEST_PATH_IMAGE017
. To export migration variables
Figure 856716DEST_PATH_IMAGE018
and the corresponding migration target
Figure 736947DEST_PATH_IMAGE019
, a space-efficient index (i.e., a space-saving index) is designed to motivate the migration sequence. The function is shown in Algorithm 1, lines 9-13. Assume
Figure 133073DEST_PATH_IMAGE020
express
Figure 288110DEST_PATH_IMAGE021
and
Figure 901626DEST_PATH_IMAGE022
The amount of data shared between blocks, which can be changed from
Figure 952758DEST_PATH_IMAGE022
on the data sketch
Figure 612410DEST_PATH_IMAGE021
The middle block's bf-based membership query is derived. Then, the function according to
Figure 977663DEST_PATH_IMAGE020
and
Figure 445685DEST_PATH_IMAGE023
deviation, return index
Figure 464456DEST_PATH_IMAGE024
, which reflects the
Figure 546157DEST_PATH_IMAGE021
from
Figure 511839DEST_PATH_IMAGE015
move to
Figure 631105DEST_PATH_IMAGE022
space resources saved.
Figure 555199DEST_PATH_IMAGE023
The value is based on not containing
Figure 392705DEST_PATH_IMAGE021
The sketch of the calculated, reflects the
Figure 896498DEST_PATH_IMAGE015
middle
Figure 135850DEST_PATH_IMAGE021
and space overlap between others.

步骤S104,基于所述节省空间索引,确定所述待迁移文件中的目标迁移文件并确定所述各候选服务器中的目标服务器,以将所述目标迁移文件迁移至所述目标服务器,并更新所述目标服务器和所述源服务器对应的位向量和节省空间索引。Step S104, based on the space-saving index, determine the target migration file in the to-be-migrated files and determine the target server in each candidate server, so as to migrate the target migration file to the target server, and update all the target migration files. bit vector and space-saving index corresponding to the target server and the source server.

本实施例中,使用每个文件服务器匹配的节省空间索引,可以通过迭代检测最大

Figure 434107DEST_PATH_IMAGE025
来确定目标迁移文件(具体是通过最大
Figure 959242DEST_PATH_IMAGE025
中的变量i来确定)及各候选服务器中的目标服务器(具体是通过最大
Figure 63464DEST_PATH_IMAGE025
中的变量k确定),直到
Figure 95005DEST_PATH_IMAGE015
中数据量的M百分比已经被迁移(如上述算法1的第3-5行所示)。每次迁移都会改变源服务器和目标服务器的存储状态。因此,在添加或释放文件的服务器上本地更新数据草图(即更新位向量)。此外,节省空间索引
Figure 829743DEST_PATH_IMAGE025
也相应地在相关服务器上进行更新(如上述算法1的第6-8行所示)。In this embodiment, the space-saving index matched by each file server can be used to iteratively detect the maximum
Figure 434107DEST_PATH_IMAGE025
to determine the target migration file (specifically by max.
Figure 959242DEST_PATH_IMAGE025
the variable i in ) and the target server in each candidate server (specifically by the maximum
Figure 63464DEST_PATH_IMAGE025
variable k in ), until
Figure 95005DEST_PATH_IMAGE015
M percent of the amount of data in M has been migrated (as shown in lines 3-5 of Algorithm 1 above). Each migration changes the storage state of the source and target servers. Therefore, update the data sketch (i.e. update the bit vector) locally on the server where files are added or released. Also, space-efficient indexes
Figure 829743DEST_PATH_IMAGE025
Updates are also made on the relevant servers accordingly (as shown in lines 6-8 of Algorithm 1 above).

作为本申请实施例的另一种实现方式,用于去重存储系统的高效数据迁移方法还包括将待迁移文件迁移到单个空服务器。As another implementation manner of the embodiment of the present application, the efficient data migration method for the deduplication storage system further includes migrating the files to be migrated to a single empty server.

示例的,在过载的源服务器

Figure 172999DEST_PATH_IMAGE015
中,有一组文件
Figure 284175DEST_PATH_IMAGE026
有热度
Figure 232539DEST_PATH_IMAGE027
。让
Figure 872599DEST_PATH_IMAGE028
表示
Figure 903485DEST_PATH_IMAGE029
中从文件划分的唯一块的集合。设
Figure 552772DEST_PATH_IMAGE030
表示块
Figure 152380DEST_PATH_IMAGE031
的大小,则服务器
Figure 228921DEST_PATH_IMAGE032
的存储成本就是存储在其上的块的总大小,即
Figure 484453DEST_PATH_IMAGE033
。注意,这个
Figure 468589DEST_PATH_IMAGE034
函数为固定大小的块分块生成一个常量值,并为可变大小的块分块算法生成一个非常量值。设
Figure 594808DEST_PATH_IMAGE035
表示包含关系,其中
Figure 638988DEST_PATH_IMAGE036
表示块
Figure 647395DEST_PATH_IMAGE037
包含在文件
Figure 635555DEST_PATH_IMAGE021
中。不考虑一个文件在一个服务器上被多次复制的情况,因为它对访问分流没有影响,而只影响数据冗余。example, on an overloaded origin server
Figure 172999DEST_PATH_IMAGE015
, there is a set of files
Figure 284175DEST_PATH_IMAGE026
hot
Figure 232539DEST_PATH_IMAGE027
. Let
Figure 872599DEST_PATH_IMAGE028
express
Figure 903485DEST_PATH_IMAGE029
A collection of unique blocks partitioned from the file. Assume
Figure 552772DEST_PATH_IMAGE030
presentation block
Figure 152380DEST_PATH_IMAGE031
size, the server
Figure 228921DEST_PATH_IMAGE032
The storage cost of is the total size of the blocks stored on it, i.e.
Figure 484453DEST_PATH_IMAGE033
. Note that this
Figure 468589DEST_PATH_IMAGE034
The function generates a constant value for fixed-size chunking and a non-constant value for variable-size chunking algorithms. Assume
Figure 594808DEST_PATH_IMAGE035
represents a containment relationship, where
Figure 638988DEST_PATH_IMAGE036
presentation block
Figure 647395DEST_PATH_IMAGE037
included in the file
Figure 635555DEST_PATH_IMAGE021
middle. The case where a file is replicated multiple times on a server is not considered, because it has no effect on access distribution, but only on data redundancy.

源服务器在数据迁移前的初始状态

Figure 147439DEST_PATH_IMAGE015
可以定义为一个五元组
Figure 362520DEST_PATH_IMAGE038
,其中
Figure 592644DEST_PATH_IMAGE039
Figure 121845DEST_PATH_IMAGE040
分别表示
Figure 753815DEST_PATH_IMAGE015
的空间约束和服务约束,
Figure 343059DEST_PATH_IMAGE041
表示源服务器中的存储文件集合。在这个初始状态下,在数据迁移之后,源服务器
Figure 122796DEST_PATH_IMAGE015
中的每个候选迁移文件
Figure 455689DEST_PATH_IMAGE021
将处于以下三种状态之一:The original state of the origin server before data migration
Figure 147439DEST_PATH_IMAGE015
can be defined as a quintuple
Figure 362520DEST_PATH_IMAGE038
,in
Figure 592644DEST_PATH_IMAGE039
and
Figure 121845DEST_PATH_IMAGE040
Respectively
Figure 753815DEST_PATH_IMAGE015
space constraints and service constraints,
Figure 343059DEST_PATH_IMAGE041
Represents a collection of storage files in the origin server. In this initial state, after data migration, the origin server
Figure 122796DEST_PATH_IMAGE015
each candidate migration file in
Figure 455689DEST_PATH_IMAGE021
will be in one of three states:

迁移,即

Figure 939235DEST_PATH_IMAGE021
被迁移到目标服务器
Figure 699381DEST_PATH_IMAGE042
,而源服务器占用的空间被释放。migration, i.e.
Figure 939235DEST_PATH_IMAGE021
to be migrated to the target server
Figure 699381DEST_PATH_IMAGE042
, and the space occupied by the origin server is released.

引入如下布尔状态变量

Figure 904097DEST_PATH_IMAGE043
来表示这种状态,这样:Introduce the following boolean state variables
Figure 904097DEST_PATH_IMAGE043
to represent this state, like this:

Figure 40680DEST_PATH_IMAGE044
Figure 40680DEST_PATH_IMAGE044
.

复制,即源服务器向目标服务器发送

Figure 178401DEST_PATH_IMAGE021
副本。这种情况通常出现在热门文件中,在这些文件中,数据访问需求可能会超过源服务器的容量。Replication, that is, the source server sends the target server
Figure 178401DEST_PATH_IMAGE021
copy. This situation typically occurs with popular files, where data access requirements may exceed the capacity of the origin server.

引入二进制布尔状态变量

Figure 375027DEST_PATH_IMAGE045
来表示这种状态,这样:Introduce binary boolean state variable
Figure 375027DEST_PATH_IMAGE045
to represent this state, like this:

Figure 270302DEST_PATH_IMAGE046
Figure 270302DEST_PATH_IMAGE046
.

不变,即

Figure 944997DEST_PATH_IMAGE021
在源服务器上没有被迁移或复制,这意味着
Figure 668714DEST_PATH_IMAGE047
Figure 36242DEST_PATH_IMAGE048
。unchanged, that is
Figure 944997DEST_PATH_IMAGE021
has not been migrated or replicated on the source server, which means
Figure 668714DEST_PATH_IMAGE047
and
Figure 36242DEST_PATH_IMAGE048
.

基于上述布尔变量所指示的文件状态,也可以用数学方法表示其分区块的深度关联状态。同样,块也可以被迁移、复制或不变。这些状态也可以通过定义布尔变量来表示。为了表示块的文本迁移状态,定义了一个布尔变量

Figure 949971DEST_PATH_IMAGE049
,其中:Based on the file state indicated by the above Boolean variable, the deep association state of its partitions can also be expressed mathematically. Likewise, blocks can also be migrated, copied, or unchanged. These states can also be represented by defining boolean variables. To represent the text transition state of the block, a boolean variable is defined
Figure 949971DEST_PATH_IMAGE049
,in:

Figure 428357DEST_PATH_IMAGE050
Figure 428357DEST_PATH_IMAGE050
.

Figure 743932DEST_PATH_IMAGE051
时,块
Figure 16781DEST_PATH_IMAGE052
已经从源服务器
Figure 683386DEST_PATH_IMAGE015
迁移到目标服务器
Figure 699884DEST_PATH_IMAGE042
。这种状态可能是由于它的从属文件
Figure 132615DEST_PATH_IMAGE021
的迁移造成的,其中
Figure 576365DEST_PATH_IMAGE053
。when
Figure 743932DEST_PATH_IMAGE051
time, block
Figure 16781DEST_PATH_IMAGE052
already from the origin server
Figure 683386DEST_PATH_IMAGE015
Migrate to target server
Figure 699884DEST_PATH_IMAGE042
. This state may be due to its dependent files
Figure 132615DEST_PATH_IMAGE021
caused by the migration of
Figure 576365DEST_PATH_IMAGE053
.

进一步定义一个布尔变量

Figure 730266DEST_PATH_IMAGE054
,这样:Further define a boolean variable
Figure 730266DEST_PATH_IMAGE054
,so:

Figure 550455DEST_PATH_IMAGE055
Figure 550455DEST_PATH_IMAGE055
.

Figure 637359DEST_PATH_IMAGE056
时,块
Figure 252011DEST_PATH_IMAGE057
将同时出现在源服务器和目标服务器上。如果在迁移过程中复制了它的任何附属文件(包含
Figure 627629DEST_PATH_IMAGE057
的文件),
Figure 782667DEST_PATH_IMAGE057
的状态将被标记为复制。此外,破坏两个文件(一个已迁移,另一个未更改)的数据共享依赖关系还会在共享部分附加块复制。请注意,
Figure 927340DEST_PATH_IMAGE058
还与文件移动引起的额外空间成本有关,可以表示为
Figure 733402DEST_PATH_IMAGE059
。如果
Figure 596315DEST_PATH_IMAGE060
Figure 555044DEST_PATH_IMAGE061
,这意味着
Figure 288645DEST_PATH_IMAGE057
保持不变。when
Figure 637359DEST_PATH_IMAGE056
time, block
Figure 252011DEST_PATH_IMAGE057
will appear on both the source and target servers. If any of its accompanying files (including
Figure 627629DEST_PATH_IMAGE057
document),
Figure 782667DEST_PATH_IMAGE057
will be marked as duplicated. Also, breaking a data sharing dependency of two files (one migrated, the other unchanged) also appends a block copy in the shared section. caution,
Figure 927340DEST_PATH_IMAGE058
is also related to the extra space cost caused by file movement, which can be expressed as
Figure 733402DEST_PATH_IMAGE059
. if
Figure 596315DEST_PATH_IMAGE060
and
Figure 555044DEST_PATH_IMAGE061
,this means
Figure 288645DEST_PATH_IMAGE057
constant.

可以分析

Figure 510679DEST_PATH_IMAGE062
中文件及其分区块之间的状态关系,并将其表示为以下三种情况之一。采用易于理解的例子来说明这些特殊的例子。can be analyzed
Figure 510679DEST_PATH_IMAGE062
The state relationship between the file and its partitioned blocks in , and express it as one of three cases. These special examples are illustrated with easy-to-understand examples.

一个文件保留在源服务器,即

Figure 64151DEST_PATH_IMAGE063
Figure 560992DEST_PATH_IMAGE064
,例如图7中的初始存储状态(Innitial storage state)的文件
Figure 414678DEST_PATH_IMAGE065
。其中,
Figure 804684DEST_PATH_IMAGE066
中的5表示文件
Figure 642190DEST_PATH_IMAGE065
的热度,
Figure 83666DEST_PATH_IMAGE067
为文件
Figure 57439DEST_PATH_IMAGE065
中的块,其他文件的表示方式类似,不再赘述。在这种情况下,文件中包含的所有块要么是未更改的(如块
Figure 418013DEST_PATH_IMAGE068
),要么是已复制的(如块
Figure 477236DEST_PATH_IMAGE069
),依赖于未更改文件和已迁移/复制文件之间的块共享依赖关系。A file remains on the origin server, i.e.
Figure 64151DEST_PATH_IMAGE063
and
Figure 560992DEST_PATH_IMAGE064
, such as the file in the Initial storage state in Figure 7
Figure 414678DEST_PATH_IMAGE065
. in,
Figure 804684DEST_PATH_IMAGE066
The 5 in the file means the file
Figure 642190DEST_PATH_IMAGE065
the heat,
Figure 83666DEST_PATH_IMAGE067
for file
Figure 57439DEST_PATH_IMAGE065
The blocks in other files are represented in a similar manner, and will not be repeated here. In this case, all blocks contained in the file are either unchanged (as blocks
Figure 418013DEST_PATH_IMAGE068
), or copied (such as a block
Figure 477236DEST_PATH_IMAGE069
), relying on block sharing dependencies between unchanged files and migrated/copied files.

一个文件被迁移到目标服务器,即

Figure 519141DEST_PATH_IMAGE070
,例如图7中的初始存储状态(Innitialstorage state)的文件
Figure 409737DEST_PATH_IMAGE071
。在这种情况下,文件中包含的所有块要么被迁移(如块
Figure 141545DEST_PATH_IMAGE072
),要么被复制(如块
Figure 891326DEST_PATH_IMAGE073
)。A file is migrated to the target server, i.e.
Figure 519141DEST_PATH_IMAGE070
, such as the initial storage state (Initialstorage state) file in Figure 7
Figure 409737DEST_PATH_IMAGE071
. In this case, all blocks contained in the file are either migrated (as blocks
Figure 141545DEST_PATH_IMAGE072
), or is copied (as a block
Figure 891326DEST_PATH_IMAGE073
).

复制一个文件,使源服务器和目标服务器都有一个副本,例如图7中的初始存储状态(Innitial storage state)的文件

Figure 799239DEST_PATH_IMAGE074
。在这种情况下,文件
Figure 747604DEST_PATH_IMAGE074
所涉及的所有块,即
Figure 653243DEST_PATH_IMAGE075
将被复制。此外,文件的访问需求将通过在
Figure 421479DEST_PATH_IMAGE076
中使用分布参数
Figure 805187DEST_PATH_IMAGE077
进行复制来分散。该参数由网络中中广泛使用的负载均衡器调整,负责根据服务器的实际负载均衡存储系统中的负载。文件
Figure 670374DEST_PATH_IMAGE078
为初始存储状态容器中的文件。Copy a file so that both the source server and the target server have a copy, such as the file in the Initial storage state in Figure 7
Figure 799239DEST_PATH_IMAGE074
. In this case, the file
Figure 747604DEST_PATH_IMAGE074
all the blocks involved, i.e.
Figure 653243DEST_PATH_IMAGE075
will be copied. In addition, file access requirements will be
Figure 421479DEST_PATH_IMAGE076
using distribution parameters in
Figure 805187DEST_PATH_IMAGE077
Make copies to disperse. This parameter is adjusted by the load balancer widely used in the network, which is responsible for balancing the load in the storage system according to the actual load of the server. document
Figure 670374DEST_PATH_IMAGE078
File in the container for initial storage state.

通过前面提到的关于文件和块的布尔变量,可以将空目标服务器的迁移问题表述如下。With the boolean variables mentioned earlier for files and blocks, the migration problem for an empty target server can be formulated as follows.

当块

Figure 746915DEST_PATH_IMAGE079
被迁移时,即
Figure 999517DEST_PATH_IMAGE080
,那么包含该块的所有文件将被迁移到目标服务器:when block
Figure 746915DEST_PATH_IMAGE079
when migrated, i.e.
Figure 999517DEST_PATH_IMAGE080
, then all files containing that block will be migrated to the target server:

Figure 186916DEST_PATH_IMAGE081
(1)
Figure 186916DEST_PATH_IMAGE081
(1)

当文件

Figure 906610DEST_PATH_IMAGE082
被迁移时,即
Figure 154052DEST_PATH_IMAGE083
,那么它所包含的所有块将被迁移或复制:when file
Figure 906610DEST_PATH_IMAGE082
when migrated, i.e.
Figure 154052DEST_PATH_IMAGE083
, then all blocks it contains will be migrated or copied:

Figure 162459DEST_PATH_IMAGE084
(2)
Figure 162459DEST_PATH_IMAGE084
(2)

当文件

Figure 887970DEST_PATH_IMAGE082
在服务器
Figure 665433DEST_PATH_IMAGE085
Figure 83776DEST_PATH_IMAGE086
都被复制时,即
Figure 310970DEST_PATH_IMAGE087
,那么它所涉及的所有块也应该被复制:when file
Figure 887970DEST_PATH_IMAGE082
on the server
Figure 665433DEST_PATH_IMAGE085
and
Figure 83776DEST_PATH_IMAGE086
are copied, i.e.
Figure 310970DEST_PATH_IMAGE087
, then all blocks it involves should also be copied:

Figure 105751DEST_PATH_IMAGE088
(3)
Figure 105751DEST_PATH_IMAGE088
(3)

文件的状态,即

Figure 206562DEST_PATH_IMAGE089
Figure 61386DEST_PATH_IMAGE090
,块的状态,即
Figure 44385DEST_PATH_IMAGE091
,是互斥的:the status of the file, i.e.
Figure 206562DEST_PATH_IMAGE089
,
Figure 61386DEST_PATH_IMAGE090
, the state of the block, i.e.
Figure 44385DEST_PATH_IMAGE091
, which are mutually exclusive:

Figure 580540DEST_PATH_IMAGE092
Figure 67016DEST_PATH_IMAGE093
(4)
Figure 580540DEST_PATH_IMAGE092
,
Figure 67016DEST_PATH_IMAGE093
(4)

迁移后的数据量应满足预定义的迁移百分比M,

Figure 355390DEST_PATH_IMAGE094
,The amount of migrated data should meet the predefined migration percentage M,
Figure 355390DEST_PATH_IMAGE094
,

Figure 560107DEST_PATH_IMAGE095
(5)
Figure 560107DEST_PATH_IMAGE095
(5)

源服务器和目标服务器的最终空间/服务开销不应超过相应的容量

Figure 634373DEST_PATH_IMAGE096
The final space/serving overhead of the source and target servers should not exceed the corresponding capacity
Figure 634373DEST_PATH_IMAGE096

Figure 506514DEST_PATH_IMAGE097
(6)
Figure 506514DEST_PATH_IMAGE097
(6)

Figure 703140DEST_PATH_IMAGE098
(7)
Figure 703140DEST_PATH_IMAGE098
(7)

Figure 129574DEST_PATH_IMAGE099
(8)
Figure 129574DEST_PATH_IMAGE099
(8)

Figure 273110DEST_PATH_IMAGE100
(9)
Figure 273110DEST_PATH_IMAGE100
(9)

其中,

Figure 731249DEST_PATH_IMAGE101
Figure 567618DEST_PATH_IMAGE102
分别是初始状态
Figure 950189DEST_PATH_IMAGE103
的空间约束和服务约束。
Figure 897416DEST_PATH_IMAGE104
Figure 681832DEST_PATH_IMAGE105
分别是目标服务器的空间约束和服务约束。状态变量都是布尔值:
Figure 689103DEST_PATH_IMAGE106
。in,
Figure 731249DEST_PATH_IMAGE101
and
Figure 567618DEST_PATH_IMAGE102
are the initial state
Figure 950189DEST_PATH_IMAGE103
space constraints and service constraints.
Figure 897416DEST_PATH_IMAGE104
and
Figure 681832DEST_PATH_IMAGE105
They are the space constraints and service constraints of the target server, respectively. The state variables are all boolean values:
Figure 689103DEST_PATH_IMAGE106
.

将精卫方案(即本申请实施例的用于去重存储系统的高效数据迁移方法对应的各数据迁移策略)的目标规范化,即在空间效率(最小化额外空间成本)和服务适应性(最大化共享数据需求)之间实现优雅的权衡,原理如下:Standardize the goals of the Jingwei solution (that is, the data migration strategies corresponding to the efficient data migration method for the deduplication storage system in the embodiment of the present application), that is, in terms of space efficiency (minimizing extra space cost) and service adaptability (maximum To achieve an elegant trade-off between the requirements of the shared data), the principle is as follows:

Figure 352778DEST_PATH_IMAGE107
(10)
Figure 352778DEST_PATH_IMAGE107
(10)

其中,参数

Figure 634855DEST_PATH_IMAGE108
可以调整以适应两种基本原理的不同优化趋势,
Figure 539357DEST_PATH_IMAGE109
为文件
Figure 186370DEST_PATH_IMAGE110
的热度。Among them, the parameter
Figure 634855DEST_PATH_IMAGE108
can be adjusted to accommodate the different optimization trends of the two rationale,
Figure 539357DEST_PATH_IMAGE109
for file
Figure 186370DEST_PATH_IMAGE110
the heat.

以公式10为迁移目标,以公式1~9为约束条件,可以将该问题表示为一个整数线性规划问题(ILP)。已知ILP问题是NP-hard,目前还没有已知的多项式时间复杂度的有效求解算法。特别是,当变量被限制为布尔赋值(0或1)时,那么仅仅决定问题是否有一个最优解就已经是NP完全。然而,商业优化器,如CPLEX、lp_solve和Gurobi优化器,可以有效地解决具有数十万变量的实例的这类问题。因此,利用这些高度优化的求解器来直接搜索出最优的迁移计划。但是,这种场景可能不适用于大型存储系统,因为在大型存储系统中,服务器连接到空状态并不常见。此外,将所有文件迁移到一个服务器的限制可能会限制性能的提高。Taking Equation 10 as the migration objective and Equations 1 to 9 as constraints, the problem can be expressed as an integer linear programming problem (ILP). It is known that the ILP problem is NP-hard, and there is no known efficient solution algorithm with polynomial time complexity. In particular, when variables are restricted to boolean assignments (0 or 1), then simply deciding whether the problem has an optimal solution is already NP-complete. However, commercial optimizers, such as CPLEX, lp_solve, and the Gurobi optimizer, can efficiently solve such problems for instances with hundreds of thousands of variables. Therefore, these highly optimized solvers are used to directly search for the optimal migration plan. However, this scenario may not be suitable for large storage systems, where it is not common for servers to connect to an empty state. Also, the limitation of migrating all files to one server may limit performance gains.

本申请实施例通过接收数据迁移请求,确定所述数据迁移请求对应的源服务器,进而确定所述源服务器中的待迁移文件;确定各可用服务器,进而基于所述待迁移文件和预设个数的独立哈希函数,确定所述各可用服务器对应的位向量,基于所述位向量确定所述数据迁移请求对应的各候选服务器;基于所述待迁移文件确定与所述各候选服务器的文件空间重叠数据和共享分区块的数据量,进而基于所述文件空间重叠数据和所述数据量,确定所述各候选服务器对应的节省空间索引;基于所述节省空间索引,确定所述待迁移文件中的目标迁移文件并确定所述各候选服务器中的目标服务器,以将所述目标迁移文件迁移至所述目标服务器,并更新所述目标服务器和所述源服务器对应的位向量和节省空间索引。从而在数据迁移过程中减少额外的空间成本,以高效、适应性强的数据迁移策略进行数据迁移。In this embodiment of the present application, by receiving a data migration request, the source server corresponding to the data migration request is determined, and then the files to be migrated in the source server are determined; each available server is determined, and then based on the to-be-migrated files and the preset number determine the bit vector corresponding to each available server, determine each candidate server corresponding to the data migration request based on the bit vector; determine the file space corresponding to each candidate server based on the to-be-migrated file The overlapping data and the data amount of the shared partition block, and then determine the space-saving index corresponding to each candidate server based on the file space overlapping data and the data amount; and determine the target server among the candidate servers, so as to migrate the target migration file to the target server, and update the bit vector and space-saving index corresponding to the target server and the source server. In this way, additional space costs are reduced during the data migration process, and data migration is carried out with an efficient and adaptable data migration strategy.

继续参见图2,其示出了根据本申请的用于去重存储系统的高效数据迁移方法的另一个实施例的流程200。如图2所示,本实施例的用于去重存储系统的高效数据迁移方法可以包括以下步骤:Continue to refer to FIG. 2 , which shows a flow 200 of another embodiment of an efficient data migration method for a deduplication storage system according to the present application. As shown in FIG. 2 , the efficient data migration method for the deduplication storage system in this embodiment may include the following steps:

步骤S201,接收数据迁移请求,确定所述数据迁移请求对应的源服务器,进而确定所述源服务器中的待迁移文件。Step S201: Receive a data migration request, determine a source server corresponding to the data migration request, and then determine a file to be migrated in the source server.

步骤S202,确定各可用服务器,进而基于所述待迁移文件和预设个数的独立哈希函数,确定所述各可用服务器对应的位向量,基于所述位向量确定所述数据迁移请求对应的各候选服务器。Step S202: Determine each available server, and then determine the bit vector corresponding to each available server based on the to-be-migrated file and a preset number of independent hash functions, and determine the corresponding data migration request based on the bit vector. each candidate server.

位向量,即本申请实施例中的数据草图。基于BF的数据草图的时间复杂度为

Figure 605850DEST_PATH_IMAGE111
,其中
Figure 629301DEST_PATH_IMAGE112
表示使用的哈希函数的数量,
Figure 408817DEST_PATH_IMAGE113
表示任何候选服务器上的最大块数。需要注意的是,抽样技术将减少抽样比率的一个因子的时间复杂度。基于BF的数据草图的空间复杂度为
Figure 289048DEST_PATH_IMAGE114
,其中
Figure 867928DEST_PATH_IMAGE115
表示BF长度。A bit vector, that is, a data sketch in this embodiment of the present application. The time complexity of the BF-based data sketch is
Figure 605850DEST_PATH_IMAGE111
,in
Figure 629301DEST_PATH_IMAGE112
represents the number of hash functions used,
Figure 408817DEST_PATH_IMAGE113
Indicates the maximum number of blocks on any candidate server. Note that the sampling technique will reduce the time complexity by a factor of the sampling rate. The space complexity of the BF-based data sketch is
Figure 289048DEST_PATH_IMAGE114
,in
Figure 867928DEST_PATH_IMAGE115
Indicates the BF length.

步骤S203,基于所述待迁移文件确定与所述各候选服务器的文件空间重叠数据和共享分区块的数据量,进而基于所述文件空间重叠数据和所述数据量,确定所述各候选服务器对应的节省空间索引。Step S203, determining the file space overlapping data and the data amount of the shared partition block with the candidate servers based on the files to be migrated, and then determining the corresponding candidate servers based on the file space overlapping data and the data amount. space efficient index.

节省空间的数据迁移的时间复杂度为

Figure 226228DEST_PATH_IMAGE116
,其中
Figure 574164DEST_PATH_IMAGE117
表示任何文件中的最大块数。注意,在每次文件迁移之后,源服务器和目标服务器的存储状态都会部分更新。这不会增加算法的总体时间复杂度。另外,空间复杂度为
Figure 890876DEST_PATH_IMAGE118
,其中
Figure 219702DEST_PATH_IMAGE119
记录服务器概要,
Figure 850534DEST_PATH_IMAGE120
记录节省空间的索引。The time complexity of space saving data migration is
Figure 226228DEST_PATH_IMAGE116
,in
Figure 574164DEST_PATH_IMAGE117
Represents the maximum number of blocks in any file. Note that the storage state of the source and target servers is partially updated after each file migration. This does not increase the overall time complexity of the algorithm. In addition, the space complexity is
Figure 890876DEST_PATH_IMAGE118
,in
Figure 219702DEST_PATH_IMAGE119
record server summary,
Figure 850534DEST_PATH_IMAGE120
Record space-efficient indexes.

步骤S201~步骤S203的原理与步骤S101~步骤S103的原理类似,此处不再赘述。The principles of steps S201 to S203 are similar to the principles of steps S101 to S103, and are not repeated here.

步骤S204,基于所述节省空间索引确定各所述待迁移文件对应的占用空间

Figure 318556DEST_PATH_IMAGE121
。Step S204, determining the occupied space corresponding to each of the to-be-migrated files based on the space-saving index
Figure 318556DEST_PATH_IMAGE121
.

步骤S205,确定各所述待迁移文件所承担的平均数据访问频率SDA(i)。SDA(i)定义为文件

Figure 806169DEST_PATH_IMAGE110
的每个副本所承担的平均数据访问频率。Step S205: Determine the average data access frequency SDA(i) undertaken by each of the files to be migrated. SDA(i) is defined as a file
Figure 806169DEST_PATH_IMAGE110
The average frequency of data access undertaken by each replica of .

步骤S206,根据所述平均数据访问频率SDA(i)和所述占用空间

Figure 94062DEST_PATH_IMAGE122
,确定目标迁移文件。Step S206, according to the average data access frequency SDA(i) and the occupied space
Figure 94062DEST_PATH_IMAGE122
, to determine the target migration file.

具体地,所述根据所述平均数据访问频率和所述占用空间,确定目标迁移文件,包括:确定所述平均数据访问频率与所述占用空间的比值;响应于确定所述比值大于预设单位热值,确定所述比值对应的待迁移文件为目标迁移文件。预设单位热值反映了分割数据访问(SDA)量与待迁移文件副本所需额外空间之间的商(商即比值)。Specifically, the determining the target migration file according to the average data access frequency and the occupied space includes: determining a ratio of the average data access frequency to the occupied space; in response to determining that the ratio is greater than a preset unit The calorific value is determined, and the file to be migrated corresponding to the ratio is determined as the target migration file. The preset unit heat value reflects the quotient (quotient or ratio) between the amount of split data access (SDA) and the additional space required for copies of files to be migrated.

此外,作为本申请实施例的另一种实现方式,执行主体还可以根据占用空间计算空间成本,具体地,执行主体可以确定待迁移文件与各候选服务器之间共享分区块的数据量,然后可以根据占用空间

Figure 794165DEST_PATH_IMAGE122
和共享分区块的数据量
Figure 113763DEST_PATH_IMAGE123
来确定空间成本
Figure 772278DEST_PATH_IMAGE124
。然后,执行主体可以根据平均数据访问频率SDA(i)和空间成本
Figure 813046DEST_PATH_IMAGE124
的商(即比值)与预设值的大小,来确定该待迁移文件是否是目标迁移文件。如果,平均数据访问频率SDA(i)和空间成本
Figure 785681DEST_PATH_IMAGE124
的商(即比值)大于预设值,则确定该待迁移文件为目标迁移文件。In addition, as another implementation manner of the embodiment of the present application, the execution body may also calculate the space cost according to the occupied space. Specifically, the execution body may determine the data amount of the partition block shared between the file to be migrated and each candidate server, and then may According to the space occupied
Figure 794165DEST_PATH_IMAGE122
and the amount of data in the shared partition block
Figure 113763DEST_PATH_IMAGE123
to determine the cost of space
Figure 772278DEST_PATH_IMAGE124
. Then, the executive can access the data according to the average data access frequency SDA(i) and space cost
Figure 813046DEST_PATH_IMAGE124
The quotient (that is, the ratio) and the preset value are used to determine whether the file to be migrated is the target migration file. If, average data access frequency SDA(i) and space cost
Figure 785681DEST_PATH_IMAGE124
The quotient (that is, the ratio) is greater than the preset value, and the file to be migrated is determined as the target migration file.

执行主体可以采用基于热度感知的数据复制算法进行目标迁移文件的迁移。The execution body can use the data replication algorithm based on heat perception to migrate the target migration file.

示例的,基于热度感知的数据复制算法如下方算法2所示:An example, the data replication algorithm based on heat perception is shown in Algorithm 2 below:

Figure 493874DEST_PATH_IMAGE125
Figure 493874DEST_PATH_IMAGE125

对算法2的解释如下:The interpretation of Algorithm 2 is as follows:

对于源服务器中存储的任何文件,从函数

Figure 726885DEST_PATH_IMAGE126
Figure 520529DEST_PATH_IMAGE127
中得到
Figure 31276DEST_PATH_IMAGE128
Figure 390713DEST_PATH_IMAGE128
的值随后按降序排序,以构造服务器队列
Figure 328713DEST_PATH_IMAGE129
(第3-4行)。排名在前面的服务器包含更多与
Figure 344073DEST_PATH_IMAGE110
类似的内容。对于
Figure 655581DEST_PATH_IMAGE129
中的每个服务器,计算SDA(i)和额外的空间成本-
Figure 869525DEST_PATH_IMAGE130
。如果这两个参数之间的商大于
Figure 181689DEST_PATH_IMAGE131
,那么文件
Figure 684346DEST_PATH_IMAGE110
将被复制到
Figure 536895DEST_PATH_IMAGE132
,即
Figure 867995DEST_PATH_IMAGE129
中的第k个服务器(第5-9行)。在确定每个文件
Figure 413377DEST_PATH_IMAGE110
的复制位置之后,进一步利用函数Heat Allocation
Figure 544275DEST_PATH_IMAGE133
(第12-17行)来调整
Figure 934936DEST_PATH_IMAGE134
中为每个服务器分配的数据访问量。此功能接管了负载调度程序的角色,后者根据所涉及服务器的可用功能来平衡服务负载。For any file stored in the origin server, from the function
Figure 726885DEST_PATH_IMAGE126
,
Figure 520529DEST_PATH_IMAGE127
get in
Figure 31276DEST_PATH_IMAGE128
.
Figure 390713DEST_PATH_IMAGE128
The values of are then sorted in descending order to construct the server queue
Figure 328713DEST_PATH_IMAGE129
(Lines 3-4). The top-ranked servers contain more
Figure 344073DEST_PATH_IMAGE110
similar content. for
Figure 655581DEST_PATH_IMAGE129
For each server in , compute SDA(i) and additional space cost -
Figure 869525DEST_PATH_IMAGE130
. If the quotient between these two parameters is greater than
Figure 181689DEST_PATH_IMAGE131
, then the file
Figure 684346DEST_PATH_IMAGE110
will be copied to
Figure 536895DEST_PATH_IMAGE132
,Right now
Figure 867995DEST_PATH_IMAGE129
The kth server in (Lines 5-9). in each file
Figure 413377DEST_PATH_IMAGE110
After copying the location of , further use the function Heat Allocation
Figure 544275DEST_PATH_IMAGE133
(lines 12-17) to adjust
Figure 934936DEST_PATH_IMAGE134
The amount of data access allocated for each server in . This feature takes over the role of the load scheduler, which balances the service load based on the available capabilities of the servers involved.

步骤S207,确定所述各候选服务器中的目标服务器,以将所述目标迁移文件迁移至所述目标服务器,并更新所述目标服务器和所述源服务器对应的位向量和节省空间索引。Step S207: Determine the target server among the candidate servers, so as to migrate the target migration file to the target server, and update the bit vector and space-saving index corresponding to the target server and the source server.

在确定目标迁移文件及目标服务器之后,下一步就是利用文件的热度来调整迁移计划。过热的文件应该在系统中有多个副本,以便容错和更好的用户体验。本申请实施例提供了一个单位热值(

Figure 792646DEST_PATH_IMAGE135
)来指示文件复制的必要性。单位热值(
Figure 508929DEST_PATH_IMAGE135
)这个指标反映了分割数据访问(SDA)量与副本所需额外空间之间的商。SDA(i)定义为文件
Figure 923861DEST_PATH_IMAGE110
的每个副本所承担的平均数据访问频率。如果其商大于这个单位热值,则执行任何复制。这确保了在有限的额外空间成本下增加热文件的副本。After determining the target migration file and target server, the next step is to adjust the migration plan by using the file's heat. Overheated files should have multiple copies in the system for fault tolerance and better user experience. The examples of this application provide a unit calorific value (
Figure 792646DEST_PATH_IMAGE135
) to indicate the need for file copying. unit calorific value (
Figure 508929DEST_PATH_IMAGE135
) This metric reflects the quotient between the amount of split data access (SDA) and the additional space required for replicas. SDA(i) is defined as a file
Figure 923861DEST_PATH_IMAGE110
The average frequency of data access undertaken by each replica of . If its quotient is greater than this unit heat value, then perform any replication. This ensures increased copies of hot files with limited additional space cost.

热度感知的数据复制算法的时间复杂度为

Figure 118213DEST_PATH_IMAGE136
。在这里,
Figure 564851DEST_PATH_IMAGE137
是基于给定文件
Figure 452035DEST_PATH_IMAGE110
的共享数据量对目标服务器排序的结果。
Figure 416580DEST_PATH_IMAGE138
Figure 945782DEST_PATH_IMAGE139
的复杂性是由
Figure 108910DEST_PATH_IMAGE140
中每个文件的最大迁移时间和最大复制时间造成的。空间复杂度为
Figure 166996DEST_PATH_IMAGE141
。The time complexity of the heat-aware data replication algorithm is
Figure 118213DEST_PATH_IMAGE136
. it's here,
Figure 564851DEST_PATH_IMAGE137
is based on the given file
Figure 452035DEST_PATH_IMAGE110
The result of sorting the target server by the amount of shared data.
Figure 416580DEST_PATH_IMAGE138
and
Figure 945782DEST_PATH_IMAGE139
The complexity is given by
Figure 108910DEST_PATH_IMAGE140
due to the maximum migration time and maximum copy time per file in . The space complexity is
Figure 166996DEST_PATH_IMAGE141
.

具体地,作为另一实现方式,所述确定所述各候选服务器中的目标服务器,包括:Specifically, as another implementation manner, the determining of the target server among the candidate servers includes:

确定所述目标迁移文件与所述各候选服务器的相似度。具体地,执行主体可以确定目标迁移文件中的各个块在所属的源服务器对应的位向量中的

Figure 842607DEST_PATH_IMAGE142
位置和该位置的值,进而与各候选服务器对应的位向量的相应
Figure 644341DEST_PATH_IMAGE142
位置和该位置的值进行相似度计算,具体可以是将位向量中目标迁移文件中的各个块的
Figure 865238DEST_PATH_IMAGE142
位置对应的值转化为向量,通过余弦相似度算法进行与候选服务器的相应位置的值的相似度的计算,进而得出目标迁移文件中的各个块与各个候选服务器的相似度。将大于预设相似度阈值的相似度对应的候选服务器确定为目标服务器。The similarity between the target migration file and each candidate server is determined. Specifically, the execution subject can determine the relative value of each block in the target migration file in the bit vector corresponding to the source server to which it belongs.
Figure 842607DEST_PATH_IMAGE142
position and the value of this position, and then the corresponding bit vector corresponding to each candidate server
Figure 644341DEST_PATH_IMAGE142
The similarity calculation is performed between the position and the value of the position, which can be calculated by calculating the similarity of each block in the target migration file in the bit vector.
Figure 865238DEST_PATH_IMAGE142
The value corresponding to the position is converted into a vector, and the cosine similarity algorithm is used to calculate the similarity with the value of the corresponding position of the candidate server, and then the similarity between each block in the target migration file and each candidate server is obtained. A candidate server corresponding to a similarity greater than a preset similarity threshold is determined as a target server.

具体地,作为另一实现方式,所述确定所述各候选服务器中的目标服务器,包括:Specifically, as another implementation manner, the determining of the target server among the candidate servers includes:

确定所述目标迁移文件的目标占用空间;确定所述目标迁移文件与所述各候选服务器的共享分区块的目标数据量;确定所述目标迁移文件的每个副本所承担的平均数据访问频率;确定所述目标占用空间与各所述目标数据量的差值,进而确定所述平均数据访问频率与各所述差值的比值;响应于确定所述比值大于预设值,确定所述比值对应的候选服务器为目标服务器。目标服务器可以是多个服务器,通过将目标迁移文件迁移到多个服务器,可以利用合理的通用迁移策略,以最大限度地利用各目标服务器上的共享分区块来重新构件目标迁移文件,从而潜在地降低总空间成本。determining the target occupied space of the target migration file; determining the target data volume of the shared partition between the target migration file and each candidate server; determining the average data access frequency undertaken by each copy of the target migration file; Determine the difference between the target occupied space and each of the target data volumes, and then determine the ratio of the average data access frequency to each of the differences; in response to determining that the ratio is greater than a preset value, determine that the ratio corresponds to The candidate server is the target server. The target server can be multiple servers, and by migrating the target migration file to multiple servers, a reasonable general migration strategy can be utilized to maximize the use of the shared partition blocks on each target server to reconstruct the target migration file, thereby potentially Reduce total space cost.

本申请实施例,利用节省空间的迁移算法,以较低的额外空间成本优先确定迁移文件及其迁移目标。在此基础上,提出了热感知数据复制算法,在有限的额外空间开销下复制热文件,实现了服务的自适应性。通过相似性感知的文件分配,节省的空间会随着迁移百分比的增加而逐渐增加。当45%的数据迁移时,大约41%的已占用空间可以从源服务器释放。通过检测目标迁移文件与候选服务器之间的相似性来实现空间缩减。In this embodiment of the present application, a space-saving migration algorithm is used to preferentially determine the migration file and its migration target at a lower cost of additional space. On this basis, a heat-aware data replication algorithm is proposed, which replicates hot files with limited extra space overhead, and realizes service adaptability. With similarity-aware file allocation, the space savings increases gradually with the migration percentage. When 45% of the data is migrated, about 41% of the occupied space can be freed from the origin server. Space reduction is achieved by detecting the similarity between the target migration file and the candidate server.

精卫(如图6所示),是一种高效和适应性强的数据迁移策略,即本申请各实施例所描述的用于去重存储系统的高效数据迁移方法,将文件迁移和复制到合适的服务器上。这有助于提高重复数据删除存储系统的空间效率和业务适应性。首先,在只允许一个空迁移目标的情况下,设计了基于ILP技术的迁移策略。进一步将这个问题扩展到一个更通用的场景中,其中多个非空服务器可用于迁移。使用基于Bloom过滤器的有效启发式算法来解决一般的迁移问题。不同场景下的跟踪驱动实验表明,本申请实施例可以在增加热文件副本的同时显著减少迁移过程中的额外空间成本。综上所述,精卫实现了一种高效、适应性强的迁移策略,在有限的额外存储空间下构建了大量的RH。具体来说,精卫为四个热文件生成副本(66.7%),与Goseed相比,只增加了5.7%的空间利用率。此外,精卫实现了

Figure 828646DEST_PATH_IMAGE143
的RS-ratio,与SARA方法相比节省了约3.5倍的存储空间。Jingwei (as shown in Figure 6) is an efficient and adaptable data migration strategy, that is, the efficient data migration method for the deduplication storage system described in the various embodiments of this application. on a suitable server. This helps improve the space efficiency and business adaptability of deduplication storage systems. First, under the condition that only one empty migration target is allowed, a migration strategy based on ILP technology is designed. Extending this question further to a more general scenario where multiple non-null servers can be used for migration. Use efficient heuristics based on Bloom filters to solve general migration problems. Tracking-driven experiments in different scenarios show that the embodiment of the present application can significantly reduce the extra space cost in the migration process while increasing the number of hot file copies. In summary, Jingwei implements an efficient and adaptable migration strategy to build a large amount of RH with limited additional storage space. Specifically, Jingwei generates copies for four hot files (66.7%), which only increases the space utilization by 5.7% compared to Goseed. In addition, Jingwei has achieved
Figure 828646DEST_PATH_IMAGE143
, which saves about 3.5 times the storage space compared to the SARA method.

继续参见图4,作为对上述各图所示方法的实现,本申请提供了一种用于去重存储系统的高效数据迁移装置的一个实施例,该装置实施例与图1所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。Continuing to refer to FIG. 4 , as an implementation of the methods shown in the above figures, the present application provides an embodiment of an efficient data migration apparatus for a deduplication storage system, which is implemented with the method shown in FIG. 1 . Correspondingly, the device can be specifically applied to various electronic devices.

如图4所示,本实施例的用于去重存储系统的高效数据迁移装置400包括:接收单元401、候选服务器确定单元402、索引确定单元403和迁移单元404。As shown in FIG. 4 , an efficient data migration apparatus 400 for a deduplication storage system in this embodiment includes: a receiving unit 401 , a candidate server determination unit 402 , an index determination unit 403 and a migration unit 404 .

接收单元401,被配置成接收数据迁移请求,确定所述数据迁移请求对应的源服务器,进而确定所述源服务器中的待迁移文件;The receiving unit 401 is configured to receive a data migration request, determine a source server corresponding to the data migration request, and then determine a file to be migrated in the source server;

候选服务器确定单元402,被配置成确定各可用服务器,进而基于所述待迁移文件和预设个数的独立哈希函数,确定所述各可用服务器对应的位向量,基于所述位向量确定所述数据迁移请求对应的各候选服务器;The candidate server determination unit 402 is configured to determine each available server, and then determine the bit vector corresponding to each available server based on the to-be-migrated file and a preset number of independent hash functions, and determine all available servers based on the bit vector. each candidate server corresponding to the data migration request;

索引确定单元403,被配置成基于所述待迁移文件确定与所述各候选服务器的文件空间重叠数据和共享分区块的数据量,进而基于所述文件空间重叠数据和所述数据量,确定所述各候选服务器对应的节省空间索引;The index determination unit 403 is configured to determine, based on the to-be-migrated file, the file space overlapping data and the data amount of the shared partition block with each candidate server, and then determine the file space overlapping data and the data amount based on the file space overlapping data and the data amount. the space-saving index corresponding to each candidate server;

迁移单元404,被配置成基于所述节省空间索引,确定所述待迁移文件中的目标迁移文件并确定所述各候选服务器中的目标服务器,以将所述目标迁移文件迁移至所述目标服务器,并更新所述目标服务器和所述源服务器对应的位向量和节省空间索引。The migration unit 404 is configured to, based on the space-saving index, determine a target migration file in the files to be migrated and determine a target server in each candidate server, so as to migrate the target migration file to the target server , and update the bit vector and space-saving index corresponding to the target server and the source server.

在一些实施例中,所述迁移单元404进一步被配置成:基于所述节省空间索引确定各所述待迁移文件对应的占用空间;确定各所述待迁移文件所承担的平均数据访问频率;根据所述平均数据访问频率和所述占用空间,确定目标迁移文件。In some embodiments, the migrating unit 404 is further configured to: determine the occupied space corresponding to each of the to-be-migrated files based on the space-saving index; determine the average data access frequency borne by each of the to-be-migrated files; The average data access frequency and the occupied space determine the target migration file.

在一些实施例中,所述迁移单元404进一步被配置成:确定所述平均数据访问频率与所述占用空间的比值;响应于确定所述比值大于预设单位热值,确定所述比值对应的待迁移文件为目标迁移文件。In some embodiments, the migrating unit 404 is further configured to: determine a ratio of the average data access frequency to the occupied space; in response to determining that the ratio is greater than a preset unit heat value, determine the ratio corresponding to the ratio The file to be migrated is the target migration file.

在一些实施例中,迁移单元404进一步被配置成:确定所述目标迁移文件与所述各候选服务器的相似度;将大于预设相似度阈值的相似度对应的候选服务器确定为目标服务器。In some embodiments, the migration unit 404 is further configured to: determine the similarity between the target migration file and each candidate server; and determine the candidate server corresponding to the similarity greater than a preset similarity threshold as the target server.

在一些实施例中,迁移单元404进一步被配置成:确定所述目标迁移文件的目标占用空间;确定所述目标迁移文件与所述各候选服务器的共享分区块的目标数据量;确定所述目标迁移文件的每个副本所承担的平均数据访问频率;确定所述目标占用空间与各所述目标数据量的差值,进而确定所述平均数据访问频率与各所述差值的比值;响应于确定所述比值大于预设值,确定所述比值对应的候选服务器为目标服务器。In some embodiments, the migration unit 404 is further configured to: determine the target occupied space of the target migration file; determine the target data volume of the shared partition between the target migration file and each candidate server; determine the target the average data access frequency undertaken by each copy of the migration file; determining the difference between the target occupied space and each of the target data volumes, and then determining the ratio of the average data access frequency to each of the differences; in response to It is determined that the ratio is greater than a preset value, and the candidate server corresponding to the ratio is determined to be the target server.

本说明书实施例中所述支付涉及的技术载体,例如可以包括近场通信(NearField Communication,NFC)、WIFI、3G/4G/5G、POS 机刷卡技术、二维码扫码技术、条形码扫码技术、蓝牙、红外、短消息(Short Message Service,SMS)、多媒体消息(MultimediaMessage Service,MMS)等。The technical carriers involved in the payment described in the embodiments of this specification may include, for example, Near Field Communication (NFC), WIFI, 3G/4G/5G, POS machine card swiping technology, two-dimensional code scanning technology, and barcode scanning technology , Bluetooth, infrared, short message (Short Message Service, SMS), multimedia message (MultimediaMessage Service, MMS) and so on.

需要说明的是,本说明书一个或多个实施例的方法可以由单个设备执行,例如一台计算机或服务器等。本实施例的方法也可以应用于分布式场景下,由多台设备相互配合来完成。在这种分布式场景的情况下,这多台设备中的一台设备可以只执行本说明书一个或多个实施例的方法中的某一个或多个步骤,这多台设备相互之间会进行交互以完成所述的方法。It should be noted that the methods of one or more embodiments of this specification may be executed by a single device, such as a computer or a server. The method in this embodiment can also be applied in a distributed scenario, and is completed by the cooperation of multiple devices. In the case of such a distributed scenario, one device among the multiple devices may only execute one or more steps in the method of one or more embodiments of the present specification, and the multiple devices may perform operations on each other. interact to complete the described method.

上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。The foregoing describes specific embodiments of the present specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims can be performed in an order different from that in the embodiments and still achieve desirable results. Additionally, the processes depicted in the figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

为了描述的方便,描述以上装置时以功能分为各种模块分别描述。当然,在实施本说明书一个或多个实施例时可以把各模块的功能在同一个或多个软件和/或硬件中实现。For the convenience of description, when describing the above device, the functions are divided into various modules and described respectively. Of course, when implementing one or more embodiments of this specification, the functions of each module may be implemented in one or more software and/or hardware.

上述实施例的装置用于实现前述实施例中相应的方法,并且具有相应的方法实施例的有益效果,在此不再赘述。The apparatuses in the foregoing embodiments are used to implement the corresponding methods in the foregoing embodiments, and have the beneficial effects of the corresponding method embodiments, which will not be repeated here.

图5示出了本实施例所提供的一种更为具体的电子设备硬件结构示意图, 该设备可以包括:处理器1010、存储器1020、输入/输出接口1030、通信接口1040和总线 1050。其中处理器1010、存储器1020、输入/输出接口1030和通信接口1040通过总线1050实现彼此之间在设备内部的通信连接。FIG. 5 shows a schematic diagram of a more specific hardware structure of an electronic device provided in this embodiment. The device may include: a processor 1010 , a memory 1020 , an input/output interface 1030 , a communication interface 1040 and a bus 1050 . The processor 1010 , the memory 1020 , the input/output interface 1030 and the communication interface 1040 realize the communication connection among each other within the device through the bus 1050 .

处理器1010可以采用通用的CPU(Central Processing Unit,中央处理器)、微处理器、应用专用集成电路(Application Specific Integrated Circuit,ASIC)、或者一个或多个集成电路等方式实现,用于执行相关程序,以实现本说明书实施例所提供的技术方案。The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit, central processing unit), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, and is used to execute related program to implement the technical solutions provided by the embodiments of this specification.

存储器1020可以采用ROM(Read Only Memory,只读存储器)、RAM(Random AccessMemory,随机存取存储器)、静态存储设备,动态存储设备等形式实现。存储器1020可以存储操作系统和其他应用程序,在通过软件或者固件来实现本说明书实施例所提供的技术方案时,相关的程序代码保存在存储器1020中,并由处理器1010来调用执行。The memory 1020 may be implemented in the form of a ROM (Read Only Memory, read only memory), a RAM (Random Access Memory, random access memory), a static storage device, a dynamic storage device, and the like. The memory 1020 may store an operating system and other application programs. When implementing the technical solutions provided by the embodiments of this specification through software or firmware, the relevant program codes are stored in the memory 1020 and invoked by the processor 1010 for execution.

输入/输出接口1030用于连接输入/输出模块,以实现信息输入及输出。输入输出/模块可以作为组件配置在设备中(图中未示出),也可以外接于设备以提供相应功能。其中输入设备可以包括键盘、鼠标、触摸屏、麦克风、各类传感器等,输出设备可以包括显示器、扬声器、振动器、指示灯等。The input/output interface 1030 is used to connect the input/output module to realize information input and output. The input/output/module can be configured in the device as a component (not shown in the figure), or can be externally connected to the device to provide corresponding functions. The input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output device may include a display, a speaker, a vibrator, an indicator light, and the like.

通信接口1040用于连接通信模块(图中未示出),以实现本设备与其他设备的通信交互。其中通信模块可以通过有线方式(例如USB、网线等)实现通信,也可以通过无线方式(例如移动网络、WIFI、蓝牙等)实现通信。The communication interface 1040 is used to connect a communication module (not shown in the figure), so as to realize the communication interaction between the device and other devices. The communication module can implement communication through wired means (such as USB, network cable, etc.), or can implement communication through wireless means (such as mobile network, WIFI, Bluetooth, etc.).

总线1050包括一通路,在设备的各个组件(例如处理器1010、存储器1020、输入/输出接口1030和通信接口1040)之间传输信息。The bus 1050 includes a path to transfer information between the various components of the device (eg, the processor 1010, the memory 1020, the input/output interface 1030, and the communication interface 1040).

需要说明的是,尽管上述设备仅示出了处理器1010、存储器1020、输入/输出接口1030、通信接口1040以及总线1050,但是在具体实施过程中,该设备还可以包括实现正常运行所必需的其他组件。此外,本领域的技术人员可以理解的是,上述设备中也可以仅包含实现本说明书实施例方案所必需的组件,而不必包含图中所示的全部组件。It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in the specific implementation process, the device may also include necessary components for normal operation. other components. In addition, those skilled in the art can understand that, the above-mentioned device may only include components necessary to implement the solutions of the embodiments of the present specification, rather than all the components shown in the figures.

本实施例的计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。The computer readable medium of this embodiment includes both permanent and non-permanent, removable and non-removable media and can be implemented by any method or technology for information storage. Information may be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

所属领域的普通技术人员应当理解:以上任何实施例的讨论仅为示例性的,并非旨在暗示本申请的范围(包括权利要求)被限于这些例子;在本申请的思路下,以上实施例或者不同实施例中的技术特征之间也可以进行组合,步骤可以以任意顺序实现,并存在如上所述的本说明书一个或多个实施例的不同方面的许多其它变化,为了简明它们没有在细节中提供。Those of ordinary skill in the art should understand that the discussion of any of the above embodiments is only exemplary, and is not intended to imply that the scope of the present application (including the claims) is limited to these examples; under the idea of the present application, the above embodiments or Technical features in different embodiments may also be combined, steps may be carried out in any order, and there are many other variations of the different aspects of one or more embodiments of this specification as described above, which are not in detail for the sake of brevity supply.

另外,为简化说明和讨论,并且为了不会使本说明书一个或多个实施例难以理解,在所提供的附图中可以示出或可以不示出与集成电路(IC)芯片和其它部件的公知的电源/接地连接。此外,可以以框图的形式示出装置,以便避免使本说明书一个或多个实施例难以理解,并且这也考虑了以下事实,即关于这些框图装置的实施方式的细节是高度取决于将要实施本说明书一个或多个实施例的平台的(即,这些细节应当完全处于本领域技术人员的理解范围内)。在阐述了具体细节(例如,电路)以描述本申请的示例性实施例的情况下,对本领域技术人员来说显而易见的是,可以在没有这些具体细节的情况下或者这些具体细节有变化的情况下实施本说明书一个或多个实施例。因此,这些描述应被认为是说明性的而不是限制性的。Additionally, in order to simplify illustration and discussion, and in order not to obscure the understanding of one or more embodiments of this specification, the figures provided may or may not be shown in connection with integrated circuit (IC) chips and other components. Well known power/ground connections. Furthermore, devices may be shown in block diagram form in order to avoid obscuring one or more embodiments of this description, and this also takes into account the fact that details regarding the implementation of such block diagram devices are highly dependent on the implementation of the invention (ie, these details should be well within the understanding of those skilled in the art) of the platform describing one or more embodiments. Where specific details (eg, circuits) are set forth to describe exemplary embodiments of the present application, it will be apparent to those skilled in the art that these specific details may be used without or with changes to the specific details One or more embodiments of this specification are implemented below. Accordingly, these descriptions are to be considered illustrative rather than restrictive.

尽管已经结合了本申请的具体实施例对本申请进行了描述,但是根据前面的描述,这些实施例的很多替换、修改和变型对本领域普通技术人员来说将是显而易见的。例如,其它存储器架构(例如,动态RAM(DRAM))可以使用所讨论的实施例。Although the present application has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations to these embodiments will be apparent to those of ordinary skill in the art from the foregoing description. For example, other memory architectures (eg, dynamic RAM (DRAM)) may use the discussed embodiments.

本说明书一个或多个实施例旨在涵盖落入所附权利要求的宽泛范围之内的所有这样的替换、修改和变型。因此,凡在本说明书一个或多个实施例的精神和原则之内,所做的任何省略、修改、等同替换、改进等,均应包含在本申请的保护范围之内。The embodiment or embodiments of this specification are intended to cover all such alternatives, modifications and variations that fall within the broad scope of the appended claims. Therefore, any omission, modification, equivalent replacement, improvement, etc. made within the spirit and principle of one or more embodiments of the present specification should be included within the protection scope of the present application.

Claims (10)

1. An efficient data migration method for a deduplication storage system, comprising:
receiving a data migration request, determining a source server corresponding to the data migration request, and further determining a file to be migrated in the source server;
determining each available server, further determining a bit vector corresponding to each available server based on the file to be migrated and a preset number of independent hash functions, and determining each candidate server corresponding to the data migration request based on the bit vector;
determining file space overlapping data and data volume of a shared partition block of each candidate server based on the file to be migrated, determining deviation of the file space overlapping data and the data volume, and further determining a space-saving index corresponding to each candidate server based on the deviation, wherein the space-saving index is the amount of storage resources saved when the file is migrated to a target server;
and determining a target migration file in the files to be migrated and a target server in each candidate server based on the space-saving index so as to migrate the target migration file to the target server and update the bit vectors and the space-saving index corresponding to the target server and the source server.
2. The method according to claim 1, wherein the determining a target migration file in the files to be migrated comprises:
determining the occupied space corresponding to each file to be migrated based on the space-saving index;
determining the average data access frequency borne by each file to be migrated;
and determining a target migration file according to the average data access frequency and the occupied space.
3. The method of claim 2, wherein determining a target migration file based on the average data access frequency and the footprint comprises:
determining a ratio of the average data access frequency to the footprint;
and in response to the fact that the ratio is larger than the preset unit heat value, determining that the file to be migrated corresponding to the ratio is the target migration file.
4. The method of claim 2, wherein determining the target server of the candidate servers comprises:
determining the similarity between the target migration file and each candidate server;
and determining the candidate server corresponding to the similarity greater than the preset similarity threshold as the target server.
5. The method of claim 2, wherein determining the target server of the candidate servers comprises:
determining a target occupation space of the target migration file;
determining the target data volume of the shared partition blocks of the target migration file and each candidate server;
determining an average data access frequency assumed by each copy of the target migration file;
determining a difference value between the target occupied space and each target data quantity, and further determining a ratio of the average data access frequency to each difference value;
and in response to determining that the ratio is greater than a preset value, determining that the candidate server corresponding to the ratio is the target server.
6. An efficient data migration apparatus for a deduplication storage system, comprising:
the receiving unit is configured to receive a data migration request, determine a source server corresponding to the data migration request, and further determine a file to be migrated in the source server;
a candidate server determining unit configured to determine each available server, determine a bit vector corresponding to each available server based on the file to be migrated and a preset number of independent hash functions, and determine each candidate server corresponding to the data migration request based on the bit vector;
the index determining unit is configured to determine file space overlapping data of each candidate server and data quantity of a shared partition block based on the file to be migrated, determine deviation of the file space overlapping data and the data quantity, and further determine a space-saving index corresponding to each candidate server based on the deviation, wherein the space-saving index is the amount of storage resources saved when the file is migrated to a target server;
and the migration unit is configured to determine a target migration file in the files to be migrated and a target server in each candidate server based on the space-saving index, so as to migrate the target migration file to the target server, and update the bit vectors and the space-saving index corresponding to the target server and the source server.
7. The apparatus of claim 6, wherein the migration unit is further configured to:
determining the occupied space corresponding to each file to be migrated based on the space-saving index;
determining the average data access frequency borne by each file to be migrated;
and determining a target migration file according to the average data access frequency and the occupied space.
8. The apparatus of claim 7, wherein the migration unit is further configured to:
determining a ratio of the average data access frequency to the footprint;
and in response to the fact that the ratio is larger than the preset unit heat value, determining that the file to be migrated corresponding to the ratio is the target migration file.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 5 when executing the program.
10. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 5.
CN202111156367.9A 2021-09-30 2021-09-30 An efficient data migration method and device for deduplication storage system Active CN113590535B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111156367.9A CN113590535B (en) 2021-09-30 2021-09-30 An efficient data migration method and device for deduplication storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111156367.9A CN113590535B (en) 2021-09-30 2021-09-30 An efficient data migration method and device for deduplication storage system

Publications (2)

Publication Number Publication Date
CN113590535A CN113590535A (en) 2021-11-02
CN113590535B true CN113590535B (en) 2021-12-17

Family

ID=78242552

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111156367.9A Active CN113590535B (en) 2021-09-30 2021-09-30 An efficient data migration method and device for deduplication storage system

Country Status (1)

Country Link
CN (1) CN113590535B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114490581A (en) * 2022-01-27 2022-05-13 中国农业银行股份有限公司 Heterogeneous database migration and data comparison method, device, equipment and storage medium
US12007968B2 (en) 2022-05-26 2024-06-11 International Business Machines Corporation Full allocation volume to deduplication volume migration in a storage system
CN117376403B (en) * 2023-10-08 2024-05-14 上海知享家信息技术服务有限公司 Cloud data migration method and system

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103279502A (en) * 2013-05-06 2013-09-04 北京赛思信安技术有限公司 Framework and method of repeated data deleting file system combined with parallel file system
CN103631933A (en) * 2013-12-06 2014-03-12 中国科学院计算技术研究所 Distributed duplication elimination system-oriented data routing method
CN103902735A (en) * 2014-04-18 2014-07-02 中国人民解放军理工大学 Application perception data routing method oriented to large-scale cluster deduplication and system
CN103959254A (en) * 2011-11-30 2014-07-30 国际商业机器公司 Optimizing migration/copy of de-duplicated data
US8898120B1 (en) * 2011-10-09 2014-11-25 Symantec Corporation Systems and methods for distributed data deduplication
US8943032B1 (en) * 2011-09-30 2015-01-27 Emc Corporation System and method for data migration using hybrid modes
CN104932841A (en) * 2015-06-17 2015-09-23 南京邮电大学 Saving type duplicated data deleting method in cloud storage system
CN105897921A (en) * 2016-05-27 2016-08-24 重庆大学 Data block routing method combining fingerprint sampling and reducing data fragments
CN105930545A (en) * 2016-06-29 2016-09-07 浙江宇视科技有限公司 Method and device for migrating files
US10037337B1 (en) * 2015-09-14 2018-07-31 Cohesity, Inc. Global deduplication
CN111966649A (en) * 2020-10-21 2020-11-20 中国人民解放军国防科技大学 Lightweight online file storage method and device capable of efficiently removing weight

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100332401A1 (en) * 2009-06-30 2010-12-30 Anand Prahlad Performing data storage operations with a cloud storage environment, including automatically selecting among multiple cloud storage sites
US9122527B2 (en) * 2012-08-21 2015-09-01 International Business Machines Corporation Resource allocation for migration within a multi-tiered system
US20160132523A1 (en) * 2014-11-12 2016-05-12 Strato Scale Ltd. Exploiting node-local deduplication in distributed storage system
US20180329646A1 (en) * 2017-05-12 2018-11-15 International Business Machines Corporation Distributed storage system virtual and storage data migration
US10846129B2 (en) * 2018-11-30 2020-11-24 Nutanix, Inc. Systems and methods for organizing on-demand migration from private cluster to public cloud

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8943032B1 (en) * 2011-09-30 2015-01-27 Emc Corporation System and method for data migration using hybrid modes
US8898120B1 (en) * 2011-10-09 2014-11-25 Symantec Corporation Systems and methods for distributed data deduplication
CN103959254A (en) * 2011-11-30 2014-07-30 国际商业机器公司 Optimizing migration/copy of de-duplicated data
CN103279502A (en) * 2013-05-06 2013-09-04 北京赛思信安技术有限公司 Framework and method of repeated data deleting file system combined with parallel file system
CN103631933A (en) * 2013-12-06 2014-03-12 中国科学院计算技术研究所 Distributed duplication elimination system-oriented data routing method
CN103902735A (en) * 2014-04-18 2014-07-02 中国人民解放军理工大学 Application perception data routing method oriented to large-scale cluster deduplication and system
CN104932841A (en) * 2015-06-17 2015-09-23 南京邮电大学 Saving type duplicated data deleting method in cloud storage system
US10037337B1 (en) * 2015-09-14 2018-07-31 Cohesity, Inc. Global deduplication
CN105897921A (en) * 2016-05-27 2016-08-24 重庆大学 Data block routing method combining fingerprint sampling and reducing data fragments
CN105930545A (en) * 2016-06-29 2016-09-07 浙江宇视科技有限公司 Method and device for migrating files
CN111966649A (en) * 2020-10-21 2020-11-20 中国人民解放军国防科技大学 Lightweight online file storage method and device capable of efficiently removing weight

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"分布式重复数据删除系统中路由方法的研究";王奏鸣;《万方学位论文》;20180416;全文 *
"面向备份系统的重复数据删除关键技术研究";孙振;《万方学位论文》;20201231;全文 *

Also Published As

Publication number Publication date
CN113590535A (en) 2021-11-02

Similar Documents

Publication Publication Date Title
CN113590535B (en) An efficient data migration method and device for deduplication storage system
CN105487818B (en) For the efficient De-weight method of repeated and redundant data in cloud storage system
US8898120B1 (en) Systems and methods for distributed data deduplication
US10715460B2 (en) Opportunistic resource migration to optimize resource placement
US20170344598A1 (en) De-Duplication Optimized Platform for Object Grouping
CN111966649B (en) A lightweight online file storage method and device for efficient deduplication
US8972986B2 (en) Locality-aware resource allocation for cloud computing
CN103369042B (en) A kind of data processing method and device
US10268716B2 (en) Enhanced hadoop framework for big-data applications
CN102629258B (en) Repeating data deleting method and device
Chatzimilioudis et al. Distributed in-memory processing of all k nearest neighbor queries
Liu et al. An improved hadoop data load balancing algorithm
CN108563743A (en) A kind of file read/write method, system and equipment and storage medium
CN105824881A (en) Repeating data and deleted data placement method and device based on load balancing
CN111767287A (en) Data import method, device, device and computer storage medium
US10417192B2 (en) File classification in a distributed file system
Mohamed et al. Accelerating data-intensive genome analysis in the cloud
Gao et al. An effective merge strategy based hierarchy for improving small file problem on HDFS
KR101872414B1 (en) Dynamic partitioning method for supporting load balancing of distributed RDF graph
JP5444728B2 (en) Storage system, data writing method in storage system, and data writing program
CN104102557B (en) A kind of cloud computing platform data back up method based on cluster
CN114625695A (en) Data processing method and device
CN105426119A (en) Storage apparatus and data processing method
US10594620B1 (en) Bit vector analysis for resource placement in a distributed system
US11336519B1 (en) Evaluating placement configurations for distributed resource placement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant