[go: up one dir, main page]

CN104239493A - Cross-cluster data migration method and system - Google Patents

Cross-cluster data migration method and system Download PDF

Info

Publication number
CN104239493A
CN104239493A CN201410455695.2A CN201410455695A CN104239493A CN 104239493 A CN104239493 A CN 104239493A CN 201410455695 A CN201410455695 A CN 201410455695A CN 104239493 A CN104239493 A CN 104239493A
Authority
CN
China
Prior art keywords
cluster
data
tables
target cluster
source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410455695.2A
Other languages
Chinese (zh)
Other versions
CN104239493B (en
Inventor
黄刚
何洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201410455695.2A priority Critical patent/CN104239493B/en
Publication of CN104239493A publication Critical patent/CN104239493A/en
Application granted granted Critical
Publication of CN104239493B publication Critical patent/CN104239493B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24532Query optimisation of parallel queries
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/252Integrating or interfacing systems involving database management systems between a Database Management System and a front-end application

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An embodiment of the invention provides a cross-cluster migration method and system. According to the cross-cluster migration method and system, persistence of data inside a distributed database of a source cluster before migration can be achieved due to data operation interruption through all child nodes of the source cluster and persistence of memory data of the distributed database of the source cluster; the data transmission amount can be reduced due to compression of data tables in the distributed database of the source cluster, the compressed data tables in the distributed database of the source cluster are migrated to a target cluster, and the migration efficiency is improved; then occupied storage space and total file blocks of the data tables in the distributed database of the source cluster before migration are matched with occupied space and total file blocks of the data tables of the target cluster, after migration and accordingly the migration integrity can be verified according to a matching result.

Description

跨集群数据迁移方法和系统Cross-cluster data migration method and system

技术领域technical field

本发明实施例涉及数据库技术领域,尤其涉及一种跨集群数据迁移方法和系统。The embodiments of the present invention relate to the technical field of databases, and in particular to a cross-cluster data migration method and system.

背景技术Background technique

随着互联网应用的发展,用户量的激增,数据存储数量呈指数递增,传统的单库存储技术无法满足海量数据的存取需求,HDFS(Hadoop Distributed FileSystem,分布式文件系统)和分布式数据库应用而生。With the development of Internet applications, the number of users has increased rapidly, and the number of data storage has increased exponentially. The traditional single-database storage technology cannot meet the access requirements of massive data. HDFS (Hadoop Distributed File System, distributed file system) and distributed database applications And born.

HBase(Hadoop Database,分布式数据库)是一种可扩展的、面向列存储的分布式数据库,利用HDFS作为文件存储系统,以数据表的形式存储数据,能在普通硬件环境基础上支撑十亿量级行、百万量级列的大型数据表,并支持对这种规模的数据进行随机存储和读取操作。由于具有高可靠性、高可扩展性、支持随机存取以及支持MapReduce(映射化简)并行计算,因此得到了广泛应用。其中,Hadoop是一个由“Apache”基金会开发的分布式系统基础架构,用户可以在不了解分布式底层细节的情况下,开发分布式程序,充分利用集群的威力实现高速运算和海量数据的存取。HBase (Hadoop Database, distributed database) is a scalable, column-oriented distributed database that uses HDFS as a file storage system to store data in the form of data tables, and can support billions of data on the basis of a common hardware environment. Large data tables with row-level rows and millions of columns, and support random storage and read operations on data of this scale. Because of its high reliability, high scalability, support for random access, and support for MapReduce (mapping simplification) parallel computing, it has been widely used. Among them, Hadoop is a distributed system infrastructure developed by the "Apache" foundation. Users can develop distributed programs without knowing the underlying details of the distribution, and make full use of the power of the cluster to achieve high-speed computing and massive data storage. Pick.

实际应用过程中,不可避免地涉及数据迁移,尤其是当线上某个HBase集群需要下线,或者机房维护搬迁的时候,都会面临海量数据迁移的紧迫任务,即把老集群的数据表迁移到新集群中继续为接入业务方提供海量数据存取服务。In the actual application process, data migration is inevitably involved, especially when an online HBase cluster needs to be offline, or when the computer room is maintained and relocated, it will face the urgent task of massive data migration, that is, to migrate the data tables of the old cluster to The new cluster continues to provide massive data access services for access business parties.

现有的数据迁移技术,通常采用Hadoop的数据拷贝组件进行分布式拷贝,从而达到将一个集群中的数据表迁移到新集群的目的。当数据拷贝完成后,启动新集群相关服务进程。Existing data migration technologies usually use Hadoop's data copy component for distributed copying, so as to achieve the purpose of migrating data tables in one cluster to a new cluster. When the data copy is complete, start the new cluster-related service process.

上述数据迁移技术存在的缺陷在于:无法保证迁移后数据的完整性;迁移耗时严格依赖于迁移数据的规模,导致迁移所用时间很难控制,如果集群间网络带宽有限,同时迁移数据又多,很难保证在短暂的迁移窗口完成迁移工作,也即迁移效率低。The disadvantages of the above data migration technology are: the integrity of the migrated data cannot be guaranteed; the migration time is strictly dependent on the size of the migrated data, making it difficult to control the migration time. It is difficult to ensure that the migration work is completed within a short migration window, that is, the migration efficiency is low.

发明内容Contents of the invention

本发明实施例提供一种跨集群数据迁移方法和系统,以确保跨集群数据迁移的完整性和高效性。Embodiments of the present invention provide a cross-cluster data migration method and system to ensure the integrity and efficiency of cross-cluster data migration.

第一方面,本发明实施例提供了一种跨集群数据迁移方法,包括:In the first aspect, the embodiment of the present invention provides a cross-cluster data migration method, including:

源集群的主控节点调用停止命令控制源集群的各子节点停止数据操作;The master control node of the source cluster invokes the stop command to control each child node of the source cluster to stop data operations;

源集群的主控节点利用源集群的分布式数据库的清空缓冲区组件,将所述分布式数据库内存中的数据持久化到分布式文件系统HDFS中;The master control node of the source cluster uses the clear buffer component of the distributed database of the source cluster to persist the data in the distributed database memory into the distributed file system HDFS;

源集群的主控节点控制对源集群的分布式数据库所包含的数据表,采用设定的压缩算法进行压缩;The master control node of the source cluster compresses the data tables contained in the distributed database of the source cluster using the set compression algorithm;

源集群的主控节点统计源集群的分布式数据库中的数据表所占用的HDFS的第一存储空间大小及第一总文件块数;The master control node of the source cluster counts the first storage space size and the first total file block number of HDFS occupied by the data tables in the distributed database of the source cluster;

源集群的主控节点基于预先获取的目标集群所包含的节点的IP地址与主机名称的映射关系,将源集群中分布式数据库中的数据表迁移至所述目标集群的分布式数据库中;The master control node of the source cluster migrates the data tables in the distributed database in the source cluster to the distributed database of the target cluster based on the pre-acquired mapping relationship between the IP addresses of the nodes contained in the target cluster and the host name;

如果获取到源集群的映射化简进程的网页管理界面返回的数据迁移完成消息,则目标集群的主控节点统计目标集群的分布式数据库中的数据表占有的对应的HDFS的第二存储空间大小和第二总文件块数,并将所述第二存储空间大小和第二总文件块数与所述第一存储空间大小和所述第一总文件块数匹配;If the data migration completion message returned by the web management interface of the mapping and simplification process of the source cluster is obtained, the master control node of the target cluster counts the size of the corresponding second storage space of HDFS occupied by the data table in the distributed database of the target cluster and a second total number of file blocks, and matching the size of the second storage space and the second total number of file blocks with the size of the first storage space and the first total number of file blocks;

如果匹配成功,则目标集群的主控节点采用与所述设定的压缩算法对应的解压算法对迁移至目标集群中的数据表进行解压;If the matching is successful, the master control node of the target cluster uses the decompression algorithm corresponding to the set compression algorithm to decompress the data table migrated to the target cluster;

目标集群的主控节点基于启动策略,启动所述目标集群。The master control node of the target cluster starts the target cluster based on the start policy.

第二方面,本发明实施例还提供了一种跨集群数据迁移系统,包括源集群和目标集群,所述源集群包括主控节点和至少一个子节点,所述目标集群包括主控节点和至少一个子节点;In the second aspect, the embodiment of the present invention also provides a cross-cluster data migration system, including a source cluster and a target cluster, the source cluster includes a master node and at least one child node, and the target cluster includes a master node and at least a child node;

所述源集群的主控节点包括:The master control node of the source cluster includes:

停止模块,用于调用停止命令控制源集群的各子节点停止数据操作;The stop module is used to call the stop command to control each child node of the source cluster to stop data operations;

持久化模块,用于利用源集群的分布式数据库的清空缓冲区组件,将所述分布式数据库内存中的数据持久化到分布式文件系统HDFS中;Persistence module, for utilizing the empty buffer component of the distributed database of source cluster, the data in the memory of described distributed database is persisted in the distributed file system HDFS;

压缩模块,用于对源集群的分布式数据库所包含的数据表,采用设定的压缩算法进行压缩;The compression module is used to compress the data tables contained in the distributed database of the source cluster using a set compression algorithm;

统计模块,用于统计源集群的分布式数据库中的数据表所占用的HDFS的第一存储空间大小及第一总文件块数;Statistical module, for the first storage space size and the first total file block number of HDFS occupied by the data table in the distributed database of the source cluster;

迁移模块,用于基于预先获取的目标集群所包含的节点的IP地址与主机名称的映射关系,将源集群中分布式数据库中的数据表迁移至所述目标集群的分布式数据库中;The migration module is used to migrate the data tables in the distributed database in the source cluster to the distributed database of the target cluster based on the pre-acquired mapping relationship between the IP addresses of the nodes contained in the target cluster and the host names;

所述目标集群的主控节点包括:The master control node of the target cluster includes:

统计模块,用于如果获取到源集群的映射化简进程的网页管理界面返回的数据迁移完成消息,则统计目标集群的分布式数据库中的数据表占有的对应的HDFS的第二存储空间大小和第二总文件块数,并将所述第二存储空间大小和第二总文件块数与所述第一存储空间大小和所述第一总文件块数匹配;The statistical module is used to obtain the data migration completion message returned by the webpage management interface of the mapping and simplification process of the source cluster, then count the second storage space size and the size of the corresponding HDFS occupied by the data table in the distributed database of the target cluster a second total number of file blocks, and matching the size of the second storage space and the second total number of file blocks with the size of the first storage space and the first total number of file blocks;

解压模块,用于如果匹配成功,则采用与所述设定的压缩算法对应的解压算法对迁移至目标集群中的数据表进行解压;The decompression module is used to decompress the data table migrated to the target cluster by using the decompression algorithm corresponding to the set compression algorithm if the matching is successful;

启动模块,用于基于启动策略,启动所述目标集群。The startup module is configured to start the target cluster based on the startup policy.

本发明实施例提供的跨集群数据迁移方法和系统,通过使源集群的各子节点停止数据操作,以及将源集群的分布式数据库的内存中的数据持久化,能够实现迁移前源集群的分布式数据库中的数据持久化;通过对源集群的分布式数据库中的数据表进行压缩,能够减小数据传送量,将源集群的分布式数据库中的压缩后的数据表迁移至目标集群中,提高了迁移效率;然后通过将迁移前的源集群的分布式数据库中的数据表所占用的存储空间大小及总文件块数,与迁移后的目标集群的数据表占有的存储空间大小和总文件块数进行匹配,能够根据匹配结果验证迁移的完整性。The cross-cluster data migration method and system provided by the embodiments of the present invention can realize the distribution of the source cluster before the migration by stopping the data operation of each child node of the source cluster and persisting the data in the memory of the distributed database of the source cluster. Data persistence in the distributed database; by compressing the data tables in the distributed database of the source cluster, the amount of data transmission can be reduced, and the compressed data tables in the distributed database of the source cluster can be migrated to the target cluster. Improve the migration efficiency; then by comparing the storage space size and total number of file blocks occupied by the data tables in the distributed database of the source cluster before migration with the storage space size and total file blocks occupied by the data tables of the target cluster after migration The number of blocks is matched, and the integrity of the migration can be verified according to the matching result.

附图说明Description of drawings

为了更清楚地说明本发明,下面将对本发明中所需要使用的附图做一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the present invention more clearly, the accompanying drawings that need to be used in the present invention will be briefly introduced below. Obviously, the accompanying drawings in the following description are some embodiments of the present invention. For those of ordinary skill in the art , on the premise of not paying creative labor, other drawings can also be obtained based on these drawings.

图1为本发明实施例一提供的一种跨集群数据迁移方法的流程图;FIG. 1 is a flowchart of a cross-cluster data migration method provided by Embodiment 1 of the present invention;

图2为本发明实施例三提供的一种跨集群数据迁移方法的流程图;FIG. 2 is a flowchart of a cross-cluster data migration method provided by Embodiment 3 of the present invention;

图3a为本发明实施例四提供的一种跨集群数据迁移系统中源集群的主控节点的结构示意图;FIG. 3a is a schematic structural diagram of a master control node of a source cluster in a cross-cluster data migration system provided by Embodiment 4 of the present invention;

图3b为本发明实施例四提供的一种跨集群数据迁移系统中目标集群的主控节点的结构示意图。FIG. 3 b is a schematic structural diagram of a master control node of a target cluster in a cross-cluster data migration system provided by Embodiment 4 of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施例中的技术方案作进一步详细描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。可以理解的是,此处所描述的具体实施例仅用于解释本发明,而非对本发明的限定,基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。另外还需要说明的是,为了便于描述,附图中仅示出了与本发明相关的部分而非全部内容。In order to make the purpose, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be further described in detail below in conjunction with the accompanying drawings. Obviously, the described embodiments are part of the embodiments of the present invention, not all of them. Example. It can be understood that the specific embodiments described here are only used to explain the present invention, rather than limit the present invention. Based on the embodiments of the present invention, all those skilled in the art can obtain without creative work. Other embodiments all belong to the protection scope of the present invention. In addition, it should be noted that, for the convenience of description, only parts related to the present invention are shown in the drawings but not all content.

实施例一Embodiment one

请参阅图1,为本发明实施例一提供的一种跨集群数据迁移方法的流程图。本发明实施例的方法适用于跨集群数据迁移系统,该系统包括源集群和目标集群,所述源集群包括主控节点和至少一个子节点,所述目标集群包括主控节点和至少一个子节点。其中,源集群的主控节点和至少一个子节点形成HDFS,源集群中存储有待迁移的数据表;目标集群的主控节点和至少一个子节点也可形成HDFS,用于迁移存储源集群中的数据表。Please refer to FIG. 1 , which is a flowchart of a cross-cluster data migration method provided by Embodiment 1 of the present invention. The method of the embodiment of the present invention is applicable to a cross-cluster data migration system, the system includes a source cluster and a target cluster, the source cluster includes a master control node and at least one child node, and the target cluster includes a master control node and at least one child node . Among them, the master control node and at least one child node of the source cluster form HDFS, and the data tables to be migrated are stored in the source cluster; the master control node and at least one child node of the target cluster can also form HDFS, which is used to migrate and store data tables in the source cluster data sheet.

该方法包括:The method includes:

步骤110、源集群的主控节点调用停止命令控制源集群的各子节点停止数据操作;Step 110, the master control node of the source cluster invokes a stop command to control each child node of the source cluster to stop data operations;

本步骤具体是通过将源集群的各子节点停止数据操作,使得迁移前各节点中的数据持久化。具体地,可以通知各子节点对应的业务方停止数据写入或者读取操作,然后调用停止命令使源集群的各子节点停止数据操作。当然,也可以直接调用停止命令使源集群的各子节点停止数据操作。This step is specifically to make the data in each node before migration persistent by stopping the data operation of each child node of the source cluster. Specifically, the business party corresponding to each sub-node may be notified to stop data writing or reading operations, and then a stop command may be invoked to cause each sub-node of the source cluster to stop the data operation. Of course, you can also directly invoke the stop command to stop data operations on each child node of the source cluster.

步骤120、源集群的主控节点利用源集群的分布式数据库的清空缓冲区组件,将所述分布式数据库内存中的数据持久化到HDFS中;Step 120, the master control node of the source cluster uses the clear buffer component of the distributed database of the source cluster to persist the data in the memory of the distributed database to HDFS;

本步骤具体是将源集群的分布式数据库中的数据持久化。This step is specifically to persist the data in the distributed database of the source cluster.

其中所述清空缓冲区组件用于将暂存在所述分布式数据库的内存中的数据持久化到HDFS的磁盘中。Wherein the clear buffer component is used to persist the data temporarily stored in the memory of the distributed database to the disk of HDFS.

步骤130、源集群的主控节点控制对源集群的分布式数据库所包含的数据表,采用设定的压缩算法进行压缩;Step 130, the master control node of the source cluster compresses the data tables contained in the distributed database of the source cluster using a set compression algorithm;

本步骤具体是对源集群中的待迁移的数据表进行压缩。具体地,可以通过查看各数据表的压缩状态,对未进行压缩的数据表进行压缩,具体可以采用LZO(Lempel-Ziv-Oberhumer)压缩算法、SNAPPY压缩算法或其他压缩算法。This step is specifically to compress the data table to be migrated in the source cluster. Specifically, the uncompressed data tables may be compressed by checking the compression status of each data table, specifically, LZO (Lempel-Ziv-Oberhumer) compression algorithm, SNAPPY compression algorithm or other compression algorithms may be used.

其中,LZO压缩算法是一种高压缩比和解压速度极快的压缩算法,为无损压缩,也即压缩后的数据能准确还原。SNAPPY压缩算法是一个用于压缩和解压缩的开发包,旨在提供高速压缩速度和合理的压缩率。Among them, the LZO compression algorithm is a compression algorithm with high compression ratio and extremely fast decompression speed. It is lossless compression, that is, the compressed data can be accurately restored. The SNAPPY compression algorithm is a development package for compression and decompression, designed to provide high compression speed and reasonable compression ratio.

分布式数据库利用HDFS作为文件存储系统,以数据表的形式存储数据,能在普通硬件环境基础上支撑十亿量级行、百万量级列的大型数据表,因此通过对待迁移的数据表进行压缩,能够有效减少数据传输率,有利于提高数据迁移效率。Distributed databases use HDFS as a file storage system to store data in the form of data tables, and can support large data tables with billions of rows and millions of columns on the basis of common hardware environments. Compression can effectively reduce the data transfer rate and help improve data migration efficiency.

步骤140、源集群的主控节点统计源集群的分布式数据库中的数据表所占用的HDFS的第一存储空间大小及第一总文件块数;Step 140, the master control node of the source cluster counts the size of the first storage space and the first total number of file blocks in HDFS occupied by the data tables in the distributed database of the source cluster;

本步骤中,源集群的分布式数据库所包含的数据表作为待迁移的数据表,可以存储在该分布式数据库的根目录中,由于源集群的分布式数据库利用HDFS作为文件存储系统,因此基于HDFS形成了磁盘存储空间,该磁盘存储空间是源集群的各子节点的磁盘存储空间的总和,所述数据表在所述HDFS的磁盘存储空间中所占用的存储空间即为所述第一存储空间。In this step, the data table contained in the distributed database of the source cluster can be stored in the root directory of the distributed database as the data table to be migrated. Since the distributed database of the source cluster uses HDFS as the file storage system, based on HDFS forms a disk storage space, which is the sum of the disk storage spaces of each child node of the source cluster, and the storage space occupied by the data table in the HDFS disk storage space is the first storage space. space.

由于待迁移的数据表存储的数据量非常大,实际存储过程中,对所述待迁移的数据表采用的是分布式存储,也即将所述待迁移的数据表进行分块,形成多个文件块,将不同的文件块存储在源集群的不同子节点的磁盘中。所述第一总文件块数是指待迁移的数据表对应的文件块的块数的总和。Since the amount of data stored in the data table to be migrated is very large, in the actual storage process, distributed storage is used for the data table to be migrated, that is, the data table to be migrated is divided into blocks to form multiple files Blocks, which store different file blocks in the disks of different child nodes of the source cluster. The first total number of file blocks refers to the sum of the number of file blocks corresponding to the data table to be migrated.

步骤150、源集群的主控节点基于预先获取的目标集群所包含的节点的IP地址与主机名称的映射关系,将源集群中分布式数据库中的数据表迁移至所述目标集群的分布式数据库中;Step 150, the master control node of the source cluster migrates the data tables in the distributed database in the source cluster to the distributed database of the target cluster based on the pre-acquired mapping relationship between the IP addresses of the nodes contained in the target cluster and the host names middle;

本步骤具体是将源集群中分布式数据库中的数据表迁移至目标集群的分布式数据库中。This step is specifically to migrate the data tables in the distributed database in the source cluster to the distributed database in the target cluster.

需要说明的是,步骤110中只是停止源集群的分布式数据库提供的数据存取服务,而源集群的HDFS的服务仍正常运行。It should be noted that in step 110, only the data access service provided by the distributed database of the source cluster is stopped, while the HDFS service of the source cluster is still running normally.

还需要说明的是,根据目标集群所包含的节点的IP地址和主机名的映射关系,源集群的主控节点能够找到目标集群的HDFS,从而基于所述映射关系,能够实现源集群中分布式数据库中的数据表迁移至目标集群的HDFS中,由于目标集群的分布式数据库利用HDFS作为文件存储系统,从而能够实现源集群中分布式数据库中的数据表迁移至目标集群的分布式数据库中。It should also be noted that, according to the mapping relationship between the IP addresses and host names of the nodes contained in the target cluster, the master control node of the source cluster can find the HDFS of the target cluster, so based on the mapping relationship, the distributed The data tables in the database are migrated to the HDFS of the target cluster. Since the distributed database of the target cluster uses HDFS as the file storage system, the data tables in the distributed database of the source cluster can be migrated to the distributed database of the target cluster.

步骤160、如果获取到源集群的MapReduce(映射化简)进程的网页管理界面返回的数据迁移完成消息,则目标集群的主控节点统计目标集群的分布式数据库中的数据表占有的对应的HDFS的第二存储空间大小和第二总文件块数,并将所述第二存储空间大小和第二总文件块数与所述第一存储空间大小和所述第一总文件块数匹配;Step 160, if the data migration completion message returned by the webpage management interface of the MapReduce (mapping simplification) process of the source cluster is obtained, the master control node of the target cluster counts the corresponding HDFS occupied by the data table in the distributed database of the target cluster The size of the second storage space and the second total number of file blocks, and matching the size of the second storage space and the second total number of file blocks with the size of the first storage space and the first total number of file blocks;

本步骤具体是在监测到所述数据表迁移完成之后,首先统计目标集群的分布式数据库的数据表占有的对应的HDFS的第二存储空间大小和第二总文件块数,然后将所述第二存储空间大小与所述第一存储空间大小进行比对,并将第二总文件块数与所述第一总文件块数进行比对。This step is specifically after monitoring that the migration of the data table is completed, first counting the second storage space size and the second total file block number of the corresponding HDFS occupied by the data table of the distributed database of the target cluster, and then counting the first The size of the second storage space is compared with the size of the first storage space, and the second total number of file blocks is compared with the first total number of file blocks.

本步骤中所述第二存储空间大小和所述第二总文件块数与所述第一存储空间大小和所述第一总文件块数类似,此处不再赘述。In this step, the size of the second storage space and the second total number of file blocks are similar to the size of the first storage space and the first total number of file blocks, and will not be repeated here.

其中,MapReduce是一种实现分布式并行计算任务的通用编程模型,用于处理大规模数据的并行运算。通过MapReduce进程的网页管理界面可以监测具体的迁移情况,例如,实时迁移速度、迁移百分比、估计剩余时间和已迁移数据的描述信息等。Among them, MapReduce is a general programming model for implementing distributed parallel computing tasks, and is used to process parallel operations on large-scale data. The specific migration status can be monitored through the web management interface of the MapReduce process, such as real-time migration speed, migration percentage, estimated remaining time, and description information of the migrated data.

步骤170、如果匹配成功,则目标集群的主控节点采用与所述设定的压缩算法对应的解压算法对迁移至目标集群中的数据表进行解压;Step 170, if the matching is successful, the master control node of the target cluster uses the decompression algorithm corresponding to the set compression algorithm to decompress the data table migrated to the target cluster;

本步骤中匹配成功是指迁移前的源集群的分布式数据库中的数据表所占用的HDFS的第一存储空间大小与迁移后所占用的目标集群的HDFS的第二存储空间大小一致,以及迁移前的总文件块数与迁移后的总文件块数一致,也即所述匹配成功即为源集群的分布式数据库中的数据表被完整地迁移到了目标集群的分布式数据库中。The successful matching in this step means that the size of the first HDFS storage space occupied by the data tables in the distributed database of the source cluster before migration is the same as the second storage space of HDFS of the target cluster occupied after migration, and the migration The total number of file blocks before is consistent with the total number of file blocks after migration, that is, if the matching is successful, the data tables in the distributed database of the source cluster are completely migrated to the distributed database of the target cluster.

本步骤具体是在数据表完整迁移之后,解压迁移至目标集群中数据表。This step is specifically to decompress and migrate the data table to the data table in the target cluster after the data table is completely migrated.

步骤180、目标集群的主控节点基于启动策略,启动所述目标集群。Step 180, the master control node of the target cluster starts the target cluster based on the start policy.

本步骤具体是启动所述目标集群,以使所述目标集群的各节点正常工作。This step is specifically to start the target cluster, so that each node of the target cluster can work normally.

本实施例的技术方案,通过使源集群的各子节点停止数据操作,以及将源集群的分布式数据库的内存中的数据持久化,能够实现迁移前源集群的分布式数据库中的数据持久化;通过对源集群的分布式数据库中的数据表进行压缩,能够减小数据传送量,将源集群的分布式数据库中的压缩后的数据表迁移至目标集群中,提高了迁移效率;然后通过将迁移前的源集群的分布式数据库中的数据表所占用的存储空间大小及总文件块数,与迁移后的目标集群的数据表占有的存储空间大小和总文件块数进行匹配,能够根据匹配结果验证迁移的完整性。In the technical solution of this embodiment, the data persistence in the distributed database of the source cluster before migration can be realized by stopping the data operation of each child node of the source cluster and persisting the data in the memory of the distributed database of the source cluster ; By compressing the data tables in the distributed database of the source cluster, the amount of data transfer can be reduced, and the compressed data tables in the distributed database of the source cluster are migrated to the target cluster, which improves the migration efficiency; and then by Match the storage space and total number of file blocks occupied by the data tables in the distributed database of the source cluster before migration with the storage space and total number of file blocks occupied by the data tables of the target cluster after migration. The matching results verify the integrity of the migration.

实施例二Embodiment two

本实施例在上述实施例的基础上,在源集群的主控节点统计源集群的分布式数据库中的数据表所占用的HDFS的第一存储空间大小及第一总文件块数之前,还包括:In this embodiment, on the basis of the foregoing embodiments, before the master control node of the source cluster counts the size of the first storage space of the HDFS occupied by the data tables in the distributed database of the source cluster and the first total number of file blocks, it also includes :

源集群的主控节点利用源集群的分布式数据库的完整文件合并组件,清除所述源集群的分布式数据库的磁盘存储空间中符合预设清除策略的数据表。The master control node of the source cluster uses the complete file merging component of the distributed database of the source cluster to clear the data tables in the disk storage space of the distributed database of the source cluster that meet the preset clearing strategy.

本步骤具体是在对待迁移的数据表进行压缩之后,清除所述源集群的分布式数据库的磁盘存储空间中的失效的数据表,以进一步减少迁移的数据量,提高迁移效率。In this step, after the data table to be migrated is compressed, the invalid data table in the disk storage space of the distributed database of the source cluster is cleared, so as to further reduce the amount of data to be migrated and improve migration efficiency.

其中,所述预设清除策略可以有多种实现方式,例如包括下述至少一种:Wherein, the preset clearing strategy can be implemented in multiple ways, for example including at least one of the following:

将所述源集群的分布式数据库的磁盘存储空间中的带有删除标识的数据表作为待清除的数据表;Using the data table with the deletion mark in the disk storage space of the distributed database of the source cluster as the data table to be cleared;

将所述源集群的分布式数据库的磁盘存储空间中的达到生存时间的数据表作为待清除的数据表;Using the data table that has reached the time-to-live in the disk storage space of the distributed database of the source cluster as the data table to be cleared;

需要说明的是,可以根据需要预先设置数据表的生存时间,根据数据表的生存时间淘汰过期的数据表。以电子商务平台为例,通常,会根据促销活动的持续时间设置对应的数据表的生存时间,例如30天、7天、或特定的某一天中的某个时段,如店庆这一天从早上10点到晚上10点,在店庆结束时,生存时间针对此次店庆的数据表即为过期的数据表,通过清除过期的数据表,有利于节省存储空间和提高迁移效率。It should be noted that the survival time of the data table can be preset according to the requirement, and expired data tables can be eliminated according to the survival time of the data table. Taking the e-commerce platform as an example, usually, the survival time of the corresponding data table will be set according to the duration of the promotional activity, such as 30 days, 7 days, or a certain period of time in a specific day, such as the day of the store celebration from 10:00 a.m. From 10:00 p.m. to 10:00 p.m., at the end of the store celebration, the survival time of the data table for this store celebration is the expired data table. By clearing the expired data table, it is beneficial to save storage space and improve migration efficiency.

将所述源集群的分布式数据库的磁盘存储空间中的最大版本数大于门限值的数据表作为待清除的数据表。A data table whose maximum number of versions is greater than a threshold value in the disk storage space of the distributed database of the source cluster is used as a data table to be cleared.

需要说明的是,可以根据需要预先设置数据表的最大版本数,通常设置为3。对于更新比较频繁的数据表可以设置为1,从而能够快速地淘汰失效的数据表,有利于节省存储空间和提高迁移效率。It should be noted that the maximum number of versions of the data table can be set in advance according to needs, usually set to 3. For data tables that are updated frequently, it can be set to 1, so that invalid data tables can be quickly eliminated, which is conducive to saving storage space and improving migration efficiency.

本实施例的技术方案,在对迁移前源集群的分布式数据库中的数据进行持久化之后,对待迁移的数据表进行压缩,能够减小传输的数据量,并且通过清除所述源集群的分布式数据库的磁盘存储空间中的失效的数据表,能够进一步减少迁移的数据量,将源集群的分布式数据库中的经过压缩和清除操作的数据表迁移至目标集群中,提高了迁移效率;通过将迁移前的源集群的分布式数据库的数据表所占用的存储空间大小及总文件块数,与迁移后的目标集群的数据表占有的存储空间大小和总文件块数进行匹配,能够根据匹配结果验证迁移的完整性。In the technical solution of this embodiment, after persisting the data in the distributed database of the source cluster before migration, the data table to be migrated is compressed, which can reduce the amount of data to be transmitted, and by clearing the distribution of the source cluster The invalid data tables in the disk storage space of the distributed database can further reduce the amount of migrated data, and migrate the compressed and cleared data tables in the distributed database of the source cluster to the target cluster, which improves the migration efficiency; through Match the storage space and total number of file blocks occupied by the data table of the distributed database of the source cluster before migration with the storage space and total number of file blocks occupied by the data table of the target cluster after migration. The results verify the integrity of the migration.

在上述方案中,可以通过调用源集群的分布式数据库命令行接口触发清空缓冲区组件和完整文件合并组件。In the above solution, the empty buffer component and the complete file merge component can be triggered by calling the distributed database command line interface of the source cluster.

其中,命令行接口是操作系统与用户的交互界面。在Linux操作系统中,称命令行接口为shell,其作用主要是为用户提供服务,如接收来自键盘的输入数据,或在屏幕上显示执行结果等。Wherein, the command line interface is an interactive interface between the operating system and the user. In the Linux operating system, the command line interface is called the shell, and its main function is to provide services for users, such as receiving input data from the keyboard or displaying execution results on the screen.

在上述方案中,在目标集群的主控节点采用与所述设定的压缩算法对应的解压算法对迁移至目标集群中的数据表进行解压之前,还优选包括:In the above solution, before the master control node of the target cluster adopts the decompression algorithm corresponding to the set compression algorithm to decompress the data table migrated to the target cluster, it is also preferred to include:

目标集群的主控节点调用目标集群中的一致性检测组件,检测目标集群的分布式数据库所包含的数据表的一致性;The master control node of the target cluster invokes the consistency detection component in the target cluster to detect the consistency of the data tables contained in the distributed database of the target cluster;

如果一致,则触发目标集群的主控节点采用与所述设定的压缩算法对应的解压算法对迁移至目标集群中的数据表进行解压。If they are consistent, trigger the master control node of the target cluster to use the decompression algorithm corresponding to the set compression algorithm to decompress the data table migrated to the target cluster.

需要说明的是,检测数据表的一致性是指检测数据表的描述信息与目标集群的HDFS中真实存在的数据表的属性信息是否一致。如果一致,则触发目标集群的主控节点采用与所述设定的压缩算法对应的解压算法对迁移至目标集群中的数据表进行解压;如果不一致,则利用所述一致性检测组件进行修复。本步骤是在进行数据表迁移完整性验证之后的补充验证,可以提高迁移至目标集群中的数据表的一致性。It should be noted that checking the consistency of the data table refers to checking whether the description information of the data table is consistent with the attribute information of the data table actually existing in the HDFS of the target cluster. If consistent, trigger the master control node of the target cluster to use the decompression algorithm corresponding to the set compression algorithm to decompress the data table migrated to the target cluster; if inconsistent, use the consistency detection component to repair. This step is a supplementary verification after data table migration integrity verification, which can improve the consistency of data tables migrated to the target cluster.

实施例三Embodiment three

请参阅图2,为本发明实施例三提供的一种跨集群数据迁移方法的流程图。本实施例在上述各实施例的基础上,提供了目标集群的主控节点基于启动策略,启动所述目标集群的优选方案。该优选方法包括:Please refer to FIG. 2 , which is a flowchart of a cross-cluster data migration method provided by Embodiment 3 of the present invention. On the basis of the foregoing embodiments, this embodiment provides an optimal solution for the master control node of the target cluster to start the target cluster based on the start policy. This preferred method includes:

步骤210、目标集群的主控节点调用启动命令启动目标集群;Step 210, the master control node of the target cluster invokes a startup command to start the target cluster;

步骤220、如果目标集群的分布式数据库关联的日志文件中不存在错误日志信息或警告日志信息,则目标集群的主控节点调用目标集群的分布式数据库中的整体健康度检查组件,查看目标集群的整体健康度;Step 220, if there is no error log information or warning log information in the log file associated with the distributed database of the target cluster, the master control node of the target cluster calls the overall health check component in the distributed database of the target cluster to check the target cluster overall health of

本步骤具体是查看目标集群所包含的节点中的与分布式数据库关联的日志文件,如果存在错误日志信息或警告日志信息,则根据提示调用目标集群的分布式数据库的相关组件解决问题;如果不存在错误日志信息或警告日志信息,则利用目标集群的分布式数据库的健康度检查组件,查看目标集群的整体健康度。This step is specifically to check the log files associated with the distributed database in the nodes contained in the target cluster. If there is error log information or warning log information, call the relevant components of the distributed database of the target cluster to solve the problem according to the prompt; if not If there is error log information or warning log information, use the health check component of the distributed database of the target cluster to check the overall health of the target cluster.

其中,查看目标集群的整体健康度包括查看目标集群的数据表是否处于正常状态。Wherein, checking the overall health of the target cluster includes checking whether the data table of the target cluster is in a normal state.

步骤230、目标集群的主控节点调用分布式数据库的命令行接口将所述数据表的状态置为使能状态。Step 230, the master control node of the target cluster invokes the command line interface of the distributed database to set the state of the data table to an enabled state.

本步骤具体是根据步骤220中的目标集群的整体健康度的检测结果,使目标集群中的数据表的状态维持在“使能”的正常状态。Specifically, this step is to maintain the state of the data table in the target cluster in the normal state of “enabled” according to the detection result of the overall health of the target cluster in step 220 .

本实施例的技术方案,通过在启动目标集群之后,查看目标集群所包含的节点中的与分布式数据库关联的日志文件,如果存在错误日志信息或警告日志信息,则根据提示调用目标集群的分布式数据库的相关组件解决问题;如果不存在错误日志信息或警告日志信息,则利用目标集群的分布式数据库的健康度检查组件,查看目标集群的整体健康度;并基于目标集群的整体健康度的检测结果,使目标集群中的数据表的状态维持在“使能”的正常状态,从而使目标集群中的数据表能够提供正常的存取服务。In the technical solution of this embodiment, after starting the target cluster, check the log files associated with the distributed database in the nodes included in the target cluster, if there is error log information or warning log information, then call the distribution of the target cluster according to the prompt Relevant components of the distributed database to solve the problem; if there is no error log information or warning log information, use the health check component of the distributed database of the target cluster to check the overall health of the target cluster; and based on the overall health of the target cluster As a result of the detection, the state of the data table in the target cluster is maintained in a normal state of "enabled", so that the data table in the target cluster can provide normal access services.

在上述方案中,在通过目标集群的分布式数据库管理页面,如果目标集群的分布式数据库所包含的数据表没有处于使能状态,则目标集群的主控节点调用分布式数据库的命令行接口将所述数据表的状态置为使能状态之后,还优选包括:In the above solution, if the data table contained in the distributed database of the target cluster is not enabled on the distributed database management page of the target cluster, the command line interface of the distributed database invoked by the master control node of the target cluster will be After the state of the data table is set to the enabled state, it also preferably includes:

将目标集群所包含的节点的IP地址与主机名称的映射关系,以及目标集群的分布式数据库中的连接信息发送至业务方,并通知所述业务方对目标集群的数据服务进行验证。Send the mapping relationship between the IP addresses of the nodes contained in the target cluster and the host name, and the connection information in the distributed database of the target cluster to the business party, and notify the business party to verify the data service of the target cluster.

实施例四Embodiment four

请参阅图3a和图3b。本发明实施例四提供一种跨集群数据迁移系统,该系统包括:源集群和目标集群,所述源集群包括主控节点和至少一个子节点,所述目标集群包括主控节点和至少一个子节点。Please refer to Figure 3a and Figure 3b. Embodiment 4 of the present invention provides a cross-cluster data migration system, the system includes: a source cluster and a target cluster, the source cluster includes a master control node and at least one child node, and the target cluster includes a master control node and at least one child node node.

所述源集群的主控节点包括:停止模块310、持久化模块320、压缩模块330、统计模块340和迁移模块350。The master control node of the source cluster includes: a stop module 310 , a persistence module 320 , a compression module 330 , a statistics module 340 and a migration module 350 .

所述目标集群的主控节点包括:统计模块360、解压模块370和启动模块380。The master control node of the target cluster includes: a statistical module 360 , a decompression module 370 and a startup module 380 .

其中,停止模块310用于调用停止命令控制源集群的各子节点停止数据操作;持久化模块320用于利用源集群的分布式数据库的清空缓冲区组件,将所述分布式数据库内存中的数据持久化到HDFS中;压缩模块330用于对源集群的分布式数据库所包含的数据表,采用设定的压缩算法进行压缩;统计模块340用于统计源集群的分布式数据库中的数据表所占用的HDFS的第一存储空间大小及第一总文件块数;迁移模块350用于基于预先获取的目标集群所包含的节点的IP地址与主机名称的映射关系,将源集群中分布式数据库中的数据表迁移至所述目标集群的分布式数据库中;Among them, the stop module 310 is used to call the stop command to control each sub-node of the source cluster to stop data operation; Persistence in HDFS; Compression module 330 is used to compress the data tables contained in the distributed database of the source cluster using a set compression algorithm; Statistical module 340 is used to count the data tables contained in the distributed database of the source cluster The size of the first storage space and the first total number of file blocks of the occupied HDFS; the migration module 350 is used to transfer the distributed database in the source cluster to The data table of is migrated to the distributed database of the target cluster;

其中,统计模块360用于如果获取到源集群的映射化简进程的网页管理界面返回的数据迁移完成消息,则统计目标集群的分布式数据库中的数据表占有的对应的HDFS的第二存储空间大小和第二总文件块数,并将所述第二存储空间大小和第二总文件块数与所述第一存储空间大小和所述第一总文件块数匹配;解压模块370用于如果匹配成功,则采用与所述设定的压缩算法对应的解压算法对迁移至目标集群中的数据表进行解压;启动模块380用于基于启动策略,启动所述目标集群。Wherein, the statistical module 360 is used to obtain the data migration completion message returned by the webpage management interface of the mapping and simplification process of the source cluster, then count the second storage space of the corresponding HDFS occupied by the data table in the distributed database of the target cluster size and the second total number of file blocks, and match the size of the second storage space and the second total number of file blocks with the size of the first storage space and the first total number of file blocks; the decompression module 370 is used if If the matching is successful, the data table migrated to the target cluster is decompressed using the decompression algorithm corresponding to the set compression algorithm; the startup module 380 is used to start the target cluster based on the startup policy.

本实施例的技术方案,通过使源集群的各子节点停止数据操作,以及将源集群的分布式数据库的内存中的数据持久化,能够实现迁移前源集群的分布式数据库中的数据持久化;通过对源集群的分布式数据库中的数据表进行压缩,能够减小数据传送量,将源集群的分布式数据库中的压缩后的数据表迁移至目标集群中,提高了迁移效率;然后通过将迁移前的源集群的分布式数据库中的数据表所占用的存储空间大小及总文件块数,与迁移后的目标集群的数据表占有的存储空间大小和总文件块数进行匹配,能够根据匹配结果验证迁移的完整性。In the technical solution of this embodiment, the data persistence in the distributed database of the source cluster before migration can be realized by stopping the data operation of each child node of the source cluster and persisting the data in the memory of the distributed database of the source cluster ; By compressing the data tables in the distributed database of the source cluster, the amount of data transfer can be reduced, and the compressed data tables in the distributed database of the source cluster are migrated to the target cluster, which improves the migration efficiency; and then by Match the storage space and total number of file blocks occupied by the data tables in the distributed database of the source cluster before migration with the storage space and total number of file blocks occupied by the data tables of the target cluster after migration. The matching results verify the integrity of the migration.

在上述方案中,所述源集群的主控节点还优选包括:清除模块,用于在统计源集群的分布式数据库中的数据表所占用的HDFS的第一存储空间大小及第一总文件块数之前,利用源集群的分布式数据库的完整文件合并组件,清除所述源集群的分布式数据库的磁盘存储空间中符合预设清除策略的数据表。In the above solution, the master control node of the source cluster preferably also includes: a clearing module, used for counting the size of the first storage space and the first total file blocks of HDFS occupied by the data tables in the distributed database of the source cluster Before counting, use the complete file merging component of the distributed database of the source cluster to clear the data tables in the disk storage space of the distributed database of the source cluster that meet the preset clearing policy.

在上述方案中,所述预设清除策略包括下述至少一项:In the above solution, the preset clearing strategy includes at least one of the following:

将所述源集群的分布式数据库的磁盘存储空间中的带有删除标识的数据表作为待清除的数据表;Using the data table with the deletion mark in the disk storage space of the distributed database of the source cluster as the data table to be cleared;

将所述源集群的分布式数据库的磁盘存储空间中的达到生存时间的数据表作为待清除的数据表;Using the data table that has reached the time-to-live in the disk storage space of the distributed database of the source cluster as the data table to be cleared;

将所述源集群的分布式数据库的磁盘存储空间中的最大版本数大于门限值的数据表作为待清除的数据表。A data table whose maximum number of versions is greater than a threshold value in the disk storage space of the distributed database of the source cluster is used as a data table to be cleared.

在上述方案中,所述持久化模块310具体用于通过调用源集群的分布式数据库命令行接口触发清空缓冲区组件,将所述分布式数据库内存中的数据持久化到分布式文件系统HDFS中;所述清除模块,具体用于在统计源集群的分布式数据库中的数据表所占用的HDFS的第一存储空间大小及第一总文件块数之前,通过调用源集群的分布式数据库命令行接口触发完整文件合并组件,清除所述源集群的分布式数据库的磁盘存储空间中符合预设清除策略的数据表。In the above solution, the persistence module 310 is specifically configured to trigger the empty buffer component by calling the distributed database command line interface of the source cluster, and persist the data in the distributed database memory to the distributed file system HDFS The clearing module is specifically used to call the distributed database command line of the source cluster before the first storage space size of the HDFS occupied by the data tables in the distributed database of the statistical source cluster and the first total file block number The interface triggers the complete file merging component to clear the data tables conforming to the preset clearing policy in the disk storage space of the distributed database of the source cluster.

在上述方案中,所述目标集群的主控节点还优选包括:一致性检测模块,用于在采用与所述设定的压缩算法对应的解压算法对迁移至目标集群中的数据表进行解压之前,调用目标集群中的一致性检测组件,检测目标集群的分布式数据库所包含的数据表的一致性;如果一致,则触发目标集群的主控节点采用与所述设定的压缩算法对应的解压算法对迁移至目标集群中的数据表进行解压。In the above solution, the master control node of the target cluster preferably further includes: a consistency detection module, which is used to decompress the data table migrated to the target cluster before using the decompression algorithm corresponding to the set compression algorithm , call the consistency detection component in the target cluster to detect the consistency of the data tables contained in the distributed database of the target cluster; if consistent, trigger the master control node of the target cluster to adopt the decompression corresponding to the set compression algorithm The algorithm decompresses the data tables migrated to the target cluster.

在上述方案中,所述启动模块390优选包括:启动单元、整体健康度检测单元和数据表状态设置单元。In the above solution, the startup module 390 preferably includes: a startup unit, an overall health degree detection unit, and a data table status setting unit.

其中,启动单元用于调用启动命令启动目标集群;整体健康度检测单元用于如果目标集群的分布式数据库关联的日志文件中不存在错误日志信息或警告日志信息,则调用目标集群的分布式数据库中的整体健康度检查组件,查看目标集群的整体健康度;数据表状态设置单元用于调用分布式数据库的命令行接口将所述数据表的状态置为使能状态。Among them, the startup unit is used to call the startup command to start the target cluster; the overall health detection unit is used to call the distributed database of the target cluster if there is no error log information or warning log information in the log file associated with the distributed database of the target cluster The overall health check component in the system checks the overall health of the target cluster; the data table state setting unit is used to call the command line interface of the distributed database to set the state of the data table to an enabled state.

在上述方案中,所述启动模块390还可以包括:数据服务验证单元,用于在通过目标集群的分布式数据库管理页面,如果目标集群的分布式数据库所包含的数据表没有处于使能状态,则调用分布式数据库的命令行接口将所述数据表的状态置为使能状态之后,将目标集群所包含的节点的IP地址与主机名称的映射关系,以及目标集群的分布式数据库中的连接信息发送至业务方,并通知所述业务方对目标集群的数据服务进行验证。In the above solution, the startup module 390 may also include: a data service verification unit, configured to pass through the distributed database management page of the target cluster, if the data table contained in the distributed database of the target cluster is not in an enabled state, After calling the command line interface of the distributed database to set the state of the data table to the enabled state, the mapping relationship between the IP addresses of the nodes contained in the target cluster and the host name, and the connection in the distributed database of the target cluster The information is sent to the business party, and the business party is notified to verify the data service of the target cluster.

本发明实施例提供的跨集群数据迁移系统中源集群的主控节点和目标集群的主控节点可执行本发明任意实施例所提供的跨集群数据迁移方法,具备执行方法相应的功能模块和有益效果。The master control node of the source cluster and the master control node of the target cluster in the cross-cluster data migration system provided by the embodiment of the present invention can execute the cross-cluster data migration method provided by any embodiment of the present invention, and has corresponding functional modules and beneficial Effect.

最后应说明的是:以上各实施例仅用于说明本发明的技术方案,而非对其进行限制;实施例中优选的实施方式,并非对其进行限制,对于本领域技术人员而言,本发明可以有各种改动和变化。凡在本发明的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; preferred implementations in the examples are not to limit them, and for those skilled in the art, this The invention is capable of various modifications and variations. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the protection scope of the present invention.

Claims (10)

1. across a company-data moving method, it is characterized in that, comprising:
Each child node that the main controlled node of source cluster calls control source cluster of ceasing and desisting order stops data manipulation;
The main controlled node of source cluster utilizes the clearing buffers area assembly of the distributed data base of source cluster, by the data persistence in described distributed data base internal memory in distributed file system HDFS;
The main controlled node of source cluster controls the tables of data comprised the distributed data base of source cluster, adopts the compression algorithm of setting to compress;
First storage size of the HDFS shared by tables of data in the distributed data base of the main controlled node Statistic Source cluster of source cluster and the first general act block number;
The IP address of node that the main controlled node of source cluster comprises based on the target cluster obtained in advance and the mapping relations of Hostname, migrate in the distributed data base of described target cluster by the tables of data in distributed data base in the cluster of source;
If the Data Migration that the management of webpage interface getting the mapping abbreviation process of source cluster returns completes message, then the main controlled node of target cluster adds up the second storage size and the second general act block number of the HDFS of the correspondence that the tables of data in the distributed data base of target cluster is occupied, and described second storage size and the second general act block number is mated with described first storage size and described first general act block number;
If the match is successful, then the main controlled node of target cluster adopts the decompression algorithm corresponding with the compression algorithm of described setting to carry out decompress(ion) to the tables of data migrated in target cluster;
The main controlled node of target cluster, based on startup strategy, starts described target cluster.
2. method according to claim 1, is characterized in that, before first storage size of the HDFS shared by the tables of data in the distributed data base of the main controlled node Statistic Source cluster of source cluster and the first general act block number, also comprises:
The main controlled node of source cluster utilizes the complete file merge module of the distributed data base of source cluster, removes in the disk storage space of the distributed data base of described source cluster the tables of data meeting and preset and remove strategy.
3. method according to claim 2, is characterized in that, described default removing strategy comprises following at least one item:
Using the tables of data with deletion mark in the disk storage space of the distributed data base of described source cluster as tables of data to be cleaned;
Using the tables of data reaching life span in the disk storage space of the distributed data base of described source cluster as tables of data to be cleaned;
Maximum version number in the disk storage space of the distributed data base of described source cluster is greater than the tables of data of threshold value as tables of data to be cleaned.
4. method according to claim 2, is characterized in that, triggers clearing buffers area assembly and complete file merge module by the distributed data base command line interface calling source cluster.
5. method according to claim 1, is characterized in that, adopts before the decompression algorithm corresponding with the compression algorithm of described setting carry out decompress(ion) to the tables of data migrated in target cluster, also comprise at the main controlled node of target cluster:
Consistency detection assembly in the main controlled node invocation target cluster of target cluster, the consistance of the tables of data that the distributed data base detecting target cluster comprises;
If consistent, then the main controlled node of trigger target cluster adopts the decompression algorithm corresponding with the compression algorithm of described setting to carry out decompress(ion) to the tables of data migrated in target cluster.
6. according to the arbitrary described method of claim 1-5, it is characterized in that, the main controlled node of target cluster, based on startup strategy, starts described target cluster, comprising:
The main controlled node of target cluster calls startup command and starts target cluster;
If there is not error-logging information or warning log information in the journal file of the distributed data base association of target cluster, holistic health degree inspection assembly in the distributed data base of the then main controlled node invocation target cluster of target cluster, checks the holistic health degree of target cluster;
The state of described tables of data is set to enabled state by the command line interface that the main controlled node of target cluster calls distributed data base.
7. method according to claim 6, it is characterized in that, at the distributed database management page by target cluster, if the tables of data that the distributed data base of target cluster comprises is not in enabled state, the command line interface that then main controlled node of target cluster calls distributed data base also comprises after the state of described tables of data is set to enabled state:
The IP address of the node that target cluster is comprised and the mapping relations of Hostname, and the link information in the distributed data base of target cluster is sent to business side, and notify that the data, services of described business side to target cluster is verified.
8. across a company-data migratory system, comprise source cluster and target cluster, described source cluster comprises main controlled node and at least one child node, and described target cluster comprises main controlled node and at least one child node, it is characterized in that:
The main controlled node of described source cluster comprises:
Stopping modular, stops data manipulation for each child node calling control source cluster of ceasing and desisting order;
Persistence module, for utilizing the clearing buffers area assembly of the distributed data base of source cluster, by the data persistence in described distributed data base internal memory in distributed file system HDFS;
Compression module, for the tables of data comprised the distributed data base of source cluster, adopts the compression algorithm of setting to compress;
Statistical module, for the first storage size and the first general act block number of the HDFS shared by the tables of data in the distributed data base of Statistic Source cluster;
Transferring module, for the IP address of node that comprises based on the target cluster obtained in advance and the mapping relations of Hostname, migrates in the distributed data base of described target cluster by the tables of data in distributed data base in the cluster of source;
The main controlled node of described target cluster comprises:
Statistical module, if the Data Migration returned for the management of webpage interface of the mapping abbreviation process getting source cluster completes message, then add up the second storage size and the second general act block number of the HDFS of the correspondence that the tables of data in the distributed data base of target cluster is occupied, and described second storage size and the second general act block number are mated with described first storage size and described first general act block number;
Decompression module, if for the match is successful, then adopts the decompression algorithm corresponding with the compression algorithm of described setting to carry out decompress(ion) to the tables of data migrated in target cluster;
Start module, for based on startup strategy, start described target cluster.
9. system according to claim 8, is characterized in that, the main controlled node of described source cluster also comprises:
Remove module, before first storage size of the HDFS shared by the tables of data in the distributed data base of Statistic Source cluster and the first general act block number, utilize the complete file merge module of the distributed data base of source cluster, remove in the disk storage space of the distributed data base of described source cluster the tables of data meeting and preset and remove strategy.
10. system according to claim 9, is characterized in that, described default removing strategy comprises following at least one item:
Using the tables of data with deletion mark in the disk storage space of the distributed data base of described source cluster as tables of data to be cleaned;
Using the tables of data reaching life span in the disk storage space of the distributed data base of described source cluster as tables of data to be cleaned;
Maximum version number in the disk storage space of the distributed data base of described source cluster is greater than the tables of data of threshold value as tables of data to be cleaned.
CN201410455695.2A 2014-09-09 2014-09-09 cross-cluster data migration method and system Active CN104239493B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410455695.2A CN104239493B (en) 2014-09-09 2014-09-09 cross-cluster data migration method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410455695.2A CN104239493B (en) 2014-09-09 2014-09-09 cross-cluster data migration method and system

Publications (2)

Publication Number Publication Date
CN104239493A true CN104239493A (en) 2014-12-24
CN104239493B CN104239493B (en) 2017-05-10

Family

ID=52227552

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410455695.2A Active CN104239493B (en) 2014-09-09 2014-09-09 cross-cluster data migration method and system

Country Status (1)

Country Link
CN (1) CN104239493B (en)

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105069128A (en) * 2015-08-14 2015-11-18 北京京东尚科信息技术有限公司 Data synchronization method and apparatus
CN105159970A (en) * 2015-08-25 2015-12-16 浪潮(北京)电子信息产业有限公司 System and method for database data migration
CN105808612A (en) * 2014-12-31 2016-07-27 北京嘀嘀无限科技发展有限公司 Method and equipment used for migrating data of database
CN106484379A (en) * 2015-08-28 2017-03-08 华为技术有限公司 A kind of processing method and processing device of application
CN106777164A (en) * 2016-12-20 2017-05-31 东软集团股份有限公司 A kind of Data Migration cluster and data migration method
CN106933859A (en) * 2015-12-30 2017-07-07 中国移动通信集团公司 The moving method and device of a kind of medical data
CN107016075A (en) * 2017-03-27 2017-08-04 聚好看科技股份有限公司 Company-data synchronous method and device
CN107515782A (en) * 2017-07-26 2017-12-26 北京天云融创软件技术有限公司 Implementation method of the container across host migration under a kind of Docker environment
CN107704633A (en) * 2017-11-01 2018-02-16 郑州云海信息技术有限公司 A kind of method and system of file migration
CN108021585A (en) * 2016-10-28 2018-05-11 腾讯科技(深圳)有限公司 Distributed data storage method and device
CN108234566A (en) * 2016-12-21 2018-06-29 阿里巴巴集团控股有限公司 The data processing method and device of a kind of cluster
CN109376010A (en) * 2018-09-28 2019-02-22 上海思询信息科技有限公司 A method for cross-cluster resource migration based on Openstack
CN109542882A (en) * 2018-12-05 2019-03-29 南京中孚信息技术有限公司 A kind of database migration method and device
CN109544072A (en) * 2018-11-21 2019-03-29 北京京东尚科信息技术有限公司 Method, system, equipment and medium are reduced in hot spot inventory localization
CN109818794A (en) * 2019-01-31 2019-05-28 北京搜狐互联网信息服务有限公司 Cluster Migration Methods and Tools
CN110209731A (en) * 2019-04-25 2019-09-06 深圳壹账通智能科技有限公司 Method of data synchronization, device and storage medium, electronic device
CN110263044A (en) * 2019-06-21 2019-09-20 深圳前海微众银行股份有限公司 Date storage method, device, equipment and computer readable storage medium
CN110704540A (en) * 2019-10-10 2020-01-17 云南中烟工业有限责任公司 Method for evaluating data quality of source end and target end in data acquisition process
CN110955720A (en) * 2018-09-27 2020-04-03 阿里巴巴集团控股有限公司 Data loading method, device and system
CN111064789A (en) * 2019-12-18 2020-04-24 北京三快在线科技有限公司 Data migration method and system
CN111274213A (en) * 2020-02-13 2020-06-12 苏州浪潮智能科技有限公司 A distributed file system HDFS cross-Insight cluster real-time data transmission method and system
CN111367889A (en) * 2020-03-09 2020-07-03 中国工商银行股份有限公司 Cross-cluster data migration method and device based on webpage interface
CN111756562A (en) * 2019-03-29 2020-10-09 深信服科技股份有限公司 Cluster takeover method, system and related components
CN112799912A (en) * 2021-01-27 2021-05-14 苏州浪潮智能科技有限公司 A kind of data monitoring method, device and system of AMS system
CN113297166A (en) * 2020-07-27 2021-08-24 阿里巴巴集团控股有限公司 Data processing system, method and device
CN110209653B (en) * 2019-06-04 2021-11-23 中国农业银行股份有限公司 HBase data migration method and device
CN113760856A (en) * 2020-06-05 2021-12-07 京东数字科技控股有限公司 Database management method and device, computer readable storage medium and electronic device
CN113778761A (en) * 2021-08-17 2021-12-10 北京金山云网络技术有限公司 Time sequence database cluster and fault processing and operating method and device thereof
CN116303773A (en) * 2023-03-01 2023-06-23 中国建设银行股份有限公司 Data replication method, device, equipment and computer storage medium
CN116795815A (en) * 2022-12-30 2023-09-22 兴业银行股份有限公司 ETL processing demand migration and data verification method and system for cross-cluster version
US12135695B2 (en) 2022-05-06 2024-11-05 International Business Machines Corporation Data migration in a distributed file system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101958808A (en) * 2010-10-18 2011-01-26 华东交通大学 A cluster task scheduling manager serving multi-grid access
US20120198269A1 (en) * 2011-01-27 2012-08-02 International Business Machines Corporation Method and apparatus for application recovery in a file system
CN103207814A (en) * 2012-12-27 2013-07-17 北京仿真中心 Decentralized cross cluster resource management and task scheduling system and scheduling method
US20140040575A1 (en) * 2012-08-01 2014-02-06 Netapp, Inc. Mobile hadoop clusters

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101958808A (en) * 2010-10-18 2011-01-26 华东交通大学 A cluster task scheduling manager serving multi-grid access
US20120198269A1 (en) * 2011-01-27 2012-08-02 International Business Machines Corporation Method and apparatus for application recovery in a file system
US20120284558A1 (en) * 2011-01-27 2012-11-08 International Business Machines Corporation Application recovery in a file system
CN103329105A (en) * 2011-01-27 2013-09-25 国际商业机器公司 Application recovery in file system
US20140040575A1 (en) * 2012-08-01 2014-02-06 Netapp, Inc. Mobile hadoop clusters
CN103207814A (en) * 2012-12-27 2013-07-17 北京仿真中心 Decentralized cross cluster resource management and task scheduling system and scheduling method

Cited By (45)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808612A (en) * 2014-12-31 2016-07-27 北京嘀嘀无限科技发展有限公司 Method and equipment used for migrating data of database
CN105808612B (en) * 2014-12-31 2019-08-27 北京嘀嘀无限科技发展有限公司 The method and apparatus of data for migrating data library
CN105069128B (en) * 2015-08-14 2018-11-09 北京京东尚科信息技术有限公司 Method of data synchronization and device
CN105069128A (en) * 2015-08-14 2015-11-18 北京京东尚科信息技术有限公司 Data synchronization method and apparatus
CN105159970A (en) * 2015-08-25 2015-12-16 浪潮(北京)电子信息产业有限公司 System and method for database data migration
CN105159970B (en) * 2015-08-25 2019-03-15 浪潮(北京)电子信息产业有限公司 A database data migration system and method
CN106484379A (en) * 2015-08-28 2017-03-08 华为技术有限公司 A kind of processing method and processing device of application
CN106484379B (en) * 2015-08-28 2019-11-29 华为技术有限公司 A kind of processing method and processing device of application
CN106933859A (en) * 2015-12-30 2017-07-07 中国移动通信集团公司 The moving method and device of a kind of medical data
CN106933859B (en) * 2015-12-30 2020-10-20 中国移动通信集团公司 Method and device for migrating medical data
CN108021585A (en) * 2016-10-28 2018-05-11 腾讯科技(深圳)有限公司 Distributed data storage method and device
CN106777164A (en) * 2016-12-20 2017-05-31 东软集团股份有限公司 A kind of Data Migration cluster and data migration method
CN106777164B (en) * 2016-12-20 2020-07-10 东软集团股份有限公司 Data migration cluster and data migration method
CN108234566A (en) * 2016-12-21 2018-06-29 阿里巴巴集团控股有限公司 The data processing method and device of a kind of cluster
CN107016075A (en) * 2017-03-27 2017-08-04 聚好看科技股份有限公司 Company-data synchronous method and device
CN107515782A (en) * 2017-07-26 2017-12-26 北京天云融创软件技术有限公司 Implementation method of the container across host migration under a kind of Docker environment
CN107704633A (en) * 2017-11-01 2018-02-16 郑州云海信息技术有限公司 A kind of method and system of file migration
CN110955720A (en) * 2018-09-27 2020-04-03 阿里巴巴集团控股有限公司 Data loading method, device and system
CN110955720B (en) * 2018-09-27 2023-04-07 阿里巴巴集团控股有限公司 Data loading method, device and system
CN109376010A (en) * 2018-09-28 2019-02-22 上海思询信息科技有限公司 A method for cross-cluster resource migration based on Openstack
CN109544072A (en) * 2018-11-21 2019-03-29 北京京东尚科信息技术有限公司 Method, system, equipment and medium are reduced in hot spot inventory localization
CN109542882A (en) * 2018-12-05 2019-03-29 南京中孚信息技术有限公司 A kind of database migration method and device
CN109542882B (en) * 2018-12-05 2020-11-06 南京中孚信息技术有限公司 Database migration method and device
CN109818794A (en) * 2019-01-31 2019-05-28 北京搜狐互联网信息服务有限公司 Cluster Migration Methods and Tools
CN111756562A (en) * 2019-03-29 2020-10-09 深信服科技股份有限公司 Cluster takeover method, system and related components
CN111756562B (en) * 2019-03-29 2023-07-14 深信服科技股份有限公司 Cluster takeover method, system and related components
CN110209731A (en) * 2019-04-25 2019-09-06 深圳壹账通智能科技有限公司 Method of data synchronization, device and storage medium, electronic device
CN110209653B (en) * 2019-06-04 2021-11-23 中国农业银行股份有限公司 HBase data migration method and device
CN110263044B (en) * 2019-06-21 2023-03-31 深圳前海微众银行股份有限公司 Data storage method, device, equipment and computer readable storage medium
CN110263044A (en) * 2019-06-21 2019-09-20 深圳前海微众银行股份有限公司 Date storage method, device, equipment and computer readable storage medium
CN110704540A (en) * 2019-10-10 2020-01-17 云南中烟工业有限责任公司 Method for evaluating data quality of source end and target end in data acquisition process
CN110704540B (en) * 2019-10-10 2023-05-02 云南中烟工业有限责任公司 Method for evaluating data quality of source end and target end in data acquisition process
CN111064789A (en) * 2019-12-18 2020-04-24 北京三快在线科技有限公司 Data migration method and system
CN111274213A (en) * 2020-02-13 2020-06-12 苏州浪潮智能科技有限公司 A distributed file system HDFS cross-Insight cluster real-time data transmission method and system
CN111274213B (en) * 2020-02-13 2022-07-15 苏州浪潮智能科技有限公司 A distributed file system HDFS cross-Insight cluster real-time data transmission method and system
CN111367889A (en) * 2020-03-09 2020-07-03 中国工商银行股份有限公司 Cross-cluster data migration method and device based on webpage interface
CN111367889B (en) * 2020-03-09 2023-08-04 中国工商银行股份有限公司 Cross-cluster data migration method and device based on webpage interface
CN113760856A (en) * 2020-06-05 2021-12-07 京东数字科技控股有限公司 Database management method and device, computer readable storage medium and electronic device
CN113297166A (en) * 2020-07-27 2021-08-24 阿里巴巴集团控股有限公司 Data processing system, method and device
CN112799912A (en) * 2021-01-27 2021-05-14 苏州浪潮智能科技有限公司 A kind of data monitoring method, device and system of AMS system
CN113778761A (en) * 2021-08-17 2021-12-10 北京金山云网络技术有限公司 Time sequence database cluster and fault processing and operating method and device thereof
US12135695B2 (en) 2022-05-06 2024-11-05 International Business Machines Corporation Data migration in a distributed file system
CN116795815A (en) * 2022-12-30 2023-09-22 兴业银行股份有限公司 ETL processing demand migration and data verification method and system for cross-cluster version
CN116795815B (en) * 2022-12-30 2025-07-29 兴业银行股份有限公司 ETL processing demand migration and data verification method and system for cross-cluster version
CN116303773A (en) * 2023-03-01 2023-06-23 中国建设银行股份有限公司 Data replication method, device, equipment and computer storage medium

Also Published As

Publication number Publication date
CN104239493B (en) 2017-05-10

Similar Documents

Publication Publication Date Title
CN104239493B (en) cross-cluster data migration method and system
US10613791B2 (en) Portable snapshot replication between storage systems
US10599533B2 (en) Cloud storage using merkle trees
CN107045422B (en) Distributed storage method and device
US9031910B2 (en) System and method for maintaining a cluster setup
CN102521072B (en) Virtual tape library equipment and data recovery method
WO2022063284A1 (en) Data synchronization method and apparatus, device, and computer-readable medium
TWI544333B (en) Method, computer storage medium, and system for storage architecture for backup application
US20220398018A1 (en) Tiering Snapshots Across Different Storage Tiers
CN103645969B (en) Data copy method and data-storage system
CN112685499B (en) A method, device and equipment for synchronizing process data of work business flow
CN103561057A (en) Data storage method based on distributed hash table and erasure codes
WO2012051845A1 (en) Data transfer method and system
CN106657356A (en) Data writing method and device for cloud storage system, and cloud storage system
CN101667181A (en) Method, device and system for data disaster tolerance
CN103399941A (en) Distributed file processing method, device and system
CN107391770B (en) Method, device and equipment for processing data and storage medium
CN103902410A (en) Data backup acceleration method for cloud storage system
WO2024230746A1 (en) Method for backing up metadata of file on hdd, and metadata backup server
CN104866528B (en) Multi-platform data acquisition method and system
CN110618790A (en) Mist storage data redundancy removing method based on repeated data deletion
CN110362590A (en) Data managing method, device, system, electronic equipment and computer-readable medium
CN103365987B (en) Clustered database system and data processing method based on shared-disk framework
CN111316251A (en) Scalable Storage System
CN103150268A (en) Block-level data capture method in CDP (Continuous Data Protection)

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant