[go: up one dir, main page]

CN105701117A - ETL (Extract-Transform-Load) dispatching method and apparatus - Google Patents

ETL (Extract-Transform-Load) dispatching method and apparatus Download PDF

Info

Publication number
CN105701117A
CN105701117A CN201410707712.7A CN201410707712A CN105701117A CN 105701117 A CN105701117 A CN 105701117A CN 201410707712 A CN201410707712 A CN 201410707712A CN 105701117 A CN105701117 A CN 105701117A
Authority
CN
China
Prior art keywords
data warehouse
task
stage
parameter
source data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410707712.7A
Other languages
Chinese (zh)
Other versions
CN105701117B (en
Inventor
周斌彦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201410707712.7A priority Critical patent/CN105701117B/en
Publication of CN105701117A publication Critical patent/CN105701117A/en
Application granted granted Critical
Publication of CN105701117B publication Critical patent/CN105701117B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明实施例提供一种ETL调度方法及装置,其中该方法包括:首先,确定每个阶段的任务执行规则所对应的第一数据仓库,第一数据仓库为每个阶段的数据仓库中的源数据仓库或目的数据仓库;其次,根据源数据仓库和目的数据仓库之间的逻辑关系和第一数据仓库建立任务复制表,根据第二数据仓库对应服务器采用的分布式方式建立任务分配表,最后,根据任务复制表和任务分配表对每个阶段的任务进行调度。由于系统中的各个阶段中不需要多个独立的ETL装置,只需一个ETL装置,通过建立任务复制表和任务分配表调度各阶段的任务即可,从而提高对ETL装置的管理效率,降低维护复杂度。

An embodiment of the present invention provides an ETL scheduling method and device, wherein the method includes: first, determining the first data warehouse corresponding to the task execution rules of each stage, and the first data warehouse is the source in the data warehouse of each stage Data warehouse or destination data warehouse; secondly, establish a task replication table according to the logical relationship between the source data warehouse and the destination data warehouse and the first data warehouse, and establish a task assignment table according to the distributed method adopted by the corresponding server of the second data warehouse, and finally , schedule the tasks of each stage according to the task replication table and the task allocation table. Since there is no need for multiple independent ETL devices in each stage of the system, only one ETL device is needed, and the tasks of each stage can be scheduled by establishing a task copy table and a task allocation table, thereby improving the management efficiency of the ETL device and reducing maintenance the complexity.

Description

ETL调度方法及装置ETL scheduling method and device

技术领域technical field

本发明实施例涉及通信技术,尤其涉及一种抽取转换装载(Extract-Transform-Load,ETL)调度方法及装置。Embodiments of the present invention relate to communication technologies, and in particular to an Extract-Transform-Load (ETL) scheduling method and device.

背景技术Background technique

随着大数据技术发展,分布式的数据存储系统越来越多,大数据应用一般需要集成多个不同的数据存储系统来构建不同应用的数据仓库,ETL用来描述将数据从源数据仓库经过抽取、转换和装载至目的数据仓库的过程。通常ETL装置或称为ETL工具用来负责系统运行程序的调度控制和资源的分配。With the development of big data technology, there are more and more distributed data storage systems. Big data applications generally need to integrate multiple different data storage systems to build data warehouses for different applications. ETL is used to describe the process of transferring data from the source data warehouse The process of extracting, transforming, and loading into a destination data warehouse. Usually ETL devices or ETL tools are used to be responsible for the scheduling control and resource allocation of system running programs.

通常上述的数据仓库对应的服务器一般采用分布式的部署方式,但是采用的部署方式不尽相同,目前存在的主要部署方式为:无共享(ShatedNothing)架构和共享磁盘(SharedDisk)架构,其中无共享架构是指各个数据仓库中对应节点(服务器)拥有独立的中央处理器(CentralProcessingUnit,CPU)、内存、磁盘资源,数据按照规则分布在不同的节点上。共享磁盘架构是指各个数据仓库对应节点拥有独立的CPU、内存,但节点之间是共享磁盘空间的,数据统一存储。现有技术中,大规模并行处理(MassivelyParallelProcessing,MPP)中包括多个数据仓库,由于各个数据仓库对应服务器部署方式不尽相同,因此,每个阶段会对应一个ETL装置,实现任务的分发和调度。Generally, the servers corresponding to the above-mentioned data warehouses generally adopt a distributed deployment method, but the deployment methods are different. Currently, the main deployment methods are: Shated Nothing architecture and Shared Disk architecture. Architecture means that the corresponding nodes (servers) in each data warehouse have independent central processing units (Central Processing Unit, CPU), memory, and disk resources, and data are distributed on different nodes according to rules. The shared disk architecture means that the corresponding nodes of each data warehouse have independent CPU and memory, but the disk space is shared between nodes, and the data is stored uniformly. In the prior art, massively parallel processing (MassivelyParallelProcessing, MPP) includes multiple data warehouses. Since each data warehouse corresponds to a different server deployment method, each stage will correspond to an ETL device to achieve task distribution and scheduling. .

然而,现有技术中存在对离散化的ETL装置管理效率低,维护较为复杂的问题。However, in the prior art, the management efficiency of the discrete ETL device is low and the maintenance is complicated.

发明内容Contents of the invention

本发明提供一种ETL调度方法及装置,从而提高对ETL装置的管理效率,降低维护复杂度。The invention provides an ETL scheduling method and device, thereby improving the management efficiency of the ETL device and reducing maintenance complexity.

第一方面,本发明一实施例提供一种ETL调度方法,包括:确定每个阶段的任务执行规则所对应的第一数据仓库,所述第一数据仓库为所述每个阶段的数据仓库中的源数据仓库或目的数据仓库;根据所述源数据仓库和所述目的数据仓库之间的逻辑关系和所述第一数据仓库建立任务复制表,所述任务复制表包括:所述源数据仓库的表项和所述目的数据仓库的表项;根据第二数据仓库对应服务器采用的分布式方式建立任务分配表,所述第二数据仓库为所述每个阶段的数据仓库中的源数据仓库或目的数据仓库,所述任务分配表包括:每个所述第二数据仓库对应服务器所采用的分布式方式;根据所述任务复制表和所述任务分配表对所述每个阶段的任务进行调度。In the first aspect, an embodiment of the present invention provides an ETL scheduling method, including: determining the first data warehouse corresponding to the task execution rules of each stage, and the first data warehouse is in the data warehouse of each stage The source data warehouse or the destination data warehouse; according to the logical relationship between the source data warehouse and the destination data warehouse and the first data warehouse, a task replication table is established, and the task replication table includes: the source data warehouse The entry of the entry and the entry of the destination data warehouse; according to the distributed mode adopted by the server corresponding to the second data warehouse, the task allocation table is established, and the second data warehouse is the source data warehouse in the data warehouse of each stage or the destination data warehouse, the task allocation table includes: the distributed mode adopted by each server corresponding to the second data warehouse; according to the task replication table and the task allocation table, the tasks of each stage are performed scheduling.

结合第一方面,在第一方面的第一种可能实施方式中,所述任务复制表还包括:第一参数和第二参数;所述第一参数用于表示所述第一数据仓库为该阶段的所述源数据仓库;所述第二参数用于表示所述第一数据仓库为该阶段的所述目的数据仓库。With reference to the first aspect, in a first possible implementation manner of the first aspect, the task replication table further includes: a first parameter and a second parameter; the first parameter is used to indicate that the first data warehouse is the The source data warehouse of the stage; the second parameter is used to indicate that the first data warehouse is the target data warehouse of the stage.

结合第一方面的第一种可能实施方式,在第一方面的第二种可能实施方式中,所述根据所述源数据仓库和所述目的数据仓库之间的逻辑关系和所述第一数据仓库建立任务复制表,具体包括:根据所述源数据仓库和所述目的数据仓库之间的逻辑关系确定所述源数据仓库的表项和所述目的数据仓库的表项;根据所述第一数据仓库确定所述第一参数和所述第二参数;根据所述源数据仓库的表项、所述目的数据仓库的表项、所述第一参数和所述第二参数建立所述任务复制表。With reference to the first possible implementation manner of the first aspect, in the second possible implementation manner of the first aspect, according to the logical relationship between the source data warehouse and the destination data warehouse and the first data The warehouse creates a task copy table, specifically including: determining the entry of the source data warehouse and the entry of the destination data warehouse according to the logical relationship between the source data warehouse and the destination data warehouse; according to the first The data warehouse determines the first parameter and the second parameter; establishes the task replication according to the entry of the source data warehouse, the entry of the destination data warehouse, the first parameter and the second parameter surface.

结合第一方面或第一方面的第一种可能实施方式或第一方面的第二种可能实施方式,在第一方面的第三种可能实施方式中,还包括:所述分布式方式包括:无共享分布方式和共享磁盘分布方式。In combination with the first aspect or the first possible implementation manner of the first aspect or the second possible implementation manner of the first aspect, in the third possible implementation manner of the first aspect, it further includes: the distributed manner includes: There are no shared distribution method and shared disk distribution method.

结合第一方面的第三种可能实施方式,在第一方面的第四种可能实施方式中,所述根据所述任务复制表和所述任务分配表对所述每个阶段的任务进行调度,具体包括:在每个阶段中的所述源数据仓库和所述目的数据仓库之间按照确定的所述分布式方式调度所述每个阶段的任务。With reference to the third possible implementation manner of the first aspect, in a fourth possible implementation manner of the first aspect, the scheduling of the tasks of each stage according to the task replication table and the task allocation table, Specifically, it includes: scheduling tasks in each stage between the source data warehouse and the target data warehouse in the determined distributed manner in each stage.

第二方面,本发明一实施例提供一种ETL调度装置,包括:确定模块,用于确定每个阶段的任务执行规则所对应的第一数据仓库,所述第一数据仓库为所述每个阶段的数据仓库中的源数据仓库或目的数据仓库;建立模块,用于根据所述源数据仓库和所述目的数据仓库之间的逻辑关系和所述第一数据仓库建立任务复制表,所述任务复制表包括:所述源数据仓库的表项和所述目的数据仓库的表项;所述建立模块,还用于根据第二数据仓库对应服务器采用的分布式方式建立任务分配表,所述第二数据仓库为所述每个阶段的数据仓库中的源数据仓库或目的数据仓库,所述任务分配表包括:每个所述第二数据仓库对应服务器所采用的分布式方式;调度模块,用于根据所述任务复制表和所述任务分配表对所述每个阶段的任务进行调度。In the second aspect, an embodiment of the present invention provides an ETL scheduling device, including: a determination module, configured to determine the first data warehouse corresponding to the task execution rules of each stage, the first data warehouse is the The source data warehouse or the destination data warehouse in the data warehouse of the stage; the establishment module is used to establish a task replication table according to the logical relationship between the source data warehouse and the destination data warehouse and the first data warehouse, the The task duplication table includes: the entry of the source data warehouse and the entry of the destination data warehouse; the establishment module is also used to establish a task distribution table according to the distributed method adopted by the corresponding server of the second data warehouse, and the The second data warehouse is the source data warehouse or the destination data warehouse in the data warehouses of each stage, and the task allocation table includes: the distributed mode adopted by each server corresponding to the second data warehouse; a scheduling module, It is used for scheduling the tasks of each stage according to the task replication table and the task allocation table.

结合第二方面,在第二方面的第一种可能实施方式中,所述任务复制表还包括:第一参数和第二参数;所述第一参数用于表示所述第一数据仓库为该阶段的所述源数据仓库;所述第二参数用于表示所述第一数据仓库为该阶段的所述目的数据仓库。With reference to the second aspect, in a first possible implementation manner of the second aspect, the task replication table further includes: a first parameter and a second parameter; the first parameter is used to indicate that the first data warehouse is the The source data warehouse of the stage; the second parameter is used to indicate that the first data warehouse is the target data warehouse of the stage.

结合第二方面的第一种可能实施方式,在第二方面的第二种可能实施方式中,所述建立模块,具体用于:根据所述源数据仓库和所述目的数据仓库之间的逻辑关系确定所述源数据仓库的表项和所述目的数据仓库的表项;根据所述第一数据仓库确定所述第一参数和所述第二参数;根据所述源数据仓库的表项、所述目的数据仓库的表项、所述第一参数和所述第二参数建立所述任务复制表。With reference to the first possible implementation manner of the second aspect, in the second possible implementation manner of the second aspect, the establishment module is specifically configured to: according to the logic between the source data warehouse and the destination data warehouse determining the entry of the source data warehouse and the entry of the destination data warehouse; determining the first parameter and the second parameter according to the first data warehouse; according to the entry of the source data warehouse, The entry of the target data warehouse, the first parameter and the second parameter establish the task replication table.

结合第二方面或第二方面的第一种可能实施方式或第二方面的第二种可能实施方式,在第二方面的第三种可能实施方式中,还包括:所述分布式方式包括:无共享分布方式和共享磁盘分布方式。In combination with the second aspect or the first possible implementation manner of the second aspect or the second possible implementation manner of the second aspect, in the third possible implementation manner of the second aspect, it further includes: the distributed manner includes: There are no shared distribution method and shared disk distribution method.

结合第二方面的第三种可能实施方式,在第二方面的第四种可能实施方式中,所述调度模块,具体用于:在每个阶段中的所述源数据仓库和所述目的数据仓库之间按照确定的所述分布式方式调度所述每个阶段的任务。With reference to the third possible implementation manner of the second aspect, in a fourth possible implementation manner of the second aspect, the scheduling module is specifically configured to: the source data warehouse and the destination data warehouse in each stage The warehouses schedule the tasks of each stage according to the determined distributed manner.

本发明实施例提供了一种ETL调度方法及装置,其中该方法包括:确定每个阶段的任务执行规则所对应的第一数据仓库,所述第一数据仓库为所述每个阶段的数据仓库中的源数据仓库或目的数据仓库;根据所述源数据仓库和所述目的数据仓库之间的逻辑关系和所述第一数据仓库建立任务复制表,所述任务复制表包括:所述源数据仓库的表项和所述目的数据仓库的表项;根据所述第二数据仓库对应服务器采用的分布式方式建立任务分配表,所述第二数据仓库为所述每个阶段的数据仓库中的源数据仓库或目的数据仓库,所述任务分配表包括:每个所述第二数据仓库对应服务器所采用的分布式方式;根据所述任务复制表和所述任务分配表对所述每个阶段的任务进行调度。由于在系统中的各个阶段中不需要多个独立的ETL装置,只需一个ETL装置,通过建立任务复制表和任务分配表调度各阶段的任务即可,从而提高系统中对ETL装置的管理效率,降低维护复杂度。An embodiment of the present invention provides an ETL scheduling method and device, wherein the method includes: determining the first data warehouse corresponding to the task execution rules of each stage, and the first data warehouse is the data warehouse of each stage A source data warehouse or a destination data warehouse in the source data warehouse; a task replication table is established according to the logical relationship between the source data warehouse and the destination data warehouse and the first data warehouse, and the task replication table includes: the source data The entry of the warehouse and the entry of the destination data warehouse; establish a task assignment table according to the distributed mode adopted by the corresponding server of the second data warehouse, and the second data warehouse is the data warehouse in each stage For the source data warehouse or the destination data warehouse, the task allocation table includes: the distributed mode adopted by each server corresponding to the second data warehouse; tasks are scheduled. Since there is no need for multiple independent ETL devices in each stage of the system, only one ETL device is needed, and the tasks of each stage can be scheduled by establishing a task copy table and a task allocation table, thereby improving the management efficiency of the ETL device in the system , reducing maintenance complexity.

附图说明Description of drawings

图1为本发明一实施例提供的一种ETL调度方法的流程图;Fig. 1 is the flowchart of a kind of ETL scheduling method that an embodiment of the present invention provides;

图2为本发明一实施例提供的MPP系统的结构示意图;FIG. 2 is a schematic structural diagram of an MPP system provided by an embodiment of the present invention;

图3为本发明一实施例提供的一种ETL调度装置的结构示意图。Fig. 3 is a schematic structural diagram of an ETL scheduling device provided by an embodiment of the present invention.

具体实施方式detailed description

为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

图1为本发明一实施例提供的一种ETL调度方法的流程图,该方法适用于包括多个数据仓库的应用场景,其中该方法的执行主体为:ETL调度装置,该调度装置可以为ETL工具,一种ETL调度方法具体包括如下流程:Fig. 1 is a flow chart of an ETL scheduling method provided by an embodiment of the present invention, the method is applicable to an application scenario including multiple data warehouses, wherein the execution subject of the method is: an ETL scheduling device, and the scheduling device may be an ETL Tool, an ETL scheduling method specifically includes the following processes:

S101:确定每个阶段的任务执行规则所对应的第一数据仓库,第一数据仓库为每个阶段的数据仓库中的源数据仓库或目的数据仓库。S101: Determine a first data warehouse corresponding to the task execution rules of each stage, where the first data warehouse is a source data warehouse or a destination data warehouse in the data warehouses of each stage.

具体地,通常在大规模并行处理MPP等系统中包括有多个数据仓库,在数据流流经的每个阶段中都包括有源数据仓库和目的数据仓库,每个数据仓库对应有自己的任务执行规则,这里的任务执行规则包括:执行时间、执行方式等,ETL装置可以确定将在源数据仓库和目的数据仓库中选择一个作为第一数据仓库,在该阶段任务是按照第一数据仓库所对应的任务执行规则进行的。本实施例中并不限定ETL装置如何确定第一数据仓库的方法。Specifically, systems such as massively parallel processing (MPP) usually include multiple data warehouses, and each stage of the data flow includes an active data warehouse and a destination data warehouse, and each data warehouse has its own task. Execution rules. The task execution rules here include: execution time, execution mode, etc. The ETL device can determine to select one of the source data warehouse and the destination data warehouse as the first data warehouse. The corresponding task execution rules are carried out. The method of how the ETL device determines the first data warehouse is not limited in this embodiment.

S102:根据源数据仓库和目的数据仓库之间的逻辑关系和第一数据仓库建立任务复制表。S102: Establish a task replication table according to the logical relationship between the source data warehouse and the target data warehouse and the first data warehouse.

其中,所述任务复制表包括:所述源数据仓库的表项和所述目的数据仓库的表项。所述任务复制表还包括:第一参数和第二参数;所述第一参数用于表示所述第一数据仓库为该阶段的所述源数据仓库;所述第二参数用于表示所述第一数据仓库为该阶段的所述目的数据仓库。Wherein, the task replication table includes: entries of the source data warehouse and entries of the destination data warehouse. The task replication table also includes: a first parameter and a second parameter; the first parameter is used to indicate that the first data warehouse is the source data warehouse of this stage; the second parameter is used to indicate that the The first data warehouse is the target data warehouse at this stage.

可选地,所述根据每个阶段中数据仓库之间的逻辑关系和所述第一数据仓库建立任务复制表,具体包括:根据所述源数据仓库和所述目的数据仓库之间的逻辑关系确定所述源数据仓库的表项和所述目的数据仓库的表项;根据所述第一数据仓库确定所述第一参数和所述第二参数;根据所述源数据仓库的表项、所述目的数据仓库的表项、所述第一参数和所述第二参数建立所述任务复制表。Optionally, the establishing a task replication table according to the logical relationship between the data warehouses in each stage and the first data warehouse specifically includes: according to the logical relationship between the source data warehouse and the target data warehouse Determining the entry of the source data warehouse and the entry of the target data warehouse; determining the first parameter and the second parameter according to the first data warehouse; according to the entry of the source data warehouse, the The task replication table is established by describing the entry of the target data warehouse, the first parameter and the second parameter.

举个例子,图2为本发明一实施例提供的MPP系统的结构示意图,假设MPP系统中包括如下数据仓库:数据源(DataSource)201、详单库202、分析数据库(AnalysisDatabase)203和用户特征数据库204,它们分别对应在文件服务器、Hive服务器、SybaseIQ服务器和RTANA服务器上,其中文件服务器、SybaseIQ服务器和RTANA服务器的个数都是三个,Hive集群依托Hadoop集群实现了分布式内部调度,提供了一个统一入口,即可以理解为仅存在一个Hive服务器提供给ETL装置,如图2所示,每个阶段中数据仓库之间的逻辑关系包括:第一阶段的源数据仓库为数据源201,详单库202;第二阶段的源数据仓库为详单库202,目的数据仓库为分析数据库203;第三阶段的源数据仓库为分析数据库203,目的数据仓库为用户特征数据库204。For example, Fig. 2 is a schematic structural diagram of an MPP system provided by an embodiment of the present invention, assuming that the MPP system includes the following data warehouses: data source (DataSource) 201, detailed list library 202, analysis database (AnalysisDatabase) 203 and user characteristics Database 204, they correspond to file server, Hive server, SybaseIQ server and RTANA server respectively, wherein the number of file server, SybaseIQ server and RTANA server is all three, Hive cluster realizes distributed internal scheduling relying on Hadoop cluster, provides A unified entry is established, that is, it can be understood that there is only one Hive server provided to the ETL device, as shown in Figure 2, the logical relationship between the data warehouses in each stage includes: the source data warehouse in the first stage is the data source 201, The detailed list library 202; the source data warehouse of the second stage is the detailed list library 202, and the target data warehouse is the analysis database 203; the source data warehouse of the third stage is the analysis database 203, and the target data warehouse is the user characteristic database 204.

任务复制表包括:源数据仓库表项和目的数据仓库表项。所述任务复制表还包括:第一参数和第二参数;所述第一参数用于表示所述第一数据仓库为该阶段的所述源数据仓库;所述第二参数用于表示所述第一数据仓库为该阶段的所述目的数据仓库。假设第一参数为1,第二参数为2。比如:假设S101步骤中确定第一阶段的任务执行规则所对应的第一数据仓库是数据源,即任务执行规则按照数据源的执行规则,则设表格的第一行第一列处为第一参数1,同样第二行第二列处为第一参数1,第三行第三列处为第二参数2。通过上述方法则可以建立任务复制表。The task copy table includes: source data warehouse table items and destination data warehouse table items. The task replication table also includes: a first parameter and a second parameter; the first parameter is used to indicate that the first data warehouse is the source data warehouse of this stage; the second parameter is used to indicate that the The first data warehouse is the target data warehouse at this stage. Suppose the first parameter is 1 and the second parameter is 2. For example: assuming that the first data warehouse corresponding to the task execution rules of the first stage determined in step S101 is the data source, that is, the task execution rules follow the execution rules of the data source, then set the first row and the first column of the table as the first Parameter 1, also the first parameter 1 at the second row and second column, and the second parameter 2 at the third row and third column. Through the above method, a task copy table can be established.

本实施例提供的任务复制表,具体如下:The task copy table provided by this embodiment is as follows:

S103:根据第二数据仓库对应服务器采用的分布式方式建立任务分配表。S103: Establish a task allocation table according to the distributed mode adopted by the server corresponding to the second data warehouse.

具体地,所述第二数据仓库为所述每个阶段的数据仓库中的源数据仓库或目的数据仓库,所述任务分配表包括:每个所述第二数据仓库对应服务器所采用的分布式方式。所述分布式方式包括:无共享分布方式和共享磁盘分布方式。本发明中服务器之间的部署方式还可以为主备方式等,不限于此分布式方式,在任务分配表包括每个阶段的第二数据仓库对应服务器,假设无共享分布方式用3代表,共享磁盘分布方式用4代表,本实施例中的第二数据仓库恰好均是目的数据仓库,则任务分配表的第一行从左至右依次为文件服务器,SybaseIQ服务器和RTANA服务器,他们分别采用的分布方式为:共享磁盘分布方式、共享磁盘分布方式和无共享分布方式。Specifically, the second data warehouse is the source data warehouse or the destination data warehouse in the data warehouses of each stage, and the task allocation table includes: each of the second data warehouses corresponds to the distributed data warehouse adopted by the server. Way. The distributed manner includes: a shared nothing distributed manner and a shared disk distributed manner. The deployment mode between servers in the present invention can also be master-backup mode, etc., and is not limited to this distributed mode. The task allocation table includes the corresponding servers of the second data warehouse in each stage. Assuming that no shared distribution mode is represented by 3, shared The disk distribution mode is represented by 4. The second data warehouse in this embodiment happens to be the destination data warehouse. The first row of the task allocation table is the file server, the SybaseIQ server and the RTANA server from left to right. The distribution methods are: shared disk distribution method, shared disk distribution method and no-shared distribution method.

本实施例提供的任务分配表,具体如下:The task allocation table provided by this embodiment is as follows:

Hive服务器Hive server Sybase IQ服务器Sybase IQ Server RTANA服务器RTANA server 33 33 44

S104:根据任务复制表和任务分配表对每个阶段的任务进行调度。S104: Scheduling the tasks of each stage according to the task replication table and the task allocation table.

可选地,所述根据所述任务复制表和所述任务分配表对所述每个阶段的任务进行调度,具体包括:在每个阶段中的所述源数据仓库和所述目的数据仓库之间按照确定的所述分布式方式调度所述每个阶段的任务。Optionally, the scheduling the tasks of each stage according to the task replication table and the task allocation table specifically includes: between the source data warehouse and the destination data warehouse in each stage Schedule the tasks of each stage according to the determined distributed manner.

接着上述举的例子,假设三个阶段的任务分别为:Following the above example, assume that the tasks in the three stages are:

任务一:从文件服务器下载原始详单,数据直接加载到Hive服务器的详单库中。Task 1: Download the original detailed bill from the file server, and load the data directly into the detailed bill library of the Hive server.

任务二:从Hive服务器的详单库导出原始详单,数据经过清洗和汇聚后加载到SybaseIQ服务器。Task 2: Export the original detailed list from the detailed list library of the Hive server, and load the data to the SybaseIQ server after cleaning and aggregation.

任务三:从SybaseIQ服务器导出用户属性,加载到RTANA服务器,在RTANA服务器计算用户特征。Task 3: Export user attributes from the SybaseIQ server, load them to the RTANA server, and calculate user characteristics on the RTANA server.

根据三个阶段的任务即数据仓库之间的逻辑关系确定:第一阶段的源数据仓库为数据源,目的数据仓库为详单库;第二阶段的源数据仓库为详单库,目的数据仓库为分析数据库;第三阶段的源数据仓库为分析数据库,目的数据仓库为用户特征数据库。最后根据每个阶段中数据仓库之间的逻辑关系和第一数据仓库建立任务复制表。It is determined according to the tasks of the three stages, that is, the logical relationship between data warehouses: in the first stage, the source data warehouse is the data source, and the destination data warehouse is the detailed list library; in the second stage, the source data warehouse is the detailed list library, and the target data warehouse is the analysis database; the source data warehouse of the third stage is the analysis database, and the destination data warehouse is the user characteristic database. Finally, a task duplication table is established according to the logical relationship between the data warehouses in each stage and the first data warehouse.

由于建立的任务分配表的第一行从左至右依次为数据源对应的Hive服务器,SybaseIQ服务器和RTANA服务器,他们分别采用的分布方式为:共享磁盘分布方式、共享磁盘分布方式和无共享分布方式。则三个任务的具体调度步骤包括:Since the first line of the established task allocation table is the Hive server, SybaseIQ server, and RTANA server corresponding to the data source from left to right, the distribution methods they adopt are: shared disk distribution method, shared disk distribution method and no-shared distribution Way. The specific scheduling steps of the three tasks include:

1、根据任务复制表,将根据文件服务器的个数复制三个任务一,再根据任务分配表,所有任务在Hive服务器上执行。1. According to the task replication table, three tasks 1 will be copied according to the number of file servers, and then according to the task allocation table, all tasks will be executed on the Hive server.

2、根据任务复制表,将根据Hive服务器复制一个任务二,再根据任务分配表,该任务二会被某台空闲的SybaseIQ服务器调度。2. According to the task replication table, a task 2 will be copied according to the Hive server, and then according to the task allocation table, the task 2 will be scheduled by an idle SybaseIQ server.

3、根据任务复制表,将根据RTANA服务器的个数复制三个任务三,再根据任务分配表,每个RTANA服务器调度一个任务三。3. According to the task copy table, three task three will be copied according to the number of RTANA servers, and then according to the task allocation table, each RTANA server will schedule a task three.

本实施例提供了一种ETL调度方法,包括:首先,确定每个阶段的任务执行规则所对应的第一数据仓库,其中第一数据仓库为每个阶段的数据仓库中的源数据仓库或目的数据仓库;其次,根据源数据仓库和目的数据仓库之间的逻辑关系和第一数据仓库建立任务复制表,根据所述第二数据仓库对应服务器采用的分布式方式建立任务分配表,最后,根据任务复制表和任务分配表对所述每个阶段的任务进行调度。由于在MPP系统中的各个阶段中不需要多个独立的ETL装置,只需一个ETL装置,通过建立任务复制表和任务分配表调度各阶段的任务即可,从而提高MPP系统中对ETL装置的管理效率,降低维护复杂度。This embodiment provides an ETL scheduling method, including: first, determining the first data warehouse corresponding to the task execution rules of each stage, wherein the first data warehouse is the source data warehouse or the destination data warehouse in the data warehouse of each stage data warehouse; secondly, establish a task replication table according to the logical relationship between the source data warehouse and the destination data warehouse and the first data warehouse, establish a task assignment table according to the distributed mode adopted by the corresponding server of the second data warehouse, and finally, according to The task duplication table and the task allocation table schedule the tasks in each stage. Since there is no need for multiple independent ETL devices in each stage of the MPP system, only one ETL device is needed, and the tasks of each stage can be scheduled by establishing a task copy table and a task allocation table, thereby improving the efficiency of the ETL device in the MPP system. Management efficiency and reduced maintenance complexity.

图3为本发明一实施例提供的一种ETL调度装置的结构示意图,其中该装置,包括:确定模块301,用于确定每个阶段的任务执行规则所对应的第一数据仓库,所述第一数据仓库为所述每个阶段的数据仓库中的源数据仓库或目的数据仓库;建立模块302,用于根据所述源数据仓库和所述目的数据仓库之间的逻辑关系和所述第一数据仓库建立任务复制表,所述任务复制表包括:所述源数据仓库的表项和所述目的数据仓库的表项;所述建立模块302,还用于根据所述第二数据仓库对应服务器采用的分布式方式建立任务分配表,所述任务分配表包括:所述第二数据仓库为所述每个阶段的数据仓库中的源数据仓库或目的数据仓库,每个所述第二数据仓库对应服务器所采用的分布式方式;调度模块303,用于根据所述任务复制表和所述任务分配表对所述每个阶段的任务进行调度。Fig. 3 is a schematic structural diagram of an ETL scheduling device provided by an embodiment of the present invention, wherein the device includes: a determination module 301, configured to determine the first data warehouse corresponding to the task execution rules of each stage, the first A data warehouse is the source data warehouse or the target data warehouse in the data warehouses of each stage; the establishment module 302 is used for according to the logical relationship between the source data warehouse and the target data warehouse and the first The data warehouse establishes a task replication table, the task replication table includes: the entry of the source data warehouse and the entry of the destination data warehouse; the establishment module 302 is also configured to The distributed method used to establish a task distribution table, the task distribution table includes: the second data warehouse is the source data warehouse or the destination data warehouse in the data warehouse of each stage, and each of the second data warehouses Corresponding to the distributed mode adopted by the server; the scheduling module 303 is configured to schedule the tasks of each stage according to the task replication table and the task allocation table.

进一步地,所述任务复制表还包括:第一参数和第二参数;所述第一参数用于表示所述第一数据仓库为该阶段的所述源数据仓库;所述第二参数用于表示所述第一数据仓库为该阶段的所述目的数据仓库。Further, the task replication table further includes: a first parameter and a second parameter; the first parameter is used to indicate that the first data warehouse is the source data warehouse of this stage; the second parameter is used to Indicates that the first data warehouse is the target data warehouse at this stage.

可选地,所述建立模块302,具体用于:根据所述源数据仓库和所述目的数据仓库之间的逻辑关系确定所述源数据仓库的表项和所述目的数据仓库的表项;根据所述第一数据仓库确定所述第一参数和所述第二参数;根据所述所述源数据仓库的表项、所述目的数据仓库的表项、所述第一参数和所述第二参数建立所述任务复制表。Optionally, the establishment module 302 is specifically configured to: determine the entries of the source data warehouse and the entries of the destination data warehouse according to the logical relationship between the source data warehouse and the destination data warehouse; Determine the first parameter and the second parameter according to the first data warehouse; according to the entry of the source data warehouse, the entry of the destination data warehouse, the first parameter and the second The two parameters create the task replication table.

可选地,所述分布式方式包括:无共享分布方式和共享磁盘分布方式。Optionally, the distributed manner includes: a shared nothing distributed manner and a shared disk distributed manner.

可选地,所述调度模块303,具体用于:在每个阶段中的所述源数据仓库和所述目的数据仓库之间按照确定的所述分布式方式调度所述每个阶段的任务。Optionally, the scheduling module 303 is specifically configured to: schedule tasks in each stage between the source data warehouse and the target data warehouse in the determined distributed manner in each stage.

本实施例提供的ETL调度装置,可以用于执行图1对应的ETL调度方法的技术方案,其实现原理和技术效果类似,此处不再赘述。The ETL scheduling device provided in this embodiment can be used to execute the technical solution of the ETL scheduling method corresponding to FIG. 1 , and its implementation principle and technical effect are similar, and will not be repeated here.

最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still be Modifications are made to the technical solutions described in the foregoing embodiments, or equivalent replacements are made to some of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the various embodiments of the present invention.

Claims (10)

1.一种抽取转换装载ETL调度方法,其特征在于,包括:1. A method of extracting, converting, and loading ETL dispatching, characterized in that, comprising: 确定每个阶段的任务执行规则所对应的第一数据仓库,所述第一数据仓库为所述每个阶段的数据仓库中的源数据仓库或目的数据仓库;determining the first data warehouse corresponding to the task execution rules of each stage, the first data warehouse being the source data warehouse or the destination data warehouse in the data warehouses of each stage; 根据所述源数据仓库和所述目的数据仓库之间的逻辑关系和所述第一数据仓库建立任务复制表,所述任务复制表包括:所述源数据仓库的表项和所述目的数据仓库的表项;Establish a task replication table according to the logical relationship between the source data warehouse and the destination data warehouse and the first data warehouse, and the task replication table includes: entries of the source data warehouse and the destination data warehouse table entry; 根据第二数据仓库对应服务器采用的分布式方式建立任务分配表,所述第二数据仓库为所述每个阶段的数据仓库中的源数据仓库或目的数据仓库,所述任务分配表包括:每个所述第二数据仓库对应服务器所采用的分布式方式;A task allocation table is established according to the distributed mode adopted by the corresponding server of the second data warehouse, the second data warehouse is the source data warehouse or the destination data warehouse in the data warehouse of each stage, and the task allocation table includes: each A distributed mode adopted by the server corresponding to the second data warehouse; 根据所述任务复制表和所述任务分配表对所述每个阶段的任务进行调度。The tasks of each stage are scheduled according to the task replication table and the task allocation table. 2.根据权利要求1所述的方法,其特征在于,所述任务复制表还包括:第一参数和第二参数;2. The method according to claim 1, wherein the task replication table further comprises: a first parameter and a second parameter; 所述第一参数用于表示所述第一数据仓库为该阶段的所述源数据仓库;The first parameter is used to indicate that the first data warehouse is the source data warehouse of this stage; 所述第二参数用于表示所述第一数据仓库为该阶段的所述目的数据仓库。The second parameter is used to indicate that the first data warehouse is the target data warehouse of this stage. 3.根据权利要求2所述的方法,其特征在于,所述根据所述源数据仓库和所述目的数据仓库之间的逻辑关系和所述第一数据仓库建立任务复制表,具体包括:3. The method according to claim 2, wherein the establishment of a task copy table according to the logical relationship between the source data warehouse and the destination data warehouse and the first data warehouse specifically includes: 根据所述源数据仓库和所述目的数据仓库之间的逻辑关系确定所述源数据仓库的表项和所述目的数据仓库的表项;determining entries in the source data warehouse and entries in the destination data warehouse according to the logical relationship between the source data warehouse and the destination data warehouse; 根据所述第一数据仓库确定所述第一参数和所述第二参数;determining said first parameter and said second parameter based on said first data repository; 根据所述源数据仓库的表项、所述目的数据仓库的表项、所述第一参数和所述第二参数建立所述任务复制表。The task replication table is established according to the entry of the source data warehouse, the entry of the target data warehouse, the first parameter and the second parameter. 4.根据权利要求1-3任一项所述的方法,其特征在于,还包括:4. The method according to any one of claims 1-3, further comprising: 所述分布式方式包括:无共享分布方式和共享磁盘分布方式。The distributed manner includes: a shared nothing distributed manner and a shared disk distributed manner. 5.根据权利要求4所述的方法,其特征在于,所述根据所述任务复制表和所述任务分配表对所述每个阶段的任务进行调度,具体包括:5. The method according to claim 4, wherein the scheduling of tasks at each stage according to the task copy table and the task assignment table specifically includes: 在每个阶段中的所述源数据仓库和所述目的数据仓库之间按照确定的所述分布式方式调度所述每个阶段的任务。The tasks of each stage are scheduled according to the determined distributed manner between the source data warehouse and the target data warehouse in each stage. 6.一种ETL调度装置,其特征在于,包括:6. An ETL scheduling device, characterized in that, comprising: 确定模块,用于确定每个阶段的任务执行规则所对应的第一数据仓库,所述第一数据仓库为所述每个阶段的数据仓库中的源数据仓库或目的数据仓库;A determining module, configured to determine the first data warehouse corresponding to the task execution rules of each stage, the first data warehouse being the source data warehouse or the destination data warehouse in the data warehouses of each stage; 建立模块,用于根据所述源数据仓库和所述目的数据仓库之间的逻辑关系和所述第一数据仓库建立任务复制表,所述任务复制表包括:所述源数据仓库的表项和所述目的数据仓库的表项;An establishment module, configured to establish a task replication table according to the logical relationship between the source data warehouse and the target data warehouse and the first data warehouse, the task replication table including: entries of the source data warehouse and The entry of the target data warehouse; 所述建立模块,还用于根据第二数据仓库对应服务器采用的分布式方式建立任务分配表,所述第二数据仓库为所述每个阶段的数据仓库中的源数据仓库或目的数据仓库,所述任务分配表包括:每个所述第二数据仓库对应服务器所采用的分布式方式;The establishment module is also used to establish a task allocation table according to the distributed mode adopted by the corresponding server of the second data warehouse, the second data warehouse is the source data warehouse or the destination data warehouse in the data warehouse of each stage, The task allocation table includes: the distributed mode adopted by each server corresponding to the second data warehouse; 调度模块,用于根据所述任务复制表和所述任务分配表对所述每个阶段的任务进行调度。A scheduling module, configured to schedule the tasks of each stage according to the task replication table and the task allocation table. 7.根据权利要求6所述的装置,其特征在于,所述任务复制表还包括:第一参数和第二参数;7. The device according to claim 6, wherein the task replication table further comprises: a first parameter and a second parameter; 所述第一参数用于表示所述第一数据仓库为该阶段的所述源数据仓库;The first parameter is used to indicate that the first data warehouse is the source data warehouse of this stage; 所述第二参数用于表示所述第一数据仓库为该阶段的所述目的数据仓库。The second parameter is used to indicate that the first data warehouse is the target data warehouse of this stage. 8.根据权利要求7所述的装置,其特征在于,所述建立模块,具体用于:8. The device according to claim 7, wherein the establishment module is specifically used for: 根据所述源数据仓库和所述目的数据仓库之间的逻辑关系确定所述源数据仓库的表项和所述目的数据仓库的表项;determining entries in the source data warehouse and entries in the destination data warehouse according to the logical relationship between the source data warehouse and the destination data warehouse; 根据所述第一数据仓库确定所述第一参数和所述第二参数;determining said first parameter and said second parameter based on said first data repository; 根据所述源数据仓库的表项、所述目的数据仓库的表项、所述第一参数和所述第二参数建立所述任务复制表。The task replication table is established according to the entry of the source data warehouse, the entry of the target data warehouse, the first parameter and the second parameter. 9.根据权利要求6-8任一项所述的装置,其特征在于,还包括:9. The device according to any one of claims 6-8, further comprising: 所述分布式方式包括:无共享分布方式和共享磁盘分布方式。The distributed manner includes: a shared nothing distributed manner and a shared disk distributed manner. 10.根据权利要求9所述的装置,其特征在于,所述调度模块,具体用于:10. The device according to claim 9, wherein the scheduling module is specifically used for: 在每个阶段中的所述源数据仓库和所述目的数据仓库之间按照确定的所述分布式方式调度所述每个阶段的任务。The tasks of each stage are scheduled according to the determined distributed manner between the source data warehouse and the target data warehouse in each stage.
CN201410707712.7A 2014-11-27 2014-11-27 ETL scheduling method and device Expired - Fee Related CN105701117B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410707712.7A CN105701117B (en) 2014-11-27 2014-11-27 ETL scheduling method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410707712.7A CN105701117B (en) 2014-11-27 2014-11-27 ETL scheduling method and device

Publications (2)

Publication Number Publication Date
CN105701117A true CN105701117A (en) 2016-06-22
CN105701117B CN105701117B (en) 2019-06-21

Family

ID=56230411

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410707712.7A Expired - Fee Related CN105701117B (en) 2014-11-27 2014-11-27 ETL scheduling method and device

Country Status (1)

Country Link
CN (1) CN105701117B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1897025A (en) * 2006-04-27 2007-01-17 南京联创科技股份有限公司 Parallel ETL technology of multi-thread working pack in mass data process
US20090063504A1 (en) * 2007-08-29 2009-03-05 Richard Banister Bi-directional replication between web services and relational databases
CN102693297A (en) * 2012-05-16 2012-09-26 华为技术有限公司 Data processing method, node and ETL (extract transform and load) system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1897025A (en) * 2006-04-27 2007-01-17 南京联创科技股份有限公司 Parallel ETL technology of multi-thread working pack in mass data process
US20090063504A1 (en) * 2007-08-29 2009-03-05 Richard Banister Bi-directional replication between web services and relational databases
CN102693297A (en) * 2012-05-16 2012-09-26 华为技术有限公司 Data processing method, node and ETL (extract transform and load) system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ALKIS SIMITSIS等: "Optimizing ETL Processes in Data Warehouses", 《IEEE》 *
宋旭东等: "数据仓库ETL任务调度模型研究", 《控制与决策》 *

Also Published As

Publication number Publication date
CN105701117B (en) 2019-06-21

Similar Documents

Publication Publication Date Title
CN110168516B (en) Dynamic computing node grouping method and system for large-scale parallel processing
US10585889B2 (en) Optimizing skewed joins in big data
US9367359B2 (en) Optimized resource management for map/reduce computing
Hu et al. Flutter: Scheduling tasks closer to data across geo-distributed datacenters
CN108885641B (en) High performance query processing and data analysis
CN110457397A (en) A method and device for data synchronization
US20160359668A1 (en) Virtual machine placement optimization with generalized organizational scenarios
US20140215007A1 (en) Multi-level data staging for low latency data access
CN103617211A (en) HBase loaded data importing method
CN103942098A (en) System and method for task processing
CN103516807A (en) Cloud computing platform server load balancing system and method
CN106933669A (en) For the apparatus and method of data processing
CN107766147A (en) Distributed data analysis task scheduling system
US20150019680A1 (en) Systems and Methods for Consistent Hashing Using Multiple Hash Rlngs
CN104239529A (en) Method and device for preventing Hive data from being inclined
US20180011905A1 (en) Accessing electronic databases
CN106168963B (en) Real-time streaming data processing method and device and server
CN111831425B (en) Data processing method, device and equipment
US10162830B2 (en) Systems and methods for dynamic partitioning in distributed environments
Sattler et al. Towards Elastic Stream Processing: Patterns and Infrastructure.
US20200159594A1 (en) Systems and methods for dynamic partitioning in distributed environments
US10268727B2 (en) Batching tuples
CN102098223A (en) Method, device and system for scheduling node devices
US8799619B2 (en) Method and system for providing distributed programming environment using distributed spaces, and computer readable recording medium
US10951735B2 (en) Peer based distribution of edge applications

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190621