CN105701117A

CN105701117A - ETL (Extract-Transform-Load) dispatching method and apparatus

Info

Publication number: CN105701117A
Application number: CN201410707712.7A
Authority: CN
Inventors: 周斌彦
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2014-11-27
Filing date: 2014-11-27
Publication date: 2016-06-22
Anticipated expiration: 2034-11-27
Also published as: CN105701117B

Abstract

An embodiment of the present invention provides an ETL scheduling method and device, wherein the method includes: first, determining the first data warehouse corresponding to the task execution rules of each stage, and the first data warehouse is the source in the data warehouse of each stage Data warehouse or destination data warehouse; secondly, establish a task replication table according to the logical relationship between the source data warehouse and the destination data warehouse and the first data warehouse, and establish a task assignment table according to the distributed method adopted by the corresponding server of the second data warehouse, and finally , schedule the tasks of each stage according to the task replication table and the task allocation table. Since there is no need for multiple independent ETL devices in each stage of the system, only one ETL device is needed, and the tasks of each stage can be scheduled by establishing a task copy table and a task allocation table, thereby improving the management efficiency of the ETL device and reducing maintenance the complexity.

Description

ETL scheduling method and device

技术领域technical field

本发明实施例涉及通信技术，尤其涉及一种抽取转换装载(Extract-Transform-Load,ETL)调度方法及装置。Embodiments of the present invention relate to communication technologies, and in particular to an Extract-Transform-Load (ETL) scheduling method and device.

背景技术Background technique

随着大数据技术发展，分布式的数据存储系统越来越多，大数据应用一般需要集成多个不同的数据存储系统来构建不同应用的数据仓库，ETL用来描述将数据从源数据仓库经过抽取、转换和装载至目的数据仓库的过程。通常ETL装置或称为ETL工具用来负责系统运行程序的调度控制和资源的分配。With the development of big data technology, there are more and more distributed data storage systems. Big data applications generally need to integrate multiple different data storage systems to build data warehouses for different applications. ETL is used to describe the process of transferring data from the source data warehouse The process of extracting, transforming, and loading into a destination data warehouse. Usually ETL devices or ETL tools are used to be responsible for the scheduling control and resource allocation of system running programs.

通常上述的数据仓库对应的服务器一般采用分布式的部署方式，但是采用的部署方式不尽相同，目前存在的主要部署方式为：无共享(ShatedNothing)架构和共享磁盘(SharedDisk)架构，其中无共享架构是指各个数据仓库中对应节点(服务器)拥有独立的中央处理器(CentralProcessingUnit,CPU)、内存、磁盘资源，数据按照规则分布在不同的节点上。共享磁盘架构是指各个数据仓库对应节点拥有独立的CPU、内存，但节点之间是共享磁盘空间的，数据统一存储。现有技术中，大规模并行处理(MassivelyParallelProcessing，MPP)中包括多个数据仓库，由于各个数据仓库对应服务器部署方式不尽相同，因此，每个阶段会对应一个ETL装置，实现任务的分发和调度。Generally, the servers corresponding to the above-mentioned data warehouses generally adopt a distributed deployment method, but the deployment methods are different. Currently, the main deployment methods are: Shated Nothing architecture and Shared Disk architecture. Architecture means that the corresponding nodes (servers) in each data warehouse have independent central processing units (Central Processing Unit, CPU), memory, and disk resources, and data are distributed on different nodes according to rules. The shared disk architecture means that the corresponding nodes of each data warehouse have independent CPU and memory, but the disk space is shared between nodes, and the data is stored uniformly. In the prior art, massively parallel processing (MassivelyParallelProcessing, MPP) includes multiple data warehouses. Since each data warehouse corresponds to a different server deployment method, each stage will correspond to an ETL device to achieve task distribution and scheduling. .

然而，现有技术中存在对离散化的ETL装置管理效率低，维护较为复杂的问题。However, in the prior art, the management efficiency of the discrete ETL device is low and the maintenance is complicated.

发明内容Contents of the invention

本发明提供一种ETL调度方法及装置，从而提高对ETL装置的管理效率，降低维护复杂度。The invention provides an ETL scheduling method and device, thereby improving the management efficiency of the ETL device and reducing maintenance complexity.

第一方面，本发明一实施例提供一种ETL调度方法，包括：确定每个阶段的任务执行规则所对应的第一数据仓库，所述第一数据仓库为所述每个阶段的数据仓库中的源数据仓库或目的数据仓库；根据所述源数据仓库和所述目的数据仓库之间的逻辑关系和所述第一数据仓库建立任务复制表，所述任务复制表包括：所述源数据仓库的表项和所述目的数据仓库的表项；根据第二数据仓库对应服务器采用的分布式方式建立任务分配表，所述第二数据仓库为所述每个阶段的数据仓库中的源数据仓库或目的数据仓库，所述任务分配表包括：每个所述第二数据仓库对应服务器所采用的分布式方式；根据所述任务复制表和所述任务分配表对所述每个阶段的任务进行调度。In the first aspect, an embodiment of the present invention provides an ETL scheduling method, including: determining the first data warehouse corresponding to the task execution rules of each stage, and the first data warehouse is in the data warehouse of each stage The source data warehouse or the destination data warehouse; according to the logical relationship between the source data warehouse and the destination data warehouse and the first data warehouse, a task replication table is established, and the task replication table includes: the source data warehouse The entry of the entry and the entry of the destination data warehouse; according to the distributed mode adopted by the server corresponding to the second data warehouse, the task allocation table is established, and the second data warehouse is the source data warehouse in the data warehouse of each stage or the destination data warehouse, the task allocation table includes: the distributed mode adopted by each server corresponding to the second data warehouse; according to the task replication table and the task allocation table, the tasks of each stage are performed scheduling.

结合第一方面，在第一方面的第一种可能实施方式中，所述任务复制表还包括：第一参数和第二参数；所述第一参数用于表示所述第一数据仓库为该阶段的所述源数据仓库；所述第二参数用于表示所述第一数据仓库为该阶段的所述目的数据仓库。With reference to the first aspect, in a first possible implementation manner of the first aspect, the task replication table further includes: a first parameter and a second parameter; the first parameter is used to indicate that the first data warehouse is the The source data warehouse of the stage; the second parameter is used to indicate that the first data warehouse is the target data warehouse of the stage.

结合第一方面的第一种可能实施方式，在第一方面的第二种可能实施方式中，所述根据所述源数据仓库和所述目的数据仓库之间的逻辑关系和所述第一数据仓库建立任务复制表，具体包括：根据所述源数据仓库和所述目的数据仓库之间的逻辑关系确定所述源数据仓库的表项和所述目的数据仓库的表项；根据所述第一数据仓库确定所述第一参数和所述第二参数；根据所述源数据仓库的表项、所述目的数据仓库的表项、所述第一参数和所述第二参数建立所述任务复制表。With reference to the first possible implementation manner of the first aspect, in the second possible implementation manner of the first aspect, according to the logical relationship between the source data warehouse and the destination data warehouse and the first data The warehouse creates a task copy table, specifically including: determining the entry of the source data warehouse and the entry of the destination data warehouse according to the logical relationship between the source data warehouse and the destination data warehouse; according to the first The data warehouse determines the first parameter and the second parameter; establishes the task replication according to the entry of the source data warehouse, the entry of the destination data warehouse, the first parameter and the second parameter surface.

结合第一方面或第一方面的第一种可能实施方式或第一方面的第二种可能实施方式，在第一方面的第三种可能实施方式中，还包括：所述分布式方式包括：无共享分布方式和共享磁盘分布方式。In combination with the first aspect or the first possible implementation manner of the first aspect or the second possible implementation manner of the first aspect, in the third possible implementation manner of the first aspect, it further includes: the distributed manner includes: There are no shared distribution method and shared disk distribution method.

结合第一方面的第三种可能实施方式，在第一方面的第四种可能实施方式中，所述根据所述任务复制表和所述任务分配表对所述每个阶段的任务进行调度，具体包括：在每个阶段中的所述源数据仓库和所述目的数据仓库之间按照确定的所述分布式方式调度所述每个阶段的任务。With reference to the third possible implementation manner of the first aspect, in a fourth possible implementation manner of the first aspect, the scheduling of the tasks of each stage according to the task replication table and the task allocation table, Specifically, it includes: scheduling tasks in each stage between the source data warehouse and the target data warehouse in the determined distributed manner in each stage.

第二方面，本发明一实施例提供一种ETL调度装置，包括：确定模块，用于确定每个阶段的任务执行规则所对应的第一数据仓库，所述第一数据仓库为所述每个阶段的数据仓库中的源数据仓库或目的数据仓库；建立模块，用于根据所述源数据仓库和所述目的数据仓库之间的逻辑关系和所述第一数据仓库建立任务复制表，所述任务复制表包括：所述源数据仓库的表项和所述目的数据仓库的表项；所述建立模块，还用于根据第二数据仓库对应服务器采用的分布式方式建立任务分配表，所述第二数据仓库为所述每个阶段的数据仓库中的源数据仓库或目的数据仓库，所述任务分配表包括：每个所述第二数据仓库对应服务器所采用的分布式方式；调度模块，用于根据所述任务复制表和所述任务分配表对所述每个阶段的任务进行调度。In the second aspect, an embodiment of the present invention provides an ETL scheduling device, including: a determination module, configured to determine the first data warehouse corresponding to the task execution rules of each stage, the first data warehouse is the The source data warehouse or the destination data warehouse in the data warehouse of the stage; the establishment module is used to establish a task replication table according to the logical relationship between the source data warehouse and the destination data warehouse and the first data warehouse, the The task duplication table includes: the entry of the source data warehouse and the entry of the destination data warehouse; the establishment module is also used to establish a task distribution table according to the distributed method adopted by the corresponding server of the second data warehouse, and the The second data warehouse is the source data warehouse or the destination data warehouse in the data warehouses of each stage, and the task allocation table includes: the distributed mode adopted by each server corresponding to the second data warehouse; a scheduling module, It is used for scheduling the tasks of each stage according to the task replication table and the task allocation table.

结合第二方面，在第二方面的第一种可能实施方式中，所述任务复制表还包括：第一参数和第二参数；所述第一参数用于表示所述第一数据仓库为该阶段的所述源数据仓库；所述第二参数用于表示所述第一数据仓库为该阶段的所述目的数据仓库。With reference to the second aspect, in a first possible implementation manner of the second aspect, the task replication table further includes: a first parameter and a second parameter; the first parameter is used to indicate that the first data warehouse is the The source data warehouse of the stage; the second parameter is used to indicate that the first data warehouse is the target data warehouse of the stage.

结合第二方面的第一种可能实施方式，在第二方面的第二种可能实施方式中，所述建立模块，具体用于：根据所述源数据仓库和所述目的数据仓库之间的逻辑关系确定所述源数据仓库的表项和所述目的数据仓库的表项；根据所述第一数据仓库确定所述第一参数和所述第二参数；根据所述源数据仓库的表项、所述目的数据仓库的表项、所述第一参数和所述第二参数建立所述任务复制表。With reference to the first possible implementation manner of the second aspect, in the second possible implementation manner of the second aspect, the establishment module is specifically configured to: according to the logic between the source data warehouse and the destination data warehouse determining the entry of the source data warehouse and the entry of the destination data warehouse; determining the first parameter and the second parameter according to the first data warehouse; according to the entry of the source data warehouse, The entry of the target data warehouse, the first parameter and the second parameter establish the task replication table.

结合第二方面或第二方面的第一种可能实施方式或第二方面的第二种可能实施方式，在第二方面的第三种可能实施方式中，还包括：所述分布式方式包括：无共享分布方式和共享磁盘分布方式。In combination with the second aspect or the first possible implementation manner of the second aspect or the second possible implementation manner of the second aspect, in the third possible implementation manner of the second aspect, it further includes: the distributed manner includes: There are no shared distribution method and shared disk distribution method.

结合第二方面的第三种可能实施方式，在第二方面的第四种可能实施方式中，所述调度模块，具体用于：在每个阶段中的所述源数据仓库和所述目的数据仓库之间按照确定的所述分布式方式调度所述每个阶段的任务。With reference to the third possible implementation manner of the second aspect, in a fourth possible implementation manner of the second aspect, the scheduling module is specifically configured to: the source data warehouse and the destination data warehouse in each stage The warehouses schedule the tasks of each stage according to the determined distributed manner.

本发明实施例提供了一种ETL调度方法及装置，其中该方法包括：确定每个阶段的任务执行规则所对应的第一数据仓库，所述第一数据仓库为所述每个阶段的数据仓库中的源数据仓库或目的数据仓库；根据所述源数据仓库和所述目的数据仓库之间的逻辑关系和所述第一数据仓库建立任务复制表，所述任务复制表包括：所述源数据仓库的表项和所述目的数据仓库的表项；根据所述第二数据仓库对应服务器采用的分布式方式建立任务分配表，所述第二数据仓库为所述每个阶段的数据仓库中的源数据仓库或目的数据仓库，所述任务分配表包括：每个所述第二数据仓库对应服务器所采用的分布式方式；根据所述任务复制表和所述任务分配表对所述每个阶段的任务进行调度。由于在系统中的各个阶段中不需要多个独立的ETL装置，只需一个ETL装置，通过建立任务复制表和任务分配表调度各阶段的任务即可，从而提高系统中对ETL装置的管理效率，降低维护复杂度。An embodiment of the present invention provides an ETL scheduling method and device, wherein the method includes: determining the first data warehouse corresponding to the task execution rules of each stage, and the first data warehouse is the data warehouse of each stage A source data warehouse or a destination data warehouse in the source data warehouse; a task replication table is established according to the logical relationship between the source data warehouse and the destination data warehouse and the first data warehouse, and the task replication table includes: the source data The entry of the warehouse and the entry of the destination data warehouse; establish a task assignment table according to the distributed mode adopted by the corresponding server of the second data warehouse, and the second data warehouse is the data warehouse in each stage For the source data warehouse or the destination data warehouse, the task allocation table includes: the distributed mode adopted by each server corresponding to the second data warehouse; tasks are scheduled. Since there is no need for multiple independent ETL devices in each stage of the system, only one ETL device is needed, and the tasks of each stage can be scheduled by establishing a task copy table and a task allocation table, thereby improving the management efficiency of the ETL device in the system , reducing maintenance complexity.

附图说明Description of drawings

图1为本发明一实施例提供的一种ETL调度方法的流程图；Fig. 1 is the flowchart of a kind of ETL scheduling method that an embodiment of the present invention provides;

图2为本发明一实施例提供的MPP系统的结构示意图；FIG. 2 is a schematic structural diagram of an MPP system provided by an embodiment of the present invention;

图3为本发明一实施例提供的一种ETL调度装置的结构示意图。Fig. 3 is a schematic structural diagram of an ETL scheduling device provided by an embodiment of the present invention.

具体实施方式detailed description

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

图1为本发明一实施例提供的一种ETL调度方法的流程图，该方法适用于包括多个数据仓库的应用场景，其中该方法的执行主体为：ETL调度装置，该调度装置可以为ETL工具，一种ETL调度方法具体包括如下流程：Fig. 1 is a flow chart of an ETL scheduling method provided by an embodiment of the present invention, the method is applicable to an application scenario including multiple data warehouses, wherein the execution subject of the method is: an ETL scheduling device, and the scheduling device may be an ETL Tool, an ETL scheduling method specifically includes the following processes:

S101：确定每个阶段的任务执行规则所对应的第一数据仓库，第一数据仓库为每个阶段的数据仓库中的源数据仓库或目的数据仓库。S101: Determine a first data warehouse corresponding to the task execution rules of each stage, where the first data warehouse is a source data warehouse or a destination data warehouse in the data warehouses of each stage.

具体地，通常在大规模并行处理MPP等系统中包括有多个数据仓库，在数据流流经的每个阶段中都包括有源数据仓库和目的数据仓库，每个数据仓库对应有自己的任务执行规则，这里的任务执行规则包括：执行时间、执行方式等，ETL装置可以确定将在源数据仓库和目的数据仓库中选择一个作为第一数据仓库，在该阶段任务是按照第一数据仓库所对应的任务执行规则进行的。本实施例中并不限定ETL装置如何确定第一数据仓库的方法。Specifically, systems such as massively parallel processing (MPP) usually include multiple data warehouses, and each stage of the data flow includes an active data warehouse and a destination data warehouse, and each data warehouse has its own task. Execution rules. The task execution rules here include: execution time, execution mode, etc. The ETL device can determine to select one of the source data warehouse and the destination data warehouse as the first data warehouse. The corresponding task execution rules are carried out. The method of how the ETL device determines the first data warehouse is not limited in this embodiment.

S102：根据源数据仓库和目的数据仓库之间的逻辑关系和第一数据仓库建立任务复制表。S102: Establish a task replication table according to the logical relationship between the source data warehouse and the target data warehouse and the first data warehouse.

其中，所述任务复制表包括：所述源数据仓库的表项和所述目的数据仓库的表项。所述任务复制表还包括：第一参数和第二参数；所述第一参数用于表示所述第一数据仓库为该阶段的所述源数据仓库；所述第二参数用于表示所述第一数据仓库为该阶段的所述目的数据仓库。Wherein, the task replication table includes: entries of the source data warehouse and entries of the destination data warehouse. The task replication table also includes: a first parameter and a second parameter; the first parameter is used to indicate that the first data warehouse is the source data warehouse of this stage; the second parameter is used to indicate that the The first data warehouse is the target data warehouse at this stage.

可选地，所述根据每个阶段中数据仓库之间的逻辑关系和所述第一数据仓库建立任务复制表，具体包括：根据所述源数据仓库和所述目的数据仓库之间的逻辑关系确定所述源数据仓库的表项和所述目的数据仓库的表项；根据所述第一数据仓库确定所述第一参数和所述第二参数；根据所述源数据仓库的表项、所述目的数据仓库的表项、所述第一参数和所述第二参数建立所述任务复制表。Optionally, the establishing a task replication table according to the logical relationship between the data warehouses in each stage and the first data warehouse specifically includes: according to the logical relationship between the source data warehouse and the target data warehouse Determining the entry of the source data warehouse and the entry of the target data warehouse; determining the first parameter and the second parameter according to the first data warehouse; according to the entry of the source data warehouse, the The task replication table is established by describing the entry of the target data warehouse, the first parameter and the second parameter.

举个例子，图2为本发明一实施例提供的MPP系统的结构示意图，假设MPP系统中包括如下数据仓库：数据源(DataSource)201、详单库202、分析数据库(AnalysisDatabase)203和用户特征数据库204，它们分别对应在文件服务器、Hive服务器、SybaseIQ服务器和RTANA服务器上，其中文件服务器、SybaseIQ服务器和RTANA服务器的个数都是三个，Hive集群依托Hadoop集群实现了分布式内部调度，提供了一个统一入口，即可以理解为仅存在一个Hive服务器提供给ETL装置，如图2所示，每个阶段中数据仓库之间的逻辑关系包括：第一阶段的源数据仓库为数据源201，详单库202；第二阶段的源数据仓库为详单库202，目的数据仓库为分析数据库203；第三阶段的源数据仓库为分析数据库203，目的数据仓库为用户特征数据库204。For example, Fig. 2 is a schematic structural diagram of an MPP system provided by an embodiment of the present invention, assuming that the MPP system includes the following data warehouses: data source (DataSource) 201, detailed list library 202, analysis database (AnalysisDatabase) 203 and user characteristics Database 204, they correspond to file server, Hive server, SybaseIQ server and RTANA server respectively, wherein the number of file server, SybaseIQ server and RTANA server is all three, Hive cluster realizes distributed internal scheduling relying on Hadoop cluster, provides A unified entry is established, that is, it can be understood that there is only one Hive server provided to the ETL device, as shown in Figure 2, the logical relationship between the data warehouses in each stage includes: the source data warehouse in the first stage is the data source 201, The detailed list library 202; the source data warehouse of the second stage is the detailed list library 202, and the target data warehouse is the analysis database 203; the source data warehouse of the third stage is the analysis database 203, and the target data warehouse is the user characteristic database 204.

任务复制表包括：源数据仓库表项和目的数据仓库表项。所述任务复制表还包括：第一参数和第二参数；所述第一参数用于表示所述第一数据仓库为该阶段的所述源数据仓库；所述第二参数用于表示所述第一数据仓库为该阶段的所述目的数据仓库。假设第一参数为1，第二参数为2。比如：假设S101步骤中确定第一阶段的任务执行规则所对应的第一数据仓库是数据源，即任务执行规则按照数据源的执行规则，则设表格的第一行第一列处为第一参数1，同样第二行第二列处为第一参数1，第三行第三列处为第二参数2。通过上述方法则可以建立任务复制表。The task copy table includes: source data warehouse table items and destination data warehouse table items. The task replication table also includes: a first parameter and a second parameter; the first parameter is used to indicate that the first data warehouse is the source data warehouse of this stage; the second parameter is used to indicate that the The first data warehouse is the target data warehouse at this stage. Suppose the first parameter is 1 and the second parameter is 2. For example: assuming that the first data warehouse corresponding to the task execution rules of the first stage determined in step S101 is the data source, that is, the task execution rules follow the execution rules of the data source, then set the first row and the first column of the table as the first Parameter 1, also the first parameter 1 at the second row and second column, and the second parameter 2 at the third row and third column. Through the above method, a task copy table can be established.

本实施例提供的任务复制表，具体如下：The task copy table provided by this embodiment is as follows:

S103：根据第二数据仓库对应服务器采用的分布式方式建立任务分配表。S103: Establish a task allocation table according to the distributed mode adopted by the server corresponding to the second data warehouse.

具体地，所述第二数据仓库为所述每个阶段的数据仓库中的源数据仓库或目的数据仓库，所述任务分配表包括：每个所述第二数据仓库对应服务器所采用的分布式方式。所述分布式方式包括：无共享分布方式和共享磁盘分布方式。本发明中服务器之间的部署方式还可以为主备方式等，不限于此分布式方式，在任务分配表包括每个阶段的第二数据仓库对应服务器，假设无共享分布方式用3代表，共享磁盘分布方式用4代表，本实施例中的第二数据仓库恰好均是目的数据仓库，则任务分配表的第一行从左至右依次为文件服务器，SybaseIQ服务器和RTANA服务器，他们分别采用的分布方式为：共享磁盘分布方式、共享磁盘分布方式和无共享分布方式。Specifically, the second data warehouse is the source data warehouse or the destination data warehouse in the data warehouses of each stage, and the task allocation table includes: each of the second data warehouses corresponds to the distributed data warehouse adopted by the server. Way. The distributed manner includes: a shared nothing distributed manner and a shared disk distributed manner. The deployment mode between servers in the present invention can also be master-backup mode, etc., and is not limited to this distributed mode. The task allocation table includes the corresponding servers of the second data warehouse in each stage. Assuming that no shared distribution mode is represented by 3, shared The disk distribution mode is represented by 4. The second data warehouse in this embodiment happens to be the destination data warehouse. The first row of the task allocation table is the file server, the SybaseIQ server and the RTANA server from left to right. The distribution methods are: shared disk distribution method, shared disk distribution method and no-shared distribution method.

本实施例提供的任务分配表，具体如下：The task allocation table provided by this embodiment is as follows:

Hive服务器Hive server Sybase IQ服务器Sybase IQ Server RTANA服务器RTANA server 33 33 44

S104：根据任务复制表和任务分配表对每个阶段的任务进行调度。S104: Scheduling the tasks of each stage according to the task replication table and the task allocation table.

可选地，所述根据所述任务复制表和所述任务分配表对所述每个阶段的任务进行调度，具体包括：在每个阶段中的所述源数据仓库和所述目的数据仓库之间按照确定的所述分布式方式调度所述每个阶段的任务。Optionally, the scheduling the tasks of each stage according to the task replication table and the task allocation table specifically includes: between the source data warehouse and the destination data warehouse in each stage Schedule the tasks of each stage according to the determined distributed manner.

接着上述举的例子，假设三个阶段的任务分别为：Following the above example, assume that the tasks in the three stages are:

任务一：从文件服务器下载原始详单，数据直接加载到Hive服务器的详单库中。Task 1: Download the original detailed bill from the file server, and load the data directly into the detailed bill library of the Hive server.

任务二：从Hive服务器的详单库导出原始详单，数据经过清洗和汇聚后加载到SybaseIQ服务器。Task 2: Export the original detailed list from the detailed list library of the Hive server, and load the data to the SybaseIQ server after cleaning and aggregation.

任务三：从SybaseIQ服务器导出用户属性，加载到RTANA服务器，在RTANA服务器计算用户特征。Task 3: Export user attributes from the SybaseIQ server, load them to the RTANA server, and calculate user characteristics on the RTANA server.

根据三个阶段的任务即数据仓库之间的逻辑关系确定：第一阶段的源数据仓库为数据源，目的数据仓库为详单库；第二阶段的源数据仓库为详单库，目的数据仓库为分析数据库；第三阶段的源数据仓库为分析数据库，目的数据仓库为用户特征数据库。最后根据每个阶段中数据仓库之间的逻辑关系和第一数据仓库建立任务复制表。It is determined according to the tasks of the three stages, that is, the logical relationship between data warehouses: in the first stage, the source data warehouse is the data source, and the destination data warehouse is the detailed list library; in the second stage, the source data warehouse is the detailed list library, and the target data warehouse is the analysis database; the source data warehouse of the third stage is the analysis database, and the destination data warehouse is the user characteristic database. Finally, a task duplication table is established according to the logical relationship between the data warehouses in each stage and the first data warehouse.

由于建立的任务分配表的第一行从左至右依次为数据源对应的Hive服务器，SybaseIQ服务器和RTANA服务器，他们分别采用的分布方式为：共享磁盘分布方式、共享磁盘分布方式和无共享分布方式。则三个任务的具体调度步骤包括：Since the first line of the established task allocation table is the Hive server, SybaseIQ server, and RTANA server corresponding to the data source from left to right, the distribution methods they adopt are: shared disk distribution method, shared disk distribution method and no-shared distribution Way. The specific scheduling steps of the three tasks include:

1、根据任务复制表，将根据文件服务器的个数复制三个任务一，再根据任务分配表，所有任务在Hive服务器上执行。1. According to the task replication table, three tasks 1 will be copied according to the number of file servers, and then according to the task allocation table, all tasks will be executed on the Hive server.

2、根据任务复制表，将根据Hive服务器复制一个任务二，再根据任务分配表，该任务二会被某台空闲的SybaseIQ服务器调度。2. According to the task replication table, a task 2 will be copied according to the Hive server, and then according to the task allocation table, the task 2 will be scheduled by an idle SybaseIQ server.

3、根据任务复制表，将根据RTANA服务器的个数复制三个任务三，再根据任务分配表，每个RTANA服务器调度一个任务三。3. According to the task copy table, three task three will be copied according to the number of RTANA servers, and then according to the task allocation table, each RTANA server will schedule a task three.

本实施例提供了一种ETL调度方法，包括：首先，确定每个阶段的任务执行规则所对应的第一数据仓库，其中第一数据仓库为每个阶段的数据仓库中的源数据仓库或目的数据仓库；其次，根据源数据仓库和目的数据仓库之间的逻辑关系和第一数据仓库建立任务复制表，根据所述第二数据仓库对应服务器采用的分布式方式建立任务分配表，最后，根据任务复制表和任务分配表对所述每个阶段的任务进行调度。由于在MPP系统中的各个阶段中不需要多个独立的ETL装置，只需一个ETL装置，通过建立任务复制表和任务分配表调度各阶段的任务即可，从而提高MPP系统中对ETL装置的管理效率，降低维护复杂度。This embodiment provides an ETL scheduling method, including: first, determining the first data warehouse corresponding to the task execution rules of each stage, wherein the first data warehouse is the source data warehouse or the destination data warehouse in the data warehouse of each stage data warehouse; secondly, establish a task replication table according to the logical relationship between the source data warehouse and the destination data warehouse and the first data warehouse, establish a task assignment table according to the distributed mode adopted by the corresponding server of the second data warehouse, and finally, according to The task duplication table and the task allocation table schedule the tasks in each stage. Since there is no need for multiple independent ETL devices in each stage of the MPP system, only one ETL device is needed, and the tasks of each stage can be scheduled by establishing a task copy table and a task allocation table, thereby improving the efficiency of the ETL device in the MPP system. Management efficiency and reduced maintenance complexity.

图3为本发明一实施例提供的一种ETL调度装置的结构示意图，其中该装置，包括：确定模块301，用于确定每个阶段的任务执行规则所对应的第一数据仓库，所述第一数据仓库为所述每个阶段的数据仓库中的源数据仓库或目的数据仓库；建立模块302，用于根据所述源数据仓库和所述目的数据仓库之间的逻辑关系和所述第一数据仓库建立任务复制表，所述任务复制表包括：所述源数据仓库的表项和所述目的数据仓库的表项；所述建立模块302，还用于根据所述第二数据仓库对应服务器采用的分布式方式建立任务分配表，所述任务分配表包括：所述第二数据仓库为所述每个阶段的数据仓库中的源数据仓库或目的数据仓库，每个所述第二数据仓库对应服务器所采用的分布式方式；调度模块303，用于根据所述任务复制表和所述任务分配表对所述每个阶段的任务进行调度。Fig. 3 is a schematic structural diagram of an ETL scheduling device provided by an embodiment of the present invention, wherein the device includes: a determination module 301, configured to determine the first data warehouse corresponding to the task execution rules of each stage, the first A data warehouse is the source data warehouse or the target data warehouse in the data warehouses of each stage; the establishment module 302 is used for according to the logical relationship between the source data warehouse and the target data warehouse and the first The data warehouse establishes a task replication table, the task replication table includes: the entry of the source data warehouse and the entry of the destination data warehouse; the establishment module 302 is also configured to The distributed method used to establish a task distribution table, the task distribution table includes: the second data warehouse is the source data warehouse or the destination data warehouse in the data warehouse of each stage, and each of the second data warehouses Corresponding to the distributed mode adopted by the server; the scheduling module 303 is configured to schedule the tasks of each stage according to the task replication table and the task allocation table.

进一步地，所述任务复制表还包括：第一参数和第二参数；所述第一参数用于表示所述第一数据仓库为该阶段的所述源数据仓库；所述第二参数用于表示所述第一数据仓库为该阶段的所述目的数据仓库。Further, the task replication table further includes: a first parameter and a second parameter; the first parameter is used to indicate that the first data warehouse is the source data warehouse of this stage; the second parameter is used to Indicates that the first data warehouse is the target data warehouse at this stage.

可选地，所述建立模块302，具体用于：根据所述源数据仓库和所述目的数据仓库之间的逻辑关系确定所述源数据仓库的表项和所述目的数据仓库的表项；根据所述第一数据仓库确定所述第一参数和所述第二参数；根据所述所述源数据仓库的表项、所述目的数据仓库的表项、所述第一参数和所述第二参数建立所述任务复制表。Optionally, the establishment module 302 is specifically configured to: determine the entries of the source data warehouse and the entries of the destination data warehouse according to the logical relationship between the source data warehouse and the destination data warehouse; Determine the first parameter and the second parameter according to the first data warehouse; according to the entry of the source data warehouse, the entry of the destination data warehouse, the first parameter and the second The two parameters create the task replication table.

可选地，所述分布式方式包括：无共享分布方式和共享磁盘分布方式。Optionally, the distributed manner includes: a shared nothing distributed manner and a shared disk distributed manner.

可选地，所述调度模块303，具体用于：在每个阶段中的所述源数据仓库和所述目的数据仓库之间按照确定的所述分布式方式调度所述每个阶段的任务。Optionally, the scheduling module 303 is specifically configured to: schedule tasks in each stage between the source data warehouse and the target data warehouse in the determined distributed manner in each stage.

本实施例提供的ETL调度装置，可以用于执行图1对应的ETL调度方法的技术方案，其实现原理和技术效果类似，此处不再赘述。The ETL scheduling device provided in this embodiment can be used to execute the technical solution of the ETL scheduling method corresponding to FIG. 1 , and its implementation principle and technical effect are similar, and will not be repeated here.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still be Modifications are made to the technical solutions described in the foregoing embodiments, or equivalent replacements are made to some of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the various embodiments of the present invention.

Claims

1. A method of extracting, converting, and loading ETL dispatching, characterized in that, comprising:

determining the first data warehouse corresponding to the task execution rules of each stage, the first data warehouse being the source data warehouse or the destination data warehouse in the data warehouses of each stage;

Establish a task replication table according to the logical relationship between the source data warehouse and the destination data warehouse and the first data warehouse, and the task replication table includes: entries of the source data warehouse and the destination data warehouse table entry;

A task allocation table is established according to the distributed mode adopted by the corresponding server of the second data warehouse, the second data warehouse is the source data warehouse or the destination data warehouse in the data warehouse of each stage, and the task allocation table includes: each A distributed mode adopted by the server corresponding to the second data warehouse;

The tasks of each stage are scheduled according to the task replication table and the task allocation table.

2. The method according to claim 1, wherein the task replication table further comprises: a first parameter and a second parameter;

The first parameter is used to indicate that the first data warehouse is the source data warehouse of this stage;

The second parameter is used to indicate that the first data warehouse is the target data warehouse of this stage.

3. The method according to claim 2, wherein the establishment of a task copy table according to the logical relationship between the source data warehouse and the destination data warehouse and the first data warehouse specifically includes:

determining entries in the source data warehouse and entries in the destination data warehouse according to the logical relationship between the source data warehouse and the destination data warehouse;

determining said first parameter and said second parameter based on said first data repository;

The task replication table is established according to the entry of the source data warehouse, the entry of the target data warehouse, the first parameter and the second parameter.

4. The method according to any one of claims 1-3, further comprising:

The distributed manner includes: a shared nothing distributed manner and a shared disk distributed manner.

5. The method according to claim 4, wherein the scheduling of tasks at each stage according to the task copy table and the task assignment table specifically includes:

The tasks of each stage are scheduled according to the determined distributed manner between the source data warehouse and the target data warehouse in each stage.

6. An ETL scheduling device, characterized in that, comprising:

A determining module, configured to determine the first data warehouse corresponding to the task execution rules of each stage, the first data warehouse being the source data warehouse or the destination data warehouse in the data warehouses of each stage;

An establishment module, configured to establish a task replication table according to the logical relationship between the source data warehouse and the target data warehouse and the first data warehouse, the task replication table including: entries of the source data warehouse and The entry of the target data warehouse;

The establishment module is also used to establish a task allocation table according to the distributed mode adopted by the corresponding server of the second data warehouse, the second data warehouse is the source data warehouse or the destination data warehouse in the data warehouse of each stage, The task allocation table includes: the distributed mode adopted by each server corresponding to the second data warehouse;

A scheduling module, configured to schedule the tasks of each stage according to the task replication table and the task allocation table.

7. The device according to claim 6, wherein the task replication table further comprises: a first parameter and a second parameter;

8. The device according to claim 7, wherein the establishment module is specifically used for:

9. The device according to any one of claims 6-8, further comprising:

10. The device according to claim 9, wherein the scheduling module is specifically used for: