CN114880387A

CN114880387A - Data integration script generation method and device, storage medium and electronic equipment

Info

Publication number: CN114880387A
Application number: CN202210492533.0A
Authority: CN
Inventors: 牙祖将; 钱丽雯
Original assignee: Bank of China Ltd
Current assignee: Bank of China Ltd
Priority date: 2022-05-07
Filing date: 2022-05-07
Publication date: 2022-08-09

Abstract

The application discloses a data integration script generation method and device, a storage medium and electronic equipment, which can be applied to the financial field or other fields. The method comprises the following steps: when a task script of a data integration task needs to be generated, data acquisition information is determined; according to the data acquisition information, determining a target metadata identifier and target metadata information corresponding to the target metadata identifier in a preset data lake metadata information base, and according to the target metadata information, determining a database type corresponding to the target metadata identifier; determining a script generation strategy corresponding to the database type; determining a data acquisition mode corresponding to the data integration task; and if the data acquisition mode is a full acquisition mode, generating a full data integration script according to the script generation strategy and the data acquisition information to obtain a task script of the data integration task. By the method, the automatic generation of the data integration script can be realized, the script does not need to be manually compiled, and the working efficiency can be improved.

Description

Data integration script generation method and device, storage medium and electronic device

技术领域technical field

本发明涉及数据处理技术领域，特别是涉及一种数据集成脚本生成方法及装置、存储介质及电子设备。The present invention relates to the technical field of data processing, and in particular, to a method and device for generating a data integration script, a storage medium and an electronic device.

背景技术Background technique

在企业机构的信息化建设中，数据湖是常用的数据存储架构之一。数据湖从企业的多个数据源获取原始数据，所有的数据以它本来的形式进行存储，包括结构化数据(如关系数据库数据)、半结构化数据(如CSV、XML、JSON等)、非结构化数据(如电子邮件，文档，PDF)和二进制数据(如图像、音频、视频)，从而形成一个容纳所有形式数据的集中式数据存储，进而为后续的报表、可视化分析、实时分析或者是机器学习等业务场景提供数据支撑。In the informatization construction of enterprise organizations, data lake is one of the commonly used data storage architectures. The data lake obtains raw data from multiple data sources of the enterprise, and all data is stored in its original form, including structured data (such as relational database data), semi-structured data (such as CSV, XML, JSON, etc.), non- Structured data (e.g. emails, documents, PDFs) and binary data (e.g. images, audio, video) to form a centralized data store for all forms of data for subsequent reporting, visual analysis, real-time analysis or Business scenarios such as machine learning provide data support.

在基于数据湖进行数据分析的过程中，需要对数据湖中的数据进行数据集成，即对数据进行抽取(Extract)、转换(Transform)以及加载(Load)到指定的数据存储中，也就是进行ETL(Extract-Transform-Load)处理。In the process of data analysis based on the data lake, it is necessary to perform data integration on the data in the data lake, that is, extract, transform and load the data into the specified data storage, that is, carry out ETL (Extract-Transform-Load) processing.

数据集成任务通常是基于数据集成脚本，即ETL脚本实现的。目前，ETL脚本主要是由技术人员运用数据库知识、数据表结构和SQL技术，根据自己对业务数据的分析，人工编写得到的。而基于人工编写ETL脚本的方式，需要耗费大量的人力资源，处理过程耗时长，效率较低。其次，在人工编写ETL脚本的过程中，需要人工核对数据来源，容易出现遗漏或差错，给数据集成工作带来不良影响。Data integration tasks are usually implemented based on data integration scripts, ie ETL scripts. At present, ETL scripts are mainly written manually by technicians using database knowledge, data table structure and SQL technology according to their own analysis of business data. However, the way of manually writing ETL scripts requires a lot of human resources, the processing process is time-consuming, and the efficiency is low. Secondly, in the process of manually writing ETL scripts, it is necessary to manually check the data sources, which is prone to omissions or errors, which has a negative impact on the data integration work.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本发明实施例提供了一种数据集成脚本生成方法，以解决人工编写ETL脚本，耗时长，效率较低，容易出现纰漏的问题。In view of this, the embodiments of the present invention provide a method for generating a data integration script, so as to solve the problems that manually writing an ETL script is time-consuming, low in efficiency, and prone to errors.

本发明实施例还提供了一种数据集成脚本生成装置，用以保证上述方法实际中的实现及应用。The embodiment of the present invention also provides a data integration script generation device, so as to ensure the actual realization and application of the above method.

为实现上述目的，本发明实施例提供如下技术方案：To achieve the above purpose, the embodiments of the present invention provide the following technical solutions:

一种数据集成脚本生成方法，包括：A data integration script generation method, comprising:

当需要生成数据集成任务的任务脚本时，确定所述数据集成任务对应的数据采集信息；When a task script of the data integration task needs to be generated, determine the data collection information corresponding to the data integration task;

依据所述数据采集信息，在预设的数据湖元数据信息库中，确定所述数据集成任务对应的目标元数据标识以及所述目标元数据标识对应的目标元数据信息，所述数据湖元数据信息库中包含多个预设元数据标识和每个所述预设元数据标识对应的元数据信息；According to the data collection information, in the preset data lake metadata information database, determine the target metadata identifier corresponding to the data integration task and the target metadata information corresponding to the target metadata identifier, the data lake metadata The data information base contains a plurality of preset metadata identifiers and metadata information corresponding to each of the preset metadata identifiers;

依据所述目标元数据信息，确定所述目标元数据标识对应的数据库类型；According to the target metadata information, determine the database type corresponding to the target metadata identifier;

确定所述数据库类型对应的脚本生成策略；determining the script generation strategy corresponding to the database type;

确定所述数据集成任务对应的数据采集模式；determining the data collection mode corresponding to the data integration task;

若所述数据采集模式为全量采集模式，则依据所述脚本生成策略和所述数据采集信息，生成所述目标元数据标识对应的全量数据集成脚本，并将所述全量数据集成脚本作为所述数据集成任务的任务脚本。If the data collection mode is the full data collection mode, generate a full data integration script corresponding to the target metadata identifier according to the script generation strategy and the data collection information, and use the full data integration script as the Task scripts for data integration tasks.

上述的方法，可选的，所述确定所述数据集成任务对应的数据采集信息，包括：In the above method, optionally, the determining of the data collection information corresponding to the data integration task includes:

获取用户输入的业务主题、所属部门、所属产品、元数据标识和采集模式，并将所述业务主题、所述所属部门、所述所属产品、所述元数据标识和所述采集模式作为所述数据集成任务对应的数据采集信息。Obtain the business subject, the department, the product, the metadata identifier and the collection mode entered by the user, and use the business subject, the department, the product, the metadata identifier and the collection mode as the Data collection information corresponding to the data integration task.

上述的方法，可选的，所述依据所述数据采集信息，在预设的数据湖元数据信息库中，确定所述数据集成任务对应的目标元数据标识，包括：In the above method, optionally, determining the target metadata identifier corresponding to the data integration task in a preset data lake metadata information base according to the data collection information, including:

将所述数据采集信息分别与每个所述预设元数据标识对应的元数据信息进行匹配，并将与所述数据采集信息相匹配的元数据信息所对应的预设元数据标识，确定为所述目标元数据标识。The data collection information is respectively matched with the metadata information corresponding to each of the preset metadata identifiers, and the preset metadata identifier corresponding to the metadata information matched with the data collection information is determined as The target metadata identifier.

上述的方法，可选的，还包括：The above method, optionally, further includes:

若所述数据采集模式并非为所述全量采集模式，则确定所述数据集成任务对应的起止条件；If the data collection mode is not the full collection mode, determining the start and end conditions corresponding to the data integration task;

依据所述目标元数据信息和所述起止条件，对所述数据集成任务进行条件核查；According to the target metadata information and the starting and ending conditions, condition checking is performed on the data integration task;

若所述数据集成任务通过条件核查，则依据所述脚本生成策略、所述数据采集信息、所述起止条件和所述目标元数据信息，生成所述目标元数据标识对应的增量数据集成脚本，并将所述增量数据集成脚本作为所述数据集成任务的任务脚本。If the data integration task passes the condition check, an incremental data integration script corresponding to the target metadata identifier is generated according to the script generation strategy, the data collection information, the start and end conditions, and the target metadata information , and use the incremental data integration script as the task script of the data integration task.

上述的方法，可选的，所述依据所述目标元数据信息和所述起止条件，对所述数据集成任务进行条件核查，包括：In the above method, optionally, the condition checking on the data integration task according to the target metadata information and the start and end conditions includes:

确定所述目标元数据信息对应的增量信息，所述增量信息包括增量变量标识和增量变量数据结构；determining the incremental information corresponding to the target metadata information, where the incremental information includes an incremental variable identifier and an incremental variable data structure;

判断所述起止条件和所述增量信息是否相匹配；Judging whether the start and end conditions match the incremental information;

若所述起止条件和所述增量信息相匹配，则确定所述数据集成任务通过条件核查；If the start and end conditions match the incremental information, determine that the data integration task passes the condition check;

若所述起止条件和所述增量信息不匹配，则确定所述数据集成任务未通过条件核查。If the start and end conditions do not match the incremental information, it is determined that the data integration task fails the condition check.

若所述数据集成任务未通过条件核查，则进行报错提示，结束所述数据集成任务的任务脚本的生成过程。If the data integration task fails the condition check, an error message is prompted, and the generation process of the task script of the data integration task ends.

一种数据集成脚本生成装置，包括：A data integration script generation device, comprising:

第一确定单元，用于当需要生成数据集成任务的任务脚本时，确定所述数据集成任务对应的数据采集信息；a first determining unit, configured to determine data collection information corresponding to the data integration task when a task script of the data integration task needs to be generated;

第二确定单元，用于依据所述数据采集信息，在预设的数据湖元数据信息库中，确定所述数据集成任务对应的目标元数据标识以及所述目标元数据标识对应的目标元数据信息，所述数据湖元数据信息库中包含多个预设元数据标识和每个所述预设元数据标识对应的元数据信息；The second determining unit is configured to, according to the data collection information, determine the target metadata identifier corresponding to the data integration task and the target metadata corresponding to the target metadata identifier in the preset data lake metadata information database information, the data lake metadata information base contains multiple preset metadata identifiers and metadata information corresponding to each of the preset metadata identifiers;

第三确定单元，用于依据所述目标元数据信息，确定所述目标元数据标识对应的数据库类型；a third determining unit, configured to determine the database type corresponding to the target metadata identifier according to the target metadata information;

第四确定单元，用于确定所述数据库类型对应的脚本生成策略；a fourth determining unit, configured to determine a script generation strategy corresponding to the database type;

第五确定单元，用于确定所述数据集成任务对应的数据采集模式；a fifth determination unit, configured to determine a data collection mode corresponding to the data integration task;

第一生成单元，用于若所述数据采集模式为全量采集模式，则依据所述脚本生成策略和所述数据采集信息，生成所述目标元数据标识对应的全量数据集成脚本，并将所述全量数据集成脚本作为所述数据集成任务的任务脚本。The first generation unit is configured to generate a full data integration script corresponding to the target metadata identifier according to the script generation strategy and the data collection information if the data collection mode is the full collection mode, and integrate the The full data integration script is used as the task script of the data integration task.

上述的装置，可选的，还包括：The above-mentioned device, optionally, also includes:

第六确定单元，用于若所述数据采集模式并非为所述全量采集模式，则确定所述数据集成任务对应的起止条件；a sixth determination unit, configured to determine the start and end conditions corresponding to the data integration task if the data collection mode is not the full collection mode;

核查单元，用于依据所述目标元数据信息和所述起止条件，对所述数据集成任务进行条件核查；a verification unit, configured to perform condition verification on the data integration task according to the target metadata information and the start and end conditions;

第二生成单元，用于若所述数据集成任务通过条件核查，则依据所述脚本生成策略、所述数据采集信息、所述起止条件和所述目标元数据信息，生成所述目标元数据标识对应的增量数据集成脚本，并将所述增量数据集成脚本作为所述数据集成任务的任务脚本。A second generating unit, configured to generate the target metadata identifier according to the script generation strategy, the data collection information, the start and end conditions, and the target metadata information if the data integration task passes the condition check corresponding incremental data integration script, and use the incremental data integration script as the task script of the data integration task.

一种存储介质，所述存储介质包括存储的指令，其中，在所述指令运行时控制所述存储介质所在的设备执行如上述的数据集成脚本生成方法。A storage medium comprising stored instructions, wherein when the instructions are executed, a device where the storage medium is located is controlled to execute the above-mentioned method for generating a data integration script.

一种电子设备，包括存储器，以及一个或者一个以上的指令，其中一个或者一个以上指令存储于存储器中，且经配置以由一个或者一个以上处理器执行如上述的数据集成脚本生成方法。An electronic device includes a memory, and one or more instructions, wherein the one or more instructions are stored in the memory and configured to execute, by one or more processors, the data integration script generation method as described above.

基于上述本发明实施例提供的一种数据集成脚本生成方法，包括：当需要生成数据集成任务的任务脚本时，确定数据集成任务对应的数据采集信息；依据数据采集信息，在预设的数据湖元数据信息库中，确定数据集成任务对应的目标元数据标识以及目标元数据信息，所述数据湖元数据信息库中包含多个预设元数据标识和每个所述预设元数据标识对应的元数据信息；依据目标元数据信息，确定目标元数据标识对应的数据库类型；确定该数据库类型对应的脚本生成策略；确定数据集成任务对应的数据采集模式；若该数据采集模式为全量采集模式，则依据脚本生成策略和数据采集信息，生成全量数据集成脚本，并将该全量数据集成脚本作为数据集成任务的任务脚本。应用本发明实施例提供的方法，可以通过自动化的处理过程生成数据集成任务的任务脚本，可实现数据集成脚本的自动化生成，无需人工编写脚本，可节省大量的人力资源，缩短处理过程耗时，提高工作效率。其次，可以避免人工编写脚本所带来的人为纰漏，避免人为原因给数据集成工作带来的不良影响。A method for generating a data integration script provided based on the above embodiments of the present invention includes: when a task script of a data integration task needs to be generated, determining data collection information corresponding to the data integration task; according to the data collection information, in a preset data lake In the metadata information database, the target metadata identifier and target metadata information corresponding to the data integration task are determined, and the data lake metadata information database contains a plurality of preset metadata identifiers corresponding to each of the preset metadata identifiers. According to the target metadata information, determine the database type corresponding to the target metadata identifier; determine the script generation strategy corresponding to the database type; determine the data collection mode corresponding to the data integration task; if the data collection mode is the full collection mode , a full data integration script is generated according to the script generation strategy and data collection information, and the full data integration script is used as the task script of the data integration task. By applying the method provided by the embodiment of the present invention, the task script of the data integration task can be generated through an automated processing process, and the automatic generation of the data integration script can be realized without manual script writing, which can save a lot of human resources and shorten the processing time. Improve work efficiency. Secondly, it can avoid human mistakes caused by manual scripting, and avoid the adverse effects of human factors on data integration work.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据提供的附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only It is an embodiment of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to the provided drawings without creative work.

图1为本发明实施例提供的一种数据集成脚本生成方法的方法流程图；1 is a method flowchart of a method for generating a data integration script provided by an embodiment of the present invention;

图2为本发明实施例提供的一种数据集成任务的提交界面的界面示意图；2 is a schematic interface diagram of a submission interface of a data integration task provided by an embodiment of the present invention;

图3为本发明实施例提供的一种数据集成脚本生成过程的示例图；3 is an exemplary diagram of a data integration script generation process provided by an embodiment of the present invention;

图4为本发明实施例提供的一种数据集成脚本生成装置的结构示意图；4 is a schematic structural diagram of an apparatus for generating a data integration script according to an embodiment of the present invention;

图5为本发明实施例提供的一种电子设备的结构示意图。FIG. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

在本申请中，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。In this application, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device comprising a list of elements includes not only those elements, but also no Other elements expressly listed, or which are also inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

本发明实施例提供了一种数据集成脚本生成方法，所述方法可应用于数据集成平台，其执行主体可以为数据集成平台的服务器，所述方法的方法流程图如图1所示，包括：An embodiment of the present invention provides a method for generating a data integration script. The method can be applied to a data integration platform, and the execution body of the method can be a server of the data integration platform. The method flowchart of the method is shown in FIG. 1 , including:

S101：当需要生成数据集成任务的任务脚本时，确定所述数据集成任务对应的数据采集信息；S101: when a task script of a data integration task needs to be generated, determine data collection information corresponding to the data integration task;

本发明实施例提供的方法中，用户可以通过前端提交数据集成任务的相关信息，以向服务器发送需要生成数据集成任务的任务脚本的触发指令。当服务器接收到该触发指令时，可以获取用户输入的相关信息，以确定当前需要生成任务脚本的数据集成任务对应的数据采集信息。数据采集信息可以包括需要采集的元数据的标识、其归属的部门、归属的产品等等信息。In the method provided by the embodiment of the present invention, the user can submit the relevant information of the data integration task through the front end, so as to send the trigger instruction of the task script that needs to generate the data integration task to the server. When the server receives the trigger instruction, it can obtain the relevant information input by the user to determine the data collection information corresponding to the data integration task that needs to generate the task script currently. The data collection information may include the identification of the metadata to be collected, the department to which it belongs, the product to which it belongs, and so on.

S102：依据所述数据采集信息，在预设的数据湖元数据信息库中，确定所述数据集成任务对应的目标元数据标识以及所述目标元数据标识对应的目标元数据信息，所述数据湖元数据信息库中包含多个预设元数据标识和每个所述预设元数据标识对应的元数据信息；S102: According to the data collection information, in a preset data lake metadata information database, determine a target metadata identifier corresponding to the data integration task and target metadata information corresponding to the target metadata identifier, and the data The lake metadata information base contains multiple preset metadata identifiers and metadata information corresponding to each of the preset metadata identifiers;

本发明实施例提供的方法中，可以预先根据数据湖的各个元数据项，预先设置对应的各个元数据标识和每个元数据标识对应的元数据信息，以构建数据湖元数据信息库，该数据湖元数据信息库中包含多个预设元数据标识和每个预设元数据标识对应的元数据信息，各个预设元数据标识对应的元数据信息与数据湖中的各个元数据项相对应。预设元数据标识对应的元数据信息可以包括该预设元数据标识、元数据名称、数据库类型、采集模式、增量变量标识和增量变量数据结构等等信息。In the method provided by the embodiment of the present invention, corresponding metadata identifiers and metadata information corresponding to each metadata identifier may be preset according to each metadata item of the data lake, so as to construct a metadata information database of the data lake. The data lake metadata information base contains multiple preset metadata identifiers and metadata information corresponding to each preset metadata identifier, and the metadata information corresponding to each preset metadata identifier is related to each metadata item in the data lake. correspond. The metadata information corresponding to the preset metadata identifier may include information such as the preset metadata identifier, metadata name, database type, collection mode, incremental variable identifier, and incremental variable data structure.

本发明实施例提供的方法中，可以依据数据集成任务的数据采集信息，在数据湖元数据信息库中，确定目标元数据标识，并将数据湖元数据信息库中，该目标元数据标识对应的元数据信息，确定为目标元数据信息。In the method provided by the embodiment of the present invention, the target metadata identifier can be determined in the data lake metadata information base according to the data collection information of the data integration task, and the target metadata identifier corresponding to the data lake metadata information base can be determined. The metadata information is determined as the target metadata information.

S103：依据所述目标元数据信息，确定所述目标元数据标识对应的数据库类型；S103: According to the target metadata information, determine the database type corresponding to the target metadata identifier;

本发明实施例提供的方法中，可以从目标元数据信息中获取目标元数据标识对应的数据库类型，也就是存储该目标元数据标识对应的元数据的数据库的类型。数据库类型可以为Hive、HBase、Hdfs或是Nosql等类型，也可以为其他数据库类型。In the method provided by the embodiment of the present invention, the database type corresponding to the target metadata identifier can be obtained from the target metadata information, that is, the type of the database storing the metadata corresponding to the target metadata identifier. The database type can be Hive, HBase, Hdfs, or Nosql, or other database types.

S104：确定所述数据库类型对应的脚本生成策略；S104: Determine a script generation strategy corresponding to the database type;

本发明实施例提供的方法中，可以预先根据各个类型的数据库的数据集成方式，设置每类数据库对应的脚本生成策略，具体的，可以设置脚本模板，通过模板参数替换用以生成脚本，也可以预设脚本语句，导入相关参数用于生成脚本等等。In the method provided by the embodiment of the present invention, a script generation strategy corresponding to each type of database can be set in advance according to the data integration mode of each type of database. Specifically, a script template can be set, and the template parameter can be replaced to generate a script, or Preset script statements, import relevant parameters to generate scripts, etc.

本发明实施例提供的方法中，可以通过目标元数据标识对应的数据库类型，在各个预设的脚本生成策略中，匹配得到该数据库类型对应的脚本生成策略。In the method provided by the embodiment of the present invention, the corresponding database type may be identified by target metadata, and among each preset script generation strategy, the script generation strategy corresponding to the database type is obtained by matching.

S105：确定所述数据集成任务对应的数据采集模式；S105: Determine the data collection mode corresponding to the data integration task;

本发明实施例提供的方法中，用户可以通过前端设置数据集成任务的采集模式，该信息可以包含在数据采集信息中，可以从数据采集信息中获取用户设置的采集模式，将该采集模式确定为数据集成任务对应的数据采集模式。具体的，数据采集模式可以为全量采集模式或增量采集模式等模式。In the method provided by the embodiment of the present invention, the user can set the collection mode of the data integration task through the front end, the information can be included in the data collection information, the collection mode set by the user can be obtained from the data collection information, and the collection mode is determined as The data collection mode corresponding to the data integration task. Specifically, the data collection mode may be a full collection mode or an incremental collection mode.

S106：若所述数据采集模式为全量采集模式，则依据所述脚本生成策略和所述数据采集信息，生成所述目标元数据标识对应的全量数据集成脚本，并将所述全量数据集成脚本作为所述数据集成任务的任务脚本。S106: If the data collection mode is the full collection mode, generate a full data integration script corresponding to the target metadata identifier according to the script generation strategy and the data collection information, and use the full data integration script as The task script for the data integration task.

本发明实施例提供的方法中，若数据采集模式为全量采集模式，则依据脚本生成策略和数据采集信息，生成全量数据集成脚本，也就是ETL脚本。将该全量数据集成脚本作为数据集成任务的任务脚本。例如，可以依据脚本生成策略中全量采集的脚本模板，基于数据采集信息中的元数据标识、业务主题、所属部门、所属产品等数据，进行参数替换，基于完成参数替换的脚本模板生成对应的全量数据集成脚本。In the method provided by the embodiment of the present invention, if the data collection mode is the full data collection mode, a full data integration script, that is, an ETL script, is generated according to the script generation strategy and data collection information. Use the full data integration script as the task script of the data integration task. For example, according to the script template of the full collection in the script generation strategy, parameters can be replaced based on the metadata identification, business subject, department, product and other data in the data collection information, and based on the script template that has completed the parameter replacement, the corresponding full Data integration scripts.

本发明实施例提供的方法中，可以将生成的任务脚本展示给用户，用户可以对任务脚本进行检查，后续可以通过调用数据集成引擎执行数据集成任务的任务脚本，也就是调用ETL执行引擎完成ETL的执行。In the method provided by the embodiment of the present invention, the generated task script can be displayed to the user, the user can check the task script, and subsequently the task script of the data integration task can be executed by invoking the data integration engine, that is, the ETL execution engine can be invoked to complete the ETL execution.

基于本发明实施例提供的方法，当需要生成数据集成任务的任务脚本时，确定数据集成任务对应的数据采集信息；依据数据采集信息，在预设的数据湖元数据信息库中，确定数据集成任务对应的目标元数据标识以及目标元数据标识对应的目标元数据信息，依据目标元数据信息，确定目标元数据标识对应的数据库类型；确定数据库类型对应的脚本生成策略；确定数据集成任务对应的数据采集模式；若数据采集模式为全量采集模式，则依据脚本生成策略和数据采集信息，生成目标元数据标识对应的全量数据集成脚本，并将该全量数据集成脚本作为数据集成任务的任务脚本。应用本发明实施例提供的方法，可以基于数据集成任务的数据采集信息和预设的数据湖元数据信息库确定脚本生成策略，继而生成任务脚本。可以通过自动化的处理过程生成数据集成任务的任务脚本，可实现数据集成脚本的自动化生成，无需人工编写脚本，可节省大量的人力资源，缩短处理过程耗时，提高工作效率。其次，可以避免人工编写脚本所带来的人为纰漏，避免人为原因给数据集成工作带来的不良影响。Based on the method provided by the embodiment of the present invention, when the task script of the data integration task needs to be generated, the data collection information corresponding to the data integration task is determined; according to the data collection information, in the preset data lake metadata information database, the data integration task is determined The target metadata identifier corresponding to the task and the target metadata information corresponding to the target metadata identifier, according to the target metadata information, determine the database type corresponding to the target metadata identifier; determine the script generation strategy corresponding to the database type; determine the corresponding data integration task. Data collection mode; if the data collection mode is the full collection mode, the full data integration script corresponding to the target metadata identifier is generated according to the script generation strategy and data collection information, and the full data integration script is used as the task script of the data integration task. By applying the method provided by the embodiment of the present invention, a script generation strategy can be determined based on the data collection information of a data integration task and a preset data lake metadata information base, and then a task script can be generated. The task script of the data integration task can be generated through the automated processing process, and the automatic generation of the data integration script can be realized without manual script writing, which can save a lot of human resources, shorten the processing time, and improve the work efficiency. Secondly, it can avoid human mistakes caused by manual scripting, and avoid the adverse effects of human factors on data integration work.

在图1所示方法的基础上，本发明实施例提供的方法中，步骤S101中提及的确定所述数据集成任务对应的数据采集信息的过程，包括：On the basis of the method shown in FIG. 1 , in the method provided by the embodiment of the present invention, the process of determining the data collection information corresponding to the data integration task mentioned in step S101 includes:

本发明实施例提供的方法中，用户可以在前端输入需要采集的元数据的相关信息，包括业务主题、所属部门、所属产品、元数据标识和采集模式，将用户输入的这些信息作为数据集成任务对应的数据采集信息。In the method provided by the embodiment of the present invention, the user can input the relevant information of the metadata to be collected at the front end, including the business subject, the affiliated department, the affiliated product, the metadata identification and the collection mode, and the information input by the user can be regarded as the data integration task. Corresponding data collection information.

需要说明的是，本发明实施例提供的方法中提及的数据采集信息的具体内容仅是为了更好地说明本发明提供的方法所提供的一个具体实施例，在具体的实现过程中，数据采集信息中还可以包含其他数据内容。It should be noted that the specific content of the data collection information mentioned in the method provided by the embodiment of the present invention is only to better illustrate a specific embodiment provided by the method provided by the present invention. The collected information may also contain other data content.

在图1所示方法的基础上，本发明实施例提供的方法中，步骤S102中提及的，依据所述数据采集信息，在预设的数据湖元数据信息库中，确定所述数据集成任务对应的目标元数据标识的过程，包括：On the basis of the method shown in FIG. 1, in the method provided by the embodiment of the present invention, as mentioned in step S102, according to the data collection information, in the preset data lake metadata information database, determine the data integration The process of identifying the target metadata corresponding to the task, including:

本发明实施例提供的方法中，可以将数据湖元数据信息库中各个元数据信息分别与数据采集信息进行匹配，具体的，可以将元数据信息中包含的元数据标识与数据采集信息中的元数据标识进行比对，若两者相同，则认为该元数据信息与数据采集信息相匹配。将与数据采集信息相匹配的元数据信息所对应的预设元数据标识作为目标元数据标识。在具体的应用场景中，数据湖元数据信息库中的各个预设元数据标识互不相同，故通常仅有一个元数据信息与数据采集信息相匹配。In the method provided by the embodiment of the present invention, each metadata information in the metadata information database of the data lake can be matched with the data collection information respectively. Specifically, the metadata identifier included in the metadata information can be matched with the metadata in the data collection information The metadata identifiers are compared, and if the two are the same, it is considered that the metadata information matches the data collection information. The preset metadata identifier corresponding to the metadata information matching the data collection information is used as the target metadata identifier. In a specific application scenario, the preset metadata identifiers in the metadata information database of the data lake are different from each other, so usually only one metadata information matches the data collection information.

在图1所示方法的基础上，本发明实施例提供的方法中，还包括：On the basis of the method shown in FIG. 1, the method provided by the embodiment of the present invention further includes:

本发明实施例提供的方法中，如果数据集成任务对应的数据采集模式并非全量采集模式，即所述数据采集模式为增量采集模式。当用户选择增量采集模式时，可以通过前端输入数据采集的起止条件，所述起止条件可以包括起始条件和终止条件，也可以仅包含起始条件而不设终止条件。起止条件指的是表征数据范围的条件，例如采集第一预设时间点之后，第二预设时间点之前产生的数据等等。In the method provided by the embodiment of the present invention, if the data collection mode corresponding to the data integration task is not the full collection mode, that is, the data collection mode is the incremental collection mode. When the user selects the incremental acquisition mode, the start and end conditions of the data acquisition can be input through the front end, and the start and end conditions may include a start condition and an end condition, or only a start condition without an end condition. The start and end conditions refer to conditions that characterize the data range, such as data generated after the first preset time point and before the second preset time point, and so on.

本发明实施例提供的方法中，可基于目标元数据信息和起止条件，对数据集成任务进行条件核查，也就是根据目标元数据信息判断该数据集成任务对应的起止条件，是否是可行的。In the method provided by the embodiment of the present invention, the condition check of the data integration task can be performed based on the target metadata information and the start and end conditions, that is, it is determined whether the start and end conditions corresponding to the data integration task are feasible according to the target metadata information.

本发明实施例提供的方法中，如果该数据集成任务通过了条件核查，则基于脚本生成策略、数据采集信息、起止条件和目标元数据信息，生成增量数据集成脚本，将该增量数据集成脚本作为数据集成任务的任务脚本。具体的，可以依据脚本生成策略中增量采集的脚本模板，基于数据采集信息中的元数据标识、业务主题、所属部门、所属产品等数据、起止条件以及目标元数据信息中包含的增量变量标识等数据，进行参数替换，基于完成参数替换的脚本模板生成增量数据集成脚本。In the method provided by the embodiment of the present invention, if the data integration task passes the condition verification, an incremental data integration script is generated based on the script generation strategy, data collection information, start and end conditions, and target metadata information, and the incremental data integration script is generated. Scripts are task scripts for data integration tasks. Specifically, the script template for incremental collection in the script generation strategy can be generated based on the metadata identification, business subject, department, product and other data in the data collection information, start and end conditions, and incremental variables contained in the target metadata information. Identify data such as identifiers, perform parameter replacement, and generate incremental data integration scripts based on the script template that completes the parameter replacement.

在上述实施例提供的方法的基础上，本发明实施例提供的方法中，所述依据所述目标元数据信息和所述起止条件，对所述数据集成任务进行条件核查的过程，包括：On the basis of the method provided by the above embodiment, in the method provided by the embodiment of the present invention, the process of performing condition verification on the data integration task according to the target metadata information and the start and end conditions includes:

本发明实施例提供的方法中，目标元数据信息中包含目标元数据标识对应的增量变量标识和增量变量数据结构，可将增量变量标识和增量变量数据结构确定为增量信息。In the method provided by the embodiment of the present invention, the target metadata information includes an incremental variable identifier and an incremental variable data structure corresponding to the target metadata identifier, and the incremental variable identifier and the incremental variable data structure can be determined as incremental information.

本发明实施例提供的方法中，可以通过起止条件中涉及的增量变量，判断该起止条件与所述增量信息是否相匹配，例如起止条件中涉及的增量变量是时间，而增量信息中的增量变量标识和增量变量数据结构所表征的增量变量也是时间，则确定起止条件和增量信息相匹配，反之，若起止条件中涉及的增量变量和增量信息中表征的增量变量不同，则认为两者不匹配。In the method provided by this embodiment of the present invention, it can be determined whether the start and end conditions match the incremental information through the increment variables involved in the start and end conditions. For example, the increment variable involved in the start and end conditions is time, while the increment information The incremental variable identifier in the incremental variable and the incremental variable represented by the incremental variable data structure are also time, then it is determined that the start and end conditions match the incremental information. If the increment variable is different, it is considered that the two do not match.

本发明实施例提供的方法中，若起止条件与增量信息相匹配，则确定数据集成任务通过了条件核查，反之，则未通过条件核查。In the method provided by the embodiment of the present invention, if the start and end conditions match the incremental information, it is determined that the data integration task has passed the condition check; otherwise, the condition check has not been passed.

在上述实施例提供的方法的基础上，本发明实施例提供的方法中，还包括：On the basis of the methods provided in the foregoing embodiments, the methods provided in the embodiments of the present invention further include:

本发明实施例提供的方法中，如果数据集成任务未通过条件核查，则报告错误信息，结束任务脚本的生成过程。In the method provided by the embodiment of the present invention, if the data integration task fails the condition check, an error message is reported, and the generation process of the task script is ended.

为了更好地说明本发明实施例提供的方法，接下来结合实际的应用场景，本发明实施例提供了又一种数据集成脚本生成方法。In order to better illustrate the method provided by the embodiment of the present invention, the embodiment of the present invention provides another method for generating a data integration script in combination with an actual application scenario.

本发明实施例提供的方法应用于银行机构中，基于数据湖架构的数据分析场景。数据湖架构是在系统或存储库中以自然格式存储数据的方式。数据湖是一个存储企业的各种各样原始数据的大型仓库，其中的数据可供存取、处理、分析及传输。数据湖的三个层次，分为数据库等底层存储、元数据管理、跨不同数据源的SQL引擎。传统的数据仓库将数据存储在关系表中，而数据湖则使用平面结构。每个数据元素被分配唯一标识符，并用一组元数据标签进行标记。The method provided by the embodiment of the present invention is applied to a data analysis scenario based on a data lake architecture in a banking institution. A data lake architecture is a way of storing data in a natural format in a system or repository. A data lake is a large warehouse that stores various raw data of an enterprise, which can be accessed, processed, analyzed and transmitted. The three levels of the data lake are divided into underlying storage such as databases, metadata management, and SQL engines across different data sources. Whereas traditional data warehouses store data in relational tables, data lakes use a flat structure. Each data element is assigned a unique identifier and is marked with a set of metadata tags.

本发明实施例提供的方法的应用过程，包括：The application process of the method provided by the embodiment of the present invention includes:

梳理数据湖中所有企业级元数据及其所属的业务主题、所属部门、所属产品、采集模式、增量变量标识和增量变量数据结构等与ETL有关的信息，为后续的ETL参数(即元数据信息)导入做准备，梳理内容可以按照下表所示结构进行记录：Sort out all enterprise-level metadata in the data lake and its related business topics, departments, products, collection modes, incremental variable identifiers, and incremental variable data structures and other ETL-related information, as the subsequent ETL parameters (that is, metadata). data information) to prepare for import, and the sorting content can be recorded according to the structure shown in the following table:

表1Table 1

其中，指标项由高阶到低阶的顺序为：所属业务主题>所属部门>所属产品>元数据。Among them, the order of index items from high-level to low-level is: business topic to which they belong > department to which they belong > product to which they belong > metadata.

建立任务编号，为后续导入元数据参数信息和新增ETL参数做准备，任务编号数据结构可以如下所示：Create a task number to prepare for the subsequent import of metadata parameter information and new ETL parameters. The task number data structure can be as follows:

表2Table 2

首次导入的铺底数据，需要建立首次铺底所需任务编号，为后续的铺底导入做准备。For the first-time imported base-laying data, the task number required for the first base-laying needs to be established to prepare for the subsequent base-laying import.

选择建立好的任务编号，导入元数据ETL参数(即元数据标识对应的元数据信息)，对于元数据ETL参数的导入，导入时系统为每个数据元素分配唯一标识符，并用一组元数据标签进行标记，元数据标签(共18个字节)的内容如下：Select the established task number and import the metadata ETL parameters (that is, the metadata information corresponding to the metadata identifier). For the import of the metadata ETL parameters, the system assigns a unique identifier to each data element during import, and uses a set of metadata The content of the metadata tag (18 bytes in total) is as follows:

表3table 3

存储元数据ETL参数表，其中，ETL参数表结构可以如下所示：Store the metadata ETL parameter table, where the structure of the ETL parameter table can be as follows:

表4Table 4

其中，数据库类型可以为Hive、HBase、Hdfs、Nosql等类型，但不限于这些类型。The database type may be Hive, HBase, Hdfs, Nosql, etc., but is not limited to these types.

构建ETL参数表后，当有数据分析需要获取加载数据时，用户可根据数据分析师提供的取数规则，选择需要ETL的元数据、采集模式、增量条件或者分段区间，系统根据用户的选择先是自动匹配ETL参数表中的参数，获取对应元数据的数据库类型，如采集模式是增量，则获取增量变量标识，自动生成本次数据分析的ETL脚本。After the ETL parameter table is constructed, when data analysis needs to obtain and load data, the user can select the metadata, collection mode, incremental condition or segment interval that requires ETL according to the number fetching rules provided by the data analyst. Choose to automatically match the parameters in the ETL parameter table, and obtain the database type of the corresponding metadata. If the acquisition mode is incremental, obtain the incremental variable identifier, and automatically generate the ETL script for this data analysis.

ETL参数支持增、删、改、查维护，当将来新增ETL参数时，系统根据经办员所选择的任务编号、元数据名称、业务主题、所属部门、所属产品、采集模式和录入的增量变量标识、增量变量数据结构等信息，进行参数合法性检查，检查通过后，建立该元数据的ETL参数，ETL参数的查询支持元数据级ETL参数查询和全景级列表查询，其中元数据级ETL参数查询展示的是与该元数据有关的所有ETL参数，全景级列表查询是展示所有元数据ETL参数。ETL parameters support adding, deleting, modifying, checking and maintaining. When ETL parameters are added in the future, the system will add the task number, metadata name, business subject, department, product, collection mode and input selected by the operator. Quantitative variable identification, incremental variable data structure and other information, the parameter validity check is carried out. After the check is passed, the ETL parameters of the metadata are established. The query of ETL parameters supports metadata-level ETL parameter query and panorama-level list query. Metadata The level ETL parameter query displays all ETL parameters related to the metadata, and the panorama level list query displays all metadata ETL parameters.

由于ETL工作跟任务是紧密联系的，只有在数据分析任务启动后才提供相关数据ETL工作，该处理机制主要是增加数据湖数据ETL的授权，只有授权后才能进行相关的数据ETL。Since the ETL work is closely related to the task, the related data ETL work is only provided after the data analysis task is started. The processing mechanism is mainly to increase the authorization of the data lake data ETL.

本发明实施例提供的方法可应用于数据集成平台，该数据集成平台是对于图1所示方法的一种实例化，该平台可以包括以下模块：The method provided by the embodiment of the present invention can be applied to a data integration platform. The data integration platform is an instantiation of the method shown in FIG. 1 , and the platform may include the following modules:

联机模块，联机模块主要包括下列几个子模块：Online module, online module mainly includes the following sub-modules:

任务维护子模块：任务维护子模块主要是任务信息的维护功能，主要是ETL参数表中ETL参数的增删改查联机维护交易，主要信息包括元数据标签、元数据名称、采集模式、增量变量标识、增量变量数据结构、任务号等有关的信息联机维护交易。数据湖中数据分析所需的ETL主要是由数据分析的任务驱动的，没有数据分析任务，任何人和组织都不能随意更改已经发布过的ETL参数，包括首次铺底也需要建立铺底任务后才能进行铺底导入。Task maintenance sub-module: The task maintenance sub-module is mainly responsible for the maintenance function of task information, mainly the addition, deletion, modification and checking of ETL parameters in the ETL parameter table Online maintenance transactions, the main information includes metadata tags, metadata names, collection modes, incremental variables Information about identification, incremental variable data structures, task numbers, etc. is maintained online for transactions. The ETL required for data analysis in the data lake is mainly driven by the task of data analysis. Without the task of data analysis, no one or organization can arbitrarily change the ETL parameters that have been published, including the first laying of the groundwork. Bottom import.

批次任务信息的维护主要有增、改、删、查，四个功能，每一项操作都必须经过主管的授权才可以生效。The maintenance of batch task information mainly includes four functions: adding, modifying, deleting, and checking. Each operation must be authorized by the supervisor before it can take effect.

元数据的ETL参数新增子模块：该子模块主要是完成元数据的ETL参数新增工作，元数据的ETL参数新增有单笔新增和批量导入两种方式。单笔新增联机交易界面的主要栏位包括：元数据名称、采集模式、所属业务主题、所属部门、所属产品、数据库类型、增量变量标识、增量变量数据结构、任务编号，其中任务编号为事先建立的任务标号，为下拉菜单选择。批量导入支持按照表1的模板填写EXCEL导入表后，通过导入表加载一条或多条ETL参数的功能。新增ETL参数时，系统会根据经办员所录入的元数据名称、元数据所处的业务主题、所属的产品等信息，进行ETL参数的合法性检查，检查通过后，系统才允许新增该ETL参数记录。New sub-module for ETL parameters of metadata: This sub-module mainly completes the addition of ETL parameters of metadata. There are two ways to add ETL parameters of metadata: single addition and batch import. The main fields of the new online transaction interface for a single transaction include: metadata name, collection mode, business subject, department, product, database type, incremental variable identifier, incremental variable data structure, and task number, among which the task number For the pre-established task label, select from the drop-down menu. Batch import supports the function of loading one or more ETL parameters through the import table after filling in the EXCEL import table according to the template in Table 1. When adding ETL parameters, the system will check the validity of the ETL parameters according to the metadata name entered by the operator, the business subject of the metadata, and the product it belongs to. The ETL parameter record.

元数据标签生成子模块：当用户输入或导入完成后，用户提交新增交易，系统根据用户提交的元数据ETL参数相关信息编码生成该录入或导入的元数据标签，并将之作为一项信息于后续的存储模块中存储在数据库中。Metadata label generation sub-module: When the user submits a new transaction after the user's input or import is completed, the system generates the entered or imported metadata label according to the relevant information encoding of the metadata ETL parameters submitted by the user, and treats it as a piece of information Stored in the database in subsequent storage modules.

查询子模块：ETL参数的查询支持单记录级查询和全景级列表查询，其中单记录级查询展示的是与该记录有关的所有ETL参数，全景级列表查询是展示所有记录的ETL参数。Query sub-module: The query of ETL parameters supports single-record-level query and panorama-level list query. The single-record-level query displays all ETL parameters related to the record, and the panorama-level list query displays the ETL parameters of all records.

删除子模块：与新增操作相反，删除操作则将从ETL参数表中去掉对应记录的ETL参数。Delete submodule: Contrary to the add operation, the delete operation will remove the ETL parameter of the corresponding record from the ETL parameter table.

修改子模块：该子模块支持在ETL参数列表查询中选择某一条列表进行修改。系统会根据修改的信息，重新生成元数据标签信息，并更新到被修改记录中的相关栏位。Modify sub-module: This sub-module supports selecting a list to modify in the ETL parameter list query. The system will regenerate the metadata tag information based on the modified information and update it to the relevant fields in the modified record.

存储模块：主要包括ETL参数存储。Storage module: mainly includes ETL parameter storage.

ETL生成模块：主要是提供一种可视化的联机交易画面，让用户根据需要进行数据分析所需的ETL生成条件的选择，提交后自动生成其所需的ETL脚本。ETL generation module: It mainly provides a visual online transaction screen, allowing users to select the ETL generation conditions required for data analysis according to their needs, and automatically generate the required ETL scripts after submission.

本发明实施例提供的方法中，用户提交数据集成任务相关信息的用户界面可以如图2所示，包括任务编号的选择，业务主题、所属部门、所属产品、元数据(标识)、采集模式、起始条件和终止条件的输入，其中，起始条件和终止条件可手动填写，其他数据项可提供下拉菜单选择。In the method provided by the embodiment of the present invention, the user interface for the user to submit the relevant information of the data integration task may be as shown in FIG. 2, including the selection of the task number, the business topic, the department, the product, the metadata (identification), the collection mode, Input of start condition and end condition, among which, start condition and end condition can be filled in manually, and other data items can be selected from drop-down menu.

如图3所示，本发明实施例提供的数据集成脚本生成过程，具体包括：As shown in FIG. 3 , the data integration script generation process provided by the embodiment of the present invention specifically includes:

S201：用户提交任务；S201: The user submits the task;

本发明实施例提供的方法中，可初始化如图2所示的前端界面，用户可通过该前端界面，输入ETL的相关信息，包括选择数据分析任务编号、数据所属业务主题、所属部门、所属产品、选择元数据(标识)以及选择采集模式，通过点击提交控件，提交生成ETL脚本的任务。In the method provided by the embodiment of the present invention, the front-end interface shown in FIG. 2 can be initialized, and the user can input the relevant information of the ETL through the front-end interface, including selecting the data analysis task number, the business topic to which the data belongs, the department to which it belongs, and the product to which it belongs. , select the metadata (identification) and select the collection mode, and submit the task of generating the ETL script by clicking the submit control.

S202：判断用户选择的采集模式是否为全量采集模式；S202: Determine whether the collection mode selected by the user is the full collection mode;

S203：若该采集模式不是全量采集模式，则判断该采集模式是否为区间采集模式；S203: If the collection mode is not the full collection mode, determine whether the collection mode is the interval collection mode;

本发明实施例提供的方法中，可以将增量采集模式进一步细分为区间采集模式和非区间采集模式，在区间采集模式下，用户需要输入起始条件和终止条件，在非区间采集模式下，用户仅需输入起始条件。In the method provided by the embodiment of the present invention, the incremental acquisition mode can be further subdivided into an interval acquisition mode and a non-interval acquisition mode. In the interval acquisition mode, the user needs to input a start condition and an end condition. , the user only needs to enter the starting conditions.

S204：若该采集模式为区间采集模式，则获取用户输入的起始条件和终止条件，以确定起止条件；S204: If the collection mode is the interval collection mode, obtain the start condition and the end condition input by the user to determine the start and end conditions;

S205：若该采集模式为非区间采集模式，则获取用户输入的起始条件，默认终止条件也为该起始条件，以确定起止条件；S205: If the collection mode is a non-interval collection mode, obtain the starting condition input by the user, and the default ending condition is also the starting condition, so as to determine the starting and ending conditions;

S206：读入用户选择的元数据所对应的ETL参数；S206: Read in the ETL parameters corresponding to the metadata selected by the user;

S207：从ETL参数中获取数据库类型、增量变量标识和增量变量标识对应的数据结构；S207: Obtain the data structure corresponding to the database type, the incremental variable identifier and the incremental variable identifier from the ETL parameters;

S208：检查增量变量的数据结构(即增量变量标识对应的数据结构)与起止条件是否相匹配；S208: Check whether the data structure of the incremental variable (that is, the data structure corresponding to the incremental variable identifier) matches the start and end conditions;

S209：根据检查结果，判断检查是否通过；S209: According to the inspection result, determine whether the inspection has passed;

S210：如果检查通过，则依据用户输入的信息、数据库类型、增量变量标识和起止条件生成增量的ETL脚本；S210: If the check is passed, an incremental ETL script is generated according to the information input by the user, the database type, the incremental variable identifier and the starting and ending conditions;

S211：向用户展示ETL脚本，以便于用户进行复核；S211: Show the ETL script to the user so that the user can review it;

S212：如果检查未通过，则报错退出；S212: If the check fails, report an error and exit;

S213：如果在步骤S202的判断过程中，经判断该采集模式为全量采集模式，则读取用户选择的元数据所对应的ETL参数；S213: If in the judgment process of step S202, it is judged that the collection mode is the full collection mode, then read the ETL parameter corresponding to the metadata selected by the user;

S214：从ETL参数中获取数据库类型；S214: Obtain the database type from the ETL parameter;

S215：根据用户输入的信息和所述数据库类型生成全量ETL脚本，并进入步骤S211。S215: Generate a full ETL script according to the information input by the user and the database type, and go to step S211.

基于本发明实施例提供的方法，可以在数据湖架构的基础上，快速生成ETL脚本，消除ETL作业基础设施方面的重复劳动，让数据湖中的数据集可以被发现、可用于查询和分析，极大地缩短分析项目中做ETL和数据编目阶段的时间，让ETL生成变得更自动、更智能。Based on the method provided by the embodiments of the present invention, ETL scripts can be quickly generated on the basis of the data lake architecture, the repetitive work in the ETL operation infrastructure can be eliminated, and the data sets in the data lake can be discovered and used for query and analysis. Greatly shortens the time for ETL and data cataloging in analysis projects, making ETL generation more automatic and smarter.

与图1所示的一种数据集成脚本生成方法相对应的，本发明实施例还提供了一种数据集成脚本生成装置，用于对图1中所示方法的具体实现，其结构示意图如图4所示，包括：Corresponding to the method for generating a data integration script shown in FIG. 1 , an embodiment of the present invention further provides a device for generating a data integration script, which is used for the specific implementation of the method shown in FIG. 1 , and its structural diagram is shown in the figure. 4, including:

第一确定单元301，用于当需要生成数据集成任务的任务脚本时，确定所述数据集成任务对应的数据采集信息；a first determining unit 301, configured to determine data collection information corresponding to the data integration task when a task script of the data integration task needs to be generated;

第二确定单元302，用于依据所述数据采集信息，在预设的数据湖元数据信息库中，确定所述数据集成任务对应的目标元数据标识以及所述目标元数据标识对应的目标元数据信息，所述数据湖元数据信息库中包含多个预设元数据标识和每个所述预设元数据标识对应的元数据信息；The second determining unit 302 is configured to determine, according to the data collection information, in a preset data lake metadata information base, a target metadata identifier corresponding to the data integration task and a target metadata corresponding to the target metadata identifier data information, the data lake metadata information base contains a plurality of preset metadata identifiers and metadata information corresponding to each of the preset metadata identifiers;

第三确定单元303，用于依据所述目标元数据信息，确定所述目标元数据标识对应的数据库类型；A third determining unit 303, configured to determine the database type corresponding to the target metadata identifier according to the target metadata information;

第四确定单元304，用于确定所述数据库类型对应的脚本生成策略；a fourth determining unit 304, configured to determine a script generation strategy corresponding to the database type;

第五确定单元305，用于确定所述数据集成任务对应的数据采集模式；a fifth determining unit 305, configured to determine a data collection mode corresponding to the data integration task;

第一生成单元306，用于若所述数据采集模式为全量采集模式，则依据所述脚本生成策略和所述数据采集信息，生成所述目标元数据标识对应的全量数据集成脚本，并将所述全量数据集成脚本作为所述数据集成任务的任务脚本。The first generating unit 306 is configured to generate a full data integration script corresponding to the target metadata identifier according to the script generation strategy and the data collection information, if the data collection mode is the full data collection mode, and integrate all the data. The full data integration script is used as the task script of the data integration task.

基于本发明实施例提供的装置，当需要生成数据集成任务的任务脚本时，确定数据集成任务对应的数据采集信息；依据数据采集信息，在预设的数据湖元数据信息库中，确定数据集成任务对应的目标元数据标识以及目标元数据标识对应的目标元数据信息，依据目标元数据信息，确定目标元数据标识对应的数据库类型；确定数据库类型对应的脚本生成策略；确定数据集成任务对应的数据采集模式；若数据采集模式为全量采集模式，则依据脚本生成策略和数据采集信息，生成目标元数据标识对应的全量数据集成脚本，并将该全量数据集成脚本作为数据集成任务的任务脚本。应用本发明实施例提供的装置，可以基于数据集成任务的数据采集信息和预设的数据湖元数据信息库确定脚本生成策略，继而生成任务脚本。可以通过自动化的处理过程生成数据集成任务的任务脚本，可实现数据集成脚本的自动化生成，无需人工编写脚本，可节省大量的人力资源，缩短处理过程耗时，提高工作效率。其次，可以避免人工编写脚本所带来的人为纰漏，避免人为原因给数据集成工作带来的不良影响。Based on the device provided by the embodiment of the present invention, when a task script of a data integration task needs to be generated, the data collection information corresponding to the data integration task is determined; according to the data collection information, in the preset data lake metadata information base, the data integration task is determined The target metadata identifier corresponding to the task and the target metadata information corresponding to the target metadata identifier, according to the target metadata information, determine the database type corresponding to the target metadata identifier; determine the script generation strategy corresponding to the database type; determine the corresponding data integration task. Data collection mode; if the data collection mode is the full collection mode, the full data integration script corresponding to the target metadata identifier is generated according to the script generation strategy and data collection information, and the full data integration script is used as the task script of the data integration task. By applying the device provided by the embodiment of the present invention, a script generation strategy can be determined based on the data collection information of the data integration task and the preset data lake metadata information base, and then a task script can be generated. The task script of the data integration task can be generated through the automated processing process, and the automatic generation of the data integration script can be realized without manual script writing, which can save a lot of human resources, shorten the processing time, and improve the work efficiency. Secondly, it can avoid human mistakes caused by manual scripting, and avoid the adverse effects of human factors on data integration work.

在上述实施例提供的装置的基础上，本发明实施例提供的装置中，还包括：On the basis of the device provided by the foregoing embodiment, the device provided by the embodiment of the present invention further includes:

在上述实施例提供的装置的基础上，本发明实施例提供的装置中，所述第一确定单元301，包括：On the basis of the device provided by the above embodiment, in the device provided by the embodiment of the present invention, the first determining unit 301 includes:

获取子单元，用于获取用户输入的业务主题、所属部门、所属产品、元数据标识和采集模式，并将所述业务主题、所述所属部门、所述所属产品、所述元数据标识和所述采集模式作为所述数据集成任务对应的数据采集信息。The obtaining subunit is used to obtain the business subject, the department to which the user belongs, the product to which it belongs, the metadata identifier and the collection mode entered by the user, and the The collection mode is used as the data collection information corresponding to the data integration task.

在上述实施例提供的装置的基础上，本发明实施例提供的装置中，所述第二确定单元302，包括：On the basis of the apparatus provided by the above embodiment, in the apparatus provided by the embodiment of the present invention, the second determining unit 302 includes:

匹配子单元，用于将所述数据采集信息分别与每个所述预设元数据标识对应的元数据信息进行匹配，并将与所述数据采集信息相匹配的元数据信息所对应的预设元数据标识，确定为所述目标元数据标识。A matching subunit, configured to respectively match the data collection information with the metadata information corresponding to each of the preset metadata identifiers, and match the preset metadata information corresponding to the metadata information matched with the data collection information The metadata identifier, which is determined as the target metadata identifier.

在上述实施例提供的装置的基础上，本发明实施例提供的装置中，所述核查单元，包括：On the basis of the device provided by the above embodiment, in the device provided by the embodiment of the present invention, the verification unit includes:

第一确定子单元，用于确定所述目标元数据信息对应的增量信息，所述增量信息包括增量变量标识和增量变量数据结构；a first determination subunit, configured to determine incremental information corresponding to the target metadata information, where the incremental information includes an incremental variable identifier and an incremental variable data structure;

判断子单元，用于判断所述起止条件和所述增量信息是否相匹配；A judging subunit for judging whether the start and end conditions match the incremental information;

第二确定子单元，用于若所述起止条件和所述增量信息相匹配，则确定所述数据集成任务通过条件核查；若所述起止条件和所述增量信息不匹配，则确定所述数据集成任务未通过条件核查。a second determination subunit, configured to determine that the data integration task passes the condition check if the start and end conditions match the incremental information; if the start and end conditions do not match the incremental information, determine that the The data integration task described above failed the condition check.

报错子单元，用于若所述数据集成任务未通过条件核查，则进行报错提示，结束所述数据集成任务的任务脚本的生成过程。An error reporting subunit, configured to report an error if the data integration task fails the condition check, and end the generation process of the task script of the data integration task.

本发明实施例还提供了一种存储介质，所述存储介质包括存储的指令，其中，在所述指令运行时控制所述存储介质所在的设备执行如上述的数据集成脚本生成方法。An embodiment of the present invention further provides a storage medium, where the storage medium includes stored instructions, wherein when the instructions are executed, a device where the storage medium is located is controlled to execute the above-mentioned method for generating a data integration script.

本发明实施例还提供了一种电子设备，其结构示意图如图5所示，具体包括存储器401，以及一个或者一个以上的指令402，其中一个或者一个以上指令402存储于存储器401中，且经配置以由一个或者一个以上处理器403执行所述一个或者一个以上指令402进行以下操作：An embodiment of the present invention also provides an electronic device, the schematic structural diagram of which is shown in FIG. 5 , and specifically includes a memory 401 and one or more instructions 402 , wherein one or more instructions 402 are stored in the memory 401 and processed through the memory 401 . The one or more instructions 402 are configured to be executed by the one or more processors 403 to:

需要说明的是，本发明提供的数据集成脚本生成方法及装置、存储介质及电子设备可用于金融领域或其他领域，例如，可用于金融领域中的数据分析应用场景。其他领域为除金融领域之外的任意领域，例如，通信服务领域。上述仅为示例，并不对本发明提供的数据集成脚本生成方法及装置、存储介质及电子设备的应用领域进行限定。It should be noted that the data integration script generation method and device, storage medium and electronic device provided by the present invention can be used in the financial field or other fields, for example, can be used in data analysis application scenarios in the financial field. The other field is any field other than the financial field, for example, the field of communication services. The above is only an example, and does not limit the application fields of the data integration script generation method and device, storage medium and electronic device provided by the present invention.

本说明书中的各个实施例均采用递进的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于系统或系统实施例而言，由于其基本相似于方法实施例，所以描述得比较简单，相关之处参见方法实施例的部分说明即可。以上所描述的系统及系统实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下，即可以理解并实施。Each embodiment in this specification is described in a progressive manner, and the same and similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the system or the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and reference may be made to the partial description of the method embodiment for related parts. The systems and system embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, It can be located in one place, or it can be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.

专业人员还可以进一步意识到，结合本文中所公开的实施例描述的各示例的单元及算法步骤，能够以电子硬件、计算机软件或者二者的结合来实现，为了清楚地说明硬件和软件的可互换性，在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本发明的范围。Professionals may further realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of the two, in order to clearly illustrate the possibilities of hardware and software. Interchangeability, the above description has generally described the components and steps of each example in terms of functionality. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of the present invention.

对所公开的实施例的上述说明，使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下，在其它实施例中实现。因此，本发明将不会被限制于本文所示的这些实施例，而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. a data integration script generation method, is characterized in that, comprises:

When a task script of the data integration task needs to be generated, determine the data collection information corresponding to the data integration task;

According to the data collection information, in the preset data lake metadata information database, determine the target metadata identifier corresponding to the data integration task and the target metadata information corresponding to the target metadata identifier, the data lake metadata The data information base contains a plurality of preset metadata identifiers and metadata information corresponding to each of the preset metadata identifiers;

According to the target metadata information, determine the database type corresponding to the target metadata identifier;

determining the script generation strategy corresponding to the database type;

determining the data collection mode corresponding to the data integration task;

If the data collection mode is the full data collection mode, generate a full data integration script corresponding to the target metadata identifier according to the script generation strategy and the data collection information, and use the full data integration script as the Task scripts for data integration tasks.

2. The method according to claim 1, wherein the determining the data collection information corresponding to the data integration task comprises:

Obtain the business subject, the department, the product, the metadata identifier and the collection mode entered by the user, and use the business subject, the department, the product, the metadata identifier and the collection mode as the Data collection information corresponding to the data integration task.

3 . The method according to claim 1 , wherein determining the target metadata identifier corresponding to the data integration task in a preset data lake metadata information base according to the data collection information, comprising: 4 . :

The data collection information is respectively matched with the metadata information corresponding to each of the preset metadata identifiers, and the preset metadata identifier corresponding to the metadata information matched with the data collection information is determined as The target metadata identifier.

4. The method of claim 1, further comprising:

If the data collection mode is not the full collection mode, determining the start and end conditions corresponding to the data integration task;

According to the target metadata information and the starting and ending conditions, condition checking is performed on the data integration task;

If the data integration task passes the condition check, an incremental data integration script corresponding to the target metadata identifier is generated according to the script generation strategy, the data collection information, the start and end conditions, and the target metadata information , and use the incremental data integration script as the task script of the data integration task.

5 . The method according to claim 4 , wherein the condition checking on the data integration task according to the target metadata information and the start and end conditions comprises: 5 .

determining the incremental information corresponding to the target metadata information, where the incremental information includes an incremental variable identifier and an incremental variable data structure;

Judging whether the start and end conditions match the incremental information;

If the start and end conditions match the incremental information, determine that the data integration task passes the condition check;

If the start and end conditions do not match the incremental information, it is determined that the data integration task fails the condition check.

6. The method of claim 4, further comprising:

If the data integration task fails the condition check, an error message is prompted, and the generation process of the task script of the data integration task ends.

7. A data integration script generation device, characterized in that, comprising:

a first determining unit, configured to determine data collection information corresponding to the data integration task when a task script of the data integration task needs to be generated;

The second determining unit is configured to, according to the data collection information, determine the target metadata identifier corresponding to the data integration task and the target metadata corresponding to the target metadata identifier in the preset data lake metadata information database information, the data lake metadata information base contains multiple preset metadata identifiers and metadata information corresponding to each of the preset metadata identifiers;

a third determining unit, configured to determine the database type corresponding to the target metadata identifier according to the target metadata information;

a fourth determining unit, configured to determine a script generation strategy corresponding to the database type;

a fifth determination unit, configured to determine a data collection mode corresponding to the data integration task;

The first generation unit is configured to generate a full data integration script corresponding to the target metadata identifier according to the script generation strategy and the data collection information if the data collection mode is the full collection mode, and integrate the The full data integration script is used as the task script of the data integration task.

8. The apparatus of claim 7, further comprising:

a sixth determination unit, configured to determine the start and end conditions corresponding to the data integration task if the data collection mode is not the full collection mode;

a verification unit, configured to perform condition verification on the data integration task according to the target metadata information and the start and end conditions;

a second generating unit, configured to generate the target metadata identifier according to the script generation strategy, the data collection information, the start and end conditions and the target metadata information if the data integration task passes the condition check corresponding incremental data integration script, and use the incremental data integration script as the task script of the data integration task.

9 . A storage medium, characterized in that the storage medium comprises stored instructions, wherein when the instructions are executed, a device where the storage medium is located is controlled to execute the data according to any one of claims 1 to 6 Integration script generation method.

10. An electronic device, comprising a memory, and one or more instructions, wherein the one or more instructions are stored in the memory and configured to be executed by one or more processors as claimed in claims 1- 6. The method for generating a data integration script according to any one of the items.