CN118193176B

CN118193176B - Multi-dimensional dependency-based data warehouse task layered scheduling method, system and program product

Info

Publication number: CN118193176B
Application number: CN202410612281.XA
Authority: CN
Inventors: 高寿柏; 仲茜; 曹小函; 李珲
Original assignee: Shandong Shichuang Yun Service Co ltd
Current assignee: Shandong Shichuang Yun Service Co ltd
Priority date: 2024-05-17
Filing date: 2024-05-17
Publication date: 2024-07-23
Anticipated expiration: 2044-05-17
Also published as: CN118193176A

Abstract

A multi-dimensional dependency-based data warehouse task layered scheduling method, system and program product belong to the technical field of big data processing. Wherein primary data task scheduling in a data warehouse WThat is, at time t, at the model snapshotFor a specified number of bin tasks in a defined data environmentOne execution of (a)Is subjected to the following steps in the execution processConstraint of the set of dependent task instances, inBefore starting execution, executing all dependent task instances. The invention distinguishes two task scheduling triggering modes of periodicity and manual triggering and two task dependency modes of data and time, effectively and efficiently avoids cyclic dependency among tasks and repeated scheduling of tasks on the premise of basically not reducing modeling capacity of a data warehouse, and improves the efficiency of task scheduling and execution.

Description

Multi-dimensional dependency-based data warehouse task layered scheduling method, system and program product

Technical Field

The invention discloses a multi-dimensional dependency-based data warehouse task layered scheduling method, a multi-dimensional dependency-based data warehouse task layered scheduling system and a multi-dimensional dependency-based data warehouse task layered scheduling program product, and belongs to the technical field of big data processing.

Background

In recent years, with the high-speed development of information technologies such as artificial intelligence, big data, cloud computing and the like, digital economy has explosive growth in the global scope, and the characteristics of high efficiency, convenience and intelligence are utilized to deeply change the traditional industrial form and business mode, so as to promote the transformation and upgrading of economy. The key link of digital economy is the construction of data resources, which is now the only requirement of all types of enterprises, not just the problem of concern for large enterprises.

The valuable data resource is a data asset. To realize data capitalization, all data required by enterprises, including structured and unstructured data, including current and historical data and the like, must be effectively stored, organized and integrated, and on the basis, the data is queried and analyzed, so that the value of the data is furthest exerted. Such work is difficult to achieve for general database application systems, often requiring the use of data warehouse systems that are specific to this problem. The data warehouse system firstly extracts and stores various data in a data source in a timing or real-time manner through a data acquisition tool, then reorganizes and initially statistically gathers the data according to a unified data model which meets the data analysis requirement, and finally processes the data into data products required by users by utilizing a targeted intelligent data analysis method. In order to improve the reusability of the data resource definition and the data processing process in the process and improve the maintainability and the working efficiency of the data warehouse system, the data model is generally organized according to a hierarchical structure, and the data acquisition and processing process is decomposed into a plurality of tasks and scheduled and executed by a task scheduling engine. It can be seen from the data processing procedure of the data warehouse system that the data model and the data task scheduling are core technologies of the data warehouse system.

Traditional data warehouses employ workflows for scheduling, with one scheduling task often appearing in multiple workflows. In addition, execution of a data task sometimes fails, and at this time, the task needs to be re-executed. Therefore, the dependency relationship among the data processing tasks in the system is complex, and the optimization difficulty is high. Meanwhile, in order to ensure the completeness of workflow execution, the same task in different workflows often only reuses execution logic, and does not reuse execution examples, that is, when a certain task in a workflow is executed, whether the task is executed in another workflow is not considered, but the task is directly scheduled to be executed, so that the problem of repeated execution of the task inevitably occurs, the execution efficiency of a system is reduced, and the occupation amount of resources is increased. Meanwhile, the definition and optimization of the workflow have higher technical thresholds for the users of the multi-bin system, and the users not only need to have enough multi-bin construction knowledge but also need to master certain program development skills. The method is not a problem for large enterprises, but the resources mastered by the large, medium and small enterprises are limited, the application requirements of the data warehouse are relatively simple, the data warehouse system is more concerned about the usability of the data warehouse system, and particularly, the business personnel of the business personnel can complete the construction and application of the data warehouse under the guidance of professional personnel.

Chinese patent document CN116244276A discloses a model processing method, a device, electronic equipment and a readable storage medium, and belongs to the technical field of model processing. The method comprises the steps of obtaining a first model to be detected and a preset model set corresponding to the first model, wherein the preset model set comprises at least one second model; determining first basic data corresponding to the first model and second basic data corresponding to each second model, wherein the first basic data and the second basic data are data in a basic number bin; determining, for each second model, model similarity of the first model and the second model based on the first base data and second base data corresponding to the second model; and generating a prompt message based on the first model and the second model under the condition that the model similarity is larger than a preset threshold value, and sending the prompt message to prompt a user to combine the first model and the second model. In the patent document, the data warehouse integrates a large amount of historical business data of an enterprise, a business party can perform several-bin modeling based on the data warehouse, and the created several-bin model is applied to the aspects of algorithm optimization, data analysis, decision making and the like. In the multi-bin architecture, the multi-bin model plays a role in supporting the up-down movement, and is used for standardizing bottom data for lower processing and providing general and easy-to-use data for the lower processing, and the quality of the data model directly determines the achievement of the multi-bin. However, in the process of several-bin modeling, many similar several-bin models can be generated in an enterprise system due to conditions such as business change, personnel change, implementation of convention specifications and the like, so that a large number of repeated data calculation and storage resource occupation occur.

Chinese patent document CN112764911A discloses a task scheduling method, a device, electronic equipment and a readable storage medium, which relate to the technical field of task scheduling, in particular to a task scheduling method, a device, electronic equipment and a computer readable storage medium. The method comprises the following steps: receiving a task scheduling instruction, wherein the task scheduling instruction is used for driving a task scheduling process; reading a task dependency model stored in a database according to the task scheduling instruction; the task dependency model is a dependency tree of a plurality of task nodes which is constructed according to a preset dependency rule and comprises a plurality of dependency relations; each task node has a unique ID; acquiring a target task node from the task dependency model according to a preset task scheduling configuration rule; the target task node is a task node in the current state in execution; and judging and executing a task scheduling strategy according to the order of the target task node.

Most of the current data warehouse systems are applied to large enterprises, and have little attention to vast small and medium-sized enterprises. Large enterprises often have powerful professional data teams, and can customize the data warehouse system for the characteristics of enterprise business and data. Therefore, the current mainstream data warehouse system provides a flexible and configurable data model definition function, and although the data is organized by adopting a hierarchical structure, the hierarchical structure is adopted specifically, and the role of each layer of data in the whole data model can be customized by a user. In contrast, since the data is organized in a flexible manner, the dependency relationship between the data processing tasks based on the data dependency is complex, and is difficult to optimize in a targeted manner, and only a more general workflow-based method can be used for defining. Accordingly, execution of the task also needs to be submitted to a general workflow execution engine, such as Dolphin, azkaban, for scheduling execution. These workflow engines, due to their pursuit of complete and versatile functionality, result in less efficient scheduling of data warehouse tasks and require more computational resources, which is costly to deploy and operate. While data assets are the core assets of an enterprise, data warehouse systems often require localized independent deployment. For small and medium enterprises, the data team and the IT operation and maintenance team are not full-time, and meanwhile, the construction cost of the data assets is relatively insufficient, so that the characteristics of flexibility and configurability and strong universality are rather a main obstacle for the data warehouse to be used.

Disclosure of Invention

Aiming at the defects of the prior art, the invention discloses a multi-dimensional dependency-based data warehouse task hierarchical scheduling method.

The invention also discloses a system for realizing the scheduling method.

The invention further discloses a program product.

It is known from the prior art that, when the data warehouse application has become the needs of various enterprises, it is necessary to define the existing data warehouse data model according to the needs of the middle and small enterprises for the data warehouse application, and propose an optimized data task scheduling method corresponding to the data warehouse data model.

Aiming at the problems of large application difficulty and high operation and maintenance cost of the current mainstream data warehouse task scheduling strategy to large, medium and small enterprises, the invention selects an optimized data model layering strategy and provides a data warehouse task layering scheduling method based on multidimensional dependence.

The present data warehouse adopts a layering mode to organize data, which is a mainstream method in the industry, and has proved to be effective, so the present invention does not introduce a new data organization mode, and still adopts a layering scheme, but adopts a layering mode of targeted optimization based on the rule that the contract is larger than the configuration, namely, the layering scheme is divided into four layers of a source data layer (ODS), a detail data layer (DWD), a data summarizing layer (DWS) and an application data layer (ADS) from low to high, and meanwhile, the dependence among data objects is limited to a high-level dependence and a low-level, and the practice shows that the layering scheme can be suitable for the application scenes of the data warehouse of most small and medium enterprises. After the layering scheme of the data model is determined, the data task scheduling scheme aiming at the layering strategy is provided for the data asset construction requirements of small and medium enterprises, the effectiveness and the efficiency of task scheduling are improved, and the resource occupancy rate is reduced.

In order to facilitate understanding of the technical solution of the present invention, the important concepts of the present invention are defined as follows:

Definition 1:

the data warehouse W is composed of a data model M and a set of data tasks J, which may be referred to hereinafter simply as models and tasks without ambiguity, i.e. And according to the principle that a high layer depends on a low layer, elements in M and J are divided into four layers from low to high, namely: a source data layer (ODS) for storing data directly collected from a data source; a detail data layer (DWD) for organizing data by a dimension model, including two sublayers of a dimension surface layer and a fact surface layer, directly obtaining data from the ODS layer; a data summarizing layer (DWS) for primarily summarizing the data in the DWD; an application data layer (ADS) for performing targeted processing on the data of the DWS and DWD layers according to the user's requirements, to generate a data product meeting the user's requirements, where the specific definitions of M and J are as follows:

(1)

(2)

As can be seen from equations (1) and (2), the data model M can be divided into four sets M _l,M_l having no intersection with each other, and the elements in the sets M _l,M_l are data tablesI indicates the hierarchy to which the data table belongs, including an ODS paste source data layer, a DWD detail data layer, a DWS data summary layer, an ADS application data layer, etc., e.g. M _ODS is the set of all the data tables belonging to the ODS layer in M, i is the serial number of the table for uniquely identifying the table, the serial number is counted from 1, and the maximum number of the data tables is not exceeded；

(3)

(4)

In the formulas (3) and (4), the data task j is the minimum unit independently scheduled and executed in the data warehouse W, and is used for extracting data in a data source or processing the data in the data model M and then storing the data in the data source or the data in the data model M into a target table r; the target table r is a data table in the data model M which is newly added or updated after the task j is executed; the data warehouse W supports two types of tasks: firstly, a data acquisition task, namely extracting data from a source database into a data warehouse W; the other type is a data processing task, namely processing data in a low-layer data model in the data warehouse W, generating and storing the data in a high-layer data model, and forming a dependency relationship between data tables and data tasks according to a data flow direction in the data processing process, wherein an inflow party depends on an outflow party. As can be seen from formulas (3) and (4), J can be divided into four elements in a mutually non-intersecting set J _l,J_l corresponding to ML is the hierarchy of tasks, determined by the hierarchy of the target table, k is the sequence number of tasks, the sequence number is counted from 1, and the maximum number of data tasks is not exceededWithout causing ambiguity, or abbreviated herein as j.

Definition 2:

the data source ds is the source of data in the M _ODS, and the data warehouse W extracts the data from the data source ds through the ETL (Extract Transform Load, extraction, conversion and loading) process, and after necessary data cleaning (the data cleaning is a proper term in the industry and is not the content to be protected by the present invention), the data is stored in the data table of the M _ODS.

Definition 3:

model snapshot For the state of the data model M at the time t, the corresponding model snapshot of the layer I is recorded asIncluding definitions of data tables in the data model and stored data; without loss of generality, since the minimum granularity of statistics of a typical data warehouse is 1 day, the time t in the present invention refers to the date of task initiation execution unless otherwise specified.

Definition 4:

Task execution instance Starting one-time execution at the time t for a data task j in the data warehouse W; the task execution examples include two classes: the first is a periodic automatic triggering execution example, which is recorded as; Another type is a task instance that is manually triggered to execute, noted as。

A set of periodic task instances including all started execution at time t and task instances which are not started execution at time t but have been completed by the previous period is recorded asAnd the latest execution example set of each task at the time t is shown.

(5)

In formula (5), CJ ^t is the set of all periodic task instances that start execution at time t; PJ ^t is a set of task instances that do not start execution at time t, but that start normally one cycle closest to t before t;

,

Wherein the said To the point in time at which the periodic task j was last periodically executed before the point in time t,Is thatAt the position ofThe moment when the latter cycle should start execution indicates that the moment t is between two adjacent periodic schedules of j;

Definition 5:

task scheduling space S in the data warehouse is defined as four tuples { T, J, P, D }, where

T is the time dimension, and the time T of the current dispatching start is the time T;

j is the task dimension, the current scheduled data task J is designated, and the specific execution example is ；

P is a model dimension, and the current scheduled data task is designated to relate to a model snapshotA set of medium data tables;

(6)

D is a dependency dimension, and designates a current task execution instanceA task set directly relied upon;

(7)

In the formula (7), letFor the taskIn the instance of execution at time t,Is thatAt the position ofThe previous cycle should start the moment of execution. As can be seen from equation (7), the dependencies between task execution instances include two dimensions, one being based on target table dependencies, referred to herein as data dependencies between task instances, the set of data dependencies in equation (7) being defined byRepresentation, assume that the current task execution instance isExamples of data dependencies areThe method comprises the following steps:

Representative of a higher level than ，Data taskThe target table needs to be tasked with periodic dataAcquiring data from a target table of (1);

Another is the dependence of the current task instance on the last cycle execution instance of the same task, the invention refers to the time dependence among task instances, namely Depending on。

The invention adopts the detailed technical scheme that:

The multi-dimensional dependency-based data warehouse task layered scheduling method is characterized by comprising the following steps of:

primary data task scheduling in a data warehouse W That is, at time t, at the model snapshotFor a specified number of bin tasks in a defined data environmentOne execution of (a)Is subjected to the following steps in the execution processConstraint of the set of dependent task instances, inBefore starting execution, executing all dependent task instances;

the triggering mode of data task scheduling in the data warehouse W comprises two types of periodic automatic triggering and manual triggering:

the periodic automatic triggering is used for the data task in W Is that of normal execution ofThe main mode of scheduling execution is set by the scheduler according to the settingsThe scheduling period is executed, and in order to optimize the scheduling process, the data warehouse starts periodic scheduling tasks at a designated time: firstly, executing tasks to be scheduled at the current moment of all ODS layers concurrently, and after the tasks are executed, sequentially executing the tasks to be scheduled at the current moment of DWD, DWS, ADS layers in the same mode:

if the dependent task is not successfully executed, the current task is not executed any more, the failure is directly returned, and the task is waited to be manually triggered to run again;

The manual triggering is used for the scenes of task re-running, data source real-time data synchronization, data task testing and the like after the task execution instance fails, is an important supplement of a periodic automatic triggering mode, and can solve personalized user task scheduling.

According to the invention, the multi-dimensional dependency-based data warehouse task hierarchical scheduling method further comprises the following steps:

One key problem of batch task scheduling is to determine the execution sequence of tasks and prevent cyclic dependence and repeated scheduling, wherein the cyclic dependence is that an A task depends on a B task and the B task also depends on the A task;

As described above, the conventional data warehouse organizes task scheduling by means of workflow, the workflow is just a directed acyclic graph, which ensures the order of local task execution, however, for global aspect, because of the flexibility and locality of the hierarchical structure and workflow definition, to ensure that there is no loop in the whole, all the workflows need to be merged into the same graph, when the number of workflow is large, the scale of the graph is significantly increased, and judging whether there is a loop becomes more complex, and meanwhile, because different hierarchical structures can be adopted, the same task may exist in different hierarchical structures, judging whether the task is equivalent becomes very difficult, and it is difficult to ensure that there is no repeated scheduling, and these problems all put high demands on the database system and the database engineer. In order to solve the problems, the invention provides a dispatching method based on a multi-bin hierarchy, which is suitable for the situation that middle and small enterprises lack high-level multi-bin engineers, and solves the problem of cyclic dependence, necessary agreements are carried out on the dependence among tasks. Therefore, the invention adopts ODS, DWD, DWS, ADS four-layer classification mode, and the definition of the four layers can show that the high-layer data is generated by the low-layer data, so the data dependency relationship is generated by the high-layer data depending on the low-layer data, for the same-layer dependence, the node of the same-layer dependence is generated in real time as an intermediate result in consideration of the transitivity of the data dependency, thereby the same-layer dependence is converted into the dependence of the high layer on the low layer, the converted model and the model before conversion are functionally equivalent, only certain original model nodes need to be dynamically generated, a certain processing cost can be generated, but the small and medium-sized enterprise data model is not complex generally, the probability of the same-layer dependence is lower, and the influence on the system execution efficiency is smaller. And hierarchical scheduling is sequentially carried out according to the order from low to high, each task is carried out in one round of scheduling and only carried out once, and repeated scheduling can be effectively prevented.

According to the invention, the scheduling time of the periodic tasks is also a problem to be considered in the periodic automatic triggering. Because in most cases, the minimum period of scheduling the data batch processing task by small and medium enterprises is usually not less than 1 day, namely, the data batch processing task is executed for 1 time per day at most, and because the service is relatively single, the starting scheduling time of the daily task is not required to be repeatedly scheduled for a plurality of times, and the scheduling is usually performed at the same time.

According to the invention, the multi-dimensional dependency-based data warehouse task hierarchical scheduling method specifically comprises the following steps of:

scheme 1: periodic scheduling of data tasks

The automatic triggering of the data warehouse is the most common mode of task scheduling of the data warehouse, and is used for anySetting a scheduling period, determining the starting operation time of each scheduling of j according to the first starting time and the scheduling period of j, and submitting the starting operation time to a scheduler for scheduling on time, wherein the flow of each scheduling execution is as follows:

Step 1-1: screening out a data task set to be executed at the current time t according to the task scheduling period and the first execution time ;

Step 1-2: the data tasks screened in the step 4-1 are sequentially called in the sequence from low to high according to the layers of the tasks, and the tasks on the same layer are scheduled in parallel or concurrently by a scheduler, namely, the ODS layer periodic task set which needs to be scheduled and executed at the moment t is executed firstIs then sequentially executed、、Is a task in (1);

scheme 2: batch re-running of failed data task instances

When a large number of data task instances fail to execute, the data task instances are manually triggered by an engineer to run all the failed data task instances again according to the hierarchy;

Step 2-1: according to the task instance execution result, screening out periodic data task adding set of execution failure to be re-run For task j with multiple instances of error, selecting the instance of the latest failed cycle scheduling execution to be added toIn (a) and (b);

step 2-2: the data tasks screened in the step 3-1 are sequentially called in the process 3 according to the sequence from low to high of the layers of the tasks, and the tasks on the same layer can be scheduled in parallel or concurrently by a scheduler, namely, the tasks are firstly executed Tasks of the middle ODS layer are then sequentially performedA failed task at DWD, ODS, ADS layers in;

Scheme 3: task instance rerun specifying failure status

The program is triggered in two cases, namely, when an engineer manually triggers a certain failed task to run again, the process is directly executed; secondly, other processes call;

step 3-1: acquiring periodic execution instances of execution failure ；

Step 3-2: the execution method S4 relies on a task repeated execution checking method to check and mark the repeated execution condition of the current task, if the current task is executed by other task examples, the execution result is directly returned, otherwise, the subsequent steps of the flow are continuously executed;

Step 3-3: the execution method S1 checks the time-dependent node execution status method to perform time-dependent check if it is time-dependent task instance If not, calling the flow 3 to run the instance again, and waiting for the execution of instance scheduling to finish; if an example isThe running again fails, the flow fails to return, otherwise, the follow-up steps of the flow are continuously executed;

step 3-4: the execution method S2 checks the execution state of the data dependent node to perform data dependency check, and returns the failed data dependency set ; If the call of flow 3 is triggered by either flow 1 or flow 2, then whenNon-space-time stream flow failure return, instanceThe execution state is recorded as "execution failure due to data dependency exception"; otherwise, for allExecuting the process 3 to perform the re-running, if not all re-running is successful, returning the failure of the processThe execution state is recorded as "execution failure due to data dependency exception"; otherwise, continuing to execute the subsequent steps of the flow;

step 3-5: if the task is the several bin levels described by the current task Normally scheduled execution; Otherwise, the execution method S3 performs data compensation by the data compensation method for the execution example of the task, and then the execution method S5 deletes the data of the deleted data processing method when the task is executed;

scheme 4: specifying task new instance execution

The procedure triggers two cases: firstly, the data warehouse engineer manually triggers the new instance which is generated at the moment and is not depended by other instances, and if the execution fails, the new instance is not re-run by the flow 2; secondly, the process 1 is called and triggered, and at the moment, a new instance generated in the program is a periodical execution instance, which can be depended by other instances, and if the execution fails, the process 2 can run again;

step 4-1: acquiring new task instances to be executed ；

Step 4-2: the execution method S4 relies on a task repeated execution checking method to check and mark the repeated execution condition of the current task instance, if the current task instance is executed in the current scheduling, the execution result is directly returned, otherwise, the subsequent steps of the flow are continuously executed;

Step 4-3: the execution method S1 checks the time dimension dependent node execution state method to perform time dependent check if it is time dependent task instance If not, calling the flow 3 to run the instance again, and waiting for the execution of instance scheduling to finish; if an example isThe running again fails, the flow fails to return, otherwise, the follow-up steps of the flow are continuously executed;

step 4-4: processing data dependencies

Step 4-4-1: if the current flow is manually triggered, there are data dependent instances that are not created(ODS layer task has no upstream data dependency), create and invoke flow 4 to execute the instance; waiting for all data dependent instance execution to finish, if there is a dependent instance not successfully executed, returning failure, and the instanceRecording as 'failure in execution due to abnormal layering dependency', otherwise, executing step 5;

step 4-4-2: if the flow is triggered by the flow 1, the execution method S2 checks the execution state method of the data dependent node to obtain a failed data dependent set ; When (when)Non-space-time stream flow failure return, instanceThe execution state is recorded as 'failure in execution due to data dependence exception', otherwise, step 5 is executed;

Step 5: scheduling execution Task content, then execution method S5 delete data for the deleted data processing method when task is executed.

According to a preferred embodiment of the invention, the method S1: the method for checking the execution state of the time dimension dependent node comprises the following specific steps:

Acquisition task Current execution instanceDependency instances in the time dimensionPeriodic tasksThe execution example of the last period at the moment checks the execution state, and if the execution state is unsuccessful, returns。

According to a preferred embodiment of the invention, the method S2: the method for checking the execution state of the data dimension dependent node comprises the following specific steps: obtaining a current task execution instanceData dependent instance set in hierarchical dimensionCheck allTo join a collection if execution failsIs a kind of medium.

According to a preferred embodiment of the invention, the method S3: the specific steps of the data compensation method when the task execution instance re-runs are as follows: for periodic task instanceRestoring the data snapshot of the ODS layer at the time t through the current state of the service data and other historical snapshot dataThis is because the data of the data source may have changed since the time t has elapsed, and recovery of the data at the time t is required.

According to a preferred embodiment of the present invention, the method S3 stores data having a current data creation time or update time between prev (t, j) and t in the incrementally stored dataIn, for data stored in full, it is necessary to storeOn the basis of (a), newly adding or updating the data with the creation time or the update time between prev (t, j) and t in the current data toIs a kind of medium.

According to a preferred embodiment of the invention, the method S4: the specific steps of the checking method are repeatedly executed by the dependent tasks: before executing task content, the task instance checks whether the current flow is executed by the same task instance or is being started by other t time, if the current task instance is executed, the execution result is directly returned, and if the current task instance is being executed, the execution result is returned after the task instance is blocked to wait for the execution completion of the task instance. For example: when task execution instance、Simultaneously dependent on periodic task instancesIf (if)Executing faster, checking to find out that it is necessary to run the dependency instance again firstThe first time is called up, recordHas run again whenDependent relationship constraints also require triggeringIn the case of heavy running, the running speed is reduced,Without re-execution, only wait for the execution of the programAdjusted upAnd after the running is finished, the execution result is directly returned. Adjusted up

According to a preferred embodiment of the invention, the method S5: the method for processing deleted data comprises the following specific steps of: the service system usually has different processing modes aiming at the deletion of the data, and most systems logically delete the data, namely only the deletion identification field is modified and the record is not physically deleted when the data is deleted, but some systems delete the data in a physical deletion mode; the system when the logic deletion is performed modifies the data update time at the same time, and the system when the logic deletion is performed is not modified. When the data warehouse system extracts data to the ODS layer, the data warehouse system needs to consider the difference of the service system and perform unified processing; for physically deleted data, the current task instanceStoring fetched data to a model snapshotComparison ofAnd，Exist in and are not identified as deleted data butIs missing fromTaking out, setting logic deletion mark as deleted, setting data update time as current time, and storing; Service data with logic deleted but not modified update date is first to be usedThe fetched data is stored toComparison of，Not identified as deleted data butThe service deletion identification bit is deleted, the logic deletion identification is set as deleted, and the data updating time is modified as the current time. No additional processing is required for the logical deletion of the modified data update time.

The system for realizing the scheduling method is characterized by comprising a data management platform carrying a data warehouse service module, a gateway cluster and a security management center;

the data management platform also comprises a task scheduling service module;

The data warehouse service module is responsible for development of tasks and configuration of scheduling rules; the task scheduling service is responsible for scheduling methods corresponding to the flow 1, the flow 2, the flow 3 and the flow 4, and executing the methods S1, the method S2, the method S3, the method S4 and the method S5.

A program product for realizing a multi-dimensional dependency-based data warehouse task hierarchical scheduling method, which is characterized by being used for executing the flow 1, the flow 2, the flow 3 and the flow 4; and is also used for calling and executing the methods S1, S2, S3, S4 and S5.

The invention has the technical advantages that:

Aiming at the problems that the current mainstream data warehouse is complicated in modeling, development, application and operation and maintenance processes, high in personnel quality requirement and difficult to effectively implement in large, medium and small sizes, an optimized data model layering strategy with wide application range is selected, and a data warehouse task layering scheduling method based on multidimensional dependence is provided, and the method has the main technical advantages that:

1. Aiming at the application characteristics of data warehouses of small and medium enterprises, the invention distinguishes two task scheduling triggering modes of periodic and manual triggering and two task dependency modes of data and time based on the high-efficiency, sufficient and easy-to-use principle, researches and realizes a targeted data warehouse scheduling algorithm on the basis of a defined number of warehouse layered structures, optimizes the data modeling, development, application and operation and maintenance processes, effectively and efficiently avoids cyclic dependence among tasks and repeated scheduling of the tasks on the premise of basically not reducing the modeling capacity of the data warehouse, and improves the efficiency of task scheduling and execution.

2. When the data task instance is rerun, the current data and the available snapshot data are comprehensively adopted to recover the data snapshot at the rerun moment as much as possible, so that the probability of missing the historical data is reduced. In the running process of the data warehouse system, the failure of executing the data task due to the system or human factors is unavoidable, the data task needs to be re-run, and the most important in the re-running process is to recover the data snapshot at the time. For the incremental data model, taking the date of data creation or modification from the current data snapshot as target date data; and adding the created or modified date in the current data snapshot to the full data model based on the previous period full data snapshot as target date data, and deleting the data modified by the current data snapshot.

3. Different deleting modes such as logical deletion and physical deletion in the data source are distinguished, targeted processing is carried out when the data are collected to the ODS layer, the data are stored in a target number bin in a unified form, and the consistency of the data warehouse and the data of the data source is ensured.

Drawings

FIG. 1 is a flow chart of the periodic scheduling of the data tasks of the process 1 according to the present invention;

FIG. 2 is a flow chart of batch re-running of instances of the failed data task of flow 2 in accordance with the present invention;

FIG. 3 is a flow chart of a task instance re-run of the present invention for the failed state specified in flow 3;

FIG. 4 is a flow chart of the new instance execution of the assignment task of flow 4 of the present invention;

FIG. 5 is a flow chart of a method S1 of the present invention for time dimension dependent checking;

FIG. 6 is a flow chart of a method S2 data dimension dependency check in accordance with the present invention;

FIG. 7 is a flow chart of a method S4 of the present invention for performing a check repeatedly depending on tasks;

FIG. 8 is a schematic diagram of a system module for implementing a multi-dimensional dependency based hierarchical scheduling method for data warehouse tasks according to the present invention.

Detailed Description

The present invention will be described in detail with reference to examples and drawings, but is not limited thereto.

Example 1,

A multi-dimensional dependency-based data warehouse task layered scheduling method comprises the following steps:

The multi-dimensional dependency-based data warehouse task layered scheduling method further comprises the following steps:

EXAMPLE 2,

As shown in fig. 1 to fig. 4, the multi-dimensional dependency-based data warehouse task hierarchical scheduling method specifically includes the following steps for processing periodically triggered task scheduling and manually triggered task scheduling:

scheme 1: periodic scheduling of data tasks

scheme 2: batch re-running of failed data task instances

Scheme 3: task instance rerun specifying failure status

step 3-1: acquiring periodic execution instances of execution failure ；

scheme 4: specifying task new instance execution

step 4-1: acquiring new task instances to be executed ；

step 4-4: processing data dependencies

As shown in fig. 5-7, the method S1: the method for checking the execution state of the time dimension dependent node comprises the following specific steps:

The method S3: the specific steps of the data compensation method when the task execution instance re-runs are as follows: for periodic task instanceRestoring the data snapshot of the ODS layer at the time t through the current state of the service data and other historical snapshot dataThis is because the data of the data source may have changed since the time t has elapsed, and recovery of the data at the time t is required.

In the method S3, for incrementally stored data, data with a current data creation time or update time between prev (t, j) and t is stored inIn, for data stored in full, it is necessary to storeOn the basis of (a), newly adding or updating the data with the creation time or the update time between prev (t, j) and t in the current data toIs a kind of medium.

The method S4: the specific steps of the checking method are repeatedly executed by the dependent tasks: before executing task content, the task instance checks whether the current flow is executed by the same task instance or is being started by other t time, if the current task instance is executed, the execution result is directly returned, and if the current task instance is being executed, the execution result is returned after the task instance is blocked to wait for the execution completion of the task instance. For example: when task execution instance、Simultaneously dependent on periodic task instancesIf (if)Executing faster, checking to find out that it is necessary to run the dependency instance again firstThe first time is called up, recordHas run again whenDependent relationship constraints also require triggeringIn the case of heavy running, the running speed is reduced,Without re-execution, only wait for the execution of the programAdjusted upAnd after the running is finished, the execution result is directly returned.

The method S5: the method for processing deleted data comprises the following specific steps of: the service system usually has different processing modes aiming at the deletion of the data, and most systems logically delete the data, namely only the deletion identification field is modified and the record is not physically deleted when the data is deleted, but some systems delete the data in a physical deletion mode; the system when the logic deletion is performed modifies the data update time at the same time, and the system when the logic deletion is performed is not modified. When the data warehouse system extracts data to the ODS layer, the data warehouse system needs to consider the difference of the service system and perform unified processing; for physically deleted data, the current task instanceStoring fetched data to a model snapshotComparison ofAnd，Exist in and are not identified as deleted data butIs missing fromTaking out, setting logic deletion mark as deleted, setting data update time as current time, and storing; Service data with logic deleted but not modified update date is first to be usedThe fetched data is stored toComparison of，Not identified as deleted data butThe service deletion identification bit is deleted, the logic deletion identification is set as deleted, and the data updating time is modified as the current time. No additional processing is required for the logical deletion of the modified data update time.

EXAMPLE 3,

As shown in fig. 8, a system for implementing the scheduling method includes a data management platform carrying a data warehouse service module, a gateway cluster and a security management center;

the data management platform also comprises a task scheduling service module;

EXAMPLE 4,

Claims

1. The multi-dimensional dependency-based data warehouse task layered scheduling method is characterized by comprising the following steps of:

the periodic automatic triggering is used for the data task in W Is set by the scheduler according to the normal execution ofThe scheduling period is executed, and the data warehouse starts periodic scheduling tasks at a designated time: firstly, executing tasks to be scheduled at the current moment of all ODS layers concurrently, and after the tasks are executed, sequentially executing the tasks to be scheduled at the current moment of DWD, DWS, ADS layers in the same mode:

The multi-dimensional dependency-based data warehouse task layering scheduling method specifically comprises the following steps of processing periodically triggered task scheduling and manually triggered task scheduling:

scheme 1: periodic scheduling of data tasks

scheme 2: batch re-running of failed data task instances

When the execution failure of the data task examples occurs, the data task examples are manually triggered by an engineer to re-run all the failed data task examples according to the hierarchy;

Scheme 3: the task instance re-running of the appointed failure state is triggered by the program under two conditions, namely, when the engineer manually triggers the re-running of a certain failure task, the process is directly executed; secondly, other processes call;

step 3-1: acquiring periodic execution instances of execution failure ；

Step 3-5: if it is the number bin level of the current task Normally scheduled execution; Otherwise, the execution method S3 performs data compensation by the data compensation method for the execution example of the task, and then the execution method S5 deletes the data of the deleted data processing method when the task is executed;

scheme 4: specifying task new instance execution

step 4-1: acquiring new task instances to be executed ；

step 4-4: processing data dependencies

Step 4-4-1: if the current flow is manually triggered, there are data dependent instances that are not createdCreating and calling a flow 4 to execute the instance; waiting for all data dependent instance execution to finish, if there is a dependent instance not successfully executed, returning failure, and the instanceRecording as 'failure in execution due to abnormal layering dependency', otherwise, executing step 5;

2. The multi-dimensional dependency based data warehouse task hierarchical scheduling method as set forth in claim 1, further comprising:

By adopting ODS, DWD, DWS, ADS four-layer classification mode, the high-layer data is generated by the low-layer data, so that the data dependency relationship is formed by the high-layer data depending on the low-layer data, for the same-layer dependency, taking the transitivity of the data dependency into consideration, nodes depending on the same-layer are generated in real time as intermediate results, so that the same-layer dependency is converted into the dependency of the high layer on the low layer, the converted model and the model before conversion are equivalent in function, hierarchical scheduling is sequentially performed according to the order of layers from low to high, and each task is performed in one round of scheduling and only performed once.

3. The multi-dimensional dependency based data warehouse task layering scheduling method of claim 1, wherein in the periodic automatic triggering, the minimum period of data batch task scheduling of a small and medium-sized enterprise is generally not less than 1 day, i.e. at most 1 time per day.

4. The multi-dimensional dependency based data warehouse task hierarchical scheduling method according to claim 1, wherein the method S1: the method for checking the execution state of the time dimension dependent node comprises the following specific steps:

Acquisition task Current execution instanceDependency instances in the time dimensionPeriodic tasksAt the position ofThe execution example of the last period at the moment checks the execution state, and if the execution state is unsuccessful, returns。

5. The multi-dimensional dependency based data warehouse task hierarchical scheduling method according to claim 4, wherein the method S2: the method for checking the execution state of the data dimension dependent node comprises the following specific steps: obtaining a current task execution instanceData dependent instance set in hierarchical dimensionCheck allTo join a collection if execution failsIs a kind of medium.

6. The multi-dimensional dependency based data warehouse task hierarchical scheduling method according to claim 5, wherein the method S3: the specific steps of the data compensation method when the task execution instance re-runs are as follows: for periodic task instanceRestoring the data snapshot of the ODS layer at the time t through the current state of the service data and other historical snapshot data; In the method S3, for incrementally stored data, data with a current data creation time or update time between prev (t, j) and t is stored inIn, for data stored in full, it is necessary to storeOn the basis of (a), newly adding or updating the data with the creation time or the update time between prev (t, j) and t in the current data toIs a kind of medium.

7. The multi-dimensional dependency based data warehouse task hierarchical scheduling method according to claim 6, wherein the method S4: the specific steps of the checking method are repeatedly executed by the dependent tasks: before executing task content, the task instance checks whether the current flow is executed by the same task instance or is being started by other t moments, if the current task instance is executed, an execution result is directly returned, if the current task instance is executed, the execution of the task instance is blocked and waits for completion, and an execution result is returned;

The method S5: the method for processing deleted data comprises the following specific steps of: the system when the logic deletion is performed modifies the data updating time at the same time, and the system when the logic deletion is performed is not modified; when the data warehouse system extracts data to the ODS layer, the data warehouse system needs to consider the difference of the service system and perform unified processing; for physically deleted data, the current task instance Storing fetched data to a model snapshotComparison ofAnd，Exist in and are not identified as deleted data butIs missing fromTaking out, setting logic deletion mark as deleted, setting data update time as current time, and storing; Service data with logic deleted but not modified update date is first to be usedThe fetched data is stored toComparison of，Not identified as deleted data butSetting a logic deletion mark as deleted and modifying the data updating time as the current time; no additional processing is required for the logical deletion of the modified data update time.

8. A system for implementing the scheduling method according to any one of claims 1-7, comprising a data management platform carrying a data warehouse service module, a gateway cluster, and a security management center;

the data management platform also comprises a task scheduling service module;

9. A program product implementing the multi-dimensional dependency based data warehouse task hierarchical scheduling method of any one of claims 1-7, characterized by being used to perform the process 1, process 2, process 3, process 4; and is also used for calling and executing the methods S1, S2, S3, S4 and S5.