CN115421898A

CN115421898A - Big data task scheduling management system and method based on quartz framework

Info

Publication number: CN115421898A
Application number: CN202211382485.6A
Authority: CN
Inventors: 邓明瑞; 王乐珩; 张金银
Original assignee: Hangzhou Bizhi Technology Co ltd
Current assignee: Hangzhou Bizhi Technology Co ltd
Priority date: 2022-11-07
Filing date: 2022-11-07
Publication date: 2022-12-02

Abstract

The invention discloses a big data task scheduling management system and method based on a quartz framework, wherein the system comprises a core scheduling module, a resource scheduling module, at least one scheduling service and a database; the core scheduling module is used for loading all job tasks from the database into a quartz framework, generating corresponding job examples, sending the job examples to corresponding actuators and submitting the job examples to a Hadoop cluster, and monitoring the task state by the actuators through the methods of all components and returning the task state to scheduling service; the core scheduling module comprises a scheduling job manager, a timing manager, a scheduling task manager and a callback manager. The system and the method can uniformly manage the scheduling of various tasks, reduce the workload of data development engineers and improve the efficiency of task scheduling in the whole warehouse construction.

Description

Big data task scheduling management system and method based on quartz framework

Technical Field

The invention relates to the field of computers, network communication technology and big data task scheduling, in particular to a big data task scheduling management system and method based on a quartz framework.

Background

With the construction and development of the data center, a data analyst analyzes the data of the enterprise, and provides a series of requirements for data development engineers, operation and maintenance engineers and development engineers, the construction of the data center station is important, but the data center station generally has the following problems for tasks:

(1) The operation types are multipurpose and diverse, and java, shell, hiveSql, flink, spark, python and the like;

(2) The job content can not be uniformly managed, and the version, the authority and the submission record can not be uniformly managed;

(3) The stability of periodic scheduling is checked;

(4) In the traditional warehouse construction, task scheduling configuration needs to be carried out on a plurality of servers through crons;

(5) The operation failure is not sensed and the log cannot be seen;

(6) Data governance, data consanguinity, data quality can't be monitored.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to provide a big data task scheduling management system and method based on a quartz framework, and the data development work efficiency is reduced, and the problems are solved.

In order to achieve the aim, the invention provides a big data task scheduling management system based on a quartz framework, which comprises a core scheduling module, a resource scheduling module, at least one scheduling service and a database; the core scheduling module comprises a scheduling job manager, a timing manager, a scheduling task manager and a callback manager; wherein, the scheduling job manager is used for editing jobs and storing the edited script tasks in a database and managing the operation authority of the script tasks;

the timing manager is used for processing the tasks scheduled at regular time, realizing a timing scheduling logic through quartz and supporting a cluster scheduling mode;

the scheduling task manager is used for configuring corresponding tasks for the jobs the dependency forms a DAG and allocates an execution cycle of the corresponding task;

the callback manager is used for uniformly processing the public class of callbacks;

the resource scheduling module comprises a task resource scheduling manager, wherein the task resource scheduling manager is responsible for resource control and processing of each job type when a task is issued, and controls a corresponding processor to process aiming at different job types.

Further, the jobs include hive, spark, python, pyspark, flink, datax.

Further, the core scheduling module further comprises a job instance issuing manager, a job instance retry manager and a job instance manager, wherein the job instance issuing manager, the job instance retry manager and the job instance manager are functional modules of the core scheduling module related to processing instance related logic.

Further, the job instance issuance manager generates a record of the instance for each task and/or job run.

Further, the operation instance manager is used for issuing the operation instance, the instance is submitted to the resource scheduling module to be processed, and the project engine information is acquired through a policeman mode before issuing to carry out parameter assembly before issuing of different operation types.

Further, the job instance retry manager processes the logic of job instance retries, including queue slow retries, task failure retries, and/or no long receipt of callback messages for the job instance for retries, and is configured to monitor the status of data quality monitoring and trigger alerts for data generated by each job type.

Further, the core scheduling module further comprises a queue manager, and the queue manager is responsible for a queue mechanism built in the task domain scheduling and can control the concurrency number issued by the tasks.

On the other hand, the invention provides a big data task scheduling management method based on a quartz framework, the method is used for being implemented in a task scheduling management system, and comprises the following steps:

s1 editing each job and storing each edited script task in a database;

s2 allocating corresponding task executors and execution periods for the tasks;

s3, loading each script task from the database into a quartz frame so that the quartz frame issues an execution instruction of each script task to a corresponding task executor through a message queue according to the execution period;

s4, the task executor executes the corresponding script task processor according to the issued execution instruction and returns an execution result;

s5 managing the user operation authority of each script task;

and S6, when the script task fails to be executed or is executed for a time-out on the corresponding task executor, sending an alarm notice to a related responsible person.

The operation of the method firstly needs to pass parameter verification, secondly generates an instance corresponding to the operation and the task, judges whether the operation is timed scheduling, and if the operation is not timed scheduling, the timed scheduling is added into the quartz, and if the operation is not timed operation, the operation immediately enters a task issuing queue; when the triggering condition is met in the quartz, the quartz also sends the quartz to a sending queue, in order to prevent the pressure caused by excessive tasks on a large data cluster, a queue manager is used for controlling concurrency limitation, and if the concurrency limitation is met, whether the upstream is successful or not is judged, because DAG is dependent; if the upstream is judged to be successfully executed, the task is submitted to the big data cluster, the executor can observe the task state through Hadoop related SDK, synchronize the task state through the job instance manager, and send a real-time log through the message middleware, so that a user can conveniently check the progress and the condition of the task; and the task information is synchronized to the calling party in real time through the callback manager, and the operation condition of the operation is monitored through the alarm module.

According to the big data scheduling system and method based on the quartz framework, scheduling of tasks can be managed in a unified mode, workload of data development engineers is reduced, and task scheduling efficiency in whole warehouse construction is improved.

Drawings

FIG. 1 is a diagram of a big data task scheduling management system architecture based on a quartz framework according to an embodiment of the present invention;

FIG. 2 is a flow chart of a big data task scheduling management system based on a quartz framework according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of task prioritization according to an embodiment of the present invention;

fig. 4 is a schematic diagram of scheduling logic of the quota controlling overall system according to an embodiment of the present invention;

fig. 5 is a schematic view of quota control according to an embodiment of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it is to be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

quartz is an open source job scheduling framework that is written entirely in Java and designed for use in J2SE and J2EE applications. It provides great flexibility without sacrificing simplicity. It can be used to create a simple or complex schedule for executing a job. It has many features, such as: database support, clustering, plug-in, EJB job pre-build, javaMail and others, support cron-like expressions, and the like.

The present invention utilizes quartz timing distributed scheduling and capabilities to handle large data tasks such as livesql, sparksql, flink, python, algorithmic tasks, and the like. The invention solves the problems of improving the concurrent scheduling capability, processing hadoop tasks, ensuring the consistency of task states, automatically recovering system faults, controlling the multi-tenant scheduling capability and controlling the quota of a large data cluster.

The following describes in detail a specific embodiment of the present invention with reference to fig. 1 to 5. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are given by way of illustration and explanation only, not limitation.

As shown in fig. 1, the big data task scheduling management system based on the quartz framework according to the present invention provides a high concurrency scheduling capability through the queue manager, and the native quartz framework has limited processing task capability in a high concurrency scenario because the bottom layer is a task based on thread pool processing, especially in a case that the task processing time is long. The solution of the invention is to carry out decoupling of timing task and task processing through message middleware in a scheduling job management module. The task is triggered by the timing time, the real business processing logic is not carried out at the first time, but the task is put into the task queue manager, so that the quartz can trigger the task without the upper limit in a short time. The real task execution logic performs real processing at the queue consumer, improving the timing capability of the quartz framework.

The system processes the task scheduling capability of the Hadoop through the scheduling operation manager, the quartz is a timing task frame of java and can only process some logics under the java, but cannot process Hadoop ecological related tasks, and under the data warehouse scene, under the offline task scene, the scene of timing tasks is needed. The user need only save the code to the scheduled job manager and configure the service schedule. The task to be scheduled is put into the timing manager, and when the scheduling time is met, the business system converts the task into a corresponding Hadoop task and submits the Hadoop task to the Hadoop cluster, so that the timed scheduling capability is provided for the Hadoop task.

The timing manager of the present system resolves the task state because the quartz framework is only an event trigger mechanism and has no task state score. The present invention defines a job instance manager outside of quartz, which includes task states: the state of the Hadoop task is maintained by not starting, waiting, committed, started, successful, failed, retried, freezing 8 states. The task is just put into the timing manager and is in an unfixed state, the task becomes a waiting state when the time is up, the task becomes a submitted state after the task is submitted to the Hadoop, the Hadoop platform really starts to execute after receiving the task, and the processor corresponding to the task becoming the submitted state, the Hadoop task success or failure actively pushes the state to the scheduling system to maintain the system state. And quartz has no retry mechanism, the timing manager of the scheduling system can set the retry times and the retry intervals by a user, and the task can still be guaranteed to be successful through retry under the condition of unstable network and environment.

The system completes the automatic recovery capability of the system through the operation instance issuing manager, and under the scene of Hadoop fault, service data source and even self fault of the scheduling system, the system can normally schedule tasks without human intervention after the fault is recovered, and the final result is consistent. The thread with the timing task of the scheduling system carries out different compensation logics by scanning the states of the tasks in the table, if the Hadoop system is recovered after downtime, the timing task scans the task state to be a task with failure error reasons due to overtime and carries out resubmission, and the task running for a long time can actively acquire the task state from the Hadoop to maintain the scheduling system of the scheduling system. If the task is abnormal, the task whose task state is scanned by the operation example retry manager is put into the task sending queue again so as to reach the compensation logic.

The big data task scheduling management system based on the quartz framework is divided into two modules, namely a core scheduling module and a resource scheduling module, as shown in fig. 1.

The core scheduling module is used for loading all job tasks from the database into a quartz frame, generating corresponding job examples, sending the corresponding job examples to corresponding actuators to submit to a Hadoop cluster, and monitoring the task state by the actuators through the method of each component and returning the task state to scheduling service; wherein the components comprise encapsulated hivesql, sparksql, flink, python, datax and other task processors. Hadoop is a distributed system infrastructure developed by the Apache Foundation. Users can develop distributed programs without knowing the details of the distributed underlying. The capacity of the cluster is fully utilized for high-speed operation and storage. Hadoop implements a Distributed File System (Distributed File System).

The core scheduling module comprises a task, a job, a scheduling job manager arranged in a timing mode, a timing manager, a scheduling task manager and a callback manager.

The scheduling job manager is used for editing live, spark, python, pyspark, flex and datax jobs, storing each edited script task in a database and managing the operation authority of each script task.

The timing manager is used for processing tasks of timing scheduling, currently, timing scheduling logic is realized through quartz, and a cluster scheduling mode is supported.

The scheduling task manager is used for configuring corresponding dependency relationships for the jobs to form a DAG and distributing execution cycles of corresponding tasks; in which DAG (Directed Acyclic Graph) is Directed Acyclic Graph, and in mathematics, especially Graph theory and computer science, directed Acyclic Graph refers to a Directed Graph without loops. If there is a non-directed acyclic graph, and point A can go from B to C and back to A, forming a ring. Changing the edge direction from C to A from A to C, the directed acyclic graph is changed. The number of spanning trees for directed acyclic graphs is equal to the in-degree product of nodes with non-zero in-degree.

The callback manager is a common class used to unify the processing of callbacks, such as: and if the job is configured with callback, the callback is carried out in an http request mode, and an alarm management function is carried out.

The core scheduling module also comprises a job instance issuing manager, a job instance retry manager and a job instance manager, wherein the job instance issuing manager, the job instance retry manager and the job instance manager are functional modules of the core scheduling module related to processing instance related logic, and are introduced as follows: the job instance issuing manager generates a record of an instance for each task/job run, an instance being the entity that actually runs the job, the job instance manager acting as a manager for issuing job instances, the instance is submitted to a resource scheduling module for instance processing, and before issuing, the project engine information is acquired through a policeman mode to carry out parameter assembly before issuing different operation types.

Logic for the job instance retry manager to handle job instance retries, such as: the queue is retried slowly, retries are retried after the task fails, and retries are carried out after the callback message of the job instance is not received for a long time, and the characteristics of service failover, high service availability and cluster deployment are supported; and is used for monitoring the monitoring state of data quality of the data generated by each job type and triggering an alarm.

The core scheduling module also comprises a queue manager, and the queue manager is responsible for a built-in queue mechanism in task domain scheduling and can control the concurrency number issued by the tasks.

The resource scheduling module is a core function of real physical resource scheduling, such as cpu, memory, disk, network, etc. The resource scheduling module comprises a task resource scheduling manager, wherein the task resource scheduling manager is responsible for resource control and processing of each job type when a task is issued, and controls a corresponding processor to process aiming at different job types, such as job execution of a datax, the task resource manager allocates resources and lacks the datax processor to process, and other job types are the same.

On the other hand, as shown in fig. 2, the present invention provides a method for scheduling and managing big data based on a quartz framework, wherein the method comprises the following steps:

s1, editing each job and storing each edited script task in a database through a job manager;

s2 allocating corresponding task executors and execution periods for the tasks through a scheduling task manager;

s3, loading each script task from the database into a quartz frame so that the quartz frame issues an execution instruction of each script task to a corresponding task executor through a message middleware according to the execution period;

and S4, the task resource scheduling manager executes the corresponding script task according to the issued execution instruction and returns an execution result.

The method further comprises the following steps:

s5, managing the user operation authority of each script task;

and S6, when the script task fails to be executed on the corresponding task executor or is executed overtime, sending an alarm notice to a related responsible person.

As shown in fig. 2, the task flow is as follows:

the running jobs firstly need to pass parameter verification, secondly, corresponding instances of the jobs and the tasks are generated, whether the jobs and the tasks are scheduled in a timing mode is judged, if the jobs and the tasks are scheduled in a timing mode, the jobs and the tasks are added into a quartz queue, and if the jobs and the tasks are not run in a timing mode, the jobs and the tasks are immediately sent into a task issuing queue. When the triggering condition is met in quartz, the queue controller also sends the queue to a sending queue, in order to prevent the too many tasks from stressing the large data cluster, the queue controller queue manager controls the concurrency limit, and if the concurrency limit is met, whether the upstream is successful is judged, because the DAG is dependent. If the upstream is successfully executed, the task is submitted to the big data cluster, the executor observes the task state through hadoop correlation sdk and synchronizes the task state in real time through the callback manager, and sends a real-time log through k message middleware, so that a user can conveniently observe the progress and the condition of the task. And the task information is synchronized to the calling party in real time through the callback manager, and the operation condition of the operation is monitored through the alarm module. If the requirement is not met, alarming is carried out through the modes of telephone, nail, short message, mail, enterprise Wechat and the like, and the blood relationship between the operation and the table is obtained through analyzing the operation input and output table, so that the data is managed. The data quality checks the data fluctuation condition of the table according to the operation condition of the operation, and if the fluctuation condition difference is large, the alarm of the data quality is triggered.

In the system, a scheduling task manager is used for a plurality of different service teams to use the scheduling system under the scene of using a plurality of users in a big data scheduling system, and scheduling capabilities are not mutually sensed, such as: team a creates 5000 tasks and team B creates 10 tasks, the scheduling capabilities of team B not being affected by team a. The scheduling system is realized through a priority queue in a queue manager, and the priority formula is as follows:

scores=user_ratio+task_number/minute+priority。

interpretation of the formula: the scores are the total score, the user _ ratio is the user coefficient, the task _ number is the task number, the task _ number/minute is the time length coefficient, and the priority is the priority.

Wherein, the user coefficient (user _ ratio) is an attribute of each user, and is 10 by default, and the value of the user coefficient can be adjusted by configuration if the priority of the task scheduling is to be increased. The duration factor (task _ number/minute) is the number of tasks performed per minute. The priority (priority) is an attribute on the task body, the default priority is 5, and the user can adjust the priority of the job through the job attribute.

For example: the number of tasks which the user A runs in the platform is 20 in 1 minute, the task priority is 5, the number of tasks which the user B runs in the platform is 19 in 1 minute, the task priority is 5, and according to a calculation formula:

the final score of the user A is 10+20/1+5=35;

the final score of the user B is 10+19/1+5=34;

as shown in fig. 3, jobs with small scores are executed first in the priority queue according to the score ranking, so among the tasks of user a and user B, task 6 score 34 of user B is the lowest, and is executed first in the first place, and then task 1 of user a is executed, thereby ensuring fairness of scheduling in a multi-user scenario.

As shown in FIG. 4, the system also monitors the big data component resources through the resource scheduling module, and controls the scheduling logic of the whole system through quotas, for example, team A and team B use the scheduling system at the same time, team A wants to use 70% of the resources of the cluster, and team B can only use 30% of the resources, and the quota of the scheduling system is modified to realize the purpose. The quota is resource isolation through a queue of hadoop-garden. As shown in fig. 5, the quota a, the quota B, and the quota C are divided according to the situation of the resource, and 20%, 50%, and 30% of the physical resources are allocated respectively to achieve the isolation of the resource. The bottom implementation uses a capacity scheduler of yarn.

The capacity scheduler is a pluggable resource scheduler supported by hadoop, and allows multi-tenant safe sharing of cluster resources, and tasks of the cluster resources can allocate resources in time under the capacity limit. The hadoop application is run using an operation-friendly approach while maximizing throughput and cluster utilization.

The core idea provided by the capacity scheduler is Queues, which are usually set by an administrator, and support multiple Queues, each of which can be configured with a certain amount of resources, and each of which adopts a FIFO scheduling policy. To provide more control and predictability over shared resources, the capacity scheduler supports multiple levels of queues to ensure that resources can be shared among sub-queues of an organization before other queues allow use of free resources.

In the description herein, references to the description of the terms "embodiment," "example," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, various embodiments or examples described in this specification and features thereof may be combined or combined by those skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described, it is understood that the above embodiments are illustrative and not to be construed as limiting the present invention, and that modifications, alterations, substitutions, and alterations may be made to the above embodiments by those of ordinary skill in the art without departing from the scope of the present invention.

Claims

1. A big data task scheduling management system based on a quartz frame is characterized by comprising a core scheduling module, a resource scheduling module, at least one scheduling service and a database; the core scheduling module is used for loading all job tasks from the database into a quartz framework, generating corresponding job examples, sending the job examples to corresponding actuators and submitting the job examples to a Hadoop cluster, and monitoring the task state by the actuators through the methods of all components and returning the task state to scheduling service; the core scheduling module comprises a scheduling job manager, a timing manager, a scheduling task manager and a callback manager;

the scheduling job manager is used for editing jobs and storing the edited script tasks in a database and is used for managing the operation authority of the script tasks;

the timing manager is used for processing the tasks scheduled in a timing mode, realizing timing scheduling logic through quartz and supporting a cluster scheduling mode;

the scheduling task manager is used for configuring corresponding dependency relationships for the jobs to form a DAG and distributing execution cycles of corresponding tasks;

the resource scheduling module comprises a task resource scheduling manager, wherein the task resource scheduling manager is responsible for controlling resources and processing each job type when a task is issued, and controls corresponding processors to process according to different job types.

2. The quartz framework-based big data task scheduling management system of claim 1, wherein the job types comprise hive, spark, python, pyspark, flink, datax.

3. The big data task scheduling management system based on the quartz framework as claimed in claim 2, wherein the core scheduling module further comprises a job instance issuing manager, a job instance retrying manager and a job instance manager, and the job instance issuing manager, the job instance retrying manager and the job instance manager are functional modules of the core scheduling module with respect to processing instance related logic.

4. The big data task scheduling management system based on the quartz framework as claimed in claim 3, wherein the job instance issuing manager generates a record of the instance for each task and/or job run.

5. The quartz framework-based big data task scheduling management system of claim 4, wherein the job instance manager is used as a manager for issuing job instances, the instances are submitted to the resource scheduling module for instance processing, and before issuing, the project engine information is obtained through a policeman mode to perform parameter assembly before issuing different job types.

6. A big data task scheduling management system based on quartz framework as claimed in claim 5, characterized in that, the task instance retry manager processes the logic of task instance retry, including that the queue has retried slowly, retries failed by task and/or receives no callback message of the task instance for a long time, and is used to monitor the data quality monitoring status of the data generated by each task type and trigger alarm.

7. The big data task scheduling management system based on quartz framework as claimed in claim 6, wherein the core scheduling module further includes a queue manager, the queue manager is responsible for a queue mechanism built in the task domain scheduling, and can control the concurrency number of task issuing.

8. A quartz framework based big data task scheduling management method for implementing the quartz framework based big data task scheduling management system as claimed in any one of claims 1-7, the method comprising the following steps:

s1, editing each job and storing each edited script task in a database;

s2, allocating corresponding task executors and execution periods for the tasks;

s3, loading each script task from the database into a quartz frame so that the quartz frame issues an execution instruction of each script task to a corresponding task executor through message middleware according to the execution period;

and S4, the task executor executes the corresponding script task according to the issued execution instruction and returns an execution result.

9. The big data task scheduling management method based on quartz framework as claimed in claim 8, wherein the method further comprises:

s5, managing the user operation authority of each script task;

10. The big data task scheduling management method based on the quartz framework as claimed in claim 9, wherein the operation job of the method firstly needs to pass parameter check, secondly generates corresponding job and task instance, and judges whether the operation job is timing scheduling, the timing scheduling is added into quartz, and the operation job immediately enters into task issuing queue if the operation job is not timing scheduling; when the triggering condition is met in the quartz, the quartz also enters the issuing queue, in order to prevent the pressure caused by excessive tasks on a large data cluster, a queue manager quartz manager is used for controlling concurrency limitation, and if the concurrency limitation is met, whether the upstream is successful or not is judged, because the DAG is dependent; if the upstream is judged to be successfully executed, the task is submitted to the big data cluster, the executor can observe the task state through Hadoop related SDK, monitors the task state through the job instance manager, and sends a real-time log through the message middleware, so that a user can conveniently check the progress and the condition of the task; and the task information is synchronized to the calling party in real time through the callback manager, and the operation condition of the operation is monitored through the alarm module.